
Our "Beyond Neural Scaling laws" paper got a #NeurIPS22 outstanding paper award! Congrats Ben Sorscher, Robert Geirhos, @sshkhr16 & @arimorcos awards: blog.neurips.cc/2022/11/21/ann… paper: arxiv.org/abs/2206.14486 🧵 twitter.com/SuryaGanguli/s…
sshkhr
1.1K posts

@sshkhr16
Research Engineer @GoogleDeepMind Previously: co-founder Dice, AI Research @MetaAI @VectorInst Follow @awesomeMLSS

Our "Beyond Neural Scaling laws" paper got a #NeurIPS22 outstanding paper award! Congrats Ben Sorscher, Robert Geirhos, @sshkhr16 & @arimorcos awards: blog.neurips.cc/2022/11/21/ann… paper: arxiv.org/abs/2206.14486 🧵 twitter.com/SuryaGanguli/s…









Deepseek v4 still not released Alibaba Qwen going closed Western open weights models slacking In these dark times for open source, who will save us? Alliances must be made, brothers must band together! A world of only closed source AI will lead to consolidation of power! Tyranny!



i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/au…

i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/au…


i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/au…


The correct answer was "over $4.5B"