sshkhr

1.1K posts

sshkhr banner
sshkhr

sshkhr

@sshkhr16

Research Engineer @GoogleDeepMind Previously: co-founder Dice, AI Research @MetaAI @VectorInst Follow @awesomeMLSS

Ontario, Canada Katılım Nisan 2018
1.5K Takip Edilen1.7K Takipçiler
Sabitlenmiş Tweet
sshkhr
sshkhr@sshkhr16·
Our work on improving neural scaling beyond power law won an Outstanding Paper award at @NeurIPSConf 2022!! Come check it out on Wed, Nov 30, at Poster Session 3 in New Orleans.
Surya Ganguli@SuryaGanguli

Our "Beyond Neural Scaling laws" paper got a #NeurIPS22 outstanding paper award! Congrats Ben Sorscher, Robert Geirhos, @sshkhr16 & @arimorcos awards: blog.neurips.cc/2022/11/21/ann… paper: arxiv.org/abs/2206.14486 🧵 twitter.com/SuryaGanguli/s…

English
9
9
105
0
sshkhr
sshkhr@sshkhr16·
@recurseparadox #mid-training" target="_blank" rel="nofollow noopener">shashankshekhar.com/blog/data-qual…
GIF
QME
1
0
1
67
Pranav Shyam
Pranav Shyam@recurseparadox·
I don’t know what mid training is and this point I’m too afraid to ask
English
2
0
6
838
erin griffith
erin griffith@eringriffith·
A detailed and brutal look at the tactics of buzzy AI compliance startup Delve "Delve built a machine designed to make clients complicit without their knowledge, to manufacture plausible deniability while producing exactly the opposite." substack.com/home/post/p-19…
English
124
158
2.1K
1.8M
Awni Hannun
Awni Hannun@awnihannun·
I joined Anthropic as a member of the technical staff. Excited to work on frontier modeling at a place with unwavering values and a generational mission.
English
205
37
2.2K
113.6K
wundram
wundram@wundram·
@yacineMTB No, you can use MLX which is better, but Mac only. Since the memory is unified you don’t have to copy anything in and out of vram
English
1
0
3
1.4K
kache
kache@yacineMTB·
can i run cuda on a macbook
English
65
2
207
34.6K
Joseph Suarez 🐡
Joseph Suarez 🐡@jsuarez·
@yacineMTB no. You can run MPS. Which is apple's "we know better" dogshit heavy C++ crap. We'll still have the torch backend in 4.0 though. We have a contributor that got mps to 3m sps... but is it really worth maintaining 10k+ new lines to run at the speed of a $200 gpu?
English
7
1
68
6.2K
sshkhr
sshkhr@sshkhr16·
Uhhh....I have a presentation in 2 hours 😅 @claudeai
sshkhr tweet media
English
0
0
2
165
Excalidraw
Excalidraw@excalidraw·
Excalidraw is live right now on @nvidia GTC 2026 ❤️
Excalidraw tweet media
English
42
88
2K
86.2K
sshkhr
sshkhr@sshkhr16·
Don't mind if I do...
sshkhr tweet media
English
0
0
0
127
sshkhr
sshkhr@sshkhr16·
Exihibit B (also from last week)
sshkhr tweet media
English
0
0
1
52
alth0u🧶
alth0u🧶@alth0u·
new market category that only has like two startups operating in it that has no name yet something roughly like either - posttraining as a service - posttraining ide unclear if fde or consultants sufficient
English
9
0
51
4.9K
AT
AT@AliesTaha·
was skeptical but gave it a shot because @karpathy anyways 2x kernel perf (fp4 matmul) 3 minutes of work (1 prompt) triton beat cutlass (?!)
AT tweet media
Jaber@Akashi203

i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/au…

English
12
12
243
38.7K
You Jiacheng
You Jiacheng@YouJiacheng·
142.5T cuBLAS on H100? hmmm.
Jaber@Akashi203

i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/au…

English
8
3
68
13.7K
sshkhr
sshkhr@sshkhr16·
🧐🧐🧐
sshkhr tweet media
Jaber@Akashi203

i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/au…

QME
0
0
1
490