Sharon Zhou

1.5K posts

Sharon Zhou banner
Sharon Zhou

Sharon Zhou

@realSharonZhou

Recursively self-improving | VP Eng & AI, @AMD | Prev: Founder & CEO, Lamini. CS Faculty & PhD @Stanford. @Google. @Harvard | @MIT 35 under 35. Angel investor.

Stanford, CA Katılım Kasım 2016
0 Takip Edilen27.4K Takipçiler
Sabitlenmiş Tweet
Sharon Zhou
Sharon Zhou@realSharonZhou·
It's here: We just hit superhuman performance on AI kernel optimization! Real customer models & production settings. Not toy problems (what I typically see). This is the year that Claude writes its own kernels, Codex its own kernels, for every new GPU that it wants to run on -- something that takes months to port between GPU generations today. This has a massive impact to scaling intelligence. More compute means getting the next frontier model sooner.
English
36
77
950
240.8K
Sharon Zhou retweetledi
Andrew Ng
Andrew Ng@AndrewYNg·
New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with @AMD and taught by @realSharonZhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs: deeplearning.ai/courses/transf…
English
66
140
836
108.8K
Sharon Zhou retweetledi
DeepLearning.AI
DeepLearning.AI@DeepLearningAI·
Slow inference. Hallucinations. Costs that don't scale. The parts of LLMs you can't see are the parts that bite you. Build the intuition to debug them, in our new course with @RealSharonZhou and @AMD: Transformers in Practice. Enroll here: hubs.la/Q04g8KSv0
English
5
31
166
15.7K
Sharon Zhou
Sharon Zhou@realSharonZhou·
Happy to share I’m expanding my role to report directly to @LisaSu!
Sharon Zhou tweet media
English
57
23
718
110K
Sharon Zhou
Sharon Zhou@realSharonZhou·
@corentinanjuna Yes end to end- it’s already operating at the model/framework level
English
1
0
0
128
Corentin Kérisit
Corentin Kérisit@corentinanjuna·
@realSharonZhou The real deal would be to let the agend improve the whole stack from driver to libs to kernel.
English
1
0
2
379
Sharon Zhou
Sharon Zhou@realSharonZhou·
codegen is cheap now. performance isn’t: most generated kernels are kinda mid. iteration and feedback are missing for both the agent and RL env layers. so we're open-sourcing Apex: an end-to-end agent using Claude Code + Codex to effectively optimize AMD kernels, instead of one-shotting them github.com/AMD-AGI/Apex
English
7
21
190
90.8K
Sharon Zhou
Sharon Zhou@realSharonZhou·
@Tahseen_Rahman yeah agreed on the 100% autonomy case - tho I guess for Codex, if it's an issue that a knowledgeable engineer can fix and still get gains, it'd be faster to solution as an agentic partner
English
1
0
5
341
Tahseen Rahman
Tahseen Rahman@Tahseen_Rahman·
@realSharonZhou Speed vs. quality is the wrong framing. Codex shipped faster but unusable. Claude shipped slower but production-ready. The real metric: time to working solution. Not time to first attempt.
English
1
0
6
442
Sharon Zhou
Sharon Zhou@realSharonZhou·
Mood: agents optimizing kernels Claude won on kernel optimization: gemm_bf16 at 1.19x vs Codex's 0.94x. Codex was faster (~1.3h vs ~3.4h) but produced no reinjectable optimizations. Claude used torch.mm (hipBLASLt) as a drop-in replacement for the custom Triton kernel. For Codex, shape mismatch caused slight regression. Still improving, open sourcing soon --- AMD-AGI team (Sina Rafati, Emad Barsoum, and many more)
English
16
13
190
23.6K
Sharon Zhou
Sharon Zhou@realSharonZhou·
It's here: We just hit superhuman performance on AI kernel optimization! Real customer models & production settings. Not toy problems (what I typically see). This is the year that Claude writes its own kernels, Codex its own kernels, for every new GPU that it wants to run on -- something that takes months to port between GPU generations today. This has a massive impact to scaling intelligence. More compute means getting the next frontier model sooner.
English
36
77
950
240.8K
Sharon Zhou
Sharon Zhou@realSharonZhou·
@gonzaleshvili also memories I teach the post-training course on DeeplearningAI, and writing a book on post-training/RL - so Claude and I chat a lot about it… Also this Xmas I got obsessed with loss landscapes and vibe codes some loss landscape video games - and that shows up here…
English
0
0
3
80
Sharon Zhou
Sharon Zhou@realSharonZhou·
Wait this is so good Asked Claude to make a video on RL post-training
English
70
9
206
18.7K
Sharon Zhou
Sharon Zhou@realSharonZhou·
“Thought pruned by optimizer” got me
English
2
0
11
1.6K
Sharon Zhou
Sharon Zhou@realSharonZhou·
@ysu_ChatData I’m worried about correctness checks being the bottleneck due to reward hacking and the fact that many existing tests are written for humans; less worried about benchmarking speed as we’ve been optimizing pieces there
English
0
0
6
1.2K
Yongrui Su
Yongrui Su@ysu_ChatData·
This is a big deal. If models can write kernels, the hard part becomes correctness checks plus fast benchmarking and regression tracking across drivers and GPU generations. Curious what your verification loop looks like, do you use reference implementations, property tests, or differential testing against vendor libs?
English
1
0
1
469
Macro Guru
Macro Guru@macro_guru·
@realSharonZhou Focus on composability of inference optimizations for ROCm stack in order to defeat the CUDA moat. (wide EP + disagg + FP4 + KV offloading + more) for maximum efficiency/scaling on huge models. AMD/ROCm lags in making those combinations smooth
English
1
0
8
2.3K
Sharon Zhou
Sharon Zhou@realSharonZhou·
@Strakyo I agree that's ideal and we do see generated kernels that generalize, but also the "real only if" part is not necessarily true in the frontier world
English
1
0
6
2K
Strakyo
Strakyo@Strakyo·
@realSharonZhou Kernel wins are real only if they hold across model families. What are the p95 speedup and numerical-drift deltas on held-out production workloads?
English
3
0
7
2.9K
Sebastian Boehler
Sebastian Boehler@sebastianboehle·
@realSharonZhou Can you share a preprint or GitHub repo? Am also working on a fork of cuda agent from bytedance on an optimized kernel writing agent. Would love to benchmark
English
1
0
2
1.4K