Sharon Zhou

1.5K posts

Sharon Zhou

@realSharonZhou

Recursively self-improving | VP Eng & AI, @AMD | Prev: Founder & CEO, Lamini. CS Faculty & PhD @Stanford. @Google. @Harvard | @MIT 35 under 35. Angel investor.

Stanford, CA Katılım Kasım 2016

0 Takip Edilen27.4K Takipçiler

Sabitlenmiş Tweet

Sharon Zhou@realSharonZhou·10 Mar

It's here: We just hit superhuman performance on AI kernel optimization! Real customer models & production settings. Not toy problems (what I typically see). This is the year that Claude writes its own kernels, Codex its own kernels, for every new GPU that it wants to run on -- something that takes months to port between GPU generations today. This has a massive impact to scaling intelligence. More compute means getting the next frontier model sooner.

English

950

240.8K

Sharon Zhou retweetledi

Andrew Ng@AndrewYNg·14 May

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with @AMD and taught by @realSharonZhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs: deeplearning.ai/courses/transf…

English

140

836

108.8K

Sharon Zhou@realSharonZhou·12 May

Excited to share a new, practical course on Transformers! In partnership with AMD and DeeplearningAI Enroll free :) deeplearning.ai/courses/transf…

English

1.6K

Sharon Zhou retweetledi

DeepLearning.AI@DeepLearningAI·12 May

Slow inference. Hallucinations. Costs that don't scale. The parts of LLMs you can't see are the parts that bite you. Build the intuition to debug them, in our new course with @RealSharonZhou and @AMD: Transformers in Practice. Enroll here: hubs.la/Q04g8KSv0

English

166

15.7K

Sharon Zhou@realSharonZhou·2 Nis

@venkat_systems @LisaSu @SemiAnalysis_ @AnushElangovan Thankfully for myself, only roc-stars like Anush get that type of glory

English

Venkat Raman — inference/acc@venkat_systems·2 Nis

@realSharonZhou @LisaSu congrats ! 🚀 will you also appear now in @SemiAnalysis_ memes along with @AnushElangovan ?

English

1.6K

Sharon Zhou@realSharonZhou·2 Nis

Happy to share I’m expanding my role to report directly to @LisaSu!

English

718

110K

Sharon Zhou@realSharonZhou·2 Nis

@GrahamsonT @LisaSu @AnushElangovan Nah, no one's got anything on Anush. He's a roc-star.

English

1.1K

Two Shoe Grahamson@GrahamsonT·2 Nis

@realSharonZhou @LisaSu Congratulations. @AnushElangovan you got competition!

English

1.2K

Sharon Zhou@realSharonZhou·24 Mar

@corentinanjuna Yes end to end- it’s already operating at the model/framework level

English

128

Corentin Kérisit@corentinanjuna·24 Mar

@realSharonZhou The real deal would be to let the agend improve the whole stack from driver to libs to kernel.

English

379

Sharon Zhou@realSharonZhou·23 Mar

codegen is cheap now. performance isn’t: most generated kernels are kinda mid. iteration and feedback are missing for both the agent and RL env layers. so we're open-sourcing Apex: an end-to-end agent using Claude Code + Codex to effectively optimize AMD kernels, instead of one-shotting them github.com/AMD-AGI/Apex

English

190

90.8K

Sharon Zhou@realSharonZhou·22 Mar

@Tahseen_Rahman yeah agreed on the 100% autonomy case - tho I guess for Codex, if it's an issue that a knowledgeable engineer can fix and still get gains, it'd be faster to solution as an agentic partner

English

341

Tahseen Rahman@Tahseen_Rahman·22 Mar

@realSharonZhou Speed vs. quality is the wrong framing. Codex shipped faster but unusable. Claude shipped slower but production-ready. The real metric: time to working solution. Not time to first attempt.

English

442

Sharon Zhou@realSharonZhou·22 Mar

Mood: agents optimizing kernels Claude won on kernel optimization: gemm_bf16 at 1.19x vs Codex's 0.94x. Codex was faster (~1.3h vs ~3.4h) but produced no reinjectable optimizations. Claude used torch.mm (hipBLASLt) as a drop-in replacement for the custom Triton kernel. For Codex, shape mismatch caused slight regression. Still improving, open sourcing soon --- AMD-AGI team (Sina Rafati, Emad Barsoum, and many more)

English

190

23.6K

Sharon Zhou@realSharonZhou·11 Mar

@Haniel_Ulises Heard of triton and gluon?

English

146

Haniel Ulises@Haniel_Ulises·11 Mar

@realSharonZhou He does it in p*thon, so it's not impressive

English

235

Sharon Zhou@realSharonZhou·10 Mar

English

950

240.8K

Sharon Zhou@realSharonZhou·11 Mar

@gonzaleshvili also memories I teach the post-training course on DeeplearningAI, and writing a book on post-training/RL - so Claude and I chat a lot about it… Also this Xmas I got obsessed with loss landscapes and vibe codes some loss landscape video games - and that shows up here…

English

Sharon Zhou@realSharonZhou·11 Mar

@gonzaleshvili Yes

276

Sharon Zhou@realSharonZhou·11 Mar

Wait this is so good Asked Claude to make a video on RL post-training

English

206

18.7K

Sharon Zhou@realSharonZhou·11 Mar

“Thought pruned by optimizer” got me

English

1.6K

Sharon Zhou@realSharonZhou·11 Mar

@ysu_ChatData I’m worried about correctness checks being the bottleneck due to reward hacking and the fact that many existing tests are written for humans; less worried about benchmarking speed as we’ve been optimizing pieces there

English

1.2K

Yongrui Su@ysu_ChatData·11 Mar

This is a big deal. If models can write kernels, the hard part becomes correctness checks plus fast benchmarking and regression tracking across drivers and GPU generations. Curious what your verification loop looks like, do you use reference implementations, property tests, or differential testing against vendor libs?

English

469

Sharon Zhou@realSharonZhou·10 Mar

@macro_guru rocm.blogs.amd.com/software-tools…

QME

1.6K

Macro Guru@macro_guru·10 Mar

@realSharonZhou Focus on composability of inference optimizations for ROCm stack in order to defeat the CUDA moat. (wide EP + disagg + FP4 + KV offloading + more) for maximum efficiency/scaling on huge models. AMD/ROCm lags in making those combinations smooth

English

2.3K

Sharon Zhou@realSharonZhou·10 Mar

@Strakyo I agree that's ideal and we do see generated kernels that generalize, but also the "real only if" part is not necessarily true in the frontier world

English

Strakyo@Strakyo·10 Mar

@realSharonZhou Kernel wins are real only if they hold across model families. What are the p95 speedup and numerical-drift deltas on held-out production workloads?

English

2.9K

Sharon Zhou@realSharonZhou·10 Mar

@sebastianboehle here's the tool/mcp that helped (3000x more token efficient at regular profiling/benchmarking) github.com/AMD-AGI/Magpie working on open-sourcing the end to end pipeline you see in the gif

English

1.3K

Sebastian Boehler@sebastianboehle·10 Mar

@realSharonZhou Can you share a preprint or GitHub repo? Am also working on a fork of cuda agent from bytedance on an optimized kernel writing agent. Would love to benchmark

English

1.4K

Keşfet

@AMD @venkat_systems @LisaSu @SemiAnalysis_ @AnushElangovan @GrahamsonT @corentinanjuna @Tahseen_Rahman