Edward Z. Yang

9K posts

Edward Z. Yang

@ezyang

I work on PyTorch at Meta. Chatty alt at @difficultyang.

Edison, NJ Katılım Mayıs 2008

1.4K Takip Edilen16.9K Takipçiler

Edward Z. Yang@ezyang·3h

@RisingSayak Most of my plain q&a tasks are either easily one shot table, or too hard and I end up iterating anyway

English

130

Sayak Paul@RisingSayak·3h

@ezyang Don’t you find the lost quality and iteration speed trade-off cumbersome?

English

171

Edward Z. Yang@ezyang·4h

I have to say, it's pretty nice to be in the period of time when Muse Spark doesn't have any usage limits (my ChatGPT Pro downgraded to Plus and now when I'm intensely coding I gotta ration my regular Q&A usage. Muse Spark is fine for a lot of research tasks.)

English

3.8K

Edward Z. Yang@ezyang·3d

@SkyLi0n Yes, user mode driver is sufficient

English

133

Aaron Gokaslan@SkyLi0n·3d

@ezyang Does it work cuda13-compat shims or do you need actually cuda13 driver?

English

206

Edward Z. Yang@ezyang·3d

This is not just nightlies, it is in 2.12! (Thanks Natalia for the correction)

Edward Z. Yang@ezyang

Pro-tip: using CUDA graphs and annoyed that all the kernels have no labels in your profiles? Get a nightly that has mark_kernels context manager: github.com/pytorch/pytorc… (thanks Natalia and Shangdi for implementing!) You need 13.1 driver, but user mode driver is enough

English

8.9K

Edward Z. Yang@ezyang·3d

@drisspg coined a really good term for a lot of why we've been doing with LLM coding: "painting APIs." Where you sit down with you and your agents and just shove characters around on the API, playing around with different options, just asking for whatever API you want.

English

1.4K

Edward Z. Yang@ezyang·4d

@llllvvuu I showed this to the team and we think it's the same (@drisspg authored this to check: gist.github.com/ezyang/1990294…). torchtitan is following the HF convention.

English

2.4K

L@llllvvuu·4d

Megatron MoE computes softmax over top k logits. torchtitan computes softmax over all logits and renormalizes top k weights afterwards. The result is the same in forwards but not in backwards. Which way is correct? Megatron, right?

English

10.1K

Edward Z. Yang@ezyang·4d

English

123

14.2K

Edward Z. Yang@ezyang·6d

I know, these days stars don't mean much, but it's still lovely to see such a nice round number. Thank you everyone for trusting your ML workloads to us, here's to the next 100K stars! (H/t Tianyu for pulling this screenshot)

English

575

21.8K

Edward Z. Yang@ezyang·19 May

@typedfemale So it's definitely feasible to have the exact allocations from the caching allocator be deterministic run to run. This mostly boils down to not using record stream. I would check on this!

English

151

typedfemale@typedfemale·19 May

@ezyang oh yes, i've done that so many times!

English

138

typedfemale@typedfemale·17 May

the CUDA caching allocator is such a great way to create extremely "interesting" bugs for yourself

English

125

13.8K

Edward Z. Yang@ezyang·19 May

@typedfemale I have had some "fun" out of bounds bugs where CUDA sanitizer didn't help because all the memory accessed was technically valid 😂

English

158

typedfemale@typedfemale·19 May

@ezyang we have a kernel that's corrupting memory between the forward and backward pass and i think caching allocator was making it non-deterministic (really not it's fault, i was just being stupid and didn't realize what was going on)

English

911

Edward Z. Yang@ezyang·18 May

If you're willing to put up with explicit comms in your code, github.com/meta-pytorch/s… is our latest direction for addressing some of the silent perf cliffs in DTensor. There's tradeoff space here for explicit-implicit, and spmd_types stakes out a different spot on the curve.

Kamil Sindi@kamilsindi

Distributed training is hard. We adopted DTensor at Runway to prevent silent gradient bugs and it delivered. But we traded performance for correctness, hitting dispatch overhead, recompilation storms, and MFU drops. Wrote up what we learned and how we work around it. runwayml.com/news/dtensor-d…

English

138

19.6K

Edward Z. Yang@ezyang·18 May

@defnotbeka mfw

Rebecca Valentine@defnotbeka·18 May

man what did i type that ios """corrected""" to "merge"??? nothing makes sense

English

Rebecca Valentine@defnotbeka·16 May

merge second half has been academia for like 40 years

philosophy memes 🔗@philosophymeme0

English

382

Edward Z. Yang@ezyang·18 May

deep in the refcounting ownership mines right now

English

2.5K

Edward Z. Yang retweetledi

Armin Ronacher ⇌@mitsuhiko·16 May

I did not try it yet, but it does quite a few of the things that I wrote about recently! lucumr.pocoo.org/2026/2/9/a-lan…

Chris Tate@ctatedev

Introducing Zero The programming language for agents. I wanted a systems language that was faster, smaller, and easier for agents to use and repair. Explicit capabilities. JSON diagnostics. Typed safe fixes. Made for agents on day zero.

English

136

38.9K

Edward Z. Yang@ezyang·16 May

Here is a good example of a PR where Codex simply does better than me by default. Frickin Python generators lmao github.com/pytorch/pytorc…

English

2.8K

Edward Z. Yang retweetledi

Prime Intellect@PrimeIntellect·13 May

Introducing Renderers RL trainers work in tokens. Environments work in messages. Going back and forth corrupts sampled tokens, wasting compute on every agentic turn. With Renderers, we fix this mismatch. This unlocks >3x throughput on popular open models.

English

699

192.7K

Edward Z. Yang@ezyang·12 May

Remember, you need to make sure you generate exactly the same random numbers in forwards and backwards. So we ended up doing even regular checkpointing via a totally graph based strategy: tag and then min-cut partitioner; the SAC in compiler strategy was an extension on this.

English

446

Edward Z. Yang@ezyang·12 May

Some extra backstory here: Animesh Jain was tasked with getting our preexisting checkpoint story to work with PT2. But dealing with functionalizing RNG operations was difficult, and we ended up being unable to trace through the eager checkpoint implementation.

English

486

Edward Z. Yang@ezyang·12 May

A thread about the history and internal implementation details of activation checkpointing APIs in PyTorch. 🧵

English

250

19K

Edward Z. Yang@ezyang·12 May

I missed some history here! The original version of eager SAC came from github.com/facebookresear… ; implemented by @fvsmassa but with a lot of input from the xformers team at the time.

English

323

Edward Z. Yang@ezyang·12 May

The history here explains why the SAC APIs are all focused around classifying operations as cheap or expensive: it comes from the original min-cut formulation. In 2024, Jeffrey Wan released an eager mode version of SAC centered around the same ideas.

English

Keşfet

@RisingSayak @SkyLi0n @drisspg @llllvvuu @typedfemale @defnotbeka @elonmusk @BarackObama