Edward Z. Yang

9K posts

Edward Z. Yang banner
Edward Z. Yang

Edward Z. Yang

@ezyang

I work on PyTorch at Meta. Chatty alt at @difficultyang.

Edison, NJ Katılım Mayıs 2008
1.4K Takip Edilen16.9K Takipçiler
Edward Z. Yang
Edward Z. Yang@ezyang·
@RisingSayak Most of my plain q&a tasks are either easily one shot table, or too hard and I end up iterating anyway
English
1
0
0
130
Sayak Paul
Sayak Paul@RisingSayak·
@ezyang Don’t you find the lost quality and iteration speed trade-off cumbersome?
English
1
0
0
171
Edward Z. Yang
Edward Z. Yang@ezyang·
I have to say, it's pretty nice to be in the period of time when Muse Spark doesn't have any usage limits (my ChatGPT Pro downgraded to Plus and now when I'm intensely coding I gotta ration my regular Q&A usage. Muse Spark is fine for a lot of research tasks.)
English
2
0
28
3.8K
Aaron Gokaslan
Aaron Gokaslan@SkyLi0n·
@ezyang Does it work cuda13-compat shims or do you need actually cuda13 driver?
English
1
0
0
206
Edward Z. Yang
Edward Z. Yang@ezyang·
@drisspg coined a really good term for a lot of why we've been doing with LLM coding: "painting APIs." Where you sit down with you and your agents and just shove characters around on the API, playing around with different options, just asking for whatever API you want.
English
1
0
11
1.4K
L
L@llllvvuu·
Megatron MoE computes softmax over top k logits. torchtitan computes softmax over all logits and renormalizes top k weights afterwards. The result is the same in forwards but not in backwards. Which way is correct? Megatron, right?
English
8
1
69
10.1K
Edward Z. Yang
Edward Z. Yang@ezyang·
Pro-tip: using CUDA graphs and annoyed that all the kernels have no labels in your profiles? Get a nightly that has mark_kernels context manager: github.com/pytorch/pytorc… (thanks Natalia and Shangdi for implementing!) You need 13.1 driver, but user mode driver is enough
English
1
11
123
14.2K
Edward Z. Yang
Edward Z. Yang@ezyang·
I know, these days stars don't mean much, but it's still lovely to see such a nice round number. Thank you everyone for trusting your ML workloads to us, here's to the next 100K stars! (H/t Tianyu for pulling this screenshot)
Edward Z. Yang tweet media
English
13
24
575
21.8K
Edward Z. Yang
Edward Z. Yang@ezyang·
@typedfemale So it's definitely feasible to have the exact allocations from the caching allocator be deterministic run to run. This mostly boils down to not using record stream. I would check on this!
English
1
0
3
151
typedfemale
typedfemale@typedfemale·
@ezyang oh yes, i've done that so many times!
English
1
0
2
138
typedfemale
typedfemale@typedfemale·
the CUDA caching allocator is such a great way to create extremely "interesting" bugs for yourself
English
8
1
125
13.8K
Edward Z. Yang
Edward Z. Yang@ezyang·
@typedfemale I have had some "fun" out of bounds bugs where CUDA sanitizer didn't help because all the memory accessed was technically valid 😂
English
1
0
4
158
typedfemale
typedfemale@typedfemale·
@ezyang we have a kernel that's corrupting memory between the forward and backward pass and i think caching allocator was making it non-deterministic (really not it's fault, i was just being stupid and didn't realize what was going on)
English
2
0
19
911
Edward Z. Yang
Edward Z. Yang@ezyang·
If you're willing to put up with explicit comms in your code, github.com/meta-pytorch/s… is our latest direction for addressing some of the silent perf cliffs in DTensor. There's tradeoff space here for explicit-implicit, and spmd_types stakes out a different spot on the curve.
Kamil Sindi@kamilsindi

Distributed training is hard. We adopted DTensor at Runway to prevent silent gradient bugs and it delivered. But we traded performance for correctness, hitting dispatch overhead, recompilation storms, and MFU drops. Wrote up what we learned and how we work around it. runwayml.com/news/dtensor-d…

English
2
6
138
19.6K
Rebecca Valentine
Rebecca Valentine@defnotbeka·
man what did i type that ios """corrected""" to "merge"??? nothing makes sense
English
1
0
0
46
Edward Z. Yang
Edward Z. Yang@ezyang·
deep in the refcounting ownership mines right now
English
0
0
15
2.5K
Edward Z. Yang retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
Introducing Renderers RL trainers work in tokens. Environments work in messages. Going back and forth corrupts sampled tokens, wasting compute on every agentic turn. With Renderers, we fix this mismatch. This unlocks >3x throughput on popular open models.
Prime Intellect tweet media
English
15
72
699
192.7K
Edward Z. Yang
Edward Z. Yang@ezyang·
Remember, you need to make sure you generate exactly the same random numbers in forwards and backwards. So we ended up doing even regular checkpointing via a totally graph based strategy: tag and then min-cut partitioner; the SAC in compiler strategy was an extension on this.
English
0
0
1
446
Edward Z. Yang
Edward Z. Yang@ezyang·
Some extra backstory here: Animesh Jain was tasked with getting our preexisting checkpoint story to work with PT2. But dealing with functionalizing RNG operations was difficult, and we ended up being unable to trace through the eager checkpoint implementation.
English
1
0
2
486
Edward Z. Yang
Edward Z. Yang@ezyang·
A thread about the history and internal implementation details of activation checkpointing APIs in PyTorch. 🧵
English
6
29
250
19K
Edward Z. Yang
Edward Z. Yang@ezyang·
The history here explains why the SAC APIs are all focused around classifying operations as cheap or expensive: it comes from the original min-cut formulation. In 2024, Jeffrey Wan released an eager mode version of SAC centered around the same ideas.
English
2
1
5
1K