Garrett Goon

99 posts

Garrett Goon

@GoonGarrett

AI at IBM Research.

Katılım Mayıs 2022

147 Takip Edilen16 Takipçiler

Garrett Goon@GoonGarrett·14 Mar

@YouJiacheng What's the strategy to counteract this, then?

English

You Jiacheng@YouJiacheng·13 Mar

Consider LM Head Y=X@W {W.shape = (D, V)} Assume perfect encoder optimization such that ΔX=-dL/dX=-dL/dY@W.T (scalar factor omitted) Frozen W, we get ΔY=-dL/dY@W.T@W W.T@W is severe rank-deficient when D<<V!

Nathan Godey@nthngdy

But in the backward pass, the story is much worse. Gradients get compressed via projection onto a D-dimensional subspace, and most of the training signal simply vanishes.

English

6.2K

Garrett Goon@GoonGarrett·14 Mar

@m_sirovatka @samsja19 <2000 tokens? 🤔

English

Matej Sirovatka@m_sirovatka·12 Mar

You can just do things in prime-rl - like teach GLM5 to answer math in <2000 tokens using 16 nodes to train and 12 nodes to do inference with 2P4D configuration with only uv run rl @ rl.toml ( @samsja19 told me I should tweet more things)

English

14.6K

Garrett Goon@GoonGarrett·14 Mar

@eliebakouch Wow, the Sonnet jump @256k

English

elie@eliebakouch·13 Mar

wtf, they almost saturated MRCR v2, still unsure if this translates to solving context rot for good but i'm shocked

Claude@claudeai

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

English

7.9K

Garrett Goon@GoonGarrett·11 Mar

@StasBekman @aryanvs_ The speed up was due to throwing more GPUs at the workload, though, right? Baseline being a single GPU, other cases using N, and seeing a near linear decrease in overall runtime? Not a per-GPU improvement

English

Stas Bekman@StasBekman·11 Mar

@GoonGarrett @aryanvs_ That. And that there was a speedup. In general parallelism methods are all about feasibility and always come at a cost, hence the surprise.

English

Stas Bekman@StasBekman·10 Mar

Here is an unexpected use case of using Ulysses SP to speed up inference Thank you for running extended benchmarks @aryanvs_

Aryan V S@aryanvs_

Ulysses/Ring SP, or their combined hybrid, can be used to speedup inference nearly linearly with no. of GPUs on compute bound workloads upto some extent (for example, image/video diffusion) As some quick reference, here's what we did in diffusers: - #L42" target="_blank" rel="nofollow noopener">github.com/huggingface/di… - #L1971" target="_blank" rel="nofollow noopener">github.com/huggingface/di… and a quick standalone test: github.com/a-r-r-o-w/prod…

English

3.4K

Garrett Goon@GoonGarrett·11 Mar

@m_sirovatka @oneill_c @Jozef_Nathaniel Nightly or GTFO (saw the other threads before this one)

English

Matej Sirovatka@m_sirovatka·10 Mar

You only need this PR included in your torch github.com/pytorch/pytorc… (2.10 earliest) for the mem leak with EP to not be there, the mem screenshot is with pure FSDP, so the grad all_reduce buffer is huge. With EP=8 its size falls ~8x which saves insane amounts of memory. I was running against vllm with torch 2.9

English

307

Matej Sirovatka@m_sirovatka·10 Mar

We are so FUCKING BACK EP works

Matej Sirovatka@m_sirovatka

Sometimes being GPU rich doesn't help you, does someone know of a way to not make FSDP backward materialize full layer grads in a reduce-scatter buffer? For GLM5 it's 40GB of VRAM just for that each layer, no matter the FSDP size. (orange in the img)

English

135

11.5K

Garrett Goon@GoonGarrett·11 Mar

@m_sirovatka Yeah agreed, just thinking about how you could solve it if you really needed to

English

Matej Sirovatka@m_sirovatka·11 Mar

@GoonGarrett yeah not too worth IMO, very rare usecase. It was anyway fixed by EP which I originally just couldn’t use. TP would do the same

English

Matej Sirovatka@m_sirovatka·10 Mar

Matej Sirovatka@m_sirovatka

you can just do things when you're gpu rich (full post-train GLM5 being the things)

English

145

32.9K

Garrett Goon@GoonGarrett·11 Mar

@m_sirovatka Add like a fully_shard_bwd API to wrap modules whose grads should be RS'd together. E.g. fully_shard on a whole transformer block and wrap the MLP and attn sub-blocks each with fully_shard_bwd so their grads are bucket-reduced and freed earlier than default. Maybe too complicated

English

Garrett Goon@GoonGarrett·11 Mar

@m_sirovatka Yeah was thinking the same. Basically enabling the backwards bucketing strategy to be different from the fwd, rather than forcing both to consolidate all collectives into a single launch

English

Garrett Goon@GoonGarrett·11 Mar

@stochasticchasm @m_sirovatka Same, yeah. Still annoying though.

English

stochasm@stochasticchasm·11 Mar

@GoonGarrett @m_sirovatka yeah i personally haven't had it affect more than just metrics/logging

English

Garrett Goon@GoonGarrett·11 Mar

@stochasticchasm @m_sirovatka rakkit (GH handle, no idea if they're on here) some hack around this. They're involved in many titan PRs. Does this affect more than metrics? Double counting doesn't matter for simple loss-free balancing. Can imagine it would in complex cases

English

stochasm@stochasticchasm·11 Mar

@GoonGarrett @m_sirovatka yeah the double count with ac exactly. was the only bug-like thing that came to mind

English

Garrett Goon@GoonGarrett·11 Mar

@m_sirovatka @stochasticchasm Does torchtitan hit this? 🤔 surprised I was unaware of this

English

Matej Sirovatka@m_sirovatka·10 Mar

@stochasticchasm This thing: github.com/pytorch/pytorc… I think sami wrote you about that already (or not)

English

363

Garrett Goon@GoonGarrett·11 Mar

@stochasticchasm @m_sirovatka What's the token counting thing? Double counting tokens due to the second AC fwd?

English

stochasm@stochasticchasm·10 Mar

@m_sirovatka what's this bug? the token counting thing?

English

371

Garrett Goon@GoonGarrett·11 Mar

@marksaroufim @tqchenml Super interesting! I was unclear from the write up if the agent loop had explicit knowledge of the benchmarking and eval code? Or if was able to back out its properties from repeatedly interacting with it

English

Mark Saroufim@marksaroufim·10 Mar

LLMs are now superhuman at reward hacking our kernel competitions Natalia Kokoromyti, was #1 on last problem of the NVFP4 competition for around 10 min before we scrubbed the reward hack I know of very few humans who can write such a hack gpumode.com/news/reward-ha…

English

423

86.7K

Garrett Goon@GoonGarrett·7 Mar

@fujikanaeda Yep, should be standard practice

English

Eric W. Tramel@fujikanaeda·7 Mar

what I’ve found to improve this situation: loops - generate a timing & mem & correctness benchmark - script the standard implementation & new implementation to run against it - have optimizing agents run against it repeatedly same as any rewrite, really :)

Hōrōshi バガボンド@KatanaLarp

x.com/i/article/2029…

English

2.2K

Garrett Goon@GoonGarrett·7 Mar

@eliebakouch > it's also one of the best ways to benchmark frontier models and harnesses Can you elaborate on this? Very curious!

English

elie@eliebakouch·6 Mar

i really think speedrun like this is one of the best environments for automated research. it's also one of the best ways to benchmark frontier models and harnesses i'm starting to get obsessed with this

Andrej Karpathy@karpathy

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to NVIDIA ClimbMix (nice work NVIDIA!). I had tried Olmo, FineWeb, DCLM which all led to regressions, ClimbMix worked really well out of the box (to the point that I am slightly suspicious about about goodharting, though reading the paper it seems ~ok). In other news, after trying a few approaches for how to set things up, I now have AI Agents iterating on nanochat automatically, so I'll just leave this running for a while, go relax a bit and enjoy the feeling of post-agi :). Visualized here as an example: 110 changes made over the last ~12 hours, bringing the validation loss so far from 0.862415 down to 0.858039 for a d12 model, at no cost to wall clock time. The agent works on a feature branch, tries out ideas, merges them when they work and iterates. Amusingly, over the last ~2 weeks I almost feel like I've iterated more on the "meta-setup" where I optimize and tune the agent flows even more than the nanochat repo directly.

English

173

17K

Garrett Goon@GoonGarrett·7 Mar

@giffmana @ppwwyyxx I was hoping for the same! +1

English

Lucas Beyer (bl16)@giffmana·6 Mar

@ppwwyyxx I haven't heard that word in a while! But it's a good idea.

English

227

Lucas Beyer (bl16)@giffmana·6 Mar

I use chatgpt pulse and it's now tuned to give me updates on the latest PRs and fixes in pytorch/gpu/distributed land. So every other morning i wake up to news about some silent corruptions being fixed. This is not helping my already significant trust issues lol

English

119

7.1K

Garrett Goon@GoonGarrett·4 Mar

@davisblalock This was very cool. Miss your paper-summarizing newsletter, btw

English

757

Davis Blalock@davisblalock·4 Mar

🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]

English

226

1.5K

203.7K

Garrett Goon@GoonGarrett·4 Mar

@giffmana @davisblalock What about the Adam denominator? Haven't tried lower precision, but I assume higher precision is still beneficial/necessary there?

English

411

Lucas Beyer (bl16)@giffmana·4 Mar

@davisblalock To be fair though, it's reasonably well established (?) that momentum in bf16 just works. We've been doing that as default for many years now. Doesn't take away from all the improvements you do, but the baseline can be a bit stronger.

English

2.2K

Keşfet

@YouJiacheng @m_sirovatka @samsja19 @eliebakouch @256k @StasBekman @aryanvs_ @oneill_c