Garrett Goon

99 posts

Garrett Goon

Garrett Goon

@GoonGarrett

AI at IBM Research.

Katılım Mayıs 2022
147 Takip Edilen16 Takipçiler
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
You can just do things in prime-rl - like teach GLM5 to answer math in <2000 tokens using 16 nodes to train and 12 nodes to do inference with 2P4D configuration with only uv run rl @ rl.toml ( @samsja19 told me I should tweet more things)
Matej Sirovatka tweet media
English
5
6
91
14.6K
Garrett Goon
Garrett Goon@GoonGarrett·
@StasBekman @aryanvs_ The speed up was due to throwing more GPUs at the workload, though, right? Baseline being a single GPU, other cases using N, and seeing a near linear decrease in overall runtime? Not a per-GPU improvement
English
1
0
1
40
Stas Bekman
Stas Bekman@StasBekman·
@GoonGarrett @aryanvs_ That. And that there was a speedup. In general parallelism methods are all about feasibility and always come at a cost, hence the surprise.
English
1
0
0
40
Stas Bekman
Stas Bekman@StasBekman·
Here is an unexpected use case of using Ulysses SP to speed up inference Thank you for running extended benchmarks @aryanvs_
Aryan V S@aryanvs_

Ulysses/Ring SP, or their combined hybrid, can be used to speedup inference nearly linearly with no. of GPUs on compute bound workloads upto some extent (for example, image/video diffusion) As some quick reference, here's what we did in diffusers: - #L42" target="_blank" rel="nofollow noopener">github.com/huggingface/di… - #L1971" target="_blank" rel="nofollow noopener">github.com/huggingface/di… and a quick standalone test: github.com/a-r-r-o-w/prod…

English
1
4
29
3.4K
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
You only need this PR included in your torch github.com/pytorch/pytorc… (2.10 earliest) for the mem leak with EP to not be there, the mem screenshot is with pure FSDP, so the grad all_reduce buffer is huge. With EP=8 its size falls ~8x which saves insane amounts of memory. I was running against vllm with torch 2.9
English
2
1
9
307
Garrett Goon
Garrett Goon@GoonGarrett·
@m_sirovatka Yeah agreed, just thinking about how you could solve it if you really needed to
English
0
0
0
17
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
@GoonGarrett yeah not too worth IMO, very rare usecase. It was anyway fixed by EP which I originally just couldn’t use. TP would do the same
English
1
0
0
24
Garrett Goon
Garrett Goon@GoonGarrett·
@m_sirovatka Add like a fully_shard_bwd API to wrap modules whose grads should be RS'd together. E.g. fully_shard on a whole transformer block and wrap the MLP and attn sub-blocks each with fully_shard_bwd so their grads are bucket-reduced and freed earlier than default. Maybe too complicated
English
1
0
0
40
Garrett Goon
Garrett Goon@GoonGarrett·
@m_sirovatka Yeah was thinking the same. Basically enabling the backwards bucketing strategy to be different from the fwd, rather than forcing both to consolidate all collectives into a single launch
English
1
0
0
25
Garrett Goon
Garrett Goon@GoonGarrett·
@stochasticchasm @m_sirovatka rakkit (GH handle, no idea if they're on here) some hack around this. They're involved in many titan PRs. Does this affect more than metrics? Double counting doesn't matter for simple loss-free balancing. Can imagine it would in complex cases
English
1
0
1
45
stochasm
stochasm@stochasticchasm·
@m_sirovatka what's this bug? the token counting thing?
English
2
0
1
371
Garrett Goon
Garrett Goon@GoonGarrett·
@marksaroufim @tqchenml Super interesting! I was unclear from the write up if the agent loop had explicit knowledge of the benchmarking and eval code? Or if was able to back out its properties from repeatedly interacting with it
English
0
0
0
84
Mark Saroufim
Mark Saroufim@marksaroufim·
LLMs are now superhuman at reward hacking our kernel competitions Natalia Kokoromyti, was #1 on last problem of the NVFP4 competition for around 10 min before we scrubbed the reward hack I know of very few humans who can write such a hack gpumode.com/news/reward-ha…
English
7
42
423
86.7K
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
what I’ve found to improve this situation: loops - generate a timing & mem & correctness benchmark - script the standard implementation & new implementation to run against it - have optimizing agents run against it repeatedly same as any rewrite, really :)
Hōrōshi バガボンド@KatanaLarp

x.com/i/article/2029…

English
3
1
16
2.2K
Garrett Goon
Garrett Goon@GoonGarrett·
@eliebakouch > it's also one of the best ways to benchmark frontier models and harnesses Can you elaborate on this? Very curious!
English
0
0
0
14
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I use chatgpt pulse and it's now tuned to give me updates on the latest PRs and fixes in pytorch/gpu/distributed land. So every other morning i wake up to news about some silent corruptions being fixed. This is not helping my already significant trust issues lol
Lucas Beyer (bl16) tweet media
English
7
1
119
7.1K
Davis Blalock
Davis Blalock@davisblalock·
🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]
Davis Blalock tweet media
English
30
226
1.5K
203.7K
Garrett Goon
Garrett Goon@GoonGarrett·
@giffmana @davisblalock What about the Adam denominator? Haven't tried lower precision, but I assume higher precision is still beneficial/necessary there?
English
1
0
2
411
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@davisblalock To be fair though, it's reasonably well established (?) that momentum in bf16 just works. We've been doing that as default for many years now. Doesn't take away from all the improvements you do, but the baseline can be a bit stronger.
English
1
0
24
2.2K