Pete Walsh

31 posts

Pete Walsh banner
Pete Walsh

Pete Walsh

@epwalsh

Research Engineer @allen_ai | Python | Rust | Neovim

Beigetreten Nisan 2024
164 Folgt131 Follower
finbarr
finbarr@finbarrtimbers·
What's the best ML config library?
English
13
0
37
6.3K
Pete Walsh
Pete Walsh@epwalsh·
@saurabh_shah2 @finbarrtimbers We use FSDP2 now (what torchtitan uses) because it’s just DTensor under the hood, which plays nicely with other axes of parallelism. I don’t think there’s much difference in performance between the old FSDP and the new.
English
1
0
1
58
Saurabh Shah
Saurabh Shah@saurabh_shah2·
@finbarrtimbers I’m not sure if they still do but pretrain used FSDP when I worked w them, maybe poke Pete and ask
English
2
0
0
1K
finbarr
finbarr@finbarrtimbers·
Dumb torch question: why don’t people use torch.distributed.fsdp and instead use DeepSpeed/torchtitan/whatever? Is it that inefficient?
English
14
8
176
28.9K
Pete Walsh retweetet
Ai2
Ai2@allen_ai·
Introducing olmOCR, our open-source tool to extract clean plain text from PDFs! Built for scale, olmOCR handles many document types with high throughput. Run it on your own GPU for free—at over 3000 token/s, equivalent to $190 per million pages, or 1/32 the cost of GPT-4o!
English
85
261
1.9K
281.8K
Luca Soldaini 🎀
Luca Soldaini 🎀@soldni·
Love that feeling of starting a ✨ brand new data project 🥰
English
3
0
40
2.4K
Pete Walsh retweetet
Kyle Lo
Kyle Lo@kylelostat·
kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡 🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into: 🚖 stable pretrain 🚔 lr anneal 🤝 data curricula 🤝 soups 🚘 tulu post-train 🚜 compute infra 👇🧵
Kyle Lo tweet media
English
3
71
365
47.3K
Pete Walsh
Pete Walsh@epwalsh·
@Tanishq97836660 Hey Tanishq, really interesting work. I'm curious if your low-precision training setup involves keeping the main copy of the weights in full precision (like with how torchao's Float8Linear works)?
English
0
0
0
113
Tanishq Kumar
Tanishq Kumar@tanishqkumar07·
[3/7] We then turn our attention to training in low precision. We study both quantization-aware training (weights only) and low-precision training (everything in low precision). We decompose the model into weights, activations, and KV cache, finding scaling laws for loss when any of these are quantized to any precision, and develop a compositional and interpretable functional form to predict the effect on loss of quantizing any combination of the three during pretraining.
English
2
0
40
10.2K
Tanishq Kumar
Tanishq Kumar@tanishqkumar07·
[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR; - Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training! - The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices! Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.
Tanishq Kumar tweet media
English
23
156
839
761.1K
Pete Walsh retweetet
Niklas Muennighoff
Niklas Muennighoff@Muennighoff·
Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9
Niklas Muennighoff tweet media
English
23
225
931
203.4K
Pete Walsh retweetet
Jesse Dodge
Jesse Dodge@JesseDodge·
Congrats to our team for winning two paper awards at #ACL2024! OLMo won the Best Theme Paper award, and Dolma won a Best Resource Paper award! All the credit goes to the whole team for the massive group effort 🎉🎉
Jesse Dodge tweet mediaJesse Dodge tweet mediaJesse Dodge tweet mediaJesse Dodge tweet media
English
11
42
243
52.9K
Saurabh Shah
Saurabh Shah@saurabh_shah2·
Ok this is cool….
English
1
0
7
489
Pete Walsh
Pete Walsh@epwalsh·
I attempted the Three Sisters Ski Traverse in one day with a buddy earlier this week. Despite the seemingly endless number of transitions between booting, skinning, and skiing, there were some great moments like standing on top of Middle Sister and skiing perfect corn snow ⛷️
English
3
3
9
870
Pete Walsh
Pete Walsh@epwalsh·
Thankfully we got to the road just before sunset and got watch this very cool moonrise above Mt Bachelor to close out the day 🌔
Pete Walsh tweet media
English
0
0
2
98
Pete Walsh
Pete Walsh@epwalsh·
It took us 14 hours to cover 18.5 miles with 7.5k of vertical gain, though we had to bail short of the summit of South Sister as we were running out of daylight.
Pete Walsh tweet media
English
1
0
0
132
Pete Walsh
Pete Walsh@epwalsh·
But there were also many "wtf are we doing moments" like walking for hours to get to the snow line or realizing our planned route up the final peak (South Sister) and down to our pickup location wasn't going to be straightforward.
Pete Walsh tweet mediaPete Walsh tweet mediaPete Walsh tweet mediaPete Walsh tweet media
English
1
0
0
170
Pete Walsh
Pete Walsh@epwalsh·
Hey babe want to watch W&B curves tonight?
English
1
0
8
784
Saurabh Shah
Saurabh Shah@saurabh_shah2·
@epwalsh LMAO it’s actually ~/programming-massively-parallel-processors which is the name of the textbook I’m learning from
English
1
0
2
76
Saurabh Shah
Saurabh Shah@saurabh_shah2·
No one else like me
Saurabh Shah tweet media
English
4
0
13
1.5K
Saurabh Shah
Saurabh Shah@saurabh_shah2·
@epwalsh Only thing that could make you cooler at this point is if you snowboarded instead of skied
English
1
0
0
29