Pete Walsh
31 posts

Pete Walsh
@epwalsh
Research Engineer @allen_ai | Python | Rust | Neovim
Beigetreten Nisan 2024
164 Folgt131 Follower


@soldni @finbarrtimbers This is it. But if you want to use dataclasses AND be able to ser/deserialize to/from json/yaml, I wrote a little library for that: github.com/epwalsh/datacl…
Eventually we're going to replace omegaconf with that in olmo-core.
English

@saurabh_shah2 @finbarrtimbers We use FSDP2 now (what torchtitan uses) because it’s just DTensor under the hood, which plays nicely with other axes of parallelism. I don’t think there’s much difference in performance between the old FSDP and the new.
English

@finbarrtimbers I’m not sure if they still do but pretrain used FSDP when I worked w them, maybe poke Pete and ask
English
Pete Walsh retweetet
Pete Walsh retweetet

@Tanishq97836660 Hey Tanishq, really interesting work. I'm curious if your low-precision training setup involves keeping the main copy of the weights in full precision (like with how torchao's Float8Linear works)?
English

[3/7] We then turn our attention to training in low precision. We study both quantization-aware training (weights only) and low-precision training (everything in low precision). We decompose the model into weights, activations, and KV cache, finding scaling laws for loss when any of these are quantized to any precision, and develop a compositional and interpretable functional form to predict the effect on loss of quantizing any combination of the three during pretraining.
English

[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR;
- Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training!
- The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices!
Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.

English
Pete Walsh retweetet

Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source
- 1B active, 7B total params for 5T tokens
- Best small LLM & matches more costly ones like Gemma, Llama
- Open Model/Data/Code/Logs + lots of analysis & experiments
📜arxiv.org/abs/2409.02060
🧵1/9

English
Pete Walsh retweetet

Congrats to our team for winning two paper awards at #ACL2024!
OLMo won the Best Theme Paper award, and Dolma won a Best Resource Paper award!
All the credit goes to the whole team for the massive group effort 🎉🎉




English

Also, thanks to Rodney Kinney, @AnanyaHarsh, Pete Walsh, and @davidjwadden for contributing to Rusty-DAWG as part of the @allen_ai hackathon!
English

📜New preprint w/ @nlpnoah and @yanaiela that evaluates the novelty of LM-generated text using our n-gram search tool Rusty-DAWG 🐶
Code: github.com/viking-sudo-rm…
Paper: arxiv.org/abs/2406.13069
English

@saurabh_shah2 Oh cool I was just looking for some light reading material
English

@epwalsh LMAO it’s actually ~/programming-massively-parallel-processors which is the name of the textbook I’m learning from
English

@epwalsh Only thing that could make you cooler at this point is if you snowboarded instead of skied
English

To train a neural network you have to become one.
Feel the gradient descent for yourself.
Very bullish on AI2/olmo 🚀
Pete Walsh@epwalsh
Closing day pond skim at Mt Bachelor ⛷️
English











