Nolan Dey

34 posts

Nolan Dey

@DeyNolan

Research Scientist @ Cerebras Systems

Toronto Katılım Mart 2022

58 Takip Edilen475 Takipçiler

Nolan Dey retweetledi

Shane Bergsma@ShaneBergsma·3 Eki

Another new preprint from @cerebras 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. arxiv.org/abs/2509.25380

English

739

Nolan Dey retweetledi

Shane Bergsma@ShaneBergsma·30 Eyl

(1/4) @cerebras Hot off the presses 🔥📄arxiv.org/abs/2509.25087 If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.

English

6.7K

Nolan Dey retweetledi

Shane Bergsma@ShaneBergsma·21 May

Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

English

Nolan Dey@DeyNolan·7 May

@MihaiCNica @roydanroy @TheGregYang Yes I remember!

English

Mihai Nica@MihaiCNica·7 May

@roydanroy @TheGregYang Thanks for the post! @DeyNolan was part of a group of students at UW I gave a tutorial to on some of these limits back in 2021. Amazing to see how far it's come!

English

2.8K

Dan Roy@roydanroy·7 May

This is a huge development. I want to highlight the theoreticians behind the scene, because this paper represents the realization of the impact of years of careful theoretical research. It starts with Greg Yang (@TheGregYang) opening up research on the muP scaling and hyperparameter transfer in infinite-width models. Simultaneously infinite-depth scaling are studied by Boris Hanin (@BorisHanin), Mihai Nica (@MihaiCNica), Mufan Li (@mufan_li), and Soufiane Hayou (@hayou_soufiane), including in networks with residual connections. Then this builds further with the study of infinite-depth scalings and Transformers by Lorenzo Noci (@lorenzo_noci), Blake Bordelon (@blake__bordelon), Mufan, Chuning Li (@ChuningLi), Hamzat Chaudhuri (@hamzatchaudhry), Boris, and Cengiz (@CPehlevan) in at least 3-4 papers, in particular using the DMFT framework. My understanding is that the translation of these insights into this work was highly nontrivial and so congrats to Cerebras for seeing it through with this great team. I also think this work could serve as a wake up to those in industry who reacted to muP saying "yeah yeah yeah we ended up at effectively the same place through careful scrutiny". I’d love to know which labs landed here, if any. If not, it goes to show you cannot have everyone grinding code. You need fundamental research to fuel BIG leaps.

Nolan Dey@DeyNolan

(1/7) @cerebras Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇

English

393

54.1K

Nolan Dey@DeyNolan·7 May

@roydanroy @TheGregYang Thanks for the kind words @roydanroy! My first exposure to all this stuff began with the collaboration with you and @mufan_li back in 2020. Cool to see it all come full circle :)

English

413

Nolan Dey@DeyNolan·6 May

(7/7) If you are looking to conduct research into deep models, contact us to collaborate! We are also hiring research scientists (job-boards.greenhouse.io/cerebrassystem…) and research engineers (job-boards.greenhouse.io/cerebrassystem…)!

English

2.5K

Nolan Dey@DeyNolan·6 May

(6/7) Implementing CompleteP is very simple, only requiring two lines of code. We provide a minimal implementation here: github.com/EleutherAI/nan…

English

4.4K

Nolan Dey@DeyNolan·6 May

English

404

112.2K

Nolan Dey@DeyNolan·2 Nis

Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. ndey96.github.io/neuron-explana…

English

275

Nolan Dey retweetledi

EleutherAI@AiEleuther·23 Eyl

🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵

English

180

24.5K

Nolan Dey retweetledi

Davis Blalock@davisblalock·3 Haz

So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varying with sparsity level...

Cerebras@cerebras

(6/n) Applying SμPar to pretraining a 610M parameter LLM significantly improves loss over SP and μP models due to improved HP tuning.

English

114

74.3K

Nolan Dey retweetledi

Cerebras@cerebras·31 May

(1/n) Paper drop: arxiv.org/abs/2405.15743 TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇

English

176

29K

Nolan Dey retweetledi

Vithu Thangarasa@vithursant19·1 Oca

Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: github.com/vithursant/nan…. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.

English

45.3K

Nolan Dey retweetledi

Cerebras@cerebras·23 Eki

📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇

English

21K

Nolan Dey retweetledi

Cerebras@cerebras·22 Eyl

We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay arxiv.org/abs/2309.11568

English

112

12.4K

Nolan Dey retweetledi

Cerebras@cerebras·16 Ağu

Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on @huggingface! A big thanks to all the devs out there building on top of open source models 🙌

English

193

49.8K

Keşfet

@cerebras @MihaiCNica @roydanroy @TheGregYang @BorisHanin @mufan_li @hayou_soufiane @lorenzo_noci