Nolan Dey

34 posts

Nolan Dey

Nolan Dey

@DeyNolan

Research Scientist @ Cerebras Systems

Toronto Katılım Mart 2022
58 Takip Edilen475 Takipçiler
Nolan Dey retweetledi
Shane Bergsma
Shane Bergsma@ShaneBergsma·
Another new preprint from @cerebras 🚨📄- this time on training *re-evaluation* curves (TRECs) for data curriculums in LLMs. Everyone sticks high-quality data at the end of training… we show the sweet spot is often earlier — and we can predict it. arxiv.org/abs/2509.25380
Shane Bergsma tweet media
English
1
1
7
739
Nolan Dey retweetledi
Shane Bergsma
Shane Bergsma@ShaneBergsma·
(1/4) @cerebras Hot off the presses 🔥📄arxiv.org/abs/2509.25087 If you're spending $1B to train an LLM, you need to know it’s on track—every step of the way. With optimal AdamW τ + fixed TPP, loss curves collapse to a universal path → an early-warning signal for training.
Shane Bergsma tweet media
English
2
8
25
6.7K
Nolan Dey retweetledi
Shane Bergsma
Shane Bergsma@ShaneBergsma·
Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.
English
1
16
94
9K
Mihai Nica
Mihai Nica@MihaiCNica·
@roydanroy @TheGregYang Thanks for the post! @DeyNolan was part of a group of students at UW I gave a tutorial to on some of these limits back in 2021. Amazing to see how far it's come!
English
1
1
10
2.8K
Dan Roy
Dan Roy@roydanroy·
This is a huge development. I want to highlight the theoreticians behind the scene, because this paper represents the realization of the impact of years of careful theoretical research. It starts with Greg Yang (@TheGregYang) opening up research on the muP scaling and hyperparameter transfer in infinite-width models. Simultaneously infinite-depth scaling are studied by Boris Hanin (@BorisHanin), Mihai Nica (@MihaiCNica), Mufan Li (@mufan_li), and Soufiane Hayou (@hayou_soufiane), including in networks with residual connections. Then this builds further with the study of infinite-depth scalings and Transformers by Lorenzo Noci (@lorenzo_noci), Blake Bordelon (@blake__bordelon), Mufan, Chuning Li (@ChuningLi), Hamzat Chaudhuri (@hamzatchaudhry), Boris, and Cengiz (@CPehlevan) in at least 3-4 papers, in particular using the DMFT framework. My understanding is that the translation of these insights into this work was highly nontrivial and so congrats to Cerebras for seeing it through with this great team. I also think this work could serve as a wake up to those in industry who reacted to muP saying "yeah yeah yeah we ended up at effectively the same place through careful scrutiny". I’d love to know which labs landed here, if any. If not, it goes to show you cannot have everyone grinding code. You need fundamental research to fuel BIG leaps.
Nolan Dey@DeyNolan

(1/7) @cerebras Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇

English
7
47
393
54.1K
Nolan Dey
Nolan Dey@DeyNolan·
(6/7) Implementing CompleteP is very simple, only requiring two lines of code. We provide a minimal implementation here: github.com/EleutherAI/nan…
Nolan Dey tweet media
English
4
2
32
4.4K
Nolan Dey
Nolan Dey@DeyNolan·
(1/7) @cerebras Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇
Nolan Dey tweet media
English
12
65
404
112.2K
Nolan Dey
Nolan Dey@DeyNolan·
Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. ndey96.github.io/neuron-explana…
English
0
0
3
275
Nolan Dey retweetledi
EleutherAI
EleutherAI@AiEleuther·
🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵
English
2
28
180
24.5K
Nolan Dey retweetledi
Cerebras
Cerebras@cerebras·
(1/n) Paper drop: arxiv.org/abs/2405.15743 TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇
Cerebras tweet media
English
4
36
176
29K
Nolan Dey retweetledi
Vithu Thangarasa
Vithu Thangarasa@vithursant19·
Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: github.com/vithursant/nan…. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.
English
3
15
91
45.3K
Nolan Dey retweetledi
Cerebras
Cerebras@cerebras·
📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇
English
2
16
74
21K
Nolan Dey retweetledi
Cerebras
Cerebras@cerebras·
We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay arxiv.org/abs/2309.11568
Cerebras tweet media
English
2
27
112
12.4K
Nolan Dey retweetledi
Cerebras
Cerebras@cerebras·
Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on @huggingface! A big thanks to all the devs out there building on top of open source models 🙌
Cerebras tweet media
English
4
51
193
49.8K