Nicholas Lourie

55 posts

Nicholas Lourie

Nicholas Lourie

@NickLourie

Better empirical methods for deep learning. PhD at @nyuniversity (@CILVRatNYU). Advised by @kchonyc and @hhexiy. Prev: @allen_ai. I build things. 🤖

New York, NY Katılım Mart 2014
818 Takip Edilen1.5K Takipçiler
Sabitlenmiş Tweet
Nicholas Lourie
Nicholas Lourie@NickLourie·
LLMs are expensive—experiments cost a lot, mistakes even more. How do you make experiments cheap and reliable? By using hyperparameters' empirical structure. @kchonyc, @hhexiy, and I show you how in Hyperparameter Loss Surfaces Are Simple Near their Optima at #COLM2025! 🧵1/9
GIF
English
2
10
32
14.1K
Nicholas Lourie retweetledi
Michael Hu
Michael Hu@michahu8·
if you truly believe in the bitter lesson, then why hand design scaling laws? introducing: neural neural scaling laws (NeuNeu), a neural network - trained on open-source LM trajectories - that predicts LMs' future downstream task performance 🧵👇
Michael Hu tweet media
English
4
30
205
19.3K
Nicholas Lourie
Nicholas Lourie@NickLourie·
LLMs are expensive—experiments cost a lot, mistakes even more. How do you make experiments cheap and reliable? By using hyperparameters' empirical structure. @kchonyc, @hhexiy, and I show you how in Hyperparameter Loss Surfaces Are Simple Near their Optima at #COLM2025! 🧵1/9
GIF
English
2
10
32
14.1K
Nicholas Lourie
Nicholas Lourie@NickLourie·
If you're at #COLM2025, come say hi! We're presenting as Poster 67 at Poster Session 4 this afternoon!
English
0
0
0
292
Nicholas Lourie
Nicholas Lourie@NickLourie·
Thanks for the references!😁 This gets at the heart of our message: even for a fixed task, sometimes downstream scaling is predictable, other times it isn't, and we don't know why. What factors in your experiment made scaling laws work? It's a question we should try to answer.
English
0
0
0
39
Michael Hu
Michael Hu@michahu8·
📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇
Michael Hu tweet media
English
4
37
276
30.3K
Nicholas Lourie
Nicholas Lourie@NickLourie·
Interesting question! Switching from loss to compute would change the curve's shape, but it wouldn't make it linear (and easy to extrapolate). The curve's derivative is discontinuous at the breaks, and smoothly changing the x-axis preserves that discontinuity since (fg)' (x) = f'(g(x)) g'(x) and a discontinuous function times anything besides zero is still discontinuous at that point.
English
0
0
0
39
Will Held
Will Held@WilliamBarrHeld·
@NickLourie @michahu8 @kchonyc Can we interpret this figure as showing emergence since the X axis is pretraining loss and not compute? Kaplan shows pretraining loss scales as a power law in terms of compute, so a small shift in that X-axis could be a large shift in compute space.
English
2
0
1
87
Nicholas Lourie
Nicholas Lourie@NickLourie·
If we could understand when and why perplexity captures downstream performance, then it would be a powerful tool indeed. When the context allows it, we could compare language models on perplexity alone---without the need to run difficult, downstream evaluations.
English
1
0
1
142
Nicholas Lourie
Nicholas Lourie@NickLourie·
A standard approach has yet to emerge (a great area for research!) Task-specific losses are interesting, we share a few papers on them in our related work. Still, a task-agnostic loss has one big advantage: it gives one number to compare LLMs, regardless of the downstream task.
English
1
0
2
145
Nicholas Lourie
Nicholas Lourie@NickLourie·
@WilliamBarrHeld @michahu8 @kchonyc Even with continuous metrics, there are stubbornly emergent phenomena. For example, this figure from arxiv.org/abs/2411.16035. Scaling shows several structural breaks even when you look at a continuous metric like P(correct answer). It's a tough problem, but we're making progress!
Nicholas Lourie tweet media
English
1
0
0
180
Nicholas Lourie
Nicholas Lourie@NickLourie·
@WilliamBarrHeld @michahu8 @kchonyc Great question. 🙂 We only looked at downstream scaling laws in terms of pretraining loss, and a big conclusion is that we need more work like this! I'd guess that it'll take a few tricks to make downstream scaling laws reliable and intermediate task losses could certainly one.
English
1
0
0
206