Darshil Doshi

51 posts

Darshil Doshi

Darshil Doshi

@darshilhdoshi1

Johns Hopkins | UMD | Brown | IITGN —— ML | Physics

Baltimore, MD Katılım Eylül 2016
127 Takip Edilen68 Takipçiler
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@percyliang Can you explain what you meant by "scaling laws are bending"? The plot in the figure is semilogx. Moreover, AN^{-a} + B is bound to be "bent" even in log-log..?
English
1
0
0
319
Percy Liang
Percy Liang@percyliang·
Our 1e23 Delphi run finished last night. It's loss was within 0.005 of the projected (preregistered) loss. Note that these projections were based on only training models over 100x smaller (3e20)! Still more work to do. We still had loss spikes and if you closely, our scaling laws are bending. We have some ideas for fixing both...
Will Held@WilliamBarrHeld

How far do Marin's scaling laws extrapolate? At least 100x, apparently! Despite spooky spikes, our 1e23 Delphi finished on forecast. The compute-optimal ladder costs ~1e21 FLOPs to train. Good scaling science lets you “run” this (not tiny) experiment at 1/100th the cost.

English
8
13
192
32.8K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
Not all heroes wear capes. Some write entertaining papers.
Darshil Doshi tweet media
English
0
0
0
59
Darshil Doshi retweetledi
Matthieu wyart
Matthieu wyart@MatthieuWyart·
Many complex systems must navigate rugged energy landscapes. The glass transition is the archetypal example: cooling a liquid increases its viscosity by ~15 orders of magnitude. What causes such dramatic slowing? New review: arxiv.org/pdf/2603.05209
Matthieu wyart tweet media
English
1
1
8
380
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
Amazing work! For the first time, one can predict the exponents for language scaling laws. Perfect example of what the @SimonsFdn Collaboration on PLNC has to offer in the coming months/years!
Surya Ganguli@SuryaGanguli

Our new paper "Deriving neural scaling laws from the statistics of natural language" arxiv.org/abs/2602.07488 lead by @Fraccagnetta & @AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the structure of natural language itself for the very first time! If you give us two properties of your natural language dataset: 1) How conditional entropy of the next token decays with conditioning length. 2) How pairwise token correlations decay with time separation. Then we can give you the exponent of the neural scaling law (loss versus data amount) through a simple formula! The key idea is that as you increase the amount of training data, models can look further back in the past to predict, and as long as they do this well, the conditional entropy of the next token, conditioned on all tokens up to this data-dependent prediction time horizon, completely governs the loss! This gets us our simple formula for the neural scaling law!

English
0
1
23
3.5K
Darshil Doshi retweetledi
Francesco Cagnetta
Francesco Cagnetta@Fraccagnetta·
🚨 We derive data-limited neural scaling exponents directly from measurable corpus statistics. No synthetic data models, only two ingredients: -decay of token-token correlations with separation; -decay of next-token conditional entropy with context length.
Francesco Cagnetta tweet media
English
12
27
194
26.2K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
I’ve attended several iterations of this school in the past, and every time I’ve come out of it with new ideas and new connections. Highly recommended for young researchers!
Boris Hanin@BorisHanin

🚨 2026 @Princeton ML Theory Summer School 🔥 Learn from amazing researchers and meet your peers. Mini-courses by: - Subhabrata Sen @subhabratasen90 - Lenaic Chizat @LenaicChizat - Sinho Chewi - Elliot Paquette @poseypaquet - Elad Hazan @HazanPrinceton - Surya Ganguli @SuryaGanguli (to be confirmed) August 3 - 14, 2026 Apply by March 31, 2026. Link 👇 Sponsored by @NSF, @PrincetonAInews, @EPrinceton, @JaneStreetGroup, @DARPA, @PrincetonPLI, Princeton NAM, Princeton AI2, Princeton PACM

English
0
0
1
113
Darshil Doshi retweetledi
Dayal Kalra
Dayal Kalra@dayal_kalra·
Excited to share work from my internship at MSL @AIatMeta! 🚀 We analyze Critical Sharpness: a scalable curvature measure requiring only ~6 forward passes to analyze LLM training dynamics at scale. We extend this measure to introduce Relative Critical Sharpness, which measures the relative curvature between two landscapes. We use this to answer a major practical question: How much pre-training data should we mix during fine-tuning to avoid catastrophic forgetting? 🧵 (1/n)
Dayal Kalra tweet mediaDayal Kalra tweet media
English
8
33
279
17.8K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@EkdeepL @GoodfireAI Very exciting stuff! Are there many other super-human biology models out there right now that need to be interp'd or was this a one-off?
English
0
0
0
69
Darshil Doshi retweetledi
Johns Hopkins Data Science and AI Institute
Join us in advancing data science and AI research! The Johns Hopkins Data Science and AI Institute Postdoctoral Fellowship Program is now accepting applications for the 2026–2027 academic year. Apply now! Deadline: Jan 23, 2026. Details and apply: apply.interfolio.com/1790598:26
Johns Hopkins Data Science and AI Institute tweet media
English
0
19
70
5.9K
Darshil Doshi retweetledi
Surya Ganguli
Surya Ganguli@SuryaGanguli·
We have 14 survey lectures for our @SimonsFdn Collaboration on the Physics of Learning and Neural Computation! All videos available at: physicsoflearning.org/webinar-series Here is the list: @zdeborova: Attention-based models and how to solve them using tools from quadratic networks and matrix denoising @KempeLab: Recent lessons from LLM reasoning @MBarkeshli: Sharpness dynamics in neural network training @KrzakalaF: How Do Neural Networks Learn Simple Functions with Gradient Descent? Michael Douglas: Mathematics, Economics and AI Yuhai Tu: Towards a Physics-based Theoretical Foundation for Deep Learning: Stochastic Learning Dynamics and Generalization @SuryaGanguli: An analytic theory of creativity for convolutional diffusion models Eva Silverstein: Hamiltonian dynamics for stabilizing neural simulation-based inference @adnarim066: Generation with Unified Diffusion Bernd Rosenow: Random matrix analysis of neural networks: distinguishing noise from learned information @jhhalverson Nerual networks and conformal field theory @KempeLab Synthetic data: friend or foe in the age of scaling @WyartMatthieu Learning hierarchical representations with deep architectures @CPehlevan Mean-field theory of deep network learning dynamics and applications to neural scaling laws
English
2
57
250
22.1K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@EkdeepL Wonderful work Ekdeep.. congratulations! Big fan of fig.5!
English
1
0
1
144
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)
GIF
English
8
59
296
53.6K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@abybaby_san Atom are also an abstraction made by us. The connection between atoms and non-commutativity is as real as the idea of atoms themselves. The idea of “real” is also made up, along with the rest of the language… ad nauseam
English
0
0
1
23
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@MrnllMtt Not that insane. Increased swimming and ice cream consumption in the summer explains the trend-match. Tweaking the axis scaling explains the shape-match.
English
0
0
0
28
Darshil Doshi retweetledi
Andrei Mircea
Andrei Mircea@mirandrom·
I gave a talk on LLM zero-sum learning dynamics last week at MSR Montreal. I went over a few things that were not in the paper but that I'm particularly excited about; one of those is the connection between generalization and zero-sum learning. youtu.be/UyK3DgWY7yw
YouTube video
YouTube
English
1
9
36
4.5K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@mirandrom @zmkzmkz Intuition for muP/warmup connection: warmup brings the model to a flat region of the landscape, where higher lrs don’t cause problems. muP (any mean-field init) starts from a flat region, so one could get away with less/no warmup. SO to Dayal’s paper: arxiv.org/abs/2406.09405
English
0
0
2
16
Andrei Mircea
Andrei Mircea@mirandrom·
@darshilhdoshi1 @zmkzmkz I didn't know about the mup/warmup connection, I thought warmup was also important for Adam moments to stabilize. I've been meaning to rerun those experiments anyways so I'll try and see. If I recall correctly, "fast enough" just meant having a high enough LR at a specific loss.
English
1
0
2
29
zed
zed@zmkzmkz·
does anyone have any pointers on what this "hump" is in the gradient norm at the beginning of training a transformer? I've seen this happen at all scales, even in different architectural variants, even with or without warmup/decay lr
zed tweet mediazed tweet media
English
42
5
183
41.1K
Darshil Doshi
Darshil Doshi@darshilhdoshi1·
@mirandrom @zmkzmkz This is very interesting! I always thought that more warmup (i.e slower lr increase) is never worse — looks like I was wrong. Is there a way to quantify “fast enough” for lr warmup? Also, muP is considered a substitute for warmup — wonder if that can circumvent these issues.
English
1
0
2
18
Andrei Mircea
Andrei Mircea@mirandrom·
@zmkzmkz but there's also just a sharpening of the loss landscape going on around that point I think. that would explain the effect of LR warmup I observed on this a while back, where gradnorm growth occurs when LR is too small relative to current loss, and reverses if you increase LR
Andrei Mircea tweet mediaAndrei Mircea tweet mediaAndrei Mircea tweet media
English
2
1
3
160
Darshil Doshi retweetledi
Rose Yu
Rose Yu@yuqirose·
I'm hiring a postdoc on #AI for #science starting soon. We are building AI-enabled 3D human liver printing technology with many experts across disciplines! If you are excited about this, drop me an email with your CV and a brief introduction! Repost is much appreciated!
English
5
39
163
21.6K