Darshil Doshi

51 posts

Darshil Doshi

@darshilhdoshi1

Johns Hopkins | UMD | Brown | IITGN —— ML | Physics

Baltimore, MD Katılım Eylül 2016

127 Takip Edilen68 Takipçiler

Darshil Doshi@darshilhdoshi1·2 Nis

@percyliang Can you explain what you meant by "scaling laws are bending"? The plot in the figure is semilogx. Moreover, AN^{-a} + B is bound to be "bent" even in log-log..?

English

319

Percy Liang@percyliang·1 Nis

Our 1e23 Delphi run finished last night. It's loss was within 0.005 of the projected (preregistered) loss. Note that these projections were based on only training models over 100x smaller (3e20)! Still more work to do. We still had loss spikes and if you closely, our scaling laws are bending. We have some ideas for fixing both...

Will Held@WilliamBarrHeld

How far do Marin's scaling laws extrapolate? At least 100x, apparently! Despite spooky spikes, our 1e23 Delphi finished on forecast. The compute-optimal ladder costs ~1e21 FLOPs to train. Good scaling science lets you “run” this (not tiny) experiment at 1/100th the cost.

English

192

32.8K

Darshil Doshi@darshilhdoshi1·15 Mar

Not all heroes wear capes. Some write entertaining papers.

English

Darshil Doshi retweetledi

Matthieu wyart@MatthieuWyart·6 Mar

Many complex systems must navigate rugged energy landscapes. The glass transition is the archetypal example: cooling a liquid increases its viscosity by ~15 orders of magnitude. What causes such dramatic slowing? New review: arxiv.org/pdf/2603.05209

English

380

Darshil Doshi retweetledi

Yasaman Bahri@yasamanbb·18 Şub

In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data. arxiv.org/abs/2602.15029… With @dhruvakarkada, @DanKorchinski, Andres Nava, and @MatthieuWyart.

English

167

8.9K

Darshil Doshi@darshilhdoshi1·17 Şub

Yet another gem from @SimonsFdn PLANC collaboration!

Matthieu wyart@MatthieuWyart

What governs the geometry of time and space embeddings in LLMs? We show it follows from translation symmetry in language statistics. With Dhruva Karkada, @DanKorchinski, Andres Nava, @yasamanbb arxiv.org/abs/2602.15029

English

648

Darshil Doshi@darshilhdoshi1·11 Şub

Amazing work! For the first time, one can predict the exponents for language scaling laws. Perfect example of what the @SimonsFdn Collaboration on PLNC has to offer in the coming months/years!

Surya Ganguli@SuryaGanguli

Our new paper "Deriving neural scaling laws from the statistics of natural language" arxiv.org/abs/2602.07488 lead by @Fraccagnetta & @AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the structure of natural language itself for the very first time! If you give us two properties of your natural language dataset: 1) How conditional entropy of the next token decays with conditioning length. 2) How pairwise token correlations decay with time separation. Then we can give you the exponent of the neural scaling law (loss versus data amount) through a simple formula! The key idea is that as you increase the amount of training data, models can look further back in the past to predict, and as long as they do this well, the conditional entropy of the next token, conditioned on all tokens up to this data-dependent prediction time horizon, completely governs the loss! This gets us our simple formula for the neural scaling law!

English

3.5K

Darshil Doshi retweetledi

Francesco Cagnetta@Fraccagnetta·10 Şub

🚨 We derive data-limited neural scaling exponents directly from measurable corpus statistics. No synthetic data models, only two ingredients: -decay of token-token correlations with separation; -decay of next-token conditional entropy with context length.

English

194

26.2K

Darshil Doshi@darshilhdoshi1·9 Şub

I’ve attended several iterations of this school in the past, and every time I’ve come out of it with new ideas and new connections. Highly recommended for young researchers!

Boris Hanin@BorisHanin

🚨 2026 @Princeton ML Theory Summer School 🔥 Learn from amazing researchers and meet your peers. Mini-courses by: - Subhabrata Sen @subhabratasen90 - Lenaic Chizat @LenaicChizat - Sinho Chewi - Elliot Paquette @poseypaquet - Elad Hazan @HazanPrinceton - Surya Ganguli @SuryaGanguli (to be confirmed) August 3 - 14, 2026 Apply by March 31, 2026. Link 👇 Sponsored by @NSF, @PrincetonAInews, @EPrinceton, @JaneStreetGroup, @DARPA, @PrincetonPLI, Princeton NAM, Princeton AI2, Princeton PACM

English

113

Darshil Doshi retweetledi

Dayal Kalra@dayal_kalra·3 Şub

Excited to share work from my internship at MSL @AIatMeta! 🚀 We analyze Critical Sharpness: a scalable curvature measure requiring only ~6 forward passes to analyze LLM training dynamics at scale. We extend this measure to introduce Relative Critical Sharpness, which measures the relative curvature between two landscapes. We use this to answer a major practical question: How much pre-training data should we mix during fine-tuning to avoid catastrophic forgetting? 🧵 (1/n)

English

279

17.8K

Darshil Doshi@darshilhdoshi1·28 Oca

@EkdeepL @GoodfireAI Very exciting stuff! Are there many other super-human biology models out there right now that need to be interp'd or was this a one-off?

English

Ekdeep Singh Lubana@EkdeepL·28 Oca

A huge, huge project coming from our scientific discovery team---big interp win! Super excited about the results and all that's to follow very soon from our org @GoodfireAI :)

Goodfire@GoodfireAI

We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente. How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6)

English

121

8.6K

Darshil Doshi retweetledi

Johns Hopkins Data Science and AI Institute@HopkinsDSAI·6 Oca

Join us in advancing data science and AI research! The Johns Hopkins Data Science and AI Institute Postdoctoral Fellowship Program is now accepting applications for the 2026–2027 academic year. Apply now! Deadline: Jan 23, 2026. Details and apply: apply.interfolio.com/1790598:26

Johns Hopkins Data Science and AI Institute tweet media

English

5.9K

Darshil Doshi retweetledi

Surya Ganguli@SuryaGanguli·15 Kas

We have 14 survey lectures for our @SimonsFdn Collaboration on the Physics of Learning and Neural Computation! All videos available at: physicsoflearning.org/webinar-series Here is the list: @zdeborova: Attention-based models and how to solve them using tools from quadratic networks and matrix denoising @KempeLab: Recent lessons from LLM reasoning @MBarkeshli: Sharpness dynamics in neural network training @KrzakalaF: How Do Neural Networks Learn Simple Functions with Gradient Descent? Michael Douglas: Mathematics, Economics and AI Yuhai Tu: Towards a Physics-based Theoretical Foundation for Deep Learning: Stochastic Learning Dynamics and Generalization @SuryaGanguli: An analytic theory of creativity for convolutional diffusion models Eva Silverstein: Hamiltonian dynamics for stabilizing neural simulation-based inference @adnarim066: Generation with Unified Diffusion Bernd Rosenow: Random matrix analysis of neural networks: distinguishing noise from learned information @jhhalverson Nerual networks and conformal field theory @KempeLab Synthetic data: friend or foe in the age of scaling @WyartMatthieu Learning hierarchical representations with deep architectures @CPehlevan Mean-field theory of deep network learning dynamics and applications to neural scaling laws

English

250

22.1K

Darshil Doshi@darshilhdoshi1·15 Kas

@EkdeepL Wonderful work Ekdeep.. congratulations! Big fan of fig.5!

English

144

Ekdeep Singh Lubana@EkdeepL·13 Kas

New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)

GIF

English

296

53.6K

Darshil Doshi@darshilhdoshi1·5 Kas

@abybaby_san Atom are also an abstraction made by us. The connection between atoms and non-commutativity is as real as the idea of atoms themselves. The idea of “real” is also made up, along with the rest of the language… ad nauseam

English

Abybaby@abybaby_san·5 Kas

It's just an abstraction made by us. It's not real and it doesn't affect atoms. They don't know or care about any maths. They just be

Surya Ganguli@SuryaGanguli

The noncommutativity of matrix multiplication is beautiful. Without it all atoms would be unstable.

English

1.4K

Darshil Doshi@darshilhdoshi1·4 Kas

@MrnllMtt Not that insane. Increased swimming and ice cream consumption in the summer explains the trend-match. Tweaking the axis scaling explains the shape-match.

English

Matteo Marinelli@MrnllMtt·3 Kas

This is an insane chart

Aakash Gupta@aakashgupta

this is an insane chart

English

673

5.5K

174.8K

11.7M

Darshil Doshi retweetledi

Andrei Mircea@mirandrom·25 Eki

I gave a talk on LLM zero-sum learning dynamics last week at MSR Montreal. I went over a few things that were not in the paper but that I'm particularly excited about; one of those is the connection between generalization and zero-sum learning. youtu.be/UyK3DgWY7yw

YouTube

English

4.5K

Darshil Doshi@darshilhdoshi1·23 Eki

@mirandrom @zmkzmkz Intuition for muP/warmup connection: warmup brings the model to a flat region of the landscape, where higher lrs don’t cause problems. muP (any mean-field init) starts from a flat region, so one could get away with less/no warmup. SO to Dayal’s paper: arxiv.org/abs/2406.09405

English

Andrei Mircea@mirandrom·23 Eki

@darshilhdoshi1 @zmkzmkz I didn't know about the mup/warmup connection, I thought warmup was also important for Adam moments to stabilize. I've been meaning to rerun those experiments anyways so I'll try and see. If I recall correctly, "fast enough" just meant having a high enough LR at a specific loss.

English

zed@zmkzmkz·22 Eki

does anyone have any pointers on what this "hump" is in the gradient norm at the beginning of training a transformer? I've seen this happen at all scales, even in different architectural variants, even with or without warmup/decay lr

English

183

41.1K

Darshil Doshi@darshilhdoshi1·23 Eki

@mirandrom @zmkzmkz This is very interesting! I always thought that more warmup (i.e slower lr increase) is never worse — looks like I was wrong. Is there a way to quantify “fast enough” for lr warmup? Also, muP is considered a substitute for warmup — wonder if that can circumvent these issues.

English

Andrei Mircea@mirandrom·22 Eki

@zmkzmkz but there's also just a sharpening of the loss landscape going on around that point I think. that would explain the effect of LR warmup I observed on this a while back, where gradnorm growth occurs when LR is too small relative to current loss, and reverses if you increase LR

English

160

Darshil Doshi@darshilhdoshi1·8 Eki

@alesfav @Cambridge_Uni Congratulations Alessandro!

Français

Alessandro Favero@alesfav·8 Eki

🎓My PhD thesis is now on arXiv! It follows a thread of compositionality in AI: from locality in CNNs, to the 'grammar' diffusion models learn to be creative, to task composition in foundation models. Officially a Dr 🎉 and starting as a Physics-AI Fellow at DAMTP @Cambridge_Uni

Stat.ML Papers@StatMLPapers

The Physics of Data and Tasks: Theories of Locality and Compositionality in Deep Learning ift.tt/9B0HFnC

English

1.2K

Darshil Doshi retweetledi

Rose Yu@yuqirose·10 Eyl

I'm hiring a postdoc on #AI for #science starting soon. We are building AI-enabled 3D human liver printing technology with many experts across disciplines! If you are excited about this, drop me an email with your CV and a brief introduction! Repost is much appreciated!

English

163

21.6K

Keşfet

@percyliang @dhruvakarkada @DanKorchinski @MatthieuWyart @SimonsFdn @AIatMeta @EkdeepL @GoodfireAI