Larry Dial

86 posts

Larry Dial

@classiclarryd

Technical Staff at Open Athena, working on Marin

Katılım Mayıs 2024

28 Takip Edilen1.3K Takipçiler

Larry Dial@classiclarryd·1d

Very cool. My 2 cents for participants: most compute will be spent on undifferentiated hill climbing from people functioning as LLM vessels. Agents can climb hills, but humans are still superior at finding them. What paradigm can you introduce? Sparse circuit discovery and compression during training? Variable embedding sizing? Manifold-ultra-connections? Paired head attn on steroids? Decision tree distillation? The list is endless.

OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English

467

57.3K

Larry Dial@classiclarryd·12 Mar

Agreed. Neurons learn to sparsely activate, and MoE enforces clustering over this behavior in hardware friendly manner. IMO the term “active parameters” is a bit of a misnomer from model perspective. Both mlp and MoE are sparse under the common characterization avg_active_neurons/neurons.

English

Ethan@torchcompiled·12 Mar

@leothecurious @_AmilDravid I mean aside from LTH, the “we didn’t see it until now” when there’s plenty of works like this arxiv.org/html/2411.1269…

English

Larry Dial@classiclarryd·8 Mar

@Sam_Acqua is on X too!

English

497

Larry Dial@classiclarryd·8 Mar

New NanoGPT Speedrun WR at 86.8 (-0.4s) from @.samacqua on Github, by tuning and reusing the transpose_copy kernel during the cross entropy backward calc. Outside the main speedrun track, Sam did an interesting experiment in Jan showing how test-time training can improve perplexity. github.com/KellerJordan/m…. github.com/KellerJordan/m…

English

6.3K

Larry Dial@classiclarryd·8 Mar

@willdepue In the partitioned hc pr, the MLPs learned to mostly not write to attn stream for last 3 layers. A prior test where every module gets its own stream showed prediction pulled from last 3 layers. However, in this PR I only tested 3 so can’t confirm it’s better. Just a first guess.

English

Larry Dial@classiclarryd·8 Mar

New NanoGPT Speedrun WR at 86.1 (-0.7s), by replacing partitioned hyperconnections with a simple idea: feed the exact same context vector into the last 3 attn layers, so late stage attn doesn't get polluted by prediction MLPs. Opinion: AI research agents are handicapped until they have a mech-interp toolkit. Many sub-3min architecture improvements came from analyzing weights. github.com/KellerJordan/m…

English

326

25.8K

Larry Dial@classiclarryd·8 Mar

Most arch ones come from weight and circuit analysis. This one was one-shot based on the weights @sisovicm shared. He also used weight analysis in partitioned hc. Partial key offset was one-shot based on looking at attn map. But some random ones too. Paired head attn was random idea 50 in a search over crackpot ideas. 100 other crackpot ideas in past, which mostly fail.

English

954

Somi AI@somi_ai·8 Mar

@classiclarryd okay but the mech-interp point is so underrated. how much of the recent speedrun progress came from weight analysis vs brute force architecture search?

English

1.2K

Larry Dial@classiclarryd·7 Mar

New NanoGPT Speedrun WR at 87.2 (-0.9s) from @.moof2x on Github, with optimizations to the cross entropy kernel, primarily around the memory loads of the multi-token-prediction component. github.com/KellerJordan/m…

English

6.3K

Larry Dial@classiclarryd·27 Şub

@Thom_Wolf Custom harness on GPT5 + Opus 4.5 achieved 19 minutes. Surely better now and will get there eventually, but when it’s $5-$10 to validate an idea id much rather source that idea from a human at this stage. arxiv.org/abs/2601.14525

English

437

Larry Dial@classiclarryd·27 Şub

@Thom_Wolf Improvements tend to follow a pattern of new architecture -> engineering optimization of that arch. Many participants use AI agents to assist in the engr optimization, but AI still seems poor at novel architecture, at least when bounded by limited H100 compute.

English

7.5K

Thomas Wolf@Thom_Wolf·27 Şub

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

Larry Dial@classiclarryd

New NanoGPT Speedrun WR at 88.1 (-1s) from @ChrisJMcCormick , by optimizing kernels for transposed weights, removing the Block() abstraction, and tuning the prior PR on partitioned hyperconnections by reducing the lambda count. github.com/KellerJordan/m…

English

302

1.3M

Larry Dial@classiclarryd·27 Şub

English

347

70.4K

Larry Dial@classiclarryd·24 Şub

New NanoGPT Speedrun WR at 89.1 (-0.7s) from @sisovicm , with a technique called partitioned hyperconnections. The learned weights reveal that the final attn modules prefer to ignore the prediction vectors generated by the final MLPs, and instead query representations from slightly earlier layers. github.com/KellerJordan/m…

English

141

18.2K

Larry Dial@classiclarryd·18 Şub

@huntymamawerk @photon_mz github.com/KellerJordan/m…

QME

162

Ulysse Mizrahi@huntymamawerk·17 Şub

@classiclarryd @photon_mz Is there a list of all the improvements that have been made from the start?

English

192

Larry Dial@classiclarryd·16 Şub

The NanoGPT Speedrun WR has broken below 90s, dropping from 92.1 to 89.8 from 4 recent contributions: 1. Tuned Kernels (-0.4s) from @.EmmetBicker on github & AI System Aster 2. Tuned Value Embeds (-0.4s) from @photon_mz 3. Sparse comms for bigram gradients (-0.3s) from @roeeshenberg 4. max_seq_len schedule and increased min lr (-1.2s) from @.dualverse-ai on github & AI System Station

English

242

15.8K

Larry Dial@classiclarryd·11 Şub

New NanoGPT Speedrun WR at 92.1 (-0.3s) from @dhrvji , by moving the bigram hash from CPU to GPU. As shown here, recently added architectures are a great place to look for engineering improvements. github.com/KellerJordan/m…

English

4.6K

Larry Dial@classiclarryd·10 Şub

New NanoGPT Speedrun WR at 92.4 (-3.3s) from the kernel sorcerer @andrewbriand8, with a triton kernel to fuse the fp8 quantization of the gradient into the backward kernel. (Lm_head calc is ran in fp8) On pace for 120% MFU by July. github.com/KellerJordan/m…

English

156

25.3K

Larry Dial@classiclarryd·1 Şub

New NanoGPT Speedrun WR at 96.8 (-1.0s) from @varunneal , achieved by combining the value embeds into a single param for faster indexing. On H100 this gives 23% speedup! On A100 this gives a slowdown, indicating hardware-dependent interactions. github.com/KellerJordan/m…

English

143

8.8K

Larry Dial@classiclarryd·1 Şub

New NanoGPT Speedrun WR at 97.8 (-1.2s) from @srashedll , with an update to the attention initialization. Motivated by mimetic initialization techniques, experiments uncovered that small random init outperformed zero init on attention out projection. github.com/KellerJordan/m…

English

8.7K

Larry Dial@classiclarryd·29 Oca

New NanoGPT Speedrun WR at 99.0 (-0.3s) from @photon_mz, with an update from 3 to 5 value embeddings, enabling 1.5% fewer training steps! The trend of fewer steps with higher sparsity continues. github.com/KellerJordan/m…

English

115

10.2K

Larry Dial@classiclarryd·26 Oca

@KoszarskyB My guess is things got lost in translation, and they mean their caching implementation for inference is the innovative part.

English

126

Braden Koszarsky@KoszarskyB·21 May

Hmmmm….

Braden Koszarsky@KoszarskyB

New NanoGPT training speed record: 3.28 FineWeb val loss in 4.41 minutes Previous record: 4.66 minutes Changelog: - Layerwise Token Value Embeddings - hyperparameter tweaks

11.1K

Keşfet

@leothecurious @_AmilDravid @Sam_Acqua @willdepue @sisovicm @Thom_Wolf @ChrisJMcCormick @huntymamawerk