Larry Dial

86 posts

Larry Dial banner
Larry Dial

Larry Dial

@classiclarryd

Technical Staff at Open Athena, working on Marin

Katılım Mayıs 2024
28 Takip Edilen1.3K Takipçiler
Larry Dial
Larry Dial@classiclarryd·
Very cool. My 2 cents for participants: most compute will be spent on undifferentiated hill climbing from people functioning as LLM vessels. Agents can climb hills, but humans are still superior at finding them. What paradigm can you introduce? Sparse circuit discovery and compression during training? Variable embedding sizing? Manifold-ultra-connections? Paired head attn on steroids? Decision tree distillation? The list is endless.
OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English
12
14
467
57.3K
Larry Dial
Larry Dial@classiclarryd·
Agreed. Neurons learn to sparsely activate, and MoE enforces clustering over this behavior in hardware friendly manner. IMO the term “active parameters” is a bit of a misnomer from model perspective. Both mlp and MoE are sparse under the common characterization avg_active_neurons/neurons.
English
0
0
1
40
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 86.8 (-0.4s) from @.samacqua on Github, by tuning and reusing the transpose_copy kernel during the cross entropy backward calc. Outside the main speedrun track, Sam did an interesting experiment in Jan showing how test-time training can improve perplexity. github.com/KellerJordan/m…. github.com/KellerJordan/m…
English
2
14
74
6.3K
Larry Dial
Larry Dial@classiclarryd·
@willdepue In the partitioned hc pr, the MLPs learned to mostly not write to attn stream for last 3 layers. A prior test where every module gets its own stream showed prediction pulled from last 3 layers. However, in this PR I only tested 3 so can’t confirm it’s better. Just a first guess.
English
0
0
2
32
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 86.1 (-0.7s), by replacing partitioned hyperconnections with a simple idea: feed the exact same context vector into the last 3 attn layers, so late stage attn doesn't get polluted by prediction MLPs. Opinion: AI research agents are handicapped until they have a mech-interp toolkit. Many sub-3min architecture improvements came from analyzing weights. github.com/KellerJordan/m…
Larry Dial tweet media
English
5
20
326
25.8K
Larry Dial
Larry Dial@classiclarryd·
Most arch ones come from weight and circuit analysis. This one was one-shot based on the weights @sisovicm shared. He also used weight analysis in partitioned hc. Partial key offset was one-shot based on looking at attn map. But some random ones too. Paired head attn was random idea 50 in a search over crackpot ideas. 100 other crackpot ideas in past, which mostly fail.
English
0
0
4
954
Somi AI
Somi AI@somi_ai·
@classiclarryd okay but the mech-interp point is so underrated. how much of the recent speedrun progress came from weight analysis vs brute force architecture search?
English
1
0
4
1.2K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 87.2 (-0.9s) from @.moof2x on Github, with optimizations to the cross entropy kernel, primarily around the memory loads of the multi-token-prediction component. github.com/KellerJordan/m…
English
0
7
69
6.3K
Larry Dial
Larry Dial@classiclarryd·
@Thom_Wolf Custom harness on GPT5 + Opus 4.5 achieved 19 minutes. Surely better now and will get there eventually, but when it’s $5-$10 to validate an idea id much rather source that idea from a human at this stage. arxiv.org/abs/2601.14525
English
0
0
10
437
Larry Dial
Larry Dial@classiclarryd·
@Thom_Wolf Improvements tend to follow a pattern of new architecture -> engineering optimization of that arch. Many participants use AI agents to assist in the engr optimization, but AI still seems poor at novel architecture, at least when bounded by limited H100 compute.
English
2
1
48
7.5K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 88.1 (-1s) from @ChrisJMcCormick , by optimizing kernels for transposed weights, removing the Block() abstraction, and tuning the prior PR on partitioned hyperconnections by reducing the lambda count. github.com/KellerJordan/m…
Larry Dial tweet media
English
7
30
347
70.4K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 89.1 (-0.7s) from @sisovicm , with a technique called partitioned hyperconnections. The learned weights reveal that the final attn modules prefer to ignore the prediction vectors generated by the final MLPs, and instead query representations from slightly earlier layers. github.com/KellerJordan/m…
Larry Dial tweet media
English
1
15
141
18.2K
Larry Dial
Larry Dial@classiclarryd·
The NanoGPT Speedrun WR has broken below 90s, dropping from 92.1 to 89.8 from 4 recent contributions: 1. Tuned Kernels (-0.4s) from @.EmmetBicker on github & AI System Aster 2. Tuned Value Embeds (-0.4s) from @photon_mz 3. Sparse comms for bigram gradients (-0.3s) from @roeeshenberg 4. max_seq_len schedule and increased min lr (-1.2s) from @.dualverse-ai on github & AI System Station
Larry Dial tweet media
English
12
12
242
15.8K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 92.1 (-0.3s) from @dhrvji , by moving the bigram hash from CPU to GPU. As shown here, recently added architectures are a great place to look for engineering improvements. github.com/KellerJordan/m…
English
1
10
64
4.6K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 92.4 (-3.3s) from the kernel sorcerer @andrewbriand8, with a triton kernel to fuse the fp8 quantization of the gradient into the backward kernel. (Lm_head calc is ran in fp8) On pace for 120% MFU by July. github.com/KellerJordan/m…
English
5
19
156
25.3K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 96.8 (-1.0s) from @varunneal , achieved by combining the value embeds into a single param for faster indexing. On H100 this gives 23% speedup! On A100 this gives a slowdown, indicating hardware-dependent interactions. github.com/KellerJordan/m…
Larry Dial tweet media
English
0
7
143
8.8K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 97.8 (-1.2s) from @srashedll , with an update to the attention initialization. Motivated by mimetic initialization techniques, experiments uncovered that small random init outperformed zero init on attention out projection. github.com/KellerJordan/m…
English
0
9
68
8.7K
Larry Dial
Larry Dial@classiclarryd·
New NanoGPT Speedrun WR at 99.0 (-0.3s) from @photon_mz, with an update from 3 to 5 value embeddings, enabling 1.5% fewer training steps! The trend of fewer steps with higher sparsity continues. github.com/KellerJordan/m…
Larry Dial tweet media
English
1
8
115
10.2K
Larry Dial
Larry Dial@classiclarryd·
@KoszarskyB My guess is things got lost in translation, and they mean their caching implementation for inference is the innovative part.
English
0
0
0
126