Lucas Nestler

2K posts

Lucas Nestler

Lucas Nestler

@Clashluke

Researcher

Zurich, Switzerland Katılım Ekim 2020
372 Takip Edilen4.9K Takipçiler
Lucas Nestler
Lucas Nestler@Clashluke·
A few notes: 1) The LayerNorms are copied as-is from Qwen 3 2) The per-group scales are absmean, not learned (#L44-L47" target="_blank" rel="nofollow noopener">github.com/PrismML-Eng/ll…) 3) Depth-based patterns appear through classical optimization, it's unlikely some layers (like embeddings) are treated differently 4) Given the bit similarity, it likely warm-started from Qwen and did O(1000) finetuning steps to recover lost performance. We can't know what algorithm was used here, but the weight pattern hints at AdamW. Based on (2), it's likely a BitNet variant.
English
0
0
0
38
Lucas Nestler
Lucas Nestler@Clashluke·
@zhengyaojiang Understood, thank you for the clarification! It's very interesting seeing how their findings differ mechanistically, regardless of the scores. Wishing you all the best with the paper.
English
0
0
5
360
Zhengyao Jiang
Zhengyao Jiang@zhengyaojiang·
Fair point, thanks for the pointer! I'll be honest, I'm not an expert in Bayesian optimization, TPE was what I knew as the most commonly used algorithm, so that's what we started with. We'd probably add HEBO and other SOTA baselines if we turn this into a proper paper. That said worth noting, the AutoResearch setup we used is basically Karpathy's version, which was probably put together in a few days. I'm pretty sure there will be much better implementations soon, and better models driving them. This is a preliminary study, but we think the underlying trends it shows should hold up.
English
2
0
12
2K
Zhengyao Jiang
Zhengyao Jiang@zhengyaojiang·
Is autoresearch really better than classic hyperparameter tuning? We did experiments comparing Optuna & autoresearch. Autoresearch converges faster, is more cost-efficient, and even generalizes better: 🧵(1/6)
Zhengyao Jiang tweet media
English
23
91
1.1K
93.1K
Lucas Nestler
Lucas Nestler@Clashluke·
With replicated gradients that arrive after external reduction, possibly. The key addition for FSDP is automatic shape detection, which is needed only for optimizers like Muon and SOAP. It moves gradients in a round robin fashion to run one whole-matrix Muon per device and pushes them back to their hosts, using two large all_to_all over the default group. AdamW isn't touched by this at all. If you have a simple script with it, I could look into adding multi-axis or EP support.
English
2
0
3
420
samsja
samsja@samsja19·
@Clashluke @tonysilveti Does it support EP ? Or at least allowing to gather on different devices mesh / axis ?
English
1
0
4
532
Lucas Nestler
Lucas Nestler@Clashluke·
HeavyBall 3.0.0 is finally out. Key features: * FSDP * DDP * End-to-End Compilation (2.5x speedup) * Higher-precision PSGDKron (grey, vs. HB2's blue) * Faster Muon and SOAP * PSGD-PRO (yellow) * LATHER, a SOAP-like optimizer * HyperBall * explicit `consume_grad` * simplified API
Lucas Nestler tweet media
English
5
13
109
8.1K
Lucas Nestler
Lucas Nestler@Clashluke·
@PrismML @HessianFree Do your Qwen quants gain more intelligence-per-bit from scaling than the baseline? I'd be curious to see if the scaling law holds with Bonsai 3.5 397B.
Lucas Nestler tweet media
English
1
1
7
549
PrismML
PrismML@PrismML·
Today, we are emerging from stealth and launching PrismML, an AI lab with Caltech origins that is centered on building the most concentrated form of intelligence. At PrismML, we believe that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just sheer parameter count. Our first proof point is the 1-bit Bonsai 8B, a 1-bit weight model that fits into 1.15 GBs of memory and delivers over 10x the intelligence density of its full-precision counterparts. It is 14x smaller, 8x faster, and 5x more energy efficient on edge hardware while remaining competitive with other models in its parameter-class. We are open-sourcing the model under Apache 2.0 license, along with Bonsai 4B and 1.7B models. When advanced models become small, fast, and efficient enough to run locally, the design space for AI changes immediately. We believe in a future of on-device agents, real-time robotics, offline intelligence and entirely new products that were previously impossible. We are excited to share our vision with you and keep working in the future to push the frontier of intelligence to the edge.
PrismML tweet media
English
172
560
4K
1.3M
Lucas Nestler
Lucas Nestler@Clashluke·
@jjgort It's ECC-noSR on the parameters with BF16-SR on the optimizer state, to follow the paper more closely. Full ECC-SR will follow soon.
English
1
0
1
104
Jose Javier Gonzalez
Jose Javier Gonzalez@jjgort·
@Clashluke Can you elaborate on the setup? I don't quite follow why ECC+SR would converge worse than BF16+SR. Don't both do the same debiasing?
English
1
0
2
130
Lucas Nestler
Lucas Nestler@Clashluke·
@giffmana @jjgort Exactly, it's compiled with fullgraph=True and gets compiled into one kernel with in-kernel RNG #file-kernel-py" target="_blank" rel="nofollow noopener">gist.github.com/ClashLuke/d188…
English
1
2
3
410
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@Clashluke @jjgort it's this, right? #L2121-L2126" target="_blank" rel="nofollow noopener">github.com/HomebrewML/Hea… yeah I see memory traffic should dominate if the randint gets correctly fused per element (not sampled as big tensor first), thx.
English
1
0
2
163
Lucas Nestler
Lucas Nestler@Clashluke·
@giffmana @jjgort 1) fixing, good catch 2) BF16+SR is the same speed as BF16+RNE in HeavyBall, as SR is fused into the optimizer kernel and adds no memory traffic. ECC adds ~50% memory traffic per tensor.
English
1
0
4
356
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@Clashluke @jjgort Cool blog post! Two comments: 1) change the orange color, it's way too similar to red. Mauve should be good 2) I'm curious if you have any speed comparison between these curves? Especially wondering about bf16 SR vs ECC vs RNE
English
1
1
7
1.6K
Lucas Nestler
Lucas Nestler@Clashluke·
@SkyLi0n same lr for all methods, but the main effect isn't lr. naive bf16 truncates v, so v freezes and adam stops adapting. ecc and stochastic rounding both prevent that, even though neither changes the effective lr.
English
0
0
1
49
Aaron Gokaslan
Aaron Gokaslan@SkyLi0n·
@Clashluke Is this fully LR swept? Couldn't the ECC change the effective LR?
English
1
0
1
74