Lucas Nestler

2K posts

Lucas Nestler

@Clashluke

Researcher

Zurich, Switzerland Katılım Ekim 2020

372 Takip Edilen4.9K Takipçiler

Lucas Nestler@Clashluke·1h

A few notes: 1) The LayerNorms are copied as-is from Qwen 3 2) The per-group scales are absmean, not learned (#L44-L47" target="_blank" rel="nofollow noopener">github.com/PrismML-Eng/ll…) 3) Depth-based patterns appear through classical optimization, it's unlikely some layers (like embeddings) are treated differently 4) Given the bit similarity, it likely warm-started from Qwen and did O(1000) finetuning steps to recover lost performance. We can't know what algorithm was used here, but the weight pattern hints at AdamW. Based on (2), it's likely a BitNet variant.

English

Archie Sengupta@archiexzzz·18h

x.com/i/article/2039…

ZXX

219

41K

Lucas Nestler@Clashluke·14h

@zhengyaojiang Understood, thank you for the clarification! It's very interesting seeing how their findings differ mechanistically, regardless of the scores. Wishing you all the best with the paper.

English

360

Zhengyao Jiang@zhengyaojiang·14h

Fair point, thanks for the pointer! I'll be honest, I'm not an expert in Bayesian optimization, TPE was what I knew as the most commonly used algorithm, so that's what we started with. We'd probably add HEBO and other SOTA baselines if we turn this into a proper paper. That said worth noting, the AutoResearch setup we used is basically Karpathy's version, which was probably put together in a few days. I'm pretty sure there will be much better implementations soon, and better models driving them. This is a preliminary study, but we think the underlying trends it shows should hold up.

English

Zhengyao Jiang@zhengyaojiang·17h

Is autoresearch really better than classic hyperparameter tuning? We did experiments comparing Optuna & autoresearch. Autoresearch converges faster, is more cost-efficient, and even generalizes better: 🧵(1/6)

English

1.1K

93.1K

Lucas Nestler@Clashluke·1d

With replicated gradients that arrive after external reduction, possibly. The key addition for FSDP is automatic shape detection, which is needed only for optimizers like Muon and SOAP. It moves gradients in a round robin fashion to run one whole-matrix Muon per device and pushes them back to their hosts, using two large all_to_all over the default group. AdamW isn't touched by this at all. If you have a simple script with it, I could look into adding multi-axis or EP support.

English

420

samsja@samsja19·1d

@Clashluke @tonysilveti Does it support EP ? Or at least allowing to gather on different devices mesh / axis ?

English

532

Lucas Nestler@Clashluke·1d

HeavyBall 3.0.0 is finally out. Key features: * FSDP * DDP * End-to-End Compilation (2.5x speedup) * Higher-precision PSGDKron (grey, vs. HB2's blue) * Faster Muon and SOAP * PSGD-PRO (yellow) * LATHER, a SOAP-like optimizer * HyperBall * explicit `consume_grad` * simplified API

English

109

8.1K

Lucas Nestler@Clashluke·1d

@stochasticchasm @ryu0000000001 let me know if you encounter any issues!

English

303

stochasm@stochasticchasm·1d

@Clashluke @ryu0000000001 ooh nice

English

352

Lucas Nestler@Clashluke·1d

@dvruette Thank you! I hope the FSDP path helps you 🫡

English

227

Dimitri von Rütte@dvruette·1d

@Clashluke huge!

English

304

Lucas Nestler@Clashluke·1d

@HessianFree @ryu0000000001 Thank you! I'm looking forward to more PSGD adoption. Excellent work.

English

220

Omead Pooladzandi@HessianFree·1d

@Clashluke @ryu0000000001 This is great!

English

304

Lucas Nestler@Clashluke·1d

for the full release notes, check out github.com/HomebrewML/Hea… use github.com/HomebrewML/Hea… to migrate any existing optimizer state

English

639

Lucas Nestler@Clashluke·2d

@PrismML @HessianFree Do your Qwen quants gain more intelligence-per-bit from scaling than the baseline? I'd be curious to see if the scaling law holds with Bonsai 3.5 397B.

English

549

Lucas Nestler@Clashluke·2d

@PrismML congrats @HessianFree, excited to check out your quantization method

English

3.5K

PrismML@PrismML·2d

Today, we are emerging from stealth and launching PrismML, an AI lab with Caltech origins that is centered on building the most concentrated form of intelligence. At PrismML, we believe that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just sheer parameter count. Our first proof point is the 1-bit Bonsai 8B, a 1-bit weight model that fits into 1.15 GBs of memory and delivers over 10x the intelligence density of its full-precision counterparts. It is 14x smaller, 8x faster, and 5x more energy efficient on edge hardware while remaining competitive with other models in its parameter-class. We are open-sourcing the model under Apache 2.0 license, along with Bonsai 4B and 1.7B models. When advanced models become small, fast, and efficient enough to run locally, the design space for AI changes immediately. We believe in a future of on-device agents, real-time robotics, offline intelligence and entirely new products that were previously impossible. We are excited to share our vision with you and keep working in the future to push the frontier of intelligence to the edge.

English

172

560

1.3M

Lucas Nestler@Clashluke·25 Mar

What separates training methods that scale from those that don’t? convergentthinking.sh/posts/steer-be…

English

4.1K

Lucas Nestler@Clashluke·17 Mar

@jjgort It's ECC-noSR on the parameters with BF16-SR on the optimizer state, to follow the paper more closely. Full ECC-SR will follow soon.

English

104

Jose Javier Gonzalez@jjgort·17 Mar

@Clashluke Can you elaborate on the setup? I don't quite follow why ECC+SR would converge worse than BF16+SR. Don't both do the same debiasing?

English

130

Lucas Nestler@Clashluke·16 Mar

correction: torch.compile eliminated a bf16 round-trip, making ECC appear worse than it is the core finding holds, but ECC closes most of the gap. credit to @jjgort for catching the compile interaction.

Lucas Nestler@Clashluke

why bias matters more than precision, and how you can use this to save 50% memory convergentthinking.sh/posts/bias-com…

English

6.3K

Lucas Nestler@Clashluke·17 Mar

@giffmana @jjgort Exactly, it's compiled with fullgraph=True and gets compiled into one kernel with in-kernel RNG #file-kernel-py" target="_blank" rel="nofollow noopener">gist.github.com/ClashLuke/d188…

English

410

Lucas Beyer (bl16)@giffmana·17 Mar

@Clashluke @jjgort it's this, right? #L2121-L2126" target="_blank" rel="nofollow noopener">github.com/HomebrewML/Hea… yeah I see memory traffic should dominate if the randint gets correctly fused per element (not sampled as big tensor first), thx.

English

163

Lucas Nestler@Clashluke·17 Mar

@giffmana @jjgort 1) fixing, good catch 2) BF16+SR is the same speed as BF16+RNE in HeavyBall, as SR is fused into the optimizer kernel and adds no memory traffic. ECC adds ~50% memory traffic per tensor.

English

356

Lucas Beyer (bl16)@giffmana·17 Mar

@Clashluke @jjgort Cool blog post! Two comments: 1) change the orange color, it's way too similar to red. Mauve should be good 2) I'm curious if you have any speed comparison between these curves? Especially wondering about bf16 SR vs ECC vs RNE

English

1.6K

Lucas Nestler@Clashluke·12 Mar

why bias matters more than precision, and how you can use this to save 50% memory convergentthinking.sh/posts/bias-com…

Lucas Nestler@Clashluke

ecc added to all optimizers in heavyball 2.3.0, including muon, soap and psgd x.com/davisblalock/s…

English

12.1K

Lucas Nestler@Clashluke·12 Mar

want 150B tokens of this, but from the best open source models? @affine_io

Joël Niklaus@joelniklaus

Introducing the Synthetic Data Playbook: We generated over a 1T tokens in 90 experiments with 100k+ GPUh to figure out what makes good synthetic data and how to generate it at scale huggingface.co/spaces/Hugging…

English

Lucas Nestler@Clashluke·11 Mar

@SkyLi0n same lr for all methods, but the main effect isn't lr. naive bf16 truncates v, so v freezes and adam stops adapting. ecc and stochastic rounding both prevent that, even though neither changes the effective lr.

English

Aaron Gokaslan@SkyLi0n·11 Mar

@Clashluke Is this fully LR swept? Couldn't the ECC change the effective LR?

English

Lucas Nestler@Clashluke·11 Mar

ecc added to all optimizers in heavyball 2.3.0, including muon, soap and psgd x.com/davisblalock/s…

Davis Blalock@davisblalock

🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]

English

14.3K

Keşfet

@zhengyaojiang @tonysilveti @stochasticchasm @ryu0000000001 @dvruette @HessianFree @PrismML @jjgort