Thomas Massena

58 posts

Thomas Massena

@thomasmassena

PhD-ing in Toulouse or Paris

Katılım Temmuz 2025

238 Takip Edilen46 Takipçiler

Sabitlenmiş Tweet

Thomas Massena@thomasmassena·24 Şub

Wrote my first blog post, turns out exploring adaptive orthogonalization methods is pretty fun. massena-t.github.io/blog/2026/02/1… TLDR: Using an adaptive Newton-Schulz scheme allows a theoretically principled way to choose between SGD-esque or Muon-esque weight updates.

English

5.7K

Thomas Massena retweetledi

Thibaut Boissin@ThibautBoissin·15h

Muon is actually not that bad for CNNs: on ImageNet-1K, ResNet-50 gets 54% top-1 classification accuracy in a single epoch. It also gets close to the performance of ResNet-50 v1 trained for 90 epochs, with only a third of the budget.

JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget

@VukRosic99 Not sure Muon would help for CNNs.

English

321

Thomas Massena retweetledi

Thomas Fel@thomas_fel_·21 May

How do SAEs capture concept manifolds? 🍩 I think this is important work. we study how SAEs handle the geometric structures we've identified and find they tile/shatter them in a particular way we characterize, letting us recast unsupervised manifold discovery as inverse Ising

Goodfire@GoodfireAI

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)

English

4.7K

Thomas Massena retweetledi

Thomas Fel@thomas_fel_·7 May

Happy to share my first post since joining Goodfire. Neural geometry has been my obsession for years, and our team here is building a really serious research agenda around it. I can't wait to share the series of papers coming over the next few weeks... Brace for shapes 🍩

Goodfire@GoodfireAI

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English

1.1K

77.2K

Thomas Massena retweetledi

Hanchen Li@lihanc02·11 May

Neurips paper bidding for me. Lollll

English

13K

Thomas Massena retweetledi

Erfanzar@eraznafre·28 Nis

Releasing SpectraX is a JAX-native neural-network library built around true MPMD pipeline parallelism. Each physical rank compiles and runs its own XLA program — no shared shard_map HLO, no SPMD-same-shape constraint. Heterogeneous stages (eg, embed → blocks → head), nine pipeline schedules (GPipe, 1F1B, ZeroBubble, Interleaved, DualPipeV, …), and a unified spx.run()/spx.jit() entry point that dispatches to SPMD or MPMD from the same training script. github.com/erfanzar/Spect…

English

161

37.2K

Thomas Massena retweetledi

Pierfrancesco Beneventano@PierBeneventano·26 Nis

Muon leads to severely miscalibrated models! This is just one of the results of this new paper of ours: In “Too Sharp, Too Sure” we show calibration error tracks loss curvature during training and we tie both to margin tails.

English

447

83K

Thomas Massena retweetledi

Hayden Prairie@hayden_prairie·15 Nis

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English

179

1.3K

293.6K

Thomas Massena retweetledi

varun@varunneal·8 Nis

my version of gstack is a folder called "optimizer-literature" containing 250 arXiv TeXs and the entire Jianlin Su canon that I force all my agents to read

English

431

Thomas Massena retweetledi

Tim Dettmers@Tim_Dettmers·7 Nis

We in the quantization community could quickly see this and were flabbergastered by the response to TurboQuant. Whenever I saw TurboQuant on my timeline, I found it hurtful, because the work of other academics who worked so hard was discounted.

English

237

19.4K

Thomas Massena retweetledi

Thibaut Boissin@ThibautBoissin·1 Nis

I'd be curious to combine it with turbo-muon. That could have double benefits: the extra speedup from the removal of one iter plus just enough stabilization to avoid the restart at iter 3. @noahamsel what do you think?

Jack Zhang@jcz42

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

English

612

Thomas Massena@thomasmassena·1 Nis

@breskanu You can check out the Turbo-Muon and Chebyshev Accelerated NS paper for this.

English

Thomas Massena@thomasmassena·1 Nis

@breskanu Oh it definitely is better to use the spectral norm. Only it's more costly to compute. Some AOL / Gelfand formula tricks can allow you to use the X^T . X matrix to normalize to [0, 1] better than the Frobenius norm (and more efficiently than with the spectral norm).

English

Nikita Breskanu@breskanu·31 Mar

Standard Muon takes X0 = G / ||G||_F. It feels like normalizing by spectral norm ||G||_2 may potentially be better than frobenius: it keeps the range [0, 1] needed for convergence, but singular values are more widespread across it.

English

626

Thomas Massena@thomasmassena·1 Nis

@HessianFree Impressive stuff ! Congrats

English

Omead Pooladzandi@HessianFree·31 Mar

your spotify cache is bigger than our largest AI model. Bonsai: 1-bit weights. 1.7B to 8B params. 14x compression vs bf16. 8x faster on edge. 256 MB to 1.2GB. Based on Qwen 3. we just came out of stealth. intelligence belongs at the edge and we're going to put it there. Apache 2.0. we compressed intelligence. more coming. @PrismML

PrismML@PrismML

Today, we are emerging from stealth and launching PrismML, an AI lab with Caltech origins that is centered on building the most concentrated form of intelligence. At PrismML, we believe that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just sheer parameter count. Our first proof point is the 1-bit Bonsai 8B, a 1-bit weight model that fits into 1.15 GBs of memory and delivers over 10x the intelligence density of its full-precision counterparts. It is 14x smaller, 8x faster, and 5x more energy efficient on edge hardware while remaining competitive with other models in its parameter-class. We are open-sourcing the model under Apache 2.0 license, along with Bonsai 4B and 1.7B models. When advanced models become small, fast, and efficient enough to run locally, the design space for AI changes immediately. We believe in a future of on-device agents, real-time robotics, offline intelligence and entirely new products that were previously impossible. We are excited to share our vision with you and keep working in the future to push the frontier of intelligence to the edge.

English

162

204.9K

Thomas Massena@thomasmassena·23 Mar

@cloneofsimo That's so much more competitive as I would've imagined ! Are you planning any kind of write-up ? I'd be really interested

English

1.1K

Simo Ryu@cloneofsimo·23 Mar

Parameter Golf, non-neural-network division: Ive got (my gpt-5.4) to get val bpb of 1.6633 with markov chains + bag of tricks, fully autonomously. Codex implemented methods from 1990~2005.

English

342

72K

Thomas Massena retweetledi

David@dnhkng·22 Mar

1/n I topped the HuggingFace Open LLM Leaderboard without changing a single weight. No training. No merging. No gradient descent. I duplicated 7 middle layers of Qwen2-72B and stitched it back together. This is the story of LLM Neuroanatomy 🧵

English

119

1.1K

129.3K

Thomas Massena retweetledi

Thibaut Boissin@ThibautBoissin·6 Mar

Totally agree with the problem of muon for CNNs: kernel reshaping + orthonormalization ≠ orthogonalizing the operator. Interestingly, a whole community studied this for convolutional weights (instead of gradients): arxiv.org/abs/1911.00937

Ji-Ha@Ji_Ha_Kim

How to ("properly") orthogonalize convolutional layers for Muon optimizer Trick: Assume circular kernels to allow diagonalization Blog post + proof of concept CIFAR10 speedrun fork (unoptimized and slow for now but better convergence per step)

English

Thomas Massena retweetledi

Ji-Ha@Ji_Ha_Kim·5 Mar

English

224

39.3K

Thomas Massena retweetledi

Jason Ramapuram@jramapuram·26 Şub

Autoregressive models dominate, but what if we treat multimodal generation as discrete order agnostic iterative refinement? Excited to share our systematic study on the design space of Tri-Modal Masked Diffusion Models (MDMs). We pre-trained the first Tri-Modal MDM from scratch on (text,), (image, text), and (audio, text). The same model can do ASR, TTS, T2I, captioning and native text generation. What I'm the most proud of in this work is the scientific rigor. Over 3,500 training runs. Principled hyperparameter transfer. Honest results. Carefully controlled ablations across multiple different axis of entanglement. A thread on our empirical findings (arXiV: arxiv.org/abs/2602.21472)

English

240

40.1K

Thomas Massena retweetledi

Samip@industriaalist·26 Şub

1/ Introducing NanoGPT Slowrun 🐢: an open repo for state-of-the-art data-efficient learning algorithms. It's built for the crazy ideas that speedruns filter out -- expensive optimizers, heavy regularization, SGD replacements like evolutionary search.