Thomas Massena

58 posts

Thomas Massena

Thomas Massena

@thomasmassena

PhD-ing in Toulouse or Paris

Katılım Temmuz 2025
238 Takip Edilen46 Takipçiler
Sabitlenmiş Tweet
Thomas Massena
Thomas Massena@thomasmassena·
Wrote my first blog post, turns out exploring adaptive orthogonalization methods is pretty fun. massena-t.github.io/blog/2026/02/1… TLDR: Using an adaptive Newton-Schulz scheme allows a theoretically principled way to choose between SGD-esque or Muon-esque weight updates.
English
2
10
61
5.7K
Thomas Massena retweetledi
Thibaut Boissin
Thibaut Boissin@ThibautBoissin·
Muon is actually not that bad for CNNs: on ImageNet-1K, ResNet-50 gets 54% top-1 classification accuracy in a single epoch. It also gets close to the performance of ResNet-50 v1 trained for 90 epochs, with only a third of the budget.
Thibaut Boissin tweet media
JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget

@VukRosic99 Not sure Muon would help for CNNs.

English
1
1
6
321
Thomas Massena retweetledi
Thomas Fel
Thomas Fel@thomas_fel_·
How do SAEs capture concept manifolds? 🍩 I think this is important work. we study how SAEs handle the geometric structures we've identified and find they tile/shatter them in a particular way we characterize, letting us recast unsupervised manifold discovery as inverse Ising
Goodfire@GoodfireAI

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)

English
1
11
82
4.7K
Thomas Massena retweetledi
Thomas Fel
Thomas Fel@thomas_fel_·
Happy to share my first post since joining Goodfire. Neural geometry has been my obsession for years, and our team here is building a really serious research agenda around it. I can't wait to share the series of papers coming over the next few weeks... Brace for shapes 🍩
Goodfire@GoodfireAI

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English
55
71
1.1K
77.2K
Thomas Massena retweetledi
Hanchen Li
Hanchen Li@lihanc02·
Neurips paper bidding for me. Lollll
Hanchen Li tweet media
English
5
3
63
13K
Thomas Massena retweetledi
Erfanzar
Erfanzar@eraznafre·
Releasing SpectraX is a JAX-native neural-network library built around true MPMD pipeline parallelism. Each physical rank compiles and runs its own XLA program — no shared shard_map HLO, no SPMD-same-shape constraint. Heterogeneous stages (eg, embed → blocks → head), nine pipeline schedules (GPipe, 1F1B, ZeroBubble, Interleaved, DualPipeV, …), and a unified spx.run()/spx.jit() entry point that dispatches to SPMD or MPMD from the same training script. github.com/erfanzar/Spect…
English
6
18
161
37.2K
Thomas Massena retweetledi
Pierfrancesco Beneventano
Pierfrancesco Beneventano@PierBeneventano·
Muon leads to severely miscalibrated models! This is just one of the results of this new paper of ours: In “Too Sharp, Too Sure” we show calibration error tracks loss curvature during training and we tie both to margin tails.
Pierfrancesco Beneventano tweet media
English
7
47
447
83K
Thomas Massena retweetledi
Hayden Prairie
Hayden Prairie@hayden_prairie·
We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇
Hayden Prairie tweet media
English
41
179
1.3K
293.6K
Thomas Massena retweetledi
varun
varun@varunneal·
my version of gstack is a folder called "optimizer-literature" containing 250 arXiv TeXs and the entire Jianlin Su canon that I force all my agents to read
English
0
2
15
431
Thomas Massena retweetledi
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
We in the quantization community could quickly see this and were flabbergastered by the response to TurboQuant. Whenever I saw TurboQuant on my timeline, I found it hurtful, because the work of other academics who worked so hard was discounted.
English
9
12
237
19.4K
Thomas Massena retweetledi
Thomas Massena
Thomas Massena@thomasmassena·
@breskanu You can check out the Turbo-Muon and Chebyshev Accelerated NS paper for this.
English
1
0
0
38
Thomas Massena
Thomas Massena@thomasmassena·
@breskanu Oh it definitely is better to use the spectral norm. Only it's more costly to compute. Some AOL / Gelfand formula tricks can allow you to use the X^T . X matrix to normalize to [0, 1] better than the Frobenius norm (and more efficiently than with the spectral norm).
English
1
0
3
62
Nikita Breskanu
Nikita Breskanu@breskanu·
Standard Muon takes X0 = G / ||G||_F. It feels like normalizing by spectral norm ||G||_2 may potentially be better than frobenius: it keeps the range [0, 1] needed for convergence, but singular values are more widespread across it.
Nikita Breskanu tweet media
English
1
0
1
626
Omead Pooladzandi
Omead Pooladzandi@HessianFree·
your spotify cache is bigger than our largest AI model. Bonsai: 1-bit weights. 1.7B to 8B params. 14x compression vs bf16. 8x faster on edge. 256 MB to 1.2GB. Based on Qwen 3. we just came out of stealth. intelligence belongs at the edge and we're going to put it there. Apache 2.0. we compressed intelligence. more coming. @PrismML
Omead Pooladzandi tweet media
PrismML@PrismML

Today, we are emerging from stealth and launching PrismML, an AI lab with Caltech origins that is centered on building the most concentrated form of intelligence. At PrismML, we believe that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just sheer parameter count. Our first proof point is the 1-bit Bonsai 8B, a 1-bit weight model that fits into 1.15 GBs of memory and delivers over 10x the intelligence density of its full-precision counterparts. It is 14x smaller, 8x faster, and 5x more energy efficient on edge hardware while remaining competitive with other models in its parameter-class. We are open-sourcing the model under Apache 2.0 license, along with Bonsai 4B and 1.7B models. When advanced models become small, fast, and efficient enough to run locally, the design space for AI changes immediately. We believe in a future of on-device agents, real-time robotics, offline intelligence and entirely new products that were previously impossible. We are excited to share our vision with you and keep working in the future to push the frontier of intelligence to the edge.

English
88
162
2K
204.9K
Thomas Massena
Thomas Massena@thomasmassena·
@cloneofsimo That's so much more competitive as I would've imagined ! Are you planning any kind of write-up ? I'd be really interested
English
0
0
0
1.1K
Simo Ryu
Simo Ryu@cloneofsimo·
Parameter Golf, non-neural-network division: Ive got (my gpt-5.4) to get val bpb of 1.6633 with markov chains + bag of tricks, fully autonomously. Codex implemented methods from 1990~2005.
Simo Ryu tweet media
English
11
10
342
72K
Thomas Massena retweetledi
David
David@dnhkng·
1/n I topped the HuggingFace Open LLM Leaderboard without changing a single weight. No training. No merging. No gradient descent. I duplicated 7 middle layers of Qwen2-72B and stitched it back together. This is the story of LLM Neuroanatomy 🧵
David tweet media
English
28
119
1.1K
129.3K
Thomas Massena retweetledi
Thibaut Boissin
Thibaut Boissin@ThibautBoissin·
Totally agree with the problem of muon for CNNs: kernel reshaping + orthonormalization ≠ orthogonalizing the operator. Interestingly, a whole community studied this for convolutional weights (instead of gradients): arxiv.org/abs/1911.00937
Thibaut Boissin tweet media
Ji-Ha@Ji_Ha_Kim

How to ("properly") orthogonalize convolutional layers for Muon optimizer Trick: Assume circular kernels to allow diagonalization Blog post + proof of concept CIFAR10 speedrun fork (unoptimized and slow for now but better convergence per step)

English
3
6
48
5K
Thomas Massena retweetledi
Ji-Ha
Ji-Ha@Ji_Ha_Kim·
How to ("properly") orthogonalize convolutional layers for Muon optimizer Trick: Assume circular kernels to allow diagonalization Blog post + proof of concept CIFAR10 speedrun fork (unoptimized and slow for now but better convergence per step)
Ji-Ha tweet media
English
3
15
224
39.3K
Thomas Massena retweetledi
Jason Ramapuram
Jason Ramapuram@jramapuram·
Autoregressive models dominate, but what if we treat multimodal generation as discrete order agnostic iterative refinement? Excited to share our systematic study on the design space of Tri-Modal Masked Diffusion Models (MDMs). We pre-trained the first Tri-Modal MDM from scratch on (text,), (image, text), and (audio, text). The same model can do ASR, TTS, T2I, captioning and native text generation. What I'm the most proud of in this work is the scientific rigor. Over 3,500 training runs. Principled hyperparameter transfer. Honest results. Carefully controlled ablations across multiple different axis of entanglement. A thread on our empirical findings (arXiV: arxiv.org/abs/2602.21472)
Jason Ramapuram tweet media
English
6
42
240
40.1K
Thomas Massena retweetledi
Samip
Samip@industriaalist·
1/ Introducing NanoGPT Slowrun 🐢: an open repo for state-of-the-art data-efficient learning algorithms. It's built for the crazy ideas that speedruns filter out -- expensive optimizers, heavy regularization, SGD replacements like evolutionary search.
GIF
English
21
106
975
169.8K
hans
hans@wavefunk_·
@thomasmassena Very nice! Just fyi, some of your plots are pretty hard to read on mobile.
hans tweet media
English
2
0
1
106
Thomas Massena
Thomas Massena@thomasmassena·
Wrote my first blog post, turns out exploring adaptive orthogonalization methods is pretty fun. massena-t.github.io/blog/2026/02/1… TLDR: Using an adaptive Newton-Schulz scheme allows a theoretically principled way to choose between SGD-esque or Muon-esque weight updates.
English
2
10
61
5.7K