Paul Janson

Francesco Capuano@_fracapuano

8

10

1.3K

Paul Janson retweetledi

Volkan Cevher@CevherLIONS·21h

This is like a good stress test for optimizers. Kaon is basically Muon/lmo + spectral noise. It preserves the singular vectors of the gradient and randomizes only the positive singular weights. For exchangeable noise, the conditional expectation is the spectral-norm-ball lmo direction up to scale. Individual draws are not necessarily lmos tho. Freon’s map for c>1/2 is decreasing on the singular values, so the operator is non-monotone. Exact fixed-step Freon can fail even on a simple convex quadratic minimization near rank deficiency. Freon’s map for c<=1/2 (i.e., the monotone case) can also be analyzed using phi-convexity. Shameless plug: arxiv.org/abs/2605.11850

Finally got you, damned lich king @tensorqt

English

7

51

8K

Paul Janson retweetledi

Tony S.F.@tonysilveti·3d

New paper! We analyze proximal preconditioned gradient methods that extend Muon/Scion to handle nonconvex constraints (Stiefel manifold, spectral sphere, norm balls, ...) with convergence guarantees under heavy-tailed noise + variance reduction w/ STORM! arxiv.org/abs/2605.11850

English

30

167

11.2K

Paul Janson retweetledi

Keller Jordan@kellerjordan0·4d

Modded-NanoGPT optimization result #13: @benjamintherien has achieved a new record of 3210 steps (-15), by wrapping NorMuonH in a MuLoCo-style outer Nesterov SGD. Compared to the target loss, this result has a p-value of p=1.3e-4. Compared to result #11, it has p=0.099.

English

11

84

7.8K

Paul Janson@janson002·24 Nis

@WorldEdServices @WorldEdServices Applicants paying for credential evaluations deserve a two-way communication channel. Contact forms with no-reply responses aren't workable when documents go missing. DMing reference number and asking for a named case owner. Hoping for real help.

English

2

1

120

World Education Services@WorldEdServices·24 Nis

Doors opened. Barriers lowered. Talent recognized. In 2025, WES served 430K+ applicants, expanded support for immigrant‑ & refugee‑led orgs, and advanced licensure pathways. Read our 2025 Annual Report + new 10‑year strategy: impact-wes.org/2025AR #WES

English

17

0

2

648

Paul Janson retweetledi

Benjamin Thérien @ ICLR 2026@benjamintherien·22 Nis

I’ll be at #ICLR2026 🇧🇷 this week to present “μLO: Compute-Efficient Meta-Generalization of Learned Optimizers” and give a talk about SparseLoCo at the Protocol Learning Workshop! If you work on these topics or just want to chat — DM me. 🧵1/3

Benjamin Thérien @ ICLR 2026 tweet media

English

4

38

2.1K

Paul Janson retweetledi

Abhinav Moudgil@amoudgl·20 Nis

Heading to Rio 🇧🇷 to present our Celo line of work at #ICLR2026! Get in touch if you are curious about new avenues in neural network training or how we scaled learned optimizers from CIFAR-10 to GPT-3 🚀 Details ⬇️

English

5

18

1.4K

Paul Janson@janson002·20 Nis

PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: github.com/Belilovsky-Lab…

Paul Janson@janson002

Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N

English

8

10

1.3K

Paul Janson@janson002·20 Nis

Full Paper: arxiv.org/abs/2506.10315 #PyTorch #MLSys2026 #LearnedOptimizers #Jax #CUDA #Kernels #MachineLearning #MLSystems

English

1

107

Paul Janson retweetledi

Michael Rizvi-Martel@frisbeemortel·9 Nis

Latent CoT is an alternative LLM reasoning scheme hypothesized to enable “superposition” allowing models to hold uncertainty over multiple concepts during reasoning 💭 We revisit superposition in 3 latent CoT approaches and find that it is largely an illusion 🔮! More in 🧵

English

9

33

167

14K

Paul Janson retweetledi

Abhinav Moudgil@amoudgl·31 Mar

Introducing Celo2: Towards Learned Optimization Free Lunch We show that learned optimizers can generalize to practical tasks like GPT-3 1.3B pretraining and several out-of-distribution vision/RL tasks from limited meta-training (~4.5 GPU hours)! 🧵

English

Benjamin Thérien @ ICLR 2026@benjamintherien

22

103

9.1K

Paul Janson retweetledi

templar@tplr_ai·10 Mar

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

English

216

929

6.2K

1.9M

Paul Janson retweetledi

Benjamin Thérien @ ICLR 2026@benjamintherien·11 Mar

🚨 New Tech Report: Covenant-72B 🚨 TL;DR we use SparseLoCo to pre-train a 72B model on 1.1T tokens over the internet! This is the largest decentralized training run to date. x.com/tplr_ai/status…

templar@tplr_ai

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

English

2

7

70

5.2K

Paul Janson retweetledi

VAIBHAV SINGH@VAIBHAV22155287·3 Mar

Masked Diffusion LMs (MDLMs) are the most exciting paradigm shift in AR generation because they can decode in parallel, infill, and self-correct. But they are bottlenecked by the transformer's quadratic attention, making throughput fall apart for long contexts. We offer a simple solution. Introducing DiffuMamba: first diffusion LM with a bidirectional Mamba backbone. Better quality. Up to 8.2x faster. 🧵1/N

English

4

27

160

14.4K

Paul Janson retweetledi

Benjamin Thérien @ ICLR 2026@benjamintherien·27 Şub

This week, we released a paper from Meta @AIatmeta “MuLoCo: Muon is a practical inner optimizer for DiLoCo”, showing that K=1 MuLoCo has a Pareto-optimal performance-training-time tradeoff Let’s drill deeper into single-worker MuLoCo’s efficiency 🧵1/5 x.com/benjamintherie…

Are frontier LLMs trained across datacenters? One thing is certain: if the pre-training optimizer’s critical batch size is too small, they are NOT! Excited to announce MuLoCo, a pre-training optimizer that can efficiently pre-train across datacenters while having large enough batch sizes to warrant doing so. 🧵1/N

English

2

11

49

5.9K

Paul Janson retweetledi

Benjamin Thérien @ ICLR 2026@benjamintherien·26 Şub

Are frontier LLMs trained across datacenters? One thing is certain: if the pre-training optimizer’s critical batch size is too small, they are NOT! Excited to announce MuLoCo, a pre-training optimizer that can efficiently pre-train across datacenters while having large enough batch sizes to warrant doing so. 🧵1/N

English

33

94

17.8K

Paul Janson retweetledi

Eugene Belilovsky@ebelilov·24 Şub

I have open positions including Postdocs, PhD, master's students, and PhD interns. For more information eugenium.github.io/Projects/postd…

English

5

4

2.2K

Paul Janson@janson002·19 Şub

@meetpatelfp @ebelilov yes , we train the low ranked model longer to match the training compute.

English

20

Meet Patel@meetpatelfp·18 Şub

@janson002 @ebelilov How this correlates to compute budget? Do you need to feed higher amount of tokens to reach same loss as dense model? Do you need to train longer to reach same loss as dense counter part?

English

0

25

Paul Janson@janson002·16 Şub

We’re excited to share our new work : “Stabilizing Native Low-Rank LLM Pretraining.” (arxiv.org/abs/2602.12429) 🚀 Can we train LLMs from scratch using only low-rank factorized weights and still match dense performance? Short answer: yes (with care).

English

5

42

207

29.7K

Paul Janson@janson002·19 Şub

@oswaldjoh @ebelilov Thank you. It seems very interesting to naturally induce low rank in the Attention weights. I'll give that a try.

English

24

Johannes Oswald@oswaldjoh·18 Şub

@janson002 @ebelilov hey! looks very cool ! You could actually try it on the normal weights of the softmax attention (without q / k norm) as there you are also (implicitly) training low-rank "weight" matrices i.e. W_OVsoftmax(K^TQ) = W_OW_VXsoftmax((W_KX)^TW_QX) :) see e.g. arxiv.org/abs/2410.23819

English

0

1

35

Paul Janson@janson002·17 Şub

@Gradientdinner @ebelilov Sure, Thanks.

English