Paul Janson

160 posts

Paul Janson banner
Paul Janson

Paul Janson

@janson002

Ph.D. Student @Mila_Quebec and Concordia University. Working in Deep learning optimization , Continual learning and Computer vision. Previously @Kaust

Montréal, Québec Katılım Eylül 2013
2.7K Takip Edilen396 Takipçiler
Sabitlenmiş Tweet
Paul Janson
Paul Janson@janson002·
PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: github.com/Belilovsky-Lab…
Paul Janson@janson002

Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N

English
1
8
10
1.3K
Paul Janson retweetledi
Volkan Cevher
Volkan Cevher@CevherLIONS·
This is like a good stress test for optimizers. Kaon is basically Muon/lmo + spectral noise. It preserves the singular vectors of the gradient and randomizes only the positive singular weights. For exchangeable noise, the conditional expectation is the spectral-norm-ball lmo direction up to scale. Individual draws are not necessarily lmos tho. Freon’s map for c>1/2 is decreasing on the singular values, so the operator is non-monotone. Exact fixed-step Freon can fail even on a simple convex quadratic minimization near rank deficiency. Freon’s map for c<=1/2 (i.e., the monotone case) can also be analyzed using phi-convexity. Shameless plug: arxiv.org/abs/2605.11850
Francesco Capuano@_fracapuano

Finally got you, damned lich king @tensorqt

English
0
7
51
8K
Paul Janson retweetledi
Tony S.F.
Tony S.F.@tonysilveti·
New paper! We analyze proximal preconditioned gradient methods that extend Muon/Scion to handle nonconvex constraints (Stiefel manifold, spectral sphere, norm balls, ...) with convergence guarantees under heavy-tailed noise + variance reduction w/ STORM! arxiv.org/abs/2605.11850
Tony S.F. tweet mediaTony S.F. tweet mediaTony S.F. tweet media
English
1
30
167
11.2K
Paul Janson retweetledi
Keller Jordan
Keller Jordan@kellerjordan0·
Modded-NanoGPT optimization result #13: @benjamintherien has achieved a new record of 3210 steps (-15), by wrapping NorMuonH in a MuLoCo-style outer Nesterov SGD. Compared to the target loss, this result has a p-value of p=1.3e-4. Compared to result #11, it has p=0.099.
Keller Jordan tweet mediaKeller Jordan tweet media
English
3
11
84
7.8K
Paul Janson
Paul Janson@janson002·
@WorldEdServices @WorldEdServices Applicants paying for credential evaluations deserve a two-way communication channel. Contact forms with no-reply responses aren't workable when documents go missing. DMing reference number and asking for a named case owner. Hoping for real help.
English
3
2
1
120
World Education Services
World Education Services@WorldEdServices·
Doors opened. Barriers lowered. Talent recognized. In 2025, WES served 430K+ applicants, expanded support for immigrant‑ & refugee‑led orgs, and advanced licensure pathways. Read our 2025 Annual Report + new 10‑year strategy: impact-wes.org/2025AR #WES
English
17
0
2
648
Paul Janson retweetledi
Benjamin Thérien @ ICLR 2026
Benjamin Thérien @ ICLR 2026@benjamintherien·
I’ll be at #ICLR2026 🇧🇷 this week to present “μLO: Compute-Efficient Meta-Generalization of Learned Optimizers” and give a talk about SparseLoCo at the Protocol Learning Workshop! If you work on these topics or just want to chat — DM me. 🧵1/3
Benjamin Thérien @ ICLR 2026 tweet media
English
1
4
38
2.1K
Paul Janson retweetledi
Abhinav Moudgil
Abhinav Moudgil@amoudgl·
Heading to Rio 🇧🇷 to present our Celo line of work at #ICLR2026! Get in touch if you are curious about new avenues in neural network training or how we scaled learned optimizers from CIFAR-10 to GPT-3 🚀 Details ⬇️
English
1
5
18
1.4K
Paul Janson
Paul Janson@janson002·
PyLO is accepted to MLSys 2026! 🎉🚀 A PyTorch-native library bringing SOTA learned optimizers to the codebases most of us actually use — with fast CUDA kernels and real speedups on large-scale training. Drop-in ready, no more JAX-only barriers. Library: github.com/Belilovsky-Lab…
Paul Janson@janson002

Have you ever trained a neural network using a learned optimizer instead of AdamW? Doubt it: you're probably coding in Pytorch! Excited to introduce PyLO: Towards Accessible Learned Optimizers in Pytorch! . Accepted at @icmlconf ICML 2025 CODEML workshop 🧵1/N

English
1
8
10
1.3K
Paul Janson retweetledi
Michael Rizvi-Martel
Michael Rizvi-Martel@frisbeemortel·
Latent CoT is an alternative LLM reasoning scheme hypothesized to enable “superposition” allowing models to hold uncertainty over multiple concepts during reasoning 💭 We revisit superposition in 3 latent CoT approaches and find that it is largely an illusion 🔮! More in 🧵
Michael Rizvi-Martel tweet media
English
9
33
167
14K
Paul Janson retweetledi
Abhinav Moudgil
Abhinav Moudgil@amoudgl·
Introducing Celo2: Towards Learned Optimization Free Lunch We show that learned optimizers can generalize to practical tasks like GPT-3 1.3B pretraining and several out-of-distribution vision/RL tasks from limited meta-training (~4.5 GPU hours)! 🧵
Abhinav Moudgil tweet media
English
3
22
103
9.1K
Paul Janson retweetledi
templar
templar@tplr_ai·
We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n
English
216
929
6.2K
1.9M
Paul Janson retweetledi
Benjamin Thérien @ ICLR 2026
Benjamin Thérien @ ICLR 2026@benjamintherien·
🚨 New Tech Report: Covenant-72B 🚨 TL;DR we use SparseLoCo to pre-train a 72B model on 1.1T tokens over the internet! This is the largest decentralized training run to date. x.com/tplr_ai/status…
templar@tplr_ai

We just completed the largest decentralised LLM pre-training run in history: Covenant-72B. Permissionless, on Bittensor subnet 3. 72B parameters. ~1.1T tokens. Commodity internet. No centralized cluster. No whitelist. Anyone with GPUs could join or leave freely. 1/n

English
2
7
70
5.2K
Paul Janson retweetledi
VAIBHAV SINGH
VAIBHAV SINGH@VAIBHAV22155287·
Masked Diffusion LMs (MDLMs) are the most exciting paradigm shift in AR generation because they can decode in parallel, infill, and self-correct. But they are bottlenecked by the transformer's quadratic attention, making throughput fall apart for long contexts. We offer a simple solution. Introducing DiffuMamba: first diffusion LM with a bidirectional Mamba backbone. Better quality. Up to 8.2x faster. 🧵1/N
English
4
27
160
14.4K
Paul Janson retweetledi
Benjamin Thérien @ ICLR 2026
Benjamin Thérien @ ICLR 2026@benjamintherien·
This week, we released a paper from Meta @AIatmeta “MuLoCo: Muon is a practical inner optimizer for DiLoCo”, showing that K=1 MuLoCo has a Pareto-optimal performance-training-time tradeoff Let’s drill deeper into single-worker MuLoCo’s efficiency 🧵1/5 x.com/benjamintherie…
Benjamin Thérien @ ICLR 2026@benjamintherien

Are frontier LLMs trained across datacenters? One thing is certain: if the pre-training optimizer’s critical batch size is too small, they are NOT! Excited to announce MuLoCo, a pre-training optimizer that can efficiently pre-train across datacenters while having large enough batch sizes to warrant doing so. 🧵1/N

English
2
11
49
5.9K
Paul Janson retweetledi
Benjamin Thérien @ ICLR 2026
Benjamin Thérien @ ICLR 2026@benjamintherien·
Are frontier LLMs trained across datacenters? One thing is certain: if the pre-training optimizer’s critical batch size is too small, they are NOT! Excited to announce MuLoCo, a pre-training optimizer that can efficiently pre-train across datacenters while having large enough batch sizes to warrant doing so. 🧵1/N
Benjamin Thérien @ ICLR 2026 tweet media
English
3
33
94
17.8K
Meet Patel
Meet Patel@meetpatelfp·
@janson002 @ebelilov How this correlates to compute budget? Do you need to feed higher amount of tokens to reach same loss as dense model? Do you need to train longer to reach same loss as dense counter part?
English
1
0
0
25
Paul Janson
Paul Janson@janson002·
We’re excited to share our new work : “Stabilizing Native Low-Rank LLM Pretraining.” (arxiv.org/abs/2602.12429) 🚀 Can we train LLMs from scratch using only low-rank factorized weights and still match dense performance? Short answer: yes (with care).
English
5
42
207
29.7K
Paul Janson
Paul Janson@janson002·
@oswaldjoh @ebelilov Thank you. It seems very interesting to naturally induce low rank in the Attention weights. I'll give that a try.
English
0
0
0
24
Johannes Oswald
Johannes Oswald@oswaldjoh·
@janson002 @ebelilov hey! looks very cool ! You could actually try it on the normal weights of the softmax attention (without q / k norm) as there you are also (implicitly) training low-rank "weight" matrices i.e. W_OVsoftmax(K^TQ) = W_OW_VXsoftmax((W_KX)^TW_QX) :) see e.g. arxiv.org/abs/2410.23819
English
1
0
1
35