Quanquan Gu

2.4K posts

Quanquan Gu banner
Quanquan Gu

Quanquan Gu

@QuanquanGu

Professor @UCLA, Pretraining and Scaling at ByteDance Seed | Recent work: Seed2.0, SeedFold | Opinions are my own

Los Angeles, CA Katılım Ağustos 2017
2.4K Takip Edilen20.6K Takipçiler
Sabitlenmiş Tweet
Quanquan Gu
Quanquan Gu@QuanquanGu·
Finally joined Xiaohongshu (RedNote) 👀 xhslink.com/m/AGPhXTjj3kE Will occasionally share thoughts on AI, scaling, and AI for science there too.
English
7
10
221
88.7K
Quanquan Gu retweetledi
Aaron Defazio
Aaron Defazio@aaron_defazio·
🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! arxiv.org/abs/2605.19095…
Aaron Defazio tweet media
English
7
56
415
81.9K
Quanquan Gu retweetledi
Tim Lau
Tim Lau@timlautk·
1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…
English
4
30
136
25.1K
Quanquan Gu retweetledi
SHIYUAN ZHANG
SHIYUAN ZHANG@zsy25ucla·
Thrilled to announce our ICML 2026 paper: "Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence" 🎉 We give the first dimension-free KL bounds for discretized underdamped Langevin — depending on tr(H), not the ambient dimension d.
SHIYUAN ZHANG tweet media
English
2
9
23
2.6K
Quanquan Gu
Quanquan Gu@QuanquanGu·
Finally joined Xiaohongshu (RedNote) 👀 xhslink.com/m/AGPhXTjj3kE Will occasionally share thoughts on AI, scaling, and AI for science there too.
English
7
10
221
88.7K
Quanquan Gu retweetledi
Elad Hazan
Elad Hazan@HazanPrinceton·
spectral filtering is a favorite technique, from learning in dynamics to neural architecture design, but extending beyond real eigenvalues is a pain. here is a reason from slepian theory, w. my brilliant postdoc Annie: arxiv.org/html/2601.2240…
English
0
13
108
11.1K
Quanquan Gu retweetledi
Peter Richtarik
Peter Richtarik@peter_richtarik·
Let me highlight one very surprising (was surprising to us!!) aspect of the Local LMO theory: Provided the radius follows a certain upper bound (is type-I or type-II admissible), the radius always equals the effective stepsize! While this is obvious in the unconstrained setting, it is very surprising in the constrained setting! Look at conclusion (ii) of Theorem 3.1.
Peter Richtarik tweet media
Peter Richtarik@peter_richtarik

Imagine that projected gradient descent (PGD) was a new method, discovered today. How would that feel? This is a textbook algorithm... What further research, extensions, improvements and variants would this enable? In fact, together with Kaja Gruntkowska and Hanmin Li, we have just discovered a sister method to projected gradient descent -- one of equal conceptual importance. Our method admits the same or very similar guarantees as PGD. However, instead of relying on projections onto the constraint, it relies on linear minimization! You may say: Did you rediscover Frank-Wolfe? No. In contrast to Frank-Wolfe, which uses a global linear minimization oracle (global LMO), our method relies on a local minimization oracle (local LMO). For this reason, we simply call the method "Local LMO" (admittedly, conflating the oracle name with the method name). Frank-Wolfe theory is much more limited to the theory of Local LMO. Here are some key differences: 1) Frank-Wolfe only works if the constraint is bounded, and its convergence theory depends in the diameter of the constraint set. Local LMO works even for unbounded constraints, and its theory does not depend on the diameter of the constraint set. 2) In fact, Local LMO reduces to gradient descent (GD) in the unconstrained case. If the constraint is affine, Local LMO reduces to (preconditioned) GD in the affine space. 3) While Frank-Wolfe does not converge linearly for smooth strongly convex functions, Local LMO does. 4) While Frank-Wolfe does not converge for non-smooth convex problems (its theory depends on a curvature assumption), Local LMO does. arxiv.org/abs/2605.08850

English
0
6
30
7K
Quanquan Gu retweetledi
Rosinality
Rosinality@rosinality·
arxiv.org/abs/2605.12715 How many repetitions could be allowed for a small dataset in pretraining mixtures? Naturally it would be a function of model scale and data size (and compute budget). But it could be larger than expected. arxiv.org/abs/2603.16177
Rosinality tweet media
English
1
28
249
32.9K
Quanquan Gu retweetledi
Rosinality
Rosinality@rosinality·
Exploration over MoE. Increase the number of experts as much as you can, and pick adequate granularity. The rest of the choices are not that important.
Rosinality tweet mediaRosinality tweet media
English
1
11
88
6.9K
Quanquan Gu retweetledi
Peter Richtarik
Peter Richtarik@peter_richtarik·
Imagine that projected gradient descent (PGD) was a new method, discovered today. How would that feel? This is a textbook algorithm... What further research, extensions, improvements and variants would this enable? In fact, together with Kaja Gruntkowska and Hanmin Li, we have just discovered a sister method to projected gradient descent -- one of equal conceptual importance. Our method admits the same or very similar guarantees as PGD. However, instead of relying on projections onto the constraint, it relies on linear minimization! You may say: Did you rediscover Frank-Wolfe? No. In contrast to Frank-Wolfe, which uses a global linear minimization oracle (global LMO), our method relies on a local minimization oracle (local LMO). For this reason, we simply call the method "Local LMO" (admittedly, conflating the oracle name with the method name). Frank-Wolfe theory is much more limited to the theory of Local LMO. Here are some key differences: 1) Frank-Wolfe only works if the constraint is bounded, and its convergence theory depends in the diameter of the constraint set. Local LMO works even for unbounded constraints, and its theory does not depend on the diameter of the constraint set. 2) In fact, Local LMO reduces to gradient descent (GD) in the unconstrained case. If the constraint is affine, Local LMO reduces to (preconditioned) GD in the affine space. 3) While Frank-Wolfe does not converge linearly for smooth strongly convex functions, Local LMO does. 4) While Frank-Wolfe does not converge for non-smooth convex problems (its theory depends on a curvature assumption), Local LMO does. arxiv.org/abs/2605.08850
Peter Richtarik tweet media
English
8
21
124
22.3K
Quanquan Gu retweetledi
Kaixuan Ji
Kaixuan Ji@Kaixuan_Ji_19·
Excited to share our ICML2026 paper: Near-Optimal Regret for KL-Regularized Multi-Armed Bandits (arxiv.org/abs/2603.02155) 🚀 KL-regularization is at the heart of modern LLM post-training. ❓How does it change the statistical limits of online learning? In our paper, we give the first near-complete characterization of regret for KL-regularized multi-armed bandits: • For low regularization: \sqrt{KT} regret, similar to standard MAB • For high regularization: ηK*ploylog(T) regret.
Kaixuan Ji tweet media
English
3
13
85
9.9K
Quanquan Gu retweetledi
Rosinality
Rosinality@rosinality·
Power law relationship between RL compute and task complexity using synthetic logic problems. One of the interesting parts is how task complexity and RL compute affect the downstream performance.
Rosinality tweet media
English
2
14
67
7K
Kai-Wei Chang
Kai-Wei Chang@kaiwei_chang·
I’ve recently been promoted to Full Professor at UCLA 🎉 It’s been a long journey, with many tears, laughs, and surprises along the way. When I was working on linear models 20 years ago, I couldn’t have imagined we’d be building trustworthy AI agents today. I feel incredibly fortunate and deeply grateful to my research group, mentors, collaborators, and students who have made this journey so meaningful. I still remember the moment of hooding each of my PhD students. Those are the happiest moments in my career. Many thanks as well to my family, colleagues, and friends for their support. Looking forward to the next chapter. For those interested, check out our recent work: web.cs.ucla.edu/~kwchang/ Photo: a decade after graduation
Kai-Wei Chang tweet media
English
66
19
672
39.9K
Quanquan Gu retweetledi
Yiping Wang
Yiping Wang@ypwang61·
We improve a 32-year lower bound in a challenging open problem, Ramsey numbers, through simply scaling autoresearch. ⭕ Proves R(3,17) >= 93. Previous 92 bound were obtained in 1994. Google’s AlphaEvolve (2026) matched previous result but did not beat it. All could be done with Claude Code / Codex + a CPU server. Graphs and evolving history are available at github.com/ypwang61/Scale… [1/n]
English
11
49
324
52K
Quanquan Gu retweetledi
Pierfrancesco Beneventano
Pierfrancesco Beneventano@PierBeneventano·
Our new paper was accepted at ICML! 1) Momentum isn’t just “SGD but faster”. It affects sharpness (of orders of magnitude!) 2) The usual story says momentum lets you train in sharper regions. That’s true for large batches only! The opposite is true for minibatches!
Pierfrancesco Beneventano tweet media
English
3
14
112
7.2K
Quanquan Gu retweetledi
wh
wh@nrehiew_·
Excellent blog covering the recent post training meta of training specialized experts and then distilling into the final checkpoint.
wh tweet media
English
4
51
392
38.8K
Quanquan Gu retweetledi
Quanquan Gu retweetledi
Sebastien Bubeck
Sebastien Bubeck@SebastienBubeck·
Talking at the Future of Math Symposium in 10 minutes (livestream link: youtube.com/live/tN4hsT5t0…). I decided to make the talk "personal" and explain the five moments that updated me on how fast AI will change mathematics. (The ChatGPT's illustration of these moments is so good!!)
YouTube video
YouTube
Sebastien Bubeck tweet media
English
17
73
533
51.8K