Quanquan Gu (@QuanquanGu) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Quanquan Gu@QuanquanGu·17 May

Finally joined Xiaohongshu (RedNote) 👀 xhslink.com/m/AGPhXTjj3kE Will occasionally share thoughts on AI, scaling, and AI for science there too.

English

7

10

221

88.7K

Quanquan Gu retweetledi

Aaron Defazio@aaron_defazio·5d

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! arxiv.org/abs/2605.19095…

English

7

56

415

81.9K

Quanquan Gu retweetledi

Tim Lau@timlautk·6d

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…

English

4

30

136

25.1K

Quanquan Gu retweetledi

SHIYUAN ZHANG@zsy25ucla·18 May

Thrilled to announce our ICML 2026 paper: "Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence" 🎉 We give the first dimension-free KL bounds for discretized underdamped Langevin — depending on tr(H), not the ambient dimension d.

English

2

9

23

2.6K

Quanquan Gu@QuanquanGu·18 May

@Xianbao_QIAN @xiaohongshu Thanks! I think the account is back now :)

English

0

3

2.8K

Tiezhen WANG@Xianbao_QIAN·18 May

@QuanquanGu @xiaohongshu please help fix the account issue :)

English

8

0

5

3.2K

Quanquan Gu@QuanquanGu·17 May

Finally joined Xiaohongshu (RedNote) 👀 xhslink.com/m/AGPhXTjj3kE Will occasionally share thoughts on AI, scaling, and AI for science there too.

English

7

10

221

88.7K

Quanquan Gu retweetledi

Elad Hazan@HazanPrinceton·18 May

spectral filtering is a favorite technique, from learning in dynamics to neural architecture design, but extending beyond real eigenvalues is a pain. here is a reason from slepian theory, w. my brilliant postdoc Annie: arxiv.org/html/2601.2240…

English

0

13

108

11.1K

Quanquan Gu@QuanquanGu·17 May

@deveshlogs Thanks!

English

0

3

4.4K

Devesh | Reddit Marketing@deveshlogs·17 May

@QuanquanGu welcome aboard, will follow closely

English

1

0

3

5K

Quanquan Gu retweetledi

Peter Richtarik@peter_richtarik·15 May

Let me highlight one very surprising (was surprising to us!!) aspect of the Local LMO theory: Provided the radius follows a certain upper bound (is type-I or type-II admissible), the radius always equals the effective stepsize! While this is obvious in the unconstrained setting, it is very surprising in the constrained setting! Look at conclusion (ii) of Theorem 3.1.

Peter Richtarik@peter_richtarik

Imagine that projected gradient descent (PGD) was a new method, discovered today. How would that feel? This is a textbook algorithm... What further research, extensions, improvements and variants would this enable? In fact, together with Kaja Gruntkowska and Hanmin Li, we have just discovered a sister method to projected gradient descent -- one of equal conceptual importance. Our method admits the same or very similar guarantees as PGD. However, instead of relying on projections onto the constraint, it relies on linear minimization! You may say: Did you rediscover Frank-Wolfe? No. In contrast to Frank-Wolfe, which uses a global linear minimization oracle (global LMO), our method relies on a local minimization oracle (local LMO). For this reason, we simply call the method "Local LMO" (admittedly, conflating the oracle name with the method name). Frank-Wolfe theory is much more limited to the theory of Local LMO. Here are some key differences: 1) Frank-Wolfe only works if the constraint is bounded, and its convergence theory depends in the diameter of the constraint set. Local LMO works even for unbounded constraints, and its theory does not depend on the diameter of the constraint set. 2) In fact, Local LMO reduces to gradient descent (GD) in the unconstrained case. If the constraint is affine, Local LMO reduces to (preconditioned) GD in the affine space. 3) While Frank-Wolfe does not converge linearly for smooth strongly convex functions, Local LMO does. 4) While Frank-Wolfe does not converge for non-smooth convex problems (its theory depends on a curvature assumption), Local LMO does. arxiv.org/abs/2605.08850

English

0

6

30

7K

Quanquan Gu retweetledi

Rosinality@rosinality·14 May

arxiv.org/abs/2605.12715 How many repetitions could be allowed for a small dataset in pretraining mixtures? Naturally it would be a function of model scale and data size (and compute budget). But it could be larger than expected. arxiv.org/abs/2603.16177

English

1

28

249

32.9K

Quanquan Gu retweetledi

Rosinality@rosinality·13 May

Exploration over MoE. Increase the number of experts as much as you can, and pick adequate granularity. The rest of the choices are not that important.

English

1

11

88

6.9K

Quanquan Gu retweetledi

Peter Richtarik@peter_richtarik·12 May

Imagine that projected gradient descent (PGD) was a new method, discovered today. How would that feel? This is a textbook algorithm... What further research, extensions, improvements and variants would this enable? In fact, together with Kaja Gruntkowska and Hanmin Li, we have just discovered a sister method to projected gradient descent -- one of equal conceptual importance. Our method admits the same or very similar guarantees as PGD. However, instead of relying on projections onto the constraint, it relies on linear minimization! You may say: Did you rediscover Frank-Wolfe? No. In contrast to Frank-Wolfe, which uses a global linear minimization oracle (global LMO), our method relies on a local minimization oracle (local LMO). For this reason, we simply call the method "Local LMO" (admittedly, conflating the oracle name with the method name). Frank-Wolfe theory is much more limited to the theory of Local LMO. Here are some key differences: 1) Frank-Wolfe only works if the constraint is bounded, and its convergence theory depends in the diameter of the constraint set. Local LMO works even for unbounded constraints, and its theory does not depend on the diameter of the constraint set. 2) In fact, Local LMO reduces to gradient descent (GD) in the unconstrained case. If the constraint is affine, Local LMO reduces to (preconditioned) GD in the affine space. 3) While Frank-Wolfe does not converge linearly for smooth strongly convex functions, Local LMO does. 4) While Frank-Wolfe does not converge for non-smooth convex problems (its theory depends on a curvature assumption), Local LMO does. arxiv.org/abs/2605.08850

English

8

21

124

22.3K

Quanquan Gu retweetledi

Kaixuan Ji@Kaixuan_Ji_19·10 May

Excited to share our ICML2026 paper: Near-Optimal Regret for KL-Regularized Multi-Armed Bandits (arxiv.org/abs/2603.02155) 🚀 KL-regularization is at the heart of modern LLM post-training. ❓How does it change the statistical limits of online learning? In our paper, we give the first near-complete characterization of regret for KL-regularized multi-armed bandits: • For low regularization: \sqrt{KT} regret, similar to standard MAB • For high regularization: ηK*ploylog(T) regret.

English

3

13

85

9.9K

Quanquan Gu retweetledi

Rosinality@rosinality·8 May

Power law relationship between RL compute and task complexity using synthetic logic problems. One of the interesting parts is how task complexity and RL compute affect the downstream performance.

English

2

14

67

7K

Quanquan Gu@QuanquanGu·9 May

@kaiwei_chang Congrats!

English

0

2

527

Kai-Wei Chang@kaiwei_chang·9 May

I’ve recently been promoted to Full Professor at UCLA 🎉 It’s been a long journey, with many tears, laughs, and surprises along the way. When I was working on linear models 20 years ago, I couldn’t have imagined we’d be building trustworthy AI agents today. I feel incredibly fortunate and deeply grateful to my research group, mentors, collaborators, and students who have made this journey so meaningful. I still remember the moment of hooding each of my PhD students. Those are the happiest moments in my career. Many thanks as well to my family, colleagues, and friends for their support. Looking forward to the next chapter. For those interested, check out our recent work: web.cs.ucla.edu/~kwchang/ Photo: a decade after graduation

English

66

19

672

39.9K

Quanquan Gu retweetledi

Yiping Wang@ypwang61·8 May

We improve a 32-year lower bound in a challenging open problem, Ramsey numbers, through simply scaling autoresearch. ⭕ Proves R(3,17) >= 93. Previous 92 bound were obtained in 1994. Google’s AlphaEvolve (2026) matched previous result but did not beat it. All could be done with Claude Code / Codex + a CPU server. Graphs and evolving history are available at github.com/ypwang61/Scale… [1/n]

English

11

49

324

52K

Quanquan Gu retweetledi

Pierfrancesco Beneventano@PierBeneventano·3 May

Our new paper was accepted at ICML! 1) Momentum isn’t just “SGD but faster”. It affects sharpness (of orders of magnitude!) 2) The usual story says momentum lets you train in sharper regions. That’s true for large batches only! The opposite is true for minibatches!

English

3

14

112

7.2K

Quanquan Gu retweetledi

wh@nrehiew_·2 May

Excellent blog covering the recent post training meta of training specialized experts and then distilling into the final checkpoint.

English

4

51

392

38.8K

Quanquan Gu retweetledi

Boaz Barak@boazbaraktcs·2 May

Perhaps the first meaningful human-AI collaboration in math with substantial back and forth: GPT-5.4-pro came up with a new proof for one longstanding problem and then people identified that the proof contains a new idea that can be used for multiple other problems.

Jared Duker Lichtman@jdlichtman

Update on Erdős Problem 1196: In joint work, we refined and adapted the proof method from GPT-5.4 Pro to give proofs of several additional problems. This includes another 60 year old conjecture by Erdős, Sárközy, and Szemerédi. A proof is valued not just by the problem it solves, but by what new avenues it opens up. This is perhaps one of the first examples of an AI-generated proof having downstream impacts, which we are still exploring. We are announcing the result today at the Future of Mathematics Symposium (see links below)

English

5

21

170

19.3K

Quanquan Gu retweetledi

Sebastien Bubeck@SebastienBubeck·2 May

Talking at the Future of Math Symposium in 10 minutes (livestream link: youtube.com/live/tN4hsT5t0…). I decided to make the talk "personal" and explain the five moments that updated me on how fast AI will change mathematics. (The ChatGPT's illustration of these moments is so good!!)

YouTube

English

17

73

533

51.8K

Quanquan Gu retweetledi

will brown@willccbb·1 May

x.com/i/article/2050…

ZXX

44

253

1.9K

475.4K

Quanquan Gu

Keşfet