Quanquan Gu

2.3K posts

Quanquan Gu banner
Quanquan Gu

Quanquan Gu

@QuanquanGu

Professor @UCLA, Pretraining and Scaling at ByteDance Seed | Recent work: Seed2.0, SeedFold | Opinions are my own

Los Angeles, CA Katılım Ağustos 2017
2.4K Takip Edilen19.3K Takipçiler
Sam Redlich
Sam Redlich@SamRedlich·
@QuanquanGu it feels like i’ve just compressed 25 years of research into about six weeks.
English
1
0
3
1K
Quanquan Gu
Quanquan Gu@QuanquanGu·
Actually not just math, this is happening across almost every field. AI is collapsing the barrier to entry for research. What once required a PhD and years of training can now start much easier. We are moving toward a world where there is no “hard research”, but just unsolved problems. Big things are coming!
Andrew Curran@AndrewCurran_

Terence Tao responding to a question on what advice he would give someone considering a career in math in 2026: 'Yeah, so we live in a time of change. It is, as I said, we live in a particularly unpredictable era. And I think things that we've taken for granted for centuries may not hold anymore. So, yeah, the way we... do everything, not just mathematics, will change. In many ways, I would prefer the much more boring, quiet era where things are much the same as they were 10 years ago, 20 years ago. But I think one just has to embrace that there's going to be a lot of change and that, you know, the things that you study, some of them may become obsolete or revolutionized, but some things will be retained. There'll be a lot of opportunities for things that you wouldn't be able to do before. So, I mean, in math, you previously had to basically go through years and years of education to be a math PhD before you could contribute to the frontier of math research. But now it's quite possible at the high school level or whatever, that you could get involved in a math project and actually make a real contribution because of all these AI tools and lean and everything else. So there'll be a lot of non-traditional opportunities to learn. So you need a very adaptable mindset. There'll be one for pursuing things just for curiosity, for playing around. And I mean, you still need to get your credentials. I mean, I think for a while it would still be important to sort of still go through traditional education and learn math and science and so forth the old-fashioned way for a while. Yeah, but you should also be open to very, very different ways of doing science, some of which don't exist yet. Yeah, so it's a scary time, but also very exciting.'

English
15
36
308
36.6K
Quanquan Gu retweetledi
Percy Liang
Percy Liang@percyliang·
In Marin, we are trying to get really good at scaling laws. We have trained models up to 1e22 FLOPs and have made a prediction of the loss at 1e23 FLOPs, which @WilliamBarrHeld is running. This prediction is preregistered on GitHub, so we'll see in a few days how accurate our prediction was. What we want is not just a single model but a training recipe that scales reliably.
Percy Liang tweet media
English
16
43
428
44.6K
Quanquan Gu retweetledi
Sham Kakade
Sham Kakade@ShamKakade6·
1/ Au revoir, RLVR. New work: EBFT (Energy-Based Fine-Tuning), a post-training method that directly optimizes the long-horizon behavior of model generations, addressing SFT’s deployment-time error amplification without relying on sparse, task-specific rewards.
English
7
39
267
262.9K
Quanquan Gu retweetledi
Daisuke Okanohara / 岡野原 大輔
Pre-training of LLMs has once again become a major focus of attention. Although concerns about data scarcity are growing, pre-training itself continues to evolve. A key driver of this progress is the increasing use of synthetic data (see Tramel’s presentation at Berkeley, linked below). Although post-training can improve performance, the upper bound of a model’s capabilities is generally believed to be determined during the pre-training phase. This is because pre-training is where fundamental representations and basic reasoning patterns are acquired, and these tend to change only marginally during post-training. Looking at current scaling laws, the Chinchilla rule originally suggested that the optimal training data size is roughly 20 times the number of parameters. Recently, however, this ratio has increased to around 60 times the number of parameters. In addition, the emergence of Mixture-of-Experts (MoE) architectures has enabled increasing the total number of parameters without a proportional increase in inference compute. This development further intensifies data requirements. Compared with dense models, MoE models require fewer data visits per parameter and are therefore more susceptible to overfitting. As a result, typical MoE implementations require roughly 40 times as much data per total parameter as dense models. For example, a 1T-parameter model may require on the order of 40T tokens. Moreover, data diversity is critical. Simply repeating the same dataset multiple times does not meaningfully improve performance. However, when model-generated synthetic data is used directly as training data, the overall data quality can deteriorate. This phenomenon—often referred to as mode collapse—reduces the diversity present in the long tail of the data distribution and leads to more monotonous model outputs. One effective mitigation strategy is to mix real data and synthetic data during training. In addition, instead of fully regenerating data, it is often preferable to generate paraphrases of existing data. By synthesizing alternative expressions that preserve the original data's factual content, it is possible to improve training efficiency while maintaining data diversity. Importantly, the models used for paraphrasing do not necessarily need to be powerful; relatively small or weak models can be sufficient. This approach follows the same fundamental principle as data augmentation in computer vision. By observing the same information expressed in many different forms, the model learns representations that are independent of specific surface expressions while simultaneously learning the mapping between expressions and internal semantic representations. Recently, two types of synthetic data have emerged as particularly important. The first is program code. Code can be verified by execution, enabling automatic correctness checks and the generation of highly reliable training data. Beyond improving programming ability, code data appears to help models acquire broader representations and reasoning capabilities. The second is data containing explicit reasoning processes. If such reasoning traces are incorporated during pre-training rather than only during post-training, models may learn reasoning procedures—essentially, certain classes of algorithms—during pre-training itself. In real-world data, explicit reasoning processes are often absent; texts rarely include detailed explanations of why particular outcomes occur. To address this, one promising approach is to generate multiple reasoning trajectories with inexpensive, weaker models, then verify and filter them with stronger models. This pipeline can produce high-quality reasoning data suitable for inclusion in the pre-training corpus. In this sense, synthetic data acts as an amplifier of real-world data. Because human-generated data is fundamentally limited, synthetic data will likely play an increasingly central role in future large-scale model training.
English
2
8
39
10.1K
Quanquan Gu retweetledi
Massimo
Massimo@Rainmaker1973·
Did you know? NASA only uses 15 digits of π for calculating interplanetary travel. At 40 digits, you could calculate the circumference of a circle the size of the visible universe with an accuracy that'd fall off by less than the diameter of a hydrogen atom. π Day 2026
Massimo tweet media
English
256
1.6K
12.9K
695.2K
Carina Hong
Carina Hong@CarinaLHong·
Excited to announce Axiom’s Series A. We raised $200 million fresh capital at a $1.6 billion+ valuation in a round led by Menlo Ventures to accelerate our strong execution momentum — extending our lead in formal math into Verified AI. Mathematicians and theoretical scientists dream up theories, formulate hypotheses. They then come up with proofs, a two-step process of discovery. We created Axiom to turn the sparks of curiosity into known truths - and to compress the timeline of breakthroughs. The Verified AI dream is a generalization of this dream. It is more than providing safeguards for mission-critical systems. This same gap between expert intuitions and the machinery needed for grounding exists today in any domain where the generation-verification iteration loop can be tighter. And yes, software eats the world, recursive self-improvement is a near sight. Verified AI is not about hallucinations, what’s lousy; instead, it’s about superintelligence, the brilliant. We work on Verified AI not due to a distrust in technology, but rather, we think the rapid advances of AI compels it. I’m grateful to work with and learn from the best team in the world. It’s not an easy journey, but climbing with you is what makes it worth it. And can’t wait to build with a more accelerated speed - nod to @shubho for grounding an ambitious vision in relentless execution everyday. This round was led by @mkraning with @CCgong. Thanks also to existing investors who doubled down for your conviction since the start (@jturow, @mattmcilwain of @MadronaVentures; @marcievu of @greycroftvc; @yanda, @IdaGirma, @nickgiometti of @BCapitalGroup; @ChrisAbshire_ of @Toyota_Ventures; @xtzhou, @jhuber of @TriatomicCap) and the new firms who we got to meet through the process.
Axiom@axiommathai

Axiom launched six months ago with one conviction: mathematics is the right foundation for building systems that reason. Today we announce Axiom's Series A. We raised $200M at a $1.6B+ valuation, led by @MenloVentures, to extend our lead in formal mathematics into Verified AI.

English
45
33
369
73.8K
Quanquan Gu
Quanquan Gu@QuanquanGu·
@elonmusk @yunta_tsai Grokking is an interesting training phenomenon. But it usually appears in relatively early training regimes (small compute). Scaling laws are fitted over much larger compute horizons, where these transient dynamics have little effect on the overall trend.
English
1
0
4
589
Yun-Ta Tsai
Yun-Ta Tsai@yunta_tsai·
I am curious how many scaling laws are based on the ability to memorize in fixed epochs instead of grokking (generalization). If grokking truly happened, you could use fewer parameters to achieve the same thing, but that requires a model to truly understand the problem in depth (e.g., discovering the laws of physics) instead of line fitting.
English
40
40
459
76.7K
Quanquan Gu
Quanquan Gu@QuanquanGu·
Improvements in computer use or economically valuable tasks (e.g., GDPval) don’t necessarily require a major jump in pretraining capability. A lot of these gains come from post-training: better tool use, agent scaffolding, RL, or task-specific on policy distilation. If there is a “wall” in post-training, it is often just a reflection of the ceiling set by pretraining. So I’m not entirely sure what “no wall” refers to here.
Noam Brown@polynoamial

GPT-5.4 is a big step up in computer use and economically valuable tasks (e.g., GDPval). We see no wall, and expect AI capabilities to continue to increase dramatically this year.

English
2
1
47
10.6K
Stefano Ermon
Stefano Ermon@StefanoErmon·
Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.
English
320
587
4.2K
976.6K
Quanquan Gu retweetledi
Haitham Bou Ammar
Haitham Bou Ammar@hbouammar·
LLM decoding isn’t folklore. It’s an optimisation on the probability simplex. One master objective recovers greedy / softmax / top-k/top-p / sparsemax… and makes new samplers easy to design. (paper coming to arXiv — comment for early PDF) #AI #MachineLearning
Haitham Bou Ammar tweet media
English
7
13
93
6.9K
Quanquan Gu
Quanquan Gu@QuanquanGu·
Congrats to the Gemini 3.1 team! Strong results and great progress for the field. Healthy competition pushes everyone forward. Faster iteration cycles are coming! Frontier models are becoming a continuous process, not a one-time release.
Arena.ai@arena

Gemini 3.1 Pro is here! It’s top 3 across Text and Vision Arena, and #6 in Code Arena, tied closely with Claude Opus 4.5. Highlights: ▪️Tied #1 in Text (scoring 1500), 4 pts from Opus 4.6 ▪️Top 3 in Arena Expert Leaderboard (scoring 1538), just behind Opus 4.6 ▪️#6 in Code Arena, on par with Opus 4.5 and GLM-5 Competition is tight at the top as ranking spreads overlap. Congrats to the @GoogleDeepMind team on this strong release! 👏

English
1
1
35
3.3K