Jiaxuan Zou

156 posts

Jiaxuan Zou

@SmartPig_Joe

Undergrad@XJTU, Research Intern@Gaolin School of Artificial Intelligence, RUC

Beijing Haidian Katılım Kasım 2021

216 Takip Edilen40 Takipçiler

Sabitlenmiş Tweet

Jiaxuan Zou@SmartPig_Joe·25 Kas

I am looking for new AI/ML research internship opportunities. Below is my personal homepage🙂: jiaxuanzou0714.github.io

English

440

Jiaxuan Zou@SmartPig_Joe·37m

@GoogleLabs When will we be able to use it?

English

Google Labs@GoogleLabs·18h

Today, we introduced Gemini For Science, a collection of experimental tools designed to expand the scale and precision of scientific exploration. Included in Gemini for Science are three (!!!) brand new Google Labs experiments. Meet your new AI research partners: 🧵👇

English

147

1.1K

64.7K

Jiaxuan Zou@SmartPig_Joe·8h

@timlautk @weijie444 Thank you very much for your reply. Your paper is an insightful piece of work.🫡

English

Tim Lau@timlautk·8h

@SmartPig_Joe @weijie444 Thanks for mentioning your concurrent work. We will discuss and cite it in our revision.

English

Tim Lau@timlautk·20h

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…

English

107

18.4K

Jiaxuan Zou@SmartPig_Joe·9h

@timlautk @weijie444 As analyzed in our work, the orthogonal constraint eliminates radial jitter and preserves weight norms, theoretically preventing dead neurons. We would greatly appreciate it if you could discuss and cite Nora as a concrete instance in your revisions!😀

English

Jiaxuan Zou@SmartPig_Joe·9h

@timlautk @weijie444 Congratulations on the symmetry-compatible principle paper! Your discussion on RMNP and Mano perfectly aligns with our recently proposed Nora optimizer (arxiv.org/abs/2605.03769). Nora bridges RMNP's row-wise mechanism and Mano's manifold constraints.

English

101

Jiaxuan Zou@SmartPig_Joe·1d

Why does the web version of ChatGPT always take up so much memory and respond quite slowly, making its user experience far inferior to that of the Gemini web version? Was your web version developed using Vibe Coding? @sama

English

Jiaxuan Zou@SmartPig_Joe·3d

interesting

Alexander Doria@Dorialexander

typical run with batch standing out.

English

Jiaxuan Zou retweetledi

Shenyang Deng ✈️ ICML2026@DengShenyang24·3d

1/n I wrote a blog post explaining the asymptotic equivalence between normalized and orthogonalized descent directions (not preconditioners, but descent directions) in an extremely simple example. Here I'd like to respond to some gaps in the construction of this counterexample:

Tianyang Lin@tianylin

blogpost: Can row-normalization really replace Muon? nil9.net/posts/rownorm_…

English

Jiaxuan Zou@SmartPig_Joe·4d

@tonysilveti @StefanGliga Is this slide publicly available? Where can I download it?

English

Tony S.F.@tonysilveti·4d

@StefanGliga It's actually good to be forced to move in the beginning, though; it can be used to guarantee that you escape the lazy regime (not moving relative to init as width or depth scales).

English

222

Stefan@StefanGliga·4d

I was thinking about Muon a bit while waiting for the bus, and at the first glace it seems like it can't attain exact convergence even after infinite time without a learning rate scheduler, only epsilon convergence at best. With GD on a quadratic, stepsize naturally shrinks and it converges. Adam interferes with this analysis, but assuming reasonable betas and epsilon, it should still quickly reach the regime where epsilon is significant and transition to GD. Epsilon might rescue convergence even in some pretty pathological cases of gradients or memory states, as long as they're local/transient. Meanwhile Muon update is always of a constant size, it has nothing to dampen it. It has the same problem as signgd. Now this is not a problem in practice as, A) we use schedulers B) we don't usually train to convergence, but it's interesting to think about how such a basic property is missing from such a prominent optimizer If anything, this could be yet another explanation why Muon works so well, if saddle points are both more common and more problematic than we think, erratic behavior around gradient~0 regions is a feature, not a bug.

English

2.8K

Jiaxuan Zou@SmartPig_Joe·4d

Why do the conclusions of experiments in different blogs vary? Which one should we choose?😵

Tianyang Lin@tianylin

blogpost: Can row-normalization really replace Muon? nil9.net/posts/rownorm_…

English

Jiaxuan Zou@SmartPig_Joe·4d

lol

Evan Walters@evaninwords

The thing is I have a theory on DL cycles, last time people were obsessed with optimizer variants, norm balls, and adding res connections everywhere was late 2010s right before the transformer explosion, so my bet is all these opt variants today are signaling the next stage of DL which will be significantly more powerful than the common models we see today.

QST

Jiaxuan Zou@SmartPig_Joe·5d

Everyone should stop and take a look at Shenyang's work (arxiv.org/abs/2603.20527); it's a very insightful paper.😃

Shenyang Deng ✈️ ICML2026@DengShenyang24

1/n Please stop by👋. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below 🧵

English

Jiaxuan Zou retweetledi

Mayukh@mayukh_panja·13 May

I don’t agree. A PhD student should not prioritize work-life balance. Getting to do a PhD is a privilege. You are paid to think. There is no pressure for you to be economically useful. It is a unique opportunity to push the boundaries of human knowledge and produce something ground breaking. And nothing great ever happens without complete devotion. Look at everything that moved and shaped the world. Every single person who created anything meaningful, in science, in arts, in music, in movies, devoted their lives to their craft. Extraordinary outcomes require extraordinary inputs and some degree of sacrifice. Sure, have work-life balance during your PhD. But be content a mediocre outcome.

Dr. Manabendra Saharia@m_saharia

Yesterday, I was giving an intro talk to our dept's new PhD students. Technical things aside, my number 1 suggestion has remained the same over the years: Treat your PhD like a job. - Avoid 1.5h lunch and three tea breaks. - Avoid gossiping and loitering at work. - Lab at 9 am and leave at 6 pm. Being productive till 11 pm in the lab is a lie people till themselves when their day starts at 1 PM. Everything worth doing can be done with high intensity focus during work hours. And having fun in life is the secret to being productive in a marathon.

English

359

277

1.1M

Jiaxuan Zou retweetledi

Yuandong Tian@tydsh·6d

Today we launch Recursive. We are building AI that discovers knowledge automatically and improves itself recursively, an open-ended process that will fundamentally change how science and technology advance. Our 25 top researchers and engineers in San Francisco and London bring diverse expertise spanning agentic AI scientists, architecture and algorithm design, world models, optimization, and interpretability, united by a shared conviction that this is the most important problem we could be working on today. If you are interested in joining, please send your resume to talent@recursive.com. Follow us at @Recursive_SI!

Recursive@Recursive_SI

x.com/i/article/2054…

English

149

1.4K

165.8K

Jiaxuan Zou retweetledi

Tony S.F.@tonysilveti·13 May

New paper! We analyze proximal preconditioned gradient methods that extend Muon/Scion to handle nonconvex constraints (Stiefel manifold, spectral sphere, norm balls, ...) with convergence guarantees under heavy-tailed noise + variance reduction w/ STORM! arxiv.org/abs/2605.11850

English

167

11.4K

Jiaxuan Zou@SmartPig_Joe·13 May

@tmpethick @scottjmaddox @b_rich_now @wen_kaiyue interesting point

English

Thomas Pethick@tmpethick·13 May

@scottjmaddox @SmartPig_Joe @b_rich_now @wen_kaiyue I made the argument precise here in case it is interesting: pethick.dk/posts/2026-05-…

English

131

Kaiyue Wen@wen_kaiyue·10 May

One interesting observation one could make about optimization track is that weight decay is removed from all top submission.

Keller Jordan@kellerjordan0

Modded-NanoGPT optimization result #11: @nilinabra has achieved a new record of 3225 steps (-25) via a novel technique dubbed Contra-Muon, in which top SVD components are somewhat suppressed. This result builds on #9.

English

121

16.4K

Jiaxuan Zou@SmartPig_Joe·12 May

@CV_novel_plume agree😃

English

Yuxin Fang@CV_novel_plume·12 May

@SmartPig_Joe Of course — I think we actually agree on this point. My earlier thoughts are here: x.com/cv_novel_plume…

Yuxin Fang@CV_novel_plume

This is a very meaningful benchmark, but there is one caveat worth keeping in mind. In speedrun settings, there is now a clear trend toward using different optimizers and hyperparameters for different modules. I have to admit that this can bring real gains. But when comparing optimizers, we should not give hyperparameters unlimited freedom. For example, if I first run a strong optimizer, then reverse-engineer an SGD hyper parameter schedule that tunes every neuron at every step to match it, SGD may appear to “simulate” Adam, Muon, or almost any optimizer . But that would not tell us much about SGD. It only means the optimizer has been hidden inside the hyperparameter schedule. To me, the value of a good optimizer is the opposite: it should adapt internally, require fewer hand-tuned knobs, and transfer robustly across model scales. This kind of invariance across model scales is exactly what makes hyperparameter scaling laws meaningful. If we over-optimize the recipe for one particular scale, we may win that benchmark while losing the cross-scale structure we actually want to understand

English

Jiaxuan Zou@SmartPig_Joe·12 May

I think the latest several SOTA are meaningless; they are merely overfitting the NanoGPT benchmark while ignoring the computational cost issue.

Yuxin Fang@CV_novel_plume

This result matches my earlier intuition (x.com/CV_novel_plume…): once optimizer-state budget is not constrained and the objective is step count, outer-loop tricks become natural candidates. Interesting that the gain is real, but still fairly small. I would have expected this direction to have more headroom. Maybe the gains would become more visible if the baseline required more iterations, giving these outer-loop dynamics more room to compound.

English

379

Jiaxuan Zou@SmartPig_Joe·12 May

@CV_novel_plume but... I still want to emphasize, or rather urge, the search for simple, elegant, and effective methods, rather than using complex hyperparameter and tricks to overfit the benchmark.😃😃😃

English

Yuxin Fang@CV_novel_plume·12 May

@SmartPig_Joe I’d view it differently. This benchmark is less about compute and more about data efficiency: fixed per-step batch size, lower loss per token/step. Over the long run, compute and storage scale more easily than high-quality data, so this constraint is actually meaningful.

English

135

Jiaxuan Zou@SmartPig_Joe·12 May

@CV_novel_plume You have a point. In the long run, with the supply of computing power constantly increasing, these methods may come in handy.

English

Jiaxuan Zou@SmartPig_Joe·12 May

what do u think? @wen_kaiyue 😃😃

English

Jiaxuan Zou@SmartPig_Joe·12 May

Have we misunderstood? Is tangential updating unnecessary? With the help of hyperball, is the effective learning rate already well controlled? Then what factors lead to Nora outperforming RMNP?

English

Jiaxuan Zou@SmartPig_Joe·12 May

Is tangential projection necessary? I have been pondering this question for the past few days. In our paper Nora (arxiv.org/pdf/2605.03769), our Nora optimizer merely adds one step of tangential projection (p2) before row normalization.

English

Keşfet

@GoogleLabs @timlautk @weijie444 @sama @tonysilveti @StefanGliga @Recursive_SI @tmpethick