Jiaxuan Zou

156 posts

Jiaxuan Zou

Jiaxuan Zou

@SmartPig_Joe

Undergrad@XJTU, Research Intern@Gaolin School of Artificial Intelligence, RUC

Beijing Haidian Katılım Kasım 2021
216 Takip Edilen40 Takipçiler
Sabitlenmiş Tweet
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
I am looking for new AI/ML research internship opportunities. Below is my personal homepage🙂: jiaxuanzou0714.github.io
English
0
0
0
440
Google Labs
Google Labs@GoogleLabs·
Today, we introduced Gemini For Science, a collection of experimental tools designed to expand the scale and precision of scientific exploration. Included in Gemini for Science are three (!!!) brand new Google Labs experiments. Meet your new AI research partners: 🧵👇
English
14
147
1.1K
64.7K
Tim Lau
Tim Lau@timlautk·
1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…
English
3
22
107
18.4K
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
@timlautk @weijie444 As analyzed in our work, the orthogonal constraint eliminates radial jitter and preserves weight norms, theoretically preventing dead neurons. We would greatly appreciate it if you could discuss and cite Nora as a concrete instance in your revisions!😀
English
1
0
2
47
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
@timlautk @weijie444 Congratulations on the symmetry-compatible principle paper! Your discussion on RMNP and Mano perfectly aligns with our recently proposed Nora optimizer (arxiv.org/abs/2605.03769). Nora bridges RMNP's row-wise mechanism and Mano's manifold constraints.
English
2
0
4
101
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
Why does the web version of ChatGPT always take up so much memory and respond quite slowly, making its user experience far inferior to that of the Gemini web version? Was your web version developed using Vibe Coding? @sama
English
0
0
2
49
Tony S.F.
Tony S.F.@tonysilveti·
@StefanGliga It's actually good to be forced to move in the beginning, though; it can be used to guarantee that you escape the lazy regime (not moving relative to init as width or depth scales).
Tony S.F. tweet media
English
1
0
4
222
Stefan
Stefan@StefanGliga·
I was thinking about Muon a bit while waiting for the bus, and at the first glace it seems like it can't attain exact convergence even after infinite time without a learning rate scheduler, only epsilon convergence at best. With GD on a quadratic, stepsize naturally shrinks and it converges. Adam interferes with this analysis, but assuming reasonable betas and epsilon, it should still quickly reach the regime where epsilon is significant and transition to GD. Epsilon might rescue convergence even in some pretty pathological cases of gradients or memory states, as long as they're local/transient. Meanwhile Muon update is always of a constant size, it has nothing to dampen it. It has the same problem as signgd. Now this is not a problem in practice as, A) we use schedulers B) we don't usually train to convergence, but it's interesting to think about how such a basic property is missing from such a prominent optimizer If anything, this could be yet another explanation why Muon works so well, if saddle points are both more common and more problematic than we think, erratic behavior around gradient~0 regions is a feature, not a bug.
English
3
2
28
2.8K
Jiaxuan Zou retweetledi
Mayukh
Mayukh@mayukh_panja·
I don’t agree. A PhD student should not prioritize work-life balance. Getting to do a PhD is a privilege. You are paid to think. There is no pressure for you to be economically useful. It is a unique opportunity to push the boundaries of human knowledge and produce something ground breaking. And nothing great ever happens without complete devotion. Look at everything that moved and shaped the world. Every single person who created anything meaningful, in science, in arts, in music, in movies, devoted their lives to their craft. Extraordinary outcomes require extraordinary inputs and some degree of sacrifice. Sure, have work-life balance during your PhD. But be content a mediocre outcome.
Dr. Manabendra Saharia@m_saharia

Yesterday, I was giving an intro talk to our dept's new PhD students. Technical things aside, my number 1 suggestion has remained the same over the years: Treat your PhD like a job. - Avoid 1.5h lunch and three tea breaks. - Avoid gossiping and loitering at work. - Lab at 9 am and leave at 6 pm. Being productive till 11 pm in the lab is a lie people till themselves when their day starts at 1 PM. Everything worth doing can be done with high intensity focus during work hours. And having fun in life is the secret to being productive in a marathon.

English
359
277
3K
1.1M
Jiaxuan Zou retweetledi
Yuandong Tian
Yuandong Tian@tydsh·
Today we launch Recursive. We are building AI that discovers knowledge automatically and improves itself recursively, an open-ended process that will fundamentally change how science and technology advance. Our 25 top researchers and engineers in San Francisco and London bring diverse expertise spanning agentic AI scientists, architecture and algorithm design, world models, optimization, and interpretability, united by a shared conviction that this is the most important problem we could be working on today. If you are interested in joining, please send your resume to talent@recursive.com. Follow us at @Recursive_SI!
Recursive@Recursive_SI

x.com/i/article/2054…

English
87
149
1.4K
165.8K
Jiaxuan Zou retweetledi
Tony S.F.
Tony S.F.@tonysilveti·
New paper! We analyze proximal preconditioned gradient methods that extend Muon/Scion to handle nonconvex constraints (Stiefel manifold, spectral sphere, norm balls, ...) with convergence guarantees under heavy-tailed noise + variance reduction w/ STORM! arxiv.org/abs/2605.11850
Tony S.F. tweet mediaTony S.F. tweet mediaTony S.F. tweet media
English
1
31
167
11.4K
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
I think the latest several SOTA are meaningless; they are merely overfitting the NanoGPT benchmark while ignoring the computational cost issue.
Yuxin Fang@CV_novel_plume

This result matches my earlier intuition (x.com/CV_novel_plume…): once optimizer-state budget is not constrained and the objective is step count, outer-loop tricks become natural candidates. Interesting that the gain is real, but still fairly small. I would have expected this direction to have more headroom. Maybe the gains would become more visible if the baseline required more iterations, giving these outer-loop dynamics more room to compound.

English
1
0
3
379
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
@CV_novel_plume but... I still want to emphasize, or rather urge, the search for simple, elegant, and effective methods, rather than using complex hyperparameter and tricks to overfit the benchmark.😃😃😃
English
1
0
0
50
Yuxin Fang
Yuxin Fang@CV_novel_plume·
@SmartPig_Joe I’d view it differently. This benchmark is less about compute and more about data efficiency: fixed per-step batch size, lower loss per token/step. Over the long run, compute and storage scale more easily than high-quality data, so this constraint is actually meaningful.
English
2
0
4
135
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
@CV_novel_plume You have a point. In the long run, with the supply of computing power constantly increasing, these methods may come in handy.
English
0
0
1
30
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
Have we misunderstood? Is tangential updating unnecessary? With the help of hyperball, is the effective learning rate already well controlled? Then what factors lead to Nora outperforming RMNP?
English
1
0
0
60
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
Is tangential projection necessary? I have been pondering this question for the past few days. In our paper Nora (arxiv.org/pdf/2605.03769), our Nora optimizer merely adds one step of tangential projection (p2) before row normalization.
Jiaxuan Zou tweet mediaJiaxuan Zou tweet media
English
1
0
0
88