Haodong Wen

22 posts

Haodong Wen

Haodong Wen

@herrywen1

Cogito, ergo sum

Katılım Haziran 2022
281 Takip Edilen39 Takipçiler
Haodong Wen retweetledi
Kaiyue Wen
Kaiyue Wen@wen_kaiyue·
I won't be at ICLR this year but @xingyudang will help present Fantastic Optimizers arxiv.org/abs/2509.02046! Stop by at Pavilion 4 P4 5309 this afternoon to see what we have found in extensive sweeping and more importantly, what we learned after the paper that leads to Hyperball!
English
0
12
101
9K
Haodong Wen
Haodong Wen@herrywen1·
Excited to be part of this paper — our first step toward uncovering the underlying connection between data ordering and LR schedule in LLM training. Come to Kairong's oral on Fri, Apr 24, 10:30 AM–12:00 PM in Room 202 A/B, or stop by our poster from 3:15 PM–5:45 PM at P3-#521!
Kairong Luo ✈️ ICLR2026@openhonor

✈️ Heading to ICLR 🇧🇷 Apr 22–27. Come to our oral on Fri, Apr 24 (10:30 AM–12:00 PM, Room 202 A/B) or find me at our poster (3:15 PM–5:45 PM, P3-#521). We study why LR decay can hurt curriculum-based LLM pretraining — and how to fix it. Happy to chat!

English
0
0
1
40
Haodong Wen retweetledi
Chris Hayduk
Chris Hayduk@ChrisHayduk·
I strongly suspect that Claude Mythos is a looped language model, as described in the paper "Scaling Latent Reasoning via Looped Language Models" from ByteDance The authors of that paper called out graph search as one of the areas where looping provides a huge theoretical advantage over standard RLVR. And look at where Mythos blows out its competitors the most
Chris Hayduk tweet media
English
111
359
4K
594.3K
Haodong Wen retweetledi
Yiping Lu
Yiping Lu@2prime_PKU·
Gradient-Lipschitz analysis can recovers the scaling behind muP!Studying how network width changes the gradient Lip constant under operator norms, we • recover muP scaling for Adam • Muon’s smoothness can be bad • New Row-wise gradient normalization is competitive with Muon
Yiping Lu tweet mediaYiping Lu tweet mediaYiping Lu tweet media
English
3
36
181
22.5K
Haodong Wen retweetledi
Kaifeng Lyu
Kaifeng Lyu@vfleaking·
Excited to introduce Hyperball! This story actually started a few years ago, when Zhiyuan Li (@zhiyuanli_) and I observed that weight decay eventually drives the dynamics toward an equilibrium, in which the parameter norm and the effective (angular) step size stay constant. arxiv.org/abs/2010.02916 If that is the case, why not remove weight decay altogether and directly control the norm and effective step size?
Kaiyue Wen@wen_kaiyue

(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.

English
1
16
166
21.6K
Haodong Wen retweetledi
Haozhe Jiang
Haozhe Jiang@erichzjiang·
Diffusion language models (DLMs) are provably optimal parallel samplers! In my new paper with @nhaghtal and @wjmzbmr1 we show that DLMs can sample distributions with the fewest possible steps, and further with the fewest possible memory with revision/remasking.
Haozhe Jiang tweet media
English
15
67
432
34.3K
Haodong Wen retweetledi
alphaXiv
alphaXiv@askalphaxiv·
This paper shows that wildly different AI models for molecules, materials, and proteins are independently learning the same underlying representation of matter suggesting we’re converging on a shared physics-grounded “latent reality” that proves scientific foundation models might actually be generalizable across domains
alphaXiv tweet media
English
105
299
1.5K
417.3K
Haodong Wen
Haodong Wen@herrywen1·
🤩Come to meet us at #OPT2025 #NeurIPS2025! I'll present “Larger Datasets Can Be Repeated More” at the OPT 2025 workshop: a theory of how multi-epoch SGD reshapes data scaling laws and why larger datasets can be repeated more. 🗓️Sat Dec 6, 2:30 PM 📍Upper Level Ballroom 20A
Haodong Wen tweet media
English
0
0
7
420
Haodong Wen retweetledi
dr. jack morris
dr. jack morris@jxmnop·
Wondering how to attend an ML conference the right way? ahead of NeurIPS 2025 (30k attendees!) here are ten pro tips: 1. Your main goals: (i) meet people (ii) regain excitement about work (iii) learn things – in that order. 2. Make a list of papers you like and seek them out at poster sessions. Try to talk to the authors– you can learn much more from them than from a PDF. 3. Pick one workshop and one tutorial that sounds most interesting. Skip the rest. 4. Cold email people you want to meet but haven't. Check Twitter and the accepted papers list. PhD students are especially responsive. 5. Practice a concise pitch of unpublished research you're working on for "what are you interested in rn?". Focus on big unanswered questions and exciting new directions, *not* papers. 6. Skip the orals. Posters are a higher-bandwidth, more engaging, more invigorating. Orals are a good time to go for a walk or talk in the hallway. 7. for the love of god, do NOT work on other research in your hotel room. Save mental bandwidth for the conference. (This may seem obvious; you'd be surprised.) 8. Talk to people outside your area. There are many smart people working on niches <10 people understand. Learn about one or two that won't help your own work. 9. Attend one social each night. Don't overthink it or get caught up in status games. They're all fun. 10. Take breaks. You can't go to everything, and conferences consume more energy than a normal workweek. hope this helps, and sad i'm not attending neurips, have fun :)
dr. jack morris tweet media
English
28
129
1.5K
136.6K
Haodong Wen
Haodong Wen@herrywen1·
📢 Come meet us at #NeurIPS2025! We'll be presenting our paper: Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold 🗓️Friday 5 Dec 4:30 p.m. PST-7:30 p.m. PST 👀Exhibit Hall C,D,E Poster Location: #5102 Expect your feedback!
Haodong Wen tweet media
Xinghan Li@XinghanLi66

Adam prefers a different minimizer than SGD (exemplified below), but how? 🤔 Our NeurIPS 2025 Paper: Based on our Slow SDE approximation of Adam, we show that under label noise Adam implicitly minimizes tr(Diag(H)^½), whereas prior works showed that SGD minimizes tr(H). 🧵1/n

English
0
1
8
4.1K
Haodong Wen retweetledi
ICLR 2026
ICLR 2026@iclr_conf·
ICLR 2026 tweet media
ZXX
52
139
681
1M
Haodong Wen retweetledi
Xinghan Li
Xinghan Li@XinghanLi66·
Sharing findings from my internship in @SimonShaoleiDu's group on math reasoning: • Simple prompting works surprisingly well (though not universally). • Offline RL doesn't work as expected—suggesting that online interaction is still crucial! xinghanli.notion.site/Prompting-and-…
English
2
5
17
13.5K
Haodong Wen retweetledi
Weijie Su
Weijie Su@weijie444·
Why and how does gradient/matrix orthogonalization work in Muon for training #LLMs? We introduce an isotropic curvature model to explain it. Take-aways: 1. Orthogonalization is a good idea, "on the right track". 2. But it might not be optimal. [1/n]
Weijie Su tweet media
English
4
29
196
55K
Haodong Wen
Haodong Wen@herrywen1·
Kaifeng, Xinghan, and I will be in San Diego for NeurIPS. We’d love to connect and chat! More details in Xinghan’s thread. Preprint: arxiv.org/abs/2511.02773
English
0
0
1
73
Haodong Wen
Haodong Wen@herrywen1·
Our analysis applies to a family of adaptive gradient methods (e.g., Adam, Shampoo, etc.). A lot of thanks to Kaifeng (@vfleaking ) for kind, patient, and meticulous guidance, and to my great collaborator Xinghan (@XinghanLi66 ) for his hard work.
English
1
0
1
75
Haodong Wen retweetledi
𝚐𝔪𝟾𝚡𝚡𝟾
MiniCPM4 edge series - 0.5B & 8B variants | 8T/1T tokens - Trainable sparse InfLLM-v2 attention → each token attends to ~5% of others at 128K ctx - FP8 pipeline + multi‑token prediction; UltraClean/UltraChat‑v2 data - BitCPM ternary quant (−1/0/+1, ~90% weight drop), Eagle speculative heads draft-ahead for fast decoding (vLLM / FRSpec) - Jetson AGX Orin: ~7× faster than Qwen3-8B, strong 128K “needle-in-haystack” retrieval - Apache‑2.0 𝑻𝑯𝑰𝑵𝑲 𝑺𝑴𝑶𝑳 HF: huggingface.co/collections/op… TR: github.com/OpenBMB/MiniCP…
𝚐𝔪𝟾𝚡𝚡𝟾 tweet media
English
1
31
147
34K