Haodong Wen

22 posts

Haodong Wen

@herrywen1

Cogito, ergo sum

Katılım Haziran 2022

281 Takip Edilen39 Takipçiler

Haodong Wen retweetledi

Kaiyue Wen@wen_kaiyue·1d

I won't be at ICLR this year but @xingyudang will help present Fantastic Optimizers arxiv.org/abs/2509.02046! Stop by at Pavilion 4 P4 5309 this afternoon to see what we have found in extensive sweeping and more importantly, what we learned after the paper that leads to Hyperball!

English

101

Haodong Wen@herrywen1·2d

Excited to be part of this paper — our first step toward uncovering the underlying connection between data ordering and LR schedule in LLM training. Come to Kairong's oral on Fri, Apr 24, 10:30 AM–12:00 PM in Room 202 A/B, or stop by our poster from 3:15 PM–5:45 PM at P3-#521!

Kairong Luo ✈️ ICLR2026@openhonor

✈️ Heading to ICLR 🇧🇷 Apr 22–27. Come to our oral on Fri, Apr 24 (10:30 AM–12:00 PM, Room 202 A/B) or find me at our poster (3:15 PM–5:45 PM, P3-#521). We study why LR decay can hurt curriculum-based LLM pretraining — and how to fix it. Happy to chat!

English

Haodong Wen retweetledi

Chris Hayduk@ChrisHayduk·11 Nis

I strongly suspect that Claude Mythos is a looped language model, as described in the paper "Scaling Latent Reasoning via Looped Language Models" from ByteDance The authors of that paper called out graph search as one of the areas where looping provides a huge theoretical advantage over standard RLVR. And look at where Mythos blows out its competitors the most

English

111

359

594.3K

Haodong Wen retweetledi

Tianle Cai@tianle_cai·10 Nis

x.com/i/article/2042…

ZXX

100

640

221.4K

Haodong Wen retweetledi

Yiping Lu@2prime_PKU·11 Mar

Gradient-Lipschitz analysis can recovers the scaling behind muP！Studying how network width changes the gradient Lip constant under operator norms, we • recover muP scaling for Adam • Muon’s smoothness can be bad • New Row-wise gradient normalization is competitive with Muon

English

181

22.5K

Haodong Wen retweetledi

Kaifeng Lyu@vfleaking·22 Oca

Excited to introduce Hyperball! This story actually started a few years ago, when Zhiyuan Li (@zhiyuanli_) and I observed that weight decay eventually drives the dynamics toward an equilibrium, in which the parameter norm and the effective (angular) step size stay constant. arxiv.org/abs/2010.02916 If that is the case, why not remove weight decay altogether and directly control the norm and effective step size?

Kaiyue Wen@wen_kaiyue

(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.

English

166

21.6K

Haodong Wen retweetledi

Haozhe Jiang@erichzjiang·5 Oca

Diffusion language models (DLMs) are provably optimal parallel samplers! In my new paper with @nhaghtal and @wjmzbmr1 we show that DLMs can sample distributions with the fewest possible steps, and further with the fewest possible memory with revision/remasking.

English

432

34.3K

Haodong Wen retweetledi

alphaXiv@askalphaxiv·27 Ara

This paper shows that wildly different AI models for molecules, materials, and proteins are independently learning the same underlying representation of matter suggesting we’re converging on a shared physics-grounded “latent reality” that proves scientific foundation models might actually be generalizable across domains

English

105

299

1.5K

417.3K

Haodong Wen@herrywen1·7 Ara

@shauryr spotlight scaling

English

436

Shaurya Rohatgi@shauryr·6 Ara

Best worst spotlight

Peyman Milanfar@docmilanfar

dishonorable mention

English

129

39.9K

Haodong Wen@herrywen1·6 Ara

🤩Come to meet us at #OPT2025 #NeurIPS2025! I'll present “Larger Datasets Can Be Repeated More” at the OPT 2025 workshop: a theory of how multi-epoch SGD reshapes data scaling laws and why larger datasets can be repeated more. 🗓️Sat Dec 6, 2:30 PM 📍Upper Level Ballroom 20A

English

420

Haodong Wen retweetledi

dr. jack morris@jxmnop·4 Ara

Wondering how to attend an ML conference the right way? ahead of NeurIPS 2025 (30k attendees!) here are ten pro tips: 1. Your main goals: (i) meet people (ii) regain excitement about work (iii) learn things – in that order. 2. Make a list of papers you like and seek them out at poster sessions. Try to talk to the authors– you can learn much more from them than from a PDF. 3. Pick one workshop and one tutorial that sounds most interesting. Skip the rest. 4. Cold email people you want to meet but haven't. Check Twitter and the accepted papers list. PhD students are especially responsive. 5. Practice a concise pitch of unpublished research you're working on for "what are you interested in rn?". Focus on big unanswered questions and exciting new directions, *not* papers. 6. Skip the orals. Posters are a higher-bandwidth, more engaging, more invigorating. Orals are a good time to go for a walk or talk in the hallway. 7. for the love of god, do NOT work on other research in your hotel room. Save mental bandwidth for the conference. (This may seem obvious; you'd be surprised.) 8. Talk to people outside your area. There are many smart people working on niches <10 people understand. Learn about one or two that won't help your own work. 9. Attend one social each night. Don't overthink it or get caught up in status games. They're all fun. 10. Take breaks. You can't go to everything, and conferences consume more energy than a normal workweek. hope this helps, and sad i'm not attending neurips, have fun :)

English

129

1.5K

136.6K

Haodong Wen@herrywen1·4 Ara

📢 Come meet us at #NeurIPS2025! We'll be presenting our paper: Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold 🗓️Friday 5 Dec 4:30 p.m. PST-7:30 p.m. PST 👀Exhibit Hall C,D,E Poster Location: #5102 Expect your feedback!

Xinghan Li@XinghanLi66

Adam prefers a different minimizer than SGD (exemplified below), but how? 🤔 Our NeurIPS 2025 Paper: Based on our Slow SDE approximation of Adam, we show that under label noise Adam implicitly minimizes tr(Diag(H)^½), whereas prior works showed that SGD minimizes tr(H). 🧵1/n

English

4.1K

Haodong Wen retweetledi

ICLR 2026@iclr_conf·27 Kas

ZXX

139

681

Haodong Wen@herrywen1·28 Kas

@XinghanLi66 @SimonShaoleiDu cool!😀

English

130

Haodong Wen retweetledi

Xinghan Li@XinghanLi66·26 Kas

Sharing findings from my internship in @SimonShaoleiDu's group on math reasoning: • Simple prompting works surprisingly well (though not universally). • Offline RL doesn't work as expected—suggesting that online interaction is still crucial! xinghanli.notion.site/Prompting-and-…

English

13.5K

Haodong Wen retweetledi

Weijie Su@weijie444·5 Kas

Why and how does gradient/matrix orthogonalization work in Muon for training #LLMs? We introduce an isotropic curvature model to explain it. Take-aways: 1. Orthogonalization is a good idea, "on the right track". 2. But it might not be optimal. [1/n]

English

196

55K

Haodong Wen@herrywen1·7 Kas

Kaifeng, Xinghan, and I will be in San Diego for NeurIPS. We’d love to connect and chat! More details in Xinghan’s thread. Preprint: arxiv.org/abs/2511.02773

English

Haodong Wen@herrywen1·7 Kas

Our analysis applies to a family of adaptive gradient methods (e.g., Adam, Shampoo, etc.). A lot of thanks to Kaifeng (@vfleaking ) for kind, patient, and meticulous guidance, and to my great collaborator Xinghan (@XinghanLi66 ) for his hard work.

English

Haodong Wen@herrywen1·7 Kas

Excited to share our NeurIPS 2025 paper: “Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold.” We show that Adam does reduce sharpness—but in a way that is different from SGD.

Kaifeng Lyu@vfleaking

🚀 Our NeurIPS 2025 paper: An SDE-based mathematical characterization of how adaptive gradient methods (e.g., Adam, Shampoo, etc.) implicitly reduce the sharpness of the local loss landscape. Under label noise, it is known that SGD implicitly minimizes tr(H). We show that Adam implicitly minimizes tr(Diag(H)¹ᐟ²) — a very unique form of sharpness! In sparse linear regression with diagonal nets, this difference in implicit bias enables Adam to recover the sparse ground truth with much fewer samples than SGD. 👥 Work from our group at Tsinghua, with undergrad intern Xinghan Li @XinghanLi66 and first-year PhD student Haodong Wen @herrywen1.

English

106

Haodong Wen retweetledi

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8·6 Haz

MiniCPM4 edge series - 0.5B & 8B variants | 8T/1T tokens - Trainable sparse InfLLM-v2 attention → each token attends to ~5% of others at 128K ctx - FP8 pipeline + multi‑token prediction; UltraClean/UltraChat‑v2 data - BitCPM ternary quant (−1/0/+1, ~90% weight drop), Eagle speculative heads draft-ahead for fast decoding (vLLM / FRSpec) - Jetson AGX Orin: ~7× faster than Qwen3-8B, strong 128K “needle-in-haystack” retrieval - Apache‑2.0 𝑻𝑯𝑰𝑵𝑲 𝑺𝑴𝑶𝑳 HF: huggingface.co/collections/op… TR: github.com/OpenBMB/MiniCP…

English

147

34K

Keşfet

@xingyudang @zhiyuanli_ @nhaghtal @wjmzbmr1 @shauryr @XinghanLi66 @SimonShaoleiDu @vfleaking