Weiyang Liu

871 posts

Weiyang Liu banner
Weiyang Liu

Weiyang Liu

@Besteuler

Assistant Professor of CSE @CUHKofficial. Postdoc @MPI_IS. PhD @Cambridge_Uni & @GeorgiaTech. Previous Intern @Google & @nvidia. All opinions are my own.

Katılım Mayıs 2009
782 Takip Edilen2.5K Takipçiler
Sabitlenmiş Tweet
Weiyang Liu
Weiyang Liu@Besteuler·
🚀 We are very excited to introduce Pion — a spectrum-preserving optimizer for LLM training. Pion shows strong training stability in practice. This is the project that we have been working on since POET. The central idea is to turn POET/POET-X (spherelab.ai/poet; spherelab.ai/poetx) into an easy-to-use optimizer. Instead of additive updates like Adam & Muon, Pion takes a different route by updating each weight matrix via coupled left & right orthogonal transformations, keeping its weight spectrum stable throughout training. This different update mechanism is directly inspired by the empirical effectiveness of Orthogonal Finetuning (oft.wyliu.com; boft.wyliu.com) and POET/POET-X, with roots tracing back to minimum-energy training methods such as MHE (arxiv.org/abs/1805.09298) and OPT (opt-training.github.io). Pion's key features: ✨ Competitive on LLM pretraining, SFT & RLVR ✨ μP-compatible by construction ✨ Stabily trains ultra-deep LLMs and even normalization-free LLMs, where AdamW & Muon diverge 🌐 Project: spherelab.ai/pion 📜 Paper: arxiv.org/abs/2605.12492
Weiyang Liu tweet media
English
4
14
115
13.5K
Weiyang Liu retweetledi
Richard Sutton
Richard Sutton@RichardSSutton·
The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.
English
133
964
7.3K
539.1K
Weiyang Liu
Weiyang Liu@Besteuler·
Good question! One of the design principles of Pion (and also POET) is the minimum energy training. Under the double-sided orthogonal transformation, the easiest way to (probabilistically) guarantee this is to use zero-mean Gaussian initialization, and this amounts to the singular value distribution of a matrix with isotopic Gaussian (directionally equivalent to uniform distribution on the hypersphere).
English
0
0
3
372
jianlin.su
jianlin.su@Jianlin_S·
@Besteuler Under Pion's design, the singular value distribution of each matrix parameter is fixed at initialization. How is this singular value distribution chosen? The paper does not seem to describe this either, but I believe this point is crucial.
English
1
0
5
352
Weiyang Liu
Weiyang Liu@Besteuler·
🚀 We are very excited to introduce Pion — a spectrum-preserving optimizer for LLM training. Pion shows strong training stability in practice. This is the project that we have been working on since POET. The central idea is to turn POET/POET-X (spherelab.ai/poet; spherelab.ai/poetx) into an easy-to-use optimizer. Instead of additive updates like Adam & Muon, Pion takes a different route by updating each weight matrix via coupled left & right orthogonal transformations, keeping its weight spectrum stable throughout training. This different update mechanism is directly inspired by the empirical effectiveness of Orthogonal Finetuning (oft.wyliu.com; boft.wyliu.com) and POET/POET-X, with roots tracing back to minimum-energy training methods such as MHE (arxiv.org/abs/1805.09298) and OPT (opt-training.github.io). Pion's key features: ✨ Competitive on LLM pretraining, SFT & RLVR ✨ μP-compatible by construction ✨ Stabily trains ultra-deep LLMs and even normalization-free LLMs, where AdamW & Muon diverge 🌐 Project: spherelab.ai/pion 📜 Paper: arxiv.org/abs/2605.12492
Weiyang Liu tweet media
English
4
14
115
13.5K
Weiyang Liu
Weiyang Liu@Besteuler·
We do tried a lot variants to relax the iso-spectrum manifold constraint. For example, we have implemented a linearized version of Pion, which is essentially doing a Taylor expansion of the matrix exponential. While this variant also works, it is not as good as original Pion. But I do think there is plenty of room to improve Pion. :)
English
0
0
2
336
Federico Andres Lois
Federico Andres Lois@federicolois·
@Besteuler Very interesting result!!! The spectral norm being bounded by construction is doing a lot of heavy lifting here. Curious how much of the gain survives if you relax the iso-spectral constraint while keeping everything else.
English
1
0
0
461
Weiyang Liu
Weiyang Liu@Besteuler·
🎉OrthoMerge has been accepted to #ICML2026. This work introduces an elegant way to merge different model checkpoints. Kudos to my PhD students @sihany077 and @KexuanShi67338.
Weiyang Liu@Besteuler

Orthogonal Finetuning (oft.wyliu.com; boft.wyliu.com) has a unique advantage of preventing catastrophic forgetting. Inspired by this property, we find that merging models within the orthogonal group can effectively reduce model conflicts and preserve both pretraining and downstream knowledge. This is our OrthoMerge framework. The idea behind OrthoMerge is extremely simple. For OFT-tuned models, we can first map the orthogonal adapters to Lie algebra with inverse Carley transform and then perform merging there. This guarantees the merged model differs from the pretrained model only up to an orthogonal transformation. A better news is that OrthoMerge can also be applied to non-OFT-tuned models. By solving the orthogonal procrustes problem, we can have the projected component of the adapter onto the orthogonal group. OrthoMerge will then be applied there and the residual component can be merged using conventional merging methods. That said, OrthoMerge can be used together with existing model merging methods! This is a great example of simple yet effective ideas. Great efforts by my PhD students Sihan Yang and Kexuan Shi. The project is already open-sourced and feel free to give it a try! Project: spherelab.ai/OrthoMerge/ Paper: arxiv.org/pdf/2602.05943 Code: github.com/Sphere-AI-Lab/…

English
2
21
179
25K
Weiyang Liu
Weiyang Liu@Besteuler·
🥳POET-X has been accepted to #ICML2026 as a Spotlight paper! We’re very grateful that the Area Chair and reviewers recognized the contribution of our work.
Weiyang Liu@Besteuler

🚀 Excited to introduce POET-X, a scalable and highly memory-efficient algorithm for LLM pretraining. ✨ LoRA-level GPU memory, better-than-AdamW pretraining performance! POET-X finally marries training stability (from POET's spectrum preservation) and practical scalability (from our new implementation and CUDA kernels). POET-X can pretrain billion-parameter LLMs (eg., Llama-8B) on a single NVIDIA H100, where standard optimizers like AdamW run out of memory under the same settings. We carefully reimplemented every computation step of POET (arxiv.org/pdf/2506.08001). POET-X combines many small checkpointing and parallelization tricks. While each may appear incremental, together they dramatically improve scalability and reduce memory usage by over 70% compared to the original POET. The memory-efficiency of POET-X comes from the unique parameter-efficient reparameterization (where sparsity comes in) of the weight update rule. POET-X bridges this gap between parameter efficiency and memory efficiency. Code is now public. Feel free to try it! ➡️ paper: arxiv.org/pdf/2603.05500 💻 Code: github.com/Sphere-AI-Lab/… 🌐 Website: spherelab.ai/poetx #AI #LLM #MachineLearning #DeepLearning

English
0
2
28
4.5K