Zhixuan Lin

178 posts

Zhixuan Lin

Zhixuan Lin

@zhxlin

PhD student at @Mila_Quebec and @UMontreal. Working on (linear complexity) long-context sequence models and RL.

Katılım Nisan 2017
617 Takip Edilen510 Takipçiler
Sabitlenmiş Tweet
Zhixuan Lin
Zhixuan Lin@zhxlin·
#COLM2025 We introduce Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX) — a provably safe pruning method that significantly speeds up our Forgetting Attention kernel, especially for long-context pretraining. Our simple Triton kernel with ACP is 1.7x to 2.4x faster than the official FlashAttention2 kernel when pretraining 760M-param models with context lengths from 4k to 16k on 4xL40S! • Code: github.com/zhixuan-lin/fo… • Paper: arxiv.org/abs/2504.06949 Joint work with @johanobandoc, Xu Owen He, @AaronCourville, from @Mila_Quebec and @makermaker_ai More details👇
Zhixuan Lin tweet media
English
5
52
309
27.8K
Zhixuan Lin retweetledi
Johan Obando-Ceron @ ICLR’26 👍🏽
🔥 The AutoRL workshop is shaping up to be an exciting venue. If your work aligns, we strongly encourage you to submit. Great talks and an exciting panel will be announced soon. #RLC @RL_Conference
AutoRL Workshop@AutoRL_Workshop

🔥 AutoRL Workshop returns to RLC 2026 in Montréal 🇨🇦 Join us to tackle RL brittleness and advance methods that work “out of the box”. More info: sites.google.com/view/automated… This year's organisers are: Theresa Eimer, @DierkesJul67648, @johanobandoc, @pcastr, @HolgerHoos

English
0
6
17
2.4K
Zhixuan Lin retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…
English
45
185
1.8K
206.8K
Zhixuan Lin retweetledi
Yihao Sun
Yihao Sun@Tobealegend24·
Most VLA-RL frameworks inherit the complexity of LLM-RL infra but we found that none of it is necessary. We therefore introduce VLARLKit: A simple yet fast VLA RL framework. Code link: github.com/VLARLKit/VLARL…
Yihao Sun tweet media
English
4
20
115
8.9K
Zhixuan Lin retweetledi
Tianwei Ni
Tianwei Ni@twni2016·
🔥Thrilled to announce the Continual Reinforcement Learning (CRL) Workshop @RL_Conference 2026 in Montreal, Canada! 📷 We welcome submissions on broad topics of continual RL. Interested in submitting or reviewing? Check out our website for more details!
English
0
1
13
581
Zhixuan Lin retweetledi
Ziyan "Ray" Luo
Ziyan "Ray" Luo@RayZiyan41307·
🔥Thrilled to announce the Continual Reinforcement Learning (CRL) Workshop @RL_Conference 2026 in Montreal, Canada! 📣 We welcome submissions on broad topics of continual RL. Interested in submitting or reviewing? Check out our website for more details!
English
1
5
34
4.3K
Zhixuan Lin retweetledi
Yu Zhang 🐙🌘
Yu Zhang 🐙🌘@yzhang_cs·
flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…
Yu Zhang 🐙🌘 tweet media
English
7
26
233
27.9K
Zhixuan Lin retweetledi
Lucas Maes
Lucas Maes@lucasmaes_·
JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io
English
107
561
3.9K
930.6K
Zhixuan Lin retweetledi
William Merrill
William Merrill@lambdaviking·
[1/8] New paper with Hongjian Jiang, @YanhongLi2062, Anthony Lin, @Ashish_S_AI: 📜Why Are Linear RNNs More Parallelizable? We identify expressivity differences between linear/nonlinear RNNs and, conversely, barriers to parallelizing nonlinear RNNs 🧵👇
William Merrill tweet media
English
4
27
188
16.5K
Zhixuan Lin retweetledi
Johan Obando-Ceron @ ICLR’26 👍🏽
🚨 Very excited to share “Stable Deep Reinforcement Learning via Isotropic Gaussian Representations” 🎉 We show that isotropic Gaussian representations can stabilize training, prevent collapse, and reduce neuron dormancy in deep RL. A simple geometric idea with a big impact. Check out the amazing @psc thread below 👇🏽
Pablo Samuel Castro@pcastr

New paper 🚨 "Stable Deep Reinforcement Learning via Isotropic Gaussian Representations" Deep RL suffers from unstable training, representation collapse, and neuron dormancy. We show that a simple geometric insight, isotropic Gaussian representations, can fix this. Here's how 👇

English
2
10
31
4.1K
Zhixuan Lin retweetledi
nathan chen
nathan chen@nathancgy4·
This was the first plotted figure during our first attention residual run. I would've drawn it even if the loss was worse than the baseline. This by itself is something insightful and interpretable. Besides a better performance, Attnres gives a better "understanding" about the trained model :)
nathan chen tweet media
Joey (e/λ)@shxf0072

this is also soo sooo good for mech interp now you can directly find which block is adding information to res steam we can identify concept across time and depth res_attn >> mhc in every way

English
13
18
342
46.1K
Zhixuan Lin retweetledi
Zhixuan Lin retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
334
2.1K
13.5K
5M
Zhixuan Lin retweetledi
Aaron Defazio
Aaron Defazio@aaron_defazio·
@francoisfleuret When gradient norms drop early in training you need warmup, when they don’t… you don’t. It’s a simple testable theory that holds in every case I know of. We essentially have a full theory of warmup and decay. arxiv.org/html/2310.0783…
Aaron Defazio tweet media
English
2
41
312
37.7K
Zhixuan Lin retweetledi
Wonmin Byeon
Wonmin Byeon@wonmin_byeon·
🚀 New paper: Mamba–Transformer hybrid VLMs can go fast without forgetting. We introduce stateful token reduction for long-video VLMs. ✅ Only 25% of visual tokens 🚀 3.8–4.2× faster prefilling (TTFT) 🎯 Near-baseline accuracy (can exceed baseline with light finetuning)
Wonmin Byeon tweet media
English
3
24
218
14K
Zhixuan Lin retweetledi
Ai2
Ai2@allen_ai·
Introducing Olmo Hybrid, a 7B fully open model combining transformer and linear RNN layers. It decisively outperforms Olmo 3 7B across evals, w/ new theory & scaling experiments explaining why. 🧵
Ai2 tweet media
English
17
128
787
170.6K
Zhixuan Lin retweetledi
jianlin.su
jianlin.su@Jianlin_S·
Beyond MuP: 3. Special Cases, Special Treatment kexue.fm/archives/11647 Derived stability metrics and steepest descent directions for Embedding, LM Head, and RMS Norm layers — explaining why Embedding and LM Head don't play well with Muon.
English
2
22
161
15.6K