Zhixuan Lin (@zhxlin) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

#COLM2025 We introduce Adaptive Computation Pruning (ACP) for the Forgetting Transformer (FoX) — a provably safe pruning method that significantly speeds up our Forgetting Attention kernel, especially for long-context pretraining. Our simple Triton kernel with ACP is 1.7x to 2.4x faster than the official FlashAttention2 kernel when pretraining 760M-param models with context lengths from 4k to 16k on 4xL40S! • Code: github.com/zhixuan-lin/fo… • Paper: arxiv.org/abs/2504.06949 Joint work with @johanobandoc, Xu Owen He, @AaronCourville, from @Mila_Quebec and @makermaker_ai More details👇

English

5

52

309

27.8K

Zhixuan Lin retweetledi

Johan Obando-Ceron @ ICLR’26 👍🏽@johanobandoc·3d

🔥 The AutoRL workshop is shaping up to be an exciting venue. If your work aligns, we strongly encourage you to submit. Great talks and an exciting panel will be announced soon. #RLC @RL_Conference

AutoRL Workshop@AutoRL_Workshop

🔥 AutoRL Workshop returns to RLC 2026 in Montréal 🇨🇦 Join us to tackle RL brittleness and advance methods that work “out of the box”. More info: sites.google.com/view/automated… This year's organisers are: Theresa Eimer, @DierkesJul67648, @johanobandoc, @pcastr, @HolgerHoos

English

0

6

17

2.4K

Zhixuan Lin retweetledi

Xiaofeng Zhang ✈️ ICLR26@AZUREANg·5d

I’ll present our paper on how prompt complexity change the generated distribution in T2I models (arxiv.org/abs/2510.19557) at ICLR 26! Come and chat at Pavillon 4 - #3114 tomorrow 23 April 3PM :)

English

1

10

32

2.4K

Zhixuan Lin retweetledi

Kimi.ai@Kimi_Moonshot·6d

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…

English

45

185

1.8K

206.8K

Zhixuan Lin retweetledi

Continual RL Workshop@continual_learn·20 Nis

Standard RL assumes a stable world. The real world may not. ♾ Introducing the Continual RL Workshop @RL_Conference 2026, Montreal, Canada. 🤖 Agents should never stop learning! 🖼️ Site: sites.google.com/view/continual… 📄 Submit: sites.google.com/view/continual…

English

1

15

88

8.3K

Zhixuan Lin retweetledi

Yihao Sun@Tobealegend24·20 Nis

Most VLA-RL frameworks inherit the complexity of LLM-RL infra but we found that none of it is necessary. We therefore introduce VLARLKit: A simple yet fast VLA RL framework. Code link: github.com/VLARLKit/VLARL…

English

4

20

115

8.9K

Zhixuan Lin retweetledi

Tianwei Ni@twni2016·11 Nis

🔥Thrilled to announce the Continual Reinforcement Learning (CRL) Workshop @RL_Conference 2026 in Montreal, Canada! 📷 We welcome submissions on broad topics of continual RL. Interested in submitting or reviewing? Check out our website for more details!

English

0

1

13

581

Zhixuan Lin retweetledi

Ziyan "Ray" Luo@RayZiyan41307·11 Nis

🔥Thrilled to announce the Continual Reinforcement Learning (CRL) Workshop @RL_Conference 2026 in Montreal, Canada! 📣 We welcome submissions on broad topics of continual RL. Interested in submitting or reviewing? Check out our website for more details!

English

1

5

34

4.3K

Zhixuan Lin retweetledi

Yu Zhang 🐙🌘@yzhang_cs·27 Mar

flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…

English

7

26

233

27.9K

Zhixuan Lin retweetledi

Lucas Maes@lucasmaes_·23 Mar

JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io

English

107

561

3.9K

930.6K

Zhixuan Lin retweetledi

William Merrill@lambdaviking·19 Mar

[1/8] New paper with Hongjian Jiang, @YanhongLi2062, Anthony Lin, @Ashish_S_AI: 📜Why Are Linear RNNs More Parallelizable? We identify expressivity differences between linear/nonlinear RNNs and, conversely, barriers to parallelizing nonlinear RNNs 🧵👇

English

4

27

188

16.5K

Zhixuan Lin retweetledi

Mayank Mishra@MayankMish98·19 Mar

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English

10

110

515

144.4K

Zhixuan Lin retweetledi

Johan Obando-Ceron @ ICLR’26 👍🏽@johanobandoc·18 Mar

🚨 Very excited to share “Stable Deep Reinforcement Learning via Isotropic Gaussian Representations” 🎉 We show that isotropic Gaussian representations can stabilize training, prevent collapse, and reduce neuron dormancy in deep RL. A simple geometric idea with a big impact. Check out the amazing @psc thread below 👇🏽

Pablo Samuel Castro@pcastr

New paper 🚨 "Stable Deep Reinforcement Learning via Isotropic Gaussian Representations" Deep RL suffers from unstable training, representation collapse, and neuron dormancy. We show that a simple geometric insight, isotropic Gaussian representations, can fix this. Here's how 👇

English

2

10

31

4.1K

Zhixuan Lin retweetledi

nathan chen@nathancgy4·16 Mar

This was the first plotted figure during our first attention residual run. I would've drawn it even if the loss was worse than the baseline. This by itself is something insightful and interpretable. Besides a better performance, Attnres gives a better "understanding" about the trained model :)

Joey (e/λ)@shxf0072

this is also soo sooo good for mech interp now you can directly find which block is adding information to res steam we can identify concept across time and depth res_attn >> mhc in every way

English

13

18

342

46.1K

Zhixuan Lin retweetledi

Yu Zhang 🐙🌘@yzhang_cs·16 Mar

The idea of rotating attention by 90° is sooooooo cool (credits to @Jianlin_S 's insights), and it surprisingly works. We (w/ the amazing @nathan) are so excited about this— been working on the paper for months and couldn't stop. Go give it a try. It's a drop-in replacement for standard residuals, born in 2015. really like the figs btw :-)

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

19

70

960

124.9K

Zhixuan Lin retweetledi

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

334

2.1K

13.5K

5M

Zhixuan Lin retweetledi

Sebastian Raschka@rasbt·15 Mar

I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place! sebastianraschka.com/llm-architectu…

English

202

1.5K

8.2K

723.8K

Zhixuan Lin retweetledi

Aaron Defazio@aaron_defazio·9 Mar

@francoisfleuret When gradient norms drop early in training you need warmup, when they don’t… you don’t. It’s a simple testable theory that holds in every case I know of. We essentially have a full theory of warmup and decay. arxiv.org/html/2310.0783…

English

2

41

312

37.7K

Zhixuan Lin retweetledi

Wonmin Byeon@wonmin_byeon·4 Mar

🚀 New paper: Mamba–Transformer hybrid VLMs can go fast without forgetting. We introduce stateful token reduction for long-video VLMs. ✅ Only 25% of visual tokens 🚀 3.8–4.2× faster prefilling (TTFT) 🎯 Near-baseline accuracy (can exceed baseline with light finetuning)

English

3

24

218

14K

Zhixuan Lin retweetledi

Ai2@allen_ai·5 Mar

Introducing Olmo Hybrid, a 7B fully open model combining transformer and linear RNN layers. It decisively outperforms Olmo 3 7B across evals, w/ new theory & scaling experiments explaining why. 🧵

English

17

128

787

170.6K

Zhixuan Lin retweetledi

jianlin.su@Jianlin_S·2 Mar

Beyond MuP: 3. Special Cases, Special Treatment kexue.fm/archives/11647 Derived stability metrics and steepest descent directions for Embedding, LM Head, and RMS Norm layers — explaining why Embedding and LM Head don't play well with Muon.

English

2

22

161

15.6K

Zhixuan Lin

Keşfet