
Xiaoliu.x
212 posts

Xiaoliu.x
@xiaolGo
ex-algo engineer → architect → founder → researcher → ? My reading list: https://t.co/raW6A7xfMS https://t.co/b0JOLNr4bE









Hello world! RWKV Gemma4 E2B RWKV hxa07i + Tiny Infused Causal Attention Prime RWKV L7 + Efficient RWKV L28 Headsize 512 -> 128 Compression


RWKV-7 G1e is here (13B/7B/3B/1B). Although Qwen 3.5 is strong, we are improving every month too 🙂 G1f in April. (G1d models all released too).

NEW paper from Apple. Interesting idea: "Attention to Mamba". The paper introduces a two-stage recipe for cross-architecture distillation from Transformers into Mamba. Naive distillation collapses teacher performance. Their trick: first distill the transformer into a linearized-attention student using a kernel adaptation, then transfer that student into a pure Mamba with no attention blocks. On a 1B model trained on 10B tokens, the Mamba student hits 14.11 perplexity against a 13.86 Pythia-1B teacher, nearly matching quality at linear-time inference cost. If you can reliably convert trained transformers into state-space models without retraining from scratch, the entire open-weights ecosystem becomes cheaper to serve at long context. This is the kind of quiet infrastructure work that decides which architectures actually get deployed in agent stacks. Paper: arxiv.org/abs/2604.14191 Learn to build effective AI agents in our academy: academy.dair.ai










