

Sebastian Raschka
19.7K posts

@rasbt
ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)



Meta observation: DeepSeek is still king of the active-parameter ratio








Came across Cola-DLM(hongcanguo.github.io/Cola-DLM/) from ByteDance. A hierarchical continuous latent diffusion LM that separates global semantic planning (DiT in latent space) from local token realization (VAE decoder). Paper is out, but no code and no HF model yet. So I reproduced it from scratch. Happy to share with anyone interested 👇









Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment? That is the idea behind Lighthouse Attention. The method wraps ordinary SDPA with a hierarchical, gradient-free selection layer that compresses and decompresses queries, keys, and values symmetrically, preserving left-to-right causality. Crucially, it can be removed near the end of training in a short recovery phase, so the deployed model still runs vanilla attention with no architectural cost at inference. Preliminary LLM experiments report faster total training time and lower final loss than full-attention baselines. Why does it matter? Most efficient-attention work either changes the deployment-time architecture or pays a quality tax to do so. A training-only wrapper that survives a clean recovery phase sidesteps both. If it scales, this becomes an important training-time speedup for long-context pretraining. Paper: arxiv.org/abs/2605.06554 Learn to build effective AI agents in our academy: academy.dair.ai




ERNIE 5.1 just dropped. Built on ERNIE 5.0's pre-training foundation, our latest foundation model upgrades search, reasoning, knowledge Q&A, creative writing, and agentic capabilities, while using only around 6% of the pre-training cost of comparable models. More in the thread 🧵