Maximilian Beck
290 posts

Maximilian Beck
@maxmbeck
ELLIS PhD Student @ JKU Linz Institute for Machine Learning & PhD Researcher @nx_ai_com, Research Scientist Intern @Meta FAIR

xLSTM Distillation: arxiv.org/abs/2603.15590 Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.

xLSTM Distillation: arxiv.org/abs/2603.15590 Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.



🧵Debugging Code World Models A few months ago we started studying CWMs. The plan was post-training an LLM on code execution traces. Two weeks in, we realised a paper by Meta had already done much of this : arxiv.org/pdf/2510.02387. We however identified what's wrong with them!

🧠🪲We introduce Neural Debuggers: 🧑🏭 LLMs that emulate traditional debuggers by predicting forward code execution (future states & outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions such as step over, step into, or breakpoints.


Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/



> an example of this is that in hybrid models, sometimes "stronger" linear layers can lead to overall weaker models because it incentivizes the global attention to be "lazy" some people asked about this. i think this is a somewhat folklore result that I don't have a reference for, but here's another recent result that's similar: arxiv.org/abs/2509.24552 this is an example of a related phenomenon where in a SWA+xLSTM model, longer SWA windows led to worse long-context performance because it encouraged the xLSTM layers to be lazy

We studied whether linear RNNs can learn state-tracking from code via next-token prediction. We converted permutation tracking (the shell game) into REPL traces and trained models on them. Key idea: We interleave variable swaps with print statements that reveal partial state.




