Maximilian Beck

290 posts

Maximilian Beck banner
Maximilian Beck

Maximilian Beck

@maxmbeck

ELLIS PhD Student @ JKU Linz Institute for Machine Learning & PhD Researcher @nx_ai_com, Research Scientist Intern @Meta FAIR

Linz, Österreich Katılım Haziran 2021
854 Takip Edilen1.1K Takipçiler
Sabitlenmiş Tweet
Maximilian Beck
Maximilian Beck@maxmbeck·
Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)
Maximilian Beck tweet media
English
3
60
347
47.1K
Maximilian Beck retweetledi
Niklas Schmidinger
Niklas Schmidinger@smdrnks·
Excited to share our new paper: Effective Distillation to Hybrid xLSTM Architectures. TL;DR: we retrofit / graft / distill / linearize Transformers into xLSTM-SWA hybrids with fixed-size states. This gives a practical path to studying linear and hybrid architectures starting from already strong pretrained models.
Sepp Hochreiter@HochreiterSepp

xLSTM Distillation: arxiv.org/abs/2603.15590 Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.

English
1
5
13
986
Babak Rahmani
Babak Rahmani@babakRmni·
@maxmbeck Thanks Maximilian. Looking forrward to reading your upcoming work on CWMs :)
English
1
0
1
27
Maximilian Beck
Maximilian Beck@maxmbeck·
Very cool in depth prediction error analysis of Code World Model (CWM) 🌍 ⬇️⬇️⬇️ However, instead of „debugging code world models“, what about debugging WITH code world models? Stay tuned for more on this soon
Babak Rahmani@babakRmni

🧵Debugging Code World Models A few months ago we started studying CWMs. The plan was post-training an LLM on code execution traces. Two weeks in, we realised a paper by Meta had already done much of this : arxiv.org/pdf/2510.02387. We however identified what's wrong with them!

English
1
0
6
450
Maximilian Beck
Maximilian Beck@maxmbeck·
Looking forward, there are lots of exciting directions for future research: • Agentic program repair, reasoning & tool use with neural debuggers. • Expanding & improving data generation. • Improving inverse debugging. • Better Python object representations.
English
1
0
4
214
Maximilian Beck
Maximilian Beck@maxmbeck·
🧠🪲We introduce Neural Debuggers: 🧑‍🏭 LLMs that emulate traditional debuggers by predicting forward code execution (future states & outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions such as step over, step into, or breakpoints.
Maximilian Beck tweet media
English
1
15
56
4.4K
Maximilian Beck retweetledi
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
30
230
1.8K
183.1K
Maximilian Beck retweetledi
Ronak Malde
Ronak Malde@rronak_·
WTF you can now train a 5 million context window 8b model on a single node of 8xH100s ???? Most people don't realize that even on long context pretrained frontier models, most RL post-training is only done on a small fraction of that context. Why? Long context RL is notoriously memory hungry, and requires a sharding strategy called Context Parallelism that takes up an inordinate amount of GPUs. This paper for Together flew under the radar, combines the best of Context Parallelism + Sequence Parallel-style head chunking, to get memory efficient long context. The gains are insane, cutting attention memory footprint up to 87% Authors: @m_ryabinin , @sereghik , Maksim Abraham, @ghadiaravi13
Ronak Malde tweet media
English
12
50
470
37.6K
Maximilian Beck retweetledi
Ai2
Ai2@allen_ai·
📢 Update: the Molmo 2 codebase is now open source. We're releasing the code behind Molmo 2—our open model family for video & image understanding, pointing, tracking, & more. Now you can easily train Molmo 2 on your own data. 🧵
Ai2 tweet media
English
6
51
364
30.8K
Maximilian Beck
Maximilian Beck@maxmbeck·
Since xLSTM or in this case mLSTM, which we used for SWAX, is very similar to Mamba2 (mLSTM has no coupling between input&forget gates, while Mamba2 has via dt), I suspect one would obtain similar the results with Mamba2. Note: input&forget gate bias init also matter for mLSTM!
Albert Gu@_albertgu

> an example of this is that in hybrid models, sometimes "stronger" linear layers can lead to overall weaker models because it incentivizes the global attention to be "lazy" some people asked about this. i think this is a somewhat folklore result that I don't have a reference for, but here's another recent result that's similar: arxiv.org/abs/2509.24552 this is an example of a related phenomenon where in a SWA+xLSTM model, longer SWA windows led to worse long-context performance because it encouraged the xLSTM layers to be lazy

English
0
1
10
1.1K