Sebastian Lehner

127 posts

Sebastian Lehner banner
Sebastian Lehner

Sebastian Lehner

@sebaLeh

Machine learner at @jkulinz

Linz, Österreich Katılım Ocak 2017
546 Takip Edilen267 Takipçiler
Sebastian Lehner retweetledi
Jiajun He
Jiajun He@JiajunHe614·
🚨SPIGM is back at ICML 2026 — Call for Papers 🚨 SPIGM: Structured Probabilistic Inference & Generative Modeling — beyond scaling & benchmarks 📍Seoul 🇰🇷 🗓️Submit by April 24 (AoE) 👇Submission link below.
Jiajun He tweet mediaJiajun He tweet media
English
2
6
51
9.8K
Sebastian Lehner retweetledi
Niklas Schmidinger
Niklas Schmidinger@smdrnks·
Excited to share our new paper: Effective Distillation to Hybrid xLSTM Architectures. TL;DR: we retrofit / graft / distill / linearize Transformers into xLSTM-SWA hybrids with fixed-size states. This gives a practical path to studying linear and hybrid architectures starting from already strong pretrained models.
Sepp Hochreiter@HochreiterSepp

xLSTM Distillation: arxiv.org/abs/2603.15590 Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.

English
1
6
15
1.2K
Sebastian Lehner retweetledi
Sepp Hochreiter
Sepp Hochreiter@HochreiterSepp·
xLSTM Distillation: arxiv.org/abs/2603.15590 Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.
Sepp Hochreiter tweet mediaSepp Hochreiter tweet media
English
5
59
311
22.6K
Sebastian Lehner retweetledi
Lorenz Richter
Lorenz Richter@lorenz_richter·
We extend stochastic interpolants to the setting where no data samples are available - only an unnormalized density. Our non-Markovian approach generalizes adjoint sampling and scales to targets in dimension 2500. Paper: arxiv.org/pdf/2603.00530 Talk: youtube.com/watch?v=mpBLax…
YouTube video
YouTube
English
1
8
56
4.8K
Sebastian Lehner retweetledi
Günter Klambauer
Günter Klambauer@gklambauer·
Symbol-equivariant Recurrent Reasoning Models (SE-RRM) SE-RRM advances HRM and TRM -- guaranteed identical solutions for problems with permuted colors (ARC AGI) or digits (Sudoku). Coolest part: extrapolation to larger problem sizes!!! P: arxiv.org/abs/2603.02193
Günter Klambauer tweet media
English
3
40
214
13.6K
Sebastian Lehner retweetledi
Jiajun He
Jiajun He@JiajunHe614·
Let’s formally introduce FEAT (Free Energy Estimator with Adaptive Transport). [1/N]✨FEAT accelerates free-energy estimation by a learned non-equilibrium dynamics, enabling low-variance estimators based on the escorted Jarzynski equality and Crooks fluctuation theorem.
Jiajun He tweet media
English
4
9
41
5.9K
Sebastian Lehner retweetledi
Ivan Skorokhodov
Ivan Skorokhodov@isskoro·
The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos). The way Drifting Models work is actually very simple: - 1. Sample random noise z ~ N(0, I) - 2. Feed it to the generator and get a fake sample x' = G(z) - 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch. - 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most). - 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme. - 6. Profit Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG. Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations). From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.
Alexia Jolicoeur-Martineau@jm_alexia

Byebye diffusion, say hello to Drifting models. Drifting models will take over diffusion models within the next year. I was told many times that we figured it all out, that there was nothing else to invent in generative AI and it was just about scaling. Wrong again and again.

English
8
46
502
90K
Sebastian Lehner retweetledi
Maximilian Beck
Maximilian Beck@maxmbeck·
I am happy to announce that 2 papers with xLSTM are accepted at ICLR 2026! 📉xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity arxiv.org/abs/2510.02228 AND ⌛️Short window attention enables long-term memorization arxiv.org/abs/2509.24552
Maximilian Beck@maxmbeck

🚀 Excited to share our new paper on scaling laws for xLSTMs vs. Transformers. Key result: xLSTM models Pareto-dominate Transformers in cross-entropy loss. - At fixed FLOP budgets → xLSTMs perform better - At fixed validation loss → xLSTMs need fewer FLOPs 🧵 Details in thread

English
1
1
7
349
Sebastian Lehner retweetledi
Günter Klambauer
Günter Klambauer@gklambauer·
🏆 MolecularIQ is live — and open to the community 👉 Check how current LLMs perform on real molecular structure reasoning 👉 Submit your own chemistry LLM and get evaluated under a standardized protocol 🔗 Leaderboard & submissions: huggingface.co/spaces/ml-jku/…
Günter Klambauer tweet media
English
0
3
5
472
Sebastian Lehner retweetledi
Sebastian Lehner retweetledi
Sitan Chen
Sitan Chen@sitanch·
Proponents of diffusion language models tout their ability to generate many tokens in parallel. Skeptics argue this is fundamentally broken as it ignores token dependencies. Who's right? 🤔🤔🤔 🚀 In a new work, we rigorously prove that the picture is a lot more nuanced... 1/
Sitan Chen tweet media
English
3
24
127
16.3K
Sebastian Lehner retweetledi
Sebastian
Sebastian@SebSanokowski·
Ever experienced instabilities when using the popular LV (Log Variance) loss for training Diffusion Bridge Samplers?
Sebastian tweet media
English
1
6
18
10.6K
Sebastian Lehner retweetledi
Luca Ambrogioni
Luca Ambrogioni@LucaAmb·
1/2) I am very happy to finally share something I have been working on and off for the past year: "The Information Dynamics of Generative Diffusion" This paper connects the entropy production, divergence of vector fields and spontaneous symmetry breaking in a unified framework
Luca Ambrogioni tweet media
English
15
121
966
71K