Isak Westerlund

5.2K posts

Isak Westerlund

@westis96

Exploring Amortized Inference, Language and Speech.

🇪🇺 Katılım Mart 2014

5K Takip Edilen934 Takipçiler

Isak Westerlund retweetledi

Lucas Nestler@Clashluke·1d

HeavyBall 3.0.0 is finally out. Key features: * FSDP * DDP * End-to-End Compilation (2.5x speedup) * Higher-precision PSGDKron (grey, vs. HB2's blue) * Faster Muon and SOAP * PSGD-PRO (yellow) * LATHER, a SOAP-like optimizer * HyperBall * explicit `consume_grad` * simplified API

English

109

Isak Westerlund@westis96·3d

@ErikSchluntz @charles_irl @modal Hey, we have had the same weekend project 🤗 Currently around 1800 searchless Stockfish Elo trained on this dataset: huggingface.co/datasets/mateu…

English

Erik Schluntz@ErikSchluntz·4d

One of my weekend projects has been training Chess Transformers on @modal (h/t to @charles_irl for introducing it to me!) Claude already knows how to use it, you just tell it "run a hyper param sweep of X on modal" and it happens

English

1.9K

Isak Westerlund retweetledi

Xenova@xenovacom·3d

Get started with npm i @huggingface/transformers Links: 📖 Blog post: huggingface.co/blog/transform… 📝Release notes: github.com/huggingface/tr… 🛝 Demos: huggingface.co/collections/we… 🔗 Release video: youtu.be/KnhppkY4gHs?si…

YouTube

English

3.9K

Isak Westerlund@westis96·5d

@ml_4rtemi5 Interesting. I'll see if I can replicate improvements on other tasks.

English

Raphael Pisoni@ml_4rtemi5·5d

I'm open sourcing all my code for scaled RBF-Attention. If you want to roast my triton knowledge or want to check how far you have to scale things to make it break, feel free to have a look!😅 github.com/4rtemi5/rbf_at…

English

290

Raphael Pisoni@ml_4rtemi5·5d

For some reason I decided to swap out standard dot-product attention for a scaled-rbf kernel. Pretty much expected it to fail to converge or be impossibly slow but the scaled-rbf-attention is getting unexpectedly good results?? 👇

English

11.5K

Isak Westerlund retweetledi

Adina Yakup@AdinaYakup·6d

Matrix-Game 3.0🔥real-time interactive world models from @Skywork_ai huggingface.co/Skywork/Matrix… ✨ MIT license ✨ 720p @ 40FPS with a 5B model ✨ Minute-long memory consistency ✨ Unreal + AAA + real-world data ✨ Scales up to 28B MoE

English

104

627

42.4K

Isak Westerlund@westis96·6d

@ellen_in_sf literature review

English

ellen livia ᯅ 🇺🇸🇮🇩@ellen_in_sf·6d

Starting an AI Researcher group chat. The space is growing fast! Comment “literature review” to join.

English

871

755

57.5K

Isak Westerlund@westis96·26 Mar

@levelsio What’s up ✌️

English

@levelsio@levelsio·26 Mar

Okay let's see who can reply to this

English

2.5K

2.2K

Isak Westerlund retweetledi

Wildminder@wildmindai·25 Mar

Covo-Audio (7B) -full-duplex LALM from Tencent. - Qwen2.5-7B + Whisper - Listens and speaks simultaneously (barge-in support). - No separate ASR or TTS pipelines. - Decoupled intelligence/speaker for voice cloning. - 8M hours of audio training. huggingface.co/tencent/Covo-A…

English

113

5.8K

Isak Westerlund@westis96·24 Mar

@RadianceFields @SplatK1ng SplatKing

English

Radiance Fields@RadianceFields·23 Mar

I'm giving away a NVIDIA RTX PRO 6000, but you only have three days left to enter. Also my capture app, @SplatK1ng, is now available in the EU! Thank you to NVIDIA for providing the GPU and hosting me at GTC.

English

Isak Westerlund retweetledi

Conor Heins@conorheins·23 Mar

pymdp 1.0.0 is here: batched, autodifferentiable, JIT-compiled active inference in JAX: github.com/infer-actively… This release brings: GPU/TPU-ready active inference autodiff through inference, planning and learning easy parallelization and batching with vmap()

English

8.4K

Isak Westerlund retweetledi

Burny - Effective Curiosity@burny_tech·23 Mar

I'm loving this. Delightful Policy Gradient is adding in some surprisal into RL. Karl Friston would approve. :D

Ian Osband@IanOsband

Something is rotten with policy gradient. PG has become *the* RL loss for LLMs. But it’s not even good at basic RL. Even on MNIST with bandit feedback, vanilla PG performs far worse than cross-entropy because it wastes gradient budget. Delightful Policy Gradient: arxiv.org/abs/2603.14608…

English

10.3K

Isak Westerlund retweetledi

Sophia Tang@_sophia_tang_·20 Mar

New tutorial paper on the “Foundations of Schrödinger Bridges for Generative Modeling” is out on arXiv! 🧩 📖 arXiv: arxiv.org/abs/2603.18992 🔮 Project Website: sophtang.github.io/foundations-of… With 220 pages and 24 figures, this guide builds the theoretical foundations of Schrödinger bridges from the ground up, unifying the broad field of generative modeling with a single guiding principle: construct an optimal stochastic bridge between distributions while minimizing deviation from a reference process. The rapid progress in generative modeling has made the field increasingly difficult to navigate from a foundational perspective, which motivated me to develop a resource that builds the core concepts needed to understand and contribute to new advances. This guide contains intuitive explanations and step-by-step proofs covering: 🧩 The dynamic Schrödinger bridge formulation, lifting optimal transport to continuous-time stochastic processes between distributions, with direct connections to diffusion models, score-based methods, and flow matching. 🧩 A comprehensive toolkit for constructing Schrödinger bridges from first principles, describing stochastic optimal control, forward–backward SDEs, Doob’s h-transform, and Markov and reciprocal projections. 🧩 Extensions to complex and real-world problem settings, including the multi-marginal, unbalanced, discrete SB problems, highlighting the flexibility of the Schrödinger bridge framework in describing complex dynamical systems. 🧩 Practical, scalable algorithms for training and inference of dynamic Schrödinger bridges across modern generative modeling tasks. More details in the thread 👇🏻

English

145

887

43.6K

Isak Westerlund retweetledi

Mayank Mishra@MayankMish98·19 Mar

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English

108

510

139.4K

Isak Westerlund retweetledi

Albert Gu@_albertgu·17 Mar

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

313

1.6K

427.7K

Isak Westerlund retweetledi

Karsten Kreis@karsten_kreis·17 Mar

📢📢 Proteina-Complexa 📢📢 Atomistic Binder Design with Generative Pretraining and Test-Time Compute + Experimental Validation at Scale ⭐️ Project page (research.nvidia.com/labs/genair/pr…) for: 📜 Method paper (ICLR 2026 Oral) 🧬 Wet lab paper 🛠️ Code & models 📁 Data 🧵 Thread (1/n)

English

121

12.5K

Isak Westerlund retweetledi

Ion Stoica@istoica05·17 Mar

Excited to share our latest work rethinking k-means for modern GPU architectures. While the algorithm is classic, scaling it requires aggressively targeting memory bandwidth bottlenecks. Flash-KMeans introduces a strictly IO-aware design that achieves exact results with up to a 30x speedup over cuML and 200x over FAISS, completing million-scale iterations in milliseconds!

Haocheng Xi@HaochengXiUCB

𝗞-𝗺𝗲𝗮𝗻𝘀 𝗶𝘀 𝘀𝗶𝗺𝗽𝗹𝗲. 𝗠𝗮𝗸𝗶𝗻𝗴 𝗶𝘁 𝗳𝗮𝘀𝘁 𝗼𝗻 𝗚𝗣𝗨𝘀 𝗶𝘀𝗻’𝘁. That’s why we built Flash-KMeans — an IO-aware implementation of exact k-means that rethinks the algorithm around modern GPU bottlenecks. By attacking the memory bottlenecks directly, Flash-KMeans achieves 30x speedup over cuML and 200x speedup over FAISS — with the same exact algorithm, just engineered for today’s hardware. At the million-scale, Flash-KMeans can complete a k-means iteration in milliseconds. A classic algorithm — redesigned for modern GPUs. Paper: arxiv.org/abs/2603.09229 Code: github.com/svg-project/fl…

English

525

53.4K

Isak Westerlund retweetledi

Mohammed AlQuraishi@MoAlQuraishi·13 Mar

New OpenFold3 preview out! (OF3p2) It closes the gap to AlphaFold3 for most modalities. Most critically, we're releasing everything, including training sets & configs, making OF3p2 the only current AF3-based model that is functionally trainable & reproducible from scratch🧵1/9

English

185

675

53.4K

Isak Westerlund retweetledi

Shuangfei Zhai@zhaisf·12 Mar

Say hi to Exclusive Self Attention (XSA), a (nearly) free improvement to Transformers for LM. Observation: for y = attn(q, k, v), yᵢ and vᵢ tend to have a very high cosine similarity Fix: exclude vᵢ from yᵢ via zᵢ = yᵢ - (yᵢᵀvᵢ)vᵢ/‖vᵢ‖² Result: better training/val loss across model sizes; increasing gains as sequence length grows. See more: arxiv.org/abs/2603.09078

English

944

215K

Isak Westerlund@westis96·10 Mar

@OneTweetAwayMan @AuroraIntel It didn't explode. The "smoke" is just debris from the kinetic impact.

English

Tech Pro Dude@OneTweetAwayMan·9 Mar

@AuroraIntel A tomahawk impact has a way bigger explosion... You don't even see any fire here...

English

103

Aurora Intel@AuroraIntel·9 Mar

It’s a tomahawk, end of.

Matt Tardio@angertab

The evidence is clear, this is not a Tomahawk Iran alleged that an American Tomahawk Cruise Missile hit a school (buried in an IRGC compound) in southern Iran, killing 165 people. Analysis of a newly released video tells a different story. ANALYSIS: A-I analysis confirms the wings of the munition in question sit about 40%-45% down the body of the munition. On a Tomahawk, the wings sit roughly 49%-50% down the body of the munition. The wing to body ratio of the munition in question matches an Iranian Kh-55–derived Land Attack Cruise Missile. Further, the video shows the munition in a steep dive angle for the final attack phase. This places the attack angle at approximately 70%, which is the max attack angle for a Tomahawk. The attack angle does not match the KH-55. That angle maxes out at about 55 degrees. So what would have caused this? CONCLUSION: The wing positioning alone makes the munition impossible to be a Tomahawk. The attack angle is at the max of the Tomahawk's capabilities. The typical attack angle for a Tomahawk is much lower than 70 degrees. The typical angle is between 20-45 degrees. This is due to the flight pattern of Tomahawks. They fly very low horizontally to the ground, often only 50-100 meters AGL to avoid detection and interception. In order to achieve that attack angle, the missile would have had to gain altitude several kilometers away, this would leave it vulnerable for interception. This is highly unlikely on the first day of US attacks. So what could have caused this? Simply put, GPS jamming of an Iranian KH-55. The USA and Israel were, and continue to actively jam the Iranian airspace. If the KH-55's signal was jammed, this could result in an uncontrollable dive. Think of GPS jamming more like disorienting the missile. On 03/07 President Trump stated: “No, in my opinion, based on what I’ve seen, that was done by Iran.” Today, I concur with the President.

English

283

177

3.7K

237.9K

Isak Westerlund@westis96·9 Mar

@eduardwieandt Not showing my port processes running through docker, but really cool :)

English