Ali Tayeb retweetledi
Ali Tayeb
31 posts

Ali Tayeb retweetledi
Ali Tayeb retweetledi
Ali Tayeb retweetledi
Ali Tayeb retweetledi

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training!
If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-…
Comes with code + an assignment and test cases so you can follow along!


English
Ali Tayeb retweetledi

Open sourcing Telescope, a complete framework to post-train LLMs with RL for reasoning and agents.
Async training, 7 RL algorithms, FSDP & Megatron backends, multi-turn environments, tool calling, and more.
Telescope comes with a unique UI to visualize rollouts, infra, metrics, timelines, and much more.
English
Ali Tayeb retweetledi
Ali Tayeb retweetledi

New blog: Read Less, Steer More blog.ezyang.com/2026/03/read-l…
English
Ali Tayeb retweetledi

I used to play a rhythm game called "osu!" where you click circles to the beat of a song
got back to it recently and convinced 20+ friends to try it out
seeing them play made me reflect on how improvement works and what teaching really transfers
link: apmoverflow.xyz/on-fingerspitz…

English
Ali Tayeb retweetledi

The newest model in the Mamba series is finally here 🐍
Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models.
We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes.
This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

wrote a Triton kernel for the Mamba-2 SSD layer that beats mamba-ssm by 1.56x in pure kernel time on H200
next up: seeing if the same approach finds gaps in Nemotron's inference stack (although Mamba-3, which uses a completely different recurrence could release soon enough before then)
Blog: tperm.xyz/mamba-2-triton/
English
Ali Tayeb retweetledi

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.
Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
English
Ali Tayeb retweetledi
Ali Tayeb retweetledi
Ali Tayeb retweetledi

GUI update: The only GPU-powered chat client that's multi-OS, multi-provider and high performance (~0.1% cpu ~0.2 mem)
zero javascript
Switch from local to external in the same chat, long running tasks etc
There are many things that will make it more than just a "chat" app.
wavefnx@wavefnx
Regarding the Research task app, I was considering that it will eventually need a chat system. Let's flood-test virtualization with thousands of dynamically sized messages, the second [flood] sends them with 50ms delay so we can actually see them. Pure Rust/GPU, zero chromium
English

Always been frustrated that no home for F1 data existed, so I made f1muse.com!
There are a lot of queries to test out, whether it's pace, race, or qualifying related. Let me know your thoughts!
f1muse.com
English





