Ali Tayeb

31 posts

Ali Tayeb

@amtayb

cs @ cmu

pit Katılım Ekim 2025

306 Takip Edilen21 Takipçiler

Ali Tayeb retweetledi

clem 🤗@ClementDelangue·14 Nis

Introducing Kernels on the Hugging Face Hub ✨ What if shipping a GPU kernel was as easy as pushing a model? - Pre-compiled for your exact GPU, PyTorch & OS - Multiple kernel versions coexist in one process - torch.compile compatible - 1.7x–2.5x speedups over PyTorch baselines

English

223

1.7K

204.1K

Ali Tayeb retweetledi

Ramp Labs@RampLabs·10 Nis

x.com/i/article/2042…

ZXX

139

1.4K

353.8K

Ali Tayeb retweetledi

alex zhang@a1zhang·10 Nis

x.com/i/article/2041…

ZXX

134

1.1K

293.5K

Ali Tayeb@amtayb·8 Nis

Github: github.com/alityb/hotpath Blog: tperm.xyz/hotpath/

Ali Tayeb@amtayb·8 Nis

Built a tool that profiles your vLLM/SGLang server and shows you where every millisecond goes! hotpath launches your model, traces kernels, parses server logs, replays traffic then breaks down queue wait, prefill, decode, and cache hits in one report.

English

Ali Tayeb retweetledi

Islam تايب@islamTyb·27 Mar

I take most of my notes in obsidian, but sharing them is a pain in the ass so I built a plugin that publishes any file or directory to github in 1 command

English

273

Ali Tayeb retweetledi

chuyi shang@chuyishang·24 Mar

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!

English

602

32.4K

Ali Tayeb retweetledi

Eduardo Slonski@EduardoSlonski·10 Mar

Open sourcing Telescope, a complete framework to post-train LLMs with RL for reasoning and agents. Async training, 7 RL algorithms, FSDP & Megatron backends, multi-turn environments, tool calling, and more. Telescope comes with a unique UI to visualize rollouts, infra, metrics, timelines, and much more.

English

1.4K

Ali Tayeb retweetledi

MeekMill@MeekMill·23 Mar

Claude is helping me organize my whole music career and other businesses in days ... and it's moving my business forward at a high rate! Some tech youngbull I met on LinkedIn gave me a incredible template! Who else can help me with Claude

English

946

725

12.4K

3.7M

Ali Tayeb@amtayb·22 Mar

Github: github.com/alityb/kerndiff Blog: tperm.xyz/unbound/

Ali Tayeb@amtayb·22 Mar

Profiling two GPU kernels and comparing them has always been annoying. Run NCU on v1, run NCU on v2, try to hold both in your head. Built KernDiff to fix this: structured hardware metric diff, CUDA + Triton support, git mode.

English

344

Ali Tayeb retweetledi

Edward Z. Yang@ezyang·18 Mar

New blog: Read Less, Steer More blog.ezyang.com/2026/03/read-l…

English

204

12.6K

Ali Tayeb retweetledi

Islam تايب@islamTyb·18 Mar

I used to play a rhythm game called "osu!" where you click circles to the beat of a song got back to it recently and convinced 20+ friends to try it out seeing them play made me reflect on how improvement works and what teaching really transfers link: apmoverflow.xyz/on-fingerspitz…

English

478

Ali Tayeb retweetledi

Albert Gu@_albertgu·17 Mar

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

312

1.6K

435.9K

Ali Tayeb@amtayb·17 Mar

wrote a Triton kernel for the Mamba-2 SSD layer that beats mamba-ssm by 1.56x in pure kernel time on H200 next up: seeing if the same approach finds gaps in Nemotron's inference stack (although Mamba-3, which uses a completely different recurrence could release soon enough before then) Blog: tperm.xyz/mamba-2-triton/

English

Ali Tayeb retweetledi

Tri Dao@tri_dao·5 Mar

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

229

1.8K

187.1K

Ali Tayeb retweetledi

baby keem@babykeem·26 Şub

how do u fix openclaw internal reasoning leaking

English

654

1.7K

18.7K

3.6M

Ali Tayeb retweetledi

Anthropic@AnthropicAI·23 Şub

We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax. These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.

English

7.2K

6.3K

54.7K

33.7M

Ali Tayeb retweetledi

wavefnx@wavefnx·18 Şub

GUI update: The only GPU-powered chat client that's multi-OS, multi-provider and high performance (~0.1% cpu ~0.2 mem) zero javascript Switch from local to external in the same chat, long running tasks etc There are many things that will make it more than just a "chat" app.

wavefnx@wavefnx

Regarding the Research task app, I was considering that it will eventually need a chat system. Let's flood-test virtualization with thousands of dynamically sized messages, the second [flood] sends them with 50ms delay so we can actually see them. Pure Rust/GPU, zero chromium

English

224

52.1K

Ali Tayeb@amtayb·18 Şub

github: github.com/alityb/f1muse blog: tperm.bearblog.dev/latency-to-ins…

Ali Tayeb@amtayb·18 Şub

Always been frustrated that no home for F1 data existed, so I made f1muse.com! There are a lot of queries to test out, whether it's pace, race, or qualifying related. Let me know your thoughts! f1muse.com

English

390

Keşfet

@aakash_lahoti @kevinyli_ @_berlinchen @caitWW9 @tri_dao @elonmusk @BarackObama @taylorswift13