Ali Tayeb

31 posts

Ali Tayeb

Ali Tayeb

@amtayb

cs @ cmu

pit Katılım Ekim 2025
306 Takip Edilen21 Takipçiler
Ali Tayeb retweetledi
clem 🤗
clem 🤗@ClementDelangue·
Introducing Kernels on the Hugging Face Hub ✨ What if shipping a GPU kernel was as easy as pushing a model? - Pre-compiled for your exact GPU, PyTorch & OS - Multiple kernel versions coexist in one process - torch.compile compatible - 1.7x–2.5x speedups over PyTorch baselines
English
71
223
1.7K
204.1K
Ali Tayeb
Ali Tayeb@amtayb·
Built a tool that profiles your vLLM/SGLang server and shows you where every millisecond goes! hotpath launches your model, traces kernels, parses server logs, replays traffic then breaks down queue wait, prefill, decode, and cache hits in one report.
English
1
1
3
90
Ali Tayeb retweetledi
Islam تايب
Islam تايب@islamTyb·
I take most of my notes in obsidian, but sharing them is a pain in the ass so I built a plugin that publishes any file or directory to github in 1 command
English
1
3
6
273
Ali Tayeb retweetledi
chuyi shang
chuyi shang@chuyishang·
Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!
chuyi shang tweet mediachuyi shang tweet media
English
9
66
602
32.4K
Ali Tayeb retweetledi
Eduardo Slonski
Eduardo Slonski@EduardoSlonski·
Open sourcing Telescope, a complete framework to post-train LLMs with RL for reasoning and agents. Async training, 7 RL algorithms, FSDP & Megatron backends, multi-turn environments, tool calling, and more. Telescope comes with a unique UI to visualize rollouts, infra, metrics, timelines, and much more.
English
1
4
11
1.4K
Ali Tayeb retweetledi
MeekMill
MeekMill@MeekMill·
Claude is helping me organize my whole music career and other businesses in days ... and it's moving my business forward at a high rate! Some tech youngbull I met on LinkedIn gave me a incredible template! Who else can help me with Claude
English
946
725
12.4K
3.7M
Ali Tayeb
Ali Tayeb@amtayb·
Profiling two GPU kernels and comparing them has always been annoying. Run NCU on v1, run NCU on v2, try to hold both in your head. Built KernDiff to fix this: structured hardware metric diff, CUDA + Triton support, git mode.
English
2
1
5
344
Ali Tayeb retweetledi
Islam تايب
Islam تايب@islamTyb·
I used to play a rhythm game called "osu!" where you click circles to the beat of a song got back to it recently and convinced 20+ friends to try it out seeing them play made me reflect on how improvement works and what teaching really transfers link: apmoverflow.xyz/on-fingerspitz…
Islam تايب tweet media
English
2
4
7
478
Ali Tayeb retweetledi
Albert Gu
Albert Gu@_albertgu·
The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
Albert Gu tweet media
English
38
312
1.6K
435.9K
Ali Tayeb
Ali Tayeb@amtayb·
wrote a Triton kernel for the Mamba-2 SSD layer that beats mamba-ssm by 1.56x in pure kernel time on H200 next up: seeing if the same approach finds gaps in Nemotron's inference stack (although Mamba-3, which uses a completely different recurrence could release soon enough before then) Blog: tperm.xyz/mamba-2-triton/
English
0
0
2
89
Ali Tayeb retweetledi
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
31
229
1.8K
187.1K
Ali Tayeb retweetledi
baby keem
baby keem@babykeem·
how do u fix openclaw internal reasoning leaking
English
654
1.7K
18.7K
3.6M
Ali Tayeb retweetledi
Anthropic
Anthropic@AnthropicAI·
We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax. These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.
English
7.2K
6.3K
54.7K
33.7M
Ali Tayeb retweetledi
wavefnx
wavefnx@wavefnx·
GUI update: The only GPU-powered chat client that's multi-OS, multi-provider and high performance (~0.1% cpu ~0.2 mem) zero javascript Switch from local to external in the same chat, long running tasks etc There are many things that will make it more than just a "chat" app.
wavefnx@wavefnx

Regarding the Research task app, I was considering that it will eventually need a chat system. Let's flood-test virtualization with thousands of dynamically sized messages, the second [flood] sends them with 50ms delay so we can actually see them. Pure Rust/GPU, zero chromium

English
13
8
224
52.1K
Ali Tayeb
Ali Tayeb@amtayb·
Always been frustrated that no home for F1 data existed, so I made f1muse.com! There are a lot of queries to test out, whether it's pace, race, or qualifying related. Let me know your thoughts! f1muse.com
English
1
1
4
390