Hamza Elshafie

83 posts

Hamza Elshafie

@hamzaelshafie

ML infra & GPU performance

github.com/HamzaElshafie Katılım Ocak 2022

239 Takip Edilen5.3K Takipçiler

Sabitlenmiş Tweet

Hamza Elshafie@hamzaelshafie·3 Şub

I’m rebuilding GEMM optimisation on H100 from scratch and documenting the full path toward cuBLAS level performance in a deep dive blog. 8 kernels in so far, from naive all the way to Tensor Cores (Async TMA + WGMMA). Just hit 407.7 TFLOP/s, now at 56.9% of cuBLAS with Tensor Cores. Each kernel is broken down in detail: profiling bottlenecks, PTX and SASS inspection, memory behaviour, and 35+ visuals!!! so far explaining what is happening at the hardware level for each kernel. I also included a full H100 architecture section to ground the optimisation decisions. Still more to push. Blog: hamzaelshafie.bearblog.dev/worklog-optimi… Repo: github.com/HamzaElshafie/…

English

161

1.4K

216.7K

Hamza Elshafie retweetledi

Patrick OShaughnessy@patrick_oshag·2d

Every conversation I have with @dylan522p, I'm really just trying to understand the supply and demand of tokens. This is a unique episode in that it's entirely dedicated to talking about both sides of that equation. We discuss: - The infinite demand for the newest models - @SemiAnalysis_ going from $10K on AI spend to $7M - Mythos and Anthropic's compute problem - Why TSMC spending $100B on CapEx could cause a shortage - Robotics as next demand wave - Why memory prices will double again This is my second conversation with Dylan and find myself needing to speak with him more and more often to make sense of it all. Enjoy! Timestamps: 0:00 Intro 1:00 Surging AI Spend 10:27 Token Demand 16:21 When Ideas Are Cheap and Execution is Easy 20:46 Model Hoarding 22:34 Robotics 27:03 The Compute Bottleneck 30:26 The AI Permanent Underclass 31:39 Supply Chain Reality 37:47 CPUs 42:54 Predictions: Public Backlash

English

117

1.1K

851.9K

Hamza Elshafie@hamzaelshafie·2d

Another visual walkthrough of @vllm_project's continuous batching with a full dummy example flow: prefill batch, slot mapping, paged KV block allocations, sampled tokens appended to request state, then a mixed decode + prefill step. Sampled tokens are appended to each request’s state, but the engine rebuilds a fresh flat batch every step from only the tokens that still need compute.

English

214

Hamza Elshafie@hamzaelshafie·4d

To actually benefit from prefix caching in a multi-GPU setup, the next turn has to land on the worker that already holds the cached prefix. Otherwise you miss the local KV cache, recompute the repeated prompt from scratch, and only then cache it redundantly on another worker.

Hamza Elshafie@hamzaelshafie

Visual walkthrough of prefix caching in vLLM on a multi-turn chat example for lower TTFT.

English

150

10K

Hamza Elshafie@hamzaelshafie·4d

Visual walkthrough of prefix caching in vLLM on a multi-turn chat example for lower TTFT.

English

131

15.1K

Hamza Elshafie retweetledi

Chayenne Zhao@GenAI_is_real·4d

every few months someone declares LLMs a dead end and every few months LLMs get dramatically better at things they supposedly cant do. but the LeWM result is worth paying attention to for a different reason: 15M parameters, single GPU, 48x faster planning. that efficiency per parameter ratio is what matters for real deployment. the future of AI inference isnt one architecture ruling everything, its routing different tasks to different architectures optimized for each. physics reasoning on a tiny world model, language on an LLM, vision on a specialized encoder. the serving layer that routes between them is where the real value sits @HowToAI_

How To AI@HowToAI_

Yann LeCun was right the entire time. And generative AI might be a dead end. For the last three years, the entire industry has been obsessed with building bigger LLMs. Trillions of parameters. Billions in compute. The theory was simple: if you make the model big enough, it will eventually understand how the world works. Yann LeCun said that was stupid. He argued that generative AI is fundamentally inefficient. When an AI predicts the next word, or generates the next pixel, it wastes massive amounts of compute on surface-level details. It memorizes patterns instead of learning the actual physics of reality. He proposed a different path: JEPA (Joint-Embedding Predictive Architecture). Instead of forcing the AI to paint the world pixel by pixel, JEPA forces it to predict abstract concepts. It predicts what happens next in a compressed "thought space." But for years, JEPA had a fatal flaw. It suffered from "representation collapse." Because the AI was allowed to simplify reality, it would cheat. It would simplify everything so much that a dog, a car, and a human all looked identical. It learned nothing. To fix it, engineers had to use insanely complex hacks, frozen encoders, and massive compute overheads. Until today. Researchers just dropped a paper called "LeWorldModel" (LeWM). They completely solved the collapse problem. They replaced the complex engineering hacks with a single, elegant mathematical regularizer. It forces the AI's internal "thoughts" into a perfect Gaussian distribution. The AI can no longer cheat. It is forced to understand the physical structure of reality to make its predictions. The results completely rewrite the economics of AI. LeWM didn't need a massive, centralized supercomputer. It has just 15 million parameters. It trains on a single, standard GPU in a few hours. Yet it plans 48x faster than massive foundation world models. It intrinsically understands physics. It instantly detects impossible events. We spent billions trying to force massive server farms to memorize the internet. Now, a tiny model running locally on a single graphics card is actually learning how the real world works.

English

262

37.9K

Hamza Elshafie retweetledi

Reiner Pope@reinerpope·5d

Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers

English

578

132.5K

Hamza Elshafie retweetledi

Kimi.ai@Kimi_Moonshot·18 Nis

We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: arxiv.org/html/2604.1503…

English

346

2.9K

676.5K

Hamza Elshafie@hamzaelshafie·18 Nis

Acc on that point, I’ve always wondered why we haven’t seen more TP-friendly latent attention designs like Grouped Latent Attention (GLA) show up in serving systems, given that they’re easier to shard & reduce KV cache per device?

Hamza Elshafie@hamzaelshafie

Such a high signal post!! Really enjoyed reading this. Rare to see someone combine real technical depth with strong inference economics intuition this well.

English

884

Hamza Elshafie@hamzaelshafie·18 Nis

Such a high signal post!! Really enjoyed reading this. Rare to see someone combine real technical depth with strong inference economics intuition this well.

Piotr Mazurek@tugot17

deepdive into the economics of DeepSeek Sparse Attention (DSA) and how it affects the profit margins of serving a Claude-Code-like products link in the thread 1/x

English

6.1K

Hamza Elshafie@hamzaelshafie·16 Nis

Decode throughput depends heavily on large effective batch sizes to amortise repeated weight loading, but pipeline parallelism reduces the batch each stage sees. So you get hit from both sides: lower memory bandwidth on each GPU and much worse inter-GPU scaling. That is why inference can collapse far more than expected when moving a medium-ish workload from Hopper to L40.

English

283

Hamza Elshafie@hamzaelshafie·16 Nis

You’re running a medium-sized model and switch from H100s to L40s to cut cost. You expected inference performance to drop because of much lower memory bandwidth, but it dropped far more than expected, and scaling across multiple GPUs was especially poor. Why?

English

2.4K

Hamza Elshafie@hamzaelshafie·16 Nis

That means tensor parallel communication, KV movement, and collectives become much more expensive, so scaling across GPUs degrades sharply. And if you try to compensate by leaning more on pipeline parallelism, that can make decode worse too.

English

235

Hamza Elshafie@hamzaelshafie·16 Nis

The second reason is multi-GPU communication. H100 systems have NVLink + NVSwitch, which make tensor parallelism inside a node very efficient. L40 is a PCIe-only card and does not have that high-bandwidth GPU-to-GPU fabric.

English

281

Hamza Elshafie@hamzaelshafie·16 Nis

Each step still has to stream model weights and KV cache data, so performance is often limited by moving bytes rather than raw compute. So even before talking about distributed inference, moving from H100 to L40 already hurts badly.

English

279

Hamza Elshafie@hamzaelshafie·16 Nis

The first reason is the obvious one: memory bandwidth. H100 has 80 GB HBM3 at 3.35 TB/s, while L40 has 48 GB GDDR6 at 864 GB/s. For inference, especially decode, that difference is significant because decode is usually memory-bandwidth bound.

English

279

Hamza Elshafie retweetledi

clem 🤗@ClementDelangue·14 Nis

Introducing Kernels on the Hugging Face Hub ✨ What if shipping a GPU kernel was as easy as pushing a model? - Pre-compiled for your exact GPU, PyTorch & OS - Multiple kernel versions coexist in one process - torch.compile compatible - 1.7x–2.5x speedups over PyTorch baselines

English

227

1.7K

205.2K

Hamza Elshafie retweetledi

surya@suryasure05·6 Nis

wrote an article breaking down the math behind TurboQuant by @GoogleResearch. I walk through a toy example using concrete numbers to show every single operation that goes on under the hood. link below:

English

115

936

75K

Hamza Elshafie retweetledi

Chayenne Zhao@GenAI_is_real·1 Nis

$852B valuation, $2B/month revenue, 15B tokens/minute on APIs. but the number that caught my eye is codex growing 70% month over month to 2M weekly users. coding agents are clearly the highest-ROI product in their portfolio right now, which explains why they acquired astral and killed sora on the same week. every GPU hour redirected from video generation to code generation probably has 10x better unit economics. openai is quietly becoming a developer tools company that happens to also run a chatbot @SawyerMerritt

Sawyer Merritt@SawyerMerritt

NEWS: OpenAI just announced that it has officially closed their latest funding round with $122 billion in committed capital at a post money valuation of $852 billion. "We are now generating $2B in revenue per month. At this stage, we are growing revenue four times faster than the companies who defined the Internet and mobile eras, including Alphabet and Meta. ChatGPT has more than 900 million weekly active users, and over 50 million subscribers. Search usage has nearly tripled in a year, and our ads pilot reached more than $100 million in ARR in under six weeks. Momentum is just as strong on the enterprise side, which now makes up more than 40% of our revenue, and is on track to reach parity with consumer by the end of 2026. GPT‑5.4 is driving record engagement across agentic workflows. Our APIs now process more than 15 billion tokens per minute. Codex now serves over 2 million weekly users, up 5x in the past three months, with usage growing more than 70% month over month."

English

12.7K

Keşfet

@dylan522p @SemiAnalysis_ @vllm_project @HowToAI_ @itsclivetime @GoogleResearch @SawyerMerritt @elonmusk