Subbu

77 posts

Subbu

@subbdue

I develop AI inference Chips/SoCs for a living. Read my work and subscribe to my newsletter at https://t.co/RGu9NKu1Ze

Santa Clara, CA 参加日 Haziran 2009

33 フォロー中706 フォロワー

Subbu@subbdue·10h

One crazy metric from vLLM’s x Mooncake article: “On a real Codex agentic trace … the distributed KV cache pool improves vLLM throughput by 3.8x and reduces P50 TTFT and E2E latency by 46x and 8.6x” 46x improvement in Time To First Token!

Matej Sirovatka@m_sirovatka

KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀

English

7.6K

Subbu@subbdue·11h

@vllm_project steadily stacking KV cache gains. It’s worth understanding this recipe … distributed KV caching (and GPUDirect RDMA on Nvidia GPUs) is how real commercial inference deployments operate, and companies like @weka have been industrializing exactly this for a while now. I have a couple of articles explaining this: NVIDIA KV-Cache Context Memory Storage: chiplog.io/p/analysis-of-… Origins of GPUDirect RDMA: chiplog.io/p/how-mellanox… One crazy metric from vLLM’s x Mooncake article: “On a real Codex agentic trace, the distributed KV cache pool improves vLLM throughput by 3.8x and reduces P50 TTFT and E2E latency by 46x and 8.6x” 46x improvement in Time To First Token!

vLLM@vllm_project

🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake. Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them. By integrating Mooncake Store as a distributed KV cache pool, vLLM gets: 🚀 3.8x higher throughput ⚡ 46x lower P50 TTFT ⏱️ 8.6x lower E2E latency 📈 Cache hit rate 1.7% -> 92.2% 🌐 Scales near-linearly to 60 GB200 GPUs at >95% hit rate 🔥 Powered by a deep collaboration between @Inferact and @KT_Project_AI 📖 Read more: vllm.ai/blog/mooncake-… 🧵👇

English

191

Subbu がリツイート

Kairos@KairosPraxis·23h

Absolutely sobering paragraph from @subbdue just as $MU and SK Hynix hit their 1T market caps.

English

237

40.4K

Subbu@subbdue·1d

This reminds me of when companies went all-in abandoning their private clouds for AWS, only to realize a few years later the savings weren't quite what they'd expected. History rhyming again. I think Geico was one them .. they migrated a bunch of their apps, and then quietly repatriated them back to their private cloud.

Citrini@citrini

Local inference not looking so crazy anymore now that companies are revolting over token spend

English

1.1K

Subbu がリツイート

Lawrence Hamtil@lhamtil·3d

This is a very good history of the DRAM boom/bust cycles: "he product was a pure commodity, sold by the bit, indistinguishable across vendors. Five or six players were always willing to flood the market the moment demand softened. Every downturn turned into a price war, and every price war took out at least one company." chiplog.io/p/dram-was-the…

English

Subbu@subbdue·3d

A very American-80s style logo and graphic design. Tastefully done!

David Hansen 🇺🇸 🇳🇿@boxcardavid

Westmag is building American robot actuators and drone motors at scale. In 2025, @westmagco raised $11M led by @a16z, with participation from @FoundersFund, @LuxCapital, NFDG, @MenloVentures, and other top investors. Since then, we’ve been building industrial capacity, crawling up supply chains, and securing high-volume customers. Now, we’re ramping production at our factory in South San Francisco to deliver against committed offtake orders from high-volume customers. Westmag is committed to scaling quickly in the US to deliver millions of drone motors and robot actuators to the surging domestic and global market. We’re building the great American motor and actuator company.

English

1.3K

Subbu@subbdue·3d

Add on a few more months before improved kernels and meaningful MFU. I write about why this is the case in my deep dive: The Uncomfortable Truth Behind Deploying the Latest NVIDIA GPUs: MFU, Silent Data Corruption - chiplog.io/p/the-uncomfor…

SemiAnalysis@SemiAnalysis_

IMPORTANT: it is important to understand that the CoreWeave & Microsoft photos are still Engineering/Quality Samples, and there is still some time before the software stack bring-up finishes & first production tokens are generated. The VR200 & MI455 rack metric to watch out for is time to first at-scale production token TTF-(ASP)-T. You can clearly see in the CW rack photos that none of the scale-out 800G OSFP cages are even populated.

English

100.6K

Subbu@subbdue·3d

Make that Mango Habanero Pineapple and you can sell it to other geographies - APAC and BRICS

Nestor Gonzalez@NestorG99

@bubbleboi Pitch Mango Pineapple. Higher margins.

English

236

Subbu@subbdue·3d

So fun! Reads like a Netflix doc .. maybe they should consider making one.

Pushkar Ranade@magicsilicon

x.com/i/article/2060…

English

284

Subbu@subbdue·3d

@joinyellowbrick Finstack

English

Yellowbrick Investing@joinyellowbrick·5d

Bragging about how much money your substack subscribers would have made from your AI stocks if you had a substack is peak Fintwit. No notes

Jack Farley@JackFarley96

My 2nd biggest position btw I don’t have a Substack, but if I did people would be very happy with me

English

151

21.8K

Subbu@subbdue·3d

Boy! All these numbers are staggering: - 7.5T - 15T model - Trained on Trainium2?! - 20% MFU (yikes!)

Lisan al Gaib@scaling01

I think Anthropic likely trained the Mythos base model from roughly October to December using on the order of 6.7e26–1.0e27 flops Since then, the RL-to-base-model-training flop ratio is plausibly somewhere around 0.5 and 3, depending on how much of the expanded Trainium 2 fleet was actually allocated to Mythos RL. The reason this range is plausible is that public AWS/Anthropic statements imply Anthropic-accessible Trainium2 capacity grew from roughly 500k chips around Rainier’s launch to over 1M Trainium2 chips for Claude training and serving.

English

309

Subbu@subbdue·4d

Other winners in the NVLink Fusion story: $CDNS and $SNPS. In last year’s Computex keynote, Jensen announced that NVLink IP would be distributed through them. Of course, MediaTek and Marvell are also key partners, helping enable companies like Ayar Labs. The future of computing (especially AI inference) is clearly heading toward heterogeneous architectures. The NVIDIA + Groq deal basically cements it. NVLink Fusion is NVIDIA’s play to protect its turf while embracing this shift. My article on NVLink Fusion, AI Inference & Heterogenous Computing: chiplog.io/p/why-speculat… Jensen’s keynote where he announced NVLink Fusion and its partners: youtube.com/live/TLzna9__D…

YouTube

Ayar Labs@AyarLabs

Today, @AyarLabs announced it has joined the @nvidia NVLink Fusion ecosystem, introducing co-packaged optics as a foundational building block for hyperscalers and system innovators deploying heterogeneous compute in NVIDIA AI factories. Press Release: bit.ly/4oa8epa

English

200.2K

Subbu@subbdue·4d

The NVIDIA-MediaTek partnership is turning out to be quite the love story. I wrote this deep dive article on DGX Spark's GB10 SOC, MediaTek's role in it, and how exactly the chemistry between NVIDIA and MediaTek worked ... and even with NVLink Fusion, MediaTek is an essential partner. GB10 SOC Deep Dive: chiplog.io/p/analysis-of-… NVLink Fusion & Heterogenous Computing : chiplog.io/i/189813953/wh…

𝐷𝑟. 𝐼𝑎𝑛 𝐶𝑢𝑡𝑟𝑒𝑠𝑠@IanCutress

If a baker makes a cake with their own recipe using ingredients from a mutual friend, then gives it to you, would you pass it off as your own? Mediatek makes a CPU with their own recipe with arm ingredients and gives it to Nvidia. Broadcom makes a TPU with their own recipe but with Google ingredients and gives it back to Google.

English

554

Subbu@subbdue·5d

@citrini Don't blame you! Those of us in the semi-industry aren't used to this kind of exuberance :) . Like, $MU ... $1T?! Who woulda thought. chiplog.io/p/dram-was-the…

English

780

Citrini@citrini·5d

I’ve sold a lot of AI winners too early, I’m intent on selling very slowly with the agentic utility winners like $BAND $ATEN $SAIL… If the cycle has taught me anything, it’s that a name being up 700% from your entry is absolutely not a reason to sell.

English

1.1K

120.1K

Subbu@subbdue·5d

@vikramskr Putting the sexy in semi!

English

Vikram Sekar@vikramskr·6d

If you run into me at Computex, ask me for my newly minted NFC biz card — No more rummaging through paper cards from me. Let’s talk semi research hand-crafted to your order. 🤌

English

9.4K

Subbu@subbdue·6d

Article link: chiplog.io/p/dram-was-the…

English

181

Subbu@subbdue·6d

Micron has seen 6 boom & bust cycles since 2000. In 2019, hyperscaler over-ordering crashed their revenue from $30B to $21B. In 2023, the pandemic unwind cut it in half again, from $30B to $15B. Has the AI super-cycle finally broken this pattern? Or is a reckoning coming? — especially since it's the same set of buyers, buying for the same reasons. I analyze Micron through three lenses: Engineering, Finance, and Strategy. The current tailwinds, and the possible headwinds. Full Article here: chiplog.io/p/dram-was-the…

English

45.2K

Subbu@subbdue·11 May

@AccBalanced A surprisingly common mistake 😄

English

b/acc, context platform engineer@AccBalanced·11 May

Per second

0xSero@0xSero

21 petabytes of memory bandwidth.

English

181

Subbu@subbdue·11 May

@theinformation I don't think anyone saw this abysmal 11% MFU number coming. I've written up a comprehensive technical report on why this happens and what it takes for ByteDance, DeepSeek, Meta, and Google engineers to squeezing their clusters. x.com/subbdue/status…

Subbu@subbdue

Last week The Information reported that xAI’s Colossus-1 achieves a mere 11% MFU (Model Flop Utilization), compared to the 45-55% other hyperscalers achieve. My latest article is a comprehensive analysis of: + What really is MFU and why is it hard to achieve, and where 50% of the FLOPs are lost — Communication bottlenecks, silent data corruption (SDC), and stragglers + Lessons from Google, Meta, ByteDance, and DeepSeek + Who does it best, and what's the "real moat" + The uncomfortable truth behind deploying the latest NVIDIA GPUs. Full analysis here: chiplog.io/p/the-uncomfor…

English

The Information@theinformation·2 May

xAI’s GPU fleet is running at about 11% utilization, exposing how hard it is for AI labs to fully use expensive Nvidia hardware. Read more in our AI Agenda newsletter: thein.fo/4cHRjWI

English

557

1.3M

ディスカバー

@vllm_project @weka @joinyellowbrick @citrini @vikramskr @elonmusk @BarackObama @taylorswift13