

Austin Lyons
4K posts

@austinsemis
Substack: https://t.co/zxI0P7bBdO | Podcast: https://t.co/fmcrAXKfr6 | Tech Analyst @creativestrat



Today we’re announcing an agreement with Amazon Web Services to bring tens of millions of AWS Graviton cores to our compute portfolio. This partnership marks an expansion of our diversified AI infrastructure and will help scale systems behind Meta AI and agentic experiences that serve billions of people. Learn more: go.meta.me/2bc5c5


Tornado Warning including Waukee IA, De Soto IA and Van Meter IA until 7:45 PM CDT




($) TSMC's Margins in Uncharted Territory chipstrat.com/p/tsmcs-margin… $TSM





Meta's core business is ads. Ads are AI workloads. But not LLM workloads. @austinlyons chatted with @Meta VP Matt Steiner to understand Meta's heterogeneous compute stack. Surprises: - Recommender training needs a different compute-to-memory ratio than LLMs. Hence MTIA. - Retrieval is memory-bound at Meta scale. Andromeda runs on a co-designed Grace Hopper SKU, not off-the-shelf. - Adaptive ranking scales compute per user. Power users with long histories get more. - Consolidating N ranking models into one (Lattice) improved performance, not just cost. - KernelEvolve (LLM-written kernels) flipped heterogeneous fleet economics. SWE demand going UP. - Meta wants ~100x more kernels per chip. Chapters: (00:00) Intro and scale (00:39) How Meta's ad system works (02:00) Meta Andromeda and the custom NVIDIA SKU (03:30) Lattice: consolidating ranking models (05:00) GEM, Meta's ads foundation model (06:30) Adaptive ranking for power users (08:17) The scale: 3B DAUs at sub-second latency (09:40) Why longer interaction histories matter (10:45) The anniversary gift analogy (12:57) A decade of compute evolution (15:21) Meta's infra as a CP-SAT problem (16:07) Co-designing Grace Hopper with NVIDIA (17:47) Matching compute shape to workload (18:26) Influencing hardware and software roadmaps (20:23) MTIA: why ads aren't LLMs (22:07) The personalization blob and I/O ratios (26:38) One trillion parameters at sub-second latency (28:26) Heterogeneous hardware trade-offs (29:30) KernelEvolve: LLMs writing custom kernels (33:30) GenAI and recommender systems cross-pollination (35:21) The 2-year infrastructure outlook (37:00) Why demand for software engineering is rising (38:53) How Matt stays on top of it all $META @austinlyons @vikramskr





Everything that can be accelerated is, right? Nope. A surprising number of HPC and production rendering workloads still run on CPUs. The GPUs available during the CPU-to-GPU wave weren't the right config or price point to make switching worthwhile for everyone. And now GPU makers are (rightly) focused on datacenter AI and neural rendering, making tradeoffs that don’t help traditional simulation and production rendering (e.g. FP64 has been deprioritized while AI-focused formats like FP8 and FP4 get more silicon). So these customers were missed by the CPU-to-GPU wave and are deprioritized in the AI era. That creates an opening for a newcomer like Bolt to make different architectural bets, for example up to 384 GB of memory per card vs 96 GB on Nvidia's RTX 6000 Pro and 32 GB on the 5090. That'll help with rendering's "scenes don't fit in GPU memory" problem! @boltgraphics's announcement today is a 12nm test chip. There's still a long way to go until Bolt's 4Q'27 production, but the market opportunity is definitely there. prnewswire.com/news-releases/…
