Reiner Pope

182 posts

Reiner Pope

Reiner Pope

@reinerpope

CEO and founder, @MatXComputing, developing high throughput chips tailored for LLMs

Mountain View, California Katılım Temmuz 2011
460 Takip Edilen18.3K Takipçiler
Sabitlenmiş Tweet
Reiner Pope
Reiner Pope@reinerpope·
We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.
English
122
202
2.3K
3M
Reiner Pope retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
.@reinerpope's new blackboard lecture goes all the way down: how AI training and inference are built up from logic gates on silicon. He walks me through a 4-bit multiply-accumulate by hand, and shows how that primitive is the foundation for the matrix multiplies in training runs.
English
11
47
526
37.4K
Reiner Pope retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
New blackboard lecture w @reinerpope How do chips actually work – starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. 0:00:00 – Building a multiply-accumulate from logic gates 0:16:20 – Muxes and the cost of data movement 0:25:59 – How systolic arrays work 0:39:00 – Clock cycles and pipeline registers 0:51:40 – FPGAs vs ASICs 1:03:14 – Cache vs scratchpad 1:07:16 – Why CPU cores are much bigger than GPU cores 1:11:49 – Brains vs chips 1:15:22 – A GPU is just a bunch of tiny TPUs Look up Dwarkesh Podcast on YouTube/Spotify/etc to watch. Enjoy!
English
94
716
5.5K
896.7K
Reiner Pope retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
It's very interesting that cryptographic protocols and neural networks have the same high-level architecture (where they jumble information as it moves sequentially across many layers). This is the result of a convergent evolution - cryptographic protocols need every output bit to depend on every input bit in complicated ways, and similarly, NNs need output to make connections between inputs. But they're in some sense doing opposite things. While cryptographic protocols take something which has a lot of structure and make it seem indistinguishable from random, NNs take something which may look random and extract structure from it. Much more on this idea in the full episode with @reinerpope
English
22
36
487
59.9K
Reiner Pope
Reiner Pope@reinerpope·
Let's forget about pipelining/microbatching for a moment. In that case you have just one batch of KVs in HBM, plus the weights. If you do 5ms scheduling rather than 20ms, you can only read 25% of the HBM data in that time. Assuming dense attention, you read the same set of KVs on every forward pass. You also read all of the weights on every forward pass. So you end up reading the same 25% on every forward pass. This technically fine, and you can make it work, but you're wasting 75% of HBM capacity and suggests you bought too much. You could consider using sparser MoE and/or larger batch size to profit from this spare HBM capacity. This should improve the (quality, throughput) pareto frontier at the expense of latency.
English
2
1
73
3.9K
ar0cket1
ar0cket1@ar0cket1·
Can someone explain why Blackwell batch scheduling is 20ms and not less. 20ms is derived via hbm capacity/bandwith so the time it takes to off load all HBM. (So 20ms is close to the time that a full forward pass finishes *assume I’m talking about frontier massive batch size where you become flop bottlenecked. Like why not drop batch scheduling to like 5ms and basically send off the batch to the first layers of the model by the time that earlier batches are half way in. Similar to how micro batches in pipelining are streamed constantly. I was thinking if your FLOP limited and your tensor cores aren’t being utilised as the batches make their way to layer layers then why not just increase batch frequency. This would shift your bottleneck from flops to HBM capacity since you need a bunch of KV retrieval. Can someone explain why lower batch scheduling is done (I tried to ask AI but all of the models give some weird hand wavy answer) @reinerpope
English
2
1
24
4.1K
Reiner Pope retweetledi
Clive Chan
Clive Chan@itsclivetime·
podcast bro inventing university lectures from first principles (but also, this will be better than any university lecture you've ever watched)
Dwarkesh Patel@dwarkesh_sp

Did a very different format with @reinerpope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography

English
11
32
709
88.9K
Reiner Pope retweetledi
Sholto Douglas
Sholto Douglas@_sholtodouglas·
Sessions just like this one were incredibly formative for me at Google, no one better to learn from
Dwarkesh Patel@dwarkesh_sp

Did a very different format with @reinerpope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography

English
8
43
1.1K
178.9K
Reiner Pope
Reiner Pope@reinerpope·
RT @dwarkesh_sp: Did a very different format with @reinerpope – a blackboard lecture where he walks through how frontier LLMs are trained a…
English
0
16
0
4.1K
Reiner Pope retweetledi
zach
zach@blip_tm·
“i kinda think you wanna make a chip that people do hate” is a very funny way to express an important fact about accelerator design if your chip is super easy to use across many applications, it’s probably not specialized enough!
Reiner Pope@reinerpope

Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers

English
2
5
76
11.9K
Clive Chan
Clive Chan@itsclivetime·
Thanks for the fun chat @reinerpope @dylan522p!
Reiner Pope@reinerpope

Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers

English
5
5
79
15.4K
Reiner Pope
Reiner Pope@reinerpope·
Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers
English
11
57
622
144.2K
Reiner Pope
Reiner Pope@reinerpope·
I chatted with @ysmulki about MatX, chip design and where silicon designed for LLMs is headed (8:17) Tightly coupling SRAM and HBM on one chip (14:03) More MoE FLOPS, smaller KV cache load (16:08) Numerics: from 32-bit to 4-bit (19:02) Targeting both training and inference (22:14) Chip timelines (27:15) Logic and memory scarcity (29:42) Compute costs (32:07) Latency: from 20ms to 1ms as the new table stakes (40:50) Programming the chip (43:00) Starting MatX (47:11) Codesign without seeing the models (51:57) Interconnect design (55:44) Performance modeling philosophy (1:07:02) Prefill vs. decode (1:13:47) What's next
English
15
44
323
68.8K
Reiner Pope retweetledi
Semi Doped
Semi Doped@semidoped·
New interview: @ReinerPope, co-founder/CEO of @MatXComputing A counterintuitive throughput insight: “Low latency means small batch sizes. That is just Little’s law. Memory occupancy in HBM is proportional to batch size. So you can actually fit longer contexts than you could if the latency were larger. Low latency is not just a usability win, it improves throughput.” We get into: • The hybrid SRAM + HBM bet, and why pipeline parallelism finally works • Why sparse MoE drives MatX to “the most interconnect of any announced product” • Why frontier labs are willing to bet on an AI ASIC startup • Memory-bandwidth-efficient attention, numerics, and what MatX publishes (and what it does not) • Why 95% of model-side news is noise for chip design • The biggest challenges ahead 00:00 “We left Google one week before ChatGPT” 00:24 Intro: who is MatX 01:17 Origin story: leaving Google for LLM chips 02:21 GPT-3 and the “too expensive” problem 04:25 Why buy hardware that is not a GPU 05:52 Overcoming the CUDA moat 08:46 Early investors 09:35 The name MatX 09:59 The chip: matrix multiply + hybrid SRAM/HBM 12:11 Why pipeline parallelism finally works 14:22 Reading papers and Google going dark 15:20 Research agenda: attention and numerics 17:06 Five specs and meeting customers where they are 19:24 Why frontier labs are the natural first customer 20:32 Workloads: training, prefill, decode 22:18 Little’s law and the throughput case for low latency 24:29 Interconnect and MoE topology 26:35 Inside the team: 100 people, full stack 28:32 Agentic AI: 95% noise for hardware 30:35 KV cache sizing in an agentic world 32:11 How MatX uses AI for chip design (Verilog + BlueSpec) 34:23 Go to market: proving credibility under NDA 35:12 Porting effort for frontier labs 36:34 Biggest skepticism: manufacturing at gigawatt scale 37:32 Hiring plug @austinlyons @vikramskr
English
1
10
98
19.3K
Josh Albrecht
Josh Albrecht@joshalbrecht·
mngr: programmatically manage 100s of claude code sessions in parallel 🤖 open source today. lets you do things like: — for each open GitHub issue, create a PR — for each flaky test in the past week, fix it — for each rule in style guide, scan codebase & fix all instances runs any agent: @claudeai, codex, @opencode, etc. runs on any compute: locally, @modal, @Docker, or anything you can ssh into.
GIF
English
28
38
169
31K
Musaran
Musaran@bruno_mailly·
@reinerpope >mix thoroughly and in a complex way >few other correctness requirements >perform extremely well on hardware PRNG is kinda similar too.
English
1
0
1
36
Reiner Pope
Reiner Pope@reinerpope·
Why are neural nets and cryptographic ciphers so similar?
Reiner Pope tweet media
English
3
5
51
9.2K
Reiner Pope retweetledi
John Collison
John Collison@collision·
Reiner Pope (@MatXComputing) just raised a $500m round led by @leopoldasch and Jane Street to build faster AI chips. I enjoyed having him on Cheeky Pint so I could ask all my questions about how chip design actually works, where the speed-up comes from, and how the industry will evolve. 00:00:15 Google’s AI revival 00:07:54 MatX 00:17:11 AI supply chain 00:21:48 Designing chips 00:37:11 TSMC 00:44:17 Token pricing 00:44:55 RL-ing chip design 00:49:26 Design to production 00:56:05 MatX culture 01:02:57 Rust 01:05:21 Cuckoo hashing 01:09:35 Unexplored model architectures
English
21
41
433
53.3K