tensorpro (@tensorpro) - Profil Twitter | Zamantika Mersobahis Locabet

Tweet épinglé

tensorpro@tensorpro·14 Oca

We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it. matx.com/research/leaky…

English

1

17

97

29.6K

tensorpro retweeté

arman@outsidearman·1 Mar

Many Iranians have given more than anyone should ever have to give for a chance at freedom Freedom to speak, think, love, learn, work, dress, create, gather, believe or not believe

English

2

1

4

254

tensorpro retweeté

John Collison@collision·26 Şub

Reiner Pope (@MatXComputing) just raised a $500m round led by @leopoldasch and Jane Street to build faster AI chips. I enjoyed having him on Cheeky Pint so I could ask all my questions about how chip design actually works, where the speed-up comes from, and how the industry will evolve. 00:00:15 Google’s AI revival 00:07:54 MatX 00:17:11 AI supply chain 00:21:48 Designing chips 00:37:11 TSMC 00:44:17 Token pricing 00:44:55 RL-ing chip design 00:49:26 Design to production 00:56:05 MatX culture 01:02:57 Rust 01:05:21 Cuckoo hashing 01:09:35 Unexplored model architectures

English

21

39

424

49.9K

tensorpro@tensorpro·26 Şub

@_arohan_ @reinerpope I used these visuals when I was demoing seqax to @BerkeleyML, they seemed to like it. tensorpro.github.io/matmul-visuals…

English

2

13

2.8K

rohan anil@_arohan_·26 Şub

@reinerpope I guess add one more annotation “A B_v / t -> A / t B_v”

English

1

0

1

850

rohan anil@_arohan_·25 Şub

A bit polarizing comment: its too late but I kind of think whoever named TP DP EP combinations probably slowed down progress by inventing terminology thats borderline absurd to describe basic sharding

English

9

11

156

38.1K

tensorpro retweeté

Andrej Karpathy@karpathy·25 Şub

With the coming tsunami of demand for tokens, there are significant opportunities to orchestrate the underlying memory+compute *just right* for LLMs. The fundamental and non-obvious constraint is that due to the chip fabrication process, you get two completely distinct pools of memory (of different physical implementations too): 1) on-chip SRAM that is immediately next to the compute units that is incredibly fast but of very of low capacity, and 2) off-chip DRAM which has extremely high capacity, but the contents of which you can only suck through a long straw. On top of this, there are many details of the architecture (e.g. systolic arrays), numerics, etc. The design of the optimal physical substrate and then the orchestration of memory+compute across the top volume workflows of LLMs (inference prefill/decode, training/finetuning, etc.) with the best throughput/latency/$ is probably today's most interesting intellectual puzzle with the highest rewards (\cite 4.6T of NVDA). All of it to get many tokens, fast and cheap. Arguably, the workflow that may matter the most (inference decode *and* over long token contexts in tight agentic loops) is the one hardest to achieve simultaneously by the ~both camps of what exists today (HBM-first NVIDIA adjacent and SRAM-first Cerebras adjacent). Anyway the MatX team is A++ grade so it's my pleasure to have a small involvement and congratulations on the raise!

Reiner Pope@reinerpope

We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.

English

322

506

7.4K

2.5M

tensorpro retweeté

James Bradbury@jekbradbury·24 Şub

Congratulations to Reiner and the MatX team! Combining the benefits of HBM and SRAM and the benefits of tile-based architectures and systolic arrays has been a clear gap in the accelerator design space for a while, and I’m excited to see how MatX One performs.

Reiner Pope@reinerpope

We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.

English

1

9

115

16.3K

tensorpro retweeté

Sholto Douglas@_sholtodouglas·24 Şub

Reiner taught me much of what I know - goes without saying that I trust him to make the best chip in the world.

Reiner Pope@reinerpope

We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.

English

12

10

412

50.9K

tensorpro retweeté

Reiner Pope@reinerpope·24 Şub

We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.

English

124

201

2.2K

3M

tensorpro retweeté

Lulu Cheng Meservey@lulumeservey·24 Şub

Banger

Bloomberg@business

An AI chip startup founded by two Google alumni has raised more than $500 million in a new round to compete with Nvidia bloomberg.com/news/articles/…

Indonesia

6

7

247

101.1K

tensorpro@tensorpro·11 Şub

@_xjdr @0xAlansari Ah, but it sounds like multinode may be a goal for future releases? Is that what you're working towards now?

English

0

1

32

tensorpro@tensorpro·11 Şub

@_xjdr @0xAlansari I don't see any mentions of multi-node in the docs yet so I'm guessing not.

English

1

0

1

38

xjdr@_xjdr·10 Şub

man training a large MoE is just really really hard. hats off to any and every team that get something out the door no matter how it benchmarks.

English

11

7

500

32K

tensorpro@tensorpro·3 Şub

@norxornor was it from expert choice, token dropping, or something else?

English

0

5

148

tensorpro retweeté

Hyena@hy3na_xyz·17 Oca

I would put as much money as possible in @tensorpro and the team at @MatXComputing

English

0

1

7

2.6K

tensorpro retweeté

Jack Cook@jackcookjack·14 Oca

Here's a non-obvious problem with block-scaled quantized Attention: at the edge of your causal mask, later tokens can leak information to earlier ones through the scale factor computation. I wouldn't expect this leakage to matter very much since it affects scales, not values, but it turns out it does actually cause the loss to decrease a little too quickly! Very cool post by @tensorpro and team.

tensorpro@tensorpro

We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it. matx.com/research/leaky…

English

0

3

18

2.5K

tensorpro retweeté

Reiner Pope@reinerpope·28 Tem

Prefill and Decode are very different workloads. We should optimize differently for them! Some ideas and speculation 🧵

English

4

15

146

30.1K

tensorpro retweeté

Sanjit Neelam@sanjitneelam·30 Tem

Speculative decoding (SD) and blockwise sparse attention both accelerate LLM decoding, but when combined naively, the KV cache may lose sparsity during the verification step of SD. A simple modification fixes this while preserving model quality. 1/5

English

1

2

13

3.7K

tensorpro retweeté

Reiner Pope@reinerpope·30 Tem

Some of our research on an interesting interaction between Native Sparse Attention and Speculative Decoding.

Sanjit Neelam@sanjitneelam

Speculative decoding (SD) and blockwise sparse attention both accelerate LLM decoding, but when combined naively, the KV cache may lose sparsity during the verification step of SD. A simple modification fixes this while preserving model quality. 1/5

English

0

1

14

1.7K

tensorpro retweeté

CNN@CNN·1 Eyl

breaking news

English

7.1K

26K

199.5K

0

tensorpro

Découvrir