Andrew Feldman@andrewdfeldman
GPUs are slow at AI inference because they hit the memory wall. Cerebras pioneered the SRAM based AI accelerator because GPUs were memory bandwidth constrained.
Let me explain.
There are two types of memory. Memory that can store a lot, but is slow.
And memory that is fast, but can’t store much per square milimeter of silicon.
The former is called DRAM (or HBM) and the latter is SRAM.
Graphics Processing Units use HBM.
In fact, graphics was the perfect use case for HBM.
It required a lot of data stored. But didn’t need it moved very often.
This is why graphics processing units use HBM.
But AI inference has different characteristics than graphics.
It moves data constantly from memory to compute.
To generate each token, it needs to move all of the weights from memory to compute. And for the next token, it needs to do it again. For every single token in the answer. Because HBM is slow, moving data is time consuming.
The GPU is waiting for data to get to it.
It sits idle. Pulling power.
Doing no work.
Cerebras chose to use SRAM so we could move data from memory to compute faster. Not a little bit faster but more than 2,600 times faster than NVIDIA Blackwell GPUs. As a result, we can generate tokens faster 15 times faster. This is why we are the fastest in the world.
But what about the weakness of SRAM? QSurely there is a tradeoff. SRAM can’t store very much data per square millimeter. This is why Cerebras went to wafer scale.
By building a chip the size of a dinner plate, a chip that is 58 times larger than the largest GPU, Cerebras could stuff it to the gills with SRAM. We couldn’t make SRAM store more data per square millimeter, but we could provide more square millimeters by building a bigger chip.
If you build a solution with little chips and try to use SRAM you need to link thousands of them together to support a larger model. There simply isn’t enough room on the little chips for lots of SRAM and lots of compute cores.
Thousdands of little chips connected together with cables, is slower and more power hungry than if all that traffic stayed on a big chip, or even several big chips.
And since communication between chips is slow, and communication on chip is fast, lots of little chips is slower at inference as well.