Jason ward

10K posts

Jason ward banner
Jason ward

Jason ward

@JasonWardmc

Changing the game for AI inference with @ Cerebras. Played GAA far too long. Sunny days on the right bank or Barolo

Ireland Inscrit le Temmuz 2011
2K Abonnements1.7K Abonnés
Jason ward retweeté
Cerebras
Cerebras@cerebras·
Cafe Compute went global this week 🌎 Two cities. Two continents. One big chip. This week Cafe Compute hit San Francisco for @HumanXCo and London at @OpenAI's office for @aiDotEngineer Europe — fueling developers with coffee and the fastest AI on the planet. Next stop: Miami for @aiDotEngineer Miami 🌴
Cerebras tweet mediaCerebras tweet mediaCerebras tweet mediaCerebras tweet media
English
6
6
28
4.7K
Jason ward retweeté
Andrew Feldman
Andrew Feldman@andrewdfeldman·
What is disaggregated inference? What does it do? When does matter? Who is it built for? What Is Disaggregated Inference? In the AI world: “Training” is how AI is made. “Inference” is how AI is used. “Inference disaggregation” is a technique to divide and conquer inference compute. Disaggregation separates inference into two stages: prompt processing, called “prefill,” and output generation, called “decode.” Prefill - where the model processes your prompt. This is the part you type into ChatGPT, for example. Decode - where the model generates new tokens one at a time to create the response that you read. This is the answer you get back from GPT. Why Disaggregation Matters These two stages have very different computational characteristics. Prefill is natively parallel and requires little memory bandwidth. Decode is inherently serial, and memory bandwidth intensive. Prefill can be done quickly while decode accounts for the majority of time between hitting send and getting your full answer. This is because decode is a sequential process, each output token (word) must be generated before the next can begin. Because the stages are so different, there's an opportunity to specialize, that is, to divide and conquer. Rather than 1 processor doing both jobs, you can use 2 different processors, each with an architecture suited to its task. The result of this specialization is higher throughput and lower power consumption. The Tradeoff In computer architecture, there is no free lunch. The cost of specialization is lost flexibility. Deploying separate hardware for prefill and decode locks in the ratio between them. For example, out of every 100 racks, you might allocate 30 to prefill and 70 to decode. That ratio is fixed at deployment time. When you can predict key workload characteristics, input/output ratio, KV cache size, cache hit rate, specialization delivers exceptional value. But it's fragile. If workload characteristics shift, you end up with the wrong balance of prefill and decode hardware. The result: stranded capacity, lower utilization, higher power draw, and higher costs. The challenge of course, is that hardware deployments are meant to last for five or six years. And that data centers are physically configured for the hardware deployed. Change is expensive. When you can’t predict workload characteristics with high accuracy, then specialization through disaggregation will cost more and consume more power. Who Benefits, Who Doesn't Hyperscalers, who have fleets of different processors and who can move workloads across their fleet, will easily overcome the lack of flexibility in disaggregated solutions. And they will benefit enormously from it. If the workload changes, they can direct that traffic to different processors in their massive fleets. However, for enterprises and neoclouds, who have long depreciation schedules, and are locked into a specific vendor’s processor architecture, the rapidly changing AI landscape will be real challenge for disaggregated solutions. The Bottom Line If you know your workload well and are confident it won't change much, or if you have a large pool of diverse hardware to absorb shifts, disaggregation is a good choice. If you can't predict your traffic or lack a flexible hardware fleet, a more general-purpose approach that handles prefill and decode on the same hardware is probably the safer bet. Final thoughts Disaggregated inference is still a new technology. I'm often asked what percentage of AI data centers will be built this way. The honest answer is that no one knows yet. The battle between specialized solutions and more general ones are always interesting and difficult to predict. But overall, with AI inference growing so quickly, I expect disaggregation will add to, rather than replace, the way we do inference today.
English
23
104
1.2K
653.7K
Jason ward retweeté
Andrew Feldman
Andrew Feldman@andrewdfeldman·
At GTC, we saw the crumbling of one of @nvidia  most enduring moats. It was a perception moat. It was the perception that GPUs were all you need for AI. Nvidia paid $20 billion for Groq. Acknowledging that for fast inference, the GPU alone couldn’t get the job done. Nvidia’s newly announced inference solution requiring 5 distinct systems, comprised of CPUs, GPUs and LPUs, makes it clear that the GPU isn’t enough for fast inference, and put a stake through the heart of the notion that all you need for AI is the GPU. What happened? 👍 Agentic coding took off. 👍 The market for chat is limited by the number of internet users. 👍 But coding tools are different. They are billed by the token and developers are using more all the time. 👍 Coding agents need fast inference - speed, measured by tokens per second per user. 👍 When inference is fast, developers are more productive, they ship faster, which in turn, generates more revenue. Nvidia said in the keynote: “Fast tokens are smart tokens and valuable tokens.” I will add the corollary. Slow tokens are not so smart tokens, and not so valuable. Cerebras is the fastest inference hardware in the world.
Andrew Feldman tweet media
English
17
13
103
9.8K
Jason ward retweeté
Andrew Feldman
Andrew Feldman@andrewdfeldman·
NVIDIA's biggest GTC announcement was a $20 billion bet on the same problem we solved 6 years ago. Their next-gen inference chip - not available yet - has 140x less memory bandwidth than @cerebras. To run a single 2 trillion parameter model, you need 2,000+ Groq chips. On Cerebras, that's just over 20 wafers. Even paired with GPUs, Groq maxes out at ~1,000 tokens per second. We run at thousands of tokens per second today. And every day. In production now. Why? When you connect 2,000 chips together, every interconnect has latency. Every cable has overhead. It doesn't matter what your memory bandwidth is on paper if you're bottlenecked by the wiring between thousands of tiny chips. We solved this with wafer scale. One integrated system. Little interconnect tax. Jensen told the world that fast inference is where the value is. He’s right - it’s why the world’s leading AI companies and hyperscalers are choosing Cerebras.
Andrew Feldman tweet media
English
73
68
745
155.1K
Jason ward retweeté
Andrew Feldman
Andrew Feldman@andrewdfeldman·
GPUs are slow at AI inference because they hit the memory wall. Cerebras pioneered the SRAM based AI accelerator because GPUs were memory bandwidth constrained.   Let me explain.   There are two types of memory. Memory that can store a lot, but is slow. And memory that is fast, but can’t store much per square milimeter of silicon.   The former is called DRAM (or HBM) and the latter is SRAM. Graphics Processing Units use HBM. In fact, graphics was the perfect use case for HBM. It required a lot of data stored. But didn’t need it moved very often. This is why graphics processing units use HBM.   But AI inference has different characteristics than graphics. It moves data constantly from memory to compute. To generate each token, it needs to move all of the weights from memory to compute. And for the next token, it needs to do it again. For every single token in the answer. Because HBM is slow, moving data is time consuming. The GPU is waiting for data to get to it. It sits idle. Pulling power. Doing no work.   Cerebras chose to use SRAM so we could move data from memory to compute faster. Not a little bit faster but more than 2,600 times faster than NVIDIA Blackwell GPUs. As a result, we can generate tokens faster 15 times faster. This is why we are the fastest in the world.   But what about the weakness of SRAM? QSurely there is a tradeoff. SRAM can’t store very much data per square millimeter. This is why Cerebras went to wafer scale. By building a chip the size of a dinner plate, a chip that is 58 times larger than the largest GPU, Cerebras could stuff it to the gills with SRAM. We couldn’t make SRAM store more data per square millimeter, but we could provide more square millimeters by building a bigger chip.   If you build a solution with little chips and try to use SRAM you need to link thousands of them together to support a larger model. There simply isn’t enough room on the little chips for lots of SRAM and lots of compute cores. Thousdands of little chips connected together with cables, is slower and more power hungry than if all that traffic stayed on a big chip, or even several big chips.   And since communication between chips is slow, and communication on chip is fast, lots of little chips is slower at inference as well.
English
31
29
368
36K
Jason ward retweeté
Cerebras
Cerebras@cerebras·
Nvidia DLSS 5 off DLSS 5 on
Cerebras tweet media
Filipino
21
34
591
41.7K
Jason ward retweeté
Cerebras
Cerebras@cerebras·
Our coding workflows were designed to accommodate slow inference. @OpenAI's Codex Spark powered by @cerebras changes the game. Here's how we make the most out of 1,200 tokens per second, with @MilksandMatcha.
English
11
15
108
13.1K
Jason ward retweeté
Cerebras
Cerebras@cerebras·
gpt-oss-120b is one of the most-used models on Cerebras Inference. We sat down with @ml_angelopoulos from @arena and @SarahChieng to break down its strengths, weaknesses, and where it's outperforming. Here's what he's seeing.
English
10
9
110
12.9K
Jason ward retweeté
Andrew Feldman
Andrew Feldman@andrewdfeldman·
Love the smell of power plant construction in the morning. It’s the smell of AI. Little is better than the sight of giant gas turbines being installed. A 300MW power plant under construction. Hundreds of jobs for the local community. And 100MW of power for a new data center.
Andrew Feldman tweet media
English
2
3
37
2.7K
Jason ward retweeté
Cerebras
Cerebras@cerebras·
@sama Cerebras 🤝 OpenAI
GIF
Latviešu
9
8
276
9.7K
Jason ward retweeté
Andrew Feldman
Andrew Feldman@andrewdfeldman·
Just one month after announcing our partnership with @OpenAI, we’re launching our first model together: OpenAI Codex-Spark, powered by @cerebras. Codex-Spark is built for real-time software development. In coding, responsiveness is the product. It is not a nice to have. Codex-Spark is optimized for targeted code edits, logic revisions, and frontend iteration. It gives developers near-instant feedback so they can stay in flow. Powered by the Cerebras Wafer-Scale Engine, it runs at over 1,000 tokens/s. That speed fundamentally changes the experience. We did not build this to win a benchmark. We built it so developers could move faster. I’m proud of how quickly the OpenAI and Cerebras teams have brought this to life. This is what fast execution looks like - deep engineering collaboration, rapid iteration, and shipping real products developers can use today. We are just getting started. When inference is fast, entirely new markets open up. We plan to lead that shift with our partners at OpenAI.
Andrew Feldman tweet media
English
79
115
1.8K
173.4K
Martin
Martin@Brown2Martin·
@JasonWardmc And taking your phone with you…😂
English
1
0
2
228
Jason ward
Jason ward@JasonWardmc·
Can’t beat relaxing in warm water 💧
Jason ward tweet media
English
3
1
14
1.2K
Jason ward
Jason ward@JasonWardmc·
Can’t get much closer to water for your 🥪
English
0
2
19
1.4K
Jason ward
Jason ward@JasonWardmc·
Great evening for ⛳️ at last
Jason ward tweet mediaJason ward tweet mediaJason ward tweet mediaJason ward tweet media
English
0
0
10
1.7K
Jason ward retweeté
Tour Golf (not PGA Tour)
Tour Golf (not PGA Tour)@PGATUOR·
Never forget one of the swaggiest putts of all-time 💯
English
122
1.5K
21.6K
1.5M