Anton McGonnell

694 posts

Anton McGonnell

Anton McGonnell

@aton2006

Product @SambaNovaAI #GenerativeAI

Palo Alto Katılım Ağustos 2009
489 Takip Edilen559 Takipçiler
Anton McGonnell
Anton McGonnell@aton2006·
We agree with Jensen on one thing: The future is fast, disaggregated inference. But the path there matters. Nvidia has proposed a fragmented, multi-chip system with strong benchmarks, but it will be inefficient, costly and complicated to deploy. We take a cleaner approach: Dataflow architecture. Fewer chips. High efficiency. Deployable. No Frankenstack. It’s clear– GPUs for prefill. RDUs for decode. @SambaNovaAI
Anton McGonnell tweet media
English
0
0
3
73
Anton McGonnell
Anton McGonnell@aton2006·
Someone needs to explain to me how this “magic deterministic compiler” is supposed to handle the inherent dynamism in the most common forms of sparsity. I.e. in MoEs, you dont know at compile time which token activates which parameters. So are you activating all parameters? This is incredibly wasteful if so. And, we going are going to see more things like this e.g. DSA makes KV cache sparse and therefore dynamic. The only super fast inference chip that makes sense is @SambaNovaAI .
English
0
0
2
37
Ryan Shrout
Ryan Shrout@ryanshrout·
Intel is getting in on the AI announcements today as well, in partnership with @SambaNovaAI. I still have a lot of digging to do on what makes these chips unique, but the claims are impressive. and the partnership with Intel means the potential to scale quickly and broadly. Today SambaNova unveiled its 5th-gen SN50 RDU, a shift from AI training to production-grade Agentic Inference. 1/ Hardware Architecture for Agents: Unlike GPUs designed for general compute, the SN50 claims that the Reconfigurable Dataflow Unit (RDU) maps AI models directly to silicon. This minimizes off-chip memory calls, effectively solving the "latency tax" that plagues traditional architectures during complex reasoning loops. 2/ The "Memory Wall" Breakthrough: Featuring a three-tier memory architecture (SRAM, HBM, and high-capacity DDR), the SN50 supports models up to 10T+ parameters and 10M+ context lengths. Its "Agentic Cache" allows for hot-swapping models in milliseconds, critical for agents switching between reasoning, tool-use, and retrieval. 3/ Aggressive Performance Claims: SambaNova is claiming 5x faster speeds and 3x the throughput of Nvidia’s Blackwell B200 for agentic workloads (e.g., Llama 3.3 70B). More importantly, they cite an 8x TCO advantage, positioning the SN50 as the performance-per-watt leader for inference service providers. That's a VERY big claim, and one that (rightfully) needs a lot more scrutiny. 4/ Strategic @intel Collaboration: The multi-year deal with Intel is a big piece of this. SambaNova gains global supply chain and distribution scale, while Intel bolsters its AI story by pairing Xeon infrastructure with specialized accelerators to offer a credible alternative to the Nvidia-centric data center. 5/ Validation via Capital & Customers: A $350M Series E (led by Vista Equity Partners) provides the runway to scale, while an anchor deployment with SoftBank Corp. in Japan validates the tech's readiness for sovereign AI and massive-scale production. 6/ The "Intelligence per Joule" Metric: Operating at 20 kW per SambaRack, the SN50 fits into existing air-cooled data centers. This focus on efficiency is a push that the industry is moving past raw FLOPS toward cost-per-token and energy efficiency as the primary KPIs for ROI. While Nvidia remains the incumbent and AMD the strong #2, the Intel + SambaNova stack offers potential diversification enterprises have been demanding. Of course, there is added complexity in that knowledge that Nvidia is an Intel investory now, and that they have their own x86+GPU partnership in play...🌐
Intel News@intelnews

Today, Intel and SambaNova, announced plans to build a multi-year strategic collaboration to deliver high performance and cost-efficient AI inference solutions for AI native start-ups, model providers, enterprises, and government organizations worldwide, built around Intel Xeon-based infrastructure. It was also announced that Intel Capital is participating in SambaNova’s Series E financing round. ms.spr.ly/6010QZ9Ow

English
5
2
18
3.6K
SambaNova
SambaNova@SambaNovaAI·
SN50 is here, the fastest chip built for agentic AI. Max speed of up to 5X faster; run agentic AI at a 3X lower cost than GPUs, unlocking cloud-scale inference economics. We’ve also planned a multi-year strategic collaboration with @intel &raised $350M+ from @Vista_Equity, Cambium Capital & @TRowePrice to scale manufacturing &cloud capacity. Learn more: bit.ly/4qUsx9F
English
14
55
240
70.2K
yontr
yontr@yontrtwt·
I completely agree. Currently, there are people who dont pay for ai, who pay for ai, and who pay a lot for ai. The ones who dont pay or pay a little want maximum tokens at cheapest cost. The ones who pay a lot want the fastest tokens too to unlock new workloads. Who does the balanced middle ground serve? The ones who balance speed and cost.
English
1
0
1
43
Bionic_Squash
Bionic_Squash@SquashBionic·
@aton2006 Interms of hardware customers I guess, we'll have to wait and see if this partnership helps in this regard. Softbank as anchor customer is nice (would be cool to know the cluster size)
English
1
0
0
108
Bionic_Squash
Bionic_Squash@SquashBionic·
I guess Good for Sambanova. But I don't feel like this is sustainable unless they start getting some actual momentum. The deal is good for both of them, let's intel ship bunch of their GPUs for prefill and CPU as host, and Intel helps Sambanova reach out to customers.
SambaNova@SambaNovaAI

SN50 is here, the fastest chip built for agentic AI. Max speed of up to 5X faster; run agentic AI at a 3X lower cost than GPUs, unlocking cloud-scale inference economics. We’ve also planned a multi-year strategic collaboration with @intel &raised $350M+ from @Vista_Equity, Cambium Capital & @TRowePrice to scale manufacturing &cloud capacity. Learn more: bit.ly/4qUsx9F

English
3
3
32
5.4K
Anton McGonnell retweetledi
Intel News
Intel News@intelnews·
Today, Intel and SambaNova, announced plans to build a multi-year strategic collaboration to deliver high performance and cost-efficient AI inference solutions for AI native start-ups, model providers, enterprises, and government organizations worldwide, built around Intel Xeon-based infrastructure. It was also announced that Intel Capital is participating in SambaNova’s Series E financing round. ms.spr.ly/6010QZ9Ow
English
1
36
148
14.7K
yontr
yontr@yontrtwt·
@v_mohan_ @SambaNovaAI @intel @Vista_Equity being slightly better than blackwell is a tough market but I wish you the best. rubin will be faster and cheaper than blackwell too. I hope sn50 beats that.
English
1
0
2
98
Anton McGonnell retweetledi
SambaNova
SambaNova@SambaNovaAI·
Our RDU delivers 4X more intelligence per joule than Nvidia’s latest B200 Blackwell chip. Research from Stanford introduces "Intelligence per Joule", a new metric that best explains AI efficiency from chips to models. Learn more about this benchmark: sambanova.ai/blog/best-inte…
English
1
8
14
2.7K
The Tactixology
The Tactixology@tactixology·
UNITED’S xG FALLACY Are United simply unlucky this season, as seems to be the message after Ratcliffe & Amorim’s meeting? Will the tide soon turn? The G-xG underperformance certainly seems to suggest so: United are seemingly creating much more than they are scoring. Except that it’s not as simple. 🧵👇 What we’re seeing is at work a classic example of a common statistical fallacy - fallacy of additive probabilities. Let’s dive in. There are two ways to look at United’s xG: that the players for have heen unable to convert good chances, or that something might be off in the way that United’s chances are create. Since we just invested €230m into new forwards, option #1 would be bizarre, though possible. But if you watch the games, the eye test confirms what a deeper dive into attacking stats indicates: it’s not the finishing. How can that be? How can United have a high total xG and a low number of goals scored, without poor finishing being responsible for it? A lot of people have a distorted view of total xG over a number of games or a season, treating it like collectibles which add up and can then be traded for goals, like grain in Age of Empires. What needs to be considered is the quality of those chances, too. The sum itself is distorting the picture, because chances are summed up linearly. United are taking a lot of shots, true, but they are low-quality shots, end products of direct passes, isolated players, rushed actions, often against settled blocks. What this means is that it’s fundamentally wrong to add those shots up and treat the whole as indicative of probability to score. Enter the fallacy of additive probabilities. Imagine if a manager instructed his players to shot right away whenever they had the ball in opposition half, literally whenever, regardless of distance, angle, blocks. They’d be extremely low xG shots, but they’d end up having 60-70 of them per game, so the total open-play xG would seem good. But in reality, none of those attempts had a realistic chance of going in. United’s example is less extreme, for sure, but the principle is the same. If a majority of your shots are low-quality shots, the xG add ups, but it has no effect on your next low-quality chance. It’s a mistake to think that you can simply keep adding kow-quality shots and eventually one is bound to result in a goal. People sometimes treat a bunch of small independent probabilities as if they linearly summed into one equivalent large probability. But probabilities don’t add that way. They combine via complementary probabilities, in a non-linear way. It’s not unlike the Gambler’s fallacy, in which one ceases to consider each roll’s probabilities of success and simpy believes that volume of unsuccessful tries = the next roll is surely more likely to be successful. Nevermind if we take 10 bad shots. If they have a 0.09 xG each, that adds up to 0.9xG, so the next shot is bound to get in. But that’s not the way the world nor stats work. In order to start scoring more, United have to improve how chances are being created.
The Tactixology tweet media
EPL - Analytics@DataAnalyticEPL

👉Total xG, Total Goals scored and xG performance for the teams in English Premier League 2025-2026. UPDATED after Gameweek 4.

English
51
95
445
182.1K
Anton McGonnell retweetledi
Will Manidis
Will Manidis@WillManidis·
regardless of where the ai cycle ends, it is inevitable that the number of investable assets pre-ipo is going to go from 1000s to dozens pretty quickly most of the market has intuited this, but very few are taking it deadly seriously. my view of what’s going on here;
English
39
91
1.7K
371.2K
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Inception Labs has just launched the first diffusion language model publicly released for general chat Mercury is a generalist language model with similar intelligence to OpenAI’s GPT-4.1 Nano that runs >7x faster that GPT-4.1 Nano on GPU hardware. This follows @InceptionAILabs' code-focused Mercury Coder model released earlier this year. Diffusion language models achieve faster output speeds compared to autoregressive language models on the same hardware because they can process many output tokens in parallel - allowing them to leverage use more of a GPU’s compute without being limited by memory bandwidth. Mercury is available now Inception’s first party API at $0.25/$1 USD per million input/output tokens.
Artificial Analysis tweet media
English
10
36
278
46.2K
Anton McGonnell retweetledi
SambaNova
SambaNova@SambaNovaAI·
We just keep moving up! 📈 131K Context Length is now available on #Llama4 Maverick! Unlock use cases with Maverick for RAG, Coding, and Image Processing 👇
English
3
10
24
3.5K
Anton McGonnell
Anton McGonnell@aton2006·
I can only speak for @SambaNovaAI but these numbers are wrong. Our typical deployment caps power at 688 watts per chip. This can go up to a max of 1188 watts per chip but that is typically not required, particularly for inference. We have 838 TFLOPS of BF16 compute per chip. So our number is ~0.92.
English
0
1
3
80
hassan
hassan@hbou·
Samsung 2400: ~4.5 Qualcomm AI Hub: ~3 NVIDIA GB300: ~2.5 Apple M5: ~1.9 Cambricon MLU370-X8: ~0.85 Huawei 910B: ~0.83 Groq LPU: ~0.46 Google v5e: ~0.5 SambaNova DataScale: ~0.25 AWS Trainium2: ~0.2 AMD MI300X: ~0.12 Intel Gaudi3: ~0.13 Fujitsu A64FX: ~0.33 Cerebras CS-3: ~0.033
1
0
0
356
hassan
hassan@hbou·
AI Chip (TFLOPS/watt): Qualcomm AI Hub: ~3 NVIDIA GB300 Blackwell: ~2.5 Apple Neural Engine (M5): ~1.9 Google TPU v5e: ~0.5 Groq LPU: ~0.46 SambaNova DataScale: ~0.25 AWS Trainium2: ~0.2 Intel Gaudi3: ~0.13 AMD Instinct MI300X: ~0.12 Cerebras CS-3: ~0.033 by @grok
English
1
0
0
386
Anton McGonnell retweetledi
Meta for Developers
Meta for Developers@MetaforDevs·
Apply to participate in the LlamaCon Hackathon in SF May 3rd-4th🦙 We’ve partnered with @cerebral_valley, @Shack15sf, @nebiusai, @GroqInc, @LambdaAPI, @SambaNovaAI, @tavus, and @crewAIInc to bring together top-tier developers building groundbreaking applications. 👉 Work directly with the latest Llama models with guidance from Llama experts from Meta 👉Connect with the top AI developers in the Bay Area, renowned names in the AI industry and Meta leaders 👉The prize pool is up to $35,000 in cash and partner credits Apply below 👇 bit.ly/4is5COV
English
5
10
27
5.9K
Anton McGonnell
Anton McGonnell@aton2006·
Yes it is absolutely possible but depends on the chip. GPUs typically have a lot of synchronization issues as there are lots of overheads. But think of it this way - decoding is memory bound until you reach very very large batch sizes, but if you’re use case is real time inference, you don’t want to increase your batch size to such a degree that it is no longer interactive. In theory, you should actually be able to increase system throughout with parrelization if you can make sure p2p doesn’t become the bottleneck. Let’s assume you have a 2 chip budget. And let’s say mapping 1 is 1 chip running 1 8B model (BF16) at batch size 16. This would be 16GB plus kv cache which is maybe 500MB. Maybe this runs at 100 tps per user. So you duplicate this on both chips (simple data parallel) for a total throughout of 3200 output tokens. Mapping 2 is 2 chips running the same model using TP2 at batch size 32, so kv cache is 1GB. I should be able to run this at greater than 100tps, (eg 150 tps) because I am loading less data per token. Total throughout here would be 4800 output tokens. The difference is in mapping 1 I am loading the weights twice (16GB on chip 1, 16GB on chip 2, and a total kv cache of 1GB. Meaning total data to load of 33GB) Mapping 2, I am loading the weights once and using both chips HBM bandwidth to do it (16GB on chip 1+2, and a total kv cache of 1GB, Meaning total data to load is 17GB). So the difference between these two mappings is that I have the same amount of compute and same amount of bandwidth and I need to do the same amount of computation, but I am loading less data per token, and as long as I am memory bound and have no p2p bottlenecks, then mapping 2 is better for speed and throughput. Now there are lots of practical things that make this hard, especially on GPUs, but it is what we do on our RDU chips.
English
0
0
1
45
xlr8harder
xlr8harder@xlr8harder·
@aton2006 oh wait you even said without sacrificing throughout? hmm. i should experiment with this sometime.
English
1
0
0
36
xlr8harder
xlr8harder@xlr8harder·
Temp 0 greedy decoding is often assumed to be deterministic, but in practice it is often not due to the non-associativity of floating point operations combined with multi-GPU operations. But how different are the generations in practice? I devised a quick test: send the same prompt 10 times to every provider of the same model and look at the edit distance of the output to all other generations by that provider, and to all other generations by all other providers. What do we see? Some providers are deterministic, even with large models! This could be evidence of providers returning cached results, or using different architectures, like Groq, where parallelism is not required. (I will say that if providers are returning cached results for temp 0 inference requests without disclosing this, that it seems dishonest to me.) In the first chart for Llama 4 Maverick we can see a few things. 1. groq and fireworks are returning deterministic results. Architecture or caching? No idea. 2. Most providers cluster around a similar edit distance from one another. This makes sense, most are probably using similar implementations on similar hardware. Lambda labs is a clear outlier in the first chart. Being an outlier doesn't mean you're doing something wrong -- it could mean you're just using a slightly different inference stack of someone else. But it could also be evidence of e.g. quantization or taking shortcuts, and might be worth a closer look. In the second chart I ran against Llama 3.1 8B instruct. Here determinism is more plausible, because the model easily fits on a single GPU. But what do we see in practice? 1. SambaNova and Fireworks both seem to be returning internally deterministic results. Though their results are not identical! You can see this because their edit distances are slightly different when compared to other models. 2. Two providers seem to be returning nearly but not quite deterministic results. InferenceNet and Lepton both have nearly but not quite identicaly internal results. Not sure what explains this, maybe multiple GPU types in their serving stacks? 3. Some clear clustering around differences across other providers, and also some clear outliers. The deterministic-ish providers all seem to show similar edit distances to other providers. There is anohter cluster that seems to include Novita, Friendli, Together. Kluster is a little bit odd here, and Lambda is a clear outlier here, again.
xlr8harder tweet mediaxlr8harder tweet media
English
18
12
199
19.1K