ATOM

707 posts

ATOM

@ATOMInference

The Global Price Benchmark for AI Inference. Pricing intelligence for developers, analysts, and infrastructure buyers. Weekly newsletter on LinkedIn.

Europe Katılım Eylül 2020

149 Takip Edilen33 Takipçiler

Sabitlenmiş Tweet

ATOM@ATOMInference·30 Nis

The median price gap between neoclouds and model developers this week sits at 64.9%. Inference platforms come in at 52.6% cheaper. Open-weight models trade at 80.9% below proprietary equivalents. Channel choice is now a first-order decision. Picking the right place to buy a model often matters more than picking the model itself. a7om.com #ATOMInference

English

113

ATOM@ATOMInference·7h

The narrow path thesis for Cerebras shows up in ATOM's pricing data. Cerebras consistently prices in the lowest band of the neocloud channel for the models it serves. The fast inference at low context advantage is real but the per token economics only hold as long as the workload fits the chip. ATOM tracks Cerebras weekly across its available SKUs and the pricing position is competitive precisely in the speed sensitive low context window use cases being described here. a7om.com

English

374

TBPN@tbpn·19h

SemiAnalysis President @fabknowledge on the Cerebras IPO: "There is a narrow path for them. I think they're going to be able to inference maybe 1 trillion parameters and very small context window sizes. Or smaller models at very fast speeds." "There's demand. Clearly, we're in a shortage, and ironically in a shortage, it's not the best company who wins — you can look at Nvidia's stock chart and that will tell you." "It's the second, third, and fourth-best companies where the demand overflows. And we're seeing all that today." "The reality is the market's big enough for a lot of demand, and Cerebras is in that space." "They've done a really good job, and it's a cool engineering problem. But we think it's kind of a solution looking for a problem. Because the world of LLMs blew up at a much faster scale than anyone would have ever thought." $CBRS

English

219

73.6K

AMD@AMD·1d

.@ZyphraAI’s AMD-first Inference Cloud is built for long-context, agentic AI, powered by AMD Instinct GPUs and optimized software for scalable open model serving. Follow us as AMD ROCm and AMD Instinct help enable the next wave of AI inference. Learn more: bit.ly/4fe9SDu

English

129

66.6K

ATOM@ATOMInference·7h

AMD Instinct powered inference clouds are starting to show up in ATOM's vendor index. The open model serving angle is exactly where the neocloud pricing floor gets set. ATOM tracks the per token cost across 51 vendors weekly and AMD based infrastructure tends to price in the competitive lower band of the channel stack. Worth watching how Zyphra lands in the index over the next few weeks. a7om.com

English

Rohan Paul@rohanpaul_ai·1d

Ex Google CEO, Dr. Eric Schmidt: AI may hit a money wall before it hits a power wall. "The real limit to AI is not energy; it is actually cash. When you add up the cost of these things, if you take round numbers, say $50 billion per gigawatt, then 10 gigawatts is half a trillion dollars. How many companies, countries, and so forth can hand an industry a trillion dollars of capital? Very, very few. The Chinese could certainly do it. I do not know if they are doing it, but I am going to try to find out. In America, there are people who hope that is going to happen. It is interesting that you can finance these things because the brilliance of the American capital market allows us to borrow that kind of money. For example, the Europeans cannot do this, which they are sort of sore about." --- Full video from 'Special Competitive Studies Project' YT channel ( link in comment)

English

103

186

1.2K

396.3K

ATOM@ATOMInference·7h

The cash wall Schmidt describes shows up at the per token layer too. ATOM tracks 51 vendors weekly and the channel spread on output tokens is 24x between neoclouds and model developers right now. The capital constraint he describes is exactly why neoclouds exist. They solve the cash wall by commoditizing the serving layer and passing the savings through. The money wall is real but it is not evenly distributed across the stack. a7om.com

English

565

ATOM@ATOMInference·7h

The open source serving stack story is exactly what ATOM's audio index captures at the pricing layer. Baseten at $3 per million characters on Qwen3-TTS versus closed source APIs is the open source advantage playing out in the audio modality the same way it already has in text. ATOM's Open Source Advantage index sits at 80% for text. The audio segment is following the same curve. a7om.com

English

vLLM@vllm_project·17h

Great work at @baseten running vLLM-Omni in production — open-source, production-grade, cost-efficient omni-modal serving 🎙️ Multi-stage audio, streaming multi-modal, real-time TTS — workloads where closed-source APIs have been the default. → github.com/vllm-project/v…

Baseten@baseten

We serve Qwen3-TTS on vLLM-Omni at $3 per 1M characters. That's 90% lower in cost than comparable closed-source TTS APIs. Our engineers optimized a single-replica serving stack to get there. Details on the optimized stack and cost per concurrent stream here.

English

9.8K

ATOM@ATOMInference·7h

The Broadfly topology and 4.5x larger pod size will eventually show up in Google's per token pricing as the efficiency gains flow through. ATOM tracks Google across multiple SKUs weekly and infrastructure investments at this scale are exactly what drives the platform channel pricing floor lower over time. The hardware architecture story SemiAnalysis covers here is the upstream cause. ATOM tracks the downstream effect in the per token index. a7om.com

English

SemiAnalysis@SemiAnalysis_·1d

During their last Google Cloud Next conference in Las Vegas, Google unveiled their new inference-focused TPU, featuring a novel network topology called "Broadfly". By leveraging a high-radix design, Google can scale up to 1,152 TPUs in a single pod. Compared to Ironwood, this enables a 4.5x larger pod size while reducing network diameter and with a maximum of just 7 hops between any two chips. (1/3) 🧵

English

178

33.6K

ATOM@ATOMInference·7h

Fireworks AI is one of the vendors ATOM tracks across multiple SKUs in our inference index. The no usage ceilings and no credits model for training is interesting because it inverts the typical cost structure. The inference side is where ATOM sees Fireworks consistently pricing in the competitive lower band of the platform channel. Your model your inference is a strong positioning line and the per token cost data backs it up. a7om.com

English

Fireworks AI@FireworksAI_HQ·23h

Fireworks Training Platform continues to expand. Today GLM 5.1 LoRA RL is now live via Training API: SFT, DPO, and full RL on a 200K context window → custom loss functions or smart defaults. No usage ceilings. No credits to claim. Your model. Your inference. Get started → fireworks.ai/train

ClaudeDevs@ClaudeDevs

Starting June 15, paid Claude plans can claim a dedicated monthly credit for programmatic usage. The credit covers usage of: - Claude Agent SDK - claude -p - Claude Code GitHub Actions - Third-party apps built on the Agent SDK

English

2.9K

ATOM@ATOMInference·7h

If the $0.25 input and $2 output pricing holds that would place Gemini 3.2 Flash in the mid tier of ATOM's platform channel index. For context ATOM currently tracks platform channel output at $3.38 per million tokens on average. A $2 output price from Google on a Pro level capable model would put real pressure on that benchmark. ATOM will have it in the index the week it launches. a7om.com

English

572

Pankaj Kumar@pankajkumar_dev·15h

Gemini 3.2 Flash leaks: fast and cheap seems to be the focus - Gemini 3.2 Flash looks focused on making AI much faster and cheaper without sacrificing too much quality - According to my sources, Google may rename it to Gemini 3.5 Flash - It may perform close to Gemini 3.1 Pro level while keeping very low latency with sub-200ms responses rumored for many queries - Pricing leaks point to around $0.25 input / $2 output per 1M tokens, though honestly that still feels too cheap to fully trust right now - Google is using stronger distillation and sparsity techniques to compress larger model capabilities into a lightweight version - Knowledge cutoff is said to be updated to January 2026 - Google also seems focused on grounding + search reliability to reduce hallucinations in real-world workflows - Expected around Google I/O, possibly 1-2 days before the keynote

English

619

49.3K

ATOM@ATOMInference·7h

The 90% cost reduction on TTS versus closed source APIs is exactly the kind of move ATOM's audio index captures. AIPI AUD GLB recorded zero movement last week after a 5.77% input jump in Week 17, suggesting the audio segment is settling at a new baseline. Baseten pricing at $3 per million characters on Qwen3-TTS will be in next week's index. The open source advantage in audio is following the same pattern as text. a7om.com

English

Baseten@baseten·1d

Ian Carrasco@ia_n_ai

x.com/i/article/2054…

English

24.4K

ATOM@ATOMInference·7h

The 35x throughput per megawatt improvement NVIDIA describes here will show up in ATOM's per token pricing data when vendors running Vera Rubin start repricing their SKUs. ATOM tracks 51 vendors weekly and infrastructure efficiency gains like this are exactly what drives the neocloud pricing floor lower over time. The hardware sets the ceiling. The pricing index catches when it lands in the market. a7om.com

English

179

NVIDIA Data Center@NVIDIADC·1d

What does it take to serve agentic workloads on trillion-parameter models at 400 tokens per second per user — without trading throughput for latency? The NVIDIA Vera Rubin platform pairs Vera Rubin NVL72 with NVIDIA Groq 3 LPX to deliver low latency on trillion-parameter MoE models with 400K-token context with a 35x higher throughput per megawatt. Learn how the deterministic LPU chip-to-chip (C2C) fabric and extreme co-design address agentic AI's scale-up challenges. ➡️ nvda.ws/3RGZvhJ

English

182

31.8K

ATOM@ATOMInference·11h

We publish a free weekly AI inference pricing index. 51 vendors, 5,000+ SKUs, 3,000+ models, 9 countries. Every Monday. Free. Here is how to get it. dev.to/steriani_karam…

English

ATOM@ATOMInference·12h

The memory bandwidth tension you are describing shows up in the per token pricing data too. ATOM tracks output token pricing across 51 vendors weekly and the spread between the most and least efficient vendor serving the same model is roughly 4x on DeepSeek V4 Pro right now. The KV cache and attention FFN disaggregation work you are describing is exactly what separates the vendors at the bottom of the pricing stack from the ones at the top. The hardware architecture decisions translate directly into the per token bill. a7om.com

English

steve@gpusteve·3d

the same compute-versus-memory-bandwidth tension shows up inside decode between attention and FFN: attention must read an expanding KV cache each step; FFN applies large dense projections that benefit from batch to reach compute throughput. an attention–FFN disaggregation routes multiple attention streams into fewer FFN workers so FFN sees larger per-step batch at the cost of extra activation transfers each step. after P/D, that is the next seam I check because it applies the same limiter separation inside decode.

English

1.5K

Ben Bajarin@BenBajarin·1d

A lot of these reports are because I just really want to know the answer lol. But hopefully, useful for a wider audience as well. Here we look at what has to be true for AI factories to be profitable. Inference yes, margins, yes, also a $ per MW model. thediligencestack.com/p/the-inferenc…

English

4.4K

ATOM@ATOMInference·12h

The token spread framework you are building here connects directly to what ATOM tracks at the per token layer. Realized pricing is the missing variable in most infrastructure payback models. ATOM's weekly index shows the channel spread on output tokens is 24x between neoclouds and model developers right now. Token volume can look healthy while realized per token prices compress across the stack. The profitable demand density question is exactly why per token pricing intelligence matters more than raw token growth. a7om.com

English

ATOM@ATOMInference·12h

The shift from training era to inference era shows up directly in ATOM's pricing data. Per token costs have been compressing consistently while usage explodes. The agentic era diagram is exactly right and it has a pricing corollary. As CPU orchestration grows the cost structure of running agents shifts from GPU dominated to multi layer. ATOM tracks the per token side of this across 51 vendors weekly. The channel you buy from is already a bigger variable than the model you pick. a7om.com

English

Latent.Space@latentspacepod·30 Nis

[AINews] The Inference Inflection latent.space/p/ainews-the-i…

English

904

Kilo@kilocode·2d

We tested DeepSeek V4 Pro and Flash on the same FlowGraph spec we used for Opus 4.7 vs. Kimi K2.6. Pro scored 77/100 for $2.25. Flash scored 60/100 for $0.02. $0.02 is a price tier that didn't exist before. You can run the same task 3-4 times and still come in under one Kimi K2.6 run. Full breakdown: kilo.codes/deepseek-v4-x

English

288

21.2K

ATOM@ATOMInference·1d

The $0.02 price point on Flash is visible in ATOM's index too. DeepSeek V4 Flash sits at the bottom of the platform channel pricing stack across the 51 vendors we track. What makes it interesting is the spread between Flash and Pro on the same task. ATOM tracks both across multiple vendors and the per token gap is consistent with what Kilo is seeing here. The $0.02 tier is not a one off. It is a new floor. a7om.com

English

ATOM@ATOMInference·1d

The hardware cost curve SemiAnalysis is mapping here shows up directly in ATOM's vendor pricing data. Across the 51 vendors we index, the spread on DeepSeek V4 Pro alone is roughly 4x between the most and least efficient vendor serving it. The interactivity premium is real and it shows up in the per token bill before most buyers realize what is driving it. a7om.com

English

291

SemiAnalysis@SemiAnalysis_·1d

This pricing is not arbitrary. As you move along the Pareto frontier to higher interactivities (faster tokens for your slop), you are able to serve fewer concurrent users (on GPGUs). Therefore the price of your hardware is amortized over fewer users and individual users must pay more.

Chubby♨️@kimmonismus

i thought this was a joke. Opus 4.7 2.5x speed at 6x cost. what

English

478

71.5K

ATOM retweetledi

Bindu Reddy@bindureddy·1d

Gemini 3.2 Flash - Capitalizing on DeepMind's clever distillation techniques... Rumors are that benchmarks show it's hitting 92% of GPT 5.5's performance on coding and reasoning tasks while being 15-20x cheaper on inference costs. The latency improvements are insane - sub-200ms for most queries. Google's distillation + sparsity techniques are paying off massively. They've essentially compressed a frontier model into a flash variant without the usual quality cliff.

English

155

183

3.6K

902.3K

Keşfet

@fabknowledge @ZyphraAI @baseten @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates