Tensormesh (@tensormesh) - Twitter Profili | Zamantika Mersobahis Locabet

Tensormesh@tensormesh·1d

We held an interview with AI Inference Researcher Yuyang Huang on what agent developers should actually be measuring. Most AI agent dashboards are still measuring, tokens per call, latency per call, cost per call, and requests per minute. Yuyang believes that is where developers are missing the most important metrics. The metrics that tell the full story of your agent performance: ✅ Task completion accuracy ✅ Time to first subtask ✅ End-to-end completion time and quality ✅ Cost per task, not cost per call We wrote about why traditional LLM observability metrics break down for agent workloads, and what agent developers should actually be tracking instead. 📖 Read the full blog here: tensormesh.ai/blog-posts/ai-…

English

0

1

51

Tensormesh retweetledi

Braden Hancock@bradenjhancock·8 May

Great time to have sota KV cache optimization tech, eh @tensormesh? 👀

dylan ツ@demian_ai

Inference got a hundred times cheaper this year. The compute bill went up anyway. If you understand why those two sentences are both true at the same time, you understand the most important thing happening in AI right now. I work on inference for a living, at @nebiustf, where we run open-source managed inference at scale. Most of what follows is what I'm seeing from inside the bill. 12 months ago, the cost of 1M tokens of frontier-class reasoning was somewhere on the order of $60. Today, an equivalent quality of output costs roughly $0.50. Price /token of o1-level intelligence has dropped about a 128x in a year. Price of GPT-4-level output has dropped roughly 100x since the original GPT-4 shipped. By any normal reading of a technology cost curve, this should be deflationary. It should be saving customers money. The opposite has happened. The total compute bill at every hyperscaler is going up, not down. Anthropic just signed multi-year capacity deals with both XAI and Amazon. Microsoft's Azure capex guide for 2026 starts with an eight. OpenAI is reportedly spending more on compute every quarter than it did in all of 2023. Nvidia paid roughly twenty billion dollars to acquire Groq, an inference-specialist company that did not exist as a serious commercial entity three years ago. The cost curve and the demand curve crossed, and then the demand curve lapped the cost curve. Here is what happened underneath. A reasoning model burns roughly 10x the output tokens of a non-reasoning model on the same task, because it spends most of its tokens thinking out loud before answering. An agentic workflow chains roughly twenty times the requests of a single-shot completion, because it loops, calls tools, plans, retries, and synthesizes. A modern deep-research query (the kind a research analyst can fire off in fifteen seconds and then walk away from for ten minutes) costs more compute than 10 original GPT-4 queries combined. We made every individual token a hundred times cheaper, and then we built a generation of products that consume ten thousand times more tokens. This is the Jevons paradox playing out at trillion-dollar scale, in compressed time, in front of everyone. Jevons noticed in 1865 that making coal-burning more efficient did not reduce coal consumption. It increased it, because efficiency unlocked uses that were previously uneconomic. Steam engines became more practical at smaller scales. Whole industries that could not afford coal at the old price suddenly could. Britain's coal consumption rose sharply, not despite the efficiency gains, but because of them. The same thing is happening to AI compute right now and it is happening faster than any analogous historical cycle. Falling token prices did not contract demand. They unlocked agents, deep research, code-writing systems, multi-step reasoning, persistent memory, the entire next layer of AI products. Every product in that next layer consumes orders of magnitude more compute than the chat interfaces it is replacing. The math at the aggregate level is brutal: 100x cheaper tokens times 10 000 more tokens equals a 100x larger total bill. The implications stack quickly. If you are running a hyperscaler, your 2026 capex guide is not a peak. It is a step on a curve. Inference is structurally always-on, twenty-four hours a day, in a way that training never was. Training is bursty. You spin up a cluster, run for weeks or months, and stop. Inference runs continuously, scales with usage, and the usage curve is exponential. Your power bill, your cooling bill, your transceiver count, your storage footprint, all of these were sized for a workload mix that no longer exists. If you are running an AI software company built on top of someone else's closed API, you have a problem that did not exist a year ago. Your gross margins get worse as your customers get more value out of your product, because the more they use it, the more compute you pay for. The companies that win this are the ones that figured out vertical integration before the math caught them. If you are watching this from a distance and trying to understand where the next bottlenecks form, the answer is everywhere downstream of "more inference compute, always-on, with massive memory state per session." The KV cache, the running memory state of a long conversation or an agent loop, is the silent monster of the inference era. It does not scale linearly with parameters. It scales linearly with context length and number of agent steps. A long agent session can hold tens of gigabytes of state per user, per session. Multiply that by every concurrent user of every product, and you understand why $MU, $SNDK, $TOWCF, and the entire memory and packaging layer have re-rated the way they have. The CPU-to-GPU ratio is evolving. Training is 1:8. Basic chat inference is 1:4. Agentic inference is 1:1, sometimes CPU-heavy. Google has split its TPU line in two, with a dedicated inference chip carrying tripled SRAM for KV cache. $INTC and $AMD just spent two earnings calls explaining that this shift is structural, not cyclical. The hardware map is redrawing in real time and the financial press is mostly still writing about training clusters. The right framing of where we are right now is not that AI is hitting a wall. The framing a year ago that scaling was hitting a wall was the most expensive bad take of the cycle. The right framing is that AI got dramatically cheaper, dramatically more capable, and dramatically more useful, and the cost of running it at the new equilibrium of demand is much higher than the cost at the old equilibrium of demand, because the new equilibrium is enormous. A meaningful share of what we actually do at Token Factory, day to day, is help customers stop their bills from running away from them. KV-cache management. Speculative decoding. Quantization. Routing. The kind of vertical integration that, eighteen months ago, every product team was happy to leave abstracted away behind a closed API. The reason this stack matters now is the same reason this whole essay matters: at the new equilibrium of inference demand, the cost of treating compute as a commodity is no longer survivable. The companies that figure out the layer beneath the API are the ones who keep their margins. Cheaper tokens. More tokens. Same coal as 1865.

English

0

2

9

732

Tensormesh@tensormesh·6d

@demian_ai @nebiustf Hi @demian_ai, we love your post and we made @tensormesh specifically to solve this issue! We already partner with @nebiusai, but feel free to check us out!

English

0

21

dylan ツ@demian_ai·7 May

Inference got a hundred times cheaper this year. The compute bill went up anyway. If you understand why those two sentences are both true at the same time, you understand the most important thing happening in AI right now. I work on inference for a living, at @nebiustf, where we run open-source managed inference at scale. Most of what follows is what I'm seeing from inside the bill. 12 months ago, the cost of 1M tokens of frontier-class reasoning was somewhere on the order of $60. Today, an equivalent quality of output costs roughly $0.50. Price /token of o1-level intelligence has dropped about a 128x in a year. Price of GPT-4-level output has dropped roughly 100x since the original GPT-4 shipped. By any normal reading of a technology cost curve, this should be deflationary. It should be saving customers money. The opposite has happened. The total compute bill at every hyperscaler is going up, not down. Anthropic just signed multi-year capacity deals with both XAI and Amazon. Microsoft's Azure capex guide for 2026 starts with an eight. OpenAI is reportedly spending more on compute every quarter than it did in all of 2023. Nvidia paid roughly twenty billion dollars to acquire Groq, an inference-specialist company that did not exist as a serious commercial entity three years ago. The cost curve and the demand curve crossed, and then the demand curve lapped the cost curve. Here is what happened underneath. A reasoning model burns roughly 10x the output tokens of a non-reasoning model on the same task, because it spends most of its tokens thinking out loud before answering. An agentic workflow chains roughly twenty times the requests of a single-shot completion, because it loops, calls tools, plans, retries, and synthesizes. A modern deep-research query (the kind a research analyst can fire off in fifteen seconds and then walk away from for ten minutes) costs more compute than 10 original GPT-4 queries combined. We made every individual token a hundred times cheaper, and then we built a generation of products that consume ten thousand times more tokens. This is the Jevons paradox playing out at trillion-dollar scale, in compressed time, in front of everyone. Jevons noticed in 1865 that making coal-burning more efficient did not reduce coal consumption. It increased it, because efficiency unlocked uses that were previously uneconomic. Steam engines became more practical at smaller scales. Whole industries that could not afford coal at the old price suddenly could. Britain's coal consumption rose sharply, not despite the efficiency gains, but because of them. The same thing is happening to AI compute right now and it is happening faster than any analogous historical cycle. Falling token prices did not contract demand. They unlocked agents, deep research, code-writing systems, multi-step reasoning, persistent memory, the entire next layer of AI products. Every product in that next layer consumes orders of magnitude more compute than the chat interfaces it is replacing. The math at the aggregate level is brutal: 100x cheaper tokens times 10 000 more tokens equals a 100x larger total bill. The implications stack quickly. If you are running a hyperscaler, your 2026 capex guide is not a peak. It is a step on a curve. Inference is structurally always-on, twenty-four hours a day, in a way that training never was. Training is bursty. You spin up a cluster, run for weeks or months, and stop. Inference runs continuously, scales with usage, and the usage curve is exponential. Your power bill, your cooling bill, your transceiver count, your storage footprint, all of these were sized for a workload mix that no longer exists. If you are running an AI software company built on top of someone else's closed API, you have a problem that did not exist a year ago. Your gross margins get worse as your customers get more value out of your product, because the more they use it, the more compute you pay for. The companies that win this are the ones that figured out vertical integration before the math caught them. If you are watching this from a distance and trying to understand where the next bottlenecks form, the answer is everywhere downstream of "more inference compute, always-on, with massive memory state per session." The KV cache, the running memory state of a long conversation or an agent loop, is the silent monster of the inference era. It does not scale linearly with parameters. It scales linearly with context length and number of agent steps. A long agent session can hold tens of gigabytes of state per user, per session. Multiply that by every concurrent user of every product, and you understand why $MU, $SNDK, $TOWCF, and the entire memory and packaging layer have re-rated the way they have. The CPU-to-GPU ratio is evolving. Training is 1:8. Basic chat inference is 1:4. Agentic inference is 1:1, sometimes CPU-heavy. Google has split its TPU line in two, with a dedicated inference chip carrying tripled SRAM for KV cache. $INTC and $AMD just spent two earnings calls explaining that this shift is structural, not cyclical. The hardware map is redrawing in real time and the financial press is mostly still writing about training clusters. The right framing of where we are right now is not that AI is hitting a wall. The framing a year ago that scaling was hitting a wall was the most expensive bad take of the cycle. The right framing is that AI got dramatically cheaper, dramatically more capable, and dramatically more useful, and the cost of running it at the new equilibrium of demand is much higher than the cost at the old equilibrium of demand, because the new equilibrium is enormous. A meaningful share of what we actually do at Token Factory, day to day, is help customers stop their bills from running away from them. KV-cache management. Speculative decoding. Quantization. Routing. The kind of vertical integration that, eighteen months ago, every product team was happy to leave abstracted away behind a closed API. The reason this stack matters now is the same reason this whole essay matters: at the new equilibrium of inference demand, the cost of treating compute as a commodity is no longer survivable. The companies that figure out the layer beneath the API are the ones who keep their margins. Cheaper tokens. More tokens. Same coal as 1865.

English

129

390

2.5K

550.8K

Tensormesh@tensormesh·6 May

Most agent developers don't realize their biggest inference cost isn't the model, it's the system prompt. Every time your agent calls the LLM, it resends the same system prompt, the same tool catalog, the same growing conversation history. The model dutifully reprocesses all of it. For an agent making 30 LLM calls per task, you're paying to re-process the same context 30 times. This is the "prefill tax," and it's where most of your inference bill actually goes. We just published a deep dive on why agent workloads are uniquely punished by default inference behavior, what KV caching actually does under the hood, and how to stop paying twice for cached tokens. 📖 Read the full blog: tensormesh.ai/blog-posts/che… 👉 Try Tensormesh with 100$ in credit: app.tensormesh.ai/login?logged_o…

English

2

0

3

87

Tensormesh@tensormesh·5 May

Tensormesh @ #AMD AI DevDay Great turnout for @JunchenJiang ’s talk on KV Cache as new memory layer for AI. Agents don’t need more compute, just context. Taking KV Cache from research to production with features like compression + CacheBlend. Thanks @AIatAMD for the invite🤝

English

0

1

11

489

Tensormesh@tensormesh·29 Nis

Most LLM developers know caching exists, but what catches teams off guard is how quickly agentic workloads break it. A single dynamic value injected into your system prompt at runtime invalidates your prefix cache entirely. By step 10 of a typical agent loop, your model is processing 11,500 tokens to act on 200 tokens of new information. We wrote a breakdown of exactly where existing caching falls short for production agentic workloads, what persistent session-aware KV caching looks like, and a four-question diagnostic you can run on any agent today. If you want to see what this looks like in practice, you can run your agents on Tensormesh and measure the difference yourself! Full Blog: tensormesh.ai/blog-posts/age… $100 Free to Run Your Agents on Tensormesh: app.tensormesh.ai/login?logged_o…

English

0

2

84

Tensormesh@tensormesh·28 Nis

🎙️𝗜𝗻𝘀𝗶𝗱𝗲 𝗧𝗲𝗻𝘀𝗼𝗿𝗺𝗲𝘀𝗵: Meet the brains behind @lmcache We sat down with our CTO, 𝗬𝗶𝗵𝘂𝗮 𝗖𝗵𝗲𝗻𝗴 and Chief Scientist, 𝗞𝘂𝗻𝘁𝗮𝗶 𝗗𝘂 to discuss the journey from @lmcache research to powering the next layer of inference at scale with @tensormesh . They explain the motivation behind their journey: the real-world constraints, design decisions, and what shaped their framework to tackle one of the hardest bottleneck in modern inference: "𝗞𝗩 𝗰𝗮𝗰𝗵𝗲" . Watch the Full interview: y2u.be/tud54kSDr5s #LLMInference #KVCache #Tensormesh @this_will_echo

English

0

5

13

646

Tensormesh@tensormesh·22 Nis

A company with 60+ accounts just had its entire AI infrastructure taken offline by their provider. No reason given, all that was provided was an appeal path as a Google Form. This is not a one-off, we have mapped the pattern across every major closed-weight provider and what enterprise teams can do about it. 📖 Read the full blog: tensormesh.ai/blog-posts/ent… 🚀 Try Tensormesh with $100 in free GPU Credits: app.tensormesh.ai/login?logged_o…

English

0

4

176

Tensormesh@tensormesh·15 Nis

Tensormesh Beta 2.2 is live: Serverless LLM inference with $0 cached input tokens. Call an API → get an endpoint in seconds. Scale to zero when idle. OpenAI-compatible. No rewrite needed. Beta 2 was one-click deployment. Beta 2.2 is zero-click inference. Try it 👉 app.tensormesh.ai/login 📖 Read all the Beta 2.2. Updates: tensormesh.ai/blog-posts/ser…

English

0

4

149

Tensormesh@tensormesh·14 Nis

GPU memory alone won’t carry the next generation of LLM serving. At #RaySummit, our Chief Scientist @this_will_echo shared how #LMCache offloads KV Cache across CPU RAM, local disk, Redis, and S3, while enabling cache reuse beyond basic prefix caching. Watch the full talk on YouTube: 👉🏻youtube.com/watch?v=aVpkkV… #RaySummit #LMCache #Tensormesh #KVCache

YouTube

English

0

1

8

489

Tensormesh@tensormesh·8 Nis

We partnered with @Redisinc to push LLM KV cache retrieval from 0.3 GB/s to 10 GB/s. @Redisinc is already the go-to for fast, scalable storage but LLM KV caches aren't a typical workload. Chunks run 500 KB to 40 MB, far beyond standard key-value patterns. Working together, we rebuilt the client to match: → Zero-copy RESP parsing: 0.3 → 2 GB/s → Fixed-size chunk optimization: 2 → 5 GB/s → C++ core with GIL release + Linux eventfd: 5 → 10 GB/s ✅ Proving 30x throughput improvement and 40% faster end-to-end inference rounds while we are still co-tuning across AWS and GCP. 📖 Read the full technical breakdown: tensormesh.ai/blog-posts/blo…

English

0

7

115

Tensormesh@tensormesh·7 Nis

@Redis × @tensormesh: Breaking the Inference Context Bottleneck In a recent interview with our partner Redis (@tchutch94, Head of AI Engineering), one idea stuck: “If intelligence is the utility, inference is the delivery grid.” Agentic systems increasing context (KVCache) are why the delivery grid starts to break. Watch the interview and learn how @tensormesh and @Redis are tackling this challenge. #Redis #Tensormesh #AIInfrastructure #KVCache

English

0

2

53

Tensormesh@tensormesh·2 Nis

𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗶𝘀 𝗼𝗻𝗹𝘆 𝘂𝘀𝗲𝗳𝘂𝗹 𝗶𝗳 𝘆𝗼𝘂 𝗰𝗮𝗻 𝗺𝗲𝗮𝘀𝘂𝗿𝗲 𝗶𝘁. That’s why @tensormesh exposes 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗛𝗶𝘁 𝗥𝗮𝘁𝗲: a real-time metric showing how often cached tensors are reused instead of recomputed. 🔽 Every cache hit means: → No recompute → Lower latency → Better GPU utilization → Lower inference cost At @tensormesh, we don’t just store 𝗞𝗩 𝗰𝗮𝗰𝗵𝗲, we show you 𝘄𝗵𝗲𝗻 𝗶𝘁 𝗽𝗮𝘆𝘀 𝗼𝗳𝗳. → Claim $100 in 𝗳𝗿𝗲𝗲 𝗚𝗣𝗨 𝗰𝗿𝗲𝗱𝗶𝘁𝘀 and see it on your own workloads: 👉🏻tinyurl.com/tensormesh

English

0

6

106

Tensormesh@tensormesh·31 Mar

Tensormesh has officially launched a fully redesigned website. This new site reflects where we are as a company. A cleaner, more structured experience that makes it easier to understand our technology, our approach, and the problem we're solving. For AI teams managing rising inference costs and latency constraints, we encourage you to explore how Tensormesh can cut GPU costs by 5-10x and improve inference speed to sub-second latency for you. 👉 Explore the new Tensormesh: tensormesh.ai

GIF

English

0

7

92

Tensormesh@tensormesh·25 Mar

"𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗶𝘀 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸" — Kevin Deierling, SVP Networking #NVIDIA At his #GTC talk last week, he highlighted 𝗖𝗠𝗫 and 𝗖𝗮𝗰𝗵𝗲𝗕𝗹𝗲𝗻𝗱 from 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 (@tensormesh) were part of the new KV Cache memory stack for agents, and recognized @tensormesh among the 𝗖𝗠𝗫 𝘀𝘁𝗼𝗿𝗮𝗴𝗲 𝗽𝗮𝗿𝘁𝗻𝗲𝗿𝘀. As the stack evolves, @tensormesh keeps building for what's next. ▶️ session Replay: tinyurl.com/GTC-talk

English

0

3

10

544

Tensormesh@tensormesh·17 Mar

🔴 Live from #GTC2026 On the floor with our Chief Scientist @this_will_echo and CTO #Yihua Chang — #KVCache is the hottest topic of the day. Even Jensen opened with it. 🎙️They covered topics like: #CacheBlend, @lmcache 0.4.0. and the super cool collab with @nvidia around a bot called #reachy using LMCache under the hood for 20x speedup #GTC2026 #KVCache #LMCache #TensorMesh

English

0

3

15

441

Tensormesh@tensormesh·11 Mar

We're going to @nvidia GTC 2026 🎉 Booth 7022 South Market Lot | March 16–19 | San Jose, CA Stop by for: →Live KV cache optimization demos →Meet the team →Tensormesh swag If GPU inference costs are killing your margins, let's talk. #nvidiagtc2026 #nvidia #gpu #aiinference

English

0

2

140

Tensormesh@tensormesh·10 Mar

🗣️Next week at 𝗡𝗩𝗜𝗗𝗜𝗔 𝗚𝗧𝗖 Our CEO @JunchenJiang joins engineers from @nvidia & @TencentGlobal to unpack 𝗞𝗩 𝗰𝗮𝗰𝗵𝗲 𝗱𝗲𝘀𝗶𝗴𝗻 — one of the most critical perf levers in production LLM inference. Featuring: FlexKV • LMCache • Dynamo KVBM 📅 Mar 18 | 10 AM 📍 San Jose, CA Reserve your seat 👉 nvidia.com/gtc/session-ca… #NVIDIAGTC #AIInference #KVCache #TensorMesh

English

0

2

10

628

Tensormesh@tensormesh·4 Mar

Most teams running MemGPT agents are wasting 56% of their prefill compute every turn, by default. The fix isn't better hardware. It's a different caching strategy. Read the full technical breakdown: tensormesh.ai/blog-posts/pre…

English

0

1

4

210

Tensormesh

Keşfet