hackcore

83 posts

hackcore

@hardmeta

Code to GTM. Now building AI Infra.

Katılım Ocak 2022

486 Takip Edilen9 Takipçiler

hackcore@hardmeta·24 Nis

DeepSeek just shipped V4 and casually dropped this in the pricing page: "Pro is throughput-limited. Wait for Huawei Ascend 950 supernodes in H2 — prices will drop a lot." A frontier lab publicly pinning its product roadmap to a Chinese chip release. That's the loudest ad Huawei never bought.

English

hackcore@hardmeta·17 Nis

When Apple killed NVIDIA support 7 years ago, nobody expected the deadlock to be broken not by Apple, not by NVIDIA, but by an open-source team writing a driver from scratch. The significance of tinygrad's driver isn't "NVIDIA works on Mac again." It's the first proof that GPU compute can run outside the vendor's proprietary driver stack. Plug in via Thunderbolt, compute. Performance will catch up. The decoupling has already happened.

English

103

the tiny corp@__tinygrad__·1 Nis

If you have a Thunderbolt or USB4 eGPU and a Mac, today is the day you've been waiting for! Apple finally approved our driver for both AMD and NVIDIA. It's so easy to install now a Qwen could do it, then it can run that Qwen...

English

268

7.7K

1.5M

hackcore@hardmeta·14 Nis

Found a way to keep OpenCode/OpenClaw humming on Claude Max Plan. Not telling how. Just know the cat-and-mouse game is very much still on. 🐈🐁

English

hackcore@hardmeta·7 Nis

@GenAI_is_real Agent-inference co-design and cache-aware scheduling feel like exactly the right direction. The current stateless contract is the HTTP/1.1 of our era — and it took HTTP/2 / gRPC nearly a decade to break that inertia with shared connection state.

English

112

hackcore@hardmeta·7 Nis

@SemiAnalysis_ The real moat here isn't CUDA — it's that NIXL is becoming the protocol layer for KV transfer between heterogeneous accelerators. And it's not a coincidence that it's open source.

English

SemiAnalysis@SemiAnalysis_·7 Nis

NVIDIA SOFTWARE MOAT ALERT: the recently announced AWS Trainium <> Cerebras will still be using a small bit of NVIDIA software code. In order to transfer kvcache between prefill Trainium & decode Cerebras wafer, AWS will be using NVIDIA NIXL KVcache transfer agent along with EFA. They will RDMA over EFA from Trainium over to Cerebras's cpu host memory before cpu host talking to wafer via wafer engine's FGPA.

English

330

48.5K

Chayenne Zhao@GenAI_is_real·6 Nis

We're Not Wasting Tokens — We're Wasting the Design Margin of the Entire Inference Stack A few days ago I read a post by Fuli Luo on Twitter, discussing Anthropic's decision to cut off third-party harnesses (OpenClaw) from using Claude subscriptions, and the design thinking behind MiMo's Token Plan pricing. Her core argument: global compute capacity is seriously falling behind the token demand created by agents. The way forward isn't selling tokens cheaper in a race to the bottom — it's the co-evolution of "more efficient agent harnesses" and "more powerful, efficient models." I read it several times over. People who build inference engines have long been frustrated by how wastefully agent frameworks burn through tokens. She articulated something the industry has tacitly acknowledged but rarely stated plainly — and she did it with precision and restraint: the compute allocation crisis we face today is not fundamentally about insufficient compute. It's about tokens being spent in the wrong places. I want to push this one layer deeper, from my own perspective. I'm a heavy user of Claude Code — I make no attempt to hide that. You can check that all the latest code in SGLang Omni was built with Claude Code powering my workflow. Its commercial success is beyond question; it genuinely gave many people (myself included) their first real experience of "coding with an agent." But I'm also an inference engine developer — my day job is figuring out how to push prefix cache hit rates higher, how to make KV cache memory layouts more efficient, how to drive down the cost of every single inference request. So when I plugged Claude Code into a local inference engine and started observing the actual request patterns it generates, my reaction was — how to put it — like a water engineer who spent months designing a conservation system, only to watch someone water their garden with a fire hose. I measured Claude Code's cache hit rate on my local serving engine over the course of a day. The numbers were painful. This isn't a case of "decent but room to improve." It's a case of "the prefix cache mechanisms we carefully engineered at the inference layer are being almost entirely defeated." Fuli Luo mentioned that OpenClaw's context management is poor — firing off multiple rounds of low-value tool calls within a single user query, each carrying over 100K tokens of context window. Frankly, Claude Code's own context management is nowhere near making proper use of prefix cache or any of the other optimizations we've built into inference engines. Many people have already noticed — for example, the resume feature has a bug that causes KV cache misses entirely, which is borderline absurd. I'll say it plainly: the way sessions construct their context was never seriously designed with cache reuse in mind from the start. Perhaps Anthropic has internal trade-offs we can't see — after all, they control both ends of the stack, model and inference, and can theoretically do optimizations at the API layer that are invisible to us. But from the external behavior I can observe, enormous volumes of tokens are being spent on: re-transmitting already-processed context, re-parsing already-confirmed tool call results, and maintaining an ever-inflating conversation history with extremely low information density. If this is merely to earn more on inference token charges, I find it genuinely regrettable. But many Claude Code users are on subscriptions — burning more tokens is fundamentally a cost burden for Anthropic, not revenue. I honestly don't understand what purpose such inefficient context management serves for Claude Code. Here's a bold hypothesis: for those long sessions that consume 700K+ tokens, there is certainly a way to restructure the session's context so it accomplishes the exact same task with 10% of the tokens. Not by sacrificing quality, but through smarter context compression, more rational prefix reuse strategies, and more precise tool call scheduling. This isn't theoretical speculation — anyone who has worked on inference engine optimization, upon seeing current agent framework request patterns, would arrive at a similar conclusion. Fuli Luo is right: global compute capacity can't keep up with the token demand agents are creating. But I'd add that a significant portion of that gap is an illusion of prosperity — artificial demand manufactured by the crude design of agent frameworks. Here's an analogy I keep coming back to. I've always liked bringing up RAM bloat — in 1969, 64KB of memory sent Apollo to the moon. In 2026, I open a single webpage and 500MB of memory usage is nothing unusual. Every generation of hardware engineers pushes memory capacity higher, and every generation of software engineers lavishly fills it to the brim. People have gotten used to this cycle, even come to see it as the normal cost of progress. But LLM inference is different. The cost of RAM bloat is your computer running a bit slower, spending a couple hundred bucks on a memory upgrade — users barely notice. The cost of token bloat is real money — GPU cluster electricity bills, user subscription fees, the industry's entire compute budget. And this cost scales exponentially as agent usage grows. If we don't establish the engineering discipline that "tokens should be used efficiently" in the early days of the agent era, the cost of catching up later, once scale kicks in, will be beyond imagination. Fuli Luo notes that Anthropic cutting off third-party harness subscription access is objectively forcing these frameworks to improve their context management. I agree with that assessment, but my gut feeling is that this shouldn't stop at "third-party frameworks need to be more frugal with tokens." It should trigger a more fundamental reflection: what kind of agent-inference co-design do we actually need? Right now, agent frameworks and inference engines are essentially fully decoupled — agent frameworks treat the inference engine as a stateless API, sending the full context with every request. Meanwhile, the inference engine does its best with prefix matching, caching whatever it can. This architecture is simple and general-purpose, but brutally inefficient for long sessions. If agent frameworks could be aware of the inference engine's cache state and proactively construct cache-friendly requests — if inference engines could understand the session semantics of agents and make smarter cache eviction decisions — once that information channel between the two opens up, the potential gains in token efficiency are enormous. Of course, maybe I'm overthinking this. Maybe the market's ultimate answer is: compute gets cheap enough, waste is fine. Just like the RAM story — in the end, everyone chose "memory is big enough, no need to optimize." But I don't think the token economy will follow the same path, at least not in the near term — because the supply elasticity of GPU compute is far lower than that of DRAM. Under compute constraints, token efficiency isn't a "nice to have" optimization — it's the core competitive advantage that determines who survives. Most people love hearing "we made the model bigger," "we stretched the context window to a million tokens," "we stacked HBM to new heights" — these narratives are sexy, shareable, fundable. But I seriously believe that "finding ways to reduce the reckless waste of tokens" is a profoundly underestimated direction. This isn't a defensive optimization. It's an offensive capability — whoever first achieves an order-of-magnitude reduction in token consumption at equivalent quality can serve ten times the users on the same compute budget, or deliver ten times the agent depth to a single user. The agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it most wisely. This line from Fuli Luo resonates deeply with me. But I want to press further: who gets to define "wisely"? The people building models? The people building inference engines? The people building agent frameworks? I think the answer is — all three must come to the table together. And right now, we're nowhere close.

Fuli Luo@_LuoFuli

Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.

English

219

37.8K

hackcore@hardmeta·31 Mar

Anthropic leaked their Claude Code source code. I built it from source. I unlocked a hidden Tamagotchi pet system called BUDDY. I hatched a goose named Mochi. It wiggles happily when I pet it. Happy April Fools' — except none of this is a joke. #ClaudeCode

English

hackcore@hardmeta·21 Mar

NVIDIA's Defense and Offense Strategy Defense (protect GPU revenue): Dynamo + CMX → make every GPU produce more tokens, raise switching cost Offense (invade new markets): Vera CPU → steal x86 server market Spectrum-X → steal Ethernet switching STX/BlueField-4 → steal storage Groq LPX → steal inference accelerators Jensen's playbook: defend the GPU moat by making everything around it NVIDIA too. Every new "X" platform is a new revenue stream disguised as GPU optimization.

English

hackcore@hardmeta·19 Mar

14/ For anyone building in this space, one number to remember: In our benchmarks, 80% KV cache hit rate = 185% throughput improvement. That's the value proposition of KV cache storage in one line. Not faster disks. Not more capacity. CONTEXT REUSE AT SCALE. The GPU produces tokens. But storage decides how many of those tokens are NEW WORK vs REPEATED WORK. Storage is no longer where data rests. It's where intelligence persists. Data: GTC 2026 keynote (Mar 16), NVIDIA CMX product page, NVIDIA STX launch press release. KV cache hit rate / throughput data from author's own benchmarks. Jensen quotes are direct transcriptions from GTC 2026 keynote.

English

hackcore@hardmeta·19 Mar

13/ The strategic implication: NVIDIA now defines what "AI storage" means, just as it defined what "AI compute" means with CUDA. STX is the reference architecture. CMX is the reference product. NIXL is the reference data transfer library. BlueField-4 is the reference DPU. Storage vendors can build on this stack — but NVIDIA sets the standard. Sound familiar? It's CUDA for storage.

English

hackcore@hardmeta·19 Mar

1/ At GTC 2026, Jensen showed the Vera Rubin system: 5 rack-scale computers side by side. GPU compute. CPU orchestration. Networking. And for the first time — STORAGE. Jensen: "The storage system is going to get pounded... which is the reason why we reinvented the storage system." Here's why this changes everything. 🧵

English

Keşfet

@GenAI_is_real @SemiAnalysis_ @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA