Roger

475 posts

Roger banner
Roger

Roger

@gerrox

🤖AI hardware | On-device AI 🚀技术分享 | 产品体验 | 认知刷新 🔗公众号:Rog3r

शामिल हुए Şubat 2023
920 फ़ॉलोइंग153 फ़ॉलोवर्स
पिन किया गया ट्वीट
Roger
Roger@gerrox·
Token 度量的是 tokenizer 吐出来的字符数,而不是实际消耗的算力。 两者之间的换算关系,在 Agent 场景里会因为 KV Cache 越来越模糊。 我觉得未来的计量单位可能是: • 有效推理次数 × 模型单价,忽略缓存因素,用户买单的是结果也不是过程 • 或者干脆用总算力消耗(FLOPs),可以不用区分大小模型 尤洋教授提到的 97:3 的 Input/Output 比例继续极端化带来的结果就是 System Prompt 越来越长,但真正被计费的「生成 Token」比例越来越小。 最后 Token 总量就只是一个「虚假繁荣」的指标。
Roger tweet mediaRoger tweet mediaRoger tweet media
中文
0
1
1
460
Roger
Roger@gerrox·
太好了,终于用上了 Grok 翻译。
Roger tweet media
中文
0
0
0
6
Roger
Roger@gerrox·
已经进入信息茧房了,现在推荐流比关注流的质量还高。
中文
0
0
0
3
Roger
Roger@gerrox·
OpenRouter 上免费的 qwen/qwen3.6-plus 没有了,免费的 API 真的经不起这么薅,这么看字节是真的财大气粗,豆包现在每天上亿 DAU,万亿 token 消耗量都还是免费。
中文
0
0
0
54
Roger
Roger@gerrox·
Deepseek 一有动静就全员轰动。刚才去试了一下新更新的专家模式,一看联网搜索引用的全是 CSDN,不管模型能力多强,V4 来了也救不了这输出的质量。
中文
0
0
0
200
Roger
Roger@gerrox·
现在大模型厂商的开源对于个人来说都没什么意义,模型这么大,个人根本不可能本地部署。 所以不如像千问和前段时间的小米一样,直接提供免费的 API,模型权重开不开源都无所谓,有免费的 API 使用才是真正的开源。
中文
0
0
0
43
Roger
Roger@gerrox·
Claude 4.6 Opus 巨能写,不知道是不是用了 Superpowers,现在写 spec 和 plan,动辄就是上千行,写得非常详细。
中文
0
0
0
24
Roger
Roger@gerrox·
Agent API 使用省钱技巧 ✅ 建议 1. 一次多提几个问题:3 个问题合成一条消息发出去,模型只需读一次缓存,省的是那笔反复读系统提示的钱 2. 一个 session 里持续聊:聊得越久,每轮真正需要重新计算的内容越来越少,缓存在帮你把成本越摊越薄 3. CLAUDE.md 在开始前整理好,之后不动:中途改了这个文件,之后所有的对话历史都要重新算 4. 用不到的 MCP 工具提前关掉:工具列表变了,已经积累的对话缓存就全废了 5. 同一个任务用同一个模型做完:切换模型等于把缓存清零,要切就在开始新任务的时候切 6. Pro/Max 用户:吃饭前发一条消息:缓存每次读取都会续期 1 小时,主动保持活跃比回来后重建便宜 ❌不建议 1. 频繁开新 session:每次都要把系统提示重新算一遍,相当于每次上班都重新开机 2. 工作到一半改 CLAUDE.md:改完之后,之前所有积累的对话缓存全部清零 3. 用 sub-agent 处理简单搜索:每启动一个 agent 都是一次冷启动,能直接搜文件就不要绕一圈 4. context 不长就用 /compact:对话历史还不够长时,compact 只会让缓存断掉,省不了多少钱 5. 工作中途换模型:缓存完全失效,切模型要选在任务交接的时候
实践哥MinLi@MinLiBuilds

x.com/i/article/2040…

中文
0
1
0
69
Roger
Roger@gerrox·
@jyrnan Upstream error from Alibaba: Request rate increased too quickly. To ensure system stability, please adjust your client logic to scale requests more smoothly over time. 现在 OpenRouter 免费的 Qwen 3.6 Plus API 有点不稳定了,没想到之前小米免费开放的 API 竟然还抗住了。
中文
0
0
1
179
 jyrnn 胖帅博
不知道 Openrouter 上的这个 Qwen3.6 plus 能免费用多久,现在用着还挺好的😂
中文
3
0
2
1K
Roger
Roger@gerrox·
@oran_ge HTML 最大的问题是有太多的标签,很浪费 token,现在我让模型生成 html 都直接使用模板填充,不然写复杂页面,token 长,效果差。
中文
1
0
5
1.8K
Orange AI
Orange AI@oran_ge·
新的 office 三件套 md csv html
中文
14
18
242
41.6K
Roger
Roger@gerrox·
平时会持续关注的两个榜单,一个是 coding 能力榜单,一个是 Openclaw 能力榜单。 不知道选什么模型?尽量选自己能用到的最好的模型就行了。 claw-eval.github.io arena.ai/leaderboard/co…
中文
0
0
1
59
Roger
Roger@gerrox·
@oran_ge 享受 build 的快乐,just for fun 可能会有额外的收获。
中文
0
0
0
654
Roger
Roger@gerrox·
nanobot 里提到的"在缩小的舞台上跳舞"这个上下文管理比喻挺有意思。 长任务的核心瓶颈是上下文窗口。一个 agent 运行 10 分钟到 1 小时,每一步都在吃 token,窗口只会越来越小。nanobot 的处理是三层: • Consolidator 监控 token 量,接近预算上限时让 LLM 自己总结早期对话、归档到磁盘 • 滑动游标标记哪些已被 consolidation,模型只看到近期上下文 + 过去内容的精炼版本 • 安全网:每次请求都会被裁剪以适应 token 预算,保留最新轮次,从最老的开始裁。agent 永远不会因为上下文满了而崩溃 工具输出同样处理——大 codebase 上一个 grep 可能返回 MB 级结果,nanobot 把它持久化到磁盘,上下文里只放一个引用。模型知道结果在哪,不用付 token 代价。 ─── 基于上面的上下文管理,整个框架设计了三层记忆。 第一层:短期记忆 存储当前会话中的每一条消息、每一次工具调用和每一个返回结果。存在内存里,实时读写,零延迟。它的生命周期跟会话绑定——会话结束就没了。LLM 在 prompt 里完整可见这些内容,没有任何损耗。 这一层要解决的是连贯性问题:让 agent 记住用户刚说了什么、上一步工具返回了什么、当前执行到哪一步。没有短期记忆,agent 甚至无法完成一轮多轮对话。 第二层:中期记忆 当对话变长、token 压力增大时,Consolidator 会自动介入。它取最早的、尚未 consolidation 的对话轮次,让 LLM 把那些内容压缩成结构化的摘要,然后追加写到磁盘文件上。一个滑动游标记录哪些已经处理过,所以同一个片段不会被重复压缩。 中期记忆是有损的——细节会丢,但主线和关键结论会被保留。它注入 prompt 的方式是与近期完整上下文混合,所以模型看到的是最近发生的事(完整版)加上过去发生的事(精炼版)。每次发送给 LLM 的请求还有最后一道安全网:如果仍然超出 token 预算,从最老的开始裁剪,保最新的。 这一层解决的是上下文爆炸。Agent 运行长任务时窗口只会越来越小,compression 让它能继续工作而不丢线索。 第三层:长期记忆(Dream) Dream 定时运行,每隔几小时自动触发,也可以手动调用。它读取当天积累的中期摘要,跟 agent 已有的长期知识做交叉对比,判断哪些是新发现、哪些需要更新、哪些不再成立。 跟重新写整个记忆文件不同,Dream 做精准的、最小化的编辑——在这里加一行用户偏好,在那里更新一个项目惯例,把不再成立的东西删掉。每次编辑都会被 git commit,附带时间戳和 diff。用户可以跑 /dream-log 看 agent 到底学到了什么、什么时候学的,也可以跑 /dream-restore 回滚到任意历史状态。 这一层解决的不是技术问题,是信任问题。记忆不应该是一个黑盒——如果 agent 要记住关于你的事情,你就应该能看到它记住了什么、为什么记。 明确不是 RAG。没有向量数据库,没有 embedding 相似度搜索。是 agent 观察自己的经验,自己决定什么值得保留。 三层之间的关系是递进的,不是并列的: 对话产生,短期记忆实时记录完整内容。当 token 压力增大,Consolidator 把早期内容压缩为有损摘要,落入中期记忆。每隔几小时 Dream 运行,从摘要中提取真正值得跨会话保留的知识,写入长期记忆。信息逐层沉淀,从快到慢,从完整到精炼,从临时到永久。
Xubin Ren@xubinrencs

x.com/i/article/2041…

中文
0
1
1
276
Roger
Roger@gerrox·
用前几天 Andrej Karpathy 火遍全网的知识库搭建 Pipeline,构建了一个本地的知识库。 把巴菲特近70年的致股东/合伙人信,从原始文本变成了一个可搜索、可引用的中文概念知识库。 将由 98 封原始信件编译的知识库系统构建好后,问了我的第一个问题:什么是好的商业模式? 下面是从整个知识库里索引并整理的回答。 ─── 根据巴菲特 69 年投资记录,判断一个商业模式好不好,本质上看一件事:它能不能在不追加或少追加有形资本的前提下,持续产生超额回报。 具体落地为三个维度的检验: 一、经济特许经营权三要素(来自 1991 年信) 1. 产品或服务被需要或被渴望 2. 客户认为找不到近似替代品 3. 不受价格管制 验证方法只有一个:能否定期主动提价并获得高资本回报率。 不能提价的企业,不是好企业。 二、两类特许经营权,一种本质(来自 1983 年信 + 2000 年信) • 品牌渴望型:喜诗糖果、可口可乐。护城河来自消费者好感——"基于与产品和员工无数次愉快体验而建立的深厚好感"。 • 成本领先型:盖可保险。产品本身是大宗商品(汽车保险),但运营成本远低于同行。巴菲特原话:"当企业销售具有大宗商品属性的产品时,做到最低成本生产商就是一切。" 两种看似不同,本质一样:拥有定价权或成本优势,使得增量资本投入能持续获得高回报。 三、通胀时代的终极检验(来自 1983 年商誉附录) 喜诗糖果用约 2,000 万美元净有形资产赚 1,300 万美元税后利润。通胀来了,它不需要额外投入就能自然涨价维持利润水平。而一家需要 1,800 万美元净有形资产才能赚同样利润的平庸企业,通胀一来就必须追加 1,800 万美元维持竞争力,利润却被通胀吞噬。 巴菲特后来承认这是他从格雷厄姆身上学到的最重要的一课转变: "困难不在于接受新思想,而在于摆脱旧观念。" 35 年前他被教导偏爱有形资产,现在他"强烈偏爱那些拥有大量持久商誉、只需少量有形资产的企业"。 反面教材也同样清晰: 纺织业是典型的普通企业——无论管理层多努力,行业本质决定了它"注定只能赚取不足够的回报"。1971-1980 年几乎年年亏损。不是管理不好(肯·蔡斯被巴菲特评价为"坦诚"且"锲而不舍"),是行业里没有定价权、产品高度无差异、客户只看价格。 总结一句话: 好的商业模式 = 能定期提价(品牌型)或成本最低(成本领先型),且不需要大量有形资本追加就能维持增长。差的商业模式 = 只能拼价格,每赚一美元都要先投入一大笔固定资产。
中文
0
0
0
54
Roger
Roger@gerrox·
AI 的差异不在模型能力,而在身份与连续性。模型是基础设施,心智层才是产品。 新的 OpenClaw 的记忆系统分三层,每层一个纯文本文件: 1. SOUL.md(身份)——定义 AI 是谁:声音、价值观、性格、思考方式。不靠 fine-tuning,只靠描述一个心智。 2. MEMORY.md(经验)——累积 AI 活过的东西:决策原因、关系变化、重要项目、回避的事物。带来真正的连续性,不是冷启动。 3. DREAMS.md(整合)——AI 在无人交互时反思,发现你自己都意识不到的模式与连接。让 AI 从工具变成伙伴。 和我之前的想法一致,三层记忆模型无关、可移植。模型能力、 Agent 框架、记忆系统,三位一体。模型可能被取代,但内核不会。
Dave Morin 🦞@davemorin

x.com/i/article/2040…

中文
0
0
0
45
Roger
Roger@gerrox·
可预见的是,短期内模型的 coding 能力还会持续增加。现在已经进入了一个数据飞轮,AI 写的代码从质量不行,到现在生成的代码完全可用,到后面这类代码又会成为高质量的训练数据,加速整个循环。
中文
0
0
0
18
Roger
Roger@gerrox·
Upstream error from Alibaba: Request rate increased too quickly. To ensure system stability, please adjust your client logic to scale requests more smoothly over time. 今天 OpenRouter 免费的 Qwen 3.6 Plus API 太不稳定了,没想到之前小米免费开放的 API 竟然还抗住了。
中文
0
0
1
132
Roger
Roger@gerrox·
回想一下 PC 软件的历史。 640KB RAM 不够用了,DOS 出了扩展内存。程序员一看够用。没过几年,4MB 不够了。16MB、64MB、256MB——每一代新增的内存,软件都会在 18-24 个月内吃得干干净净。 这是激励机制在起作用。 硬件给你更多空间,程序员就没有动力去优化。`malloc` 比 `realloc` 好用?那就多 alloc 一些。数据结构冗余一点?反正内存便宜。压缩算法太费 CPU?算了不压了,内存够用。 每一代软件工程师都在挥霍上一代人争取来的冗余。 今天的 Agent 框架正在干一模一样的事——挥霍的对象换成了 token。 后果完全不同。 软件工程师挥霍 RAM,用户多掏 200 块买条内存条。Agent 工程师挥霍 token,卖 token 的生意归零。 为什么?因为 RAM 的供给弹性很大。DRAM 扩产周期 6-12 个月,标准化程度高,用户自己买条插上就行。GPU 算力不行。芯片制造 18-24 个月。HBM 被三家厂商垄断,产能爬坡缓慢。CoWoS 先进封装产能有限。电力基础设施是硬约束,不是你想建就建。 GPU 的供给弹性比 DRAM 低一到两个数量级。 token 的底层是 GPU。挥霍 token 的成本,远比挥霍内存高。 --- 把抽象问题变成账单。 一个用户查询,在 Agent harness 里被拆成了 N 个独立 API 请求。每一条请求都携带 100K+ token 的全量上下文。同一个上下文,重复发送 N 次。 请求的数量,是 Claude Code 自家框架的「好几倍」。 换算成 API 定价,实际成本是订阅价格的几十倍。 Fuli Luo 说了一句话:「痛苦会转化为工程纪律。」 Anthropic 砍第三方订阅因为算不过账。一个用户按月付 20 刀订阅费,底层 token 消耗值 500 刀。这是做善。 问题在 Agent harness 的架构天然产生重复 token 消耗,推理层的 prefix cache 被这种请求模式完全绕过去了。 先说一下 prefix cache 到底是什么。 LLM 做推理分两步:prefill 处理输入的 prompt,decode 一个字一个字往外蹦。prefill 阶段,模型会对 prompt 里每个 token 计算一组中间结果——叫 KV cache。这些中间结果就是模型的「记忆」。下次遇到同样的 prompt,直接复用,不用重新算。 你的 prompt 里,开头那部分经常是一样的。同一个 session 里,system prompt + 工具定义 + 项目上下文,这几千甚至几万 token,在每一次请求里都相同。变化的只是最后一句用户的新问题。 推理引擎做的事:把已处理过的 prompt 前缀对应的 KV cache 存起来。新请求来了,检查它的前缀和之前缓存的前缀有没有重合。重合的部分直接复用,只计算新增的部分。 这就是 prefix cache——自动前缀缓存。 有两种主流实现。vLLM 按固定大小的 block 切分 prompt,每个 block 算一个 hash 值,新请求来了逐个比对 hash 找匹配。SGLang 用 radix tree(基数树),按 token 级别精细匹配,所有共享同一个前缀的请求都指向树里同一个节点。前者简单但粒度粗,后者更灵活但实现复杂。 命中率怎么算?能从缓存复用的 token 数除以需要处理的总 token 数。命中率 90% 意味着 100K token 里 90K 直接拿现成的,只算 10K 新内容。命中率 10% 就是 90K 全部重新算。 系统提示词的命中率高吗?理论上应该很高——它是完全重复的内容。但现实里 system prompt 经常动态变化:自动注入文件列表、git 状态、tool 定义随场景变化。哪怕只改了几个 token,如果出现在 prefix 的前面部分,整个前缀就断了。 即便 system prompt 100% 命中,它也只占 prompt 的极小比例。一个 100K token 的请求里,系统提示词可能占 5K,剩下的 95K 是对话历史和 tool results——这些全都在 miss。总体命中率 = 5K / 100K = **5%**。 这才是「数字非常难看」的真正含义。 SGLang、vLLM 这些推理引擎,花了大量工程精力把 KV cache 管理做到极致。KV cache 复用、radix attention tree、continuous batching——都是为了让你少算一遍相同的 token。 agent 框架来了,每个请求是独立的 HTTP call,上下文是全新的。引擎收到请求时看到的是一组全新的 prefix,prefix cache 全部 miss。 一个停车场,管理员设计了智能泊车系统让车位利用率最大化。每个司机把车开到门口,停一下就走,换另一辆车重来。 Chayenne 测了 cache 命中率,数字非常难看。这不是引擎的问题。是 harness 和 engine 之间的请求协议,完全没有考虑 cache 的存在。 --- 把系统里的角色摆出来看。 **第一层:卖 token 的人。** 按量计费,token 消耗越多赚越多。一个 session 用 700K token 还是 70K token,对他们来说收入差了十倍。 **第二层:做推理引擎的人。** 关心效率。prefix cache 命中率越高,GPU 利用率越高,单位 token 成本越低。但他们没有权限改 agent 框架的请求模式,只能等请求进来尽量优化。 **第三层:用户。** 订阅制下感知不到 token 成本。一次 70K 还是 700K,月底扣同一笔月费,没有动机优化上下文结构。 三层人,三种利益,互相不交叠。 没有人对总效率负责。 更精确地说:没有人**能**对总效率负责。卖 token 的没动机,做引擎的没权限,用户没感知。经典的激励不相容系统。 经济学里这是「公地悲剧」的反面——资源没被过度使用,效率被系统性浪费,每个参与者都觉得自己没错。 框架开发者觉得「我只是调用 API」,引擎开发者觉得「我只负责加速推理」,平台方觉得「用户开心就行」。 700K token 的 session,用 70K 就能干完。花了十倍的 GPU 时间、十倍的 token 消耗、十倍的等待时间。 --- Agent harness 和 inference engine 之间,存在一个谁都没管的空白层。 没有协议让框架告诉引擎:「接下来的五个请求都会复用前 80K token 的那段上下文,请帮我保持 KV cache warm。」 没有协议让引擎告诉框架:「你每次都在重新发送这段前缀,合并成一个请求,速度能快三倍。」 这个空白层不存在。 历史上每次出现这种空白层,都会有人跳出来填补。操作系统和应用软件之间没有统一的图形接口,Adobe 和 Microsoft 自己做了。数据库和应用程序之间没有 ORM 层,Ruby on Rails 和 Hibernate 自己建了。 agent harness 和 inference engine 之间,摆着同样的空白层。 谁会来定义 harness ↔ engine 的协同调度协议? 三个具体问题: 第一,context continuity。框架如何向引擎声明一段上下文的生命周期——复用多久、哪些不变、哪些增量更新。 第二,cache-aware dispatch。框架如何将一组相关请求批量调度到同一个 engine 实例,确保 prefix cache 命中。 第三,cost feedback。引擎如何将 cache 命中率、token 利用率数据反馈给框架,让框架做智能的上下文压缩和请求合并。 单个看都不是技术难题。难的是现在没有动力来做这件事。 做引擎的不知道框架的请求模式。做框架的不知道引擎的 cache 状态。用户不知道两边在做什么。 --- MiMo Token Plan 走了一条不同的路。支持第三方 harness,按 token 额度计费,不限制调用方式,按实际消耗收费。 这个模式把 token 成本暴露给了用户。用多少付多少,没有订阅制的幻觉。用户开始在意 token 消耗量,开始重构上下文结构,开始问「我能不能用 10% 的 token 干同样的事」。 当成本可见时,效率从锦上添花变成了生存问题。 订阅制本身没问题。Agent 场景下,成本不可见的订阅制会掩盖浪费。掩盖的浪费不会被优化。不被优化的浪费会撑爆单位经济模型。 Anthropic 砍第三方订阅是症状。病根是 harness 和 engine 之间的空白层。 --- GPU 供给弹性低,会让 token 浪费的代价随着时间推移越来越贵。HBM 产能、CoWoS 封装、电力基建——爬坡速度以年为单位。 agent 框架消耗 token 的速度,以天为单位。 两条线的交叉点迟早会到。到了之后,要么有人来做 harness ↔ engine 中间层,要么整个 agent 商业模式被迫重构。 谁先来定义这个协议,谁就掌握了下一代 AI 基础设施的入口。
Fuli Luo@_LuoFuli

Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.

中文
0
1
1
72
Roger रीट्वीट किया
Chayenne Zhao
Chayenne Zhao@GenAI_is_real·
We're Not Wasting Tokens — We're Wasting the Design Margin of the Entire Inference Stack A few days ago I read a post by Fuli Luo on Twitter, discussing Anthropic's decision to cut off third-party harnesses (OpenClaw) from using Claude subscriptions, and the design thinking behind MiMo's Token Plan pricing. Her core argument: global compute capacity is seriously falling behind the token demand created by agents. The way forward isn't selling tokens cheaper in a race to the bottom — it's the co-evolution of "more efficient agent harnesses" and "more powerful, efficient models." I read it several times over. People who build inference engines have long been frustrated by how wastefully agent frameworks burn through tokens. She articulated something the industry has tacitly acknowledged but rarely stated plainly — and she did it with precision and restraint: the compute allocation crisis we face today is not fundamentally about insufficient compute. It's about tokens being spent in the wrong places. I want to push this one layer deeper, from my own perspective. I'm a heavy user of Claude Code — I make no attempt to hide that. You can check that all the latest code in SGLang Omni was built with Claude Code powering my workflow. Its commercial success is beyond question; it genuinely gave many people (myself included) their first real experience of "coding with an agent." But I'm also an inference engine developer — my day job is figuring out how to push prefix cache hit rates higher, how to make KV cache memory layouts more efficient, how to drive down the cost of every single inference request. So when I plugged Claude Code into a local inference engine and started observing the actual request patterns it generates, my reaction was — how to put it — like a water engineer who spent months designing a conservation system, only to watch someone water their garden with a fire hose. I measured Claude Code's cache hit rate on my local serving engine over the course of a day. The numbers were painful. This isn't a case of "decent but room to improve." It's a case of "the prefix cache mechanisms we carefully engineered at the inference layer are being almost entirely defeated." Fuli Luo mentioned that OpenClaw's context management is poor — firing off multiple rounds of low-value tool calls within a single user query, each carrying over 100K tokens of context window. Frankly, Claude Code's own context management is nowhere near making proper use of prefix cache or any of the other optimizations we've built into inference engines. Many people have already noticed — for example, the resume feature has a bug that causes KV cache misses entirely, which is borderline absurd. I'll say it plainly: the way sessions construct their context was never seriously designed with cache reuse in mind from the start. Perhaps Anthropic has internal trade-offs we can't see — after all, they control both ends of the stack, model and inference, and can theoretically do optimizations at the API layer that are invisible to us. But from the external behavior I can observe, enormous volumes of tokens are being spent on: re-transmitting already-processed context, re-parsing already-confirmed tool call results, and maintaining an ever-inflating conversation history with extremely low information density. If this is merely to earn more on inference token charges, I find it genuinely regrettable. But many Claude Code users are on subscriptions — burning more tokens is fundamentally a cost burden for Anthropic, not revenue. I honestly don't understand what purpose such inefficient context management serves for Claude Code. Here's a bold hypothesis: for those long sessions that consume 700K+ tokens, there is certainly a way to restructure the session's context so it accomplishes the exact same task with 10% of the tokens. Not by sacrificing quality, but through smarter context compression, more rational prefix reuse strategies, and more precise tool call scheduling. This isn't theoretical speculation — anyone who has worked on inference engine optimization, upon seeing current agent framework request patterns, would arrive at a similar conclusion. Fuli Luo is right: global compute capacity can't keep up with the token demand agents are creating. But I'd add that a significant portion of that gap is an illusion of prosperity — artificial demand manufactured by the crude design of agent frameworks. Here's an analogy I keep coming back to. I've always liked bringing up RAM bloat — in 1969, 64KB of memory sent Apollo to the moon. In 2026, I open a single webpage and 500MB of memory usage is nothing unusual. Every generation of hardware engineers pushes memory capacity higher, and every generation of software engineers lavishly fills it to the brim. People have gotten used to this cycle, even come to see it as the normal cost of progress. But LLM inference is different. The cost of RAM bloat is your computer running a bit slower, spending a couple hundred bucks on a memory upgrade — users barely notice. The cost of token bloat is real money — GPU cluster electricity bills, user subscription fees, the industry's entire compute budget. And this cost scales exponentially as agent usage grows. If we don't establish the engineering discipline that "tokens should be used efficiently" in the early days of the agent era, the cost of catching up later, once scale kicks in, will be beyond imagination. Fuli Luo notes that Anthropic cutting off third-party harness subscription access is objectively forcing these frameworks to improve their context management. I agree with that assessment, but my gut feeling is that this shouldn't stop at "third-party frameworks need to be more frugal with tokens." It should trigger a more fundamental reflection: what kind of agent-inference co-design do we actually need? Right now, agent frameworks and inference engines are essentially fully decoupled — agent frameworks treat the inference engine as a stateless API, sending the full context with every request. Meanwhile, the inference engine does its best with prefix matching, caching whatever it can. This architecture is simple and general-purpose, but brutally inefficient for long sessions. If agent frameworks could be aware of the inference engine's cache state and proactively construct cache-friendly requests — if inference engines could understand the session semantics of agents and make smarter cache eviction decisions — once that information channel between the two opens up, the potential gains in token efficiency are enormous. Of course, maybe I'm overthinking this. Maybe the market's ultimate answer is: compute gets cheap enough, waste is fine. Just like the RAM story — in the end, everyone chose "memory is big enough, no need to optimize." But I don't think the token economy will follow the same path, at least not in the near term — because the supply elasticity of GPU compute is far lower than that of DRAM. Under compute constraints, token efficiency isn't a "nice to have" optimization — it's the core competitive advantage that determines who survives. Most people love hearing "we made the model bigger," "we stretched the context window to a million tokens," "we stacked HBM to new heights" — these narratives are sexy, shareable, fundable. But I seriously believe that "finding ways to reduce the reckless waste of tokens" is a profoundly underestimated direction. This isn't a defensive optimization. It's an offensive capability — whoever first achieves an order-of-magnitude reduction in token consumption at equivalent quality can serve ten times the users on the same compute budget, or deliver ten times the agent depth to a single user. The agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it most wisely. This line from Fuli Luo resonates deeply with me. But I want to press further: who gets to define "wisely"? The people building models? The people building inference engines? The people building agent frameworks? I think the answer is — all three must come to the table together. And right now, we're nowhere close.
Fuli Luo@_LuoFuli

Two days ago, Anthropic cut off third-party harnesses from using Claude subscriptions — not surprising. Three days ago, MiMo launched its Token Plan — a design I spent real time on, and what I believe is a serious attempt at getting compute allocation and agent harness development right. Putting these two things together, some thoughts: 1. Claude Code's subscription is a beautifully designed system for balanced compute allocation. My guess — it doesn't make money, possibly bleeds it, unless their API margins are 10-20x, which I doubt. I can't rigorously calculate the losses from third-party harnesses plugging in, but I've looked at OpenClaw's context management up close — it's bad. Within a single user query, it fires off rounds of low-value tool calls as separate API requests, each carrying a long context window (often >100K tokens) — wasteful even with cache hits, and in extreme cases driving up cache miss rates for other queries. The actual request count per query ends up several times higher than Claude Code's own framework. Translated to API pricing, the real cost is probably tens of times the subscription price. That's not a gap — that's a crater. 2. Third-party harnesses like OpenClaw/OpenCode can still call Claude via API — they just can't ride on subscriptions anymore. Short term, these agent users will feel the pain, costs jumping easily tens of times. But that pressure is exactly what pushes these harnesses to improve context management, maximize prompt cache hit rates to reuse processed context, cut wasteful token burn. Pain eventually converts to engineering discipline. 3. I'd urge LLM companies not to blindly race to the bottom on pricing before figuring out how to price a coding plan without hemorrhaging money. Selling tokens dirt cheap while leaving the door wide open to third-party harnesses looks nice to users, but it's a trap — the same trap Anthropic just walked out of. The deeper problem: if users burn their attention on low-quality agent harnesses, highly unstable and slow inference services, and models downgraded to cut costs, only to find they still can't get anything done — that's not a healthy cycle for user experience or retention. 4. On MiMo Token Plan — it supports third-party harnesses, billed by token quota, same logic as Claude's newly launched extra usage packages. Because what we're going for is long-term stable delivery of high-quality models and services — not getting you to impulse-pay and then abandon ship. The bigger picture: global compute capacity can't keep up with the token demand agents are creating. The real way forward isn't cheaper tokens — it's co-evolution. "More token-efficient agent harnesses" × "more powerful and efficient models." Anthropic's move, whether they intended it or not, is pushing the entire ecosystem — open source and closed source alike — in that direction. That's probably a good thing. The Agent era doesn't belong to whoever burns the most compute. It belongs to whoever uses it wisely.

English
15
39
212
33.4K
dontbesilent
dontbesilent@dontbesilent·
我将尽力证明:一切问题都是知识问题
中文
10
3
52
7.3K