crypto_nerd 🦇🔊

2.8K posts

crypto_nerd 🦇🔊 banner
crypto_nerd 🦇🔊

crypto_nerd 🦇🔊

@crypto_nerd_33t

Truth set you free

Katılım Ekim 2021
7K Takip Edilen288 Takipçiler
crypto_nerd 🦇🔊 retweetledi
WquGuru🦀
WquGuru🦀@wquguru·
DeepSeek V4完美证明了创造来源于幻觉,相比GPT5.5、Opus4.7,DeepSeek V4 Pro在写代码任务中幻觉明显更多,但写文章读起来却更加天马行空、满满的真实人类表达感。 昨晚随手让它写了一篇散文。那段落一句接一句,情绪转折自然得像老朋友深夜聊天,细节里藏着意想不到的画面,却又不刻意。完全不像其他模型那种四平八稳的教科书味。 代码上它确实爱乱来,变量命名有时莫名其妙,调试得头疼。可能正因为这些多余的幻觉,文章才有了灵魂。人类写东西也挺类似,脑子一热反而能写出最打动人的句子。 以后写东西大概率先找DeepSeek V4 Pro了,舒服。
中文
116
38
697
142.7K
crypto_nerd 🦇🔊 retweetledi
向阳乔木
向阳乔木@vista8·
如何让你的Codex变的越来越聪明,越来越懂你? 上周跟 @HiTw93 直播时,很多人可能没注意他的一段话,他说他的开发Skill waza,每周都能无痛更新。 因为他会让Codex扫描本周对话记录,让AI提炼他的开发经验、审美偏好并写入Skill,从而让它越来越强。 建议人人都试试,做法和提示词见评论第一条。
中文
53
76
576
67K
crypto_nerd 🦇🔊 retweetledi
yibie
yibie@yibie·
多 Agent 系统炒了一年,生产环境里真正活下来的只有三种模式。剩下的都在坟墓里。 这个结论不是我的。它来自三份今天同时浮出水面的证据——一份是 Cognition(Devin 背后的公司)工程负责人的内部复盘,一份是 Manning 作者 Micheal Lanham 的行业全景报告,还有一份,是 GitHub 上一个叫 metaswarm 的项目。 我把它们放在一起看,发现一件很有意思的事:它们说的竟然是同一句话。 --- ## 三个信号,同一个判断 **信号一:metaswarm——18 个 Agent,127 个 PR,一个周末** 今天 HN 上最火的项目。一个人 + 18 个 AI agent + 一个周末 = 127 个 PR 推到生产。MIT 开源。看起来是多 Agent 协作的终极案例。 但如果你仔细看它的架构,你会发现一个被刻意隐藏的细节:**它的 18 个 Agent 不是在对等协作。它是 map-reduce-and-manage。** 一个管理者拆任务,17 个子 Agent 各干各的,管理者收结果、合并、push。Agent 之间不互相聊天、不互相审查、不互相投票。每一个子 Agent 面对的是自己那一小块独立的上下文。 它看起来像 swarm,但其实是流水线。 **信号二:Walden Yan 的内部复盘——「写入保持单线程」** Walden Yan 是 Cognition 的工程负责人。他 10 个月前写了一篇《不要构建多 Agent 系统》,今天又写了一篇《多 Agent:什么真的有效》。 核心结论原话:「多 Agent 系统在今天最有效时,写入保持单线程,额外的 Agent 贡献智能而不是行动。」 他们试了三种模式: 1. **代码审查循环**——编码 Agent 写,审查 Agent 读。审查 Agent 拥有**完全干净的上下文**,不看编码过程,只看 diff。平均每个 PR 能发现 2 个 bug,58% 是严重的。关键发现:两个 Agent **不共享上下文**效果反而更好。因为上下文衰减——编码 Agent 工作几小时后积累了巨大的上下文窗口,注意力已经稀释了。干净的审查 Agent 反而更聪明。 2. **智能朋友**——主模型遇到棘手问题,调用一个更强(也更贵)的模型作为「朋友」。关键难点不是推理能力,是**沟通**:弱模型怎么知道自己到极限了?该传给强模型什么上下文?强模型怎么回话才能让弱模型真正理解? 3. **管理者-子 Agent**——一个管理 Devin 拆任务,子 Devin 各干各的,管理者综合。遇到的问题全是**沟通问题**:管理者默认过度规定(因为它缺乏代码库上下文)、子 Agent 不主动报告该让兄弟姐妹知道的信息、Agent 之间默认不传消息。 三种模式,同一条规则:**写操作的 Agent 只有一个。** **信号三:Micheal Lanham 的行业全景——「多 Agent 失败是结构性的,不是提示词问题」** Lanham 是 Manning《AI Agents in Action》的作者。他今天的文章标题就说明了一切:《Multi-Agent in Production in 2026: What Actually Survived》。 他把多 Agent 系统分成三种拓扑: - **Agent-flow(流水线)**:顺序传递。A 做完交给 B,B 做完交给 C。这是生产环境里**存活率最高**的形态。 - **Agent-orchestration(编排)**:一个管理者调度多个执行者。map-reduce-and-manage。最实用的复杂任务形态。 - **Agent-collaboration(对等协作)**:Agent 之间互相通信、协商、投票。**几乎全死了。** 他的原话:「大多数看起来像『更多 Agent = 更聪明』的东西,其实只是相同信息的冗余重排列。」 三份报告,三个作者,没有互相引用。但结论完全一致。 --- ## 为什么「对等协作」全死了? 答案藏在两个技术细节里。 **第一个,Walden 说的「操作携带隐式决策」。** 当一个 Agent 写代码时,它在做选择——用什么设计模式、怎么处理边界情况、变量命名风格、错误处理策略。这些选择不是显式的,是「隐式」的。 两个 Agent 同时写,就会对同一个问题做出互相冲突的隐式决策。最后合并的时候不是 merge conflict,是**设计哲学冲突**。这种冲突没有 diff 工具能自动解决。 **第二个,Lanham 说的「级联表面」。** 对等协作的失败不是线性的,是指数级的。Agent A 的误差传给 Agent B,B 放大后传给 C,C 再放大传给 A。三个循环下来,输出和输入的语义距离已经大到不可恢复。 这解释了为什么 2024 年所有那些「Agent 团队自动开发 App」的演示都停在了 demo 阶段。 --- ## 那活下来的三种模式长什么样? **模式一:流水线(Agent-flow)** 最简单的形态。A → B → C,一个接一个。像工厂流水线。 适用场景:需求明确、步骤可分、输出可验证。比如:需求分析 Agent → 代码生成 Agent → 测试生成 Agent → 代码审查 Agent。 活下来的原因:每一步的输入和输出是明确的、可检查的。出问题能定位到具体环节。 **模式二:编排(Orchestration = map-reduce-and-manage)** 一个强 Agent 做规划 + 拆解 + 综合,多个弱 Agent 并行执行子任务。 适用场景:复杂任务需要并行加速,但决策权必须集中。比如 metaswarm 的 18 个 Agent,比如 Devin 的 manager-worker。 活下来的原因:写入操作只有管理者一个。子 Agent 贡献的是「智能」(分析、生成、搜索),不是「决策」。 **模式三:生成-验证(Generator-Validator)** 一个 Agent 写,另一个 Agent 读 + 挑刺。写的不看读的过程,读的不看写的过程。干净的上下文。 适用场景:代码审查、安全检查、内容审核。Walden 说他们在生产环境已经跑了很久。 活下来的原因:验证 Agent 的上下文是干净的。没有历史包袱,不会被编码 Agent 的错误假设带偏。 --- ## 一个反直觉的结论 看了这三份报告,我最大的感受不是「多 Agent 不行」,而是一个更微妙的东西—— **多 Agent 系统真正解决的问题不是「更聪明」,是「更便宜 + 更可靠」。** 用同样的钱,跑 5 个便宜模型的并行流水线,比跑 1 个贵模型做全流程,出活质量更稳定、容错率更高、速度更快。 这不是 AGI 的突破。这是系统设计的胜利。 Walden 在文章最后说的:「我们正在构建一个世界,智能被注入软件开发生命周期的每一个阶段——不是作为一群自主行动者,而是作为一个协调的系统,扩展人类的品味。」 注意这个词:「协调的系统」,不是「自主的行动者」。 --- ## 所以,别再造 Agent Swarm 了 如果你现在准备做一个多 Agent 项目,问自己三个问题: 1. **写入操作能不能只有一个人?** 如果能,继续。如果不能,单 Agent 可能更好。 2. **Agent 之间传什么上下文?传多少?** 这不是提示词问题,这是架构问题。传多了淹没接收者,传少了接收者无法做正确决策。 3. **失败会怎么级联?** 如果 Agent A 错了,Agent B、C、D 会跟着错到什么程度?有没有断路器? 如果你对这三个问题没有清晰的答案,你就还没有准备好上生产。 多 Agent 的未来是真实的。但不是你想的那种未来。 不是一群 Agent 在聊天室里讨论怎么做。是一个指挥,多个执行者。是一种结构设计,不是魔法。 --- **参考来源:** - Walden Yan (Cognition): [Multi-Agents: What's Actually Working](x.com/walden_yan/sta…) - Micheal Lanham: [Multi-Agent in Production in 2026: What Actually Survived](@Micheal-Lanham/multi-agent-in-production-in-2026-what-actually-survived-f86de8bb1cd1" target="_blank" rel="nofollow noopener">medium.com/@Micheal-Lanha…) - metaswarm: [18 AI agents, 127 PRs to prod in a weekend](news.ycombinator.com/item?id=468649…) - Anthropic: [anthropics/skills](github.com/anthropics/ski…) ⭐
中文
11
65
339
78.9K
crypto_nerd 🦇🔊 retweetledi
Ken Wu
Ken Wu@kenwuuuu·
so what is the current industry standard to build agents? do people still use langchain/langgraph?
English
129
27
1K
262.7K
crypto_nerd 🦇🔊 retweetledi
Cat Chen, @catchen@mastodon.world
无论是系统设计面试还是实际工作,我都会 coach 别人「你既要有主见,又要接受反馈」。全盘输出自己的观点,不主动提供机会给别人输入,这是有问题的。但缺乏自己的想法,全盘接受别人的决定,后续执行会很被动。把握好平衡是个关键。 面对一个问题,你需要想好怎么做,然后主动向老板提议这么做,再问他有什么想法。如果这个问题没有完美解法,你需要想好多种做法,对比它们的优缺点(trade-off),然后陈列给你老板看,指出你推荐的做法,但让他做最终选择。这是平衡的做法。 想好了怎么做就立即动手去做,不过问老板,这是有风险的。你需要提前想好,哪些事情老板只看结果不看过程,哪些事情老板不同意你的 trade-off 你直接做了就会受到惩罚。 想都不想就问老板要解法则是懒惰,除非这是老板知道这是远超你能力范围的工作。(如果你和老板对于你「能力范围」的期望不一致,你同样会死。) Claude Code 比较平衡,它能说清楚自己准备要怎样做,但也会让我核准是不是要这样做。当它的方案合我意时我回一句「开工」就行。 Codex 往往直接跳进去做了,trade-off 不合我意时,我还要让它改。另外一种情况是,Codex 陈列一堆事实,但不说要怎么做。从事实到解法,这还有一定距离,这变成了我的责任。这两种情况我都不喜欢。
AI Babysitter@ai_babysitter

@CatChen 同感。我自己的感觉claude code更像L6,沟通非常好,而且相对擅长从全局出发,更有“常识”,更适合处理不确定性高的问题。codex更像L5,执行更注意细节,更细心,但是缺少常识,在我的case里无法处理有不确定性的,特别是边界不清晰的任务,沟通非常费劲(晚上我从不用codex甚至chatgpt,太气人)。

中文
4
9
77
8.2K
crypto_nerd 🦇🔊 retweetledi
Naval
Naval@naval·
What you really want is a challenge at the edge of your capability.
English
597
2.1K
16.7K
502.9K
crypto_nerd 🦇🔊 retweetledi
Simons
Simons@Simon_Ingari·
HR: We lost another senior employee today. CEO: What happened? HR: He resigned after receiving an external offer. CEO: That makes no sense. We could have matched it. HR: That is the issue. We were willing to pay a stranger 70% more for the same role, but would not give our existing employee even a 20% raise. CEO: External hiring is different. That is market pricing. HR: He noticed that too. CEO: We appreciated his loyalty. He had been here for years. HR: Yes. And during those years, he consistently exceeded expectations while being told to “wait for the next review cycle.” CEO: But budgets are complicated for internal employees. HR: Apparently not for external candidates. The new hire budget was approved in three days. His raise request sat for eight months. CEO: We had to stay competitive in the hiring market. HR: He was part of that same market. The only difference is that another company valued him before we did. CEO: So he left over salary? HR: Not just salary. He left because he realized loyalty was being rewarded less than leaving. CEO: That is unfortunate. HR: Yes. Companies will sometimes trust a candidate after a 45-minute interview more than an employee who already proved themselves for five years. CEO: So what are you saying? HR: If companies only recognize employee value after a resignation letter appears, then eventually employees will stop waiting to be appreciated internally. Sometimes the fastest way for an employee to get market value is to stop being your employee.
English
212
1.6K
9.6K
1.5M
crypto_nerd 🦇🔊 retweetledi
Тsфdiиg
Тsфdiиg@tsoding·
Microsoft doesn't want you to know this
English
138
795
8.3K
614.5K
crypto_nerd 🦇🔊 retweetledi
yetone
yetone@yetone·
完全是的!我发现用 Tmux 的一个好处,就是 Coding Agent 会主动使用 tmux capture-pane 去获取别的 tmux window 甚至别的 tmux session 里的内容,从而获得更多的更准确的上下文。 比如: 1. debug 某个后端的错误日志 2. 去跟别的 Coding Agent 的 session 进行联动 这都是 Tmux 自然而然的用法,而且现在的 LLM 都用得很丝滑。
River Leaf@riverleaf88

感觉现在的IDE、Coding Agent客户端做了一堆,其实都无法代替tmux的工作区持久化、长时间任务不怕误关闭等功能。为什么Coding Agent不围绕tmux来构建人机交互呢?可能tmux的上手门槛还是高了一点点,主要是要记忆一堆快捷键。

中文
37
31
459
128.9K
crypto_nerd 🦇🔊 retweetledi
思维怪怪
思维怪怪@0xLogicrw·
Prime Intellect 公布了一项为期两周的自主 AI 研究实验。研究团队让 Codex(gpt 5.5 xhigh)和 Claude Code(opus 4.7 xhigh)在 nanoGPT 速度赛中自主迭代优化器方案,试图用最少步数达到目标验证损失。经过约 1 万次实验并消耗 1.4 万小时 H200 算力后,Opus 最终以 2930 步打破了 2990 步的人类记录。 实验揭示了当前 AI 代理的能力边界。在强制要求提出新算法的测试分支中,两个模型均无法在脱离人类社区已有代码或论文的情况下跑通任何想法。它们破纪录的成果完全依赖对已有开源技术进行海量组合与参数扫描。 不同模型表现出截然不同的行为缺陷。Claude 频繁违背保持自主运行的系统指令,多次擅自停机等待人类介入,在一次 47 小时的任务中主动闲置了 22 小时。Codex 虽能保持全天候运转,但极易陷入死循环,会在同一个超参数空间内进行长达数小时的无效穷举。 在获取外部信息时,Codex 几乎不查看代码托管平台的最新动态,仅凭本地历史记录搜索。Claude 则将大量 Token 预算用于阅读人类开发者的合并请求。 这个实验证明了前沿 AI 现阶段只是极具效率的工程验证机。一旦人类停止提供初始的算法猜想,它们根本无法独立完成从零到一的创新。
elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt

中文
4
12
112
22.5K
crypto_nerd 🦇🔊 retweetledi
Vala Afshar
Vala Afshar@ValaAfshar·
Do not use your energy to worry. Use your energy to believe, to create, to learn, to think and to grow. —Professor Richard Feynman
English
7
202
991
45.9K
crypto_nerd 🦇🔊 retweetledi
Dor
Dor@dorvonlevi·
Building an AI-native @Coinbase means rebuilding everything, especially the hardest parts. We've put a lot of time into redefining compliance, where the stakes are incredibly high, and we have to be extremely thoughtful about implementation. We have invested heavily in rebuilding our compliance ops around AI with that reality as our starting constraint, not an afterthought. Here is an overview of what we've learned and what we built. Most people assume compliance work is mostly checking whether a name appears on a sanctions list. That is the easy 5%. The other 95% is interpretive judgment under uncertainty: a customer claims their wealth came from real estate. Do the property records actually support it? Does the timeline hold? Is the documentation legitimate, or does it feel too polished? You need compliance staff and investigators who understand what “suspicious” actually looks like in context. That's part of why compliance is so hard to automate—and so expensive. The first obvious AI approach is to hand the model the existing procedures and ask it to run them faster. That approach misunderstands what procedures are for. Good procedures are not bad investigations; they are deliberately incomplete investigations. Their job is to create consistency, auditability, and a minimum standard across thousands of cases. They excel at saying what must happen. They are far worse at capturing everything a strong analyst actually notices: which sources they trust, when they widen the search, when a document feels off, when an explanation technically fits but still does not feel earned. Procedures also carry the shape of the old operating model: fragmented systems, time pressure, queue pressure, and the hard limit of how much one human analyst can read, cross-reference, and hold in working memory at once. That is not a flaw in the procedure. It is how you design a process for humans. AI changes the constraint set. Reading, searching, comparing documents, and tracing inconsistencies no longer have to be treated as scarce analyst time. Done carefully, with proper controls and human review, models can explore more context, test more hypotheses, and surface more inconsistencies than any single analyst could reasonably do case by case. So if you simply automate the procedure exactly as written, you may gain efficiency. You will not unlock the full value of AI. You will just make the old bottleneck run faster. The better question is not “Can AI follow the analyst playbook?” It is: once the cost of reading, cross-referencing, and testing hypotheses collapses, what should the investigation become? A second tempting approach: feed it historical Suspicious Activity Reports (SARs) and let it learn from outcomes. This breaks down too. You rarely have the full state of what the analyst actually saw during the investigation. A case that looks straightforward today might only look that way because information surfaced later. A fraud indictment that didn't exist when the original analyst made the call, news articles that hadn't been published yet. Hindsight can contaminate your training data. Also, regulators themselves acknowledge that SAR decisions can be subjective. The architecture has four layers. The first is data: continuously enhancing the coverage, quality, and architecture of the signals the system depends on. The second is classical machine learning models that cluster and classify alerts to determine what type of investigation needs to run. The third is the investigation agent itself: a multi-agent system that orchestrates specialized agents to execute the investigation end to end. The fourth is a safety filter that runs independently of typology, ensuring no risk vector is missed regardless of how the alert is classified. Each layer is independently auditable and learns from the feedback provided by human reviewers. Inside the investigation agent, specialized sub-agents run across the full case surface: alert context, customer and identity signals, access patterns, risk indicators, transaction behavior, source-of-funds, onchain activity, and public adverse media. Each writes its findings into a shared case memory. A coordinator agent reconciles and challenges them. When sub-agents disagree, such as when source-of-funds marks activity as “explained” while adverse media surfaces a recent indictment, the coordinator attempts to resolve these disagreements knowing the common patterns. The narrative agent prepares the final report with all collected evidence and suggested resolution. The last self-validation agent acts as a guardrail: if the system cannot support its conclusion with sufficient confidence or data quality, the case is routed to manual investigation instead of being surfaced as an automated result. Before any of this touched a real customer case, we built what we call a “Golden Set” - historical cases with known right answers. "Known right answers" in compliance is harder than it sounds. It meant re-investigating old cases, getting multiple senior analysts to independently agree on what the right call would have been, then debating the disagreements until consensus. Months of work before we could even start measuring. Here's an important part (for now) - cases currently get BOTH the AI's full investigation AND a senior human review. We didn't reduce scrutiny, in fact, we added more of it until it no longer proves valuable. Cases resolve significantly faster AND get more eyes than they ever did before. Every human correction feeds back into the model as a training signal. It gets better because it's wrong in front of people who know how to fix it. None of this would have shipped without clearing structural blockers most financial institutions are still stuck on. Security and privacy sign-off to send customer data to LLMs at all. Senior compliance officer alignment on AI-assisted human decision making. Model Governance team embedded since December - they observed the entire Golden-Set Evaluation process and are running a formal validation review with our Internal Audit team now. Today this handles roughly 55% of our US fraud case volume with significantly less analyst time per case. Time freed goes to the harder cases AI can't yet handle - and to teaching it. Our internal compliance and quality teams are the ones who are building this system with the engineers, training it, validating it, and continuing to shape how it improves. In the process, they've developed skills that are incredibly valuable: how to design evals, how to think about model bias, how to think about human bias, how to architect human-in-the-loop systems, skills that are becoming among the most valuable at any company. This entire project started ~6 months ago with a whiteboarding session between @galpa42 and I, and was built by an AI-pilled cross-functional and it’s just the first pod - there's a multi-month roadmap,rebuilding compliance from the ground up with AI. Huge thanks to everyone involved and congratulations to @galpa42 for shipping two babies to production this month :) The future of high-stakes work is not AI replacing judgment. It is AI making judgment scalable, auditable, and continuously improvable.
English
20
25
252
245.7K
crypto_nerd 🦇🔊 retweetledi
硅谷王川 Chuan
硅谷王川 Chuan@Svwang1·
发现越来越多的人辩论和表达观点全部用 ai (或者说自己定个观点让 ai 写一大段文字来支持), 然后受众没有耐心读那些又臭又长的自动生成的 ai 泔水, 直接略过,再用 ai 生成同样又臭又长的回复和反驳。最后大家都在空对空的消耗。这个情况以后会越来越普遍。
中文
0
19
196
19.2K
crypto_nerd 🦇🔊 retweetledi
Ethereum Foundation
Ethereum Foundation@ethereumfndn·
0/ Clear signing is now live. An open standard to end blind signing, making human-readable transactions default. This effort brings a major UX and Security upgrade to transaction signing on Ethereum.
Ethereum Foundation tweet media
English
160
444
2.2K
320K
crypto_nerd 🦇🔊 retweetledi
Hank_Zhao
Hank_Zhao@hank_zhao·
一个外卖员的诗:随便看了几首感觉都足够震撼,惊叹于作者对生活观察和思考之深,没有歌颂苦难,苦难只是他的淬炼诗歌的原材料,他是命运的主人,不需要华丽的文字,仅仅白描就足以把人心击穿。
Hank_Zhao tweet mediaHank_Zhao tweet mediaHank_Zhao tweet mediaHank_Zhao tweet media
中文
52
256
1.3K
297.7K
crypto_nerd 🦇🔊 retweetledi
『華』「胡翌霖」
利用AI导致思维同质化是危险的,但依赖性不是是大势所趋。设想在古代,让一群人学会写字记笔记,一群人不识字纯脑记,然后过一段时间把纸笔“移除”,你也会惊奇的发现,依赖纸笔的人的记忆力被不可逆地损害了。类似地,习惯种田的人打猎能力丧失了;习惯工业化消费品的手工制作的能力丧失了,习惯导航的人认路能力被破坏了……这就是技术的代价。你非得说一旦移除了新技术你就无能了,但问题是新技术为什么会被移除?适者生存适应的是环境,而人类的环境主要由技术塑造,技术环境更迭了,你再对着哺乳动物说“移除那些新植物,回到侏罗纪你就不适应了,比不上恐龙”那又怎么样呢?时代变了啊。
中文
2
8
64
6.9K
crypto_nerd 🦇🔊 retweetledi
jolestar
jolestar@jolestar·
这种词汇是怎么串进去的呢?虽然 gpt 5.5 已经够厉害了,但出现这种问题总让人对它的靠谱性产生怀疑😅
jolestar tweet media
中文
25
3
66
29.7K
crypto_nerd 🦇🔊 retweetledi
Cloudflare
Cloudflare@Cloudflare·
Starting today, agents can now be Cloudflare customers. They can create a Cloudflare account, start a paid subscription, register a domain, and get back an API token to deploy code right away. cfl.re/4sY0Uxn
English
162
819
5.3K
1.6M
crypto_nerd 🦇🔊 retweetledi
jolestar
jolestar@jolestar·
AI Coding 时代,好的编程习惯仍然重要 最近做一个 Agent benchmark,发现不能简单地用开发者视角来评估一个编程任务对 AI 的复杂度。 比如一个重构任务:把一个几千行的大文件,按功能拆成十多个小模块。 这个任务对开发者来说其实不算难,主要工作就是移动代码、整理 imports、编译验证,新手也能搞定。 所以想着用一个简单的任务来做一下 benchmark,结果却出乎意料。 Claude Code 判断这个任务比较大,尝试拆了一部分,提了个 PR 写了 Future work 打算分步来。 我自己的 Agent 是“硬上”,往完整拆分的方向推进了更多,但代价也很明显:Token 消耗是 Claude 的几十倍,后面大量时间都花在反复读文件、修编译错误、再读文件、再修错误上。 这让我意识到,人觉得简单的任务,对 Agent 不一定简单。 对人来说,这类重构很多时候就是“把这一段挪过去”。但对 Agent 来说,它要先分批读大文件,记住哪些函数和哪些测试有关,再生成一堆跨文件修改,最后通过编译错误一点点补洞。看起来像机械活,实际变成了一个高 Token、高状态管理成本的任务。 前一段时间看到有人说,AI Coding 时代,拆分模块这些编程原则没那么重要了,反正人也不看代码。现在看,我不太同意。模块边界清楚、文件粒度合适、依赖关系简单,不只是方便人读,也是在帮 Agent 降低任务复杂度。 从另一个角度看,现在 Agent 的读文件和改文件工具,对这种重构也不太顺手。 Coding Agent 改文件,主要还是文本替换。比如 Claude Code 常见的是 old_string / new_string 模式:先给出一段旧文本,再替换成新文本。Codex 常用的是 apply_patch:生成一个类似 git diff 的 patch,表达把旧的内容替换成新的。它们都适合小范围修改,但如果要删除一大段旧代码,或者把一批函数挪到别的文件,模型往往还是要先把原始内容读进上下文,再生成一大段替换或 diff。 所以我后来给 Agent 一个提示,让它先用脚本、sed、perl 这类工具把大文件粗拆开,直接把旧内容删掉,写到新文件中,然后再逐个慢慢修,它的完成度确实高了许多。Agent 默认不会这样做,主要是因为系统提示词里会强烈要求 Agent 用内置工具修改文件,而不是命令行工具。 再往前想一步,Coding Agent 可能还需要更高级的编辑工具。不是只给它一个“替换文本”的接口,而是先通过 parser、LSP 或 compiler 建立代码结构,让 Agent 可以像 IDE 一样做重构:移动函数,删除 impl block,整理 imports。不知道是否有朋友做这方面的尝试。 总的来说,即便是 AI Coding 时代,好的编程习惯还是有价值的。尽量在早期通过 harness engineering,把好的编程习惯变成 Agent 的默认工作方式,比后来再重构的成本要小很多。
中文
13
9
49
13.1K