

Monk Zero
1.7K posts

@NoCommas
@antigma_labs, prev: @awsCloud, @Meta, @Mysten_Labs. A Turing Complete mind, wandering the world of Gödel Incompleteness.






We've reached an agreement to acquire Astral. After we close, OpenAI plans for @astral_sh to join our Codex team, with a continued focus on building great tools and advancing the shared mission of making developers more productive. openai.com/index/openai-t…






When you find out About the non-human life, That walks amongst us; Will your mind accept it?

a bit about how I use Claude to help me write, instead of having Claude write for me x.com/trq212/status/…



the ai environment is so frothy it's not surprising companies are cheating benchmarks but if you're gonna gamble your entire reputation why do it for a terminal bench ranking lmao

很多人要手搓AI Agent或者coding Agent,希望自己手搓一个全能编程机器人。 我反复讲过,光初代的SWE Agent,到cursor时代的骚context管理,到初代claude code,到后来各种花里胡哨的memory机制,到plan mode,到再后面一个主agent控制几个subagent和后台tasks, 光Agent本身的技术,短短两三年, 已经工业革命了三四次了, 你可以理解,光coding agent这一个工具的设计,已经经历了马车、火车、汽车、飞机、火箭这个级别的几次迭代了。 我今天必须告诉大家,手搓一个初代SWE Agent是必要的,因为有教学意义,等于10年前任何一个人手搓一个操作系统或者编译器一样,这是动手课的一部分, 但是如果你想要追上codex、gemini cli或者claude code这些工具,你要去step in这些项目的代码,看看他们里面已经是多么复杂的设计。 哪怕是roo coder、cline和aider时代,在一年前还是硅谷顶级明星开原产品,现在也已经和codex和claude code产生了代差,彻底落后了, 更别提国产那几家一个大公司三个coding agent瞎几把折腾了,跟claude code和codex已经完完全全不是一个时代的产物了。 哪怕只有半年代差,实际也已经等于蒸汽火车 vs 大火箭了,而且短期内肉眼可见距离继续拉大。 我必须警示你们一点,claude code和codex很有可能成为下一个chrome大屎山,虽然屎,但是客观上将会成为行业默认唯一标准, 最终结果就是,大家所有市面上的coding agent和claude code全部产生了三四代的代差,于是全都成了缩头王八,回去老老实实卖廉价API,在claude code里手动配置API,claude code成了闭源之王,codex成了开源之王,两家平分市场。 而其他人已经无法理解codex和claude code的全部工程细节,就像chromium所有代码开源给你看,你也完完全全看不懂,是同一个道理。 我只是想要告诉你,coding agent经过三年的迭代,复杂度已经今非昔比, 哪怕阿里、字节、LLM六小虎这个级别的公司,恐怕也要被硅谷的同行们远远甩在身后,这一点是追不上的。

This is unethical and sad. And it's not even bad engineering and they could have used it for something good.

We independently verified these claims and removed OpenBlocks from the Terminal-Bench 2.0 leaderboard. Thank you @NoCommas for helping us keep leaderboard entries honest! Recent leaderboard submissions are in huggingface.co/datasets/harbo… which makes it easy for the community to work together to detect cheating.


