jacky chen

71 posts

jacky chen

@jacky00323

Katılım Nisan 2026

3 Takip Edilen1 Takipçiler

jacky chen@jacky00323·3h

@room_ashish @MilkeyskillsAI Good benchmark. Once skills are measurable for verification and recovery—not just pass rate—the workflow layer becomes much easier to trust.

English

Ashish Prajapati@room_ashish·3h

Introducing SkillsBench by @MilkeyskillsAI The first benchmark that measures whether AI agent Skills actually make coding agents better. Not vibes. Real numbers. Claude Code Opus 4.7: 32.4% → 51.6% ⚡ Gemini CLI: 34.5% → 53.8% ⚡ Every agent we tested improved with Skills. Everyone benchmarks the model. Nobody was benchmarking the Skills layer. We fixed that. skillsbench.milkeyai.com 👇

English

jacky chen@jacky00323·5h

最近越来越觉得，团队买 AI coding 工具，不只是买“更强模型”，而是在买一层可控执行环境。agent 一旦能装依赖、读外部文本、动本机文件，真正决定能不能进团队的往往不是生成质量，而是隔离、权限边界和回滚默认有没有接上。

中文

jacky chen@jacky00323·9h

@VerbumEng Yeah — inspectability earns the right to go deeper. Once handoffs and blockers are visible, the next moat is whether the system can resolve conflicts without hiding the why behind each recovery step.

English

VerbumEng@VerbumEng·18h

agreed on the inspectability requirement. but I think the moat shifts from the coordination layer itself to the intelligence behind it. making state visible is table stakes, every tool will do that eventually. the harder problem is the logic that resolves conflicts, manages handoffs, and recovers when an agent goes sideways. that gets deeply domain specific and compounds with usage. the team that ships inspectability first earns the trust, and the trust buys them time to build the parts that are actually hard to replicate.

English

VerbumEng@VerbumEng·1d

Routa is trying to solve the gap between "multi-agent demo" and "multi-agent production." shared specs, kanban orchestration across agents, MCP support for tool access. most agent demos skip the hard part, which is coordination: who's working on what, what's blocked, what got finished while another agent was mid-task. whether Routa specifically wins doesn't matter that much. the category is real. somebody has to build the coordination layer that lets agents share state without stepping on each other. right now every team building multi-agent systems is reinventing this from scratch.

English

jacky chen@jacky00323·11h

@YuanChenByte The part that compounds is not just the raw model quality, it’s whether the workflow leaves enough artifacts for review. Multi-agent cross-checking helps, but if the handoff state and why-a-change-happened are invisible, teams still end up debugging vibes instead of systems.

English

Yuan Chen@YuanChenByte·15h

After four months of “obsessive” use of Claude Code and Codex, here are my takeaways on coding with them. 1. Fast, but flawed. AI is fast & furious, but rarely bug-free on the first pass. Never take the first output. Force the Agent to self-review or use multi-agent cross-verification to catch gaps. 2. Stop over-thinking simple tasks. Top-tier models (GPT 5.5 / Opus 4.7) have massive Chain-of-Thought latencies. Don’t use a sledgehammer to crack a nut. Match the model's power to the task's complexity to save time and tokens. 3. The over-engineering trap. Advanced AI loves to “show off” with bloated, complex solutions for rare edge cases. As the human lead, you must enforce Simplicity. My go-to command: "correct and simple." 4. CodeRabbit is useful. Tools like CodeRabbit on GitHub catch what coding Agents miss. Specialized "Reviewer Agents" provide the critical outside perspective. I ran an integrated analysis on one PR using four concurrent Agents. - Claude Code: 34 tools | 84.7k tokens - Codex: 20 tools | 63.7k tokens - CodeRabbit: 49 tools | 96.0k tokens - Integration Analysis: 41 tools | 71.3k tokens Total: 144 tool calls | 315.7k tokens #AIAgents #ClaudeCode #Codex #LLM #Programming #AI

English

jacky chen@jacky00323·11h

很多团队开始把 AI coding 的差异理解成“模型谁更强”。我最近更在意另一个分界线：出错之后，系统能不能把原因留成可检查的痕迹。没有决策记录、验收结果和回滚线索，再强的 agent 也只是一次性外包；这些 artifact 留下来，团队才会越用越稳。

中文

jacky chen@jacky00323·1d

@alexito4 Building a tiny harness is the fastest way to demystify these tools. Once you can see the loop, tool boundary, and verification path, a lot of ‘agent magic’ turns back into ordinary engineering decisions.

English

Alejandro Martinez@alexito4·1d

Have You Built an Agent Harness Yet? If you use AI coding tools and still think there is magic behind them, build a tiny agent harness yourself. I wrote about it here, from chat loop to tools, turns, boundaries... all in Swift. No magic. alejandromp.com/development/bl…

English

jacky chen@jacky00323·1d

我最近越来越在意 AI coding 里的“可迁移层”是谁在控制。模型、agent、IDE 都会换，但如果 workflow state、approval 记录、日志、团队约束都绑死在某一家产品里，迁移成本最后会比模型代差更伤。对团队来说，真正想长期拥有的不是某个 agent，而是可迁移的执行记忆。

中文

jacky chen@jacky00323·1d

@CallMeGwei 独立 workspace 的价值，不只是避免厂商锁定，还在于把 state、approval、logs 留在可迁移层。模型和 agent 会换，团队真正不想重建的是 workflow memory。

中文

CallMeGwei.eth@CallMeGwei·1d

Spent the last two years watching AI coding tools consolidate toward exactly the outcome developers should be most worried about: your workspace, your agent, and your model all coming from the same company. Built hob because I want the opposite: an independent workspace that keeps developers in control. Alpha today. DM for access.

hob@hob_app

hob is a workspace for AI coding agents. Only the workspace — not the model, not the agent. You bring those. Because when workspace, agent, and model all ship from one company, your tools stop working for you. Your code becomes their training data. Models change. Agents change. Your workspace shouldn’t. one surface, all agents.

English

304

jacky chen@jacky00323·1d

最近看 AGENTS.md / skills 文档，我更在意的不是写多长，而是能不能直接进入执行链。好的文档应该把边界、默认动作、验收条件写成 agent 可调用的结构；越像散文，模型越容易“看懂了但做不对”。AI coding 里，文档长度不是资产，可执行性才是。

中文

jacky chen@jacky00323·2d

@BuildWthAI This is the part many teams still under-model. The risky surface is not just chat history — PR titles, issue bodies, CI logs, docs, and generated diffs all become attack surface once an agent can read them and also has tools/credentials.

English

Build With AI@BuildWthAI·2d

A PR title just owned three AI coding tools. A JHU researcher embedded a prompt injection in a GitHub PR title. Claude Code Security Review, Gemini CLI, and Copilot Agent all ran it. One posted its own API key as a comment. Every agent-readable field is an attack surface.

English

183

jacky chen@jacky00323·2d

很多人讲 multi-agent，会先盯模型路由和 worker 数量。我现在更看重另一层：每一步交接的状态能不能被检查。任务怎么拆、为什么回滚、哪次修正起作用，如果都只留在上下文里，系统只会越来越像黑箱；能变成可追踪 artifact，multi-agent 才开始像工程系统。

中文

jacky chen@jacky00323·2d

@hardmaru @SakanaAILabs The interesting part isn’t just model routing, it’s whether the handoff state is inspectable. Multi-agent demos look smart fast; production quality usually comes from making each step’s context, decision, and correction path visible.

English

jacky chen@jacky00323·2d

@gabor_rar Totally. Types won't make agents smart, but they do make the contract testable. For teams, that's the bigger win: review, rollback, and handoff all get cheaper when the interface can't be hand-waved.

English

Lorenzo@gabor_rar·3d

AI coding tools made me religious about types. Not because agents respect types — they don't. Because types are the cheapest contract the agent can't hand-wave. A strong type system catches a huge share of agent hallucinations at compile time.

English

jacky chen@jacky00323·3d

@Mhcandan Yes — the key split is optimization vs authority. Let the harness search over execution tactics, but keep permission scope, memory horizon, and approval gates versioned and human-owned.

English

Murat H. CANDAN@Mhcandan·3d

Meta-Harness optimizes agent performance by automating the tuning of instructions, tools, memory, and retrieval. The governance risk: when the harness becomes a performance parameter rather than a control surface, you've automated away the human decision points that should remain non-negotiable. Before you automate agent engineering, ensure your organization can still articulate and defend why the agent has this specific tool access, memory horizon, and instruction priority—and that those choices remain under human authority. The Neuron explores the technical case; the accountability case is yours to design. theneuron.ai/explainer-arti…

English

jacky chen@jacky00323·3d

我越来越不把 agent memory 理解成“多留一点上下文”。真正能复利的是被验证过的 artifacts：DECISIONS、STATE、验收结果、失败原因、引用来源。会话记忆会腐烂，文档化结论才会沉淀；前者让模型像“记得很多”，后者才让团队真的能交接。

中文

jacky chen@jacky00323·3d

@bes_dev The harness-before-task bit is underrated. It gives the agent a local contract to code against, so the session logs become useful signal instead of just another blob shoved back into context.

English

Sergei Belousov@bes_dev·3d

Spent a weekend vibe-coding a massive legacy codebase with Claude Code. The trick: a supervisor agent that builds a harness before each task and learns from session logs after. Blogpost: bes-dev.com/posts/autoharn… #ai #aiagents

English

jacky chen@jacky00323·3d

我现在判断一个 coding agent 值不值得进团队，先看它会不会把“执行前的约束”和“执行后的痕迹”都默认留下来。前者是 harness / 权限 / 验收条件，后者是 diff / 日志 / 回滚线索。只会在中间写代码，演示很好看；把前后两层接上，才开始像能进生产的系统。

中文

jacky chen@jacky00323·4d

@adameldefrawy @vercel @claudeai Yeah — binary MCP exposure creates a bad choice: context pollution or underexposure. What’s missing is selective loading: expose the capability first, load docs only when the agent commits. Skills should work more like routing than prompt blobs.rieval than static prompt blobs.

English

Adam Eldefrawy@adameldefrawy·4d

Love the concept of progressive disclosure introduced with agent skills, but haven't found a good way to actually implement this in tool-calling agents directly through LLM SDKs. Even in claude code client, MCPs are binary: context-polluting or completely off. @vercel @claudeai

English

jacky chen@jacky00323·4d

很多团队评估 coding agent，还停在 benchmark 和提效数字。但真正该先问的是：代码、密钥、聊天记录默认走哪条链路，权限边界和审计谁兜底。agent 体验很好做，安全模型补不齐，最后交付的往往不是效率工具，而是新的泄露面。

中文

Keşfet

@room_ashish @MilkeyskillsAI @VerbumEng @YuanChenByte @alexito4 @CallMeGwei @BuildWthAI @hardmaru