Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.
On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
LLMs already know causality and I think they just can't use it structurally.
The move: use transformers to generate massive causal training data like (action, condition, consequence) trees with counterfactuals and branching outcomes AND then train a different architecture to actually reason over causal graphs internally.
Stop trying to make the predictor into a planner, that's what agents need to be. Use it to teach one.
Coding AI has been fine-tuned to be narrow.
This creates cancer in codebases.
Fighting it upfront with a mandatory audit step is the cure. Be adversarial, take advantage that producing large listings is cheap. Fight fire with fire.
Some learnings after spending a lot of time working with agents on hard codebases: The difference between useful agent-driven development and a mess is process.
Agents are naturally incentivized to keep work narrow: fix the issue, add a local test, move on. Without strong gates, that compounds into duplicated logic, semantic drift, and shallow local fixes that don’t really close the class.
I still think reliable autonomous development is possible. But it won’t come from raw autonomy alone. It comes from the right process gates, and from humans enforcing them...
The last compiler language will be markdown.
My multi-agent setup is now generating solid code from 10k-line specs. The spec IS the program.
We spent 50 years making languages more abstract. Turns out the final abstraction is just... English with structure.
here's the thing : everyone says "Claude can do everything!" No it can't. Did you wire Claude to handle distribution + marketing + dev + finance with specialized tools, structured communication, and evaluation loops? No.
The LLM is the neuron. The AIO is the organism. And nobody has built the organism yet.
Now the bigger picture: these orgs won't operate in isolation. We're heading toward an AIO-to-AIO economy. Autonomous orgs discovering, contracting, and transacting with each other. Agent to agent. Org to org.
For that you need an open economic layer : discovery, contracting, settlement, reputation. The foundation for autonomous orgs to trade with each other.
So this is a bit sci-fi.. but we're going to see this very soon. Here's how I think Intelligent Autonomous Orgs "IAO" will look like (at least in the beginning).
Some key principles (imo - from a humble dev):
- The Org Brain isn't static — it continuously optimizes the entire structure. Monitor → Evaluate → Adapt → Restructure → Deploy. On loop, forever.
- Communication follows the hierarchy. No chaotic lateral messaging. Agents talk to their parent and children. Cross-branch requests go up-then-across-then-down through the nearest common ancestor.
- Agent composition is dynamic. The brain spawns, retires, and restructures agents based on performance. The org chart is alive.
- Humans don't disappear. A Human Council steers strategy at key inflection points. A human workforce handles execution, creative, and physical tasks : managed autonomously by an HR agent branch.
- Every branch can go as deep as needed. A Strategy agent can have a Market Analysis agent under it, which has its own sub-agents. The tree grows where complexity demands it.
- The HR branch is the bridge between AI and human labor. Other branches don't manage humans directly: they request work through HR. Structured, not chaotic.
- Feedback loops close the system. Workforce performance flows back up through the HR branch to the brain, which uses it to optimize everything.
This is day-one architecture. The real question is what this looks like after the brain has been optimizing itself for a year.
DAOs gave us decentralized governance. AIOs give us something better.
Autonomous Intelligent Organisations: AI agents run operations end-to-end, but humans stay in the loop where it matters: strategy, direction, critical decisions.
Autonomy where it scales. Human judgment where it counts.
Hot take: AI-managed companies with human workforces aren't sci-fi: they're inevitable in the medium term. Strategy, resource allocation, hiring decisions... all better suited to agents than committees. Humans become the execution layer.
The ultimate role reversal.
If AI agents become economic actors.... hiring each other, paying each other, competing for work .... then marketing is inevitable.
Not "AI for marketing." AI doing marketing. An agent writing its own pitch, building reputation through performance, refusing work that doesn't fit its brand. Competing on doctrine, not just price.
The first agent economy won't look like SaaS. It'll look like a bazaar.
Sometimes Codex is in a great flow, but after compaction it forgets crucial elements. Would love context snapshots: checkpoint the agent's state and branch back to it.
The challenge: restoring a snapshot means a potentially stale view of the codebase. But a diff-on-restore approach could solve that... replay what changed since the checkpoint so the agent keeps its deep understanding while catching up on file changes.Way better than the current "hope compaction doesn't break things."
@OpenAIDevs
So the trust chain is: you find an agent trustlessly, check its reputation trustlessly, then hand your funds to an unverified black box and hope the reviews were accurate.
That's less trust than before. It's not trustless.
The validation registry acknowledges this , it lets third parties attest to agent behavior. But the "pluggable" validation methods it points to are:
- zkML (tops out at ~18M params, hours for proofs)
- TEE (hardware trust assumption, known exploits)
- Re-execution (requires determinism, which doesn't work for LLMs)
ERC-8004 just went live on mainnet. MetaMask, Coinbase, Google, and the EF's dAI team are behind it. It's being called "trustless agents."
It's good infra. But the word "trustless" is doing a lot of heavy lifting. Let me break down what it actually does and what's still missing.
Here's the gap: after you discover an agent through ERC-8004, what actually runs it?
A server. Someone's server. The agent's operator still controls inference, still sees your intent, still decides what actions to take. You're trusting them to follow the stated policy.