Recursive just came out of stealth, and the team has been cooking 🔥
Our first results: an automated AI research system that can improve AI across 3 very different settings across training and GPU kernel optimization.
recursive.com/articles/first…
Does LLM really need to be a helpful assistant all the time?
No. If you want to simulate people, “perfectly helpful” could be the wrong objective.
Meet OdysSim, a journey toward LLMs beyond assistants, as behavioral foundation models (10B tokens of real human behavior; 23 sim benchmarks, finally in one place. new open models: outperform or on par with GPT-5.5, Gemini 3.1, or Claude Opus 4.7 in many behavior-sim dimensions).
Human behavior simulation is becoming essential.
Agent evaluation needs realistic users before real users show up. Medical and classroom training need realistic patients and students. Social science needs synthetic participants at scale.
But real people are not ideal assistants.
Real patients panic or ignore good advice. Real students misunderstand. Real customers are vague, picky, impatient, or simply leave. Human behavior is messy, diverse, and often imperfect.
Frontier LLMs are getting better at math, code, and long-horizon tasks. They are NOT getting better at simulating human behavior. If anything, they drift the other way: more assistant-ish, more homogeneous, fewer of the errors and quirks real humans show.
This is no accident. The whole pipeline is built for helpfulness and task success, not behavioral realism.
And you can't prompt your way out of that.
So we rethink the recipe from scratch and release:
🧠 The OdysSim corpus: 21.4M real human interactions (~10B tokens) from 62 sources, every conversation retrofitted with social grounding (who is talking, and why)
📏 SOUL-Index: 23 human-behavior benchmarks unified into one suite across 5 axes
🤖 OSim-8B: open weights; tops more SOUL-Index benchmarks than any frontier model, acts more like a real user than any of them on τ-bench (nearly matching real humans in the reaction dimension), and writes far more human-like text along the way.
FrontierMath: Tiers 1–4 (v2) is live.
We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicinenature.com/articles/s4159…
@DarioAmodei The shift in this essay from "theoretical risk" to "demonstrated risk" is the part policymakers will need evidence for, case by case. Documenting how agents fail in practice is the job we've taken on at atella.ai
Today I'm publishing a new essay, Policy on the AI Exponential. AI is progressing extremely fast—much faster than the policy process was built to handle. The essay lays out where I think the technology is now, and the action needed to close the gap: darioamodei.com/post/policy-on…