Annas Bin Adil

8 posts

Annas Bin Adil

@annasbinadil

Katılım Haziran 2026

66 Takip Edilen4 Takipçiler

Annas Bin Adil@annasbinadil·2d

@cong_ml Amazing stuff. curious to look more into your approach for idea generation and selection

English

Cong Lu@cong_ml·4d

Recursive just came out of stealth, and the team has been cooking 🔥 Our first results: an automated AI research system that can improve AI across 3 very different settings across training and GPU kernel optimization. recursive.com/articles/first…

English

335

47.1K

Annas Bin Adil@annasbinadil·2d

@nlpxuhui This is amazing work and much needed, especially in safety research. would love to collab with you guys at atella atella.ai

English

119

Xuhui Zhou@nlpxuhui·3d

Does LLM really need to be a helpful assistant all the time? No. If you want to simulate people, “perfectly helpful” could be the wrong objective. Meet OdysSim, a journey toward LLMs beyond assistants, as behavioral foundation models (10B tokens of real human behavior; 23 sim benchmarks, finally in one place. new open models: outperform or on par with GPT-5.5, Gemini 3.1, or Claude Opus 4.7 in many behavior-sim dimensions). Human behavior simulation is becoming essential. Agent evaluation needs realistic users before real users show up. Medical and classroom training need realistic patients and students. Social science needs synthetic participants at scale. But real people are not ideal assistants. Real patients panic or ignore good advice. Real students misunderstand. Real customers are vague, picky, impatient, or simply leave. Human behavior is messy, diverse, and often imperfect. Frontier LLMs are getting better at math, code, and long-horizon tasks. They are NOT getting better at simulating human behavior. If anything, they drift the other way: more assistant-ish, more homogeneous, fewer of the errors and quirks real humans show. This is no accident. The whole pipeline is built for helpfulness and task success, not behavioral realism. And you can't prompt your way out of that. So we rethink the recipe from scratch and release: 🧠 The OdysSim corpus: 21.4M real human interactions (~10B tokens) from 62 sources, every conversation retrofitted with social grounding (who is talking, and why) 📏 SOUL-Index: 23 human-behavior benchmarks unified into one suite across 5 axes 🤖 OSim-8B: open weights; tops more SOUL-Index benchmarks than any frontier model, acts more like a real user than any of them on τ-bench (nearly matching real humans in the reaction dimension), and writes far more human-like text along the way.

English

459

131.9K

Annas Bin Adil@annasbinadil·2d

@EpochAIResearch Wow, thats concerning, how do we rebuild trust in this benchmark?

English

356

Epoch AI@EpochAIResearch·2d

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

English

581

116.9K

Annas Bin Adil@annasbinadil·2d

@EricTopol @EvidenceOpen @UpToDate Not expected, yet not surprising. This is part of the bitter lesson...

English

862

Eric Topol@EricTopol·3d

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine nature.com/articles/s4159…

English

117

521

1.9K

790.7K

Annas Bin Adil@annasbinadil·4d

@DarioAmodei The shift in this essay from "theoretical risk" to "demonstrated risk" is the part policymakers will need evidence for, case by case. Documenting how agents fail in practice is the job we've taken on at atella.ai

English

672

Dario Amodei@DarioAmodei·4d

Today I'm publishing a new essay, Policy on the AI Exponential. AI is progressing extremely fast—much faster than the policy process was built to handle. The essay lays out where I think the technology is now, and the action needed to close the gap: darioamodei.com/post/policy-on…

English

1.3K

2.4K

13.5K

6.5M

Annas Bin Adil@annasbinadil·4d

why are so many things "load bearing"? iykyk

English