Leon Qi

17 posts

Leon Qi

@dmon2048

Founding member of @actAVAai

Katılım Ağustos 2023

690 Takip Edilen14 Takipçiler

Leon Qi retweetledi

actAVA AI@actAVAai·1d

CHI-Bench is the world's 1st long-horizon healthcare benchmark for AI agents. If you're building or buying AI for healthcare, this is the test that actually matters — real clinical workflows, not toy demos. U.S. healthcare needs this. 🏥🔬

ModelScope@ModelScope2022

The best AI agent (Claude Code + Claude Opus 4.6) passes only 28% of real healthcare workflow tasks. CHI-Bench by @actAVAai @iscreamnearby @HaolinChen11, built with Johns Hopkins, Yale, Stanford, CMU, Oxford and 20+ institutions, was designed to find out exactly how far we are. 🏥 Try it yourself 👉 modelscope.ai/datasets/actav… Three long-horizon domains tested: 🏥 Prior Authorization: provider intake and PA preparation for new referrals 📋 Utilization Management: full payer review cycle from intake to peer-to-peer 👥 Care Management: chronic disease follow-up, outreach, assessment, care planning 75 tasks + 3 marathon tasks + 23 end-to-end dual-agent scenarios. 20 medical apps via MCP, 1,279-document handbook. 💻 Git: github.com/actava-ai/chi-… 🔗 Leaderboard: actava.ai/benchmarks

English

164

Leon Qi retweetledi

actAVA AI@actAVAai·1d

actAVA AI integrates CHI-Bench with @huggingface and @harborframework today. Users can run the CHI-Bench evaluation and RL training from both platforms.

Weiran Yao@iscreamnearby

Introducing CHI-Bench on @huggingface: the world’s first long-horizon healthcare benchmark for AI agents. 75 real healthcare workflows + 20 apps + 200+ MCP tools + 1,290 skills + process / outcome rewards huggingface.co/datasets/actav… Any questions, lmk!

English

200

Leon Qi@dmon2048·3d

Check out our dataset on Hugging Face.

Frank Wang@FWang9959

Great news. Today we ranked #6 most popular dataset on Hugging Face! Wow! 😊 huggingface.co/datasets

English

Leon Qi@dmon2048·5d

Awesome!

Frank Wang@FWang9959

🚨 Historic moment for @actAVAai ! 📷Just one day after launch, our benchmark dataset is already #10 most popular on Hugging Face — out of 1 million+ datasets! Huge thanks to @iscreamnearby , @HaolinChen11 , Deon Metelski, Leon Qi, Tao Xia, Joon Lee, Steve Brown, Kevin Riley, T. Y. Alvin Liu, M.D., Zhiwei Liu, Qingsong Wen, @CaimingXiong , Sanmi Koyejo, Eric Xing & all our collaborators. 📷📷

English

Leon Qi retweetledi

The Agent Times@TheAgentTimes·20 May

A new 33-author benchmark called CHI-Bench finds that the best AI agent configuration resolves only 28% of realistic healthcare administration tasks, dropping to 3.8% in continuous-session testing.

English

151

Leon Qi@dmon2048·20 May

Great work! @HaolinChen11 @iscreamnearby

Haolin Chen@HaolinChen11

(1/n) After a few months of work with multiple hospitals, universities and research facilities, today we're open-sourcing CHI-Bench: the first long-horizon benchmark for healthcare AI agents on real clinical and healthcare workflows. Best frontier agent overall: 28% pass @1. End-to-end prior authorization: 0%. A thread on what we found 🧵

English

Leon Qi@dmon2048·20 May

@HopkinsMedicine @AlvinLiu_MD @WellstarHealth @YaleMed @StanfordAILab @zeyu1tang @sanmikoyejo @mbzuai @XiangchenSong @LingjingKong @kunkzhang @ericxing @ffeng01 @huang_biwei @SFResearch @JimZhiwei @zixianma02 @hjian42 8/ …and more 🔬 Brown: @FangliGeng Boston College: @YuanYuan_MIT Stony Brook: @Charlesyooo1 Oxford: @qingsongedu ASU: @realhuawei, Yanjie Fu USC: Yue Zhao Emory: @yangji9181 @Recursive_SI: @CaimingXiong UIC: Philip S. Yu

English

707

Leon Qi@dmon2048·20 May

@HopkinsMedicine @AlvinLiu_MD @WellstarHealth @YaleMed 7/ …and university & industry AI research labs 🔬 @StanfordAILab @zeyu1tang, @sanmikoyejo CMU & @mbzuai: @XiangchenSong, @LingjingKong, @kunkzhang, @ericxing UCSD: @ffeng01, @huang_biwei @SFResearch: @JimZhiwei UW: @zixianma02 Northeastern: @hjian42

English

212

Leon Qi@dmon2048·20 May

1/ Introducing CHI-Bench 🧵 Can AI agents automate U.S. healthcare workflows end to end — given only clinician & insurer apps, operations, and a medical policy library? 75 long-horizon workflows × 30 frontier agents. Best agent solves just 28%. #AIinHealthcare 👇

English

244

Leon Qi@dmon2048·20 May

Proud to have helped build CHI-Bench 🧵 Can frontier agents run U.S. healthcare workflows end to end? 75 long-horizon tasks, 30 agents — best solves just 28%. We're early, and now we can measure it. Fully open 👇

Weiran Yao@iscreamnearby

1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇

English

177

Leon Qi@dmon2048·20 May

@CaimingXiong @CaimingXiong thanks for your collaboration.

English

Caiming Xiong@CaimingXiong·20 May

In real healthcare operations, agents must do far more than answer medical questions. They need to read charts, interpret clinical and operational policies, verify coverage, route referrals, draft P2P scripts, and finalize care plans — where a single policy violation can mean a denied claim or missed patient outcome. @actAVAai @iscreamnearby led and developed CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon, policy-rich benchmark for AI agents operating across end-to-end U.S. healthcare workflows. Key highlights: ▶️ High-fidelity simulators for Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, all exposed as MCP servers over patient, clinician, and insurer records. 🧪 Each trial runs 60–80 agent steps across 4–6 clinical stages, with access to 21 healthcare apps, 200+ MCP tools, and a 1,279-document operations handbook. Leaderboard results across 30 frontier agents: • Claude Code + Opus 4.6: 28% pass@1 • Codex + GPT-5.5: 21% • Utilization review: 41% • Care management: 32% • Prior authorization: 29% Reliability remains a major challenge: no agent exceeds 20% when the same case is repeated three times.

English

2.8K

Leon Qi@dmon2048·20 May

@iscreamnearby Remarkable results! It's a game changer on integrating with AI in health care.

English

Weiran Yao@iscreamnearby·20 May

English

62.5K

Keşfet

@huggingface @harborframework @HaolinChen11 @iscreamnearby @HopkinsMedicine @AlvinLiu_MD @WellstarHealth @YaleMed