Frank Wang

19 posts

Frank Wang banner
Frank Wang

Frank Wang

@FWang9959

Founder & CTO @ https://t.co/HzxZItCdqY, ex-Salesforce AI Research, Vlocity, Successfactors, SAP, BusinessObjects

Pleasanton Katılım Eylül 2025
33 Takip Edilen7 Takipçiler
Frank Wang retweetledi
ModelScope
ModelScope@ModelScope2022·
The best AI agent (Claude Code + Claude Opus 4.6) passes only 28% of real healthcare workflow tasks. CHI-Bench by @actAVAai @iscreamnearby @HaolinChen11, built with Johns Hopkins, Yale, Stanford, CMU, Oxford and 20+ institutions, was designed to find out exactly how far we are. 🏥 Try it yourself 👉 modelscope.ai/datasets/actav… Three long-horizon domains tested: 🏥 Prior Authorization: provider intake and PA preparation for new referrals 📋 Utilization Management: full payer review cycle from intake to peer-to-peer 👥 Care Management: chronic disease follow-up, outreach, assessment, care planning 75 tasks + 3 marathon tasks + 23 end-to-end dual-agent scenarios. 20 medical apps via MCP, 1,279-document handbook. 💻 Git: github.com/actava-ai/chi-… 🔗 Leaderboard: actava.ai/benchmarks
ModelScope tweet mediaModelScope tweet mediaModelScope tweet media
English
7
6
35
4.2K
Frank Wang retweetledi
Weiran Yao
Weiran Yao@iscreamnearby·
Introducing CHI-Bench on @huggingface: the world’s first long-horizon healthcare benchmark for AI agents. 75 real healthcare workflows + 20 apps + 200+ MCP tools + 1,290 skills + process / outcome rewards huggingface.co/datasets/actav… Any questions, lmk!
English
7
26
131
24K
Weiran Yao
Weiran Yao@iscreamnearby·
1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇
Weiran Yao tweet media
English
12
23
42
62.5K
Frank Wang
Frank Wang@FWang9959·
🚨 Historic moment for @actAVAai ! 📷Just one day after launch, our benchmark dataset is already #10 most popular on Hugging Face — out of 1 million+ datasets! Huge thanks to @iscreamnearby , @HaolinChen11 , Deon Metelski, Leon Qi, Tao Xia, Joon Lee, Steve Brown, Kevin Riley, T. Y. Alvin Liu, M.D., Zhiwei Liu, Qingsong Wen, @CaimingXiong , Sanmi Koyejo, Eric Xing & all our collaborators. 📷📷
Frank Wang tweet media
English
2
2
5
192
Frank Wang retweetledi
The Agent Times
The Agent Times@TheAgentTimes·
A new 33-author benchmark called CHI-Bench finds that the best AI agent configuration resolves only 28% of realistic healthcare administration tasks, dropping to 3.8% in continuous-session testing.
The Agent Times tweet media
English
1
3
4
151
Frank Wang
Frank Wang@FWang9959·
@dmon2048 Hell yeah @dmon2048 🔥 Great work on CHI-Bench, brother. The bar you just set will pull the whole field forward. So damn proud to build this with you. Leaderboard's live. Come take a swing at 28%. #AIinHealthcare #CHIBench
English
0
0
1
21
Leon Qi
Leon Qi@dmon2048·
1/ Introducing CHI-Bench 🧵 Can AI agents automate U.S. healthcare workflows end to end — given only clinician & insurer apps, operations, and a medical policy library? 75 long-horizon workflows × 30 frontier agents. Best agent solves just 28%. #AIinHealthcare 👇
English
5
3
8
243
Caiming Xiong
Caiming Xiong@CaimingXiong·
In real healthcare operations, agents must do far more than answer medical questions. They need to read charts, interpret clinical and operational policies, verify coverage, route referrals, draft P2P scripts, and finalize care plans — where a single policy violation can mean a denied claim or missed patient outcome. @actAVAai @iscreamnearby led and developed CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon, policy-rich benchmark for AI agents operating across end-to-end U.S. healthcare workflows. Key highlights: ▶️ High-fidelity simulators for Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, all exposed as MCP servers over patient, clinician, and insurer records. 🧪 Each trial runs 60–80 agent steps across 4–6 clinical stages, with access to 21 healthcare apps, 200+ MCP tools, and a 1,279-document operations handbook. Leaderboard results across 30 frontier agents: • Claude Code + Opus 4.6: 28% pass@1 • Codex + GPT-5.5: 21% • Utilization review: 41% • Care management: 32% • Prior authorization: 29% Reliability remains a major challenge: no agent exceeds 20% when the same case is repeated three times.
Caiming Xiong tweet media
English
7
19
54
2.8K
Frank Wang
Frank Wang@FWang9959·
Hell yeah @HaolinChen11!! 🔥 "Healthcare AI evaluation has to be done with clinicians in the loop, not on top of them." That line is the whole thesis right there. The 0% on the end-to-end provider→payer arena is the most honest result I've seen in this space. Real workflows compound. Every hop is a new place to break a policy check. So damn proud of what you built, brother. Massive shoutout to the 19 organizations across Hopkins, Yale, Stanford, CMU, Oxford, MBZUAI, and more who made this real. Leaderboard's live. Repo's open. If your agent beats 28%, we want to know. #AIinHealthcare #CHIBench
English
0
0
1
16
Haolin Chen
Haolin Chen@HaolinChen11·
(1/n) After a few months of work with multiple hospitals, universities and research facilities, today we're open-sourcing CHI-Bench: the first long-horizon benchmark for healthcare AI agents on real clinical and healthcare workflows. Best frontier agent overall: 28% pass@1. End-to-end prior authorization: 0%. A thread on what we found 🧵
Haolin Chen tweet media
English
10
7
15
474
Frank Wang
Frank Wang@FWang9959·
Hell yeah @iscreamnearby!! 🔥 CHI-Bench is the reality check healthcare AI has been waiting for. In real ops, one missed policy check delays care or triggers an audit. The bar has to be this high. End-to-end workflows across prior auth, utilization management, and care management. 60-80 agent steps per trial. 30 frontier agents tested. Best score: 28%. That's exactly the right signal. It shows how far we still have to go, and it shows we're aiming at the right target. So damn proud to build this with you, brother. Massive shoutout to the crew across @HopkinsMedicine, @WellstarHealth, @YaleMed, @StanfordAILab, and CMU who made this real. Leaderboard's live. Repo's open. Submissions are open. Time to climb. #AIinHealthcare #CHIBench
English
1
0
5
165
Frank Wang
Frank Wang@FWang9959·
@CaimingXiong Thank you @CaimingXiong for the close collaboration. Your guidance shaped CHI-Bench from the simulator architecture to the eval design. The 20% reliability ceiling is the next problem worth chasing. More soon 🚀
English
0
0
3
78