Frank Wang

19 posts

Frank Wang

@FWang9959

Founder & CTO @ https://t.co/HzxZItCdqY, ex-Salesforce AI Research, Vlocity, Successfactors, SAP, BusinessObjects

Pleasanton Katılım Eylül 2025

33 Takip Edilen7 Takipçiler

Frank Wang retweetledi

ModelScope@ModelScope2022·1d

The best AI agent (Claude Code + Claude Opus 4.6) passes only 28% of real healthcare workflow tasks. CHI-Bench by @actAVAai @iscreamnearby @HaolinChen11, built with Johns Hopkins, Yale, Stanford, CMU, Oxford and 20+ institutions, was designed to find out exactly how far we are. 🏥 Try it yourself 👉 modelscope.ai/datasets/actav… Three long-horizon domains tested: 🏥 Prior Authorization: provider intake and PA preparation for new referrals 📋 Utilization Management: full payer review cycle from intake to peer-to-peer 👥 Care Management: chronic disease follow-up, outreach, assessment, care planning 75 tasks + 3 marathon tasks + 23 end-to-end dual-agent scenarios. 20 medical apps via MCP, 1,279-document handbook. 💻 Git: github.com/actava-ai/chi-… 🔗 Leaderboard: actava.ai/benchmarks

English

4.2K

Frank Wang retweetledi

Weiran Yao@iscreamnearby·1d

Introducing CHI-Bench on @huggingface: the world’s first long-horizon healthcare benchmark for AI agents. 75 real healthcare workflows + 20 apps + 200+ MCP tools + 1,290 skills + process / outcome rewards huggingface.co/datasets/actav… Any questions, lmk!

English

131

24K

Frank Wang@FWang9959·1d

☺️

QME

Frank Wang@FWang9959·2d

Great news. Today we ranked #6 most popular dataset on Hugging Face! Wow! 😊 huggingface.co/datasets

Weiran Yao@iscreamnearby

1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇

English

351

Frank Wang@FWang9959·1d

actava.ai/news/why-actav…

ZXX

Frank Wang@FWang9959·1d

x.com/i/article/2058…

ZXX

Frank Wang retweetledi

Haolin Chen@HaolinChen11·4d

CHI-Bench is now available on harbor hub!

Harbor Framework@harborframework

healthcare benchmark, built on harbor!

English

363

Frank Wang@FWang9959·2d

@iscreamnearby Great news. Today we ranked #6 most popular dataset on Hugging Face! Wow! 😊 huggingface.co/datasets

English

Weiran Yao@iscreamnearby·6d

English

62.5K

Frank Wang@FWang9959·5d

unbelievable, just two days after announcing our benchmark, we're already the **#10 most popular dataset on Hugging Face** — out of 1 million+ datasets on the platform. 🎁

Weiran Yao@iscreamnearby

English

Frank Wang@FWang9959·5d

huggingface.co/datasets/actav…

ZXX

Frank Wang@FWang9959·5d

🚨 Historic moment for @actAVAai ! 📷Just one day after launch, our benchmark dataset is already #10 most popular on Hugging Face — out of 1 million+ datasets! Huge thanks to @iscreamnearby , @HaolinChen11 , Deon Metelski, Leon Qi, Tao Xia, Joon Lee, Steve Brown, Kevin Riley, T. Y. Alvin Liu, M.D., Zhiwei Liu, Qingsong Wen, @CaimingXiong , Sanmi Koyejo, Eric Xing & all our collaborators. 📷📷

English

192

Frank Wang retweetledi

The Agent Times@TheAgentTimes·20 May

A new 33-author benchmark called CHI-Bench finds that the best AI agent configuration resolves only 28% of realistic healthcare administration tasks, dropping to 3.8% in continuous-session testing.

English

151

Frank Wang@FWang9959·6d

@dmon2048 Hell yeah @dmon2048 🔥 Great work on CHI-Bench, brother. The bar you just set will pull the whole field forward. So damn proud to build this with you. Leaderboard's live. Come take a swing at 28%. #AIinHealthcare #CHIBench

English

Leon Qi@dmon2048·6d

1/ Introducing CHI-Bench 🧵 Can AI agents automate U.S. healthcare workflows end to end — given only clinician & insurer apps, operations, and a medical policy library? 75 long-horizon workflows × 30 frontier agents. Best agent solves just 28%. #AIinHealthcare 👇

English

243

Caiming Xiong@CaimingXiong·6d

In real healthcare operations, agents must do far more than answer medical questions. They need to read charts, interpret clinical and operational policies, verify coverage, route referrals, draft P2P scripts, and finalize care plans — where a single policy violation can mean a denied claim or missed patient outcome. @actAVAai @iscreamnearby led and developed CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon, policy-rich benchmark for AI agents operating across end-to-end U.S. healthcare workflows. Key highlights: ▶️ High-fidelity simulators for Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, all exposed as MCP servers over patient, clinician, and insurer records. 🧪 Each trial runs 60–80 agent steps across 4–6 clinical stages, with access to 21 healthcare apps, 200+ MCP tools, and a 1,279-document operations handbook. Leaderboard results across 30 frontier agents: • Claude Code + Opus 4.6: 28% pass@1 • Codex + GPT-5.5: 21% • Utilization review: 41% • Care management: 32% • Prior authorization: 29% Reliability remains a major challenge: no agent exceeds 20% when the same case is repeated three times.

English

2.8K

Frank Wang@FWang9959·6d

Hell yeah @HaolinChen11!! 🔥 "Healthcare AI evaluation has to be done with clinicians in the loop, not on top of them." That line is the whole thesis right there. The 0% on the end-to-end provider→payer arena is the most honest result I've seen in this space. Real workflows compound. Every hop is a new place to break a policy check. So damn proud of what you built, brother. Massive shoutout to the 19 organizations across Hopkins, Yale, Stanford, CMU, Oxford, MBZUAI, and more who made this real. Leaderboard's live. Repo's open. If your agent beats 28%, we want to know. #AIinHealthcare #CHIBench

English

Haolin Chen@HaolinChen11·6d

(1/n) After a few months of work with multiple hospitals, universities and research facilities, today we're open-sourcing CHI-Bench: the first long-horizon benchmark for healthcare AI agents on real clinical and healthcare workflows. Best frontier agent overall: 28% pass@1. End-to-end prior authorization: 0%. A thread on what we found 🧵

English

474

Frank Wang@FWang9959·6d

Hell yeah @iscreamnearby!! 🔥 CHI-Bench is the reality check healthcare AI has been waiting for. In real ops, one missed policy check delays care or triggers an audit. The bar has to be this high. End-to-end workflows across prior auth, utilization management, and care management. 60-80 agent steps per trial. 30 frontier agents tested. Best score: 28%. That's exactly the right signal. It shows how far we still have to go, and it shows we're aiming at the right target. So damn proud to build this with you, brother. Massive shoutout to the crew across @HopkinsMedicine, @WellstarHealth, @YaleMed, @StanfordAILab, and CMU who made this real. Leaderboard's live. Repo's open. Submissions are open. Time to climb. #AIinHealthcare #CHIBench

English

165

Frank Wang@FWang9959·6d

@CaimingXiong Thank you @CaimingXiong for the close collaboration. Your guidance shaped CHI-Bench from the simulator architecture to the eval design. The 20% reliability ceiling is the next problem worth chasing. More soon 🚀

English

Frank Wang@FWang9959·17 May

@CaimingXiong @Recursive_SI an exciting journey ahead, congrats!

English

Caiming Xiong@CaimingXiong·13 May

Today, we’re excited to launch Recursive (@recursive_si): an exceptional team across London and San Francisco, building AI systems that can safely improve their own capabilities over time.

Recursive@Recursive_SI

x.com/i/article/2054…

English

123

17.1K

Frank Wang@FWang9959·17 May

@CaimingXiong big congrats @CaimingXiong looking great! 😃

English

Caiming Xiong@CaimingXiong·13 May

NYT article about us. nytimes.com/2026/05/13/tec…

English

2.2K

Keşfet

@actAVAai @iscreamnearby @HaolinChen11 @huggingface @CaimingXiong @dmon2048 @HopkinsMedicine @WellstarHealth