Sabitlenmiş Tweet

Introducing NEJM-Bench, a multimodal clinical reasoning benchmark we built from over two decades of real clinical image cases.
Across all models we tested, accuracy drops by ~30 percentage points when no answer options are provided to them. This gap persists even on cases outside each model’s training distribution.
1/🧵

English






































