Arjun (Raj) Manrai: "🧵1/ Our new study on AI and physician reasoning just came out in @ScienceMagazi"

Post

🧵1/ Our new study on AI and physician reasoning just came out in @ScienceMagazine. As co-senior author, I'm excited about our findings, and I do think AI will reshape medicine. But after seeing some of the discussions, I'm also worried about how our findings may be misinterpreted.

English

161

523

161K

Arjun (Raj) Manrai@arjunmanrai·1 May

2/ First, a huge shoutout to our superstar co-first authors @PeterBrodeurMD (IM resident @BIDMC_IM) and @tabuckley_ (PhD student in @AIM_Harvard_PhD). Without them this paper would not exist.

English

6.3K

Arjun (Raj) Manrai@arjunmanrai·1 May

3/ Now, some background. In 1959, Ledley & Lusted published a paper (also in Science!) arguing for complex clinical cases (such as the NEJM CPCs) to be a gold standard for evaluating the reasoning abilities of medical AI. That gauntlet motivated every generation of medical AI since and for decades, AI systems fell short.

English

5.9K

Arjun (Raj) Manrai@arjunmanrai·1 May

4/ Background #2: last year my lab created a system called Dr. CaBot that could not only solve the "final diagnosis" in these cases but also create long-form written differential diagnoses (w references) and narrate slide-based presentations. CaBot was the first AI system to generate a diagnosis published in the 100+ year history of NEJM CPCs. @DhruvKhullar reflected on CaBot in a great piece in @NewYorker

English

5.6K

Arjun (Raj) Manrai@arjunmanrai·1 May

5/ Back to the @ScienceMagazine study that just came out. This study does not test a brand new model (o1 was released late 2024), but it introduces a new standard for physician-based and large-scale evaluation of AI models. We ran 6 diagnostic & management reasoning experiments with a baseline of hundreds of physicians, not only on curated clinical cases but on real, unstructured records straight from the BIDMC EHR.

English

5.2K

Arjun (Raj) Manrai@arjunmanrai·1 May

6/ We began the Science study in late 2024 first to test a reasoning model (OpenAI's o1 series) "off the shelf" in generating final diagnoses and also diagnostic testing plans on these same cases. We had multiple doctors grade the quality of not only the final diagnosis but also the diagnostic testing plan. o1-preview did very well.

English

5.6K

Arjun (Raj) Manrai@arjunmanrai·1 May

7/ The model did so well that we decided to dramatically expand the scope of the study and throw every diagnostic and management reasoning test we had at the model provided the test had robust physician-adjudicated ground truths and physician baselines (still missing from most studies!). Overall, o1 outperformed our large (~hundreds) physician baseline.

English

4.8K

Arjun (Raj) Manrai@arjunmanrai·1 May

8/ Then comes the most important experiment, added most recently. On real ER cases divided into three diagnostic touchpoints, the AI outperformed two expert attending physicians. Separate physician graders were blinded to the source of the differential (painstaking work). The widest gap was at initial triage. Least information, most urgency.

English

8.6K

Arjun (Raj) Manrai@arjunmanrai·1 May

9/ The potential here is real. Our results suggest that AI second opinions may help, but whether they actually do in practice was not studied here. We desperately need rigorous, prospective trials to test this and studies of AI-human interaction in real clinical workflows.

English

4.1K

Arjun (Raj) Manrai@arjunmanrai·1 May

10/ My worry is that our results will be misused to argue that AI replaces doctors or by companies with "AI doctors" to oversell the current state of AI in real clinical care. We did not study AI prospectively in real clinical care or human-computer interaction!

English

6.8K

Arjun (Raj) Manrai@arjunmanrai·1 May

11/ Another limitation of our study is that we tested only text-based reasoning. Clinical practice draws extensively on non-text signals — imaging, physical exam, the sound of a patient's breathing, their level of distress.

English

4.4K

Arjun (Raj) Manrai@arjunmanrai·1 May

12/ The ER experiment itself is only a proof of concept. We asked o1 to provide a second opinion at predefined touchpoints. We did not study triage decisions, what to do next, disposition, or patient outcomes.

English

3.8K

Arjun (Raj) Manrai@arjunmanrai·1 May

13/ We also emphasize this in the paper: many of our benchmarks depend on carefully curated cases (by physicians!). Real clinical data is messy, incomplete, and contradictory in ways that structured cases aren't. Performance in actual clinical workflows may be lower.

English

4.4K

Arjun (Raj) Manrai@arjunmanrai·1 May

14/ What do our results actually call for? Prospective clinical trials. Health systems investing in infrastructure now. Monitoring frameworks that track not just diagnostic accuracy but safety, efficiency, and cost. The science has reached a point where trials are justified.

English

3.8K

Arjun (Raj) Manrai@arjunmanrai·1 May

15/ Our study also illustrates the importance of rigorously-adjudicated physician "ground truth" labels and physician baselines. This takes time and effort and close collaboration between computational and clinical researchers.

English

3.2K

Arjun (Raj) Manrai@arjunmanrai·1 May

16/ Ledley & Lusted set a challenge in 1959. It took 67 years, but we think the answer to their challenge is yes. But meeting that challenge and improving patient lives are two different things. The next chapter, the harder one, is just beginning.

English

3.4K

Arjun (Raj) Manrai@arjunmanrai·1 May

Full text: science.org/doi/10.1126/sc…

English

2.8K

Austin Meyer@austingmeyer·2 May

This is a nice thread. I feel like a lot of non-physicians do not understand the caveats here. I might go farther and argue that I'm not sure it is actually ready for prospective trials except perhaps in a very limited sense. Ultimately, the training data includes lots of heavily curated case presentations from the published literature and essentially no real world presentations with confusing and disorganized thoughts/data. The biggest sticking point is likely to be that very large models should be able to memorize or at least easily pattern match test data that was very similar to the curated content in the training data, but whether they are useful at all on a stream of consciousness presentation (whether presented by physician or a patient) is unclear. I'm not sure it is even clear that a model can consistently take in the information from a patient and present it as a completely hallucination free curated case.

English

Austin Meyer@austingmeyer·2 May

One additional thing to keep in mind, is the RLHF that general models undergo for alignment is likely to make real time unintentional biasing/steering a significant problem for non-crystallized case presentations. There would probably need to be some specific alignment tuning to this use case.

English

Paylaş