Post

Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
🧵1/ Our new study on AI and physician reasoning just came out in @ScienceMagazine. As co-senior author, I'm excited about our findings, and I do think AI will reshape medicine. But after seeing some of the discussions, I'm also worried about how our findings may be misinterpreted.
Arjun (Raj) Manrai tweet media
English
31
161
523
161K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
3/ Now, some background. In 1959, Ledley & Lusted published a paper (also in Science!) arguing for complex clinical cases (such as the NEJM CPCs) to be a gold standard for evaluating the reasoning abilities of medical AI. That gauntlet motivated every generation of medical AI since and for decades, AI systems fell short.
Arjun (Raj) Manrai tweet media
English
0
0
0
5.9K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
4/ Background #2: last year my lab created a system called Dr. CaBot that could not only solve the "final diagnosis" in these cases but also create long-form written differential diagnoses (w references) and narrate slide-based presentations. CaBot was the first AI system to generate a diagnosis published in the 100+ year history of NEJM CPCs. @DhruvKhullar reflected on CaBot in a great piece in @NewYorker
Arjun (Raj) Manrai tweet media
English
0
0
0
5.6K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
5/ Back to the @ScienceMagazine study that just came out. This study does not test a brand new model (o1 was released late 2024), but it introduces a new standard for physician-based and large-scale evaluation of AI models. We ran 6 diagnostic & management reasoning experiments with a baseline of hundreds of physicians, not only on curated clinical cases but on real, unstructured records straight from the BIDMC EHR.
English
0
0
0
5.2K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
6/ We began the Science study in late 2024 first to test a reasoning model (OpenAI's o1 series) "off the shelf" in generating final diagnoses and also diagnostic testing plans on these same cases. We had multiple doctors grade the quality of not only the final diagnosis but also the diagnostic testing plan. o1-preview did very well.
Arjun (Raj) Manrai tweet mediaArjun (Raj) Manrai tweet media
English
0
0
0
5.6K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
7/ The model did so well that we decided to dramatically expand the scope of the study and throw every diagnostic and management reasoning test we had at the model provided the test had robust physician-adjudicated ground truths and physician baselines (still missing from most studies!). Overall, o1 outperformed our large (~hundreds) physician baseline.
Arjun (Raj) Manrai tweet mediaArjun (Raj) Manrai tweet media
English
0
0
0
4.8K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
8/ Then comes the most important experiment, added most recently. On real ER cases divided into three diagnostic touchpoints, the AI outperformed two expert attending physicians. Separate physician graders were blinded to the source of the differential (painstaking work). The widest gap was at initial triage. Least information, most urgency.
Arjun (Raj) Manrai tweet media
English
2
4
23
8.6K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
9/ The potential here is real. Our results suggest that AI second opinions may help, but whether they actually do in practice was not studied here. We desperately need rigorous, prospective trials to test this and studies of AI-human interaction in real clinical workflows.
English
2
3
42
4.1K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
10/ My worry is that our results will be misused to argue that AI replaces doctors or by companies with "AI doctors" to oversell the current state of AI in real clinical care. We did not study AI prospectively in real clinical care or human-computer interaction!
English
3
14
67
6.8K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
11/ Another limitation of our study is that we tested only text-based reasoning. Clinical practice draws extensively on non-text signals — imaging, physical exam, the sound of a patient's breathing, their level of distress.
English
2
4
44
4.4K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
12/ The ER experiment itself is only a proof of concept. We asked o1 to provide a second opinion at predefined touchpoints. We did not study triage decisions, what to do next, disposition, or patient outcomes.
English
1
1
21
3.8K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
13/ We also emphasize this in the paper: many of our benchmarks depend on carefully curated cases (by physicians!). Real clinical data is messy, incomplete, and contradictory in ways that structured cases aren't. Performance in actual clinical workflows may be lower.
English
1
3
33
4.4K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
14/ What do our results actually call for? Prospective clinical trials. Health systems investing in infrastructure now. Monitoring frameworks that track not just diagnostic accuracy but safety, efficiency, and cost. The science has reached a point where trials are justified.
English
2
3
34
3.8K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
15/ Our study also illustrates the importance of rigorously-adjudicated physician "ground truth" labels and physician baselines. This takes time and effort and close collaboration between computational and clinical researchers.
English
1
0
16
3.2K
Arjun (Raj) Manrai
Arjun (Raj) Manrai@arjunmanrai·
16/ Ledley & Lusted set a challenge in 1959. It took 67 years, but we think the answer to their challenge is yes. But meeting that challenge and improving patient lives are two different things. The next chapter, the harder one, is just beginning.
English
2
4
23
3.4K
Austin Meyer
Austin Meyer@austingmeyer·
This is a nice thread. I feel like a lot of non-physicians do not understand the caveats here. I might go farther and argue that I'm not sure it is actually ready for prospective trials except perhaps in a very limited sense. Ultimately, the training data includes lots of heavily curated case presentations from the published literature and essentially no real world presentations with confusing and disorganized thoughts/data. The biggest sticking point is likely to be that very large models should be able to memorize or at least easily pattern match test data that was very similar to the curated content in the training data, but whether they are useful at all on a stream of consciousness presentation (whether presented by physician or a patient) is unclear. I'm not sure it is even clear that a model can consistently take in the information from a patient and present it as a completely hallucination free curated case.
English
1
0
0
61
Austin Meyer
Austin Meyer@austingmeyer·
One additional thing to keep in mind, is the RLHF that general models undergo for alignment is likely to make real time unintentional biasing/steering a significant problem for non-crystallized case presentations. There would probably need to be some specific alignment tuning to this use case.
English
0
0
0
8
Paylaş