LLMs might not significantly augment diagnostic reasoning

The use of GPT-4 did not necessarily enhance clinicians' diagnostic reasoning performance compared to those relying solely on conventional clinical resources.

Shania Kennedy, Assistant Editor

Published: 06 Nov 2024

New research published in JAMA Network Open suggests that clinicians' use of large language models might not significantly improve diagnostic reasoning performance.

LLMs, like other types of generative AI (GenAI) in healthcare, have recently demonstrated promise in various applications, such as streamlining nursing documentation and administrative tasks. The tools have also shown potential for medical reasoning, as chatbots have achieved high performance on multiple-choice and open-ended medical reasoning examinations in the past.

However, the researchers noted that the impact these technologies have on clinicians' diagnostic reasoning is still not well understood. To help bridge this research gap, the team recruited 50 U.S.-licensed clinicians with backgrounds in family medicine, internal medicine or emergency medicine.

Participants were tasked with reviewing up to six clinical vignettes in 60 minutes, and each clinician was randomly assigned to utilize either a combination of conventional resources and GPT-4 or conventional resources alone.

Performance was primarily assessed in terms of differential diagnosis accuracy, appropriateness of supporting and opposing factors and next diagnostic evaluation steps, which were validated using blinded expert consensus. Secondary outcomes, including time spent per case and final diagnosis accuracy, were also recorded.

The average diagnostic reasoning score per case was 76% for the LLM group and 74% for the conventional resources-only group. Time spent per case was 519 seconds for the LLM group and 565 seconds for the conventional resources group.

In a secondary analysis, the researchers also evaluated the LLM's standalone diagnostic reasoning capabilities. On its own, GPT-4 scored 16 percentage points higher than the clinicians in the conventional resources-only group.

These findings suggest that the LLM did not significantly enhance diagnostic reasoning performance, highlighting the need to further explore how GenAI tools can best support clinicians.

"The field of AI is expanding rapidly and impacting our lives inside and outside of medicine. It is important that we study these tools and understand how we best use them to improve the care we provide as well as the experience of providing it," stated Andrew Olson, M.D., a professor at the University of Minnesota Medical School and hospitalist with M Health Fairview, in a press release. "This study suggests that there are opportunities for further improvement in physician-AI collaboration in clinical practice."

Shania Kennedy has been covering news related to health IT and analytics since 2022.

LLMs might not significantly augment diagnostic reasoning

The use of GPT-4 did not necessarily enhance clinicians' diagnostic reasoning performance compared to those relying solely on conventional clinical resources.

Dig Deeper on Artificial intelligence in healthcare

AI in medical diagnostics and decision-making

ChatGPT shows potential for clinical knowledge review

NIH study reveals pitfalls of AI in clinical decision-making

Generative AI predicts hospital admissions from ED visits