DrAfter123/DigitalVision Vectors

Chatbots show promise in simplifying pathology reports for patients

Bard and GPT-4 enhanced the readability of pathology reports while maintaining high medical accuracy, which could help explain findings to patients.

AI chatbots may be able to accurately simplify pathology reports to help patients more easily understand them, according to recent study findings published in JAMA Network Open.

Pathology reports play a critical role in healthcare delivery, allowing providers to glean important information about a patient’s health and well-being. These reports are often available online via the provider’s patient portal, but the findings are typically complex and difficult to understand.

Simplifying pathology reports would make them easier for patients to understand, but doing so on a large scale would require significant time and resources that many healthcare organizations lack.

To overcome this, the study authors set out to determine whether two popular chatbots – Google’s Bard and OpenAI’s GPT-4 – could interpret pathology reports and improve their readability.

To do this, the researchers gathered 1,134 pathology reports created from January 1, 2018, to May 31, 2023. Each report varied in length and complexity, in addition to being written by different pathologists and addressing a variety of conditions and procedures.

The text of each report, which contained notes, comments, synoptic reports and addendums, was not edited for clarity, but potentially identifying information was anonymized.

From there, two expert reviewers categorized each report’s findings as normal, benign, atypical and/or suspicious, precancerous, malignant or nondiagnostic. In the event of a disagreement between the reviewers, a third pathologist served as a tiebreaker.

This information was then provided to the chatbots, and each was given a sequence of tasks: to simplify each report, to classify the findings and to determine the pathologic stage of any tumors. New chat threads were started for each report to minimize potential bias.

Each model’s performance was measured in terms of readability – per the Flesch Reading Ease (FRE) and Flesch-Kincaid grade level (FKGL) formulas – and accuracy, as determined by human pathologists.

In addition to screening the simplified reports for errors and categorizing each as medically correct, partially medically correct or medically incorrect, the reviewers also flagged any instances of chatbot hallucinations.

For the original reports, the FKGL was 13.19, and both models successfully improved this score: Bard achieved an average FKGL score of 8.17, while GPT-4 decreased the original score to 7.45. Similar results were seen in terms of FRE scores, increasing from 10.32 in the original reports to 61.32 for Bard and 70.80 for GPT-4.

Further, Bard interpreted 87.57% of reports correctly, 8.99% partially correctly and 3.44% incorrectly. GPT-4 interpreted 97.44% of reports correctly, 2.12% partially correctly and 0.44% incorrectly. Overall, Bard experienced instances of hallucinations for 2.82% of reports compared to GPT-4’s 0.26%.

These results indicate that chatbots may have the potential to explain pathology reports to patients, but the research team indicated that caution is needed to implement them appropriately.

“The findings of this cross-sectional study suggest that artificial intelligence chatbots were able to simplify pathology reports. However, some inaccuracies and hallucinations occurred. Simplified reports should be reviewed by clinicians before distribution to patients,” the researchers wrote.

Next Steps

Dig Deeper on Artificial intelligence in healthcare