Getty Images

Data Augmentation May Improve LLM Generalization on Clinical Notes

Using large language models in a healthcare domain-informed manner could address dataset shifts and enhance generalizability in AI-driven medical note analysis.

Researchers from Johns Hopkins University and Columbia University have developed a technique to improve the performance of artificial intelligence (AI) and machine learning (ML) models for medical note analysis, according to findings presented at the 37th Annual Conference on Neural Information Processing Systems.

Clinical notes housed within electronic health records (EHRs) contain a wealth of valuable data that could be used to improve care. However, reviewing, sorting, and analyzing this information is too time- and resource-intensive to be done manually.

AI technologies are a potential solution to this problem, as these tools can process vast amounts of data quickly. But questions around model generalizability and performance present significant hurdles for deployment.

AI and ML models for medical note analysis are typically trained on health systems’ EHR data, which helps the tools “learn” to infer key information about patients’ medical conditions.

However, medical notes can vary significantly within and across health systems, meaning that models trained on these data may perform poorly when tasked with analyzing clinical notes from other sources. This phenomenon is known as “dataset shift,” and it can create safety concerns around healthcare AI deployment.  

To address these challenges, the researchers developed a data augmentation technique designed to enhance model generalizability.

The research team underscored that variations in clinicians’ writing styles can cause AI models to incorrectly infer associations between factors like grammar or word choice and a patient’s diagnosis or medical conditions. The same can happen with the templates – including tables and headings – that clinicians use frequently in their notes.

While these style-related factors are irrelevant to the analysis being performed by the AI, the same templates are often used by clinicians treating particular subgroups of patients. The AI tool will then recognize that that template and certain diagnoses regularly appear together, leading the model to potentially learn from spurious correlations, rather than true associations in the data.

To combat this, the researchers propose using data augmentation to prevent the tools from learning from spurious correlations.

“We found that we can greatly improve the robustness of these text models across different settings by making them less sensitive to changes in writing habits and styles observed between different caregivers,” said Yoav Wald, PhD, a postdoctoral fellow at Johns Hopkins’ Whiting School of Engineering who worked on the project, in a news release.

The technique enables researchers to make the models less sensitive to these factors by feeding them the same medical note written in multiple different styles. This allows the AI to learn from the content of the notes, rather than the style or templates used.

But rather than having clinicians rewrite each other’s notes to help achieve this – which would create undue burdens on already busy care teams – the research team turned to large language models (LLMs).

“Given a specific note that we wish to rewrite in the style of some caregiver—say, Dr. Beth—we instead ask an LLM, ‘How would this note look had Dr. Beth written it?'” Wald explained.

This approach helps generate counterfactual data, which can be used to determine what a model would predict based on a change in its input. This information can help negate spurious correlations in real-world data, and the application of counterfactual data has the potential to reduce the likelihood of an AI model making inaccurate predictions.

Using auxiliary data from clinical notes – such as patient demographics, timestamps, and document types – can generate high-quality approximations of these counterfactual data.

The researchers demonstrated that this technique promotes using LLMs in a healthcare domain-informed manner and improves the generalizability of AI models for medical note analysis.

This work is part of a larger effort to develop an AI safety framework for healthcare applications.

“As we increase our use of AI in real-world applications and learn about its strengths and weaknesses, it is important to develop tools that improve AI models’ robustness and safety,” stated Suchi Saria, PhD, the John C. Malone Associate Professor of computer science at the Whiting School of Engineering. “This has been a key area of our focus over the last five years and this new work takes an important step in that direction. The methods we’ve developed here are directly applicable across many important text classification tasks.”

“Overall, we believe that causally motivated data augmentation methods like ours can help address challenges in developing robust and reliable ML systems, particularly in safety-critical applications,” Wald noted.

The use of AI in medical note analysis can play a key role in the clinical documentation improvement process.

Health systems are increasingly turning to EHR documentation assistants to streamline documentation and reduce clinician burnout. These tools, often scribe- or voice-based technologies, can help reduce the amount of time clinicians spend on documentation without sacrificing the quality of the notes.

Medical note analysis presents a potential avenue to make the information in clinical documentation even more useful by surfacing potential associations in the data that could be used to inform predictive analytics or guide clinical decision-making.

Next Steps

Dig Deeper on Health data governance