Ace2020/istock via Getty Images

ChatGPT Fails American College of Gastroenterology Assessment Tests

ChatGPT-3 and ChatGPT-4 scored 65.1% and 62.4% on American College of Gastroenterology Self-Assessment Tests, which require a 70% or higher to pass.

A study published this week in the American Journal of Gastroenterology demonstrated that ChatGPT-3 and ChatGPT-4 failed the 2021 and 2022 multiple-choice self-assessment tests for the American College of Gastroenterology (ACG), which may hamper the tools’ use for medical education in gastroenterology.

Over the past several months, excitement around how ChatGPT may transform healthcare has led researchers to investigate its potential use cases, including in medical education.

Recently, the large language model (LLM) has successfully passed US Medical Licensing Exam (USMLE)-style tests, proved capable of answering competency-based microbiology questions, and shown promise in providing accurate information on cancer misconceptions. However, ChatGPT’s application in various medical specialties, such as gastroenterology, has not been fully explored.

This led researchers from Arkansas Gastroenterology, Northwell Health, and Northwell’s Feinstein Institutes for Medical Research to evaluate ChatGPT-3 and ChatGPT-4, the most recent iterations of the natural language processing (NLP) tool, on each’s ability to pass an ACG assessment, which is designed to help students gauge how they would perform on the actual American Board of Internal Medicine (ABIM) Gastroenterology board examination.

To test the tools, the research team tasked both versions of ChatGPT with answering a total of 455 questions across two ACG tests. Overall, ChatGPT-3 correctly answered 296 of 455 questions, scoring a 65.1 percent, while Chat GPT-4 correctly answered 284 questions, achieving a 62.4 percent.

Neither version of ChatGPT successfully achieved a passing score of 70 percent or higher.

“Recently, there has been a lot of attention on ChatGPT and the use of AI across various industries. When it comes to medical education, there is a lack of research around this potential ground-breaking tool,” said Arvind Trindade, MD, senior author of the study and associate professor at the Feinstein Institutes’ Institute of Health System Science, in a press release discussing the findings. “Based on our research, ChatGPT should not be used for medical education in gastroenterology at this time and has a ways to go before it should be implemented into the health care field.”

Alongside their recommendation that ChatGPT, in its current form, not be used for medical education in gastroenterology, the researchers highlighted some of the tool’s limitations.

They explained that since ChatGPT is designed to generate human-like text based on user prompts and predicting word sequences, it can only provide answers to questions based on the data it has been trained on.

The research team indicated that ChatGPT’s failure on gastroenterology tests could be a result of the tool sourcing outdated or questionable information from non-medical sources or lacking access to paid subscription medical journals, which would contain the most up-to-date, accurate information.

“ChatGPT has sparked enthusiasm, but with that enthusiasm comes skepticism around the accuracy and validity of AI’s current role in health care and education,” said Andrew C. Yacht, MD, senior vice president, academic affairs and chief academic officer at Northwell Health. “Dr. Trindade’s fascinating study is a reminder that, at least for now, nothing beats hitting time-tested resources like books, journals and traditional studying to pass those all-important medical exams.”

Next Steps

Dig Deeper on Artificial intelligence in healthcare