Ace2020/istock via Getty Images

ChatGPT Fails American Urological Association Self-Assessment Exam

ChatGPT correctly answered less than 30 percent of questions from the 2022 American Urological Association Self-assessment Study Program exam.

A study published last week in Urology Practice demonstrated that ChatGPT performed poorly on the 2022 American Urological Association (AUA) Self-assessment Study Program exam, indicating that the tool may not be ready for use in urological education.

Since ChatGPT passed a United States Medical Licensing Exam (USMLE)-style exam in February, researchers have become increasingly interested in what role the large language model (LLM) could play in medical education.

These investigations have yielded mixed results, with one showing that the tool can successfully answer competency-based medical education questions for microbiology and another showing how ChatGPT can provide accurate information about cancer myths and misconceptions. However, the LLM has limitations, highlighted recently when it failed two American College of Gastroenterology Self-Assessment Tests.

Given its promising performance on the USMLE, researchers decided to test ChatGPT on the AUA’s 2022 Self-assessment Study Program exam, which is a commonly used preparatory tool for urology students.

The research team screened 150 questions from the exam for use in the study, removing 15 that contained virtual assets that could not be effectively processed by ChatGPT. The remaining 135 questions were either open-ended or multiple choice.

ChatGPT’s response to each question was coded as correct, incorrect, or indeterminate. Indeterminate outputs were regenerated up to two times to help decrease the number of indeterminate responses given by the tool overall and better assess its performance.

The LLM’s outputs were then evaluated based on quality, accuracy, and concordance by three independent researchers and reviewed by two physician adjudicators.

To avoid crossover learning, a phenomenon in which an algorithm may combine previous outputs to create a new one, the researchers started a new ChatGPT session for each question entry.

Overall, the researchers found that ChatGPT provided correct responses to 26.7 percent of open-ended questions and to 28.2 percent of multiple choice questions. The tool generated indeterminate outputs in response to 29.6 percent and 3.0 percent of open-ended and multiple choice questions, respectively.

Of the correct responses, 66.7 percent of open-ended and 94.7 percent of multiple choice outputs were generated on first output; 22.2 percent and 2.6 percent on second output; and 11.1 percent and 2.6 percent on final output.

While regeneration did decrease indeterminate responses, it did not increase the proportion of correct responses.

Performance improved with multiple choice questions compared to open-ended ones, for which ChatGPT consistently generated vague, unspecific responses at a postgraduate reading level.

For both open-ended and multiple choice questions, ChatGPT provided consistent justifications for incorrect responses while demonstrating high concordance among both correct and incorrect responses.

These findings indicate that ChatGPT, in its current form, cannot be leveraged in urological education, the researchers concluded. The research team also raised concerns about the tool’s persistent justifications for incorrect responses, which could perpetuate medical misinformation if ChatGPT is not adequately tested and validated by medical professionals.

Next Steps

Dig Deeper on Artificial intelligence in healthcare