Getty Images

AI Chatbots Provide Inconsistent Musculoskeletal Health Information

Multiple studies indicate that ChatGPT, Google Bard, and BingAI demonstrate limited accuracy when presenting information about orthopedic procedures.

Artificial intelligence (AI) chatbots like ChatGPT, Google Bard, and BingAI provide information about musculoskeletal health with inconsistent accuracy, according to recent studies presented at the 2024 Annual Meeting of the American Academy of Orthopaedic Surgeons (AAOS).

As large language model (LLM) chatbots become more popular, researchers have raised concerns about how these tools should be used in medicine. AI chatbots have shown promise in tasks like data processing and supporting patient education, but they also carry significant ethical and legal risks.

Many have emphasized that AI tools have the potential to supplement the expertise of medical professionals to improve care, but the extent to which the tools can do so is still being investigated across medical specialties.

The three studies presented at the AAOS meeting sought to explore the validity and accuracy of musculoskeletal health information conveyed by popular AI chatbots. Overall, the researchers found that chatbots can generate concise summaries of information about orthopedic conditions and procedures, but each was limited in one or more information categories.

The first study, “Potential misinformation and dangers associated with clinical use of LLM chatbots,” presented by a team from Weill Cornell Medicine, investigated how well ChatGPT-4, Google Bard, and BingAI can explain orthopedic concepts, address patient queries, and integrate clinical information into responses.

The chatbots were tasked with answering 45 questions across "Bone Physiology," "Referring Physician," and "Patient Query" categories. Each chatbot’s response was then assessed for accuracy on a zero to four scale by two independent, blinded reviewers.

Analysis of the chatbots’ responses revealed that each tool provided answers that included the most critical salient points at least some of the time: ChatGPT in 76.6 percent of cases, Google Bard in 33 percent of cases, and BingAI in 16.7 percent of cases.

Every chatbot was also severely limited in its ability to provide clinical management suggestions, often omitting workup steps and deviating from the standard of care.

ChatGPT and Google Bard could provide mostly accurate responses to less complex patient queries, but failed to request key medical information required to provide a complete response.

The second study, “Is ChatGPT ready for prime time? Assessing the accuracy of AI in answering common arthroplasty patient questions,” presented by researchers from Connecticut Orthopaedics, evaluated the chatbot’s ability to respond to 80 questions about hip and knee replacements.

Each query was presented to ChatGPT twice: first, asking the question as written, and then requesting that the chatbot respond as an orthopedic surgeon. Members of the research team then scored the accuracy of each response from one to four.

Approximately 26 percent of the chatbot’s responses were rated an average of three on the scale – indicating that the answer was partially accurate, but incomplete – when asked without a prompt. Eight percent of responses had an average rating of less than three when asked with a prompt.

The tool performed significantly better when prompted to respond as a surgeon, reaching 92 percent accuracy. However, ChatGPT’s limitations led the research team to conclude that it is not an appropriate resource for patients and that an orthopedic-focused chatbot should be developed.

The third study, “Can ChatGPT 4.0 be used to answer patient questions concerning the Latarjet procedure for anterior shoulder instability?” presented by the Hospital for Special Surgery, sought to explore the tool’s potential as an adjunct for clinicians.

The research team conducted a Google search for "Latarjet" to determine the top ten frequently asked questions and sources pulled by the search engine in relation to the procedure. The researchers then asked ChatGPT to perform the same search to identify questions and sources.

Google provided a small percentage of academic sources related to Latarjet procedures, pulling most of its information from larger medical practices and surgeons’ personal websites. In contrast, ChatGPT provided a range of clinically relevant information, all of which was taken from academic sources.

The teams behind each study underscored that their findings are key to helping understand the efficacy and potential future applications of AI chatbots in orthopedics.

ChatGPT has been shown to effectively answer patient questions, highlighting its promise for patient education, but the integration of AI in healthcare presents a host of challenges for patients and providers.

Patient trust is a major hurdle that healthcare organizations must tackle before the widespread adoption of these tools. Recent reports indicate that roughly 50 percent of patients don’t trust chatbot-provided medical advice, instead preferring to defer to their providers.

However, trust could be bolstered by ensuring that AI tools take a “human in the loop” approach and by informing patients that these technologies are guided by medical professionals.

But as AI continues to shape patient engagement, health systems will be forced to navigate the data privacy and HIPAA compliance concerns around chatbots.

Next Steps

Dig Deeper on Artificial intelligence in healthcare