AlfaOlga/istock via Getty Images
AI May Be More Prone to Errors in Image-Based Diagnoses Than Clinicians
New research indicates that AI may be more prone to making mistakes than humans in image-based medical diagnoses because of the features they use for analysis.
Researchers have found that deep neural networks (DNNs) make mistakes in image-based medical diagnoses that humans are less likely to make, and they hypothesize that these mistakes may indicate that clinicians and artificial intelligence (AI) use different features for analysis when looking at medical images.
According to a study published in Scientific Reports, DNNs can fail when performing image-based medical diagnosis tasks because their predictions can be unrelated to the underlying pathology of the condition they are designed to diagnose. For example, an AI skin classifier learns to associate surgical skin markings with malignant melanoma, increasing the classifier’s false positive rate by 40 percent, according to the researchers.
Since clinicians use their medical knowledge to make predictions and diagnose patients, while DNNs do not, the researchers designed their study to determine whether DNNs use different features than humans in image-based medical diagnoses. To compare the two, the researchers used Gaussian low-pass filters in nine varying severities to blur or otherwise distort images used by DNNs and radiologists evaluating breast cancer images.
To evaluate the clinicians, the researchers applied the various low-pass filters to 720 sets of mammogram images and presented them to 10 radiologists, who then made predictions indicating the presence of microcalcifications or soft tissue lesions in each breast. The DNNs were given the same task. Both were evaluated based on their predictive confidence and the correctness of their predictions.
The researchers found that the low-pass filtering decreased the predictive confidence of both the radiologists and the DNNs when looking at microcalcifications. However, the effect of the filters is constant for radiologists, but the DNNs continue to become less confident as the filters become more severe.
Low-pass filtering decreased the correctness of predictions for both radiologists and DNNs as well.
Thus, radiologists and DNNs were both sensitive to low-pass filtering on images with microcalcifications, so the researchers could not conclude that humans and machines use different features to detect them.
When considering images with soft tissue lesions, low-pass filtering decreased the predictive confidence and correctness of the DNNs while having almost no effect on radiologists. This indicated that the DNNs must be sensitive to the image perturbation in a way that the radiologists were not, showing that the two must use different features to detect soft tissue lesions. This difference could be related to the biases built into the DNNs, like their tendency to consider texture over shape, researchers said.
The researchers also compared the degree to which radiologists and DNNs agree on the most suspicious regions of an image, which may further indicate the use of different features. For this, seven radiologists were asked to annotate 120 sets of exam images from the original 720 and indicate up to three regions of interest (ROIs) containing the most suspicious features in each image. Low-pass filters were then applied to the ROIs and the entire images.
For mild filter severities, the researchers found that DNNs rely on the same regions that radiologists find suspicious to detect microcalcifications. But for high filter severities, DNN correctness decreased when the exteriors of the ROIs were filtered, indicating that DNNs use some of the features found in the exteriors of ROIs.
Filtering ROI interiors decreased DNN correctness for soft tissue lesions, but to a lesser extent than when filtering the entire image. This indicates that DNNs use features in regions that radiologists find suspicious, but they do so to a lesser degree than the clinicians.
Further, filtering ROI exteriors had a similar effect on correctness as filtering an entire image. These findings suggest that the features DNNs use for predictions may be scattered across an image rather than localized in regions clinicians find suspicious.
The researchers separated the analyses of microcalcifications and soft tissue lesions to avoid artificially inflating the similarity between human and machine perception, a flaw they found in other research.
They concluded that more research that does not artificially inflate similarities between clinicians and AI is needed to improve medical imaging and diagnosis.