Getty Images

Improving Machine Learning With Big Data-Driven Algorithms

Incorporating big data into algorithms helps machine learning methods avoid shortcuts.

Massachusetts Institute of Technology researchers examined the issue of shortcuts in a popular machine learning method and proposed a solution forcing the model to use more data in its decision-making to avoid pitfalls.

By removing simpler characteristics, researchers can redirect the model’s focus to more complex features of the data that were missed. The researchers then ask the model to solve the task in two ways — first by focusing on the simpler characteristics and second by using the complex features. According to researchers, doing so reduced the occurrence of shortcut solutions and boosted the model’s performance.

Through this work, researchers can improve the effectiveness of machine learning in identifying disease in medical images and reducing the number of false diagnoses.

“It is still difficult to tell why deep networks make the decisions that they do, and in particular, which parts of the data these networks choose to focus upon when making a decision,” said Joshua Robinson, a PhD candidate in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.

“If we can understand how shortcuts work in further detail, we can go even farther to answer some of the fundamental but very practical questions that are really important to people who are trying to deploy these networks.”

The researchers centered the study around contrastive learning, which is a powerful form of self-supervised machine learning. With this method, the model is trained using raw data that does not have labeled descriptions from humans. It can be used successfully in a large range of data.

For contrastive learning models, an encoder algorithm is trained to differentiate between pairs of similar inputs and pairs of dissimilar inputs. The process encodes rich and complex data to be interpreted by the learning model.

The team tested the encoders with a series of images and discovered that they also struggled with shortcut solutions. With that, the researchers made it more difficult to tell the difference between the similar and dissimilar pairs, finding that it changed which features the encoder looked at when making a decision.

“If you make the task of discriminating between similar and dissimilar items harder and harder, then your system is forced to learn more meaningful information in the data, because without learning that it cannot solve the task,” explained Stefanie Jegelka, the X-Consortium Career Development Associate Professor in EECS and a member of CSAIL and IDSS.

To test this method, the researchers used images of vehicles, adjusting the color, orientation, and vehicle type to make it difficult for encoders to discriminate between similar and dissimilar pairs of images. The encoder improved its accuracy across all three features simultaneously.

“To see if the method would stand up to more complex data, the researchers also tested it with samples from a medical image database of chronic obstructive pulmonary disease (COPD). Again, the method led to simultaneous improvements across all features they evaluated,” the press release stated.

While the study is critical in understanding the causes of shortcuts and how to fit them, the researchers explained that continuing to refine these methods will pave the way for future advancements.

“This ties into some of the biggest questions about deep learning systems, like ‘Why do they fail?’ and ‘Can we know in advance the situations where your model will fail?’ There is still a lot farther to go if you want to understand shortcut learning in its full generality,” Robinson said.

Next Steps

Dig Deeper on Artificial intelligence in healthcare