madgooch - stock.adobe.com
Synthetic data for machine learning combats privacy, bias issues
Synthetic data generation for machine learning can combat bias and privacy concerns while democratizing AI for smaller companies with data set issues.
Modern enterprises are inundated with data; however, not all data is usable as is for machine learning. Though an organization may have millions of data points, it could still have data struggles that stunt machine learning.
Turning to synthetic data for machine learning can boost privacy, democratize data, minimize bias in data sets and reduce costs. More broadly, real data and synthetic data tend to be used in combination.
"I can't think of any project in the AI space where you wouldn't be able to get a better outcome by leveraging synthetic data," said Kjell Carlsson, principal analyst at Forrester Research. "There is no situation I know of where you have so much real-world data that you wouldn't want to use synthetic data as well."
Address privacy issues
A major benefit to using synthetic data is the creation of data without privacy risks. Enterprises in healthcare and financial services must be careful about how they use personally identifiable information. However, with emerging universal laws such as the EU's General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), more businesses are being regulated.
Stripping the personal information from collected data can be costly and prone to error. In data sets that include personal information, creating alternative identities not attached to a human can provide a seamless leap into machine learning.
"We can create tens of thousands of identities and vary the appearance, environment and camera modalities to create an entire pipeline," said Yashar Behzadi, CEO of Synthesis AI, a computer vision synthetic data generation platform.
"There are millions of images, none of which are based on any one individual," Behzadi said.
This collection instructs data scientists on how to create data sets for machine learning, including features on limited data, ethical implications of synthetic data and how to democratize data.
How to create a data set for machine learning with limited data
Labeled data brings machine learning applications to life
Data democratization strategy for machine learning enterprise
Democratize data
In some cases, the data required for a project exists in theory but can be difficult to get. Proprietary data from internal clients, academic studies and labeled data can be too costly or private to use for machine learning training.
"It sometimes takes a while to persuade people to give up their data because they may want to hold onto it until it's published, or they don't want it floating around for anyone to see," said Holly Rachel, co-founder of data consulting firm Rachel + Winfree Consulting.
Researchers could provide others with the synthetic equivalent of their data to democratize its use. Alternatively, if an enterprise project may be scrapped because the data it requires is too time-consuming and costly to collect or label, or too expensive to buy, synthetic data can offer a solution.
"The ideal approach is to programmatically create vast amounts of synthetic training data at a fraction of the cost and time of traditional acquisition methods," said Behzadi.
Combat bias
High-profile AI gaffes have business leaders concerned about bias in their algorithms. Biased data can lead to discriminatory outcomes that may be unintentionally risky from legal, regulatory and reputational standpoints. While synthetic data for machine learning can help combat bias, developers need to still be cognizant of what synthetic data is derived from.
"AI models learn from the training data they are fed. This data is almost always skewed leading to inherent biases related to gender, ethnicity, socioeconomic status, age, etc.," Behzadi said. "The best way to deal with biases is to ensure well-balanced training data from the start."
An erroneous assumption is that synthetic data is inherently unbiased data -- but this is not necessarily true. When synthetic data is derived from biased data, it can inherit the bias.
"Every data set that you have is biased from the point of view that it's never a perfect representation of the real world," said Carlsson. "Synthetic data allows you to mitigate bias by complementing it with what you haven't seen."
Kickstart machine learning
One of the biggest issues holding back machine learning development is not having enough data to build a good model, said Michael Berthold, CEO of open source software provider KNIME. Working with synthetic data can democratize access and allow more companies to begin implementing machine learning.
"You start with a distribution, and then you have a node that generates data following that distribution," Berthold said. "Then you inject rules [such as] 'I want to build something that simulates a supermarket with 1,000 products for 100,000 customers with a million transactions a day for a year.'" To train at such mass scale requires not only access to data, but millions of data points.
Quite often, it isn't clear at the beginning of a machine learning project what data will be needed to build the model(s). In that case, synthetic data could be used to build a proof-of-concept model. The other benefit would be having a better understanding of what data the production model would require.
According to Carlsson, synthetic data can also be used to accelerate innovation, such as developing a solution in a way that the company hasn't done before.
Synthetic data isn't perfect
Synthetic data can help overcome some of the issues with real data, but it isn't a substitute for human data analysis. For example, if original data indicated that 10% of hospital patients are pregnant at any given time, but the data scientist or analyst overlooked the fact that only women can be pregnant, the end result would be a flawed model.
Other problems with synthetic data may include the failure to replicate signals that are present in the original data set -- or conversely, include signals that don't exist in the original data set. Overfitting may also result if a small data set is used to generate a much larger synthetic data set.
"If I have 10 people with a particular kind of condition, and we suspect there are thousands like them out there, I could create synthetic data based on the population of 10, but it looks nothing like [the sample of 10] because I used a biased sample," said Carlsson.
"I could also have ended up creating [the synthetic data] in such a way that we can now reverse-engineer who those people are because there are only a small number of people who have these comorbidities occurring in tandem and they're all in New York state," he added.
Synthetic data saves time and money, and it can help organizations address privacy and bias issues as well as gaps in data. However, data-related best practices still apply, and developers need to be aware of the unique problems when working with synthetic data.