How to create a data set for machine learning with limited data
A shortage of data for machine learning training sets can halt a company's AI development in its tracks. Turning to external sources and hidden data can solve the problem.
Most companies remain in the research and development phase of AI implementation, and one reason why few have actual AI deployments is that data science teams are facing data shortages. Analysts agree that the more data you have, the better trained your models will be. So how does a data shortage factor in when determining how to create a data set for machine learning? The solution may be to look for data in unique places and pull from research and prior collection.
At the recent AI World Conference & Expo, data scientist Madhu Bhattacharyya, managing director of enterprise data and analytics at global consultancy firm Protiviti talked internal data shortages, mediating bias and the importance of external data collection.
Editor's note: The following has been edited for clarity and brevity.
What are some tips for how to create a data set for machine learning if you have limited internal data?
Madhu Bhattacharyya: In reality, the more data you have, the better the model is. From a prediction perspective, accuracy also increases with more data.
So if the data you have is very lean, or you're a company that doesn't have enough data, but wants to come up with insights, you need to figure out a way to gather additional data -- through analytics, analysis, data multiplication or data mining exercises.
Say you're a startup, or you're a company developing a new product. There will be some data which will be available right away, because before you start up with something, you do a lot of research. Nothing starts off out of the blue. Before releasing any product or service, think of what you do that collects data. You check for viability, you check for market penetration, and you check for potential ROI.
If you're selling a product, a platform as a service or a service, even before you generate your own data, you will have the initial market data that you researched prior to launching the product. How did you identify your potential customer? How did you identify that you need to have the launch in Boston versus in Dallas, for example? All of that information that helped you strategize multiple angles before the launch of the product is useful for building models and creating a data pipeline.
Don't restrict yourself only to internal data. Try to bring in relevant external data where appropriate (i.e., social media, credit reports, etc.). Ideally, you want a huge amount of data to fall back on from an amalgamation of both internal and external data sources. This data actually makes models and AI training much more robust from a decision-making perspective.
This collection instructs data scientists on how to create data sets for machine learning, including features on limited data, ethical implications of synthetic data and how to democratize data.
Labeled data brings machine learning applications to life
Data democratization strategy for machine learning enterprise
Synthetic data for machine learning combats privacy, bias issues
Additionally, as data scientists, we need to check for data bias, even at the very outset when we are actually bringing in the data. Do data cleansing to check for data quality. Make sure that your data is not replicated or that you don't have the same line item multiple times and it is unique. Check for variable reduction to make sure that you have the right set of data.
Most companies think that when you get external data, you don't have control over it, but when you buy or acquire data, there is an expectation that the data is unbiased and clean.
Then you have your own internal data that you work with, and that is where you can actually have your data quality checks in place. When you try to build analytical models using both internal and external data, the first thing to look for is the data that you want to use for the model and check for multiple collinearity. If there are five variables which are interdependent on each other -- which means they are correlated and the presence of one would mean the presence of the others -- then we keep one and we remove the others because, of course, we don't want to bring in that bias.
In talking about data quantity, you don't have to work with every single variable in the data set. Breaking it down to whatever is relevant and significant for that particular model or solves that particular business objective. Bring in whatever clean data you have and realize what model building you can perform with your existing data and the external data that you have. With that concept, we can actually build models or algorithms, while doing more analysis and data mining, to come up with insights.