Clean data for machine learning is key to successful AI
Many enterprises think their AI projects should start with massive amounts of data. But clean data for machine learning should be their first step toward AI.
Since the dawn of AI, there has been an emphasis on the importance of big data: If you want to automate a process, make sure you have massive amounts of data to train an algorithm.
However, the conversation around big data is now changing to the importance of clean data for machine learning. With machine learning bias errors making the news, it's becoming critical for enterprise AI projects to start with clean training data to produce relevant results.
From tech giants to startups, enterprise users at the O'Reilly AI Conference recounted issues they faced when managing giant data sets and explained how they found a path to success with clean data.
Big data complications
Most of the use cases described at the conference focused on issues such as big data, post-implementation data cleanup and algorithmic errors that enterprises encountered before beginning to clean data for machine learning programs.
Jeff Thompson, assistant professor and director of visual arts and technology at the Stevens Institute of Technology in New Jersey, described his Empty Apartments art project, which used machine learning models to collect images of empty apartments from rental listings and sort them algorithmically by similarities in light, setup and image density. The goal of the project was to create an overall snapshot of the state of American rental homes.
Starting the machine learning training process with clean and targeted data -- in this case, photos of empty apartments on Craigslist rather than all of the photos on Craigslist -- enabled Thompson to create a manageable project that could correlate photos, sort them according to commonality and illustrate larger themes.
Patrick Kaifosh, chief science officer at CTRL-labs, a New York-based startup developing a neural interface between humans and machines, has deployed machine learning models in a range of settings. In a clinical setting, for example, sensors attached to the skin can pick up neural data points, transmit them to a computer, label them and analyze them. At each of those stages, there are issues with big data collection.
Primarily, because clinical neural interfaces have to be tailored to each individual's brain, neural pathways and outputs, data collection is essentially single-use. Training data models are complex because it's incredibly difficult to pick out clean data from personalized use.
Additionally, Kaifosh said that creating a complex data flow -- sensors that pick up neural activity from muscles and then send it to a computer -- means that the sensors often gather more data points than those that are relevant to the user. He aims to get clean, tailored data by reducing the connections and collecting smaller amounts of data.
CTRL-labs' hope for the future is to pick up neural activity directly and send it to the computer to eliminate noise, Kaifosh said.
Kim Hazelwood, a senior engineering manager of AI infrastructure research at Facebook, spoke about Facebook's initial issues when tailoring results to the context and desired outcome of a program.
"Leveraging massive data is one of the greatest challenges seen when trying to scale machine learning to all users," she said.
Hazelwood said Facebook approaches AI implementations with three steps -- collecting unstructured data and preparing it for models, training models, and deploying models into production.
During the first step, data engineers have to clean the data for machine learning and optimize it for tools to perform tasks like automatic text translation and facial detection. The constant training of models, the variety of AI tools used and the different output requirements mean that Hazelwood and her team are constantly changing data set requirements and cannot work from one massive data haul.
While experts at Facebook have the time and resources to undertake massive data labeling, cleaning and optimizing, other enterprises that want to build their own machine learning products are coming up against a massive roadblock.
Automation squared
Data scientists are hard to train and expensive to hire, said Ruchir Puri, chief scientist at IBM Research. Instead of counting on data scientists, his approach is to create training data with an automated process he calls Auto ML.
Instead of amassing giant amounts of data that require data scientists to analyze, label and optimize every feature in the data, Puri said that having programs create and label small amounts of data makes AI projects more sustainable and eliminates the need to hand-train algorithms, a barrier for smaller companies with fewer resources.
Christopher Ré, associate professor at Stanford University, said he thinks the journey to democratizing AI for broader enterprise deployment begins with clean data. Creating a smaller data set tailored to the specific algorithm being trained, in light of its potential biases, outcomes and use cases, can enable enterprises to save time and money.
"People spend years collecting data, and every point is as expensive as the last," Ré said.
While hardware, models, programs and interfaces are all widely available to those who want to implement AI in the enterprise, Ré said that training data is not. The massive amounts of unlabeled data that exist, collected by various enterprises, are not conducive to probabilistic models.