your123 - stock.adobe.com

Tip

Clean data is the foundation of machine learning

Clean data is crucial to achieving accurate, consistent and thorough machine learning models. With the right prep techniques, teams can improve data quality and model outcomes.

Machine learning projects can succeed or fail based on a single, seemingly simple factor: data quality. Data scientists and engineers have recognized -- often through hard-won lessons -- that this critical element can dramatically impact a model's predictive power and real-world practicality.

Most tech professionals are familiar with the phrase "garbage in, garbage out," which dates back to the 1950s. The aim of most projects is to use clean data. But defining the term clean data can be a challenge due to common misunderstandings such as the following:

  • Clean data is not always error free. Achieving completely error-free data is impractical; instead, the goal should be to minimize errors to a level that does not significantly impact outcomes.
  • Data cleaning is not a one-time activity. Rather, it's a process that needs revisiting as new data is collected and developers' understandings evolve.
  • Clean data doesn't guarantee model accuracy. Model accuracy depends on factors like algorithm choice and feature relevance, not just data cleanliness.
  • Data cleaning can't remove all biases. Cleaning data can reduce certain biases, but it cannot eliminate biases intrinsic to the data collection process. For example, using health data from smartwatches or fitness trackers will create a model based on people who are already health conscious and monitoring their fitness.

Even with these caveats in mind, data quality is still essential to machine learning work. Machine learning engineers and data scientists need a thorough understanding of how data can become unclean and what they can do about it.

What is clean data?

Clean data is consistent, accurate, and free of errors or outliers that could negatively affect the model's learning process. A clean data set should have few missing values, no duplicate records and no irrelevant information. Therefore, proper data cleaning removes or corrects inaccuracies, handles missing values intelligently, and standardizes data formats.

Another important aspect of clean data in supervised machine learning -- where the model is told what to look for -- is ensuring that data is tagged with the correct outcome or label. Labeled data provides a clear guide for the model's learning process. For example, in a data set used to train a model to recognize animals in images, each image (the data) should be labeled with the correct name of the animal in the image (the label).

In all cases, training a machine learning model involves presenting the algorithm with a data set from which it can learn, thus equipping it to forecast outcomes or make choices when presented with new data. With clean data, the model can more easily discover patterns and structures.

The four pillars of an effective machine learning model

Using high-quality, clean data during training processes can help achieve four critical goals in machine learning: accuracy, robustness, fairness and efficiency.

In practice, the significance of each pillar varies depending on the model's specific application or context. For example, in healthcare, model developers might prioritize accuracy and fairness, whereas efficiency and robustness could be more critical for real-time applications. Nevertheless, a good model should achieve all the following goals to some degree.

1. Accuracy

An accurate machine learning model makes predictions or classifications that closely match observed outcomes in the real world. Accuracy is a critical metric for evaluating machine learning models' performance, especially for supervised learning tasks where the actual outcomes are known.

2. Robustness

A robust model performs well across a wide range of inputs and can handle variations in the data without significant loss in accuracy. Clean, well-prepared data contributes to a robust model by ensuring that it is trained on accurate, representative samples of the real-world phenomena it aims to predict or classify.

3. Fairness

Fairness is a particular case of robustness where the model can handle the diverse scenarios and demographics it encounters. For this to be possible, the training data must represent those demographics and scenarios. Similarly, accurate and unbiased labeling helps ensure that the model does not learn incorrect or discriminatory patterns. For instance, in a facial recognition system, a balanced representation and correct labeling of faces from various ethnic backgrounds help minimize racial bias.

4. Efficiency

When data is clean and accurately labeled, a model can more easily learn underlying patterns without being confused by errors or noise. Clean data reduces the need for extensive preprocessing and cleansing, which saves resources and time. In addition, models trained on high-quality data converge to optimal solutions faster, requiring less computational power and reducing overall model development costs. Because the computing costs of machine learning and AI are now a critical factor for many businesses, this advantage of clean data is an increasingly important consideration.

The flip side: Unclean data

Good-quality data has clear advantages, but to see the complete picture, it's also important to consider the downsides of bad data. What if your data is just not that good?

There are nearly endless ways in which data can be unclean. For example, an address might be incorrectly formatted, or it might be correctly formatted but represent a nonexistent location. Both are errors, but of different sorts, and each might be more or less critical for different models. However, three common data quality issues are problematic whenever they occur: inaccuracy, inconsistency and incompleteness.

1. Inaccuracy

Inaccuracies are incorrect or misleading data points that do not accurately represent the real-world entity or event being modeled. These inaccuracies can arise from various sources, including human error during data entry, errors in data collection processes, or malfunctioning sensors and equipment.

Consider a data set intended for a model that predicts house prices, where the square footage of several homes needs to be more accurately recorded due to typos or misinterpretation of unit conversions. If the inaccurate data is left unchanged, the model could learn an incorrect relationship between house size and price, potentially leading to inaccurate price predictions.

2. Inconsistency

Inconsistent data lacks a uniform format or standard across the data set, which can create confusion and lead to incorrect interpretations. For example, in a data set containing customer information from different countries, dates might be formatted MM/DD/YYYY in the U.S., but DD/MM/YYYY in Europe. Without standardizing these formats, the model could interpret the day as the month and vice versa, leading to errors in recognizing seasonal purchasing behaviors or trends.

3. Incompleteness

Very few data sets are perfect. One common problem, even in otherwise clean data, is missing values or gaps in the data set. This can occur for various reasons, including lost records, unrecorded observations or, in some cases, data owners refusing to provide certain information.

These missing values can be critical. For example, in a medical data set used to train a model for predicting patient outcomes based on treatments, missing values for a patient's age or preexisting conditions could lead the model to make predictions based on incomplete profiles. This could potentially skew results toward the few patients with fully recorded data.

Data cleaning techniques for machine learning models

With all the potential problems unclean data presents, eliminating data quality issues is essential. Fortunately, data scientists and machine learning engineers can employ a range of techniques to clean their data effectively.

Identifying and handling missing data

Missing data can significantly skew a model's interpretations and predictions, though some algorithms are less affected by missing data than others. In many cases, it is possible to fill in values based on other data points, a technique known as imputation. Common methods include using the mean or median for numerical data or the mode for categorical data.

Data normalization

Normalization transforms values from different scales to a standard scale, thus avoiding distortion. For example, suppose a data scientist is evaluating the performance of partner companies across regions. Some partners in emerging markets might only make sales in the thousands of dollars, while sales might be in the millions for a partner in a more developed market. Directly comparing these entities without adjustment could lead to incorrect conclusions about their capabilities, performance and potential for improvement.

Normalization can help by equalizing the values and adjusting all sales figures to fall within a standard range, such as 0 to 1. This approach enables fair, relative comparisons of sales based on their proportion to the highest and lowest figures in the data set, ensuring a more accurate assessment of each partner's performance within their specific market context.

Feature selection

Feature selection is the process of identifying and selecting the most relevant information from a data set, known as features, that contribute to the outcome the model is intended to predict or understand.

For example, sales data might include information such as the day of the week, marketing expenditure, weather conditions, and the number of employees working that day. Because not all information is equally useful for predicting sales, feature selection helps data scientists and model developers distinguish critical data from data that can be safely ignored. This can make the resulting models -- and, ultimately, the decisions made based on their outputs -- more efficient and effective.

Keeping it clean

Clean data -- and therefore data cleaning -- is the foundation of accurate, robust, fair and efficient machine learning models. However, data cleaning cannot be a one-time process; it requires ongoing attention and refinement.

The nature of data quality issues also changes over time as the real world evolves. As models are updated and new data becomes available, the data cleaning process must adapt to ensure consistency, accuracy and relevance.

Donald Farmer is the principal of TreeHive Strategy, where he advises software vendors, enterprises and investors on data and advanced analytics strategy. He has worked on some of the leading data technologies in the market and in award-winning startups. He previously led design and innovation teams at Microsoft and Qlik.

Dig Deeper on AI technologies