Data preparation in machine learning: 4 key steps
Data preparation for ML is key to accurate model results. Clean and structure raw data to boost accuracy, improve efficiency, and reduce overfitting for more reliable predictions.
Data preparation refines raw data into a clean, organized and structured format that is ready for machine learning. Taking the time to clean and organize your data leads to more accurate models, faster training and better predictions.
ML revolves around data. Poor quality data can lead to inaccurate or biased results, however, data scientists building ML models have more to think about than just data quality.
When an ML algorithm processes input data, it detects patterns and derives rules for future scenarios. Data preparation helps the algorithm effectively detect patterns and derive meaningful rules.
For example, in an insurance scenario, an actuarial model might derive age from a column with date-of-birth data. This transformation simplifies the data, which can improve model performance. More importantly, it allows the algorithm to identify relationships between age and risk, rather than analyzing the more complex date-of-birth column.
Preparing data can also reduce the possibility of overfitting, where a model learns too much from the training data. ML algorithms sometimes ingest noise and random patterns from data, instead of focusing on general trends. If the model was trained directly on date of birth, it could detect some slight correlations by coincidence -- for example, people born in July have a marginally higher risk -- rather than identifying meaningful patterns.
Steps to prepare data for machine learning
To be effective, data preparation should be a systematic process involving multiple stages that transform raw data into a structure ready for ML. It is inefficient to deal with every new issue in an ad-hoc manner.
Most data engineers follow a sequential process, with each step building on the previous one. This process ensures that the data evolves from an unstructured, raw form to a clean, usable state.
The key stages typically include data collection, cleaning, transformation and splitting. These are usually executed in that sequence, but data preparation can be iterative, revisiting earlier steps if new issues or insights emerge.
1. Data collection
The first step is to identify the data sources you want to use and ensure that they are relevant, up-to-date and accessible. To collect the data for the model, you might have to export data from a source system or ensure connectivity to a data source with the right drivers or interfaces.
Common enterprise data sources include databases, enterprise applications, data warehouses and data lakes. These architectures support large volumes of data, but they are structured differently.
In a data lake, data is generally stored in its original format. That could be tabular, but it's often columnar or text-based. Using data from a data lake requires detailed knowledge of its storage format.
A traditional data warehouse stores data in a structured format, so querying for data can be simpler. Modern data warehouse architectures and hybrids such as the data lakehouse can store semi-structured data or enable querying unstructured and structured data together.
Other data sources, such as databases and applications, have their own connectors. Some are easier to use than others. Sometimes you can use techniques such as web scraping and real-time streaming when working with internet applications or log files.
Regardless of the data source and how the data is collected, data preparation for machine learning requires evaluating the data types involved, the volume of data and its quality before moving on to the next step: data cleaning.
2. Data cleaning
Enterprise data is messy. Even in well-structured applications, there can be duplicates, errors and outliers. Think of your own use of e-commerce: You might have multiple versions of addresses, out-of-date credit card details and incomplete or canceled orders.
Data cleaning techniques such as deduplication, error correction, outlier removal and validation ensure data integrity.
Data quality issue | Potential techniques for fixing |
Missing data |
|
Incorrect data |
|
Outliers |
|
Duplication |
|
Irrelevant data |
|
Figure 1. Some common data quality issues and their potential solutions.
The techniques you choose will depend on the data types, scale and project objectives. In many cases, you will be able to automate these processes.
Feature selection
Feature selection helps address irrelevant data.
A feature is any data field or column that could help predict a target outcome. Age -- derived from date of birth -- might predict insurance risk. "Average days between orders" or "total spend last quarter" capture customer behavior in an e-commerce data set, but should be calculated in advance for accuracy and efficiency.
Some data, such as timestamps or street numbers, are just too detailed to be useful compared to ages or zip codes. You can often remove these features.
3. Data transformation
There are other transformation steps that reshape and restructure data to make it more suitable for machine learning algorithms.
Here are some key transformation techniques:
Technique | Transformation |
Denormalization | Combines multiple tables or entities to flatten related data into a single structure |
Aggregation and summarization | Roll up detailed data points into meaningful statistics or metrics, such as daily or weekly totals |
Pivoting and reshaping | Restructure data between long and wide formats, such as converting columns to rows or reorganizing data dimensions |
Binning and discretization | Convert continuous values into discrete categories or ranges, such as age ranges for demographics |
Window functions | Create features based on rolling or sliding windows of data, such as moving averages of prices |
Feature scaling | Adjusts numerical values to a specific scale or range, such as a 1-5 scale |
Figure 2. Common techniques for transforming data
4. Data splitting
The final step in data preparation is splitting your data set, sometimes called partitioning. This process divides your data into two or more subsets for training and testing. Sometimes, a third subset is used to validate models.
Splitting your data is critical to ensure your model generalizes well to new, unseen data rather than just memorizing patterns from its training data. You train the model over one subset and test it with another.
The training set typically uses about 60%-80% of your data. A validation set, if used, is much smaller; perhaps only 10% of the data. Validation helps fine-tune model parameters and assess performance in the development phase. Finally, the test set (10%-20%) can provide an unbiased evaluation of your final model's accuracy and performance.
There are several techniques to split data effectively. Random splitting is the simplest approach; it randomly assigns data points to each set. Some data sets need more sophisticated methods, however. For example, randomly splitting a time series would break the series and any patterns within the data. In this case, you might use older data to train and use newer data for tests.
Cross-validation is a more advanced form of splitting and creating multiple subsets, some of which you can use for training and some for testing. This may be particularly useful for smaller datasets where a single split might not be representative.
Donald Farmer is a data strategist with 30-plus years of experience, including as a product team leader at Microsoft and Qlik. He advises global clients on data, analytics, AI and innovation strategy, with expertise spanning from tech giants to startups. He lives in an experimental woodland home near Seattle.