Tech Accelerator What is machine learning? Guide, definition and examples

Prev Next

Definition

data splitting

Alexander S. Gillis

By

Alexander S. Gillis, Technical Writer and Editor

Published: Jun 06, 2024

What is data splitting?

Data splitting is when data is divided into two or more subsets. Typically, with a two-part split, one part is used to evaluate or test the data and the other to train the model.

Data splitting is an important aspect of data science, particularly for creating models based on data. This technique helps ensure the creation of data models and processes that use data models -- such as machine learning -- are accurate.

How data splitting works

In a basic two-part data split, the training data set is used to train and develop models. Training sets are commonly used to estimate different parameters or to compare different model performance.

The testing data set is used after the training is done. The training and test data are compared to check that the final model works correctly. With machine learning, data is commonly split into three or more sets. With three sets, the additional set is the dev set, which is used to change learning process parameters.

This article is part of

What is machine learning? Guide, definition and examples

Which also includes:
4 types of machine learning models explained
How engineers can build a machine learning model in 8 steps
CNN vs. RNN: How are they different?

There is no set guideline or metric for how the data should be split; it can depend on the size of the original data pool or the number of predictors in a predictive model. Organizations and data modelers might choose to separate split data based on data sampling methods, such as the following three methods:

Random sampling. This data sampling method protects the data modeling process from bias toward different possible data characteristics. However, random splitting can have issues regarding the uneven distribution of data.
Stratified random sampling. This method selects data samples at random within specific parameters. It ensures the data is correctly distributed in training and test sets.
Nonrandom sampling. This approach is typically used when data modelers want the most recent data as the test set.

With data splitting, organizations don't have to choose between using the data for analytics versus statistical analysis, since the same data can be used in the different processes.

diagram of random and nonrandom data sampling — Data sampling can either use a random, probability-based method or nonrandom approach.

Common data splitting uses

Ways that data splitting is used include the following:

Data modeling. This uses data splitting to train models. An example of this is in regression testing modeling, where a developer uses a model to predict a system's response when operated with made-up values. Using this set of values, the developer would select a portion of that data to act as the training data. Then, they would compare those results against the test data put through the regression model. This gives the developer a sense that the model is accurate.
Machine learning. This also uses data splitting to train models. Training data is added to the model to update its training phase parameters. After the training phase is finished, the data from the test set is measured against how the model handles new observations.

diagram of how machine learning continually learns — See how machine learning continually learns from existing data so that it knows what to do with new data.

Cryptographic splitting. This is a different process from the uses of data splitting mentioned above. It is a technique used to secure data over a computer network. Cryptographic splitting is meant to protect systems from security breaches and involves encrypting data, splitting the encrypted data into smaller pieces and storing those pieces in different storage locations. The data is further encrypted when stored in its new location.

Data splitting in machine learning

In machine learning, data splitting is typically done to avoid overfitting. That is an instance where a machine learning model fits its training data too well and fails to reliably fit additional data.

The original data in a machine learning model is typically taken and split into three or four sets. The three sets commonly used are the training set, the dev set and the testing set:

The training set is the portion of data used to train the model. The model should observe and learn from the training set, optimizing any of its parameters.
The dev set is a data set of examples used to change learning process parameters. It is also called the cross-validation or model validation set. This set of data has the goal of ranking the model's accuracy and can help with model selection.
The testing set is the portion of data that is tested in the final model and is compared against the previous sets of data. The testing set acts as an evaluation of the final mode and algorithm.

diagram showing how data splitting separates data — Data splitting separates data among train, test and dev data.

Data should be split so that data sets can have a high amount of training data. For example, data might be split at an 80-20 or a 70-30 ratio of training vs. testing data. The exact ratio depends on the data, but a 70-20-10 ratio for training, dev and test splits is optimal for small data sets.

Learn more about building machine learning models and the seven steps involved in the process here.

Editor's note: This article was remediated in June 2024 to improve the reader experience.

Continue Reading About data splitting

In-depth guide to machine learning in the enterprise

Data modeling techniques and concepts for business

Top types of machine learning algorithms, with cheat sheet

Data modeling vs. data architecture: What's the difference?

Dig Deeper on Machine learning platforms

Search Business Analytics

Why ethical use of data is so important to enterprises
Enterprises that don't use data ethically have a lot to lose. To maintain their businesses' trustworthiness and value, executives...
Domo adds App Catalyst to platform to aid AI development
By combining natural language code generation with enterprise-grade security and governance, the vendor aims to help customers ...
The future of business intelligence: 10 top trends in 2026
Here are 10 key trends affecting the current state and future direction of BI initiatives that analytics leaders should be aware ...

Search CIO

CIOs are feeling the pressure of the AI leadership gap
In this Q&A, Wendy Lynch, founder of Analytic Translator, discusses how CIOs need to close a leadership gap to overcome the huge ...
10 must-know blockchain trends for 2026 and beyond
The over-the-top hype faded years ago, but blockchain is settling in to make steady advances in cryptocurrency, financial ...
The business value of IT: How CIOs drive competitiveness
CIOs create business value when they remove obstacles and align teams, not when they adopt flashy technology. The CIO of DeVry ...

Search Data Management

Databricks launches PostgreSQL Lakebase to aid AI developers
Resulting from the $1B acquisition of Neon, the database built for AI workloads -- including separate compute and storage -- is ...
Pentaho update aids data integration, semantic modeling
The vendor's latest platform update aims to speed, simplify and better govern workloads to help customers build a trusted ...
Snowflake launches new AI tools, unveils OpenAI partnership
New features such as an agent-powered code generator and automated semantic modeling simplify developing cutting-edge ...

Search ERP

C-suite should make AI data management the 2026 ERP priority
Aligning data lakehouses with those of ERP vendors and data partners is important, but it won't be enough without silo-busting ...
8 ERP security best practices for modern ERP environments
As supply chain attacks continue, ERP security requires strong authentication, regular patching, monitoring and incident response...
4 supply chain trends for COOs in 2026
The trend of nearshoring will remain a major topic for COOs and other supply chain executives in 2026. Learn other trends to be ...

Close