Sergey - stock.adobe.com

Guest Post

2 data-wrangling techniques for better machine learning

Before data can be usefully inputted into algorithms, it must first be prepared. Learn two of the techniques that do the job and make machine learning work.

This article is excerpted from the course "Fundamental Machine Learning," part of the Machine Learning Specialist certification program from Arcitura Education. It is the seventh part of the 13-part series, "Using machine learning algorithms, practices and patterns."

This article explains machine learning data-wrangling techniques that help to make data ready as input for a chosen machine learning algorithm. These techniques enable conversion of values from categorical to numerical, and vice versa, filling in missing data and bringing data values in a known range via normalization. This article covers the following two of four data-wrangling techniques: feature imputation and feature encoding. As explained in Part 4, these techniques are documented in a standard pattern profile format.

Feature imputation: Overview

  • How can a data set with missing feature values be used for model development without having to delete entire rows or columns of valuable data?
  • One way to utilize a data set containing missing values is to delete entire rows or columns of data. However, this comes at the expense of losing valuable data that could have contributed toward developing a more accurate model.
  • Instead of deleting data, the value of missing features is inferred from the rest of the features through the application of statistical techniques or machine learning algorithms.
  • Statistical techniques, such as mean, median or mode, or machine learning algorithms, such as k-nearest neighbors (KNN) and linear regression, are applied to the data set to find the values of the missing fields.

Feature imputation: Explained

Problem

It is rare that acquired data contains values for all features of all instances. Values can go missing for a number of reasons -- for example, through a faulty sensor, software bug, mapping issues from the source system or being left intentionally blank in a survey.

To be able to use such a data set for model training, since machine learning algorithms require a value to work with, a quick and easy solution is to delete either the entire instances (rows) with missing values or delete the feature (column). However, doing so negatively impacts model training as deleting instances not only decreases the amount of training data, but also creates an imbalance in the example training data. In addition, removing features altogether affects the predictive power of the resulting model (Figure 1).

machine learning, training data set, feature imputation
Figure 1: A training data set contains missing values for Feature B and cannot be used to train a model (1). Consequently, the rows that contain missing values are deleted (2). The resulting data set now contains lesser data but can be used for training purposes (3). The data set is then used to train a model (4, 5). However, the resulting model's accuracy is not as expected (6).

Solution

The features or instances are not deleted, and the values of the missing fields are calculated from other fields. Either the nonmissing values of the respective feature are used, or the value can be extrapolated from other feature(s).

Application

Mean, median, or mode can be used to compute the missing values. As the average value is used for imputation, the resulting data set may have a reduced variance compared to the one without missing values.

Another technique for numerical data is the use of linear regression. From the data set, via correlation, the best correlated feature -- without missing values -- to the feature with missing values is found. Then, using the linear regression equation, the known value of the feature without a missing value is input to find the missing value of the other feature. Although this technique does not suffer from the variance reduction problem, it does assume a linear relationship between features, which may not always be accurate.

For categorical features, either mode or a constant value, such as N/A, can be used that becomes another category for that feature. In cases where the missing values feature is the target, this constant value can be treated as a separate class (Figure 2).

machine learning, feature imputation, training data set
Figure 2: A training data set contains missing values for Feature B and cannot be used to train a model (1). The imputation technique is applied in to fill in the missing values (2). In the resulting data set, the missing values are imputed by using the mean value of Feature B (3). The imputed data set is then used to train a model (4, 5). The resulting model's accuracy is within the expected range (6).

Feature encoding: Overview

  • How can categorical features be used for model development when the underlying machine learning algorithm only supports numerical features?
  • Most machine learning algorithms can only accept numerical features as input, which makes the inclusion of valuable categorical features impossible as an input for model development.
  • All categorical features in the data set are converted to a numeric representation using a mathematical function.
  • The categorical features are encoded to numerical features using a specific encoding scheme, such as label encoder and one hot encoder.

Feature encoding: Explained

Problem

A training data set normally contains both numerical and categorical data. However, most machine learning algorithms are unable to work with categorical features. This is because such algorithms need to perform mathematical operations -- for example, addition and multiplication -- and carry out comparisons on feature values. On a higher level, algorithms such as KNN and K-means perform distance measurement, which is not possible with categorical values. Excluding categorical features for model training can severely impact the efficacy of the model. In other cases, exclusion might not be an option at all if the target class itself comprises categorical values, such as cat, dog or wolf (Figure 3).

machine learning, feature encoding
Figure 3: A training data set comprises two numerical features and one categorical feature. The data set needs to be used to train a model, but the underlying algorithm only accepts numerical features (1). Due to this restriction, the model is unable to be trained (2, 3).

Solution

Categorical features are converted to numerical features via reversible mathematical operations that make it possible to retrieve the original categorical value from the converted numerical value. The conversion is performed in such a way that the resulting numerical value does not make one feature more important than the other. The process normally involves first mapping each unique value to a number.

Although the new value of the categorical feature is a now numerical value, using it as is for model training can misrepresent the facts. For example, consider a fruit feature containing values such as orange, apple and banana. After step one, the corresponding values are 0, 1 and 2, respectively. This means that instances with banana values for the fruit feature may get more weight than the ones with orange values. To rectify this issue, another step is required where the numerical values from the previous step become new features (columns), and for instances (rows) where a certain numerical value appears, the corresponding feature (column) contains a value of 1, while other columns contain a value of 0.

Continuing with the previous example, consider an instance where the fruit value is 2. After step two, the same instance will have three additional features (columns): 0, 1 and 2, where features 0 and 1 have values of 0, while feature 2 has a value of 1.

Application

A label encoder is first applied to the categorical features followed by the application of a one hot encoder. Both of these encoders are generally applied using data pre-processing functions available either within the model development software or as a separate library that can be imported (Figure 4).

feature encoding, machine learning
Figure 4: A training data set comprises two numerical features and one categorical feature. The data set needs to be used to train a model, but the underlying algorithm only accepts numerical features (1). The data set's categorical Feature C is encoded first using a label encoder and then hot encoding (2). The encoded data set now contains only numerical features where the categories become the features, such that the presence of each categorical value for a certain row is represented by 1 while 0 represents its absence (3). The resulting data set is then used to train a model without any issues (4, 5).

What's next?

The next article covers the feature discretization and feature standardization data-wrangling patterns.

Next Steps

 Prep data for ML training with SageMaker Data Wrangler

Clean data is the foundation of machine learning

Dig Deeper on Machine learning platforms