ktsdesign - stock.adobe.com
Key steps in the feature engineering process
Feature engineering is key to machine learning algorithms. Read on to learn how those features are created and chosen to increase the accuracy of those models.
Machine learning may seem magical at times, but it's not. A lot of effort goes into ensuring algorithms perform properly and it starts with getting data sets in usable shape. That's where the feature engineering process comes in.
The feature engineering process is key to understanding what data is available to use in a machine learning algorithm. Features are also necessary to test how accurate models are and further improving their accuracy.
What is feature engineering?
Feature engineering is the process of taking raw data and transforming it into features that can be used in machine learning algorithms. Features are the specific units of measurement that algorithms evaluate for correlations.
"Really, the feature engineering process is turning the data that you have into the most effective and usable version of what's going to get at the question you want to answer," said Hannah Pullen-Blasnik, a graduate research fellow at Columbia University and former senior data scientist at Digitas North America, a global marketing and technology agency.
According to Brian Lett, research director at Dresner Advisory Services, feature engineering is a balance of art and science. The art side incorporates domain expertise, while the science side finds the correct variables.
Why feature engineering is important
Features are what make machine learning models work. The feature engineering process helps data scientists choose the features that will make the model accurate. This process is especially important with data sets that have many outliers.
"Some people argue that it's the most important part [in the machine learning process] because the quality of data you're putting into your model is going to influence the quality of results that you're able to get back," Pullen-Blasnik said.
Feature engineering is also key to getting into what's important in the data.
"I think [feature engineering] is important because a lot of times we can get bogged down in information that's not helpful for the problem at hand," said Gilad Barash, vice president of analytics at Dstillery, a custom audience marketing consultancy based in New York.
Mike Gualtieri, an analyst at Forrester Research, said feature engineering is the most important part of the machine learning process because it can make or break an algorithm's accuracy.
"You have to give the machine learning algorithm a fighting chance to analyze that data, and feature engineering basically is the process of revealing more detail, giving more information," Gualtieri said.
The feature engineering process
Feature engineering is still a relatively new discipline, meaning there isn't an exact process.
"It varies, I think, a lot between the type of problem that you're solving, the type of model that you're using and the industry that you're in," Pullen-Blasnik said.
Regardless of differences between industries and data scientist approaches, these are the main steps in the feature engineering process.
Data preparation
Though technically a different process, data preparation is integral to the feature engineering process. Data prep for feature engineering can sometimes be more of an integration process, Gualtieri said.
"You may take data from five or six different sources, and you have to put it together. You have to make it one," he said.
Lett said data preparation is especially important to ensure all the data is formatted in the same way. For example, in Europe, dates are formatted as day/month/year while in North America the date is formatted as month/day/year.
"It's the kind of thing where somebody has to be doing that cleansing to make sure ahead of time that it's not going to cause a problem," Lett said.
Exploratory data analysis
The first step in the feature engineering process is understanding the data you have. Exploratory data analysis can be an important step if there's a lack of documentation for the data set. According to Pullen-Blasnik, data documentation varies by data set. When there's a lack of documentation, exploratory data analysis can help when you can't rely on the minimal information you've been provided.
"I want to make sure that I know exactly what's in it before I start working with it," Pullen-Blasnik said.
Exploratory data analysis is also a key step in choosing the right features for a model. Barash said exploratory data analysis can reduce dimensionality between features. It can also help data scientists find unique features and understand ones that are less intuitive.
According to Gualtieri, exploratory data analysis is important in choosing features for a model.
"If you're doing exploratory data analysis and the pre-feature engineering, you're looking for things -- you're hypothesizing and looking for things that might matter and making a prediction," he said.
Establish a benchmark and choose features
Before choosing features, it's important to know what the question is that you're trying to answer, which requires a good amount of creativity.
When testing a hypothesis, it's important to establish a baseline for accuracy. From that benchmark, you can introduce simple features to see if they meet that accuracy.
"Your goal is to reduce error rates and [improve] predictability of the model that you're working with," Lett said. "As you're trying to reduce this error rate, you have to have that benchmark against which you're going to compare everything to."
Lett said collaboration and domain expertise are especially beneficial in this step.
Brian LettResearch director, Dresner Advisory Services
"You need to have people with domain expertise who can start filtering the data correctly and start to come up with these good hypotheses that the model is actually testing," Lett said.
This is a good point to incorporate business users in the team. Not only can business users add to the domain expertise, they can help data scientists keep the ultimate business goal in mind, Lett said.
Avoid bias in feature engineering
One thing data scientists want to avoid is introducing bias into the model, which can start in the feature engineering process.
Barash said investigation is important to keeping out biases and that the human element shouldn't be undervalued in the process.
"The more different sets of eyes -- they can look at these features and make the decisions together in terms of, for example, eliminating features -- the more change you have of successfully avoiding bias," Barash said.
According to Lett, one way to keep bias out is to include a governance strategy.
"It is very important from a business standpoint -- especially if this is going to be used in external-facing or customer-facing activities -- to be able to fully explain that model," Lett said.
Larger organizations tend to have more formal governance structures in place when it comes to the feature engineering process, while smaller organizations may not have the same governance team or strategy.
"If you start to see some bias in there, you'll be able to understand where that came from -- where those sources of data may be contributing or not contributing to that bias," Lett said. "And then it's much easier to understand where you might be able to take action to be able to address that versus trying to retroactively figure that out."
The role of automated tools
Automated feature engineering tools have also been on the rise recently to help data scientists, but the human element is still necessary.
"The great thing about automated tools is that they allow you to utilize or look at and process a much bigger dimensionality," Barash said.
According to Gualtieri, automated feature engineering gives data scientists choices.
"What if you have a column and 70% of the values are missing?" Gualtieri said. "There's a choice there, so the automated feature engineering may reveal that choice to the data scientists."
He also said some products will provide more control and alert data scientists when adjustments are necessary.