kirill_makarov - stock.adobe.com

How to streamline feature engineering for machine learning

Structured data is necessary in machine learning, but sifting through data is time consuming. Streamlining the feature engineering process can help data scientists be more productive.

For impactful machine learning, data scientists first need clean, structured data. That's where feature engineering comes in -- to refine data structures that improve the efficiency and accuracy of machine learning models.

Ryohei Fujimaki, Ph.D., CEO and founder of dotData, a data science platform, said, "Features are, without question, even more critical than the machine learning algorithm itself." Poor quality features will result in a failure of the machine learning algorithm, he said. On the other hand, high-quality features will allow even simple machine learning algorithms like linear regression to perform well.

"It's quite common for feature engineering and data engineering to require a considerable amount of time and significant manual effort," Fujimaki said. Accelerating the feature engineering process will significantly shorten overall project timelines.

There are a variety of ways data engineers can improve the process of feature engineering for machine learning to eliminate some of the grunt work for data scientists. This includes improving the quality of data to start with, taking advantage of popular techniques for organizing the data, improving the organization and sharing of data to improve self-service, and using automated feature engineering tools.

How it works

Feature engineering involves expanding and organizing the raw data set in a way that exposes the behavior of data relevant to a prediction. Effective feature engineering has traditionally required good domain expertise to help intuit the types of transformations that are most helpful to the machine learning process, said Saif Ahmed, product owner of machine learning at Kinetica, an analytics database.

Sifting through these variable features form many types of more influential combinations than typical raw data. For example, knowing that a transaction occurred on a holiday, weekend or a weekday is more important to a sales prediction model than the raw date, said Elias Lankinen, founder of Deepez, a machine learning tool provider.

Prepping for feature engineering

When it comes to data preparation, especially in feature engineering for machine learning, there are several major steps.

The first step is data collection, which consists of gathering raw data from various sources, such as web services, mobile apps, desktop apps and back-end systems, and bringing it all into one place. Tools like Kafka and Amazon Kinesis are often used to collect raw event data and stream it to data lakes, such as Amazon S3 or Azure Data Lake, or data warehouses, such as Snowflake or Amazon Redshift.

The second step involves validating, cleaning and merging data together to create a single source of truth for all data analysis. On top of this single source of truth, new data sets are usually created to support specific use cases in a convenient, high-performing and cost-effective way.

Feature engineering is essentially the third step in the machine learning lifecycle, said Pavel Dmitriev, vice president of data science at Outreach, a sales engagement company. "The feature engineering step transforms the data from the single source of truth dataset into a set of features that can be directly used in a machine learning model," Dmitriev said.

Additionally, typical transformations include scaling, truncating outliers, binning, handling missing values and transforming categorical values into numeric values. Dmitriev said the importance of manual feature engineering has been declining in recent years due to improvements in deep learning algorithms that require less feature engineering and the development of automated feature engineering techniques.

Common techniques

Data engineers use a variety of techniques to combine and transform raw data into different kinds of features that may be the most relevant to a particular machine learning problem.

Dhanya Bijith, data analyst at Fingent, a software development firm, said some of the more common techniques include:

  • Correlation matrix. In this process, feature engineering identifies the correlation between the different fields in the raw data. If the variations in two fields are the same, it means they are dependent and one of them can be eliminated.
  • Eliminating values. Feature engineering helps data scientists eliminate null values and extreme values.
  • Normalizing values. This process transforms raw data to give different fields equal importance. For example, a machine learning model for house appraisal prediction might normalize the representations for the number of bedrooms, the size of the bedrooms and their coordinates within a house.
  • Identifying output fields. Feature engineering helps identify what fields will influence the output by eliminating fields with a lower correlation. This improves the computational efficiency and accuracy of the model.

Document the process

It's possible to use Excel for feature engineering, which reduces the amount of coding involved in selecting features, said Deepez's Lankinen. But Excel does not automatically document what you do. "It's rare that people document every step, and then it's hard to go back if you realize a mistake," he said. It's also hard to duplicate the process for new data.

It's better to work with Jupyter Notebooks or even normal code files, which are much better at automatically documenting the entire feature engineering process. He notes that data cleaning tools such as Trifacta and OpenRefine can also help document the process in a way that makes it easier to explore different feature engineering transformations more iteratively. These kinds of platforms can also make it easier for teams to collaborate to clean and structure the data with monitoring and alerting built in.

"It is important to empower all stakeholders in this process," said David McNamara, product marketing manager at Trifacta, a data wrangling tools provider.

IT teams and data engineers often work at odds with data science teams. IT wants to ensure proper governance and automation for production applications, while data scientists want to run new experiments. A data transformation platform can provide self-service for data science experiments and make it easier to push applications to products with centralized management.

Working with unstructured data

Feature engineering typically refers to improving the presentation of structured or semi-structured data to deep learning algorithms. With unstructured data, deep learning algorithms often generate the features automatically as part of the training process. These algorithms can be a huge benefit in feature engineering for machine learning, but they won't necessarily be the final step in the process.

"Automatic feature engineering through the use of neural networks and deep learning certainly can reduce the grunt work, but it's not a panacea," said Hadayat Seddiqi, director of machine learning at InCloudCounsel, a legal technology company. The neural networks used to build the machine learning models can be sensitive to train and require the right objective to learn good representations.

Seddiqi's team has been crafting algorithms to improve the ability to answer questions from collections of legal contracts. The hardest part is cleaning and organizing the raw text data. This typically means segmenting a string of text into words and sentences. Sometimes other representations are useful too, such as tagging the part of speech of words, adding indicators for the existence of certain prefixes or suffixes, or extracting named entities such as places, companies or people.

Certain featurization algorithms can work with messy text data by working at the character level. This means they're robust in dealing with different types of noise. For example, word vectors use groups of characters inside a word to produce the same or similar representations of words, typically with misspellings.

"One thing that is less talked about because it's not as glamorous is data collection," Seddiqi said. He has found that manually annotating relevant portions of a document they want to extract can be hugely valuable for a model being trained because it learns how to ignore the unnecessary noise in the document and focus on the right places. "Having a solid annotations pipeline can be the biggest driver for performance," he said.

Another good practice is to create an error analysis pipeline on the back end of the machine learning training process. This requires having the right tools, infrastructure and process in place to hone in on the mistakes your models make. "People have to get down and dirty with their models, otherwise you can't make progress," Seddiqi said.

Auto-generation

"The majority of work for feature engineering was traditionally performed in SQL-like environments or as a very complex pipeline on visual programming-based systems like SAS or SPSS," said dotData's Fujimaki. Researchers at MIT launched the field of automated feature generation for structured data in 2015 with the development of the Deep Feature Synthesis algorithm. This was eventually released as an open source tool called Featuretools.

Other developers have built on this early work with open source libraries for specific domains like autofeat, a Python library for science applications. In addition, several commercial products from IBM, DataRobot, Explorium and dotData are starting to support automated feature engineering capabilities.

In most companies, feature engineering is more of an art than a science.
Omer HarCo-founder and CTO, Explorium

"In most companies, feature engineering is more of an art than a science," said Omer Har, co-founder and CTO of Explorium. Automated feature engineering can create code that tries different ideas for features without needing the data scientist to sit and think of all possible features from a category.

Although these tools can make it easy to create new features, Kinetica's Ahmed cautioned that "evaluating new features is always computationally intensive."

Dig Deeper on Data governance