Building a data science pipeline: Benefits, cautions
A data science development pipeline is critical for digital business. But the sequence of the pipeline must be monitored closely to ensure the output reflects the business goal.
Enterprises are adopting data science pipelines for artificial intelligence, machine learning and plain old statistics. A data science pipeline -- a sequence of actions for processing data -- will help companies be more competitive in a digital, fast-moving economy.
Before CIOs take this approach, however, it's important to consider some of the key differences between data science development workflows and traditional application development workflows.
Data science development pipelines used for building predictive and data science models are inherently experimental and don't always pan out in the same way as other software development processes, such as Agile and DevOps. Because data science models break and lose accuracy in different ways than traditional IT apps do, a data science pipeline needs to be scrutinized to assure the model reflects what the business is hoping to achieve.
At the recent Rev Data Science Leaders Summit in San Francisco, leading experts explored some of these important distinctions, and elaborated on ways that IT leaders can responsibly implement a data science pipeline. Most significantly, data science development pipelines need accountability, transparency and auditability. In addition, CIOs need to implement mechanisms for addressing the degradation of a model over time, or "model drift." Having the right teams in place in the data science pipeline is also critical: Data science generalists work best in the early stages, while specialists add value to more mature data science processes.
Data science at Moody's
CIOs might want to take note from Moody's, the financial analytics giant, which was an early pioneer in using predictive modeling to assess the risks of bonds and investment portfolios. Jacob Grotta, managing director at Moody's Analytics, said the company has streamlined the data science pipeline it uses to create models in order to be able to quickly adapt to changing business and economic conditions.
"As soon as a new model is built, it is at its peak performance, and over time, they get worse," Grotta said. Declining model performance can have significant impacts. For example, in the finance industry, a model that doesn't accurately predict mortgage default rates puts a bank in jeopardy.
Watch out for assumptions
Grotta said it is important to keep in mind that data science models are created by and represent the assumptions of the data scientists behind them. Before the 2008 financial crisis, a firm approached Grotta with a new model for predicting the value of mortgage-backed derivatives, he said. When he asked what would happen if the prices of houses went down, the firm responded that the model predicted the market would be fine. But it didn't have any data to support this. Mistakes like these cost the economy almost $14 trillion by some estimates.
The expectation among companies often is that someone understands what the model does and its inherent risks. But these unverified assumptions can create blind spots for even the most accurate models. Grotta said it is a good practice to create lines of defense against these sorts of blind spots.
The first line of defense is to encourage the data modelers to be honest about what they do and don't know and to be clear on the questions they are being asked to solve. "It is not an easy thing for people to do," Grotta said.
A second line of defense is verification and validation. Model verification involves checking to see that someone implemented the model correctly, and whether mistakes were made while coding it. Model validation, in contrast, is an independent challenge process to help a person developing a model to identify what assumptions went into the data. Ultimately, Grotta said, the only way to know if the modeler's assumptions are accurate or not is to wait for the future.
A third line of defense is an internal audit or governance process. This involves making the results of these models explainable to front-line business managers. Grotta said he was working with a bank recently that protested its bank managers would not use a model if they didn't understand what was driving its results. But he said the managers were right to do this. Having a governance process and ensuring information flows up and down the organization is extremely important, Grotta said.
Baking in accountability
Models degrade or "drift" over time, which is part of the reason organizations need to streamline their model development processes. It can take years to craft a new model. "By that time, you might have to go back and rebuild it," Grotta said. Critical models must be revalidated every year.
To address this challenge, CIOs should think about creating a data science pipeline with an auditable, repeatable and transparent process. This promises to allow organizations to bring the same kind of iterative agility to model development that Agile and DevOps have brought to software development.
Transparent means that upstream and downstream people understand the model drivers. It is repeatable in that someone can repeat the process around creating it. It is auditable in the sense that there is a program in place to think about how to manage the process, take in new information, and get the model through the monitoring process. There are varying levels of this kind of agility today, but Grotta believes it is important for organizations to make it easy to update data science models in order to stay competitive.
How to keep up with model drift
Nick Elprin, CEO and co-founder of Domino Data Lab, a data science platform vendor, agreed that model drift is a problem that must be addressed head on when building a data science development pipeline. In some cases, the drift might be due to changes in the environment, like changing customer preferences or behavior. In other cases, drift could be caused by more adversarial factors. For example, criminals might adopt new strategies for defeating a new fraud detection model.
In order to keep up with this drift, CIOs need to include a process for monitoring the effectiveness of their data models over time and establishing thresholds for replacing these models when performance degrades.
With traditional software monitoring, the IT service management needs to track metrics related to CPU, network and memory usage. With data science, CIOs need to capture metrics related to accuracy of model results. "Software for [data science] production models needs to look at the output they are getting from those models, and if drift has occurred, that should raise an alarm to retrain it," Elprin said.
Fashion-forward data science
At Stitch Fix, a personal shopping service, the company's data science pipeline allows it to sell clothes online at full price. Using data science in various ways allows them to find new ways to add value against deep discount giants like Amazon, said Eric Colson, chief algorithms officer at Stitch Fix.
For example, the data science team has used natural language processing to improve its recommendation engines and buy inventory. Stitch Fix also uses genetic algorithms -- algorithms that are designed to mimic evolution and iteratively select the best results following a set of randomized changes. These are used to streamline the process for designing clothes, coming up with countless iterations: Fashion designers then vet the designs.
This kind of digital innovation, however, was only possible he said because the company created an efficient data science pipeline. He added that it was also critical that the data science team is considered a top-level department at Stitch Fix and reports directly to the CEO.
Specialists or generalists?
One important consideration for CIOs in constructing the data science development pipeline is whether to recruit data science specialists or generalists. Specialists are good at optimizing one step in a complex data science pipeline. Generalists can execute all the different tasks in a data science pipeline. In the early stages of a data science initiative, generalists can adapt to changes in the workflow more easily, Colson said.
Some of these different tasks include feature engineering, model training, enhance transform and loading (ETL) data, API integration, and application development. It is tempting to staff each of these tasks with specialists to improve individual performance. "This may be true of assembly lines, but with data science, you don't know what you are building, and you need to iterate," Colson said. The process of iteration requires fluidity, and if the different roles are staffed with different people, there will be longer wait times when a change is made.
In the beginning at least, companies will benefit more from generalists. But after data science processes are established after a few years, specialists may be more efficient.
Align data science with business
Today a lot of data science models are built in silos that are disconnected from normal business operations, Domino's Elprin said. To make data science effective, it must be integrated into existing business processes. This comes from aligning data science projects with business initiatives. This might involve things like reducing the cost of fraudulent claims or improving customer engagement.
In less effective organizations, management tends to start with the data the company has collected and wonder what a data science team can do with it. In more effective organizations, data science is driven by business objectives.
"Getting to digital transformation requires top down buy-in to say this is important," Elprin said. "The most successful organizations find ways to get quick wins to get political capital. Instead of twelve-month projects, quick wins will demonstrate value, and get more concrete engagement."