Sergey Nivens - Fotolia

Today's top data pipeline management challenges

IT executives say pricing models, agility and auditability are some of the biggest challenges they have faced in managing today's increasingly complex data pipelines.

Data pipeline management today isn't easy -- especially as machine learning models gain prevalence.

One of the issues is that existing analytics and data science platforms aren't always suited for today's modern machine learning applications. They're often not agile enough to deliver the real-time data needed to continuously -- and quickly -- train machine learning models.

Instead, data science teams can spend large chunks of their time building complex data pipelines, transforming the data and then training the models on the updated data. That complexity, along with the heavy volume of data in the pipelines and the pricing models of the data analytics tools, can hinder agility, auditability and overall organizational efficiency. Here, IT executives break down the challenges they've been facing in data pipeline management.

Data platform pricing models

One challenge in data pipeline management is that many commercial platforms charge by volume, according to Len Greski, vice president of technology at travel information company Travelport. The company currently handles more than 100 terabytes of data a day.

"If you pay more based on how much data you process, you're essentially adding a tax on the business process," Greski said. "[With many] products in a data pipeline, the costs can add up quickly."

Travelport is currently re-evaluating its data strategy based on the total cost of ownership of a data science application compared to the economic value it generates, he said. That means, for example, putting more resources into the highest value areas like search. "If you get a better search experience, you improve the conversion rate," Greski said.

Travelport currently uses a combination of home-built, open source and commercial tools to help in its data pipeline management and address recurring issues. These range from processing tools like IBM MQ and Apache Kafka to other tools like Hadoop and Microsoft Azure for data storage and data science capabilities. Some of Microsoft Azure's newer tools allow for the creation of complex, high-volume data pipelines, but there are tradeoffs, according to Greski.

The complexity makes the systems more prone to failure ... If you have to have relationships with 1,000 hotels and 1,000 airlines, the amount of effort needed to manage all those connections is expensive.
Len GreskiVice president of technology, Travelport

"The complexity makes the systems more prone to failure," he said. "[And] if you have to make a change, [the systems] are more expensive to change. If you have to have relationships with thousands of hotels and airlines, the amount of effort needed to manage all those connections is expensive."

That's not the only pricing concern with Azure, Greski said. Enterprises pay usage charges for orchestration and there are a variety of execution charges for data movement, for work in the pipeline and for interaction with tools outsize of Azure.

There's also the challenge of integrating cloud workloads with those of legacy applications, he said.

Open source tools help address some of the cost issues but have their downsides as well.

"If you use open source tools, you have to make a significant investment in learning how to use them effectively and how to make them run and scale well," Greski said. "And then, how do you get technical support for that when you have a problem?"

The agility challenge and software 2.0

Data and business agility is another hindrance to efficient data pipeline management. As AI moves from being a one-off, add-on feature or pilot project to a core business process, data becomes the new software. Tesla's AI director Andrej Karpathy calls this "software 2.0." If the AI model isn't giving the required results, the solution isn't necessarily to rewrite code in the model but to look at the training data to identify gaps or biases.

Getting better results requires better, more focused data, better labeling and the use of different attributes, according to Greski. It also means that data scientists and data engineers need to be part of the software development process. Over the past year, Travelport has been doing just that by integrating data scientists into its release trains. This ensures the data science work is a "natural part of the overall product as opposed to off to the side," Greski said.

Another company that's completely shaken up its development team is Alegion, an Austin-based training data company. Alegion uses crowdsourced workers to classify training data for companies like Airbnb, Charles Schwab, Home Depot and Walmart.

Until two years ago, the company used basic statistics to track which workers were most accurate or most productive. In 2017, the company began to add intelligence to its classification process. Instead of having, for example, three human workers look at a particular image to determine if it was a cat or a dog, machine learning could step in and replace the third worker in the clearest examples. Reducing the number of humans needed by a third while maintaining -- or even improving accuracy -- would be a substantial benefit and cost-savings for customers.

"It was a paradigm change," said Nathaniel Gates, Alegion's co-founder and CEO. "Instead of just making the workers as efficient as possible, we're training our own neural net to augment and provide our own level of judgment."

But the data classification problems that Alegion tackles change constantly as customers need new data to meet their own AI needs. There's natural language processing for chatbots and video annotations for aerial cameras or autonomous vehicles, for example. On top of that, new edge computing cases are constantly being discovered in existing data, requiring retraining of Alegion's internal machine learning models.

Customers may also want to try different models on the data they're collecting and may need to have Alegion look at different attributes of the same data. Some customers might have 30 different models they're testing and training in parallel, Gates said.

That means that for its own model development, Alegion needed to keep up with a steady stream of changes. It wasn't a matter of building the machine learning technology once, setting it up and just letting it run. The company sent its ten software engineers for machine learning training and hired six more engineers who were already specialists in machine learning. Alegion integrated the new employees with the existing engineers so that machine learning would permeate.

How auditable is your data pipeline?

Governance, testability and transparency are also some of the major challenges that arise in data pipeline management. Take, for example, the question of training data sets. A company collects a giant amount of data, some that is used to train a model. When the results aren't quite up to par, the model is retrained with a slightly different set of data, said Ken Seier, chief architect for data and AI at technology consulting and system integration firm Insight.

Which data was used to train each version of the model? Keeping a copy of the training data set for each iteration of the training process could quickly result in giant data sets and unnecessarily complex data pipelines.

"Training data should be flagged in order for the system to be provable and auditable," Seier said. "Most of the third-party vendors out there have barely scratched the surface in the complexity of the data management required. The open source tools and commercial tools don't come close."

When the data comes from multiple sources or third-party party feeds, flagging becomes even more of an issue. Many companies also delete data that they no longer need, which is a problem if the data has been used as part of a training data set and is needed again for technical, legal or compliance reasons.

"The business complexity of which data needs to be frozen is huge," Seier said.

Then there's the issue of transformed and restructured data. Companies may fix, for example, the contrast on images before training image recognition systems or create new synthetic images so that they have a more diverse training set. Should companies save all of that transformed data or remove it from data pipelines all together?

Companies facing this issue have a build-or-wait dilemma. Vendors are usually two to three years behind the need, but building these systems from scratch can be costly and time-consuming, Seier said.

"It depends on the business market," Seier said. Companies that don't see an opportunity to create operational benefits or claim huge market share as a result of the new AI projects -- or that are in slower-moving industries -- might want to wait for the vendors to catch up."

"But it only takes one disruptor to change an industry from slow to fast," he said.

Next Steps

Flatfile looks to be universal translator for data loading

Dig Deeper on Data science and analytics