Getty Images

Platform teams draw on DataOps, MLOps to support GenAI

GenAI calls for more rigorous data discipline than previous forms of analytics, but there are lessons platform teams can learn from existing approaches.

DataOps early adopters were inspired by DevOps principles to help data scientists quickly create business value from big data; as generative AI apps go mainstream, that interplay between IT disciplines is coming full circle.

As with Agile software development and DevOps methods, data operations sought to break down organizational barriers and encourage collaboration between business stakeholders and IT teams. Thus, there has been some exchange of techniques between disciplines already. DevOps and platform engineers have already applied methods developed by DataOps and machine learning operations (MLOps) pros to AIOps and observability workflows for years.

But now, generative AI is pushing data-driven analytics further into the mainstream among business users, bringing data management and data governance to the forefront of enterprise IT ops concerns.

Adoption of AI, including GenAI, was cited as the top driver for increased usage of corporate data in recent market research. Of 318 respondents to a June 2024 survey by Informa TechTarget's Enterprise Strategy Group, 56% rated AI as the main reason for having more users of corporate data within their organizations. AI adoption also prompts more intense scrutiny on governance and security, concerns at the heart of DataOps principles: 83% of respondents added new data governance roles or expanded existing roles -- or both -- over the last year because of AI.

As early adopters of AIOps and AI automation tools have already discovered, data quality and integrity can make or break such initiatives -- 70% of survey respondents classified these as very high or high priority for AI projects. A majority of survey respondents, 51%, indicated that they don't yet fully trust or somewhat distrust the accuracy of the data used in decision-making.

Comparison of DevOps vs. DataOps.
DevOps and DataOps are similar but separate disciplines that are coming back together as generative AI goes mainstream.

GenAI data pipelines vs. ML data pipelines

For one enterprise with experience in DataOps and MLOps, GenAI has been a different beast when it comes to data quality management and data pipeline architecture.

"Virtually any interesting [large language] model isn't just going to be trained on one stream of data," said Stephen Manley, CTO at Druva, which markets a data resilience SaaS platform. "You're going to be putting in multiple streams, which requires, I think, a much higher degree of rigor than some of what we may have done on the ML side, where that tended to be a smaller set of higher-intensity data."

Stephen Manley, CTO, DruvaStephen Manley

For example, Druva's ML pipeline processes 100 billion operations per hour, but that much data wouldn't be well suited to GenAI models, Manley said. ML data tends to be used to discover anomalies in a series of well-defined events; training the large language models behind Druva's Dru AI assistant required fine-tuning three different data streams with mostly lower rates of change to ensure consistency.

One stream of LLM data consists of Druva documentation; another, information on the Mitre ATT&CK incident response framework, which changes infrequently. The third, more frequently updated data feed consists of customer information used to personalize AI responses for individual customers.

The integration of customer data puts GenAI at Druva on a different footing than past big data and ML apps, Manley said.

"On the ML side, it was all internal metadata that we were processing, so it was all [subject to] our standard [need to] be secure, but it was all internal and never externally viewed," he said. "Because [the GenAI app is] externally viewed, you need a DataOps pipeline that is rigorous around privacy and security."

DataOps and DevOps are separate ... but there are overlaps. ... It's not as if they can't learn from each other.
Stephen ManleyCTO, Druva

To address this, Druva engineers integrated the GenAI data pipelines into the company's mainline IT security infrastructure for functions such as role-based access control and authentication. Building the GenAI data pipeline also presented scaling challenges, since Druva has to support 6,000 logically separate pipelines -- one for each of its customers -- within this rapidly changing data stream.

"DataOps and DevOps are separate ... but there are overlaps. ... It's not as if they can't learn from each other," Manley said. "And so what we learned on the DevOps side is [to] containerize everything, make everything fairly ephemeral. ... We viewed it the same way on our data pipeline ... so that I'm not putting in persistent infrastructure, because that would bankrupt us if we tried to do it for 6,000 customers."

Manley predicted that the deepening relationships between DevOps and DataOps will continue, as AI-generated data from the company's LLMs could loop back into analytics and ML systems to test data pipelines. GenAI data pipelines and workflows will also grow more complex as Dru takes action on customer systems rather than generating passive recommendations.

Workflow orchestration aids DataOps platforms

Druva built its own ML data pipelines years ago, based on Amazon S3 storage and the Apache Iceberg table format, as well as internally developed code for data conversions. So far, the company's need for fine-grained versioning of data and custom data quality checks has kept it from using a commercial product, Manley said.

For other organizations, open source and commercial orchestration tools have proven vital to platform engineers who support DataOps. For example, the open source Apache Airflow workflow-as-code project uses a directed acyclic graph (DAG) to build sophisticated systems to organize tasks and their dependencies. It can be used as a data flow tool, which focuses on how data moves through a distributed system -- including GenAI prompts. Or it can function as a workflow tool, which focuses on how a particular data object progresses through a series of changes.

Workflow orchestration is primarily how Black Crow AI, makers of an AI-driven analytics app for e-commerce companies, uses Airflow. Black Crow has a platform team specifically for DataOps that oversees the development and implementation of data pipelines for machine learning, predictive AI and generative AI. Airflow helps the data platform team act as site reliability engineers by automating parts of the troubleshooting process for data pipelines, said Will Norman, senior staff engineer at Black Crow AI.

Will Norman, senior staff engineer, Black Crow AIWill Norman

"It's [handling] the orchestration part [of the workflow], but also the alerting part, so if something goes wrong, it triggers a PagerDuty alert so that whoever's on call can investigate the issue with the job," he said. "Or if these things start running longer than we expect, then [it will] trigger some SLA [service-level agreement] alerts so that we can get some visibility and start digging into why."

So far, the biggest change prompted by GenAI has been the importance of human guidance when working with LLMs, Norman said.

"You can build tooling [for] generating code or content using GenAI, but having somebody that actually looks at that before it goes live is super helpful," he said. "Developing evaluation criteria is also super helpful, and those are things that you could potentially orchestrate with Airflow [to make] sure that your evaluation criteria are being met."

Here, Black Crow can borrow from similar tooling that the data platform team developed to evaluate ML models.

"We have jobs that are constantly running, checking out our model performance," Norman said. "Some of our model retraining that runs on a daily cadence is just looking at model performance and saying, 'Has there been degradation? Do we need to retrain these models? Is there anything that someone needs to look into?'"

As an extensive user of AWS services, the data platform team jumped on Amazon Managed Workflows for Apache Airflow (MWAA) as soon as it became available in 2020 to eliminate the toil of managing its own servers. When its business and Airflow environment grew, Black Crow turned to a specialized managed service from Astronomer that better handled incremental scaling and supported upgrade rollbacks, which had chewed up time for the data platform team using MWAA.

Astronomer's Astro suite supports the automated deployment of Airflow on containers, including ephemeral deployments for test jobs, along with direct integration into CI/CD tools that Norman said he plans to evaluate for future use.

DataOps tools adapt to GenAI era

Platform teams have enlisted other automation tools to support AI apps, including microservices orchestration frameworks and microservices choreography utilities. A competitor to Apache Airflow and associated managed service providers, Prefect, touts workflow orchestration that doesn't rely on a DAG. In June, Prefect released ControlFlow, an open source Python framework for AI workflows in anticipation of the more complex orchestration that agentic AI will demand.

"We have found that one of the ways that people can get higher-quality outputs is by being very structured and teaching AI what a workflow actually looks like -- giving it discrete objectives to accomplish that it's not allowed to move on [from] until [it does it correctly]," said Jeremiah Lowin, founder and CEO at Prefect, in an interview with Informa TechTarget Editorial earlier this year. "We've used our workflow expertise to develop a product for controlling [AI] agents, and letting people actually have confidence in the fact that [they can] turn one on and it doesn't run out into the world and scream at customers, because it's got a very tight leash on it."

Prefect customer Flatiron Health hasn't yet applied ControlFlow to agentic AI workflows, but it's still early for both of those things, according to Will Shapiro, chief of AI and vice president of data science at the biotechnology research firm in New York City.

"We're using agentic design to work on scientific document creation, and it's not something that we need to do even weekly -- the volume hasn't required us to think about more robust orchestration on production systems [yet]," Shapiro said during an August interview. "[But] we might start using agentic approaches in more high-volume situations that would require thinking about orchestration in more depth."

Flatiron Health, which aggregates and analyzes data from cancer patients that researchers can use to improve oncology outcomes, was able to scale up its data sets tenfold using machine learning. Shapiro said he hoped that LLMs and agentic AI would further expand the process of interpreting electronic health records, which is done by humans at Flatiron now.

"Historically, we've hired human abstractors to go through and manually curate data from the chart, and we've written up in-depth policies and procedures for how each type of variable should be extracted," he said. "What we've been experimenting with is feeding that set of instructions into an LLM, and then using one [model to] take a pass at ingesting the instructions and extracting data, then another one to look at the data that was extracted, identify errors and then iteratively do prompt refinement using multiple agents."

In the meantime, Prefect brought order to Flatiron's MLOps practices by replacing an internally developed extract, transform and load (ETL) system, Shapiro said.

"The homegrown ETL system was ... fairly opaque and very challenging from an onboarding perspective, from an interpretability perspective, from an observational perspective," he said. "Prefect has been great at creating observability, reproducibility and standardization of data processing at large."

Shapiro has since left Flatiron for another company, but a Flatiron spokesperson said there had been no significant change in the company's use of Prefect and agentic AI as of late October.

SnapLogic adds GenAI app and agent builders as a service

SnapLogic, an integration platform as a service (iPaaS) vendor founded in 2006, has built a business on cloud-hosted prebuilt utilities that integrate multiple software applications and their associated data. It was a natural evolution for the company to begin offering automated, low-code/no-code tools to wire together the elements of GenAI apps and AI agents as LLMs emerged over the last two years, according to company officials. Competitors in iPaaS such as Boomi and Informatica also added GenAI integrations over the last year.

SnapLogic released GenAI Builder in January 2023, which was replaced by AgentCreator in October 2024. AgentCreator expanded GenAI Builder to support agents -- autonomous sets of AI-driven systems that can take action to execute workflows.

Not every enterprise is comfortable with a cloud-hosted platform or the enterprise scale and low-code/no-code features built into iPaaS products, which can come at an enterprise-scale price, according to Black Crow AI's Norman. But one SnapLogic customer can say it has something many other enterprises didn't as of early November: GenAI apps in production.

Mike Wertz, program engineering leader, Aptia Group U.S.Mike Wertz

"SnapLogic with AgentCreator allows us to go from concept to prototype to development to deployment really fast," said Mike Wertz, program engineering leader at Aptia Group U.S., an employee benefits and pensions administration company based in Boston. "Traditionally, you would have had to spend a lot of time coding -- being able to put that into two Snaps [prebuilt connectors] with very limited configuration was an excellent fit for us."

Aptia was already a SnapLogic iPaaS customer for ETL workflows before using AgentCreator. Faster GenAI application development with the new tool sped up the process of improving and verifying the quality of LLM results, Wertz said. This helped Aptia get GenAI automation for summarizing and organizing unstructured data from benefits plan documents into production on its website.

"[Previously, that meant] an analyst needing to get that data from the source, read through that data from various sources, extract the data we need, format that data, load the data -- we estimated that [took], at best, an hour," Wertz said. "We've cut that down to just a couple seconds. It passes through the large language model pretty quick, SnapLogic takes care of the rest of it and loads it into the data source, and we're done."

Beth Pariseau, senior news writer for TechTarget Editorial, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.

Dig Deeper on DevOps