Some of the same lessons -- and unsolved problems -- from supporting machine learning apps in production carry over to generative AI apps, but not all.
SANTA CLARA, Calif. -- Machine learning introduced collaboration and performance management issues for engineers, but large language models present an even greater departure from traditional approaches to reliability engineering.
Machine learning operations versus large language model operations was the topic of presentations and sessions at SREcon this week, including a discussion session addressing attendees' experiences supporting machine learning models in production and how that compares with LLMs.
The rise of MLOps and LLMOps both echo the earlier transition to DevOps -- past points of collaboration between IT ops specialists, developers and data scientists introduced similar organizational friction, attendees said.
"This goes back to DevOps versus SRE and how you handle expertise and responsibility," said Jacob Scott, an SREcon attendee and software engineer with 15 years of experience in operational excellence. "[Things] like, should data scientists be on call? And how do you get people to do that?"
With LLM-based apps, SRE teams face a similar question: "Who can fix it?" Scott said. "There are a lot of failures that SREs are best positioned to fix, like load shedding if your database is on overload and figuring out it's overloaded. But who is positioned to respond to an LLM being buggy, or hallucinations?"
MLOps and LLMOps are both all about the data
Another similarity between MLOps and LLMOps is an emphasis on precision when managing data inputs and the struggle to deal with the deep and subtle dependencies this can create in the overall system. That was the consensus among participants in a discussion session this week conducted under the Chatham House Rule, in which statements are repeatable but not attributable to their specific sources.
"I've seen multiple places where like you get bad data, like a sub-pipeline goes down, and you don't notice, or you silently stop emitting events that feed it in some way, and the model degrades," a discussion session participant said. "But if you don't have enough fidelity of measuring what success actually is, and days or weeks later, someone running a query across a higher-fidelity system will be like, 'Why did this business metric go whack?' And you're like, 'Oh, [expletive]. Now I realize my pipeline is broken.'"
This can worsen political friction between parts of the organization, according to another participant. He recalled when an ML team perceived a similar failure as an indication that SREs didn't care about their systems.
The problem of subtle degradation in complex systems also applies to LLMOps, where small changes to prompts and models can have damaging results, participants said.
The audience at an SREcon 2025 plenary session.
LLMOps is made of people
However, the results of subtle changes to LLM data can be even thornier to track down and cause higher-profile failures, said Niall Murphy, co-founder and CEO of SRE tools vendor Stanza Systems, in an interview with Informa TechTarget.
"A lot of quality concerns [with LLMs] can be quite narrow, like, 'This model has now gone off the rails with respect to the Schleswig-Holstein debate of the 1850s, but it doesn't actually make a difference to 'How do I make pancakes?'" Murphy said. "There are some question spaces which are commoner than others, so you can have a degradation and still not affect the people who care about pancakes. And that's OK, except when that starts to drift and affect other things as well."
MLOps presented a monitoring challenge because failures could be more subtle than a system being "up" or "down." However, those failures could still be measured more concretely -- and avoided more easily -- than with LLMs, which must be measured using subjective human responses to how the AI responds to text-based prompts.
The golden signal for AI might be, 'Is this prompt in the right context? If I try a family of contexts, are the ones that I'm picking effective?' That replaces a golden signal with a process.
Discussion session participantSREcon MLOps
"You're taking a bunch of statistics that traditionally a product manager would look at, and you are pulling them into the reliability side of the house," said a discussion session participant.
According to another, "When you only have signals from a traditional application or a traditional system, you're figuring out how to keep an adaptive system undersaturated, under capacity, in a happy little circle [of performance] ... [But] the golden signal for AI might be, 'Is this prompt in the right context? If I try a family of contexts, are the ones that I'm picking effective?' That replaces a golden signal with a process."
The human factor with LLMs, and the relatively high profile of the technology, can also worsen the kinds of organizational conflict some attendees saw with MLOps. For example, Microsoft corporate vice president Brendan Burns described intense mistrust of Azure Copilot among users when the company first rolled out the AI assistant during a presentation Tuesday.
"We had some very, very disturbed users when we first rolled out the Azure Copilot who strongly believed that we had stolen ... their VM data ... and trained it into the model," Burns said.
As the Azure Copilot team refined prompts behind the scenes, there were also sometimes tense discussions among internal stakeholders about which team's prompts were used to respond to user questions, Burns said.
"If I walk up to the Azure Copilot and I say, 'How do I back up my database?' Is that a prompt for the database team to handle, or is that a prompt for the backup team to handle?" he said. "You can look at agentic approaches to blend those answers together, but especially early on, we just chose one handler and went with it. And then teams would get very upset, and they'd say, 'Why didn't you ever choose my handler? Why do you hate us and love the backup team so much?'"
Some discussion session participants reported that user mistrust of AI went even further in their organizations. They recalled users deliberately sabotaging AI test responses because they feared AI would eventually replace their jobs.
LLMs' murkier versioning makes for disruptive upgrades
There are also significant technical differences between managing machine learning models and LLMs in production, attendees said, including costs. Machine learning models require specialized distributed systems but can be trained relatively quickly with fewer resources and more discrete versioning than LLMs, Murphy said.
"When you're building LLMs ... for which the training run is, shall we say, weeks or months rather than hours, if you make a mistake, you are out of luck until you build a new one," Murphy said.
At the same time, even tiny changes to an LLM can have huge implications for apps that use it, Burns said during his presentation.
"Honestly, it's like switching the underlying database or switching the underlying architecture of your system," he said. "It's that scale of change in our experience."
When this statement surfaced during the discussion session, another participant added, "But you have to do it three times a year."
Moreover, since specialist companies typically maintain foundational models, downstream users might not be aware that models have changed until they see disruption in their apps, Murphy said.
For example, after Anthropic updated its Claude Sonnet model from version 3.5 to version 3.7 last month, some users reported increased errors and poorer quality results.
"There was no way for the outside world to determine that actually, the thing that they had pointed their API at had, in fact, changed behind the scenes," Murphy said.
Beth Pariseau, senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.
Dig Deeper on Systems automation and orchestration