Getty Images

SREs map uncharted territory with LLMOps

Generative AI apps upend longstanding reliability engineering principles, mindsets and skill sets -- how experts weather the disruption.

SANTA CLARA, Calif. -- The engineers tasked with keeping the world's systems running smoothly face a steep -- and in some ways unprecedented -- learning curve as generative AI takes center stage in IT.

That was the dominant topic of discussion at SREcon Americas 2025 this week, from its kickoff general session by a Microsoft corporate vice president on lessons learned building Microsoft Azure Copilot to birds-of-a-feather sessions and hallway discussions about large language model operations (LLMOps).

Generative AI creates a profound change in the way systems behave and, thus, a profound shift in how they must be managed, said Niall Murphy, co-founder and CEO at SRE tools vendor Stanza Systems, but better known in the industry as one of the co-authors of Google's seminal "Site Reliability Engineering" book in 2016.

We move into a world where determinism with respect to management of a system has gone away, and we are into probabilistic management.
Niall MurphyCo-founder and CEO, Stanza

With LLMOps, "We move into a world where determinism with respect to management of a system has gone away, and we are into probabilistic management," Murphy said in an interview with Informa TechTarget. "And so a huge amount of the techniques and mindsets and approaches that we learned from cybernetics in the '50s ... have to be supplanted by things like confidence signals and approaches that attempt to look at holism of a system rather than a specific response."

Microsoft Azure Copilot: Lessons learned

Microsoft corporate vice president Brendan Burns shared some of his company's early experiences with probabilistic management as it deployed Microsoft Azure Copilot during a plenary session presentation Tuesday.

Among the issues that surfaced for the team that built Azure Copilot was a major change to testing, debugging and observing systems, Burns said.

"Obviously, we're going to monitor all the same things: Do we return results successfully? What's the latency? None of that stuff changes, but it no longer says that your system is working right," he said. "And what is going to say whether your system is working right is the user's feedback."

That means that the most important signal for SRE and DevOps teams maintaining an LLM-based app is a user's "thumbs up" or "thumbs down," which can be quantified with statistical measures such as net promoter score and net satisfaction but is ultimately "a lot less like measuring things and a lot more like social media," Burns said.

Measuring human behavior is slippery at best. For example, the Azure Copilot team noticed that an outage anywhere in the Azure system tended to skew human Copilot evaluations negatively.

"If Azure has an outage in general ... the net promoter score for the client tools takes a dip just because people are just a little bit grumpy," he said. "It's not that different from understanding, 'Is this outage due to my system failing, or some downstream dependency failing?' But it's a lot fuzzier."

Prompt engineering introduced another new wrinkle for the Azure Copilot team, Burns said.

"In these systems, the prompt -- and really, actually, it's more the meta prompt, the stuff that you're putting around the prompt -- is the code," he said. "And so the same things that you think about when you think about rolling out software, you need to be thinking about when you're rolling out the prompt. Any changes there can have a really big impact on the overall quality of the system that you're building."

LLM apps are still maturing, especially for structured approaches to versioning and developing meta prompts, Burns said.

"I don't think we have a really good way of having things like our integrated development environments [IDEs] reason about that right now, or even version it independently," Burns said. "The fact that it's tied into the code is probably a problem, because you'd want to be able to move through it in version space independently. ... We're still exploring the right ways to do software development here."

LLMs evaluating LLMs?

Release qualification for LLM-driven products is another major area of disruption the Azure Copilot team encountered and ultimately called in LLM reinforcements to assist with, Burns said.

"Previously, you'd ... run all the tests, and if it's green, you release it," he said. "In the world of AI prompts ... every single change that you make is going to make some things better, but it's probably going to make some things worse. And what you need to be able to say is, 'On aggregate, I think that this thing makes things better.'"

Microsoft's Brendan Burns at SREcon 2025.
Brendan Burns, corporate vice president at Microsoft, presents at SREcon 2025.

The Azure Copilot team used LLMs to generate test cases at a high volume -- tens of thousands -- and evaluate the quality of their results, according to Burns.

These remarks generated some debate among attendees about the trustworthiness of the approach. One question submitted on Slack and asked in the Q&A portion of Burns' talk said, "How do you prevent the models from 'cheating' and saying 'yes, the result is good' when it is not? Or generating only the test cases they 'know' the answers to?"

Burns replied that models are not motivated to generate tests they know are bad, an idea Murphy disputed in the Slack thread. However, Burns and Murphy both cited the scale of tests required for prompt evaluation as a better job for LLMs. Both also emphasized the need for human evaluation at some stage of the release cycle and progressive deployments.

"If you don't have an automatic way of, in some sense, assessing the performance of the model in coarse-grained form, then it's actually incredibly hard to get anything out the door at all," Murphy said in an interview. And for most businesses, the economic incentives to get LLM apps out the door supersede many other concerns.

Companies that don't have a built-in audience of 100,000 internal employees to perform canary testing, as Azure does, can define sets of must-pass tests of results quality, and measure signals such as whether users ask a model reformulations of the same questions, Murphy said.

Jacob Scott, an SREcon attendee and software engineer with 15 years of experience in operational excellence, said model result evaluations can still be systematized in some ways.

"There are things you can do, like probes or golden path tests," Scott said. "You can run an end-to-end test with synthetic traffic [with a] known set of inputs and outputs."

Scott and others in the SREcon Slack cited blog posts about LLMOps best practices that detail such an approach by Hamel Husain, a machine learning engineer who now works as an independent consultant, but who previously worked at Airbnb and GitHub.

Murphy said another emerging tool that could help reliability engineers improve LLM performance is model context protocol (MCP), an open standard developed by Anthropic.

"That is a way of essentially putting an LLM-legible wrapper around a data set or system, such that the model finds it easier and more reliable to interact with, rather than the API or whatever it is by default," Murphy said. "The [incentives] for [better] tests and accuracy [are more] for the foundational model building companies than the massively distributed set of people who are wrapping their stuff with MCP, but maybe they're not completely absent [elsewhere]."

Beth Pariseau, senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.

Dig Deeper on IT systems management and monitoring