Dynatrace strategist favors SLMs for AIOps

Specialized small language models will be necessary to automate revisions to production code for incident response, according to a longtime industry expert.

Generative AI already helps boost preventive IT automation with AIOps tools, but automatically revising production code in response to incidents will require further development, according to an experienced industry expert.

Alois Reitbauer, chief technology strategist, DynatraceAlois Reitbauer

Alois Reitbauer is chief technology strategist, head of open source and research leader at observability vendor Dynatrace, where he has worked for most of the last 18 years. In that time, he's seen AIOps evolve from warning of impending problems before they happen to identifying the root causes of issues and suggesting possible solutions. Most recently, generative AI has given users quicker answers to observability questions than traditional query languages, and explained errors and incidents along with suggested remediations in natural language.

Such features are enough to prevent or remediate most issues, Reitbauer said during an interview on Informa TechTarget's IT Ops Query podcast. But for about 5% of problems that are impossible to predict, finer-tuned small language models (SLMs) and AI agents will be necessary to automate remediation.

"You need the model to be very good at understanding how to rewrite code," he said. "Most people think you tell it exactly how to rewrite a certain manifest to change something, but that's actually not the case."

It has been a bit of a challenge when you look at [large language] models, because most of them are trained to work on the actual code, but not so much the output of code and saying, 'Give me something that changes this output somehow.'
Alois ReitbauerChief technology strategist, Dynatrace

Many generative AI coding models exist, but Reitbauer said there's a difference between writing and rewriting new code in response to prompts versus understanding the effects of rewriting existing infrastructure code to produce specific results in a production context.

"It has been a bit of a challenge when you look at [large language] models, because most of them are trained to work on the actual code, but not so much the output of code and saying, 'Give me something that changes this output somehow,'" he said. "That's where I see the challenge right now, and we're experimenting with [SLMs,] which are particularly good at that."

AI agents will also give rise to more specialized SLMs, Reitbauer predicted.

"Even today with a combination of the different AIs, you often tell the system what to do: 'If it does this, [do that to] keep it running,'" he said. "Where we eventually want to go with agentic is moving even more responsibility over to the models: 'Ensure that I don't run out of disk space.' That's all I would tell the system to do."

To effectively respond to a range of issues, AI agents would need to work in groups made up of specialists in different areas, such as databases or cloud hyperscalers' infrastructure, Reitbauer said.

"That's what's really coming along with agentic AI, that we delegate these higher-level tasks, and even in the resolution process say, 'OK, this service has a response time of 500 milliseconds. How can it get it to 200?'" he said.

Beth Pariseau, a senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.

Dig Deeper on Systems automation and orchestration