Platform engineers call for cloud-native GenAI updates
GenAI apps put unprecedented pressure on IT infrastructure management, that has IT pros in the field looking to the open source community for reinforcements.
SALT LAKE CITY -- Platform engineers face unique challenges supporting their organizations' generative AI workloads, and they're asking open source projects to address them.
In the year since the last KubeCon + CloudNativeCon North America, the dizzying pace of generative AI development has continued. Last year, efforts were just beginning to more efficiently provision GPU hardware using Kubernetes for early AI experimentation; this year, enterprises have already moved through the first cloud-based phase of AI development. Now, some companies want to host more large language models (LLMs) and associated workloads on self-managed infrastructure to cut costs and preserve data privacy.
But, as enterprise IT pros are now discovering, there are good reasons cloud providers with experience running complex infrastructure at scale were the first to shoulder that operational burden.
Not only is generative AI infrastructure complex, often requiring multiple Kubernetes clusters working in concert, connected with equally complex multi-cloud and multi-cluster networks, but it must be highly reliable, according to a keynote by CoreWeave engineers. CoreWeave is a cloud computing startup that specializes in hosting AI infrastructure.
"There are tons of complex physical components that have a direct impact to your hyper-connected training cluster in Kubernetes. Any change in any layer of the stack can impact the cluster health or the job performance, and if something goes wrong … it has detrimental impact to the entire cluster," said Chen Goldberg, senior vice president of engineering at CoreWeave, during the presentation. "Silent data corruption might also occur and can severely affect model quality."
Moreover, "repeat offenders and intermittent failures are not a nuisance," Goldberg said. "They are the obstacle for experimenting and getting things done quickly."
Even when enterprises don't do training for large foundational models, hosting and serving LLMs along with fine-tuning and inferencing workflows place new demands on internal developer platforms, according to a keynote presentation by Aparna Sinha, senior vice president and head of AI product at Capital One.
Despite years of experience running a machine learning platform, generative AI required Sinha's team to add new data services to support the large amounts of unstructured data used by LLMs, additional cross-platform services for semantic search and summarization, user-friendly interfaces to support software developers in addition to data scientists and researchers, as well as updated API management and security guardrails. On top of that, the platform had to remain easy for developers to use and merely lays the groundwork for the next frontier of agentic AI.
Now, as more platform engineers embark on this path, open source tools can be helpful, but require further development, Sinha said.
"If you use closed source [tools], you can actually get most of this platform, or many aspects of it, [with] very little work to be done, and that gives you good time to market," she said. "But on the other hand, open source [has] really started to catch up … and so now you have the ability to create a platform in house that's far more customized and tailored to your needs. … But of course, building up that platform requires an open source community and a number of components that are yet to be invented."
Kubernetes must learn new GenAI tricks
Kubernetes turned 10 this year, bordering on ancient in cloud-native circles, and had developed mature management practices before the generative AI wave. Now, according to platform engineers, there are aspects of the platform in need of refurbishment.
Standard CPU-based workloads were well-suited to horizontally scaling pods and nodes in Kubernetes clusters. But the cost and scarcity of GPU hardware requires more stringent bin packing of workloads into larger individual nodes, and increasing the efficiency of resource consumption across Kubernetes clusters, according to panelist presentations at KubeCon.
"We're looking to things like Vertical Pod Autoscaling to get recommendations on how our resources should be running [to avoid] running at, like, 50% traffic in a cluster," said Rachel Sheikh, an engineer at Cash App, a digital wallet service provider in San Francisco, during a press conference panel Q&A session. "I would love to see more tools … integrate that into their system. … That would make my team's job so much easier."
Vertical Pod Autoscaling reached version 1.0 two years ago with Kubernetes 1.25, but its development continues upstream. The next incremental Kubernetes release, version 1.31.3, due out this month, will add in-place pod vertical scaling for CPU and memory, according to Lucy Sweet, senior software engineer at Uber and a participant in the Kubernetes project's SIG-Node, which oversees the feature.
"We're trying to nail that, and then what we're hoping to do is do in-place scaling for all the resources," Sweet said in an interview following a breakout panel session Wednesday. "For training, that would be particularly useful because you can start a pod, [and if] it needs more GPUs, I don't have to start it again. I can just change how much GPU it has, change the cgroup and move on with my life."
Sheikh said she'd also like to see better documentation and guidance from the community on how to handle cluster failures, particularly when they're caused by a faulty Kubernetes upgrade.
"Often the thing that causes those outages [is related to] edge cases that you don't often consider, like if you do an in-place cluster upgrade and it goes wrong, and you can't downgrade your cluster API server," she said. "Any tooling -- anything in that space that's working to improve guardrails [and] give you a path forward out of an emergency state -- would be really useful and beneficial to us."
Smoother migration of containers between clusters was on the wish list for another panelist, Mukulika Kapas, director of product management for Intuit's Modern SaaS platform.
"We are building service mobility, like, 'Take all the services on one cluster and move them to another seamlessly,'" Kapas said during the press conference. "A tool from the open source community would really help."
Other press conference panelists called for improved monitoring and diagnostic tools to diagnose interference between namespaces on a shared cluster and improve performance for Kubernetes operators at high scale.
"Also, there is no single answer for Kubernetes federation after all these years in the community, unfortunately," said panelist Ahmet Alp Balkan, senior staff software engineer in compute infrastructure at LinkedIn. "That, I think, is one of the factors that's been pushing us to build our own tools."
A breakout session panelist added that GPU vendors have a part to play in making Kubernetes cluster management easier and more reliable.
"It is incredibly hard to work with most of these GPU vendors," said Rebecca Weekly, vice president of infrastructure at Geico, during a Q&A at the end of the session. "They don't all support the same versions of Linux. They don't all have the same drivers. It's obnoxious [and complicates node rebuilds]. I would love to see the GPU vendors put one-tenth of the effort into open source as I have seen from the CPU vendors."
Beth Pariseau, senior news writer for TechTarget Editorial, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.