Getty Images/iStockphoto

Tip

How and why to run machine learning workloads on Kubernetes

Running ML model development and deployment on Kubernetes is an absolute must in a world where decoupling workloads can optimize resources and cut costs.

Michael Levan

By

Michael Levan

Published: 12 Aug 2024

Machine learning and AI have moved into the mainstream. Regardless of their job role, most business and IT professionals are now familiar with leading AI tools like ChatGPT.

As the buzz around AI grows, so do the engineering needs in ML and AI. In particular, managing machine learning workloads is top of mind for many organizations due to rising costs and complexity. Key considerations are related to how models are trained and deployed, including the scalability, efficiency and cost-effectiveness of those processes.

As ML use cases have become increasingly complex, training ML models has become more resource-intensive and less cost-effective. In fact, it's quite expensive -- and a key reason that GPUs have become so pricey and sought-after. Containerizing ML workloads can help solve these challenges.

Containerization can alleviate many of the challenges associated with ML model development and deployment, including scaling, automation and infrastructure sharing. Kubernetes, a popular tool for containerizing workloads, is a viable option for organizations looking to streamline their ML processes.

Kubernetes basics

Over the years, engineering priorities have shifted, but one consistent trend is the need to minimize applications' operational footprints. From the mainframes of the late 1980s to modern servers and later virtualization, the trend has been toward minimalism.

After virtualization, containers emerged as a method for decoupling application stacks into the smallest possible entities while maintaining performance. Containers started with cgroups and namespaces in Linux, but gained more widespread popularity with Docker. The problem was that containers alone didn't scale well; if a container went down, it didn't automatically start back up.

Kubernetes, an open source platform for managing containerized workloads, came onto the scene to fix this issue. As an orchestration tool, Kubernetes not only helps developers build containerized applications, but also facilitates workload scaling, ensuring that containers are always active and properly managed.

In Kubernetes, containers run inside resources called pods, which house all the information needed to run the application. In addition to containers, Kubernetes has also become valuable for orchestrating other types of resources, such as virtual machines.

Machine learning on Kubernetes

AI and ML systems' demands are a major driver of the recent surge in GPU costs, which has posed challenges for consumers and tech pros alike.

ML systems require vast amounts of system power, including CPU, memory and GPU resources. Without ample compute, the training process can be highly time-consuming, especially for larger models. Traditionally, this forced users to buy multiple servers to train models, as there was no way to efficiently share those resources.

That's where Kubernetes comes into play with its ability to orchestrate containers and decouple workloads. Within a Kubernetes cluster, multiple pods can run models simultaneously, using the same CPU, memory and GPU power for training.

This can assist with many ML practices, including automated deployment and scaling. Although there still needs to be a powerful Kubernetes cluster with a GPU attached to the worker nodes, the ability to share resources increases production velocity and reduces costs.

Examples of ML workloads that can be run on Kubernetes include the following:

Distributing model training tasks across multiple pods at the same time.
Automatically deploying models to production, with the ability to make updates and rollbacks as needed.
Optimizing model performance by concurrently running multiple hyperparameter tuning experiments.
Scaling workloads dynamically based on demand at inference time.

ML on Kubernetes pitfalls

Running ML workloads on Kubernetes is a stable and popular option. Even OpenAI, the creator of ChatGPT, runs its experiments on Kubernetes. However, organizations should be aware of two notable disadvantages:

Tool maturity. Software designed for running ML on Kubernetes, such as Kubeflow, is still relatively young. Because these tools are evolving, they might undergo changes over time, leading to instability and increased time spent keeping up with the latest developments.
Talent availability. Finding experts with the knowledge and experience to effectively run ML on Kubernetes can be expensive and time-consuming. The specialized combination of IT operations and AI skills is in demand and relatively rare, making hiring costly and challenging.

Tools for machine learning on Kubernetes

Kubernetes by itself isn't equipped to manage ML workloads; instead, it needs specific tools or software designed to run ML workloads on top of Kubernetes. These tools integrate with Kubernetes, using its orchestration capabilities to handle the specialized requirements of ML tasks.

Just as Kubernetes uses a container runtime interface to interact with the software running containers, it uses a flexible plugin model to ensure that it can manage different types of resources. There are three primary ML tools in the Kubernetes ecosystem:

Kubeflow, an open source platform for running and experimenting with ML models on Kubernetes.
MLflow, a tool for running ML model training via a Flask interface as the inference endpoint.
KubeRay, a tool built by the creators of Ray, an open source framework for scaling AI and Python-based applications. KubeRay adapts Ray's capabilities for Kubernetes environments.

Another option is to use TensorFlow on Kubernetes. However, TensorFlow isn't built specifically for Kubernetes, so it lacks the dedicated integration and optimization of Kubernetes-focused tools like Kubeflow.

For those looking to run ML workloads on Kubernetes, exploring Kubeflow first is often the best option. At the time of writing, Kubeflow is the most advanced and mature tool in terms of capabilities, ease of use, community support and overall functionality.

Michael Levan is a cloud enthusiast, DevOps pro and HashiCorp Ambassador. He speaks internationally, blogs, publishes books, creates online courses on various IT topics and makes real-world, project-focused content to coach engineers on how to create quality work.

Next Steps

Set up a machine learning pipeline in this Kubeflow tutorial

Dig Deeper on AI business strategies

Search Business Analytics

Qlik adds trust score to aid data prep for AI development
By measuring dimensions such as diversity and timeliness, the vendor's new tool helps users understand if their data is properly ...
Agents, semantic layers among top data, analytics trends
The top 10 predictions for the next few years are all influenced by the increasing deployment of AI to help make business ...
ThoughtSpot evolving as BI becomes driven by AI
Recent releases, including the Agentic Analytics Platform and Agentic Semantic Layer, demonstrate that the vendor continues to be...

Search CIO

Help desk vs. service desk: What's the difference?
Help desks deliver tactical, immediate technical support for specific issues, while service desks offer strategic, comprehensive ...
In an FTC antitrust win, Meta could face divestitures
The FTC argues that Meta acquired Instagram and WhatsApp to eliminate competition in social media networks. If the FTC wins its ...
Google settlement may affect DOJ antitrust remedies
Google faces numerous antitrust challenges and has agreed to spend $500 million revamping its regulatory compliance structure in ...

Search Data Management

Confluent platform update targets performance, simplicity
The vendor's latest release replaces its coordinating technology to make its tools easier to use and updates its Control Center ...
New Actian features target data discoverability, reliability
Automatically embedded data governance, data product registration on a data marketplace platform and a natural language interface...
Why Apache Iceberg is essential for modern data lakehouses
Organizations adopt Apache Iceberg to build open data lakehouses that support high-performance analytics, multi-cloud strategies ...

Search ERP

IFS snaps up AI agent platform vendor TheLoops
IFS has beefed up its industrial AI agentic capabilities by acquiring TheLoops.
6 benefits of using 3PL for last-mile delivery
Partnering with a 3PL provider can help organizations improve supply chain management and ensure effective delivery. Learn other ...
9 generative AI use cases in supply chain
Generative AI's demand forecasting and inventory optimization abilities, among others, can help companies meet their goals. Learn...

Close