Red Hat CTO, Nvidia AI exec reveal joint LLMOps roadmap
The company leaders discussed deepening integration between OpenShift AI and Nvidia NIMs, how they fit into RHEL AI and what's surprised them about AI growth so far.
DENVER -- Red Hat unveiled plans to extend integrations between its products and Nvidia's AI platform, expanding its ongoing partnership amid the industry's generative AI craze.
Previously, Red Hat OpenShift supported a GPU Operator for Nvidia's Compute Unified Device Architecture (CUDA), a parallel computing framework for general computing on GPUs. A limited number of joint customers in private preview are using cloud-based Nvidia Inference Microservices (NIMs) on OpenShift, according to Nvidia vice president of enterprise AI Justin Boitano.
But according to Red Hat CTO Chris Wright, there are further opportunities to simplify and streamline the integration of NIMs into OpenShift AI to support emerging LLMOps workflows, especially for on-premises users, as well as a fresh set of jointly supported tools under RHEL AI.
TechTarget Editorial sat down with Boitano and Wright this week for an exclusive interview about the technical details of the companies' integration plans, along with their views on broader enterprise AI adoption trends.
Editor's note: This Q&A has been edited for clarity and brevity.
What's available right now that people can use already, and what's coming from Red Hat and Nvidia AI in the future?
Chris Wright: NIMs as a container image running on OpenShift are available on OpenShift already. Taking Nvidia's work inside the NIM and integrating that into OpenShift AI is the work that we're doing. OpenShift AI expands OpenShift into an MLOps platform that includes lifecycle management for deploying, monitoring, redeploying and retraining models. KServe is the part of OpenShift AI that serves model content.
Model serving is about inference -- the 'I' in NIM is inference. It's [the difference between] NIMs as a pure container versus NIMs as something that's more deeply integrated into the MLOps workflow that's a part of the OpenShift AI platform. At the tail end of a workflow, you do KServe-based deployment for inference, and the content that we're serving in that context is NIM-based content. When you serve this inference engine, it needs to be placed in a worker node in the cluster that has access to hardware, so that's a Kubernetes scheduling challenge. Part of KServe is how you scale the deployment of your inference instances. The same content can be used without that integration, but then you have to take on the responsibility for yourself of figuring out your MLOps lifecycle around the content that's inside the core model.
What kind of support will there be for on-premises users of OpenShift AI with NIMs?
Justin Boitano: Right now, there's an integration where you set up OpenShift AI and you get your Nvidia account and then you pull all the container images down and then you can set up these endpoints. We're looking at how to streamline that so it's all pre-populated for you.
We run all of our NIMs on global infrastructure through serverless APIs -- it's the API queue that you call, and then the work gets pulled off the back and run across our DGX Cloud. The KServe architecture is the same architecture but running on-prem, so it makes it very simple for developers just to start calling APIs -- they don't have to work with IT teams to set up infrastructure. KServe gives you serverless queuing to call APIs, but you're having them run against your data in your cloud or data center of choice.
CIOs love the fact that the prompts stay on the network, the data stays on the network, none of their IP is going up to these models. Whereas I think when enterprises first started with experimentation, to be honest, they didn't realize corporate data was being used [for model training]. Enterprises are learning what they need to do to control their IP, and this provides them a great building block for that.
How does this fit alongside RHEL AI?
Wright: A container running on Linux needs access to an accelerator, and part of RHEL AI's job is to make that easy. An operating system does hardware enablement and creates some level of consistency for applications so they're not hard-coded individually to registers in a given hardware device. The other pieces of RHEL AI are a Granite model, InstructLab as a way to add skills and knowledge to that Granite model, and then below that is an optimized runtime to run that model on the operating system and hardware platform. In the context of Nvidia GPUs, you've got a device driver that gets plugged into the kernel, a layer of software libraries from Nvidia, the CUDA stack, that gives consistency to higher-level runtimes -- that whole optimized stack is part of what we deliver with RHEL AI and then you can think of that as a building block for OpenShift AI.
Boitano: We take PyTorch as a framework and performance-optimize it to run on our GPUs. As there are new techniques, we have to keep optimizing those frameworks. Then there's a downstream [flow] from PyTorch to PyTorch Lightning to NeMo, where we add additional tuning techniques for parallelizing [workloads]. ... We are offering InstructLab with NeMo on top of RHEL AI as a type of accelerated framework folks can choose to use.
It's early, obviously, for all of this. But you have some early adopters -- is there any kind of pattern taking shape in how they're developing generative AI apps that you've seen?
Wright: It's actually interesting, because it moves so quickly. I'll give you an example: When Justin and I started this conversation, which culminates with this announcement, we had this vision of bringing the value and the work that Nvidia is doing in the context of NIMs to OpenShift and OpenShift AI on premises as a really important thing. Clouds are an easy place to get started and do experimentation. But then the very next question after you've learned something is, 'What about my data?'
So our expectation was the next most important thing would be providing a RAG-based solution to augment the foundation model with your own data. And the reason RAG was so important is because fine tuning is actually quite hard. But with InstructLab alignment tuning to bring new skills and knowledge into the foundation model, the relationship between that model and RAG just evolves a little bit. You can find contexts in which you don't need RAG because you've trained the model with the core knowledge that you need to infuse the model with. In other contexts, RAG has certain specific properties about well-defined consistency in terms of how the database works, and how the responses are generated, which itself can be valuable. But if you asked just a few months ago where we were going, I'd say RAG was a was a clear trend of what we needed to focus on next. And now it's just a few months later, and you can see the industry changes really quickly.
Boitano: Customers that want to get their first wins with generative AI usually start on that RAG spectrum. And then they work back into fine tuning [and] progressively learn the new techniques as they go. As they're doing that, they're driving up the accuracy of applications. RAG can address some small percentage of them to get quick wins and show success. And then you start to peel the onion back on where you need fine tuning.
As we shift left into the harder problems together, there's a longer-term commitment [to] work through how to do data curation in a more simplified way, so you don't need hardcore AI researchers to refine data blends and do the fine tuning … and continuing to lower the barriers to entry.
Beth Pariseau, senior news writer for TechTarget Editorial, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.