ktsdesign - stock.adobe.com

Tip

Don't buy a GPU for machine learning unless you have the workload

Virtual GPUs and machine learning might seem like a perfect match, but specialized chips might not be worth the investment if your workloads won't use the cards' full capacity.

Recent advances to virtualization technology enable IT administrators to better take advantage of the parallel processing capabilities of GPUs. But organizations must be sure that the workloads they support with vGPUs use them efficiently to make the hardware expense worthwhile.

Traditionally, organizations used GPUs to support graphics-hungry workloads, such as computer-aided design (CAD) and virtual desktops, but today, more companies use general-purpose GPUs to support other demanding workloads. General-purpose GPUs can provide organizations with the power to analyze large data sets, which makes them ideal for supporting AI workloads and supercomputing. However, inconsistent or unpredictable demand can leave expensive GPUs sitting idle, negating their potential advantage over CPUs.

Understanding GPUs

Because the technology to easily virtualize GPUs didn't exist when the cards were invented, early GPU use cases were largely graphics-adjacent or restricted to extreme high-performance computing (HPC) environments. Admins would load up servers with as many GPUs as they could and either build vast render farms or incredibly niche academia-oriented HPC environments.

Over time, admins incorporated GPUs in x86 virtualization environments, and they were almost exclusively used in VDI environments. This started with pass-through capabilities and a 1-to-1 mapping of GPUs to VMs. However, this was fairly inefficient because most VMs didn't need an entire GPU to themselves to complete CAD or VDI tasks. Building a large cluster was also problematic because it was hard to find a server that could fit more than four GPUs into a chassis.

Early attempts to virtualize GPUs were difficult to configure and incredibly fragile. However, x86 virtualization vendors, such as VMware, eventually got it right.

In academia, it's easy to understand why GPU virtualization is taking off. Large institutions with hundreds of faculty members and thousands of students create spontaneous and unpredictable demands on compute infrastructure. Virtualization and workload scheduling enable supercomputers and HPC clusters to stay occupied the majority of the time, optimizing ROI.

The machine learning revolution

GPUs are also finding their way into everyday enterprise computing. The use of machine learning workloads in the enterprise is also increasing. As big data analytics expands and traditionally mundane endeavors, such as logistics, become increasingly compute-intensive, virtual GPUs (vGPUs) are a welcome tool.

Machine learning workloads require a lot of number crunching, and vGPUs can crunch a lot of numbers. So, the two should be a match made in heaven. But the majority of machine learning workloads don't generate a steady enough stream of work to keep vGPUs occupied. This idle time ends up being a waste of resources and money.

Most machine learning implementations generate unpredictable demand, and in these cases, GPUs spend a great deal of time waiting for work.

There are some use cases where vGPUs are right for machine learning workloads, such as instances where there's a steady load of machine learning tasks for the vGPUs to perform. One example is the FBI's and Immigration and Customs Enforcement's extensive use of facial recognition. A large facial recognition database working around the clock would mean there's always work for the vGPUs.

But the argument for using vGPUs to support machine learning workloads in the average business is harder to make because they're unlikely to generate constant demand. Most machine learning implementations generate unpredictable demand, and in these cases, GPUs spend a great deal of time waiting for work. Virtualizing GPUs certainly helps ensure the physical GPUs can be fed different work streams, but in most cases, it's not an efficient use of money or resources.

And machine learning isn't the only use case that vGPUs might not be right for. For example, geographic information systems (GISes) with accurate, up-to-date data are vital to fields such as resource exploration and utility regulation. Updating GIS databases and rendering new maps take a significant amount of compute capacity. A CPU could handle this work, but it would take significantly longer to process than a server with a GPU. The problem is that there's almost never a steady stream of data to feed into a GIS application because GIS updates are driven by unpredictable changes to data, such as the purchase and sale of property. In many other cases, it requires complex, hands-on field work with sensors before data is processed and added to a GIS system. In this case, it might make sense to keep a handful of base VM instances with vGPUs to handle read requests.

In a modern GIS setup, when new data arrives from a field team, additional compute VMs can be implemented to compute on new data. Because there is so much data to process, as well as potentially thousands of images to be re-rendered based on the new data, this is a perfect use case for GPUs. GPU virtualization enables multiple independent platforms to efficiently share a single corporate -- or academic -- compute infrastructure, despite their unpredictable nature.

Major cloud providers now offer GPU-accelerated instances, which could enable a company to consume GPU resources only when needed. While GPU-accelerated computing is still growing in popularity, it might be only the vanguard for the next generation of workload-specific chips administrators must consider. Intel, for example, is focused on bringing field-programmable gate arrays into enterprise data centers to support machine learning workloads. Admins looking to outfit their company for the future now have other options to consider, and it might be time to look beyond the x86 CPU.

Dig Deeper on Containers and virtualization