CW Asia-Pacific

CW APAC: Buyer’s guide to NVMe storage

In This Issue

Feature

NVMe for AI: A powerful pairing

NVMe storage capabilities provide the bandwidth and low latency that demanding AI and machine learning applications need to access and manage the massive amounts of data they use.

John Edwards

Published: 19 Mar 2019

AI and machine learning systems have long relied on traditional compute architectures and storage technologies to meet their performance needs. But that won't be the case for much longer. Today's AI and machine learning systems -- using GPUs, field-programmable gate arrays and application-specific integrated circuits -- process data much faster than their predecessors.

Meanwhile, the data sets used to train those smart systems have grown progressively larger. To meet these growing demands, adopters are turning to NVMe for AI functionality.

NVMe provides greater bandwidth and lower latency than SAS and SATA, enabling maximum performance for demanding workloads. Machine learning training, for instance, uses millions of data examples to train algorithms so they can make decisions about new data.

"NVMe has moved from the bleeding edge when launched early this decade to the mainstream storage option for AI in 2019," said Jason Echols, senior technical marketing manager at Micron Technology, which offers NVMe SSDs.

Traditional spinning storage has an access time that's three orders of magnitude slower than current NVMe technology, said Scott Schweitzer, director and technology evangelist at Solarflare Communications, which offers technologies designed to accelerate cloud data center applications and electronic trading platforms.

Traditional storage, designed with disk heads reading off a spinning disk, is serial in nature, he said. "The controllers only provide a handful of queues that often map back to the number of heads on the disk," he said. NVMe devices, by contrast, have 64,000 queues, enabling them to serve as many as 64,000 parallel requests for data.

Faster is better

NVMe has moved from the bleeding edge when launched early this decade to the mainstream storage option for AI in 2019.

Jason EcholsSenior technical marketing manager, Micron Technology

Flash is already a key component in AI platforms that pair high-performance, scale-out storage with GPU-accelerated compute to eliminate I/O bottlenecks and fuel AI insights at scale, said Matthew Hausmann, AI and analytics product marketing manager at Dell EMC. "Faster is always better, so NVMe is a natural progression of these solutions, driving additional performance and moving them closer to real time."

Schweitzer expects NVMe will replace traditional storage in AI environments. AI applications often require enormous data sets, and as applications become more performance-oriented, waiting for traditional disk subsystems quickly becomes the long pole in the computational tent.

"It wasn't but a few years ago that networking was the performance curve on the far right that limited overall system performance," he observed. "As we moved to 10 [Gigabit Ethernet], then 25 GbE and soon 100 GbE and later 400 GbE, networking is rapidly approaching local memory access speeds."

AI applications running on GPU-based systems can use NVMe storage to feed virtually any size GPU farm with far greater performance than traditional storage technologies, said Kirill Shoikhet, chief architect at Excelero, a distributed block storage supplier. "Modern GPUs used in AI and [machine learning] applications have an amazing appetite for data, up to 16 GBps per GPU," he noted. "Starving that appetite with slow storage or wasting time copying data back and forth wastes the most expensive resource you've purchased."

NVMe for AI use case

NVMe works well for specific AI use cases, such as training a machine learning model and checkpoints, where backup snapshots are taken of the training in progress. Machine learning includes two phases: training a model based on what's learned from the data set and then actually running the model. "Training a model is the most resource-hungry stage," Shoikhet explained. "Hardware used for this phase -- usually, high-end GPUs or specialized SoCs [systems on a chip] -- is expensive to buy and run, so it should be always busy."

The machine learning process — How data sets are used to train machine learning applications

Modern data sets used for model training can be huge. MRI scans, for example, can reach multiple terabytes apiece and, when used for machine learning, may require tens or even hundreds of thousands of images.

"Even if the training itself runs from RAM, the memory should be fed from non-volatile storage," Shoikhet said. Paging out old training data and bringing in the new data should be done as fast as possible to keep the GPUs running. That means latency should be low as well, he said, and for this type of application, NVMe is the only protocol that supports both high bandwidth and low latency.

Checkpoint setting also benefits from NVMe technology. "If a training process is long, the system can choose to save a snapshot of the memory into non-volatile storage to allow a restart from that snapshot in case of a crash," Shoikhet explained. "NVMe storage is very suitable for this kind of usage as well."

Potential pitfalls

It's important to fully understand the storage I/O profile of an AI application in order to match the right NVMe SSD to specific needs. "Some AI environments, especially training, are very read-centric, meaning you can realize cost and performance benefits without breaking the bank," Echols said.

For all use cases involving NVMe for AI, Hausmann advised steering clear of proprietary NVMe storage technologies and, instead, looking for NVMe that's built into flagship enterprise products. "You might lose a few nanoseconds on paper, but you'll be light years ahead when your system stays up and running and is still supported six months down the road."

Article 3 of 3

Faster is better

NVMe for AI use case

Potential pitfalls

Dig Deeper on Storage system and application software

How AI workloads are reshaping storage vendor strategies

How to implement storage for GPUaaS

DDN targets enterprise-shaped hole in its AI storage offer

Cloud storage for AI: Options, pros and cons

Get More CW Asia-Pacific

E-Zine | Dec 2019

E-Zine | Nov 2019

E-Zine | Oct 2019

E-Zine | Sep 2019

E-Zine | Dec 2020

E-Zine | Nov 2020

E-Zine | Oct 2020

E-Zine | Sep 2020

E-Zine | Dec 2021

E-Zine | Nov 2021

E-Zine | Oct 2021

E-Zine | Sep 2021

E-Zine | Dec 2022

E-Zine | Nov 2022

E-Zine | Oct 2022

E-Zine | Sep 2022

E-Zine | Nov 2023

E-Zine | Oct 2023

E-Zine | Sep 2023

E-Zine | Aug 2023

E-Zine | Nov 2024

E-Zine | Aug 2024

E-Zine | May 2024

E-Zine | Feb 2024

E-Zine | Nov 2025

E-Zine | Aug 2025

E-Zine | May 2025

E-Zine | Feb 2025