Vladislav Kochelaevs - Fotolia
NVMe performance challenges expose the CPU chokepoint
NVMe systems have diminishing marginal performance gains when hardware is added and storage software consumes more CPU resources. Discover how to address this problem.
NVMe flash SSDs have reduced performance issues between the server or storage controller CPU and attached flash SSDs, lowering latency and enhancing performance compared with SAS and SATA SSDs. And NVMe-oF has solved the problem of getting local and embedded NVMe latency and performance from shared storage in either DAS or SAN-attached storage.
These are critical storage performance technologies. However, as crucial as they are, they've exposed another NVMe performance challenge: the CPU chokepoint in the server or the storage controller.
The CPU chokepoint
Moore's law has slowed, and it turns out there are limits to the doubling of transistors every 18 to 24 months. The latest Intel x86 processors have up to 48 PCIe lanes supporting up to 24 NVMe flash SSDs. The latest AMD x86 plug compatible processors have up to 128 PCIe lanes supporting up to 32 NVMe flash SSDs.
If more NVMe flash SSDs are required, then the supporting hardware gets increasingly complicated. It usually means more CPUs, either internal or external ones. The storage can be DAS or shared across NVMe-oF. Either way, more CPUs, drives, drive drawers, switches, adapters, transceivers and cables will be required.
The general industry consensus is that scaling capacity and performance using NVMe drives and NVMe-oF just requires more hardware. There are some clever systems available with multiple CPUs, large numbers of NVMe drives, NVMe-oF interconnect and high-performance storage in a small footprint. Apeiron, E8 Storage, Pavilion Data Systems and Vexata are among the vendors offering them.
But here's the rub. These systems offer quite noticeable diminishing marginal returns. The hardware grows much faster than the performance gains. This occurs no matter how many CPUs or NVMe flash SSDs are added. Eventually, more hardware means a negative return on overall performance.
The root cause of this NVMe performance challenge isn't hardware. It's storage software that wasn't designed for CPU efficiency. Why bother with efficiency when CPU performance was doubling every 18 to 24 months? Features, such as deduplication, compression, snapshots, clones, replication, tiering and error detection and correction, were continually added to storage software. And many of these features were CPU intensive. When storage software is consuming CPU resources, they aren't available for storage I/O to the high-performance drives.
Solutions at hand to the NVMe performance challenge
Some believe storage class memory (SCM), the next-generation of non-volatile memory, will fix this NVMe performance challenge. It won't. SCM technologies will only exacerbate it, because their increased performance puts even more load pressure on the CPU.
While this has become a difficult problem in scaling storage performance, there are several ways it's being handled, including the following:
- Throwing more CPUs -- servers or storage controllers -- and interconnect at it. This is the most common approach, but it comes with a high cost and diminishing marginal returns.
- Using dynamic RAM (DRAM) caching in front of the NVMe flash SSDs. DRAM is as much as 1,000 times faster with lower latencies than the fastest NVMe flash SSDs. However, it has severe capacity limitations -- typically 3 TB or less -- per server or storage controller. DRAM is also expensive and volatile, requiring power backup to protect cached data. As SCM technologies start to replace DRAM, the cost of DRAM caching will come down, and the hardware will become less complex. The biggest issue with caching is scaling out. Cache coherency is needed to prevent application errors, but cache coherency algorithms are complicated. The complexity increases geometrically with the number of server nodes or storage controllers.
- Computational storage from Burlywood, NGD Systems, Pliops, ScaleFlux and others. Computational storage puts one or more processors and RAM on the NVMe flash drive. These drives can run executables closer to the data, reducing data movement and latency. They enable cooperative processing between the main CPUs and the ones on the flash drives and eliminate the PCIe lane limitations. These drives cost more than standard ones and are mostly provided by startups, but that will change.
- Making storage software efficient. Storage software over the past three decades hasn't needed to be efficient. There were plenty of server and controller resources to handle the software without affecting read/write performance. The HDD was the performance bottleneck. Flash drives, and now NVMe, have exposed the CPU bottleneck. Fixing storage software requires completely rewriting it to be more efficient, using less server or storage controller resources. In other words, get more storage functionality with less server or controller hardware resources. StorOne was the first to take this approach.
- Bypassing the target storage CPU. Remote direct drive access, or RDDA, technology builds on remote direct memory access technology to directly access NVMe drive controllers and bypass the storage server CPU. This technology requires specific NICs from Mellanox Technologies or Broadcom, however, it has the potential to eliminate performance scalability issues. Excelero was the first to use it.
Some of these approaches to this NVMe performance challenge are cost-effective, some aren't. Others are easier to implement. All have pros, cons and risks, and there's no one size fits all. This is a difficult problem to solve, but it is solvable.