Feature

Leading storage for AI tools address workload capacity, performance

Today's AI storage platforms provide organizations stand-alone and prepackaged products designed to address the data storage and capacity needs of their AI workloads.

Chris Evans

Published: 13 Nov 2019

Storage for AI vendors offer either converged infrastructure products or products that organizations can build into their AI projects. Several vendors, including DataDirect Networks, Dell EMC, Hewlett Packard Enterprise, IBM, NetApp and Pure Storage, offer packaged products or reference architectures with server and networking partners. These offerings increase storage performance and capacity in line with CPU and GPU compute.

Other vendors, including Excelero, Vast Data and WekaIO, offer software products that deliver scalable storage performance and capacity. The customer performs the integration work with these AI platforms. These three vendors also work with partners to deliver their products as prepackaged appliances.

Here, we take an in-depth look at what types of storage for AI products these nine vendors offer.

DataDirect Networks

DataDirect Networks (DDN) has two product families, the AI200 and AI400 storage models, both powered by Nvidia and Hewlett Packard Enterprise (HPE) servers. The Nvidia reference architecture consists of one, four or nine DGX-1 appliances or one or three DGX-2 appliances. The systems use 100 Gigabit Ethernet (GbE) or InfiniBand networking, and two appliances provide the storage. The two models are all-flash NVMe appliances that implement a parallel file system. Both models support up to 24 dual-ported 2.5-inch NVMe drives. DDN quotes performance figures for the AI200 model of up to 25 GB for read and write and 750,000 IOPS. For the AI400 model, the vendor quotes 33 GB read and write and 1.5 million IOPS with a maximum capacity of 360 TB.

HPE systems use the Apollo 6500 Gen10 server platform that supports up to eight GPUs per server and NVLink 2.0. Supported storage configurations include the AI200 and AI400, as well as the AI7990 hybrid storage offering that scales to 5.4 petabytes (PB) with up to 750,000 IOPS and 23 GB read and 16 GB write performance. DDN recommends two reference architectures based on Apollo using one AI400 and either one or four 6500 servers with multiple GPU configurations.

Dell EMC

Dell EMC offers three storage for AI product sets based on Nvidia GPUs, Dell servers or Dell servers with Hadoop. Dell EMC-Nvidia products deploy a PowerEdge R740xd head node and four PowerEdge C4140 worker nodes, each with four Nvidia Tesla V100 GPUs. Mellanox InfiniBand switches provide the networking, and Isilon F800 all-flash NAS provides the storage. The F800 is capable of scaling from a single chassis, delivering 250,000 IOPS and 15 GB of throughput, to a full configuration of 15.75 million IOPS and 945 GB in a 252-node cluster with 58 PB of capacity.

Dell servers without GPUs deploy a single PowerEdge R740xd head node and 16 PowerEdge C6420 nodes, each with two Xeon Gold 6230 processors, 192 GB of dynamic RAM and 250 GB of local M.2 storage. Isilon H600 hybrid storage provides the shared storage. H600 systems deliver up to 120,000 IOPS and 12 GB of throughput per chassis.

Dell EMC's Hadoop product for AI deploys a PowerEdge R640 head node and two PowerEdge R640 worker nodes connected by Dell 25 GbE networking. Worker nodes use local SSD storage. The Hadoop infrastructure is built from as many as 10 PowerEdge R740xd servers that provide shared storage.

Excelero

Excelero is a startup vendor that has developed scale-out block storage for high-performance and low-latency requirements, such as machine learning and AI. Excelero NVMesh software uses a patented protocol called Remote Direct Drive Access, or RDDA. This protocol, which is similar to Remote Direct Memory Access (RDMA), enables nodes or servers in an NVMesh cluster to communicate with NVMe drives in another node without involving the CPU of the target server. This enables NVMesh to deliver high linear scalability as either a dedicated storage product or in a hyper-converged configuration. NVMesh can be combined with IBM Spectrum Scale to deliver a scale-out file system for machine learning and AI.

Excelero doesn't provide performance figures, but the vendor works with partners and resellers to develop integrated hardware and software products. The Talyn platform from Boston Ltd. in the U.K., for example, can deliver up to 4.9 million IOPS and 23 GB of throughput at less than 200 microsecond (µs) latency in a 2U all-flash appliance.

Hewlett Packard Enterprise

HPE partners with WekaIO and Scality to deliver a tiered offering that meets both capacity and performance requirements. HPE provides two reference architectures based on either HPE Apollo 2000 servers for WekaIO Matrix and Apollo 4200 for Scality Ring. It also offers a combined product that uses Apollo 4200 for both Matrix and Ring on the same hardware. A single Apollo 4200 Gen10 server supports up to 46 TB of NVMe storage or 288 TB of HDD capacity. Typical configurations consist of a minimum of six Apollo 4200 servers for a mixed cluster or six Apollo 4200 and six Apollo 2000 servers in a disaggregated cluster.

Storage for AI vendors offer either converged infrastructure products or products that organizations can build into their AI projects.

HPE offers an AI reference architecture that uses WekaIO software deployed on ProLiant DL360 Gen10 servers with NVMe SSDs. Networking is delivered through Mellanox 100 Gb InfiniBand switches, while Apollo 6500 Gen10 servers provide up to eight Nvidia Tesla V100 GPUs.

IBM

IBM's reference architecture for AI is Spectrum Storage for AI. The product uses either IBM Power System servers or Nvidia DGX-1 and DGX-2 servers. The Power System AC922 variant uses IBM Power9 processors and as many as six Nvidia Tesla V100 GPUs in a single server. The DGX variants support as many as nine DGX-1 or three DGX-2 servers per rack. In both instances, the products use Mellanox InfiniBand switches or 100 GbE and IBM Elastic Storage Server (ESS) all-flash appliances. Typical DGX configurations pair three DGX-1 servers with one all-flash appliance or one DGX-2 with one all-flash appliance.

IBM ESS combines NVMe block storage and IBM Spectrum Scale, formerly known as General Parallel File System, or GPFS. Each ESS appliance is capable of delivering 40 GB of throughput at 100 µs of latency and saturating the GPUs on three DGX-1 systems.

NetApp

Ontap AI combines NetApp All Flash Fabric-Attached Storage (AFF) all-flash storage with Nvidia DGX-1 servers and Cisco networking. This product is offered as a validated reference architecture using tested combinations of NetApp AFF A800 storage and DGX-1 servers. Typical configurations use a single AFF A800 with one, four or seven DGX-1 systems. One AFF A800 supports up to 25 GB of sequential read throughput and 1 million IOPS and scales to 300 GB and 11.4 million IOPS in a 24-node cluster.

Using a reference architecture of one AFF A800 system and four Nvidia DGX-1 servers, NetApp claims to operate the GPUs at more than 95% utilization and achieve close to the theoretical maximum processing capability on industry-standard ResNet-50, ResNet-152, Inception-v3 and VGG16 image processing training models.

One advantage of the NetApp design is it enables businesses to use existing Ontap features and integrates with NetApp Data Fabric to provide data movement in and out of a dedicated AI product.

Three categories of products from storage for AI vendors

Storage products delivered with a reference architecture. This includes products that either package AI compute and storage directly or provide a reference architecture that validates scalability and performance capabilities.
Almost all these types of products are built on the Nvidia DGX platform with Tesla V100 GPUs. Internally, the DGX server system uses an interconnect called NVLink, which provides a high-bandwidth network between multiple GPUs and CPUs in the platform. NVLink can scale up to 300 GBps of internal bandwidth in a DGX-2 server with 16 GPUs and deliver 2 petaflops of AI compute power. DGX systems do have some local storage, but to provide capacity and performance scalability, they also need fast shared external storage.

The packaged and reference architecture products from vendors of storage for AI offer organizations validated configurations that ensure the bandwidth capabilities of the Tesla GPUs are fully exploited. Generally, this is achieved with fast shared storage, 100 Gb networking -- either Ethernet or InfiniBand enterprise data replication -- and one or more DGX-1 or DGX-2 systems. Most storage systems offered with these architectures deliver high throughput at low latency using all-flash media.
High-performance file storage. This includes storage that's delivered as software-defined storage, either as software or with bundled hardware from partners. In this instance, performance validation is provided through white papers and internal testing but not specifically through a reference architecture.
Object storage. Finally, object storage vendors are providing the capacity to store large quantities of unstructured data for machine learning and AI and are partnering with other vendors to deliver integrated products that move data between fast and capacity tiers.

Pure Storage

Pure Storage AIRI is a converged infrastructure-style packaging of Pure Storage FlashBlade, Nvidia DGX-1 servers and Arista 100 GbE switches. AIRI enables admins to scale storage capacity and performance by adding more blades to the FlashBlade configuration and to scale compute performance with additional DGX-1 servers. A FlashBlade configuration with 15 blades delivers around 17 GBps bandwidth and 1.5 million NFS IOPS.

Pure has expanded AIRI to offer smaller and larger configurations. AIRI Mini incorporates dual Ethernet switches supporting either Ethernet or InfiniBand, with seven 17 TB FlashBlades and two Nvidia DGX-1 systems using Tesla V100 GPUs.

The standard AIRI configuration provides dual switches, four Nvidia DGX-1 servers and 15 17 TB FlashBlades. Hyperscale AIRI offers three configurations, each with dual Ethernet or InfiniBand fabrics. Users can choose nine Nvidia DGX-1 systems with 30 17 TB FlashBlades across two chassis. A second configuration uses three Nvidia DGX-2 servers with 30 17 TB FlashBlades across two chassis. A third configuration uses two Nvidia DGX-2 systems and 15 17 TB FlashBlades.

Pure Storage recently announced FlashStack for AI, a product based on Cisco Unified Computing System C480 ML servers, Cisco Nexus switches and FlashBlade that enables organizations to build end-to-end data pipelines for managing AI applications.

Vast Data

Vast Data is a relatively new storage startup. The company has developed a scale-out architecture based on inexpensive quad-level cell NAND flash storage and Intel Optane with the intention of replacing hybrid and HDD-based systems in the enterprise. The Vast Universal Storage system offers machine learning and AI workloads with low-cost per-gigabyte capacity and submillisecond latency. In the current release, protocol support offers NFSv3 or NFS over RDMA, which enables data transfers to exceed the restrictions of traditional NFS-over-IP networking. Vast intends for its system to be the main repository for enterprises with large unstructured data lakes that machine learning and AI infrastructure process directly.

WekaIO

WekaIO Matrix software implements a scale-out distributed file system. Matrix can be deployed in the public cloud or with on-premises infrastructure that uses NVMe storage and NVMe-oF to link thousands of nodes into a huge parallel file system. Although Matrix can provide NFS support, the main protocol access route to the file system is through a client agent that exposes a local file system to the application.

WekaIO doesn't sell entire storage for AI offerings directly but instead works with partners and resellers. HPE, for example, offers products based on HPE Apollo 6500 Gen10 servers that support Nvidia GPUs. Matrix is implemented on storage nodes, such as Apollo 4200 and ProLiant DL360 servers.

Matrix has broad applicability to all machine learning and AI workloads, with design characteristics that support both small files and large file counts. Matrix enables organizations to tier data to less expensive forms of storage through support of the S3 protocol. This includes public cloud and partners like Scality.

Editor's note

Using extensive research into the storage for AI market, TechTarget editors focused this article series on storage systems that are used to run heavy-duty AI and machine learning analytics loads. Our research included data from TechTarget surveys and reports from other well-respected research firms, including Gartner.