Storage strategies for machine learning and AI workloads
Learn how organizations are using machine learning and AI to create actionable insights and what challenges they face as they develop their storage strategies.
Businesses are increasingly using data assets to accelerate their competitiveness and drive greater revenue. Part of this strategy is to use machine learning and AI tools and technologies. But AI workloads have significantly different data storage and computing needs than generic workloads.
AI and machine learning workloads require huge amounts of data both to build and train the models and to keep them running. When it comes to storage for these workloads, high-performance and long-term data storage are the most important concerns.
Organizations can use existing data sources to build and train AI models that create insights to improve business processes, target customers more accurately or develop better products. Machine learning/AI processing typically follows two steps. In the first, an organization uses data to build and define machine learning/AI models, which are essentially algorithms that some business process will use. The training step requires machine learning algorithms to repeatedly process large amounts of data as machine learning/AI models are developed.
Once an organization creates a model, it's deployed against a data source to generate a new set of results that produces value for the business. However, this isn't the end of the process. Machine learning/AI model design uses an iteration process where models are developed, evaluated and rebuilt as new data is added, and the model is refined. This closed loop repeats continuously.
When examining the storage requirements for AI workloads, it's important to note that there's already widespread use of AI within storage platforms themselves. Application I/O profiles aren't entirely random, even considering the effect of the virtualization I/O blender. This lack of predictability enables vendors to train their storage systems to improve the overall performance of the platform.
Most modern self-tuning capabilities were developed to address an organization's need to manage multiple tiers of storage within a single appliance. Products such as Dell EMC's Fully Automated Storage Tiering, or FAST, moved inactive data to lower-cost storage, while dynamically promoting active data to faster media. Today, this requirement is less relevant with all-flash systems but will become more important as tiered flash gains widespread use within the enterprise.
Using data from the field to improve platform reliability is probably the most interesting use of AI to improve storage. Vendors like Hewlett Packard Enterprise and Pure Storage collect system information that can detect and resolve anomalies in performance and spot potential bugs. This wisdom of the crowd approach means uptime for dual-controller platforms, such as Nimble Storage, can be increased to six nines -- 99.9999% -- or more.
Developing a storage for AI strategy
As organizations develop their storage strategies to take advantage of machine learning and AI, they're faced with two main challenges:
- Storing and retaining data for the long term. At the outset of a machine learning/AI development, it may not be clear which data is useful and which can be discarded. Long-term archives like object stores or the public cloud can retain data in well-indexed platforms that act as a data lake.
- High-performance options. At some point, an organization must move active data to a high-performance platform for processing. Vendors have released products that combine their fastest storage systems with machine learning hardware, such as Nvidia's DGX-1 and DGX-2 GPUs.
Building the right platform can incur significant cost and requires specific skills to ensure that machine learning hardware like GPUs are continually fed data. This can make packaged storage for AI products more attractive, as they offer a measurable level of performance. As a result, vendors tune and optimize their storage products for the features needed by AI workloads, rather than for generic workloads.
Storage for AI workload requirements
Machine learning and AI workloads have very specific storage requirements. These include:
Scalability. Machine learning requires organizations to process vast amounts of data. But processing exponentially more data volumes results in only linear improvements in AI models. This means in order to increase the accuracy of machine learning/AI models, businesses must collect and store increasingly more data each day.
Accessibility. Data must be continuously accessible. Machine learning/AI training requires the storage system to read and reread entire data sets, usually in a random fashion. This means it isn't possible to use archive systems, such as tape, that only offer sequential access methods.
Latency. The latency of I/O is important to building and using machine learning/AI models, because data is read and reread many times. Reducing the I/O latency can reduce the training time for machine learning/AI by days or months. Faster model development translates directly to greater business advantage.
Throughput. Naturally, the throughput of storage systems is also critical to efficient machine learning/AI training. Training processes use massive amounts of data, often measured in terabytes per hour. It can be challenging for many storage systems to deliver this level of randomly accessed data.
Parallel access. In order to achieve high throughput, machine learning/AI training models will split activity into multiple parallel tasks. Often this means machine learning algorithms access the same files from multiple processes -- potentially on multiple physical servers -- at the same time. Storage systems must be able to cope with concurrent demand without affecting performance.
Naturally, these requirements are very specific and focused around high performance. In general, machine learning/AI uses unstructured data -- either objects or files -- which dictates the type of storage systems an organization can use.
The pros and cons of different storage technologies
Given the choice, the fastest way to process any data set is to store its contents in memory, as dynamic RAM (DRAM) operates at nanosecond speeds. However, server platforms are limited in memory capacity. For example, even a single server with a maximum of 6 TB of DRAM is too small to process machine learning/AI workloads.
This means machine learning algorithms need to access persistent storage in some form. Here's where things become challenging. Inevitably, different storage products have benefits and disadvantages.
- Block-based storage has historically offered the lowest latency for I/O, but it doesn't provide scalability for multi-petabyte deployments. Cost is also a factor in high-performance block products. Some vendors are implementing hybrid options that combine block and a scalable file system, which we will discuss later.
- File-based storage provides scalability and the right access method for unstructured data. But historically, file-based products haven't offered the highest levels of performance.
- Object storage offers the greatest level of scalability and a more simplified access protocol via HTTP(S). Object stores are good at managing multiple concurrent I/O requests, but they generally don't offer the best throughput or lowest latency. This is because most object storage systems are based on spinning media to reduce costs.
Given the various trade-offs, some machine learning/AI implementations will use a mix of platform types, storing the majority of data, for example, on an object store, then moving the active data set to a high-performance file system as part of the training process. But this should be avoided, if possible, as it can introduce extra delays in processing as data is moved around.
How organizations are deploying machine learning/AI
What types of machine learning/AI workloads do organizations process today? Clearly, organizations that have large volumes of input data are at an advantage.
Probably the most often-quoted application is autonomous vehicles. Self-driving cars can each collect many terabytes of data per day. This represents a massive amount of data across even a small fleet of test vehicles.
Editor's note: Using extensive research into the storage for AI market, TechTarget editors focused this article series on storage systems that are used to run heavy-duty AI/machine learning analytics loads. Our research included data from TechTarget surveys and reports from other well-respected research firms, including Gartner.
The airline industry is using AI extensively for everything from collecting statistics on aircraft in flight, to efficient baggage handling and face recognition. Consumer-focused products are also being developed that allow customers to ask common questions through smart devices such as Alexa.
Smart cities are collecting everything from traffic patterns to energy usage in an effort to create better and safer urban environments for everyone.
An overview of the machine learning/AI market
Many vendors are selling stand-alone and prepackaged storage products for machine learning and AI workloads. Pure Storage, Dell EMC, IBM and NetApp Inc. all offer converged infrastructure-type products that package storage, networking and compute with Nvidia Corp. DGX GPUs into a single rack. DataDirect Network's product packages scale-out file storage with Nvidia DGX-1 GPUs.
WekaIO and Excelero offer software-based products that turn a cluster of servers into high-performance storage for AI. WekaIO's product is file-based, while Excelero's system offers block storage that organizations can combine with a scale-out file system. Users can then build these products into AI systems of their own design.
Vast Data has developed a high-performance and highly scalable storage product that can be used to hold multi-petabytes of machine learning/AI data as the source for machine learning training models.