Building a business case for all-flash array storage
To make a business case for purchasing AFA storage, you must first understand what your applications need in terms of IOPS, latency and throughput.
Before you can determine if your organization could benefit from all-flash array storage, you must gather information about your workloads. Do your applications require more IOPS or throughput, or lower latency? Are they read-intensive or write-intensive? You may need to use tools to pinpoint performance bottlenecks and characterize data performance. And you need to understand that all-flash array storage operates differently from disk arrays.
Keep in mind that all-flash storage arrays are not all alike. Many systems have been built with specific applications or workloads in mind, and an all-flash storage array that will make one application run 100 times faster will not necessarily have a similar effect on other workloads. Applications that can take advantage of the deduplication and compression features in all-flash array (AFA) storage, such as virtualization apps, will be much more cost-effective than a system that deals with already compressed and unique data such as video files or real-time streams.
How do HDDs compare to SSDs?
HDDs in storage systems can cause performance bottlenecks for applications. For example, the time needed to move the read/write head from one position to another on the disk to read information causes a relatively large gap between the minimum, average and maximum times required to access information. In addition, reading large amounts of data from the innermost tracks on a disk takes longer than from the outermost tracks because the number of bits passing under the head each millisecond varies with the position on the disk.
Solid-state drives (SSDs) have much smaller variations between minimum, average and maximum latency on reads. This makes them ideal for read-intensive applications. SSDs are also fast enough to enable real-time deduplication. This helps applications that use many similar disk images, including both server virtualization and virtual desktop infrastructure (VDI). Because most of the millions of files in a typical Microsoft Windows or Linux OS installation are the same from system to system, a hundred virtual OSes will take up little more space than just one file.
But SSDs have their own Achilles' heel: write amplification. This phenomenon is caused because SSDs need to erase an entire block and rewrite it to change one bit in that block. Depending on the size of the block, an SSD may require 256 KB to 4 MB of data to be rewritten to save one bit of new information, resulting in a ratio of 256,000:1 to 4,000,000:1 amplification of writes. In a write-intensive situation, this must be addressed with special algorithms to collect writes to the same block on the SSD whenever possible. That allows the system to write multiple changes to the same block, all at the same time.
A similar issue involves garbage collection, which is the recycling of blocks of data that have been rewritten. With an HDD, a file that has been written will often remain in the same blocks on disk indefinitely. Because the number of times data can be written to each SSD cell is limited, data is often rewritten to new blocks to ensure that all cells are used. This process is known as wear leveling. Once data is rewritten, the old cells must be erased before they can be used again. Keeping track of cells to be erased and then erasing them can cause additional write amplification, unless the storage system's algorithms are optimized to prevent wear leveling.
Even with write amplification, all-flash array storage systems are generally much faster than HDD-based systems, but the optimum methods for getting the best performance out of a system are different. The latest all-flash storage systems are designed to minimize the impact of write amplification and garbage collection, but if you have write-intensive applications, you may want to do some testing on your own to determine the optimum configurations.
All-flash array storage advantages and applications
AFA storage can deliver massive improvements in storage performance -- including much higher IOPS, increased throughput and reduced latency -- while increasing the effective capacity of a system through inline compression and deduplication. It is important to understand how these work to know whether the applications in your data center will be able to get the most from an all-flash array storage system.
Some apps need IOPS, others need low latency and some need high throughput. For instance, server virtualization and VDI apps are generally most affected by IOPS, high-performance computing and databases are sensitive to latency, and video systems need high throughput. Being able to characterize your apps and their requirements will go a long way toward making the case for a new AFA storage system.
Testing and benchmarking applications, whether software-only products or appliances, can help find issues with your current systems as well as be used to create tests to see if an AFA system can handle the traffic. Bear in mind that you cannot eliminate bottlenecks, only move them. Increasing the performance of one part of the system will only expose the next choke point.
Applications that require more IOPS. Systems that execute many parallel operations typically require a large amount of IOPS. These include server virtualization and VDI, but also database systems with many simultaneous users, from search engines such as Google to e-commerce apps such as Amazon. While many all-flash array vendors advertise that their products can perform a million or more IOPS, these numbers are difficult to achieve in the real world. While a benchmark app can generate millions of small (2 kb) requests to get a higher number, real-world apps tend to use larger sizes (100 kb to megabytes) of requests.
This makes benchmarking an application and setting up a test bed problematic. A single all-flash SAN system with a couple of servers just won't cut it, even with 10 Gb or faster connections. Sometimes a better solution is to ask the vendor to provide customer contacts that have already implemented a system that uses the same applications as your organization, and to talk to their data center managers about issues they found when implementing AFA storage in an environment similar to yours.
Programs that are affected by latency. Applications that stack processing, passing data between nodes in a cluster, are particularly sensitive to latency issues. Because parts of the app are waiting for data from the previous part, any delays in processing or transmitting data can cause further delays downstream. High-performance computing systems, clustered databases, and real-time and streaming apps are all sensitive to latency. When testing, average or mean latency is more informative than minimum latency, but it is also important to look at maximum latencies. If there is one condition that produces latencies many times greater than the norm, it is worth investigating.
Applications with high throughput needs. Applications that move around large amounts of data require high throughput. These applications include video processing for special effects, data analytics with big data sets such as seismic analysis and data mining. They all depend on sharing and processing volumes of data from hundreds of gigabytes to petabytes. Being able to move data around is crucial, and often requires large pipes (10 Gb to 100 Gb Ethernet, 16 Gbps or 32 Gbps Fibre Channel) and efficient data handling at the network stack layer, as well as deduplication and compression of data being transmitted over LAN or WAN connections.
As with the other I/O measurements, there are throughput issues that may not be readily apparent. For instance, moving from a 1 Gb to a 10 Gb Ethernet connection will produce higher throughput, but may also increase server utilization by 10 times or more. That can take a server that is running at 25% utilization to one that is running at 50% or more just to access data, which can cause crashes or logjams when network utilization spikes.
Making the case for all-flash array storage
Justifying an upgrade to AFA storage begins by collecting enough data to be able to say more than just that the network is slow. Data collection tools range from the logging functions built into Microsoft's Systems Management Server or System Center Configuration Manager, or VMware's vCenter, to specialized tools such as SolarWinds Storage Manager. These tools help pinpoint bottlenecks and characterize data performance. They also build historical trend data, showing, for example, that a system is growing at a rate that will cause problems in six months or a year. This data lets you proactively build a new storage tier by dealing with growth before it starts affecting application performance.
Multi-tier applications are complex, with many different servers passing data between them, so simply throwing hardware at a slow app may not help at all. It is necessary to isolate the real issues, find the root causes and then test the proposed solution before upgrading anything. Fortunately, the big AFA vendors will often make trial systems as well as expert systems engineers available in the pre-sales environment. Leveraging this kind of expertise can help a storage administrator prove to the financial gatekeepers that a new system is necessary and will benefit the organization.