idspopd - Fotolia
Big data analytics applications impact storage systems
Analytics applications for big data have placed extensive demands on storage systems, which Mike Matchett says often requires new or modified storage structures.
Whether driven by direct competition or internal business pressure, CIOs, CDOs and even CEOs today are looking to squeeze more value, more insight and more intelligence out of their data. They no longer can afford to archive, ignore or throw away data if it can be turned into a valuable asset. At face value, it might seem like a no-brainer -- "we just need to analyze all that data to mine its value." But, as you know, keeping any data, much less Big data, has a definite cost. Processing larger amounts of data at scale is challenging, and hosting all that data on primary storage hasn't always been feasible.
Historically, unless data had some corporate value -- possibly as a history trail for compliance, a source of strategic insight or intelligence that can optimize operational processes -- it was tough to justify keeping it. Today, thanks in large part to big data analytics applications, that thinking is changing. All of that bulky low-level bigger data has little immediate value, but there might be great future potential someday, so you want to keep it -- once it's gone, you lose any downstream opportunity.
Big data alchemy
To extract value from all that data, however, IT must not only store increasingly large volumes of data, but also architect systems that can process and analyze it in multiple ways. For many years, the standard approach was to aggregate limited structured/transactional elements into data warehouses (and related architectures) to feed BI workflows and archive older or file-based data for compliance and targeted search/recall needs. Thus, we've long supported expensive scale-up performance platforms for structured query-based analytics alongside capacity-efficient deep content stores to freeze historical and compliance data until it expires. Both are complex and expensive to implement and operate effectively.
But, that limited bimodal approach left a lot of potential data-driven value out of practical reach. Naturally, the market was ripe for innovation that could not only bring down the cost of active analysis at larger scale and faster speeds, but also fill in the gaps where value-laden data was being left unexploited. For example, archiving architectures began embedding native search type analytics to make their captured cold data more "actively" useful. Today, after generations of performance and scalability design improvements, what was once considered a dumping ground for dying data has evolved into Web-scale object storage (e.g.,, AWS S3).
Likewise, the emerging Hadoop ecosystem brought HPC-inspired scale-out parallel processing onto affordable hardware, enabling rank-and-file organizations to conduct cost-efficient, high-performance data analysis on a large scale. As a first use case, Hadoop is a good place to land raw detail data and host large-scale ELT (extract, load, transform)/ETL (extract, transform, load) for highly structured BI/DW architectures. But the growing Hadoop ecosystem has also unlocked the ability to mine value from less-structured, higher-volume and faster-aggregating data streams. Today, complex Hadoop and Hadoop-converged processing offerings (e.g.,, HP Haven integrating Hadoop and Vertica) are tightly marrying structured and unstructured analytical superpowers, enabling operationally focused (i.e., in business real-time) big data analytics applications.
In fact, we are seeing new and exciting examples of convergent (i.e., convenient) IT architectures combining ever larger-scale storage with ever larger-scale data processing. While the opportunity for organizations to profitably analyze data on a massive scale has never been better, the number of storage options today is bewildering. Understanding the best of the options available can be quite challenging.
Keys to big data storage success
When tasked with storing and analyzing data on a large scale, consider the following:
- While storage (and compute, memory and so on) is constantly getting cheaper, as the analytical data footprint grows over time so does cost. When you budget, account for data transmission and migration fees, lifetime data access operations (even the cost of eventual deletion) and other administrative/operational factors (e.g., storage management Opex).
- Ensuring data protection, business continuity/availability and security do not get easier at scale. And placing tons of eggs into only a few baskets can create massive points of vulnerability.
- While many analytical needs have been met with batch-oriented processing, more and more analytical outputs are being applied in real time to affect the outcomes of dynamic business processes (i.e., meeting live prospect/customer needs). This operational speed intelligence requires well-planned big data workflows that will likely cross multiple systems, and will probably require judicious amounts of flash cache or in-memory processing.
Massive analytical data storage
So what does it take to store and make bigger data analytically useful to the enterprise? Obviously, handling data at larger scale is the first thing that most folks need to address. A popular approach is to leverage a scale-out design in which additional storage nodes can be added as needed to grow capacity. Scale-out products also deliver almost linear performance ramp-up to keep pace with data growth -- more nodes for capacity mean more nodes serving IOPS. Certain architectures even allow you to add flash-heavy nodes to improve latency and capacity-laden ones to expand the storage pool.
Many scale-out storage products are available as software-defined storage; in other words, they can be purchased as software and installed on more cost-effective hardware. In the real world, however, we see most folks still buying SDS as appliances, pre-loaded or converged to avoid the pain of DIY implementation.
The second thing we find with these new generations of massive analytical systems is that the analytical processing is being converged with the storage infrastructure. There is, of course, a remote I/O performance "tax" to be paid when analyzing data stored on a separate storage system, and with bigger data and intensive analytics that tax can be staggering, if not an outright obstacle.
When we look at the widely popular and growing Hadoop ecosystem (YARN, MR, Spark and so on), we see a true paradigm shift. The Hadoop Distributed File System (HDFS) has been designed to run across a scale-out cluster that also hosts compute processing. Cleverly parallelized algorithms and supporting job schedulers farm out analysis tasks to run on each node to process relevant chunks of locally stored data. By adding nodes to deal with growing scale, capacity can be increased while overall performance remains relatively constant.
Since Hadoop is a scale-out platform designed to run on commodity servers, HDFS is basically software-defined storage custom-designed for big data. However, there are some drawbacks to a straight-up Hadoop implementation, including challenges with handling multiple kinds of data at the same time, mixed user/workloads with varying QoS needs, and multi-stage data flows. Within a single Hadoop cluster, it can be hard to separately scale capacity and performance. And Hadoop's native products are still maturing around enterprise data management requirements, although Hadoop vendors like Hortonworks and Cloudera continue to fill the remaining gaps.
For some use cases, the fact that enterprise networks are always getting faster means that a separate scale-out storage system closely networked to a scale-out processing system can also make sense. Instead of fully converging processing with storage, maintaining loosely coupled infrastructure domains can preserve existing data management platforms and provide for shared multi-protocol data access patterns.
Before attempting to use existing enterprise storage, be sure to consider the large-scale analytical data demands -- will traditional storage platforms designed to share centralized data with many similar client workloads be able to serve a tremendous number of different small files, or a small number of tremendously large files to each of many analytical applications at the same time? For example, HDFS is designed to support analyses that call for huge streams of long serial reads, while traditional NAS might focus on tiering and caching hot data for small file read and write use cases.
Managing massive storage for analytics
Here is list of areas to pay attention to:
1. Capacity planning. Balancing vast amounts of data with a seemingly infinite scale-out infrastructure is not trivial. Capacity needs ongoing attention and planning to optimize cost while avoiding getting caught without enough space.
2. Clusters. As clusters of any kind of IT infrastructure grow to hundreds or even thousands of nodes, effective cluster management becomes increasingly important. Patching, provisioning and other tasks become difficult without world-class management.
3. Big data workflows. When designing really effective storage systems, think about data from an end-to-end lifecycle perspective by following the data from sources, to results, to content distribution and consumption (and back again).
4. Data protection. At scale it's even more important to protect data from loss or corruption, and recover from potential disasters. Look for snapshot, replication, backup and DR approaches that could address bigger data stores.
Futures: converged data lakes
Overall, it seems like converged (and hyper-converged) products are the future. The evolving Hadoop architecture is but one example. With the advent of container technologies, we are starting to hear about how traditional storage arrays may be able to now natively host data-intensive applications.
IT convergence is happening on several levels, including the integration of storage and compute (and networking services) and by mixing diverse data types (e.g., transactional records with unstructured documents and machine data repositories) together to support increasingly complex and demanding applications.
Many vendors today are pushing the idea of an enterprise big data lake in which all relevant corporate data first lands to be captured, preserved and mastered in a scale-out Hadoop cluster. Data from that master repository would then be directly available for shared access by big data analytics applications and users from across the organization.
However, some of the thorniest challenges to the data lake concept are governance and security. It's hard to track exactly what data is in a structured database, much less sunk into a huge unstructured data lake.
That's important not just to help figure out what can be useful in any given analytical scenario, but also to find and perhaps mask things like credit card numbers for compliance reasons. And with multiple modes of access across an ever-changing data lake, how do you control who has access to which data and track who has accessed it? From a data quality perspective, users will want to know which data is most current, where it came from and what exactly about it has been validated.
Back to the cloudy future
We are seeing a resurgence of storage previously associated with HPC-like environments now coming into its own in enterprise data centers to support large-scale analytical processes. Some examples include Panasas' PanFS, Lustre-based arrays (i.e., DDN EXAscaler) and IBM's GPFS packaged into IBM Spectrum Scale.
Also, public cloud storage and burstable analytical processing go hand-in-hand (e.g., AWS S3 and Amazon Elastic MapReduce). Today, many cloud security regimes are better than some enterprise data centers, and cloud options are now able to meet most compliance regulations.
One perceived sticking point with the cloud is the cost and time of moving data into and across clouds, but in practice for many applications, most data only needs to be moved into the cloud once, and from then on, only manageable increments of data need be migrated (if not produced in the cloud). Of course, cloud data storage costs over time will accrue, but that can be budgeted.
With software-defined, scale-out storage, converged infrastructure (storage and processing), virtualization, containers and cloud approaches, enterprises are now well-armed to build cost-effective scalable storage to support whatever analytical challenge they take on.