Storage

Managing and protecting all enterprise data

Use cases for object storage ramp up to meet emerging demands

Computational and storage advances are expanding the role of object storage beyond traditional HPC and cloud to emerging data analytics, machine learning and deep learning use cases.

Object storage is the newest of the three primary storage techniques, complementing block volumes and file systems. As technological advancements have evolved, object storage has found a place with new and different use cases beyond its initial strengths.

Originated when high-performance computing researchers in the 1990s sought more scalable alternatives for their massive data sets, object-based storage became popular with the rise of massively scalable cloud services. In describing their motivation, the Google researchers who designed one of the earliest object storage implementations in the Google File System (GFS) detailed design requirements that still resonate after two decades.

Namely, an object storage system should do the following:

  • adapt to frequent component failures for storage systems requiring hundreds or thousands of nodes via continual monitoring, error detection, fault tolerance and automatic recovery;
  • accommodate massive, multi-terabyte data sets and multi-gigabyte files; and
  • be optimized for files that are predominantly read-only, sequentially read, have "practically nonexistent" random writes and allow new data to append the file rather than overwriting existing data.

Engineers optimized early forms of object storage like GFS for online service providers. But, over time, the storage requirements for enterprise workloads took on similar characteristics.

The changing face of object storage

Not only are organizations adopting cloud infrastructure and cloud-native application designs, but the following trends show that the enterprise data footprint is evolving in ways that favor object storage:

  • a tremendous growth in the volume of unstructured data like text, images, audio and video;
  • a similar expansion in the amount of semistructured data from system logs, email repositories and tagged information like HTML and JavaScript Object Notation documents;
  • the accumulation of vast data repositories that can span hundreds of terabytes or even petabytes;
  • concurrent, real-time storage access from hundreds of users from multiple locations;
  • the increased use of data lakes or other techniques that aggregate data from various sources in different formats; and
  • an accompanying decoupling of data from a particular application, with the same repository used by many workloads.

Each of these enterprise storage trends favors object formats over block or file storage, since object storage is inherently scalable, highly distributed, more efficient -- in terms of both space and cost -- and enables granular security policies with access controls tailored to subsets of a repository. Furthermore, the increasing use of infrastructure-as-a-service resources from AWS, Azure, Google Cloud Platform (GCP) and others -- where object storage services are the most scalable, low-cost option -- encourages the use of on-premises object stores as part of an integrated, hybrid cloud environment.

Reasons behind the changes

Extreme scalability and an inherently distributed and redundant system drew high-performance computing and cloud researchers to embrace object storage. However, the technology's cost advantage over file and block alternatives is what garnered the attention of enterprise users and cloud services, with AWS S3 -- notably -- providing the introduction.

As organizations began dabbling with cloud services over the last decade, they sought uses that were both low-risk and easy to implement: backup and archiving became the answer, and cloud object stores the vehicle. Backup and archive became the bread-and-butter use case and will remain the primary enterprise application for object storage for many years.

The object storage market

There aren't any widely available public measures tracking the size of the object storage market. By one estimate, it remains small at approximately $4 billion in sales this year, growing 14% annually to $6 billion in 2023. These numbers seem far too low given that EMC predicted Isilon and Atmos would hit $1 billion in sales by 2012 when it bought the firm almost a decade ago. Either way, there's little doubt that enterprise object storage remains a far smaller market than that for block or network file storage products.

The big three cloud providers -- AWS, Azure and GCP -- engage in sporadic price wars for commodity services like compute instances and object storage. Economies of scale make it difficult for object storage vendors to compete based on cost per-gigabyte, so they have taken a page from Marketing 101: When you can't compete on price, focus on features and performance.

Realizing that trying to win a price war with Amazon or Google is a losing proposition for a smaller company, most firms specializing in object storage technology shifted their focus to AI, machine learning and big data analytic workloads that can exploit faster I/O and new, embedded features. Indeed, the bargain-basement stigma of the term object storage has some shifting the terminology to emphasize data platforms, universal storage and distributed data management.

Nevertheless, these vendors must acknowledge the dominance of cloud object storage services, particularly S3, with IT and developers. Consequently, compatibility with the S3 API has become the baseline upon which companies add data analytics features.

State-of-the-art use cases for object storage: Development, analytics, AI

Modern object storage platforms are designed for several emerging use cases:

  • data lakes for streaming data, such as system events and logs, application telemetry, sensor readings, financial transactions, online interactions, social media activity and other metadata;
  • object storage databases for metadata, unstructured content and binary large objects;
  • storage for big data analytics using software like Spark, Flink, Hive and their commercial alternatives;
  • machine and deep learning training data and input streams for analytics using previously trained models;
  • search engine repositories;
  • rich media streaming;
  • persistent data stores for container-based and cloud-native applications; and
  • repositories for software development environments, including source code management, continuous integration and continuous delivery pipelines, issue tracking and documentation.

Since many development tools use network-mounted file shares, object stores are also exposed via NFS and SMB protocols.

Technological advances enabling these changes

Object storage benefits from broader trends in computational and storage technologies. Here is a list of the most significant trends:

  • Software virtualization of the OS (VM instances), application (containers) and storage resources (software-defined storage) that interpose an abstraction layer between hardware implementations and applications. By decoupling the storage data and control planes, virtualization enables distributed, scale-out clusters of any size and capacity.
  • Vast increases in the power of general-purpose CPUs that, combined with virtualization, enable clusters of commodity servers to manage vast storage capacity.
  • Continued declines in the price per byte of SATA SSDs and NVMe drives, making them suitable for high-capacity object storage systems that combine massive capacity and high throughput.
  • Commercialization of persistent memory technologies like Optane -- from Intel and Micron Technology -- and magnetoresistive RAM -- from vendors like Everspin Technologies and Avalanche Technology -- filling a gap between high density but relatively slow storage using magnetic disks or 3D NAND flash and fast, low latency, but volatile DRAM caches.

Most object storage products incorporate some, if not all, of these advancements. However, these products typically evolved their core storage control software from older, HDD-based systems. Such a software legacy renders them suboptimal for AI, machine learning and analytics workloads with a mixed I/O pattern of random and sequential reads and writes that also require low latency and high throughput.

In response to shortcomings in existing implementations, Intel and others have developed a new software platform called Distributed Asynchronous Object Storage. DAOS is an open source effort to develop an object storage system that decouples the data and control planes while also segregating I/O metadata and indexing workloads from bulk data storage.

DAOS implements lightweight protocols designed for NVMe and Optane persistent storage, and has a low-latency, high-throughput messaging interface that bypasses the OS. It stores metadata on fast, persistent memory and bulk data on NVMe SSDs and includes built-in support for big data interfaces, including Hierarchical Data Format version 5, Apache Arrow and Spark.

According to Intel, DAOS read and write I/O performance scales almost linearly with an increasing number of client I/O requests -- to approximately 32 to 64 remote clients -- making it well suited for the cloud and other shared environments. The first significant production deployment of DAOS is for the U.S. Department of Energy's Argonne National Laboratory exascale supercomputer named Aurora.

Bleeding-edge technologies, notably computational storage -- which embeds small, power-efficient processors with individual SSDs -- will eventually make their way into object storage designs since its scale-out nature is ideally suited to distributing storage processing across hundreds of devices.

GigaOm enterprise object storage rating

Specific vendors and products making it all possible

The primary vendors in the object storage market include, but are not limited to, the following.

Caringo Swarm software and servers. Swarm is a software-defined storage object platform that supports a heterogeneous system environment and provides a unified namespace that is exposed via NFS, SMB, AWS S3 and Swarm's native HTTP API.

Cloudian HyperStore object storage exposes an S3-compatible API and both NFS and SMB NAS interfaces via an integrated software-hardware platform available on three products, ranging from 1U and 168 TB up to 4U and 1.5 petabytes using HDDs.

DataDirect Networks Web Object Scaler is an S3-compatible object system that scales to a petabyte of capacity and trillions of objects.

Dell EMC ECS, formerly known as Elastic Cloud Storage, is a set of integrated hardware appliances available in three sizes, ranging from 12 1 TB to 8 TB drives up to 90 12 TB HDDs.

Hitachi Vantara Content Intelligence is a noteworthy supplement to Hitachi's object storage products. The product provides data processing workflows and a library of analytics, extraction, transformation and reporting functions that users can apply to incoming data. Transformed and extracted data can then be forwarded to applications or a storage tier for long-term retention.

IBM Red Hat Ceph Storage is a storage platform that supports S3 and OpenStack object APIs, along with block (iSCSI) and NFS file protocols.

MinIO is open source, cloud-optimized object software supporting the S3 API that runs on Kubernetes clusters. MinIO can replace Hadoop Distributed File System, and is designed for analytics and AI workloads, including Spark, Presto, TensorFlow and H2O.ai.

NetApp StorageGrid is the company's S3-compatible object product. However, NetApp also partners with Nvidia on the NetApp Ontap AI platform that integrates its AFF A-Series all-flash array (AFA) storage systems with Ontap 9 and AI Control Plane software with Nvidia's new DGX A100 AI servers.

Enterprise storage trends favor object formats over block or file storage, since object storage is inherently scalable, highly distributed, more efficient and enables granular security.

Pure Storage FlashBlade AFA provides object storage software and an expandable 4U chassis with 15 hot-pluggable storage modules. Pure also partners with Nvidia to offer the AIRI AI platform that integrates two or more DGX-1 or -2 compute servers with one or two FlashBlades and a converged Ethernet-InfiniBand network fabric.

Qumulo file system supports multi-cloud deployments on all Qumulo storage appliances, qualified third-party products from Hewlett-Packard Enterprise and Fujitsu, or public cloud infrastructure on AWS or GCP.

Scality Ring is a petabyte-scale software storage control plane for x86 servers that provides both S3 object and file interfaces. The vendor's Zenko software delivers a single management interface for Ring and other object platforms, including S3, Azure Blob, GCP and Ceph.

SwiftStack, recently acquired by Nvidia, is object storage software that supports SwiftStack policy-based workflows to, for example, add metadata, labels and tags that can be used in preprocessing data to facilitate search and analysis. SwiftStack's 1space management software supports multiple clouds while the 1space File Connector provides a unified namespace across heterogeneous environments.

Vast Data describes its scale-out storage environment as universal storage that is available as software-only, a hybrid of hardware and container software, or as a packaged hardware appliance. Vast's high I/O performance and NAS support makes it well suited for machine learning and deep learning workloads.

Evaluation criteria

When evaluating object storage products, buyers should consider several critical factors, as strength in one area might come at the cost of weakness in another.

These include capacity and storage efficiency versus performance, resilience and redundancy versus capacity, and interoperability versus proprietary features such as embedded support for AI and analytics functions. Manageability versus convenience is also important, as is security versus multi-tenant flexibility.

Advancements in object storage software with designs optimized for AFAs and hybrid HDD/SSD hardware have significantly improved performance, while disaggregated control and data planes enable heterogeneous deployments across multiple cloud environments. Together, these developments have made distributed, scale-out object systems the premier storage environment for emerging data analytics, machine learning and deep learning workloads.

Advancements like DAOS, persistent memory support and integrated data processing workflows promise even greater performance and flexibility for even more emerging use cases for object storage in the future. Stay tuned.

Article 4 of 6

Dig Deeper on Storage architecture and strategy