twobee - Fotolia

Build a cloud-ready, global distributed file system

Organizations that want to extend traditional network file shares across data centers, branch offices and public cloud infrastructure have many options.

The global distributed file system has been available since the early 1980s, when Carnegie Mellon University developed the Andrew File System. It became one of the first platforms developers built for a new era of networked workstations and servers. Developers refined and productized the concept in the years following. The most recent advancements focused on bridging the gap between private systems in data centers and branch offices, and an organization's IaaS resources in cloud environments.

One of the core features of cloud is a distributed object storage service that can replicate across regions and admins can access through standard file-sharing protocols with a storage gateway. Now, organizations would benefit from the same geo-redundancy to span cloud and on-premises environments.

It is essential to understand the basics of a distributed file or object store. Systems such as the Andrew File System, Ceph and the original Google File System are a form of software-defined-storage that separates stored data from metadata through distinct data and control planes. A replicated control server typically manages the metadata, often in a redundant, multi-master configuration. The data usually splits into smaller pieces or chunks and spreads across multiple storage nodes, often in different data centers.

Data chunking works well for object storage, but not for traditional file systems -- networked or not. It is possible to access distributed object stores with network file protocols such as NFS and SMB, through a gateway. However, for native file stores, it's more common to create geographic redundancy with the Andrew File System technique that replicates shares and fronts them with a globally unique virtual namespace.

For example, Microsoft Distributed File System separates the logical and physical namespaces. A global namespace server, which holds metadata and pointers to various secondary root servers with a pointer to the physical network file shares, manages the logical namespace.

Almost all product implementations of the global distributed file system use similar techniques. A logical control plane manages a virtual namespace that points to NAS shares or object buckets and pools distributed across storage nodes in multiple locations. To extend a global distributed file system to the cloud is relatively straightforward. Cloud services expose file and object storage using either standard protocols such as NFS and SMB or published APIs such as Amazon S3 and Google Cloud Storage.

Survey products that feature global distributed file systems

Below are summaries of some popular enterprise products with distributed file storage and, where noted, cloud connectivity. Note that this is a sample of representative products, not an exhaustive buyer's guide.

Cloudian HyperFile is a scale-out controller that provides standard NAS file access with a unique namespace to a Cloudian object store. It can access data on and replicate to AWS, Azure and Google Cloud. It supports multiple, geographically distributed controllers. It also includes data projection features such as object and file versioning, asynchronous replication and write-once, read-many drives.

Dell EMC PowerScale is the hardware manifestation of its OneFS scale out NAS software that provides a clustered file system, volume management, data protection services and a global namespace. PowerScale nodes are standard single- or dual-socket 1U servers with two sets of dual network ports -- one for front-end clients, one for intra-cluster communication -- plus four or eight drives and up to 384 GB RAM. OneFS clusters can scale out to 252 nodes. OneFS includes a load balancer that distributes traffic across the cluster. It also has a smart failover feature that non-disruptively redirects in-flight reads and writes to failed nodes to another active node in the cluster.

To extend a global distributed file system to the cloud is relatively straightforward.

IBM Spectrum Scale is a parallel file system that can unify SSD, HDD, tape and object storage under a single global namespace and expose data via NAS (NFS, SMB), object (S3, OpenStack Swift), and big data (Hadoop Distributed File System) protocols or APIs. It includes an auto-tiering feature that works with IBM Cloud, AWS S3 or OpenStack Swift object storage. As a distinct software layer, Spectrum Scale works with a variety of servers and OSes, which include IBM Power Systems that run on AIX. Spectrum Scale bundles with IBM's Elastic Storage System, available as a 2U chassis or rack-scale integrated system.

NetApp OnTap is a comprehensive data management platform that works with flash, disk and cloud storage using SAN, NAS and object protocols. It helps move data within and between local clusters and cloud services through a single management interface. It can auto-tier to cloud storage based on data usage and age. It includes a full set of enterprise storage features such as data compression, deduplication, quality of service, snapshots, replication, mirroring and encryption. OnTap provides a global namespace for file volumes that can span on-premises arrays like the NetApp All Flash FAS and cloud services with Cloud Volumes, a set of managed services on AWS, Google Cloud and Azure.

Scality Ring is scale-out file and object software that runs on x86 Linux systems. It offers a high-availability control plane, self-healing infrastructure and petabyte-scale capacity. Its global distributed file system provides a global namespace and multisite, asynchronous replication. It supports NFS and SMB protocols along with the S3 API. Its Zenko multi-cloud controller provides a single namespace across AWS S3, Azure Blob Storage, Google Cloud Storage, Wasabi, Ceph and legacy NAS environments.

Cloud NAS gateways

Another technique for admins to build a multi-cloud file system entails the use of a cloud NAS gateway between on-premises NAS arrays and cloud-native object storage. They are typically implemented as an on-premises hardware or software appliance that acts as a proxy cache between the NAS and cloud environments.

Popular products include:

  • Azure FXT Edge Filer -- formerly Avere -- works with Azure Blob Storage. Like other multi-cloud file services, FXT provides a single global namespace.
  • Ctera offers both an edge filer and client agent to bridge local NAS and AWS, Azure, IBM and Dell EMC cloud environments. It provides a global namespace spanning sites and clouds.
  • Nasuni UniFS is something of a hybrid between a multi-cloud file system and a gateway appliance. It provides a software control plane that bridges traditional NAS filers and cloud environments but stores all data on cloud object storage, using edge caching appliances as an intermediary. It supports S3, Azure Blob Storage, Dell EMC Elastic Cloud Storage, Google Cloud Storage, Hitachi object storage, IBM Cloud Object Storage and Western Digital HGST.
  • Panzura Freedom NAS uses a global file system to create scale-out NAS clusters that can expand to multiple data centers, branch offices and cloud environments. It includes data services such as inline global deduplication, compression, encryption and mirroring. It can automatically make data copies on multiple clouds to enhance availability and durability. Panzura caching appliances can run as cloud instances on Elastic Compute Cloud, Azure, Google Cloud and IBM Cloud or local VMware ESXi hosts. It also sells three hardware appliances with local caches of 7 to 28 TB and can simultaneously host 5,000 users. Panzura's proprietary distributed file-locking technology guarantees data write-order consistency across geographic boundaries and multiple users. It maintains data and metadata state throughout multiuser transactions.

Next Steps

A comparison of distributed file systems

Dig Deeper on Primary storage devices