Alluxio adds support for more than 1 billion files

An update to Alluxio's data management software boosts scalability to more than 1 billion files, adds a POSIX API and enables a job service for replication and data transfer.

Carol Sliwa

Published: 14 Mar 2019

Startup Alluxio is previewing version 2.0 of its open source storage software that it claims can support access to more than 1 billion files across private and public clouds.

The Alluxio software layer sits between the application servers and the back-end file and object storage systems. The distributed file system facilitates access to stored data through a global namespace. Alluxio caches hot data in memory, within close proximity of the applications, to accelerate access.

Alluxio software was jointly developed by researchers at the University of California, Berkeley, and MIT. The company was known as Tachyon until 2015.

Alluxio claimed early versions of its software supported access to 200 million or 300 million files. The new 2.0 version adds the option to tier file metadata management to both memory and disk and enables Alluxio to manage more than 1 billion files without hitting memory limits, said Dipti Borkar, the startup's vice president of product and marketing.

Scalability is critical for Alluxio's customer base, which includes top telecom, retail and internet companies, such as Alibaba, China Unicom, JD.com and Tencent. Baidu uses Alluxio to speed interactive data queries across a 200-plus node deployment to gain insight into its products and business. Another Chinese customer, online travel information provider Qunar, uses real-time machine learning to deliver web ads, according to Alluxio.

Machine learning, analytics are top use cases

Alluxio CEO Haoyuan Li said most customers use the company's software for machine learning, and about 25% support analytics workloads. Typical applications on the front end include Apache Spark, Presto and TensorFlow. Common storage back ends under Alluxio include the Hadoop Distributed File System (HDFS), NetApp NAS and object storage such as Amazon Simple Storage Service, Dell EMC Elastic Cloud Storage, Hitachi Content Platform and IBM Cloud Object Storage.

Customer deployments generally range from 10 to 1,500 nodes, with tens of petabytes of data in the back-end file or object storage, Li said. Alluxio ultimately aims to enable customers to scale to more than 5,000 or 10,000 nodes, he said.

The Alluxio software supports server-side API translation to enable the applications to access data from any file or object storage. With the 2.0 release, Alluxio's Filesystem in Userspace, or FUSE, feature supports a POSIX-compatible API. That enables machine learning and deep learning frameworks, such as TensorFlow, Caffe and other Python-based models, to directly access data from any storage system.

Alluxio 2.0 supports unified access to different versions of HDFS to assist enterprises that run multiple Hadoop clusters. The new version also integrates with HDFS inotify to enable the Alluxio software to proactively update the metadata and data it manages whenever there are changes to the persistent storage system.

New features for hyperscale workloads

For hyperscale workloads, Alluxio 2.0 adds a fault tolerance and high availability mode for file and object metadata, and adaptive replication enables users to make priority data more readily accessible across the compute cluster. Customers can configure how many data copies they want Alluxio to manage automatically.

China-based online retailer Vipshop has used Alluxio software since 2016 to speed access to its HDFS-based data. Engineers allocate a portion of the memory in each of its 27 Apache Spark servers for Alluxio to manage and cache the data for faster access.

Wanchun Wang, the chief architect based in San Jose, Calif., who manages Vipshop's big data team, said Vipshop has lots of old servers, and it was far easier for Alluxio to tap their internal memory than it would have been to set up a cluster of machines with faster SSDs. He estimated a 100% performance improvement using Alluxio on the disk-based servers.

"But the main benefit is the access to the remote data," Wang said. "The data lake is a huge jungle. So many things can impact the response time. The inconsistency hurt the most. With Alluxio as the cache, the response is consistent."

Li said he expects the Alluxio 2.0 release to become generally available in the second or third quarter. Alluxio offers a free Alluxio Community Edition of the open source software and the Alluxio Enterprise Edition that adds cloud features, such as enterprise security and data compression. The Alluxio Open Source Project has more than 1,000 contributors, Li said.

Alluxio adds support for more than 1 billion files

An update to Alluxio's data management software boosts scalability to more than 1 billion files, adds a POSIX API and enables a job service for replication and data transfer.

Machine learning, analytics are top use cases

New features for hyperscale workloads

Dig Deeper on Storage architecture and strategy

What is a distributed file system (DFS)?

Top 35 big data interview questions with answers for 2025

Compare 5 Ceph alternatives for storage

Hadoop Distributed File System (HDFS)