parallel file system
What is a parallel file system?
A parallel file system is a software component designed to store data across multiple networked servers. It facilitates high-performance access through simultaneous, coordinated input/output (I/O) operations between clients and storage nodes.
Parallel file system implementations can span thousands of server nodes and manage petabytes or exabytes of data. Users typically deploy high-speed networking, such as Fast Ethernet, InfiniBand and proprietary technologies, to optimize the I/O path and enable greater bandwidth.
How does a parallel file system work?
Parallel file systems break up a data set and distribute, or stripe, the blocks to multiple storage drives that are located in local and remote servers. Users don't need to know the physical location of the data blocks to retrieve a file. Systems use a global namespace to facilitate data access. These systems often use a metadata server to store information about the data, such as the file name, location and owner.
A parallel file system reads and writes data to distributed storage devices using multiple I/O paths concurrently, as part of one or more processes of a computer program. The coordinated use of multiple I/O paths can provide a significant performance benefit, especially when streaming workloads that involve many clients.
Capacity and bandwidth can be scaled to accommodate enormous quantities of data and different data center needs. Storage features include high availability, mirroring, replication and snapshots.
Common use cases of parallel file systems
Parallel file systems tend to target high-performance computing (HPC) environments that require access to large files, massive amounts of data or simultaneous access from multiple compute servers.
Users of parallel file systems include national laboratories, government agencies and universities, as well as industries such as financial services, life sciences, manufacturing, media, entertainment, and oil and gas.
Applications include the following:
- Climate modeling.
- Computer-aided engineering.
- Exploratory data analysis.
- Financial modeling.
- Genomic sequencing.
- Machine learning and artificial intelligence.
- Seismic data processing.
- Video editing and visual effects rendering.
Parallel file system vs. distributed file system
A parallel file system is a type of distributed file system. Both distributed and parallel file systems can spread data across multiple storage servers, scale to accommodate petabytes of data and support high bandwidth.
Distributed file systems typically support a shared global namespace, as parallel file systems do. But, with a distributed file system, all client systems accessing a given portion of the namespace generally go through the same storage node to access the data and metadata, even if parts of the file are stored on other servers. With a parallel file system, the client systems have direct access to all the storage nodes for data transfer without having to go through a single coordinating server.
Additional distinctions between parallel and distributed file systems include the following:
- A distributed file system generally uses a standard network file access protocol, such as Network File System or Server Message Block, to access a storage server. A parallel file system requires the installation of client-based software drivers to access the shared storage system usually via high-speed Ethernet, InfiniBand and Omni-Path networks.
- A distributed file system often stores a file on a single storage node, whereas a parallel file system breaks up the file and stripes the data blocks across multiple storage nodes.
- Distributed file system deployments store data on the application servers or centralized servers, while typical parallel file system deployments separate the compute and storage servers for performance reasons.
- Distributed file systems target loosely coupled, data-heavy applications or active archives. Parallel file systems focus on high-performance workloads that can benefit from coordinated I/O access and significant bandwidth.
- Distributed file systems use techniques such as three-way replication and erasure coding to provide fault tolerance in the software, whereas many parallel file systems run on a shared storage platform.
Pros and cons of parallel file systems
On the positive side, parallel file systems can support HPC, data replication and scale-out storage deployments. They're also important tools for disaster recovery events since data can be stored in multiple locations for rapid retrieval and recovery.
Conversely, a high-performance system often results in increased complexity and administrative tasks. It can be more challenging to maintain a parallel system, and activities associated with system upgrades are often complicated.
Examples of parallel file systems
Open source parallel file systems identified by industry experts and TechTarget research include the following:
- Parallel Virtual File System. PVFS is an open source file system for Linux-based clusters. It was developed by the Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory. PVFS is based on Vesta, which was developed at IBM's T.J. Watson Research Center. The current version, PVFS2, was released in 2003.
- OrangeFS. An open source parallel file system targeting parallel computation environments, OrangeFS is a branch of PVFS to support a broader range of use cases and features.
- Lustre. This open source, object-based parallel file system has file regions that vary in length and static metadata for information distribution. The Lustre file system supports a range of Linux distributions and offers features such as scalability of metadata servers, an online consistency checker and quality of service.
Find out more about the key features in distributed file systems and why they could be important to you.