This article covers the basics of parallel file systems, what they are, and how they work. It explains file systems that support data redundancy and high performance computing. Included are use cases where using a parallel file system is an advantage. It gives examples of different file systems available in the marketplace, and some examples of real-world utilization.
What Is A Parallel File System?
A parallel file system is a specialized type of clustered file system, where the physical storage medium is composed of storage devices on multiple servers. When the file system receives data for storage, the system distributes the data across multiple storage nodes. Parallel file systems also replicate each piece of data on physically distinct nodes. The replication of data allows the system to be more fault-tolerant and affords data redundancy, should any storage nodes fail.
This distribution of data improves the response performance when the data is requested. Instead of reading a single source file on a single storage node, the system simultaneously retrieves data from multiple nodes. Instead of a serial read, each part of the file is read in parallel, hence the file system’s name. This approach affords the system much faster response times, especially for large files.
We’ve already touched on the increased performance and data redundancy issues. These are the primary advantages of using a parallel file system. They are especially useful in high-performance computing (HPC), of which we’ll share some impressive examples down below.
Any system that requires fast and unfettered access to data, especially large amounts of data, can benefit from using a parallel file system. Some examples of these systems include Big Data analysis, machine learning (ML), and artificial intelligence (AI). In each of these cases, high performing access to large datasets is critical in ensuring system effectiveness.
Parallel file systems are also highly scalable, and multiple clients can access the data simultaneously.
One of the historical downsides of parallel file systems is the handling of the metadata for each file. Modern implementations of this model have overcome this by establishing a dedicated metadata server (MDS) within the storage cluster. The MDS holds metadata for each file in the system and information about each part of its location.
Cost is another factor that needs consideration. Parallel file systems are generally a cost-effective way of managing data with high performance and data redundancy, but they are more expensive than standard storage solutions. Modern implementations take care of much of the complexity involved in configuring and managing the system; however, you will also need to have trained personnel available to support the system, which adds to the costs of maintaining the system.
Parallel File Systems to Investigate
One of the oldest and most widely used systems is Lustre. Lustre was initially released at the end of 2003 and is currently available under the GNU General Public License. It supports everything from small private clusters to many of the world’s fastest supercomputers.
Another system is BeeGFS, which was developed at the Fraunhofer Center for High-Performance Computing in 2005 and formerly known as FhGFS. BeeGFS is available as a free community edition and an Enterprise edition, which includes support. Like Lustre, it supports a range of installations from small to massive supercomputers; although, given its relatively young age, it lacks the breadth of use of Lustre.
The technology landscape has shifted in recent years to adopting public cloud offerings like AWS and Google Cloud. This is coupled with today’s high-performance and data-intensive applications. Consequently, there is increased demand for managed offerings that provide access to a parallel file system without the barrier to entry of infrastructure and specialized personnel.
Real-World Examples Powered By Parallel File Systems
If high-performance computing interests you, the Top 500 List may be a list with which you’re already familiar. Twice a year, they rank the highest performing supercomputers in the world. It should be no surprise that the top performance leverage parallel file systems’ power as part of their architecture.
At the time of this writing, the most recent list was published in June of 2020 and included a new top system from Japan. Fugaku can achieve over 1,000 petaflops’ peak performance, making it the first system to enter exaflop territory. That’s 10^18 or one quintillion floating-point operations every second. As part of Fugaku’s impressive and high performing architecture, it employs a Lustre file system.
The second supercomputer on the list – and current fastest in the United States – is Summit, located in Tennessee. Summit has Non-Volatile Memory (NVMe) storage devices on each of its nodes to support its performance. The NVMe devices serve as a cache of sorts between the compute nodes and the underlying file system. Summit uses a parallel file system supported by IBM Spectrum Scale, formerly known as the General Parallel File System (GPFS).
The highest performing computers in the world use parallel file systems for fast and secure data storage. As the volume of data we need to process increases and as we build increasingly complex machine learning and AI systems, we need access to data storage, keeping up and supporting the computing power required.
Open-source systems like Lustre offer best-in-class performance and can be deployed on a range of hardware configurations from small clusters to the world’s fastest supercomputers.