Super saver

By Joab Jackson

| February 23, 2007

DOE leads race to build fast, global file system.

Save often, especially when you run a supercomputer.For Gary Grider, group leader of Los Alamos National Laboratory's High Performance Computing Systems Integration Group, saving data is of the highest importance. He is part of a team that is developing what may be the world's fastest supercomputer, a petascale machine named Roadrunner with more than 32,000 processors. IBM Corp. is leading the effort.Simulating nuclear-weapon degradation can take months to run. A single failed processor ? a statistical probability given the huge number of CPUs used ? would corrupt the work. So the lab must save often, just as you would do with your PC. But in this case, the procedure involves frequently saving terabytes of data as quickly as possible, which is no small feat.That's why Los Alamos specified that data must be able to flow from the processors to the storage arrays at an unprecedented 50 gigabits/sec, far beyond the capability of any single storage cluster. Running multiple storage arrays in parallel would do the trick, but that approach requires advanced techniques for coordinating the storage and management of data.Roadrunner isn't alone in facing this challenge. "You can easily put a lot of CPU power in the room, but to do useful work, you also need very good I/O," said Mike Gigante, Silicon Graphics Inc.'s engineering director of file-serving technologies. "Unfortunately, many people don't think about the I/O until the CPU is set up, and they realize that the overall utilization efficiency of their computer is very low."Managing a computer's data is the job of the file system, and industry is working with agencies and volunteer organizations on a new generation of file systems ? often called global or parallel file systems ? that can support machines such as Roadrunner. The challenge is picking the right one for the job.In many ways, Energy Department laboratories have been a driving force behind the development of global file systems. In 1994, DOE labs banded together to develop Lustre, a file system designed specifically for supercomputer deployments. "We didn't see anyone out there who had what we wanted," Grider said.In short, a file system is a data structure for storing files on disk, wrote Andrew Tanenbaum and Albert Woodhull in "Operating Systems: Design and Implementation."A global file system must provide a means of keeping track of data across multiple storage arrays. For large implementations, simply buying a number of independent arrays only leads to confusion because users must remember which array holds their information, and swapping information between arrays is difficult."You definitely want all [data] shared across all the nodes in a cluster, so all the nodes see the same data, can read the data and write the data to common places," Grider said. But functions that help find data should not hamper speed of access, which is a difficult problem to tackle.Lustre attacks that problem, and many supercomputer systems still use it. Cluster File Systems Inc., based in Boulder, Colo., now manages the Lustre code base. Lustre can be used in applications that need to move hundreds of megabytes per second.Supercomputer manufacturer Cray Inc., based in Seattle, plans to use Lustre for the internal file system of its next-generation X1 supercomputer, nicknamed Black Widow, said Peter Rigsbee, the company's product marketing manager. Lustre also worked well in Cray's Red Storm XT3/4 series of supercomputers, whose nodes do not have full operating systems. "Since Lustre is an open-source product, it is far more malleable," Grider said.However, Lustre might not be suitable for all implementations. "Lustre has some wonderful attributes, but it is pretty complicated ? it is difficult to set up and difficult to maintain," one industry veteran said. Few Lustre experts are out there, and tools are fairly rudimentary.DOE labs use a mixture of the major global file systems, Grider said. The Lawrence Livermore and Sandia national laboratories use Lustre. Livermore also uses a file system developed by IBM for its systems, named the General Parallel File System (GPFS). The Argonne National Laboratory and the Ohio Supercomputer Center are developing the Parallel Virtual File System (pVFS), an open-source parallel file system best suited for small clusters.For the Roadrunner supercomputer project, Los Alamos chose Panasas Inc.'s ActiveScale File System, which will run on the ActiveScale Storage Cluster made by Panasas, based in Fremont, Calif."We do things in parallel," said Larry Jones, vice president of marketing at Panasas. "A typical network appliance handles things serially. Basically, you send in a request, it works really hard to process that request and send [the data] back as fast as it can." Panasas' approach to speeding the exchange is to break the job into multiple portions that the system can execute simultaneously.The Network File System, originally developed by Sun Microsystems Inc., has long been the norm in Unix-based network environments. It is easy to set up and reliable. An NFS storage server, on average, can offer 500 megabits/sec transfer rates. It is the basis for the network-attached storage systems offered by Network Appliance Inc., based in Sunnyvale, Calif.Although it is good in normal network storage implementations, NFS has a reputation for not scaling well to supercomputing environments. An environment that needs throughputs faster than 500 megabits/sec must set up multiple arrays and spread out the data among them. Companies have developed software to aggregate multiple NFS servers into a common pool, but that approach tends to create a bottleneck because all requests go through a single server.Because NFS is well-known in the network administrator community, various initiatives are under way to boost its output. The University of Michigan, with some DOE funding, is developing an extension to NFS called NFS Remote Direct Memory Access. NFS RDMA promises to break the bottleneck by eliminating some of the work a server does when moving files, Gigante said.Another NFS enhancement in development, called Parallel NFS, promises even greater throughput. Like NFS RDMA, pNFS is a planned extension to Version 4 of NFS. It would solve the bottleneck problem by establishing parallel file services. In essence, data can be spread out across multiple servers, according to an Internet draft on pNFS written by Garth Gibson, Panasas' chief technology officer, and Peter Corbett, an engineer at Network Appliance.Most vendors say pNFS is the way forward, although researchers note that it is less mature than NFS RDMA and at least two years from commercial deployment.Today's supercomputer designers can choose a variety of solutions. Paul Buerger, who leads systems and operations at the Ohio Supercomputer Center, said the center has been experimenting with distributed, parallel file systems including GPFS, Lustre and pVFS. The center provides supercomputing power to universities and businesses in Ohio and the surrounding region."Each of these file systems has its advantages and disadvantages," he said. "None of them has yet distinguished itself as the answer to all I/O issues in supercomputing."

A guide to global file systems

CXFS (Clustered XFS): CXFS is an extension of Silicon Graphics Inc.'s XFS file system, which was developed for the SGI IRIX operating system. CXFS is optimized for large computer clusters that work together on a single-system image, such as the NASA Ames Research Center's Columbia supercomputer, where it is deployed. (www.sgi.com/products/storage/tech/file_
systems.html)

GPFS (General Parallel File System): GPFS is a file system for clustering that IBM developed. First developed for IBM's AIX Unix operating system, GPFS now also works for Linux implementations. It can support more than 1,000 disks within a single file system. (www.ibm.com/systems/
clusters/software/gpfs.html)

GFS (Global File System): GFS is an open-source file system developed for Linux clusters. Red Hat, based in Raleigh, N.C., incorporates GFS into its Red Hat Enterprise Linux operating system. (sources.redhat.com/cluster/gfs)

NFS RDMA (NFS Remote Direct Memory Access): NFS RDMA fuses the widely used Network File System with the RDMA protocol, which can be used to offload work from server CPUs to network cards, thereby increasing the potential amount of data that can be downloaded. It is being incorporated into Version 4 of NFS. (sourceforge.net/projects/nfs-rdma)

PanFS (Panasas ActiveScale File System): An object-based file system for Linux clusters used with storage arrays from Panasas, based
in Fremont, Calif. (www.panasas.com/panfs.
html)

pNFS (Parallel NFS): A version of NFS, pNFS facilitates the storage of data across multiple storage arrays, which could speed downloads and writes of large data sets. It is being incorporated into Version 4 of NFS. (www.pdl.cmu.edu/pNFS)

pVFS (Parallel Virtual File System): pVFS is a dedicated open-source file system for parallel environments. The Energy Department, NASA and the National Science Foundation are funding development. (www.parl.clemson.edu/pvfs)

ZFS (Zettabyte File System): ZFS is the next-generation file system that Sun Microsystems developed as a successor for NFS. It is packaged in the Sun Solaris 10 operating system. ZFS is the first file system built to support 128-bit addressing, allowing systems to manage a virtually unlimited amount of storage. (www.opensolaris.org/os/community/zfs)

Joab Jackson is an assistant managing editor at Government Computer News. He can be reached at jjackson@1105govinfo.com.

NEXT STORY: On the edge