DOE leads race to build fast, global file system
- By Joab Jackson
- Feb 23, 2007
Save often, especially when you run a supercomputer.
For Gary Grider, group leader of Los Alamos National Laboratory's High Performance Computing Systems Integration Group, saving data is of the highest importance. He is part of a team that is developing what may be the world's fastest supercomputer, a petascale machine named Roadrunner with more than 32,000 processors. IBM Corp. is leading the effort.
Simulating nuclear-weapon degradation can take months to run. A single failed processor ? a statistical probability given the huge number of CPUs used ? would corrupt the work. So the lab must save often, just as you would do with your PC. But in this case, the procedure involves frequently saving terabytes of data as quickly as possible, which is no small feat.
That's why Los Alamos specified that data must be able to flow from the processors to the storage arrays at an unprecedented 50 gigabits/sec, far beyond the capability of any single storage cluster. Running multiple storage arrays in parallel would do the trick, but that approach requires advanced techniques for coordinating the storage and management of data.
Roadrunner isn't alone in facing this challenge. "You can easily put a lot of CPU power in the room, but to do useful work, you also need very good I/O," said Mike Gigante, Silicon Graphics Inc.'s engineering director of file-serving technologies. "Unfortunately, many people don't think about the I/O until the CPU is set up, and they realize that the overall utilization efficiency of their computer is very low."
Managing a computer's data is the job of the file system, and industry is working with agencies and volunteer organizations on a new generation of file systems ? often called global or parallel file systems ? that can support machines such as Roadrunner. The challenge is picking the right one for the job.
In many ways, Energy Department laboratories have been a driving force behind the development of global file systems. In 1994, DOE labs banded together to develop Lustre, a file system designed specifically for supercomputer deployments. "We didn't see anyone out there who had what we wanted," Grider said.
In short, a file system is a data structure for storing files on disk, wrote Andrew Tanenbaum and Albert Woodhull in "Operating Systems: Design and Implementation."
A global file system must provide a means of keeping track of data across multiple storage arrays. For large implementations, simply buying a number of independent arrays only leads to confusion because users must remember which array holds their information, and swapping information between arrays is difficult.
"You definitely want all [data] shared across all the nodes in a cluster, so all the nodes see the same data, can read the data and write the data to common places," Grider said. But functions that help find data should not hamper speed of access, which is a difficult problem to tackle.
Lustre attacks that problem, and many supercomputer systems still use it. Cluster File Systems Inc., based in Boulder, Colo., now manages the Lustre code base. Lustre can be used in applications that need to move hundreds of megabytes per second.
Supercomputer manufacturer Cray Inc., based in Seattle, plans to use Lustre for the internal file system of its next-generation X1 supercomputer, nicknamed Black Widow, said Peter Rigsbee, the company's product marketing manager. Lustre also worked well in Cray's Red Storm XT3/4 series of supercomputers, whose nodes do not have full operating systems. "Since Lustre is an open-source product, it is far more malleable," Grider said.
However, Lustre might not be suitable for all implementations. "Lustre has some wonderful attributes, but it is pretty complicated ? it is difficult to set up and difficult to maintain," one industry veteran said. Few Lustre experts are out there, and tools are fairly rudimentary.
DOE labs use a mixture of the major global file systems, Grider said. The Lawrence Livermore and Sandia national laboratories use Lustre. Livermore also uses a file system developed by IBM for its systems, named the General Parallel File System (GPFS). The Argonne National Laboratory and the Ohio Supercomputer Center are developing the Parallel Virtual File System (pVFS), an open-source parallel file system best suited for small clusters.
For the Roadrunner supercomputer project, Los Alamos chose Panasas Inc.'s ActiveScale File System, which will run on the ActiveScale Storage Cluster made by Panasas, based in Fremont, Calif.
"We do things in parallel," said Larry Jones, vice president of marketing at Panasas. "A typical network appliance handles things serially. Basically, you send in a request, it works really hard to process that request and send [the data] back as fast as it can." Panasas' approach to speeding the exchange is to break the job into multiple portions that the system can execute simultaneously.
The Network File System, originally developed by Sun Microsystems Inc., has long been the norm in Unix-based network environments. It is easy to set up and reliable. An NFS storage server, on average, can offer 500 megabits/sec transfer rates. It is the basis for the network-attached storage systems offered by Network Appliance Inc., based in Sunnyvale, Calif.
Although it is good in normal network storage implementations, NFS has a reputation for not scaling well to supercomputing environments. An environment that needs throughputs faster than 500 megabits/sec must set up multiple arrays and spread out the data among them. Companies have developed software to aggregate multiple NFS servers into a common pool, but that approach tends to create a bottleneck because all requests go through a single server.
Because NFS is well-known in the network administrator community, various initiatives are under way to boost its output. The University of Michigan, with some DOE funding, is developing an extension to NFS called NFS Remote Direct Memory Access. NFS RDMA promises to break the bottleneck by eliminating some of the work a server does when moving files, Gigante said.
Another NFS enhancement in development, called Parallel NFS, promises even greater throughput. Like NFS RDMA, pNFS is a planned extension to Version 4 of NFS. It would solve the bottleneck problem by establishing parallel file services. In essence, data can be spread out across multiple servers, according to an Internet draft on pNFS written by Garth Gibson, Panasas' chief technology officer, and Peter Corbett, an engineer at Network Appliance.
Most vendors say pNFS is the way forward, although researchers note that it is less mature than NFS RDMA and at least two years from commercial deployment.
Today's supercomputer designers can choose a variety of solutions. Paul Buerger, who leads systems and operations at the Ohio Supercomputer Center, said the center has been experimenting with distributed, parallel file systems including GPFS, Lustre and pVFS. The center provides supercomputing power to universities and businesses in Ohio and the surrounding region.
"Each of these file systems has its advantages and disadvantages," he said. "None of them has yet distinguished itself as the answer to all I/O issues in supercomputing."Joab Jackson is an assistant managing editor at Government Computer News. He can be reached at firstname.lastname@example.org.
Joab Jackson is the senior technology editor for Government Computer News.