Long live e-records!
Miles of files at the National Archives
Forget gigabytes or even terabytes.
The National Archives and Records Administration's creation of a
permanent online archive of its electronic records is one of the few projects
anywhere in which data storage is measured by the petabyte ? a quadrillion
bytes ? and that is what fascinates Steve Hansen.
"It is hard to get your mind around the sheer volume of it," said
Hansen, chief engineer for a Lockheed Martin Corp. team competing for the
contract. The other team is led by Harris Corp. of
But the project is significant not just for its mammoth size. It is the
highest-profile example of the growing trend of information lifecycle
management, a strategy for managing records from their creation to their use to
how they are archived.
Data storage is becoming a much more strategic decision. In past years,
data storage was bought as needed, with regular migrations of the data to new
formats as the old formats and applications became obsolete.
But over the last several years, powerful forces have complicated data
management and storage, including steep increases in the volume of data to be
stored, and new regulations as well as business and legal requirements requiring
data to be more accessible over longer periods of time.
Major corporate scandals, such as Enron's financial collapse and Martha
Stewart's conviction, have focused attention on electronic records, such as
e-mails, as evidence in lawsuits and in shaping public opinion. New compliance
regulations such as the Sarbanes-Oxley and the Health Insurance Portability and
Accountability acts include tough new requirements for storing data.
Facing these demands, many government agencies and corporations are
applying a lifecycle approach to electronic records, assessing the data's
value now and in the future, and designing systems to manage and store the data
based on those priorities. Major IT systems integrators have entered the field
to chase what could be a multibillion global market.
The total global storage market is estimated at $65 billion for 2005 and
is expected to grow to $80 billion by 2009, according to research firm IDC of
Framingham, Mass. The firm doesn't have an estimate for spending on the
lifecycle approach for records management. But the National Archives project may
bolster the lifecycle approach.
70 years in the making
The lifecycle concept may have originated with the National Archives
itself 70 years ago.
"Information lifecycle management is an obvious copycat of the basic
application of records archiving," said Kenneth Thibodeau, director of the
electronic archives project for the National Archives.
Typically, many agencies stored old data in proprietary formats, mostly
offline and out of immediate reach. But that approach is changing as demand
rises for online access over longer periods of time, and as technologies such as
Extensible Markup Language have developed. XML is an open, nonproprietary
standard that is interoperable with multiple applications and data formats. It
does not become obsolete over time.
Even so, many users are heavily invested in proprietary systems and
technologies, including massive amounts of data stored in Adobe Corp.'s
Portable Document Format files and Microsoft Corp.'s Word documents. Switching
to other formats can carry steep upfront costs, although from a long-term
perspective, there may be cost savings from alternative formats. In addition,
standards and protocols for e-records management and storage are not mature,
particularly for long-term, accessible records.
"There are no standards yet for information lifecycle management,"
said Michael Peterson of
, program director for the Storage Networking Industry Association's data
Industry association members are developing such standards, Peterson
said, through projects such as SNIA's own "100-year-archive" committee and
through lessons expected from the National Archives electronic records project.
The data management and storage industry is evolving, complex and
fragmented, with sectors devoted to storage and management software, hardware
and services. Many industry observers are looking to the National Archives
project to solve ? or at least shine a bright light on ? the stickiest
technical problems of long-term e-records management, such as how to keep huge
volumes of data accessible, searchable and authentic while formats and
applications become obsolete.
The project "will be influential in solving some of the problems vexing
the broader market," said Charles Brett, managing principal for electronic
records management at Xerox Corp.
The project is supposed to create an accessible, authentic, secure
archive that functions in perpetuity and transcends today's and even
"The goal is to make the information available, so that it's
independent of any particular hardware or software," said Clyde Relick,
project manager for the team headed by Bethesda, Md.-based Lockheed Martin.
Other team members include BearingPoint Inc.,
; EDS Corp.,
; Fenestra Technologies Corp.,
; Filetech Storage Systems Inc.,
; Metier Ltd., Washington; and Science Applications International Corp.,
In the short term, the National Archives project also will set data
management and storage standards for all other federal agencies to follow and
could have a big impact on IT electronic records projects governmentwide.
"It will be extremely influential for all federal agencies
initially," said Karen Knockel, program manager for the Harris team. Other
members include Booz Allen Hamilton Inc.,
; CACI International Inc.,
; and Information Manufacturing Corp.,
The project has been budgeted at $136 million through October 2006 but is
expected to cost hundreds of millions, or even billions, more. The National
Archives is expected to pick a winner in August.
Though Relick and Knockel declined to say what their proposed solutions
entail, experts said XML and metadata, which is data about data, are likely to
be part of the answer.
"We need a data architecture that is scalable and evolvable over
time," Thibodeau said. The archives' own research with the San Diego
Supercomputer suggested that such an architecture might be created with XML
language, but "obviously we don't have the answer yet," he said.
The National Archives project will handle electronic records dating back
to 1970, which constitute about a terabyte in total, but it must be able to
handle about three petabytes a year when it becomes operational in 2007,
The records include about 100 million White House e-mails from the
Clinton and Bush administrations. The e-mails themselves are not difficult to
archive, "but the attachments will kill you," he said.
New technology is likely to assist National Archives personnel in
automating data management at each step: imaging, formatting, authenticating,
securing and storing the records. For example, newly developed,
"content-addressed" software may be used to automatically classify which
data is likely to be in frequent demand and ought to be made immediately
accessible online, and which data is in less-frequent demand and could be saved
on magnetic tape, Thibodeau said. The data on the tapes could be made accessible
to a researcher within four to five minutes per item, at best, he said.
He estimated 70 percent of the e-records would need infrequent
accessibility, and 30 percent would need active accessibility.
Regardless of the outcome of the National Archives project, the lifecycle
approach appears to be gaining steam in the data storage industry. Companies
need to reduce their storage costs with strategic decisions, said Russ Kennedy,
director of software product management for information lifecycle management at
StorageTek Corp., a data storage company in
"With lifecycle management, you decide the value of an individual
object and, based on its value, you decide how long it is to be retained, in
what storage format and with what devices. Then you can decide whether it ought
to be moved to a lower-cost tier of storage that fits the information's
value," he said.
StorageTek in June announced its new IntelliStore data management and
archiving system, a lifecycle-based system that permits multiple tiers of
Another popular solution that stresses the lifecycle approach is the
Centera data archiving and networked storage system from EMC Corp. of Hopkington,
Mass. Centera can expand to hold petabytes of data and store it according to a
lifecycle strategy for a long-term archive, said Kenneth Steinhardt, EMC's
director of technical analysis.
"Instead of being places where information goes to die, archives are
becoming the place where information goes to live in perpetuity," Steinhardt