Patent Pending

Recently, a large medical services company with tens of terabytes of digital records tried an experiment. It copied a subset of its electronic records, which included MRI scans and X-rays, from expensive disk arrays to tape, then restored the files from tape back to disk.

Recently, a large medical services company with tens of terabytes of digital records tried an experiment. It copied a subset of its electronic records, which included MRI scans and X-rays, from expensive disk arrays to tape, then restored the files from tape back to disk. Unfortunately, a lot of bits were lost in the transfer. The company discovered that a 10 percent variance existed between the restored data and the original data. The difference was disconcerting. While a small amount of data loss might not represent a serious problem for certain types of applications, it is different with medical records, which might be used later as a reference for ongoing treatment or as evidence in a malpractice court case. Even a slight change in data ? one that could manifest itself, say, as a shadow on a lung ? could have significant repercussions for doctors and patients.The issue of data integrity is a growing problem for data storage managers. Several efforts are afoot within standards-making bodies, such as the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C), to come up with a fix. At the same time, work is proceeding behind the scenes within many storage vendor shops to develop proprietary solutions. Just keeping up with the terminology can be a challenge.For example, many of the efforts involve using MD5 hashing. Greg Wade, architect for Legato Systems in Palo Alto, Calif., explained that Message Digest Version 5 (MD5) is an algorithm submitted to the IETF in 1992 to provide a means for authenticating data by creating a "128-bit fingerprint or message digest from a much longer string of bits." The algorithm, or hash, was intended to authenticate the contents of a dataset, since each fingerprint created with the hash would be different from all others."You can confirm the contents of a file with an MD5 signature," said Mark Hayden, senior director of virtualization software for Left Hand Networks of Boulder, Colo. "It is similar to the checksum function commonly used in storage, but it is humanproof. You can't change the data and still produce the same MD5 hash."How could a technology such as MD5 hashing, which can aid in data integrity testing, be harnessed to support content networking or hierarchical storage management? Scott Carson, director of technology services for IMS Systems, a mass-storage systems integrator in Silver Spring, Md., gave this explanation: "Content networking and, to some degree, large-scale HHS [hierarchical storage management] schemes, rely on content addressing to be effective. Addressing is commonly used to map contents to a memory location or disk sector in everyday computing."Content addressing, he said, takes the same idea but applies it to data stored across a large storage platform or network. An effective content addressing scheme allows you to retrieve an entire object by looking up a portion of its contents. "Since MD5 hashing takes a large amount of data and condenses it to a small message header, I can see how you might be able to use it as the basis for content networking in a next-generation storage system," Carson said.Can MD5 or related technologies be pressed into service not only as a guarantor of data integrity, but also as an enabler of a grander scheme of content networking? What alternatives exist to facilitate the needs of organizations with lots of data to store over an extended period of time? Watch this space.

Jon William Toigo


























Jon William Toigo is an independent consultant and author of more than 1,000 articles and 12 books. If there is an emerging technology you would like Jon to look at, contact him through www.toigoproductions.com or via e-mail at jtoigo@intnet.net.

NEXT STORY: Instant messaging gets busy