On all the time

By David Essex

| July 21, 2006

Disaster recovery was a low priority for many government agencies until the flood of terrorist attacks, hurricanes and other disasters of recent years. Now disaster recovery, ensuring that IT works uninterruptedly, is a key component of the continuity of operations plans that government expects industry to help it carry out.

Disaster recovery was a low priority for many government agencies until the flood of terrorist attacks, hurricanes and other disasters of recent years. Now disaster recovery, ensuring that IT works uninterruptedly, is a key component of the continuity of operations plans that government expects industry to help it carry out.The technologies for keeping systems online in a catastrophe are essentially the same as those for maintaining redundant servers, applications and databases for high availability. But dispersing systems geographically to let resources stay online when disaster strikes offers a new wrinkle and a fresh set of IT challenges ? challenges that experts say contractors and their agency customers can and should overcome."Do something," said Steve Duplessie, senior analyst at Milford, Mass.-based Enterprise Strategy Group. "It's too cheap and too easy not to start moving data offsite."The technology behind disaster recovery falls into three areas: backup and restore, replication and failover. How your customer approaches those and the infrastructure it eventually deploys depends on how it chooses to characterize disaster recovery.Two common disaster recovery benchmarks are recovery point objective and recovery time objective. Recovery point objective gets at the issue of acceptable data loss from a failure. For instance, if a system goes down, is it acceptable for you to bring it back online with month-old or week-old data? Recovery time objective is about availability: How quickly does a system need to be back up and running?Having identified systems and spelled out recovery point objective and recovery time objective requirements, you can solicit products that meet your customer's requirements.At the most basic level, disaster recovery solutions have a backup-and-restore infrastructure.Much of today's disaster recovery technology debate echoes the thinking behind storage management evolution over the past decade. Agencies that need data backups in case of emergency must weigh the price and performance differences among storage media such as optical tape and, increasingly, cheap serial ATA hard drives that could function as virtual tape drives.Integrators should help evaluate how these technologies fit into a cost-effective disaster recovery strategy while solving their customers' recovery point and time objectives."Tape actually does a pretty good job, and tape technology continues to evolve to keep pace," said Matt Fairbanks, senior director of product management at Symantec Corp., which sells backup, replication and clustering software. "Just about everybody uses tape at the back end," he said.But tape has limitations. "When you need to get online quickly, tape is sometimes not the best option," Fairbanks said.Related to backup is data replication, which maintains a copy of an application or database so the information is as fresh as you need it. Asynchronous replication, the most common type, means the primary system sends changes to the backup, then proceeds without waiting for proof that the information was copied.There's a risk that systems could fall out of sync, but asynchronous replication is faster than the alternative.In a synchronous replication scheme, the primary system waits for confirmation before proceeding, so each database updates the other, which slows performance but guarantees a true duplicate. This is important for agencies that have stringent recovery point objectives.Synchronous replication can be expensive to run over WANs because of the bandwidth required to maintain acceptable performance."Keeping a hot site up and running at full speed can be very costly," said Joe Gentry, global vice president at Reston, Va.-based Software AG Inc., which sells replication software for its Adabas database management system.Still, experts said it's important for agencies and their contractors to understand the type of replication they have and how it could affect availability in a continuity-of-operations or disaster recover situation. Asynchronous replication could leave blocks of data unavailable if they were en route to the replicated system when the outage occurred, said Terry Stowers, a senior storage technology specialist at Microsoft Corp."That may be OK," he said. "The important thing is to know ahead of time."Finally, disaster recovery typically includes a failover function, in which services automatically switch from a failed system to a replicated system or site. These days, failover servers in a high availability configuration can act like a single computer, effectively a disaster recovery spin on clustering. Storage for high availability clusters can either be shared ? creating a single point of failure, which could be a drawback ? or replicated on independent hardware linked via an IP network.Increasingly, high availability infrastructures include servers clustered not merely onsite over LANs, but across hundreds of miles in a WAN configuration, a technological stretch that brings risk with reward.With wide-area clustering, Fairbanks said, "literally at the click of a button, we can move an entire service, an application and everything associated with it, to an alternate site."Failover can be manual or automatic, depending on needs and budgets. That can also be set up to happen imperceptibly.High availability replication and clustering technologies are the most expensive disaster recovery options, and overkill for some agencies. That said, vendors insist that they're the preferred disaster recovery techniques for agencies with strict security and availability requirements. Xosoft, for example, said 11 federal agencies, including the Labor Department, use its WANSync and WANSyncHA asynchronous replication."You need to have geographic separation of copies of your data," Fairbanks said. "It's not good enough to have a copy sitting on a storage array that's sitting next to your primary server."Pulling off wide-area disaster recovery can be tricky. For example, third-party disaster recovery clustering tools are sometimes needed to supplement the cluster technology that comes with major operating systems and high availability servers from companies such as Hewlett-Packard Co. and IBM Corp.Microsoft Windows Server 2003 comes with its own cluster service, but users often augment it with products such as Steeleye Technology Inc.'s LifeKeeper or Double-Take Software Inc.'s Double-Take tools, because Windows' shared-storage design can be a challenge for WAN replication.There are two types of disaster recovery clusters: active-active and active-passive. Microsoft Cluster Service is one of the former, while the Neverfail and XoSoft clusters are active-passive, according to John Posavatz, vice president of product management at Neverfail Inc."With active-passive, only one of your servers is doing anything for users," Posavatz said. "That passive system is basically ? a hot spare."In replication and failover scenarios, whether clustered or not, applications present another set of disaster recovery challenges."E-mail has probably become the most important application for business organizations and government to protect in the event of an outage," said Bob Williamson, vice president of products at Steeleye. But because of network, server and operating system dependencies, applications don't always come up easily on remote systems, a problem that is compounded when workers try to recreate their office setups at home.Steeleye and Neverfail claim application awareness and sell versions of their disaster recovery tools tailored to popular applications such as Oracle, Microsoft Exchange, Internet Information Server and SQL Server.Replication tools such as LifeKeeper handle some of the messy details."If that primary data center should go out, more often than not, the remote site is going to be on a different subnet," Williamson said. "How are you going to get them updated with the IP address of the remote site where the applications are now running?"Neverfail's True Clone technology replicates the active server's IP address and server name on the passive server. Vendors that don't offer this feature usually require users to redirect their PCs to the passive server during failover."Since it looks identical to the primary, it's completely transparent to the end user," Posavatz said. Xosoft can redirect client systems to new addresses.And because the database is a critical component of an application, most disaster recovery tool vendors sell versions customized to specific brands, principally DB2, Oracle and SQL Server. In addition, the database vendors sell replication modules for their products.The most frequently heard disaster recovery phrase is "continuous data protection," or the ability to return systems to their state before the disruption. Research analyst Duplessie identifies two kinds: "true" CDP, which captures every change and provides the most granularity in rollback options, and CDP that periodically takes "snapshots." Both can be useful disaster recovery components, he said, but even vendors of CDP say they can be tricky to use.File-level CDP, for example, can cause more trouble than it's worth if it fails to account for global effects on applications. It also can be hard for IT managers to find the right rollback point. Xosoft's Enterprise Rewinder is a CDP program that works at the application level, according to executive vice president Gil Rapaport. "We are translating input-output requests of the disks to major events in the application itself," he said.Perhaps the biggest challenge in building a comprehensive disaster recovery plan is simply having the right infrastructure to execute it. Experts have argued that the greatest infrastructure issue is network bandwidth among the primary, secondary and, in some cases, even tertiary sites. Some agencies are prepared for coordinated attacks on more than one site.If the network is too slow, data might take too long to transmit to the remote site or to failback to the primary site. Investing in high-speed WAN may be money down the drain, however, if asynchronous replication, which can run well at lower speeds, meets your agencies recovery time objective goals."Many customers will say, 'I'm okay losing five minutes of data,'" Posavatz said. "They don't have to invest in so much bandwidth."Such distinctions are significant. Even a basic setup with asynchronous failover and no clustering can get expensive, with software costing $5,000 to $10,000 per server. Integrators can see economies by making disaster recovery infrastructures do double duty, improving service availability regardless of disaster. They can also use other technologies not typically associated with disaster recovery, such as virtualization, to better manage costs."If I don't have virtualization, that means I have to have the same number of servers at the remote site as the primary site," said George Symons, chief technology officer at EMC Corp., which offers enterprise storage vendor and disaster recovery solutions. Of course, as with most technologies, agencies need to weigh the tradeoffs.In the case of virtualizing disaster recovery resources, "you won't have the same CPU power that you had at the primary site," Symons said.Duplessie said an emerging technology area that could help make for more cost-efficient disaster recovery plans is data de-duplication. "By eliminating duplicate data," he said, "we can reduce the amount of data that needs to be physically moved to the disaster recovery site, perhaps by 100 to 1."And once you and your customer have disaster recover in place, use it, even if there isn't a disaster. Because a disaster recovery system is regarded as an insurance policy, it's tempting to test it only once a year. Experts recommend something more frequent, such as monthly or quarterly.David Essex is a freelance technology writer in Antrim, N.H.

RFP Checklist | Stcik to the plan

You and your agency customer have come up with a comprehensive disaster recovery plan, and you're ready to solicit products in support of your proposal. Now it's time to consider these tips:

»Define the end, not the means. Specify that if the production site goes down, you want an alternate application service and all associated data to be available within 12 hours, with no more than an hour of lost data.

»Assess what the system does to mitigate the risk of tape or hard-drive backup failure.
»If one of the critical applications is a database, determine whether the disaster recovery system covers all transaction types, including insert, update and delete.

»Beware of version creep. Spell out the versions of the applications the agency needs to run and inquire about the process for handling updates.

»Gauge replication performance carefully. Some vendors may only provide numbers for asynchronous replication, which is typically faster than synchronous. Ask for throughput rates that reflect peak demand periods, not just averages.

»Find out how seamless failover appears to users. Are they disconnected from the network? Will they have to log in again?

»Know the difference between failover times for disaster-recovery and high-availability situations. The latter tend to be quicker, but do not reflect real disaster recovery environments.

»If vendor consolidation is a goal in the agency's enterprise framework strategy, look for vendors with the most complete disaster recovery suite that, at minimum, includes a variety of backup tools, replication and continuous data protection.

»Watch for compatibility with the agency's server and storage platforms. Many disaster recovery vendors support a variety of hardware brands and operating systems, but some support only a single OS, such as Windows, or their own storage arrays.

»Investigate a replication program's failback process, in which it returns control to the primary site after failover. Many products do it manually, although a few are automatic.

»Beware of vendors that claim certification from your preferred server supplier, but only have it for one or two of their disaster recovery tools.

»Remember that the people expected to run the remote site might be inexperienced. Documentation and training are critical. ? David Essex

Three pillars of recovery

Not so simple

Infrastructure investment

NEXT STORY: Combined Endeavor spotlights opportunities