Virginia fights computer failures

EMC under fire for performance

A massive computer failure across multiple Virginia agencies last week continues to cause problems. Three agencies’ systems are still being worked on, including the Department of Motor Vehicles, which isn’t able to process driver’s licenses or identification cards at its 74 customer-service centers.

Technicians worked through the weekend to address the problem, which was blamed on the failure of two circuit boards installed and maintained by EMC, a Northrop Grumman subcontractor, as reported by the Richmond Times-Dispatch. The equipment is located at a headquarters office in Chesterfield County shared by the Virginia Information Technologies Agency (VITA) and Northrop Grumman.

"We've been told by EMC engineers that this is the first instance of a simultaneous memory board failure on one of their systems," Samuel Nixon Jr., the state's chief information officer and head of VITA, told the newspaper. "We've asked for specifics [and] proof."

The state's largest computer failure, which began on the afternoon of Aug. 25, affected 25 out of 80 agencies, the governor’s office and Northrop Grumman. Virginia officials had hired Northrop Grumman in 2005 under a $2.3 billion contract to provide computer and communication services to state government, the state’s largest outsourcing contract.

In all, the failure hit 483 servers, about 13 percent of the total number of Virginia's government servers.

VITA has not yet quantified the amount of stranded data — data stored in computer memory but not yet written to the hard drive — that was lost when the system failed last week, Nixon said. However, the interruption was not serious enough to activate a backup system at a duplicate computer center in Russell County, Va., he added.

Northrop Grumman will have to pay a penalty of at least $100,000 for the outage, Nixon said, and the state is considering whether agencies should also get credits or refunds for service interruptions.

Megan Mitchell, a spokeswoman for Northrop Grumman, had no comment.

The outage shut down Web sites, prevented the processing of jobless benefits and delayed welfare payments, the Times-Dispatch reported. At the state Department of Taxation, taxpayers could not file returns, make payments or register a business through the agency's Web site.

According to the newspaper, VITA and Northrop Grumman have quarreled for months over what the state characterizes as shoddy, expensive service. This past spring, the two entered into a new agreement that gives the company an additional $236 million in exchange for a pledge to provide better service.

About the Author

Kathleen Hickey is a freelance writer for GCN.

Reader Comments

Thu, Sep 2, 2010 MD_Steve

"the interruption was not serious enough to activate a backup system at a duplicate computer center" Not serious enough to flip? What is their criteria for a "serious failure"? Thy must have some very serious reservations about the viability of their redundant site.

Wed, Sep 1, 2010

DMX-3 hardware stores host writes in cache before destaging to disk. These writes are stored on mirrored memory cache boards. The article suggests that both these boards failed at the same time. Obviously hardware can fail, and in this instance, it most likely was a one in a billion chance. However this is the reason why service providers (not vendors) provide business continuity through storage replication.

Tue, Aug 31, 2010 Michael D. Long Knoxville, TN

This outage created such a major problem for three reasons: 1) the project staff lack sufficient experience with real world implementations to properly plan for contingencies; 2) failure to maintain on-site spares of critical components; and 3) an architecture that centralizes data storage and retrieval at the risk of implementing a single point of failure (and potential performance choke point). Back in the 90's my team developed archival storage solutions that were deployed on Stratus RADIO clusters for high availability due to the mission critical nature of both storage and retrieval of the data. We experienced a network segmentation fault at two sites in less than three months. Immediately following the first such fault, the lead engineer for Stratus' 2-node HA architecture had to pay us an emergency visit to provide root cause analysis and explain to my team, our corporate management, and our customer's technical representatives how such an event could occur, how to avoid similar issues in the future, and why we should continue forward with their technology. The architect was adamant that a network segnmentation fault was "statistically impossible" - and held to this argument even after the second occurrence at a different customer's facility. I questioned the viability of 2-node redundancy then, and still question the approach today, as it lacks voting majority. Stratus staked its claims on widespread use of its 2-node HA systems and a track record for its hardware meeting customers' mission critical needs, which I must say was a compelling argument in the face of logic that would give question to such a design. The EMC solution shares the weakness of any 2-node HA implementation, and sadly, such a failure scenario should have been planned for. As for myself, I'd never accept a 2-node HA solution again after having to face the music for another company's failure on two separate occasions. You need a 3-node implementation at a minimum in order to maintain voting majority. The only safe course of action when majority vote cannot be maintained is to cease operations and perform an orderly shutdown.

Tue, Aug 31, 2010

Absolutely makes technical sense. All those servers have storage on an EMC array. The array's controllers both go belly up, it's akin to ripping out the hard drives of live servers - all of them - that are attached to the arry.

Tue, Aug 31, 2010 Randy Broadwater, PhD Herndon, VA

This is obviously a perfect example of a poorly executed Continuity of Operations plan. The overall system should have been capable of automatic fail-over to the backup site at the first sign of problems.This would have allowed for beter route cause analysis and repair. The sad thing is that VA is paying $236 Million more to aa company that promiese to do better, that is absurd.

Show All Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above.


WT Daily

Sign up for our newsletter.

Terms and Privacy Policy consent

I agree to this site's Privacy Policy.