Data warehouses need integrator's touch
Technology is only one part of the solution
The Internet and IP networks have gone a long way toward removing barriers between government agencies. To further a spirit of cooperation, agencies also are beginning to share information from disparate databases by sending some of that data to common data warehouses, where it can be merged, queried and analyzed.
The Transportation Department's Bureau of Transportation Statistics, for example, has created the TranStats portal (www.transtats.bts.gov). There, researchers, analysts and citizens can view, search, run reports and generate graphics from 400 tables in more than 100 databases from different departments and agencies.
The Bureau of Transportation Statistics has three goals for TranStats, said Cheryl Young, TranStats data team manager.
"The first is to assist public users in obtaining data easily and efficiently; the second is to support the public and internal users who perform online analysis; and the third is to offer data to internal users to perform advanced statistical analysis using third-party tools," she said.
But setting up a data warehouse is more complicated than building a database. Agencies often need integrators to handle the task.
"We learned that data collection and presentation involves much more than technology," Young said. "It is difficult to change the concept of data stewardship. We would like to have focused much earlier on this issue, so that the technical progress can follow."Something in common
Although they share many characteristics, a data warehouse is different in design and function than a production database. The primary differentiator is whether the system will be used for online transaction processing (OLTP) or online analytic processing (OLAP).
Most production databases are primarily for OLTP and require high-speed access to individual data records. A motor vehicles department, for example, would need an OLTP database to let clerks quickly access a vehicle's record, determine whether its registration is current and if not, to update the data.
The TranStats data warehouse, on the other hand, requires a database management system designed for OLAP. This determines the hardware needed to support the data warehouse as well as the database management system itself.
TranStats runs on four Sun Microsystems Solaris servers and a 2.5T Sybase IQ Database Management System from Sybase Inc. of Dublin, Calif. Query speed, not transaction speed, led to the Transportation Department's selection of Sybase for this application, Young said.
"Our data changes monthly or quarterly, but our queries must be performed within a split second," he said. "Sybase IQ is much faster than other databases, and the only one that can meet our dynamic Web presentation criteria."
Size is another differentiator. Because data warehouses extract information from multiple databases, they tend to be much larger than any one of their sources.
"Scalability becomes a bigger issue with a data warehouse, as does the way in which the database management system implements parallel queries, because of the heavy query nature of the data warehouse," said Craig Mullins, consultant and author of "Database Administration: The Complete Guide to Practices and Procedures" (Addison-Wesley Professional, 2002).Mart or warehouse?
There are two main approaches to setting up a data warehouse. Bill Inmon, known as the father of data warehousing, favors a top-down approach. The organization decides what data it wants in the central warehouse and how it will be used. Lower-level analytic databases, called data marts, draw data from the central warehouse. This method ensures consistency of data throughout all the organization's data marts.
The other method is a bottom-up approach. Data marts, set up to meet specific business needs, feed into the data warehouse. With multiple sources populating the warehouse with data, the organization more quickly realizes a return on its investment.
Perhaps of greater importance than which design you choose is that you make that choice early in the process.
"In setting up a data warehouse, it is important to have the data model correct right from the get-go," said Nathaniel Booth, IT specialist at the U.S. Geological Survey Data Center in Middleton, Wis. "You have piles of applications that will depend on that data going forward, so it is really hard to change later."
The Geological Survey hired a consultant to design the data warehouse and to handle early business intelligence deployments.
A database management system also must be chosen. This often is complicated by the fact that the organization usually has a database management system with which any new system must be compatible. The choice of software also must be interoperable with systems at any other agencies that will be feeding data into the data warehouse.
"Assuming you already have a database, which almost everyone in government has, [you could] just buy more of what you have and do a sole-source acquisition," said John Kost, analyst for Gartner Inc. of Stamford, Conn. The risk in any procurement, he said, is that an agency will end up buying new software and won't have in-house staff with the skills to support it.
The Geological Survey chose to stick with software it knew in selecting a database management system for a central data warehouse to collect water-quality data.
"We stayed with Oracle because of its support for a wide array of functionality, including Oracle spatial, text search and indexing," Booth said. "We had to bring in data from SQL Server and Ingres databases, as well as text data."
The Geological Survey takes datastreams from state, local and federal agencies nationwide and aggregates them in Oracle databases at its data center. Researchers can do browser-based queries via a flexible interface. The largest database contains the National Water Quality Assessment, and includes more than 11 million water and biological test results.
The data center started with Oracle 7.3, and now runs three instances of Oracle's 10g database on Red Hat Linux.
"We try to stay on the most recent version of the database," Booth said, even though this may cause problems. "Sometimes the new version doesn't support the applications you are running, and previous queries don't always work right, so just dealing with the change management through the versions of the database is somewhat of a challenge."More not merrier
Ensuring that all the data follows a consistent format is an ongoing effort, Booth said.
"Vendors still haven't come to us with a standardized metadata repository, so we have a difficult time having to manage the metadata about multiple products," he said. "In hindsight, it is nice to pick ones that play well with metadata."
When an agency is working with many other entities, each with their own legacy systems and unique data needs, managing the data warehouse means constant negotiations. Sonoma County, Calif., for example, set up an IBM Corp. DB2 geographic information systems data warehouse for use by all county agencies, including agricultural inspectors, tax assessors, the building department and emergency services. The county already had been using GIS software from ESRI for some of its applications, but it had no countywide standard.
"We were looking for a central repository to share information, so people didn't have to guess which version is most up-to-date," said GIS Manager Tim Pudoff. "By having a central database where people can publish their data, we can provide users with a one-source shopping point."
The GIS staff has real-time access to add or update information in the warehouse. Participating agencies periodically send automatic or manual feeds from their servers. Overseeing the system is a technical advisory committee, composed of representatives from 16 agencies. The committee provides requirements and buy-offs on any upgrades or new services. With so many stakeholders, Pudoff said, it is vital to have an executive champion to help manage the agencies' competing interests.
"You need someone who can help you wade through the bureaucratic trials," he said. "I have that from my manager, and it made all the difference."
Selecting the right hardware and software for a data warehouse involves looking at more than its technical features or initial cost.
"It is almost irrelevant what the vendor tells you the price is," Kost said. "You need to figure out what are the parts of the implementation, including your own costs." This includes the hardware, as well as services in setting up the database, creating links to other databases, applications or agencies and ongoing support.
Although an initial database purchase might be geared to a single use, Mullins said agencies should break out hardware, software and services components and examine them separately, even if the entire contract winds up going to a single contractor.
"I would split out the hardware and programming services from the database management system, unless this particular system is going to be used only by this one project," he said. "A database management system is designed for sharing data across multiple applications, and you don't have to always have a brand-new system for every application."
In evaluating a database management system, look at the analytical and reporting features, not the transaction speed, Mullins said. A proposal for a data warehouse "should focus on analytical processing instead of transactional processing," Mullins said. "The reporting features embedded with the database management system become much more important, as does the ability to manage very large amounts of data. A database management system with embedded [extract, transfer, load] features will also move to the head of the pack of data warehouse proposals, because warehousing implementations require heavy data movement from operational databases to the warehouse."
It's important to consider who is going to support the warehouse after the integrator wraps up work implementing the project. That was one factor in Geological Survey's decision to stick with Oracle rather than go with a product with a smaller installed base and local support network.
"For us, it is important to have face time with the consultants," Booth said. "It is difficult, when you are not in a major metropolitan area, to find skills in these very specific technologies."Drew Robb is a freelance technology writer in Los Angeles