Washington Technology Online Data Warehousing: A Paradigm Shift

Data Warehousing: A Paradigm Shift Companies worldwide are proactively using data as a competitive weapon By John Makulowich Among the more rabid supporters of data warehousing and its prospects for growth and prosperity, David M. Thomas stands tall. The First Albany, Albany, N.Y., analyst waxed enthusiastically about this IT segment in a report he co-authored, "Relational Database Management

Data Warehousing: A Paradigm Shift

Companies worldwide are proactively using data as a competitive weapon

By John Makulowich

Among the more rabid supporters of data warehousing and its prospects for growth and prosperity, David M. Thomas stands tall.

The First Albany, Albany, N.Y., analyst waxed enthusiastically about this IT segment in a report he co-authored, "Relational Database Management Systems." In the report, he boldly proclaimed, "We believe the [data warehousing] market represents one of the biggest, if not the biggest, software investment opportunities in the IT industry."

Thomas went on to cite studies from his company's own META Research Group showing that more than 80 percent of the Global-2000 firms are implementing data warehousing strategies. And he rated nine publicly traded companies, all household names in the data warehousing/online analytical processing industry, firms such as Oracle Corp., Redwood Shores, Calif.; Cognos Inc., Burlington, Mass.; Informix Software Inc., Menlo Park, Calif.; Prism Solutions Inc., Sunnyvale, Calif.; Red Brick Systems Inc., Los Gatos, Calif.; Arbor Software Corp., Sunnyvale, Calif.; and Sybase Inc., Emeryville, Calif.

Since then, all the firms, save Oracle, have plummeted in price, dropping faster than a skier trying to outrun an avalanche. What happened to these companies whose common goal is data warehousing - the systematic collection of large volumes of business information from transaction processing, legacy systems and external sources to benefit end users and ultimately to make the organization more competitive?

"Like other stocks in the IT sector, investors bid these up to unheard of multiples. There was bound to be a correction," says Thomas. "Part of the reason stems from the Oracle legend, a stock that set the standard for software growth, 10 straight years of 100 percent growth. We're not likely to see that ever again."

Still, Thomas is bullish on the stocks and the prospects for data warehousing, focusing on the fact that the revenue growth rate of his universe of nine stocks is still strong.

"Organizations are increasingly viewing their data as a valuable asset that can be used as a competitive weapon. Using data this way, that is, proactively, is a paradigm shift in the use of IT and specifically in the use of database management," says Thomas.

One path along which he thinks the data warehouse and the development of data mining tools (used to uncover previously hidden or undiscovered data patterns or relationships in the warehouse) will evolve is temporary data marts, organized collections of data that are project specific. He gives an example of the companies that put advertisements in the Sunday newspapers in a particular market for a specific product.

"Those companies, whether they are Sears or Proctor & Gamble, want to figure out how that ad will impact sales. The data generated could seed a data mart for a specific purpose, but it would be a temporary data mart, just for one project and for one particular time and market."

Another Perspective

One gets a completely different perspective on data warehousing from Michael Abbey, co- author with fellow Oracle expert Michael J. Corey of the newly published Oracle Press book, "Oracle Data Warehousing: A Practical Guide to Successful Data Warehouse Analysis, Build, and Roll-Out." Abbey is also president of the consulting firm Michael Abbey Systems International Inc., Ottawa, Ontario.

"The book was written with two goals: to feature Oracle, which obviously enjoys a significant market share. We feel they have a tendency to do things properly and are in tune with what is going on in the industry. We also wanted to offer a network of guidelines that would help warehouse projects succeed," says Abbey.

According to First Albany, Oracle has a 45 percent online transaction processing relational database management system market share and 60 percent for very large databases.

Asked for his guidance to organizations embarking on the data warehousing adventure, Abbey strongly advises them to ensure that power users are involved in design, deployment and rollout.

"Someone in the organization must ensure that power users play a part in the process. The user voice is critical," explains Abbey. "Since you're looking at a 24- to 72-month project, you also need to find subject specialists, especially IT professionals.

If the membership on the data warehousing team changes as the project moves ahead, management needs to make sure that team members toe the line, that they have corporate direction to stay on task and corporate commitment."

He further stresses the importance of clearly stating the roles and responsibilities of the data warehousing team, especially in having infotech staff reorient themselves to meet user needs. Included in the book is a data warehouse roles checklist. It contains 19 areas of responsibility that companies need to cover in building a warehouse.

It's clear that one of his model companies for a successful data warehouse is Hertz System Inc., Park Ridge, N.J. The key is the business' ability to leverage information that will make it more competitive.

And the example he and Corey give is the high level of customer service that Hertz offers based on "knowledge, care and trust," phrased in the questions: Do they know what they are talking about? Do they care about me? And do I trust them?

According to the authors, Hertz has collapsed a 40-plus minute transaction into a few minutes. "They have taken the information they have learned about me as a customer and customized their services to meet my individual needs. Many customers are willing to pay a premium for this level of service."

Data Mining

Beneath the standard steps in building a data warehouse, that is, forming a team, finding, gathering, cleaning and transforming the data, organizing it, automating the process of moving the data to areas for quick access, is the task of uncovering previously overlooked information nuggets. This is the role of data mining tools.

It's an area that most experts agree is currently up for grabs and represents the next major development for the data warehousing industry.

According to Larry Mazlack, associate professor of Computer Science in the College of Engineering at the University of Cincinnati, data mining tools are at a relatively primitive stage. Mazlack just returned from a year's sabbatical at the University of California at Berkeley studying with, among others, the legendary Lotfi A. Zadeh, the man generally credited with originating fuzzy logic, which is used by designers and researchers to model complex, nonlinear systems.

"Most products out there are too early to market and it's really unclear whether they do a lot of good. For the most part, the tools look for predefined relationships, for example, between a person's age and the products they buy. In this case, it is nothing special that you could not do with statistical analysis," says Mazlack.

The critical issue, he says, is to look at the database as a whole and find different, if not surprising, relationships. Among the fruitful areas he identified are consumer marketing using scanner data and the financial industry, analyzing stock market data, especially in trying to uncover time patterns and sequences.

Following the lead of Zadeh, he stresses the importance of what he terms approximate reasoning and fuzzy arithmetic "to capture the implicit imprecision of data. We need to allow the data to retain its imprecision. Not everything can be reduced to computerized algorithms; the binary universe does not correspond to reality. We need to make computers deal with impreciseness," says Mazlack.

Red Brick Systems Inc.

On the commercial front, one of the few companies to integrate data mining algorithms in its data warehousing RDBMS, or relational database management system, through a joint development agreement with DataMind Corp. is Red Brick Systems Inc. It introduced this option just two months ago with its Red Brick Warehouse 5.0, a high- performance client/server RDBMS.

"The Red Brick Data Mine approach is to leave the data in the warehouse in relational tables and bring the data mining software to the data, rather than vice versa," explains Steve O'Brien, director of Product Marketing for Red Brick.

Even with his company's product on the market, he is the first to admit that data mining is at a nascent stage, much like data warehousing was three or four years ago.

"In our visits with customers, we hear lots of confusion and conflicting messages. The recurring theme is: help me understand what this is, what it does and how it can help me. People don't understand the technology behind it and how it can help their business," admits O'Brien.

The bottom line for a data warehouse, he says, is that the product should allow complex queries against very large data sets. Part of the reasoning behind integrating the data mining tool into the data warehouse is to bring the technology down to the level of the decision maker, the end user.

The customers who expressed the deepest interest in the data mining option so far are in the telecommunications industry, such as regional Bell operating companies and long distance carriers, and the health care industry. The interest of the ilk of AT&T, MCI and Cellular One is in performing churning analysis of the rate of turnover of customers in the different telephone services.

O'Brien gives two examples of the successful use of data mining tools. In the first case, an instance of business-to-business analysis, a large Japanese automobile manufacturer was trying to reduce warranty expenditures. As he noted, there is room in that process for unscrupulous dealers to charge both the company and the customer for repairs while the vehicle is under warranty. The manufacturer was interested in identifying the likelihood of fraudulent claims and uncovering them. Using data mining tools, they found relationships in cases of fraud among the type of work performed on the vehicles, the hours of labor and the type of problem.

In the second example, automobile insurance companies are continually trying to determine how best to set rates and how to attract customers from their competitors. One approach is to separate the good from bad drivers, for example, those with strong safety records. There are already very well-known criteria for this exercise. However, using data mining tools, one insurance company found out that people with good credit ratings have fewer accidents, a relationship previously unknown.

O'Brien feels that systems integrators have a strong role to play in helping customers "size the business problem."

"Systems integrators can help customers understand the software tools and hardware platforms available today to solve their problem. It's not enough to walk the cus tomer through the process of sizing the problem by asking the right questions. You also need to recommend the right pieces. There is no one vendor today that solves the whole problem; we ourselves are one of the pieces," says O'Brien.

DataMind Corp.

The company working with Red Brick, among others, is DataMind Corp., Redwood City, Calif. With its Agent Network Technology, the firm offers two products, the DataMind DataCruncher, which is a server-based data mining engine for Unix and Microsoft Windows-NT platforms, and the DataMind Professional Edition, for end users to define data mining studies and view data mining results.

According to AJ Brown, vice president of marketing, the company's product strategy is to offer a client/server solution.

"We offer an easy-to-use tool. We take a single algorithm approach in the belief that it will satisfy 90 percent of data mining queries that end users have. It's an open algorithm in the sense that it can be modified," says Brown. "With products like ours, systems integrators need to start getting expertise in specific industry segments. Since we can modify our product, we need to hear from them that this is what we need, for example, for the financial market. Frankly, at this point, we do not know where the best applications will be with data mining."

DataMind's products started shipping in May and they have been working on pilots with a number of companies, mainly in the telecommunications, retail and health-care industries.

NCR Corp.

On a completely different plane in data warehousing and data mining is NCR Corp., Dayton, Ohio, which recently was given a 1996 Best Practices Award by the Data Warehousing Institute, Bethesda, Md., for its data warehouse application for retailing giant Wal-Mart Stores Inc. The company focuses its computer product and service efforts on the retail, financial and communications industries and could easily claim to have started the data warehousing industry in 1985, according to Boston-based Patricia Seybold Group.

Among NCR's recent initiatives are the Scalable Data Warehouse framework, a blueprint for building a data warehouse solution through one-stop shopping, and an agreement with Knowledge Discovery One, Richardson, Texas, to deliver a data mining consulting program, called Voyager.

For Joe Wenig, general manager of NCR's Data Warehousing Professional Services, data mining is a new process for answering questions, not a solution by itself.

"Obviously, the more details you have, the more likely you are to pick up patterns. Equally important to detailed data is understanding the nature of the problem you are looking to solve. You are unlikely to find one tool that runs the gamut of the data mining space," says Wenig.

The typical engagement for NCR moves through four phases. First is the basic, high-level problem identification. Second is detailed analysis and design. Third is data collection. And fourth is the mining, using sophisticated querying techniques.

Down the road, Wenig sees the tools maturing to the extent that they will be as simple as using Excel to find an average. Driving that development is end user awareness.

"Customers today are coming to the realization that data mining is the future. To a large degree, most of the tools are leading edge. It's not clear what types of problems they are best used to solve," admits Wenig.

He also sees enormous opportunity for systems integrators as users try to get a handle on how data mining tools can be used in their company and industry.

The Storage and Middleware Issues

While data mining may be the industry rage, two other areas of critical importance for building the successful data warehouse are storage and middleware. The latter is software that allows different systems to communicate. In many cases, customers overlook the need for both.

The first is the domain of EMC Corp., Hopkinton, Mass., which has to its credit a record 11-terabyte (11,000 gigabytes) data warehouse built with NCR. It showcased the decision support system last year in Tokyo as a simulated manufacturing company's order entry and sales history records. Eleven terabytes is equivalent to 2.75 billion pages of text, or enough information to jam 220,000 four-drawer filing cabinets. The EMC hardware is the Symmetrix 3500, capable of storing more than 1 terabyte of data in the space of 17 square feet.

Says Mitch Seigle, EMC senior product manager, "When you look at the key data warehouse requirements from a customer's perspective, what do you see? Often it's folks building a data warehouse focused on short-term efforts, such as source data problems and end users. Many fail to see the long-term issues. What they overlook is a critical element - that of scalability, the ability to quickly grow the data warehouse. You have to remember that success in data warehousing breeds success. It attracts users and more demand, which requires more storage. These don't amount to percentage issues; these are orders of magnitude issues."

He points out that the total storage required includes temporary space and indices as well as data protection schemes. Thus, the actual physical storage could be a two-and-a-half to four times increase over the raw storage.

"There are at least three views of storage to consider," says Seigle. "First is the owner of the data, the end user. His or her view is the amount of raw data. The second is the database administrator, who factors in temporary space and index space. The third view is the overall data warehouse, the protection scheme and the mirroring configuration."

He likes to refer to what he calls the "ity" words when suggesting what companies should focus on when evaluating the needs of the data warehouse.

"People should be looking for scalability, availability, manageability and flexibility. The warehouse has to scale well as demand grows. It has to be available when management starts to sees its role as mission-critical. It must be manageable, that is, key users have to be able to get their arms around it. And it must be flexible, for example, in its storage solution, so if necessary you can change platforms."

Overall, Seigle sees a 12- to 18-month time line for building the data warehouse: the first three months talking about data needs and setting up pilot projects, the next six to nine months piloting with early adopters of the technology inside the company, the so-called power users, and 12 months and beyond in implementing the large-scale data warehouse.

Intersolv

With the diversity of legacy systems and platforms holding corporate data from different departments, the role of middleware becomes critical. One company trying to capture the market is Intersolv, Rockville, Md., with its DataDirect SequeLink and DataDirect SmartData, integrated programs that promise a single-point, client/server solution for smooth access to enterprise data.

According to Edward Peters, vice president and general manager of the Data Direct Division of Intersolv, the role of middleware is to help you gain access to the data and return that data to the application, whether it be desktop business intelligence or custom-written applications.

"Today, many customers have at least three different [database management systems] in house. For them, it's useful to have middleware that supports open standards, Java, ActiveX and ODBC 3.5 [Microsoft's Open Database Connectivity]," he says.

To explain middleware, he compares it to the production plant that manufactures the product, moves it to one or many warehouses and then distributes it to the retail outlet. The warehouse is the data warehouse and the store is the end user.

"Middleware allows you access to data in a timely fashion; it should simplify data access. In the model, middleware connects you to the different warehouses where the product is kept. The model is the distributed network where you need some level of connectivity across all those nodes. Something has to sit in the middle. We see middleware serving an expanding role; it won't be enough to just provide access. Middleware [must] join data across different databases and data types. It focuses on a big issue today for data warehousing: Should all the data be in one warehouse or in different places, called data marts," says Peters.