Government Data Mining Systems Defy Definition

In the best of all possible digital worlds, information is king. The universe, nothing but an expanding data warehouse, is completely catalogued by omniscient reference librarians and continuously updated by IT professionals who require no sleep. The ultimate goal is mass customization. Every meaningful relationship in the data is uncovered with sophisticated analytical tools. Every end user interaction is captured. Every transaction between buyer and seller is completely transparent.

By John Makulowich



In the best of all possible digital worlds, information is king. The universe, nothing but an expanding data warehouse, is completely catalogued by omniscient reference librarians and continuously updated by IT professionals who require no sleep. The ultimate goal is mass customization. Every meaningful relationship in the data is uncovered with sophisticated analytical tools. Every end user interaction is captured. Every transaction between buyer and seller is completely transparent.

Refocus on the real world. This grand vision of data mining suddenly narrows. The data is dirty, that is, it can't be used without extensive "cleaning." The end user lacks the numeracy to work intelligently with the tools, the IT department wars with statisticians over access to the data, and the field lacks international standards.

It is all part and parcel of any growing computer area, whether IP telephony, wireless telecommunication or data mining. In the case of data mining, according to experts, it is a field that holds great promise.

Case in point: The mission of the Department of Agriculture's Rural Development Division is to "improve the economy and quality of life in rural America through financial lending and grant assistance programs." This amounts to funding essential public services like water and sewer systems, housing, health clinics and public utilities as well as investing in businesses and agricultural cooperatives.

It is not a small operation. In fact, the Rural Development Division's lending volume of $60 billion rivals the fourth largest bank in the nation. And its 900 offices nationwide operate 39 loan and grant programs while maintaining relations with lenders, borrowers, governments, businesses and development organizations.

Not surprisingly, the computer systems that house information about these different programs are varied, from mainframes to PCs, and run under a multitude of operating systems, from Virtual Machine and Virtual Memory System to Windows 95.

The division's goal was to link the databases in the mainframes so analysis of community investments by field staff could be done on laptops in seconds rather than in weeks or months. So Rural Development contracted for the creation of a so-called data mining system.

As J. Norman Reid, associate deputy administrator in the Office of Community Development in the division, said, "The DMS is specifically designed to help top USDA policy-makers navigate and access that data so they can quickly and accurately respond to financial questions from Congress and the Clinton administration."

But is it data mining? Not really, said Reid. The Web-driven data warehouse allows number crunching, the kind generally used by banks to analyze the effectiveness of lending programs.

"True, we are not using the term data mining in the most accepted ways. We mean access to the numbers and just being able to pop them up," he said. "We can go farther and correlate, but we are not doing that right now. What we are doing is closer to a data mart, that is, access to di ferent databases on a single screen."

While an extremely valuable effort, one that has been nominated for a Computerworld Smithsonian Award for information technology, the Agriculture Department's data mining system highlights the difficulty in drilling down to a concrete definition.

Turning to the experts begins the quest for a resolution to the never-ending story. A perfect example is the easily retrieved, 31-page tutorial, "Introduction to Data Mining and Knowledge Discovery" (www.twocrows.com).

Prepared by Herbert Edelstein, president of Two Crows Corp., Potomac, Md., and a recognized expert in data warehousing and data mining, the document alerts the reader that the goal of data mining is "to produce new knowledge that the user can act upon."

It quickly shifts into high gear. By the third and fourth paragraphs the casual reader is deeply immersed in a quasi-academic discussion of predictive and descriptive models, the two main kinds in data mining.

In short order come the box diagrams, distinctions and acronyms, such as the difference between data mining and on-line analytical processing — Edelstein said that OLAP can verify hypothetical patterns, while data mining can uncover them — and the six types of data mining models: classification, regression, time series, clustering, association analysis and sequence discovery.

In the midst of Edelstein's study, one item stands out: the reference to those who can use data mining tools. "The key point is that data mining is the application of these and other [artificial intelligence] and statistical techniques to common business problems in a fashion that makes these techniques available to the skilled knowledge worker, as well as the trained statistics professional."

And what exactly counts as the skilled knowledge worker?

"What you really need to know to do data mining effectively is the line of business you are in and your data. More important is data and not mining. The standing joke is that data mining is what IS people call statistics," said Edelstein.

Asked for examples, he talks about the vendor who was having problems with the printers he was selling.

After drawing a map plot, identifying the areas where problems were occurring, Edelstein said it took two minutes to analyze the data. The bottom line? Users living in cold, dry climates had an ink problem caused by humidity.

Another rudimentary data mining analysis he cited was the famous one performed by Richard Feynmann, the Nobel Prize-winning physicist who uncovered the cause of the 1996 Challenger space shuttle disaster by pointing out the relationship between the temperature and the O-ring failure.

The difficulty in defining the field is echoed by another expert, Christopher Westphal, president and founder of Bethesda, Md.-based Visual Analytics and co-author of "Data Mining Solution: Methods and Tools for Solving Real-World Problems."

He admits early on that even though data mining has been part of the IT vernacular for several years, people still view the process as magic. "This is partly because no established road maps or procedures have been formally identified for guiding analysts to profitable outcomes," said Westphal.

That is the task he sets for himself in his book. But brush aside the differences in approach among practitioners and the level of understanding required by end users. The critical issue for data mining on which all agree is to correctly define the business problem.
As Edelstein put it: "There are two keys to success in data mining. First is coming up with a good formulation of the problem you are trying to solve. A good, focused problem usually results in the best payoff. The second key is using the right data."

In fact, more often than not, in the consulting work that both Westphal and Edelstein offer, the initial hurdle to data mining is helping the client develop a well-formed business problem.

According to Gregory Piatetsky-Shapiro, director and chief scientist of Boston-based Knowledge Stream Partners, a business question could be wrong in several ways. It could focus on the wrong problem, could be too broad or could overlook certain restrictions.

"An example of focusing on the wrong business problem would be trying to predict which people to contact before their contract expires. If there is a business policy to contact everybody, then it does not make sense to select who to contact, but rather 'what to offer' at the contact," said Piatetsky-Shapiro.

In the case of the too broadly stated question, he offered the example where the problem was defined as "predict attrition of users of product X." On closer examination, he found there were seven different types of attrition.

"Some covered customers we wanted to attrite, that is, fraud, while for valuable customers there were at least two very different types of attrition that required very different models," said Piatetsky-Shapiro.

In the third case, when dealing with personal data, the user must be aware of legal restrictions, for example, that credit cannot be denied on the basis of sex, age or race. Thus, it would not be valid to come up with a rule, "if race = X, then credit_accept = NO."

He suggested that making it easier for users to ask the right business questions requires a combination of business expertise and some domain knowledge, which can be encoded into a mini-expert system that sits on top of a data mining system. In the industry, this is referred to as the solutions framework.

Examples can be found at www.kdnuggets.com/solutions/.
Fast forward to the future, and the evolution of the Internet as the venue for e-business and e-commerce makes data mining seem a natural. For Erick Brethenoux, vice president of Lazard Freres & Co. LLC, New York, and a data warehouse and data mining analyst, the field is moving in at least two major directions: one aligned with knowledge management and the other with greater data visualization techniques.

Brethenoux divides the data mining aspect of knowledge management into categorization and filtering. In the first case, it amounts to finding interesting trends and business intelligence in business data.

For example, in launching a new product, the firm might want to determine if there are existing patents on the product or process and, if there are, whether or not to buy them. Filtering can be combing through information for use in knowledge management.

"With the trend of increasing use of the World Wide Web, data mining opens the whole area of such things as pattern analysis of traffic, allowing you to see who has access to what information," said Brethenoux.

What is even more interesting are the implications for mass customization.

For instance, while the paradigm of mass customization is one-on-one marketing, he said, the reality is to define market segments more tightly than is now done. Thus, while the 19-to-25 age group may be one segment to Procter & Gamble Co., it can really be seen as 20 segments, depending on the categories of product you are talking about.

This whole notion fits with his own segmentation of the market for data mining, which he used to classify in three areas, by technologies, for example, neural nets, users and techniques.

"I now segment the market into four different users," said Brethenoux. "There is the end user, who simply approaches the system to get data. Then there is the business analyst, a kind of hybrid, who understands the business function very well and has a background in techniques.

"Next are the statisticians, who know the techniques and the limitations of the technology, but not the business. Last are the scientists and researchers, like those at NASA."