Data Mining

By John Makulowich Nearly five years ago, 50 researchers who had taken part in a Knowledge Discovery and Data Mining conference workshop received Gregory Piatetsky-Shapiro's electronic newsletter, Knowledge Discovery Nuggets, once a month. The newsletter's readership has since grown to 4,000. Its frequency has increased to two to three times per month, and the community of knowledge discovery

Over the last two to three years, there has been more emphasis by developers on integration, visualization and data access, said Piatetsky-Shapiro, who has started a new section on his World Wide Web page for these integrated multitask tools. "It's the fastest growing section with more than 20 tools of this type," he said. One person on whom Piatetsky-Shapiro's observations are not lost is Gerard Montgomery, president and CEO of AbTech Corp., Charlottesville, Va. The company has been fielding data mining applications for more than nine years. Included among its government success stories are work for the Agriculture Department in developing real-time models for estimating grasshopper infestation and for the Defense Advanced Research Projects Agency and the U.S. Air Force in target recognition. Though simple in principle, building a data mart is very hard to do well, said John. "The company has some very bright computer scientists doing nothing but 'architecting' the data mart, developing connectors to existing operational databases and writing algorithms to efficiently pre-compute answers to queries during nightly runs," said John, who believes that pre-computation is essential. Coming at the issues of data mining and data warehousing from the perspective of a big company is Mike Schiff, former executive director of data warehouse and decision support for Oracle Global Public Sector. Hoping to keep the attention of NASA, which selected HDF as the scientific data standard for the Earth Observing System, part of the agency's $8 billion Mission to Planet Earth project to monitor long-term global environmental change, Fortner is focusing on the peculiarities of scientific data. Historically, it was placed in files and not relational databases and ranged in size and complexity. For example, a single Landsat science data granule (measured observation) that covers three minutes can be 17 gigabytes, the size of an average data warehouse.

By John Makulowich

Nearly five years ago, 50 researchers who had taken part in a Knowledge Discovery and Data Mining conference workshop received Gregory Piatetsky-Shapiro's electronic newsletter, Knowledge Discovery Nuggets, once a month.

The newsletter's readership has since grown to 4,000. Its frequency has increased to two to three times per month, and the community of knowledge discovery and data mining professionals has blossomed. And data mining, the use of statistical methods with computers to uncover useful patterns inside databases, continues to attract more and more attention in the business and scientific communities.

"We are clearly moving into the next generation of data mining systems, toward more and more embedded solutions. The early data mining efforts were research driven; they did one data mining task," said Piatetsky-Shapiro, currently editor of the newsletter and director of applied research at Knowledge Stream Partners, a data mining and customer modeling company based in Chicago.


Knowledge Stream Partners photo

Gregory Piatetsky-Shapiro, editor of Knowledge Discovery Nuggets and director of applied research at Knowledge Stream Partners

The third generation of data mining systems offers complete data mining solutions that talk the language of business and target the end user. Not only did the first and second generations require understanding of often arcane research tools used by professionals trained in statistical techniques, but they were clearly directed to the data analyst and technical power user.

For Piatetsky-Shapiro, embedded solutions means hiding the data mining agents and allowing business analysts, marketing personnel and salespeople to focus on their specific skills and avoid learning sophisticated statistical tools.

"Generally, data mining tools are still not oriented to the business analyst, to the end user. The marketing people want to know what customers to target for telemarketing or direct marketing and the reasons why the customer is likely to do one thing and not another," he said.

The marketing people "do not care how to interpret a data mining algorithm, they don't want to look under the hood of the car, as it were." Data mining can solve many of the marketers' questions but the algorithms should be hidden, said Piatetsky-Shapiro.

He is quick to point out that embedding the agents does not address another critical issue: accurately analyzing and interpreting the results of data mining. The industry is rich with stories about how the misinterpretation of operational data in data marts led business users far astray from their sales goals. Still, unlike many other fields, the data mining tasks may remain an in-house activity.

"There is a trend to outsourcing in data mining. One reason is the difficulty of finding professionals who understand the intricacies of data mining. But outsourcing is a temporary solution because most firms want to keep it in-house, if only for competitive reasons. Besides, a lot of critical knowledge is domain specific, such as telecommunications, and requires a strong background if not industry experience," said Piatetsky-Shapiro.

Within the next one to two years, Piatetsky-Shapiro sees a round of mergers, acquisitions and consolidation among major players, such as IBM and Silicon Graphics, and the developer boutiques that are producing the dizzying array of new data mining solutions.

He also sees the role of training taking on more importance, at conferences and on CDs. He himself will serve as general chair for the fourth annual international conference on Knowledge Discovery and Data Mining - KDD-98 - set for New York City in August.


AbTech photo

Gerard Montgomery, president and CEO of AbTech Corp.

In the Agriculture Department's case, the agency replaced rules in an expert system with AbTech's ModelQuest Statistical Networks that recommend treatments for grasshopper infestation. The Statistical Network models reduced system run time from 20 minutes to seconds, allowing the system called Hopper-Lite to be taken into the field.

AbTech boasts over 4,000 clients, including the U.S. Department of State, University of Virginia's Health Sciences Center, DuPont, Dow Chemical, Moody's and Oppenheimer Capital Investments. Montgomery remembers when data mining consisted of neural networking and artificial intelligence. Then the market, as he puts it, went south last year.

"Even if you developed a successful app [application], you had to change the way you worked the data. To get the app working required organizational change; that was the hard part. The staff did not have basic data systems and lacked infrastructure. And AI [artificial intelligence] tools were too complicated. Most organizations didn't understand the value of database systems. Data warehousing changed all that; it made executives change business processes to take advantage of what could be learned from analysis," said Montgomery.

He sees the same type of recognition for data mining now dawning on executives, that data mining can offer substantial value to organizations. The next key issue to address is how they can make the information they uncover actionable, that is, how can it directly affect what they are trying to accomplish. One way is through direct marketing, a new direction in which he is steering his own company. He feels that data mining tools in themselves are highly overrated now and that, aside from his own, very few companies are turning a profit. AbTech, a privately held company, does not release its financial results, but Montgomery said the firm hired 10 people last year, increasing its staff to 35 employees.

"Our primary focus is shifting to direct marketing. The chief information officers were a hindrance to making sales. They were knee-deep in alligators with data warehouses. The majority of CIOs are still holding back. We want to offer a total solution to the customer, address everything from data analysis all the way to acquiring and retaining clients. The person who is going to drive the requirement for our products is someone like the vice president of marketing," said Montgomery.

Both the emerging interactive markets that the Internet and CD-ROMs are opening and the hard numbers back up his decision for this strategic change. According to the New York-based Direct Marketing Association, direct marketing in 1995 generated an estimated $594.4 billion in consumer sales and $498.1 billion in business-to-business sales.

A report recently released by the association found that "enhanced information availability leads to potential changes in the customer's ability to find information about goods and services, the time required for marketers to notice and react to market changes, and the relationship between corporate size and marketing clout." In this dynamic, a company's new media application is akin to a "magnet," where the objective is to attract customers, said the report, "Marketing in the Interactive Age."

Further, new media direct-selling applications require robust back-end capabilities such as fulfillment, customer service, database marketing, response analysis, and customer segmentation, the report said. The instant gratification afforded by new media will stress the timely delivery of products (both informational and physical).

According to the report, "this technology can allow manufacturers to directly bypass intermediaries in the supply chain - such as wholesalers, retailers, and, most importantly, direct marketers. The management of new media's disintermediation effects is one of the most challenging issues that companies will face," the report said.

Montgomery's approach is to focus on departmental and division data marts, where managers can show return on investment quickly. He currently has three beta sites, all catalogers, for whom he is doing initial modeling. In fact, the company is generating 50 percent of its revenues from such business services. He strongly believes that, given the way the market is moving, many data mining firms will need to become vertical if they are to be successful.

Just last month, AbTech debuted its ModelQuest MarketMiner at the Direct Marketing Association's Annual Conference in Chicago. The product is a high-end data mining tool for direct marketers scheduled for release early in 1998. Its purpose is to help direct marketing professionals increase customer acquisition, retention and value by creating data models that highlight the most likely and most profitable buyers of new products and services.

"Data mining companies will need to form partnerships and team up with firms that can provide the expertise and information they lack. On the other hand, an interesting example of what is happening is the company, Capitol One," said Montgomery.

A Falls Church, Va., holding company, Capital One Financial Corp. owns Capital One Bank, one of the major issuers of MasterCard and Visa credit cards in the country. Yet the description on its Web page profiling the company sounds very much like a firm exploiting the potential of data warehousing and data mining for direct marketing, among other activities.

Thus, "Capital One has grown dramatically due to the success of its proprietary information-based strategy, which combines advances in information technology and sophisticated analytical techniques to identify, manage and take advantage of business opportunities."

Montgomery's point is that here is a company ostensibly in the credit card business that sees itself as an information technology company, using data mining techniques for marketing. He feels this may be a more familiar scenario in the future.

A perspective from deep inside the academic, business and developer community, comes from George H. John, who carries the typically California business title of data mining guru at Epiphany Marketing Software of Mountain View, Calif.

A former IBM consultant, John received his doctorate from Stanford University earlier this year with a thesis on data mining titled, "Enhancements to the Data Mining Process." In it, he described the data mining process and presented advances and novel methods for the six data mining steps: extracting data from a database or data warehouse, cleaning the data, data engineering, algorithm engineering, data mining and analyzing the results.

He agrees with Piatetsky-Shapiro about the trend toward embedding the data mining tools in complete business solutions.

"Successful technology becomes invisible. I heard [Oracle CEO] Larry Ellison give a talk the other day and that was one of his points. He raised the question, What if buying a car were like buying an information system? You walk in and say you want a four-door car. The guy behind the counter yells back, 'Charley, get the lady four doors!' then says, 'No problem - what else do you want?' The point is that people need a complete solution and vendors aren't doing a great job of giving it to them right now," John said.

"We're now in the generation of "in your face" data mining products. You run the program and your screen is filled with menus and buttons that are all telling you, 'OK, we're going to do some data mining. Now precisely what do you want to do?' The next generation will be 'behind the scenes' data mining. There'll be some great technology working for you, but you won't even know it's there."

For him, data warehousing and data mining are simply about corporations remembering and learning from their collective experience. The problem, as everyone knows, is that there is just too much data, too many experiences generated each day for the people in the company to comprehend. For example, the marketing manager conducting a direct mail campaign is overwhelmed with data on the customers contacted and the patterns in who is responding.

While many have recognized that revenues in data mining are not making Wall Street jump and shout, John feels it is more the result of packaging and timing rather than a real problem with data mining technology or philosophy.

"There are two problems with the packaging of data mining tools today. First, vendors are building tools for propeller heads, not business users. Second, they're only building a single slice of a solution, relying on the customers to put it all together themselves or hire expensive consultants to do it for them," he said.

While the time for data mining seemed ripe when we saw the explosion in database sizes in many corporations, he said, the reality is that it's very hard to mine operational data. "You usually need to start with a data warehouse or data mart and many companies are just starting to build these," said John.

His firm, Epiphany Marketing Software, is building enterprise automation software for marketing. By focusing on marketing, he said, the firm can deliver a very useful package of tools, not just raw, untailored technology. Clarity, its first product, includes a marketing data mart and a Web-based query tool for access to the data. After access comes analysis and action, the themes of the next products that John is working on.

One beta customer used Clarity to discover that a trade show their marketing department was pushing as a "must-attend" was indeed generating a lot of leads, but no sales.


Epiphany Marketing Software photo

George H. John of Epiphany Marketing Software

If a marketer wants to see the sales of all products over the past four quarters broken down by distributor, he wants to get the answer when he thinks of the question, not several minutes later, John said.

"It's not as if you can put up an explanation about how hard it is to do four outer joins and have them read that while the query is running. So we precompute answers overnight after the data is loaded. Then the answers are all just waiting for the questions to be asked. This sets us up well for adding data mining, to automatically ask tons of questions and see which ones have interesting answers," said John.

To his way of thinking, access to marketing data about how products are selling and how campaigns are going is either nonexistent, available only by word of mouth or contained in reports that come a month after you need them. It's the reason why marketers love the World Wide Web.

"All the data is there in real time if you've done a good job of constructing your site and you're working with the right Web advertising companies. But there's no reason in principle why other communication channels have to have such long lag times - it's just a function of the stages in the channel [manufacturer - distributor - retailer] not having good tools to collect and trade information," said John.


Current Analysis photo

Mike Schiff, principal analyst for data warehousing strategies at Current Analysis

Now principal analyst for data warehousing strategies at Current Analysis of Sterling, Va., a firm offering expert analysis of information technology news, he sees a number of trends as the industry starts to mature.

For example, data warehousing and data mining will become business critical, with the likelihood that some data mining activities will be outsourced, he said. "We'll also see the movement of products from proprietary to open, and database vendors packaging functions to support data mining."

He sees data warehousing and mining covering not only structured data but any data, such as multimedia. Also, he sees front-end tools becoming easier to use as well as open access to data warehousing from any authorized user. And while some argue for the move to data marts and division/departmental databases, Schiff argues for the importance of the bigger picture.

"There is a drastic need for [information services] to lay out the overall architecture, one that covers production, data warehouses and data marts. There will always be a need, for example, to establish policy and develop standards that apply across data warehouses and data marts, whether subject or application specific. Still, data marts are going to happen, even if those in IS feel they are losing control," said Schiff.

A different approach to the market, one that targets scientists and engineers as end users and that advances the Hierarchical Data Format standard for viewing data sets, is taken by Fortner Software, formerly known as Fortner Research, Sterling, Va., with its new product, Noesys.

According to Ted Meyer, Fortner's chief technology officer and a former NASA researcher who was the information architect of the agency's Earth Observing System Data and Information System, the Internet changed the market for science data analysis and visualization software. With access to vast amounts of data and powerful, inexpensive PCs, the only item missing was easy-to-use scientific analysis software that could take advantage of both.

Into the spotlight steps Noesys, which the company claims is the first desktop science data mining application with easy access to Hierarchical Data Format data. With it, users can read, write, edit and analyze data formatted in HDF, a multiobject, open-standard file format developed by the National Center for Supercomputing Applications and accepted by large producers of publicly available science data like NASA.

"While the business community had a protocol or access standard called SQL, which provided a common interface to the collection of data, the science community did not. There seemed no need for generic cross-discipline tools. With the development of data warehousing and the emergence of data mining, the need arises for such tools. Noesys is part of the scientific data mining process, supporting users in developing their data mining needs," said Meyer.

Benefits of a Data Warehouse
AFFECTED AREABENEFITS
Business decisions Improves decision-making process
Provides basis for strategic planning efforts
Improves business decisions (quality and quantity)
Improves sales metrics
Improves trend visibility
Improves cost analysis
Improves inventory and distribution channel management
Improves monitoring of business initiatives
Data access Improves data availability and timeliness
Improves data quality
Improves data integration
Improves access to historical information
Provides easier data access
Allows high-performance data mining
Allows access to data not previously available
Improves data availability for customers
Costs Reduces staff
Identifies lost revenue
Optimizes space utilization
Reduces inventory
Reduces inventory replenishment time
Productivity Provides access to data without programmer intervention
Facilitates elimination of legacy systems
Reduces analysis effort
Reduces impact on operational systems
Reduces manual analysis and data consolidation efforts
Florida Power and Light Co.

"Our belief is that killer apps are based on and driven by data standards. SQL databases were killer apps in the business community and promoted the data warehousing of information. The killer app for science must be a tool that supports standards and data functionality, one that provides data mining functionality," Meyer said.

Noesys was developed with the intention of allowing a variety of types of plug-ins, a concept familiar to anyone who's ever used a browser to surf the World Wide Web. From Fortner's standpoint, there are three basic types of plug-ins: translators, used to open different types of data structures; editors, to manipulate data; and analyzers, to interpret the data through imagers and visualizing tools.

Fortner plans to release the first Software Developer Kit for translators by December, with the other two coming early next year. Eventually, there are plans to produce a server product targeted to small work groups of scientists operating on the Windows NT platform.


NEXT STORY: NETPLEX