The Profitable Search for Search Engines

Editor's Note: this is the first of a two-part article examining Internet search engines', what they can provide, and the growing business in creating them

ditor's Note: this is the first of a two-part article examining Internet search engines', what they can provide, and the growing business in creating them

Kent Summers (Electronic Book Technologies Inc.; kjs@ebt.com) wasn't just having a bad hair day when he began a presentation about the Internet recently with the pronouncement, "Surfing sucks." Rather, he was echoing the frustration Internet searchers feel when probing cyberspace looking for that particular piece of data or item of information. Finding specific documents is difficult enough; exploring collections, libraries or servers is nearly impossible with the current crop of Internet search engines.

This challenge of creating finely tuned engines to explore and exploit the information richness of the Internet -- not just the World Wide Web, but discussion groups, news groups, et al. -- confronts not only software developers but also document producers. It's also attracting serious attention from entrepreneurs and big business. Beacons broke the fog a short time ago when Microsoft signed a non-exclusive license to use one of the more popular engines, Lycos.

What's out there now? Even for those familiar with the Internet, the current crop of search engines, at times referred to as spiders, wanderers and robots, bear arcane names: Lycos (the first 5 letters of the Latin name for Wolf Spider, that is, Lycosidae; URL: http://lycos.cs.cmu.edu/); JumpStation II (URL: http://js.stir.ac.uk/jsbin/jsii); World Wide Web Worm (URL: http://www.cs.colorado.edu/home/mcbryan/WWWW.html); WebCrawler (URL: http://webcrawler.cs.washington.edu/WebCrawler/); NIKOS (URL:http://www.rns.com/cgi-bin/nikos) and DE-CLOD (Distributedly Administered Categorical List of Documents; URL: http://schiller.wustl.edu/DACLOD/daclod

These engines allow you to search for information many ways -- some search the titles or the headers of documents, such as WWW home pages, others search documents and still others search indexes or directories. Many offer subsystems to manipulate and manage the information you get.

Lycosª, for example, which bills itself as "the catalog of the Internet," trawls the Web, Gopherspace and FTP archives daily and creates a database of all the Web pages it uncovers. The index of the database is updated each week. Its search engine, called Pursuit, presents "probabilistic retrieval from this catalog, taking a user's query and returning a sorted list of hits (the list is sorted by match score, and only documents with scores above the threshold are retrieved)."

This Lycos site, administered by Michael L. Mauldin (fuzzy@cmu.edu), contains references to 3.85 million Web pages out of his estimate of more than 5 million Web documents. This does not include, as he notes, pages inside databases such as the Library of Congress, the Human Genome Database or WAIS indexes.

A newer set of search tools developed by Mike Schwartz is the Harvest Information Discovery and Access System (URL: http://harvest.cs.colorado.edu/). Described as an "integrated set of tools to gather, extract, organize, search, cache and replicate relevant information across the Internet," this system, so developers claim, allows users -- with only modest effort -- to "tailor Harvest to digest information in many different formats, and offer custom search services on the Internet."

The home page has hypertext information on such subjects as demonstrations and useful indexes, technical discussion, user's manual , FAQ, papers, talks, press release, HPCC blue book pages, getting the software and Harvest team contact information.

If you are interested in trying out the different search engines, you can find most of them at the heavily trafficked Yahoo (http://www.yahoo.com/Reference/Searching_the_Web/). Those who want to track this subject in finer detail should consider following patent announcements available via the Web at Source Translation &amp Optimization's (STO) Internet Patent Search System (URL:http://sunsite.unc.edu/patents/intropat.html) and through its mailing list (send the word, News, to: patents@world.std.com). The STO list offers a weekly mailing of all patents listed in the most recent issue of the USPTO Patent Gazette as well as other valuable services.

Further, there is the discussion group on technical aspects of WWW robots (E-mail robots-request@nexor.co.uk; type the words, "subscribe", "help", and "stop" on separate lines in your message) as well as news groups, for example, the group, comp.infosystems.www.*, which includes comp.infosystems.www.announce, comp.infosystems.www.misc, comp.infosystems.www.providers, comp.infosystems.www.servers.* and comp.infosystems.www.users.

Part 2 will address what's ahead in search engine development.


NEXT STORY: Incoming