When you become involved in database development and the Web, it is easy to feel overwhelmed by the volume of material
available about products, research, standards, and leading-edge technologies.
The curse of the Web is that, even when we're doing goal-directed Web-surfing,
we often follow links that take us in unanticipated directions. The beauty of
the Web is that, by doing this, we often uncover nuggets. If you're interested
in the latest R&D on information retrieval, indexing, data mining,
parallel processing, content-based queries, or other research topics, you'll
find volumes on the Web.
The Association for Computing Machinery posts the proceedings of its
Special Interest Group on Management of Data (SIGMOD) conference, which
typically include interesting papers on data mining and information retrieval
techniques. Database research long ago moved past the premise that we'll
always deal with queries against structured data. Web search engines are a
concrete example of applied research into information retrieval techniques for
non-structured data. Researchers continue to develop technologies for querying
the diverse mix of data sources that constitute the Web.
WHIRL
The WHIRL search engine was the product of a research project at AT&T Research
Labs (Machine Learning and Information Retrieval
Research department). The project's purpose was to advance the technology for searching with queries based on textual
similarity. Object-relational DBMS (ORDBMS) products from Oracle and IBM offer extenders for text searching. For example, IBM DB2's Text
Extender supports text indexing using linguistic, precise, dual, and ngram
indexes. Linguistic indexes support synonym and variant searches and ngram
indexes support fuzzy searches.
WHIRL differs from current object-relational DBMS extensions in its ability to query
heterogeneous data sources and treat the Web as a unified database that
supports fuzzy-text searching. WHIRL treats Web pages as
data, not hypertext,
and it supports joins between Web pages. The difference between the WHIRL
Search Engine and engines such as Google or Alta Vista is a question of
quantity. The WHIRL Search Engine indexes Web pages at fewer sites, but it has
more information in its indexes.
WHIRL is useful technology for a fuzzy search across heterogeneous sources,
when there are differences in vocabulary. It retrieves information even when
data sources use different names for the same entity. WHIRL does this by
modeling information sources as relations and matching on names. Instead of
joining on identical keys (hard joins), it finds tuples with similar names
(soft joins) and produces a ranked list.
Using WHIRL, the following type of query will return results when one Web page
refers to a software publisher as Microsoft Kids and another
refers to the same publisher as Scholastic/Microsoft:
SELECT s.game, s.pub, p.name, p.href FROM publisher as
p, superkid as s WHERE similar (s.pub,p.name)
William Cohen, PhD, the principal WHIRL researcher,
is a member of the faculty at Carnegie-Mellon University. The WHIRL software
(C++ source code) is available for download.
WHIRL is a powerful solution for searches in the absence of a controlled,
domain-specific vocabulary. (Domain in this context refers to a knowledge
domain such as biochemistry, not a Web domain such as nih.gov or GridSummit.com).
Domain Vocabularies
Imprecise matches can uncover nuggets, but often we are overwhelmed by the
sheer volume of results. Applying a domain context will address that problem.
For example, an Alta Vista search turned up 922 hits for "impedance
mismatch." The first link was to a page about selecting coaxial cable for
baseband video. The second link was to a page that discussed object
orientation, software and SGML. The first page used a vocabulary such as
resistance, ground-loop interference, signal attenuation, and
"ground-ground-leak-through." The domain of the second page was software
engineering, not video engineering, so the vocabulary included instances,
abstraction, and inheritance. The search would have been more precise if I'd
used exclusion terms and Boolean logic, or if I'd been able to tell the search
engine to use a domain thesaurus to constrain the search to either software
engineering or video engineering. A domain-specific vocabulary enables us to
use a dictionary or thesaurus to restrict querying, indexing, and abstracting
to a standard vocabulary. This simplifies the task of the query processor and
makes it easier to produce precise matches.
The U.S. National Library of Medicine (NLM) pioneered controlled
vocabularies in the biomedical field. NLM's first controlled vocabulary was
the Medical Subject Headings (MeSH), used for indexing, abstracting,
cataloging, and retrieving bibliographic citations to medical literature.
Since the 1960s, MeSH has been used for computer queries related to medicine,
nursing, dentistry, and veterinary medicine.
Years ago, I worked on a query processor for NLM's Medical Literature
Analysis and Retrieval System (MEDLARS). It queried on Boolean combinations of
MeSH headings and subheadings, an approach that is still used today. By 1999:
MeSH contained 18,000 subject headings and 96,000+ chemical records
MEDLARS contained about 18 million references to medical literature
NLM added 31,000 new citations each month.
The successors to the original
MEDLARS application include MEDLINE, Internet Grateful Med, and PubMed, which
provides literature searches such as the example shown in Figure
1. Internet Grateful Med and PubMed are an example of modern technology and the Web
making information widely accessible. (In the 1960s and 1970s, MEDLARS access was restricted to authorized users such as medical libraries.)
MeSH is an example of a controlled, domain vocabulary but NLM has been
involved in other medical language research and development projects. More
recently, NLM sponsored the development of the Unified Medical Language System
(UMLS) for retrieving information from diverse sources. The objective of UMLS
is to support diverse applications involving digital libraries, patient data,
decision support, bibliographies, and Web information retrieval. Besides MeSH,
UMLS includes other vocabularies, lexical programs, and UMLS Knowledge
Sources. The Knowledge Sources include a semantic network, a meta-thesaurus,
and an information sources map. The UMLS knowledge base includes a dictionary
of concepts that defines core concepts and constraints that guide the browsing
of UMLS Knowledge Sources. Research projects have been measuring the
effectiveness of concept-matching algorithms and using canonical concepts to
index documents, as opposed to using word indexes. Results have been promising
in machine-learning situations, where systems automatically learned
connections between words and concepts in MEDLINE documents. (For more
information about machine learning, visit www.aic.nrl.navy.mil/~aha/research/machine-learning.html.)
Research continues on new approaches to automated indexing and retrieval from
medical documents. Searches would be more efficient if we had a standard for
various domains. For example, a "Cajun-cooking" thesaurus would draw from the
expertise and works of people such as Marcelle Bienvenu and Paul Prudhomme. If
I were coding a page containing Cajun recipes, I might include information
about their origin and nutritional content, and use a tag such as:
<thesaurus> "Cajun Cooking, Nutrition, American
History"
Perhaps one day we will see hybrid search engines that are capable of using
a domain-specific vocabulary and name-matching capabilities such as WHIRL.
When asked (in 1999) about integrating domain-specific vocabularies with WHIRL, Dr.
Cohen replied, "So far I've used only general-purpose similarity metrics,
although you can obviously set up the interface so that the queries themselves
are suited to a domain."
Perhaps we will update each thesaurus by a combination of domain expert
input, automated indexing, and machine intelligence. Publishers might
routinely submit new books, magazines, journals, and other publications to
indexing engines. The engines will assist in generating an the index for each
publication, and contribute to the statistical databases used to update the
standard thesaurus for each domain. When a term or concept consistently
appears in the literature, it will pass some qualitative measure of acceptance
that can be detected without human analysis. In addition, domain experts would
periodically review the thesaurus for completeness and accuracy.
An earlier version of this article was supplemental reading for the January
1999 "Database Developer" column in Web Techniques. Since
then, there have been changes to the bibliographic databases at the
National Library of Medicine. Their scope has been increased to supplement
biomedical topics with more life sciences information. NLM retired Internet Grateful
Med in September 2001 but Internet users can now search MEDLINE bibliographic
citations via PubMed. The PubMed databases
contain approximately 12 million
citations.
WHIRL, and before that, MEDLARS, made significant contributions to the body
of knowledge about information retrieval strategies. There has been additional
work on searching semi-structured data and the W3C published specifications
for XQuery and the Resource
Description Framework (RDF). Research continues on distributed queries,
machine learning, knowledge representation, and ontology-based
computing.
Ken North is an author and consultant. He teaches Expert Series seminars. Contact him at