Web search engines as we know them today appeared after the world
wide web became popular. However, research in fields like information
retrieval began long ago. Todayís search engines use many results
that originated in such research, which for a long time did not
have a pressing application. Now, practical applications in this
area are always waiting for ongoing research, instead of the other
way around.
Information retrieval technology was used in electronic library
catalogues and other forms of databases that needed search capabilities.
The first attempts to search the web used essentially the same methods
as on these traditional data sources, i.e. capturing a small textual
representation of the content and performing expression matching.
An example of the state of early search engines was the World Wide
Web Worm, developed by Oliver McBryan in 1994 or so. He writes
in one of his papers:
The WWW Worm is a resource
location tool. It is intended to locate almost all of the WWW-addressable
resources on the Internet, and provide a powerful search interface
to those resources. Searches can be performed on document titles,
reference hypertext, or within the components of the URL name
strings of documents - for example to locate all mpeg movies in
Finland. WWWW has been accessed 60,000 times in 45 days.
Evidently, the WWWW only did expression matching of user queries
on a few key components of the web documents it was searching. Despite
the authorís claims, this was too indiscriminate. As the web got
too large, a title or a link text was just not enough to discriminate
the fine detail of a web page and to provide relevant search results.
Everything that had a slight bit to do with the search query, or
maybe nothing at all, came through with equal prominence.
This is just one example of the challenges of web searching. Because
the world wide web was different from traditional data sources ó
it was hypertext-based, constantly evolving, distributed, and huge
ó information retrieval research was renewed and many known methods
were modified specifically for the problem of web searching.
The Theory Behind Web Searching
Search engines for the web all must perform at least these tasks:
finding all relevant documents to a query, a process that we will
call ìmatching,î and sorting the matched documents in some order
of importance, which we will call ìprioritizing.î
When both these are done correctly, we are supposed to get results
to our query which are relevant and which can be listed with the
most important ones first. Solutions to these problems required
building models that approximate what we intuitively believe about
the vague words ìrelevanceî and ìimportance.î
Note that because of the subjective nature of these problems, the
performance of algorithmic models are often measured against usage
experiences (i.e. human evaluations of relevance and importance
or whether one is satisfied with a search engine or not). Often,
there are variables in algorithmic models that can be tuned by experimental
data.
Query Matching
We need to state some assumptions for convenience. So suppose we
have a set of known documents D and also a known user query Q, which
for all practical purposes is a short document.
A common method from which many variations are derived is the vector
distance model of keyword matching. Basically, documents and queries
alike are represented as vectors, based on their word content. So
finding word content similarity becomes a task of computing some
form of ìdistanceî or ìangular distanceî between vectors.
For example, we take all the distinct words in all the documents
of D, say there are k of them, and associate each with a unit vector
in a k-dimensional vector space. Each document, then, is a vector
in this space, scaled in the various dimensions by the frequency
of the associated word occurring in it. A query is likewise a vector
in the space and it may even be the 0 vector.
We may choose the inner product as the preferred measure of vector
similarity, among other things. The larger the inner product between
the query vector and a particular document vector, the more relevant
the query is supposed to be to that particular document. As defined,
more relevant means that the query and the matching documents have
more word content in common, as in, more query keywords were found
or the same keyword was found a higher number of times. For our
match result, we can return all or any number of the top relevantly
matched documents.
This has many problems. Words in documents do not appear independently.
Also, this scheme matches exact query words given in the query,
which is not reasonable to expect from a user conducting an exploratory
search: we would like synonyms or alternate phrasings to match almost
as well. A method called Latent
Symantec Indexing can resolve some of these issues.
Prioritizing Algorithms
From experience, it seems possible that two documents, while equally
relevant to some search query, do not necessarily carry the same
weight. One may be somehow more authoritative or ìimportantî than
the other. In fact, the less important document may even be search
engine ìspamî material, consisting of the same raw word content,
but is gibberish. Itís clear that to construct an ordering of the
documents, we must use some information other than word content.
There are many proposed ways to give an ordering to search results.
After all, some extra information is contained in textual elements
such as the page titles and link URLs. But by far the most widely
accepted algorithm comes out of number of frequently cited papers
(see Kleinberg; Page and Brin) written around the 1997 timeframe
on the topic of link analysis. They all propose the same idea, that
the linkage structure of the web itself conveys some interesting
information.
Information retrieval researchers have loosely identified web links
with citations in scientific journals and applied older graph theory
results to this problem. Individual links are said to express ìtacit
approvalî of what is linked to, which is not always true, but in
aggregate, turns out to be a good enough assumption. In this scheme,
a document with many outgoing links may be called a hub or a directory.
Many pages that we consider ìimportantî fall under this category.
Their outgoing links are often highly valued as well. More importantly,
a document with many incoming links may be called an authority,
because many others tacitly approve it.
Of the similar ranking systems proposed like HITS, IBMís CLEVER,
and PageRank, we look at the last, which is rather easy to describe.
We will assume we have a known set of documents D and there is a
directed graph G that represents the linkage structure of the documents.
There will be no queries considered.
PageRank assigns a number called a rank to each document in D,
based solely on its position in the graph G. A larger number indicates
higher ìimportance.î In the PageRank model, each document has a
definite rank, and this rank is evenly divided amongst the outgoing
links and contributed to the documents linked to. Therefore, a documentís
rank depends on two things, the incoming links and also the ranks
of those pages which are provide those links.
The definition of PageRank has a recursive feel. To quote a simplified
version of the PageRank system, we need to define ri, the rank of
the document i. We also define Bi to be the set of documents that
point to i. We let Ni be the number of links on some document i.
Finally, we define the rank of a document u as the following:
where the sum is taken over the set of documents that
point to u, and c normalizes the result so that the total rank of
all documents is constant.
If we write down this rank equation for all the documents
in the set D, we can make this mathematically compact by writing
it in matrix form:
r = cAr
where r is a column vector representing the ranks
of all documents in D and A is an adjacency matrix
of sorts. If we number the documents from 1 to n, then in
the matrix A, the ith row and the jth column
gives the value 1/Nj if the document
j has a link to i or 0 if there is no such link.
The goal is to find the rank of every page given a link structure,
which is the same as to find r from the above equation.
So the problem has managed to convert itself to one of finding eigenvectors.
This can be done with numeric methods starting with some seed value
for the rank vector r. PageRank takes only the ìprincipal eigenvectorî
as the rank vector.
Actually, PageRank is a bit more complicated, as it adds a damping
term to the above to deal with documents that form isolated internal
loops.
|