Untitled Document

Web search engines as we know them today appeared after the world wide web became popular. However, research in fields like information retrieval began long ago. Todayís search engines use many results that originated in such research, which for a long time did not have a pressing application. Now, practical applications in this area are always waiting for ongoing research, instead of the other way around.

Information retrieval technology was used in electronic library catalogues and other forms of databases that needed search capabilities. The first attempts to search the web used essentially the same methods as on these traditional data sources, i.e. capturing a small textual representation of the content and performing expression matching.

An example of the state of early search engines was the World Wide Web Worm, developed by Oliver McBryan in 1994 or so. He writes in one of his papers:

The WWW Worm is a resource location tool. It is intended to locate almost all of the WWW-addressable resources on the Internet, and provide a powerful search interface to those resources. Searches can be performed on document titles, reference hypertext, or within the components of the URL name strings of documents - for example to locate all mpeg movies in Finland. WWWW has been accessed 60,000 times in 45 days.

Evidently, the WWWW only did expression matching of user queries on a few key components of the web documents it was searching. Despite the authorís claims, this was too indiscriminate. As the web got too large, a title or a link text was just not enough to discriminate the fine detail of a web page and to provide relevant search results. Everything that had a slight bit to do with the search query, or maybe nothing at all, came through with equal prominence.

This is just one example of the challenges of web searching. Because the world wide web was different from traditional data sources ó it was hypertext-based, constantly evolving, distributed, and huge ó information retrieval research was renewed and many known methods were modified specifically for the problem of web searching.

The Theory Behind Web Searching

Search engines for the web all must perform at least these tasks: finding all relevant documents to a query, a process that we will call ìmatching,î and sorting the matched documents in some order of importance, which we will call ìprioritizing.î

When both these are done correctly, we are supposed to get results to our query which are relevant and which can be listed with the most important ones first. Solutions to these problems required building models that approximate what we intuitively believe about the vague words ìrelevanceî and ìimportance.î

Note that because of the subjective nature of these problems, the performance of algorithmic models are often measured against usage experiences (i.e. human evaluations of relevance and importance or whether one is satisfied with a search engine or not). Often, there are variables in algorithmic models that can be tuned by experimental data.

Query Matching

We need to state some assumptions for convenience. So suppose we have a set of known documents D and also a known user query Q, which for all practical purposes is a short document.

A common method from which many variations are derived is the vector distance model of keyword matching. Basically, documents and queries alike are represented as vectors, based on their word content. So finding word content similarity becomes a task of computing some form of ìdistanceî or ìangular distanceî between vectors.

For example, we take all the distinct words in all the documents of D, say there are k of them, and associate each with a unit vector in a k-dimensional vector space. Each document, then, is a vector in this space, scaled in the various dimensions by the frequency of the associated word occurring in it. A query is likewise a vector in the space and it may even be the 0 vector.

We may choose the inner product as the preferred measure of vector similarity, among other things. The larger the inner product between the query vector and a particular document vector, the more relevant the query is supposed to be to that particular document. As defined, more relevant means that the query and the matching documents have more word content in common, as in, more query keywords were found or the same keyword was found a higher number of times. For our match result, we can return all or any number of the top relevantly matched documents.

This has many problems. Words in documents do not appear independently. Also, this scheme matches exact query words given in the query, which is not reasonable to expect from a user conducting an exploratory search: we would like synonyms or alternate phrasings to match almost as well. A method called Latent Symantec Indexing can resolve some of these issues.

Prioritizing Algorithms

From experience, it seems possible that two documents, while equally relevant to some search query, do not necessarily carry the same weight. One may be somehow more authoritative or ìimportantî than the other. In fact, the less important document may even be search engine ìspamî material, consisting of the same raw word content, but is gibberish. Itís clear that to construct an ordering of the documents, we must use some information other than word content.

There are many proposed ways to give an ordering to search results. After all, some extra information is contained in textual elements such as the page titles and link URLs. But by far the most widely accepted algorithm comes out of number of frequently cited papers (see Kleinberg; Page and Brin) written around the 1997 timeframe on the topic of link analysis. They all propose the same idea, that the linkage structure of the web itself conveys some interesting information.

Information retrieval researchers have loosely identified web links with citations in scientific journals and applied older graph theory results to this problem. Individual links are said to express ìtacit approvalî of what is linked to, which is not always true, but in aggregate, turns out to be a good enough assumption. In this scheme, a document with many outgoing links may be called a hub or a directory. Many pages that we consider ìimportantî fall under this category. Their outgoing links are often highly valued as well. More importantly, a document with many incoming links may be called an authority, because many others tacitly approve it.

Of the similar ranking systems proposed like HITS, IBMís CLEVER, and PageRank, we look at the last, which is rather easy to describe. We will assume we have a known set of documents D and there is a directed graph G that represents the linkage structure of the documents. There will be no queries considered.

PageRank assigns a number called a rank to each document in D, based solely on its position in the graph G. A larger number indicates higher ìimportance.î In the PageRank model, each document has a definite rank, and this rank is evenly divided amongst the outgoing links and contributed to the documents linked to. Therefore, a documentís rank depends on two things, the incoming links and also the ranks of those pages which are provide those links.

The definition of PageRank has a recursive feel. To quote a simplified version of the PageRank system, we need to define ri, the rank of the document i. We also define Bi to be the set of documents that point to i. We let Ni be the number of links on some document i. Finally, we define the rank of a document u as the following:

where the sum is taken over the set of documents that point to u, and c normalizes the result so that the total rank of all documents is constant.

If we write down this rank equation for all the documents in the set D, we can make this mathematically compact by writing it in matrix form:

r = cAr

where r is a column vector representing the ranks of all documents in D and A is an adjacency matrix of sorts. If we number the documents from 1 to n, then in the matrix A, the ith row and the jth column gives the value 1/Nj if the document j has a link to i or 0 if there is no such link.

The goal is to find the rank of every page given a link structure, which is the same as to find r from the above equation. So the problem has managed to convert itself to one of finding eigenvectors. This can be done with numeric methods starting with some seed value for the rank vector r. PageRank takes only the ìprincipal eigenvectorî as the rank vector.

Actually, PageRank is a bit more complicated, as it adds a damping term to the above to deal with documents that form isolated internal loops.