The Many Languages of the Web

The world wide web is just that, world-wide. With the burst of technology in so many countries, the web is rapidly becoming a multi-lingual, multi-cultural space. Though this seems advantageous in light of the attempt to build a global community, the coexistence of multi-lingual documents on the web has created problems for search engines. While indexing documents of a specific language may be easy enough, indexing an array of documents of different languages, i.e. the internet, has proved to be problematic for most search engines.

One of the important aspects of filtering the multi-lingual web involves the need to "automatically determine the language(s) in which [documents] are written." The future of search engines in such a global space may depend on ideas which would facilitate this process. Its importance is in the fact that identifying the language of a document would allow more refined searches within a particular language, and the categorization process would be much quicker. One of the concepts that could play a role in dealing with multi-lingual databases is "Linguini," "a vector-space based categorizer tailored for high-precision language identification."

Linguini or Language Identification for Multilingual Documents, a highly scientific approach towards identifying languages, finds its purpose in providing search engines with a possible tool to use in categorizing and filtering information on a multi-lingual database. Linguini "could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy." However, Linguini's testing was limited to thirteen western European languages so its accuracy with languages from other continents cannot be predicted. It would be important to note that the thirteen languages "share etymological roots an have largely overlapping character sets, they provide a more difficult test of Linguini's classification powers" than a wider set would. This shows how the technology available would enable us to identify languages relatively accurately and thereby aid us in categorizing the world wide web as a complete database.

The other issues in searching multi-lingual databases involve not only recognizing languages but rather integrating the web as a whole. This would include searching multi-lingual documents accurately and identifying desired materials. Integration is hindered by the lack of communication, as most information on the internet is only accessible to those who share the language, rather than all those in the world who may be interested in the knowledge. Translation and identification are necessary parts of an integrated web in the future. As it is now though, our main concerns are in streamlining the processes we have now and properly categorizing and identifying documents of different languages.

Searching the Visual Web

Due to the increasingly visual nature of the internet, content-based image retrieval (or visual information retrieval) has become an important aspect of effective searching. With the advent of high-speed connections and more powerful software, web pages now contain more detailed visual information than ever before. Search engines have failed to properly categorize and index these visual images, thereby limiting the user from finding the information that he desires. This stems from the fact that the standing technology most search engines use is incapable of interpreting visual images, as it was designed to analyze text. This gap has prompted different innovators to design methods of harnessing databases of visual information in hopes of translating those technologies to a web level. Several competing technologies have emerged, some attempting to interpret the visual images by translating them into words, while other attempt to analyze the visual content itself. Though some methods enjoy a degree of success, the process of accurately categorizing and retrieving images from the web remains a challenge.

IBM's Query by Image Content (QBIC)

One of the first attempts at a visual search engine was IBM's QBIC, which used a simple, text-based method to index the images. QBIC searches still graphics and videos based on shape, texture, sketches as well as other descriptors. In its Imagebase at the Fine Art Museums of San Francisco, QBIC used an interesting method to facilitate indexing of works: the "word soup." The "soup" contains both the standard information about a work as well as an array of connected information revealed through free-form descriptions of the image. "This free form text is created by having knowledgeable museum volunteers describe the works in a 60-second stream-of-consciousness session, leaving out art and museum jargon." As could be expected, this kind of categorization yields both useful as well as completely irrelevant results due to the random nature of the information used. This "word soup," however, allows the use of existing text-based search technologies to both categorize and retrieve visual images, without the use of direct interpretation of raw visual data. Though QBIC does not use the "word soup" for all its applications, it still tends more towards verbal as opposed to visual evidence.

QBIC in Practice

In a demo of QBIC which searches a database of the U.S. Stamp collection, both the strengths and weaknesses of the system can be seen. For instance, a search under key words such as "animal" or "love," words for which there exist specific groups of stamps, the results are on target.

Search "animal":

Search "love":

However, searches under key words which do not so easily divide stamps provide rather undesirable results.

Search "rose":

These results appear due to the fact that the color of the stamp is defined as "rose pink." Here, QBIC shows the ultimate flaw in trying to interpret visual data in terms of just words. The confusion between the visual nature of the object and its content (which may contain the same words, though they are unrelated) makes searching for images that much more challenging.

Virage's Image Search Engine Library: Image (System)-Building

Virage is another pioneer of the image search engine, though its methods are quite different from QBIC's. Virage's categorization and analysis methods depend on a software-based analysis of the images, as opposed to human indexes. Virage categorizes images based on major descriptors such as texture, color type and distribution, shape and structures. Unfortunately, this visually-based analysis creates limitations on the search due to its lack of regard for actual content. The kind of analysis described, though stringent, fails to interpret the object of an image. An example of this would be seen if a user wanted to find a "similar image" to a map of city streets. They would have just as many chances of finding the layout for a computer chip as they would another map of city streets. This and other undesired results stem from the fact that many images may share core visual properties, but actually depict completely different objects. Virage and QBIC both show how important both the content information and the visual information are in accurately searching the visual web.

Excalibur's Visual Retrievalware: Neural-Logical Image Management

The next level of searching methods involve processes which more closely mirror the brain's interpretation of images. Excalibur's content-based retrieval products do not use traditional text retrieval methods in their searches, not do they use stringent visual criteria. Their neural network technology uses a method they refer to as Adaptive Pattern Recognition Processing (APRP). These systems "develop an increasingly rich essential notion of objects by analyzing many instances of these objects at different angles or renditions." This quote seems to suggest that there is a deeper analysis of the object than its mere physical qualities. APRP attempts to analyze patterns in both text and images thereby bypassing many of the minor details that throw off traditional text-based searches such as wrong spelling. APRP is modeled on the way bilogical systems use neural networks to process information. This design requires a proper sample set to calibrate the neural network in the beginning, so that it may become familiar with what kind of patterns to look for. Therefore, its major drawback is the dependency of its success on a proper sample set to begin with, since a poorly chosen sample set would greatly reduce its success in future searches. Though these problem cannot be immediately solved, the software's attempt to integrate knowledge from both surrounding text and images speaks to the bigger picture of the nature of the visual web.

NEC's Advanced Multimedia- Oriented Retrieval Engine (Amore)

In a similar vein, NEC's attempt at an image retrieval system also borrows from human methods of comprehension. NEC's image retrieval system, though a lesser known image-based search engine, has received high remarks from the Getty Information Institute, in the center's research to "improve users' attempt to find cultural material on the Web." Marty Harris of the institute's Technology, Research and Development Group remarked " 'What set Amore apart was the algorithms: when you used it you got results back that felt good, not unsettling as with some other tools.' " Amore uses content-oriented image retrieval (COIR) which essentially, according to Yoshi Hara, Amore's product manager, " 'recognizes objects within an image and compares those objects for similarity in the same way that humans perceive and translate images into meaningful data.'" It mostly supports color and shape feature extraction, so these limited descriptors could perhaps allow it more depth of analysis and comparison. Though limited, Amore seems to follow in today's trend to interpret and categorize images using more advanced methods of analysis and pattern-building in order to better harness the visual web.

Conclusion

Overall, the visual web has created many challenges for search engines. The technology is by no means refined: searching for "apple" on Google's image search has just as many chances of finding pictures of New York City (the Big Apple) and Apple computers as it does of finding images of the fruit itself. However, this could all be expected for even text retrieval has its fair share of random results. As the web becomes more visual, it is becoming increasingly urgent to devise ways of searching these images and video clips in an accurate way, so that users will be able to find what they're looking for.