home overview techniques technology resources about

Word Prediction

The task of predicting the next word in a sentence might seem irrelevant if one thinks of natural language processing (NLP) only in terms of processing text for semantic understanding. However, NLP also involves processing noisy data and checking text for errors. For example, noisy data can be produced in speech or handwriting recognition, as the computer may not properly recognize words due to unclear speech or handwriting that differs significantly from the computer’s model. Additionally, NLP could be extended to such functions as spell checking in order to catch errors in which no word is misspelled but the user has accidentally typed a word that she or he did not intend. In the sentence “I picked up the phone to answer her fall,” for instance, fall may have been the intended word, but it is more likely that call was simply mistyped. A spell checker cannot catch this error because both fall and call are English words. An NLP algorithm that could catch this error would thus need to look beyond what letters form words and instead attempt to determine what word is most probable in a given sentence.

N-Gram Models
One of the oldest methods used in trying to compute the probability that a given word is the next word in a sentence is employing n-gram models. N-gram models are attempts to guess the next word in a sentence based upon the (n - 1) previous words in the sentence. These models base their guesses on the probability of a given word without any context (i.e., the is a more common word than green and is thus more probable than green if context is ignored) and the probability of a word given the last (n – 1) words. For example, take the sentence beginning “The four leaf clover was the color...”. Using a bigram model, one would compute P(green | color) and P(the | color) to determine the more probable guess between these two words. Based on this example, one might imagine that the model’s guess would be even more accurate if we computed P(green | The four leaf clover was the color), making a 7-gram model. However, such a model would take enormous computing power and a much greater amount of time than the bigram model to compute. Since good estimates can be made based on smaller models, it is more practical to use bi- or trigram models. This idea that a future event (in this case, the next word) can be predicted using a relatively short history (for the example, one or two words) is called a Markov assumption.

In order to make these predictions about the next word in a sentence, the NLP application must have access to probabilities about how often specific words occur in general and how often specific words occur after particular words. To program a computer with these probabilities by hand would be extremely tedious and raises the question of how those probabilities would be reached in the first place. A simple way to find the probabilities might involve counting the number of occurrences of words in samples of text; however, a human would probably become introduce errors into this process and be unable to sort through millions of words quickly. Thus, n-gram models are usually trained on corpora, huge text files that can be processed to determine statistical properties about the words and sentences within.

Training Issues
While training n-grams on corpora saves humans time and effort, it raises the question of style and how style differs between genre and sources. For example, one expects very different words and sentences when reading Virginia Woolf’s works than when reading The New York Times. Consequently, a model that is trained using solely Virginia Woolf’s works will probably not be very good at guessing the next word in sentences from The New York Times. One solution that has been suggested for this problem is to train models on corpora that include a variety of sources. Different genres of work and work from different periods could be included to give models a “sense” of many varieties of text. However, this introduces new issues to the problem. The New York Times is not likely to have sentences and words similar to that in Shakespeare, although a model trained from corpora with both texts would consider them equally when guessing the next word in a newspaper article. As this example suggests, n-gram models thus have much better performance rates when they are trained on corpora similar to their intended application than when they are trained on dissimilar corpora. At this point, researchers have not thought of techniques to solve this issue, and it is more likely that it would be solved by labeling text as a certain source than by some new development that allows models trained on one type of corpora to perform well on texts that are stylistically very different.

Another potentially problematic aspect of training n-grams is that all training corpora will leave out some combinations of words. These combinations will then have zero probability, even though they may be acceptable English combinations. Consequently, many researchers employ techniques called smoothing that give small probabilities to combinations that have not been seen in the training corpus. These techniques are simply another statistical technique that allows the model to guess words that it has never observed after the previous word.

Current Progress on N-Grams
Although the idea of n-grams has existed since 1913 when Markov proposed statistical techniques for computing such models, researchers have been creating more sophisticated models in recent years. In order to combat some of the issues with stylistic differences between texts, scientists have experimented with giving greater weight to combinations that have recently occurred and with using far away words to influence guesses to some extent. Additionally, thesauri have been used to increase the vocabulary of n-grams and make them more feasible for generating language as well as guessing words.