Word
Prediction
The task of predicting the next
word in a sentence might seem irrelevant if one thinks of natural
language processing (NLP) only in terms of processing text for semantic
understanding. However, NLP also involves processing noisy data
and checking text for errors. For example, noisy data can be produced
in speech or handwriting recognition, as the computer may not properly
recognize words due to unclear speech or handwriting that differs
significantly from the computer’s model. Additionally, NLP
could be extended to such functions as spell checking in order to
catch errors in which no word is misspelled but the user has accidentally
typed a word that she or he did not intend. In the sentence “I
picked up the phone to answer her fall,” for instance, fall
may have been the intended word, but it is more likely that call
was simply mistyped. A spell checker cannot catch this error because
both fall and call are English words. An NLP algorithm that could
catch this error would thus need to look beyond what letters form
words and instead attempt to determine what word is most probable
in a given sentence.
N-Gram Models
One of the oldest methods used in trying to compute the probability
that a given word is the next word in a sentence is employing n-gram
models. N-gram models are attempts to guess the next word in a sentence
based upon the (n - 1) previous words in the sentence. These models
base their guesses on the probability of a given word without any
context (i.e., the is a more common word than green and is thus
more probable than green if context is ignored) and the probability
of a word given the last (n – 1) words. For example, take
the sentence beginning “The four leaf clover was the color...”.
Using a bigram model, one would compute P(green | color) and P(the
| color) to determine the more probable guess between these two
words. Based on this example, one might imagine that the model’s
guess would be even more accurate if we computed P(green | The four
leaf clover was the color), making a 7-gram model. However, such
a model would take enormous computing power and a much greater amount
of time than the bigram model to compute. Since good estimates can
be made based on smaller models, it is more practical to use bi-
or trigram models. This idea that a future event (in this case,
the next word) can be predicted using a relatively short history
(for the example, one or two words) is called a Markov assumption.
In
order to make these predictions about the next word in a sentence,
the NLP application must have access to probabilities about how
often specific words occur in general and how often specific words
occur after particular words. To program a computer with these probabilities
by hand would be extremely tedious and raises the question of how
those probabilities would be reached in the first place. A simple
way to find the probabilities might involve counting the number
of occurrences of words in samples of text; however, a human would
probably become introduce errors into this process and be unable
to sort through millions of words quickly. Thus, n-gram models are
usually trained on corpora, huge text files that can be processed
to determine statistical properties about the words and sentences
within.
Training
Issues
While training n-grams on corpora saves humans time and effort,
it raises the question of style and how style differs between genre
and sources. For example, one expects very different words and sentences
when reading Virginia Woolf’s works than when reading The
New York Times. Consequently, a model that is trained using solely
Virginia Woolf’s works will probably not be very good at guessing
the next word in sentences from The New York Times. One solution
that has been suggested for this problem is to train models on corpora
that include a variety of sources. Different genres of work and
work from different periods could be included to give models a “sense”
of many varieties of text. However, this introduces new issues to
the problem. The New York Times is not likely to have sentences
and words similar to that in Shakespeare, although a model trained
from corpora with both texts would consider them equally when guessing
the next word in a newspaper article. As this example suggests,
n-gram models thus have much better performance rates when they
are trained on corpora similar to their intended application than
when they are trained on dissimilar corpora. At this point, researchers
have not thought of techniques to solve this issue, and it is more
likely that it would be solved by labeling text as a certain source
than by some new development that allows models trained on one type
of corpora to perform well on texts that are stylistically very
different.
Another
potentially problematic aspect of training n-grams is that all training
corpora will leave out some combinations of words. These combinations
will then have zero probability, even though they may be acceptable
English combinations. Consequently, many researchers employ techniques
called smoothing that give small probabilities to combinations that
have not been seen in the training corpus. These techniques are
simply another statistical technique that allows the model to guess
words that it has never observed after the previous word.
Current
Progress on N-Grams
Although the idea of n-grams has existed since 1913 when Markov
proposed statistical techniques for computing such models, researchers
have been creating more sophisticated models in recent years. In
order to combat some of the issues with stylistic differences between
texts, scientists have experimented with giving greater weight to
combinations that have recently occurred and with using far away
words to influence guesses to some extent. Additionally, thesauri
have been used to increase the vocabulary of n-grams and make them
more feasible for generating language as well as guessing words.
|