dice.wmf (2966 bytes)

Entropy and Redundancy in English

After having defined entropy and redundancy, it is useful to consider an example of these concepcts applied to the English language. Shannon, in his paper "Prediction and Entropy of Printed English," gives two methods of estimating the entropy of English. The redundancy, or number of constraints imposed on the text of the English language, causes a decrease in its overall entropy. For example, the rules "i before e except after c", and the fact that a q must always be followed by a u are dependencies that make the English language more redundant. Rules of grammar, parts of speech, and the fact that we cannot make up words make English redundant as well.

Redundancy in the English language is actually beneficial at times, for how else might one discern what is said in a noisy room? The redundancy allows one to infer what is said when only part of a message comes across. For example if one hears "Turn phat mufic down!", one can make a fairly good guess as to what the speaker meant.

One possible way of calculating the entropy of English uses N-grams. One can statistically calculate the entropy of the next letter when the previous N - 1 letters are known. As N increases, the entropy approaches H, or the entropy of English. Following are the calculated values from Shannon's paper. FN is the entropy associated with the Nth letter when the previous N - 1 letters are known. The difficulty of calculating the statistics for FN is O(26^N), because there are that many sequences of N letters. Note that F0 is simply the maximum entropy for the set of letters, where each has an equal probability.

  F0 F1 F2 F3 Fword
26 letter 4.70 4.14 3.56 3.3 2.62
27 letter 4.76 4.03 3.32 3.1 2.14

The 27-letter sequences include the space as a letter. Onecanalmostalwaysfillinthespacesfrom asequenceofwordswithnospaces. Therefore, spaces are basically redundant and will cause lower calculated entropies when taken into account. Only in the case where no statistics are taken into account, F0, is the entropy higher when the space is added. This simply adds another possible symbol, which means more uncertainty.

Another strategy Shannon suggests is to calculate the entropy associated with every word in the English language, and take a weighted average. Shannon uses an approximating function to estimate the entropy for over 8,000 words. The calculated value he gets for the entropy per word is 11.82 bits, and since the average word has 4.5 letters, the entropy is 2.62 bits per letter. This is given as Fword in the above table.

We have already discussed how to calculate redundancy from entropy. The maximum redundancy occurs when all the symbols have equal likelihood, and is equal to - (log2(1/26)) = 4.7 bits/letter. Therefore, using the formula 1 - H/Hmax, we can estimate the redundancy of English. Shannon initially estimated this value at 50 %, meaning that about half the letters in the English language are redundant!

Discussed later in the same article is a rather ingenious way of calculating the entropy of the English language. It incorporates many more features of the English language, such as line of thought and context that statistical methods cannot explicitly account for. Crossword puzzles and games such as Hangman and Wheel of Fortune exploit redundancy by assuming that humans can guess letters in a word or phrase based on their previous knowledge of the language. Shannon's ingenious idea was to exploit this natural measure of redundancy... the human mind. He asked subjects to guess the letters in a phrase one by one. If the subject guessed correctly, then he/she moved on to the next letter. If not, then the subject is told the next letter. Of 129 letters in a phrase, 69% were guessed correctly. This suggests approximately a 69 % redundancy in the English language. Say we reproduce only those letters which were guessed wrong, 31 %. Then we may, by cloning the subject who guessed from scratch, get back the original sentence. The subject can obviously get the 69 % of the symbols right, and he/she has the other 31%, so he/she can reproduce the original text with only about 31 % of the information. Actually, the subject would need slightly more than 31 % of the information. He/she would need to know where the letters are that his is going to guess wrong on, so actually the redundancy might be a little less. Theoretically this is a good example, but practically, it is not. Sampling error in terms of sentences and subjects can cause significant distortion of results. Nevertheless, this example is instrumental in illustrating a practical example of redundancy, and sheds light on how to code English. There is no need to create a statistical grammar of the English language in order to calculate its entropy. Humans have the grammar naturally built in.

Statistically calculating the redundancy of the English language has numerous practical applications. ASCII reserves exactly 8 binary digits per character. However, this is highly inefficient, considering that some calculations place the entropy of English at around 1 bit per letter. This means that theoritically, there is a compression scheme that is 8 times as good as ASCII. Although modern computers have enough memory that this inefficiency is not crucial, Huffman compression and Lempel-Ziv compression algorithms save significant space when text is stored.

Normally, when people say the English language is redundant, they are speaking of the numerous synonyms which clutter our dictionaries. Redundancy in the sense of information theory is a measure of how efficiently the letters/symbols are used in the language. The fact that English is a redundant language is not necessarily bad. That our language is spoken as well as written brings in many issues besides efficiency. We want to be understood in a noisy room, we want the sound of words to correspond to their meaning, and we want to be able to pronounce words with ease. Information rates are only one small part of the analysis of the English language.

A very interesting illustration of how well a language can be be describe statistically occurs are the nth order approximations of the English language, reproduced here from Shannon's paper "The Mathematical Theory of Communication." Would a monkey who knew the n-gram frequencies of the letters in English where n is large be able to produce credible English text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one door and a human is behind the other, could a third-party observer tell which was the monkey? This question rings of the Turing test for artificial intelligence, to which there is no easy answer.

Zero-order approximation XFOML RXKHRJFFJUJ ALPWXFWJXYJ FFJEYVJCQSGHYD QPAAMKBZAACIBZLKJQD
First-order approximation OCRO HLO RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
Second-order approximation ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
Third-order approximation IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE
First-order word approximation REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
Second-order word approximation THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED