Bits and Binary Digits

Information

"Nature must be interpreted as matter, energy, and information" (Jeremy Campbell, Grammatical Man)

"It is hardly to be expected that a single concept of information would satisfactorily account for the numerous possible applications of this general field." (Claude Shannon)

"Where is the life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?" (T.S. Eliot, "The Rock")

Information, in Shannon's theory of information, is viewed stochastically, or probabilistically. It is carried discretely as symbols, which are selected from a set of possible symbols. The meaning of these symbols is completely irrelevant, though a binary digit may represent the toss of a coin (heads or tails) or the fate of the Universe (expand or collapse). The information carried by a message or symbol depends on its probability. If there is one possibile symbol, then there is no information to be gained since the outcome is not in question. More generally, if one symbol from a set has a probability of unity, then no information is conveyed. That is, if Alice and Bob are having a conversation, and Alice knows what Bob is going to say, then she gives Bob no information by speaking. At the other end of the information spectrum, one can imagine, is a set where all the probabilities are equal. Note that this need not be a random system, although randomness is one way of achieving a balanced set of results. It merely means that each symbol is equally likely from the point of view of the receiver.

The simplest of the many definitions of information in Shannon's theory is that information is a decrease in uncertainty. To illustrate a more concrete example of this definition, consider the following set of colored shapes. This is the set of all possible shapes, and Bob is trying to guess which particular shape Alice is holding. Bob is in a state of uncertainty, as he has no idea which particular shape Alice holds. What's more, each shape has an equally likely probability of being held by Alice.

Now, say Alice tells Bob that she is holding a blue shape. This definitely reduces the set of possibilities, and represents a decrease in Bob's uncertainty. He is by no means certain of the shape, but has a better idea of what single shape she is holding. Bob has gained information.

Now for a subtler point. Say that Alice had instead told Bob that she was holding a red shape. In this case, she would actually have conveyed more information. Instead of having three possible choices, Bob now would have two. Therefore, his final uncertainty would be less. Since the initial uncertainty is the same, his overall decrease in uncertainty would be greater in this case.

Another interesting case occurs when the probabilities are not all equal. Say that the probability Alice is holding a blue shape is less than the probability that she is holding any other shape. In this scenario, by telling Bob that she is holding a blue shape, she has conveyed more information than in the original case where all probabilities are equal. What if she instead had said that she is holding a red shape? Whether she conveys more or less information depends on the relative probabilities of the blue and red shapes.

The above metaphor allows us to visualize information theory as encompassing two realms, the known and the unknown. Information is gained when the size of the known becomes larger, and/or the size of the unknown becomes smaller. Another paradigm that is helpful in understanding Information Theory defines information as "surprisal." If event X has a smaller probability than event Y, I should be more surprised if you told be that X had occurred. Hence, I would get more information in this case.

An instructive example of information theory in practice is its application to language. English, and any other language for that matter, has sets of constraints which must be met. There are dependencies and rules which govern how letters/symbols are put together to form words, sentences, and paragraphs. Therefore, written English has less information in the probabilistic sense than random a set of letters.

Given this definition of information as a decrease in uncertainty, Shannon devised an information-generating function h(p), which would, given a set of N independent states a1, ... aN and N corresponding possibilities p1, ... pN, determine how much information is generated. In his ground-breaking paper, The Mathematical Theory of Communication, he set out these five postulates for h(p).

h(p) is continuous for 0 <= p <= 1	Fairly intuitive that this should be so.
h(pi) = infinity if pi = 0	If an event with 0 probability occurred, I should be infinitely surprised.
h(pi) = 0 if pi = 1	No surprise if I win a sure bet.
h(pi) > h(pj) if pj > pi	A formal restatement of the above shape example.
h(pi) + h(pj) = h(pi*pj) if the two states si and sj are independent	Says that the information from independent symbols is linearly additive.

The last property, which is not immediately obvious, follows from the fact that the probability of two events occurring in concert is the product of their individual probabilities provided that the events are independent. It can thus be restated, "The information from event XY equals the information from event X plus the information from event Y."

It can be proved that the only function that satisfies that these five properties is

h(p) = - logb(p)

This is Shannon's most famous result. Although the logarithmic relation between information and probability had been conceived of before, Shannon was able to draw mind-boggling implications from this simple formula.

An immediate implication of the information-generating function is the formula for the average information per symbol in a set of symbols with a priori probabilities. This formula can be interpreted as sum of the information totals associated with every symbol, where each total weighted by the probability of the corresponding symbol.

H = p1*h(p1) + p2*h(p2) + p3h(p3) + ... pN-1*h(pN-1) pN*h(pN)

H is also called entropy. This formula can predict the efficiency limits on any binary code for the set of symbols it describes. In particular, it can predict the most efficient code for compressing the English language.