Statistical
CompressorsComputers generally encode characters using the standard ASCII chart,
which assigns an 8-bit code to each symbol. For an example, the letter a has
an ASCII value of 97, and is encoded as 01100001. Characters that occur more
frequently (such as e) are treated the same as rare characters, such as
¸. A file that has 100 characters will require 800 bits this value is
fixed, whether the file contains 100 unique characters or if it has 100 occurrences of the
same character. The advantages of the ASCII encoding scheme is that boundaries between
characters are easily determined, and the pattern used for each character is fixed and
universal.
However, in almost any given text file, some characters occur with more frequency than
others. Wouldnt it thus make more sense to assign shorter bit codes to more frequent
characters than less frequent ones? This idea is not new. An early example of data
compression is Morse code, developed by Samuel Morse in the mid-19th century. Letters sent
by telegraph are encoded with dots and dashes. Morse noticed that certain letters occurred
more often than others. In order to reduce the average time required to send a message, he
assigned shorter sequences to letters that occur more frequently such as e(…) and a(…
-) , and longer sequences to letters that occur less frequently such as q(- - …-)
and j(…- - -).
This idea of using shorter codes for more frequently occurring characters was taken
into the field of computing by Claude Shannon and R.M. Fano in the 1950s, when they
developed the Shannon-Fano compression algorithm. However, D.A. Huffman published a paper
in 1952 that improved the algorithm slightly, bypassing the Shannon-Fano compression
algorithm with the aptly named Huffman coding.
To see the advantages of these compression algorithms, consider a text file that has 35
letters with the following letter frequencies A: 14; B: 7; C: 5; D: 5; E: 4. Using
the ASCII encoding, each of these letters will have a separate bit pattern, consuming 35*8
= 280 bits of space. "Decoding" such a string (i.e. translate the binary
encoding back into the original text file) simply consists of breaking the encoding into
8-bit bytes, and converting each byte into the character is represents using the ASCII
table. A file that is encoded in this scheme has the advantage of needing no additional
information to be passed along with the encoding, since all files and computers have the
same binary-to-character mapping.
Variable-length encoding schemes such as the Huffman and Shannon-Fano schemes have the
following properties:
- Codes for more frequent characters are shorter than ones for less probable characters.
- Each code can be uniquely decoded. This is also called the prefix property, i.e. no
characters encoding is a prefix of any other. To see why this property is important,
consider the case where A is encoded as 0, and B is encoded as
01, and C is encoded as 10. If the decoder encounters
the bit-stream 0010, is this ABA or AAC? With the
prefix guarantee, there is no ambiguity in determining where the character boundaries are.
We start reading from the beginning, and gather bits in a sequence until we find a match.
That indicated the end of a character, and we move along to the next character.