 Statistical
Compressors
Concept
Algorithm
Example
Comparison (H vs SF) |
Statistical Compressors Computers generally encode characters using the standard ASCII
chart, which assigns an 8-bit code to each symbol. For an example, the letter
a has an ASCII value of 97, and is encoded as 01100001. Characters
that occur more frequently (such as e) are treated the same as rare
characters, such as ¸. A file that has 100 characters will require 800 bits
this value is fixed, whether the file contains 100 unique characters or if it has
100 occurrences of the same character. The advantages of the ASCII encoding scheme is that
boundaries between characters are easily determined, and the pattern used for each
character is fixed and universal.
However, in almost any given text file, some characters
occur with more frequency than others. Wouldnt it thus make more sense to assign
shorter bit codes to more frequent characters than less frequent ones? This idea is not
new. An early example of data compression is Morse code, developed by Samuel Morse in the
mid-19th century. Letters sent by telegraph are encoded with dots and dashes. Morse
noticed that certain letters occurred more often than others. In order to reduce the
average time required to send a message, he assigned shorter sequences to letters that
occur more frequently such as e(…) and a(… -) , and longer sequences to
letters that occur less frequently such as q(- - …-) and j(…- - -).
This idea of using shorter codes for more frequently occurring characters was taken
into the field of computing by Claude Shannon and R.M. Fano in the 1950s, when they
developed the Shannon-Fano compression algorithm. However, D.A. Huffman published a paper
in 1952 that improved the algorithm slightly, bypassing the Shannon-Fano compression
algorithm with the aptly named Huffman coding.
To see the advantages of these compression algorithms, consider a text file that has 35
letters with the following letter frequencies A: 14; B: 7; C: 5; D: 5; E: 4. Using
the ASCII encoding, each of these letters will have a separate bit pattern, consuming 35*8
= 280 bits of space. "Decoding" such a string (i.e. translate the binary
encoding back into the original text file) simply consists of breaking the encoding into
8-bit bytes, and converting each byte into the character is represents using the ASCII
table. A file that is encoded in this scheme has the advantage of needing no additional
information to be passed along with the encoding, since all files and computers have the
same binary-to-character mapping.
Variable-length encoding schemes such as the Huffman and Shannon-Fano schemes have the
following properties:
- Codes for more frequent characters are shorter than ones for less probable characters.
- Each code can be uniquely decoded. This is also called the prefix property, i.e. no
characters encoding is a prefix of any other. To see why this property is important,
consider the case where A is encoded as 0, and B is encoded as
01, and C is encoded as 10. If the decoder encounters
the bit-stream 0010, is this ABA or AAC? With the
prefix guarantee, there is no ambiguity in determining where the character boundaries are.
We start reading from the beginning, and gather bits in a sequence until we find a match.
That indicated the end of a character, and we move along to the next character.
|