Lossless Data Compression: Huffman

Statistical Compressors
Concept
Algorithm
Example
Comparison (H vs. SF)

Statistical Compressors

Computers generally encode characters using the standard ASCII chart, which assigns an 8-bit code to each symbol. For an example, the letter ‘a’ has an ASCII value of 97, and is encoded as ‘01100001’. Characters that occur more frequently (such as ‘e’) are treated the same as rare characters, such as ‘¸’. A file that has 100 characters will require 800 bits – this value is fixed, whether the file contains 100 unique characters or if it has 100 occurrences of the same character. The advantages of the ASCII encoding scheme is that boundaries between characters are easily determined, and the pattern used for each character is fixed and universal.

However, in almost any given text file, some characters occur with more frequency than others. Wouldn’t it thus make more sense to assign shorter bit codes to more frequent characters than less frequent ones? This idea is not new. An early example of data compression is Morse code, developed by Samuel Morse in the mid-19th century. Letters sent by telegraph are encoded with dots and dashes. Morse noticed that certain letters occurred more often than others. In order to reduce the average time required to send a message, he assigned shorter sequences to letters that occur more frequently such as e(…) and a(… -) , and longer sequences to letters that occur less frequently such as q(- - …-) and j(…- - -).

This idea of using shorter codes for more frequently occurring characters was taken into the field of computing by Claude Shannon and R.M. Fano in the 1950’s, when they developed the Shannon-Fano compression algorithm. However, D.A. Huffman published a paper in 1952 that improved the algorithm slightly, bypassing the Shannon-Fano compression algorithm with the aptly named Huffman coding.

To see the advantages of these compression algorithms, consider a text file that has 35 letters with the following letter frequencies – A: 14; B: 7; C: 5; D: 5; E: 4. Using the ASCII encoding, each of these letters will have a separate bit pattern, consuming 35*8 = 280 bits of space. "Decoding" such a string (i.e. translate the binary encoding back into the original text file) simply consists of breaking the encoding into 8-bit bytes, and converting each byte into the character is represents using the ASCII table. A file that is encoded in this scheme has the advantage of needing no additional information to be passed along with the encoding, since all files and computers have the same binary-to-character mapping.

Variable-length encoding schemes such as the Huffman and Shannon-Fano schemes have the following properties:

Codes for more frequent characters are shorter than ones for less probable characters.

Each code can be uniquely decoded. This is also called the prefix property, i.e. no character’s encoding is a prefix of any other. To see why this property is important, consider the case where ‘A’ is encoded as 0, and ‘B’ is encoded as ‘01’, and ‘C’ is encoded as ‘10’. If the decoder encounters the bit-stream ‘0010’, is this ‘ABA’ or ‘AAC’? With the prefix guarantee, there is no ambiguity in determining where the character boundaries are. We start reading from the beginning, and gather bits in a sequence until we find a match. That indicated the end of a character, and we move along to the next character.