Lossless Data Compression: LZ78

Dictionary-based Compressors
Concept
Algorithm
Example
Variations

Dictionary-based Compressors

Statistical models, such as the Huffman and Shannon-Fano models illustrated above, usually encode a single symbol at a time—by generating a one-to-one symbol-to-code map. The basic idea behind a dictionary-based compressor is to replace an occurrence of a particular phrase or group of bytes in a piece of data with a reference to a previous occurrence of that phrase.

Like the Huffman Algorithm, dictionary based compression schemes also have a historical basis. Where Morse code uses the frequency of occurrence of single characters, a widely used form of Braille code, also developed in the mid-19th century, uses the frequency of occurrence of words to provide compression. In Braille coding, 2 x 3 arrays of dots are used to represent text. Different letters are represented by different combinations of raised and flat dots. In Grade I Braille, each array of six dots represents a single character. However, given six dots with two positions for each dot, we can obtain 2⁶ or 64 different combinations. If we use 26 of these for the different letters, we have 38 combinations left over. In Grade 2 Braille, some of these leftover combinations are used to represent words that occur frequently, such as "and" and "for". One of the combinations is used as a special symbol indicating that the symbol following is a word and not a character, thus allowing a large number of words to be represented by two arrays of dots. These modifications, along with contractions of some of the words, result in a significant compression in the average amount of space used.

In modern data compression, there are two main classes of dictionary-based schemes schemes, named after Jakob Ziv and Abraham Lempel, who first proposed them in 1977 and 1978. These are called LZ77 and LZ78, respectively.