maintitle.gif (3356 bytes)
menu_intro.gif (1886 bytes)
menu_bar_77.GIF (1845 bytes)
lossless.jpg (77783 bytes)

Dictionary-based Compressors

Dictionary-based Compressors

Statistical models, such as the Huffman and Shannon-Fano models illustrated above, usually encode a single symbol at a time—by generating a one-to-one symbol-to-code map. The basic idea behind a dictionary-based compressor is to replace an occurrence of a particular phrase or group of bytes in a piece of data with a reference to a previous occurrence of that phrase.

Like the Huffman Algorithm, dictionary based compression schemes also have a historical basis. Where Morse code uses the frequency of occurrence of single characters, a widely used form of Braille code, also developed in the mid-19th century, uses the frequency of occurrence of words to provide compression. In Braille coding, 2 x 3 arrays of dots are used to represent text. Different letters are represented by different combinations of raised and flat dots. In Grade I Braille, each array of six dots represents a single character. However, given six dots with two positions for each dot, we can obtain 26 or 64 different combinations. If we use 26 of these for the different letters, we have 38 combinations left over. In Grade 2 Braille, some of these leftover combinations are used to represent words that occur frequently, such as "and" and "for". One of the combinations is used as a special symbol indicating that the symbol following is a word and not a character, thus allowing a large number of words to be represented by two arrays of dots. These modifications, along with contractions of some of the words, result in a significant compression in the average amount of space used.

In modern data compression, there are two main classes of dictionary-based schemes schemes, named after Jakob Ziv and Abraham Lempel, who first proposed them in 1977 and 1978. These are called LZ77 and LZ78, respectively.

back to top | home