Lossless Data Compression: LZ78

Dictionary-based Compressors
Concept
Algorithm
Example
Variations

The Concept

To avoid the problems that occurred with LZ77, Ziv and Lempel developed a different form of dictionary-based compression. LZ78 abandons the concept of a text window. In LZ77, the dictionary of phrases was defined by a fixed-length window of previously seen text. Under LZ78, the dictionary is a potentially unlimited collection of previously seen phrases. LZ78-based schemes work by entering phrases into a ‘dictionary’ and then, when a repeat occurrence of that particular phrase is found, outputting a token that consists of the dictionary index instead of the phrase, as well as a single character that follows that phrase. Unlike LZ77, there is no need to pass the phrase length as a parameter because decoder already has this information.

Several compression algorithms based on this principle, differing mainly in the manner in which they manage the dictionary. The most well-known scheme (in fact the most well-known of all the Lempel-Ziv compressors is Terry Welch's LZW scheme, developed in 1984.

Just as files have certain symbols that occur with more regularity than others, many files (especially, but not restricted to, text files) also have certain string that are repeated frequently. For an example, take the string " the " (including the spaces). Including the instances of the spaces, the string takes 5Ą 8 = 40 bits to encode. However, if we were to append this entire string to the list of characters, at position 256, then at every subsequent occurrence of " the " we could send the code 256 instead of the index sequence 32, 116, 104, 101, 32. Since 256 cannot be represent by the standard 8-bit byte, this would require 9 bits – nevertheless, it is still a great improvement on the standard 40-bit ASCII encoding.

Unlike LZ77, LZ78 does not have a ready-made window full of text (the search windows) to use as a dictionary. On the contrary, it has to create a dictionary ‘on the fly’: it creates a new phrase each time a token is output, and it adds that phrase to the dictionary. After the phrase is appended, it will available to the encoder at any time in the future – not just for the next few thousand characters as with LZ77.