To avoid the problems that occurred with LZ77, Ziv and Lempel
developed a different form of dictionary-based compression. LZ78 abandons the concept of a
text window. In LZ77, the dictionary of phrases was defined by a fixed-length window of
previously seen text. Under LZ78, the dictionary is a potentially unlimited collection of
previously seen phrases. LZ78-based schemes work by entering phrases into a
dictionary and then, when a repeat occurrence of that particular phrase is
found, outputting a token that consists of the dictionary index instead of the phrase, as
well as a single character that follows that phrase. Unlike LZ77, there is no need to pass
the phrase length as a parameter because decoder already has this information.
Several compression algorithms based on this principle, differing mainly in the manner
in which they manage the dictionary. The most well-known scheme (in fact the most
well-known of all the Lempel-Ziv compressors is Terry Welch's LZW scheme, developed in
1984.
Just as files have certain symbols that occur with more regularity than others, many
files (especially, but not restricted to, text files) also have certain string that are
repeated frequently. For an example, take the string " the " (including the
spaces). Including the instances of the spaces, the string takes 5¥
8 = 40 bits to encode. However, if we were to append this entire string to the list of
characters, at position 256, then at every subsequent occurrence of " the " we
could send the code 256 instead of the index sequence 32, 116, 104, 101, 32. Since 256
cannot be represent by the standard 8-bit byte, this would require 9 bits
nevertheless, it is still a great improvement on the standard 40-bit ASCII encoding.
Unlike LZ77, LZ78 does not have a ready-made window full of text (the search windows)
to use as a dictionary. On the contrary, it has to create a dictionary on the
fly: it creates a new phrase each time a token is output, and it adds that phrase to
the dictionary. After the phrase is appended, it will available to the encoder at any time
in the future not just for the next few thousand characters as with LZ77.