Lossless Data Compression: LZ77

Dictionary-based Compressors
Concept
Algorithm
Example
Shortcomings
Variations

The Algorithm

To encode the sequence in the look-ahead buffer, the encoder moves a search pointer back through the search buffer until it encounters a match to the first symbol in the look-ahead buffer. The distance of the pointer from the look-ahead buffer is called the offset. The encoder then examines the symbols following the symbol at the pointer location to see if they match consecutive symbols in the look-ahead buffer. The number of consecutive symbols in the search buffer that match consecutive symbols in the look-ahead buffer, starting with the first symbol, is called the length of the match. The encoder searches the search buffer for the longest match. Once the longest match has been found, the encoder encodes it with a triple <o, l, c>, where o is the offset, l is the length of the match, and c is the codeword corresponding to the symbol in the look-ahead buffer that follows the match. For example, in the diagram above, the longest match is the first ‘a’ of the search buffer. The offset o in this case is 7, the length of the match l is 4, and the symbol in the look-ahead buffer following the match is ‘r’. The reason for sending the third element in the triple is to take care of the situation where no match for the symbol in the look-ahead buffer can be found in the search buffer. In this case, the offset and match-length values are set to 0, and the third element of the triple is the code for the symbol itself.

The algorithm looks like the following:

while( lookAheadBuffer not empty )
{
    get a pointer (position, match) to the longest match
    in the window for the lookAheadBuffer;

    output a (position, length, char()) triple;
    shift the window length+1 characters along;
}

For the decoding process, it is basically a table loop-up procedure and can be done by reversing the encoding procedures. We employ the same buffer sized n characters, and then use its first (N-n) spaces to hold the previously decoded characters, where N is the size of the window (sum of the size of the look-ahead buffer and the search buffer) used in the encoding process. If we break up each triple that we encounter back into its components:- position offset o, match length l, and the last symbol of the incoming stream c, we extract the match string from buffer according to o, and thus obtain the original content. What has been shifted out of the buffer is sent to the output. The process continues until all code words have been decoded, and the decoding is then complete.