 Dictionary-based
Compressors
Concept
Algorithm
Example
Shortcomings
Variations |
The Algorithm To encode the sequence in the look-ahead buffer, the encoder
moves a search pointer back through the search buffer until it encounters a match to the
first symbol in the look-ahead buffer. The distance of the pointer from the look-ahead
buffer is called the offset. The encoder then examines the symbols following the
symbol at the pointer location to see if they match consecutive symbols in the look-ahead
buffer. The number of consecutive symbols in the search buffer that match consecutive
symbols in the look-ahead buffer, starting with the first symbol, is called the length
of the match. The encoder searches the search buffer for the longest match. Once the
longest match has been found, the encoder encodes it with a triple <o, l, c>,
where o is the offset, l is the length of the match, and c is the
codeword corresponding to the symbol in the look-ahead buffer that follows the match. For
example, in the diagram above, the longest match is the first a of the search
buffer. The offset o in this case is 7, the length of the match l is 4, and
the symbol in the look-ahead buffer following the match is r. The reason for
sending the third element in the triple is to take care of the situation where no match
for the symbol in the look-ahead buffer can be found in the search buffer. In this case,
the offset and match-length values are set to 0, and the third element of the triple is
the code for the symbol itself.
The algorithm looks like the following:
while( lookAheadBuffer not empty )
{
get a pointer (position, match) to the longest match
in the window for the lookAheadBuffer;
output a (position, length, char()) triple;
shift the window length+1 characters along;
}
For the decoding process, it is basically a table loop-up procedure and can be
done by reversing the encoding procedures. We employ the same buffer sized n
characters, and then use its first (N-n) spaces to hold the
previously decoded characters, where N is the size of the window (sum of the
size of the look-ahead buffer and the search buffer) used in the encoding process. If we
break up each triple that we encounter back into its components:- position offset o,
match length l, and the last symbol of the incoming stream c, we extract the
match string from buffer according to o, and thus obtain the original content. What
has been shifted out of the buffer is sent to the output. The process continues until all
code words have been decoded, and the decoding is then complete.
|