To encode dictionary indices and plain text so that they are distinguishable

Dictionary-Based Compression

4. To encode dictionary indices and plain text so that they are distinguishable

The corresponding decompression program has a slightly different set of requirements. It no longer has to parse the input text stream into fragments, and it doesn’t have to test fragments against the dictionary. Instead, it has the following requirements: (1) to decode the input stream into either dictionary indices or plain text; (2) to add new phrases to the dictionary; (3) to convert dictionary indices into phrases; and (4) to output phrases as plain text. The ability to accomplish these tasks with relatively low costs in system resources made dictionary-based programs popular over the last ten years.

A Representative Example

Compressing data when sending it to magnetic tape has several nice side effects. First, it reduces the use of magnetic tape. Though magnetic tape is not particulary expensive, some applications make prodigous use of it. Second, the effective transfer rate to and from the tape is increased.

Improvements in transfer speed through hardware are generally expensive, but compression through software is in a sense “free.” Finally, in some cases, the overall CPU time involved may actually be reduced. If the CPU cost of writing a byte to magnetic tape is sufficiently high, writing half as many compressed bytes may save enough cycles to pay for the compression.

While the benefits of compressing data before sending it to magnetic tape have been clear, only sporadic methods were used until the late 1980s. In 1989, however, Stac Electronics successfully implemented a dictionary-based compression algorithm on a chip. This algorithm was quickly embraced as an industry standard and is now widely used by tape-drive manufacturers worldwide.

This compression method is generally referred to by the standard which defines it: QIC-122. (QIC refers to the Quarter Inch Cartridge industry group, a trade association of tape-drive manufacturers.) As you may know, Stac Electronics expanded the scope of this algorithm beyond tape drives to the consumer hard disk utility market in the form of its successful Stacker program (discused later in this chapter).

QIC-122 provides a good example of how a sliding-window, dictionary-based compression

algorithm actually works. It is based on the LZ77 sliding-window concept. As symbols are read in by the encoder, they are added to the end of a 2K window that forms the phrase dictionary. To encode a symbol, the encoder checks to see if it is part of a phrase already in the dictionary. If it is, it creates a token that defines the location of the phrase and its length. If it is not, the symbol is passed through unencoded.

The output of a QIC-122 encoder consists of a stream of data, which, in turn, consists of tokens and symbols freely intermixed. Each token or symbol is prefixed by a single bit flag that indicates whether the following data is a dictionary reference or a plain symbol. The definitions for these two sequences are: (1) plaintext: <1><eight-bit-symbol>; (2) dictionary reference:

<0><window-offset><phrase-length>.

The QIC-122 encoder complicates things by further encoding the window-offset and phrase-length codes. Window offsets of less than 128 bytes are encoded in seven bits. Offsets between 128 bytes and 2,047 bytes are encoded in eleven bits. The phrase length uses a variable-bit coding scheme which favors short phrases over long. This explanation will gloss over these as “implementation details.” The glossed-over version of the C code for this algorithm is shown here.

while ( !out_of_symbols ) {

length = find_longest_match(&offset);

if ( length > 1 ) { output_bit( 0 );

length = find_longest_match( &offset );

output_bits( offset );

output_bits( length );

shift_input_buffer( length );

} else {

output_bit( 1 );

output_byte( buffer[ 0 ] );

shift_input_buffer( 1 );

} }

Following is an example of what this sliding window looks like when used to encode some C code, in this case the phrase “output_byte.” The previously encoded text, which ends with the phrase

“output_bit( 1 );\r,” is at the end of the window. The find_longest_match routine will return a value of 8, since the first eight characters of “output_byte” match the first eight characters of “output_bit.”

The encoder will then output a 0 bit to indicate that a dictionary reference is following. Next it will output a 15 to indicate that the start of the phrase is fifteen characters back into the window (‘\r’ is a single symbol). Finally, it will output an 8 to indicate that there are eight matching symbols from the phrase.

Figure 7.1 A sliding window used to encode some C code.

Using QIC-122 encoding, this will take exactly sixteen bits to encode, which means it encodes 8 bytes of data with only 2 bytes. This is clearly a respectable compression ratio, typical of how QIC-122 works under the best circumstances as shown here:

Figure 7.2 Encoding 8 bytes of data using only 2 bytes.

After the dictionary reference is output, the input stream over eight characters, with the last symbol encoded becoming the last symbol in the window. The next three symbols will not match anything in the window, so they will have to be individually encoded.

This example of QIC-122 gives a brief look at how a dictionary-based compression scheme might work. Chapter 8 will take a more extensive look at LZ77 and its derivatives.

Israeli Roots

Dig beneath the surface of virtually any dictionary-based compression program, and you will find the work of Jacob Ziv and Abraham Lempel. For all practical purposes, these two Israeli researchers gave birth to this branch of information theory in the late 1970s.

Research in data compression up to 1977 included work on entropy, character and word frequencies, and various other facets of statistical modeling. There were minor forays into other esoteric areas of interest, such as finite state machines and linguistic models, but research went mainly into new and improved methodologies for driving Huffman coders.

All this changed in 1977 with the publication of Jacob Ziv’s and Abraham Lempel’s “A Universal Algorithm for Sequential Data Compression” in IEEE Transactions on Information Theory. This paper, with its 1978 sequel “Compression of Individual Sequences via Variable-Rate Coding,”

triggered a flood of dictionary-based compression research, algorithms, and programs.

The two compression techniques developed in these papers are called LZ77 and LZ78 (the

transposition of the author’s initials is apparently an innocent historical accident, but one that is here to stay). LZ77 is a “sliding window” technique in which the dictionary consists of a set of fixed-length phrases found in a “window” into the previously processed text. The size of the window is generally somewhere between 2K and 16K bytes, with the maximum phrase length ranging from perhaps 16 to 64 bytes. LZ78 takes a completely different approach to building a dictionary. Instead of using fixed-length phrases from a window into the text, LZ78 builds phrases up one symbol at a time, adding a new symbol to an existing phrase when a match occurs.

It is easy to think that these two compression methods are closely related, particularly since people will casually speak of “Lempel Ziv Compression” as if it were just one thing. These are, however, two very different techniques. They have different implementation problems and solutions, different strengths and weaknesses. Since they have had such a large impact on the world of data

compression, the next two chapters of this book will take a detailed look at an LZ77 and an LZ78 implementation.

History

While the publication of the two papers in 1977 and 1978 may have had an immediate impact in the world of information theory, it was some time before programmers noticed the effects. In fact, it took the publication of another paper in 1984 to really get things moving.

The June 1984 issue of IEEE Computer had an article entitled “A Technique for High-Performance Data Compression” by Terry Welch. Welch described work performed at Sperry Research Center (now part of Unisys). His paper was a practical description of this implementation of the LZ78 algorithm, which he called LZW. It discussed the LZW compression algorithm in reference to its possible use in disk and tape-drive controllers, but it was clear that the same algorithm could easily be built into a general-purpose compression program.

Almost immediately after the article appeared, work began on the Unix compress program. compress is a C program developed initially on the DEC’s VAX. Ports to other machines, including the IBM PC, followed shortly. The public release of compress became available almost exactly a year after the publication of the IEEE article.

Compress was a very influential program for a number of reasons. The program was well written. It performed well, and it had a reasonable level of documentation. Many UNIX installations began actively using compress soon after its release. Manual pages distributed with UNIX systems are now routinely shipped in compressed form, and they are not decompressed until accessed for the first time by the man program. The code was in the public domain from its initial release, which made for wide distribution and study. Perhaps most importantly, the authors went out of their way to ensure that the code was portable so that it could be used on a wide variety of systems with no

modifications.

While compress was becoming a standard in the UNIX community, desktop software was still struggling along with a rather inefficient order-0 Huffman coding program known as SQ. But in 1985, desktop power was increasing; more and more people were using modems to communicate;

and hard-disk space was still relatively expensive. Conditions were ripe for an improvement in compression, and dictionary-based coding stepped in.

ARC: The Father of MS-DOS Dictionary Compression

In 1985, System Enhancement Associates released a general-purpose compression and cataloging program called ARC. ARC quickly took the MS-DOS desktop world by storm, becoming a de facto standard for PC users in a matter of months. Several factors helped ARC gain this position. First, it ordinarily used a close derivative of compress to compress files. At the time, this provided state-of-the-art compression and was essentially without peer. Second, ARC provided a cataloging or

archiving function as an integral part of the program. UNIX users were accustomed to using the “tar”

program to combine groups of files into a single archive, but PC users did not have a similar function as part of their operating system. ARC added that capability, vital for transferring groups of files by modem or even floppy diskette. Finally, ARC was distributed as shareware, which helped saturate the user base in a short time.

With compress reigning supreme in the UNIX world and ARC ruling the MS-DOS world, it seemed LZ78 would be the dominant compression method for years. Imitators such as PKWare’s PKARC only strengthened LZ78’s hold by providing performance improvements in both speed and

compression ratios. But oddly enough, in recent years the field has taken a step back, if you consider moving from LZ78 to LZ77 a step backwards.

ARC lost its dominance of the desktop world to new contenders, most notably PKZIP, by PKWare;

but also LHarc, by Haruyasu Yoshizaki; and ARJ, by Robert Jung. These programs are built on an LZ77 algorithm which uses a dictionary based on a sliding window that moves through the text.

LZ77 was not a practical algorithm to implement until refinements were made in the mid 1980s.

Now LZ77 has a legitimate position alongside LZ78 as co-ruler of the general-purpose compression world.

Most recently, a patent dispute between Unisys, which owns the patent for LZ78-derived algorithms (Terry Welch’s work), versus the rest of the computer industry, has resulted in a definite shift over to LZ77-derived algorithms. For example, the recently designed PNG format (discussed later in this book) is being promulgated as a replacement to Compuserve’s GIF format, in order to sidestep Unisys’ patent claims.

Dictionary Compression: Where It Shows Up

Dictionary-based compression has found more and more homes in the last ten years as both hardware and software improvements make it practical. We can subdivide applications for

dictionary-based compression into two areas: general-purpose programs and hardware-specific code.

As shown, dictionary-based coding took over desktop general-purpose compression. In the MS-DOS world, programs such as PKZIP, ARC, ARJ, and LHarc all use dictionary-based algorithms to compress and archive files in a general-purpose manner. Most of these programs have ports to at least one or two other platforms, UNIX being the most popular.

Dictionary-based compression is also used in some special-purpose desktop programs. Most backup programs, for example, use some form of compression to make their operation faster and more efficient. PC Backup, developed by Central Point Software developed (a company later acquired by symantic), uses a dictionary-based algorithm from Stac Electronics, the company that produces the Stacker disk compression utility and which initiated the QIC-122 compression standard.

Compuserve Information Service developed a dictionary-based compression scheme used to encode bit-mapped graphical images. The GIF format uses an LZW variant to compress repeated sequences in screen images. Compression is clearly needed when using these type of images. Computer images take up lots of storage space. As video resolutions improve, the size of the saved images grows

dramatically. Compuserve users also typically use modems to upload or download these images.

When running on older 2400 baud modems, compressing images becomes even more crucial. Even at 14400 bps and faster speeds, the exploding use of the World-Wide Web on the Internet means increased demand for speedy transfer of graphics, and increased reliance on the GIF format for non-photographic images. (Image-oriented formats will be discussed in later chapters.)

Compressing files before transmitting them saves telecommunications bandwidth. But this requires compatible compression software on both ends. A more convenient method of conserving

bandwidth’s to build data compression directly into the modem. Microcom Corp. originally developed this idea, which used Huffman coding to compress data before it was transmitted by its modems. Microcom’s compression algorithm, MNP-5, uses a dynamic Huffman coding scheme that performs well as a general-purpose compressor on most data streams.

In recent years, the international telecommunications industry has widely ratified and adopted a new compression algorithm used by modem manufacturers: V.42bis, a dictionary-based compression scheme which offers better compression ratios than MNP-5. With the adoption of an international standard, modem builders can now implement data compression in their modems and have

confidence that they can communicate with modems from other manufacturers.

As mentioned previously, tape drive manufacturers have adopted an industry-standard compression algorithm: QIC-122. QIC-122 is generally implemented on the tape drive itself or on the tape controller using a dedicated microcontroller or the Stac Electronics compression engine chip.

Hewlett-Packard has proposed an alternative compression standard known as DCLZ. DCLZ uses an LZ78-type algorithm, which supposedly offers better compression performance than programs based on QIC-122.

The hard disk has been an active battleground for dictionary-based compression. In recent years, the utility programs for archiving and compressing files have been supplemented by programs that work at the device-driver level to transparently compress data stored on disk. The most visible of these utilities is Stacker, from Stac Electronics. There are others, including Disk Doubler on the Macintosh platform. In version 6 of MS-DOS, Microsoft added Stacker-like compression to its operating

system, and then was forced to remove it after a well-publicized lawsuit between Microsoft and Stac Electronics. When operating at the device-driver level, these programs add a level of convenience to compression that is hard to beat.

One of the primary difficulties with compressing data on the hard disk is that most of today’s dictionary schemes are adaptive. A general-purpose algorithm would be needed to operate on a disk drive or controller, making a static dictionary difficult to implement. Most adaptive algorithms do not perform very well until they have built up some statistical information, which may take several K of input data.

Unfortunately, disk drives are used in a random-access mode, which means a program can begin reading data at any point in the file. If the file were compressed using a conventional adaptive method, we might have to go back to the start and begin reading there in order to properly decompress the file. This may not be much of a problem in a 16 Kbyte text file, but imagine the performance problems in a 16-Mbyte database!

At present, manufactures of software- and hardware-based disk compression drivers avoid these problems by compressing at the sector level. Since the device driver typically reads in a sector or more at a time, the adaptive algorithm will restart at the begining of each sector boundary. Even better, if the device driver controls the size of the sector, it can be set to a somewhat larger value than might normally be used, giving the adaptive algorithm a chance to improve its compression.

With algorithms such as QIC-122, increasing the sector size past a certain point will not likely

improve matters, since the dictionary is only composed of the previous 2K bytes of data. But more powerful compression algorithms that take advantage of older information will frequently want to increase the sector size.

In practice, many users of disk-compression programs that work at the device-driver level find performance to be less of an issue than one might expect. This is due to the fact that, while

compression/decompression does consume additional CPU cycles, this effort is compensated for by the reduced amount of data that needs to be transferred between memory and the hard disk. When a mechanical procedure (physically moving the hard disk arm across the platter) competes with an electronic one (decompressing data in a CPU cache), the electronic process usually wins out.

Danger Ahead—Patents

The fact that most work on dictionary-based compression has been done over the last ten or fifteen years has a potentially dangerous side effect. Until the early 1980s, it was generally not possible to patent software. But during the past ten years, increasingly large numbers of patents were awarded for what are clearly algorithms.

One of the first data-compression patents was granted to Sperry Corp. (now Unisys) for the

improvements to LZ78 developed by Terry Welch at the Sperry Research Center. In fact, this patent became a point of contention during the standardization process for the V.42bis

data-communications standard. Since V.42bis is based on the LZW algorithm, Unisys claimed the right to collect royalties on implementations which use V.42bis. There was some concern in the CCITT about the effect of basing a standard on a patented technique. Unisys dampened concern while protecting its patent rights by publicly offering to license the algorithm to any modem manufacturer for a onetime $25,000 fee.

After a hiatus of several years, Unisys recently woke from slumber and began patent-related legal maneuvers, seeking to get licensing fees from CompuServe and other users of the popular GIF format. Some in the industry viewed this request as unreasonable, given the lengthy delay between the introduction of the GIF format, and the unexpected demands for licensing fees once the format had gained wide acceptance. The response to Unisys’ effort was creative rather than contentious.

Dans le document Introduction to Data Compression (Page 140-147)