Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree

The Dawn Age: Minimum Redundancy Coding

5. Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree

Figure 3.1 A simple Shannon-Fano tree.

The tree structure shows how codes are uniquely defined though they have different numbers of bits.

The tree structure seems designed for computer implementations, but it is also well suited for machines made of relays and switches, like the teletype machines of the 1950s.

While the table shows one of the three properties discussed earlier, that of having variable numbers of bits, more information is needed to talk about the other two properties. After all, code trees look interesting, but do they actually perform a valuable service?

The Shannon-Fano Algorithm

A Shannon-Fano tree is built according to a specification designed to define an effective code table.

The actual algorithm is simple:

1. For a given list of symbols, develop a corresponding list of probabilities or frequency counts so that each symbol’s relative frequency of occurrence is known.

2. Sort the lists of symbols according to frequency, with the most frequently occuring symbols at the top and the least common at the bottom.

3. Divide the list into two parts, with the total frequency counts of the upper half being as close to the total of the bottom half as possible.

4. The upper half of the list is assigned the binary digit 0, and the lower half is assigned the digit 1. This means that the codes for the symbols in the first half will all start with 0, and the codes in the second half will all start with 1.

5. Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree.

The Shannon-Fano tree shown in Figure 3.1 was developed from the table of symbol frequencies shown next.

Putting the dividing line between symbols B and C assigns a count of 22 to the upper group and 17 to the lower, the closest to exactly half. This means that A and B will each have a code that starts with a 0 bit, and C, D, and E are all going to start with a 1 as shown:

Symbol Count

A 15

B 7

C 6

D 6

E 5

Subsequently, the upper half of the table gets a new division between A and B, which puts A on a leaf with code 00 and B on a leaf with code 01. After four division procedures, a table of codes results. In the final table, the three symbols with the highest frequencies have all been assigned 2-bit codes, and two symbols with lower counts have 3-bit codes as shown next.

That symbols with the higher probability of occurence have fewer bits in their codes indicates we are on the right track. The formula for information content for a given symbol is the negative of the base two logarithm of the symbol’s probability. For our theoretical message, the information content of each symbol, along with the total number of bits for that symbol in the message, are found in the following table.

The information for this message adds up to about 85.25 bits. If we code the characters using 8-bit ASCII characters, we would use 39 × 8 bits, or 312 bits. Obviously there is room for improvement.

When we encode the same data using Shannon-Fano codes, we come up with some pretty good

Symbol Count

A 15 0

B 7 0

First division

C 6 1

D 6 1

E 5 1

Symbol Count

A 15 0 0

Second division

B 7 0 1

First division

C 6 1 0

Third division

D 6 1 1 0

Fourth division

E 5 1 1 1

Symbol Count Info Cont. Info Bits

A 15 1.38 20.68

B 7 2.48 17.35

C 6 2.70 16.20

D 6 2.70 16.20

E 5 2.96 14.82

numbers, as shown below.

With the Shannon-Fano coding system, it takes only 89 bits to encode 85.25 bits of information.

Clearly we have come a long way in our quest for efficient coding methods. And while Shannon-Fano coding was a great leap forward, it had the unfortunate luck to be quickly superseded by an even more efficient coding system: Huffman coding.

The Huffman Algorithm

Huffman coding shares most characteristics of Shannon-Fano coding. It creates variable-length codes that are an integral number of bits. Symbols with higher probabilities get shorter codes.

Huffman codes have the unique prefix attribute, which means they can be correctly decoded despite being variable length. Decoding a stream of Huffman codes is generally done by following a binary decoder tree.

Building the Huffman decoding tree is done using a completely different algorithm from that of the Shannon-Fano method. The Shannon-Fano tree is built from the top down, starting by assigning the most significant bits to each code and working down the tree until finished. Huffman codes are built from the bottom up, starting with the leaves of the tree and working progressively closer to the root.

The procedure for building the tree is simple and elegant. The individual symbols are laid out as a string of leaf nodes that are going to be connected by a binary tree. Each node has a weight, which is simply the frequency or probability of the symbol’s appearance. The tree is then built with the following steps:

• The two free nodes with the lowest weights are located.

• A parent node for these two nodes is created. It is assigned a weight equal to the sum of the two child nodes.

• The parent node is added to the list of free nodes, and the two child nodes are removed from the list.

• One of the child nodes is designated as the path taken from the parent node when decoding a 0 bit. The other is arbitrarily set to the 1 bit.

• The previous steps are repeated until only one free node is left. This free node is designated the root of the tree.

This algorithm can be applied to the symbols used in the previous example. The five symbols in our message are laid out, along with their frequencies, as shown:

Symbol Count Info Cont. Info Bits SF Size SF Bits

A 15 1.38 20.68 2 30

B 7 2.48 17.35 2 14

C 6 2.70 16.20 2 12

D 6 2.70 16.20 3 18

E 5 2.96 14.82 3 15

15 7 6 6 5 A B C D E

These five nodes are going to end up as the leaves of the decoding tree. When the process first starts, they make up the entire list of free nodes.

The first pass through the tree identifies the two free nodes with the lowest weights: D and E, with weights of 6 and 5. (The tie between C and D was broken arbitrarily. While the way that ties are broken affects the final value of the codes, it will not affect the compression ratio achieved.) These two nodes are joined to a parent node, which is assigned a weight of 11. Nodes D and E are then removed from the free list.

Once this step is complete, we know what the least significant bits in the codes for D and E are going to be. D is assigned to the 0 branch of the parent node, and E is assigned to the 1 branch. These two bits will be the LSBs of the resulting codes.

On the next pass through the list of free nodes, the B and C nodes are picked as the two with the lowest weight. These are then attached to a new parent node. The parent node is assigned a weight of 13, and B and C are removed from the free node list. At this point, the tree looks like that shown in Figure 3.2.

Figure 3.2 The Huffman tree after two passes.

On the next pass, the two nodes with the lowest weights are the parent nodes for the B/C and D/E pairs. These are tied together with a new parent node, which is assigned a weight of 24, and the children are removed from the free list. At this point, we have assigned two bits each to the Huffman codes for B, C, D, and E, and we have yet to assign a single bit to the code for A.

Finally, on the last pass, only two free nodes are left. The parent with a weight of 24 is tied with the A node to create a new parent with a weight of 39. After removing the two child nodes from the free list, we are left with just one parent, meaning the tree is complete. The final result looks like that shown in Figure 3.3.

Figure 3.3 The Huffman tree.

To determine the code for a given symbol, we have to walk from the leaf node to the root of the Huffman tree, accumulating new bits as we pass through each parent node. Unfortunately, the bits are returned to us in the reverse order that we want them, which means we have to push the bits onto a stack, then pop them off to generate the code. This strategy gives our message the code structure shown in the following table.

The Huffman Code Table

As you can see, the codes have the unique prefix property. Since no code is a prefix to another code, Huffman codes can be unambiguously decoded as they arrive in a stream. The symbol with the highest probability, A, has been assigned the fewest bits, and the symbol with the lowest probability, E, has been assigned the most bits.

Note, however, that the Huffman codes differ in length from Shannon-Fano codes. The code length for A is only a single bit, instead of two, and the B and C symbols have 3-bit codes instead of two bits. The following table shows what effect this has on the total number of bits produced by the message.

This adjustment in code size adds 13 bits to the number needed to encode the B and C symbols, but it saves 15 bits when coding the A symbol, for a net savings of 2 bits. Thus, for a message with an information content of 85.25 bits, Shannon-Fano coding requires 89 bits, but Huffman coding requires only 87.

In general, Shannon-Fano and Huffman coding are close in performance. But Huffman coding will always at least equal the efficiency of Shannon-Fano coding, so it has become the predominant coding method of its type. Since both algorithms take a similar amount of processing power, it seems sensible to take the one that gives slightly better performance. And Huffman was able to prove that this coding method cannot be improved on with any other integral bit-width coding stream.

Since D. A. Huffman first published his 1952 paper, “A Method for the Construction of Minimum Redundancy Codes,” his coding algorithm has been the subject of an overwhelming amount of additional research. Information theory journals to this day carry numerous papers on the

implementation of various esoteric flavors of Huffman codes, searching for ever better ways to use this coding method. Huffman coding is used in commercial compression programs, FAX machines, and even the JPEG algorithm. The next logical step in this book is to outline the C code needed to implement the Huffman coding scheme.

Huffman in C

A Huffman coding tree is built as a binary tree, from the leaf nodes up. Huffman may or may not have had digital computers in mind when he developed his code, but programmers use the tree data structure all the time.

A 0

B 100

C 101

D 110

E 111

Symbol Count

Shannon-Fano Size Shannon-Fano Bits Huffman Size Huffman Bits

A 15 2 30 1 15

B 7 2 14 3 21

C 6 2 12 3 18

D 6 3 18 3 18

E 5 3 15 3 15

Two programs used here illustrate Huffman coding. The compressor, HUFF-C, implements a simple order-0 model and a single Huffman tree to encode it. HUFF-E expands files compressed using HUFF-C. Both programs use a few pieces of utility code that will be seen throughout this book.

Before we go on the actual Huffman code, here is a quick overview of what some of the utility modules do.

BITIO.C

Data-compression programs perform lots of input/output (I/O) that does reads or writes of

unconventional numbers of bits. Huffman coding, for example, reads and writes bits one at a time.

LZW programs read and write codes that can range in size from 9 to 16 bits. The standard C I/O library defined in STDIO.H only accommodates I/O on even byte boundaries. Routines like putc() and getc() read and write single bytes, while fread() and fwrite() read and write whole blocks of bytes at a time. The library offers no help for programmers needing a routine to write a single bit at a time.

To support this conventional I/O in a conventional way, bit-oriented I/O routines are confined to a single source module, BITIO.C. Access to these routines is provided via a header file called BITIO.H, which contains a structure definition and several function prototypes.

Two routines open files for bit I/O, one for input and one for output. As defined in BITIO.H, they are

BIT_FILE *OpenInputBitFile( char *name );

BIT_FILE *OpenOutputBitFile ( char *name );

These two routines return a pointer to a new structure, BIT_FILE. BIT_FILE is also defined in BITIO.H as shown:

typedef struct bit_file { FILE *file;

unsigned char mask;

int rack;

int pacifier_counter;

} BIT_FILE:

OpenInputBitFile() or OpenOutputBitFile() perform a conventional fopen() call and store the returned FILE structure pointer in the BIT_FILE structure. The other two structure elements are initialized to their startup values, and a pointer to the resulting BIT_FILE structure is returned.

In BITIO.H, rack contains the current byte of data either read in from the file or waiting to be written out to the file. mask contains a single bit mask used either to set or clear the current output bit or to mask in the current input bit.

The two new structure elements, rack and mask, manage the bit-oriented aspect of a most significant bit in the I/O byte gets or returns the first bit, and the least significant bit in the I/O byte gets or returns the last bit. This means that the mask element of the structure is initialized to 0x80 when the BIT_FILE is first opened. During output, the first write to the BIT_FILE will set or clear that bit, then the mask element will shift to the next. Once the mask has shifted to the point at which all the bits in the output rack have been set or cleared, the rack is written out to the file, and a new rack byte is started.

Performing input from a BIT_FILE is done in a similar fashion. The mask is first set to 0x80, and a single byte from the file is read into the rack element. Each call to read a bit from the file masks in a new bit, then shifts the mask over to the next lower significant bit. Eventually, all bits in the input

rack have been returned, and the input routine can read in a new byte from the input file.

Two types of I/O routines are defined in BITIO.C. The first two routines read or write a single bit at a time. The second two read or write multiple bits, up to the size of an unsigned long. These four routines have the following ANSI prototype in BITIO.H:

void OutputBit( BIT_FILE *bit_file, int bit );

void OutputBits( BIT_FILE *bit_file,

unsigned long code, int count);

int InputBit( BIT_FILE *bit_file );

unsigned long InputBits( BIT_FILE *bit_file, int bit_count );

Specialized routines open a BIT_FILE, and two specialized routines close a BIT_FILE. The output routine makes sure that the last byte gets written out to the file. Both the input and output routines need to close their files, then free up the BIT_FILE structure allocated when the file was opened.

The BIT_FILE routines used to close a file are defined in BITIO.H with these ANSI prototypes:

void CloseInputBitFile( BIT_FILE *bit_file );

void CloseOutputBitFile( BIT_FILE *bit_file );

The input and output routines in BITIO.H also have a pacifier feature that can be useful in testing compression code. Every BIT_FILE structure has a pacifier_counter that gets incremented every time a new byte is read in or written out to the corresponding file. Once every 2,048 bytes, a single character is written to stdout. This helps assure the impatient user that real work is being done. On MS-DOS systems, it also helps ensure that the user can break out of the program if it does not appear to be working correctly.

The header file and code for BITIO.H is shown next:.

/******************** Start of BITIO.H **********************/

#ifndef _BITIO_H

#define _BITIO_H

#include <stdio.h>

typedef struct bit_file { FILE *file;

unsigned char mask;

int rack;

int pacifier_counter;

} BIT_FILE;

#ifdef __STDC__

BIT_FILE* OpenInputBitFile( char *name );

BIT_FILE* OpenOutputBitFile( char *name );

void OutputBit( BIT_FILE *bit_file, int bit );

void OutputBits( BIT_FILE *bit_file,

unsigned long code, int count );

int InputBit( BIT_FILE *bit_file );

unsigned long InputBits( BIT_FILE *bit_file, int bit_count );

void CloseInputBitFile( BIT_FILE *bit_file );

void CloseOutputBitFile( BIT_FILE *bit_file );

void FilePrintBinary( FILE *file, unsigned int code, int bits);

#else /* __STDC__ */

BIT_FILE* OpenInputBitFile();

BIT_FILE* OpenOutputBitFile();

void OutputBit();

void OutputBits();

int InputBit();

unsigned long InputBits();

void CloseInputBitFile();

void CloseOutputBitFile();

void FilePrintBinary();

#endif /* __STDC__ */

#endif /* _BITIO_H */

/********************** End of BITIO.H *********************/

/******************** Start of BITIO.C ********************/

* This utility file contains all of the routines needed to implement

* bit oriented routines under either ANSI or K&R C. It needs to be

* linked with every program used in the book.

#include <stdio.h>

#include <stdlib.h>

#include "bitio.h"

#include "errhand.h"

BIT_FILE *OpenOutputBitFile( name ) char *name;

{

BIT_FILE *bit_file;

bit_file = (BIT_FILE *) calloc( 1, sizeof( BIT_FILE ) );

if ( bit_file == NULL ) return( bit_file );

bit_file->file = fopen( name, "rb" );

bit_file->rack = 0;

bit_file->mask = 0x80;

bit_file->pacifier_counter = 0;

return( bit_file );

}

BIT_FILE *OpenInputBitFile( name ) char *name;

{

BIT_FILE *bit_file;

bit_file = (BIT_FILE *) calloc( 1, sizeof( BIT_FILE ) );

if ( bit_file == NULL ) return( bit_file );

bit_file->file = fopen( name, "rb" );

bit_file->rack = 0;

bit_file->mask = 0x80;

bit_file->pacifier_counter = 0;

return( bit_file );

}

void CloseOutputBitFile( bit_file ) BIT_FILE *bit_file;

{

if ( bit_file->mask != 0x80 )

if ( putc( bit_file->rack, bit_file->file ) != bit_file->rack ) fatal_error( "Fatal error in CloseBitFile!\n" );

fclose( bit_file->file );

free( (char *) bit_file );

}

void OutputBit( bit_file, bit ) BIT_FILE *bit_file;

if ( putc( bit_file->rack, bit_file->file ) != bit_file->rack ) fatal_error( "Fatal error in OutputBit!\n" );

void OutputBits( bit_file, code, count ) BIT_FILE *bit_file;

bit_file->mask >>= 1;

if ( bit_file->mask == 0 ) {

if ( putc( bit_file->rack, bit_file->file ) != bit_file->rack ) fatal_error( "Fatal error in OutputBit!\n" );

value = bit_file->rack & bit_file->mask;

bit_file->mask >>= 1;

if ( bit_file->mask = 0 )

bit_file->mask = 0x80;

return ( value ? 1 : 0 );

}

unsigned long InputBits( bit_file, bit_count ) BIT_FILE *bit_file;

int bit_count;

{

unsigned long mask;

unsigned long return_value;

mask = 1L << ( bit_count - 1 );

return_value = 0;

while ( mask != 0) {

if ( bit_file->mask == 0x80 ) {

bit_file->rack = getc( bit_file->file );

if ( bit_file->rack == EOF )

fatal_error( "Fatal error in InputBit!\n" );

if ( ( bit_file->pacifier_counter++ & 2047 ) == 0 ) putc( '.', stdout );

}

if ( bit_file->rack & bit_file->mask ) return_value |=mask;

mask >>= 1;

bit_file->mask >>= 1;

if ( bit_file->mask = 00 ) bit_file->mask = 0x80;

}

return( return_value );

}

void FilePrintBinary( file, code, bits ) FILE *file;

unsigned int code;

int bits;

{

unsigned int mask;

mask = 1 << ( bits - 1 ):

while ( mask != 0 ){

if ( code & mask ) fputc( '1', file );

else

fputc( '0', file);

mask >>= 1;

} }

/********************** End of BITIO.C **********************/

A Reminder about Prototypes

The code in this book works on both Unix K&R and the more modern MS-DOS compilers. This affects the code in this book mainly in the area of function parameters in both prototypes and the function body itself. For the function body, all code in this book will use old-fashioned parameter specifications like this:

int main( argc, argv ) int argc;

char *argv[];

{

...

This is the only method of parameter declaration acceptable to K&R compilers, and as such it has the blessing of the ANSI standard. A few compilers (Microsoft C 6.0 at Warning Level 4, for example) will issue a warning when it encounters this type of function declaration, so be prepared to ignore those warnings. Declaring function parameters in this method will generally have no effect on code reliability or readability, so using the K&R style should be considered a benign anachronism.

Parameters in function declarations present a little more of a problem. The ANSI C specification will accept old style K&R function declarations (such as int main();), but there are good reasons to specify all function arguments in the declaration. When using full prototyping—as in int main( int argc, char *argv[] );—the compiler checks for correct parameter passing when it encounters a call to a function. This helps avoid one of the most commonplace C coding mistakes; incorrect parameter

Dans le document Introduction to Data Compression (Page 26-36)