• Aucun résultat trouvé

SOURCE CODING ALGORITHMS

Dans le document Data Mining (Page 118-122)

Multimedia Data Compression

3.6 SOURCE CODING ALGORITHMS

From the information theoretic perspective, source coding can mean both lossless and lossy compression. However, researchers often use it to indicate lossless coding only. In the signal processing community, source coding is used to mean source model based coding. In this section, we describe some basic source coding algorithms such as Run-length coding and Huffman coding in greater detail.

3.6.1 Run-length coding

Run-length coding is a simple approach to source coding when there exists a long run of the same data, in a consecutive manner, in a dataset. As an example, the data d = '6 6 6 6 6 6 0 9 0 5 5 5 5 5 5 2 2 2 2 2 2 1 3 4 4 4 4 4 ...' contains long runs of 6's, 5's 2's 4's, etc. Rather than coding each sample in the run individually, the data can be represented compactly by simply indicating the value of the sample and the length of its run when it appears. For example, if a portion of an image is represented by "5 5 5 5 5 5 5 19 19 19 19 19 19 19 19 19 19 19 19 0 0 0 0 0 0 0 0 8 23 23 23 23 23 23,"

this can be run-length encoded as (5 7) (19 12) (0 8) (8 1) (23 6). For ease of understanding, we have shown a pair in each parenthesis. Here the first value represents the pixel, while the second indicates the length of its run.

In some cases, the appearance of runs of symbols may not be very apparent.

But the data can possibly be preprocessed in order to aid run-length coding.

Consider the data d = '26 29 32 35 38 41 44 50 56 62 68 78 88 98 108 118 116 114 112 110 108 106 104 102 100 98 96'. We can simply preprocess this data by taking the sample difference e(i) = d(i) — d(i — 1), to produce the processed data e = '26 3 3 3 3 3 3 6 6 6 6 10 10 10 10 10 -2 -2 -2 -2 -2

100 MULTIMEDIA DATA COMPRESSION

—2 —2 —2 —2 —2 —2'. This preprocessed data can now easily be run-length encoded as (26 1) (3 6) (6 4) (10 5) (-2 11). A variation of this technique is applied in the baseline JPEG standard for still picture compression [7]. The same technique can be applied to numeric databases as well.

On the other hand, binary (black and white) images, such as facsimile, usually consist of runs of O's or I's. As an example, if a segment of a binary image is represented as d =

"0000000001111111111100000000000000011100000000000001001111111111,"

it can be compactly represented as c(d) = (9, 11, 15, 3, 13, 1, 2, 10) by simply listing the lengths of alternate runs of O's and I's. While the original binary data d requires 65 bits for storage, its compact representation c(d) requires 32 bits only under the assumption that each length of run is being represented by 4 bits. The early facsimile compression standard (CCITT Group 3, CCITT Group 4) algorithms have been developed based on this principle [8].

3.6.2 Huffman coding

In 1952, D. A. Huffman [9] invented a coding technique to produce the shortest possible average code length, given the source symbol set and the associated probability of occurrence of the symbols. The Huffman coding technique is based on the following two observations regarding optimum prefix codes.

• The more frequently occurring symbols can be allocated with shorter codewords than the less frequently occurring symbols.

• The two least frequently occurring symbols will have codewords of the same length, and they differ only in the least significant bit.

The average of the length of these codes is closed to the entropy of the source.

Let us assume that there are m source symbols {si, S2> • • • , sm} with asso-ciated probabilities of occurrence {pi, P2, • . . , pm}- Using these probability values, we can generate a set of Huffman codes of the source symbols. The Huffman codes can be mapped into a binary tree, popularly known as the Huffman tree. We describe below the algorithm to generate the Huffman codes of the source symbols.

1. Produce a set N = {Ni, N2, • • • , Nm} of m nodes as leaves of a binary tree. Assign a node Ni with the source symbol Sj, i = 1, 2, ..., m, and label the node with the associated probability p,.

(Example: As shown in Fig. 3.3, we start with eight nodes AT0, NI, N2, N$, N±, N$, N&, N? corresponding to the eight source symbols o, 6, c, d, e, /, 0, /i, respectively. Probability of occurrence of each symbol is indicated in the associated parentheses.)

2. Find the two nodes with the two lowest probability symbols from the current node set, and produce a new node as a parent of these two nodes.

SOURCE CODING ALGORITHMS 101

Fig. 3.3 Huffman tree construction for Example 2.

(Example: Prom Fig. 3.3 we find that the two lowest probability sym-bols g and d are associated with nodes NQ and N$, respectively. The new node N8 becomes the parent of N3 and N6.)

3. Label the probability of this new parent node as the sum of the proba-bilities of its two child nodes.

(Example: The new node N8 is now labeled by probability 0.09, which is the sum of the probabilities 0.06 and 0.03 of the symbols d and g associated with the nodes N3 and N&, respectively.)

4. Label the branch of one child node of the new parent node as 1 and branch of the other child node as 0.

(Example: The branch N3 to N8 is labeled by 1 and the branch N6 to NS is labeled by 0.)

5. Update the node set by replacing the two child nodes with smallest probabilities by the newly generated parent node. If the number of nodes remaining in the node set is greater than 1, go to Step 2.

(Example: The new node set now contains the nodes NO, AT1? N%, 7V4, N5, N7, N8 and the associated probabilities are 0.30, 0.10, 0.20, 0.09,

102 MULTIMEDIA DATA COMPRESSION

0.07, 0.15, 0.09, respectively. Since there are more than one node in the node set, Steps 2 to 5 are repeated and the nodes 7V9, AT10, TVu, Ni2> ./Vis, Wi4 are generated in the next six iterations, until the node set consists of AT14 only.)

6. Traverse the generated binary tree from the root node to each leaf node AT*, i = 1, 2, . . . , m, to produce the codeword of the corresponding symbol s», which is a concatenation of the binary labels (0 or 1) of the branches from the root to the leaf node.

(Example: The Huffman code of symbol h is "110," formed by con-catenating the binary labels of the branches 7V14 to JVi3, JV13 to NU and

to N7.)

Table 3.1 Huffman code table

Symbol p(f) = 0.07, p(g) = 0.03 and p(h) = 0.15, respectively. The Huffman tree for this source is depicted in Fig. 3.3, while the Huffman code is shown in Table 3.1.

Let us consider a string M of 200 symbols generated from the above source, where the numbers of occurrences of a, 6, c, d, e, /, g and h in M are 60, 20, 40, 12, 18, 14, 6, and 30, respectively. Size of the encoded message M using the Huffman codes in Table 3.1 will be 550 bits. Here it requires 2.75 bits per symbol on the average. On the other hand, the length of the encoded message M will be 600 bits if it is encoded by a fixed length code of length 3 for each of the symbols. This simple example demonstrates how we can achieve compression using variable-length coding or source coding techniques.

PRINCIPAL COMPONENT ANALYSIS FOR DATA COMPRESSION 103

3.7 PRINCIPAL COMPONENT ANALYSIS FOR DATA

Dans le document Data Mining (Page 118-122)