Arithmetic bit recycling data compression

(1)

Arithmetic Bit Recycling Data Compression

Thèse Ahmad Al-Rababa’a Doctorat en informatique Philosophiæ doctor (Ph.D.) Québec, Canada © Ahmad Al-Rababa’a, 2016

(2)

Arithmetic Bit Recycling Data Compression

Thèse

Ahmad Al-Rababa’a

Sous la direction de:

(3)

Résumé

La compression des données est la technique informatique qui vise à réduire la taille de l'infor-mation pour minimiser l'espace de stockage nécessaire et accélérer la transmission des données dans les réseaux à bande passante limitée. Plusieurs techniques de compression telles que LZ77 et ses variantes sourent d'un problème que nous appelons la redondance causée par la multiplicité d'encodages. La multiplicité d'encodages (ME) signie que les données sources peuvent être encodées de diérentes manières. Dans son cas le plus simple, ME se produit lorsqu'une technique de compression a la possibilité, au cours du processus d'encodage, de coder un symbole de diérentes manières.

La technique de compression par recyclage de bits a été introduite par D. Dubé et V. Beau-doin pour minimiser la redondance causée par ME. Des variantes de recyclage de bits ont été appliquées à LZ77 et les résultats expérimentaux obtenus conduisent à une meilleure compres-sion (une réduction d'environ 9% de la taille des chiers qui ont été compressés par Gzip en exploitant ME).

Dubé et Beaudoin ont souligné que leur technique pourrait ne pas minimiser parfaitement la redondance causée par ME, car elle est construite sur la base du codage de Human qui n'a pas la capacité de traiter des mots de code (codewords) de longueurs fractionnaires, c'est-à-dire qu'elle permet de générer des mots de code de longueurs intégrales. En outre, le recyclage de bits s'appuie sur le codage de Human (HuBR) qui impose des contraintes supplémen-taires pour éviter certaines situations qui diminuent sa performance. Contrairement aux codes de Human, le codage arithmétique (AC ) peut manipuler des mots de code de longueurs fractionnaires. De plus, durant ces dernières décennies, les codes arithmétiques ont attiré plu-sieurs chercheurs vu qu'ils sont plus puissants et plus souples que les codes de Human. Par conséquent, ce travail vise à adapter le recyclage des bits pour les codes arithmétiques an d'améliorer l'ecacité du codage et sa exibilité. Nous avons abordé ce problème à travers nos quatre contributions (publiées). Ces contributions sont présentées dans cette thèse et peuvent être résumées comme suit.

Premièrement, nous proposons une nouvelle technique utilisée pour adapter le recyclage de bits qui s'appuie sur les codes de Human (HuBR) au codage arithmétique. Cette technique est nommée recyclage de bits basé sur les codes arithmétiques (ACBR). Elle décrit le cadriciel

(4)

et les principes de l'adaptation du HuBR à l'ACBR. Nous présentons aussi l'analyse théorique nécessaire pour estimer la redondance qui peut être réduite à l'aide de HuBR et ACBR pour les applications qui sourent de ME. Cette analyse démontre que ACBR réalise un recyclage parfait dans tous les cas, tandis que HuBR ne réalise de telles performances que dans des cas très spéciques.

Deuxièmement, le problème de la technique ACBR précitée, c'est qu'elle requiert des calculs à précision arbitraire. Cela nécessite des ressources illimitées (ou innies). An de bénécier de cette dernière, nous proposons une nouvelle version à précision nie. Ladite technique devienne ainsi ecace et applicable sur les ordinateurs avec les registres classiques de taille xe et peut être facilement interfacée avec les applications qui sourent de ME.

Troisièmement, nous proposons l'utilisation de HuBR et ACBR comme un moyen pour réduire la redondance an d'obtenir un code binaire variable à xe. Nous avons prouvé théoriquement et expérimentalement que les deux techniques permettent d'obtenir une amélioration signi-cative (moins de redondance). À cet égard, ACBR surpasse HuBR et fournit une classe plus étendue des sources binaires qui pouvant bénécier d'un dictionnaire pluriellement analysable. En outre, nous montrons qu'ACBR est plus souple que HuBR dans la pratique.

Quatrièmement, nous utilisons HuBR pour réduire la redondance des codes équilibrés générés par l'algorithme de Knuth. An de comparer les performances de HuBR et ACBR, les résultats théoriques correspondants de HuBR et d'ACBR sont présentés. Les résultats montrent que les deux techniques réalisent presque la même réduction de redondance sur les codes équilibrés générés par l'algorithme de Knuth.

(5)

Abstract

Data compression aims to reduce the size of data so that it requires less storage space and less communication channels bandwidth. Many compression techniques (such as LZ77 and its variants) suer from a problem that we call the redundancy caused by the multiplicity of encodings. The Multiplicity of Encodings (ME) means that the source data may be encoded in more than one way. In its simplest case, it occurs when a compression technique with ME has the opportunity at certain steps, during the encoding process, to encode the same symbol in dierent ways.

The Bit Recycling compression technique has been introduced by D. Dubé and V. Beaudoin to minimize the redundancy caused by ME. Variants of bit recycling have been applied on LZ77 and the experimental results showed that bit recycling achieved better compression (a reduction of about 9% in the size of les that have been compressed by Gzip) by exploiting ME.

Dubé and Beaudoin have pointed out that their technique could not minimize the redundancy caused by ME perfectly since it is built on Human coding, which does not have the ability to deal with codewords of fractional lengths; i.e. it is constrained to generating codewords of integral lengths. Moreover, Human-based Bit Recycling (HuBR) has imposed an additional burden to avoid some situations that aect its performance negatively. Unlike Human coding, Arithmetic Coding (AC ) can manipulate codewords of fractional lengths. Furthermore, it has attracted researchers in the last few decades since it is more powerful and exible than Human coding. Accordingly, this work aims to address the problem of adapting bit recycling to arithmetic coding in order to improve the code eciency and the exibility of HuBR. We addressed this problem through our four (published) contributions. These contributions are presented in this thesis and can be summarized as follows.

Firstly, we propose a new scheme for adapting HuBR to AC . The proposed scheme, named Arithmetic-Coding-based Bit Recycling (ACBR), describes the framework and the principle of adapting HuBR to AC . We also present the necessary theoretical analysis that is required to estimate the average amount of redundancy that can be removed by HuBR and ACBR in the applications that suer from ME, which shows that ACBR achieves perfect recycling in all cases whereas HuBR achieves perfect recycling only in very specic cases.

(6)

Secondly, the problem of the aforementioned ACBR scheme is that it uses arbitrary-precision calculations, which requires unbounded (or innite) resources. Hence, in order to benet from ACBR in practice, we propose a new nite-precision version of the ACBR scheme, which makes it eciently applicable on computers with conventional xed-sized registers and can be easily interfaced with the applications that suer from ME.

Thirdly, we propose the use of both techniques (HuBR and ACBR) as the means to reduce the redundancy in plurally parsable dictionaries that are used to obtain a binary variable-to-xed length code. We theoretically and experimentally show that both techniques achieve a signicant improvement (less redundancy) in this respect, but ACBR outperforms HuBR and provides a wider class of binary sources that may benet from a plurally parsable dictionary. Moreover, we show that ACBR is more exible than HuBR in practice.

Fourthly, we use HuBR to reduce the redundancy of the balanced codes generated by Knuth's algorithm. In order to compare the performance of HuBR and ACBR, the corresponding theoretical results and analysis of HuBR and ACBR are presented. The results show that both techniques achieved almost the same signicant reduction in the redundancy of the balanced codes generated by Knuth's algorithm.

(7)

List of Tables

2.1 The steps of encoding the string 'aabraabcaabbaaba' by LZ77 . . . 11

2.2 The steps of decoding the string 'aabraabcaabbaaba' by LZ77 . . . 11

2.3 The steps of encoding the string 'aabraabcaabbaaba' by LZ78. . . 12

2.4 An order-2 model of the encoded string 'aabaa' . . . 13

2.5 Volf and Willems switching-compression technique. . . 15

2.6 A 3-bit Tunstall code . . . 17

2.7 The Tunstall and plurally parsable dictionaries of a binary source, α = {0, 1} . 18 2.8 The probability distribution and the corresponding Human codewords of α at time t. . . 21

2.9 The statistical model of the source data . . . 29

3.1 The probability distribution and the corresponding Human codewords of α at time t. . . 37

5.1 The Tunstall and Savari dictionaries for a skewed binary source . . . 61

5.2 The experimental results of applying ACBR, HuBR, the Tunstall code, and the Savari code using a dictionary of size M = 4. . . 86

6.1 Calculated H(m), I(m), HN(m), and IN(m)for dierent values of m. . . 97

6.2 The computed values of the minimum redundancy and the number of redundant bits per m-bit block in Knuth's algorithm, HuBRK, and ACBRK.. . . 98

(10)

List of Figures

2.1 LZ77 sliding window . . . 10

2.2 Illustration of ME in the string'abbaaabbabbabbb' using LZ77. . . 19

2.3 The principle of the HuBR compressor. . . 21

2.4 Illustration of the decompressor's bit stream . . . 23

2.5 The steps of constructing the optimal recycling tree for a given list of choices and the corresponding costs.. . . 25

2.6 The model representation in arithmetic coding . . . 29

2.7 The steps of encoding S = '3320' using the arbitrary-precision AC . . . 30

2.8 The steps of encoding S = '3320' using the nite-precision AC . . . 34

3.1 (A) The binary tree of the source alphabets. (B) The skeleton of the equivalent messages. . . 37

3.2 The principle of the arbitrary-precision ACBR scheme. . . 38

3.3 A closer view of the composed interval It+1 . . . 39

3.4 The non-deterministic process of ACBR-(A) . . . 43

3.5 The non-deterministic process of ACBR-(B) . . . 44

3.6 Comparison of HuBR and ACBR performance with uniform distribution . . . 46

3.7 Comparison of HuBR and ACBR performance with two skewed choices . . . 47

4.1 The framework of the nite-precision ACBR scheme. . . 50

4.2 (A) The binary tree of the source alphabets. (B) The skeleton of tree of the equivalent messages. . . 51

4.3 The steps of encoding S = {m1, m3, m5} m3 using the nite-precision ACBR scheme. . . 51

4.4 The steps of decoding σ = 101000000 into S = {m1, m3, m5} m3 using the nite-precision ACBR scheme. . . 58

5.1 Encoding the chunk s = (021₁₎_{with ACBR.} _{. . . .} ₆₄

5.2 The principle of the nite-precision ACBR compressor for binary sources. . . . 67

5.3 A comparison of the cost of the chunk (0l₁_{) achieved by Tunstall, Savari, HuBR,} and ACBR for a dictionary of size M = 4. . . 78

5.4 A comparison of the cost of the chunk (0l₁_{) achieved by Tunstall, Savari, HuBR,} and ACBR for a dictionary of size M = 8. . . 78

5.5 A comparison of the expected codeword length per bit achieved by Tunstall, Savari, HuBR, and ACBR for a dictionary of size M = 4. . . 79

5.6 A comparison of the expected codeword length per bit achieved by Tunstall, Savari, HuBR, and ACBR for a dictionary of size M = 8. . . 79

(11)

5.8 A comparison of the cost of the chunk (0l₁_{) achieved by Tunstall using its}

standard dictionary, and by HuBR and ACBR using Dic2, for a dictionary of

size M = 4. . . 82

5.9 A comparison of the expected codeword length per bit achieved by Tunstall using its standard dictionary, and by HuBR and ACBR using Dic2, for a dic-tionary of size M = 4. . . 82

6.1 The main steps of the HuBRK scheme.. . . 92

6.2 The termination process.. . . 93

6.3 The decoding process of HuBRK . . . 94

6.4 Comparison between the performance of Knuth's algorithm with HuBRK, ACBRK, and the Immink & Weber technique. . . 99

(12)

Abbreviations

AC Arithmetic Coding

ACBR Arithmetic-Coding-based Bit Recycling

ACBRK Arithmetic-Coding-based Bit Recycling for Knuth's algorithm

EQ Equivalent Messages FF Fixed-to-Fixed FIFO First-In-First-Out FV Fixed-to-Variable

HuBR Human-based Bit Recycling

HuBRK Human-based Bit Recycling for Knuth's algorithm

LSB Least Signicant Bit

LZ77 The 1977 Lemepel and Ziv compression algorithm LZ78 The 1978 Lemepel and Ziv compression algorithm

LZSS The LempelZivStorerSzymanski compression algorithm LZW The LempelZivWelch compression algorithm

ME Multiplicity of Encodings MSB Most Signicant Bit

ND Non-Determinism, Non-Deterministic, or Non-Deterministically PP Plurally Parsable

PPM Prediction by Partial Match RLE Run-Length Encoding RT Recycling Tree

VF Variable-to-Fixed VV Variable-to-Variable

(13)

To my parents, Alshiekh Bader Al-Rababa'a and Aminah Al-Tamimi

(14)

Chapter 1

Introduction

1.1 Background and Motivation

Data storage and transmission cost money; this cost increases with the amount of data avail-able, therefore data compression aims to reduce the size of data so that it requires less storage space and less bandwidth from the communication channels. An additional benet of data compression is in wireless communication devices as power consumption is directly propor-tional to the size of the transmitted data. Power saving is possible by compressing data prior to transmission. This is because wireless transmission of a single bit can require over 1000 times more power than a single 32-bit computation [12], therefore data compression has been recently used to reduce the energy consumption in many wireless applications, such as wireless-networked handheld devices [37] and wireless sensor networks (WSN) [28].

Data encoding (code) is a process of mapping the original (or source) data, which is a sequence of symbols taken from a given source alphabet, α, into (usually) binary code sequences. The symbols may be characters, words, messages, pixels, sequences of symbols, or any units that may be stored in a dictionary of symbols. Data compression can be thought of as a two-stage process: modeling and encoding. Therefore, many encoding techniques have been widely used in developing data compression algorithms, in which the original data can be represented (en-coded) according to the underlying statistical model in fewer bits, as the coder assigns shorter codewords for the more likely symbols. Data compression algorithms can be categorized into two main groups: lossless and lossy. In lossless compression (also known as reversible com-pression), every single symbol of the original data can be reconstructed (decoded), but in lossy compression (also known as irreversible compression), only an estimation of original data can be decoded.

Compression is possible because data usually contains redundancies, or information that is often repeated. The redundancy of a source code (lossless data compression) is the amount by which the average number of code symbols per source symbol for that code exceeds the

(15)

entropy of the source [52], which is the optimal (minimum) number of code symbols (bits) per source symbol. The reasons for remaining redundancies in compressed data are: inappropriate modeling of the data, incorrect (or inaccurate) random source statistics, and the multiplicity of encodings. This work concerns itself with the redundancy caused by the multiplicity of encodings in lossless data compression.

The Multiplicity of Encodings (ME) means that the original data may be encoded in more than one way. In its simplest case, it occurs when a compression technique has the opportunity at certain steps, during the encoding process to encode a symbol dierently, i.e. dierent codewords for the same symbol can be sent to the decoder and any one of these codewords can be decoded correctly. Upon occurrence of such an opportunity, the default behavior of most techniques is to encode the symbol using the shortest codeword and less computation, if possible.

Many applications suer from the redundancy caused by ME, such as LZ77 (Lempel and Ziv, 1977) [68] and its variants, some variants of the Prediction by Partial Matching (PPM) technique [14], Volf and Willems switching-compression technique [61], variable-to-xed length codes based on plurally parsable dictionaries [50, 66], and Knuth's algorithm [30] for the generation of balanced codes.

The bit recycling technique [16] has been introduced by D. Dubé and V. Beaudoin to minimize the redundancy caused by ME. It reduces that kind of redundancy by harnessing ME in a certain way, so that it is not always necessary to select the shortest codeword, but instead, all the appropriate codewords are taken into account with some agreement between the encoder and the decoder. Variants of bit recycling have been applied to the LZ77 algorithm in [17,18,

19]. The experimental results of [19] showed that bit recycling achieved better1 _compression

by exploiting ME rather than systematically selecting the shortest codeword.

Dubé and Beaudoin have pointed out that their technique could not minimize the redundancy perfectly since it is built on Human coding [25], which does not have the ability to deal with codewords of fractional lengths, i.e. it is constrained to generating codewords of inte-gral lengths, so it can only completely utilize probabilities that are powers of 1

2. Moreover,

Human-based Bit Recycling (HuBR) has imposed an additional burden to avoid some sit-uations that aect its performance negatively. Unlike Human coding, Arithmetic Coding (AC ) [64] can manipulate codewords of fractional lengths. Furthermore, it has attracted re-searchers in the last few decades since it is more powerful and exible than Human coding. Consequently, this work aims to address the problem of adapting bit recycling to AC in order to achieve better compression with fewer burdens.

(16)

1.2 Statement of the Problem

This work aims to address the following key questions:

1. Why can bit recycling achieve more compression with fewer burdens if it is adapted to AC ?

2. How do we adapt HuBR to AC in a practical way?

3. How much more compression can be achieved by this adaptation?

4. What are the practical applications that could benet from this work and how?

1.3 Methodology

In order to answer the aforementioned questions (statement of the problem), we follow the following methodology. To answer the rst question, we demonstrate the theory behind the capability of developing HuBR by adapting it to AC . Furthermore, we theoretically show that by this adaptation we can achieve better compression thanHuBR.

To answer the second question, a new scheme named Arithmetic-Coding-based Bit Recycling (ACBR) is proposed to address the weaknesses of HuBR. The problem of the proposed ACBR scheme is that it uses arbitrary-precision calculations, which require unbounded (or innite) resources. Hence, in order to benet from ACBR in practice, we propose the corresponding nite-precision ACBR that addresses the problem of arbitrary-precision ACBR and can be implemented and evaluated in practice.

We answer the third and the fourth questions by using both ACBR and HuBR as the means to reduce the redundancy caused by ME in the following two applications:

1. The variable-to-xed length codes based on plurally parsable dictionaries designed by Savari [50].

2. The generation of balanced codes using Knuth's algorithm [30].

In fact, each application has its own characteristics, therefore the nite-precision ACBR and HuBR had to be adjusted so that they fulll the requirements of these applications. Ac-cordingly, in the rst application, two new algorithms, the compressor and decompressor, are proposed based on the nite-precision ACBR scheme to reduce the redundancy of plurally parsable dictionaries designed by Savari. In the second application, which is not a case of data compression, but rather one of constrained coding [31], a new scheme based on HuBR is proposed to reduce the redundancy of the balanced codes generated by Knuth's algorithm. In the rst application, in order to evaluate the performance of the nite-precision ACBR and to compare it with the performance of HuBR , we present a theoretical analysis that evaluates the performance of variable-to-xed codes based on the Tunstall dictionary [57] and ones based on plurally parsable dictionaries, using Savari's coding on the one hand and coding with bit

(17)

recycling (ACBR and HuBR) on the other hand. We also present the experimental results of our (software) implementation that aims to: (i) ensure the correctness of the proposed algorithms; (ii) experimentally evaluate the performance of the aforementioned methods and compare them with each other; (iii) experimentally evaluate the degradation in the perfor-mance of ACBR and HuBR incurred by a combination of practical factors and some factors that have not been considered in the presented theoretical analysis; (iv) examine the consis-tency between the theoretical and experimental results. We also present the time-complexity analysis for the proposed compressor and decompressor and the eect of using nite-precision computations on the code eciency.

Similarly, in the second application, to evaluate the performance of the proposed scheme and to compare it with the performance of ACBR, we present a theoretical analysis that evaluates the performance of each method as well as an analysis the time complexity of the proposed scheme.

1.4 Organization of this Document

This document is organized as follows. The literature review is presented in Chapter 2, which contains the necessary background information for understanding the new proposed schemes. This chapter explains the essential concepts underlying data compression that will be used throughout the document, shows the presence of ME in ve dierent applications, explains the principle of bit recycling and its weaknesses, which is the core of our problem, and explains the basic notions of arithmetic coding, as it is the coding technique upon which ACBR is based. The arbitrary-precision scheme for adapting HuBR to arithmetic coding is presented in Chap-ter 3. We present the principle of ACBR, the non-deterministic process of ACBR in general, a theoretical analysis that estimates its performance, and the problems and the associated needs of the proposed arbitrary-precision ACBR scheme. The corresponding nite-precision version of ACBR is presented in Chapter 4.

Chapter 5 proposes the use of nite-precision ACBR and HuBR as solutions to reduce the redundancy in plurally parsable dictionaries, as well as to extend the range of random binary sources that may benet from a plurally parsable dictionary. We present a theoretical analysis and experimental results that evaluate the performance of the proposed solutions in this respect.

Chapter 6 mainly shows the use of HuBR as the means to reduce Knuth's algorithm redun-dancy. We present a theoretical analysis that evaluates the performance of HuBR and ACBR as solutions for Knuth's algorithm redundancy. Finally, the conclusion and future work are presented in Chapter 7.

(18)

Chapter 2

Literature Review

This chapter is divided into four sections. In Section 2.1, we introduce a general overview of data compression. Section 2.2 shows the presence of the multiplicity of encodings in ve dierent applications. Section2.3explains the mechanism of HuBR based on what is explained in section 2.2, this section is necessary to clearly identify the main weakness of bit recycling which is the core of our problem. Section 2.4 demonstrates the basic notions of arithmetic coding, as it is the coding technique upon which our proposed scheme is based.

2.1 Data Compression Overview

This section introduces the essential terms and concepts underlying data compression that we use throughout this thesis. To keep consistency we review some terms that we have already introduced in the introduction.

Data compression is about storing and sending a smaller number of bits. The compression process is a combination of modeling and encoding. The model involves the knowledge about the input (or source) data to be compressed. The coder, based on this knowledge, represents the source data in fewer bits by mapping it into a sequence of binary codewords. In other words, one can think of the model as the intelligence of a compression scheme, which is responsible for deducing or interpolating the structure of the input, whereas the coder is the engine room of the compression system, which converts a probability distribution and a single symbol drawn from that distribution into a code (Alistar Moat [39]).

2.1.1 A Brief Introduction to Information Theory

Information theory is a branch of science concerned with the quantication of information. Information theory was developed by Claude Elwood Shannon [52]. Shannon dened a quan-tity called self-information. Suppose we have an event A of a random experiment. If P (A) is

(19)

the probability that event A will occur, the self information of event A is given by i(A) = log_b 1

P (A) = − logbP (A) (2.1)

If b equals 2, the units of i(A) are bits, if b equals e, the units are nats, and if b equals 10, the units are hartleys. All logarithms in this thesis are to base 2. It is clear that the amount of self-information is solely dependent upon P (A) and i(A) increases as P (A) decreases. For example, if P (A) = 1, then we are certain that A will occur, therefore we get no information (no surprise) from its outcome, i.e i(A) = 0.

As established in Shannon's source coding theorem [52], the information entropy (Shannon entropy) of a random event is the expected value of its self-information. Consider a general source S with alphabet α = {1, 2, · · · , m} that generates a sequence X = {X1, X2, · · · , Xn},

where Xi ∈ α for 1 ≤ i ≤ n. If each element in X is independent and identically distributed

(iid), the entropy of S is given by:

H(S) = −

i=m

X

i=1

P (i) log P (i) (2.2)

The average number of bits per symbol for any lossless compression method is always greater than or equal to the entropy H.

In other words, the smallest size that can be achieved by any lossless compression scheme that we could design to compress an independent sequence generated by S is greater than or equal to n · H.

2.1.2 Categorization of Data Compression Algorithms

Data compression algorithms can be categorized by several characteristics, such as data com-pression delity, symbols, models, and eciency. We take advantage of this categorization to introduce some necessary concepts. Next, we give a brief summary of each characteristic. Data Compression Fidelity

Data compression is often discussed in terms of the delity of the compressed data to the source data, therefore the delity is the most important and common characteristic based on which data compression techniques are classied as lossless or lossy. In lossless compression (also known as reversible compression), every single symbol of the original data can be reconstructed (decoded). In contrast, in lossy compression (also known as irreversible compression), only an estimation of original data can be decoded. In the context of data transmission, lossless compression is called source coding since data encoding is performed at the source of the data before transmission.

(20)

Lossless compression is used when it is important that the original and the decompressed data be identical such as compressing text les, executable les, database les, etc. There are many well-known lossless compression techniques such as Human coding [25], Arithmetic Coding [64], PPM (Prediction by Partial Matching) [14], RLE (Run-Length Encoding) [22], LZ77 [68], etc.

Lossy compression is used in applications when some loss of the original data is acceptable such as compressing audio les, video les, images, etc. Since some information can be discarded, more compression can be achieved than with lossless compression, depending on the data being encoded. The JPEG1 _{image le, commonly used for images on the Web, is an image}

that has lossy compression, as well as the MP32 _{audio les. More details and information}

about lossless and lossy compression can be found in many references [51,49,40,23,15]. Data Compression Symbols

According to the length (or number) of the symbols, regardless of whether the algorithm uses variable/xed length symbols in the original data or in the compressed data, the methods (or algorithms) of encoding are mainly categorized into the following four classes:

1. Variable-to-Fixed length codes (VF ) such as Tunstall's codes [57] and other codes [50,

58,42,56];

2. Variable-to-Variable length codes (VV ) such as LZ77 [68], LZ78 [69], LZW [63], and LZSS [54];

3. Fixed-to-Variable length codes (FV ) such as Human coding [25] and ShannonFano coding [52]; and

4. Fixed-to-Fixed length codes (FF ) [9]. Data Compression Models

Another distinguishing feature of data compression methods is the modeling of the source data. The method for determining the probability distribution of the source data is called a model. Basically, there are three types of models: static, semi-adaptive (semi-static), and adaptive. Let us briey describe each model as follows:

1. A static model is a xed model that is known for both the compressor and decompressor and it is independent of the input data. Constructing a model based on the frequencies of symbols in the English language computed from a very large English text, and using the same model to compress any English text, is an example of this model.

2. A semi-adaptive or semi-static model is a xed model that is constructed in accordance with the input data before compression, therefore the methods that use this model

1. Joint Photographic Experts Group [2].

(21)

perform two passes to compress the input data; the rst pass constructs the model, and the second pass encodes the input data based on the model constructed in the rst pass. The model has to be included as part of the compressed data. Many compression methods use this model, such as the static Human coding [25], the static arithmetic coding [46, 45, 44, 64], and range coder [41]. Many references classify this model as a static model; however, what matters is the concept, not the name.

3. An adaptive model is a model which changes the symbols' probabilities during the com-pression according to the input data. Therefore, at certain point during the comcom-pression, the model is a function of the previously compressed part of the data. The model does not need to be included as a part of the compressed data since it can be reconstructed in the same way by the decompressor. Using an adaptive model enables the compres-sion method to achieve more comprescompres-sion, but more computation is required. Adaptive methods usually start with a minimal symbol table to bias the algorithms toward the type of data they are expecting. Many compression methods (adaptive methods) use this model, such as the dynamic Human coding [59,60, 29], the adaptive arithmetic coding [64], the M-coder [35], and the adaptive binary arithmetic coding [53,36]. Higher-order models take advantage of the correlation between the symbols of the input data, providing much better compression if this correlation is strong. Modeling the data with higher context orders provides more ability in probability estimation (prediction) compared with the zero-order models, which are context-0 models or non-contextual models. Zero-order models evaluate the symbols' probabilities independently of the preceding symbols in the string, since one pass before compression is required to collect each alphabet symbol occurrence in a certain string (le) independently. The constructed model is used as the probability estimator (predictor) to encode a string of symbols. Modeling with higher orders evaluates the probability of each symbol based on the symbol occurrences dependently of the preceding (or the adjacent) symbols. Therefore, the rst-order-model contains the information of the zero-order-model in addition to the probability of the symbol occurrences depending on one proceeding or adjacent symbol. The second-order-model provides further estimation than the rst-order-model by providing the probability of each symbol depending on two preceding or adjacent symbols and so on. As a result, the higher order models provide more accurate prediction when there is a correlation between the symbols of the source data; consequently, better compression can be achieved, but more computations are added to the compression process.

Data Compression Eciency

Data compression algorithms are also compared based on their performance. There are two performance evaluation parameters: the code eciency, which is the expected codeword length per source symbol, and the processing time or the algorithm complexity. In general, more

(22)

compression requires more computations and time complexity, therefore the design of data compression scheme involves a trade-o between code eciency and time complexity.

Several parameters are used to measure the amount of compression that can be achieved by a particular data compression algorithm, such as the compression ratio CR, and the reduction

ratio RR. The compression ratio CRis dened as the ratio between the size of the data before

compression and the size of the data after compression. It can be expressed as: CR=

So

Sc

, (2.3)

where So and Sc are the sizes of the original and the compressed data, respectively. The

reduction ratio, RR, represents the ratio of So minus Sc to So. It is usually stated as a

percentage and can be expressed as: RR=

So− Sc

So

· 100% (2.4)

Time complexity is an important issue to measure the performance of data compression al-gorithms and to compare them to each other, because the fastest algorithm is preferable in many applications, such as video surveillance, mobile TV, and other applications where computational complexity plays a critical role.

2.2 The Presence of the Multiplicity of Encoding in Various

Applications

It has been mentioned that the main goal of the bit recycling technique is to reduce the redundancy caused by ME; the objectives of this section is merely to show the presence of this property in dierent applications. Therefore, the presence of ME in the following applications will be exposed.

1. The LZ77 compression technique [68] and its variant LZSS [54]. 2. The LZ78 compression technique [69].

3. Some variants of the PPM compression technique [14]. 4. Volf and Willems switching-compression technique [61]. 5. Knuth's algorithm for generating balanced codes [30].

6. Variable-to-xed length codes and plurally parsable dictionaries [50].

2.2.1 The LZ77 Compression Technique and its Variant LZSS

The LZ77 compression technique goes through the text to be encoded sequentially in a sliding window consisting of a search buer and a look-ahead buer as shown in Figure2.1. The search buer represents the storage space of the symbols already encoded and are seen by both the

(23)

Figure 2.1 LZ77 sliding window

encoder and the decoder so far. The look-ahead buer is the storage space of the string to be encoded and in the hands of the encoder so far. After each encoding step, in which a portion of the string of length n symbols is encoded, the window slides to the right by, say, n symbols, so that the separator between the search buer and the look-head buer (cursor) moves to the beginning of the not-yet-encoded string. Therefore, the cursor represents the current position of the symbol being encoded. Sizes of the window, search buer, and look-ahead buer are parameters of the implementation. For example, let the string 'aabraabcaabbaaba' be the string to be encoded by LZ77. For sake of simplicity, we assume that the sizes of the search-buer and the look-ahead buer are 8 and 5, respectively. Table2.1shows the encoding steps for the given string 'aabraabcaabbaaba'. To encode the string in the look-ahead buer, the encoder searches for the current symbol being encoded in the search buer until it nds the longest match to that symbol followed by the same consecutive symbols. The distance to that symbol from the current position is called the oset (o). The number of consecutive symbols in the search buer that match the consecutive symbols in the look-ahead buer considering the rst symbol is called the match length (l). Therefore, for each encoding step, the encoder outputs the triple (o, l, c), which means that the next l characters are a copy of the l characters that appear o characters before the position of the symbol being encoded, and the next character is c. On the other side, the sequence of triples transmitted by the encoder will be decoded by the decoder triple by triple as shown in Table 2.2.

To show the presence of ME in LZ77, notice that at step 5 in Table 2.1there is another triple that represents the same match but at oset 8, which means that the triple (8, 3, b) is also an equivalent triple (message). The same thing happened at step 6, where LZ77 can realize that the longest match is 'aab' and there are two copies (occurrences) of this match in the search buer. Thus the encoder can send either the (8, 3, b) or (4, 3, b) triple, and the result will be

(24)

Step Encoded

String Search Buer Look-aheadBuer RemainingString

Output 8 7 6 5 4 3 2 1 5 4 3 2 1 (o,l,c) 1 a a b r a abcaabbaaba (0,0,a) 2 a a b r a a bcaabbaaba (1,1,b) 3 a a b r a a b c aabbaaba (0,0,r) 4 a a b r a a b c a abbaaba (4,3,c) 5 a a b r a a b c a a b b a aba (4,3,b) 6 aabr a a b c a a b b a a b a (4,3,a) 7 aabraabc a a b b a a b a

Table 2.1 The steps of encoding the string 'aabraabcaabbaaba' by LZ77 Input Output Step 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 (0,0,a) a 2 (1,1,b) a a b 3 (0,0,r) a a b r 4 (4,3,c) a a b r a a b c 5 (4,3,b) a a b r a a b c a a b b 6 (4,3,a) a a b r a a b c a a b b a a b a

Table 2.2 The steps of decoding the string 'aabraabcaabbaaba' by LZ77

the same, since both of them are valid and can be used by the decoder to decode the string 'aab'. This means that the le can be encoded in multiple ways and numerous compressed les can be generated. Any one of the possible compressed les can be used by the decoder to restore the same original le, which is a case of ME.

The principle of the LZSS variant is the same as LZ77 but a 1-bit ag is needed to indi-cate whether the message to be encoded is a character (literal message) or a pointer to the closest longest match (match message) that has already been encoded (in the search buer). Therefore, there are two message types that can be transmitted at each encoding step: literal messages or match messages. The literal message contains the codeword of symbol s, let us say [s], and the match message contains the codeword of the reference to the same string that has been encoded. The match message is represented by the tuple (l, d), where l is the length of the longest matched string and d is the distance to that match from the current position. For example, let the string 'abbaaabbabbabbb' be the string to be encoded, and the current position of encoding is indicated by the under-lined character ('a'), so the match message in this case is represented by the tuple (3, 3). Notice that there are two more equivalent messages, which are (3, 6) and (3, 11), and any one of these messages can be used by the

(25)

Step Encoded symbols Input Dictionary Output (i,c) Index Entry aabraabcaabbaaba 1 a abraabcaabbaaba 1 a (0,a) 2 aab raabcaabbaaba 2 ab (1,b) 3 aabr aabcaabbaaba 3 r (0,r)

4 aabraa bcaabbaaba 4 aa (1,a)

5 aabraab caabbaaba 5 b (0,b)

6 aabraabc aabbaaba 6 c (0,c)

7 aabraabcaab baaba 7 aab (4,b)

8 aabraabcaabba aba 8 ba (5,a)

9 aabraabcaabbaaba 9 aba (2,a)

Table 2.3 The steps of encoding the string 'aabraabcaabbaaba' by LZ78.

decoder to retrieve the same original string, but since LZSS selects the closest longest match, then the tuple (3, 3) will be the selected tuple. The freedom of selection among the three available tuples (3, 3), (3, 6), and (3, 11) represents a case of ME.

2.2.2 The LZ78 Compression Technique

The main dierence between LZ77 and LZ78 is that LZ78 encodes the string by creating the dictionary progressively as depicted in Table 2.3. For each new encountered symbol or entry, an index is assigned to this entry, the inputs are encoded by the tuple (i, c), where i is the index to the longest match in the dictionary, and c is the codeword of the symbol following that match. The value of i is 0 in case of no match. For example, Table2.3shows the encoding steps of the LZ78 technique for the same string 'aabraabcaabbaaba' that has been used in the LZ77 example.

To show ME in LZ78, the following variant of LZ78 could be devised, such that seeking for a match in the dictionary is not constrained by the longest match. Thus, there is no need to add the encoded entry to the dictionary if there is a longer match, since the shorter match(s) should exist too as a prex of the longer one. This suggested variant provides a valid encoding and contains ME, since this relaxation provides more than one possible tuple to be selected. For instance, at step 7, the output is the tuple (4, b), which represents the codeword of the entry 'aab'. According to the devised variant, the tuple (1, a) is another option to encode the entry 'aa' without adding any new entry to the dictionary, symbol b will be delayed to step 8, where the output will be (5, b) instead of (5, a), the new entry in step 8 will be 'bb' instead of 'ba' and so on.

(26)

Context-2 Context-1 Context-0 Context(-1)

Context Symbol Count Context Symbol Count Symbol Count Symbol Count

aa b 1 a a 2 a 4 a 1

esc 1 b 1 b 1 b 1

Total 2 esc 1 esc 1 Total 2

ab a 1 Total 4 Total 6 esc 1 b a 1 Total 2 esc 1 ba a 1 Total 2 esc 1 Total 2

Table 2.4 An order-2 model of the encoded string 'aabaa'

2.2.3 The PPM Compression Technique

The Prediction by Partial Matching (PPM) compression technique is an adaptive statistical technique that is based on context modeling and prediction. We explain the main idea of PPM using the following example. Assume that the string to be encoded is S = 'aabaaaba', and the under-lined symbol ('a') is the symbol being encoded; the encoder uses an order-2 model as depicted in Table 2.4, which contains the information about the encoded string so far ('aabaa'). At each time, the available information for the encoder and decoder is that the frequencies count for each encoded symbol in each encountered contexts, 0, 1, and 2, so far. The encoder uses an escape symbol, esc, to signal to the decoder that this symbol has not been seen in that context.

To encode the current symbol, 'a', the encoder starts searching for 'a' after 'aa' in context-2 table. The context 'aa' exists but 'aa' followed by 'a' does not exist. Therefore esc is encoded, 'a' is added to the 'aa' context with count = 1, and the total count is updated to 3. Each symbol is encoded by arithmetic coding according to the symbol count with respect to the total count (i.e. the probability = 1

2). The number of bits used by arithmetic coding to encode

a symbol with probability p is very close to − log2p. Therefore encoding esc at this step

requires − log2(12) = 1bit. Let us roughly describe how this esc is encoded and decoded using

one bit of information. The context 'aa' at this point has two symbols ('b' and esc), and each of them has the same probability p = 1

2, therefore each symbol can be described by one bit

according to p. Let us say that the codeword '0' is assigned for esc and the codeword '1' for 'b'; using the same principle, the decoder can recognize that the transmitted codeword ('0') means that the symbol to be decoded does not exist in this context ('aa').

Next, the encoder crosses to the lower order context (context-1 table) to search about 'a' followed by 'a'; since it exists, the encoder encodes it as described above according to the corresponding count and total count (probability = 2

(27)

symbols after 'a' in the context-1 table ('a', 'b', and esc) are assigned the codewords '1', '01', and '00', respectively, according to their probabilities. Thus the encoder sends bit '1' to the decoder, and the decoder decodes this bit ('1') as symbol 'a', using the same principle described above. Of course, the corresponding count and the total count are updated to 3 and 5 respectively. Once the symbol is located in a certain context starting from the higher to the lower context order, the encoder stops searching in the lower contexts and moves to encode the next symbol and so on. Thus, the encoder encodes the current symbol, 'a', as <esc><aa>, where <x> denote the codeword of x.

The aforementioned variant of PPM was invented by Cleary and Witten [14] and is called PPMA, and in this variant, the esc count is one. Another similar variant called PPMB is proposed by Cleary and Witten to set the esc count so that the cost of esc is decreased, which leads to decreasing the cost of crossing from the higher to the lower context and thus improving the compression eciency. In PPMB, the esc count equals the number of symbols in the context (more symbols, higher probability to esc). Moat [38] proposed another variant, PPMC, to set the esc count, in which esc is treated as a separate symbol, so when a symbol occurs for the rst time he adds 1 to both the esc count and the new symbol's count. As a result, the PPMB variant is the same as PPMC except that the count for each symbol is less by one than the symbol's count in PPMC. However, in practice, PPMC compresses better than PPMA and PPMB.

To show ME in PPMA, notice that once the encoder nds the symbol to be encoded in the context of a higher order, it does not cross to the lower order. Now consider the possible variant of PPMA that could be devised as follows: if the symbol to be encoded exists in the higher context, the encoder is allowed to send <esc> and to cross over to the lower context. This will provide a correct encoding and can be applied if, for example, the encoder could achieve better compression. Accordingly, the previous symbol can be encoded as <esc><aa> or <esc>< esc>< a>. In conclusion, under certain valid conditions, the original data can be encoded correctly in multiple ways, leading to multiple possible compressed les each of which can be used by the decoder to retrieve the original data, which is a case of ME.

2.2.4 Volf and Willems Switching-Compression Technique

Volf and Willems [61] have proposed a technique that switches between two universal source encoding algorithms, A and B. For each symbol to be encoded, the two algorithms, A and B, provide their probability estimation of the symbol to be encoded (prediction). The algorithm with better prediction (higher probability) is selected to encode that symbol. A ag bit is used in order to tell the decoder which algorithm has been used to encode that symbol, and thus to provide the decoder with the necessary information for correct decoding. The two algorithms provide dierent estimations of the probability distribution, since their modeling of the source data is dierent. For instance, the two algorithms A and B could be PPMA and

(28)

Source String x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 Block b1 b2 b3 Best Algorithm A A B A B A B B B A A B A A B Switch to A B A

Output 0* Code A(b1) 1* Code B(b2) 0* Code A(b3)

*: ag bit, 0 for Algorithm A and 1 for Algorithm B Code X(y) means use Algorithm X to code block y

Table 2.5 Volf and Willems switching-compression technique.

LZ77. Each algorithm provides a dierent prediction for the symbol being encoded due to its nature. Therefore, for each symbol to be encoded, the two algorithms compete to encode that symbol according to their prediction. Thus, if the string to be encoded is of length n symbols, then there will be 2n _{possible compressed les. The switching algorithm gives the}

same performance as the best of the two algorithms plus one ag bit redundancy per symbol. In order to reduce the redundancy caused by this ag bit, the switching technique divides the source sequence into blocks. The switching algorithm decides whether algorithm A or B is used to encode the whole block based on performance of A and B for that block. Table 2.5

shows an example of encoding the source sequence 'x1x2· · · x15' using the present Volf and

Willems switching-compression technique. The Volf and Willems switching technique achieves more compression for the whole given string of symbols by exploiting this competition, despite the additional one bit (the ag-bit) redundancy. It is clear that multiple outputs would be generated for the same original data. As a result, the same original string can be encoded in multiple ways. This capability represents a case of ME.

2.2.5 Knuth's Balanced Codes Scheme

A balanced binary codeword is a codeword that contains an equal number of ones and zeros. The balanced codes have found many applications, such as digital recording on magnetic and optical storage devices like magnetic tapes, Compact Disks (CD), Blue-ray Disks (BD) and Digital Versatile Discs (DVD), and the applications in both the optical networks and error correcting and detecting codes.

In 1986, Donald Knuth published an ecient algorithm for constructing balanced codes [30]. Knuth's algorithm is an ecient coding algorithm (not a compression algorithm) due to its simplicity and the fact that it does not need look-up tables. The main idea of Knuth's algorithm is that any unbalanced user data word w of length m, where m is even, can be encoded to construct a new balanced codeword, say ˆw, by complementing the rst k bits of w, where k is chosen in such a way that ˆwhas equal numbers of ones and zeros. Knuth showed that such an index k can always be found. A balanced prex codeword v of even length p,

(29)

which represents the index k, is appended in front of ˆw. Thus the whole new constructed balanced codeword is composed of v followed by ˆw (v ˆw). The decoder decodes v ˆwas follows: it reads the rst p bits which are the prex codeword for k, and then knows the number of bits from the start of ˆw that need to be complemented in order to retrieve the original word w. The prex codeword itself will be discarded once the original word is retrieved.

Let us explain the main principle of Knuth's algorithm using the following practical example. Let w of length 8 bits be '10101101'; it is clear that this word is unbalanced, since it contains 5 one's and 3 zero's. By complementing the rst bit (k = 1), the new constructed balanced code-word, ˆw, will be '00101101'. Let the balanced prex codeword be '000111', '001011', '001101', '001110', '010011', '010101','010110', and '011001' for k = 1, 2, · · · , and 8, respectively. Thus the whole new constructed balanced codeword is '00011100101101' of length p + m = 14 bits. The decoder reads the rst 6 bits '000111'. Knowing that '000111' is the prex codeword for k = 1, then the decoder complements the rst bit of '00101101' to reconstruct the original word '10101101'. The prex codeword '000111' is discarded after reconstructing the original word. Notice that the encoder needs not have to choose k = 1, since it could also choose k = 3, 5, or 7, so it has the freedom to construct any of the following four balanced codewords, '00101101', '01001101', '01010101', and '01010011', by complementing the rst k = 1, 3, 5, and 7 bits respectively. This selection freedom means that Knuth's algorithm does have ME, since the encoder can encode the original word '10101101' in multiple ways.

2.2.6 Variable-to-Fixed Length Codes and Plurally Parsable Dictionaries

Many techniques used for lossless data compression, such as LZ77 [68], LZ78 [69], and Tunstall codes [57], are dictionary based and process the input string of symbols by rst parsing it. The input string S, which is drawn from some alphabet α, is turned into the concatenation of a number n of dictionary entries, s = s1s2. . . sn, so that each si is a dictionary entry dj,

for j = 1, 2, · · · M, where M is the dictionary size, and the entries may have variable lengths. Accordingly, the encoding procedure is a mapping from the set of dictionary entries to the corresponding set of codewords. In variable-to-xed (VF ) length encoding procedures such as Tunstall codes [57], each di is mapped into the corresponding xed-length codeword, but in

variable-to-variable (VV ) length codes such as LZ78, each diis mapped into the corresponding

variable-length codeword.

A dictionary D is said to be uniquely parsable if every entry in D is not a prex of any other entry in the same dictionary. Tunstall [57] devised a simple and ecient algorithm to construct the optimal uniquely parsable dictionary of size M. The main feature of the Tunstall code is that the errors in codewords do not propagate like in variable-length codes (such as Human codes [25]). Let us explain the Tunstall code with the following simple example. Suppose that we want a 3-bit Tunstall code for an alphabet α = {A, B, C}, with P (A) = 0.6, P (B) = 0.3 and P (C) = 0.1, where P is the probability. A 3-bit Tunstall code

(30)

Symbol P Entries P Entries Codeword A 0.60 B 0.30 B 000 B 0.30 C 0.10 C 001 C 0.10 AA 0.36 AB 010 Step 1 AB 0.18 AC 011 AC 0.06 AAA 100 Step 2 AAB 101 AAC 110 Step 3

Table 2.6 A 3-bit Tunstall code

means that we need rst to construct a dictionary of M = 8 entries (23_{), then each dictionary}

entry is mapped into the corresponding 3-bit codeword. The three steps of constructing this dictionary are shown in Table2.6. At step 1, we start with the 3 alphabet symbols {A, B, C} and the corresponding probabilities in our dictionary. At step 2, we remove the entry that has the highest probability (A) from the dictionary, and add three entries that are obtained by concatenating this entry (A) with every symbol in α including itself. Thus we get the following three new entries: AA, AB, and AC. The associated probability of each new entry is the product of the concatenated symbols probabilities. At this point, we have 5 dictionary entries, which is less than the desired dictionary size (M = 8). Therefore we repeat the procedure mentioned in step 2 in order to obtain more entries as shown in Table2.6. Now, we have 7 dictionary entries, and going through another iteration would increase the dictionary size to 9. Therefore we stop iterating and each dictionary entry is assigned a 3-bit codeword as shown in Table 2.6. Notice that there is no entry in the constructed dictionary that is a prex of any other entry in the constructed dictionary, therefore this dictionary is called a uniquely parsable dictionary.

Many VV codes such as LZ77 [68], LZ78 [69], and LZW [63] (see section 2.2.1) use Plurally Parsable (PP) dictionaries; i.e. some of the dictionary entries are prexes of other entries in the same dictionary, from which it follows that there exists some string, S, that can be parsed into a concatenation of dictionary entries in two or more ways. S. Savari [50] designed PP dictionaries for a memoryless, very predictable (highly skewed), binary source that can outperform the corresponding Tunstall dictionary for the same source and dictionary size. The main idea of Savari's work is as follows. Suppose we want to obtain the optimal dictionary of size M = N + 2, N ≥ 2, and assume that P (0)N _{≥ P (1)}_{. Accordingly, the desired}

Tunstall dictionary for this source is of the form 0N +1_{, 0}N_{1, 0}N −1_{1, · · · , 01, 1}

, where 0l

denotes the string of l zeros. Savari designed the corresponding PP dictionary of the form 0iN_{, 0}N_{, 0}N −1_{1, · · · , 01, 1}

, for some i ≥ 2, that provides less redundancy of the same source. The redundancy of a source code is the amount by which the average number of code symbols per source symbol for that code exceeds the entropy (the optimal codelength) of the source.

(31)

Entry Tunstall's Dictionary Plurally Parsable Dictionary Codeword 1 1 1 000 2 01 01 001 3 001 001 010 4 0001 0001 011 5 00001 00001 100 6 000001 000001 101 7 0000001 000000 110 8 0000000 000000000000 111

Table 2.7 The Tunstall and plurally parsable dictionaries of a binary source, α = {0, 1} For example, suppose that we want to construct the Tunstall dictionary and the correspond-ing PP dictionary proposed by Savari for the binary source of the alphabet α = {0, 1}, with the corresponding probabilities P (0) = 0.9 and P (1) = 0.1. Let the desired dictio-nary size be M = (N = 6) + 2 = 8 entries, so this dictiodictio-nary satises the aforementioned assumption: P (0)N _{≥ P (1)}_{. For this binary source, the required Tunstall dictionary of}

the form 0N +1_{, 0}N_{1, 0}N −1_{1, · · · , 01, 1}

, and the corresponding PP dictionary of the form 0iN_{, 0}N_{, 0}N −1_{1, · · · , 01, 1}

, for i = 2, are shown in Table2.7. It is obvious that the dierence between the two dictionaries is in the last two entries (7 and 8). In the PP dictionary, entry 7 (0N_{) is a prex of entry 8 (0}2N_{) and no entry in the Tunstall dictionary is a prex of any other}

entry. However, Savari has proved that the PP dictionary shown in Table 2.7outperforms the corresponding Tunstall dictionary under the same assumptions given above.

In fact, the aforementioned PP dictionary does have ME since there exists some string, S, that can be parsed (segmented) into two or more ways. For example, the string S = 021₁

can be parsed in the following three ways (choices): (i) 000000-000000-000000-0001; (ii) 000000000000-000000-0001; (iii) 000000-000000000000-0001. These three choices result in the following codewords, respectively: (i) 110-110-110-011; (ii) 111-110-011; (iii) 110-111-011. Since the cost of encoding S = 021₁ _{by choice (i) (the cost is the length of the codewords}

= 4 × 3 = 12 bits ) is greater than the cost of choices (ii) and (iii) (9 bits), then Savari's code excludes choice (i). Nevertheless, the encoder is still oered two choices, each with the same cost, to encode S. Hence, we have an opportunity to exploit this freedom, by employing the principle of bit recycling to outperform (reduce the redundancy of) the Tunstall and PP dictionaries for the same binary source and under the same assumptions given above.

2.3 The Bit Recycling Compression Technique

In order to explain the principle of bit recycling and to keep the consistency, we need to rst briey review the LZ77 compression technique already mentioned in the literature review in order to introduce some preliminaries related to the concept of bit recycling before describing

(32)

Figure 2.2 Illustration of ME in the string'abbaaabbabbabbb' using LZ77.

the principle of HuBR and its weakness.

2.3.1 LZ77 and ME

LZSS (the variant of LZ77) is a compression technique that compresses a string of characters, S, by transmitting a sequence of messages. A message is either a literal message, denoted by [c], which means that the next character is c, or a match message, denoted by hl, di, which means that the next l characters are a copy of the l characters that appear d characters before the position of the symbol being encoded in S. Let 'abbaaabbabbabbb' be the string to be encoded, the underlined substring is the so-far encoded part as depicted in Figure 2.2. The next character to be encoded is 'a'. The position of character 'a' is called the current position of encoding. The next character 'a' can be encoded by transmitting the literal message [a], and the encoder can also encode 'ab' or 'abb' since these substrings have many copies at dierent distances in the underlined string. For instance, 'abb' can be encoded by transmitting the match message h3, 3i, h3, 6i, or h3, 11i since 'abb' has 3 copies at the distances 3, 6, and 11 in the underlined substring. We call the set of all the literal and match messages that describe 'a', 'ab' or 'abb' the adequate messages.

LZ77 typically selects the longest match ('abb'), and among the available longest matches, it selects the one at the shortest distance (d = 3), therefore the match message h3, 3i is transmitted and the current position is moved to the last 'b'. We call the selected choice the default choice. It is clear that LZ77 has the freedom to encode 'abb' by selecting any message

(33)

from the set of the equivalent messages EQ = {M1= h3, 3i , M2= h3, 6i , M3 = h3, 11i}. These

messages are called the equivalent messages since any message in EQ can be used to encode the same string ('abb'). Accordingly, many dierent sequences of messages may be transmitted to describe S, and any possible sequence can be decoded correctly.

Most implementations of LZ77 select the longest match, since selecting the longest match enables the encoder to use almost the same number of bits to encode the longest possible string. For instance, encoding three characters ('abb') using M1, M2, or M3 is better than

encoding one character ('a') using the message [a]. When many longest matches exist, the shortest one is preferred in order to create a skew in favor of short distances. Accordingly, with higher frequencies for the short distances, the encoder can take advantage of this skewness and send shorter codewords on average.

2.3.2 The Principle of Bit Recycling and HuBR

The main objective of the bit recycling technique is to minimize the redundancy caused by ME. The idea of recycling comes from the fact that bits that would have been wasted due to ME are turned into useful information. Bit recycling [16] aims to improve code eciency by exploiting the redundancy resulting from ME. In bit recycling, the compressor is not restricted to selecting the default choice of the shortest codeword and to ignoring the other choices, but instead, with some arrangement between the compressor and decompressor, it uses ME to implicitly carry information from the compressor to the decompressor. Since data compression is a combination of modeling and coding, the added arrangement requires more computation and a stronger coordination between the model and the coder/decoder.

Without loss of generality, we explain the principle of HuBR by means of the following example with the help of Figure2.3. This example will be used as a running example in this section, and (for convenience) its essence will be used in many sections throughout this thesis. Assume that, at time t during encoding, the alphabet α is {mi}5_i=0, and the corresponding distribution and

Human codewords of α are as shown in Table2.8. Let EQ = {M1= m1, M2 = m3, M3 = m5}

be the set of the equivalent messages at t, like the equivalent messages of 'abb' in the above example. The corresponding Human codeword (Hui) of message miin Table2.8is generated

using Human coding and based on the count of occurrences (Cnti) of mi. The default choice

(without recycling) is the equivalent message with the shortest codeword (M1 = m1), thus

the corresponding Human codeword ('11') will be transmitted to the decoder.

Let us explain the principle of HuBR and show how it can add more computations and extra information between the model and coder/decoder to reduce the default cost (2 bits), where the cost is the codeword length. At time t, the compressor has the freedom to encode M1, M2,

or M3 with the costs 2 bits, 4 bits, and 3 bits, respectively. The compressor, using Human

(34)

Figure 2.3 The principle of the HuBR compressor.

Table 2.8 The probability distribution and the corresponding Human codewords of α at time t. Message (mi) Count (Cnti) Probability (pi) Human Codeword (Hui) m5 (M3) 2 0.125 011 m4 3 0.1875 00 m3 (M2) 1 0.0625 0100 m2 4 0.25 10 m1 (M1) 5 0.3125 11 m0 1 0.0625 0101 Total 16 1

us say '0', '10', and '11' for M1, M2, and M3 respectively, as shown in Figure2.3. The created

codewords are called the recycled codewords. Thus, each Mi will have two codewords: the

explicit codeword (Hui) that may be sent to the decoder, and the corresponding implicit

codeword (ri) that will be constructed implicitly and identically by the compressor and

de-compressor as follows. Suppose that each ri is compared with the rst bit(s), let us say b0b1,

of the codeword of the next message, then one and only one match should occur. The equiv-alent message whose ri matches b0b1 is sent to the decompressor. More precisely, b0 should

either be '0' or '1'; if b0 ='0', select M1, otherwise b0b1 should be either '10' or '11', if b0b1 =

(35)

i ∈ {1, 2, 3}, would be able to acknowledge that the selection of the received message among M1, M2, and M3 was intentional and it would then be able to recover the corresponding ri of

the received Mi. Accordingly, the compressor can omit the matched bits from the compressed

stream since the decompressor can follow the same approach so that it can implicitly deduce the corresponding ri of the received Hui and restore them at the same location.

In detail, it will be easier to rst explain how the decoder will behave at time t, then to resume the discussion of the encoding process. The decoder will receive Mi, i ∈ {1, 2, 3}, at

time t, so the received message is rst decoded; the decoder can then infer that the received message has two other copies in the so-far decoded source string (the string before t). Recall the equivalent messages, from t to t + 1, in the above LZ77 example; whatever the received message is, once it is decoded into 'abb', the decoder can infer that 'abb' has three copies at dierent distances in the so-far decoded string, and the received message is one of these three copies, so the other two equivalent messages can be inferred accordingly, i.e. the decoder can list the three equivalent messages that could have been sent by the encoder. Therefore, the list of equivalent messages at time t can be established by both the encoder and the decoder identically and implicitly. Knowing the list of equivalent messages, the decoder can rebuild the corresponding Human tree that has been built by the encoder, which means that one of the recycled codewords of the available messages has been transmitted implicitly from the encoder to the decoder. If the compressor has decided to send M1, M2, or M3 then the codeword '0',

'10', or '11' is transmitted implicitly, respectively.

The selection of message Mi among M1, M2, and M3constitutes an eye wink or a signal from

the encoder to the decoder. In other words, the act of choosing among the equivalent messages constitutes a form of implicit communication from the compressor to the decompressor. The key question now is: where is the compression? According to an agreement between the encoder and decoder and based on the available information, the decoder inserts the recycled codeword of the received message into the compressed stream just before t + 1; accordingly, these bits have been omitted from that location by the encoder. Note that the compressor can aord to omit these bits because they are transmitted implicitly by the selection process among the equivalent messages.

Let us go back to the encoding process at time t. The encoder at this point has three recycled codewords '0', '10', and '11' available, but it does not know which one to select yet. Hence, these codewords need to be compared with the bits starting from t + 1, which is the beginning of the next message codeword. There will be one and only one match between the rst bits of the next message and a recycled codeword. The corresponding message of the match will be the eye wink. As a consequence, the matched bits starting from t + 1 need not be sent. The matched bits can be omitted from the stream, since they can be deduced implicitly by the decoder as described above.

(36)

Figure 2.4 Illustration of the decompressor's bit stream

Let us illustrate the state of the decompressor's bit stream depicted in Figure 2.4. The decompressor, at Instant II, decodes the underlined prex codeword at Instant I (codeword of M3). According to the discussion above, after decoding M3, the decoder can infer that

M3 is one of the three equivalent messages M1, M2, and M3 that could have been selected

by the encoder, thereby the decompressor can identically rebuild the corresponding recycled codewords ('0', '10', and '11') that have been constructed by the compressor as shown in Figure 2.3; accordingly, it determines that the corresponding recycled codeword ('11') of the received message (M3) needs to be inserted into the bit stream as illustrated at Instant III.

At Instant I of time t + 1, the bit stream is exactly left as it was at instant III of time t. If the bit stream at Instant I was: '01000011011010101· · · ' the underlined codeword would be the codeword of M2, and the decoder would have to recycle bit '10', so the bit stream at Instant

III would be: '100011011010101· · · ', and so on.

The sequence of implicitly transmitted recycled codewords forms a kind of implicit channel between the encoder and the decoder. We call this implicit channel the side-channel. The idea of the side-channel has been used previously in many applications such as: data hiding [13], data authentication [10], and embedding of error-correction codes inside the compressed documents [65, 33, 34]. The idea of side-channel has been used, for the rst time, for data compression purposes by bit recycling [16] then by Yokoo et al. [67].

2.3.3 The Performance of HuBR

Let us evaluate the performance of HuBR in the previous example. According to the afore-mentioned principle, if M1 is selected, then the coder will send the corresponding Hui of M1

of length 2 bits and recycle the corresponding ri of length 1 bit, so the net cost of encoding

M1 will be 1 (2 − 1) bit, and similarly, if M2 or M3 is selected, the net cost will be 2 (4 − 2)

or 1 (3 − 2) bit, respectively.

Since recycling proceeds by extracting prexes of the bit stream, which is a sequence of entropy-encoded events3_{, the bit stream starts by '0' half the time, and by '1' the other half of the}

time; also the bit stream starts by '11' one fourth of the time and so on. Accordingly, M1 has