• Aucun résultat trouvé

Data compression using antidictionaries

N/A
N/A
Protected

Academic year: 2022

Partager "Data compression using antidictionaries"

Copied!
14
0
0

Texte intégral

(1)

HAL Id: hal-00619579

https://hal-upec-upem.archives-ouvertes.fr/hal-00619579

Submitted on 13 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Maxime Crochemore, Filippo Mignosi, Antonio Restivo, Sergio Salemi

To cite this version:

Maxime Crochemore, Filippo Mignosi, Antonio Restivo, Sergio Salemi. Data compression using an-

tidictionaries. Proceedings of the I.E.E.E., 2000, 88 (11), pp.1756-1768. �10.1109/5.892711�. �hal-

00619579�

(2)

Data Compression Using Antiditionaries

M.Crohemore ,F. Mignosi , A. Restivo , S. Salemi

Abstrat|Wegiveanewtext ompressionshemebased

on ForbiddenWords("antiditionary").Weprovethatour

algorithms attaintheentropyfor balanedbinary soures.

Theyruninlineartime. Moreover,oneofthemainadvan-

tagesofthis approahisthatit produesveryfast deom-

pressors. Aseondadvantageisasynhronizationproperty

thatishelpfultosearhompressed dataand allowsparal-

lelompression. Thetehniquesusedinthispaperarefrom

InformationTheoryandFiniteAutomata.

Keywords|Data Compression, Lossless ompression, In-

formationTheory,FiniteAutomaton,ForbiddenWord,Pat-

ternMathing.

I. Introdution

W

Epresent asimpletext ompression method alled

DCA (Data Compression with Antiditionaries)

that uses some \negative" information about the text,

whih is desribed in terms of antiditionaries. In on-

trast to other methods that make use, asa main tool,of

ditionaries,i.e., partiularsetsof wordsourring asfa-

tors in the text (f. [1℄, [2℄, [3℄, [4℄ and [5℄), ourmethod

takes advantageof words that do not our asfators in

the text, i.e., that are forbidden. Suh sets of words are

alled hereantiditionaries.

We desribe a stati ompression sheme that runs in

lineartime(SetionsIIandIII)inludingtheonstrution

ofantiditionaries(SetionV andSetion VI). Variations

usingstatistial ordynamialonsiderationsaredisussed

in theonlusion(SetionVII)

Let w be atext on the binary alphabet f0;1g and let

ADbeanantiditionaryforw. Byreadingthetextwfrom

lefttoright,ifat aertainmomenttheurrentprexvof

the text has assuÆx awordu 0

suh that u =u 0

a 2 AD

with a2f0;1g,i.e., uisforbidden, thensurely theletter

followingv in thetext annot be a and, sine the alpha-

bet is binary, it is the letter b 6= a. In other terms, we

know in advane the next letter b, that turns out to be

redundant or preditable. The main idea of our method

is to eliminateredundantlettersin orderto ahieveom-

pression. The deoding algorithm reoversthe text w by

DCAURLishttp://www-igm.univ-ml v.fr /

ma/DCA.html

M.Crohemore,InstitutGaspard-Monge,UniversitedeMarne-la-

Vallee,Frane. E-mail:Maxime.Crohemoreuniv-mlv.fr.

F. Mignosi, Universita degli Studi di Palermo, Italy and Bran-

deis University, U.S.A. E-mail: mignosialtair.math.unipa.it and

mignosis.brandeis.edu. Work partially supported by the CNR-

NATOfellowshipn.215.31andbytheprojet\Modelliinnovatividi

alolo: metodisintattiieombinatori"MURST,Italy.

A. Restivo, Universita degli Studi di Palermo, Italy. E-mail:

restivoaltair.math.unipa.it. Workpartiallysupportedbythe pro-

jet\Modelliinnovatividialolo: metodisintattiieombinatori"

MURST,Italy.

S.Salemi,Universitadegli StudidiPalermo,Italy. E-mail: sale-

mialtair.math.unipa.it. Work partially supported by the projet

\Modelli innovativi di alolo: metodi sintattii e ombinatori"

prediting the letter following the urrent prex v of w

alreadydeompressed.

Themethodproposedherepresentssomeanalogieswith

ideas disussed by C. Shannon at the very beginning of

Information Theory. In [6℄ Shannon designed psyholog-

ial experiments in order to evaluate the entropy of En-

glish. Oneofsuhexperimentswasaboutthehumanabil-

ity to reonstrut an English text where someharaters

wereerased. Atuallyourompressionmethoderasessome

haratersandthedeompressionreonstrutthem.

Weprove(SetionIV) thattheompression rateof our

ompressor reahes the entropy almost surely, provided

thatthesoureisbalanedandproduedfromanitean-

tiditionary. Thistypeofsoureapproximatesalargelass

ofsoures,andonsequently,avariantofthebasisheme

givesanoptimalompression forthem. Theideaofusing

antiditionaries is founded onthe fat that there exists a

topologialinvariantforDynamialSystemsbasedonfor-

biddenwords,invariantthatisindependentoftheentropy

(f. [7℄and[8℄).

TheuseoftheantiditionaryADinodinganddeoding

algorithmsrequiresthatADmustbestruturedinorderto

answertothefollowingqueryonawordv: doesthere ex-

ists a word u = u 0

a, a 2 f0;1g, in AD suh that u 0

is

a suÆx of v? In the ase of positive answer the output

should also inlude theletter b dened by b 6=a. Oneof

themainfeaturesofourmethodisthatweareabletoim-

plementeÆientlynite antiditionariesin terms ofnite

automata. Thisleadsto fastlinear-time ompressionand

deompressionalgorithms that anberealized by sequen-

tialtransduers(generalizedsequentialmahines). Thisis

espeiallyrelevantforxedsoures. It isthenomparable

tothefastestompressionmethodsbeausethebasioper-

ationat ompressionanddeompressiontimeisjust table

lookup.

Aentralnotionofthepresentmethodisthatofminimal

forbidden words, whih allows to redue the size of anti-

ditionaries. Thisnotionhasalsosomeinterestingombi-

natorialproperties. Our ompression method inludes al-

gorithmsto omputeantiditionaries, algorithmsthat are

basedonthe aboveombinatorialpropertiesandthat are

desribedin detailin [9℄and[10℄.

Theompressionmethodsharesalsoaninterestingsyn-

hronizationproperty,intheaseofniteantiditionaries.

Itstatesthattheenodingofablokofdatadoesnotde-

pendontheleftandrightontextsexeptforalimited-size

prexof theenodedblok. This isalsohelpful to searh

ompressed data and the same property allows to design

eÆientparallelompressionalgorithms.

Thepaperisorganizedasfollows.

InSetion IIwegivethedenition ofForbidden Words

(3)

pression and deompression algorithms (binary oriented)

assuming that the antiditionary is given. In Setion III

wedesribeadatastrutureforniteantiditionariesthat

allows us to answerin an eÆient way the queries need-

ed byour ompression and deompressionalgorithms; we

showhowtoimplementitgivenanite antiditionary. In

thease ofrationalantiditionariestheompressionis al-

sodesribedin termsof transduers. We end the setion

by proving the synhronization property. In Setion IV

we evaluate the ompression rate of our ompression al-

gorithm relative to a given antiditionary. In Setion V

weshowhowtoonstrutantiditionariesforsinglewords

and soures. As aonsequeneweobtaina familyof lin-

eartime optimalalgorithmsfortext ompressionthat are

universal for balaned Markovsoures with nite memo-

ry. InSetionVIwegivelineartimeimprovedalgorithms

for building antiditionaries for a stati approah. They

usetheideasofpruningand self-ompressing. We disuss

improvementsandgeneralizationsin SetionVII.

Someoftheresultspresentinthispaperhavebeensu-

intlystatedin[11℄.

II. BasiAlgorithms

Let us rst introdue the main ideas of our algorithm

on its stati version. We disuss variations of this rst

approahinSetionVII.

LetwbeanitebinarywordandletF(w)bethesetof

fatorsof w. Forinstane, ifw=01001010then F(w)=

f";0;1;00;01;10;001;010;:::;01001010gwhere"denotes

theemptyword.

LetustakesomewordsintheomplementofF(w),i.e.,

letus takesomewordsthatare notfatorsofw andthat

weallforbidden. Thisset ofsuh wordsAD is alled an

antiditionaryforthelanguageF(w). Antiditionariesan

be nite as well innite. For instane, if w = 01001010

the words 11, 000, and 10101 are forbidden and the set

f11;000;10101g is an antiditionary for F(w). For in-

stane, if w

1

=001001001001the inniteset of all words

thathavetwo1'sasi-thandasi+2-thletterforsomein-

tegeri,isanantiditionaryforw

1

. Wewantheretostress

that an antiditionary an be any subset of the omple-

mentofF(w). Thereforeanantiditionaryanbedened

byanypropertythat onernswords.

Theompression algorithm treatstheinputwordin an

on-line manner. Ataertainstepin this proess wehave

readthewordvproperprexofw. Ifthereexistsanyword

u=u 0

a,a2f0;1g,intheantiditionaryADsuhthatu 0

is

asuÆxofv,thensurelytheletterfollowingvannotbea,

i.e., thenextletterisb,b6=a. Inotherwords,weknowin

advanethenextletterbthatturnsouttobe\redundant"

orpreditable. Remark that this argument works onlyin

theaseofbinaryalphabets.

The main idea in thealgorithm wedesribeis to elim-

inate redundantletters. In what followswerst desribe

the ompression algorithm, Enoder, and then the de-

ompression algorithm, Deoder. Thewordto be om-

pressedis notedw =a

1 a

n

andits ompressed version

Enoder(antiditionaryAD,wordw2f0;1g

)

1. v "; ";

2. fora rsttolast letterofw

3. ifforeverysuÆxu 0

ofv,u 0

0;u 0

162AD

4. :a;

5. v v:a;

6. return(jvj,);

As an example, let us run the algorithm Enoder on

the string w = 01001010 with the antiditionary AD =

f000;10101;11g. The steps of the treatment are de-

sribedinthenextarraybytheurrentvaluesoftheprex

v

i

=a

1 a

i

ofwthathasbeenjustonsideredandofthe

output(w). Intheaseofpositiveanswertothequeryto

theantiditionaryAD,thearrayalsoindiatesthevalueof

theorrespondingforbiddenwordu. Thenumberoftimes

theanswerispositivein arunorrespondstothenumber

ofbitserased.

" (w)="

v

1

=0 (w)=0

v

2

=01 (w)=01 u=112AD

v

3

=010 (w)=01

v

4

=0100 (w)=010 u=0002AD

v

5

=01001 (w)=010 u=112AD

v

6

=010010 (w)=010

v

7

=0100101 (w)=0101 u=112AD

v

8

=01001010 (w)=0101 u=101012AD

v

9

=010010100 (w)=0101 u=0002AD

v

10

=0100101001 (w)=0101 u=112AD

Remarkthat thefuntion is notinjetive.

Forinstane(01)=(010)=01.

In order to have an injetive mapping we an onsid-

er the funtion 0

(w) = (jwj;(w)). In this ase we an

reonstrut the original word w from both 0

(w) and the

antiditionary.

The deoding algorithm works as follow. The om-

pressed word is (w) = b

1 b

h

and the length of w is

n. Thealgorithm reoversthe wordw by prediting the

letter following the urrent prex v of w already deom-

pressed. Ifthereexistsonewordu=u 0

a,a2f0;1g,inthe

antiditionary AD suh that u 0

is asuÆx of v, then, the

outputletterisb,b6=a. Otherwise,thenextletterisread

fromtheinput.

Deoder(antiditionaryAD,word2f0;1g

,

integern)

1. v ";

2. whilejvj<n

3. ifforsomeu 0

suÆxofv anda2f0;1g,u 0

a

belongstoAD

4. v v:a;

5. else

6. b nextletterof;

7. v vb;

8. return(v);

TheantiditionaryAD mustbestrutured in orderto an-

(4)

onewordu=u 0

a,a2f0;1g,inADsuhthatu 0

isasuÆx

of v? Inase ofa positiveanswerthe outputshould also

inludetheletterbdenedbyb6=a. Notiethattheletter

aonsideredat line3isuniquebeause,atthispoint,the

endof thetextwhasnotbeenreahedsofar.

Inthis approah, where theantiditionary isstatiand

availabletoboththeenoderandthedeoder,theenoder

must send to the deoder the length of the word jwj, in

additionto theompressedword(w),in ordertogiveto

thedeodera\stop"riterion. Slightvariationsofthepre-

vious ompression-deompression algorithm an be easily

obtainedbygivingother\stop"riteria: Forinstane, the

enoderansendthenumberoflettersthatthedeoderhas

to reonstrut after that thelast letterof theompressed

word(w) hasbeenread. Ortheenoderanletthe de-

oderstopwhenthereisnomoreletteravailablein (line

6),orwhenbothlettersareimpossibletobereonstruted

aordingto AD. Doingso,theenodermust sendto the

deoderthenumberof lettersto erasein order toreover

the original message. Forsuh variationsantiditionaries

anbestruturedtoanswerslightlymoreomplexqueries.

Sineweareonsideringherethestatiase,theenoder

mustsendtothedeodertheantiditionaryunlessthede-

oder has already a opy of the antiditionary or it has

analgorithmiwaytoreonstrut theantiditionaryfrom

somepreviouslyaquiredinformation.

The method presented here brings to mind some ideas

proposedbyC.ShannonattheverybeginningofInforma-

tionTheory. In[6℄Shannondesignedpsyhologialexper-

imentsinordertoevaluatetheentropyofEnglish. Oneof

suh experiments was about the human ability to reon-

strut anEnglishtextwheresomeharaterswere erased.

Atuallyourompressionmethodserasessomeharaters

and the deompressionreonstrut them. Forinstane in

previous example the input string is 01

00

1

01

0

0

1, where

barsindiatewhihlettersareerasedduringtheompres-

sion.

In order to get good ompression rates(at least in the

statiapproahwhentheantiditionaryhastobesent)we

needtominimizeinpartiularthesizeoftheantiditionary.

Remark that if there exists a forbidden word u = u 0

a,

a 2 f0;1g in the antiditionary suh that u 0

is also for-

bidden then our algorithm will never use this word u in

the algorithms. So that wean erasethis wordfrom the

antiditionarywithout any lossfor theompression of w.

Thisargumentleadstoonsiderthenotionofminimalfor-

biddenwordwithrespettoafatoriallanguageL,andthe

notionofanti-fatoriallanguage,pointsthat aredisussed

in thenextsetion.

III. Implementation of Finite Antiditionaries

Whentheantiditionaryisaniteset,thequeriesonthe

antiditionary requiredby the algorithms of the previous

setion are realized as follows. We build a deterministi

automatonaeptingthewordshavingnofatorinthean-

tiditionary. Then, whilereading the text to enode, ifa

transitionleadstoasinkstate,theoutputistheotherlet-

antiditionary AD. An algorithm to build A(AD) is de-

sribed in [9℄ and [10℄. The same onstrution has been

disoveredbyChorutetal. [12℄,itissimilartoadesrip-

tiongivenbyAhoandCorasik([13℄,see[14℄),byDiekert

etal. [15℄,anditisrelatedtoamoregeneralonstrution

givenin [16℄.

TherequiredautomatonaeptsafatoriallanguageL.

Reall that alanguageL is fatorialif L satises thefol-

lowing property: for any words, u, v, uv 2 L ) u 2 L

and v 2 L. The omplement language L

= A

nL is a

(two-sided)ideal of A

. Denotingby MF(L) the base of

this ideal,wehaveL

=A

MF(L)A

. The set MF(L)is

alled the set of minimal forbidden words for L. A word

v 2A

is forbiddenfor thefatorial languageL if v 62L,

whih isequivalentto say that v ours in nowordof L.

Inaddition,v is minimalifithasnoproperfator thatis

forbidden.

OneannotethatthesetMF(L)uniquelyharaterizes

L, just beause L = A

nA

MF(L)A

: This set MF(L)

isananti-fatorial languageorafatorode,whihmeans

thatitsatises: 8u;v2MF(L); u6=v=)uisnotafator

ofv,propertythat omesfromtheminimalityofwordsof

MF(L). Indeed, there is aduality betweenfatorial and

anti-fatoriallanguages,beausewealsohavetheequality:

MF(L)=AL\LA\(A

nL):Inviewoftheremarkmadeat

theendof theprevioussetion,fromnowoninthepaper

we onsider only antiditionaries that onsist of minimal

forbiddenwords. Thustheyareanti-fatoriallanguages.

Figure1displaysthetriethat aeptstheanti-fatorial

language AD = f000;10101;11g. The automaton pro-

duedfromthetrieisshownin Figure2.

m

1

m

2 3

m

0

m

4

m

5

m

6

m

7 8

9

0

- 0

- 0

- 1

- 0

- 1

- 0

- 1

R 1

Fig.1. Trieof thefatorodef000 ;10101;11 g. Squaresrepresent

terminalstates.

Thefollowingtheorem is proved in [10℄. It is basedon

analgorithmalledL-automatonthat hasas(nite)in-

put AD in the form of a trie T. It is straigthforward to

get T if AD is given in the form of a list of words. The

algorithmanbeadaptedtotestwhetherT representsan

anti-fatorial set,to generate thetrie of theanti-fatorial

languageassoiatedwith a set of words, oreven to build

theautomatonassoiatedwith theanti-fatoriallanguage

orrespondingtoanysetofwords.

Theorem 1: Theonstrution of A(AD) from T anbe

realizedinlineartime.

Wereport here,for sakeof ompleteness,thealgorithm

L-automatondesribedin[10℄. Itsinput,thetrieT that

(5)

m

1

m

2 3

m

0

m

4

m

5

m

6

m

7 8

9

0

- 0

- 0

- 1

- 0

- 1

- 0

- 1

R

1 1

? 1

1

6

0

H H H H H H H H H Y

0

0,1

0,1

Fig. 2. Automaton aepting the words that avoid the set

f000 ;10101 ;11 g. Squaresrepresentnon-terminalstates(sinks-

tates).

AD and, assuh, it is noted(Q;A;i;T;Æ 0

). Theset T of

terminalstatesistheset ofleavesof thetrie.

Thealgorithmusesafuntionf alledafailurefuntion

and denedonstatesofT asfollows. Statesofthetrie T

areidentiedwiththeprexesofwordsinAD. Forastate

au(a2A,u2A

),f(au)isthelongestsuÆxofuthatisa

stateofthetrieT,awordthat mayhappentobeuitself.

ThisstateisalsoÆ(i;u),whereÆ isthetransitionfuntion

of A(AD), and this anbeeasily provedby indution on

thelengthofu. Notethatf(i)isundened,whihjusties

aspei treatmentoftheinitialstateinthealgorithm.

L-automaton(trieT =(Q;A;i;T;Æ 0

))

1. foreaha2A

2. ifÆ 0

(i;a)dened

3. Æ(i;a) Æ

0

(i;a);

4. f(Æ(i;a)) i;

5. else

6. Æ(i;a) i;

7. foreahstatep2Qnfigin width-rst

searhand eaha2A

8. ifÆ 0

(p;a)dened

9. Æ(p;a) Æ

0

(p;a);

10. f(Æ(p;a)) Æ(f(p);a);

11. elseifp62T

12. Æ(p;a) Æ(f(p);a);

13. else

14. Æ(p;a) p;

15. return(Q;A;i;QnT;Æ);

A. Transduers

From the automaton A(AD) we an easily onstrut a

(nite-state) transduer B(AD)that realizestheompres-

sionalgorithmEnoder,i.e., thatomputesthefuntion

. The input part of B(AD) oinides with A(AD), with

sink statesremoved,and theoutputis givenasfollows: if

astateofA(AD)hastwooutgoingedges,thentheoutput

labels of these edges oinide with their input label; if a

stateof A(AD) hasonlyoneoutgoingedge,then theout-

put labelof this edgeis theemptyword. Thetransduer

B(AD) works as follows on an input string w. Consider

ters of w that orrespond to an edge that is the unique

outgoingedgeof agivenstateareerased;otherlettersare

unhanged.

Weanthenstatethefollowingtheorem.

Theorem 2: Algorithm Enoder an be realized by a

sequentialtransduer(generalizedsequentialmahine).

Conerning the algorithm Deoder, remark (see Se-

tion II) that the funtion is not injetive and that we

needsome additionalinformation, forinstane the length

oftheoriginalunompressedword,inordertoreonstrut

itwithoutambiguity. Therefore,Deoderanberealized

by the sametransduer as above, by interhanginginput

and output labels (denote it by B 0

(AD)), with a supple-

mentaryinstrutiontostopthedeoding.

Let Q = Q

1 [Q

2

be a partition of the set of states

Q, where Q

j

is the set of stateshaving j outgoingedges

(j =1;2). Forany q2Q

1

, denep(q)=(q;q

1

;:::;q

r )as

the unique path in the transduer for whih q

h 2 Q

1 for

h<randq

r 2Q

2 .

Given an input word v = b

1 b

2 :::b

m

, there exists in

B 0

(AD)auniquepath i;q

1

;:::;q

m 0

suh that q

m 0

1 2Q

2

and the transition from q

m 0

1 to q

m 0

orrespond to the

inputletter b

m . If q

m 0

2Q

2

,then theoutputwordorre-

spondingtothispathinB 0

(AD)istheuniquewordwsuh

that(w)=v. Ifq

m 0

2Q

1

,thenweanstopthedeoding

algorithmrealizedbyB 0

(AD)inanystateq2p(q

m 0),and,

for dierent states, we obtaindierent deodings. So we

needsupplementaryinformation(forinstane,thelengthof

theoriginalunompressedword)toperformthedeoding.

Inthis senseweansaythat B 0

(AD) realizessequentially

thealgorithmDeoder(f. also[17℄).

Theonstrutions andthe resultsgiven aboveonnite

antiditionariesandtransduersanbegeneralizedalsoto

theaseofrationalantiditionaries,or,equivalently,when

thesetofwords\produedbythesoure"isaregular(ra-

tional)language. Intheseasesitisnot,in astrit sense,

neessary to introdue expliitly antiditionaries and al-

lthemethodsanbepresentedin termsof automataand

transduers,asabove. Remarkhoweverthatthepresenta-

tiongiveninSetionIIintermsofantiditionariesismore

general,sineitinludes thenonrationalase. Moreover,

even in the nite ase, theonstrution of automata and

transduers from a xed text, given in the next setion,

makesan expliit use of the notionof minimal forbidden

wordsandofantiditionaries.

B. ASynhronization Property

In the sequel we prove a synhronization property of

automata built from nite antiditionaries, as desribed

above. This property also \haraterizes" in some sense

niteantiditionaries. Thispropertyisalassialoneand

itisoffundamentalimportaneinpratialappliations.

Denition 1: Given a deterministi nite automaton

A, we say that a word w = a

1 a

k

is synhronizing

for A if, whenever w represents the label of two paths

(q

1

;a

1

;q

2 )(q

k

;a

k

;q

k +1

) and (q 0

1

;a

1

;q 0

2 )(q

0

k

;a

k

;q 0

k +1 )

oflengthk,then thetwoending statesq

k +1 and q

0

k +1 are

(6)

If L(A) is fatorial, any word that does not belong to

L(A) is synhronizing. Clearly in this ase synhronizing

words in L(A) are muh more interesting. Remark also

that,sineAis deterministi,ifw issynhronizingfor A,

then anywordw 0

=wv that has w as prex is also syn-

hronizingforA.

Denition 2: AdeterministiniteautomatonAisloal

ifthereexists anintegerk suhthat anywordof lengthk

issynhronizing. AutomatonAisalsoalledk-loal.

RemarkthatifAisk-loalthenitism-loalforanymk.

Given anite antifatoriallanguage AD, let A(AD) be

theautomatonassoiatedwithADthatreognizesthelan-

guage L(AD). Letus eliminatethe sink states and edges

going to them. Sine there is no possibility of misunder-

standing, we denote the resulting automaton by A(AD)

again. Notie thatit hasnosinkstate, thatall statesare

terminal,andthat L(A(AD))isfatorial.

Theorem3: Let AD be a nite antifatorial antidi-

tionaryandletkbethelengthofthelongestwordin AD.

ThenautomatonA(AD)assoiatedto AD is(k 1)-loal.

Proof: Let u = a

1 a

n 1

be a word of length

n 1. Wehavetoprovethat uissynhronizing. Suppose

that there exist two paths (q

1

;a

1

;q

2 )(q

n 1

;a

n 1

;q

n )

and (q 0

1

;a

1

;q 0

2 )(q

0

n 1

;a

n 1

;q 0

n

) of length n 1labeled

byu. Wehavetoprovethatthetwoendingstatesq

n and

q 0

n

areequal. ReallthatstatesofAarewords,and,more

preisely they are the proper prexesof wordsin AD. A

simpleindutiononi,1inshowsthatq

i

(respetively

q 0

i

)\is"thelongestsuÆxofthewordq

1 a

1 a

i

(respetive-

lyq 0

1 a

1 a

i

)that isalsoa\state",i.e.,aproperprexof

awordinAD. Heneq

n

(respetivelyq 0

n

)isthelongestsuf-

xofthewordq

1

u(respetivelyq 0

1

u)thatisalsoaproper

prexofawordinAD. Sineallproperprexesofwordsin

ADhavelengthatmostn 1,bothq

n andq

0

n

havelength

atmostn 1. Sineuhaslengthn 1,boththeyarethe

longestsuÆxofuthat isalsoaproperprexofawordin

AD,i.e., theyareequal.

Inother terms,thetheorem saysthatonlythe lastk

1 bits matter for determining whether AD is avoided or

not. The theorem admits a \onverse" that shows that

loality haraterizesin somesense nite antiditionaries

(f. Propositions2.8and 2.14of[18℄).

Theorem4: If automatonA is loal andL(A) is afa-

torial languagethenthere exists anite antifatoriallan-

guageADsuhthatL(A)=L(AD).

LetADbeanantifatorialantiditionaryandletkbethe

lengthofthelongestwordin AD. Letalso w=w

1 uvw

2 2

L(AD)withjuj=k 1andlet(w)=y

1 y

2 y

3

betheword

produed byourenoderof SetionII withinputADand

w. Thewordy

1

isthewordproduedbyourenoderafter

proessing w

1

u, theword y

2

is thewordprodued by our

enoder after proessing v and the word y

3

is the word

produedbyourenoderafterproessingw

2 .

The proof of next theorem is an easy onsequene of

previousdenitionsandofthestatementofTheorem3.

Theorem5: The wordy

2

depends onlyon theworduv

anditdoesnotdependontheontextsofit,w

1 andw

2 .

The property stated in the theorem has an interesting

onsequeneforthedesignofpatternmathingalgorithms

on words ompressedby the algorithm Enoder. It im-

plies that tosearhthe ompressedwordforapattern, it

isnotneessarytodeodethewholeword. Justalimited

left ontext of an ourrene of the pattern needs to be

proessed. Thesamepropertyallowsthedesign ofhighly

parallelizableompressionalgorithms. Theideaisthatthe

ompression an be performed independently and in par-

allelonanyblokofdata. Ifthetext tobeompressedis

parsed into bloksof data in suh away that eah blok

overlapsthe nextblok by alength not smaller than the

lengthofthelongestwordin theantiditionary,thenitis

possibleto runthewholeompressionproessin parallel.

IV. Effiieny

InthissetionweevaluatetheeÆienyofourompres-

sionalgorithm relativelyto asoureorresponding to the

niteantiditionaryAD.

Indeed,theantiditionaryADnaturallydenesasoure

S(AD) in the following way. Let A(AD) be the automa-

tononstrutedintheprevioussetionwithnosinkstates

andreognizingthefatoriallanguageL(AD)(allstatesare

terminal). To avoid trivialases, we suppose that in this

automatonall thestateshaveat least oneoutgoingedge.

Reallthatsineouralgorithmsworkonabinaryalphabet,

allstateshaveatmosttwooutgoingedges.

ForanystateofA(AD)withonlyoneoutgoingedgewe

give to this edge probability 1. For any state of A(AD)

with two outgoingedge wegiveto these edges probabili-

ty 1=2. This denes a deterministi(or unilar,f. [19℄)

Markovsoure,denotedS(AD). Notiealsothat,byThe-

orem3, that S(AD) is a Markovsoure of nite order or

nite memory(f. [19℄). Weall abinaryMarkovsoure

withthisprobabilitydistributionanbalanedsoure.

Remarkthatourompressionalgorithmisdenedexat-

lyforallthewords\emitted"byS(AD).

Inwhat followswesupposethat thegraphofthesoure

S,i.e.,thegraphofautomatonA(AD),isstronglyonnet-

ed. Theresultsthatweproveanbeextendedtothegen-

eral ase byusing standardtehniques of MarkovChains

(f. [19℄, [20℄, [21℄ and [22℄). Reall (f. Theorem 6.4.2

of [19℄) that the entropy H(S) of adeterministi Markov

soureSisH(S)= n

i;j=1

i

i;j log

2 (

i;j

);where(

i;j )is

thestohastimatrixofSand(

1

;;

n

)isthestationary

distribution ofS.

Wenowstatethree lemmas.

Lemma1: The entropyof abalaned soure S is given

by H(S) =

i2D

i

where D is the set of all states that

havetwooutgoingedges.

Proof: Bydenition

H(S)= n

i;j=1

i

i;j log

2 (

i;j ):

Ifiisastatewithonlyone outgoingedge,bydenition

thisedgemusthaveprobability1. Then

j

i

i;j log

2 (

i;j )

reduesto

i log

2

(1),thatisequalto0. Hene

H(S)=

i2D

n

i

i;j log (

i;j ):

(7)

Sine from eah i 2 D there are exatly two outgoing

edgeshavingeahprobability1=2,onehas

H(S)=

i2D 2

i

(1=2)log

2

(1=2)=

i2D

i

asstated.

Lemma2: Letw=a

1 a

m

beawordinL(AD)andlet

q

1 q

m+1

bethesequeneofstatesinthepathdetermined

bywin A(AD)startingfromtheinitial state. Thelength

of (w) is equalto the numberof statesq

i

, i= 1;:::;m,

thatbelongtoD,whereDisthesetofallstatesthathave

twooutgoingedges.

Proof: The statement is straightforward from the

desription of the ompression algorithm and the imple-

mentation of the antiditionary with automaton A(AD).

Throughawell-knownresultson\large deviations"(f.

ProblemIX.6.7of[23℄),wegetakindof optimalityofthe

ompressionsheme.

Letq=q

1

;q

m

bethesequeneofmstatesofapathof

A(AD)andletL

m;i

(q)bethefrequenyofstateq

i inthis

sequene, i.e., L

m;i

(q)=m

i

=m,where m

i

isthe number

of ourrenesof q

i

in thesequenesq. Let alsoX

m ()=

fq j qhasmstatesandmax

i jL

m;i (q )

i

jg;where

q representsa sequene of m statesof apath in A(AD).

Inother words,X

m

()is theset ofallsequenes ofstates

representingpathinA(AD)that\deviate"at leastofin

at leastonestateq

i

fromthetheoretialfrequeny

i .

Lemma3: For any > 0, the set X

m

() satises the

equalitylim 1

m log

2 Pr (X

m

())= ();where()isaposi-

tiveonstantdependingon.

We now state the main theorem of this setion. The

proof of ituses thethree previouslemmas. It statesthat

foranytheprobabilitythat theompressionrate(v)=

j(v)j=jvjofastringoflengthnisgreaterthanH(S(AD))+

,goesexponentiallytozero. Hene,asaorollary,almost

surelytheompressionrateofaninnitesequeneemitted

byS(AD)reahestheentropyH(S(AD)),that isthe best

possibleresult.

Theorem6: LetK

m

()bethesetofwordswoflengthm

suhthattheompressionrate(v)=j(v)j=jvjisgreater

thanH(S(AD))+. Forany>0thereexistarealnumber

r(), 0<r() <1, and anintegerm()suh that forany

m>m(),Pr(K

m

())r() m

:

Proof: Letwbeawordoflengthm inthelanguage

L(AD)andletq

1

;;q

m+1

bethesequeneofstatesinthe

path determined by w in A(AD) startingfrom theinitial

state. Let q=(q

1

;;q

m

) bethe sequeneof the rstm

states. We know, by Lemma 2, that the length of (w)

is equalto the numberof statesq

i

, i=1m, in q that

belong to D, where D is the set of all states having two

outgoingedges.

IfwbelongstoK

m

(),i.e.,iftheompressionrate(v)=

j(v)j=jvj is greater than H(S(AD))+, then there must

exists anindexj suhthat L

m;j

(q)>

j

+=jDj. Infat,

iffor allj, L

m;j

(q )

j

+=jDjthen, by denitionsand

byLemma1,

(v)= L (q ) +=H(S(AD))+;

aontradition. Thereforethesequeneofstatesqbelongs

toX

m

(=d). HenePr(K

m

())Pr (X

m (=d)).

ByLemma3,there existsanintegerm()suh that for

anym>m()onehas

1

m log

2 Pr(X

m (

d ))

1

2 (

d ):

Then Pr(K

m

()) 2

(1=2)(=d)m

. If we set r() =

2

(1=2)(=d)

,thestatementofthetheoremfollows.

Theorem 7: Theompressionrate(x )ofaninnitese-

quenexemittedbythesoureS(AD)reahestheentropy

H(S(AD))almostsurely.

V. How to build Antiditionaries

Inpratialappliationstheantiditionarymightnotbe

givena prioribut itmust be derivedeither from thetext

tobeompressedorfromafamilyoftextsbelongingtothe

assumedsoureofthetexttobeompressed.

There exist several riteria to build eÆient antidi-

tionaries,dependingondierentaspetsorparametersthat

onewishes to optimize in theompression proess. Eah

riteriongivesrisetodierentalgorithmsandimplementa-

tions.

All our methods to build antiditionaries are based on

data strutures to store fators of words, suh as suÆx

tries,suÆxtrees,DAWGs,andsuÆxandfatorautomata

(seeforinstane Theorem15in [10℄). Inthesestrutures,

it ispossibleto onsider anotionof suÆxlink. This link

isessentialtodesigneÆientalgorithmstobuildrepresen-

tationsofsetsofminimal forbiddenwordsintermoftries

or trees. This approah leads to onstrution algorithm-

s that run in linear time in the length of the text to be

ompressed.

A rough solution to ontrol the size of antiditionaries

is obviouslyto bound the lengthof wordsin the antidi-

tionary. Abettersolutioninthestatiompressionsheme

is to prune the trie of the antiditionary with ariterion

basedon the tradeo between thespae of the trieto be

sentandthegaininompression,thiswillbedevelopedin

nextsetion. However,therst solutionis enoughto get

ompression rates that reah asymptotially the entropy

for balaned soures, even if this is not true for general

soures. Both solutions an be designed to run in linear

time.

Wepresentinthis setionaverysimpleonstrutionto

build nite antiditionaries of a nite word w. It is the

baseonwhihseveralvariationsaredeveloped. Theideais

to build the automatonaeptingthe wordshaving same

fators of w of length k and, from this, to build the set

ofminimal forbidden wordsoflength kof thewordw. It

anbeusedasarststeptobuildantiditionariesforxed

soures. Inthisaseourshemeanbeonsideredasastep

for aompressorgenerator(ompressorompiler). In the

designofaompressorgenerator,orompressorompiler,

statistialonsiderationsandthepossibilityofmaking"er-

rors"in preditingthenextletterplayanimportantrole,

asdisussedin SetionVII.

(8)

AlgorithmBuild-ADdesribedhereafterbuildstheset

ofminimalforbiddenwordsoflengthk(k>0)oftheword

w. It takes as input an automaton aepting the words

that havethesamefatorsoflengthk (or less)asw, i.e.,

aeptingthelanguage

L

k

=fx2f0;1g

j(u2F(x)andjujk))u2F(w)g:

The preproessing of the automatonis done by the al-

gorithmBuild-Fatwhoseentraloperationisdesribed

bythefuntionNext.

Build-Fat(wordw2f0;1g

,integerk>0)

1. i newstate;Q fig;

2. level(i) 0;

3. p i;

4. whilenotendofstringw

5. a nextletterofw;

6. p Next(p;a;k);

7. returntrie(Q;i;Q;Æ),funtion f;

Next(state p,lettera,integerk>0)

1. ifÆ(p;a)dened

2. returnÆ(p;a);

3. elseiflevel(p)=k

4. returnNext(f(p);a;k);

5. else

6. q newstate;Q Q[fqg;

7. level(q) level(p)+1;

8. Æ(p;a) q;

9. if(p=i)f(q) i;

10. elsef(q) Next(f(p);a;k);

11. returnq;

Build-AD(trie (Q;i;Q;Æ),funtion f,integerk>0)

1. T ;;Æ 0

Æ;

2. foreahp2Q,0<level(p)<k, inbreadth-rst

order

3. fora 0then1

4. ifÆ(p;a)isundenedandÆ(f(p);a)is

dened

5. q newstate;T T[fqg;

6. Æ

0

(p;a) q;

7. Q QnfstatesofQfrom whihnoÆ 0

-path

leadstoTg

8. returntrie(Q[T;i;T;Æ 0

);

Theautomatonisrepresentedbybothatrieanditsfail-

ure funtion f. If pis a node of the trieassoiated with

the wordav, v 2f0;1g

and a2 f0;1g, f(p) is thenode

assoiated with v. This is a standard tehnique used in

the onstrution of suÆx trees (see [24℄ for example). It

is used here in algorithm Build-AD (line 4) to test the

minimality of forbidden words aording to the equality

MF(L)=AL\LA\(A

nL).

The aboveonstrutiongives riseto the following stat-

twie,thersttimetoonstruttheantiditionaryADand

theseondtimetoenodethetext.

Informally, the enoder sends amessage z of the form

(x;y;(n)) to thedeoder,where x isadesriptionof the

antiditionaryAD,yisthetextodedaordingtoAD,as

desribed inSetion II,and(n)is theusualbinary ode

ofthelengthn ofthetext. Thedeoderrstreonstrut-

s fromx the antiditionary andthen deodes y aording

to the algorithm in Setion II. The antiditionary AD is

omposed in this simple ompression sheme by allmini-

malforbiddenwordsoflengthkofw,butotherintelligent

hoiesofsubsetsofADarepossible. Weandesribethe

antiditionary AD for instane by oding with standard

tehniquesthetrie assoiatedwithADtoobtaintheword

x. A basiquestion is how fastmust growthe numberk

asfuntion of the lengthn ofthe word w. In this simple

ompression sheme wehoosek to be any funtion suh

thatonehasthatjxj=o(n),butotherhoiesarepossible.

Sine the ompression rate is the size jzj of z divided by

thelengthnofthetext,wehavethatjzj=n=jyj=n+o(n).

AssumingthatfornandklargeenoughthesoureS(AD),

asinSetionIV,approximatesthesoureofthetext,then,

bytheresultsofSetionIV,theompressionrateis\opti-

mal".

Forinstane, supposethat wis emittedbyanbalaned

MarkovsoureS withmemoryh,andletLbetheformal

language omposed of all nite words that an be emit-

tedbyS. ByTheorem 4thereexists anite antifatorial

languageN suh that L = L(N). Moreover, sine S has

memoryh,thewordsin N havelengthsmallerthanore-

qualtoh+1. Ifjwjissuhthatk>hthenADontainsN

and, thereforeH(S(AD))H(S(N))=H(S). ByCorol-

lary1weandeduethatthis simpleompressionsheme

turnsouttobeuniversalforthefamilyofbalanedMarkov

soureswithnitememory(f. [25℄).

Letw=a

1 a

2

beabinaryinnitewordthatisperiodi

(i.e., there exists integerP >0suh that foranyindex i

thelettera

i

isequaltothelettera

i+P

),andletw

n bethe

prexofwoflengthn. Wewanttoompressthewordw

n

followingoursimpleshemeinformallydesribedabove.

It isnotdiÆulttoprovethat theompressionrate for

w

n

isjzj=n=O((n))=O(log

2

(n)),whihmeansthatthe

shemeanahieveanexponentialompression.

VI. PruningAntiditionaries

Inthissetion,aswellasinprevioussetion,weonsider

astatiompression shemein whih weneedto read the

text twie: the rst time to onstrut the antiditionary

ADandtheseondtimetoenodethetext.

Inthissetion,however,wesupposethatwehaveenough

resouresto build, in lineartime, asuÆx ora fator au-

tomaton(ortheirompatedversion,f. [26℄)ofthenite

text string to be ompressed. From these strutures we

anobtainin lineartime atrierepresentatingof allmini-

malforbiddenwordsofthetext(f. [10℄). Itanbeshown

thatthetotallengthofallminimalforbiddenwordsanbe

quadratiinthesizeoftheoriginaltext. Howeverthetrie

(9)

wewanttogetgoodompressionratiosnotalltheminimal

forbiddenwordsshouldbeonsidered.

The rst ideadeveloped in this setion is to prune the

trieof the antiditionary with someriteria basedon the

tradeo betweenthe spae of the trie to besent and the

gain in ompression. Clearly, the spae of the trie to be

sentstritlydependonhowweenodethetrie.

Usingalassialapproah,in thissetionwereallthat

a binarytree that has k nodes an be enoded using two

bits foreahnode,whihgives2k bitsforthe wholetree.

Indeed,dependingonwhetherasubtreeSofabinarytree

T hasboth subtrees,only the rightsubtree, only theleft

subtree,ornosubtree,therootofSanbeenodedrespe-

tivelybythestrings11,10,01,00. Thisisdonereursively

in aprex traversal of the whole tree. All the resultsp-

resentedin thissetionanbeeasily extendedto thease

when anode of the trie an be enoded using bits for

eahnode,whereisapositiverealnumber.

The seond idea presented afterwards is to ompress

the words retained in the antiditionary using the anti-

ditionaryitself.

The twooperations, pruning and self ompressing,an

be applied iteratively on antiditionaries. They lead to

veryompatrepresentationsofantiditionaries,produing

higherompressionratios.

A. PrunedAntiditionary

A linear-time algorithm for obtaining the trie T of all

minimal forbidden word of axed text t anbefound in

[10℄. Hene wesupposeherethat wehavethistrieT.

In order to make a tradeo between the spae of the

trie to be sent and the gain in ompression, we have to

know how muh eah forbidden word ontributes to the

ompression. Minimalforbiddenwordsoftexttorrespond

in a bijetive way to the leaves of the trie T, i.e. with

anyleaveqofthetreewean assoiatetheorresponding

minimal forbidden word w(q). Indeed if we identify, as

in Setion III, the nodes of the trie T to the prexesof

the minimal forbidden words, then the funtion w is the

identity.

Wedeneaostfuntionthatassoiateswithanyleafq

ofT thenumberofbits(q)thatthewordw(q)ontributes

toeraseduringtheompressionofthetextt. Thisnumber

(q) is also the number of times that the longest proper

prex of w(q) appears in text t as a fator but not as a

suÆx. In another words, the number(q) is the number

of timesthat astatepis traversed while readingthetext

t in the automaton A(AD), where p leads to state q by

some letter a (f. Setion III and Theorem 1). Indeed

thelast letterof thetextis notonsideredin thisproess

beausethere is nothingto eraseafter it. ByTheorem 1,

thefuntion anbeomputedinlineartime.

Wefurtherdenethegain(saving)ofasubtreeSofthe

trieT representinganantiditionaryT asg(S)=((q)j

qleafofS) 2m

S

where m

S

isthenumberofnodesofS.

Indeed the number of bits that have to be sent after

lengthnofthetextt (f. theasadinglengthstehnique

in [4℄ and referenes therein); 2m

T

bits for a desription

oftheantiditionaryT;j(t)jbitsforthetextompressed

usingT. Theoverallsize is

2blogn+2m

T

+j(t)j=2blogn+n g(T)

bydenitionofg(T).

Sine2blogn+nisxedandsinethegaing(T)isthe

sumof thegainofits subtrees minus 2bits(for enoding

theroot),thenpruningsubtreesofT thathaveanegative

gaininreases the gainof T and, onsequently, dereases

theoverallnumberofbitsthat havetobesentafter om-

pression.

Suppose howeverthatS

2

isasubtreeofS

1

whihis, in

turn,asubtreeofthetrieT. Supposefurtherthat S

2 has

anegativegainandthesameholdsforS

1

,butthatS

1 has

apositivegainifS

2

isprunedfromit. Inthisase,inorder

to obtain better ompression ratios, the best thing to do

is to prune S

2

and not the whole S

1

. It is thus natural

toonsidertheoptimizationproblemrelatedtoanabstat

non-negativefuntion (denedonleavesofT)whereone

instane is a trie T representing a prex ode C, and a

solutionisatrieT 0

thatrepresentsasubsetofCandthat

maximizesthegaing(T 0

).

Inwhatfollowsweshowthatabottom-upapproahgives

alinear-timesolutiontothisproblem.

With any subtree S of T we assoiate the funtion g 0

,

alledtheprunedgain,thatis denedby

g 0

(S)= 8

>

<

>

:

0 ifS isempty

(S) 2 ifS isaleaf

g 0

(S

1

) 2 ifS hasonehildS

1

M

where M = max(g 0

(S

1 );g

0

(S

2 );g

0

(S

1 )+g

0

(S

2

)) 2, with

S

1 andS

2

hildrenofS.

From the abovedenition it is not diÆultto see that

it is possible to ompute funtion g 0

in linear time with

respet tothesize of thetrie T, in abottom-uptraversal

ofthetrie.

Weannowpresentthesimplepruningalgorithm.

Simple Pruning(trieT,funtion)

1. omputeg 0

(S)foreahsubtreeS ofT;

2. eliminatesubtreesS ofT forwhihg 0

(S)0;

3. returnmodiedtrieT;

The following proposition is a onsequene of the de-

sriptions given above, and the next theorem shows that

the output of the algorithm gives a solution to the opti-

mizationproblem desribedabove.

Proposition1: AlgorithmSimplePruninganbeper-

formedin lineartime.

Theorem 8: LetT beatrierepresentingaprexodeC

andletbeanon-negativefuntiondenedonleavesofT.

The output T 0

of algorithm Simple Pruning represents

asubsetof C and g 0

(T 0

)is maximum. Moreoverwehave

0 0 0

(10)

Proof: Firstof all we laimthat thetrie T 0

output

by algorithm Simple Pruning representsa subset of C.

Indeed, by the denition of g 0

it follows that if asubtree

S of T isnotaleafandifg 0

(S)>0,thenS must haveat

leastonehildS

1

withpositiveprunedgain,i.e. g 0

(S

1 )>

0. This fat impliesthat allleavesof T 0

are leaves of T,

provingthelaim.

Therestoftheproofisdonebyindutionontheheightof

T. IfT isemptythereisnothingtoprove. IfT hasheight

0 then T is aleaf and we alreadyhaveg(T)=g 0

(T). If

g(T)>0,T itselfisequaltoT 0

,otherwiseT 0

istheempty

tree. Inbothasesthestatementofthetheoremissatised.

Supposenowthat T hasheight>0. Eitherit hasjust

onehildS

1

orithastwohildrenS

1 and S

2 .

Suppose that T has two hildren S

1 and S

2 . S

i

; they

arebothtriesandweanassoiatetothemtherestrition

ofthefuntiongaintoallsubtrees. Byapplyingalgorithm

Simple Pruningwith inputS

i

, i =1;2,and funtion

(restritedtoleavesof orrespondingsubtrees), weobtain

asoutput amodiedtrieS 0

i

. By indution weknowthat

g(S 0

i )=g

0

(S 0

i

)andthat thisvaluemaximizesthefuntion

gain. Therefore,ifbothg(S 0

1

)andg(S 0

2

)arepositive,atrie

T 0

representingasubsetofCandmaximizingthefuntion

gainisthetriethathasthesamerootasT andhashildren

S 0

1 andS

0

2

. Moreoverg(T 0

)=g 0

(T 0

)andalgorithmSimple

PruningdoesnotpruneS

1 andS

2 fromT

0

sothetheorem

isprovedinthisase.

The other ases, (g(S

1

) 0 and g(S

2

) > 0), (g(S

1 ) >

0 and g(S

2

) 0), (g(S

1

) 0 and g(S

2

) 0), and the

asewhen T hasonly onehildS

1

are dealt in analogous

manner.

Remark that the statement of Theorem 8 holds essen-

tiallybeausepruningasubtreeSofT doesnotaetthe

valueoffuntiongainoverallothersubtreesofT. Thisfa-

t is nottrueanymorewith theself-ompressingapproah

usedin nextsubsetion.

B. Self-ompressingthe antiditionary

LetADbeanantifatorialantiditionaryfortextt. Sine

ADisantifatorialthen,foranyv2ADthesetADnfvgis

anantiditionaryforv. Thereforeitispossibletoompress

v usingADnfvgorasubsetofit.

Oneanthinkofastrategythatsendstothedeoder,in

astatiapproah,allwordsvofADompressedbyalgorith-

mEnoderwithasubsetofADnfvgandvasinput. This

would ahieve better ompression. We all this approah

self-ompression;itisthesubjetofthissubsetion.

Letusrsttrytoompressanywordv2ADbyusingthe

wholeADnfvgandletusdenoteby

1

(v)theompressed

versionofvbyusingADnfvg. NotiethatthewordsofAD

thatareusedinompressingv havelengthjvj. Further,

ifu2ADwith juj=jvj isused toerasethelast letterof

v, then u must oinide with v exept for the last letter,

that is, u =xa,v =xb and a6=b. Inaddition it is easy

toseethat

1 (u)=

1

(v). Thiswordisalsoequalto

1 (x)

thathasbeenompressedbyusingtheantiditionaryofall

Asasaspeialaseofthenextproposition,asetfu;vg

havingthese propertiesanouratmostoneinanyan-

tiditionaryADofatextt.

A pair of words (v;v

1

) is alled stopping pair if v =

ua;v

1

= u

1

b 2 AD, with a;b 2 f0;1g, a 6= b, and u is

asuÆxofu

1 .

Proposition2: LetADbeanantifatorialantiditionary

of a text t. If there exists a stopping pair (v;v

1 ) with

v

1

=u

1

b, b2f0;1g, thenu

1

is asuÆxof t anddoesnot

appear elsewhere in t. Moreoverthere existsat mostone

pairofwordshavingtheseproperties.

Proof: Sine u

1

b 2AD, u

1

isafator of t. Suppose

that u

1

appears as a fator of t, with 2 f0;1g. Sine

u is a suÆx of u

1

, letter is not letter a (beause ua is

forbidden)andisnotletterb(beauseu

1

bisforbidden),a

ontradition. Heneu

1

isasuÆxoftanddoesnotappear

elsewhereint.

Sineu

1

isasuÆxoft,thenalsouisasuÆxoft. Sup-

posethat there exists anotherpair(v 0

=u 0

;v 0

1

=u 0

1 d)6=

(v;v

1

) ofwordsin AD with;d2f0;1g,a 6=b,and u 0

is

asuÆxofu 0

1

. Then u 0

1 and u

0

arealsosuÆxes oft andit

isnotdiÆulttoprovebyasesthatoneofthefourwords

amongv;v

1

;v 0

;v 0

1

isafatorofanother,ontraditingthe

antifatorialityofAD.

Let us suppose nowthat v

1

;:::;v

k

is asequene of all

wordsinADsuhthatforanyi,1ik 1,jv

i jjv

i+1 j.

If oneknowsthat there exists no v

j

suh that jv

j j = jv

i j

and v

j

has been used to erase the last letter of v

i , then

theset AD

1

=fv

1

;:::v

i 1

g isthe antiditionary used for

ompressingv

i

toget

1

(v), andv

i

anbereoveredfrom

both (v

i

) and jv

i

j using algorithm Deoder. If there

existsv

j

suhthatjv

j j=jv

i jandv

j

hasbeenusedtoerase

thelastletterofv

i

thentheset AD

1

=fv

1

;:::v

i 1 gisthe

antiditionary used for obtaining the ompressed version

1

(x) =

1 (v

i

) of the longest ommon prex x of v

i and

v

j

, with jxj = jv

i

j 1. Also in this ase x and therefore

v

i andv

j

,anbereoveredfrom both

1

(x)=

1 (v

i )and

jxj=jv

i

j 1usingalgorithmDeoder.

By the above disussion, it follows that if one knows

thesequene(

1 (v

1 );jv

1 j),(

1 (v

2 );jv

2

j), :::,(

1 (v

k );jv

k j),

together with the ouple (i;j) suh that v

i and v

j have

beenused tomutuallyerasetheir lastletter (i=j =0if

thereisnosuhapair),thenthedeoderanreonstrut,

in this order,wordsv

1 , v

2 , :::, v

k

. That is, deoder an

reonstrutthewholeantiditionaryAD.

Unfortunately, while AD, being antifatorial, is also a

prex ode and an be represented by a trie, this is not

true anymore for the set X

1

= f

1

(v) j v 2 ADg. For

example, the reader an easily verify that if AD = f11;

000; 10101; 00100100; 1010010100101g then X

1

= f11;

000;111; 0000;1111;g. Also,ifAD=f10;110; ;1 n

0g

then, for any n0, X

1

= f10g. Consequently thespae

savedbyself ompressingtheantiditionaryouldbelost

inenodingthesetX

1 .

We propose a dierent approah that makes use of

the same idea and leads to simple algorithms for self-

ompressing andreoveringtheantiditionaryAD. These

(11)

sentingtheantifatorialantiditionaryADand,moreover,

theompressionratiosobtainedwiththepruningtehnique

an only be improved by the next self ompression teh-

nique.

Wepresentaformaldesriptionofthetehnique. Given

awordv2AD,weompressitusinganantiditionaryAD 0

thatdynamiallyhangesatanystepofthewhileloopon

line2ofalgorithmEnoder. Whiledealingwithaproper

prexuofvandtheletterafollowingit,theantiditionary

AD 0

isomposed ofallwordsbelongingtoADwithlength

notgreaterthanjuj. Letteraiserasedifandonlyifthere

exists a word u 0

b 2 AD, b 6=a, with u 0

apropersuÆxof

u. Let usall

2

(v) theompressed versionof v obtained

in thiswayandletX

2

=f

2

(v)jv2ADg.

Thiskindofself-ompressionanbeperformedinlinear

time by nextalgorithm Self-ompress. It hasas input

boththe trieT that representsAD and thefuntion Æof

automatonA(AD) (f. algorithmL-automaton). Notie

that Æ is dened on nodes ofT. Its output T 0

is thetrie

aeptingthe set X

2

=f

2

(v)j v 2ADg. Thealgorithm

performs breadth-rsttraversal of T implemented by the

queueQ. Duringthetraversal,itreatesaself-ompressed

versionT 0

ofT thatrepresentsthesetX

2 .

Self-ompress(trie T, funtionÆ))

1. i root ofT;

2. reaterooti 0

;

3. add(i;i 0

)toemptyqueueQ;

4. whileQ6=;

5. extrat(p;p 0

)fromQ;

6. ifq

0 andq

1

arehildrenofp

7. reateq

0

0 andq

0

1

ashildrenofp 0

;

8. add(q

0

;q 0

0

)and(q

1

;q 0

1 )toQ;

9. elseifqisauniquehildofpand

q=Æ(p;a),a2A

10. ifÆ(p;:a)isaleaf

11. add(q;p

0

)toQ;

12. elsereate q

0

asa-hildofp 0

;

13. add(q;q

0

)toQ;

14. returntriehavingrooti 0

;

The orretness of algorithm Self-ompress relies on

thefollowingpropositionandthedisussionthereafter.

Proposition 3: IfanodepinthetrieT hastwohildren

q

0 andq

1

thenitsorrespondingnodep 0

intheoutputtrie

T 0

alsohastwohildren.

Proof: If q

0 and q

1

are both leaves, they represent

twominimal forbidden wordsua andub, a6=b. Thereis

nominimalforbiddenwordsin theformu 0

aoru 0

bwithu 0

apropersuÆxof ubeauseAD isantifatorial. Therefore

neitherletteranorletterbanbeerasedbythetehnique.

If q

0 and q

1

are not leaves, they represent two words

ua and ub, a 6=b, that are fators of text t. There is no

minimal forbiddenwordsin theform u 0

a oru 0

b with u 0

a

propersuÆxofubeausethesewordsarealsofatorsoft.

Thereforeneitherletteranorletterbanbeerasedbythe

tehnique.

Letussupposenowthatonlyonenodeamongq

0 andq

1

is aleaf. Forinstane, letus assumethat q isaleafand q

1

isnotaleaf. Theyrepresentrespetivelytwowordsua

and ub, a 6= b. Letter a annot be erased beausein the

antiditionary there is nowordin the form u 0

b with u 0

a

propersuÆx ofu, ubbeingafator oft. Letterb annot

beerasedbeauseintheantiditionarythereisnowordin

theformu 0

awithu 0

apropersuÆxofu,sineuaisin the

antiditionaryandtheantiditionaryisantifatorial.

Thepreviouspropositionexplainswhythealgorithmre-

atestwonodesq 0

0 and q

0

1

atline7.

Wenextonsiderlines 10{13,in whihnode pofT has

onlyonehildq=Æ(p;a). Thenode Æ(p;:a)annothave

higher level than pbeause phas only one hild. Hene,

letteraiserasedifandonlyifÆ(p;:a)isaleaf,bydenition

ofthetehnique.

Finally, if p has no hildren, i.e. p is a leaf, nothing

is done by the algorithm but extrating (p;p 0

) from the

queue.

Corollary1: Tries T and T 0

have the samenumber of

internal nodes that havetwo hildren and, onsequently,

have the same number of leaves. Trie T 0

represents the

prexodeX

2 .

The orollary impliesthat X

2

=f

2

(v)j v 2 ADgan

beuniquelyreonstrutedfromT 0

. Thereisanadditional

property that allowsreonstruting AD from X

2

without

onsidering lengths of words in AD. This simplies the

proedure. Thenextpropositionfollowsreadilyfromde-

nitions.

Proposition4: IfthereexistsnostoppingpairinADthen

foranyv2AD,thelastletterofvisnoterasedduringthe

self-ompressionto get

2 (v).

If the deoder has the additional information that the

last letteroft wasnoterasedatompression timethenit

anuse thisfat asastopriterion. This isalso possible

eveniftheantiditionaryhangesdynamially. Indeedthe

deoderjust hasto stopafterproessing thelast letterof

theompressedtext. Thereforethereisnoneedtousethe

lengthofthetext tostopdeoding.

Toensurethatthelastletterofanyv2ADisnoterased

and to meetthe above hypothesis, it issuÆientto elim-

inate theonly possible stoppingpair (f. Proposition 2).

Todothat,wedeletefromADthelongestwordv

1

ofsuh

a pair. By Proposition 2 this word does not ontribute

toerasinglettersin texttduring theompressionbeause

thereisnothingtoeraseafterthelastletter.

Hene we suppose that in our antiditionary AD this

word is not inluded, or,equivalently, that the branh of

trieT thathasthiswordasuniqueleafispruned. Inother

words, we suppose from now on that antiditionary AD

(andobviouslyallitssubsets)hasnostoppingpair.

AlgorithmSelf-automatonusestheprevioushypoth-

esistoreonstrutADfromT 0

. Morepreisely,itsinputis

atrie T 0

, self-ompressedfrom trie T, withits transition

funtionÆ 0

. ItsoutputistheautomatonA(AD),whereAD

istheantiditionaryrepresentedbytrieT. Itissimilarto

algorithmL-automaton. Indeeditmakesabreadth-rst

traversal on statesof the trie T. It is possibleto dothis

beause,anytimeastateisreahed,ifahildwas\erased"

duringtheexeutionofSelf-ompress,itisnowreated

(12)

andaddedtothequeueQ. Inordertoreateanewhild,

funtionÆmustbepreviouslyrestored,asdoneinalgorith-

m L-automaton,by usingthe failurefuntion f. When

aleaf isreahed in theself-ompressedtrie,the newstop

riteriontellsusthatthere isnothingmoretoreonstrut

in thatbranh.

TrieT anbeobtainedfromtheautomatonA(AD),out-

put of next algorithm, by using a linear time algorithm

desribedin [10℄.

Theurrentsituationinthenextalgorithmisasfollows:

whenanode pispoppedfromthequeue,trieT hasbeen

deompresseduptothelevelofpinT,f(p)isdenedand

funtion Æisdenedforallpreviousnodes,whihinludes

nodesatpreviouslevel. Afterproessingp,Æisalsodened

forpandthefailurefuntion f isdenedonitshildren.

Self-automaton(trieT 0

)

1. i 0

rootofT 0

;

2. Q ;;

3. foreaha2A

4. ifÆ 0

(i 0

;a)isdened

5. Æ(i

0

;a) Æ 0

(i 0

;a);

6. f(Æ(i 0

;a)) i 0

;

7. addÆ(i 0

;a)toQ;

8. else

9. Æ(i

0

;a) i 0

;

10. whileQ6=;

11. extratpfromQ;

12. ifpisnotaleaf

13. ifÆ(f(p);a)isaleaffora2A

14. reatep1;

15. foreahb2A

16. ifÆ

0

(p;b)isdened

17. Æ

0

(p1;b) Æ 0

(p;b);

18. Æ(p;:a) p1;

19. Æ(p;a) Æ(f(p);a));

20. f(p

1

) Æ(f(p);:a));

21. addp1toQ;

22. else

23. foreaha2A

24. ifÆ

0

(p;a)isdened

25. Æ(p;a) Æ

0

(p;a));

26. f(Æ(p;a)) Æ(f(p);a));

27. addÆ(p;a)toQ;

28. else

29. Æ(p;a) Æ(f(p);a));

30. else

31. foreaha2A

32. Æ(p;a) p;

33. return(Q;A;i 0

;Qnfleaves g;Æ);

SinethereisabijetionbetweenleavesofT andleaves

of T 0

, we an assoiate with any leaf q 0

of T 0

the same

value (q) of the orresponding leaf q in T. This is the

number of bits that the word w(q) leads to erase during

theompression oftextt. Analogously,asin theprevious

subsetion, we andene funtions gain and pruned gain

and,asarststep,weanrunalgorithmSimplePruning

onT 0

. Atthesametimewepruneorrespondingsubtrees

in T and obtainatrieT

1

. Doingso,the modied trie T

1

represents asubset of AD. As a seond step, wean use

again algorithm Self-ompress on T

1

to get T 0

1 . Note

that T 0

1

anbedierentfromtheprunedtrieT 0

beause

pruningsubtreesanaetself-ompression.

Weaniteratetheabovetwostepsforaxednumberof

VII. Conlusion

WehavedesribedDCA,atextompressionmethodthat

uses some \negative" information about the text, repre-

sentedin termsofantiditionaries. Theadvantagesof the

shemeare:

itisfastat deompressingdata,

itisfastat ompressingdataforxed soures,

ithasasynhronizationpropertyintheaseofnitean-

tiditionaries,propertythatleadstoeÆientparallelom-

pressionand tosearhenginesonompresseddata.

Inthe previoussetionswepresentedsomestati DCA

shemes in whih the text to be ompressed needs to be

sannedtwie. Startingfromthese statishemes,several

variationsandimprovementsanbeproposed. Thesevari-

ationsareallbasedonleverombinationsoftwoelements

thatanbeintroduedinourmodel:

statistionsiderations,

dynamiapproahes.

Thesearelassialfeaturesthatareofteninludedinother

dataompressionmethods.

Statistial onsiderations are used in the onstrution

of antiditionaries. If aforbidden word is responsible for

\erasing"fewbitsofthetextintheompressionalgorithm

of Setion II andif its\desription" asan element of the

antiditionary is \expensive" then the ompression ratio

improvesif it is not inluded in the antiditionary. This

idea hasbeenpartially exploited in previous setion. On

the ontrary, one an introdue into the antiditionary a

word that is notforbidden but that ours veryrarely in

thetext. Inthisase,theompressionalgorithmwill pro-

due some \errors" or \mistakes" in prediting the next

letter. Inordertohavealosslessompression,enoderand

deodermust beadapted to manage suh errors. Typial

errors ourin the ase of antiditionaries built for xed

souresaswellasinthedynamiapproah.

Evenwith errors,assuming that theyare rare with re-

spet to the maximum length of words of the antidi-

tionary,ourompressionshemepreservesthesynhroniza-

tion property of Theorem 3. The use of errors beomes

neessaryfor some artiial stringslike1 m

0 ifone wants

to useastati aproah. Without errorsand with astati

approah,thealgorithmsdesribedinprevioussetionare

unabletoompress suhstrings.

Antiditionaries for xed soures have also anintrinsi

interest. A ompressor generator, or ompressor ompil-

er,anreate,startingfromwordsobtainedfromasoure

S,anantiditionarythat anbeused toompressalloth-

er words from the same soure S. Error management is

essentialforthis kindof appliation. Having axed anti-

ditionarymakestheompressionfastbeausebasioper-

ationsarejust tablelookups.

In the dynami approah, we onstrut the antidi-

tionary and enode the text at the same time. The an-

tiditionaryisonstruted(alsowithstatistial onsidera-

tion) byonsidering thewhole text previouslysannedor

just a part of it. The antiditionary an hange at any

stepandthealgorithmirulesforitsonstrutionmustbe

(13)

File originalsize ompressedsize

(inbytes) (inbytes)

bib 111261 35535

book1 768771 295966

book2 610856 214476

geo 102400 79633

news 377109 161004

obj1 21504 13094

obj2 246814 111295

paper1 53161 21058

paper2 382199 2282

pi 513216 70240

prog 39611 15736

progl 71646 20092

progp 49379 13988

trans 93695 22695

Fig.3. CompressionratiosonlesoftheCalgaryCorpus.

Wehaverealizedprototypesoftheompressionandde-

ompressionalgorithms. Theyalsoimplementthedynami

versionofthemethod. Theyhavebeentested ontheCal-

garyCorpus(seeFigure3),andexperimentsshowthatwe

getompressionratiosequivalenttothoseofmostommon

ompressors(suhaspkzipforexample).

Weareonsideringseveralgeneralizations:

Compressor shemes and implementations of antidi-

tionaries on more general alphabets or on other typesof

data(images,sounds, et.),

Useoflossyompressionespeiallyto dealwithimages,

Combination of DCA with other ompression shemes;

for instane, using both ditionaries and antiditionaries

likepositiveand negativesets of examplesasin Learning

Theory,

Designofhipsdediatedtoxedsoures.

Severalproblemsonerningthedataompressionshe-

mearestillopen. Welistsomeof them.

Arebalanedsoures denseinside thefamilyofMarkov

soures? A positive answer would raise the question of

adapting the sheme so that it beomes universal for

Markovorergodi soures. Canself ompressionbeused

to settlethisquestion?

ArethereeÆientalgorithmstobuildgoodantiditionar-

ies for syntatisoures, generated for instane bygram-

mars? Thisraises aquestionofodingonabinaryalpha-

bet.

Whatistheaverageofthemaximumlengthofminimal

forbidden words in texts of length n generated by an er-

godisourehavingentropyH?

Howmanytimesontheaverageshouldpruningandself

ompressing be iterated before the proess stabilizes (see

previous setion)? We would expet amaximum of logn

steps. Isthestabilizedtrieoptimal?

Aknowledgments

WethanksM.P.Beal,M.Cohn,F.M.Dekking,R.Grossi

Referenes

[1℄ J. G.Cleary T.C.Belland I.H. Witten, Text Compression,

PrentieHall,1990.

[2℄ J. Gailly, \Frequently asked questions in data ompression,"

2000, FAQ, URL http://www.faqs.org/faqs/faqs/ompression-

faq/.

[3℄ J.GaillyM.Nelson,TheDataCompressionBook,M&TBooks,

NewYork,NY,1996.

[4℄ J.A.Storer, Data Compression: Methodsand Theory, Com-

puterSienePress,1988.

[5℄ T.C.BellI.H.Witten, A.Moat, ManagingGigabytes, Van

NostrandReinhold,1994.

[6℄ C.Shannon, \Preditionandentropyofprintedenglish," Bell

SystemTehnialJ.,vol.January,1951.

[7℄ A.Restivo M.-P.Beal, F.Mignosi, \Minimalforbiddenwords

andsymbolidynamis," inSTACS'96,C.PuehandR.Reis-

huk,Eds.,number1046inLetureNotesinComputerSiene,

pp.555{566.Springer-Verlag,Berlin,1996.

[8℄ A.RestivoM.-P.Beal,F.MignosiandM.Siortino, \Minimal

forbiddenwords and symboli dynamis," Advanes in Appl.

Math.,vol.Toappear.

[9℄ A. Restivo M. Crohemore, F. Mignosi, \Minimal forbidden

wordsand fatorautomata," inMFCS'98,J.Gruska L.Brim

and J.Slatuska,Eds., number1450 inLeture NotesinCom-

puterSiene,pp.665{673.Springer-Verlag,Berlin,1998.

[10℄ M. Crohemore,F. Mignosi,and A. Restivo, \Automata and

forbiddenwords," Inf. Proess. Lett.,vol.67, no.3, pp.111{

117,1998.

[11℄ A. Restivo M. Crohemore,F. Mignosiand S.Salemi, \Text

ompression using antiditionaries," in ICALP'99, J. Gruska

L.BrimandJ.Slatuska,Eds.,number1664inLetureNotesin

ComputerSiene.Springer-Verlag,Berlin,1999.

[12℄ C.ChorutandK.Culik,\Onextendibilityofunavoidablesets,"

DisreteAppl.Math.,vol.9,pp.125{137,1984.

[13℄ A.V.Ahoand M.J.Corasik, \EÆientstringmathing: an

aidtobibliographisearh,"Commun.ACM,vol.18,no.6,pp.

333{340,1975.

[14℄ M.Crohemore andW.Rytter, Textalgorithms, OxfordUni-

versityPress,1994.

[15℄ V. Diekert andY. Kobayashi, \Someidentitiesrelatedto au-

tomata,determinants,andmobiusfuntions," Report1997/05,

UniversitatStuttgart,1997.

[16℄ J.BerstelandD.Perrin,\Finiteandinnitewords,"inAlgebrai

Combinatoris onWords, D.PerrinJ.Berstel,Ed.Cambridge

UniversityPress,Toappear.

[17℄ Y. Shibata,M. Takeda, A.Shinohara, and S.Arikawa, \Pat-

ternmathingintextompressedbyusingantiditionaries," in

CPM'99,M.Crohemoreand M.Paterson,Eds. 1999,number

1645inLetureNotesinComputerSiene,pp.37{49,Springer-

Verlag,Berlin.

[18℄ M.P.Beal, CodageSymbolique, Masson,1993.

[19℄ R. Ash., Information Theory, Trats inmathematis. Inter-

sienePublishers,J.Wiley&Sons,1985.

[20℄ R.G.Gallager, InformationTheory andReliableCommunia-

tion,J.WileyandSons,In.,1968.

[21℄ R. G. Gallager, Disrete Stohasti Proesses, Kluver Ad.

Publ.,1995.

[22℄ J.L.SnellJ.G.Kemeny,FiniteMarkovChains,VanNostrand

Reinhold,1960.

[23℄ R.S.Ellis, Entropy,LargeDeviations,andStatistialMehan-

is,SpringerVerlag,1985.

[24℄ C.HanartM.Crohemore,\Automataformathingpatterns,"

in Handbook of Formal Languages, Volume 2, Linear Model-

ing: Bakground and Appliation, A. Salomaa G.Rozenberg,

Ed.Springer-Verlag,1997.

[25℄ R.Krihevsky., UniversalCompression and Retrieval, Kluver

AademiPublishers,1994.

[26℄ M. Crohemore and R. Verin, \On ompat direted ayli

word graphs," inStrutures in Logi and Computer Siene,

G.RozenbergJ.MyielskiandA.Salomaa,Eds.,number1261

inLetureNotes inComputerSiene,pp.192{211. Springer-

Références

Documents relatifs

A masked priming experi- ment was carried using two types of suf- fixed French primes: the effects of words having a surface frequency (SF) higher than their base (e.g.,

The estimation of the code length of a Tardos code requires two skills: to assess what is the worst attack the colluders can lead, and to experimentally assess the probabilities

In order to be able to compute an approximation of these optimal costs we extract the contours of the optimal densities and we perform a constrained optimization on the

Results: Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O ( n ) -time and O ( n ) -space algorithm for computing all minimal

In Section 4 we introduce the concept of 0-limited square property (a word has this property if the squares it contains have a particular form) to prove that, for every integer k,

If collection data contains tuples with variable dimension length we can save from 55 % (only by using block which support variable dimension length) up to 85 % (for Fi-

Question: Does there exist in G a walk w from 5 to t of length kl The peculiarity of the &amp;-WALK problem is that a solution might contain up to k occurrences of the same

Applying the gênerai approach presented in [8], we are able to détermine the exact asymptotical behaviour of the average length of the shortest prefix which has to be read in order