HAL Id: hal-00619579
https://hal-upec-upem.archives-ouvertes.fr/hal-00619579
Submitted on 13 Feb 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Maxime Crochemore, Filippo Mignosi, Antonio Restivo, Sergio Salemi
To cite this version:
Maxime Crochemore, Filippo Mignosi, Antonio Restivo, Sergio Salemi. Data compression using an-
tidictionaries. Proceedings of the I.E.E.E., 2000, 88 (11), pp.1756-1768. �10.1109/5.892711�. �hal-
00619579�
Data Compression Using Antiditionaries
M.Crohemore ,F. Mignosi , A. Restivo , S. Salemi
Abstrat|Wegiveanewtext ompressionshemebased
on ForbiddenWords("antiditionary").Weprovethatour
algorithms attaintheentropyfor balanedbinary soures.
Theyruninlineartime. Moreover,oneofthemainadvan-
tagesofthis approahisthatit produesveryfast deom-
pressors. Aseondadvantageisasynhronizationproperty
thatishelpfultosearhompressed dataand allowsparal-
lelompression. Thetehniquesusedinthispaperarefrom
InformationTheoryandFiniteAutomata.
Keywords|Data Compression, Lossless ompression, In-
formationTheory,FiniteAutomaton,ForbiddenWord,Pat-
ternMathing.
I. Introdution
W
Epresent asimpletext ompression method alled
DCA (Data Compression with Antiditionaries)
that uses some \negative" information about the text,
whih is desribed in terms of antiditionaries. In on-
trast to other methods that make use, asa main tool,of
ditionaries,i.e., partiularsetsof wordsourring asfa-
tors in the text (f. [1℄, [2℄, [3℄, [4℄ and [5℄), ourmethod
takes advantageof words that do not our asfators in
the text, i.e., that are forbidden. Suh sets of words are
alled hereantiditionaries.
We desribe a stati ompression sheme that runs in
lineartime(SetionsIIandIII)inludingtheonstrution
ofantiditionaries(SetionV andSetion VI). Variations
usingstatistial ordynamialonsiderationsaredisussed
in theonlusion(SetionVII)
Let w be atext on the binary alphabet f0;1g and let
ADbeanantiditionaryforw. Byreadingthetextwfrom
lefttoright,ifat aertainmomenttheurrentprexvof
the text has assuÆx awordu 0
suh that u =u 0
a 2 AD
with a2f0;1g,i.e., uisforbidden, thensurely theletter
followingv in thetext annot be a and, sine the alpha-
bet is binary, it is the letter b 6= a. In other terms, we
know in advane the next letter b, that turns out to be
redundant or preditable. The main idea of our method
is to eliminateredundantlettersin orderto ahieveom-
pression. The deoding algorithm reoversthe text w by
DCAURLishttp://www-igm.univ-ml v.fr /
ma/DCA.html
M.Crohemore,InstitutGaspard-Monge,UniversitedeMarne-la-
Vallee,Frane. E-mail:Maxime.Crohemoreuniv-mlv.fr.
F. Mignosi, Universita degli Studi di Palermo, Italy and Bran-
deis University, U.S.A. E-mail: mignosialtair.math.unipa.it and
mignosis.brandeis.edu. Work partially supported by the CNR-
NATOfellowshipn.215.31andbytheprojet\Modelliinnovatividi
alolo: metodisintattiieombinatori"MURST,Italy.
A. Restivo, Universita degli Studi di Palermo, Italy. E-mail:
restivoaltair.math.unipa.it. Workpartiallysupportedbythe pro-
jet\Modelliinnovatividialolo: metodisintattiieombinatori"
MURST,Italy.
S.Salemi,Universitadegli StudidiPalermo,Italy. E-mail: sale-
mialtair.math.unipa.it. Work partially supported by the projet
\Modelli innovativi di alolo: metodi sintattii e ombinatori"
prediting the letter following the urrent prex v of w
alreadydeompressed.
Themethodproposedherepresentssomeanalogieswith
ideas disussed by C. Shannon at the very beginning of
Information Theory. In [6℄ Shannon designed psyholog-
ial experiments in order to evaluate the entropy of En-
glish. Oneofsuhexperimentswasaboutthehumanabil-
ity to reonstrut an English text where someharaters
wereerased. Atuallyourompressionmethoderasessome
haratersandthedeompressionreonstrutthem.
Weprove(SetionIV) thattheompression rateof our
ompressor reahes the entropy almost surely, provided
thatthesoureisbalanedandproduedfromanitean-
tiditionary. Thistypeofsoureapproximatesalargelass
ofsoures,andonsequently,avariantofthebasisheme
givesanoptimalompression forthem. Theideaofusing
antiditionaries is founded onthe fat that there exists a
topologialinvariantforDynamialSystemsbasedonfor-
biddenwords,invariantthatisindependentoftheentropy
(f. [7℄and[8℄).
TheuseoftheantiditionaryADinodinganddeoding
algorithmsrequiresthatADmustbestruturedinorderto
answertothefollowingqueryonawordv: doesthere ex-
ists a word u = u 0
a, a 2 f0;1g, in AD suh that u 0
is
a suÆx of v? In the ase of positive answer the output
should also inlude theletter b dened by b 6=a. Oneof
themainfeaturesofourmethodisthatweareabletoim-
plementeÆientlynite antiditionariesin terms ofnite
automata. Thisleadsto fastlinear-time ompressionand
deompressionalgorithms that anberealized by sequen-
tialtransduers(generalizedsequentialmahines). Thisis
espeiallyrelevantforxedsoures. It isthenomparable
tothefastestompressionmethodsbeausethebasioper-
ationat ompressionanddeompressiontimeisjust table
lookup.
Aentralnotionofthepresentmethodisthatofminimal
forbidden words, whih allows to redue the size of anti-
ditionaries. Thisnotionhasalsosomeinterestingombi-
natorialproperties. Our ompression method inludes al-
gorithmsto omputeantiditionaries, algorithmsthat are
basedonthe aboveombinatorialpropertiesandthat are
desribedin detailin [9℄and[10℄.
Theompressionmethodsharesalsoaninterestingsyn-
hronizationproperty,intheaseofniteantiditionaries.
Itstatesthattheenodingofablokofdatadoesnotde-
pendontheleftandrightontextsexeptforalimited-size
prexof theenodedblok. This isalsohelpful to searh
ompressed data and the same property allows to design
eÆientparallelompressionalgorithms.
Thepaperisorganizedasfollows.
InSetion IIwegivethedenition ofForbidden Words
pression and deompression algorithms (binary oriented)
assuming that the antiditionary is given. In Setion III
wedesribeadatastrutureforniteantiditionariesthat
allows us to answerin an eÆient way the queries need-
ed byour ompression and deompressionalgorithms; we
showhowtoimplementitgivenanite antiditionary. In
thease ofrationalantiditionariestheompressionis al-
sodesribedin termsof transduers. We end the setion
by proving the synhronization property. In Setion IV
we evaluate the ompression rate of our ompression al-
gorithm relative to a given antiditionary. In Setion V
weshowhowtoonstrutantiditionariesforsinglewords
and soures. As aonsequeneweobtaina familyof lin-
eartime optimalalgorithmsfortext ompressionthat are
universal for balaned Markovsoures with nite memo-
ry. InSetionVIwegivelineartimeimprovedalgorithms
for building antiditionaries for a stati approah. They
usetheideasofpruningand self-ompressing. We disuss
improvementsandgeneralizationsin SetionVII.
Someoftheresultspresentinthispaperhavebeensu-
intlystatedin[11℄.
II. BasiAlgorithms
Let us rst introdue the main ideas of our algorithm
on its stati version. We disuss variations of this rst
approahinSetionVII.
LetwbeanitebinarywordandletF(w)bethesetof
fatorsof w. Forinstane, ifw=01001010then F(w)=
f";0;1;00;01;10;001;010;:::;01001010gwhere"denotes
theemptyword.
LetustakesomewordsintheomplementofF(w),i.e.,
letus takesomewordsthatare notfatorsofw andthat
weallforbidden. Thisset ofsuh wordsAD is alled an
antiditionaryforthelanguageF(w). Antiditionariesan
be nite as well innite. For instane, if w = 01001010
the words 11, 000, and 10101 are forbidden and the set
f11;000;10101g is an antiditionary for F(w). For in-
stane, if w
1
=001001001001the inniteset of all words
thathavetwo1'sasi-thandasi+2-thletterforsomein-
tegeri,isanantiditionaryforw
1
. Wewantheretostress
that an antiditionary an be any subset of the omple-
mentofF(w). Thereforeanantiditionaryanbedened
byanypropertythat onernswords.
Theompression algorithm treatstheinputwordin an
on-line manner. Ataertainstepin this proess wehave
readthewordvproperprexofw. Ifthereexistsanyword
u=u 0
a,a2f0;1g,intheantiditionaryADsuhthatu 0
is
asuÆxofv,thensurelytheletterfollowingvannotbea,
i.e., thenextletterisb,b6=a. Inotherwords,weknowin
advanethenextletterbthatturnsouttobe\redundant"
orpreditable. Remark that this argument works onlyin
theaseofbinaryalphabets.
The main idea in thealgorithm wedesribeis to elim-
inate redundantletters. In what followswerst desribe
the ompression algorithm, Enoder, and then the de-
ompression algorithm, Deoder. Thewordto be om-
pressedis notedw =a
1 a
n
andits ompressed version
Enoder(antiditionaryAD,wordw2f0;1g
)
1. v "; ";
2. fora rsttolast letterofw
3. ifforeverysuÆxu 0
ofv,u 0
0;u 0
162AD
4. :a;
5. v v:a;
6. return(jvj,);
As an example, let us run the algorithm Enoder on
the string w = 01001010 with the antiditionary AD =
f000;10101;11g. The steps of the treatment are de-
sribedinthenextarraybytheurrentvaluesoftheprex
v
i
=a
1 a
i
ofwthathasbeenjustonsideredandofthe
output(w). Intheaseofpositiveanswertothequeryto
theantiditionaryAD,thearrayalsoindiatesthevalueof
theorrespondingforbiddenwordu. Thenumberoftimes
theanswerispositivein arunorrespondstothenumber
ofbitserased.
" (w)="
v
1
=0 (w)=0
v
2
=01 (w)=01 u=112AD
v
3
=010 (w)=01
v
4
=0100 (w)=010 u=0002AD
v
5
=01001 (w)=010 u=112AD
v
6
=010010 (w)=010
v
7
=0100101 (w)=0101 u=112AD
v
8
=01001010 (w)=0101 u=101012AD
v
9
=010010100 (w)=0101 u=0002AD
v
10
=0100101001 (w)=0101 u=112AD
Remarkthat thefuntion is notinjetive.
Forinstane(01)=(010)=01.
In order to have an injetive mapping we an onsid-
er the funtion 0
(w) = (jwj;(w)). In this ase we an
reonstrut the original word w from both 0
(w) and the
antiditionary.
The deoding algorithm works as follow. The om-
pressed word is (w) = b
1 b
h
and the length of w is
n. Thealgorithm reoversthe wordw by prediting the
letter following the urrent prex v of w already deom-
pressed. Ifthereexistsonewordu=u 0
a,a2f0;1g,inthe
antiditionary AD suh that u 0
is asuÆx of v, then, the
outputletterisb,b6=a. Otherwise,thenextletterisread
fromtheinput.
Deoder(antiditionaryAD,word2f0;1g
,
integern)
1. v ";
2. whilejvj<n
3. ifforsomeu 0
suÆxofv anda2f0;1g,u 0
a
belongstoAD
4. v v:a;
5. else
6. b nextletterof;
7. v vb;
8. return(v);
TheantiditionaryAD mustbestrutured in orderto an-
onewordu=u 0
a,a2f0;1g,inADsuhthatu 0
isasuÆx
of v? Inase ofa positiveanswerthe outputshould also
inludetheletterbdenedbyb6=a. Notiethattheletter
aonsideredat line3isuniquebeause,atthispoint,the
endof thetextwhasnotbeenreahedsofar.
Inthis approah, where theantiditionary isstatiand
availabletoboththeenoderandthedeoder,theenoder
must send to the deoder the length of the word jwj, in
additionto theompressedword(w),in ordertogiveto
thedeodera\stop"riterion. Slightvariationsofthepre-
vious ompression-deompression algorithm an be easily
obtainedbygivingother\stop"riteria: Forinstane, the
enoderansendthenumberoflettersthatthedeoderhas
to reonstrut after that thelast letterof theompressed
word(w) hasbeenread. Ortheenoderanletthe de-
oderstopwhenthereisnomoreletteravailablein (line
6),orwhenbothlettersareimpossibletobereonstruted
aordingto AD. Doingso,theenodermust sendto the
deoderthenumberof lettersto erasein order toreover
the original message. Forsuh variationsantiditionaries
anbestruturedtoanswerslightlymoreomplexqueries.
Sineweareonsideringherethestatiase,theenoder
mustsendtothedeodertheantiditionaryunlessthede-
oder has already a opy of the antiditionary or it has
analgorithmiwaytoreonstrut theantiditionaryfrom
somepreviouslyaquiredinformation.
The method presented here brings to mind some ideas
proposedbyC.ShannonattheverybeginningofInforma-
tionTheory. In[6℄Shannondesignedpsyhologialexper-
imentsinordertoevaluatetheentropyofEnglish. Oneof
suh experiments was about the human ability to reon-
strut anEnglishtextwheresomeharaterswere erased.
Atuallyourompressionmethodserasessomeharaters
and the deompressionreonstrut them. Forinstane in
previous example the input string is 01
00
1
01
0
0
1, where
barsindiatewhihlettersareerasedduringtheompres-
sion.
In order to get good ompression rates(at least in the
statiapproahwhentheantiditionaryhastobesent)we
needtominimizeinpartiularthesizeoftheantiditionary.
Remark that if there exists a forbidden word u = u 0
a,
a 2 f0;1g in the antiditionary suh that u 0
is also for-
bidden then our algorithm will never use this word u in
the algorithms. So that wean erasethis wordfrom the
antiditionarywithout any lossfor theompression of w.
Thisargumentleadstoonsiderthenotionofminimalfor-
biddenwordwithrespettoafatoriallanguageL,andthe
notionofanti-fatoriallanguage,pointsthat aredisussed
in thenextsetion.
III. Implementation of Finite Antiditionaries
Whentheantiditionaryisaniteset,thequeriesonthe
antiditionary requiredby the algorithms of the previous
setion are realized as follows. We build a deterministi
automatonaeptingthewordshavingnofatorinthean-
tiditionary. Then, whilereading the text to enode, ifa
transitionleadstoasinkstate,theoutputistheotherlet-
antiditionary AD. An algorithm to build A(AD) is de-
sribed in [9℄ and [10℄. The same onstrution has been
disoveredbyChorutetal. [12℄,itissimilartoadesrip-
tiongivenbyAhoandCorasik([13℄,see[14℄),byDiekert
etal. [15℄,anditisrelatedtoamoregeneralonstrution
givenin [16℄.
TherequiredautomatonaeptsafatoriallanguageL.
Reall that alanguageL is fatorialif L satises thefol-
lowing property: for any words, u, v, uv 2 L ) u 2 L
and v 2 L. The omplement language L
= A
nL is a
(two-sided)ideal of A
. Denotingby MF(L) the base of
this ideal,wehaveL
=A
MF(L)A
. The set MF(L)is
alled the set of minimal forbidden words for L. A word
v 2A
is forbiddenfor thefatorial languageL if v 62L,
whih isequivalentto say that v ours in nowordof L.
Inaddition,v is minimalifithasnoproperfator thatis
forbidden.
OneannotethatthesetMF(L)uniquelyharaterizes
L, just beause L = A
nA
MF(L)A
: This set MF(L)
isananti-fatorial languageorafatorode,whihmeans
thatitsatises: 8u;v2MF(L); u6=v=)uisnotafator
ofv,propertythat omesfromtheminimalityofwordsof
MF(L). Indeed, there is aduality betweenfatorial and
anti-fatoriallanguages,beausewealsohavetheequality:
MF(L)=AL\LA\(A
nL):Inviewoftheremarkmadeat
theendof theprevioussetion,fromnowoninthepaper
we onsider only antiditionaries that onsist of minimal
forbiddenwords. Thustheyareanti-fatoriallanguages.
Figure1displaysthetriethat aeptstheanti-fatorial
language AD = f000;10101;11g. The automaton pro-
duedfromthetrieisshownin Figure2.
m
1
m
2 3
m
0
m
4
m
5
m
6
m
7 8
9
0
- 0
- 0
- 1
- 0
- 1
- 0
- 1
R 1
Fig.1. Trieof thefatorodef000 ;10101;11 g. Squaresrepresent
terminalstates.
Thefollowingtheorem is proved in [10℄. It is basedon
analgorithmalledL-automatonthat hasas(nite)in-
put AD in the form of a trie T. It is straigthforward to
get T if AD is given in the form of a list of words. The
algorithmanbeadaptedtotestwhetherT representsan
anti-fatorial set,to generate thetrie of theanti-fatorial
languageassoiatedwith a set of words, oreven to build
theautomatonassoiatedwith theanti-fatoriallanguage
orrespondingtoanysetofwords.
Theorem 1: Theonstrution of A(AD) from T anbe
realizedinlineartime.
Wereport here,for sakeof ompleteness,thealgorithm
L-automatondesribedin[10℄. Itsinput,thetrieT that
m
1
m
2 3
m
0
m
4
m
5
m
6
m
7 8
9
0
- 0
- 0
- 1
- 0
- 1
- 0
- 1
R
1 1
? 1
1
6
0
H H H H H H H H H Y
0
0,1
0,1
Fig. 2. Automaton aepting the words that avoid the set
f000 ;10101 ;11 g. Squaresrepresentnon-terminalstates(sinks-
tates).
AD and, assuh, it is noted(Q;A;i;T;Æ 0
). Theset T of
terminalstatesistheset ofleavesof thetrie.
Thealgorithmusesafuntionf alledafailurefuntion
and denedonstatesofT asfollows. Statesofthetrie T
areidentiedwiththeprexesofwordsinAD. Forastate
au(a2A,u2A
),f(au)isthelongestsuÆxofuthatisa
stateofthetrieT,awordthat mayhappentobeuitself.
ThisstateisalsoÆ(i;u),whereÆ isthetransitionfuntion
of A(AD), and this anbeeasily provedby indution on
thelengthofu. Notethatf(i)isundened,whihjusties
aspei treatmentoftheinitialstateinthealgorithm.
L-automaton(trieT =(Q;A;i;T;Æ 0
))
1. foreaha2A
2. ifÆ 0
(i;a)dened
3. Æ(i;a) Æ
0
(i;a);
4. f(Æ(i;a)) i;
5. else
6. Æ(i;a) i;
7. foreahstatep2Qnfigin width-rst
searhand eaha2A
8. ifÆ 0
(p;a)dened
9. Æ(p;a) Æ
0
(p;a);
10. f(Æ(p;a)) Æ(f(p);a);
11. elseifp62T
12. Æ(p;a) Æ(f(p);a);
13. else
14. Æ(p;a) p;
15. return(Q;A;i;QnT;Æ);
A. Transduers
From the automaton A(AD) we an easily onstrut a
(nite-state) transduer B(AD)that realizestheompres-
sionalgorithmEnoder,i.e., thatomputesthefuntion
. The input part of B(AD) oinides with A(AD), with
sink statesremoved,and theoutputis givenasfollows: if
astateofA(AD)hastwooutgoingedges,thentheoutput
labels of these edges oinide with their input label; if a
stateof A(AD) hasonlyoneoutgoingedge,then theout-
put labelof this edgeis theemptyword. Thetransduer
B(AD) works as follows on an input string w. Consider
ters of w that orrespond to an edge that is the unique
outgoingedgeof agivenstateareerased;otherlettersare
unhanged.
Weanthenstatethefollowingtheorem.
Theorem 2: Algorithm Enoder an be realized by a
sequentialtransduer(generalizedsequentialmahine).
Conerning the algorithm Deoder, remark (see Se-
tion II) that the funtion is not injetive and that we
needsome additionalinformation, forinstane the length
oftheoriginalunompressedword,inordertoreonstrut
itwithoutambiguity. Therefore,Deoderanberealized
by the sametransduer as above, by interhanginginput
and output labels (denote it by B 0
(AD)), with a supple-
mentaryinstrutiontostopthedeoding.
Let Q = Q
1 [Q
2
be a partition of the set of states
Q, where Q
j
is the set of stateshaving j outgoingedges
(j =1;2). Forany q2Q
1
, denep(q)=(q;q
1
;:::;q
r )as
the unique path in the transduer for whih q
h 2 Q
1 for
h<randq
r 2Q
2 .
Given an input word v = b
1 b
2 :::b
m
, there exists in
B 0
(AD)auniquepath i;q
1
;:::;q
m 0
suh that q
m 0
1 2Q
2
and the transition from q
m 0
1 to q
m 0
orrespond to the
inputletter b
m . If q
m 0
2Q
2
,then theoutputwordorre-
spondingtothispathinB 0
(AD)istheuniquewordwsuh
that(w)=v. Ifq
m 0
2Q
1
,thenweanstopthedeoding
algorithmrealizedbyB 0
(AD)inanystateq2p(q
m 0),and,
for dierent states, we obtaindierent deodings. So we
needsupplementaryinformation(forinstane,thelengthof
theoriginalunompressedword)toperformthedeoding.
Inthis senseweansaythat B 0
(AD) realizessequentially
thealgorithmDeoder(f. also[17℄).
Theonstrutions andthe resultsgiven aboveonnite
antiditionariesandtransduersanbegeneralizedalsoto
theaseofrationalantiditionaries,or,equivalently,when
thesetofwords\produedbythesoure"isaregular(ra-
tional)language. Intheseasesitisnot,in astrit sense,
neessary to introdue expliitly antiditionaries and al-
lthemethodsanbepresentedin termsof automataand
transduers,asabove. Remarkhoweverthatthepresenta-
tiongiveninSetionIIintermsofantiditionariesismore
general,sineitinludes thenonrationalase. Moreover,
even in the nite ase, theonstrution of automata and
transduers from a xed text, given in the next setion,
makesan expliit use of the notionof minimal forbidden
wordsandofantiditionaries.
B. ASynhronization Property
In the sequel we prove a synhronization property of
automata built from nite antiditionaries, as desribed
above. This property also \haraterizes" in some sense
niteantiditionaries. Thispropertyisalassialoneand
itisoffundamentalimportaneinpratialappliations.
Denition 1: Given a deterministi nite automaton
A, we say that a word w = a
1 a
k
is synhronizing
for A if, whenever w represents the label of two paths
(q
1
;a
1
;q
2 )(q
k
;a
k
;q
k +1
) and (q 0
1
;a
1
;q 0
2 )(q
0
k
;a
k
;q 0
k +1 )
oflengthk,then thetwoending statesq
k +1 and q
0
k +1 are
If L(A) is fatorial, any word that does not belong to
L(A) is synhronizing. Clearly in this ase synhronizing
words in L(A) are muh more interesting. Remark also
that,sineAis deterministi,ifw issynhronizingfor A,
then anywordw 0
=wv that has w as prex is also syn-
hronizingforA.
Denition 2: AdeterministiniteautomatonAisloal
ifthereexists anintegerk suhthat anywordof lengthk
issynhronizing. AutomatonAisalsoalledk-loal.
RemarkthatifAisk-loalthenitism-loalforanymk.
Given anite antifatoriallanguage AD, let A(AD) be
theautomatonassoiatedwithADthatreognizesthelan-
guage L(AD). Letus eliminatethe sink states and edges
going to them. Sine there is no possibility of misunder-
standing, we denote the resulting automaton by A(AD)
again. Notie thatit hasnosinkstate, thatall statesare
terminal,andthat L(A(AD))isfatorial.
Theorem3: Let AD be a nite antifatorial antidi-
tionaryandletkbethelengthofthelongestwordin AD.
ThenautomatonA(AD)assoiatedto AD is(k 1)-loal.
Proof: Let u = a
1 a
n 1
be a word of length
n 1. Wehavetoprovethat uissynhronizing. Suppose
that there exist two paths (q
1
;a
1
;q
2 )(q
n 1
;a
n 1
;q
n )
and (q 0
1
;a
1
;q 0
2 )(q
0
n 1
;a
n 1
;q 0
n
) of length n 1labeled
byu. Wehavetoprovethatthetwoendingstatesq
n and
q 0
n
areequal. ReallthatstatesofAarewords,and,more
preisely they are the proper prexesof wordsin AD. A
simpleindutiononi,1inshowsthatq
i
(respetively
q 0
i
)\is"thelongestsuÆxofthewordq
1 a
1 a
i
(respetive-
lyq 0
1 a
1 a
i
)that isalsoa\state",i.e.,aproperprexof
awordinAD. Heneq
n
(respetivelyq 0
n
)isthelongestsuf-
xofthewordq
1
u(respetivelyq 0
1
u)thatisalsoaproper
prexofawordinAD. Sineallproperprexesofwordsin
ADhavelengthatmostn 1,bothq
n andq
0
n
havelength
atmostn 1. Sineuhaslengthn 1,boththeyarethe
longestsuÆxofuthat isalsoaproperprexofawordin
AD,i.e., theyareequal.
Inother terms,thetheorem saysthatonlythe lastk
1 bits matter for determining whether AD is avoided or
not. The theorem admits a \onverse" that shows that
loality haraterizesin somesense nite antiditionaries
(f. Propositions2.8and 2.14of[18℄).
Theorem4: If automatonA is loal andL(A) is afa-
torial languagethenthere exists anite antifatoriallan-
guageADsuhthatL(A)=L(AD).
LetADbeanantifatorialantiditionaryandletkbethe
lengthofthelongestwordin AD. Letalso w=w
1 uvw
2 2
L(AD)withjuj=k 1andlet(w)=y
1 y
2 y
3
betheword
produed byourenoderof SetionII withinputADand
w. Thewordy
1
isthewordproduedbyourenoderafter
proessing w
1
u, theword y
2
is thewordprodued by our
enoder after proessing v and the word y
3
is the word
produedbyourenoderafterproessingw
2 .
The proof of next theorem is an easy onsequene of
previousdenitionsandofthestatementofTheorem3.
Theorem5: The wordy
2
depends onlyon theworduv
anditdoesnotdependontheontextsofit,w
1 andw
2 .
The property stated in the theorem has an interesting
onsequeneforthedesignofpatternmathingalgorithms
on words ompressedby the algorithm Enoder. It im-
plies that tosearhthe ompressedwordforapattern, it
isnotneessarytodeodethewholeword. Justalimited
left ontext of an ourrene of the pattern needs to be
proessed. Thesamepropertyallowsthedesign ofhighly
parallelizableompressionalgorithms. Theideaisthatthe
ompression an be performed independently and in par-
allelonanyblokofdata. Ifthetext tobeompressedis
parsed into bloksof data in suh away that eah blok
overlapsthe nextblok by alength not smaller than the
lengthofthelongestwordin theantiditionary,thenitis
possibleto runthewholeompressionproessin parallel.
IV. Effiieny
InthissetionweevaluatetheeÆienyofourompres-
sionalgorithm relativelyto asoureorresponding to the
niteantiditionaryAD.
Indeed,theantiditionaryADnaturallydenesasoure
S(AD) in the following way. Let A(AD) be the automa-
tononstrutedintheprevioussetionwithnosinkstates
andreognizingthefatoriallanguageL(AD)(allstatesare
terminal). To avoid trivialases, we suppose that in this
automatonall thestateshaveat least oneoutgoingedge.
Reallthatsineouralgorithmsworkonabinaryalphabet,
allstateshaveatmosttwooutgoingedges.
ForanystateofA(AD)withonlyoneoutgoingedgewe
give to this edge probability 1. For any state of A(AD)
with two outgoingedge wegiveto these edges probabili-
ty 1=2. This denes a deterministi(or unilar,f. [19℄)
Markovsoure,denotedS(AD). Notiealsothat,byThe-
orem3, that S(AD) is a Markovsoure of nite order or
nite memory(f. [19℄). Weall abinaryMarkovsoure
withthisprobabilitydistributionanbalanedsoure.
Remarkthatourompressionalgorithmisdenedexat-
lyforallthewords\emitted"byS(AD).
Inwhat followswesupposethat thegraphofthesoure
S,i.e.,thegraphofautomatonA(AD),isstronglyonnet-
ed. Theresultsthatweproveanbeextendedtothegen-
eral ase byusing standardtehniques of MarkovChains
(f. [19℄, [20℄, [21℄ and [22℄). Reall (f. Theorem 6.4.2
of [19℄) that the entropy H(S) of adeterministi Markov
soureSisH(S)= n
i;j=1
i
i;j log
2 (
i;j
);where(
i;j )is
thestohastimatrixofSand(
1
;;
n
)isthestationary
distribution ofS.
Wenowstatethree lemmas.
Lemma1: The entropyof abalaned soure S is given
by H(S) =
i2D
i
where D is the set of all states that
havetwooutgoingedges.
Proof: Bydenition
H(S)= n
i;j=1
i
i;j log
2 (
i;j ):
Ifiisastatewithonlyone outgoingedge,bydenition
thisedgemusthaveprobability1. Then
j
i
i;j log
2 (
i;j )
reduesto
i log
2
(1),thatisequalto0. Hene
H(S)=
i2D
n
i
i;j log (
i;j ):
Sine from eah i 2 D there are exatly two outgoing
edgeshavingeahprobability1=2,onehas
H(S)=
i2D 2
i
(1=2)log
2
(1=2)=
i2D
i
asstated.
Lemma2: Letw=a
1 a
m
beawordinL(AD)andlet
q
1 q
m+1
bethesequeneofstatesinthepathdetermined
bywin A(AD)startingfromtheinitial state. Thelength
of (w) is equalto the numberof statesq
i
, i= 1;:::;m,
thatbelongtoD,whereDisthesetofallstatesthathave
twooutgoingedges.
Proof: The statement is straightforward from the
desription of the ompression algorithm and the imple-
mentation of the antiditionary with automaton A(AD).
Throughawell-knownresultson\large deviations"(f.
ProblemIX.6.7of[23℄),wegetakindof optimalityofthe
ompressionsheme.
Letq=q
1
;q
m
bethesequeneofmstatesofapathof
A(AD)andletL
m;i
(q)bethefrequenyofstateq
i inthis
sequene, i.e., L
m;i
(q)=m
i
=m,where m
i
isthe number
of ourrenesof q
i
in thesequenesq. Let alsoX
m ()=
fq j qhasmstatesandmax
i jL
m;i (q )
i
jg;where
q representsa sequene of m statesof apath in A(AD).
Inother words,X
m
()is theset ofallsequenes ofstates
representingpathinA(AD)that\deviate"at leastofin
at leastonestateq
i
fromthetheoretialfrequeny
i .
Lemma3: For any > 0, the set X
m
() satises the
equalitylim 1
m log
2 Pr (X
m
())= ();where()isaposi-
tiveonstantdependingon.
We now state the main theorem of this setion. The
proof of ituses thethree previouslemmas. It statesthat
foranytheprobabilitythat theompressionrate(v)=
j(v)j=jvjofastringoflengthnisgreaterthanH(S(AD))+
,goesexponentiallytozero. Hene,asaorollary,almost
surelytheompressionrateofaninnitesequeneemitted
byS(AD)reahestheentropyH(S(AD)),that isthe best
possibleresult.
Theorem6: LetK
m
()bethesetofwordswoflengthm
suhthattheompressionrate(v)=j(v)j=jvjisgreater
thanH(S(AD))+. Forany>0thereexistarealnumber
r(), 0<r() <1, and anintegerm()suh that forany
m>m(),Pr(K
m
())r() m
:
Proof: Letwbeawordoflengthm inthelanguage
L(AD)andletq
1
;;q
m+1
bethesequeneofstatesinthe
path determined by w in A(AD) startingfrom theinitial
state. Let q=(q
1
;;q
m
) bethe sequeneof the rstm
states. We know, by Lemma 2, that the length of (w)
is equalto the numberof statesq
i
, i=1m, in q that
belong to D, where D is the set of all states having two
outgoingedges.
IfwbelongstoK
m
(),i.e.,iftheompressionrate(v)=
j(v)j=jvj is greater than H(S(AD))+, then there must
exists anindexj suhthat L
m;j
(q)>
j
+=jDj. Infat,
iffor allj, L
m;j
(q )
j
+=jDjthen, by denitionsand
byLemma1,
(v)= L (q ) +=H(S(AD))+;
aontradition. Thereforethesequeneofstatesqbelongs
toX
m
(=d). HenePr(K
m
())Pr (X
m (=d)).
ByLemma3,there existsanintegerm()suh that for
anym>m()onehas
1
m log
2 Pr(X
m (
d ))
1
2 (
d ):
Then Pr(K
m
()) 2
(1=2)(=d)m
. If we set r() =
2
(1=2)(=d)
,thestatementofthetheoremfollows.
Theorem 7: Theompressionrate(x )ofaninnitese-
quenexemittedbythesoureS(AD)reahestheentropy
H(S(AD))almostsurely.
V. How to build Antiditionaries
Inpratialappliationstheantiditionarymightnotbe
givena prioribut itmust be derivedeither from thetext
tobeompressedorfromafamilyoftextsbelongingtothe
assumedsoureofthetexttobeompressed.
There exist several riteria to build eÆient antidi-
tionaries,dependingondierentaspetsorparametersthat
onewishes to optimize in theompression proess. Eah
riteriongivesrisetodierentalgorithmsandimplementa-
tions.
All our methods to build antiditionaries are based on
data strutures to store fators of words, suh as suÆx
tries,suÆxtrees,DAWGs,andsuÆxandfatorautomata
(seeforinstane Theorem15in [10℄). Inthesestrutures,
it ispossibleto onsider anotionof suÆxlink. This link
isessentialtodesigneÆientalgorithmstobuildrepresen-
tationsofsetsofminimal forbiddenwordsintermoftries
or trees. This approah leads to onstrution algorithm-
s that run in linear time in the length of the text to be
ompressed.
A rough solution to ontrol the size of antiditionaries
is obviouslyto bound the lengthof wordsin the antidi-
tionary. Abettersolutioninthestatiompressionsheme
is to prune the trie of the antiditionary with ariterion
basedon the tradeo between thespae of the trieto be
sentandthegaininompression,thiswillbedevelopedin
nextsetion. However,therst solutionis enoughto get
ompression rates that reah asymptotially the entropy
for balaned soures, even if this is not true for general
soures. Both solutions an be designed to run in linear
time.
Wepresentinthis setionaverysimpleonstrutionto
build nite antiditionaries of a nite word w. It is the
baseonwhihseveralvariationsaredeveloped. Theideais
to build the automatonaeptingthe wordshaving same
fators of w of length k and, from this, to build the set
ofminimal forbidden wordsoflength kof thewordw. It
anbeusedasarststeptobuildantiditionariesforxed
soures. Inthisaseourshemeanbeonsideredasastep
for aompressorgenerator(ompressorompiler). In the
designofaompressorgenerator,orompressorompiler,
statistialonsiderationsandthepossibilityofmaking"er-
rors"in preditingthenextletterplayanimportantrole,
asdisussedin SetionVII.
AlgorithmBuild-ADdesribedhereafterbuildstheset
ofminimalforbiddenwordsoflengthk(k>0)oftheword
w. It takes as input an automaton aepting the words
that havethesamefatorsoflengthk (or less)asw, i.e.,
aeptingthelanguage
L
k
=fx2f0;1g
j(u2F(x)andjujk))u2F(w)g:
The preproessing of the automatonis done by the al-
gorithmBuild-Fatwhoseentraloperationisdesribed
bythefuntionNext.
Build-Fat(wordw2f0;1g
,integerk>0)
1. i newstate;Q fig;
2. level(i) 0;
3. p i;
4. whilenotendofstringw
5. a nextletterofw;
6. p Next(p;a;k);
7. returntrie(Q;i;Q;Æ),funtion f;
Next(state p,lettera,integerk>0)
1. ifÆ(p;a)dened
2. returnÆ(p;a);
3. elseiflevel(p)=k
4. returnNext(f(p);a;k);
5. else
6. q newstate;Q Q[fqg;
7. level(q) level(p)+1;
8. Æ(p;a) q;
9. if(p=i)f(q) i;
10. elsef(q) Next(f(p);a;k);
11. returnq;
Build-AD(trie (Q;i;Q;Æ),funtion f,integerk>0)
1. T ;;Æ 0
Æ;
2. foreahp2Q,0<level(p)<k, inbreadth-rst
order
3. fora 0then1
4. ifÆ(p;a)isundenedandÆ(f(p);a)is
dened
5. q newstate;T T[fqg;
6. Æ
0
(p;a) q;
7. Q QnfstatesofQfrom whihnoÆ 0
-path
leadstoTg
8. returntrie(Q[T;i;T;Æ 0
);
Theautomatonisrepresentedbybothatrieanditsfail-
ure funtion f. If pis a node of the trieassoiated with
the wordav, v 2f0;1g
and a2 f0;1g, f(p) is thenode
assoiated with v. This is a standard tehnique used in
the onstrution of suÆx trees (see [24℄ for example). It
is used here in algorithm Build-AD (line 4) to test the
minimality of forbidden words aording to the equality
MF(L)=AL\LA\(A
nL).
The aboveonstrutiongives riseto the following stat-
twie,thersttimetoonstruttheantiditionaryADand
theseondtimetoenodethetext.
Informally, the enoder sends amessage z of the form
(x;y;(n)) to thedeoder,where x isadesriptionof the
antiditionaryAD,yisthetextodedaordingtoAD,as
desribed inSetion II,and(n)is theusualbinary ode
ofthelengthn ofthetext. Thedeoderrstreonstrut-
s fromx the antiditionary andthen deodes y aording
to the algorithm in Setion II. The antiditionary AD is
omposed in this simple ompression sheme by allmini-
malforbiddenwordsoflengthkofw,butotherintelligent
hoiesofsubsetsofADarepossible. Weandesribethe
antiditionary AD for instane by oding with standard
tehniquesthetrie assoiatedwithADtoobtaintheword
x. A basiquestion is how fastmust growthe numberk
asfuntion of the lengthn ofthe word w. In this simple
ompression sheme wehoosek to be any funtion suh
thatonehasthatjxj=o(n),butotherhoiesarepossible.
Sine the ompression rate is the size jzj of z divided by
thelengthnofthetext,wehavethatjzj=n=jyj=n+o(n).
AssumingthatfornandklargeenoughthesoureS(AD),
asinSetionIV,approximatesthesoureofthetext,then,
bytheresultsofSetionIV,theompressionrateis\opti-
mal".
Forinstane, supposethat wis emittedbyanbalaned
MarkovsoureS withmemoryh,andletLbetheformal
language omposed of all nite words that an be emit-
tedbyS. ByTheorem 4thereexists anite antifatorial
languageN suh that L = L(N). Moreover, sine S has
memoryh,thewordsin N havelengthsmallerthanore-
qualtoh+1. Ifjwjissuhthatk>hthenADontainsN
and, thereforeH(S(AD))H(S(N))=H(S). ByCorol-
lary1weandeduethatthis simpleompressionsheme
turnsouttobeuniversalforthefamilyofbalanedMarkov
soureswithnitememory(f. [25℄).
Letw=a
1 a
2
beabinaryinnitewordthatisperiodi
(i.e., there exists integerP >0suh that foranyindex i
thelettera
i
isequaltothelettera
i+P
),andletw
n bethe
prexofwoflengthn. Wewanttoompressthewordw
n
followingoursimpleshemeinformallydesribedabove.
It isnotdiÆulttoprovethat theompressionrate for
w
n
isjzj=n=O((n))=O(log
2
(n)),whihmeansthatthe
shemeanahieveanexponentialompression.
VI. PruningAntiditionaries
Inthissetion,aswellasinprevioussetion,weonsider
astatiompression shemein whih weneedto read the
text twie: the rst time to onstrut the antiditionary
ADandtheseondtimetoenodethetext.
Inthissetion,however,wesupposethatwehaveenough
resouresto build, in lineartime, asuÆx ora fator au-
tomaton(ortheirompatedversion,f. [26℄)ofthenite
text string to be ompressed. From these strutures we
anobtainin lineartime atrierepresentatingof allmini-
malforbiddenwordsofthetext(f. [10℄). Itanbeshown
thatthetotallengthofallminimalforbiddenwordsanbe
quadratiinthesizeoftheoriginaltext. Howeverthetrie
wewanttogetgoodompressionratiosnotalltheminimal
forbiddenwordsshouldbeonsidered.
The rst ideadeveloped in this setion is to prune the
trieof the antiditionary with someriteria basedon the
tradeo betweenthe spae of the trie to besent and the
gain in ompression. Clearly, the spae of the trie to be
sentstritlydependonhowweenodethetrie.
Usingalassialapproah,in thissetionwereallthat
a binarytree that has k nodes an be enoded using two
bits foreahnode,whihgives2k bitsforthe wholetree.
Indeed,dependingonwhetherasubtreeSofabinarytree
T hasboth subtrees,only the rightsubtree, only theleft
subtree,ornosubtree,therootofSanbeenodedrespe-
tivelybythestrings11,10,01,00. Thisisdonereursively
in aprex traversal of the whole tree. All the resultsp-
resentedin thissetionanbeeasily extendedto thease
when anode of the trie an be enoded using bits for
eahnode,whereisapositiverealnumber.
The seond idea presented afterwards is to ompress
the words retained in the antiditionary using the anti-
ditionaryitself.
The twooperations, pruning and self ompressing,an
be applied iteratively on antiditionaries. They lead to
veryompatrepresentationsofantiditionaries,produing
higherompressionratios.
A. PrunedAntiditionary
A linear-time algorithm for obtaining the trie T of all
minimal forbidden word of axed text t anbefound in
[10℄. Hene wesupposeherethat wehavethistrieT.
In order to make a tradeo between the spae of the
trie to be sent and the gain in ompression, we have to
know how muh eah forbidden word ontributes to the
ompression. Minimalforbiddenwordsoftexttorrespond
in a bijetive way to the leaves of the trie T, i.e. with
anyleaveqofthetreewean assoiatetheorresponding
minimal forbidden word w(q). Indeed if we identify, as
in Setion III, the nodes of the trie T to the prexesof
the minimal forbidden words, then the funtion w is the
identity.
Wedeneaostfuntionthatassoiateswithanyleafq
ofT thenumberofbits(q)thatthewordw(q)ontributes
toeraseduringtheompressionofthetextt. Thisnumber
(q) is also the number of times that the longest proper
prex of w(q) appears in text t as a fator but not as a
suÆx. In another words, the number(q) is the number
of timesthat astatepis traversed while readingthetext
t in the automaton A(AD), where p leads to state q by
some letter a (f. Setion III and Theorem 1). Indeed
thelast letterof thetextis notonsideredin thisproess
beausethere is nothingto eraseafter it. ByTheorem 1,
thefuntion anbeomputedinlineartime.
Wefurtherdenethegain(saving)ofasubtreeSofthe
trieT representinganantiditionaryT asg(S)=((q)j
qleafofS) 2m
S
where m
S
isthenumberofnodesofS.
Indeed the number of bits that have to be sent after
lengthnofthetextt (f. theasadinglengthstehnique
in [4℄ and referenes therein); 2m
T
bits for a desription
oftheantiditionaryT;j(t)jbitsforthetextompressed
usingT. Theoverallsize is
2blogn+2m
T
+j(t)j=2blogn+n g(T)
bydenitionofg(T).
Sine2blogn+nisxedandsinethegaing(T)isthe
sumof thegainofits subtrees minus 2bits(for enoding
theroot),thenpruningsubtreesofT thathaveanegative
gaininreases the gainof T and, onsequently, dereases
theoverallnumberofbitsthat havetobesentafter om-
pression.
Suppose howeverthatS
2
isasubtreeofS
1
whihis, in
turn,asubtreeofthetrieT. Supposefurtherthat S
2 has
anegativegainandthesameholdsforS
1
,butthatS
1 has
apositivegainifS
2
isprunedfromit. Inthisase,inorder
to obtain better ompression ratios, the best thing to do
is to prune S
2
and not the whole S
1
. It is thus natural
toonsidertheoptimizationproblemrelatedtoanabstat
non-negativefuntion (denedonleavesofT)whereone
instane is a trie T representing a prex ode C, and a
solutionisatrieT 0
thatrepresentsasubsetofCandthat
maximizesthegaing(T 0
).
Inwhatfollowsweshowthatabottom-upapproahgives
alinear-timesolutiontothisproblem.
With any subtree S of T we assoiate the funtion g 0
,
alledtheprunedgain,thatis denedby
g 0
(S)= 8
>
<
>
:
0 ifS isempty
(S) 2 ifS isaleaf
g 0
(S
1
) 2 ifS hasonehildS
1
M
where M = max(g 0
(S
1 );g
0
(S
2 );g
0
(S
1 )+g
0
(S
2
)) 2, with
S
1 andS
2
hildrenofS.
From the abovedenition it is not diÆultto see that
it is possible to ompute funtion g 0
in linear time with
respet tothesize of thetrie T, in abottom-uptraversal
ofthetrie.
Weannowpresentthesimplepruningalgorithm.
Simple Pruning(trieT,funtion)
1. omputeg 0
(S)foreahsubtreeS ofT;
2. eliminatesubtreesS ofT forwhihg 0
(S)0;
3. returnmodiedtrieT;
The following proposition is a onsequene of the de-
sriptions given above, and the next theorem shows that
the output of the algorithm gives a solution to the opti-
mizationproblem desribedabove.
Proposition1: AlgorithmSimplePruninganbeper-
formedin lineartime.
Theorem 8: LetT beatrierepresentingaprexodeC
andletbeanon-negativefuntiondenedonleavesofT.
The output T 0
of algorithm Simple Pruning represents
asubsetof C and g 0
(T 0
)is maximum. Moreoverwehave
0 0 0
Proof: Firstof all we laimthat thetrie T 0
output
by algorithm Simple Pruning representsa subset of C.
Indeed, by the denition of g 0
it follows that if asubtree
S of T isnotaleafandifg 0
(S)>0,thenS must haveat
leastonehildS
1
withpositiveprunedgain,i.e. g 0
(S
1 )>
0. This fat impliesthat allleavesof T 0
are leaves of T,
provingthelaim.
Therestoftheproofisdonebyindutionontheheightof
T. IfT isemptythereisnothingtoprove. IfT hasheight
0 then T is aleaf and we alreadyhaveg(T)=g 0
(T). If
g(T)>0,T itselfisequaltoT 0
,otherwiseT 0
istheempty
tree. Inbothasesthestatementofthetheoremissatised.
Supposenowthat T hasheight>0. Eitherit hasjust
onehildS
1
orithastwohildrenS
1 and S
2 .
Suppose that T has two hildren S
1 and S
2 . S
i
; they
arebothtriesandweanassoiatetothemtherestrition
ofthefuntiongaintoallsubtrees. Byapplyingalgorithm
Simple Pruningwith inputS
i
, i =1;2,and funtion
(restritedtoleavesof orrespondingsubtrees), weobtain
asoutput amodiedtrieS 0
i
. By indution weknowthat
g(S 0
i )=g
0
(S 0
i
)andthat thisvaluemaximizesthefuntion
gain. Therefore,ifbothg(S 0
1
)andg(S 0
2
)arepositive,atrie
T 0
representingasubsetofCandmaximizingthefuntion
gainisthetriethathasthesamerootasT andhashildren
S 0
1 andS
0
2
. Moreoverg(T 0
)=g 0
(T 0
)andalgorithmSimple
PruningdoesnotpruneS
1 andS
2 fromT
0
sothetheorem
isprovedinthisase.
The other ases, (g(S
1
) 0 and g(S
2
) > 0), (g(S
1 ) >
0 and g(S
2
) 0), (g(S
1
) 0 and g(S
2
) 0), and the
asewhen T hasonly onehildS
1
are dealt in analogous
manner.
Remark that the statement of Theorem 8 holds essen-
tiallybeausepruningasubtreeSofT doesnotaetthe
valueoffuntiongainoverallothersubtreesofT. Thisfa-
t is nottrueanymorewith theself-ompressingapproah
usedin nextsubsetion.
B. Self-ompressingthe antiditionary
LetADbeanantifatorialantiditionaryfortextt. Sine
ADisantifatorialthen,foranyv2ADthesetADnfvgis
anantiditionaryforv. Thereforeitispossibletoompress
v usingADnfvgorasubsetofit.
Oneanthinkofastrategythatsendstothedeoder,in
astatiapproah,allwordsvofADompressedbyalgorith-
mEnoderwithasubsetofADnfvgandvasinput. This
would ahieve better ompression. We all this approah
self-ompression;itisthesubjetofthissubsetion.
Letusrsttrytoompressanywordv2ADbyusingthe
wholeADnfvgandletusdenoteby
1
(v)theompressed
versionofvbyusingADnfvg. NotiethatthewordsofAD
thatareusedinompressingv havelengthjvj. Further,
ifu2ADwith juj=jvj isused toerasethelast letterof
v, then u must oinide with v exept for the last letter,
that is, u =xa,v =xb and a6=b. Inaddition it is easy
toseethat
1 (u)=
1
(v). Thiswordisalsoequalto
1 (x)
thathasbeenompressedbyusingtheantiditionaryofall
Asasaspeialaseofthenextproposition,asetfu;vg
havingthese propertiesanouratmostoneinanyan-
tiditionaryADofatextt.
A pair of words (v;v
1
) is alled stopping pair if v =
ua;v
1
= u
1
b 2 AD, with a;b 2 f0;1g, a 6= b, and u is
asuÆxofu
1 .
Proposition2: LetADbeanantifatorialantiditionary
of a text t. If there exists a stopping pair (v;v
1 ) with
v
1
=u
1
b, b2f0;1g, thenu
1
is asuÆxof t anddoesnot
appear elsewhere in t. Moreoverthere existsat mostone
pairofwordshavingtheseproperties.
Proof: Sine u
1
b 2AD, u
1
isafator of t. Suppose
that u
1
appears as a fator of t, with 2 f0;1g. Sine
u is a suÆx of u
1
, letter is not letter a (beause ua is
forbidden)andisnotletterb(beauseu
1
bisforbidden),a
ontradition. Heneu
1
isasuÆxoftanddoesnotappear
elsewhereint.
Sineu
1
isasuÆxoft,thenalsouisasuÆxoft. Sup-
posethat there exists anotherpair(v 0
=u 0
;v 0
1
=u 0
1 d)6=
(v;v
1
) ofwordsin AD with;d2f0;1g,a 6=b,and u 0
is
asuÆxofu 0
1
. Then u 0
1 and u
0
arealsosuÆxes oft andit
isnotdiÆulttoprovebyasesthatoneofthefourwords
amongv;v
1
;v 0
;v 0
1
isafatorofanother,ontraditingthe
antifatorialityofAD.
Let us suppose nowthat v
1
;:::;v
k
is asequene of all
wordsinADsuhthatforanyi,1ik 1,jv
i jjv
i+1 j.
If oneknowsthat there exists no v
j
suh that jv
j j = jv
i j
and v
j
has been used to erase the last letter of v
i , then
theset AD
1
=fv
1
;:::v
i 1
g isthe antiditionary used for
ompressingv
i
toget
1
(v), andv
i
anbereoveredfrom
both (v
i
) and jv
i
j using algorithm Deoder. If there
existsv
j
suhthatjv
j j=jv
i jandv
j
hasbeenusedtoerase
thelastletterofv
i
thentheset AD
1
=fv
1
;:::v
i 1 gisthe
antiditionary used for obtaining the ompressed version
1
(x) =
1 (v
i
) of the longest ommon prex x of v
i and
v
j
, with jxj = jv
i
j 1. Also in this ase x and therefore
v
i andv
j
,anbereoveredfrom both
1
(x)=
1 (v
i )and
jxj=jv
i
j 1usingalgorithmDeoder.
By the above disussion, it follows that if one knows
thesequene(
1 (v
1 );jv
1 j),(
1 (v
2 );jv
2
j), :::,(
1 (v
k );jv
k j),
together with the ouple (i;j) suh that v
i and v
j have
beenused tomutuallyerasetheir lastletter (i=j =0if
thereisnosuhapair),thenthedeoderanreonstrut,
in this order,wordsv
1 , v
2 , :::, v
k
. That is, deoder an
reonstrutthewholeantiditionaryAD.
Unfortunately, while AD, being antifatorial, is also a
prex ode and an be represented by a trie, this is not
true anymore for the set X
1
= f
1
(v) j v 2 ADg. For
example, the reader an easily verify that if AD = f11;
000; 10101; 00100100; 1010010100101g then X
1
= f11;
000;111; 0000;1111;g. Also,ifAD=f10;110; ;1 n
0g
then, for any n0, X
1
= f10g. Consequently thespae
savedbyself ompressingtheantiditionaryouldbelost
inenodingthesetX
1 .
We propose a dierent approah that makes use of
the same idea and leads to simple algorithms for self-
ompressing andreoveringtheantiditionaryAD. These
sentingtheantifatorialantiditionaryADand,moreover,
theompressionratiosobtainedwiththepruningtehnique
an only be improved by the next self ompression teh-
nique.
Wepresentaformaldesriptionofthetehnique. Given
awordv2AD,weompressitusinganantiditionaryAD 0
thatdynamiallyhangesatanystepofthewhileloopon
line2ofalgorithmEnoder. Whiledealingwithaproper
prexuofvandtheletterafollowingit,theantiditionary
AD 0
isomposed ofallwordsbelongingtoADwithlength
notgreaterthanjuj. Letteraiserasedifandonlyifthere
exists a word u 0
b 2 AD, b 6=a, with u 0
apropersuÆxof
u. Let usall
2
(v) theompressed versionof v obtained
in thiswayandletX
2
=f
2
(v)jv2ADg.
Thiskindofself-ompressionanbeperformedinlinear
time by nextalgorithm Self-ompress. It hasas input
boththe trieT that representsAD and thefuntion Æof
automatonA(AD) (f. algorithmL-automaton). Notie
that Æ is dened on nodes ofT. Its output T 0
is thetrie
aeptingthe set X
2
=f
2
(v)j v 2ADg. Thealgorithm
performs breadth-rsttraversal of T implemented by the
queueQ. Duringthetraversal,itreatesaself-ompressed
versionT 0
ofT thatrepresentsthesetX
2 .
Self-ompress(trie T, funtionÆ))
1. i root ofT;
2. reaterooti 0
;
3. add(i;i 0
)toemptyqueueQ;
4. whileQ6=;
5. extrat(p;p 0
)fromQ;
6. ifq
0 andq
1
arehildrenofp
7. reateq
0
0 andq
0
1
ashildrenofp 0
;
8. add(q
0
;q 0
0
)and(q
1
;q 0
1 )toQ;
9. elseifqisauniquehildofpand
q=Æ(p;a),a2A
10. ifÆ(p;:a)isaleaf
11. add(q;p
0
)toQ;
12. elsereate q
0
asa-hildofp 0
;
13. add(q;q
0
)toQ;
14. returntriehavingrooti 0
;
The orretness of algorithm Self-ompress relies on
thefollowingpropositionandthedisussionthereafter.
Proposition 3: IfanodepinthetrieT hastwohildren
q
0 andq
1
thenitsorrespondingnodep 0
intheoutputtrie
T 0
alsohastwohildren.
Proof: If q
0 and q
1
are both leaves, they represent
twominimal forbidden wordsua andub, a6=b. Thereis
nominimalforbiddenwordsin theformu 0
aoru 0
bwithu 0
apropersuÆxof ubeauseAD isantifatorial. Therefore
neitherletteranorletterbanbeerasedbythetehnique.
If q
0 and q
1
are not leaves, they represent two words
ua and ub, a 6=b, that are fators of text t. There is no
minimal forbiddenwordsin theform u 0
a oru 0
b with u 0
a
propersuÆxofubeausethesewordsarealsofatorsoft.
Thereforeneitherletteranorletterbanbeerasedbythe
tehnique.
Letussupposenowthatonlyonenodeamongq
0 andq
1
is aleaf. Forinstane, letus assumethat q isaleafand q
1
isnotaleaf. Theyrepresentrespetivelytwowordsua
and ub, a 6= b. Letter a annot be erased beausein the
antiditionary there is nowordin the form u 0
b with u 0
a
propersuÆx ofu, ubbeingafator oft. Letterb annot
beerasedbeauseintheantiditionarythereisnowordin
theformu 0
awithu 0
apropersuÆxofu,sineuaisin the
antiditionaryandtheantiditionaryisantifatorial.
Thepreviouspropositionexplainswhythealgorithmre-
atestwonodesq 0
0 and q
0
1
atline7.
Wenextonsiderlines 10{13,in whihnode pofT has
onlyonehildq=Æ(p;a). Thenode Æ(p;:a)annothave
higher level than pbeause phas only one hild. Hene,
letteraiserasedifandonlyifÆ(p;:a)isaleaf,bydenition
ofthetehnique.
Finally, if p has no hildren, i.e. p is a leaf, nothing
is done by the algorithm but extrating (p;p 0
) from the
queue.
Corollary1: Tries T and T 0
have the samenumber of
internal nodes that havetwo hildren and, onsequently,
have the same number of leaves. Trie T 0
represents the
prexodeX
2 .
The orollary impliesthat X
2
=f
2
(v)j v 2 ADgan
beuniquelyreonstrutedfromT 0
. Thereisanadditional
property that allowsreonstruting AD from X
2
without
onsidering lengths of words in AD. This simplies the
proedure. Thenextpropositionfollowsreadilyfromde-
nitions.
Proposition4: IfthereexistsnostoppingpairinADthen
foranyv2AD,thelastletterofvisnoterasedduringthe
self-ompressionto get
2 (v).
If the deoder has the additional information that the
last letteroft wasnoterasedatompression timethenit
anuse thisfat asastopriterion. This isalso possible
eveniftheantiditionaryhangesdynamially. Indeedthe
deoderjust hasto stopafterproessing thelast letterof
theompressedtext. Thereforethereisnoneedtousethe
lengthofthetext tostopdeoding.
Toensurethatthelastletterofanyv2ADisnoterased
and to meetthe above hypothesis, it issuÆientto elim-
inate theonly possible stoppingpair (f. Proposition 2).
Todothat,wedeletefromADthelongestwordv
1
ofsuh
a pair. By Proposition 2 this word does not ontribute
toerasinglettersin texttduring theompressionbeause
thereisnothingtoeraseafterthelastletter.
Hene we suppose that in our antiditionary AD this
word is not inluded, or,equivalently, that the branh of
trieT thathasthiswordasuniqueleafispruned. Inother
words, we suppose from now on that antiditionary AD
(andobviouslyallitssubsets)hasnostoppingpair.
AlgorithmSelf-automatonusestheprevioushypoth-
esistoreonstrutADfromT 0
. Morepreisely,itsinputis
atrie T 0
, self-ompressedfrom trie T, withits transition
funtionÆ 0
. ItsoutputistheautomatonA(AD),whereAD
istheantiditionaryrepresentedbytrieT. Itissimilarto
algorithmL-automaton. Indeeditmakesabreadth-rst
traversal on statesof the trie T. It is possibleto dothis
beause,anytimeastateisreahed,ifahildwas\erased"
duringtheexeutionofSelf-ompress,itisnowreated
andaddedtothequeueQ. Inordertoreateanewhild,
funtionÆmustbepreviouslyrestored,asdoneinalgorith-
m L-automaton,by usingthe failurefuntion f. When
aleaf isreahed in theself-ompressedtrie,the newstop
riteriontellsusthatthere isnothingmoretoreonstrut
in thatbranh.
TrieT anbeobtainedfromtheautomatonA(AD),out-
put of next algorithm, by using a linear time algorithm
desribedin [10℄.
Theurrentsituationinthenextalgorithmisasfollows:
whenanode pispoppedfromthequeue,trieT hasbeen
deompresseduptothelevelofpinT,f(p)isdenedand
funtion Æisdenedforallpreviousnodes,whihinludes
nodesatpreviouslevel. Afterproessingp,Æisalsodened
forpandthefailurefuntion f isdenedonitshildren.
Self-automaton(trieT 0
)
1. i 0
rootofT 0
;
2. Q ;;
3. foreaha2A
4. ifÆ 0
(i 0
;a)isdened
5. Æ(i
0
;a) Æ 0
(i 0
;a);
6. f(Æ(i 0
;a)) i 0
;
7. addÆ(i 0
;a)toQ;
8. else
9. Æ(i
0
;a) i 0
;
10. whileQ6=;
11. extratpfromQ;
12. ifpisnotaleaf
13. ifÆ(f(p);a)isaleaffora2A
14. reatep1;
15. foreahb2A
16. ifÆ
0
(p;b)isdened
17. Æ
0
(p1;b) Æ 0
(p;b);
18. Æ(p;:a) p1;
19. Æ(p;a) Æ(f(p);a));
20. f(p
1
) Æ(f(p);:a));
21. addp1toQ;
22. else
23. foreaha2A
24. ifÆ
0
(p;a)isdened
25. Æ(p;a) Æ
0
(p;a));
26. f(Æ(p;a)) Æ(f(p);a));
27. addÆ(p;a)toQ;
28. else
29. Æ(p;a) Æ(f(p);a));
30. else
31. foreaha2A
32. Æ(p;a) p;
33. return(Q;A;i 0
;Qnfleaves g;Æ);
SinethereisabijetionbetweenleavesofT andleaves
of T 0
, we an assoiate with any leaf q 0
of T 0
the same
value (q) of the orresponding leaf q in T. This is the
number of bits that the word w(q) leads to erase during
theompression oftextt. Analogously,asin theprevious
subsetion, we andene funtions gain and pruned gain
and,asarststep,weanrunalgorithmSimplePruning
onT 0
. Atthesametimewepruneorrespondingsubtrees
in T and obtainatrieT
1
. Doingso,the modied trie T
1
represents asubset of AD. As a seond step, wean use
again algorithm Self-ompress on T
1
to get T 0
1 . Note
that T 0
1
anbedierentfromtheprunedtrieT 0
beause
pruningsubtreesanaetself-ompression.
Weaniteratetheabovetwostepsforaxednumberof
VII. Conlusion
WehavedesribedDCA,atextompressionmethodthat
uses some \negative" information about the text, repre-
sentedin termsofantiditionaries. Theadvantagesof the
shemeare:
itisfastat deompressingdata,
itisfastat ompressingdataforxed soures,
ithasasynhronizationpropertyintheaseofnitean-
tiditionaries,propertythatleadstoeÆientparallelom-
pressionand tosearhenginesonompresseddata.
Inthe previoussetionswepresentedsomestati DCA
shemes in whih the text to be ompressed needs to be
sannedtwie. Startingfromthese statishemes,several
variationsandimprovementsanbeproposed. Thesevari-
ationsareallbasedonleverombinationsoftwoelements
thatanbeintroduedinourmodel:
statistionsiderations,
dynamiapproahes.
Thesearelassialfeaturesthatareofteninludedinother
dataompressionmethods.
Statistial onsiderations are used in the onstrution
of antiditionaries. If aforbidden word is responsible for
\erasing"fewbitsofthetextintheompressionalgorithm
of Setion II andif its\desription" asan element of the
antiditionary is \expensive" then the ompression ratio
improvesif it is not inluded in the antiditionary. This
idea hasbeenpartially exploited in previous setion. On
the ontrary, one an introdue into the antiditionary a
word that is notforbidden but that ours veryrarely in
thetext. Inthisase,theompressionalgorithmwill pro-
due some \errors" or \mistakes" in prediting the next
letter. Inordertohavealosslessompression,enoderand
deodermust beadapted to manage suh errors. Typial
errors ourin the ase of antiditionaries built for xed
souresaswellasinthedynamiapproah.
Evenwith errors,assuming that theyare rare with re-
spet to the maximum length of words of the antidi-
tionary,ourompressionshemepreservesthesynhroniza-
tion property of Theorem 3. The use of errors beomes
neessaryfor some artiial stringslike1 m
0 ifone wants
to useastati aproah. Without errorsand with astati
approah,thealgorithmsdesribedinprevioussetionare
unabletoompress suhstrings.
Antiditionaries for xed soures have also anintrinsi
interest. A ompressor generator, or ompressor ompil-
er,anreate,startingfromwordsobtainedfromasoure
S,anantiditionarythat anbeused toompressalloth-
er words from the same soure S. Error management is
essentialforthis kindof appliation. Having axed anti-
ditionarymakestheompressionfastbeausebasioper-
ationsarejust tablelookups.
In the dynami approah, we onstrut the antidi-
tionary and enode the text at the same time. The an-
tiditionaryisonstruted(alsowithstatistial onsidera-
tion) byonsidering thewhole text previouslysannedor
just a part of it. The antiditionary an hange at any
stepandthealgorithmirulesforitsonstrutionmustbe
File originalsize ompressedsize
(inbytes) (inbytes)
bib 111261 35535
book1 768771 295966
book2 610856 214476
geo 102400 79633
news 377109 161004
obj1 21504 13094
obj2 246814 111295
paper1 53161 21058
paper2 382199 2282
pi 513216 70240
prog 39611 15736
progl 71646 20092
progp 49379 13988
trans 93695 22695
Fig.3. CompressionratiosonlesoftheCalgaryCorpus.
Wehaverealizedprototypesoftheompressionandde-
ompressionalgorithms. Theyalsoimplementthedynami
versionofthemethod. Theyhavebeentested ontheCal-
garyCorpus(seeFigure3),andexperimentsshowthatwe
getompressionratiosequivalenttothoseofmostommon
ompressors(suhaspkzipforexample).
Weareonsideringseveralgeneralizations:
Compressor shemes and implementations of antidi-
tionaries on more general alphabets or on other typesof
data(images,sounds, et.),
Useoflossyompressionespeiallyto dealwithimages,
Combination of DCA with other ompression shemes;
for instane, using both ditionaries and antiditionaries
likepositiveand negativesets of examplesasin Learning
Theory,
Designofhipsdediatedtoxedsoures.
Severalproblemsonerningthedataompressionshe-
mearestillopen. Welistsomeof them.
Arebalanedsoures denseinside thefamilyofMarkov
soures? A positive answer would raise the question of
adapting the sheme so that it beomes universal for
Markovorergodi soures. Canself ompressionbeused
to settlethisquestion?
ArethereeÆientalgorithmstobuildgoodantiditionar-
ies for syntatisoures, generated for instane bygram-
mars? Thisraises aquestionofodingonabinaryalpha-
bet.
Whatistheaverageofthemaximumlengthofminimal
forbidden words in texts of length n generated by an er-
godisourehavingentropyH?
Howmanytimesontheaverageshouldpruningandself
ompressing be iterated before the proess stabilizes (see
previous setion)? We would expet amaximum of logn
steps. Isthestabilizedtrieoptimal?
Aknowledgments
WethanksM.P.Beal,M.Cohn,F.M.Dekking,R.Grossi
Referenes
[1℄ J. G.Cleary T.C.Belland I.H. Witten, Text Compression,
PrentieHall,1990.
[2℄ J. Gailly, \Frequently asked questions in data ompression,"
2000, FAQ, URL http://www.faqs.org/faqs/faqs/ompression-
faq/.
[3℄ J.GaillyM.Nelson,TheDataCompressionBook,M&TBooks,
NewYork,NY,1996.
[4℄ J.A.Storer, Data Compression: Methodsand Theory, Com-
puterSienePress,1988.
[5℄ T.C.BellI.H.Witten, A.Moat, ManagingGigabytes, Van
NostrandReinhold,1994.
[6℄ C.Shannon, \Preditionandentropyofprintedenglish," Bell
SystemTehnialJ.,vol.January,1951.
[7℄ A.Restivo M.-P.Beal, F.Mignosi, \Minimalforbiddenwords
andsymbolidynamis," inSTACS'96,C.PuehandR.Reis-
huk,Eds.,number1046inLetureNotesinComputerSiene,
pp.555{566.Springer-Verlag,Berlin,1996.
[8℄ A.RestivoM.-P.Beal,F.MignosiandM.Siortino, \Minimal
forbiddenwords and symboli dynamis," Advanes in Appl.
Math.,vol.Toappear.
[9℄ A. Restivo M. Crohemore, F. Mignosi, \Minimal forbidden
wordsand fatorautomata," inMFCS'98,J.Gruska L.Brim
and J.Slatuska,Eds., number1450 inLeture NotesinCom-
puterSiene,pp.665{673.Springer-Verlag,Berlin,1998.
[10℄ M. Crohemore,F. Mignosi,and A. Restivo, \Automata and
forbiddenwords," Inf. Proess. Lett.,vol.67, no.3, pp.111{
117,1998.
[11℄ A. Restivo M. Crohemore,F. Mignosiand S.Salemi, \Text
ompression using antiditionaries," in ICALP'99, J. Gruska
L.BrimandJ.Slatuska,Eds.,number1664inLetureNotesin
ComputerSiene.Springer-Verlag,Berlin,1999.
[12℄ C.ChorutandK.Culik,\Onextendibilityofunavoidablesets,"
DisreteAppl.Math.,vol.9,pp.125{137,1984.
[13℄ A.V.Ahoand M.J.Corasik, \EÆientstringmathing: an
aidtobibliographisearh,"Commun.ACM,vol.18,no.6,pp.
333{340,1975.
[14℄ M.Crohemore andW.Rytter, Textalgorithms, OxfordUni-
versityPress,1994.
[15℄ V. Diekert andY. Kobayashi, \Someidentitiesrelatedto au-
tomata,determinants,andmobiusfuntions," Report1997/05,
UniversitatStuttgart,1997.
[16℄ J.BerstelandD.Perrin,\Finiteandinnitewords,"inAlgebrai
Combinatoris onWords, D.PerrinJ.Berstel,Ed.Cambridge
UniversityPress,Toappear.
[17℄ Y. Shibata,M. Takeda, A.Shinohara, and S.Arikawa, \Pat-
ternmathingintextompressedbyusingantiditionaries," in
CPM'99,M.Crohemoreand M.Paterson,Eds. 1999,number
1645inLetureNotesinComputerSiene,pp.37{49,Springer-
Verlag,Berlin.
[18℄ M.P.Beal, CodageSymbolique, Masson,1993.
[19℄ R. Ash., Information Theory, Trats inmathematis. Inter-
sienePublishers,J.Wiley&Sons,1985.
[20℄ R.G.Gallager, InformationTheory andReliableCommunia-
tion,J.WileyandSons,In.,1968.
[21℄ R. G. Gallager, Disrete Stohasti Proesses, Kluver Ad.
Publ.,1995.
[22℄ J.L.SnellJ.G.Kemeny,FiniteMarkovChains,VanNostrand
Reinhold,1960.
[23℄ R.S.Ellis, Entropy,LargeDeviations,andStatistialMehan-
is,SpringerVerlag,1985.
[24℄ C.HanartM.Crohemore,\Automataformathingpatterns,"
in Handbook of Formal Languages, Volume 2, Linear Model-
ing: Bakground and Appliation, A. Salomaa G.Rozenberg,
Ed.Springer-Verlag,1997.
[25℄ R.Krihevsky., UniversalCompression and Retrieval, Kluver
AademiPublishers,1994.
[26℄ M. Crohemore and R. Verin, \On ompat direted ayli
word graphs," inStrutures in Logi and Computer Siene,
G.RozenbergJ.MyielskiandA.Salomaa,Eds.,number1261
inLetureNotes inComputerSiene,pp.192{211. Springer-