Algorithmic Rate-Distortion Theory
Nikolai Vereshchagin∗ Paul Vit´anyi† November 28, 2005
Abstract
We propose and develop rate-distortion theory in the Kolmogorov complexity setting.
This gives the ultimate limits of lossy compression of individual data objects, taking all effective regularities of the data into account.
1 Introduction
Kolmogorov complexity is the accepted absolute measure of the information content of an individual finite object. It gives the ultimate limit on the number of bits resulting from lossless compression of the object—more precisely, the number of bits from which effective lossless decompression of the object is possible. A similar absolute approach is needed for lossy compression, that is, a rate-distortion theory giving the ultimate effective limits for individual finite data objects. We give natural definitions of the rate-distortion functions of individual data (independent of a random source producing those data). We analyze the possible shapes of the rate-distortion graphs for all data and all computable distortions. The classic Shannon rate-distortion curve corresponds approximately to the individual curves of typical (random) data from the postulated random source, while the nonrandom data have completely different curves. It is easy to see that one is generally interested in the behavior of lossy compression on complex structured nonrandom data, like pictures, movies, music, while the typical unstructured random data like noise (represented by the Shannon curve) is discarded (we are not likely to want to store it). Finally, we formulate a new problem related to the practice of lossy compression. Is it the case that a code word that realizes least distortion of the source word at a given rate also captures the most properties of that source word that are possible at this rate? Clearly, this question cannot be well posed in the Shannon setting, where we deal with expected distortion, while also the notion of capturing a certain amount of the properties of the data cannot be well expressed. We show that in our setting this question is answered in the affirmative for every distortion measure that satisfies a certain parsimony-of-covering property.
2 Preliminaries
Compared to the classical information theory setting we dispense with sequences of random variables, and we also generalize the distortion measures from single-letter distortion measures to full generality. We start from a discrete set X called the source alphabet. Its elements will be called letters or source words. (The discreteness is not essential.) Suppose we want to
∗Department of Mathematical Logic and Theory of Algorithms, Faculty of Mechanics and Mathematics, Moscow State University, Leninskie Gory, Moscow, Russia 119992. Email: [email protected].
†CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands. Email: [email protected].
communicate source wordsx from X using a code of at most r bits for each such word. (We callr the rate.) If 2r is smaller than |X |, then this is clearly impossible. However, for every x we can use a representation y that in some sense is close to x. For example, assume that we want to communicate a computable real number x ∈ [0; 1]. Using r bits we are able to communicate a representation y that is a rational number at distance≤2−r from x.
Assume that the representations are chosen from a set Y, possibly different from X, var- iously called the reproduction alphabet, user alphabet, and the destination alphabet. We call its elementsdestination words. Assume furthermore that we are given a functiondfromX × Y to the reals, called thedistortion measure. It measures the lack of fidelity, which we call dis- tortion, of the destination word y against the source word x. (In our example, this is the Euclidean distance betweenx and y.)
In the Shannon theory [11, 12, 1, 2], we are given a random variable X with values in X. Thus every source word appears with a given probability. The goal is, for a given rate r and distortiona, to find an encoding functionE, with a range of cardinality at most 2r, such that the expected distortion between a source wordx ∈ X and the corresponding destination word y =E(x) is at most a. The set P of all pairs hr, ai for which this is possible is called the rate-distortion profile of the random variable X. For every distortion a, we consider the minimum ratersuch that the pair hr, aiis an element of the profile ofX. This way we obtain therate-distortion function of the random variable X:
r(a) = min{r :hr, ai ∈P}. (1) Here, like in [16, 9, 13], we are interested in what happens for individual source words, irrespective of the probability distribution onX induced by a random variable. To this end we use Kolmogorov complexityK(y) and conditional Kolmogorov complexity K(x|y), as defined in [6], and the textbook [8]. In our treatment it is not essential which version of complexity we use, the plain one or the prefix one. We assume that the destination alphabetY consists of finite objects and thusK(y) is defined for all y∈ Y. For everyx∈ X we want to identify the set of pairs hr, ai such that there is y ∈ Y with K(y) ≤r and d(x, y) ≤ a. The set Px of all such pairs will be called therate-distortion profile of the source word x. For every distortiona consider the minimum rater such that the pair hr, ai belongs to the profile ofx. This way we obtain therate-distortion function of the source word x:
rx(a) = min{K(y) :d(x, y)≤a}.
It is often more intuitive to consider, for every rater, the minimum distortionasuch that the pair hr, ai belongs to the profile ofx. This way we obtain the distortion-rate function of the individual word x:
dx(r) = min{d(x, y) :K(y)≤r}.
It is straightforward from the definitions thatdx(r) is a sort of “inverse” from rx(a).
2.1 Related work
In Shannon’s paper [12], it is assumed thatX =An,Y =Bn are the sets of strings of certain lengthnover finite alphabetsA,B. (Here we ignore the generalizations to infinite or continuous sets.) The distortion measure has the formdn(x, y) =Pn
i=1d(xi, yi)/nwhere dmaps pairs of letter fromA × Bto the reals (the single-letter distortion measure). The random variableX is taken asX=X1, . . . , Xnwhere theXi’s are independent Bernoulli random variables identically distributed over A. For everyn we obtain the rate-distortion functionrn(a). Shannon shows that the limit limn→∞rn(a)/n exists and determines its non-constructive description in terms
of A,B, h, a, Xi. In the next paragraph this limit will be denoted by R(a). More general distortion measures and random variables, that were studied later, are treated in [1, 2].
The papers [16, 9, 13], using the same i.i.d. Bernoulli assumptions on X,Y, X, d, estab- lish the value of the rate distortion functions rx(a) for specific x’s in An and compare them with with R(a). It is shown that the limit limn→∞rx(a)/n is equal to R(a) almost surely (i.e. with probability 1), and that the limit of the expectation of rx(a)/n is also equal to R(a). Similar results hold also for ergodic stationary sources X. These results show that if x is obtained from a random (ergodic stationary) source, then with high probability its rate distortion function is close to the functionnR(a). For example, if we flip a fair coinn times, then the resulting sequence has maximal Kolmogorov complexity (within a small additive con- stant) with overwhelming probability. For every distortion measure (ofO(logn) Kolmogorov complexity), the individual rate-distortion functions of all thesex’s will be close to the Shan- non rate-distortion function (trivially achieving the “almost surely” part), and since almost all probablity is concentrated on thesex’s they also coincide with the expectation up to negligible error. Our results show that for nonrandomx(containing structure and regularity), there are many different shapes ofrx, and all of them are vastly different from that of R.
Ziv [17] considers also a kind of the rate distortion function for individual data. The rate distortion function is assigned to every infinite sequence ω of letters of a finite alphabet A (and not to a finite object, as in the present paper). The source wordsxare prefixes ofω and the encoding function is computed by a finite state transducer. Kolmogorov complexity is not involved.
In [14], we analyzed the questions we address here for general rate-distortion, in the context of model selection, for the particular case of list decoding distortion: X = {0,1}n, and Y is the set of all finite subsets of {0,1}n; the distortion function d(x, y) is equal to dlog|y|e if y containsx and is equal to infinity otherwise (we needdlog|y|eof extra information to identify x given y.) This is such a special case of distortion that the proofs and techniques do not generalize. Nonetheless, and surprisingly so to the authors, the results generalize by more powerful techniques to somewhat weaker versions, yielding a completely general algorithmic rate-distortion theory.
2.2 Results
GivenX,Y and the distortion measure d, satisfying certain mild properties, we determine all possible shapes of the profilesPx for the different source wordsx∈ X, within a small margin of error. (Equivalently, we establish all possible shapes of the graph ofrx up to that margin.) In contrast to the Shannon case where one obtains a single profileP and rate-distortion function r(a) (for every X,Y, X and distortion measured), the new approach establishes that different x’s can lead to vastly different profiles. This analysis may illuminate the dichotomy between Shannon’s rate-distortion theory and practical lossy compression where one tries to exploit syntactical and other regularities in an application domain. A successful lossy compressor succeeds in capturing a good distortion measure and lossily compresses all individual data appropriately; many of them below the rate-distortion curve of an assumed random variable (since the latter gives the point-wise average). We will illustrate the variability of the shapes by the instructive example of Hamming distortion: The set of source words X and the set of destination words Y are both equal to the set {0,1}n of all binary strings of length n. The distortion function dis defined by d(x, y) =a/n ify differs from x in a bit positions. Other examples we have analyzed areKolmogorov distortion(list decoding distortion), andEuclidean distortion.
A second issue has not been addressed before in either theory or practice of lossy com- pression (as far as the authors know). By definition, a destination word that witnesses the
distortion-rate curvedx(r) at rater, is a destination word giving least distortion at that rate.
But a priori it is possible that there are destination words at the same rate, with equal or greater distortion, that represent more properties of the data. It is difficult (if possible at all) to express this idea in Shannon’s theory, but here it is related to the Kolmogorov complexity notion of “randomness deficiency” of source word x with respect to destination word y. We show that a destination word y with K(y) ≤ r, that witnesses the distortion-rate graph of dx(r) (d(x, y) =dx(r)), represent at least as many, and the same, properties of dataxas it, or any destination word z that witnesses dx(r), can represent for any source wordu of which it witnessesdu(r) =dx(r).
Finally, we show that given a sequence of i.i.d. recursive random variablesX1, . . . , Xn, and a recursive distortion measured, the Shannon distortion-rate functionD(R) pointwise equals the limit inferior of the expectation of the individual rate-distortion curves, n1dnx(nR+o(n)), the expectation taken over the probabilities of the outcomes of then-fold random variable, the limit inferior taken withn→ ∞.
3 Rate-Distortion Graph
Under certain mild restrictions on the considered distortion measures, the shape of the rate- distortion function will follow a fixed pattern, associated with the particular distortion measure involved, and, moreover, every function that follows that pattern is the rate distortion function of some source wordx ∈ X within a negligible tolerance for that distortion measure. Before formulating this claim we need to list these restrictions and introduce some notation.
Although the results of this section can be naturally generalized to infinite or even un- countable setsX, like the segment [0; 1], for simplicity we will assume further thatX is a finite subset of a fixed set of finite objects U. We assume that a computable bijection σ between U and the set of all binary strings is fixed. The Kolmogorov complexity K(x) for x ∈ U is defined asK(σ(x)). The bijectionσ induces a well order onU and hence onX. We make the same assumptions aboutY, the fixed set of finite objectsY is included to is denoted by V.
The distortion measuredis a function mapping X × Y to the non-negative reals. (This is not an essential restriction for this paper, since every real can be approximated by a rational and all bounds in the paper hold to limited precision only.) Let D denote the range of d and dmax the maximal element of D. That is, we assume there is a y ∈ Y with distortion d(x, y)≤dmax for all source words x∈ X.
Aball of radius ainX is a set of the formBy(a) ={x∈ X :d(x, y)≤a}. The destination wordy is called the center of the ball. Note that such a ball, being a subset ofX, may have more than one center. Let B(a) stand for maximal cardinality of a ball of radius a in X. We assume that B(0) = 1, and for every x ∈ X there is y ∈ Y with By(0) = {x}. (This is equivalent to the statement that, for everyx∈ X, there is ay∈ Y with distortiond(x, y) = 0.) Note that B(dmax) =|X |.
Let α denote the covering coefficient related to X,Y, d, defined as the minimal number that satisfies the following conditions: (i) For everya the cardinality of every ball of radiusa is at least B(a)/α; (ii) For every a > 0 there is c < a with B(c) ≥ B(a)/α; and (iii) For all 0≤a < a0, every ball of radiusa0 inX can be covered by at most αB(a0)/B(a) balls of radius a.
Thegraphof distortion measuredis the set of tripleshx, y, d(x, y)iordered lexicographically.
Note that this list identifies alsoX,Y andD. LetK(d) stand for the Kolmogorov complexity of the graph ofd.
The results in this paper will usually be precise up to an additive term of orderO(s) where s= log log|X |+ logα+ log|D|+K(d).
We say that ε= O(δ) where ε, δ are functions of X,Y, d if|ε| ≤ cδ+C, with c an absolute constant, andC depends on the choice of the optimal description method in the definition of Kolmogorov complexity, and on the choice of computable bijection between the set of binary strings and the universesU,V.
For the Hamming distortion example, all the assumptions are satisfied, and s is of order O(logn) (see Lemma 1 below). For the list decoding distortion, again all the assumptions are satisfied, andsis of orderO(logn). This is a good accuracy, as the the values in question are proportional ton= log|X |.
3.1 The common properties of rate distortion profile
We first establish three easy properties, Theorem 1, that are satisfied by the rate distortion profilePx of every source word x ∈ X. Item (i) is self-explanatory. Item (ii) states that the profile Px of x contains a pair which is close to the pair h0, dmaxi (every word can be easily described with maximal distortion). The setPx also contains a pair that is close to hK(x),0i (each word can be exactly described in aboutK(x) bits), and no pairshr,0iwithrsignificantly less thanK(x). Item (iii) states that we can decrease the distortion fromatoa0at the expense of increasing the rate by at most logB(a)/B(a0) +O(s) . That is, ifPx has a pairhr, ai, then it contains all the pairshr+ log(B(a)/B(a0)) +O(s), a0ifora0< a. Note that moving in the other direction is impossible in general: in some cases we cannot decrease the rate at the expense of increasing the distortion by any reasonable amount. This is a corollary of Theorem 2 below (see Fig 1).
Theorem1. For some ε=O(s) and all X,Y, dand x∈ X:
(i) The setPx is upward closed: ifhr, ai ∈Px, thenPx also contains allhr0, a0iwithr0 ≥r and a0 ≥a.
(ii) The set Px contains the pairs hε, dmaxi andhK(x) +ε,0i, and does not contain some pair hK(x)−ε,0i.
(iii) If a pair hr, ai ∈Px and 0≤a0 < a, then alsohr+ log(B(a)/B(a0)) +ε, a0i ∈Px. Proof. Item (i) is obvious.
To prove Item (ii), consider the first destination wordy0 with distortion at mostdmaxwith respect to every source word. Its complexity is at mostK(Y) +O(1)≤K(d) +O(1). By Item (i), this proves that the pairhε, dmaxi is in Px. Towards the other extreme, we have assumed that for every source wordx there is a code wordy such thatd(x, y) = 0. The complexity of the first suchy is at most K(x) +O(s), which proves that the pair hK(x) +ε,0i is in Px. On the other hand, everyy having distortion 0 withx completely identifies x (given the graph of d). Therefore, hr,0i ∈Px implies that the witnessy satisfiesK(y)≤r+O(s).
To prove Item (iii), let y witness rx(a) = K(y). Find a covering of By(a) by at most αB(a)/B(a0) balls of radius a0. Let B0 be a ball in the covering containing x. Its center y0 can be specified byy and the indexiofB0 among the covering balls, given the following extra information: the graph of d, and the values of a and a0. Without loss of generality we can assume that both a, a0 belong to D (otherwise we can decrease them to the closest values in D). We need alsoO(log logi) =O(log log|X |) extra bits to separate the description ofy from the binary representation of i. All the extra information and separator bits are included in O(s). Altogether,K(y0)≤K(y) + logB(a)/B(a0) +O(s) bits. If there is more than one center of B0, then y0 is the first such center in enumeration order of execution of the process using this description.
Corollary 1.
rx(dmax)≤ε, (2)
rx(0) =K(x)±ε, (3)
0≤rx(a0)−rx(a)≤log(B(a)/B(a0)) +ε for alla0 < a. (4) Property (4) implies that rx(a) is a rather smooth function provided logB(a) is so. The similar property doesn’t hold for the “inverse” dx(r). Theorem 2 below will establish that dx(r) can decrease a lot for r increasing only a little (see Fig 1). Corollary 1 shows that the rate distortion function is confined within the following bounds:
K(x)−logB(a)−O(s)≤rx(a)≤log|X | −logB(a) +O(s).
The right-hand bound is obtained by lettinga=dmax in Equation (4). The left-hand bound can be derived by lettinga0 = 0 in (4). It can also be argued as follows: Let y witness rx(a).
Then we have the following two part description of x: y, and the index of x in By(a). Since the complexity of this description cannot be less thanK(x), we obtain
K(x)≤K(y) + logB(a) +O(s) =rx(d) + logB(a) +O(s).
Ifxis a random element ofX, that is, the complexity ofxequals log|X |+O(s), then the lower and upper bounds forrx(a) coincide and we can conclude thatrx(a) = log|X |−logB(a)+O(s).
If x is not such a random element, then there are many possible behaviors of rx(a), and we will show in the next section that they are all realizable.
3.2 Every function is realized by some data
Assume that we are given a non-increasing functionr(a) :D→Nsatisfying the constraints in Corollary 1. The graph ofris the set of pairsha, r(a)i(0≤a≤dmax) ordered lexicographically.
Is there a source wordx∈ X whose distortion functionrx(a) is close tor(a)? The next theorem answers this question in the affirmative:
Theorem2. Let r :D→N satisfy (2) by having r(dmax) = 0, satisfy (3)by having r(0) =k, and satisfy (4)with ε= 0. Then there is a source word x∈ X of complexityk±ε such that
|r(a)−rx(a)| ≤ε, (5)
whereε=O(p
log|X |log(2α) +K(d) +K(r))andK(r) stands for the complexity of the graph of r.
Proof. First note that the complexity of the potential witnessxis fixed by r: Ifrx(0) =k±ε, then by Equation (3) in Corollary 1, we haveK(x) =k±ε.
Next, we claim that there is a sequence of elements a0 = dmax > a1 > · · · > aN = 0 in D, where N = O(p
log|X |), such that every a ∈ D\ {a0, . . . , aN} belongs to a segment [ai+1;ai] with logB(ai)−logB(ai+1) ≤ p
log|X |. Indeed, chop the interval [0; logB(dmax)]
into subintervals of lengthp
log|X |, and for each subinterval [b;c] consider the setB−1[b;c] = {a : b ≤ B(a) ≤ c}. Include the least element and the greatest element of the intersection DT
B−1[b;c] in the sequence of theai’s (if the intersection is empty, do not include anything).
To prove the theorem, it suffices to find anx such that (5) holds for allai. To additionally show that the inequality also holds for the remaining a’s, let [ai+1;ai] be the subsegment containinga. Since both functionsr(a), rx(a) are non-increasing, we have
r(a)∈[r(ai), r(ai+1)],
rx(a)∈[rx(ai), rx(ai+1)]⊂[r(ai)−ε, r(ai+1) +ε].
By the conditions of the theorem, and the property of the sequence of ai’s, the length of the segment [r(ai), r(ai+1)] is at most
r(ai+1)−r(ai)≤log(B(ai)/B(ai+1))≤p
log|X |. Hence|r(a)−rx(a)| ≤p
log|X |+εand we are done.
To find the desired x, we run the following non-halting algorithm that takes as input the graphs ofdand r.
Algorithm: Outline: Enumerate all the balls in X of radiuses ai and complexities less thanr(ai)−ε, for 0≤i≤N, respectively. Call such ballsforbidden, since the desired source wordx cannot belong to any such ball. Maintain a variable G containing X minus the union of all forbidden balls discovered so far.
Construct, in parallel, balls B0, . . . , BN of radiuses a0, . . . , aN, respectively, as described below. Call them candidate balls since their union contains the desired word x. These are balls ensuring the inequality rx(ai) ≤ r(ai) +ε. Every candidate ball is changed from time to time to maintain the following invariant: for all i ≤ N the cardinality of the intersection B0∩ · · · ∩Bi∩G is at least
B(ai)2−i−1α−i. (6)
Intialize: Find ballsB0, . . . , BN of radiusesa0, . . . , aN such that|B0T
· · ·T
Bi| ≥2−i−1α−i. We will amply fullfil the requirement by producing balls with a much larger intersection—
without the factor of 2−i−1. Let B0 be the ball of the radiusdmaxcentered at the first element inY with distortion at mostdmaxfor everyx∈ X. The next balls are constructed inductively.
Assume thatB0, . . . , Bi are already defined, and the cardinality of their joint intersection is at leastB(ai)α−1. To find Bi+1, coverBi by at most αB(ai)/B(ai+1) balls of radius ai+1 (this cover exists by the definition ofα). This covering also covers the set B0T
· · ·T
Bi, which has at least B(ai)α−i elements. Thus the intersection of at least one covering balls with this set has at least
B(ai)α−i/(αB(ai)/B(ai+1)) =B(ai+1)α−i−1 elements. LetBi+1 by the first such ball.
Enumerate all forbidden balls and maintain invariant: Enumerating forbidden balls we updateG. Whenever the invariant (6) becomes false, we change some candidate balls to restore the invariant. Let us prove first that for i = 0 the invariant never becomes false. In other words the cardinality ofGnever gets smaller than half of the cardinality of B0 =X. Indeed, for fixed ithe total cardinality of all the balls of radius ai and complexity less thanr(ai)−ε does not exceed 2r(ai)−εB(ai). Since the functionr(a) + logB(a) is monotonic non-decreasing, the total number of elements in all forbidden balls is at most
N
X
i=0
2r(ai)−εB(ai)≤(N+ 1)2r(dmax)−εB(dmax) = (N + 1)2−εB(dmax)B(dmax) =|X |, where we note thatr(dmax) = 0 by definition, and the last inequality holds providedεis chosen appropriately.
Now assume that the invariant has become false for some i > 0. Let i be the least such index. Since the invariant is true fori−1, the cardinality of the joint intersection G0 of all the balls B1, . . . , Bi−1 and G is at least B(ai−1)2−iα−i+1. We update Bi, . . . , BN as follows. To define the new Bi find a covering of Bi−1 by at most αB(ai−1)/B(ai) balls of radius ai. The cardinality ofG0∩B for at least one covering ballB is at least
|G0|/(αB(ai−1)/B(ai))≥B(ai)2−iα−i.
Let Bi = B be for the first such ball. Note that B(ai)2−iα−i exceeds twice the threshold required by the invariant. We will use this in the sequel: after each change of any candidate ballBj the required threshold forj is exceeded at least two times. Using the same procedure findBi+1, . . . , BN. End of Algorithm
Although the algorithm does not halt, at some (unknown) moment the last forbidden ball is enumerated. After this moment the candidate balls are not changed. Take asx any object in the intersection of G and all the candidate balls. The intersection is not empty, since its cardinality is positive by the invariant. By construction x avoids all the forbidden balls, thus rx(a) satisfies the required lower bound.
To finish the proof it remains to show that the complexity of every candidate ballBi (after the stabilization moment) does not exceedr(ai) +εfor an appropriate ε=O(s). Fix i≤N. Consider the description of the finalBiconsisting ofi, the graphs ofd, r, and the total number M of changes to intermediate versions of Bi. The final ball Bi can be algorithmically found from this description by running the Algorithm. Thus it remains to upper bound logM by something close tor(ai). Let us prove that the candidate ball Bi is changed at most 2r(ai)+i times. Distinguish two possible cases when Bi is changed: (1) the invariant has become false for an index strictly less than i, and (2) the invariant has become false for i and remained true for all smaller indexes. By induction, the number of changes of the first kind can be upper bounded by 2r(ai−1)+i−1 ≤ 2r(ai)+i−1. To upper bound the number of changes of the second kind divide them again in two categories: (2a) after the last change ofBi at least one forbidden ball of radius greater thanai has been enumerated, (2b) after the last change ofBi
no forbidden ball of radius greater than ai has been enumerated. The number of changes of type (2a) is at most the number of forbidden balls of radiuses aj ≥ ai. By monotonicity of r(a), this is at most (N + 1)2r(ai)−ε 2r(ai). Finally, for every change of type (2b), between the last change of Bi and the current one no candidate balls with indexes less than i have been changed and no forbidden balls with radiuses aj ≥ai have been enumerated. Thus the cardinality of G has decreased by at least B(ai)2−i−1α−i due to enumerating forbidden balls with radiuses aj < ai (recall that after the last change of Bi the threshold was exceeded at least two times). The total cardinality of forbidden balls of these radiuses does not exceed N2r(ai)−εB(ai) (we use the monotonicity ofr(a) + logB(a)). The number of changes of types (2b) is less than the ratio of this number to the thresholdB(ai)2−i−1α−i. Hence it is less than N2r(ai)−ε2i+1αi. One can choose ε= O(Nlog(2α)) = O(p
|X |log(2α)) so that this is much less than 2r(ai). The theorem is proved.
3.3 Example: Hamming distortion
Recall that in the case of Hamming distortion,X =Y is the set{0,1}n of all binary strings of length n. The distortion measure d(x, y) is equal to the fraction of bits where y differs from x. Let us estimate the term logB(a) in Corollary 1 and Theorem 2. For alla≤1/2 the term logB(a) differs by at mostO(logn) fromnH(a), whereH(a) =alog 1/a+ (1−a) log 1/(1−a) is the Shannon entropy function. Fora∈[1/2; 1] the functionB(a) is almost constant: n−1≤ logB(a)≤n.
The terms log log|X |, log|D|,K(d) are all of orderO(logn). As to the term logα, it also is of the same order, as the following lemma shows. (Although one would think that everything is known about covering Hamming balls, as far as the authors were able to ascertain this is a new combinatorial result.)
Lemma 1. For all a≤ a0 ≤1/2 every Hamming ball of radius a0 can be covered by at most αB(a0)/B(a), where α is a polynomial of n, Hamming balls of radius a.
Proof. The lemma implies that the set of all strings of length ncan be covered by at most N = poly(n)2n/B(a)
balls of radiusa. We will first prove this corollary, and then use the same method to prove the full lemma.
Fix a string x. The probability that xisnot covered by a randomly selected ball of radius ais equal to 1−B(a)2−n. Thus the probability that no ball out ofN randomly selected balls of radiusacovers x is (1−B(a)2−n)N < e−N·B(a)2−n.
For N = n2n/B(a), the exponent in the right hand side of the latter inequality is at most−nand the probability that x is not covered is less thane−n. This probability remains exponentially small even after multiplying by 2n, the number of different x’s. Hence, with probability close to 1,N random balls cover all the strings of length n.
Let us proceed to the proof of the lemma. Fix a ball with centery and radiusa0. All the strings in the ball that are at Hamming distance at mostafrom y can be covered by one ball of radiusa with centery. Thus it suffices, for everya00 of the form i/nsuch thata < a00 ≤a0, to cover by poly(n)B(a0)/B(a) balls of radius aall the strings at distance a00 fromy.
Fix a00 and let S denote the set of all strings at distance exactly a00 from y. Let f be the solution to the equationa+f(1−2a) =a00rounded to the closest rational of the formi/n. As a < a00≤a0 ≤ 12 this equation has the unique solution and it lies in the interval [0; 1]. Consider a ballB of radius awith a random centerz at distancef from y. As in the first argument, it suffices to show that
Prob[x∈B]≥ B(a) poly(n)B(a0) for allx∈S.
Fix any string z at distance f from y. We claim that the ball of radius a with center z covers poly(n)B(a) strings in S. W.l.o.g. assume that the string y consists of only zeros and string zconsists off nones and (1−f)nzeros. Flip a set off anones and a set of (1−f)anzeros in z. The total number of flipped bits is equal toan, therefore, the resulting string is at distance afrom z. The number of ones in the resulting string is f n−f an+ (1−f)an=a00n, therefore it belongs toS. Different choices of flipped bits result in different strings inS. The number of ways to choose flipped bits is equal to f anf n (1−f)n
(1−f)an
. By Stirling’s formula, the second factor is 2(1−f)nH(a)−O(logn) (we use that a < 12 and that H(a) increases on [0;12]). The first factor can be estimated as f anf n
≥2f nH(a)−O(logn). Therefore, the number of ways to choose flipped bits is at least
2f nH(a)+(1−f)nH(a)−O(logn) = 2nH(a)−O(logn)
≥ B(a) poly(n).
By symmetry, the probability that a random ball B covers a fixed string x ∈S does not depend on x. We have shown that a random ball B covers poly(n)B(a) strings in S. Hence with probability
B(a)
poly(n)|S| ≥ B(a) poly(n)B(a0)
a random ballB covers a fixed string inS. The lemma is proved. (A more accurate calculation shows that the lemma holds withα=O(n4).)
Corollary 2. For every x of length n the rate distortion function rx of x satisfies the in- equalities:
rx(1/2) =O(logn), rx(0) =K(x) +O(logn) (7) 0≤rx(a)−rx(a0)≤n(H(a)−H(a0)) +O(logn) (8)
for all0≤a < a0 ≤ 12. On the other hand, letrbe a function mapping the set{0,1/n,2/n, . . . ,1/2} to the naturals satisfying the condition (8) without O(logn) term and such that rx(1/2) = 0 and rx(0) = k. Then there is a string x of length n and complexity k±O(logn) such that rx(a) =r(a) +O(√
nlogn) for alla≤1/2.
For example, we can apply the second part of Corollary 2 to the function r(a) shown on Fig. 1. The rate distortion graph of the string x existing by Corollary 2 is in the strip of size
distortion 1/2
1/3 1/6
n(1−H(a)) rate
n(1−H(a)+H(1/6)−H(1/3))
Figure 1: A possible shape of the rate distortion function.
O(√
nlogn) of the graph ofr(a). Thereforerx(a) is almost constant on the segment [1/6; 1/3].
Allowing the distortion to increase on this interval, all the way from 1/6 to 1/3, so allowing n/6 incorrect extra bits, we still cannot decrease the rate. This means that the distortion rate functiondx(r) ofx drops from 1/3 to 1/6 near the pointr=n(1−H(1/3)), exhibiting a very non-smooth behavior.
3.4 Example: List decoding distortion
The value s is again of order O(logn), and hence the accuracy ε is of order O(logn) in Corollary 1, and ε = O(√
nlogn+K(r)) in Theorem 2. However, we can achieve better accuracy in Theorem 2, as shown in [14]:
Theorem 3. Let rx(a) stand for the minimal complexity of a set of cardinality at most 2a containingx of length n. Then
rx(0) =K(x) +O(1), rx(n) =O(logn), 0≤rx(a)−rx(a0)≤a0−a+O(logn)
for all 0 ≤ a ≤ a0 ≤ n. On the other hand, let r : {0,1,2, . . . , n} → N be a non-increasing function such that r(n) = 0, r(0) =k and the functionr(a) +ais monotonic non-decreasing.
Then there is a stringx of length nand complexity k+O(logn) such that rx(a) =r(a) +O(logn+K(r)).
List decoding distortion is the distortion proposed by A.N. Kolmogorov, [7], in the context of model selection and a non-probabilistic approach to statistics. It is representative of a family of distortions of which the rate-distortion graphs coincide up to logarithmic terms:
Shannon-Fano distortion were Y is the set of computable probability mass functions p with
distortion log 1/p(x), and minimizing distortion means finding the distribution that maximizes probability (maximum likelihood estimation), [14];total recursive function distortionwhereY is the set of total recursive functionsf with distortiondforf(d) =x, [14]; andcommunication complexity distortion, [3].
3.5 Example: Euclidean distortion
Let X = Y be the set of rational numbers in the segment [0,1] having n binary digits. Let d(x, y) be equal the 0 if x = y and to n+ 1 +dlog|x−y|e otherwise. Given any y with d(x, y)≤awe can find aboutn−afirst bits of the binary expansion ofxand vice versa. Thus in this examplerx(a) differs by at mostO(1) from the Kolmogorov complexity of the prefix of length n−aof the binary expansion of x.
The value s is again of order O(logn) and hence we have the accuracy ε = O(logn) in Corollary 1 andε=O(√
nlogn+K(r)) in Theorem 2. In [10], improving an earlier treatment in a draft of this paper, Salnikov strengthened this result and extended it to the case of all reals in [0; 1]:
Theorem4. Let rx(a) stand for the minimal complexity of a rational numbery at distance at mosta from a real x∈[0; 1]. For allx we have
rx(1/2) =O(1),
0≤rx(a)−rx(a0)≤loga0−loga+O(log log(a0/a))
for all 0< a≤a0 ≤ 12. On the other hand, let r :Q→ N be a given non-increasing function such thatr(1/2) = 0 and the function r(a) + logais monotonic non-decreasing. Then there is a real x∈[0; 1] such that
rx(a) =rx(a) +O(p
log 1/a) for all0< a≤1/2. The constant in O(p
log 1/a) does not depend on r.
4 Minimizing rate and randomness deficiency
Assume that a destination wordywitnesses a pointhr, dx(r)ion the distortion-rate graph ofx (d(x, y) =dx(r) and K(y)≤r). We will demonstrate that y can be considered as a “best-fit”
description of x of the quality that is as good, as is possible for any destination word z with K(z) ≤r. We will measure the quality of destination words, as fitting descriptions of x, by the randomness deficiency ofx in the ball By(d(x, y)). The randomness deficiency of x in a setA⊂ X containingx is defined as
δ(x|A) = log|A| −K(x|A),
where A⊆ X in the conditional of K(x|A) is given as the list of elements ofA (in the fixed order ofX). The following properties of randomness deficiency explain its meaning:
(1) Randomness deficiency is almost non-negative, that is, δ(x|A) ≥ C for some constant C and all x ∈ A. Indeed, every element x of A can be described by its log|A|-bit index in A conditional toA. Thus K(x|A)≤log|A|+O(1).
(2) For allA, the randomness deficiency of almost all elements of Ais very small: the number ofx∈Awithδ(x|A)> βis less than|A|2−β. Indeed,δ(x|A)> βimpliesK(x|A)<log|A|−β.
Since there are at most 2log|A|−β programs of less than log|A| −β bits, the number of suchx’s satisfying the inequality cannot be larger. Thus, elements of small deficiency form a majority of A, which has a simple description conditional to A (there is a program of size about logβ enumerating all elements with deficiency at leastβ).
(3) Elements with small deficiency belong toall simply described majoritiesB ⊆A with, say,
|B| ≥ (1−2−β)|A|. Assume to the contrary, that x ∈ A\B and x has small randomness deficiency in A. Assume furthermore that B is enumerated by a program p that is simple conditional to A, say K(p|A) ≤γ. Because A is in the conditional, there is another program q that enumerates A\B and K(q|A) ≤ γ +O(1). Then the randomness deficiency of x in A is large for large β and small γ: δ(x|A) > β−γ, contradicting the assumption. Indeed, K(x|A)≤log|A\B|+K(q|A)≤log|A\B|+γ <log|A|−β+γ. We omitted logarithmic terms.
According to general probabilistic principles, a property represented by a setAis a subset ofAthat contains a large majority of elements ofA. (For example, consider the set of infinite binary sequences, outcomes of infinitely many flips of a fair coin. The subset of sequences x=x1x2. . .satisfying limn→∞n1Pn
i=1xi = 12 has full measure and corresponds to a property represented by this set.) On the other hand, every such subset of A can be viewed as a property (of belonging to that subset). Thus, we equate simply described large majorities (of complexity, sayO(logn) of elements of A of length n) with simple properties represented by A. We can generalize large majorities to high-probability subsets. (In our example of infinite binary sequences, n→ ∞, and hence logn → ∞, and the “simple” properties turn into “all effective” properties.) Roughly speaking, items (2) and (3) mean, that every element with small deficiency in A belongs to the intersection of all simply described large subsets of A (where a description of a subset ofA is a program that enumerates its complement given the list of A as input). Thus, x has all simple properties represented by A, and, conversely, A represents all simple properties posessed byx.
We claim that if y witnesses dx(r) = a, then the randomness deficiency of x in the ball By(a) is as small as is possible for balls of complexity at mostr(minus a small value). This ball thus represents as many simple properties of x (and every other source word in the ball with that randomness deficiency) as is possible for balls at this rate. To formulate this rigorously, let thedeficiency-rate functionβx(r) be defined by
βx(r) = min{δ(x|Bz(b)) :z∈ Y, b∈D, K(z)≤r, x∈Bz(b)}.
From now on, we useδ(x|z, b) as a shortcut forδ(x|Bz(b)), and δ(x|z) denotes δ(x|z, d(x, z)).
Theorem 5. For some ε = O(s) for all X,Y, d, x and r ≤ K(x) the following holds. If a destination wordy ∈ Y witnesses dx(r+ε) =a, that is, d(x, y) =aand K(y)≤r+ε, then
δ(x|y)≤βx(r) +ε.
Moreover, for every y∈ Y we have
δ(x|y)≤βx(r) +ε+ logB(d(x, y))/B(dx(r+ε)) +K(y)−r. (9) Note that Equation (9) gives the bound
δ(x|y)≤βx(r) +ε+ logB(dx(r))/B(dx(r+ε))
for everyy witnessingdx(r) (and notdx(r+ε)). This bound is useful when the term logB(dx(r))/B(dx(r+ε))
is small. Unfortunately this is not always the case, since dx(r) can be much greater than dx(r+ε). Theorem 2 shows that this is indeed possible (see Fig 1).
Note that for the destination words y witnessing the rate-distortion curve, instead of the distortion-rate curve, Theorem 5 is not necessarily true. For example let, for Hamming distor- tion,x be the string of n zeros. The stringy = 0101. . .01 witnesses rx(n/2) close to 0. The randomness deficiency ofxin the ballBy(n/2) is aboutn−ε≈n. However,βx(rx(0)) is close to 0, withrx(0) also close to 0, as witnessed by the string x itself.
Proof of Theorem 5. The proof is based on relating the randomness deficiency of xin a setA with theoptimality deficiency of x in A defined by
log|A|+K(A)−K(x).
This is the number of extra bits incurred by the two-part code forxusing A compared to the most optimal one-part code of x using K(x) bits. The randomness deficiency is always less than the optimaity deficiency, and the difference between them is equal to
log|A|+K(A)−K(x)−δ(x|A) =K(A)−K(x) +K(x|A) =K(A|x).
We ignore in the proof additive terms of orderO(s). The last equality is true by the Symmetry of Information (see [8]). SinceK(A|x) is nonegative, we obtain the inequality
δ(x|A)≤log|A|+K(A)−K(x), (10) which explains why a ball A 3 x with minimal radius, and hence cardinality (among balls of complexity at most r), can minimize randomness deficiency. To prove the theorem by this argument we need to show that this way to obtain balls of small deficiency is optimal. Formally, this translates to the inequality
logB(dx(r+ε)) +r−K(x)≤βx(r) +ε (11) for appropriate ε = O(s). The combination of inequalities (11), (10) with A = By(d(x, y)), and the inequalityK(A)≤K(y), we obtain inequality (9).
Note that the converse of (10) is not necessarily true: the randomness deficiency can be much less than the optimality deficiency. Here is an example for X = {0,1}n. Let x be a random string (K(x) is close ton). LetAbe the set of all strings of lengthnexcept a stringx0 that is random and independent ofx. Thenδ(x|A) =n−n= 0, but the optimality deficiency is close to n+n−n=n. This setA can be improved by adding the string x0. The resulting setB has much smaller complexity and almost the same cardinality. It turns out that this is rather general situation: for every ballA 3x with large complexity K(A|x) there is another ballB 3x of much smaller complexity and the same radius. We need this statement to prove Equation (11). Let us state it rigorously:
Lemma 2. For every ball A 3 x there is a ball B 3 x of the same radius b, as A, with K(B)≤K(A)−K(A|x). (We ignore here additive terms of order O(s).)
Proof. Indeed, letN be the number of balls of radiusband complexity at mostK(A) covering x. We claim that logN ≥ K(A|x) (in the proof we ignore additive terms of order O(s)).
Indeed, givenx,K(A) we can generate all the balls of radiusb and complexity at most K(A) coveringx. Thus we can describeA by its index among generated balls andK(A|x)≤logN. Applying Theorem 6 below to the family of balls of radiusb and r =K(A), k=blogNc, we conclude that there is a ballB3xof radius bwithK(B)≤K(A)−k≤K(A)−K(A|x).
Apply Lemma 2 to a ball A = Bz(b) 3 x minimizing δ(x|z, b) subject to K(z) ≤ r. By Lemma 2 there is a ballB 3x of radius b withK(B)≤K(A)−K(A|x) and hence
K(B) + log|B| −K(x)≤K(A)−K(A|x) + log|B| −K(x) (12)
=K(A)−K(A|x) + log|A| −K(x) =δ(x|A) =βx(r).
(we ignore additive terms of orderO(s)). To prove Equation (11) we need to show that the optimality deficiency has the following property:
Lemma 3. If dx(i) = a, then i+ logB(a) is less than K(u) + logB(b) +O(s) for every ball Bu(b)3xwithK(u)≤i. (Here the inequality K(u)≤iis understood literally, without hidden O(s) terms.)
Assuming that the lemma is true (we prove it later), we prove Equation (11) as follows:
WithA and B as in (12), let u be the first code word in Y with B =Bu(b) (note that a ball might have many centers). Then
K(u)≤K(B)≤K(A)−K(A|x)≤K(A)≤K(z).
Recall that this inequalities hold to within an additive O(s) term. We have required that K(z) ≤ r. Therefore, for appropriate ε = O(s) these inequalities imply that K(u) ≤ r+ε.
Thus we can apply Lemma 3 fori=r+ε, use (12), and conclude that
r+ logB(a)−K(x)≤K(u) + logB(b)−K(x) =K(B) + log|B| −K(x)≤βx(r).
Proof of Lemma 3. Let u, b satisfy its conditions. Since ais the minimal possible radius of a ball containingxwith center of complexity at mosti, we can conclude thatb≥a. By definition of α there is c < awithB(a)/B(c)≤α. 1 By Item (iii) in Theorem 1, there is a destination wordv such thatd(x, v)≤c and
K(v)≤K(u) + log(B(b)/B(c)) =K(u) + log(B(b)/B(a)), where we ignoreO(s) terms. Sinced(v, x)< a we know that
i < K(v)≤K(u) + log(B(b)/B(a)) and we are done.
In the proof of Lemma 2 we have used Theorem 6 below. We exhibit it separately, since the theorem is interesting in its own right. Previously an analog of this theorem was known for exact descriptions: If a stringx has at least 2k descriptions of length at most r (with respect to an optimal description method), then K(x)≤r−k+O(logk+ logr) (Ex. 4.3.8 in [8]). It was also known in the case of the list decoding distortion [14]: if a binary string belongs to at least 2k setsA, of a fixed cardinality 2c and complexity K(A)≤r, thenx belongs to a set B of cardinality 2c and complexityK(B)≤r−k+O(logr+ logk+ logc). The difference is that in the first case there is no restriction on description syntax, and in the second case the descriptions are restricted to be finite sets of fixed cardinality. In the next theorem we restrict the description format further to allow an arbitrary set family (for example, Hamming balls).
Theorem 6. Let A be a family of subsets of X. If x ∈ X is covered by at least 2k sets A ∈ A with K(A) ≤ r, then x belongs to a set A in A with K(A) ≤ r −k+ε where ε=O(logk+ logr+ log log|X |+K(A) +K(X)).
Proof. The statement of the theorem easily follows from its combinatorial version: Consider a game between two players, called Producer (P) and Consumer (C). The game consists of alternating moves by the players, each making 2r moves, starting with P’s move. A move of P consists in producing a subset ofX. A move of C consists in marking some sets previously produced by P (the number of marked sets can be 0). There are two versions of the game: the on-line version and the off-line one. In the on-line game, C wins if, following every one of his
1Ifa= 0, then the statement is true, since K(u) + logB(b)≥K(x)≥r.
moves, everyx∈ X that is covered at least 2k times by P’s sets belongs to a marked set. In the off-line game C wins if this condition holds after his last move. Consumer can easily win (in both games), if he marks every set produced by P. However, we are interested in minimizing the total number of marked sets.
Lemma4. In the off-line game, Consumer has a winning strategy that marks at most2r−klog|X | sets. In the on-line game, Consumer has a winning strategy that marks at mostr22r−klog|X | sets.
Proof. Off-line case: We show that there is a selection of 2r−klog|X | sets produced by P, that cover all x ∈ X that are covered by at least 2k sets produced by P. Choose at random 2r−klog|X | of P’s sets. Let x ∈ X be covered by at least 2k sets produced by P. Then, the probability thatxis not covered by the chosen sets is at most
(1−2k−r)2r−klog|X | ≤e−log|X |1/|X |.
Multiplying this upper bound by |X | we get less than 1. Therefore, there is a selection of 2r−klog|X | sets produced by P that covers allx∈ X with multiplicity 2k or more.
On-line case: Consumer simultaneously usesr strategies denoted by j= 1,2, . . . , r. Strat- egy number j works as follows. Divide the sequence of Producer’s sets into 2r−j segments of 2j sets. After receiving each segment, cover all the x ∈ X of multiplicity ≥ 2k/r in that segment by marked sets. From the off-line case we know that it suffices to use 2r−krlog|X | marked sets. Since there are 2r−j segments (for fixed j), the total number of marked sets C needs to use is 2j−k2r−jrlog|X | = 2r−krlog|X | (for fixed j). Summing over all j, this comes to 2r−kr2log|X | marked sets.
We claim that after every move t= 1, . . . ,2r of C, each x ∈ X of multiplicity 2k belongs to a marked set. Assume to the contrary, that there is anx that has multiplicity 2k following stept of C, andxbelongs to no set marked on step tor earlier. Let t= 2j1 + 2j2 +. . . where j1 > j2 > . . . be the binary expansion of t. The element x has multiplicity less than 2k/r in the first segment of 2j1 P’s sets, less than 2k/rin the next segment of 2j2 sets, and so on. Thus its total multiplicity among t first sets is less than r2k/r= 2k. The contradiction proves the claim.
Let us finish the proof of the theorem. Given A,X and k, r, enumerate the sets in A of complexity at most r. Using the on-line strategy of Lemma 4, mark at most 2r−kr2log|X | of the generated sets that cover all the strings of multiplicity 2k, i.e., that are covered 2k times by the generated sets. The complexity of each marked set is at most the logarithm r−k+ 2 logr+ log log|X |of the number of marked sets plus the amount of information needed to run the Algorithm. The latter isO(logr+ logk+K(X) +K(A)).
5 Relation with Shannon’s notion
Recall the initial paragraphs of Section 2.1. We generalize the approach from Bernoulli ran- dom variables to more general random variables. We define rn(a) from the rate-distortion profile of then-fold random variable X=X1, . . . , Xn, analogously to r(a) in Equation 1. In- terchanging the independent and dependent variables, we obtain the distortion-rate function dn(r), resulting in precisely the same curve in the (d, r)-plane. We now consider the average minimal expected distortion dn(r)/n, that is, the just part per random variable Xi, the ex- pectation taken over the distribution P(X =x). In this general case we have no proof that limn→∞dn(r)/n exists. Instead, we set D(R) = lim infn→∞dn(nR)/n. We can now treat the relation between the expected value of 1ndnx(nR), the expectation taken over the distribution f(x) =P(X =x), andD(R).
Theorem 7. Assume the discussion above. Let dn be a recursive distortion measure and let f(x) be a recursive probability mass function. Then,
Ednx(r+ε)≤dn(r)≤Ednx(r), withε=K(X,Y, f, d, n, r) +O(1), the expectations taken over f. Proof. Left Inequality: By definition,
dn(r) = min
Y0:Y0⊆Y,|Y0|≤2r min
y:X →Y0
X
x∈X
f(x)d(x, y(x)). (13)
Given n, r and programs to compute X,Y, d, f, we can compute an optimal assignment of (13). Then, K(y(x)) ≤ r +K(X,Y, f, d, n, r) +O(1) (1 ≤ i ≤ k), and we have dnx(r + K(X,Y, f, d, n, r) +O(1))≤d(x, y(x)). Therefore,Ednx(r+K(X,Y, f, d, n, r) +O(1))≤dn(r), the expectation taken overf.
Right Inequality: By definition, Ednx(r) = P
x∈Xf(x)dnx(r), with each y witnessing dx(r) for some x ∈ X having K(y) ≤ r. There are at most 2r of these y’s, but not every subset Y0 ⊆ Y of cardinality ≤ 2r contains only y’s with K(y) ≤ r. Hence the y’s involved in dnx(r) for the different x’s are more restricted than the ones involved in dn(r), which shows Ednx(r)≥dn(r).
Corollary 3. It follows from the above theorem that, for recursive d, f and fixed R, we have lim inf
n→∞
1
nEdnx(nR+ε)≤D(R)≤lim inf
n→∞
1
nEdnx(nR),
with ε = K(X,Y, f, d, n, R) +O(1), for outcomes x = x1. . . xn of i.i.d. random variables Xi =xi withxi ∈ Afor 1≤i≤n, the expectation taken overf(x) =P(Xi =ai, i= 1, . . . , n).
6 Computability
Given a recursive distortion measure, bothdx andrxare upper semi-computable, but not com- putable up to any significant degree of precision. The functionβxis not even semi-computable (upper nor lower) up to any significant degree of precision, but it is computable using an oracle for the halting problem. Thus, while we can upper semi-compute (as the limit of a computation) a destination word y, witnessing dx(r) with y witnessing also βx(r), we cannot semi-computeβx explicitly. This shows that minimizing distortion is the proper computational approach: it is computationally feasible in contrast to optimizing best-fit, and nonetheless the destination word optimizing both is delivered in the limit by the former computation process.
The definitions and proofs can be found in [14] for the special case of list decoding distor- tion. The upper semi-computability follows the same argument in the general case as in the particular case, and the uncomputability of the special case holds a fortiori for the general case.
References
[1] T. Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice- Hall, Englewood Cliffs, NJ, 1971.
[2] T. Berger, J.D. Gibson, Lossy source coding,IEEE Trans. Inform. Th., 44:6(1998), 2693–
2723.
[3] H. Buhrman, H. Klauck, N.K. Vereshchagin, and P.M.B. Vit´anyi. Individual communica- tion complexity. In Proc. 21th Symp. Theoret. Aspects of Comput. Sci., Lecture Notes in Computer Science, Vol. 2996, Springer-Verlag, Berlin, 2004, 19–30.
[4] P. Elias, List decoding for noisy channels.Wescon Convention Record,Part 2, Institute for Radio Engineers (now IEEE), 1957, 94–104.
[5] P. Elias, Error-correcting codes for List decoding, IEEE Trans. Inform. Th., 37:1(1991), 5–12.
[6] A.N. Kolmogorov, Three approaches to the quantitative definition of information,Problems Inform. Transmission 1:1 (1965) 1–7.
[7] A.N. Kolmogorov. Complexity of Algorithms and Objective Definition of Randomness. A talk at Moscow Math. Soc. meeting 4/16/1974. An abstract available inUspekhi Mat. Nauk 29:4(1974),155; English translation in [14].
[8] M. Li and P.M.B. Vit´anyi.An Introduction to Kolmogorov Complexity and Its Applications.
Springer-Verlag, 1997. 2nd Edition.
[9] J. Muramatsu, F. Kanaya, Distortion-complexity and rate-distortion function, IEICE Trans. Fundamentals, E77-A:8(1994), 1224–1229.
[10] S. Salnikov. Kolmogorov complexity of initial segments of binary sequences. Manuscript, 2004.
[11] C.E. Shannon. The mathematical theory of communication. Bell System Tech. J., 27:379–
423, 623–656, 1948.
[12] C.E. Shannon. Coding theorems for a discrete source with a fidelity criterion. In IRE National Convention Record, Part 4, pages 142–163, 1959.
[13] D.M. Sow, A. Eleftheriadis, Complexity distortion theory, IEEE Trans. Inform. Th., 49:3(2003), 604–608.
[14] N.K. Vereshchagin and P.M.B. Vitanyi, Kolmogorov’s Structure functions and model se- lection,IEEE Trans. Inform. Theory, 50:12(2004), 3265- 3290.
[15] J.M. Wozencraft, List decoding.Quarterly Progress Report, Research Laboratory for Elec- tronics, MIT, Vol. 58(1958), 90–95.
[16] E.-H. Yang, S.-Y. Shen, Distortion program-size complexity with respect to a fidelity criterion and rate-distortion function,IEEE Trans. Inform. Th., 39:1(1993), 288–292.
[17] J. Ziv, Distortion-rate theory for individual sequences, IEEE Trans. Inform. Th., 26:2(1980), 137–143.