Randomness as incompressibility - Algorithmic Foundations of the Internet

Aside from pure probability theory, random sequences have been thor-oughly studied in the mid twentieth century in the realm of Information The-ory, a branch of science devoted to studying the information content of

mes-sages and initially developed by C.E. Shannon and D.A. Huffman. Following their approach theentropyof a source was introduced as a measure of the de-gree of uncertaintywith which the messages generated by a source may appear.

Without getting into mathematical definitions, we notice that if all messages have the same probability of occurring the source has maximal entropy, while if a message occurs with probability one, i.e., with complete certainty, and all the other messages will never appear, then the entropy of the source is zero. That is consistent with the observation that no new information can be extracted from a source whose output is known a priori.

All messages are represented with the symbols of an arbitrary alphabet.

They become binary sequences when representing coin tosses; or musical notes on a song score; or sentences in the Latin alphabet as in this text. So messages, mathematical definitions, and computing algorithms, are all citizens in the same world of sequences if a common alphabet is fixed, and since this alpha-bet is arbitrary we will refer to the binary one. On these grounds Information Theory took an unexpected direction in the 1960s with the computational approach independently adopted first by Solomonoff in the U.S.A., then by Kolmogorov in Russia and Chaitin in the U.S.A. again. The concept of algo-rithmic complexity of a sequence arose, leading to a new characterization of randomness that is no longer a consequence of the source but entirely depends on the inner nature of the sequence. As the name suggests these studies are strictly dependent on the theory of algorithms that had begun to develop in the preceeding decades. This is the new tool by which many intuitions of the past were rephrased in solid mathematical terms.⁶

The reference system is strictly mathematical and quite sophisticated.

Computations are ideally performed on Turing Machines that, as we have already seen, may execute any of an infinite set of algorithms. In practical terms the machine can be any computer if no limitation is put on the memory size, and the algorithms may be coded in any programming language. A point to be remembered is that, in any family of machines, there always exists at least one that isuniversalin the sense that it can simulate the computation of any other machine that executes any one of its programs. In the real world any reasonable computer accepting programs written in a general purpose programming language as Java or C is universal provided its memory is large enough to execute any program at hand.

Sequences to work on, machines, and algorithms, are described with the characters of the same (binary) alphabet. Each sequence may be generated by any machine with an elementary algorithm that contains the message itself and a command to output it. But of course a more skilled algorithm might exist to reconstruct the sequence by using a smarter computation. We can

6The characterization of random sequences has essentially followed the three intersect-ing lines ofstochasticness, incompressibility, andtypicality, all strictly connected with the theory of algorithms. Here we refer to the second line that is mostly related to computer science. A rigorous study of randomness is hard - see thebibliographical notesfor references.

apply these concepts informally to the two sequences of figure 5.2. The first one may be generated by a skilled “program” such as:

put{expand(01)¹⁵ followed by0} (5.1) while for the second we would probably opt for the naive solution:

put{1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 0 1 1 1} (5.2) where the whole sequence is explicitly shown, because computing a prefix of the decimal expansion ofπis definitely more complicated.

Once the reference system has been chosen, a generating algorithm has nothing to do with the source of the sequence. Moreover different algorithms may generate the same sequence.⁷On these grounds Kolmogorov and Chaitin defined thealgorithmic complexityof a sequenceS as the length of the short-est algorithm that generates S (i.e., the length of the sequence representing such an algorithm). From this followed that:a random sequence is one having complexity “substantially” equal to its length(i.e., the length of the naive pro-gram above). Conversely:a sequence is nonrandom if it admits a generating algorithm shorter than the sequence itself.

Put this way, it seems too simple. In fact the informal statements above convey ideas that stand behind a complex mathematical theory. Before going into more depth, however, we note that a sequence like the first one above, possibly extended to a large number 2n+1 of bits, is now declared non random because the generating program:

put{expand(01)ⁿ followed by0} (5.3) is certainly shorter than the sequence itself for large enoughn.⁸Laplace sug-gested that one such a sequence be non random because, more than likely, it was generated by a non random source. The new definition, instead, de-clares that the sequence is non random on the sole basis of its structure: it could have been generated by a purely random source, although with very low probability, but still it must be considered non random.

The second sequence of Figure 5.2is too short for drawing a firm conclu-sion. However, if the sequence is extended to representnbinary digits ofπ, for nsufficiently large, a program to compute it would be definitely shorter than the sequence itself. Then even the second sequence is non random, as it may be expected by its definition.⁹But then, do random sequences exist under this definition? An affirmative answer will be given shortly through a schematic

7In fact a sequence can be generated by an infinite number of algorithms because, once one such an algorithm is found (e.g., the naive algorithm that specifies the sequence explic-itly), another algorithm can be formed from the former one with the addition of a sentence that has no effect on the output, and so on.

8The length of this program is of order logn+c, wherecis a constant, because the only part that depends onnisnitself appearing as an exponent, and the binary representation ofnrequires lognbits.

9The length of this program would also be of order logn+cwithcconstant, because the only part that depends onnis the specification of the number of output digits required.

introduction to Kolmogorov and Chaitin theory. Readers interested only in a general picture of the field may jump to the discussion on data compression in the next section, and perhaps return here later.

Enumerate all the binary sequences σ0, σ1, σ2, ...incanonical orderas ex-plained in Chapter 4 (for clarity we repeat this ordering here):

σ₀ σ₁ σ₂ σ₃ σ₄ σ₅ σ₆ σ₇ σ₈ ... σ₁₄ σ₁₅ ...

0 1 00 01 10 11 000 001 ... 111 0000 ... (5.4) Now consider a (possibly infinite) familyS0, S1, S2, ....of computing systems, and let p be a program for Si. We write: Si(p) = σ to indicate that the computation ofSiwith programpproduces the sequenceσ. Recall also thatp is represented by a binary sequence and let|p|,|σ|denote the length (number of bits) of the two sequences. At the basis of the theory stands the concept of algorithmic complexity ofσ, first defined with respect to a computing system Si as:

K_i(σ) = min{|p|:S_i(p) =σ}. (5.5)

That is the complexity of σ is the length of the shortest program that generates this sequence usingSi. What makes this definition not so interesting is its dependence on S_i. This seems inevitable at a first glance because, in general terms, a program is written for a specific machine. However, a brilliant invariance theoremproves that we can get rid of such dependence if we refer to a universal systemS_uthat simulates the functioning of all the others with a complexity that is essentially the same. Without getting into mathematical details for which the reader may refer to the bibliographical notes, the theorem states that, for any systemSi:

Ku(σ)≤Ki(σ) +k, where kis a constant. (5.6) Any program p that generates σ on Su is called a description of σ. By definition Ku(σ) is the shortest description ofσ and, after relation (5.6), is taken as thealgorithmic complexityof the sequence with precision up to a con-stant. This value is now simply denoted asK(σ) and often calledKolmogorov complexity of the sequence. It is also customary to say that this complexity represents the information contents ofσ.

At this point a concept related to the Internet steps in. Having put any sequenceσ in relation with a programpthat generatesσ, we may decide to transmit this sequence from a nodeAto a nodeB installing proper software into the two nodes so thatσ is “coded” inA as p, transmitted to B in this form, and then “decoded” in B obtaining σ again. Clearly we can hope to gain something only if |p| < |σ|. In this case we talk of data compression.

While specific compression techniques are used in practice, the concept of algorithmic complexity is crucial for understanding what is possible.

Although the expansion ofπ has many statistical characteristics of a random string it is obviously non random.

Recalling thatσcan be always generated by a trivial program that contains σinside and simply transfers it to the output, and noting that the length of this program is |σ|+c, where c is the constant number of bits to code the output operation (e.g., the wordputplus the two parentheses{}in program (5.2)), we conclude that:

K(σ)≤ |σ|+c. (5.7)

Of course K(σ) is smaller than this limit value if there is a better program to generateσ, and in particular may be much smaller than|σ|. The question is: how many sequences can be actually compressed, and by how much? Now Laplace’s intuition on the scarcity of “regular combinations” will be confirmed.

It turns out there are not enough short descriptions for all sequences, no matter how long such sequences may be. In fact there are 2ⁿ binary sequences of lengthnand 1 + 2 +...+ 2ⁿ⁻¹= 2ⁿ−1 binary sequences of length from 1 ton−1, then there is at least one sequence σof length nwithout a shorter description. This σ is incompressible, but this is not the end of the story.

There are 1 + 2 +...+ 2ⁿ⁻²= 2ⁿ⁻¹−1 = 2ⁿ/2−1 binary sequences of length at most n−1, then about one half of the sequences of length n cannot be compressed to less that n−1 bits. Similarly about 3/4 of the sequences of lengthncannot be compressed to less thann−2 bits, and so on. Out of all possible sequences, the compressible ones are a small minority.¹⁰

It is now quite natural to call a sequence σ that cannot be compressed, that is K(σ) ≥ |σ|, random. However there are subtle theoretical reasons indicating that a strictly valid definition of randomness can be given only for infinite sequences, and furthermore, pure incompressibility is too demanding a property on practical grounds because there is no interest in compressing a sequence ofnbits to, say, not less thann−1 bits. Therefore for a parameter f > 0 that may be a constant or a function of n, σ is called f-random if it cannot be compressed to less than n−f bits, that is K(σ)≥ |σ| −f. The logarithmic function is a standard forf, so we loosely say:

A finite sequence σis random

if and only ifK(σ)≥ |σ| − dlog₂|σ|e. (5.8) From what we have seen about the number of compressible sequences, the statement above is an implicit proof thatrandom sequences do exist, and that they are in the vast majority among all possible sequences.

Taking these concepts intuitively, imagine that our world of sequences is the one of well formed sentences in English and the formal system at hand is a series of commands to locate different sentences within the books of a vast library. Nobody has ever thought that the sentence of fifty-three characters:

10In particular any sequence representing a programpthat constitutes a minimal descrip-tion of a sequenceσ, i.e.,K(σ) =|p|, is incompressible. Otherwisepshould admit a shorter descriptionp⁰ that in turn would allow to reconstructσ, against the hypothesis thatpis minimal forσ.

It is an ancient Mariner and he stoppeth one of three

was created at random, although this may have happened with the minuscule probability of 1/27⁵³after the draw of fifty-three characters from an alphabet of twenty-seven including the blank. In our system the sequence has to be considered non random because it can be univocally determined by a short description likeA335.h15.1, v.1-2, that gives the library accession number to the poem “The Rime of the Ancient Mariner” by Samuel Taylor Coleridge, with the addition of “v.1-2” to indicate the first two verses. With the same reasoning the whole poem is non random, but the sequence of the first two words “It is” must be taken as being random, since it has no shorter description than the one that explicitly spells it out.

Statement (5.8) defines randomness independently of a generating cause.

This is no more than an arbitrary definition, but it turns out to be very impor-tant in probability theory where, incidentally, it leads to the same conclusions as other reasonable models. In particular it has been proved in the realm of information theory that, for the length of the sequences tending to infinity, the Kolmogorov complexity of a sequence tends to the value of the source entropy.

This is a nice touch for assessing the soundness of both theories. Furthermore this definition of randomness gives a mathematical sense to Laplace’s asser-tion that we are led to believe that a regular (now: compressible) sequence σ of length n was generated by a specific rule (now: a short description p) rather than randomly. We are in the case: K(σ) =|p| < n. As 2ⁿ sequences may be generated at random, and only 2^|p| of them can be generated by a rule of length|p|, it is 2^n−|p|times more likely thatσwas generated by a non random cause.

The existence of random sequences is good news for many fields of applied science where such sequences are needed for simulating physical events, or for building cryptographic keys, etc. But, as in many jokes, after good news there is bad news. Mathematical logic gives a theoretical limit to the discovery of random sequences becausethe problem of establishing if an arbitrary sequence is random is undecidable. That is, if a short description is found we conclude that the sequence is nonrandom, if such a description is not found we will never know whether it does not exist or we were just unable to find it. Readers with interest in computability may enjoy a rigorous proof of this fact, among the most elegant in the field. We show here the structure of this proof, built on the so called Berry’s paradox on integers of 1908.

For the sequences σ0, σ1, σ2....listed in (5.4), assume that there exists a program R(i) to decide whether an arbitrary sequence σi is random. Take now the new programP ofFigure 5.3. Although the existence ofRis merely assumed, the length |P|+|R| of the whole program is independent of the sequencesσitreated at each iteration, no matter whatRlooks like. Since such sequences are treated for increasing length as in the listing (5.4), the condition

|σi|>|R|+|P|must be verified from a certain value of ion. Furthermore we

programP fori= 1to∞

if{(R(i) =true)∩(|σi|>|R|+|P|)}

print(i)and stop;

FIGURE 5.3: A contradictory program to prove that randomness cannot be decided algorithmically.

have seen that random sequences of any length exist, so the stopcondition on P must eventually be met on a sequence σi that is both random (R(i)

= true) and longer than the program that allows it to be detected (|σi| >

|R|+|P|): a contradiction. We must conclude that programR cannot exist, i.e., is impossible to decide randomness algorithmically.

This argument rules out the possibility of deciding whether a given se-quenceσis random as long as a shorter description of it is not found.¹¹ How-ever the use of random sequences is important in applied science, therefore in most cases they are actually generated by computer programs. Since these programs are inevitably shorter than the sequences themselves, these are non random by definition, but are useful anyway if they pass some standard suite of statistical tests. In this case the sequences must be more correctly called pseudo-random.

Dans le document Algorithmic Foundations of the Internet (Page 90-96)