In the previous section, we have checked that the block entropy estimator Hest(2)(k) given by formula (24) returns quite good estimates for Bernoulli processes and per-sistently provides a lower bound of the true block entropy. Hence in this section, we would like to apply this estimator to some real data such as texts in natural language.
As we have mentioned, there were some attempts to estimate block entropy for texts in natural language from frequencies of blocks [10–12]. These attempts were quite heuristic, whereas now we have an estimator of block entropy that may work for some class of processes.
Let us recall that estimatorHest(2)(k)works under the assumption that the process is properly skewed, which holds e.g. if the distribution of strings of a fixed length is skewed towards less probable values. In fact, in texts in natural language, the empirical distribution of orthographic words, which are strings of varying length, is highly skewed in the required direction, as described by Zipf’s law [30]. Hence we may suppose that the hypothetical process of generating texts in natural language is also properly skewed. Consequently, estimator Hest(2)(k)applied to natural language data should be smaller than the true block entropy.
Having this in mind, let us make some experiment with natural language data. In the following, we will analyze three texts in English:First Folio/35 Playsby William Shakespeare (4,500,500 characters),Gulliver’s Travelsby Jonathan Swift (579,438
p= 0.1
Fig. 4 Entropy as a function of block length for Bernoulli(p) processes, where p = 0.1,0.2,0.3,0.4,0.5. Crosses are the true valuesH(k)=hk. Squares are estimates (24) forC=2 and samplesXn1of varying lengthn=2jwheren<70000
characters), andComplete Memoirsby Jacques Casanova de Seingalt (6,719,801 characters), all downloaded from the Project Gutenberg.1For each of these texts we have computed estimator Hest(2)(k)forC = 2 and initial fragments of texts of lengthn=2j, wherej =1,2,3, . . .andn’s are smaller than the text length. In this way we have obtained the data points in Fig.6. The estimates look reasonable. The maximal block length for which the estimates can be found isk ≈10, and in this
1www.gutenberg.org.
Estimation of Entropy from Subword Complexity 67
Fig. 5 Entropy ratehfor Bernoulli(p)processes as a function of parameterp.
case we obtain Hest(2)(k)≈ 12.5 nats, which is less than the upper bound estimate of H(10)/10≈1.5 nats per character by Shannon [27]. Our data also corroborate Hilberg’s hypothesis. Namely, using nonlinear least squares, we have fitted model
Hest(2)(k)=Akβ−C, (31) where C was chosen as 2, and we have obtained quite a good fit. The values of parametersAandβwith their standard errors are given in Table1.
To verify Hilberg’s hypothesis for k ≥ 10 we need much larger data, such as balanced corpora of texts. The size of modern text corpora is of order n = 109 characters. If relationship (1) persists in so large data, then we could obtain estimates of entropyH(k)for block lengthsk≤20. This is still quite far from the range of data pointsk ∈ [1,100]considered by Shannon in his experiment with human subjects [27]. But we hope that estimatorHest(2)(k)can be improved to be applicable also to
Table 1 The fitted parameters of model (1). The values after±are standard errors
Text A β
First Folio/35 Playsby William Shakespeare
3.57±0.05 0.652±0.007 Gulliver’s Travelsby Jonathan
Swift
3.56±0.07 0.608±0.012 Complete Memoirsby Jacques
Casanova de Seingalt
3.60±0.15 0.602±0.021
so large block lengths. As we have shown, for the Bernoulli model with uniform probabilities, subword complexity may convey information about block entropy for block lengths smaller than or equal the maximal repetition. For many texts in natural language, the maximal repetition is of order 100 or greater [8]. Hence we hope that, using an improved entropy estimator, we may get reasonable estimates of block entropy H(k)fork≤100.
5 Conclusion
In this paper we have considered some new methods of estimating block entropy. The idea is to base inference on empirical subword complexity—a function that counts the number of distinct substrings of a given length in the sample. In an entangled form, the expectation of subword complexity carries information about the probability distribution of blocks of a given length from which information about block entropies can be extracted in some cases.
We have proposed two estimators of block entropy. The first estimator has been designed for IID processes but it has appeared that it works well only in the trivial case of uniform probabilities. Thus we have proposed a second estimator, which works for any properly skewed stationary process. This assumption is satisfied if the distribution of strings of a given length is skewed towards less probable values.
It is remarkable that the second estimator with a large probability provides a lower bound of entropy, in contrast to estimators based on source coding, which give an upper bound of entropy. We stress that consistency of the second estimator remains an open problem.
Moreover, using the second estimator, we have estimated block entropy for texts in natural language and we have confirmed earlier estimates as well as Hilberg’s hypothesis for block lengthsk≤10. Further research is needed to provide an esti-mator for larger block lengths. We hope that subword complexity carries information about block entropy for block lengths smaller than or equal the maximal repetition, which would allow to estimate entropy fork≤100 in the case of natural language.
Estimation of Entropy from Subword Complexity 69
References
1. Algoet PH, Cover TM (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem.
Ann. Probab. 16:899–909
2. Baayen, H (2001) Word frequency distributions. Kluwer Academic Publishers, Dordrecht 3. Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
4. Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: the entropy convergence hierarchy. Chaos 15:25–54
5. Dillon WR, Goldstein M (1984) Multivariate analysis: methods and appplications. Wiley, New York
6. De˛bowski Ł (2011) On the vocabulary of grammar-based codes and the logical consistency of texts. IEEE Trans. Inform. Theor. 57:4589–4599
7. De˛bowski Ł (2013) A preadapted universal switch distribution for testing Hilberg’s conjecture (2013).http://arxiv.org/abs/1310.8511
8. De˛bowski Ł (2014) Maximal repetitions in written texts: finite energy hypothesis vs. strong Hilberg conjecture (2014).http://www.ipipan.waw.pl/~ldebowsk/
9. De˛bowski Ł (2014) A new universal code helps to distinguish natural language from random texts (2014).http://www.ipipan.waw.pl/~ldebowsk/
10. Ebeling W, Pöschel T (1994) Entropy and long-range correlations in literary English. Europhys.
Lett. 26:241–246
11. Ebeling W, Nicolis G (1991) Entropy of symbolic sequences: the role of correlations. Europhys.
Lett. 14:191–196
12. Ebeling W, Nicolis G (1992) Word frequency and entropy of symbolic sequences: a dynamical perspective. Chaos Sol. Fract. 2:635–650
13. Ferenczi S (1999) Complexity of sequences and dynamical systems. Discr. Math. 206:145–154 14. Gheorghiciuc I, Ward MD (2007) On correlation polynomials and subword complexity. Discr.
Math. Theo. Comp. Sci. AH, 1–18
15. Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics, a foundation for computer science. Addison-Wiley, New York
16. Hall P, Morton SC (1993) On the estimation of entropy. Ann. Inst. Statist. Math. 45:69–88 17. Hilberg W (1990) Der bekannte Grenzwert der redundanzfreien Information in Texten—eine
Fehlinterpretation der Shannonschen Experimente? Frequenz 44:243–248
18. Ivanko EE (2008) Exact approximation of average subword complexity of finite random words over finite alphabet. Trud. Inst. Mat. Meh. UrO RAN 14(4):185–189
19. Janson S, Lonardi S, Szpankowski W (2004) On average sequence complexity. Theor. Comput.
Sci. 326:213–227
20. Joe H (1989) Estimation of entropy and other functionals of a multivariate density. Ann. Inst.
Statist. Math. 41:683–697
21. Khmaladze E (1988) The statistical analysis of large number of rare events, Technical Report MS-R8804. Centrum voor Wiskunde en Informatica, Amsterdam
22. Kontoyiannis I, Algoet PH, Suhov YM, Wyner AJ (1998) Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inform.
Theor. 44:1319–1327
23. Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27:1061–1067 24. Krzanowski W (2000) Principles of multivariate analysis. Oxford University Press, Oxford 25. de Luca A (1999) On the combinatorics of finite words. Theor. Comput. Sci. 218:13–39 26. Schmitt AO, Herzel H, Ebeling W (1993) A new method to calculate higher-order entropies
from finite samples. Europhys. Lett. 23:303–309
27. Shannon C (1951) Prediction and entropy of printed English. Bell Syst. Tech. J. 30:50–64 28. Vogel H (2013) On the shape of subword complexity sequences of finite words (2013).
http://arxiv.org/abs/1309.3441
29. Wyner AD, Ziv J (1989) Some asymptotic properties of entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inform. Theor. 35:1250–1258
30. Zipf GK (1935) The Psycho-Biology of language: an introduction to Dynamic Philology.
Houghton Mifflin, Boston
31. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans.
Inform. Theor. 23:337–343
32. Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theor. 24:530–536