4 Texts in Natural Language - Challenges in Computational Statistics and Data Mining

In the previous section, we have checked that the block entropy estimator H_est⁽²⁾(k) given by formula (24) returns quite good estimates for Bernoulli processes and per-sistently provides a lower bound of the true block entropy. Hence in this section, we would like to apply this estimator to some real data such as texts in natural language.

As we have mentioned, there were some attempts to estimate block entropy for texts in natural language from frequencies of blocks [10–12]. These attempts were quite heuristic, whereas now we have an estimator of block entropy that may work for some class of processes.

Let us recall that estimatorH_est⁽²⁾(k)works under the assumption that the process is properly skewed, which holds e.g. if the distribution of strings of a fixed length is skewed towards less probable values. In fact, in texts in natural language, the empirical distribution of orthographic words, which are strings of varying length, is highly skewed in the required direction, as described by Zipf’s law [30]. Hence we may suppose that the hypothetical process of generating texts in natural language is also properly skewed. Consequently, estimator H_est⁽²⁾(k)applied to natural language data should be smaller than the true block entropy.

Having this in mind, let us make some experiment with natural language data. In the following, we will analyze three texts in English:First Folio/35 Playsby William Shakespeare (4,500,500 characters),Gulliver’s Travelsby Jonathan Swift (579,438

p= 0.1

Fig. 4 Entropy as a function of block length for Bernoulli(p) processes, where p = 0.1,0.2,0.3,0.4,0.5. Crosses are the true valuesH(k)=hk. Squares are estimates (24) forC=2 and samplesXⁿ₁of varying lengthn=2^jwheren<70000

characters), andComplete Memoirsby Jacques Casanova de Seingalt (6,719,801 characters), all downloaded from the Project Gutenberg.¹For each of these texts we have computed estimator H_est⁽²⁾(k)forC = 2 and initial fragments of texts of lengthn=2^j, wherej =1,2,3, . . .andn’s are smaller than the text length. In this way we have obtained the data points in Fig.6. The estimates look reasonable. The maximal block length for which the estimates can be found isk ≈10, and in this

1www.gutenberg.org.

Estimation of Entropy from Subword Complexity 67

Fig. 5 Entropy ratehfor Bernoulli(p)processes as a function of parameterp.

case we obtain H_est⁽²⁾(k)≈ 12.5 nats, which is less than the upper bound estimate of H(10)/10≈1.5 nats per character by Shannon [27]. Our data also corroborate Hilberg’s hypothesis. Namely, using nonlinear least squares, we have fitted model

H_est⁽²⁾(k)=Ak^β−C, (31) where C was chosen as 2, and we have obtained quite a good fit. The values of parametersAandβwith their standard errors are given in Table1.

To verify Hilberg’s hypothesis for k ≥ 10 we need much larger data, such as balanced corpora of texts. The size of modern text corpora is of order n = 10⁹ characters. If relationship (1) persists in so large data, then we could obtain estimates of entropyH(k)for block lengthsk≤20. This is still quite far from the range of data pointsk ∈ [1,100]considered by Shannon in his experiment with human subjects [27]. But we hope that estimatorH_est⁽²⁾(k)can be improved to be applicable also to

Table 1 The fitted parameters of model (1). The values after±are standard errors

Text A β

First Folio/35 Playsby William Shakespeare

3.57±0.05 0.652±0.007 Gulliver’s Travelsby Jonathan

Swift

3.56±0.07 0.608±0.012 Complete Memoirsby Jacques

Casanova de Seingalt

3.60±0.15 0.602±0.021

so large block lengths. As we have shown, for the Bernoulli model with uniform probabilities, subword complexity may convey information about block entropy for block lengths smaller than or equal the maximal repetition. For many texts in natural language, the maximal repetition is of order 100 or greater [8]. Hence we hope that, using an improved entropy estimator, we may get reasonable estimates of block entropy H(k)fork≤100.

5 Conclusion

In this paper we have considered some new methods of estimating block entropy. The idea is to base inference on empirical subword complexity—a function that counts the number of distinct substrings of a given length in the sample. In an entangled form, the expectation of subword complexity carries information about the probability distribution of blocks of a given length from which information about block entropies can be extracted in some cases.

We have proposed two estimators of block entropy. The first estimator has been designed for IID processes but it has appeared that it works well only in the trivial case of uniform probabilities. Thus we have proposed a second estimator, which works for any properly skewed stationary process. This assumption is satisfied if the distribution of strings of a given length is skewed towards less probable values.

It is remarkable that the second estimator with a large probability provides a lower bound of entropy, in contrast to estimators based on source coding, which give an upper bound of entropy. We stress that consistency of the second estimator remains an open problem.

Moreover, using the second estimator, we have estimated block entropy for texts in natural language and we have confirmed earlier estimates as well as Hilberg’s hypothesis for block lengthsk≤10. Further research is needed to provide an esti-mator for larger block lengths. We hope that subword complexity carries information about block entropy for block lengths smaller than or equal the maximal repetition, which would allow to estimate entropy fork≤100 in the case of natural language.

Estimation of Entropy from Subword Complexity 69

References

1. Algoet PH, Cover TM (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem.

Ann. Probab. 16:899–909

2. Baayen, H (2001) Word frequency distributions. Kluwer Academic Publishers, Dordrecht 3. Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York

4. Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: the entropy convergence hierarchy. Chaos 15:25–54

5. Dillon WR, Goldstein M (1984) Multivariate analysis: methods and appplications. Wiley, New York

6. De˛bowski Ł (2011) On the vocabulary of grammar-based codes and the logical consistency of texts. IEEE Trans. Inform. Theor. 57:4589–4599

7. De˛bowski Ł (2013) A preadapted universal switch distribution for testing Hilberg’s conjecture (2013).http://arxiv.org/abs/1310.8511

8. De˛bowski Ł (2014) Maximal repetitions in written texts: finite energy hypothesis vs. strong Hilberg conjecture (2014).http://www.ipipan.waw.pl/~ldebowsk/

9. De˛bowski Ł (2014) A new universal code helps to distinguish natural language from random texts (2014).http://www.ipipan.waw.pl/~ldebowsk/

10. Ebeling W, Pöschel T (1994) Entropy and long-range correlations in literary English. Europhys.

Lett. 26:241–246

11. Ebeling W, Nicolis G (1991) Entropy of symbolic sequences: the role of correlations. Europhys.

Lett. 14:191–196

12. Ebeling W, Nicolis G (1992) Word frequency and entropy of symbolic sequences: a dynamical perspective. Chaos Sol. Fract. 2:635–650

13. Ferenczi S (1999) Complexity of sequences and dynamical systems. Discr. Math. 206:145–154 14. Gheorghiciuc I, Ward MD (2007) On correlation polynomials and subword complexity. Discr.

Math. Theo. Comp. Sci. AH, 1–18

15. Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics, a foundation for computer science. Addison-Wiley, New York

16. Hall P, Morton SC (1993) On the estimation of entropy. Ann. Inst. Statist. Math. 45:69–88 17. Hilberg W (1990) Der bekannte Grenzwert der redundanzfreien Information in Texten—eine

Fehlinterpretation der Shannonschen Experimente? Frequenz 44:243–248

18. Ivanko EE (2008) Exact approximation of average subword complexity of finite random words over finite alphabet. Trud. Inst. Mat. Meh. UrO RAN 14(4):185–189

19. Janson S, Lonardi S, Szpankowski W (2004) On average sequence complexity. Theor. Comput.

Sci. 326:213–227

20. Joe H (1989) Estimation of entropy and other functionals of a multivariate density. Ann. Inst.

Statist. Math. 41:683–697

21. Khmaladze E (1988) The statistical analysis of large number of rare events, Technical Report MS-R8804. Centrum voor Wiskunde en Informatica, Amsterdam

22. Kontoyiannis I, Algoet PH, Suhov YM, Wyner AJ (1998) Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inform.

Theor. 44:1319–1327

23. Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27:1061–1067 24. Krzanowski W (2000) Principles of multivariate analysis. Oxford University Press, Oxford 25. de Luca A (1999) On the combinatorics of finite words. Theor. Comput. Sci. 218:13–39 26. Schmitt AO, Herzel H, Ebeling W (1993) A new method to calculate higher-order entropies

from finite samples. Europhys. Lett. 23:303–309

27. Shannon C (1951) Prediction and entropy of printed English. Bell Syst. Tech. J. 30:50–64 28. Vogel H (2013) On the shape of subword complexity sequences of finite words (2013).

http://arxiv.org/abs/1309.3441

29. Wyner AD, Ziv J (1989) Some asymptotic properties of entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inform. Theor. 35:1250–1258

30. Zipf GK (1935) The Psycho-Biology of language: an introduction to Dynamic Philology.

Houghton Mifflin, Boston

31. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans.

Inform. Theor. 23:337–343

32. Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theor. 24:530–536

Dans le document Challenges in Computational Statistics and Data Mining (Page 76-82)