TO THE B ALANCE OF C OMPLEXITY

(1)

A ^N I ^NFORMATION T ^HEORY -B ^ASED A ^PPROACH

TO THE B ^{ALANCE OF} C ^OMPLEXITY

BETWEEN P ^HONETICS , P HONOLOGY AND M ORPHOSYNTAX

François PELLEGRINO, Christophe COUPÉ & Egidio MARSICO Dynamics of Language Lab

UMR5596 CNRS – Université de Lyon FRANCE

LSA Complexity Panel

This research is supported by the French National Research Agency (ANR)

(2)

2

Overview

9 General framework

9 Proposed IT-based approach

9 Data

9 Results

(3)

3

Information Theory & Complexity

9 Information Theory (IT) and Entropy

ª Initiated from C. Shannon’s works in the 1940’s

ª Applied to linguistics, cybernetics and cognitive science

ª Related to notions like functional load, statistics, complexity

9 Quantitative typology and sciences of Complexity

ª Shed new light on typological issues (and cognition)

ª Quest for correlations and compensations among linguistic components (phonetics, phonology, morphology, syntax, etc.)

ª Problems when dealing with complexity:

¾

No straightforward definition

¾

Multidimensional problem

9 IT may provide relevant tools to evaluate complexity

(4)

4

IT and phonology: Recent revival?

9 Goldsmith (1998): On information theory, entropy, and phonology in the 20

^th

century , Royaumont

9 Compression approach (Juola 1998, Kettunen et al., 2006)

9 Functional load approach (Surendran and Niyogi, 2004-2006)

9 Probabilistic approaches (among others, Goldsmith, 2002; Hume,

2004-2006, …)

(5)

5

Issues and stakes

9 Typological approaches involving phonology

ª Often leave phonetics aside

¾

Either because of difficulties or considering it as irrelevant

9 Complexity balance between linguistic levels

ª “Slippery” issue

ª Several studies challenged this statement

¾

Among others, Auer (1993), Planck (1998) Maddieson (1986, 2006), Fenk-Oczlon & Fenk (1999, 2005), Shosted (2006)

¾

Different indices lead to different results

¾

No universal methodology (yet)

9 Goals of this study

ª Evaluate whether IT is relevant in this context

ª Draw attention on some methodological pitfalls

¾

Cross-linguistic comparison based on “tiny” corpora

¾

(too) coarse-grained evaluation of indices

¾

Interaction between phonetics and phonology (speech rate)

(6)

PROPOSED APPROACH

(7)

7

Proposed Approach

9 For several languages, texts conveying an “equivalent”

semantic content

9 Estimation of parameters from these texts uttered by several speakers

Increase the number of texts and speakers to “neutralize” within-language variability and get significant cross-linguistic results.

9 Estimation of the information carried by linguistic units in these languages

ª For given units, how to calculate Information?

ª How to choose the units compatible with the chosen approach to

estimation of information?

(8)

8 Proposed Approach

How to calculate the Information

ª Considering that language L is a source of linguistic sequences s composed of units ( u ) from a finite set ( N

_L

)

ª Assuming that the units are independent from each other

¾

s(t) = u

₁

u

₂

u

₃

… u

_t-1

u

_t

¾

P( u

_t

) is supposed to be independent of s(t-1) = u

₁

u

₂

u

₃

… u

_t-1

ª Quantity of Information of unit u = entropy h( u ) = -log

₂

(P( u ))

¾

Less probable => more informative

¾

More probable => less informative

¾

Certain (P = 1) => no information

ª H(L) Quantity of Information of L (Entropy of L )

¾

Easy to compute from the set of units and their probabilities

¾

H(L) is always inferior to log

₂

( N

_L

) _∑ _∑ ( )

=

−

=

^L

i i

L i

N

i

u u

N

i

u

h u p p

p L

H

1

2 1

log )

( . )

(

0 0.2 0.4 0.6 0.8 1

1 2 3 4

Probability P_u

Entropy h(u)

(9)

9 Proposed Approach

Information Estimation

9 What linguistic units?

ª That do not violate too heavily the independence rule?

ª For which the inventory is known and P

_u

(probability of occurrence) is calculable.

9 Our choice: the syllable

ª Pros

¾

Higher independence than between phonemes (or features, gestures?)

¾

Syllable frequency estimated from big written corpora for several languages

¾

Way to somewhat get rid of coarticulation issues ª Cons

¾

Independence is not absolute

¾

Relative frequencies differ from written to speech production

¾

Oral syllables resulting from phonological processes (elision, liaison, etc.) are absent from written data

¾

Relevance of syllable as a linguistic unit across languages?

(10)

EXPERIMENTAL DATA

(11)

11 Raw material

Spoken data and syllable frequencies

9 Need for comparable data across languages

9 2 types of data

ª Speech data : subset of MULTEXT corpus

¾

7 languages (5 European languages, 2 East-Asian languages)

¾

20 Passages (texts composed of 5 semantically connected sentences)

¾

4-10 speakers per text

¾

Broadly speaking, semantically equivalent texts in each language

y

Last night I opened the front door to let the cat out. It was such a beautiful evening that I wandered down the garden for a breath of fresh air. Then I heard a click as the door closed behind me. I realised I'd locked myself out. To cap it all, I was arrested while I was trying to force the door open!

y

Hier soir, j'ai ouvert la porte d'entrée pour laisser sortir le chat. La nuit était si belle que je suis descendu dans la rue prendre le frais. J'avais à peine fait quelque pas que j'ai entendu la porte claquer derrière moi. J'ai réalisé, tout d'un coup, que j'étais fermé dehors. Le comble c'est que je me suis fait arrêter alors que j'essayais de forcer ma propre porte !

ª Syllable Frequency data

¾

Computed from large text resources (newspapers, books, etc.)

¾

Different resources depending on the languages

(12)

12

Available Data for this study

LANGUAGE Code SPEECH DATA SYLLABLE FREQUENCIES

English EN

FR GE IT JA MA

SP DU

VI

9 9

French 9 9

German 9 9

Italian 9 9

Japanese 9 9

Mandarin Chinese 9 9

Spanish 9 9

Dutch 8 9

Vietnamese 9 8

(13)

13

Speech Data

Language Source No of

speakers

Total duration

EN Multext 10 18 min.

FR Multext 6 14 min.

GE Multext 10 27 min.

IT Multext 10 18 min.

JA Kitazawa, 2002 5 33 min.

VI Courtesy of E.

Castelli MICA, Hanoi 4

(two times each) 38 min.

OVERALL 62 > 3 hours

MA Komatsu et al., 2004 9 23 min.

SP Multext 8 17 min.

Small but not tiny corpus

(14)

14

Syllable Frequencies

Language Source No of

different syllables

Total No of syllables

DU WebCelex 6 486 1.4 M mmm

EN WebCelex 7 931 1.0 M mmm

FR Lexique3 5 685 1.3 M mmm

GE WebCelex 4 207 0.8 M mmm

IT PhD Massimiliano Pone,

2005 2 719 27.0 M mmm

JA Tamaoka and Makioka,

2004 416 575.7M mmm

MA PhD

Peng Gang, 2005, 1 191

(incl. tones) 138.0 M mmm SP PhD Massimiliano Pone,

2005 1 593 0.9 M mmm

(15)

RESULTS

9 Speech corpus

9 Syllabic Entropy

9 Back to initial issues

(16)

16

CAVEAT

9 Very Few languages => no typological dimension

9 Differences between oral and written syllables may be significant (but invisible here)

9 Descriptive and not explanatory results so far

(17)

17 Speech Corpora Comparison

Parameters

9 Raw parameters for each Passage

ª Passage Duration (in seconds, Silences >= 150 ms are discarded)

ª Number of Syllables (from canonical pronunciation)

ª Number of Words (according to language-specific standards) 9 Normalized parameters

ª Significant differences in length among Passages for a given language (e.g. from 62 to 104 syllables in the English corpus)

ª BUT Passages are matched among languages

ª Normalization procedure

¾

Normalized Length (Syllables)

y

Ratio between the number of syllables in each Passage in language L and the matched Passage in English

y

Median Value calculated among Passages for each language

¾

Normalized Length (Words)

¾

Normalized Duration (Time) ª Additional parameters

¾

Syllabic Rate : Number of syllables per second (average value among Passages)

¾

ASW = Average Number of Syllables per Word (average value among

Passages)

(18)

18 Speech Corpora Comparison

Normalized Lengths

9 High within-language variation

ª

Cross-linguistic comparison CANNOT be done with just one utterance…

9 No significant correlation between Syllabic and Word Lengths

0,5 0,7 0,9 1,1 1,3 1,5 1,7 1,9

0,5 1,0 1,5 2,0

Normalized Length (Syllable)

Normalized Length (Word)

MA EN

IT FR

SP

GE JA

VI

Average values and interquartile ranges

(19)

19 Speech Corpora Comparison

Speech Rate

9 Cross-linguistic variation of Syllabic Rate

ª

Not only a speaker-specific parameter!

ª

Values are pretty high (compared to spontaneous speech)

ª

Inter-speaker variation is pretty low in this task (reading)

9 Somewhat linked to syllable structure (shell complexity)

9 But not only

ª

MA and VI exhibit low Syllabic Rates though their syllable structures is moderately complex

ª

Tone dimension may not be "orthogonal" to syllabic structure complexity

0 2 4 6 8 10

ENGLISH FR ENC H

GERMAN ITALIA N

JAPANESE MA NDARIN SPA NISH

VIETNAMES E

Syllabic Rate

(20)

20 Speech Corpora Comparison (cont’d)

Are Syllabic Rate

and Text Length linked?

9

Normalized Duration does NOT correlate with Syllabic Rate

ª

Speak faster does not mean speak shorter!

Syllable Rate x Normalized Duration

0,80 1,10 1,40

4,0 5,0 6,0 7,0 8,0 9,0

Syllabic Rate

Normalized Duration

MA

EN

IT

FR SP

GE

JA

VI

Syllable Rate x Normalized Length (Syllables)

R² = 0,70; p<0.01

0,5 1,0 1,5 2,0

4,0 4,5 5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5

Syllabic Rate

Normalized Length (Syllables)

MA

EN

IT FR GE SP

JA

VI

9

Normalized Length (Syllable) correlates with Syllabic Rate

ª

Is this just an artifact (Duration linked to Number of syllables?)

ª

Is there any causality?

(21)

21

9

Cross-linguistic variations

9

How much of the potential offered by a given syllabic inventory is used?

ª

Redundancy = difference between the maximum possible entropy for each inventory and the observed entropy in a written corpora

ª

Pretty similar redundancy across languages

9

Comment: Sizes of Syllabic Inventories (calculated from corpora) are much lower than those computed just from phonotactic rules

Syllabic Entropy

Syllabic Entropy

0 2 4 6 8 10

ENGLIS H

FRENCH

GERMAN

ITALIAN JAPAN

ESE

MANDARIN

SPANISH

DUTCH

Syllable Inventory Size x Syllabic Entropy

Syllable Inventory Size

Syllabic Entropy

R²= 1

R² = 0,67

0 2 4 6 8 10 12 14

0 2 000 4 000 6 000 8 000 10 000

MA

EN IT

FR SP

GE JA

Actual Syllabic Entropy

Theoretical Maximum Syllabic Entropy

(22)

22 9 Very high correlation between Syllabic Entropy and

Normalized Length (Syllable)

ª Syllabic entropy (or Information load of syllables) is efficient to quantify the linguistic amount of information

9 Size of syllable inventory is not efficient to do so

=> Syllabic inventory (without frequency) is probably not informative enough for cross-linguistic comparison

Back to initial questions

Is Syllabic entropy relevant?

Normalized Length (Syllable) vs. Syllabic Entropy R² = 0,94 p<.01

5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5 9,0 9,5 10,0

0,5 0,7 0,9 1,1 1,3 1,5 1,7 1,9 2,1 2,3

MA

EN

IT FR GE SP

JA

0 1 000 2 000 3 000 4 000 5 000 6 000 7 000 8 000 9 000

0,8 1,0 1,2 1,4 1,6 1,8

Size of Syllabic Inventory

MA EN

IT FR

SP GE

JA

(23)

23

5,0 6,0 7,0 8,0 9,0 10,0

MA

EN

IT FR

SP GE

JA

5,0 6,0 7,0 8,0 9,0 10,0

Syllabic Rate

MA

EN

IT FR GE SP

JA

4,0 5,0 6,0 7,0 8,0 9,0

MA

EN

IT FR

SP GE

JA

Back to initial questions (cont’d)

Is there any trade-off between Syllabic Rate and Syllabic Entropy?

9

Tendency to negatively correlate

ª

Left-skewed distribution (Syllabic Rates)

ª

Normalization through transformation

ª

Significant correlation with exp(Syllabic Rate)

¾

R² = 0.61 (p<.05)

9

Consequence on Syllabic Information Rate

ª

SIR = Syllabic Entropy x Syllabic Rate

ª

Amount of syllabic information per second

Syllabic Information Rate is not predictable

from syllable inventory and probabilities

9

Possible balance between syntagmatic and paradigmatic information

ª

Cognitive (memory and process) load?

9

Speech Rate matters when looking for correlations!

9

Information RATE may be more important than Information LOAD

R² = 0,47; p=.09 (tendency)

Syllabic Information Rate vs. Syllabic Entropy

0,0 10,0 20,0 30,0 40,0 50,0 60,0 70,0

5,0 6,0 7,0 8,0 9,0 10,0

SIR

MA EN IT

FR SP

JA GE

(24)

24

Normalized Length (Word) vs. SIR

40,0 43,0 46,0 49,0 52,0 55,0 58,0 61,0 64,0

0,8 1,0 1,2 1,4

Normalized Length (Words)

MA EN IT

SP FR

GE JA

SIR

Back to initial questions (still cont’d)

What about morphosyntax?

9 Limitations

ª No explicit knowledge on morphology and syntax in the corpora

ª Hypotheses: indirect indices related to morphosyntax

¾

Normalized Length (Word)

¾

ASW (Average Number of Syllables per Word)

y

Trade-off to limit word informational load (or complexity)

Normalized Length (Word) vs. Syllabic Entropy

5,0 6,0 7,0 8,0 9,0 10,0

0,8 1,0 1,2 1,4

Normalized Length (Words)

Syllabic Entropy

MA EN

IT

FR SP

GE

JA

=> not correlated to any index

ASW vs. Syllabic Entropy

R² = 0,90 p<.01

5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5 9,0 9,5 10,0

1,0 1,5 2,0 2,5 3,0

ASW

MA EN

FR IT

SP GE

JA

(25)

25

Conclusions & Perspectives

9 Results

ª

Methodology assessed on a small set of languages => no definitive conclusion

ª

Syllabic Entropy seems to be relevant in terms of linguistic information

ª

High Syllabic Entropy does not automatically result in high Syllabic Information Rate!

9 Take Home message

If Information is what matters,

Just looking at descriptions or inventories of linguistic systems is not enough:

Pay attention to the SPEECH dimension!

9 Perspectives

ª

Take more phonetic and phonological factors of speech into account

ª

Add more languages

¾ Related languages to track historical trajectories (e.g. Romance languages)

¾ Typologically distant languages

ª

Rank these languages on morphological and syntactic scales of complexity

(26)

26

Thanks

Ching-pong AU, Solène BLANDIN, Eric CASTELLI, Gang PENG, Miyuki ISHIBASHI,

Shogo MAKIOKA and Katsuo TAMAOKA for their help with data processing

Ian MADDIESON

THANK YOU FOR YOUR ATTENTION

(27)

27

References

Auer, P. 1993, Is a rhythm-based typology possible? “A study of the role of prosody in phonological typology”. KontRI Working Paper (Universitat Konstanz) 21. 96p.

Campione, E. and Véronis, J., 1998, “A multilingual prosodic database”, in proc. of ICSLP98, Sydney, Australia, pp. 3163-3166 Dahl, O. 2004. The growth and maintenance of linguistic complexity, Studies in Language Companion Series, John Benjamins.

Fenk-Oczlon G. and Fenk A. 1999, “Cognition, quantitative linguistics, and systemic typology”, Linguistic Typology, vol. 3-2, pp. 151-177 Fenk-Oczlon G. and Fenk A. 2005, “Crosslinguistic correlations between size of syllables, number of cases, and adposition order”, in Sprache

und Natürlichkeit, Gedenkband für Willi Mayerthaler, G. Fenk-Oczlon & Ch. Winkler (eds), Tübingen.

Goldsmith, J., 1998, “On information theory, entropy, and phonology in the 20th century”, Royaumont CTIP II Round Table on Phonology in the 20th Century, Royaumont (June 26, 1998)

Goldsmith, J., 2002, “Probabilistic models of grammar: phonology as information minimization”, in Phonological Studies, Vol. 5, pp. 21-46 Greenberg, J., 1960, “A quantitative approach to the morphological typology of language”, in International Journal of American Linguistics, Vol.

26:3, pp. 178-194

Hume, E., 2006, “Language specific and universal markedness: an information-theoretic approach”, Linguistic Society of America Annual Meeting (January 7, 2006)

Juola, P., 1998, "Measuring Linguistic Complexity : The Morphological Tier" (1998). Journal of Quantitative Linguistics 5(3):206-213 Karlgren, H., 1961, “Speech Rate and Information Theory”, in proc. of 4th ICPhS, pp. 671-677

Kettunen, K., Sadeniemi, M., Lindh-Knuutila, T. & Honkela, T. 2006. Analysis of EU Languages through Text Compression. In T. Salakoski et al. (Eds.): Advances in Natural Language Processing, LNAI 4139, pp. 99 - 109, 2006. Springer-Verlag Berlin Heidelberg.

Maddieson, I., 1986, “The size and structure of phonological inventories: analysis of UPSID”, in Experimental phonology, J. J. Ohala and J.

Jaeger (eds), Academic Press, New York, pp. 105-123

Maddieson, I., 2006, “Correlating phonological complexity: Data and validation”, Linguistic Typology, 10-1

Plank, F., 1998, “The co-variation of phonology with morphology and syntax: a hopeful history”, Linguistic Typology 2:2 pp. 195-230.

Shannon, C.E., 1948, “A Mathematical Theory of Communication”, Bell Syst. Tech. J. 27, pp. 379-423 Shosted, R. K., 2006, “Correlating complexity: a typological approach”, Linguistic Typology, 10-1

Surendran D. and Niyogi P. to appear, “Quantifying the functional load of phonemic oppositions, distinctive features, and suprasegmentals”, in Current trends in the theory of linguistic change. In commemoration of Eugenio Coseriu (1921-2002), O. Nedergaard Thomsen (ed), Amsterdam & Philadelphia: Benjamins.

(28)

28

Additional slides

9 Normalization Procedure

9 Suitable linguistic units?

9 How to estimate Syllabic Entropy

9 Methodology for Japanese and Chinese Mandarin

9 Zipf-like curves

(29)

29

NORMALIZATION

Language Data Passages MEDIAN IQTL

Reference Passage EN O1 O2 O3 O4 O6 ...

Nb Words 51 57 48 53 47 58 52,3 4,5

Nb Syllables 72 86 84 86 66 86 80,0 8,8

Mean duration 10,7 13,7 13,0 14,7 11,0 13,6 13,3 1,6

Normalized Length(Syl) 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,00

Normalized Length (Wd) 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,00

Normalized duration 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,00

Nb Words 62 77 40 84 63 69 65,8 15,2

Nb Syllables 81 106 67 120 96 99 94,8 18,7

Mean duration 11,3 14,5 9,4 17,6 13,6 13,7 13,7 2,8

Normalized Length(Syl) 1,1 1,2 0,8 1,4 1,5 1,2 1,19 0,22

Normalized Length (Wd) 1,22 1,35 0,83 1,58 1,34 1,19 1,28 0,15

Normalized duration 1,06 1,06 0,72 1,20 1,24 1,01 1,06 0,14

FR

EN

(30)

30

Candidate linguistic units

9 What linguistic units?

ª That do not violate too heavily the independence rule?

ª For which inventory is known and P

_u

(probability of appearance) is calculable.

Candidate Independence Calculable Phoneme 8 phonotactics 9 from corpora

Syllable 9 no too bad 9 from corpora Morpheme 8 not really! 8 morpheme count?

Phrase 9 Not too bad 8 from HUGE corpora?

Feature 8 9 from corpora?

Gesture 9 To be explored… 9 from corpora?

(31)

31 9 How to estimate the information carried by syllables?

ª

Using syllable inventories AND syllable frequencies

9 How to estimate syllable inventories and frequencies?

1.

Phonotactic constraints from language description (.CV. .CVC. etc.)

=> “skeleton” inventory, rough frequency

2.

Lexicon or dictionary ([a], [pa], etc. from lexicon)

=> written/oral syllable inventory, “type” frequency

3.

Written Corpora ([a], [pa], etc. from written production)

=> written syllable inventory, “token” frequency

4.

Oral Corpora ([a], [pa], etc. from oral spontaneous data)

=> written syllable inventory, “token” frequency

9 Comparison of 2 methods in French

ª

From (written) syllable frequencies

¾ Statistical evaluation (only the number of syllables present in a text is considered)

¾ From syllabification (the observed syllables are individually taken into account)

COARSE-GRAINED

BIASES (e.g. morphology) BIAS (≠ from oral production) UNAVAILABLE*

Estimation of Syllabic Entropy

* Except from very few languages

(32)

32

Estimation of Syllabic Entropy (cont’d)

9 How to evaluate the information carried by syllables?

1. From distribution of syllable types and phonological inventories

2. From distributions of syllables in corpora (.a., .e., .dø., .la. …)

Syllable type

% of occurrence (tokens)

Number of V

Number of C

CV 60,4 1 1

V 12,5 1 0

CCV 9,2 1 2

CVC 11,6 1 2

VC 1,6 1 1

CCVC 1,4 1 3

CVCC 1,4 1 3

CCCV 0,4 1 3

Other 1,5

Syllable

Number of Occurrences

per M. of σ Probability Syllable’s information

a ^{44 255} ^0,04 ^4,50

e ^{31 889} ^0,03 ^4,97

dø 30 472 0,03 5,04

la 21 752 0,02 5,52

l 18 200 0,02 5,78

...

zlø 0.05 ∼0,00 24,19

zla ^0.05 ^∼0,00 24,19

zuk ^0.05 ^∼^0,00 ^24,19

zwa 0.05 ∼0,00 24,19

zyp 0.05 ∼0,00 24,19

Then, considering that vowels (resp. consonants) are equiprobable, syllable’s information is the sum of consonantal and vocalic information:

h(.CCVC.) = - (1 x log2(1/Nv) + 3 x log2(1/Nc))

H(L) = sum of h(.XXX.) weighted by % of occurrence

⇒Several significant approximations

(33)

33

Estimation of Syllabic Entropy (cont’d)

9 Evaluation from distribution of syllables in corpora

Syllable

Number of Occurrences

per M. of σ Probability Syllable’s information

a 44 255 0,04 4,50

e 31 889 0,03 4,97

dø 30 472 0,03 5,04

la 21 752 0,02 5,52

l 18 200 0,02 5,78

...

zlø 0.05 ∼0,00 24,19

zla 0.05 ∼0,00 24,19

zuk 0.05 ∼0,00 24,19

zwa 0.05 ∼0,00 24,19

zyp 0.05 ∼0,00 24,19

Syllabification available?

• Exact Calculation

• Text of n syllables

• H(Text) = sum of h( σ

_i

), i from 1 to n

• Statistical Estimation

• Text of n syllables

• H(Text) = n x H(L) = n x mean of h( σ )

YES NO

(34)

34

French: Methodology comparison

Syllabic Entropy

from syllable types

from syllable frequencies

(statistical)

from syllable frequencies

(exact)

FRENCH 9,05 8,43 9,21

y = 0,8638x + 27,93

R²

= 0,9261

0 200 400 600 800 1000 1200 1400

EXACT CALCULATION (FROM SYLLABIFIED TEXTS)

STATISTICAL ESTIMATION (FROM CORPUS)

(35)

TO THE B ALANCE OF C OMPLEXITY

A N I NFORMATION T HEORY -B ASED A PPROACH