Modelling Chinese dialect evolution

(1)

HAL Id: hprints-00758549

https://hal-hprints.archives-ouvertes.fr/hprints-00758549

Submitted on 28 Nov 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Modelling Chinese dialect evolution

Johann-Mattis List, Shijulal Nelson-Sathi, Dagan Tal

To cite this version:

Johann-Mattis List, Shijulal Nelson-Sathi, Dagan Tal. Modelling Chinese dialect evolution. Beyond Phylogeny: Quantitative diachronic explanations of language diversity (SLE Workshop), Aug 2012, Stockholm, Sweden. �hprints-00758549�

(2)

.

... .

. .

Modelling Chinese Dialect Evolution

Johann-Mattis List^∗, Shijulal Nelson-Sathi⁺, and Tal Dagan⁺

∗Institute for Romance Languages and Literature

+Institute for Genomic Microbiology Heinrich Heine University Düsseldorf

2012/08/31

1 / 30

(3)

Structure of the Talk

. .¹. Languages Languages Diasystems Change

. .². Modelling Language History Trees

Waves Networks

. .³. Modelling Chinese Dialect History Data

Analysis Results

2 / 30

(4)

Languages

语言

language

1

язык

språk

1

Languages

3 / 30

(5)

Languages and Dialects

Norwegian, Danish, and Swedish are different languages.

. .

Beijing-Chinese, Shanghai-Chinese, and Hakka-Chinese are dialects of the same Chinese language.

4 / 30

(6)

Languages Languages

Languages and Dialects

Beijing Chinese 1 iou²¹ i⁵⁵ xuei³⁵ pei²¹fəŋ⁵⁵ kən⁵⁵ tʰai⁵¹iaŋ¹¹ t͡ʂəŋ⁵⁵ ʦai⁵³ naɚ⁵¹ t͡ʂəŋ⁵⁵luən⁵¹ Hakka Chinese 1 iu³³ it⁵⁵ pai³³a¹¹ pet³³fuŋ³³ tʰuŋ¹¹ ɲit¹¹tʰeu¹¹ hɔk³³ e⁵³ au⁵⁵ Shanghai Chinese 1 ɦi²² tʰɑ̃⁵⁵ ʦɿ²¹ poʔ³foŋ⁴⁴ taʔ⁵ tʰa³³ɦiã⁴⁴ ʦəŋ³³ hɔ⁴⁴ ləʔ¹lə²³ʦa⁵³ Beijing Chinese 2 ʂei³⁵ də⁵⁵ pən³⁵ liŋ²¹ ta⁵¹

Hakka Chinese 2 man³³ ɲin¹¹ kʷɔ⁵⁵ vɔi⁵³

Shanghai Chinese 2 sa³³ ɲiŋ⁵⁵ ɦəʔ²¹ pəŋ³³ zɿ⁴⁴ du¹³

Norwegian 1 nuːɾɑʋinˑn̩ ɔ suːln̩ kɾɑŋlət ɔm

Swedish 1 nuːɖanvɪndən ɔ suːlən tv̥ɪstadə ən gɔŋ ɔm

Danish 1 noʌ̯ʌnvenˀn̩ ʌ soːl̩ˀn kʰʌm eŋg̊ɑŋ i sd̥ʁiðˀ ʌmˀ

Norwegian 2 ʋem ɑ dem sɱ̩ ʋɑː ɖɳ̩ stæɾ̥kəstə

Swedish 2 vɛm ɑv dɔm sɔm vɑ staɹkast

Danish 2 vɛmˀ a b̥m̩ d̥ vɑ d̥n̩ sd̥æʌ̯g̊əsd̥ə

5 / 30

(7)

Languages and Dialects

From the perspective of the lexicon and the sound system, the Chinese dialects are at least equally if not more different than the Scandinavian languages.

5 / 30

(8)

Languages Diasystems

Language as a Diasystem

Languages are complex aggregates of different linguistic systems that ‘coexist and influence each other’ (Coseriu 1973: 40, my translation).

. .

A linguistic diasystem requires a “roof language” (Goossens 1973:11), i.e. a linguistic variety that serves as a standard for interdialectal communication.

6 / 30

(9)

Language as a Diasystem

Languages are complex aggregates of different linguistic systems that ‘coexist and influence each other’ (Coseriu 1973: 40, my translation).

. .

A linguistic diasystem requires a “roof language” (Goossens 1973:11), i.e. a linguistic variety that serves as a standard for interdialectal communication.

6 / 30

(10)

Languages Diasystems

Language as a Diasystem

Standard Language

Diatopic Varieties

Diastratic Varieties

Diaphasic Varieties

7 / 30

(11)

Languages Change

Change

expected Mandarin [ma₅₅po₂₁lou] attested Mandarin [wan₅₁paw₂₁lu₅₁] explanation Cantonese [maːn₂₂pow₃₅low₃₂]

8 / 30

(12)

Languages Change

Change

expected Mandarin [ma₅₅po₂₁lou]

attested Mandarin [wan₅₁paw₂₁lu₅₁] explanation Cantonese [maːn₂₂pow₃₅low₃₂]

8 / 30

(13)

Languages Change

Change

attested Mandarin [wan₅₁paw₂₁lu₅₁]

explanation Cantonese [maːn₂₂pow₃₅low₃₂]

8 / 30

(14)

Languages Change

Change

attested Mandarin [wan₅₁paw₂₁lu₅₁]

explanation Cantonese [maːn₂₂pow₃₅low₃₂]

8 / 30

(15)

Change

English Cantonese Mandarin

maːlboʁo maːn22pow35low32 wan51paw21lu51

Proper Name “Road of 1000 Tre- asures”

“Road of 1000 Tre- asures”

万宝路

1

9 / 30

(16)

Modelling Language History

10 / 30

(17)

Dendrophilia

August Schleicher (1821-1868)

11 / 30

(18)

Modelling Language History Trees

Dendrophilia

August Schleicher (1821-1868)

These assumptions that logically follow from the results of our re- search can be best illustrated with help of a branching tree. (Schle- icher 1853: 787, my translation)

11 / 30

(19)

Dendrophilia

Schleicher (1853)

12 / 30

(20)

Modelling Language History Waves

Dendrophobia

Johannes Schmidt (1843-1901)

No matter how we look at it, as long as we stick to the assumption that today’s languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifi- cally adequate way. (Schmidt 1872: 17, my translation)

13 / 30

(21)

Dendrophobia

Johannes Schmidt (1843-1901) No matter how we look at it, as long as we stick to the assumption that today’s languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifi- cally adequate way. (Schmidt 1872:

17, my translation)

13 / 30

(22)

Dendrophobia

Johannes Schmidt (1843-1901) I want to replace [the tree] by the im- age of a wave that spreads out from the center in concentric circles be- coming weaker and weaker the far- ther they get away from the center.

(Schmidt 1872: 27, my translation)

14 / 30

(23)

Dendrophobia

Schmidt (1875)

15 / 30

(24)

Dendrophobia

Meillet (1908)

Hirt (1905)

Bloomfield (1933)

Bonfante (1931)

16 / 30

(25)

Modelling Language History Networks

Phylogenetic Networks

Trees are bad because

they are difficult to reconstruct...

languages do not separate in split processes

they are boring, since they only capture certain aspects of language history, namely the vertical relations

nobody knows how to reconstruct them

languages still separate, even if not in split processes they are boring, since they only capture certain aspects of language history, namely, the horizontal relations

17 / 30

(26)

Phylogenetic Networks

Trees are bad because they are difficult to reconstruct...

Waves are bad because nobody knows how to reconstruct them

17 / 30

(27)

Phylogenetic Networks

nobody knows how to reconstruct them

17 / 30

(28)

Phylogenetic Networks

17 / 30

(29)

Phylogenetic Networks

if not in split processes they are boring, since they only capture certain aspects of language history, namely, the horizontal relations

17 / 30

(30)

Phylogenetic Networks

languages still separate, even if not in split processes

they are boring, since they only capture certain aspects of language history, namely, the horizontal relations

17 / 30

(31)

Phylogenetic Networks

17 / 30

(32)

Phylogenetic Networks

Hugo Schuchardt (1842-1927)

We connect the branches and twigs of the tree with countless horizon- tal lines and it ceases to be a tree (Schuchardt 1870 [1900]: 11)

18 / 30

(33)

Phylogenetic Networks

Hugo Schuchardt (1842-1927) We connect the branches and twigs of the tree with countless horizon- tal lines and it ceases to be a tree (Schuchardt 1870 [1900]: 11)

18 / 30

(34)

Phylogenetic Networks

19 / 30

(35)

魚

1

魚

1

魚

1 ?

首

1

首

1

首

1

首

1 Modelling Chinese Dialect History

20 / 30

(36)

Modelling Chinese Dialect History Data

Data

The data for this study was taken from the Xiàndài Hànyǔ Fāngyán Yīnkù (Hou 2004).

It consists of 180 items (“meanings”) translated into 40 contemporary Chinese dialects.

The data is available on a CD in RTF format along with recordings for all dialect entries.

For this study, the transcriptions in RTF were converted to Unicode. Every word was compared with the recordings in order to minimize errors resulting from the extraction process and the original

encoding itself.

21 / 30

(37)

Data

contemporary Chinese dialects.

encoding itself.

21 / 30

(38)

Data

encoding itself.

21 / 30

(39)

Data

Every word was compared with the recordings in order to minimize errors resulting from the extraction process and the original

encoding itself.

21 / 30

(40)

Data

For this study, the transcriptions in RTF were converted to Unicode.

encoding itself.

21 / 30

(41)

Data

For this study, the transcriptions in RTF were converted to Unicode.

encoding itself.

21 / 30

(42)

Data

ITEM 太阳 tàiyáng “sun”

.

Dialect Pronunciation Characters Cognacy Shanghai tʰa³⁴⁻³³ɦiã¹³⁻⁴⁴ 太阳 1

Shanghai ȵjɪʔ¹⁻¹¹dɤ¹³⁻²³ 日头 2

Wenzhou tʰa⁴²⁻²²ji 太阳 1

Wenzhou ȵi²¹³⁻²²dɤu 日头 2

Guangzhou jit²tʰɐu²¹⁻³⁵ 热头 3

Guangzhou tʰai³³jœŋ²¹ 太阳 1

Haikou zit³hau³¹ 日头 2

Beijing tʰai⁵¹iɑŋ¹ 太阳 1

22 / 30

(43)

Data

01 02 03 04

06 05 07

08 09 10 11

12

13 14

15 17 16 18

19 20 21 22 23

24 25

26 27

29 28 30

31 32

33 34

35

36 37 39 38 40

Guanhua JinKejia YueGan HuiMin Xiang Wu

Dialect Locations in the Xiàndài Hànyǔ Fāngyán Yīnkù

01 Shanghai 上海

02 Suzhou 苏州

03 Hangzhou 杭州 04 Wenzhou 温州 05 Guangzhou 广州 06 Nanning 南宁 07 Xianggang 香港

08 Xiamen 厦门

09 Fuzhou 福州

10 Jian'ou 建瓯 11 Shantou 汕头

12 Haikou 海口

13 Taibei 台北

14 Meixian 梅县 15 Taoyuan 桃园 16 Nanchang 南昌 17 Changsha 长沙 18 Xiangtan 湘潭 19 Shexian 歙县

20 Tunxi 屯溪

21 Taiyuan 太原 22 Pingyao 平遥 23 Huhehaote 呼和浩特 24 Beijing 北京 25 Tianjin 天津

26 Jinan 济南

27 Qingdao 青岛 28 Nanjing 南京

29 Hefei 合肥

30 Zhengzhou 郑州

31 Wuhan 武汉

32 Chengdu 成都 33 Guiyang 贵阳 34 Kunming 昆明 35 Haerbin 哈尔滨

36 Xi'an 西安

37 Yinchuan 银川 38 Lanzhou 兰州

39 Xining 西宁

40 Wulumuqi 乌鲁木齐 01 Shanghai 02 Suzhou 03 Hangzhou 04 Wenzhou 05 Guangzhou 06 Nanning 07 Xianggang 08 Xiamen 09 Fuzhou 10 Jian'ou 11 Shantou 12 Haikou 13 Taibei 14 Meixian 15 Taoyuan 16 Nanchang 17 Changsha 18 Xiangtan 19 Shexian 20 Tunxi 21 Taiyuan 22 Pingyao 23 Huhehaote 24 Beijing 25 Tianjin 26 Jinan 27 Qingdao 28 Nanjing 29 Hefei 30 Zhengzhou 31 Wuhan 32 Chengdu 33 Guiyang 34 Kunming 35 Haerbin 36 Xi'an 37 Yinchuan 38 Lanzhou 39 Xining 40 Wulumuqi

23 / 30

(44)

Modelling Chinese Dialect History Analysis

Analysis

The data was analyzed with help of Dagan and Martin’s (2008) method for phylogenetic network reconstruction, that was applied to linguistic data before (Nelson-Sathi et al. 2011).

Given a binary reference tree reflecting the vertical history of a language family and a list of homologs (“cognates”) distributed over the languages, the method reconstructs horizontal relations between the languages and the internal nodes of the tree. The reconstruction of horizontal relations is done by seeking specific evolutionary models (loss and gain of characters) that fit the given distribution best.

The main criterion by which the fitness of the distributions is evaluated is the “vocabulary size”, i.e. the distribution of word forms over a set of meanings. Comparing the vocabulary sizes of different models that infer different amounts of lateral events, the model that comes closest to the vocabulary sizes of the

contemporary languages is chosen.

24 / 30

(45)

Analysis

language family and a list of homologs (“cognates”) distributed over the languages, the method reconstructs horizontal relations between the languages and the internal nodes of the tree. The reconstruction of horizontal relations is done by seeking specific evolutionary models (loss and gain of characters) that fit the given distribution best.

24 / 30

(46)

Analysis

Given a binary reference tree reflecting the vertical history of a language family and a list of homologs (“cognates”) distributed over the languages, the method reconstructs horizontal relations between the languages and the internal nodes of the tree.

The reconstruction of horizontal relations is done by seeking specific evolutionary models (loss and gain of characters) that fit the given distribution best.

24 / 30

(47)

Analysis

evaluated is the “vocabulary size”, i.e. the distribution of word forms over a set of meanings. Comparing the vocabulary sizes of different models that infer different amounts of lateral events, the model that comes closest to the vocabulary sizes of the

24 / 30

(48)

Analysis

24 / 30

(49)

Analysis

Xi’an Zhengzhou

Harbin

Yinchuan Lanzhou

Xining

Qingdao

Beijing

Tunxi Hangzhou Suzhou Shanghai Wenzhou Tianjin

Jinan Shexian

Hefei Nanjing Wulumuqi

Guiyang Wuhan

Xiangtan Changsha

Huhehaote Pingyao Taiyuan Kunming Chengdu

Haikou

Fuzhou

Jian’ou

Guangzhou Xianggang

Shantou

Xiamen

Taibei Nanning Taoyuan Nanchang

Meixian

“sun” 日头 rìtou

25 / 30

(50)

Analysis

Xi’an Zhengzhou

Harbin

Yinchuan Lanzhou

Xining

Qingdao

Beijing

Jinan Shexian

Guiyang Wuhan

Xiangtan Changsha

Haikou

Fuzhou

Jian’ou

Guangzhou Xianggang Taoyuan Nanchang

Meixian

Shantou

Xiamen

Taibei Nanning

“sun” 太阳 tàiyáng

25 / 30

(51)

Analysis

Xi’an Zhengzhou

Harbin

Yinchuan Lanzhou Xining

Qingdao

Beijing

Jinan Shexian

Guiyang Wuhan

Xiangtan Changsha

Haikou

Fuzhou

Jian’ou

Meixian

Shantou

Xiamen

Taibei Nanning

“become sick” 生病 shēngbìng

25 / 30

(52)

Analysis

Xi’an Zhengzhou

Harbin

Xining

Yinchuan Lanzhou

Qingdao

Beijing

Jinan Shexian

Guiyang Wuhan

Xiangtan Changsha

Haikou

Fuzhou

Jian’ou

Meixian

Shantou

Xiamen

Taibei Nanning

“aubergine” 茄子 qiézi

25 / 30

(53)

Results

0 200 400 600 800 1000 1200

Genome size

p<0.05 p<0.05 p<0.05 p=0.2 p<0.05 p<0.05

26 / 30

(54)

Modelling Chinese Dialect History Results

Results

The BOR3-model fits the distribution best. It allows up to three lateral connections per homolog.

Out of 1152 homologs distributed over the Chinese dialects, 264 are monophyletic, 328 require one, 355 two, and 177 three lateral links in order to explain the distribution neatly.

This corresponds to a borrowing rate of 0.5286 borrowing events per homolog per lifetime.

For 78 percent of all homologs in the dataset the method reconstructs lateral links and therefore suggests that these have been involved in borrowing events during their history.

Suprisingly, the 48 homologs that correspond to basic vocabulary concepts in the dataset do not show significant differences in their borrowing rates compared to the non-basic items.

26 / 30

(55)

Results: General Results

Nanjing

Wuhan Hefei

Guiyang Xining

Zhengzhou

Yinchuan Lanzhou

Wulumuqi Xi’an

Qingdao Tianjin

Wenzhou

Jinan

Kunming Chengdu

Taiyuan Harbin

Beijing Nanchang

Tunxi

Taoyuan

Meixian

Shantou

Xiamen

Taibei Guangzhou

Nanning Xianggang

Huhehaote Pingyao

Xiangtan Shanghai Suzhou

Hangzhou Shexian

Fuzhou

Changsha Jian’ou

Haikou

Whole Dataset _{27 / 30}

(56)

Results: General Results

Nanjing

Wuhan Hefei

Guiyang Xining

Zhengzhou

Yinchuan Lanzhou

Wulumuqi Xi’an

Qingdao Tianjin

Wenzhou

Jinan

Kunming Chengdu

Taiyuan Harbin

Beijing Nanchang

Tunxi

Taoyuan

Meixian

Shantou

Xiamen

Taibei Guangzhou

Nanning Xianggang

Huhehaote Pingyao

Xiangtan Shanghai Suzhou

Hangzhou Shexian

Fuzhou

Changsha Jian’ou

Haikou

Swadesh Subset _{27 / 30}

(57)

Results: General Results

Nanchang Shexian

Tunxi

Taoyuan Shanghai Suzhou

Hangzhou

Huhehaote Changsha

Pingyao

Jian’ou

Xiangtan

Fuzhou Haikou

Meixian

Xiamen

Taibei Guangzhou

Shantou Nanning Xianggang

Wulumuqi Yinchuan Lanzhou Xining

Zhengzhou

Xi’an

Chengdu Kunming

Taiyuan Beijing

Qingdao

Harbin

Tianjin Wenzhou

Jinan

Nanjing

Wuhan Hefei

Guiyang

Whole Dataset (Cutoff 5) _{27 / 30}

(58)

Results: General Results

Nanchang Shexian

Tunxi

Taoyuan Shanghai Suzhou

Hangzhou

Huhehaote Changsha

Pingyao

Jian’ou

Xiangtan

Fuzhou Haikou

Meixian

Xiamen

Taibei Guangzhou

Shantou Nanning Xianggang

Wulumuqi Yinchuan Lanzhou Xining

Zhengzhou

Xi’an

Chengdu Kunming

Taiyuan Beijing

Qingdao

Harbin

Tianjin Wenzhou

Jinan

Nanjing

Wuhan Hefei

Guiyang

Whole Dataset (Cutoff 10) _{27 / 30}

(59)

Results: Chengdu

Haikou Changsha

Fuzhou Wuhan

Xianggang Xiangtan

Taoyuan Qingdao

Zhengzhou Xi’an

Pingyao Tianjin Taiyuan

Lanzhou Yinchuan

Jinan Wulumuqi

Xining

Huhehaote

Harbin

Beijing

Shanghai

Suzhou Hangzhou Shexian Hefei

Wenzhou Tunxi

Jian’ou Chengdu

Nanjing

Nanchang

Guangzhou Nanning

Meixian Xiamen

Taibei Kunming

Shantou Guiyang

Contemporary Links Mapped to Coordinates _{28 / 30}

(60)

Results: Chengdu

Haikou Changsha

Fuzhou Wuhan

Xianggang Xiangtan

Taoyuan Qingdao

Zhengzhou Xi’an

Lanzhou Yinchuan

Jinan Wulumuqi

Xining

Huhehaote

Harbin

Beijing

Shanghai

Wenzhou Tunxi

Jian’ou Chengdu

Nanjing

Nanchang

Guangzhou Nanning

Meixian Xiamen Taibei Kunming

Shantou Guiyang

Contemporary Links of Chengdu _{28 / 30}

(61)

Results: Chengdu

Xiamen Shantou

Taibei Meixian

Xianggang

Guangzhou

Nanning

Jian’ou

Changsha Haikou

Huhehaote

Fuzhou

Xiangtan Suzhou

Hangzhou

Tunxi

Nanchang

Taoyuan Shexian

Guiyang

Chengdu Kunming

Taiyuan Pingyao Qingdao

Tianjin Shanghai Wenzhou

Beijing Jinan

Xining Zhengzhou

Lanzhou

Yinchuan Harbin

Xi’an

Wuhan Hefei Nanjing Wulumuqi

Links of Chengdu _{28 / 30}

(62)

Results: Nanchang

Nanchang Tunxi

Taoyuan Shexian

Suzhou Hangzhou

Jian’ou

Xiangtan Huhehaote

Changsha Fuzhou

Haikou Guangzhou

Nanning

Shantou

Xiamen Meixian

Xianggang

Taibei Yinchuan

Lanzhou Xining

Kunming Pingyao Taiyuan Chengdu Qingdao

Tianjin Wenzhou Shanghai

Beijing Jinan

Wuhan Hefei Wulumuqi

Nanjing

Guiyang Harbin

Zhengzhou

Xi’an

Links of Nanchang _{29 / 30}

(63)

Results: Nanchang

Haikou Changsha

Fuzhou Wuhan

Xianggang Xiangtan

Taoyuan Qingdao

Zhengzhou Xi’an

Lanzhou Yinchuan

Jinan Wulumuqi

Xining

Huhehaote

Harbin

Beijing

Shanghai

Wenzhou Tunxi

Jian’ou Chengdu

Nanjing

Nanchang

Guangzhou Nanning

Meixian Xiamen Taibei Kunming

Shantou Guiyang

Contemporary Links of Nanchang _{29 / 30}

(64)

Results: Nanchang

Shanghai Nanjing

Suzhou Hangzhou Hefei

Shexian

Tunxi

Wenzhou Wuhan

Xiangtan Changsha

Jian’ou Nanchang

Links between Nanchang and its Neighbors _{29 / 30}

(65)

Concluding Remarks

Phylogenetic networks look nice.

Phylogenetic networks are – if properly reconstructed – a valid alternative to both the tree and the wave model.

We need to test the method by Dagan and Martin (2008) on more data and in more detail in order to be able to give an account on its full potential and its limits.

30 / 30

(66)

Concluding Remarks

谢谢大家！

30 / 30

(67)

Concluding Remarks

Thank you!

30 / 30