Phonological Parsing for Bi-directional Letter-to-Sound Sound-to-Letter Generation

(1)

Doctor of Philosophy in Electrical Engineering and Computer Science

at the

Massachusetts Institute of Technology

June 1995 c

The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part,

and to grant others the right to do so.

Signature of Author ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Department of Electrical Engineering and Computer Science February 14, 1995 Certied by ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Stephanie Sene Thesis Supervisor Certied by ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Victor W. Zue Thesis Supervisor Accepted by :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Frederic R. Morgenthaler Chairman, Departmental Committee on Graduate Students

(2)

Letter-to-Sound / Sound-to-Letter Generation

by

Helen Mei-Ling Meng

Submitted to the Department of Electrical Engineering and Computer Science on February 14, 1995 in partial fulllment

of the requirements for the degree of Doctor of Philosophy

Abstract

This thesis proposes a unied framework for integrating a variety of linguistic knowledge sources for representing speech, in order to facilitate their concurrent utilization in spoken language systems. The feasibility of the proposed methodology is demonstrated on the test bed of bi-directional letter-to-sound / sound-to-letter generation. We present a hierarchical lexical representation which includes information such as morphology, stress, syllabication, phonemics and graphemics. Each of these knowledge sources occupies a distinct stratum in the hierarchy, and the constraints they provide are administered in parallel during generation. A probabilistic parsing paradigm is adopted for generation. The parser is a hybrid of a rule-based formalism and data-driven techniques, and is capable of bi-directional generation. Our training and testing corpora are derived from the high-frequency portion of the Brown Corpus (10,000 words), augmented with markers indicating stress and word morphology. We evaluated our performance based on an unseen test set. The percentage of nonparsable words for letter-to-sound and sound-to-letter generation were 6% and 5% respectively. Of the remaining words our system achieved a word accuracy of 71.8% and a phoneme accuracy of 92.5% for letter-to-sound generation, and a word accuracy of 55.8% and letter accuracy of 89.4% for sound-to-letter generation. The implementation of a robust parsing mechanism shows how generation constraints can be relaxed within the hierarchical framework, in order to broaden coverage and handle nonparsable words. Additionally, a pilot study provides evidence that the framework can be generalized to encompass other linguistic knowledge sources for potential applications in speech synthesis, recognition and understanding.

Thesis Supervisors:

Dr. Stephanie Sene, Principal Research Scientist Dr. Victor W. Zue, Senior Research Scientist

(3)

as unwavering support and constant encouragement. Working with Stephanie and Victor has been a great pleasure and honor, and I cannot conceive of better thesis advisors. Their profound inspiration will extend far beyond the scope of this work.

I am also grateful to the members of my thesis committee for an expeditious yet careful reading of my thesis. Thanks to Professor Jonathan Allen for sharing with me his experience with the development of the famous MITalk system. I thank Dr.

Andrew Golding for taking a keen interest throughout the course of this work, and for his enlightening technical comments and critiques of this thesis. I also thank Dr.

Kim Silverman for his stimulating input concerning this research, and for travelling from California to Boston to attend my thesis defense.

I would also like to extend my appreciation to the past and present members of the Spoken Language Systems Group. My thanks go to Dr. Sheri Hunnicutt for many informative discussions about English morphology and her experience with rules generation in MITalk, and for providing the labelled corpus for my experiments;

to Dr. Eric Brill for his help with the transformation-based error-driven learning algorithms; to the research sta for many thoughtful comments and feedback about

3

(4)

my work; to Christine Pao and Joe Polifroni for keeping the machines up and running;

and to Vicky Palay and Sally Lee for ensuring that everything else runs smoothly.

I thank all my fellow students in the Spoken Language Systems Group for their comradeship, and for making school life a lot of fun. Aside from learning from one another and discussing technicalities, we have shared many Friday-afternoon-happy- hours over spectrograms, and together discovered the therapeutic eects of chocolate and tennis. I especially thank my oce-mates, TJ Hazen and Raymond Lau, for their moral support and funny jokes, which make the thesis home-stretch much more bearable.

Special thanks also go to all my good friends for many enjoyable times which make student life at MIT (outside the lab) both fun and memorable.

Finally, I express my whole-hearted gratitude to my family | my grandmother, my parents, my brothers and my sister-in-law, for their unconditional love, unfailing support and uplifting encouragement. I thank my parents for providing me with the best education, for instilling in me a strong desire to learn, for their comforting words during times of hardships, and for having faith that I will attain my goals.

This research was supported by ARPA under Contract N00014-89-J-1332, moni- tored through the Oce of Naval Research, and a grant from Apple Computer Inc.

(5)

(6)

Speech : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 1.3 Spelling-Phonemics Conversion : : : : : : : : : : : : : : : : : : : : : 23 1.3.1 Orthographic-phonological Correspondences in English : : : : 24 1.4 Previous Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26 1.4.1 Letter-to-Sound Generation : : : : : : : : : : : : : : : : : : : 26 1.4.2 Sound-to-Letter Generation : : : : : : : : : : : : : : : : : : : 33 1.4.3 Summary of Previous Approaches : : : : : : : : : : : : : : : : 35 1.5 Thesis Goals : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35 1.6 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 39

2 The Lexical Representation 42

2.1 Integration of Various Linguistic Knowledge Sources : : : : : : : : : : 42 2.2 Some Examples : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48 2.3 Chapter Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67

3 The Parsing Algorithm 68

3.1 Data Preparation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69 3.2 The Training Procedure : : : : : : : : : : : : : : : : : : : : : : : : : 72

6

(7)

4.3 Results on Sound-to-Letter Generation : : : : : : : : : : : : : : : : : 88 4.4 Error Analyses : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90 4.5 Data Partitioning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92 4.6 Chapter Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93

5 Evaluating the Hierarchy 94

5.1 Investigations on the Hierarchy : : : : : : : : : : : : : : : : : : : : : 95 5.1.1 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96 5.2 The Non-linguistic Approach : : : : : : : : : : : : : : : : : : : : : : : 102 5.2.1 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104 5.3 Chapter Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : 107

6 Robust Parsing 108

6.1 The Causes of Parse Failure : : : : : : : : : : : : : : : : : : : : : : : 109 6.2 The Robust Parser : : : : : : : : : : : : : : : : : : : : : : : : : : : : 113 6.3 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119 6.4 Chapter Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : 123

7 Extending the Hierarchy 125

7.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126 7.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 7.3 Experimental Corpus : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 7.4 Phonological Variations : : : : : : : : : : : : : : : : : : : : : : : : : : 128

(8)

7.5 Extending the Hierarchical Representation : : : : : : : : : : : : : : : 130 7.6 Extending the Layered Bigrams Parser : : : : : : : : : : : : : : : : : 133 7.6.1 Training in the Extended Layered Bigrams : : : : : : : : : : : 133 7.6.2 Testing in the Extended Layered Bigrams : : : : : : : : : : : 134 7.6.3 Lexical Access in the Extended Layered Bigrams: : : : : : : : 138 7.7 Captured Phonological Variations : : : : : : : : : : : : : : : : : : : : 138 7.7.1 Allophonic Variations: : : : : : : : : : : : : : : : : : : : : : : 139 7.7.2 Across-word Phonological Variations : : : : : : : : : : : : : : 139 7.7.3 Within-word Phonological Variations : : : : : : : : : : : : : : 141 7.7.4 Capturing Phonological Rules : : : : : : : : : : : : : : : : : : 143 7.8 Experimental Results : : : : : : : : : : : : : : : : : : : : : : : : : : : 144 7.9 Chapter Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : 148

8 Conclusions and Future Work 150

8.1 Thesis Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 150 8.2 Performance Improvement : : : : : : : : : : : : : : : : : : : : : : : : 155 8.3 Large Vocabulary Speech Recognition : : : : : : : : : : : : : : : : : : 158 8.4 Interface with Pen-based Systems : : : : : : : : : : : : : : : : : : : : 160 8.5 Multilingual Applications: : : : : : : : : : : : : : : : : : : : : : : : : 160 8.6 Speech Generation, Understanding and Learning in a Single Framework163

A List of Morphs 164

B List of Syllables 165

C List of Subsyllabic Units 166

D List of Broad Manner Classes 167

E List of Phonemes 168

F List of Graphemes 170

(9)

(10)

List of Figures

1-1 A Proposed Grand Hierarchy for Representing Speech : : : : : : : : : 22 2-1 Lexical representation for the word \monkey" | shown here in a parse

tree format. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 2-2 Lexical representation for the word \dedicated" - shown here in a parse

tree format, and with the dierent linguistic layers indicated numerically. 49 2-3 Lexical representation for the word \dedicate" - shown here in a parse

tree format. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 50 2-4 Lexical representation for the word \taxes" - shown here in a parse

tree format. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51 2-5 Lexical representation for the word \hero." : : : : : : : : : : : : : : : 52 2-6 Lexical representation for the word \heroic." : : : : : : : : : : : : : : 53 2-7 Lexical representation for the word \accelerometer." : : : : : : : : : : 54 2-8 Lexical representation for the word \headlight." : : : : : : : : : : : : 55 2-9 Lexical representation for the name \Arkansas." : : : : : : : : : : : : 56 2-10 Lexical representation for the name \Meredith." : : : : : : : : : : : : 57 2-11 Lexical representation for the word \buddhism." : : : : : : : : : : : : 58 2-12 Lexical representation for the word \national." : : : : : : : : : : : : : 59 2-13 Lexical representation for the word \issue.": : : : : : : : : : : : : : : 60 2-14 Lexical representation for the word \dene." : : : : : : : : : : : : : : 61 2-15 Lexical representation for the word \dening." : : : : : : : : : : : : : 62 2-16 Lexical representation for the word \denition." : : : : : : : : : : : : 63

10

(11)

3-1 A parse tree generated by TINA for the word \predicted." predenotes

\prex," isuf denotes \inectional sux," syl denotes \unstressed syllable,"ssyl1denotes \primary stressed syllable," andnucdenotes

\nucleus." : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73 3-2 The parse generated by TINA for the word \predicted," shown in a

parse tree format in the previous gure, but displayed here in layered bigrams format. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 74 4-1 Letter-to-sound generation experiments: Percent correct whole-word

theories as a function of N-best depth for the test set.: : : : : : : : : 89 4-2 Sound-to-letter generation experiments: Percent correct whole-word

theories as a function of N-best depth for the test set : : : : : : : : : 91 5-1 Word accuracies as a function of the dierent layers omitted from the

hierarchical lexical representation. Layer 4 is the layer of subsyllabic units.: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 98 5-2 Perplexities as a function of the dierent layers omitted from the hi-

erarchical lexical representation. Layer 4 is the layer of subsyllabic units.: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 5-3 Coverage as a function of the dierent layers omitted from the hierar-

chical lexical representation. Layer 4 is the layer of subsyllabic units. 100

(12)

5-4 Number of parameters as a function of the dierent layers omitted from the hierarchical lexical representation. Layer 4 is the layer of subsyllabic units. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 101 6-1 Parse tree for the word \typewriter." : : : : : : : : : : : : : : : : : : 110 6-2 Parse tree for the word \lloyd." : : : : : : : : : : : : : : : : : : : : : 111 6-3 Parse tree for the word \tightly." : : : : : : : : : : : : : : : : : : : : 112 6-4 Parse trees for the word \cushion" | (left) from letter-to-sound gen-

eration and (right) from sound-to-letter generation. : : : : : : : : : : 113 6-5 Top-level architecture for the robust parser. : : : : : : : : : : : : : : 115 6-6 Robust parser output for the word \typewriter." : : : : : : : : : : : : 116 6-7 Robust parser output for the word \lloyd." : : : : : : : : : : : : : : : 117 6-8 Parse tree for the word \lightly." : : : : : : : : : : : : : : : : : : : : 118 6-9 Parse tree for the word \charlie" from robust letter-to-sound generation.121 6-10 Parse tree for the word \henrietta" from robust letter-to-sound gener-

ation.: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121 6-11 Parse tree for the word \joe" from robust letter-to-sound generation. 122 6-12 Parse tree for the word \cushion" from robust sound-to-letter generation.122 6-13 Parse tree for the word \henrietta" from robust sound-to-letter gener-

ation.: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 123 6-14 Parse tree for the word \typewriter" from robust sound-to-letter gen-

eration. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124 7-1 Some phonological variations occurring in the sa-1 training sentences

| \She had your dark suit in greasy wash water all year." dcl and kcl denote d-closure and k-closure respectively. : : : : : : : : : : : : : : : 129 7-2 Some phonological variations occurring in the sa-2 training sentences

| \Don't ask me to carry an oily rag like that." tcl and kcl denote t-closure and k-closure respectively. : : : : : : : : : : : : : : : : : : : 129

(13)

wash water all year." : : : : : : : : : : : : : : : : : : : : : : : : : : : 137 7-7 Bar graph showing the occurrences of the dierent allophones of /^t/. 140 7-8 Word and sentence accuracies of the layered bigrams in parsing sen-

tences, plotted as a function of increasing training data.: : : : : : : : 145 8-1 A portion of the Speech Maker grid representing the word \outstanding."162 8-2 An example of a two-dimensional rule in Speech Maker. The upward

arrows delineate the letter to be transcribed and the corresponding phoneme. The rule expresses that the letter \a" which precedes an arbitrary number of consonants and ending with the letter \e" should be pronounced as /e/. : : : : : : : : : : : : : : : : : : : : : : : : : : 162

(14)

List of Tables

1.1 Previous Approaches for Letter-to-sound Generation: : : : : : : : : : 36 1.2 Previous Approaches for Sound-to-letter Generation : : : : : : : : : : 36 2.1 Table showing the dierent layers in the lexical representation, the

number of categories in each layer and some example categories. : : : 43 3.1 Examples of lexical entries in the training corpus. : : : : : : : : : : : 70 3.2 Examples of lexical entries in the training corpus. : : : : : : : : : : : 71 4.1 Letter-to-sound generation experiments: Word and phoneme accura-

cies for training and testing data. Nonparsable words are excluded. : 88 4.2 Sound-to-letter generation experiments: Word and letter accuracy for

training and testing data : : : : : : : : : : : : : : : : : : : : : : : : : 90 4.3 Some examples of generation errors. : : : : : : : : : : : : : : : : : : : 92 5.1 Examples of generated outputs using the non-linguistic approach : : : 104 5.2 Experimental results for spelling-to-pronunciation generation using the

non-linguistic approach : : : : : : : : : : : : : : : : : : : : : : : : : : 104 5.3 Error examples made by the non-linguistic approach: : : : : : : : : : 106 6.1 Performance improvement on the development test set with the ad-

dition of robust parsing. Zero accuracies were given to nonparsable words. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119

14

(15)

8.1 Some examples of rule templates for transformational error-driven learning. These rules include context up to a window of seven phonemes/letters

centered at the current phoneme/letter, i.e. the windows areP^,3P^,2P^,1P⁰P¹P²P³ andL^,3L^,2L^,1L⁰L¹L²L³, where P⁰ is the current phoneme, andL⁰ is

the current letter. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 157

(16)

Chapter 1 Introduction

1.1 Overview

Human-machine communication via speech is the shared vision and common goal of many speech researchers. Computers with the power to speak and listen can create a user-friendly, hands-free and eyes-free environment for the user, and the speech medium can provide an ecient and economical mode of transmission. Great strides have been made in many areas of speech research over the past few decades. Speech synthesizers [41] have achieved a reasonable degree of clarity and naturalness, and are striving to cover unlimited vocabularies. Speech recognizers are now capable of speaker-independent, large-vocabulary, continuous speech recognition. The speech input may either be read or spontaneous.¹ Vocabulary sizes can range from a few thou- sand words to tens of thousands of words [63] and eorts to handle out-of-vocabulary words are under way [6], [35]. Natural language understanding systems can analyze a recognized sentence to obtain a meaning representation [73]. The semantics are then channelled to the appropriate locations to perform specic actions (such as sav-

1Read speech tends to be \cleaner" than spontaneous speech. The latter is characterized by hesitations, lled pauses such as \um" and \ah," false starts (e.g. \I want to y to Bos-Boston") and ungrammatical constructs.

16

(17)

domains [28]). The systems accept spontaneous speech as input and respond with synthesized speech as output. They enable the user to solve problems within the designated domain (such as trip planning, weather inquiries, etc.) [24] [28], convey a spoken message with another language via machine translation [94], or learn to read [8] [33] [54].

The development of conversational systems necessitates correct interpretation of spoken input, and accurate generation of spoken output. Decoding the semantics embedded in an acoustic signal, or encoding a message in synthesized speech, involve diverse sources of linguistic knowledge [14] [96]. Amongst these are:

Signal processing | the transformation of a continuously-varying acoustic speech signal into a discrete form.

Phonology and acoustic-phonetics | the study of speech sounds, their variabil- ities as a result of coarticulation, as well as their acoustic characteristics. For example, although the underlying phoneme sequences in \nitrate" and \night rate" are identical, they are realized dierently.

Lexical constraints and word morphology | the knowledge about the composition of words in a language.²

Syntactic information | the rules about grammatical constructs in the forma-

2A morpheme is the minimal syntactic unit in a language which carries meaning, and a morph is the surface realization of a morpheme.

(18)

tion of clauses, phrases or sentences from words.

Semantic information | the meaning of the spoken input. For example, it may be dicult to dierentiate between the two sentences \Meter at the end of the street" and \Meet her at the end of the street" based on the acoustics of continuous speech, but they are dierent both syntactically and semantically.

Prosodics | the stress and intonation patterns of speech. The location of emphasis in a spoken sentence conveys specic meaning. \I am ying to Chicago tomorrow" indicates that ying is the means of transportation to Chicago (and not driving or others); while \I am ying to Chicago tomorrow" proclaims that Chicago is the ight destination (and not Boston or another city).

Discourse and pragmatics | the context of the conversation and the rational chain of thoughts invoked. Consider as examples the sentences \It is easy to recognize speech" and \It is easy to wreck a nice beach." Both are semantically reasonable and syntactically well-formed. However, acoustically they are almost indistinguishable. In order to achieve disambiguation, a conversational system needs to resort to information regarding the dialogue context.

These dierent knowledge sources interact together to modify the speech signal.

Word pronunciations, which are a major concern in recognition and synthesis, can be inuenced by word morphology, syntax, semantics and discourse. For example, the pronunciation of \unionization" depends on whether the topic of interest con- cerns \ions" or \unions," which may give the respective derivations \un+ionization"

or \union+ization."³ Semantics is needed for the disambiguation of homonyms such as \see," \sea" and \C". Syntax leads to the dierent pronunciations between the noun and verb forms of \conduct" (/k a n d { k t/ and /k { n d ^ k t/). Coar- ticulatory eects in dierent phonetic contexts and across word boundaries are ex- pressed as phonological rules [62]. Examples include the apping of the /^t/ in \water"

3This example is borrowed from [21].

(19)

word in the sentence \I want to y from Boston to..." the search algorithm probably should focus on the city names in the vocabulary.

It is therefore obvious that these interrelated knowledge sources are indispensable in the development of speech systems, be it synthesis, recognition or understanding. The dierent types of information, or subsets of them, are often incorporated independently, and with ad hoc methodologies, into the components of existing conversational systems. Phonological rules are applied in letter-to-sound generation in speech synthesis [17]. They are also embedded in pronunciation models and networks in speech recognizers [31]. n-gram language models [39] are popular for guiding the search in speech recognizers, because they can be automatically acquired for dierent tasks with a wide range of perplexities, and are thus more adaptable than nite-state grammars [47] [49]. The recognition outputs may be further re-processed using natural language parsers to provide syntactic analysis and derive meaning. As was shown earlier, semantics and syntax may come into play for reducing the search space of the recognizer, especially for high perplexity⁵ tasks (such as those with large vocabularies) where constraints given by the n-gram language models are weak. A lower search complexity should help avoid search errors and maintain high recognition performance. Discourse and prosody have also been used in dialogue management [42].

4There also exists systems which attempt to obtain semantics without involving syntactic analysis, see [65] [92].

5Perplexity is an information-theoretic measure for the average uncertainty at the word boundary for the next possible words to follow. Later in the thesis we will show how it is computed. A high perplexity signies a large search space, and a more dicult recognition problem.

(20)

We feel that instead of allowing these knowledge sources to reside individually in a conversational system, it is more desirable to model their interrelationships in an integrated framework. The objective of this thesis is to propose a methodology for such integration. The resultant framework should facilitate the concurrent utilization of the knowledge sources, and should be applicable in speech synthesis, recognition and understanding.

1.2 An Integrated Hierarchical Framework for Speech

Having a common framework which integrates all the relevant knowledge sources for speech synthesis, recognition and understanding is advantageous. Not only can it reduce redundancy in development eorts, but also any improvements made in the framework can be inherited by all three tasks. This integration is best exemplied by the human communication system. Our framework is therefore designed to mirror the chain of events underlying the communication between a speaker and a listener, a sequence which has been described as the speech chain [20].

When a speaker wants to convey a spoken message to a listener, he rst gathers his thoughts, which constitutes the semantics of his speech. The semantics is generally coherent with the context of the dialogue, which involves discourse and pragmatics.

The speaker proceeds to congure his message into a linguistic form. He chooses the appropriate words and their morphological forms from his vocabulary, and organizes them into sentences and phrases according to the grammar rules or syntax of the language. The speech utterance is then formulated in the brain, along with prosodic features (pitch, intonation and duration) and stress (sentential and word-internal stress) to aid expression. The utterance is then spoken by the coordinated movements of the vocal organs and articulators, producing the phonetics of the speech wave which is transmitted from the speaker to the listener. The acoustics of the speech wave

(21)

It seems plausible that the design of a unied framework for speech be modeled after the speech chain. We conceive of a grand speech hierarchy with multiple levels of linguistic knowledge sources, grossly ranging from discourse, pragmatics and semantics at the upper levels, through the intermediate levels including prosody and stress, syntax, word morphology, syllabication, distinctive features, to the lower levels of word pronunciations, phonotactics and phonology, graphemics,⁶ phonetics and acoustics. The framework should encode not only the constraints propagated along each level of linguistic representation, but also the interactions among the dierent layers.

The hierarchy is illustrated in Figure 1-1. From one perspective, the order of events in speech production is roughly simulated as we descend the hierarchy; while the reverse order as we ascend the hierarchy approximately models the speech perception process. Looking from another perspective, this unied body of linguistic knowledge should be applicable in speech generation/synthesis, recognition and understanding.

Furthermore, learning can be achieved if the regularities within the framework can be derived and utilized in generating new structures or representations.

The prime objective of this thesis is to propose such a unied framework of linguistic knowledge sources for multiple speech applications. The test-bed selected for demonstrating the feasibility of our methodology is the task of bi-directional spelling- to-phonemics/phonemics-to-spelling generation.

6By \graphemes" we are referring to contiguous letters which correspond to a phoneme.

(22)

Discourse and Pragmatics

Syntax and Semantics

Sentence/Phrase Prosodics

Word Morphology

Word Stress

Syllabification and Phonotactics

Phonemics

Phonetics and Graphemics

Acoustics

Figure 1-1: A Proposed Grand Hierarchy for Representing Speech

(23)

English word. We have selected as our test-bed the design of representations and algorithms pertaining to the simultaneous/synchronized application of these knowledge sources for bi-directional spelling-phonemics conversion. This task should suce as apt evidence for the viability of our proposed unied framework, at least on the (smaller) scale of the English word. The thesis will also include preliminary experiments which show the extendability of the implemented framework. The versatility of the framework is implicated by the bi-directionality | the same set of knowledge sources remains pertinent, be it spelling-to-phonemics generation or phonemics-to- spelling generation. In a similar manner, if the grand speech hierarchy in Figure 1-1 is realized, its versatility should transcend to applications in speech synthesis, recognition and understanding.

The bi-directional generation task is also chosen because of its usefulness in han- dling out-of-vocabulary words in unrestricted text-to-speech synthesis and large vocabulary speech recognition. Text-to-speech synthesizers used as reading machines for the blind, or as interactive voice response for transmission over the telephone, often encounter new words outside their vocabulary. When this happens, letter-to- sound generation becomes the key operating mechanism. Similarly, it is dicult to fully specify the active vocabulary of a conversational system beyond a static initial set. Users should be able to enter new words by providing the spoken, typed or hand- written spellings and/or pronunciations. If only one of the two elements is given, a bi-directional system will be able to automatically generate the other element, and dynamically update the system's vocabulary accordingly.

(24)

The development of a bi-directional letter-to-sound/sound-to-letter generator war- rants an understanding of the relationship between English orthography and phonology. This will be examined in the following subsection.

1.3.1 Orthographic-phonological Correspondences in English

The English writing system is built from the 26 letters in the alphabet. However, only certain letter sequences are found in English words. Adams [1] [53] noted that,

From an alphabet of 26 letters, we could generate over 475,254 unique strings of 4 letters or less, or 12,376,630 of 5 letters or less. Alterna- tively, we could represent 823,543 unique strings with an alphabet of only 7 letters, or 16,777,216 with an alphabet of only 8. For comparison, the total number of entries in Webster's New Collegiate Dictionary is only 150,000.

Such a limited set of letters and letter patterns, however, encodes a vast body of knowledge. In fact, the graphemic constraints may very well be a consequence of this.

The alphabetic principle [69] refers to the occurrence of systematic correspondences between the spoken and written forms of words | the letters and letter patterns found in written English map somewhat consistently to the speech units such as phonemes in spoken English. Chomsky and Halle [11] pointed out that English phonology and morphology are simultaneously represented in the orthography. This suggests that the orthography should exhibit cues which reect lexical structures like the morpheme.

Other lexical structures like the syllable are derived from phonotactic constraints specic to the language, so if written English largely corresponds to spoken English, then syllabic structures should be found in the orthography as well [1] [53].

The way that English orthography corresponds to morphology, syllabication and phonology is fairly systematic, but it also admits many irregularities. Therefore, English has been described as a quasi-regular system [53]. To illustrate correspondences in morphology, consider the words \preview" and \decode," which contain

(25)

tent, while \gave"-\have" is not.⁷ Vowels account for many of the inconsistencies in letter-phoneme mappings, since the identity of the vowel in a word is strongly aected by the stress pattern of the word. The stress pattern is in turn dependent on the part of speech of a word, e.g., homographs which can take on two parts-of-speech often have a stress-unstress pattern if they are nouns, and an unstress-stress pattern if they are verbs, as in \record" and \permit." Another interesting class of exceptional pronunciations arises from high-frequency words [3]. Initial \th" is pronounced as /^T/ (a voiceless fricative) in many words (such as \thin," \thesis," \thimble"), but for very frequent words such as the short function words (\the," \this," \there," \those"),

\th" is pronounced as /^D/ (a voiced fricative). Similarly, \f" is always pronounced as an /^f/ (an unvoiced fricative) except for the single case \of." Finally, the nal \s" in

\atlas" and \canvas" is realized as the unvoiced /^s/, but for the function words \is,"

\was" and \has," it is realized as the voiced /^z/.

As we can see, English orthographic-phonological correspondences seem to operate through the intermediate levels of morphology and syllabication, and contain both regularities and irregularities. Irregularities arise due to the stress pattern of a word, dierent dialects (e.g. British and American English), lexical borrowings from other languages and spelling reforms, to name a few reasons [53]. Since English is quasi- regular in nature, it seems that a possible way to tackle the spelling-to-pronunciation or pronunciation-to-spelling conversion problems is to capture regularities using rules

7These examples are borrowed from [53].

(26)

and statistics, while accommodating irregularities using exception dictionaries. Any attempt to determine the orthographic-phonological regularities in English must consider the two important areas of representing and deriving such regularities. In the next section, we will give an overview of the approaches adopted in previous attempts to capture letter-sound regularities for the development of pronunciation and spelling systems.

1.4 Previous Work

1.4.1 Letter-to-Sound Generation

A myriad of approaches have been applied to the problem of letter-to-sound generation. Excellent reviews can be found in [18], [29] and [41]. The various approaches have given rise to a wide range of letter-to-sound generation accuracies. Many of these accuracies are based on dierent corpora, and some corpora may be more dif- cult than others. Furthermore, certain systems are evaluated by human subjects, while others have their pronunciation accuracies reported on a per phoneme or per letter basis. Insertion errors or stress errors may be included in some cases, and ig- nored in others. There are also systems which look up an exceptions dictionary prior to generation, and the performance accuracies of these sytems tend to increase with the use of larger dictionaries. Due to the above reasons, we should be careful when comparing dierent systems based on the quoted performance values.

The following is a sketch of the various approaches with a few illustrative examples.

1. Rule-based Approaches

The classic examples of rule-based approaches include MITalk [3], the NRL system [21], and DECtalk [17]. These use a set of hand-engineered, ordered rules for transliteration. Transformation rules may also be applied in multiple passes in order to process linguistic units larger than the phoneme/grapheme, e.g., morphs. The rule-based approaches have by far given the best generation

(27)

Maker formalism [89] developed for Dutch. The two-dimensional rules in the Speech Maker are modelled after the deltasystem [34]. The rules manipulate the contents of a data structure known as the grid, which contains streams of linguistic representations synchronized by markers.

Writing rule sets is an arduous process. As the rule set increases in size, the determination of rule ordering and the tracing of rule interactions become more dicult. Furthermore, rules generally have low portability across domains or languages. Therefore, there are also other approaches which try to automatically infer these transcription rules, or the letter/sound correspondences which they represent.

2. Induction Approaches

Induction approaches attempt to infer letter-to-sound rules from a body of training data. The rules follow the form of generative phonology, which gives a letter and its transcription under a specied spelling context. Examples of this approach can be found in [36], [40], [51], [61], [71] and [87]. The following briey recounts a few of them.

Klatt and Shipman [40] used a 20,000 word phonemic dictionary to create letter- to-sound rules of the form A^![b]/CD EF, i.e., the letter \A" goes to the phoneme [b] in the letter environment consisting of 2 letters on each side. If there are rule conicts, the most popular rule in the conicting set is used. The computer program organizes the rules into a tree for run-time eciency, and

(28)

the system achieved an accuracy of 93% correct by letter.

Lucassen and Mercer [51] designed another letter-pattern learner using an information-theoretic approach. The phonemic pronunciation is viewed as being generated from the spelling via a noisy channel. The channel context consists of 4 letters to the left and right of the current letter, and the 3 phonemes to the left. A decision tree is constructed based on a 50,000 word lexicon, where at each step, the tree includes the context feature with the maximum conditional mutual information.⁸ They reported a performance of 94% accuracy per letter on a test set of 5,000 words.

Hochberg et al. [36] devised a default hierarchy of rules, ranging from the most general rule set at the bottom to the most specic rule set on top. The bottom-level (Level 1) has 26 general rules, each being a context-independent transcription of a single letter to its most frequent phoneme according to the training corpus. At the next level up (Level 2), each rule includes as context one letter to the left/right of the letter to be transcribed. Level 3 rules are a natural extrapolation | they include up to 2 letters to the left or right of the current letter. Therefore, the rules at level i contain (i^,1) letters as context.

Training identies the phoneme =x= in each rule to be the most frequently occurring pronunciation in the training corpus. Each rule has a numerical value computed as its \strength," which is based on the training corpus statistics.

Testing pronounces each letter sequentially, and rule applications are ordered top-down in the hierarchy. Rule \conicts" are reconciled according to rule strengths and a \majority rules" principle. The system was trained and tested on disjoint sets of 18,000 and 2000 words respectively, and achieved an accuracy of 90% by phoneme. A similar approach was also adopted at Martin Marietta Laboratories [70].

8The conditional mutual information betweenû¹,û²andû³ is dened as^log^P(u^P(u1¹ ^jû^j²û3)^;û³⁾.

(29)

performance of 85% accuracy per phoneme. Aside from this work, HMMs have also been used for the alignment of orthography and phonemics prior to an inductive learning transliteration procedure for Dutch [86]. Another approach related to HMMs can be found in [52].

4. Connectionist Approach

A well-known example of this approach is NETtalk developed by Sejnowski and Rosenberg [72]. NETtalk is a neural network that learns the pronunciations of letters. The network consists of three fully connected layers: the input layer takes in a 7-letter context window, where the middle letter is the one to be pronounced and the other six serve as left and right context; the hidden middle layer performs intermediate calculations, and the output layer gives a vector indicative of a phoneme and a stress level (two degrees of stress are included).

The network was trained for 5 passes on 1,000 words and tested on a non- disjoint dictionary of 20,012 words. The \best guess"⁹ performance was found to be 90% correct by letter. NETtalk was also re-implemented by McCulloch et al. [55] to become NETspeak, in order to examine the eects of dierent input and output encodings in the architecture, and of the word frequencies on network performance.

Lucas and Damper [50] developed a system for bi-directional text-phonetics

9The dot products between the output vector and the code vector of every phoneme are computed.

The phoneme that has the smallest product is the "best guess" output.

(30)

translation using two \syntactic" neural networks (SNN) to perform statistical string translation. This system, unlike the others, does not require pre-aligned text-phonetic pairs from training, but instead tries to infer appropriate segmen- tations and alignments. The rst SNN models orthography while the second models phonemics. Training is done in three phases. In the rst phase, each SNN allocates a neuron node for the high-frequency substrings in its own domain. In the second phase, transition (bigram) probabilities corresponding to the recurrent connections between neurons within an SNN are estimated. Fi- nally, the third phase learns the translation probabilities between the nodes of one domain and those in the other domain. The activation of a node takes into account all the weighted recurrent connections to that node. The output symbol corresponding to the node with the highest activation is selected as the generated translation. In text-to-phonemics conversion, training and testing on two disjoint 2000-word corpora gave a 66% phoneme accuracy and 26% word accuracy.

5. Psychological Approaches

Dedina and Nusbaum [19] developed the system PRONOUNCE to demonstrate the computational feasibility of the analogical model. This model is proposed by Glushko [26] in psychology literature, which suggests that humans use a process of analogy to derive the pronunciation for a spelling pattern, as an alternative to the pronunciation-by-rule theory. PRONOUNCE uses a lexical database of approximately 20,000 words. It does not have a training phase. Instead, PRONOUNCE matches each spelling pattern in the test word against every lexical entry, and if there are matching substrings, the corresponding phonetic pattern is retrieved to build a pronunciation lattice. After the matching phase, PRONOUNCE traverses the lattice to nd the \best path," using the lengths and frequencies of the subpaths as search heuristics. The system was evaluated on a set of 70 nonsense monosyllabic words, and was found to disagree with

(31)

to PRONOUNCE, except that it uses a scoring mechanism based on the text- phonemic mapping statistics, instead of a lexicographic function. The phonemic

\analogiser" begins with using a set of context-free rules to generate multiple pronunciations, and these are re-ranked in a way similar to the lexical analogies.

The outputs from the orthographic and phonemic analogisers are eventually combined to generate the result.

6. Case-based Reasoning and Hybrid Approaches

Case-based approaches generate a pronunciation of an input word based on similar exemplars in the training corpus. The TTS system [16] developed at Bell Labs adopts this approach for generating name pronunciations. It operates primarily as a 50K dictionary lookup, but if direct lookup fails, the system will try using rhyming analogies (e.g. \ALIFANO" and \CALIFANO"), perform sux-exchanges (e.g. \AGNANO" = \AGNELLI" - \ELLI" + \ANO") or append suxes (e.g. \ABELSON" = \ABEL" + \SON"). If everything fails, then TTS will fall back on a rule-based system named NAMSA for prex and sux analysis and stress reassignment.

MBRtalk [78] [79] is a pronunciation system operating within the memory- based reasoning paradigm. The primary inference mechanism is a best-match recall from memory. A data record is generated for every letter in a training word. Each record contains the current letter, the previous three letters, the

10This terminology is adopted from the reference [83].

(32)

next three letters, and the phoneme and stress assigned to the current letter.

For each letter in the test word, the system retrieves the 10 data records that are most \similar" to the letter under consideration. A special dissimilarity metric is used for the retrieval. Weights are assigned to each of the 10 records according to their dissimilarity to the current letter, whose pronunciation is then determined from the records and their respective weights. Training on 4438 words and testing on 100 novel words gave a performance accuracy of 86%

per phoneme. Evaluation by six human subjects gave a word accuracy between 47% and 68%. An extension of this work is found in [80]. Another approach using case-based reasoning can be found in [46].

Golding [29] proposed a hybrid approach based on the interaction of rule-based and case-based reasoning and developed the system ANAPRON. Rules are used to implement broad trends and the cases are for pockets of exceptions. The set of rules is adapted from MITalk and foreign-language textbooks. Each rule records its own set of positive and negative exemplars. In pronunciation generation, the hand-crafted rules are applied to obtain a rst approximation to the output, and this is then rened by the case-base if any compelling analogies are found. The judgement for compellingness is based on the ratio between the positive and negative exemplars in the rules, and the similarity between the test token and the negative exemplars. In this way, rules and the case-base form nice complements. This approach was evaluated on a name pronunciation task, with a case-library of 5000 names, and a separate set of 400 names for testing.

The percentage of acceptable pronunciations was measured and compared with NETtalk and other commercial systems (from Bellcore [77], Bell Labs [16], and DEC [17]). ANAPRON performed signicantly better than NETtalk in this task, yielding a word accuracy of 86%, which is very close to the performance of the commercial systems.

Van den Bosch et al. [88] experimented with two data-oriented methods for gra-

(33)

letter in the training data, the table stores the minimum context required to arrive at an unambiguous transcription, up to ve letters to the left and right of the current letter (a 5-1-5 grapheme window). Testing is essentially a table retrieval process. However, if retrieval fails to nd a match, the test procedure is supported by two \default tables," which use grapheme windows of 1-1-1 and 0-1-0 respectively. The reference also suggested the use of IBL to replace the default tables. This idea is similar to Golding's method in that it is also a hybrid | between the table and the case-base, instead of rules and the case-base.

Using the table method on English transliterated (18,500 training words and 1,500 testing words) gave a 90.1% accuracy per letter.

1.4.2 Sound-to-Letter Generation

The development of spelling systems is a task rarely undertaken. We know of three approaches that have previously been adopted:

1. A Combined Rule-based and Inductive Approach

The rule formalism in generative phonology is also used in generating spelling rules [95]. Two lexicons of respective sizes 96,939 and 11,638 were transcribed with one-to-one phoneme-to-grapheme matches, using the /null/ phoneme and

\null" letter when necessary. Upon analysis of the lexicons, it was felt that there was insucient consistency for a rule-based system. Therefore, each lexicon was split according to word phonemic length, and their respective rule sets were

(34)

found as a function of phoneme position, in addition to the local phonemic context. Therefore, the format of a typical rule is:

Rule : num; pos; P⁰; phoneme context; G

where num is the number of phonemes in the pronunciation, pos is the position of the current phoneme P⁰, which maps to grapheme G under the specied phonemic context (up to two phonemes on either side P⁰). For example, the rule ^f5, 3, /â¤/, P^,1 =/^b/ and P^,2 =/â/, \I"^g states that the phoneme /â¤/, when preceded by the di-phoneme /^{a b}/, generates the grapheme \I" (e.g. in the word \abides", pronounced as /a b a¤ d z/).

The rules are searched sequentially, given the word length and phonemic position, in the general order of increasing phonemic context: (i) no neighboring phonemes, (ii) one phoneme on the right, (iii) one phoneme on the left, (iv) one phoneme on each side, (v) two phonemes on the right and (vi) two phonemes on the left. The search proceeds until a unique grapheme is found. If there are none, the system is considered to encounter a failure. Each rule set is tested on the lexicon used for its generation. Word accuracies on the small and large lexicons are 72.4% and 33.7% respectively. Another set of experiments were conducted whereby the system can revert to a set of \default" rules upon failure. These rules are manually written with reference to the lexicons. Accuracies rose to 84.5% and 62.8% for the small and large lexicons respectively.

2. Hidden Markov Models

HMMs have also been used by Alleva and Lee [4] for acoustics-to-spelling generation. The problem is formulated roughly as an inverse of the previous application of HMMs on spelling-to-pronunciation generation | the surface form is the acoustic signal, and the underlying form is the orthography. There- fore the HMMs model the relationship between the acoustics and orthography

(35)

to more than 100%, it is assured that insertion errors were omitted for letter accuracy.

3. Connectionism

The aforementioned Syntactic Neural Network system [50], which is the only reversible system we have found in the literature, gave a 71% letter accuracy and 23% word accuracy when trained and tested on two disjoint 2000-word corpora.

1.4.3 Summary of Previous Approaches

Tables 1.4.3 and 1.4.3 summarize the two previous subsections.

1.5 Thesis Goals

In essence, the common thread running behind most automatic generation systems is the acquisition of transcription rules or swatches of letter/phoneme patterns, which enfold local context for letter/sound generation. These entities (rules or patterns) can either be written by linguistic experts or inferred from training data. If the window of context involved is narrow, the entity tends to have high data coverage, i.e., it is applicable to many test words. However, entities with narrow context windows also have a lot of ambiguities. Disambiguation needs long-distance constraints, which leads to the widening of the context windows. The corresponding rules/patterns hence

(36)

Approach Example Corpora Word Phoneme

Systems Accuracy Accuracy

Rule-based MITalk 200 (test) 66%-77% |

SPP (rules only) 85% |

SPP (rules and 97% |

exceptions)

Induction Klatt & Shipman 20K | 93% per letter

Lucassen & Mercer 50K (train) | 94% per letter 5K (test)

Hochberg et al. 18K (train) | 90% per phoneme 2K (test)

HMM Partt & Sharman 50K (train and | 85% per phoneme test)

Connectionist NETtalk 20K (train) | 90% per letter

1K (non- disjoint test)

Lucas & Damper 2K (train) 38% 71% per phoneme

(SNN) 2K (test)

Psychological PRONOUNCE 70 nonsense 91% |

syllables

Case-based MBRtalk 4K (train) 47-68% 86%

Reasoning 100 (test)

Case and Rule Golding 5K (train) 86% |

Hybrid (ANAPRON) 400 (test)

Table 1.1: Previous Approaches for Letter-to-sound Generation

Approach Example Corpora Word Letter

Systems Accuracy Accuracy

Rule-based and Yannakoudakis 12K(train and test) 72% 85%

Inductive Hybrid & Hutton 97K(train and test) 34% 63%

HMM Alleva & Lee 15K sentences (train) 21% 61%

30 embedded words (test)

Connectionist Lucas & Damper 2K (train) 23% 71%

(SNN) 2K (test)

Table 1.2: Previous Approaches for Sound-to-letter Generation

(37)

that an average word is 6 letters long, and the probability of pronouncing each letter correctly in a word is independent of the other letters. It is therefore obvious that there is quite a wide performance gap between the automatic systems and systems using hand-crafted rules, which typically can attain word accuracies in the 80-90% range.

This tacitly reects the insuciency of local context for generation. It is mainly the rule-based approaches which apply suprasegmental constraints to some signicant extent. Suprasegmental rules operate on larger linguistic units, e.g. morphs and syllables, to enforce long-distance constraints concerning morphology¹² and syllable stress patterns in a word. These rules also tend to be executed in a sequential manner, adding further complexity to the existing rule specication.

Reiterating our earlier statement, this thesis adopts a novel approach in spelling- phonemics conversions which diers from the ordered transformations and local pattern matchers. Relevant knowledge sources, including those beyond the local letter/phoneme context, are united in a hierarchical framework, where each knowledge source occupies a distinct stratum. All the constraints benecial to the generation task at hand (from long-distance constraints for suprasegments to short-distance constraints for transcription) are administered in parallel.¹³ The advantages of this for-

11The accuracies quoted amongst the dierent systems should not be strictly compared, because some are measured on a per letter basis; others on a per phoneme basis, and with dierent data sets.

We will address this later in the thesis.

12Morphotactics refers to the positional constraints for the morphs in a word. In general, the location of morph boundaries are considered to be very important in letter-to-sound generation, because generation rules which operate within a morpheme often break down across morph boundaries.

13This idea shares similarities with the synchronized rules in the Speech Maker formalism [89] for

(38)

malism are three-fold:

1. The higher strata in the hierarchy embody longer-distance constraints. These provide additional information to the limited context used in local string matches, and may also help eliminate the large number of \specic" transcription rules.

2. Interactions between the variable-sized units from dierent knowledge sources (morphs, syllables, phonemes, graphemes, etc.) are harnessed in the hierarchy framework. Hence, one can avoid the tedium of tracking rule interactions and resolving rule conicts in the determination of a rule order. The framework also oers a thorough description of the English word at various degrees of resolution.

3. Serial, transformational rules generate arbitrarily many intermediate representations between the input form and the output form. Once a rewrite-rule is applied, the identity of the representation prior to the rewrite is lost. Therefore, transformation from the input form to the output form is irreversible. Con- trarily, the integrated framework is inherently bi-directional. The hierarchical framework preserves the same constraints exercised in both letter-to-sound and sound-to-letter generation. Consequently, the new formalism should be more ecient and economical.

Generation is performed in a parsing framework, which is suitable for providing a hierarchical analysis of the input. The parser design is a hybrid which combines the merits of a knowledge-based approach (i.e. high performance accuracy) with those of a data-driven approach (i.e. automation and robustness), by incorporating simple and straightforward linguistic rules into a probabilistic parser. The probabilistic parsing paradigm is preferred for four reasons: First, the probabilities serve to augment the

text-to-speech synthesis, and the two-level rules found in the pc-kimmosystem for morphological analysis [5].

(39)

parser also permits us to automatically relax constraints to attain better coverage of the data.

In short, the goals of this thesis are :

to demonstrate the feasibility of assembling and integrating multiple linguistic knowledge sources (lying within the scope of the English word) in a hierarchical framework, and

to illustrate the versatility and parsimony of this unied framework in terms of the bi-directionality in spelling-phonemics conversion via a probabilistic parsing paradigm.

1.6 Thesis Outline

In this introductory chapter, we have given a brief overview of spoken language research, placing particular emphasis on the interdisciplinary aspect of the problems involved. We feel that it is desirable to combine and coordinate the suite of knowledge sources to form a coherent framework for the various speech components, and will proceed in the following to describe our attempts to achieve this goal. The rest of the thesis is organized as follows:

Chapter 2 describes the lexical representation which we have created for the En- glish word. It is a hierarchical representation designed to integrate dierent levels of

14A parse theory suggests a possible way of parsing a word.

(40)

lingustic representation, namely, morphology, stress, syllabication, distinctive features, phonemics and graphemics. Therefore, a collection of variable length units such as morphs and syllables are used, in addition to phonemes and letters.

Chapter 3 explains the bi-directional, synthesis-by-analysis algorithm used to ac- complish our generation tasks. It is based on a probabilistic parsing paradigm, entitled the Layered Bigrams, which is used in accordance with the hierarchical lexical representation. The parser is a hybrid of rule-based and data-driven strategies. Details about the training phase, the testing phase, as well as the search mechanism, will be provided.

Chapter 4 presents information about the data used for our experiments, and the evaluation criteria by which we measure our performance. Results will also be reported for both letter-to-sound and sound-to-letter generation, followed by an analysis of some generation errors.

Chapter 5 lists a series of experiments which illustrate the advantages of using the hierarchical framework by comparing it with an alternative \non-linguistic" analysis based on variable length letter/phoneme n-grams. The hierarchical representation supplies a collection of constraints which together enhance eciency and accuracy in generation. In addition, it is a compact representation, requiring few system parameters as it promotes a high degree of sharing among dierent words.

Chapter 6 addresses a major issue of concern, parser coverage, because a nonparsable word spelling or pronunciation does not yield any generated output. We have implemented a \robust" parser, which is capable of relaxing certain constraints to handle the problematic words and broaden coverage.

Chapter 7 examines the extendability of the hierarchical layered bigrams framework. It is a small step towards an existence proof that this framework can encompass other linguistic levels in the full-edged speech hierarchy conceived in Figure 1-1. We have added a phone layer to the layered bigrams framework, and shown how it is possible to automatically capture phonological rules with probabilities trained from

(41)

(42)

Chapter 2 The Lexical Representation

This chapter presents the lexical representation which we have designed for the bi- directional generation tasks. The knowledge sources which have immediate relevance to graphemic-phonemic mappings in the English word, and which form a subhierarchy in Figure 1-1, are united and integrated into a succinct description. This description also forms the infrastructure for generation, which is explained in the next chapter.

2.1 Integration of Various Linguistic Knowledge Sources

The lexical representation is a hierarchical structure which assimilates the relevant linguistic knowledge sources to capture the orthographic-phonological correspondences in English. Each level of linguistic representation is composed of a small set of lexical units, each serving a unique descriptive role. The several distinct and well-dened layers in the lexical representation preserve the ordering of the knowledge sources in Figure 1-1. The layers are dened from top-to-bottom in Table 2.1.¹

1Phonemes are enclosed in / /, graphemes in #[ ], and phones in [ ] | as will be seen in Chapter 7. The categories for each layer are shown in Appendices A through F. If we dene a column history

42

(43)

Table 2.1: Table showing the dierent layers in the lexical representation, the number of categories in each layer and some example categories.

The size of lexical units decreases as we descend the hierarchy. A word consists of one or more morphs, and dierent words may share the same morph(s). The same relationship is found between morphs and syllables. A given syllable is often identied with its level of stress, and each syllable has one or more syllable parts (or subsyllabic units). The syllable structure provides tactics for phonology, which is why the manner, place and voicing features are placed beneath the subsyllabic unit level. Letters or graphemes are located at the bottom because they often have direct correspondences with phonemes. The phonetic layer is considered to occupy the same level as the graphemic layer in this hierarchical ordering.

The top level currently consists of a generic [word] category, but it can conceiv- ably be used to encode semantic information such as word sense, or syntactic information such as part-of-speech or tense.² Semantic and syntactic characterization may change the pronunciations of words. For example, \bass" may be pronounced as /^{b e s}/ or /^{b @ s}/, depending on whether we are referring to music or a sh.

Homographs like \permit" and \record" are pronounced with complementary stress patterns, depending on the part of speech (noun or verb forms). Similarly, \read"

to be a feature vector with seven categories, one from each level shown in the table, then there are fewer than 1,500 unique column histories in our training data.

2Other information may also be included, such as the origin of loan words, should we decide to model words of dierent origins as separate categories.

(44)

may be pronounced as /^{r i d}/ or /^{r E d}/, depending on the tense.

The second layer, morphology, embodies some morphophonemic eects. Letter- to-sound mappings which are consistent within morphs may be altered across morph boundaries. For example, the letter sequence \sch" in \discharge" is pronounced dierently from that in \scheme." The former \sch" sequence overarches a morph boundary between \s" and \c" which separates the prex morph \dis-" and the root morph \-charge," while the latter sequence belongs to a single root morph. Another similar example is provided by the word \penthouse," where the letter sequence \th"

is not realized as a medial fricative, due to the presence of a morph boundary in between the two letters. Morph composition also brings about spelling changes [59].³ For instance, the nal \e" in the sux \ize" of the word \baptized" is redundant with the \e" of the inectional sux \ed," and so one of the redundant letters is dropped.

Other cases of deletion are evidenced in the word \handful," derived from \hand"

and \full," and in the word \handicap," coming from the three words \hand," \in"

and \cap." There are also examples of insertions due to morph combinations, such as the gemination of \g" in \begged," which did not appear in the root \beg."

The third layer is a sequence of stressed and unstressed syllables. Stress strongly aects the identity of the vowel in a syllable, as can be seen by comparing the words

\nite" and \innite." The rst syllable in \nite" is stressed and contains the diphthong /^a¤/, but the corresponding syllable in \innite" becomes unstressed, and the diphthong reduces to a front schwa /^|/. In addition, stress aects the placement of syllable boundaries, which is illustrated by the words \fabric" and \fabrication."

The letter \c" in \fabric" forms the coda of the second syllable. However, upon the addition of a \stress-aecting sux" such as \-ation,"⁴ \c" has been moved to become the onset of the following stressed sux syllable. This movement occurs

3These spelling change rules, however, have not been explicitly incorporated in our lexical representation.

4When a morph is extended by a \stress-aecting" sux, the syllable preceding the sux is forced to become unstressed.

(45)

therefore the second /^t/ is apped in \strategy" but not in \strategic."

The next couple of layers, with subsyllabic units in the fourth and broad manner classes in the fth, jointly dene the syllable structure of the word. The morph layer is deliberately positioned above the syllable layer. This is because syllable theory implicitly assumes that a given syllable can transition to any other syllable. However, since there are only a nite number of prexes and suxes, morphology provides constraints for the syllables. In addition, precise syllable boundaries are often hard to locate. For example, the syllable boundary in \monkey" may be placed between the phonemes /⁴/ and /^k/, or between /^k/ and /ⁱ/. In these circumstances, we may be able to utilize morph boundaries to aid placement of the syllable boundaries.

According to our data,⁶ the word \monkey" is composed of the root morph \monk-"

and the sux \-ey." Consequently, the selected syllabication for the word \monkey"

places the syllable boundary between the phonemes /^k/ and /ⁱ/. This is shown in Figure 2-1.

The fourth layer, syllable parts, also provides tactics for the two successive layers of distinctive features [22] [81]. The sequence of broad classes (manner features) in the fth layer bears the Sonority Sequencing Constraint. This rule states that the relative prominence or \vowel-likeness" of a sound decreases as we move from the

5The Maximal Onset Principle states that the number of consonants in the onset position should be maximized when phonotactic and morphological constraints permit, and Stress Resyllabication refers to maximizing the number of consonants in the stressed syllables.

6The morphological decomposition of our data is provided by Sheri Hunnicutt [59].

(46)

onset

nasal

/m/

#[m]

nuc

vow

/^/

#[o]

nasal

/4/

#[n]

stop

/k/

#[k]

suf

syl

nuc

vow

/i/

#[ey]

root

ssyl1

coda

word 1. top-level

2. morphs

3. stress

4. subsyllabic units

5. broad classes

6. phonemes

7. graphemes

Figure 2-1: Lexical representation for the word \monkey" | shown here in a parse tree format.