Analysis of acoustic cues for identifying consonant /ð/ in continuous speech

(1)

Analysis of Acoustic Cues for

Identifying the Consonant /o/ in Continuous Speech

by

Ying Alisa Cao

Submitted to the Department of Electrical Engineering and Computer Science

In partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 13, 2002

The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so.

A uthor ...

_...

_-

_F-

_i

Depu

En

cience

June 13, 2002

Certified by ...

Kenneth N. Stevens

Professor, Research Laboratory for Electronics

Thesis Supervisor

Accepted by ...

-

%

irhur

C. Smith

Chairman, Department Committee on Graduate Theses

OF TECHNOt96Y

(2)

Analysis of Acoustic Cues for

Identifying the Consonant /o/ in Continuous Speech

by Ying Alisa Cao

Submitted to the Department of Electrical Engineering and Computer Science June 13, 2002

in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract

This project aims to advance automatic recognition of continuous speech by recognizing individual phonemes in speech using acoustic cues unique to each phoneme. This project focuses on studying the acoustic characteristics of one of the most prevalent phonemes of English-the fricative consonant /6/, as in word the, this, those, etc. Since previous research has shown that /6/ can assimilate to its preceding phoneme, characteristics of /6/ and its close-sounding phonemes, /n/, /d/, and /v/, are studied in the preceding context of nasal consonant, stop consonant, and vowel, respectively. Through examining characteristics of /6/ and its counterparts in phrases such as win

those and win nose, the goal is to find the contextual-based and invariant cues for identifying /6/.

Spectrum analysis tools are used to extract important acoustic information such as the formant frequencies and their changes over time, the energy distribution over frequencies, and the duration of utterances. In the context of a nasal consonant, F2 and the change in F2 over a fixed interval of time (DeltaF2) are found to be the best cues: /6/ has lower F2 and higher DeltaF2 than /n/. In the context of stop consonant, F2 and amplitude difference in the noise burst between the high and low frequency ranges are the best cues: /6/ has lower F2 and a more negative amplitude difference. In the context of vowels, F3 and DeltaF3 are found to be the best cues: /6/ generally has higher F3 and DeltaF3 than /v/, although they are not as reliable as the cues of the other two contexts. The formant frequencies are greatly influenced by the speaker's gender and the succeeding vowel, and they vary among speakers of the same gender. Thus, the more contextual and speaker information the identification criteria are based on, the more accurate the identification of /6/ is likely to be. This correlation suggests human's auditory system is likely to also rely on contextual information for the accurate processing of continuous speech.

Thesis Supervisor: Kenneth N. Stevens, Sc.D.

Title: Clarence J. Lebel Professor of Electrical Engineering, Research Laboratory for Electronics

(3)

Acknowledgments

My heartfelt gratitude goes to Professor Kenneth Stevens, for giving me this valuable

learning experience to engage in research and scientific writing, for personally guiding me through

the entire process, for his patience and understanding, and for his care and concerns for my

well-being.

I would like to thank my friends at MIT who have given me many warm and joyous

memories and who have helped me immensely in all aspects of my life in the past five years.

I also want to thank my parents who have worked hard and sacrificed much to provide me with the best education and much more.

Lastly, even though this thesis hardly measures up to all that I have received from them, I

would like to dedicate this thesis to my grandparents. They, too, have sacrificed much to give me

the best opportunities, and they gave me a wonderful childhood from which I draw faith and

confidence. Their sincerity, work ethic, simple living, dedication toward family, and service to

(4)

I: Introduction ...

12

i 1 M o tiva tio n ... 12

1. 2 B ackground Inform ation ... 13

1.3 P relim in ary W ork ... 15

1. 4 R esearch O bjectives ... 1 7

II: M ethodology...

18

II. O v erv ie w ... 1 8 e.2 D a ta b a se ... 18

.3 P aram eters A nalyzed ... 20

11.3.1 Second and Third Formant Frequencies at the Onset of the Succeeding Vowel (F2 and F 3 )...2 0 11.3.2 Change of 2 and f3 Over a Period of 50 ms (DeltaF2 and DeltaF3)...22

11.3.3 Amplitude Difference (Amp(High-Mid)) and Duration of Burst of /6/ and /d/...23

III: Results and Analysis...

26

II.1 Context of Nasal Consonant ... 26

111.1.1 Second vs. Third Formant Frequencies (F2 vs. F3)... 26

111.1.2 Second Formant at the Onset of the Succeeding Vowel (F2)...27

111.1.3 Movement of f2 Over a Period of 50 ms (DeltaF2)... 32

111.1.4 F 2 and D eltaF 2 ... 33

111.2 Context of Stop Consonant ... 38

111.2.1 Second vs. Third Formant Frequencies (F2 vs. F3)... 38

111.2.2 The Second Formant Frequency at the Onset of the Succeeding Vowel (F2)...38

111.2.3 Amplitude Difference In The Burst (Amp(High-Mid))... 40

111.2 .4 B urst D uration ... 4 1 111.2.5 F2 and Amp(High-Mid)... 42

I I.3 C on tex t of Vow el...4 6 111.3.1 Second vs. Third Formant Frequencies (F2 vs. F3)... 46

111.3.2 Third Formant Frequency at the Onset of the Succeeding Vowel (F3)...46

111.3 .3 F 3 and D eltaF 3 ... 47

IV: Conclusion and Future W ork...52

I V . C o n c lu s io n ... 5 2

(5)

I V 2 F u tu re W o rk ... 5 4

V. References ...

57 VI. Appendices...58

Appendix A: Sentences Used in the Study...58

Appendix B : A dditional R esults ... 60

Appendix B. 1 Context of Nasal Consonant...60

(6)

List of Figures

Figure 1-1: a) An example spectrogram, b) An example output of formant tracks extracted by xkl. Note that this thesis denotes the second and third formant frequencies with lowercase "f', as "f2" and "f3", and denotes the second and third formant frequencies at the onset of the

succeeding vowel with uppercase "F", as "F2" and "F3."... 15

Figure 1-2: An idealized spectrogram and the important acoustic features measured in the

prelim in ary stu dy ... 16

Figure 1-3: Three commonly observed patterns in the second and third formant frequencies of /6/ an d /v /. ... 16

Figure 1-4: Example of a combination (F3 and -F3) that separates /6/ and /v/. Notice that a line representing some differentiation criteria can be drawn and would separate all utterances of /6/ from th o se of /v/...17

Figure 2-1: A male speaker's waveform of the utterance "V" in "TV" (Sentence #12). The changes in the waveform at around 224 ms indicate the onset of the vowel... 21

Figure 2-2: An example spectrum showing auto-picked value of F3 (the dotted vertical line) by xkl. ... 2 2 Figure 2-3: Spectra of the burst of a) /6/ vs. b) /d/, of a female speaker. Notice that a prominent

peak at around 4.7 kHz is seen in the spectrum of /d/ but not in that of /6/. ... 23

Figure 2-4: Sample waveforms of a) /6/ and b) /d/. In each waveform, the two vertical lines represent the points of waveform change, and the time in between the lines is the duration of th e b u rst...2 5

Figure 3-1: The mid-sagittal view of the vocal track for /6/, /n/, and /d/ [5]. The back cavity is more restricted for /6/, the difference that leads to a lower F2 for /6/ than for /n/ and /d/. ... 29

Figure 3-2: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of female speakers, given the succeeding vowel /o/. Notice that the F2 ranges of /6/ and /n/ do not

ov erlap . ... 3 0

Figure 3-3: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of male

speakers, given the succeeding vowel /o/. Notice the slight overlap between the two F2 ranges. ... 3 0

(7)

Figure 3-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, regardless of the succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on one side and /n/ points on the other. ... 35

Figure 3-5: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, regardless of the succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on one side and /n/ points on the other. ... 35

Figure 3-6: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding v ow el /i/. ... 3 6

Figure 3-7: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#l, b) F#2, given succeeding vowel /i/. /6/ and /n/ are more distinctly separated for each individual speaker than for both speakers (Figure 3-6). ... 36

Figure 3-8: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding vowel /o/. Notice that /6/ and /n/ occupy different ranges of F2 but not of DeltaF2...37

Figure 3-9: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#1, b) F#2, given succeeding vowel /o/. /6/ and /n/ are more distinctly separated for each individual speaker than for both speakers (Figure 3-8). ... 37

Figure 3-10: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of female speakers, given the succeeding vowel /o/. Notice that the F2 of /6/ is significantly lower than th at o f /d /...3 9

Figure 3-11: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of male

speakers, given the succeeding vowel /o/. Male speakers show similar trend as female speakers do . (F igure 3-10)...39

Figure 3-12: F2 and amplitude difference of /6/ vs. /d/ in stop context of female speakers. Notice that /6/ and /d/ are clearly separated by both F2 and amplitude difference...43

Figure 3-13: F2 and amplitude difference of /6/ vs. /d/ in stop context of male speakers. Unlike in the similar plot for female speakers (Figure 3-12), the regions of /6/ and /d/ overlap slightly for th e m ale sp eak ers...43

Figure 3-14: F2 and amplitude difference of /6/ vs. /d/ in stop context of individual female speakers a) F#1, and b) F#2. Notice that regions of /6/ and /d/ are clearly and consistently separated in b oth sp eak ers...44

(8)

Figure 3-15: F2 and amplitude difference of /6/ vs. /d/ in stop context of individual male speakers a) M# 1, b) M#2. /6/ and /d/ are more distinctly separated for each individual speaker than for both speakers (Figure 3-13) ... 44

Figure 3-16: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female speakers, b) male speakers, given succeeding vow el //...50

Figure 3-17: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker a) M#1, b) M#2 given succeeding vowel /i/. /6/ and /v/ are more distinctly separated for each individual

speaker than for both speakers (Figure 3-16b). ... 50

Figure 3-18: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker a) F#1, b) F#2 given succeeding vowel /i/. /6/ and /v/ remain difficult to separate even when the speakers are analyzed individually...50

Figure 3-19: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker a) M#1, b) M#2 in two pairs of sentences succeeded by similar vowels (/o/ and //). Notice the large variations in both F3 and D eltaF3 of /6/ and /v/. ... 51

Figure 3-20: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker (M#1) a) in sentences #3 and #11 (succeeded by vowel /c/) b) in sentences #5 and #16 (succeeded by vowel /D/).

Notice the significant differences in F3 and DeltaF3 of both /6/ and /v/ between the two plots. ... 5 1 Figure 3-21: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker (M#2) a) in sentences

#3 and #11 (succeeded by vowel /c/) b) in sentences #5 and #16 (succeeded by vowel /;/).

Notice similar decreases in F3 and DeltaF3 of both /6/ of /v/ are seen in graph b) as in Figure 3 -2 0 b ... 5 1 Figure A-1: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, given succeeding vowel

/i/. Notice that the regions of /6/ and /n/ overlap. ... 60

Figure A-2: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual male speakers a) M#1, b) M#2, given succeeding vowel /i/. Notice that /6/ and /n/ are less separable for these males speakers than they are for females speakers (Figure 3-7)... 60

Figure A-3: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, given succeeding vowel /o /. ... 6 1

(9)

Figure A-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual male speakers a) M#1, b) M#2, given succeeding vowel /o/. Notice that /6/ and /n/ become more separable (mostly by F2) for each individual speaker than for both speakers (Figure A-3), but the separation is not as distinct as in similar plots for the female speakers (Figure 3-9). ... 61

Figure A-5: The average and standard deviation of F3 of /6/ vs /v/ in vowel context of a) female speakers, b) male speakers, regardless of the succeeding vowel...62

Figure A-6: The average and standard deviation of F3 of /6/ vs /v/ in vowel context of a) female speakers, b) male speakers, given succeeding vowel /i/...62

Figure A-7: The average and standard deviation of F3 of /6/ vs. /v/ in vowel context, given succeeding vowel /i/, of an individual male speaker (M#1). Notice that the overlap becomes significantly reduced for a given speaker than for both male speakers (Figure A-6b)...62

Figure A-8: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) male, b) female speakers in two pairs of sentences succeeded by similar vowels (/3/ and /r/). Notice the wide ranges of F3 and D eltaF3 for both /6/ and /v/... 63

Figure A-9: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker a) F#1, b) F#2 in two pairs of sentences succeeded by similar vowels (/3/ and /0/). Notice the large variations in both F3 and D eltaF3 of /6/ and /v/... 64

Figure A- 10: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker (F# 1) a) in

sentences #3 and #11 (succeeded by I1/) b) in sentences #5 and #16 (succeeded by /o/)...64 Figure A- 11: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker (F#2) a) in

sentences #3 and #11 (succeeded by vowel //) b) in sentences #5 and #16 (succeeded by vowel /3/). Notice the differences in F3 and especially in DeltaF3 between the two plots,

similar to those seen in the other speakers (Figure A-10, 3-20, 3-21)...64

Figure A-12: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female speakers, b) male

speakers, given succeeding vow el /o/... 65

Figure A-13: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual female speakers a) F#1, b) F#2, given succeeding vowel /o/. Notice /6/ and /v/ are mixed in the same region...65 Figure A-14: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual male speakers a) M#1, b)

M#2, given succeeding vowel /o/. Notice /6/ have higher F3 and lower DeltaF3 than /v/--the same general pattern is see in A-13, despite its exceptions in individual enunciations. ... 65

(10)

Figure A-15: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female, b) male speakers, given succeeding vow el /a /...66

Figure A-16: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual speakers a) F#1, b) F#2, c) M#1, and d) M#2, given succeeding vowel /w/. Notice that only c) shows clear distinction between /6/ and /v/. The average values of /6/ and /v/ of each plot, however, consistently show that /6/ has higher F3 and lower DeltaF3 than /v/... 66

Figure A-17: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female, b) male speakers given succeeding vowel /I/. Notice /6/ tend to have higher F3 than /v/, but their DeltaF3 values tend to be in the sam e range. ... 67

Figure A- 18: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual speakers a) F# 1, b) F#2, c) M#1, d) M#2, given succeeding vowel /I/. Neither parameter is very effective at separating /6/ from /v/. In general, F3 is higher in /6/. The average DeltaF3 of /6/ seems slightly higher than that of /v/, which is an exception to the trend of DeltaF3 observed for the rest of the v ow els...6 7

(11)

List of Tables

Table 2-1: The contexts, close-sounding consonants, succeeding vowel, and phrases studied in the project. The underlined segments are the parts of the phrases analyzed in detail...19

Table 2-2: Number of utterances analyzed in the study for each context. ... 20

Table 3-1: F2 and F3 of /6/ vs. /n/ in the nasal context, listed by speakers and vowels. Notice that the corresponding F2 of /6/ is less than that of /n/ for all cases, but F3 does not show such consistency (the asterisks mark cases of inconsistency). Thus F2 is further analyzed...27

Table 3-2: Average F2 values for /6/ vs. /n/ in the nasal context, listed by gender and succeeding vowel. Notice that F2 varies according to gender and the succeeding vowel, but F2 of /6/ is always lower than the corresponding F2 of /n/... 27

Table 3-3: Average DeltaF2 of /6/ vs. /n/ in nasal context, listed by speaker and succeeding vowel. The average DeltaF2 is positive for utterances succeeded by /o/ and negative for those by /i/. DeltaF2 of /6/ is greater than that of /n/ in all cases. ... 32

Table 3-4: Average F2 values for /6/ vs. /d/ in stop context, listed by gender. Notice that F2 is higher in /d/ than in /6/. Note that for /6/ and /d/ only succeeding vowel /o/ is examined...38

Table 3-5: Amplitude difference between the high and mid frequency ranges of /6/ vs. /d/ in stop context, listed by speaker and enunciation. Notice /6/'s amplitude difference is usually

negative and is less than that of /d/ in almost all cases... 40

Table 3-6: Burst duration of /6/ vs. /d/ in stop context, listed by each speaker. Notice that no

consistent difference is seen between the burst duration of /6/ and /d/ across all speakers...41

Table 3-7: Average F2 and amplitude difference between high and mid frequency ranges of /6/ vs. /d/ in stop context, listed by gender. Notice that F2 is substantially lower in /6/ than in /d/ and Amp(High-Mid) is negative for /6/ but positive for /d/...42

(12)

Chapter I: Introduction

1.1 Motivation

Progress in the intense study of acoustics and speech recognition in the past century has

manifested in automatic speech recognition software such as SUMMIT used in the MIT-developed

Jupiter weather information system. Jupiter, in particular, has been achieving response accuracy of

about 90% for novice users and over 98% for experienced users [1]. Such software, however, has a

limited domain of acceptable queries, set by the limited number of recognized words; Jupiter can

only recognize about 2000 words. This limited scope of recognition is a result of the algorithm of

the speech recognition units, such as SUMMIT, which recognize words in the queries by matching

key segments of the speech signals with a pre-stored database of phonemes [1]. This kind of search

and match algorithm makes real-time recognition of continuous speech impractical, both in terms

of size limitations of the pre-stored library of vocabulary and the computation power currently

available to run such an algorithm.

In light of such limitations, this study is not interested in advancing automatic speech

recognition by looking for ways to reduce the limitations of the search and match algorithm

mentioned above, but by finding sets of acoustic cues that can unambiguously identify individual

phonemes in continuous speech. This study examines the spectral characteristics of phonemes and

aims to incorporate these characteristics into algorithms that computers can use to achieve

automatic phoneme recognition. In the process, the goal is to gain further understanding of how the

human perceptual system processes, identifies, and differentiates between similar-sounding

(13)

phonemes. In addition to building better recognition tools that can potentially identify every

phoneme in continuous speech, this research can also contribute to the development of

computer-based methods for transforming spoken speech into its written form. Such a function would be

highly valuable in a number of applications, such as taking notes for deaf and hard-of-hearing individuals.

1.2 Background Information

This project focuses on finding the acoustic characteristics of /6/ for two main reasons.

First, /6/ is one of the most common phonemes in English [2]. Because /6/ is found in common

function words such as they, them, those, then, and the etc, it is the 7th most frequent consonant in

spoken English and the most frequent consonant in word-initial position. Because function words

could be important for extracting the meaning of sentences, it is important to be able to recognize

intended /6/ in natural speech. Second, because /6/ is usually found at the same location, at the

beginning of words, /6/ is easier to study.

/6/ is a voiced fricative with weak noise, and is produced by air turbulence created when air

from the lungs is forced through the vocal tract constriction formed by the tongue and the teeth [3].

/6/ is generally unstressed, as in see the ball, but it can also be stressed, as in see that ball. As

mentioned, /6/ is often found at beginning of words, but it can also be found in middle of words,

such as bother, father, weather; and in end of a few words, such as seethe, bathe, and teethe [4].

Some recent research in consonants in varied contexts that occur in normal speech has

shown that the acoustic features of phonemes vary according to the identity of adjacent phonemes

(14)

auditory system may identify /6/ based on one of several different sets of cues, depending on the

perceived context. At the same time, given that our auditory system can recognize intended /6/ in

various contexts as /6/, it is possible that there exist a set of invariant acoustic cues for speech

perception--characteristics that are common of /b/ in all contexts. Since our purpose is to find such

invariant and context-dependent cues, characteristics of /6/ are analyzed in various contexts.

Spectral characterization of /6/ preceded by nasal consonants has already been studied for

/n/ [4]. The research found that /6/ assimilates and becomes nasalized when preceded by /n/, i.e. the

entire consonant region in the spectrogram of /6/ shows characteristics like those found in /n/. At

the same time, acoustic evidence suggests that contextually-nasalized /6/ retains its dental place of

articulation [4]. This evidence is based on the second formant frequency (f2) in the following

vowel. F2 is considerably lower at the release of contextually-nasalized /6/ than at the release of a

true /n/. Furthermore, listeners can generally tell the difference between natural tokens of win nose

and win those, even when /6/ is completely nasalized [4]. And when synthetic stimuli were

constructed in which signals differed only in F2 near the nasal consonant region of win nose and

win those, listeners systematically reported hearing the latter more often when F2 was low at the

release of the nasal consonant [4]. These results are consistent with the claims in literature that

despite contextual assimilation, listeners can recognize the intended phoneme [3]. Finding the

acoustic cues that help listeners to recognize all of the contextually modified /6/ as the same

phoneme is the objective of this research.

To achieve this objective, this study first analyzes characteristics of /6/ in isolated

enunciation of vowel-consonant(fricative)-vowel (VCV) combinations by comparing the

(15)

spectrogram of /6/ to that of its two closest phonemes, /v/ and /z/. It then proceeds to examine /6/ in

a variety of contexts in sentence material.

1.3 Preliminary Work

Preliminary characterization has been done for the spectrograms of fricatives /6/, /z/, and /v/

that occur in isolated vowel-fricative-vowel (VCV) combinations containing the vowels /a/, /x/, /e/,

/i/, and /u/.

The fricative /z/ has visible high frequency content in its spectrogram that is not found in

the spectrograms of /6/ and /z/. Thus, /z/ can easily be differentiated from the other two fricatives,

and efforts are subsequently focused on finding the more subtle spectral differences between /6/

and /v/. To do so, a program called

xkl

is used to extract frequency information from the original

spectrogram. (See Figure 1-1). An idealized spectrogram illustrating the features of /6/ and /v/ that

are measured and compared is displayed in Figure 1-2.

0 I kHz 7-4 23 ...-r~0.

a)

b

0

200 300 400 50 T TIME Time (ms)

Figure 1-1: a) An example spectrogram, b) An example output of formant tracks extracted by

xkl.

Note that this thesis denotes the second and third formant frequencies with lowercase "f", as "f2"

and "f3", and denotes the second and third formant frequencies at the onset of the succeeding vowel

with uppercase "F", as "F2" and "F3."

(16)

Onset of Fricative Release of Fricative

I I

Vowel(V) Fricative(C) Vowel(v)

-13 F3

fcativ dutatioh

-Fl El₁

fi

Time (Ms)

Figure 1-2: An idealized spectrogram and the important acoustic features measured in the preliminary study.

The parameters measured include +F1, +F2, +F3, -F1, -F2, -F3, +Slope of F2, +Slope of F3,

-Slope of F2, -Slope of F3, and fricative duration. The most useful parameter for differentiating /6/

and /v/ turns out to be +F3. F3 values of /6/ are greater than those of /v/ in all of the CVC

enunciations studied. The rest of the parameters can also be useful in differentiating between /6/

and /v/, especially when the context is known.

Three common patterns of movements of F2 and F3 are observed, as shown in Figure 1-3.

dh dl

Time (ns) Time (ns) Time (ns)

Figure 1-3: Three commonly observed patterns in the second and third formant frequencies of /6/ and /v/.

(17)

Furthermore, using combinations of parameters, /6/ and /v/ can be distinguishable in a number of

utterances, but not in all. Figure 1-4 shows an example of a combination of characteristics that are

successful in distinguishing /6/ and /v/.

2600- 2500-2400- A S2300 - adh 2200-2100- Au 200 0 * * , * 2000 2100 2200 2300 2400 2500 2600 2700 F3

Figure 1-4: Example of a combination (F3 and -F3) that separates /6/ and /v/. Notice that a line

representing some differentiation criteria can be drawn and would separate all utterances of /6/

from those of /v/.

1.4 Research Objectives

The objective of the project is to apply the methods and results of the analysis of the

spectral characteristics of /6/ and /v/ in the simpler VCV enunciation toward finding cues that can

identify all intended /6/'s in various contexts in continuous speech. More specifically, the research

intends to accomplish the following:

1. Identify the invariant and contextual-based differences between the spectrograms and

spectra of /6/ and its close-sounding phonemes.

2. Gain insight into how the human perceptual system processes and identifies the

contextually-varying /6/ and other phonemes during cognitive processing of continuous

speech.

(18)

Chapter II: Methodology

11.1 Overview

This project examines the invariant- and contextual-based characteristics of /6/. A set of

eighteen sentences is designed such that /6/ and its close-sounding phonemes are between the same

preceding and succeeding context. These sentences are listed in Appendix A. Three contexts that

are studied are nasal, stop, and vowel, and the corresponding close-sounding phonemes of /6/ are

/n/, /d/ and /v/, respectively. Based on the finding of the preliminary work, useful characteristics in

identifying /6/, such as F2 and F3, are measured and analyzed for the new database of sentences.

Additional parameters of the burst of /6/ and /d/ in the context of stop consonants are also measured

and analyzed.

11.2 Database

Eighteen sentences containing /6/ and its close-sounding phonemes, /n/, /d/ and /v/, are

constructed to form the database for this project (See Appendix A). Phrases in the original

sentences that contain the consonants of interest are included in Table 2-1. The sequences of

preceding context, consonant (/6/; or /n/, /d/, and /v/), and succeeding vowel that are analyzed in

detail are underlined in Table 2-1. Because the succeeding vowel may influence the characteristics

of the consonants, the succeeding vowels of each pair of phrases are chosen to be the same, as seen

(19)

Type of Phoneme In Succeedin

Preceding The Preceding 161' Close Sounding g Phrase

Context Context Consonant (/6/ vs ...) oe

Nasal /n/ /n/ // 1.win those

win nose

/i/ 2. win these

win Niece

Stop

It!

/d/ /o/ 1. putthose

put dough

Vowel iV/ /i/ 1. guarantee these

guard TV's /1/ 2. see this

see Victorian

/u/ /3/ 3. to the racial

two veracious

/ai/ /W/ 4. dye that

dye vat

/a/ /o/ 5. via those

via votes

/ei/ /c/ 6. may then

may

vend

Table 2-1: The contexts, close-sounding consonants, succeeding vowel, and phrases studied in the project. The underlined segments are the parts of the phrases analyzed in detail.

To find the spectral characteristics of /6/ that are common to both genders, the sentences are spoken by both male and female speakers. Because there may be variations among speakers and

enunciations, each sentence is spoken by two male and two female speakers and is repeated three

times by each speaker.

For each context, the database provides the following number of consonant pairs, listed in

(20)

Type of Preceding Number of Pairs Number ofRepetitions Total Number of

Context of Phrases _umber____epetition Utterances Analyzed

Nasal 2 11 22

Stop 1 11 11

Vowel 6 11 66

Table 2-2: Number of utterances analyzed in the study for each context.

The actual number of data analyzed may be less than the numbers listed here, because some

formant frequencies are impossible to determine in some enunciations. Also, 12 repetitions (4

speakers and 3 repetition per speaker) were planned, but due to problems during recording one set

of utterances by one of the female speakers was not used for the study. So, only 11 repetitions for

pairs of sentences were available for analysis and study.

11.3 Parameters Analyzed

11.3.1 Second and Third Formant Frequencies at the Onset of the Succeeding Vowel (F2 and F3)

The onset time of the succeeding vowel is determined by the combination of three pieces of

information: changes in the enunciations' sound wave, the formant tracks extracted by xkl, and the

spectrogram. Because vowels, unlike consonants, are formed with no obstruction in the vocal tract,

significantly more sound energy is concentrated at the lower formant frequencies of vowels than in

the case of consonants. High amplitude of sound energy is represented by dark bands on the

spectrogram and by bold points in the formant tracks extracted by xkl. Thus, the abrupt appearance

of dark bands in the spectrogram and of bold formant frequencies points in the formant tracks give

the approximate onset time of the succeeding vowel.

(21)

A more exact onset time is determined by looking for abrupt changes in the waveform.

Because the vocal tract has different shapes for vowels and consonants as a speaker switches from a

consonant to a vowel, the change is reflected in the waveform, as shown in Figure 2-1. In the

example given below, the point of abrupt change (the appearance of the first negative peak) at

around 224 ms marks the onset of the vowel; this value would correspond to the times of abrupt

appearance of dark bands and bold points in the spectrogram and formant tracks, respectively.

4000 -464.18 _mrs (405) 2000 --2000 -4000 170 180 190 200 210 220 230 240 250 TIME (ms)

Figure 2-1: A male speaker's waveform of the utterance "V" in "TV" (Sentence #12). The changes

in the waveform at around 224 ms indicate the onset of the vowel.

Because the interest is to study the pattern of change of the second and third formant

frequencies over time, a narrow time window that gives detailed time domain information is used.

In this case, a 6.0 millisecond Hamming window is chosen. Since the amplitudes of sound energy

in higher formant frequencies tend to be low for vowels-thereby making determination of F3

is set to 100 to raise the amplitude of the F3

prominence.

The formant tracks extracted by xkl (the right side figure in Figure 1-1) show abrupt jumps

that are uncharacteristic of formant frequencies, due to the measuring algorithm used in

xkl.

To get

V. -. .

V

yfi

~l

(22)

a more accurate measurement of F2 and F3, the formant frequencies given by the 6.0 ms Hamming

window are averaged over 15 milliseconds--7 milliseconds before the onset time to 7 milliseconds

after. An example of the averaged spectrum is shown in Figure 2-2. The auto-pick option is turned

on to let xkl find the most accurate values for the formant frequencies, which correspond to the

peak frequencies (x-coordinates) seen in the spectrum.

dB 70 60 50 40 30 20 10 0

Figure 2-2: An exam

ple

1 2 3 4 57 7 8

FREQ (kHz)

spectrum showing auto-picked

value of F3

Avg DFT-spect (kn) win:6.Oms start 276 end 290 2741 Hz 40.3 dB 2741 Hz 31.2 dB

(the dotted vertical line) by xkl.

11.3.2 Change of f2 and 3 Over a Period of 50 ms (DeltaF2 and DeltaF3)

The same parameter setting of 6.0 ms Hamming window, 100 pre-emphasis, and 15 ms

time-averaging is used to read the f2 and f3 values at 50 ms after the onset. The midpoint of the

time average is the onset time plus 50 ms. DeltaF2 and DeltaF3 are the differences of the f2 and f3

values at the later time minus those at the earlier time, respectively.

(23)

11.3.3 Amplitude Difference (Amp(High-Mid)) and Duration of Burst of /6/ and /d/

The burst spectrum of /d/ is expected to contain more energy in the higher frequency range

than that of /6/, and /d/ is expected to have longer duration than /6/. Thus the burst amplitude and

duration of the two consonant bursts are measured in the hope of finding characteristics that would

separate the two phonemes. Example spectra of bursts of /6/ and /d/ that illustrate the difference are

shown in Figure 2-3. dB 70 60 50 40 30 20 10 0 dB dB 70 60 50 40 30 20 10 0 _ 1 2 3 4 5 6 7 a 1 2 3 4 5 6 7

a)

FREO (kHz)

b)

FREQ (kHz)

Figure 2-3: Spectra of the burst of a) /6/ vs. b) /d/, of a female speaker. Notice that a prominent

peak at around 4.7 kHz is seen in the spectrum of /d/ but not in that of /6/.

To measure and quantify this spectral difference, the amplitudes of the peaks in the high and mid

frequency ranges are measured, and the difference is taken. This difference is named the amplitude

difference, and will be denoted as Amp(High-Mid) from here on.

Due to the differences in formant frequencies between the genders, the Amp(High-Mid)

value is defined slightly differently for each gender; the cutoff frequency defining the ranges is

higher for female speakers than for male speakers. More specifically:

(24)

1. For female speakers, Amp(High-Mid) equals the amplitude difference between the highest peak at frequencies higher than 4000 Hz and the highest peak (excluding the F1

peak) at under 3000 Hz.

2. For male speakers, Amp(High-Low) equals the amplitude difference between the highest

peak at frequencies higher than 3500 Hz and the highest peak (excluding the F1 peak) at

under 2500 Hz.

As done for the succeeding vowels, the spectra of /6/ and /d/ are also generated by a

6.0-Hamming window, averaged over 15 ms, and the formant frequencies for the spectra are

auto-picked by xkl. The midpoint of the time average is the onset of /6/ and /d/, which are easily

determined by looking for abrupt changes in the waveform. (See Figure 2-4.) The pre-emphasis is

set at 0, because /6/ and /d/ both have enough energy distributed in the higher frequencies for f3 to

be determined unambiguously. Figure 2-4 also illustrates how the duration of /6/ and /d/ are

determined.

(25)

-I -, -6000 4000 2000 0 -2000 -4000 -6000 275 280 285 290 295 300 305 310 TIME (ms) 315 320 325 330 335

Figure 2-4: Sample waveforms of a) /6/ and b) /d/. In each waveform, the two vertical lines

represent the points of waveform change, and the time in between the lines is the duration of the

burst.

In each waveform, the former and latter points of change indicate the beginning and the end of the

burst of /6/ or /d/. The time difference between the two points is the duration of the burst.

270.32 ms (54 6). 215 220 225 230 235 240 245 250 255 260 265 27, TIME (n) - e TI1E (sms)77 - r 013 ms-5477 - f

a)

6000 3000 0 -3000 -6000

b)

w e

(26)

Chapter III: Results and Analysis

The second and third formant frequencies at the onset of the succeeding vowel (F2 and F3)

and their change over time (DeltaF2 and DeltaF3) were measured and analyzed for all three

contexts. The duration and the amplitude differences of the burst of /6/ and /d/ are also examined

in the context of a stop consonant. The resulting measurements of the three contexts are presented

in the tables and figures below. Pairs of parameters that show consistent differences between /6/

and its close-sounding phonemes are graphed as (x, y) pairs for the purpose of finding criteria that

would separate the consonants.

111.1 Context of Nasal Consonant

111.1.1 Second vs. Third Formant Frequencies (F2 vs. F3)

F2 and F3, the second and third formants at the onset of the succeeding vowel, were

measured for /6/ and /n/. Table 3-1 shows the average value of each speaker, given a particular

consonant and succeeding vowel (F#1 denotes female speaker #1, M#1 denotes male speaker #1,

and so on).

(27)

Parameter F2 F3 Vowel /o/ Consonant // F2 of Individual Speaker (Hz) F#1 F#2 M#J M#2 1654 1606 1260 1386 /n/ 2394 1952 1312 1533 /n/ /o/ 2856 3072 1 /n/ /n/ 3009 2825 3087 2993 2741 2694* 2667* 1974 2032 2342* 2258* 2426 2436

Table 3-1: F2 and F3 of /6/ vs. /n/ in the nasal context, listed by speakers and vowels. Notice that the corresponding F2 of /6/ is less than that of /n/ for all cases, but F3 does not show such consistency (the asterisks mark cases of inconsistency). Thus F2 is further analyzed.

Notice that in Table 3-1, given the same speaker and vowel, the F2 of /6/ is less than that of

/n/ for all cases. This difference is not only for the average values but for each enunciation as well.

F3, however, does not show the same consistency. The average F3 of /6/ is less than that of /n/ for

only 6 out of 8 of the cases (see Table 3-1; not true for the two sets of F3 appended with ""). The

relationship between F3 of /6/ and /n/ is even less consistent in each enunciation; only about half of

the enunciations have a higher F3 for /n/ than for /6/. Therefore, F2 is determined to be a better

parameter for identifying /6/ in nasal context and is thus further studied. 2363 2552 2773 2037 2394 2793 2016 2058 2667

111.1.2 Second Formant at the Onset of the Succeeding Vowel (F2) F2 of /6/ and /n/ is further analyzed by taking the average of F2

produced within each gender group. See Table 3-2 below for results.

for all of the utterances

Succeeding Vowel F2 of Female Speakers (Hz) F2 o Male Speakers (Hz)

/6/ /n/ /6/ /n/

/o/ 1660 1993 1360 1423

/i/ 2200 2473 1995 2045

Table 3-2: Average F2 values for /6/ vs. vowel. Notice that F2 varies according to

lower than the corresponding F2 of /n/.

/n/ in the nasal context, listed by gender gender and the succeeding vowel, but F2

and succeeding of /6/ is always

(28)

Table 3-2 indicates that corresponding F2's differ significantly between the genders, with

the difference ranging from around 200 to 570 Hz. This gender difference is expected because male speakers have longer vocal tracts, which produce lower formant frequencies, according to

perturbation properties of resonators [6].

Table 3-2 indicates that the succeeding vowel also significantly influences F2. F2 of both

/6/ and /n/ when succeeded by /o/ is consistently around 600 Hz lower than that of the consonants

when succeeded by /i/. F2's correlation with the succeeding vowels is expected, because different

vowels are results of different vocal tract shapes, which in turn resonate at different formant

frequencies, according to the perturbation properties of resonators [6]. The vowel /i/ is expected to

have a high F2 because /i/ has a fronted and high tongue body position [6]. On the other hand, /o/

has a back tongue position and thus should have a lower F2 [6]. The measured F2 agrees with these

expectations, and thus helps ensure the validity of the rest of the analysis.

The differences in F2 of the same consonant when succeeded by different vowels (Table

3-2) suggest that knowledge of the identity of the succeeding vowel could be important in the correct

identification of /6/. More specifically, if the succeeding vowel is not taken into consideration, the

average F2 of /6/ for female would be the average of 1660 and 2200 Hz (Table 3-2), or 1930 Hz;

and F2 of /n/ would be the average of 1993 and 2473 Hz (Table 3-2), or 2233 Hz. Given these

average F2 values of /6/ and /n/, a reasonable cutoff F2 to differentiate /6/ and /n/ could be the

midpoint, or 2082 Hz, i.e. consonants with higher F2 would be identified as a /n/, and consonants

with a lower F2 would be identified as a /6/. If this were the case, for /6/ and /n/ succeeded by /o/,

(29)

the vast majority of both /6/ and /n/ would be determined to be /6/, since more than half of the /n/

would have F2 less than 2082 Hz. To avoid this type of misclassification, differentiation criteria of

the consonants would be much more accurate is they are set with regard to a particular succeeding

vowel.

Table 3-2 also shows that for both genders and vowels, F2 of /n/ is consistently greater than

that of the corresponding /6/. This observation is reasonable based on the difference in the vocal

tract of the two consonant. Figure 3-1 shows that /6/ has more constriction in the back cavity than

/n/ (and /d/). The shape of the back cavity is known to have the strongest influence on the value of

F2: the more constriction, the lower F2 [6]. The data obtained are consistent with findings in the

literature and thus strengthen the validity of the results.

tongue

/d/ and /n/

/6/

Figure 3-1: The mid-sagittal view of the vocal track for /6/, /n/, and /d/ [5]. The back cavity is more restricted for /6/, the difference that leads to a lower F2 for /6/ than for /n/ and /d/.

The average and standard deviation of F2 for vowel /o/ is calculated and graphed in Figure

3-2 and 3-3. The height of each bar represents the average F2 and the extension above and below the bar represents one standard deviation above and below the average.

(30)

N1 2600 2200 -1800 -1400 -1000 -0/a/ H/n/ 1626

-0

Figure 3-2: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of female speakers, given the succeeding vowel /o/. Notice that the F2 ranges of /6/ and /n/ do not overlap.

2600 2200 -N 1800 1400 -1000 -0 /6/ .I/n/ 1423 1323

LIr

Figure 3-3: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of male speakers, given the succeeding vowel /o/. Notice the slight overlap between the two F2 ranges.

30 212_

(31)

For female speakers, the ranges of F2 for /6/ and /n/ are relatively distinct. Most of their F2 values

fall in between 1590-1662 Hz for /6/, as opposed to 1886-2371 Hz for /n/. For male speakers,

however, F2 of /n/ and /6/ are less distinct and their ranges overlap (Figure 3-3). The ranges of the

majority of F2 values of male speakers are 1249-1398 Hz and 1301-1545 Hz, for /6/ and /n/

respectively.

The lack of overlap of F2 ranges in female speakers (Figure 3-2) suggests that substantial

differences exist between the F2's of the two consonants, and thus it is possible to set a criterion

that would separate most utterances of /6/ from /n/ on the basis of F2 alone. In this particular case,

the cutoff F2 that would best differentiate between the two consonants is somewhere between the

lower range of /n/ (1886 Hz) and the higher range of /6/ (1662 Hz). If the criterion is set half way

between the two values for simplicity, or at 1774 Hz, it would be 1.47 standard deviations away

from the mean F2 of /n/ and 4.11 standard deviations from mean F2 of /6/. Therefore, more than

93% of /n/ and virtually 100% of /6/ would be identified correctly, assuming Gaussian

distributions. With fine adjustments, the cutoff could be set so that it would be as many standard

deviations away from the average of both consonants as possible. Such adjustment would further

increase the accuracy of identifying the intended consonants.

The criterion for differentiating /6/ and /n/ based on F2 is harder to set for the given male

speakers. But since the two consonants are articulated similarly between the two genders, the

apparent greater difficulty in differentiating /6/ and /n/ in the male speakers is most likely speaker

(32)

two male speakers reveal substantial variation of F2 between the two speakers: F2 of one speaker is

significantly and consistently higher than that of the other. For each speaker, however, F2 of /6/ and

/n/ show differences similar to those observed in the female speakers Thus, the more acoustic

characteristics are known about a particular speaker, the more accurate the identification of /6/

would likely to be for that speaker.

111.1.3 Movement of f2 Over a Period of 50 ms (DeltaF2)

The second formant frequency at 50 ms after the onset of the vowel is measured by methods

described in Chapter II. The average frequency differences at the two times for a given speaker and

succeeding vowel are tabulated in Table 3-3.

Succeeding Vowel Consonant Average DeltaF2 of Individual Speaker (Hz)

F#1 F#2 M#1 M#2

/o/ /6/ -205 -94 -178 -147

/n/ -409 -188 -179 -221

/I /6/ 315 410 221 111

/n/ 205 241 179 74

Table 3-3: Average DeltaF2 of /6/ vs. /n/ in nasal context, listed by speaker and succeeding vowel. The average DeltaF2 is positive for utterances succeeded by /o/ and negative for those by /i/. DeltaF2 of /6/ is greater than that of /n/ in all cases.

Notice that the average DeltaF2 is negative for all speakers when the succeeding vowel is

/o/ and is positive when the succeeding vowel is /i/. This observation is also consistent with

characteristics of F2 of /o/ and /i/. As mentioned earlier, the tongue is raised and fronted when

producing /i/. These actions cause a widening of the back cavity of the vocal tract and are reflected

(33)

by a rising F2. The tongue is moved toward a back position when producing /o/, and the movement results in a lowering of F2, as observed in the data.

111.1.4 F2 and DeltaF2

Since F2 and DeltaF2 are, for the most part, consistently different for /6/ and /n/, they are

chosen to be graphed as x- and y-coordinates, i.e. (F2, Delta F2). Figure 3-4 and 3-5 show F2 and DeltaF2 of /6/ vs. /n/, regardless of the succeeding vowel, of the female and male speakers,

respectively. Notice that /6/ and /n/ are mixed together in both figures. It is impossible to draw a

line, representing a certain differentiation criterion, such that points of /6/ would be on one side and

points of /n/ on the other.

The F2 and DeltaF2 of /6/ and /n/ of the female speakers when plotted for a particular

succeeding vowel (/i/), shown in Figure 3-6, show better separation between /6/ and /n/ than in the

case of mixed vowels in Figure 3-4. The separation becomes even clearer in Figure 3-7, when the

plots of F2 and DeltaF2 are further narrowed down to each speaker. Notice that for both female

speakers, /6/ has lower F2 and higher DeltaF2 than /n/. The same trend is also observed for both

female speakers in /6/ and /n/ succeeded by /o/ in Figures 3-8 and 3-9. After normalization by

vowel and speaker, points for /6/ and /n/ can be separated onto different sides of a positively sloped

line. Such a line would be the graphical representation of an algorithm that differentiates intended

/6/ from /n/, and it would be specified by assigning the appropriate coefficients to F2 and DeltaF2. Figures 3-7 and 3-9 show that /6/ and /n/ are indeed distinct enough that it is possible for an

(34)

Similar plots of F2 and DeltaF2 of /6/ and /n/ of a particular vowel for the each of the two

male speakers are included in Appendix B (Figures A-2 and A-4). Notice that each of the four

graphs shows more separation of the /6/ and /n/ points than their corresponding plot in which the

two male speakers' data are mixed (Figures A-1 and A-3). Again, the clearer separation of /6/ and

/n/ in the plots of for each individual speaker shows that, in addition to having knowledge of the

succeeding vowel, knowledge of the particular speaker also greatly assists in separating /6/ from its

close-sounding phoneme.

The extent of separation between /6/ and /n/ in the four plots of Figures A-2 and A-4 for

male speakers, however, is not as clear and consistent as for the females (Figures 3-7 and 3-9).

Both Figure A-2a and A-4b show the same kind of separation as seen in the female speakers but

with less distinction, whereas Figures A-2b and A-4a have overlapping /6/ and /n/ points. Despite

the lack of more convincing separation in the male speakers, the clusters of /6/ and /n/ points are

located in similar positions relative to each other. This similarity is expected because speakers of

both genders articulate /6/ and /n/ by shaping their the vocal tracts in similar ways, thus resulting in

the same relationship in the values of F2 and DeltaF2 between /6/ and /n/. The lack of more

convincing separation for the male speakers is most likely due to the limited number of utterances

that were available for this study. Had more speakers of both genders been asked to repeat each

utterance more times, the pattern of relatively lower F2 and higher DeltaF2 of /6/ would probably

be more apparent, and the separation between the two consonants would be more distinct.

(35)

OUU

-Figure 3-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, regardless of the

succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on one

side and /n/ points on the other.

Figure

3-5:

F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, regardless of the

succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on one

side and /n/ points on the other.

400 200 S-200 --400 -600 F2 (Hz) 600 400 200-0 10 0 -200--400 -600 F2 (Hz) M

(36)

Figure 3-6: F2 and DeltaF2

vowel /i/.

a)

of /6/ vs. /n/ in nasal context of female speakers, given succeeding

600 400 200 0 D/O -200J 0 n -400 -600 F2 (Hz)

b)

Figure 3-7: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#1, b)

F#2, given succeeding vowel /i/. /6/ and /n/ are more distinctly separated for each individual

speaker than for both speakers (Figure 3-6).

36

600 400 200 0 49 10 "'Do /n S-200i -400 -600 F2 (Hz) 600 400 200 SA 00 -20J0 -400 -600 F2 (Hz)

(37)

600

r-Figure 3-8: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding

vowel

/o/.

Notice that /6/ and /n/ occupy different ranges of F2 but not of DeltaF2.

a)

b)

Figure 3-9: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#1, b)

F#2, given succeeding vowel /o/. /6/ and /n/ are more distinctly separated for each individual

speaker than for both speakers (Figure 3-8).

400 200 0 + 16 0 4 -30 // -200 -400 -600 F2 (Hz) 600 400 200 0 mn 01A 10 2500, 3 0 -20/ -400 -600 F2 (Hz) 600 400 200 0 -200 -400 -600 U F2 (Hz) 00 15t~ ~2* ZRI 31 B

(38)

111.2 Context of Stop Consonant

111.2.1 Second vs. Third Formant Frequencies (F2 vs. F3)

Data on F2 and F3 similar to those in Table 3-1 were collected for phrases that compare /6/

and /d/. As in the case of nasal context, F2 of /6/ is consistently lower than that of /d/ whereas F3 of

/6/ and /d/ does not show a consistent pattern. Therefore, F2 is further analyzed.

111.2.2 The Second Formant Frequency at the Onset of the Succeeding Vowel (F2)

The second formant frequency of /6/ and /d/ is further analyzed by taking the average of F2

for all utterances produced by the female and male speakers separately. The results are summarized

in Table 3-4. Note that for /6/ and /d/ only succeeding vowel /o/ is examined.

F2 of Female Speakers (Hz) F2 of Male Speakers (Hz)

/6/ /d/ /6/ /d/

1695 2079 1367 1650

Table 3-4: Average F2 values for /6/ vs. /d/ in stop context, listed by gender. Notice that F2 is higher in /d/ than in /6/. Note that for /6/ and /d/ only succeeding vowel /o/ is examined.

Again, F2 varies significantly between the genders, on the order of about 300 Hz. F2 of /6/ is lower

than that of /d/, because like /n/, /d/ also has less constriction in the back cavity than /6/ (see Figure

3-1).

The average and standard deviation of F2 of female and male speakers are shown in Figures

3-10 and 3-11. As in Figure 3-2, the height of each bar represents the average F2, whereas the

extension above and below the bar represents one standard deviation away from the average.

(39)

2600 2200 - - 2079 1800 -- 169_ / M /d/ 1400 1000

-Figure 3-10: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of female speakers, given the succeeding vowel /o/. Notice that the F2 of /6/ is significantly lower than that of /d/. 2600 -2200 -1800 1650/ 1397 1400 1000

Figure 3-11: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of male speakers, given the succeeding vowel /o/. Male speakers show similar trend as female speakers do. (Figure 3-10)

(40)

For female speakers, the ranges of F2 of /6/ and /d/ are relatively distinct (Figure 3-10).

Most of the F2 values of /6/ fell between 1660-1729 Hz as opposed to 2012-2146 Hz for /d/. For

male speakers, F2 of /6/ and /d/ are relatively distinct too (Figure 3-11). Most F2 values of male

speakers are in the range of 1329-1465 Hz and 1552-1747 Hz, for /6/ and /n/ respectively. Because

the ranges do not overlap, as in the case of /6/ and /n/ in the nasal context, it is possible to set up a

cutoff of F2 such that /6/ and /d/ would be correctly identified by their lower and higher F2 values,

respectively.

111.2.3 Amplitude Difference In The Burst (Amp(High-Mid))

As described Chapter II, the energy distribution in the burst spectrum can potentially be

used to distinguish intended /6/ and /d/; the amplitude differences in the burst of /6/ and /d/ are

tabulated in Table 3-5.

Speaker Consonant Amplitude Difference (Amp(High-Mid)) (dB)

Enunciation Enunciation Enunciation Average of

#1 #2 #3 the Enunciations F#1 /a/ N/a -11.1 -14.5 -12.8 /d/ N/a 6.7 5.8 6.25 F#2 /6/ -18.1 -12.5 -8.1 -12.9 /d/ 17.3 14 6.3 12.6 M#1

/6/

-6.8 -8.1 -3.5 -6.1 /d/ -7.8 -1.2 -4.5 -4.5 M#2 / / 1.3 -1.3 -6.2 -2.1 /d/ 1 5.5 10.4 5.6

Table 3-5: Amplitude difference between the high and mid frequency ranges of /6/ vs. /d/ in stop context, listed by speaker and enunciation. Notice /6/'s amplitude difference is usually negative and is less than that of /d/ in almost all cases.

Analysis of acoustic cues for identifying consonant /ð/ in continuous speech

Analysis of Acoustic Cues for

Identifying the Consonant /o/ in Continuous Speech

by

Ying Alisa Cao

Submitted to the Department of Electrical Engineering and Computer Science

In partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 13, 2002

A uthor ...

...

-

F-

i

Depu

En

cience

June 13, 2002

Certified by ...

Kenneth N. Stevens

Professor, Research Laboratory for Electronics

Thesis Supervisor

Accepted by ...

%

irhur

C.

Smith

Chairman, Department Committee on Graduate Theses

Analysis of Acoustic Cues for

Identifying the Consonant /o/ in Continuous Speech

Abstract

Acknowledgments

Table of Contents

I: Introduction ...

12

II: M ethodology...

18

III: Results and Analysis...

26

IV: Conclusion and Future W ork...52

V. References ...

57

VI. Appendices...58

List of Figures

List of Tables

Chapter I: Introduction

1.1 Motivation

1.2 Background Information

spectrogram of /6/ to that of its two closest phonemes, /v/ and /z/. It then proceeds to examine /6/ in

a variety of contexts in sentence material.

1.3 Preliminary Work

Preliminary characterization has been done for the spectrograms of fricatives /6/, /z/, and /v/

that occur in isolated vowel-fricative-vowel (VCV) combinations containing the vowels /a/, /x/, /e/,

/i/, and /u/.

The fricative /z/ has visible high frequency content in its spectrogram that is not found in

the spectrograms of /6/ and /z/. Thus, /z/ can easily be differentiated from the other two fricatives,

and efforts are subsequently focused on finding the more subtle spectral differences between /6/

and /v/. To do so, a program called

xkl

is used to extract frequency information from the original

spectrogram. (See Figure 1-1). An idealized spectrogram illustrating the features of /6/ and /v/ that

are measured and compared is displayed in Figure 1-2.

a)

b

0

Figure 1-1: a) An example spectrogram, b) An example output of formant tracks extracted by

Note that this thesis denotes the second and third formant frequencies with lowercase "f", as "f2"

and "f3", and denotes the second and third formant frequencies at the onset of the succeeding vowel

with uppercase "F", as "F2" and "F3."

Furthermore, using combinations of parameters, /6/ and /v/ can be distinguishable in a number of

utterances, but not in all. Figure 1-4 shows an example of a combination of characteristics that are

successful in distinguishing /6/ and /v/.

Figure 1-4: Example of a combination (F3 and -F3) that separates /6/ and /v/. Notice that a line

representing some differentiation criteria can be drawn and would separate all utterances of /6/

from those of /v/.

1.4 Research Objectives

The objective of the project is to apply the methods and results of the analysis of the

spectral characteristics of /6/ and /v/ in the simpler VCV enunciation toward finding cues that can

_...

_-

_F-

_i