I. Automatic Speech Recognition
1.2 Main processing phases
1.2.2 Acoustic models
Acoustic models define the relationship between the structure of an acoustic signal and the distinct sounds of a language (Chen & Jokinen, 2010, Chapter 12) and associate sequences of
speech sounds with individual words, such as the sequence /ˈf l a ʊ ə r / and the word flower in the example below.
Figure 5. Acoustic Modelling.
The first phase of the acoustic model of a speech recognizer, delimited by a red line in Figure 5, constitutes the acoustic modelling task in itself, i.e., the process by which phonemes can be identified in the spectral representation of an acoustic signal; this process can also be referred to as phoneme identification. The second phase consists in building a pronunciation dictionary or lexicon that associates sequences of phonemes to individual words. Both phases will be dealt with successively in the present section.
An understanding of the first phase of the process requires some consideration of how linguistic sounds are produced by the articulators of the human vocal tract and how they are realized acoustically.
All sound waves are produced by the vibration of a source and intensified or damped by the body around them. In a trumpet or a flute, for example, sound waves are produced by the vibration of air at the mouthpiece and then amplified as they pass through the body of the instrument. Likewise, speech sounds are produced by the vibration of a source —typically the vocal folds— and then amplified by the resonating cavities in the vocal tract, namely: the pharynx, the mouth and the nose.
What causes certain components of the speech sound to be amplified, or damped, is the particular shape of the resonating cavities of the speaker, which in its turn is determined by the position of the tongue and other articulators in the vocal tract.
In the case of vowels, the parameters that determine the shape of the vocal tract are tongue height, which indicates how close the tongue is to the roof of the mouth; tongue frontness or backness, which indicates whether the tongue is positioned towards the front or the back of the
mouth; and lip protrusion, which indicates whether the shape of the lips is rounded or not at the moment of producing the vowel sound (Johnson, 2011).
The correlation between the shape of the vocal tract and the quality of vowel sounds is made evident by the fact that it is simply impossible for a speaker to pronounce an /ae/ sound having set their vocal tract to the inherent configuration of an /i/ sound (or any other speech sound for that matter; naturally, the same principle would apply to all analogous combinations of any given speech sound and any divergent or “wrong” configuration of the vocal tract).
In the specific case of /ae/ and /i/, if one would care to entertain this idea for a second longer, it is by drawing the tongue closer or less close to the roof of the mouth that a speaker gets to modify the size and shape of the oral cavity and, thus, determine the quality of the resulting vowel sound. The /i/ sound is produced with the tongue placed near the palate, while the /ae/
sound is produced with the tongue in a much lower position.
This account of speech production, i.e., conceiving of the vocal tract as a natural acoustic filter that modifies the sound made by the vocal folds, is known as the filter theory or source-filter model of speech production.
The central idea is that filters block or let pass components of sound of different frequencies.
As the sound wave travels from the vocal folds to the exterior of the mouth, it “plucks” other sound waves from the resonating cavities in the vocal tract, forming a bundle of frequency components. The frequency value produced at the source is called the fundamental frequency (F0), and the frequency values produced at the resonating cavities are called formants or harmonics.
Frequency components are essential to automatic speech recognition —as they are to the perception of speech by humans— because they account for a large portion of the phonetic quality or the “identity” of the resulting speech sound, and because they can be made visible in various spectral representations.
Figure 6 shows a waveform view and a spectrogram of two sequences of English speech sounds separated by a short pause. The sounds were pronounced by the author and recorded with specialized software for phonetic analysis2.
2 praat.org (accessed in July 2015).
To round off the brief information provided in the previous section, a spectrogram is “a way of envisioning how the different frequencies that make up a waveform change over time”
(Jurafsky & Martin, 2009, p.262). The horizontal axis shows time, as in the case of the wave form, and the vertical axis shows frequency in hertz. Amplitude is represented by the intensity of each point in the image, varying from black at the strongest intensity (the highest amplitude value) to white at the weakest (the lowest amplitude value).
As can be observed, the phoneme /i/ presents a constant pattern in all four instances: a low first formant (F1) occurring at about 260 Hz, and a very high second formant (F2) occurring at about 2450 Hz. The location of these two frequency components —represented by visibly darker bars marked with green lines— can be said to be the characteristic spectral pattern of the /i/ sound and, consequently, can be used to distinguish it from other vowel sounds.
In fact, all vowel sounds can be recognized in a spectrogram by analyzing the disposition of their first two formants.
Consonants, for their part, are different in that the layout of formants in a spectrogram is not the sole requirement for their identification. Other acoustic cues come into play as a result of the fact that they involve some kind of restriction to the flow of air coming from the lungs.
Before analyzing the spectral representation of /f/ and /v/, we will introduce some preliminary
Figure 6. A waveform and a spectrogram.
notions about the production of consonants (just as was done in the case of vowels a few pages above).
From an articulatory perspective, consonants can be classified according to three main criteria:
voicing, i.e., whether the vocal folds vibrate or not when the sound is produced; place of articulation, i.e., the specific anatomical point in the vocal tract where the maximum restriction of airflow is produced; and manner of articulation, i.e., how the obstruction of airflow is produced.
Let us begin with /f/. If a speaker places their lower lip loosely on the lower edge of their upper teeth and forces air out of their mouth, they create a partial obstruction to the flow of air coming from the lungs. The noise made by the air escaping past that obstruction will sound exactly like the /f/ sound at the beginning of the words flower and philosophy, or at the end of the word laugh (please note that one single speech sound corresponds to multiple orthographic realizations and keep this in mind when we discuss the notion of ambiguity in the next section).
Since the articulators at play in this process are the lower lip and the upper teeth, /f/ is said to be a labiodental sound. Similarly, because the hissy noise made by the air escaping the lips is caused by friction, the sound is characterized as a fricative. Lastly, /f/ can be characterized as a devoiced or voiceless linguistic sound because it does not require vibration of the vocal folds.
The /v/ sound differs in that it does require vocal fold vibration, but otherwise it behaves exactly as /f/ regarding articulatory properties. Thus, /f/ can be said to be a voiceless labiodental fricative, whereas /v/ can be said to be a voiced labiodental fricative.
This distinction is visible in the spectrogram displayed in Figure 6. The darker bar at the bottom of the plot in the portion that corresponds to /v/, slightly clearer than the other bars, means that the vocal folds were vibrating when the sound was produced, and that such sound was subsequently amplified by the vocal tract of the speaker in a particular way (unsurprisingly, in a manner similar to the way neighboring /i/ was amplified).
Conversely, the portion of the spectrogram that corresponds to /f/ shows no bars. This is because there was no vocal fold vibration involved in the production of /f/ and no vocal tract filtering.
The source of the sound in the case of /f/ is to be found in the lips (very far forward in the mouth) rather than in the vocal folds of the speaker.
Another acoustic cue that can help distinguish these sounds from one another is their amplitude.
The portion of the spectrogram that corresponds to /f/ is visibly darker than the portion that corresponds to /v/, suggesting that overall amplitude is higher. This difference is perhaps easier to spot in the waveform view, where the chunk of the wave that corresponds to /f/ is bigger than that of /v/.
The articulatory principles underlying this last distinction are related to one of the two main factors that determine the production of turbulent noise, namely: “the volume velocity of the airflow (volume of air going past a certain point per unit of time)” (Johnson, 2011, p. 152), the other factor being the size of the channel that lets the air through. The idea is that the amplitude of turbulent noise, characteristic of all fricative sounds, is determined by the velocity of the air molecules as they pass through a channel (the faster they move, the louder the sound).
Keith Johnson explains the intrinsic differences in amplitude between /f/ and /v/ in terms of their articulatory properties by suggesting a quite revealing practical demonstration.
[…] if you drape a sheet of paper over your head, you can easily blow it off with a voiceless labio-dental fricative [f] but not with a voiced labio-dental fricative [v]. This is because during voicing the vocal cords are shut (or nearly so) as much as they are open. Therefore, given a comparable amount of air pressure produced by the lungs (the subglottal pressure), the volume velocity during voicing is much lower than it is when the glottis is held open (Johnson, 2011, p. 156).
It is interesting to note that volume velocity, the physical property determining the amplitude of /f/ and /v/, is directly related to voicing, the articulatory phenomenon that typically elicits the formation of frequency components. This reveals the complex interplay that exists between the articulation of speech sounds, their acoustic realizations and their visual representations; and makes it clear that a grasp of such relations is key to understanding how acoustic models work.
To further such an understanding, we present three basic notions that offer a summary account of the present section: 1) that spectrograms allow for the recognition of phonemes; 2) that the recognition task is performed on the basis of the intrinsic acoustic properties of speech sounds;
and 3) that acoustic properties bear a relationship with articulatory principles.
These three basic notions are meant to complement the two notions about digital signal processing presented at the end of Section 1.2.1 and, all together, provide the reader with an intuitive global grasp of the initial acoustic-level phases of the recognition process.
Table 1 below gathers all five notions.
Digital signal processing
A speech signal is stored in a computer as a series of digits that contain all the information that is important for human audition.
This numeric version of the signal can be plotted over time into various kinds of visual representations.
Some of these representations (spectral ones) allow for the recognition of phonemes.
The recognition task is performed on the basis of the intrinsic acoustic properties of speech sounds.
The acoustic properties of speech sounds bear a relationship with articulatory principles.
Table 1. A minimal conceptualization of the initial acoustic-level phases of the speech recognition process.
Finally, the second phase of the acoustic model of a speech recognizer, delimited by a green line in Figure 5 and replicated in Figure 7 below, consists in associating sequences of phonemes with individual words of a language (Rayner et al., 2006).
Figure 7. Pronunciation dictionary.
Building a good pronunciation dictionary constitutes a major task for various reasons. Perhaps the most salient ones are 1) that the number of words in natural languages is very high and 2) that the number of possible pronunciations for those words is even higher.
The challenges associated with the first factor are fairly straightforward; they stem from the sheer size of the undertaking. Pronunciation dictionaries for large-vocabulary applications typically contain hundreds of thousands of entries. Creating such a large number of entries demands great effort and close attention to questions of storage, access and maintenance.
The second reason is explained by the fact that speakers of the same language pronounce certain words differently depending on various factors, such as the geographical region where they live, their socioeconomic status and their age, among many others. These differences in pronunciation are instances of a broader linguistic phenomenon called variation, which is also concerned with the lexical and structural choices of speakers and the factors that motivate them.
Let us consider a specific example. If we were to build a speech translation tool for the catering industry that includes Spanish as an input language —not an unlikely scenario, since many commercial systems target this domain at present— the pronunciation dictionary would have to include an entry for the Spanish word “sándwich” (a loanword from English), along with all of its many pronunciation variants, including the more refined / 'sandwit∫ / and / 'saŋ gwi t∫ /, but also the less prestigious / 'saŋ gwi t∫e /, / 'saŋ gu t∫e /, / 'san du t∫e / and / 'sam bu t∫e /. The ratio is seven sequences of phonemes to one individual word.
Since acoustic models are trained from large quantities of recorded and transcribed speech, the success of a speech application in recognizing the various pronunciations of the Spanish word
“sándwich” will depend on whether the training data include examples of people speaking those variants.
Some applications, in particular large-vocabulary dictation tools like Dragon NaturallySpeaking, can be trained to recognize the voice of a single user taking into account their idiosyncratic preferences in pronunciation. This typically improves word accuracy, but the complications that result from creating and appropriately activating individual user profiles mean that this method is often hard to implement (Bouillon, Routledge).
Now, the fact that one single word can be associated to several different sequences of phonemes is just one part of the problem. The other part is that one sequence of phonemes (or more) can
be associated to several different words. For example, in French, the sequences of phonemes /vɛ̃/ and /vain/ can both be associated to the words vin (wine), vain (vane), vingt (twenty), vaincs (to defeat, 1st and 2nd person singular, present tense), vainc (to defeat, 3rd person singular, present tense), vints (to come, 1st and 2nd person singular, simple past) and vint (to come, 3rd person singular, simple past). The ratio in this example is two sequences of phonemes to eight individual words.
This phenomenon is called homophony and will be dealt with more carefully in Section 1.2.3, since it is the language model of a speech recognizer that distinguishes acceptable from unacceptable sequences of words.
Figure 8 below picks up the word used as an example in previous figures (3, 5 and 7) to illustrate how multiple sequences of phonemes can be associated with multiple words within a single entry of the pronunciation lexicon. The phenomenon to the left of the arrow concerns the acoustic modelling phase and constitutes an instance of linguistic variation, whereas the phenomenon to the right concerns the language modelling phase and can be described as homophony.
Figure 8. Variation and homophony.