Summary - Novel speech processing techniques for robust automatic speech recognition

In this chapter, we have demonstrated that the variable-scale piecewise quasi-stationary spectral analysis of speech signal can possibly improve the recognition accuracies of the state-of-the-art ASR systems in clean acoustic conditions. Such a technique can partially overcome the time-frequency resolution limitations of the fixed scale spectral analysis techniques. However, it can be argued that most of the frequency resolution is anyway lost due to the Mel-filter binning of the DFT samples.

Nevertheless, a spectrum(DFT) estimated over a quasi-stationary segment will help to reduce the variance of the estimated Mel-filter bank energies and consequently those of the MFCC feature vectors. However, as we need certain minimum number of samples to estimate the AR parameters and the residuals, our algorithm cannot detect QSSs below 20ms. We believe that in order to realize the true potential of variable scale piece-wise quasi-stationary analysis, we will have to research new statistical modeling techniques that can handle the spectral vectors derived from the vari-able sized segments in a suitvari-able way. One example could be Dynamic Bayesian Networks(DBNs) (Stephenson et al., 2004) where the length of the QSS can be an auxiliary (Stephenson et al., 2004) variable that conditions the emitted MFCC vector. However, this discussion is beyond the scope of our present work.

As the linear prediction analysis is very sensitive to noise, we expect the proposed technique to be sensitive to noisy speech. In particular if the acoustic model (HMM-GMM) have been trained using only the clean speech and the test utterances are noisy, then we can expect a mismatch be-tween the LPC analysis of the train set and test set. Consequently, this will affect the values of the likelihood ratio in (3.4). As the thresholdγin (3.4) has been determined over a clean development set, this will lead to a mismatch whenever the test conditions are noisy. However, if the train and the development set include the noisy utterances as well, then this mismatch can be reduced. At this point, we will like to stress the point that the proposed technique is aimed toward improving the ASR performances in the clean acoustic conditions (SN R ≥18db). To address the problem of noise robust ASR, this thesis proposed other features which are elaborated in chapters 5, 6 and 7.

As, the performance gains obtained through the variable-scale QSS analysis were a bit modest,

3.6. SUMMARY 37 this led us to design new features that can inherently describe the non-stationary signals such as speech. Amplitude modulations and Frequency modulations (AM-FM) (Haykin, 1994) of any given signal, can reasonably well model the non-stationarity inherent in that signal. In the next chapter we will study the AM-FM demodulation techniques that has been specially developed and designed for the analysis of the speech signals and in particular for its use as a feature vector in ASR. We will identify the shortcomings of the previously proposed modulation spectrum related techniques (Tyagi et al., 2003; Athineos et al., 2004; Zhu and Alwan, 2000; Kingsbury et al., 1998a;

Kanedera et al., 1998). These previously published modulation spectrum related techniques have primarily improved the word recognition accuracies in noisy conditions but have had a much poorer performance than the MFCC features in the clean acoustic conditions.

We will outline the development and design of a novel feature (Fepstrum) that has been de-rived using a theoretically consistent AM-FM demodulation technique and has led to significant improvements in word recognition accuracies in the clean acoustic conditions.

Chapter 4

Fepstrum representation of speech signal

4.1 Introduction

In past several years, significant efforts have been made to develop new speech signal represen-tations which can better describe the non-stationarity (spectral dynamics) inherent in the speech signal. Some representative examples are temporal patterns (TRAPS) features(Hermansky, 2003;

Athineos et al., 2004) and the several modulation spectrum related techniques(Tyagi et al., 2003;

Athineos et al., 2004; Zhu and Alwan, 2000; Kingsbury et al., 1998a; Kanedera et al., 1998). In TRAPS technique, temporal trajectories of spectral energies in individual critical bands over win-dows as long as one second are used as features for pattern classification.

The notion of the amplitude modulation (AM) and the frequency modulation (FM) were initially developed for the communication signals (Haykin, 1994). In theory, the AM signal modulates a narrow-band carrier signal (specifically, a monochromatic sinusoidal signal). Therefore to be able to extract the AM signals of a wide-band signal such as speech (typically 4KHz), it is necessary to decompose the speech signal into several narrow spectral bands where each band’s output signal can be modeled as an AM signal modulating a single carrier (FM) signal. In the past (Tyagi et al., 2003; Athineos et al., 2004; Zhu and Alwan, 2000; Kingsbury et al., 1998a; Kanedera et al., 1998), the modulation spectrum that has been used as a feature vector for ASR has been defined and extracted in a slightly ad-hoc manner. For instance, several researchers have extracted the speech modulation spectrum by computing a discrete Fourier transform (DFT) of the Mel or critical band spectral energy trajectories, where each sample of the trajectory has been obtained through a power spectrum (followed by Mel filtering) over 20-30ms long windows. An illustration of this is provided in Fig.4.1. The major limitation of such a technique is that,

• It implicitly assumes that within each Mel or critical band, the amplitude modulation (AM) signal remains constant within the duration of the window length that is typically 20-30ms long.

• Instead of modeling the constantly and slowly changing amplitude modulation signal in each band, it mostly models the spurious and abrupt modulation frequency changes that occur due to the frame shifting of10ms.

In this chapter, we have proposed an algorithm to perform AM-FM demodulation of the speech 39

signal in the time domain. As the AM-FM signal model is defined in time domain¹, a demodulation in time domain leads to robust estimation of the continuously though slowly changing AM signals.

It also leads to a better understanding of the relationships between various signal sub-components.

Through examples, we will show that for a theoretically meaningful estimation of the AM signals, it is important to constrain the companion FM signal to be narrow-band. Similar arguments from the modulation filtering point of view as applied to speech coding, were presented by Schimmel and Atlas (Schimmel and Atlas, 2005). In their experiment, they consider a wide-band filtered speech signalx(t) =a(t)c(t), wherea(t)is the AM signal andc(t)is the broad-band carrier signal. Then, they perform a low-pass modulation filtering of the AM signala(t)to obtainaLP(t). The low-pass filtered AM signalaLP(t)is then multiplied with the original carrierc(t)to obtain a new signal˜x(t).

They show that the acoustic bandwidth of the reconstructed signalx(t), is not necessarily less than˜ that of the original signalx(t). This unexpected result is a consequence of the signal decomposition into wide spectral bands that results in a broad-band carrier (Schimmel and Atlas, 2005) and is explained below. Let us consider the original signal x(t)and its Fourier transformX(f)which is obtained as the convolution of the spectra of the AM and the carrier signals, namelyA(f)andC(f).

x(t) =a(t)c(t), X(f) =A(f)∗C(f) (4.1)

The Fourier transform of the reconstructed signalx(t)˜ can be expressed as follows,

x(t) =aLP(t)c(t), X˜(f) =ALP(f)∗C(f) (4.2) where ’*’ denotes convolution andA(f),ALP(f)andC(f)and are the Fourier transforms of the AM signala(t), its low-pass filtered versionaLP(t)and the carrier signalc(t). Now ifc(t)is a sinusoidal carrier i.e. c(t) = sin(2πf0t)then it can be seen that the acoustic bandwidth of x(t)˜ is less than that of the original signalx(t)which is the desired result due to the low-pass filtering of the AM signal a(t). However, this is not necessarily the case if the carrier c(t) is not sinusoidal and is broad-band. This can be seen as follows. As x(t)has finite bandwidth (lets sayX(f)is non-zero only over the interval [100−200]Hz), therefore, the broad-band carrier spectrum C(f) and the AM spectrumA(f)have a special structure such that their convolution in (4.1) is zero outside the interval[100−200]Hz. The low-pass filtering operation²will not preserve this “special structure”

betweenALP(f)andC(f). Therefore, the convolution in (4.2) is not guaranteed to be zero outside the interval [100−200]Hz, thus increasing the bandwidth ofxLP(t)and defying the intent of the low-pass filter. Therefore, it is important to ensure that the carrier signal is narrow-band (ideally monochromatic). We realize that is not only a serious problem for modulation filtering as applied to speech coding (Schimmel and Atlas, 2005), but also for modulation spectrum analysis (which is used as feature vector for ASR and is the topic of this chapter). As a solution, we propose using narrow-band filters to decompose speech signal, followed by the AM signal estimation in each band, using analytic signals in time domain. The usefulness of this modification is further explained, later on in this chapter.

Over the past few decades, pole-zero transfer functions that are used for modeling the frequency response of a signal, have been well studied and understood (Atal and Hanauer, 1971; Makhoul, 1975; Haykin, 1993). In this work we will denote them by “F-PZ”. Lately, Kumaresan and his colleagues (Kumaresan and Rao, 1999; Kumaresan, 1998) have proposed to model analytic signals (Haykin, 1994) using pole-zero models in the temporal domain (denoted by T-PZ to distinguish them from the F-PZ). Along similar lines, Athineos et al. (Athineos and Ellis, 2003; Athineos et al., 2004) have used the dual of the linear prediction in the frequency domain to improve upon the TRAP features.

1x(t) =a(t)cos(Rt

02πf(t)dt), herex(t)is a narrow band-pass filtered speech signal where,a(t)is the corresponding AM signal andf(t)is the corresponding FM signal

2In fact, any filtering operation except for the identity one

Dans le document Novel speech processing techniques for robust automatic speech recognition (Page 52-57)