Contributions of the thesis - Novel speech processing techniques for robust automatic speech re

This thesis focuses on addressing certain deficiencies of the existing spectral envelope based feature extraction techniques to improve the ASR performance in clean conditions. Moreover, it is also important to design new adaptive feature extraction techniques to achieve satisfactory noise robust ASR performance. The contributions of this thesis in the form of new feature extraction algorithms for clean as well noisy acoustic conditions are listed below.

1.4.1 Features for clean speech

1. Variable scale piece-wise quasi-stationary analysis of speech signals: It is often ac-knowledged that speech signals contain short-term and long-term temporal properties (Ra-biner and Juang, 1993) that are difficult to capture and model by using the usual fixed scale (typically 20ms) short time spectral analysis used in hidden Markov models (HMMs), based on piecewise stationarity and state conditional independence assumptions of acoustic vectors.

For example, vowels are typically quasi-stationary over 40-80ms segments, while plosive typ-ically require analysis below 20ms segments. Thus, fixed scale analysis is clearly sub-optimal for “optimal” time-frequency resolution and modeling of different stationary phones found in the speech signal. In this work, we have studied the potential advantages of using variable size analysis windows towards improving state-of-the-art speech recognition systems in clean acoustic conditions. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the Linear Prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable scale time spec-tral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff.

2. Fepstrum representation of speech: In this work, a theoretically consistent amplitude modulation (AM) signal analysis technique has been developed as compared to the previous

1.4. CONTRIBUTIONS OF THE THESIS 5

X(1) X(2) X(3) X(4) X(5)

corresponding frequency bin.

bin Kth frequency

a crude approximation to the modualtion spectrum of the

Short time Fourier transform vectors indexed by frame number.

A DFT over a few number of frames for a fixed frequency bin yields

Frequency axis

Time axis as frame index

Figure 1.1.An approximate AM demodulation technique.

ones (Tyagi et al., 2003; Athineos et al., 2004; Zhu and Alwan, 2000; Kingsbury et al., 1998a;

Kanedera et al., 1998). We have shown that a “meaningful” AM signal estimation is possi-ble only if we decompose the speech analytic signal using several narrow-band filters which results in narrow-band carrier signals. Secondly, we use the lower modulation frequency spec-trum of the downsampled AM signal, as a feature vector (termed FEPSTRUM as it is shown to be an exact dual of the well known quantity, cepstrum). While Fepstrum provides ampli-tude modulations (AM) occurring within a single speech frame of size 80-100 ms, the MFCC provides a description of static energy in each of the Mel-bands of each frame and its variation across several frames (through the use of the delta and double delta feature). The Fepstrum provides complementary information to the MFCC features and we show that a combination of the two features provides a significant ASR accuracy improvement in clean conditions over several speech databases.

1.4.2 Speech signal enhancement and noise robust features

1. Least Square filtering of speech signal: We have developed an adaptive filtering tech-niques that enhances a speech signal that is corrupted by additive broad-band noise. We have analyzed the behavior of the least squares filter (LeSF) operating on noisy speech signal.

Speech analysis is performed on a block by block basis and a LeSF filter is designed for each block of signal, using a computationally efficient algorithm. Unlike the classical spectral sub-traction and Wiener filtering techniques that require the noise to be stationary, the proposed LeSF technique makes no such assumption as this technique works on a block by block basis.

Moreover, the proposed techniques does not require a reference noise signal as is required in Wiener filtering. This renders this technique as a highly practical signal enhancement technique as usually in the realistic scenarios, the reference noise channel is not available.

The key contribution in the approach proposed in this work is that we relax the assumption of the input signal being stationary. The method of least squares may be viewed as an alternative to Wiener filter theory (Haykin, 1993). Wiener filters are derived fromensemble averages and they require good estimates of the clean signal power spectral density (PSD) as well as the noise PSD. Consequently, one filter (optimum in a probabilistic sense) is obtained for all realizations of the operational environment, assumed to be wide-sense stationary. On the other hand, the method of least squares isdeterministicin approach. Specifically, it involves the use of time averages over a block of data, with the result that the filter depends on the number of samples used in the computation. Moreover, the method of least squares does not require the noise PSD estimate. Therefore the input signal is blocked into frames and we analyze aL-weight least squares filter (LeSF), estimated on each frame which consists ofN samples of the input signal. In this work, analytic expressions for the weights and the output of the LeSF are derived as a function of the block length and the signal SNR computed over the corresponding block.

Unlike other feature level noise robustness technique, the LeSF filter enhances the signal waveform itself and a MFCC feature computed over this enhanced signal leads to a significant improvement in the noisy speech recognition accuracies as compared to the other competing feature level noise robustness techniques such as RASTA-PLP and spectral subtraction. In distributed speech recognition (DSR) in the context of mobile telephony and voice-over IP sys-tems, it may be desirable not only to have noise robust feature extraction algorithm but also to enhance the noisy speech signal for the human listener. Therefore, a signal enhancement tech-nique that also leads to noise robust ASR is desirable. The proposed LeSF filtering techtech-nique falls into this category as it not only enhances the signal, a simple MFCC feature computed over this enhanced signal leads to significant ASR accuracy improvements in several realistic noisy conditions.

2. Cepstral desensitization to noise and its modulations as a noise robust featureAs is well known, in the presence of commonly encountered additive noise levels, the formants are less affected as compared to the spectral “valleys” which exhibit spurious ripples. The DCT of a log Mel-filter bank spectrum (logMelFBS) which is commonly known as MFCC (Davis and Mermelstein, 1980) feature vector, is sensitive to ripples in the spectral valleys which, otherwise, do not characterize the speech sounds. This is one of the reasons for the poor per-formance of MFCC features in additive noisy conditions. Observing that the higher amplitude portions ( such as formants) of a spectrum are relatively less affected by noise, Paliwal pro-posed spectral subband centroids (SSC) as features (Paliwal, 1998; Chen et al., 2004). In this work, we analytically show that exponentiating the logMelFBS can decrease the sensitivity of the cepstra to the spurious perturbations in the logMelFBS valleys as compared to the peaks.

Lim has proposed the use of spectral root homomorphic deconvolution system (SRDS) (Lim,

1.5. ORGANIZATION OF THE THESIS 7

Dans le document Novel speech processing techniques for robust automatic speech recognition (Page 20-23)