• Aucun résultat trouvé

Conclusions and perspectives

10.1 Summary and discussion

Conclusions and perspectives

This thesis presented an exploratory research on the application of a novel non-linear multi-scale formalism to speech analysis, called the MMF. In this final chapter, we will summarize briefly the results of the thesis and draw conclusions about the work, pointing out future research directions that these results suggest.

10.1 Summary and discussion

We have presented in this thesis, our step by step progress in the study of the relevance of the MMF to speech analysis. At the very first glance, MMF seemed as a suitable candidate for such analysis as it relies on geometric local parameters that gives precise access to underlying multi-scale organizations of a complex signal, without relying on stationarity assumptions. This is an important feature of the formalism that implies the relaxation of one limiting assumption of many other linear and non-linear speech analysis approach: the time invariance.

We first ran a set of preliminary experiments investigating the validity of general theoretic conditions for applicability of the formalism to speech analysis. Despite some practical issues hindering the exact evaluation of these conditions, we ob-served that we can confidently conclude the existence of local scaling behavior for most of the points in domain of the speech signal and hence, precise and mean-ingful estimates of Singularity Exponents (SE), as key parameters characterizing multi-scale complexities can be achieved.

Consequently, the study of the applicability of the MMF to speech analysis was initiated. To that aim, our strategy in this thesis was to enrich our understanding about the MMF as applied to the speech signal, through the development of practi-cal speech analysis applications. We started by simple adaptation of the formalism to the case of the speech signal as a general1-D signal, and through the implemen-tation of those applications we updated some of the compuimplemen-tational details of the formalism according to particularities of this signal.

The first observation regarding the informativeness of the SEs to speech analysis was made on the time-evolution of local distribution of SEs, which led us to develop a very simple but yet highly accurate phonetic segmentation algorithm. The

partic-115

tel-00821896, version 1 - 13 May 2013

ular property of the algorithm was its interestingly high resolution in detection of phoneme boundaries. This accuracy can actually be attributed to the geometric nature of the formalism.

This first successful experiment motivated us to study the use of another major component of the MMF: the MSM that is shown to be the most informative subset of points in the domain of complex signals. It has been indeed shown that a universal reconstruction kernel can reconstruct the whole complex signal using its gradient field restricted to the MSM. However, the reconstruction kernel is originally devel-oped for 2-D signals (using known spectral properties of natural images) and its direct adaptation to the case of 1-D signals results in a simple integration over the gradient field restricted to the MSM.

It is noteworthy that the most powerful definition of SEs in case of natural images is derived by local evaluation of the above mentioned kernel. This in turn related the resulting SE to the local concept of predictability. However, as this kernel reduces to an integration operation for1-D signal and hence, its local evaluation for estimation of SEs simplifies to evaluation of finite differences. So, the proper definition of SEs for speech analysis was [and still is] an open question, on which we provided a pragmatic answer in chapter6. We first started the study of the MSM in case of the speech signal, by viewing it as a general1-D signal: to assess the reconstructability of the signal from the MSM, we employed a generic classical reconstruction tech-nique that can reconstruct any band-limited signal from a set of irregularly spaced samples (just like the MSM). We thus used several definitions of SEs (using differ-ent definitions of scale-dependdiffer-ent functionalΓr) to form the MSM and evaluate the reconstruction quality using this generic reconstructor. We then introduced a new scale-dependent functional for SE estimation (the two-sided variations functional), which resulted in the highest quality of reconstruction (compared to the other func-tionals). We used this functional in the remaining experiments of this thesis.

Our next step was to study the potential physical meaning of the MSM. By com-paring the MSM with locations of the GCIs extracted from Electro Glotto Graph signal, we observed that the MSM can give access to the time instants of signifi-cant excitations of vocal tract. Consequently, we developed a GCI detector using the MSM with quite interesting results: In clean speech the algorithm was almost as reliable as the state-of-the-art, while being more accurate. More importantly, the algorithm showed higher robustness in the presence of different types of noises.

As a direct application of our GCI detection algorithm, we then tackled the prob-lem of sparse Linear Prediction Analysis (LPA). The aim of sparse LPA is to achieve a spectral representation whose residuals are sparse. Our strategy was to down-weight thel2-norm of the LP error on the GCIs such that the optimizer can relax the error at those points and concentrate on minimizing the error on the rest of points.

This provided an efficient closed-form solution for the interesting problem of sparse LPA. Apart from numerical motivations for this solution, this work can be seen as

tel-00821896, version 1 - 13 May 2013

10.1 s u m m a r y a n d d i s c u s s i o n 117

an effort to overcome one major non-valid assumption in classical LPA framework:

the independent functioning of the vocal tract and the excitation source. With the proposed strategy we discard the time instants where the coupling of the source and the filter is maximized (the GCIs where the significant excitations of the vocal tract takes place). It is noteworthy that our GCI algorithm has two properties that makes it practically suitable to be used for this weightedl2-norm solution. Firstly, the computational burden of the algorithm is low (this justifies the use of the whole solution instead of the computationally complexl1-norm minimizers) and secondly, it does not rely on any model for the speech signal and it extracts the residuals di-rectly from the signal itself (some GCI detection methods are relying themselves on the residuals, and hence are not suitable for modifying the residuals).

We finally studied the use of larger MSMs to estimate the excitation source of the speech signal. With the GCI algorithm the size of the MSM is restricted to one loca-tion per glottal cycle. However, in the context of multi-pulse coding, it is required to find more than one location per cycle so as to reconstruct speech signal with good perceptual quality. We thus used MSM to determine the locations of the excitation sequence. We then used the minimization of spectrally weighted reconstruction er-ror (as in the classical approach) so as to find the value of pulse amplitudes. We showed that we can achieve almost the same level of perceptual quality as of the classical techniques but with less computational complexity. The good perceptual quality may indicate that the value of SE indeed provides a hierarchical ranking of speech samples: the smallest one correspond to the GCIs, while the secondary smallest ones correspond to the secondary significant excitations of the vocal tract.

Care must be taken however, as the high perceptual quality of MPE synthesizer might be attributed to the objective minimization of perceptually weighted error of reconstruction.

Overall, in this thesis we have demonstrated that MMF provides precise access to important events in the domain of speech signal at different levels: phoneme boundaries, the GCIs and the locations of excitation sequence of the speech signal.

We have shown the usefulness of two major components of the MMF (the SEs and the MSM) to speech analysis. Our approach regarding the other important aspect of the MMF, which is the reconstruction of the signal from its MSM was rather prag-matic: as the direct adaptation of the 2-D reconstruction kernel looses information in case of the speech signal, we used two alternatives. First we employed a generic reconstructor called the Sauber-Allebach algorithm and then, for the MPE coding example in chapter9we used classical LP based synthesizer.

tel-00821896, version 1 - 13 May 2013

10.2 Perspectives

A number of potentially promising directions of research can be taken from the results of this thesis. In short term, some improvement or more sophisticated ex-tensions can be readily anticipated to the applications that have been developed in their simplest form in this thesis.

• the high geometrical resolution of the text-independent phonetic segmenta-tion algorithm in chapter 5, suggests the possibility of adapting it to the case oftext-dependentphonetic segmentation, using statistical models of the speech signal;

• the GCI detector introduced in chapter 7is particularly an efficient one, with enough reliability. This suggests the investigation of its applicability in pitch-synchronous speech processing problems. This may serve as more concrete (and subjective) demonstration of the ability of the algorithm in detecting meaningful events in the speech signal;

• the efficient GCI detector of chapter7together with the closed-form stable so-lution for sparse linear prediction analysis in chapter8, open the way to inves-tigate the usefulness of such sparse representation, in any high-level practical speech processing application such as speech recognition, speaker identifica-tion, synthesis. In particular, the accurate estimates of the vocal tract filter may have very high potential in text-to-speech synthesis applications, where proper modeling of excitation source and synthesizer filter is a serious con-cern. The first interesting subject to investigate is the possible improvement of synthesis quality using the residuals of this sparse representation to model the excitation source for HMM-based speech synthesis. This synthesizer, in its basic form, uses a very simple model of excitation source: one pulse per period. This is the highest level of sparsity and is too far from reality with the classical minimum variance modeling. So it is reasonable to expect that the use of a representation whose residuals are effectively sparser, may improve the synthesis quality;

• another interesting possible subject may arise by considering the results of chapters 8and 9 as a whole. In chapter8, we showed how weighting of the objective function on GCIs may assist the optimizer to relax the prediction error at those points and to focus on minimizing the error at remaining time instants. In chapter 9, we showed how the MSM can provide direct access to a subset of points which are interesting for multi-pulse coding of speech.

However, we used the classical synthesizer for reconstruction of the speech, which is found through the minimization of l2-norm error. One can fairly expect that a similar weighting of the l2-norm error, but on the MSM (not

tel-00821896, version 1 - 13 May 2013

10.2 p e r s p e c t i v e s 119

just on the GCIs) may improve quality. Of course there are practical issues to overcome, such as instabilities or the proper design of weighting function.

Such combination would result in a unified coding framework based on the MMF;

Apart from the perspectives drawn from the applications introduced in this thesis, a number of applications might be considered on which the MMF may potentially give rise to appreciable results:

• the development of a new feature set for all classification problems in the context of statistical speech processing (recognition, identification and syn-thesis tasks). Our preliminary observations show that such development is indeed possible. For instance, we have observed that the time evolution of time-conditioned histogram of singularity exponents (the one shown in Fig-ure 5.1 in chapter 5) exhibits consistent patterns for different realizations of the same phoneme. As such, this histogram can be readily considered for all classification tasks;

• the successful detection of the GCIs and also the multi-pulse excitation se-quence, also suggests the investigation for detection of Glottal Opening In-stants using the MMF, which are much more difficult to detect than GCIs.

All these potential extensions to the present work, may add up to substantiate the establishment of the MMF as a powerful tool for the analysis of the speech signal.

However, another important track for future research is to take a similar strategy such as the one taken for 2-D natural images to attain more powerful definitions of SEs. As mentioned in section 10.1, our strategy regarding the reconstruction of speech from its MSM was only pragmatic. This actually provided us with interesting results. However, noting that the most powerful definition of SEs is realized in case of the2-D signals by local evaluation of an appropriate reconstruction kernel, one can fairly expect that the definition of appropriate recunstruction kernels in the case of speech signals, and the consequent definition of the corresponding scaling exponents, would greatly contribute to improvement of the results of this thesis.

tel-00821896, version 1 - 13 May 2013

A