• Aucun résultat trouvé

Text Independent phonetic segmentation using the MMF

5.3 MMF based phonetic segmentation

These observations suggests that SE indeed convey valuable information about local dynamics of the speech which can be even used for phoneme-level classifica-tion. However, the fact that there is considerable change in the distribution of SE between neighboring phonemes can be readily used for the phonetic segmentation task.

5.3 MMF based phonetic segmentation

To develop an automatic segmentation method which uses the properties of SEs mentioned in section5.2, we employ a two step procedure:

1. First, we define a quantity representative of the distinctive properties (sec-tion5.3.1) and we develop a simple and efficient method for detecting changes on this quantity (section5.3.2).

2. We apply the above method to the signal itself and its low-passed version, to make a preliminary list of phoneme boundary candidates. These candidates are then refined by dynamic windowing and Log-Likelihood Ratio Test (LLRT) to make the final decision (section5.3.3).

This 2-step approach is similar to that of traditional segmentation methods where there is a boundary pre-selection followed by statistical hypothesis tests to make the final decision [1].

5.3.1 The Accumulative function ACC

To define a simple measure which is a quantitative representative of the changes in distribution of SE between neighboring phonemes, we use the simplest descriptive statistic, which is the average of SE. We considerh(t)as a random variable whose average is changing between adjacent phonemes and we search for the locations of changes in local averages of h(t) as the candidate phoneme boundaries. However, since the SE estimations are available at the finest resolution, and this resolution is useful to be preserved for phonetic segmentation task, we would want to avoid any windowing for the estimation of averages. To do so, we use the primitive of h(t)as the representative quantity. Indeed, inside the boundaries of each phoneme, the slope of this quantity would be an estimate for the local average of the h(t). Formally, the definition reads:

ACC(t) = Zt

t0

dτ h(τ) (5.9)

tel-00821896, version 1 - 13 May 2013

The resulting functional is plotted in Fig. 5.2. This time-variable functional is de-trended to enhance the presentation of the values. Just as we expected, this new functional reveals the changes in distribution in a more precise way. Indeed, inside each phoneme the functionalACCis almost linear (if we neglect the small scale fluc-tuations). Moreover, there is a clear change in the slope at the phoneme boundaries.

These slope changes are even able to identify the boundaries between extremely short phonemes, such as stops. Extensive observations over different sentences con-firms this behavior and thus the strength of the proposed functional in Eq. (5.9).

The next step is to develop an automatic method of detection of changes in the slope of the ACC. A very simple solution is to fit a a piecewise linear curve to the ACCand take the break-points as the candidates of change in distribution of SE, i.e.

phoneme boundary candidates.

5.3.2 Piece-wise linear approximation of ACC

Assuming that for a speech signal of lengthN,ACChasKsignificant break-points, the problem of finding these breakpoints can be formalized as the following opti-mization problem (note thatKis unknown):

• find LAm(ACC,1,N)such thatmis minimized andE1→Nm < .

where LAm(ACC,1,N) denotes m-piece linear approximation of the curve ACC between the time indices 1 and Nand E1→Nm is the corresponding mean squared approximation error. This optimization problem, is addressed in [72] for denoising of piece-wise constant functions and it is argued that whenK << N, a greedy search for jump placements (called the jump penalization method) may give rise to a more efficient and more accurate approximation compared to other solvers. However, the computational complexity of this method is still increasing with K which is not desirable for the task of automatic phonetic segmentation since it deals with large databases of the speech.

We develop a similar but more efficient solver for the above optimization problem which its computational is independent ofK. Our optimization method is motivated by the dynamic programming approach for Piece-wise Linear Approximation (PLA) problem described in [58]. There, the problem is solved by recursive solution of some easier sub-problems of the type LA2(ACC,1,N), which also suffers from the problem of increasing computational complexity with K (it requires log2K passes of Nsamples).

tel-00821896, version 1 - 13 May 2013

5.3 m m f b a s e d p h o n e t i c s e g m e n tat i o n 57

Figure5.2:top:Thespeechsignal"Masqueradepartiestaxone’simagination"fromtheTIMITdatabaseandbottom:theACCfunctionalof Eq.(5.9).Thefunctionalisdetrendedbysubtractingitsglobalmean.

tel-00821896, version 1 - 13 May 2013

Our efficient implementation includes the replacement of m-point approxima-tion problem LAm(ACC,1,N), by sequential solutions of 1-point approximations LA1(ACC,ti,ti+1), where i ∈ (1· · ·m−1) andti < ti+1. The algorithm starts by t1 = 1, and at each iterationti+1 is determined by the procedure explained in Alg.

1. There,ci stands for the i-th phoneme boundary candidate. The particularity of the method is that its computational complexity is independent of the number of breakpoints (K) and requires only 2 passes of N data samples. The detected can-didates (ci) are rejected if located in the silence, i.e. where the average power in a 32ms window is less than -30dB. It is important to note that at each iteration,ti+1 is determined through a greedy minimization (Alg. 1line 7) and not through a thresholding operation. This greedy minimization, decreases the impact of the threshold (Alg.1 line3). We will later see in experimental results that the algorithm is not really that sensitive to the selec-tion of the threshold. The insensitivity to the selection of algorithm parameters is an important property for text-independent phonetic segmentation, where no data is available for training of the parameters.

Alg.1is very simple for implementation and is quite fast in practice (as it requires only 2 passes over Ndata samples). This algorithm is readily a phonetic segmen-tation algorithm which provides very good performance in detection of phoneme boundaries. This is shown by extensive experiments in section 5.4 (this simple al-gorithm is denoted by SE-ACC). However, in the following subsection we develop a more accurate segmentation algorithm by using this simple criterion of change inside a classical decision making procedures.

tel-00821896, version 1 - 13 May 2013