Monomicrophone blind audio source separation using EM-Kalman filters and short+long term AR modeling

(1)

MONO-MICROPHONE BLIND AUDIO SOURCE SEPARATION USING EM-KALMAN FILTERS AND SHORT+LONG TERM AR MODELING

Siouar Bensa¨ıd, Antony Schutz, Dirk Slock

Eurecom Institute , 2229 route des Crˆetes, B.P. 193, 06904 Sophia Antipolis Cedex, FRANCE

e-mail: { siouar.bensaid,antony.schutz,dirk.slock } @eurecom.fr

ABSTRACT

Blind sources separation (BSS) arises in a variety of fields in speech processing such as speech enhancement, speakers diarization and identification. Generally, methods for BSS consider several observations of the same recording. Sin- gle microphone analysis is the worst underdetermined case, but, it’s also the more realistic one. In our approach, the autoregressive structure (short term prediction) and the periodic signature (long term prediction) of voiced speech signal are jointly modeled. The filters parameters are extracted using a combined version of the EM-Algorithm with the Rauch- Tung-Striebel optimal smoother while the fixed-lag Kalman smoother algorithm is used for the initialization.

Index Terms— Blind sources extraction, mono-microphone analysis, short+long term prediction, EM Algorithm.

1. CONTACT AUTHOR Antony Schutz

antony.schutz@eurecom.fr.

EURECOM

Mobile Communication Department

2229 Route des Crˆetes BP 193, 06904 Sophia Antipolis Cedex, France

Tel: +33 (0)4 93 00 81 98 - Fax: +33 (0)4 93 00 82 00

2. EXTENDED SUMMARY

We consider the problem of estimating an unknown number of multiple mixed Gaussian sources. We use a voice produc- tion model that can be described by filtering an excitation signal with short term prediction filter followed by a long term

EURECOM’s research is partially supported by its industrial members:

BMW Group Research And Technology BMW Group Company, Bouygues Telecom, Cisco Systems, France Telecom , Hitachi, SFR, Sharp, STMicro- electronics, Swisscom, Thales. The research work leading to this paper has also been partially supported by the European Commission under contract FP6-027026.

one and which is mathematically formulated like

y_t =

c

X

k=1

x_k,t + n_t, (1)

x_k,t =

pk

X

n=1

a_k,nxk,t−n + ˜x_k,t

˜

xk,t =bkx˜k,t−Tk + ek,t

Whereytis the scalar observation,xk,tis the sourcekat timet. ak,nis then^thshort term coefficient of the sourcek whilex˜_k,t is the short term prediction error. b_k is the long term prediction coefficient of thek^th source, T_k its period, which is not necessary an integer, and {ek,t} are Gaussian mutually uncorrelated innovation sequences with varianceρ_k. {nt}is a white gaussian process with varianceσ²_n. We aim to seperate jointly these sources by estimating the short and long autoregressive (AR) coefficients of each one. We will proceed like the following: first, estimate coarsly the different pitches by extending a classical approach of monopitch estimation to the multipitch case,then use the EM algorithm to estimate the different parameters of the problem. For the first step, in speech processing it is common to use autocorrelation method (ACM) for pitch estimation, the method consists in finding the first highest peak of the function. But when several pitches are present, ACM is a bad estimator due to the fact that all the periodicities present in the signal contribute and generates proper peak and interaction peak. For estimating the different periodicity in the ACM we propose to estimate the ratio of ^R_R²_T^T for a set of possible T (period) compared to a threshold, whereR is the autocorrelation function. In order to achieve the second step (separation), we will derive a State-Space model formulated like the following

x_k,t = F_kxk,t−1 + g_ke_k,t, k = 1,· · ·, c (2) where x_k,t= [sk(t)s_k(t−1)· · ·s_k(t−pk)|s˜_k(t) ˜s_k(t− 1)· · ·˜s_k(t− ⌊Tk⌋)· · ·s˜_k(t−N+ 1)]^T

and g_k = [ 1 0· · ·0|1 0 · · · 0]^T

F_k =

F11,k F12,k

F21,k F22,k

(2)

F_11,k =







ak,1 ak,2 · · · a_k,pk 0

1 0 · · · 0 0

0 1 · · · 0 0

... ... . .. ... ...

0 0 · · · 1 0







(pk+ 1)×(pk+ 1)

F_12,k =







0 0 · · · α bk(1−α)bk 0

0 0 · · · 0 0

... ... . .. ... ...

0 0 · · · 0 0







(pk+ 1)×(N)

F_21,k =







0 · · · 0 ... . .. ...

0 · · · 0







(N)×(pk+ 1)

F_22,k =







0 0 · · · α bk(1−α)bk 0

1 0 · · · 0 0

0 1 · · · 0 0

... ... . .. ... ...

0 0 · · · 1 0







(N)×(N)

The N is chosen superior to the maximum value of pitches T_k in order to detect the long-term aspect. We concatenate the state equation (2) fork = 1,· · ·, c and introduce an observation equation for{yt}to obtain the state space model x_t = F x_t−1 + G e_t (3)

y_t = h^T x_t + n_t (4)

Where x_t = [x^T_1,tx^T_2,t · · · x^T_c,t]^Tand e_t= [e1,te2,t · · · ec,t]^T. The(Pc

k=1pk+c+N c) × (Pc

k=1pk+c+N c)block di- agonal matrix F is given by F = block diag(F1,· · · ,F_c).

The(Pc

k=1pk+c+N c)×cmatrix G and(Pc

k=1pk+c+ N c)×1column vector h are given respectively by G =block diag(g1,· · ·, gc), and h = [h^T₁ · · ·h^T_c]^T where

h_i = [1 0· · · 0]^T. The covariance matrice of the input vector e_tis thec×cdiagonal matrix Q=Cov(et) =diag(ρ1,· · ·, ρ_c).

This State-Space model will be used in the EM algorithm.

The Expectation Maximization algorithm is a general iterative method to compute ML estimates when the observed data can be regarded as “incomplete” and the incomplete data set can be related to some complete data set through a noninvertible transformation. Let the vector of observations y be the incom- plete data set which has probability densityp(y;θ). The ML estimate ofθ is obtained by maximizing the log-likelihood function

θ_{M L} = arg max

θ log p(y;θ) (5)

In our problem the data of interest areθ = [ρ1 · · · ρ_cσ_n²v1

· · ·v_c]^T where v_kcontains all the short and long term coefficients of the sourcek. A natural choice for the complete data setX is the setX = [x1,· · · ,x_M,n1,· · · ,n_M]. The complete data set and the incomplete data set are related in our case by the vector h in the equation(4). The EM Algorithm is composed of 2 steps, the expectation step (E-step) where we compute the following function

E step:

U(θ, θ⁽ⁱ⁾) = E_θ⁽ⁱ⁾ {log p(X;θ) | y} (6) and the maximization step where we maximize(6)

M step:

θ⁽ⁱ⁺¹⁾ = arg_θU(θ, θ⁽ⁱ⁾) (7) hereidenote the iteration indice. In order to compute (6), we first evaluate the log-likelihood functionlog p(X;θ):

log p(X;θ) =

c

X

k=1

log p(xk,1,· · · ,x_k,M) + log p(n1,· · ·, n_M)

=

c

X

k=1

(log p(ek,2,· · · ,x_k,M | xk,1) + log p(xk,1) ) +log p(n1,· · ·, nM) Using the Gaussian assumption, we find, after some algebraic manipulation

log p(X;θ) =Pc k=1

−¹₂(M +N+pk)log ρk

−₂¹_ρ_k v^T_k

PM

t=2 ˘x_k,t˘x^T_k,t

v_k + trace

Γ⁻¹_k xk,1xk,1

i

−^M₂ log σ²_n − ₂¹_σ2 n

PM

t=1 y_t² − 2yth^T x_t + h^T x_tx^T_t h +K

where˘x_k,t = [sk(t, θ)· · ·sk(t−pk, θ) ˜sk(t−⌊Tk⌋, θ) ˜sk(t−

⌊Tk⌋ −1, θ)]^T = S^T_k x_k,t, S^T_k is a transformation matrix which extract the partial state ˘x_k,t from the complete state x_k,t, v_k = [1 −ak,1· · · −ak,pk −α bk −(1−α)bk]^T, K is a constant independent of θandΓk is the normalized covariance matrixΓk = ρ⁻¹_k Cov(xk,1).After applying the E_θ⁽ⁱ⁾ {. | y}operator we deduce that

U(θ, θ⁽ⁱ⁾) = Pc k=1

−¹₂(M+N+pk)log ρk

−_2ρ¹_k

v^T_k S^T_k

PM t=2

ˆ x⁽ⁱ⁾_k,t|M

ˆ

x⁽ⁱ⁾_k,t|M T

+ P⁽ⁱ⁾_k,t|M S_k v_k+ trace n

Γ⁻¹_k ˆ

x⁽ⁱ⁾_k,1|M (ˆx⁽ⁱ⁾_k,1|M)^T +P⁽ⁱ⁾_k,1|M o i

−^M₂ log σ_n² − _2σ¹2 n

PM t=1 y²_t

−2yth^Tˆx⁽ⁱ⁾_t|M +h^T

ˆ x⁽ⁱ⁾_t|M

ˆ x⁽ⁱ⁾_t|M T

+ P⁽ⁱ⁾_t|M

h

+K (8)

(3)

whereˆx⁽ⁱ⁾_t|M = E_θ^(k) {xt|y}andˆx⁽ⁱ⁾_k,t|M = E_θ^(k) {xk,t|y} denote the smoothed conditional means, P⁽ⁱ⁾_t|M = Cov_θ^(k) (xt|y) and P⁽ⁱ⁾_k,t|M = Cov_θ⁽^k⁾ (xk,t|y)are the smoothed conditional covariance matrices. These quantities are calculated in the E-step using the Rauch-Tung-Striebel optimal smoother like described below

Setˆx_1|0= 0

Set P_1|0= block diag(ρ1Γ1,· · · , ρ_cΓc) F or t= 1to M do

K_t=P_t|t−1h(h^TP_t|t−1h+σ²_n)⁻¹ ˆ

x_t|t= ˆx_t|t−1+K_t(y_t−h^Tˆx_t|t−1) ˆ

x_t+1|t= ˆFˆx_t|t

P_t+1|t= ˆFP_t|tFˆ^T +GQG^T F or t=M to1do A_t=P_t|tFˆ^TP⁻¹_t+1|t ˆ

x_t|M = ˆx_t|t+A_t(ˆx_t+1|M −Fˆˆx_t|t) P_t|M =P_t|t+A_t(Pt+1|M −P_t+1|t)A^T_t

The maximization of (8)during the M-Step is trivial. Set- ting the partial derivative with respect toρ1,· · ·, ρc, σ_n² to zero yields to the new estimates of this variables. To esti- mate the vectors v_k we use the fact that e_k,t = ˘x^T_k,tv_k, by multiplying on the left with ˘x_k,t and applying the operator E_θ⁽ⁱ⁾ {. | y}we will obtain

S^T_k _M¹

PM t=1

ˆ x⁽ⁱ⁾_k,t|M

ˆ

x⁽ⁱ⁾_k,t|M T

+ P⁽ⁱ⁾_k,t|M S_kv_k= [ρk0 · · · ρ_k0 · · · 0]^T

where the secondρ_k is the(pk + 2)^thcomponent. F de- pends of the parameters of the problem, therefore, it should be computed in each iteration in order to be used in the Kalman algorithm (hence its estimate F). Converging to pretty goodˆ values of parameters requires a good initialization. Therefore, here we will proceed to a first sweep where we use the fixed- lag Kalman smoothing algorithm. The obtained values will be used as initializations to the the Rauch-Tung-Striebel optimal smoother algorithm described above and that will be runned several times till convergence.