• Aucun résultat trouvé

Under-determined convolutive blind source separation using spatial covariance models

N/A
N/A
Protected

Academic year: 2021

Partager "Under-determined convolutive blind source separation using spatial covariance models"

Copied!
5
0
0

Texte intégral

(1)

HAL Id: inria-00541863

https://hal.inria.fr/inria-00541863

Submitted on 27 Jan 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Under-determined convolutive blind source separation using spatial covariance models

Ngoc Duong, Emmanuel Vincent, Rémi Gribonval

To cite this version:

Ngoc Duong, Emmanuel Vincent, Rémi Gribonval. Under-determined convolutive blind source sep- aration using spatial covariance models. Acoustics, Speech and Signal Processing, IEEE Conference on (ICASSP’10), Mar 2010, Dallas, United States. pp.9–12, �10.1109/ICASSP.2010.5496284�. �inria- 00541863�

(2)

UNDER-DETERMINED CONVOLUTIVE BLIND SOURCE SEPARATION USING SPATIAL COVARIANCE MODELS

Ngoc Q.K. Duong, Emmanuel Vincent and R´emi Gribonval METISS project team, IRISA-INRIA

Campus de Beaulieu, 35042 Rennes Cedex, France

�qduong, emmanuel.vincent, remi.gribonval�@irisa.fr

ABSTRACT

This paper deals with the problem of under-determined con- volutive blind source separation. We model the contribution of each source to all mixture channels in the time-frequency domain as a zero-mean Gaussian random variable whose covariance encodes the spatial properties of the source. We consider two covariance models and address the estimation of their parameters from the recorded mixture by a suitable initialization scheme followed by an iterative expectation- maximization (EM) procedure in each frequency bin. We then align the order of the estimated sources across all fre- quency bins based on their estimated directions of arrival (DOA). Experimental results over a stereo reverberant speech mixture show the effectiveness of the proposed approach.

Index Terms— Convolutive blind source separation, under-determined mixtures, spatial covariance models, EM algorithm, permutation problem.

1. INTRODUCTION

In blind source separation (BSS), the recorded multichannel signal����is a mixture of several sound sources

���� �

���

��� (1) where���is the spatial image of source�, that is its con- tribution to all mixture channels. Each���can be modeled via the convolutive mixing process

��� �

�������� (2) where���is the�-th source signal and��is the vector of mixing filter coefficients modeling the acoustic path from sourceto all microphones. Under-determined BSS consists in recovering either theoriginal source signals or their spa- tial images given themixture channels where� � �.

Most existing approaches transform the signals into the time-frequency domain via the short-time Fourier transform

(STFT) and approximate the convolutive mixing process by a complex-valued mixing matrix in each frequency bin. Source separation is then achieved by estimating the mixing matrices in all frequency bins and deriving the source STFT coeffi- cients under a sparse prior distribution. Popular algorithms include binary masking [1] or-norm minimization [2].

A different framework [3, 4] assumes that the sources are uncorrelated and the vector��� �of STFT coefficients of each spatial source image in time frameand frequency bin

is modeled as a zero-mean Gaussian random variable with covariance matrix

��� �� ���� ����� (3) where��� �is a scalar time-varying variance and��a time-invariantspatial covariance matrixencoding the spatial properties of the source. The parameters��� �and�� can be estimated in the maximum likelihood (ML) sense. The spatial images of all sources are then obtained in the minimum mean square error (MMSE) sense by Wiener filtering.

This framework was first applied to the separation of in- stantaneous audio mixtures in [5] using a rank-1 spatial co- variance matrix ���. In [3], we extended this approach to convolutive mixtures and proposed to consider��as a full-rank matrix. This model was shown to improve sepa- ration performance of reverberant mixtures in an oracle con- text, where the true values of��and��� �are known, and in a semi-blind context, where��was estimated from single-source training data but��� �was blindly estimated from the mixture.

In this paper, we investigate the estimation of both�� and��� ��in a blind context where only the mixture sig- nal is available. For that purpose, we use the expectation- maximization (EM) algorithm and propose an effective pa- rameter initialization scheme. We also solve the source per- mutation problem arising when model parameters at different frequencies are independently estimated. We argue for the better source separation performance of the full-rank model over the rank-1 model and state-of-the-art algorithms on mix- tures with realistic reverberation time.

The structure of the rest of the paper is as follows. We

(3)

present rank-1 and full-rank spatial covariance models in Sec- tionand address the blind estimation of the model param- eters in Section�. We compare the source separation perfor- mance achieved by rank-1 and full-rank models and by state- of-the-art algorithms over speech data in Section�. Finally we conclude in Section 5.

2. SPATIAL COVARIANCE MODELS

We investigate two general spatial source models with differ- ent degrees of flexibility resulting in a rank-1 or a full-rank spatial covariance matrix.

2.1. Rank-�model

Most under-determined BSS approaches model the convolu- tive mixing process (2) in the frequency domain as��� �� �

������� ��, where ��is the Fourier transform of the mixing filters���and ��� �� is the STFT of���[2].

The spatial covariance matrix of sourceis then equal to the rank-1 matrix

��� ����� �� (4) with denoting matrix conjugate transposition. In the fol- lowing, we assume that the mixing vectors��associated with different sourcesare not collinear.

2.2. Full-rank model

The above rank-1 model assumes that the sound of source

as recorded on the microphones comes from a single spa- tial position at each frequency, as specified by���. But in practice, reverberation increases the spatial spread of each source due to echoes at many different positions on the walls.

Therefore, we also investigate the modeling of each source via a full-rank spatial covariance matrix��without any constraint on its entries [3]. Since this model is more general than the rank-1 model in (4), it allows more flexible modeling of the mixing process.

3. ESTIMATION OF MODEL PARAMETERS We estimate the model parameters���,��and��� ��

from the recorded mixture using a three-step procedure: ini- tialization by hierarchical clustering, iterative estimation via the EM algorithm, and permutation alignment. The overall process is shown in Fig. 1.

3.1. Initialization by hierarchical clustering

Preliminary experiments showed that the initialization of the model parameters greatly affects the separation performance resulting from the EM algorithm. In the following, we pro- pose a hierarchical clustering-based initialization scheme in- spired from the algorithm in [2].

Fig. 1. Flow of the proposed BSS approach.

This scheme relies on the assumption that the sound from each source comes from a certain region of space at each frequency, which is different for all sources. The vectors

���� ��of mixture STFT coefficients are then likely to clus- ter around the direction of the associated mixing vector�� in the time frameswhere sourceis predominant.

In order to estimate these clusters, we first normalize the vectors of mixture STFT coefficients as

���� � ���� ��

���� ������������ (5) where������denotes the phase of a complex number and the Euclidean norm. We then define the distance between two clustersandby the average distance between the asso- ciated normalized mixture STFT coefficients

���� �� �

��

(6) At a given frequency, the vectors of mixture STFT co- efficients on all time frames are first considered as clusters containing a single item. The distance between each pair of clusters is computed and the two clusters with the smallest distance are merged. This ”bottom up” process called linking is repeated until the number of clusters is less than a prede- termined threshold�. This threshold is usually much larger than the number of sources [2], so as to eliminate outliers.

We finally choose the clusters with the largest number of samples and calculate the initial values as

���� ��� �

������

���� ���������������� (7)

���� ��� �

������

���� ������ � (8) Note that, contrary to the algorithm in [2], we define the distance between clusters as the average distance be- tween the normalized mixture STFT coefficients instead of the minimum distance between them. Besides, the mix- ing vector���� ��is computed from the phase-normalized mixture STFT coefficients as (7) instead of the normalized coefficients as (5). These modifications were found to pro- vide better initial approximation of the mixing parameters in

(4)

our experiments. We also tested random initialization and direction-of-arrival (DOA) based initialization, i.e. where the mixing vectors���� ��are derived from known source and microphone positions assuming no reverberation. Both schemes were found to result in slower convergence and poorer separation performance for both the rank-1 and full- rank model than the proposed scheme.

3.2. Maximum likelihood estimation with EM

Under model (3), the STFT coefficients of the mixture signal are zero-mean Gaussian with covariance matrix

��� �� �

��� ��� (9) In each frequency bin, we now wish to estimate both the mixing parameters��or���and the source variances

��� �in the ML sense. The EM algorithm is well-known as an appropriate choice in this case [6].

For the rank-1 model, the EM algorithm is derived based on thecomplete data���� ��� ���� ���� �that is the set of STFT coefficients of all mixture channels and all sources on all time frames. Due to lack of space, we do not provide the EM parameter updates for this model here. Similar updates can be found in [5, 7].

For the full-rank model, the complete data becomes

��� �� �� � that is the set of STFT coefficients of the mixture and all source images on all channels and all time frames. The details of one EM iteration for each sourceare as follows.

In the E-step, the Wiener filter��� ��and the mean

��� �and the covariance matrix��� �of the estimated source image are computed as

��� �� ���� ������ �� (10)

��� �� ���� ������ � (11)

��� �� ���� � ��� �� � ����� ������� ��

(12) whereis theidentity matrix,��� �is defined in (3) and��� ��in (9).

In the M-step,��and��� �are updated as [3]

��� �� �

��������� ��� (13)

��� �

���

��� ���� �� (14) where�����denotes the trace of a square matrix.

3.3. Permutation alignment

Since the model parameters are estimated independently in each frequency bin, they should be ordered so as to cor- respond to the same source across all frequency bins. In or- der to solve this so-called permutation problem, we apply the

DOA-based algorithm described in [8] for the rank-1 model.

Given the geometry of the microphone array, this algorithm computes the DOAs of all sources and permutes the model parameters by clustering the estimated mixing vectors�� normalized as in (5).

Regarding the full-rank model, we first apply principal component analysis (PCA) to calculate the first principal component��of the spatial covariance matrix��of each source in each frequency bin . This vector is con- ceptually equivalent to the mixing vector��of the rank-1 model. Thus, we can apply the same procedure to solve the permutation problem. Fig. 2 depicts the phase of the second entry����of ��before and after solving the permutation for a stereo mixture of three sources with room reverberation time �� � ��� ms, where ��� has been normalized as in (5). This phase is unambiguously related to the source DOAs below 5 kHz. Above that frequency, spatial aliasing occurs. We can see that the source order is globally aligned for most frequency bins after solving the permutation.

Fig. 2. Normalized argument of����before and after per- mutation alignment for stereo mixture of three sources.

4. EXPERIMENTAL RESULTS

We evaluated the blind source separation performance of the proposed algorithm for the rank-1 and the full-rank model over stereo mixtures of three speech sources with different re- verberation times. The mixtures were generated by convolv- ing 8 s speech signals sampled at 16 kHz with room impulse responses simulated via the source image method. The STFT was computed with a sine window of length 1024. The dis- tance between two microphones was 5 cm and the distance from sources to microphones 50 cm. The number of clusters was set to � ��and the number of EM iterations to 20.

We also computed the performance of binary masking and

-norm minimization using the same mixing parameters as those estimated from the hierarchical clustering step. Separa-

(5)

tion performance was evaluated using the signal-to-distortion ratio (SDR) criterion measuring overall distortion, as well as the signal-to-interference ratio (SIR), signal-to-artifact ratio (SAR) and source image-to-spatial distortion ratio (ISR) cri- teria in [9], averaged over all sources, and shown in Table 1.

The separation results for-norm minimization were not in- cluded in the table since they are about���dB below that of the rank-1 model in terms of SDR regardless of reverberation time.

������ Approach SDR SIR SAR ISR

50 Binary masking 8.9 18.2 9.4 18.3 Rank-1 model 11.4 17.2 12.8 20.8 Full-rank model 10.6 16.8 11.9 17.7 130 Binary masking 7.2 14.3 7.8 14.7 Rank-1 model 7.4 11.4 10.0 14.2 Full-rank model 8.8 13.8 11.2 15.2 250 Binary masking 5.2 10.9 6.0 11.0 Rank-1 model 4.0 7.9 7.5 9.2 Full-rank model 6.7 10.4 10.0 10.9

500 Binary masking 2.3 6.1 4.2 7.4

Rank-1 model 0.9 3.6 6.4 5.7 Full-rank model 3.8 5.8 7.5 7.2 Table 1. Average source separation performance In a low reverberant environment, i.e. �� � �� ms, the rank-1 model provides the best SDR and SAR among the three remaining approaches. This is consistent with the fact that the direct sound part contains most of the energy received at the microphones, so that the rank-1 spatial co- variance matrix provides similar modeling accuracy than the full-rank model with fewer parameters. However, in an envi- ronment with realistic reverberation time,i.e. �� ���ms, the full-rank model outperforms both the rank-1 model and binary masking in terms of SDR and SAR and results in a SIR very close to that of binary masking. For instance, with

�� � ���ms, the SDR achieved via the full-rank model is

��� dB and��� dB larger than that of the rank-1 model and binary masking, respectively. These results confirm the effec- tiveness of our proposed parameter estimation approach and also show that full-rank spatial covariance matrices better ap- proximate the mixing process in a reverberant room.

5. CONCLUSION

In this paper, we investigated the blind source separation per- formance stemming from rank-1 and full-rank models of the source spatial covariances. For that purpose, we addressed the estimation of model parameters by maximizing the like- lihood of the observed mixture data using the EM algorithm with a proper initialization scheme. Experimental results over speech data confirm that the full-rank model outperforms both the rank-1 model and state-of-the-art approaches,i.ebinary

masking and -norm minimization, in a reverberant envi- ronment. Future work will take into account background noise within the models and validate the performance of the proposed algorithms over real-world recordings with a larger number of sources. We will also consider combining the proposed models with models of the source spectra, such as those proposed in [7].

6. REFERENCES

[1] ¨O. Yılmaz and S. T. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. on Signal Processing, vol. 52, no. 7, pp. 1830–1847, July 2004.

[2] S. Winter, W. Kellermann, H. Sawada, and S. Makino,

“MAP-based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and - norm minimization,” EURASIP Journal on Advances in Signal Processing, vol. 2007, 2007, article ID 24717.

[3] N. Q. K. Duong, E. Vincent, and R. Gribonval, “Spa- tial covariance models for under-determined reverberant audio source separation,” in Proc. WASPAA, 2009, pp.

129–132.

[4] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley, and M. E. Davies, “Probabilistic modeling paradigms for audio source separation,” inMachine Audition: Princi- ples, Algorithms and Systems, W. Wang, Ed. IGI Global, to appear.

[5] C. F´evotte and J.-F. Cardoso, “Maximum likelihood approach for blind audio source separation using time- frequency Gaussian models,” inProc. WASPAA, 2005, pp. 78–81.

[6] A. P. Dempster, N. M. Laird, and B. D. Rubin, “Maxi- mum likelihood from incomplete data via the��algo- rithm,”Journal of the Royal Statistical Society. Series B, vol. 39, pp. 1–38, 1977.

[7] A. Ozerov and C. F´evotte, “Multichannel nonnegative matrix factorization in convolutive mixtures. With ap- plication to blind audio source separation.,” in Proc.

ICASSP, 2009, pp. 3137–3140.

[8] H. Sawada, S. Araki, R. Mukai, and S. Makino, “Group- ing separated frequency components by estimating prop- agation model parameters in frequency-domain blind source separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1592–1604, July 2007.

[9] E. Vincent, H. Sawada, P. Bofill, S. Makino, and JP.

Rosca, “First stereo audio source separation evaluation campaign: data, algorithms and results,” inProc. ICA, 2007, pp. 552–559.

Références

Documents relatifs

Notre étude n’a montré aucune corrélation entre l’évolution de la maladie et l’apparition ou l’aggravation des troubles du sommeil, l’ensemble des patients rapportant

Although the mixing model in this situation can be inverted by linear structures, we show that some simple independent component analysis (ICA) strategies that are often employed in

In this paper, we use a variant of the recent algorithm introduced in [21] for the estimation of mixture models by generalized moment matching (GeMM), to exploit mixtures

Please cite this article as: Kocinski, J., Speech Intelligibility Improvement Using Convolutive Blind Source Separation Assisted by Denoising Algorithms, Speech Communication

the framework of blind source separation (BSS) where the mixing process is generally described as a linear convolutive model, and independent component analysis (ICA) (Hyvarinen et

Also, with the selection of this small subset of taps, which are truly responsible for spatial dependencies, we can successfully avoid the possible temporal whitening effect of

En effet, notre étude confirme des données déjà connues, les complications qui surviennent pendant la grossesse et l’accouchement chez l’adolescente sont tout à fait

Belouchrani, A., Abed Meraim, K., Cardoso, J.-F., Moulines, E.: A blind source separation technique based on second order statistics. on Signal Pro-