Blind Source Separation for Robot Audition using fixed HRTF beamforming

(1)

HAL Id: hal-00683198

https://hal.archives-ouvertes.fr/hal-00683198

Submitted on 28 Mar 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Blind Source Separation for Robot Audition using fixed

HRTF beamforming

Mounira Maazaoui, Karim Abed-Meraim, Yves Grenier

To cite this version:

Mounira Maazaoui, Karim Abed-Meraim, Yves Grenier. Blind Source Separation for Robot Audition using fixed HRTF beamforming. EURASIP Journal on Advances in Signal Processing, SpringerOpen, 2012, pp.1687-6180. �10.1186/1687-6180-2012-58�. �hal-00683198�

(2)

This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon.

Blind source separation for robot audition using fixed HRTF beamforming

EURASIP Journal on Advances in Signal Processing 2012,

2012:58 doi:10.1186/1687-6180-2012-58

Mounira Maazaoui ([email protected]) Karim Abed-Meraim ([email protected]) Yves Grenier ([email protected])

ISSN 1687-6180

Article type Research

Submission date 15 June 2011

Acceptance date 6 March 2012

Publication date 6 March 2012

Article URL http://asp.eurasipjournals.com/content/2012/1/58

This peer-reviewed article was published immediately upon acceptance. It can be downloaded, printed and distributed freely for any purposes (see copyright notice below).

For information about publishing your research in EURASIP Journal on Advances in Signal

Processing go to

http://asp.eurasipjournals.com/authors/instructions/

For information about other SpringerOpen publications go to

http://www.springeropen.com

EURASIP Journal on Advances

in Signal Processing

(3)

Blind source separation for robot audition

using fixed HRTF beamforming

Mounira Maazaoui, Karim Abed-Meraim and Yves Grenier

Institute TELECOM, TELECOM ParisTech, CNRS-LTCI 37/39,

rue Dareau, 75014, Paris, France

Email addresses:

[email protected]

Abstract

In this article, we present a two-stage blind source separation (BSS) algorithm for robot audition. The first stage consists in a fixed beamforming preprocessing to reduce the reverberation and the environmental noise. Since we are in a robot au-dition context, the manifold of the sensor array in this case is hard to model due to

(4)

the presence of the head of the robot, so we use pre-measured head related transfer functions (HRTFs) to estimate the beamforming filters. The use of the HRTF to estimate the beamformers allows to capture the effect of the head on the manifold of the microphone array. The second stage is a BSS algorithm based on a sparsity criterion which is the minimization of the l1norm of the sources. We present

dif-ferent configuration of our algorithm and we show that it has promising results and that the fixed beamforming preprocessing improves the separation results.

1 Introduction

Robot audition consists in the aptitude of an humanoid to understand its acoustic

environment, separate and localize sources, identify speakers and recognize their

emotions. This complex task is one of the target points of the ROMEO projecta

that we work on. This project aims to build an humanoid (ROMEO) that can

act as a comprehensive assistant for persons suffering from loss of autonomy.

Our task in this project is focused on the blind source separation (BSS) topic

using a microphone array (more than two sensors). Source separation is a very

important step for human-robot interaction: it allows latter tasks like speakers

identification, speech and motion recognition and environmental sound analysis

(5)

received microphone signals without prior knowledge of the mixing process. The

only knowledge is limited to the array geometry.

The problem of BSS has been studied by many authors [1], and we present

here some of the state-of-the-art methods related to robot audition. Tamai et al. [2]

performed sound source localization by a delay and sum beamforming and source

separation in a real environment with frequency band selection using a

micro-phone array located on three rings with 32 micromicro-phones. Yamamoto et al. [3]

proposed a source separation technique based on geometric constraints as a

pre-processing for the speech recognition module in their robot audition system. This

system was implemented in the humanoids SIG2 and Honda ASIMO with an eight

sensors microphone array, as a part of a more complete system for robot audition

named HARK [4]. Saruwatari et al. [5] proposed a two-stage binaural BSS system

for an humanoid. They combined a single-input multiple-output model based on

independent component analysis (ICA) and a binary mask processing.

One of the main challenges of BSS remains to obtain good BSS performance

in a real reverberant environments. A beamforming preprocessing can be a

solu-tion to improve BSS performance in a reverberant room. Beamforming consists

in estimating a spatial filter that operates on the outputs of a microphone array

(6)

many purposes, particularly for enhancing a desired signal from its measurement

corrupted by noise, competing sources and reverberation [6]. Beamforming filters

can be estimated in a fixed or in an adaptive way. A fixed beamforming, contrarily

to an adaptive one, does not depend on the sensors data, the beamformer is built

for a set of fixed desired directions. In this article, we propose a two-stage BSS

technique where a fixed beamforming is used in a preprocessing step.

Ding et al. propose to use a beamforming preprocessing where the steering

di-rections are the didi-rections of arrival (DOA) of the sources. In this case, the DOA

of the sources are supposed to be known a priori [7]. The authors evaluate their

method in a determined case with 2 and 4 sources and a circular microphone

ar-ray. Saruwatari et al. present a combined ICA [8] and beamforming method: first

the authors perform a subband ICA and estimate the direction of arrivals (DOA)

of the sources using the directivity patterns in each frequency bin, second they

use the estimated DOA to build a null beamforming, and third they integrate the

subband ICA and the null beamforming by selecting the most suitable separation

matrix in each frequency [9]. In this article, we propose to use a fixed

beamform-ing preprocessbeamform-ing with fixed steerbeamform-ing directions, independently from the direction

of arrival of the sources, and we compare this preprocessing method to the one

(7)

beamform-ing as a preprocessbeamform-ing tool so we are not gobeamform-ing to include the algorithm of [9]

in our evaluation (the authors of [9] use the beamforming as a separation method

alternatively with ICA).

However, in a beamforming task, we need to know the manifold of the sensor

array, which is hard to model for the robot audition case because the head of

the robot alters the acoustic near field. To overcome the problem of the array

geometry modeling and take into account the influence of the robot’s head on the

received signals, we propose to use the head related transfer functions (HRTFs)

of the robot’s head as steering vectors to build the fixed beamformer. The main

advantages of our method are its reduced computational cost (as compared to

the one based on adaptive beamforming), its improved separation quality and its

relatively fast convergence rate. Its weaknesses consist in the lack of theoretical

analysis or proofs that guarantee the convergence to the desired solution and in

the case where source localization is needed, our method provides only a rough

estimation of the direction of arrival.

This article is organized as follows: in Section 2, we present the signal model

used in the BSS task, Sections 4 and 3 are dedicated respectively for the

beam-forming using HRTF step and for the presentation of the BSS using sparsity

(8)

6 provides some concluding remarks.

2 Signal model

Assume N sound sources s (t) = [s1(t) , . . . , sN(t)]T and an array of M

micro-phones with outputs denoted by x (t) = [x1(t) , . . . , xM(t)]T, where t is the time

index. We assume that we are in an overdetermined case with M > N and that

the number of sources N is known a priori. In Section 3.3 however, we propose

a method of source number estimation in the robot audition case. As we are in

a real environment context, the output signals in the time domain are modeled as

the sum of the convolution between the sound sources and the impulse responses

of the different propagation paths between the sources and the sensors, truncated

at the length of L + 1: x (t) = L

∑

l=0 h (l) s (t − l) + n (t) (1)

where h (l) is the lth matrix of impulse response and n (t) is a noise vector. We

consider a spatially decorrelated diffuse noise which energy is supposed to be

negligible comparing to the punctual sources ones. If the noise is punctual, it will

(9)

and real life application setups.

In the frequency domain, when the length of the analysis window Nf of the

short time fourier transform (STFT) is longer than twice the length of the mixing

filter L, the output signals at the time-frequency bin ( f , k) can be approximated

as: X ( f , k) ' H ( f ) S ( f , k) (2) where X( f , k) = [X1( f , k), . . . , XM( f , k)]H respectively S( f , k) = [S1( f , k), . . . , SN( f , k)]H

is the STFT of {x (t)}_1≤t≤T respectively {s (t)}_1≤t≤T in the frequency bin f ∈ h

1,N₂f + 1 i

and the time bin k ∈ [1, NT], and H is the Fourier transform of the

mix-ing filters {h (l)}_0≤l≤L. Using an appropriate separation criterion, our objective is

to find for each frequency bin a separation matrix F ( f ) that leads to an estimation

of the original sources in the time-frequency domain:

Y ( f , k) = F ( f ) X ( f , k) (3)

The inverse STFT of the estimated sources in the frequency domain Y allows

the recovery of the estimated sources y (t) = [y1(t) , . . . , yN(t)]T in the time

do-main.

Separating the sources for each frequency bin introduces the permutation

(10)

another. To solve the permutation problem, we use the method proposed by

Wei-hua and Fenggang and described in [10]. This method is based on the signals

correlation between two adjacent frequencies. In this article, we are not going

to investigate the permutation problem and we use the cited method for all the

proposed algorithm.

The separation matrix F ( f ) is estimated using a two-step blind separation

algorithm: a fixed beamforming preprocessing step and a BSS step (cf. Figure 1).

F ( f ) is written as the combination of the results of those two steps:

F ( f ) = W ( f ) B ( f ) (4)

where W ( f ) is the separation matrix estimated using a sparsity criterion and

B ( f ) is a fixed beamforming filter. More details are presented in the following

subsections (cf. Algorithm 1).

2.1 Beamforming preprocessing

The role of the beamformer is essentially to reduce the reverberation and the

in-terferences coming from directions other than the looked up ones. Once the

re-verberation is reduced, Equation (2) is better satisfied which leads to an improved

(11)

We consider {B ( f )}

1≤ f ≤N f₂ +1 a set of fixed beamforming filters of size K ×

M, where K is the number of the desired beams, K ≥ N. Those filters are

calcu-lated beforehand (before the beginning of the processing) and used in the

beam-forming preprocessing step (cf. Section 3). The outputs of the beamformers at

each frequency f are:

Z ( f , k) = B ( f ) X ( f , k) (5)

2.2 Blind source separation

The BSS step consists in estimating a separation matrix W ( f ) that leads to

sepa-rated sources at each frequency bin f . The separation matrix W ( f ) is estimated

by minimizing, with respect to W( f ), a cost function ψ based on a sparsity

crite-rion, under a unit norm constraints for W( f ). The chosen optimization technique

is the natural gradient (cf. Section 4). The separation matrix is estimated from

the output signals of the beamformers Z ( f , k) and the estimated sources are then

written as:

(12)

3 Fixed beamforming using HRTF

In the case of robot audition, the geometry of the microphone array is fixed once

for all. To build the fixed beamformers, we need to determine the “desired”

steer-ing directions and the characteristics of the beam pattern (the beamwidth, the

am-plitude of the sidelobes and the position of nulls). The beamformers are estimated

only once for all scenarii using these spatial information and independently of the

measured mixture in the sensors.

The least-square (LS) technique is used [6] to estimate the beamformer

fil-ters that will achieve the desired beam pattern according to a desired direction

response. To accomplish this beamformers estimation, we need to calculate the

steering vectors which represent the phase delays of a plane wave evaluated at the

microphone array elements.

In the free field, the steering vector of an M elements array at a frequency f

and for a steering direction θ is known. For example, for a linear array, we have:

a ( f , θ ) =h1, e− j2π fdcsin θ, . . . , e− j2π f d

c(M−1) sin θ

iT

(7)

where d is the distance between two sensors and c is the speed of sound.

In the case of robot audition, the microphones are often fixed in the head of

(13)

Equation (7) does not take into account the influence of the head on the

surround-ing acoustic fields, and in this case, the microphone array manifold is not modeled

(unknown).

For a human hearing, there is a spectral filtering of the sound source by the

head and the pinna, and thus a transfer function between the source and each ear

is defined and refered to as: the HRTF. The HRTF takes into account the interaural

time differenceb (ITD), the interaural intensity differencec(IID) and the shape of

the head and the pinna. It defines how a sound emitted from a specific location

and altered by the head and the pinna is received at an ear. The notion of HRTF

remains the same if we replace the human head by a dummy head and the ears by

two microphones. We extend the usual concept of binaural HRTF to the context

of robot audition where the humanoid is equipped with a microphone array. In our

case, a HRTF hm( f , θ ) at frequency f characterizes how a signal emitted from a

specific direction θ is received at the mth sensor fixed in a head.

We propose to use the HRTFs as steering vectors for the beamformer filters

calculation (cf. figure 3) and replace the unknown array manifold by a discrete

distribution of HRTFs on a group of NS a priori chosen steering directions Θ =

{θ1, . . . , θNS}. The HRTFs are measured in an anechoïc room as explained in

(14)

Let hm( f , θ ) be the HRTF at frequency f from the emission point located at

θ to the mth sensor. The steering vector is then:

a ( f , θ ) = [h1( f , θ ) , . . . , hM( f , θ )]T (8)

Given Equation (8), one can express the normalized LS beamformer for a

desired direction θias [6]: b ( f , θi) = R−1_aa ( f ) a ( f , θi) aH_{( f , θ} i) R−1aa ( f ) a ( f , θi) (9)

where Raa( f ) = _N1_S∑θ ∈Θa ( f , θ ) aH( f , θ ). Given K desired steering

direc-tions θ1, . . . , θK, the beamforming matrix B ( f ) is:

B ( f ) = [b ( f , θ1) , . . . , b ( f , θK)]T (10)

In the following, we present the different configurations of the combined

beamforming-BSS algorithm.

3.1 Beamforming with known DOA

If the direction-of-arrivals (DOAs) of the sources are known a priori, mainly by

a source localization method, the beamforming filters are estimated using this

spatial information of the sources location (cf. Figure 4). Therefore, the desired

(15)

to build the desired response vectors a ( f , θ ). This is an ideal method to compare

our results with. Indeed, we consider that source localization is beyond the scope

of this article (in [7] where the beamforming with known DOAs was proposed for

a circular microphone array, the authors have assumed that the DOAs are known

a priori).

3.2 Beamforming with fixed DOA

Estimating the DOAs of the sources to build the beamformers is time consuming

and not always accurate in the reverberant environments. So we propose to build

K fixed beams with arbitrary desired directions chosen such as they cover all the

useful space directions (cf. Figure 5). We use the output of all the

beamform-ers directly in the BSS algorithm. In this case, we still have an overdetermined

separation problem with N sources and K mixtures.

3.3 Beamforming with beams selection

In this configuration, we still have K fixed beams with arbitrary desired directions,

but we are not going to use all the outputs of those beamformers (cf. figure 6).

We select the N beamformer outputs with the highest energy, corresponding to

(16)

sources are quite close to each other). In this case, after beamforming, we are in a

determined separation problem with N sources and K = N mixtures (cf. Algorithm

2).

Fixed beamforming with beams selection can be derived and used for the

source number as well as a rough DOAs estimation. We fix a maximum

num-ber of sources Nmax < K. In each frequency bin, after the beamforming filtering

(5), we select the Nmax beams with the highest energies (instead of selecting N

beams as in the previous paragraph). Then, we build over all the selected steering

directions a histogram that corresponds to their overall number of occurrence (cf.

Figure 7). After a thresholding, we select the beams corresponding to the peaks

(a peak corresponds to a local maximum point associated to the number of

se-lected beams over all the frequencies). The filters that correspond to those beams

are our final beamforming filters, the number of peaks correspond to the

num-ber of sources and the corresponding steering directions provide us with a rough

(17)

4 BSS using sparsity criterion

In the BSS step, we estimate the separation matrix W ( f ) by minimizing, with

respect to the separation matrix W( f ), a cost function ψ based on a sparsity

cri-terion, under a unit norm constraint for W( f ):

min

W ψ (W ( f )) such that kW( f )k = 1 (11)

The optimization technique used to update the separation matrix W ( f ) is the

natural gradient. Section 4.1 summarizes the natural gradient algorithm [11],

Sec-tion 4.2 shows how we use this optimizaSec-tion algorithm in our cost funcSec-tion.

4.1 Natural gradient algorithm

The natural gradient is an optimization method proposed by Amari et al. [11].

In this modified gradient search method, the standard gradient search direction is

altered according to the local Riemannien structure of the parameter space. This

guarantees the invariance of the natural gradient search direction to the

statisti-cal relationship between the parameters of the model and leads to a statististatisti-cally

efficient learning performance [12].

(18)

function ψ (W). The gradient update of this matrix is given by:

Wt+1= Wt− µ∇ψ (Wt) (12)

where ∇ψ (W) is the gradient of the function ψ (W) and t refers to the

iter-ation (or time) index. From [12], the natural gradient of a loss function ψ (W),

noted ˜∇ψ (W), is given by:

˜

∇ψ (W) = ∇ψ (W) WHW (13)

The natural gradient update of the separation matrix W is then:

Wt+1= Wt− µ∇ψ (Wt) WHt Wt (14)

4.2 Sparsity separation criterion

Speech signal is known to be sparse in the time-frequency domain: the number

of time-frequency points where the speech signal is active (i.e., of non

neglige-able energy) is small comparing to the total number of time-frequency points (cf.

Figure 8).

We consider a separation criterion based on the sparsity of the signals in the

time-frequency domain. For every frequency bin, we look for a separation matrix

(19)

In the same manner, we define the mixture matrix in each frequency bin X ( f , :) =

[X ( f , 1) , . . . , X ( f , NT)].

To measure the sparsity of a signal, the l1 norm is the most used sparsity

measure thanks to its convexity [13]. The smaller is the l1 norm of a signal, the

sparser it is. However, the l1 norm is not the only measure of sparsity [13]. We

presented recently a parameterized lpnorm algorithm for BSS, where we made the

sparsity constraint harder through the iterations of the optimization process [14].

In this article, we use the l1 norm to measure the sparsity of signal Y ( f , :), and

hence the cost function is:

ψ (W ( f )) = N

∑

i=1 NT

∑

k=1 |Y_i( f , k)| (15) To have the sparsest estimated sources, we should minimize ψ (W ( f )) and we

use the natural gradient search technique to find the optimum separation matrix

W ( f ):

Wt+1( f ) = Wt( f ) − µ∇ψ (Wt( f )) WHt ( f ) Wt( f ) (16)

The differential of ψ (W ( f )) is:

(20)

where f (Y( f , :)) = sign (Y( f , :)) is a matrix with the same size as Y( f , :) in

which the (i, j) th entry is sign (Yi( f , j)) .d Thus, the gradient of ψ (W) is

ex-pressed as:

∇ψ (W ( f )) = f (Y ( f , :)) XH( f , :) (18)

which gives the expression of the natural gradient of ψ (Wt( f )):

˜

∇ψ (Wt( f )) = ∇ψ (Wt( f )) WHt ( f ) Wt( f )

= f (Yt( f , :)) YtH( f , :) Wt( f )

(19)

The update equation of Wt( f ) for a frequency bin f is then:

Wt+1( f ) = Wt( f ) − µGt( f ) Wt( f ) (20)

with Gt( f ) = f (Yt( f , :)) YHt ( f , :).

The convergence of the natural gradient is conditioned both by the initial

co-efficients W0( f ) of the separation matrix and the step size of the update and it is

quite difficult to choose the parameters that allow fast convergence without

risk-ing divergence. Douglas and Gupta [15] proposed to impose a scalrisk-ing constraint

to the separation matrix Wt( f ) to maintain a constant gradient magnitude along

(21)

the algorithm has fast convergence and excellent performance independently of

the magnitude of X ( f , :) and W0( f ). Applying this scaling constraint, our update

function becomes: Wt+1( f ) = ct( f ) Wt( f ) − µct2( f ) Gt( f ) Wt( f ) (21) with ct( f ) = ₁ 1 N∑ N i=1∑Nj=1 g i j t ( f ) and gi j_t ( f ) = [Gt( f )]i j.

4.3 Initialization

When we are in an overdetermined case, we use a whitening process for the

initial-ization of the separation matrix W0. The whitening is an important preprocessing

in an overdetermined BSS algorithm as it allows to focus the energy of the

re-ceived signals in the useful signal space. The separation matrix is initialized as

follow:

W0=

q

D−1_M EH_:M

where DM is a matrix containing the first M rows and M columns of the matrix

D and E:M is the matrix containing the first M columns of the matrix E. D and E

(22)

decomposition of the autocorrelation matrix of the received data X( f , :) or the

filtered data after beamforming Z( f , :).

If we are in a determined case, in particular when we select the beams with

the highest energy after the beamforming filtering or when the steering directions

correspond to the direction of arrivals of the sources, the initialization of the

sep-aration matrix is done with the identity matrix:

W0= IN

5 Experimental results

5.1 Experimental database

To evaluate the proposed BSS techniques, we built two databases: a HRTFs

database and a speech database.

5.1.1 HRTF database

We recorded the HRTF database in the anechoic room of Telecom ParisTech (cf.

Figure 2) using the Golay codes process [16]. As we are in a robot audition

context, we model the future robot by a child size dummy (1m20) for the sound

(23)

We measured 504 HRTF for each microphone as follow:

• 72 azimuth angles from 0° to 355° with a 5° step

• 7 elevation angles: −40°, −27°, 0°, 20°, 45°, 60° and 90°

To measure the HRTFs, the dummy was fixed on a turntable in the center of the

loudspeaker arc in the anechoic room (cf. Figure 2). For each azimuth angle, a

sequence of complementary Golay codes is emitted sequentially from each

loud-speaker (this is to vary the elevation) and recorded with the 16 sensors array. This

operation was repeated for all the azimuth angles. The Golay complementary

sequences have the useful property that their autocorrelation functions have

com-plementary sidelobes: the sum of the autocorrelation sequences is exactly zero

everywhere except at the origin. Using this property and the recorded

comple-mentary Golay codes, the HRTF are calculated as in [16].

Details about the experimental process of HRTF calculation as well as the

HRTF databases at the sampling frequencies of 48 and 16 KHz are available at

http://www.tsi.telecom-paristech.fr/aao/?p=347.

5.1.2 Test database

The test signals were recorded in a moderately reverberant room where the

(24)

positions of the sources in the room. We chose to evaluate the proposed algorithm

on a separation of two sources: the first source is always the one placed at 0° and

the second source is chosen from 20° to 90°.

The output signals x (t) are the convolutions of 40 pairs of speech sources

(male and female speaking French and English) by two of the impulse responses

{h (l)}_0≤l≤Lmeasured for the direction of arrivals presented in Figure 11.

The characteristics of the signals and the BSS algorithms are summarized in

Table 1.

5.2 Evaluation results

In this section, we evaluate different configurations of the presented algorithme:

(1) The beamforming stage only: beamforming of 37 lobes from −90° to 90°

with a step angle of 5° (BF[5°])

(2) The BSS algorithm only

(a) with minimization of the l1norm (BSS-l1)

(b) with ICA from [15] (ICA)

(25)

(a) beamforming of N lobes in the DOA of the sources (BF[DOA]+BSS-l1)

(b) beamforming of 7 lobes from −90° to 90° with a step angle of 30°

(BF[30°] + BSS-l1 when the l1 norm minimization is used in the BSS

step and BF[30°] + ICA when ICA is used in the BSS step)

(c) beamforming of 13 lobes from −90° to 90° with a step angle of 15°

(BF[15°] + BSS-l1)

(d) beamforming of 19 lobes from −90° to 90° with a step angle of 10°

(BF[10°] + BSS-l1)

(e) beamforming of 37 lobes from −90° to 90° with a step angle of 5°

(BF[5°] + BSS-l1)

(f) beamforming of 7 lobes from −90° to 90° with a step angle of 30° with

selection of the N beams containing the highest energy before

proceed-ing the BSS (BF[30°] + BS + BSS-l1)

(g) beamforming of 37 lobes from −90° to 90° with a step angle of 5° with

selection of the N beams containing the highest energy before

proceed-ing the BSS (BF[5°] + BS + BSS-l1)

We evaluate the proposed two-stage algorithm by the signal-to-interference ratio

(26)

presented curves are the average result of the 40 pairs of speech.

5.2.1 Influence of the beamforming preprocessing

Figures 12 and 13 show that the SIR and SDR of the two-stage algorithm with

the fixed beamforming preprocessing BF[5°] + BSS-l1 and BF[30°] + BSS-l1 are

better than the SIR and SDR of the separation algorithm with l1norm alone

BSS-l1 and much better than the ones we obtain by the fixed beamforming BF[5°]

only. The SIR and SDR of the received signals in microphones 1 and 2 (labeled

as sensors data in the figures) is taken as reference to illustrate the performance

gain of our method. However this increase in the SIR and SDR by the fixed

beamforming preprocessing is limited and do not reach the performance of the

beamforming preprocessing with known DOA BF[DOA] + BSS-l1 as shown in

Figures 14 and 15. But we can overcome this limitation by the beam selection as

shown in the sequel.

Figures 16 and 17 show the SIR and SDR obtained with different inter-beam

angle of the beamforming preprocessing, the steering directions vary from −90°

to 90°: beamforming with 7 beams with a step angle of 30° (BF[30°] + BSS-l1),

beamforming of 13 beams with a step angle of 15° (BF[15°] + BSS-l1),

(27)

beamform-ing with 37 beams with a step angle of 5°. The results show that when we

in-crease the number of the beams, the SIR and especially the SDR inin-creases. For

BF[15°] + BSS-l1, BF[10°] + BSS-l1 and BF[5°] + BSS-l1, the beamforming

pre-processing increases the SDR of the estimated sources comparing with the single

stage BSS-l1algorithm. The SIR with a beamforming preprocessing is also better

than the single stage BSS-l1algorithm, and this for all the tested configurations of

the fixed steering direction beamforming prepossessing.

Influence of the beams selection

As we can observe from Figures 12, 13, 14, and 15, the beamforming

preprocess-ing with beams selection (BF[30°] + BS + BSS-l1and BF[5°] + BS + BSS-l1) and

the beamforming preprocessing with known direction of arrivals (BF[DOA] +

BSS-l1) have close results in terms of SIR (cf. Figures 12 and 14) and SDR (cf. Figures

13 and 15). However, if we are in a reverberant environment where the direction

of arrivals can not be estimated accurately, the beamforming preprocessing with

beams selection would be a good solution to improve the SIR and the SDR of the

estimated sources comparing to the use of the BSS algorithm only (BSS-l1).

Comparing BF[5°] + BS + BSS-l1 in Figure 12 and BF[30°] + BS + BSS-l1 in

(28)

to the separation gain. However, the beamforming preprocessing with beams

se-lection of 5° inter-beam angle step allows us to estimate correctly the DOA of the

sources with a step of 5° as shown in Figure 18. The latter represents the selected

beam directions for all considered experiments (i.e., the 40 experiments) and for

different source locations.

5.2.2 Comparison between BSS-l1and ICA

Independent component analysis and the l1 norm minimization have quite close

results with or without the preprocessing step. However, we believe that replacing

BSS-l1 by BSS-lpwith p < 1 or with varying p value might lead to a significant

improvement of the separation quality. This observation is based on the

prelimi-nary results we obtained in [14] and would be the focus of future investigations.

5.2.3 Convergence analysis

We procceed to the analysis of the convergence of the proposed algorithm by

ob-serving the convergence rates through the iterations and for the considered DOA

(cf. Figure 19). Each curve represents the average of cost function (15) averaged

for all the frequencies. As we can see in Figure 19b, our iterative algorithm

(29)

notice also that the convergence rate of the proposed two stage method with beam

selection is better than the convergence of BSS-l1. Indded, in this context, the

separation algorithm BSS-l1converges to its steady state after 30 to 40 iterations.

Moreover, the cost function of the two stage algorithm reaches lower values than

the separation algorithm only and thus, the beamforming preprocessing helps for

better convergence.

6 Conclusion

In this article, we present a two-stage BSS algorithm for robot audition. The first

stage is a preprocessing step with fixed beamforming. To deal with the effect of

the head of the robot in the acoustic near field and model the manifold of the

sensors array, we used HRTFs as steering vectors in the beamformers estimation

step. The second stage is a BSS algorithm exploiting the sparsity of the sources

in the time-frequency domain.

We tested different configurations of this algorithm with steering directions of

the beams equal to the direction of arrivals of the sources and with fixed steering

directions. We also varied the step angle between the beams. The beamforming

(30)

and noise effects. The maximum gain is obtained when we select the beams with

the highest energies and use the corresponding filters as beamformers or when the

sources DOAs are known. The beamforming preprocessing with fixed steering

directions has also good performance and does not use an estimation of the DOAs

or beam selection, which represent a gain in the processing time. Using the 5°

step beamforming preprocessing with beams selection, we can also have a rough

estimation of the direction of arrivals of the sources.

Acknowledgement

This work is funded by the Ile-de-France region, the General Directorate for

Com-petitiveness, Industry and Services (DGCIS) and the City of Paris, as a part of the

ROMEO project.

Competing interests

(31)

Endnotes

a_{Romeo project: www.projetromeo.com.} b_{The ITD is the difference in arrival}

times of a sound wavefront at the left and right ears. cThe IID is the amplitude

difference of a sound that reaches the right and left ears.dFor a complex number z,

sign(z)= _|z|z .eThe names of the algorithms that we are going to use in the legends

(32)

References

[1] Pierre Comon and Christian Jutten, Handbook of Blind Source Separation ,

Independent Component Analysis and Applications, Elsevier, 2010.

[2] Y. Tamai, Y. Sasaki, S. Kagami, and H. Mizoguchi, “Three ring microphone

array for 3d sound localization and separation for mobile robot audition,”

IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.

4172–4177, Aug. 2005.

[3] S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J.-M. Valin, K.

Ko-matani, T. Ogata, and H.G. Okuno, “Design and implementation of a robot

audition system for automatic speech recognition of simultaneous speech,”

IEEE Workshop on Automatic Speech Recognition Understanding, pp. 111–

116, 2007.

[4] H. Nakajima, K. Nakadai, Y. Hasegawa, and H. Tsujino, “High performance

sound source separation adaptable to environmental changes for robot

au-dition,” IEEE/RSJ International Conference on Intelligent Robots and

Sys-tems, pp. 2165–2171, Sept. 2008.

(33)

T. Morita, “Two-stage blind source separation based on ica and binary

mask-ing for real-time robot audition system,” IEEE/RSJ International Conference

on Intelligent Robots and Systems, pp. 2303–2308, 2005.

[6] Jacob Benesty, Jingdong Chen, and Yiteng Huang, Microphone Array Signal

Processing, Chapter 3: Conventional beamforming techniques, Springer,

1rst edition, 2008.

[7] Heping Ding Lin Wang and Fuliang Yin, “Combining superdirective

beam-forming and frequency-domain blind source separation for highly

reverber-ant signals,” EURASIP Journal on Audio, Speech, and Music Processing,

vol. 2010, 2010.

[8] Pierre Comon, “Independent component analysis, a new concept?,” Signal

Processing, 1994.

[9] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa, and

K. Shikano, “Blind source separation combining independent component

analysis and beamforming,” EURASIP Journal on Applied Signal

(34)

[10] Wang Weihua and Huang Fenggang, “Improved method for solving

per-mutation problem of frequency domain blind source separation,” 6th IEEE

International Conference on Industrial Informatics, pp. 703–706, July 2008.

[11] S. Amari, A. Cichocki, and H. H. Yang, “A new learning algorithm for blind

signal separation,” Advances in Neural Information Processing Systems, pp.

757–763, 1996.

[12] Shun-Ichi Amari, “Natural gradient works efficiently in learning,” Neural

Computation, vol. 10, pp. 251–276, 1998.

[13] Hurley Niall and Rickard Scott, “Comparing measures of sparsity,” IEEE

Workshop on Machine Learning for Signal Processing, vol. 55, pp. 4723–

4741, October 2009.

[14] M Maazaoui, Y Grenier, and K Abed-Meraim, “Frequency domain blind

source separation for robot audition using a parameterized sparsity

crite-rion,” 19th European Signal Processing Conference, EUSIPCO, 2011.

[15] S.C. Douglas and M. Gupta, “Scaled natural gradient algorithms for

instan-taneous and convolutive blind source separation,” IEEE International

Con-ference on Acoustics, Speech and Signal Processing, vol. 2, pp. 637–640,

(35)

[16] S. Foster, “Impulse response measurement using golay codes,” in IEEE

International Conference on Acoustics, Speech, and Signal Processing,

ICASSP ’86, Apr. 1986, vol. 11, pp. 929 – 932.

[17] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in

blind audio source separation,” IEEE Transactions on Audio, Speech, and

(36)

Algorithm 1 Combined beamforming and BSS algorithm 1. Input:

(a) The output of the microphone array x = [x (t1) , . . . , x (tT)]

(b) The beamforming pre-calculated filters {B ( f )}

1≤ f ≤N f₂ +1

2. {X ( f , k)}_{1≤ f ≤N}

f,1≤k≤NT = STFT(x)

3. for each frequency bin f

(a) beamforming preprocessing step: Z ( f , :) = B ( f ) X ( f , :)

(b) initialization step: W( f ) = W0( f )

(c) Y0( f , :) = W0( f ) Z ( f , :)

(d) for each iteration t:

blind source separation step to estimate W ( f )

4. Permutation problem solving

5. Output: the estimated sources y = ISTFT

{Y ( f , k)}_{1≤ f ≤N}

f,1≤k≤NK

(37)

Algorithm 2 Beams selection algorithm 1. SelectedBeams = Ø

2. for each frequency bin f :

(a) Form K beams (beamformer outputs) Z ( f , :) = B ( f ) X ( f , :),

Z ( f , :) = [z1( f , :) , . . . , zK( f , :)]T

(b) Compute the energy of the beamformer outputs: E( f ) =

[e1( f ), . . . , eK( f )] with ei( f ) = _N1_T∑N_k=1T |zi( f , k)|2

(c) Decreasing order sort of E( f ), Beams are the beams corresponding to

the sorted energies: Beams = sort (E( f ))

(d) Select the N highest energies, the indexes are stored in B.

(e) SelectedBeams = SelectedBeams ∪ B

3. Compute the frequency of appearance of each beam and store the

occur-rences in I.

(38)

Table 1: Parameters of the blind source separation algorithms

Sampling frequency 16 KHz

Analysis window Hanning

Analysis window length 2048

Shift length 1,024

µ 0.2

Signals length 5 s

(39)

Figure 1: The processing scheme of the combined beamforming-BSS

algo-rithm.

Figure 2: The dummy in the anechoïc room (left) and the microphone array

of 16 sensors (right).

Figure 3: Example of a beam pattern using HRTFs for θi= 50° (in dB).

Figure 4: Beamforming with known DOAs.

Figure 5: Beamforming with fixed steering directions (fixed lobes).

Figure 6: Beamforming with fixed steering directions and beams selection.

Figure 7: Estimation of the source number and DOAs using fixed

beamform-ing: DOAs = 0° and 40°: we used Nmax = 5, 1024 frequency bins and an

(40)

Figure 11: The position of the sources and their directions

Figure 8: Sparsity of the speech signal in the time-frequency domain

com-paring to the time domain .(a) Speech sentence in the time domain (b)

Time-frequency representation of the speech sentence

Figure 9: The detailed configuration of the microphone array.

Figure 10: Energy decay curve of the room used for the reverberant

record-ing.

reverberant room.

Figure 12: SIR comparison in a real environment: source 1° is at 0° and

source 2 varies from 20° to 90°— effect of the beamforming preprocessing on

the SIR of the estimated sources.

(41)

Figure 15: SDR comparison in a real environment: source 1 is at 0° and source Figure 13: SDR comparison in a real environment: source 1 is at 0° and

source 2 varies from 20° to 90°—effect of the beamforming preprocessing on

the SDR of the estimated sources.

Figure 14: SIR comparison in a real environment: source 1 is at 0° and source

2 varies from 20° to 90°.

Figure 16: SIR of different configuration of the beamforming preprocessing

with fixed steering direction: inter-beams angles are 30°, 15°, 10°, and 5°,

respectively.

Figure 17: SDR of different configuration of the beamforming preprocessing

with fixed steering direction: inter-beams angles are 30°, 15°, 10°, and 5°,

(42)

Figure 18: DOA estimation using the BF[5°]+BS algorithm for the 40

exper-iments.

Figure 19: Convergence rates: the value of the cost function through the

(43)

(44)

(45)

(46)

(47)

(48)

(49)

(50)

(a)

(51)

(52)

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)