37-39 rue Dareau, 75014 Paris, France alexey.ozerov@telecom-paristech.fr
Abstract. The **underdetermined** **blind** **audio** **source** **separation** prob- lem is often addressed in the time-frequency domain by assuming that each time-frequency point is an independently distributed random vari- able. Other approaches which are not **blind** assume a more structured model, like the **Spectral** Gaussian Mixture Models (**Spectral**-GMMs), thus exploiting statistical diversity of **audio** sources in the **separation** process. However, in this last approach, **Spectral**-GMMs are supposed to be learned from some training signals. In this paper, we propose a new approach **for** learning **Spectral**-GMMs of the sources without the need of using training signals. The proposed **blind** method significantly outper- forms state-of-the-art approaches on stereophonic **instantaneous** music mixtures.

En savoir plus
These and other results show that improved **separation** performance in many sce- narios can be obtained by modeling and exploiting spatial and **spectral** properties of sounds, i.e., by designing models and constraints which account **for** the specificities of **audio** sources and acoustic mixing conditions. Two trends can be seen: developing com- plex, hierarchical models with little training so as to adapt to unknown situations with little amounts of data, or training simpler models on huge amounts of data, e.g., thou- sands of room impulse responses and dozens of hours of speech, so as to benefit from the power of big data and turn parameter **estimation** into a model selection problem.

En savoir plus
No BSS/ICA algorithm is truly **blind**, in the sense that a minimal number of assumptions (generally involving some form of prior knowledge) on the sources and/or on the mixture process must be integrated in the algorithms to derive solutions to the **separation** problem [4]. 1 In the **underdetermined** case, many relevant techniques take advantage of the sparse nature of **audio** **source** signals. These methods make the (realistic) assumption that, in a given basis, **source** signals have a parsi- monious representation, i.e. most of the **source** coefficients are close to zero. A direct consequence of sparsity is the limitation of sources overlapping in the appropriate basis since the probability that several sources are simultaneously active is low. **For** most music signals, the time-frequency domain is a natural appropriate domain **for** exploiting sparsity (much more than the time domain where **source** signals generally strongly overlap) [5] [6]. As a consequence, many USS techniques are based on sparse time-frequency (TF) representations of signals. **For** example, in [7] the authors make the assumption that the non-stationary **source** signals to be separated are disjoint in the TF domain. Specific points of the TF plane 1 As a major example underlined in [4], the Bayesian approach to BSS

En savoir plus
Fig. 1. (a)–TF disjoint, (b) TF non-disjoint
extension of both the time domain and the frequency domain processing, that involves representing signals in a two-dimensi- onal space the joint TF domain, hence providing a distribution of signal energy versus time and frequency simultaneously. **For** this reason, a TF representation is commonly referred to as a time-frequency distribution (TFD). TFDs have been ap- plied to a wide variety of engineering problems. Specifically, they have been successfully used **for** signal recovery at low signal-to-noise ratio (SNR), accurate **estimation** of the instan- taneous frequency (IF), signal detection in communication, radar processing and **for** the design of time-varying filter. **For** more details on TFDs and related methods, see **for** example the recent comprehensive reference [5].

En savoir plus
The goal of **source** **separation** algorithms is to recover the con- stituent sources, or **audio** objects, from their mixture. How- ever, **blind** algorithms still do not yield estimates of sufficient quality **for** many practical uses. Informed **Source** **Separation** (ISS) is a solution to make **separation** robust when the **audio** objects are known during a so-called encoding stage. During that stage, a small amount of side information is computed and transmitted with the mixture. At a decoding stage, when the sources are no longer available, the mixture is processed using the side information to recover the **audio** objects, thus greatly improving the quality of the estimates at a cost of ad- ditional bitrate which depends on the size of the side infor- mation. In this study, we compare six methods from the state of the art in terms of quality versus bitrate, and show that a good **separation** performance can be attained at competitive bitrates.

En savoir plus
(7)
where 2i is a user-defined number of row blocks, each block contains m rows (number of measurement sensors), j is the number of columns (practically j = N-2i+1, N is the number of sampling points). The Hankel matrix H 1,2 i is split into two
equal parts of i block rows: past and future data. Thus the algorithm considers vibration signals at different instants and not only **instantaneous** representations of responses. This allows to take into account temporal correlations between measurements when current data depends on past data. Therefore, the objective pursued here in using the block Hankel matrix rather than the observation matrix is to improve the sensitivity of the detection method. The combined methods will be called enhanced SOBI (ESOBI) or enhanced BMID (EBMID) in the following.

En savoir plus
I. I NTRODUCTION
**Blind** **Source** **Separation** is a major tool to learn meaningful decompositions of multivalued data [1], [2]. Most of the work has been dedicated to linear BSS: m observations are linear combinations of n sources of t samples. In matrix form, X = AS + N with X (size m × t) the observation matrix corrupted with some unknown noise N, S (n × t) the sources and A (m × n – here, n ≤ m) the mixing matrix. The goal of linear BSS is to recover A and S from X up to a permutation and scaling indeterminacy. While this is ill-posed, the sparsity prior [3] – assuming the sources to have many zero coefficients – has been shown to lead to high **separation** quality [4], [5]. Less work has however been done on non-linear BSS, where: X = f (S) + N (1) With f an unknown non-linear function from R n×t to R m×t . Here, we will consider general functions f , by mostly assuming that f is in- vertible and symmetrical around the origin, as well as regular enough (i.e. L-Lipschitz with L small and not deviating from a linear mixing too fast as a function of the **source** amplitude). Despite increased indeterminacies than in the linear case, [6] claimed the possibility to recover sparse sources up to a non-linear function h under some conditions. Our approach is fully different from manifold clustering ones [7], [6], and also differs from neural network ones [8] as it brings a geometrical interpretation (and uses the mixing regularity) and an automatic hyperparameter choice (potentially enabling an increased robustness and building on the linear BSS litterature [4]).

En savoir plus
Ming Jiang ∗ , J´ erˆ ome Bobin ∗ , and Jean-Luc Starck ∗
Abstract. **Blind** **Source** **Separation** (BSS) is a challenging matrix factorization problem that plays a central role in multichannel imaging science. In a large number of applications, such as astrophysics, current unmixing methods are limited since real-world mixtures are generally affected by extra instrumental effects like blurring. Therefore, BSS has to be solved jointly with a deconvolution problem, which requires tackling a new inverse problem: deconvolution BSS (DBSS). In this article, we introduce an innovative DBSS approach, called DecGMCA, based on sparse signal modeling and an efficient alternative projected least square algorithm. Numerical results demonstrate that the DecGMCA al- gorithm performs very well on simulations. It further highlights the importance of jointly solving BSS and deconvolution instead of considering these two problems independently. Furthermore, the per- formance of the proposed DecGMCA algorithm is demonstrated on simulated radio-interferometric data.

En savoir plus
In [6], the authors propose a model **for** piano spectro- gram restoration, based on generalized coupled tensor factor- ization where additional information comes from an approxi- mate musical score and spectra of isolated piano sounds. The framework described in [9] proposed a **separation** model be- tween voice and background guided by another speech sam- ple corresponding to the pronunciation of the same sentence. The speech reference is either recorded by a human speaker or is created with a voice synthesizer based on the available text pronounced by the speaker of the mixture to be sepa- rated. A nonnegative matrix co-factorization (NMcF) model is designed so that some of the factorized matrices are shared by the mixture and the speech reference. The authors of [7] incorporate knowledge of the fundamental frequency 𝑓 0 in a NMF model, by fixing the **source** part of a **source**-filter model to be a harmonic **spectral** comb following the known 𝑓 0 value of the target **source** over time. In the context of **audio** sepa- ration of movie soundtracks, the **separation** can be guided by other available international versions of the same movie [12]. A cover-informed **source** **separation** principle is introduced in [10] where the authors assume that cover multitrack signals are available and are used as initialization of the NMF algo- rithm. The original mixture and the cover signals are time- aligned in a pre-processing step.

En savoir plus
Among the variety of phase recovery techniques, the multiple input spectrogram inversion (MISI) algo- rithm [13] is particularly popular. This iterative procedure consists in retrieving time-domain sources from their STFT magnitudes while respecting a mixing constraint: the estimates must add up to the mixture. This algorithm exhibits a good performance in **source** **separation** when combined with DNNs [10, 11]. However, MISI suffers from one limitation. Indeed, it is derived as a solution to an optimization problem that involves the quadratic loss, which is not the best-suited metric **for** evaluating discrepancies in the TF domain. **For** instance, it does not properly account **for** the large dynamic range of **audio** signals [14].

En savoir plus
Although it is similar to classification, **audio** **source** sep- aration barely exploits fusion principles. Kirbiz et al. im- plemented in [9] a form of late fusion: NMF (Non-negative Matrix Factorization) based **separation** is applied in paral- lel to different time-frequency resolution spectrograms and the resulting time-domain **source** estimates are combined through an adaptive filter bank. Their paper thus applies data fusion principles to an **audio** **source** **separation** prob- lem thanks to the design of a specific algorithm. In image processing, Meganem et al. also propose in [10] to compute several **source** estimates thanks to different analysis param- eters and to use a correlation measure in order to select the best estimate.

En savoir plus
Fig. 3. **Source** **separation** performance **for** various methods on the DSD100 test dataset. Oracle (top) and estimated (bottom) magni- tudes spectrograms.
and maximum values, and crosses representing the outliers. We first observe that using the phase unwrapping prior only leads to poor results. Indeed, this technique neglects the phase of the mixture, then the prior error is propagated over time frames, lead- ing to audible artifacts. In both Oracle and non-Oracle scenarios, the proposed estimator (denoted by MMSE in Fig. 3) leads to bet- ter results than Wiener, but slightly worse than Cons-W in terms of SDR, SIR and SAR. However, we perceptually observe that Cons-W tends to produce more artifacts in the bass and drums tracks than the proposed MMSE technique. Finally, it is important to note that Cons-W is computationally costly: **for** a 10 seconds excerpt, the **separation** is performed in 27 seconds with Cons-W vs 4 seconds with our estimator. The proposed approach then appears appealing **for** an efficient **audio** **source** **separation** task.

En savoir plus
Model-based STFT Phase Recovery **for** **Audio** **Source** **Separation**
Paul Magron, Roland Badeau, Senior Member, IEEE, and Bertrand David, Member, IEEE
Abstract—**For** **audio** **source** **separation** applications, it is com- mon to estimate the magnitude of the short-time Fourier trans- form (STFT) of each **source**. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture. In this paper, a different standpoint is adopted. Many music events are partially composed of slowly varying sinusoids and the STFT phase increment over time of those frequency components takes a specific form. This allows phase recovery by an unwrapping technique once a short- term frequency estimate has been obtained. Herein, a novel iterative **source** **separation** procedure is proposed which builds upon these results. It consists in minimizing the mixing error by means of the auxiliary function method. This procedure is initialized by exploiting the unwrapping technique in order to generate estimates that benefit from a temporal continuity prop- erty. Experiments conducted on realistic music pieces show that, given accurate magnitude estimates, this procedure outperforms the state-of-the-art consistent Wiener filter.

En savoir plus
The main principle behind most score-informed **separation** techniques is to make use of the onset/offset information found in MIDI files to correctly initialize the parameters of a parametric model. In the case of the NMF model described above in section 3.1, the score sheets permit to set to 0 the activation parameters H (n, j) **for** one given **source** when it is known to be inactive. Such a simple procedure is shown in [9] to dramatically increase **separation** performance, by initializ- ing the parameters to a sensible value, hence much closer to the global minimum sought **for** during optimization. Pitch information can also be used to initialize the **spectral** tem- plates W , or to adequately drive comb-filters as in [26]. In the case of more flexible parametric models such as a NMF with time-varying **spectral** templates [13], score information may also be used with a noticeable gain in **separation** quality. The main issue with such score-initialized **audio** decomposi- tions is the requirement that MIDI files be synchronized with the **audio** mixtures. Even if efficient alignment techniques do exist to this purpose, mismatch in the alignment may lead to wrongly initialized decompositions, yielding poor sepa- rated sources. In a recent study, S IMSEKLI et al. [28] showed that MIDI information can actually be used without assuming such an alignment. The main fact underlying their technique is that apart from their temporal position, the score also con- tains information about co-occurrences of the notes as well as their pitch information. Even in case of misalignment, these may be supposed to be the same in the actual **audio** mixture. In practice, such co-occurrences are modeled as common fac- tors in a Generalized Tensor Factorization framework [34], where both scores and **audio** are jointly analyzed.

En savoir plus
<firstname>.<lastname>@telecom-paristech.fr
ABSTRACT
In this paper we show that considering early contributions of mixing filters through a probabilistic prior can help **blind** **source** **separation** in reverberant recording conditions. By modeling mixing filters as the direct path plus R−1 reflections, we represent the propagation from a **source** to a mixture channel as an autoregressive process of order R in the frequency domain. This model is used as a prior to derive a Maximum A Posteriori (MAP) **estimation** of the mixing filters using the Expectation-Maximization (EM) algorithm. Exper- imental results over reverberant synthetic mixtures and live record- ings show that MAP **estimation** with this prior provides better sep- aration results than a Maximum Likelihood (ML) **estimation**.

En savoir plus
In SO applications, the extracted sources and/or mix- ing parameters are processed to obtain information at more abstract levels, in order to find a representation of the obser- vations related to human perception. **For** instance, looking **for** the number and the kind of the instruments in a musical excerpt enters the scope of SO **separation**. **Separation** qual- ity criteria are generally less demanding than in AQO ap- plications because the aim of SO **separation** is only to keep specific features of the sources. Thus, a rough **separation** may be sufficient (possibly with high distortion), depend- ing on the robustness of the subsequent feature extraction algorithms.

En savoir plus
While most **source** **separation** methods are designed **for** a specific scenario, the flexible **audio** **source** **separation** frame- work in [2] introduced a compositional approach [3] where the mixture signal is modeled by composing multiple **source** models together. Each model is parameterized by a number of variables, which may be constrained by the user, trained from separate data or adapted to the considered mixture according to the available information. This framework has been applied to a wide variety of speech and music **separation** scenarios by exploiting information such as note spectra, cover music recordings, reference speech pronounced by another speaker, or target spatial direction [4–8]. It has also been used as a preprocessing step **for** instrument recognition, beat tracking, and automatic speech recognition [8, 9].

En savoir plus
z x ( f , n) = | e x( f , n)| was the magni- tude of single-channel signals obtained from the multichannel noisy signals x( f , n) by DS beamforming [45, 46]. In doing so, the time-varying time difference of ar- rivals (TDOAs) between the speaker’s mouth and each of the microphones are first measured using the provided baseline speaker localization tool [47], which relies on a nonlinear variant of steered response power using the phase transform (SRP- PHAT) [59, 60]. All channels are then aligned with each other by shifting the phase of STFT of the input noisy signal x( f , n) in all time-frequency bins ( f , n) by the opposite of the measured delay. This preprocessing is required to satisfy the model in (1.2) which assumes that the sources do not move over time. Finally, a single- channel signal is obtained by averaging the realigned channels together. On the out- put side, the estimated speech spatial image are averaged over channels to obtain a single-channel signal **for** the speech recognition evaluation, which empirically pro- vided better ASR performance than the use of one of the channels. Likewise, the

En savoir plus
Multichannel **Audio** **Source** **Separation** with Probabilistic Reverberation Priors
Simon Leglaive, Roland Badeau, Senior Member, IEEE, Ga¨el Richard Senior Member, IEEE
Abstract—Incorporating prior knowledge about the sources and/or the mixture is a way to improve under-determined **audio** **source** **separation** performance. A great number of informed **source** **separation** techniques concentrate on taking priors on the sources into account, but fewer works have focused on con- straining the mixing model. In this paper we address the problem of under-determined multichannel **audio** **source** **separation** in re- verberant conditions. We target a semi-informed scenario where some room parameters are known. Two probabilistic priors on the frequency response of the mixing filters are proposed. Early reverberation is characterized by an autoregressive model while according to statistical room acoustics results, late reverberation is represented by an autoregressive moving average model. Both reverberation models are defined in the frequency domain. They aim to transcribe the temporal characteristics of the mixing filters into frequency-domain correlations. Our approach leads to a maximum a posteriori **estimation** of the mixing filters which is achieved thanks to an expectation-maximization algorithm. We experimentally show the superiority of this approach compared with a maximum likelihood **estimation** of the mixing filters.

En savoir plus
Adaptive **blind** **source** **separation** with HRTFs beamforming preprocessing
Mounira Maazaoui, Karim Abed-Meraim and Yves Grenier
Abstract—We propose an adaptive **blind** **source** **separation** algorithm in the context of robot audition using a microphone array. Our algorithm presents two steps: a fixed beamforming step to reduce the reverberation and the background noise and a **source** **separation** step. In the fixed beamforming preprocessing, we build the beamforming filters using the Head Related Transfer Functions (HRTFs) which allows us to take into consideration the effect of the robot’s head on the near acoustic field. In the **source** **separation** step, we use a **separation** algorithm based on the l 1 norm minimization. We evaluate the performance of the

En savoir plus