37-39 rue Dareau, 75014 Paris, France email@example.com
Abstract. The underdeterminedblindaudiosourceseparation prob- lem is often addressed in the time-frequency domain by assuming that each time-frequency point is an independently distributed random vari- able. Other approaches which are not blind assume a more structured model, like the Spectral Gaussian Mixture Models (Spectral-GMMs), thus exploiting statistical diversity of audio sources in the separation process. However, in this last approach, Spectral-GMMs are supposed to be learned from some training signals. In this paper, we propose a new approach for learning Spectral-GMMs of the sources without the need of using training signals. The proposed blind method significantly outper- forms state-of-the-art approaches on stereophonic instantaneous music mixtures.
These and other results show that improved separation performance in many sce- narios can be obtained by modeling and exploiting spatial and spectral properties of sounds, i.e., by designing models and constraints which account for the specificities of audio sources and acoustic mixing conditions. Two trends can be seen: developing com- plex, hierarchical models with little training so as to adapt to unknown situations with little amounts of data, or training simpler models on huge amounts of data, e.g., thou- sands of room impulse responses and dozens of hours of speech, so as to benefit from the power of big data and turn parameter estimation into a model selection problem.
No BSS/ICA algorithm is truly blind, in the sense that a minimal number of assumptions (generally involving some form of prior knowledge) on the sources and/or on the mixture process must be integrated in the algorithms to derive solutions to the separation problem . 1 In the underdetermined case, many relevant techniques take advantage of the sparse nature of audiosource signals. These methods make the (realistic) assumption that, in a given basis, source signals have a parsi- monious representation, i.e. most of the source coefficients are close to zero. A direct consequence of sparsity is the limitation of sources overlapping in the appropriate basis since the probability that several sources are simultaneously active is low. For most music signals, the time-frequency domain is a natural appropriate domain for exploiting sparsity (much more than the time domain where source signals generally strongly overlap)  . As a consequence, many USS techniques are based on sparse time-frequency (TF) representations of signals. For example, in  the authors make the assumption that the non-stationary source signals to be separated are disjoint in the TF domain. Specific points of the TF plane 1 As a major example underlined in , the Bayesian approach to BSS
Fig. 1. (a)–TF disjoint, (b) TF non-disjoint
extension of both the time domain and the frequency domain processing, that involves representing signals in a two-dimensi- onal space the joint TF domain, hence providing a distribution of signal energy versus time and frequency simultaneously. For this reason, a TF representation is commonly referred to as a time-frequency distribution (TFD). TFDs have been ap- plied to a wide variety of engineering problems. Specifically, they have been successfully used for signal recovery at low signal-to-noise ratio (SNR), accurate estimation of the instan- taneous frequency (IF), signal detection in communication, radar processing and for the design of time-varying filter. For more details on TFDs and related methods, see for example the recent comprehensive reference .
The goal of sourceseparation algorithms is to recover the con- stituent sources, or audio objects, from their mixture. How- ever, blind algorithms still do not yield estimates of sufficient quality for many practical uses. Informed SourceSeparation (ISS) is a solution to make separation robust when the audio objects are known during a so-called encoding stage. During that stage, a small amount of side information is computed and transmitted with the mixture. At a decoding stage, when the sources are no longer available, the mixture is processed using the side information to recover the audio objects, thus greatly improving the quality of the estimates at a cost of ad- ditional bitrate which depends on the size of the side infor- mation. In this study, we compare six methods from the state of the art in terms of quality versus bitrate, and show that a good separation performance can be attained at competitive bitrates.
where 2i is a user-defined number of row blocks, each block contains m rows (number of measurement sensors), j is the number of columns (practically j = N-2i+1, N is the number of sampling points). The Hankel matrix H 1,2 i is split into two
equal parts of i block rows: past and future data. Thus the algorithm considers vibration signals at different instants and not only instantaneous representations of responses. This allows to take into account temporal correlations between measurements when current data depends on past data. Therefore, the objective pursued here in using the block Hankel matrix rather than the observation matrix is to improve the sensitivity of the detection method. The combined methods will be called enhanced SOBI (ESOBI) or enhanced BMID (EBMID) in the following.
I. I NTRODUCTION
BlindSourceSeparation is a major tool to learn meaningful decompositions of multivalued data , . Most of the work has been dedicated to linear BSS: m observations are linear combinations of n sources of t samples. In matrix form, X = AS + N with X (size m × t) the observation matrix corrupted with some unknown noise N, S (n × t) the sources and A (m × n – here, n ≤ m) the mixing matrix. The goal of linear BSS is to recover A and S from X up to a permutation and scaling indeterminacy. While this is ill-posed, the sparsity prior  – assuming the sources to have many zero coefficients – has been shown to lead to high separation quality , . Less work has however been done on non-linear BSS, where: X = f (S) + N (1) With f an unknown non-linear function from R n×t to R m×t . Here, we will consider general functions f , by mostly assuming that f is in- vertible and symmetrical around the origin, as well as regular enough (i.e. L-Lipschitz with L small and not deviating from a linear mixing too fast as a function of the source amplitude). Despite increased indeterminacies than in the linear case,  claimed the possibility to recover sparse sources up to a non-linear function h under some conditions. Our approach is fully different from manifold clustering ones , , and also differs from neural network ones  as it brings a geometrical interpretation (and uses the mixing regularity) and an automatic hyperparameter choice (potentially enabling an increased robustness and building on the linear BSS litterature ).
Ming Jiang ∗ , J´ erˆ ome Bobin ∗ , and Jean-Luc Starck ∗
Abstract. BlindSourceSeparation (BSS) is a challenging matrix factorization problem that plays a central role in multichannel imaging science. In a large number of applications, such as astrophysics, current unmixing methods are limited since real-world mixtures are generally affected by extra instrumental effects like blurring. Therefore, BSS has to be solved jointly with a deconvolution problem, which requires tackling a new inverse problem: deconvolution BSS (DBSS). In this article, we introduce an innovative DBSS approach, called DecGMCA, based on sparse signal modeling and an efficient alternative projected least square algorithm. Numerical results demonstrate that the DecGMCA al- gorithm performs very well on simulations. It further highlights the importance of jointly solving BSS and deconvolution instead of considering these two problems independently. Furthermore, the per- formance of the proposed DecGMCA algorithm is demonstrated on simulated radio-interferometric data.
In , the authors propose a model for piano spectro- gram restoration, based on generalized coupled tensor factor- ization where additional information comes from an approxi- mate musical score and spectra of isolated piano sounds. The framework described in  proposed a separation model be- tween voice and background guided by another speech sam- ple corresponding to the pronunciation of the same sentence. The speech reference is either recorded by a human speaker or is created with a voice synthesizer based on the available text pronounced by the speaker of the mixture to be sepa- rated. A nonnegative matrix co-factorization (NMcF) model is designed so that some of the factorized matrices are shared by the mixture and the speech reference. The authors of  incorporate knowledge of the fundamental frequency 𝑓 0 in a NMF model, by fixing the source part of a source-filter model to be a harmonic spectral comb following the known 𝑓 0 value of the target source over time. In the context of audio sepa- ration of movie soundtracks, the separation can be guided by other available international versions of the same movie . A cover-informed sourceseparation principle is introduced in  where the authors assume that cover multitrack signals are available and are used as initialization of the NMF algo- rithm. The original mixture and the cover signals are time- aligned in a pre-processing step.
Among the variety of phase recovery techniques, the multiple input spectrogram inversion (MISI) algo- rithm  is particularly popular. This iterative procedure consists in retrieving time-domain sources from their STFT magnitudes while respecting a mixing constraint: the estimates must add up to the mixture. This algorithm exhibits a good performance in sourceseparation when combined with DNNs [10, 11]. However, MISI suffers from one limitation. Indeed, it is derived as a solution to an optimization problem that involves the quadratic loss, which is not the best-suited metric for evaluating discrepancies in the TF domain. For instance, it does not properly account for the large dynamic range of audio signals .
Although it is similar to classification, audiosource sep- aration barely exploits fusion principles. Kirbiz et al. im- plemented in  a form of late fusion: NMF (Non-negative Matrix Factorization) based separation is applied in paral- lel to different time-frequency resolution spectrograms and the resulting time-domain source estimates are combined through an adaptive filter bank. Their paper thus applies data fusion principles to an audiosourceseparation prob- lem thanks to the design of a specific algorithm. In image processing, Meganem et al. also propose in  to compute several source estimates thanks to different analysis param- eters and to use a correlation measure in order to select the best estimate.
Fig. 3. Sourceseparation performance for various methods on the DSD100 test dataset. Oracle (top) and estimated (bottom) magni- tudes spectrograms.
and maximum values, and crosses representing the outliers. We first observe that using the phase unwrapping prior only leads to poor results. Indeed, this technique neglects the phase of the mixture, then the prior error is propagated over time frames, lead- ing to audible artifacts. In both Oracle and non-Oracle scenarios, the proposed estimator (denoted by MMSE in Fig. 3) leads to bet- ter results than Wiener, but slightly worse than Cons-W in terms of SDR, SIR and SAR. However, we perceptually observe that Cons-W tends to produce more artifacts in the bass and drums tracks than the proposed MMSE technique. Finally, it is important to note that Cons-W is computationally costly: for a 10 seconds excerpt, the separation is performed in 27 seconds with Cons-W vs 4 seconds with our estimator. The proposed approach then appears appealing for an efficient audiosourceseparation task.
Model-based STFT Phase Recovery forAudioSourceSeparation
Paul Magron, Roland Badeau, Senior Member, IEEE, and Bertrand David, Member, IEEE
Abstract—Foraudiosourceseparation applications, it is com- mon to estimate the magnitude of the short-time Fourier trans- form (STFT) of each source. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture. In this paper, a different standpoint is adopted. Many music events are partially composed of slowly varying sinusoids and the STFT phase increment over time of those frequency components takes a specific form. This allows phase recovery by an unwrapping technique once a short- term frequency estimate has been obtained. Herein, a novel iterative sourceseparation procedure is proposed which builds upon these results. It consists in minimizing the mixing error by means of the auxiliary function method. This procedure is initialized by exploiting the unwrapping technique in order to generate estimates that benefit from a temporal continuity prop- erty. Experiments conducted on realistic music pieces show that, given accurate magnitude estimates, this procedure outperforms the state-of-the-art consistent Wiener filter.
The main principle behind most score-informed separation techniques is to make use of the onset/offset information found in MIDI files to correctly initialize the parameters of a parametric model. In the case of the NMF model described above in section 3.1, the score sheets permit to set to 0 the activation parameters H (n, j) for one given source when it is known to be inactive. Such a simple procedure is shown in  to dramatically increase separation performance, by initializ- ing the parameters to a sensible value, hence much closer to the global minimum sought for during optimization. Pitch information can also be used to initialize the spectral tem- plates W , or to adequately drive comb-filters as in . In the case of more flexible parametric models such as a NMF with time-varying spectral templates , score information may also be used with a noticeable gain in separation quality. The main issue with such score-initialized audio decomposi- tions is the requirement that MIDI files be synchronized with the audio mixtures. Even if efficient alignment techniques do exist to this purpose, mismatch in the alignment may lead to wrongly initialized decompositions, yielding poor sepa- rated sources. In a recent study, S IMSEKLI et al.  showed that MIDI information can actually be used without assuming such an alignment. The main fact underlying their technique is that apart from their temporal position, the score also con- tains information about co-occurrences of the notes as well as their pitch information. Even in case of misalignment, these may be supposed to be the same in the actual audio mixture. In practice, such co-occurrences are modeled as common fac- tors in a Generalized Tensor Factorization framework , where both scores and audio are jointly analyzed.
In this paper we show that considering early contributions of mixing filters through a probabilistic prior can help blindsourceseparation in reverberant recording conditions. By modeling mixing filters as the direct path plus R−1 reflections, we represent the propagation from a source to a mixture channel as an autoregressive process of order R in the frequency domain. This model is used as a prior to derive a Maximum A Posteriori (MAP) estimation of the mixing filters using the Expectation-Maximization (EM) algorithm. Exper- imental results over reverberant synthetic mixtures and live record- ings show that MAP estimation with this prior provides better sep- aration results than a Maximum Likelihood (ML) estimation.
In SO applications, the extracted sources and/or mix- ing parameters are processed to obtain information at more abstract levels, in order to find a representation of the obser- vations related to human perception. For instance, looking for the number and the kind of the instruments in a musical excerpt enters the scope of SO separation. Separation qual- ity criteria are generally less demanding than in AQO ap- plications because the aim of SO separation is only to keep specific features of the sources. Thus, a rough separation may be sufficient (possibly with high distortion), depend- ing on the robustness of the subsequent feature extraction algorithms.
While most sourceseparation methods are designed for a specific scenario, the flexible audiosourceseparation frame- work in  introduced a compositional approach  where the mixture signal is modeled by composing multiple source models together. Each model is parameterized by a number of variables, which may be constrained by the user, trained from separate data or adapted to the considered mixture according to the available information. This framework has been applied to a wide variety of speech and music separation scenarios by exploiting information such as note spectra, cover music recordings, reference speech pronounced by another speaker, or target spatial direction [4–8]. It has also been used as a preprocessing step for instrument recognition, beat tracking, and automatic speech recognition [8, 9].
z x ( f , n) = | e x( f , n)| was the magni- tude of single-channel signals obtained from the multichannel noisy signals x( f , n) by DS beamforming [45, 46]. In doing so, the time-varying time difference of ar- rivals (TDOAs) between the speaker’s mouth and each of the microphones are first measured using the provided baseline speaker localization tool , which relies on a nonlinear variant of steered response power using the phase transform (SRP- PHAT) [59, 60]. All channels are then aligned with each other by shifting the phase of STFT of the input noisy signal x( f , n) in all time-frequency bins ( f , n) by the opposite of the measured delay. This preprocessing is required to satisfy the model in (1.2) which assumes that the sources do not move over time. Finally, a single- channel signal is obtained by averaging the realigned channels together. On the out- put side, the estimated speech spatial image are averaged over channels to obtain a single-channel signal for the speech recognition evaluation, which empirically pro- vided better ASR performance than the use of one of the channels. Likewise, the
Multichannel AudioSourceSeparation with Probabilistic Reverberation Priors
Simon Leglaive, Roland Badeau, Senior Member, IEEE, Ga¨el Richard Senior Member, IEEE
Abstract—Incorporating prior knowledge about the sources and/or the mixture is a way to improve under-determined audiosourceseparation performance. A great number of informed sourceseparation techniques concentrate on taking priors on the sources into account, but fewer works have focused on con- straining the mixing model. In this paper we address the problem of under-determined multichannel audiosourceseparation in re- verberant conditions. We target a semi-informed scenario where some room parameters are known. Two probabilistic priors on the frequency response of the mixing filters are proposed. Early reverberation is characterized by an autoregressive model while according to statistical room acoustics results, late reverberation is represented by an autoregressive moving average model. Both reverberation models are defined in the frequency domain. They aim to transcribe the temporal characteristics of the mixing filters into frequency-domain correlations. Our approach leads to a maximum a posteriori estimation of the mixing filters which is achieved thanks to an expectation-maximization algorithm. We experimentally show the superiority of this approach compared with a maximum likelihood estimation of the mixing filters.
Adaptive blindsourceseparation with HRTFs beamforming preprocessing
Mounira Maazaoui, Karim Abed-Meraim and Yves Grenier
Abstract—We propose an adaptive blindsourceseparation algorithm in the context of robot audition using a microphone array. Our algorithm presents two steps: a fixed beamforming step to reduce the reverberation and the background noise and a sourceseparation step. In the fixed beamforming preprocessing, we build the beamforming filters using the Head Related Transfer Functions (HRTFs) which allows us to take into consideration the effect of the robot’s head on the near acoustic field. In the sourceseparation step, we use a separation algorithm based on the l 1 norm minimization. We evaluate the performance of the