Non-linear echo suppression - Dual microphone based echo postfilter

Dual microphone based echo postfilter

6.2 Non-linear echo suppression

6.2.1 Problem description

Before introducing the new DM approach to non-linear echo processing we ﬁrst establish a model of the non-linear echo problem. One suitable model which summarizes the coupling of acoustic sources to two transducers is illustrated in Figure 6.2. The system still has two acoustic sources: near-end speech denoted bys(n) and the non-linear signalx_nl(n) emitted by the loudspeaker. The non-linear signal xnl(n) is related to the received loudspeaker or far-end speech signal x(n) through a non-linear transformation which we denote by

h₁(n)

g₁(n)

g₂(n)

h₂(n)

y₂(n) y₁(n) s(n)

d₁(n) s₁(n)

s₂(n)

d₂(n) x(n) xnl(n)

f(·)

Figure 6.2: Signal model in presence of loudspeaker non-linearities

f(.) such that x_nl(n) = f(x(n)). The non-linear transformation f(.) is a generic model appropriate to any existing non-linear loudspeaker model such as a clipping model [Stenger and Kellermann, 2000] or a Volterra model [Azpicueta-Ruiz et al., 2011]. For real mobile devices, such modeling accounts for all non-linearities related to the loudspeaker from those due to the cone vibrations to those due to dynamic saturations.

Both near-end speech s(n) and the non-linear signal x_nl(n) are recorded by the two microphones, each according to a diﬀerent acoustic path which we denote similarly as in Chapter 5. The microphone signals are denotedy₁(n) andy₂(n) respectively for primary or secondary microphone signals. The near-end speech signal at the primary and secondary microphone is denoteds₁(n) ands₂(n) and is the result of convolution with acoustic path g₁(n) and g₂(n) respectively. The non-linear loudspeaker x_nl(n) also encounters diﬀerent acoustic paths denoted by h1(n) and h2(n) to produce signals d1(n) and d2(n) at the primary and secondary microphone respectively. In order words, the microphone signals y₁(n) and y₂(n) are expressed as follows:

y_j(n) =s_j(n) +d_j(n)

=g_j(n)∗s(n) +h_j(n)∗x_nl(n) (6.34) wherej ∈ {1,2}.

6.2.2 Limitations of the proposed DM PSD estimate

We want to achieve echo cancellation using the echo processing shown in Figure 6.1.

In [Mossi et al., 2010], it is showed that although the AEC is less disturbed by the presence of the non-linearities than by the presence of ambient noise. Because of the correlation between the linear and non-linear components of the echo, the AEC is still able to estimate the acoustic path. The AEC can still be used to estimate the linear part of the echo. The error signal at the output of the AEC can be written as:

e₁(n) =y₁(n)−ˆh₁(n)∗x(n) (6.35)

=s₁(n) +h₁(n)∗x_nl(n)−ˆh₁(n)∗x(n) (6.36)

=s₁(n) + ˜d₁(n) (6.37)

114 6. Dual microphone based echo postfilter where ˜d₁(n) denotes the residual echo which we consider to be composed of the non-linear echo and of the residual linear echo. In this context, the DM postﬁlter aims to suppress the non-linear part of the echo as well as the residual linear echo. We consider the same gain rule as in Section 6.1.The required residual echo PSD Φ^d^˜¹^d^˜¹ is deﬁned, from equations 6.36 and 6.37, as:

Φ^d^˜¹^d^˜¹(k, i) =|Hˆ1|²·Φ^xx(k, i) +|H1|²·Φ^x^nl^x^nl(k, i)−2 Re( ˆH1(k, i)·H1(k, i)^∗·Φ^xx^nl(k, i)) (6.38) where Φ^x^nl^x^nl is the PSD of x_nl(n) and Φ^xx^nl is the cross-PSD betweenx(n) andx_nl(n).

The ﬁrst term of the sum|Hˆ1(k, i)|²·Φ^xx(k, i) can be computed using the impulse response estimate from the AEC ˆh₁(n) and the received loudspeaker signal x(n). The two other terms of the sum need to be estimated ash₁(n) and x_nl(n) are unknown.

The PSD of the echo signal present at the secondary microphone can be written as:

Φ^d²^d²(k, i) =|H2(k, i)|²·Φ^x^nl^x^nl(k, i). (6.39) In Section 6.1, we used the echo RTF Γ to link Φ^d^˜¹^d^˜¹ and Φ^d²^d² and to derive an estimate of Φ^d^˜¹^d^˜¹. Here we cannot link the two PSDs mainly because of the cross-PSD term Φ^xx^nl which is present in Equation 6.38 but not in Equation 6.39. It cannot be computed since x_nl(n) is unavailable in the system. To circumvent this problem, we deﬁne the residual echo PSD diﬀerently. Two options are considered and presented in the following.

6.2.3 Residual echo PSD estimate in presence of loudspeaker nonlinearities 6.2.3.1 Residual echo PSD as a function of the error signal

Based on Equation 6.35, we can deﬁne the residual echo PSD Φ^d^˜¹^d^˜¹ as a function of the error signal PSD:

Φ^d^˜¹^d^˜¹(k, i) = Φê¹ê¹(k, i)−Φ^s¹^s¹(k, i). (6.40) The definition in Equation 6.40 is valid under the assumption that x(n) and s(n) are decorrelated. We estimate Φ^s¹^s¹ through a Wiener filter which we apply to the primary microphone signal y₁(n) as follows:

Φˆ^s¹^s¹(k, i) =τ·V²(k, i)·Φ^y¹^y¹(k, i) with V(k, i) = ψ(k, i)

1 +ψ(k, i) (6.41) ψ(k, i) =βΦˆ^s¹^s¹(k−1, i)

Φˆ^d¹^d¹(k−1, i) + (1−β)·max Φ^y¹^y¹(k, i)

Φˆ^d¹^d¹(k, i)−1,0 (6.42) where V is a Wiener filter, τ is a dynamic amplification gain and ψ denotes the SER before AEC. More specificallyψcan be defined as the ratio between Φ^s¹^s¹ and Φ^d¹^d¹. ψis estimated through decision directed approach and therefore its computation only requires an estimate of the echo PSD Φ^d¹^d¹.

Similarly to Equation 6.39, the primary microphone echo PSD can be deﬁned as Φ^d¹^d¹(k, i) =|H₁(k, i)|²·Φ^x^nl^x^nl(k, i). (6.43) Assuming once more that s(n) and x(n) are decorrelated, the PSDs of the microphone signals y_j(n) withj∈ {1,2} can be written as:

Φ^y^j^y^j(k, i) = Φ^s^j^s^j(k, i) + Φ^d^j^d^j(k, i)

=|G_j|²·Φ^ss(k, i) +|H_j|²·Φ^x^nl^x^nl(k, i) (6.44)

Let us introduce a slightly diﬀerent echo RTF to that used previously Γ^′(k, i) = H2(k, i)

H₁(k, i). (6.45)

The echo RTF Γ^′ deﬁnes the ratio between the frequency responses of the echo paths before the AEC at the contrary of Γ which deﬁnes the RTF after the AEC. By using the echo RTF Γ^′ and the near-end RTF Θ, the secondary microphone PSD can be rewritten as:

Φ^y²^y²(k, i) =|Θ(k, i)|²· |G₁(k, i)|²·Φ^ss(k, i) +|Γ^′(k, i)|²· |H₁(k, i)|²·Φ^x^nl^x^nl(k, i). (6.46) An estimate of Φ^d¹^d¹ is derived by combining the deﬁnitions of Φ^y¹^y¹ and Φ^y²^y² (Equa-tions 6.44 and 6.46):

Φˆ^d¹^d¹(k, i) =

|Θ(k, i)|²·Φ^y¹^y¹(k, i)−Φ^y²^y²(k, i)

max(|Θ(k, i)|²− |Γ|²,0.1) . (6.47) Once more, the denominator is thresholded to avoid problems due to division by small numbers. ˆΦ^d¹^d¹ computed as in Equation 6.47 is used to compute the SERψ and later on Φˆ^s¹^s¹.

The ampliﬁcation gain τ in Equation 6.41 is used to avoid excessive attenuation of near-end speech during double talk. In our implementation, it is controlled using the normalized PLD described in Section 5.3 (See Equation 5.18). In echo-only periods, τ is set to 1 while during double-talk periodsτ is set to 2.

6.2.3.2 Residual echo PSD as a function of the echo picked by the primary microphone

The echo recorded by the primary microphone can be written as the sum of its linear estimate from the AEC and of the residual echo

d₁(n) = ˆd₁(n) + ˜d₁(n). (6.48) Equivalently, its PSD can be deﬁned as:

Φ^d¹^d¹(k, i) = Φ^d^ˆ¹^d^ˆ¹(k, i) + Φ^d^˜¹^d^˜¹(k, i) + 2 Re(Φ^d^ˆ¹^d^˜¹(k, i)). (6.49) If we consider that the AEC has reached its optimum, based on the orthogonality principle, the cross-PSD term Φ^d^ˆ¹^d^˜¹ is null [Haykin, 2002]. In other terms, when the AEC reaches its optimum, the echo estimate ˆd₁(n) and the residual echo ˜d₁(n) deﬁne two orthogonal sub-spaces of the vectorial space generated byd₁(n). In a real time system, the AEC takes some time to reach its optimum, therefore, the orthogonality principle does not apply. But yet, according to Equation 6.48, we can somehow say ˆd₁(n) and ˜d₁(n) generate two complementary sub-spaces of the vectorial space generated byd₁(n). Based on this complementarity, the cross-PSD term is therefore negligible.

By neglecting Φ^d^ˆ¹^d^˜¹, another estimate of the residual echo can be derived as

Φˆ^d^˜¹^d^˜¹(k, i) = Φ^d¹^d¹(k, i)−Φ^d^ˆ¹^d^ˆ¹(k, i). (6.50) The required echo PSD Φ^d¹^d¹ can be estimated as in Equation 6.47. The PSD of the echo estimate from the AEC Φ^d^ˆ¹^d^ˆ¹ can be computed through any appropriate method as the signald1(n) is known in the system. In our case, Φ^d^ˆ¹^d^ˆ¹ is computed through autoregressive smoothing.

116 6. Dual microphone based echo postfilter 1. Update of PSDs and cross-PSDs Φ^y¹^y¹, Φ^y²^y², Φ^e¹^e¹, Φ^xy¹ and Φ^xy²

2. Update of RTFs estimate ˆΘ and ˆΓ^′ 3. Compute ˆΦ^d¹^d¹ (Equation 6.47)

Dans le document Acoustic echo cancellation for single-and dual-microphone devices: Application to mobile devices (Page 131-135)