Approximate inference based static and dynamic sparse Bayesian learning with Kronecker structured dictionaries

(1)

Approximate Inference based Static and Dynamic Sparse Bayesian Learning with Kronecker Structured Dictionaries

Christo Kurisummoottil Thomas, Dirk Slock

Communication Systems Department, EURECOM, Sophia Antipolis, France

R´ eunion du GdR ISIS, Mai 27,

D´ ecompositions Tensorielles, Visio Conf´ erence

Approximate Inference based Static and Dynamic Sparse Bayesian Learning with Kronecker Structured Dictionaries

(2)

Outline

1 Introduction

2 Combined BP-MF-EP Framework

3 Kronecker Structured Dictionary Learning using BP/VB

4 Numerical Results and Conclusion

Approximate Inference based Static and Dynamic Sparse Bayesian Learning with Kronecker Structured Dictionaries GdR ISIS, D´ecompositions Tensorielles, 2020, Christo Kurisummoottil Thomas, Dirk Slock, EURECOM, FRANCE 1 / 17

(3)

Time Varying Sparse State Tracking

Sparse signal x

t

is modeled using an AR(1) process with diagonal correlation coefficient matrix F .

Define: Λ = diag(λ), F = diag(f).

f

i

: correlation coefficient and x

i,t

∼ CN (0,

_λ¹

i

). Further, w

t

∼ CN (0, Γ

⁻¹

= Λ

⁻¹

(I − FF

^H

)) and v

t

∼ CN (0, γ

⁻¹

I). VB leads to Gaussian SAVE-Kalman Filtering (GS-KF).

Applications: Localization, Adaptive Filtering.

(4)

Sparsification of the Innovation Sequence (Sparse Bayesian Learning-SBL)

Bayesian Compressed Sensing: 2-layer hierarchical prior for x as in

¹

, inducing sparsity for x .

p(x

_i,t

|λ

i

) = N (x

_i,t

; 0, λ

⁻¹_i

), p(λ

_i

/a, b) = Γ

⁻¹

(a)b

^a

λ

^a−1_i

e

^−bλⁱ

⇒ sparsifying Student-t marginal

p(x

i,t

) = b

^a

Γ(a +

¹₂

)

(2π)

¹²

Γ(a) (b + w

_i²

/2)

^−(a+¹²⁾

We apply the (Gamma) prior not to the precision of the state x but it’s innovation w , allowing to sparsify at the same time the components of x AND their variation in time (innovation).

1

Tipping’01

(5)

Tensor Representation (Channel Tracking in MaMIMO OFDM)

Sampling across Doppler space and stacking all the subcarrier and sampled (in Doppler) elements as a vector

vec(H(t)) =

L

X

i=1

x

i,t

h

t

(φ

i

) ⊗ h

r

(θ

i

) ⊗ v

f

(τ

i

) ⊗ v

t

(f

i

) = A(t)x

t

4-D Tensor model, Delay, Doppler and Tx/Rx spatial dimensions.

Array response itself: Kronecker structure in the case of polarization or in the case of 2D antenna arrays with separable structure

²

.

User mobility changes the scattering geometry and path coefficients.

Tensor based KF proposed here avoids the off-grid basis mismatch issues.

2

Qian, Fu, Sidiropoulos, Yang’18

(6)

Kronecker Structured Tensor Models

≈ ^,1

1,1 ,1

1 + ... +

Y

. .

,

1,1 ,

= 1

Tensor signals appear in many applications: massive multi-input multi-output (MIMO) radar, massive MIMO (MaMIMO) channel estimation, speech processing, image and video processing.

Exploiting tensorial structure beneficial compared to estimating unstructured dictionary.

From Canonical Polyadic Decompositions (CPD) to Tucker Decompositions (TD).

The signal model for the recovery of a time varying sparse signal under Kronecker structured (KS) dictionary matrix can be formulated as (Tensor representation, Y

t

= X

t

+ N

t

)

Observation: y

t

= (A

1

(t) ⊗ A

2

(t).... ⊗ A

N

(t))

| {z }

A(t)

x

t

+ v

t

, y

t

= vec(Y

t

) State Update: x

t

= Fx

t−1

+ w

t

,

Y

t

∈ C

^I¹^×I²^...×I^N

is the observations or data at time t, A

j,i

(t) ∈ C

^I^j

, the factor matrix

A

j

(t) = [A

j,1

(t), ..., A

j,P_j

(t)] and the overall unknown parameters are [[A

1

(t), ..., A

N

(t); x

t

]], x

t

is M(=

N

Q

j=1

P

j

)-dimensional sparse center tensor and w

t

, v

t

are the state or measurement noise.

(7)

Identifiability of Tensor Decomposition

CPD is essentially unique under scaling and permutation indeterminacies, if T = [[A

1

(t), ..., A

N

(t); x

t

]] = [[A

1

(t), ..., A

N

(t); x

t

]]

A

j

(t) = A

j

(t)Π

j

P

j

, ∀j.

Π

j

being P

j

× P

j

permutation matrix and P

j

is a non-singular diagonal matrix.

To circumvent the scaling indeterminacies, we model the columns of A

j

(t) as A

j,i

(t) = [1 a

^Tj,i

]

^T

. Recent results on local identifiability results for Kronecker structered dictionary matrix (so TD) case

³

. They derived sample complexity upper bound for coordinate dictionary identification up to specified errors, under a separable sparsity model for X.

Global identifiability is still an open issue.

3

Shakeri, Sarwate, Bajwa’18

(8)

Our Contribution Here Compared to State of the Art

Extension of our previous work

⁴

on Kronecker Structured (KS) dictionary learning (DL) - called as SAVED-KS DL (Space Alternating Variational Estimation for KS DL) to the case of time varying sparse state vector.

Variational Bayes (VB) at the scalar level i.e mean field (MF) for x

t

and at the column level for the factor matrices in the measurement matrix is quite suboptimal since the posterior covariance computed does not converge to the true posterior

⁵

. One potential solution: belief propagation (BP).

Also, better variational free energy (VFE) approximation, e.g. Belief Propagation (BP) or Expectation Propagation (EP) instead of MF.

Inspired by the framework of

⁶

, we combine BP and MF approximations in such a way as to optimize the message passing (MP) framework.

The main focus: reduced complexity (e.g. VB leads to the low complexity) algorithms without sacrificing much on the performance (BP for performance improvement).

Dynamic autoregressive SBL (DAR-SBL): a case of joint Kalman filtering (KF) with a linear time-invariant diagonal state-space model, and parameter estimation =⇒ an instance of nonlinear filtering.

4

Thomas, Slock’19

5

Thomas,Slock’18

6

Riegler, Kirkelund, Manch´ on, Badiu, Fleury’13

(9)

Outline

1 Introduction

2 Combined BP-MF-EP Framework

3 Kronecker Structured Dictionary Learning using BP/VB

4 Numerical Results and Conclusion

(10)

Variational Free Energy (VFE) Framework

Intractable joint posterior distribution of the parameters θ = {x , A, f, λ, γ}.

Actual posterior: p(θ)=

_Z¹

Y

a∈A_BP

f

a

(θ

a

) Y

b∈A_MF

f

b

(θ

b

)

| {z }

factor nodes

, where A

BP

, A

MF

= set of variable nodes

belonging to the BP/MF part with A

BP

∩ A

MF

=∅.

The whole θ is partitioned into the set θ

i

, and we want to approximate the true posterior p(θ) by an approximate posterior q(θ) = Q

i

q

i

(θ

i

). The individual variable portions θ

i

are the variable nodes in a factor graph.

N

BP

(i), N

MF

(i)− the set of neighbouring factor nodes of variable node i which belong to the BP/MF part.

I

MF

= S

a∈A_MF

N (a), I

BP

= S

a∈A_BP

N (a). N (a)− the set of neighbouring variable nodes of any factor node a.

The resulting Free Energy (Entropy − Average Energy) obtained by the combination of BP and MF are written as below (let q

i

(θ

i

) represents the belief about θ

i

(the approximate posterior)),

F

BP,MF

= P

a∈A_BP

P

θa

q

a

(θ

a

) ln

^q_f^a^(θ^a⁾

a(θ_a)

− P

b∈A_MF

P

x_b

Q

i∈N(b)

q

i

(θ

i

) ln f

b

(θ

b

)− P

i∈I

(|N

BP

(i)| − 1) P

θ_i

q

i

(θ

i

) ln q

i

(θ

i

).

(11)

Message Passing Expressions

The beliefs have to satisfy the following normalization and marginalization constraints, P

θ_i

q

i

(θ

i

) = 1, ∀i ∈ I

MF

\ I

BP

, P

θ_a

q

a

(θ

a

) = 1, ∀a ∈ A

BP

, q

i

(θ

i

) = P

θ_a\θ_i

q

a

(θ

a

), ∀a ∈ A

BP

, i ∈ N (a).

The fixed point equations corresponding to the constrained optimization of the approximate VFE can be written as follows

⁷

,

q

i

(θ

i

) = z

i

Q

a∈N_BP(i)

m

^BPa→i

(θ

i

) Q

a∈N_MF(i)

m

^MFa→i

(θ

i

), = ⇒ Product of incoming beliefs

n

i→a

(θ

i

) = Q

c∈N_BP(i)\a

m

c→i

(θ

i

) Q

d∈N_MF(i)

m

d→i

(θ

i

), = ⇒ variable to factor nodes

m

^MFa→i

(θ

i

) = exp(< ln f

a

(θ

a

) >

^Q

j∈N(a)\i

n_j→a(θ_j)

), m

^BP_a→i

(θ

i

) = ( R Q

j∈N(a)\i

n

j→a

(θ

j

)f

a

(θ

a

) Q

j6=i

dθ

j

), = ⇒ factor to variable nodes

where <>

q

represents the expectation w.r.t distribution q.

7

Riegler, Kirkelund, Manch´ on, Badiu, Fleury’13

(12)

Gaussian BP-MF-EP KF

Proposed Method: Alternating optimization between non linear KF for the sparse states (plus the hyperparameters) and BP for dictionary learning (DL).

Diagonal AR(1) ( DAR(1) ) Prediction Stage: Since there is no coupling between the scalars in the state update , it is enough to update the prediction stage using MF. However, the interaction between x

l,t

and f

l

requires Gaussian projection, using expectation propagation (EP).

More details in

⁸

.

y

n

− factor node, x

l

− variable node. (l , n) or (n, l) to represent the messages passed from l to n or viceversa. Gaussian messages from y

n

to x

l

parameterized by mean b x

_n,l^(t)

and variance ν

_n,l^(t)

. Measurement Update (Filtering) Stage: For the measurement update stage, the posterior for x

t

is inferred using BP. In the measurement stage, the prior for x

l,t

gets replaced by the belief from the prediction stage. We define d

l,t

=(

N

P

n=1

ν

_n,l^(t)−1

)

⁻¹

, r

l,t

=d

l,t

(

N

P

n=1 bx_n,l^(t)

ν^(t)_n,l

+

^b^x^l,t|t−1

σ_l,t|t−1²

).

σ

_l,t|t⁻²

= λ

l,t

+ d

_l,t⁻¹

, x b

l,t|t

=

^r^l,t

1+d_l,tσ⁻²_l,t|t

.

8

Thomas, Slock, Asilomar’19

(13)

Lag-1 Smoothing Stage for Correlation Coefficient f

y

t

= A(t)Fx

t−1

+ v e

t

, where v e

t

= A(t)w

t

+ v

t

, e v

t

∼ CN (0, R e

t

)

We show in Lemma 1 of

⁹

that KF is not enough to adapt the hyperparameters, instead we need at least a lag 1 smoothing (i.e. the computation of x b

k,t−1|t

, σ

_k,t−1|t²

through BP). For the smoothing stage, we use BP.

Gaussian Posterior for x

t

: σ

_k,t−1|t⁻²^,(i)

= (b f

_k|t²

+ σ

²_f

k|t

)A

^H_k

(t) R e

_t⁻¹

A

k

(t) + σ

⁻²_{k,t−1|t−1}

,

< x

_k,t−1|t⁽ⁱ⁾

> = σ

_k,t−1|t^2,(i)

(b f

_k|t^H

A

^H_k

(t) R e

_t⁻¹

(y

t

− A

_k

(t)F

_k|t

< x

⁽ⁱ⁻¹⁾

k,t−1|t

>) +

^x^b^{k,t−1|t−1}

σ²_{k,t−1|t−1}

).

Applying the MF rule, the resulting Gaussian distribution for f has mean, σ

⁻²_f

i|t

and variance, b f

i|t

, the detailed derivations for which are in our paper

¹⁰

.

σ

⁻²_f

i|t

= (| b x

i,t−1|t

|

²

+ σ

²i,t−1|t

)A

^Ti

(t) R e

_t⁻¹

A(t)

i

, b f

i|t

= σ

²_f

i|t

x b

_i,t−1|t^H

A

^H_i

(t) R e

_t⁻¹

(y

t

− A

_i

(t) F b

_i|t

x b

_i,t_−1|t

).

R e

t

= A(t)Λ

⁻¹

A(t)

^H

+

_γ¹

I.

9

Thomas, Slock, Asilomar’19

10

Thomas, Slock’20

(14)

Outline

1 Introduction

2 Combined BP-MF-EP Framework

3 Kronecker Structured Dictionary Learning using BP/VB

4 Numerical Results and Conclusion

(15)

Dictionary Learning using Tensor Signal Processing

≈ ^,1

1,1 ,1

1 + ... +

Y

. .

,

1,1 ,

=

1

Let Y

i₁,...,i_N

represents the i

1

i

2

...i

_N^th

element of the tensor and y = [y

1,1,...,1

, y

1,1,...,2

...., y

I₁,I₂,...,I_N

]

^T

, then it can be verified that

¹¹

,

y

t

= (A

₁

(t) ⊗ A

₂

(t)... ⊗ A

_N

(t))x

t

+ w

t

, w ∼ N (0, γ

⁻¹

I),

Matrix Unfolding:Y

⁽ⁿ⁾

= A

n

(t)X

⁽ⁿ⁾

(A

N

(t ) ⊗ ...A

n+1

(t) ⊗ A

_n−1

(t)... ⊗ A

1

(t))

^T

. A

j

(t) is of dimension, I

j

× P

j

and the resulting Tensor is C

^I¹^×....×I^N

.

Retaining the Tensor structure in the dictionary matrix leads to better estimates than using the matricized version for A and learning it.

Less free variables to be estimated in the Tensor structured case.

Variational Bayesian Inference using the following approximate posterior q(x, λ, γ, A) = q

γ

(γ)

M

Q

i=1

q

x_i

(x

i

)

M

Q

i=1

q

λ_i

(λ

i

)

N

Q

j=1 P_j

Q

i=1

q

A_j,i

(A

j,i

), = ⇒ SAVED-KS DL Or

q(x, λ, γ, A) = q

γ

(γ)

M

Q

i=1

q

x_i

(x

i

)

M

Q

i=1

q

λ_i

(λ

i

)

N

Q

j=1

q

A_j

(A

j

) = ⇒ Joint VB for DL

11

Sidiropoulos, Lathauwer, Fu, Huang, Papalexakis, Faloutsos’17

(16)

Suboptimality of SAVED-KS DL and Joint VB

From the expression for the error covariance in the estimation of the factor A

j,i

(SAVED-KS DL in

¹²

) (tr{( N

¹

k=N,k6=j

< A

^T_k

(t)A

^∗_k

(t) >) < X

^(j)^T

X

^(j)

>}I), = ⇒ it does not take into account the estimation error in the other columns of A

j

(t). The columns of A

j

(t) can be correlated, for e.g. if we consider two paths (say i, j) with same DoA but with different delays, the delay responses v

f

(τ

i

(t)) and v

f

(τ

i

(t)) may be correlated.

The joint VB estimates (mean and covariance) can be obtained as M

_j^T

= A b

^T

1,j

(t) =< γ > Ψ

⁻¹_j

B

^T_j

, Ψ

j

= (< γ >< X

^(j)

(

1

N

k=N,k6=j

< A

^T_k

(t)A

^∗_k

(t) >)X

^(j)T

>), (1)

where V

j

=< X

^(j)

>< (

1

N

k=N,k6=j

A

k

(t))

^T

> and B

j

is defined as with the first row of (Y

^(j)

V

^T_j

) removed.

However, the joint VB involves a matrix inversion and is not recommended for large system dimensions.

Nevertheless, it is possible to estimate each columns of A

j

(t) by BP, since each column estimate can be expressed as the solution of a linear system of equation from (1), A b

^T_j,i

(t) = Ψ

⁻¹_j

b

j,i

. b

j,i

represents the i

^th

column of B

^T_j

.

12

Thomas, Slock’19

(17)

Optimal Partitioning of the Measurement Stage and KS DL

Lemma

For the measurement stage, an optimal partitioning is to apply BP for the sparse vector x

t

and VB (SAVED-KS) for the columns of the factor matrices A

j,i

(t) assuming the vectors A

j,i

(t) are independent and have zero mean. However, if the columns of A

j

(t) are correlated, then a joint VB, with the posteriors of the factor matrices assumed independent, should be done for an optimal performance.

Proof: Follows from Lemma 1 of

¹³

, where the main message was that if the parameter partitioning in VB is such that the different parameter blocks are decoupled at the level of FIM, then VB is not suboptimal in terms of (mismatched) Cramer-Rao Bound (mCRB).

y

t

= (

M

P

r=1

x

r,t

F

r

)

| {z }

F(x_t)

(

N

j=1

Φ

j,t

)

| {z }

f(Φt)

+w

t

. J(Φ

t

) = [J(Φ

1,t

) ... J(Φ

N,t

)]

where, J(Φ

j,t

) = F (x

t

)(Φ

1,t

⊗ ...I

I_jP_j

.... ⊗ Φ

N,t

),

FIM=







E(γ)J(Φ_t)^TJ(Φ_t) 0 0 0

0 E(γ)J(x_t)^TJ(x_t) + E(Γ⁻¹) 0 0

0 0 aE(Γ⁻²) 0

0 0 0 (N+c−1)E(γ⁻²)







13

Kalyan, Thomas, Slock’19

(18)

Outline

1 Introduction

2 Combined BP-MF-EP Framework

3 Kronecker Structured Dictionary Learning using BP/VB

4 Numerical Results and Conclusion

(19)

BP-MF-EP Outperforms SAVED-KS DL

0 2 4 6 8 10 12 14 16 18 20

SNR in dB -60

-50 -40 -30 -20 -10 0 10

NMSE

BP-MF-EP with ALS for DL SAVED-KS with DL BP-MF-EP with BP for DL BP-MF-EP with Joint VB for DL BP-MF-EP with Known Dictionary Matrix

Static SBL: NMSE as a function of N.

ALS- Alternating Least Squares.

Exponential power delay profile for x

t

.

30 non zero elements in x

t

, same support across all time.

Dimensions: 3-D Tensor (4, 8, 8), with M = 200.

(20)

Conclusions and Thank You!

Further advancements from

¹⁴

: VB with a too fine variable partitioning is quite suboptimal.

Better approximation is message passing based methods such as belief propagation (BP) and expectation propagation (EP)

BP or EP message passing can be implemented using low complexity methods such as AMP/GAMP/VAMP, which are proven to be Bayes optimal under certain conditions on A.

AMP - Approximate message passing. We also derived an Generalized Vector AMP (GVAMP-SBL) SBL version to take care of diagonal power delay profile.

Further work to be done on learning a combination of structured and unstructured Kronecker factor matrices.

14

Thomas, Slock’19

(21)

References (and Questions?)

K. Gopala, C. K. Thomas, D. Slock, “Sparse Bayesian learning for a bilinear calibration model and mismatched CRB,”EUSIPCO, Sept. 2019.

C. Qian, X. Fu, N. D. Sidiropoulos, Y. Yang, “Tensor-based Parameter Estimation of Double Directional Massive MIMO Channel with Dual-Polarized Antennas,”ICASSP, Apr. 2018.

E. Riegler, G. E. Kirkelund, C. N. Manch´on, M. A. Badiu, B. H. Fleury, “Merging Belief Propagation and the Mean Field Approximation: A Free Energy Approach,”IEEE Trans. on Info. Theo., Jan. 2013.

Z. Shakeri, A. D. Sarwate, W. U. Bajwa, “Identifiability of Kronecker-Structured Dictionaries for Tensor Data,”IEEE Journ. Sel. Top. in Sig.

Process., Oct. 2018.

C. K. Thomas, D. Slock, “Save - Space alternating variational estimation for sparse Bayesian learning,”IEEE Data Sci. Wkshp, June. 2018.

C. K. Thomas, D. Slock, “Space alternating variational estimation and kronecker structured dictionary learning,”ICASSP, May. 2019.

C. K. Thomas and D. Slock, “Low Complexity Static and Dynamic Sparse Bayesian Learning Combining BP, VB and EP Message Passing,” in IEEE Asilomar, Nov. 2019.

C. K. Thomas, D. Slock, “BP-VB-EP based Static and Dynamic Sparse Bayesian Learning with Kronecker Structured Dictionaries,”ICASSP, May. 2020.

M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine ,”Journ. Mach. Learn. Rese., Jun. 2001.