Linear decoding: known covariance matrix - Reconstructing from ternary encoded data

3.3 The Sparse Ternary Codes

3.3.2 Reconstructing from ternary encoded data

3.3.2.2 Linear decoding: known covariance matrix

Here we restrict our decoding to be linear, i.e., consisting only of a projection step. Let us then try to learn this projection matrix from the data, i.e., we are given a training set F = [f₁,· · · ,f_N]. Furthermore, we assume that the data is (second-order) stationary and for which we can estimate a covariance matrix from F. For high-dimensional data, e.g., images, and particularly when training set size is limited, estimation of a covariance matrix that is also applicable to the test set is very difficult. However, as we will see later, since we will use only the first several eigenvectors which are more robust to estimate, our assumption of availability of the covariance matrix is safe in practice.

Therefore, concretely, assume that CF ≜ _n¹E[FF^T] is the covariance matrix of F. We are interested in finding an optimal decoder that provides the best reconstruction given the ternary codes.

Projection: We want the codes x to be as informative as possible, which requires independence (or at least no correlation) among the dimensions. Thus, we perform the PCA transform by choosing A = U^T_F, where CF = UFΣFU^T_F is the eigenvalue decomposition of CF.

9Notice however, that this formulation is interesting since, in a way, it is the opposite of compressed sensing. While in compressed sensing, recovery of high-dimensional sparse signals from low-dimensional dense measurements is desired, Eq. 3.30 tries to reconstruct a dense signal from the sparsity pattern of a high-dimensional projection. Therefore, it may be considered as a future direction.

3.3 The Sparse Ternary Codes 61 Therefore, the projected data ˜f ≜Af is de-correlated and its marginal distributions con-verge to Gaussian¹⁰for sufficiently largenas ˜F_j ∼ N(0, σ_j²), where ΣF = diag [σ₁²,· · · , σ_n²]^T withσ_i²’s being the eigenvalues of CF, which are decaying in value for the correlated F.

Encoding: When thinking about reconstruction from codes, we should take into account the fact that the dimensions of ˜f have a decaying variance profile and hence different contributions in the reconstruction of f. However, when encoded as in Eq. 3.19, these differences will not be taken into account for reconstruction. Therefore, at this point, it makes sense to improve our encoding by a slight change as in Eq. 3.29, i.e., by weighting the codes with a weighting vectorβ≜[β1,· · ·, βm]^T¹¹, whose optimal values should be calculated:

x=Q[f] =ϕλ(Af)⊙β. (3.29)

Decoding (reconstruction): As it was mentioned in section 3.3.2.1, we prefer fast and hence linear decoding for reconstruction ˆf =Q⁻¹[x]. This is done simply using a (back-) projection matrix B as:

ˆf = Bx= Bϕ_λ(Af)⊙β. (3.30)

The optimal value of B can be learned from the data. In order to facilitate this, we can provide B with the knowledge of the forward projection A. Therefore, for a general (full-rank) A and without loss of generality, we decompose B as B = (A^TA)⁻¹A^TB^′, i.e., with the pseudo-inverse of A, and B^′ that leaves the degrees of freedom for training.¹² Therefore, the optimization of B is equivalent to the optimization of B^′ as:

B^′ = argmin

B^′

||F−(A^TA)⁻¹A^TB^′X||²_F. (3.31) This can easily be re-expressed as:

B^′= argmin

B^′

||(A^TA)F−A^TB^′X||²_F

= argmin

B^′ Tr^h(AF−B^′X)^TAA^T(AF−B^′X)ⁱ

= argmin

B^′ Tr^h−2AA^TAFX^TB^′T + B^′XX^TB^′TAA^Tⁱ.

10Note that, according to CLT, the Gaussianity of the marginals of ˜Fdoes not require the Gaussianity ofF, due to the projection step performed. However, the joint distributionp(˜f) is not a (multivariate) Gaussian in General.

11Note that in our general formulation, we have A as anmbynprojection and hencem-dimensional codes.

However, during section 3.3.2.2 and for simplicity of analysis, we assumem=n.

12This decomposition serves us with the facility of presentation, but in general and within iterative procedures, such decompositions can help convergence.

Derivation w.r.t. B^′ and equating to zero gives:

B^′∗= AFX^T(XX^T)⁻¹. (3.32)

Noting that AF = ˜F and X =ϕ(˜F)⊙(β1^T_N), and by expressing matrix multiplications as the sum of rank-1 matrices, B^′∗can be re-written as:

B^′ =^hF˜[ϕ(˜F)⊙(β1^T_N)]^T^ih[ϕ(˜F)⊙(β1^T_N)][ϕ(˜F)⊙(β1^T_N)]^Tⁱ⁻¹

=^h

i=1

˜f_i[ϕ(˜fi)⊙β]^T^ih[β⊙β^T]

i=1

ϕ(˜f_i)ϕ(˜f_i^T)ⁱ⁻¹.

As was mentioned earlier, now let us choose A = U^T_F, i.e., the PCA. Due to the de-correlating property of A, in the projected domain and hence in the code domain, the covariance matrix ϕ(˜F)ϕ(˜F)^T and the cross-covariance matrix ˜Fϕ(˜F)^T are diagonals. There-fore, the expression of B^′∗ above is a product of two diagonals and hence is a diagonal itself.

This is a function only of the weighting vectorβ. This means that, in our encoding-decoding formulation and their optimization to minimize the reconstruction distortion, without loss of generality or loss of encoding-decoding power, we can set B^′ = Inand optimize for βinstead.

This means that, our optimal decoder can in fact be considered as B = (A^TA)⁻¹A^T, and since A in this case is orthonormal (due to PCA), we can simply have that B = A^T.

So now that B is also determined, we are left only to optimizeβ. For this, we should first calculate distortion as follows.

Distortion: By noting that ˆf = A^Tx, the expected reconstruction distortion can probabilistically be characterized as:

D=EdE(F,Fˆ)

= 1

nE||F−A^TX||²₂

= 1

nE||AF−X||²₂ (3.33a)

= 1

nE||AF−ϕ_λ(˜F)⊙β||²₂ (3.33b)

= 1

nE||F˜−ϕλ(˜F)⊙β||²₂, where Eq. 3.33a follows from the orthonormality assumption of A.

This expression is useful since it links the distortion of the original domain with that of the projection domain.

3.3 The Sparse Ternary Codes 63 The total distortion D can be expressed as the sum of the distortions at each of the dimensions, i.e., D =^Pⁿj=1D_j. Noting that ˜F_j, i.e., the elements of ˜F= [ ˜F₁,· · · ,F˜_n]^T is This integration leads to the expression of distortion as:

Dj =σ_j²+ 2β_j²Qλ

Weighting vector: Now we can find optimalβj that minimizes Dj. Fortunately, this can be expressed in a closed-form as:

β_j^∗= argmin

Fig. 3.3e sketchesβ_j in the ternary encoding of a Gaussian distribution. Note that in the special case whe λ= 0, the ternary encoding reduces to binary encoding and Eq. 3.35 reduces to the well-known binary quantization formula ∆ =±^q²_πσ, e.g., as in (10.1) of [2].

It should be mentioned that for the storage of codes, sinceβ is the same for all codes,x’s can be stored as fixed-point {+1,0,−1}values in memory. In fact,β is necessary to convert these fixed-point codes to floating-point values that can reconstruct thef’s.

Rate: Now that we have fully characterized the distortion of ternary encoding, we should also characterize its rate. In fact, this is similar to entropy calculation of section 3.3.1, where we had independent ternary variables with different sparsities. In this setup, however, we can only guarantee that these ternary variables are un-correlated. Therefore, by assuming independence, we can provide an upper-bound for rate as:

R⩽ 1

where we used the ternary entropy of Eq. 3.22.

Summary: Here we summarize the STC encoding of this section 3.3.2.2. Assuming that the data in the projection domain marginally follows the Gaussian distribution, we first apply the PCA transform to de-correlate the data. The ternarization is then performed

f

^A ^Af ^φ(·) φ(Af)

β x=φ(Af)β

A^T A^Tx

ˆ f

Fig. 3.6 A unit of STC encoding and decoding for reconstruction.

according to Eq. 3.29 for which values of β are calculated from Eq. 3.35. The decoding step for reconstruction is then performed using Eq. 3.30 for which we showed that B = A^T is the optimal choice. These steps are sketched in Fig. 3.6.

Dans le document Learning to compress and search visual data in large-scale systems (Page 77-81)