On a fundamental predictability-distortion trade-off in decentralized decisional networks

(1)

ON A FUNDAMENTAL PREDICTABILITY-DISTORTION TRADE-OFF IN DECENTRALIZED DECISIONAL NETWORKS

Lorenzo Miretti, Paul de Kerret, and David Gesbert EURECOM, Sophia Antipolis, France

ABSTRACT

Decentralized decisional networks are composed of agents that take coordinated decisions on the basis of individual noisy information about the system state, i.e. under a so called distributed state information configuration. In such scenarios, the application of algorithms directly derived from classical centralized optimization often incurs severe performance degradation. This is because the agents fail to predict each other’s decision due to local state information noise, hence they do not coordinate properly. On the other hand coordination can be easily enhanced by letting one agent re- duce its dependency with respect to its state information, thus making its decision more predictable but less adapted to the system state. This observation naturally leads to formulating a fundamental trade-off, coined here predictability-distortion trade off. The goal of this paper is to formulate this trade-off and propose a framework to explore it, based on a concept of quantization under predictability constraint.

Index Terms— quantization, coordination, multi-agent systems, team decision problems, Lloyd algorithm

1. INTRODUCTION

Many types of networks benefit from the cooperating behav- ior of its agents. Examples include car networks, energy networks, interference limited radio networks [1], etc. Unfortu- nately, agents are often limited in the quality of the system state information over which they seek to adapt their decision so as to maximize system performance. More importantly, noisy state information (with possibly unequal noise levels) at the agents prevents them from reliably predict each other decision hence coordinate. For instance, a reliable distributed resource allocation scheme in an interference limited mobile network is difficult to obtain when the available channel state information is very noisy at some of the radio devices.

In this paper we consider a system where two agents wish to cooperatively coordinate their decisions based on possibly imperfect observations of the system stateX, a so calleddistributed state informationconfiguration. When the

L. Miretti, P. de Kerret, and D. Gesbert are supported by the European Research Council under the European Union’s Horizon 2020 research and innovation program (Agreement no. 670896).

p(y₁, y₂|x)

Agent 1 (quant.) β(α(·))

Agent 2 (est.) β(γ(·)) X

p(x)

Z

Zˆ Y₁

Y2

E[d(X, Z)]

P( ˆZ6=Z)

Fig. 1. System model

state information is perfectly shared among all agents, perfect coordination is achieved by simply running at each agent an instance of a centralized decision algorithm. We call in fact such a configuration a logically centralized configuration. However, when this condition is not met, the situation becomes significantly more complex as each agent needs to take its decision as a function of its beliefs over the actions of the other agents [2]. This fall into the framework of so-called team decision problems[3], which have remained vastly un- solved despite significant efforts (see [4] for an overview related to wireless communications). An alternative approach to alleviate the complexity of such problems is to consider systems with hierarchical state information, in which one agent has access to the state information of the other agent, or equivalently, to the decision taken at the other agent. In that setting, the coordination task is alleviated and efficient low complexity heuristics are known [4].

In this work, we focus on enforcing hierarchical state information through quantization. Indeed, it is clear that quan- tizing the information (or the decision) makes it easier to suc- cessfully guess it at the other node. However, it is intuitive that there is a trade-off between the information loss, or distortion, introduced by the quantizer, and the predictability, which is here measured in terms of the probability that the other agent guesses the right value.

The main goal of this paper is precisely to investigate the

(2)

fundamental interplay between distortion and predictability, and to discuss algorithms achieving performances as close as possible to the fundamental limits.

2. SYSTEM MODEL AND PROBLEM STATEMENT Let us consider a source symbol X ∈ X, and two corre- lated observationsY₁∈ Y₁,Y₂∈ Y₂distributed according to pXY₁Y₂available at two different agents. We let thenZ ∈ Z be a quantized representation ofX obtained from the obser- vationY1at Agent 1, where Z ⊂ X is a finite codebook of cardinality|Z| = M. Furthermore, we let Zˆ ∈ Z be an estimate ofZobtained fromY2at Agent 2.

We wish to design the quantizer at Agent 1 and the estimator at Agent 2 such thatZcan be reliably estimated, while minimizing an expected distortion. More precisely, we look at the problem

minimize D=E[d(X, Z)]

subject to Pe=P( ˆZ 6=Z)≤, (1) whered: X × X →R⁺ is an arbitrary distortion measure, and where∈[0,1]defines a given estimation reliability tol- erance. The minimization is carried over the set of tuples (α, β, γ), whereα : Y1 → I = {1, . . . , M}is an encoder that mapsY1into indexes,β : I → Z is a decoder that bi- jectively maps the indexes into the reconstruction codewords zi =β(i), and whereγ :Y2 → I is an estimator that maps Y2into estimated indexes. We assume w.l.o.g.zi6=zj, for all i6=j, so that, in order to estimateZ, Agent 2 needs only to produce an estimateˆi =γ(y2)ofi =α(y1)and then apply ˆ

z_i=β(ˆi).

Let us denote with D^?() the optimum of Problem (1) for a given predictability . The function D^?() is non- increasing, but in general non-convex. However, any point in the convex envelope ofD^?()can be achieved by allowing time-sharing between two solutions(α, β, γ)and(α, β, γ)⁰. Hence, the goal of this work is to provide a characterization of thedistortion-predictability function

D() = Conv{D^?()}. (2) A conventional technique for characterizing the convex envelope (2) of the curveD^?()is to minimize the Lagrangian functional

L_λ(α, β, γ) :=E[d(X, β(α(Y₁))] +λP(γ(Y₂)6=α(Y₁)) (3) for a fixed Lagrangian multiplierλ ∈ [0,∞). In fact, for problems of the type of (1), the optimalL^?(λ)identifies a line of slopeλsupporting the convex envelope ofD^?(). Hence, by sweepingλfrom zero to infinity, and given that we can compute the minimum ofL_λ(α, β, γ), it is possible to com- pletely characterizeD().

For the aforementioned reasons, in the reminder of this paper, we focus on the following unconstrained problem

minimize

(α,β,γ)

L_λ(α, β, γ), forλ≥0. (4)

3. CHARACTERIZATION OF THE DISTORTION-PREDICTABILITY FUNCTION In this section we propose an algorithm for solving (4), here- after referred to as thePredictability Constrained Quantizer (PCQ), that guarantees convergence to a local optimum. The proposed algorithm is a modification of the celebrated Lloyd algorithm [5], in a similar spirit to [6], i.e. based on an alter- nating minimization procedure.

3.1. Necessary conditions for optimality

We start by recalling a necessary conditions for the optimality of the decoderβ, known as thecentroidcondition, developed for standard vector quantization [7] and readily extensible to the problem considered in here.

Lemma 1. Let(α, γ)be any tuple of encoders and estima- tors. Define the setIeff := {i ∈ I |P(α(y1) = i) > 0}.

Then, the optimal decoderβ^?(i) =z^?_i that minimizes the La- grangian(3)is given by

∀i∈ Ieff, z^?_i ∈arg min

z∈XE[d(X, z)|α(Y₁) =i].

The pointsz_i^?are called the centroids of the quantization regions{y1∈ Y1|α(y1) =i}.

Proof. Since the constraint set of problem (1) does not de- pend onβ, and so does the termλP(γ(Y2)6=α(Y1))of (3), the proof follows directly from the classical result in [7] for unconstrained minimization.

Note that for i ∈ I \ Ieff, the centroids are not well- defined, but they can be selected arbitrarily without impact- ing the cost function. Furthermore, ifd(x, z)is convex in z, the centroids can be easily computed by standard convex optimization techniques. As a classical example, if the distortion measure is the squared-errord(x, z) =kx−zk²₂over an Euclidean space, then the centroids can be computed as

z_i^?=E[X|α(Y1) =i]. (5) Next, we state a necessary condition for the optimality of the encoderα.

Lemma 2. Let(β, γ)be any tuple of decoders and estima- tors. Then, the optimal encoderα^? that minimizes the La- grangian(3)is given by

α^?(y1)∈arg min

i

n

E[d(X, zi)|Y1=y1]

+λP(γ(Y2)6=i|Y1=y1)o . (6)

(3)

Proof. We lower bound the objectiveLλ(α, β, γ)as Lλ(α, β, γ) =

Z

Y1

n

E[d(X, β(α(y1))|Y1=y1] +λP(γ(Y2)6=α(y1)|Y1=y1)o

dPY₁(y1)

≥ Z

Y1

min

i

n

E[d(X, zi)|Y1=y1] +λP(γ(Y₂)6=i|Y₁=y₁)o

dP_Y₁(y₁), which is achieved by a mappingαthat minimizes the inte- grand almost everywhere, i.e. by (6).

As an example, for the squared-error distortion we obtain α^?(y1)∈arg min

i

n|E[X|Y1=y1]−zi|²

+λP(γ(Y2)6=i|Y1=y1)o .

Finally, we state a necessary condition for the optimality of the estimatorγ.

Lemma 3. Let(α, β)be any tuple of encoder and decoder.

Then, the optimal estimatorγ^?that minimizes the Lagrangian (3)is given by

γ^?(y2)∈arg max

i P(α(Y1) =i|Y2=y2),

which corresponds to the maximum a posteriori (MAP) estimator.

Proof. Similar to the proof of Lemma 2, hence omitted.

We conclude this part by noticing that the above necessary conditions can be equivalently stated by designing the quantizer according to a modified distortion measured˜:Y1×X → R⁺operating directly on the noisy sourceY1, given by [8]

d(y˜ 1, z) :=E[d(X, z)|Y1=y1], (7) sinceD=E[d(X, Z)] =E[E[d(X, Z)|Y1]] =E[ ˜d(Y1, Z)].

3.2. The PCQ algorithm

We now describe the proposed PCQ algorithm for solving Problem (4), based on the necessary conditions for optimality developed in the previous section.

Starting from an arbitrary initial tuple(α⁽⁰⁾, β⁽⁰⁾, γ⁽⁰⁾), and fixing a predictability weightλ∈[0,∞), we form a sequence{(α^(k), β^(k), γ^(k))}^∞_k=0 defined by the following al- ternating minimization procedure:

α^(k+1)∈arg min

α Lλ(α, β^(k), γ^(k)) β^(k+1)∈arg min

β Lλ(α^(k+1), β, γ^(k)) γ^(k+1)∈arg min

γ Lλ(α^(k+1), β^(k+1), γ).

Every step of iterationk ≥ 1 is obtained by applying the optimality conditions in Lemma 1, Lemma 2, and Lemma 3, and the modified distortion measure (7). In case of mul- tiple minima, the algorithm picks one solution at random.

By definition, the sequence L^(k)_λ := Lλ(α^(k), β^(k), γ^(k)) satisfies0 ≤ L^(k+1)_λ ≤ L^(k)_λ , hence the convergence of the algorithm to a local minimum is guaranteed. In practice, the algorithm is executed until a given stopping criterion

L^(k+1)_λ −L^(k)_λ

/L^(k+1)_λ ≤ δ is met. The algorithm is summarized in Figure 2.

It is easy to see that for λ = 0 the PCQ algorithm re- duces to the classical unconstrained quantizer of [7, 8]. For λ→ ∞instead, where the distortionD becomes asymptot- ically negligible, the Lagrangian is minimized by a perfectly predictable quantizer of rate zero, where all the probability mass is put on a single quantization pointz∈arg minzE[X].

As for the classical Lloyd algorithm, it is important to un- derline that the PCQ algorithm does not specify the codebook cardinalityM. Indeed, the largerM, the more precise is the characterization ofD(), as the optimization problem (1) is defined for anyM, and it should be in principle optimized.

Clearly, for the extreme pointD(1) the optimal number of codewords is often unbounded. In practice, we setM as high as possible with respect to the computation capabilities. How- ever, for the region of interest of most applications, i.e. for relatively small, the PCQ algorithm usually puts zero probability mass in most of the quantization regions, resulting in a low-rate quantizer.

Finally, since the PCQ algorithm cannot guarantee global optimality, the trade-off regionD()is characterized only in terms of an achievable upper bound.

3.3. Statistical inference of the PCQ algorithm

In some applications the distribution p_XY₁_Y₂ might be un- known, or the update rules in Figure 2 might be too difficult to be treated analytically. However, given the availability of a training sequence{x^N₁ , y₁^N, y₂^N}generated according to the joint distributionQN

n=1pXY₁Y₂(x, y1, y2), it is still possible to approximately infer the PCQ.

Similarly to [7], we can consider the minimization of the functional

Lˆλ(α, β, γ) := 1 N

N

X

n=1

nd˜(y1n, β(α(y1n)))

+λ1[γ(y2n)6=α(y1n)]o , (8) where1[·]denotes the indicator function. This ‘empirical av- erage’ functional corresponds to the ‘expected’ functional (3) with respect to the sample distribution of the training data.

By the strong law of large number, we have thatLˆ_λ → L_λ forN → ∞almost surely. It is possible to minimize (8) by simply substituting all the statistical operatorsE[·]andP(·)

(4)

Algorithm 1:PCQ

Set(α⁽⁰⁾, β⁽⁰⁾, γ⁽⁰⁾),λ≥0,M 1,δ >0, L⁽⁰⁾_λ =∞

while(L^(k+1)_λ −L^(k)_λ )/L^(k+1)_λ > δdo α(y1) = arg mini{d(y˜ 1, β(i))

+λP(γ(Y2)6=i|Y1=y1)};

β(i) = arg min_z∈XE[ ˜d(Y₁, z)|α(Y₁) =i]; γ(y₂) = arg max_iP(α(Y₁) =i|Y2=y₂); end

Fig. 2. The proposed PCQ algorithm for finding a local minimum ofLλ(α, β, γ).

in Figure 2 with their respective empirical operators, i.e. their expression with respect to the sample distribution of the training sequence{y^N₁, y₂^N}. Note that(α, γ)are defined over the finite alphabetsYˆ1,Yˆ2composed by the occurrences ofY1, Y2

in the training data. The resulting scheme is indeed a cluster- ing algorithm over the training set.

The method described above has some important limita- tions. Firstly, it cannot be applied to continuous alphabets.

In fact, due to the particular structure of the sample distribution derived from continuous alphabets, it can be easily seen that,∀λ, theγupdate step degenerates toγ^(k+1)(y_2n) = α^(k)(y_1n)almost surely, leading to extremely suboptimal solutions. Secondly, it requires the knowledge ofd, i.e. at least˜ the knowledge of the marginalp_XY₁. Similarly to [8], if this information is not available, it is not possible to adapt the previous scheme in a straight-forward way.

A possible heuristic to address the first problem is to consider instead a discretized version(Y1Q, Y2Q)of(Y1, Y2), obtained by applying some quantizer of properly tuned (finite) rate. For the second problem, one can obtain approximations ofd˜from samples{x^N₁, y^N₁ }by applying density estimation techniques [9] (see [10] for the squared-error distortion).

4. EXAMPLE

We considerY₁ andY₂ to be Gaussian noisy measurements of a Gaussian sourceX ∼ N(0,1). More specifically, we let

Y1=X+N1, N1∼ N(0,SNR⁻¹₁ ), Y₂=X+N₂, N₂∼ N(0,SNR⁻¹₂ ),

for a given pair of signal-to-noise ratios (SNR), and where N1,N2, andX are independent. As distortion measure, we choose the squared-error distortiond(x, z) = (x−z)².

Under these assumptions, it is easy to see that d(y˜ 1, z) = (κy1−z)²+c, where κ = σ_XY₁σ⁻²_Y

1 is the minimum mean-square error (MMSE) estimator ofXgivenY₁, and wherecis a constant

-5 0 5

0 0.1 0.2

0.3 ^pdf

thresholds quantization points

(a)λ= 0,M = 8,SNR₁= 30dB,SNR₂= 20dB.

-5 0 5

0 0.1 0.2 0.3

pdf thresholds quantization points

(b)λ= 10,M = 8,SNR₁= 30dB,SNR₂= 20dB Fig. 3. Illustration of the behaviour of the quantizer for 2 different choices of the predictability weightλ. In Figure(a), for λ = 0, the solution is similar to the classicalM-levels Lloyd quantizer. In Figure(b), forλ = 10, we see that the PCQ algorithm produces fewer quantization regions contain- ing more probability mass.

that depends only on the conditional distributionp_X|Y₁ and hence it can be neglected through the optimization steps of the PCQ algorithms (it is indeed the MMSE value itself, i.e.

the distortion floor for every quantizer onY1). The decoder update rule is then given by

β(i) =κE[Y1|α(Y1) =i],

which is evaluated analytically in terms ofQ-functions. The encoder update is obtained numerically by discretizing the alphabetY1, by evaluating analytically the augmented dis- tances d(y˜ 1, β(i)) +λP(γ(Y2) 6= i|Y1 = y1) for everyi andy1, and by solving the optimization problem through ex- haustive search over i = 1, . . . , M. The estimator update rule is obtained in a similar way. The PCQ algorithm is run several times with random initialization(α⁽⁰⁾, β⁽⁰⁾, γ⁽⁰⁾), and for severalλ. An achievableD()curve is obtained by taking the convex hull of the obtained(D, )points.

Figure 3 illustrates the effect of the choice ofλon the out- put of the PCQ algorithm. Note that we only plot the thresholds and the quantization points of the quantizerβ(α(·))applied at agent 1, sinceγsimply corresponds to a MAP estimator.

(5)

Figure 4 shows the achievableD()curve obtained via the PCQ algorithm, normalized by the Minimum Mean Squared Estimation error (MMSE) ofX givenY₁. We also show the performance of a heuristic approach based on the Lloyd algorithm for noisy sources [8], and where different degrees of predictability are enforced by simply varying the numberM of quantization points. We denote this second approach as Predictable Lloyd Quantizer(PLQ).

P

e

≤ 0

0 0.1 0.2 0.3 0.4 0.5 0.6

D ( 0 ) / M M SE

0 1 2 3 4 5 6 7 8 9 10

PCQ PLQ

Fig. 4. Achievable distortion-predictability function for the considered jointly Gaussian example with squared-error distortion.M = 32,SNR₁= 30dB,SNR₂= 20dB.

5. DISCUSSION

In this preliminary study we describe a method for exploring the trade-off betweenpredictabilityanddistortionin a setting composed by a couple of distributed measurementsY1, Y2of a random system stateX. As a natural next step, we are cur- rently investigating the benefits of applying the quantizer developed in here to obtain suboptimal solutions to team decision problems. Specifically, by limiting the optimization space to the class of decision functions derived byhierarchi- calinformation structure, a close-to-optimal operating point can be obtained by varying the parameterλof the proposed PCQ algorithm, applied toY1as a pre-processing step to en- hance coordination. Note that a similar idea has been success- fully applied in [11] for the problem of distributed precoding in cooperative wireless networks.

6. REFERENCES

[1] D. Gesbert, S. Hanly, H. Huang, S. Shamai Shitz, O. Simeone, and W. Yu, “Multi-cell mimo cooperative networks: A new look at interference,” IEEE Journal on Selected Areas in Communications, vol. 28, no. 9, pp. 1380–1408, December 2010.

[2] J. C. Harsanyi, “Games with incomplete information played by Bayesian players,” Management Science, 1967.

[3] R. Radner, “Team decision problems,” The Annals of Mathematical Statistics, 1962.

[4] D. Gesbert and P. de Kerret, “Team methods for device cooperation in wireless networks,” inCooperative and Graph Signal Processing, pp. 469–487. Elsevier, 2018.

[5] S. Lloyd, “Least squares quantization in PCM,” Un- published Bell Laboratories Technical Note. Published in the March 1982 special issue on quantization of the IEEE Transactions on Information Theory, 1957.

[6] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Entropy- constrained vector quantization,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989.

[7] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizer design,” IEEE Transactions on communications, vol. 28, no. 1, pp. 84–95, 1980.

[8] R. M. Ephraim, Y.and Gray, “A unified approach for encoding clean and noisy sources by means of waveform and autoregressive model vector quantization,” IEEE Transactions on Information Theory, 1988.

[9] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, 1986.

[10] Y. A. Ghassabeh and F. Rudzicz, “Noisy source vector quantization using kernel regression,” IEEE Trans- actions on Communications, vol. 62, no. 11, pp. 3825–

3834, 2014.

[11] A. N. Bazco, L. Miretti, P. de Kerret, D. Gesbert, and N. Gresset, “Achieving vanishing rate loss in decentralized network MIMO,” inProc. IEEE Int. Symp. Inf.

Theory (ISIT). 2019, IEEE.