Algorithm 4 returns as output the set of points that are global outliers while revealing no other information to any party, provided parties do

Descriptive Modeling (Clustering, Outlier Detection)

Theorem 7.3. Algorithm 4 returns as output the set of points that are global outliers while revealing no other information to any party, provided parties do

not collude.

Proof All parties know the number (and identity) of objects in O. Thus they can set up the loops; the simulator just runs the algorithm to generate most of the simulation. The only communication is at lines 8, 11, 15, and 16.

Step 8:

Each party Pg sees x' = x -f Ylrî Distancer{oi, Oj), where x is the random value chosen by Pi. Pr{x' — y) — Pr{x + Yl^rîDistancer{oi,Oj) = y) = Pr{x = y — Yl^r~ô-^'^^^^'^^^r{oi->Oj)) = r ^ . Thus we can simulate the value received by choosing a random value from a uniform distribution over D.

Steps 11 and 15:

Each step is again a secure comparison, so messages are simulated as in Steps 11 and 19 of Theorem 7.2.

Step 16:

This is again the final result, simulated as in Step 20 of Theorem 7.2. tempi is simulated by choosing a random value, tempk = result — tempi. By the same argument on random shares used above, the distribution of simulated values is indistinguishable from the distribution of the shares.

Again, the simulator clearly runs in polynomial time (the same as the algorithm). Since each party is able to simulate the view of its execution (i.e., the probability of any particular value is the same as in a real execution with the same inputs/results) in polynomial time, the algorithm is secure with respect to Definition 3.1.

Without collusion and assuming a malicious-model secure comparison, a malicious party is unable to learn anything it could not learn from altering its input. Step 8 is particularly sensitive to collusion, but can be improved (at cost) by splitting the sum into shares and performing several such sums (see [47] for more discussion of collusion-resistant secure sum).

110 Descriptive Modeling (Clustering, Outlier Detection) 7.3.7 C o m p u t a t i o n a n d C o m m u n i c a t i o n Analysis

In general we do not discuss the computational/communicational complexity of the algorithms in detail. However, in this case the algorithmic complexity raises interesting issues vis-a-vis security. Therefore we discuss it below in detail.

Both Algorithms 3 and 4 suffer the drawback of having quadratic compu-tation complexity due to the nested iteration over all objects.

Due to the nested iteration, Algorithm 3 requires 0{n'^) distance compu-tations and secure comparisons (steps 10-11), where n is the total number of objects. Similarly, Algorithm 4 also requires O(n^) secure comparisons (step 11). While operation parallehsm can be used to reduce the round complexity of communication, the key practical issue is the com.putational complexity of the encryption required for the secure comparison and scalar product proto-cols.

This quadratic complexity is troubling since the major focus of new algo-rithms for outlier detection has been to reduce the complexity, since n^ is as-sumed to be inordinately large. However, achieving lower than quadratic com-plexity is challenging - at least with the basic algorithm. Failing to compare all pairs of points is likely to reveal information about the relative distances of the points that are compared. Developing protocols where such revelation can be proven not to disclose information beyond that revealed by simply knowing the outliers is a challenge. Otherwise, completely novel techniques must be de-veloped which do not require any pairwise comparison. When there are three or more parties, assuming no collusion, much more efficient solutions that reveal some information can be developed. In the following sections we dis-cuss some of these techniques developed by Vaidya and Clifton[83] for both partitionings of data. While not completely secure, the privacy versus cost tradeoff may be acceptable in some situations. An alternative (and another approach to future work) is demonstrating lower bounds on the complexity of fully secure outlier detection. However, significant work is required to make any of this happen - thus opening a rich area for future work.

Horizontally p a r t i t i o n e d d a t a

The most computationally and communication intensive part of the algorithm are the secure comparisons. With horizontally partitioned data, a semi-trusted third party can perform comparisons and return random shares. The two comparing parties just give the values to be compared to the third party to add and compare. As long as the third party does not collude with either of the comparing parties, the comparing parties learn nothing.

The real question is, what is disclosed to the third party? Basically, since the data is horizontally partitioned, the third party has no idea about the respective locations of the two objects. All it can find out is the distance between the two objects. While this is information that is not a part of the

Outlier Detection 111 result, by itself it is not very significant and allows a tremendous increase in efficiency. Now, the cost of secure comparison reduces to a total of 4 messages (which can be combined for all comparisons performed by the pair, for a constant number of rounds of communication) and insignificant computation cost.

Vertically partitioned data

The simple approach used in horizontal partitioning is not suitable for verti-cally partitioned data. Since all of the parties share all of the points, partial knowledge about a point does reveal useful information to a party. Instead, one of the remaining parties is chosen to play the part of completely untrusted non-colluding party. With this assumption, a much more efficient secure com-parison algorithm has been postulated by Cachin [11] that reveals nothing to the third party. The algorithm is otherwise equivalent, but the cost of the comparisons is reduced substantially.

7.3.8 Summary

This section has presented privacy-preserving solutions for finding distance based outliers in distributed data sets. A significantly important and useful part is the security proofs for the algorithm. The basic proof technique (of sim-ulation) is the same for every secure algorithm under in the Secure Multiparty Computation framework.

8

Dans le document PRIVACY PRESERVING DATA MINING (Page 112-115)