• Aucun résultat trouvé

A n Alternative: Secure PowelPs Algorithm

Dans le document PRIVACY PRESERVING DATA MINING (Page 70-73)

Predictive Modeling for Regression

Theorem 6.2, If M is k-secure, any nonzero linear combination of the columns of M generates a column vector with at least /c + 1 nonzero entries

5.2.4 A n Alternative: Secure PowelPs Algorithm

In the previous sections, all the protocols for secure regression analysis are based on the following formula,

p={X^X)-^X^Y.

As a matter of fact, /? is an explicit minimizer of the following minimization problem

P = argmin(y - Xpy{Y - Xp). (5.9) Recall that the columns of X may be distributed across several parties. For

example, when there are two parties, X = {A,B) where A belongs to the first party while B the second party. The protocols discussed in the previous sections tried to secure the direct calculation of $ through Equation 5.3. Re-cently, Sanil et al. proposed a new method that avoids the direct use of the formula for computing the regression coefficients. Instead, they propose a se-cure protocol for the minimization of Equation 5.9. As will be discussed later, this leads to higher security for the calculation of (3 and may have further im-pHcation for secure statistical analysis, and distributed computing, in general.

This alternative approach challenges the wisdom of using standard analytical solution in distributed and/or secure computing. The procedure proposed by Sanil et al.[77] is an modification of the well-known Powell's algorithm for quadratic optimization.

Suppose f{p) is the target function one wants to minimize, where /3 G IZ^.

Powell's algorithm is a derivative-free procedure that finds the minimizer of /(/?) though a series of line optimizations of f{p) in various directions. The following is a pseudo code for the algorithm.

Initialization: Select an arbitrary orthogonal basis for 7^^: d^^\S'^\ . . . , S^K Also set an arbitrary starting point p.

Iteration: Repeat the following block of steps p times

• set /3 ^ p.

• For z = 1,2,... ,p:

66 Predictive Modeling for Regression - Find 5 that minimizes f{/3-{- 5d^'^\

- Set p^P-^6d^'\

• For 2 = l , 2 , . . . , ( p - l ) : Set S'^ ^ d^'-^^\

• Set d(p) ^0-p.

• - Find 6 that minimizes /(/? + 5d^^\

- Setp^P-j-Sd^P^).

Note that during each iteration in the algorithm above, (^ + 1) one-dimensional minimizations need to be carried out. Powell proved that when /(/3) is a quadratic function, the final output of the algorithm is exactly the minimizer of /(/?). In regression, the target function L{/3) = {Y — X/3y{Y — X(3) is clearly a quadratic function, hence the algorithm can be employed di-rectly for computing 6. For se<"ure rp^rpssimi, however, additional effort needs to be exerted so that the involved parties do not disclose their attribute val-ues. Sanil et al.[77] assume that all the participants are semi-honest. Utilizing the secure sum protocol, they propose the following procedure for securely computing the regression coefficients.

The key of Powell's algorithm is line optimization. Along any given direc-tion d^ the minimizer oi L{/3 -\- Sd) over 6 is

{¥ - xpyxd _ z^

{xdyxd ~ ^ J ^ '

where z = Y — X(5 and w = Xd. For convenience, we assume that there are three parties denoted by Ai, A2 and ^ 3 respectively. The data owned by Ai is denoted by X^i,. and the regression coefficients for the attributes held by Ai are denoted by 13A.-,, for i = 1,2,3. Hence, X = {XA^.XA^^^A^) and P^ = {P\^, /^^2' ^^2)- ^^^ given direction d can also be decomposed into three

components as d'^ = (d'A-^^^d'^^^d^^). Then,

z = Y- [XA.PA^ + XA,PA2 + XASPAS]

and

W = XA^dA^ + XA2dA2 +

XA^dAs-Clearly, z and w can be computed using the secure summation protocol. It is straightforward to extend the discussion above to any number of parties.

The secure Powell's algorithm proposed by Sanil et al.[77] is described as follows. First an initial set of search directions d^^\d'''^\ ... ,d^^^ are chosen such that dj is zero if r and j are not held by the same party. In the algorithm, each Aj will know and update only d]^ , which is the portion related to the attributes owned by Aj. Second, set up the initial values for the regression coefficients p^ = {PAI^I^A2^ - • • ^ f^Ak)- Third, the following block of steps is iterated p times, which leads to the exact least squares estimate p.

Secure Iteration

1. Each Aj sets PAJ = pAy

Vertically Partitioned Data 67 2. For r - l , 2 , . . . , j 9 :

(a) Each Aj computes XAJPAJ a n d XAjd]^^,.

(h) z = Y — J2j=i ^AjpAj and w = Yl,j=i ^AjdA are computed collec-tively by A i , . . . , Afc using t h e secure s u m m a t i o n protocol.

(c) C o m p u t e 5 — z'^w/w^w.

(d) Each Aj u p d a t e s PAJ ^ PAJ + 5d^2'

3. For r = 1 , 2 , . . . , (p - 1): Each Aj u p d a t e s d^^] <- d^^^^\

4. Each Aj u p d a t e s d^^^. ^ PAJ -

PAJ-5. z^ w and 5 are computed as before, and each Aj u p d a t e s PAJ ^— PAJ +5d^^ . T h e secure Powell's algorithm o u t p u t s t h e regression coefficients 8 and t h e residuals which are equal t o t h e final z when t h e algorithm terminates.

Using these results, diagnostics including t h e all-purpose residual plot and other goodness of fit measures such as R^ can be generated. Compared t o t h e protocols discussed in Section 5.2.1, t h e secure Powell's algorithm is at least as competitive in t e r m s of generating t h e necessary results for fitting a regression model with diagnostics.

Although a rigorous security comparison between t h e secure Powell's algo-r i t h m and t h e palgo-rotocols based on Section 5.2.1 would be difficult, it appeaalgo-rs t h a t t h e former ought t o be more secure, because, during t h e iterations of t h e algorithms, each Aj only u p d a t e s its portions of t h e regression coefficients and t h e search directions. Due t o this only some aggregate values of linear combi-nations of their private d a t a are revealed, while t h e other parties do not know t h e coefficients of t h e linear combinations at all. In other words, t h e other parties only knows t h e results at most, do not know t h e linear equations. In t h e m a t r i x - p r o d u c t based protocols, t h e parties do know t h e linear equations and t h e values of t h e these equations, so there exists t h e risk t h a t t h e y can solve these equations t o guess or infer t h e original values.

A possible downside of t h e secure Powell's algorithm is t h a t it may not be appropriate for a small number of parties. For example, if there are only two parties, t h e secure s u m m a t i o n protocol will not be able t o protect t h e private d a t a very well. This is t r u e for a small number of parties ( > 2) as well, especially when m a n y iterations are needed. Another risk is t h a t , if a p a r t y only hold a small number of a t t r i b u t e s and it is known t o t h e other parties, t h e privacy protection it receives is significantly lesser t h a n a p a r t y t h a t holds a large number of a t t r i b u t e s . For other concerns as well as more details of t h e technique, readers are referred t o Sanil et al.[77].

Most d a t a mining procedures and analyses involve various optimization.

This alternative approach t o secure regression indicates t h a t n o n - s t a n d a r d approaches t o these optimizations might be more suitable when constraints like privacy protection are present. This idea is worth further investigation in other tasks of statistical analysis and d a t a mining as well.

68 Predictive Modeling for Regression

Dans le document PRIVACY PRESERVING DATA MINING (Page 70-73)