Solution Approaches / Problems - PRIVACY PRESERVING DATA MINING

In the current day and age, data collection is ubiquitous. Collating knowledge from this data is a valuable task. If the data is collected and mined at a single site, the data mining itself does not really pose an additional privacy risk;

anyone with access to data at that site already has the specific individual information. While privacy laws may restrict use of such data for data mining (e.g., EC95/46 restricts how private data can be used), controlling such use is not really within the domain of privacy-preserving data mining technology.

The technologies discussed in this book are instead concerned with preventing disclosure of private data: mining the data when we aren't allowed to see it.

If individually identifiable data is not disclosed, the potential for intrusive misuse (and the resultant privacy breach) is eliminated.

The techniques presented in this book all start with an assumption that the source(s) and mining of the data are not all at the same site. This would seem to lead to distributed data mining techniques as a solution for privacy-preserving data mining. While we will see that such techniques serve as a basis for some privacy-preserving data mining algorithms, they do not solve the problem. Distributed data mining is eff"ective when control of the data resides with a single party. From a privacy point of view, this is little dif-ferent from data residing at a single site. If control/ownership of the data is centralized, the data could be centrally collected and classical data mining algorithms run. Distributed data mining approaches focus on increasing ef-ficiency relative to such centralization of data. In order to save bandwidth or utilize the parallelism inherent in a distributed system, distributed data mining solutions often transfer summary information which in itself reveals significant information.

If data control or ownership is distributed, then disclosure of private in-formation becomes an issue. This is the domain of privacy-preserving data mining. How control is distributed has a great impact on the appropriate so-lutions. For example, the first two privacy-preserving data mining papers both dealt with a situation where each party controlled information for a subset of individuals. In [56], the assumption was that two parties had the data divided

18 Solution Approaches / Problems

between them: A "collaborating companies" model. The motivation for [4], individual survey data, lead to the opposite extreme: each of thousands of individuals controlled data on themselves. Because the way control or owner-ship of data is divided has such an impact on privacy-preserving data mining solutions, we now go into some detail on the way data can be divided and the resulting classes of solutions.

3.1 Data Partitioning Models

Before formulating solutions, it is necessary to first model the different ways in which data is distributed in the real world. There are two basic data partition-ing / data distribution models: hurizontai partitionpartition-ing (a.k.a. homogeneous distribution) and vertical partitioning (a.k.a. heterogeneous distribution). We will now formally define these models. We define a dataset D in terms of the entities for whom the data is collected and the information that is collected for each entity. Thus, D = {E, / ) , where E is the entity set for whom information is collected and / is the feature set that is collected. We assume that there are k different sites. P i , . . . , P / ^ collecting datasets Di = (^i, / i ) , . . . ,Dk = {Ek,Ik) respectively.

Horizontal partitioning of data assumes that different sites collect the same sort of information about different entities. Therefore, in horizontal partition-ing. EG - [JiEi = Ei[j'"[JEk. a n d / c = ^^ - hf]'"f]h- Many such situations exist in real life. For example, all banks collect very similar infor-mation. However, the customer base for each bank tends to be quite different.

Figure 3.1 demonstrates horizontal partitioning of data. The figure shows two banks. Citibank and JPMorgan Chase, each of which collects credit card infor-mation for their respective customers. Attributes such as the account balance, whether the account is new, active, delinquent are collected by both. Merging the two databases together should lead to more accurate predictive models used for activities like fraud detection.

On the other hand, vertical partitioning of data assumes that different sites collect different feature sets for the same set of entities. Thus, in verti-cal partitioning. EG =- f]iEi = Eif].. .f]Ek, dmd IQ = [J^ = hi) • •-Uh-For example. •-Uh-Ford collects information about vehicles manufactured. Fire-stone collects information about tires manufactured. Vehicles can be linked to tires. This linking information can be used to join the databases. The global database could then be mined to reveal useful information. Figure 3.2 demon-strates vertical partitioning of data. First, we see a hypothetical hospital / insurance company collecting medical records such as the type of brain tu-mor and diabetes (none if the person does not suffer from the condition).

On the other hand, a wireless provider might be collecting other information such as the approximate amount of airtime used every day, the model of the cellphone and the kind of battery used. Together, merging this information for common customers and running data mining algorithms might give

com-Perturbation 19

CC# Active? Delinquent? New? Balance

Citibank

Fig. 3.1. Horizontal partitioning / Homogeneous distribution of data pletely unexpected correlations (for example, a person with Type I diabetes using a cell phone with Li/Ion batteries for more than an hour per day is very likely to suffer from primary brain tumors.) It would be impossible to get such information by considering either database in isolation.

While there has been some work on more complex partitionings of data (e.g., [44] deals with data where the partitioning of each entity may be differ-ent), there is still considerable work to be done in this area.

3.2 Perturbation

One approach to privacy-preserving data mining is based on perturbating the original data, then providing the perturbed dataset as input to the data mining algorithm. The privacy-preserving properties are a result of the pertur-bation: Data values for individual entities are distorted, and thus individually identifiable (private) values are not revealed. An example would be a survey:

A company wishes to mine data from a survey of private data values. While the respondents may be unwilling to provide those data values directly, they would be willing to provide perturbed/distorted results.

If an attribute is continuous, a simple perturbation method is to add noise generated from a specified probability distribution. Let X be an attribute and an individual have X = x, where x is a real value. Let r be a number

20 Solution Approaches / Problems

Dans le document PRIVACY PRESERVING DATA MINING (Page 23-26)