6 What a Miner Should Not Know - 123 Fun withAlgorithms

In the case of web data, the miner is processing public data after all, and if there is any sensitive information it is only because some website contains it. But this is not always the case: sometimes, data are privately hold by some company, and given to the miner only in virtue of some contract that should anyway preserve the rights of the individuals whose personal information are contained in the data being processed. For example, the Facebook graph mentioned above was provided by Facebook itself, and that graph is likely to contain a lot of sensitive information about Facebook users. In general, privacy is becoming more and more a central problem in the data mining ﬁeld.

An early, infamous example of the privacy risks involved in the data studied by the miners is the so-called “AOL search data leak”. AOL (previously known as

“America Online”) is a quite popular Internet company that used to be running a search engine; in 2006 they decided to distribute a large querylog for the sake of research institutes around the world. A querylog is a dataset containing the queries submitted to a search engine (during a certain time frame and from a certain geographical region); some of the queries (besides the actual query and other data, like when the query was issued or which links the users decided to click on) came with an identiﬁcation of the user that made the query (for the users that were logged in). To avoid putting the privacy of its users at risk, AOL substituted the names of the users with numeric identiﬁers.

Two journalists fromThe New York Times, though, by analysing the text of the queries were able to give a name and a face to one of the users: they established that user number 4417749 was in fact Thelma Arnold, a 62-year-old widow who lived in Lilburn (Georgia). From that, they were able for example to determine that Mrs. Arnold was suﬀering from a range of ailments (she kept searching things like

12 P. Boldi

“hand tremors”, “dry mouth” etc.⁹), and a number of other, potentially sensitive information about her.

AOL acknowledged their mistake and the dataset was immediately removed, but in September 2006 a class action was ﬁled against AOL in the U.S. District Court for the Northern District of California. The privacy breach ultimately led to the resignation of AOL’s chief technology oﬃcer, Maureen Govern.

Preserving the anonymity of individuals when publishing social-network data is a challenging problem that has recently attracted a lot of attention [29,30].

Even just the case of graph-publishing is diﬃcult, and poses a number of the-oretical and practical questions. Overall, the idea is to introduce in the data some amount of noise so to protect the identity of individuals. There is clearly a trade-oﬀ between privacy and utility: introducing too much noise certainly pro-tects individuals but makes the publish data unusable for any practical purpose by the data miners! Solving this conundrum is the mixed blessing of a whole research area often referred to asdata anonymization.

Limiting our attention to graph data only, most methods rely on a number of (deterministic or randomized) modiﬁcations of the graph, where typically edges are added, deleted or switched. Recently we proposed an interesting alterna-tive, based onuncertain graphs. An uncertain graph is a graph endowed with probabilities on its edges (where the probability is to be interpreted as a “proba-bility that the edge exists”, and is independent for every edge); in fact, uncertain graphs are a compact way to express some graph distributions.

The advantage of uncertain graphs for anonymization is that using proba-bilities you have the possibility of “partially deleting” or “partially adding” an edge, so to have a more precise knob to ﬁne-tune the amount of noise you are introducing. The idea is that you modify the graph to be published, turning it into an uncertain graph, that is what the data miner will see at the end. The uncertain graph will share (in expectation) many properties of the original graph (e.g., degree distribution, distance distribution etc.), but the noise introduced in the process will be enough to guarantee a certain level of anonymity.

The amount of anonymity can be determined precisely using entropy, as ex-plained in [31]. Suppose you have some propertyP that you want to preserve:

a property is a map from vertices to values (of some kind); you want to be sure that if the adversary knows the property of a certain vertex (s)he will still not be able to single out the vertex in the published graph. An easy example is degree:

the adversary knows that Mr. John Smith has 173 Facebook friends and (s)he would like to try to find John Smith out based only on this information; we will introduce the minimum amount of noise to be sure that (s)he will always be uncertain about who John Smith is, with a fixed desired minimum amount of uncertainty k (meaning that (s)he will only be able to find a set of candidate nodes whose cardinality will bekor more).

For the sake of simplicity, let us assume that you take the original graphG and only augment it with probabilities on its edges. In the original graphGevery

Algorithmic Gems in the Data Miner’s Cave 13

vertex (say,x) had a certain value of the probability (P(x)); in the uncertain graph, it has adistributionof values: for example, the degree ofxwill be zero in the possible world where all edges incident onxdo not exist (which will happen with some probability depending on the probability values we have put on those edges), it will have degree 1 with some other probability and so on.

Let me write Xx(ω) the probability that vertex xhas valueω; mutatis mu-tandis, you can determine the probabilityY_ω(x)that a given node isx, provided that you know it had the propertyω (say, degree 173). Now, you want these probability distributions Y_ω(−) to be as “ﬂat” as possible, because otherwise there may be values of the property for which singling out the right vertex will be easy for the adversary. In terms of probability, you wantH(Y_ω)≥logk, where H denotes the entropy. Now the problem will be chosing the probability labels in such a way that the above property is guaranteed. In [31] we explain how it is possible to do that.

7 Conclusions

I was reading again this paper, and it is not clear (not even clearto myself) what was the ﬁnal message to the reader, if there was one. I think that I am myself learning a lot, and I am not sure I can teach what I am learning, yet. The ﬁrst lesson is that computer science, in these years, and particularly data mining, is hitting real “big” data, and when I say “big” I mean “so big¹⁰” that traditional feasibility assumptions (e.g., “polynomial time is ok!”) does not apply anymore.

This is a stimulus to look for new algorithms, new paradigms, new ideas. And if you think that “big data” can be processed using “big machines” (or largely dis-tributed systems, like MapReduce), you are wrong: muscles are nothing without intelligence (systems are nothing without good algorithms)! The second lesson is that computer science (studying things like social networks, web graphs, au-tonomous systems etc.) is going back to its roots in physics, and is more and more a Galilean science: experimental, explorative, intrinsically inexact. This means that we need more models, more explanations, more conjectures. . . Can you see anything more fun around, folks?

Acknowledgements. I want to thank Andrea Marino and Sebastiano Vigna for commenting on an early draft of the manuscript.

References

1. Johnson, S.: The Ghost Map: the Story of London’s Most Terrifying Epidemic -And How It Changed Science, Cities, and the Modern World. Riverhead Books (2006)

2. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking:

Bringing order to the web. Technical Report 66, Stanford University (1999)

10 Please, please: say “big data” one more time!

14 P. Boldi

3. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully dis-tributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004) 4. Boldi, P., Marino, A., Santini, M., Vigna, S.: Bubing: Massive crawling for the

masses. Poster Proc. of 23rd International World Wide Web Conference, Seoul, Korea (2014)

5. Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond. ACM Trans. Web 3(5), 8:1–8:34 (2009)

6. Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th Interna-tional Conference on World Wide Web, pp. 124–135. ACM (2002)

7. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Con-sistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM (1997)

8. Majewski, B.S., Wormald, N.C., Havas, G., Czech, Z.J.: A family of perfect hashing methods. Comput. J. 39(6), 547–554 (1996)

9. Jacobson, G.: Space-eﬃcient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, pp.

549–554. IEEE (1989)

10. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practise of monotone minimal perfect hashing. In: Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 132–144. SIAM (2009)

11. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing:

Searching a sorted table with O(1)accesses. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Mathematics (SODA), pp. 785–794. ACM Press, New York (2009)

12. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast preﬁx search in little space, with applications. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part I. LNCS, vol. 6346, pp. 427–438. Springer, Heidelberg (2010)

13. Belazzougui, D., Boldi, P., Vigna, S.: Dynamic z-fast tries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 159–172. Springer, Heidelberg (2010) 14. Randall, K.H., Stata, R., Wiener, J.L., Wickremesinghe, R.G.: The Link Database:

Fast access to graphs of the web. In: Proceedings of the Data Compression Con-ference, pp. 122–131. IEEE Computer Society, Washington, DC (2002)

15. Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In:

Proc. of the Thirteenth International World Wide Web Conference, pp. 595–601.

ACM Press (2004)

16. Moﬀat, A.: Compressing integer sequences and sets. In: Kao, M.-Y. (ed.) Encyclo-pedia of Algorithms, pp. 1–99. Springer, US (2008)

17. Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Ragha-van, P.: On compressing social networks. In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing, pp. 219–228. ACM, New York (2009)

18. Boldi, P., Santini, M., Vigna, S.: Permuting web and social graphs. Internet Math. 6(3), 257–283 (2010)

19. Boldi, P., Santini, M., Vigna, S.: Permuting web graphs. In: Avrachenkov, K., Donato, D., Litvak, N. (eds.) WAW 2009. LNCS, vol. 5427, pp. 116–126. Springer, Heidelberg (2009)

Algorithmic Gems in the Data Miner’s Cave 15

20. Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: A multires-olution coordinate-free ordering for compressing social networks. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Pro-ceedings of the 20th International Conference on World Wide Web, pp. 587–596.

ACM (2011)

21. Milgram, S.: The small world problem. Psychology Today 2(1), 60–67 (1967) 22. Travers, J., Milgram, S.: An experimental study of the small world problem.

So-ciometry 32(4), 425–443 (1969)

23. Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures.

In: VLDB 1989: Proceedings of the 15th International Conference on Very Large Data Bases, pp. 165–171. Morgan Kaufmann Publishers Inc. (1989)

24. Crescenzi, P., Grossi, R., Lanzi, L., Marino, A.: A comparison of three algorithms for approximating the distance distribution in real-world graphs. In: Marchetti-Spaccamela, A., Segal, M. (eds.) TAPAS 2011. LNCS, vol. 6595, pp. 92–103.

Springer, Heidelberg (2011)

25. Palmer, C.R., Gibbons, P.B., Faloutsos, C.: Anf: a fast and scalable tool for data mining in massive graphs. In: KDD 2002: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 81–90.

ACM, New York (2002)

26. Boldi, P., Rosa, M., Vigna, S.: HyperANF: Approximating the neighbourhood func-tion of very large graphs on a budget. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Proceedings of the 20th Inter-national Conference on World Wide Web, pp. 625–634. ACM (2011)

27. Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the 13th Confer-ence on Analysis of Algorithm (AofA 2007), pp. 127–146 (2007)

28. Backstrom, L., Boldi, P., Rosa, M., Ugander, J., Vigna, S.: Four degrees of separa-tion. In: ACM Web Science 2012: Conference Proceedings, pp. 45–54. ACM Press (2012), Best paper award

29. Backstrom, L., Dwork, C., Kleinberg, J.M.: Wherefore art thou r3579x?:

anonymized social networks, hidden patterns, and structural steganography. In:

WWW, pp. 181–190 (2007)

30. Narayanan, A., Shmatikov, V.: De-anonymizing social networks. In: IEEE Sympo-sium on Security and Privacy (2009)

31. Boldi, P., Bonchi, F., Gionis, A., Tassa, T.: Injecting uncertainty in graphs for iden-tity obfuscation. Proceedings of the VLDB Endowment 5(11), 1376–1387 (2012)

Dans le document 123 Fun withAlgorithms (Page 22-27)