Crowd Mining (joint work with Y. Amsterdamer, Y. Grossman, and T. Milo) P

Texte intégral

(1)Crowd Mining (joint work with Y. Amsterdamer, Y. Grossman, and T. Milo) P IERRE S ENELLART. 5 December 2012, The University of Hong Kong.

(2) Association rule mining One of the best most studied aspect of data mining [Agrawal et al., 1993] Discovering rules in a database of transactions. D. Transaction: set of items. Rule: X ! Y with X , Y sets of items Only interested in rules with support and confidence greater than given thresholds s , c. ( ! Y ) = #ft 2 D j#XD[ Y t g. supp X. ( ! Y ) = #f#t f2t D2 Dj Xj X[ Y t g t g. conf X. Typical application: market basket Diaper ! Beer. 2 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(3) Crowd-sourced data. Many applications where raw, extensional, exhaustive data is not available But intensionally hidden in people’s collective minds. ) Resort to asking humans (the crowd) for bits of the data they know (shopping history, life habits, etc.) Humans are bad at remembering the full history; also bad at discovering correlations The crowd is a costly resource [Parameswaran and Polyzotis, 2011]. 3 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(4) Mining association rules from the crowd. Goal of this work Determining association rules on crowd-sourced data, by: asking questions to humans that are easy to answer; determining which is the best question to ask at any given point; deducing from all answers a (probabilistic) set of valid association rules; optimizing this computation as much as possible.. 4 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(5) Outline Introduction Concepts Crowd Mining Algorithm The CrowdMiner System Experiments Conclusions 5 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(6) User support and confidence. U Each user u 2 U has a (hidden) transaction database Du Each rule X ! Y is associated with its user support and A set of users. user confidence:. j X [ Y tg ! Y ) = #ft 2 Du#D u #f t 2 Du j X [ Y t g conf u (X ! Y ) = #ft 2 Du j X t g. suppu (X. 6 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(7) Significant rules. Significant rules are those whose overall support and confidence are above specified threshold s , c Overall support and confidence defined as the mean user support and confidence: supp(r ) = avg suppu (r ) u. 2U. conf (r ) = avg conf u (r ) u. 2U. Goal: finding significant rules while asking the smallest number of questions to the crowd as possible. 7 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(8) Questions to the crowd Two kind of questions: Closed questions X ! Y ? Ask a user for her (approximate) support and confidence for this rule;. Open questions ? ! ? Ask a user for one arbitrary rule and its (approximate) support and confidence.. 8 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(9) Questions to the crowd Two kind of questions: Closed questions X ! Y ? Ask a user for her (approximate) support and confidence for this rule;. Open questions ? ! ? Ask a user for one arbitrary rule and its (approximate) support and confidence. Users will not be precise, but that’s fine.. Example (Morning. ! Jogging). “How often do you go jogging in the morning?” “I go jogging three times per week in the morning.” conf u (Morning ! Jogging) =. 3 7. suppu (Morning ! Jogging) =. (if there is one transaction for each morning, afternoon, evening) 8 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart. 3 21.


(11) Algorithm components One general framework for crowdmining. Choose the next question Open or closed question? Choose candidate rules. Choose the next closed question Rank the rulesbygrade. Estimate current error estimate sample distribution. 10 / 26. Estimate next error. estimate mean distribution. Télécom PT & Tel Aviv U.. estimate rule significance. Pierre Senellart. One particular choice of implementation of all black boxes We do not claim any optimality But we validate by experiments.

(12) Estimating distributions Attention: support and confidence are correlated, we need to consider bivariate distributions!. 11 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(13) Estimating distributions Attention: support and confidence are correlated, we need to consider bivariate distributions! Central limit theorem: the sample distribution of (confidence, support) pairs for a rule is normally distributed. 11 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(14) Estimating distributions Attention: support and confidence are correlated, we need to consider bivariate distributions! Central limit theorem: the sample distribution of (confidence, support) pairs for a rule is normally distributed Hypothesis: The distribution of (confidence, support) values for rules among the whole set of users is normally distributed. 11 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(15) Estimating distributions Attention: support and confidence are correlated, we need to consider bivariate distributions! Central limit theorem: the sample distribution of (confidence, support) pairs for a rule is normally distributed Hypothesis: The distribution of (confidence, support) values for rules among the whole set of users is normally distributed 1.0. Confidence. The sample mean and covariance matrix are unbiased estimators of that of the original distribution. 0.8 0.6. 0.4 0.2 0.0 0.0. 0.2. 0.4. 0.6. Support. 11 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart. 0.8. 1.0.

(16) Estimating rule significance A rule is significant if:. Z1 Z1. s c. N; 1 (c ; s ) dc ds K. > 0:5. , are sample mean and covariance matrix K is the number of samples N is the bivariate normal distribution. Efficient algorithms [Genz, 2004] for numerical integration of bivariate normal distributions. The current error probability on rule significance is simply the distance of this integral to 0 or 1 12 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(17) Estimating next error The current distribution N; for a rule can be used as an estimator of what the next answer would be. 13 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(18) Estimating next error The current distribution N; for a rule can be used as an estimator of what the next answer would be 1.0 1.0. 13 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart. 0.8 0.8. Confidence Confidence. We sample according to N; , recompute rule significance and error probabilities, and deduce from that the next error probability in this particular case. 0.6 0.6. 0.4 0.4 0.2 0.2. 0.0 0.0 0.0 0.0. 0.2 0.2. 0.4 0.4. 0.6 0.6. Support Support. 0.8 0.8. 1.0 1.0.

(19) Estimating next error The current distribution N; for a rule can be used as an estimator of what the next answer would be 1.0 1.0. 0.8 0.8. Confidence Confidence. We sample according to N; , recompute rule significance and error probabilities, and deduce from that the next error probability in this particular case. 0.6 0.6. 0.4 0.4 0.2 0.2. 0.0 0.0 0.0 0.0. 0.2 0.2. 0.4 0.4. 0.6 0.6. Support Support. 0.8 0.8. 1.0 1.0. By averaging over all samples, we obtain an estimate of the next error probability. 13 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(20) Estimating next error The current distribution N; for a rule can be used as an estimator of what the next answer would be 1.0 1.0. 0.8 0.8. Confidence Confidence. We sample according to N; , recompute rule significance and error probabilities, and deduce from that the next error probability in this particular case. 0.6 0.6. 0.4 0.4 0.2 0.2. 0.0 0.0 0.0 0.0. 0.2 0.2. 0.4 0.4. 0.6 0.6. Support Support. 0.8 0.8. 1.0 1.0. By averaging over all samples, we obtain an estimate of the next error probability The difference between next error and current error (expected error reduction) is an estimate of how much we gain by asking a question on this rule! 13 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(21) Putting everything together Candidate rules are rules of length 1, rules for which we have samples, and rules for which subrules are significant (analogous to Apriori [Agrawal et al., 1994]). Choose the next question Open or closed question?. Choose the next closed question. Choose candidate rules. Rank the rulesbygrade. Estimate current error. Estimate next error. estimate sample distribution. estimate mean distribution. 14 / 26. Télécom PT & Tel Aviv U.. estimate rule significance. Pierre Senellart. The grade of a rule is the expected error reduction when known, an estimate based on subrules otherwise We decide between closed or open by flipping a coin (exploitation vs exploration).


(23) Architecture. ask question. answer question. Question Display. user query. results. Portal Interface. Portal User Interface. rule+. query. [rule, conf, supp]. Query Selector. Data Aggregator. Rule Rule Database Database Initial Data. 16 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart. Best Rules Extractor.

(24) Outcome. Rank 1 2 3 …. 1752. Change. Support. Conf.. Error Prob.. +1 -1. Morning → Jogging Jogging → Energy Drink, Granola. 0.087 0.085. 0.61 0.5. 0.521e-11 0.66e-8. … -8. Morning → Coffee … Upset Stomach → Chamomile. 0.067 … 0.032. 0.52 … 0.05. 0.54e-7 … 0.03. …. Vegetarian, Yoga → Raw Foods …. 0.009 …. 0.047 …. 0.012 …. 1753 …. 17 / 26. Rule. Télécom PT & Tel Aviv U.. Pierre Senellart.


(26) Datasets We experimented on several datasets: Real-world Retail dataset [Brijs et al., 1999] from a shopping basket application; since the data is anonymized, users are assigned transactions in a random fashion Edits on categories in Simple English Wikipedia: transactions are articles, items are high-level categories (Wordnet-level classes of YAGO [Suchanek et al., 2007]) assigned to articles, users are editors of these articles Synthetic dataset (not discussed here). 19 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(27) Experimental setting Baselines: Random: At each step, we choose a random rule to ask a user about Greedy: Ask about the known rule with the fewest samples (starting with smaller rules) Settings: Zero-knowledge: we start with no information about the world Known items: the set of items is known, no information about rules Rule refinement: already know some rules (not discussed here) We evaluate in terms of precision, recall, F-measure of predicted significant rules, as well as absolute number of errors (not discussed here) 20 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(28) F-measure, zero-knowledge. 0.9. 0.5. 0.8. 0.45 0.4 0.35. 0.6. F-measure. F-measure. 0.7. 0.5 0.4 0.3. 0.3 0.25 0.2 0.15. CrowdMiner Random Greedy. 0.2 0.1 0 0. 500. 1000 1500 Number of Samples. 2000. Télécom PT & Tel Aviv U.. 0 0. 500. 1000 1500 Number of Samples. Wikipedia dataset. Retail dataset. 21 / 26. CrowdMiner Random Greedy. 0.1 0.05. Pierre Senellart. 2000.

(29) Precision and recall, zero-knowledge 1. 1.2 1. 0.8. 0.7. 0.8. 0.6 Recall. Precision. CrowdMiner Random Greedy. 0.9. 0.6 0.4 CrowdMiner Random Greedy. 0.2 0 0. 500. 1000 1500 Numberof Samples. 0.5 0.4 0.3 0.2 0.1. 2000. 0 0. 500. 1000 1500 Numberof Samples. 2000. Retail dataset. Retail dataset. Better precision: we make sure to reduce the global expected number of errors; Greedy loses precisions as new rules are explored Much better recall: due to adding potentially large rules as candidates once candidate subrules are found (Greedy will only add such rules much later) 22 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(30) F-measure, known items 1 0.9 0.8 F-measure. 0.7 0.6 CrowdMiner Random Greedy. 0.5. 0.4 0.3 0.2 0.1. 0 0. 500. 1000 1500 Number of Samples. 2000. Retail dataset Good initial precision of the greedy algorithm: the best thing to do is to start by asking about rules of small size anyway CrowdMiner overtakes Greedy: larger rules are soon made candidate and their significance assessed 23 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.


(32) In brief. How to design an interactive poll? Many situations when one wants to find correlations in non-extensionally accessible data “Crowd-sourced” Apriori (but with subtleties) Good behavior in practice Many other design choices for replacing black boxes, especially in the presence of priors Connections with active learning [Lindenbaum et al., 2004]. 25 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(33) Perspectives. What are the best next k questions to ask? Allows parallelization. Also possible to do that by sampling, and not significantly more costly! Take into account correlations between rules to refine estimates Which user to ask which question?. 26 / 26. Télécom PT & Tel Aviv U.. Pierre Senellart.

(34) References I R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Record, 22(2), 1993. R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In VLDB, 1994. Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, 1999. Alan Genz. Numerical computation of rectangular bivariate and trivariate normal and t probabilities. Statistics and Computing, 14, 2004..

(35) References II. M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest neighbor classifiers. Machine Learning, 54(2), 2004. Aditya G. Parameswaran and Neoklis Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, 2011. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: A core of semantic knowledge. Unifying WordNet and Wikipedia. In WWW, 2007..

(36)