• Aucun résultat trouvé

Association Rules and Statistics

Dans le document Encyclopedia of Data Warehousing and Mining (Page 174-178)

Martine Cadot

University of Henri Poincaré/LORIA, Nancy, France Jean-Baptiste Maj

LORIA/INRIA, France Tarek Ziadé

NUXEO, France

INTRODUCTION

A manager would like to have a dashboard of his company without manipulating data. Usually, statistics have solved this challenge, but nowadays, data have changed (Jensen, 1992); their size has increased, and they are badly structured (Han & Kamber, 2001). A recent method—data mining—has been developed to analyze this type of data (Piatetski-Shapiro, 2000).

A specific method of data mining, which fits the goal of the manager, is the extraction of association rules (Hand, Mannila & Smyth, 2001). This extraction is a part of attribute-oriented induction (Guyon & Elis-seeff, 2003).

The aim of this paper is to compare both types of extracted knowledge: association rules and results of statistics.

BACKGROUND

Statistics have been used by people who want to extract knowledge from data for one century (Freeman, 1997).

Statistics can describe, summarize and represent the data. In this paper data are structured in tables, where lines are called objects, subjects or transactions and columns are called variables, properties or attributes.

For a specific variable, the value of an object can have different types: quantitative, ordinal, qualitative or bi-nary. Furthermore, statistics tell if an effect is significant or not. They are called inferential statistics.

Data mining (Srikant, 2001) has been developed to precede a huge amount of data, which is the result of progress in digital data acquisition, storage technol-ogy, and computational power. The association rules, which are produced by data-mining methods, express links on database attributes. The knowledge brought

by the association rules is shared in two different parts.

The first describes general links, and the second finds specific links (knowledge nuggets) (Fabris & Freitas, 1999; Padmanabhan & Tuzhilin, 2000). In this article, only the first part is discussed and compared to statis-tics. Furthermore, in this article, only data structured in tables are used for association rules.

MAIN THRUST

The problem differs with the number of variables. In the sequel, problems with two, three, or more variables are discussed.

Two Variables

The link between two variables (A and B) depends on the coding. The outcome of statistics is better when data are quantitative. A current model is linear regres-sion. For instance, the salary (S) of a worker can be expressed by the following equation:

S = 100 Y + 20000 + ε (1)

where Y is the number of years in the company, and ε is a random number. This model means that the salary of a newcomer in the company is $20,000 and increases by $100 per year.

The association rule for this model is: Y→S. This means that there are a few senior workers with a small paycheck. For this, the variables are translated into binary variables. Y is not the number of years, but the property has seniority, which is not quantitative but of type Yes/No. The same transformation is applied to the salary S, which becomes the property “has a big salary.”

Association Rules and Statistics

A

Therefore, these two methods both provide the link between the two variables and have their own instru-ments for measuring the quality of the link. For statistics, there are the tests of regression model (Baillargeon, 1996), and for association rules, there are measures like support, confidence, and so forth (Kodratoff, 2001).

But, depending on the type of data, one model is more appropriate than the other (Figure 1).

Three Variables

If a third variable E, the experience of the worker, is integrated, the equation (1) becomes:

S = 100 Y + 2000 E + 19000 + ε (2) E is the property “has experience.” If E=1, a new experienced worker gets a salary of $21,000, and if E=0, a new non-experienced worker gets a salary of $19,000.

The increase of the salary, as a function of seniority (Y), is the same in both cases of experience.

S = 50 Y + 1500 E + 50 E ´ Y + 19500 + ε (3)

Now, if E=1, a new experienced worker gets a salary of $21,000, and if E=0, a new non-experienced worker gets a salary of $19,500. The increase of the salary, as a function of seniority (Y), is $50 higher for experienced workers. These regression models belong to a linear model of statistics (Prum, 1996), where, in the equation (3), the third variable has a particular effect on the link between Y and S, called interaction (Winer, Brown & Michels, 1991).

The association rules for this model are:

• Y→S, E→S for the equation (2)

• Y→S, E→S, YE→S for the equation (3)

The statistical test of the regression model allows to choose with or without interaction (2) or (3). For the association rules, it is necessary to prune the set of three rules, because their measures do not give the choice between a model of two rules and a model of three rules (Zaki, 2000; Zhu, 1998).

More Variables

With more variables, it is difficult to use statistical models to test the link between variables (Megiddo

& Srikant, 1998). However, there are still some ways to group variables: clustering, factor analysis, and taxonomy (Govaert, 2003). But the complex links between variables, like interactions, are not given by these models and decrease the quality of the results.

Comparison

Table 1 briefly compares statistics with the associa-tion rules. Two types of statistics are described: by tests and by taxonomy. Statistical tests are applied to a small amount of variables and the taxonomy to a great Figure 1. Coding and analysis methods

Quantitative Ordinal Qualitative Yes/No

Association Rules +

+ Statistics

Table 1. Comparison between statistics and association rules

Statistics Data mining

Tests Taxonomy Association rules Decision Tests (+) Threshold defined (-) Threshold defined (-) Level of Knowledge Low (-) High and simple (+) High and complex (+) Nb. of Variables Small (-) High (+) Small and high (+)

Complex Link Yes (-) No (+) No (-)

Association Rules and Statistics

amount of variables. In statistics, the decision is easy to make out of test results, unlike association rules, where a difficult choice on several indices thresholds has to be performed. For the level of knowledge, the statistical results need more interpretation relative to the taxonomy and the association rules.

Finally, graphs of the regression equations (Hayduk, 1987), taxonomy (Foucart, 1997), and association rules (Gras & Bailleul, 2001) are depicted in Figure 2.

FUTURE TRENDS

With association rules, some researchers try to find the right indices and thresholds with stochastic methods.

More development needs to be done in this area. Another sensitive problem is the set of association rules that is not made for deductive reasoning. One of the most common solutions is the pruning to suppress redundan-cies, contradictions and loss of transitivity. Pruning is a new method and needs to be developed.

CONCLUSION

With association rules, the manager can have a fully detailed dashboard of his or her company without ma-nipulating data. The advantage of the set of association rules relative to statistics is a high level of knowledge.

This means that the manager does not have the incon-venience of reading tables of numbers and making interpretations. Furthermore, the manager can find knowledge nuggets that are not present in statistics.

The association rules have some inconvenience;

however, it is a new method that still needs to be developed.

REFERENCES

Baillargeon, G. (1996). Méthodes statistiques de l’ingénieur: Vol. 2. Trois-Riveres, Quebec: Editions SMG.

Fabris,C., & Freitas, A. (1999). Discovery surprising patterns by detecting occurrences of Simpson’s paradox:

Research and development in intelligent systems XVI.

Proceedings of the 19th Conference of Knowledge-Based Systems and Applied Artificial Intelligence, Cambridge, UK.

Foucart, T. (1997). L’analyse des données, mode d’emploi. Rennes, France: Presses Universitaires de Rennes.

Freedman, D. (1997). Statistics. W.W. New York:

Norton & Company.

Govaert, G. (2003). Analyse de données. Lavoisier, France: Hermes-Science.

Gras, R., & Bailleul, M. (2001). La fouille dans les don-nées par la méthode d’analyse statistique implicative.

Colloque de Caen. Ecole polytechnique de l’Université de Nantes, Nantes, France.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection: Special issue on variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.

Figure 2. a) Regression equations b) Taxonomy c) Association rules

(a) (b) (c)

Association Rules and Statistics

Han, J., & Kamber, M. (2001). Data mining: Con-

A

cepts and techniques. San Francisco, CA: Morgan Kaufmann.

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press.

Hayduk, L.A. (1987). Structural equation modelling with LISREL. Maryland: John Hopkins Press.

Jensen, D. (1992). Induction with randomization testing: Decision-oriented analysis of large data sets [doctoral thesis]. Washington University, Saint Louis, MO.

Kodratoff, Y. (2001). Rating the interest of rules induced from data and within texts. Proceedings of the 12th IEEE International Conference on Database and Expert Systems Aplications-Dexa, Munich, Germany.

Megiddo, N., & Srikant, R. (1998). Discovering predic-tive association rules. Proceedings of the Conference on Knowledge Discovery in Data, New York.

Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. Proceedings of the Conference on Knowledge Discovery in Data. Boston, Massachusetts.

Piatetski-Shapiro, G. (2000). Knowledge discovery in databases: 10 years after. Proceedings of the Confer-ence on Knowledge Discovery in Data, Boston, Mas-sachusetts.

Prum, B. (1996). Modèle linéaire: Comparaison de groupes et régression. Paris, France: INSERM.

Srikant, R. (2001). Association rules: Past, present, future. Proceedings of the Workshop on Concept Lat-tice-Based Theory, Methods and Tools for Knowledge Discovery in Databases, California.

Winer, B.J., Brown, D.R., & Michels, K.M.(1991).

Statistical principles in experimental design. New York: McGraw-Hill.

Zaki, M.J. (2000). Generating non-redundant associa-tion rules. Proceedings of the Conference on Knowledge Discovery in Data, Boston, Massachusetts.

Zhu, H. (1998). On-line analytical mining of associa-tion rules [doctoral thesis]. Simon Fraser University, Burnaby, Canada.

KEY TERMS

Attribute-Oriented Induction: Association rules, classification rules, and characterization rules are written with attributes (i.e., variables). These rules are obtained from data by induction and not from theory by deduction.

Badly Structured Data: Data, like texts of corpus or log sessions, often do not contain explicit variables.

To extract association rules, it is necessary to create variables (e.g., keyword) after defining their values (frequency of apparition in corpus texts or simply ap-parition/non apparition).

Interaction: Two variables, A and B, are in interac-tion if their acinterac-tions are not seperate.

Linear Model: A variable is fitted by a linear com-bination of other variables and interactions between them.

Pruning: The algorithms of extraction for the as-sociation rule are optimized in computationality cost but not in other constraints. This is why a suppression has to be performed on the results that do not satisfy special constraints.

Structural Equations: System of several regression equations with numerous possibilities. For instance, a same variable can be made into different equations, and a latent (not defined in data) variable can be accepted.

Taxonomy: This belongs to clustering methods and is usually represented by a tree. Often used in life categorization.

Tests of Regression Model: Regression models and analysis of variance models have numerous hypothesis, e.g. normal distribution of errors. These constraints allow to determine if a coefficient of regression equa-tion can be considered as null with a fixed level of significance.

This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 74-77, copyright 2005 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).

Section: Multimedia

Dans le document Encyclopedia of Data Warehousing and Mining (Page 174-178)