Specialised vs Declarative Data Mining

(1)

Specialised vs

Declarative Data Mining

Software Testing Applications

Nadjib Lazaar, CNRS, University of Montpellier

Join works with: M. Maamar, Y. Lebbah, S. Loudni, C. Bessiere, et. al.

(2)

DATA MINING

2

(3)

DATA MINING

➤ Data Mining (DM) or Knowledge Discovery in Databases (KDD) revolves around the investigation and creation of knowledge, processes, algorithms, and the mechanisms for retrieving potential knowledge from data collections.

(4)

DATA MINING

Mining on:

➤ Itemsets (Finding itemsets from a collection of transactions)

2

(5)

DATA MINING

Mining on:

➤ Sequences (Finding subsequences from collection of sequences)

(6)

DATA MINING

Mining on:

➤ Graphs (Finding subgraphs from collection of graphs)

2

(7)

DATA MINING

Mining on:

➤ Graphs (Finding subgraphs from collection of graphs)

(8)

DATA MINING APPLICATIONS

3

(9)

➤ Market Basket Analysis [Agrawal93]

(10)

➤ Future Healthcare

➤ Great potential to improve health systems [Obenshain04]

3

(11)

➤ Education

➤ Knowledge from data educational environments [Scheuer12]

(12)

➤ Education

➤ Fraud and Intrusion detection [Wang10] [Lee98]

3

(13)

➤ Education

➤ Lie detection and Criminal Investigation [Chen04]

(14)

➤ Education

➤ Bio Informatics [Hoffman97]

3

(15)

➤ Education

➤ Bio Informatics [Hoffman97]

(16)

4

Bio-

Informatics Marketing

Mining process

Inputs Outputs

- Custumers behavior - Frequent products - shopping basket

- Protein structure prediction - Cancer classification

- DNA sequencing - Classes of genes

Software Engineering

- Program comprehension

- Fault localization/prediction - Execution traces

- Flow diagram - Source code

Aurora Project

(17)

Bio-

Mining process

Aurora Project

(18)

4

Bio-

Mining process

Aurora Project

(19)

Bio-

Mining process

Aurora Project

(20)

FREQUENT ITEMSET MINING [Agrawal et al, 93]

5

(21)

FREQUENT ITEMSET MINING

➤ Aims at finding regularities in datasets (e.g., shopping behavior of customers)

[Agrawal et al, 93]

(22)

In market basket analysis:

➤ Find sets of products that are frequently bought together [Agrawal et al, 93]

5

(23)

In market basket analysis:

➤ Find sets of products that are frequently bought together

Often found patterns are expressed as association rules, for example:

➤ If a customer buys bread and wine, then she/he will probably also buy cheese.

[Agrawal et al, 93]

(24)

FREQUENT ITEMSET MINING (PROBLEM)

6

(25)

(26)

6

➤ Given:

➤ A set of items

➤ A set of transactions overs the items

➤ A minimum support

I = {i₁, …, i_n}

T = {t₁, …, t_m} θ

(27)

➤ Given:

➤ A set of items

➤ A set of transactions overs the items

➤ A minimum support

➤ The need:

➤ The set of itemset P s.t.:

I = {i₁, …, i_n}

T = {t₁, …, t_m} θ

freq(P) ≥ θ

(28)

STANDARD ITEMSET MINING

7

(29)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G

(30)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G

7

(31)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G cover(BEF) = {t₁, t₅, t₆}

(32)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G

freq(BEF) = 50 % cover(BEF) = {t₁, t₅, t₆}

7

(33)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G

➤ Brute force enumeration is infeasible

➤ 128 items 10⁶⁸ itemsets (atoms in the universe)

(34)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G

➤ Several specialised algorithms have been developed:

Apriori, Eclat, FP-Growth, LCM…

7

(35)

t1: B C E F G H

t2: A D G

t3: A C D H

t4: A E F

t5: B E F

t6: B E F G

➤ Several specialised algorithms have been developed:

Apriori, Eclat, FP-Growth, LCM…

➤ Dealing with basic user’s constraints:

Frequency, Condensed representations (closedness, maximality,…), Size…

(36)

EXAMPLE

8

(37)

EXAMPLE

(2^I, ⊆ )

(38)

EXAMPLE

8

(2^I, ⊆ )

D

(39)

EXAMPLE

(2^I, ⊆ )

θ = 3 D

(40)

EXAMPLE

8

(2^I, ⊆ )

θ = 3 D

(41)

EXAMPLE

(2^I, ⊆ )

θ = 3 D

Maximal

(42)

EXAMPLE

9

(2^I, ⊆ )

θ = 3 D

M_✓ = {P 2 I| f req(P) ✓ ^ 8P ⁰ P : f req(P⁰) < ✓} Maximal

(43)

EXAMPLE

(2^I, ⊆ )

θ = 3 D

(44)

EXAMPLE

10

(2^I, ⊆ )

θ = 3 D

(45)

EXAMPLE

(2^I, ⊆ )

θ = 3 D

Closedness

(46)

EXAMPLE

11

(2^I, ⊆ )

θ = 3 D

M_✓ = {P 2 I| f req(P) ✓ ^ 8P ⁰ P : f req(P⁰) < ✓} Closedness

(47)

CONDENSED REPRESENTATION

(48)

12

(49)

(50)

12

Dataset #Frequent #Closed #Maximal

Zoo-1 151 807 3 292 230

Mushroom 155 734 3 287 453

Lymph 9 967 402 46 802 5 191

Hepa;;s 27 . 10⁷ 1 827 264 189 205

(51)

SPECIALIZED VS DECLARATIVE DATA MINING

(52)

dataset

13

(53)

Basic user’s constraints

Query

dataset

(54)

Specialised Miner

+

Query

dataset

13

(55)

Specialised Miner

+

Patterns

Query

dataset

(56)

Specialised Miner

+

Patterns

Query

dataset

13

Limitations: Dealing with sophisticated user’s constraints [Wojciechowski and Zakrzewicz, 02]

(57)

Specialised Miner

+

Patterns

Query

dataset

Sophisticated user’s constraints

(58)

Specialised Miner

+

Patterns

Query

dataset

13

1

preprocessing

(59)

Specialised Miner

+

Patterns

Query

dataset

1

preprocessing

2

post- processing

(60)

Specialised Miner

+

Patterns

Query

dataset

13

1

preprocessing

2

3 new algo

(61)

Specialised Miner

+

Patterns

Query

dataset

1

preprocessing

2

3 new algo

Need: Declarative way to deal with more complex queries

(62)

+

Patterns

Query

dataset

14

1

preprocessing

2

3 new algo

➤ Declarative data Mining

CP model CP solver⁺

(63)

+

Patterns

Query

dataset

1

preprocessing

2

3 new algo

Need: Declarative way to deal with more complex queries CP model

CP solver⁺

(64)

+

Patterns

Query

dataset

14

➤ Declarative data Mining

CP model CP solver⁺

(65)

SPECIALISED VS DECLARATIVE DATA MINING

(66)

15

(67)

Specialised is the winner!

(68)

15

(69)

Declarative is the winner!

(70)

16

(71)

Preprocessing + Specialised step vs Declarative

(72)

17

(73)

Specialised + postprocessing vs Declarative

(74)

CONCLUSIONS (PART I)

18

(75)

➤ Specialised methods are suitable for:

➤ Enumerating Patterns

➤ Taking into account classic constraints (simple queries)

(76)

➤ Declarative methods are suitable for:

➤ Taking into account user’s constraints (complex queries)

➤ Iterative data mining process

18

(77)

➤ Declarative methods are suitable for:

➤ Taking into account user’s constraints (complex queries)

➤ Iterative data mining process

Time left?

(78)

FAULT LOCALISATION

19

(79)

➤ The need: identify a subset of statements that are susceptible to explain a fault in a program

➤ Precision <=> Efficiency

(80)

➤ The need: identify a subset of statements that are susceptible to explain a fault in a program

➤ Precision <=> Efficiency

➤ Spectrum-based approaches: (ranking metrics - suspiciousness score)

➤ Tarantula [Jones and Harrold 05]

➤ Ochiai [Abreu et al. 07]

➤ Jaccard [Abreu et al. 07]

➤ …

19

(81)

FAULT LOCALISATION (MOTIVATIONS)

(82)

➤ Pros: Quick localisation

20

(83)

➤ Cons: independent evaluation of each statement at the expense of accuracy

(84)

21

(85)

Test cases

Program : Character counter tc₁ tc₂ tc₃ tc₄ tc₅ tc₆ tc₇ tc₈ function count (char *s) {

int let, dig, other, i = 0;

char c;

e₁: while (c = s[i++]) { 1 1 1 1 1 1 1 1

e₂: if(’A’<=c && ’Z’>=c) 1 1 1 1 1 1 0 1

e₃: let += 2; //- fault - 1 1 1 1 1 1 0 0

e₄: else if ( ’a’<=c && ’z’>=c ) 1 1 1 1 1 0 0 1

e₅: let += 1; 1 1 0 0 1 0 0 0

e₆: else if ( ’0’<=c && ’9’>=c ) 1 1 1 1 0 0 0 1

e₇: dig += 1; 0 1 0 1 0 0 0 0

e₈: else if (isprint (c)) 1 0 1 0 0 0 0 1

e₉: other += 1; 1 0 1 0 0 0 0 1

e₁₀: printf("%d %d %d\n", let, dig, other);} 1 1 1 1 1 1 1 1

Passing/Failing F F F F F F P P

(86)

21

Test cases

char c;

e₁: while (c = s[i++]) { 1 1 1 1 1 1 1 1

e₂: if(’A’<=c && ’Z’>=c) 1 1 1 1 1 1 0 1

e₃: let += 2; //- fault - 1 1 1 1 1 1 0 0

e₄: else if ( ’a’<=c && ’z’>=c ) 1 1 1 1 1 0 0 1

e₅: let += 1; 1 1 0 0 1 0 0 0

e₆: else if ( ’0’<=c && ’9’>=c ) 1 1 1 1 0 0 0 1

e₇: dig += 1; 0 1 0 1 0 0 0 0

e₉: other += 1; 1 0 1 0 0 0 0 1

(87)

Test cases

char c;

e₁: while (c = s[i++]) { 1 1 1 1 1 1 1 1

e₂: if(’A’<=c && ’Z’>=c) 1 1 1 1 1 1 0 1

e₃: let += 2; //- fault - 1 1 1 1 1 1 0 0

e₄: else if ( ’a’<=c && ’z’>=c ) 1 1 1 1 1 0 0 1

e₅: let += 1; 1 1 0 0 1 0 0 0

e₆: else if ( ’0’<=c && ’9’>=c ) 1 1 1 1 0 0 0 1

e₇: dig += 1; 0 1 0 1 0 0 0 0

e₉: other += 1; 1 0 1 0 0 0 0 1

(88)

22

(89)

(90)

22

(91)

➤ Need: more finer-grained localisation, taking into account user’s constraints

(92)

➤ Need: more finer-grained localisation, taking into account user’s constraints

➤ How: Use of Declarative Data Mining

22

(93)

Test cases

char c;

e₁: while (c = s[i++]) { 1 1 1 1 1 1 1 1

e₂: if(’A’<=c && ’Z’>=c) 1 1 1 1 1 1 0 1

e₃: let += 2; //- fault - 1 1 1 1 1 1 0 0

e₄: else if ( ’a’<=c && ’z’>=c ) 1 1 1 1 1 0 0 1

e₅: let += 1; 1 1 0 0 1 0 0 0

e₆: else if ( ’0’<=c && ’9’>=c ) 1 1 1 1 0 0 0 1

e₇: dig += 1; 0 1 0 1 0 0 0 0

e₉: other += 1; 1 0 1 0 0 0 0 1

(94)

23

Test cases

char c;

e₁: while (c = s[i++]) { 1 1 1 1 1 1 1 1

e₂: if(’A’<=c && ’Z’>=c) 1 1 1 1 1 1 0 1

e₃: let += 2; //- fault - 1 1 1 1 1 1 0 0

e₄: else if ( ’a’<=c && ’z’>=c ) 1 1 1 1 1 0 0 1

e₅: let += 1; 1 1 0 0 1 0 0 0

e₆: else if ( ’0’<=c && ’9’>=c ) 1 1 1 1 0 0 0 1

e₇: dig += 1; 0 1 0 1 0 0 0 0

e₉: other += 1; 1 0 1 0 0 0 0 1

Fault localisation

=

Mining Task

(95)

PATTERN SUSPICIOUSNESS DEGREE (PSD)

(96)

➤ PSD function. Given a pattern P of a program:

24

P SD(P ) = f req (P ) + ^|^{F AIL}_|_{P ASS}^| ^{f req}_|₊₁⁺^(P ⁾

(97)

➤ PSD-dominance relation. Given two patterns Pi and Pj

P_i B^{P SD} P_j , P SD(P_i) > P SD(P_j)

(98)

➤ PSD-dominance relation. Given two patterns Pi and Pj

➤ Top-k suspicious patterns.

24

P_i B^{P SD} P_j , P SD(P_i) > P SD(P_j)

top-k= {P | 6 9P₁, . . . , P_k : 81  j  k, P_j B^{P SD} P }

(99)

FCP-MINER TOOL (SOME RESULTS)

(100)

CONCLUSIONS (PART II)

26

(101)

➤ Software Testing/Program comprehension tasks can be tackled using Data Mining

➤ Trace analysis

➤ Test suites mining

➤ Source code mining

➤ …

(102)

➤ Trace analysis

➤ …

➤ Think about using Declarative methods in Software Testing

26

(103)

➤ Trace analysis

➤ …

➤ Think about using Declarative methods in Software Testing