• Aucun résultat trouvé

A Global Constraint for Closed Frequent Itemset Mining

N/A
N/A
Protected

Academic year: 2022

Partager "A Global Constraint for Closed Frequent Itemset Mining"

Copied!
22
0
0

Texte intégral

(1)

N. Lazaar

2

Y. Lebbah

1

S. Loudni

3

M. Maamar

1,2

V. Lemière

3

C. Bessiere

2

P. Boizumault

3

1 Lab. LITIO, University of Oran 1 – Algeria

2 Lab. LIRMM, University of Montpellier – France

3 Lab. GREYC, University of Caen – France

CP 2016

Toulouse, France, 6 September 2016

A Global Constraint for Closed Frequent

Itemset Mining

(2)

Itemset Mining: Definition

Itemset Sequence Graph

• It aims at finding regularities in a dataset

Find sets of products that are frequently bought together

Data mining

(3)

Itemset Mining: Example

Set of items: I = {A,B,C,D,E,F,G,H}

Set of transaction: T = {t1,t2,t3,t4,t5}Itemset: P ⊆ I

Cover(AD)=

{t2,t3}

Cover(BEFG)=

{t5}

Cover:

(4)

Closed Frequent Itemset Mining

• The frequency of a itemset is the size of its cover.

• θ ∈ N

+

be the minimum support.

Frequent itemset mining problem :

• Extract all itemsets P satisfying : freq(P) ≥ θ

freq(BC) = 1

Frequent itemset with θ = 3 freq(EG) = 2

A:3 D:3 E:4

AD:3 are closed

are not closed

Can we compact these itemsets ?

Closed itemsets

F:3

(5)

The Need:

Extract all Closed Frequent Itemsets

How ?

Closed Frequent Itemset Mining

(6)

→ Many efficient algorithms for mining closed itemsets

but

• dedicated to particular classes of constraints

adding new constraints requires new implementations.

→ CP framework:

modeling

-

in a declarative way -> facilitates the addition of new constraints

solving

-

efficient solvers based on filtering

Closed Frequent Itemset: State of art

(7)

Reified model [De Raedt et al,2008] :

Variables:

item variables: decision

transaction variables: auxiliary Reified Constraints:

Coverage Frequency Closedness

State of the art: CP approach

Drawbacks

- Additional dimension of transaction variables.

- The huge number of reified constraints -> limitation

(8)

Proposition: a global constraint that encodes efficiently the Closed Frequent itemsets Mining problem.

• Domain consistency with polynomial algorithm.

• No reified constraints/extra variables.

• Backtrack-free.

Contribution

(9)

• Encoding: P = {P

1

…P

n

}: D(P

i

) = {0,1}

• Definition:

C

LOSED

P

ATTERND,θ

(P) : (freq(P) ≥ θ) ( ∧ P is a closed pattern)

C LOSED P ATTERN : Definition

C

LOSED

P

ATTERND,2

(AD)

freq(AD) > 2 AD is closed

(10)

3 Filtering rules

0 D(P ∉

j

):

Rule 1: full extension items

1 D(P ∉

j

):

Rule 2: infrequent items Rule 3: Absent items

C LOSED P ATTERN : Filtering rules

D must be present

θ = 2

H is infrequent E assigned to 0

G => E E = 0 => G = 0

(11)

Algorithm: n: items, m: transactions

for each free variable

rule 1: maintained (O(n x m))

rule 2: maintained (O(n x m)) for each absent item

rule 3: maintained (O(n x (n x m)))

11/17

Filtering algorithm & Complexity

Time: O(n x (n x m)): Cubic

Space: O(n x m): Quadratic

(12)

C LOSED P ATTERN vs Reified

At each node:

0 D(Pj) : rule1 1 D(Pj) : rule2 1 D(Pj) : rule3

θ = 2

Lex : variables max_val : values

1

2 5

3 4 6

8

7

9

10 11

1 A 0

1 D 0 B

1 0

1 C 0

1 E 0

<AD> <A> <BG>

<CH>

<EF> <>

1

2 5

4 6

8

7

9

10 1 A 0

1 0 B

1 0

C 0

D 0

<AD> <A> <BG>

<CH>

3

11 D

1

E 0 1

12 13

14 15

<EF>

F 1

1 0

1 G 0 Closed

Fail

Pruned item values Pruned transaction values

D = 0

G = 0

H = 0

F = 0

CLOSEDPATTERN Reified

(13)

• DC at each node on Boolean variables.

Tree search size:

full binary tree -> 2 x|C|-1

Backtrack-free:

O(|C|x n

2

x m)

C LOSED P ATTERN : Backtrack-free

1

2 5

3 4 6

8

7

9

10 11

1 A 0

1 D 0

B

1 0

1 C 0

1 E 0

<AD> <A> <BG>

<CH>

<EF> <>

(14)

C LOSED P ATTERN : Experiments

Comparison with:

- The most efficient CP method: CP4IM (reified) - The most efficient ad hoc algorithm: LCM-v5.2

Solver: or-tools, Intel Xeon E5-2680@ 2,5 GHz with 128 Gb

CLOSEDPATTERN-DC: rules 1,2 and 3 (cubic pruning)

(15)

15/17

C LOSED P ATTERN : Experiments

Dataset θ Times(s) #Failures

% DC CP4IM DC CP4IM

Chess

30 20 10

45.92 187.89 969.40

136.31 467.52 1950.5

0 0 0

1054 32 381 725 617

Mush room

1 0.5 0.1 0.05

1.74 3.62 5.40 6.37

24.06 29.84 41.38 43.69

0 0 0 0

32 828 54 584 100 528 107 191

Pums b

80 75 70

133.97 271 509.79

OOM OOM OOM

0 0 0

OOM OOM OOM

CLOSEDPATTERN-DC vs CP4IM:

Backtrack-free

(16)

C LOSED P ATTERN : Experiments

Dataset θ Times(s) #Failures

% DC WC DC WC

Chess

30 20 10

45.92 187.89 969.40

109.36 480.52 2288

0 0 0

3 x 107 1.3 x 108 6.3 x 108

Mush room

1 0.5 0.1 0.05

1.74 3.62 5.40 6.37

3.42 5.00 9.96 9.94

0 0 0 0

1 117 883 1 863 542 3 401 297 3 787 678

Pums b

80 75 70

133.97 271 509.79

640.59 1010.4 2150.8

0 0 0

51 564 131 551 239 623

CLOSEDPATTERN-DC vs CLOSEDPATTERN-WC :

(17)

C LOSED P ATTERN : Experiments

Dataset θ Times(s)

% DC CP4IM LCM

Che ss

30 20 10

45.92 187.89 969.40

136.31 467.52 1950.5

6.07 27.55 141.55

Mus hroo

m

1 0.5 0.1 0.05

1.74 3.62 5.40 6.37

24.06 29.84 41.38 43.69

0.22 0.34 0.47 0.51

Pum sb

80 75 70

133.97 271 509.79

OOM OOM OOM

0.33 0.48 0.69

CLOSEDPATTERN-DC vs CP4IM vs LCM:

(18)

The aim : find k closed itemsets:

(i)

C

LOSED

P

ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C LOSED P ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

k

times(s)

(19)

The aim : find k closed itemsets:

(i)

C

LOSED

P

ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C LOSED P ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

chess (θ = 80%, lb = 2, ub = 10)

LCM

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

connect (θ = 90%, lb = 2, ub = 10)

k

times(s)

TO TO

(20)

The aim : find k closed itemsets:

(i)

C

LOSED

P

ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C LOSED P ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

CP4IM LCM

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

CP4IM LCM

k

times(s)

TO OOM OOM TO

(21)

The aim : find k closed itemsets:

(i)

C

LOSED

P

ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C LOSED P ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

chess (θ = 80%, lb = 2, ub = 10)

CLOSED- PATTERN CP4IM LCM

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

connect (θ = 90%, lb = 2, ub = 10)

CLOSED- PATTERN CP4IM LCM

k

times(s)

TO OOM OOM TO

(22)

• Conclusion:

- A global constraint for Closed Frequent Itemset ensuring DC - No need for reified constraints/extra variables

- Filtering algorithm cubic in time, quadratic in space.

Conclusion and perspectives

Références

Documents relatifs

In this paper, we provide an experimental comparison of 10 algorithms on real-world data sets whose implementations are publicly available, two of them compute formal concepts (FCbO

We compare our FLH -based method using some default settings with BOW- based image classification, spatial pyramid matching (SPM) [24], visual word- based standard frequent

In order to address this drawback, [12] have proposed a global constraint exists-embedding to encode the subsequence relation, and used projected frequency within an ad hoc

In fact, they browse the search space by traversing the lattice of frequent itemsets from an equivalence class to another, and they calculate the closure of the frequent

In this work we provide a study of metrics for the char- acterization of transactional databases commonly used for the benchmarking of itemset mining algorithms.. We argue that

We have presented the Partial Counting algorithm for mining frequent item- sets from data streams with a space complexity proportional to the number of frequent itemsets generated

In this paper, we proposes the ClosedPattern global constraint to capture the closed frequent pattern mining problem without requiring reified constraints or extra variables..

The paper estab- lishes a surprisingly simple characterization of Palm distributions for such a process: The reduced n-point Palm distribution is for any n ≥ 1 itself a log Gaussian