A Global Constraint for Closed Frequent Itemset Mining

(1)

N. Lazaar

²

Y. Lebbah

¹

S. Loudni

³

M. Maamar

^1,2

V. Lemière

³

C. Bessiere

²

P. Boizumault

³

1 Lab. LITIO, University of Oran 1 – Algeria

2 Lab. LIRMM, University of Montpellier – France

3 Lab. GREYC, University of Caen – France

CP 2016

Toulouse, France, 6 September 2016

A Global Constraint for Closed Frequent

Itemset Mining

(2)

Itemset Mining: Definition

Itemset Sequence Graph

…

• It aims at finding regularities in a dataset

• Find sets of products that are frequently bought together

Data mining

(3)

Itemset Mining: Example

Set of items: I = {A,B,C,D,E,F,G,H}

Set of transaction: T = {t1,t2,t3,t4,t5}Itemset: P ⊆ I

Cover(AD)=

{t2,t3}

Cover(BEFG)=

{t5}

Cover:

(4)

Closed Frequent Itemset Mining

• The frequency of a itemset is the size of its cover.

• θ ∈ N

⁺

be the minimum support.

Frequent itemset mining problem :

• Extract all itemsets P satisfying : freq(P) ≥ θ

freq(BC) = 1

Frequent itemset with θ = 3 freq(EG) = 2

A:3 D:3 E:4

AD:3 ^{are closed}

are not closed

Can we compact these itemsets ?

Closed itemsets

F:3

(5)

The Need:

Extract all Closed Frequent Itemsets

How ?

Closed Frequent Itemset Mining

(6)

→ Many efficient algorithms for mining closed itemsets ✓

but

• dedicated to particular classes of constraints

adding new constraints requires new implementations.

→ CP framework:

• modeling

-

in a declarative way -> facilitates the addition of new constraints

• solving

-

efficient solvers based on filtering

Closed Frequent Itemset: State of art

(7)

• Reified model [De Raedt et al,2008] :

Variables:

item variables: decision

transaction variables: auxiliary Reified Constraints:

Coverage Frequency Closedness

State of the art: CP approach

Drawbacks

- Additional dimension of transaction variables.

- The huge number of reified constraints -> limitation

(8)

Proposition: a global constraint that encodes efficiently the Closed Frequent itemsets Mining problem.

• Domain consistency with polynomial algorithm.

• No reified constraints/extra variables.

• Backtrack-free.

Contribution

(9)

• Encoding: P = {P

₁

…P

_n

}: D(P

_i

) = {0,1}

• Definition:

C

^LOSED

P

^ATTERN_D,θ

(P) : (freq(P) ≥ θ) ( ∧ P is a closed pattern)

C ^LOSED P ^ATTERN : Definition

C

^LOSED

P

^ATTERN_D,2

^(AD)

freq(AD) > 2 AD is closed

(10)

3 Filtering rules

0 D(P ∉

_j

):

Rule 1: full extension items

1 D(P ∉

_j

):

Rule 2: infrequent items Rule 3: Absent items

C ^LOSED P ^ATTERN : Filtering rules

D must be present

θ = 2

H is infrequent E assigned to 0

G => E E = 0 => G = 0

(11)

Algorithm: n: items, m: transactions

for each free variable

• rule 1: maintained (O(n x m))

• rule 2: maintained (O(n x m)) for each absent item

• rule 3: maintained (O(n x (n x m)))

11/17

Filtering algorithm & Complexity

Time: O(n x (n x m)): Cubic

Space: O(n x m): Quadratic

(12)

C ^LOSED P ^{ATTERN vs} ^Reified

At each node:

0 ∉D(P_j) : rule1 1 ∉D(P_j) : rule2 1 ∉D(P_j) : rule3

θ = 2

Lex : variables max_val : values

1

2 5

3 4 6

8

7

9

10 11

1 A 0

1 D 0 B

1 0

1 C 0

1 E 0

<AD> <A> <BG>

<CH>

<EF> <>

1

2 5

4 6

8

7

9

10 1 A 0

1 0 B

1 0

C 0

D 0

<AD> <A> <BG>

<CH>

3

11 D

1

E 0 1

12 13

14 15

<EF>

F 1

1 0

1 G 0 Closed

Fail

Pruned item values Pruned transaction values

D = 0

G = 0

H = 0

F = 0

C^LOSEDP^ATTERN Reified

(13)

• DC at each node on Boolean variables.

• Tree search size:

full binary tree -> 2 x|C|-1

• Backtrack-free:

O(|C|x n

²

x m)

C ^LOSED P ^ATTERN : Backtrack-free

1

2 5

3 4 6

8

7

9

10 11

1 A 0

1 D 0

B

1 0

1 C 0

1 E 0

<AD> <A> <BG>

<CH>

<EF> <>

(14)

C ^LOSED P ^ATTERN : Experiments

• Comparison with:

- The most efficient CP method: CP4IM (reified) - The most efficient ad hoc algorithm: LCM-v5.2

• Solver: or-tools, Intel Xeon E5-2680@ 2,5 GHz with 128 Gb

• C^LOSEDP^ATTERN-DC: rules 1,2 and 3 (cubic pruning)

(15)

15/17

C ^LOSED P ^ATTERN : Experiments

Dataset θ Times(s) #Failures

% DC CP4IM DC CP4IM

Chess

30 20 10

45.92 187.89 969.40

136.31 467.52 1950.5

0 0 0

1054 32 381 725 617

Mush room

1 0.5 0.1 0.05

1.74 3.62 5.40 6.37

24.06 29.84 41.38 43.69

0 0 0 0

32 828 54 584 100 528 107 191

Pums b

80 75 70

133.97 271 509.79

OOM OOM OOM

0 0 0

OOM OOM OOM

CLOSEDPATTERN-DC vs CP4IM:

Backtrack-free

(16)

C ^LOSED P ^ATTERN : Experiments

Dataset θ Times(s) #Failures

% DC WC DC WC

Chess

30 20 10

45.92 187.89 969.40

109.36 480.52 2288

0 0 0

3 x 10⁷ 1.3 x 10⁸ 6.3 x 10⁸

Mush room

1 0.5 0.1 0.05

1.74 3.62 5.40 6.37

3.42 5.00 9.96 9.94

0 0 0 0

1 117 883 1 863 542 3 401 297 3 787 678

Pums b

80 75 70

133.97 271 509.79

640.59 1010.4 2150.8

0 0 0

51 564 131 551 239 623

CLOSEDPATTERN-DC vs CLOSEDPATTERN-WC :

(17)

C ^LOSED P ^ATTERN : Experiments

Dataset θ Times(s)

% DC CP4IM LCM

Che ss

30 20 10

45.92 187.89 969.40

136.31 467.52 1950.5

6.07 27.55 141.55

Mus hroo

m

1 0.5 0.1 0.05

1.74 3.62 5.40 6.37

24.06 29.84 41.38 43.69

0.22 0.34 0.47 0.51

Pum sb

80 75 70

133.97 271 509.79

OOM OOM OOM

0.33 0.48 0.69

CLOSEDPATTERN-DC vs CP4IM vs LCM:

(18)

The aim : find k closed itemsets:

(i)

C

^LOSED

P

^ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C ^LOSED P ^ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

k

times(s)

(19)

The aim : find k closed itemsets:

(i)

C

^LOSED

P

^ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C ^LOSED P ^ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

chess (θ = 80%, lb = 2, ub = 10)

LCM

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

connect (θ = 90%, lb = 2, ub = 10)

k

times(s)

TO TO

(20)

The aim : find k closed itemsets:

(i)

C

^LOSED

P

^ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C ^LOSED P ^ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

CP4IM LCM

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

CP4IM LCM

k

times(s)

TO OOM OOM TO

(21)

The aim : find k closed itemsets:

(i)

C

^LOSED

P

^ATTERN-DC

(ii)

Distinct itemsets (iii) lb < size < ub

C ^LOSED P ^ATTERN : Experiments

k itemsets instance

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

chess (θ = 80%, lb = 2, ub = 10)

CLOSED- PATTERN CP4IM LCM

k

times(s)

2 3 4 5 6 7 8 9 10 11 12

0 500 1000 1500 2000 2500 3000 3500 4000

connect (θ = 90%, lb = 2, ub = 10)

CLOSED- PATTERN CP4IM LCM

k

times(s)

TO OOM OOM TO

(22)

A Global Constraint for Closed Frequent Itemset Mining

N. Lazaar

Y. Lebbah

S. Loudni

M. Maamar

V. Lemière

C. Bessiere

P. Boizumault

CP 2016

Toulouse, France, 6 September 2016

A Global Constraint for Closed Frequent

Itemset Mining

Itemset Mining: Definition

Itemset Sequence Graph

…

• It aims at finding regularities in a dataset

• Find sets of products that are frequently bought together

Data mining

Itemset Mining: Example

Closed Frequent Itemset Mining

• The frequency of a itemset is the size of its cover.

• θ ∈ N

be the minimum support.

Frequent itemset mining problem :

• Extract all itemsets P satisfying : freq(P) ≥ θ

Closed itemsets

The Need:

Extract all Closed Frequent Itemsets

How ?

Closed Frequent Itemset Mining

→ Many efficient algorithms for mining closed itemsets ✓

but

• dedicated to particular classes of constraints

adding new constraints requires new implementations.

→ CP framework:

• modeling

-

• solving

-

Closed Frequent Itemset: State of art

• Reified model [De Raedt et al,2008] :

Variables:

item variables: decision

transaction variables: auxiliary Reified Constraints:

Coverage Frequency Closedness

State of the art: CP approach

Drawbacks

- Additional dimension of transaction variables.

- The huge number of reified constraints -> limitation

Proposition: a global constraint that encodes efficiently the Closed Frequent itemsets Mining problem.

• Domain consistency with polynomial algorithm.

• No reified constraints/extra variables.

• Backtrack-free.

Contribution

• Encoding: P = {P

…P

}: D(P

) = {0,1}

• Definition:

C

P

(P) : (freq(P) ≥ θ) ( ∧ P is a closed pattern)

C LOSED P ATTERN : Definition

C

P

(AD)

3 Filtering rules

0 D(P ∉

):

Rule 1: full extension items

1 D(P ∉

):

Rule 2: infrequent items Rule 3: Absent items

C LOSED P ATTERN : Filtering rules

Algorithm: n: items, m: transactions

for each free variable

• rule 1: maintained (O(n x m))

• rule 2: maintained (O(n x m)) for each absent item

• rule 3: maintained (O(n x (n x m)))

Filtering algorithm & Complexity

C ^LOSED P ^ATTERN : Definition

^(AD)

C ^LOSED P ^ATTERN : Filtering rules

C ^LOSED P ^{ATTERN vs} ^Reified

C ^LOSED P ^ATTERN : Backtrack-free

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments

C ^LOSED P ^ATTERN : Experiments