• Aucun résultat trouvé

Invalidating a conjecture on the average number of closed sets in a random database

N/A
N/A
Protected

Academic year: 2022

Partager "Invalidating a conjecture on the average number of closed sets in a random database"

Copied!
5
0
0

Texte intégral

(1)

Invalidating a conjecture on the average number of closed sets in a random database

Olivier Bodini

??

LIPN, Universit´e Paris 13, and CNRS, UMR 7030. 99, av. J.-B. Cl´ement, 93430 Villetaneuse, France

Julien David

??,??

LIPN, Universit´e Paris 13, and CNRS, UMR 7030. 99, av. J.-B. Cl´ement, 93430 Villetaneuse, France

Alexandre Bazin

??,??

My Department My University My City, My Country

Abstract

In this paper, we invalidate a conjecture from [3] which stated that given a random database with n columns and m lines, the average number of closed sets of size x with a support y is maximized when x = logn and y = logm. We prove a refinement of this conjecture and obtain that x= logm−log log logm+O(1) and y = logn−log log logn+O(1). From there we obtain a estimation of the average number of closed sets in a random database.

Keywords: Average analysis, closed sets, data mining.

1 Introduction

Frequent and closed patterns and their enumeration is a widely studied sub- ject. Both kinds of patterns are indeed useful for the mining of association rules in a database. Most papers that focus on the discovery of new mining al- gorithms rely on experimental results to exhibit the efficiency of their results.

(2)

9 9 8 8 7 7 y y 6 0 6

0

5 5 3

3 1.×1016 1.×1016

4 4

5 5

4 4 x

x 6 6 2.×1016

2.×1016

7 7

8 8 3.×1016

3.×1016

3 9 3 9 4.×1016

4.×1016 5.×1016 5.×1016 6.×1016 6.×1016

Fig. 1. we plotted the average number of closed sets for fixed parameters n and m equal to 1000 andx and y in [log log 1000, ...,log 1000].

In the past fifteen years, a few papers [3,2,1] performed theoretical studies on random databases under various models. In [3], the authors perform a study on the average number of frequent sets in a database. The paper finishes on a lists of conjectures, including one that states that the average number of closed sets of size x with a support y is maximized when x = logn and y= logm.

Using maple and the closed formula which counts the average number of closed sets, we obtained an experimental result refining the conjecture.

As we can clearly see on Figure 1, the maximum value appears before the point (logn,logm). In the following, we prove that the maximum value appears in (logm−log log logm+O(1),logn−log log logn+O(1)) 1

1 A mapple sheet containing details of the calculus can be found at http://lipn.fr/david/ArticleDoc/closedSets.nw

(3)

2 Model

In this section, we recall the model defined in [3] to study random databases. A database can be represented by an×mmatrix (χx,y)x=1...n,y=1...m. Each column of the matrix is associated to an attribute and each row to a transaction. The set of attributes is denoted A and the set of transactions T. The support of an attribute x is the set Y ⊆ T of transactions in which it appears (meaning χx,y = 1, for all y∈Y). The support of a set of attributes, or pattern, is the intersection of the supports of all attributes it contains.

Definition 2.1 A pattern isclosedif there exists two setsX ⊆ AandY ⊆ T such that,

∀x∈X and ∀y∈Y, χx,y = 1,

for all i ∈ A \X, there exists j ∈ Y such that χi,j = 0 (X is maximal by inclusion for its supportY),

for all j ∈ T \Y, there exists i ∈ X such that χi,j = 0 (Y is maximal by inclusion for its associated setX).

From a probabilistic point of view, the matrix forms an independent family of random variables which follows a same Bernouilli law of parameterp∈]0,1[.

We suppose that n and m are polynomially related, that is m=nα.

3 Result

Letxbe the cardinal of a closed setX. Let ybe the maximal number of lines where the elements of X forms a rectangle of values equal to 1. There are

m y

possible choices of lines and the probability to obtain a rectangle of 1 is pxy. The probability that there exists at least one 0 on all the other lines and columns is (1−px)my(1−py)nx.

The average number of closed set follows this formula :

E(C) =

n

X

x=1

n x

m X

y=1

m y

pxy(1−py)nx(1−px)my

!

Lemma 3.1 When n tends to infinity, the function

κ(n, m, x, y, p) = n

x m

y

pxy(1−py)nx(1−px)my

(4)

is maximal when:

(x= logm−log log logm+s1 y= logn−log log logn+s2 where

s1 = ln

ln(ln(n))

ln(p)(−ln(ln(n))+ln(ln(ln(m)) ln(ln(n))))

s2 =−ln

ln (p)

ln(ln(ln(m)))

ln(ln(n)) −1 that both tend, really slowly, to −ln ln1p

Proof. In the following, we assume κ(n, m, x, y, p) is maximal when xmax = logn−log log logn+s1 and ymax = logn−log log logn+s2 and prove that s1 and s2 are constants. The idea of the proof is simple : we compute the logarithm of the formula, then we compute its gradient and solve it. The serie expansion of lnn! is

(ln (n)−1)n+ ln√ 2√

π

+ 1/2 ln (n) + 1/12n1− 1

360n3+O n4 from which we compute an approximation of the logarithm of the formula.

Removing regligible terms we obtain:

ln (κ(n, nα, xmax, ymax, p))∼ lnn2−lnnA+ ln lnnB+ ln ln lnmC−ln ln lnnD ln1p

where









A=αlnα−αln lnp+I∗α∗Π +αln lnn−α+ ln lnn−1−ln lnp+I∗π B = ln ln lnn−s1 + ln ln lnm+ 1+elns2+ep s1 −s2

C = lnlnαp +s1 + ln ln lnn+s2 D= ln ln1p +s1

As one can see, the dominant term is log1

pn×lnn meaning, assuming s1 and s2 are constants, that the average of closed sets in quasi-polynomial and of the form nlog1pn.

We now show that s1 and s2 are asymptotically constants. The gradient

(5)

of the previous formula is ln lnn

1 + eln−s1p

−ln ln lnm−ln ln lnn

lnp es1+ln lnn

1 + eln−s2p

−ln ln lnm

lnp es2

Solving this formula, we obtain the values announced in the lemma. ✷ Note that in practice, s1 and s2 will be slightly higher than−ln lnpsince the convergence is really slow.

Theorem 3.2 The average number of closed sets in a random database is quasipolynomial. More precisely, we have

E(C) =nlog1pnAlnnBln lnmCln lnnD with

















A= ln11 p

(α ln (α) +α(iπ+ (α+ 1) ln (ln (n))−1−ln (ln (p))) + 1) B = ln11

p

ln ln lnn+ ln ln lnm+ 2 ln ln1p +1+2 lnlnp 1p

C = ln11 p

ln ln lnn+ lnαln1p + 2 ln ln1p D= ln11

p

2 ln ln1p

References

[1] Julien David, Lo¨ıck Lhote, Arnaud Mary, and Fran¸cois Rioult. An average study of hypergraphs and their minimal transversals. Theor. Comput. Sci., 596:124–

141, 2015.

[2] Lo¨ıck Lhote, Fran¸cois Rioult, and Arnaud Soulet. Average number of frequent and closed patterns in random databases. In Actes de CAP 05, Conf´erence francophone sur l’apprentissage automatique - 2005, Nice, France, du 31 mai au 3 juin 2005, pages 345–360, 2005.

[3] Lo¨ıck Lhote, Fran¸cois Rioult, and Arnaud Soulet. Average number of frequent (closed) patterns in bernouilli and markovian databases. InProceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), 27-30 November 2005, Houston, Texas, USA, pages 713–716, 2005.

Références

Documents relatifs

Closed convex sets for which the barrier cone has a nonempty interior, as well as proper extended-real-valued functionals for which the domain of their Fenchel conjugate is

They include piecewise linear sets, semi-algebraic sets, subanalytic sets, sets in an o-minimal structure, X sets [Sh], manifolds with boundary, Riemannian polyhedra.. Their main

Abstract. The present work aims to exploit the interplay between the alge- braic properties of rings and the graph-theoretic structures of their associated graphs. We

We compute in particular the second order derivative of the functional and use it to exclude smooth points of positive curvature for the problem with volume constraint.. The

To prove Theorem 2.1 we shall follow [13] and we shall study the relation between the flat of α of slope α (c) and the Aubry set at c... 2.1. The Aubry sets and the flats

That means, that even if the data denition language and the data manipulation language of a database management system and a knowledge base manage- ment system would coincide,

By Lemmas 2.4, 2.5 and 2.6, deciding if a class Av(B ) contains a finite number of wedge simple permutations or parallel alternations is equivalent to checking if there exists

Using a two-channel model, we show that the number of closed-channel molecules in a two- component Fermi gas close to a Feshbach resonance is directly related to the derivative of