• Aucun résultat trouvé

An Algorithm for Verifying Approximate Pure Evolving Functional Dependencies

N/A
N/A
Protected

Academic year: 2022

Partager "An Algorithm for Verifying Approximate Pure Evolving Functional Dependencies"

Copied!
15
0
0

Texte intégral

(1)

An algorithm for verifying Approximate Pure Evolving Functional Dependencies

Pietro Sala

Department of Computer Science, University of Verona, Italy pietro.sala@univr.it

Abstract. Fuctional dependencies (FDs) form the foundations of data- base theory since the beginning of time, given they may represent con- straints such as “the employee role determines his monthly salary”. If we consider temporal databases where each tuple is timestamped with a valid time (V T) attribute, finer constraints may be imposed on the evo- lution of the data such as “the employee role before a promotion and his role afterwards determine his new monthly salary”. Such constraints are calledPure Evolving Functional Dependencies (PEFDs) according to the classification introduced in [3]. By adding approximation to such rules they may be used for extracting knowledge in place of imposing con- straints. Then we may want todiscover that “the employee role before a promotion and his role afterwardsgenerally determine his new monthly salary”. This kind ofapproximate dependencies are calledApproximate Pure Evolving Functional Dependencies(APEFDs). Verifying whether or not a givenAPEFD holds over a database instance is an NP-Complete problem [4]. Despite the discouraging complexity, in this work we pro- pose an algorithms that solve in an efficient way this problem by trying to divide-and-conquer and prune the search space as much as possible.

Keywords: Temporal Databases·Approximate Functional Dependen- cies·Data Mining.

1 Temporally Mixed Functional Dependencies

According to [3], in the following we study a family of temporal functional de- pendencies [2, 5, 6] i.e., thePure Evolving Functional Dependencies, calledPEFD.

From now on our temporal functional dependencies will be given over a temporal schema R “ UY tV Tu, where U is a set of atemporal attributes and V T is a special attribute denoting the valid time of each tuple. Hereinafter we assume tuples timestamped with natural numbers (i.e, DompV Tq “N). Let J ĎU be a non-empty subset ofU. We define the set W as W “UzJ and setW, which is basically a renaming of attributes inW. Formally, for each attribute APW, we haveAPW (i.e.,W “ tA:APWu). Given an instancerofR, we define an instanceτJrof schemaRev“J W WtV T, V Tuas follows:

This work has not been previously published and it is an extension of the results published in [4].

(2)

τJr

$

’’

&

’’

% u

ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ

Dt,t1

¨

˚

˚

˚

˚

˚

˝

rptq^rpt1q^trJs“t1rJs“urJs^urWs“trWs^

urWs“t1rWs^trV Ts“urV Ts^t1rV Ts“urV Ts

^trV Tsăt1rV Ts^

@t2pprpt2q^trV Tsăt2rV TsqÑt1rV Tsďt2rV Tsq

˛

, // . // -

Schema Rev is called the evolution schema ofR. We will denote by τJR the view on R that is built by expression τJr for every instance r of R. View τJR joins two tuples t1 and t2 that agree on the values of the attributes in J (i.e.

t1rJs “ t2rJs) and t1rV Ts ăt2rV Ts. Moreover, such tuples are joined if does not exist a tuple t P r with trJs “t1rJs and t1rV Ts ătrV Ts ă t2rV Ts (i.e., there exists a tuple that holds at some point in between the valid times of such tuples). For application purposes, it is correct correct to consider in a evolution schema only those pair of consecutive tuples whose the difference between V T and V T respects some given bound. Given a parameterk PNY t`8u, tuples ofτJrare filtered by means of the selection∆kJrq “σtrV Ts´trV TsďkJrq(notice that∆`8Jrq “τJr).∆kJrqforces to consider only those tuples belonging toτJr having a temporal distance within the given thresholdk. In the following, given a tupletPτJr, we denote its temporal distancetrV Ts ´trV Tswith∆ptq.

A Pure Temporally Evolving Functional Dependency [3] over the temporal schemaR“UY tV Tu,PEFDfor short, is an expression of the form

r∆kJRqsXY ÑZ.

We have thatX ĎW andY , ZĎW withX ‰ Hand|Z| “1 (Zcontains a single attribute). We say that an instancerofRfulfills aPEFDr∆kJRqsXY Ñ Z, written r|ù r∆kJRqsXY ÑZ, if and only if for each pair of tuples t, t1 P

kJrqwe have trXs “t1rXs ^trYs “t1rYs ÑtrZs “t1rZs.

Now add approximation to PEFD in a very standard way [7, 2]. First, we provide measurementGto deal withPEFDas follows:

Gpr∆kJRqsXY ÑZ,rq “ |r| ´maxt|s|:sĎr,s|ù r∆kJRqsXY ÑZu.

By means ofGwe can define the relativescaled measurement g for TM-FD as follows:

gpr∆kJRqsXY ÑZ,rq “Gpr∆kJRqsXY ÑZ,rq

|r| .

Now we are ready to define theApproximate Pure Temporally Evolving Func- tional Dependency (APEFDfor short). Formally, given a real number 0ďď1, we say that an instancerofRsatisfies theAPEFDr∆kJRqsX Y Ñ Z, written r|ù r∆kJRqsXY Ñ Z, if and only ifgpr∆kJRqsXY ÑZ,rq ď.

In Section 2 we provide an algorithm for checking an APEFD against an instancer, we call this problem Check-APEFD:

Problem 1. (Check-APEFD). Given a temporal schema R, a PEFD r∆kJRqs XY Ñ Z on R, an instance r of R, and a real number 0 ď ď1 determine whether or notr|ù r∆kJRqsXY Ñ Z.

(3)

r“

# Name P hys CT Durpminutesq V T

1 M cM urphy Sayer self „15 1 Jan 2016 2 M cM urphy Sayer f amily „5 5 Jan 2016 3 M cM urphy M aguire f amily „15 10 Jan 2016 4 M cM urphy M aguire self „15 15 Jan 2016 5 M cM urphy M aguire self „5 20 Jan 2016 6 M cM urphy M aguire self „5 25 Jan 2016

7 Lowe Sayer f amily „15 7 Jan 2016

8 Lowe Sayer f amily „15 15 Jan 2016

9 Lowe M aguire self „15 22 Jan 2016

10 Lowe Sayer self „5 28 Jan 2016

11 Lowe M aguire self „15 3 Feb 2016

12 Lowe Sayer self „5 6 Feb 2016

Fig. 1. An instance r of schemaContact that stores the phone contacts about two psychiatric cases. Attribute # represents the tuple number and it is used only for referencing tuples in the text (i.e., # does not belong to the schemaContact).

τNamer

# Name P hys CT Dur V T P hys CT Dur V T

1,2 M cM urphy Sayer self „15 1 Jan 2016 Sayer f amily „5 5 Jan 2016 2,3 M cM urphy Sayer f amily 5 5 Jan 2016 M aguire f amily 15 10 Jan 2016 3,4 M cM urphy M aguire f amily „15 10 Jan 2016 M aguire self 15 15 Jan 2016 4,5 M cM urphy M aguire self „15 15 Jan 2016 M aguire self „5 20 Jan 2016 5,6 M cM urphy M aguire self 5 20 Jan 2016 M aguire self „5 25 Jan 2016 7,8 Lowe Sayer f amily „15 7 Jan 2016 Sayer f amily 15 15 Jan 2016 8,9 Lowe Sayer f amily „15 15 Jan 2016 M aguire self 15 22 Jan 2016 9,10 Lowe M aguire self „15 22 Jan 2016 Sayer self „5 28 Jan 2016 10,11 Lowe Sayer self 5 28 Jan 2016 M aguire self 15 3 Feb 2016 11,12 Lowe M aguire self „15 3 Feb 2016 Sayer self „5 6 Feb 2016

Fig. 2.The evolution expressionτN amer .

(4)

Now we consider a scenario, borrowed from the clinical domain, in order to provide examples of how PEFDsandAPEFDs work. In particular, this scenario is taken from psychiatric case register. Let us consider the temporal schema Contact“ tN ame, P hys, CT, Duru Y tV Tu. Such a schema stores values about a phone-call service provided to psychiatric patients. This service is intended for monitoring and helping psychiatric patients, who are not hospitalized. Whenever a patient feels the need to talk to a physician, he can call the service. Data about calls are collected according to schemaContact. For the sake of simplicity, temporal attribute V T identifies the day when the call has being received, not the exact time the call has begun which is usually its semantics in real world applications. In addition, the service may be used by people somehow related to patients, as, for instance, relatives afraid of current conditions of a patient.

More precisely, attribute N ame identifies patients, P hys identifies physicians, CT (Contact Type) identifies the physical person who is doing the call (e.g., value ’self’ stand for the patient himself, ’family’ for a relative), andDurstores information about total duration of calls (value „ n means approximately n minutes). An instance r of R is provided in Figure 1. Instance τN amer , and τJr in general, may be seen as the output of a two-phase procedure. First, table Contactis partitioned into subsets of tuples, one for each value ofN ame. Then, each tuple is joined with its immediate successor in its partition, w.r.t. V T values. The whole relation τN amer is provided in Figure 2. In the following we will usetfor referencing tuples ofrandufor referencing tuples ofτJr. Moreover, in the following each tupleuinτJrwill be identified by the pair of indexes of the tuples inrthat generateu. For instance, the first tuple ofτJr in Figure 2 will be denoted byu1,2 since it is generated by the join of tuplest1 andt2 in r.

Going back to our example, it is worth noting that tuples t2 and t7 are not joined in τN amer , even if t7rV Ts “ t2rV Ts `2 and there is no tuple t with trV Ts “ t7rV Ts `1. This is due to the fact that t7rNames ‰ t2rNames forbids the join in τN amer . Moreover, t1 and t3 are not joined in τNamer . In- deed, the presence of tuple t2 with t1rNames “ t2rNames “ t3rNames and t1rV Ts ăt2rV Ts ăt3rV Tsforbids the join inτNamer . Figure 3 graphically depicts how pairs of tuplespt1, t2q,pt2, t3q,pt3, t4q,pt4, t5q,pt5, t6qandpt7, t8q,pt8, t9q,pt9, t10q,pt10, t11q,pt11, t12qare joined inτNamer . In both scenarios depicted in Figure 3, nodes represent tuples and are labeled by the corresponding tuple number.

Values for attribute Dur are reported above each node. Values of P hys and CT attributes are reported below every node, respectively. Every edgepti, tjqis labeled by value∆pui,jq “tjrV Ts ´tirV Ts(i.e., the temporal distance between two tuples). Basically, each tupleuPτN amer corresponds to an edge in Figure 3 while we have a node for each tuple inr.

In our example, we have thatr |ù r∆5NameContactqsP hys, P hys ÑCT. How- ever,r* r∆6NameContactqsP hys, P hysÑCT, because of pairspt2, t3qandpt10, t11q.

More precisely, we have thatt2rNames “t3rNames “M cM urphy,t10rNames “ t11rNames “ Lowe, t2rP hyss “ t10rP hyss “ Sayer, t3rP hyss “ t11rP hyss “ M aguire, butt3rCTs ‰t11rCTs(i.e., t3rCTs “f amily, andt11rCTs “self).

(5)

t1 t2 t3 t4 t5 t6

M cM urphy

t7 t8 t9 t10 t11 t12

4days 5days 5days 5days 5days

10

8days 7days 6days 6days 3days

Sayer self

„15

Sayer f amily

„5

M aguire f amily

„15

M aguire self

„15

M aguire self

„5

M aguire self

„5

Sayer f amily

„15

Sayer f amily

„15

M aguire self

„15

Sayer self

„5

M aguire self

„15

Sayer self

„5 Lowe

Fig. 3.A graph-based representation ofτN amer .

In other words, we have that the set of tuples tu2,3, u10,11udoes not satisfy the FDP hysP hysÑCT.

Let us observe that the two proposed PEFDsdiffer only for the maximum temporal distance allowed. In particular, tuple u10,11 is responsible for r |ùz r∆6NameContactqsP hys, P hysÑCT and it does not belong to∆5NameContactqbecause of∆pu10,11q ą5 then we haver|ù r∆5NameContactqsP hys, P hysÑCT. This allows us to point out a general property ofPEFDs, such a property holds forAPEFDs, too. Given aPEFDr∆kJRqsXY ÑZwe have that for every instancerofRsuch that r|ù r∆kJRqsXY ÑZ then for everyhďk we haver|ù r∆hJRqsXY Ñ Z.

In our example, if we considerAPEFDr∆6NameR qsP hys, P hysÑ CT with “ 121, we have that r|ù r∆6NameR qs P hys, P hysÑ CT. It is worth noting that, by considering relation r1 “rztt3u, this dependency would hold without the need of approximation (i.e., if tuplet3 is deleted from relationr). More pre- cisely, we haveτNamer1 “τNamer ztu2,3, u3,4uYtu2,4u. Notice that tupleu2,4was not originally inτNamer because of tuplet3. Figure 3 depicts this new scenario, by re- placing edgespt2, t3qandpt3, t4qwith the dashed edgept2, t4q. Moreover, we have that r1 |ù r∆`8NameR qsP hys, P hys Ñ CT. Thus, r |ù r∆`8NameR qsP hys, P hysÑ CT with“ 121.

2 An algorithm for checking APEFDs

As we proved in [4], the problem Check-APEFDgiven anAPEFDrτJR, swpkqsXY Ñ Z and an instance r of R is NP-Complete in |r|. This is not an uncommon situation in the realm of temporal functional dependencies (see [8, 1, 9, 7] for further examples).

(6)

Then, in principle, there is no asimptoptycally better algorithm than explore the whole set of of possible subsets r1 of r with |r|´|r|r| 1| ď . However, in the following, we provide an algorithm that make use of heuristics for pruning the search space in order to achieve the tractability for many cases.

Such algorithm is general and it may be applied under no assumptions on the input instancer, it makes use of two optimization techniques. The first op- timization technique consists of trying, whenever is possible, to split the current subset ofr in two subsets on which the problem may be solved independently (i.e., choices in one subset do not affect choices in the other one and viceversa).

The latter optimization technique consists of checking if the current partial so- lution may not lead to an optimal solution (i.e., a solution r1 where |r1| is the maximum possible number of tuples that may be kept) if this happen the subtree is pruned immediately (i.e., we are looking only for optimal solutions).

Our algorithm relies on the concept ofcolor that we will explain through an example in the following. GivenAPEFDrτJR, swpkqsXY Ñ Z and an instancer ofR, let us suppose that we are solving problem Check-APEFDon such instance with a simple guess-and-check procedure that make use of two, initially empty, subsetr`(the tuples to be kept in the solution) andr´(the tuples to be deleted in the solution) ofr. At each step the procedure guess a tupletinrzpr`Yr´qand decides non-deterministically (guessimg phase) either to updater` to r`Y ttu (i.e., t is kept in the current partial solution) to update r´ to r´Y ttu (i.e,t is deleted in the current partial solution). When r “ r` Yr´ (check phase), the procedure returns Y ES if r` |ù rτJR, swpkqsXY Ñ Z and |r´| ď ¨ |r|, otherwise it returns N O. From now on, for us a partial solution is a triple pr,r`,r´qsuch thatpr`Yr´q Ďrandr`Xr´ “ H, ifr“r`Yr´ we simply say thatpr`Yr´qis asolution. A solution isconsistent pr,r`,r´qif and only if r`|ù rτJR, swpkqsXY Ñ Z. Given two partial solutionpr,r`1,r´1qandpr,r`2,r´2q we say thatpr,r`2,r´2qextends pr,r`1,r´1qif and only if r`1 Ďr`2 andr´1 Ďr´2.

Is there a way to check if we are generating an inconsistent solution possibly without guessing all the tuples in r? Violations of the latter constraint (i.e.,

|r´| ď¨ |r|) are fairly simple to detect during the guessing phase, it suffices to check after each insertion inr´ifr´ exceeds¨ |r|, if it is the case the procedure may return N O immediately without guessing any further. Violations of the first constraint (i.e.,r` |ù rτJR, swpkqsXY ÑZ) during the guessing phase are trickier to detect, then we need the following additional definitions. From now on when two tuplest, t1 share the same value for the attributeJ (i.e.,trJs “t1rJs) we will say that they are in the sameJ-group and we will say thattrJsis the value of the J-group containing t and t1. For the sake of brevity, for a given j P DompJq we will use j-group for denoting the J-group with value j. An ordered pair, written o-pair, is a pairpt, t1q Prˆrsuch that tand t1 are in the sameJ-group andtrV Ts ăt1rV Ts. Given an o-pairt, t1 Prwe say that the pair pt, t1qis an edge if and only if 0ăt1rV Ts ´trV Ts ďk. Given triplepr,r`,r´q, an o-pair pt, t1q P r is active if and only if t, t1 P r` and for every tuple t in the same J-group oft iftrV Ts ă t ăt1rV Ts we have t Pr´ (i.e., t and t1 are selected in the current partial solution and all the tuples between t and t1 in

(7)

the same J-group are deleted). Given two valid times vt, vt1 P DompV Tq and a value j P DompJq we say that vt and vt1 are consecutive in j-group if and only if there exists an active o-pair pt, t1q with trV Ts “ vt, t1rV Ts “vt1, and trJs “j. Notice that we may have two distinct values j, j1 PDompJqand two distinct valid timesvt, vt1 PDompV Tqwhich are consecutive inj-group and not consecutive in j1-group. Moreover, we may have edgespt, t1qthat are not active and active pairs pt, t1qthat are not edges, such is the case of active pairs pt, t1q witht1rV Ts´trV Ts ąk. Acoloris a tuplecon the schemaC“XY Z, two colors px, y, zq andpx1, y1, z1qare conflicting if and only if x“x1, y “y1 andz ‰z1. Given an o-pairpt, t1q, denoted bycpt, t1qis the tuplecpt, t1q “ ptrXs, trYs, t1rZsq.

Two o-pairs pt, t1q,pt2, t3q areconflicting if and only if cpt, t1q and cpt2, t3q are conflicting. The following result it is easy to prove and its proof is left as exercise for the reader.

Theorem 1. Given anAPEFDrτJR, swpkqsXY Ñ Z, an instancerofR, and a partial solution pr,r`,r´q, if there exists two active edges pt, t1q and pt2, t3q then every solution pr,r`f,r´fq that follows pr,r`,r´q satisfy r`f z|ù rτJR, swpkqsXY ÑZ (i.e., is inconsistent).

The above theorem guarantees that from a partial solutionpr,r`,r´qthat feature at least two conflicting edges we cannot reach a solution pr,r`f,r´fq that satisfy the first constraint and in such a case we may return immedi- ately N O without going any further from pr,r`,r´q. The colours of a par- tial solution pr,r`,r´q is the set colourspr,r`,r´q “ tptrXs, t1rYs, t1rZsq : pt, t1qis an active edge in pr,r`,r´qu. It is easy to see that the hypothesis of Theorem 1 applies if and only ifcolourspr,r`,r´qcontains at least two conflict- ing colours. Then, by means of colours, our above guess-and-check procedure may be improved by adding the control on the size of r´ and by keeping up- dated the current set of colourscolourspr,r`,r´q. Once an insertion of a tuple in eitherr` orr´ introduce a colorc that is conflicting with at least one colour in pr,r`,r´qthe procedure answersN O immediately.

An example of how the procedure works is given in Figure 4 where we have an instance of 5 tuples with“0.2 (i.e., we may delete at most one tuple) and swpkq “ 6 (all the tuples are in the same window). The execution depicted in Figure 4 guesses the values of the tuples from the oldest (t1) to the newest one (t2) according to the value ofV T. First it tries to put the current tupletinr`if no violation arises it continues, if some violation arises it tries to insert the tuple t inr´ if no violation arises it continues otherwise it goes back to the previous choice (i.e., backtracking). Every internal node is labelled with the current tuple which will be guessed next, every leaf is labelled either with Y ES (i.e., the current branch is a solution) orN O(i.e, a violation has arisen), the current set of colours is reported within the node. Nodes are numbered according to their order of appearance. We have that the root isn1followed by the introduction of the nodesn1. . . n4in this precise order. If we introducet4in the partial solution associated ton4we violate the first constraint. Since inn4 addingt4inr´does not generate any violation, node n5 is created as child of n4. However, node

(8)

# J X Y Z V T 1 j1 x1y1 z1 1 2 j1 x2y1 z1 2 3 j1 x1y1 z2 3 4 j1 x2y1 z2 4 5 j1 x1y1 z2 5

x1|y1|z1 px1,y1,z1q x2|y1|z1 x1|y1|z2 x2|y1|z2 x1|y1|z2 px1,y1,z2q

px1,y1,z2q

px1,y1,z2q

px2,y1,z2q

px2,y1,z2q

px2,y1,z2q

px1,y1,z2q

px1,y1,z2q

px2,y1,z2q t1,tu

n1

t2,tu n2

t3,tpx1,y1,z1qu n3

t4,

"

px1,y1,z1q, px2,y1,z2q

*

n4 n6 t4,tpx1,y1,z1qu

t5,

"

px1,y1,z1q, px2,y1,z2q

* n5

t5,

"

px1,y1,z1q, px2,y1,z2q

* n7

N O conflict between px1,y1,z1qandpx1,y1,z2q

N O conflict between px1,y1,z1qandpx1,y1,z2q

N O

|r´ |ą1

Y ES r“r` Yr´,|r´ | ď1, colourspr,r`,r´ q “"

px1,y1,z1q, px2,y1,z2q

* t1Pr`

t2Pr`

t3Pr`

t3Pr´

t4Pr`

t4Pr´

t5Pr` t5Pr´

t4Pr`

t5Pr`

Fig. 4.An example of how the use of colors improve a guess and check procedure for solving the problem Check-APEFD.

n5 cannot be extended without introducing a violation in the above constraints because if putt5in r`we introduce a conflicting color, while if we put t5 inr´ we exceed the maximum number of allowed deletions. We backtrack to n4 and we have that even for it all the possible choices have been explored and thus we backtrack ton3where the choice of addingt3 tor´is attempted generating the noden6. Fromn6we putt5inr` without violating any constraint and thus we have thattt1, t2, t4, t5u |ù rτJR,6sXY 0.2ÑZ.

Let us consider deeper in the description of the first algorithm. On an higher lever the algorithm operates like the procedure above apart from some trivial technicalities. However, two more heuristics are introduced in order to possibly stop early during the exploration of a branch in the tree of computation. The main procedure of the algorithm is reported in Figure 7 with auxiliary procedures reported in Figure 5 and in Figure 6. The algorithm is implemented by the functionT upleW iseM inthat takes 4 arguments. The first argument isGrwhich is a pre-processing ofrbased on theAPEFDrτJR, swpkqsXY Ñ Z that has to be checked. More precisely, Gr is an instance of the schema J, X, Y, Z, V T, count, with Dompcountq “ N. We have that t P Gr if and only if there exists t1 P r for whichpt1rJs, t1rXs, t1rYs, t1rZsq “ ptrJs, trXs, trYs, trZsqandtrcounts “ |tt1P r : pt1rJs, t1rXs, t1rYs, t1rZsq “ ptrJs, trXs, trYs, trZsqu|, that is, we count how

(9)

procedureInBetweenpGr, t1, t2q

comment: returns the subset ofGrconsisting of all the tuples in the sameJ-group oft1 andt2 that feature valid times betweent1rV Tsandt2rV Ts.

return ttPGr:t1rV Ts ătrV Ts ăt2rV Ts ^trJs “t1rJsu procedureEdgeConflict?ppt1, t2q,pt11, t12qq

comment: returns true if and only if the two input edges feature conflicting colors.

if t1rXs “t11rXs ^t2rYs “t12rYs ^t2rZs ‰t12rZs then return true

else return false

procedureColorConflict?ppt1, t2q,Cq

comment: returns true if and only if the color of the edgept1, t2qis conflicting with at least one color in the set of colorsC.

if DcpcPC^t1rXs “crXs ^t2rYs “crYs ^t2rZs ‰crZsq then return true

else return false

procedureE!pGr, k, G`r, G´rq

comment: returns the set of all and only active edges in the current partial solution.

return

"

pt1, t2q: t1rJs “t2rJs ^0ăt2rV Ts ´t1rV Ts ďk^

tt1, t2u ĎG`r ^InBetweenpGr, t1, t2q ĎG´rq

*

procedureE?pGr, k, G`r, G´r,Cq

comment: returns the set of all and only pending edges in the current partial solution.

return

$

’’

&

’’

% pt1, t2q:

t1rJs “t2rJs ^0ăt2rV Ts ´t1rV Ts ďk^ tt1, t2u XG´r “ H^

^ ColorConf lict?ppt1, t2q,Cq^

^InBetweenpGr, t1, t2q XG`r “ H^

^ptt1, t2u YInBetweenpGr, t1, t2qq Ę pG`r YG´rq

, // . // -

procedureReach?pt1, t2, N odes, Edgesq comment:

returns true if and only if there exists a path fromt1 tot2in the graph pN odes, Edgesq. It is a function that checks wether or not there exits a path between two nodes in a graph.

if Dptt1. . . tmu ĎN odesqpt1“t1^tm“t2^ @p1ďiămqppti, ti`1q PEdgesqq then return true

else return false

Fig. 5.Auxiliary procedures used by procedures presented in Figures 6 and 7.

(10)

procedureGroupIndependent?pj, Gr, k, G`r, G´r,Cq comment:

returns true if and only if in the current partial solution pending edges involving tuples belonging to theJ-group with valuejare not

conflicting with the pending edges introduced by the othersJ-groups.

EjÐ tpt1, t2q PE?pGr, k, G`r, G´r,Cq:t1rJs “ju EjÐE?pGr, k, G`r, G´r,CqzEj

if Dt1Dt2Dt1Dt2ppt1, t2q PEj^ pt,t2q PEj^EdgeConf lict?ppt1, t2q,pt1, t2qqq then return false

else return true

procedureMacConsistentSubsetpEq comment:

returns the maximal subsetE1ofEsuch that d for every edge pt11, t12q PE1and for every edgept1, t2q PE we have thatcpt11, t12qand cpt1, t2qare not conflicting.

E1Ð H

for eachpt1, t2q PE do

"

if @t1@t2ppt1, t2q PEÑ EdgeConf lict?ppt1, t2q,pt1, t2qqq thenE1ÐE1Y tpt1, t2qu

return E1

procedureReplacePath?pt, t, Gr, k, G`r, G´r,Cq

comment: returns true if and only if in the current partial solution the consecutive valid timestrV TsandtrV TsintrJs-group may be safely replaced.

if Dt1pt1PGr^t1rJs “trJs ^trV Ts ăt1rV Ts ătrV Tsq then return false

NsÐ tt1PG`r :t1rJs “trJs ^t1rV Ts “trV Tsu NeÐ tt1PG`r :t1rJs “trJs ^t1rV Ts “trV Tsu

NmÐ tt1PG´r :t1rJs “trJs ^trV Ts ăt1rV Ts ătrV Tsu NÐNsYNeYNm

E˜Ð tpt1, t2q:t1PN^t2PN^t1rV Ts ăt2rV Ts ^ pt1RNs_t2RNequ P airsÐ tpt1, t2q:t1PN^t2PN^t1rV Ts `kăt2rV Ts ^ pt1RNs_t2 RNequ E˜okÐ pM axConsistentSubsetpE?pGr, k, G`r, G´r,Cqq XEq Y˜ P airs

N otSaf eÐ H

for eachpt1, t2q PEz˜ E˜ok

do

$

&

%

for eachpt1, t2q P pE?pGr, k, G`r, G´r,Cq YE!pGr, k, G`r, G´rq YEq˜ do

"

ifEdgeConf lict?ppt1, t2q,pt1, t2qq then N otSaf eÐN otSaf eY tpt1, t2qu E˜ÐEzN otSaf e˜

if Dt1Dt2@t1@t2

ˆ t1, t2PNm^t1PNs^t2PNe^

^pt1, t1q,pt2, t2q PE^Reach?pt˜ 1, t2, N,Eq˜

˙

then return true return false

Fig. 6.Auxiliary procedures used by procedureT upleW iseM in(Figure 7).

(11)

Algorithm 2.1: TupleWiseMin(Gr, k, G`r “ H, G´r “ H,C“ H) procedureMaximalPaths?pGr, k, G`r, G´r,Cq

comment:

returns true if and only if in the current partial solution for every J-groupj-group every pair of consecutive valid timesvtandvt1in j-group cannot be safely replaced.

ifDt1Dt2ppt1, t2q PE!pGr, k, G`r, G´rq ^ReplaceP ath?pt1, t2, k, Gr, G`rG´r,Cqq then return false

else return true

procedureIsConsistent?pGr, k, G`r, G´r,Cq comment:

returns true if and only if the current partial solution features consistent colors inC and for everyJ-groupj-group every pair of consecutive valid timesvtandvt1 inj-group cannot be safely replaced.

ifDt1Dt2ppt1, t2q PE!pGr, k, G`r, G´rq ^ColorConf lict?ppt1, t2q,Cqq then return false

ifM aximalP aths?pGr, k, G`r, G´r,Cq then return true

else return false main

comment: returns the minimum valuem“min||GrzG1r||

forG1r intG2rĎGr:G1r|ù rτGXY Zcountr sXY ÑZu.

ifGr“ Hthen return||G´r||

ifDjpjPDompJq ^GroupIndependent?pj, Gr, k, G`r, G´r,Cqq then

$

&

%

GrÐ ttPGr:trJs “ju return

ˆT upleW iseCheckpGr, k, G`r XGr, G´r XGr,Cq`

T upleW iseCheckpGrzGr, k, G`rzGr, G´rzGr,Cq

˙

lettPGr

ifIsConsistent?pGrzttu, k, G`r Y ttu, G´r,Cq then

"

C1ÐCY tpt1rXs, t2rYs, t2rZsq:pt1, t2q PE!pGrzttu, k, G`r Y ttu, G´rqu mtÐT upleW iseCheckpGrzttu, k, G`r Y ttu, G´r,C1q

else mtÐ `8

ifIsConsistent?pGrzttu, k, G`r, G´r Y ttu,Cq then

"

C1ÐCY tpt1rXs, t2rYs, t2rZsq:pt1, t2q PE!pGrzttu, k, G`r, G´r Y ttuqu mztÐT upleW iseCheckpGrzttu, k, G`r, G´r Y ttu,C1q

else mztÐ `8 return minpmt, mztq

Fig. 7.The main procedure for checkingAPEFDs in a tuple-wise fashion. Notice that we use a compact notation for the recursive procedure which is initially called as T upleW iseM inpGr, kqwhere whenG`r,G´r, andCare ommitted in the procedure call they get their respective default values specified in the procedure declaration (i.e.,H for each of them in this case).

(12)

many tuples inrshare the same values for attributesJ, X, Y,andZ. The input parameterk is the length for the sliding window grouping swpkq. The setsG`r and G´r, originally initialized to H, represent the tuples ofGr that are either kept or deleted in the current solution respectively. On instances r of schema R satisfyingcountPR we denote with ||r|| the sum on thecountattribute for the tuples in r (i.e.,||r|| “ř

tPrtrcounts). Finally, C is a set of colors which is initially set to H. A color cis a tuple on the schemaX, Y, Z. As we will see,C keeps track via colours of the constraints introduced so far in the construction of the solution.

The procedureT upleW iseM inreturns the minimum number of tuples that has to be deleted from r in order to obtain an instance r1 such that r1 |ù rτJR, swpkqsXY Ñ Z. Then if such minimum is less or equal than ¨ |r| we can conclude r|ù rτJR, swpkqsXY Ñ Z else we haver |ù rτz JR, swpkqsXY Ñ Z. Given Gr,G`r, G´r, and a set of colours Cwe say that an edge pt, t1q PGrˆGr

ispending if and only if the following conditions hold:

1. t, t1RG´r andptrXs, t1rYs, zq RCfor every zPDompZq;

2. for everyt2witht2rJs “trJsandtrV Ts ăt2rV Ts ăt1rV Tswe havet2RG`r; 3. there exists t2 P pGrY tt, t1uqzpG´r YG`rqwith t2rJs “trJs and trV Ts ď

t2rV Ts ďt1rV Ts.

Informally speaking, a pending edge is an edge that is not active in the current partial solution but it may become active during the computation and, if it happens, it introduces a new color inC. In our algorithm, pending edges for the current partial solution are retrieved by the procedureE? while active edges are retrieved by the procedureE!.

The procedureT upleW iseM in(Figure 7) works as follows. IfG`rYG´r “Gr it means that we have obtained a solution without violating any constraint and thus we can return||G´r||(i.e., the number of deleted tuples). IfG`r YG´r ‰Gr the algorithm guesses a tupletPGrzpG`r YG´rqand proceed as follows. First, it checks if insertingtintoG`r does not cause any violation in the constraints, if so it stores inmtthe value of the recursive call toT upleW iseM inwheretbelongs to G`r andC has been updated accordingly. Notice that by inserting a tuplet in G`r the algorithm is asserting that t belongs to the current partial solution, while by inserting t in G´r the algorithm is asserting thatt does not belong to the current partial solution. If a constraint is violated the algorithm stores inmt

the value`8, which means thatt may not be kept in the current solution.

Then, it checks if inserting t into G´r does not cause any violation in the constraints, if so it stores inmztthe value of the recursive call toT upleW iseM in whereG´r andCare updated accordingly. If a constraint is violated the algorithm stores inmztthe value`8, which means thattmust be kept in the current partial solution. In procedureT upleW iseM in the only way in which a constraint may be violated is that, after the insertion a tupletinG`r (resp.G´r), an edgept1, t2q turns out to be active and its colorpt1rXs, t2rYs, t2rZsqturns out to be conflicting with at least one color inC.

The first one allows us to restrict the search space by splitting the problem into independent sub-problems in a divide-and-conquer fashion. Let us suppose

(13)

that at a certain step of our computation there exists a valuejP ttrJs:tPGru for which for each pair of conflicting pending edgespt, t1qandpt, t1qwe have that either allt, t1, t, andt1 belong toj-group or allt, t1, t, andt1 do not belong toj- group (such condition is verified by sub-procedureGroupIndependent? reported in Figure 6.). Let Gr “ tt:trJs “ju, if every edge involving tuples inj-group is not conflicting with every edge that may be introduced outsidej-group then we can split the problem into the two sub-problemspGr, k, G`r XGr, G´r XGrq and pGrzGr, k, G`rzGr, G´rzGrq. Such problems are independent and may be resolved in a separate way. The resulting value for the solution is the sum of the values returned by T upleW iseM in applied to both the two sub-problems. Let H “ |G´rzpG`r YG´rq|andh“ |ttP pG´rzpG`r YG´rqq:trJs “ju|, notice that, in this case, the upper bound of the complexity at the current step of computation drops fromOp2HqtoOp2H´h`2hq.

The second optimization allows us to prune a sub-tree of computation even before a contradiction arises. The second optimization implements a criteria to detect, in many cases, if every possible solution that may be built starting from the current partial one turns to be not minimal. Suppose that there exists an active o-pairpt, t1qin a partial solutionpGr, G`r, G´rqsuch that there existstPGr

in the same J-group of twith trV Ts ătrV Ts ăt1rV Ts. By definition of active o-pair, we have that t belongs to G´r as well as every tuple t1 in the same J- group oft withtrV Ts ăt1rV Ts ăt1rV Ts, here the additional condition is that there exists at least one of such tuples. Given a partial solution pGr, G`r, G´rq we denote with colorspGr, G`r, G´rq the set colorspGr, G`r, G´rq “ tpx, y, zq : there exists an active edge pt, t1q withcpt, t1q “ px, y, zqu. Let us define the set of colors pendingpGr, G`r, G´rq “ tpx, y, zq : there exists a pending edge pt, t1q withcpt, t1q “ px, y, zqu which collects all and only the colors that may be introduced later on in the current computation.

A colorpx, y, zqissafe inpvt, vt1, jqif and only if one of the following three conditions hold:

1. px, y, zq PcolorspGr, G`r, G´rq;

2. every colorpx, y, z1qinpendingpGr, G`r, G´rqsatisfiesz1“z(i.e.,px, y, zqis a pending color and there is no pending color that is conflicting withpx, y, zq);

3. the color is not conflicting with any color incolorspGr, G`r, G´rq Ypendingp Gr, G`r, G´rq and do not exist two tuples t, t1 P pG`r YG´rq X tt2 P Gr : t2rJs “j^trV Ts ďt2 ďt1rV Tsusuch thatpt, t1qis an edge and the color ptrXs, t1rYs, t1rZsqis conflicting with px, y, zq

Notice that the above three conditions imply that if a color is safe in pvt, vt1, jq then it is neither in conflict with a colorcolorspGr, G`r, G´rqnor with a color in pendingpGr, G`r, G´rq, however this is just a necessary but not sufficient condi- tion. Given a partial solutionpGr, G`r, G´rqand the triplepvt, vt1, jqapvt, vt1, jq- replace DAG is a DAGpV, EqwhereV “ ttPG`r :trV Ts “vt^trJs “ju Y ttP

(14)

G´r :vtătrV Ts ăvt1^trJs “ju Y ttPG`r :trV Ts “vt1^trJs “juand

E“

"

pt, t1q PV ˆV : ptrV Ts ‰vt_t1rV Ts ‰vt1q ^trV Ts ăt1rV Ts^

cpt, t1qis safe inpvt, vt1, jq

*

Y

tpt, t1q PV ˆV :ptrV Ts ‰vt_t1rV Ts ‰vt1q ^t1rV Ts ´trV Ts ąku A node t P V is a starting node (resp. ending node) if and only if vt ă trV Ts ăvt1 and for everyt1 PV witht1rV Ts “vt (resp.t1rV Ts “vt1) we have pt1, tq PE (resp.pt, t1q PE). Areplace path is pvt, vt1, jq-replace DAG pV, Eq is any patht1. . . tm inpV, Eqfor whichtt1 is a starting node andtmis an ending node. We say thatvtandvt1injcan besafely replacedif and only if there exists a replace path in thepvt, vt1, jq-replace DAGpV, Eq.

Using the above definitions of replace DAGs/paths we can provide the fol- lowing result.

Theorem 2. Given a partial solutionpGr, G`r, G´rqif there exists a groupjwith two consecutive valid timesvtandvt1 such thatvtandvt1 can be safely replaced inj then every consistent solution that followspGr, G`r, G´rqis not optimal.

The proof of the theorem is very simple and left as exercise to the reader.

Let us suppose that t1. . . tm is a replace path in the pvt, vt1, jq-replace DAG, it must exists by hypothesis and by definition we havet1, . . . , tm PG´r, it suf- fices to take any consistent solution pGr, G`r, G´rq that follows pGr, G`r, G´rq and that pGr, G`r Y tt1, . . . , tmu, G´rztt1, . . . , tmuq is still a consistent solution, non-optimality immediately follows. We take advantage of Theorem 2 by prun- ing every computation rooted in a partial solution pGr, G`r, G´rq that features a J-group j-group and two consecutive valid timesvt and vt1 in j-group such that vt and vt1 can be safely replaced in j. Notice that verify wether or not such condition applies may be performed in polynomial time. In the procedure T upleW iseM in, this optimization is realized by the joint work of sub-procedures M aximalP aths? andReplaceP ath? reported in Figure 7 and 6 respectively.

3 Conclusion

In this work we propose an algorithm for solving the problem of verifying whether or not a givenAPEFDEF holds over a given instance rof a temporal schema.

Such problem is known to be NP-Complete in the size ofr(i.e., data-complexity).

We expect our algrithm to perform considerably better w.r.t. standard branch and bound procedures since it considerably prune the search space in most cases.

We are building a prototype in order to measure the effective improvement w.r.t.

frameworks that solve generic NP-Complete problems (i.e., Integer Linear Pro- grammming tools, SAT solvers, and logic programming), obviously this implies that we have to encode our problem in such formalisms.

(15)

References

1. Combi, C., Mantovani, M., Sabaini, A., Sala, P., Amaddeo, F., Moretti, U., Pozzi, G.: Mining approximate temporal functional dependencies with pure tem- poral grouping in clinical databases. Comp. in Bio. and Med.62, 306–324 (2015).

https://doi.org/10.1016/j.compbiomed.2014.08.004, https://doi.org/10.1016/j.

compbiomed.2014.08.004

2. Combi, C., Mantovani, M., Sala, P.: Discovering quantitative temporal functional dependencies on clinical data. In: 2017 IEEE International Conference on Health- care Informatics, ICHI 2017, Park City, UT, USA, August 23-26, 2017. pp. 248–257.

IEEE (2017). https://doi.org/10.1109/ICHI.2017.80, https://doi.org/10.1109/

ICHI.2017.80

3. Combi, C., Montanari, A., Sala, P.: A uniform framework for temporal functional dependencies with multiple granularities. In: Pfoser, D., Tao, Y., Mouratidis, K., Nascimento, M.A., Mokbel, M.F., Shekhar, S., Huang, Y. (eds.) Advances in Spatial and Temporal Databases - 12th International Symposium, SSTD 2011, Minneapolis, MN, USA, August 24-26, 2011, Proceedings. Lecture Notes in Computer Science, vol. 6849, pp. 404–421. Springer (2011). https://doi.org/10.1007/978-3-642-22922- 0 24,https://doi.org/10.1007/978-3-642-22922-0$\_$24

4. Combi, C., Rizzi, R., Sala, P.: The price of evolution in temporal databases.

In: Grandi, F., Lange, M., Lomuscio, A. (eds.) 22nd International Sympo- sium on Temporal Representation and Reasoning, TIME 2015, Kassel, Ger- many, September 23-25, 2015. pp. 47–58. IEEE Computer Society (2015).

https://doi.org/10.1109/TIME.2015.24,https://doi.org/10.1109/TIME.2015.24 5. Combi, C., Sala, P.: Temporal functional dependencies based on interval

relations. In: Combi, C., Leucker, M., Wolter, F. (eds.) Eighteenth In- ternational Symposium on Temporal Representation and Reasoning, TIME 2011, L¨ubeck , Germany, September 12-14, 2011. pp. 23–30. IEEE (2011).

https://doi.org/10.1109/TIME.2011.15,https://doi.org/10.1109/TIME.2011.15 6. Combi, C., Sala, P.: Interval-based temporal functional dependencies: spec-

ification and verification. Ann. Math. Artif. Intell. 71(1-3), 85–130 (2014).

https://doi.org/10.1007/s10472-013-9387-1, https://doi.org/10.1007/

s10472-013-9387-1

7. Combi, C., Sala, P.: Mining approximate interval-based temporal dependen- cies. Acta Inf.53(6-8), 547–585 (2016). https://doi.org/10.1007/s00236-015-0246-x, https://doi.org/10.1007/s00236-015-0246-x

8. Sala, P.: Approximate interval-based temporal dependencies: The complex- ity landscape. In: Cesta, A., Combi, C., Laroussinie, F. (eds.) 21st Interna- tional Symposium on Temporal Representation and Reasoning, TIME 2014, Verona, Italy, September 8-10, 2014. pp. 69–78. IEEE Computer Society (2014).

https://doi.org/10.1109/TIME.2014.20,https://doi.org/10.1109/TIME.2014.20 9. Sala, P., Combi, C., Cuccato, M., Galvani, A., Sabaini, A.: A framework for mining

evolution rules and its application to the clinical domain. In: Balakrishnan, P., Sri- vatsava, J., Fu, W., Harabagiu, S.M., Wang, F. (eds.) 2015 International Conference on Healthcare Informatics, ICHI 2015, Dallas, TX, USA, October 21-23, 2015. pp.

293–302. IEEE Computer Society (2015). https://doi.org/10.1109/ICHI.2015.42, https://doi.org/10.1109/ICHI.2015.42

Références

Documents relatifs

Information about patients and therapies is stored in a re- lation according to the schema PatTherapies ( TherType, PatId , DrugCode, Qty, Phys, B, E ) , where TherType identifies

Indeed, under the restriction to BCNF, the original form of ids introduced by Calvanese, De Giacomo, and Lenzerini [9] is expressive enough to transfer FDs from the relational schema

Adaptive execution of join will combine all the benefits of the existing standard algorithms, whether sorted inputs, the presence of permanent indexes for one or

We have presented a method to compute the characterization of a set of func- tional dependencies that hold in a set of many-valued data, based on formal concept analysis plus

We evaluated the proposed approach on several real world datasets, previ- ously used for testing fd discovery algorithms [13]. Statistics on the characteris- tics of the

On the one hand, we adapt the definition of a canonical direct basis of im- plications with proper premises [4, 22] to the formalism of pattern structures, in order to prove that

Given a DLR ` knowledge base pT , <q, we define the projection signature as the set T including the signatures τpRNq of the relations RN P R, the singletons associated with

(The complexity of inference for limited AMDs is also linear; the proof will appear in the extended version of this paper.) Algorithm 1 presents an inference procedure for AMFDs..