An Algorithm for Verifying Approximate Pure Evolving Functional Dependencies

(1)

An algorithm for verifying Approximate Pure Evolving Functional Dependencies

^‹

Pietro Sala

Department of Computer Science, University of Verona, Italy pietro.sala@univr.it

Abstract. Fuctional dependencies (FDs) form the foundations of database theory since the beginning of time, given they may represent constraints such as “the employee role determines his monthly salary”. If we consider temporal databases where each tuple is timestamped with a valid time (V T) attribute, finer constraints may be imposed on the evolution of the data such as “the employee role before a promotion and his role afterwards determine his new monthly salary”. Such constraints are calledPure Evolving Functional Dependencies (PEFDs) according to the classification introduced in [3]. By adding approximation to such rules they may be used for extracting knowledge in place of imposing constraints. Then we may want todiscover that “the employee role before a promotion and his role afterwardsgenerally determine his new monthly salary”. This kind ofapproximate dependencies are calledApproximate Pure Evolving Functional Dependencies(APEFDs). Verifying whether or not a givenAPEFD holds over a database instance is an NP-Complete problem [4]. Despite the discouraging complexity, in this work we propose an algorithms that solve in an efficient way this problem by trying to divide-and-conquer and prune the search space as much as possible.

Keywords: Temporal Databases·Approximate Functional Dependen- cies·Data Mining.

1 Temporally Mixed Functional Dependencies

According to [3], in the following we study a family of temporal functional dependencies [2, 5, 6] i.e., thePure Evolving Functional Dependencies, calledPEFD.

From now on our temporal functional dependencies will be given over a temporal schema R “ UY tV Tu, where U is a set of atemporal attributes and V T is a special attribute denoting the valid time of each tuple. Hereinafter we assume tuples timestamped with natural numbers (i.e, DompV Tq “N). Let J ĎU be a non-empty subset ofU. We define the set W as W “UzJ and setW, which is basically a renaming of attributes inW. Formally, for each attribute APW, we haveAPW (i.e.,W “ tA:APWu). Given an instancerofR, we define an instanceτ_J^rof schemaRev“J W WtV T, V Tuas follows:

‹ This work has not been previously published and it is an extension of the results published in [4].

(2)

τ_J^r“

$

’’

&

’’

% u

ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ

Dt,t¹

¨

˚

˝

rptq^rpt¹q^trJs“t¹rJs“urJs^urWs“trWs^

urWs“t¹rWs^trV Ts“urV Ts^t¹rV Ts“urV Ts

^trV Tsăt¹rV Ts^

@t²pprpt²q^trV Tsăt²rV TsqÑt¹rV Tsďt²rV Tsq

˛

‹

‚

, // . // -

Schema Rev is called the evolution schema ofR. We will denote by τ_J^R the view on R that is built by expression τ_J^r for every instance r of R. View τ_J^R joins two tuples t1 and t2 that agree on the values of the attributes in J (i.e.

t1rJs “ t2rJs) and t1rV Ts ăt2rV Ts. Moreover, such tuples are joined if does not exist a tuple t P r with trJs “t1rJs and t1rV Ts ătrV Ts ă t2rV Ts (i.e., there exists a tuple that holds at some point in between the valid times of such tuples). For application purposes, it is correct correct to consider in a evolution schema only those pair of consecutive tuples whose the difference between V T and V T respects some given bound. Given a parameterk PNY t`8u, tuples ofτ_J^rare filtered by means of the selection∆_kpτ_J^rq “σtrV Ts´trV Tsďkpτ_J^rq(notice that∆_`8pτ_J^rq “τ_J^r).∆kpτ_J^rqforces to consider only those tuples belonging toτ_J^r having a temporal distance within the given thresholdk. In the following, given a tupletPτ_J^r, we denote its temporal distancetrV Ts ´trV Tswith∆ptq.

A Pure Temporally Evolving Functional Dependency [3] over the temporal schemaR“UY tV Tu,PEFDfor short, is an expression of the form

r∆kpτ_J^RqsXY ÑZ.

We have thatX ĎW andY , ZĎW withX ‰ Hand|Z| “1 (Zcontains a single attribute). We say that an instancerofRfulfills aPEFDr∆_kpτ_J^RqsXY Ñ Z, written r|ù r∆_kpτ_J^RqsXY ÑZ, if and only if for each pair of tuples t, t¹ P

∆_kpτ_J^rqwe have trXs “t¹rXs ^trYs “t¹rYs ÑtrZs “t¹rZs.

Now add approximation to PEFD in a very standard way [7, 2]. First, we provide measurementGto deal withPEFDas follows:

By means ofGwe can define the relativescaled measurement g for TM-FD as follows:

gpr∆kpτ_J^RqsXY ÑZ,rq “Gpr∆_kpτ_J^RqsXY ÑZ,rq

|r| .

Now we are ready to define theApproximate Pure Temporally Evolving Func- tional Dependency (APEFDfor short). Formally, given a real number 0ďď1, we say that an instancerofRsatisfies theAPEFDr∆kpτ_J^RqsX Y Ñ Z, written r|ù r∆kpτ_J^RqsXY Ñ Z, if and only ifgpr∆kpτ_J^RqsXY ÑZ,rq ď.

In Section 2 we provide an algorithm for checking an APEFD against an instancer, we call this problem Check-APEFD:

Problem 1. (Check-APEFD). Given a temporal schema R, a PEFD r∆kpτ_J^Rqs XY Ñ Z on R, an instance r of R, and a real number 0 ď ď1 determine whether or notr|ù r∆kpτ_J^RqsXY Ñ Z.

(3)

r“

# Name P hys CT Durpminutesq V T

1 M cM urphy Sayer self „15 1 Jan 2016 2 M cM urphy Sayer f amily „5 5 Jan 2016 3 M cM urphy M aguire f amily „15 10 Jan 2016 4 M cM urphy M aguire self „15 15 Jan 2016 5 M cM urphy M aguire self „5 20 Jan 2016 6 M cM urphy M aguire self „5 25 Jan 2016

7 Lowe Sayer f amily „15 7 Jan 2016

8 Lowe Sayer f amily „15 15 Jan 2016

9 Lowe M aguire self „15 22 Jan 2016

10 Lowe Sayer self „5 28 Jan 2016

11 Lowe M aguire self „15 3 Feb 2016

12 Lowe Sayer self „5 6 Feb 2016

Fig. 1. An instance r of schemaContact that stores the phone contacts about two psychiatric cases. Attribute # represents the tuple number and it is used only for referencing tuples in the text (i.e., # does not belong to the schemaContact).

τ_Name^r

# Name P hys CT Dur V T P hys CT Dur V T

1,2 M cM urphy Sayer self „15 1 Jan 2016 Sayer f amily „5 5 Jan 2016 2,3 M cM urphy Sayer f amily „5 5 Jan 2016 M aguire f amily „15 10 Jan 2016 3,4 M cM urphy M aguire f amily „15 10 Jan 2016 M aguire self „15 15 Jan 2016 4,5 M cM urphy M aguire self „15 15 Jan 2016 M aguire self „5 20 Jan 2016 5,6 M cM urphy M aguire self „5 20 Jan 2016 M aguire self „5 25 Jan 2016 7,8 Lowe Sayer f amily „15 7 Jan 2016 Sayer f amily „15 15 Jan 2016 8,9 Lowe Sayer f amily „15 15 Jan 2016 M aguire self „15 22 Jan 2016 9,10 Lowe M aguire self „15 22 Jan 2016 Sayer self „5 28 Jan 2016 10,11 Lowe Sayer self „5 28 Jan 2016 M aguire self „15 3 Feb 2016 11,12 Lowe M aguire self „15 3 Feb 2016 Sayer self „5 6 Feb 2016

Fig. 2.The evolution expressionτ_{N ame}^r .

(4)

Now we consider a scenario, borrowed from the clinical domain, in order to provide examples of how PEFDsandAPEFDs work. In particular, this scenario is taken from psychiatric case register. Let us consider the temporal schema Contact“ tN ame, P hys, CT, Duru Y tV Tu. Such a schema stores values about a phone-call service provided to psychiatric patients. This service is intended for monitoring and helping psychiatric patients, who are not hospitalized. Whenever a patient feels the need to talk to a physician, he can call the service. Data about calls are collected according to schemaContact. For the sake of simplicity, temporal attribute V T identifies the day when the call has being received, not the exact time the call has begun which is usually its semantics in real world applications. In addition, the service may be used by people somehow related to patients, as, for instance, relatives afraid of current conditions of a patient.

More precisely, attribute N ame identifies patients, P hys identifies physicians, CT (Contact Type) identifies the physical person who is doing the call (e.g., value ’self’ stand for the patient himself, ’family’ for a relative), andDurstores information about total duration of calls (value „ n means approximately n minutes). An instance r of R is provided in Figure 1. Instance τ_{N ame}^r , and τ_J^r in general, may be seen as the output of a two-phase procedure. First, table Contactis partitioned into subsets of tuples, one for each value ofN ame. Then, each tuple is joined with its immediate successor in its partition, w.r.t. V T values. The whole relation τ_{N ame}^r is provided in Figure 2. In the following we will usetfor referencing tuples ofrandufor referencing tuples ofτ_J^r. Moreover, in the following each tupleuinτ_J^rwill be identified by the pair of indexes of the tuples inrthat generateu. For instance, the first tuple ofτ_J^r in Figure 2 will be denoted byu1,2 since it is generated by the join of tuplest1 andt2 in r.

Going back to our example, it is worth noting that tuples t2 and t7 are not joined in τ_{N ame}^r , even if t7rV Ts “ t2rV Ts `2 and there is no tuple t with trV Ts “ t7rV Ts `1. This is due to the fact that t7rNames ‰ t2rNames forbids the join in τ_{N ame}^r . Moreover, t1 and t3 are not joined in τ_Name^r . In- deed, the presence of tuple t2 with t1rNames “ t2rNames “ t3rNames and t1rV Ts ăt2rV Ts ăt3rV Tsforbids the join inτ_Name^r . Figure 3 graphically depicts how pairs of tuplespt1, t2q,pt2, t3q,pt3, t4q,pt4, t5q,pt5, t6qandpt7, t8q,pt8, t9q,pt9, t₁₀q,pt₁₀, t₁₁q,pt₁₁, t₁₂qare joined inτ_Name^r . In both scenarios depicted in Figure 3, nodes represent tuples and are labeled by the corresponding tuple number.

Values for attribute Dur are reported above each node. Values of P hys and CT attributes are reported below every node, respectively. Every edgept_i, t_jqis labeled by value∆pu_i,jq “t_jrV Ts ´t_irV Ts(i.e., the temporal distance between two tuples). Basically, each tupleuPτ_{N ame}^r corresponds to an edge in Figure 3 while we have a node for each tuple inr.

In our example, we have thatr |ù r∆₅pτ_Name^ContactqsP hys, P hys ÑCT. How- ever,r* r∆₆pτ_Name^ContactqsP hys, P hysÑCT, because of pairspt₂, t₃qandpt₁₀, t₁₁q.

More precisely, we have thatt2rNames “t3rNames “M cM urphy,t10rNames “ t11rNames “ Lowe, t2rP hyss “ t10rP hyss “ Sayer, t3rP hyss “ t11rP hyss “ M aguire, butt3rCTs ‰t11rCTs(i.e., t3rCTs “f amily, andt11rCTs “self).

(5)

t1 t2 t3 t4 t5 t6

M cM urphy

t7 t8 t9 t10 t11 t12

4days 5days 5days 5days 5days

10

8days 7days 6days 6days 3days

Sayer self

„15

Sayer f amily

„5

M aguire f amily

„15

M aguire self

„15

M aguire self

„5

M aguire self

„5

Sayer f amily

„15

Sayer f amily

„15

M aguire self

„15

Sayer self

„5

M aguire self

„15

Sayer self

„5 Lowe

Fig. 3.A graph-based representation ofτ_{N ame}^r .

In other words, we have that the set of tuples tu2,3, u10,11udoes not satisfy the FDP hysP hysÑCT.

Let us observe that the two proposed PEFDsdiffer only for the maximum temporal distance allowed. In particular, tuple u10,11 is responsible for r |ùz r∆6pτ_Name^ContactqsP hys, P hysÑCT and it does not belong to∆5pτ_Name^Contactqbecause of∆pu10,11q ą5 then we haver|ù r∆5pτ_Name^ContactqsP hys, P hysÑCT. This allows us to point out a general property ofPEFDs, such a property holds forAPEFDs, too. Given aPEFDr∆kpτ_J^RqsXY ÑZwe have that for every instancerofRsuch that r|ù r∆kpτ_J^RqsXY ÑZ then for everyhďk we haver|ù r∆hpτ_J^RqsXY Ñ Z.

In our example, if we considerAPEFDr∆6pτ_Name^R qsP hys, P hysÑ CT with “ ₁₂¹, we have that r|ù r∆6pτ_Name^R qs P hys, P hysÑ CT. It is worth noting that, by considering relation r¹ “rztt3u, this dependency would hold without the need of approximation (i.e., if tuplet3 is deleted from relationr). More precisely, we haveτ_Name^r¹ “τ_Name^r ztu2,3, u3,4uYtu2,4u. Notice that tupleu2,4was not originally inτ_Name^r because of tuplet3. Figure 3 depicts this new scenario, by re- placing edgespt₂, t₃qandpt₃, t₄qwith the dashed edgept₂, t₄q. Moreover, we have that r¹ |ù r∆_`8pτ_Name^R qsP hys, P hys Ñ CT. Thus, r |ù r∆_`8pτ_Name^R qsP hys, P hysÑ CT with“ ₁₂¹.

2 An algorithm for checking APEFDs

As we proved in [4], the problem Check-APEFDgiven anAPEFDrτ_J^R, swpkqsXY Ñ Z and an instance r of R is NP-Complete in |r|. This is not an uncommon situation in the realm of temporal functional dependencies (see [8, 1, 9, 7] for further examples).

(6)

Then, in principle, there is no asimptoptycally better algorithm than explore the whole set of of possible subsets r¹ of r with ^|r|´|r_|r| ¹^| ď . However, in the following, we provide an algorithm that make use of heuristics for pruning the search space in order to achieve the tractability for many cases.

Such algorithm is general and it may be applied under no assumptions on the input instancer, it makes use of two optimization techniques. The first optimization technique consists of trying, whenever is possible, to split the current subset ofr in two subsets on which the problem may be solved independently (i.e., choices in one subset do not affect choices in the other one and viceversa).

The latter optimization technique consists of checking if the current partial solution may not lead to an optimal solution (i.e., a solution r¹ where |r¹| is the maximum possible number of tuples that may be kept) if this happen the subtree is pruned immediately (i.e., we are looking only for optimal solutions).

Our algorithm relies on the concept ofcolor that we will explain through an example in the following. GivenAPEFDrτ_J^R, swpkqsXY Ñ Z and an instancer ofR, let us suppose that we are solving problem Check-APEFDon such instance with a simple guess-and-check procedure that make use of two, initially empty, subsetr^`(the tuples to be kept in the solution) andr^´(the tuples to be deleted in the solution) ofr. At each step the procedure guess a tupletinrzpr^`Yr^´qand decides non-deterministically (guessimg phase) either to updater^` to r^`Y ttu (i.e., t is kept in the current partial solution) to update r^´ to r^´Y ttu (i.e,t is deleted in the current partial solution). When r “ r^` Yr^´ (check phase), the procedure returns Y ES if r^` |ù rτ_J^R, swpkqsXY Ñ Z and |r^´| ď ¨ |r|, otherwise it returns N O. From now on, for us a partial solution is a triple pr,r^`,r^´qsuch thatpr^`Yr^´q Ďrandr^`Xr^´ “ H, ifr“r^`Yr^´ we simply say thatpr^`Yr^´qis asolution. A solution isconsistent pr,r^`,r^´qif and only if r^`|ù rτ_J^R, swpkqsXY Ñ Z. Given two partial solutionpr,r^`₁,r^´₁qandpr,r^`₂,r^´₂q we say thatpr,r^`₂,r^´₂qextends pr,r^`₁,r^´₁qif and only if r^`₁ Ďr^`₂ andr^´₁ Ďr^´₂.

Is there a way to check if we are generating an inconsistent solution possibly without guessing all the tuples in r? Violations of the latter constraint (i.e.,

|r^´| ď¨ |r|) are fairly simple to detect during the guessing phase, it suffices to check after each insertion inr^´ifr^´ exceeds¨ |r|, if it is the case the procedure may return N O immediately without guessing any further. Violations of the first constraint (i.e.,r^` |ù rτ_J^R, swpkqsXY ÑZ) during the guessing phase are trickier to detect, then we need the following additional definitions. From now on when two tuplest, t¹ share the same value for the attributeJ (i.e.,trJs “t¹rJs) we will say that they are in the sameJ-group and we will say thattrJsis the value of the J-group containing t and t¹. For the sake of brevity, for a given j P DompJq we will use j-group for denoting the J-group with value j. An ordered pair, written o-pair, is a pairpt, t¹q Prˆrsuch that tand t¹ are in the sameJ-group andtrV Ts ăt¹rV Ts. Given an o-pairt, t¹ Prwe say that the pair pt, t¹qis an edge if and only if 0ăt¹rV Ts ´trV Ts ďk. Given triplepr,r^`,r^´q, an o-pair pt, t¹q P r is active if and only if t, t¹ P r^` and for every tuple t in the same J-group oft iftrV Ts ă t ăt¹rV Ts we have t Pr^´ (i.e., t and t¹ are selected in the current partial solution and all the tuples between t and t¹ in

(7)

the same J-group are deleted). Given two valid times vt, vt¹ P DompV Tq and a value j P DompJq we say that vt and vt¹ are consecutive in j-group if and only if there exists an active o-pair pt, t¹q with trV Ts “ vt, t¹rV Ts “vt¹, and trJs “j. Notice that we may have two distinct values j, j¹ PDompJqand two distinct valid timesvt, vt¹ PDompV Tqwhich are consecutive inj-group and not consecutive in j¹-group. Moreover, we may have edgespt, t¹qthat are not active and active pairs pt, t¹qthat are not edges, such is the case of active pairs pt, t¹q witht¹rV Ts´trV Ts ąk. Acoloris a tuplecon the schemaC“XY Z, two colors px, y, zq andpx¹, y¹, z¹qare conflicting if and only if x“x¹, y “y¹ andz ‰z¹. Given an o-pairpt, t¹q, denoted bycpt, t¹qis the tuplecpt, t¹q “ ptrXs, trYs, t¹rZsq.

Two o-pairs pt, t¹q,pt², t³q areconflicting if and only if cpt, t¹q and cpt², t³q are conflicting. The following result it is easy to prove and its proof is left as exercise for the reader.

Theorem 1. Given anAPEFDrτ_J^R, swpkqsXY Ñ Z, an instancerofR, and a partial solution pr,r^`,r^´q, if there exists two active edges pt, t¹q and pt², t³q then every solution pr,r^`f,r^´fq that follows pr,r^`,r^´q satisfy r^`_f z|ù rτ_J^R, swpkqsXY ÑZ (i.e., is inconsistent).

The above theorem guarantees that from a partial solutionpr,r^`,r^´qthat feature at least two conflicting edges we cannot reach a solution pr,r^`_f,r^´_fq that satisfy the first constraint and in such a case we may return immediately N O without going any further from pr,r^`,r^´q. The colours of a partial solution pr,r^`,r^´q is the set colourspr,r^`,r^´q “ tptrXs, t¹rYs, t¹rZsq : pt, t¹qis an active edge in pr,r^`,r^´qu. It is easy to see that the hypothesis of Theorem 1 applies if and only ifcolourspr,r^`,r^´qcontains at least two conflicting colours. Then, by means of colours, our above guess-and-check procedure may be improved by adding the control on the size of r^´ and by keeping updated the current set of colourscolourspr,r^`,r^´q. Once an insertion of a tuple in eitherr^` orr^´ introduce a colorc that is conflicting with at least one colour in pr,r^`,r^´qthe procedure answersN O immediately.

An example of how the procedure works is given in Figure 4 where we have an instance of 5 tuples with“0.2 (i.e., we may delete at most one tuple) and swpkq “ 6 (all the tuples are in the same window). The execution depicted in Figure 4 guesses the values of the tuples from the oldest (t1) to the newest one (t2) according to the value ofV T. First it tries to put the current tupletinr^`if no violation arises it continues, if some violation arises it tries to insert the tuple t inr^´ if no violation arises it continues otherwise it goes back to the previous choice (i.e., backtracking). Every internal node is labelled with the current tuple which will be guessed next, every leaf is labelled either with Y ES (i.e., the current branch is a solution) orN O(i.e, a violation has arisen), the current set of colours is reported within the node. Nodes are numbered according to their order of appearance. We have that the root isn₁followed by the introduction of the nodesn₁. . . n₄in this precise order. If we introducet₄in the partial solution associated ton4we violate the first constraint. Since inn4 addingt4inr^´does not generate any violation, node n5 is created as child of n4. However, node

(8)

# J X Y Z V T 1 j₁ x₁y₁ z₁ 1 2 j₁ x₂y₁ z₁ 2 3 j₁ x₁y₁ z₂ 3 4 j1 x2y1 z2 4 5 j₁ x₁y₁ z₂ 5

x1|y1|z1 ^px1,y1,z1q x2|y1|z1 x1|y1|z2 x2|y1|z2 x1|y1|z2 px1,y1,z2q

px1,y1,z2q

px2,y1,z2q

px1,y1,z2q

px2,y1,z2q t1,tu

n1

t2,tu n2

t3,tpx1,y1,z1qu n3

t4,

"

px1,y1,z1q, px2,y1,z2q

*

n4 n6 t4,tpx1,y1,z1qu

t5,

"

* n5

t5,

"

* n7

N O conflict between px1,y1,z1qandpx1,y1,z2q

N O

|r´ |ą1

Y ES r“r` Yr´,|r´ | ď1, colourspr,r`,r´ q “"

* t1Pr`

t2Pr`

t3Pr`

t3Pr´

t4Pr`

t4Pr´

t5Pr` t5Pr´

t4Pr`

t5Pr`

Fig. 4.An example of how the use of colors improve a guess and check procedure for solving the problem Check-APEFD.

n5 cannot be extended without introducing a violation in the above constraints because if putt5in r^`we introduce a conflicting color, while if we put t5 inr^´ we exceed the maximum number of allowed deletions. We backtrack to n4 and we have that even for it all the possible choices have been explored and thus we backtrack ton3where the choice of addingt3 tor^´is attempted generating the noden₆. Fromn₆we putt₅inr^` without violating any constraint and thus we have thattt₁, t₂, t₄, t₅u |ù rτ_J^R,6sXY ^0.2ÑZ.

Let us consider deeper in the description of the first algorithm. On an higher lever the algorithm operates like the procedure above apart from some trivial technicalities. However, two more heuristics are introduced in order to possibly stop early during the exploration of a branch in the tree of computation. The main procedure of the algorithm is reported in Figure 7 with auxiliary procedures reported in Figure 5 and in Figure 6. The algorithm is implemented by the functionT upleW iseM inthat takes 4 arguments. The first argument isG_rwhich is a pre-processing ofrbased on theAPEFDrτ_J^R, swpkqsXY Ñ Z that has to be checked. More precisely, G_r is an instance of the schema J, X, Y, Z, V T, count, with Dompcountq “ N. We have that t P Gr if and only if there exists t¹ P r for whichpt¹rJs, t¹rXs, t¹rYs, t¹rZsq “ ptrJs, trXs, trYs, trZsqandtrcounts “ |tt¹P r : pt¹rJs, t¹rXs, t¹rYs, t¹rZsq “ ptrJs, trXs, trYs, trZsqu|, that is, we count how

(9)

procedureInBetweenpGr, t1, t2q

comment: returns the subset ofGrconsisting of all the tuples in the sameJ-group oft1 andt2 that feature valid times betweent1rV Tsandt2rV Ts.

return ttPGr:t1rV Ts ătrV Ts ăt2rV Ts ^trJs “t1rJsu procedureEdgeConflict?ppt1, t2q,pt¹1, t¹2qq

comment: returns true if and only if the two input edges feature conflicting colors.

if t1rXs “t¹1rXs ^t2rYs “t¹2rYs ^t2rZs ‰t¹2rZs then return true

else return false

procedureColorConflict?ppt1, t2q,Cq

comment: returns true if and only if the color of the edgept1, t2qis conflicting with at least one color in the set of colorsC.

if DcpcPC^t1rXs “crXs ^t2rYs “crYs ^t2rZs ‰crZsq then return true

else return false

procedureE!pGr, k, G^`r, G^´rq

comment: returns the set of all and only active edges in the current partial solution.

return

"

pt1, t2q: t1rJs “t2rJs ^0ăt2rV Ts ´t1rV Ts ďk^

tt1, t2u ĎG^`r ^InBetweenpGr, t1, t2q ĎG^´rq

*

procedureE?pGr, k, G^`_r, G^´_r,Cq

comment: returns the set of all and only pending edges in the current partial solution.

return

$

’’

&

’’

% pt1, t2q:

t1rJs “t2rJs ^0ăt2rV Ts ´t1rV Ts ďk^ tt1, t2u XG^´_r “ H^

^ ColorConf lict?ppt1, t2q,Cq^

^InBetweenpGr, t1, t2q XG^`_r “ H^

^ptt1, t2u YInBetweenpGr, t1, t2qq Ę pG^`r YG^´rq

, // . // -

procedureReach?pt1, t2, N odes, Edgesq comment:

returns true if and only if there exists a path fromt1 tot2in the graph pN odes, Edgesq. It is a function that checks wether or not there exits a path between two nodes in a graph.

if Dptt1. . . tmu ĎN odesqpt1“t1^tm“t2^ @p1ďiămqppti, ti`1q PEdgesqq then return true

else return false

Fig. 5.Auxiliary procedures used by procedures presented in Figures 6 and 7.

(10)

procedureGroupIndependent?pj, Gr, k, G^`_r, G^´_r,Cq comment:

returns true if and only if in the current partial solution pending edges involving tuples belonging to theJ-group with valuejare not

conflicting with the pending edges introduced by the othersJ-groups.

EjÐ tpt1, t2q PE?pGr, k, G^`_r, G^´_r,Cq:t1rJs “ju EjÐE?pGr, k, G^`_r, G^´_r,CqzEj

if Dt1Dt2Dt1Dt2ppt1, t2q PEj^ pt,t2q PEj^EdgeConf lict?ppt1, t2q,pt1, t2qqq then return false

else return true

procedureMacConsistentSubsetpEq comment:

returns the maximal subsetE¹ofEsuch that d for every edge pt¹₁, t¹₂q PE¹and for every edgept1, t2q PE we have thatcpt¹₁, t¹₂qand cpt1, t2qare not conflicting.

E¹Ð H

for eachpt1, t2q PE do

"

if @t1@t2ppt1, t2q PEÑ EdgeConf lict?ppt1, t2q,pt1, t2qqq thenE¹ÐE¹Y tpt1, t2qu

return E¹

procedureReplacePath?pt, t, Gr, k, G^`r, G^´r,Cq

comment: returns true if and only if in the current partial solution the consecutive valid timestrV TsandtrV TsintrJs-group may be safely replaced.

if Dt1pt1PGr^t1rJs “trJs ^trV Ts ăt1rV Ts ătrV Tsq then return false

NsÐ tt1PG^`r :t1rJs “trJs ^t1rV Ts “trV Tsu NeÐ tt1PG^`_r :t1rJs “trJs ^t1rV Ts “trV Tsu

NmÐ tt1PG^´r :t1rJs “trJs ^trV Ts ăt1rV Ts ătrV Tsu NÐNsYNeYNm

E˜Ð tpt1, t2q:t1PN^t2PN^t1rV Ts ăt2rV Ts ^ pt1RNs_t2RNequ P airsÐ tpt1, t2q:t1PN^t2PN^t1rV Ts `kăt2rV Ts ^ pt1RNs_t2 RNequ E˜okÐ pM axConsistentSubsetpE?pGr, k, G^`_r, G^´_r,Cqq XEq Y˜ P airs

N otSaf eÐ H

for eachpt1, t2q PEz˜ E˜ok

do

$

&

%

for eachpt1, t2q P pE?pGr, k, G^`_r, G^´_r,Cq YE!pGr, k, G^`_r, G^´_rq YEq˜ do

"

ifEdgeConf lict?ppt1, t2q,pt1, t2qq then N otSaf eÐN otSaf eY tpt1, t2qu E˜ÐEzN otSaf e˜

if Dt1Dt2@t1@t2

ˆ t1, t2PNm^t1PNs^t2PNe^

^pt1, t1q,pt2, t2q PE^Reach?pt˜ 1, t2, N,Eq˜

˙

then return true return false

Fig. 6.Auxiliary procedures used by procedureT upleW iseM in(Figure 7).

(11)

Algorithm 2.1: TupleWiseMin(Gr, k, G^`_r “ H, G^´_r “ H,C“ H) procedureMaximalPaths?pGr, k, G^`r, G^´r,Cq

comment:

returns true if and only if in the current partial solution for every J-groupj-group every pair of consecutive valid timesvtandvt¹in j-group cannot be safely replaced.

ifDt1Dt2ppt1, t2q PE!pGr, k, G^`r, G^´rq ^ReplaceP ath?pt1, t2, k, Gr, G^`rG^´r,Cqq then return false

else return true

procedureIsConsistent?pGr, k, G^`_r, G^´_r,Cq comment:

returns true if and only if the current partial solution features consistent colors inC and for everyJ-groupj-group every pair of consecutive valid timesvtandvt¹ inj-group cannot be safely replaced.

ifDt1Dt2ppt1, t2q PE!pGr, k, G^`r, G^´rq ^ColorConf lict?ppt1, t2q,Cqq then return false

ifM aximalP aths?pGr, k, G^`r, G^´r,Cq then return true

else return false main

comment: returns the minimum valuem“min||GrzG¹_r||

forG¹_r intG²_rĎGr:G¹_r|ù rτ_G^{XY Zcount}_r sXY ÑZu.

ifGr“ Hthen return||G^´_r||

ifDjpjPDompJq ^GroupIndependent?pj, Gr, k, G^`r, G^´r,Cqq then

$

&

%

GrÐ ttPGr:trJs “ju return

ˆT upleW iseCheckpGr, k, G^`r XGr, G^´r XGr,Cq`

T upleW iseCheckpGrzGr, k, G^`rzGr, G^´rzGr,Cq

˙

lettPGr

ifIsConsistent?pGrzttu, k, G^`r Y ttu, G^´r,Cq then

"

C¹ÐCY tpt¹rXs, t²rYs, t²rZsq:pt¹, t²q PE!pGrzttu, k, G^`r Y ttu, G^´rqu mtÐT upleW iseCheckpGrzttu, k, G^`_r Y ttu, G^´_r,C¹q

else mtÐ `8

ifIsConsistent?pGrzttu, k, G^`_r, G^´_r Y ttu,Cq then

"

C¹ÐCY tpt¹rXs, t²rYs, t²rZsq:pt¹, t²q PE!pGrzttu, k, G^`r, G^´r Y ttuqu m_ztÐT upleW iseCheckpGrzttu, k, G^`_r, G^´_r Y ttu,C¹q

else mztÐ `8 return minpmt, m_ztq

Fig. 7.The main procedure for checkingAPEFDs in a tuple-wise fashion. Notice that we use a compact notation for the recursive procedure which is initially called as T upleW iseM inpGr, kqwhere whenG^`r,G^´r, andCare ommitted in the procedure call they get their respective default values specified in the procedure declaration (i.e.,H for each of them in this case).

(12)

many tuples inrshare the same values for attributesJ, X, Y,andZ. The input parameterk is the length for the sliding window grouping swpkq. The setsG^`_r and G^´_r, originally initialized to H, represent the tuples ofG_r that are either kept or deleted in the current solution respectively. On instances r of schema R satisfyingcountPR we denote with ||r|| the sum on thecountattribute for the tuples in r (i.e.,||r|| “ř

tPrtrcounts). Finally, C is a set of colors which is initially set to H. A color cis a tuple on the schemaX, Y, Z. As we will see,C keeps track via colours of the constraints introduced so far in the construction of the solution.

The procedureT upleW iseM inreturns the minimum number of tuples that has to be deleted from r in order to obtain an instance r¹ such that r¹ |ù rτ_J^R, swpkqsXY Ñ Z. Then if such minimum is less or equal than ¨ |r| we can conclude r|ù rτ_J^R, swpkqsXY Ñ Z else we haver |ù rτz _J^R, swpkqsXY Ñ Z. Given Gr,G^`_r, G^´_r, and a set of colours Cwe say that an edge pt, t¹q PGrˆGr

ispending if and only if the following conditions hold:

1. t, t¹RG^´_r andptrXs, t¹rYs, zq RCfor every zPDompZq;

2. for everyt²witht²rJs “trJsandtrV Ts ăt²rV Ts ăt¹rV Tswe havet²RG^`_r; 3. there exists t² P pGrY tt, t¹uqzpG^´_r YG^`_rqwith t²rJs “trJs and trV Ts ď

t²rV Ts ďt¹rV Ts.

Informally speaking, a pending edge is an edge that is not active in the current partial solution but it may become active during the computation and, if it happens, it introduces a new color inC. In our algorithm, pending edges for the current partial solution are retrieved by the procedureE? while active edges are retrieved by the procedureE!.

The procedureT upleW iseM in(Figure 7) works as follows. IfG^`_rYG^´_r “G_r it means that we have obtained a solution without violating any constraint and thus we can return||G^´_r||(i.e., the number of deleted tuples). IfG^`_r YG^´_r ‰G_r the algorithm guesses a tupletPG_rzpG^`_r YG^´_rqand proceed as follows. First, it checks if insertingtintoG^`_r does not cause any violation in the constraints, if so it stores inmtthe value of the recursive call toT upleW iseM inwheretbelongs to G^`_r andC has been updated accordingly. Notice that by inserting a tuplet in G^`_r the algorithm is asserting that t belongs to the current partial solution, while by inserting t in G^´_r the algorithm is asserting thatt does not belong to the current partial solution. If a constraint is violated the algorithm stores inmt

the value`8, which means thatt may not be kept in the current solution.

Then, it checks if inserting t into G^´_r does not cause any violation in the constraints, if so it stores inm_z_tthe value of the recursive call toT upleW iseM in whereG^´_r andCare updated accordingly. If a constraint is violated the algorithm stores inm_z_tthe value`8, which means thattmust be kept in the current partial solution. In procedureT upleW iseM in the only way in which a constraint may be violated is that, after the insertion a tupletinG^`_r (resp.G^´_r), an edgept¹, t²q turns out to be active and its colorpt¹rXs, t²rYs, t²rZsqturns out to be conflicting with at least one color inC.

The first one allows us to restrict the search space by splitting the problem into independent sub-problems in a divide-and-conquer fashion. Let us suppose

(13)

that at a certain step of our computation there exists a valuejP ttrJs:tPG_ru for which for each pair of conflicting pending edgespt, t¹qandpt, t¹qwe have that either allt, t¹, t, andt¹ belong toj-group or allt, t¹, t, andt¹ do not belong toj- group (such condition is verified by sub-procedureGroupIndependent? reported in Figure 6.). Let G_r “ tt:trJs “ju, if every edge involving tuples inj-group is not conflicting with every edge that may be introduced outsidej-group then we can split the problem into the two sub-problemspG_r, k, G^`_r XG_r, G^´_r XG_rq and pG_rzG_r, k, G^`_rzG_r, G^´_rzG_rq. Such problems are independent and may be resolved in a separate way. The resulting value for the solution is the sum of the values returned by T upleW iseM in applied to both the two sub-problems. Let H “ |G^´_rzpG^`_r YG^´_rq|andh“ |ttP pG^´_rzpG^`_r YG^´_rqq:trJs “ju|, notice that, in this case, the upper bound of the complexity at the current step of computation drops fromOp2^HqtoOp2^H´h`2^hq.

The second optimization allows us to prune a sub-tree of computation even before a contradiction arises. The second optimization implements a criteria to detect, in many cases, if every possible solution that may be built starting from the current partial one turns to be not minimal. Suppose that there exists an active o-pairpt, t¹qin a partial solutionpGr, G^`_r, G^´_rqsuch that there existstPGr

in the same J-group of twith trV Ts ătrV Ts ăt¹rV Ts. By definition of active o-pair, we have that t belongs to G^´_r as well as every tuple t¹ in the same J- group oft withtrV Ts ăt¹rV Ts ăt¹rV Ts, here the additional condition is that there exists at least one of such tuples. Given a partial solution pGr, G^`_r, G^´_rq we denote with colorspGr, G^`_r, G^´_rq the set colorspGr, G^`_r, G^´_rq “ tpx, y, zq : there exists an active edge pt, t¹q withcpt, t¹q “ px, y, zqu. Let us define the set of colors pendingpGr, G^`_r, G^´_rq “ tpx, y, zq : there exists a pending edge pt, t¹q withcpt, t¹q “ px, y, zqu which collects all and only the colors that may be introduced later on in the current computation.

A colorpx, y, zqissafe inpvt, vt¹, jqif and only if one of the following three conditions hold:

1. px, y, zq PcolorspGr, G^`_r, G^´_rq;

2. every colorpx, y, z¹qinpendingpG_r, G^`_r, G^´_rqsatisfiesz¹“z(i.e.,px, y, zqis a pending color and there is no pending color that is conflicting withpx, y, zq);

3. the color is not conflicting with any color incolorspG_r, G^`_r, G^´_rq Ypendingp Gr, G^`_r, G^´_rq and do not exist two tuples t, t¹ P pG^`_r YG^´_rq X tt² P Gr : t²rJs “j^trV Ts ďt² ďt¹rV Tsusuch thatpt, t¹qis an edge and the color ptrXs, t¹rYs, t¹rZsqis conflicting with px, y, zq

Notice that the above three conditions imply that if a color is safe in pvt, vt¹, jq then it is neither in conflict with a colorcolorspG_r, G^`_r, G^´_rqnor with a color in pendingpGr, G^`_r, G^´_rq, however this is just a necessary but not sufficient condition. Given a partial solutionpGr, G^`_r, G^´_rqand the triplepvt, vt¹, jqapvt, vt¹, jq- replace DAG is a DAGpV, EqwhereV “ ttPG^`_r :trV Ts “vt^trJs “ju Y ttP

(14)

G^´_r :vtătrV Ts ăvt¹^trJs “ju Y ttPG^`_r :trV Ts “vt¹^trJs “juand

E“

"

pt, t¹q PV ˆV : ptrV Ts ‰vt_t¹rV Ts ‰vt¹q ^trV Ts ăt¹rV Ts^

cpt, t¹qis safe inpvt, vt¹, jq

*

Y

tpt, t¹q PV ˆV :ptrV Ts ‰vt_t¹rV Ts ‰vt¹q ^t¹rV Ts ´trV Ts ąku A node t P V is a starting node (resp. ending node) if and only if vt ă trV Ts ăvt¹ and for everyt¹ PV witht¹rV Ts “vt (resp.t¹rV Ts “vt¹) we have pt¹, tq PE (resp.pt, t¹q PE). Areplace path is pvt, vt¹, jq-replace DAG pV, Eq is any patht₁. . . t_m inpV, Eqfor whichtt₁ is a starting node andt_mis an ending node. We say thatvtandvt¹injcan besafely replacedif and only if there exists a replace path in thepvt, vt¹, jq-replace DAGpV, Eq.

Using the above definitions of replace DAGs/paths we can provide the following result.

Theorem 2. Given a partial solutionpGr, G^`_r, G^´_rqif there exists a groupjwith two consecutive valid timesvtandvt¹ such thatvtandvt¹ can be safely replaced inj then every consistent solution that followspGr, G^`_r, G^´_rqis not optimal.

The proof of the theorem is very simple and left as exercise to the reader.

Let us suppose that t1. . . tm is a replace path in the pvt, vt¹, jq-replace DAG, it must exists by hypothesis and by definition we havet1, . . . , tm PG^´_r, it suffices to take any consistent solution pGr, G^`_r, G^´_rq that follows pGr, G^`_r, G^´_rq and that pGr, G^`_r Y tt1, . . . , tmu, G^´_rztt1, . . . , tmuq is still a consistent solution, non-optimality immediately follows. We take advantage of Theorem 2 by pruning every computation rooted in a partial solution pGr, G^`_r, G^´_rq that features a J-group j-group and two consecutive valid timesvt and vt¹ in j-group such that vt and vt¹ can be safely replaced in j. Notice that verify wether or not such condition applies may be performed in polynomial time. In the procedure T upleW iseM in, this optimization is realized by the joint work of sub-procedures M aximalP aths? andReplaceP ath? reported in Figure 7 and 6 respectively.

3 Conclusion

In this work we propose an algorithm for solving the problem of verifying whether or not a givenAPEFDEF holds over a given instance rof a temporal schema.

Such problem is known to be NP-Complete in the size ofr(i.e., data-complexity).

We expect our algrithm to perform considerably better w.r.t. standard branch and bound procedures since it considerably prune the search space in most cases.

We are building a prototype in order to measure the effective improvement w.r.t.

frameworks that solve generic NP-Complete problems (i.e., Integer Linear Pro- grammming tools, SAT solvers, and logic programming), obviously this implies that we have to encode our problem in such formalisms.

(15)

References

1. Combi, C., Mantovani, M., Sabaini, A., Sala, P., Amaddeo, F., Moretti, U., Pozzi, G.: Mining approximate temporal functional dependencies with pure temporal grouping in clinical databases. Comp. in Bio. and Med.62, 306–324 (2015).

https://doi.org/10.1016/j.compbiomed.2014.08.004, https://doi.org/10.1016/j.

compbiomed.2014.08.004

2. Combi, C., Mantovani, M., Sala, P.: Discovering quantitative temporal functional dependencies on clinical data. In: 2017 IEEE International Conference on Health- care Informatics, ICHI 2017, Park City, UT, USA, August 23-26, 2017. pp. 248–257.

IEEE (2017). https://doi.org/10.1109/ICHI.2017.80, https://doi.org/10.1109/

ICHI.2017.80

3. Combi, C., Montanari, A., Sala, P.: A uniform framework for temporal functional dependencies with multiple granularities. In: Pfoser, D., Tao, Y., Mouratidis, K., Nascimento, M.A., Mokbel, M.F., Shekhar, S., Huang, Y. (eds.) Advances in Spatial and Temporal Databases - 12th International Symposium, SSTD 2011, Minneapolis, MN, USA, August 24-26, 2011, Proceedings. Lecture Notes in Computer Science, vol. 6849, pp. 404–421. Springer (2011). https://doi.org/10.1007/978-3-642-22922- 0 24,https://doi.org/10.1007/978-3-642-22922-0$\_$24

4. Combi, C., Rizzi, R., Sala, P.: The price of evolution in temporal databases.

In: Grandi, F., Lange, M., Lomuscio, A. (eds.) 22nd International Sympo- sium on Temporal Representation and Reasoning, TIME 2015, Kassel, Ger- many, September 23-25, 2015. pp. 47–58. IEEE Computer Society (2015).

https://doi.org/10.1109/TIME.2015.24,https://doi.org/10.1109/TIME.2015.24 5. Combi, C., Sala, P.: Temporal functional dependencies based on interval

relations. In: Combi, C., Leucker, M., Wolter, F. (eds.) Eighteenth In- ternational Symposium on Temporal Representation and Reasoning, TIME 2011, L¨ubeck , Germany, September 12-14, 2011. pp. 23–30. IEEE (2011).

https://doi.org/10.1109/TIME.2011.15,https://doi.org/10.1109/TIME.2011.15 6. Combi, C., Sala, P.: Interval-based temporal functional dependencies: spec-

ification and verification. Ann. Math. Artif. Intell. 71(1-3), 85–130 (2014).

https://doi.org/10.1007/s10472-013-9387-1, https://doi.org/10.1007/

s10472-013-9387-1

7. Combi, C., Sala, P.: Mining approximate interval-based temporal dependencies. Acta Inf.53(6-8), 547–585 (2016). https://doi.org/10.1007/s00236-015-0246-x, https://doi.org/10.1007/s00236-015-0246-x

8. Sala, P.: Approximate interval-based temporal dependencies: The complexity landscape. In: Cesta, A., Combi, C., Laroussinie, F. (eds.) 21st Interna- tional Symposium on Temporal Representation and Reasoning, TIME 2014, Verona, Italy, September 8-10, 2014. pp. 69–78. IEEE Computer Society (2014).

https://doi.org/10.1109/TIME.2014.20,https://doi.org/10.1109/TIME.2014.20 9. Sala, P., Combi, C., Cuccato, M., Galvani, A., Sabaini, A.: A framework for mining

evolution rules and its application to the clinical domain. In: Balakrishnan, P., Sri- vatsava, J., Fu, W., Harabagiu, S.M., Wang, F. (eds.) 2015 International Conference on Healthcare Informatics, ICHI 2015, Dallas, TX, USA, October 21-23, 2015. pp.

293–302. IEEE Computer Society (2015). https://doi.org/10.1109/ICHI.2015.42, https://doi.org/10.1109/ICHI.2015.42