• Aucun résultat trouvé

Tree inclusion problems

N/A
N/A
Protected

Academic year: 2022

Partager "Tree inclusion problems"

Copied!
16
0
0

Texte intégral

(1)

DOI: 10.1051/ita:2007052 www.rairo-ita.org

TREE INCLUSION PROBLEMS∗,∗∗

Patrick C´ egielski

1

, Ir` ene Guessarian

2

and Yuri Matiyasevich

3

Abstract. Given two trees (a targetT and a patternP) and a nat- ural numberw,window embedded subtree problemsconsist in deciding whether P occurs as an embedded subtree of T and/or finding the number of size (at most)wwindows ofT which contain patternP as an embedded subtree. P is an embedded subtree ofT ifP can be ob- tained by deleting some nodes fromT (if a nodevis deleted, all edges adjacent tovare also deleted, and outgoing edges are replaced by edges going from the parent ofv(if it exists) to the children ofv). Deciding whetherP is an embedded subtree ofT is known to be NP-complete.

Our algorithms run in timeO(|T|22|P|) where|T|(resp. |P|) is the size ofT (resp. P).

Mathematics Subject Classification.68Q25, 68W05.

1. Introduction

Given two trees, we study the following problems: canPbe obtained fromTby deleting nodes? if this holds, isP contained in a reasonably small (i.e. of a small height) subtree ofT? IfP is contained in a subtree of heightw(a “window”) of T, how many times can this occur?

These problem generalize in a natural way the subsequence problems for words:

we proved in [1], that the problem of counting the number of w-windows of a

Keywords and phrases.Subtree inclusion, algorithm.

For the 60th birthday of Serge Grigorieff.

∗∗ Support from the Council for Grants of the President of the Russian Federation under grant NSh-8464.2006.1 is acknowledged by the third author.

1 LACL EA 4213, Universit´e Paris Est, Route foresti`ere Hurtault, 77300 Fontainebleau, France;cegielski@univ-paris12.fr

2 LIAFA, UMR 7089 and Universit´e Paris 6, 2 place Jussieu, 75254 Paris Cedex 5, France;

ig@liafa.jussieu.fr

3Steklov Institute of Mathematics, Fontanka 27, St. Petersburg, 191023, Russia;

yumat@pdmi.ras.ru

c EDP Sciences 2007

Article published by EDP Sciences and available at http://www.rairo-ita.org or http://dx.doi.org/10.1051/ita:2007052

(2)

text t containing a pattern p as a subsequence (i.e. letters of p appear in the window in the same order as inpbut are not necessarily consecutive and may be interleaved with other letters) can be solved in timeO(n) wheren is the size of t. The generalization to trees can be stated as follows: P is an embedded subtree ofT ifP can be obtained by deleting some nodes fromT (if a nodev is deleted, the ingoing edge tov (if it exists) is also deleted, and outgoing edges are replaced by edges going from the parent of v (if it exists) to the children of v). P is an embedded subtree ofT within aw-window ifP is an embedded subtree ofW, and W is a subtree ofT of heightw.

We cannot hope to reduce (in a simple and succinct way) the problem of finding whether P is an embedded subtree of T within a w-window to the subsequence problem for words by encodingT andP by wordst andprespectively, and then solving some subsequence problem fortandp: it is known [4] that even the simpler problem of deciding whetherP is an embedded subtree ofT is NP-complete, hence the length oftand/orpwould probably be exponential in the size ofT and/orP. The problem of finding embedded occurrences of a pattern in a window of a tree is important in two areas extensively studied recently:

1. retrieving information from structured documents [4] such as dictionaries:

via a pattern embedding the user can specify the structure and content of the parts of the document she/he is interested in;

2. discovering frequent substructures in semi-structured data; most semi-struc- tured data are modeled by colored labeled trees (e.g. itemsets in relational databases, chemical compounds, XML documents), and mining such structures is naturally done via finding tree embeddings.

The paper is organized as follows: in Section 2 we define notations and problems, in Section 3 we describe a new algorithm to decide whether a pattern in embedded in a target (not interesting per se, but for the generalisations given in the following section), and Section 4 presents our main contribution, namely, (i) determining (and counting) thew-windows containing a pattern as an embedded subtree, and (ii) determining (and counting) the embeddings of a pattern within a w-window of text.

2. Notations

LetA be an alphabet.

Definition 2.1. (i) A tree T on A is a finite connected acyclic graph T = V, E, color, whereV is the set of nodes,Eis the set of edges, andcolor:V →A is a coloring function: each node (or vertex) is colored by a letter ofA.

(ii) A rooted tree T = V, E, color, r is a tree where a node r has been distinguished and is called therootof the tree.

In a rooted tree, edges are naturally directed (from the root to the leaves): all nodes have in-degree one except for the root which has in-degree zero; if (u, v) is an edge oriented fromuto v, uis theparent ofvand vis achild ofu; each node has exactly one parent, except for the root which has none; a node can have any

(3)

finite number of children and childless nodes are calledleaves. Nodes have adepth:

the depth of the root is 0, and if the depth of nodevisn, then all its children have depthn+ 1. Two nodes having the same parent are calledsiblings. The transitive closure of the parent (resp. child) relation is called theancestor(resp. descendent) relation. Theheight of a tree is the maximum of the depths of its nodes.

In the case of trees, the notions similar to subword and subsequence for words exist and are calledsubtreeandembedded subtree. Formally:

Definition 2.2. Let T = V, E, color, r and T = V, E, color, r be rooted trees, such that:

1. V⊆V andE⊆E;

2. the restriction toV of the ancestor relation ofV coincides with the ancestor relation ofV;

3. the coloring ofV is preserved inT, i.e. ∀v V color(v) = color(v).

ThenT is said to be asubtreeofT.

Moreover, if for each node v from T all its descendants in T are also its descendants inT, thenT is said to be abottom-up subtreeofT.

T is said to be a proper (bottom-up) subtree of T if T is a (bottom-up) subtree ofT andT=T.

Intuitively, a bottom-up subtree ofT can be obtained by taking a nodevofT together with all ofv’s descendants and corresponding edges; a subtree of T can be obtained by taking a bottom-up subtree ofT and pruning some edges together with the subtree below the pruned edge. The bottom-up subtree ofT rooted at nodev will be denoted byT[v].

Definition 2.3. Let T = V, E, color, r and T = V, E, color, r be rooted trees; an embedding from T into T is an injective mappingτ: V V, such that:

1. for everyv∈V,color(τ(v)) =color(v),i.e. τ preserves colors;

2. v1 is an ancestor ofv2 in T iff τ(v1) is an ancestor of τ(v2) in T, i.e. τ preserves the ancestor-descendant relationship.

Tis said to be anembedded subtreeofT if there exists an embedding from T intoT.

Intuitively, an embedded subtree ofT is obtained by deleting some nodes from T and gluing together the remaining edges in a way preserving the ancestor- descendant relationship ofT.

Definition 2.4. AwindowofT =V, E, color, ris a subtreeW =V, E, color, rofT such thatV contains all the descendants ofrfrom depthdepth(r) down to depthdepth(r) +height(W).

Forw∈N, aw-windowofT is a window ofT of height at mostw. P is an embedded subtree ofT within aw-windowif there is an embedding fromP intoT and moreover the image ofP is contained in aw-window ofT.

Example 1. In Figure 1, T, T, T are respectively a bottom-up subtree, a subtree, and an embedded subtree ofT. T is an embedded subtree ofT within a 2-window. W is a 1-window ofT.

(4)

f

d a

b c d

a

b c d

e f

g

h

T T’ T’’ T’’’

a

c b c

f e a

W d

e

Figure 1. A treeT with bottom-up subtreeT, subtreeT, em- bedded subtreeT, and 1-windowW.

The problems

Problem 1. Given two trees,targettreeT andpatterntreeP, to decide whetherP is an embedded subtree ofT.

Problem 2. Given two trees,targetT andpatternP, to decide whetherP is an embedded subtree ofT within aw-window. Subsidiarily, to count the number of windows of height at mostwofT containingP as an embedded subtree.

Problem 3. Given two trees,targetT andpatternP, to count the number of occurrences ofP as an embedded subtree ofT within aw-window.

Related results

Different versions of the first problem have been considered. Papers [3, 4] show that problem 1 is NP-complete, but can be solved in timeO(ptk22k), wherep=|P| (resp. t=|T|) is the number of nodes (size) ofP (resp. T),ifthe out-degrees of the nodes ofP are bounded byk.

3. Embedded subtree search

We study Problem 1: given target treeT and pattern treeP, to decide whether P is an embedded subtree ofT.

3.1. Notations

Without loss of generality we may assume that the nodes ofP are labeled: each node has auniquelabel from{1, . . . , p}, wherepis the number of nodes ofP, see Figure 2. This yields a labeling of the bottom-up subtrees of P: the bottom-up subtree rooted at nodev has the same label as node v. A bottom-up subtree of P rooted at nodevis represented either by the label ofvor in the formP[v]: the bottom-up subtree rooted at nodev having labelj, will thus be denoted by j or P[v] according to the context.

(5)

a e

c c d

b

1 2

3 4

5 6

Figure 2. A pattern P and a postorder labeling of its bottom-up subtrees.

Definition 3.1. AforestofP is a set of bottom-up subtrees ofP. A forest will be denoted by the set of labels of its roots. ForestF is said to be a max-forest if all its trees are incomparable,i.e. there are no t, t∈F such thatt is a proper bottom-up subtree oft.

We say that forestF dominatesforestF(different fromF itself) if every tree fromF is a bottom-up subtree of some tree fromF.

3.2. Idea

LetT be a (big) tree (called the target) and letP be a (small) tree (called the pattern). For each nodev ofT we will compute aconfiguration, which will be a set of max-forests of bottom-up subtrees ofP.

Intuitively, each forest of the configuration at nodevrepresents a set of subtrees ofP which can be embedded in T[v]simultaneously, i.e. in such a way that the images of different trees wouldn’t intersect.

Definition 3.2. Aconfigurationis a setC={F1, F2, . . . , Fk}, where eachFiis a max-forest ofP, and ifi=j thenFi does not dominateFj.

Definition 3.3. The union of two configurationsC={F1, F2, . . . , Fk}andC = {F1, F2, . . . , Fk} is configuration D, denoted by D = C ⊗C and obtained as follows:

(1) letD=

Fi∪Fi|i∈ {1, ..., k}, i∈ {1, ..., k}

;

(2) we pass fromD toD by removing from eachFj=Fi∪Fi which is not a max-forest all labels of subtrees ofP which are subtrees of a tree whose label is already present inFj;

(3) we pass fromD toD by removing eachFjwhich is dominated by some Fi.

Note that:

In (2), we obtain a forest Fj such that all subtrees of P belonging Fj are maximal in Fj (all subtrees of Fj are incomparable), i.e. Fj is a max-forest.

The resultingD=

Fi|i= 1, . . . , l

is a set of max-forests ofP, such that ifi=j thenFi does not dominateFj, hence it is a configuration.

(6)

{{4}} {{4}}

{1} {3}

{ , } {{1} {3}}

a

x e

e c

c d

b

{3,4} {2}

{ {

{

{ }

{6}

{2} {5}}

}

{1}}

{2,5}}

{2}

}

a

x e

e c

c d

b

{3,4} {2}

{ {

{

{{1} {3}

}

{6}

{2} {5}} {3} {2} }

}

{2,5}}

,

, , ,

, ,

,

( 1 ) ( 2 )

{ } {

{ {

Figure 3. A target T where the pattern of Figure 2 is embedded.

3.3. Algorithm

We traverseT bottom-up (from leaves to root, or in postorder): to scan a node vof T we must first have scanned all its children. With each nodev ofT we will associate a configurationCv such that for every forestF fromCv all trees fromF can be simultaneously embedded intoT[v]. This leads to the following algorithm.

(See Fig. 3(1)for an example.) Algorithm 1

Letr:= the label of the root ofP; //initialization FORALLnodesv ofT visited in bottom-up orderDO

(1) IFnodevis a leaf ofT, its configuration is the set of singletons{i}where iis the label of a leafv ofP such thatcolor(v) =color(v) =a.

//If no leaf ofP is coloreda, the configuration ofvis the empty set.

ENDIF

(2) IFnodevis an internal node coloreda, with childrenv1, . . . , vn,THEN DO (a) ∆ :=D:=Cv1⊗Cv2⊗ · · · ⊗Cvn;

(b) FORALL nodes w of P colored a and labeled j DO //w has the same color asv

Letj1, ..., jp be the children (if they exists) ofj inP

IF there is anFi ∈D such that{j1, ..., jp} ⊆Fi //true if there are no children

THENIFw=rTHEN DO

output “P is an embedded subtree ofT”;

STOP ENDDO ENDIF

let ∆be the result of deleting j1, ..., jp from all forests in ∆;

∆ := ∆∪ {{j}}

ENDIF

(7)

ENDDO

(c) Remove from ∆ all dominated forests and take the resulting configu- ration forCv

ENDDO ENDIF ENDDO

output “P isn’t an embedded subtree ofT

Comment. When computing ∆ we delete every occurrence of j1, ..., jp in all forests because they are used only to obtainj; at a later stage (i) either we will choosejbut it already appears, or (ii) we will choose another subtree, and in that latter case we do not needj1, ..., jp.

Complexity. The number of bottom-up subtrees of P is bounded by p where pis the size ofP, the number of forests is bounded by 2p, hence the number of configurations is at mostO(22p).

3.4. Improvements Improvement 1

Algorithm 1 can be improved in practice by reducing the number of configura- tions. Let us say nodevfrom the target and nodevfrom the pattern areupward compatible, if the path fromv to the root ofP can be embedded into the path from v to the root ofT. We can preprocess targetT in order to precompute for each nodev inT the setc(v) of all nodes ofP which are upward compatible with it. This is trivial for the root ofT. Passing from a nodev in T to its childwwe just add toc(v) each node uin P such that: (1) uhas the same color asw, and (2) the parent ofuis inc(v).

The algorithm computing the setc(w) of nodes ofP upward compatible with wis as follows:

Letr:= the label of the root ofP; //initialization FORALLnodeswofT visited top-downDO

IFnodew is the root ofT, – THENc(w) =

{r} ifcolor(w) =color(r),

otherwise.

– ELSE DO//whas parentvwhose set of upward compatible nodes isc(v) c(w) =c(v)∪ {u|uis a node of P, andparent(u)∈c(v), and color(u) = color(w)} ENDDO

ENDIF ENDDO

The complexity of this preprocessing isO(tp). Then in step 1 of algorithm1 we can demand thatv should be upward compatible withv. This could considerably reduce the number of configurations to deal with. For instance in Figure 3 the set

(8)

of nodes upward compatible with the rightmost leaf ofT is{1,2,6}, which reduces by half the set of configurations on the rightmost path ofT, see Figure 3(2).

Improvement 2

The idea of upward compatibility can be further developed as follows. Suppose thatτ :V →V is an embedding ofP =V, E, color, rintoT =V, E, color, r.

We can consider an inverse embedding σ which is a partial map from V onto P(E)∪V defined as follows:

ifv=τ(v) thenσ(v) =v;

ifv1 is the parent of v2 in P, then for every internal node v on the path fromτ(v1) toτ(v2) the value ofσ(v) is equal to the edge betweenv1 and v2;

for all other nodes ofT the value of σis left undefined.

For everyv fromT we can consider the setS(v) of all possible values ofσ(v), for all possible embeddingsτ. It is easy to check that these sets satisfy the following conditions:

[A1] ifS(v) contains somev fromV thencolor(v) =color(v);

[A2] ifS(v) contains somev fromV then

ifv has the parentv1in P, thenv has the parentv1 inT andS(v1) contains eitherv1 or the the edge betweenv1 andv;

for every childv2 ofv inP, the nodev has a childv2 such that the set S(v2) contains eitherv2 or the edge betweenv andv2;

[A3] ifS(v) contains the edge between somev2 and its parentv1 in P then v has the parent v1 in T and S(v1) contains either v1 or the edge

betweenv1 andv2;

vhas a childv2such thatS(v2) contains eitherv2 or the edge between v1 andv2.

We cannot easily calculate “true” sets S(v) so instead of this we calculate (and dynamically maintain during the entire work of the algorithms) some sets ˜S(v) such thatS(v)⊆S(v). Initially, we put˜

S˜(v) :={v |v∈V and color(v) =color(v)} ∪E (1) and then diminish these sets as long as either [A2] or [A3] is violated.

As soon as we calculated configurationCv for some nodev, we try to further trim ˜S(v) in the following way. The set ˜S(v) can be represented as the union S(v) = ˜˜ SV(v)∪S˜E(v) where ˜SV(v) = ˜S(v)∩V(P) and ˜SE(v) = ˜S(v)∩E(P). Now we can put

S(v) =˜ S˜V(v)(∪w∈F∈CvV(P[w]))

∪S˜E(v) (2) (and then check conditions [A2] and [A3], of course).

In its turn, the calculation of ˜S(v) allows us to add on step (b) additional restriction on the choice ofj, namely, we can demand thatj∈S(v).˜

(9)

Calculation of the ˜S(v)’s can be done with only linear slow-down of the algo- rithm. This can be implemented as follows. Each of the two conditions in [A2]

and the two conditions in [A3] can be expressed by a logical formula of the form u ∈S(v)⇒u1∈S(v1)∨ · · · ∨uk∈S(vk) (3) wherev1, . . . , vkare adjacent tovinTandu1, . . . , ukare either adjacent or incident tou inP. Initially, each nodev ofT writes down these formulas for each element u from the set (1). As soon as the right hand side of the implication (3) turns out to be empty (=false), the node removesu from ˜S(v) and thenv informs its parent (if it exists) and its children (if they exist) about this removal. Having got this information, the parent and the children delete corresponding disjunctive terms in their formulas. This process can propagate by a chain but since each time at least one disjunctive term is deleted, the total complexity is bounded by the total size of initial formulas which is at mostO(tp2).

4. The window subsequence algorithm

4.1. Problem 2

We study Problem 2: given a target tree T and a pattern tree P, to decide whether pattern P is an embedded subtree of tree T within a window of height w. We will solve an extended version of Problem 2, where we count the number ofw-windows ofT whereP can be embedded.

The algorithm is somehow similar to Algorithm 1, but the configurations contain not only the embedded bottom-up subtrees ofP, but also the least possible depth of its image inT. We will thus store in configurations ordered pairs consisting of a bottom-up subtreetofP embedded inT, together with an integernrepresenting the length of the longest root-to-leaf path (in T) of the current embedding of t.

The numbernwill be called adepth-stamp.

Moreover, for the extended version of Problem 2, a counter N will count the number ofw-windows containingP.

Configurations will be replaced bys-configurationswhere each occurrence of a subtree ofP will be augmented by the least possible value of the depth-stamp. We will modify accordingly the definition of union of configurations to keep track of the depth-stamps. The intuitive meaning of stampn∈Nin stamped subtree (t, n) will be that subtreetofP can be embedded in a subtree ofT of height nlocated below the current position. Intuitively, each forest of the configuration at nodev represents a set of subtrees ofP which can be embedded inT[v]simultaneously, i.e. in such a way that the images of different trees wouldn’t intersect.

Definition 4.1.Astamped subtreeis an ordered pair (t, n) wheretis a bottom- up subtree ofP, and n∈Nis called a depth-stamp.

A setFof stamped trees is called an s-forest, and it is said to be amin-s-forest if the following two conditions hold:

(10)

there are no (t, n) and (t, n) ∈F such that n < n, i.e. all its subtrees occur at the least possible depth inT;

there are no (t, n) and (t, n) F such that n n and t is a proper bottom-up subtree oft.

An s-forestF is said todominates-forestF if for every (t, n) fromF there is a (t, n) fromF such that

t is a bottom-up subtree oft; and

n≥n.

Ans-configurationis a setC={F1, F2, . . . , Fk}, where eachFi is a min-s-forest, and ifi=j thenFi does not dominateFj.

Example 4.2. In a min-s-forest, only minimal stamped subtrees appear: for in- stance, considering the pattern of Figure 5 and identifying a subtree ofP with the label of its root,{(1,1),(3,0),(4,1)}is a min-s-forest, while neither{(1,1),(4,1),(4,2)}

nor {(1,1),(4,1),(3,1)} are min-s-forests. Forest F dominates forest F if, in- tuitively, all the possibilities of embeddings contained in F are subsumed by those contained in F. For example, considering again the pattern of Figure 5, forest {(2,1),(4,1)} dominates forests {(2,1),(3,1)} and {(2,1),(4,2)}, but forest {(2,1),(4,1)}does not dominate forest{(2,0),(4,2)}.

Definition 4.3. If an s-forest F is not a min-s-forest, we can associate with F a reducedforest, the min-s-forestD= red(F) obtained by the following algorithm:

(1) remove all stamped subtrees (t, n) F such that there is a (t, n) F withn < n: then all subtrees ofP belonging to F are affected with the minimal possible depth-stamp;

(2) remove all stamped subtrees (t, n)∈F such that there is (t, n)∈F such thatn≥nandt is a proper bottom-up subtree oft.

Notice that we remove stamped subtrees having the larger stampnbecause only the subtrees having the minimal possiblenwill give us the best possible embed- dings.

Example 4.4. red( {(1,1),(4,1),(4,2)} ) = {(1,1),(4,1)} because (4,1) has a depth-stamp less than (4,2) hence the latter can be dropped;

red({(1,1),(4,1),(3,1)}) ={(1,1),(4,1)}because 3 is a subtree of 4 and they have the same depth-stamp hence (3,1) can be dropped.

Definition 4.5. A subtree T of T is said to be aminimal subtree ofT con- tainingP iff P is an embedded subtree of T, but there is no proper subtreeT ofT such thatP is an embedded subtree ofT.

Definition 4.6. The union of two s-configurations C = {F1, F2,· · ·, Fk} and C ={F1, F2,· · ·, Fk}is s-configurationD=C⊗sC obtained as follows:

(1) letD=

red(Fi∪Fi)|i∈ {1, ..., k}, i∈ {1, ..., k}

;

(2) we pass fromD toDby removing eachFj∈Dwhich is dominated by someFi∈D.

(11)

a

b

a

a

b 1

2

Pattern

Target P

T

} {

} }

{ {

x

x

b x

{(1,0)}

{{(1,0)}} }

{

} {{(1,1)}

{(2,1)}

{ b {(1,0)}

{

{(2,3),(1,2)} }

{(2,3)}}

{(2,2) , (1,1)}

x O

Figure 4. PatternP occurs in four 3-window (window of height 3 at most) in targetT.

Intuitively, (1) ensures that each forest of C⊗sC is a min-s-forest, namely we keep only of the “best” (i.e. minimal) possible stamped subtrees, and (2) ensures that we keep only non redundant min-s-forests,i.e. if all information from forest F is already present in forestF, we discard forestF.

It is easy to see [1,3] that a window of heightwofT containsPas an embedded subtree iff it contains a minimal subtree ofT containingP; therefore, it is enough to count the number ofw-windows ofT containing a minimal subtree containing P. For each nodevofT we will compute ans-configuration, which will be a set of min-s-forests of stamped bottom-up subtrees ofP. The idea of the algorithm is to increment the number ofw-windows each time a stamped tree (P, d) withd≤w is found in one of the forests of the configuration.

We now present Algorithm 2 for Problem 2. See Figure 4 for an illustration of Algorithm 2 withw= 3.

Algorithm 2

Letr:= the label of the root ofP;N := 0; //initializations FORALLnodesv ofT visited in bottom-up orderDO

(1) IFnodev is leaf ofT,THENthe configuration ofv is the set of singletons {(i,0)}whereiis the label of a leafv ofP such thatcolor(v) =color(v), //if no leaf ofP has the same color asvthe configuration ofvis the empty set.

(2) IFnode v is an internal node colored a, with childrenv1, . . . , vn, whose respective configurations areCvi,i= 1, . . . , n,THEN DO

(a) FORi= 1, . . . , n,DO

(12)

FORFj∈Cvi DOFj:={(l, d+ 1)|(l, d)∈Fj andh+d+ 1≤wwhere his the height ofl inP } //subtree (l, d+ 1) cannot contribute to any embedding in aw-window ifh+d+ 1> w ENDDO

Cvi :=

Fj|Fj ∈Cvi ENDDO (b) ∆ :=D:=Cv1sCv2s· · · ⊗sCvn;

(c) FORALLnodeswofPcoloredaand with labeljDO//whas same color asv

IFnode wis not a leaf ofP, //wis an internal node ofP labeledj

THENLetj1, ..., jp be the children ofj inP, FOR{(j1, d1), ...,(jp, dp)} ⊆Fi ∈D

∆ := ∆∪ {{(j,max{d1, ..., dp})}};

ELSE //wis a leaf labeledjand coloreda

∆ := ∆

{(j,0)}

; ENDIF

ENDDO

(d) Reduce the forests in ∆ and remove from ∆ all dominated forests (e) Take the resulting configuration ∆ for Cv

(f) IFthere aredandF ∈Cv such that (r, d)∈F THEN DOletd0be the least possible value of such ad;

N :=N+ 1 + (w−d0);

output “Pis an embedded subtree ofT within 1+(w−dr)w-windows at node v”.

ENDDO ENDIF ENDDO ENDIF ENDDO

Notice that in step (c) of our algorithm, the loop FOR{(j1, d1), ...,(jp, dp)} ⊆Fi ∈D

∆ := ∆∪ {{(j,max{d1, ..., dp})}};

ensures that, for each “good” subset of each Fi, we add in the configuration a forest consisting of a single subtreeP ofP: the choice of subtreeP excludes all other subtrees ofP possible at that stage (because nodevofT can be used to match only one subtree ofP with root having the same label asv). ConsiderP, T as in Figure 5: the FOR loop in step (c) rightly prevents us from saying thatP is embedded inT (by excluding the simultaneous embedding of subtrees 2 and 4 of P in the subtree with root colored bofT).

(13)

Target T Pattern P

b

}

{

} {

} { {

}

a a

b

c d

2

1 3

4 5

b

c d

d

{(1,0)} {(3,0)}

{(3,0)}

}

(1,1)}

{{(2,1)} {(4,1)}{(3,1),

{(3,1)}

Figure 5. PatternP is not embedded in a 2-window of targetT.

4.2. Problem 3

Given two trees, target T and pattern P, we want to count the number of occurrences ofP as an embedded subtree ofT within a window of heightw.

In order to solve Problem 3, configurations will now be replaced byms-configu- rationswhich are multisets of multiforests (i.e. multisets of stamped subtrees of P). We must also modify the definition of union of configurations to keep track of the multiplicities. Note that we now neither reduce the multiforests nor remove dominated multiforests.

Definition 4.7. Anms-configurationis a multisetC={F1, F2, . . . , Fk}, where eachFi is a multiset of bottom-up stamped subtrees ofP.

Definition 4.8. Theunion of two ms-configurations C ={F1, F2, . . . , Fk} and C = {F1, F2, . . . , Fk} is ms-configuration D = C⊗msC =

Fi ∪Fi | i {1, ..., k}, i∈ {1, ..., k}

.

Algorithm 3 below will count the number of embeddings ofP as an embedded subtree ofT within aw-window. The reader is invited to note that

(1) we now must count all embedded occurrences ofPwithin aw-window and not only the minimal ones;

(2) we considerP as acolored labeled tree, that is each node has a (unique) labelfrom{1, . . . , p}, and a (not necessarily unique)colorfrom the alpha- bet: different nodes can have the same color but not the same label. For instance in Figure 6,P has 2 occurrences in a 2-window ofT.

Figure 7 illustrates algorithm3 on the same P, T as in Figure 4, with w = 2.

Figure 8 illustrates algorithm3 with thread-like treesP, T withw= 3.

(14)

b a c c

2

a

3

1 c P

T c

Figure 6. PatternP has 2 occurrences in a window of height 2 ofT.

{(2,2)} {(2,1)} {(1,2) , (1,1)}

{

a

b 1

2

Pattern P

a

a

a

b

b

b

Target T b

}

} { }

{ {

{

}

{(1,0)}

{(1,0)}

{(2,1)} {(1,1)}

{(1,0)}

{(1,2),(1,1)}

{(2,2)} {(2,1)}

}

{ {

}

{(1,0)}

}

,

, ,

,

,

Figure 7. PatternP has 5 occurrences (which are crossed with a backslash) in a 2-window (window of height at most 2) of tar- getT.

Algorithm 3

Letr:= the label of the root ofP;N := 0; //initializations FORALLnodes ofT visited bottom-upDO

(1) IFnodev is leaf ofT,THENthe configuration ofv is the set of singletons {(i,0)}whereiis the label of a leafv ofP such thatcolor(v) =color(v), //the configuration ofvis the empty set if no leaf ofP has the same color asv. (2) IFnode v is an internal node colored a, with childrenv1, . . . , vn, whose

respective configurations areCvi,i= 1, . . . , n,THEN DO (a) FORi= 1, . . . , n,DO

FORFj∈Cvi DOFj:={(l, d+ 1)|(l, d)∈Fj andh+d+ 1≤wwhere his the height ofl inP }

ENDDO Cvi :=

Fj|Fj ∈Cvi ENDDO

(15)

PatternP

a

a

TargetT a

a

}

{(1,0)}

{

{(2,1)}{(1,1)}

{(3,3)}

,

{(3,3)}{(3,2)}

, ,

{(2,3)}

,

{(2,3)} {(2,3)}

, ,

{(2,2)}

} ,

{(2,2)}

{(2,1)}

{

{(1,3)} {(1,2)}{(1,1)} {(1,0)}

{(3,3)}

,

{(3,3)}{(3,2)}

, ,

{(2,3)}

,

{(2,3)} {(2,3)}

, ,

{(2,2)}

} ,

{(2,2)}

{(2,1)}

{

{(1,3)} {(1,2)}{(1,1)} {(1,0)}

a

a

a 1

2 3

} {

{(1,0)}

, ,

, , ,

{

{(3,2)}{(2,2)}{(2,1)} {(2,2)}

, ,

a

}

{(1,0)}

{(1,1)}

,

{(1,2)}

, , , ,

, , , ,

Figure 8. Pattern P has 7 occurrences (crossed with a back- slash) in a 3-window of targetT.

(b) ∆ :=D:=Cv1msCv2ms· · · ⊗msCvn ;

(c) FORALL nodes wof P coloreda and labeledj DO//w has same color asv

IFnode wis a not a leaf //wis an internal node labeledj

THENLetj1, ..., jp be the children ofj inP ;DO FORALLoccurrences{(j1, d1), ...,(jp, dp)} ⊆Fi∈D

∆ := ∆

{(j,max{d1, ..., dp})}

; ENDDO

ELSE ∆ := ∆

{(j,0)}

; //wis a leaf labeledj ENDIF

ENDDO

(d) FORALL (r, dr) such that (r, dr)∈Fj ∆DO

output “P is an embedded subtree ofT within aw-window at nodev”.

N:=N+ 1 ;

Fj :=Fj\ {(r, dr)} ; ENDDO (e) Cv:= ∆

ENDDO ENDIF ENDDO

(16)

5. Conclusion

In the present paper we answered some problems about tree inclusions, namely deciding whether a pattern is an embedded subtree of a target within aw-window, counting the number of windows of height at mostwof the target containing the pattern as an embedded subtree, and counting the number of occurrences of the pattern as an embedded subtree of the target within aw-window.

There are many other interesting problems concerning tree inclusions, for in- stance, counting the number of windows of height exactlywof the target containing the pattern as an embedded subtree, or counting in a slice of the target tree.

References

[1] L. Boasson, P. Cegielski, I. Guessarian and Yu. Matiyasevich, Window accumulated subse- quence matching is linear.Ann. Pure Appl. Logic113(2001) 59–80.

[2] Y. Chi, R. Muntz, S. Nijssen and J. Kok, Frequent subtree mining – an overview.Fund.

Inform.66(2005) 161–198.

[3] P. Kilpelainen,Tree matching problems with applications to structured text databases. Ph.D.

thesis, Helsinki (1992).

http://thesis.helsinki.fi/julkaisut/mat/tieto/vk/kilpelainen/

[4] P. Kilpelainen and H. Mannila, Ordered and unordered tree inclusion.SIAM J. Comput.24 (1995) 340–356.

Références

Documents relatifs

We will prove that some distance measure between the iterate and the true image decreases monotonically until the stopping index is reached and that the regularization procedure

A generation of symbols asserted for n ≥ 0 in the proof of Theorem 3.3 of the original paper in fact only holds for n &gt; 0, thus undermining the proof of the theorem. A new version

Several well-developed techniques are available for the analysis of linear time- invariant systems. In such cases, the response of the system to any input can be obtained by

Although the proposed LALM and LADMs are not necessarily always the fastest (again, for matrix completion problems with small sample ratios relative to rank(X ∗ ), the APGL

T is the address of the current subterm (which is not closed, in general) : it is, therefore, an instruction pointer which runs along the term to be performed ; E is the

We propose a new algorithm to decide whether a given finitely generated sub- group H is a free factor of the free group F , which is polynomial in the size of H and exponential in

From this we conclude that the number of input words O = G 1 for which M will terminate successfully in k itérations, under the assumption that the test of Step 5 remains

Key words: Multicriteria combinatorial optimization, Choquet integral, branch and bound, minimal spanning tree problem, knapsack problem..