Hotlinks and Dictionaries

(1)

UNIVERSIT´E LIBRE DE BRUXELLES Facult´e des Sciences

Hotlinks and Dictionaries

Thèse présentée en vue de l’obtention du grade de Docteur en Sciences

Karim DOUIEB Ann´ee acad´emique 2008–2009

(2)

Avant de débuter cette thèse, je ne me destinais pas au monde de la recherche scientifique. C’est grâce aux sollicitations, conseils et encouragements de Stefan, mon promoteur, que j’ai fait le choix de me lancer sur cette voie. J’aimerais dès lors le remercier car c’est un choix qui s’est révélé épanouissant en tout point. Je voudrais aussi lui témoigner toute ma gratitude pour son soutien inconditionnel tant sur un plan scientifique que moral. Stefan est un chercheur de grande qualité, il restera pour moi un modèle à suivre.

I would like to thank Carleton Computational Geometry Group for inviting me several times. They are all wonderful people to work with, I was always pleased to visit them. I am particularly grateful to Jit who offers me the opportunity to pursue my research work in his group.

I would also like to thank Prosenjit Bose, Gerth Brodal, Ian Munro, Jean Cardinal, Samuel Fiorini and Sebastien Collette for having kindly accepted to be members of my committee and review my thesis.

Je tiens également à remercier les membres du groupe Algo ainsi que mes collègues du département d’informatique de l’ULB. Ils m’ont permis de travailler au sein d’un milieu agréable, amical et productif.

J’aimerais surtout chaleureusement remercier mes parents, S´elim, Sofia ainsi que D`ea, pour le soutien et l’amour qu’ils me portent.

(3)

This thesis has been written under the supervision of Prof. Stefan Langerman (Universit´e Libre de Bruxelles, Belgique). The Members of the committee are:

• Prof. Prosenjit Bose (Carleton University, Canada)

• Prof. Gerth Brodal (University of Aarhus, Denmark)

• Prof. Ian Munro (University of Waterloo, Canada)

• Prof. Jean Cardinal (Universit´e Libre de Bruxelles, Belgique)

• Prof. Samuel Fiorini (Universit´e Libre de Bruxelles, Belgique)

• Dr. Sebastien Collette (Universit´e Libre de Bruxelles, Belgique)

(4)

Introduction

“Information is not knowledge”

— Albert Einstein

Knowledge has always been a decisive factor of humankind’s social evolutions. Collecting the world’s knowledge is one of the greatest challenges of our civilization. Knowledge involves the use of information but information is not knowledge. It is a way of ac- quiring and understanding information. Improving the visibility and the accessibility of information requires to organize it efficiently. This thesis focuses on this general purpose.

A fundamental objective of computer science is to store and retrieve information efficiently. This is known as the dictionary problem. A dictionary asks for a data structure which allows essentially the search operation. In general, information that is important and popular at a given time has to be accessed faster than less relevant information.

This can be achieved by dynamically managing the data structure periodically such that relevant information is located closer from the search starting point. The second part of this thesis is devoted to the development and the understanding of self-adjusting dictionaries in various models of computation. In particular, we focus our attention on dictionaries which do not have any knowledge of the future accesses. Those dictionaries have to auto-adapt themselves to be competitive with dictionaries specifically tuned for a given access sequence.

This approach, which transforms the information structure, is not always feasible.

Reasons can be that the structure is based on the semantic of the information such as categorization. In this context, the search procedure is linked to the structure itself and

7

(8)

CONTENTS 8

modifying the structure will affect how a search is performed. A solution developed to improve search in static structure is the hotlink assignment. It is a way to enhance a structure without altering its original design. This approach speeds up the search by creating shortcuts in the structure. The first part of this thesis is devoted to this approach.

Contributions and Acknowledgement

This work was supported by the F.R.I.A. (Fonds pour la formation `a la Recherche dans l’Industrie et dans l’Agriculture). All contributions of this work are original results to our knowledge. Results on hotlinks, presented in Part I, were found in collaboration with Stefan Langerman. Works on dynamic hotlinks, which can be found in Section 3.2 and Section 4.2.4, have been presented and published at the 9th Workshop on Algorithms and Data Structures (WADS) [32]. An extended version of the paper was published in a special issue of the journal Algorithmica [34] dedicated to the best 7 papers of the conference. Works on near-entropy hotlink assignment, which can be found in Chap- ter 4, have been presented and published at the 14th Annual European Symposium on Algorithms (ESA) [33]. An extended version of the paper will appear in the journal Algorithmica [31]. Works on self-adjusting dictionaries, presented in Part II, were found in collaboration with Stefan Langerman and Prosenjit Bose. Those results have been presented in the 19th Annual ACM-SIAM Symposium On Discrete Algorithms (SODA) [16].

(9)

Part I

Hotlinks

9

(10)

Chapter 1

Introduction

Contents

1.1 The Problem

User Model(s) Objective Functions Lower Bound Top-down Methods 1.2 Related Work

Complexity Exact Algorithms

Path Length Approximations Gain Approximations Experimentations Applications 1.3 Contributions

Tools

Hotlink Assignment Methods for Lists Hotlink Assignment Methods for Trees Fast Hotlink Assignment Algorithm

Lower Bound on Running Time of Approximation Methods

10

(11)

CHAPTER 1. INTRODUCTION 11

Netcraft [1] reported recently that the number of web sites in the World Wide Web has grown steadily, from approximately 23000 in 1995 to over 175 million by the middle of 2008. Finding information in such a large database is becoming a complex task.

There are many ways to speed up the access to information on the Web. An in- teresting one is the hierarchical category approach which consists in structuring the information according to a precise categorization of the data in topics. Hierarchical category structures have flourished since the popularization of the World Wide Web.

Yahoo [2] and the open directory service Dmoz [3] both developed this kind of web search facilities, and Google [4] developed an automatic classification system. Most of the time, the categories are organized as main categories and subcategories, so they can easily be represented by a categorization tree (see Fig. 1.1.a).

Hyperlinks Hotlinks

a. b.

Figure 1.1: a. Example of a categorization tree, b. Example of hotlink assignment on a categorisation tree.

The hope is that the relevance of search results is improved, simply because the con- ception is based on the categorization (we shall not find information about the Chicago Bulls basketball team in the animals category). A search consists in traversing a path in the categorization tree from its root to a desired node corresponding to a searched topic. This path is sometimes long due to the fact that the categorization tree is not necessarily balanced. See for example the search of the vocalistSade’s Web site in the categorization ofDmoz : (Top : World : Fran¸cais : Art : Musique : Genres : Rhythm and Blues : Soul : Artistes : Sade), so 10 steps are needed to reach the desired category.

If the 590000 categories constituting Dmoz were ordered in a perfectly balanced tree and

(12)

knowing that the number of subcategories reachable from another one is about 10, then the number of steps needed to to reach a precise category would be about 6.

The main problem with the hierarchical approach resides in the fact that the access frequencies of the items are not considered. Indeed, a problematic situation occurs when items with high access frequencies are located deeper in the tree than the less popular items, implying that the most popular categories may require long access paths while the less significant ones may be much closer to the root of the tree. However, the search by categories has an intuitive and useful aspect for the users. To preserve those properties, an improvement of this approach should keep the structure unchanged. The solution discussed in this thesis does not change the original hyperlink structure but enhances it with additional hyperlinks in order to speed up the access to a destination. This addition of hyperlinks is called a hotlink assignment. The hotlink assignment problem was originally introduced by Perkowitz and Etzioni [69] to improve the search in Web sites.

Hotlinks are pointers added to a structure with the goal of improving its design by reducing the expected number of steps to reach an element. A hotlink can be seen as a shortcut from a category to a subcategory or more generally from a web page to another one (see Fig. 1.1.b).

1.1 The Problem

Formally, aweb site can be modeled as a directed graphG= (V, E) where the nodesV correspond to the web pages and the edgesE represent the links. Each node carries a weight representative of its access frequency. We assume that all web pages are reachable starting from thehomepager. Ahotlink (x, x⁰) is defined as an extra directed link from the pagex to the pagex⁰. Ahotlink assignment adds one or up tokhotlinks per page of a web site. Our goal by assigning hotlinks is to minimize the expected number of steps to reach a page from the homepager.

We restrict our attention to the case whereGis a rooted directed treeT withnnodes and maximum arbitrary degreed(maximum number of children for a node). Some results of this thesis (in particular all those that relate the search time with the entropy bound) extend to general graphs by takingTto be the shortest-path tree ofGfrom the homepage r. We assume that only the leaves of the tree contain information. Let L be the set of all leaves in T. Every leaf iin L is associated with a weight wi representative of its

(13)

access frequency, and W = P

i∈Lwi. The restriction that only the leaves are accessed can easily be removed by adding a leaf child to all nodes, with a weight corresponding to the access frequency of the node. This transformation only increases the length of the search paths by 1. We useTx to denote the subtree rooted at x and W(Tx) to denote its weight, i.e., the sum of the weights of its leaves. For simplicity, we refer toW(Tx) as the weight of the nodex.

User Model(s)

Two different user models have been analyzed in the literature: Thegreedy user model[46]

assumes that from a node the user always takes the pointer that leads him as close as possible, in the initial structure, to its desired destination. We say that the hotlink (y, y⁰) cross the hotlink (x, x⁰) if x is an ancestor of y and x⁰ is a descendant of y but not of y⁰ (see Fig.1.2.b). In this greedy user model, a hotlink (x, x⁰) is useless if x is not an ancestor of x⁰ in the initial tree (see Fig.1.2.a) or if it crosses another one. Assigning a hotlink to a node i renders useless the original hyperlinks that ends in i, because a user who does not follow the hotlink will not access Ti. We can thus assume that the hyperlinks ending iniare deleted. Hence hotlink assignments in the greedy user model is seen as an adoption (see Fig. 1.3). A node adopted by another one becomes itshotchild while all other remain theoriginal children of the node in the initial tree. Although each node could have up tokhotlinks, the total number of hotlinks assigned is smaller than nbecause there cannot be more than one hotlink pointing to each node.

x

x^!

x^! y^! y

a. b.

Figure 1.2: Useless hotlinks in the greedy user model.

(14)

In contrast to the greedy user model, Matichin and Peleg [60] studied theclairvoyant user model. It assumes that the user always takes the shortest path in the enhanced structure from the root to the desired destination. It is equivalent to the greedy user model if hotlinks are not allowed to cross and if they are not allowed to link an element . In this thesis, as in the majority of the literature, we only consider the greedy user model.

i

i Hyperlinks

Hotlinks

Figure 1.3: Adoption in the greedy user model.

Objective Functions

A hotlink method A determines the hotlink assignment A = A(T) to be applied on a tree T. Let TÂ = TÂ(T⁾ be the tree T enhanced by the hotlink assignment A. The average path length (or just path length), a measure of the average access time to the leaves inTÂ, is defined as

E[T^A, p] =X

i∈L

dA(i)pi,

wheredA(i) is the distance to the leafifrom the root of T^A, and p=hpi=wi/W :i= 1, . . . , ni is the probability distribution on the nodes of the original treeT.

We are interested in finding an assignment A that minimizes E[T^A, p], we refer to this problem as the path length minimization. Another formulation of the problem is the maximization of the gain g(A) = E[T, p]−E[T^A, p]. These formulations define essentially the same problem but differ in terms of approximation ratios. LetA^∗ be the optimal hotlink method which determines the hotlink assignmentA^∗ that minimizes the

(15)

path length. Anα-approximation assignmentA in terms of the gain or in terms of the path length guaranteesαg(A)≥g(A^∗) or αE[T^A^∗, p]≥E[T^A, p] respectively.

A good approximation algorithm in terms of the gain does not necessary yield a good approximation in terms of path length. Consider for example the 2-approximation algorithm in terms of the gain presented by Matichin and Peleg [61]. This algorithm assigns only hotlinks that bypass exactly one node. In the case where the input treeT is a linked list of even length`with only one leaf of weight 1 (see Fig. 1.4), the algorithm produces an enhanced tree TÂ which is equivalent to a linked list of length `/2. The optimal assignment in this case is trivial and consists in assigning the hotlink of the root to the unique leaf of the tree. The initial path length is E[T, p] = `, the path length of the enhance tree produced by the approximation algorithm is E[TÂ, p] = `/2 and the path length of the optimal enhanced tree is E[TÂ^∗, p] = 1. Thus the algorithm is a 2-approximation of the best assignment since 2g(A) ≥g(A^∗). However the algorithm gives an approximation in terms of path length which is arbitrary far from the optimal assignment for large value of `. This example also shows that, at least in some cases, good approximation assignments in terms of the gain do not necessary reflect the quality of what we can expect of good assignments.

a. b. c.

Figure 1.4: a. TreeT, b. Tree T^A, c. TreeT^A^∗ .

(16)

Lower Bound

A lower bound on the path lengthE[T^A, p] was given by Boseet al. [17] using Shannon theory [5] about the average code length with a precise size alphabet. Let H(p) be the entropy of the probability distribution p, defined by

H(p) =−X

i∈L

pilogpi.

Then for every assignment of at most k hotlinks per node, the path length of a tree of maximum degree d is at least H(p)/log(d+k). The tree could be a list, in which case we have a lower bound ofH(p)/log(1 +k). We define anear-entropy hotlink assignment method as a method that guarantees a path length ofO(H(p)).

Top-down Methods

In this thesis we focus our attention on recursive algorithms that first choose the hotlinks of the root of the tree T, simulate the adoption (see Fig. 1.3), and recursively assign hotlinks to the children and hotchildren of the root. We denote these algorithms astop- down if the hotlink assignment of a subtree only depends on the subtree itself minus the subtrees adopted by its own ancestors. Thus any top-down method is fully characterized by the choice of the hotchildren of a node. In this contextTx^Adefines the subtree rooted atxminus the subtrees adopted by its ancestors.

Top-down methods have the property that they guarantee the same upper bound on the path length for all subtrees composing the augmented treeTÂ. Namely, if a top-down method A guarantees E[TÂ, p]≤ αH(p), for any fixed α, then E[TxÂ, p^(x)] ≤αH(p^(x)) for each element x ∈ T where p^(x) is the normalized access probability distribution on the leaves inside the subtreeT_xÂ. Any method satisfying this property is called strong.

Therefore top-down methods are strong.

1.2 Related Work

Complexity

The idea of hotlinks was suggested by Perkowitz and Etzioni [69] to improve the search in Web sites (seen as DAGs). Later, Bose et al. [17] proved that finding the optimal hotlink assignment for a DAG is NP-hard, and analyzed several heuristics for assigning

(17)

hotlinks. The majority of works done on hotlinks restricts the input structure to general trees. Recently Jacobs [51] proved that finding the optimal hotlink assignment for an arbitrary tree is NP-hard even if only one hotlink is assigned per node of the tree.

Exact Algorithms

Pessoa et al. [70] and Gerstel et al. [46] independently discovered a polynomial time dynamic programming algorithm for finding the optimal placement of hotlinks when restricted to a tree whose depth is logarithmic in the number of nodes (both results have been merged in a journal article [45]). The running time of the algorithm of Gerstel et al. is O(n3^D) where D is the height of the output tree. In a model where hotlinks are only allowed to point at the leaves of the tree, Jacobs [51] found an optimal hotlink assignment algorithm running inO(n⁴) time.

Path Length Approximations

Previous works focusing on path length approximation use the greedy user model.

For single hotlink assignment methods (assigning exactly 1 hotlink per node), Kranakis, Krizanc and Shende [56] gave a O(n²) time algorithm which guarantees that the path length attains the entropy bound within a constant factor, i.e., log(d+1)−(d/(d+1)) log^H^(p) d +

d+1

d . Still for the path length minimization problem but with respect to the approximation factor, the first 2-approximation algorithm running inO(n⁴) time was given by Jacobs [52].

It seems difficult to directly generalize those methods for the multiple hotlink assignments (at mostkhotlinks per node). Some studies have been done on this topic, namely Fuhrmannet al.[41] present algorithms to reduce the height of ad-regular complete tree by a constant factor. The algorithms for optimal hotlink assignment by dynamic programming allow the assignment ofkhotlinks per node [46, 70]. The method of Kranakis et al.[56] has been generalized by Bose et al.[18] to assign k hotlinks per node, but is restricted to binary trees, it guarantees an average access time at most _log(k+2)−1^H^(p) + 1.

Gain Approximations

Single hotlink assignment methods: Matichin and Peleg [61] give the first 2-approximation algorithm in terms of the gain in both user models. Jacobs [52] generalized this algorithm to obtains a PTAS in terms of the gain. The naturalGreedy top-down method which

(18)

assigns the hotlink of the root achieving the greatest gain, has exhibited the best performance among the algorithms studied experimentally by Czyzowiczet al. [26]. Matichin and Peleg [60] showed that the Greedy method is a 2-approximation in terms of gain in the clairvoyant user model. Jacobs [52] proves that it is also 2-approximation in terms of the gain in the greedy user model.

Multiple hotlink assignment methods: Every method presented in the previous para- graph, which approximates the gain, has been generalized to assign up tokhotlinks per nodes. However no approximation results depending onk have been presented.

Experimentations

Experimental results [26, 71] have demonstrated the validity of the hotlinks approach.

Jacobs [50] studied the most recent hotlink assignment methods. A software tool to structure websites efficiently by automatic assignment of hotlinks has been developed by Kranakiset al. [55].

Applications

The concept of hotlinks can be applied to other problems than that of web structuring.

For instance, Bose et al. [18] use hotlink assignments to design efficient asymmetric communication protocols. Hotlinks can also be used to design data structures as was demonstrated by Br¨onnimann, Cazals and Durand [19] with their jumplist dynamic dictionary data structure. The jumplist structure can be seen as a randomized hotlink assignment on a list, and is meant as a simplification of the skiplist structure [72].

Deterministic versions of the randomized jumplist were developed by Elmasry [36] and by Dou¨ıeb and Langerman [32], independently.

1.3 Contributions

In this thesis we mainly develop hotlink assignment methods for lists and arbitrary trees in the greedy user model. Those methods are focused on achieving the best path length upper bound or the best approximation ratio in terms of path length. All methods giving in this section are new.

(19)

Tools

In Chapter 2, we give a general framework for analyzing both the path length upper bound (Section 2.1) and the approximation ratio (Section 2.2) of many hotlink assignment methods.

Hotlink Assignment Methods for Lists

In Chapter 3, Section 3.1, we give several static methods for assigning one hotlink per node in a list. In Subsection 3.1.1, we develop a dynamic programming algorithm for computing the optimal assignment, the algorithm has a O(n³) running time. In Sub- section 3.1.2, we develop an algorithm for computing a nearly optimal assignment. The algorithm has O(n) running time. In Subsection 3.1.3, we show how we can develop hotlink assignment methods for lists based on algorithms for building binary search trees. In Section 3.2, we show how a nearly optimal assignment can be maintained dynamically under update operations such as insertion, deletion and reweighting.

Hotlink Assignment Methods for Trees

In Chapter 4, we develop hotlink assignment methods for assigning 1 or up tokhotlinks per node of trees. Section 4.2 first considers the assignment of one hotlink per node of a tree. The best previous algorithm, the KKS method [56], guarantees a path length of at most log(d+1)−(d/(d+1)) log^H(p) d+^d+1_d , its asymptotic behavior is H(p)_log^d_d for sufficiently large values ofd(maximum degree of the tree). This upper bound is shown to be tight in Section 4.2.1. We also show that the KKS method is a (d+ 1)-approximation of the optimal path length.

New top-down methods for assigning one hotlink per node are presented: The Min- Max method, an intuitive variant of the previous method is presented in Section 4.2.2 and shown not to improve significantly over KKS. In Section 4.2.3, the h/ph method guarantees a path length of at most 1.141H(p) + 1. We also show that the h/ph method is a 3-approximation of the optimal path length. These performances, in contrast to that of KKS or MinMax, are completely independent of the maximum degree of the tree and improve on KKS for all values of d > 2. Furthermore, the h/ph method matches the bound of KKS ford= 2. Those methods are top-down, thus they are strong in the sense that they guarantee the same average access time for each subtree in the enhanced tree.

(20)

In Section 4.2.4, we present the first O(n) time algorithm for assigning one hotlink per node so that the number of steps to access a leafxfrom the root of the tree reaches the entropy bound, i.e., is at most O(log_w^W_x) where W = P

i∈Lwi. We also give the first efficient data structure for maintaining hotlinks when nodes are added, deleted or their weights modified, in amortized time O(log_w^W_x) per update. The data structure can be made adaptative, i.e., reaches the entropy bound in the amortized sense without knowing the weightswi in advance.

In Section 4.3, we consider multiple hotlink assignment methods for trees. In Sec- tion 4.3.1, we present a natural generalization of the algorithm of Bose et al. [18] for assigningkhotlinks per node in trees of arbitrary maximum degree dinstead of binary trees, it guarantees an upper bound on the path length of log(k+d)−log^H(p) d+1. We also show that this method gives a

1 1− ^dk

k+d

-approximation of the optimal path length. As the performance of this method degrades when d grows, we give in Section 4.3.2 a second method. TheAbove _k+1¹ whose path length is at most min

2H(p)

log(k+1),(log(k+1)−log^H(p) d)

+ 1 constituting the first multiple hotlink assignment method giving a near-entropy bound that is independent of the degree. We also show that the Above _k+1¹ method is a (k+1)(4k+2)

k²+k+1

-approximation (at most ≈ 4.3-approx for all value of k) of the optimal path length which constitutes the first constant-approximation multiple-hotlink assignment method. Those multiple assignment methods are top-down, thus they are strong.

Table 1.3 summarizes the performances of hotlink assignment methods for tree.

(21)

Table1.1:Performancesofhotlinkassignmentmethodsfortreeswithmaximumdegreed. MethodsPathlengthPathlengthApproxRunningtimeMultipleDynamic Priormethods KKS[56]≤H(p) log(d+1)−(d d+1)logd+d+1 d[56] (d+1)[Thm4.2]O(n2 )[56] NoNo ≥H(p) log(d+1)−(d/d+1)logd+τ[Lemma4.1]O(nlogn)[Thm5.1] HeavyCentipede[52]?2[52]O(n4 )[52]NoNo Newmethods MinMax≤H(p) log(d+1)−(d d+1)logd+d+1 d[Thm4.3] (d+1)[Thm4.5]O(nlogn)[Thm5.1]NoNo ≥H(p)(d2+3d) 2(d+1)log(d+1)[Lemma4.4] h/ph≤1.141H(p)+1[Thm4.7]3[Thm4.8]O(nlogn)[Thm5.1]NoNo HeavyPaths≤3H(p)[Thm4.11]3log(d+1)[Thm4.12]O(n)[Sec.4.2.4]NoYes Belowd k+d≤H(p) log(k+d)−logd+1[Thm4.13] 1 1−dk k+d [Thm4.14]O(nlogn)[Thm5.1]YesNo Above1 k+1≤2H(p) log(k+1)+1[Thm4.16] 4k2+6k+2 k2+k+1 <4.31[Thm4.17]O(nlogn)[Thm5.1]YesNo

(22)

Fast Hotlink Assignment Algorithm

In Chapter 5 we develop a fast algorithm for several strong methods; it uses an enhanced version of the link-cut trees of Sleator and Tarjan [76] and performs the hotlink assignment in O(nlogn) time for all our strong methods and the KKS method [56]. This is an improvement over the previousO(n²) algorithms.

Lower Bound on Running Time of Approximation Methods

Finally in Chapter 6 we give a Ω(nlogn) lower bound on running time of any strong near-entropy hotlink assignment method which guarantees a path length of at most αH(p) or methods that are anα-approximation withα <2 .

(23)

Chapter 2

Top-Down Methods

Atop-downhotlink assignment methodAis defined as a method which always performs the hotlink assignment of an elementxafter the assignment of its ancestors. The hotchild of x must be contained in TxÂ and its choice must only depend on TxÂ. Where TxÂ is the subtree Tx rooted at x minus the subtrees adopted by its own ancestors. Thus any top-down method is fully characterized by the choice of the hotchildren of a node.

The next sections describe several lemmas which will help in analyzing

• the upper bound on the path length, and

• the approximation ratio in terms of the path length for many top-down hotlink assignment method.

2.1 Path Length Upper Bound

Here is first a lemma concerning the entropy. Consider a probability distribution p = hp1, p2, . . . , pni and a partition A1, A2, . . . , Ak of the index set {1,2, . . . , n} into k non- empty subsets. Define the weight of a subset Ai as Si = P

j∈A_ipj for i = 1,2, . . . , k.

Consider the new distributions: p⁽ⁱ⁾ =hp⁽ⁱ⁾_j := ^p_S^j_i :j∈Aii fori= 1,2, . . . , k. Kranakis, Krizanc and Shende [56] proved the following lemma:

23

(24)

CHAPTER 2. TOP-DOWN METHODS 24

Lemma 2.1 [56] For any partition A1, A2, . . . , Ak of the index set of the probability distribution we have the identity

H(p) = Xk

i=1

SiH(p⁽ⁱ⁾)− Xk i=1

SilogSi,

where Si andp⁽ⁱ⁾ are defined in the above equations.

Define a leader set as a connected subset of nodes in T^A that includes the root.

A leader set X induces a partition of all other nodes in TÂ into several subtrees T1Â, T2Â, . . . , T_kÂwith corresponding weightsSi =W(TiÂ) fori= 1, . . . , k. These subtrees are rooted at the children and hotchildren of nodes inX that are not inX(see Fig. 2.1).

Each subtree TiÂ has a depth d(TiÂ), corresponding to the number of pointers that we must follow from the root ofTÂto reach T_iÂ.

T^A

T₁^A T₂^A

T₃^A

T_k^A

Figure 2.1: Partition of the treeT^A by a leader set (black nodes).

Lemma 2.2 Given a top-down hotlink assignment method A, suppose we can fix a constant a such that for each tree T there exists a leader set partitioning TÂ into subtrees T₁Â, T₂Â, . . . , T_kÂ of weights S1, S2, . . . , Sk that satisfy

a≥ − P_k

i=1Sid(T_i^A) P_k

i=1SilogSi

, then the path length of the tree T^A is at most aH(p) + 1.

(25)

Proof By induction on the depth of the tree. If the depth is equal to 1 then the lemma is true for any positive value of a, the path length is always 1. Assume the induction hypothesis is valid for the subtreesT_i^A for alli. Now we compute the path length of the treeT^A:

E[T^A, p] = Xk

i=1

Si(d(T_i^A) +E[T_i^A, p⁽ⁱ⁾])

= Xk

i=1

Sid(Ti^A) + Xk

i=1

SiE[Ti^A, p⁽ⁱ⁾]

≤ Xk

i=1

Sid(Ti^A) + Xk

i=1

Si(aH(p⁽ⁱ⁾) + 1)

= Xk

i=1

Sid(T_i^A) +aH(p) +a Xk

i=1

SilogSi+ Xk i=1

Si

≤ aH(p) + 1

The third inequality is by induction hypothesis, the fourth is obtained using Lemma 2.1 (whereAi corresponds to the set of leaves in the subtree T_i^A) and the last inequality is valid if we choose the constantasuch that

Xk i=1

Sid(Ti^A) +a Xk i=1

SilogSi ≤ 0.

That is,

− Pk

i=1Sid(T_i^A) P_k

i=1SilogSi

≤ a.

Finally we generalize Lemma 5 of [56]:

(26)

Lemma 2.3 For any fixed constantα∈[0,¹₂], the solutions of the optimization problem maximize f(s1, s2, . . . , sk) =P_k

i=1silogsi

subject to 0≤si ∀i, P_k

i=1si= 1, α≤sk≤1−α,

are obtained, either when sk=α and one among the quantities s1, s2, . . . , sk−1 attains the value1−α or whensk= 1−α and one among the quantitiess1, s2, . . . , sk−1 attains the value α, while the other quantities are equal to 0.

Proof This optimization problem is similar to the minimization of the entropy (equal to −f) of p = hs1, s2, . . . , ski with the same constraints. As the entropy is a concave function (Theorem 2.7.3 in [25]), f is convex. Thus the optimal value of the function f is at a vertex of the polytope defined by the constraints of the optimization problem.

The polytope corresponding to all constraints except the last one (i.e. α≤sk ≤1−α) is a simplex and its vertices are the unit vectors (si = 1, sj6=i= 0) for alli= 1, . . . , k−1.

None of these vertices satisfy this last constraint. Thus all the vertices of the polytope are on the boundary of the last constraint. In other words, they have their component sk equal either to αor 1−α (the end point values ofsk). Oncesk fixed, the constraints become si ≥ 0 and P_k−1

i=1 si = 1−sk which is a simplex as well. If a vertex has its componentskequal to α then one of its other components is exactly equal to 1−αand all other components are equal to 0. Else if a vertex has its componentskequal to 1−α then one of its other components is exactly equal toα and all the other components are equal to 0. Thus the value of the functionf for all those vertices is the same and

αlogα+ (1−α) log(1−α).

(27)

2.2 Path Length Approximation Ratio

Here we give a general framework to analyze the approximation ratio in terms of the path length for any top-down hotlink assignment method. We first show how to transform the optimal assignment on an input tree into another. Then we present a lemma that shows when the transformation does not increase the path length by too much.

Transformation

The transformation allowing the modification of an assignment into another is described by two basic operations. It is similar to the technique developed by Jacobs [52]. The transformation uses basic operations in order to modify all hotlinks of a given node. Note that the transformation considers the general case where an element haskhotlinks. Here is the description of the basic operations:

• ^PUSHDOWN(u) — Define H = {h1, . . . , hk} as the set of hotchildren of u. The operation^PUSHDOWN(u) deletes the hotlink (u, hi) for everyhi that is a grandchild of u (at distance 2 from u inT). LetV be the set of direct children of u that are ancestors of at least one hi ∈ H. Apply recursively^PUSHDOWN(v) for allv∈ V and replace (u, hi) by (v, hi) in the assignment for every hotchild hi descendant of v (see Fig. 2.2).

u u

v v

h h

u v

h

u v

h Figure 2.2: Basic operation: PUSHDOWN(u).

• ^INSERT(u, h) — Consider u, h∈T andh descendant ofu. Letu1, u2, . . . , u` be the elements on the path from u to h that have at least one hotchild in Th, ordered

(28)

by increasing depth (they do not include u orh). Furthermore, letHi be the set of ui’s hotchildren in Th. The operation ^INSERT(u, h) consists for i = `, . . . ,1 in applying^PUSHDOWN(h) and replacing (ui, hj) by (h, hj) for allhj ∈ Hi. After these

` iterations, add (u, h) to the assignment (see Fig. 2.3).

u

h u1

u2

u

h u1

u2

u

h u1

u2

u

h u1

u2

Figure 2.3: Basic operation: INSERT(u, h).

Here we describe the transformation procedure for a specific element x under an assignmentA⁰ to another assignmentA. LetH⁰ ={h⁰1, . . . , h⁰_k}andH={h1, . . . , hk}be the sets of hotchildren ofx in the assignment A⁰ and in the assignment A respectively.

The transformation performs^PUSHDOWN(x) which has the effect of translating down all h⁰_i ∈ H⁰. Then it performs ^INSERT(x, hi) for all hi ∈ H (see Fig. 2.4).

Lemma 2.4 Consider the tree T under the assignment A⁰, PUSHDOWN(u) increases by at most 1 the distance between u and every leaf in T_u^A⁰.

Proof We prove this lemma by induction on the height of the tree. It is trivially true for trees of height 1. Consider now the subtree T_uÂ⁰, assume by induction that the lemma is true for the subtrees rooted at the original children ofu. If u does not have any hotchildren or if the hotlinks at of u bypass at most one node then the lemma is satisfied. Otherwise ^PUSHDOWN(v), with v an original child of u, increases by at most 1 the distance between v (and u at the same time) and the leaves in T_vÂ⁰. Replacing (u, hi) by (v, hi) increases exactly by one the distance fromuto the leaves inT_hÂ_i⁰. Thus

PUSHDOWN(u) increases by at most 1 the distance between uand the leaves inTu^A⁰.