• Aucun résultat trouvé

4. All references to author W. Scott:

2.3 DATA STRUCTURES

In this section we cover three basic data structures used to organize data: search trees, digital trees, and hashing. They are used not only for storing text in secondary memory, but also as components in

searching algorithms (especially digital trees). We do not describe arrays, because they are a well-known structure that can be used to implement static search tables, bit vectors for set manipulation, suffix arrays (Chapter 5), and so on.

These three data structures differ on how a search is performed. Trees define a lexicographical order over the data. However, in search trees, we use the complete value of a key to direct the search, while in digital trees, the digital (symbol) decomposition is used to direct the search. On the other hand, hashing

"randomizes" the data order, being able to search faster on average, with the disadvantage that scanning in sequential order is not possible (for example, range searches are expensive).

Some examples of their use in subsequent chapters of this book are:

Search trees: for optical disk files (Chapter 6), prefix B-trees (Chapter 3), stoplists (Chapter 7).

Hashing: hashing itself (Chapter 13), string searching (Chapter 10), associated retrieval, Boolean operations (Chapters 12 and 15), optical disk file structures (Chapter 6), signature files (Chapter 4), stoplists (Chapter 7).

Digital trees: string searching (Chapter 10), suffix trees (Chapter 5).

We refer the reader to Gonnet and Baeza-Yates (1991) for search and update algorithms related to the data structures of this section.

2.3.1 Search Trees

The most well-known search tree is the binary search tree. Each internal node contains a key, and the left subtree stores all keys smaller that the parent key, while the right subtree stores all keys larger than the parent key. Binary search trees are adequate for main memory. However, for secondary memory, multiway search trees are better, because internal nodes are bigger. In particular, we describe a special class of balanced multiway search trees called B-tree.

A B-tree of order m is defined as follows:

The root has between 2 and 2m keys, while all other internal nodes have between m and 2m keys.

If ki is the i-th key of a given internal node, then all keys in the i - 1 - th child are smaller than ki, while all the keys in the i-th child are bigger.

All leaves are at the same depth.

Usually, a B-tree is used as an index, and all the associated data are stored in the leaves or buckets. This structure is called B+-tree. An example of a B+-tree of order 2 is shown in Figure 2.3, using bucket size 4.

Figure 2.3: A B+ -tree example (Di denotes the primary key i, plus its associated data).

B-trees are mainly used as a primary key access method for large databases in secondary memory. To search a given key, we go down the tree choosing the appropriate branch at each step. The number of disk accesses is equal to the height of the tree.

Updates are done bottom-up. To insert a new record, we search the insertion point. If there is not enough space in the corresponding leaf, we split it, and we promote a key to the previous level. The algorithm is applied recursively, up to the root, if necessary. In that case, the height of the tree increases by one.

Splits provides a minimal storage utilization of 50 percent. Therefore, the height of the tree is at most logm+1(n/b) + 2 where n is the number of keys, and b is the number of records that can be stored in a leaf. Deletions are handled in a similar fashion, by merging nodes. On average, the expected storage utilization is ln 2 .69 (Yao 1979; Baeza-Yates 1989).

To improve storage utilization, several overflow techniques exist. Some of them are:

B*-trees: in case of overflow, we first see if neighboring nodes have space. In that case, a subset of the keys is shifted, avoiding a split. With this technique, 66 percent minimal storage utilization is

provided. The main disadvantage is that updates are more expensive (Bayer and McCreight 1972; Knuth 1973).

Partial expansions: buckets of different sizes are used. If an overflow occurs, a bucket is expanded (if possible), or split. Using two bucket sizes of relative ratio 2/3, 66 percent minimal and 81 percent

average storage utilization is achieved (Lomet 1987; Baeza-Yates and Larson 1989). This technique

does not deteriorate update time.

Adaptive splits: two bucket sizes of relative ratios 1/2 are used. However, splits are not symmetric (balanced), and they depend on the insertion point. This technique achieves 77 percent average storage utilization and is robust against nonuniform distributions (low variance) (Baeza-Yates 1990).

A special kind of B-trees, Prefix B-trees (Bayer and Unterauer 1977), supports efficiently variable length keys, as is the case with words. This kind of B-tree is discussed in detail in Chapter 3.

2.3.2 Hashing

A hashing function h (x) maps a key x to an integer in a given range (for example, 0 to m - 1). Hashing functions are designed to produce values uniformly distributed in the given range. For a good discussion about choosing hashing functions, see Ullman (1972), Knuth (1973), and Knott (1975). The hashing value is also called a signature.

A hashing function is used to map a set of keys to slots in a hashing table. If the hashing function gives the same slot for two different keys, we say that we have a collision. Hashing techniques mainly differ in how collisions are handled. There are two classes of collision resolution schemas: open addressing and overflow addressing.

In open addressing (Peterson 1957), the collided key is "rehashed" into the table, by computing a new index value. The most used technique in this class is double hashing, which uses a second hashing function (Bell and Kaman 1970; Guibas and Szemeredi 1978). The main limitation of this technique is that when the table becomes full, some kind of reorganization must be done. Figure 2.4 shows a hashing table of size 13, and the insertion of a key using the hashing function h (x) = x mod 13 (this is only an example, and we do not recommend using this hashing function!).

Figure 2.4: Insertion of a new key using double hashing.

In overflow addressing (Williams 1959; Knuth 1973), the collided key is stored in an overflow area, such that all key values with the same hashing value are linked together. The main problem of this schema is that a search may degenerate to a linear search.

Searches follow the insertion path until the given key is found, or not (unsuccessful case). The average search time is constant, for nonfull tables.

Because hashing "randomizes" the location of keys, a sequential scan in lexicographical order is not possible. Thus, ordered scanning or range searches are very expensive. More details on hashing can be

found in Chapter 13.

Hashing schemes have also been used for secondary memory. The main difference is that tables have to grow dynamically as the number of keys increases. The main techniques are extendible hashing which uses hashing on two levels: a directory and a bucket level (Fagin et al. 1979), and linear hashing which uses an overflow area, and grows in a predetermined way (Litwin 1980; Larson 1980; Larson and Kajla 1984). For the case of textual databases, a special technique called signature files (Faloutsos 1987) is used most frequently. This technique is covered in detail in Chapter 4 of this book.

To improve search time on B-trees, and to allow range searches in hashing schemes, several hybrid methods have been devised. Between them, we have to mention the bounded disorder method (Litwin and Lomet 1987), where B+-tree buckets are organized as hashing tables.

2.3.3 Digital Trees

Efficient prefix searching can be done using indices. One of the best indices for prefix searching is a binary digital tree or binary trie constructed from a set of substrings of the text. This data structure is used in several algorithms.

Tries are recursive tree structures that use the digital decomposition of strings to represent a set of strings and to direct the searching. Tries were invented by de la Briandais (1959) and the name was suggested by Fredkin (1960), from information retrie val. If the alphabet is ordered, we have a

lexicographically ordered tree. The root of the trie uses the first character, the children of the root use the second character, and so on. If the remaining subtrie contains only one string, that string's identity is stored in an external node.

Figure 2.5 shows a binary trie (binary alphabet) for the string "01100100010111 . . . " after inserting all the substrings that start from positions 1 through 8. (In this case, the substring's identity is represented by its starting position in the text.)

The height of a trie is the number of nodes in the longest path from the root to an external node. The length of any path from the root to an external node is bounded by the height of the trie. On average, the height of a trie is logarithmic for any square-integrable probability distribution (Devroye 1982). For a random uniform distribution (Regnier 1981), we have

for a binary trie containing n strings.

The average number of internal nodes inspected during a (un)successful search in a binary trie with n strings is log2n + O(1). The average number of internal nodes is + O (n) (Knuth 1973).

A Patricia tree (Morrison 1968) is a trie with the additional constraint that single-descendant nodes are eliminated. This name is an acronym for "Practical Algorithm To Retrieve Information Coded In

Alphanumerical." A counter is kept in each node to indicate which is the next bit to inspect. Figure 2.6 shows the Patricia tree corresponding to the binary trie in Figure 2.5.

Figure 2.5: Binary trie (external node label indicates position in the text) for the first eight suffixes in "01100100010111 . . .".

Figure 2.6: Patricia tree (internal node label indicates bit number).

For n strings, such an index has n external nodes (the n positions of the text) and n -1 internal nodes.

Each internal node consists of a pair of pointers plus some counters. Thus, the space required is O(n).

It is possible to build the index in time, where denotes the height of the tree. As for tries, the expected height of a Patricia tree is logarithmic (and at most the height of the binary trie). The expected height of a Patricia tree is log2 n + o(log2 n) (Pittel 1986).

A trie built using the substrings (suffixes) of a string is also called suffix tree (McCreight [1976] or Aho et al. [1974]). A variation of these are called position trees (Weiner 1973). Similarly, a Patricia tree is called a compact suffix tree.

2.4 ALGORITHMS

It is hard to classify IR algorithms, and to draw a line between each type of application. However, we can identify three main types of algorithms, which are described below.

There are other algorithms used in IR that do not fall within our description, for example, user interface algorithms. The reason that they cannot be considered as IR algorithms is because they are inherent to any computer application.

2.4.1 Retrieval Algorithms

The main class of algorithms in IR is retrieval algorithms, that is, to extract information from a textual database. We can distinguish two types of retrieval algorithms, according to how much extra memory we need:

Sequential scanning of the text: extra memory is in the worst case a function of the query size, and not of the database size. On the other hand, the running time is at least proportional to the size of the text, for example, string searching (Chapter 10).

Indexed text: an "index" of the text is available, and can be used to speed up the search. The index size is usually proportional to the database size, and the search time is sublinear on the size of the text, for example, inverted files (Chapter 3) and signature files (Chapter 4).

Formally, we can describe a generic searching problem as follows: Given a string t (the text), a regular expression q (the query), and information (optionally) obtained by preprocessing the pattern and/or the text, the problem consists of finding whether t *q * (q for short) and obtaining some or all of the following information:

1. The location where an occurrence (or specifically the first, the longest, etc.) of q exists. Formally, if t *q *, find a position m 0 such that t m q *. For example, the first occurrence is defined as the least m that fulfills this condition.

2. The number of occurrences of the pattern in the text. Formally, the number of all possible values of m