All the locations where the pattern occurs (the set of all possible values of m)

In general, the complexities of these problems are different.

We assume that is not a member of L(q). If it is, the answer is trivial. Note that string matching is a particular case where q is a string. Algorithms to solve this problem are discussed in Chapter 10.

The efficiency of retrieval algorithms is very important, because we expect them to solve on-line queries with a short answer time. This need has triggered the implementation of retrieval algorithms in many different ways: by hardware, by parallel machines, and so on. These cases are explained in detail in Chapter 17 (algorithms by hardware) and Chapter 18 (parallel algorithms).

2.4.2 Filtering Algorithms

This class of algorithms is such that the text is the input and a processed or filtered version of the text is the output. This is a typical transformation in IR, for example to reduce the size of a text, and/or

standardize it to simplify searching.

The most common filtering/processing operations are:

Common words removed using a list of stopwords. This operation is discussed in Chapter 7.

Uppercase letters transformed to lowercase letters.

Special symbols removed and sequences of multiple spaces reduced to one space.

Numbers and dates transformed to a standard format (Gonnet 1987).

Spelling variants transformed using Soundex-like methods (Knuth 1973).

Word stemming (removing suffixes and/or prefixes). This is the topic of Chapter 8.

Automatic keyword extraction.

Word ranking.

Unfortunately, these filtering operations may also have some disadvantages. Any query, before consulting the database, must be filtered as is the text; and, it is not possible to search for common words, special symbols, or uppercase letters, nor to distinguish text fragments that have been mapped to the same internal form.

2.4.3 Indexing Algorithms

The usual meaning of indexing is to build a data structure that will allow quick searching of the text, as we mentioned previously. There are many classes of indices, based on different retrieval approaches. For example, we have inverted files (Chapter 3), signature files (Chapter 4), tries (Chapter 5), and so on, as we have seen in the previous section. Almost all type of indices are based on some kind of tree or hashing. Perhaps the main exceptions are clustered data structures (this kind of indexing is called clustering), which is covered in Chapter 16, and the Direct Acyclic Word Graph (DAWG) of the text, which represents all possible subwords of the text using a linear amount of space (Blumer et al. 1985), and is based on finite automata theory.

Usually, before indexing, the text is filtered. Figure 2.7 shows the complete process for the text.

Figure 2.7: Text preprocessing

The preprocessing time needed to build the index is amortized by using it in searches. For example, if

building the index requires O (n log n) time, we would expect to query the database at least O (n) times to amortize the preprocessing cost. In that case, we add O (log n) preprocessing time to the total query time (that may also be logarithmic).

Many special indices, and their building algorithms (some of them in parallel machines), are covered in this book.

REFERENCES

AHO, A., J. HOPCROFT, and J. ULLMAN. 1974. The Design and Analysis of Computer Algorithms.

Reading, Mass.: Addison-Wesley.

BAEZA-YATES, R. 1989. "Expected Behaviour of B⁺-Trees under Random Insertions." Acta Informatica, 26(5), 439-72. Also as Research Report CS-86-67, University of Waterloo, 1986.

BAEZA-YATES, R. 1990. "An Adaptive Overflow Technique for the B-tree," in Extending Data Base Technology Conference (EDBT 90), eds. F. Bancilhon, C. Thanos and D. Tsichritzis, pp. 16-28, Venice.

Springer Verlag Lecture Notes in Computer Science 416.

BAEZA-YATES, R., and P.-A. LARSON. 1989. "Performance of B⁺-trees with Partial Expansions."

IEEE Trans. on Knowledge and Data Engineering, 1, 248-57. Also as Research Report CS-87-04, Dept.

of Computer Science, University of Waterloo, 1987.

BAYER, R., and E. MCCREIGHT. 1972. "Organization and Maintenance of Large Ordered Indexes."Acta Informatica, 1(3), 173-89.

BAYER, R., and K. UNTERAUER. 1977. "Prefix B-trees." ACM TODS, 2(1), 11-26.

BELL, J., and C. KAMAN. 1970. "The Linear Quotient Hash Code." CACM, 13(11), 675-77.

BLUMER, A., J. BLUMER, D. HAUSSLER, A. EHRENFEUCHT, M. CHEN, and J. SEIFERAS.

1985. "The Smallest Automaton Recognizing the Subwords of a Text." Theoretical Computer Science, 40, 31-55.

DE LA BRIANDAIS, R. 1959. "File Searching Using Variable Length Keys, in AFIPS Western JCC, pp. 295-98, San Francisco, Calif.

DEVROYE, L. 1982. "A Note on the Average Depth of Tries." Computing, 28, 367-71.

FAGIN, R., J. NIEVERGELT, N. PIPPENGER, and H. STRONG. 1979. "Extendible Hashing--a Fast

Access Method for Dynamic Files. ACM TODS, 4(3), 315-44.

FALOUTSOS, C. 1987. Signature Files: An Integrated Access Method for Text and Attributes, Suitable for Optical Disk Storage." Technical Report CS-TR-1867, University of Maryland.

FREDKIN, E. 1960. "Trie Memory." CACM, 3, 490-99.

GONNET, G. 1987. "Extracting Information from a Text Database: An Example with Dates and

Numerical Data," in Third Annual Conference of the UW Centre for the New Oxford English Dictionary, pp. 85-89, Waterloo, Canada.

GONNET, G. and R. BAEZA-YATES. 1991. Handbook of Algorithms and Data Structures--In Pascal and C. (2nd ed.). Wokingham, U.K.: Addison-Wesley.

GUIBAS, L., and E. SZEMEREDI. 1978. "The Analysis of Double Hashing." JCSS, 16(2), 226-74.

HOPCROFT, J., and J. ULLMAN. 1979. Introduction to Automata Theory. Reading, Mass.: Addison-Wesley.

KNOTT, G. D. 1975. "Hashing Functions." Computer Journal, 18(3), 265-78.

KNUTH, D. 1973. The Art of Computer Programming: Sorting and Searching, vol. 3. Reading, Mass.:

Addison-Wesley.

LARSON, P.-A. 1980. "Linear Hashing with Partial Expansions," in VLDB, vol. 6, pp. 224-32, Montreal.

LARSON, P.-A., and A. KAJLA. 1984. "File Organization: Implementation of a Method Guaranteeing Retrieval in One Access." CACM, 27(7), 670-77.

LITWlN, W. 1980. "Linear Hashing: A New Tool for File and Table Addressing," in VLDB, vol. 6, pp.

212-23, Montreal.

LITWIN, W., and LOMET, D. 1987. "A New Method for Fast Data Searches with Keys. IEEE Software, 4(2), 16-24.

LOMET, D. 1987. "Partial Expansions for File Organizations with an Index. ACM TODS, 12: 65-84.

Also as tech report, Wang Institute, TR-86-06, 1986.

MCCREIGHT, E. 1976. "A Space-Economical Suffix Tree Construction Algorithm." JACM, 23, 262-72.

MORRlSON, D. 1968. "PATRlClA-Practical Algorithm to Retrieve Information Coded in

Alphanumeric." JACM, 15, 514-34.

PETERSON, W. 1957. "Addressing for Random-Access Storage. IBM J Res. Development, 1(4), 130-46.

PITTEL, B. 1986. "Paths in a Random Digital Tree: Limiting Distributions." Adv. Appl. Prob., 18, 139-55.

REGNIER, M. 1981. "On the Average Height of Trees in Digital Search and Dynamic Hashing." Inf.

Proc. Letters, 13, 64-66.

ULLMAN, J. 1972. "A Note on the Efficiency of Hashing Functions." JACM, 19(3), 569-75.

WEINER, P. 1973. "Linear Pattern Matching Algorithm," in FOCS, vol. 14, pp. 1-11.

WILLIAMS, F. 1959. "Handling Identifiers as Internal Symbols in Language Processors." CACM, 2(6), 21-24.

YAO, A. 1979. "The Complexity of Pattern Matching for a Random String." SIAM J. Computing, 8, 368-87.

Go to Chapter 3 Back to Table of Contents

Dans le document Information Retrieval: Data Structures & Algorithms (Page 28-33)