Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes

(1)

Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on

Multi-component Key Indexes

Alexander B. Veretennikov1[0000−0002−3399−1889]

Ural Federal University, 620002 Mira street, Yekaterinburg, Russia alexander@veretennikov.ru

Abstract. Proximity full-text search is commonly implemented in con- temporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the stor- age of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to theM axDistanceparameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of M axDistanceand other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments.

Keywords: Full-text search·Inverted indexes·Proximity search·Term proximity·Query processing·DAAT.

1 Introduction

In full-text search, a query is a list of words. The result of the search is a list of documents containing these words. Consider a query that consists of words occurring with high-frequency. In [16], we improved the average query processing time by a factor of 130 for these queries relative to traditional inverted indexes.

A methodology for high-performance proximity full-text searches that covers different types of queries was discussed in [17].

The factor of proximity or nearness between the queried words in the indexed document plays an important role in modern information retrieval [19, 12, 3].

We assume that a document should contain query words near each other to be relevant for the user in the context of the search query. Taking this factor into account is essential if the query consists of frequently occurring words.

(2)

Some words occur in texts significantly more frequently than others. We can illustrate this [17] by referring to Zipf’s law [20]. An example of a typical word occurrence distribution is presented in Fig. 1. The horizontal axis represents different words in decreasing order of their occurrence in texts. On the vertical axis, we plot the number of occurrences of each word. This peculiarity of language has a strong influence on the search performance.

The inverted index [21, 2] is a commonly used data structure for full-text searches. The traditional inverted index contains records (ID, P), whereID is the identifier of the document andP is the position of the word in the document, for example, an ordinal number of the word in the document. This record corresponds to an occurrence of a word in a document. These (ID, P) records are called “postings”. The inverted index enables us to obtain a list of postings that corresponds to the given word. These lists of postings are used for the search.

For proximity full-text searches, we need to store the (ID, P) record for every occurrence of every word in the indexed document [15, 8, 9]. In other words, for proximity searches, we require a word-level inverted index instead of a document- level index [8]. Consequently, if a word occurs frequently in the texts, then its list of postings is long [17]. The query search time is proportional to the number of occurrences of the queried words in the indexed documents. To process a search query that contains words occurring with high-frequency, a search system requires much more time, as shown on the left side of Fig. 1, than a query that contains only ordinary words, as shown on the right side of Fig. 1.

Fig. 1.Example of a word frequency distribution.

According to [13], we can consider a full-text query as a “simple inquiry”

[17]. In this case, we may require that the search results be provided within two seconds, as stated in [13], to prevent the interruption of the thought continuity of the user. To enhance the performance, the following approaches can be used.

Early-termination approaches can be employed for full-text searches [1, 11].

However, these methods are not effective in the case of proximity full-text searches [16]. Usually, early-termination approaches are applied to document- level indexes. It is difficult to combine the early-termination approach with the incorporation of term-proximity data into relevance models.

Additional indexes can improve the search performance. In [14, 18], additional indexes were used to improve phrase searches. However, the approaches reported in [14, 18] cannot be used for proximity full-text searches. Their area of application is limited by phrase searches. We have overcome this limitation.

(3)

With our additional indexes, an arbitrary query can be executed very fast and quickly [17].

In an example given in [16], we indexed a subset of the Project Gutenberg web site using Apache Lucene and Apache Tika and performed some searches. A query that contained words occurring with high-frequency was evaluated within 21 sec. On the other hand, another query that contained ordinary words was evaluated within 172 milliseconds. This difference is considerable. However, our algorithms reported in [16, 17] can help address this issue.

The goals and questions of this paper are as follows:

1) We need to examine how search performance is improved with consideration of different values ofM axDistanceand other parameters.

2) We need to investigate search performance on the commonly used text collection.

3) Does the search performance depend on the document size?

4) What are other factors that can affect the performance?

5) We evaluate the performance with respect to both short and long queries.

Key points of our research are the following. We use the word-level index.

We use the DAAT approach [6, 10]. We include in the indexes information about all lemmas. Our indexes support incremental updates. We can store one posting list in several data streams. We read entire posting lists when searching and no early-termination are used.

2 Lemmatization and Lemma Type

For this paper, we use an English dictionary with approximately 92 thousand English lemmas. This dictionary is used by a morphology analyzer. The analyzer produces a list of numbers of lemmas, that is, basic or canonical forms, for every word from the dictionary. Usually, a word has one lemma, but some words have several lemmas. For example, the word “mine” has two lemmas, namely, “mine”

and “my”.

Consider an array of all lemmas. Let us sort all lemmas in decreasing order of their occurrence frequency in the texts. We call the result of sorting theF L- list [17]. The number of a lemma win theF L-list is called itsF L-number [17]

and is denoted by F L(w). Let us say that lemma “earth” > “day” because F L(earth) = 309, F L(day) = 199, and 309>199. We use theF L-numbers to define the order of the lemmas in the collection of all lemmas.

In our search methodology [16], we defined three types of lemmas, namely, stop lemmas, frequently used lemmas and ordinary lemmas.

The firstSW Countmost frequently occurring lemmas are stop lemmas. Ex- amples include “earth”, “yes”, “who”, “day”, “war”, “time”, “man” and “be”.

The secondF U Countmost frequently occurring lemmas are frequently used lemmas. Examples include “red”, “beautiful”, and “mountain”.

All other lemmas are ordinary lemmas, e.g., “fiber” and “undersea”.

(4)

SW CountandF U Countare the parameters. We useSW Count= 500 and F U Count = 1050 in the experiments described here. Let M axDistance be a parameter that can take a value of 5, 7, 9 or even greater.

The value of SW Count is very near 421 from [7]. However, as we include information about all lemmas in the indexes, we can use very different values of the parameters. If an ordinary lemma,q, occurs in the text so rarely thatF L(q) is irrelevant, then we can say thatF L(q) =∼. Here, “∼” denotes a large number.

3 Can Stop Words be Skipped?

Let us discuss the stop-word approach, in which some words occurring with high- frequency or their lemmas are excluded from the search. We do not agree with this approach. A word cannot be excluded from the search because even a word occurring with high-frequency can have a specific meaning in the context of a specific query [16, 18]. Therefore, excluding some words from the search can lead to search quality degradation or unpredictable effects [18]. Additionally, stop words are often employed in higher-order term proximity feature models [5].

Consider the query “who are you who” [17]. The Who are an English rock band, and “Who are You” is one of their works. The word “Who” has a specific meaning in the context of this query. Therefore, in our approach, we include information about all of the words in the indexes. Moreover, we can easily see that modern search systems, such as Google, do not skip stop words in a search.

4 The Types of Additional Indexes

The three-component key (f, s, t) index [17] is the list of the occurrences of the lemma f for which lemmas s and t both occur in the text at distances that are less than or equal to the M axDistance from f. Each posting includes the distance betweenf and s in the text and the distance between f and t in the text. An (f, s, t) index is created only for the case in whichf ≤s≤t. Here,f, s, andt are all stop lemmas.

In [16], we considered queries consisting of high-frequency occurring words. In addition, we showed that the average query execution time can be improved with three-component key indexes by up to 15.6 times relative to the time necessary using two-component key indexes only. Therefore, (f, s, t) indexes are required if we need to search queries that contain high-frequently occurring words.

The two-component key (w, v) index [17] is the list of occurrences of the lemmawfor which lemmavoccurs in the text at a distance that is less than or equal to theM axDistancefromw. Each posting includes the distance between wandv in the text. Here,wdenotes a frequently used lemma, andvdenotes a frequently used or ordinary lemma.

Let us consider the traditional index with near stop word (NSW) records. For each occurrence of each ordinary or frequently used lemma in each document, we include a posting record (ID, P,NSW) in the index.IDcan be the ordinal number of the specific document, andP is the position of the word in the document,

(5)

e.g., the ordinal number of the word in the document. The NSW record contains information about all high-frequency lemmas, that is, stop lemmas, occurring near positionP in the document (at a distance that is less than or equal to the M axDistancefrom P). Examples of the indexes are given in [17].

The posting list of a key is stored in several data streams [17]. For the traditional index with NSW records, we can use up to three streams for one key: one stream for (ID), one for (P) and one for (NSW). On the other hand, for this index, we can use two streams: the first is (ID, P), and the second is (NSW).

The actual choice depends on the length of the posting list. For short lists, we use two streams; for long lists, we use three streams. This architecture allows us to skip NSW records when they are not required. For the (w, v) and (f, s, t) indexes, we use one or two data streams for every key.

5 Experiment

5.1 Indexes, Collection and Environment We create the following indexes:

Idx0: the traditional inverted index without any enhancements, such as NSW records. The total size is 143 GB. This value includes the total size of indexed texts in compressed form, which is 57.3 GB.

Idx5: our indexes, including the traditional inverted index with the NSW records and the (w, v) and (f, s, t) indexes, whereM axDistance= 5. The total size is 1.29 TB, the total size of the (w, v) index is 104 GB, the total size of the (f, s, t) index is 727 GB, and the total size of the traditional index with NSW records is 192 GB.

Idx7: our indexes, whereM axDistance= 7. The total size is 2.16 TB, the total size of the (w, v) index is 148 GB, the total size of the (f, s, t) index is 1.422 TB, and the total size of the traditional index with NSW records is 239 GB.

Idx9: our indexes, whereM axDistance= 9. The total size is 3.27 TB, the total size of the (w, v) index is 191 GB, the total size of the (f, s, t) index is 2.349 TB, and the total size of the traditional index with NSW records is 283 GB.

For the experiment, GOV2 [4] text collection and the following queries are used: title queries from TREC Robust Task 2004 (with 250 queries in total), title queries from TREC Terabyte Task from 2004 to 2006 (with 150 queries in total), title queries from TREC Web Task from 2009 to 2014 (with 300 queries in total), queries from TREC 2007 Million Query Track (10000 queries in total).

The total size of the query set after duplicate removal is 10 665 queries. GOV2 text collection contains 25 million documents. The total size of the collection is approximately 426 GB, and after HTML tag removal, there is approximately 167 GB of plain text. The average document text size is approximately 7 KB.

We used the following computational resources: CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz. HDD: 7200 RPM. RAM: 24 GB. OS: Microsoft Windows 2008 R2 Enterprise.

The query set can be divided into the following subsets depending on lemmas in a concrete query [17]. All queries are evaluated within one program thread.

(6)

If we have a query with a length greater thanM axDistance, then we should divide it into several parts. For example, when the value of M axDistanceis 5, then the query “to be or not to be that is the question” should be rewritten as “(to be or not to) (be that is the question)”, and these two parts should be evaluated independently; then, the results should be combined.

5.2 Q1. Only Stop Lemmas

Every query in the subset contains only stop lemmas. There are 119 queries in this subset. Examples include the following: to be or not to be that is the question, kids earth day activities. With F L-numbers, we have the following queries: [to: 9] [be: 7] [or: 38] [not: 64] [to: 9] [be: 7] [that: 40] [be: 7] [the: 1]

[question: 305] and [kid: 447] [earth: 309] [day: 199] [activity: 247].

For these queries, the (f, s, t) indexes are used [16].

Average query processing times:

Idx0: 51.4 s,Idx5: 0.82 s,Idx7: 0.86 s,Idx9: 1.05 s (see Fig. 2).

Average data read sizes per query:

Idx0: 1.3 GB,Idx5: 11.1 MB,Idx7: 15.6 MB,Idx9: 20.1 MB.

Average numbers of postings per query:

Idx0: 317.8 million,Idx5: 0.88 million,Idx7: 1.15 million,Idx9: 1.5 million.

We improved the query time by a factor of 62.7 with Idx5, by a factor of 59.4 withIdx7, and by a factor of 48.7 withIdx9 in comparison withIdx0.

Fig. 2.The average query execution times forIdx0,Idx5,Idx7, andIdx9 (seconds);

the query subsets Q1, Q2, Q3.

5.3 Q2. Stop and Frequently Used and/or Ordinary Lemmas

Every query in the subset contains one or several stop lemmas. The query also contains some other lemmas that may be frequently used or ordinary. There are 7 244 queries in the subset. Examples include the following: History of Physicians in America. WithF L-numbers, we have the following query:

[history: 598] [of: 4] [physician: 1760] [in: 14] [America: 1391]

For these queries, we need to read one posting list with NSW records from the traditional index for some query lemma. This lemma is designated as the

“main” lemma of the query. If there is a frequently used lemma in the query, then we can use two-component key indexes, like (physician, history), for other

(7)

lemmas. If the query consists of only stop and ordinary lemmas, then we use posting lists from the traditional index for the remaining ordinary lemmas but without reading the NSW records. The details are described in [17].

Average query times:Idx0: 50 s,Idx5: 1.9 s,Idx7: 2 s,Idx9: 2.14 s.

Idx0: 1.3 GB,Idx5: 64.3 MB,Idx7: 68.7 MB,Idx9: 76 MB.

The number of postings for Idx7 is less than that for Idx5 because fewer queries were divided into parts. Additionally, the sizes of the NSW records are different forIdx5,Idx7 andIdx9. We improved the query processing time by a factor of 25.6 with Idx5, by a factor of 24.7 withIdx7, and by a factor of 23.3 withIdx9 in comparison withIdx0.

5.4 Q3. Only Frequently Used Lemmas

Every query in the subset contains only frequently used lemmas. There are 79 queries in this subset. Examples include the following: california mountain pass.

WithF L-numbers, we have the following query:

[california: 518] [mountain: 704] [pass: 528]

Two-component (w, v) key indexes are used. For example, we can use the (california, pass) and (*pass, mountain) two-component key indexes.

Average query times:Idx0: 2.5 s,Idx5: 0.32 s,Idx7: 0.33 s,Idx9: 0.34 s.

Idx0: 53.3 MB,Idx5: 4.69 MB,Idx7: 4.79 MB,Idx9: 4.93 MB.

We improved the query time by a factor of 7.83 with Idx5, by a factor of 7.58 withIdx7, and by a factor of 7.44 withIdx9 in comparison withIdx0.

5.5 Q4. Frequently Used Lemmas and Ordinary Lemmas

Every query in the subset contains frequently used lemmas and ordinary lemmas.

There are 1 388 queries in this subset.

Examples include the following: Scalable Vector Graphics.

[scalable:∼] [vector: 2953] [graphics: 921]

Each query contains a frequently used lemma; therefore, two-component key indexes can be used. For example, we can use the (graphics, scalable) and (graphics, vector) two-component key indexes.

Idx0: 41.7 MB,Idx5: 1.5 MB,Idx7: 1.6 MB, Idx9: 1.8 MB.

(8)

We improved the query time by a factor of 6 withIdx5, by a factor of 6.4 withIdx7, and by a factor of 6.8 withIdx9 in comparison withIdx0.

5.6 Q5. Only Ordinary Lemmas

Every query in the subset contains only ordinary lemmas. There are 1 835 queries in this subset. Examples include the following: Undersea Fiber Optic Cable.

[undersea: 15873] [fiber: 3127] [optic: 2986] [cable: 2771]

We use the traditional index only for these queries. We do not need to read the NSW records because they are stored in separated streams of data.

Idx0: 14.2 MB,Idx5: 15.6 MB,Idx7: 15.7 MB,Idx9: 15.8 MB.

The query processing time for these queries does not need any improvement.

5.7 Results for Entire Query Set

Average query times: Idx0: 34.9 s,Idx5: 1.51 s,Idx7: 1.57 s,Idx9: 1.66 s.

Idx0: 0.93 GB,Idx5: 46.4 MB,Idx7: 49.5 MB,Idx9: 54.5 MB.

We improved the query time by a factor of 23.1 with Idx5, by a factor of 22.4 withIdx7, and by a factor of 21 withIdx9 in comparison withIdx0.

5.8 Analysis of the Results

The average time that is needed for the search for Idx5, Idx7, and Idx9, is approximately 1-2 sec. for every query type.

For Idx0, we need 2-2.5 sec. for the search if the query consists only of ordinary or frequently used lemmas. For the queries that contain any stop lemma (Q1, Q2),Idx0 requires approximately 50 sec. for the search on average.

This means that the search inIdx5,Idx7, andIdx9 is stable from the performance point of view. However, forIdx0, the search system has a performance problem if the query contains any high-frequency occurring lemma. The difference in performance betweenIdx0 and multi-component key indexesIdx5,Idx7 and Idx9 is larger in cases when more high-frequency occurring lemmas occur in the queries.

In [16], we improved the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. For TREC GOV2, we improved the average query execution time by up to 62.7 times for these queries (i.e., the query set Q1). Perhaps the difference in the “improvement

(9)

factor” is dependent on the average document text size, which is approximately 7 KB for TREG GOV2 and 384 KB for the text collection that was used in [16].

The other factor is the length of the query. However, an improvement of 62.7 times in the average query execution time also seems good.

Let us analyze this in more detail. Let us consider the following example. We have two documents,D0 andD1. Let us consider a wordw. Forw, we have the following posting list in the traditional index: (0,1),(0,5),(0,7),(1,2),(1,5).

We use two very common encoding schemes. The idea of the first scheme as follows. We group the records that are related to a specific document. We convert the original posting list as follows: (0,(1,5,7)),(1,(2,5)).For other kinds of indexes, we have the same.

Therefore, we store in the index theIDof the document and then the list of word’s positions. The second scheme is a delta-encoding scheme. Consequently, we have the following. (0,(1,4,2)),(1,(2,3)).For example, consider (1,5,7). In- stead of 5, we store 4 = 5−1, where 1 is the previous value in the list.

The first scheme is much more effective for text collections that consist of large documents. This explains the difference in “improvement factor” in the experiments with two aforementioned collections. However, with our method, we can use any other encoding scheme and any other inverted index organization.

In addition, the “improvement factor” can depend on the structure of the query set. To analyze this question, we performed additional experiment. We formed another query set, using the method from [16]. Consequently, we selected a document from the TREC GOV2 collection. We used the content of the document to produce a set of 3500 queries. We performed our experiments again, using this query set. The “improvement factor” was similar to the foregoing results that we have for TREC GOV2.

The next question is, how can we select the values of our parameters? The value of M axDistance affects relevance and should be determined according to the selected relevance function [17]. The results of experiments allows us to predict how the change of theM axDistance affects the search time and index size. Let us discuss now, how the values ofSW CountandF U Countaffect these metrics.

The value ofSW Countis more important than the value ofF U Count, because it affects Q1 and Q2 queries, which are most complex from the performance point of view. Let us considerSW Countnow. Let us consider Q2 query subset.

We use Zipf’s law [20] as a representation of our word occurrence distribution.

According to this, the second most frequently occurring lemma, occurs half as often as the first. The third most frequently occurring lemma, occurs 1/3 as often as the first, and so on. Let V be the number of occurrences of the most frequently occurring lemma. Therefore, our lemmas have the following number of occurrences: V, V /2, V /3, ...

Let us consider a queryq = (q₁, q₂, ..., q_n) from Q2. Here, q_i is the number of corresponding lemma in the F L-list. The query contains one or several stop lemmas and some other lemmas that may be frequently used or ordinary. To evaluate the query using traditional index, we need to read the following number

(10)

of postings from the index:

n

P

i=1

V /qi. Let, without loss of generality, q1 and q2

be stop lemmas and qn is the main lemma of the query. With the use of NSW records, we need to read the following number of postings.

n

P

i=3

V /qi+ (V /qn)× (N SW F actor−1). That means, we do not need to read the postings list for q1

andq2. However, forqn we need to read its posting list withN SW records.

The size of the posting (ID, P, N SW) in bytes is up to N SW F actor= 4.5 times large than the size of (ID, P). The experiments withM axDistance = 5 show this. Therefore, we can calculate the ”planned performance gain” as follows:

P P G(q) = _n

P

i=1

1/q_i

/ _n

P

i=3

1/q_i+ (1/q_n)×(N SW F actor−1)

.

If we have a query setQ, then we can calculate the average planned performance gain, AP P G(Q) = _|Q|¹ P

q∈Q

P P G(q). Let now use only Q2 queries for estimations. If a query q is not a Q2 query, then let P P G(q) be 1. Let AP P G(Q, SW Count) be the average planned performance gain that is calcu- lated for a specific value ofSW Count.

For the query set that we use in this paper, i.e., 10 665 queries, we have the following: AP P G(Q,100) = 43, AP P G(Q,500) = 105, AP P G(Q,1000) = 153.

However, this model do not take into account (w, v) and (f, s, t) indexes. In future, we plan to develop more precise models. However, from this model, we can predict that SW Countshould not be increased.

The foregoing results need a more detailed examination. Let us consider a search query. The query consists of some set of lemmas. LetM in-F L-number be the minimumF L-number among all lemmas of the query. A lower F L-number corresponds to a more frequently occurring lemma. If the M in-F L-number of a query is a small number, then the query can induce performance problems because the query contains some high-frequency occurring lemma.

Then, we divide the entire query set into subsets based on the M in-F L- number of queries. We select 100 as the division step. In the first subset, we include all queries with M in-F L-numbers from 0 to 99; in the second subset, we include all queries withM in-F L-numbers from 100 to 199; and so on. In the following diagrams, we consider the first 21 subsets.

Fig. 3.The average query execution times forIdx0 (seconds); the query set is divided based on theM in-F L-number with a step of 100.

In Fig. 3, the average query execution time forIdx0 in seconds is displayed for every subset. The first bar is significantly larger than the other bars, showing

(11)

that the first subset induces some performance problems. The first 8 subsets have an average query execution time of more than two seconds.

Fig. 4.The average query execution times forIdx5 (seconds); the query set is divided based on theM in-F L-number with a step of 100.

In Fig. 4, the average query execution time forIdx5 in seconds is displayed for every subset. For the first subset, the average query execution time is about two seconds. For each following subset, the average query execution time is less than two seconds.

Fig. 5.The improvement factor forIdx5 in comparison withIdx0 (times); the query set is divided based on theM in-F L-number with a step of 100.

In Fig. 5, we show the improvement factor forIdx5 in comparison withIdx0.

The first bar value is 33, which means that the average query execution time for the first subset is improved by a factor of 33. We also see how the performance is changed when theM in-F L-number of a query crosses theSW Countof 500.

By this diagram, we can propose that the value ofSW Countcan be lowered to 100. Consequently, we created another indexIdx5/SW100, withSW Count= 100 and F U Count= 1450.

Fig. 6.The average query execution times forIdx5/SW100 (seconds); the query set is divided based on theM in-F L-number with a step of 100.

In Fig. 6, the average query execution time for Idx5/SW100 in seconds is displayed for every subset. For each subset, the average query execution time is less than two seconds. The results look more promising than those forIdx5.

(12)

However, let us consider Q1 queries when the value ofSW Countis 500, which are the queries that consist only of lemmas withF L-number<500. There are 119 queries in this subset, as discussed above. It was established that for this subset, the average query search time withIdx5 is 0.8 sec. but with Idx5/SW100, it is 6.7 sec., which is significantly larger. When this subset is evaluated withIdx5, three-component key indexes are used.

Fig. 7.The average query execution times forIdx0,Idx5, andIdx5/SW100 (seconds);

the query set Q1 is divided based on theM in-F L-number with a step of 100.

Let us divide query set Q1 into five subsets based on theM in-F L-number value of the concrete query. In Fig. 7, the average query execution time forIdx0, Idx5 andIdx5/SW100 in seconds is displayed for every subset. We see that the search inIdx5/SW100 works significantly more slowly than that inIdx5.

Therefore, we come to the following conclusions for different kinds of queries.

Consequently, we propose the new index schema.

T1 A query consists of extreme high-frequency occurring lemmas and high- frequency occurring lemmas. This means that all lemmas of the query are extreme high-frequency occurring or high-frequently occurring. E.g., for every lemma of the query,F L-number <500.

The (f, s, t) indexes work better for these queries than do other types of indexes, as shown by Fig. 7. The search inIdx5/SW100 is significantly slower than that in Idx5. When the search is performed using Idx5, the (f, s, t) indexes are used for every query in T1. When the search is performed using Idx5/SW100, the (f, s, t) indexes are used only if F L(f) <100,F L(s)<

100, andF L(t)<100.

T2 A query contains a low-frequency occurring lemma (ordinary or frequently used). Additionally, the query may contain high-frequency occurring lemmas, but for every query lemma,F L-number≥100.

The (w, v) indexes allow one to achieve a better performance improvement, as shown in Fig. 6. For subsets starting fromF L-number = 100, no (f, s, t) indexes or NSW records are used when we do the search usingIdx5/SW100.

Here,Idx5/SW100 is better thanIdx5.

T3 A query contains a low-frequency occurring lemma (ordinary or frequently used) and some extreme high-frequency occurring lemmas withF L-number

<100.

The (w, v) indexes and NSW records are required. The first bar in Figures 3, 4 and 6 supports this, and our previous experiments in [16] confirm it.

(13)

5.9 The New Index Schema

The original schema can be represented by the following rules:

1) (f, s, t) indexes, whereF L(f), F L(s), F L(t)< SW Count.

2) (w, v) indexes, where

SW Count≤F L(w)< SW Count+F U Count, andSW Count≤F L(v).

3) Traditional indexes (x) with NSW records,SW Count ≤F L(x); the NSW records contain information about all lemmasywith the conditionF L(y)<

SW Countthat occur near lemmaxin the text.

For the new schema, we use the following parameters with example values.

EHF Count= 100 – for extreme high-frequency occurring lemmas.

HF Count= 400 – for high-frequency occurring lemmas.

F U Count= 1050 – for frequently used lemmas.

We propose using the following indexes.

1) (f, s, t) indexes that can be used for T1 queries, where

F L(f), F L(s), F L(t)< EHF Count+HF Count= 500, 2) (w, v) indexes that can be used for T2 and T3 queries, where

100 =EHF Count≤F L(w)< EHF Count+HF Count+F U Count= 1450, 100 =EHF Count≤F L(v).

3) Traditional indexes (x) with NSW records, 100 = EHF Count ≤ F L(x);

the NSW records contain information about all lemmasywith the condition F L(y) < EHF Count = 100 that occur near lemma x in the text. These indexes can be used for T3 queries.

The concrete values of the parameters are provided only for example and can be different for different languages and text collections.

6 Conclusions and Future Work

In this paper, we investigated how multi-component key indexes help to improve search performance. We used well-known GOV2 text collection. We proposed a method of analyzing the search performance by considering different types of queries. By following this method, we found that the performance can be improved further and proposed a new index schema.

We analyzed how the value ofM axDistanceaffects the search performance.

With an increase in the value of M axDistancefrom 5 to 9, the average search time using multi-component key indexes was increased from 1.51 sec. to 1.66 sec. Therefore, the value of M axDistance can be increased even further, and the main limitations here are the disk space and the time of indexing. Our multi-component indexes are relatively large for large values of M axDistance.

However, large hard disk drives are now available. In many cases, it would be preferable to spend several TB of disk space but to increase the search speed by a factor of 20 times or more.

(14)

We found that multi-component key indexes work significantly better on text collections with large documents (e.g., documents with sizes of approximately several hundred KB or more) than on text collection that consists of small documents; thus, an improvement factor of 20 can be considered as a minimum improvement factor. In the future, it is important to consider how different com- pression technologies can reduce the index total size, which will increase the search speed even more.

The proposed indexes with multi-component keys have one limitation. If we have a document that contains queried words and the distance between these words is greater than M axDistance, then this document can be absent in the search results. This is usually not a problem if the average document size in the text collection is relatively large, e.g., several hundreds of kilobytes. In this case, after the proximity search with multi-component key indexes, we can run a search without distance. When the former requires the word-level index, the latter needs only the document-level index and works significantly faster.

Second, in the majority of modern relevance models, it is defined that the weight of the document is inversely proportional to the square of the distance between queried words in the document [19]. With a relatively large value of M axDistance, we can be sure that all relevant documents will occur in the search results. When the first consideration for GOV2 collection is under question, because the documents are small, the second is still valid.

References

1. Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th Annual International ACM SIGIR Con- ference on Research and Development in Information Retrieval. pp. 35–42. SIGIR

’01, ACM, New York, NY, USA (2001). https://doi.org/10.1145/383952.383957 2. Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalized in-

verted index lock wait times. Journal of Physics: Conference Series944(1), 012022 (jan 2018). https://doi.org/10.1088/1742-6596/944/1/012022

3. Broschart, A., Schenkel, R.: High-performance processing of text queries with tun- able pruned term and term pair indexes. ACM Trans. Inf. Syst.30(1) (Mar 2012).

https://doi.org/10.1145/2094072.2094077

4. B¨uttcher, S., Clarke, C., Soboroff, I.: The trec 2006 terabyte track. In: Proceedings of the Fifteenth Text REtrieval Conference, TREC 2006. pp. 128–141 (2006) 5. Crane, M., Culpepper, J.S., Lin, J., Mackenzie, J., Trotman, A.: A comparison

of document-at-a-time and score-at-a-time query evaluation. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. pp.

201–210. WSDM 17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3018661.3018726

6. Daoud, C.M., Silva de Moura, E., Carvalho, A., Soares da Silva, A., Fernandes, D., Rossi, C.: Fast top-k preserving query processing using two-tier indexes. Inf. Process. Manage. 52(5), 855–872 (Sep 2016).

https://doi.org/10.1016/j.ipm.2016.03.005

7. Fox, C.: A stop list for general text. SIGIR Forum 24(1-2), 19–21 (Sep 1989).

https://doi.org/10.1145/378881.378888

(15)

8. Gall, M., Brost, G.: K-word proximity search on encrypted data.

In: 30th International Conference on Advanced Information Network- ing and Applications Workshops (WAINA). pp. 365–372. IEEE (2016).

https://doi.org/10.1109/WAINA.2016.104

9. Imran Rafique, M., Hassan, M.: Utilizing distinct terms for proximity and phrases in the document for better information retrieval. In: 2014 International Conference on Emerging Technologies (ICET). pp. 100–105. IEEE (2014). https://doi.org/10.1 109/ICET.2014.7021024

10. Jiang, D., Leung, K.W.T., Yang, L., Ng, W.: Teii: Topic enhanced inverted index for top-k document retrieval. Know.-Based Syst.89(C), 346–358 (Nov 2015).

https://doi.org/10.1016/j.knosys.2015.07.014

11. Lin, J., Trotman, A.: Anytime ranking for impact-ordered indexes. In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval. pp.

301–304. ICTIR 15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2808194.2809477

12. Lu, X., Moffat, A., Culpepper, J.S.: Efficient and effective higher order proximity modeling. In: Proceedings of the 2016 ACM International Conference on the The- ory of Information Retrieval. pp. 21–30. ICTIR ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2970398.2970404

13. Miller, R.B.: Response time in man-computer conversational transactions.

In: Proceedings of the December 9-11, 1968, Fall Joint Computer Confer- ence, Part I. pp. 267–277. AFIPS 68 (Fall, part I), ACM, USA (1968).

https://doi.org/10.1145/1476589.1476628

14. Petri, M., Moffat, A.: On the cost of phrase-based ranking. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 931934. SIGIR 15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2766462.2767769

15. Sadakane, K., Imai, H.: Fast algorithms for k-word proximity search. In: IEICE Transactions on Fundamentals of Electronics Communications and Computer Sci- ences. vol. 84, pp. 2311–2318 (2001)

16. Veretennikov, A.B.: Proximity full-text search by means of additional indexes with multi-component keys: In pursuit of optimal performance. In: Manolopoulos Y., Stupnikov S. (eds) Data Analytics and Management in Data Intensive Domains.

DAMDID/RCDL 2018. Communications in Computer and Information Science.

vol. 1003, pp. 111–130. Springer. https://doi.org/10.1007/978-3-030-23584-0 7 17. Veretennikov, A.B.: Proximity full-text search with a response time guarantee by

means of additional indexes. In: Arai K., Kapoor S., Bhatia R. (eds) Intelligent Sys- tems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Com- puting. vol. 868, pp. 936–954. Springer, Cham (2019). https://doi.org/10.1007/978- 3-030-01054-6 66

18. Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22(4), 573–594 (Oct 2004).

https://doi.org/10.1145/1028099.1028102

19. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.R.: Efficient term proximity search with term-pair indexes. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. pp. 1229–1238. CIKM ’10, ACM, New York, NY, USA (2010). https://doi.org/10.1145/1871437.1871593

20. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language.

Harvard University Press (1932)

21. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv.

38(2) (Jul 2006). https://doi.org/10.1145/1132956.1132959