Finding the K th Largest Element - Selected Examples: Determining the Complexity of

Part 1 The Algorithm Side: Regularity,

3 Examples of Complexity Analysis

3.2 Selected Examples: Determining the Complexity of

3.2.9 Finding the K th Largest Element

Consider the following recursive approach; note that A is not assumed to be sorted:

Select(A[1:n],K)

1. Randomly choose a pivot element m.

2. Use m to construct the sets L, E, and G of those elements that are strictly smaller, equal, and strictly greater than m, respectively:

For i:=1,…,n do

if A[i]=m then add A[i] to E

else if A[i]<m then add A[i] to L else add A[i] to G

During the construction of L, E, and G, also count their elements c_L, c_E, and c_G, respectively.

3. If c_G ≥ K, then return Select (G,K).

else if c_G+c_E ≥ K then return m else return Select (L,K-(c_G+c_E))

This algorithm splits the search space A[1:n] around the pivot element m and then determines the three sets L, E, and G. If there are at least K elements in G, then we call Select recursively to determine the Kth largest element in G. Otherwise we see whether G∪E contains at least K elements; if so, m is

28 A very common value is K = n/2, in which case the problem is finding the median of the set.

Informally, the median of a set is the element with the property that half of the elements of the set are larger and half of the elements are smaller than the median.

29 We would spend O(t) time to find the maximum of an array of size t; first t = n (find the max-imum of the entire set), then t = n − 1 (find the maxmax-imum of the remaining n − 1 elements), and so on, until t = n − K + 1. Summing this work up yields a quadratic time complexity.

C6730_C003.fm Page 68 Friday, August 11, 2006 8:12 AM

Examples of Complexity Analysis 69 the desired Kth largest element. Finally, if none of these cases applies, we call Select recursively again, but now with the set L, and instead of finding its Kth largest element, we find L’s element with the number K − (cG + c_E), reflecting that we removed G and E from the search space and therefore K has to be reduced accordingly by the number of the removed elements in G and E.

The most important factor in the determination of the complexity is the recursive calls to Select, and more specifically, the size of the sets involved in these calls. Initially, our search space size is n; in the single recursive call in step 3, the search space size is either c_G or c_L (note that there is no more than one recursive call, as at most one of the two cases with recursion can apply). It is not difficult to see that the worst case occurs if max{c_G,c_L} = n − 1. If this occurs in every recursive call, we have the situation of Scenario 1 of Section 3.1, with the proviso that the additional work (namely the con-struction of the sets L, E, and G) takes O(n) time; consequently, the worst-case time complexity is a pathetic O(n²).³⁰

What would be a desirable situation? Recall that in binary search, the search space was split into two equal halves. Can we achieve something similar here? Suppose m were the median of the search space; then we would mirror binary search, except for step 3, which is concerned with keeping track of the index K. Of course, we do not know how to get m to be the median,³¹ but we can get very close. Here is how.

Replace step 1 in Select with the following steps:

1.1 Split the search space A[1:n] into groups of five elements each and sort each of these sets.

1.2 Determine M to be the set of all the medians of these five-element sets.

1.3 m:=Select(M,Èn/10˘).

We will refer to steps 1.1, 1.2, 1.3, 2, and 3 as the modified Select algorithm.

While it is clear that any choice of m will work, and hence the m determined in steps 1.1, 1.2, and 1.3 will also work, it is not so clear what this convoluted construction buys us. Step 1.1 groups the search space into five-element sets and sorts each of them. This requires O(n) time, since the grouping operation implies one scan of the search space and sorting five elements can be done in a constant number of comparisons (seven comparisons are sufficient to sort five numbers). Also, we can incorporate step 1.2 into this process — just take the middle element of each five-element (by now sorted) set and add it to M. How large is M? Since we have about n/5 five-element sets and we

30 To see this directly, consider that in the first call, we need O(n) time. In the second call, we need O(n − 1) time. In general, in the ith call, we need O(n − i + 1) time, for i = 1, …, n. Summing this up yields the claim of O(n²).

31 The best way we know at this point of finding the median is to find the Kth largest element with K = n/2. Since we are still struggling with this problem, looking for the median is not exactly promising.

C6730_C003.fm Page 69 Friday, August 11, 2006 8:12 AM

70 A Programmer’s Companion to Algorithm Analysis take one element from each, M has about n/5 elements. Step 1.3 then consists of determining, again recursively, the median of the set M, that is, the median of the medians of the five-element sets.³²

With this specific choice of m, let us revisit the question of how large are the sets L and G determined in step 2. Let us first determine which elements cannot possibly be in L. Clearly, m cannot be in L, and since m is the median of M, half of the elements in M are ≥ m, so they cannot be in L either.

Moreover, each of these elements is the median of its five-element set, so in each such set there are two more elements that are ≥ m, namely the elements greater than or equal to its set’s median. Summing all this up, we reach the conclusion that there are at least n/10 + 2·n/10, or 3·n/10, elements ≥ m;

therefore, none of them can be in L, and hence c_L cannot be larger than 7·n/

10. By a similar argument, one sees that there are at least 3·n/10 elements in A[1:n] that are ≤ m, so none of them can be in G, and hence cG ≤ 7·n/10.

It follows therefore that in step 3, the search space for either of the two recursive calls is no larger than 7·n/10.

Let T(n) be the time the modified Select algorithm requires for a search space with n elements. Then the recursive call in step 1.3 requires time T(n/5), and the (single) recursive call in step 1.3 requires time at most T(7·n/10). Since 7·n/10 < 3·n/4, we can bound T(7·n/10) from above by T(3·n/4). (Clearly, T is monotonically increasing, that is, if s < t, then T(s) < T(t).) Our final expres-sion for T(n) is therefore

T(n) ≤ T(n/5) + T(3·n/4) + C·n,

where C·n reflects the work to be done in steps 1.1, 1.2, and 2. It follows now that

T(n) = 20·C·n

satisfies this relation. The worst-case time complexity of the modified Select algorithm for finding the Kth largest element in the set A[1:n] of n elements is therefore O(n).³³ Furthermore, the space complexity is O[log₂(n)] since the recursive calls require space proportional to the depth of the recursion. Since in the worst case, the modified Select algorithm reduces the search space size by a factor of 7/10, such a reduction can take place at most log10/7(n)

times before the search space is reduced to nothing, and that is O(log₂(n))

32 In modified Select, we used five-element sets. There is nothing magic about the number 5; any odd number (>4) would do (it should be odd to keep the arguments for L and G symmetric).

However, 5 turns out to be most effective for the analysis.

33 We assume a suitable termination condition for Select; typically something like: if n ≤ 50 then sort directly and return the desired element. (Here, 50 plays the role of n₀, a value of n so small that it has little significance for the asymptotic complexity of how one deals with the cases where n ≤ n₀.) Under this assumption, we write T(n) = T(n/5) + T(3·n/4) + C·n and show by direct sub-stitution that T(n) = 20·C·n satisfies the equality. On the right-hand side we get 20·C·n/5 + 20·C·3·n/4 + C·n, which adds up to exactly 20·C·n.

C6730_C003.fm Page 70 Friday, August 11, 2006 8:12 AM

Examples of Complexity Analysis 71 times. Hence, the worst-case space complexity of modified Select is O(log₂(n)).³⁴

In comparing the initial version and the modified version of Select, one is struck by the fact that the modification, even though far more complicated, has a guaranteed worst-case complexity far superior to that of the original.

However, what this statement hides is the enormous constant that afflicts its linear complexity. On average, the original Select tends to be much faster, even though one does run the risk of a truly horrible worst-case complexity.

It is tempting to improve QuickSort by employing modified Select to determine its pivot. Specifically, we might want to use modified select to determine the pivot x as the median of the subarray to be sorted. While this would certainly guarantee that the worst-case complexity equals the best-case complexity of the resulting version of QuickSort, it would also convert QuickSort into something like ComatoseSort. The constant factor attached to such a sort would make it at least one, if not two, orders of magnitude slower than HeapSort.

So far, all the algorithms we have explored were essentially off-line. In other words, we expected the input to be completely specified before we started any work toward a solution. This may not be the most practicable approach to some problems, especially to problems where the underlying data sets are not static, but change dynamically. For example, an on-line telephone directory should be up to date: It should reflect at any time all current subscribers and should purge former customers or entries. This requires the continual ability to update the directory. Were we to use an unsorted linear list to represent the directory, adding to the list would be cheap, but looking up a number would be a prohibitive O(n) if there were n entries. Thus, we would like to be able to use a scheme that is at least as efficient as binary search — which requires the array to be sorted. However, it would be clearly impractical to sort the entire array whenever we want to insert or delete an entry. Here is where search trees come into their own.

Dans le document A ProgrAmmer’s ComPAnion to Algorithm AnAlysis (Page 81-84)