Hash-based Search - Algorithms in a Nutshell 2E

The previous sections on searching are appropriate when there are a small number of elements (Sequential Search) or the collection is already ordered (Binary Search). We need more powerful techniques for searching larger collections that are not necessarily ordered. One of the most common approaches is to use a hash function to transform one or more characteristics of the searched-for item into an index into a hash table. Hash-based searching has better average-case perfor‐

mance than the other search algorithms described in this chapter.

Many books on algorithms discuss hash-based searching under the topic of hash tables (Chapter 11 in Cormen et al., 2009); you may also find this topic in books on data structures that describe hash tables.

In Hash-based Search the n elements of a collection C are first loaded into a hash table H with b bins structured as an array. This pre-processing step has O(n) performance, but improves performance of future searches. The concept of a hash function makes this possible.

A hash function is a deterministic function that maps each element Ci

to an integer value hi. For a moment, let’s assume that 0 ≤ hi < b. When loading the elements into a hash table, element Ci is inserted into the bin H[hi]. Once all elements have been inserted, searching for an item t becomes a search for t within H[hash(t)].

The hash function guarantees only that if two elements Ci and Cj are equal, hash(Ci) = hash(Cj). It can happen that multiple elements in C have the same hash value; this is known as a collision and the hash table needs a strategy to deal with these situations. The most common sol‐

ution is to store a linked list at each hash index (even though many of these linked lists will contain only one element), so all colliding values can be stored in the hash table. The linked lists have to be searched linearly, but this will be quick because each is likely to store at most a few different items. The following describes a linked list solution to collisions in the pseudocode.

Hash-based Search | 111

Hash-based Search Summary Best, Average: O(1) Worst: O(n)

loadTable (size, C)

H = new array of given size foreach e in C do

h = hash(e)

if H[h] is empty then H[h] = new Linked List add e to H[h]

return A end

search (H, t) h = hash(t) list = H[h]

if list is empty then return false

if list contains t then return true

return false end

Create linked lists when value inserted into empty bin.

Use Sequential Search on small lists.

The general pattern for Hash-based Search is shown in Figure 5-1 with a small example. The components are:

• A set U that defines the set of possible hash values. Each element e ∈ C maps to a hash value h ∈ U.

• A hash table, H, containing b bins that store the n elements from the original collection C.

• The hash function, hash, which computes an integer value h for every element e, where 0 ≤ h < b.

This information is stored in memory using arrays and linked lists.

Figure 5-1. General approach to hashing

Hash-based Search raises two main concerns: the design of the hash function, and how to handle collisions. A poorly chosen hash function can leave keys poorly distributed in the primary storage, with two consequences: many bins in the hash table may be unused, wasting space, and there will be collisions that force many keys into the same bin, which worsens search performance.

Input/Output

Unlike Binary Search, the original collection C does not need to be ordered for Hash-based Search. Indeed, even if the elements in C were ordered in some way, the hashing method that inserts elements into the hash table H does not attempt to replicate this ordering within H.

The input to Hash-based Search is the computed hash table, H, and the target element t being sought. The algorithm returns true if t exists in the linked list stored by H[h] where h = hash(t). If t does not exist within the linked list stored by H[h], false is returned to indicate that t is not present in H (and by implication, does not exist in C).

Hash-based Search | 113

1. http://www.wordlist.com.

2. http://openjdk.java.net.

Context

Assume you had an array of 213,557 sorted English words¹. We know from our discussion on Binary Search that we can expect about 18 string comparisons on average to locate a word in this array (since log (213557) = 17.70). With Hash-based Search the number of string comparisons is based on the length of the linked lists rather than the size of the collection.

We first need to define the hash function. One goal is to produce as many different values as possible, but not all values need to be unique.

Hashing has been studied for decades, and numerous papers describe effective hash functions, but they can be used for much more than searching. For example, special hash functions are essential for cryp‐

tography. For searching, a hash function should have a good distribu‐

tion and should be quick to compute with respect to machine cycles.

A popular technique is to produce a value based on each character from the original string:

hashCode(s) = s[0]*31^(len-1) + s[1]*31^(len-2)+ … + s[len−1]

where s[i] is the i^th character (as an ASCII value between 0 and 255) and len is the length of the string s. Computing this function is simple, as shown in the Java code in Example 5-5 (adapted from the Open JDK source code), where chars is the array of characters that defines a string.² By our definition, the hashCode() method for the java.lang.String class is not yet a hash function because its com‐

puted value is not guaranteed to be in the range [0, b).

Example 5-5. Sample Java hashCode

For efficiency, this hashCode method caches the value of the computed hash to avoid recomputation (i.e., it computes the value only if hash is 0). Observe that this function returns very large integers and some‐

times negative values because the int type in Java can store only 32 bits of information. To compute the integer bin for a given element, define hash(s) to be:

hash(s) = abs(hashCode(s)) % b

where abs is the absolute value function and % is the modulo operator that returns the remainder when dividing by b. Doing so ensures that the computed integer is in the range [0, b).

Choosing the hashing function is just the first decision to make when implementing Hash-based Search. Storage space poses another de‐

sign issue. The primary storage, H, must be large enough to reduce the size of the linked lists storing the elements in each bin. You might think that H could contain b=n bins, where the hash function is a one-to-one mapping from the set of strings in the collection onto the in‐

tegers [0, n), but this is not easy to accomplish! Instead, we try to define a hash table that will contain as few empty bins as possible. If our hash function distributes the keys evenly, we can achieve reasonable success by selecting an array size approximately as large as the collection.

The size of H is typically chosen to be a prime number to ensure that using the % modulo operator efficiently distributes the computed bin numbers. A good choice in practice is 2^k−1, even though this value isn’t always prime.

The elements stored in the hash table have a direct effect on memory.

Since each bin stores string elements in a linked list, the elements of the list are pointers to objects on the heap. Each list has overhead storage that contains pointers to the first and last elements of the list and, if you use the LinkedList class from the Java JDK, a significant amount of additional fields. One could write a much simpler linked list class that provides only the necessary capabilities, but this certainly adds additional cost to the implementation of Hash-based Search.

The advanced reader should, at this point, question the use of a basic hash function and hash table for this problem. When the word list is fixed and not likely to change, we can do better by creating a perfect hash function. A perfect hash function is one that guarantees no col‐

lisions for a specific set of keys; this option is discussed in “Varia‐

tions” on page 122. Let’s first try to solve the problem without one.

Hash-based Search | 115

For our first try at this problem, we choose a primary hash table H that will hold b=2¹⁸−1=262,143 elements. Our word list contains 213,557 words. If our hash function perfectly distributes the strings, there will be no collisions and only about 40,000 open bins. This, however, is not the case. Table 5-4 shows the distribution of the hash values for the Java String class on our word list with a table of 262,143 bins. Recall that hash(s) = abs(hashCode(s))%b. As you can see, no slot contains more than seven strings; for non-empty bins, the average number of strings per bin is approximately 1.46. Each row shows the number of bins used and how many words hash to those bins. Almost half of the table bins (116,186) have no strings that hash to them. So this hashing function wastes about 500KB of memory (assuming that the size of a pointer is four bytes). You may be surprised that this is quite a good hashing function and that finding one with better distribution will require a more complex scheme.

For the record, there were only five pairs of strings with identical hashCode values (for example, both “hypoplankton” and “unheavenly”

have a computed hashCode value of 427,589,249)! The long linked lists are created by the modulo function that reduces the size of the hash to be less than b.

Table 5-4. Hash distribution using Java String.hashCode( ) method as key with b=262,143

If you use the LinkedList class, each non-empty element of H will require 12 bytes of memory, assuming that the size of a pointer is four bytes. Each string element is incorporated into a ListElement that requires an additional 12 bytes. For the previous example of 213,557 words, we require 5,005,488 bytes of memory beyond the actual string storage. The breakdown of this memory usage is:

• Size of the primary table: 1,048,572 bytes

• Size of 116,186 linked lists: 1,394,232 bytes

• Size of 213,557 list elements: 2,562,684 bytes

Storing the strings also has an overhead when using the JDK String class. Each string has 12 bytes of overhead. We can therefore add 213,557*12 = 2,562,684 additional bytes to our overhead. So the al‐

gorithm chosen in the example requires 7,568,172 bytes of memory.

The actual number of characters in the strings in the word list we used in the example is only 2,099,075, so it requires approximately 3.6 times the space required for the characters in the strings.

Most of this overhead is the price of using the classes in the JDK. The engineering tradeoff must weigh the simplicity and reuse of the classes compared to a more complex implementation that reduces the mem‐

ory usage. When memory is at a premium, you can use one of several variations discussed later to optimize memory usage. If, however, you have available memory, a reasonable hash function that does not pro‐

duce too many collisions, and a ready-to-use linked list implementa‐

tion, the JDK solution is usually acceptable.

The major force affecting the implementation is whether the collection is static or dynamic. In our example, the word list is fixed and known in advance. If, however, we have a dynamic collection that requires many additions and deletions of elements, we must choose a data structure for the hash table that optimizes these operations. Our col‐

lision handling in the example works quite well because inserting an item into a linked list can be done in constant time, and deleting an item is proportional to the length of the list. If the hash function dis‐

tributes the elements evenly, the individual lists are relatively short.

Solution

In addition to the hash function, the solution for Hash-based Search contains two parts. The first is to create the hash table. The code in Example 5-6 shows how to use linked lists to hold the elements that hash into a specific table element. The input elements from col‐

lection C are retrieved using an Iterator. Example 5-6. Loading the hash table

public void load (Iterator<V> it) {

listTable = (LinkedList<V>[]) new LinkedList[tableSize];

Hash-based Search | 117

// Pull each value from the iterator and find desired bin h.

// Add to existing list or create new one into which value is added.

while (it.hasNext()) {

LinkedList<V> list = (LinkedList<V>) listTable[h];

list.add(v);

} }

Note that listTable is composed of tableSize bins, each of which is of type LinkedList<V> to store the elements.

Searching the table for elements now becomes trivial. The code in Example 5-7 does the job. Once the hash function returns an index into the hash table, we look to see whether the table bin is empty. If it’s empty, we return false, indicating that the searched-for string is not in the collection. Otherwise, we search the linked list for that bin to determine the presence or absence of the searched-for string.

Example 5-7. Searching for an element

public boolean search (V v){

int h = hashMethod.hash(v);

LinkedList<V> list = (LinkedList<V>) listTable[h];

if (list == null) { return false; }

Note that the hash function ensures that the hash index is in the range [0,tableSize). With the hashCode function for the String class, the hash function must account for the possibility that the integer arith‐

metic in hashCode has overflowed and returned a negative number.

This is necessary because the modulo operator (%) returns a negative number if given a negative value (i.e., the Java expression −5%3 is equal to the value −2.) For example, using the JDK hashCode method for String objects, the string “aaaaaa” returns the value −1,425,372,064.

Analysis

As long as the hash function distributes the elements in the collection fairly evenly, Hash-based Search has excellent performance. The average time required to search for an element is constant, or O(1).

The search consists of a single look-up in H followed by a linear search through a short list of collisions. The components to searching for an element in a hash table are:

• Computing the hash value

• Accessing the item in the table indexed by the hash value

• Finding the specified item in the presence of collisions

All Hash-based Search algorithms share the first two components;

different behaviors stem from variations in collision handling.

The cost of computing the hash value must be bounded by a fixed, constant upper bound. In our word list example, computing the hash value was proportional to the length of the string. If Tk is the time it takes to compute the hash value for the longest string, it will require

≤ Tk to compute any hash value. Computing the hash value is therefore considered a constant time operation.

The second part of the algorithm also performs in constant time. If the table is stored on secondary storage, there may be a variation that depends upon the position of the element and the time required to position the device, but this has a constant upper bound.

If we can show that the third part of the computation also has a con‐

stant upper bound, it will mean that the overall time performance of Hash-based Search is constant. For a hash table, define the load fac‐

tor α to be the average number of elements in a linked list. More pre‐

cisely, α=n/b, where b is the number of bins in the hash table and n is the number of elements stored in the hash table. Table 5-5 shows the actual load factor in the hash tables we create as b increases. Note how the maximum length of the element linked lists drops while the num‐

ber of bins containing a unique element rapidly increases once b is sufficiently large. In the final row, 81% of the elements are hashed to a unique bin and the average number of elements in a bin becomes just a single digit. Regardless of the initial number of elements, you can choose a sufficiently large b value to ensure that there will be a small, fixed number of elements (on average) in every bin. This means that searching for an element in a hash table is no longer dependent

Hash-based Search | 119

on the number of elements in the hash table, but rather is a fixed cost, producing amortized O(1) performance.

Table 5-5. Statistics of hash tables created by example code b Load factor α Min length of linked

list Max length of linked

Table 5-6 compares the performance of the code from Example 5-7 with the JDK class java.util.Hashtable on hash tables of different sizes. For the tests labeled p=1.0, each of the 213,557 words is used as the target item to ensure that the word exists in the hash table. For the tests labeled p=0.0, each of these words has its last character replaced with a “*” to ensure that the word does not exist in the hash table. Note also that we keep the size of the search words for these two cases the same to ensure that the cost for computing the hash is identical.

We ran each test 100 times and discarded the best- and worst-performing trials. The average of the remaining 98 trials is shown in Table 5-6. To help understand these results, statistics on the hash tables we created are shown in Table 5-5.

Table 5-6. Search time (in milliseconds) for various hash table sizes b Our hashtable shown in Example 5-7 java.util.Hashtable with default capacity

p = 1.0 p = 0.0 p = 1.0 p = 0.0

b Our hashtable shown in Example 5-7 java.util.Hashtable with default capacity

524,287 75.00 33.05 69.33 28.26

1,048,575 76.68 29.02 72.24 27.37

As the load factor goes down, the average length of each element linked list also goes down, leading to improved performance. Indeed, by the time b=1,045,875 no linked list contains more than five elements. Be‐

cause a hash table can typically grow large enough to ensure that all linked lists of elements are small, its search performance is considered to be O(1). However, this is contingent (as always) on having sufficient memory and a suitable hash function to disperse the elements throughout the bins of the hash table.

The performance of the existing java.util.Hashtable class outper‐

forms our example code, but the savings are reduced as the size of the hash table grows. The reason is that java.util.Hashtable contains optimized list classes that efficiently manage the element chains. In addition, java.util.Hashtable automatically “rehashes” the entire hash table when the load factor is too high; the rehash strategy is dis‐

cussed in “Variations” on page 122. The implementation increases the cost of building the hash table, but improves search performance. If we prevent the “rehash” capability, search performance in java.util.Hashtable is nearly the same as our implementation.

Table 5-7 shows the number of times rehash is invoked when building the java.util.Hashtable hash table and the total time (in millisec‐

onds) required to build the hash table. We constructed the hash tables from the word list described earlier; after running 100 trials, the best and worst performing timings were discarded and the table contains the average of the remaining 98 trials. The java.util.Hashtable class performs extra computation while the hash table is being constructed to improve the performance of searches (a common tradeoff). Col‐

umns 3 and 5 of Table 5-7 show a noticeable cost penalty when a rehash occurs. Also note that in the last two rows, the hash tables do not rehash themselves, so the results in columns 3, 5, and 7 are nearly identical.

Dans le document Algorithms in a Nutshell 2E (Page 122-138)