Previous Section
 < Free Open Study > 
Next Section


10.3 Hashing

So far, we have succeeded in paring down our O(N) search to O(log2N) complexity by keeping the list sorted sequentially with respect to the key value. That is, the key in the first element is less than (or equal to) the key in the second element, which is less than the key in the third element, and so on. Can we do even better? Is it possible to design a search of O(1)-that is, one that has a constant search time, no matter where the element is located in the list?

In theory, that goal is not an impossible dream. Let's look at an example, a list of employees of a fairly small company. Each of the 100 employees has an ID number in the range 0 to 99, and we want to access the employee records using the key idNum. If we store the elements in an array that is indexed from 0 to 99, we can directly access any employee's record through the array index. There is a one-to-one correspondence between the element keys and the array index; in effect, the array index functions as the key of each element.

In practice, however, this perfect relationship between the key value and the location of an element is not easy to establish or maintain. Consider a similar small company that uses its employees' five-digit ID number as the primary key. Now the range of key values goes from 00000 to 99999. Obviously, it is impractical to set up an array of 100,000 elements, of which only 100 are needed, just to make sure that each employee's element is in a perfectly unique and predictable location.

What if we keep the array size down to the size that we actually need (an array of 100 elements) and use just the last two digits of the key to identify each employee? For instance, the element of employee 53374 is in employeeList.info [74], and the element of employee 81235 is in employeeList.info[35]. Note that the elements are not sorted according to the value of the key as they were in our earlier discussion; the position of employee 81235's record precedes that of employee 53374 in the array, even though the value of its key is larger. Instead, the elements are sorted with respect to some function of the key value.

Hash function A function used to manipulate the key of an element in a list to identify its location in the list

Hashing The technique used for ordering and accessing elements in a list in a relatively constant amount of time by manipulating the key to identify its location in the list

This function is called a hash function, and the search technique we are using is called hashing. In the case of the employee list, the hash function is (Key % 100). The key (idNum) is divided by 100, and the remainder is used as an index into the array of employee elements, as illustrated in Figure 10.17. This function assumes that the array is indexed from 0 to 99 (MAX_ITEMS = 100). The function to perform the conversion of key values to indexes is very simple:

int ItemType::Hash() const
// Post: Returns an integer between 0 and MAX_ITEMS - 1.
{
  return (idNum % MAX_ITEMS);
}
Click To expand
Figure 10.17: Using a hash function to determine the location of the element in an array

Here we assume that Hash is a member function of ItemType, the type of the items in the list, and that idNum is a data member of ItemType.

This hash function has two uses. First, it is used as a method of accessing the list element. The result of the hash function tells us where to look for a particular element-information we need to retrieve, modify, or delete the element. Here, for example, is a simple version of the function RetrieveItem, which assumes that the element is present in the list:

template<class ItemType>
void ListType<ItemType>::RetrieveItem(ItemType& item)
// Post: Returns the element in the array at position
//       item.Hash().
{
  int location;

  location = item.Hash();
  item = info[location];
}

Second, the hash function determines where in the array to store the element. If the employee list elements were inserted into the list using an insert operation from Chapter 3-into sequential array slots or into slots with their relative order determined by the key value-we could not use the hash function to retrieve them. We have to create a version of an insert operation that puts each new element into the correct slot according to the hash function. Here is a simple version of InsertItem, which assumes that the array slot at the index returned from the hash function is not in use:

template<class ItemType>
void ListType<ItemType>::InsertItem(ItemType item)
// Post: item is stored in the array at position item.Hash().
{
  int location;

  location = item.Hash();
  info [location] = item;
  length++;
}

Figure 10.18(a) shows an array whose elements-records for the employees with the key values (unique ID numbers) 12704, 31300, 49001, 52202, and 65606-were added using InsertItem. Note that this function does not fill the array positions sequentially. Because we have not yet inserted any elements whose keys produce the hash values 3 and 5, the array slots [3] and [5] are logically "empty." This technique differs from the approach we used in Chapter 3 to create a sorted list. In Figure 10.18(b), the same employee records have been inserted into a sorted list using the InsertItem operation from Chapter 3. Note that, unless the hash function was used to determine where to insert an element, the hash function is useless for finding the element.

Collisions

By now you are probably objecting to this scheme on the grounds that it does not guarantee unique hash locations. For example, ID number 01234 and ID number 91234 both "hash" to the same location: list.info [34]. The problem of avoiding these collisions represents the biggest challenge in designing a good hash function. A good hash function minimizes collisions by spreading the elements uniformly throughout the array. We say "minimizes collisions," because it is extremely difficult to avoid them completely.

Collision The condition resulting when two or more keys produce the same hash location

Assuming that some collisions will occur, where do you store the elements that cause them? We briefly describe several popular collision-handling algorithms in the next sections. Note that the scheme used to find the place to store an element determines the method subsequently used to retrieve it.

Linear probing Resolving a hash collision by sequentially searching a hash table beginning at the location returned by the hash function

Linear Probing A simple approach to resolving collisions is to store the colliding element in the next available space. This technique is known as linear probing. In the situation depicted in Figure 10.19, we want to add the employee element with the key ID number 77003. The hash function returns 3. But there is already an element stored in this array slot, the record for Employee 50003. We increment location to 4 and examine the next array slot. The list.info[4] entry is also in use, so we increment location again. This time we find an empty slot, so we store the new element into list.info[5].

Click To expand
Figure 10.18: Comparing hashed and sequential lists of identical elements
Click To expand
Figure 10.19: Handling collisions with linear probing

What happens if the key hashes to the last index in the array and that space is in use? We can consider the array to be a circular structure and continue looking for an empty slot at the beginning of the array. This situation is similar to the circular array-based queue we developed in Chapter 4. There we used the % operator when we incremented our index. We can use similar logic here.

How do we know whether an array slot is "empty"? We can initialize the array slot to contain a special emptyItem value. This value (a parameter to the class constructor) must be syntactically legal, but semantically illegal. For instance, if all employees have nonnegative integer idNum keys, we can use -1 as the key value for an "empty" slot. Now it is easy to tell whether the slot is free: We just compare the value in the position to emptyItem.

The following version of InsertItem uses linear probing to find a place to store a new element. It assumes that the array has room for another element; that is, the client checks for IsFull before calling the function. (We have retained the length member of ListType. Even though it no longer tells us where to find the end of the list, it is still useful in determining whether the list is full.)

template<class ItemType>
void ListType<ItemType>::InsertItem(ItemType item)
// Post: item is stored in the array at position item.Hash()
//       or the next free spot.
{
  int location;

  location = item.Hash();
  while (info [location] != emptyItem)
    location = (location + 1) % MAX_ITEMS;
  info [location] = item;
  length++;
}

To search for an element using this collision-handling technique, we perform the hash function on the key, then compare the desired key to the actual key in the element at the designated location. If the keys do not match, we use linear probing, beginning at the next slot in the array. Following is a version of the function RetrieveItem that uses this approach. If the element is not found in the list, the outgoing parameter found is false, and item is undefined.

template<class ItemType>
void ListType<ItemType>::RetrieveItem(ItemType& item, bool& found)
{
  int location;
  int startLoc;
  bool moreToSearch = true;

  startLoc = item.Hash();
  location = startLoc;
  do
  {
    if (info [location] == item | | info[location] == emptyItem)
      moreToSearch = false;
    else
      location = (location + 1) % MAX_ITEMS;
  } while (location != startLoc && moreToSearch);
  found = (info[location] == item);
  if (found)
    item = info [location];
}

We have discussed the insertion and retrieval of elements in a hash table, but we have not yet mentioned how to delete an element from the table. If we did not need to concern ourselves with collisions, the deletion algorithm would be simple:

Collisions, however, complicate matters. We can find the element using the same search approach as we used for RetrieveItem. But when we locate the element in the hash table, we cannot merely replace the item with emptyItem. A review of RetrieveItem shows the problem. In the loop, the detection of an empty slot ends the search. If DeleteItem "empties" the slot occupied by a deleted element, we may terminate a subsequent search prematurely.

Let's look at an example. In Figure 10.20, suppose we delete the element with the key 77003 by setting the array slot [5] to emptyItem. A subsequent search for the element with the key 42504 would begin at the hash location [4]. The record in this slot is not the one we are looking for, so we increment the hash location to [5]. This slot, which was formerly occupied by the record that we deleted, is now empty (contains emptyItem), so we terminate the search. We haven't really finished searching, however-the record that we want is found in the next slot.

Click To expand
Figure 10.20: A hash program with linear probing

One solution to this problem is to create a third constant value, deletedItem, to use in slots that were occupied by deleted records. If a slot contains deletedItem, it means that this slot is currently free but was previously occupied.

With this change, we must modify both the insertion and the retrieval operations to process slots correctly. The insertion algorithm treats a slot with deletedItem and emptyItem in the same way; the search for an available slot for the new element ends. emptyItem halts the search in the function RetrieveItem, but deletedItem does not.

This solution corrects the search problem, but produces another dilemma: After many deletions, the search "path" to a record may travel through many array slots with deletedItem. This "wandering" may cause the efficiency of retrieving an element to deteriorate. These problems illustrate that a hash table, in the forms that we have studied thus far, is not the most effective data structure for implementing lists whose elements may be deleted.

Clustering One problem with linear probing is that it results in a situation called clustering. A good hash function produces a uniform distribution of indexes throughout the array's index range. Initially, therefore, records are inserted throughout the array, with each slot being equally likely to be filled. Over time, after a number of collisions have been resolved, the distribution of records in the array becomes less and less uniform. The records tend to cluster together, as multiple keys begin to compete for a single hash location.

Clustering The tendency of elements to become unevenly distributed in the hash table, with many elements clustering around a single hash location

Consider the hash table in Figure 10.20. Only a record whose key produces the hash value 8 would be inserted into array slot [8]. However, any records with keys that produce the hash values 3, 4, 5, 6, or 7 would be inserted into array slot [7]. That is, array slot [7] is five times as likely as array slot [8] to be filled. Clustering results in inconsistent efficiency of insertion and retrieval operations.

Rehashing Resolving a collision by computing a new hash location from a hash function that manipulates the original location rather than the element's key

Rehashing The technique of linear probing discussed here is an example of collision resolution by rehashing. If the hash function produces a collision, the hash value serves as the input to a rehash function to compute a new hash value. In the previous section, we added 1 to the hash value to create a new hash value; that is, we used the rehash function:

(HashValue + 1) % 100

For rehashing with linear probing, you can use any function

(HashValue + constant) % array-size

as long as constant and array-size are relatively prime-that is, if the largest number that divides both of them evenly is 1. For instance, given the 100-slot array in Figure 10.21, we might use the constant 3 in the rehash function:

(HashValue + 3) % 100

Click To expand
Figure 10.21: Handling collisions with rehashing

(Although 100 is not a prime number, 3 and 100 are relatively prime; they have no common factor larger than 1.)

Suppose that we want to add a record with the key 14001 to the hash table in Figure 10.21. The original hash function (Key % 100) returns the hash value 1, but this array slot is already in use; it contains the record with the key 44001. To determine the next array slot to try, we apply the rehash function using the results of the first hash function as input: (1 + 3) % 100 = 4. The array slot at index [4] is also in use, so we reapply the rehash function until we find an available slot. Each time, we use the value computed from the previous rehash as input to the rehash function. The second rehash gives us (4 + 3) % 100 = 7; this slot is in use, too. The third rehash gives us (7 + 3) % 100 = 10; the array slot at index [10] is empty, so the new element is inserted there.

To understand why the constant and the number of array slots must be relatively prime, consider the rehash function

(HashValue + 2) % 100

We want to add the record with the key 14001 to the hash table pictured in Figure 10.21. The original hash function, Key % 100, returns the hash value 1. This array slot is already occupied. We resolve the collision by applying the rehash function above, examining successive odd-numbered indexes until a free slot is found. What happens if all of the slots with odd-numbered indexes are already in use? The search would fail-even though the hash table includes free slots with even-numbered indexes. This rehash function does not cover the full index range of the array. However, if the constant and the number of array slots are relatively prime (like 3 and 100), the function produces successive rehashes that eventually cover every index in the array.

Rehash functions that use linear probing do not eliminate clustering (although the clusters are not always visually apparent in a figure). For example, in Figure 10.21, any record with a key that produces the hash value 1, 4, 7, or 10 would be inserted into the slot at index [10].

Quadratic probing Resolving a hash collision by using the rehashing formula (HashValue ± I2) % array-size, where I is the number of times that the rehash function has been applied

Random probing Resolving a hash collision by generating pseudo-random hash values in successive applications of the rehash function

In linear probing, we add a constant (usually 1) in each successive application of the rehash function. A second approach, called quadratic probing, makes the result of rehashing dependent on how many times the rehash function has been applied. In the Ith rehash, the function is

(HashValue ± I2) % array-size

The first rehash adds 1 to HashValue, the second rehash subtracts 1 from HashValue, the third rehash adds 4, the fourth subtracts 4, and so on. Quadratic probing reduces clustering, but it does not necessarily examine every slot in the array. For example, if array-size is a power of 2 (512 or 1,024, for example), relatively few array slots are examined. If array-size is a prime number of the form (4 * some-integer + 3), however, quadratic probing does examine every slot in the array.

A third approach uses a pseudo-random number generator to determine the increment to HashValue in each application of the rehash function. Random probing is an excellent technique for eliminating clustering, but it tends to be slower than the other techniques we have discussed.

Bucket A collection of elements associated with a particular hash location

Chain A linked list of elements that share the same hash location

Buckets and Chaining Another alternative for handling collisions is to allow multiple element keys to hash to the same location. One solution is to let each computed hash location contain slots for multiple elements, rather than just a single element. Each of these multi-element locations is called a bucket. Figure 10.22 shows a hash table with buckets that can hold three elements each. Using this approach, we can allow collisions to produce duplicate entries at the same hash location, up to a point. When the bucket becomes full, we must again deal with the problem of handling collisions.

Click To expand
Figure 10.22: Handling collisions by hashing with buckets

Another solution, which avoids this problem, is to use the hash value not as the actual location of the element, but rather as the index into an array of pointers. Each pointer accesses a chain of elements that share the same hash location. Figure 10.23 illustrates this solution to the problem of collisions. Rather than rehashing, we simply allow both elements to share hash location [3]. The entry in the array at this location contains a pointer to a linked list that includes both elements.

Click To expand
Figure 10.23: Handling collisions by hashing with chaining

To search for a given element, you first apply the hash function to the key and then search the chain for the element. Searching is not eliminated, but it is limited to elements that actually share a hash location. In contrast, with linear probing you may have to search through many additional elements if the slots following the hash location are filled with elements from collisions on other hash locations.

Figure 10.24 compares the chaining and hash-and-search schemes. The elements were added in the order shown on the next page.

Click To expand
Figure 10.24: Comparison of linear probing and chaining schemes

45300

20006

50002

40000

25001

13000

65905

30001

95000

Figure 10.24(a) depicts the linear probing approach to collision handling; Figure 10.24(b) shows the result of chaining the colliding elements. Let's search for the element with the key 30001.

Using linear probing, we apply the hash function to get the index [1]. Because list.info[1] does not contain the element with the key 30001, we search sequentially until we find the element in list.info[7].

Using the chaining approach, we apply the hash function to get the index [1]. list.info[1] directs us to a chain of elements whose keys hash to 1. We search this linked list until we find the element with the desired key.

Another advantage of chaining is that it simplifies the deletion of records from the hash table. We apply the hash function to obtain the index of the array slot that contains the pointer to the appropriate chain. The node can then be deleted from this chain using the linked-list algorithm from Chapter 5.

Choosing a Good Hash Function

One way to minimize collisions is to use a data structure that has more space than is actually needed for the number of elements, so as to increase the range of the hash function. In practice it is desirable to have the array size somewhat larger than the number of elements required, thereby reducing the number of collisions.

Selecting the table size involves a space versus time tradeoff. The larger the range of hash locations, the less likely that two keys will hash to the same location. However, allocating an array that contains a large number of empty slots wastes space.

More importantly, you can design your hash function to minimize collisions. The goal is to distribute the elements as uniformly as possible throughout the array. Therefore you want your hash function to produce unique values as often as possible. Once you admit collisions, you must introduce some sort of searching, either through array or chain searching or through rehashing. The access to each element is no longer direct, and the search is no longer O(1). In fact, if the collisions cause very disproportionate chains, the worst case may be almost O(N)![2]

To avoid such a situation, you need to know something about the statistical distribution of keys. Imagine a company whose employee records are sorted based on a six-digit company ID. The company has 500 employees, and we decide to use a chained approach to handle collisions. We set up 100 chains (expecting an average of five elements per chain) and use the hash function

idNum % 100

That is, we use the last two digits of the six-digit ID number as our index. The planned hash scheme is shown in Figure 10.25(a). Figure 10.25(b) shows what happened when we implemented the hash scheme. How could the distribution of the elements have come out so skewed? It turns out that the company's ID number is a concatenation of three fields:

Click To expand
Figure 10.25: Hash scheme to handle employee records

The hash scheme depended solely on the year hired to produce hash values. Because the company was founded in 1987, all the elements were crowded very disproportionately into a small subset of the hash locations. A search for an employee element, in this case, is O(N). Although this is an exaggerated example, it illustrates the need to understand as completely as possible the domain and predicted values of keys in a hash scheme.

Division Method The most common hash functions use the division method (%) to compute hash values. We used this type of function in the preceding examples. The general function is

Key % TableSize

We have already mentioned the idea of making the table somewhat larger than the number of elements required, so as to increase the range of hash values. In addition, better results are produced with the division method when the table size is a prime number.

The advantage of the division hash function is its simplicity. Sometimes, however, it is necessary to use a more complicated (or even exotic) hash function to get a good distribution of hash values.

Other Hash Methods How can we use hashing if the element key is a string instead of an integer? One approach is to use the internal representations of the string's characters to create a number that can serve as an index. (Recall that each ASCII character is represented in memory as an integer in the range 0 through 127.) For instance, the following simple hash function takes a five-element char array and produces a hash value in the range 0 through MAX_ITEMS - 1:

int Hash(char letters [])
Post: Returns an integer between 0 and MAX_ITEMS - 1.
{
  int sum = 0;

  for (int index = 0; index < 5; index++)
    sum = sum + int(letters[index]);
  return sum % MAX_ITEMS;
}

Folding A hash method that breaks the key into several pieces and concatenates or exclusive-ORs some of the pieces to form the hash value

A hash method called folding involves breaking the key into several pieces and concatenating or exclusive-OR'ing some of the pieces to form the hash value. Another method is to square the key and then use some of the digits (or bits) of the key as a hash value. A number of other techniques exist, all of which are intended to make the hash location as unique and random (within the allowed range) as possible.

Let's look at an example of folding. Suppose we want to devise a hash function that results in an index between 0 and 255, and the internal representation of the int key is a bit string of 32 bits. We know that it takes 8 bits to represent the 256 index values (28 = 256). A folding algorithm to create a hash function might

  1. Break the key into four bit strings of 8 bits each,

  2. Exclusive-OR the first and last bit strings,

  3. Exclusive-OR the two middle bit strings, and

  4. Exclusive-OR the results of steps 2 and 3 to produce the 8-bit index into the array.

We illustrate this scheme using the key 618403. The binary representation of this key is

00000000000010010110111110100011

We break this bit string into four 8-bit strings:

00000000

(leftmost 8 bits)

00001001

(next 8 bits)

01101111

(next 8 bits)

10100011

(rightmost 8 bits)

The next step is to exclusive-OR the first and last bit strings. The exclusive OR of two bits is 0 if the two bits are the same, and 1 if they are different. To exclusive-OR (denoted as XOR) bit strings, we apply this rule to successive pairs of bits.

00000000

(XOR) 10100011


10100011

Next, we exclusive-OR the middle two bit strings:

00001001

(XOR) 01101111


01100110

Finally, we exclusive-OR the results of the preceding two steps:

10100011

(XOR) 01100110


11000101

This binary number is equivalent to the decimal number 197, so the key 618403 hashes into the index 197. We leave the implementation of this hash function as an exercise.

The relationship between the key and the index is not intuitively obvious, but the indexes produced are likely to be uniformly distributed through the range of possible values.

When using an exotic hash function, you should keep two considerations in mind. First, you should consider the efficiency of calculating the function. Even if a hash function always produces unique values, it is not a good hash function if it takes longer to calculate the hash value than to search half the list. Second, you should consider programmer time. An extremely exotic function that somehow produces unique hash values for all of the known key values may fail if the domain of possible key values changes in a later modification. The programmer who has to modify the program may then waste a lot of time trying to find another hash function that is equally clever.

Of course, if you know all of the possible keys ahead of time, it is possible to determine a perfect hash function. For example, if you needed a list of elements whose keys were the reserved words in a computer language, you could find a hash function that hashes each word to a unique location. In general, it takes a great deal of work to discover a perfect hash function. And usually, we find that its computational complexity is very high, perhaps comparable to the effort required to execute a binary search.

Complexity

We began the discussion of hashing by trying to find a list implementation where the insertion and deletion had O(1) complexity. If our hash function never produces duplicates or if the array size is very large compared to the expected number of items in the list, then we have reached our goal. In general, this will not be the case. Clearly, as the number of elements approaches the array size, the efficiency of the algorithms deteriorates. A precise analysis of the complexity of hashing is beyond the scope of this book. Informally we can say that the larger the array relative to the expected number of elements, the more efficient the algorithms.

[2]This is O(N) "exclamation point," not O(N) factorial, as one long-ago student complained when he got the answer wrong on a quiz.



Previous Section
 < Free Open Study > 
Next Section
Converted from CHM to HTML with chm2web Pro 2.85 (unicode)