Hashing Study Guide

Author: Josh Hug

Overview

Brute force approach. All data is just a sequence of bits. Can treat key as a gigantic number and use it as an array index. Requires exponentially large amounts of memory.

Hashing. Instead of using the entire key, represent entire key by a smaller value. In Java, we hash objects with a hashCode() method that returns an integer (32 bit) representation of the object.

hashCode() to index conversion. To use hashCode() results as an index, we must convert the hashCode() to a valid index. Modulus does not work since hashCode may be negative. Taking the absolute value then the modulus also doesn't work since Math.abs(Integer.MIN_VALUE) is negative. Typical approach: use hashCode & 0x7FFFFFFF instead before taking the modulus.

Hash function. Converts a key to a value between 0 and M-1. In Java, this means calling hashCode(), setting the sign bit to 0, then taking the modulus.

Designing good hash functions. Requires a blending of sophisticated mathematics and clever engineering; beyond the scope of this course. Most important guideline is to use all the bits in the key. If hashCode() is known and easy to invert, adversary can design a sequence of inputs that result in everything being placed in one bin. Or if hashCode() is just plain bad, same thing can happen.

Uniform hashing assumption. For our analyses below, we assumed that our hash function distributes all input data evenly across bins. This is a strong assumption and never exactly satisfied in practice.

Collision resolution. Two philosophies for resolving collisions discussed in class: Separate (a.k.a. external) chaining and 'open addressing'.

Separate-chaining hash table. Key-value pairs are stored in a linked list of nodes of length M. Hash function tells us which of these linked lists to use. Get and insert both require potentially scanning through entire list.

Resizing separate chaining hash tables. Understand how resizing may lead to objects moving from one linked list to another. Primary goal is so that M is always proportional to N, i.e. maintaining a load factor bounded above by some constant.

Performance of separate-chaining hash tables. Cost of a given get, insert, or delete is given by number of entries in the linked list that must be examined.

The expected amortized search and insert time (assuming items are distributed evenly) is N / M, which is no larger than some constant (due to resizing).

Linear-probing hash tables. We didn't go over this in detail in 61B, but it's where you use empty array entries to handle collisions, e.g. linear probing. Not required for exam.

Example Implementations

External Chaining HT

Linear Probing HT

C level

[Adapted from Textbook 3.4.5] Is the following implementation of hashCode() legal?
```
public int hashCode() {
    return 17;
}
```

If so, describe the effect of using it. If not, explain why.

B level

HW7.
Problem 2 of the Fall 2014 midterm.

A level

One strategy discussed in class for hashing objects containing multiple pieces of data is as follows (in pseudocode), where x.get(i) returns the ith piece of data in x:

int hashCode = 0; for (int i = 0; i < x.length; i += 1) {
```
hashCode *= R;
hashCode += x.get(i).hashCode();
hashCode = hashCode % M;
```
} return hashCode;

Effectively, we're summing each piece of data multiplied by a different power of R. In class, we used R = 31, and M is the number of buckets. Explain why the idea above does not work well if M = 31 (or some power of 31).

If we start with a hash table of size 2 and double when the load factor exceeds some constant. Why is this procedure for setting sizes suboptimal from the perspective of utilizing all of the bits of the hashCode?
[Adapted from textbook 3.4.23] Suppose that we hash strings as described in A-level problem 1, using R = 256 and M = 255. Show that any permutation of letters within a string hashes to the same value. Why is this a bad thing?
Find 2 strings in Java that hash to the same value (writing code is probably best).
CS61B Fall 2009 midterm, #4 (really beautiful problem)
Explan why the approach in A-level question 1 works better if we initially start the hashCode at 1 instead of 0.

A+ level

Give a simple procedure that can be carried out by hand that takes a Java string X and finds another Java string Y with the same hashCode().