What is hashing?
Hashing is a method that reliably takes an input and converts it into a hash (output), the hash cannot be converted back into the input (usually).
Why Hash in the first place?
To understand why we hash things, first ask yourself. How can facebook check that your password is correct without having it saved? The answer is Hashing.
Hashing can also be used to quickly (Like really quickly) find where you previously stored something. This is what we will explore.
Let’s start with some definitions.
Hashmap: A table/array with a hash function
Hash Function: A function that takes input x and spits out output y –> Cannot convert y to x again.
Hash Function Properties:
- Can map any key to the table with index 0 -> M-1
- Should have fast computation
- Should distribute keys uniformly to the table
- Should try to avoid collisions
Hash Function Types:
Division Hash: h(k) = k % m / (% = modulus, m is a large prime number)
Example: Given table size m = 10
h(89) = 89 % 10 = 9 –> insert element 89 into slot 9
Multiplication Hash: h(k) = Truncate(m * frac(k*A)) Where 0 < A < 1, frac = get fraction portion
Simple Uniform Hash (SUH): Each item is equally likely to hash to a slot independently of other items.
Theorem: SUH with Chain based collision resolution takes O(1+@) to search
Rule of thumb: For chaining, use load factor @ ~1
Sounds good so far, what’s the issue?
Collisions man, collisions suck. A collision is when a hashing function hashes 2 separate elements into 1 slot on a table (Tables/arrays aren’t infinite in size ya know).
How do we solve them?
—-> Collision Resolution Algorithms
Collision Resolution Algorithms
Chaining / (Open Hashing)
Chaining: Each table entry is a linked list
Collision Resolution: When a collision happens, just append to the linked list and carry on
Issue: When a collision happens, we append to a linked list. Usually, hashing has an O(1) constant lookup time but with open hashing, we run the risk of having to search through a linked list. If somehow our hashing function sucks and all the elements are in one table entry, then our lookup time becomes O(n). (BAD)
Ideally: For good hash functions, get n/m elements in each slot
Closed Hashing/ (Open Addressed Hash)
Each slot can only hold one element, Load factor < 1, ideally Load Factor = 0.5
Linear Probing: Collision Function is linear
Hashing Function: hi(k) = [h‘(k) + i] % m
Intuition: Oops my parking spot is taken, go to next available spot
Problem: Primary Clustering: Interference with neighboring hashes, can take significantly longer to find the element you are looking for.
Quadratic Probing: Collision Function is quadratic in nature
Hashing Function: hi(k) = [h‘(k) + C0 + C1*i + C2*i2] % m
Intuition: Oops my parking spot is taken, go to next available spot but look for the next available spot in a quadratic fashion (I guess).
- – Secondary Clustering: Interference between Items of the same hash.
- – May get a half-empty table with unreachable spots (Due to quadratic nature of collisions)
- – Need to use a lazy form of deleting
Double Hashing: Use 2 hash functions
Hashing Function: hi(k) = [h1(k) + i * h2(k)] % m
Intuition: Oops this parking spot is full, let’s magically look in another spot for some unknown reason.
Problem: Overhead of computing 2 hashes.
Cuckoo Hashing: Use 1 hash function, but 2 tables
Hashing Function: Can be any hash function
Intuition: Oops this parking spot is full, let’s tow their car to another lot. and repeat.
Algorithm for inserting (x): Apply the first hash on x, h1(x), place x in table 1 with h1(x). If the cell is occupied by y, move y to location h2(y) in table 2. Repeat RECURSIVELY
Guarantees: Constant lookup time given (num elements/size of table) < 1/2.
Note: The probability of a cycle is very low.
Problem: Can get stuck in an infinite loop (rare but possible). if this happens then rehash enough elements to break the loop.