Hash Tables
A Hash Table is a data structure that provides an efficient way to store and retrieve data using key-value pairs. The underlying mechanism relies on a hash function that maps keys to specific indices in an array, ensuring fast access times.
Key Characteristics
- Time Complexity:
- Average-case: O(1) for insertion, deletion, and lookup.
- Worst-case: O(n) when collisions are not handled properly.
- Space Complexity: O(n), where n is the number of entries.
- Applications:
- Caching (e.g., LRU Cache)
- Databases (e.g., indexing)
- Implementing associative arrays and sets
Components of a Hash Table
- Array: The underlying data storage.
- Hash Function: Converts a key into an array index.
- Collision Handling Mechanism: Resolves conflicts when multiple keys hash to the same index.
Hash Functions
A Hash Function should have the following properties:
- Deterministic: Same input always produces the same output.
- Uniformity: Distributes keys uniformly across the array.
- Efficiency: Fast to compute.
- Minimization of Collisions: Reduces instances of keys hashing to the same index.
Popular Hash Functions:
-
Modulo Operation: Simplest method, computes
index = hash(key) % table_size
. However, not always ideal for non-uniform data distributions. -
Multiplicative Hash Function: Uses multiplication and fractional parts to achieve better uniformity (e.g.,
index = floor(table_size * frac(hash(key) * A))
, whereA
is a constant). -
Cryptographic Hash Functions:
- Examples: MD5, SHA-256.
- Ensure strong collision resistance.
- Slower but crucial for security-sensitive applications like digital signatures.
-
Universal Hashing: Family of hash functions designed to minimize collisions across any set of keys.
-
MurMurHash:
- Non-cryptographic hash function known for high speed and uniformity.
- Frequently used in distributed systems like Cassandra.
-
CityHash:
- Optimized for speed and commonly used in database indexing.
- Developed by Google.
-
XXHash:
- Extremely fast, designed for non-cryptographic use cases.
- Popularized by Facebook, with the latest version being XXHash3, offering high-speed hashing with excellent distribution and low collision rates.
Collision Handling Mechanisms
Hash functions can produce the same index for different keys, resulting in a collision. Effective collision handling is crucial for maintaining the performance and integrity of a hash table.
Two common techniques for handling collisions are chaining and open addressing.
Chaining
Chaining resolves collisions by storing all elements with the same hash index in a separate data structure, typically a linked list or a dynamic array. Each index of the hash table contains a pointer to the head of the chain.
Steps:
- Compute the hash index using the hash function.
- If the index is empty, insert the element there.
- If the index already contains a chain, append the new element to the chain.
Example:
When inserting values 10
, 20
, and 30
into a hash table of size 10, all map to index 0
(assuming hash(key) = key % size
):
Advantages:
- Simple to implement.
- Hash table size doesn't limit the number of elements (beyond memory constraints).
- Works well with a good hash function and low load factor.
Disadvantages:
- Performance degrades if chains grow long, especially with poor hash functions.
- Extra memory overhead for pointers or additional data structures.
Optimization Tips:
- Use a balanced tree or dynamic array instead of a linked list for better performance.
- Keep the load factor low (e.g., rehash when the load factor exceeds a threshold).
Open Addressing
In open addressing, all elements are stored directly within the hash table array. When a collision occurs, the algorithm searches for the next available slot according to a specific probing sequence.
Types of Probing:
1. Linear Probing:
- If a collision occurs, incrementally check the next slot (i.e., ).
- For example, if the hash table size is
10
andhash(20) = 2
, inserting30
(also mapping to2
) results in probing the next available slot.
Advantages:
- Simple to implement.
- Efficient for low load factors.
Disadvantages:
- Clustering: A chain of consecutive filled slots increases the probability of further collisions.
2. Quadratic Probing:
- Uses a quadratic function to compute the next slot: .
- For example, if
hash(20) = 2
, the next slots probed would be ( (2 + 1^2) % 10 ), ( (2 + 2^2) % 10 ), and so on.
Advantages:
- Reduces clustering compared to linear probing.
Disadvantages:
- May leave gaps in the table, reducing overall efficiency.
3. Double Hashing:
- Uses a secondary hash function to determine the step size for probing: .
- The second hash function ensures the probing sequence is more distributed.
- For example, with
hash1(key) = key % size
andhash2(key) = 7 - (key % 7)
, inserting values follows this pattern:
Advantages:
- Reduces clustering significantly, making it one of the most efficient methods.
Disadvantages:
- More complex than other probing techniques.
Comparison of Chaining and Open Addressing
Feature | Chaining | Open Addressing |
---|---|---|
Storage | Requires extra memory for chains | Stores all elements in the table |
Load Factor | Can exceed 1 | Must remain below 1 |
Deletion | Easy | Complex |
Cache Performance | Poor (pointer traversal) | Better (contiguous memory) |
Performance with Poor Hash Functions | Degrades significantly | Degrades significantly |
Static Hash Tables
Static hash tables have a fixed size and do not allow resizing.
-
Use Cases:
- Applications where the maximum number of entries is known in advance.
- Memory-constrained environments.
-
Advantages:
- Simple to implement.
- No overhead for resizing.
-
Disadvantages:
- Inefficient if size is underestimated (causing high collision rates).
- Wasted space if size is overestimated.
Linear Probing
Linear probing is a technique used in static hash tables to resolve collisions. If a collision occurs, the algorithm searches the next available slot in a sequential manner.
-
Insertion:
- Compute the hash index for the key.
- If the slot is occupied, probe the next slot until an empty one is found.
- Insert the key-value pair into the empty slot.
-
Search:
- Compute the hash index for the key.
- Check the slot. If the key is not found, continue probing sequentially until the key is located or an empty slot is encountered.
-
Deletion:
- Locate the key using the search process.
- Mark the slot as deleted (use a special marker to distinguish it from an empty slot).
- During search and insertion, skip over marked slots but recheck their sequence for proper key relocation.
-
Advantages:
- Simple to implement.
- High cache performance due to locality of reference.
-
Disadvantages:
- Clustering: Consecutive slots may become filled, degrading performance.
Example Implementation
class LinearProbingHashTable:
def __init__(self, size):
self.size = size
self.table = [None] * size
def hash_function(self, key):
return hash(key) % self.size
def insert(self, key, value):
index = self.hash_function(key)
original_index = index
while self.table[index] is not None:
if self.table[index][0] == key:
self.table[index] = (key, value)
return
index = (index + 1) % self.size
if index == original_index: # Table is full
raise Exception("Hash table is full")
self.table[index] = (key, value)
def search(self, key):
index = self.hash_function(key)
original_index = index
while self.table[index] is not None:
if self.table[index][0] == key:
return self.table[index][1]
index = (index + 1) % self.size
if index == original_index:
return None
return None
def delete(self, key):
index = self.hash_function(key)
original_index = index
while self.table[index] is not None:
if self.table[index][0] == key:
self.table[index] = None
return True
index = (index + 1) % self.size
if index == original_index:
return False
return False
Cuckoo Hashing
Cuckoo hashing uses two or more hash functions and two tables to resolve collisions. Each key can reside in one of several possible locations, and if all are occupied, an eviction process is triggered.
-
Insertion:
- Compute the possible hash indices for the key.
- If one slot is empty, place the key there.
- If all slots are occupied, evict an existing key and rehash it to its alternative location. Repeat the process until all keys are placed.
-
Search:
- Compute the possible hash indices for the key.
- Check each index; if the key is found, return its value.
-
Deletion:
- Compute the possible hash indices for the key.
- Remove the key if it exists in any of the indices.
-
Advantages:
- Constant worst-case lookup time.
- Simple collision handling.
-
Disadvantages:
- High insertion cost due to potential multiple evictions.
- Inefficient at high load factors.
Example Implementation
class CuckooHashTable:
def __init__(self, size):
self.size = size
self.table1 = [None] * size
self.table2 = [None] * size
def hash1(self, key):
return hash(key) % self.size
def hash2(self, key):
return (hash(key) // self.size) % self.size
def insert(self, key, value):
for _ in range(self.size):
index1 = self.hash1(key)
if self.table1[index1] is None:
self.table1[index1] = (key, value)
return
key, value, self.table1[index1] = self.table1[index1][0], self.table1[index1][1], (key, value)
index2 = self.hash2(key)
if self.table2[index2] is None:
self.table2[index2] = (key, value)
return
key, value, self.table2[index2] = self.table2[index2][0], self.table2[index2][1], (key, value)
raise Exception("Rehash required")
def search(self, key):
index1 = self.hash1(key)
if self.table1[index1] and self.table1[index1][0] == key:
return self.table1[index1][1]
index2 = self.hash2(key)
if self.table2[index2] and self.table2[index2][0] == key:
return self.table2[index2][1]
return None
def delete(self, key):
index1 = self.hash1(key)
if self.table1[index1] and self.table1[index1][0] == key:
self.table1[index1] = None
return True
index2 = self.hash2(key)
if self.table2[index2] and self.table2[index2][0] == key:
self.table2[index2] = None
return True
return False
Dynamic Hash Tables
Dynamic hash tables can grow or shrink as the number of entries changes, typically by resizing the underlying array.
Resizing Mechanism
- When the load factor exceeds a threshold (e.g., 0.75), the table is resized.
- Create a new array (typically double the size).
- Rehash all keys into the new array.
-
Load Factor: Ratio of the number of entries to the table size.
- Load Factor = Number of Entries / Table Size
Collision Handling in Dynamic Hash Tables
-
Chained Hashing:
-
Uses linked lists (or other data structures like trees) to handle collisions.
-
Add:
- Compute the hash index.
- Insert the key-value pair into the linked list at that index.
-
Get:
- Compute the hash index.
- Traverse the linked list at that index to find the key.
-
Delete:
- Compute the hash index.
- Remove the key from the linked list.
-
-
Open Addressing with Linear Probing:
-
Stores entries directly in the array, probing for the next available slot if a collision occurs.
-
Add:
- Compute the hash index.
- If the slot is occupied, probe the next slot sequentially until an empty slot is found.
- Insert the key-value pair.
-
Get:
- Compute the hash index.
- Check sequentially until the key is found or an empty slot is reached.
-
Delete:
- Locate the key and mark the slot as deleted (special marker).
-
-
Double Hashing:
- Uses a second hash function to determine the probe sequence.
- Reduces clustering compared to linear probing.
Advanced Dynamic Hashing Techniques
-
Extensible Hashing:
-
Uses a directory of pointers to buckets and allows the hash table to grow incrementally without full rehashing.
-
Add:
- Compute the hash and map it to a bucket via a directory.
- If the bucket overflows, split the bucket and redistribute the entries.
- Update the directory to point to the new buckets.
-
Get:
- Use the directory and hash value to locate the bucket.
- Search within the bucket for the key.
-
Delete:
- Locate the bucket and remove the key.
- Optionally merge buckets if underutilized.
-
Advantages:
- Grows incrementally, avoiding full rehashing.
- Efficient for disk-based storage systems.
-
-
Linear Hashing:
-
Incrementally grows the hash table by splitting buckets dynamically.
-
Add:
- Compute the hash and locate the bucket.
- If the bucket overflows, split the bucket and redistribute its contents.
- Increase the range of the hash function as needed.
-
Get:
- Compute the hash and locate the bucket.
- Search within the bucket for the key.
-
Delete:
- Locate the bucket and remove the key.
-
Advantages:
- Avoids global resizing.
- Gradual growth reduces performance spikes.
-
-
Chained Hashing with Dynamic Buckets:
- Combines chained hashing with buckets that can grow dynamically.
- Useful for applications with large key-value stores and unpredictable growth.
Sample Python Code
Chained Hashing
class ChainedHashTable:
def __init__(self, size):
self.size = size
self.table = [[] for _ in range(size)]
def _hash(self, key):
return hash(key) % self.size
def add(self, key, value):
index = self._hash(key)
for pair in self.table[index]:
if pair[0] == key:
pair[1] = value
return
self.table[index].append([key, value])
def get(self, key):
index = self._hash(key)
for pair in self.table[index]:
if pair[0] == key:
return pair[1]
return None
def delete(self, key):
index = self._hash(key)
self.table[index] = [pair for pair in self.table[index] if pair[0] != key]
# Example usage
cht = ChainedHashTable(10)
cht.add("key1", "value1")
print(cht.get("key1")) # Output: value1
cht.delete("key1")
print(cht.get("key1")) # Output: None
Linear Probing
class LinearProbingHashTable:
def __init__(self, size):
self.size = size
self.table = [None] * size
def _hash(self, key):
return hash(key) % self.size
def add(self, key, value):
index = self._hash(key)
start_index = index
while self.table[index] is not None and self.table[index][0] != key:
index = (index + 1) % self.size
if index == start_index:
raise Exception("Hash table is full")
self.table[index] = (key, value)
def get(self, key):
index = self._hash(key)
start_index = index
while self.table[index] is not None:
if self.table[index][0] == key:
return self.table[index][1]
index = (index + 1) % self.size
if index == start_index:
break
return None
def delete(self, key):
index = self._hash(key)
start_index = index
while self.table[index] is not None:
if self.table[index][0] == key:
self.table[index] = None
return
index = (index + 1) % self.size
if index == start_index:
break
# Example usage
lht = LinearProbingHashTable(10)
lht.add("key1", "value1")
print(lht.get("key1")) # Output: value1
lht.delete("key1")
print(lht.get("key1")) # Output: None
Advantages of Dynamic Hash Tables
- Scalability: Automatically adjusts to changes in data volume.
- Efficient Memory Utilization: Expands or contracts based on load.
- Improved Performance: Reduces collision rates through resizing and advanced techniques.
Disadvantages of Dynamic Hash Tables
- Resizing Overhead: Resizing operations are expensive (O(n)).
- Complexity: Implementation is more complex compared to static hash tables.
- Temporary Performance Degradation: Resizing causes brief performance drops.
Comparison: Static vs Dynamic Hash Tables
Feature | Static Hash Table | Dynamic Hash Table |
---|---|---|
Size | Fixed | Variable |
Resizing | Not applicable | Supported (costly operation) |
Memory Efficiency | May waste space | More efficient |
Use Cases | Predictable entry count | Unpredictable entry |