Hash Tables
A Hash Table is a data structure that provides an efficient way to store and retrieve data using key-value pairs. The underlying mechanism relies on a hash function that maps keys to specific indices in an array, ensuring fast access times.
Key Characteristics
- Time Complexity:
- Average-case: O(1) for insertion, deletion, and lookup.
- Worst-case: O(n) when collisions are not handled properly.
- Space Complexity: O(n), where n is the number of entries.
- Applications:
- Caching (e.g., LRU Cache)
- Databases (e.g., indexing)
- Implementing associative arrays and sets
Components of a Hash Table
- Array: The underlying data storage.
- Hash Function: Converts a key into an array index.
- Collision Handling Mechanism: Resolves conflicts when multiple keys hash to the same index.
Hash Functions
A Hash Function should have the following properties:
- Deterministic: Same input always produces the same output.
- Uniformity: Distributes keys uniformly across the array.
- Efficiency: Fast to compute.
- Minimization of Collisions: Reduces instances of keys hashing to the same index.
Popular Hash Functions:
-
Modulo Operation: Simplest method, computes
index = hash(key) % table_size
. However, not always ideal for non-uniform data distributions. -
Multiplicative Hash Function: Uses multiplication and fractional parts to achieve better uniformity (e.g.,
index = floor(table_size * frac(hash(key) * A))
, whereA
is a constant). -
Cryptographic Hash Functions:
- Examples: MD5, SHA-256.
- Ensure strong collision resistance.
- Slower but crucial for security-sensitive applications like digital signatures.
-
Universal Hashing: Family of hash functions designed to minimize collisions across any set of keys.
-
MurMurHash:
- Non-cryptographic hash function known for high speed and uniformity.
- Frequently used in distributed systems like Cassandra.
-
CityHash:
- Optimized for speed and commonly used in database indexing.
- Developed by Google.
-
XXHash:
- Extremely fast, designed for non-cryptographic use cases.
- Popularized by Facebook, with the latest version being XXHash3, offering high-speed hashing with excellent distribution and low collision rates.
Collision Handling Mechanisms
Hash functions can produce the same index for different keys, resulting in a collision. Effective collision handling is crucial for maintaining the performance and integrity of a hash table.
Two common techniques for handling collisions are chaining and open addressing.
Chaining
Chaining resolves collisions by storing all elements with the same hash index in a separate data structure, typically a linked list or a dynamic array. Each index of the hash table contains a pointer to the head of the chain.
Steps:
- Compute the hash index using the hash function.
- If the index is empty, insert the element there.
- If the index already contains a chain, append the new element to the chain.
Example:
When inserting values 10
, 20
, and 30
into a hash table of size 10, all map to index 0
(assuming hash(key) = key % size
):
Advantages:
- Simple to implement.
- Hash table size doesn't limit the number of elements (beyond memory constraints).
- Works well with a good hash function and low load factor.
Disadvantages:
- Performance degrades if chains grow long, especially with poor hash functions.
- Extra memory overhead for pointers or additional data structures.
Optimization Tips:
- Use a balanced tree or dynamic array instead of a linked list for better performance.
- Keep the load factor low (e.g., rehash when the load factor exceeds a threshold).
Open Addressing
In open addressing, all elements are stored directly within the hash table array. When a collision occurs, the algorithm searches for the next available slot according to a specific probing sequence.