Hash Tables

Status

This note is complete, reviewed, and considered stable.

A hash table is a data structure that implements an associative array abstraction:

Given a key, efficiently store and retrieve a corresponding value.

Formally, it supports:

put(key, value)
get(key)
delete(key)

The defining promise of a hash table is:

Constant-time performance on average, independent of the number of stored elements.

This is achieved by directly computing where a key should live, instead of searching for it.

Why Hash Tables Are Fundamentally Different

Most data structures answer this question:

“Where should I search for this value?”

Hash tables answer a different question:

“Where must this value be stored?”

That shift is what enables O(1) access.

Comparison

Structure	How lookup works
Array	Index-based
Linked List	Sequential scan
Tree	Guided traversal
Hash Table	Direct computation

Hash tables eliminate traversal.

Core Components of a Hash Table

A hash table is composed of four essential parts:

Key
Hash Function
Bucket Array
Collision Resolution Mechanism

Without any one of these, the data structure breaks down.

Hash Function

A hash function maps a key to an integer:

hash(key) → integer

That integer is then mapped to a bucket:

index = hash(key) % capacity

What the Hash Function Actually Controls

The hash function determines:

Distribution of keys
Collision frequency
Performance stability
Resistance to adversarial input

A poor hash function turns a hash table into a linked list.

Required Properties

A valid hash function must:

Be deterministic Same key → same hash
Be consistent across the lifetime of the key Hashing must not depend on mutable data
Be fast Hashing must be cheaper than searching

Desired Properties (Performance-Critical)

A good hash function should:

Spread keys uniformly
Minimize clustering
Avoid predictable patterns

Uniform distribution is the entire reason hash tables work.

Bucket Array

At its core, a hash table uses a contiguous array.

Each index is called a bucket.

What a bucket stores depends on the collision strategy:

Linked list
Dynamic array
Single entry with probing

Insertion

Let’s insert (key, value).

Key observation:

Insertion never scans unrelated elements.

Lookup

Lookup is insertion without mutation.

Hash the key
Compute index
Search only inside that bucket

This is why lookup is fast.

Collisions

A collision occurs when:

hash(k1) % size == hash(k2) % size

Collisions are mathematically unavoidable:

Infinite key space
Finite bucket space

The goal is not to eliminate collisions, but to handle them efficiently.

Collision Resolution Strategies

Separate Chaining

Each bucket stores a collection of entries.

Characteristics

Simple to implement
Performance degrades linearly with bucket size
Memory overhead from pointers

Time Complexity

Case	Time
Average	O(1)
Worst	O(n)

Worst case happens when all keys hash to one bucket.

Open Addressing

All entries live directly inside the array.

If a collision occurs, search for another slot.

Linear Probing

index, index+1, index+2, ...

Problems

Primary clustering
Cache-friendly but fragile
Deletions are complex

Quadratic Probing

index + 1², index + 2², index + 3²

Reduces clustering, but still sensitive to load factor.

Load Factor

load_factor = entries / buckets

As load factor increases:

Collision probability increases
Performance degrades

Typical thresholds:

0.7
0.75

Resizing and Rehashing

Why Resizing Is Mandatory

Without resizing:

Buckets grow unbounded
Hash table degenerates

What Rehashing Actually Means

Rehashing is not copying memory.

Steps:

Allocate new bucket array
Recompute index for every key
Insert again

Complexity

Single resize: O(n)
Amortized per operation: O(1)

This is identical to dynamic array growth behavior.

Python `dict`

Python dictionaries are highly optimized hash tables.

Properties

Open addressing
Perturbation-based probing
Insertion-ordered (since 3.7)
Randomized hash seed

Example

d = {}
d["x"] = 10
d["y"] = 20

print(d["x"])

Internally, Python stores:

Hash
Key
Value

This avoids recomputing hashes during probing.

Time & Space Complexity

Operation	Average	Worst
Insert	O(1)	O(n)
Lookup	O(1)	O(n)
Delete	O(1)	O(n)

Worst case is rare but theoretically possible.

Algorithms Using Hash Tables

Frequency Counting

freq = {}
for x in nums:
    freq[x] = freq.get(x, 0) + 1

Used in:

Histograms
Voting algorithms
Anagrams

Two Sum

seen = {}

for i, num in enumerate(nums):
    need = target - num
    if need in seen:
        return [seen[need], i]
    seen[num] = i

Transforms O(n²) → O(n).

Deduplication

unique = set(nums)

Grouping by Key

groups = {}

for item in items:
    key = compute_key(item)
    groups.setdefault(key, []).append(item)

Hash Set vs Hash Map

Hash Set	Hash Map
Stores keys	Stores key-value
Membership test	Value retrieval
No duplicates	Unique keys

Both are hash tables.

Hash Tables vs Balanced Trees

Hash Table	Tree
O(1) avg	O(log n)
Unordered	Ordered
Faster	Predictable

Trees are chosen when order or range queries matter.

Security: Hash Flooding Attacks

Attackers can exploit predictable hashes.

Defenses:

Randomized hashing
Dynamic resizing
Strong hash functions

Python includes built-in protections.

When Hash Tables Are a Bad Choice

When ordering matters
When memory is constrained
When worst-case guarantees are required
When keys are mutable

Why Hash Tables Are Fundamentally Different​

Comparison​

Core Components of a Hash Table​

Hash Function​

What the Hash Function Actually Controls​

Required Properties​

Desired Properties (Performance-Critical)​

Bucket Array​

Insertion​

Lookup​

Collisions​

Collision Resolution Strategies​

Separate Chaining​

Characteristics​

Time Complexity​

Open Addressing​

Linear Probing​

Problems​

Quadratic Probing​

Load Factor​

Resizing and Rehashing​

Why Resizing Is Mandatory​

What Rehashing Actually Means​

Complexity​

Python dict​

Properties​

Example​

Time & Space Complexity​

Algorithms Using Hash Tables​

Frequency Counting​

Two Sum​

Deduplication​

Grouping by Key​

Hash Set vs Hash Map​

Hash Tables vs Balanced Trees​

Security: Hash Flooding Attacks​

When Hash Tables Are a Bad Choice​