What is hash index?
A hash index is a data structure used in database systems to speed up data retrieval by mapping index keys directly to their physical storage location using a hash function. It excels at equality-based lookups but has significant limitations for other types of queries.
What is a Hash Index?
A hash index employs a hash function to compute a hash value for each key value in the indexed column. This hash value then points directly to the row's physical address (or a bucket containing pointers to rows). When a query requests a specific key, the hash function is applied to the search key, and the resulting hash value is used to quickly locate the corresponding data row. This direct mapping makes equality lookups extremely fast, often approaching O(1) complexity in the best case, as it avoids traversing a tree structure.
How it Works
- Hash Function: A mathematical algorithm takes an input (the column value) and returns a fixed-size string of characters, which is the hash value.
- Hash Table/Array: These hash values are stored in a hash table or array, along with pointers to the actual data rows.
- Lookup: When searching for a value, the database computes its hash value and goes directly to the corresponding entry in the hash table to retrieve the row pointer.
Advantages
- Extremely Fast Equality Lookups: Ideal for queries using
=(equals operator) as it provides near-constant time retrieval once the hash is computed. - Efficient for Exact Matches: Performs exceptionally well when you need to find specific unique values or frequently accessed exact values.
- Minimal I/O Operations: For exact matches, it can often retrieve data with fewer disk I/O operations compared to other index types.
Disadvantages
- No Support for Range Queries: Cannot be used for
BETWEEN,>,<,>=or<=operations because the hash function scrambles the order of keys. The hash values themselves are not ordered. - No Support for Sorting: Cannot be used for
ORDER BYclauses since the index does not store data in any particular sorted order. - No Support for Prefix Matching: Cannot be used for
LIKE 'prefix%'queries because changing even a single character in the input key results in a completely different hash value. - Hash Collisions: Different input keys might produce the same hash value (a collision). The database must have a strategy to handle these collisions, which can slightly degrade performance.
- Higher Memory Consumption: Can consume more memory than B-tree indexes, especially if the hash table becomes sparse or collision handling is complex.
- Not Universal: Not all database systems fully support hash indexes, and their implementation details and limitations vary (e.g., PostgreSQL only supports them for equality comparisons and they are not crash-safe by default, requiring a rebuild after a crash).
Common Use Cases
- Primary Key Lookups: Especially useful for unique identifiers like UUIDs or serial numbers where only exact matches are expected.
- Exact Value Filtering: When a column is frequently queried for specific, distinct values and range queries are rare.
- Joins: Can sometimes accelerate hash joins when one of the join keys has a hash index.
Comparison with B-tree Indexes
| Feature | Hash Index | B-tree Index |
|---|---|---|
| Primary Goal | Fast equality lookups | Fast range, equality, and sort operations |
| Order of Keys | Not preserved | Preserved (sorted) |
| Equality Queries (`=`) | Excellent (O(1) average) | Good (O(log n)) |
| Range Queries (`>`, `<`, `BETWEEN`) | Not supported | Excellent |
| Sorting (`ORDER BY`) | Not supported | Excellent |
| Prefix Matching (`LIKE 'prefix%'`) | Not supported | Supported |
| Memory Usage | Potentially higher | Generally lower/optimized |
| Collision Handling | Required | Not applicable |