:::warning[Experimental] Hash index is an experimental feature available in Spice v1.11.0-rc.2 and later. :::
The hash index is an optional, high-performance indexing feature for Arrow-accelerated datasets. It provides O(1) point lookups on primary key and secondary index columns, dramatically improving query performance for equality predicates.
To use the hash index, explicitly enable it and specify a primary key:
Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the indexes field in the acceleration configuration:
Index types:
unique — Enforces uniqueness and enables O(1) indexed lookups.enabled — Permits duplicates. The index is built and maintained but does not currently accelerate queries (queries fall back to a full scan).Compound secondary indexes can be defined with a multicolumn key in parentheses, e.g. '(col1, col2)': unique, but are not yet used for query optimization.
:::note
Only single-column unique secondary indexes currently accelerate queries. Non-unique and compound secondary indexes are maintained for future use.
:::
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
hash_index | enabled/disabled | No | disabled | Enable hash indexing |
primary_key | string or list | Yes (if hash_index enabled) | None | Column(s) for the primary key index |
indexes | YAML map | No | None | Secondary indexes (see indexes) |
The hash index supports the following primary key column types:
Int8, Int16, Int32, Int64UInt8, UInt16, UInt32, UInt64Utf8, LargeUtf8Binary, LargeBinaryThe hash index automatically accelerates queries with equality predicates on indexed columns.
When a primary key lookup is combined with additional filters (e.g. WHERE id = 123 AND status = 'active'), the index is used for the primary key lookup and the remaining filters are applied afterward by DataFusion.
The hash index is only built when the dataset exceeds a minimum size:
| CPU Cores | Minimum Rows for Index |
|---|---|
| 1 | 256 |
| 4 | 1,024 |
| 8 | 2,048 |
| 16 | 4,096 |
| 32 | 8,192 |
For small tables below the threshold, a full scan is faster than index maintenance overhead.
The built-in bloom filter provides:
| Component | Memory per Entry |
|---|---|
| Hash slot | 16 bytes (8-byte hash + 8-byte location) |
| Bloom filter | ~1.25 bytes |
| Total | ~17.25 bytes per indexed row |
For a 10 million row dataset:
The index uses 256 independent shards to minimize lock contention:
Shard Selection: Uses XOR-folded hash bits: ((hash >> 56) ^ (hash >> 48) ^ hash) & 0xFF
Each indexed key maps to a RowLocation:
Uses XXH3_64 with a fixed seed (0x5370_6963_6541_4920 = "SpiceAI ") for:
engine: arrow accelerationunique secondary indexes accelerate queries. Non-unique and compound secondary indexes are built and maintained but do not yet optimize queriesCause: Dataset row count is below the index threshold.
Solution: This is expected behavior for small datasets. The full scan is faster than index overhead.
Cause: primary_key is specified but hash_index is not enabled.
Solution: Add hash_index: enabled to params:
Cause: Index consumes ~17 bytes per row.
Solution: