:::warning[Experimental] Hash index is an experimental feature available in Spice v1.11.0-rc.2 and later. :::
The hash index is an optional, high-performance indexing feature for Arrow-accelerated datasets. It provides O(1) point lookups on primary key and secondary index columns, dramatically improving query performance for equality predicates.
Hash indexing activates automatically on Arrow-accelerated datasets when a primary_key or secondary index is configured. No additional parameter is required.
The hash index activates whenever:
engine is arrow or partitioned_arrow,acceleration.enabled is true,indexes is set, or primary_key is set with a non-caching refresh_mode.Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the indexes field in the acceleration configuration:
Index types:
unique — Enforces uniqueness and enables O(1) indexed lookups.enabled — Permits duplicates. The index is built and maintained but does not currently accelerate queries (queries fall back to a full scan).Compound secondary indexes can be defined with a multicolumn key in parentheses, e.g. '(col1, col2)': unique, but are not yet used for query optimization.
:::note
Only single-column unique secondary indexes currently accelerate queries. Non-unique and compound secondary indexes are maintained for future use.
:::
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
primary_key | string or list | Yes (unless indexes is set) | None | Column(s) for the primary key index |
indexes | YAML map | No | None | Secondary indexes (see indexes) |
:::note hash_index parameter is ignored
The legacy hash_index: enabled parameter is accepted but no longer activates indexing on its own. When set, the runtime logs a warning and falls back to the automatic rules above. Remove hash_index from params to clear the warning.
:::
The hash index supports the following primary key column types:
Int8, Int16, Int32, Int64UInt8, UInt16, UInt32, UInt64Utf8, LargeUtf8Binary, LargeBinaryThe hash index automatically accelerates queries with equality predicates on indexed columns.
When a primary key lookup is combined with additional filters (e.g. WHERE id = 123 AND status = 'active'), the index is used for the primary key lookup and the remaining filters are applied afterward by DataFusion.
The hash index is only built when the dataset exceeds a minimum size:
| CPU Cores | Minimum Rows for Index |
|---|---|
| 1 | 256 |
| 4 | 1,024 |
| 8 | 2,048 |
| 16 | 4,096 |
| 32 | 8,192 |
For small tables below the threshold, a full scan is faster than index maintenance overhead.
The built-in bloom filter provides:
| Component | Memory per Entry |
|---|---|
| Hash slot | 16 bytes (8-byte hash + 8-byte location) |
| Bloom filter | ~1.25 bytes |
| Total | ~17.25 bytes per indexed row |
For a 10 million row dataset:
The index uses 256 independent shards to minimize lock contention:
Shard Selection: Uses XOR-folded hash bits: ((hash >> 56) ^ (hash >> 48) ^ hash) & 0xFF
Each indexed key maps to a RowLocation:
Uses XXH3_64 with a fixed seed (0x5370_6963_6541_4920 = "SpiceAI ") for:
engine: arrow accelerationunique secondary indexes accelerate queries. Non-unique and compound secondary indexes are built and maintained but do not yet optimize queriesCause: Dataset row count is below the index threshold.
Solution: This is expected behavior for small datasets. The full scan is faster than index overhead.
Cause: hash_index: enabled is set in params but no longer activates indexing on its own.
Solution: Remove hash_index from params. Hash indexing activates automatically when primary_key or indexes is configured on an Arrow-accelerated dataset (see Configuration).
primary_key being setCause: refresh_mode: caching disables hash indexing even when primary_key is set; the caching path uses its own lookup strategy.
Solution: Use a non-caching refresh_mode (e.g. full, append, changes) for datasets that need point-lookup acceleration via the hash index.
Cause: Index consumes ~17 bytes per row.
Solution:
primary_key for datasets where point lookups are rare (hash indexing stops being applied)