spiceai/docs

spiceai/

docs

Help Login

evgenii/docs-spicepod-v2

Edit on GitHub

Fork

/docs/website/versioned_docs/version-2.0.x/features/data-acceleration/hash-index.md

spiceai/docs | Spice Cloud Platform

evgenii/docs-spicepod-v2

Edit on GitHub

Fork

/docs/website/versioned_docs/version-2.0.x/features/data-acceleration/hash-index.md

spiceai/docs/README.md

title: 'Hash Index for Arrow Acceleration' sidebar_label: 'Hash Index' sidebar_position: 3 description: 'Learn how to use hash indexes for O(1) point lookups on Arrow-accelerated datasets.'

:::warning[Experimental] Hash index is an experimental feature available in Spice v1.11.0-rc.2 and later. :::

The hash index is an optional, high-performance indexing feature for Arrow-accelerated datasets. It provides O(1) point lookups on primary key and secondary index columns, dramatically improving query performance for equality predicates.

Key Features

O(1) Point Lookups: Direct row access via primary key or secondary indexes without full table scans
Secondary Indexes: Optional indexes on non-primary-key columns for fast lookups
256-Shard Design: Minimizes lock contention for concurrent reads
SIMD-Optimized Hashing: Uses XXH3_64 for fast, high-quality hashing
Built-in Bloom Filter: Fast negative lookups to skip unnecessary hash table probes
Auto-Threshold: Index is only built when data size exceeds a minimum threshold

Configuration

Hash indexing activates automatically on Arrow-accelerated datasets when a primary_key or secondary index is configured. No additional parameter is required.

The hash index activates whenever:

engine is arrow or partitioned_arrow,
acceleration.enabled is true,
and either indexes is set, or primary_key is set with a non-caching refresh_mode.

Secondary Indexes

Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the indexes field in the acceleration configuration:

Index types:

unique — Enforces uniqueness and enables O(1) indexed lookups.
enabled — Permits duplicates. The index is built and maintained but does not currently accelerate queries (queries fall back to a full scan).

Compound secondary indexes can be defined with a multicolumn key in parentheses, e.g. '(col1, col2)': unique, but are not yet used for query optimization.

:::note Only single-column unique secondary indexes currently accelerate queries. Non-unique and compound secondary indexes are maintained for future use. :::

Configuration Options

Parameter	Type	Required	Default	Description
`primary_key`	string or list	Yes (unless `indexes` is set)	None	Column(s) for the primary key index
`indexes`	YAML map	No	None	Secondary indexes (see indexes)

:::note hash_index parameter is ignored The legacy hash_index: enabled parameter is accepted but no longer activates indexing on its own. When set, the runtime logs a warning and falls back to the automatic rules above. Remove hash_index from params to clear the warning. :::

Supported Data Types

The hash index supports the following primary key column types:

Primitive Types

Int8, Int16, Int32, Int64
UInt8, UInt16, UInt32, UInt64

String Types

Utf8, LargeUtf8

Binary Types

Binary, LargeBinary

Query Optimization

The hash index automatically accelerates queries with equality predicates on indexed columns.

Optimized Queries

When a primary key lookup is combined with additional filters (e.g. WHERE id = 123 AND status = 'active'), the index is used for the primary key lookup and the remaining filters are applied afterward by DataFusion.

Non-Optimized Queries

Index Threshold

The hash index is only built when the dataset exceeds a minimum size:

CPU Cores	Minimum Rows for Index
1	256
4	1,024
8	2,048
16	4,096
32	8,192

For small tables below the threshold, a full scan is faster than index maintenance overhead.

Performance

Bloom Filter Performance

The built-in bloom filter provides:

~0.82% false positive rate (10 bits/item, 7 hash functions)
O(1) negative lookup confirmation
Reduced unnecessary hash table probes for non-existent keys

Memory Usage

Component	Memory per Entry
Hash slot	16 bytes (8-byte hash + 8-byte location)
Bloom filter	~1.25 bytes
Total	~17.25 bytes per indexed row

Estimating Memory

For a 10 million row dataset:

Architecture

Sharded Hash Table

The index uses 256 independent shards to minimize lock contention:

Shard Selection: Uses XOR-folded hash bits: ((hash >> 56) ^ (hash >> 48) ^ hash) & 0xFF

Row Location

Each indexed key maps to a RowLocation:

Hash Function

Uses XXH3_64 with a fixed seed (0x5370_6963_6541_4920 = "SpiceAI ") for:

Deterministic hashing across instances
High-quality distribution (passes SMHasher)
SIMD acceleration on arm64/amd64

Limitations

Arrow Engine Only: Hash index is only available for engine: arrow acceleration
Single-Column Primary Keys Only: Composite primary keys are not yet supported for indexed lookups; only single-column primary keys use the index
Experimental: API and behavior may change in future releases
No Persistence: Index is rebuilt on restart (data persists, index is in-memory)
Duplicate Keys: Primary key columns must have unique values
Secondary Index Limitations: Only single-column unique secondary indexes accelerate queries. Non-unique and compound secondary indexes are built and maintained but do not yet optimize queries

Troubleshooting

"No index available for point lookup"

Cause: Dataset row count is below the index threshold.

Solution: This is expected behavior for small datasets. The full scan is faster than index overhead.

Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"

Cause: hash_index: enabled is set in params but no longer activates indexing on its own.

Solution: Remove hash_index from params. Hash indexing activates automatically when primary_key or indexes is configured on an Arrow-accelerated dataset (see Configuration).

Hash index not active despite `primary_key` being set

Cause: refresh_mode: caching disables hash indexing even when primary_key is set; the caching path uses its own lookup strategy.

Solution: Use a non-caching refresh_mode (e.g. full, append, changes) for datasets that need point-lookup acceleration via the hash index.

High Memory Usage

Cause: Index consumes ~17 bytes per row.

Solution:

Remove primary_key for datasets where point lookups are rare (hash indexing stops being applied)
Consider using a different acceleration engine for very large datasets

spiceai/docs/README.md

title: 'Hash Index for Arrow Acceleration' sidebar_label: 'Hash Index' sidebar_position: 3 description: 'Learn how to use hash indexes for O(1) point lookups on Arrow-accelerated datasets.'

:::warning[Experimental] Hash index is an experimental feature available in Spice v1.11.0-rc.2 and later. :::

Key Features

O(1) Point Lookups: Direct row access via primary key or secondary indexes without full table scans
Secondary Indexes: Optional indexes on non-primary-key columns for fast lookups
256-Shard Design: Minimizes lock contention for concurrent reads
SIMD-Optimized Hashing: Uses XXH3_64 for fast, high-quality hashing
Built-in Bloom Filter: Fast negative lookups to skip unnecessary hash table probes
Auto-Threshold: Index is only built when data size exceeds a minimum threshold

Configuration

Hash indexing activates automatically on Arrow-accelerated datasets when a primary_key or secondary index is configured. No additional parameter is required.

The hash index activates whenever:

engine is arrow or partitioned_arrow,
acceleration.enabled is true,
and either indexes is set, or primary_key is set with a non-caching refresh_mode.

Secondary Indexes

Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the indexes field in the acceleration configuration:

Index types:

unique — Enforces uniqueness and enables O(1) indexed lookups.
enabled — Permits duplicates. The index is built and maintained but does not currently accelerate queries (queries fall back to a full scan).

Compound secondary indexes can be defined with a multicolumn key in parentheses, e.g. '(col1, col2)': unique, but are not yet used for query optimization.

:::note Only single-column unique secondary indexes currently accelerate queries. Non-unique and compound secondary indexes are maintained for future use. :::

Configuration Options

Parameter	Type	Required	Default	Description
`primary_key`	string or list	Yes (unless `indexes` is set)	None	Column(s) for the primary key index
`indexes`	YAML map	No	None	Secondary indexes (see indexes)

Supported Data Types

The hash index supports the following primary key column types:

Primitive Types

Int8, Int16, Int32, Int64
UInt8, UInt16, UInt32, UInt64

String Types

Utf8, LargeUtf8

Binary Types

Binary, LargeBinary

Query Optimization

The hash index automatically accelerates queries with equality predicates on indexed columns.

Optimized Queries

Non-Optimized Queries

Index Threshold

The hash index is only built when the dataset exceeds a minimum size:

CPU Cores	Minimum Rows for Index
1	256
4	1,024
8	2,048
16	4,096
32	8,192

For small tables below the threshold, a full scan is faster than index maintenance overhead.

Performance

Bloom Filter Performance

The built-in bloom filter provides:

~0.82% false positive rate (10 bits/item, 7 hash functions)
O(1) negative lookup confirmation
Reduced unnecessary hash table probes for non-existent keys

Memory Usage

Component	Memory per Entry
Hash slot	16 bytes (8-byte hash + 8-byte location)
Bloom filter	~1.25 bytes
Total	~17.25 bytes per indexed row

Estimating Memory

For a 10 million row dataset:

Architecture

Sharded Hash Table

The index uses 256 independent shards to minimize lock contention:

Shard Selection: Uses XOR-folded hash bits: ((hash >> 56) ^ (hash >> 48) ^ hash) & 0xFF

Row Location

Each indexed key maps to a RowLocation:

Hash Function

Uses XXH3_64 with a fixed seed (0x5370_6963_6541_4920 = "SpiceAI ") for:

Deterministic hashing across instances
High-quality distribution (passes SMHasher)
SIMD acceleration on arm64/amd64

Limitations

Arrow Engine Only: Hash index is only available for engine: arrow acceleration
Single-Column Primary Keys Only: Composite primary keys are not yet supported for indexed lookups; only single-column primary keys use the index
Experimental: API and behavior may change in future releases
No Persistence: Index is rebuilt on restart (data persists, index is in-memory)
Duplicate Keys: Primary key columns must have unique values
Secondary Index Limitations: Only single-column unique secondary indexes accelerate queries. Non-unique and compound secondary indexes are built and maintained but do not yet optimize queries

Troubleshooting

"No index available for point lookup"

Cause: Dataset row count is below the index threshold.

Solution: This is expected behavior for small datasets. The full scan is faster than index overhead.

Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"

Cause: hash_index: enabled is set in params but no longer activates indexing on its own.

Solution: Remove hash_index from params. Hash indexing activates automatically when primary_key or indexes is configured on an Arrow-accelerated dataset (see Configuration).

Hash index not active despite `primary_key` being set

Cause: refresh_mode: caching disables hash indexing even when primary_key is set; the caching path uses its own lookup strategy.

Solution: Use a non-caching refresh_mode (e.g. full, append, changes) for datasets that need point-lookup acceleration via the hash index.

High Memory Usage

Cause: Index consumes ~17 bytes per row.

Solution:

Remove primary_key for datasets where point lookups are rare (hash indexing stops being applied)
Consider using a different acceleration engine for very large datasets

threshold = 256 × CPU_cores

threshold = 256 × CPU_cores

Index memory ≈ 10M × 17.25 bytes ≈ 165 MB

Index memory ≈ 10M × 17.25 bytes ≈ 165 MB

┌────────────────────────────────────────────────┐
│                  HashIndex                     │
├────────────────────────────────────────────────┤
│  ┌─────────┐ ┌─────────┐      ┌─────────┐      │
│  │ Shard 0 │ │ Shard 1 │ ...  │Shard 255│      │
│  │ RwLock  │ │ RwLock  │      │ RwLock  │      │
│  └────┬────┘ └────┬────┘      └────┬────┘      │
│       │           │                │           │
│       ▼           ▼                ▼           │
│  ┌─────────┐ ┌─────────┐      ┌─────────┐      │
│  │  Hash   │ │  Hash   │ ...  │  Hash   │      │
│  │  Table  │ │  Table  │      │  Table  │      │
│  └─────────┘ └─────────┘      └─────────┘      │
├────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────┐   │
│  │        Optional Bloom Filter            │   │
│  │        (Fast Negative Lookups)          │   │
│  └─────────────────────────────────────────┘   │
└────────────────────────────────────────────────┘

┌────────────────────────────────────────────────┐
│                  HashIndex                     │
├────────────────────────────────────────────────┤
│  ┌─────────┐ ┌─────────┐      ┌─────────┐      │
│  │ Shard 0 │ │ Shard 1 │ ...  │Shard 255│      │
│  │ RwLock  │ │ RwLock  │      │ RwLock  │      │
│  └────┬────┘ └────┬────┘      └────┬────┘      │
│       │           │                │           │
│       ▼           ▼                ▼           │
│  ┌─────────┐ ┌─────────┐      ┌─────────┐      │
│  │  Hash   │ │  Hash   │ ...  │  Hash   │      │
│  │  Table  │ │  Table  │      │  Table  │      │
│  └─────────┘ └─────────┘      └─────────┘      │
├────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────┐   │
│  │        Optional Bloom Filter            │   │
│  │        (Fast Negative Lookups)          │   │
│  └─────────────────────────────────────────┘   │
└────────────────────────────────────────────────┘

datasets:
  - from: s3://bucket/orders.parquet
    name: orders
    acceleration:
      engine: arrow
      primary_key: order_id

datasets:
  - from: s3://bucket/orders.parquet
    name: orders
    acceleration:
      engine: arrow
      primary_key: order_id

datasets:
  - from: s3://bucket/users.parquet
    name: users
    acceleration:
      engine: arrow
      primary_key: user_id
      indexes:
        email: unique
        status: enabled
        '(region, category)': unique

datasets:
  - from: s3://bucket/users.parquet
    name: users
    acceleration:
      engine: arrow
      primary_key: user_id
      indexes:
        email: unique
        status: enabled
        '(region, category)': unique

-- Primary key lookup (uses primary key index)
SELECT * FROM my_dataset WHERE id = 123;

-- Multiple key lookups (uses primary key index for each key)
SELECT * FROM my_dataset WHERE id IN (1, 2, 3);

-- Secondary index lookup (uses unique secondary index)
SELECT * FROM my_dataset WHERE email = 'user@example.com';

-- Primary key lookup with additional filter (index + post-filter)
SELECT * FROM my_dataset WHERE id = 123 AND status = 'active';

-- Primary key lookup (uses primary key index)
SELECT * FROM my_dataset WHERE id = 123;

-- Multiple key lookups (uses primary key index for each key)
SELECT * FROM my_dataset WHERE id IN (1, 2, 3);

-- Secondary index lookup (uses unique secondary index)
SELECT * FROM my_dataset WHERE email = 'user@example.com';

-- Primary key lookup with additional filter (index + post-filter)
SELECT * FROM my_dataset WHERE id = 123 AND status = 'active';

-- Range queries (full scan)
SELECT * FROM my_dataset WHERE id > 100 AND id < 200;

-- Pattern matching (full scan)
SELECT * FROM my_dataset WHERE id LIKE 'A%';

-- Composite primary keys (full scan, not yet supported)
SELECT * FROM my_dataset WHERE region = 'US' AND customer_id = 42;

-- Non-unique secondary index (full scan, not yet optimized)
SELECT * FROM my_dataset WHERE status = 'active';

-- Range queries (full scan)
SELECT * FROM my_dataset WHERE id > 100 AND id < 200;

-- Pattern matching (full scan)
SELECT * FROM my_dataset WHERE id LIKE 'A%';

-- Composite primary keys (full scan, not yet supported)
SELECT * FROM my_dataset WHERE region = 'US' AND customer_id = 42;

-- Non-unique secondary index (full scan, not yet optimized)
SELECT * FROM my_dataset WHERE status = 'active';

RowLocation {
    partition: u32,  // Partition index
    batch: u32,      // Batch index within partition
    row: u32,        // Row index within batch
}

RowLocation {
    partition: u32,  // Partition index
    batch: u32,      // Batch index within partition
    row: u32,        // Row index within batch
}

title: 'Hash Index for Arrow Acceleration' sidebar_label: 'Hash Index' sidebar_position: 3 description: 'Learn how to use hash indexes for O(1) point lookups on Arrow-accelerated datasets.'

Key Features

Configuration

Secondary Indexes

Configuration Options

Supported Data Types

Primitive Types

String Types

Binary Types

Query Optimization

Optimized Queries

Non-Optimized Queries

Index Threshold

Performance

Bloom Filter Performance

Memory Usage

Estimating Memory

Architecture

Sharded Hash Table

Row Location

Hash Function

Limitations

Troubleshooting

"No index available for point lookup"

Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"

Hash index not active despite primary_key being set

High Memory Usage

title: 'Hash Index for Arrow Acceleration' sidebar_label: 'Hash Index' sidebar_position: 3 description: 'Learn how to use hash indexes for O(1) point lookups on Arrow-accelerated datasets.'

Key Features

Configuration

Secondary Indexes

Configuration Options

Supported Data Types

Primitive Types

String Types

Binary Types

Query Optimization

Optimized Queries

Non-Optimized Queries

Index Threshold

Performance

Bloom Filter Performance

Memory Usage

Estimating Memory

Architecture

Sharded Hash Table

Row Location

Hash Function

Limitations

Troubleshooting

"No index available for point lookup"

Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"

Hash index not active despite primary_key being set

High Memory Usage

Hash index not active despite `primary_key` being set

Hash index not active despite `primary_key` being set