Partitioning splits an accelerated dataset into multiple physical units (files, tables, or in-memory tables) keyed by an expression evaluated per row. Queries that filter on the partitioning expression — or on a column it references — only read the partitions that can match, dramatically reducing the data scanned.
This config writes 50 separate DuckDB files, each containing the rows whose PULocationID hashes to the same bucket. A query like WHERE PULocationID = 132 only opens the single bucket file that could contain 132.
partition_by expression for every row and routes the row to a partition keyed by the expression's value.year then month — to combine pruning across dimensions.Partitioning is most useful when:
region = 'EU', tenant_id IN (...), created_at >= today() - INTERVAL '7 days').bucket(N, col) collapses to N partitions.partition_byLives directly under acceleration: (not under acceleration.params:). It's a list of expressions; each entry is either a plain string or a single-entry { name: expression } mapping:
Multi-entry mappings (- year: "…", month: "…" on one list item) are rejected at load time.
partition_mode (DuckDB only)Under acceleration.params.partition_mode. Selects how DuckDB physically lays out partitions:
| Value | Layout | Default |
|---|---|---|
files | Each partition is its own DuckDB file in a Hive-style directory (column=value/…). | ✓ |
tables | Single DuckDB file with one table per partition (discovered via information_schema.tables). |
| Engine | Required mode: | Multi-expression | Layout |
|---|---|---|---|
arrow | (memory; default) | Yes | One Arrow MemTable per partition value. |
duckdb | file | No (single only) | Hive-style files (partition_mode: files) or per-partition tables (partition_mode: tables). |
cayenne | file | Yes | One Vortex table per partition; catalog in a SQLite metadata file. |
sqlite, postgres, and turso accelerators do not support partition_by.
partition_by accepts any DataFusion-compatible scalar SQL expression that returns a String, integer, Boolean, or Timestamp and references exactly one column from the dataset. The most common partition transforms are:
bucket(num_buckets, column)Hashes column into num_buckets deterministic buckets.
num_buckets must be a positive integer literal ≤ 1,000,000.num_buckets — bucket(50, …) (Int64 literal) returns Int64; bucket(50::int32, …) returns Int32.num_buckets (uses ahash with a fixed seed).user_id, account_id, device_id).See the bucket reference for the full SQL signature.
truncate(width, value)Truncates to the next-lower multiple of width. Iceberg's truncate transform.
width must be a positive Int64 literal.value may be any signed/unsigned integer, Decimal128/Decimal256, Utf8 (string), or Binary. For strings/binary, returns the first width units.value.truncate(1000, amount)), or grouping strings by prefix (truncate(2, country_code)).See the truncate reference for examples.
date_part(unit, column) and date_trunc(unit, column)Built-in DataFusion datetime functions. Useful for time-based partitioning at year, month, day, or hour granularity.
date_part('year', col) returns the integer year (e.g. 2026).date_trunc('day', col) returns the timestamp truncated to the start of the day.YEAR(col), MONTH(col), DAY(col), etc. are aliases of date_part(...) and work identically.
:::note Filter pruning for date_part() is not yet implemented
A date_part('year', l_shipdate) partition still produces correctly-distributed partitions, but a filter like WHERE l_shipdate >= '2026-01-01' does not currently translate to a partition pruning. Queries return correct results — they just scan more partitions than necessary. If your filter is on the bare partition expression (WHERE date_part('year', l_shipdate) = 2026), the equality form does prune.
:::
column % N)A plain modulo expression also produces stable, partition-prunable buckets:
Range filters on the base column (id BETWEEN 0 AND 1000) are pruned for % partitions.
A bare column name partitions one partition per distinct value — useful for low-cardinality columns:
Pruning supports equality, IN, NOT IN, and range filters on the column.
Arrow and Cayenne accelerations accept multiple partition_by expressions. Spice partitions hierarchically — first by the leftmost expression, then by the next, and so on.
A query with WHERE region = 'EU' AND date_part('year', created_at) = 2026 prunes on both axes.
DuckDB partitioned acceleration rejects multiple expressions with PartitionByRequired (single-expression only).
Spice attempts to translate query-time filters into a partition selection. The matrix below summarizes which filter shapes prune which partition transforms:
| Filter shape | Plain column / truncate / date_trunc / % N | bucket(N, col) |
|---|---|---|
col = X | ✓ | ✓ (filter substituted into bucket expression) |
col != X | ✓ | ✗ (no pruning) |
col IN (a, b, …) | ✓ | ✓ |
col NOT IN (a, b, …) | ✓ | ✗ (no pruning) |
col < X / <= X / > X / >= X | ✓ | ✓ only for bounded ranges (both lower and upper) |
col BETWEEN a AND b | ✓ | ✓ (when range expands to ≤ a few thousand Int32 values) |
col = a OR col = b OR … | ✓ | ✓ |
partition_expr = X (e.g. ) |
Pruning notes:
WHERE user_id = 42 against a bucket(100, user_id) partition evaluates bucket(100, 42) once and reads only that partition.Int32 range (capped at MAX_BUCKET_ENUMERATION_I32 candidate values). Open-ended ranges (col < X with no lower bound) and non-Int32 types fall back to no pruning.date_part() filter pruning is not yet implemented for time-range filters. Equality on the partition expression still prunes; range filters on the base column do not.engine: arrow)In-memory MemTable per partition value. Partitions are rebuilt on every refresh. The simplest engine to start with for moderate datasets that fit in RAM. hash_index and sort_columns are propagated per partition.
engine: duckdb, partition_mode: files)Each partition value becomes its own DuckDB file under a Hive-style directory layout (e.g. accelerator_dir/dataset/year=2026/data.duckdb). Useful when you want OS-level visibility of partitions and per-partition I/O.
Requires mode: file. Single-expression only.
engine: duckdb, partition_mode: tables)A single DuckDB file with one table per partition value. Cheaper to manage operationally (one file to back up, one connection pool to size), but loses per-partition file-level isolation.
Requires mode: file. Single-expression only.
engine: cayenne)Each partition is a separate Cayenne (Vortex) table; the partition catalog is tracked in a SQLite metadata file. Cayenne supports composite partitioning natively and is the right pick for very large datasets where Arrow would not fit in memory.
Requires mode: file.
partition_by after refreshOnce partitions exist on disk, changing partition_by is rejected:
The
partition_byexpressions are different from the expressions used to create the existing partition files. Revert thepartition_byexpressions, delete the partition files, or change the location the partition files are stored to create new partitions.
To re-partition:
accelerator_dir at a fresh location.partition_by.There is no automatic in-place re-partitioning.
A partition_by expression is rejected at startup if any of the following hold:
String, an integer (signed or unsigned), Boolean, or Timestamp.OUTER REFERENCE, UNNEST, window function, aggregate function, EXISTS, GROUPING SET, or PLACEHOLDER.col AS partition_key).For DuckDB engines, additionally:
mode: must be file.partition_by expression is allowed.Partition events by user, collapsing millions of users into 100 stable buckets:
bucket SQL referencetruncate SQL referencedate_part SQL referencedate_trunc SQL referencetime_partition_column for time-pruned refreshesbucket(50, c) = 7| ✓ |
| ✓ |
expr1_filter AND expr2_filter (composite partitions) | ✓ | ✓ |