spiceai/docs

spiceai/

docs

Help Login

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.9.x/components/data-accelerators/duckdb.md

spiceai/docs | Spice Cloud Platform

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.9.x/components/data-accelerators/duckdb.md

spiceai/docs/README.md

title: 'DuckDB Data Accelerator' sidebar_label: 'DuckDB Data Accelerator' description: 'DuckDB Data Accelerator Documentation' sidebar_position: 3

The DuckDB Data Accelerator helps improve query performance by using DuckDB, an embedded analytical database engine optimized for efficient data processing.

It supports in-memory and file-based operation modes, enabling workloads that exceed available memory and optionally providing persistent storage for datasets.

To enable DuckDB acceleration, set the dataset's acceleration.engine to duckdb:

Modes

Memory Mode

By default, DuckDB acceleration uses mode: memory, loading datasets into memory.

File Mode

When using mode: file, datasets are stored by default in a DuckDB file on disk in the .spice/data directory relative to the spicepod.yaml. Specify the duckdb_file parameter to store the DuckDB file in a different location. For datasets intended to be joined, set the same duckdb_file path for all related datasets.

Configuration Parameters

DuckDB acceleration supports the following optional parameters under acceleration.params:

duckdb_file (string, default:.spice/data/accelerated_duckdb.db): Path to the DuckDB database file. Applies if mode is set to file. If the file does not exist, Spice creates it automatically.
duckdb_data_dir (string, default:.spice/data/): Path to the directory the DuckDB database file(s) will be placed in. This is useful when using the partition_by acceleration parameter. If both duckdb_data_dir and duckdb_file are specified, duckdb_file will be used and duckdb_data_dir will be ignored.
duckdb_memory_limit (string, default: none): Limits DuckDB's memory usage for instance. Acceptable units are KB, MB, GB, TB (decimal: 1000^i) or KiB, MiB, GiB, TiB (binary: 1024^i). See DuckDB memory limit documentation.
duckdb_preserve_insertion_order (boolean, default: true): Controls whether DuckDB preserves the insertion order of rows in tables. When set to true, rows are returned in the order they were inserted. See and .

Refer to the datasets configuration reference for additional supported fields.

Example Configuration

Limitations

Consider the following limitations when using DuckDB acceleration:

DuckDB does not support enum and dictionary field types.
DuckDB's maximum decimal precision is 38 digits. Decimal256 (76 digits) is unsupported.
Queries using on_zero_results: use_source cannot filter binary columns directly (e.g., WHERE col_blob <> ''). Instead, cast binary columns to another type (e.g., WHERE CAST(col_blob AS TEXT) <> '').
DuckDB indexes currently do not support spilling to disk.
Hot-reloading dataset configurations while the Spice Runtime is active disables DuckDB query federation until the runtime restarts.

Resource Considerations

Resource requirements depend on workload, dataset size, query complexity, and refresh modes.

Memory

DuckDB manages memory through streaming execution, intermediate spilling, and buffer management. By default, each DuckDB instance (one per DuckDB file) uses up to 80% of available system memory. To control memory usage, set the duckdb_memory_limit parameter:

Note that duckdb_memory_limit only limits the DuckDB instance it is set on, not the entire runtime process. Additionally, it does not cover all DuckDB operations, such as some insert operations. Index creation and scans are limited by the duck_memory_limit so ensure adequate memory is provisioned.

Allocate at least 30% more container/machine memory for the runtime process.

Indexes and Memory

DuckDB indexes currently do not support spilling to disk. While index memory usage is registered through the buffer manager, index buffers are not managed by the buffer eviction mechanism. As a result, indexes may consume significant memory, impacting memory-intensive query performance.

Indexes are serialized to disk and loaded lazily upon database reopening, ensuring they do not affect database opening performance. Also consider index serialization when allocating disk storage.

For more details, see DuckDB's Indexes and Memory documentation.

CPU

Query performance, data load, and refresh operations scale with available CPU resources. Allocate sufficient CPU cores based on query complexity and concurrency.

Storage

Ensure adequate disk space for temporary files, swap files, WAL files, and intermediate spilling. Monitor disk usage regularly and adjust storage capacity based on dataset growth and query patterns.

Temporary Directory

The Spice runtime supports configuring a temporary directory for query and acceleration operations that spill to disk. By default, this is the directory of the duckdb_file.

Set the runtime.query.temp_directory parameter to specify a custom temporary directory. This can help distribute I/O operations across multiple volumes for improved throughput. For example, setting runtime.temp_directory to a high-IOPS volume separate from the DuckDB data file can improve performance for workloads exceeding available memory.

Example configuration:

Use this parameter when:

Handling workloads that frequently spill to disk.
Distributing swap and data I/O operations across multiple storage volumes.

For more details, refer to the runtime parameters documentation.

For detailed DuckDB limits, see the DuckDB Memory Management Guide.

Cookbook

For practical examples, see the DuckDB Data Accelerator Cookbook Recipe.