Apache DataFusion vs DuckDB

Apache DataFusion and DuckDB are both fast, in-process analytical query engines. DataFusion is an embeddable Rust library designed to be extended. DuckDB is a self-contained database designed to be used as-is. Choosing between them comes down to whether you are building a data system or querying data.

In-process analytical databases have changed how teams think about query performance. Instead of sending queries over the network to a remote warehouse, teams can embed a fast columnar engine directly in their application or data pipeline and query data at memory bandwidth speeds. Apache DataFusion and DuckDB are the two most prominent options in this space, and they are frequently compared by engineers making real architectural decisions.

They are not the same tool. DataFusion is a query engine framework written in Rust that teams embed into larger systems and extend with custom logic. DuckDB is a complete analytical database system with its own storage engine and transaction manager. Both are capable of fast analytical queries, but they answer different questions.

What Apache DataFusion Is

Apache DataFusion is an open-source SQL query engine framework within the Apache Arrow ecosystem. It provides parsing, logical planning, optimization, and vectorized execution as a Rust library. DataFusion does not include a storage engine -- it relies on external table providers -- and it does not manage transactions or data persistence independently.

The distinguishing characteristic of DataFusion is extensibility. Every major component is designed to be replaced or augmented: table providers can connect to any data source, optimizer rules can be added without modifying DataFusion's source code, user-defined functions (UDFs) can extend SQL with custom operations, and custom physical plan nodes can implement new execution strategies. Systems built on DataFusion include Spice, InfluxDB 3.0, Apache Ballista, and Delta-rs.

DataFusion produces results as Apache Arrow record batches throughout its pipeline. There is zero serialization overhead between operators, and results integrate directly with the broader Arrow ecosystem (PyArrow, Arrow Flight, Parquet readers, etc.).

What DuckDB Is

DuckDB is an open-source, embedded analytical database management system written in C++. It includes a columnar storage engine, a vectorized query executor, full ACID transaction support, and a PostgreSQL-compatible SQL dialect. DuckDB is designed to be used directly, not extended into a platform.

The distinguishing characteristic of DuckDB is completeness. It is a full database that works out of the box. A developer installs it, opens a connection, and starts querying -- no custom code required. DuckDB handles data storage, schema management, transactions, and compression automatically.

DuckDB can query Parquet, CSV, and JSON files directly without loading them into a database. It runs in-process with no external dependencies, with bindings available for Python, R, Go, Rust, Java, Node.js, and others.

Architecture Comparison

The core architectural difference is that DataFusion is a query engine without storage, while DuckDB is a complete database that includes storage.

Apache DataFusion DuckDB SQL Query Logical Plan Optimizer Vectorized Execution Arrow Record Batches Custom Table Providers(any data source) SQL Query Parse & Plan Optimizer Vectorized Execution Result Set Built-in Columnar Storage(DuckDB files, Parquet, CSV)

Storage model

DataFusion has no built-in storage. It reads data through TableProvider implementations, which can point to anything: local Parquet files, remote databases, in-memory Arrow buffers, or custom storage formats. Building a DataFusion-based system requires implementing or choosing table providers.

DuckDB has a full native storage engine. Data is persisted in DuckDB's columnar format on disk, and the storage layer handles compression, indexing, and crash recovery automatically. DuckDB also reads Parquet, CSV, and JSON files directly, without loading them into its native format.

Extensibility model

DataFusion is designed for extension first. The extension surface covers custom table providers, logical optimizer rules, physical execution plans, scalar and aggregate UDFs, and custom analyzers. These extension points are typed Rust traits -- implementing them is straightforward and does not require forking DataFusion.

DuckDB supports extension through a loadable extension API (for adding file format readers, custom functions, and data types) but is fundamentally a closed system. You use DuckDB's capabilities; you do not rebuild DuckDB's internals.

Language and integration

DataFusion is a pure Rust library. Rust crates directly depend on it via Cargo. Python, Java, and other language bindings exist but are thinner wrappers around the Rust core. Systems built on DataFusion are typically written primarily in Rust.

DuckDB is written in C++ with first-class bindings across Python, R, Go, Rust, Java, Node.js, and others. The Python API in particular is mature and widely used for interactive analysis and data pipelines.

Feature Comparison

FeatureApache DataFusionDuckDB
Execution modelVectorized (Arrow-native)Vectorized (columnar)
StorageNone (external table providers)Full native columnar storage
PersistenceVia table providerFull (WAL, crash recovery)
TransactionsNone (stateless query engine)Full ACID
Full SQL supportComprehensive (extensible)Comprehensive (PostgreSQL-compatible)
File format supportVia providers: Parquet, CSV, JSON, ArrowNative: Parquet, CSV, JSON; extensible
ParallelismMulti-threaded, partition-awareAutomatic multi-core
Primary languageRustC++ (bindings for many languages)
ExtensibilityDeep (table providers, optimizer rules, UDFs, custom operators)Limited (extension API for discrete additions)
Startup overheadMilliseconds (library init)Milliseconds (in-process)
EcosystemApache Arrow ecosystemStandalone; integrates with Parquet, Arrow, Python
Primary use caseBuilding data systemsAnalyzing data

Performance

Both DataFusion and DuckDB deliver excellent analytical query performance relative to row-oriented databases and remote query engines. On standard benchmarks like TPC-H, they perform within a similar range, though results vary by query type and hardware.

The practical performance difference comes from the workload pattern:

DataFusion excels when queries are distributed across custom sources or when the execution pipeline is extended with domain-specific operators. Because DataFusion operates natively on Arrow throughout, there is zero serialization cost when data is already in Arrow format (from Arrow Flight, from in-memory caches, or from a connected streaming system).

DuckDB excels at single-node analytical queries over files and when the full Parquet reader with zone maps, dictionary pushdown, and late materialization is needed. DuckDB's C++ implementation and extensive query optimizer tuning give it an edge on pure file-scanning workloads.

For Spice's data acceleration use case, DuckDB is one of several available accelerator engines. The recommended option for production workloads is Spice Cayenne, which uses the Vortex columnar format and outperforms DuckDB on TPC-H benchmarks for accelerated datasets.

When to Choose DataFusion

Choose Apache DataFusion when:

  • You are building a data system, not just querying data. If you need to connect to 10+ data sources, add custom SQL functions, or implement a proprietary execution strategy, DataFusion's extension model is the right foundation.
  • Your application is written in Rust and you need deep embedding with zero cross-language overhead.
  • You need Federation. DataFusion's custom table provider API makes it straightforward to add connectors to remote databases, object stores, and streaming systems -- the approach used by SQL federation engines like Spice.
  • You are building on the Arrow ecosystem. DataFusion's native Arrow output integrates directly with Arrow Flight, Parquet writers, and notebook environments without conversion.
  • Long-term extensibility matters. DataFusion's architecture is designed to evolve with domain requirements; DuckDB's extension API covers common additions but not deep architectural changes.

When to Choose DuckDB

Choose DuckDB when:

  • You want a database, not a framework. DuckDB works out of the box for analytical queries without writing any Rust code or configuring table providers.
  • You need persistence and transactions. If you need to write data back to a durable store and query it later, DuckDB's native storage engine handles this. DataFusion has no built-in storage.
  • Your team works in Python, R, or another non-Rust language. DuckDB's Python API is mature, widely adopted, and well-documented. It integrates naturally with pandas, PyArrow, and dbt.
  • Interactive and exploratory analysis. DuckDB in a Jupyter notebook or a DuckDB CLI is an excellent tool for ad hoc exploration of Parquet files and structured data.
  • You need SQL-level compatibility with PostgreSQL. DuckDB's SQL dialect is closely aligned with PostgreSQL, which simplifies porting queries.

Advanced Topics

DataFusion's Physical Planning and Extensibility Depth

DataFusion separates logical planning (what to compute) from physical planning (how to compute it). This separation allows developers to inject custom physical operators -- for example, a custom join that routes one side of a join to a remote database and the other to a local buffer, merging results in the DataFusion execution thread.

This is not possible with DuckDB. DuckDB's execution engine is a closed system. You can add custom scalar functions and file format readers, but you cannot replace or inject into its execution operators.

For SQL federation use cases -- where different tables come from different sources and the query planner must make pushdown decisions for each source type -- DataFusion's extensibility is essential.

DuckDB's Parquet Zone Maps and Late Materialization

DuckDB's Parquet reader is one of the most optimized in the industry. It uses zone maps (min/max statistics stored in Parquet row group metadata) to skip row groups that cannot contain matching rows before reading any data. It also uses late materialization: columns not needed by a filter are not decoded until after the filter has been applied, further reducing I/O.

DataFusion also implements these optimizations, but DuckDB's C++ implementation and years of tuning give it consistent performance on raw Parquet scan workloads.

Memory Management

DataFusion uses a MemoryPool abstraction that tracks and limits memory usage during query execution. Operators that accumulate state (hash joins, sorts, hash aggregations) register reservations and can spill to disk when the pool budget is exceeded.

DuckDB uses a similar buffer pool model with automatic spilling. Both handle out-of-core execution, but the behavior under memory pressure differs. DataFusion's memory pool is configurable and replaceable -- a system builder can implement custom memory management strategies. DuckDB's memory management is internal and not externally extensible.

DataFusion and DuckDB in the Spice Ecosystem

Spice uses both engines:

  • Apache DataFusion is Spice's core query engine. All federated queries across 30+ connected data sources are planned and executed through DataFusion. Spice registers custom table providers for each connector, adds optimizer rules for pushdown, and extends DataFusion with UDFs for hybrid search and LLM inference.
  • DuckDB is available as a data acceleration engine. When datasets are accelerated locally in Spice, users can choose DuckDB as the backing store for the local cache. DuckDB's columnar storage and analytical performance make it well-suited for scan-heavy accelerated queries.

For production acceleration workloads, the recommended option is Spice Cayenne, which uses the Vortex columnar format and delivers faster queries at lower memory usage than DuckDB for large accelerated datasets.

Apache DataFusion vs DuckDB FAQ

What is the main difference between Apache DataFusion and DuckDB?

DataFusion is a query engine framework -- a Rust library that provides SQL parsing, planning, optimization, and execution that developers embed into larger systems. DuckDB is a complete embedded analytical database with its own storage engine, transaction support, and persistence. DataFusion is for building data systems. DuckDB is for querying data.

Which is faster: DataFusion or DuckDB?

Performance depends on the workload. Both deliver fast analytical query execution through vectorized, columnar processing. DuckDB has an edge on raw Parquet file scanning due to its highly optimized C++ implementation. DataFusion has an advantage when queries are distributed across custom data sources (via table providers) or when data is already in Arrow format and zero-copy integration matters. For most analytical workloads, the difference is within a small factor.

Can I use both DataFusion and DuckDB in the same system?

Yes. Spice does exactly this: Apache DataFusion is the federation and query planning layer, while DuckDB is available as a local acceleration engine for cached datasets. Queries are planned through DataFusion and can be routed to the local DuckDB-backed cache for accelerated datasets, or pushed to remote sources for federated datasets.

Is Apache DataFusion production ready?

Yes. Apache DataFusion is used in production by multiple organizations, including Spice AI, InfluxDB 3.0, and Apache Comet (Spark accelerator). It is an Apache Software Foundation project with active development, regular releases, and comprehensive test coverage. Its Rust implementation provides memory safety and predictable performance for production workloads.

Does DuckDB support SQL federation across multiple databases?

DuckDB has limited federation capabilities through its extension system (e.g., the postgres_scanner extension can query PostgreSQL). However, DuckDB is not designed as a multi-source federation engine -- it lacks the custom optimizer rules and connector architecture needed for production multi-source federation with predicate pushdown across heterogeneous systems. Apache DataFusion with custom table providers is the appropriate foundation for production SQL federation.

See Spice in action

Get a guided walkthrough of how development teams use Spice to query, accelerate, and integrate AI for mission-critical workloads.

Get a demo