What is Apache DataFusion? Extensible SQL Query Engine

What is Apache DataFusion?

Apache DataFusion is an open-source, extensible SQL query engine written in Rust. Part of the Apache Arrow ecosystem, DataFusion is designed to be embedded as a library rather than deployed as a standalone server -- making it the foundation for a growing number of data systems.

See Spice SQL federation

Read the docs

Building a SQL query engine from scratch is a multi-year effort. Parsing SQL, generating logical plans, optimizing query execution, managing memory, and parallelizing work across cores are all hard problems individually. Combining them into a reliable, performant system is harder still.

Apache DataFusion provides a production-quality implementation of all of these components as an embeddable Rust library. Instead of building a query engine from zero, developers embed DataFusion and extend it with custom table providers, user-defined functions, and optimizer rules specific to their use case. The result is a fully featured SQL engine tailored to a particular domain, built in weeks instead of years.

Core Architecture

DataFusion processes a SQL query through a well-defined pipeline: parsing, planning, optimization, and execution. Each stage is modular and extensible.

SQL Parsing and Logical Planning

When a SQL query arrives, DataFusion parses it into an abstract syntax tree (AST) and then converts the AST into a logical plan. The logical plan is a tree of relational algebra operations -- scans, filters, projections, joins, aggregations, sorts -- that describe what the query computes without specifying how to compute it.

For example, the query:

SELECT customer_name, SUM(amount)
FROM orders
WHERE created_at > '2026-01-01'
GROUP BY customer_name
ORDER BY SUM(amount) DESC
LIMIT 10

Produces a logical plan roughly equivalent to:

Limit (10)
  Sort (SUM(amount) DESC)
    Aggregate (GROUP BY customer_name, SUM(amount))
      Filter (created_at > '2026-01-01')
        Scan (orders)

Query Optimization

DataFusion applies a series of optimization passes to the logical plan. These include:

Predicate pushdown: Moving filter expressions closer to the data source so less data is read
Projection pushdown: Eliminating columns that are not needed by downstream operators
Constant folding: Evaluating constant expressions at planning time rather than execution time
Join reordering: Selecting the most efficient join order based on available statistics
Common subexpression elimination: Computing repeated expressions once and reusing the result

The optimizer is rule-based and extensible. Developers can register custom optimization rules that apply domain-specific transformations. For example, a federation engine can add rules that push certain operations down to remote data sources.

Physical Planning and Execution

After optimization, the logical plan is converted to a physical plan that specifies the actual execution strategy: which join algorithm to use (hash join, sort-merge join, nested loop), how to partition work across threads, and how to manage memory.

Execution produces a stream of Apache Arrow record batches. Arrow is a columnar in-memory format that enables zero-copy data exchange between operators and between systems. Because DataFusion is built natively on Arrow, there is no serialization or deserialization overhead between planning and execution -- the data stays in Arrow format throughout.

Key Features

Full SQL Support

DataFusion supports a comprehensive subset of SQL, including:

Standard SELECT, INSERT, UPDATE, DELETE statements
JOIN (inner, left, right, full outer, cross, semi, anti)
Window functions (ROW_NUMBER, RANK, LAG, LEAD, etc.)
Common table expressions (CTEs) with WITH clauses
Subqueries and correlated subqueries
UNION, INTERSECT, EXCEPT set operations
GROUP BY, HAVING, ORDER BY, LIMIT, OFFSET

Extensibility

DataFusion's primary design goal is extensibility. The key extension points are:

Custom table providers allow DataFusion to query any data source. A table provider implements the TableProvider trait, telling DataFusion how to scan data from a specific source. Out of the box, DataFusion includes providers for Parquet, CSV, JSON, and Arrow IPC files. Custom providers can connect to databases, APIs, object stores, or any other data source.

User-defined functions (UDFs) extend DataFusion's expression language. Scalar UDFs operate on individual rows, aggregate UDFs operate on groups of rows, and window UDFs operate over window frames. UDFs are registered with the session context and can be used in SQL queries like built-in functions.

Custom optimizer rules allow developers to add domain-specific optimizations. For example, a federated query engine can add rules that detect when a filter or aggregation can be pushed down to a remote source and rewrite the plan accordingly.

Custom physical plan operators allow developers to implement new execution strategies. For example, a distributed query engine can replace DataFusion's local join operator with a distributed shuffle-join that partitions data across multiple nodes.

Native Arrow Integration

DataFusion operates on Apache Arrow arrays throughout the entire pipeline. This means:

No serialization overhead between operators
Zero-copy data exchange with other Arrow-based systems
Compatibility with the broader Arrow ecosystem (PyArrow, Arrow Flight, etc.)
Efficient SIMD operations on columnar data

DataFusion vs. Other Query Engines

DataFusion vs. DuckDB

DuckDB is an embedded analytical database -- a "SQLite for analytics." Like DataFusion, it is designed for in-process analytical queries. The key difference is in extensibility and architecture.

DuckDB is a complete, self-contained database with its own storage engine, transaction manager, and query executor. It is written in C++ and provides a SQL interface with minimal configuration.

DataFusion is a query engine library, not a database. It does not include a storage engine or transaction manager. Instead, it provides the query planning and execution components that developers embed into their own systems. DataFusion is written in Rust and designed to be extended with custom table providers, UDFs, and optimizer rules.

Choose DuckDB when you need a self-contained analytical database. Choose DataFusion when you are building a custom data system and need an embeddable, extensible query engine as a foundation. For a detailed side-by-side comparison, see Apache DataFusion vs. DuckDB.

DataFusion vs. Trino (Presto)

Trino (formerly Presto) is a distributed SQL query engine designed for federated queries across data warehouses and data lakes. It is deployed as a standalone cluster of coordinator and worker nodes.

DataFusion is a single-node, embeddable library. It does not include built-in distributed execution (though Apache Ballista adds distributed capabilities on top of DataFusion). Trino is a production-ready distributed system with its own cluster management, fault tolerance, and resource scheduling.

Choose Trino when you need a standalone, distributed federated query engine with its own cluster infrastructure. Choose DataFusion when you need an embeddable query engine that you control and extend within your own application.

DataFusion vs. Apache Spark SQL

Spark SQL is the SQL interface to Apache Spark, a distributed data processing framework. Spark is designed for large-scale batch processing and runs on JVM-based cluster infrastructure (YARN, Mesos, Kubernetes).

DataFusion is a lightweight, Rust-native library with no JVM dependency. It is designed for low-latency, in-process query execution rather than large-scale distributed batch processing. DataFusion's startup time is measured in milliseconds; Spark's is measured in seconds to minutes.

Choose Spark when you need large-scale distributed batch processing with a mature ecosystem. Choose DataFusion when you need a lightweight, low-latency query engine embedded in a Rust or Python application.

How Spice Extends DataFusion

Spice uses DataFusion as its core query engine and extends it with several capabilities:

Custom Table Providers for Federated Data

Spice registers custom DataFusion table providers for each connected data source -- PostgreSQL, MySQL, Databricks, Amazon S3, Snowflake, and 30+ others. When a query references a table backed by a remote source, the corresponding table provider handles connectivity, dialect translation, and data retrieval.

Custom Optimizer Rules for Pushdown

Spice adds optimizer rules that analyze the query plan and determine which operations can be pushed down to each source. For example, a filter on a PostgreSQL-backed table is rewritten into a WHERE clause in the generated PostgreSQL query, so only matching rows are transferred. This minimizes data movement and maximizes source-side performance.

UDFs for Search and AI Inference

Spice extends DataFusion's function library with UDFs that enable hybrid search (full-text and vector search within SQL), AI model inference, and embedding generation. These functions are available in standard SQL queries:

SELECT id, content, search_score
FROM documents
WHERE search(content, 'deployment strategies', 'hybrid')
ORDER BY search_score DESC
LIMIT 10

Integration with the Acceleration Layer

When query acceleration is enabled, Spice stores cached data in Vortex format and exposes it through custom DataFusion table providers. The query optimizer can push filters and projections directly into the Vortex storage layer, enabling sub-second query performance over locally cached data.

The DataFusion Ecosystem

DataFusion's embeddable design has led to a growing ecosystem of projects built on top of it:

Spice: SQL federation, acceleration, and AI inference engine
Apache Ballista: Distributed query execution layer for DataFusion
InfluxDB 3.0: Time-series database rebuilt on DataFusion and Arrow
Apache Comet: Spark-compatible query accelerator using DataFusion
Delta-rs: Delta Lake implementation in Rust with DataFusion integration
GlareDB: SQL interface for querying across databases and data lakes

This ecosystem demonstrates DataFusion's value proposition: rather than each project building its own SQL parser, optimizer, and execution engine, they share a common, well-tested foundation and focus on their differentiating features.

Advanced Topics

The Query Pipeline in Detail

A SQL query passes through a series of well-defined stages before producing results. Understanding this pipeline is essential for developers who need to extend or debug DataFusion behavior.

The parser converts a SQL string into a logical plan tree. The optimizer applies a sequence of rule-based passes -- predicate pushdown, projection pruning, join reordering, and others -- to produce an optimized logical plan. The physical planner then selects concrete execution strategies (e.g., hash join vs. sort-merge join) and generates a physical plan. Finally, the execution engine evaluates the physical plan and streams results as Apache Arrow record batches.

Each stage is independently extensible. Developers can register custom analyzer rules (which run before optimization), custom optimizer rules (which transform the logical plan), and custom physical plan nodes (which implement new execution strategies).

The Catalog System

DataFusion organizes data through a three-level naming hierarchy: catalog, schema, and table. A SessionContext holds a default catalog that contains one or more schemas, each of which contains tables. When a query references orders, DataFusion resolves it through this hierarchy -- by default, datafusion.public.orders.

The catalog system is trait-based and fully replaceable. Developers implement the CatalogProvider, SchemaProvider, and TableProvider traits to integrate their own metadata stores. For example, a SQL federation engine can register a catalog provider that discovers schemas and tables dynamically from a remote database's information schema, making remote tables queryable as if they were local.

The TableProvider trait is the most commonly extended interface. It defines how DataFusion scans data from a source, what statistics are available for the optimizer, what filters can be pushed down to the source, and what the schema of the data is.

Custom Execution Plans

When the built-in physical plan operators are not sufficient, developers create custom ExecutionPlan implementations. A custom execution plan node participates in the standard pipeline: it receives input partitions, applies its logic, and produces output partitions as Arrow record batch streams.

Common use cases for custom execution plans include remote execution (sending part of a query to an external system and reading results back as Arrow), custom caching (materializing intermediate results for reuse across queries), and specialized operators (e.g., a graph traversal operator or a time-series interpolation operator that has no SQL equivalent).

Custom execution plans integrate with DataFusion's partition-aware execution model. They declare how many output partitions they produce, whether they require a specific input partitioning, and how they distribute work across threads.

Memory Management and Spilling

DataFusion uses a MemoryPool abstraction to track and limit memory consumption during query execution. Operators that accumulate state -- hash joins, hash aggregations, sorts -- register their memory usage with the pool. When an operator's allocation request would exceed the configured memory limit, DataFusion triggers spilling: the operator writes its in-memory state to temporary files on disk and continues processing with reduced memory.

The spilling mechanism is critical for handling queries that process more data than available memory. Hash joins spill their build-side partitions, sorts spill sorted runs, and aggregations spill partial aggregate state. When the spilled data is needed, it is read back from disk and merged. This enables DataFusion to process arbitrarily large datasets within a fixed memory budget, albeit with the performance trade-off of disk I/O.

Developers can implement custom MemoryPool strategies to match their deployment constraints -- for example, a pool that reserves memory for concurrent queries or one that integrates with an external resource manager.

Apache DataFusion FAQ

What is the difference between DataFusion and DuckDB?

DuckDB is a self-contained embedded analytical database with its own storage engine and transaction manager. DataFusion is a query engine library without built-in storage -- it provides SQL parsing, planning, optimization, and execution that developers embed into their own systems. DuckDB is designed for end users who want a ready-to-use analytical database. DataFusion is designed for developers building custom data systems who need an extensible query engine as a foundation.

Is Apache DataFusion production ready?

Yes. DataFusion is used in production by multiple companies and projects, including Spice, InfluxDB 3.0, and Apache Comet. It is an Apache Software Foundation project with an active contributor community, regular releases, and comprehensive test coverage. Its Rust implementation provides memory safety and predictable performance characteristics suited for production workloads.

How does DataFusion compare to Trino?

Trino is a standalone, distributed SQL query engine deployed as a cluster of coordinator and worker nodes. DataFusion is a single-node, embeddable library. Trino includes built-in cluster management, fault tolerance, and resource scheduling. DataFusion provides query planning and execution components that developers embed and extend within their own applications. Trino is designed for standalone deployment; DataFusion is designed for embedding.

How do you extend DataFusion with custom functionality?

DataFusion provides several extension points: custom table providers (to query any data source), user-defined functions (scalar, aggregate, and window), custom optimizer rules (to add domain-specific plan transformations), and custom physical plan operators (to implement new execution strategies). Extensions are implemented as Rust traits and registered with the DataFusion session context.

What is the relationship between DataFusion and Apache Arrow?

DataFusion is part of the Apache Arrow ecosystem and operates on Arrow arrays throughout its entire query pipeline. Arrow provides the columnar in-memory data format; DataFusion provides the SQL query engine that operates on that format. This native integration means zero serialization overhead between operators and zero-copy data exchange with other Arrow-based systems like PyArrow, Arrow Flight, and Ballista.

Learn more about DataFusion and Spice

Technical deep dives on how Spice extends Apache DataFusion for federated SQL, search, and AI inference.

Docs

Spice.ai OSS Documentation

Learn how Spice uses Apache DataFusion as its core query engine with custom table providers, optimizer rules, and UDFs.

Blog

How we use Apache DataFusion at Spice AI

A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs.

Blog

Contribution of TableProviders to DataFusion

How Spice contributes table providers to Apache DataFusion.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer