Managed Apache DataFusion: Federated SQL at Scale

Managed Apache DataFusion is an operational model where teams use DataFusion as the query core while a platform handles source connectivity, federation planning, acceleration policies, and runtime reliability.

See SQL federation and acceleration

Read the docs

Teams rarely struggle with writing SQL itself. The harder problem is operating SQL across many systems with different dialects, latency profiles, and reliability characteristics. Managed Apache DataFusion addresses that operational gap by combining DataFusion's embeddable query engine with platform-level controls for connectors, pushdown, cache freshness, fallback behavior, and multi-tenant execution.

At a high level, this model keeps the parts DataFusion does best: parsing, logical planning, optimization, and execution in Rust, while shifting operational complexity into a managed control plane. This matters when a single workload must query PostgreSQL, Snowflake, S3, and streams through one query layer, while still meeting isolation, reliability, and cost constraints.

For a practical implementation walkthrough, see How we use Apache DataFusion at Spice AI. For the product-level architecture and deployment model, see SQL federation and acceleration.

Why Teams Choose Managed DataFusion

Most teams evaluating DataFusion can ship a prototype quickly. The challenge appears later, when they need to run federated SQL continuously in production.

Common pressure points include:

Connector lifecycle management across many source systems and credentials.
Cross-source query planning where each backend has different SQL capabilities.
Freshness controls for accelerated datasets and fallback behavior on cache misses.
Multi-tenant safeguards to keep policies, workload isolation, and routing consistent.
Operational visibility for query failures, retries, and performance regressions.

A managed approach packages these concerns into repeatable platform behavior. Instead of every application team implementing the same engine-level controls, they reuse a common runtime with centralized policy.

What Managed Apache DataFusion Includes

Managed DataFusion is not a separate SQL language. It is a runtime and operations model layered around DataFusion's extension points.

Federated SQL Across Heterogeneous Sources

DataFusion provides the planning pipeline. Managed platforms implement source adapters and table providers that expose remote systems as queryable tables. This allows one query to span databases, warehouses, object storage, and streams through the same interface.

In practice, federated execution depends on two paths:

Push work down to each source when possible to reduce data movement.
Merge or post-process results locally when a query spans multiple systems.

For a detailed background on federation mechanics, see SQL federation.

Optimizer and Pushdown Management

Managed DataFusion systems usually add analyzer and optimizer rules that are specific to federation. These rules decide which filters, projections, and aggregates are safe to run at the source versus locally.

This is one reason teams adopt managed offerings: optimizer behavior becomes part of platform configuration rather than ad hoc logic in each application.

Acceleration and Freshness Controls

Many production workloads combine federation with local acceleration to improve latency. Managed runtimes configure refresh policy (for example, append or CDC-based updates), stale-read policy, and fallback behavior in one place.

This pattern aligns with real-time change data capture and data lake acceleration use cases where freshness and response time both matter.

SQL-Embedded Search and AI Operators

DataFusion's UDF and table-function model allows platforms to expose search and AI capabilities inside SQL. Managed systems can register these functions consistently and enforce governance controls around model access, execution cost, and tenancy.

This is how teams combine hybrid SQL search and LLM inference with federated data access inside a single query workflow.

How DataFusion Enables the Managed Model

The Apache DataFusion architecture makes this operational model practical because its core interfaces are designed to be extended.

Query Pipeline

DataFusion executes a stable pipeline:

SQL -> AST -> Logical Plan -> Optimizer -> Physical Plan -> Execution -> Arrow results

Managed platforms can attach custom behavior at each stage, including:

Table providers for source registration.
Analyzer and optimizer rules for federated rewrites.
Execution operators for fallback, schema casting, and runtime policies.

Extension Points That Matter in Production

DataFusion's TableProvider, OptimizerRule, ExecutionPlan, and UDF interfaces are the primary hooks for managed behavior.

This is reflected in Spice's implementation approach described in How we use Apache DataFusion at Spice AI, where federation, acceleration, search, and AI behaviors are implemented in the planner and runtime rather than bolted on externally.

Managed Execution Topology

The architecture below shows a common managed pattern for federated SQL at scale.

Managed vs Self-Managed DataFusion

Both approaches can work. The decision depends on team shape, operational maturity, and workload requirements.

Dimension	Self-managed DataFusion	Managed DataFusion
Initial flexibility	Highest, full custom control	High, within platform extension model
Time to production	Slower for most teams	Faster for most teams
Connector operations	Built and maintained in-house	Centralized and standardized
Federated optimizer behavior	Team-specific implementation	Platform-level governance
Runtime reliability features	Team builds fallback and recovery	Included as managed capabilities
Multi-tenant controls	Custom policy implementation	Built-in policy and routing patterns

Teams that already run a mature query platform may prefer full control. Teams focused on product delivery often prefer managed execution so data access does not become a long-running infrastructure project.

Advanced Topics

Dialect Translation and Function Rewrites

Federated SQL is not only about connectivity. It requires dialect-aware plan rewriting. A single logical expression may need source-specific SQL forms to run correctly across PostgreSQL, Snowflake, and other engines. Managed platforms maintain these translation layers as part of runtime compatibility.

This is also where function mapping matters. For example, semantic equivalents of random, regexp, or distance functions can vary by backend. Keeping those mappings in a managed planner avoids query portability drift across teams.

Multi-Source Query Splitting

When a query joins data from multiple systems, managed federation planners usually split the query into per-source subqueries, push source-compatible operations down, and run cross-source join or union stages locally. The quality of this split directly affects network cost and latency.

A mature managed implementation tracks pushdown boundaries explicitly, so teams can reason about where compute occurred and tune policies over time.

Runtime Reliability Patterns

Reliable federated execution requires more than successful planning. Managed DataFusion systems generally include deferred connection handling, schema-cast operators, and fallback operators that keep query behavior predictable when sources are slow or temporarily unavailable.

These patterns reduce the operational blast radius of transient source failures and help maintain service-level objectives for applications and agents.

Managed Apache DataFusion with Spice

Spice uses Apache DataFusion as the query core and extends it with connector table providers, federated planning rules, acceleration controls, and SQL operators for search and AI workloads. This enables one query layer across 30+ integrations while preserving flexibility around source pushdown and local execution.

For teams evaluating this model, start with SQL federation and acceleration, hybrid SQL search, LLM inference, and Spice pricing.

Managed Apache DataFusion FAQ

What is managed Apache DataFusion?

Managed Apache DataFusion is an operational model where DataFusion provides SQL planning and execution while a platform manages connectors, federation rules, acceleration policy, reliability controls, and tenant-aware runtime operations.

How is managed DataFusion different from running DataFusion directly?

Running DataFusion directly gives maximum implementation control but requires teams to build and operate source adapters, optimizer policies, and runtime safeguards themselves. Managed DataFusion centralizes those responsibilities so application teams can focus on product workloads instead of engine operations.

Can managed DataFusion query multiple systems in one SQL statement?

Yes. Managed DataFusion platforms typically use custom table providers and federation analyzers to split cross-source plans into per-source subqueries, push down compatible operations, and combine results locally when needed.

How does managed DataFusion handle freshness and low latency together?

Most managed implementations combine source federation with local acceleration and configurable refresh policies. Hot datasets can be served from local acceleration while freshness is maintained through scheduled or CDC-based refresh modes.

When should a team choose managed DataFusion?

Managed DataFusion is typically the better fit when teams need federated SQL across many systems, multi-tenant controls, and reliable production behavior without building a query platform from scratch.

Learn more about managed DataFusion

Explore technical resources on Apache DataFusion, federation planning, and production runtime patterns.

Docs

Spice.ai OSS Documentation

Reference architecture and runtime configuration for federation, acceleration, and AI-enabled SQL workflows.

Blog

How we use Apache DataFusion at Spice AI

A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs.

Blog

Contribution of TableProviders to DataFusion

How Spice contributes table providers to Apache DataFusion and why connector abstractions matter for federated SQL.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer