Localhost Latency at Scale: The Spice Cluster-Sidecar Architecture

Spice AI

Engineering

Spice Cloud Platform

Spice OSS

Luke Kim

Founder and CEO of Spice AIApril 21, 2026

Spice cluster-sidecar architecture for applications, services, and AI agents

TL;DR: Any application, service, or AI agent that needs high-performance, low-latency access to large-scale operational data faces the same three challenges: low-latency retrieval (SQL, full-text, vector), a safe blast radius so a misbehaving workload can't take down a database or access data it shouldn't see, and enough compute behind it to answer the hard questions. No single deployment model delivers all three. The Spice cluster-sidecar (hybrid) architecture does: a lightweight Spice sidecar runs inside each application pod and serves query, search, and LLM inference on localhost from a scoped working set (acting as a sandbox between the application and the underlying data systems), while a central Spice cluster (self-managed or Spice Cloud) handles ingestion, Cayenne acceleration, Ballista-powered distributed execution, hybrid search indexing, and refresh. The application sees one endpoint on localhost. Spice transparently decides whether to serve locally, delegate to the cluster, or return a cached result. Databases, data lakes, and CDC streams never see the application directly. This is especially powerful for AI agents, where autonomous query generation makes sandboxing critical, but the architecture benefits any workload that needs fast, safe access to data at scale.

The problem: applications need fast, safe, distributed access to data

Any application, service, or AI agent that queries operational data at scale puts pressure on three things at once. AI agents make these challenges acute because they write their own queries, but the problems exist for any workload that needs low-latency access to large datasets.

Latency on the retrieval path. Every query that feeds a user-facing response (RAG lookups, tool calls that read state, text-to-SQL, vector search, dashboard panels, API responses) adds its retrieval latency directly to the user's wait time. A few hundred milliseconds of round-trips across a retrieval chain turns a one-second response into four. Applications want answers in single-digit milliseconds from whatever they're asking, whether that's SQL, full-text, vector, or hybrid search.
Blast radius. Giving any workload direct credentials to production Postgres, a data lake, or a warehouse means a bad plan, a runaway loop, or a misconfigured service can exhaust connection pools, scan petabytes, or touch rows it shouldn't. For AI agents, which write their own queries, the risk is amplified: a prompt injection or a bad planning step can generate arbitrary SQL. The retrieval layer needs to be a sandbox, not a passthrough.
Occasional heavy queries. Most reads are narrow: a tenant's recent orders, the docs for one entity, the last 24 hours of events. But "summarize our churn trend over the last year across all regions" is one API call (or one prompt) away, and when it happens the system has to answer it without collapsing.

Traditional architectures force a choice.

A centralized query cluster handles the heavy analytical side, but every retrieval pays a network round trip, each additional replica adds load to the cluster, and all applications hold credentials to the cluster (and often to the origin systems behind it). A sidecar-per-pod model gives applications localhost reads, but every sidecar independently ingests from source systems, which doesn't scale once you have dozens or hundreds of replicas, the source database starts buckling under connection count, and per-pod CDC multiplies cloud costs linearly with fleet size.

Teams typically end up stitching together a Redis or Memcached tier for retrieval, a vector database, a CDC pipeline, a materialization layer, a separate analytical warehouse, and an auth proxy in front of everything. Cache invalidation, schema drift, TTL tuning, and credential management become a permanent tax on the team. None of it solves the fundamental problem that the application is still talking to production systems, just through more layers.

The solution: an application-local data, search, and inference plane

The Spice cluster-sidecar architecture gives each application a complete data plane on localhost and keeps the data systems behind a single, centrally managed tier:

Application-local sidecars serve the hot path. Every application pod gets its own Spice sidecar on loopback. That sidecar answers SQL, full-text, vector, and hybrid search queries from a scoped working set, handles LLM inference and tool calls locally, and is the only data-plane endpoint the application knows about. The application never holds credentials to Postgres, S3, Snowflake, or Iceberg. It holds a token for its sidecar.
A centralized Spice cluster (self-managed, or the managed Spice Cloud Platform) is the only tier that talks to your data systems. It handles ingestion, Cayenne acceleration, refresh scheduling, hybrid search indexing, and distributed query execution powered by Apache Ballista.
Transparent delegation. When an application asks something the sidecar can't answer from its working set (a historical lookup, a cross-dataset join, a broad vector search), the sidecar forwards the query to the cluster over Arrow Flight (gRPC), streams the result back as Arrow record batches, and caches it for future reads. The application never knows delegation happened.

Think of it as a CDN for your data, with the sidecar as a sandbox in front of the application. The cluster is the origin server and the sidecars serve as edge nodes, with Spice handling routing, caching, and invalidation. Your origin data systems see traffic from the cluster only and never from the application fleet.

From the application's perspective, the entire data plane (SQL, search, vector, inference) is localhost, served through one endpoint with one wire format. From an infrastructure perspective, you get distributed query throughput and embedded-database latency, and a hard isolation boundary between the application and your data systems, without writing or operating ETL or an auth proxy.

Architecture at a glance

Applications only ever talk to their sidecar. Sidecars only ever talk to the cluster. The cluster is the only tier with credentials to your data systems.

The sidecar as a sandbox

The most important property of the sidecar isn't just latency: it's isolation. The sidecar is the only data-plane surface the application touches, and it's deliberately scoped so that a misbehaving workload can't escape. This is especially valuable for AI agents, where autonomous query generation makes the blast radius unpredictable, but the same isolation properties benefit any application that shouldn't have direct database credentials.

Scoped working set, not the whole warehouse. A sidecar's spicepod.yaml declares exactly which datasets, views, and search indices the application is allowed to query. Anything not declared simply doesn't exist from the application's perspective: not filtered by a policy, not hidden by a row-level rule, but physically absent from the catalog. Compare this with approaches like row-level security (RLS), where the underlying data is present and a single policy misconfiguration can silently expose it. A sidecar's isolation is structural: there is no rule to misconfigure because the data was never there. For AI agents, even a perfectly crafted prompt injection can't query a table that isn't in the catalog.
No origin credentials in the application. The application connects to its sidecar with a local token. The sidecar connects to the cluster over Arrow Flight (gRPC). The cluster holds the credentials for Postgres, Snowflake, Databricks, S3, Kafka, and the rest. Compromising an application pod cannot leak origin credentials, because the pod never had them.
Narrow network surface. The application's only outbound data dependency is the loopback interface. Network policy can (and should) pin the sidecar's egress to the Spice cluster endpoint only. No direct connectivity to databases, warehouses, or the public internet is required for the data plane.
Per-tenant / per-application data views. Sidecars can be specialised per application class or per tenant. A customer-service agent, a fraud-review agent, and an internal dashboard can each run pods with different spicepods pointing at different slices of the same cluster, without any code change to the application itself. This is physical tenant isolation, not policy-based filtering on a shared database. Each sidecar's catalog is a separate, bounded surface that can be reviewed, diffed, and tested independently.
Bounded resource use. A rogue query plan or a runaway loop exhausts the sidecar's local memory and CPU budget, not the cluster's. Delegated queries hit the cluster's multi-tenant fair-scheduling; they can't saturate Postgres. Policy-based isolation can't bound resource consumption this way: a poorly scoped query against a shared database can still exhaust connection pools or scan entire tables, even if the result set is filtered.
LLM inference stays local too. Sidecars can serve model inference and tool calls on loopback, so sensitive prompts, tool arguments, and retrieved context don't leave the pod for low-latency paths. Heavier models can still be routed to the cluster.
Auditability. Because every query flows through the sidecar, you get one well-known place to log, rate-limit, and enforce policy on what the application is actually doing, regardless of how it decided to ask.

The data flow is unambiguous: application -> sidecar -> (optionally) cluster -> origin. Data only moves back along that path; it never skips a tier. That's the security boundary that production systems need and don't typically get from a direct-to-database setup. For more on how Spice isolates agent workloads, see the secure AI agents use case.

Why split the tiers

Splitting ingestion, acceleration, and distributed compute from local caching is the other key design decision. It's what makes this architecture scale where naive alternatives don't. The cluster side is object-storage-native by design: coordination state, acceleration files, and failover metadata all persist to S3-compatible object storage rather than local disk or an external database. That means cluster nodes are stateless and recoverable, and the sidecar tier scales independently without coupling to the cluster's storage.

Sidecars stay lightweight

A sidecar's job is to hold a working set, answer queries from it, and forward everything else to the cluster. It leaves refresh, Cayenne acceleration builds, and long-lived connections to Postgres, S3, DynamoDB, and Kafka entirely to the cluster tier.

That means a sidecar:

Starts in seconds, important when application pods autoscale aggressively with traffic.
Runs on a few hundred megabytes of memory.
Scales 1:1 with application pods: go from 5 replicas to 50 and 50 sidecars come up automatically, each with the right scoped working set.
Doesn't add 50 new connections to your source database when the fleet scales out.

The cluster ingests once

Refreshing from upstream data sources, running Cayenne acceleration on large Iceberg tables, and keeping CDC streams (like DynamoDB Streams, Postgres logical replication, Debezium, or Kafka consumers) connected are all resource-intensive operations. Doing any of that N times for N sidecars is wasteful and often infeasible. Many source systems have hard connection limits, and per-pod CDC multiplies cloud costs linearly with fleet size.

The cluster ingests each dataset once, producing one authoritative materialization that every sidecar gets a consistent view of. Source load is bounded by your cluster size, not your fleet size. Refreshes, backfills, and large initial loads all happen once on the cluster rather than N times across the fleet, and CDC-backed datasets stay in sync with bounded lag. The cluster is also the single place to reason about data freshness: if you need to know when a dataset was last refreshed from Postgres, the answer is the same for every sidecar.

This makes bootstrap efficient too. A new pod's sidecar pulls its working set from the cluster (which already has the data hot) rather than re-scanning the source. An alternative bootstrap mechanism is acceleration snapshots, where the cluster writes pre-built acceleration files to object storage and sidecars download them on startup (see also the DynamoDB Streams deep dive for an end-to-end example). Either way, new nodes are operational in seconds, not minutes.

Query delegation uses Arrow Flight end-to-end

Sidecar-to-cluster communication is Arrow Flight over gRPC. Results flow as Arrow record batches directly into the sidecar's query engine with no serialization detour through JSON or row-based wire formats. Zero-copy materialization into Arrow is a core performance principle of Spice; Arrow Flight preserves that boundary across the network.

The sidecar decides locally whether a query can be served from its working set. If not, it forwards and streams results back. The application sees one endpoint and one query. It never knows or cares where execution happened, or whether Ballista fanned the query out across ten cluster nodes to answer it.

The cluster tier: Ballista + Spice Cayenne

The value of a sidecar is only as good as the cluster behind it. Two pieces of the Spice stack do most of the heavy lifting there: Apache Ballista for distributed query execution, and Spice Cayenne for scale-out acceleration.

Apache Ballista: distributed SQL execution

Spice's cluster mode uses Apache Ballista to execute DataFusion query plans across multiple nodes. When a sidecar delegates a query (say, a join across a 500-million-row orders table and a 10-billion-row events table), the cluster's scheduler splits the plan into stages, distributes them across executor nodes, shuffles intermediate results, and streams the final result back over Arrow Flight.

Upstream Ballista uses a single-scheduler architecture; Spice extends it for production with multi-active HA and object-storage-native persistence (detailed below). From the sidecar's perspective, delegation looks identical to talking to a single node. From the operator's perspective, the cluster scales horizontally (add executor nodes when analytical workload grows), and because everything is Arrow-native, shuffles move columnar data without row-by-row serialization. Delegated queries don't have a soft ceiling: a sidecar can safely hand off an expensive query because the cluster has the parallelism to answer it.

Production characteristics:

Multi-active schedulers, no single point of failure. Multiple scheduler instances run concurrently; any can handle any query. Failover is automatic and doesn't require a separate consensus service.
mTLS for inter-node communication. Scheduler-to-executor communication within the cluster is mutually authenticated and encrypted on the wire.
Object-storage-native persistence. Cluster state, acceleration snapshots, and Cayenne files persist to S3-compatible object storage. Nodes are stateless and recoverable; a restarted node comes back online in seconds without re-ingesting from source.
Spice Kubernetes Operator. Automated deployment, zero-downtime rolling upgrades, health checks, and Prometheus metrics out of the box.

This is the engine that lets Spice replace Spark-class workloads: the same distributed compute surface, but with full SQL, local acceleration, hybrid search, and LLM inference in one system instead of a stack of integrations.

Cayenne: acceleration that scales past 1 TB

For the cluster's accelerated datasets, Spice uses Cayenne, an acceleration engine built on Vortex, a next-generation open-source columnar format from the Linux Foundation.

Vortex runs compute kernels directly on encoded data, so many predicates and projections execute without ever decompressing. When decompression is required, data lands directly in Arrow arrays with no intermediate copy. Compared to Parquet, Vortex delivers roughly 100x faster random access reads and 10-20x faster scans, which is exactly the access pattern an operational data lakehouse produces: lots of selective lookups, lots of segment pruning, hot repeated queries.

Cayenne pairs Vortex files with per-segment min/max/null-count statistics (zone-map equivalents) and fast random-access encodings like FSST for strings, FastLanes for integers, and ALP for floats. The net effect is that the cluster can accelerate datasets well past 1 TB (comfortably beyond where DuckDB file mode tops out) while still answering point lookups and small range queries fast enough to be useful as a hot backend for sidecars. The entire acceleration tier is object-storage-native: Cayenne persists Vortex files to S3-compatible object storage (including S3 Express One Zone for single-digit-millisecond first-byte latency), so accelerated data is durable across cluster restarts and nodes are stateless and recoverable.

Sidecars don't need to run Cayenne themselves. They materialize a smaller working set into a lightweight engine (Arrow in-memory, DuckDB, or SQLite) and let the cluster handle the heavyweight acceleration tier. That's the right division of labor: Vortex and Cayenne where the data volume demands it, embedded engines where latency demands it.

Acceleration snapshots: single writer, many readers

The cluster-sidecar split maps directly onto Spice's acceleration snapshots feature. Snapshots let you persist a pre-built acceleration file to object storage (S3, GCS, or local filesystem) and reuse it on startup instead of refreshing from source, turning a minutes-long cold start into a seconds-long file download.

In the hybrid model, snapshots have a natural single-writer / multiple-reader topology:

The cluster is the single writer. It uses snapshots: create_only on each accelerated dataset. After every refresh (or on a configured interval), the cluster uploads a new snapshot of the acceleration file to object storage. The cluster never downloads snapshots on startup; it always refreshes from the authoritative source.
Each sidecar is a reader. It uses snapshots: bootstrap_only on its local DuckDB or SQLite acceleration. On startup (or when the sidecar's ephemeral NVMe is recycled), the sidecar downloads the most recent snapshot from object storage and is immediately ready to serve queries. The sidecar never writes snapshots back; the cluster is the single source of truth.

This avoids snapshot conflicts (multiple writers racing to upload), keeps the cluster as the authoritative refresh point, and gives every sidecar in the fleet fast, consistent bootstraps from the same materialization.

The cluster-side spicepod for a snapshot-enabled dataset looks like:

snapshots:
  enabled: true
  location: s3://my-bucket/spice-snapshots/
  params:
    s3_auth: iam_role

datasets:
  - from: postgres:public.orders
    name: orders
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      snapshots: create_only # Write snapshots, never download
      snapshots_trigger: refresh_complete
      snapshots_compaction: enabled # Compact before upload
      params:
        duckdb_file: /nvme/orders.db

The sidecar-side spicepod for the same dataset:

snapshots:
  enabled: true
  location: s3://my-bucket/spice-snapshots/
  bootstrap_on_failure_behavior: warn # Fall back to empty if no snapshot
  params:
    s3_auth: iam_role

datasets:
  - from: spice.ai/<your-org>/<your-app>/datasets/orders
    name: orders
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      snapshots: bootstrap_only # Download on startup, never write
      params:
        duckdb_file: /nvme/orders.db

On a fresh pod, the sidecar downloads the cluster's latest snapshot, opens the DuckDB file, and starts serving queries, all before the first refresh from the cluster completes. Subsequent refreshes pull incremental updates from the cluster as usual. If the snapshot is unavailable (first deploy, S3 blip), the sidecar falls back to an empty acceleration and catches up on the next refresh cycle.

This pattern is especially valuable on Kubernetes with ephemeral NVMe instance storage: pods lose their local disk on every restart, but the snapshot gives them a warm start without re-pulling the full dataset from the cluster. For CDC-backed datasets with large initial state, the difference between a snapshot bootstrap and a full re-sync can be the difference between a pod being ready in 5 seconds and 5 minutes.

Results caching: the third latency tier

Spice has one more lever that fits naturally into the hybrid model: the results cache.

Both the cluster and the sidecar have an in-memory LRU (or TinyLFU) results cache for SQL queries, search results, and embeddings. It's enabled by default on HTTP (/v1/sql, /v1/search) and Arrow Flight endpoints. On a cache hit, the response header reports Results-Cache-Status: HIT and the query doesn't re-execute for the lifetime of that cache entry.

That gives you three latency tiers on a single deployment:

Sidecar results cache. Repeat queries against the sidecar return from cache in microseconds. No query execution, no local scan.
Sidecar working set. Novel queries that can be answered from the locally materialized dataset execute on-node in single-digit milliseconds.
Cluster delegation. Queries that exceed the working set are forwarded to the cluster, where Ballista runs them distributed and Cayenne accelerates the scans. The sidecar can then cache the result for subsequent reads, pulling them back into tier 1.

A few of the knobs matter in production and are worth calling out:

cache_key_type: plan vs sql. The default plan cache key uses the query's logical plan, so semantically equivalent queries share a cache entry even if the SQL text differs, which matters for ORM-generated queries where whitespace and column order drift. sql is a faster lookup but string-exact.
Spice-Cache-Key header. When your application knows two queries should share a result (for example, a templated query rendered two ways), it can supply an explicit cache key and bypass the plan hash altogether.
Stale-while-revalidate. Setting stale_while_revalidate_ttl lets the sidecar serve a stale cached result immediately (with Results-Cache-Status: STALE) while a background task refreshes the entry. For user-facing latency this is usually the right default once you're willing to accept a bounded freshness window.
Cache-Control directives. Clients can opt in per-request: no-cache to skip the cache, only-if-cached to fail on miss (useful for read-your-writes gating), stale-if-error=600 to serve cached results for up to 10 minutes if the fresh fetch fails. Works over HTTP and Arrow FlightSQL.
encoding: zstd. Compressing cache entries with zstd typically cuts memory use by 50-90%, letting the cluster's results cache hold substantially more distinct queries at the same footprint.

Results caching is where the tiers collapse for the application. A sidecar miss becomes a cluster call, and that cluster result becomes a sidecar cache entry, so subsequent reads from any pod in the Deployment return in microseconds. The cluster's own cache amplifies this further: with many sidecars delegating overlapping queries, one sidecar's miss becomes a cluster cache hit for every subsequent sidecar. The application-visible p50 settles at tier 1, p95 at tier 2, and only tail latency touches tier 3.

Engineering decisions

Three decisions shape how this architecture behaves in production.

1. Declarative sidecar configuration

Every sidecar is configured through a spicepod.yaml that declares the datasets, views, acceleration engines, search indices, and models it manages. That's the same declarative model the cluster uses. There's no imperative "register this table at startup" API, and no coordination service that sidecars check in with at boot.

This matters because sidecars are cattle, not pets. A new pod gets its sidecar from the same manifest as every other pod. If you need to change what a sidecar materializes, you change the spicepod and roll the deployment. Every pod converges on the same manifest, eliminating drift and special cases.

It also means the sidecar is safe to treat as part of the application deployment artifact: versioned, reviewed, rolled out with the same process as the service it sits next to.

2. Cache coherency is a refresh policy, not a protocol

Sidecars pull from the cluster on a configurable interval using append or full refresh strategies. They do not participate in a distributed invalidation protocol. This is deliberate.

Distributed cache invalidation is hard, and the failure modes are worse than stale data for most workloads. A pull-based model with explicit refresh intervals makes staleness bounded, predictable, and debuggable. Teams pick the interval that matches their freshness requirements (seconds for configuration-like data, minutes for analytical rollups), and the system stays simple.

For workloads that need sub-second freshness, the cluster consumes CDC streams (DynamoDB Streams, Debezium, Kafka) and the sidecars pull the resulting accelerated dataset on a short interval. That gives you near-real-time propagation without a fleet-wide invalidation bus.

3. Resilience through local state

If the cluster is temporarily unavailable (a rolling upgrade, a network blip, a zone event), sidecars keep serving cached data and their accelerated working sets. Refreshes pause and resume when connectivity returns. Applications don't fail because the central tier is briefly unreachable; they just see slightly staler data.

This is a significant operational property. It means the blast radius of a cluster incident is "retrieved context gets slightly stale" rather than "every application goes down." And because the sidecar is the only endpoint the application ever talked to, there's no fallback logic to write in the application itself.

Putting it together: a request path

Let's walk through a request from the sidecar's perspective, using an AI agent as the example (the flow is the same for any application):

The agent needs context to answer a user message. It calls its sidecar on localhost:8090 with a tool call: SELECT ... FROM orders WHERE tenant_id = $1 ORDER BY created_at DESC LIMIT 20.
The sidecar checks its results cache. Hit → return in microseconds, header Results-Cache-Status: HIT. The agent uses the result as grounding context and proceeds to the LLM call.
On a miss, the sidecar plans the query against its local catalog. orders is materialized locally (refreshed from the cluster every 10 seconds). The sidecar executes against DuckDB, returns in single-digit milliseconds, and populates its results cache. The origin Postgres sees no traffic.
The agent follows up with a hybrid search for "find similar past tickets", a combined full-text + vector query over a multi-gigabyte tickets index.
The sidecar's catalog shows tickets as a cluster-resident dataset with no local materialization. It opens an Arrow Flight stream to the cluster. Ballista distributes the search across executor nodes; Cayenne's segment statistics prune most Vortex files.
Results stream back as Arrow record batches. The sidecar materializes them, returns to the agent, and caches the result with the configured TTL. The agent passes the retrieved context into the LLM call.
The agent then asks the sidecar for an LLM inference with tool use. The sidecar serves a small local model on loopback for the routing step and delegates the large-model call to the cluster's inference pool.
The cluster independently caches its own query and inference results. The next replica that asks the same question gets it from the cluster cache. No re-execution, no re-inference.

Every step is Arrow-native end to end. The application sees one endpoint, one wire format, and one latency distribution. What's actually happening underneath is a coordinated dance between a sandboxed local engine, three caching tiers, a distributed executor, an inference pool, and a columnar accelerator.

When to use it

The cluster-sidecar architecture is the right fit when:

You're running AI agents, LLM-backed features, or data-intensive services and want fast, unified retrieval (SQL, full-text, vector, hybrid) plus inference on localhost, without giving the application direct database credentials.
You need a clear isolation boundary between application code and your data systems, including scoped catalogs, policy enforcement, and one audit point per pod. This is especially critical for AI agents, which generate their own queries autonomously.
Multiple application or agent replicas need fast access to the same datasets and you want one ingestion path, not N.
You want to shield upstream data sources from application query volume: the cluster ingests once, sidecars sandbox the application, and results caching deduplicates repeat work.
Workloads span real-time operational retrieval and large-scale analytics on the same underlying data, the operational data lakehouse on S3 / Iceberg pattern.
You're running on Kubernetes and already use the sidecar pattern for other concerns (service mesh, logging, config).

It's not the right fit when:

You have a single application instance and no multi-tenant or isolation requirement: a standalone Sidecar deployment is simpler.
All queries are batch or analytical with relaxed latency: a Microservice deployment is enough.
Network connectivity between sidecars and cluster is unreliable: delegation needs a working path back to the origin.

A concrete example: a multi-tenant agent platform

The architecture applies to any multi-replica deployment, but AI agents are where the sandboxing properties shine brightest. Here's a concrete example. For a deeper look at multi-tenant agent isolation, see Multi-Tenancy for AI Agents Without Pipelines.

A multi-tenant SaaS platform runs an AI support agent. Each tenant gets a dedicated set of agent pods. Every agent pod has a Spice sidecar that:

Materializes the tenant's working set (recent tickets, active customer records, the last 7 days of events, the tenant's private knowledge-base embeddings) into a local DuckDB + vector index.
Exposes exactly those datasets, and no others, to the agent through a scoped spicepod.
Serves the routing-model LLM call locally on loopback; delegates the large-model call to the cluster.
Holds no credentials for Postgres, S3, Snowflake, or the tenant-shared embedding store.

Agent turns hit the sidecar on localhost and return in single-digit milliseconds; repeat retrieval calls return from the sidecar's results cache in microseconds. The user feels a single-pass response instead of a retrieval-stall-then-answer.

Behind the sidecars, a Spice cluster (in this case, Spice Cloud) ingests from PostgreSQL (via CDC), S3 (Iceberg tables), and Databricks. Cayenne acceleration builds run on the cluster on a schedule, with the larger tables (a multi-billion-row events table and a 2-TB Iceberg knowledge corpus) materialized into Vortex files on NVMe-backed nodes (with S3 Express One Zone as the persistent tier). The cluster also runs the shared large-model inference pool.

When an agent asks a broader question ("summarize this tenant's churn signal across the last 12 months"), the sidecar recognises the query exceeds its working set and delegates over Arrow Flight. Ballista plans and distributes it across cluster executors. Cayenne's segment statistics prune most of the 2 TB away. The result streams back as Arrow batches. The sidecar caches it with a 1-minute TTL and a 30-second stale-while-revalidate window. The next ten agent turns for that tenant return in microseconds; the eleventh triggers a background refresh without blocking the user. The same query from a different tenant's agent benefits from the cluster's own results cache, so a 2-TB scan happens at most once per freshness window across the whole fleet.

Meanwhile, the origin Postgres sees exactly one consumer (the cluster) and the agent pods have exactly one outbound data dependency (their own sidecar). If a replica is compromised, the attacker gets a loopback endpoint scoped to that tenant's working set, not database credentials and not a query interface to the whole warehouse.

The same pattern (single ingestion path, per-pod sandboxing, tiered latency, one isolation boundary) works for any application fleet, not just agents: microservices serving real-time dashboards, services powering search, or internal tools querying operational data.

A minimal spicepod illustrating the pattern

The cluster side materializes and accelerates a large dataset with Cayenne, consumes CDC from Postgres, and exposes a results cache:

version: v1
kind: Spicepod
name: platform-cluster

runtime:
  caching:
    sql_results:
      enabled: true
      cache_max_size: 4GiB
      item_ttl: 1m
      stale_while_revalidate_ttl: 30s
      encoding: zstd

datasets:
  - from: postgres:public.orders
    name: orders
    acceleration:
      enabled: true
      engine: cayenne
      mode: file
      refresh_mode: changes
      primary_key: id
      on_conflict:
        id: upsert
      params:
        cayenne_compression_strategy: btrblocks
        cayenne_footer_cache_mb: 512
        cayenne_segment_cache_mb: 1024

  - from: s3://lakehouse/events/
    name: events
    params:
      file_format: parquet
    acceleration:
      enabled: true
      engine: cayenne
      mode: file
      refresh_mode: append
      refresh_check_interval: 15m
      params:
        sort_columns: event_time,user_id

The sidecar side is much smaller. It pulls from the cluster, keeps a working-set engine local, and caches results:

version: v1
kind: Spicepod
name: app-sidecar

runtime:
  caching:
    sql_results:
      enabled: true
      cache_max_size: 256MiB
      item_ttl: 30s
      stale_while_revalidate_ttl: 30s

datasets:
  - from: spice.ai/<your-org>/<your-app>/datasets/orders
    name: orders
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      refresh_mode: append
      refresh_check_interval: 10s

  - from: spice.ai/<your-org>/<your-app>/datasets/events
    name: events
    # No local acceleration - delegate to cluster on demand.
    # Queries that match recent hot events still hit the results cache.

Two manifests. Everything else (shuffle plumbing, Cayenne file layout, CDC checkpointing, results cache keying, Arrow Flight transport) is handled by Spice.

The Spice Cloud hybrid model

The cluster-sidecar architecture gives you optionality in where each tier runs: not just how it's deployed, but who operates it. The tier that benefits most from being managed is the cluster. Scheduler tuning, capacity planning, failover drills, upgrade choreography, and observability for a distributed SQL engine are real operational commitments. The tier that benefits most from staying close to your application is the sidecar. That's the whole point of loopback latency.

Spice Cloud splits those responsibilities:

Spice Cloud operates the cluster: a fully managed, multi-node Spice cluster with high-availability distributed query, Cayenne acceleration, hybrid search, and LLM inference. SOC 2 Type II, automatic scaling, built-in monitoring, and enterprise SLAs, without running a scheduler or executor node yourself.
Your sidecars run where your applications live in your Kubernetes clusters, VPCs, on-premises data centers, or edge locations. Each sidecar connects to the Spice Cloud cluster encrypted in transit and at rest, and transparently delegates queries it can't serve from its local working set. The heavy compute (terabyte-scale scans, distributed joins, cross-dataset search, large aggregations) runs on managed infrastructure. The latency-sensitive reads stay on localhost inside your security perimeter. Your data stays in your object storage; Spice Cloud makes it fast and queryable.

For teams that want the operational properties of a managed distributed engine without giving up data locality or the ability to keep sidecars inside a regulated environment, this is usually the right topology. It's also incremental: start with sidecars pointed at Spice Cloud, and scale the cluster tier up or down as analytical load changes, without touching the application side.

Self-hosted at enterprise scale: Spice.ai Enterprise

For organizations that need to run the whole cluster-sidecar stack inside their own environment (regulated industries, air-gapped networks, on-prem data centers, sovereign cloud), Spice.ai Enterprise is the self-hosted, production-grade distribution of Spice designed to manage and scale an application-serving fleet.

Enterprise extends the open-source runtime with the pieces a platform team needs to run Spice as a shared service across an organization:

Kubernetes Operator for management and scale. Automated deployment, autoscaling, zero-downtime rolling upgrades, and lifecycle management of sidecars and clusters through pod annotations and CRDs. Declarative scale-up and scale-down of the application fleet, the cluster tier, or both, without hand-operating scheduler or executor nodes.
Multi-active HA clustering. Multi-node distributed query on Apache Ballista with automatic failover, mTLS between cluster nodes, and auto-provisioned cluster certificates.
Authentication: OIDC, API keys, identity SQL functions. Applications and agents authenticate with OIDC bearer tokens or scoped API keys. Identity-aware SQL functions make the authenticated principal queryable inside policies and views, which is the foundation for per-tenant and per-application row-level scoping.
Policy and authorization. Enforce per-dataset, per-view, and row-/column-level access rules centrally. The sidecar is the one place each application's requests flow through, which makes policy both simple to express and impossible to bypass.
Audit logging. Every query, search, and inference call is logged with identity, dataset, and policy decision. You get one canonical audit trail of what every application or agent asked, what it was allowed to see, and what was returned, which is what compliance actually wants to see.
Enterprise distributions. NAS (SMB/NFS), CUDA GPU-accelerated inference, data-only, and ODBC-connector builds, so the same cluster-sidecar architecture runs on the hardware and networks you already have.
Operational assurance. SOC 2 Type II, tiered security updates with up to 3 years of guaranteed patches, 99.9%+ uptime SLA, and 24/7 premium support via Slack Connect, email, and pager.

Enterprise adds the identity, policy, audit, and operator tooling a platform team needs to run the cluster-sidecar model at fleet scale on their own infrastructure.

Getting started

The hybrid architecture is documented at spiceai.org/docs/deployment/architectures/hybrid. It works with Spice.ai open source, Spice.ai Enterprise (self-hosted, with the K8s Operator, OIDC, policy, and audit logging described above), and the managed Spice Cloud Platform.

If you're running Spice on Kubernetes today and want to evolve from standalone sidecars or a centralized cluster into the hybrid model, the migration path is incremental: the same spicepod manifests work in both tiers, and you can move ingestion, Cayenne acceleration, and distributed query to the cluster one dataset at a time.

Ready to evaluate it on your own workload? See the Spice.ai platform overview, compare editions on pricing, or get a demo to talk through your architecture with the Spice team.

What is the Spice cluster-sidecar architecture?

It is a hybrid deployment model where a lightweight Spice sidecar runs inside each application pod, serving SQL, search, and inference on localhost from a scoped working set, while a central Spice cluster handles ingestion, acceleration, distributed query execution, and refresh. The application sees one endpoint; Spice transparently decides whether to serve locally, delegate to the cluster, or return a cached result.

How is sidecar isolation different from row-level security (RLS)?

RLS filters data at query time. The underlying tables are still present, and a single policy misconfiguration can silently expose them. A sidecar's isolation is structural: its spicepod.yaml declares exactly which datasets exist in the catalog. Anything not declared is physically absent, so there is no rule to misconfigure. The sidecar also bounds resource consumption per pod, which policy-based approaches on a shared database cannot do.

What happens when a sidecar can't answer a query locally?

The sidecar transparently forwards the query to the cluster over Arrow Flight (gRPC). The cluster executes it (potentially distributing it across multiple nodes with Apache Ballista) and streams the result back as Arrow record batches. The sidecar caches the result for subsequent reads. The application never knows delegation happened.

Does the application need credentials to the underlying data sources?

No. The application connects to its sidecar with a local token. The sidecar connects to the cluster. Only the cluster holds credentials for Postgres, Snowflake, S3, Kafka, and other data systems. Compromising an application pod cannot leak origin credentials because the pod never had them.

How do sidecars stay in sync with the cluster?

Sidecars pull from the cluster on a configurable interval using append or full refresh strategies. For CDC-backed datasets, the cluster consumes the change stream once and sidecars pull the resulting accelerated dataset on a short interval. Acceleration snapshots provide an alternative bootstrap path: the cluster writes pre-built files to object storage, and sidecars download them on startup for a warm start in seconds.

What happens if the cluster goes down?

Sidecars continue serving queries from their local working set and results cache. Refreshes pause and resume when connectivity returns. The blast radius of a cluster outage is slightly staler data, not application downtime. A stale-if-error cache directive can extend this further by serving last-known-good results for delegated queries.

Can I run the cluster as a managed service?

Yes. Spice Cloud is a fully managed cluster with high-availability distributed query, Cayenne acceleration, hybrid search, and LLM inference. Your sidecars run in your own infrastructure and connect to Spice Cloud for delegation. For organizations that need the full stack on-premises, Spice.ai Enterprise provides the same architecture with a Kubernetes Operator, OIDC authentication, policy enforcement, and audit logging.