Learn Data & AI

Understand the core technologies behind modern data and AI infrastructure. Each guide explains a key concept in depth -- how it works, when to use it, and how it connects to the broader data stack.

Data Infrastructure

How to Do SQL Query Federation

Query multiple databases with a single SQL statement without moving data. Learn how to set up federated queries with predicate pushdown and acceleration.

Read the guide

What is Data Virtualization?

Access and combine data from multiple sources through a unified interface without replication. Learn how it compares to ETL and when to use it.

Read the guide

What is Data Acceleration?

Cache frequently accessed data locally for sub-second queries while keeping it fresh with CDC. Learn acceleration strategies and when to use them.

Read the guide

What is Zero-ETL?

Eliminate ETL pipelines by federating data in place and synchronizing acceleration caches with CDC. Learn the three zero-ETL patterns and when to use each.

Read the guide

How to Implement Change Data Capture

Track row-level database changes and stream them in real time. Learn how to implement log-based, trigger-based, and polling patterns for real-time pipelines.

Read the guide

Search

What is Hybrid Search?

Combine vector similarity with keyword matching for more accurate results. Learn about RRF, score fusion, and why hybrid search matters for RAG.

Read the guide

What is BM25 Full-Text Search?

The standard ranking function for full-text search. Learn how BM25 scores documents, how inverted indexes work, and when keyword search needs vector search.

Read the guide

What is Vector Search?

Find semantically similar content by comparing vector embeddings. Learn about ANN algorithms, distance metrics, and vector indexes.

Read the guide

What is Reciprocal Rank Fusion (RRF)?

The rank-based merging algorithm behind most hybrid search systems. Learn how RRF combines keyword and vector results without requiring score normalization.

Read the guide

AI & LLMs

What is RAG?

Retrieval augmented generation grounds LLM responses in real data at inference time. Learn the three-stage pipeline, production challenges, and hybrid search integration.

Read the guide

What is LLM Inference?

Understand how large language models generate responses. Learn about tokenization, KV caching, quantization, and latency optimization.

Read the guide

What is LLM Tool Calling?

LLMs output structured function calls instead of text to interact with external tools. Learn the tool calling loop, security considerations, and MCP.

Read the guide

What is the Model Context Protocol?

MCP standardizes how AI models discover and invoke external tools and data. Learn the client-server architecture and how gateways enable enterprise AI.

Read the guide

How to Use Text-to-SQL

Translate natural language questions into SQL queries using LLMs. Learn how to implement schema-aware generation and production safeguards.

Read the guide

What are Embeddings?

Dense vector representations that capture semantic meaning. Learn how embedding models work, how they enable semantic search and RAG, and how to choose the right model.

Read the guide

How to Connect AI Agents to Live Operational Data Without ETL

Practical architecture guide for connecting agents to live operational systems using federation, acceleration, and policy controls instead of batch ETL.

Read the guide

Isolated data environments for AI agents

How to Give Each AI Agent Its Own Isolated Data Environment

Step-by-step guide to isolating data access per AI agent with scoped identities, runtime boundaries, and policy controls for safer production operations.

Read the guide

How to Sandbox Data Access for AI Agents

Learn how to sandbox AI agent retrieval paths with least-privilege access, query guardrails, output redaction, and policy-aware monitoring.

Read the guide

Reducing data lakehouse costs for agentic workloads

How to Reduce Data Lakehouse Costs for Agentic Workloads

Practical framework for lowering data lakehouse cost in agentic systems by separating serving and analytics paths, tuning query classes, and applying acceleration.

Read the guide

Open-Source Technologies

What is Apache DataFusion?

An extensible SQL query engine written in Rust. Learn the architecture, how it compares to DuckDB and Trino, and how Spice extends it.

Read the guide

Managed Apache DataFusion: Federated SQL at Scale

Learn how teams run Apache DataFusion in production with managed federation, optimizer controls, acceleration policy, and tenant-aware operations.

Read the guide

What is Apache Ballista?

A distributed SQL query engine that scales DataFusion across multiple nodes. Learn the scheduler-executor architecture and how it compares to Spark.

Read the guide

What is Vortex?

A compressed columnar file format with adaptive encoding for fast analytical queries. Learn how it compares to Parquet and powers Spice Cayenne.

Read the guide

What is DuckDB?

An in-process analytical database designed for fast OLAP queries with zero dependencies. Learn the columnar engine, vectorized execution, and how Spice uses it.

Read the guide

What is Apache Arrow?

A cross-language columnar in-memory data format for zero-copy analytics. Learn how Arrow enables high-speed data exchange and powers modern query engines.

Read the guide

What is Apache Iceberg?

An open table format for large analytic datasets with schema evolution, hidden partitioning, and time travel. Learn the architecture and how Spice queries Iceberg tables.

Read the guide

What is Delta Lake?

An open-source storage layer that brings ACID transactions to data lakes. Learn the transaction log architecture and how Spice federates Delta tables.

Read the guide

What is Tantivy?

A full-text search engine library written in Rust, inspired by Apache Lucene. Learn about inverted indexes, BM25 scoring, and how Spice embeds it for hybrid search.

Read the guide

Comparisons

SQL Federation vs ETL

Query data in place or move it to a warehouse? Compare federation and ETL across latency, freshness, cost, and operational complexity.

Read the guide

Full-Text Search vs Vector Search

Keyword matching or semantic similarity? Compare BM25 and vector search across accuracy, performance, and use cases -- and learn when hybrid search wins.

Read the guide

RAG vs Fine-Tuning

Ground LLM responses with retrieved data or train the model directly? Compare RAG and fine-tuning across cost, freshness, accuracy, and implementation effort.

Read the guide

Data Virtualization vs Data Replication

Access data virtually or replicate it physically? Compare virtualization and replication across latency, consistency, cost, and when to combine both.

Read the guide

Sidecar vs Microservice Architecture

Deploy a data runtime alongside your app or as a shared service? Compare sidecar and microservice architectures across latency, scaling, resource usage, and operational complexity.

Read the guide

Apache DataFusion vs DuckDB

Both are fast in-process SQL engines built for analytical workloads. Compare architecture, extensibility, language bindings, and when each is the better fit.

Read the guide

Apache Iceberg vs Delta Lake

Open table formats with different metadata architectures and ecosystem strengths. Compare hidden partitioning, catalog support, and engine compatibility.

Read the guide

Comparing Data Federation Tools for AI Agents

An objective comparison of data federation tool categories for AI agents, including latency, governance, connector coverage, and deployment tradeoffs.

Read the guide

Best Alternatives to ETL Pipelines for AI Agents

Objective guide to ETL alternatives for AI agent workloads, including federation, CDC acceleration, streaming, and hybrid architecture tradeoffs.

Read the guide

Architecture

What is a Hybrid Data Architecture?

Combine sidecar caching with a centralized cluster for sub-millisecond reads and centralized data management. Learn the CDN-for-data pattern.

Read the guide

What is the Sidecar Pattern?

Co-locate a runtime alongside your application container for low-latency data access without network hops. Learn the sidecar pattern and when to use it.

Read the guide

What is a Data Substrate?

A co-located data and AI layer that provides applications with sub-millisecond access to any data source. Learn how a substrate differs from a warehouse, lake, or lakehouse.

Read the guide

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer