spiceai/docs

spiceai/

docs

Help Login

evgenii/docs-spicepod-v2

Edit on GitHub

Fork

/docs/website/versioned_docs/version-2.0.x/features/data-ingestion/index.md

spiceai/docs | Spice Cloud Platform

evgenii/docs-spicepod-v2

Edit on GitHub

Fork

/docs/website/versioned_docs/version-2.0.x/features/data-ingestion/index.md

spiceai/docs/README.md

title: 'Data Ingestion' sidebar_label: 'Data Ingestion' description: 'Learn how to ingest data in Spice.' sidebar_position: 6 pagination_prev: null pagination_next: null tags:

features
write

Data can be ingested into the Spice runtime using the following methods:

Acceleration Refresh Modes – Pull data from a source connector into a local accelerator using one of the standard refresh modes (full, append, changes, snapshot, caching). This is the most common ingestion path for keeping a local accelerator in sync with an upstream system.
SQL Statements – Write data directly to write-capable connectors using standard SQL INSERT (and, where supported, UPDATE/DELETE) syntax.
OpenTelemetry (OTEL) Ingestion – Stream OTEL metrics for real-time processing and acceleration.

Data ingestion is useful for scenarios such as keeping a local accelerator continuously in sync with an upstream database, collecting metrics from edge devices, writing application events for later analysis, or populating datasets from external sources.

Ingestion via Acceleration Refresh Modes

When a dataset is configured with acceleration.enabled: true, Spice ingests rows from the source connector into a local engine (Arrow, DuckDB, SQLite, PostgreSQL, or Cayenne). The refresh_mode controls how that ingestion happens.

Refresh Mode	What it ingests	Typical source
`full`	Replaces the accelerator's contents with a fresh read of the source on every refresh.	Slowly-changing reference tables; small lookup datasets.
`append`	Inserts only rows newer than the highest seen `time_column` value on each refresh.	Time-series, event/log data, append-only tables.
`changes`	Streams row-level inserts, updates, and deletes from a source CDC feed (PostgreSQL logical replication, DynamoDB Streams, MongoDB Change Streams, Debezium, Kafka, etc.).	Operational databases where you need near real-time mirror of the source.
`snapshot`	Loads exclusively from an external snapshot store; no source reads.	Read-only replicas bootstrapped from a centralized snapshot, e.g. for fan-out reader fleets.
`caching`	Read-through caches per-request HTTP/HTTPS responses with a TTL.	API search results or other request-keyed content fetched lazily.

For cross-cutting refresh behavior — refresh intervals, on-demand refresh, retries, retention, and zero-results handling — see Data Refresh.

Example: continuous CDC ingestion into an accelerator

This uses PostgreSQL Logical Replication to ingest every INSERT, UPDATE, and DELETE from public.users into a local DuckDB accelerator with low latency.

SQL Statements

Spice supports writing data to compatible data connectors using standard SQL INSERT INTO syntax.

Write-Capable Connectors

Data connectors that support write operations are tagged as write:

Apache Iceberg - Write to Iceberg tables via data connector or catalog connector
AWS Glue - Write to Glue Data Catalog tables via data connector or catalog connector

Configuration for Write Operations

To enable write operations, configure your dataset or catalog with read_write access:

Example SQL

For more details on the INSERT statement syntax, see the SQL INSERT documentation.

OpenTelemetry Data Ingestion

By default, the runtime exposes an OpenTelemetry (OTEL) endpoint at grpc://127.0.0.1:50051 for the OTEL data ingestion.

OTEL metrics will be inserted into datasets with matching names (metric name = dataset name) and optionally replicated to the dataset source.

Benefits

Spice.ai OSS includes built-in data ingestion support for collecting the latest data from edge nodes for use in subsequent queries. This feature eliminates the need for additional ETL pipelines and improves the speed of the feedback loop.

For example, consider CPU usage anomaly detection. When CPU metrics are sent to the Spice OpenTelemetry endpoint, the loaded machine learning model can use the most recent observations for inferencing and provide recommendations to the edge node. This process occurs quickly on the edge itself, within milliseconds, and without generating network traffic.

Additionally, Spice will periodically replicate the data to the data connector for further use.

Considerations

Data Quality: Use Spice SQL capabilities to transform and cleanse ingested edge data, ensuring high-quality inputs.

Data Security: Evaluate data sensitivity and secure network connections between the edge and data connector when replicating data for further use. Implement encryption, access controls, and secure protocols.

Example

Disk SMART

Start Spice with the following dataset:

Start telegraf with the following config:

SMART data will be available in the smart_attribute_raw_value dataset in Spice.ai OSS and replicated to the coolorg.smart.drive_stats dataset in Spice.ai Cloud.

Limitations

:::warning[Current Limitations]

Write Support: Only selected write-capable connectors and catalogs support write operations.
Only Spice.ai replication is supported for OpenTelemetry ingestion

:::