title: 'Data Ingestion' sidebar_label: 'Data Ingestion' description: 'Learn how to ingest data in Spice.' sidebar_position: 6 pagination_prev: null pagination_next: null tags:
Data can be ingested into the Spice runtime using the following methods:
full, append, changes, snapshot, caching). This is the most common ingestion path for keeping a local accelerator in sync with an upstream system.INSERT (and, where supported, UPDATE/DELETE) syntax.Data ingestion is useful for scenarios such as keeping a local accelerator continuously in sync with an upstream database, collecting metrics from edge devices, writing application events for later analysis, or populating datasets from external sources.
When a dataset is configured with acceleration.enabled: true, Spice ingests rows from the source connector into a local engine (Arrow, DuckDB, SQLite, PostgreSQL, or Cayenne). The refresh_mode controls how that ingestion happens.
| Refresh Mode | What it ingests | Typical source |
|---|---|---|
full | Replaces the accelerator's contents with a fresh read of the source on every refresh. | Slowly-changing reference tables; small lookup datasets. |
append | Inserts only rows newer than the highest seen time_column value on each refresh. | Time-series, event/log data, append-only tables. |
changes | Streams row-level inserts, updates, and deletes from a source CDC feed (PostgreSQL logical replication, DynamoDB Streams, MongoDB Change Streams, Debezium, Kafka, etc.). | Operational databases where you need near real-time mirror of the source. |
snapshot | Loads exclusively from an external snapshot store; no source reads. | Read-only replicas bootstrapped from a centralized snapshot, e.g. for fan-out reader fleets. |
caching | Read-through caches per-request HTTP/HTTPS responses with a TTL. | API search results or other request-keyed content fetched lazily. |
For cross-cutting refresh behavior — refresh intervals, on-demand refresh, retries, retention, and zero-results handling — see Data Refresh.
This uses PostgreSQL Logical Replication to ingest every INSERT, UPDATE, and DELETE from public.users into a local DuckDB accelerator with low latency.
Spice supports writing data to compatible data connectors using standard SQL INSERT INTO syntax.
Data connectors that support write operations are tagged as write:
To enable write operations, configure your dataset or catalog with read_write access:
For more details on the INSERT statement syntax, see the SQL INSERT documentation.
By default, the runtime exposes an OpenTelemetry (OTEL) endpoint at grpc://127.0.0.1:50051 for the OTEL data ingestion.
OTEL metrics will be inserted into datasets with matching names (metric name = dataset name) and optionally replicated to the dataset source.
Spice.ai OSS includes built-in data ingestion support for collecting the latest data from edge nodes for use in subsequent queries. This feature eliminates the need for additional ETL pipelines and improves the speed of the feedback loop.
For example, consider CPU usage anomaly detection. When CPU metrics are sent to the Spice OpenTelemetry endpoint, the loaded machine learning model can use the most recent observations for inferencing and provide recommendations to the edge node. This process occurs quickly on the edge itself, within milliseconds, and without generating network traffic.
Additionally, Spice will periodically replicate the data to the data connector for further use.
Data Quality: Use Spice SQL capabilities to transform and cleanse ingested edge data, ensuring high-quality inputs.
Data Security: Evaluate data sensitivity and secure network connections between the edge and data connector when replicating data for further use. Implement encryption, access controls, and secure protocols.
Start Spice with the following dataset:
Start telegraf with the following config:
SMART data will be available in the smart_attribute_raw_value dataset in Spice.ai OSS and replicated to the coolorg.smart.drive_stats dataset in Spice.ai Cloud.
:::warning[Current Limitations]
:::