spiceai/docs

spiceai/

docs

Help Login

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.11.x/components/data-connectors/index.md

spiceai/docs | Spice Cloud Platform

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.11.x/components/data-connectors/index.md

spiceai/docs/README.md

title: 'Data Connectors' sidebar_label: 'Data Connectors' description: 'Learn how to use Data Connector to query external data.' image: /img/og/data-connectors.png sidebar_position: 1 pagination_prev: null pagination_next: null tags:

data-connectors
overview
federation

Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

Each connector is configured using the from field in a dataset definition. For example:

Supported Data Connectors include:

Name	Description	Status	Protocol/Format
`postgres`	PostgreSQL, Amazon Redshift	Stable	PostgreSQL-wire
`mysql`	MySQL	Stable
`s3`	S3	Stable	Parquet, CSV, JSON
`file`	File	Stable	Parquet, CSV, JSON
`duckdb`	DuckDB	Stable	Embedded
`dremio`	Dremio	Stable	Arrow Flight
`spice.ai`	Spice.ai OSS & Cloud	Stable	Arrow Flight
`databricks (mode: delta_lake)`	Databricks	Stable	S3/Delta Lake
`delta_lake`	Delta Lake	Stable	Delta Lake
`github`	GitHub	Stable	GitHub API
`graphql`	GraphQL	Release Candidate	JSON
`dynamodb`	DynamoDB	Release Candidate
`databricks (mode: spark_connect)`	Databricks	Beta	Spark Connect
`flightsql`	FlightSQL	Beta	Arrow Flight SQL
`mssql`	Microsoft SQL Server	Beta	Tabular Data Stream (TDS)
`odbc`	ODBC	Beta	ODBC
`snowflake`	Snowflake	Beta	Arrow
`spark`	Spark	Beta	Spark Connect
`iceberg`	Apache Iceberg	Beta	Parquet
`abfs`	Azure BlobFS	Alpha	Parquet, CSV, JSON
`ftp`, `sftp`	FTP/SFTP	Alpha	Parquet, CSV, JSON
`smb`	SMB	Alpha	Parquet, CSV, JSON
`nfs`	NFS	Alpha	Parquet, CSV, JSON
`glue`	Glue	Alpha	Iceberg, Parquet, CSV
`http`, `https`	HTTP(s)	Alpha	Parquet, CSV, JSON
`imap`	IMAP	Alpha	IMAP Emails
`localpod`	Local dataset replication	Alpha
`oracle`	Oracle	Alpha	Oracle ODPI-C
`sharepoint`	Microsoft SharePoint	Alpha	Unstructured UTF-8 documents
`clickhouse`	Clickhouse	Alpha
`debezium`	Debezium CDC	Alpha	Kafka + JSON
`kafka`	Kafka	Alpha	Kafka + JSON
`mongodb`	MongoDB	Alpha
`scylladb`	ScyllaDB	Alpha	CQL, Alternator (DynamoDB)
`elasticsearch`	ElasticSearch	Roadmap

File Formats

Data connectors that read files from object stores (S3, Azure Blob, GCS) or network-attached storage (FTP, SFTP, SMB, NFS) support a variety of file formats. These connectors work with both structured data formats (Parquet, CSV) and document formats (Markdown, PDF).

Specifying File Format

When connecting to a directory, specify the file format using params.file_format:

When connecting to a specific file, the format is inferred from the file extension:

Supported Formats

Name	Parameter	Status	Description
Apache Parquet	`file_format: parquet`	Stable	Columnar format optimized for analytics
CSV	`file_format: csv`	Stable	Comma-separated values
JSON	`file_format: json`	Stable	JavaScript Object Notation
Delta Lake	`file_format: delta`	Stable	Open table format with ACID transactions. Object stores only.
Apache Iceberg	`file_format: iceberg`	Beta	Open table format for large analytic datasets. Object stores only. Requires a catalog.
Microsoft Excel	`file_format: xlsx`	Roadmap	Excel spreadsheet format
Markdown	`file_format: md`	Stable	Plain text with formatting (document format)
Text	`file_format: txt`

Format-Specific Parameters

File formats support additional parameters for fine-grained control. Common examples include:

Parameter	Applies To	Description
`csv_has_header`	CSV	Whether the first row contains column headers
`csv_delimiter`	CSV	Field delimiter character (default: `,`)
`csv_quote`	CSV	Quote character for fields containing delimiters

For complete format options, see File Formats Reference.

Applicable Connectors {#object-store-file-formats}

The following data connectors support file format configuration:

Connector Type	Connectors
Object Stores	S3, Azure Blob (ABFS), GCS, HTTP/HTTPS
Network-Attached Storage	FTP, SFTP, SMB, NFS
Local Storage	File

Hive Partitioning

File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.

Given a folder structure:

Configure the dataset:

Query with partition filters:

Partition pruning improves query performance by reading only the relevant files.

Metadata Columns

File-based connectors can expose per-file object store metadata as virtual columns in the dataset schema. These columns are not stored in the data files — they are derived from object store file metadata at query time.

Available Columns

Column	Type	Description
`location`	`Utf8`	Full URI of the source file
`last_modified`	`Timestamp(µs, "UTC")`	When the file was last modified
`size`	`UInt64`	File size in bytes

Enabling Metadata Columns

Metadata columns are enabled by adding a metadata section to the dataset definition with each desired column set to enabled:

Each column can be individually enabled or omitted:

:::note If the data files already contain a column with the same name as a metadata column (e.g., a Parquet file with a size column), the metadata column is not added to avoid conflicts. :::

Querying Metadata Columns

Once enabled, metadata columns appear alongside the regular data columns:

Metadata columns can be used in filters, projections, aggregations, and joins like any other column:

Applicable Connectors

Metadata columns are supported by all file-based connectors:

Connector Type	Connectors
Object Stores	S3, Azure Blob (ABFS), HTTP/HTTPS
Network-Attached Storage	FTP, SFTP, SMB, NFS
Local Storage	File

Schema Inference

Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.

Schema inference happens once, when the dataset is first registered. Some connectors support tuning the inference behavior with connector-specific parameters:

Connector	Parameter	Default	Description
Kafka	`schema_infer_max_records`	10	Number of messages sampled to infer the JSON schema
DynamoDB	`schema_infer_max_records`	10	Number of items sampled to infer the schema
MongoDB	`mongodb_num_docs_to_infer_schema`	400	Number of documents sampled to infer the schema
CSV files	`csv_schema_infer_max_records`	1000	Number of rows sampled to infer the CSV schema

For connectors that read self-describing formats (Parquet, Arrow, Avro), the schema is read directly from file metadata and does not require sampling.

Runtime Schema Changes

Spice does not apply schema changes at runtime. If the source schema changes while the runtime is running — for example, new columns are added, columns are removed, or data types change — subsequent data refreshes will fail with an error such as:

Failed to load data for dataset <name>: Cannot cast struct field ...

This behavior is by design. Blocking runtime schema evolution protects accelerated tables from unintentional or breaking schema changes that could corrupt data or produce unexpected query results.

To apply a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source and re-initializes the dataset with the updated column definitions.

:::tip[Recommendation] Pin a known-good schema version in the data source or use the columns configuration to explicitly define the expected columns. This makes schema expectations explicit and produces clear errors if the source drifts. :::

:::note Runtime schema evolution controls are planned for a future release. When available, schema evolution will remain off by default. :::

Name	Parameter	Supported	Is Document Format
Apache Parquet	`file_format: parquet`	✅	❌
CSV	`file_format: csv`	✅	❌
Delta Lake	`file_format: delta`	✅	❌
Apache Iceberg	`file_format: iceberg`	✅	❌
JSON	`file_format: json`	✅	❌
Microsoft Excel	`file_format: xlsx`	Roadmap	❌
Markdown	`file_format: md`	✅	✅
Text	`file_format: txt`	✅	✅
PDF	`file_format: pdf`	Beta	✅
Microsoft Word	`file_format: docx`

Document Formats {#document-formats}

Document formats (Markdown, Text, PDF, Word) are handled differently from structured data formats. Each file becomes a row in the resulting table, with the file contents stored in a content column.

:::warning[Note] Document formats in Alpha (PDF, DOCX) may not parse all structure or text from the underlying documents correctly. :::

Document Table Schema

Column	Type	Description
`location`	String	Path to the source file
`content`	String	Full text content of the document

Example

Consider a local filesystem:

And the spicepod

A Document table will be created.

Data Connector Docs

import DocCardList from '@theme/DocCardList';