spiceai/docs

spiceai/

docs

Help Login

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.10.x/components/data-connectors/index.md

spiceai/docs | Spice Cloud Platform

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.10.x/components/data-connectors/index.md

spiceai/docs/README.md

title: 'Data Connectors' sidebar_label: 'Data Connectors' description: 'Learn how to use Data Connector to query external data.' image: /img/og/data-connectors.png sidebar_position: 1 pagination_prev: null pagination_next: null tags:

data-connectors
overview
federation

Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.

Supported Data Connectors include:

Name	Description	Status	Protocol/Format
`postgres`	PostgreSQL, Amazon Redshift	Stable	PostgreSQL-line
`mysql`	MySQL	Stable
`s3`	S3	Stable	Parquet, CSV, JSON
`file`	File	Stable	Parquet, CSV, JSON
`duckdb`	DuckDB	Stable	Embedded
`dremio`	Dremio	Stable	Arrow Flight
`spice.ai`	Spice.ai OSS & Cloud	Stable	Arrow Flight
`databricks (mode: delta_lake)`	Databricks	Stable	S3/Delta Lake
`delta_lake`	Delta Lake	Stable	Delta Lake
`github`	GitHub	Stable	GitHub API
`graphql`	GraphQL	Release Candidate	JSON
`databricks (mode: spark_connect)`	Databricks	Beta	Spark Connect
`flightsql`	FlightSQL	Beta	Arrow Flight SQL
`mssql`	Microsoft SQL Server	Beta	Tabular Data Stream (TDS)
`odbc`	ODBC	Beta	ODBC
`snowflake`	Snowflake	Beta	Arrow
`spark`	Spark	Beta	Spark Connect
`iceberg`	Apache Iceberg	Beta	Parquet
`abfs`	Azure BlobFS	Alpha	Parquet, CSV, JSON
`ftp`, `sftp`	FTP/SFTP	Alpha	Parquet, CSV, JSON
`glue`	Glue	Alpha	Iceberg, Parquet, CSV
`http`, `https`	HTTP(s)	Alpha	Parquet, CSV, JSON
`imap`	IMAP	Alpha	IMAP Emails
`localpod`	Local dataset replication	Alpha
`oracle`	Oracle	Alpha	Oracle ODPI-C
`sharepoint`	Microsoft SharePoint	Alpha	Unstructured UTF-8 documents
`clickhouse`	Clickhouse	Alpha
`debezium`	Debezium CDC	Alpha	Kafka + JSON
`kafka`	Kafka	Alpha	Kafka + JSON
`dynamodb`	DynamoDB	Release Candidate
`mongodb`	MongoDB	Alpha
`elasticsearch`	ElasticSearch	Roadmap

Object Store File Formats

For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.

If a file is provided, the file format will be inferred, and params.file_format is unnecessary.

File formats currently supported are:

When connecting to a specific file, the format is inferred from the file extension:

Supported Formats

Name	Parameter	Status	Description
Apache Parquet	`file_format: parquet`	Stable	Columnar format optimized for analytics
CSV	`file_format: csv`	Stable	Comma-separated values
JSON	`file_format: json`	Roadmap	JavaScript Object Notation
Apache Iceberg	`file_format: iceberg`	Roadmap	Open table format for large analytic datasets
Microsoft Excel	`file_format: xlsx`	Roadmap	Excel spreadsheet format
Markdown	`file_format: md`	Stable	Plain text with formatting (document format)
Text	`file_format: txt`	Stable	Plain text files (document format)
PDF	`file_format: pdf`	Alpha	Portable Document Format (document format)
Microsoft Word	`file_format: docx`

Format-Specific Parameters

File formats support additional parameters for fine-grained control. Common examples include:

Parameter	Applies To	Description
`csv_has_header`	CSV	Whether the first row contains column headers
`csv_delimiter`	CSV	Field delimiter character (default: `,`)
`csv_quote`	CSV	Quote character for fields containing delimiters

For complete format options, see File Formats Reference.

Applicable Connectors {#object-store-file-formats}

The following data connectors support file format configuration:

Connector Type	Connectors
Object Stores	S3, Azure Blob (ABFS), GCS, HTTP/HTTPS
Network-Attached Storage	FTP, SFTP, SMB, NFS
Local Storage	File

Hive Partitioning

File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.

Given a folder structure:

Configure the dataset:

Query with partition filters:

Partition pruning improves query performance by reading only the relevant files.

Name	Parameter	Supported	Is Document Format
Apache Parquet	`file_format: parquet`	✅	❌
CSV	`file_format: csv`	✅	❌
Apache Iceberg	`file_format: iceberg`	Roadmap	❌
JSON	`file_format: json`	Roadmap	❌
Microsoft Excel	`file_format: xlsx`	Roadmap	❌
Markdown	`file_format: md`	✅	✅
Text	`file_format: txt`	✅	✅
PDF	`file_format: pdf`	Alpha	✅
Microsoft Word	`file_format: docx`	Alpha	✅

File formats support additional parameters in the params (like csv_has_header) described in File Formats

If a format is a document format, each file will be treated as a document, as per document support below.

:::warning[Note] Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly. :::

Document Support

If a Data Connector supports documents, when the appropriate file format is specified (see above), each file will be treated as a row in the table, with the contents of the file within the content column. Additional columns will exist, dependent on the data connector.

Example

Consider a local filesystem

And the spicepod

A Document table will be created.

Data Connector Docs

import DocCardList from '@theme/DocCardList';