title: 'Data Connectors' sidebar_label: 'Data Connectors' description: 'Learn how to use Data Connector to query external data.' image: /img/og/data-connectors.png sidebar_position: 1 pagination_prev: null pagination_next: null tags:
Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.
Supported Data Connectors include:
| Name | Description | Status | Protocol/Format |
|---|---|---|---|
postgres | PostgreSQL, Amazon Redshift | Stable | PostgreSQL-line |
mysql | MySQL | Stable | |
s3 | S3 | Stable | Parquet, CSV, JSON |
file | File | Stable | Parquet, CSV, JSON |
duckdb | DuckDB | Stable | Embedded |
dremio | Dremio | Stable | Arrow Flight |
spice.ai | Spice.ai OSS & Cloud | Stable | Arrow Flight |
databricks (mode: delta_lake) | Databricks | Stable | S3/Delta Lake |
delta_lake | Delta Lake | Stable | Delta Lake |
github | GitHub | Stable | GitHub API |
graphql | GraphQL | Release Candidate | JSON |
databricks (mode: spark_connect) | Databricks | Beta | Spark Connect |
flightsql | FlightSQL | Beta | Arrow Flight SQL |
mssql | Microsoft SQL Server | Beta | Tabular Data Stream (TDS) |
odbc | ODBC | Beta | ODBC |
snowflake | Snowflake | Beta | Arrow |
spark | Spark | Beta | Spark Connect |
iceberg | Apache Iceberg | Beta | Parquet |
abfs | Azure BlobFS | Alpha | Parquet, CSV, JSON |
ftp, sftp | FTP/SFTP | Alpha | Parquet, CSV, JSON |
glue | Glue | Alpha | Iceberg, Parquet, CSV |
http, https | HTTP(s) | Alpha | Parquet, CSV, JSON |
imap | IMAP | Alpha | IMAP Emails |
localpod | Local dataset replication | Alpha | |
oracle | Oracle | Alpha | Oracle ODPI-C |
sharepoint | Microsoft SharePoint | Alpha | Unstructured UTF-8 documents |
clickhouse | Clickhouse | Alpha | |
debezium | Debezium CDC | Alpha | Kafka + JSON |
kafka | Kafka | Alpha | Kafka + JSON |
dynamodb | DynamoDB | Release Candidate | |
mongodb | MongoDB | Alpha | |
elasticsearch | ElasticSearch | Roadmap |
For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format.
If a file is provided, the file format will be inferred, and params.file_format is unnecessary.
File formats currently supported are:
When connecting to a specific file, the format is inferred from the file extension:
| Name | Parameter | Status | Description |
|---|---|---|---|
| Apache Parquet | file_format: parquet | Stable | Columnar format optimized for analytics |
| CSV | file_format: csv | Stable | Comma-separated values |
| JSON | file_format: json | Roadmap | JavaScript Object Notation |
| Apache Iceberg | file_format: iceberg | Roadmap | Open table format for large analytic datasets |
| Microsoft Excel | file_format: xlsx | Roadmap | Excel spreadsheet format |
| Markdown | file_format: md | Stable | Plain text with formatting (document format) |
| Text | file_format: txt | Stable | Plain text files (document format) |
file_format: pdf | Alpha | Portable Document Format (document format) | |
| Microsoft Word | file_format: docx |
File formats support additional parameters for fine-grained control. Common examples include:
| Parameter | Applies To | Description |
|---|---|---|
csv_has_header | CSV | Whether the first row contains column headers |
csv_delimiter | CSV | Field delimiter character (default: ,) |
csv_quote | CSV | Quote character for fields containing delimiters |
For complete format options, see File Formats Reference.
The following data connectors support file format configuration:
| Connector Type | Connectors |
|---|---|
| Object Stores | S3, Azure Blob (ABFS), GCS, HTTP/HTTPS |
| Network-Attached Storage | FTP, SFTP, SMB, NFS |
| Local Storage | File |
File-based connectors support Hive-style partitioning, which extracts partition columns from folder names. Enable with hive_partitioning_enabled: true.
Given a folder structure:
Configure the dataset:
Query with partition filters:
Partition pruning improves query performance by reading only the relevant files.
| Name | Parameter | Supported | Is Document Format |
|---|---|---|---|
| Apache Parquet | file_format: parquet | ✅ | ❌ |
| CSV | file_format: csv | ✅ | ❌ |
| Apache Iceberg | file_format: iceberg | Roadmap | ❌ |
| JSON | file_format: json | Roadmap | ❌ |
| Microsoft Excel | file_format: xlsx | Roadmap | ❌ |
| Markdown | file_format: md | ✅ | ✅ |
| Text | file_format: txt | ✅ | ✅ |
file_format: pdf | Alpha | ✅ | |
| Microsoft Word | file_format: docx | Alpha | ✅ |
File formats support additional parameters in the params (like csv_has_header) described in File Formats
If a format is a document format, each file will be treated as a document, as per document support below.
:::warning[Note] Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly. :::
If a Data Connector supports documents, when the appropriate file format is specified (see above), each file will be treated as a row in the table, with the contents of the file within the content column. Additional columns will exist, dependent on the data connector.
Consider a local filesystem
And the spicepod
A Document table will be created.
import DocCardList from '@theme/DocCardList';
| Alpha |
| Word document format (document format) |