spiceai/docs

Parameter Name	Description
`file_format`	Specifies the data format. Required if it cannot be inferred from the object URI. Options: `parquet`, `csv`, `json`. Refer to File Formats for details.
`s3_endpoint`	S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server`
`s3_region`	S3 bucket region. Default: `us-east-1`.
`client_timeout`	Optional. Timeout for S3 operations. No timeout by default.
`hive_partitioning_enabled`	Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`
`s3_auth`	Authentication type. Options: `public`, `key` and . Defaults to . If set to the and parameters must also be set. If set to the credentials will be loaded from environment variables or IAM roles (see for details).

For additional CSV, JSON, and Parquet specific parameters, see File Formats.

Authentication

No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role.

If s3_auth is set to iam_role, the connector will automatically load credentials from the following sources in order.

Environment Variables:
- AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
- AWS_SESSION_TOKEN (if using temporary credentials)
Shared AWS Config/Credentials Files:
- Config file: ~/.aws/config (Linux/Mac) or %UserProfile%\.aws\config (Windows)
- Credentials file: ~/.aws/credentials (Linux/Mac) or %UserProfile%\.aws\credentials (Windows)
- The AWS_PROFILE environment variable can be used to specify a named profile, otherwise the [default] profile is used.
- Supports both static credentials and SSO sessions
- Example credentials file:
:::tip To set up SSO authentication:
1. Run aws configure sso to configure a new SSO profile
2. Use the profile by setting AWS_PROFILE=sso-profile
3. Run aws sso login --profile sso-profile to start a new SSO session :::
AWS STS Web Identity Token Credentials:
- Used primarily with OpenID Connect (OIDC) and OAuth

The connector will try each source in order until valid credentials are found. If no valid credentials are found, an authentication error will be returned.

:::note[IAM Permissions] Regardless of the credential source, the IAM role or user must have appropriate S3 permissions (e.g., s3:ListBucket, s3:GetObject) to access the files. If the Spicepod connects to multiple different AWS services, the permissions should cover all of them. :::

:::note[kube2iam] kube2iam is a project that provides IAM roles to Kubernetes pods based on annotations. It has been superceded by IAM Roles for service accounts (IRSA), which should be preferred for new deployments.

Spice requires kube2iam >= 0.12 - versions prior to 0.12 only supported IMDSv1. :::

Required IAM Permissions

Minimum IAM policy for S3 access:

Permission Details

Permission	Purpose
`s3:ListBucket`	Required. Allows scanning all objects from the bucket
`s3:GetObject`	Required. Allows fetching objects

Types

Refer to Object Store Data Types for data type mapping from object store files to arrow data type.

Examples

Public bucket Example

Create a dataset named taxi_trips from a public S3 folder.

MinIO Example

Create a dataset named cool_dataset from a Parquet file stored in MinIO.

Hive Partitioning Example

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This enables efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

Schema Source Path example

Use schema_source_path to speed up dataset registration by specifying a URL to use to infer the schema.

Metadata Columns Example

Metadata columns expose per-file S3 object metadata (location, last_modified, size) as virtual columns in query results. See Metadata Columns for full details.

Query metadata alongside regular data:

Filter by specific file:

Aggregate per file:

Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.

Limitations

:::warning[Performance Considerations]

When using the S3 Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.

Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.

:::

Cookbook

A cookbook recipe to configure S3 as a data connector in Spice. S3 Data Connector

spiceai/docs/README.md

title: 'S3 Data Connector' sidebar_label: 'S3 Data Connector' description: 'S3 Data Connector Documentation'

The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder path is specified as the dataset source, all files within the folder will be loaded.

File formats are specified using the file_format parameter, as described in File Formats.

Quickstart

Query a public S3 dataset with no authentication:

For private buckets, add authentication (see Authentication):

Configuration

`from`

S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>

Example: from: s3://my-bucket/path/to/file.parquet

`name`

The dataset name. This will be used as the table name within Spice.

Example:

The dataset name cannot be a reserved keyword.

`params`

Parameter Name	Description
`file_format`	Specifies the data format. Required if it cannot be inferred from the object URI. Options: `parquet`, `csv`, `json`. Refer to File Formats for details.
`s3_endpoint`	S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. `s3_endpoint: https://my.minio.server`
`s3_region`	S3 bucket region. Default: `us-east-1`.
`client_timeout`	Optional. Timeout for S3 operations. No timeout by default.
`hive_partitioning_enabled`	Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`
`s3_auth`	Authentication type. Options: `public`, `key` and . Defaults to . If set to the and parameters must also be set. If set to the credentials will be loaded from environment variables or IAM roles (see for details).