title: 'Azure Cosmos DB Data Connector' sidebar_label: 'Azure Cosmos DB Data Connector' description: 'Query Azure Cosmos DB (NoSQL / Core SQL API) containers as SQL tables in Spice. Read-only scan with schema inferred from a sample of documents.' tags:
The Azure Cosmos DB Data Connector exposes Cosmos DB containers (NoSQL / Core SQL API) as SQL tables in Spice. The connector samples a configurable number of documents at startup, infers an Arrow schema, and streams documents into DataFusion for federated SQL queries alongside data from other connectors.
fromThe from field takes the form cosmosdb:{database}.{container} or cosmosdb:{database}/{container}. The connector also accepts a bare {container} when cosmosdb_database is provided in params.
nameThe dataset name used as the table name within Spice. The dataset name cannot be a reserved keyword.
paramsProvide either a full Cosmos DB connection string (preferred) or the discrete account_endpoint + account_key pair. Secrets must be sourced from a secret store in production.
| Parameter Name | Description | Required |
|---|---|---|
cosmosdb_connection_string | Full connection string copied from the Azure portal. Takes precedence over account_endpoint / account_key. | Either this or both endpoint+key |
cosmosdb_account_endpoint | Account endpoint URL, e.g. https://my-account.documents.azure.com:443/. | When connection string isn't set |
cosmosdb_account_key | Primary or secondary account key. | When connection string isn't set |
Microsoft Entra ID and managed-identity authentication are tracked as a post-RC enhancement and are not supported in the current release.
| Parameter Name | Description | Default |
|---|---|---|
cosmosdb_database | Database name. When unset, parsed from the from: path (database.container). | - |
query | Cosmos SQL query used to scan the container. Useful when the container is large and only a subset should be surfaced as a dataset. | SELECT * FROM c |
schema_infer_max_records | Number of documents sampled during schema inference at dataset registration. Larger samples produce a more precise schema at the cost of more RU consumption. | 100 |
The connector applies per-account concurrency limits, bounded retries with backoff, and a permanent-error latch that disables the connector account-wide on 401/403/404 responses.
| Parameter Name | Description | Default |
|---|---|---|
max_concurrent_requests | Maximum number of concurrent Cosmos DB requests per account endpoint, shared across all datasets pointing at the same account. | 4 |
http_max_retries | Maximum number of retries for transient errors (HTTP 429, 5xx, network) during the schema-inference pass at dataset registration. Retries honor Retry-After and x-ms-retry-after-ms headers. | 3 |
backoff_method | Backoff strategy between retries. exponential doubles the delay each attempt (capped at 30s); fibonacci follows the Fibonacci sequence (capped at 30s). | exponential |
disable_on_permanent_error | When true, a permanent error (401/403/404) latches the connector into a disabled state and short-circuits subsequent requests until Spice is restarted. | true |
See the deployment guide for sizing, troubleshooting, and observability details.
Copy the connection string from the Azure portal under Settings → Keys for the Cosmos DB account, then reference it from a secret store:
When the endpoint and key are stored separately (for example in Key Vault), provide both:
If both styles are supplied, the connection string takes precedence.
Cosmos DB has no native schema. At dataset registration the connector runs the configured query (default SELECT * FROM c) limited to schema_infer_max_records documents and hands the result to Arrow's JSON inference. The inferred schema is locked for the lifetime of the runtime process.
| Cosmos / JSON value | Arrow type | Notes |
|---|---|---|
"abc" | Utf8 | |
Integer (42, -7) | Int64 | Widens to Float64 if any sampled document contains a decimal value for the same field. |
Floating (3.14, 1.0e9) | Float64 | |
true / false | Boolean | |
Object { ... } | Struct | Nested objects are preserved as Arrow structs. |
Array [ ... ] | List | The element type is inferred from the first non-null item; heterogeneous arrays may surface as or require a wider sample to disambiguate. |
Cosmos does not emit Date, Time, Timestamp, Decimal, or Binary natively — they round-trip as strings and should be handled with CAST at query time.
When optional fields are sparse in the first 100 documents but present in production data, increase schema_infer_max_records:
Each unit increase costs additional Request Units (RUs) at startup. Pin a schema explicitly via columns: for the most precise control.
unsupported_type_actionOptional. Controls behavior for fields that infer as DataType::Null (every sampled document had null for the field). Defaults to warn.
error — Fail dataset registration.warn — Log a warning and drop the column. (Default.)ignore — Silently drop the column.string — Coerce the column to Utf8.After registering a dataset, query it like any other Spice table:
When the container is large and only a subset should be surfaced as a dataset, push the predicate to Cosmos with a custom query:
Cosmos DB does not support joins across containers. Spice federates joins between Cosmos-backed datasets (and any other connector) in the local DataFusion engine:
Standard Spice acceleration (DuckDB, SQLite, Arrow in-memory, Cayenne) works on top of the Cosmos DB connector. Acceleration is recommended when the container is large or when query latency matters — it avoids per-query RU consumption against the Cosmos account.
SELECT scans are supported. Writes (INSERT / UPDATE / DELETE) are not implemented.query parameter to narrow at the Cosmos side.acceleration.refresh_mode: changes is not supported.CAST in SQL.account_endpoint + account_key.A copy-pasteable example Spicepod is in the runtime repo at examples/cosmosdb-connector/.
Utf8| All-null in sample | Null | Warn-dropped by default. Set unsupported_type_action: string to coerce to Utf8, or widen the sample so real values appear. |
System fields (_rid, ...) | stripped | The system fields _rid, _self, _etag, _attachments, and _ts are stripped and never appear in the dataset schema. |