title: 'Embedding Models' sidebar_label: 'Embeddings' description: 'Describes how embedding models are used in Spice to convert text into numerical vectors for machine learning and search applications.' image: /img/og/embeddings.png sidebar_position: 6 pagination_prev: null pagination_next: null tags:
Embedding models transform raw text into numerical vectors that machine learning models can use. Spice supports running embedding models locally or via hosted services such as OpenAI, Amazon Bedrock, Databricks MosaicAI, or la Plateforme.
Embeddings enable vector-based and similarity search, such as document retrieval. For chat-based large language models, see Model Providers.
Spice supports a variety of embedding model sources and formats:
| Name | Description | Status | ML Format(s) | LLM Format(s)* |
|---|---|---|---|---|
file | Local filesystem | Release Candidate | ONNX | GGUF, GGML, SafeTensor |
huggingface | Models hosted on HuggingFace | Release Candidate | ONNX | GGUF, GGML, SafeTensor |
openai | OpenAI (or compatible) LLM endpoint | Release Candidate | - | OpenAI-compatible HTTP endpoint |
azure | Azure OpenAI | Alpha | - | OpenAI-compatible HTTP endpoint |
databricks | Models deployed to Databricks Mosaic AI | Alpha | - | OpenAI-compatible HTTP endpoint |
bedrock | Models deployed on AWS Bedrock | Alpha | - | OpenAI-compatible HTTP endpoint |
model2vec | Model2Vec static word embeddings | Alpha | - | Model2Vec format |
Spice provides three ways to handle embedding columns in datasets:
Define embedding models in the spicepod.yaml file as top-level components.
Example configuration in spicepod.yaml:
Embedding models can be used via:
To create vector embeddings for specific dataset columns, define them under columns in the spicepod.yaml file, within the datasets section.
Example configuration in spicepod.yaml:
See the embeddings and datasets reference for more details.
JIT embeddings are computed at query time. This is useful when precomputing is impractical (e.g., large or rarely queried datasets, or heavy prefiltering). To add a JIT embedding column, specify it in the dataset's column config.
To speed up queries, embeddings can be precomputed and stored in a data accelerator. Enable this by adding:
to the dataset configuration. All other data accelerator configurations are optional, but can be applied as per their respective documentation.
Full example:
If the dataset already contains embedding columns, Spice can use them for vector search and other embedding features. The schema must match that of Spice-generated embeddings (or be adapted with a view).
Example:
A sales table with an address column and its embedding:
The same table if it was chunked:
Passthrough embedding columns must still be defined in the spicepod.yaml file. The Spice instance must also have access to the same embedding model used to generate the embeddings.
To ensure compatibility, embedding columns must meet these requirements:
string Arrow data type.<column_name>_embedding (e.g., review_embedding for a review column).FixedSizeList[Float32 or Float64, N] for unchunked data, where N is the embedding vector size.List[FixedSizeList[Float32 or Float64, N]] for chunked data.<column_name>_offsets must exist with type List[FixedSizeList[Int32, 2]], where each pair [start, end] maps a chunk to its text segment.[[0, 100], [101, 200]] means two chunks covering indices 0–100 and 101–200.Following these guidelines ensures that the dataset's pre-existing embeddings are fully compatible with Spice.
Spice supports chunking large text columns before embedding, which is useful for Document Tables. Chunking helps return only the most relevant text during search. Configure chunking in the embedding config:
The body column will be split into chunks of about 512 tokens, preserving sentence and semantic boundaries. See the API reference for details.
The row_id field specifies which column(s) uniquely identify each row, similar to a primary key. This is important for chunked embeddings, so that operations (e.g., v1/search) can map multiple chunked vectors to a single row. Set row_id in columns[*].embeddings[*].row_id.
import DocCardList from '@theme/DocCardList';