spiceai/docs

spiceai/

docs

Help Login

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.6.x/features/search/vector-search.md

spiceai/docs | Spice Cloud Platform

trunk

Edit on GitHub

Fork

/docs/website/versioned_docs/version-1.6.x/features/search/vector-search.md

spiceai/docs/README.md

title: 'Vector-Based Search' sidebar_label: 'Vector Search' description: 'Learn how Spice can perform searches using vector-based methods.' sidebar_position: 1 tags:

search
models
embeddings

🎓 Learn how it works with the Amazon S3 Vectors with Spice engineering blog post.

Spice provides advanced vector-based search capabilities, enabling more nuanced and intelligent searches.

Embedding Models

Spice supports two types of embedding providers:

Local embedding models e.g., sentence-transformers/all-MiniLM-L6-v2.
Remote embedding services e.g., OpenAI Embeddings API.

Embedding models are defined in the spicepod.yaml file as top-level components.

Configuring Datasets for Embeddings

To enable vector search, specify embeddings for the dataset columns in spicepod.yaml:

This configuration instructs Spice to create embeddings from the body column, enabling similarity searches on body content.

Performing a Vector Search

Execute similarity searches using Spice's HTTP API:

For detailed API documentation, see Search API Reference.

Retrieving Full Documents

If the dataset uses chunking, Spice returns relevant chunks. To retrieve entire documents, include the embedding column in additional_columns:

Response:

SQL UDTF

The embedding index can also be used to perform search in SQL, via a user-defined table function (UDTF).

SQL Function Signature of vector_search:

:::warning[Limitations]

vector_search UDTF does not support chunked embedding columns.

:::

Using Existing Embeddings

Spice supports vector searches on datasets with pre-existing embeddings. Ensure the dataset meets these requirements:

Column Naming: The embedding column name must be <original_column_name>_embedding.
Data Types: Embedding columns must use Arrow types:
- Non-chunked: FixedSizeList[Float32|Float64, N]
- Chunked: List[FixedSizeList[Float32|Float64, N]]
Offset Columns: For chunked embeddings, an additional offset column (<column_name>_offsets) is required:
- Type: List[FixedSizeList[Int32, 2]], indicating chunk boundaries.

Example dataset structure (sales table):

Non-chunked:

Chunked:

Constraints

Underlying Column Presence:
- The underlying column must exist in the table, and be of string Arrow data type .
Embeddings Column Naming Convention:
- For each underlying column, the corresponding embeddings column must be named as <column_name>_embedding. For example, a customer_reviews table with a review column must have a review_embedding column.
Embeddings Column Data Type:
- The embeddings column must have the following Arrow data type when loaded into Spice:
  1. FixedSizeList[Float32 or Float64, N], where N is the dimension (size) of the embedding vector. FixedSizeList is used for efficient storage and processing of fixed-size vectors.
  2. If the column is chunked, use List[FixedSizeList[Float32 or Float64, N]].
Offset Column for Chunked Data:
- If the underlying column is chunked, there must be an additional offset column named <column_name>_offsets with the following Arrow data type:

By following these guidelines, you can ensure that your dataset with pre-existing embeddings is fully compatible with the vector search and other embedding functionalities provided by Spice.

Example

A table sales with an address column and corresponding embedding column(s).