What is Tantivy?

Tantivy is an open-source, full-text search engine library written in Rust. Inspired by Apache Lucene, Tantivy provides fast, reliable text indexing and search as an embeddable library rather than a standalone server.

Full-text search is a core capability for any system that needs to find relevant documents from a large corpus based on natural language queries. The dominant technology behind full-text search -- Apache Lucene -- has been the industry standard for over two decades. But Lucene is a Java library, and embedding it into non-JVM systems introduces complexity, overhead, and operational constraints.

Tantivy is a full-text search engine library written in Rust that brings Lucene-class search capabilities to the Rust ecosystem. It implements the same fundamental data structures and algorithms -- inverted indexes, BM25 scoring, segment-based architecture -- while taking advantage of Rust's memory safety, zero-cost abstractions, and native performance. Like Lucene, Tantivy is a library, not a server. It is designed to be embedded directly into applications that need search functionality without the overhead of running a separate search service.

Core Architecture

Tantivy's architecture follows the same proven design that made Lucene successful: documents are indexed into segments, each containing an inverted index that maps terms to the documents where they appear.

Inverted Indexes

The inverted index is the fundamental data structure behind full-text search. For every unique term that appears in the indexed documents, the inverted index maintains a posting list -- a sorted list of document IDs where that term appears, along with metadata like term frequency and positions.

For example, indexing three documents about database topics might produce:

"query"     → [doc_1, doc_2, doc_3]
"optimize"  → [doc_1, doc_3]
"postgres"  → [doc_2]
"index"     → [doc_1, doc_2]

When a search query arrives, Tantivy tokenizes the query, looks up the posting lists for each term, and combines them to find matching documents. This is fast because the work is proportional to the number of matching documents, not the total corpus size.

Tantivy stores posting lists in a compressed format optimized for sequential access and intersection operations. The compression uses techniques like variable-byte encoding and block-based compression that balance space efficiency with decompression speed.

Segments and the Segment Architecture

Tantivy organizes its index into segments, each of which is an independent, self-contained inverted index. New documents are written to new segments rather than modifying existing ones. This append-only design has several advantages:

  • Concurrent reads and writes: Readers operate on immutable segments while writers create new ones, so indexing never blocks searching.
  • Crash safety: If the process crashes mid-write, only the incomplete segment is lost. Existing segments remain intact.
  • Efficient updates: Deleting a document marks it as deleted in a bitmap rather than physically removing it from the segment. The deleted document is excluded from search results and physically cleaned up during merging.

Each segment contains its own inverted index, stored fields, fast fields (columnar numeric data for sorting and filtering), and term dictionary. The term dictionary maps terms to their posting lists and uses a finite state transducer (FST) for compact, fast prefix lookups.

Segment Merging

Over time, as new segments accumulate, the index can become fragmented -- many small segments increase the overhead of searching (each segment must be searched independently and results merged). Tantivy addresses this with segment merging, a background process that combines multiple segments into larger ones.

Merging serves multiple purposes:

  • Performance: Fewer, larger segments reduce per-query overhead
  • Space reclamation: Documents marked as deleted are physically removed during merging
  • Compaction: The merged segment has a more compact representation than the sum of its inputs

Tantivy uses a configurable merge policy that determines when and how segments are merged. The default policy targets a logarithmic distribution of segment sizes, similar to Lucene's tiered merge policy. This balances merge cost against search performance.

Key Features

BM25 Scoring

Tantivy uses BM25 (Best Match 25) as its default ranking function. BM25 scores documents based on three factors: term frequency (how often query terms appear in the document), inverse document frequency (how rare those terms are across the corpus), and document length normalization (penalizing longer documents that naturally contain more term occurrences).

BM25 is the same ranking function used by Elasticsearch, Apache Solr, and most production search engines. Using BM25 as the default means Tantivy produces relevance rankings comparable to these established systems.

Phrase Queries and Positional Indexing

Tantivy supports phrase queries -- queries that require terms to appear in a specific order and proximity. The query "database optimization" matches only documents where "database" and "optimization" appear adjacent and in that order, excluding documents where the terms appear separately.

Phrase queries require positional information in the inverted index. For each term occurrence, Tantivy records not just the document ID but also the position within the document. This positional data enables phrase matching, proximity queries (terms within N positions of each other), and highlighting of matching passages.

Faceted Search

Faceted search enables categorization and filtering of search results by structured attributes. Tantivy supports hierarchical facets -- structured paths like /category/databases/postgresql -- that allow users to drill down into search results by category.

Facets are stored as a special field type in the index and can be combined with full-text queries. A search for "query optimization" can be filtered to only documents faceted under /category/databases, with counts showing how many results exist under each sub-facet.

Range Queries and Fast Fields

Tantivy supports range queries on numeric and date fields -- for example, finding documents where published_date falls between two dates or where price is below a threshold. These queries use fast fields, Tantivy's equivalent of Lucene's doc values.

Fast fields store columnar numeric data alongside the inverted index. Unlike the inverted index (which maps terms to documents), fast fields map documents to values. This columnar layout enables efficient sorting, filtering, and aggregation on numeric fields without reading the full document.

Multi-Threaded Indexing

Tantivy supports multi-threaded indexing out of the box. Multiple threads can add documents to the index concurrently, with each thread writing to its own segment. This parallelism is particularly valuable for bulk indexing operations where throughput is critical.

The indexing pipeline includes configurable tokenization, concurrent segment writing, and automatic segment merging. A configurable memory budget controls how much data is buffered in memory before being flushed to disk as a new segment.

Custom Tokenizers

Tantivy provides a tokenizer pipeline that can be customized per field. The default tokenizer splits text on whitespace and punctuation, lowercases tokens, and removes stop words. Custom tokenizers can add stemming (reducing words to their root form), n-gram generation, language-specific analysis, or any other text processing step.

The tokenizer pipeline is applied both at index time (when documents are added) and at query time (when queries are processed), ensuring consistent token handling across indexing and search.

Tantivy vs. Apache Lucene

Tantivy and Apache Lucene share the same fundamental architecture -- inverted indexes, segment-based storage, BM25 scoring -- but differ in language, deployment model, and ecosystem.

Language and runtime: Lucene is written in Java and requires the JVM. Tantivy is written in Rust with no runtime dependencies. Rust's ownership model provides memory safety without garbage collection pauses, and its zero-cost abstractions enable performance comparable to hand-written C/C++ code.

Embedding model: Both are libraries, but embedding Lucene into a non-Java application requires JNI bridges, JVM lifecycle management, and cross-language memory coordination. Tantivy can be embedded directly into any Rust application or accessed through FFI bindings from C, Python, or other languages with simpler interop.

Feature parity: Lucene has a broader feature set developed over two decades -- custom similarity models, spatial search, auto-suggest, and a rich analyzer ecosystem. Tantivy covers the core search features (inverted indexes, BM25, phrase queries, facets, range queries) but does not yet match Lucene's full breadth. For most full-text search use cases, Tantivy's feature set is sufficient.

Performance characteristics: Benchmarks show Tantivy achieving competitive or superior indexing throughput and query latency compared to Lucene, particularly for single-node workloads. Rust's predictable performance -- no GC pauses, no JIT warm-up -- makes Tantivy's latency profile more consistent, which matters for search workloads where tail latency affects user experience.

Tantivy vs. Elasticsearch

Elasticsearch is a distributed search and analytics engine built on top of Lucene. Comparing Tantivy to Elasticsearch is comparing a library to a full system.

Architecture: Elasticsearch is a distributed server with REST APIs, cluster management, replication, and sharding. Tantivy is an embeddable library with no network layer, clustering, or server infrastructure. Elasticsearch adds operational complexity but provides horizontal scalability. Tantivy adds zero operational overhead but requires the application to handle distribution if needed.

Deployment: Elasticsearch requires deploying, monitoring, and maintaining a cluster. Tantivy is embedded directly in the application process -- there is no separate system to manage, no network hops between the application and the search engine, and no data synchronization between systems.

Use case fit: Choose Elasticsearch when you need a standalone, distributed search service with its own cluster infrastructure, REST APIs, and a rich ecosystem of clients and integrations. Choose Tantivy when you need search capabilities embedded directly in a Rust application without the overhead of a separate service.

Performance: For single-node search workloads, Tantivy's embedded model eliminates the network serialization and deserialization overhead inherent in Elasticsearch's HTTP-based API. Query latency is lower because the search happens in-process. For distributed workloads across large clusters, Elasticsearch's built-in sharding and replication provide capabilities that Tantivy does not include.

How Spice Uses Tantivy

Spice embeds Tantivy as its full-text search engine. When users enable full-text search on accelerated datasets, Spice builds Tantivy indexes automatically and exposes search through SQL. This powers BM25 keyword search, which combines with vector search to enable hybrid search in a single SQL query.

Tantivy's Rust-native design aligns with Spice's Rust-based architecture. Spice is built on Apache DataFusion and Apache Arrow, both of which are Rust-native. Embedding Tantivy means full-text search runs in the same process, with the same memory model, and without the overhead of crossing language boundaries (as would be required with a JVM-based library like Lucene) or network boundaries (as would be required with a separate service like Elasticsearch).

Automatic Index Management

When full-text search is enabled on a dataset in Spice, the runtime automatically:

  1. Builds a Tantivy index over the specified text columns
  2. Keeps the index synchronized as source data changes through change data capture
  3. Exposes the index through SQL query functions

Users do not interact with Tantivy directly -- they write SQL queries and Spice translates full-text search operations into Tantivy queries internally.

Hybrid Search in SQL

Spice combines Tantivy-powered full-text search with vector similarity search in a unified SQL interface. A single query can perform BM25 keyword search, vector search, or both, with results fused using Reciprocal Rank Fusion (RRF):

-- Hybrid search combining BM25 full-text and vector similarity
SELECT * FROM search(
  'knowledge_base',
  'kubernetes deployment troubleshooting',
  mode => 'hybrid',
  limit => 10
)

This hybrid approach addresses the vocabulary mismatch problem -- BM25 handles exact keyword matches while vector search captures semantic similarity -- without requiring separate search infrastructure or complex result merging logic in the application layer.

When to Use Tantivy

Tantivy is the right choice when:

  • You are building a Rust application that needs search: Tantivy integrates natively with Rust codebases. No JVM, no external services, no FFI complexity.
  • You need an embedded search library: If search is a feature within a larger application (rather than a standalone service), Tantivy's library model eliminates the operational overhead of running and synchronizing a separate search system.
  • Latency consistency matters: Rust's lack of garbage collection means no GC pauses during search. Query latency is predictable, which matters for user-facing search and real-time applications.
  • You need core full-text search features: BM25 scoring, phrase queries, faceted search, range queries, and multi-threaded indexing cover the needs of most full-text search use cases.

Elasticsearch or Solr remain better choices when you need a standalone distributed search cluster, built-in REST APIs, a rich client ecosystem, or advanced features like learning-to-rank, cross-cluster replication, or the full Lucene analyzer ecosystem.

Advanced Topics

The Term Dictionary and Finite State Transducers

Tantivy's term dictionary maps terms to their posting list offsets in the inverted index. Rather than using a hash map or B-tree, Tantivy uses a finite state transducer (FST) -- a compact, immutable data structure that represents a sorted set of key-value pairs.

FSTs are space-efficient because they share common prefixes and suffixes between terms. For a typical English text corpus, the FST representation of the term dictionary is significantly smaller than a hash map or sorted array. FSTs also support efficient prefix lookups, which Tantivy uses for wildcard queries and auto-completion.

The trade-off is that FSTs are immutable -- they cannot be updated in place. This fits Tantivy's segment architecture: each segment has its own term dictionary, and new terms are added by creating new segments rather than modifying existing ones.

Posting List Compression

Posting lists -- the sorted lists of document IDs for each term -- can be large for common terms. Tantivy compresses posting lists using a combination of techniques:

  • Delta encoding: Instead of storing absolute document IDs, Tantivy stores the difference (delta) between consecutive IDs. For dense posting lists, these deltas are small and compress well.
  • Block-based compression: Posting lists are divided into fixed-size blocks (typically 128 document IDs). Each block is compressed independently using bitpacking, where each delta is stored using only as many bits as needed. This enables fast decompression of individual blocks without decompressing the entire list.
  • Skip lists: Tantivy maintains skip pointers that allow jumping forward in a posting list without reading every entry. This accelerates intersection operations (AND queries) where the engine needs to find document IDs present in multiple posting lists.

These compression techniques reduce index size while maintaining fast query execution. The block-based approach is particularly efficient because modern CPUs can decompress a block of 128 bitpacked integers in a single SIMD operation.

Concurrent Search and Indexing

Tantivy uses a searcher snapshot model to enable concurrent reads and writes. When a search is executed, Tantivy captures a snapshot of the current set of segments and searches that snapshot. New segments created by concurrent indexing operations are not visible to in-progress searches -- they become visible only when the next searcher snapshot is created.

This model provides read consistency (a search sees a consistent view of the index) without blocking (indexing continues unimpeded while searches execute). The searcher snapshot is lightweight -- it references existing immutable segments rather than copying data.

The IndexReader component manages searcher lifecycle, including warming (pre-loading segment data into OS page cache) and recycling (reusing searcher resources across queries). Applications can configure the reload policy to control how quickly new segments become visible to searches.

Tantivy's Scoring Pipeline

While BM25 is the default, Tantivy's scoring pipeline is customizable. The pipeline consists of:

  1. Weight creation: At query time, each query clause creates a Weight object that encapsulates the global statistics (document frequency, average document length) needed for scoring.
  2. Scorer iteration: The Weight produces a Scorer that iterates over matching documents in a segment. The scorer combines posting list traversal with score computation.
  3. Score combination: For multi-term queries, individual term scores are combined according to the query structure (sum for boolean OR, sum for boolean AND with minimum-should-match semantics).

Developers can implement custom Weight and Scorer types to use alternative scoring functions (e.g., BM25F for multi-field scoring, or custom learned ranking models) while reusing Tantivy's indexing and posting list infrastructure.

Tantivy FAQ

How does Tantivy compare to Apache Lucene?

Tantivy and Lucene share the same core architecture -- inverted indexes, segment-based storage, BM25 scoring -- but differ in language and runtime. Lucene is written in Java and requires the JVM. Tantivy is written in Rust with no runtime dependencies. Tantivy offers competitive performance with more predictable latency (no garbage collection pauses), while Lucene has a broader feature set developed over two decades.

Is Tantivy a replacement for Elasticsearch?

No. Elasticsearch is a distributed search server built on Lucene, with clustering, REST APIs, and operational tooling. Tantivy is an embeddable library without networking or distribution. They serve different use cases: Elasticsearch for standalone distributed search infrastructure, Tantivy for embedding search directly into an application without external dependencies.

What search features does Tantivy support?

Tantivy supports BM25 scoring, phrase queries, boolean queries, range queries on numeric and date fields, faceted search with hierarchical facets, fuzzy queries, regex queries, and custom tokenizers. It also provides multi-threaded indexing, configurable merge policies, and concurrent search during indexing.

Can Tantivy be used from languages other than Rust?

Yes. Tantivy has bindings for Python (tantivy-py), and its C-compatible FFI interface allows integration from any language that supports C interop. However, the primary and most complete API is the native Rust API. The Python bindings cover the most common use cases but may not expose every feature.

How does Spice use Tantivy?

Spice embeds Tantivy as its full-text search engine. When full-text search is enabled on an accelerated dataset, Spice automatically builds and maintains Tantivy indexes and exposes search through SQL. This powers BM25 keyword search, which combines with vector search to enable hybrid search in a single SQL query -- without requiring a separate search service.

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer