What is Reciprocal Rank Fusion (RRF)?

Reciprocal Rank Fusion (RRF) is a ranking algorithm that combines result lists from multiple retrieval methods into a single, unified ranking. Instead of comparing raw scores -- which are on incompatible scales -- RRF uses only each result's position in each list. Documents appearing near the top of multiple result lists get the highest combined scores.

Hybrid search runs two retrieval methods in parallel -- vector similarity search for semantic relevance and BM25 full-text search for keyword precision -- and combines their results. The challenge is that the scores from these two methods are on fundamentally different scales: cosine similarity scores range from -1 to 1, while BM25 scores are unbounded positive numbers. Combining raw scores directly produces unpredictable results.

Reciprocal Rank Fusion solves this by ignoring raw scores entirely. It assigns each result a score based only on where it ranked in each result list, then sums those scores. This rank-based approach is robust to score distribution differences, requires no calibration, and is straightforward to implement. It was introduced by Cormack, Clarke, and Buettcher in their 2009 paper and has become the default fusion algorithm for hybrid search systems.

The RRF Formula

The RRF score for a document d across n ranked lists is:

RRF(d) = Σ [ 1 / (k + rank_i(d)) ]
         i=1 to n

Where:

  • rank_i(d) is the position of document d in the i-th result list (1-indexed, so the top result is rank 1)
  • k is a constant (typically 60) that dampens the impact of high-ranking documents

If a document does not appear in a result list, it contributes 0 to the sum for that list.

Example

Consider a hybrid search query that returns the following results:

DocumentVector search rankBM25 rank
Doc A15
Doc B41
Doc C23
Doc D3not ranked

With k = 60:

  • Doc A: 1/61 + 1/65 = 0.01639 + 0.01538 = 0.03177
  • Doc B: 1/64 + 1/61 = 0.01563 + 0.01639 = 0.03202
  • Doc C: 1/62 + 1/63 = 0.01613 + 0.01587 = 0.03200
  • Doc D: 1/63 + 0 = 0.01587

Final ranking: Doc B (0.03202) > Doc C (0.03200) > Doc A (0.03177) > Doc D (0.01587)

Document B wins because it was top-ranked by BM25 and still appeared in the vector search results. Document A was the top vector result but ranked poorly for keywords, so it is slightly behind Doc B and Doc C.

Why Use Rank Positions Instead of Scores?

The core insight behind RRF is that rank positions are more meaningful than raw scores when combining different retrieval methods.

Score incompatibility. A vector cosine similarity of 0.85 and a BM25 score of 12.4 cannot be directly compared. They are produced by different models with different normalization. Any attempt to add or average them requires arbitrary scaling decisions that can favor one method over the other depending on the current query.

Score instability. BM25 scores depend on corpus statistics (document frequency, inverse document frequency) that change as the corpus grows. Vector scores depend on the embedding model. Calibrating the scales between them would require frequent recalibration as either changes.

Rank stability. Whether a document is the top result or third result is a meaningful, stable signal that does not depend on score scale. RRF leverages this stability.

Robustness to outliers. A single extremely high BM25 score for an exact keyword match would dominate a weighted score combination. RRF gives that document only a slight advantage from its high rank rather than allowing its raw score to overwhelm the list.

The k Parameter

The constant k is the most important tuning parameter in RRF. Its role is to reduce the score advantage of documents at the very top of a list relative to those ranked slightly lower.

With k = 60 (the value recommended in the original paper):

  • Rank 1 score: 1/61 ≈ 0.01639
  • Rank 2 score: 1/62 ≈ 0.01613
  • Rank 10 score: 1/70 ≈ 0.01429
  • Rank 60 score: 1/120 ≈ 0.00833

The difference between rank 1 and rank 2 is small (1.6% relative). The difference between rank 1 and rank 60 is about 49%. RRF treats the top ranks roughly equally and has a smooth decay toward lower ranks.

With a smaller k (e.g., k = 1), the score advantage for rank 1 becomes much larger relative to rank 2, making the top result from each list more dominant in the combined ranking. With a larger k (e.g., k = 1000), all positions contribute nearly equal scores and the ranking becomes very flat.

In practice, k = 60 is a sensible default. The original paper showed that this value was robust across many benchmark datasets; empirical testing on your specific dataset and query patterns should guide further tuning.

RRF vs. Weighted Score Combination

The alternative to RRF is a weighted linear combination of normalized scores:

Combined(d) = α × score_vector(d) + (1 - α) × score_keyword(d)

Where α (alpha) controls the balance between the two methods (0 = all keyword, 1 = all vector).

When to prefer RRF:

  • When you do not want to calibrate α and cannot run evaluation experiments
  • When the score distributions of the two methods differ significantly or change over time
  • When simplicity and interpretability matter -- RRF has one meaningful parameter (k)
  • In most production deployments where both methods should contribute equally

When to prefer weighted combination:

  • When evaluation data is available and shows that one method significantly outperforms the other on your query distribution
  • When domain knowledge justifies weighting (e.g., exact product code searches where keyword precision should dominate)
  • When implementing cross-encoder re-ranking at a later stage and the first-pass scores need to reflect calibrated relevance

Most practitioners start with RRF for its robustness and only switch to weighted combination when evaluation data justifies it.

RRF with More Than Two Result Lists

RRF is not limited to two retrieval methods. The formula sums contributions from any number of result lists, making it straightforward to combine three or more signals:

RRF(d) = Σ [ 1 / (k + rank_i(d)) ]
         i=1 to n

For example, a hybrid search system might combine:

  1. Vector search results (semantic similarity via embeddings)
  2. BM25 full-text search results (keyword precision via Tantivy)
  3. Recency scores (time-weighted ranking to prefer recent documents)

Each list contributes independently, and documents that rank well across all three signals receive the highest combined scores.

How RRF Is Used in Production

In a typical hybrid search pipeline with RRF:

  1. The query is submitted to both the vector index and the inverted index simultaneously
  2. Each index returns a ranked list of the top-k candidates (e.g., top 100 from each)
  3. The union of both lists is formed (up to 200 unique documents)
  4. RRF scores are computed for each document in the union
  5. Documents are sorted by RRF score, and the top-k are returned to the application
Query Vector Searchtop 100 BM25 Searchtop 100 Union≤ 200 docs RRF Scoring1 / k + rank Final Rankingtop 10

The number of candidates retrieved from each method (the "retrieval depth") affects both recall and computation. Retrieving more candidates (e.g., top 500 from each) increases the chance that the best documents are included in the union, but also increases the cost of RRF scoring and any subsequent re-ranking step. A retrieval depth of 50-200 is typical for most production deployments.

Advanced Topics

RRF and Multi-Stage Retrieval

RRF is commonly used as the fusion step in a two-stage retrieval pipeline:

  1. Stage 1: BM25 and vector search independently retrieve a broad candidate set (hundreds of documents) with high recall
  2. Stage 2 (RRF): Candidates from both lists are merged and scored by RRF to produce a shorter, higher-quality ranked list
  3. Stage 3 (optional): A cross-encoder re-ranker scores each remaining candidate against the query for maximum precision

RRF's rank-based approach makes it fast and deterministic at Stage 2, which is important when the pipeline must complete within a latency budget. Cross-encoder re-ranking (Stage 3) is expensive -- each candidate-query pair requires a full model forward pass -- so applying it only to the top-20 or top-50 from RRF limits its latency impact.

RRF in Distributed Search

In distributed search systems where each shard returns a local top-k result list, RRF is applied after gathering results from all shards. The shard-local rankings are often imperfect (a document ranked first on one shard might rank 50th globally), but RRF's tolerance for rank imprecision makes it robust to this limitation.

For exact global relevance ordering, some systems perform a second-pass re-ranking over the merged candidates using raw scores from each shard. This is more expensive than RRF but produces more accurate global rankings when shard score calibration is reliable.

Evaluating RRF Quality

The standard metric for evaluating a retrieval system's ranking quality is Normalized Discounted Cumulative Gain (NDCG). NDCG@10 measures how well the top 10 results are ordered relative to an ideal ranking, discounting the score of results that appear lower in the list.

To evaluate whether RRF improves over single-method retrieval:

  1. Collect a ground truth relevance dataset -- a set of queries with known relevant documents
  2. Run each retrieval method independently (vector only, BM25 only) and measure NDCG@10 for each
  3. Run hybrid search with RRF and measure NDCG@10
  4. Compare: ideally, RRF should outperform both individual methods

In practice, hybrid search with RRF consistently outperforms both individual methods on datasets with mixed query types (some semantic, some keyword-dominant). The improvement is largest for technical datasets where users mix natural language questions with exact identifier lookups.

RRF with Spice

Spice implements hybrid search with RRF natively in a single SQL runtime. Vector search, BM25 full-text search, and RRF fusion run in the same Apache DataFusion-based execution engine with no separate systems to manage.

The rrf() function accepts two or more vector_search() or text_search() UDTFs as arguments and returns a unified result set with a fused_score column:

-- Hybrid search with RRF in Spice
SELECT id, title, content, fused_score
FROM rrf(
    vector_search(product_docs, 'how to cancel subscription'),
    text_search(product_docs, 'cancel subscription refund', content),
    join_key => 'id'   -- explicit join key for optimal performance
)
ORDER BY fused_score DESC
LIMIT 10;

To weight one method more heavily than the other, pass a rank_weight argument to the relevant search UDTF:

-- Boost semantic search over exact keyword matching
SELECT id, title, content, fused_score
FROM rrf(
    text_search(product_docs, 'cancel subscription', content,
                rank_weight => 50.0),
    vector_search(product_docs, 'how to cancel subscription',
                  rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;

The k smoothing parameter (default 60.0) is configurable per call. Because results are returned as a standard SQL result set, they can be filtered by additional predicates, joined with metadata tables, or combined with application-specific signals.

Reciprocal Rank Fusion FAQ

What does Reciprocal Rank Fusion do?

RRF combines result lists from multiple retrieval methods (e.g., vector search and BM25) into a single ranked output. Instead of comparing raw scores, it assigns each document a value based on its position in each list and sums those values. Documents that appear near the top of multiple lists receive the highest combined scores.

What is the k parameter in RRF?

The k constant (typically 60) dampens the score advantage for documents ranked at the very top of a list. A smaller k makes the top result more dominant; a larger k flattens the ranking so all positions contribute nearly equal scores. The value 60 was shown to be robust across many benchmark datasets in the original 2009 paper and is a sensible default for most use cases.

When should I use RRF instead of weighted score combination?

Use RRF when you do not have evaluation data to calibrate the weighting between methods, or when the score distributions of your retrieval methods differ significantly. Use weighted combination when you have evaluation data showing that one method clearly outperforms the other for your query distribution and you want to favor it proportionally.

Can RRF combine more than two result lists?

Yes. The RRF formula sums contributions from any number of ranked lists. You can combine vector search, BM25, recency scores, and other signals -- each list contributes a term to the sum. Documents ranking well across all signals receive the highest scores.

Does RRF work for all types of queries?

RRF works well for most query types, especially mixed workloads that include both semantic questions and exact identifier lookups. For highly specialized domains where one retrieval method clearly dominates (e.g., exact code search where keyword matching always wins), weighted combination or single-method retrieval may be more appropriate. Evaluating on your specific dataset and query distribution is the best way to determine the right approach.

See Spice in action

Get a guided walkthrough of how development teams use Spice to query, accelerate, and integrate AI for mission-critical workloads.

Get a demo