Lexical Reciprocal Rank Fusion

In some applications, lexical search can be a better approach to certain queries despite its known deficiencies on others. Typical cases include queries with measurements (e.g. Rolex with 35mm face) or exact model names (e.g. iPhone 12 Pro Max 256GB).

To get the best of both lexical and tensor search, the results from both can be combined using a technique called Reciprocal Rank Fusion (RRF). This technique combines the results from both lexical and tensor search and boosts the results that are common to both.

How does Reciprocal Rank Fusion work?

Reciprocal Rank Fusion (RRF) is a technique that combines the results from two or more search methods by assigning a score to each result based on its rank in the result list. The score is calculated as the reciprocal of the rank of the result in the list. The scores from all the result lists are then combined to produce a final list of results. The results are then sorted based on the combined scores.

The score of a document in RRF if calculated as 1 / (k + rank), where k is a constant and rank is the position of the document in the result list. The constant k is used to control the importance of the rank in the final score. A higher value of k gives more importance to the rank, while a lower value gives more importance to the document itself. Many implementations of RRF use a value of k = 60 as a default.

Recipe for Lexical Reciprocal Rank Fusion with Structured Indexes

For structured indexes Marqo supports hybrid search out of the box through the search API and can be used as follows:

results = mq.index("my-structured-index").search(
    q="Rolex with 35mm face",
    search_method="HYBRID",
    limit=100,
    hybrid_parameters={
        "retrievalMethod": "disjunction",
        "rankingMethod": "rrf",
        "alpha": 0.5,
        "rrfK": 60,
    },
)

In this example we set the following parameters:

retrievalMethod: disjunction retrieves both lexical and tensor results
rankingMethod: rrf combines the results using reciprocal rank fusion
alpha: 0.5 weights the lexical and tensor results (0 for pure lexical, 1 for pure tensor)
rrfK: 60 the constant k in the RRF formula, default is 60

Recipe for Lexical Reciprocal Rank Fusion with Unstructured Indexes

The following implementation is a generic implementation of RRF for unstructured Marqo indexes. Native hybrid search will be coming to unstructured indexes in a future release.

You may wish to modify the search requests in marqo_rrf_search to use different score modifiers or searchable attributes.

import marqo
from typing import List, Dict


def reciprocal_rank_fusion(
    list1: List[str], list2: List[str], k: int = 60
) -> List[str]:
    rrf_scores = {}

    def update_scores(result_list: List[str], scores_dict: Dict[str, float]) -> None:
        for rank, item in enumerate(result_list, start=1):
            if item not in scores_dict:
                scores_dict[item] = 0
            scores_dict[item] += 1 / (k + rank)

    update_scores(list1, rrf_scores)
    update_scores(list2, rrf_scores)

    sorted_items = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

    sorted_result_ids = [item[0] for item in sorted_items]

    max_length = max(len(list1), len(list2))
    return sorted_result_ids[:max_length]


def marqo_rrf_search(
    mq: marqo.Client,
    index_name: str,
    query: str,
    limit: int,
):
    lexical_results = mq.index(index_name).search(
        q=query, limit=limit, search_method="LEXICAL"
    )
    vector_results = mq.index(index_name).search(q=query, limit=limit)
    lexical_ids = [r["_id"] for r in lexical_results["hits"]]
    vector_ids = [r["_id"] for r in vector_results["hits"]]

    search_data = {}
    for hit in lexical_results["hits"] + vector_results["hits"]:
        search_data[hit["_id"]] = hit

    rrf_results = reciprocal_rank_fusion(lexical_ids, vector_ids)

    rrf_hits = []
    for _id in rrf_results:
        rrf_hits.append(search_data[_id])

    return {"hits": rrf_hits}