Calculating Recall
Marqo provides configurable parameters for the underlying HNSW Approximate Nearest Neighbours (ANN) search algorithm. These parameters can be tuned to balance recall, latency, and memory usage. The key parameters are efConstruction
, m
, and efSearch
. More details on these can be found in the Understanding HNSW Parameters section.
While the defaults provided are able to provide >99% recall on average for most datasets, advanced users or people with use cases that are incredibly sensitive to recall may wish to tune these parameters to achieve higher recall.
Recall in this context is the proportion of items returned by the approximate search that are returned by an exact search. For example, if the result sets of the approximate and exact searches are A
and B
respectively, then recall is calculated as |A ∩ B| / |B|
.
Recipe for Calculating Recall with Marqo
Marqo allows you to toggle between approximate and exact search. This allows you to calculate the recall for any index you have in Marqo.
The following function which takes an instance of the client, an index name, a limit, and a list of queries, can be used to calculate the recall for a given index:
import marqo
from typing import List
def calculate_average_recall(
mq: marqo.Client, index_name: str, limit: int, queries: List[str]
) -> float:
recalls = []
for query in queries:
approximate_results = mq.index(index_name).search(q=query, limit=limit)
exact_results = mq.index(index_name).search(
q=query, limit=limit, approximate=False
)
approximate_ids = [result["_id"] for result in approximate_results["hits"]] # A
exact_ids = [result["_id"] for result in exact_results["hits"]] # B
intersection = set(approximate_ids).intersection(exact_ids) # A ∩ B
recall = len(intersection) / len(exact_ids) # |A ∩ B| / |B|
recalls.append(recall)
return sum(recalls) / len(recalls) # average recall over the set of queries
Example usage:
import marqo
mq = marqo.Client()
index_name = "my-first-index"
limit = 10
queries = ["apple", "banana", "cherry"]
average_recall = calculate_average_recall(mq, index_name, limit, queries)
print(f"Average Recall@{limit}: {average_recall}")