Sort And Relevance Cutoff

Sort and relevance cut-off are two features added to Marqo in version 2.22.0. These features allow you to control the order of search results and filter out results that do not meet a certain relevance threshold. In this document, we will explain how to use these features effectively and explain the mechanics behind them.

Sort

The fields parameter contains a list of dictionaries where in each dictionary you can specify the fieldName, order, and missing to specify your sorting behaviour. The values naturally explains themselves, but here is a quick summary: - fieldName: The name of the field to sort by. - order: The order of the sort, either asc for ascending or desc for descending. Default to desc. - missing: The value to use for documents that do not have the specified field or if the field is not a numeric value. This can be set to last or first, and defaults to last.

The order of the fields in the fields parameter determines the order in which the fields are sorted. The field that appears earlier in the list will be sorted first, followed by the next field as the tie-breaker, and so on. If all fields are equal, the documents will be sorted by relevance score, with the highest score appearing first.

Note that if the target sorting field is not present in the document, or not a numeric value, the document will be treated as missing and will be sorted according to the missing value specified in the fields parameter.

Retrievals in the sort

Sort happens after the retrievals, meaning that the documents are first retrieved based on the search query and then sorted based on the specified fields. By default, the retrieval size is set to 3 * limit or limit + offset, whichever is larger. This means that if you set a limit of 10 and offset of 0, the retrieval size will be 30. If you set limit of 10 and offset of 30, the retrieval size will be 40. This ensures that you will get consistent sort results for the first 3 pages of results. If you want to change the retrieval size, you can set the minSortCandidates parameter in the search request. For example, if you want to have 60 results per page and you think you will need 10 pages, you can set minSortCandidates to 600. This will ensure that you will get a consistent sort results for the first 10 pages of results.

The sortDepth parameter controls how many documents are sorted after the retrievals. By default, all the retrieved documents are sorted, but you can set this to a lower value to limit the number of documents that are sorted. This can be useful if you only want to sort a subset of the retrieved documents. For example, if you set sortDepth to 100, only the first 100 retrieved documents will be sorted based on the specified fields.

If sortBy is enabled, a _sortCandidates field will be added to the search response. This field contains the documents that were retrieved considered for sorting, before sortDepth is applied. Investigating this field can help you understand how the sorting is applied and what documents are considered for sorting.

Here are some examples to help you understand how much documents are retrieved and sorted based on the limit, offset, and sortDepth parameters: - You set limit=30, offset=0, and sortDepth=None: Marqo will retrieve 90 documents, sort all 90 documents, and return the top 30 sorted documents. The sort will be applied to all retrieved documents, and the top 30 will be returned based on the specified fields.

You set limit=30, offset=30, and sortDepth=None: Marqo will retrieve 90 documents, sort all 90 documents, and return the top 31st to 60th sorted documents. The sort will be applied to all retrieved documents, and the top 31st to 60th will be returned based on the specified fields.
You set limit=10, offset=30, and sortDepth=None: Marqo will retrieve 40 documents, sort all 40 documents, and return the top 31st to 40th sorted documents. The sort will be applied to all retrieved documents, and the top 31st to 40th will be returned based on the specified fields.
You set limit=20, offset=0, and sortDepth=10: Marqo will retrieve 60 documents, sort the first 10 documents, and return the top 20 sorted documents. The sort will be applied to the first 10 retrieved documents, and the top 20 will be returned based on the specified fields. In this case, the 11st to 20th results will not be sorted based on the specified fields, but will be sorted based on the relevance score.

In some cases, your _sortCandidates can be smaller than expected, if there are not enough documents retrieved to sort, or not enough documents that match the search query. And if you use disjunction hybrid search, the _sortCandidates can be larger than the minSortCandidates parameter, as the documents are retrieve from both lexical and vector search, and the sorting is applied to the combined results.

A practical example of sort

Here is a more practical example of how to use sort in Marqo.

mq.create_index("my-first-index", model="hf/all-MiniLM-L6-v2")

documents = [
    {"_id": "1", "price": 10, "rating": 4.5, "content": "This is a t-shirt"},
    {"_id": "2", "price": 20, "rating": 3.5, "content": "This is the second t-shirt"},
    {"_id": "3", "price": 15, "rating": 4, "content": "This is the third t-shirt"},
    {"_id": "4", "rating": 5, "content": "This is the fourth t-shirt"},
    {"_id": "5", "price": 20, "rating": 3.5, "content": "This is fifth t-shirt"},
    {"_id": "6", "price": 15, "rating": 4.3, "content": "This is a sixth t-shirt"},
    {"_id": "7", "price": 5, "rating": 5.0, "content": "This is a seventh t-shirt"},
]

mq.index("my-first-index").add_documents(documents, tensor_fields=["content"])

results = mq.index("my-first-index").search(
    q="t-shirt",
    search_method="HYBRID",
    sort_by={
        "fields": [
            {"fieldName": "price", "order": "asc", "missing": "last"},
            {"fieldName": "rating", "order": "desc", "missing": "last"},
        ]
    },
    limit=5,
)

print(results)

# Output:
{
    "_sortCandidates": 7,
    "hits": [
        {
            "_highlights": [{"content": "This is a seventh t-shirt"}],
            "_id": "7",
            "_lexical_score": 0.06241087758358531,
            "_score": 1.0,
            "_tensor_score": 0.7826970430882307,
            "content": "This is a seventh t-shirt",
            "price": 5,
            "rating": 5.0,
        },
        {
            "_highlights": [{"content": "This is a t-shirt"}],
            "_id": "1",
            "_lexical_score": 0.06721171432078418,
            "_score": 0.5,
            "_tensor_score": 0.9050907719855004,
            "content": "This is a t-shirt",
            "price": 10,
            "rating": 4.5,
        },
        {
            "_highlights": [{"content": "This is a sixth t-shirt"}],
            "_id": "6",
            "_lexical_score": 0.06241087758358531,
            "_score": 0.3333333333333333,
            "_tensor_score": 0.7878834783379681,
            "content": "This is a sixth t-shirt",
            "price": 15,
            "rating": 4.3,
        },
        {
            "_highlights": [{"content": "This is the third t-shirt"}],
            "_id": "3",
            "_lexical_score": 0.06241087758358531,
            "_score": 0.25,
            "_tensor_score": 0.811950408796305,
            "content": "This is the third t-shirt",
            "price": 15,
            "rating": 4,
        },
        {
            "_highlights": [{"content": "This is the second t-shirt"}],
            "_id": "2",
            "_lexical_score": 0.06241087758358531,
            "_score": 0.2,
            "_tensor_score": 0.8196942179786915,
            "content": "This is the second t-shirt",
            "price": 20,
            "rating": 3.5,
        },
        {
            "_highlights": [{"content": "This is fifth t-shirt"}],
            "_id": "5",
            "_lexical_score": 0.06721171432078418,
            "_score": 0.16666666666666666,
            "_tensor_score": 0.7908237159402687,
            "content": "This is fifth t-shirt",
            "price": 20,
            "rating": 3.5,
        },
        {
            "_highlights": [{"content": "This is the fourth t-shirt"}],
            "_id": "4",
            "_lexical_score": 0.06241087758358531,
            "_score": 0.14285714285714285,
            "_tensor_score": 0.7983804748639495,
            "content": "This is the fourth t-shirt",
            "rating": 5,
        },
    ],
    "limit": 10,
    "offset": 0,
    "processingTimeMs": 46,
    "query": "t-shirt",
}

As we can see, the results contains a _sortCandidates field, which indicates the number of documents that were retrieved and considered for sorting. And the sorting follows the specified fields in the sort_by parameter, sorting by price in ascending order first, and then by rating in descending order as the tie-breaker.

Relevance Cutoff

Relevance cutoff is a feature that allows you to filter out results that do not meet a certain relevance threshold. If this feature is enabled, this is what happens:

Marqo will do a probe lexical search with retrieval size of probeDepth, and get _probeCandidates number of documents;
Marqo will collect the relevance scores of the retrieved documents, do a relevance cutoff based on the relevanceCutoff parameter, and then compute a _relevantCandidates number;
Marqo will implement the real search with the _relevantCandidates or limit+offset, which ever is smaller, number of documents, and return the results.

In this sense, Marqo relies on the lexical search results to determine the number of relevant documents inside the index. If the relevanceCutoff is enabled, you can check _relevantCandidates and _probeCandidates in the search response.

If the there are less relevant documents in the index than the probeDepth, you may see _probeCandidates smaller than probeDepth. The _relevantCandidates will be smaller than _probeCandates, and depending on how strict the relevance cutoff threshold is, you may see smaller or larger _relevantCandidates. If the _relevantCandidates is larger than the limit + offset, Marqo will use limit + offset as the number of documents to retrieve. In this sense, your search results should be consistent with not enabling the relevance cutoff. However, if the _relevantCandidates is smaller than limit + offset, Marqo will use _relevantCandidates as the number of documents to retrieve, and you may see less results than expected.

A practical example of relevance cutoff

Let's say you have an unstructured index called my-first-index with the following documents:

mq.create_index("my-first-index", tensor_fields=["content"])

documents = [
    {"_id": "h1", "content": "Machine learning algorithms in artificial intelligence"},
    {
        "_id": "h2",
        "content": "Artificial intelligence relies on machine learning algorithms",
    },
    {"_id": "m1", "content": "Machine learning processes data efficiently"},
    {"_id": "l1", "content": "Engineers use machine tools for cutting"},
    {"_id": "l2", "content": "Bright morning sunlight streams through the room"},
]

mq.index("my-first-index").add_documents(documents, tensor_fields=["content"])

regular_results = mq.index("my-first-index").search(
    q="machine learning artificial intelligence", limit=5
)


relevance_cutoff_results = mq.index("my-first-index").search(
    q="machine learning artificial intelligence",
    limit=5,
    relevance_cutoff={"method": "mean_std_dev", "parameters": {"stdDevFactor": 1.2}},
)


print(regular_results)

# Regular search results
{
    "hits": [
        {
            "_highlights": [
                {"content": "Machine learning algorithms in " "artificial intelligence"}
            ],
            "_id": "h1",
            "_lexical_score": 2.612086396229609,
            "_score": 0.01639344262295082,
            "_tensor_score": 0.8889584065865476,
            "content": "Machine learning algorithms in artificial intelligence",
        },
        {
            "_highlights": [
                {
                    "content": "Artificial intelligence relies on "
                    "machine learning algorithms"
                }
            ],
            "_id": "h2",
            "_lexical_score": 2.4483762460480873,
            "_score": 0.016129032258064516,
            "_tensor_score": 0.7812147670366723,
            "content": "Artificial intelligence relies on machine learning "
            "algorithms",
        },
        {
            "_highlights": [
                {"content": "Machine learning processes data " "efficiently"}
            ],
            "_id": "m1",
            "_lexical_score": 0.8977623995410945,
            "_score": 0.015873015873015872,
            "_tensor_score": 0.6570781567113382,
            "content": "Machine learning processes data efficiently",
        },
        {
            "_highlights": [{"content": "Engineers use machine tools for " "cutting"}],
            "_id": "l1",
            "_lexical_score": 0.29152923241027423,
            "_score": 0.015625,
            "_tensor_score": 0.5454894444827113,
            "content": "Engineers use machine tools for cutting",
        },
        {
            "_highlights": [
                {"content": "Bright morning sunlight streams " "through the room"}
            ],
            "_id": "l2",
            "_score": 0.007692307692307693,
            "_tensor_score": 0.5126319998660153,
            "content": "Bright morning sunlight streams through the room",
        },
    ],
    "limit": 10,
    "offset": 0,
    "processingTimeMs": 60,
    "query": "machine learning artificial intelligence",
}


print(relevance_cutoff_results)

# Relevance cutoff search results
{
    "_probeCandidates": 4,
    "_relevantCandidates": 2,
    "hits": [
        {
            "_highlights": [
                {"content": "Machine learning algorithms in " "artificial intelligence"}
            ],
            "_id": "h1",
            "_lexical_score": 2.612086396229609,
            "_score": 0.01639344262295082,
            "_tensor_score": 0.8889584065865476,
            "content": "Machine learning algorithms in artificial intelligence",
        },
        {
            "_highlights": [
                {
                    "content": "Artificial intelligence relies on "
                    "machine learning algorithms"
                }
            ],
            "_id": "h2",
            "_lexical_score": 2.4483762460480873,
            "_score": 0.016129032258064516,
            "_tensor_score": 0.7812147670366723,
            "content": "Artificial intelligence relies on machine learning "
            "algorithms",
        },
    ],
    "limit": 10,
    "offset": 0,
    "processingTimeMs": 24,
    "query": "machine learning artificial intelligence",
}

As we can see, the regular search results returned 5 documents, while the relevance cutoff search results only returned the only top 2 documents that were considered relevant based on the relevance cutoff threshold. The _probeCandidates field indicates that Marqo retrieved 4 documents in the probe lexical search, and the _relevantCandidates field indicates that only 2 documents were considered relevant based on the relevance cutoff threshold.

Relevance Cutoff and Sort

In sort, sometime you may get irrelevant results popping up in the search results due to the sort criteria. For example, you search "black t-shirt", and you want to sort by price in ascending order. You may get a very cheap t-shirt that is not black, but the price is lower than the other black t-shirts. In this case, you can use the relevance cutoff to filter out the irrelevant results to prevent this from happening.

For example, you send a request "black t-shirt", and you want to sort by price in ascending order, and you want to filter out the irrelevant results that do not match the search query, you can set

mq.index("my-first-index").search(
    q="black t-shirt",
    sort_by=[{"fieldName": "price", "order": "asc"}],
    limit=60,
    relevance_cutoff={
        "method": "mean_std_dev",
        "parameters": {"stdDevFactor": 1.2},
        "probeDepth": 1000,
    },
    search_method="HYBRID",
    hybrid_parameters={
        "retrievalMethod": "disjunction",
        "rankingMethod": "rrf",
        "alpha": 0.3,
        "rrfK": 10,
    },
)

In this case, Marqo will: 1. Do a probe lexical search with retrieval size of 1000 with query "black t-shirt", and get _probeCandidates number of documents; 2. Collect the relevance scores of the retrieved documents, do a relevance cutoff based on the mean_std_dev method with stdDevFactor of 1.2, and then compute a _relevantCandidates number; 3. Implement the real search with the _relevantCandidates. In this example, since the retrievalMethod is disjunction, Marqo will retrieve documents from both lexical and vector search, applied the rrf ranking method, and return _sortCandidates number of documents; 4. Sort the documents based on the price field in ascending order. In this case, all the documents will be sorted; 5. Return the top 60 sorted documents.

In this case, you do would not see irrelevant results popping up in the search results, as the relevance cutoff will filter out the irrelevant results that do not match the search query. If you still see irrelevant results popping up in the search results, you can increase the stdDevFactor to make the relevance cutoff more strict. These parameters will be data dependent, so you may need to experiment with different values to find the best fit for your use case by investigating the _probeCandidates, _relevantCandidates, and _sortCandidates in the search response.

Here is a practical example of how to use relevance cutoff with sort in Marqo:

mq.create_index(index_name, model="hf/all-MiniLM-L6-v2")

documents = [
    {"_id": "1", "price": 10, "color": "black", "content": "This is a black t-shirt"},
    {"_id": "2", "price": 20, "color": "white", "content": "This is a white t-shirt"},
    {
        "_id": "3",
        "price": 15,
        "color": "black",
        "content": "This is a black t-shirt with a logo",
    },
    {"_id": "4", "price": 5, "color": "blue", "content": "This is a blue t-shirt"},
    {
        "_id": "5",
        "price": 25,
        "color": "black",
        "content": "This is an expensive black t-shirt",
    },
]

mq.index("my-first-index").add_documents(documents, tensor_fields=["content"])

regular_results = mq.index("my-first-index").search(
    q="black t-shirt",
    search_method="HYBRID",
    sort_by={
        "fields": [{"fieldName": "price", "order": "asc", "missing": "last"}],
    },
)


relevance_cutoff_results = mq.index("my-first-index").search(
    q="black t-shirt",
    search_method="HYBRID",
    relevance_cutoff={
        "method": "relative_max_score",
        "parameters": {"relativeScoreFactor": 0.8},
    },
    sort_by={
        "fields": [{"fieldName": "price", "order": "asc", "missing": "last"}],
    },
)


print(regular_results)

# Regular sort search results
{
    "_sortCandidates": 5,
    "hits": [
        {
            "_highlights": [{"content": "A red T-shirt"}],
            "_id": "4",
            "_lexical_score": 0.0870113769896297,
            "_score": 1.0,
            "_tensor_score": 0.7778304865789906,
            "content": "A red T-shirt",
            "price": 8.99,
        },
        {
            "_highlights": [{"content": "A blue T-shirt"}],
            "_id": "3",
            "_lexical_score": 0.0870113769896297,
            "_score": 0.5,
            "_tensor_score": 0.7897496445065721,
            "content": "A blue T-shirt",
            "price": 10.99,
        },
        {
            "_highlights": [{"content": "Another black T-shirt"}],
            "_id": "2",
            "_lexical_score": 0.9624801143435295,
            "_score": 0.3333333333333333,
            "_tensor_score": 0.9244735618095294,
            "content": "Another black T-shirt",
            "price": 15.99,
        },
        {
            "_highlights": [{"content": "A black T-shirt"}],
            "_id": "1",
            "_lexical_score": 0.9624801143435295,
            "_score": 0.25,
            "_tensor_score": 0.9610388137414024,
            "content": "A black T-shirt",
            "price": 19.99,
        },
        {
            "_highlights": [{"content": "A green T-shirt"}],
            "_id": "5",
            "_lexical_score": 0.0870113769896297,
            "_score": 0.2,
            "_tensor_score": 0.7825439831964851,
            "content": "A green T-shirt",
            "price": 20.99,
        },
    ],
    "limit": 10,
    "offset": 0,
    "processingTimeMs": 58,
    "query": "black t-shirt",
}


# Sort and relevance cut-off results
{
    "_probeCandidates": 5,
    "_relevantCandidates": 2,
    "_sortCandidates": 2,
    "hits": [
        {
            "_highlights": [{"content": "Another black T-shirt"}],
            "_id": "2",
            "_lexical_score": 0.9624801143435295,
            "_score": 1.0,
            "_tensor_score": 0.9244735618095294,
            "content": "Another black T-shirt",
            "price": 15.99,
        },
        {
            "_highlights": [{"content": "A black T-shirt"}],
            "_id": "1",
            "_lexical_score": 0.9624801143435295,
            "_score": 0.5,
            "_tensor_score": 0.9610388137414024,
            "content": "A black T-shirt",
            "price": 19.99,
        },
    ],
    "limit": 10,
    "offset": 0,
    "processingTimeMs": 17,
    "query": "black t-shirt",
}

The example above shows how to use relevance cutoff with sort in Marqo. If relevance cut-off is not enabled, the red T-shirt will be returned as the first result, as it has the lowest price. However, with relevance cut-off enabled, only the top 2 documents that are relevant to the search query are returned, and they are sorted by price in ascending order.