Sort And Relevance Cutoff
Sort and relevance cut-off are two features added to Marqo in version 2.22.0. These features allow you to control the order of search results and filter out results that do not meet a certain relevance threshold. In this document, we will explain how to use these features effectively and explain the mechanics behind them.
Sort
The fields
parameter contains a list of dictionaries where in each dictionary you can specify the
fieldName
, order
, and missing
to specify your sorting behaviour. The values naturally explains themselves, but here is a quick summary:
- fieldName
: The name of the field to sort by.
- order
: The order of the sort, either asc
for ascending or desc
for descending. Default to desc
.
- missing
: The value to use for documents that do not have the specified field or if the field is not a numeric value. This can be set to last
or first
, and defaults to last
.
The order of the fields in the fields
parameter determines the order in which the fields are sorted. The field that appears earlier in the list will be sorted first, followed by the next field as the tie-breaker, and so on.
If all fields are equal, the documents will be sorted by relevance score, with the highest score appearing first.
Note that if the target sorting field is not present in the document, or not a numeric value, the document will be treated as missing and will be sorted according to the missing
value specified in the fields
parameter.
Retrievals in the sort
Sort happens after the retrievals, meaning that the documents are first retrieved based on the search query and then sorted based on the specified fields. By default,
the retrieval size is set to 3 * limit
or limit + offset
, whichever is larger. This means that if you set a limit of 10 and offset of 0, the retrieval size will be 30. If you set limit of 10 and offset of 30, the retrieval size will be 40.
This ensures that you will get consistent sort results for the first 3 pages of results. If you want to change the retrieval size, you can set the minSortCandidates
parameter in the search request.
For example, if you want to have 60 results per page and you think you will need 10 pages, you can set minSortCandidates
to 600. This will ensure that you will get a consistent sort results for the first 10 pages of results.
The sortDepth
parameter controls how many documents are sorted after the retrievals. By default, all the retrieved documents are sorted, but you can set this to a lower value to limit the number of documents that are sorted.
This can be useful if you only want to sort a subset of the retrieved documents. For example, if you set sortDepth
to 100, only the first 100 retrieved documents will be sorted based on the specified fields.
If sortBy
is enabled, a _sortCandidates
field will be added to the search response. This field contains the documents that were retrieved considered for sorting, before sortDepth
is applied.
Investigating this field can help you understand how the sorting is applied and what documents are considered for sorting.
Here are some examples to help you understand how much documents are retrieved and sorted based on the limit
, offset
, and sortDepth
parameters:
- You set limit=30, offset=0, and sortDepth=None:
Marqo will retrieve 90 documents, sort all 90 documents, and return the top 30 sorted documents. The sort will be applied to all retrieved documents, and the top 30 will be returned based on the specified fields.
-
You set limit=30, offset=30, and sortDepth=None: Marqo will retrieve 90 documents, sort all 90 documents, and return the top 31st to 60th sorted documents. The sort will be applied to all retrieved documents, and the top 31st to 60th will be returned based on the specified fields.
-
You set limit=10, offset=30, and sortDepth=None: Marqo will retrieve 40 documents, sort all 40 documents, and return the top 31st to 40th sorted documents. The sort will be applied to all retrieved documents, and the top 31st to 40th will be returned based on the specified fields.
-
You set limit=20, offset=0, and sortDepth=10: Marqo will retrieve 60 documents, sort the first 10 documents, and return the top 20 sorted documents. The sort will be applied to the first 10 retrieved documents, and the top 20 will be returned based on the specified fields. In this case, the 11st to 20th results will not be sorted based on the specified fields, but will be sorted based on the relevance score.
In some cases, your _sortCandidates
can be smaller than expected, if there are not enough documents retrieved to sort, or not enough documents that match the search query.
And if you use disjunction
hybrid search, the _sortCandidates
can be larger than the minSortCandidates
parameter, as the documents are retrieve from both lexical and vector search, and the sorting is applied to the combined results.
A practical example of sort
Here is a more practical example of how to use sort in Marqo.
mq.create_index("my-first-index", model="hf/all-MiniLM-L6-v2")
documents = [
{"_id": "1", "price": 10, "rating": 4.5, "content": "This is a t-shirt"},
{"_id": "2", "price": 20, "rating": 3.5, "content": "This is the second t-shirt"},
{"_id": "3", "price": 15, "rating": 4, "content": "This is the third t-shirt"},
{"_id": "4", "rating": 5, "content": "This is the fourth t-shirt"},
{"_id": "5", "price": 20, "rating": 3.5, "content": "This is fifth t-shirt"},
{"_id": "6", "price": 15, "rating": 4.3, "content": "This is a sixth t-shirt"},
{"_id": "7", "price": 5, "rating": 5.0, "content": "This is a seventh t-shirt"},
]
mq.index("my-first-index").add_documents(documents, tensor_fields=["content"])
results = mq.index("my-first-index").search(
q="t-shirt",
search_method="HYBRID",
sort_by={
"fields": [
{"fieldName": "price", "order": "asc", "missing": "last"},
{"fieldName": "rating", "order": "desc", "missing": "last"},
]
},
limit=5,
)
print(results)
# Output:
{
"_sortCandidates": 7,
"hits": [
{
"_highlights": [{"content": "This is a seventh t-shirt"}],
"_id": "7",
"_lexical_score": 0.06241087758358531,
"_score": 1.0,
"_tensor_score": 0.7826970430882307,
"content": "This is a seventh t-shirt",
"price": 5,
"rating": 5.0,
},
{
"_highlights": [{"content": "This is a t-shirt"}],
"_id": "1",
"_lexical_score": 0.06721171432078418,
"_score": 0.5,
"_tensor_score": 0.9050907719855004,
"content": "This is a t-shirt",
"price": 10,
"rating": 4.5,
},
{
"_highlights": [{"content": "This is a sixth t-shirt"}],
"_id": "6",
"_lexical_score": 0.06241087758358531,
"_score": 0.3333333333333333,
"_tensor_score": 0.7878834783379681,
"content": "This is a sixth t-shirt",
"price": 15,
"rating": 4.3,
},
{
"_highlights": [{"content": "This is the third t-shirt"}],
"_id": "3",
"_lexical_score": 0.06241087758358531,
"_score": 0.25,
"_tensor_score": 0.811950408796305,
"content": "This is the third t-shirt",
"price": 15,
"rating": 4,
},
{
"_highlights": [{"content": "This is the second t-shirt"}],
"_id": "2",
"_lexical_score": 0.06241087758358531,
"_score": 0.2,
"_tensor_score": 0.8196942179786915,
"content": "This is the second t-shirt",
"price": 20,
"rating": 3.5,
},
{
"_highlights": [{"content": "This is fifth t-shirt"}],
"_id": "5",
"_lexical_score": 0.06721171432078418,
"_score": 0.16666666666666666,
"_tensor_score": 0.7908237159402687,
"content": "This is fifth t-shirt",
"price": 20,
"rating": 3.5,
},
{
"_highlights": [{"content": "This is the fourth t-shirt"}],
"_id": "4",
"_lexical_score": 0.06241087758358531,
"_score": 0.14285714285714285,
"_tensor_score": 0.7983804748639495,
"content": "This is the fourth t-shirt",
"rating": 5,
},
],
"limit": 10,
"offset": 0,
"processingTimeMs": 46,
"query": "t-shirt",
}
As we can see, the results contains a _sortCandidates
field,
which indicates the number of documents that were retrieved and considered for sorting.
And the sorting follows the specified fields in the sort_by
parameter, sorting by price
in ascending order first,
and then by rating
in descending order as the tie-breaker.
Relevance Cutoff
Relevance cutoff is a feature that allows you to filter out results that do not meet a certain relevance threshold. If this feature is enabled, this is what happens:
- Marqo will do a probe lexical search with retrieval size of
probeDepth
, and get_probeCandidates
number of documents; - Marqo will collect the relevance scores of the retrieved documents, do a relevance cutoff based on the
relevanceCutoff
parameter, and then compute a_relevantCandidates
number; - Marqo will implement the real search with the
_relevantCandidates
or limit+offset, which ever is smaller, number of documents, and return the results.
In this sense, Marqo relies on the lexical search results to determine the number of relevant documents inside the index. If the relevanceCutoff
is enabled, you can check
_relevantCandidates
and _probeCandidates
in the search response.
If the there are less relevant documents in the index than the probeDepth
, you may see _probeCandidates
smaller than probeDepth
.
The _relevantCandidates
will be smaller than _probeCandates
, and depending on how strict the relevance cutoff threshold is, you may see smaller or larger _relevantCandidates
.
If the _relevantCandidates
is larger than the limit + offset
, Marqo will use limit + offset
as the number of documents to retrieve. In this sense, your search results should be consistent with not enabling the relevance cutoff.
However, if the _relevantCandidates
is smaller than limit + offset
, Marqo will use _relevantCandidates
as the number of documents to retrieve, and you may see less results than expected.
A practical example of relevance cutoff
Let's say you have an unstructured index called my-first-index
with the following documents:
mq.create_index("my-first-index", tensor_fields=["content"])
documents = [
{"_id": "h1", "content": "Machine learning algorithms in artificial intelligence"},
{
"_id": "h2",
"content": "Artificial intelligence relies on machine learning algorithms",
},
{"_id": "m1", "content": "Machine learning processes data efficiently"},
{"_id": "l1", "content": "Engineers use machine tools for cutting"},
{"_id": "l2", "content": "Bright morning sunlight streams through the room"},
]
mq.index("my-first-index").add_documents(documents, tensor_fields=["content"])
regular_results = mq.index("my-first-index").search(
q="machine learning artificial intelligence", limit=5
)
relevance_cutoff_results = mq.index("my-first-index").search(
q="machine learning artificial intelligence",
limit=5,
relevance_cutoff={"method": "mean_std_dev", "parameters": {"stdDevFactor": 1.2}},
)
print(regular_results)
# Regular search results
{
"hits": [
{
"_highlights": [
{"content": "Machine learning algorithms in " "artificial intelligence"}
],
"_id": "h1",
"_lexical_score": 2.612086396229609,
"_score": 0.01639344262295082,
"_tensor_score": 0.8889584065865476,
"content": "Machine learning algorithms in artificial intelligence",
},
{
"_highlights": [
{
"content": "Artificial intelligence relies on "
"machine learning algorithms"
}
],
"_id": "h2",
"_lexical_score": 2.4483762460480873,
"_score": 0.016129032258064516,
"_tensor_score": 0.7812147670366723,
"content": "Artificial intelligence relies on machine learning "
"algorithms",
},
{
"_highlights": [
{"content": "Machine learning processes data " "efficiently"}
],
"_id": "m1",
"_lexical_score": 0.8977623995410945,
"_score": 0.015873015873015872,
"_tensor_score": 0.6570781567113382,
"content": "Machine learning processes data efficiently",
},
{
"_highlights": [{"content": "Engineers use machine tools for " "cutting"}],
"_id": "l1",
"_lexical_score": 0.29152923241027423,
"_score": 0.015625,
"_tensor_score": 0.5454894444827113,
"content": "Engineers use machine tools for cutting",
},
{
"_highlights": [
{"content": "Bright morning sunlight streams " "through the room"}
],
"_id": "l2",
"_score": 0.007692307692307693,
"_tensor_score": 0.5126319998660153,
"content": "Bright morning sunlight streams through the room",
},
],
"limit": 10,
"offset": 0,
"processingTimeMs": 60,
"query": "machine learning artificial intelligence",
}
print(relevance_cutoff_results)
# Relevance cutoff search results
{
"_probeCandidates": 4,
"_relevantCandidates": 2,
"hits": [
{
"_highlights": [
{"content": "Machine learning algorithms in " "artificial intelligence"}
],
"_id": "h1",
"_lexical_score": 2.612086396229609,
"_score": 0.01639344262295082,
"_tensor_score": 0.8889584065865476,
"content": "Machine learning algorithms in artificial intelligence",
},
{
"_highlights": [
{
"content": "Artificial intelligence relies on "
"machine learning algorithms"
}
],
"_id": "h2",
"_lexical_score": 2.4483762460480873,
"_score": 0.016129032258064516,
"_tensor_score": 0.7812147670366723,
"content": "Artificial intelligence relies on machine learning "
"algorithms",
},
],
"limit": 10,
"offset": 0,
"processingTimeMs": 24,
"query": "machine learning artificial intelligence",
}
As we can see, the regular search results returned 5 documents, while the relevance cutoff search results only returned the only top 2 documents that were considered relevant based on the relevance cutoff threshold.
The _probeCandidates
field indicates that Marqo retrieved 4 documents in the probe lexical search, and the _relevantCandidates
field indicates that only 2 documents were considered relevant based on the relevance cutoff threshold.
Relevance Cutoff and Sort
In sort, sometime you may get irrelevant results popping up in the search results due to the sort criteria. For example, you search "black t-shirt"
, and you want to sort by price
in ascending order. You may
get a very cheap t-shirt that is not black, but the price is lower than the other black t-shirts. In this case, you can use the relevance cutoff to filter out the irrelevant results to prevent this from happening.
For example, you send a request "black t-shirt"
, and you want to sort by price
in ascending order, and you want to filter out the irrelevant results that do not match the search query, you can set
mq.index("my-first-index").search(
q="black t-shirt",
sort_by=[{"fieldName": "price", "order": "asc"}],
limit=60,
relevance_cutoff={
"method": "mean_std_dev",
"parameters": {"stdDevFactor": 1.2},
"probeDepth": 1000,
},
search_method="HYBRID",
hybrid_parameters={
"retrievalMethod": "disjunction",
"rankingMethod": "rrf",
"alpha": 0.3,
"rrfK": 10,
},
)
In this case, Marqo will:
1. Do a probe lexical search with retrieval size of 1000 with query "black t-shirt"
, and get _probeCandidates
number of documents;
2. Collect the relevance scores of the retrieved documents, do a relevance cutoff based on the mean_std_dev
method with stdDevFactor
of 1.2, and then compute a _relevantCandidates
number;
3. Implement the real search with the _relevantCandidates
. In this example, since the retrievalMethod is disjunction
, Marqo will retrieve documents from both lexical and vector search, applied the rrf
ranking method, and return _sortCandidates
number of documents;
4. Sort the documents based on the price
field in ascending order. In this case, all the documents will be sorted;
5. Return the top 60 sorted documents.
In this case, you do would not see irrelevant results popping up in the search results, as the relevance cutoff will filter out the irrelevant results that do not match the search query.
If you still see irrelevant results popping up in the search results, you can increase the stdDevFactor
to make the relevance cutoff more strict. These parameters will be data dependent, so you may need to experiment with different values to find the best fit for your use case by
investigating the _probeCandidates
, _relevantCandidates
, and _sortCandidates
in the search response.
Here is a practical example of how to use relevance cutoff with sort in Marqo:
mq.create_index(index_name, model="hf/all-MiniLM-L6-v2")
documents = [
{"_id": "1", "price": 10, "color": "black", "content": "This is a black t-shirt"},
{"_id": "2", "price": 20, "color": "white", "content": "This is a white t-shirt"},
{
"_id": "3",
"price": 15,
"color": "black",
"content": "This is a black t-shirt with a logo",
},
{"_id": "4", "price": 5, "color": "blue", "content": "This is a blue t-shirt"},
{
"_id": "5",
"price": 25,
"color": "black",
"content": "This is an expensive black t-shirt",
},
]
mq.index("my-first-index").add_documents(documents, tensor_fields=["content"])
regular_results = mq.index("my-first-index").search(
q="black t-shirt",
search_method="HYBRID",
sort_by={
"fields": [{"fieldName": "price", "order": "asc", "missing": "last"}],
},
)
relevance_cutoff_results = mq.index("my-first-index").search(
q="black t-shirt",
search_method="HYBRID",
relevance_cutoff={
"method": "relative_max_score",
"parameters": {"relativeScoreFactor": 0.8},
},
sort_by={
"fields": [{"fieldName": "price", "order": "asc", "missing": "last"}],
},
)
print(regular_results)
# Regular sort search results
{
"_sortCandidates": 5,
"hits": [
{
"_highlights": [{"content": "A red T-shirt"}],
"_id": "4",
"_lexical_score": 0.0870113769896297,
"_score": 1.0,
"_tensor_score": 0.7778304865789906,
"content": "A red T-shirt",
"price": 8.99,
},
{
"_highlights": [{"content": "A blue T-shirt"}],
"_id": "3",
"_lexical_score": 0.0870113769896297,
"_score": 0.5,
"_tensor_score": 0.7897496445065721,
"content": "A blue T-shirt",
"price": 10.99,
},
{
"_highlights": [{"content": "Another black T-shirt"}],
"_id": "2",
"_lexical_score": 0.9624801143435295,
"_score": 0.3333333333333333,
"_tensor_score": 0.9244735618095294,
"content": "Another black T-shirt",
"price": 15.99,
},
{
"_highlights": [{"content": "A black T-shirt"}],
"_id": "1",
"_lexical_score": 0.9624801143435295,
"_score": 0.25,
"_tensor_score": 0.9610388137414024,
"content": "A black T-shirt",
"price": 19.99,
},
{
"_highlights": [{"content": "A green T-shirt"}],
"_id": "5",
"_lexical_score": 0.0870113769896297,
"_score": 0.2,
"_tensor_score": 0.7825439831964851,
"content": "A green T-shirt",
"price": 20.99,
},
],
"limit": 10,
"offset": 0,
"processingTimeMs": 58,
"query": "black t-shirt",
}
# Sort and relevance cut-off results
{
"_probeCandidates": 5,
"_relevantCandidates": 2,
"_sortCandidates": 2,
"hits": [
{
"_highlights": [{"content": "Another black T-shirt"}],
"_id": "2",
"_lexical_score": 0.9624801143435295,
"_score": 1.0,
"_tensor_score": 0.9244735618095294,
"content": "Another black T-shirt",
"price": 15.99,
},
{
"_highlights": [{"content": "A black T-shirt"}],
"_id": "1",
"_lexical_score": 0.9624801143435295,
"_score": 0.5,
"_tensor_score": 0.9610388137414024,
"content": "A black T-shirt",
"price": 19.99,
},
],
"limit": 10,
"offset": 0,
"processingTimeMs": 17,
"query": "black t-shirt",
}
The example above shows how to use relevance cutoff with sort in Marqo. If relevance cut-off is not enabled, the red T-shirt will be returned as the first result, as it has the lowest price. However, with relevance cut-off enabled, only the top 2 documents that are relevant to the search query are returned, and they are sorted by price in ascending order.