Skip to content

Duplicate Detection

Vector search excels at duplicate detection, especially with multimodal indexes where copies of images may have very slight variations due to compression, resizing, or other minor transformations.

Finding Duplicates

The simplest way to find duplicates is to search with the vector of a known document. This will return the most similar documents in the index. Documents with a similarity score above a sufficiently high threshold can be considered duplicates. As this is a symmetric retrieval task the threshold can be set quite high; ultimately it requires some experimentation to find the best threshold for your data and model, however a good starting point is 0.99 for strong duplicates.

import marqo

mq = marqo.Client()

index_name = "my-first-index"

base_document_id = "document1"

item_with_facets = mq.index(index_name).get_document(
    document_id=base_document_id, expose_facets=True
)

# NOTE: we assume on tensor field in this example with no text chunking (multimodal index)
# adjust accordingly for your data (_tensor_facets is a list with an entry for each tensor field / text chunk)
vec = item_with_facets["_tensor_facets"][0]["_embedding"]

results = mq.index(index_name).search(
    q=None,  # no query, just context vectors
    context={"tensor": [{"vector": vec, "weight": 1.0}]},
)