Skip to content

Text Pre-Processing Best Practices

Marqo provides inbuilt text pre-processing which chunks up longer passages of text and stores multiple vectors per document. This is useful for longer documents where you want to search for specific sections of the text and enabled the _highlights feature which shows you what text matched your query.

Marqo provides the following defaults:

{
    "textPreprocessing": {
        "splitLength": 2,
        "splitOverlap": 0,
        "splitMethod": "sentence",
    }
}

These settings are good for many uses cases. Another recommended configuration which works well when longer passages are useful is:

{
    "textPreprocessing": {
        "splitLength": 3,
        "splitOverlap": 1,
        "splitMethod": "sentence",
    }
}

This configuration will split the text into chunks of 3 sentences with an overlap of 1 sentence. This is useful for Retrieval Augmented Generation (RAG) applications and search with longer documents where larger section contain relevant information and information but not split nicely into distinct sentence chunks.