Text

Marqo supports pre-processing of text. Currently, the text pre-processing consists of (optionally) chunking pieces of text into shorter pieces.

Text chunking

settings = {
    "textPreprocessing": {
        "splitLength": 2,
        "splitOverlap": 0,
        "splitMethod": "sentence",
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)

The settings above will split text (in any field) into sentences (split_method) of length 2 (split_length) with no overlap (split_overlap) between consecutive chunks of text. For example, if we had a document given by the following python dictionary;

document = {
    "title": "This is a short title.",
    "description": "This field is for a description. In this example, it contains some text. And some more. And even more!",
    "other": 100,
}

Then each string field ("title" and "description") will be chunked according to the settings provided above. The "title" will remain unchanged "This is a short title." as it is only a single sentence. The "description" text will go from "This field is for a description. In this example, it contains some text. And some more. And even more! to being ["This field is for a description. In this example, it contains some text., "And some more. And even more!"] as the length is 2 sentences with 0 overlap between consecutive sentences. Other methods of splitting are also available -"character" which performs the chunking based on characters, "word" which performs it based on words, and "passage" which performs it based on passages of text (denoted by a \n\n). Currently the settings will be applied to all fields in the same way. Field specific settings will be added soon.