Text
Marqo supports pre-processing of text. Currently, the text pre-processing consists of (optionally) chunking pieces of text into shorter pieces.
Text chunking
settings = {
"textPreprocessing": {
"splitLength": 2,
"splitOverlap": 0,
"splitMethod": "sentence",
},
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)
The settings above will split text (in any field) into sentences (split_method
) of length 2 (split_length
) with no overlap (split_overlap
) between consecutive chunks of text. For example, if we had a document given by the following python dictionary;
document = {
"title": "This is a short title.",
"description": "This field is for a description. In this example, it contains some text. And some more. And even more!",
"other": 100,
}
"title"
and "description"
) will be chunked according to the settings provided above. The "title"
will remain unchanged "This is a short title."
as it is only a single sentence. The "description"
text will go from "This field is for a description. In this example, it contains some text. And some more. And even more!
to being ["This field is for a description. In this example, it contains some text., "And some more. And even more!"]
as the length is 2 sentences with 0 overlap between consecutive sentences. Other methods of splitting are also available -"character"
which performs the chunking based on characters, "word"
which performs it based on words, and "passage"
which performs it based on passages of text (denoted by a \n\n
).
Currently the settings will be applied to all fields in the same way. Field specific settings will be added soon.