Overview
The size of the indexed data will depend on the number of vectors, the dimension of the vectors and the amount of meta-data that is indexed alongside the vectors.
The number of vectors will depend on the settings chosen, the dimension of the vector will depend on the model and the size of meta-data will depend on what else is indexed.
The exact amount will vary from use case to use case but we provide some examples below to help estimate and understand the storage requirements. For the most accurate estimates, a small but representative amount of data (1000 documents) should be indexed and the per document storage estimated from this.
Examples
Below are some prototypical examples to help understand the storage requirements.
Example 1. Indexing text with the default model
In this example, each document has a single text field (”text”) and a short amount of text. For example:
document = {"text": "this is an example. here is some more text."}
Each of these occupies between 5-10 kB depending on the model used. Thus, indexing 1M documents would require approximately 5-10GB of storage.
Example 2. Indexing images with CLIP ViT-B/32
In this example, each document has a single image field (”image”) that contains a uri for the images. For example:
document = {"image": "https://some.domain/an.image/image.webp"}
Each of these occupies between 15-20 kB depending on the model used. Thus, indexing 1M documents would require approximately 15-20GB of storage.
Example 3. Indexing text with the default model and multiple fields
In this example, each document has a multiple text fields (”text1” and “text2”) and a short amount of text. For example:
document = {
"text1": "this is an example. here is some more text.",
"text2": "this is an example. here is some more text.",
}
Each of these occupies between 10-20 kB depending on the model used. Thus, indexing 1M documents would require approximately 10-20GB of storage.
Reducing storage
There are several strategies that can be used to reduce the amount of storage required. These are listed below:
- Include specific fields to be transformed into vector fields. The non-tensor fields can still be stored, filtered and searched with lexical search. An example below will turn the field “Description” into vectors but will exclude “Title” and “Genre”. This would reduce the amount of storage by ~3x.
mq.index("my-first-index").add_documents(
[
{
"Title": "The Travels of Marco Polo",
"Description": "A 13th-century travelogue describing the travels of Polo",
"Genre": "History",
},
{
"Title": "Extravehicular Mobility Unit (EMU)",
"Description": "The EMU is a spacesuit that provides environmental protection",
"_id": "article_591",
"Genre": "Science",
},
],
tensor_fields=["Description"],
)
- Modify the internal segmentation settings for text (see here for details). By default, blocks of text that are two sentences long will be turned into 1 vector. To create an index with modified settings, change the text processing parameters. For fields that are longer than 2 sentences the settings below would reduce storage by ~2x. Note already existing indexes cannot have their settings modified.
settings = {
"textPreprocessing": {
"splitLength": 4,
"splitOverlap": 0,
"splitMethod": "sentence",
}
}
response = mq.create_index("my-index", settings_dict=settings)