Skip to content

Overview

The size of the indexed data will depend on the number of vectors, the dimension of the vectors and the amount of meta-data that is indexed alongside the vectors. The number of vectors will depend on the settings chosen, the dimension of the vector will depend on the model and the size of meta-data will depend on what else is indexed. The exact amount will vary from use-case to use-case but we provide some examples below to help estimate and understand the storage requirements. For the most accurate estimates, a small but representative amount of data (1000 documents) should be indexed and the per document storage estimated from this.

Examples

Below are some prototypical examples to help understand the storage requirements.

Example 1. Indexing text with the default model

In this example, each document has a single text field (”text”) and a short amount of text. For example:

document = {"text":"this is an example. here is some more text."}

Each of these occupies between 5-10 kB depending on the model used. The default strategy includes 1 replica to ensure that the service will always be available. This results in the final per document size of 15-20kB. If 1M documents are indexed then that would occupy ~15-20GB.

Example 2. Indexing images with CLIP ViT-B/32

In this example, each document has a single image field (”image”) that contains a uri for the images. For example:

document = {"image":"https://some.domain/an.image/image.webp"}

Each of these occupies between 15-20 kB depending on the model used. The default strategy includes 1 replica to ensure that the service will always be available. This results in the final per document size of 20-30kB. If 1M documents are indexed then that would occupy ~20-30GB.

Example 3. Indexing text with the default model and multiple fields

In this example, each document has a multiple text fields (”text1” and “text2”) and a short amount of text. For example:

document = {"text1":"this is an example. here is some more text.", 
            "text2":"this is an example. here is some more text."}

Each of these occupies between 10-20 kB depending on the model used. The default strategy includes 1 replica to ensure that the service will always be available. This results in the final per document size of 30-40kB. If 1M documents are indexed then that would occupy ~30-40GB.

Reducing storage

There are several strategies that can be used to reduce the amount of storage required. These are listed below:

  1. Exclude some fields from being turned into vector fields (see here for details). The data can still be stored, filtered and searched with lexical search. An example below will turn the field “Description” into vectors but will exclude “Title” and “Genre”. This would reduce the amount of storage by ~3x.
mq.index("my-first-index").add_documents([
    {
         "Title": "The Travels of Marco Polo",
         "Description": "A 13th-century travelogue describing the travels of Polo",
         "Genre": "History"
      }, 
    {
        "Title": "Extravehicular Mobility Unit (EMU)",
        "Description": "The EMU is a spacesuit that provides environmental protection",
        "_id": "article_591",
        "Genre": "Science"
    }], non_tensor_fields=["Title", "Genre"]
)
  1. Modify the internal segmentation settings for text (see here for details). By default, blocks of text that are two sentences long will be turned into 1 vector. To create an index with modified settings, change the text processing parameters. For fields that are longer than 2 sentences the settings below would reduce storage by ~2x. Note already existing indexes cannot have their settings modified.
settings = {
    "index_defaults": {
        "text_preprocessing": {
            "split_length": 4,
            "split_overlap": 0,
            "split_method": "sentence"
        },
    },
}

response = mq.create_index("my-multimodal-index", settings_dict=settings)