Skip to content

Choosing a model for Marqo

This guide will explain tradeoffs and differences between Marqo's supported embedding models. See our blog post, Benchmarking Models for Multimodal Search and our Hugging Face space for more details.

The most fundamental component of any Marqo index is the embedding model used to represent the data. Marqo's embedding models take data like text or images as input and return an embedding (vector). This vector representation is indexed and searchable within Marqo by using approximate nearest neighbour algorithms along with a simililarty measure like L2 distance. You can use a varienty of different models to generate these vectors, depending on modality, language and performance requirements.


Text

The following models are supported by default (and primarily based on the excellent sbert and Hugging Face libraries and models).

These models can be selected when creating the index and are illustrated by the example below:

# Import Marqo and create a client
settings = {
    "treatUrlsAndPointersAsImages": False,
    "model": "flax-sentence-embeddings/all_datasets_v4_MiniLM-L6",
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-index", settings_dict=settings)

The model field is the pertinent field for selecting the model to use. Note that once an index has been created and a model has been selected, the model cannot be changed. A new index would need to be created with the alternative model.

The model will be applied to all relevant fields. Field-specific settings which allow different models to be applied to different fields is not currently supported but will be coming soon (and contributions are always welcome).

Currently, Marqo adds prefixes by default to e5 model queries. These are trained on data with prefixes, so adding those same prefixes to text chunks before embedding improves the quality of the embeddings. The default prefix for queries is "query: " and for documents, "passage: ". For more information, refer to the model card here

Although use case specific, a good starting point is the model flax-sentence-embeddings/all_datasets_v4_MiniLM-L6. It provides a good compromise between speed and relevancy. The model flax-sentence-embeddings/all_datasets_v4_mpnet-base provides the best relevancy (in general).

Images

The models that are used for vectorizing images come from CLIP. We support different implementations, including models from OpenAI, open clip, and our state-of-the-art models Marqo FashionCLIP.

Marqo FashionCLIP

Marqo-FashionCLIP & Marqo-FashionSigLIP are two new state-of-the-art multimodal models for search and recommendations in the fashion domain. Both models can produce embeddings for both text and images that can then be used in downstream search and recommendations applications. See the model release article on the Marqo blog for more details.

index_settings = {
    "model": "Marqo/marqo-fashionCLIP",
    "treatUrlsAndPointersAsImages": True,
    "type": "unstructured",
}

mq.create_index("my", settings_dict=index_settings)

OpenAI

  • RN50
  • RN101
  • RN50x4
  • RN50x16
  • RN50x64
  • ViT-B/32
  • ViT-B/16
  • ViT-L/14
  • ViT-L/14@336px

Although use case specific, a good starting point is the model ViT-B/16. It provides a good compromise between speed and relevancy. The models open_clip/ViT-B-32/laion2b_s34b_b79k and ViT-L/14@336px provides the best relevancy (in general) but are typically slower.

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-index", settings_dict=settings)

OpenAI-float16

Some OpenAI CLIP models can be implemented in float16, ONLY when cuda device is available. This can largely increase the speed with minor loss to accuracy. In our tests the inference latency is reduced by 50% (device dependent). Available models are:

  • fp16/ViT-L/14
  • fp16/ViT-B/32
  • fp16/ViT-B/16

You can load the model with:

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-index", settings_dict=settings)

OpenCLIP

  • open_clip/RN101-quickgelu/openai
  • open_clip/RN101-quickgelu/yfcc15m
  • open_clip/RN101/openai
  • open_clip/RN101/yfcc15m
  • open_clip/RN50-quickgelu/cc12m
  • open_clip/RN50-quickgelu/openai
  • open_clip/RN50-quickgelu/yfcc15m
  • open_clip/RN50/cc12m
  • open_clip/RN50/openai
  • open_clip/RN50/yfcc15m
  • open_clip/RN50x16/openai
  • open_clip/RN50x4/openai
  • open_clip/RN50x64/openai

  • open_clip/ViT-B-16-plus-240/laion400m_e31

  • open_clip/ViT-B-16-plus-240/laion400m_e32
  • open_clip/ViT-B-16/laion2b_s34b_b88k
  • open_clip/ViT-B-16/laion400m_e31
  • open_clip/ViT-B-16/laion400m_e32
  • open_clip/ViT-B-16/openai
  • open_clip/ViT-B-16-SigLIP/webli
  • open_clip/ViT-B-16-SigLIP-256/webli
  • open_clip/ViT-B-16-SigLIP-384/webli
  • open_clip/ViT-B-16-SigLIP-512/webli
  • open_clip/ViT-B-16-quickgelu/metaclip_fullcc
  • open_clip/ViT-B-32-quickgelu/laion400m_e31
  • open_clip/ViT-B-32-quickgelu/laion400m_e32
  • open_clip/ViT-B-32-quickgelu/openai
  • open_clip/ViT-B-32/laion2b_e16
  • open_clip/ViT-B-32/laion2b_s34b_b79k
  • open_clip/ViT-B-32/laion400m_e31
  • open_clip/ViT-B-32/laion400m_e32
  • open_clip/ViT-B-32/openai
  • open_clip/ViT-B-32-256/datacomp_s34b_b86k

  • open_clip/ViT-H-14/laion2b_s32b_b79k

  • open_clip/ViT-H-14-quickgelu/dfn5b
  • open_clip/ViT-H-14-378-quickgelu/dfn5b

  • open_clip/ViT-L-14-336/openai

  • open_clip/ViT-L-14/laion2b_s32b_b82k
  • open_clip/ViT-L-14/laion400m_e31
  • open_clip/ViT-L-14/laion400m_e32
  • open_clip/ViT-L-14/openai
  • open_clip/ViT-L-14-quickgelu/dfn2b
  • open_clip/ViT-L-14-CLIPA-336/datacomp1b
  • open_clip/ViT-L-16-SigLIP-256/webli
  • open_clip/ViT-L-16-SigLIP-384/webli

  • open_clip/ViT-bigG-14/laion2b_s39b_b160k

  • open_clip/ViT-g-14/laion2b_s12b_b42k
  • open_clip/ViT-g-14/laion2b_s34b_b88k
  • open_clip/ViT-SO400M-14-SigLIP-384/webli

  • open_clip/coca_ViT-B-32/laion2b_s13b_b90k

  • open_clip/coca_ViT-B-32/mscoco_finetuned_laion2b_s13b_b90k
  • open_clip/coca_ViT-L-14/laion2b_s13b_b90k
  • open_clip/coca_ViT-L-14/mscoco_finetuned_laion2b_s13b_b90k

  • open_clip/convnext_base/laion400m_s13b_b51k

  • open_clip/convnext_base_w/laion2b_s13b_b82k
  • open_clip/convnext_base_w/laion2b_s13b_b82k_augreg
  • open_clip/convnext_base_w/laion_aesthetic_s13b_b82k
  • open_clip/convnext_base_w_320/laion_aesthetic_s13b_b82k
  • open_clip/convnext_base_w_320/laion_aesthetic_s13b_b82k_augreg
  • open_clip/convnext_large_d/laion2b_s26b_b102k_augreg
  • open_clip/convnext_large_d_320/laion2b_s29b_b131k_ft
  • open_clip/convnext_large_d_320/laion2b_s29b_b131k_ft_soup
  • open_clip/convnext_xxlarge/laion2b_s34b_b82k_augreg
  • open_clip/convnext_xxlarge/laion2b_s34b_b82k_augreg_rewind
  • open_clip/convnext_xxlarge/laion2b_s34b_b82k_augreg_soup

  • open_clip/roberta-ViT-B-32/laion2b_s12b_b32k

  • open_clip/xlm-roberta-base-ViT-B-32/laion5b_s13b_b90k
  • open_clip/xlm-roberta-large-ViT-H-14/frozen_laion5b_s13b_b90k

  • open_clip/EVA02-L-14-336/merged2b_s6b_b61k

  • open_clip/EVA02-L-14/merged2b_s4b_b131k
  • open_clip/EVA02-B-16/merged2b_s8b_b131k

Like the OpenAI based models, the larger ViT based models typically perform better. For example, open_clip/ViT-H-14/laion2b_s32b_b79k is the best model for relevency (in general) and surpasses even the best models from OpenAI.

The names of the OpenCLIP models are in the format of "implementation source / model name / pretrained dataset". The detailed configurations of models can be found here.

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "open_clip/ViT-H-14/laion2b_s32b_b79k",
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-index", settings_dict=settings)

Multilingual CLIP

Marqo supports multilingual CLIP models that are capable up to 200 languages. You can use the following models and achieve multimodal search in your preferred language:

  • visheratin/nllb-clip-base-siglip
  • visheratin/nllb-siglip-mrl-base
  • visheratin/nllb-clip-large-siglip
  • visheratin/nllb-siglip-mrl-large
  • multilingual-clip/XLM-Roberta-Large-Vit-L-14
  • multilingual-clip/XLM-R Large Vit-B/16+
  • multilingual-clip/XLM-Roberta-Large-Vit-B-32
  • multilingual-clip/LABSE-Vit-L-14

These models can be specified at index creation time.

Note that multilingual clip models are very large models (approximately 6GB) therefore a cuda device is highly recommended.

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "visheratin/nllb-siglip-mrl-base",
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-index", settings_dict=settings)

Video and Audio Models

Marqo supports multimodal models for video, audio, image, and text documents using LanguageBind (see model card here). You can use the following modality combinations for your index

Model Name Supported Modalities Required Memory
LanguageBind/Video_V1.5_FT_Audio_FT_Image Video, Audio, Image, Text 8 GB
LanguageBind/Video_V1.5_FT_Audio_FT Video, Audio, Text 5 GB
LanguageBind/Video_V1.5_FT_Image Video, Image, Text 5 GB
LanguageBind/Audio_FT_Image Audio, Image, Text 5 GB
LanguageBind/Audio_FT Audio, Text 2 GB
LanguageBind/Video_V1.5_FT Video, Text 2 GB

For models with more than 4 GB required for memory, please set the environment variables MARQO_MAX_CPU_MODEL_MEMORY or MARQO_MAX_CUDA_MODEL_MEMORY to the appropriate value in GB, depending on your device.

For these models, it is recommended to set MARQO_MEDIA_DOWNLOAD_THREAD_COUNT to 5 initially, increasing depending on your machine. Preprocessing of the media files is done right after downloading in parallel and in chunks, so this is a CPU-heavy process.

Generic CLIP Models

You can use your fine-tuned clip models with custom weights in Marqo. Depending on the framework you are using (we currently support model frameworks from openai clip and open_clip), you can use set up the index as:

Open_CLIP

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "generic-clip-test-model-1",
    "modelProperties": {
        "name": "ViT-B-32-quickgelu",
        "dimensions": 512,
        "url": "https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/vit_b_32-quickgelu-laion400m_avg-8a00ab3c.pt",
        "type": "open_clip",
    },
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-generic-model-index", settings_dict=settings)

Openai CLIP

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "generic-clip-test-model-2",
    "modelProperties": {
        "name": "ViT-B/32",
        "dimensions": 512,
        "url": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
        "type": "clip",
    },
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-generic-model-index", settings_dict=settings)

It is very important to set "treat_urls_and_pointers_as_images": True to enable the multimodal search. The model field is required and acts as an identifying alias to the model specified through modelProperties.

In modelProperties, the field name is to identify the model type. dimensions specifies the dimension of the output. type shows the framework you are using. You should also provide you custom model (checkpoint) by field url. You will need to serve your model and access it via a url. For more detailed instructions, please check here.

Advanced usage If Marqo is not running on Docker, models may be stored locally and referenced using a local file pointer. By default, Marqo running within Docker will not be able to access these.

Users should conscious of the different fields model and name. model acts as an identifying alias in Marqo (for generic models, you can choose your own). name, in this case, is used to identify the CLIP architecture from OpenAI or OpenCLIP

A table of all the required fields is listed below

Required Keys for modelProperties

Field Type Description
name String Name of model in library. If the model is specified by modelProperties.model_location, then this parameter refers to the model architecture, for example open_clip/ViT-B-32/laion2b_s34b_b79k
dimensions Integer Dimensions of the model
url String The url of the custom model
type String, "clip" or "open_clip" The framework of the model

Optional fields provide further flexibilities of generic models. These fields only works for models from open_clip as this framework provides more flexibilities.

Optional Keys for modelProperties

Field Type Default value Description
jit Bool False Whether to load this model in JIT mode.
precision String "fp32" The precison of the model. Optional values: "fp32" or "fp16"
tokenizer String "clip" The name of the tokenizer. We support hugging face tokenizer.
mean Tuple (0.48145466, 0.4578275, 0.40821073) The mean of the image for normalization
std Tuple (0.26862954, 0.26130258, 0.27577711) The std of the image for normalization
model_location Dictionary "" The location of the model if it is not easily reachable by URL (for example a model hosted on a private Hugging Face and AWS S3 repos). See here for examples.

Generic SBERT Models

You can also use models that are not supported by default.

settings = {
    "treatUrlsAndPointersAsImages": False,
    "model": "unique-model-alias",
    "modelProperties": {
        "name": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
        "dimensions": 384,
        "tokens": 128,
        "type": "hf",
    },
    "normalizeEmbeddings": True,
}
response = mq.create_index("my-generic-model-index", settings_dict=settings)

The model field is required and acts as an identifying alias to the model specified through modelProperties. If a default model name is used in the name field, modelProperties will override the default model settings.

Currently, models hosted on huggingface model hub are supported. These models need to output embeddings and conform to either the sbert api or huggingface api. More options for custom models will be added shortly, including inference endpoints.

Required Keys for modelProperties

Name Type Description
name String Name of model in library. This is required unless modelProperties.model_location is specified.
dimensions Integer Dimensions of model
type String Type of model loader. Must be set to "hf" for generic SBERT models.

Optional Keys for modelProperties

Search Parameter Type Default value Description
tokens Integer 128 Number of tokens
model_location Dictionary "" The location of the model if it is not easily reachable by URL (for example a model hosted on a private Hugging Face and AWS S3 repos). See here for examples.

No Model

You may want to use marqo to store and search upon vectors that you have already generated. In this case, you can create your index with no model. To do this, set model to the string "no_model" and define model_properties with "type": "no_model" and "dimensions" set to your desired vector size.

Note that for a no model index, you will not be able to vectorise any documents or search queries. To add documents, use the custom_vector feature, and to search, use the context parameter with no q defined.

# Suppose you want to create an index with 384 dimensions
settings = {
    "treatUrlsAndPointersAsImages": False,
    "model": "no_model",
    "modelProperties": {
        "dimensions": 384,  # Set the dimensions of the vectors
        "type": "no_model",  # This is required
    },
}
response = mq.create_index("my-no-model-index", settings_dict=settings)

Required Keys for modelProperties

Name Type Description
dimensions Integer Dimensions of the index
type String Type of model loader. Must be set to "no_model"

Other media types

At the moment only text and images are supported. Other media types and custom media types will be supported soon.