Model Selection for Multimodal Search

Marqo supports a huge range of opensource models, deciding which one to use can be difficult and often getting the absolute best results for a specific dataset will require some experimentation. However, there are some models which tend to be the best choice for certain scenarios and we will cover these here.

When using multimodal CLIP models be sure to set treatUrlsAndPointersAsImages to true in the index settings. This will ensure that URLs and pointers are treated as images and not text.

For example:

import marqo

mq = marqo.Client()

settings = {
    "treatUrlsAndPointersAsImages": True,
    "model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
}

mq.create_index("my-first-multimodal-index", settings_dict=settings)

OpenCLIP Model Naming Convention

OpenCLIP models are named with a loose convention to them that can help you understand what they are. Here is a breakdown of the naming convention.

For a model such as open_clip/ViT-L-14/laion2b_s32b_b82k:

ViT: Vision Transformer (the image tower architecture)
L: Large (the size of the model). Sizes are B (base), L (large), H (huge), g (gigantic)
14: 14x14 pixel patches for the vision tower. Other common values are 16 and 32
laion2b: the dataset used (this one is 2 billion image text pairs)
sXXb: the number of samples seen. For 2 billion examples in laion2b the s32b would mean 16 epochs
bXXk: the global batch size using in training, in this case 82 thousand

Some models such as open_clip/xlm-roberta-base-ViT-B-32/laion5b_s13b_b90k also specify the text tower architecture in the name (xlm-roberta-base in this case).

I want a balanced model

For a good balanced of speed and relevancy we recommend open_clip/ViT-L-14/laion2b_s32b_b82k as a good starting point. This model uses a 224x224px image with 14x14px patches and has strong image and text understanding from the diverse laion2b dataset. This model is best used with a GPU however it can be used on a CPU.

If compute is more constrained or latency is more critical then the smaller open_clip/ViT-B-16/laion2b_s34b_b88k is a good choice. This model is very close in performance however it uses a smaller architecture and slightly larger patches.

I want the best image understanding

As a general rule, models with smaller patches and/or larger input images will tend to exhibit better image understand; image understanding is also heavily influenced by data and training though so this is not a hard and fast rule.

The following models are strong choices for image understanding:

Vision Transformer Models

The vision towers for these models are ViT architectures.

open_clip/ViT-L-14/laion2b_s32b_b82k: A large model with 14x14px patches and a 224x224px image size.
open_clip/ViT-H-14/laion2b_s32b_b79k: A huge model with 14x14px patches and a 224x224px image size. (GPU strongly recommended)
open_clip/ViT-g-14/laion2b_s34b_b88k: A gigantic model with 14x14px patches and a 224x224px image size. (GPU strongly recommended)

ConvNeXT Models

The vision towers for these models are ConvNeXT architectures.

open_clip/convnext_base_w_320/laion_aesthetic_s13b_b82k_augreg: A base model with 320x320px images.
open_clip/convnext_large_d_320/laion2b_s29b_b131k_ft_soup: A large model with 320x320px images. (GPU strongly recommended)

ResNet Models

The vision towers for these models are ResNet architectures. While they are often not as strong in image-text search tasks as other models, these models can be quite good at image-image search tasks.

open_clip/RN50x64/openai: A ResNet model with 448x448px images. This models is quite large so a GPU is recommended.

I want the best text understanding

In use cases where the text is more important than the image it is important to select a model with a strong text tower. Typically models that use models pretrained independant of the image tower are good choice.

XLM-RoBERTa text towers are pretrained on a large amount of multilingual data and have strong text understanding. These models are significantly better at multilingual search than english only models however the dedicated multilingual CLIP models perform better.

open_clip/xlm-roberta-base-ViT-B-32/laion5b_s13b_b90k: This is a pretrained base XLM-RoBERTa which was then further trained alongside an untrained ViT-B-32 vision tower.
open_clip/xlm-roberta-large-ViT-H-14/frozen_laion5b_s13b_b90k: This is a pretrained large XLM-RoBERTa which was then further trained alongside a frozen ViT-H-14 vision tower from open_clip/ViT-H-14/laion2b_s32b_b79k.

I want fast inference

For fast inference times smaller models with larger patches are typically a better choice.

open_clip/ViT-B-32/laion2b_s34b_b79k: A base model with 32x32px patches and a 224x224px image size.

I want multilingual image-text search

Marqo also supports a selection of multilingual CLIP models.

multilingual-clip/XLM-Roberta-Large-Vit-L-14: A large XLM-RoBERTa text tower trained alongside a large ViT-L-14 vision tower. This models is quite large so ensure you have sufficient RAM/VRAM (>6GB).