Skip to content

Multimodal Combination Field Best Practices

Multimodal combinations fields are a powerful way to incorporate both text and image understanding into your Marqo index. Incorporating both text and image into a single vector is recommended for most use cases where you have text which contains useful information which will help with search.

A good example of a situation where text is helpful in complementing an image would be implementing search for an e-commerce site.

Situation:

You are building E-commerce search and all your images are "dressed" images, meaning they are images of models wearing the clothes in context, or furniture in situ with complementary items. The text is very useful in this situation as it guides the vector towards the specific product of interest. A sofa you sell which is pictured in a living room with a coffee table and a rug will likely appear in searches for "coffee table" whereas a vector comprised of the image and the product title "Three Seater Slipcover Modern Sofa in Green" will steer the vector away from other items in the image.

This is a good example of where the text can help guide the vector towards the specific product of interest.

How to Choose Weight and Fields

The goal of a multimodal combination field is to help bring the product closer to the searches you expect to return it. The text should contain information which complements the image. Good examples of the types of metadata which can work well with images are:

  • Product Title
  • Product Description
  • Brand
  • Product Tags

We recommend that most of the weight be placed on the image(s). As a rule of thumb a good starting point is to place 90% of the weight on the image(s) and 10% on the text(s).

e.g.

{
    "mappings": {
        "product_mm_field": {
            "type": "multimodal_combination",
            "weights": {
                "image_url": 0.9, 
                "product_title": 0.1
            }
        }
    }
}

This 90% image and 10% text weight is a good starting point for most use cases and works well with our recommended multimodal models for balanced search and image understanding.

Higher Text Weight Applications

In rarer use cases the text might be of more importance to the search. In these cases it can make sense to incorporate more text in the mapping such as 60% image and 40% text or even 50% image and 50% text.

There are two caveats to keep in mind when increasing the text weight:

  • Text Length: CLIP models typically have a context length of 77 tokens for text. This is short and equates to approximately 58 words. If your text is longer than this it will be truncated.
  • Model Selection: We provide recommendations for CLIP models with good text understanding. It is strongly recommended that one of these is used when apply a higher text weight.