Skip to content

Images

Marqo supports pre-processing of images in several ways. The pre-processing allows the image to be broken into sub images (or patches) that can be considered to represent regions of interest. These patches are indexed along with the original image by cropping the image to the proposed region. Each image can have multiple patches associated with it. The patches can be searched alongside the original image. This can allow for better search results as well as providing localisation (and is akin to highlighting in text based search). By default no pre-processing will be performed but it can be easily specified when setting up the index. The method needs to be specified at indexing time and cannot be changed. If another method is required then a new index should be created.

Heuristic based patching

Heuristic based patching relies on a simple heuristic scheme to provide the regions of interest.

Simple

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": "simple"
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)

The settings above will use a 'simple' chunking scheme by splitting the image into smaller patches. This means that at indexing time, not only will the original image be indexed and searchable, but sub-image patches are also generated and indexed.

The sub-image patches are generated by breaking the image into a 3x3 grid. This means that the original image now has child images which consist of the sub-images. At searching time, not only is the original image searched over but so are all the sub-images.

Overlap

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": "overlap"
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)

The settings above will use an extension of the 'simple' chunking scheme above to also include patches which overlap the grid.

Advanced methods

The advanced methods use learned models for proposing regions of the image to patch. Both supervised and unsupervised methods for region proposal are supported. These methods will continue to evolve and any suggested improvements or feature requests can be made as an issue or PR on GitHub.

Faster-rcnn

Setting using a patch method of frcnn will invoke PyTorch's pretrained faster-rcnn model.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": "frcnn"
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)

Marqo-yolo

Setting using a patch method of marqo-yolo will invoke a pretrained yolox model that was trained on the LVIS dataset. The class agnostic scores ("objectness") are used for the nms.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": "marqo-yolo"
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)

DINO v1

Setting using a patch method of dino-v1 will invoke a pretrained DINO vision transformer model (currently ViT small 16). The bounding boxes are determined from the (summed) attention maps using contours before being passed through nms. This method produces fewer boxes than the dino-v2 method described below.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": "dino-v1"
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)

DINO v2

Setting using a patch method of dino-v2 will invoke a pretrained DINO vision transformer model (currently ViT small 16). The bounding boxes are determined from the individual attention maps using contours before being passed through nms.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": "dino-v2"
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}
response = mq.create_index("my-multimodal-index", settings_dict=settings)