Skip to content

Documents

Add or replace documents

POST /indexes/{index_name}/documents

Add an array of documents or replace them if they already exist. If the provided index does not exist, it will be created.

If you send a document with an _id that corresponds to an existing document, the new document will overwrite the existing document.

This endpoint accepts the application/json content type.

Path parameters

Name Type Description
index_name String name of the index

Query parameters

Query Parameter Type Default Value Description
refresh Boolean true Forces a refresh after adding documents. This makes the documents available for searching. If you are happy to wait for the system to refresh, you can set this to false for better performance. In the Python client, this parameter is called auto_refresh: my_index.add_documents(..., auto_refresh=False)
device String null The device used to index the documents. If device is not specified and CUDA devices are available to Marqo (see here for more info), Marqo will speed up the indexing process by using available CUDA devices. Otherwise, the CPU will be used. Options include cpu and cuda, cuda1, cuda2 etc. The cuda option tells Marqo to use any available cuda devices.
telemetry Boolean False If true, the telemetry object is returned in the add documents response body. This includes information like latency metrics. This is set at client instantiation time in the Python client: mq = marqo.Client(return_telemetry=True)

Body

In the RestAPI and for cURL users these parameters are in lowerCamelCase, as presented in the following table. The Python client uses the pythonic snake_case equivalents.

Add documents parameters Value Type Default Value Description
documents Array of objects An array of documents. Each document is represented as a JSON object. You can optionally set a document's ID with the special _id field. The _id must be a string type. If an ID is not specified, marqo will generate one.
tensorFields Array of Strings [] The fields within these documents which will be tensor fields, and therefore will have vectors generated for them. Tensor search can only be performed on these fields for these documents. Pre-filtering and lexical search are still viable on text fields which are not included in the tensorFields parameter. For the best recall and speed performance, we recommend minimising the number of different tensor fields for your index. For production use cases where speed and recall are critical, we recommend only a single tensor field for the entire index.
useExistingTensors Boolean false Setting this to true will get existing tensors for unchanged fields in documents that are indexed with an id. Note: Marqo analyses the field string for updates, so Marqo can't detect a change if a URL points to a different image.
imageDownloadHeaders Dict null An object that consists of key-value pair headers for image download. Can be used to authenticate the images for download.
mappings Dict null An object to handle object fields in documents. Check mappings for more information. Mappings are required to create multimodal tensor combination fields - see here for more information
modelAuth Dict null An object that consists of authorisation details used by Marqo to download non-publicly available models. Check here for more information.
client_batch_size Integer null A Python client only helper parameter that splits up very large lists of documents into batches of a more manageable size for Marqo

Example

curl -XPOST 'http://localhost:8882/indexes/my-first-index/documents' \
-H 'Content-type:application/json' -d '
{
  "documents": [ 
      {
           "Title": "The Travels of Marco Polo",
           "Description": "A 13th-century travelogue describing the travels of Polo",
           "Genre": "History"
        }, 
      {
          "Title": "Extravehicular Mobility Unit (EMU)",
          "Description": "The EMU is a spacesuit that provides environmental protection",
          "_id": "article_591",
          "Genre": "Science"
      }
  ],
  "tensorFields": ["Description"]
}'
mq.index("my-first-index").add_documents([
    {
         "Title": "The Travels of Marco Polo",
         "Description": "A 13th-century travelogue describing the travels of Polo",
         "Genre": "History"
      }, 
    {
        "Title": "Extravehicular Mobility Unit (EMU)",
        "Description": "The EMU is a spacesuit that provides environmental protection",
        "_id": "article_591",
        "Genre": "Science"
    }], 
    tensor_fields=["Description"]
)

Response: 200 OK

{
   "errors":false,
   "items":[
      {
         "_id":"5aed93eb-3878-4f12-bc92-0fda01c7d23d",
         "result":"created",
         "status":201
      },
      {
         "_id":"article_591",
         "result":"updated",
         "status":200
      }
   ],
   "processingTimeMs":6,
   "index_name":"my-first-index"
}
The first document in this example had its _id generated by Marqo. In this example, there was already a document in Marqo with _id = article_591, so it was updated rather than created. We want Description to be a searchable with tensor search (Marqo's default search), so we explicitly declare it as a tensor field. Tensor fields are stored alongside vector representation of the data, allowing for multimodal and semantic searches.

Documents

Parameter: documents

Expected value: An array of documents. Each document is a JSON object that is to be added to the index. Each key is the name of a document's field and its value is the content for that field. See here for the allowed field data types. The optional _id key can be used to specify a string as the document's ID.

[
  {
    "Title": "The Travels of Marco Polo",
    "Description": "A 13th-century travelogue describing Polo's travels"
  }, 
  {
    "Title": "Extravehicular Mobility Unit (EMU)",
    "Description": "The EMU is a spacesuit that provides environmental protection",
    "_id": "article_591"
  }
]

Image auth

Parameter: imageDownloadHeaders

Expected value: An object that consists of key-value pair headers for image download. If set, Marqo will use this to authenticate the images for download.

Default value: null

Example

mq.create_index("my-first-index", treat_urls_and_pointers_as_images=True, model="ViT-L/14")
mq.index("my-first-index").add_documents(
  [
    {
      "img": "https://my-image-store.com/image_1.png",
      "title": "A lion roaming around..."
    },
    {
      "img": "https://my-image-store.com/image_2.png",
      "title": "Astronauts playing football"
    }
  ],
  image_download_headers={
    "my-image-store-api-key": "some-super-secret-image-store-key"
  },
  tensor_fields=["img", "title"]
)

Mappings

Parameter: mappings

Expected value: JSON object with field names as keys, mapped to objects with type (currently only multimodal_combination is supported) and weights which is an object that maps each nested field to a relative weight.

Default value: null

The mappings object allows adding nested fields, such as multimodal fields. With these fields, child fields are vectorised and combined into a single tensor via weighted-sum approach using the weights object.

The combined tensor will be used for tensor search. The multimodal combination field must be in tensor_fields.

Child fields can be used for lexical search or tensor search with filtering. All the child fields and child fields content must be str.

Read more about using mappings and multimodal combination fields here

Model auth

Parameter: modelAuth

Expected value: JSON object with either an s3 or an hf model store authorisation object.

Default value: null

The model_auth object allows searching on indexes that use OpenCLIP and CLIP models from private Hugging Face and AWS S3 stores.

The model_auth object contains either an s3 or an hf model store authorisation object. The model store authorisation object contains credentials needed to access the index's non publicly accessible model. See the example for details.

The index's settings must specify the non publicly accessible model's location in the setting's model_properties object.

model_auth is used to initially download the model. After downloading, Marqo caches the model so that it doesn't need to be redownloaded.

Example: AWS S3

# Create an index that specifies the non-public location of the model.
# Note the `auth_required` field in `model_properties` which tells Marqo to use
# the modelAuth it finds during add_documents to download the model
mq.create_index(
    index_name="my-cool-index", 
    settings_dict={
        "index_defaults": {
            "treat_urls_and_pointers_as_images": True,
            "model": 'my_s3_model',
            "normalize_embeddings": True,
            "model_properties": {
                {
                    "name": "ViT-B/32",
                    "dimensions": 512,
                    "model_location": {
                        "s3": {
                            "Bucket": "<SOME BUCKET>",
                            "Key": "<KEY TO IDENTIFY MODEL>",
                        },
                        "auth_required": True
                    },
                    "type": "open_clip",
                }
            }
        }
    }
)

# Specify the authorisation needed to access the private model during add_documents:
# We recommend setting up the credential's AWS user so that it has minimal 
# accesses needed to retrieve the model
mq.index("my-cool-index").add_documents(
    auto_refresh=True, documents=[
        {'Title': 'The coolest moon walks'}
    ],
    model_auth={
        's3': {
            "aws_access_key_id" : "<SOME ACCESS KEY ID>", 
            "aws_secret_access_key": "<SOME SECRET ACCESS KEY>"
        }
    },
    tensor_fields=["Title"]
)

Example: Hugging Face (HF)

# Create an index that specifies the non-public location of the model.
# Note the `auth_required` field in `model_properties` which tells Marqo to use
# the modelAuth it finds during add_documents to download the model
mq.create_index(
    index_name="my-cool-index", 
    settings_dict={
        "index_defaults": {
            "treat_urls_and_pointers_as_images": True,
            "model": 'my_hf_model',
            "normalize_embeddings": True,
            "model_properties": {
                {
                    "name": "ViT-B/32",
                    "dimensions": 512,
                    "model_location": {
                        "hf": {
                            "repo_id": "<SOME HF REPO NAME>",
                            "filename": "<THE FILENAME TO DOWNLOAD>",
                        },
                        "auth_required": True
                    },
                    "type": "open_clip",
                }
            }
        }
    }
)

# specify the authorisation needed to access the private model during add_documents:
mq.index("my-cool-index").add_documents(
    documents=[
        {'Title': 'The coolest moon walks'}
    ],
    tensor_fields=['Title'],
    model_auth={
        'hf': {
            "token" : "<SOME HF TOKEN>", 
        }
    }
)

Client batch size (Python client only)

Parameter: client_batch_size

Expected value: An Integer greater than 0.

Default value: None

A Python client only helper parameter that splits up very large lists of documents into batches of a more manageable size for Marqo. If very large documents are being indexed, it is recommended that this to be set lower. A client_batch_size=24 is a good place to start, and then adjust this for your use case as necessary.

Example

many_documents = [{"_id": f"doc_{i}", "Title": f"This is document number {i}"} for i in range(10000)]
mq.index("my-first-index").add_documents(
  many_documents, client_batch_size=24, tensor_fields=['Title']
)

Get one document

GET /indexes/{index_name}/documents/{document_id}
Gets a document using its ID.

Path parameters

Name Type Description
index_name String name of the index
document_id String ID of the document

Query parameters

Search parameter Type Default value Description
expose_facets Boolean False If true, the document's tensor facets are returned. This is a list of objects. Each facet object contains document data and its associated embedding (found in the facet's _embedding field)

Example

curl -XGET 'http://localhost:8882/indexes/my-first-index/documents/article_591?expose_facets=true'
mq.index("my-first-index").get_document(
    document_id="article_591",
    expose_facets=True
)

Response: 200 OK

{'Blurb': 'A rocket car is a car powered by a rocket engine. This treatise '
          'proposes that rocket cars are the inevitable future of land-based '
          'transport.',
 'Title': 'Treatise on the viability of rocket cars',
 '_id': 'article_152',
 '_tensor_facets': [{'Title': 'Treatise on the viability of rocket cars',
                     '_embedding': [-0.10393160581588745,
                                    0.0465407557785511,
                                    -0.01760256476700306,
                                    ...]},
                    {'Blurb': 'A rocket car is a car powered by a rocket '
                              'engine. This treatise proposes that rocket cars '
                              'are the inevitable future of land-based '
                              'transport.',
                     '_embedding': [-0.045681700110435486,
                                    0.056278493255376816,
                                    0.022254955023527145,
                                    ...]}]
}
In this example, the GET document request was sent with the expose_facets parameter set to true. The _tensor_facets field is returned as a result. Within each facet, there is a key-value pair that holds the content of the facet, and an _embedding field, which is the content's vector representation.

Get multiple documents

GET /indexes/{index_name}/documents
Gets a selection of documents based on their IDs.

This endpoint accepts the application/json content type.

Path parameters

Name Type Description
index_name String name of the index

Query parameters

Search parameter Type Default value Description
expose_facets Boolean False If true, the documents' tensor facets are returned. This is a list of objects. Each facet object contains document data and its associated embedding (found in the facet's _embedding field)

Body

An array of IDs. Each ID is a string.

["article_152", "article_490", "article_985"]

Example

curl -XGET http://localhost:8882/indexes/my-first-index/documents -H 'Content-Type: application/json' -d '
    ["article_152", "article_490", "article_985"]
'
mq.index("my-first-index").get_documents(
    document_ids=["article_152", "article_490", "article_985"]
)

Response 200 OK

{'results': [{'Blurb': 'A rocket car is a car powered by a rocket engine. This '
                       'treatise proposes that rocket cars are the inevitable '
                       'future of land-based transport.',
              'Title': 'Treatise on the viability of rocket cars',
              '_found': true,
              '_id': 'article_152'},
             {'_found': false, '_id': 'article_490'},
             {'Blurb': "One must maintain one's space suite. It is, after all, "
                       'the tool that will help you explore distant galaxies.',
              'Title': 'Your space suit and you',
              '_found': true,
              '_id': 'article_985'}]}
In this response, the index has no document with and ID of article_490. As a result, the _found field is false.

Delete documents

Delete documents identified by an array of their IDs.

POST /indexes/{index-name}/documents/delete-batch

Path parameters

Name Type Description
index_name String name of the index

Body

An array of document IDs, to be deleted.

[ "article_591", "article_602" ]

Example

curl -XPOST  http://localhost:8882/indexes/my-first-index/documents/delete-batch -H 'Content-type:application/json' -d '[
  "article_591", "article_602"
]'
mq.index("my-first-index").delete_documents(ids=["article_591", "article_602"])

Response 200 OK

{
  "index_name":"my-first-index",
  "status":"succeeded",
  "type":"documentDeletion",
  "details":{
    "receivedDocumentIds":2,
    "deletedDocuments":1
  },
  "duration":"PT0.084367S",
  "startedAt":"2022-09-01T05:11:31.790986Z",
  "finishedAt":"2022-09-01T05:11:31.875353Z"
}
In this example, one of the articles didn't exist in the index. Therefore, only one document was deleted.