Using Datadog to monitor your Marqo Cloud indexes
Marqo Cloud can publish performance and error metrics about your index to your Datadog account. To set this up in the Marqo Cloud console, first navigate to the integrations page here. Then, click on the Edit button and paste your Datadog API Key and select the Datadog site you want to send the metrics to. Click Apply Changes.
Datadog sites
Datadog has multiple sites. To set up an integration between Marqo Cloud and Datadog you need to find your datadog site. You can find this information from your datadog website url. For instance if your datadog website url is https://app.datadoghq.com then in Marqo console you should select datadog site datadoghq.com when setting up your integration.
You can read more about datadog sites and find the mapping between webite url and datadog site here.
API Key
You will need a Datadog API key to set up your Marqo Cloud to Datadog integration. To generate the API key from your Datadog account, navigate to Profile > Organization settings > API Keys
and click on New Key or copy one of the existing keys. These API Keys only allow Marqo Cloud to push metrics to your datadog account, but doesn't allow Marqo Cloud to access or read any data from your Datadog account.
You can read more about Datadog API Keys here
Metrics published to Datadog
Request Duration
Distribution metric that reports request processing time by Marqo. A new datapoint is published at least once every 30 seconds if there is any traffic to the index. No datapoint for this metric is published in absence of traffic.
Metric name: marqo.request.duration
Assigned Datadog tags
-
index_name: Name of the index that the request was made against.
-
request_path: API path of the request. This will be
/indexes/{index_name}/documents
for add docs request and/indexes/{index_name}/search
for search request. -
request_method: HTTP request method. This will be one of POST, GET, PUT, PATCH, DELETE.
-
resp_code: HTTPS response code from Marqo, This will be a value between 200 and 599.
Percentile calculations
marqo.request.duration
is published as a distribution metric in Datadog and if you need percentile calculations for this metric, you need to enable it in Datadog by navigating to Metrics > Summary
in the DD console and then select marqo.request.duration
. This will open a side tab and you can toggle the switch for Enable percentiles and threshold queries there and click save.
Example usage
To monitor average duration for all search requests with 200 response code against index A:
avg:marqo.request.duration{index_name:A,request_path:indexes/A/search,resp_code:200,request_method:post}
To monitor count of all search requests with 200 response code against index A:
count:marqo.request.duration{index_name:A,request_path:indexes/A/search,resp_code:200,request_method:post}.as_count()
To monitor average duration for all requests against index A:
avg:marqo.request.duration{index_name:A}
Vector Count
Number of vectors currently stored in the index. A new datapoint is published every 30 seconds.
Metric Name: marqo.vector_count
Assigned Datadog tags
- index_name: Name of the index that the metric is for.
Example usage
To monitor number of vectors in index A:
avg:marqo.vector_count{index_name:A}
Doc Count
Number of documents currently stored in the index. A new datapoint is published every 30 seconds.
Metric Name: marqo.doc_count
Assigned Datadog tags
- index_name: Name of the index that the metric is for.
Example usage
To monitor number of documents in index A:
avg:marqo.doc_count{index_name:A}
Disk Used Percentage
Every index is provisioned with a fixed disk size based on the number of shards and type of storage shard you configure. This metric tells you the amount of disk used in percentage. You can aim to keep this number below 70% for optimal performance. A new datapoint is published once every 30 seconds.
Metric Name: marqo.disk_used_percentage
Assigned Datadog tags
- index_name: Name of the index that the metric is for.
Example usage
To monitor number of vectors in index A:
avg:marqo.disk_used_percent{index_name:A}
Memory Used Percentage
Every index is provisioned with a fixed amount of memory based on the number of shards and type of storage shard you configure. This metric tells you the amount of memory used in percentage. You can aim to keep this number below 70% for optimal performance. A new datapoint is published once every 30 seconds.
Metric Name: marqo.memory_used_percentage
Assigned Datadog tags
- index_name: Name of the index that the metric is for.
Example usage
To monitor number of vectors in index A:
avg:marqo.memory_used_percent{index_name:A}
Datadog dashboard setup example
To get started with your first Datadog dashboard for Marqo Cloud, you may copy this example Datadog dashboard config from here. Then navigate to your datadog website(https://{your_datadog_website}/dashboard/lists) and create a new dashboard by clicking on the New Dashboard
button on the top-right corner of the Dashboard page.
Once your dashboard is created and opened. Click on the Configure
button on the dashboard on the top right corner and then select Import dashboard JSON
from the menu. Once the pop-up dialog is open, paste the JSON that you copied from the link above.
You have now successfully setup a dashboard for all indexes in your Marqo Cloud account.
Datadog monitors setup
To alert yourself on Marqo Cloud reported metrics, you can setup monitors by navigating to Monitors > New Monitor > Metric Monitor
in your datadog website(https://{your_datadog_website}/monitors/create/metric) and configure a monitor following the documentation here.
We recommend that you setup atleast the following 2 Monitors by copying the JSON configurations listed below and importing it into your account by clicking on New from JSON
on the top right corner of the screen and paste the copied JSON from below and save.
Disk utilisation too high
{
"name": "Marqo Cloud index disk utilization too high",
"type": "query alert",
"query": "max(last_5m):avg:marqo.disk_used_percent{*} > 80",
"message": "Marqo Cloud index disk utilisation too high. @all",
"tags": [],
"options": {
"thresholds": {
"critical": 80,
"warning": 70,
"critical_recovery": 79,
"warning_recovery": 69
},
"notify_audit": false,
"include_tags": false,
"timeout_h": 1,
"renotify_interval": 0,
"escalation_message": ""
},
"priority": 2
}
Memory utilisation too high
{
"name": "Marqo Cloud index memory utilization too high",
"type": "query alert",
"query": "max(last_5m):avg:marqo.memory_used_percent{*} > 80",
"message": "Marqo Cloud index memory utilisation too high. @all",
"tags": [],
"options": {
"thresholds": {
"critical": 80,
"warning": 70,
"critical_recovery": 79,
"warning_recovery": 69
},
"notify_audit": false,
"include_tags": false,
"timeout_h": 1,
"renotify_interval": 0,
"escalation_message": ""
},
"priority": 2
}