Sizing, Storage and Inference on Marqo Cloud

Sizing Your Index

Marqo allows you to scale your index in the following ways:

Shards: increasing the number of shards allows you to store more vectors.

Replicas: which allow for more requests per second. Having at least 1 replica enables high availability. Replicas do not affect the total number of shards you can store.

Inference nodes: inference nodes run the machine learning models. You'll need to scale the number of inference nodes depending on the size of the machine learning model you're running and the expected RPS.

Name	Description	Approximate number of 768 dim. vectors	Hourly pricing (USD)
marqo.basic	Low cost, recommended for dev/testing workloads	Up to 2,000,000	$0.0593
marqo.balanced	Storage optimised	Up to 16,000,000	$0.8708
marqo.performance	RPS optimised	Up to 16,000,000	$2.1808

For example, if you created an index with 3 balanced storage shards, the index would be able to store up to 3*16,000,000 = 48,000,000 vectors.

Another example, if you created an index with 2 performance storage shards, the index would be able to store up to 2*16,000,000 = 32,000,000 vectors.

Please note that the maximum number of vectors may vary based on the content of other document fields being stored in the index.

Choosing Storage Shard Types

marqo.basic Marqo basic is the cheapest of the shard types. This is good for proof of concept applications and development work. These shards have higher search latency that the other options and each shard has a lower capacity of approximately 2 million 768 dim vectors. These shards cannot have any replicas either so they are not recommended for production applications where high availability is a requirement.

marqo.balanced Marqo balanced is the middle tier of the shard types. This is good for production applications where high availability is a requirement. These shards have lower search latency than marqo.basic and each shard has a higher capacity of approximately 16 million 768 dim vectors. These shards can have replicas so they are suitable for production applications where high availability is a requirement.

marqo.performance Marqo performance is the highest tier of the shard types. This is good for production applications where high availability and the lowest search latency is a requirement, especially for indexes with tens or hundreds of millions of vectors. These shards have the lowest search latency of all the shard types and each shard has a capacity of approximately 16 million 768 dim. vectors. These shards can have replicas so they are suitable for highly available production application with millions of users.For the tutorials in this getting started guide we will only use marqo.basic shards however if you are deploying a production search with many users we recommend using marqo.balanced. For larger enterprises with large number of concurrent searches a marqo.performance shard is likely more suitable.

Choosing Inference Pod Types

Name	Description	Hourly pricing (USD)
marqo.CPU.large	Storage optimised	$0.3187
marqo.GPU	RPS optimised	$0.9717

The inference pod type adjusts the infrastructure that is used for inference. Inference is the process of creating vectors from your data. A more powerful inference node will reduce latency for indexing and search by creating vectors faster. Inference pod type has a particularly big difference on latency for indexing or searching with images. There are two inference pod types available:

marqo.CPU.large Marqo CPU Large is the middle tier of the inference pod types. This is suitable for production applications with low latency search. For many applications a large marqo.CPU pod will be sufficient when searching with text however if searching or indexing images and dealing with very high request concurrency these may become too slow. ‍

marqo.GPU Marqo GPU is the highest tier of the inference pod types. This is suitable for production applications with low latency search and high request concurrency. These pods are significantly faster than marqo.CPU pods when indexing or searching with text and/or images.

A common usage pattern is to mix these nodes for different stages of development. For example you can accelerate indexing of images with marqo.GPU pods and then swap to marqo.CPU pods for searching with only text. You can change your inference configuration at any time by editing the index.