Solution Best Practices
Pre-Filtering Accuracy in Vector Searches
Users of our vector database sometimes face challenges with pre-filtering, particularly when trying to distinguish between closely related search terms. For example, a search for "Northern Africa" might erroneously pull up results for "Southern Africa" because traditional vector search methods can't effectively differentiate between the two.
For those looking to refine their search capabilities, we suggest considering an external fuzzy filtering solution such as the rapidfuzz library to enhance the precision of search results. By creating a synonym list relevant to your domain, you can ensure that searches are not only comprehensive but also contextually accurate. See this link for more information
Selecting the Right AWS EC2 Instance for Marqo Docker Containers
Running Marqo Docker containers on AWS EC2 instances may fail due to insufficient resources or compatibility issues with certain Ubuntu versions and cgroup configurations.
Ensure your AWS EC2 instance is appropriately sized and your Ubuntu version is compatible with Marqo's requirements. Start with an instance that has at least 2 vCPUs and 8GB RAM, like the t2.large. Marqo can struggle on smaller instances, like the t2.micro, due to limited memory and processing power.
Selecting Appropriate Models and Optimal Weights for Multimodal Search with Marqo
Choosing the right models and determining the optimal balance between text and image weights for multimodal search indexing is crucial for performance but can be complex and nuanced.
Follow these guidelines to streamline the selection of models and weights for multimodal searches using Marqo:
- Understand Your Data: The nature of your data should guide the model and weight choice.
- For text-rich content, prioritize text weights. An xlm b-32 model with a balanced or slightly text-favored weight distribution (e.g., 50/50 or 60/40 image/text) often yields the best results.
- For data where images carry more information, like stock images, increase the image weight. Use the default model with a higher image weight proportion (80/20, 90/10, or 95/5 image/text).
- Experiment with Weights: Start with sensible weight distributions based on past empirical results.
- For image-focused data, begin with a 90/10 image/text ratio.
- For text-focused data, a 60/40 image/text ratio is a good starting point.
- Iterate and Optimize: Perform manual inspections and tweak the model and weights accordingly.
- Look for qualities like clarity, relevance, and completeness in your search results to judge the effectiveness of your current setup.
- Be prepared to swap models and adjust weights multiple times to hone in on the most effective configuration.
Customizing Search Result Presentation
The style of my results doesn’t look how I want it to or it doesn’t match my brand/platform.
One of the best ways to tailor your results is with prompting. Similar to how you can prompt LLMs you can prompt your queries (and hide this from an end user).
For example in large datasets you may have thousands of relevant items, many of which might have terrible photos. To surface only the higher images you could prefix all searches with “a high quality aesthetic photo of”. Or likewise you could add a negative query term such as “low quality photo, blurry, jpeg artefacts” with a weight of -0.5 to push these sorts of results away.
If you want results to match a style you can quite explicitly ask for them in a query prompt such as, “a stock photo of”, “an oil painting of”, “a cyberpunk neon depiction of”, etc.
The same concepts apply to text search as well.
Optimizing Document Tokenization for Search and Summarization
Finding the right balance in tokenizing lengthy documents (30-50 pages) for efficient search and summarization when using Marqo and a Language Model (LLM).
For precise search results, consider tokenizing your documents into smaller chunks, such as 1-3 sentences, and then add those as separate documents to Marqo. This increases the specificity of the embeddings. However, when broader context is needed, larger segments can capture the essence of concepts spread over multiple sentences.
A practical approach is to tokenize one paragraph (4-6 sentences) per document. Utilize Marqo's text_preprocessing feature to determine the text amount for each vector, starting with a split length of 3 and an overlap of 1 to capture a wider context. For more exact matches, a split length of 2 or 1 could be more effective.
Experiment with different models like the E5, which are adept at embedding longer text pieces (up to 512 tokens). Remember to start queries with "query: " for these models.
When working with LLMs for summarization, consider only sending them the _highlights from Marqo to stay within the token processing limit and ensure relevance. You can also expand the context around the highlights when feeding it to the LLM to balance between search specificity and the need for context in summarization.