Query Prompt Engineering

Queries with multimodal models can be engineered into prompts to help guide search behaviour. This can be used to control result style, supress low quality content, or to implement semantic filtering.

The goal of query prompt engineering is to engineer a generic modification to a query that will steer search results in a desired direction. This works best with CLIP models as longer caption style prompts bear similarities to the data on which they are trained. This is similar to the techniques used to repurpose CLIP for zero shot classification, an example of prompt templates can be seen in this notebook.

Typically query are best modified either with natural language prefixes or by adding tags to the query. The goal is to make the query resemble a caption that might be found in the training data of the multimodal model. Those familiar with text to image models like Stable Diffusion will be familiar with the comma separated tag style of prompts, Stable Diffusion uses CLIP as a text encoder for the embeddings with condition the generation.

Controlling Style

Modifying queries in your backend implementation can help tailor results to a specific style without sacrificing relevance. It is important to think back to how these models were trained, with captioned images. The models have an understanding of various styles, quality descriptions, brands, and other caption information that can be leveraged to control the style of search results.

Prefixing Examples

Prefixes work well as they resemble image captions.

An image of <QUERY> - A generic prompt popular in zero-shot tasks
A high resolution photo of <QUERY> - For high quality images
A stock image of <QUERY> - Stock image style to the results
A colorful vibrant image of <QUERY> - For colorful images
An Amazon product image of <QUERY> - Higher scores to images that look like ones you would find on Amazon

Tag Style Examples

Comma separated tag style modifications are also effective.

<QUERY>, high resolution, high quality
<QUERY>, colorful, vibrant
<QUERY>, stock image
<QUERY>, e-commerce listing, Amazon product image

Supressing Low Quality Content

There are two approaches to supressing low quality content, one is to modify the query to steer the model towards high quality content, the other is to add an additional weighted term to the query that will penalize low quality content.

Query Modification

This is similar to the style control example above but with a focus on quality. Examples might include:

A high resolution photo of <QUERY> - For high quality images
A high quality image of <QUERY> - Stock image style to the results
A professional photo of <QUERY> - For professional quality images

Penalizing Low Quality Content

Marqo supports multi-term queries with weighted terms. This can be used to inject an additional query term which penalises low quality content. Examples might include:

{"<QUERY>": 1.0, "low resolution, blurry, jpeg artifacts": -0.4} - Penalizes low quality images
{"<QUERY>": 1.0, "NSFW, nudity": -0.4} - Penalizes NSFW content

Semantic Filtering

For detailed examples and more information we provide a selection of useful templates in the Semantic Filtering Recipe.

An example use case where semantic filtering is powerful is in stock image search, it can be effectively applied in almost any domain though. If you have an index of stock images it can be powerful to filter on styles and aesthetics, things that typically do not have metadata. A template for outline art styles might look like "An artwork of a <QUERY> in a clean black and white outline style". Where <QUERY> is the search term entered by an end user. Semantic Filtering can also be done with tag style prompts such as <QUERY>, outline, black and white, clean.

Recommended UI for Semantic Filtering

Typically the user interface is implemented as a dropdown selector which makes it appear to a use as if it were a traditional filter. The dropdown would contain the various styles or aesthetics that can be applied to the search query. When a user selects an option the query is modified without the user knowing what is happening behind the scenes.