Offline Evaluation
Offline evaluation lets you replay logs, perform counterfactual evaluation, and use relevance judgments to evaluate changes without affecting live traffic.
Overview
Offline evaluation allows you to test ranking strategies, recommendations, and other changes on historical data before deploying to production, reducing risk and enabling faster iteration. Historical data is automatically collected by the Marqo pixel, so you can replay past user interactions and queries without any manual data preparation.
Log Replay
Replay historical queries:
{
"evaluation_method": "log_replay",
"data_period": "30_days",
"queries": "all",
"metrics": [
"ndcg@10",
"mrr",
"precision@10"
]
}
Counterfactual Evaluation
Evaluate what would have happened:
{
"evaluation_method": "counterfactual",
"baseline": "current_ranking",
"variant": "new_ranking",
"metrics": [
"estimated_conversions",
"estimated_revenue",
"estimated_clicks"
]
}
Relevance Judgments
Use human relevance judgments:
{
"evaluation_method": "relevance_judgments",
"judgments": [
{
"query": "running shoes",
"product_id": "prod_123",
"relevance": 4,
"judge": "expert"
}
],
"metrics": [
"ndcg",
"map",
"mrr"
]
}
Best Practices
- Use representative data: Ensure evaluation data matches production
- Validate offline metrics: Confirm offline metrics correlate with online performance
- Use multiple methods: Combine log replay, counterfactual, and judgments
- Regular evaluation: Run offline evaluation regularly
- Document results: Keep records of offline evaluation outcomes
Related Topics
- A/B & Multivariate Testing - Validate offline results online
- Analytics & Reporting - Measure performance