2025/12/09

Search Results Quality Monitoring with LLMs

Author:: Yui Takeuchi

, 2025/12/09

Search Results Quality Monitoring with LLMs

Hello, I’m @otter, a software engineer working in the search domain at Mercari.
This article is the entry for Day 9 of the Mercari Advent Calendar 2025.

Mercari’s Product Search and Its Quality Management

Mercari’s product search plays a crucial role in accurately understanding our customers’ intentions among a massive number of products and displaying the exact items they’re looking for in the search results. Therefore, it is essential to continuously check the relevance and validity between search keywords and search results to maintain and improve quality.
In this article, I will introduce how we have leveraged LLMs (large language models) to improve the quality check flow for search results.

Challenges and Requirements in Search Results Quality Review

Until recently, product managers and engineers had to visually check each search result item sampled for different keywords and calculate the proportion of irrelevant items. This manual process was extremely time-consuming, and also led to inconsistencies and instability in evaluation results when done by multiple people due to variations in evaluation criteria.

In light of these challenges, our quality review process now needs to be automated on a daily or weekly basis, monitored through a dashboard, ensure a consistent and sufficient volume of reviews, include clear evaluation criteria, and accurately capture the context and intent behind users’ searches.

Achieving Objective and Stable Monitoring with LLMs and Evaluation Criteria

To meet these requirements, we implemented several LLM-based quality reviewers for search results.

After comparing several LLM models, we decided to leverage Gemini 2.5 Pro as it best understood users’ intent through the experimentation phase.

At first, we evaluated search results by providing only screenshots of the results pages to the LLM, simulating the user’s perspective. However, with this approach, it was difficult for the LLM to make judgments that accounted for detailed product information, leading to misclassifications, for example, due to differences in product specifications or categories. To improve the accuracy of the evaluations, we modified the process to also provide the LLM with detailed information for each item, such as the product name, type, price, category, and thumbnail image.

Evaluation Criteria

We instructed the LLM to return a "Relevance Score (0.0–1.0)" and a rationale for each item. The scoring is based on Amazon’s ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant), with scores assigned to each class:
　

Exact (1.0): Products that perfectly match the specified query (e.g., "iPhone 14 Pro Max 256GB" → the exact model and specification)
Substitute (0.75): Products that are functionally usable as substitutes (e.g., "iPhone 14" → iPhone 13; similar specification but different generation)
Complement (0.5): Accessories or complementary products (e.g., "iPhone" → iPhone case, charger)
Irrelevant (0.0): Completely unrelated or not meeting the requirements (e.g., "telescope" → socks)

With our previous manual evaluations, the assessment criteria tended to be subjective, often resulting in inconsistent outcomes. However, by introducing clear scoring definitions and leveraging LLMs, we have significantly improved the stability and objectivity of our evaluation results.

How the Quality Monitoring Tools work

For our search team, there are currently two major use cases for Search Relevancy quality checks.

Online Monitoring

We randomly extract search keywords from production search query logs and evaluate the relevancy of their results. Every week, about 1,000 keywords are sampled, and for each, the top 120 items in the search results are reviewed.
Review results are output to a BigQuery table and can be routinely checked through a monitoring dashboard, etc. When conducting A/B tests for search quality improvements or releasing new features, we can monitor changes in metrics such as Average Relevance Score or Irrelevant Items Rate.

Offline Evaluation

We also use it for offline evaluation before running A/B tests on new features or for improvement validation. By entering keywords to be examined, engineers or product managers can instantly see the search results, category/brand/price distributions, and LLM-based evaluation results via a tool. It’s also possible to conduct large-scale batch reviews using pre-determined keyword sets.

Although these two use cases run on different systems, by unifying the LLM prompts, we ensure consistency in the evaluation criteria and results.

SERP Monitor

Possibilities for Further Expansion

Combining image data with text data has improved evaluation accuracy. However, there are still challenging cases that require human judgment. That said, model accuracy continues to improve drastically every year, and we expect even further automation in the future.
Additionally, beyond evaluation and monitoring, we are also considering using LLM-generated evaluation data itself as training data to improve the underlying search models.

Conclusion

In this article, I introduced our efforts to automate, stabilize, and improve the efficiency of the evaluation of search result relevance through the use of LLM, which had until now relied on human review only at Mercari.
The introduction of LLMs has led not only to more efficient review operations at Mercari but also to the realization of continuous quality monitoring based on more objective evaluation axes.
Going forward, we plan to further improve our search features by leveraging evaluation data and addressing even more difficult cases.
I hope this article proves useful to those struggling with quality evaluation in search or recommendation systems, as well as those interested in utilizing LLMs.

Tomorrow’s article will be written by @task. Please look forward to it!