This post is for Day 22 of Mercari Advent Calendar 2023, brought to you by @pakio from the Mercari US ML/Search team.
Query Understanding is one of the most challenging but rewarding tasks for search engineers and it’s a never-ending challenge for the team. Query Understanding involves various tasks, such as query categorization, query expansion, and query reformulation. Among these tasks, query categorization plays a pivotal role in organizing and classifying queries into target taxonomy, enabling search engines to retrieve results more efficiently.
In this article, we focus on Query Categorization and explore several approaches. We examine both rule-based and ML-based methods, exploring their respective strengths and challenges. Furthermore, we share insights gleaned from our experiments in this task.
Rule-based Method
The Rule-based method is a simple yet powerful approach for query categorization. With this method, search engineers can easily implement logic using a map data structure, ensuring results are highly explainable. The fact that popular search engines like Algolia and Vespa offer this feature by default highlights its importance.
The following diagram illustrates an example process of applying rule-based query categorization in the search system. Here we used a simple category id filter as an example, but you can change this to more complex processes, such as boosting scores or changing the search logic itself, for example.
At a glance, this method seems very simple and attractive, but we should be aware of the maintenance cost of the rule and it is unfeasible to cover all queries. While some automation is possible through rule generation from master data, human intervention is often necessary to handle synonyms, resolve conflicts between names, and address irregular cases. As query patterns change and new products emerge, there is a need for regular review and updates of the rule-based query categorization. In fact, our team has been operating this method for several years, but it requires periodic review as listing trends change and new products are introduced.
Machine Learning (ML)-based Method
There have been proposals for more automated methods that use query logs, accompanying click logs, and statistics on documents displayed in search results. However, given the extensive data involved, these methods frequently complement machine learning approaches instead of relying solely on rule-based methods.
The paper published in 2018 by Lin et al. introduced a method using click logs for Query Categorization in EC product search. For approximately 40 million queries, the system acquired the categories of items that appeared in the search results and caused an action, i.e. click, add to cart, and purchase. And trained multiple ML models as a text classification task that predicts categories from queries and compares their performance.
The categories used here are hierarchical, and the best model has a micro-F1 score of 0.78 for the 36 level one categories and about 0.58 for the leaf-level categories. This result indicates that ML models can categorize queries with reasonable performance
Although the conditions and model structure are different, our team also trained a multi-class classification model using query and click logs, to predict the probability of a search query belonging to a certain leaf category. As a result, we confirmed that the micro-F1 score was 0.72 on our test data.
Language Model (LM)-based Method
As you are probably aware, the language model BERT, which was also published at the end of 2018, has been showing excellent performance in various fields. BERT is characterized by its architecture, which makes it more context-sensitive than conventional models such as ACNN, which was compared above, and by the fact that various pre-trained models are available and easy to validate. Another characteristic of publicly available pre-trained BERT is that it uses a general vocabulary, unlike models learned from the company’s query logs. This has some advantages, such as being resistant to unknown queries and being versatile, but it also has disadvantages, such as being vulnerable to domain-specific terms.
Here, we would like to introduce a method implemented by our team using DistilBERT, a derivative model of BERT, for the task of Query Categorization.
The DistilBERT model is fine-tuned with our data. In this experiment, only the classification layer was trained from query and click logs similar to the machine learning approach described above. The micro-F1 score was 0.80 on our test data.
In an online test comparing this model and the ML model described in the previous section, the coverage of the converted keywords doubled in this model, confirming the merits of using BERT, a highly versatile language model, for further improvements.
Conclusion
In this article, we discussed various approaches to Query Categorization, a crucial task in Query Understanding for search systems. We explored the rule-based method, which is a simple and powerful approach but requires ongoing maintenance costs. Additionally, we delved into the machine learning-based method, which leverages users’ logs to accurately categorize queries with high precision. We also introduced the Language Model-based method, specifically using DistilBERT, which provides reliable results while minimizing training efforts.
While this is an interesting field for me as a search engineer, it will be very interesting to see how the existing Query Understanding technology will be applied and evolve in the future when vector-based search becomes mainstream.
Tomorrow’s article will be by @mtsuka. Look forward to it!
—
Special thanks to @Vamshi for helping me with summarizing the experiment result and reviewing this post.