The Journey to Machine-Learned Re-ranking

The Journey to Machine-Learned Re-ranking

Search is the most fundamental way users discover what Mercari marketplace has to offer; our users rely on search to find the items they want. This core functionality is powered by a traditional text matching information retrieval system.

But recently, we asked ourselves: Is there a reasonable machine-learning based approach that could improve the search experience for users? This led to more questions: Could we capture Mercari users’ interactions as a hint about what search results are more relevant to them? Can we build a more informative context for the model to learn from by labeling our training data and keeping in mind the limitations of standalone user clicks? How can we use data labeling to help us capture the relationship with our business KPIs?

It has been a long, challenging, and interesting journey, with a lot of lessons along the way. Through the efforts/developments described in this article, we achieved a statistically significant shift in the business KPIs and now our model is powering 100% of production traffic for the “best-match” (おすすめ) search flow. But that’s just the beginning, and we are just getting started!

The state of search as of April 2021

Search at Mercari is powered by an information retrieval system based on Elasticsearch’s text matching. By default, Elasticsearch leverages a ranking algorithm called Okapi BM25. The BM25 algorithm considers how many words in a given query match the fields being searched over in the index, where rarer matches and matches over shorter fields contribute to a higher relevance score.

However, Mercari uses a custom ranking approach: Mercari Search does not leverage out-of-the-box BM25; instead, it uses its own definition of a ranking function that puts emphasis on item freshness (i.e.: recency boosting hereafter). Depending on the search flow type at Mercari, the query matches are then ordered in a variety of different ways, in order to provide a search experience which is important to our users. For example, two common search flows are the “newly posted” (新しい) and the "best-match" (おすすめ) items. The "best-match" ranking also incorporates recency boosting (i.e.: when was the item/listing created), as we want to avoid including older items that have a higher similarity score at the top of the search result set.

Historically, there were also a number of machine learning ranking attempts to provide a better Search user experience to our Mercari end-users. While some of these attempts were more successful than others (as shown by the past A/B experiments), these successes did not translate to statistically significant improvements in business KPIs.

The reader should bear in mind that the current blog post focuses on one particular past effort to improve accuracy of search results via the application of machine learning techniques. The focus of this effort was to improve a search behavior that we refer to at Mercari as "best-match" (おすすめ) flow. Here, "best match" means a higher similarity score between a query and a document, as determined by the ranking function of the text-matching information retrieval system.

Building upon regular text-matching system capabilities

Text-matching information retrieval systems these days are fast and powerful when providing full-text search capabilities. Some may even leverage natural language processing (NLP) capabilities which employ text embeddings represented as numeric (dense) vectors to capture the linguistic content of the text, and can be used to assess semantic similarity between a query and a document.

Although powerful, such systems are still limited in terms of what signals can be captured, in addition to the indexed content. Our users tend to use Mercari Search in various ways, based on their search behavioral habits, for example:

  • What products they like to shop for and look at
  • How deep into the search results they like to scroll
  • How many results from the presented result set they actually preview
  • How long they dwell when viewing item details
  • and other business interactions, etc.

In addition to buyers, sellers also use search to determine the reference price of items they want to sell. They do this by checking for sold out items in the search results.

Regular text-matching systems normally do not capture such implicit and explicit user signals, which can serve as potential signals to improve relevance or personalization of the search results, as perceived by our users.

If we want to improve the relevance of search results that the information retrieval system gives us, we need to apply some sort of post-retrieval processing to optimize the order of the items shown to the end user. An approach to this, which is used widely in the industry by information retrieval / recommender systems, is to leverage machine learning for post-retrieval processing, i.e., re-ranking.

This is actually a fairly common information retrieval system architecture: there is a first phase retrieval where we get an initial result set from the text matching system (e.g.: Elasticsearch, Solr, etc.). This is the “recall” phase, which can be expressed using the following question: “Have we retrieved all the relevant documents for our query”? Then, there is a second phase which does post-retrieval processing of the search results through the application of machine learning (i.e.: for the purpose of re-ranking). This is our “precision” phase, which can be expressed using the following question: “From the documents we have retrieved, which ones are relevant”? In other words, precision is the subset of the retrieved documents that are relevant to the query.

Learning-to-Rank – Theory and challenges

We decided to leverage a Learning-to-Rank (LTR) machine learning approach to add a post-retrieval processing to the search results retrieved from the index, which leverages various signals from Mercari user behavior. LTR is a class of machine learning algorithms that learns how to rank (i.e.: reranks) search results automatically given a query & a set of documents using additional signals about the search.

Liu et al. (2009) defines LTR as “… a task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” [1]

Traditionally, it is a supervised machine-learning task which produces a scoring model, where the training data consists of queries and their respective document results. During a training phase where some sort of ranking loss is minimized, the model learns to predict a relevance score for each input x = (q, d), where q is a query and d is a document. The relevance score represents the importance of the document’s relevance with respect to the query.

As mentioned earlier, we constructed a model to predict a relevance score for each document. Motivated by inference speed, we settled on an architecture based on a small multi-layer perceptron that maps document features to a single score value. To maximize the use of our data, we trained this model using a listwise loss function that approximates nDCG [5]. At inference time, we execute the model over each query-document pair and sort the documents based on the scores given by the model. This effort was not without any challenges though, which we had to address along the way.

Challenge 1: Dataset labeling

For our learning to rank approach, we require labeled data to train our model. One option is to explicitly annotate a set of search queries and results from our logs, where each item is judged by a human annotator and assigned a relevance score. However as one can imagine this is quite a costly and time consuming process, and not guaranteed to produce sufficiently good labels (of course, annotated datasets are valuable in their own right and have an important place in research!).

For example, some of the limitations of annotated datasets are that they are stationary (i.e.: once created, a dataset does not change) and won’t include any future changes in relevancy by integrating a fresher intent (Lefortier et al., 2014) [2]. Another drawback is they may not necessarily align with the actual user preferences, i.e.: annotators and users often disagree (Sanderson, 2010) [3].

Another more scalable approach is to learn from Mercari user interactions (i.e., behavior) with the search results which can be indicative of their preferences. This brings its own difficulties which are discussed further.

Challenge 2: Using clicks to automatically generate good labels

User clicks (i.e.: human interactions with the search results) can be one of the obvious signals to use as labels, but unfortunately it is not always the best/right choice, depending on the business domain and the data corpus.

Clicks are noisy

Unfortunately, in a C2C marketplace, clicks alone are not a strong enough signal to help the model learn what’s important. The problem here is that in general, human users create a lot of “noise” when browsing via clicks – humans often click for unexpected reasons! Often users do not know what they want and they spend a lot of time looking and browsing products. As a result, many items get clicked on, but not that many get purchased. When a click occurs, it does not mean that it happened because of relevance. The opposite is also true: often clicks do not occur despite items being relevant, thus, the model cannot learn much from such a noisy signal — or whatever it does learn can detrimentally affect various business KPIs, such as those tied to sales!

In addition to these points, there is another factor: bots. In any web service, the possibility of automated scraping / bot activity is present. There are many cases of user activity in our client logs being attributed to bots. However, we are fortunate at Mercari to have a huge user base, and consequently a vast collection of client logs to use. Therefore, we take the conservative approach of filtering out search sessions that only contain clicks from our training data. We found this improves model quality by allowing the model to learn from users who had strong intent on searching for and purchasing items on our marketplace.

Clicks are biased

Often, human interactions with the search results are affected by factors other than their relevance. There are a number of problems here, like position bias, where higher ranked results get more attention (thus during user interaction data logging, those results are considered as more relevant); interaction bias, where user interaction is limited to the presented search results; presentation bias, where users interact differently with results presented differently, etc.

We would like to segue for a moment here and touch more on position bias, as we feel often this problem gets overlooked. Position bias leads to another problem called selection bias, where models are effectively trained on biased historical data, which causes bias to be perpetuated via a self-reinforcing feedback loop.

To elaborate on this more, imagine a scenario where most of the time, clicks occur in the top 20 search results. After historical click log collection and curation, the model is trained on a dataset using those clicks as labels, evaluated online via an A/B test, and then retrained using newer data that contain the same click position bias; thus, the loop. This may cause items which are more relevant but shown in lower positions to not improve their rank, due to getting a lower user engagement.

Clicks are a weak signal vis-à-vis e-commerce business KPIs

Clicks alone may not be an informative enough signal that the model can learn from. Of course, this depends on what ranking problem we are solving, the information retrieval system business domain, and the data corpus. For example, in a web search, clicks would be one of the better signals to use when judging what can be relevant to the user, as clicks (and their derived metric click-through-rate) can also be used to measure user engagement. But, in a C2C marketplace, clicks alone won’t provide enough context to model why users buy what they buy.

Challenge 3: Capturing rich context about item relevance from user activity in a label

To provide a rich context for the label, we came up with the concept of a user-item engagement journey. The item engagement journey starts when the user clicks (previews the product details) on an item. But, what happens after the click? Does the user put a “like” on the product? Does the user make a comment? Does a user initiate a purchase process? With this in mind, we came up with the notion of a graded relevance label, instead of the binary label “clicked” or “not clicked”. Our “holy grail” final business event that would represent the notion that the “item engagement journey has finished” is an item purchase event.

We assigned a score to different item activity events in Mercari, where click had the lowest score and purchase had the highest. Thus, if an item had a click, a “like”, and then was purchased, in our training dataset, the label for that item was the score of the purchase event.

Of course, there are different ways to compute the “graded” value, and by no means is our approach exhaustive. For example, as a different approach, we could assign a cumulative sum of different event scores as the final label score and experiment with that. Instead, we decided to start simple and used the highest event score within the events that occurred on the item in the item engagement journey as the label value (which may or may not be an item purchase event).

This has also allowed us to better capture the relationship with business KPIs when labeling the data, thus making our labels more meaningful and informative to provide a much richer context for the model to learn from.

We would like to point out here that the “graded relevance” concept is not a novel thing that we came up with. It is fairly common practice in the information retrieval industry to leverage multiple signals when labeling the data.

Challenge 4: Quality data features are essential

Naturally we want to give the model as much data as possible for it to score each document. We leverage features readily available in our Elasticsearch index, and we also store additional features in a feature store (i.e.: Redis) that we update daily. Namely, these dynamic features are item click-through rate (CTR) and impression probability (an "impression" is an item that was rendered on the user’s device visible pane), and they provide strong signals for our ranking model. For each item, its impression probability is defined as the item’s number of impressions divided by the total number of item impressions over all items for a particular time period, and is designed this way to take into account the fact that user activity on our service is not constant.

Recency boosting is particularly important and is incorporated in model training. It is defined as the time span between the item listing event and the search query. The reason we place so much importance on recency is that in a C2C marketplace, the item price is determined by the lister. Therefore, any item that is underpriced (and therefore an attractive purchase) attracts a lot of attention and will get sold quickly. Conversely, items that are not well priced tend to go unsold for long periods of time. In other words, newly listed items tend to contain good deals, and so our customers want to see them first.

In addition to recency boosting, the text of the query and item titles themselves are important to gain semantic understanding of the query and item. To this end, we tokenize both the query and item title, and learn text embeddings for a fixed length vocabulary of words in our training data. These text embeddings are averaged over a particular query or item title, and the aggregate embedding is passed to the model as an input feature.

Neural networks are susceptible to the scale of their inputs. We opted to normalize our inputs to log-space (specifically, log1p [6]). Many of our inputs such as price and freshness follow a skewed long-tailed distribution, and we found empirically that optimizing the network on log inputs helped to stabilize training and improve the overall performance by reducing the impact of outlier values in our data.

Challenge 5: “Show me the money” ©

When applying re-ranking to the search results, we must be also mindful of the business domain of the application when assessing the accuracy of the re-ranking. Often, relying purely on the statistically significant improvement in metrics like nDCG or CTR (click-through rate) may not necessarily translate to business value. In other words, re-ranking problems in an e-commerce business domain are tricky because we need to ensure that improvement in search relevancy must also be tied to the business KPIs, while also providing a better end-user experience.

A common business goal is to increase sales, so revenue at first seems like an obvious metric. But revenue is not typically a good metric, because its high variance makes it difficult to detect statistically significant changes. Instead, for our re-ranking efforts, we focused on buyer metrics with lower variance such as Average Transactions per User and Buyer Conversion Rate (ATPU/BCR). Specifically, ATPU just looks at the average number of transactions per active user. It has lower variance, because unlike revenue, the impact of price differences is removed. Thus, detecting statistically significant changes is more feasible.

BCR goes further than ATPU by just looking at how many active users actually bought anything, thus removing the variance that comes from heavy vs. light buyers. This is even easier to use than ATPU for measuring statistical significance. In the end, ATPU/BCR are leading indicators of our sales performance, and they show how many users get value out of our service.

Our takeaways

1. Invest your time in quality label defining strategy

When we started labeling the data, we tried to better understand and define the specifics of our business case – what does the model need to accomplish. First, we started small and simple by experimenting with binary (item clicked or not) labels. Unfortunately, this has not been an informative enough signal that the model could learn from, and thus we could not observe increases in nDCG that translated to improved business KPIs. As we kept iterating and hypothesizing how our training data needed to be labeled, the user-item engagement journey that we have built helped us to define a much stronger label with a much more rich context for the model.

2. Keep your eye on nDCG’s “blind” side

The quality of a ranking is commonly evaluated using ranking metrics, e.g., the normalized discounted cumulative gain (nDCG). What it means is that an LTR model often is trained with an nDCG loss, which means that the model attempts to maximize nDCG when learning to rank the results on the training dataset.

The accuracy of the reranking LTR model (i.e.: the 2nd phase which does post-retrieval processing of the search results) hugely depends on how well the 1st phase (retrieval of the search results) performs. If we get bad matches from the index, even when a model maximizes nDCG and it seems that the model performs well, from the users’ perspective we are still providing a poor search experience, because we keep re-ranking bad results — as it pertains to the user!

3. Clean your data

Along our journey to develop a successful ranking model, we ran into problems with our data time and time again. The biggest issues we faced were:

  1. The presence of extreme outliers. For example, some people list items for the maximum allowed price of 9,999,999 yen! Another example is when the user searches for an uncommon item, resulting in items being retrieved that are very old (and thus the recency boosting score is extremely high) compared to most items on Mercari. Sudden appearances of extreme feature values causes instability in gradient-based learning methods, and so we resorted to both excluding these outlier data, along with robust normalization choices.

  2. We found that for re-ranking, not all clicks are created equally. From looking at our client logs, we found that some users have a tendency to examine a majority of the items shown to them. In the extreme, we found cases where the users tap on basically everything in the search result. Obviously, this does not generate a good relevance label for machine learning. We also found evidence of bot-like behavior, where users were sending the same query every few seconds, all day every day.

The key takeaway here is, you need to curate a data sample that represents genuine user intent, otherwise your ability to train, and indeed evaluate, a model will become difficult. A model evaluation using noisy data will not tell you anything about your model.

Closing thoughts

We still have a range of interesting problems to address in the context of application of Learning-to-Rank techniques:

  • Leveraging IPS (inverse propensity scoring), first introduced by Wang et al. (2016)[4] to address user click bias, which contributes to selection & training biases
  • Leveraging additional interaction signals which will improve personalization of search results even more
  • Experimenting with exponentially decaying graded relevance labels
  • Exploring embeddings-based approaches for reranking and dense-vector retrieval.

Tune in to future posts and join us on our ML journey as we tackle these problems, learn, and grow!

References

  1. Liu, T.. (2009). Rank learning for information retrieval. Foundations and Trends. Information Retrieval. 3. 225-331.
  2. Lefortier, P. Serdyukov, and M. de Rijke. Online exploration for detecting shifts in fresh intent. In CIKM 2014: 23rd ACM Conference on Information and Knowledge Management. ACM, November 2014.
  3. Sanderson, Mark. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval. 4. 247-375. 10.1561/1500000009.
  4. Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 115–124. ACM, 2016.
  5. Tao Qin, Tie-Yan Liu, Hang Li. A General Approximation Framework for Direct Optimization of Information Retrieval Measures. October, 2008
  6. Wang, M. Bendersky, et. al, (2019). TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank. 2970-2978. 10.1145/3292500.3330677.