2024/05/31

LLM-based Approach to Large-scale Item Category Classification

Author:: UchidaNaotaka

, 2024/05/31

LLM-based Approach to Large-scale Item Category Classification

Hello, I’m ML_Bear, an ML Engineer on Mercari’s Generative AI team.

In a previous article [1], I talked about improving Mercari’s item recommendations. In this article, I will be presenting a case study involving the categorization of over 3 billion items using large-scale language models (LLMs) and related technologies.

As the LLM boom was sparked by the appearance of ChatGPT, many people became aware that LLMs were used in conversations, but it’s also true that LLMs can be an extremely useful tool for solving various tasks due to their high level of thinking ability. On the other hand, their slow processing speed and high cost can be a barrier to their implementation in large-scale projects.

This article describes our efforts to overcome these challenges by applying various innovations, maximize the potential of LLMs and its peripheral technologies, and solve the problem of categorizing large-scale item data.

Challenge

Let me begin with a brief background of this project and the technical issues involved.
In 2024, Mercari renewed its category structure, revamping its hierarchical structure and significantly increasing the number of item categories. However, when the number of categories and their hierarchical structure are changed, it becomes necessary to change the item data associated with them as well.

Normally, item categorization uses machine learning models or rule-based models. In this case, however, it was not possible to create a classifier using machine learning because the "correct category in the new category structure" for past items was unknowable. In addition, because the number of categories was very large, it was also difficult to construct a rule-based model. This prompted us to see if we could utilize LLMs to address this issue.

Solution: Prediction algorithm in two-stage configuration with LLM and kNN

We responded to this issue by constructing a two-stage algorithm as follows.

Correctly predict the categories of some past items with ChatGPT 3.5 turbo (OpenAI API[2])
Create a category prediction model for past items using 1. as training data

Things would have been simpler if it was possible to predict everything with ChatGPT, but since Mercari’s past items exceed 3 billion [3], it was impossible to predict everything from the perspective of both processing time and API cost. Therefore, after some trial and error, we settled on this two-stage model configuration. (Classifying all items with ChatGPT 3.5 turbo would have resulted in a cost of approximately 1 million USD and an unrealistic processing time estimate of 1.9 years.)
The following is a brief description of the model. Details will be described in the "Points of Innovation" section, so we will keep the explanations simple here.

1. Predict some correct categories of past items with ChatGPT 3.5 turbo (OpenAI API)

First, we sampled several million previously listed items and asked ChatGPT 3.5 turbo to predict the "correct category in the new category structure" for that item. Specifically, we created about 10 candidates for the new category based on each item’s item name, item description, and original category name, and asked it to provide the correct answer from among those candidates.

2. Create a category prediction model for past items using 1. as training data

Next, we created a simple kNN model[4] using the dataset created in 1. as the correct answer data.
Specifically, first the embedding and the correct answer category of the item whose correct answer category was predicted in 1. were stored in a vector database. Then, based on the embedding of the item to be predicted, X similar items were extracted from the vector database, and the most frequent category of those X items was used as the correct category.

Embedding was calculated based on a concatenated string of each item’s item name, item description, metadata, and original category name. A more complex machine learning model was also considered, but a simple model was adopted because it performed satisfactorily.

Points of Innovation

Here are some of the innovations that we devised for this project, applied to the following points which I will explain one by one.

Usage of OSS Embedding model
Usage of Multi-GPU with the Sentence Transformers library
Voyager Vector DB for fast neighborhood search on CPU
Accelerated LLM prediction by using max_tokens and CoT
Usage of Numba/cuDF

1. Usage of OSS Embedding model

The second stage model (kNN) required the computation of the embeddings of items. Although it was possible to build a neural network on our own, it was confirmed that the OpenAI Embeddings API (text-embedding-ada-002) [5] would provide sufficient accuracy, so we initially decided to use this API.

However, when we made an estimate, we quickly realized that using the OpenAI Embeddings API for all items would be a bit challenging in terms of processing time and cost.
While looking at MTEB[6] and JapaneseEmbeddingEval[7], we noticed that there were many OSS models in languages other than English that were comparable to the OpenAI Embeddings API. We decided to use the OSS models because we found them to be as accurate as the OpenAI Embeddings API when we created our own evaluation dataset and tried them out.
According to the data as of October 2023 in the midst of this project, the following models were evaluated as highly accurate, and we ended up using intfloat/multilingual-e5-base due to its good balance of computational cost and accuracy. (MTEB rankings are constantly changing, so we believe that stronger models may be available as of April 2023.)

intfloat/multilingual-e5-large [8]
intfloat/multilingual-e5-base [9]
intfloat/multilingual-e5-small [10]
cl-nagoya/sup-simcse-ja-large [11]

Since there are very high-performance embedding models in OSS, we recommend that when doing a project that uses embedding, that you create a simple problem and see if there is a model with sufficient performance in OSS.

2. Usage of Multi-GPU with Sentence Transformers library

Although using the OSS model dramatically increased processing speed compared to the OpenAI Embeddings API, more improvements were needed to process billions of items.
Our issues would have been solved much more quickly if we had access to a powerful GPU such as the A100, but it was quite difficult to acquire such a powerful GPU as of November-December 2023 back when the project was launched, possibly due to the global GPU shortage. (It’s doubtful that the situation has changed much even now.)
We therefore decided to use multiple GPUs such as V100 and L4 in tandem to handle this problem. Fortunately, the Sentence-Transformers[12] library was very helpful because we could easily parallelize multiple GPUs with the following simple code.

from sentence_transformers import SentenceTransformer

def embed_multi_process(sentences):
    if 'intfloat' in self.model_name:
        sentences = ["query: " + b for b in sentences]
    model = SentenceTransformer(model_name)
    pool = model.start_multi_process_pool()
    embeddings = model.encode_multi_process(sentences, pool)
    model.stop_multi_process_pool(pool)

It would have been ideal if we could use as many powerful GPUs as we needed, but even in situations where this isn’t possible, we can speed up processing by making use of creative ideas. That’s why it is important to make the most of limited resources by utilizing libraries such as Sentence-Transformers.

3. Voyager Vector DB for fast neighborhood search on CPU

A vector database was required when using kNN. Although sampled, the training data held several million items, so it could not fit in the GPU’s memory. While this may have been solved by using a GPU with a large memory, such as an A100 80GB, the difficulty in obtaining such a powerful GPU hindered us from trying that option.
Around that time, we learned that Spotify’s Voyager[13] can run at high speed even with a CPU, so we tried it and were able to easily achieve a speed that was sufficient for practical use. Compared to embedding calculations, there was not that much effect on the time required for neighborhood search, so although we did not compare it with other items in the strict sense, we were satisfied at having been able to achieve sufficient speed.
Voyager did not have metadata management capabilities, so we had to write our own client, but we still believe it was a good choice overall.

4. Accelerated LLM prediction by using max_tokens and CoT

For this project, ChatGPT 4 was not available due to cost, so we had to use ChatGPT 3.5 turbo. ChatGPT 3.5 turbo is rather clever for the cost, but we were a little concerned about its accuracy. Therefore, we used Chain of Thoughts[14] to improve accuracy by having it generate explanations.
As you may already know, ChatGPT sometimes talks for a long time when asked to provide an explanation, leading to prolonged processing times. Therefore, we tried to shorten the processing time by using the max_tokens parameter to interrupt a long answer midway.

Since the JSON (of Function Calling) is broken when the answer is interrupted, it is necessary to either use llm.stream()of LangChain[15], or restore and parse the JSON yourself, which is a bit time-consuming. Although we have not done an exact comparison, we feel that the method we used strikes a good balance between reducing processing time and improving accuracy.

The following is a sample code for using LangChain’s llm.stream().

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class ItemCategory(BaseModel):
    item_category_id: int = Field(None, description="Category ID predicted from product description")
    reason: Optional[str] = Field(None, description="Explain in detail why you selected this category ID")

system_prompt = """
Based on the product information given, predict the category of the product.
Please choose a product category from the list of candidates. Explain why you chose it.
"""
item_info = "(Include product data and potential new categories, etc.) "

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    max_tokens=25,
)
structured_llm = llm.with_structured_output(ItemCategory)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{item_info}"),
    ]
)
chain = prompt | structured_llm

#  Extract only the last element of streaming
# - Normally, if you terminate the answer with max_tokens, the json is broken and needs to be parsed. 
# - There is no need to parse json when answer is terminated by max_tokens 
#   since it always completes json when termination is executed in langchain stream.
for res in chain.stream({"item_info": item_info}):
    pass

print(res.json(ensure_ascii=False))  # res: ItemCategory
# {"item_category_id": 1, "reason": "The product name contains 'stuffed animal' "}

5. Usage of Numba/cuDF

Since processing speed is a concern even for minor processes when processing billions of items, all processing was accelerated with cuDF[16] and Numba[17] whenever possible.
Although I am not very good at writing Numba, when I showed the raw Python code to ChatGPT 4, it rewrote it for me, which greatly reduced my coding time.

Conclusion

ChatGPT has attracted a lot of attention for its frequent use in a conversational style, and its advanced thinking ability provides effortless solutions to tasks that were previously tedious or deemed impossible. In our project, ChatGPT helped us solve the tedious task of reclassifying a huge amount of item data into new categories within a short period of time.

We were also able to maximize results even with limited time and resources by making use of OSS Embedding models and multiple GPUs, adopting a vector database that enables fast neighborhood search, using ChatGPT to speed up prediction, and using Numba to accelerate processing.
I hope that this case study will demonstrate the potential of ChatGPT and other large-scale language models and will be helpful in future projects. We encourage you to utilize LLMs in a variety of situations and take on challenges that have been difficult to solve in the past.