Leveraging LLMs in Production: Looking Back, Going Forward

This post is for Day 19 of Mercari Advent Calendar 2023, brought to you by @andre from the Mercari Generative AI/LLM team.

Remember when ChatGPT was first released to the public? It reshaped the boundaries of what was possible and elevated the discourse around artificial intelligence. Yet such innovations were not without their enigmas, presenting as much potential as they did new frontiers to explore.

We have come a long way since then. Earlier this month, for example, many researchers and practitioners shone a light on the capabilities and limitations of current Large Language Model (LLM) technologies at the EMNLP 2023 conference, in which Mercari was a sponsor.

In this article, we’re excited to share the strides our team at Mercari has made in utilizing LLMs to enhance our beloved application. We focus primarily on our initial work with Mercari AI Assist (メルカリAIアシスト), a project at the intersection of innovation and practical application.

We hope that this article will serve as a resource that is not only informative but also offers tangible benefits to readers interested in the practical applications of LLMs.

Some key takeaways

  • Clear and frequent communication across different roles is critical for aligning expectations and making quick progress.
  • Begin with simple prompt engineering and leverage commercial APIs.
  • Rigorous pre- and post-processing are required to address LLM output inconsistencies.
  • Closely following new updates, both from the academic and industrial field, helps us navigate through a rapidly changing field of large language models.

The Team

The Generative AI/LLM team is on a mission to generate impactful business improvements by integrating LLM technologies into our products and enhance productivity. Generally, our efforts are twofold: building and enabling. Speed is a crucial aspect of our work—on one hand, we strive to improve the user experience for our customers by developing high-quality products; on the other, we also aim to quickly acquire knowledge and expertise to empower more teams to understand and implement LLMs in production environments.

We work in a relatively small team, as close and effective communication between PM, designer, and engineers is crucial to be able to work fast and ship our product. As LLM is a relatively new concept for many people, it is important to maintain a constant dialogue about what is achievable and what lies beyond its current scope.

Additionally, the engineers regularly conduct experiments to assess technical feasibility. With the field of LLMs evolving at a breakneck pace, it’s imperative to stay abreast of the latest findings and updates. Social media and news outlets are invaluable for acquiring the most immediate updates, while research papers offer a deeper dive, providing a comprehensive understanding and empirical observations of the latest advancements.

The Product

Mercari AI Assist is envisioned to be an assistant feature that can guide our customers to use our app effectively depending on their preferences.

There is still a lot of work to be done; however, in the initial version, our focus is on the sellers—Mercari customers who use the platform to list and sell items. Through this feature, we utilize LLMs to assist sellers by offering suggestions to enhance their listing information. Below are illustrations that depict what the Title Suggestion feature looks like within the application.

You can read more about Mercari AI Assist in the press release article. Meanwhile, this article will focus more on sharing about the technical side of how we use LLMs to bring the two types of suggestions into production.

Choosing the Right Models and Techniques for Our Case

Firstly, it’s important to emphasize that while this article focuses on the use of LLMs, not everything requires the use of LLMs. Some tasks may be more effectively addressed without them, depending on factors such as cost, objectives, and the development team’s expertise. Knowing when and how to deploy LLMs is crucial.

One of the most challenging tasks in our case is to process and to understand unstructured data from user generated texts. Inside a listing in Mercari, the item’s title and description contain lots of useful information, however, distilling key information and determining how to utilize it has always been difficult. For example, identifying which category had the most listings in the past month might be straightforward, but discerning which factors differentiate listings that sell quickly from those that do not is complex. This is especially true given the varied and unique styles people use to write an item’s title or description. We believed that, given the breadth of data with which a large language model has been pre-trained, it would be adept at meeting such challenges.

Once we identify tasks that LLMs can address, there are several other things we need to decide. Two of the most commonly considered factors are:

  1. Which models to use; e.g. commercially available models or open source models
  2. Fine-tuning or prompt engineering (or training our own LLMs)

In general, fine-tuning often yields better results for specialized tasks within a fixed model size, as it allows the entire network to specialize in solving a specific problem. Conversely, prompting or in-context learning (ICL) can be seen as a method to enable a general LLM to perform specialized downstream tasks.

In the case of Mercari AI Assist, we utilized prompt engineering and simple retrieval-augmented generation to enable the use of commercially available LLMs—specifically, OpenAI’s GPT-4 and GPT-3.5-turbo—for executing a variety of specific tasks. Our objective at the moment is to design an optimal user experience and establish a sustainable and effective workflow for incorporating LLMs into our production environment.

The figure below illustrates the streamlined design of how we implement the Title Suggestion feature within Mercari AI Assist. After experimenting with several methods of leveraging LLMs and taking both cost and performance into account, we determined that this approach best fits our requirements. Generally, the feature is split into two main parts. The first part, highlighted in blue, involves defining “what makes a good title” for a Mercari listing. This is accomplished with assistance from other teams that possess diverse domain expertise. We then collect existing title data aligned with our criteria and utilize GPT-4 to distill the key attributes of an effective title. These key attributes are subsequently stored in a database. The second part of the process, indicated in red, occurs in real-time. We employ GPT-3.5-turbo to identify key attributes (defined by the previous step) from a specific listing as it is created, and then we generate suggestions for refining the listing’s title as necessary.

Through our experiments, we observed that GPT-4 outperforms GPT-3.5-turbo in terms of quality, but it incurs greater costs and latency. Consequently, we found an optimal balance between quality and cost-efficiency by utilizing GPT-4 exclusively for the initial, offline extraction of key attributes, and employing GPT-3.5-turbo for real-time, online operations.

Continuous Evaluation and Mitigating Unexpected Responses

We primarily conduct two types of evaluations to ensure that the quality of outputs returned by the models meets our expectations: offline and online evaluations. Both are carried out before the product’s release and continue thereafter to guarantee that our quality standards are upheld.

Offline evaluation serves several purposes, but it mostly helps us to determine the most effective prompt for the task at hand. We focus on two main aspects: token usage (length) and response quality. Striking the right balance between these two aspects is crucial. Through a combination of manual review and automated evaluation, we ensure that the model’s responses meet our requirements. This step also allows us to estimate the total cost of deploying the feature to all of our users.

Online evaluation, on the other hand, ensures that the feature performs as expected in a live environment—this is particularly significant because we are dealing with user-generated content and substantial traffic in real-time. We conducted a partial release, only implementing a small segment of Mercari AI Assist that calls the LLM API, to assess performance and confirm that the complete feature is ready for our customer base. In this preliminary online test period, we tasked GPT with extracting a single key attribute from an item’s description and to respond simply with “YES” if the attribute is present, or “NO” if it is not.

We found that it is very useful for teams who are not familiar with using LLM in production to perform these kinds of partial preliminary releases, especially when using commercially available APIs provided by third-party services.

During the preliminary online test period, we observed that even though we instructed GPT to provide outputs in a straightforward format (YES or NO), the number of inconsistently formatted responses increased along with the number of requests. The table below presents a sampled result from this experiment.

LLM Output Count
NO 311,813
No 22,948
Yes 17,236
Sorry, but I can’t provide the answer you’re looking for. 5
Sorry, but I can’t assist with that request. 4
The provided text does not contain the information. 4

Being aware of such inconsistencies is crucial for production systems. In the above sampled use case, the wrong format might be non-critical and relatively easy to solve (e.g. with regular expressions). However, as we require more complex outputs from LLMs, detecting inconsistencies—as well as hallucinations, a well-known issue with large language models—becomes increasingly challenging.

It’s essential to preprocess prompts that contain user-generated content to minimize the likelihood of GPT generating incorrect responses. Additionally, post-processing logic should be implemented to ensure that only the expected output format is relayed to the client application.

Additional Things to Keep in Mind

Since we’re utilizing an LLM provided by a third-party service, it’s critical to understand how the API functions and what sorts of errors may occur. In addition to common API error types such as authentication and timeout errors, which we might already know how to handle, we need to give special attention to errors more closely related to LLMs. For instance, depending on the API you use, some calls might inadvertently trigger a content violation error. At Mercari, we have our own content moderation system; however, the filtering policy of a third-party API might differ. It is important to be aware of this to accordingly prepare our prompts and to avoid undesired outcomes.

Another consideration is the token count. The number of tokens used can vary depending on the language sent to the model. For instance, an experiment presented at EMNLP 2023 indicated that, using ChatGPT, the average cost of prompt and generated tokens in Japanese can exceed that of English by more than double. This certainly depends on the task at hand and sometimes there’s no alternative, but it is one thing to keep in mind.

Lastly, in this rapidly evolving field, what is considered the best tool can change in just a short span of time. Libraries are updated constantly—with the occasional breaking change—and many of us are constantly looking for ways to optimally integrate LLMs into production systems. This might sound obvious, but we argue that it is important to closely follow new updates regarding LLM research and best practices.

Looking Back, Going Forward

The design and development of Mercari AI Assist has offered us valuable perspectives on working with prompt engineering and integrating commercially available Large Language Models (LLMs) into production. Looking back, I felt that I gained substantial knowledge and experience from the practical aspects of working with LLMs and I am enthusiastic about further advancing my skills alongside the team.

Among the key lessons learned are the significance of cultivating a team equipped with the right mindset and fostering effective communication. I have also experienced and learned about the intricacies of choosing the right model and techniques, finding the right balance between cost and performance, dealing with LLM’s stability when used in a live environment, and addressing unique challenges of LLM, such as hallucination and content moderation. Additionally, I believe it is advisable to have team members with a background in machine learning and natural language processing when working with LLMs. Having the appropriate expertise can speed up various research and experimental processes. For instance, it can enable the team to swiftly determine the suitability of LLMs for a specific task and also decide the most suitable evaluation metrics.

Going forward, we are focusing on improvements such as LLM operations and the development of automated workflows. We are also exploring the use of LLMs for more complex and specialized tasks, which may require the adoption of parameter-efficient fine-tuning techniques. With a rapidly growing field, our team is continuously experimenting and learning, and we understand that our implementation is far from perfect. As with many other practitioners in the field, we are constantly following updates from the field, sharing, listening, and looking for best practices most suitable for our use cases.

I look forward to yet another exciting year filled with obstacles and successes, and to share in these experiences with the incredible members at Mercari.

Tomorrow’s article will be by @ayaneko. Look forward to it!

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加