Migrating a monolithic service under the bed (part 1 of 3)

This is the 23rd entry in Mercari Advent Calendar 2021, by @greg.weng from Metadata Ecosystem team.

Brief Overview

  1. In 2019, Mercari decided to close a four-year-old monolithic service (“Kauru”) and migrate its data and features to microservices.
  2. Before it was closed, the Kauru monolithic service cost ¥1,500,000 per month on the Google App Engine 1st generation platform: a platform that Google already deprecated its many vital features the Kauru service relied on.
  3. And since the Kauru service provided product information of the top sale category, all new Mercari product features needed to work around to adapt its legacy design. As a result, it increased every product feature’s development and maintenance cost.
  4. Furthermore, since no remaining core developers maintained the whole monolithic architecture, all Kauru dependent services were at risk of service interruptions due to issues.
  5. The whole migration project was executed by a small Product Catalog team (~3 members on average), and it was estimated to be completed within one year.
  6. In the end, all migration tasks were successfully carried out in about two years, despite some organizational growing pains during the project.
  7. In the difficult time, the team and all project participants pivoted the migration strategy, and finally, the new PoC-first solution proved it worked well for the Kauru migration. Eventually, this new approach became mandatory for all engineering tasks of the Product Catalog team.
  8. The whole migration was completed in November 2021. As a result, Mercari saved the Kauru infrastructure and maintenance costs, and microservices now provide its features and data with better security, flexibility, and quality assurance.
  9. Also, the Product Catalog team has become more productive and better at cross-team collaboration. Experiences from the Kauru migration also helped each team member significantly grow their soft skills.
  10. Finally, since the Product Catalog team is just one part of the whole Metadata Ecosystem team, the vision of a unified and solid Metadata platform becomes more realistic after finishing the migration. It also means all product services within Mercari and their users benefit from the better metadata features after the migration.

Background: roles of the “Kauru” monolithic service in Mercari

Originally, the “Kauru” service was a standalone marketplace app for Books, CDs, and DVDs. After it was acquired by Mercari, it kept running for years with a unique tech stack different from other services in Mercari. The most significant difference is that Kauru runs on Google App Engine 1st generation while other Mercari services run on a Kubernetes cluster.

After the acquisition of the Kauru app, Mercari used it as an internal service to improve Mercari users’ listing and buying experience. It means that Kauru service was responsible for providing product information for the entertainment category throughout Mercari. For example, when a seller tried to sell a book by scanning its barcode on the app, the app would call Kauru APIs instead of Mercari APIs. Also, when a buyer tries to search a game after specifying the Game category, the app will invoke the Kauru search API instead of the generic search API of Mercari.

It means the Mercari app needed to adapt two heterogeneous services for different categories of products. If the user wants to search, buy or sell things about entertainment products, the app switches to the Kauru logic. Otherwise, it follows the Mercari flow to provide services.

Simplified flows that co-existed inside the Mercari app
Fig 1. Simplified flows that co-existed inside the Mercari app

Features and roles of Kauru in Mercari

As an e-commerce platform majorly focused on the C2C market, Mercari keeps improving customer experience in several ways. Kauru introduced many new features to improve the listing and buying experience:

Listing Experience improvement

Compared to searching and buying things, listing an item is a long and tedious process. Users need to fill in all the required fields for listing an item, and that’s why Mercari has improved this process significantly with several backend internal services. For this improvement in the Mercari app, Kauru supported Barcode Scan and Product Suggestion listing features for entertainment products:

Barcode Scan

Barcode Scan is a feature that helps our customers list an item more quickly. When the customer scans the barcode of the products, the product information is loaded from our systems, and auto fulfills the category, brand, and description automatically.

Barcode scan feature of the Mercari app
Fig 2. Left: all blank fields waiting for the seller to fill
Right: all fields are automatically filled after scanning the barcode

Product Suggestion

This feature predicts similar items from images. With the help of Kauru service’s data and machine learning models, the app suggests to users some products that they are trying to list just by taking the image of the item to be listed.

Buying Experience improvements

Product Page

The Product Page service provides a page that helps customers decide what to buy on the Mercari client app/website. It includes different specs and features to help allow product comparison, provide reviews, and facilitate the buying process.

Product Search

This feature links to the Product Page service. It provides the user to search a particular product by specifying the title/the author etc. If the product is not an entertainment product, instead of Kauru search, this feature will switch to the general Mercari search service.

Category Ranking

Kauru also plays an important role to create the ranking for Games, Books, CDs, DVDs/Blu-ray Discs products. The ranking is based on the most sold items in Mercari, and is adjusted automatically by counting and storing selling events.

Other improvements

Dropshipping

Dropshipping in Kauru means dealing with new books, new CDs, and new DVDs/Blu-ray Disks. When the user asks to buy a new entertainment product instead of a second-hand item listed by other users, Kauru places an order to Tohan, which is Japan’s largest player in the distribution mechanism between bookstores and publishers.

Impact of Kauru in Mercari

In 2021, the entertainment products shared 25% of the whole sold items on the Mercari, meaning Kauru played an essential role in serving the top sales category.

Entertainment products are the top sale category in Mercari JP
Fig 3. Entertainment products are the top sale category in Mercari JP
(although it is translated to “Toys” in English)

As the only service gathering entertainment product information, the Kauru service was an unavoidable dependency for many other Mercari services:

A much-simplified dependency diagram of the Kauru and related services
Fig 4. A much-simplified dependency diagram of the Kauru and related services. Arrows mean calling an API or reading data from a data store

The Kauru Architecture

Kauru was a monolithic service from the beginning. It was designed to support APIs and data across multiple domains, so unavoidably, the internal and external complexity had grown exponentially. However, this doesn’t mean it was doomed to be too complicated, or there was no architecture to put features into their proper places. Although it had grown much larger than the original design in the past years, new code still followed the same architecture well to work together as a monolith.

For example, at v1.0.0, Kauru provided these APIs:

APIs the Kauru supported at version 1.0.0
Fig 5. APIs the Kauru supported at version 1.0.0

By this architecture:

Kauru architecture at version 1.0.0
Fig 6. the Kauru architecture at version 1.0.0

This hadn’t been changed drastically before the final migration.

Motivation: Why did Mercari decide to migrate the Kauru service?

Lack of maintenance

Nevertheless, such a monolithic architecture required knowledgeable developers to stay in the team to regularly maintain and develop new features on it per business requirements.

Unfortunately, in the days right before the migration, none of those knowledgeable developers of the Kauru service were in the related team anymore. It means anyone who wants to utilize the data and services that Kauru provided needed to stockpile new code on a system without the knowledge to do it properly.

Outdated tech

While Kauru had been accumulating tech debt by stockpiling more and more code to satisfy business requirements, there were many incomplete improvements for the essential quality.

For example, some plain text secrets were hard-coded in the code and not deprecated until a company-wide security improvement occurred. There were also error handling codes that just logged errors but didn’t interrupt the regular operation flow even when there was an error. Thus the system might have undefined behaviors after the error happened.

Due to the issues above, any attempt to develop new features on Kauru for new business requirements was complicated and slow. On the other hand, splitting the service into several services maintained by different domain teams would be much simpler, faster, and more flexible.

Another reason Mercari decided to migrate the Kauru service is because the service was built on Google App Engine 1st generation. Google had planned to deprecate some key features that Kauru relied upon. It means Mercari had a hard deadline to migrate the service that could not be prolonged or bargained. For this reason, the “if it ain’t broke, don’t fix it” principle couldn’t be applied for Kauru and its related services.

Services like App Engine Search API, App Engine Mail API, and application data cache have no simple replacements after Google deprecated them
Fig 7. Services like App Engine Search API, App Engine Mail API, and application data cache have no simple replacements after Google deprecated them (source: Google App Engine Document: Migrating from bundled services)

Economical feasibility

Finally, the two primary user-facing services, Mercari Books and Product Pages have provided less and less Gross Merchandising Value (GMV) per month. Compared to its GMV contribution, Kauru itself cost about ¥1,500,000 on Google infrastructure every month. Additionally, Mercari needed to continuously provide customer support service and engineers on the project to keep this service running.

The Migration

Mercari decided to end the Kauru service and migrate its features to several microservices to move safer and faster for new business opportunities. The team to take this migration project is the Product Catalog team, a small engineering team focusing on storing and providing product data as a service, a major part of the larger Metadata Ecosystem platform team.

Like all estimations based on guesses, no solid evidence suggested that the Product Catalog team would complete the migration in time. Although one company value of Mercari is Go Bold, when looking back on the project’s history, the project went too bold and much beyond the team’s tech capability, project management, and organization management capacities. However, with lots of difficulties and trial-and-errors, the team and all related stakeholders grew and worked around all the issues to complete the project.

The following milestones break the migration plan for easier understanding.

Milestone 1: Static data migration

After taking on the project, the team decided to migrate the static data Kauru held first. It means the team would migrate about 8 million product data records1 to the product-catalog microservice, including books, CD/DVDs, and games. After migrating this static data, the team also needed to migrate file uploaders to the product-catalog microservice for incoming vendor data. These uploaders were responsible for converting product data in files into SQL records.

Although this sounds like the safest step during the whole migration, it eventually became a challenging step for the team. Since the product-catalog service was never designed to take such a large amount of data at once, the existing Admin APIs failed to work when the team tried to import all the static data. As a result, it forced the team to write scripts for each batch of data and then execute them manually while taking care of all unexpected issues from the product-catalog service and Google Cloud Platform.

Despite that, even migrating the vendor data uploaders became a challenge. First, the team needed to figure out the file formats from vendors and how Kauru dealt with them, and then migrate the only necessary part to product-catalog service. It meant re-implementing similar but still different file uploader functions on the product-catalog side and then test them with the vendor data previously fed to Kauru. Since Kauru dealt with new entertainment product data from several vendors with various formats and content every day, the team spent a lot of time debugging all the problems with data from the real world. Mainly because the Kauru code had only a few or even no tests and documents about this file uploading part, the team worked in the dark and kept the trial-and-error approach for months.

The 1st phase, the data migration phase of the Kauru migration project
Fig 8. The 1st phase, the data migration phase of the Kauru migration project

After migrating millions of product data, the related product/item mapping data and re-implementing all vendor data uploaders including many new tests and documents, the team moved to the next phase to migrate the Kauru APIs.

Links

Part 2 of 3:
Migrating a monolithic service under the bed (part 2 of 3)

Part 3 of 3:
Migrating a monolithic service under the bed (part 3 of 3)