Migrating a monolithic service under the bed (part 3 of 3)

In the previous article, we covered the most challenging milestone of the Kauru migration project. We also introduced the PoC approach that the Product Catalog team adopted, and finally solved the schedule and management issue. Following that, this part 3 of 3 covers the final milestones of the migration project and concludes it by sharing lessons we have learned from the project.

Milestone 3: Better and broader cross-team collaboration

In this phase, the primary role of the Product Catalog team is to close those services that can’t or won’t migrate away from Kauru APIs and help client teams to migrate to new upstream APIs (product-catalog APIs). For example, Mercari cannot upgrade old versions of Mercari iOS/Android apps already installed on users’ phones forcibly to stop calling Kauru APIs. Therefore decision-makers decided to leave them until users upgrade their versions of the Mercari app. The only thing the Product Catalog team could do was test what would happen to the old clients if the Kauru service was closed.

The 3rd phase of the Kauru migration: the client migration phase
Fig 18. The 3rd phase of Kauru migration: the client migration phase

Since client teams handled the actual migration work in this phase, the Product Catalog team learned to collaborate with other teams better. Like every cross-team collaboration, different priorities among teams were the most important thing to deal with. It usually involves engineers, managers, and stakeholders sitting together to find a workable schedule.

Therefore, engineers among different teams became more accustomed in this phase to discuss possible steps to complete the migration together. When bugs happened on the client-side because of the migration, the Product Catalog team learned how to actively and quickly debug and provide evidence for the fix. The team didn’t cross the boundaries to start writing Swift/Kotlin/JavaScripts to debug for client teams, but checked and summarized everything related to client bugs to help client teams debug. Usually, it includes more reliable steps-to-reproduce, related logs from backend services, intercepted API requests and responses on the client apps, and if it is understandable to the team, the client code related to the bugs.

For other backend services that relied on the Kauru service, the Product Catalog team actively discussed issues and sent patches to help them migrate from Kauru. Again, this is because the tech stack is the same, so compared to sending requests to ask them to migrate, it is more straightforward and more feasible that the Product Catalog team does the actual work and lets the backend teams review and release the changes.

Different collaborations with different Kauru client teams, due to the various tech stacks
Fig 19. Different collaborations with different Kauru client teams, due to the various tech stacks

To conclude, the completion of this phase was not only the Product Catalog team achieving the milestone but a valuable lesson of cross-team collaboration. Completing this migration phase meant managers, stakeholders, and developers all worked together to make it happen.

Milestone 4: Closing non-profitable features

For this milestone, the Product Page and drop shipping features were closed. The team again collaborated with different teams for closing these features. This milestone was straightforward as it was mostly client side work to remove features.

Milestone 5: Terminate Kauru and refactor

The team had to wait for a few months before they could finally delete all the Kauru Service resources. It is because the older versions of the app still in use were calling the Kauru service. After the volume of requests to the Kauru service became insignificant, we finally deleted all the resources of Kauru service and saved its infrastructure cost at ¥1,500,000 per month.

The rest of the refactoring which mostly involves removing the dead code is done inactively.

Result of Migration: A new microservice world without the legacy Kauru service

In November, the Product Catalog team finally removed all Kauru related things from Google App Engine. It means there will be no Kauru features running in the future, and all its data and APIs are split and then provided by the product-catalog and other related microservices. They all run on the Mercari microservice platform and follow the same security, audibility rules, and also share the same flexibility of that platform.

To the whole Metadata Ecosystem team and Product Catalog team, this means a giant leap toward the metadata vision of Implosion and Explosion. Also, learning from the Kauru experience – a single service provides many unrelated features – now new microservices provide much simpler yet more generic features to their clients, not ad hoc features to some specific business requirements only.

The Implosion and Explosion vision of the Metadata and Product Catalog team
Fig 20. The Implosion and Explosion vision of the Metadata and Product Catalog team

To the whole of Mercari, completing the Kauru migration means that for every new business plan that needs entertainment product data, there will be no need to consider how to adapt a legacy service. For each kind of data requirement, there will be one and only one source of truth of the data. Compared to the previous implementations requiring the same data from two different services with different APIs, the new implementation unified fetching and updating product data flow.

Furthermore, this is not an improvement limited to only listing and selling features, but also for all potential features that need to fetch product data. Actually, just right before the final phase was done, there was already a totally new business feature that required the data previously provided by both the Kauru and product-catalog service. So now, from its beginning, this new service and the team only need to contact the Product Catalog team for all data and APIs issues: no more legacy issues about Kauru required to be handled.

Before and after the Kauru migration
Fig 21. Before and after the Kauru migration: new client features about the product information doesn’t need to take the risk anymore to adapt to a legacy
service that there is no dedicated team to maintain it

Finally, to Mercari customers, since now more new features will come much faster and based on the more reliable microservice platform, it is foreseeable that our customers will enjoy the Mercari service from our iOS/Android/Web client. As always, product information plays an essential role in helping customers have a better experience at every step when listing, selling, and buying. That is why even after the Kauru migration project, the Metadata Ecosystem team and Product Catalog team continue to migrate other related data and services to provide a more powerful generic product data backend for all Mercari features.

(Re) Building a small product engineering team during a huge migration project

The Kauru migration project is not only a tech project for the Product Catalog team. It is also a boot camp to train each team member to learn and unlearn many things about collaboration and other soft skills. As a result, the Product Catalog team became a totally different team after the Kauru migration. Many of these changes bring positive impacts on productivity, project management, and business-engineering relationships.

The complexity of the Kauru service and the lack of active maintainers were the main reasons why the migration project became so difficult. However, other reasons about the organization culture pushed every participant to grow soft skills and (re)built the team.

For example, Mercari’s working culture doesn’t ask each team to have an architect, or a system designer, to push development work according to some clear design principles. One result of this is although there were classic roles of tech lead, senior engineer, and junior engineer in the migration team, no one has the absolute authority to push things in a firm direction.

For each improvement or system change for the migration, the task assignee needs to accommodate all other migration team members. Therefore, if a migration task needs multiple improvements or changes to the current Product Catalog service due to the need for Kauru migration, there won’t be any progress until all members are fully convinced.

The cost of convincing everyone to migrate for every migration task
Fig 22. The cost of convincing everyone to migrate for every migration task

This model asks engineers to explore their communication skills and find the possible model to collaborate with other participants by themselves. Team managers during the migration decided to keep this no authority model even when there were some troubles due to communication problems.

However, the team managers were also those who pushed some public changes by introducing better tools and suggesting better strategies to tackle the project problems. In private, they suggested that each person change attitudes or the way to express themselves to make things smoother. These efforts were not in vain after all, although the team still experienced some tough moments before they finally bore the fruits of this adventure.

After the Kauru migration, the Product Catalog team members have learned to work to fit the company culture better. For examples:

  1. The tech lead learned that he doesn’t need to push and monitor every task, as well as he learned not to worry about schedule and priority too much because it is the PM and manager’s responsibility to take care of them. Also, the tech lead learned that sometimes it is NOT the migration team that needs to debug all the issues caused by the migration changes, especially if the issues need other teams’ domain knowledge to work collaboratively to solve (All for One, Be a Pro).

  2. Senior engineers learned that sometimes leaving workarounds and ad-hoc solutions in new or old services is okay — if all the team members fully understand why these short-term solutions may cause more problems in the future, and have another long-term solution agreed upon by everyone (All for One).

  3. Junior engineers learned how critical it is to keep things easy to test, track and debug. It means if there is anything that cannot be logged or debugged, it will eventually cause much more time to track back and fix. In addition, the difficulties of performing any production release have been fully experienced, no matter how tiny the release is. All these and more mean the fragility and cost of software systems are understood (Be a Pro).

  4. All team members learned that even for such migration (tech improvement) project, keeping on delivering is the most critical thing in an organization like Mercari (All for One, Be a Pro)

There is always more to migrate…

Even those who benefit from the project don’t want to experience the same Kauru migration project again. However, like any other successful company, Mercari experienced its high-speed growth, and there is other data still managed by different domain teams and scattered across various services.

Therefore, the ultimate goal for the Metadata platform is to migrate and centralize each of them from each domain microservice and provide a single yet generic query interface for all clients. It means there is always more to migrate to achieve the vision of the Metadata platform and a unified Mercari data gateway.

Although the Kauru migration was the toughest migration project yet, it was also the one that brought the most benefits. Once the team starts another migration plan, although the migration goals, target data, and services may be different, the team members will do it better by running the project sprints with the PoC approach and having more active and effective collaborations with all related teams.

And actually, there is already a plan to migrate the next batch of product-related data right now. In 2022, this subsequent migration is expected to be the first task for the team to check if the team learned enough lessons from the Kauru migration project. There will be another article about the retrospect of these two migration projects at that time. And it surely will make the Metadata and Mercari platform stronger and move faster.

Links

Part 1 of 3: Migrating a monolithic service under the bed (part 1 of 3)

Part 2 of 3: Migrating a monolithic service under the bed (part 2 of 3)