Current Microservices Status, Challenges, and the Golden Path

This post is for Day 14 of Mercari Advent Calendar 2023, brought to you by @ayman from Mercari Backend Architects team.

Introduction

I would like to talk in this article about an in-depth exploration of Mercari’s ambitious journey from a monolithic PHP architecture to a sophisticated microservices landscape, a transition that began in 2018. It offers a comprehensive narrative of the challenges, successes, and key learnings encountered during this transformative process. The story unfolds from the initial ease and simplicity of the monolithic setup, through the complexities and nuances of migrating to a distributed microservices system.

Background

Mercari started the project of microservice migration in 2018 coming from a PHP monolith that all teams used to collaborate on.

Working within this PHP monolith presented certain ease for engineers because:

  • They were not responsible for the monolith’s maintenance, which was managed by the SRE team.
  • There was no need to incorporate the extensive boilerplate code required for building new services.
  • Direct access to classes, functions, and the database was readily available.

For project managers (PMs), this setup also had its benefits. If a specific project was underway, PMs could directly assign teams to work on any part of the monolith.

However, despite these advantages, it wasn’t all unicorns and rainbows; we faced numerous challenges as well:

  • We can’t do parallel releases. Because we had only one release pipeline, we were organizing the releases via a release calendar, and each team needed to reserve a suitable time slot beforehand if your release failed or you needed to do extra work, this could impact the team that wanted to release after you.
  • Incidents had wider impacts. We had some incident that was of severity 1 or 2 because there was an error either Mercari API “our PHP monolith” timing out or our core DB “the main DB that is being used by the monolith” got so busy and stopped responding.
  • No governance in the code, it’s simply any team that can call any function or class that was written inside the monolith.
  • There were different styles when defining models, services, and other logical components

These issues limited our scalability, both in terms of team growth and workload management.

Then microservices migration decision came to the rescue as a strategic move aimed at creating a strong technical organization that can scale globally – to have a Scalable and Resilient Team.

The transition began with services that could be decoupled from the Mercari API and did not require direct database access. This involved initially developing the gateway service, authority services, and the listing-time suggestions service.

Subsequently, each team started planning migrations for their respective components within the Mercari API.

For example, the buyer domain team took on migrating buyer-related domains (such as likes, comments, page views, etc.), while the Listing domain team focused on migrating services like listing service, photo service, and so on.

To ensure a successful migration, our platform teams embarked on constructing the necessary platforms and establishing protocols for other teams to create and deploy their microservices. This included the creation and maintenance of Kubernetes (k8s) clusters, the development of pipelines for rolling out infrastructure via Terraform, and pipelines for deploying microservices to production.

Simultaneously, the architecture team implemented a set of guidelines to assist teams in adopting best practices. These covered aspects like API design, database selection decisions, error handling, pagination, and monitoring. A crucial part of these guidelines was the Production Readiness Checks (PRC), a checklist ensuring that services meet specific criteria before their production deployment.

Despite having a ready platform and comprehensive guidelines, governance remained somewhat relaxed. This approach granted teams considerable autonomy in decision-making, adhering to the principle of "you build it, you own it." While architects and platform teams could offer recommendations, the final decisions and responsibilities lay with the individual teams.

This dual setup of a robust platform and clear guidelines, coupled with a flexible governance model, initially facilitated a smooth start to the migration project for the pioneering teams. However, as the project progressed, it became apparent that this approach alone was not sufficient for the evolving demands of migration or business growth.

In the following section, we will delve deeper into the current state of our microservices, the challenges we face, and the strategies that constitute our ‘golden path’ forward.

Current Status

Microservices Status

The below graph shows the microservices/batch jobs that were released from July 2019 until December 2023 based on the production readiness checks closed every month for marketplace, merpay, and mercoin. The total number of microservices during this period was around a few hundred microservices in total.

Fig.1 – Released microservices count

To dive a little deeper, another analysis was conducted to find how many microservices teams were still actively working on in the marketplace (mercari ms/batch jobs) and the result was that only 62% of the total number of microservices in the marketplace were active after removing the deprecated microservices and also the service that has 1 deployment per month or less for 6 months (services highlighted with the red ellipse in the below diagram Fig.2).

Fig.2 – Number of deployments for each service

One important observation that we can make in Fig.1 is that this microservices count diagram shows also the trendline of the released microservices for mercari in blue, merpay in red, and mercorin in yellow, and you can see that while releasing new microservices trendline in merpay and mercoin is going up, the trendline for releasing new microservices in mercari marketplace is going down especially starting from the end of 2021 (highlighted part with the purple ellipse in the below diagram Fig.3).

Fig.3 – Released microservices count – mercari trendline highlighted

In 2021, microservice migration projects slowed down a lot due to several reasons that will be mentioned in the challenges section below. But due to these reasons, teams started to step back and think about how long it’s going to take for us to finish the migration, and that it’s taking too much time and effort.

Then teams started to be more conservative in bringing the domain logic out of Mercari API and migrating it to microservices. The new microservices that were released after this period were mainly for the new business features.

Domains

Our marketplace is structured into nine main domains, each encompassing between 2 to 9 sub-domains. The primary domains include

  • Growth Products
  • Product Engagement
  • Matching
  • Category Growth
  • CBO Product
  • Cross Border
  • Logistics
  • Platform
  • Foundation

In our marketplace, domains can be categorized into two logical types: Stable domains and Frequently changing domains.

The stable domains are the domains/teams that were stable enough to correctly migrate, maintain, and introduce new features and improvements to their services.

These domains/teams were there for a couple of years with minimal changes and re-org. This led them to own not only a clear feature development roadmap but also a clear engineering roadmap.

They solved their technical debts, provided better DX for their customers (other engineering teams that depend on their services), and provided better UX to Mercari’s customers as well.

Examples of those teams are Matching domain teams, and some of the foundation domain teams (ex. TnS, CS Tool, IDP).

On the other hand, the frequently changing domains are marked by constant evolution. These domains are characterized by teams that frequently undergo changes, including shuffling of team members, splitting into smaller groups, or merging with other teams. This dynamic nature often results in a few distinct challenges and characteristics:

Adaptive Roadmaps: Unlike stable domains with clear and long-term roadmaps, these domains often have to adapt their roadmaps rapidly in response to the changing team dynamics and business needs. This can lead to shifts in focus and priorities, requiring a more agile and flexible approach to project management and it’s hard for them to put a long-term engineering roadmap.

Technical and Organizational Fluctuations: Frequent changes can lead to a state of continuous fluctuations, both technically and organizationally. This might result in temporary delays as new team configurations find their footing especially when handling new microservices that they didn’t create originally and establishing effective development and on-call lifecycles.

Dependency Management Challenges: With teams often changing, managing dependencies between various sub-domains and external teams becomes more complex. This can lead to challenges in coordination and increased risks of delays or misalignments.

Variable Quality and Performance: The quality and performance of the services in these domains may vary more than in stable domains. New team compositions might take time to adjust and optimize their approaches, which can temporarily affect the quality of output and service performance.

Examples of such domains include some of the growth products teams, and some of the product engagement teams as well, where the focus is on introducing more features in the marketplace, and usually the business demand for these teams is much higher than the stable domain teams.

Mercari API

The Mercari API, our PHP monolith, has been the focus of our migration efforts since early 2018. As indicated in the second graph, it’s evident that development on the Mercari API remains highly active, with the highest number of deployments (the very first service to the left, approximately 1200) over six months.

This continued activity can be attributed to several key factors:

Exceptions to Code Freeze: Initially, management implemented a code freeze on the Mercari API to facilitate the microservices migration. However, due to the necessity of maintaining existing logic and the demand for new feature releases, and exceptions were granted. This allowed teams to continue feature development during migration. Between February 2019 and March 2021, about 150 exceptions were approved for the Mercari API.

Shift in Migration Focus: Around March 2021, there was a noticeable deceleration in the migration pace. Some teams even halted their migration efforts, choosing instead to concentrate on developing business features and growth. This shift led to renewed active development within the Mercari API.

Robust Foundation for Speed Initiative: The Engineering division launched the Robust Foundation for Speed (RFfS) initiative, aiming, in part, to enhance the modularity of the C2C transactions area within the Mercari API. The RFfS initiative enabled us to refactor various sections of the monolith, improving its usability and collaboration potential for different teams.

Reintegration Considerations: Post-RFfS, teams encountered a scenario where part of their domain logic resided in the Mercari API, while other parts were in microservices. This led to discussions about whether to move logic back to the Mercari API or to develop new features directly within it, rather than creating new microservices. This was also impacted by a policy that we need to reduce the number of microservices that we have to reduce the maintenance cost.

Reintegration Considerations: After the Robust Foundation for Speed (RFfS) initiative, teams were faced with a mixed landscape where some of their domain logic was embedded in the Mercari API, while other parts operated within separate microservices. This situation sparked discussions about the best approach moving forward: whether to consolidate logic back into the Mercari API or to continue developing new features within it instead of creating additional microservices. Compounding this decision was a new policy aimed at reducing the total number of microservices. This policy, driven by the need to lower maintenance costs, influenced teams to reconsider expanding the microservices architecture and to evaluate the benefits of a more integrated approach within the Mercari API.

The current state of the Mercari API is such that it has a dedicated team responsible for its management and on-call duties. While this team oversees the overall operation of the API, other teams are actively collaborating and integrating new features and domain logic into it. These collaborating teams are also accountable for maintaining their specific contributions to the API. In the event of an incident within a particular domain, the Mercari API team takes the initial response action and then escalates the issue to the relevant domain team for further resolution.

Challenges

The Marketplace Backend Architects team organized workshops with all backend teams to identify the daily challenges they encountered. These challenges were primarily categorized into four groups: platform challenges, architecture challenges, common challenges, and organizational challenges.

The following chart shows a percentage of how many challenges for each category relative to all the issues that we collected.

Fig.4 – Number of issues per each category

In the upcoming sections, we will delve into some of these challenges in more detail.

Platform Challenges

Fig.5 – Number of teams who reported each challenge in the platform category

The above chart shows how many teams reported certain challenges for example.

  • Discoverability of the current platform and microservices documentation reported by 7 teams (blue area).
  • Lack of documentation for platform tools reported by 5 teams (red area).
  • Reduce manual work that every team needs to do to keep maintaining their services (CI/CD migration, k8s-kit, ISTIO, Dependabot, etc.) reported by 5 teams. (yellow area)

Architecture Challenges

Fig.6 – Number of teams who reported each challenge in the architecture category

The above chart shows how many teams reported certain challenges for example.

  • More standardization in different areas including endpoint management, E2E testing, PII deletion, etc. This issue was reported by 14 teams, but every team reported it from their perspective. (orange area)

New Businesses Challenges

While recent workshops primarily focused on platform and architecture challenges, it’s essential to acknowledge the significance of new business challenges in Mercari’s growth. As we explore innovative ideas and ventures, our approach typically involves two key strategies:

  • Proof of Concept (PoC) for Business Validation: We initiate a POC to test new ideas, ensuring that we don’t overcommit resources before confirming the viability of the business concept.
  • Rapid Time to Market: Our goal is to launch new ventures as swiftly as possible, minimizing delays in bringing them to our customers.

In pursuing these new opportunities, teams often prefer two approaches:

  • Independent Development from Marketplace Services: To avoid delays associated with integration and coordination with existing marketplace teams, new business teams may develop services separately. This includes creating their versions of existing services, like a new authority service for the new business, to expedite development.
  • Flexibility in Architecture Guidelines: Sometimes, in the interest of speed and innovation, teams might deviate from the established architectural guidelines.

While these approaches can pose integration challenges when reintegrating with the marketplace later, they also offer invaluable benefits. Exploring new technologies and landscapes not only fosters innovation but also enriches the team’s experience and skill set.

For instance, some of our new business ventures have introduced progressive concepts such as monorepos and modular monolithic architectures, or the utilization of previously unexplored services in GCP. These experiences contribute significantly to our technological and strategic arsenal.

Learning Opportunities

In reflecting on Mercari’s transition to microservices, and also on the previous challenges, we can identify some key learning opportunities:

Challenges of Maintaining Backward Compatibility: One of our initial strategies was to ensure backward compatibility for migrated endpoints. This approach was intended to streamline the migration process and minimize client-side disruptions by allowing a simple switch from old to new endpoints. While this expedited migration and reduced immediate client-side impact, it inadvertently led to the transfer of some technical debt and legacy issues into the new microservices environment. This sometimes amplified the challenges, as these issues became more complex within a distributed system.

Stability of Domain Teams: As previously discussed, the stability of certain domain teams posed a challenge. Some teams, due to their fluctuating compositions and focus, found it difficult to establish and follow through with robust, long-term migration plans for their respective domains.

Adapting Business Processes to Microservices: The transition to a microservices architecture did not significantly alter our approach to business growth and feature development. Previously, it was feasible for a single team to implement features spanning multiple areas of the monolith. However, in a microservices environment, such an approach necessitated increased inter-team collaboration and coordination due to the interconnected nature of services. This shift highlighted the need for adapting our feature development strategies to better suit the nature of a microservices-based ecosystem.

Enhanced Investment in Platform Infrastructure: Investing more significantly in our platform infrastructure, particularly in Platform as a Service (PaaS), can help reduce manual work. This investment is essential for supporting scalability and efficiency.

Governance and Standardization at Scale: As operations scale, the initially relaxed governance model may become less effective. Therefore, implementing more stringent governance and standardization is crucial to manage growth effectively and maintain system integrity.

Framework for New Business Initiatives: Establishing a comprehensive framework for new business ventures is critical. This framework should balance the need for speed in launching new projects with the requirement for smooth integration into the marketplace or seamless termination if necessary. It aims to minimize friction and ensure alignment with broader business objectives.

Golden Path

Given the above learning opportunities, it’s time to have our Golden Path right now in Mercari. The term Golden Path entails an opinionated, well-defined set of recommended practices, tools, and architectural patterns that are advocated within an organization to achieve optimal results. These practices need to have a more strict governance model via the platform tools.

From the point of view of the architects’ team, the key to a successful golden bath is to have a single properly-sized DX team that owns, has full authority, and is responsible for the whole interface surface between the platform teams (MSP, data platform, experimentation, IDP, search platform, etc.) and the domain/feature teams – so that these teams can focus almost exclusively on business logic.

To mention some examples of what the golden path needs to provide for backend teams:

Teams can deploy a simple service in production from scratch in at most half a day. Teams can either deploy it using an application model or with a serverless model. Unless overridden via manifest, the service is automatically deployed in all appropriate regions.

Teams can safely expose a standard endpoint to web/app or other external clients, as well as to other internal services, with at most one line of configuration in the manifest.

As long as I follow the golden path, I need to maintain a minimal set of scaffolding code, I only have to add a single, config-less middleware to inbound/outbound traffic, and all configuration for my service is kept together, in a single manifest, with the sources of my service. This golden path automatically provides: managed user-service and service-service authn/z, managed observability, and managed reliability.

Conclusion

As we reflect on Mercari’s journey from a PHP monolith to a dynamic microservices architecture, it’s clear that this path has been marked by both triumphs and challenges. The migration, initiated in 2018, was more than just a technical improvement; it represented a pivotal shift in our approach to software development, team collaboration, and business strategy. Throughout this journey, we’ve encountered a range of experiences – from the ease of collaboration within the PHP monolith to the complexities of managing a distributed, microservices environment.

Our transition to microservices was not just a matter of technological change but also a learning curve in organizational adaptability and strategic foresight. The challenges we faced, such as maintaining backward compatibility and adapting business processes to fit a new architectural paradigm, were not merely obstacles but opportunities for growth and innovation. They compelled us to think critically about how we build, maintain, and evolve our software and how our teams collaborate and drive the company forward.

Looking ahead, we’re poised at a crucial juncture. The insights gained from our experiences have been invaluable in shaping our Golden Path – a set of practices, tools, and architectural patterns tailored to optimize our outcomes.

In collaboration with various stakeholders, we started to define and plan this path, ensuring that it aligns with our evolving business needs and technological advancements.

We envision a unified platform where engineers can easily access documentation, submit design documents for reviews, manage Architectural Decision Records (ADRs), and create new services and applications. This platform will alleviate the burden of scaffolding work, allowing our teams to focus on innovation and efficiency.

Our ambition is to forge a path that not only embodies best practices for high software quality and efficiency but also accelerates the time-to-market for new business initiatives. This Golden Path is more than a guideline; it’s a commitment to continual improvement and a testament to our journey from a PHP monolith to a dynamic and flexible architecture.

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加