Introduction
This post is for Day 9 of Mercari Advent Calendar 2024, brought to you by @mshibuya, a Tech Lead of the Mercari Marketplace Site Reliability Engineering (SRE) team.
My team Marketplace SRE is part of the Platform Division, which provides the Platform for the Mercari Group as a whole. This article discusses improvements made to the process called Production Readiness Check, which supports the reliability of our services and how it changed the developer experience.
The importance of services having adequate reliability is widely recognized. However, the efforts required for this can be tedious and labor-intensive, leading to a slower development speed due to the existence of this production readiness process. I will describe what aspects of the Production Readiness Check process were improved and what kind of developer experience we aimed to create as a result. I hope this will be useful for those who are undertaking similar initiatives.
About Production Readiness Check
At Mercari, there is a process called Production Readiness Check (PRC). This is a checklist of criteria that newly developed products or microservices must meet, and without passing this they cannot be operationally launched in the production environment.
Besides an introductory blog article, although not the latest, the checklist items themselves are available on GitHub.
Mercari broadly adopts the microservice architecture. In large-scale services such as the Mercari marketplace app and the mobile payment service Merpay, many feature additions are made in the form of newly-developed microservices. New products like "Mercoin" and "Mercari Hallo" also take the form of a microservice on the same infrastructure as "Mercari" and "Merpay." Hence, the launch of new microservices happens frequently. Following the DevOps principle of "You build it, you run it," the individual microservice developer teams are responsible for ensuring reliability in the production operations.
Microservice development teams may not always be familiar with launching new services or ensuring reliability. The purpose of the Production Readiness Check process is for developer teams to autonomously launch microservices while ensuring necessary reliability.
Challenges to Solve
The Production Readiness Check has played an indispensable role in ensuring that services developed at Mercari have sufficient reliability (i.e. production-ready) to operate under real user traffic. However, this process of checking for production readiness comes at a cost to developers’ time.
The Production Readiness Check process at Mercari begins with creating an issue that includes the checklist and ends with the closing of the issue.Over the last 5 years, it’s taken an average of 35.5 days to complete the PRC—although this is a reference value, since actual work does not occur throughout the entire period from issue open to close.
Developer interviews conducted by the Platform Division revealed that there were many complaints about the Production Readiness Check process. Examples include:
Did PRC as well, lots of “copy this, paste this, take a screenshot of this…”
Overall straightforward, just PRC was a painPRC, takes about 4 weeks
Takes a lot of time
Personal opinion is that 1-2 sprints could be cut by simplifying the PRC processToo many things to check, some things are hard to understand how to verify
One of the least desirable tasks. I understand it’s necessary.
At the Mercari Group, speed in launching new products and adding features to existing products is more important than ever. Therefore, speeding up this Production Readiness Check process and reducing the delivery time was an urgent task.
Developer Experience with the Existing Process
Here I will present a typical experience before the improvements in the Production Readiness Check process, using the launch of a new product as an example. This example is fictional, so please consider it as a possible worst-case scenario a developer could have experienced.
Let’s say that the Mercari Group decides to launch a hypothetical new product. This is a high-criticality product integrated with the Mercari marketplace app.
A development team is formed with a goal of launching this new service within six months. The team first clarifies the product requirements and designs the system implementation, compiling it in the form of a Design Doc. Based on the completed design, they proceed with the implementation of the actual application code. They are able to finish implementing almost all the functions by the fifth month, just before the public launch.
While the team prepares for the actual product release, setting up the infrastructure for production use, they realize that they need to go through the Production Readiness Check process. The team, recognizing that meeting these requirements is mandatory for releasing the product, does their best to finish, but due to the sheer number of requirements and aspects that were not included in the initial design, they struggle.
As a result, the team took two months to complete the Production Readiness Check, leading to a delay in the product launch and a lost opportunity to release the product early and gain feedback from users.
Solution
Check Automation
One primary factor contributing to the labor intensity of the process is the sheer number of items to be checked, which is steadily increasing due to learnings from past incidents.
The number of checklist items for typical services has increased from 62 in the publicized version, to 71 in the latest internal version, an increase of nearly 15% over approximately three years.
Moreover, while the items included in the checklist define the desired endstate, they rarely guide teams how to get there, further slowing developers down as they investigate.
To solve this problem, we introduced automated verification of checks in the Production Readiness Check process, including scanning application code and infrastructure configuration. We have automated almost half, about 45%, of the checklist items, and plan on growing this number in the future.
Not only has this made it easier for developers to conduct checks for their service, but these automated checks also make it easier for developers to understand how to fulfill the requirements, facilitating faster and easier mitigation actions.
Enhancement of Existing Platform Components with Production Readiness Check Compliance
As has been presented on past occasions, Platform Engineering is widely practiced at Mercari. Under the concept of enhancing developer productivity through self-service-focused Platforms, the Platform Division has built and provided many components.
During the process of identifying the reasons for the high burden of the Production Readiness Check process, we realized there was a gap between the requirements and the functions of the components actually provided by the Platform.
Mercari’s Platform offers various components throughout all stages of the software development life cycle (SDLC), allowing developers to efficiently achieve their necessary objectives. We identified ways to improve the platform offerings themselves, such as tools for automated Continuous Integration / Continuous Delivery (CI/CD), to fill in the gaps.
Additionally, as a more important and cost-effective improvement, we enhanced documentation to clarify the Production Readiness Check requirements that can be met by these components.
An insight gained through these efforts is the importance of integrating such components to create a comprehensive developer experience, towards the unavoidable Production Readiness Check process when building microservices. We believe that by not only providing components but also improving the check process itself, we have created a situation where a bi-directional feedback loop can function.
"Shift-Left" Approach
In this context, "Shift-Left" is a concept often used in the context of software testing or security, referring to moving activities like test execution to an earlier stage (i.e., "left side" in a timeline diagram).
In the aforementioned new product development example, the team attempted to complete the Production Readiness Check process in a short period just before releasing the product, encountering difficulties due to the high labor intensity. I personally refer to these situations as "the last-minute summer homework problem," but I believe this is due to structural issues more so than the fault of any individual team members. Launching a new product involves various challenges and difficulties, and, while focusing on these, it is inevitable to postpone things known to be important but not immediately needed.
To address this problem, I thought improvements at a systemic level were necessary. Now, with automation achieved, the team can perform the checks for automated items repeatedly to incrementally meet the requirements. Also, by adopting the expanded Production Readiness Check compliance through existing components, they can start fulfilling the requirements in advance without much effort. Then finally, by ensuring the team is aware of these measures from the early development stage, we can prevent work being concentrated in a short period just before release.
However, just informing the existence of such new processes and solutions has its limits. Therefore, by embedding them into another established process that is guaranteed to occur at the start of every development, we ensure that teams in the early development stage can recognize its existence without omission. Mercari’s culture is to create a Design Document for new services to be reviewed by stakeholders. To ensure that Production Readiness is considered earlier in the SDLC, the Design Document template was expanded to include details about these production checks.
As a result of these "Shift-Left" measures, developers can become aware of these requirements from the design stage, long before actual development or infrastructure setup happens, and take meaningful actions toward the Production Readiness Check process earlier.
Developer Experience with the New Process
The following illustrates what sort of experience we want to achieve with the improved Production Readiness Check process, incorporating automation.
Let’s go back to the hypothetical development of a new product, but with the new process in mind.
First, as a result of Shift-Left, the team becomes aware of the Production Readiness Check process at the earliest stage of a six-month development period while designing and creating the Design Doc. Understanding the requirements that need attention earlier allows them to consider options from the design stage, such as discussing with stakeholders about changing product requirements to meet the Production Readiness Check requirements.
By the fifth month, with the product launch coming closer, the team begins preparations for the Production Readiness Check process. Having selected appropriate Platform components to meet requirements, the team minimizes additional changes or efforts required to meet them.
The automated checks significantly reduce the labor to verify and fix compliance with Production Readiness Check items. Consequently, the team completes the Production Readiness Check process within a month, able to deliver value to users early and refine the product through feedback.
Future Plans
As outlined above, the Production Readiness Check process has been improved and is starting to be utilized for checks before actual microservice releases. However, there is still room for improvement of existing components to be more compliant on Production Readiness Check requirements, and automation to increase the applicable cases.
To achieve a higher developer experience, both of these aspects are expected to be areas of focus for the foreseeable future.
What lies ahead as these improvements advance?
Personally, I consider it ideal to eliminate the idea of "conducting checks" altogether. In a world where almost all requirements are inherently met through the functionalities and components provided by the Platform, developers could inherently build and operate reliable services without having to think about it.
I want to consider how we can achieve the ideal Platform where we don’t need to care about such reliability requirements, even though the journey may be a long one.
Conclusion
In this article, I explained the overview of the Production Readiness Check process at Mercari, detailed what improvements were made to the process, and illustrated what kind of developer experience it was possible to create as a result.
Tomorrow’s article will be by sintario_2nd. Please continue to enjoy!