What the SRE team wants to achieve with the development team

* This article is a translation of the Japanese article written on 2021/01/29 Jan. 29th, 2021.

Hello, I’m Shibuya (@m4buya), an Engineering Manager in the Mercari Microservices SRE team.

In June last year, Mercari made a minor update to a part of the SRE team and launched the Microservices SRE team. This team contributes to reliability by working closely with the product team and using our expertise as SREs. This article will introduce the motivation behind the launch, what we are aiming to achieve, what we have been able to do so far or yet to do, and our plans.

Background story

For customers to use our service safely and securely, Mercari has been working to maintain and improve the reliability of the system by establishing an SRE team since 2015. In addition, the Mercari SRE team has been playing a significant role in improving scalability, which is essential as the Mercari service continues to grow and traffic increases year by year. On the other hand, we have also identified emerging areas that have proven difficult for the SRE team to handle alone.

One of the biggest reasons for this is the shift to microservices, which started in earnest around 2018. Since the beginning of the service, the server-side of Mercari has consisted of monolithic applications written in PHP. However, we decided to launch cross-functional microservice teams to expand our engineering organization and increase development speed. Then, each team develops, tests, and operates their services, and they can deliver value to customers on their own.

While this gave each team more ownership over the development and operation of the microservices, it also created areas that were difficult for the SRE team to keep track of. Failures that occurred in these hard-to-grasp areas sometimes caused the entire service to stop functioning.

Merpay started its development after the transition to microservices began, so it has adopted a microservices architecture from the beginning. Merpay requires a higher level of reliability and confidentiality than Mercari because it is a financial service. Therefore, the SRE team was deployed to work along with the microservices system, and the microservices team can get operational support from the SRE team as needed.

This organizational architecture is where the situation differs between Merpay and Mercari. At Mercari, the point of contact for the microservices team has always been the Platform team, which develops and provides the common infrastructure for microservices. However, the Platform team was merely the provider of the underlying infrastructure and tools and was not in a position to provide support tailored to the unique circumstances of microservice teams.

Due to this situation, we decided to reorganize the infrastructure team crossover Mercari and Merpay. In line with this, we launched a new team that could solve the current issues.

Problems to be solved

We analyzed the situation further and concluded that our issues could be boiled down to the following three issues.

Microservices teams did not have resources to handle the operational aspects

As mentioned above, the microservices teams are organized as cross-functional teams that develop, test, and operate services themselves. This is essential to reduce dependency on external teams and to increase agility. However, microservices teams organized by existing backend application developers tend to be overwhelmed with the development of the required functions of the product. As a result, they are not able to invest sufficiently in the mid- to long-term operation of the system.

Much like systems development, stable operation of systems also requires a high level of expertise. We realized in order to increase the reliability of our product as a whole, it was essential to create an organizational architecture that could involve SREs (who typically have far more experience in systems operation and knowledge about lower layers of the application) to support individual microservices.

The hurdles for receiving support are high, and they needed specific case-by-case support

In a survey to the microservices team, we received comments such as: "When we ask for support, we need to explain our background of the system and/or the problem. We could significantly shorten this step if we could communicate with the person that would be providing us support on a more regular basis.", "It would be easier to ask minor questions if the SREs were already in the team". This led us to believe that it may be more effective if we could provide support to the microservices team by someone who worked closely with them on a daily basis.

In addition, each of Mercari’s microservices is built on the cloud infrastructure and toolkit built by the Platform team, and the microservices team can get technical support from the Platform team as needed.

However, the Platform team is only in a position to develop and provide a common infrastructure for many microservices. The Platform team tends to solve problems in a more generalized way. We thought it would be good to continuously follow the status of the microservices team and provide support relative to their context and domain knowledge.

More support for using microservices infrastructure effectively

The core mission of the Microservices team is to provide value to customers through microservices. The Platform team’s job is to help achieve that goal by creating an easy-to-use infrastructure that can be used by developers who are not familiar with the microservices infrastructure and its components. However, microservices teams are also sometimes faced with difficult problems in their operations, or are concerned if they are following the best practices. Sometimes they may be wondering whether they have overlooked something important that could cause problems. In the above survey, there were many comments that asked for support in using microservice infrastructure components such as Kubernetes and Datadog more effectively.

In order to solve the above issues, we came to the conclusion that it would be effective to organize a new SRE team that could improve the reliability of services together with the microservices team as an "Embedded SRE".

Embedded SRE

This is a style of SRE team where individual SREs are “embedded” in the product team and work together as one team to develop and operate the product. This article, "The Many Shapes of Site Reliability Engineering" gives a very clear explanation of the different types of SRE teams, and I will summarize them according to this classification.

Google Model SRE

A dedicated SRE team is responsible for the operation of the product. There will be a separate functional development team, but if reliability is insufficient, the SRE team will try to balance reliability by participating in the functional development team as on-call.

SRE as a Center of Practice

A single centralized SRE team will drive reliability improvement and develop and deliver the tools for that. The scope of the on-call is limited to the area of such tools.

Embedded SRE

The SRE will be assigned as a member of the cross-functional product team and will help the product team keep the necessary reliability and scalability for the service. The SRE is responsible for being on-call with the product team.

In Mercari, the Platform team acts the "SRE as Center of Practice". This position is essential for Mercari to leverage the latest technology and set best practices as we take on the technical challenge of microservices. It is also clear that it will continue to be significant in the future. By having an embedded SRE (who can cover the areas that the Platform team cannot) inside the product team, we can gain a deeper understanding of the inner workings of the operation side. We believe that this architecture can create a cycle of feedback for the platform team to make improvements.

Goal

When we start a new team, it is very important to have a consensus on what we want the team to be and what we want it to do. From the advice that we got, we developed an inception deck, which is often used in agile development. This is a form of document that summarizes the entire picture required to proceed with the project, including items such as:

  • Background of the project
  • The goal the project wants to achieve
  • The value that the project has for the customer

The Commander’s Intent was defined as follows.

At Mercari JP, the microservices team has been entrusted with a large degree of operational ownership. However, it is thought that greater value can be provided by solving issues that require expertise in actual operational situations, sharing common knowledge, and working across teams to improve reliability throughout the product. Microservices SRE solves this situation and leads the microservices team to practice the SRE role as a process. The SREs also contribute in a scalable way to the organization beyond the team by building a healthy culture of trust. SRE is an activity that aims to achieve optimization of the entire software development lifecycle by applying software engineering methodologies to system operations. It also involves engineering the process and culture. The Microservices SRE understands the best practices of SRE, but is flexible enough to experiment and change according to the situation in order to explore the best SRE for microservices development at Mercari.

In order to achieve this goal, we started by getting deeply involved with specific teams as Embedded SREs. Gradually, we moved to different teams and rotated through many teams in order to spread the culture.

What we actually did

The "SRE Customer" team, which was originally one of the sub-teams of the Mercari SRE, was renamed to "Microservices SRE" team, and started to shift to the path of an Embedded SRE organization. The SRE Customer team is a team that has been responsible for the operation of the application side of the Mercari core system, and their strength is that they have extensive knowledge based on the history of the Mercari system. However, since they have been working with monolithic PHP applications and their infrastructure for a long time, we knew that it would be challenging for them to catch up with the technology stack used in our current platform.

The selection process for deciding which microservices team the new Embedded SRE would join was conceived based on the following policies.

  • Risk factors should be reduced as much as possible since this is a completely new approach
    • Teams that already have a relationship with the SRE team will have priority.
  • Teams must be willing to help the SRE get acquainted with the technical elements required for the microservices
  • Teams should be facing reliability and operational issues and must have a strong motivation to put effort to solve these problems

Those in the Engineering Manager position including myself have also joined the microservices team as an Embedded SRE. While I have less time to devote to the microservices team because I am a manager, I felt it would be important for me to gain first-hand experience and give feedback on how the teams worked.

Takeaways

We started the team in June of this year, and have been gradually expanding the scale of our activities for about half a year now.

When we started the Embedded SRE initiative, one question lingered on our minds: "Can we really join another team and suddenly make a difference?”. We were experienced SREs, but we were dealing with an unknown system with a different technology stack. We thought it would be quite difficult for us to make a meaningful contribution to the microservices. However, as it turned out, my fears were unfounded. From a relatively early stage, we were able to make steady progress, especially in areas such as monitoring and data storage. As a result, after three months, we were able to obtain an average rating of 4.63 out of 5 in a survey conducted to analyze the satisfaction level of Microservices SRE activities.

However, there are some issues that we feel need to be addressed in the future. One such issue is the fact that it is quite difficult to determine when to leave a team after joining as an embedded SRE. In this regard, we are considering changing the process and first agreeing on the desired state (the goal) before starting the collaboration with the team that the SRE will be joining.

Looking to the future

The goal of the Microservices SRE team is to positively impact the user experience by improving the overall reliability of the product, and we are still on our way.

We have already established how we can support individual teams to some extent. As the next step, we would like to focus on how we can find and solve common issues among multiple microservice teams. We would like to develop tools to solve these common problems, as well as take the solutions that we applied to a specific team and deploy the same solution across other teams. These efforts would also be beneficial in that they would produce a greater impact with a small amount of effort, as they will scale to multiple teams.

However, the size of our team is still small, and we don’t yet have enough capacity to invest our efforts there. The Mercari Microservices SRE team is actively looking for people who share those missions and will work with us to create the best microservices operation team in the world.

If you are interested in joining us, we would be happy to have a casual conversation with you, even if you have no intention to change your job. Please feel free to contact me (@m4buya). I look forward to hearing from you!