2022/02/21

Embedded SRE at Mercari

Author:: deeeet

, 2022/02/21

Author:@deeeeeeet, Engineering manager of Microservices SRE

The Microservices SRE team is one of the Developer Productivity Engineering Camp teams. The team provides embedded SRE service to the product team. By working with/inside the product team, we improve the service reliability and share SRE practices with the team so that the team can maintain its reliability without SRE members. We rotate embedding teams and spread the practices in the Mercari organization.

We already introduced this team in the previous post, but, in this post, I would like to give the recent update of the team and future insight.

Platform Engineering vs. SRE

Before going into detail, I would like to clarify the difference between the platform engineering team and SRE. At Mercari, we clearly separate the two.

Platform engineering is about building the platform for the internal product team so that they can work on the entire software development lifecycle (SLDC) by themselves.

SRE is the set of practices for production operation (more precisely, I like the expression ROAD(Response, Observability, Availability, and Delivery) as defined by Bruce Dominguez).

In some organizations, one team is responsible for both, but we separate the team. The platform team is working on platform engineering, and the Microservices SRE team is on SRE. The following diagram describes the relationship between the Platform, the SRE, and the product teams at Mercari.

The SRE team is in the middle of the platform and the product teams. While the platform team provides the DevOps toolings to the product team, the SRE team joins the product team and works on the reliability work with teams by using these toolings. From this perspective, the SRE team is a customer of the platform team.

The platform team solves the productivity and reliability issues that cover 80% of the entire services and teams, and the SRE team works on the rest of 20% of the service/domain-specific issue. The SRE team can reach the problem where the platform team can not handle directly, but the SRE team does not need to care about the maintenance of the toolings. The SRE team can give feedback about the tooling to the platform team, and the platform team can back it with improvements. They complement each other’s weak points.

Ivan Velichko gives an excellent summary of this in general on DevOps, SRE, and Platform Engineering.

SRE as practice, not as a dedicated role

As I describe it on Developer Experience at Mercari, the shift to DevOps from a centralized ops team and moving on to the “you build it, you run it” practice increased the burden of the product teams. The post mainly discusses productivity, but it’s also true for production operations. It’s still immature in many teams, and they are facing lots of difficulties.

To solve the problem, it’s very important that the product teams efficiently utilize SRE practice, in my word, “the way to handle reliability as product based on data and balance feature development agility and reliability work.” on their service. The Microservice SRE team was born to realize this in the product teams.

The mission of the SRE team is to spread out the SRE practice across the organization and help the product team to do the reliability work by themself. We do not aim to hire SRE members for each product team (as concluded in The Third Age of SRE by Björn “Beorn” Rabenstein, it’s really hard). Instead, we assign SRE members to the product team “temporarily” as embedded SRE members, and rotate them to the next team. The SRE member works on reliability improvement in the target team and shares its knowledge. We expect knowledge to gradually spread by rotation and become one of the “SWE practices”.

In the next section, I will describe how we do this.

The System

So how do we do the embedded SRE at Mercari? This section describes the system and process: assignment, work, evaluation, and rotation.

Assignment

The embedding process starts from the member’s assignment: which team should the SRE member join? Since there are many product teams while the number of SRE members is limited, it’s very important to be careful when making this decision. I would even go as far as saying that this is the most crucial decision, similar to how the product team decides which features to build. If you do this well, I think 80% of embedded is successful.

Currently, we are making this decision in both quantitative and qualitative approaches. In the quantitative approach, we build monitoring dashboards that indicate the entire service’s health and observe, e.g., which team faces lots of incidents, receives lots of alerts, has high error rates, or how much the service adopts the latest recommended platform features and so on. In the qualitative approach, we conduct developer surveys and directly ask whether the product team wants to have embedded SRE or not. We also try to collect information on large/critical production feature launches (where the reliability likely becomes low).

In both approaches, after we list up and prioritize the target service, we eventually talk directly with the engineering manager (EM) and tech lead (TL) of the team and discuss whether the embedding can help the team. Even if we observe issues in the service, the team may know it and can improve it by themselves without our help, so talking with the team is an essential step. Also, in this session, it’s very important to talk about the expectation of the embedding: what kind of issue we should solve.

Once we reach the agreement, the SRE member joins the team.

Work

During their time being embedded, the SRE member works as one of the team members in the product team: they join the team meetings, share the backlogs, and join the on-call rotation. But the main work is improving the reliability of the service the team manages. In other words, they don’t work on product feature development. They have a common checklist which the embedding SRE should work on, and some of them are:

Make sure the service meets the Production Readiness Checklist
Check and implement data backup and recovery
Ensure the observability of the service
Improve on-call rotation and incident handling
Write playbooks
…

The SRE team shares the goal with the product team. At Mercari, we use the OKR system for goal management. The SRE team and the product team set OKR, and the SRE member puts their own personal OKR which contributes to both of them. SRE’s OKR is about improving the entire service reliability, which is not specific to the particular service (e.g., writing a standard playbook that most product teams can use). The product team will be asked to include service reliability KRs in their OKR(e.g., setting SLI/SLO and using its product decision). With this, we can improve the reliability both from a micro and macro perspective.

Evaluation

The evaluation is one of the difficult parts of the embedded SRE. We have common results that we expect from the SRE members like improving SLI/SLO, reducing the toils, improving the on-call system, etc. But, since the situation of the product team is different from team to team, we need to get direct feedback from them to properly evaluate the results.

We’re still considering if this is the best way but currently, we ask the EM and TL of the target product team to input the survey about the work of SRE members once a quarter. The following is sample questions:

What is your overall evaluation of MS SRE support? (score 1-5, higher is better)
Did SRE support improve your services’ reliability? (score 1-5, higher is better)
Does your team have the confidence to operate service reliably by yourself even after SRE is rotated to different teams? (score 1-5, higher is better)
Do you have any suggestions to improve SRE support?

We summarize them and use them for the member’s evaluation and improvement of the embedding process itself.

Rotation

Currently, we rotate the team every two quarters. We still don’t know this is the ideal rotation term, but we believe that two quarters give the SRE member enough time to onboard to the domain, to solve problems regarding the reliability of the product. We can set this longer, but one concern is that if given too much time the team may become dependent on the SRE always being present, and will not be able to work without them. Since the product team and SRE know it ends after two quarters and the SRE will eventually leave the team, we can make sure all the work done will be based on this assumption (Of course, it’s not a mandatory rule, if we think it’s important to extend the term, we do).

In addition to this, we also experimentally started having a special period to work on overall reliability improvements between embeddings. In this term, which is currently designed to be one quarter, the SRE member can focus not only on the specific service but on the reliability work for the entire service. An example is contributing a playbook management system to add features that they think they want when working as embedded SRE. This practice is still optional, and the member can choose to do it or not.

Case Studies

Some of the work the SRE team did at the embedded team in the last 1-2 years includes:

Search infra team: the search is one of the critical functionality of the Mercari app where we connect sellers and buyers. They receive the largest traffic, and (because of this) the search infra team frequently adds/releases features. So the service has high-reliability demand. The SRE member joined the team and worked on supporting production release, building the benchmarking system to ensure avoiding latency delegation, optimizing the cost, reviewing and improving the SLI/SLO, and so on.
Monolith API team: while we’ve been working on microservice migration but we still have the original monolith API and it’s critical for our product. It was running on the on-prem but currently, we are working on migrating it to Google Cloud where the microservice is running. Since we have the knowledge to run services reliably in a cloud-native way, we joined the team and supported its migration. Since the Monolith API is using a different monitoring stack, we are helping to migrate the same stack as other microservices so that we can have seamless tracing between them.

Detailed case studies will be covered by SRE members in the follow-up posts from the blog series articles.

Future

In this post, I covered mostly the current situation. But finally, let me share some of the future insights of embedded SRE.

Embedded to the division, instead of the team

Until today, the SRE members were mostly embedded in the team. It works well, but we found the scope sometimes too narrow, and we think we can have a broader view. We still have too few members compared to the number of product teams, and this means spreading SRE practice by rotation takes more time. Not only that, when rotating, changing the service domain drastically has high cognitive loads on the SRE member because they need to onboard it from the beginning.

To solve this problem, we are thinking of changing the embedding system from the team to the division. By embedding to the division where the team shares a similar domain and systems, the onboarding cost will be increased, but once onboarded, we can utilize the knowledge to multiple teams (that is to say, long-term cost will be decreased). From the domain level, the SRE member can observe the reliability issue, and when they find the problem, they can go down to the team level. We can also share the knowledge at the division level, and it can be more efficient.

We started experimenting with this with one member and so far it’s going well. We would like to share more detailed results in the future.

The long term direction of embedded SREs

Our mission is to spread SRE practice across the organization so that the product team can work on their product reliability by themselves. Once we reach this goal and the SRE practice becomes the norm for general SWE, what should the SRE team do? Do we not need the team anymore? I think we can think about three directions.

If we use the terminologies from Team Topologies, the embedded SRE at Mercari can be categorized into an “Enabling team”. Enabling SRE practices to “Stream-aligned team” (product team) is one of the critical tasks of the team, but, at the same time, enabling the use of the latest platform feature is also an important task. If we think about continuous platform evolutions in the future, this enabling function will remain essential. The term SRE may no longer be applicable if we go this route, but continuing to be the“Enabling team” of the platform toolings is definitely a possibility

Another direction I can think of is to become specialized in embedding “Complicated Subsystem team” (from Team Topologies). Compared with normal services managed by the “Stream-aligned team”, these complicated subsystems normally need to handle large traffic or huge data or have demands for really high performance or reliability. Even if SRE practice becomes the norm, such service will require the specialized expertise of an SRE.

The final possible direction is to be a part of the Platform team and provide SRE functionality as a service. By transforming SRE knowledge into the tooling or service, we can enable the product team to operate the service more reliably.

Conclusion

In this blog post, I introduced the embedded SRE team at Mercari. This is something that interests you, and you would like to know more about us, feel free to DM me on Twitter (or we can have some casual chat on Google meet or any).

Please also check our JDs: