I’m @deeeeeeeet, from the Platform Team.
We started the Platform Team two years ago, when Mercari first began its microservice migration. The team is responsible for providing DevOps toolsets for creating the infrastructure necessary to operate microservices, as well as for developing and operating those microservices. Although there were only 2–3 people on the team including myself in the beginning, that number grew to more than 10 people in the first 2 years.
As the number of members continued to grow, we reached a limit on what we could do as a single team. The scope of the team’s responsibility had also continued to expand as microservices became more and more heavily utilized within Mercari. This led to even greater cognitive load on our members. In order to solve these issues, we reorganized the Platform Team, splitting it into several sub-teams with different specializations.
In this article, I want to talk about team design, the mechanisms I introduced to maintain each team’s independence while aligning its members, and my current assessment of these reorganization measures (as well as how we conduct that assessment).
We solved the following issues in this reorganization:
- The increasing cognitive load on the team
- Misalignment between teams regarding developer productivity
As touched upon in the book Team Topologies, the platform team of any organization is often likely to suffer from high cognitive load due to the wide range of responsibilities they take. Mercari’s Platform Team is no different. The team is no longer responsible simply for operating the system’s primary infrastructure—Kubernetes and Istio; it also provides the tool sets that allow the service teams to develop, test, deploy, and operate the services themselves. For example, they provide Spinnaker for enabling continuous delivery and Terraform module which allows developers to bootstrap microservices infrastructure. Their scope is very wide and they are involved with a variety of components, leading to considerably heavy cognitive load.
A good example of this was when we held the team offsite at the end of the last year. We had all the team members raise issues, which can be like below.
They named more issues than a single team should usually face, and we were unable to move forward to the discussion phase.
High cognitive load can impact a team in a number of ways. First, it makes it difficult for a team to prioritize. Mistakes in prioritization can require members pivot to new tasks each quarter, with that context switch creating even more cognitive load for the team. Furthermore, it means that even if there are a lot of things that each team member wants to improve, those improvements aren’t prioritized, and members don’t get the chance to work on them. This can impact motivation. It also forces members’ involvement with each component to be much shallower. The team needs to be looking at such a wide scope that it becomes difficult for them to deeply understand each component and continuously work to improve them. There were a number of components that were created and then left untouched. This inability to go in-depth can also impact members’ mastery of skills.
The Platform Team’s main mission is to improve developer productivity for the service teams that develop our microservices. However, this is not the sole responsibility of the Platform Team; some of the Merpay SRE Team’s members are also working on developer productivity, along with many Merpay architects.
Both Mercari and Merpay’s microservices run on the infrastructure provided by the Platform Team. The Platform Team itself belongs to the Mercari side, however, while the Merpay SRE members and architects belong to Merpay. It may seem like only a slight difference, but this has caused its own problems.
First of all, the two sides have their desks on different floors. This drives up the cost of communication and decreases the frequency of interaction. Additionally, different teams prioritize tasks differently. Although the Platform Team did not intend to prioritize tasks for the Mercari side per se, it certainly looked that way from the outside. This would result in misalignment, with the Platform Team blocking Merpay SRE tasks from going ahead, the two sides heading in different directions, and multiple members working to solve the same problem in different ways.
We reorganized the Platform Team to solve these problems. Specifically, this amounted to two major changes:
- Breaking the Platform Team down into several sub-teams with different specialties
- Assigning members from both Mercari and Merpay to the Platform Team, to create a team incorporating both companies’ perspectives
Writing our design doc
Since these kind of changes to teams and organizations involve HR-related information, , I think, normally it’s not discussed openly. However, I had long felt that we could simply erase the personal information involved so that the team could discuss and come up with a design that incorporates everyone’s opinions to an extent. I felt that we should obtain a wide range of opinions on whether the design is appropriate and whether it clearly explains what problem it solves, how it solves that problem, and how we can assess whether the problem is fixed. Since the changes would closely involve organizational structure and architecture in particular, I believed that we needed opinions from both the members and architects actually working on development.
We have a culture at Mercari and Merpay where in the case of creating new microservices or making big changes to existing components, members need to write a design doc. We decided to do the same thing with this reorganization: write a design doc and finalize its contents with feedback obtained broadly from other members. (Incidentally, this article uses text largely lifted directly from the design doc.)
Even a small organizational change can have a big impact on team productivity. It can represent considerable cost to the organization, since it takes time for the reorganized team to gel and be able to proactively innovate again. That’s why it is important to create a team with long-term stability. In designing those organizational changes, we focused on making the design something to use over the next 1–3 years rather than a full transformation over only 6 months.
We split the Platform Team according to the following three strategies:
- A team for each stage of the microservice development life cycle
- A platform for platforms
- Teams that ensure accessibility for both Platform Team and the development teams
As I just described, the Platform Team supports the entire software development life cycle (SDLC), from microservices development to operation. Specifically, the team covers four main processes: building, testing, deploying, and operating. Improving each of these processes requires different knowledge for each specialization, so we decided to set up sub-teams covering each phase.
For example, we set up the Runtime Team to support the “build” phase of the life cycle. Currently, to support this phase, the Platform Team is working to provide a template for use when starting to write a microservice in Go. However, that’s all they have been able to offer thus far. We hope that in the future, this Runtime Team will provide frameworks for both Go and other languages, as well as RPC frameworks linked to protocol buffers.
Then there’s the Delivery Team, which we established to support the “deploy” phase. The Delivery Phase works on setting up the delivery infrastructure, focused on Spinnaker. They are responsible for setting up the mechanisms for post-production testing, as described in Testing in Production, the safe way.
A platform for platforms
From the perspective of the dev teams, the Platform Team appears to be the team that provides the shared infrastructure necessary for microservice development and operation. You could also say that the Platform Team provides solutions to the cross-cutting concerns shared by all service development teams. It occurred to us that this same compartmentalized structure could be applied to the Platform Team itself, like a fractal. The sub-teams under the Platform Team, established for each phase of the SDLC, have their own cross-cutting concerns. For example, every team needs cloud resource provisioning. We decided to set up the Cloud Infra Team and Network Team to provide this “platform for platforms.”
Even if we have multiple sub-teams, we believe it should still be viewed as a single platform, since their primary stakeholders are all dev teams, after all. It’s extremely important that even if each team is managing their own different components, these components are appropriately integrated so that UI/UX is consistent throughout the entire SDLC. I think a good example of this are cloud providers like GCP and AWS. Although these providers are composed of so many different teams and components that our platform teams can scarcely compare, from the user’s perspective, the API, documentation, support systems, and other components are wholly consistent.
We designed the Developer Experience (DX) Team to solve this issue. The DX Team manages shared documentation and support systems between the teams while continuously monitoring the API and tools provided by these teams. This way, they can check that there are no major discrepancies in UX between them. The DX Team is also expected to function as a point of contact with the dev teams.
In the long-term, we would ideally like to split Platform Team into the seven sub-teams I’ve outlined above. But given the number of people we currently have, suddenly splitting Platform into seven sub-teams is unrealistic. As part of an effort to migrate to this new team structure in phases, we decided to create teams that bundle several of the most closely related functions. For example, we’ve set up the CI/CD Team, which merges the testing-focused SETI Team with the deployment-focused Delivery Team.
Many have said that the ideal number of people for a single team is 5–8. This is based on communication cost and on-call systems. (See How Twilio scaled its engineering structure, How to build a startup engineering team, and Dunbar’s number.) If organizational expansion and team hiring would put our sub-teams over these numbers, however, we plan to gradually break them down into the seven teams from their original design. Since the sub-teams as they stand are composed of related functions, we feel that breaking them down will incur no significant costs.
In order to solve the misalignment between teams, any additional members assigned to the new teams at this time will be selected from both Merpay and Mercari. These new teams will be “virtual” teams that bridge the gap between our two organizations. We decided to give individuals assigned to these new teams the option of continuing their original assignment and reporting line. For example, some members in the CI/CD Team are still part of Merpay SRE and still use that reporting line.
Alongside designing this reorganization plan, we also tackled working style design to ensure that even if the Platform Team were split into multiple sub-teams, each sub-team would be able to work in alignment as part of the same platform. We referenced Spotify’s “squad” model in designing our working style, focusing on the principle of “be autonomous, but don’t suboptimize.”
The diagram below shows the Platform Team’s release structure up until now.
We have a team-wide mission and long-term roadmap based on that mission, with releases decided based on that roadmap. We looked at Basecamp’s 6-week release cycle for how to decide and push forward releases. See How We Structure Our Work at Mercari Microservices Platform Team for details.
The following diagram shows our release structure under the new team system.
Fundamentally, this is not a big change from the previous structure. Each team decides its own mission, roadmap, and releases. What’s new are the “2-year goal” and “bets” we introduced. These were introduced to bring everyone across Platform into alignment and prevent the sub-teams from going off in different directions.
The “2-year goal” is a big, ambitious goal that the team wants Platform to achieve over the next two years. For example, creating an “Ephemeral Cluster” as described in Kubernetes Cluster Migration on the Mercari Microservices Platform. This 2-year goal also focuses on long-term business goals and assignments for both Mercari and Merpay. “Bets,” on the other hand, are collaborative projects between multiple teams which aims to achieve the 2-year goal.
Each team decides its 2-year goal based on the team’s roadmap. When deciding releases, they refer to the “bets.” If one of them relates to the team’s work, they put it under release targets. If not, they select one of the other projects from the roadmap that they had considered working on. These two mechanisms aim to ensure a balance between independence and alignment for the teams.
Efforts to address silo
(Added July 20, 2020)
We received several comments citing concerns relating to silo, so I’d like to introduce what we’re doing to address it.
“Silo” refers to isolating teams, where they no longer collaborate and communicate. This prevents the dev process from working effectively, and it is one of the most critical issues to address in organization design. The silo and splitting off the sub-teams because of organizational expansion is trade-off. Choosing not to break the teams down because of silo will cause the problems I’ve already described above. That’s why it is so important to consider how we will split the teams. (This is also why we spent most of our time on this reorganization plan looking at design.)
What we should avoid most of all in splitting up the teams is that we split the teams according to the SDLC exclusively. In other words, breaking it down into a dev team that only handles development, a QA team that only handles testing, and an infrastructure team that only handles operation. The biggest issue with silo this way is the risk of lowering release speed. It would mean that QA would be requested to move onto their task only when development finished, that the infrastructure team would only be requested to move on to their task once QA completed, etc. Since this communication would need to happen every time, releases would ultimately take more time. Another issue is that silo can impede the feedback loop for each phase. For example, even if a software bug causes an error in the production environment, the dev teams ultimately get feedback by having the infrastructure team solve the issue. The bigger the disconnect between the two teams, the less likely it becomes that the lessons learned from this failure will be utilized in the next development project. This is why it’s said that microservice organizations should be comprised of not just functional teams, but cross-functional teams, with responsibility held end-to-end.
This might be confusing since it appears similar to the design we adopted this time, but our splitting the Platform Team into smaller teams based on the SDLC is to limit the scope of their responsibility to the phase they support. For microservices, the dev teams take responsibility for every phase of service development. The Platform Team, meanwhile, provides the toolsets dev teams need to fulfill that responsibility. For example, the SETI Team is responsible for the testing phase of the SDLC, but they aren’t responsible for testing all microservices. Instead, the SETI Team provides the tools and environment enabling microservice developers to test the microservices themselves. Similarly, that means the SETI Team’s SDLC begins and ends with the SETI Team.
Regardless, it’s a big problem when each team doesn’t fully understand what the others are doing. To fix this problem, we decided to hold a Platform All Hands around the time of each release (every six weeks). It’s an opportunity for the TLs from every Platform team to introduce what they’ve done for the release (sharing based on the release notes for developers), for members to conduct demos of new features, and for specialists to hold tech talks regarding any new technologies employed leading up to the release, if applicable.
It creates an opportunity for communication, where members can learn about the current situation facing each separate team and bounce questions off one another. In the long term, members may transfer between teams as well, so this knowledge sharing should make those transfers easier.
We’ve actually been working to implement this new team structure since April. Although the design is finished, we must continue to evaluate whether it’s working properly and make adjustments as necessary. As a means of assessing its efficacy, we have been surveying members from each team. The survey results can be found below.
Let’s look at the overall evaluation first. We asked members to evaluate the entirety of the organizational changes.
We saw that all members have a positive opinion of the organizational changes.
Next, we have the detailed evaluation. We asked members to score each item on a scale of 1–5 (with 5 being the highest).
The questions were as follows:
- Release: Are you able to release features with minimum adjustments and reliance on other teams?
- Focus: Are you better able to focus on your specific area?
- Mission: Is your team’s mission clear? Has everyone internalized that mission?
- Roadmap: Is your team’s roadmap clear?
- Teamwork: Are you able to work as a team? Do you help each other out?
- Influencing work: Are you able to participate in release discussions?
- Preference: Are you able to do what you want?
As I said, the problem we most wanted to solve through this reorganization was cognitive load. Looking at “Focus,” many members said that they can concentrate better on their respective focus areas. We also see that with “Release,” each team is able to work more independently. I think we can conclude that this reorganization was successful in solving those problems we wanted to fix.
We see that issues still remain with “Mission” and “Roadmap,” however. We ask this question knowing full well that neither the mission nor roadmap is completed yet. I think that each team’s tech lead is probably preparing their team’s roadmap right now, so I expect we will see these scores improve in the next survey.
I feel that overall, this was a good score for the first survey. (We had intended to quit the reorganization and rollback if the scores had been bad.) We plan to continue conducting this survey regularly, to adjust the scope of each team and make other changes based on the results.
The next challenge will be collaboration between the Mercari SRE Team and the Platform Teams. Our current outlook assumes two directions. The first is that they will join dev teams developing critical microservices as embedded SRE and work to improve service reliability there, as shown in the diagram below. The second is that the SRE members responsible for the network and more infrastructure-like components will join the Platform Team, and work to improve microservices and other infrastructure. Both of these will be carried out in phases.
In this article, I shared how the Platform Team has expanded, from its design to evaluation. I hope this information will prove useful to others who are thinking about making the same kind of organizational changes.
This kind of organizational reform is frankly outside my specialization, so I needed to do a lot of research. These are some of the sources that I found most helpful. I especially learned a lot from Will Larson.
- Team Topologies: Organizing Business and Technology Teams for Fast Flow
- Competing with Unicorns How the World’s Best Companies Ship Software and Work Differently
- An Elegant Puzzle: Systems of Engineering Management
- ACCELERATE The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations
- Scaling Agile @ Spotify
- Spotify Rhythm
- Mistakes and Discoveries While Cultivating Ownership
- Shape Up: Stop Running in Circles and Ship Work that Matters
- How Twilio scaled its engineering structure
- How to build a startup engineering team
- The human scalability of “DevOps”
- SREcon19 Europe/Middle East/Africa: How Stripe Invests in Technical Infrastructure
- Cloud Next 19: Optimizing SRE Effectiveness at The New York Times
- QCon SF 2018: Service Ownership @Slack
- Breaking Hierarchy – How Spotify Enables Engineer Decision Making
- Building A Platform for Internal Developers
- Practices as a platform engineer (2020)