This post is for Day 3 of Mercari Advent Calendar 2024, brought to you by @yakenji from the Mercari Site Reliability Engineering (SRE) team.
At Mercari, our SRE team is dedicated to maintaining and enhancing the reliability of our core product, the Mercari marketplace app, by measuring its availability and latency. We establish Service Level Objectives (SLOs) for these metrics and monitor their adherence, as well as whether availability and latency are degrading due to temporary outages or other issues.
To achieve this, our SLOs are based on Critical User Journeys (CUJs). We recently revamped these SLOs, redefining them as "User Journey SLOs" to achieve the following:
- Clarify the definition of CUJs.
- Establish a one-to-one relationship between each CUJ and its corresponding Service Level Indicator (SLI).
- Automate the maintenance of CUJs and SLOs.
- Visualize the behavior of each CUJ during incidents through dashboards.
This initiative resulted in a 99% reduction in SLO maintenance time and enabled near-zero time triage, meaning we can now start assessing impact within seconds of incident detection.
This article details the rationale behind revising our CUJ-based SLOs and explains each of the four objectives mentioned above, focusing on how we achieved continuous updates using end-to-end (E2E) tests and leveraged them effectively.
Current Challenges
Before delving into the main topic, let’s examine the two types of SLOs used at Mercari and the challenges they presented. This section explains the motivation and goals behind the User Journey SLO initiative.
Microservice SLOs and Their Challenges
At Mercari, our backend architecture utilizes microservices. For example, user data is handled by the User service, and item data by the Item service. Each domain has its own independent microservice (these are simplified examples and may not reflect the actual implementation). Each service is managed by a dedicated team responsible for its development and operation. Each team sets SLOs for their services and is responsible for meeting these objectives. These SLOs also drive monitoring and alerting, enabling development teams to respond to service incidents.
While defining SLOs for individual services is crucial for teams operating and developing independently, relying solely on these microservice SLOs presents challenges. One of the major challenges is the difficulty of evaluating the product’s overall reliability from the user’s perspective.
Microservices handle specific domain functions. For simple scenarios confined to a single domain, like "editing user information," only one service (e.g., the User service) might be involved. In these cases, assessing SLO’s attainment is straightforward. However, more complex scenarios like "shipping a purchased item" involve multiple services, making it difficult to evaluate the overall reliability of the user journey.
Furthermore, not all APIs within each service are used in each scenario. Development teams may not have a complete understanding of which APIs are used where, as APIs are generally designed for flexibility and reusability. Conversely, frontend developers typically aren’t overly concerned with which service is being accessed.
For these reasons, assessing end-user experience, such as successfully shipping purchased items, becomes difficult using only microservice-specific SLOs. Even if services A, B, and C individually meet their availability targets, the user-perceived availability might be lower. During incident response, an alert from Service A doesn’t necessarily indicate the user impact, hindering prioritization and mitigation efforts.
SRE and SLOs
To address the challenges posed by microservice SLOs, our SRE team monitors our overall marketplace service based on Critical User Journeys (CUJs), independently of the microservice-specific SLOs. CUJs represent the most critical sequences of actions frequently performed by users. However, this approach also presented challenges:
- Unclear Definition: The definition of CUJs and the rationale for selecting associated APIs were undocumented, making it difficult to add or maintain CUJs.
- Multiple SLOs per CUJ: Directly monitoring the SLOs of each related API resulted in multiple SLOs for a single CUJ, hindering accurate assessment of user-perceived reliability.
- Cumbersome Updates: Frequent functional developments and API changes led to high maintenance costs and difficulty in keeping CUJ definitions and their corresponding SLOs up-to-date.
- Opaque Impact of SLO Degradation: When SLOs were not met, the impact on users was unclear, making it difficult to prioritize responses and hindering broader utilization of CUJ-based SLOs across Mercari.
Challenge 3, in particular, resulted in a lack of comprehensive maintenance since the initial implementation around 2021, potentially leading to gaps in monitored APIs. To address these issues and enable effective use of CUJ-based SLOs across Mercari for reliability improvements and incident response, we decided on a complete rebuild.
Overview of the User Journey SLO
To address the first two challenges—unclear CUJ definitions and multiple SLOs per CUJ—I’ll explain how we defined and managed CUJs within our User Journey SLO framework and how we established corresponding Service Level Indicators (SLIs).
Defining Critical User Journeys (CUJs)
For User Journey SLOs, we maintained a similar level of granularity to our previously defined CUJs, encompassing tasks like product listing, purchasing, and searching. We revisited and redefined approximately 40 CUJs, covering both major and minor user flows. To address the unclear definition challenge, we documented each CUJ using screen operation transition diagrams, explicitly outlining the expected screen transitions resulting from user actions. We also defined the available states for each screen. A CUJ is considered available if these states are met and unavailable if not. Generally, if the core functions of a CUJ are available, the CUJ is considered available. Secondary features, such as suggestions, that don’t impact core functionality are not considered in the availability calculation.
Defining the SLI
To address the multiple SLOs per CUJ challenge, we defined SLIs to establish a one-to-one relationship between each CUJ and its availability and latency metrics. These SLIs are measurable using our existing observability tools. At Mercari, a single customer operation typically involves multiple API calls, as we generally don’t utilize a Backend for Frontend (BFF) architecture.
Ideally, we would directly measure the success of each screen transition within a CUJ. However, we currently lack the infrastructure for such granular measurement. While we considered implementing new mechanisms, the engineering cost of covering approximately 40 CUJs across all clients (iOS, Android, and web) was prohibitive. We also explored leveraging Real-Time User Monitoring (RUM) data from our Application Performance Management (APM) tools, but sampling rates, cost, and feasibility concerns made this approach impractical.
Therefore, we opted to associate the critical APIs called during a CUJ with the CUJ’s SLI. We categorized API calls within a CUJ into two types: (1) those whose failure directly results in CUJ unavailability, and (2) those whose failure does not. To create more accurate and robust SLIs, we focused solely on those in the first category—the critical APIs—for our SLI calculations.
Using metrics from these critical APIs, we uniquely defined the availability and latency SLIs for each CUJ as follows:
- Availability: The CUJ’s success rate is the product of the success rates of its critical APIs. For example, if critical APIs A and B have success rates SA and SB, respectively, the CUJ success rate SCUJ is calculated as :
SCUJ = SA × SB - Latency:The CUJ’s achievement rate for its latency target is the lowest target achievement rate among its critical APIs. For example, if critical APIs A and B have achievement rates AA and AB for their respective latency targets, the CUJ achievement rate ACUJ is calculated as:
ACUJ = min(AA, AB)
Identifying Critical APIs
To implement the SLI calculations described above, we needed to identify the critical APIs for each CUJ. We considered various methods, including static code analysis, but ultimately chose a hands-on approach using a real application to balance practicality, feasibility, and accuracy. This process involved the following steps:
- Proxy and Record: We placed a proxy between a development build of our iOS app and a development environment. We then executed each CUJ, recording all API calls made during the process.
- Fault Injection and Validation: Using the proxy, we injected faults by forcing specific APIs to return 500 errors. We then re-executed the CUJ to determine whether the failure of each API resulted in the CUJ becoming unavailable according to our defined criteria.
We used a development build of our iOS app for this process, as it is our most frequently used client.
Communication between our client apps and servers is typically encrypted. Therefore, we selected a proxy capable of inspecting and modifying encrypted traffic. We chose the open-source tool mitmproxy for its interactive web interface and extensibility through add-on development.
The User Journey SLO framework, established with the approach described above, enables us to detect incidents affecting specific CUJs, allowing for immediate identification of the impact scope and faster prioritization of incident response efforts.
Continuous Update and Visualization Using E2E Test
Next, to address the third challenge—cumbersome updates—I’ll explain how we maintain critical API information using iOS end-to-end (E2E) tests. I’ll also describe our dashboard visualization approach, which resolves the fourth challenge—opaque impact of SLO degradation.
The Need for Automation
The Mercari client app undergoes multiple releases each month. Additionally, trunk-based development and feature flags allow us to release new features without requiring app store updates. Tracking all these changes manually is impractical for the SRE team. Manually investigating frequent changes to critical APIs is also infeasible. Undetected changes could lead to monitoring gaps or unnecessary monitoring of deprecated APIs. Therefore, automating the update process for critical APIs is essential to keep up with the changes in application
Automating with iOS E2E Tests
We leveraged our existing iOS app E2E test suite, built using the XCTest framework, to automate the extraction of critical APIs.
Specifically, we implemented each CUJ as an XCTest test case, executable on simulators. Each test case includes assertions to verify the availability of the CUJ according to our defined criteria. This setup automatically distinguishes between available and unavailable CUJs. Furthermore, the test cases are version-controlled alongside the app’s source code.
We developed a mitmproxy add-on to retrieve the list of APIs called during each test and to inject failures into specific APIs. This add-on provides an API to control the proxy, allowing us to manage it directly from our test cases and scripts.
We automated the critical API identification process by scripting the execution of these XCTest tests and controlling the proxy through the add-on. The results, including whether each called API is critical to the CUJ, are logged to BigQuery. Screenshots of the app’s behavior during fault injection are stored in Google Cloud Storage (GCS).
Test results logged in BigQuery are identified by unique IDs, allowing for efficient comparison with previous test runs. We also use Terraform modules, specifically designed for User Journey SLOs, to define and manage SLOs, monitors, and dashboards in our APM system. This allows us to seamlessly integrate changes and easily add new CUJs.
This automation provides several key benefits:
- Reduced Maintenance: The process is almost entirely automated, aside from code maintenance for the tests themselves.
- Version Control: Both the test cases and the app code are version-controlled in the same repository, ensuring consistency.
- Efficient Integration: ID-based management of test results facilitates seamless integration with our APM system.
Ultimately, we created approximately 60 test cases covering around 40 CUJs. This automation drastically reduced the manual effort required, achieving a 99% reduction in maintenance time compared to manual SLO management.
Dashboard Visualization
A key goal of the User Journey SLO framework is to empower teams beyond SRE, such as incident response and customer support, with actionable insights. To achieve this, we needed to present up-to-date information about critical APIs and CUJ behavior during outages in an easily accessible format. We used Looker Studio to visualize this data, providing dashboards that display the list of API calls for each CUJ and screenshots of the app’s behavior during API failures.
Current Status and Future Directions
Through the initiatives described above, we successfully implemented the following for our User Journey SLOs:
- Clarifying the definition of CUJs
- Establishing a one-to-one relationship between each CUJ and its corresponding Service Level Indicator (SLI)
- Automating the maintenance of CUJs and SLOs
- Visualizing the behavior of each CUJ during incidents through dashboards
We currently operate SLOs for approximately 40 CUJs, utilizing around 60 test cases. While currently undergoing trial usage within the SRE team, even at this stage, the new SLOs have significantly improved:
- Incident detection speed and accuracy
- Accuracy of impact assessment
- Speed of root cause identification
- Overall quality visibility
Quantitatively, we’ve observed the following improvements:
- Immediate impact assessment: Achieved near-zero time triage, meaning we can now start assessing impact within seconds of an incident being detected
- Reduced maintenance overhead: Achieved a 99% reduction in SLO maintenance time.
Building on these positive results, we plan to expand the use of User Journey SLOs beyond the SRE team, focusing on:
- Integrating SLOs into our internal incident management criteria
- Leveraging User Journey SLOs to improve customer support responses
Conclusion
This article explored how Mercari implements and operates User Journey SLOs based on CUJs, detailing the specifics of our SLI/SLO definitions and our automated maintenance process using iOS end-to-end testing. We hope this provides valuable insights into managing SLIs and SLOs for complex systems.
Tomorrow’s article will be by ….rina…. . Look forward to it!