2025/11/25

Building Tooling for Global Customer Support Operations

Author:: LauWai Ting

, 2025/11/25

Building Tooling for Global Customer Support Operations

Hello, this is @waiting.lau and I’m a member of the Cross Border (XB) Operations (Ops) Engineering team.

Introduction: Turning the Hidden Half into a First-Class Product

When we build a product for millions of users, we often focus on the customer-facing experience: the slick UI, the smooth checkout flow, and the powerful search. But behind every great product is another critical component: the "hidden half". These are the internal tools that empower our Customer Service (CS) and Trust & Safety (TnS) teams to support users and ensure a secure marketplace.
For the new Mercari Global service, we faced a fundamental question: As we build a new global platform from scratch, how do we treat these essential internal operations as a first-class part of the product itself and not as an afterthought?
This article explores our journey to answering that question, detailing the pragmatic, phased approach we took: leveraging Mercari’s mature Japan assets for a rapid launch, while simultaneously building a new, future-proof foundation for our technology and our teams.

Learning to Decouple, Not Discard

To understand our approach, it helps to know what a complete CS operation entails. A few key components must work together:

A Help Center for user self-service.
A Contact Tool (or ticketing system) for agents to manage incoming inquiries.
An Operation Tool for CS agents to access data and perform actions (like order cancellations).
An Authorization (Authz) system to control permissions.

Mercari’s Japan business has a mature ecosystem of in-house tools covering all these areas, while the US Marketplace uses a mix of in-house solutions and third-party vendors.
We have established "Global Engineering Tenets" which guides us to have a consistent decision-making process.
To ensure consistent decision-making across this complex landscape, we followed the "Global Engineering Tenets" established for the entire Global Platform project, which were featured in a previous article.
To honor our tenet to "Learn and unlearn from past experience", we first analyzed this mature ecosystem. The initial thought was to extend all the existing tools in the Japan business. But this led us to a crucial realization, guided by another tenet: to "Keep each country’s business isolated". To achieve the velocity needed for global expansion, we had to decouple from the established JP infrastructure, not to discard its strengths, but to avoid dependencies that could slow down future rollouts.
This analysis led us to our pragmatic, hybrid strategy, which was defined by a clear distinction between what to reuse and what to build.
We chose to reuse the Help Center and Contact Tool because they are mature, modern, and most importantly, already designed with multi-tenant support. They served as stable, high-level interfaces that could be adapted for global use with minimal changes.
In contrast, we decided to build the Operation Tool, namely "Global Platform Ops Tool" from scratch. The existing one for Japan business, while powerful, is deeply integrated with numerous Japan-specific backend services. This was the key issue. Attempting to deploy the tool in a new region outside of Japan would have required either migrating a large number of these dependent services or undertaking a massive decoupling effort, both of which were impractical for our timeline.
Building a new, independent tool allowed us to create a clean foundation, free from these dependencies. This gives us the autonomy to develop and deploy features for our global users quickly.

The Starting Point: A Focused Team for a Fast Launch

To execute our initial launch, we made a deliberate decision to follow a conventional model: a new dedicated engineering team responsible for the development of "Global Platform Ops Tool", referred to here as the "Ops Tool Dev Team". This was a strategic choice guided by our primary goal – speed. For a project with a tight timeline, a single, focused team with clear ownership can move much faster than a distributed model that requires extensive coordination.
This approach was a proven method for getting a new product off the ground, just as it was in the early days of the Japan business. We knew this centralized model had long-term scaling limitations, but it was the most effective way to reduce initial complexity and ensure we could deliver the essential features needed for day-one operations.
This initial workflow, while intentionally siloed, was the pragmatic choice to get us started. It was always intended to be phase one – a bridge to a more scalable and collaborative future.

Building the Foundation: The Global Platform Ops Tool Architecture

While our day-one operations relied on existing JP tools, the primary mission of our dedicated team was to build the new technical foundation in the background: the "Global Platform Ops Tool".

A New Home in the Monorepo

A monorepo is a software development strategy where the source code for many different projects is stored in a single repository. Our global platform is built on this model, and for a deeper dive into its core design, we recommend reading the previous article from our architect published earlier in this series.
With this foundation already in place, our first major architectural decision was to build the Global Platform Ops Tool from scratch within the existing monorepo. This was a strategic choice aimed at one primary goal: aggressively reducing developer friction.
To understand our reasoning, let’s first consider the multi-repository alternative. In that model, the frontend application would live in one repository and the backend modules in multiple repositories. An engineer working on a single feature would have to make changes in multiple codebases. This creates a cascade of slowdowns: they must manage separate pull requests, reviewers must track changes across multiple repositories, and simple dependency updates, like for an updated Protobuf client, become a complex task of publishing and consuming packages. This model also creates deployment dependencies, forcing teams to coordinate separate release schedules.
Placing Ops Tool in the global platform monorepo directly solves these problems. By housing both backend and frontend code together, we create a unified developer experience. An engineer can handle everything for a single feature in one codebase and one local development environment, which eliminates context-switching and simplifies dependency management. This also ensures consistent deployment, as we leverage the same modern CI/CD pipeline as the rest of the Global Platform, removing the need to coordinate separate release schedules. Finally, it gives our team full ownership. We can iterate quickly without being a "guest" in another team’s ecosystem, subject to their schedule and tooling choices.
This unified monorepo strategy defined where our code would live. The next critical challenge was defining how it would be structured, and we’ll begin with our backend architecture.

Backend Architecture: Extending a Modular Monolith for Operations

Our backend is built on the Modular Monolith architecture, which our architect detailed in a previous post in this series. For a deep dive into the core concepts of our multi-tiered design, we highly recommend reading that article first.
Our challenge wasn’t to invent a new architecture, but to adapt this powerful foundation for the specific needs of internal operations. The core question we had to answer was: "How do we add sensitive, complex operational features without compromising the integrity of the core customer-facing logic?".
Our solution involved two key extensions to this foundation. The first was a dedicated Ops BFF (Backend for Frontend), introduced exclusively for the Ops Tool. This acts as a secure gateway that completely isolates internal traffic from the customer-facing BFF. Its primary job is to handle authentication for our employees and tailor data specifically for our admin UIs.
The second extension was the use of isolated operational endpoints. To keep operational logic separate from customer-facing logic, we often create dedicated gRPC for Ops servers within a module. However, this is not a strict rule. Our guiding principle is a clean separation of concerns, applied pragmatically. For modules where operational needs are simple like a straightforward data fetch or similar logic flows, we reuse the existing customer-facing gRPC server to avoid unnecessary complexity. A separate server is only introduced when the operational logic becomes complex or requires different security considerations.

A typical workflow, such as a "Cancel Order" operation, illustrates this approach:

A request from Ops Tool UI is first handled by the Ops BFF.
The BFF calls the gRPC method served by a dedicated "gRPC for Ops Server" on the Tier 1 Order Management module, which orchestrates specific workflows for order cancellation initiated by CS agents.
This orchestrator then calls the core Tier 2 or Tier 3 domain modules, like Order and Notification, to handle the actual state changes.

This layered design ensures that operational logic is securely isolated and properly owned.
While our rule is to place business logic in its corresponding domain, this doesn’t eliminate the need for a generic module to handle cross-cutting concerns. This can be thought of as a shared toolbox for our operations teams, providing features that don’t belong to a single business domain. Its responsibilities may include:

Managing bookmarks of flagged users, products, or orders.
Handling templates for private messages and moderation actions.
Orchestrating automation for complex internal operations.

We call this the "Ops" module. The key distinction is that it doesn’t own core business logic. The Order module still defines what it means to cancel an order but the Ops module might provide the automation script that calls the Order module as part of a larger workflow.
With our backend’s logical structure defined, the next critical challenge was to secure it with a proper authorization framework.

Secure by Design: A Declarative Authorization Model

Security is a top priority. However, in a larger development project like Global Platform, this creates a significant challenge: implementing security correctly requires a solid understanding of our internal authentication and authorization systems. Asking every product engineer to implement authorization checks correctly could be error-prone.
Our guiding principle, therefore, was to abstract this complexity. We should provide a paved road that makes it easy for engineers to do the right thing by separating the what from the how. The “what” is the simple, declarative fact of what permissions are required for an endpoint, defined right alongside the API contract. The "how" is the complex logic that enforces that check, i.e. a standardized process handled by the platform, not by individual engineers writing if/else statements in every function.
To illustrate why this is so important, let’s look at the common alternative: manually checking permissions in every API handler.

// file: service.proto

// The API contract has NO permissions defined.
rpc Greet(GreetRequest) returns (GreetResponse);

// file: service.go

// The security check is hidden in the implementation,
// easy to forget and hard to find.
func (s *myService) Greet(ctx context.Context, req *GreetRequest) (*GreetResponse, error) {

    // This check is manual and disconnected from the .proto. The function is defined in a shared auth package.
    has, err := auth.HasPermission(ctx, "data:user:read") 
    if err != nil {
        return nil, status.Error(codes.Internal, "auth failed")
    }
    if !has {
        return nil, status.Error(codes.PermissionDenied, "missing permission")
    }

    // Finally, run the actual business logic...
    // ...
}

This manual approach has three flaws. First, it’s a boilerplate. Engineers must add a few lines of code to every single gRPC method handler before the business logic. Second, it’s entirely optional. It relies on every engineer remembering to add this check. If they forget, it may lead to data leak. This problem becomes worse when dealing with granular, field-level permissions. Finally, the API contract in the .proto file and its security policy in the .go file are in separate locations. Maintaining these configurations is a nightmare and makes the system difficult to audit.
Our declarative model solves all three problems. We achieve the "what" by implementing a declarative model using custom Protobuf options. Here is the sample code.

// file: proto/framework/v1/authz.proto

syntax = "proto3";
package proto.framework.v1;

import "google/protobuf/descriptor.proto";

message Authorization {
  repeated string allows = 1;
}

// Adds custom method-level options.
extend google.protobuf.MethodOptions {
  optional Authorization authz = 51003;
}

// file: proto/gateway/v1/dummy.proto:

syntax = "proto3";
package proto.gateway.v1;

import "proto/framework/v1/authz.proto";

service DummyService {
  ...
  // Greet is the RPC to greet the user.
  rpc Greet(GreetRequest) returns (GreetResponse) {
    // Product engineer must declare this option when adding new endpoints. Lint rule can be set up to detect this issue automatically.
    option (proto.framework.v1.authz) = {
      allows: ["data:user:greet"]
      };
  }
}

The definition option (proto.framework.v1.authz) is automatically enforced by a shared authorization interceptor that runs on every request to a module before it reaches the gRPC method handler. The interceptor reads the required permissions from the proto definition and validates them against the user’s permissions. If the validation fails, the interceptor immediately rejects the request, ensuring that no unauthorized business logic is ever executed.
This design removes the burden and risk of error from the developer, eliminates this complex boilerplate, and creates a single source of truth, making our services easily auditable, meaning that anyone can understand the security posture of an entire service just by reading its API contract.
This platform-level authorization enforcement is enabled by default through our configuration, as illustrated by the configuration below.

components:
  application:
    http:
      enabled: false        # Whether the HTTP server is enabled
      port: 50000           # Listening port
      middleware:
        authorization:       # Newly added authorization module
          authz_api:
            enabled: true    # Enable internal authorization
            service_endpoint:
              address: "xxxx:10001"  # Address of the authz service
              timeout: "1s"          # Fail fast (deny access by default)

The key takeaway is that our product engineers are completely abstracted from this complexity. They don’t need to know how the interceptor works or how the permission check is performed. Their only responsibility is to add the correct proto.framework.v1.authz option to their .proto file. The framework takes care of the rest, guaranteeing security is enforced by default.

This secure, modular backend provides the power, but it’s only half the story. All this logic needs to be presented to our CS and TnS agents through an intuitive user interface. That’s where our frontend architecture comes in.

A User-First Frontend: Our Architectural Choices

On the frontend, our philosophy was guided by a single question: How do we make the tool both easy for our engineers to build and intuitive for our agents to use?
To solve the "easy to build" part, we chose a familiar and modern stack by aligning with the company’s "Web Golden Path", a recommended set of frameworks and libraries. The Global Platform Ops Tool is built on Next.js, a React framework for building full-stack web applications, allowing us to leverage the latest features of React, including React Server Components, for a fast and efficient experience.
This is the same modern stack used by our customer-facing Global Platform Web Product, which also utilizes Next.js with the App Router, as detailed in a previous article in this series. This alignment was a critical decision for velocity. It ensures that any web engineer at Mercari can be productive in the Ops Tool codebase with a minimal learning curve, as the core technologies are identical.
However, our most important user isn’t the developer; it’s the CS agent. This led to a crucial, deliberate exception to the Golden Path. When it came to the UI components, we had a choice: use Design System 4.0, our company’s new, modern standard for all customer-facing products, or use the in-house admin component library that our CS agents already know from using other internal Mercari tools.
We chose the latter. This decision prioritized the user experience of our CS agents over pure technical consistency. The rationale was simple: the small cost of a developer adapting to a familiar component library is insignificant compared to the cost of having hundreds of CS agents learn a completely new interface. This pragmatic choice ensured that when our agents switched to the new global system, the tool felt instantly intuitive – even if the backend had been completely rebuilt.

The Next Challenge: Scaling Ownership

While the dedicated team model was perfect for a focused launch, we knew from experience that it didn’t scale organizationally. The core issue isn’t just about workload; it’s about the friction of context. In a siloed model, the feature team must constantly teach the ops team the product specifications, while the ops team must teach the feature team the nuances of the tool’s codebase. This constant, two-way knowledge transfer is what ultimately becomes the bottleneck, slowing everyone down.
Our vision for the next phase was to solve this and to evolve towards a co-ownership model. The principle is simple: the team that builds a client-facing feature also builds its corresponding operational components.

Our rationale here was to eliminate the handoffs and knowledge gaps entirely. By empowering feature teams to own their operational UIs, we are not just distributing work – we are building empathy. When a product engineer sees firsthand how a CS agent interacts with their feature to solve a real user’s problem, the feedback loop becomes immediate, connecting their code directly to the people using the toolhuman. It turns ‘internal tooling’ into an integral and respected part of the product experience.
This future model where operational development is a shared responsibility is only made possible by the robust and flexible technical foundation we are building today. The monorepo, the modular architecture, and the declarative security are all designed to create a "paved road" that makes it easy for any engineer to contribute effectively.

Conclusion: A Foundation for Technology and Teamwork

Our journey began by making pragmatic decisions: we leveraged existing assets with a focused, dedicated team to ensure a stable and rapid launch. This gave us the runway to build a modern, scalable technical foundation in the background.
With this foundation now in place, we are looking ahead to evolving our team structures to create a truly holistic and collaborative development culture. We believe this approach where every engineer has a stake in the operational health of their domain will ultimately lead to a better, safer, and more supportive experience for our users around the world.
Thanks for reading, and we’re excited to continue sharing our progress on this journey.