2025/12/16

The Cost of Speed: A Battle against Cost, Debt, and Diverging Systems

Author:: ShindeSnehal

, 2025/12/16

The Cost of Speed: A Battle against Cost, Debt, and Diverging Systems

This post is for Day 16 of the Mercari Advent Calendar 2025.

Introduction

Hello, my name is Sneha. I am a Director in Product Engineering, managing the Ads and Shops product engineering teams. I want to share a personal journey—not just of systems and code, but of a “perfect squad” known as the Shops Enabling Team.

What follows is a journey of resilience.

Three engineers.
Two incompatible systems.
One year to fix spiraling costs.

It is the story of the Enabling Team—with a massive challenge: merging two heterogeneous systems to reduce operating costs and stabilize the Mercari Shops systems. The work continues, but the most challenging part is behind us.

The Origin: Going Bold and Drifting Apart

It all started with a notification that is very common within engineering organizations in any company: “We don’t have enough people to maintain the current systems, and our systems are becoming ‘too expensive’ to run”.

To understand why, we have to look back a bit. When Mercari decided to “Go Bold” and invest in growing our B2C business, we launched Mercari Shops (aka Souzoh Inc.). The directive was clear: Validate the new business hypothesis fast.

To unlock this velocity, we made the strategic choice to break away from our core foundation and platform services. We chose a stack optimized for speed and also improved the system design by learning where our existing architecture failed to deliver. We bet on Cloud Run (Serverless) to keep ops overhead near zero and used Bazel to tame our monorepo of 80+ microservices. With gRPC for backend traffic and Next.js on the frontend, we built a system optimized purely for speed, allowing us to focus on product features rather than platform maintenance.

It worked. We shipped fast, operated as a single small unit, and the business numbers climbed.

Then the product direction shifted, marking a new chapter in our journey!
We wanted to provide a seamless, unified experience, effectively erasing the boundaries between Business (B) and Consumer (C) sellers for the users.

This shift confirmed my core philosophy: “a system is a living ecosystem”. If the business evolves, the architecture must evolve with it.

Engineering found itself in the middle of a massive reconciliation problem. We were maintaining two heterogeneous systems that were similar in many respects. We built “bridges”—glue code—to force them to work together. As the years went by, the system grew bulky. Latency spiked, customer UX deteriorated, and complexity soared. And finally, the Cost Per Transaction (CPT) hit a breaking point.

The Breaking Point: The “Shops Enabling Team”

We needed to reduce costs and complexity, but we were stuck. Internal discussions revealed that a standard “fix” would require a major refactor, which would stop feature development work for almost 2 years. For a growth business, that was impossible!

For over a year, we debated. The debate centered on a crucial question: “Should we align Mercari Shops systems with our core services environment?” While the answer seemed to be ‘yes,’ the execution required such immense effort that we struggled to commit to a unified vision. We needed a strategy that would unblock business growth while handling years of accumulated debt.

The breakthrough came in July 2024. We moved from meetings, offsites and discussions to focused execution by establishing the ”Shops Enabling Team”.

The team’s goal was simple yet critical: “dismantle the obstacles holding us back, one by one”.

It was small—just three engineers—yet they formed the essential bridge spanning across various engineering organizations within Mercari.

The Operational Blueprint: Strategy, Synergy, and Speed

The formation of this team marked a cultural shift. To succeed, we had to change how we operated fundamentally:

Strategic Architecture: The Principal Architect in the team devised a strategy rooted in reality, not theory. We accepted that the ‘perfect world’ solution is a myth; real progress happens in iterations. It helped us avoid numerous discussions that weren’t addressing the problem.
Embedded Synergy: We embedded engineers from different platform domains into the team, cutting through the organizational ’telephone game’ to align priorities instantly.
Strategic Rituals: The shift in the standups. Standup’s were no longer about the usual 1-line status updates (“I did X yesterday”). Instead, they became strategic war rooms where the team discussed how to solve the day’s blockers and architectural hurdles. As an EM, these became the most insightful and productive meetings of my day. I learned a lot!!!
The Feedback Loop: The Shops Enabling Team was right in the middle of Platform Engineering and Product Engineering orgs. We created a continuous feedback loop to identify the strengths and weaknesses of each side. It didn’t just help Mercari Shops systems; the feedback we collected fueled improvements back into the core Platform, benefiting the entire company.

The Audit: The Low-Hanging Fruit

We started looking at our GCP bills. A deep dive into our GCP components revealed the usual suspects: duplicate data pipelines running in parallel, unoptimized services burning unnecessary CPU cycles, and so on.

We fixed these quickly, feeling a momentary sense of victory as the Cost Per Transaction (CPT) dropped by 20%. But the celebration was short-lived.

The data made one thing clear: we had exhausted the easy fixes. To reach our goals, we would have to stop avoiding the ‘tricky and hard bits’—the messy, complicated architectural debt that we had been too afraid to refactor.

The First Tricky Bit: Convergence through Unification

Our users don’t distinguish between ‘B’ and ‘C’ items on their screen, so why should our backend?
Recognizing that the C systems already offered a mature, feature-rich Search & Recommendation engine, we initiated a strategic merger of our Search and Recommendation systems rather than reinventing the wheel.

We decommissioned the entire Mercari Shops-specific search and recommendation infrastructure, including shutting down costly Vertex AI and Elastic Cloud instances.
We adapted the common search components that supported logic for B items within a unified “B & C Search and Recommendation” framework.
This consolidation enabled new search features to launch simultaneously across both Mercari and Mercari Shops.
It was a win-win situation in product and engineering!!

It wasn’t an easy win. We underestimated the depth of the cleanup required, discovering layers of technical debt that needed to be tackled before we could move forward.

[Note: For a deep dive into the code-level challenges and how we solved them, check out this article by one of the Enabling team engineer.]

The architectural cleanup delivered immediate cost efficiencies, slashing the Cost Per Transaction (CPT) by 12.5% for cumulative savings of 30%.

Our dramatic drop in system costs didn’t go unnoticed. It triggered a spotlight moment: our Internal FinOps team reached out, not to audit us, but to collaborate. With their recommendations unlocked , a further 9.5% improvement, culminating in a total Cost Per Transaction (CPT) reduction of 36.7%.

But the true victory wasn’t just the number—

It was the decoupling of cost from growth. Even as Shops’ business surged (increased in the number of transactions), our costs remained flat. The ‘fixes’ held firm, proving we had finally broken the cycle of linear cost scaling.

The Second Tricky bit: Moving to GKE without breaking DX

Then we moved our attention to the infrastructure layer, i.e., the Serverless architecture we chose for Mercari Shops. We were not able to scale it effectively as the business grew. We needed to move away from Cloud Run onto our unified GKE cluster. It was a dilemma on how to scale the systems for exponential growth without hitting the brakes on feature development.

This migration required us to Protect Developer Experience (DX) while doing the changes in the infrastructure layer. The team needed to really dig deeper to understand the current developer experience, which required them to interview members of the feature teams. Align on what is essential for the feature teams and what we can change.

We kept the monorepo and toolchain (Go/TypeScript/React) intact. We only shifted the operational “under the hood” components—specifically, moving logging from GCL to Datadog and deployment to WarpSpeed CD (internal tool for CI/CD ) . It minimized disruption for engineers accustomed to the existing workflow.

Instead of separate Kubernetes kits for each service, we built a single starter-kit (config) for all Mercari Shops services. It provided us with custom networking controls to build a bridge between the old Cloud Run and new GKE environments. To prepare for worst-case scenarios, we needed to build a Flexible Traffic Flow, so our principal architect designed the architecture that allowed requests to flow back and forth between the Cloud Run and GKE environments. It prevented “Big Bang” cutovers and provided an immediate, safe rollback mechanism should any unforeseen issues arise.

Behind the Scenes: Managing the Migration Chaos

The Reality of Migration – It wasn’t a smooth ride. We faced incidents, rollbacks, hidden traps, and dependencies. As we went deeper in the various service layers, we discovered inefficiencies and anti-patterns buried deep in the legacy code. We also had to familiarize feature engineers with the current infra layer. But the challenge wasn’t just technical; we were fixing context as much as code. Maintaining feature velocity required proactive knowledge transfer. We organized targeted enablement sessions to address specific blockers that feature engineering teams face.

The Shops Enabling team was effectively triaging the flood of notifications from feature teams. This hands-on support model allowed us to onboard teams quickly and resolve DX issues before they slowed feature delivery.

The “AI” Factor – Since the Shops Enabling team was new to the Mercari Shops system and did not understand the features and the services, we adopted AI tools like Cursor early on to fill the knowledge gap. We used it to analyze old documentation, Slack threads, and legacy code to get the historical context.
During development it boosted the generation of migration scripts that would have taken a week to write manually. AI became our force multiplier.

The “Perfect” Dashboard – You cannot fix what you cannot see. We realized early on that our existing monitoring was insufficient for the complexity of this migration. We took time to build the ‘Perfect Dashboard’ in Datadog—a single pane of glass that revealed the system’s heartbeat❤️. But metrics weren’t enough; we needed context. We implemented end-to-end distributed tracing, enabling us to trace every request across the heterogeneous stack and ensure nothing was lost in the transition.

However, what kept the team going, despite these challenges, was seeing the traffic graph slowly shift until it hit 100% on GKE.

In July 2025, we crossed the finish line: 100% traffic migration to GKE. YAY!!

The system stability improved, and Cost Per Transaction (CPT) dropped by 33.3% (cumulative gain 53%). We addressed an inefficiency we observed in logging and applied a quick fix, driving costs down further, achieving a massive 67% total reduction in Cost Per Transaction.

As Mercari Shops was Mercari’s first large-scale Monorepo, we encountered multiple edge cases that no other engineering team had faced. These challenges generated a lot of insights. We funneled these learnings directly back to the Platform teams, catalyzing major upgrades to our CI/CD infrastructure and developer tooling.

The Final Tricky bit: Breaking down inter-service walls

With the infrastructure settled, we wanted to resolve the last tricky bit: the Identity platform for Mercari Shops
.
The Mercari Shop’s custom token was a wall, not a bridge. Shops relied on a custom token that required maintaining years of accumulated ‘cold’ code—forgotten custom logic. It isolated Mercari Shop services and made communication with the core services outside the Shop system painful and inefficient. As our product aimed for a unified UX, this distinction made it challenging to communicate with other services, leading to messy ‘Glue code’.

We decided to stop maintaining a parallel identity stack. By adopting the Mercari PAT (Private Access Token), we not only simplified our architecture but also unlocked true interoperability with the broader Mercari backend ecosystem.

We couldn’t fix identity in a single go, so we broke the migration into two phases: Internal and External usage. We prioritized the internal cleanup first.

Upon migration to Mercari PAT, we identified two critical blockers. First, the Mercari PAT didn’t support the Google Identity Platform used by our B-Sellers. Second, Shops tokens carried custom claims that the Mercari PAT didn’t support.
We engineered a bridge in our internal auth service to convert Shops Tokens to PATs, preserving the external user experience. Simultaneously, we re-architected the dataflows to fetch custom claim data via gRPC rather than relying on the token.
It wasn’t a quick fix; it required modifying 80+ microservices.

While AI accelerated the code generation, the real battle was rigorous testing to ensure zero regressions. After a long journey of testing every use case, we decided to release.
The moment we enabled direct service-to-service calls, the benefits were undeniable.

We didn’t just simplify Mercari Shops’ system architecture; we unlocked true interoperability.

We received many requests from various engineering teams to switch to direct calls, simplifying integration across heterogeneous systems.

We are still in the second phase of the migration work. I hope we can wrap it up soon.

The Leader’s Playbook: Leading Through Legacy

As engineering leaders, we often agonize over how to rewrite our systems—which architecture to pick, which stack to use. But the truth is, the biggest challenge is rarely the system itself; it is the inertia of operating within a large organization.
Based on our journey of the various migrations that we did, here are some recommendations for leaders looking to move the needle in complex, brownfield (legacy) environments:

Design the Organization, Not Just the Architecture
In large organizations, systems often inevitably reflect communication structures, and organizational fragmentation becomes the main bottleneck for modernization. Silos prevent the cross-functional collaboration required to fix systemic debt.
The Strategy – Don’t rely on existing teams to do new tricks. We explicitly formed the "Shops Enabling Team"—a small, dedicated squad sitting across different engineering verticals.
The Takeaway – If your architecture is stuck, look at your org chart. You may need to spin up a temporary, specialized unit whose only KPI is to break silos and unblock flow.
Cultivate an “Evolutionary” Mindset
The “perfect squad” isn’t necessarily made up of the deepest experts in the legacy or latest tech stack. It is made up of engineers who are open to learning and evolving as they go.
The Strategy – The Shops Enabling team succeeded not because they knew everything from day one, but because they were resilient enough to learn many things on the fly.
The Takeaway – When staffing a modernization team, prioritize adaptability over tenure. You need people who view the system as a living ecosystem, not a static monument.-
AI is the Bridge from Brownfield to Greenfield
We are entering a new era of software development where the economics of refactoring have changed. The cost of transforming ‘brownfield’ legacy systems into ‘greenfield’ modern architectures is reduced and it is no longer manual work—it is an AI-assisted acceleration.
The Strategy – We used AI tools not just to write code, but also for “software archaeology”—analyzing legacy documentation and running various simulations to assess risks.
The Takeaway – Stop treating AI as just a coding assistant. Use it as a force multiplier to de-risk the most dangerous part of migrations: the knowledge gap.

The Hidden Wins and Personal Reflection

The impact of the work scaled far beyond the three core migrations. We optimized our caching layers and resolved critical database inefficiencies, and slashed onboarding costs by standardizing infrastructure.

Overall, it was a huge win:

We halved our system costs, reducing Cost Per Transaction (CPT) by a massive 67%, even amid rapid business growth.

Yet, the real victory was the journey itself. It reignited a spark I hadn’t realized was dimming. Reconnecting with the roots of engineering — not just managing it, but feeling the daily reality of it — ultimately made me a better leader.

None of this would have been possible without the Shops Enabling Team and the cross-divisional trust the team built among other engineering teams.

With the right strategy, people, and organizational setup, you can do the impossible: rebuilding your core infrastructure in mid-air, making it cheaper, faster, and better without ever touching the ground! 🚀

Tomorrow’s article will be by mariz about Building a Learning Culture with DevDojo. Stay tuned!