How we reduced response latency by over 80%

This post is for Day 3 of Mercari Advent Calendar 2023, brought to you by @rclarey from the Mercari Web Architect team.

Today we’ll continue on the topic of how we migrated from our dynamic rendering service (using Google’s rendertron) to server-side rendering (SSR), this time addressing how we planned, executed, and evaluated the success of the migration on the frontend side. This migration was a huge undertaking that required collaboration from almost every web team, and I’ll discuss some of the bumps and unexpected difficulties we encountered along the way. In the end we achieved huge performance improvements, reducing response latency by roughly 80% in P50, P75, P90, and P95, which made all of our hard work along the way very much worth it.

If you haven’t already, you should read yesterday’s article by the Web Platform team which explains the infrastructure and FinOps side of this migration.

Why migrate?

We rewrote Mercari Web starting in 2019, and at the time we chose to implement the new “ground up web” as a client-side rendered Gatsby app. This allowed our infrastructure to be simple, and enabled us to focus on quickly building out and launching the new app. Since then however, we realized that the compromises we made to make our app accessible to search engine bots were no longer worth it, and that we should consider a different approach moving forward. For a more detailed history of Mercari Web you should check out this article: https://engineering.mercari.com/en/blog/entry/20220830-15d4e8480e/

Before deciding to do a big migration it’s important to make sure that the large amount of work required is justified, and that your current solution is truly not serving your needs. In our case there were two main reasons why we decided to migrate:

Previously, server-side rendering Mercari Web was considered blocked because our design system used web components, and at the time SSR solutions for web components were experimental and did not fully support our use case. However, we realized that in practice only React projects were using our design system, so it wasn’t worth continuing to use web components if it meant blocking our ability to do SSR. Combined with the two main reasons above, this reinforced our motivation to migrate to SSR.

Having decided to do the migration, what was next was to plan how we would actually go about it.

Deciding on a framework

When planning this migration we had two main alternatives in mind: use Gatsby’s newly released (at the time) SSR feature, or change frameworks to Next.js which is known for SSR. Staying with Gatsby was appealing because it would be easier to incrementally add SSR support to our existing codebase.

On the other hand Next.js was a much more mature solution for SSR, and it was already used in Mercari for other web projects, however it would require changing our framework first.

To fairly judge these two options we did a proof-of-concept implementation of our item page with both. In the end there were not many differences between implementing SSR with Gatsby or with Next, however because SSR was so new for Gatsby at the time there was a lack of documentation, and biggest of all it was not officially supported outside of Gatsby Cloud (which we would not use).

This convinced us that Next’s maturity as an SSR framework, and official support for self-hosting, would make it a better choice for us long term. Lastly, since Next 13’s app router was still in beta at the time, and we also weren’t using React 18 yet, we opted to migrate to Next 12 in order to keep the migration scope manageable.

Incremental development

Since our design system was built with web components, which were not well supported for server-side rendering, we also needed to migrate it to React alongside our migration to SSR. We wrote a dedicated article about the design system migration in last year’s Mercari Advent Calendar, so please check it out for more details: https://engineering.mercari.com/en/blog/entry/20221207-web-design-system-migrating-web-components-to-react/

To make the migration easier, and to achieve partial improvements quicker, we planned to do the migration incrementally wherever possible. The main area where we could deliver incrementally was by implementing and releasing SSR page-by-page instead of all at once.

To achieve the biggest improvements as soon as possible, we prioritized pages in order of the requests-per-second they receive. Since more complex pages like /item and /search were at the beginning of this list, this ordering had the added benefit of allowing us to identify early on most of the big issues we’d have during the migration.

Once we had the list of pages, we worked backwards and created batches of design system components, based on usage within the pages, that we could also migrate incrementally. For example, the first batch was all components relevant to SEO (i.e. not decorative) used on the highest priority page, the second batch was all components used on the next highest priority page, and so on. Thankfully the design system has contributors from outside of the web architect team, so they were able to work on migrating the design system batches in parallel to my team working on migrating to SSR.

Unfortunately the one place we couldn’t easily break up the work incrementally was changing from Gatsby to Next. Since there are several web teams all working on the same Mercari Web app it would be too disruptive to pause feature development so that we could gradually move from one framework to the other. This meant that we needed to do the migration from Gatsby to Next in a feature branch, then change over all at once when it was ready.

With a solid plan in place, and a proof-of-concept under our belt, all there was left to do is actually do the migration.

Going from Gatsby to Next

The first and largest step was to change from Gatsby to client-side rendering with Next. A lot of the APIs we used with Gatsby were actually from other companion packages, and luckily there were analogous Next APIs for almost all of the ones we used, for example:

  • pages/_app.tsx in Next is roughly the same as src/App.tsx and gatsby-browser.tsx in Gatsby
  • useRouter in Next is analogous to useLocation from Reach Router
  • next/dynamic is analogous to loadable-components
  • next/head is analogous to react-helmet

To handle changing to the Next version of all of these APIs across our thousands of source files we made heavy use of automated refactoring tools to do most of the tedious work. After making the changes we validated them first using type checking (thank you TypeScript), then our existing unit and UI tests, and finally with our E2E regression tests. This was able to catch the vast majority of bugs introduced by the changes.

The most apparent difference when moving to Next was the difference in routing paradigms. With Gatsby we used client-only routes with Reach Router, however Next uses file-system based routing. This difference turned out to be mostly superficial however, and it was simple enough to create a list of all pages then write a small script to generate the correct files Next expects under the pages directory.

The only other issue we had with file-system routing was that Next 12 does not as easily support different layouts for different subsets of routes, and we had a handful of these different layouts throughout our app. While Next 12 does support per-page layouts, those increase the amount of boilerplate code needed in page files and are prone to errors where developers forget to add the layout to new pages. We instead opted to implement a simple solution using a custom app that suited our needs:

function DomainLayout({ children }: { children: ReactNode }) {
  const { pathname } = useRouter();
  if (pathname.startsWith("/mypage")) {
    return <MypageLayout>{children}</MypageLayout>;
  }
  if (pathname.startsWith("/purchase")) {
    return <PurchaseLayout>{children}</PurchaseLayout>;
  }
  // and so on
}

export function CustomApp({ Component, /* ... */ }: AppProps) {
  return (
    <>
      {/* global layout things */}
      <DomainLayout>
        <Component />
      </DomainLayout>
    <>
  );
}

There were also several subtle differences between Next’s Link component and useRouter hook compared to the analogous Link component and useLocation hook from Reach Router. Most of these differences were either very particular to our codebase or trivial to fix, and so they are not really worth talking about. The one difference that I will mention is that useRouter().pathname is not what it sounds like, and it caused us repeated issues until we implemented our own version that does what we expect. For those unaware, the pathname field on the object returned from useRouter() is the current path without parameter placeholders substituted (i.e. /item/[itemId] instead of /item/m123456789).

The correct way to get the current path with parameters substituted is useRouter().asPath, however that has the downside of also including query parameters which we don’t often want. In the end we wrote a helper to do the thing we expect, and we actively discourage the use of useRouter().pathname directly.

export function useCurrentURL() {
  const { asPath } = useRouter();
  // NEXT_PUBLIC_SITE_URL is our SSR-safe equivalent to location.origin
  return new URL(asPath, process.env.NEXT_PUBLIC_SITE_URL);
}

The Big PR

Although most of the changes at this point were fairly trivial, they touched almost every file in our code base. To make it possible to do this migration in parallel with other team’s normal feature development, we leveraged automated tools as much as possible to do the bulk of the refactoring as quickly as possible. This was especially useful for catching and fixing new unmigrated code when we synced our feature branch with the main branch every few days.

When all of the refactoring was done, and we had our app fully working with Next, we collaborated with all of the other web teams to “freeze” our main branch for a few days so that we had time to do a final code review and very thorough testing. The review period kicked off with an online code review session to introduce developers to the high level API changes, and from there the responsibility to do code review and fix failing tests was delegated to individual teams based on code ownership.

Over the course of the next few days we identified several bugs either through code review or testing (both automated and manual), and slowly worked through fixing them all. After all the regression tests were passing, and all developers were convinced that the new Next app was working as expected, I clicked the merge button.

Github UI showing 1751 changed files, 49405 additions, 48870 deletions

Of course we still had to release the change, and to do so we modified our usual staged release process to make the rollout happen in smaller increments over a longer period of time. Instead of the usual flow of 33% of sessions for 30 minutes → 100% of sessions, we doubled the number of stages and doubled the time at each stage, so it became 1% → 10% → 33% → 100% with 1 hour at each stage. This worked out well, and in the end we released the Next app without any major issues.

Implementing server-side rendering

With our app moved over to Next client-side rendering, the next step was to move over to server-side rendering. This stage of the migration was highly collaborative between three main teams:

  • The design system contributors, migrating the required components for a given page in preparation for that page’s migration to SSR
  • The web architect team, implementing server-side data fetching, error handling, and server-side monitoring
  • The web platform team, load testing newly migrated pages, updating infrastructure configurations as the SSR server began handling more load, and handling the CDN routing to move requests for a migrated page from the dynamic rendering service over to the new SSR service (again the web platform team’s article from yesterday covers this in more depth)

On the frontend side the biggest issue in this stage of the migration was finding and replacing usages of DOM APIs with SSR-safe replacements. These replacements generally fell into one of two categories:

  • APIs where we need some meaningful value during SSR, e.g. replacing location.origin with process.env.NEXT_PUBLIC_SITE_URL which we manually for each environment in .env files
  • APIs where we don’t need a value during SSR, e.g. interaction related APIs like ResizeObserver that don’t contribute to the server response

For the latter group, we implemented SSR-safe helpers for the window and document globals and introduced eslint-plugin-ssr-friendly to enforce those helpers were used instead of accessing the globals directly.

// TypeScript forces us to handle the `undefined` case that happens during SSR
function getWindow() {
  return typeof window !== "undefined" ? window : undefined;
}
function getDocument() {
  return typeof document !== "undefined" ? document : undefined;
}

Impact

Below are graphs of the P50, P75, P90, and P95 response latency for the dynamic render service just before we began the SSR migration, and the Next SSR service just after we moved 100% of requests to it.

Line graph showing the response latency for the dynamic rendering service
Response latency for the dynamic rendering service
Line graph showing the response latency for the SSR service
Response latency for the SSR service

I think the fact that the scale on the second graph is nearly an order of magnitude smaller than the first speaks for itself, but to also give some hard numbers:

  • P50 decreased 88%
  • P75 decreased 84%
  • P90 decreased 83%
  • P95 decreased 79%

Overall the migration project was a huge success in every measurable way, and I can’t give enough thanks to everybody who helped make it possible 💖

Conclusion

Looking back, the simplicity of a client-side rendered app helped us quickly build out and launch our rewrite of the web in 2019, however eventually we realized that architecture was no longer serving us well so we needed to move towards SSR instead. Focussing on migrating incrementally where possible, keeping the migration scope contained, and collaborating across teams as much as possible allowed this migration to be the success that it was.

While there were many expected and unexpected issues that arose during the migration, a healthy reliance on automated tools for refactoring, testing, linting, and type checking meant that we were able to confidently deliver the migration without any major incident.

Tomorrow’s article will be by @fp from the Mercari mobile architects team. Look forward to it!

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加