This article was written as part of "Series: The Present and Future of the RFS Project for Strengthening the Technical Infrastructure".
Hello, I’m a member of the ID Platform team at Mercari. In this article, I will describe how we are applying the OIDC / OAuth 2.0 standard specifications to build our identity platform for our first-party services.
Introduction
Mercari is known as the biggest C2C flea market platform in Japan. But it also has multiple subsidiaries companies like Merpay, Souzoh, and Mercoin, … Because of different business requirements and development styles, each company’s systems are being run on different server clusters. But they still need to call each other to process requests from clients. How we perform cross-cluster communication is an important system design decision.
As with any other system, authentication and authorization are essential parts of our system. At Mercari, we have a single ID Platform team whose main responsibility is to oversee the authentication and authorization of the group-wide system. And to support business growth, we have to build a strong, reliable but yet easy-to-use ID platform.
Instead of making in-house specifications, we always try to apply industry-standard specifications which are well-defined and proven. Usually such specifications are well-thought out, and following them help us do the Right Thing, as well as reducing the possibility of introducing security risks. OAuth 2.0 and OIDC are well-known protocols in the digital identity field. Although they were designed mainly for delegating access and providing user identity to third parties, we tried to apply those specifications to our first-party services because we believe that they will help us to build a clean, secure and sustainable IDP. In this blog post, we would like to share something about that story.
There are many documents about OAuth 2.0 and OIDC. You can find some links in the reference section.
Overview of the Mercari system
Here is a very brief chart about the Mercari system.
The Mercari core system provides the core features of our service like buy/sell items and pay for them. It was built for a long time. And with the establishment of subsidiaries, we are adding more and more subsystems. There are two kinds of subsystems, which run on their own cluster:
- A subsystem (subsystem 1 in above diagram) that bidirectionally interacts with the core system, and also called directly from the Mercari mobile apps and web. Requests to this subsystem are authenticated by IDP.
- A subsystem (subsystem 2 in above diagram) that only calls the core system, and has its own clients. Technically it can be considered a third party service, but it’s owned and managed by the Mercari group. Requests to this subsystem are not authenticated by IDP. The subsystem has its own authentication mechanism, and their clients do not use the tokens issued by IDP to access the servers.
Considerations
Decide responsibility
The three main components in OAuth 2.0 / OIDC specs are the Authorization Server (AS), the Resource Server, and the Relying Party. The Relying Party gets the token (access token and ID token) from the Authorization Server and accesses the Resource Server if needed. In order to apply those specs, we need to map our components to those from the standard. We only have 1 authorization server in our system, so it won’t change. Also, the authorization server only issues 1 type of token (legacy token types that were created before having the current authorization server still exist though). It eases the authentication and the application of restrictions.
To decide which component should bear the responsibility of the Resource Server and the Relying Party, we followed below principles:
- The Resource Server holds the resources
- The Relying Party is granted access to the resources from the Resource Server
- At the same time, a subsystem can be both Resource Server and Relying Party because it can hold some resources and requests other resources from other subsystems. But in the context of 1 request, the responsibility of the subsystem (Relying Party or Resource Server) should be cleared.
Grant type
After mapping the components, we have to think about how the Relying Party obtains the access token, or in other words, we need to choose which grant type should be used for each scenario. There are 2 cases:
Relying Party works on behalf of the users (so-called user context)
There are 2 sub-cases:
- Relying Party receives instructions from the users directly (mobile app/web in the above diagram)
- The authorization code grant should be used. One noticeable thing here is because the Relying Party is a first-party service, we don’t need to show the authorization screen to the users.
- Relying Party doesn’t receive instructions from the users directly (subsystem 1 in above diagram)
- This case happens when a subsystem can be a Relying Party and a Resource Server at the same time. As an Resource Server, it receives the call from an Relying Party and becomes an Relying Party when making the call to another Resource Server, in order to process the request from the initial Relying Party. For this case, we use the token exchange grant to exchange the token from the initial Relying Party for a new token to call another Resource Server.
Relying Party works on behalf of itself (so-called service context)
One example is when a batch job is run. The Relying Party doesn’t receive any requests from the users but some processes still need to be done. In this case, the client credentials grant should be used.
Scenarios
Based on the above principles, we divided our system into 3 scenarios (※1, ※2, ※3 in the below diagram)
First-party app/web (※1)
In this case, the first-party mobile applications and websites are the relying parties. The resource server includes the core system and the subsystem which is being called by the Relying Party.
In this scenario, the mobile apps and web obtain the access token (AT) from the authorization server by using the authorization code grant type (the grant flow has been omitted). But since users don’t delegate their access to third-party services, the authorization screen is skipped.
We don’t have the service-context access token for this scenario.
The subsystem acts as a relying party only (※2)
This scenario is for a special kind of subsystem (subsystem 2 in the above diagram). It’s outside of the IDP-protected area. It has its own clients and authentication mechanism. And it calls the core system that is protected by IDP, but the opposite doesn’t happen. From an IDP point of view, it works like a third party application but is managed by the Mercari group. The difference with scenario ※1 is the subsystem is run in a cluster of servers, not in clients like mobile apps or webs.
In this case, the subsystem becomes the Relying Party, and the resource server is the core system only. In order to get the access token (user context), the authorization code grant type, which is initiated by the backend, is used. The authorization screen is also being skipped for this case. For the service context access token, the client credentials grant type is used.
The subsystem acts as a relying party and a resource server (※3)
There is another kind of subsystem (subsystem 1 in the above diagram). It’s inside the IDP-protected area, which means the authentication is being done by using the tokens that were issued by Mercari IDP. Its callers can be both first-party clients (mobile app/web) and other subsystems. It also needs to interact with other subsystems (including the core system) inside the IDP-protected area, which means it needs to get the token from IDP and call others. Thus the subsystem acts as Resource Server when it provides resources, and acts as Relying Party when it uses resources from other systems.
To get a user-context access token, the subsystem needs to perform a token exchange flow. The flow will create a new access token based on the original one.
(More details will be provided in another article from our team)
To get a service-context access token, the client credentials grant is used.
UserID
The UserID is another interesting point to consider. The UserID is referenced everywhere in our system. It exists as part of request parameters. It is also the primary key of many tables in our DBs. At Mercari, in order to minimize the effect of data leakage and for better privacy (data unlinkability), we use separate UserIDs in each subsystem.
As you can see in the diagram below, UID1 is used inside the core system, and UID2 is used inside subsystem 1. Subsystem 2 (outside of the IDP-protected area) can have its own internal UserID structure also.
But then how do subsystems call each other? Here, we applied another thing from OAuth 2.0 and OIDC, the pairwise pseudonymous identifiers (PPID).
This means that given a user, different identifiers are issued by the authorization server to each OIDC client. In order to interact with the authorization server, each Relying Party is issued an OIDC client, and PPIDs are issued to these OIDC clients. Then Relying Parties can use these PPIDs to call Resource Servers.
On the Resource Server side, we have to convert the PPIDs from the Relying Party to the internal UserID when receiving the request and convert the internal UserID back to PPIDs when returning the response. In order to do that, each subsystem maintains a mapping between the internal UserID with its own PPID (each subsystem is also an Relying Party, so it has its own OIDC client, and its own PPID). Conversions between PPIDs are done by the authorization server.
An Relying Party-only subsystem (subsystem 2) doesn’t have the ability to convert PPIDs. It only needs to know its own PPIDs. But the subsystem that can serve as both Relying Party and Resource Server (subsystem 1) must have the ability to convert not only its own PPIDs but also PPIDs that were issued to others. So when calling other subsystems, this type of subsystem can technically use multiple types of PPIDs. In order to make it clear and consistent across subsystems, there is an important rule that we have to follow: always use the PPID issued to the Relying Party for Relying Party – Resource Server communication. As you can see in the above diagram, instead of using the same PPID, subsystem 1 uses PPID 4 (issued to subsystem 1) to call the core system, and the core system uses PPID 5 (issued to the core system) to call subsystem 1.
Asynchronous communicationAsynchronous communication
Up to this point, we’ve only talked about synchronous communication between Relying Party and Resource Server, in which Relying Party acts on the resources from Resource Server. But we also have another communication paradigm in which the resources are pushed from Resource Server to Relying Party when an event occurs.
Normally, in order to receive events from Resource Server, the Relying Party needs to register a webhook URL to the Resource Server. When the webhook is called, the Relying Party needs to verify the caller to ensure that the Resource Server is a legitimate publisher. This also means that some sort of credential must be attached to the call. Also, the event is likely to be triggered by user actions, so the UserID is likely to be available in the event message.
Given the above, we decided to use the UserID inside these messages as the PPID issued to the Relying Party. This means the Resource Server has to convert its internal UserID to the Relying Party-PPID before publishing the event.
And for the caller verification, we considered what kind of token should be used. Theoretically, since only the Resource Server has the ability to verify the access token, the Relying Party would not be able to perform this task. For subsystems that can be Relying Party/Resource Server at the same time, technically they can verify the access token but we shouldn’t use the access token for sending the event because other Relying Parties can’t verify it and we will end up in an inconsistent system. In our opinion, the event receiver should only verify the event if it came from a legitimate publisher. Therefore a verifiable ID token was the way to go.
After making these decisions, we were able not only to perform asynchronous communication between first party subsystems, but also to do the same for third party services as well, e.g. providing security events to the services that use Mercari ID Login.
Conclusion
By examining our system from the OAuth 2.0 / OIDC perspective and mapping system components to their components, we had a clearer understanding of what we should do, which helped us to design the system and prevent incorrect usage. It’s tempting to create in-house specifications, especially if it’s for first party services, but over time custom specifications tend to become complicated, even to the point of being out of control, as well as having a high chance of containing security holes.
We have just started the long journey, and our ID platform is still immature. But with current features, it can already support our current business growth. The auth mechanism would already be there even if a new subsidiary company joins our group. We only need to apply one of the above scenarios for the new business. This scalability was one of the main reasons why we started to build our ID platform, much like our other infrastructure-related projects.
This article skimmed over many parts, but if you found this article interesting and are interested in working together on authentication and authorization for the whole Mercari group, please take a look at our careers page!
Software Engineer, Backend (ID Platform) – Mercari
References
- https://www.rfc-editor.org/rfc/rfc6749
- https://openid.net/specs/openid-connect-core-1_0.html
- https://www.rfc-editor.org/rfc/rfc8693.html
- https://www.manning.com/books/oauth-2-in-action
- https://www.manning.com/books/openid-connect-in-action
- https://qiita.com/kokukuma/items/34b5aacad9fd9a894730 (in Japanese)