"By design, by default and at scale" are the driving values of the Security & Privacy division.
The Platform Security team was requested to lead efforts to review user access permissions in Okta. During this project, we had to deal with our legacy configurations and practices. Because of this, the "by design" and "by default" management wasn’t ideal. Regardless of the current state, we had to conduct our assessment "at scale" and cover the whole organisation.
This article describes how Mercari’s Security team approached this challenge.
Technologies:
- Neo4j: https://neo4j.com/
- Okta: https://www.okta.com/
- Slack: https://slack.com
TL;DR
We use Okta to grant most of our employees access to SaaS. Granting access is easy, but revoking them is harder.
To clean up unnecessary access, we used Neo4j to build a graph representation of our organisation and access to apps, then used Slack as our user interface to conduct assessments.
- We asked employees to tell us if they needed all the access they had, excluding company-wide applications.
- We then asked managers to confirm that these accesses made sense given job responsibilities.
- Once collected, we could revoke self-reported unnecessary access directly through the Okta API.
Conducting this as code allowed us to scale the assessment to the whole organisation.
Introduction: How Did We Get Here?
Mercari is 11 years old tomorrow (February 1st 2024). While it is now a well-established company, it had to go through some growing pain like many teenagers of that age. The needs related to access management evolved over time as new employees joined and left while the company expanded. New internal services were introduced and decommissioned. Reasons for some past decisions were lost along the way.
Because we are heavily relying on SaaS solutions, Okta and Google Workspace are our solutions of choice to manage identities. When we started to work on this access review project, in Okta alone, we had around 8000 users, 500 active applications and 1400 groups. Deprovisioning access is relatively easy when someone is leaving. However, it is still a delicate operation during internal transfers. For newer employees, keeping things tidy is easier, but for longer-serving employees, reviewing accumulated accesses isn’t always easy. As a result, entropy increased and with it, the complexity made it hard to clean things up.
Terminal Goal of the Project
The ultimate goal for the security team is to reduce as much as possible the potential damages that would be caused by the abuse of system accesses.
Accessory Goals
Cleaning up accesses helps achieve a multitude of accessory goals:
- Reduce the amount of entropy in our authentication systems.
- Be able to present a clearer picture of what are the systems that are used by each employee/team.
- Reduce the stress on security team members required to request system owners to explain if Mr K or Mrs W access is still necessary and document findings.
- Reduce the amount of time spent trying to understand how things are managed and why.
- Identify SaaS that we have that might not be necessary anymore.
- Create better account life cycle management patterns, based on a clean state.
- etc.
Possible Strategies
The Principle of Least Privilege is still one of the best ways to reduce the risk of accidents or incidents, but it requires efforts to apply and maintain.
Applying the principle of least privilege and achieving the Terminal Goal implies that we (should) know:
- what are the systems we have,
- who are the owners and administrators of these systems,
- who has access to these systems and with which access rights,
- the type of data each process and store,
- the potential business processes that these systems are used for,
- that we can draw a direct path between each employee, system, action they can take and the consequence of each possible action.
Doing so requires a monumental amount of work to establish and maintain.
Let’s do some quick maths based on Okta numbers: 8000 users by 500 apps, directly assigned, or through one of the 1400 groups, multiple users per app, multiple users per group, sometimes multiple groups per app, linked to the organisation structure and all teams, that total up to over 200,000 relations in our case. At this stage, we don’t even know the access level of each user, the type of data processed or stored by each system, and the potential actions possible by users.
Starting with only what we know from Okta: if I was to spend 1 second per relation, assuming that I have all the information to make a judgement within that second, I would still have to spend 55 hours straight to review these 200k relations. Obviously having a single person reviewing everyone’s access isn’t a reasonable approach.
Let’s go through some of the other possible strategies we could use.
Strategy 1: Reduce The Scope to Critical Systems Only
What is a critical system? based on what criteria? Anyone who tries to define these criteria knows that it’s easy to get lost in all the possible parameters. There is no magic, the complexity needs to be somewhere. If we chose the strategy to identify critical systems or systems containing sensitive information, someone (or a team) would still need to go through all systems and classify them by understanding what they are used for and what kind of users should have access.
At the same time, we have a good idea of what our systems are. Starting somewhere makes more sense than collecting everything, and then getting depressed looking at the unclimbable mountain ahead. Once at the top, everyone would be tired or would have resigned already.
Another issue is that the environment will not stop moving while this assessment is being done. Before they are done, new systems will have been introduced, users will have been added, and systems will be used for new use cases. We can’t freeze a flowing river to count all the fish in it.
Strategy 2: Full Scope, Asking System Owners
What if we asked system owners? 500 apps, with a number of users ranging from 1 to all employees + contractors. If each system owner has an average of 10 systems, this means that there are still 50 people who would each have to look at around 4000 accesses and make a judgement if these users should have access or not, based on job descriptions, the nature of the service or data accessed. At one point, this might be necessary, at least for some critical systems, but this is not a viable approach in our initial state of high entropy.
Additionally, system owners tend to be managers or directors. Their time is precious. Anyone with limited time will prioritise, and this task is likely to be pushed back to later, no matter how important it is.
Strategy 3: Ask Users First, Then Managers to Confirm Answers
We can ask the users if they still need access to systems before asking anyone else.
The approach we decided to take is exactly that: ask employees first.
Do you still need access to all these systems? Yes/No/Not Sure.
Once answers are collected (or the deadline expired), we ask their manager:
Given the roles and responsibilities of your members, can you review their answers and confirm that their access makes sense?
We didn’t go as far, but a third level of review would then be to then ask System Owners
These teams are using your system. Given what this system is used for, are you ok with them accessing it?
This strategy brings down the decision to keep or revoke access to the people who will actually use these accesses. It also has the advantage of distributing the assessment to all employees. Sadly for the managers, they will also have to review all of the apps that their members say that they need access to, but that assessment can go relatively quickly since they just need to confirm. Doing a quick sanity check normally takes under 5 minutes per person. Some cases might take more, but can be clarified through direct messages.
Through this process, we want to catch outlier cases like "Mr. Y in Security has access to the Payroll system". Even if Mr. Y says "I need it", we at least want the manager to do a sanity check.
Many times, comments we got from members while running this campaign were more "I didn’t even know I had access to that" or "What is this service in the first place?".
Because of how Okta is used, we know that the chosen strategy isn’t perfect yet: Okta is granting access to the app. In our case, it is rarely used to assign rights within the application. This is delegated to the system owners. Removing access in the first place already makes a significant difference and clean-up can be done later. At that time, we can prioritise a few critical systems.
How A Campaign Is Conducted
We now know Why
we are doing the assessment, we know WHAT
systems will be covered, and WHO
will answer and review. Now, HOW
will we ask everyone and collect their answers?
The Spreadsheet Assessment Strategy (nope)
- 200,000 rows with all the users/groups/apps don’t fit in a Google Spreadsheet and would be ridiculous to ask everyone to open and review. Ensuring that the integrity of the sheet is preserved is possible, but requires more work.
Web Based Assessment (maybe later)
- While it would work, we also decided not to create a web page to conduct the assessment, at least not at this stage.
Okta Identity Governance Access Certification Campaign Feature (won’t work)
- Okta does offer an identity governance access certification feature. I can see this working well if Okta is configured from the ground up knowing that it will be used to perform access reviews in the future. Owners would need to be assigned to groups, these groups would be assigned to applications. While conducting the campaign, group owners would be requested to confirm that group members should have access. This assumes that the group owner is able to judge if the user should have access. A group would then likely represent a team, and the administration of the members would be delegated to the managers. That team group would need to be assigned to the needed applications by an App Owner. However, Okta doesn’t have attributes to define App Owners (at this time).
- This approach would be fine for normal cases, but exceptions would need to be managed through other groups, assigned to someone who would be aware of these exceptions.
- In our current state, that was not a viable solution since groups are generally (but not always) used to grant access to apps, not to represent teams. This also means that we don’t have owners assigned to these groups, which would be hard to fix since our documentation of system owners requires some improvements.
Slack + Backend + Neo4j (selected)
- We decided to use Slack as our user interface, and Neo4j as the backend database. Using a graph database as the backend actually allowed us to (relatively) easily query team members, their managers, and all access they had and through what group. For now, we also decided to exclude from our scope the review of access granted within the application.
The rest of this blog post will be dedicated to describing our process.
We had to go through a certain number of steps to proceed with our assessment:
- Recover the organisational structure
- Recover Okta Apps, Groups and Users, as well as all memberships and direct accesses
- Create our Graph representation of the organisation and access
- For each team and employee: Produce a Slack form requesting them to confirm which access is still needed
- Collect answers from Users
- For each manager: produce a Slack form and ask if they agree with the apps needed by their members. In the absence of an answer from the user, the Manager has to make the call
- Collect answers from Managers
- Sanity check: Review answers for spot absurdities
- Revoke app access and group membership through the Okta API.
- Document all the changes.
All the operations above with the exception of Step 8 are conducted through code. This allows us to reliably reproduce the process at will.
Representing The Organisation Structure And Access Rights In A Database
Okta’s user can be configured to have attributes describing the team and the manager, but because of some inconsistencies, we ended up having to extract the full structure from a different source, and then had to link that structure with the users in Okta. Having the organisation structure available in the graph allowed us to conduct based on a higher level of hierarchy, which was quite convenient.
We could then extract from Okta the relations between apps, groups and users for a given organisation unit or team.
Image 1: Integrating Okta and HR data into Neo4j graph database, visualised with Mermaid.js.
Schema: Relations between Org Units, teams, managers, members, groups and apps
In an effort to prevent over-engineering, at least initially, we decided to take some shortcuts and use the OktaUser node as our unit for each employee. The reality is more complex and requires identifying principals differently, but at this stage it was sufficient.
Image 2: Schematic representation of the relationships within the database, visualised using Mermaid.js.
Once written into our Neo4j database, we then had a queryable representation of our organisation, the teams, and the apps used by each of them. Here is what the graph looks like for the organisation structure:
Image 3: Visual depiction of Mercari’s organisational structure, created using Neo4j’s web interface.
The queries below translate to:
- For all direct members of the "Platform Security" team with access to active Okta Apps:
- Get the manager
- Get if they used these applications in the last 90 days
- Return the Org node, Manager node, Properties of the relation between the user and the app, Properties of the last use, and the App node
Then again, taking into consideration access to apps through group membership.
// Team: Platform Security
MATCH (o:OrgUnit {name: "Platform Security"})<-[:IS_MEMBER_OF]-(u:OktaUser)-[r:HAS_ACCESS_TO]->(a:OktaApp {status: "ACTIVE"})
WITH o, u, r, a
MATCH (u)-[:IS_REPORTING_TO]-(m:OktaUser)
WITH o, m, u, r, a
OPTIONAL MATCH (u)-[p:HAS_USED]->(a)
RETURN o, m, u, PROPERTIES(r) AS r, PROPERTIES(p) AS p, a
MATCH (o:OrgUnit {name: "Platform Security"})<-[:IS_MEMBER_OF]-(u:OktaUser)-[r:IS_MEMBER_OF]-(g:OktaGroup)-[:HAS_ACCESS_TO]->(a:OktaApp {status: "ACTIVE"})
WITH o, u, r, g, a
MATCH (u)-[:IS_REPORTING_TO]->(m:OktaUser)
WITH o, m, u, r, g, a
OPTIONAL MATCH (u)-[p:HAS_USED]->(a)
RETURN o, m, u, PROPERTIES(r) AS r, PROPERTIES(p) AS p, g, a
Query 1: Retrieving application and group access listings for specific teams using Neo4j Cypher.
Launching a Campaign
The campaign Controller (app) relies on a list of teams to identify the users to target. The recursive list of teams can easily be extracted from the Neo4j database with a query like this:
MATCH (t:OrgUnit)-[:IS_PART_OF*]->(o:OrgUnit) WHERE o.name = "Security & Privacy" AND t.status = "active"
RETURN t.name AS team, t.orgId AS orgId, o.name AS orgName
Query 2: Recovering a recursive team hierarchy under ‘Security & Privacy’ category with Neo4j Cypher.
Based on the list of teams in scope, the Controller notifies managers that an assessment is starting, creates the assessment for each team member and sends forms through Slack direct messages.
Sending Member Assessments
Image 4: Sequential flow chart detailing the member campaign process, illustrated with Mermaid.js.
The assessment form sent to members is kept simple and is meant to be quick to fill. A user can click on the application name to connect to the app and confirm if they still need access to it, then select “Access needed” or “No need anymore”.
Image 5: Example of a member evaluation form, as displayed in Slack.
Answer Collection Backend
Once the assessment forms are sent, we only need to wait for answers. We have a backend ready to receive them and update the Neo4j database with the answers.
Image 6: Flowchart illustrating the procedure for gathering responses from the evaluation form, visualised using Mermaid.js.
Manually during the assessment, we are able to send progress updates to the managers, asking them to check with their team members if they haven’t answered yet.
Manager Answer Review
Once we have collected answers, even if a member didn’t answer or complete the assessment, we request Managers to review accesses. This step normally goes quickly since answers from members are visible, and applications related to the teams should be well known.
In the case where a manager isn’t responding, we can then report to their managers the lack of progress.
The review flow for managers looks like this:
Image 7: Sequence diagram outlining the managers’ review workflow, visualised using Mermaid.js.
The form sent to the manager is similar to the one sent to the user but only contains apps marked as needed. The manager can then see the member’s answer and select to keep or remove if they judge that the access is necessary.
Image 8: A glimpse into the manager review form interface within Slack.
Unnecessary Access Clean-Up
At this stage, we have collected answers from members, as well as collected confirmations from managers. We could request system owners to confirm that they agree that teams should have access (as opposed to individual access review), but we decided to push this to a later assessment.
The access revocation flow through the Okta API is relatively simple:
Image 9: Flowchart depicting the steps involved in the access revocation mechanism, visualised with Mermaid.js.
Conclusion
Through this project, we could review what access employees had and said they needed by trusting our employees and managers to answer truthfully. Most standards, frameworks, regulations and best practices require companies to do this kind of review on a regular basis. Such reviews can quickly get out of hand in a complex environment. This is where moving the complexity of handling relations between employees and applications into a graph database, and asking employees first if they needed the access helped us scale the assessment to the size of the company. We were also able to conduct this assessment without going through a lengthy system classification exercise. Because we rely so much on Okta, focusing on it allowed us to cover a majority of systems.
There are still improvements possible to this flow and expansion to other systems. Tighter access granting rules and checks could be implemented into the provisioning process.
Meanwhile, we could already remove a significant amount of accesses that weren’t needed anymore without any risk of access interruption since removal is based on employees’ and managers’ answers, instead of using determined rules to decide if access should be suspended or not…