2024/12/08

Streamlining Security Incident Response with Automation and Large Language Models

Author:: Florencio González

, 2024/12/08

Streamlining Security Incident Response with Automation and Large Language Models

Background

Effective security incident response is a crucial aspect of any organization’s cybersecurity strategy. The security incident response lifecycle provides a structured approach for handling security incidents methodically and efficiently. By following this approach, organizations can minimize the impact of incidents, recover operations swiftly, and implement measures to prevent future occurrences.

The incident response lifecycle typically compromises the following phases:

Preparation: Establishing policies, procedures, tools, and communication strategies to ensure readiness for potential security incidents.
Detection & classification: Identifying potential security events through monitoring systems and classifying them based on severity and impact.
Triaging: Assessing the incident’s scope, gathering additional information, and analyzing data to understand the incident’s nature and implications.
Remediation & response: Implementing actions to contain and mitigate the security incident, eradicate threats, and prevent further damage.
Recovery, reporting, & learning: Restoring affected systems and services, documenting the incident and actions taken, and learning from the experience to improve future responses through a retrospective analysis.

Understanding each phase enables incident responders to act promptly and effectively. By integrating automation and leveraging Large Language Models (LLMs), the Threat Detection and Response (TDR) team at Mercari enhanced these phases, reducing manual effort and increasing the speed and accuracy of our responses. In this article, we will explain what and how we have achieved these improvements.

Key security incident handling tasks ideal for automation

Manual processes in security incident handling can be time-consuming and prone to errors. To address these challenges, the TDR team developed a security incident response Slackbot that automates repetitive tasks and leverages Large Language Models (LLMs) for tasks requiring contextual analysis (as shown in Figure 1). This automation not only reduces the time spent on routine activities but also enhances the accuracy and consistency of security incidents documentation. In this blog post, we explore the functionalities of our Slackbot, the integration of LLMs, and the significant time savings achieved—between 160 and 250 minutes for a small security incident.

In the rapidly evolving digital landscape, organizations are encountering a growing frequency of security incidents. As a consequence, incident responders are tasked with swiftly setting up investigation environments, coordinating with team members, and meticulously documenting every step of the process. These tasks, while essential, often involve repetitive actions and consume valuable time and resources.

When a security incident occurs, the incident responder has to set up a proper environment to start handling the incident, for example:

Establish Communication Channels: Set up a dedicated platform for real-time collaboration.
Create Documentation Structures: Organize folders and documents to store investigation results.
Assign Tasks: Delegate responsibilities and track progress through task management systems.
Manage Access Rights: Ensure all relevant team members have the necessary permissions.

Throughout the security incident handling process, additional team members may join, requiring further administrative actions. Moreover, documenting investigation results, root causes, impacts, and countermeasures demands careful attention to detail. These manual processes are not only time-consuming but also susceptible to human error. To enhance efficiency and accuracy, TDR developed a security incident response Slackbot that automates many of these tasks. By incorporating LLMs, TDR also automated tasks that traditionally require human analysis.

Security Incident Response Automation

Figure 1. Security Incident Response Automation.

Automating Security Incident Response Tasks

Our security incident response Slackbot automates several key tasks across different stages of security incident management. In Table 1, we detail these tasks and the time savings achieved.

Security incident creation
Task	Steps	Time
Create folders to store the incident report and artifacts.	Locate the correct folder structure. Create new folders.	3-5min
Create a document for the incident report.	Find the correct template for the incident report. Copy the template to the correct folder. Update the document with the initial incident specific details.	5-10min
Create tasks in Jira for the incident.	Find the correct project. Create the initial tasks.	5-10min
Create a private channel in slack and pin the relevant documents.	Navigate to slack. Create a new channel. Pin the relevant documents as the incident report and the Jira issue.	3-5min
Add relevant members to the channel from a previous initial discussion thread.	Find the correct team members. Add them to the channel.	2-3min
Security incident investigation
Give access to the folders and documents to members joining the slack channel.	Monitor the slack channel for new members. Manually give access to relevant resources.	1-3min per person
Document relevant slack messages in the incident report.	Navigate to the relevant slack conversation to find the message. Copy and paste the message to the incident report. Copy and paste the message link to the incident report. Format the message properly.	3-5min per message
Security incident Postmortem
Create a post-mortem retrospective document.	Find the correct template for the post-mortem retrospective document. Copy the template to the correct folder. Update the document with the incident specific details.	5-10min
Total Time		27-51min

Table 1. Security incidents tasks and time saved by automation and LLMs implementation.

By summing the time saved across tasks, we can observe substantial efficiency gains:

Per incident: Up to 50 minutes saved only for repetitive tasks. Allowing responders to focus on critical decision-making and response activities.
Cumulative: Over time, these savings significantly enhance team productivity and security incident handling capabilities.

Leveraging Large Language Models (LLMs)

Automation significantly reduces the time spent on repetitive tasks. However, certain tasks require contextual understanding and analysis, requiring human intervention. By integrating LLMs into our Slackbot, TDR automated these complex tasks, further enhancing efficiency.

LLMs are AI models trained with a big amount of data. They can understand context, interpret nuances in languages, and generate coherent and relevant text responses. By leveraging LLMs, our Slackbot can perform tasks such as summarizing lengthy discussions, translating languages, and generating detailed reports which require a big amount of time from incident responders.

Challenges

Understand the security incident context.
Accuracy and reliability of outputs.
Handling bilingual communication.
Integration with existing systems.
Computation resource requirements.

Security incident declaration

Before declaring a security incident, responders need to analyze the initial information, understand the context, and determine the appropriate course of action. Crafting a clear and concise description and title for the incident is crucial for effective communication. Finally, determining security incident type, category, severity, and affected assets requires careful consideration.

To address this challenge, TDR leveraged LLMs to:

Do contextual analysis: The LLM processes initial messages and data related to the potential security incident, extracting key information and understanding the situation’s nuances.
Automate description generation: Based on its analysis, the LLM generates a detailed incident description and a descriptive title that accurately reflect the situation.
Assist with security incident classification: It suggests a security incident type and category by comparing the incident characteristics with known patterns and categories.
Estimate impact and severity: The LLM assesses potential impact and severity levels, aiding responders in prioritizing the security incident.
Identify affected assets: It identifies and lists the affected systems or assets by cross-referencing mentioned resources with the organization asset inventories.

Manually could take between 5 to 10 minutes based on the following steps:

Read the initial information of the security incident.
Analyze the context of the security incident.
Write a description of the incident.
Write a descriptive title.
Set a security incident type.
Set a security incident category.
Set an initial impact.
Set an initial severity.
Identify the affected assets.

Security incident reporting and status updates (Daily, weekly, monthly report)

Collecting and organizing information about a security incident, or incidents that occurred over a period is a task which requires large amounts of time. It involves ensuring each incident is summarized uniformly, highlighting key details. Also, responders have to make sure to clearly document actions taken, impact changes, countermeasures, and recommendations that will be later part of a daily, weekly, or monthly report.

To address this challenge, TDR leveraged LLMs to:

Automate security incident collection: The Slackbot gathers incident data from our database for the specified period of time to be sent to the LLM.
Standardize summaries: LLM creates concise summaries for each incident ensuring consistency in format and content.
Generate insights: LLM identifies common patterns, frequently affected assets, and recurring issues.
Generate actionable recommendations: LLM suggests countermeasures and preventive actions based on the analysis. All of them are useful during post-incident activities like retrospectives.

Manually, this could take between 60 and 90 minutes based on the following steps:

Collect security incidents for a given period of time.
Analyze each incident:
- Specify a summary for each incident.
- Specify the impact for each security incident.
- Specify taken actions.
- Specify countermeasures to prevent the incident from happening again.
- Specify recommendations.

Slack channel and Thread Summarization

Reviewing the security incident progression is a task which requires following many threads in a Slack channel every time it is required. Or even to bring new members a quick onboarding. Therefore, it is important to have a tool to provide an overview without overwhelming details.

Challenges addressed were mainly:

Volume of communication: High volume of messages can make it difficult to extract key points.
Contextual continuity: Maintaining the storyline of the security incident as it unfolded.
Identifying critical decisions and actions: Highlighting pivotal moments in the response.

To address this challenge, TDR leveraged LLMs to:

Conversation summarization: The LLM scans through Slack channels and threads, summarizing discussions chronologically.
Key point extraction: LLM identifies significant messages, decisions, and action items.
Contextual linking: The summary maintains the flow of events, showing how one action led to another.

Slack channels and threads summarization and key discussions in chronological order. This function is useful for different purposes, such as:

Security incident retrospective.
Executive summary.
Catching up with the security incident.

Depending on the phase of the security incident and the amount of threads and messages it might be hard for the incident commander to keep track of them. As the saved time is computed based on the amount of information to analyze it is hard to compute a specific number but the average for a small security incident is between 5 and 10 minutes. However, it could be over 1 hour when involved people and tasks increase.

Language interpretation

When working in bilingual environments, teams could face some delays due to language differences. So, ensuring that translated messages maintain the original intent and nuance is important for the previous functions.

Doing a manual translation in the described functions could take between 60 and 90 minutes in total for an analyst who does not know the language based on the following steps:

Identify Japanese messages.
Translate Japanese messages to English based on the context.
Format the messages properly based on the flow of the events.

Integrating Large Language Models into our security incident response processes has revolutionized the way TDR handles tasks that traditionally require significant human effort and time. Through the use of LLMs, TDR saved between 130 minutes and 200 minutes for a small security incident.

Conclusion

The use of Large Language Models frees up human incident responders to focus on strategic decisions rather than administrative tasks. Also, it provides rapid analyses and outputs accelerating the security incident response process. This is a great benefit when handling large volumes of data and communication as they add some delays during the process.
Our incident response Slackbot demonstrates the significant benefits of automating routine tasks and integrating LLMs for tasks requiring analysis. By reducing manual effort, TDR enables security incident responders to focus on critical thinking and decision-making, improving both efficiency and effectiveness.

However, the potential applications of LLMs in security incident response extend beyond our current implementation. As TDR continues to refine our Slackbot, we plan to:

Enhance LLM capabilities: Explore more advanced models for deeper analysis and better accuracy.
Implement Agent-based Incident Response Roles: Implement agents with security incident response roles as incident commanders, handlers, and analysts to support security incident response notifications.
Automate task tracking: Leverage LLMs to monitor threads where high impact tasks are happening to support and keep the incident commander up to date.
Introduce real-time collaboration: Allow LLMs to participate in discussions by providing suggestions or alerts during live incident handling.

Streamlining Security Incident Response with Automation and Large Language Models

Background

Key security incident handling tasks ideal for automation

Automating Security Incident Response Tasks

Leveraging Large Language Models (LLMs)

Challenges

Security incident declaration

Security incident reporting and status updates (Daily, weekly, monthly report)

Slack channel and Thread Summarization

Language interpretation

Conclusion

Related article

SRE2.0: No LLM Metrics, No Future: Why SRE Must Grasp LLM Evaluation Now

Rethink Tool’s UI/UX – Human-Centric to AI-Driven

Tackling Knowledge Management