true
Blog Making Sense of Unstructured Intelligence Data Using NLP

Making Sense of Unstructured Intelligence Data Using NLP

The push towards structuring threat intelligence data has gained new momentum with the proliferation of new intelligence sharing ontologies like TAXII, STIX and CybOX. Most recently, we’ve seen the MITRE ATT&CK framework take off among intelligence analysts and SOC teams due to its ability to represent relationships between Tactics, Techniques and Procedures (TTPs) that Threat Actor Groups use to target individuals and organizations.

The ATT&CK framework has been massively useful for large organizations who can now pinpoint where their strengths and weaknesses lie in terms of the techniques that can be leveraged against them. While the introduction of this framework has been revolutionary, there still remains the question of how to automatically integrate the ATT&CK Framework into an organization’s investigation workflow.  

The data science team at TruSTAR set out to find a way to automatically ingest and structure ATT&CK Framework data by using a simple yet extraordinarily useful Natural Language Processing (NLP) tool called Doc2Vec, an extension of its initial counterpart Word2Vec.

Here’s a summary of our research and results, which we presented at BSides SF.

About Our Sample Datasets

To train our model we used unstructured data from two different sources: the MITRE ATT&CK framework repository and the NIST National Vulnerability Database (NVD). Let’s go into the intricacies of each.

The MITRE ATT&CK framework has various important concepts but the two most important ones to keep in mind for this exercise are those of “Tactic” and “Technique”. The Tactic represents the overarching objective for a threat actor group and asks the important question of “Why?” or “What?” On the other hand, the Techniques represent a myriad of ways to accomplish an overarching Tactic. The relationships between the two become clear in Figure 1.0.1. The rich and verbose descriptions of these Techniques form a minor but essential part of the inputs into Doc2Vec.

The NIST NVD is a collection of CVEs (Common Vulnerabilities and Exposures) that occur in application software. Meticulous records of these vulnerabilities have been logged and each record describes a vulnerability and also includes a sentence or two about the impact that vulnerability may have if exploited. These rather brief but succinct descriptions form the major chunk of our inputs into Doc2Vec.

The idea behind connecting these two diverse data sources was to help threat intelligence analysts prioritize patching vulnerabilities which link to their weakest areas in the MITRE ATT&CK framework.

 

Figures 1.0.1 and 1.0.2 show concretely where we believe the idea of a vulnerability fits into this framework.

NLP_Figures

Our NLP Data Model and Research Process

We used Doc2Vec, an NLP algorithm that converts documents into vectors. There is a lot of unstructured threat intelligence data and the only way to make sense of it mathematically is to represent it numerically.

Before we understand Doc2Vec, let’s try to understand it’s more basic counterpart Word2Vec. The objective of Word2Vec is to convert words into vectors. The idea is fairly intuitive: Words that appear in similar contexts repeatedly have a high probability of being very similar to each other. Doc2Vec leverages the same idea of using context to form numerical relationships, except it predicts vectors for entire paragraphs.

After numerous iterations of this NLP model, words with similar meanings will have vectors very close to each other and words with opposite meanings will have orthogonal vectors. For example, the model will learn “King is to Queen, as Man is to Woman,” AKA the difference between the vectors for King and Queen is the very similar to the difference between other Male and Female objects.

After obtaining vectors for each document, CVEs and Technique descriptions alike, we performed various operations including clustering to understand relationships between these documents.

To get a full technical understanding of this process, you can refer to these technical papers by Le and Mikolov: Word2Vec and Doc2Vec. A simple walk-through of our research codebase is available on Github.

Research Findings

The main objective of this exercise was to establish links between the ATT&CK framework TTPs and CVEs. Our NLP model was able to achieve this with 50% accuracy (calculated from manually labelling ATT&CK-CVE pairs). While this may seem somewhat low, it’s important to note these two datasets (ATT&CK and NIST’s NVD) are very different. Other than the obvious difference of how one is centered around a technique and the other around kinds of vulnerabilities and their potential impact, the technique descriptions from MITRE are rather long and detailed, describing various concepts along the way, while the vulnerabilities from NVD are very short and succinct and convey meaning within a span of two sentences.

Another important result from this research is that similar vulnerabilities are now more easily clustered together, which means that any given ATT&CK technique has links to multiple CVEs in a cluster.

NLP_two-worlds

An unexpected but very important finding was the overlaps between techniques in the ATT&CK framework we identified. We identified numerous cases of the exact same technique being given multiple IDs within the ATT&CK framework database. There were also cases where a technique conveyed the same meaning, but in slightly different words, and it was also assigned multiple IDs. This algorithm can help identify TTP overlaps and clusters within the ATT&CK framework to make it more accurate and efficient.

What This Means For the Future of Threat Intelligence

A security analyst can now identify which areas in the ATT&CK framework that his or her organization is most vulnerable to, and connect those TTPs to CVEs. This is extremely helpful when it comes to prioritizing and patching vulnerabilities in a timely manner. Typically, it would take an analyst hours to identify and patch a vulnerability. With this new data model, this mitigation time could be dramatically reduced. Conversely, an analyst could begin their investigation with a CVE and identify which techniques were used to target that CVE, therefore also gaining access to the associated detection and mitigation strategies.

For analysts who prioritize speed, sifting through CVEs via cluster is much faster than reading through individual descriptions of multiple CVEs.

In spite of the limited scope of this exercise, we managed to uncover the strength of NLP as a means of adding structure to completely unstructured data. The next big step is to employ techniques that map largely varying pieces of unstructured data to a similar space so meaningful links between them can be established.

The use of NLP and deep learning techniques is absolutely essential if we are to make sense of unstructured threat intelligence data. The ATT&CK Framework and this NLP research is an important step forward.

Interested in learning more?

Download our presentation slides, check out our codebase on GitHub, or reach out at hello@trustar.co.

TruSTAR To Present Blockchain Research Tool At ShmooCon 2019 TruSTAR is headed to ShmooCon 2019! As a follow-on to our blockchain research debut at Black Hat and DEF CON 2018, TruSTAR will present a second ... Read More
TruSTAR To Speak At BSides SF Ahead of RSA 2019 TruSTAR’s Data Science team is headed to Security BSides SF 2019! As we prepare our research presentation, here’s some more info about our abstract ... Read More
CryptoLocker Deep-Dive: Why We Use Bitcoin Addresses as an IOC Follow the Money: Tracking Adversaries Through the Blockchain WhiteRabbit is an open source research tool we're debuting at Black Hat and DEF CON ... Read More