By Nicolas Ksieb and Ivan Aguilar
Machine learning (ML) has several interesting applications for cybersecurity, but trying to get started on your own can seem daunting at first - after all you are teaching machines how to do the work of humans! But ML doesn’t require an advanced degree in statistics to get started. Once you roll up your sleeves and see it working for yourself a lot of the complexity fades away.
In this blog, we’ll show you what can be achieved by tackling a simple, but relevant, threat intelligence question:
What URL is malicious and/or should be whitelisted?
This blog is definitely not the first to talk about an effective solution to the problem (check out this excellent KDNuggets post). What we have tried to do here is to demystify ML for cybersecurity analysts and provide step-by-step instructions on how to develop a ML application in Python and use statistical experiments to improve upon it.
A substantial amount of cyber attacks occur when users click on malicious URLs. Security analysts devote a lot of time investigating URLs and domains to determine if they should be blocked or be whitelisted. The task of identifying a few bad URLs from a large list can quickly become overwhelming when a majority should be whitelisted. It turns into the problem of finding a needle in a haystack.
Most people follow a brute force whitelisting approach to solve this. They create a repository of URLs known to be safe and perform a direct string match to detect them. This is how some ad-blockers prevent you from accessing malicious sites.
Unfortunately this approach can only get us so far given the large number of URLs security analysts have to process on a daily basis. We can do better with a machine learning approach, but first let’s lift the hood off ML.
A machine learning task can be performed using a number of procedures. In this example, we are going to be working with supervised learning (check out this excellent primer on ML). Supervised learning relies on having a ground truth, that can be used to teach a model to predict the correct outcome. Within the supervised learning task, we will be looking at a classification problem. In classification, we are trying to group our data into bins and teach the model to put data we have not seen before into the correct bin.
This is where the power of machine learning comes into play: we basically use patterns learned from our historical dataset to build our model and then use this model to make predictions on previously unseen data.
Simply put, we are building a ML model that predicts whether a URL belongs to the whitelist bin, or the non-whitelist bin. Since we are working within supervised learning, we need a trusted ground truth of labeled data. We use the label “1” to represent whitelist URLs and label “0” to represent non-whitelist URLs. We have attached a small dataset of labeled URLs in a CSV file that you can use as we step through this blog.
Let’s Dive In
In order to proceed and reproduce this test, there are a few requirements. First, you need to have Python installed on your machine, and secondly you need to install a few additional Python dependencies.
You can use pip to install the additional requirements:
pip install jupyter
pip install sklearn
pip install pandas
pip install numpy
pip install matplotlib[/php]
We include a jupyter notebook, which is available for download here, where you can follow along. Our code uses Python 2.6, but with a few tweaks you can run it in Python 3.x. To start the jupyter notebook in your terminal type the command:
[php light="true"] jupyter notebook[/php]
If this doesn’t work make sure you have followed the instructions above to install the Python packages needed and try again.
Do the Heavy Lifting with Sklearn
Sklearn is a ML library written in Python. This library contains a variety of algorithms that can be used to build your models without requiring a PhD in statistics and computer science. Sklearn abstracts a lot of the complexities required to get started with ML and it’s easy to experiment with different models as you get more familiar with techniques. For our example we used the Logistic Regression model.
How well your model performs is largely dependent on the quality of data used for the training phase. One of the realities of ML that is often overlooked is that its performance is constrained by the quality of data used during the training phase. For the purpose of this blog the data we have included is in a CSV format. One row of data corresponds to one data point. There are two columns:
1. The text of the URL, which is a string.
2. The label of the data, either a 1 or a 0.
So now that we have a dataset, how do we go about building a prediction model?
The first step is to decide on the features that will be used in our prediction model. The features are basically the inputs to the model that are used to predict the label. To obtain a good generalizable model (a model that works well on new data), a feature should correlate well with the label we are predicting. This is where domain knowledge is extremely helpful, so pull in your best security analysts to help you here. Ideally, we want to try and use information and patterns that we would manually look for when making the determination of whether a URL is a threat or not and embed them as features in our model.
A simple feature of a URL that most people take into consideration is its length. Another thing that most people look for when clicking on a link, is its TLD. So one thing to check for is if a specific TLD is contained in the URL. In this example we pick .ru as a substring we will be looking for in a URL.
With these two simple features we can form a vector, or a group of numbers which serve as a summary of our URL. For example the URL: ‘www.youaregettinghacked.ru’ would produce a vector: [26, 1]. The 26 corresponds to the length of the URL, and the 1 tells us that a ‘.ru’ exists in the URL.
Now that we have featurized our data and we have a vector representation of the features of our URLs, we can train the model and start seeing results generated by our model. In order to get an idea of how our model will generalize and how it will perform with new data we split up our data into two sets; a training set and a test set.
The training set is used to learn the model. The test set is then stripped of the labels, and we use our trained model to predict the outcome of these new URLs. These predictions are then compared to the true labels in the test data and used to determine the generalization potential of our model.
Performance is measured using the accuracy score, which determines the ratio of the number of correctly predicted labels divided by the total number of test samples.
In order to determine how our model works outside of training, we separated our dataset into a training and testing set. Now that our model is trained, we can use that test set to see how our model works with new data.
Here we can compare the predictions with the expected label. If our model works well with the test data, the generalizability of our model is looking good. If we are pleased with the test error rate, we can use it to make predictions for URLs we haven't seen before. To do this we featurize the data and process it identically to how we processed our training data. Then we feed this into the trained model. If our prediction matches our output, we have a successful data model on our hands.
Given that we only have two features we are extracting, we don't expect to get the best performance, but it is an interesting place to begin. Given your domain knowledge, and following the methodology presented here you could come up with a features to increase the predictive capabilities of the model.
We were quickly able to take URLs and convert them into feature vectors. From here we trained a model and tested its predictive ability. We hope you were able to follow along and found this example to be informative.
Click here to read more data science blog posts on IoC Scoring and RAT Hunting.