*In this project summary, we walk you through how we built the machine learning algorithms behind our RAT prediction tool Project Splinter.*

**Problem statement **

The goal of this tool is to predict the RAT malware family that is most likely associated with the malware configuration settings and indicators of compromise (IoCs) that a user queries. The predictions are presented as a collection of probability assignments to different RAT malware families. In our approach, we cast this as a machine learning (ML) classification problem. I.e., given a set of input features, how accurately can we estimate the probabilities of the different classes (the malware families)? For the first version of this model, we used logistic regression as our ML model.

**Definitions**

*What are the types of RAT classes?*

In the Fidelis Barncat Intelligence Database, there exist 29 different class types: DarkComet, Gh0st, hw0rm, dridex, LuxNet, JBifros, NanoCore, jRAT, ShadowTech, Bozok, CyberGate, SpyGate, SmallNet, Xtreme, Poseidon, LostDoor, AlienSpy, njRat, PredatorPain, jSpy, JSocketv2.0, Pony, JSocketv1.2, sakula, VirusRat, JSocket, Trickbot, Gootkit, PoisonIvy.

*What are the features?*

The number of features is the cross-product of the number of considered Indicator Types and Rat Classes. Each of these features are a numeric number representing the number of correlations through the Indicator Types to samples classified to a RAT Family.

For example, Figure 1 depicts a hypothetical group of correlated samples. The blue nodes in the graph are the malware samples, and the orange nodes are the IoCs. The relationship between the nodes are labeled as “contains.” I.e., a sample contains IoCs. When two samples contain the same IoC we say they are correlated. Furthermore, we can say that the IoC is correlated to x number of samples. The correlation number and the indicator type are the cornerstone of the current Project Splinter ML model.

Referencing the graph below we can extract the following insights from the model. We can see that Sample 1 correlates to Sample 2, 3, and 4. Due to these correlations, the feature “imphash-njrat” will be set to Sample 2, and “ip-darkcomet” will be set to Sample 1. All the other features of form “ioc_type-RAT_name” will be set to 0.

**Methodology**

**Establishing the Ground Truth**

The first step is to collect the data used for the training step. The data that we use is a dataset of around 200,000 malware samples classified to the different classes by Fidelis. These samples also contain the IoCs that are required to generate the features used in the ML model.

**Models**

We will use a Logistic Regression with L1 regularization as our model. For the Logistic-Regression, we need to optimize for the regularization parameter (in the sklearn ML library this parameter is defined as C). We will compare the results of two different models, a simple one and a more complicated one. For simplicity, we will call them the “simple” and the “complex” model.

**•** Simple Model: The model has 174 features and leverages correlation information from seven different indicator types - SHA1, CAMPAIGN, URL, IP, IMPHASH_MD5, MUTEX, SHA256, and MD5.

**•** Complex Model: The model has 303 features and leverages correlation information from 18 different indicator types - SHA1, ITEXT_MD5, RSRC_MD5, CAMPAIGN, IDATA_MD5, URL, IP, CODE_MD5, RELOC_MD5, BSS_MD5, RDATA_MD5, IMPHASH_MD5, MUTEX, DATA_MD5, TLS_MD5, SHA256, TEXT_MD5, and MD5.

Note that devising the simpler model with fewer features relied on security experts’ opinion on which indicator types are most valuable in correlating different samples.

**Model Selection**

The goal is to decide which model, the simple or the complex one, has better capabilities in predicting the RAT family. To do that we will split the data into a test and a training dataset. The size of the training dataset is 15,000 samples randomly chosen from the 200,000-sample dataset. The test dataset is of size 5,000. We perform 10-Fold Cross Validation (CV) over the training dataset for both the simple and complex model and obtain the following results in terms of the misclassification error rate (mispredicting RAT A when the true RAT is B).

**• **Simple Model: is varied from 0.5 to 1.5 and the average CV misclassification error rate obtained is 0.07% .

**• **Complex Model: is varied from 0.5 to 1.5 and the average CV misclassification error rate obtained is 0.13% .

The lower CV for the simpler model is hinting to us that correlations coming from the subset of indicator types SHA1, CAMPAIGN, URL, IP, IMPHASH_MD5, MUTEX, SHA256, and MD5 are more valuable than having the complete set of indicator types in our model. Now, let’s look at the test error rate to determine how both of these models generalize when dealing with unseen data.

**Test prediction errors**

We will compute the test results for the model with regularization =1:

Simple Model: the test misclassification error was 0.12% where the model mispredicted 6 samples out of the 5,000 samples in the test dataset. Below, we show the misclassified samples:

Predicted RAT: LuxNet with probability (w.p.). 49.9410463689, True RAT: NanoCore

Predicted RAT: NanoCore w.p. 50.4084835459, True RAT: njRat

Predicted RAT: njRat w.p. 10.1316077237, True RAT: SmallNet

Predicted RAT: DarkComet w.p. 16.7329389289, True RAT: sakula

Predicted RAT: DarkComet w.p. 16.7329389289, True RAT: sakula

Predicted RAT: DarkComet w.p. 16.7329389289, True RAT: Gootkit

Complex Model: the test misclassification error was 0.2%. This basically means that the model mispredicted 10 samples out of the 5,000 samples in the test dataset. Below, we show the misclassified samples:

Predicted RAT: NanoCore w.p. 31.0054752792, True RAT: VirusRat

Predicted RAT: njRat w.p. 3.29093799649, True RAT: sakula

Predicted RAT: njRat w.p. 3.29093799649, True RAT: Gootkit

Predicted RAT: DarkComet w.p. 47.4328208435, True RAT: PoisonIvy

Predicted RAT: DarkComet w.p. 52.4183094918, True RAT: njRat

Predicted RAT: NanoCore w.p. 63.5635649383, True RAT: njRat

Predicted RAT: njRat w.p. 12.792934929, True RAT: SmallNet

Predicted RAT: LuxNet w.p. 46.4253343922, True RAT: NanoCore

Predicted RAT: CyberGate w.p. 91.0936173985, True RAT: Xtreme

Predicted RAT: njRat w.p. 3.29093799649, True RAT: sakula

Again, the simple model “beats” the complex one in terms of accuracy as it gets a lower test error misclassification rate.

**Conclusion**

We see that both models, simple and complex, using Logistic Regression with =1achieve very accurate results. Since the simpler model is faster and easier to use on production, we decide to chose the simple one. An interesting conclusion obtained from this model is that the intuition of security experts passes the statistical test. Indeed, the correlation information coming from a subset of “more important” indicator types (Simple Model) had higher predictive capabilities than a much larger number of indicator types correlations (Complex Model).