Review: Intrusion Detection & Prediction using Sequence Modeling

16 min readSep 7, 2021

Intrusion detection | Intrusion Prediction | Cybersecurity | Gaurav Sarraf | Springer | Research — Photo by Cloudflare

Hey folks! This article is about a paper I published on a project I was working on in early 2020. The primary objective of this project was to develop a host-based Intrusion Detection System (IDS) that could analyze system calls being made by the kernel of an Unbuntu server. The system would use a Recurrent Neural Network and autoencoders to predict the system calls that would be made by the kernel, effectively not just detecting but also predicting an intrusion attempt. Intrusion Detection as a process involves a lot of analysis and pattern identification of the log files which is basically the definition of Machine Learning (ML) and Deep Learning (DL), hence ML/DL has been used in IDS and IPS systems for quite some time now. To develop the system, I have used the Google TensorFlow framework with the ADFA Intrusion Dataset. Let's get our hands dirty!

Paper:
Intrusion Prediction and Detection with Deep Sequence Modeling
Proceedings:
7th International Symposium, Security in Computing and Communications (SSCC)
Publisher:
Springer, Singapore

Securing network infrastructure is integral due to the vast amounts of ﬁnancial transactions and personal information at stake. An IDS should encompass availability, integrity, conﬁdentiality, accountability, and assurance of data as its core ideology, many of today's implementations lack two or more of these ideologies thus failing to fully protect its assets. Almost all intrusions are classiﬁed into four classes namely probing, denial of service (DoS), User to Root (U2R), and Remote to User (R2L) attacks which can be stopped by static ﬁrewalls or by deploying more dynamic solutions such as network-based or host-based intrusion detection systems.

IDSs are of two types, ﬁrst are the network-based systems analyzing all the traﬃc ﬂowing to and from all the diﬀerent host on the network they are usually installed on network borders like routers and managed switches. Second, host-based systems are installed on the user’s computer itself, and they are usually installed on all the computers of a network. They can classify peculiar network packets that originate from inside the local network. These systems are further divided into signature-based systems and anomaly-based detection systems. Signature-based detection systems have a database of attack patterns; every packet of network traﬃc is compared with the saved pattern, detecting abnormal behavior. These kinds of systems, though easy to use, are useless when the system encounters a zero-day attack. The database of these systems requires detailed knowledge of the intrusion and needs frequent updates. Anomaly-based detection systems analyze network
behavior, deﬁned by the network admins or it is learned automatically by the
system during its learning phase from datasets. These systems have precise rules of normal/abnormal behavior. Anomaly-based systems do a better job at detecting unknown attacks, but the set of rules deﬁnes how well the system performs, which depends on the expertise of network admins leading to a high false-positive alarm.

Kernel system-call (SC) analysis for intrusion detection traditionally was less stable and did not understand the meaning of the calls as only the frequency of calls made was being studied. A possible solution to this problem is to create a system prediction model using an end-to-end neural network, which can be used to predict SC based on requests made during the attack. Human language and modern computers SC have a surprising amount of similarities. Natural Language processing has an inherent need for language semantic understanding, in combination with words and their meaning being learned leads to fruitful results, this process has been tried and tested in multiple models. Industries' choice for such kinds of processing is always diﬀerent variants of RNNs as they are quite eﬃcient in solving sequential problems. Language Processing problems such as Question & Answer, Translation, Word-To-Vector make use of sequence-to-sequence (seq2seq) auto-encoding frameworks. The advantage of this method is that
it will not only identify malicious sequences in real-time but should also be able to predict a sequence of future SC likely to be executed during an attack. This project tries to shows how a language model can be used to predict intrusions of all kinds, including zero-day attacks with the aim of lower false alarm rates compared to other AI-based IDSs.

Related Work

Recurrent Neural Networks (RNN):

While dealing with sequential learning problems, RNNs are an obvious choice. They have generated exceptional results in problems such as captioning images, synthesis of speech, and music generation, video analysis, and musical information retrieval. They are suited exceptionally well where the raw underlying features are not individually interpretable like machine perception tasks. RNNs model requires a sequence of input data and the target is also a sequence; this makes it a seq2seq model. The model though initially designed for temporal sequence structures, works equally well for non-temporal data making it even more powerful.

All of these features are due to the introduction of cycles in the RNNs computation graph, enabling it to have “memory” to store previous states' information. A traditional RNN theoretically can hold information in memory for inﬁnitely long sequences when trained with Backpropagation through time. The RNN model is incapable of training on long inputs due to problems of gradient explosion or gradient vanishing. This is called the long-term dependency problem which is solved up to an extent by implementing Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) which can remember important information of long sequences of any length. Finally, this project uses a traditional language model and replaces words and sentences with kernel system calls.

Intrusion Detection:

Kernel System-call modeling was studied by a few researchers, introducing new scopes of research in detection space. Recently, neural networks have been the choice forward, showing signiﬁcant advancements in these systems. The introduction of the RNN language model outperforms all the other methodologies. All the previous implementations used the KDD cup99 dataset which restricts the model's capabilities significantly.

Intrusion Prediction:

There has been a considerable amount of research on prediction systems; Hidden Markov Models are the choice for most of the researchers to observe previous information and predict the next probable stage of intrusion. All of the methodology using Hidden Markov Models have a fundamental problem of dependence of limited sequences. The probability computation of the next event is based on only a minimal number of short steps, which should eventually lead to loss of critical information, signiﬁcantly aﬀecting the prediction result. HMM have the problem of convergence of local optimal point, which makes it challenging to obtain high accuracy.

System-Call Modelling

Approach Outline:

SC modeling is an approach proposed by this project where we use the seq2seq model from machine Question and Answer models to generate SC of a Linux server with previously invoked sequences as questions and generated sequences as answers. This model is divided into two distinct components for the sake of simplicity.

First, is a recurrent auto-encoding seq2seq model, which was ﬁrst designed
with LSTM units, and then another model is designed with GRU units, the
data from both models are compared in the coming sections. The models above are designed to be auto-encoding in nature which helps in denoising the data and also introduces an attention mechanism to improve the models’ accuracy. Second, the objective of this component is to improve the performance of multiple classiﬁcation algorithms by not only giving information of the invoked SC but also extending the input information by combining the predicted data.

To test the system, the attacks were carried out by a computer locally in the network running Linux Kali. Kali has all the tools installed to carry out similar attacks as the datasets. To test unknown or zero-day attacks, tools exploiting vulnerabilities discovered after 2017 were also used. This was done because the dataset used was released in 2014, thus, simulating a zero-day scenario.

Prediction Model:

Sequence-to-Sequence Model: Question and Answer models are an inspiration towards this approach, we assume the sequence of SC as questions and predicted sequences as answers. These kinds of models deploy an RNN language model; that’s what we use here as the generative model. The language model helps us generate semantically correct sequences of SC based on the input sequence. All sequence to one or many models is essentially an elaborate implementation of the encoder-decoder framework. Like any question and answer system, this too is a many-to-many mapping between the input and output. Just like human conversations, we ﬁrst understand the input sentence/question ‘source’, ﬁrst and then frame an output sentence/answer ‘target’, by understanding the sequence of input ﬁrst. The framework does the same, it takes the SC as the source and generates a target SC sequence. It is obvious that we ﬁrst need to know the words of a language to make sentences; therefore, even our model ﬁrst needs a vocabulary of SC to generate sequences of calls.

We create three sets as shown above, of words/SC where xi,yi ∈ S, s = { 1, 2, 3, …, n } where n is all the SC of an OS, s ={ x1,x2 ,x3, …, xn } represent source SC and s = { y1,y2,y3 , …, ym } represents target SC. The sets are used by the encoder to generate hidden states at any instance of computation which is given by the formula:

The decoder is technically a symmetrical copy of the encoder making it an
RNN as well, helping it generate a target sequence based on the context. The generation of each SC requires helps from the hidden states and states at that
instance.

LSTM and GRU: LSTM and GRU were introduced to solve the vanishing gradient problem in RNNs. These units eﬀectively increase the learning
capacity of these models. They help the model remember information about the training sequence for an inﬁnitely long sequence. There is no concrete evidence proving the fact that either one, LSTM, or GRU is better than the other, hence I compare both of the units. Both LSTM and GRU try to replace
the simple activation function of an RNN with a unit called a cell. These cells
generate output at every time step, but they are usually used as input for the
next step. LSTM units consist of three basic gates, input gate (i), output gate (o), and forget gate (f). They perform element-wise multiplication operations. The gates are sigmoid functions, and they help the cell decide how much incoming information should be held and how much information should be forwarded to the next cell.

These parameters are set for the ﬁrst cell, and all the following cell parameters are computed accordingly. On the other hand, GRU has just two gates reset gate (r) and update gate (z). The reset gate operates in between the previous cell and the next cell while the update gate decides how much information should be learned or updated at that time step. The likelihood of vanishing gradient is reduced by the ability to learn by utilizing Backpropagation through multiple bounded nonlinearities. LSTM exposes only the cell memory to other cells, while GRU exposes the whole-cell state. LSTM has separate input and output gates while GRU does both of these operations with the help of the reset gate.

Variational Autoencoders: As we now have a basic idea of encoder-decoder
framework, autoencoders try to recreate input data at the output, after performing encoder-decoder operations. The encoder breaks down the input sequence into a smaller set of data bits and the decoder uses these bits to recreate the input sequence. The critical part of this whole process is the hidden layers and this represents the same information at a lower density which is well suited for operations such as dimensionality reduction which might reduce the features to be learned by the model.

Autoencoders are popular choices for anomaly detection where the model is trained on normal data, making it easier to detect diﬀerences in the data anomalous in nature. Autoencoding network is a boon for this case as we have a signiﬁcantly lower number of anomaly instances compared to normal instances. To give generative properties to autoencoders by using Variational Autoencoders (VAE). This model can learn latent variables of the input sequence, so instead of making the model learn some function, we make it learn the probability distribution of the parameters of the training data. The only diﬀerence is that instead of modeling data to a single vector, we train it to two vectors, one dealing with the mean of variable distribution µ and standard deviation of variable distribution σ. VAE produces comparable or even better results than Generative Adversarial Networks (GAN). VAE with LSTM units may sometimes cause it to bypass latent vector (z) due to the bypass phenomenon, which may lead to encoding no valuable information to z.

System Evaluation

The proposed model is validated on the dataset explained below, this project
uses three diﬀerent techniques to verify the obtained results. We consider Bilingual Evaluation Understudy Score (BLUE), Term Frequency — Inverse DocumentFrequency Score (TD-IDF) and Cosine Similarity. This is done to make sure the predicted SC are both syntactic and semantically correct, and also assure that it not just considers the statistical occurrence of SC but are also able to predict SC which makes sense to the Operation Systems. It has been seen that the prediction is usually consistent with attack speciﬁc sequences; hence, we deploy multiple anomaly detection classiﬁcation algorithms to classify the predicted sequences, reassuring that the proposed model does produce sensible results.

Datasets:

Many datasets were considered for this research, some of them being UNM,
DARPA, TUIDS, CIDDS and ADFA. ADFA-LD published by National Defense University of Australia was chosen over the other because it is the most in line with the latest attacks and consistent with real-world network scenarios. The dataset is large enough for a neural network training with 6 types of attacks,
833 normal training data and 4372 validation data, gives a summary of the dataset. We proposed a host-based anomaly intrusion detection system on the same dataset.

Learning Process:

Determining the learning rate is difficult and basically trail and error, which can be set to anything between 0 and 1. The lower rate, the model learns better but is very slow and underﬁtting as a demerit, and the higher the number the model is less accurate but trains faster and has the demerit of overﬁtting. The hyperparameters include learning rate, dropout of the ﬁtted model, nodes in hidden layers and initialization of training model. In this experiment, 8 models are trained for prediction. Two types of RNN units were considered LSTM and GRU; each of these units was trained on 4 diﬀerent variants. The ﬁrst variant has two hidden layers, and learning rate of 0.1, variants two, three and four, all have three hidden layers and learning rate of 0.1, 0.01 and 0.001 respectively. The training is stopped and assumed to be
completed when the present training loss does not decline any further. The input dimension of the network is the same as output and remains the same due to the auto-encoding property. The model has 256 nodes in its hidden layers with training batch size of 32.

The ADFA dataset is marginally inadequate, for example, if we take sequence length of 30 we have just 4000 input sequences. Any RNN needs to have longer sequences and a relatively large dataset. Hence, to compensate for this, we break the sequences to small bits and create a dataset with a sequence length of 20, 22, 25 and 30. This creates about 21,000 sequences of training data and 4,200 sequences of testing data. The data is now adequate to take advantage of longer sequences of data and extract attack features; this assures a lower number of false alarms by the system.

Result Analysis:

BLUE is considered; ﬁrst, it is a comparative analysis of phrases generated by a model which counts the number of matching words in a speciﬁc position in a weighted fashion. The score is between 0 and 1, the higher the value, the higher the similarity index of the sequence. This technique does provide some advantages over the older techniques has a big disadvantage of not accounting for semantic information of the predicted sequence.

Figure above shows BLUE score of all 8 diﬀerent models when the sequence length is 25. It clearly shows that increasing the number of hidden layers (HL) leads to better performance; we can also see that changing the learning rate (LR) does not result in signiﬁcant improvements. We can hence conclude by saying that we can have a signiﬁcant learning rate to speed up the learning process to reduce processing time. Any further increase in HL or LR does not lead to any signiﬁcant improvement of results when other facts such as processing time are considered. We can also see that LSTM performs better when compared to GRU, this may be due to the more precise movement on data in the units, helping it remember more critical information. It is important to note that GRU performs better with LR as 0.001, but this is not considered as LSTM performs better when LR is set at 0.01.

Figure above shows how the models respond to various sequence lengths, it noticed that as the lengths increase the performance improves. This an example of how an RNN improves its performance when the sequence length is increased. This phenomenon is not exhibited for even longer sequences as the models stop improving with any lengths more than 35. The drop in growth may be due to the memory limitation of RNN models; it gets too overwhelmed and forgets critical information causing fall in performance. Therefore it is important to choose the length appropriately for best performance. The sequence length may be increased further if the number of HL is increased, but that comes at the cost of much higher processing time. The TF-IDF scoring system extracts the keywords and compares it with that target sequence in a weighted manner. Figure below clearly shows that the predicted sequence is very close to the actual target sequence syntactically. The score is in between 0 and 1, where 1 states the same sequence. The algorithm is fed with two sequences of data, ﬁrst is the target sequence, and the second is the predicted sequences. TF-IDF is a well-known technique in the ﬁelds of data-mining and large scale information retrieval, and this technique assures that the sequence similarities.

Major demerits of BLEU and TF-IDF systems are its characteristics to score
based only on the statistical similarities. This shows that the sequences make sense statistically but might not make sense to the OS. For this, we need to make a correlational analysis of the target sequence and predicted sequence this is done by Cosine Similarity score. It is a score between 0 and 1 and takes two encoding vectors, ﬁrst is the hidden state vector v1 of the prediction model, and the second is the target sequence encoding vector v2. Figure below shows cosine similarity between v1 andv2. It is evident now that the predicted sequence and target sequence are syntactically and semantically similar in manner.

The predicted sequence should be consistent and should make sense to the
operating system. We tested this by using various classiﬁers which were trained on the training data of ADFA. This method assures that the sequence predicted functional, and the model has learned the critical information correctly. We test the sequences on CNN and Random Forest classiﬁers. The classiﬁers were chosen based on their performance when trained on the same dataset, CNN and Random Forest performed the best out of nine diﬀerent classiﬁers. The Receiver Operating Characteristic (ROC) curve is used to measure the performance, Area Under Curve (AUC) is used to ﬁnd the exact precision. A signiﬁcant advantage of using a predicted sequence can be its ability to improve target sequence detection ability by combining the target sequence with the predicted one.

The amalgam is hugely beneﬁcial when we want to know the exact attack type. The target sequence may not have enough information to classify the exact attack type information. Still, with the help of the predicted sequence, we can amplify the characteristic feature of any attack. For this test, only the LSTM model with three HL and LR of 0.01 was considered as it has been outperforming the other models in all the previous tests. In Fig. 10 the labelled ROC, predicted ROC and extended ROC are compared, it is seen that extended ROC performs remarkably better than the target sequence, the predicted sequence helps the classiﬁer to make a more precise decision.

Conclusion

To solve problems faced by other researchers stated above, speciﬁcally the inability to predict intrusion on systems, RNN seq2seq language model framework was adopted in this project. The proposed model can eﬀectively predict SC during an attack in real-time. The results from the model were validated via various techniques, assuring improvements in various anomaly detection techniques. The project also shows how the classiﬁcation of attacks can be carried out via various algorithms parallelly with the prediction models’ generated SC.

This project is one of the hardest ones I have ever worked on. The amount of learning I had to do to pull it off is insane, from the start to finish it took me about 5 months to fine tune every parameter and get it working. This project has also helped me bag an internship position at Indian Institute of Science in the network security space. Feel free to reach out to me for further details at LinkedIn or check out my profile at Google Scholar or ResearchGate.

Please refer to the list of references in the official publication link above. Rights of publication of this article, research paper and it’s contents lies with Gaurav Sarraf and Swetha MS only.