Review: Spam Email and Malware Elimination

Gaurav Sarraf
15 min readSep 1, 2021

Greetings fellow cyber nerds! This article will be about a research paper I published in the first quarter of 2019. As an academic, it is imperative to not just work on improving existing technologies but to also publish your finds. This not helps validate findings, more importantly, it pushes the field of study forward, no matter how insignificant the contribution it may be. I have published three research papers as of September 2021 all in international conferences. I will be writing articles for all three of these papers. So stay tuned for that. Let's get to the paper now.

As the name of the paper already suggests, it talks about my findings on the bot I built to classify emails into Spam or Ham; if an email has an attachment, it also checks if the attachment is safe or not. All-in-all this is a classification problem. The bot's sole purpose is to detect and get rid of Spam in emails and malicious files as well. We all know, Google’s proprietary Spam detection algorithm using AI (Artificial Intelligence) is one of the best in the world. But, as we will see later in this article it has multiple problems, and its accuracy can be further improved.

To solve the problem of spam and malicious files classification, multiple techniques have been deployed over the years like pattern matching and statistical analysis. This did ok, but by today’s standards, it was mediocre at best. With the advancements in computer hardware, lowering of data storage costs, and faster processors; AI and more specifically ML (Machine Learning) was now a reality. Frameworks like TensorFlow and PyTorch have made it even easier for us hobbyists to play with these algorithms. So, let’s get technical.

A study in 2018 found that 59.6% of the emails sent over the internet were declared spam. The scale is so massive that with 74 Trillion emails sent, it contributed to about 4% of all global internet bandwidth at one point in time. This field of study is of great importance for security researchers as almost 45% of cyber attacks are carried out via emails. Needless to say, the benefits (resources as well) of improving this technology clearly out weight the amount of bandwidth and financial loss this causes on a daily basis. With such huge amounts of data available for scientists to analyze, ML is the right fit for such a tedious task. These algorithms do not just analyze the data but also extract features from the data to generate the model, hence helping computers to make educated guesses on unseen data with a significant amount of accuracy. In this paper, I have compared results from over ten different classification techniques. All were supervised learning algorithms, trained on labeled datasets.

Survey

A. k-Nearest Neighbor (kNN):

kNN is one of the most fundamental pattern recognition algorithms, used for classification and regression problems. The kNN algorithm trains its classifier by labeled datasets, therefore it is a Supervised Learning process. An object is
classified by a plurality vote of its neighbors, with the object being assigned a label most common among ‘k’ nearest neighbors. If k = 1, then the object is simply assigned a label of the nearest neighbor. This is a great example of instance-based learning or better known as Lazy-Learning, it deferrers all the computing until the classification. The distance between two data points is calculated by (not the only method) Euclidean Distance. the algorithm runs through the whole dataset calculating the distance ‘ d ‘ between the unknown object ‘x’, ‘k‘ stands for the closest points in the training data to ‘x‘ hence computing a set ‘A‘. ‘k‘ is usually an odd number to avoid tie situations. kNN works best with large and noisy datasets. The conditional probability is computed for each class, therefore when a new object is provided it classifies to 1 if ‘x’ is true and 0 otherwise.

Resources for kNN:

B. Decision Tree Learning:

  1. Decision Tree (DT): A decision tree classifier is a simple algorithm that asks simple carefully crafted questions about features of the test data. When the answer is computed, it is followed by another question. This is repeated with different data points over the computing cycle until the test object is given a specific class. We use a recursive approach to build the decision tree, employing a greedy strategy, based on the training data. When the tree is ready it is relatively computationally inexpensive to classify any test object. To determine the accuracy of test conditions we test our cases against the Gini Impurity Index.

Resources for Decision Trees:

2. Adaptive Boost (ADA): The concept of Adaptive Boost (a.k.a. Ada Boost) is quite simple, it is an ensemble classifier. It makes use of multiple weak classification algorithms to build a strong and robust classification model trained by noisy data. The idea is that a weak classifier might have a bad accuracy, but many of these misclassified objects are designated right weights in the next iteration assuring better results after every try. Therefore, it retrains a model iteratively by choosing accurate results from previous training. A random subset of data is chosen to train the classifier, the randomness reduces in later iterations.

Resources for ADA:

3. Random Forest (RT): Like Ada Boost, Random Forest is an ensemble classifier, they combine many weaker classifiers to get to the solution. Random Forest is essentially a combination of Ada Boost and simple Decision Tree, it creates a set of Decision Trees that randomly select data to train the classifier. The final class is computed by calculating the aggregates of these different trees. This is implemented in such a manner because a single tree might have noisy results but the combination of many trees helps us to reduce the noise significantly. Usually, a few hundred to a couple of thousands of trees are used to generate the final tree. The final class tree is derived from cross-validation and out-of-bag error or bagging process.

Resources for RT:

4. Extra Tree (ET): Extra Tree classifier is an extension of Random Forest, where extremely randomized trees are created. This is also an ensemble algorithm. There is essentially two difference when compared to Random Forest, first instead of taking a part of the training data, extra trees take the whole training data to train the classifier. Second, the splits or branches of the trees are not done randomly and from top to bottom. The cut points are generated randomly first and in a later iteration, the randomness reduces, and computed weights are used to calculate the final tree. The number of features to be considered are provided by us, by default this value is set at √𝑛.

Resources for ET:

C. Naïve Bayes Classifier:

Naïve Bayes classifiers are a part of “Probabilistic Classifiers” which is inspired by Bayes’ Theorem. These are the most widely used classifiers, they are highly scalable and can take a few to a couple of hundred variables or features. The major advantages of Naïve Bayes are that they require relatively small training datasets. This uses Bayesian Probability to compute the weights. Naïve Bayes is a collection of algorithms, we will be discussing them in detail
below.

  1. Gaussian Naïve Bayes (GNB): Say we have a dataset of continuous data attributes ‘x’, we first classify these values to a class based on Gaussian distribution, and then we calculate the mean and the variance of ‘x’.

Resources for GNB:

2. Multinomial Naïve Bayes (MNB): Multinomial Naïve Bayes considers feature vectors and the frequency of occurrence of these vectors over the computation cycle creating a histogram. This is mostly used for document classification. The generated model is regularized by Laplace Smoothing and a histogram is generated.

Resources for MNB:

3. Bernoulli Naïve Bayes (BNB): Bernoulli Naïve Bayes is similar to Multinomial Naïve Bayes but instead of computing the histogram of each word, this just look as the occurrence of a specific feature set in binary manner. This is especially famous in small text classification like SMS.

Resources for BNB:

D. Support Vector Machine-Linear Kernel (SVM):

Support Vector Machine (SVM) algorithm is a discriminative, supervised learning non-probabilistic binary linear classifier which is defined by a separating hyper-lapse in a 2D space. It is one of the most famous classifiers out there. To make an optimal classifier this algorithm considers four main characteristics which are Regularization, Gamma, Margin, and Linear Kernel. Regularization is a factor taking into consideration the amount of misclassification to be considered while computing the model. Gamma talks about how far the hyper-lapse created should be influenced by the training data. Margin is the measure of how far the separation line of the classifier model should be from the data.

Resources for SVM:

E. Gradient Boosting (GB):

Gradient Boosting is another example of an ensemble algorithm where instead of bagging here the process is called boosting. This is a combination of many different types of Decision Trees. When a part of the tree is computed, the correctly classified tree gets boosted in the next iteration. The goal of any supervised learning algorithm is to define a loss function.

F. Others:

Resource Allocating Network with Locality Sensitive Hashing is used as a classifier model to classify Spam, Malicious URLs, and Malicious Files. Feature Vectors are being computed by the Bag-Of-Words approach for all the parts of the email which is further processed by an outlier.

Proposed System

The bot will be deployed in users' Mail Server, services of which are enabled by default. We first check the text, ’ n ’ number of features are extracted which is then fed into a pattern analyzer. The pattern analyzer algorithm then compares the extracted text with the spam feature dataset generated by the ML classifier. A score (spam rating) is computed by the analyzer, if the ’spam rating’ is below the threshold it is classified as Ham otherwise it is classified as Spam. Before sending, it to the user's inbox, it is also checked if the email has an attachment. If no, then it is sent to the user's inbox else it is sent to the file analyzer.

The File analyzer first extracts the parameters and signature of each file. A similar process to text analyzer is applied here, the pattern matching algorithm compares the signatures with the feature set created by the malicious file ML classifier. The pattern matching is just a comparison of the selected features of the file under test and previously trained data by the classifier. This data is stored in a pickle file, where data is stored in a binary stream making processing and comparisons faster. If the file has similar features it is declared as malicious otherwise it is classified as legitimate. The malicious file is deleted and all the new features of the file are saved to our binary signature dataset improving our accuracy. If the email received passes the test, it is sent to the user's inbox. All the actions taken by the automated system are informed to the user as a mute notification.

Methodology

A. Obtaining Data:

The datasets mentioned in Table 1 were found to be the most comprehensive as per our research. There are approximately 26,000 spam and 19,000 ham emails in the test dataset. This dataset is partially processed, it does not have any stop words and lemmatization has already been done. The malicious file dataset has approximately 16,000 and 9,000 malicious and legitimate files respectively. They consist of files with over nine different file extensions. Each file has thirty-two parameters. In total, about 11,000 Mb of files and texts were processed to train our model.

Datasets

B. Pre-processing Data:

  1. Text Processing: Though our data has labels for each of its features, it still requires further processing and cleaning making it noise-free. Pre-processing is a must to retain only the most relevant features. In the dataset used we create dictionaries of words from the whole emails. After analyzing the frequency of occurrence of words, it was noted that approximately 3,000 words appeared 2,80,000 times. After further analysis, it was noted that some of the words from the dictionary were irrelevant and if removed would lead to much better results and lesser processing time. After the removal of these words, the frequency was approximately 7,650, which is a much better processed raw data to work on. After performing the above steps about 3,000 features were identified.
  2. File Processing: File processing is relatively straightforward. We first generate the parameters of the file under test. Then a parameter file is generated which has over 32 parameters. In the dataset used in this paper, these parameters are already computed and labeled as legitimate or malicious. Using all the 32 parameters does not result in any significant overall improvement. Needless to day identifying the right parameters is again a classification problem. Hence we deploy an Extra Tree classifier to extract all the useful features. The feature selection classifier identifies over fifteen different parameters, which carried the most weightage while gauging if the file is malicious or not. The table lists the fifteen parameters considered.

C. Training & Testing the Classifier:

The dataset is divided into train and test datasets which are in a 60:40 ratio respectively. Hyperparameters are tweaked to arrive at the most accurate model with the least false negatives and false positives. The training was completed as fast as 21.2 seconds to as slow as 23.4 minutes while comparing all ten classification techniques.

Results

After multiple runs of all the classifiers, we have achieved the following results. Testing reveals the false positives and false negatives of each algorithm, there were varying levels of success. This helps us understand how often the algorithm classified a ham email or a legitimate file as Spam or malicious respectively and how many spam email or malicious files were classified as ham and legitimate respectively. This itself gives us a very good understanding of the performance, but the data is huge and to make it easier to perceive we process the resulted father. We compute the precision and recall score of each algorithm. These figures give us an idea of the overall performance of the techniques used by using the false negatives and true positives. Quantifying these results becomes complicated with large datasets, precision and recall is the best-known method to do so.

Text Classification, Precision vs Recall: Higher is better
File Classification, Precision vs Recall: Higher is better

Precision and Recall are computed by the formulae respectively:

Finally, to get a better understanding of all the results we blend the results of Precision and Recall, this is an important figure as in the Data Mining field the F1 score has always been the aptest method to judge the performance of the model being studied. This scoring method consolidates all the results into one quantifiable value. The F1 score all of the algorithms computed by the formula:

F-Score: Higher is better

Conclusion

By conducting this comparative study we can successfully conclude by saying, classification can be carried out by any of the above-listed algorithms with varying levels of success. Our choice would be to use SVM for both text and file classification. The accuracy achieved by our model is about 99%, this assures less number of False Positive and False Negatives. By the deployment of just ML models, we can reduce the threat of E-mail related attacks significantly. This still does not assure all — round from the attacks on the internet, but will definitely make the jobs of malicious entities much more difficult. Attacks can still be conducted by files that are not passed to the victim via E-mail but through direct file download from various websites. Such a system can be for all files being downloaded from the browser. This bot can further be improved by adding a malicious URL classifier. Which would be able to classify a malicious link just by looking at the text of the URL adding an extra level of security. Since there is a huge amount of emails being transferred daily, we can also add MapReduce concepts enabling parallel processing on a distributed computing system which is the case with most Mail Servers in existence today.

That’s about it guys, if you reached the end congratulations, this was a long one! I will be posting writeups for my other papers soon. Go check out my Google Scholar profile here.

Please refer to the list of references in the official publication link above. Rights of publication of this article, research paper and it’s contents lies with Gaurav Sarraf and Swetha MS only.

--

--

Gaurav Sarraf

Security Engineer cum Researcher | Graduate Student @ Syracuse University | Space Enthusiast | bit.ly/gs-LinkedIn | bit.ly/gs-GitHub | thinkrobotics.in