Build a Spam Email Detection Program with ML

633 VIEWS

In this tutorial, we build a spam email detection program based on Machine Learning (ML). We use the Naïve Bayes Classification Algorithm in this program. 

The three kinds of Machine Learning are supervised learning, unsupervised learning and reinforcement learning. In this tutorial we are using the supervised learning approach. ML, loosely defined, is the process through which a computer system is able to learn how to perform a task without being explicitly programmed to perform such a task. Go here for more posts on this site about ML.

In machine learning, the computer is fed with examples of data. It then figures out the connections within the dataset and how to use those connections to help it to accomplish its set objectives. 

In this tutorial, we are going to be using the supervised approach of machine learning. We will teach a computer how to distinguish between spam email and authentic email.

Supervised Approach of Machine Learning

In supervised learning, the programming starts by first providing the computer system with numerous examples of data. Each example contains selected features and their corresponding target values. After providing this data, we expect the computer to find the best mathematical model that can accurately represent the given data set. When we give it the features of another example that is not in the data set, it can then accurately predict the target value of that given example. 

In an example of supervised learning, you provide a computer system with data on the size of houses and the corresponding prices. The expectation is for the computer to find a mathematical model to accurately represent this data. Then, when you provide it with an example containing just the size of a house, it can accurately predict the corresponding price of that house.

 Format for data organisation 

For the task of building a spam email detection program, the data required will be ‘labelled’ emails – the data with which we will train the computer.  In the supervised learning method we provide data that contains examples of known spam emails and known authentic emails. We label each email in the dataset with a ‘1’ to signify that it is a spam email, and with a ‘0’ to signify that it is a legitimate email. You can see this demonstrated in Fig 1 below:

spam email detection Fig. 1

Fig. 1

Supervised Machine Learning Algorithm

The supervised machine learning algorithm used in this tutorial is called the Naïve Bayes Classification Algorithm. The objective of the naïve bayes classification algorithm is to apply naïve bayes classification to discriminate between spam emails and legitimate emails. 

Naïve Bayes Classification

Naïve Bayes Classification is a probabilistic classification method.  It works based on applying the Bayes theorem with strong independence assumptions between the features of each example in the input data.

The Bayes theorem is a concept in probability theory that describes the probability of an event occurring based on prior knowledge of conditions that might be related to that event. Fig. 2 shows the mathematical representation of the Bayes theorem:

mathematical representation of bayes theorem

Fig. 2

The term ‘strong independence assumptions’ means that there is an assumption that all the features of each example in the input data are independent of each other, and that there is no feature value that depends on another different feature value.

How to Apply Naïve Bayes Classification to Spam Email Detection

We have to determine whether a given email is a spam email or a legitimate email. To do this, we compute the probability of the email being a spam email. We then compare that to the probability of the email being a legitimate email.

The email is predicted to be a spam email  if the probability that the email is a spam email outweighs the probability that the email is a legitimate email. However, if the probability that the email is a legitimate email outweighs the probability that the email is a spam email, the email is predicted to be a legitimate email. Fig. 3 demonstrates this premise in a mathematical representation:

mathematical representation class of email

Fig. 3

We can read and interpret the mathematical expression in Fig. 3 as ‘the class c of email that yields the highest probability given an email e’.

However, to compute the mathematical expression in Fig. 3, we would have to apply the Bayes theorem. Applying the Bayes theorem to the expression in Fig. 3 would yield the following mathematical expression in Fig 4 below:

applying Bayes theorem

Fig. 4

Note that the features of each email example in the given dataset constitutes the words in that email example. Since in Naïve Bayes classification, we hold the assumption that the features of each example in an input dataset are independent of each other, we can further express the mathematical expression in Fig. 4 as so:

 

independent features in Naive Bayes classification

Fig. 5

The P(c) in the mathematical expression in Fig. 5 signifies the probability of a particular class of email. I.e., it can signify the probability of spam emails, given the input data; or, it can signify the probability of legitimate emails given the input data.

We obtain the probability of spam emails, given the input data, through the following means:

spam emails over total emails

Fig. 6

However, we obtain the probability of legitimate emails, given the input data, through the following means:

legitimate emails over total emails

Fig. 7

The P(Wi | C) in the mathematical expression in Fig. 5 signifies the probability of a particular word belonging to a particular class of emails. As an example, suppose we want to obtain the probability of the word ‘congrats’ belonging to the class of spam emails. We can compute this by first grouping all of the words in the input dataset that are marked as ‘spam’ and then counting the number of times the word ‘congrats’ appears in this group.

Your Spam Email Detection Program

For a robust spam email detection program:

1)  Ensure that all of the emails in the input dataset are labelled and that there is a clear distinction between ‘spam’ emails and ‘legitimate’ emails. In this tutorial, we use 1 to represent ‘spam’ emails and 0 to represent ‘legitimate’ emails.

2)  Compute the probability of each distinct class of emails, given the input dataset. That is, compute the probability of spam emails and compute the probability of legitimate emails. This satisfies the P(c) requirement of the mathematical expression in Fig. 5.

3)  Create separate data groups to contain all of the words of  each distinct class of emails. That is, there should be a data group containing all of the words of the spam emails, and there should be a different data group containing all of the words of the legitimate emails. This helps us to satisfy the P(Wi| c) requirement of the mathematical expression in Fig. 5.

4)  When given a test email example, use the mathematical expression in Fig. 5 to compute the probability of the test email belonging to both the ‘spam’ email class and the ‘legitimate’ email class. Compare both probabilities obtained and then make your prediction.

 

Code for Training the Algorithm

def train_naive_bayes(files_to_be_trained_on):

    count_for_class1 = {}

    count_for_class0 = {}

    vocab_all = {}

    logprior = []

    loglikelihood = []

    likelihood_given0 = {}

    likelihood_given1 = {}

    num_of_classes = 2

    num_of_documents = 0

    num_of_documents_for_class1 = 0

    num_of_documents_for_class0 = 0

    zero_class = []

    one_class = []

    for i in files_to_be_trained_on:

        current_file = open(i,"r")

        for line in current_file:

         num_of_documents += 1

         line_list = line.split()

         if '1' in line_list:

             del line_list[len(line_list) - 1]

             one_class.append(line_list)

             num_of_documents_for_class1 += 1

             count_for_class1 = count_words_class1(line_list,count_for_class1)

         if '0' in line_list:

             del line_list[len(line_list) - 1]

             zero_class.append(line_list)

             num_of_documents_for_class0 += 1

             count_for_class0 = count_words_class0(line_list,count_for_class0)

         vocab_all = count_words_all(line_list,vocab_all)

    logprior.append(math.log(num_of_documents_for_class0/float(num_of_documents)))

    logprior.append(math.log(num_of_documents_for_class1/float(num_of_documents)))

    likelihood_given0 = likelihood_for_class0(vocab_all,likelihood_given0,count_for_class0)

    loglikelihood.append(likelihood_given0)

    likelihood_given1 = likelihood_for_class1(vocab_all,likelihood_given1,count_for_class1)

    loglikelihood.append(likelihood_given1)

    return logprior, vocab_all,likelihood_given0, likelihood_given1

 

Code for Testing the Naïve Bayes Classifier

def test_naive_bayes(testdocument, logpriors, classes, vocabulary,likelihood_given0,likelihood_given1):

    probabilities = []

    test = testdocument.split()

    for i in range(len(classes)):

        probabilities.append(logpriors[i])

        for word in test:

         if word in list(vocabulary.keys()):

             if i == 0:

                 probabilities[i] = probabilities[i] + likelihood_given0[word]

             if i == 1:

                 probabilities[i] = probabilities[i] + likelihood_given1[word]

                

    if probabilities[0] > probabilities[1]:

        return 0

    elif probabilities[1] > probabilities[0]:

        return 1

 


David Sasu is a junior studying Computer Science in Ashesi University. He is passionate about understanding technology and using it to solve important problems. He is currently working on the creation of information systems for under-funded orphanages in his country, Ghana. He hopes to specialize in the fields of Artificial Intelligence and cybersecurity to enable him to create systems to help safeguard and improve the African continent.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

%d bloggers like this: