Many machine learning algorithms require predicting the class of given data. You can make simple predictions with the Random Forest Algorithm. It is one of the most popular machine learning algorithms used for such a task.
The Random Forest Algorithm is able to predict the class of a given observation through the implementation and use of a ‘Random Forest’. ‘Random Forest’ is a term to describe several ‘decision trees’ that are working together to produce a collective output.
A decision tree is a tree data structure that determines the distinct classes of data within some given data sample.
Consider the following example that further explains the concept of a decision tree:
In Fig. 1 there are seven characters. Some of these characters are red in colour while others are blue. Also note that some of the characters are underlined while others are not.
The colour of the characters in the data identify the two distinct children nodes. So, we can observe in Fig. 2 that the red characters are placed in the left child node and the blue characters are placed in the right child node.
Although the data is split into two categories, certain features in the left child node differ. Some of the characters are underlined and others are not underlined.
Therefore, the data in the left child node of the root node are further split into two child nodes. The underlined characters are placed in the resulting left child node and the characters that are not underlined are placed in the resulting right child node. This can be seen below in Fig 3:
In the formulation of decision trees, we always ask the following questions at any given node in the tree:
Which feature in the data of this node is going to allow me to
1) Split the data into groups which are as distinct as possible.
2) Split the data into groups such that the data that are within the resulting groups are as similar to each other as possible.
Random Forest Classifier
To make simple predictions, the Random Forest Classifier performs the task of classification. It is based on the idea that multiple decision trees can operate as an ‘ensemble’. The concept of an ‘ensemble’ means that each tree in the Random Forest spits out a class prediction. The class with the most votes becomes the Random Forest Classifier’s prediction.
Below in Fig.4 is an image showcasing how the Random Forest Classifier works:
Prerequisites for the Random Forest Classifier to perform well
To enable a Random Forest Classifier to perform at high levels of accuracy, two main prerequisites should be satisfied. They are:
1) The features of the dataset that are selected to be used to discriminate between the data values should be good enough.
2) The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.
You can achieve these two prerequisites the the application of two processes: “bagging,” and “feature randomness.”
The Random Forest Algorithm uses “bagging” to make simple predictions. This is the process of training each decision tree in the random forest. You base the training on a random selection of data samples from the given training dataset with replacement.
In the process of bagging, we are not drawing subsets from the training dataset and training each decision tree on a different subset. Rather, if we have a training dataset of size N, we train each decision tree on a dataset of size N. That dataset consists of data samples drawn at random from the training dataset with replacement.
Consider the following example: If our training dataset was [1,2,3,4,5,6] then we might train one of the decision trees in the random forest with the following list [1,2,2,3,6,6]. Notice the size of the dataset in this example. The size of the dataset used to train the decision tree is the same size as the original training dataset. What is the difference? Two ‘2’s and two ‘6’s make up the dataset, which trains the decision tree. That’s because the data samples chosen to form a dataset that is used to train a decision tree are chosen at random and with replacement.
You may want to split a parent code into children nodes in a normal decision tree. Evaluate every possible feature in the training set. Then, apply the feature that produces the most separation between the left and right child nodes.
However, in the random forest classifier, each decision tree can only pick from a random subset of the main features of the training dataset. This brings about more variation among the decision trees in the model. It ultimately results in a lower correlation across the decision trees in the random forest.
The image in Fig. 5, below, demonstrates this idea of feature randomness for making predictions in a way that can be easily understood.
Putting It All Together
1) Obtain all of the training data that is going to be used to build decision trees in the random forest.
2) Determine the number of decision trees that you would like to have in the random forest.
3) For each decision tree in the random forest, apply the concept of bagging to obtain the dataset that would be used to train it.
4) Train the decision trees in the random forest with their respective datasets.
5) Once the decision trees are done training, obtain the test data sample and supply it to all of the decision trees in the random forest.
6) Count the number of occurrences of each distinct prediction.
7) The prediction with the most number of occurrences is provided as the final prediction of the random forest.
Code for the Random Forest Classifier
The code for building a Random Forest classifier to make simple predictions can be written by following the code snippets in this Sweetcode post
I hope that through this tutorial, you can now build your own random forest classifier to make simple predictions.