One of the most commonly used machine learning algorithms is decision tree learning. Compared to other machine learning algorithms, decision tree learning provides a simple and versatile tool for building both classification and regression predictive models. For a given set of features, the algorithm constructs a “tree”: each branch corresponds to a feature, and the leaves represent the target variable. The tree could look something like this:
This tree uses two features (Alcohol and OD280/OD315) to predict the class of each observation (0,1 or 2). As you can see, the algorithm is simple and easy to visualize, yet can build a powerful predictive model. The model can be scaled to large datasets without any performance loss. It requires little data cleaning before training, and is transparent to the data scientist (you can peek inside and interpret the inner workings). Despite the advantages, decision trees struggle to produce models with low variance. They tend to overfit the data (low bias and high variance). Fortunately, data scientists have developed ensemble techniques to get around this. In this article, I demonstrate how to create a decision tree using Python, and also discuss an extension to decision tree learning, known as random decision forests.
The general technique of random forest predictors was first introduced in the mid-1990s by Tin Kam Ho in this paper. The algorithm involves growing an ensemble of decision trees (hence the name, forests) on a dataset, and using the mode (for classification) or mean (for regression) of all of the trees to predict the target variable. In the paper, Ho implements a random selection of a subset of the features for each tree. In other words, the algorithm picks only some of the features (similar to how I picked Alcohol and OD280/OD315 out of 13 total features). It creates a tree using only these, and obtains a predicted class for each observation. This is repeated with a different set of features (picked randomly) n times. After n times, each observation has n values of our target variable (Class). These values are then either averaged (if numerical) or the mode is taken (if categorical).
Several years later, Leo Breiman improved on this algorithm. He published a paper that combines Ho’s approach and the concept of bootstrap aggregation, or “bagging,” and produced an algorithm that produces much better accuracy than traditional decision trees. Bootstrap aggregation is similar to Ho’s approach with random feature selection, except with the observation selection. For a given size of a training set m (number of observations), bagging creates k additional training sets, each of the size m by sampling uniformly and with replacement from the original training set. Sampling with replacement means that some of the observations within each additional set will be duplicates. The results of the k models are then averaged (for numerical) or the mode is taken (if categorical). The combination of Ho and Breiman’s approaches is what is known as the Random Forest learning algorithm.
Creating a decision tree
Before we create a decision tree, we need data. The data I’ll use to demonstrate the algorithm is from the UCI Machine Learning Repository. We will use the wine dataset. It contains 178 observations of wine grown in the same region in Italy. Each observation is from one of three cultivars (the ‘Class’ feature), with 13 constituent features that are the result of a chemical analysis.
import pandas as pd import matplotlib.pyplot as plt wine_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', \ 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315',\ 'Proline'] wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names = wine_names) wine_df = pd.DataFrame(wine_data) wine_df.Class = wine_df.Class - 1
We first import the pandas library and the wine dataset, then convert the dataset to a pandas DataFrame. The scikit-learn package has a nice set of decision tree functions that we can take advantage of (refer to the documentation for help with any of the functions that I use in my code). We also split the dataset into a test and training set, to properly determine the efficacy of the model.
from sklearn.model_selection import train_test_split Y = wine_df.loc[:,'Class'].values X = wine_df.loc[:,'Alcohol':'Proline'].values #we split the dataset into a test and training set train_x, test_x, train_y, test_y = train_test_split(X,Y , test_size=0.3, random_state=0) from sklearn import tree clf = tree.DecisionTreeClassifier(max_depth=2) clf = clf.fit(train_x, train_y)
The tree created with the training set looks like this:
The bottom row represents the “leaves” of the decision tree. The Gini index is a measure of impurity of the class — The closer to 0, the purer the leaf. For example, the first leaf correctly predicts the class of 44 out of 45 samples that satisfies the inequalities of the first and second branches (Color intensity and Proline). We then run our test set through the decision tree. In the example, I limit the tree depth to two levels for simplicity, but even with this limitation, the model correctly predicts the correct class ~87% of the time in the test set.
test_score = clf.score(test_x, test_y) test_score
This result is not entirely terrible, but we can improve it with the techniques proposed by Ho and Breiman.
Feature & tree bagging
One of the advantages of using the scikit-learn package is the ease with which we can implement Feature and Tree Bagging. The package has a built-in function that allows us to specify the number of additional training sets to create, as well as the number of random features to sample. To do this, run:
from sklearn.ensemble import RandomForestClassifier RF_clf = RandomForestClassifier(n_estimators=100, max_depth=2, max_features = 'sqrt') RF_clf.fit(train_x, train_y)
I specify 100 additional training sets (which equates to 100 decision trees, which is the default value) and the maximum number of features in each tree to the square root of the total number of features. You can refer to Breiman’s paper for his reasoning for using the square root, but it generally is the best choice in all cases to minimize the variance. I also keep the max depth at 2 for consistency. We can now run the test set through the trained model.
test_score = RF_clf.score(test_x, test_y) test_score
It is clear that these techniques drastically improve the test score, increasing it by nearly 10%. If you would like to improve the model even further, try increasing the max depth to a value higher than 2. Alternatively, you can reduce the number of estimators and observe how the test score changes.