In many real-life systems, the state of the system is strictly binary. For example, a team can either win or lose, a stock can either go up or down, a patient can have a disease or not. A data scientist often encounters target variables that obey this type of duality. For instance, individual health metrics can be used to diagnose certain medical conditions, such as heart disease, or they can be used to confirm pregnancy. In these cases, the answer is either ‘yes’ or ‘no’, or ‘true’ or ‘false’. Mathematically, this can be represented as a ‘1’ or ‘0’, and can then be incorporated into a predictive model with other numerical health metrics. There are several statistical models constructed to identify binary variables such as these — the logistic model is one of them.
The logistic model is a model that uses the logistic function:
Where e is the natural logarithm base, x0 is the value of the sigmoid’s midpoint, L is the curve’s maximum y value, and k is the logistic growth rate. The function looks like this:
Note that the y values range from 0 to 1 (in this case, L is 1). In the case of the examples I listed, the target variable is either 0 or 1, and the logistic function outputs the probability of being in either state. This probability is determined by a linear combination of all the predictor variables. In this case, the model is a binary logistic regression, but it can be extended to multiple categorical variables. If this is the case, a probability for each categorical variable is produced, with the most probable state being chosen. This is known as multinomial logistic regression. This function is used for logistic regression, but it is not the only machine learning algorithm that uses it. At their foundation, neural nets use it as well.
When performing multinomial logistic regression on a dataset, the target variables cannot be ordinal or ranked. The algorithm also typically produces the best results when there are a low number of categories in the target variable. If you have a larger number, another machine learning algorithm would be better suited. For my example, we’ll pick a dataset that consists of three categories.
Implementing the Model
The data I’ll use to demonstrate the algorithm is from the UCI Machine Learning Repository. We will use the wine dataset. It contains 178 observations of wine grown in the same region in Italy. Each observation is from one of three cultivars (the ‘Class’ feature), with 13 constituent features that are the result of a chemical analysis. We will try to predict the correct cultivar of each wine based on a linear model created with the 13 features.
import pandas as pd from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt wine_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', \ 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315',\ 'Proline'] wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names = wine_names) wine_df = pd.DataFrame(wine_data) wine_df.Class = wine_df.Class - 1
We’ll first import the pandas library (Python Data Analysis Library) and the wine dataset, then convert the dataset to a pandas DataFrame. I’ll use the logistic regression algorithm from the scikit-learn package (refer to the documentation for help with any of the functions that I use in my code). To properly determine the efficacy of the model, we’ll split the dataset into a test and training set. This will allow us to train our model on a sample of the data, and then test it on data that it has not seen. Since we are using a linear model, the training will converge the coefficients of each term in the model to values that most accurately predict the highest probability for the correct class.
from sklearn.model_selection import train_test_split Y = wine_df.loc[:,'Class'].values X = wine_df.loc[:,'Alcohol':'Proline'].values #we split the dataset into a test and training set train_x, test_x, train_y, test_y = train_test_split(X,Y , test_size=0.3, random_state=0) clf = LogisticRegression(solver='lbfgs',multi_class='multinomial') clf.fit(train_x, train_y) clf.score(test_x, test_y)
The score is the percentage of correct predictions in the test set. Ninety-four percent is pretty good for a first attempt. Feel free to change the solver method within the argument of LogisticRegression. Different methods will converge to more optimal coefficients and produce better scores. You can find more methods here.