Whether you are fairly new to data science techniques or a seasoned veteran, interpreting results from a machine learning algorithm can be a trying experience. The problem lies in having to make sense of the output of a given model, and to what degree the output can tell you how well the model performs against the training and test data. There are often a great many indicators, and they can often disagree. The biggest problem some of us have is just trying to remember what they mean. Some indicators refer to characteristics of the model while others refer to characteristics of the underlying data.
In this post, we will examine some particular indicators as a way to see if the data is appropriate to the model we chose. Why do we care about the characteristics of the data? What’s wrong with just stuffing the data into our algorithm and seeing what comes out? The problem is that there are literally hundreds of different machine learning algorithms designed to exploit certain tendencies in the underlying data. Just as differing weather conditions might call for different outfits, differing patterns in your data might call for different algorithms for model building.
In particular, certain models make assumptions about the data. These assumptions are key to knowing whether a particular technique is suitable for analysis. One commonly used technique in Python is linear regression. Despite its relatively simple mathematical foundation, linear regression is a surprisingly good technique and often a useful first choice in modeling. However, linear regression works best with a certain class of data. It is then incumbent upon us to ensure the data meets the required class criteria.
In this particular case, we’ll use the Ordinary Least Squares (OLS) method that comes with the statsmodel.api module. We are going to explore the mtcars dataset—a small, simple dataset containing observations of various makes and models.
You can download the mtcars.csv here.
I’ll use this Python snippet to generate the results:
import pandas as pd
import statsmodels.api as sm
## Setting Working directory
path = "C:\\Temp"
## load mtcars
mtcars = pd.read_csv(".\\mtcars.csv")
## Linear Regression with One predictor
## Fit regression model
mtcars["constant"] = 1
## create an artificial value to add a dimension/independent variable
## this takes the form of a constant term so that we fit the intercept
## of our linear model
## we can then use the intercept-only model as our null hypothesis
X = mtcars.loc[:,["constant","am"]]
## mpg is our dependent variable
Y = mtcars.mpg
# create the model
mod1res = sm.OLS(Y, X).fit()
## Inspect the results
Assuming everything works, the last line of code will generate a summary that looks like this:
The section we are interested in is at the bottom. The summary provides a number of measures to give you an idea of the data distribution and behavior in order to see if the data has the right characteristics to give us better confidence in the resulting model. Now, in a sense, we aren’t testing the data so much as the model’s interpretation of it. If the data is good for modeling, then our residuals will have certain characteristics. These characteristics are:
1. The data is “linear.” That is, the dependent variable is a linear function of independent variables and an error term, and is largely dependent on characteristics 2-4.
2. Errors are normally distributed across the data.
3. There is homoscedasticity.
4. Finally, the independent variables are actually independent and not collinear.
The results listed at the bottom specifically address those characteristics. Let’s look at each in turn.
Omnibus/Prob(Omnibus) – is a test of the skewness and kurtosis of the residual (characteristic #2). We hope to see a value close to zero which would indicate normalcy. The Prob (Omnibus) performs a statistical test indicating the probability that the residuals are normally distributed. We hope to see something close to 1 here. In this case, Omnibus is relatively low and the Prob (Omnibus) is relatively high, so the data is somewhat normal, but not altogether ideal. A linear regression approach would probably be better than random guessing, but likely not as good as a nonlinear approach.
Skew – is a measure of data symmetry. We want to see something close to zero, indicating the residual distribution is normal. Note that this value also drives the Omnibus. This result has a small (and therefore good) skew.
Kurtosis – is a measure of “peakiness,” or curvature of the data. Higher peaks lead to greater Kurtosis. Greater Kurtosis can be interpreted as a tighter clustering of residuals around zero, implying a better model with few outliers.
Durbin-Watson – tests for homoscedasticity (characteristic #3). We hope to have a value between 1 and 2. In this case, the data is close, but within limits.
Jarque-Bera (JB)/Prob(JB) – is similar to the Omnibus test in that it tests both skew and kurtosis. We hope to see in this test a confirmation of the Omnibus test. In this case, we do.
Condition Number – This test measures the sensitivity of a function’s output as compared to its input (characteristic #4). When we have multicollinearity, we can expect much higher fluctuations to small changes in the data, hence, we hope to see a relatively small number, something below 30. In this case we are well below 30, which we would expect given our model only has two variables and one is a constant.
In looking at the data we see an okay (though not great) set of characteristics. This would indicate that the OLS approach has some validity, but we can probably do better with a nonlinear model.
Data Science is somewhat of a misnomer because there is a great deal of art involved in creating the right model (so to speak). Understanding how your data “behaves” is a solid first step in that direction, and can often make the difference between a good model and a much better one.