The 10,000 Hour Rule claims that if you practice data science for over 10,000 hours, you’re a Nate Silver-level genius. Whether it was complex problems or cool digital products that drew you to the field, the wage of a data scientist is definitely a plus. So, how many hours have you spent cleaning and modeling data? Is there a relationship between hours practiced and salary? (In other words, is time spent practicing a predictor of wage for data scientists?) Enter linear regression, the origin of many supervised learning algorithms.

“The importance of having a good understanding of linear regression before studying more complex learning methods cannot be overstated.”

– James, Witten, Hastie & Tibshirani in An Introduction to Statistical Learning

Truth be told, if you’re interested in all the mathematical details of linear regression (which I strongly recommend), work on your linear algebra and get a good econometrics book. However, if you’re simply interested in applying regression to predict a quantitative variable, this intro will get you on your way.

Simple linear regression is pretty straightforward. We assume a linear relationship between the quantitative response Y and the predictor variable X. There are two coefficients in this model: the intercept and the slope. The intercept is the value of your prediction when the predictor X is zero. The slope is the marginal effect of increasing X by one unit.

How is this relevant to the field of data science and machine learning? Well, in a real-life data science problem, you are often presented with a dataset that contains both the predictor and the response. Let’s say you work in marketing, and you have data on advertising spend and revenue, or you’re a scientist with a dataset containing the number of cars and particulates during any given moment — By using simple linear regression techniques, you can quickly check if there’s a linear relationship between the predictor and the response variable. In essence, predicting a variable based on other information is what supervised machine learning is all about.

Let’s get started. Although it’s not necessary per se, you’ll have an edge if you know the fundamentals of linear regression. Since they are considered to be the basis of a universe of statistical methods, any statistics, econometrics or data science MOOC or introductory book should suffice.

First, download the “countries of the world” dataset from Kaggle. This data set contains some general facts about all the countries in the world. It is an aggregation of several datasets regarding multiple topics, all published by the US government.

library(data.table) world <- fread('countries of the world.csv',stringsAsFactors=T) setnames(world,c('country','region','population','area','population_density','coastline','net_migration','infant_mortality','gdp','literacy','phones','arable','crops','other','climate','birthrate','deathrate','agriculture','industry','service')) world[,region := NULL] world <- world[,lapply(.SD,function(x) { gsub(',','.',x,fixed=T)}),by='country'] world <- world[,lapply(.SD,as.double),by='country']

In the code above, we load the dataset into R, change the column names to something easy, remove an irrelevant column and convert character columns to numeric columns. Most of these operations are done using the very efficient data.table package.

## Simple linear regression

simple_model <- with(world,birthrate ~ infant_mortality) lm_simple <- lm(formula = simple_model, data = world) summary(lm_simple) deviance(lm_simple)

In the next chunk of code we define our linear model. Assume we want to predict the birth rate of every country in our list, expressed in births per 1,000 inhabitants per year. If you don’t have a background in development economics, it is widely accepted that the birth rate of a country declines if infant mortality declines. As you can see from the first line of code, that’s how we define our model. We store this model in the variable ‘simple_model.’ Next, we do the regression using the lm function, and we store this regression in the variable ‘lm_simple.’

By using the summary function, we quickly get to see some popular statistics such as the F-statistic and R². To refresh your memory, the first one is mainly used to identify the model that properly describes the population when comparing models, while the second one describes the amount of variance in the dependent variable explained by the independent variable. But even more importantly, you will see if your model parameters are statistically significant for the p = 0.1, 0.05, 0.01, 0.001 level.

The deviance function prints the residual sum of squares of the model — the deviation of the real data from the predicted data.

Since we’re doing inference, the confint function prints the confidence interval of the model parameters.

predict(lm_simple,data.table(infant_mortality=c(50,100,150)),interval='confidence') predict(lm_simple,data.table(infant_mortality=c(50,100,150)),interval='prediction')

With the predict function, you can use the model to predict the birth rate, given the infant mortality. This should be done through a data frame/table with the same column name we used in the model definition. In the code chunk we use the function twice, but with different interval parameters. With the first line of code, we print the confidence interval for our infant mortality rates, and with the second line, we print the prediction interval. (If you don’t know what the difference is, Stack Exchange to the rescue.)

plot(simple_model) abline(lm_simple)

We can visualize our regression model with a scatter plot and a trend line using R’s base graphics: the plot function and the abline function. The first uses the model definition variable, and the second uses the regression variable.

As we already knew from the summary of our model and the academic consensus, infant mortality is a very good predictor for birth rate. It doesn’t take a team of statistics PhDs to see that there is some non-linearity in our model. We can test this by plotting the distribution of the residuals using the hist function, or by looking at the diagnostic plots. Using the par function, you can see all four diagnostic plots at the same time.

hist(lm_simple$residuals) par(mfrow=c(2,2)) plot(lm_simple) par(mfrow=c(1,1))

## Multiple linear regression

Simple linear regression models are, well, simple. However, nothing stops you from making more complex regression models. The following code generates a model that predicts the birth rate based on infant mortality, death rate, and the amount of people working in agriculture. By definition, there should be some collinearity between infant mortality and death rate, but let’s just ignore this for educational purposes.

multiple_model <- with(world,birthrate ~ infant_mortality + deathrate + agriculture) lm_multi <- lm(formula = multiple_model, data = world) summary(lm_multi) deviance(lm_multi)

By using the + sign, you can add more predictors to your model; however, you should always check the significance level. Furthermore, it’s also possible to add interaction terms by using the * sign.

# Multiple linear regression interaction_model <- with(world,birthrate ~ infant_mortality * agriculture + deathrate) lm_interaction <- lm(formula = interaction_model, data = world) summary(lm_interaction) deviance(lm_interaction)

There are many more interesting facets of linear regression. Nevertheless, this small tutorial should be able to get you started. If you would like to know more about linear regression or tried and tested machine learning algorithms, please consult An Introduction to Statistical Learning, one of the most popular free machine learning introduction books out there.