R, over the years, has established itself as one of the world’s most widely used statistical programming languages and software. R is used in every domain you can imagine: healthcare, sports, and it’s been widely adopted in academia. It is open source and runs on the three major platforms: Linux, Mac, and Windows.
With every second that passes in the digital age, an excess of data is generated. Data that, when analyzed, usually becomes the key to solving many problems we face. This makes predictive analysis vital because it allows us to use statistics in predicting the outcome of events. Guess what—R has a plethora of built-in functions and packages that make building various predictive models seem effortless.
Now we will explore the basics of building a simple predictive model in R, with the help of RStudio. RStudio makes it easier to use R by providing a clean, intuitive user interface to interact with R, but everything can be done entirely in R if you are a devout command-line junkie. You will need to download and install R and RStudio for your specific platform, from here and here. We will use regression analysis to find the relationship between age and vital capacity (breathing capacity of the lungs).
Of course, we will need some data to work with. For this example, the dataset—found in the ISwR package—can be obtained in R through the Install Packages options under the Tools menu. (An alternative is to use the console in RStudio.) To use the library, we’ll need to load and attach it.
# install package
# load and attach package
View and clean data
Not all data comes in clean, ready-to-use form. Some may have values that, if not cleaned, yield wrong results when used. We will view a summary and tabular form of our dataset, vitcap2. A plot of the two variables (age and vital.capacity) may also help. Since our dataset contains no “unwanted” values, there’s no cleanup needed. (For larger datasets, there may be the need for subsetting.)
# plot -- the dollar sign is used for extraction
Fit the model
To fit our model in R, there’s absolutely no work required, apart from entering some commands. The summary of the fitted model variable gives us all the clues needed, which fairly predict the vital capacity of a person given the age. To get our linear equation, the intercept and age coefficients are the most important values from the summary output. This equation is our prediction equation—It can predict the vital capacity given age as input.
# attach variables
# fit model
pred = lm(vital.capacity ~ age)
# model summary
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.033316 0.247487 24.378 < 2e-16 ***
age -0.040478 0.005881 -6.883 1.08e-09 ***
# abline helps see pattern; line color is red
VitalCapacity = 6.033316 - 0.040478(Age)
Easy! Now go on, and test the equation against some known age values. If entering the commands one at a time is not convenient, R Scripts can be employed. R Scripts enable you to enter all the commands and run them all at once.
R can produce far more complex models. Even in such cases, you are probably going to find packages that can handle almost all your needs and make your life easier. You can always check out various R online communities for help with anything from data subsetting to building apps from R.