Basics of Building Predictive Models in R



R, over the years, has established itself as one of the world’s most widely used statistical programming languages and software. R is used in every domain you can imagine: healthcare, sports, and it’s been widely adopted in academia. It is open source and runs on the three major platforms: Linux, Mac, and Windows.

Doing statistics, machine learning, and any data analysis work in R is quite easy. There are tons of packages to assist you, and the list keeps growing by the day. And you are not limited by these packages—You can create your own packages to better suit your needs and share with the community. One R package that is worth exploring is Shiny. Shiny makes it possible to create interactive web applications using R, without working knowledge of JavaScript, HTML, or CSS.

With every second that passes in the digital age, an excess of data is generated. Data that, when analyzed, usually becomes the key to solving many problems we face. This makes predictive analysis vital because it allows us to use statistics in predicting the outcome of events. Guess what—R has a plethora of built-in functions and packages that make building various predictive models seem effortless.

Now we will explore the basics of building a simple predictive model in R, with the help of RStudio. RStudio makes it easier to use R by providing a clean, intuitive user interface to interact with R, but everything can be done entirely in R if you are a devout command-line junkie. You will need to download and install R and RStudio for your specific platform, from here and here. We will use regression analysis to find the relationship between age and vital capacity (breathing capacity of the lungs).

Load Data

Of course, we will need some data to work with. For this example, the dataset—found in the ISwR package—can be obtained in R through the Install Packages options under the Tools menu. (An alternative is to use the console in RStudio.) To use the library, we’ll need to load and attach it.

View and clean data

Not all data comes in clean, ready-to-use form. Some may have values that, if not cleaned, yield wrong results when used. We will view a summary and tabular form of our dataset, vitcap2. A plot of the two variables (age and vital.capacity) may also help. Since our dataset contains no “unwanted” values, there’s no cleanup needed. (For larger datasets, there may be the need for subsetting.)

Fit the model

To fit our model in R, there’s absolutely no work required, apart from entering some commands. The summary of the fitted model variable gives us all the clues needed, which fairly predict the vital capacity of a person given the age. To get our linear equation, the intercept and age coefficients are the most important values from the summary output. This equation is our prediction equation—It can predict the vital capacity given age as input.

Easy! Now go on, and test the equation against some known age values. If entering the commands one at a time is not convenient, R Scripts can be employed. R Scripts enable you to enter all the commands and run them all at once.

R can produce far more complex models. Even in such cases, you are probably going to find packages that can handle almost all your needs and make your life easier. You can always check out various R online communities for help with anything from data subsetting to building apps from R.

Do you think you can beat this Sweet post?

If so, you may have what it takes to become a Sweetcode contributor... Learn More.

Bruno is a junior at Ashesi University College studying Computer Science. He is interested in leveraging the power of technology to increase productivity. As a big fan of open source technology, he is currently exploring the possibility of using the Bitcoin Blockchain to fight corruption in government.


Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *