Have you ever been to Norway’s capital, Oslo? I have. I’ve also visited Asia. After having seen Oslo, and after traveling throughout Asia a couple of times, the one thing that stands out is the level of tidyness I witnessed. No litter in the streets. The air is fresh because the streets are car-free, or electric vehicles are zooming by. And the persistent spatial planning makes you feel you have arrived at Moore’s Utopia.
But that’s not that kind of “tidy” this article is about. When you’re a data scientist and you hear the word “tidy,” the first thing that should pop into your mind is “tidyverse”—a mystical yet very useful concept that makes your job a lot easier. In this article I explain the concept of “tidy data” and how the data.table library relates to it.
If I lived in Norway, I would stare out at the sea and count boats all day.
If I were a bad data scientist, my data on those boats would look like what you will see below. Let’s load the required libraries for this example, set the seed, and generate some example data.
library(data.table) library(ggplot2) library(dplyr) library(microbenchmark) set.seed(5) date <- seq.Date(as.Date('2001-08-01'),as.Date('2001-08-14'),by='day') dayname <- weekdays(date) red <- sample(50:100,14) green <- sample(20:40,14) blue <- sample(40:80,14) boats <- data.table(date, dayname, red, green, blue)
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
The data above violates rule one and rule two. The variable “count of boats” is spread over three columns, and a column for boat color does not exist. There are multiple ways to transform your sloppy data into tidy data. Most of these libraries exist inside an ecosystem that is called the tidyverse. How you transform data using these libraries is very well described in Wickham’s book, R for Data Science. However, I want to show you how the popular data.table package can do the work for you—and faster.
Transforming the above dataset can be done by converting the data table from “wide format” to “long format.”
tidy_boats <- melt(boats,id.vars=c('date','dayname'),measure.vars=c('red','green','blue'),variable.name='color',value.name='count')
Why is this relevant? Because many packages reward tidy data. A popular plotting package is ggplot2, also part of the tidyverse. If we were to plot the data with three line plots using the sloppy data set, we would have to write quite a lot of code.
ggplot(boats,aes(x=date)) + geom_line(aes(y=red),color='red') + geom_line(aes(y=green),color='green') + geom_line(aes(y=blue),color='blue') + labs(x='',y='boats',title='Boats counted in Oslo')
Each boat color has its own line, for which we individually need to specify the column. Worse, it doesn’t even give each line a color manually, so that also needs to be specified. Finally, it doesn’t even have a legend. Consequently, the reader of our plot does not know what the colors represent. The red line could represent the type of boat, or the flag under which it sails, etc.
However, if you consistently use ggplot2 with tidy data, your life will become a lot easier. The same graph, with a legend, can be created with three lines of code.
ggplot(tidy_boats,aes(x=date,y=count,col=color,group=color)) + geom_line() + labs(x='',y='boats',title='Boats counted in Oslo')
Or even better, split it into three different plots with only one line of code:
ggplot(tidy_boats,aes(x=date,y=count)) + geom_line() + labs(x='',y='boats',title='Boats counted in Oslo') + facet_grid(.~color)
It’s surely become clear by now why the tidyverse matters: coding speed. Follow the tidyverse rules, and you get rewarded with more efficient coding. Is this always true? Unfortunately, no. As another famous data scientist, Jeff Leek, points out: (1) sometimes you need computations that are more efficient if the data is represented in another format, and (2) some functions and libraries might ask you to use a different format. But generally, working with tidy data will earn you a lot of data science street cred.
There is another way to get that street cred: computing speed. Coding speed matters when you’re working with small datasets. And computing speed matters when you’re working with huge amounts of data. And that’s where data.table comes in.
Let’s benchmark data.table to dplyr for some basic operations. In the first operation, we count the total amount of boats per day. In the second operation we filter the counts on their color. In the third example, we select three out of four columns.
microbenchmark(tidy_boats[,.(total=sum(count)),by=date]) microbenchmark(tidy_boats %>% group_by(color) %>% summarise(total=sum(count))) microbenchmark(tidy_boats[color == 'green']) microbenchmark(tidy_boats %>% filter(color == green)) microbenchmark(tidy_boats[,.(date,color,count)]) microbenchmark(tidy_boats %>% select(date,color,count))
The data.table syntax differs strongly from the dplyr syntax, and if you’re unfamiliar with it, it can be difficult to understand at first glance. However, as you can see from the benchmarks, data.table is twice as fast as its standard tidyverse ecosystem counterparts. In the long run, coding in R in the data.table way is rewarding. If you’re having a hard time working with the syntax, there is a very handy cheat sheet by DataCamp that should get you through your first weeks.
If you want to be a fast data scientist, you should make the tidy data standards your own. If you want to be the fastest data scientist, use the data.table library. It’s like driving a Tesla in Norway.
You can find the images and code used in this example here.