G’day (a fictional Australian content marketing firm), recently released a new content marketing platform named “Koala.” After six successful months, the CEO, Christian O’Riley, gives every employee a gift: a week in the bush, without any Wi-Fi. Back to nature! Geordie, one of the junior content marketers, can’t join in the adventure because he broke his leg a couple of weeks before when skateboarding, so he has to spend the week indoors, keeping an eye on the performance of Koala.
Koala is running smoothly, and Geordie has time for an online course: R for beginners.
R is a language and environment for statistical computing and graphics, and Geordie wants to see if it is helpful for his content marketing activities.
Suddenly, Geordie’s email inbox explodes: Clients are complaining that they suddenly can’t reach Koala due to an HTTP-500 error.
Crickey! What’s going on?
Geordie is not a real techie, but he knows HTTP errors are logged on the Koala web server, and the web server is based on Apache—open source software.
Where can he find the log files he needs to view?
Apache stores two kinds of logs:
- Access logs: Contain information about requests coming in to the web server. This information can include what pages people are viewing, the success status of requests, and how long request response time is.
- Error logs: Contain information about errors that the web server encountered when processing requests.
Aha! Error logs, that’s the place to go, he thinks. But not quite, because the error clients encounter is HTTP-500. It involves the web server, and clients want to have access to the web server, but get the error in response. So instead, Geordie has a look in the access logs.
First, Geordie has to make an extract of the access logs for the last 12 hours. That is when the errors first occurred.
Apache access logs and error logs are placed differently. For simplicity, just assume Geordie found the log file and made an extraction to a text file as shown below.
Geordie is now in possession of a log file, but it is huge and not really “structured” as a table.
Then, the thought strikes Geordie—Why not use R to analyze the log file? But where to start?
Building the data frame in R
As every tutor in data science and R says: It’s 80% preparation (sometimes 95%) and 20% analysis. Raw data is certainly not ready for analysis in R. Preparation has to be done!
Geordie starts up RStudio, a toolset for working with R.
(If you’re new to RStudio and want to follow the included example, the best way to get started with it is with this tutorial.)
So now Geordie can let R “read” the access log. This can be done by reading it into a data frame as seen in the following code:
## read the log df <- read.table(‘access_log’)
Geordie now has a data frame (df), where the information is structured in rows and columns.
Each row is an entry of the log, but the columns are not yet named.
An access log can include the IP address of the client request (host); identity of the user (ident); the username (authuser); the date and time; the HTTP method and URL path (request); the HTTP request status code (status); plus, a count of bytes returned (bytes), and the time required for the request to process.
Geordie examines the file and concludes the required processing time is not part of the log, so he can exclude this from his analysis. Now, Geordie has to name the columns:
# add column names colnames(df) <-c(‘host’.’ident’,’authuser’,’date’,’time’,’request’,’status’,’bytes’)
Great! What does Geordie’s data frame (df) now look like?
To see the column names and first few rows of the data frame, Geordie uses the data frame function head().
# To see the column names and first few rows of our data frame
This gives the following output in the RStudio console:
Great! Now Geordie has a structured data frame with a lot of information. Geordie can even see at what times which status codes occurred, so he can pinpoint where and when the HTTP-500 error occurred.
What have we learned?
In this blog, I have shown you how you can use R to make your Apache access log human- readable and ready for analysis.
Here are the steps repeated.
# read the log df <- read.table(‘access_log’) # add column names colnames(df) <-c(‘host’,’ident’,’authuser’,’time’,’request’,’status’,’bytes’) # to see the column names and first few rows of our dataframe head(df)