This post is part of a content series coming out of a partnership between Sweetcode.io and Holberton School.
Recently, as part of our SRE/DevOps training, Guillaume (one of our mentors) gave us an exercise that was particularly interesting to me. I liked this question because it illustrates the importance of good thought process.
The exercise is as follows:
By using this cron log:
https://s3.amazonaws.com/holbertonintranet/files/sandbox/cron
Provide the list of all cron commands with the delay (“The cron task
‘Holberton’ is executed each 50 seconds”)
In my opinion, good thought process is being aware of the assumptions you are making. If you are not aware of your assumptions, you will have a hard time coming up with clarifying questions in an interview setting. It is important to ask clarifying questions when something is not clear because they let your interviewer know two things: 1) that you are a good communicator, by showing you are able to articulate your thought process; 2) that you understand things quickly, since you can tell that the solution to the problem changes depending on a certain parameter.
In a simple example, say you are asked write a program to solve the modulus of ‘a’ and ‘b’. If you decide to dive to the solution without being aware of your assumptions, you may not get the best solution. A good clarifying question would be: Are the inputs ‘a’ and ‘b’ always going to be whole numbers, or do I need to handle negative integers as well?
Back to the CRON log exercise:
Let’s pretend that there is a very complex server for a hypothetical developer team, with thousands of CRON jobs and many unique commands. No one individual knows all the jobs this server runs, and yet, everyone relies on the server to run their command at a set frequency.
Part of my strategy is not to overlook any special cases; this enables me to learn the most from each problem. I like to imagine the bigger picture of a problem before breaking it down into smaller pieces. In this problem, it occurred to me that this problem must not have a solution – or, at least, not a unique solution – unless I made an important assumption.
The Assumption / Clarifying Question
The assumption that I made was the following: Each command that appears in the log is only run by one CRON rule.
The clarifying question that jumped out at me was: “Can I assume that each command is run by exactly one CRON rule?”
In other words, are there two people on the team that are unknowingly running the same command, perhaps with different delays?
Since Guillaume is not generally in the habit of asking trick questions (like our low-level coach – Julien B. – is ) I found it safe to assume, that we are not expected to handle this edge case; this would make the problem unusually hard to solve for a daily morning exercise. For better or worse, I assumed the following:
Assumption: each unique command is run by one and only one CRON job.
I’m happy that I was able to identify this assumption, because in the future I will be more aware of this constraint.. The structure of the problem is now clearer, and our task is simplified. In addition, I could later brainstorm some ideas to extend the problem after I solve it.
Parsing the file
The lines on the log file look like this:
Sep 21 17:30:01 ip-172-31-27-229 CROND[process ID]: (root) CMD some_command
Using python’s regular expressions module, I parsed the file with the following search:
regexp = "CROND\[[0-9]*\]: \(root\) CMD (.*)" s = re.search(regexp, line)
You can see that I am capturing the command with (.*) and access it with s.group(1).
How Many Commands
I first decided to count how many unique CRON commands there are in this particular log file. Also, I decided to count how many CRON commands there are in total just in case I need to make a sanity check later on.
This was the result:
uniqueCmdCount: 10 count by CROND: 1999
I now know I have 10 unique commands in total, for a total of 10 CRON jobs.
The Lazy Solution
I knew, at first, that I could get the first two times a specific command has run, and subtract the difference to get the delay. But this discards the rest of the information in the log, and doesn’t take into account variations in the run times. This would be a lazy solution.
The Almost Correct Solution
This particular log spans a few days of activity. For my solution, I decided to count how many times the command had run, and then divide by the stretch of time between the first and last times this command ran. This was much better than the lazy solution, since any approximations would be averaged out and it was not needlessly complicated.
Here is my coded solution to obtain the delay, where dtdelta represents the timespan between the first and last instances of the command, and where cmdDict[“count”] is the number of occurrences for this command in the log file:
delay = dtdelta / (cmdDict["count"])
Here are some of the results:
command: (su -s /bin/bash -c ‘if [[ ! -z ${HOLBERTON_INTRANET_AUTO_CORRECTI…) count: 1042 first timestamp: 1900-09-18 03:10:01 last timestamp: 1900-09-21 17:55:01 difference: 3 days, 14:45:00 delay: 0:04:59.712092
5 minutes!
Another result:
command: (su -s /bin/bash -c ‘source /opt/elasticbeanstalk/support/envvars … count: 86 first timestamp: 1900-09-18 04:05:01 last timestamp: 1900-09-21 17:05:01 difference: 3 days, 13:00:00 delay: 0:59:18.139534
This looks like it’s one hour. But with almost 42-seconds of imprecision.
The Correct Solution
I had chalked up the imprecision to the variation in the runtimes by CRON, but didn’t verify this. I decided to take another look at the problem the following day.
It appears that some results are closer than others. But I wasn’t sure why. (The first result is off by a quarter of a second; but the second result, almost by 42 seconds.)
Something was bugging me. Those almost-42 seconds in the second result were suspect. Looking inside the log with my favorite text editor, I searched for a discrepancy in the numbers. Only once was the delay different out of the 86 times that particular command ran, and it was off by only one second. This proved that I had a problem, since the 42 seconds on mine were way above what the log would suggest.
And then it hit me. My formula to obtain the delay is re-pasted here:
delay = dtdelta / (cmdDict[“count”])
I had neglected to subtract one from the count.
.-.-.-.-.-.-.-.-.-.
There’s 10 dots, but only 9 dashes (trust me). If dots are timestamps, and dashes represent the timespan between each dot, I was dividing by 10 when I should have been dividing by one less than the number of dots.
And so, upon changing this…
delay = dtdelta / (cmdDict[“count”]) – 1
I obtained nicer (and correct) results.
Result 1:
command: (su -s /bin/bash -c ‘if [[ ! -z ${HOLBERTON_INTRANET_AUTO_CORRECTI…) count: 1042 first timestamp: 1900-09-18 03:10:01 last timestamp: 1900-09-21 17:55:01 difference: 3 days, 14:45:00 delay: 0:05:00
Result 2:
command: (su -s /bin/bash -c ‘source /opt/elasticbeanstalk/support/envvars … count: 86 first timestamp: 1900-09-18 04:05:01 last timestamp: 1900-09-21 17:05:01 difference: 3 days, 13:00:00 delay: 1:00:00
P.S.: Parsing the datetimes
For the sake of completeness, here is how I parsed the times. To get the time string I took the first 15 characters of each line via the Python slice ([:15])
I used the datetime module of Python 2.7 to interpret the times (using strptime() function which allows you to specify the format).
I appended all of these timestamps to a list for each command.
Conclusion
I liked solving this problem because it sharpened my critical thinking skills. On further reflection, there is also something else I learned: the importance of coming back to the same problem the next day. As a software developer, it is important not to burn out; I like to think of it as the developer’s equivalent of writer’s block. By getting a good night’s sleep this “block” was removed. This experience highlighted, to me, the importance that getting some rest or taking a break can have on your ability to see something that you have been staring at for too long.