On January 30, 2017, there was an outage at GitLabs. Six hours of data was lost on their production database, and no backups were readily available. One of the developers was dealing with a spam problem, and moving data around—and this developer accidentally deleted the database. To make matters worse, GitLab did not have a backup, as the backup had not been running for some time by that point.
Luckily, a team member had taken a snapshot earlier that day, so GitLab did not lose as much data as they would have. But things could have been much, much worse.
What do you do when you face a major outage like GitLab’s? How do you respond when your site crashes or gets hacked? Or when you accidentally drop the database or enable email delivery on staging servers?
This article discusses mistakes made by companies like GitHub, GitLab, and others, and how you and your team can address them when they occur. Some of this information is drawn from a talk Zach Holman gave at Twilio’s Signal Conference.
Recovery Process and Response
Have parallel responses set up:
- Short Term: Fix the immediate issue as soon as possible—such as getting the service back up.
- Long Term: Look into the future, the long term. Analyze the problem and ensure that the same vulnerability will not be introduced again in the future.
- External: You have to deal with the public. Post updates and have a disclosure policy. GitHub introduced Github Security Bug Bounty, and it has been a success for them.
- Retroactive: Look at your logs and make sure that no one else exploited the same vulnerability when it went unnoticed. This means sifting through a lot of logs, and the process can take weeks, or even months.
Give your team members autonomy in tackling the issue at different priority levels.
Dealing with the Public
1) Be Transparent
Notify the public or any stakeholders in a timely manner. Nobody likes downtime. However, your users will generally be more understanding if you keep them up-to-date on the issue.
Second, demonstrate understanding. Show that you understand the inconvenience the incident has caused, and that you are doing everything you can to fix the system.
Third, be timely. Continually provide updates, even if you have no new information. You can simply state, “We are still working on this issue. At this point in time, we do not have any new information. We will provide an update as soon as there’s a change.” Acknowledgements like this can mean a lot to people.
These are the steps that GitLab took to be transparent:
- Updating users in real time through social media, via Twitter, etc.
- Updating Google Docs with new technical information, with time stamps
- A live YouTube stream of recovery the following day:
2) Assemble a Council
Assemble your council with people from different departments or teams. Even though the problem may be a technical one, you’ll still want input from different teams. Certain incidents can be tricky and sensitive, and you want different points of view on the matter.
You are not done with the public at this stage. Next, you’ll want to write your postmortem describing what happened. Detail the incident, tell the public what went wrong, and how you plan to improve the system to prevent similar incidents in the future.
Test Your Backups
Once you have stabilized the situation, test your backups often, and test how your backups work in isolation. These tests should be as removed from your system as possible. Should there be a hacker that takes a hold of your system, you want to be able to access that backup far away from the main system.
Compare the analytics prior to the incident. Maybe users are not overtly angry, but usage has dropped. Depending on the data, you may be able to develop better business processes around it. You may need to develop a campaign of generosity toward users and customers and offer free or discounted products to win them back. The short term may look bleak, but keep your eyes on the long-term prospects.
Bad things are going to happen. Smart people are capable of dumb mistakes. Choose to be honest and open, and the pain will be greatly reduced. Mistakes will inevitably happen, but if you make “better mistakes” each time, you will recover better each time.
Postmortem of Database Outage of January 31
GitLab.com Database Incident – 2017/01/31 [Google Doc]