Incident Management and Response: Myth Busting Edition
Site Reliability Engineering – or SRE – is what happens when you ask software engineers to design an operations function. That is how Google ..
Site Reliability Engineering – or SRE – is what happens when you ask software engineers to design an operations function. That is how Google ..
Poorly implemented postmortems can be painful for everyone involved; they cost money, and worse yet, they can fail to address the root cause ..
If you work in IT, development, or DevOps, it can be easy to feel like everything you do is incident response. You constantly face issues ..
It’s easy to talk about the importance of real-time communication when responding to an incident. It’s harder to implement real-time ..
Cloud has given way to cloud-native computing over the past decade. The sweeping changes at every layer of the application stack have had a bearing ..
Every year I get on some technical kick. These fascinations usually end up being some sort of design pattern or process. In 2020, I’m ..
Alerting has come a long way from the days of paging an on-call administrator in the middle of the night, to multiple on-call teams that ..
“Prophecies and predictions they are not…the term forecast is strictly applicable to such an opinion as is the result of scientific ..