As incidents continue to occur, teams generally respond by starting to track more metrics, such as “Mean Time To Detect” (MTTD)—the gap in time between the issue beginning and an alert getting triggered—and “Mean Time To Mitigation” (MTTM)—the time between that first alert and when you’ve contained the user impact. Evaluating incident response effectiveness, but often fail to direct you where you should be improving. The answer is extending your incident response program to also include incident analysis - a meeting where the group reviews groupings of incidents to identify improvements.
Move past incident response to reliability
from GitHub
Filed under:
Same Source
Related Notes
- By replacing integration tests with unit tests, we're losing al...from Computer Things
- The only good advice I have here is to re-evaluate your metrics oft...from ferd.ca
- Potential SLIs for different types of components - Request-driven ...from Steven Thurgood and David Ferguson
- One thing I've noticed is that a lot of colleagues don't kn...from ycombinator.com
- Just like regular investment funds can either be actively managed o...from Matt Levine
- We have built many projects, and we believe the most valuable summa...from joelparkerhenderson
- A cost center is a function that is operated by optimizing its exis...from Irrational Exuberance
- If you run a mutual fund, a high mark is good for investors who tak...from Matt Levine