Incident Management - Postmortems, Blameless Culture - SRE & Observability

ב-8 בפברואר 2017, CloudFlare הוציא שינוי לparser שלהם. בתוך דקות, 1.8 מיליון domains הפסיקו לעבוד. ה-on-call engineer קיבל alert, הבין שיש בעיה רצינית, הכריז על incident.

מה שקרה אחר כך הוא הסיפור החשוב: בתוך פחות מ-שעה הם עשו rollback ו-service חזר לנורמה. לאחר מכן הם פרסמו postmortem מפורט - כולל ה-Regex שגרם לבעיה, ה-timeline המדויק, ו-5 action items ספציפיים. הם לא הסתירו מהלקוחות. הם לא חיפשו מי לאשים. הם שאלו: מה בתהליך שלנו אפשר לזה לקרות, ואיך מונעים את זה?

זה ה-incident management שעובד.

ה-contrast המוחלט לזה הוא איך incident management עובד ב-majority של חברות שאינן SRE-mature. ב-2018, British Airways עברה IT failure שבוטלו בגינה 726 טיסות בסוף שבוע אחד, עם 75,000 passengers מושפעים. MTTD: לא ידוע בדיוק כי לא פרסמו גלוי. MTTR: ~5 ימים לחלק מהsystems. הנזק הוא ביחסי ציבור, תביעות, ו-operational costs: מוערך ב-$80 מיליון.

מה שפורסם בתחקירים עיתונאיים: "IT contractor accidentally powered down systems during routine maintenance." כלומר - לא היה ה-incident management process שיאפשר לzidentify ולmitigate quickly. לא היה מנגנון שיעצור את הנזק תוך דקות. כשsystems גדולים קורסים בלי incident management process, MTTR נמדד בימים - לא בדקות.

הפרדוקס: incident management process לא מונע incidents. הוא מקצר בצורה דרמטית את הזמן שבו הincident גורם לנזק. 5 ימים MTTR לעומת 47 דקות MTTR על incident בחומרה זהה - ההבדל הוא Process, לא Technology.

Estimated cost of the 2018 British Airways IT failure - 726 cancelled flights, 75K passengers

MTTR נמדד בימים, לא בדקות. Process, not technology - incident management is the difference between $80M and a 47-minute postmortem.

The Five Phases of an Incident

Each phase has a clear handoff - IC keeps the team focused on the right one

Detect

Alert fires, on-call ack within 5 min. Start of MTTD clock - measured from real customer impact, not from alert.

Triage

Severity declared, IC assigned, comms lead identified. Decide: solo, war room, or executive page.

Mitigate

Stop the bleeding - rollback, circuit break, traffic shed. Investigation comes later. End of customer impact.

Resolve

System back to steady state, monitoring confirms no regression. Status page updated.

Learn

Blameless postmortem within 5 days. Action items in tracker, owners assigned, dates set.