Capacity Planning - תכנון קיבולת - SRE & Observability

ב-Prime Day 2018, Amazon עצמה עברה outage של כ-60 דקות בתחילת האירוע. הcause: traffic spike חרג ממה שה-capacity planning חזה. הם ידעו שיהיה spike - הם לא ידעו כמה גדול. הם חשבו שהם מוכנים. הם לא היו מוכנים.

ה-irony: Amazon היא ה-cloud provider. הם יכולים להוסיף capacity בלחיצה. הבעיה לא הייתה capacity - הבעיה הייתה ש-auto-scaling לא הספיק להגיב בזמן לspike חד. Traffic שגדל פי 3 בשניות ספורות דורש pre-provisioned capacity, לא reactive scaling.

Capacity Planning אומר: לדעת בדיוק כמה resources צריכים, מתי תצטרכו אותם, ולהיות מוכנים לפניהם - לא בגללם.

Amazon Prime Day 2018 outage at event start - auto-scaling could not catch up to a 3x traffic spike in seconds

Cloud provider with infinite capacity, but reactive scaling is too slow. Pre-provisioned capacity is the only defense against vertical traffic ramps.

ב-2023, חברת HR tech ישראלית עם ~800 לקוחות עסקיים הוסיפה feature של "mass payroll processing" - כפתור אחד שמחשב שכר ל-10,000 עובדים בו זמנית. Feature מרשים. הבעיה: תקציב CPU הוגדר לפי historical traffic - לא לפי מה שה-feature החדש ידרוש.

ב-25 לחודש, יום שבו רוב הלקוחות מריצים payroll, ה-system קרס ב-11:23 בבוקר. MTTD: 4 דקות (alert על CPU 100%). MTTR: 3 שעות. הסיבה: ה-payroll computation service לא הוגדר לscale horizontally - היה singleton instance. אחרי rollout של ה-feature החדש, כל ה-10 clients שרצו payroll בו זמנית saturation מיידית.

הpostmortem חשף שה-capacity planning לFeature זה לא נעשה. איש לא שאל "מה קורה אם 10 customers מריצים את זה בו זמנית?" load test רץ עם customer אחד. הכל נראה מצוין.

Capacity Planning אומר: לדעת בדיוק כמה resources צריכים, מתי תצטרכו אותם, ולהיות מוכנים לפניהם - לא בגללם.

Amazon Prime Day 2018 outage at event start - auto-scaling could not catch up to a 3x traffic spike in seconds

Cloud provider with infinite capacity, but reactive scaling is too slow. Pre-provisioned capacity is the only defense against vertical traffic ramps.