# Kubernetes Liveness Probes Failures
Post-mortem analysis of cascading failures due to aggressive probe timeouts.

##Incident Summary
On November 3rd, 2023, our primary API service experienced a cascading failure that took down 80% of our infrastructure. The root cause was overly aggressive Kubernetes liveness probes.
##Timeline
- 14:23 - Database latency spikes to 500ms
- 14:24 - API pods start failing liveness probes
- 14:25 - Kubernetes begins terminating "unhealthy" pods
- 14:26 - Remaining pods overwhelmed, latency increases
- 14:27 - Full cascade, all pods marked unhealthy
##Root Cause
The liveness probe was configured with a 1-second timeout. During the database latency spike, all API calls slowed down, causing probes to fail. Kubernetes killed the pods, but the issue was transient database load, not application health.
##Lessons Learned
- Liveness != Readiness - Liveness should only detect deadlocks/panics, not slow responses
- Timeout margins - Probe timeouts should be 10x the expected p99 latency
- Circuit breakers - Application-level circuit breakers prevent cascading failures
Relaxed probe settings and added circuit breakers to prevent cascading failures.