Kubernetes Liveness Probes Failures

# Kubernetes Liveness Probes Failures

Post-mortem analysis of cascading failures due to aggressive probe timeouts.

##Incident Summary

On November 3rd, 2023, our primary API service experienced a cascading failure that took down 80% of our infrastructure. The root cause was overly aggressive Kubernetes liveness probes.

##Timeline

14:23 - Database latency spikes to 500ms
14:24 - API pods start failing liveness probes
14:25 - Kubernetes begins terminating "unhealthy" pods
14:26 - Remaining pods overwhelmed, latency increases
14:27 - Full cascade, all pods marked unhealthy

##Root Cause

The liveness probe was configured with a 1-second timeout. During the database latency spike, all API calls slowed down, causing probes to fail. Kubernetes killed the pods, but the issue was transient database load, not application health.

##Lessons Learned

Liveness != Readiness - Liveness should only detect deadlocks/panics, not slow responses
Timeout margins - Probe timeouts should be 10x the expected p99 latency
Circuit breakers - Application-level circuit breakers prevent cascading failures

Relaxed probe settings and added circuit breakers to prevent cascading failures.

# Kubernetes Liveness Probes Failures

##Incident Summary

##Timeline

##Root Cause

##Lessons Learned

Recommended Logs

Designing a Marketplace for Agricultural Dealers

Building for Farmers: Handling Uncertainty in Product Design

Notes from Reading 'Designing Data-Intensive Applications'