# frontmatter
date:2023-11-05
author:Vedant Ghodekar
type:architecture
status:COMPLETED
tags:[k8s, incident]
summary:Post-mortem analysis of cascading failures due to aggressive probe timeouts.

# Kubernetes Liveness Probes Failures

Post-mortem analysis of cascading failures due to aggressive probe timeouts.


Kubernetes Liveness Probes Failures

##Incident Summary

On November 3rd, 2023, our primary API service experienced a cascading failure that took down 80% of our infrastructure. The root cause was overly aggressive Kubernetes liveness probes.

##Timeline

  • 14:23 - Database latency spikes to 500ms
  • 14:24 - API pods start failing liveness probes
  • 14:25 - Kubernetes begins terminating "unhealthy" pods
  • 14:26 - Remaining pods overwhelmed, latency increases
  • 14:27 - Full cascade, all pods marked unhealthy

##Root Cause

The liveness probe was configured with a 1-second timeout. During the database latency spike, all API calls slowed down, causing probes to fail. Kubernetes killed the pods, but the issue was transient database load, not application health.

##Lessons Learned

  1. Liveness != Readiness - Liveness should only detect deadlocks/panics, not slow responses
  2. Timeout margins - Probe timeouts should be 10x the expected p99 latency
  3. Circuit breakers - Application-level circuit breakers prevent cascading failures

Relaxed probe settings and added circuit breakers to prevent cascading failures.

Recommended Logs