PostHog Handbook Library / Company

639 words. Estimated reading time: 3 min.

Feature flags service outage

Internal post-mortem: <https://github.com/PostHog/incidents-analysis/pull/120>

On September 29, 2025, the PostHog Feature Flags service experienced an outage lasting 1 hour and 48 minutes, from 16:58 to 18:46 UTC. During this period, approximately 78% of flag evaluation requests in the US region failed with HTTP 504 errors.

Summary

A database connection timeout reduction from 1 second to 300 milliseconds coincided with elevated load on our writer database from person ingestion. This combination triggered cascading failures in our connection retry logic, resulting in a service-wide outage. Recovery was significantly delayed by hardcoded configuration values and procedural failures in our incident response.

Timeline

{"timestamp":"2025-09-29T16:54:16.154136Z","level":"ERROR","fields":{"message":"Failed to create database pools","error":"Database error: pool timed out while waiting for an open connection"},"target":"feature_flags::server","threadId":"ThreadId(1)"}
thread 'main' panicked at feature-flags/src/main.rs:119:5:
internal error: entered unreachable code: Server exited unexpectedly
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Full service degradation timeline

Root cause analysis

The outage resulted from three compounding factors:

  1. Configuration change timing: A connection timeout reduction deployed during a period of database stress created conditions where pods could not establish connections within the new timeout window.
  1. Retry amplification: Our retry logic lacked circuit breakers and exponential backoff, causing failed connection attempts to multiply rapidly. This transformed a manageable database load issue into complete service unavailability.
  1. Health check configuration: Kubernetes continued routing traffic to pods in crash loops for up to 45 minutes due to improperly configured liveness and readiness probes.

The incident duration was extended by operational failures: timeout values were hardcoded in the application rather than externalized as configuration, requiring a full deployment cycle to modify. Additionally, our standard ArgoCD rollback procedure failed due to misconfigured permissions.

Impact

Remediation

Immediate actions (completed)

Short-term improvements (Follow along here)

Long-term improvements (Q4 2025 – Q1 2026)

Lessons learned

This incident highlighted critical gaps in our defensive architecture and operational procedures. The coupling of read and write operations created unnecessary failure domains, while our retry logic lacked basic protective mechanisms against amplification. Most significantly, our incident response was hampered by inflexible configuration management and untested rollback procedures.

The architectural improvements underway will provide proper isolation between different operational modes of the feature flags service. This separation, combined with improved circuit breaking and configuration management, will prevent similar cascading failures in the future.

Canonical URL: https://posthog.com/handbook/company/post-mortems/2025-09-29-flags-is-down

GitHub source: contents/handbook/company/post-mortems/2025-09-29-flags-is-down.md

Content hash: c6b412d775922dc2