PostHog Handbook Library / Company

3,333 words. Estimated reading time: 15 min.

Feature flags recurring outages

Auto TL;DR

At a Glance

This long page covers these main areas. The list is generated from the article headings, so it updates with every handbook rebuild.

  1. Summary
  2. Incident timeline
  3. October 21, 2025 – Redis overload
  4. October 24, 2025 – Rate limiting misconfiguration
  5. October 28, 2025 – Connection pool exhaustion and excessive parallelism
  6. October 29 30, 2025 – CPU bound latency
  7. Root cause analysis
  8. Impact

Between October 21 and October 30, 2025, the PostHog Feature Flags service experienced four separate incidents, exposing systemic architectural weaknesses that required comprehensive remediation. This post-mortem documents all four incidents and our path to stability.

Summary

Over a 10-day period in October 2025, the feature flags service experienced four separate incidents totaling over 14 hours of cumulative major impact (errors or severe latency). While each incident had different surface-level symptoms, three of the four incidents shared the same root cause: improper CPU resource sizing. Our nodes were too small relative to pod resource requests, causing Kubernetes to pack too many pods per node and saturate CPU capacity. This CPU saturation led to connection pool exhaustion, excessive parallelism (too many concurrent operations), and ultimately cascading failures. The fourth incident was a rate limiting misconfiguration unrelated to resource sizing.

Incidents:

Incident timeline

October 21, 2025 – Redis overload

Duration: 21:45 to 23:28 UTC (103 minutes) Impact: ~38% of evaluation requests returning errors in US datacenter

A deployment intended to reduce timeout errors (PR #39821) incorrectly addressed symptoms rather than root causes. While rolled back within 2 minutes, it triggered excessive parallelism and connection pool exhaustion, which manifested as massive data transfer from Postgres to Redis and a surge in concurrent connections that overwhelmed our cache layer. Redis memory exhaustion followed, leading to prolonged service degradation.

What "excessive parallelism" means: Under CPU pressure, degraded requests triggered Envoy retries between the load balancer and service. Each retry spawned new concurrent requests, and each request performed multiple concurrent Redis reads. A single degraded request could fan out to dozens of concurrent Redis operations. Combined with cache misses (on cache miss, we synchronously loaded full flag and team state from Postgres and wrote it into Redis), this created bursty write storms that overwhelmed Redis.

Connection pool mechanics: Each pod maintains its own Postgres connection pool. Creating a pool involves TLS handshakes, authentication, and initial connection establishment—operations that are computationally expensive, especially when pods are CPU-bound. Under CPU pressure exceeding 90%, new pods struggled to initialize these pools within the 20-second startup timeout, leading to crash loops and reduced healthy pod capacity.

Critical issue: The Redis overload from the flags service also impacted the main PostHog application, demonstrating dangerous coupling through shared infrastructure. The flags service can operate without Redis but falls back to heavier database queries, making responses slower.

Root causes:

Timeline:

October 24, 2025 – Rate limiting misconfiguration

Duration: 18:00 to 19:12 UTC (72 minutes) Impact: ~97% of evaluation requests returning 429 (rate limit) errors worldwide

Deployed IP-based rate limiting (PR #40074) as a protective measure following Tuesday's incident. The tower-governor library (our Rust rate limiting middleware) saw all traffic as coming from a single IP (our load balancer) rather than actual client IPs, immediately triggering rate limits for all legitimate traffic.

Root causes:

Timeline:

October 28, 2025 – Connection pool exhaustion and excessive parallelism

Duration: 19:28 to 21:31 UTC (123 minutes) Impact: ~34% of evaluation requests failing in US datacenter

A routine deployment with no changes directly related to the flags service triggered a rollout of feature flag pods in the US region. New pods couldn't connect to Postgres within the 20-second startup timeout, entering crash loops due to excessive parallelism and connection pool exhaustion—the same root cause as October 21. Under CPU pressure, pods couldn't initialize Postgres connection pools (TLS handshakes, authentication, connection establishment) within the timeout. Simultaneously, a massive spike in Redis writes caused key evictions, effectively making the cache unavailable. While the flags service can operate without Redis (falling back to heavier database queries), with both cache unavailable and database under pressure, a significant portion of US traffic failed.

Critical issue: The Redis overload from the flags service also impacted the main PostHog application, highlighting dangerous infrastructure coupling. Unrelated deployments shouldn't trigger feature flags rollouts.

Root causes:

Timeline:

Note: We initially attempted the same remediation approach from October 21 before implementing other solutions to decrease parallelism.

October 29-30, 2025 – CPU-bound latency

Duration: 22:30 UTC on October 29 to 05:39 UTC on October 30 (7 hours 9 minutes) Impact: Slow queries and degraded performance due to node CPU pressure

Query performance was impacted for over 7 hours. While queries were slow to both Redis and Postgres, metrics for both dependencies confirmed they were healthy. The slow queries were due to CPU pressure on the nodes, which exceeded 90%. This impacted connections and slowed response times for the service to several times the usual.

Root causes:

Timeline:

Resolution: After identifying connectivity issues due to resource exhaustion on feature flags nodes, we applied changes that resolved this resource exhaustion. Increasing pod resource requests for the flag service resulted in a healthier distribution of pods per node, which caused per-node CPU usage to go down and the service to return to a healthy state.

Root cause analysis

While each incident had specific triggers, three of the four incidents shared the same fundamental root cause:

  1. CPU resource undersizing (primary root cause): Our nodes were too small relative to pod resource requests, causing Kubernetes to pack too many pods per node and saturate CPU capacity (exceeding 90%). This CPU saturation was the root cause of October 21, 28, and 29-30 incidents:
  1. Connection pool management complexity: Each pod maintains its own Postgres connection pool. Creating a pool involves TLS handshakes, authentication, and connection establishment—operations that are computationally expensive, especially when pods are CPU-bound. This complexity, combined with CPU saturation, exacerbated connection pool exhaustion issues.
  1. Shared Redis is a critical single point of failure: Redis overload from the flags service impacted the main PostHog application, demonstrating dangerous coupling through shared infrastructure. Isolation is critical despite implementation complexity.
  1. Critical monitoring gap: CPU alerting was missing: CPU alerting was completely absent throughout these incidents, preventing early detection of CPU saturation that was the root cause of three outages. This was a fundamental gap in our monitoring strategy that allowed CPU pressure to escalate unnoticed.
  1. Unbounded retries: Unbounded retries in Envoy (between load balancer and endpoint) amplified failures (now fixed with retry limits)
  1. Rate limiting misconfiguration (October 24 only): The October 24 incident was unrelated to CPU sizing—it was caused by rate limiting configuration that didn't account for load balancer architecture

Impact

Remediation

Immediate actions (completed)

Short-term improvements (Tracked in GitHub Issue #40885)

In progress (next 2 weeks):

To complete before re-enabling ArgoCD sync:

Medium-term improvements

Incident response and monitoring:

Architectural improvements:

Long-term improvements

Lessons learned

What went well

What didn't go well

Key takeaways

  1. CPU right-sizing is fundamental – The biggest takeaway: nodes were too small relative to pod resource requests, causing Kubernetes to pack too many pods per node and saturate CPU capacity. This CPU saturation led to excessive parallelism (Envoy retries → concurrent requests → concurrent Redis reads), connection pool exhaustion (pods couldn't initialize Postgres pools under CPU pressure), and slow queries. Right-sizing (fewer pods per node, better-resourced pods) addressed the underlying issues that caused October 21, 28, and 29-30 incidents. This must be a primary consideration for any service deployment.
  1. Connection pool management architecture matters – Each pod maintains its own Postgres connection pool. Creating a pool involves TLS handshakes, authentication, and connection establishment—operations that are computationally expensive, especially when pods are CPU-bound. This complexity, combined with CPU saturation, exacerbated connection pool exhaustion. Better approach: reduce concurrency and run smaller fleets with better-resourced pods rather than larger fleets with CPU-bound pods.
  1. Shared Redis is a critical single point of failure – When flags service overloads Redis, it takes down the main app too. This was evident in October 21 and 28 incidents where Redis overload from flags service impacted the main PostHog application. Isolation is critical despite implementation complexity.
  1. CPU alerting was completely missing – CPU alerting was absent throughout these incidents, preventing early detection of CPU saturation that was the root cause of three outages. This was a fundamental gap in our monitoring strategy. CPU metrics must be monitored and alertable from day one.
  1. Monitor data flow patterns – Postgres-to-Redis transfer spikes should trigger alerts. Watch for unusual data movement.
  1. Test under load – Overload patterns only appeared under production traffic. Load testing is non-negotiable.
  1. Progressive rollouts save lives – Gradual deployments limit blast radius and enable rapid detection. We're implementing rollout/annotation controls to disable staged rollouts and enable "force-merge" for rolling changes.
  1. Configuration must be flexible – Critical settings must be adjustable without full deployment cycles.
  1. Unbounded retries amplify failures – Retries without bounds in Envoy (between load balancer and endpoint) can cascade failures. We've implemented retry limits to prevent this.

Moving forward

These four incidents highlighted critical gaps in our defensive architecture and operational procedures. The compounding failures demonstrated that our service needed fundamental improvements, not just quick fixes. The primary root cause—CPU resource undersizing (nodes too small relative to pod requests, causing too many pods per node)—manifested differently across three incidents (October 21, 28, and 29-30), requiring us to recognize that excessive parallelism, connection pool exhaustion, and slow queries were all symptoms of the same underlying issue. The recurrence of these symptoms between October 21 and 28 showed that we needed to address the root cause (CPU sizing) rather than the symptoms. We initially attempted the same remediation approach from October 21 before implementing CPU right-sizing, which resolved the underlying issues.

We've implemented immediate remediations and are executing a comprehensive review of the entire service architecture. Our strike team is systematically identifying and addressing remaining bottlenecks. Once we complete the short-term improvements tracked in GitHub Issue #40885, we'll have confidence that the service is durable against future outages.

The architectural improvements underway—including Redis isolation, connection pool management, and comprehensive monitoring—will prevent similar cascading failures in the future. We're committed to ensuring the feature flags service meets the reliability standards our customers expect.

Canonical URL: https://posthog.com/handbook/company/post-mortems/2025-10-21-feature-flags-recurring-outages

GitHub source: contents/handbook/company/post-mortems/2025-10-21-feature-flags-recurring-outages.md

Content hash: 2e8cf7f3a9dcf230