PostHog Handbook Library / Company

987 words. Estimated reading time: 5 min.

Logs data loss

On February 19th, PostHog's Logs product experienced a major incident, which caused the loss of data collected more than 3 days ago in our US region. This data loss only impacted the Logs product, all other PostHog data is intact.

Summary

As with most queryable data in PostHog, we store data for Logs in a ClickHouse cluster. When we started building Logs, we decided to use a new, dedicated cluster, rather than building it on top of our main ClickHouse cluster, which is shared across most other PostHog products. This had a few advantages, allowing us to:

This new cluster uses S3 disks in ClickHouse, with data parts being automatically uploaded to S3 after 24 hours – this is what enables us to handle the significant data volume required for Logs (in PostHog, we alone produce about 500MB/s of logs from across our systems, or about 1PB/month uncompressed).

A bug in ClickHouse caused it to unexpectedly attempt to delete almost all of the data parts in S3. The Logs database is replicated, with two replicas, however very early on in the project we had enabled "Zero Copy Replication" in the Logs cluster nodes. This is an experimental feature that ClickHouse does not recommend in production, for exactly this reason: a bug that should have caused a single replica to be deleted instead deleted the data everywhere.

Timeline

All times in UTC.

Root cause analysis

Zero Copy replication bug

The decision to use zero-copy replication was taken extremely early in the Logs product development when it was an experimental internal-only tool.

Once Logs was released to external users this decision should have been revisited, but wasn't. Due to experiencing no issues at all during several months of internal usage, settings that had been set at the beginning were largely unvisited and unchanged.

Zero-copy replication has been largely unmaintained for the last 4 years, and still contains critical bugs, including the one we hit here. Because Zero Copy replication uses a shared storage medium (S3) for multiple replicas, when the logic on one node failed and issued delete commands for the underlying S3 objects, those files were removed for the entire cluster immediately. There was no redundancy layer between the database application logic and the storage layer.

Lack of detection

We lacked specific monitoring for the integrity of "cold" data stored in S3. Our alerts are optimized for ingestion lag, query latency, and error rates on active queries. Since users rarely query logs older than 24 hours, and the deletion process happened silently in the background without throwing application-level errors, the system remained "green" on our dashboards until the node restart forced a consistency check.

Lessons learned

What went well

What went poorly

Key takeaways

  1. Immediate Configuration Audit: Disable Zero Copy Replication on all clusters immediately. Conduct a full audit of the Logs ClickHouse configuration and ensure no experimental features are used in production.
  2. Implement S3 Object Protection: Enable S3 Versioning on the underlying storage buckets. This ensures that even if the database application issues a destructive command due to a bug, the underlying data objects can be recovered.
  3. Before a product is made Generally Available, we spot check configurations and our data integrity strategies to find and correct for potential single points of failure

Canonical URL: https://posthog.com/handbook/company/post-mortems/2026-02-20-posthog-us-logs-data-loss

GitHub source: contents/handbook/company/post-mortems/2026-02-20-posthog-us-logs-data-loss.md

Content hash: 9e9304d942279831