PostHog Handbook Library / Company

2,175 words. Estimated reading time: 10 min.

Persons database migration

Auto TL;DR

At a Glance

This long page covers these main areas. The list is generated from the article headings, so it updates with every handbook rebuild.

  1. Incident summary
  2. Incident impact
  3. Scope
  4. Data integrity
  5. Timeline
  6. Root cause analysis
  7. Primary issue: Postgres TOAST OID exhaustion
  8. Secondary issue: AWS MSK disk pressure during catch up

Between November 11 and November 15, 2025 we hit a Postgres limit that required us to migrate our Persons database for US Cloud. This led to ingestion delays which had a knock-on effect for products relying on person data, including feature flags and experiments.

This post-mortem document examines the root cause of the issue, steps taken, and our future plans derived from the lessons learned.

Incident summary

From November 11, 4:02 PM UTC to November 14, 3:02 PM UTC the performance of PostHog's ingestion processing pipeline became severely degraded, resulting in processing delays of events of up to 2 days for all customers in the US region.

The root cause of the performance degradation was that our Postgres database responsible for storing our Person information reached a previously unseen limits in Postgres related to the JSONb field we use to store person properties. This led to a state where writes to the database kept waiting for OIDs in the TOAST table to become available, which could take multiple seconds per update, slowing down the ingestion pipeline to the point where there were no standard scaling options available. See Root Cause Analysis below for more technical details.

The root cause was not identified until November 12th 10:17pm UTC, a day and a half after the issue arose. We enlisted help from engineers on the AWS RDS team and external consultants to identify the cause. Diagnosis proved difficult even with specialist support, but we eventually found out we were left with only one option: migrate to a new partitioned table.

By November 14, 2025, 15:02 UTC, ingestion was healthy again and we shifted focus to the accumulated backlog. During this recovery phase, we hit a secondary issue with AWS MSK (Kafka): local disk usage reached 85% because tiered storage keeps the most recent 4 hours of data on disk before offloading to S3. The backlog catchup created an unusually dense last-4-hours window, driving up local disk usage. We temporarily paused ingestion, reduced the topic's local retention window, confirmed disk headroom, and then resumed ingestion.

After that, catchup progressed smoothly. We moved our backfill into Dagster to gain better visibility and stability for long-running backfill jobs, knowing remediation would take at least the weekend. By the morning of November 15, all events since the start of the incident had been processed, all systems were fully operational, and we began a background backfill of older Person data purely for housekeeping. No data was lost.

By November 15, 6:20 AM UTC we had worked through the backlog of events and fully recovered.

Incident impact

Scope

Customers experienced ingestion delays ranging from 10 minutes to up to 2 days. During this period recent events sent to PostHog did not appear, leading to the following potential per-product impact:

Data integrity

No data was lost.

Once the backlog was cleared, all reporting tools indicate accurate values were processed for the delayed period.

Timeline

A full timeline of updates is available on the PostHog status page.

Root cause analysis

Primary issue: Postgres TOAST OID exhaustion

PostgreSQL stores large column values (>2KB compressed) in a separate, out-of-line table, called the TOAST table. Posthog's Persons table has a properties column that frequently exceeds this 2KB threshold, resulting in there being many TOASTed values associated with the Persons table.

Each value that is moved "out of line" is assigned a unique OID (Object identifier) from a finite 32-bit space (~4 billion values), that the main table uses to track it in the table's associated TOAST table.

From postgresql wiki:

The OIDs used for this purpose are generated from a global counter that wraps around every 4 billion values, so that from time to time an already-used value will be generated again. Postgres detects that, and tries again with the next OID.

When the space of used OIDs approaches the limit, there will be longer and longer sequential runs of used OIDs. This results in the database engine having to do an incredible amount of reads (checking every used OID it is given by the counter to see if it's free or not) to make a single INSERT or UPDATE for a TOAST'ed row.

It is important to note that before the table hits the hard limit of 4 billion OIDs, write performance for TOAST'ed rows will be severely degraded, because the space of available OIDs is so sparse. If there is just a single free OID left, the database engine would, on average, have to read through billions of used OIDs and check to see if they are free, before it finds the free OID to complete the write. This OID exhaustion increased the amount of disk reads we were doing per write query from 10kb to 15MB, increasing latency for those queries by 100x and grinding the ingestion of events to a halt.

Secondary issue: AWS MSK disk pressure during catch-up

During backlog processing, we hit a separate but related operational issue:

To mitigate this we paused ingestion, reduced the local retention configuration for the relevant topic, and resumed ingestion once disk usage returned to a safe level.

Appendix: OID exhaustion diagrams

Stage 1: Healthy database state

OID Space ( each block = used OID, each dash = free OID) [----------------------------------------------------] 0-1M OIDs [-----XX-----------------------XX---------------------] 1M-2M OIDs [---------XX--------------------------XX--------------] 2M-3M OIDs [----------------------------------------------------] 3M-4M OIDs ↑ Next OID Counter (finds free OID immediately)

Stage 2: OID exhaustion

OID Space (each block = used OID, each dash = free OID) [XXXXXXXXXXXXXXX--XXXXXXXXX--XXXXXXXXXXXXXX--XXXXXXX] 0-1M OIDs [XXXXXXXXXXXXXXXXXXXXXXXX----XXXXXXXXXXXXXXXXXXXXXX--XX] 1M-2M OIDs [XXXXXXXXXX--XXXXXXXXXXXXXXXXXXXXXXXXXXX--XXXXXXXXXXXXX] 2M-3M OIDs [XXXXXXXXXXXXXXXXXXXX----XXXXXXXXXXXXXXXXXXXXXXXXXXX--X] 3M-4M OIDs ↑ Next OID Counter (must skip over many used OIDs, each skip requiring disk reads)

Why was this hard to detect?

PostgreSQL's OID exhaustion behavior is rare and not commonly encountered even at large scale. Additionally, standard dashboards (CPU, memory, IOPS, lock contention) did not immediately point at OID exhaustion.

Diagnosis of the issue was also frustrated by a lack of dedicated observability on:

Eventual diagnosis was only possible due to the dedicated effort of our engineers working in tandem with external experts and AWS engineers to connect:

Remediation

Immediate actions (completed)

  1. Root cause discovery and isolation

We Identified TOAST OID exhaustion as the root cause by engaging internal teams, external consultants, and AWS engineers to analyze:

  1. Migration to a new partitioned Persons table
  1. Careful deployment of application changes

Given the risk of introducing new issues during an incident, we chose a manually controlled deploy to production for the web app rather than a fully automated rollout. Multiple engineers worked in shifts throughout the weekend and made changes across:

  1. Scaling ingestion to clear backlog

Once ingestion performance was restored on the new Persons table, we scaled ingestion workers to process the accumulated backlog while monitoring:

  1. MSK disk pressure mitigation
  1. Dagster-based backfill

We moved the backfill process into Dagster to provide:

  1. Final cleanup and confirmation

Communicated final resolution and announced a small upcoming maintenance window to consolidate on the new tables. We verified that:

Planned actions (planned or in-progress)

  1. Deeper Postgres engine monitoring

We plan to add metrics and alerts around:

  1. Improved runbooks for engine-level limits

We plan to document symptoms and diagnostics for similar Postgres engine-level issues. This will include clear decision trees for when to migrate vs repair-in-place.

  1. Improved and new runbooks for customer comms

We have begun creating new customer communication runbooks which clarify how and when to communicate with customers about the issue and provide a clear escalation path and redundancies.

  1. Exploring other data stores

We've been exploring using other data stores for the persons database and will continue to evaluate those.

Lessons learned

What went well?

We preserved all incoming events and persons data. While delayed, data was not lost.

Engineers across ingestion, infrastructure, and application teams, plus external consultants and AWS engineers, collaborated effectively to diagnose a rare engine-level problem.

We successfully migrated to a new Persons table using triggers and backfill while the system remained live, minimizing additional downtime.

We provided regular engineer-led status updates and committed to a public post-mortem.

What could have gone better?

It took roughly a day and a half to conclusively identify OID exhaustion as the root cause. We had no dedicated monitoring for TOAST growth, OID usage, or disk read amplification per write.

Many core features (analytics, flags, replay filters, CDP) rely heavily on timely updates to the persons table. When that became unhealthy, a wide surface area of the product was affected.

Our initial backfill approach lacked the visibility and robustness needed for a prolonged, large-scale migration. We had to move this logic into Dagster during the incident.

While secondary to the main cause, disk pressure on MSK during catch-up highlighted that our tiered storage configuration and alerting were not tuned for large backlog scenarios.

All of the team members who are normally responsible for customer communications were unavailable for the duration of this incident and we had to scramble to identify fallbacks.

Moving forward

This incident surfaced a rare but serious interaction between our data model and a low-level PostgreSQL engine limit. It also highlighted how central the Persons data model is to the rest of PostHog: when the persons table slowed down, a wide range of features---from analytics and feature flags to replay filtering and CDP---were indirectly impacted.

We've taken immediate steps to recover by migrating to a new partitioned Persons table, stabilizing ingestion, and clearing the backlog of events. We are now focused on:

We're committed to continuing to invest in the resilience of our ingestion and Persons infrastructure so that incidents like this become less likely, easier to detect early, and faster to remediate when they do occur.

Canonical URL: https://posthog.com/handbook/company/post-mortems/2025-11-15-persons-db-migration

GitHub source: contents/handbook/company/post-mortems/2025-11-15-persons-db-migration.md

Content hash: d0a4ff30c8c3b8da