PostHog Handbook Library / Company

858 words. Estimated reading time: 4 min.

Workflow "Wait until condition" steps silently failing

Between March 30 and April 22, 2026, a bug in our workflow engine caused workflows using "Wait until condition" steps to silently stop resuming. Affected workflows appeared to complete normally in the UI but never executed their downstream actions — such as delivering emails or sending Slack notifications. 48 workflows across 33 customer organizations were impacted, with 11,920 invocations silently blocked. The issue has been fully resolved and affected customers have been contacted and they will see a banner on each impacted workflow with a self-serve option to review and replay the silently-blocked runs. Importantly, 99.7% of all workflows triggered during this period executed normally.

Summary

PostHog's workflow engine allows customers to build multi-step automations. Some steps, like "Wait until condition," pause execution and periodically re-check whether a condition has been met before continuing.

On March 30, we deployed a deduplication mechanism to fix an earlier incident where ghost workflow runs were causing customers to receive duplicate emails and notifications. The dedup logic worked by comparing the invocation ID of a workflow when it first entered a step against the ID it carried when it resumed. If the IDs didn't match, the resume was treated as a duplicate.

This only affected "hold-state" actions — steps that pause and re-enter themselves ("Wait until condition"). Steps like "Delay" advance to the next action before pausing, so the dedup check on the next action started fresh and never hit the mismatch.

Unfortunately, the issue went undetected far longer than usual as we were lacking observability for the dedup code path.

Timeline

All times in UTC.

Root Cause Analysis

Invocation ID format mismatch across subsystem boundary

The workflow engine generates invocation IDs using PostHog's UUIDT format. The V1 job queue (job-queue-postgres.ts) validates incoming IDs using the npm uuid package's isUuid check, which rejects UUIDT-format IDs and silently substitutes a fresh UUIDv7.

When a "Wait until condition" step paused and was re-queued through the Postgres V1 path, the invocation ID was rewritten. On resume, the dedup logic compared the stored UUIDT against the new UUIDv7, saw different IDs, and concluded the resume was a duplicate — silently terminating the workflow.

Both sides of this boundary were tested in isolation: the dedup tests called the executor directly (never round-tripping through the queue), and the queue tests used uuidv4() instead of the UUIDT generator that production actually uses. Both passed, but neither caught the mismatch that only surfaces when the two subsystems interact.

Missing observability on a critical code path

The dedup logic was deployed without metrics tracking how many invocations were being filtered. Although legitimate deduplications were expected — thousands of ghost runs were still being correctly blocked — having a baseline would have made the anomalous spike in filtered invocations visible and drawn attention to the issue much sooner.

Lessons Learned

What went well

What went poorly

Key takeaways

  1. We've reverted the dedup logic and are investing in building a solution that fully mitigates this class of problems. The new architecture will also allow us to write more robust end-to-end tests to prevent issues like this from happening again.
  2. We have deployed additional alerting that will notify the teams immediately for this class of failure case in the future.

Canonical URL: https://posthog.com/handbook/company/post-mortems/2026-04-27-workflow-wait-until-condition

GitHub source: contents/handbook/company/post-mortems/2026-04-27-workflow-wait-until-condition.md

Content hash: 7c61fbcc106057af