PostHog Handbook Library / Engineering

2,087 words. Estimated reading time: 10 min.

Handling an incident

Auto TL;DR

At a Glance

This long page covers these main areas. The list is generated from the article headings, so it updates with every handbook rebuild.

  1. The TL;DR
  2. Raising an incident
  3. Security specific guidance
  4. Incident severity
  5. Minor
  6. Major
  7. Critical
  8. What happens during an incident?

The TL;DR

Raising an incident

Image: alert-example

Incidents are going to happen. If you'd rather watch a Loom, check out an incident drill Loom recording.

Anyone can declare an incident and, when in doubt, you should always raise an incident. We'd much rather have declared an incident which turned out not to be an incident. Many incidents take too long to get called, or are missed completely because someone didn't ring the alarm when they had a suspicion something was wrong. It's _always_ better to sound an alarm than not.

To declare an incident, type /incident anywhere in Slack. This creates a new dedicated channel for the incident and add a few stakeholders. It will trigger an alert in the #incidents channel so everyone else can be aware. Declaring an incident doesn't trigger any external notifications.

Once an incident is raised an automatic workflow begins that will help you summarize the issue and escalate it appropriately.

Some things that should definitely be an incident

Things that _shouldn’t_ be an incident

Planning some maintenance? Check the announcements section instead.

Security-specific guidance

Security incidents can have far-reaching consequences and should always be treated with urgency. Some examples of security-related issues that warrant raising an incident include:

When in doubt, err on the side of caution and raise the incident and escalate early! Better to be safe than sorry.

Need to make a security advisory? We have a page for that with more detail on the process for security vulnerabilities.

Incident severity

Please refer to the following guidance when choosing the severity for your incident. If you are unsure, it's usually better to over-estimate than under-estimate!

Minor

A minor-severity incident does not usually require paging people, and can be addressed within normal working hours. It is higher priority than any bugs however, and should come before sprint work.

Examples

If not dealt with, minor incidents can often become major incidents. Minor incidents are usually OK to have open for a few days, whereas anything more severe we would be trying to resolve ASAP.

Major

A major incident usually requires paging people, and should be dealt with _immediately_. They are usually opened when key or critical functionality is not working as expected.

Major incidents often become critical incidents if not resolved in a timely manner.

Examples

Critical

An incident with very high impact on customers, and with the potential to existentially affect the company or reduce revenue.

Examples

What happens during an incident?

When an incident is declared, the person who raised the incident is the incident lead. It’s their responsibility to:

The incident lead role is not responsible for fixing the incident, they're responsible for managing it. Sometimes that will be the same person. But if it is too much work for one person, hand over the incident lead role to someone else not actively working on the fix.

Sometimes, customer communication is required. In this case, the incident lead can ask for a comms lead to support the responding team. The best way to do this is to ask for support in the incident channel and use the @all-marketers group tag. Don't be shy.

You can find further production runbooks + specific strategies for debugging outages here (internal)

The PostHog status page

Our status page is the central hub for all incident communication. You can update it easily using the /incident statuspage (/inc sp) Slack command.

When updating the status page, make sure to mark the affected component appropriately (for example during an ingestion delay, setting US Cloud 🇺🇸 / Event and Data Ingestion to Degraded Performance). This allows PostHog's UI to gently surface incidents with a "System status" warning on the right. Only users in the affected region will see the warning:

Getting help from a comms lead

Significant incidents such as the app being partially or fully non-operational, as well as ingestion delays of 30 minutes or longer should be clearly communicated to our customers. They should get to know what is going on and what we are doing to resolve it. If the incident is minor this can usually be done by updating the status page, but it may be desirable to do additional customer communications, such as sending an email to impacted customers. When this is required, you should involve a Comms Lead and ensure the Sales team are aware.

The best way to ask for support from a Comms Lead is to post in the incident channel and use the @all-marketers group tag. This will alert the all relevant marketing teams.

When handling a security incident, please align with the incident responder team in the incident Slack channel about public communication of security issues. For example, it may not make sense to immediately communicate an attack publicly, as this could make the attacker aware that we are already investigating. This could it make harder for us to stop this attack for good.

When a customer is causing an incident

In the case that we need to update a _specific_ customer, such as when an individual org is causing an incident, we should let them know as soon as possible. Use the following guidelines to ensure smooth communication:

In the case that we need to temporarily limit a _specific_ customer's access to any functionality (e.g. temporarily prevent them from using an endpoint) as a result of certain usage resulting in an incident, we need to make sure we put an alert on their Zendesk tickets. This will make sure that anyone working on a ticket from the org will know what's happening with the org before replying (even if we've already reached out to the org, some folks at the org may not be aware, and so may open a support ticket.)

You'll just need to set the name of the org in an existing trigger in Zendesk, then reverse that change when the org's full access has been restored:

  1. After Logging into Zendesk, go to the admin center
  2. In the left column, expand Objects and rules and click on Triggers (under "Business rules")
  3. On the Triggers page, expand Housekeeping and click on Add alert for org with special-handling
  4. Under Conditions, the last condition is: Organization > Organization Is PostHog. Change PostHog to the name of organization who has had their access limited as a result of the incident. (Click on "PostHog" and then start typing to filter and find the org name, then click on it)
  5. Scroll to the bottom of the page and click the Save button

Once the org has had their full access restored, repeat the steps above, but this time put PostHog back in the last condition, and remember to Save the change.

When does an incident end?

When we've identified the root cause, implemented a fix, and confirmed all customer-facing services have returned to normal. End the incident by typing /inc close in the incident channel. Make sure to also mark the incident as resolved on the status page.

What happens after an incident?

Once the incident is resolved, the incident lead should step away. Take a walk, go to the gym, have some tea, take a shower. The longer the incident took to resolve, and the more directly customer impacting it was, the more important this is. Bring another team member up to speed, hand off outstanding customer communications, and get your head clear for the post-mortem and followup actions. Anyone else heavily involved in the response should do the same.

In almost all cases, a valid incident will have a post-mortem - check out Post-mortems for more details.

Canonical URL: https://posthog.com/handbook/engineering/operations/incidents

GitHub source: contents/handbook/engineering/operations/incidents.md

Content hash: 83e9c3ff725407e5

Static reader notes
  • MDX_COMPONENT_STATIC_ADAPTER: Adapted interactive MDX components for static reading: SmallTeam.