PostHog Handbook Library / Engineering

1,613 words. Estimated reading time: 8 min.

ClickHouse Clusters

Auto TL;DR

At a Glance

This long page covers these main areas. The list is generated from the article headings, so it updates with every handbook rebuild.

  1. Common features
  2. ZooKeeper
  3. US
  4. EU
  5. Dev
  6. Backup Policy
  7. HouseWatch
  8. CH Version

We have three different ClickHouse clusters here at PostHog:

  1. Prod-US: Our main production cluster for the US.
  2. Prod-EU: Our main production cluster for the EU.
  3. Dev: Our development cluster.

Common features

All clusters have these features in common:

ZooKeeper

We use ZooKeeper for ClickHouse replicated MergeTree tables. It is responsible for managing the replication of data between replicas and ensuring consistency.

Eventually we want to migrate to ClickHouseKeeper.

US
EU
Dev

Backup Policy

We backup all production tables every day. Once a week we take a full backup and then incremental backups for the rest of the week. We keep backups for 8 days.

Backups are managed through HouseWatch

We are able to do point in time recovery between daily backups and replay from Kafka topics. We have retention on the events to ClickHouse topic set to 5 days.

HouseWatch

https://github.com/PostHog/HouseWatch

HouseWatch is our internal tool (that is also open source!) that we use to manage ClickHouse maintenance tasks such as backups. We also use it to benchmark queries, access logs, and other tasks the help to operate the cluster.

You can find HouseWatch deployed on both Prod US and Prod EU Kubernetes clusters here:

CH Version

Currently we run 23.12.5.81

We run the same version in:

We do this because ClickHouse is notorious for breaking changes between versions. We've seen issues with query syntax compatibility, data result consistency, and other unexpected issues between upgrades.

In order to upgrade ClickHouse, we need to bump the version on CI to test for regressions and compatibility issues. We do this by adding the ClickHouse version to the CI Matrix of ClickHouse versions. This has the issue of slowing down CI because we then run two tests for everything on the current and desired version on ClickHouse, but it works nicely because it shows the discrepancies between the versions clearly.

Once we have resolved all issues on CI, we can then upgrade the ClickHouse version on:

Prod-US

Prod US is our main production cluster for the US.

It is made up of the following topology:

Online Cluster

2 out of the 3 replicas are what we call the 'Online cluster'. It serves all traffic coming in from us.posthog.com. We do this to guard against background tasks consuming resources and slowing down query times on the app. We've seen query time variability otherwise.

Offline Cluster

The third replica is what we call the 'Offline cluster'. It serves all background tasks and other non-essential traffic.

Traffic that it serves:

Load Balancing

We use AWS Network Load Balancers to route traffic to the correct replica. We have a separate load balancer for the Online and Offline clusters. We also have another Load Balancer that hits all nodes (Online and Offline) for tasks that don't need to be separated.

Each of these have a target group for each node targeting ports :8443 and :9440.

Data Retention Policy

We currently keep all data for all time with no TTL when it comes to Events.

We have a TTL for Session Replay data, which is 30 days.

Instance types

The original nodes of the cluster are using i3en.12xlarge instances. We are currently in the process of migrating to im4gn.16xlarge instances.

Online cluster
Offline cluster

Old nodes are using r6i.16xlarge instances. These are being retired due to IO throughput constraints.

Tiered storage

One of the nice features of our data is that recent data (data < 30 days old and generally hitting the 2 most recent active partitions) is the hottest both for reads and writes. This is a perfect fit for tiered storage. We can basically read and write to local ephemeral NVMe (with solid backup strategies) and then move the data to cheaper EBS volumes as it ages out. We currently do this using tiered storage configured simply by setting the storage configs in ClickHouse, but we eventually will want to move to setting TTLs on tables and having ClickHouse manage the tiering for us consistently.

https://altinity.com/blog/2019-11-29-amplifying-clickhouse-capacity-with-multi-volume-storage-part-2

Monitoring

https://grafana.prod-us.posthog.dev/d/vm-clickhouse-cluster-overview/clickhouse-cluster-overview?from=now-3h&to=now&timezone=utc&refresh=30s

Prod-EU

Prod EU is our main production cluster for the EU.

It is made up of the following setup:

We hit a problem with having smaller shards on EU that had a significant performance impact. We were running out of memory for larger queries which was also impacting our page cache hit rate. This was mainly due to limiting query size restrictions to protect other users of the cluster. We had two solutions for this. Increase the size of the nodes...meaning double the size for each instance x 16, very expensive. Or, we could setup a coordinator node. The coordinator node is a topology that allows us to effectively split the storage and compute tiers of the cluster into two pools of resources. We treat the current cluster of small nodes with many shards as the storage tier, they effectively are the mappers of the cluster and quickly fetch the data that we want for queries. We then send that relatively small data back to the coordinator and do the heavy lifting there which includes joins and aggregates.

For the EU the coordinator node is:

This is more than anything we can get with any combination of EBS volumes alone (within reason $$$). A nice bonus on top of this is this does not impinge on the EBS throughput limits of the node.

Coordinator schema

The coordinator has distributed tables (events, session_replay) tables that point to the EU Prod cluster

All non-sharded tables are replicated so that they are local to the coordinator.

Coordinator future

We should probably consider moving this to a m7g.metal if we hit any memory constraints with this, but so far we have not because of the dedicated nature of this node.

We will also want to create new coodinators for multi-tenant workloads in the future. This will allow us to scale up and down easily over time, and even potentially throughout the day as the workload rises and falls.

Monitoring

https://grafana.prod-eu.posthog.dev/d/vm-clickhouse-cluster-overview/clickhouse-cluster-overview

Dev

Dev is a relatively basic setup for development and testing.

We have a single shard with 2 replicas. This is to mimic the production setup as closely as possible. We have a single shard because we don't have the same volume of data as production. We have 2 replicas because we want to test failover scenarios.

Problems

The biggest pain points on our ClickHouse clusters is Disk Throughput. We still are using mutations too frequently. Every mutation rewrites large portions of our data on disk. This requires reading, and writing huge amounts of data which robs normal queries and inserts of resources. The best solution that we've found to support the current high utilization of mutations is to move to nodes that have local NVMe storage. This, along with RAID 10 far 2 configs provides us with roughly 1000 MB/s writes and 4000 MB/s reads at the same time on a node. This is than anything we can get with any combination of EBS volumes alone (within reason $$$). A nice bonus on top of this is this does not impinge on the EBS throughput limits of the node. Meaning that on top of the baseline speed to NVMe disk we can tier out to EBS and have full instance EBS throughput available for that JBOD disk pack.

Currently US is entirely on NVMe backed nodes. EU will need to be migrated to this setup as well.

Canonical URL: https://posthog.com/handbook/engineering/clickhouse/clusters

GitHub source: contents/handbook/engineering/clickhouse/clusters.mdx

Content hash: b7c71a753490d3ef