Elevated API Errors

Incident Report for PostHog

Resolved

We think the impact from the issue is over.

Posted Jul 04, 2023 - 07:08 UTC

Monitoring

We have identified the problem as a bug with the version of ClickHouse we were using. This caused replication issues within our cluster with event data due to projections and column changes relating to the bug. We've updated our ClickHouse cluster to a version with a fix. We've also restored all data for the ClickHouse shard that was impacted. We are now waiting for that data to replicate to the rest of our cluster. This should take ~6-8 hours depending on traffic in the meantime.

We have also disabled all worker jobs responsible for caching dashboards and insights, as well as those that are responsible for emailing dashboards. This is because until the replicas have caught up there is a chance that the data reported will be partial. We will re-enable these as soon as the replicas are caught up.

We know you depend on PostHog for making decisions and are working as fast as we can to get you the data that you depend on. We apologize for the inconvenience here.

Posted Jul 04, 2023 - 03:51 UTC

Update

We're restarting ingestion. Heads up! We're missing roughly 50% of all data at the moment while we restore from a backup, but that data should re-appear once that process is done (a couple of hours). Figures might be off in the mean time.

No data is permanently lost.

Posted Jul 03, 2023 - 21:13 UTC

Update

We're still looking into the issues currently happening. We're hoping to restart event ingestion within the next hour or so.

Posted Jul 03, 2023 - 17:16 UTC

Identified

We've identified some issues with our cluster. Querying data should be fine, though we are holding off on ingesting events that have been added to the queue since this morning.

No data has been lost.

Posted Jul 03, 2023 - 13:27 UTC

Update

We are still investigating the cause of the elevated level of errors. Apologies for the continued disruption

Posted Jul 03, 2023 - 10:08 UTC

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Posted Jul 03, 2023 - 08:08 UTC

This incident affected: US Cloud 🇺🇸 (App).