We have identified the problem as a bug with the version of ClickHouse we were using. This caused replication issues within our cluster with event data due to projections and column changes relating to the bug. We've updated our ClickHouse cluster to a version with a fix. We've also restored all data for the ClickHouse shard that was impacted. We are now waiting for that data to replicate to the rest of our cluster. This should take ~6-8 hours depending on traffic in the meantime.
We have also disabled all worker jobs responsible for caching dashboards and insights, as well as those that are responsible for emailing dashboards. This is because until the replicas have caught up there is a chance that the data reported will be partial. We will re-enable these as soon as the replicas are caught up.
We know you depend on PostHog for making decisions and are working as fast as we can to get you the data that you depend on. We apologize for the inconvenience here.
Posted Jul 04, 2023 - 03:51 UTC
We're restarting ingestion. Heads up! We're missing roughly 50% of all data at the moment while we restore from a backup, but that data should re-appear once that process is done (a couple of hours). Figures might be off in the mean time.
No data is permanently lost.
Posted Jul 03, 2023 - 21:13 UTC
We're still looking into the issues currently happening. We're hoping to restart event ingestion within the next hour or so.
Posted Jul 03, 2023 - 17:16 UTC
We've identified some issues with our cluster. Querying data should be fine, though we are holding off on ingesting events that have been added to the queue since this morning.
No data has been lost.
Posted Jul 03, 2023 - 13:27 UTC
We are still investigating the cause of the elevated level of errors. Apologies for the continued disruption
Posted Jul 03, 2023 - 10:08 UTC
We're experiencing an elevated level of API errors and are currently looking into the issue.