Events counts lower than expected

Incident Report for PostHog

Resolved

We are all caught up on all replicas and all events are accounted for. We've also identified exactly what the root cause is that caused this per partition lag and have mitigating steps in place that will prevent us from having this issue again. We apologize for any inconvenience this may have caused.

Posted Jul 23, 2024 - 01:15 UTC

Update

We've recovered our backlog on all but one instance which is going slower than anticipated. All recent events that have come in since Saturday are up to date, but a small percentage of events that came in on Thursday - Friday of last week are still in flight. As long as events continue to be processed on this node at the current rate we will be fully up to date by the end of the day.

Posted Jul 22, 2024 - 17:33 UTC

Update

We will be 100% caught up on all events by the end of day today. We'll send out another status update as soon as that backfill is complete.

Posted Jul 21, 2024 - 14:33 UTC

Monitoring

We've identified the root cause of the issue and have mitigated it. We have also kicked off a backfill that will run over the weekend. We are shooting to have all events back in order and up to date by Monday morning. Expect updates over the weekend on the progress of the backfill of missing events. Thank you all for your patience and we hope you enjoy the rest of your Friday and the weekend!

Posted Jul 19, 2024 - 22:53 UTC

Update

We continue investigating. We are close to understand the reason behind the event ingestion problem. It seems the root cause is not in the Kafka table engine, but on our write path to the distributed tables.

Events ingestion has been resumed, but it's going slowly to avoid those events disappearing, so there will be lag in the ingestion for some hours. We are working on pushing another patch to fix the lag.

After that is solved, we'll start the event backfill for the missing dates.

Posted Jul 19, 2024 - 11:36 UTC

Update

We are investigating an issue with our kafka table engines and have purposely induced lag on our pipeline. All events are safe and will show up after this investigation, but for the moment we will fall behind on processing events and you will notice the last few hours missing in your reporting.

Posted Jul 18, 2024 - 16:51 UTC

Update

We have started event recovery.

Data may be missing since 2024-07-17 at 21:00 UTC. The missing events will eventually be available for querying.

We are now working on pushing a fix to avoid this happening again.

Posted Jul 18, 2024 - 13:29 UTC

Investigating

We've spotted that the events ingested are lower than expected. We are identifying the root cause of the issue.

No data have been lost and we are already tracing a plan to recover it, identifying the impacted dates.

Posted Jul 18, 2024 - 12:34 UTC

This incident affected: US Cloud 🇺🇸 (Event and Data Ingestion Lag).