Session Replay Capture Endpoint Erroring

Incident Report for PostHog

Resolved

We have fully recovered and are ingesting session replay data realtime. Apologies for the disruption. We will be coming up with improvements to our failover strategy such that we don't face this kind of data loss in the future.

Posted Dec 27, 2024 - 18:23 UTC

Update

We are recovering session events from both sources now. There may be some lag but we are safely recording all session events sent to us. There was a period where we were unable to capture Session events successfully for 1 hour and 36 minutes. We will have a full postmortem on this in the near future.

Just to be clear this incident only impacted Session Replay events. Normal events are flowing as usual and were not impacted by this event.

Posted Dec 27, 2024 - 17:20 UTC

Monitoring

We've mitigated the problem and are monitoring recovery. We have failed over our kafka infrastructure to a failsafe and are safely consuming sessions at this point. We will update once more progress has been made.

Posted Dec 27, 2024 - 17:12 UTC

Update

Our provider has estimated about 20 more minutes of downtime and we are working ourselves to mitigate the issue by failing over to an alternate provider to save the sessions being sent. Will update as things progress.

Posted Dec 27, 2024 - 16:14 UTC

Identified

An upstream provider is having an incident resulting in our Session Replay Capture endpoint to return elevated errors and also for any Sessions currently being sent to be lost. We are working with our provider to resolve this issue as soon as possible and will update as soon as we have any further details.

Posted Dec 27, 2024 - 15:55 UTC

This incident affected: EU Cloud 🇪🇺 (Event and Data Ingestion Lag, Feature Flags and Experiments).