All Systems Operational
app.posthog.com ? Operational
90 days ago
99.89 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Nov 24, 2020

No incidents reported today.

Nov 23, 2020

No incidents reported.

Nov 22, 2020

No incidents reported.

Nov 21, 2020

No incidents reported.

Nov 20, 2020

No incidents reported.

Nov 19, 2020

No incidents reported.

Nov 18, 2020

No incidents reported.

Nov 17, 2020

No incidents reported.

Nov 16, 2020

No incidents reported.

Nov 15, 2020

No incidents reported.

Nov 14, 2020

No incidents reported.

Nov 13, 2020
Resolved - Problem
Our app metrics started showing signs of failure (high event processing times, high decide endpoint latency). This happened randomly and didn't follow immediately from a feature deploy.

Cause (Tentative)
Manage events view (#2319) was deployed with a celery task that runs on the hour every hour. Because the deployment happened earlier that hour, the task didn't trigger until the next hour came up. The task itself was responsible for updating items related to each team. However, Clickhouse is configured to handle 100 concurrent queries. This task seemed to have queued up ~1400 queries because there are ~700 teams and each team update required 2 queries. These were long running queries so they weren't consumed by clickhouse quickly.

Timeline
9:05 Manage Events view is deployed with several other large PRs
10:05 Evidence of overload grows
10:10-10:30 Event processing time goes vertical (app is down too)
11:41 Rollback heroku deployment to previous night v764
11:48 Event processing time returns to normal levels
Impact
app.posthog down for over an hour. Events remained in tact

Lessons
Task related queries are unforgiving. They can pile up really easily compared to a one off long running query.
(Tentative) The backed up queries quietly deteriorated the rest of the app performance and made it more difficult to pinpoint the problem. The app client was struggling. The celery queue was struggling. We should aim to have even more visuals on metrics especially surrounding clickhouse query load.
Solution
Always try to turn repetitive queries into a single query if possible.
Nov 13, 15:00 UTC
Nov 12, 2020

No incidents reported.

Nov 11, 2020

No incidents reported.

Nov 10, 2020

No incidents reported.