Celery backlog due to too many queries to clickhouse and excessive tasks
Incident Report for PostHog
Resolved
Problem
Our app metrics started showing signs of failure (high event processing times, high decide endpoint latency). This happened randomly and didn't follow immediately from a feature deploy.

Cause (Tentative)
Manage events view (#2319) was deployed with a celery task that runs on the hour every hour. Because the deployment happened earlier that hour, the task didn't trigger until the next hour came up. The task itself was responsible for updating items related to each team. However, Clickhouse is configured to handle 100 concurrent queries. This task seemed to have queued up ~1400 queries because there are ~700 teams and each team update required 2 queries. These were long running queries so they weren't consumed by clickhouse quickly.

Timeline
9:05 Manage Events view is deployed with several other large PRs
10:05 Evidence of overload grows
10:10-10:30 Event processing time goes vertical (app is down too)
11:41 Rollback heroku deployment to previous night v764
11:48 Event processing time returns to normal levels
Impact
app.posthog down for over an hour. Events remained in tact

Lessons
Task related queries are unforgiving. They can pile up really easily compared to a one off long running query.
(Tentative) The backed up queries quietly deteriorated the rest of the app performance and made it more difficult to pinpoint the problem. The app client was struggling. The celery queue was struggling. We should aim to have even more visuals on metrics especially surrounding clickhouse query load.
Solution
Always try to turn repetitive queries into a single query if possible.
Posted Nov 13, 2020 - 15:00 UTC