Problem Our app metrics started showing signs of failure (high event processing times, high decide endpoint latency). This happened randomly and didn't follow immediately from a feature deploy.
Cause (Tentative) Manage events view (#2319) was deployed with a celery task that runs on the hour every hour. Because the deployment happened earlier that hour, the task didn't trigger until the next hour came up. The task itself was responsible for updating items related to each team. However, Clickhouse is configured to handle 100 concurrent queries. This task seemed to have queued up ~1400 queries because there are ~700 teams and each team update required 2 queries. These were long running queries so they weren't consumed by clickhouse quickly.
Timeline 9:05 Manage Events view is deployed with several other large PRs 10:05 Evidence of overload grows 10:10-10:30 Event processing time goes vertical (app is down too) 11:41 Rollback heroku deployment to previous night v764 11:48 Event processing time returns to normal levels Impact app.posthog down for over an hour. Events remained in tact
Lessons Task related queries are unforgiving. They can pile up really easily compared to a one off long running query. (Tentative) The backed up queries quietly deteriorated the rest of the app performance and made it more difficult to pinpoint the problem. The app client was struggling. The celery queue was struggling. We should aim to have even more visuals on metrics especially surrounding clickhouse query load. Solution Always try to turn repetitive queries into a single query if possible.