- At 3:00 PM UTC we began to receive 13x our typical amount of event traffic.
- At 3:10 PM UTC Alerts fired for elevated number of errors being returned by the events pipeline and PostHog app
- Our even pipeline workers were inundated and unable to keep up with the number of events coming into the system
- Because the rate at which our event processing workers could process events was slower than the rate events were coming in the in flight events to be processed began to be stage in Redis, eventually exceeding the capacity of our Redis cluster.
- Because we hadn't set "allkeys-lru" deletion policy the redis server just gave up instead of removing old keys
- At this point Redis returned out of memory errors for the majority of our requests which we leverage heavily for cached views
- At 3:22 PM UTC we increased the number of web workers and the size of our Redis cluster. We also updated the configuration for Redis to evict keys on LRU schedule once Redis runs out of memory. We began to see things flush out and return to normal.
- At 3:40 PM UTC The root cause was mitigated and errors were no longer seen in the system. The events pipeline and app were both green.
Future work to prevent in future:
- Automatic scaling of web and events processing workers would go a long way to mitigate unexpected spikes in traffichttps://github.com/PostHog/posthog/issues/1355
- Replacing Redis with Kafka. In the event that there is a spike in traffic if we are flushing in flight events to disk we have a lot longer of a runway to spin up events processing workers without the risk of dropping events.https://github.com/PostHog/posthog/issues/1356
- Done: Short term: add allkeys-lru policy to redis instance
- Instrument alerts for events queue size to catch the system getting behind and use this as a trigger for paging and auto scaling our infrastructurehttps://github.com/PostHog/posthog/issues/1357