Turning a major failure into series of improvements
Being a developer is like driving: you focus on the final destination and, to overcome obstacles, you take shortcuts or draw your own route if necessary. Debugging and solving incidents on production is a key part of our job. It takes a bit of a different skill set than our everyday work — you need to act quickly and stay calm at the same time. Reverse engineering is a skill in itself, improving by taking lessons from what went wrong.
Quite often it’s a simple matter of debugging your latest production deployment. You may literally drive around the block and take the correct direction next time. Or, if the problem seems more serious, you may stop and think about the directions again. You simply roll it all back and run a test in a safe testing environment without the added pressure of customer impact. It gets really serious when rollback is not possible, when the path to find the mistake leads beyond your latest changes. Here is an example of such an incident and how we tackled it.
Prepare for impact
It started on the evening of October 30, 2019, just 3 days before the biggest event we had on Showmax that year - a live stream of the Rugby World Cup final with South Africa in it. We had put a lot of effort into the stability of our platform, even postponing a few updates to avoid unnecessary bugs that could bite us hard.
Suddenly, we noticed a spike in latencies in one of our services. It was not hard to notice, because the spike was about 10x the normal amount. When your latency suddenly rises 10 times, you know you have about a minute before this will go REALLY bad. So you scream “Somebody make me a coffee!“ and get down to it.
50th percentile of the API (catalogue) latencies in miliseconds. Stops at 15 seconds only because after 15 seconds request is killed by our infrastructure.
Sure enough, after about 50 seconds, Elasticsearch, which we use for storing asset metadata for APIs, started to scream in pain, dropping requests. Well, that would explain the latency.
Who is hurting you, my dear elastic? Probably me, I just don’t know how. Yet.
Our database clusters were under extreme load, which led to our services hanging and waiting for response, and then failing after they did not get one. Only about 20% of requests to our services were actually successfully returned, and it even appeared that we were dropping some logs due to the huge amount of errors coming from everywhere.
This looks like an invitation to a great party already!
So did we finally achieve a number of unique users that would kill our Elasticsearch, and we just keep kicking it with tons of new requests coming from a huge number of customers?
Sure, there was a user peak, but it should not be that high. We started digging into what we were actually fetching. Strangely, the absolute majority of them were fetching configurations for countries where we launched. It was used for looking into what country belongs to which territory, so our Content team does not need to mark content for every country where it should be available. But this does not change very often, which is why we cache it. Why so many requests then?
OH. Could it be that in this high peak of users the cache also expired, and so every request tries to fetch this and cache it?
Almost, but the amount of requests was still way too high for the number of users we have. What was generating the rest of them?
F***, I think I know!
Even health checks can make your service unhealthy
A bit of an intermezzo… We use the list of countries and territories in all of our APIs communicating with Elasticsearch, so this bit of code is implemented as a middleware for the Falcon REST Framework. Falcon middlewares are usually triggered for every request. And in case of one of our middlewares it was — get the data from cache or fetch and cache it. And it is happening for every request. EVERY REQUEST. It made us realise that the culprit might be in some middleware. So, we immediately jumped to another middleware running for every request we have — health middleware.
So what does the health middleware do? If an instance of a service where this middleware is running is healthy, return empty 200, otherwise return 500. Each of our Varnishes (caching proxy we use for all our APIs) pings this endpoint on each container every 2 seconds, and if the response is not 200, we stop routing to this instance because it is not healthy. EVERY REQUEST!
We’re getting really close to the root cause now. It seems that customers put our Elasticsearch cluster under some load, then the cache for countries expired, sick Elasticsearch clusters were not able to respond, and we kicked it with “len(instances of service) * len(Varnishes)” every 2 seconds to make sure it does not come back. At the time, we had 15 varnishes and 12 containers using this middleware. That leads to 180 queries to our elastic clusters (we had 2, so 90 per cluster) every other second. Not great. Not great at all. To make it even worse, we were actively thinking about deploying varnishes and API containers for the weekend as we expected a huge spike of users coming for the rugby finale. So we would actually worsen the situation by a lot.
It was broken like this for the past three years, but content was always cached after the first request, so it was not a request to the database. Luckily there is a very simple fix, just do not fetch countries for health requests.
Time to go home, right? Please?
Far from it.
Past me is my least favorite person
So, we stopped our APIs (they were not working anyway) to give Elasticsearch some breathing space without user requests…but it still couldn’t breathe. So we thought that there might be a lot of queries already queued and waiting for their turn. It is a time to look into the queued requests. About a million, sure, why the hell not. We had a request timeout set, so the requests should be discarded from the queue after 5 seconds, but they were still being resolved very (very) slowly. We decided to just restart elasticsearch clusters and sort it out later. After restarting the clusters and emptying the queues, we turned our APIs back on and everything was back on track — all the cache misses returned 200. As we returned mostly 500s, it should all be cache misses. Should.
How quickly can things go to hell.
It would have, had it not been for the surrogate control that we set. Or better put, if it weren’t for bug number two — not setting surrogate control in some cases.
We use two types of cache controls for our APIs. A standard cache control header that indicates how often the client should request it; and a surrogate control indicated for how long Varnish should cache the response. This is best explained with a simple example.
Let’s say we have sports with some live streams (as we do). These can change at any time, like when a football match goes into extra time. Therefore, clients should check for changes regularly (cache control header set to 60 seconds), but unless response has changed, Varnish should keep the response (surrogate control set to 1800 seconds).
After we implemented this, it did wonders. Clients check for updates every minute and, if nothing has changed, we return a cached response from Varnish. Except we forgot to set the surrogate control to 0 in case an API response is 500, so the 500s would stay there for the whole 1800 seconds or until we export new data to Elasticsearch (then we would clear the cache on Varnish side). We did not do this because those clusters were not even able to return data, nevermind reindexing new states.
Restarting Varnishes purged the cache and solved this. Again, this was broken for about a year (since we introduced Sport and Surrogate Control with it).
Ok, everything looks cool now. All of the investigation and first round of fixes took about 30 minutes. Add another 45 minutes before all the errors are gone (looking at a slow queue of Elasticsearch requests, figuring what is happening with Varnishes, etc.).
So, we had 75 minutes of the platform being broken. All of that because of 3 issues, that were sitting in our code for an eternity. Yes, of course, there was a third error.
The most juicy bugs wait for just the right moment to show up…
We did not understand why this amount of users put our platform under so much pressure that we were not even able to fetch a really small index of countries.
About four months earlier, in a completely different context, there was a slight update to our OAuth and Varnish. In that update, we unintentionally added user_id to the Varnish cache key globally — even for services that do not need user_id to generate a response (e.g. our APIs in the CMS team). That cache key is then used to decide if the response will be generated on the backend by our APIs, or returned by Varnish from cache. That means that we no longer cached our responses per-country (and a few other things), but per-user. Even a quick look in grafana showed that really, for the last four months, our cache-hit for those APIs took a serious hit. It was even visible on the load of the services, still unnoticed by the human eye, as it did not cause actual 500s or bigger latencies.
Summary of fuck ups
So what actually happened in a nutshell? All of our caches expired at the same time as elasticsearch was under pressure from users. There were 3 major bugs in total, all of them in our codebase for at least few months. Here is a really quick summary of those bugs:
- In case we did not have a response of one of our middlewares cached, we fetched the data to cache the response even for a request on a health check endpoint, leading to a huge spike in requests to Elasticsearch.
- We were caching some of the 500s on the Varnish level due to incorrectly set surrogate-control header, so the platform was recovering slower than expected.
- After some investigation of why were Elastics under pressure from our users (there was not that many to justify this) we found out that on Varnish level, we used user_id as a cache key by mistake, effectively caching per-user instead of per-country, which led to much more cache misses on Varnish level, so there were much more requests to our catalogue API and therefore the mentioned pressure on Elasticsearch.
Lessons learned
This gets us to the end of this madness. Everything was working, and we “just” needed to make sure it would not happen again. There are quite a few things that should be addressed outside of the obvious fixes mentioned above. Let’s go through some things that we absolutely needed to learn from this:
If you find a possible bug and fixing it does not improve the issue or it improves less than expected, look further. Do not get stuck debugging a single thing. It might not be just that. Here, after we fixed middlewares and stopped caching 500s, the platform would work and we would never learn about Varnish using user_id as part of the cache key.
Similarly, if you are staring at your latest deployment, wondering why something is happening and not seeing it, maybe it is just not there. Maybe you did not mess all of this up yesterday. Maybe it was on a random Wednesday three years ago.
Regression is a vicious beast, and it exists solely to bite us when we do not expect it. The only way to protect ourselves is to say “ok, our unit tests might have some value, but where are my end-to-end tests, where are my stress tests, where is my I do not know what else?” Especially with stress testing, we improved our efforts immediately, which led to many more optimisations over the upcoming weeks.
Finally, when you think you monitor something enough, you should really think again.
Fixed the issues? Great, you’re halfway there
The next morning, we huddled together and decided that we would spend the next few working days locked up in a conference room until we were satisfied with all the fixes we wanted to make.
So for the next two days, we were just screaming from our seats in the room “I’ll take this”, “pls code review that one”, … And so we:
- Fixed our caching, prolonged the cache duration and started to cache more, leading to quicker APIs.
- Introduced new metrics to our monitoring. We are finally monitoring cache-hit ratio and soon implemented first alerts based on behaviour from last week.
- Found out why Elasticsearch did not want to recover — request_timeout was deprecated without any warning in a library we used at the time, so we prepared for update of elastics to the latest version, where this had already been fixed — which we proved by doing proper stress tests.
- We also improved a lot of queries to make our elastic quicker, and reinvented how we treated instances with sick Elasticsearch. We already used “source” in elastic in some places. Doing that, only specific fields from requested documents are returned, which lead to improved performance from elasticsearch cluster and less data transfered.
The funny thing is that more than half of the changes we made were not EXACTLY tied to the outage itself; they were making our Elasticsearch more stable, responses quicker, and the issues easier to find. There will never be a better opportunity to make things just a bit neater than after discovering how ugly it can get. “Always leave the campground cleaner than you found it,” as they say.
The best thing about it? Those three days we spent fixing this mess, locked in a conference room, were just about the best days I’ve ever had in any workplace. When you have a good team that cares about their services, there is so much adrenaline after outages, so much focus, and so many ideas — not only how to prevent this bug, but how to prevent anything resembling this from happening again.
There was so much energy that the work basically did itself. Because we all knew that the way to handle the first day after an outage is not to beat ourselves up — “Hey, you really screwed that up 3 years ago!” — but, rather making sure that this monster does not come back.
Minor note — the biggest event of the year, the rugby final, went absolutely smoothly.