Failure Hunting - The quest for a better Quality of Experience
Keeping our users happy requires new releases, extensive problem-solving, and they both need to happen quickly and effectively. To measure user satisfaction (or, more precisely, its theoretical equivalent), we created the Quality of Experience metric. It includes, among many other metrics, failure rate. Simply put, the lower the failure rate is, the happier our users are.
When the dashboard is blinking red and alerts are firing everywhere, the failures that require immediate action are obvious. But when the fire is put out, the quest to improve Quality of Experience (QoE), and reduce failures, continues. With so many different failure codes to check - at Showmax we have more than 100 - it is important to have a system to prioritize them, so that the more relevant ones are tackled first. In this article, we will go through a few different ways to perform the prioritization of failure codes.
By number of failure events
Sort the failure codes descending by either the number of sessions where they occurred or affected users in a specific period of time.
This is quite possibly the most straightforward way to do this. The issue here is that you need to deal with a high number of failures, as the number of failures relates to the number of sessions. You need to have a solid knowledge of trends and peaks in platform usage to make any reasonable assumptions, and it gets even more complicated when streaming live events as we do here at Showmax. The peaks will follow not only the leisure time of our customers but also the schedule of live sporting events like the Olympic Games.
By rate of failure events
Calculate the failure rate - the percentage of sessions that experienced a particular failure event. This tends to be more stable over a given period of time and allows for comparisons between platforms that might have a different user base (Android, iOS, Smart TVs, etc.).
The failure rate brings one extra advantage. By calculating the correlation between the failure rate and the number of sessions, you can identify failures that are more likely to happen when the number of sessions grows. This can indicate a bottleneck somewhere in the delivery pipeline, such as the CDN, backend services, or even the Internet Service Provider (ISP).
You can learn more about correlation and how to implement it on PostgreSQL here.
Do all errors have the same impact on user experience, or are some more annoying than others? We conducted an experiment with more than one million sessions to understand the behavior of the user in the five minutes just after the failure happened, asking ourselves whether the user would try to watch the same asset again, go for another one, or give up watching entirely.
Intuitively the last option is the worst one, as it negatively impacts the amount of content watched. The chart below displays the result, with the y-axis representing the percentage of users in each category, and the x-axis the progress in the asset when the failure event occurred.
It is easy to see that the amount of time already invested in the asset plays a major role in the user’s decision as to whether to try again after the failure occurred or leave (the first 1% completion rate shows a slight deviation). We could conclude that the higher the completion rate when the failure happens means the “cost” is higher. However, sometimes it is just too late at night to finish a horror movie, and a great deal of users would have interrupted the session anyway, regardless of the failure event. So we decided to add to the equation the probability that the user would continue to watch the asset in case of no failure.
The chart below displays the probability that the user will give up after a failure (G), and the probability that they will continue watching in case of no failure (W). We can then calculate the cost of the failure event as COST(COMPLETION) = G(COMPLETION) * W(COMPLETION).
Failures that happen in the beginning have a higher impact, as users that gave up due to the failure were very likely to continue to watch the asset if no failure happened. The curve then smooths out in the middle, balanced by both G() and W(), and finally drops at the end, as there was not much more to watch anyway. Finally, we decided to adjust this equation by adding a constant, so even failures in the last second have a cost higher than zero, and only sessions with no failures have zero cost.
This analysis makes it possible to sort the failure codes by the sum of the cost on all the sessions where it occurred. The ones on top will not necessarily be the more frequent ones, but they will have a bigger impact on the user experience.
By statistical significance
In the “By frequency of failure events” section, we described how correlation can be used to highlight specific types of failures. You can also perform a simple comparison with previous periods, or employ the now very popular moving average to highlight trends in a compact and meaningful way.
Another option is to use anomaly detection techniques that could automatically define thresholds and generate alerts when crossed. They usually require some fine tuning to prevent the generation of either too many or too few alerts, both having the same outcome of being neglected.
After reading this far you may ask: So, what is the best option?
Well…there is not a single “best” one. All of the options we look at here look at the data from different perspectives, and might be useful in different circumstances. At Showmax, we have dashboards displaying a combination of them as time series and tables, and they generate alerts that are constantly fine-tuned based on changes on the user and server sides.
As an example, the chart below shows a simplified view of our QoE dashboard for Playback Failures, and its state during the global DRM issue (accidental revocation of a subset of widevine devices) that happened on June 23rd, 2021.