How to Aggregate Percentiles - Tdigest to the rescue
One of the most important sub-metrics of our Quality of Experience metric is buffering time. Nobody wants to see the infamous loading circle icon! Here at Showmax, we invest a lot of time and effort helping our customers avoid it.
We created several dashboards and charts to monitor user sessions and visualize and analyze the dynamics of buffering time metrics. Our goal is to pinpoint the possible root cause(s) of problems we detect. We monitor and search for correlation with the country of origin, the CDN the customer was routed to, the platform used (Android, iOS, web, etc), type of event (live or VOD), and more.
These dashboards need to be responsive — specifically, the querying and grouping performed over several millions of sessions that feed each chart should be completed in less than 15 seconds. If that doesn’t happen, we get a timeout. To make this possible, we pre-aggregate some of our data into smaller tables containing only the metrics and dimensions that we need. The aggregated data can be easily represented with percentiles.
If you aren’t familiar with the term, here’s an example.
A situation in which 95% of the website users are served within 1 second can be described using percentiles like this: the 95th percentile of the website response time is 1s. The most commonly-used percentiles are 99th, 95th, 75th (also called third quartile), and the 50th percentile, which is also the median. The applications are vast, and it has a few advantages over the average, which only behaves well under normal distributions and is easily skewed by outliers. Specifically, we use percentiles to remove outliers from our dataset, and to define goals for all metrics.
The problem with percentiles
We want to use percentiles to describe aggregated data, but there is a small issue with percentiles — they need the full dataset to be calculated precisely. As noted earlier, we need pre-aggregations to have a responsive dashboard. To see why percentiles don’t work with pre-aggregated data check out the table below. It displays a dataset of hypothetical buffering times, followed by another table with data grouped by country, with the 95th percentile as metric.
|Session ID||Country||Buffering Time (s)|
|Country||Buffering Time 95th Percentile|
Now we want to further aggregate the country table to have a general overview of all the sessions. To do so, we calculate the percentile again, getting the value of 17.34. Compared to the 95th percentile calculated from the original table, 15.49, we are off by more than 10%. The error rate is not only high, but its magnitude is unpredictable.
The same problem happens with average, but the solution is much simpler. You store the count of sessions and the sum of buffering time for each level of grouping, with the actual average calculated only at the last step when displaying the data in the dashboard.
How to find a solution
One possible solution to this problem is to drop the percentiles altogether and define buckets with fixed ranges. For instance, you could have a field named buffering_time_zero, counting the sessions with no buffering, buffering_time_two, counting sessions with buffering time up to two seconds, and so on. When displaying the result, it’s possible to divide the value in each bucket by the number of all sessions to get a percentage of sessions in each range.
This solution is easy to implement, but it’s less flexible. Any change in your goals requires changing the database schema and recreating all the aggregated tables. The alternative we found was the PostgreSQL extension called TDigest, based on a paper by Ted Dunning from 2013, that implements t-digest histograms and provides approximations for percentiles. Tdigest is fast and simple to use, and the extension adds a new data type and a few functions. To create a Tdigest structure you need to specify only the number of buckets. More buckets mean more accuracy, but also more storage space and lower performance. We did a few experiments to decide which value would be the most suitable for us.
First, we prepared a dataset containing buffering time and timestamp, that included roughly 1 million sessions. As most of the sessions don’t present buffering, the data distribution is very unbalanced, as you can see in the figure below. The same experiment could have different results on more uniform datasets.
To analyse the error, we grouped the sessions by hour to compare the estimated percentiles (as calculated by Tdigest) to the real percentile values (as calculated by the percentile_cont
function from PostgreSQL). The errors were averaged out for the 10 hours in the dataset, and the following table displays how off target the percentiles were:
Precision / # of bucketsPctl 99thPctl 95thPctl 90thPctl 75th104.53%41.41%84.79%366.52%500.90%8.21%10.43%39.08%1000.76%3.13%4.79%12.61%2500.35%0.87%1.16%2.85%5000.21%0.31%0.32%1.27%10000.19%0.17%0.20%1.06%
As described in the paper by Ted Dunning, and as evident in the above table, Tdigest is more precise closer to the extremes (the 99th percentile shows minor error compared to the 95th).To get an error below 1% on 95th percentile, you need at least 250 buckets. To check the 75th percentile of your data with high precision, you need at least 1000 buckets.
Next comes performance and storage. For storage we used the same data as before, but buckets ranging from 100 to 1000. The figure below shows the growth in disk space needed per record as we increased the number of buckets. Fortunately, the growth is not linear because of compression, and 500 buckets uses only 3.3x more space than 100 buckets.
Last but not least, the performance comparison with PostgreSQL percentile_cont function shows a clear advantage for Tdigest — 3x faster with 100 buckets, and 2.5x faster with 10000 buckets.
There are a few drawbacks to using Tdigest. The limited precision when calculating percentiles that are far from the extremes, were already mentioned. In fact, anything between the 10th and 90th percentile will yield poor results with fewer than 1000 buckets.
Although the extension can be easily installed from source, or via package manager (after setting up the PostgreSQL repository), it’s worth checking with your cloud provider before settling on the solution. Azure currently supports it on their Citus database, but Amazon Redshift doesn’t.
The extension has a function named tdigest_percentile_of, which should get a float number as input and return the percentile where the value first appears as output. So, if 70% of the users experience zero buffering time, tdigest_percentile_of(0) should return 0.70. The issue is that we don’t know beforehand in which percentile a given value will be, and the limitation that percentiles far from extremes are not reliable also applies here. That all means that we can’t really trust the result.
In the end, our workaround was a hybrid solution using Tdigest to get the 90th and 95th percentiles, and another field counting the number of sessions with no buffering to know the percentage of our happiest customers.