How We Stopped Being an Image Processing Company
A few months ago, we published a post on image processing at Showmax. It’s an interesting story about a service that had been implemented long before we actually wrote that blog post.
The post sparked some debate among our colleagues, and we started to reevaluate some of the ideas and choices that shaped the image service. This post is about that journey.
Something’s fishy
When we started scaling our platform to accomodate for the new sports feature, we needed to take a second look into our architecture, resource usage, etc. (one interesting finding has already been written up here).
One of the other interesting findings was the following:
This didn’t include the heavy-lifting services like video encoders or USP machines, but it did include pretty much every microservice that makes Showmax possible. A quick check into total RAM usage showed a similar proportion: 40% of all memory was consumed by the image service.
Needless to say, the latencies of the service were also quite high, though that’s to be expected for an image processing service. To be honest, high-resource usage on a service that’s processing ~2,000,000 images daily is also to be expected, right? Nevertheless, we decided to take a look and figure out if there was anything we could do to save some CPU and RAM.
Where is all that energy going?
The first step to figuring out what can be optimized is usually to understand where resources are being wasted. At Showmax, we use the ELK stack for logging requests, responses, and additional data. Because of that, we decided to write a simple Python decorator that would log the duration of each executed method in a passed-around falcon.Request object.
The Request object, along with its parameters, is then logged to ElasticSearch using one of our middlewares. A quick investigation showed that some of the seemingly simple operations performed by the program were taking a very long time. For example, the operation described in this step of the previous blog post, which was just resizing an empty canvas, was using up to 50% of the whole image processing time!
After we started logging durations of executed methods, it was easy to spot where the bottlenecks were, and how to rework them. Instead of just doing a patchwork solution, we decided to take a moment to look around and decide whether we wanted to be fixing the existing implementation, or rewrite the image processing part of the service (surprisingly a lot of the code is about parsing url arguments, but that’s a different story).
After consulting with our tech leads, we decided to rewrite the service to use libvips instead of continuing to use GraphicsMagick. We looked at other alternatives like Pillow and ImageMagick, but we chose libvips due to its performance and newly-added support of the HEIF image format, which is much better at image compression than JPEG. And, libvips comes with a nice Python wrapper, pyvips, so the code should be clear, understandable, and more maintainable.
The RAM consumption was a different story. We were surprised to find out that the service was using so much memory, because we could clearly see on our Grafana dashboard that each of the application instances was using up roughly ~500MB of memory. After another quick investigation, we noticed that each of the containers had a cgroup setting to limit the maximum memory usage to 16GB of RAM. Which, of course, we didn’t know they were using.
We changed our minds after we noticed that our dashboard was showing the RSS (Resident Set Size) memory, which excludes buffers and caches. It turned out that the GlusterFS Client was using as much memory as it could to cache the image files it pulled from our GlusterFS Storages. This does not necessarily make sense in our case, as usually the images are cached and served from our CDNs after they’re processed. So, out of the ~50,000,000 requests for images we receive every day, only 4% get passed to the image service, and each of them is rather unique.
The most frequently requested image - the main carousel visible at the top of the page - is being requested around 2,500 times every 24 hours, the 100th most-popular image roughly 750 times, the 1000th most-popular roughly 275 times. These values are inconsequential compared to the total of 2,000,000 requests on the service per day. In other words, if an image has already been processed, there’s a very low chance that it will have to be processed again anytime soon.
We decided to check how the service (and our Gluster) would behave if we limited the available memory per container.
vips for the win
After few days of work, we rewrote the service to use vips. After confirming that it’s actually performing the functions we required, we started doing some performance tests comparing the new and old solutions. From the plots below, it’s clearly visible that the new solution is at least 50% faster than the ‘old’ one.
The next tests, on our staging environment, showed that the new implementation was using significantly less CPU resources than the old one. Unfortunately, it was using significantly more RAM. Some tweaks of vips parameters, i. e. limiting the concurrency by setting the environmental variable VIPS_CONCURRENCY to 1 and disabling vips caches by using cache_set_max(0), cache_set_max_files(0), cache_set_max_mem(0), helped reduce the memory usage to roughly 25% of the initial consumption. It was still more than twice the consumption of the ‘old’ service, but we had a plan to free much more memory and decided that it was acceptable.
Simply reducing the allowed memory per container from 16GB to 4GB seemed to have done the trick by freeing additional RAM - the penalty for which was an 8% increase in average latency. We also noticed that the network transmit of GlusterFS increased by ~100Mbps per container during peak hours, which confirmed our theory that the memory was used for caching images retrieved from Gluster.
All the glory
After an extensive test run on our staging environment, and some minor fixes, we were ready to deploy the new Image API to production. Since the service had been completely rewritten, we decided to roll it out slowly. We notified customer support that any image-related issues should be immediately forwarded to the engineering team, and we started the deployment.
Initially, we deployed it to 1/16 of our containers and monitored for any errors and issues. To know which response is coming from the new service, we marked each response with a tag. This allowed us to split any monitoring we had to both new and old service versions. We also needed that in case we observed any issues -i.e. if the images were corrupted, we’d need to purge the caches from our CDNs. Also, it’s better to do so only for the potentially-corrupted responses from the new service. Since we didn’t observe any issues, the next day we deployed to 50% of our production containers, and to 100% on the day after.
In the end, we managed to drop the CPU and memory usage of the service to a more reasonable level. The cool thing is that the latency of the service also went down significantly. It’s still among the most resource-hungry services in Showmax, but it’s a start. We have some more ideas for how to make it even more efficient, so stay tuned for more blog posts!