#streaming
May 23, 2023

How We Migrated the Whole Content Delivery and Encoding Infrastructure from Debian to Ubuntu

How We Migrated the Whole Content Delivery and Encoding Infrastructure from Debian to Ubuntu

If you maintain any infrastructure, you know it: every now and then there’s a new major release of all the software deployed on your server infrastructure, and you have to start planning the upgrade process. This is a story about how we found ourselves in that very same situation and gradually upgraded the whole fleet of our bare metal (and some virtual) servers from Debian 10 (Buster) to Ubuntu 22.04 (Jammy Jellyfish).

The streaming business often demands lots of compute and network resources – we at Showmax run our own fleet of parallel encoders, have dozens of storage servers, and our own CDN servers in two points of presence in South Africa. In addition to the Showmax infrastructure, we are also responsible for the DStv origin and CDN bare metal infrastructure. We also maintain and operate multiple supporting and management servers, for example to ensure we can remotely access and network boot the servers, and we collect and process logs from Akamai and other external CDNs on bare metal nodes. In total, we had to upgrade about 20 server roles covering hundreds of servers in five datacenters.

Before we start digging into the migration, you may wonder: why Ubuntu? Why Jammy Jellyfish? Showmax spent a lot of effort on the AWS migration, and we are basing our container base images on Ubuntu (Focal Fossa and now Jammy Jellyfish). The idea was to have a shared package repository and package build pipeline, and to share some parts of the infrastructure for both AWS and bare metal deployments. Ubuntu was a good fit there, Jammy was a fresh distribution with 5 years of support. The software bundled with Ubuntu (Linux kernel, OpenSSL, etc.) was also newer, compared to then-stable Debian. We were indeed considering then-stable Debian Bullseye, but its support window ends mid-2024 and the Debian Bookworm release was planned for mid-2023: assuming we would complete our migration in the first half of 2023, we would only have a few months before starting the preparations for yet-another-migration. The single package base (for AWS and bare metal environments), long support period, and freshness of Ubuntu Jammy were the major reasons why we decided to use it.

Before we could even start, we had to prepare the whole Showmax environment for Ubuntu. As part of this preparation, we worked with our Platform Engineering team to switch from an internal legacy Debian package repository running in Hetzner, Germany, to a new one using AWS S3 (and the apt-transport-s3 Debian/Ubuntu package). Then we had to repackage dozens of internal packages for Ubuntu Jammy, sometimes resolving compatibility issues. We maintain our infrastructure using Puppet, so we had to update it to support Debian as well as Ubuntu Jammy – we had to split several configuration sections like the external repository configuration, and pick the right package names for components that differ (e.g. firmware-linux-nonfree on Debian and linux-image-generic/linux-firmware on Ubuntu). There were also unexpected surprises, like ifupdown (yes, we have ditched Netplan) demanding the bond-master keyword as part of physical interface configuration (Debian Buster worked fine without it):

auto showmax1
iface showmax1 inet manual
    bond-master bond0
auto showmax2
iface showmax2 inet manual
    bond-master bond0
auto bond0
iface bond0 inet static
(...)
    slaves showmax1 showmax2

We have also rebuilt our network boot image to be based on Ubuntu instead of Debian. The live boot image pipeline now generates an .iso file which includes the .squashfs file (instead of just the .squashfs file), so we had to update our network boot configuration in boot.ipxe as follows (the important parts are in bold):

:showmax
kernel http://bootserver/vmlinuz initrd=initrd.img boot=casper noprompt noeject live-config.username=root BOOTIF=01-${netX/mac:hexhyp} showmounts url=http://bootserver/filesystem.iso
initrd http://bootserver/initrd.img
imgstat
boot && goto menu || goto failed

When we have completed the initial environment preparation, we have installed the first bare metal node with Ubuntu Jammy and verified that all monitoring, alerting and log processing works as expected. The first phase, which started at the end of April 2022, was complete. At the beginning of September 2022 it was just about time to migrate the first production server role to Ubuntu!

We tend to have a staging environment even for bare metal infrastructure, but this one server role didn’t have any staging environment ready yet. We took several servers that were used for another role, which was just removed from our infrastructure, and installed them using the refreshed tooling, packages, and configuration files. It was a huge success – in under a week the servers were integrated, tested and verified. We could then just move on with other servers doing the same job: we have gradually reinstalled servers of this role in one datacenter, then waited two weeks (and observed performance and compatibility), then reinstalled the servers of the same role in another datacenter. This was just shortly before the FIFA World Cup 2022 and this role is responsible for packaging and DRM protection of the live streams available on both Showmax and DStv. Yes, the whole World Cup was packaged and DRM-protected on those servers running on the Ubuntu infrastructure!

During the FIFA World Cup we were in a freeze period, so we couldn’t migrate more server roles. When the World Cup ended, we immediately transitioned into a Christmas freeze period, so the whole migration process was on hold for almost two months.

When the freeze periods were finally over in January 2023, we resumed the work on our migrations: we took each server role and always reinstalled staging nodes first (fixing any and all configuration incompatibilities between Debian and Ubuntu), then performed the same with production nodes. Thanks to our infrastructure automation, we were able to reinstall dozens of storage servers without losing data and without production service interruptions in just 3 days.

Post image

Similarly, the whole fleet of 60 encoders was reinstalled in just two work days. Both DStv and Showmax CDNs were migrated in two weeks, as we wanted to avoid hard hits on our origin servers when the uncached content was suddenly streamed from colder nodes that had to retrieve it from the origin servers first. We were a bit more careful here.

And there was a good reason to be careful: because of the variety of devices used on the African continent, we still have to support legacy technologies like TLS 1.0 and 1.1, which we had to forcefully enable. This is not a straightforward task in Ubuntu Jammy because of the bundled OpenSSL 3.0 (which by default disables those legacy protocols), but at the end it was about deploying the right nginx configuration.

If you find yourself in the same position, we will spare you some time. Below are the configuration lines needed (the most important piece is the @SECLEVEL=0 at the end of ssl_ciphers). We’ve then used the sslscan tool to verify we have properly migrated the TLS configuration from the old to the new environment.

ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;
ssl_ciphers ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:AES128-SHA:@SECLEVEL=0;

During the whole migration, we only had to schedule two short internal maintenance windows (each of a length of one hour) because some of the components had to be completely switched off. Both of those were during regular work hours and none affected our customers' streaming. None of the reinstallations demanded work outside of regular working hours, which we consider important for work-life balance. The last server of the whole fleet (of about 300 nodes) was migrated in the third week of February 2023, exactly one year and three days after the Epic migration ticket was created.

Thanks to the up-to-date software, we could finally test and enable kernel TLS on our nginx fleet, which we were eager to do for a long time. Besides that, we also have a fully supported distribution and software with its best years ahead of it. Knowing the previous operating systems’ migrations at Showmax, this was the fastest migration to a new distribution in Showmax Engineering history. And it was worth it!

Share article via: