Recently, we discussed our move to Monitoring with a modern tech stack that revolved around our switch to Prometheus and AlertManager. Since early 2020, we have continuously made changes to both monitoring and alert management systems.
In this interview with Angus Dippenaar, we get into where we stand now, the changes we’ve made, and how we are getting ready for the transition to cloud.
Angus works for MCG and is actively helping the Showmax Engineering team with
the evolution of the monitoring ecosystem.
When working with Showmax DevOps team, you saw a great deal of fine-tuning done to the monitoring and alert management ecosystem built around Prometheus. Take us back to the beginning. What were the main motivations to make the change to Prometheus and AlertManager?
The old way, using Nagios/Incinga and black-box monitoring for our alerting setup, was not ideal. It worked, but there was room for improvement and we all just knew that we could do better. All we had to work with was the current value in the alerting system. We would see “Disk Usage > 90%,” and then we had to investigate.
Using a time series database like Prometheus offers, means that we can see metrics over time. We can see a trend that could have caused the issue, or we can see how a different metric has contributed to the problem that we’re currently experiencing.
Having metrics over time means that we can create an alert based on a trend. For example, we can have a count of all the values above a threshold over a certain period of time and create an alert based on that. You could have an alert based on how many http errors you had over 5 minutes, and if it’s greater than a defined threshold, then you probably know that something is wrong.
What does the alert ecosystem look like now? Is it already 100% based on Prometheus data?
Right now, we are still converting Icinga-based alerts into the new metrics-based approach. It’s not the highest priority, and we still have some Icinga-based alerts. But, we still want to move completely to Prometheus and white-box monitoring, and we are working towards that. Another advantage that might be important in the near future is that Prometheus is cloud-native and we can count on it when moving to the cloud.
Can you share some highlights from your future plans? What is the state you target?
One of the advantages we are seeing with the metric-based alerts is that we can fine tune the alert much more than we could with value-based alerts. I recently found an issue when I was browsing through the metrics that we had not received any alerts yet, but I could see that the metric was trending towards us having an issue. We were able to solve the problem before any alerts had come up. I would guess that this alert would have gone off around 2am, not exactly a time when you want to be solving issues. The state we would like to achieve is to consistently have alerts that something is not right before we see that something has actually gone wrong. This happens over time, where we can review the metrics before something bad happened, and then we can create a trend-based alert using those findings.
That sounds like a very modern approach, very helpful and user friendly. Is this what brought Showmax Engineering to your attention? In February 2020, we introduced the new ecosystem that resolved around Prometheus at the P2d2 conference. You were there and met Roman Fišer, Showmax Head of Infrastructure, who gave the presentation. What happened?
I saw that Showmax was one of the sponsors of the conference, and I knew them as a brand in South Africa (where I’m from). I learned at the conference that Showmax actually had their engineering operations in Prague, and I found this very interesting. I thought it would be cool to visit the office and see what it’s like. I visited on Friday, and there were very few people at the office. But I got to look around and meet the few people who were there. I wasn’t expecting a job interview, but then we had one.
And the rest is history, as they say. As actively contributing to the team, you helped to evolve the system. Focusing on alert management, what does the logging pipeline look like currently?
Our logging pipeline has been working alright for a long time and is probably not going to change much in the near future. Logs generated from applications and scripts are sent to RabbitMQ, processed by Logstash into Elasticsearch, and queried by Kibana. We had a very simple proof-of-concept setup for some of our system logs, sending them into Prometheus, the same database as our metrics.
This is quite cool, because we can display the logs on the same screen as the metrics, and we can select a time range and filter to see the logs around that time. This is very useful for debugging, because we can see in the logs the moment that an issue occurred and how it affected the metrics. Getting this logging system is something that we would like to focus on in the future.
You have also improved on-call management a lot recently to cut response time. What have you changed exactly?
In our old on-call system, we put the schedule into a CSV file, and then a few Python scripts managed the informing process for the next person who was on call. An internal web page was available to anyone in the support team where they could access the phone numbers of the person who on-call for the week. Now, with Opsgenie, there’s one phone number that can be saved on the phones of every member of the support team. If they need to phone the person on call, Opsgenie will route the call to the phone number of the person on call.
We also use Opsgenie’s schedules to manage the on-call (this is the only way to do it in Opsgenie). At the end of the month, we import the times from the schedule into our tracking system Sloneek, to correctly compensate everyone who was on-call. If someone is suddenly not available to be on-call, overriding the schedule in Opsgenie is a much easier method than we had before with the CSV file. It’s also something we can do from our phones if we need to.
We have also set up Prometheus Alertmanager to send the alerts to Opsgenie so that Opsgenie can alert the person on call directly.
Because of our planned transition to cloud-managed services, some of our improvements will become obsolete. Monitoring, however, still plays a major role in our future plans, and we will keep the best elements of our current approach.