May 5, 2021

Monitoring PostgreSQL with a modern software stack

Roman Fišer
Roman Fišer
Monitoring PostgreSQL with a modern software stack

Intended audience: Beginner to intermediate

Last February, Roman Fiser, Showmax Head of Infrastructure, gave a presentation at the P2D2 conference - “Monitoring PostgreSQL - monitoring with a modern tech stack”.
We’ve decided to make the presentation free to the public - including all of the Alertmanager rules and Grafana Dashboards that we are currently using in production. And to tease you to continue reading, in an upcoming post, we will take a step beyond the current status and will share our vision of the future monitoring ecosystem.

As presented at P2D2 last year, our monitoring revolved around Prometheus working with PostgreSQL. Prometheus is an open-source solution that implements a highly dimensional data model. We chose to use it because it fundamentally stores all data as time series, the streams of timestamped values belonging to the same metric, and thanks to Prometheus we could implement a monitoring based on metrics with metadata (CPU, RAM, disk IO, disk utilization, etc.).
To be able to address different kinds of data, all of them have to have a name and a set of optional labels that are key/value pairs. Prometheus is written in Go, and, because of that, it consists of only one executable binary that can be run on any supported architecture that you set as your compile target.

We already published the Prometheus introduction where you may learn more. And in the shared p2d2 presentation you may find the main benefits for monitoring, here is a quick recap:

  • Functions to filter, change, remove …. metadata while fetching them
  • Multiple exporters - expose data via HTTP API
  • Effective data fetching:
  • Notifications can be reported via Email, Slack, etc.
Post image

We set alert manager rules based on four SRE golden signals. The most important metrics which are latency, traffic, saturation and errors.

Metric Definition What we monitor
Latency The time it takes to service a request pg_stat_statements - Average call time
Traffic A measure of how much demand is being placed on your system pg_stat_statements - Calls
pg_stat_statements - Returned rows
Network traffic
System IO Statistics (IOPS, traffic)
Errors The rate of requests that fail pg_stat_statements - Calls
Saturation How “full” your service is num_backends / max_connections
High IO utilization (iowait, await)
High CPU utilization
Tempfile usage
Disk usage

Within the logging pipeline we placed Elastic as the search and analytics engine, RabbitMQ is used for log messages delivery, Logstash for server‐side data processing and Kibana for visualizing data with charts and graphs in Elasticsearch. You may see the logging pipeline in detail on the below scheme.

Post image

Within the presentation you may find a comprehensive summary of the tech stack we use including:

  • Specific examples of Prometheus Alertmanager rules for alerting
  • A comprehensive walkthrough of Grafana dashboards
  • Details on our logging pipeline, including our ELK stack and a few other
    Showmax specialties

You can find the presentation on Slideshare and the configuration for all the rules/dashboard on Github.

Share article via: