Alerting

Doing something with those metrics

Recording rules and alerts

Prometheus splits the alerting role in 3 components:

  • prometheus server which will calculate the alerts
  • alertmanager which will dispatch the alerts
  • webhook receivers that will handle the alerts

Exercise

Create, in Prometheus, an alert when a target is down.

Exercise

Create, in Prometheus, an alert when a grafana server is down, with an extra label: priority=high.

Exercise

Create a recording rule to get the % of disk space used and alert on > 50% of disk space used.

What is the difference between recording and alerting?

What is an annotation?

What is a “group” of recording rules?

How to see the rules and the alerts in the UI?

What is a pending alert?

Bonus: Alerts unit test (if there is enough time)

Alertmanager

  1. Download the alertmanager 0.24.0.

  2. Extract it

    $ tar xvf Downloads/alertmanager-0.24.0.linux-amd64.tar.gz
    
  3. List the files

    $ ls alertmanager-0.24.0.linux-amd64
    
  4. Launch the alertmanager

    $ cd alertmanager-0.24.0.linux-amd64
    $ ./alertmanager
    
  5. Open your browser at http://127.0.0.1:9093

  6. Add your alertmanager and your neighbors to prometheus

  7. Connect Prometheus and Alertmanager together

  8. Look for the alerts coming.

  • What are the 4 roles of alertmanager?
  • What are the different timers in alertmanager?

Exercise

Use https://webhook.site/ to get a webhook URL.

Send alerts to that https://webhook.site/ URL.

For the priority=high alerts, send an email instead of a webhook.

  • Can you explain the HA model of prometheus?
  • How can I send an alert to multiple targets?

Exercise

How can you check that two alertmanager config are in sync?

Solution

Click to reveal.

Exercise

Make a big cluster of alert managers

Amtool

Amtool is the CLI tool for alertmanager

You can use it to e.g. create silences.

$ ./amtool silence --alertmanager.url=http://127.0.0.1:9093 add job=grafana priority=high -d 15m -c "we redeploy grafana" -a Julien

That will return the UID of the silence that you can use to expire it.

Karma

karma is a dashboard for alertmanager

Last modified December 2, 2022: Update versions (195c9ea)