Alerting

Doing something with those metrics

Recording rules and alerts

Note

Prometheus is using the Go Templating System for alerting, in both Prometheus and Alertmanager.

Prometheus splits the alerting role in 3 components:

Note

Alerts and recording rules are close to each other. They are queries that are run at regular interval by prometheus. They both write new metrics into tsdb.

Exercise

Create, in Prometheus, an alert when a target is down.

Exercise

Create, in Prometheus, an alert when a grafana server is down, with an extra label: priority=high.

Exercise

Create a recording rule to get the % of disk space used and alert on > 50% of disk space used.

What is the difference between recording and alerting?

What is an annotation?

What is a “group” of recording rules?

How to see the rules and the alerts in the UI?

What is a pending alert?

Bonus: Alerts unit test (if there is enough time)

Tip

Prometheus generates an ALERTS metric with the active/pending alerts.

Alertmanager

  1. Download the alertmanager 0.23.0.
  2. Extract it

    $ tar xvf Downloads/alertmanager-0.23.0.linux-amd64.tar.gz
    
  3. List the files

    $ ls alertmanager-0.23.0.linux-amd64
    
  4. Launch the alertmanager

    $ cd alertmanager-0.23.0.linux-amd64
    $ ./alertmanager
    
  5. Open your browser at http://127.0.0.1:9093

  6. Add your alertmanager and your neighbors to prometheus

  7. Connect Prometheus and Alertmanager together

  8. Look for the alerts coming.

Exercise

Use https://webhook.site/ to get a webhook URL.

Send alerts to that https://webhook.site/ URL.

For the priority=high alerts, send an email instead of a webhook.

Exercise

How can you check that two alertmanager config are in sync?

Note

There is a alertmanager_config_hash metric

Solution

Click to reveal.

Exercise

Make a big cluster of alert managers

Amtool

Amtool is the CLI tool for alertmanager

You can use it to e.g. create silences.

$ ./amtool silence --alertmanager.url=http://127.0.0.1:9093 add job=grafana priority=high -d 15m -c "we redeploy grafana" -a Julien

That will return the UID of the silence that you can use to expire it.

Karma

karma is a dashboard for alertmanager