This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Workshop

Schedule, setup & technical

1: Prometheus
2: Metrics Monitoring
3: Grafana
4: Exporters
5: Alerting
6: Bonus
7: Config files

Schedule

10:00 am: Start
11:30 am: Coffee Break
01:00 pm: Lunch at the Restaurant
03:30 pm: Coffee break
05:00 pm: End
07:00 pm: Dinner / get together at the Restaurant

Online questionnaire

During this afternoon, you will receive an online questionnaire.

Please use it to provide feedback.

Network setup

Because clients are isolated by default, we will add a secondary interface on your laptop:

$ sudo -i
# service firewalld stop
# ip ad ad 192.168.28.x/24 dev enp3s0

Where x will be given by the trainer.

Local downloads

You can downloads the items locally at http://192.168.28.1:3000/.

Code of Conduct

This workshop is subject to the OSMC Code of Conduct.

1 - Prometheus

Install & Setup Prometheus

Prometheus is an open source monitoring system designed around metrics. It is a large ecosystem, with plenty of different components.

The prometheus documentation provides an overview of those components:

Prometheus Architecture, CC-BY-SA 4.0, from the Prometheus Authors 2014-2019

How Prometheus works

Prometheus monitoring is based on metrics, exposed on HTTP endpoints. The Prometheus server is “active” and starts the polling. That polling (called “scraping”) happens at a high interval (usually 15s or 30s).

Each monitored target must expose a metrics endpoint. That endpoint exposes metrics in the Prometheus HTTP format or in the OpenMetrics format.

Once collected, those metrics are mutated by Prometheus, which adds an instance and job label. Optionally, extra relabeling configured by the user occurs.

The Prometheus server

Download the prometheus server 2.37.4.

Extract it

$ tar xvf Downloads/prometheus-2.37.4.linux-amd64.tar.gz

List the files
```
$ ls prometheus-2.37.4.linux-amd64
```

Launch prometheus

$ cd prometheus-2.37.4.linux-amd64
$ ./prometheus

Open your browser at http://127.0.0.1:9090
Look at the TSDB data

tsdb

Prometheus stores its data in a database called tsdb. The TSDB is self-maintained by the server, which manages the data lifecycle.

The web UI

There is a lot of information that can be found in the prometheus server web ui.

Try to find:

The version of prometheus
The duration of data retention
The “targets” that are scraped by default
The “scrape” interval

Prometheus UI

The Prometheus UI went under a huge refactoring in 2020. It is now react-based, with powerful autocomplete features. There is still a link to access the “classic” UI.

promtool

promtool is a command line tool provided with Prometheus.

With promtool you can:

Validate Prometheus configuration

$ ./promtool check config prometheus.yml

Query Prometheus

$ ./promtool query instant http://127.0.0.1:9090 up

Info

The up metric is added by prometheus on each scrape. Its value is 1 if the scrape has succeeded, 0 otherwise.

Create blocks from OpenMetrics files or recording rules, aka backfill.

Adding targets

Note

At this point, make sure you understand the basis of YAML.

exercise

Open prometheus.yml
Add each one’s prometheus server as targets to your prometheus server.
Look the status (using up or the target page)

What is a job? What is an instance?

Tip

You do not need to reload prometheus: you can just send a SIGHUP signal to reload the configuration:

$ killall -HUP prometheus

Admin commands

Enable admin commmands
```
$ ./prometheus --web.enable-admin-api
```
Take a snapshot
```
$ curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
```
Look in the data directory.

Note
This is snapshotting the TSDB. There is another kind of snapshot, Memory Snapshot on Shutdown, which is a different feature.

Delete a timeserie

$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=process_start_time_seconds{job="prometheus"}'

Federation

File SD (if workshop is on site)

Now, let’s move to file_sd.

Create a file:

- targets:
    - 192.168.28.1
  labels:
    laptop_user: julien
- targets:
    - 192.168.28.2
  labels:
    laptop_user: john

With your IP + your neighbors.

Name it users.yml.

Adapt Prometheus configuration:

  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    file_sd_configs:
      - files:
         - users.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9090"

Duplicate the job, but with the following instructions:

The new job should be called “federation”
The new job should query http://127.0.0.1:9090/federate?match[]=up
The “up” metric fetched should be renamed to external_up

Tip

The name of a metric is a label too! It is the __name__ label.

Solution

Click to reveal.

DigitalOcean SD (if workshop is virual)

Now, let’s move to digitalocean_sd.

In your VM, there is a /etc/do_read file with a digitalocean token.

The version of Prometheus you have has native integration with DigitalOcean.

Adapt Prometheus configuration:

  - job_name: 'prometheus'
    digitalocean_sd_configs:
      - bearer_token_file: /etc/do_read
        port: 9090
    relabel_configs:
      - source_labels: [__meta_digitalocean_tags]
        regex: '.*,prometheus_workshop,.*'
        action: keep

Reload Prometheus:

killall -HUP prometheus

You should see the 10 prometheus servers.

Duplicate the job, but with the following instructions:

The new job should be called “federation”
The new job should query http://127.0.0.1:9090/federate?match[]=up
The “up” metric fetched should be renamed to external_up

Tip

The name of a metric is a label too! It is the __name__ label.

Solution

Click to reveal.

Last exercise

Prometheus fetches Metrics over HTTP.

Metrics have a name and labels.

As an exercise, let’s build on top of our previous example:

In a new directory, create a file called “metrics”

Add some metrics:

company{name="inuits"} 1
favorite_color{name="red"} 1
random_number 10
workshop_step 1

then, run python -m SimpleHTTPServer 5678 and add it to prometheus (and your neighbors too).

2 - Metrics Monitoring

Querying Prometheus

Metrics monitoring is different because it does not assume that a situation can be explained in fixed states. Instead, it brings you inside your system and provides dozens of metrics that you can then analyze and understand.

Even if you do not need all the metrics, it is better to collect them to look further what they look like.

What is a metric

Name
Timestamp
Labels
Value
Type

Labels

Labels are used to add metadata to metrics. That can be used to differenciate then, e.g. by adding a status code, a URI handler, a function name, …

Types of metrics

Gauge: Metric that can go up and down
Counters: Metric that can only go up and starts at 0
Histograms: Metrics that put data into buckets. An histogram is composed of 3 metrics: sum, buckets and count.
Summaries: Metrics that calculate quantiles over observed values. A summary is composed of 3 metrics: sum, quantiles and count.

Upstream documentation

Exercises

The number of http requests is expressed in …

The duration in …

The number of active sessions is a …

What are the labels added by prometheus ?

What are the labels prometheus knows but does not add?

promlens

Promlens helps you build and understand queries.

Download promlens 0.2.0.

Extract it

$ tar xvf Downloads/promlens-0.2.0.linux-amd64.tar.gz

List the files
```
$ ls promlens-0.2.0.linux-amd64
```

Launch the node_exporter

$ cd promlens-0.2.0.linux-amd64
$ ./promlens --web.default-prometheus-url="http://127.0.0.1:9090"

Open your browser at http://127.0.0.1:8080
Add your promlens and your neighbors to prometheus.

Labels matching

For prometheus to do calculations and comparisons of metrics, labels must match one each side (except __name__, the name of the metric)

Maths

Operators like +, -, *, /, …

Aggregators

count()
sum()

Functions

Some important functions:

rate()
deriv()
delta()
increase()

In the list above, which ones should be used with counters, and which ones with gauges?

What is the difference between irate and rate? idelta and delta?

Aggregation

How can I get the sum of scraped metrics by job (2 ways)? – exclusion and inclusion.

Tip

The scrape_samples_scraped metric is added by prometheus after sraping a target and indicates the number of metrics for that job at that scrape.

Solution

Click to reveal.

How can I get the % of handler="/federate" over the other prometheus_http_request_duration_seconds_count ?

Solution

Click to reveal.

Over Time

What is the difference between max and max_over_time?

Max, Min, Bottomk, Topk

What is the difference between max(x) and topk(1, x)?

Time functions

day()

day_of_week()

How to use the optional argument of day_of_week?

What is the timestamp() function? How can it be useful?

Tip

You can use Grafana Explore feature to get help and autocomplete on Prometheus (currently still a “beta” feature). That feature will likely come in the next release of Prometheus!

And/Or

Can you think of any usecases for and/or/unless?

3 - Grafana

Create beautiful dashboards

Grafana

Grafana aims to be a one-stop for obervability users. From the Grafana interface, you can address your data, wherever it lives. Grafana has first class integrations with Prometheus, Loki, Jaeger and many others.

It is famous for its dashboarding solution, but more features have been added recently, such as traces views, manual queries, and logs explorer.

License

Grafana is licensed under the AGPL license. It’s an Open Source license, with the requirement that modifications you make must be done under the same license, when you provide Grafana as a service.

Download and run grafana

Go to the Grafana website and download the latest stable release
For this exercise we will use the Standalone Linux Binaries(64 Bit)

$ wget https://dl.grafana.com/oss/release/grafana-9.3.2.linux-amd64.tar.gz
$ tar -zxvf grafana-9.3.2.linux-amd64.tar.gz
$ cd grafana-9.3.2
$ ./bin/grafana-server

Open Grafana in your browser: http://127.0.0.1:3000 (username: admin; password: admin)

Setup prometheus

Add your prometheus server as a datasource + import the prometheus dashboards.

Monitor grafana in prometheus (add it as a target).

Look at the grafana dashboards.

Create a new dashboard

Create a new dashboard which enables you to pick someone’s prometheus and gather info: samples scrapes, scrape duration, … using variables.

Your dashboard should contain at least a singlestat panel, a variable and a graph panel.

Info

Grafana also supports tables and heatmaps for Prometheus

4 - Exporters

Expose metrics for Prometheus

Exporters are HTTP servers that expose metrics. They can translate Prometheus queries into domain-specific queries. They then turn the results into Prometheus metrics.

There are hundreds of known exporters, most of them coming from the community. A few exporters are maintained by the Prometheus team.

node_exporter

The node exporter enables basic monitoring of linux machines (and other unix like systems).

Download the node_exporter 1.5.0.

Extract it

$ tar xvf Downloads/node_exporter-1.5.0.linux-amd64.tar.gz

List the files
```
$ ls node_exporter-1.5.0.linux-amd64
```

Launch the node_exporter

$ cd node_exporter-1.5.0.linux-amd64
$ ./node_exporter

Open your browser at http://127.0.0.1:9100
Add your node_exporter and your neighbors to prometheus.

collectors

The Node Exporter has multiple collectors, some of them disabled by default.

Exercise

Enable the systemd collector

textfile collector

Exercise

Move the metrics created before (company name, random number..) on port 5678 to be collected by the node exporter.

Do you see use cases for this feature?

Dashboards

Exercise

Create two dashboards: a dashboard that will show the network bandwidth of a server by interface, and a dashboard that will show the disk space available per disk.

Tip

You can use {job="node_exporter"} in prometheus to see the metrics, or you can directly open the /metrics of the node_exporter in your browser.

JMX exporter

The JMX exporter is useful to monitor Java applications. It can be loaded as a “side car” (Java agent), in the same JVM’s as the applications.

Download Jenkins
Download the JMX exporter 0.17.2.
Run Jenkins with the JMX exporter and add it to Prometheus

solution

Click to reveal.

config.yml

Click to reveal.

exercise

Create a dashboard with:
- JVM version
- Uptime
- Threads
- Heap Size
- Memory Pool size

Grok exporter

Download grok exporter 0.2.8

Extract it

$ unzip Downloads/grok_exporter-0.2.8.linux-amd64.zip

List the files
```
$ ls grok_exporter-0.2.8.linux-amd64
```
Create a simple job in Jenkins
Re run Jenkins to output to a file (add &> jenkins.log)

exercise

Create a job with name “test” and command “sleep 10”
Run the job and look for “INFO: test #2 main build action completed: SUCCESS” in the logs
Create a counter and a gauge for those: job_last_build_number and job_build_total. The name of the job should be a label, and for the job_build_total the status should too be a label.

solution

Click to reveal.

Blackbox exporter

Download the blackbox_exporter 0.22.0.

Extract it

$ tar xvf Downloads/blackbox_exporter-0.22.0.linux-amd64.tar.gz

List the files

$ ls blackbox_exporter-0.22.0.linux-amd64

Launch the blackbox_exporter

$ cd blackbox_exporter-0.22.0.linux-amd64
$ ./blackbox_exporter

Open your browser at http://127.0.0.1:9115
Add your blackbox_exporter and your neighbors to prometheus

Exercise

Monitor the Inuits website (DNS + HTTP) using the blackbox exporter
Check with prometheus blackbox exporter when the SSL certificate will expire in days
Create a dashboard with the detailed time it takes to get the OSMC website.

5 - Alerting

Doing something with those metrics

Recording rules and alerts

Note

Prometheus is using the Go Templating System for alerting, in both Prometheus and Alertmanager.

Prometheus splits the alerting role in 3 components:

prometheus server which will calculate the alerts
alertmanager which will dispatch the alerts
webhook receivers that will handle the alerts

Note

Alerts and recording rules are close to each other. They are queries that are run at regular interval by prometheus. They both write new metrics into tsdb.

Exercise

Create, in Prometheus, an alert when a target is down.

Exercise

Create, in Prometheus, an alert when a grafana server is down, with an extra label: priority=high.

Exercise

Create a recording rule to get the % of disk space used and alert on > 50% of disk space used.

What is the difference between recording and alerting?

What is an annotation?

What is a “group” of recording rules?

How to see the rules and the alerts in the UI?

What is a pending alert?

Bonus: Alerts unit test (if there is enough time)

Note

Prometheus generates an ALERTS metric with the active/pending alerts.

Alertmanager

Download the alertmanager 0.24.0.

Extract it

$ tar xvf Downloads/alertmanager-0.24.0.linux-amd64.tar.gz

List the files
```
$ ls alertmanager-0.24.0.linux-amd64
```

Launch the alertmanager

$ cd alertmanager-0.24.0.linux-amd64
$ ./alertmanager

Open your browser at http://127.0.0.1:9093
Add your alertmanager and your neighbors to prometheus
Connect Prometheus and Alertmanager together
Look for the alerts coming.

What are the 4 roles of alertmanager?
What are the different timers in alertmanager?

Exercise

Use https://webhook.site/ to get a webhook URL.

Send alerts to that https://webhook.site/ URL.

For the priority=high alerts, send an email instead of a webhook.

Can you explain the HA model of prometheus?
How can I send an alert to multiple targets?

Exercise

How can you check that two alertmanager config are in sync?

Note

There is a alertmanager_config_hash metric

Solution

Click to reveal.

Exercise

Make a big cluster of alert managers

Amtool

Amtool is the CLI tool for alertmanager

You can use it to e.g. create silences.

$ ./amtool silence --alertmanager.url=http://127.0.0.1:9093 add job=grafana priority=high -d 15m -c "we redeploy grafana" -a Julien

That will return the UID of the silence that you can use to expire it.

Karma

karma is a dashboard for alertmanager

6 - Bonus

If there is more time…

If there is time left..

Pushgateway
LTS Remote Read / Remote Write
Alert Rules unit testing
Service Discovery
TLS overview
Console templates
Graphite Exporter
Collectd
Grafonnet + Monitoring Mixins

7 - Config files

Some configuration files for the workshop

prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.26.1:9093
      - 192.168.26.2:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9090"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'node'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9100"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'grafana'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:3000"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'federation'
    metrics_path: /federate
    params:
        "match[]": [up]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9090"
    metric_relabel_configs:
      - source_labels: [__name__]
        target_label: __name__
        regex: up
        replacement: federate_up

    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'jenkins'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:8081"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'aletmanager'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9093"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'grok'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9144"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s

workshop.yml

- targets: ['192.168.26.1']
  labels:
      name: me
- targets:
  - '192.168.26.3'
  - '192.168.26.4'
  - '192.168.26.5'
  - '192.168.26.6'
  - '192.168.26.7'
  labels:
      name: right
- targets:
  - '192.168.26.8'
  - '192.168.26.9'
  - '192.168.26.10'
  - '192.168.26.11'
  - '192.168.26.12'
  - '192.168.26.13'
  - '192.168.26.14'
  - '192.168.26.15'
  labels:
      name: left

first_rules.yml

groups:
  - name: example
    rules:
    - alert: a target is down
      for: 5m
      expr: up == 0
      labels:
        priority: high
      annotations:
        text: "{{$labels.job}} is down!"

grok_exporter.yml

global:
    config_version: 2
input:
    type: file
    path: ../jenkins.log
    readall: true
grok:
    patterns_dir: ./patterns
metrics:
    - type: counter
      name: job_build_total
      help: Counter for the job runs.
      match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
      labels:
          jobname: '{{.jobname}}'
          status: '{{.status}}'
    - type: gauge
      name: job_last_build_number
      help: Number of the last build
      match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
      labels:
        jobname: '{{.jobname}}'
      value: '{{.jobid}}'
server:
    port: 9144

alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        priority: high
        continue: true
        receiver: 'sms.hook'
receivers:
- name: 'sms.hook'
  webhook_configs:
  - url: 'https://webhook.site/5c702a0d-2c02-4f70-a8ee-9ac45d2ce2b9'
- name: 'web.hook'
  webhook_configs:
  - url: 'https://webhook.site/5c702a0d-2c02-4f70-a8ee-9ac45d2ce2b9'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']