This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Workshop

Schedule, setup & technical

Schedule

  • 10:00 am: Start
  • 11:30 am: Coffee Break
  • 01:00 pm: Lunch at the Restaurant
  • 03:30 pm: Coffee break
  • 05:00 pm: End
  • 07:00 pm: Dinner / get together at the Restaurant

Online questionnaire

During this afternoon, you will receive an online questionnaire.

Please use it to provide feedback.

Network setup

Because clients are isolated by default, we will add a secondary interface on your laptop:

$ sudo -i
# service firewalld stop
# ip ad ad 192.168.28.x/24 dev enp3s0

Where x will be given by the trainer.

Local downloads

You can downloads the items locally at http://192.168.28.1:3000/.

Code of Conduct

This workshop is subject to the OSMC Code of Conduct.

1 - Prometheus

Install & Setup Prometheus

Prometheus is an open source monitoring system designed around metrics. It is a large ecosystem, with plenty of different components.

The prometheus documentation provides an overview of those components:

Prometheus Architecture, CC-BY-SA 4.0, from the Prometheus Authors 2014-2019

How Prometheus works

Prometheus monitoring is based on metrics, exposed on HTTP endpoints. The Prometheus server is “active” and starts the polling. That polling (called “scraping”) happens at a high interval (usually 15s or 30s).

Each monitored target must expose a metrics endpoint. That endpoint exposes metrics in the Prometheus HTTP format or in the OpenMetrics format.

Once collected, those metrics are mutated by Prometheus, which adds an instance and job label. Optionally, extra relabeling configured by the user occurs.

The Prometheus server

  1. Download the prometheus server 2.37.4.

  2. Extract it

    $ tar xvf Downloads/prometheus-2.37.4.linux-amd64.tar.gz
    
  3. List the files

    $ ls prometheus-2.37.4.linux-amd64
    
  4. Launch prometheus

    $ cd prometheus-2.37.4.linux-amd64
    $ ./prometheus
    
  5. Open your browser at http://127.0.0.1:9090

  6. Look at the TSDB data


The web UI

There is a lot of information that can be found in the prometheus server web ui.

Try to find:

  • The version of prometheus
  • The duration of data retention
  • The “targets” that are scraped by default
  • The “scrape” interval

promtool

promtool is a command line tool provided with Prometheus.

With promtool you can:

  • Validate Prometheus configuration

    $ ./promtool check config prometheus.yml
    
  • Query Prometheus

    $ ./promtool query instant http://127.0.0.1:9090 up
    
  • Create blocks from OpenMetrics files or recording rules, aka backfill.

Adding targets

exercise

  • Open prometheus.yml
  • Add each one’s prometheus server as targets to your prometheus server.
  • Look the status (using up or the target page)

What is a job? What is an instance?


Admin commands

  1. Enable admin commmands

    $ ./prometheus --web.enable-admin-api
    
  2. Take a snapshot

    $ curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
    

    Look in the data directory.

  3. Delete a timeserie

    $ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=process_start_time_seconds{job="prometheus"}'
    

Federation

File SD (if workshop is on site)

Now, let’s move to file_sd.

Create a file:

- targets:
    - 192.168.28.1
  labels:
    laptop_user: julien
- targets:
    - 192.168.28.2
  labels:
    laptop_user: john

With your IP + your neighbors.

Name it users.yml.

Adapt Prometheus configuration:

  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    file_sd_configs:
      - files:
         - users.yml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9090"

Duplicate the job, but with the following instructions:

  • The new job should be called “federation”
  • The new job should query http://127.0.0.1:9090/federate?match[]=up
  • The “up” metric fetched should be renamed to external_up

Solution

Click to reveal.

DigitalOcean SD (if workshop is virual)

Now, let’s move to digitalocean_sd.

In your VM, there is a /etc/do_read file with a digitalocean token.

The version of Prometheus you have has native integration with DigitalOcean.

Adapt Prometheus configuration:

  - job_name: 'prometheus'
    digitalocean_sd_configs:
      - bearer_token_file: /etc/do_read
        port: 9090
    relabel_configs:
      - source_labels: [__meta_digitalocean_tags]
        regex: '.*,prometheus_workshop,.*'
        action: keep

Reload Prometheus:

killall -HUP prometheus

You should see the 10 prometheus servers.

Duplicate the job, but with the following instructions:

  • The new job should be called “federation”
  • The new job should query http://127.0.0.1:9090/federate?match[]=up
  • The “up” metric fetched should be renamed to external_up

Solution

Click to reveal.

Last exercise

Prometheus fetches Metrics over HTTP.

Metrics have a name and labels.

As an exercise, let’s build on top of our previous example:

In a new directory, create a file called “metrics”

Add some metrics:

company{name="inuits"} 1
favorite_color{name="red"} 1
random_number 10
workshop_step 1

then, run python -m SimpleHTTPServer 5678 and add it to prometheus (and your neighbors too).

2 - Metrics Monitoring

Querying Prometheus

Metrics monitoring is different because it does not assume that a situation can be explained in fixed states. Instead, it brings you inside your system and provides dozens of metrics that you can then analyze and understand.

Even if you do not need all the metrics, it is better to collect them to look further what they look like.

What is a metric

  • Name
  • Timestamp
  • Labels
  • Value
  • Type

Labels

Labels are used to add metadata to metrics. That can be used to differenciate then, e.g. by adding a status code, a URI handler, a function name, …

Types of metrics

Gauge
Metric that can go up and down
Counters
Metric that can only go up and starts at 0
Histograms
Metrics that put data into buckets. An histogram is composed of 3 metrics: sum, buckets and count.
Summaries
Metrics that calculate quantiles over observed values. A summary is composed of 3 metrics: sum, quantiles and count.

Upstream documentation

Exercises

The number of http requests is expressed in …

The duration in …

The number of active sessions is a …

What are the labels added by prometheus ?

What are the labels prometheus knows but does not add?

promlens

Promlens helps you build and understand queries.

  1. Download promlens 0.2.0.

  2. Extract it

    $ tar xvf Downloads/promlens-0.2.0.linux-amd64.tar.gz
    
  3. List the files

    $ ls promlens-0.2.0.linux-amd64
    
  4. Launch the node_exporter

    $ cd promlens-0.2.0.linux-amd64
    $ ./promlens --web.default-prometheus-url="http://127.0.0.1:9090"
    
  5. Open your browser at http://127.0.0.1:8080

  6. Add your promlens and your neighbors to prometheus.


Labels matching

For prometheus to do calculations and comparisons of metrics, labels must match one each side (except __name__, the name of the metric)


Maths

Operators like +, -, *, /, …

Aggregators

  • count()
  • sum()

Functions

Some important functions:

  • rate()
  • deriv()
  • delta()
  • increase()

In the list above, which ones should be used with counters, and which ones with gauges?

What is the difference between irate and rate? idelta and delta?


Aggregation

How can I get the sum of scraped metrics by job (2 ways)? – exclusion and inclusion.

Solution

Click to reveal.

How can I get the % of handler="/federate" over the other prometheus_http_request_duration_seconds_count ?

Solution

Click to reveal.

Over Time

What is the difference between max and max_over_time?


Max, Min, Bottomk, Topk

What is the difference between max(x) and topk(1, x)?


Time functions

day()

day_of_week()

How to use the optional argument of day_of_week?

What is the timestamp() function? How can it be useful?

And/Or

Can you think of any usecases for and/or/unless?

3 - Grafana

Create beautiful dashboards

Grafana

Grafana aims to be a one-stop for obervability users. From the Grafana interface, you can address your data, wherever it lives. Grafana has first class integrations with Prometheus, Loki, Jaeger and many others.

It is famous for its dashboarding solution, but more features have been added recently, such as traces views, manual queries, and logs explorer.

Download and run grafana

  1. Go to the Grafana website and download the latest stable release
  2. For this exercise we will use the Standalone Linux Binaries(64 Bit)
$ wget https://dl.grafana.com/oss/release/grafana-9.3.2.linux-amd64.tar.gz
$ tar -zxvf grafana-9.3.2.linux-amd64.tar.gz
$ cd grafana-9.3.2
$ ./bin/grafana-server
  1. Open Grafana in your browser: http://127.0.0.1:3000 (username: admin; password: admin)

Setup prometheus

Add your prometheus server as a datasource + import the prometheus dashboards.

Monitor grafana in prometheus (add it as a target).

Look at the grafana dashboards.

Create a new dashboard

Create a new dashboard which enables you to pick someone’s prometheus and gather info: samples scrapes, scrape duration, … using variables.

Your dashboard should contain at least a singlestat panel, a variable and a graph panel.

4 - Exporters

Expose metrics for Prometheus

Exporters are HTTP servers that expose metrics. They can translate Prometheus queries into domain-specific queries. They then turn the results into Prometheus metrics.

There are hundreds of known exporters, most of them coming from the community. A few exporters are maintained by the Prometheus team.

node_exporter

The node exporter enables basic monitoring of linux machines (and other unix like systems).

  1. Download the node_exporter 1.5.0.

  2. Extract it

    $ tar xvf Downloads/node_exporter-1.5.0.linux-amd64.tar.gz
    
  3. List the files

    $ ls node_exporter-1.5.0.linux-amd64
    
  4. Launch the node_exporter

    $ cd node_exporter-1.5.0.linux-amd64
    $ ./node_exporter
    
  5. Open your browser at http://127.0.0.1:9100

  6. Add your node_exporter and your neighbors to prometheus.

collectors

The Node Exporter has multiple collectors, some of them disabled by default.

Exercise

  1. Enable the systemd collector

textfile collector

Exercise

Move the metrics created before (company name, random number..) on port 5678 to be collected by the node exporter.

Do you see use cases for this feature?

Dashboards

Exercise

Create two dashboards: a dashboard that will show the network bandwidth of a server by interface, and a dashboard that will show the disk space available per disk.


JMX exporter

The JMX exporter is useful to monitor Java applications. It can be loaded as a “side car” (Java agent), in the same JVM’s as the applications.

  1. Download Jenkins
  2. Download the JMX exporter 0.17.2.
  3. Run Jenkins with the JMX exporter and add it to Prometheus

solution

Click to reveal.

config.yml

Click to reveal.

exercise

  1. Create a dashboard with:
    • JVM version
    • Uptime
    • Threads
    • Heap Size
    • Memory Pool size

Grok exporter

  1. Download grok exporter 0.2.8

  2. Extract it

    $ unzip Downloads/grok_exporter-0.2.8.linux-amd64.zip
    
  3. List the files

    $ ls grok_exporter-0.2.8.linux-amd64
    
  4. Create a simple job in Jenkins

  5. Re run Jenkins to output to a file (add &> jenkins.log)

exercise

  • Create a job with name “test” and command “sleep 10”
  • Run the job and look for “INFO: test #2 main build action completed: SUCCESS” in the logs
  • Create a counter and a gauge for those: job_last_build_number and job_build_total. The name of the job should be a label, and for the job_build_total the status should too be a label.

solution

Click to reveal.

Blackbox exporter

  1. Download the blackbox_exporter 0.22.0.

  2. Extract it

    $ tar xvf Downloads/blackbox_exporter-0.22.0.linux-amd64.tar.gz
    
  3. List the files

    $ ls blackbox_exporter-0.22.0.linux-amd64
    
  4. Launch the blackbox_exporter

    $ cd blackbox_exporter-0.22.0.linux-amd64
    $ ./blackbox_exporter
    
  5. Open your browser at http://127.0.0.1:9115

  6. Add your blackbox_exporter and your neighbors to prometheus

Exercise

  • Monitor the Inuits website (DNS + HTTP) using the blackbox exporter
  • Check with prometheus blackbox exporter when the SSL certificate will expire in days
  • Create a dashboard with the detailed time it takes to get the OSMC website.

5 - Alerting

Doing something with those metrics

Recording rules and alerts

Prometheus splits the alerting role in 3 components:

  • prometheus server which will calculate the alerts
  • alertmanager which will dispatch the alerts
  • webhook receivers that will handle the alerts

Exercise

Create, in Prometheus, an alert when a target is down.

Exercise

Create, in Prometheus, an alert when a grafana server is down, with an extra label: priority=high.

Exercise

Create a recording rule to get the % of disk space used and alert on > 50% of disk space used.

What is the difference between recording and alerting?

What is an annotation?

What is a “group” of recording rules?

How to see the rules and the alerts in the UI?

What is a pending alert?

Bonus: Alerts unit test (if there is enough time)

Alertmanager

  1. Download the alertmanager 0.24.0.

  2. Extract it

    $ tar xvf Downloads/alertmanager-0.24.0.linux-amd64.tar.gz
    
  3. List the files

    $ ls alertmanager-0.24.0.linux-amd64
    
  4. Launch the alertmanager

    $ cd alertmanager-0.24.0.linux-amd64
    $ ./alertmanager
    
  5. Open your browser at http://127.0.0.1:9093

  6. Add your alertmanager and your neighbors to prometheus

  7. Connect Prometheus and Alertmanager together

  8. Look for the alerts coming.

  • What are the 4 roles of alertmanager?
  • What are the different timers in alertmanager?

Exercise

Use https://webhook.site/ to get a webhook URL.

Send alerts to that https://webhook.site/ URL.

For the priority=high alerts, send an email instead of a webhook.

  • Can you explain the HA model of prometheus?
  • How can I send an alert to multiple targets?

Exercise

How can you check that two alertmanager config are in sync?

Solution

Click to reveal.

Exercise

Make a big cluster of alert managers

Amtool

Amtool is the CLI tool for alertmanager

You can use it to e.g. create silences.

$ ./amtool silence --alertmanager.url=http://127.0.0.1:9093 add job=grafana priority=high -d 15m -c "we redeploy grafana" -a Julien

That will return the UID of the silence that you can use to expire it.

Karma

karma is a dashboard for alertmanager

6 - Bonus

If there is more time…

If there is time left..

  • Pushgateway
  • LTS Remote Read / Remote Write
  • Alert Rules unit testing
  • Service Discovery
  • TLS overview
  • Console templates
  • Graphite Exporter
  • Collectd
  • Grafonnet + Monitoring Mixins

7 - Config files

Some configuration files for the workshop

prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.26.1:9093
      - 192.168.26.2:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9090"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'node'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9100"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'grafana'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:3000"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'federation'
    metrics_path: /federate
    params:
        "match[]": [up]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9090"
    metric_relabel_configs:
      - source_labels: [__name__]
        target_label: __name__
        regex: up
        replacement: federate_up

    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'jenkins'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:8081"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'aletmanager'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9093"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s
  - job_name: 'grok'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __address__
        replacement: "${1}:9144"
    file_sd_configs:
      - files:
        - workshop.yml
        refresh_interval: 10s

workshop.yml

- targets: ['192.168.26.1']
  labels:
      name: me
- targets:
  - '192.168.26.3'
  - '192.168.26.4'
  - '192.168.26.5'
  - '192.168.26.6'
  - '192.168.26.7'
  labels:
      name: right
- targets:
  - '192.168.26.8'
  - '192.168.26.9'
  - '192.168.26.10'
  - '192.168.26.11'
  - '192.168.26.12'
  - '192.168.26.13'
  - '192.168.26.14'
  - '192.168.26.15'
  labels:
      name: left

first_rules.yml

groups:
  - name: example
    rules:
    - alert: a target is down
      for: 5m
      expr: up == 0
      labels:
        priority: high
      annotations:
        text: "{{$labels.job}} is down!"

grok_exporter.yml

global:
    config_version: 2
input:
    type: file
    path: ../jenkins.log
    readall: true
grok:
    patterns_dir: ./patterns
metrics:
    - type: counter
      name: job_build_total
      help: Counter for the job runs.
      match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
      labels:
          jobname: '{{.jobname}}'
          status: '{{.status}}'
    - type: gauge
      name: job_last_build_number
      help: Number of the last build
      match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
      labels:
        jobname: '{{.jobname}}'
      value: '{{.jobid}}'
server:
    port: 9144

alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        priority: high
        continue: true
        receiver: 'sms.hook'
receivers:
- name: 'sms.hook'
  webhook_configs:
  - url: 'https://webhook.site/5c702a0d-2c02-4f70-a8ee-9ac45d2ce2b9'
- name: 'web.hook'
  webhook_configs:
  - url: 'https://webhook.site/5c702a0d-2c02-4f70-a8ee-9ac45d2ce2b9'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']