This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Workshop
Schedule, setup & technical
Schedule
- 10:00 am: Start
- 11:30 am: Coffee Break
- 01:00 pm: Lunch at the Restaurant
- 03:30 pm: Coffee break
- 05:00 pm: End
- 07:00 pm: Dinner / get together at the Restaurant
Online questionnaire
During this afternoon, you will receive an online questionnaire.
Please use it to provide feedback.
Network setup
Because clients are isolated by default, we will add a secondary interface on
your laptop:
$ sudo -i
# service firewalld stop
# ip ad ad 192.168.28.x/24 dev enp3s0
Where x will be given by the trainer.
Local downloads
You can downloads the items locally at http://192.168.28.1:3000/.
Code of Conduct
This workshop is subject to the OSMC Code of
Conduct.
1 - Prometheus
Install & Setup Prometheus
Prometheus is an open source monitoring system designed around metrics. It is
a large ecosystem, with plenty of different components.
The prometheus documentation provides an overview of those components:

How Prometheus works
Prometheus monitoring is based on metrics, exposed on HTTP endpoints. The
Prometheus server is “active” and starts the polling. That polling (called
“scraping”) happens at a high interval (usually 15s or 30s).
Each monitored target must expose a metrics endpoint. That endpoint exposes
metrics in the Prometheus HTTP format or in the OpenMetrics format.
Once collected, those metrics are mutated by Prometheus, which adds an instance
and job label. Optionally, extra relabeling configured by the user occurs.
The Prometheus server
-
Download the prometheus server 2.37.4.
-
Extract it
$ tar xvf Downloads/prometheus-2.37.4.linux-amd64.tar.gz
-
List the files
$ ls prometheus-2.37.4.linux-amd64
-
Launch prometheus
$ cd prometheus-2.37.4.linux-amd64
$ ./prometheus
-
Open your browser at http://127.0.0.1:9090
-
Look at the TSDB data
tsdb
Prometheus stores its data in a database called
tsdb. The TSDB is
self-maintained by the server, which manages the data lifecycle.
The web UI
There is a lot of information that can be found in the prometheus server web ui.
Try to find:
- The version of prometheus
- The duration of data retention
- The “targets” that are scraped by default
- The “scrape” interval
Prometheus UI
The Prometheus UI went under a huge refactoring in 2020. It is now react-based,
with powerful autocomplete features. There is still a link to access the
“classic” UI.
promtool is a command line tool provided with Prometheus.
With promtool you can:
Info
The up
metric is added by prometheus on each scrape. Its value is 1 if the
scrape has succeeded, 0 otherwise.
- Create blocks from OpenMetrics files or recording rules, aka backfill.
Adding targets
Note
At this point, make sure you understand the basis of
YAML.
exercise
- Open prometheus.yml
- Add each one’s prometheus server as targets to your prometheus server.
- Look the status (using up or the target page)
What is a job? What is an instance?
Tip
You do not need to reload prometheus: you can just send a SIGHUP
signal to
reload the configuration:
$ killall -HUP prometheus
Admin commands
-
Enable admin commmands
$ ./prometheus --web.enable-admin-api
-
Take a snapshot
$ curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
Look in the data directory.
Note
This is snapshotting the TSDB. There is another kind of snapshot, Memory
Snapshot on Shutdown, which is a different feature.
-
Delete a timeserie
$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=process_start_time_seconds{job="prometheus"}'
Federation
File SD (if workshop is on site)
Now, let’s move to file_sd.
Create a file:
- targets:
- 192.168.28.1
labels:
laptop_user: julien
- targets:
- 192.168.28.2
labels:
laptop_user: john
With your IP + your neighbors.
Name it users.yml
.
Adapt Prometheus configuration:
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
file_sd_configs:
- files:
- users.yml
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9090"
Duplicate the job, but with the following instructions:
- The new job should be called “federation”
- The new job should query http://127.0.0.1:9090/federate?match[]=up
- The “up” metric fetched should be renamed to external_up
Tip
The name of a metric is a label too! It is the __name__
label.
Solution
Click to reveal.
Hide
- job_name: 'federation'
metrics_path: '/federate'
params:
'match[]':
- up
file_sd_configs:
- files:
- users.yml
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9090"
metric_relabel_configs:
- source_labels: [__name__]
target_label: __name__
regex: up
replacement: federate_up
DigitalOcean SD (if workshop is virual)
Now, let’s move to digitalocean_sd.
In your VM, there is a /etc/do_read file with a digitalocean token.
The version of Prometheus you have has native integration with DigitalOcean.
Adapt Prometheus configuration:
- job_name: 'prometheus'
digitalocean_sd_configs:
- bearer_token_file: /etc/do_read
port: 9090
relabel_configs:
- source_labels: [__meta_digitalocean_tags]
regex: '.*,prometheus_workshop,.*'
action: keep
Reload Prometheus:
You should see the 10 prometheus servers.
Duplicate the job, but with the following instructions:
- The new job should be called “federation”
- The new job should query http://127.0.0.1:9090/federate?match[]=up
- The “up” metric fetched should be renamed to external_up
Tip
The name of a metric is a label too! It is the __name__
label.
Solution
Click to reveal.
Hide
- job_name: 'prometheus'
digitalocean_sd_configs:
- bearer_token_file: /etc/do_read
relabel_configs:
- source_labels: [__meta_digitalocean_tags]
regex: '.*,prometheus_workshop,.*'
action: keep
- source_labels: [__meta_digitalocean_droplet_name]
target_label: instance
- source_labels: [__meta_digitalocean_public_ipv4]
target_label: __address__
replacement: '$1:9090'
- job_name: 'federation'
metrics_path: '/federate'
digitalocean_sd_configs:
- bearer_token_file: /etc/do_read
params:
'match[]':
- up
relabel_configs:
- source_labels: [__meta_digitalocean_tags]
regex: '.*,prometheus_workshop,.*'
action: keep
- source_labels: [__meta_digitalocean_droplet_name]
target_label: instance
- source_labels: [__meta_digitalocean_public_ipv4]
target_label: __address__
replacement: '$1:9090'
metric_relabel_configs:
- source_labels: [__name__]
target_label: __name__
regex: up
replacement: federate_up
Last exercise
Prometheus fetches Metrics over HTTP.
Metrics have a name and labels.
As an exercise, let’s build on top of our previous example:
In a new directory, create a file called “metrics”
Add some metrics:
company{name="inuits"} 1
favorite_color{name="red"} 1
random_number 10
workshop_step 1
then, run python -m SimpleHTTPServer 5678
and add it to prometheus (and your
neighbors too).
2 - Metrics Monitoring
Querying Prometheus
Metrics monitoring is different because it does not assume that a situation can
be explained in fixed states. Instead, it brings you inside your system and
provides dozens of metrics that you can then analyze and understand.
Even if you do not need all the metrics, it is better to collect them to look
further what they look like.
What is a metric
- Name
- Timestamp
- Labels
- Value
- Type
Labels
Labels are used to add metadata to metrics. That can be used to differenciate
then, e.g. by adding a status code, a URI handler, a function name, …
Types of metrics
- Gauge
- Metric that can go up and down
- Counters
- Metric that can only go up and starts at 0
- Histograms
- Metrics that put data into buckets. An histogram is composed of 3 metrics:
sum, buckets and count.
- Summaries
- Metrics that calculate quantiles over observed values. A summary is composed
of 3 metrics: sum, quantiles and count.
Upstream documentation
Exercises
The number of http requests is expressed in …
The duration in …
The number of active sessions is a …
What are the labels added by prometheus ?
What are the labels prometheus knows but does not add?
promlens
Promlens helps you build and understand queries.
-
Download promlens 0.2.0.
-
Extract it
$ tar xvf Downloads/promlens-0.2.0.linux-amd64.tar.gz
-
List the files
$ ls promlens-0.2.0.linux-amd64
-
Launch the node_exporter
$ cd promlens-0.2.0.linux-amd64
$ ./promlens --web.default-prometheus-url="http://127.0.0.1:9090"
-
Open your browser at http://127.0.0.1:8080
-
Add your promlens and your neighbors to prometheus.
Labels matching
For prometheus to do calculations and comparisons of metrics, labels must match
one each side (except __name__
, the name of the metric)
Maths
Operators like +
, -
, *
, /
, …
Aggregators
Functions
Some important functions:
rate()
deriv()
delta()
increase()
In the list above, which ones should be used with counters, and which ones with
gauges?
What is the difference between irate and rate? idelta and delta?
Aggregation
How can I get the sum of scraped metrics by job (2 ways)? – exclusion and
inclusion.
Tip
The scrape_samples_scraped
metric is added by prometheus after sraping a
target and indicates the number of metrics for that job at that scrape.
Solution
Click to reveal.
Hide
Using by:
sum(scrape_samples_scraped) by (job)
Using without:
sum(scrape_samples_scraped) without (instance)
How can I get the % of handler="/federate"
over the other
prometheus_http_request_duration_seconds_count
?
Solution
Click to reveal.
Hide
100 *
rate(prometheus_http_request_duration_seconds_count{handler="/federate"}[5m])
/ ignoring(handler) group_right sum without (handler)
(rate(prometheus_http_request_duration_seconds_count[5m]))
Over Time
What is the difference between max
and max_over_time
?
Max, Min, Bottomk, Topk
What is the difference between max(x)
and topk(1, x)
?
Time functions
day()
day_of_week()
How to use the optional argument of day_of_week
?
What is the timestamp()
function? How can it be useful?
Tip
You can use Grafana Explore feature to get help and autocomplete on
Prometheus (currently still a “beta” feature). That feature will likely
come in the next release of Prometheus!
And/Or
Can you think of any usecases for and
/or
/unless
?
3 - Grafana
Create beautiful dashboards
Grafana
Grafana aims to be a one-stop for obervability users. From the Grafana
interface, you can address your data, wherever it lives. Grafana has first class
integrations with Prometheus, Loki, Jaeger and many others.
It is famous for its dashboarding solution, but more features have been added
recently, such as traces views, manual queries, and logs explorer.
License
Grafana is licensed under the AGPL license. It’s an Open Source license, with
the requirement that modifications you make must be done under the same license,
when you provide Grafana as a service.
Download and run grafana
- Go to the Grafana website and download the latest stable release
- For this exercise we will use the Standalone Linux Binaries(64 Bit)
$ wget https://dl.grafana.com/oss/release/grafana-9.3.2.linux-amd64.tar.gz
$ tar -zxvf grafana-9.3.2.linux-amd64.tar.gz
$ cd grafana-9.3.2
$ ./bin/grafana-server
- Open Grafana in your browser: http://127.0.0.1:3000 (username: admin;
password: admin)
Setup prometheus
Add your prometheus server as a datasource + import the prometheus dashboards.
Monitor grafana in prometheus (add it as a target).
Look at the grafana dashboards.
Create a new dashboard
Create a new dashboard which enables you to pick someone’s prometheus and gather
info: samples scrapes, scrape duration, … using variables.
Your dashboard should contain at least a singlestat panel, a variable and a graph
panel.
Info
Grafana also supports tables and heatmaps for Prometheus
4 - Exporters
Expose metrics for Prometheus
Exporters are HTTP servers that expose metrics. They can translate Prometheus
queries into domain-specific queries. They then turn the results into Prometheus
metrics.
There are hundreds of known exporters, most of them coming from the community. A
few exporters are maintained by the Prometheus team.
node_exporter
The node exporter enables basic monitoring of linux machines (and other unix
like systems).
-
Download the node_exporter 1.5.0.
-
Extract it
$ tar xvf Downloads/node_exporter-1.5.0.linux-amd64.tar.gz
-
List the files
$ ls node_exporter-1.5.0.linux-amd64
-
Launch the node_exporter
$ cd node_exporter-1.5.0.linux-amd64
$ ./node_exporter
-
Open your browser at http://127.0.0.1:9100
-
Add your node_exporter and your neighbors to prometheus.
collectors
The Node Exporter has multiple collectors, some of them disabled by default.
Exercise
- Enable the systemd collector
textfile collector
Exercise
Move the metrics created before (company name, random number..) on port 5678 to
be collected by the node exporter.
Do you see use cases for this feature?
Dashboards
Exercise
Create two dashboards: a dashboard that will show the network bandwidth of a
server by interface, and a dashboard that will show the disk space available per
disk.
Tip
You can use {job="node_exporter"}
in prometheus to see the metrics, or you
can directly open the /metrics of the node_exporter in your browser.
JMX exporter
The JMX exporter is useful to monitor Java applications. It can be loaded as a
“side car” (Java agent), in the same JVM’s as the applications.
- Download Jenkins
- Download the JMX
exporter 0.17.2.
- Run Jenkins with the JMX exporter and add it to Prometheus
solution
Click to reveal.
Hide
$ java -javaagent:./jmx_prometheus_javaagent-0.17.2.jar=8081:config.yml -jar jenkins.war
config.yml
Click to reveal.
Hide
---
startDelaySeconds: 10
exercise
- Create a dashboard with:
- JVM version
- Uptime
- Threads
- Heap Size
- Memory Pool size
Grok exporter
-
Download grok exporter 0.2.8
-
Extract it
$ unzip Downloads/grok_exporter-0.2.8.linux-amd64.zip
-
List the files
$ ls grok_exporter-0.2.8.linux-amd64
-
Create a simple job in Jenkins
-
Re run Jenkins to output to a file (add &> jenkins.log
)
exercise
- Create a job with name “test” and command “sleep 10”
- Run the job and look for “INFO: test #2 main build action completed: SUCCESS”
in the logs
- Create a counter and a gauge for those:
job_last_build_number
and
job_build_total
. The name of the job should be a label, and for the
job_build_total
the status should too be a label.
solution
Click to reveal.
Hide
global:
config_version: 2
input:
type: file
path: ../jenkins.log
readall: true
grok:
patterns_dir: ./patterns
metrics:
- type: counter
name: job_build_total
help: Counter for the job runs.
match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
labels:
jobname: '{{.jobname}}'
status: '{{.status}}'
- type: gauge
name: job_last_build_number
help: Number of the last build
match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
labels:
jobname: '{{.jobname}}'
value: '{{.jobid}}'
server:
port: 9144
Blackbox exporter
-
Download the blackbox_exporter 0.22.0.
-
Extract it
$ tar xvf Downloads/blackbox_exporter-0.22.0.linux-amd64.tar.gz
-
List the files
$ ls blackbox_exporter-0.22.0.linux-amd64
-
Launch the blackbox_exporter
$ cd blackbox_exporter-0.22.0.linux-amd64
$ ./blackbox_exporter
-
Open your browser at http://127.0.0.1:9115
-
Add your blackbox_exporter and your neighbors to prometheus
Exercise
- Monitor the Inuits website (DNS + HTTP) using the blackbox exporter
- Check with prometheus blackbox exporter when the SSL certificate will expire
in days
- Create a dashboard with the detailed time it takes to get the OSMC website.
5 - Alerting
Doing something with those metrics
Recording rules and alerts
Note
Prometheus is using the
Go Templating System
for alerting, in both Prometheus and Alertmanager.
Prometheus splits the alerting role in 3 components:
- prometheus server which will calculate the alerts
- alertmanager which will dispatch the alerts
- webhook receivers that will handle the alerts
Note
Alerts and recording rules are close to each other. They are queries that are
run at regular interval by prometheus. They both write new metrics into tsdb.
Exercise
Create, in Prometheus, an alert when a target is down.
Exercise
Create, in Prometheus, an alert when a grafana server is down, with an extra label:
priority=high.
Exercise
Create a recording rule to get the % of disk space used
and alert on > 50% of disk space used.
What is the difference between recording and alerting?
What is an annotation?
What is a “group” of recording rules?
How to see the rules and the alerts in the UI?
What is a pending alert?
Bonus: Alerts unit test (if there is enough time)
Note
Prometheus generates an ALERTS metric with the active/pending alerts.
Alertmanager
-
Download the alertmanager 0.24.0.
-
Extract it
$ tar xvf Downloads/alertmanager-0.24.0.linux-amd64.tar.gz
-
List the files
$ ls alertmanager-0.24.0.linux-amd64
-
Launch the alertmanager
$ cd alertmanager-0.24.0.linux-amd64
$ ./alertmanager
-
Open your browser at http://127.0.0.1:9093
-
Add your alertmanager and your neighbors to prometheus
-
Connect Prometheus and Alertmanager together
-
Look for the alerts coming.
- What are the 4 roles of alertmanager?
- What are the different timers in alertmanager?
Exercise
Use https://webhook.site/ to get a webhook URL.
Send alerts to that https://webhook.site/ URL.
For the priority=high alerts, send an email instead of a webhook.
- Can you explain the HA model of prometheus?
- How can I send an alert to multiple targets?
Exercise
How can you check that two alertmanager config are in sync?
Note
There is a alertmanager_config_hash
metric
Solution
Click to reveal.
Hide
count(count_values("config", alertmanager_config_hash)) != 1
Exercise
Make a big cluster of alert managers
Amtool is the CLI tool for alertmanager
You can use it to e.g. create silences.
$ ./amtool silence --alertmanager.url=http://127.0.0.1:9093 add job=grafana priority=high -d 15m -c "we redeploy grafana" -a Julien
That will return the UID of the silence that you can use to expire it.
Karma
karma is a dashboard for alertmanager
6 - Bonus
If there is more time…
If there is time left..
- Pushgateway
- LTS Remote Read / Remote Write
- Alert Rules unit testing
- Service Discovery
- TLS overview
- Console templates
- Graphite Exporter
- Collectd
- Grafonnet + Monitoring Mixins
7 - Config files
Some configuration files for the workshop
prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.26.1:9093
- 192.168.26.2:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9090"
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
- job_name: 'node'
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9100"
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
- job_name: 'grafana'
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:3000"
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
- job_name: 'federation'
metrics_path: /federate
params:
"match[]": [up]
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9090"
metric_relabel_configs:
- source_labels: [__name__]
target_label: __name__
regex: up
replacement: federate_up
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
- job_name: 'jenkins'
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:8081"
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
- job_name: 'aletmanager'
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9093"
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
- job_name: 'grok'
relabel_configs:
- source_labels: [__address__]
target_label: __address__
replacement: "${1}:9144"
file_sd_configs:
- files:
- workshop.yml
refresh_interval: 10s
workshop.yml
- targets: ['192.168.26.1']
labels:
name: me
- targets:
- '192.168.26.3'
- '192.168.26.4'
- '192.168.26.5'
- '192.168.26.6'
- '192.168.26.7'
labels:
name: right
- targets:
- '192.168.26.8'
- '192.168.26.9'
- '192.168.26.10'
- '192.168.26.11'
- '192.168.26.12'
- '192.168.26.13'
- '192.168.26.14'
- '192.168.26.15'
labels:
name: left
first_rules.yml
groups:
- name: example
rules:
- alert: a target is down
for: 5m
expr: up == 0
labels:
priority: high
annotations:
text: "{{$labels.job}} is down!"
grok_exporter.yml
global:
config_version: 2
input:
type: file
path: ../jenkins.log
readall: true
grok:
patterns_dir: ./patterns
metrics:
- type: counter
name: job_build_total
help: Counter for the job runs.
match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
labels:
jobname: '{{.jobname}}'
status: '{{.status}}'
- type: gauge
name: job_last_build_number
help: Number of the last build
match: 'INFO: %{WORD:jobname} #%{NUMBER:jobid} main build action
completed: %{WORD:status}'
labels:
jobname: '{{.jobname}}'
value: '{{.jobid}}'
server:
port: 9144
alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
priority: high
continue: true
receiver: 'sms.hook'
receivers:
- name: 'sms.hook'
webhook_configs:
- url: 'https://webhook.site/5c702a0d-2c02-4f70-a8ee-9ac45d2ce2b9'
- name: 'web.hook'
webhook_configs:
- url: 'https://webhook.site/5c702a0d-2c02-4f70-a8ee-9ac45d2ce2b9'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']