# monitor prometheus
We use prometheus,alertmanager,grafana and other tools to monitoring our systems and services.
## Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit
- Prometheus pull data from exporters or from pushgateway.
- Prometheus use TSDB which is a time series database to store data.
- Prometheus push alerts to alermanager,alermanager decided to send emil,sed wechat,etc...
- Grafana get data form prometheus to data visualization and export.
Prometheus deploy
[Download prometheus](https://prometheus.io/download),you can also download alermanager,blackbox_exporter,and so on here.
Unzip the package
```
tar xvfz prometheus-*.tar.gz
cd prometheus-*
mkdir /etc/prometheus
cp prometheus.yml /etc/prometheus
cp prometheus /usr/local/bin/prometheus
useradd prometheus -s /usr/sbin/nologin
```
Configure prometheus as a service
```
vim /etc/systemd/system/prometheus.service
#
# Ansible managed
#
[Unit]
Description=Prometheus
After=network-online.target
Requires=local-fs.target
After=local-fs.target
[Service]
Type=simple
Environment="GOMAXPROCS=2"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=0 \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.console.templates=/etc/prometheus/consoles \
--web.listen-address=0.0.0.0:9090 \
--web.external-url=
CapabilityBoundingSet=CAP_SET_UID
LimitNOFILE=65000
LockPersonality=true
NoNewPrivileges=true
MemoryDenyWriteExecute=true
PrivateDevices=true
PrivateTmp=true
ProtectHome=true
RemoveIPC=true
RestrictSUIDSGID=true
#SystemCallFilter=@signal @timer
ReadWriteDirectories=/var/lib/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=always
[Install]
WantedBy=multi-user.target
```
Configure prometheus
```
vim prometheus.yml
#
# Ansible managed
#
# http://prometheus.io/docs/operating/configuration/
global:
evaluation_interval: 15s
scrape_interval: 15s
scrape_timeout: 10s
external_labels:
environment: grafana.cclinux.org.cn
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
rule_files:
- /etc/prometheus/rules/*.rules
scrape_configs:
- job_name: prometheus
metrics_path: /metrics
static_configs:
- targets:
- grafana.cclinux.org.cn:9090
#- file_sd_configs:
# - files:
# - /etc/prometheus/file_sd/node.yml
# job_name: node
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://example.com:8080 # Target to probe with http on port 8080.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.
- job_name: 'ipa'
static_configs:
- targets: ['192.168.69.186:9100', '192.168.69.110:9100', '192.168.69.198:9100']
- job_name: 'idp001'
static_configs:
- targets: ['192.168.69.126:9100']
- job_name: 'idp002'
static_configs:
- targets: ['192.168.69.161:9100']
- job_name: 'gitlab'
static_configs:
- targets: ['192.168.69.125:9100']
- job_name: gitlab-workhorse
static_configs:
- targets:
- 192.168.69.125:9229
- job_name: gitlab-rails
metrics_path: "/-/metrics"
static_configs:
- targets:
- 192.168.69.125:8080
- job_name: gitlab-sidekiq
static_configs:
- targets:
- 192.168.69.125:8082
- job_name: gitlab_exporter_process
metrics_path: "/metrics"
static_configs:
- targets:
- 192.168.69.125:9168
- job_name: gitaly
static_configs:
- targets:
- 192.168.69.125:9236
```
Configure rules
```
vim /etc/prometheus/rules/ansible_managed.rules
#
# Ansible managed
#
groups:
- name: ansible managed alert rules
rules:
- alert: Watchdog
annotations:
description: 'This is an alert meant to ensure that the entire alerting pipeline
is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example
the
"DeadMansSnitch" integration in PagerDuty.'
summary: Ensure entire alerting pipeline is functional
expr: vector(1)
for: 10m
labels:
severity: warning
- alert: InstanceDown
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for
more than 5 minutes.'
summary: Instance {{ $labels.instance }} down
expr: up == 0
for: 1m
labels:
severity: critical
- alert: RebootRequired
annotations:
description: '{{ $labels.instance }} requires a reboot.'
summary: Instance {{ $labels.instance }} - reboot required
expr: node_reboot_required > 0
labels:
severity: warning
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left and is filling up.
summary: Filesystem is predicted to run out of space within the next 24 hours.
expr: "(\n node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
node\",fstype!=\"\"} * 100 < 40\nand\n predict_linear(node_filesystem_avail_bytes{job=\"\
node\",fstype!=\"\"}[6h], 24*60*60) < 0\nand\n node_filesystem_readonly{job=\"\
node\",fstype!=\"\"} == 0\n)\n"
for: 1h
labels:
severity: warning
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left and is filling up fast.
summary: Filesystem is predicted to run out of space within the next 4 hours.
expr: "(\n node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
node\",fstype!=\"\"} * 100 < 20\nand\n predict_linear(node_filesystem_avail_bytes{job=\"\
node\",fstype!=\"\"}[6h], 4*60*60) < 0\nand\n node_filesystem_readonly{job=\"\
node\",fstype!=\"\"} == 0\n)\n"
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left.
summary: Filesystem has less than 5% space left.
expr: "(\n node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
node\",fstype!=\"\"} * 100 < 5\nand\n node_filesystem_readonly{job=\"node\",fstype!=\"\
\"} == 0\n)\n"
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available space left.
summary: Filesystem has less than 3% space left.
expr: "(\n node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
node\",fstype!=\"\"} * 100 < 3\nand\n node_filesystem_readonly{job=\"node\",fstype!=\"\
\"} == 0\n)\n"
for: 1h
labels:
severity: critical
- alert: NodeFilesystemFilesFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left and is filling up.
summary: Filesystem is predicted to run out of inodes within the next 24 hours.
expr: "(\n node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
node\",fstype!=\"\"} * 100 < 40\nand\n predict_linear(node_filesystem_files_free{job=\"\
node\",fstype!=\"\"}[6h], 24*60*60) < 0\nand\n node_filesystem_readonly{job=\"\
node\",fstype!=\"\"} == 0\n)\n"
for: 1h
labels:
severity: warning
- alert: NodeFilesystemFilesFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left and is filling up fast.
summary: Filesystem is predicted to run out of inodes within the next 4 hours.
expr: "(\n node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
node\",fstype!=\"\"} * 100 < 20\nand\n predict_linear(node_filesystem_files_free{job=\"\
node\",fstype!=\"\"}[6h], 4*60*60) < 0\nand\n node_filesystem_readonly{job=\"\
node\",fstype!=\"\"} == 0\n)\n"
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left.
summary: Filesystem has less than 5% inodes left.
expr: "(\n node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
node\",fstype!=\"\"} * 100 < 5\nand\n node_filesystem_readonly{job=\"node\",fstype!=\"\
\"} == 0\n)\n"
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
only {{ printf "%.2f" $value }}% available inodes left.
summary: Filesystem has less than 3% inodes left.
expr: "(\n node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
node\",fstype!=\"\"} * 100 < 3\nand\n node_filesystem_readonly{job=\"node\",fstype!=\"\
\"} == 0\n)\n"
for: 1h
labels:
severity: critical
- alert: NodeNetworkReceiveErrs
annotations:
description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} receive errors in the last two minutes.'
summary: Network interface is reporting many receive errors.
expr: 'increase(node_network_receive_errs_total[2m]) > 10
'
for: 1h
labels:
severity: warning
- alert: NodeNetworkTransmitErrs
annotations:
description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} transmit errors in the last two minutes.'
summary: Network interface is reporting many transmit errors.
expr: 'increase(node_network_transmit_errs_total[2m]) > 10
'
for: 1h
labels:
severity: warning
- alert: NodeHighNumberConntrackEntriesUsed
annotations:
description: '{{ $value | humanizePercentage }} of conntrack entries are used'
summary: Number of conntrack are getting close to the limit
expr: '(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
'
labels:
severity: warning
- alert: NodeClockSkewDetected
annotations:
message: Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure
NTP is configured correctly on this host.
summary: Clock skew detected.
expr: "(\n node_timex_offset_seconds > 0.05\nand\n deriv(node_timex_offset_seconds[5m])\
\ >= 0\n)\nor\n(\n node_timex_offset_seconds < -0.05\nand\n deriv(node_timex_offset_seconds[5m])\
\ <= 0\n)\n"
for: 10m
labels:
severity: warning
- alert: NodeClockNotSynchronising
annotations:
message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured
on this host.
summary: Clock not synchronising.
expr: 'min_over_time(node_timex_sync_status[5m]) == 0
'
for: 10m
labels:
severity: warning
```
Start prometheus
```
systemctl start prometheus
```
## Alertmanager
Alertmanager deploy
```
vim /etc/alertmanager/alertmanager.yml
#
# Ansible managed
#
global:
resolve_timeout: 3m
smtp_smarthost: 'smtp.cclinux.org:465'
smtp_from: 'monitor@cclinux.org'
smtp_auth_username: 'monitor@cclinux.org'
smtp_auth_password: 'ThisIsPassword'
smtp_require_tls: false
templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: 'email'
email_configs:
- to: 'awsomehan@cclinux.org'
route:
group_by: [alertname]
group_interval: 1m
group_wait: 30s
receiver: 'email'
repeat_interval: 4h
```
## Grafana
Grafana deploy
```
vim /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
yum install grafana -y
vim /etc/grafana/grafana.ini
app_mode = production
instance_name = grafana.cclinux.org.cn
[paths]
data = /var/lib/grafana
logs = /var/log/grafana
plugins = /var/lib/grafana/plugins
; datasources = conf/datasources
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.cclinux.org.cn
root_url = http://0.0.0.0:3000
protocol = http
enforce_domain = False
socket =
cert_key =
cert_file =
enable_gzip = False
static_root_path = public
router_logging = False
serve_from_sub_path = False
[database]
type = sqlite3
[remote_cache]
[security]
admin_user = admin
admin_password = circlelinux
[users]
allow_sign_up = False
auto_assign_org_role = Viewer
default_theme = dark
[emails]
welcome_email_on_sign_up = False
[analytics]
reporting_enabled = "True"
[dashboards]
versions_to_keep = 20
[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards
[alerting]
enabled = true
execute_alerts = True
[log]
mode = console, file
level = info
[grafana_com]
url = https://grafana.com
systemctl daemon-reload
systemctl start grafana-server
systemctl status grafana-server
systemctl enable grafana-server
```
Login grafana on http://grafana.cclinux.org:3000
[download grafana dashboard](https://grafana.com/grafana/dashboards),add dashboard on grafana dashboard.
## Blackbox exporter
Blackbox exporter need to install on prometheus node,it can test http,https,dns and icm protocol.
Configure black exporter
```
vim /etc/systemd/system/blackbox_exporter.service
[Unit]
Description=Prometheus Blackbox Exporter
After=network-online.target
[Service]
Type=simple
User=blackbox-exp
Group=blackbox-exp
ExecStart=/usr/local/bin/blackbox_exporter \
--config.file=/etc/blackbox_exporter/blackbox.yml \
--web.listen-address=0.0.0.0:9115
SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
StartLimitInterval=0
ProtectHome=yes
NoNewPrivileges=yes
ProtectSystem=strict
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=yes
[Install]
WantedBy=multi-user.target
```
Start black exporter
```
systemctl start blackbox_exporter
systemctl enable blackbox_exporter
```
Configure prometheus rule
```
vim /etc/prometheus/rules/blackbox.rules
groups:
- name: blackbox.rules
rules:
- alert: EndpointDown
expr: probe_success == 0
for: 10s
labels:
severity: "critical"
annotations:
summary: "Endpoint {{ $labels.instance }} down"
chown root.prometheus /etc/prometheus/rules/blackbox.rules
```
Restart prometheus
```
systemctl restart prometheus
```
## Refer to the link
https://prometheus.io/docs/introduction/overview/
https://grafana.com/docs/