prometheus基本使用

# monitor prometheus
We use prometheus,alertmanager,grafana and other tools to monitoring our systems and services.
## Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit

- Prometheus pull data from exporters or from pushgateway.
- Prometheus use TSDB which is a time series database to store data.
- Prometheus push alerts to alermanager,alermanager decided to send emil,sed wechat,etc...
- Grafana get data form prometheus to data visualization and export.

Prometheus deploy

[Download prometheus](https://prometheus.io/download),you can also download alermanager,blackbox_exporter,and so on here.
Unzip the package
```
tar xvfz prometheus-*.tar.gz
cd prometheus-*
mkdir /etc/prometheus
cp prometheus.yml /etc/prometheus
cp prometheus /usr/local/bin/prometheus
useradd prometheus -s /usr/sbin/nologin
```
Configure prometheus as a service
```
vim /etc/systemd/system/prometheus.service 
#
# Ansible managed
#

[Unit]
Description=Prometheus
After=network-online.target
Requires=local-fs.target
After=local-fs.target

[Service]
Type=simple
Environment="GOMAXPROCS=2"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=0 \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.console.templates=/etc/prometheus/consoles \
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=
CapabilityBoundingSet=CAP_SET_UID
LimitNOFILE=65000
LockPersonality=true
NoNewPrivileges=true
MemoryDenyWriteExecute=true
PrivateDevices=true
PrivateTmp=true
ProtectHome=true
RemoveIPC=true
RestrictSUIDSGID=true
#SystemCallFilter=@signal @timer

ReadWriteDirectories=/var/lib/prometheus

ProtectSystem=full


SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target
```

Configure prometheus
```
vim prometheus.yml
#
# Ansible managed
#
# http://prometheus.io/docs/operating/configuration/

global:
  evaluation_interval: 15s
  scrape_interval: 15s
  scrape_timeout: 10s

  external_labels:
    environment: grafana.cclinux.org.cn

alerting:
 alertmanagers:
 - static_configs:
   - targets: ["localhost:9093"]


rule_files:
  - /etc/prometheus/rules/*.rules


scrape_configs:
  - job_name: prometheus
    metrics_path: /metrics
    static_configs:
    - targets:
      - grafana.cclinux.org.cn:9090
  #- file_sd_configs:
  #  - files:
  #    - /etc/prometheus/file_sd/node.yml
  #  job_name: node
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - http://example.com:8080 # Target to probe with http on port 8080.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.
  - job_name: 'ipa'
    static_configs:
            - targets: ['192.168.69.186:9100', '192.168.69.110:9100', '192.168.69.198:9100']
  - job_name: 'idp001'
    static_configs:
    - targets: ['192.168.69.126:9100']
  - job_name: 'idp002'
    static_configs:
    - targets: ['192.168.69.161:9100']
  - job_name: 'gitlab'
    static_configs:
    - targets: ['192.168.69.125:9100']
  - job_name: gitlab-workhorse
    static_configs:
      - targets:
        - 192.168.69.125:9229
  - job_name: gitlab-rails
    metrics_path: "/-/metrics"
    static_configs:
      - targets:
        - 192.168.69.125:8080
  - job_name: gitlab-sidekiq
    static_configs:
      - targets:
        - 192.168.69.125:8082
  - job_name: gitlab_exporter_process
    metrics_path: "/metrics"
    static_configs:
      - targets:
        - 192.168.69.125:9168
  - job_name: gitaly
    static_configs:
      - targets:
        - 192.168.69.125:9236
```
Configure rules
```
vim /etc/prometheus/rules/ansible_managed.rules 
#
# Ansible managed
#

groups:
- name: ansible managed alert rules
  rules:
  - alert: Watchdog
    annotations:
      description: 'This is an alert meant to ensure that the entire alerting pipeline
        is functional.

        This alert is always firing, therefore it should always be firing in Alertmanager

        and always fire against a receiver. There are integrations with various notification

        mechanisms that send a notification when this alert is not firing. For example
        the

        "DeadMansSnitch" integration in PagerDuty.'
      summary: Ensure entire alerting pipeline is functional
    expr: vector(1)
    for: 10m
    labels:
      severity: warning
  - alert: InstanceDown
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for
        more than 5 minutes.'
      summary: Instance {{ $labels.instance }} down
    expr: up == 0
    for: 1m
    labels:
      severity: critical
  - alert: RebootRequired
    annotations:
      description: '{{ $labels.instance }} requires a reboot.'
      summary: Instance {{ $labels.instance }} - reboot required
    expr: node_reboot_required > 0
    labels:
      severity: warning
  - alert: NodeFilesystemSpaceFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left and is filling up.
      summary: Filesystem is predicted to run out of space within the next 24 hours.
    expr: "(\n  node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
      node\",fstype!=\"\"} * 100 < 40\nand\n  predict_linear(node_filesystem_avail_bytes{job=\"\
      node\",fstype!=\"\"}[6h], 24*60*60) < 0\nand\n  node_filesystem_readonly{job=\"\
      node\",fstype!=\"\"} == 0\n)\n"
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemSpaceFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left and is filling up fast.
      summary: Filesystem is predicted to run out of space within the next 4 hours.
    expr: "(\n  node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
      node\",fstype!=\"\"} * 100 < 20\nand\n  predict_linear(node_filesystem_avail_bytes{job=\"\
      node\",fstype!=\"\"}[6h], 4*60*60) < 0\nand\n  node_filesystem_readonly{job=\"\
      node\",fstype!=\"\"} == 0\n)\n"
    for: 1h
    labels:
      severity: critical
  - alert: NodeFilesystemAlmostOutOfSpace
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left.
      summary: Filesystem has less than 5% space left.
    expr: "(\n  node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
      node\",fstype!=\"\"} * 100 < 5\nand\n  node_filesystem_readonly{job=\"node\",fstype!=\"\
      \"} == 0\n)\n"
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemAlmostOutOfSpace
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available space left.
      summary: Filesystem has less than 3% space left.
    expr: "(\n  node_filesystem_avail_bytes{job=\"node\",fstype!=\"\"} / node_filesystem_size_bytes{job=\"\
      node\",fstype!=\"\"} * 100 < 3\nand\n  node_filesystem_readonly{job=\"node\",fstype!=\"\
      \"} == 0\n)\n"
    for: 1h
    labels:
      severity: critical
  - alert: NodeFilesystemFilesFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left and is filling up.
      summary: Filesystem is predicted to run out of inodes within the next 24 hours.
    expr: "(\n  node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
      node\",fstype!=\"\"} * 100 < 40\nand\n  predict_linear(node_filesystem_files_free{job=\"\
      node\",fstype!=\"\"}[6h], 24*60*60) < 0\nand\n  node_filesystem_readonly{job=\"\
      node\",fstype!=\"\"} == 0\n)\n"
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemFilesFillingUp
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left and is filling up fast.
      summary: Filesystem is predicted to run out of inodes within the next 4 hours.
    expr: "(\n  node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
      node\",fstype!=\"\"} * 100 < 20\nand\n  predict_linear(node_filesystem_files_free{job=\"\
      node\",fstype!=\"\"}[6h], 4*60*60) < 0\nand\n  node_filesystem_readonly{job=\"\
      node\",fstype!=\"\"} == 0\n)\n"
    for: 1h
    labels:
      severity: critical
  - alert: NodeFilesystemAlmostOutOfFiles
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left.
      summary: Filesystem has less than 5% inodes left.
    expr: "(\n  node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
      node\",fstype!=\"\"} * 100 < 5\nand\n  node_filesystem_readonly{job=\"node\",fstype!=\"\
      \"} == 0\n)\n"
    for: 1h
    labels:
      severity: warning
  - alert: NodeFilesystemAlmostOutOfFiles
    annotations:
      description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
        only {{ printf "%.2f" $value }}% available inodes left.
      summary: Filesystem has less than 3% inodes left.
    expr: "(\n  node_filesystem_files_free{job=\"node\",fstype!=\"\"} / node_filesystem_files{job=\"\
      node\",fstype!=\"\"} * 100 < 3\nand\n  node_filesystem_readonly{job=\"node\",fstype!=\"\
      \"} == 0\n)\n"
    for: 1h
    labels:
      severity: critical
  - alert: NodeNetworkReceiveErrs
    annotations:
      description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
        {{ printf "%.0f" $value }} receive errors in the last two minutes.'
      summary: Network interface is reporting many receive errors.
    expr: 'increase(node_network_receive_errs_total[2m]) > 10

      '
    for: 1h
    labels:
      severity: warning
  - alert: NodeNetworkTransmitErrs
    annotations:
      description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
        {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
      summary: Network interface is reporting many transmit errors.
    expr: 'increase(node_network_transmit_errs_total[2m]) > 10

      '
    for: 1h
    labels:
      severity: warning
  - alert: NodeHighNumberConntrackEntriesUsed
    annotations:
      description: '{{ $value | humanizePercentage }} of conntrack entries are used'
      summary: Number of conntrack are getting close to the limit
    expr: '(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75

      '
    labels:
      severity: warning
  - alert: NodeClockSkewDetected
    annotations:
      message: Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure
        NTP is configured correctly on this host.
      summary: Clock skew detected.
    expr: "(\n  node_timex_offset_seconds > 0.05\nand\n  deriv(node_timex_offset_seconds[5m])\
      \ >= 0\n)\nor\n(\n  node_timex_offset_seconds < -0.05\nand\n  deriv(node_timex_offset_seconds[5m])\
      \ <= 0\n)\n"
    for: 10m
    labels:
      severity: warning
  - alert: NodeClockNotSynchronising
    annotations:
      message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured
        on this host.
      summary: Clock not synchronising.
    expr: 'min_over_time(node_timex_sync_status[5m]) == 0

      '
    for: 10m
    labels:
      severity: warning
```      
Start prometheus
```
systemctl start prometheus
```

## Alertmanager
Alertmanager deploy
```
vim /etc/alertmanager/alertmanager.yml 

#
# Ansible managed
#

global:
  resolve_timeout: 3m
  smtp_smarthost: 'smtp.cclinux.org:465'
  smtp_from: 'monitor@cclinux.org'
  smtp_auth_username: 'monitor@cclinux.org'
  smtp_auth_password: 'ThisIsPassword'
  smtp_require_tls: false
templates:
- '/etc/alertmanager/templates/*.tmpl'

receivers:
- name: 'email'
  email_configs:
  - to: 'awsomehan@cclinux.org'

route:
  group_by: [alertname]
  group_interval: 1m
  group_wait: 30s
  receiver: 'email'
  repeat_interval: 4h
```

## Grafana
Grafana deploy
```
vim /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

yum install grafana -y

vim /etc/grafana/grafana.ini
app_mode = production
instance_name = grafana.cclinux.org.cn
[paths]
data = /var/lib/grafana
logs = /var/log/grafana
plugins = /var/lib/grafana/plugins
; datasources = conf/datasources
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.cclinux.org.cn
root_url = http://0.0.0.0:3000
protocol = http
enforce_domain = False
socket = 
cert_key = 
cert_file = 
enable_gzip = False
static_root_path = public
router_logging = False
serve_from_sub_path = False
[database]
type = sqlite3
[remote_cache]
[security]
admin_user = admin
admin_password = circlelinux
[users]
allow_sign_up = False
auto_assign_org_role = Viewer
default_theme = dark
[emails]
welcome_email_on_sign_up = False
[analytics]
reporting_enabled = "True"
[dashboards]
versions_to_keep = 20
[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards
[alerting]
enabled = true
execute_alerts = True
[log]
mode = console, file
level = info
[grafana_com]
url = https://grafana.com


systemctl daemon-reload
systemctl start grafana-server
systemctl status grafana-server
systemctl enable grafana-server
```
Login grafana on http://grafana.cclinux.org:3000

[download grafana dashboard](https://grafana.com/grafana/dashboards),add dashboard on grafana dashboard.
## Blackbox exporter
Blackbox exporter need to install on prometheus node,it can test http,https,dns and icm protocol.

Configure black exporter
```
vim /etc/systemd/system/blackbox_exporter.service 
[Unit]
Description=Prometheus Blackbox Exporter
After=network-online.target

[Service]
Type=simple
User=blackbox-exp
Group=blackbox-exp
ExecStart=/usr/local/bin/blackbox_exporter \
    --config.file=/etc/blackbox_exporter/blackbox.yml \
    --web.listen-address=0.0.0.0:9115

SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
StartLimitInterval=0

ProtectHome=yes
NoNewPrivileges=yes

ProtectSystem=strict
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=yes

[Install]
WantedBy=multi-user.target
```
Start black exporter
```
systemctl start blackbox_exporter
systemctl enable blackbox_exporter
```
Configure prometheus rule
```
vim /etc/prometheus/rules/blackbox.rules 
groups:
- name: blackbox.rules
  rules:
  - alert: EndpointDown
    expr: probe_success == 0
    for: 10s
    labels:
      severity: "critical"
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"

chown root.prometheus /etc/prometheus/rules/blackbox.rules
```
Restart prometheus
```
systemctl restart prometheus
```
## Refer to the link
https://prometheus.io/docs/introduction/overview/
https://grafana.com/docs/
 

上一篇:Loki轻量级日志收集


下一篇:cube.js prometheus 监控