前言
近期公司内部在做技术拉伸项,考虑到之前有看过Skywalking的相关文章,但是一直也没有自己本地搭建实践一下,借此机会,尝试一把。做一下入门的尝试和学习。
什么是Skywalking
Skywalking是一款国产APM(应用程序性能监视)工具,专为微服务、云原生架构和基于容器架构而设计。
提供了分布式追踪、应用和服务依赖分析、服务网格遥测分析、度量聚合和可视化一体化解决方案
主要支持功能
- 度量指标可视化
- 应用依赖拓扑图
- 分布式调用追踪
- 度量指标计算分析
- 链路日志查询
- 服务应用报警
官网给的架构图
比较抽象,我自己理解后也画了个图
看着很丑是吧,但是很清晰呀,其实Skywalking应用也就四个部分
1-植入探针
2-推送应用监测数据到oapservice
3-到达oapservice的数据经过加工分析后落库
4-可视化UI页面提供数据分析
整体背景大概就这样,详细介绍请移步官方Skywalking
下面开始在windows上搞起!
Elasticsearch下载启动
基于本次实践需要用到数据存储,应用服务和Skywalking都可以支持的存储中间件,于是就选择了Elasticsearch
下载Elasticsearch
下载windows版本 目前最新版本7.14.1,我就喜欢用最新的,所以本次实践也是下载最新版本的(Elasticsearch的版本兼容问题一大堆,如果你没有跟我一样的洁癖,请随意!)
启动Elasticsearch
打开PowerShell 运行bin/elasticsearch(或bin\elasticsearch.bat在 Windows 上)
观察没有报错后在浏览器打开http://localhost:9200
好,到此存储是搞完了!
Skywalking 下载启动
下载Skywalking
贴个镜像地址下载
还是一样,本人喜欢最新版本,目前最新版本是8.7.0,其他版本请移步历史版本下载
下载完解压文件(隐藏了文件,太多了,只展示目录)
├─agent
│ ├─activations
│ ├─bootstrap-plugins
│ ├─config
│ ├─logs
│ ├─optional-plugins
│ ├─optional-reporter-plugins
│ └─plugins
├─bin
├─config
│ ├─envoy-metrics-rules
│ ├─fetcher-prom-rules
│ ├─lal
│ ├─log-mal-rules
│ ├─meter-analyzer-config
│ ├─oal
│ ├─otel-oc-rules
│ ├─ui-initialized-templates
│ └─zabbix-rules
├─config-examples
├─licenses
│ └─ui-licenses
├─oap-libs
├─tools
│ └─profile-exporter
└─webapp
目录结构
bin目录存放的是启动脚本,包含oapService.sh、webappService.sh等启动脚本
config是oap服务的配置,包含一个application.yml的配置
agent是skywalking的agent,和业务系统绑定在一起,负责收集各种监控数据
webapp目录是skywalking前端的UI界面服务的配置
启动Skywalking
启动skyWalking oapService
启动前我们配一下配置文件
config目录下有个application.yml 主要修改一下数据存储方式
cluster:
selector: ${SW_CLUSTER:standalone}
standalone:
...
storage:
selector: ${SW_STORAGE:elasticsearch7}
elasticsearch7:
nameSpace: ${SW_NAMESPACE:"my-application"}
clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
connectTimeout: ${SW_STORAGE_ES_CONNECT_TIMEOUT:500}
socketTimeout: ${SW_STORAGE_ES_SOCKET_TIMEOUT:30000}
trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index.
indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1} # Shard number of new indexes
indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes
# Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es.
superDatasetDayStep: ${SW_SUPERDATASET_STORAGE_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0
superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} # This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces.
superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0.
indexTemplateOrder: ${SW_STORAGE_ES_INDEX_TEMPLATE_ORDER:0} # the order of index template
user: ${SW_ES_USER:""}
password: ${SW_ES_PASSWORD:""}
secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:5000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests
# flush the bulk every 10 seconds whatever the number of requests
# INT(flushInterval * 2/3) would be used for index refresh period.
flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:15}
concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer.
oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc.
advanced: ${SW_STORAGE_ES_ADVANCED:""}
打开PowerShell 切换到skywalking的bin目录下
运行 .\oapService.bat
如下图即启动成功
启动skyWalking webapp
一样,启动前配置一下配置文件,在webapp下的webapp.xml
server:
port: 8080
spring:
cloud:
gateway:
routes:
- id: oap-route
uri: lb://oap-service
predicates:
- Path=/graphql/**
discovery:
client:
simple:
instances:
oap-service:
- uri: http://127.0.0.1:12800
# - uri: http://<oap-host-1>:<oap-port1>
# - uri: http://<oap-host-2>:<oap-port2>
mvc:
throw-exception-if-no-handler-found: true
web:
resources:
add-mappings: true
management:
server:
base-path: /manage
再打开一个PowerShell 还是到bin目录
运行 .\webappService.bat
如下图即启动成功
打开http://localhost:8080/ (刚刚配置Skywalking的UI页面启动指定端口是8080,注意一会起应用服务的端口不要冲突)
因为我们还没有起具体的应用,所以这时候页面没有注册进来任何信息。
应用服务Skywalking探针植入
将Skywalking包下的agent包copy到应用示例里(这里就直接给出示例应用demo)
并修改agent/config/agent.config文件
# The service name in UI
agent.service_name=${SW_AGENT_NAME:skyWalking-demo}
# Backend service addresses.
collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:127.0.0.1:11800}
# Logging file_name
logging.file_name=${SW_LOGGING_FILE_NAME:skywalking-api.log}
# Logging level
logging.level=${SW_LOGGING_LEVEL:DEBUG}
# Mount the specific folders of the plugins. Plugins in mounted folders would work.
plugin.mount=${SW_MOUNT_FOLDERS:plugins,activations}
应用示例目录结构
配置文件
server:
port: 8500
spring:
swagger:
enabled: true
title: elasticsearch-study\u7CFB\u7EDF
description: skywalking-demo\u7CFB\u7EDF
version: v1.0
host: http://localhost:8500/swagger-ui.html
terms-of-service-url: http://qrainly.top/
contact:
name: bj
auto:
openurl: true
web:
loginurl: http://localhost:8500/swagger-ui.html
googleexcute: C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe
elasticsearch:
rest:
uris: localhost:9200
connection-timeout: 10s
#username:
#password:
logging:
level:
org.springframework.data.elasticsearch.core: debug
# -javaagent:D:\v_liuwen\code\skywalking-demo\agent\agent\skywalking-agent.jar
其他代码会在后面贴出
在项目启动类上添加探针植入参数
-javaagent:D:\v_liuwen\code\skywalking-demo\agent\skywalking-agent.jar
本地启动两个服务示例 一个端口8500 另一个8501
点击多次【查询所有数据】接口后,观察Skywalking可视化页面
可以看到已经注册上Skywalking了。
仪表盘
拓扑图
可以在拓扑图上看到服务之间的依赖关系
追踪
刚才调用的/all接口的链路过程都展示出来了,可以很直观的分析其链路的情况
性能分析
这个模块需要建个分析任务,就不演示了!
日志
这块因为我本地只起了单服务,没有跨服务调用,所以也没打日志
告警
告警是需要配置文件的
Skywalking目录下config/alarm-settings.yml
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 20
period: 1
count: 3
silence-period: 1
message: Response time of service {name} is more than 20ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
- http://localhost:8031/skywalking/alarm/pushData
故意在调接口断点延时
还可以配置把报警直接推到钉钉等其他平台
本次实践就到这里,后续有新玩法再跟大家分享
参考资料
https://www.fangzhipeng.com/architecture/2020/06/12/skywalking-test.html
https://www.jianshu.com/p/055e4223d054
持续输出中…