在SLS中快速实现异常巡检

2021-10-03 21:29:15

一、相关算法研究

1.1 常见的开源算法

Yahoo：EGADS
FaceBook：Prophet
Baidu：Opprentice
Twitter：Anomaly Detection
Redhat：hawkular
Ali+Tsinghua：Donut
Tencent：Metis
Numenta：HTM
CMU：SPIRIT
Microsoft：YADING
Linkedin：SAX改进版本
Netflix：Argos
NEC：CloudSeer
NEC+Ant：LogLens
MoogSoft：一家创业公司，做的内容蛮好的，供大家参考

1.2 基于统计方法的异常检测

基于统计方法对时序数据进行不同指标（均值、方差、散度、峰度等）结果的判别，通过一定的人工经验设定阈值进行告警。同时可以引入时序历史数据利用环比、同比等策略，通过一定的人工经验设定阈值进行告警。
通过建立不同的统计指标：窗口均值变化、窗口方差变化等可以较好的解决下图中（1，2，5）所对应的异常点检测；通过局部极值可以检测出图（4）对应的尖点信息；通过时序预测模型可以较好的找到图（3，6）对应的变化趋势，检测出不符合规律的异常点。

如何判别异常？

N-sigma
Boxplot（箱线图）
Grubbs’Test
Extreme Studentized Deviate Test

PS：

N-sigma：在正态分布中，99.73%的数据分布在距平均值三个标准差以内。如果我们的数据服从一定分布，就可以从分布曲线推断出现当前值的概率。
Grubbs假设检验：常被用来检验正态分布数据集中的单个异常值
ESD假设检验：将Grubbs'
Test扩展到k个异常值检测

1.3 基于无监督的方法做异常检测

什么是无监督方法：是否有监督（supervised），主要看待建模的数据是否有标签（label）。若输入数据有标签，则为有监督学习；没标签则为无监督学习。
为何需要引入无监督方法：在监控建立的初期，用户的反馈是非常稀少且珍贵的，在没有用户反馈的情况下，为了快速建立可靠的监控策略，因此引入无监督方法。
针对单维度指标

采用一些回归方法（Holt-Winters、ARMA），通过原始的观测序列学习出预测序列，通过两者之间的残差进行分析得到相关的异常。
针对单维度指标
- 多维度的含义（time，cpu，iops，flow）
- iForest（IsolationForest）是基于集成的异常检测方法
  - 适用连续数据，具有线性时间复杂度和高精度
  - 异常定义：容易被孤立的离群点，分布稀疏且离密度高的群体较远的点。
- 几点说明
  - 判别树越多越稳定，且每棵树都是互相独立的，可以部署在大规模分布系统中
  - 该算法不太适合特别高维度数据，噪音维度维度和敏感维度无法主动剔除
  - 原始iForest算法仅对全局异常值敏感，对局部相对稀疏的点敏感度较低

1.4 基于深度学习的异常检测

论文题目：《Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications》（WWW 2018）

解决的问题：针对具有周期性的时序监控数据，数据中包含一些缺失点和异常点
模型训练结构如下
检测时使用了MCMC填补的技术处理观测窗口中的已知缺失点，核心思想根据已经训练好的模型，迭代逼近边际分布（下图表示MCMC填补的一次迭代示意图）

1.5 使用有监督的方法做异常检测

标注异常这件事儿，本身很复杂？
- 用户定义的异常往往是从系统或者服务角度出发，对数据进行打标，所关联的底层指标、链路指标繁杂，无法从几个维度出发（更多的是系统的一个Shapshot）
- 在进行架构层设计时，都会进行服务自愈设计，底层的异常并未影响到上层业务
- 异常的溯源很复杂，很多情况下，单一监控数据仅是异常结果的反应，而不是异常本身
- 打标样本数量很少，且异常类型多样，针对小样本的学习问题还有待提高
常用的有监督的机器学习方法
- xgboost、gbdt、lightgbm等
- 一些dnn的分类网络等

二、SLS中提供的算法能力

时序分析
- 预测：根据历史数据拟合基线
- 异常检测、变点检测、折点检测：找到异常点
- 多周期检测：发现数据访问中的周期规律
- 时序聚类：找到形态不一样的时序
模式分析
- 频繁模式挖掘
- 差异模式挖掘
海量文本智能聚类
- 支持任意格式日志：Log4J、Json、单行（syslog）
- 日志经任意条件过滤后再Reduce；对Reduce后Pattern，根据signature反查原始数据
- 不同时间段Pattern比较
- 动态调整Reduce精度
- 亿级数据，秒级出结果

三、针对流量场景的实战分析

3.1 多维度的监控指标的可视化

具体的SQL逻辑如下：

* | 
select
   time,
   buffer_cnt,
   log_cnt,
   buffer_rate,
   failed_cnt,
   first_play_cnt,
   fail_rate 
from
   (
      select
         date_trunc('minute', time) as time,
         sum(buffer_cnt) as buffer_cnt,
         sum(log_cnt) as log_cnt,
         case
            when
               is_nan(sum(buffer_cnt)*1.0 / sum(log_cnt)) 
            then
               0.0 
            else
               sum(buffer_cnt)*1.0 / sum(log_cnt) 
         end as buffer_rate, 
sum(failed_cnt) as failed_cnt, 
sum(first_play_cnt) as first_play_cnt , 
         case
            when
               is_nan(sum(failed_cnt)*1.0 / sum(first_play_cnt)) 
            then
               0.0 
            else
               sum(failed_cnt)*1.0 / sum(first_play_cnt) 
         end as fail_rate 
      from
         log 
      group by
         time 
      order by
         time
   )
   limit 100000

3.2 各指标的时序环比图

具体的SQL逻辑如下：

* |
select 
    time,
    log_cnt_cmp[1] as log_cnt_now,
    log_cnt_cmp[2] as log_cnt_old,
    case when is_nan(buffer_rate_cmp[1]) then 0.0 else buffer_rate_cmp[1] end as buf_rate_now,
    case when is_nan(buffer_rate_cmp[2]) then 0.0 else buffer_rate_cmp[2] end as buf_rate_old,
    case when is_nan(fail_rate_cmp[1]) then 0.0 else fail_rate_cmp[1] end as fail_rate_now,
    case when is_nan(fail_rate_cmp[2]) then 0.0 else fail_rate_cmp[2] end as fail_rate_old
from
(
select 
    time, 
    ts_compare(log_cnt, 86400) as log_cnt_cmp,
    ts_compare(buffer_rate, 86400) as buffer_rate_cmp,
    ts_compare(fail_rate, 86400) as fail_rate_cmp
from (
select 
      date_trunc('minute', time - time % 120) as time, 
    sum(buffer_cnt) as buffer_cnt, 
    sum(log_cnt) as log_cnt, 
    sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate, 
    sum(failed_cnt) as failed_cnt,  
    sum(first_play_cnt) as first_play_cnt ,
    sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate
from log group by time order by time) group by time)
where time is not null limit 1000000

3.3 各指标动态可视化

具体的SQL逻辑如下：

* | 
select 
    time, 
    case when is_nan(buffer_rate) then 0.0 else buffer_rate end as show_index,
    isp as index
from
(select 
    date_trunc('minute', time) as time, 
    sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate,
    sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate,
    sum(log_cnt) as log_cnt,
    sum(failed_cnt) as failed_cnt,
    sum(first_play_cnt) as first_play_cnt,
    isp
from log group by time, isp order by time) limit 200000

3.4 异常集合的监控Dashboard页面

异常监控项目的背后图表SQL逻辑

* | 
select 
    res.name 
from ( 
    select 
        ts_anomaly_filter(province, res[1], res[2], res[3], res[6], 100, 0) as res 
    from ( 
        select 
            t1.province as province, 
            array_transpose( ts_predicate_arma(t1.time, t1.show_index, 5, 1, 1) ) as res 
        from ( 
            select
                province,
                time,
                case when is_nan(buffer_rate) then 0.0 else buffer_rate end as show_index
            from (
                select 
                    province, 
                    time, 
                    sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate, 
                    sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate, 
                    sum(log_cnt) as log_cnt, 
                    sum(failed_cnt) as failed_cnt, 
                    sum(first_play_cnt) as first_play_cnt
                from log 
                group by province, time) ) t1 
            inner join ( 
                select 
                    DISTINCT province 
                from  ( 
                    select 
                        province, time, sum(log_cnt) as total 
                    from log 
                    group by province, time ) 
                where total > 200 ) t2 on t1.province = t2.province  
        group by t1.province ) ) limit 100000

针对上述SQL逻辑的具体分析

具体的SQL的语法分析逻辑可以参照之前的文章：SLS机器学习最佳实战：批量时序异常检测

码农公寓

在SLS中快速实现异常巡检

一、相关算法研究

1.1 常见的开源算法

1.2 基于统计方法的异常检测

1.3 基于无监督的方法做异常检测

1.4 基于深度学习的异常检测

1.5 使用有监督的方法做异常检测

二、SLS中提供的算法能力

三、针对流量场景的实战分析

3.1 多维度的监控指标的可视化

3.2 各指标的时序环比图

3.3 各指标动态可视化

3.4 异常集合的监控Dashboard页面

四、参考文档

4.1 相关文章链接

4.2 DrillDown文章链接

4.3 相关算法介绍