概念飘移-concept drift-Python小结

2023-01-30 15:20:46

概念飘移

我个人目前浅读了大概10几篇论文为了自己的硕士论文，大概对于概念飘移（下面简称CD：concept drift）有一些基础的了解。
首先要说明几点我的理解：

CD是指数据集背后的Concept/原因发生了变化，

比如突然用户不买你的某个产品，是因为你的产品遭到了声誉问题。
又或者用一个更直观的STAGGER人造数据里面的例子：
- 我们有3个参数：颜色，大小和形状，
- 对应的有3个规则：红色+小=true（rule1）,绿色+圆=true(rule2),中等+大=true(rule3)
- 根据这3个规则，我们可以构造一个synthetic dataset,比如对于规则1我们搞个300个，然后2也是300，3也是300，其中1-2-3都是突然变化（abrupt change）

CD有5种：abrupt,gradual,incremental,recurrent,blip，意思可以参考下图：
CD出现了我们想要的做的就是：检测+改进我们的模型

其中很多论文都着重于检测，而对于改进则是重新训练就可以了
- 检测的分类1：按数据内部分布的变化，也就是检测输入的mean,std等参数，并设定相应的阈值（warning level,decision level）来判断是否真的发生了分布变化。其中ADWIN通过滑动窗口来控制输入的参数size。
- 检测的分类2：按照estimator的error来看，也就是error如果突然飙升，那么很有可能有CD，然后也设定相应的阈值来判断
其中也有些论文将检测和改进结合起来
- Ensemble：也就是每来一个dataset我们就训练一个estimator，然后存储下来，对于每一个estimator都有个weight权重，weight会在每个新的estimator到来后动态调整，然后可以有机会使得老的estimator重新发挥作用（解决了recurrent的问题）(Learn++)
- 使用DT(decision tree): 因为树的特性，可以被剪枝(prune)，也有很好的可解释性，修改DT作为改进模型的方法可以相对简单。（Hoeffding Tree）

CD主要应用于incremental training的场景

online training/incremental training我个人粗浅的理解为batchsize=1的训练，也就是每来一个dataset训练一次，但是每一个dataset里面有很多的instances（比如1000个），这样才有训练的价值。

主要的PYTHON框架：RIVER（ski-multiflow+creme的合体版）
主要的java框架: moa（wakaido大学出版）--> 之后得好好学学

PYTHON简单实现+sklearn训练

STAGGER：介绍

This dataset contains three nominal attributes, namely size = {small, medium, large}, color={red, green}, and shape={circular, non-circular}. Before the first drift point, instances are labeled positive if (color = red) ∧ (size = small). After this point and before the second drift, instances are classified positive if (color = green) v (shape = circular), and finally after this second drift point, instances are classified positive only if (size = medium) v (size = large).

from enum import Enum
class Size(Enum):
  small = 0
  medium = 1
  large = 2
class Color(Enum):
  red = 0
  green = 1
class Shape(Enum):
  circular = 0
  noncircular = 1
def generate_stagger(size=1000):
  """
    one datarow:(0,1,1,1) == (size.small,color.green,shape.noncircular,true)
    with easy change we can also get row like (small,green,noncircular,true): but just add workload
  """
  rule1 = lambda si,co,sh:co==Color.red.value and si==Size.small.value
  rule2 = lambda si,co,sh:co==Color.green.value or sh==Shape.circular.value
  rule3 = lambda si,co,sh:si==Size.medium.value or si==Size.large.value
  
  ans2 = list(map(rule1,[0],[0],[1]))

  print(ans2)
  np.random.seed(10)

  stagger_ = []
  ans_ = []
  for i,rule in enumerate([rule1,rule2,rule3]):
    # for size
    size_ = np.random.randint(0,3,size=int(size/3))
    # for color
    color_ = np.random.randint(0,2,size=int(size/3))
    # for shape
    shape_ = np.random.randint(0,2,size=int(size/3))

    ans = list(map(rule,size_,color_,shape_))
    # true->1 false->0
    ans = [1 if a else 0 for a in ans]

    temp = list(zip(size_,color_,shape_))
    ans_.extend(ans)
    stagger_.extend(temp)

  return stagger_,ans

使用普通batch training时候GaussianNB的测试结果：能大概在300左右看到acc下降

SIN1:介绍

It consists of two attributes x and y uniformly distributed in [0, 1]. The classification function is y = sin(x). Instances are classified as positive if they are under the curve; otherwise they are classified as negative. At a drift point, the class labels are reversed.

# create synthetic sin1 dataset
def generate_sin(size=1000):
  random_state = np.random.RandomState(42)
  x = random_state.normal(0.5,0.5,size)
  noise = random_state.normal(0.3,0.2,size)-0.3
  y = np.sin(x)+ noise
  target1 = [1 if n < 0 else 0 for n in noise[:int(size/2)]]
  # drift point at half -> inverse
  target2 = [0 if n < 0 else 1 for n in noise[int(size/2):]]
  target1.extend(target2)
  return zip(x,y),target1

测试结果：总觉得NB就没学会这个acc一直在下降

码农公寓

概念飘移

PYTHON简单实现+sklearn训练

STAGGER：介绍

SIN1:介绍

相关文章