概念飘移-concept drift-Python小结

概念飘移

我个人目前浅读了大概10几篇论文为了自己的硕士论文,大概对于概念飘移(下面简称CD:concept drift)有一些基础的了解。
首先要说明几点我的理解:

  1. CD是指数据集背后的Concept/原因发生了变化,
  • 比如突然用户不买你的某个产品,是因为你的产品遭到了声誉问题。
  • 又或者用一个更直观的STAGGER人造数据里面的例子:
    • 我们有3个参数:颜色,大小和形状,
    • 对应的有3个规则:红色+小=true(rule1),绿色+圆=true(rule2),中等+大=true(rule3)
    • 根据这3个规则,我们可以构造一个synthetic dataset,比如对于规则1我们搞个300个,然后2也是300,3也是300,其中1-2-3都是突然变化(abrupt change)
  1. CD有5种:abrupt,gradual,incremental,recurrent,blip,意思可以参考下图:
    概念飘移-concept drift-Python小结

  2. CD出现了我们想要的做的就是:检测+改进我们的模型

  • 其中很多论文都着重于检测,而对于改进则是重新训练就可以了
    • 检测的分类1:按数据内部分布的变化,也就是检测输入的mean,std等参数,并设定相应的阈值(warning level,decision level)来判断是否真的发生了分布变化。其中ADWIN通过滑动窗口来控制输入的参数size。
    • 检测的分类2:按照estimator的error来看,也就是error如果突然飙升,那么很有可能有CD,然后也设定相应的阈值来判断
  • 其中也有些论文将检测和改进结合起来
    • Ensemble:也就是每来一个dataset我们就训练一个estimator,然后存储下来,对于每一个estimator都有个weight权重,weight会在每个新的estimator到来后动态调整,然后可以有机会使得老的estimator重新发挥作用(解决了recurrent的问题)(Learn++)
    • 使用DT(decision tree): 因为树的特性,可以被剪枝(prune),也有很好的可解释性,修改DT作为改进模型的方法可以相对简单。(Hoeffding Tree)
  1. CD主要应用于incremental training的场景
  • online training/incremental training我个人粗浅的理解为batchsize=1的训练,也就是每来一个dataset训练一次,但是每一个dataset里面有很多的instances(比如1000个),这样才有训练的价值。
  1. 主要的PYTHON框架:RIVER(ski-multiflow+creme的合体版)

  2. 主要的java框架: moa(wakaido大学出版)--> 之后得好好学学

PYTHON简单实现+sklearn训练

STAGGER:介绍

This dataset contains three nominal attributes, namely size = {small, medium, large}, color={red, green}, and shape={circular, non-circular}. Before the first drift point, instances are labeled positive if (color = red) ∧ (size = small). After this point and before the second drift, instances are classified positive if (color = green) v (shape = circular), and finally after this second drift point, instances are classified positive only if (size = medium) v (size = large).

from enum import Enum
class Size(Enum):
  small = 0
  medium = 1
  large = 2
class Color(Enum):
  red = 0
  green = 1
class Shape(Enum):
  circular = 0
  noncircular = 1
def generate_stagger(size=1000):
  """
    one datarow:(0,1,1,1) == (size.small,color.green,shape.noncircular,true)
    with easy change we can also get row like (small,green,noncircular,true): but just add workload
  """
  rule1 = lambda si,co,sh:co==Color.red.value and si==Size.small.value
  rule2 = lambda si,co,sh:co==Color.green.value or sh==Shape.circular.value
  rule3 = lambda si,co,sh:si==Size.medium.value or si==Size.large.value
  
  ans2 = list(map(rule1,[0],[0],[1]))

  print(ans2)
  np.random.seed(10)

  stagger_ = []
  ans_ = []
  for i,rule in enumerate([rule1,rule2,rule3]):
    # for size
    size_ = np.random.randint(0,3,size=int(size/3))
    # for color
    color_ = np.random.randint(0,2,size=int(size/3))
    # for shape
    shape_ = np.random.randint(0,2,size=int(size/3))

    ans = list(map(rule,size_,color_,shape_))
    # true->1 false->0
    ans = [1 if a else 0 for a in ans]

    temp = list(zip(size_,color_,shape_))
    ans_.extend(ans)
    stagger_.extend(temp)

  return stagger_,ans

使用普通batch training时候GaussianNB的测试结果:能大概在300左右看到acc下降
概念飘移-concept drift-Python小结

SIN1:介绍

It consists of two attributes x and y uniformly distributed in [0, 1]. The classification function is y = sin(x). Instances are classified as positive if they are under the curve; otherwise they are classified as negative. At a drift point, the class labels are reversed.

# create synthetic sin1 dataset
def generate_sin(size=1000):
  random_state = np.random.RandomState(42)
  x = random_state.normal(0.5,0.5,size)
  noise = random_state.normal(0.3,0.2,size)-0.3
  y = np.sin(x)+ noise
  target1 = [1 if n < 0 else 0 for n in noise[:int(size/2)]]
  # drift point at half -> inverse
  target2 = [0 if n < 0 else 1 for n in noise[int(size/2):]]
  target1.extend(target2)
  return zip(x,y),target1

测试结果:总觉得NB就没学会这个acc一直在下降
概念飘移-concept drift-Python小结

上一篇:「Computer Vision」Note on Anchor Box Optimization


下一篇:python-为什么两个单独创建的不可变对象具有相同的id,而可变对象却具有不同的含义,而它们都引用相同的值?