概念飘移
我个人目前浅读了大概10几篇论文为了自己的硕士论文,大概对于概念飘移(下面简称CD:concept drift)有一些基础的了解。
首先要说明几点我的理解:
- CD是指数据集背后的Concept/原因发生了变化,
- 比如突然用户不买你的某个产品,是因为你的产品遭到了声誉问题。
- 又或者用一个更直观的STAGGER人造数据里面的例子:
- 我们有3个参数:颜色,大小和形状,
- 对应的有3个规则:红色+小=true(rule1),绿色+圆=true(rule2),中等+大=true(rule3)
- 根据这3个规则,我们可以构造一个synthetic dataset,比如对于规则1我们搞个300个,然后2也是300,3也是300,其中1-2-3都是突然变化(abrupt change)
-
CD有5种:abrupt,gradual,incremental,recurrent,blip,意思可以参考下图:
-
CD出现了我们想要的做的就是:检测+改进我们的模型
- 其中很多论文都着重于检测,而对于改进则是重新训练就可以了
- 检测的分类1:按数据内部分布的变化,也就是检测输入的mean,std等参数,并设定相应的阈值(warning level,decision level)来判断是否真的发生了分布变化。其中ADWIN通过滑动窗口来控制输入的参数size。
- 检测的分类2:按照estimator的error来看,也就是error如果突然飙升,那么很有可能有CD,然后也设定相应的阈值来判断
- 其中也有些论文将检测和改进结合起来
- Ensemble:也就是每来一个dataset我们就训练一个estimator,然后存储下来,对于每一个estimator都有个weight权重,weight会在每个新的estimator到来后动态调整,然后可以有机会使得老的estimator重新发挥作用(解决了recurrent的问题)(Learn++)
- 使用DT(decision tree): 因为树的特性,可以被剪枝(prune),也有很好的可解释性,修改DT作为改进模型的方法可以相对简单。(Hoeffding Tree)
- CD主要应用于incremental training的场景
- online training/incremental training我个人粗浅的理解为batchsize=1的训练,也就是每来一个dataset训练一次,但是每一个dataset里面有很多的instances(比如1000个),这样才有训练的价值。
-
主要的PYTHON框架:RIVER(ski-multiflow+creme的合体版)
-
主要的java框架: moa(wakaido大学出版)--> 之后得好好学学
PYTHON简单实现+sklearn训练
STAGGER:介绍
This dataset contains three nominal attributes, namely size = {small, medium, large}, color={red, green}, and shape={circular, non-circular}. Before the first drift point, instances are labeled positive if (color = red) ∧ (size = small). After this point and before the second drift, instances are classified positive if (color = green) v (shape = circular), and finally after this second drift point, instances are classified positive only if (size = medium) v (size = large).
from enum import Enum
class Size(Enum):
small = 0
medium = 1
large = 2
class Color(Enum):
red = 0
green = 1
class Shape(Enum):
circular = 0
noncircular = 1
def generate_stagger(size=1000):
"""
one datarow:(0,1,1,1) == (size.small,color.green,shape.noncircular,true)
with easy change we can also get row like (small,green,noncircular,true): but just add workload
"""
rule1 = lambda si,co,sh:co==Color.red.value and si==Size.small.value
rule2 = lambda si,co,sh:co==Color.green.value or sh==Shape.circular.value
rule3 = lambda si,co,sh:si==Size.medium.value or si==Size.large.value
ans2 = list(map(rule1,[0],[0],[1]))
print(ans2)
np.random.seed(10)
stagger_ = []
ans_ = []
for i,rule in enumerate([rule1,rule2,rule3]):
# for size
size_ = np.random.randint(0,3,size=int(size/3))
# for color
color_ = np.random.randint(0,2,size=int(size/3))
# for shape
shape_ = np.random.randint(0,2,size=int(size/3))
ans = list(map(rule,size_,color_,shape_))
# true->1 false->0
ans = [1 if a else 0 for a in ans]
temp = list(zip(size_,color_,shape_))
ans_.extend(ans)
stagger_.extend(temp)
return stagger_,ans
使用普通batch training时候GaussianNB的测试结果:能大概在300左右看到acc下降
SIN1:介绍
It consists of two attributes x and y uniformly distributed in [0, 1]. The classification function is y = sin(x). Instances are classified as positive if they are under the curve; otherwise they are classified as negative. At a drift point, the class labels are reversed.
# create synthetic sin1 dataset
def generate_sin(size=1000):
random_state = np.random.RandomState(42)
x = random_state.normal(0.5,0.5,size)
noise = random_state.normal(0.3,0.2,size)-0.3
y = np.sin(x)+ noise
target1 = [1 if n < 0 else 0 for n in noise[:int(size/2)]]
# drift point at half -> inverse
target2 = [0 if n < 0 else 1 for n in noise[int(size/2):]]
target1.extend(target2)
return zip(x,y),target1
测试结果:总觉得NB就没学会这个acc一直在下降