但是,有一个问题就是,真的购买顺序与下次购买无关吗?拿这个例子来说:买耳机,先买了个200块的AKG,又买了个200块的阿斯翠,这是很可能买手会发现国产性价比很不错,转而第三个仍然是用阿斯翠;如果先阿斯翠后AKG,可能会发现动圈的听感要比动铁舒服,可能下一个就是另一个牌子的动圈耳机了。
为了验证这个猜想,我使用有序版的Apriori算法进行测试,数据集是阿里推荐大赛的数据。使用python,很慢,所以只能用数据集的一部分,看一个结果。
算法过程结合了Apriori和skip-k-n-gram,使用Apriori的思想进行协同过滤,而过滤的关键字使用skip-k-n-gram产生,即abc,形成ab,ac,bc(skip-1-2-gram)的组合。
预处理数据:
#/usr/bin/python import csv import re def load(filename): print ‘loading‘ reader = csv.reader(file(filename, ‘rb‘)) pattern = re.compile(r‘\d+‘) maps = {} for line in reader: date = pattern.findall(line[-1]) time = (int)(date[0]) * 100 + (int)(date[1]) uid = (int)(line[0]) op = line[2] + ‘-‘ + line[1] if(uid in maps): maps[uid].append(op) else: maps[uid] = [op] return maps
返回数据的格式是:{id:[op...]...},其中op是按时间排序的。
使用Apriori的变种进行验证我的想法:
#!/usr/bin/python import dld #data = {2:[‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘dd‘,‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘dd‘, ‘fa‘, ‘af‘],# 3:[‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘dd‘,‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘fa‘, ‘dd‘, ‘af‘]} data = dld.load(‘data.csv‘) threshold = 10 def patterns(old, uid, i, k): """skip k n gram algorithm for user action""" #for first call if(not isinstance(old, list)): old = [old] #the whole action list line = data[uid] result = []#pattern, uid, endpoint in action list j = 0 while (j < k and (i + j + 1) < len(line)): result.append([old+[line[i + j + 1]], uid, i + j + 1]) j = j + 1 return result def passes(pattern, t): """reform the result from patterns and filter the results using threshold t part of Apriori""" pattern.sort(key=lambda record : record[0]) print ‘sorted %d‘ %(len(pattern)) passed = []#[pattern, [[uid, endpoint], [uid, endpoint],...] last = [‘‘, 0, 0]#merge same actions for item in pattern: if (item[0] == last[0]):#same passed[-1][1].append(item[1:]) else:#new passed.append([item[0], [item[1:]]]) last = item passed = [i for i in passed if len(i[1]) > t]#filter return passed """" main """ print ‘first round‘ pattern = []; for uid in data: i = 0 line = data[uid] for item in line: pattern[len(pattern):] = patterns(item, uid, i ,3) i = i + 1 #print pattern passed = passes(pattern, threshold) print ‘passed %d‘ %(len(passed)) #print passed j = 0 while (j < 2): print ‘round %d‘ %(j) pattern = [] for record in passed: old = record[0] for item in record[1]: pattern[len(pattern):] = patterns(old, item[0], item[1], 3) #print pattern passed = passes(pattern, threshold) print ‘passed %d‘ %(len(passed)) #print passed j = j + 1 #print passed print ‘statics‘ result = [[p[0], len(p[1])] for p in passed]#pattern, count print ‘result %d‘ %(len(result)) #print result statics = {} print ‘analyse‘ for item in result: key = sorted(item[0][:-1]) hashkey = ‘+‘.join(key) if(hashkey in statics): statics[hashkey].append([item[0], item[1]]) else: statics[hashkey] = [[item[0], item[1]]] sop = 0#same action src = 0#same recommand for static in statics: out = statics[static] before = len(out) statics[static] = [] last = [[0],0] maps = {} for item in out: if(cmp(item[0][0], item[0][1]) != 0 or cmp(item[0][0], item[0][2]) != 0):#all same action if(last[0][:-1] == item[0][:-1]): sop = sop + 1 if(last[0][1] < item[0][1]): statics[static][-1] = item[0][1] last = item else: if(item[0][-1] not in maps): statics[static].append(item) maps[item[0][-1]] = 0 last = item else: src = src + 1 #print ‘ya filter %d/%d‘ %(len(statics[static]), before) for static in statics: out = statics[static] if(len(out) > 2): #print out pass print ‘final\n positive:%d\n same action:%d\n same reco:%d‘ %(len(statics), sop, src)
结果呢,基本可以看出有旗鼓相当的一些人,因为之前的顺序,在下次购买时选择了不同的物品。至于为什么没人用这个算法,或者这类算法,请见下文。