时间与推荐 python验证

在我了解到的所有推荐系统的结构里(都是课本,课堂上的水版)没有购买顺序的考虑,即所有的基础推荐系统都是把购买行为使用bag of words的方法进行过滤和推荐的。

但是,有一个问题就是,真的购买顺序与下次购买无关吗?拿这个例子来说:买耳机,先买了个200块的AKG,又买了个200块的阿斯翠,这是很可能买手会发现国产性价比很不错,转而第三个仍然是用阿斯翠;如果先阿斯翠后AKG,可能会发现动圈的听感要比动铁舒服,可能下一个就是另一个牌子的动圈耳机了。

为了验证这个猜想,我使用有序版的Apriori算法进行测试,数据集是阿里推荐大赛的数据。使用python,很慢,所以只能用数据集的一部分,看一个结果。

算法过程结合了Apriori和skip-k-n-gram,使用Apriori的思想进行协同过滤,而过滤的关键字使用skip-k-n-gram产生,即abc,形成ab,ac,bc(skip-1-2-gram)的组合。

预处理数据:

#/usr/bin/python
import csv
import re
def load(filename):
    print ‘loading‘
    reader = csv.reader(file(filename, ‘rb‘))

    pattern = re.compile(r‘\d+‘)
    maps = {}
    for line in reader:
        date = pattern.findall(line[-1])
        time = (int)(date[0]) * 100 + (int)(date[1])
        uid = (int)(line[0])
        op = line[2] + ‘-‘ + line[1]
        if(uid in maps):
            maps[uid].append(op)
        else:
            maps[uid] = [op]
    return maps

返回数据的格式是:{id:[op...]...},其中op是按时间排序的。

使用Apriori的变种进行验证我的想法:

#!/usr/bin/python
import dld
#data = {2:[‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘dd‘,‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘dd‘, ‘fa‘, ‘af‘],#        3:[‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘dd‘,‘ab‘, ‘ab‘, ‘ab‘, ‘af‘, ‘ab‘, ‘fa‘, ‘dd‘, ‘af‘]}
data = dld.load(‘data.csv‘)
threshold = 10


def patterns(old, uid, i, k):
    """skip k n gram algorithm for user action"""
    #for first call
    if(not isinstance(old, list)):
        old = [old]
    #the whole action list
    line = data[uid]
    result = []#pattern, uid, endpoint in action list
    j = 0
    while (j < k and (i + j + 1) < len(line)):
        result.append([old+[line[i + j + 1]], uid, i + j + 1])
        j = j + 1
    
    return result


def passes(pattern, t):
    """reform the result from patterns
    and filter the results using threshold t
    part of Apriori"""
    pattern.sort(key=lambda record : record[0])
    print ‘sorted %d‘ %(len(pattern))
    passed = []#[pattern, [[uid, endpoint], [uid, endpoint],...]
    last = [‘‘, 0, 0]#merge same actions
    for item in pattern:
        if (item[0] == last[0]):#same
            passed[-1][1].append(item[1:])
        else:#new
            passed.append([item[0], [item[1:]]])
        last = item


    passed = [i for i in passed if len(i[1]) > t]#filter
    return passed


"""" main """
print ‘first round‘
pattern = [];
for uid in data:
    i = 0
    line = data[uid]
    for item in line:
        pattern[len(pattern):] = patterns(item, uid, i ,3)
        i = i + 1


#print pattern
passed = passes(pattern, threshold)
print ‘passed %d‘ %(len(passed))
#print passed


j = 0
while (j < 2):
    print ‘round %d‘ %(j)
    pattern = []
    for record in passed:
        old = record[0]
        for item in record[1]:
            pattern[len(pattern):] = patterns(old, item[0], item[1], 3)
    #print pattern
    passed = passes(pattern, threshold)
    print ‘passed %d‘ %(len(passed))
    #print passed
    j = j + 1
    
#print passed
print ‘statics‘
result = [[p[0], len(p[1])] for p in passed]#pattern, count
print ‘result %d‘ %(len(result))
#print result
statics = {}


print ‘analyse‘
for item in result:
    key = sorted(item[0][:-1])
    hashkey = ‘+‘.join(key)
    if(hashkey in statics):
        statics[hashkey].append([item[0], item[1]])
    else:
        statics[hashkey] = [[item[0], item[1]]]


sop = 0#same action
src = 0#same recommand
for static in statics:
    out = statics[static]
    before = len(out)
    statics[static] = []
    last = [[0],0]
    maps = {}
    for item in out:
        if(cmp(item[0][0], item[0][1]) != 0 or cmp(item[0][0], item[0][2]) != 0):#all same action
            if(last[0][:-1] == item[0][:-1]):
                sop = sop + 1
                if(last[0][1] < item[0][1]):
                    statics[static][-1] = item[0][1]
                    last = item
            else:
                if(item[0][-1] not in maps):
                    statics[static].append(item)
                    maps[item[0][-1]] = 0
                    last = item
                else:
                    src = src + 1
    #print ‘ya filter %d/%d‘ %(len(statics[static]), before)
                    
for static in statics:
    out = statics[static]
    if(len(out) > 2):
        #print out
        pass
print ‘final\n positive:%d\n same action:%d\n same reco:%d‘ %(len(statics), sop, src)

结果呢,基本可以看出有旗鼓相当的一些人,因为之前的顺序,在下次购买时选择了不同的物品。至于为什么没人用这个算法,或者这类算法,请见下文

时间与推荐 python验证,布布扣,bubuko.com

时间与推荐 python验证

上一篇:ECS使用体验


下一篇:Python 方法中变量加self和不加的区别