推荐系统构建

 

1、内容推荐

框架gensim使用样例

https://blog.csdn.net/tianbwin2995/article/details/51768574

LDA本质是一种降维,归类

 

1.1、聚类之lda主题模型

LDA算法样例

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
   
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
   
   # clean and tokenize document string
   raw = i.lower()
   tokens = tokenizer.tokenize(raw)

   # remove stop words from tokens
   stopped_tokens = [i for i in tokens if not i in en_stop]
   
   # stem tokens
   stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
   
   # add tokens to list
   texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
   
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

 

1.2、相似比较之tf-idf

TF是词频(Term Frequency),IDF是逆文本频率指数(Inverse Document Frequency)

 

 

 

 

 

tf-idf算法样例
#coding=utf-8
'''
Created on 2018-1-24

优点:计算出来的效果不错
缺点:为了计算tfidf值,需要多篇文章作为铺垫
'''
import jieba
from gensim import corpora, models, similarities
# gensim的模型model模块,可以对corpus进行进一步的处理,比如tf-idf模型,lsi模型,lda模型等
wordstest_model = ["我去玉龙雪山并且喜欢玉龙雪山玉龙雪山","我在玉龙雪山并且喜欢玉龙雪山","我在九寨沟"]
test_model = [[word for word in jieba.cut(words)] for words in wordstest_model]
dictionary = corpora.Dictionary(test_model,prune_at=2000000)
# for key in dictionary.iterkeys():
#     print key,dictionary.get(key),dictionary.dfs[key]
corpus_model= [dictionary.doc2bow(test) for test in test_model]
print corpus_model
# [[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1)], [(0, 1), (1, 2), (3, 1), (4, 1), (5, 1)], [(0, 1), (5, 1), (6, 1)]]

# 目前只是生成了一个模型,并不是将对应的corpus转化后的结果,里面存储有各个单词的词频,文频等信息
tfidf_model = models.TfidfModel(corpus_model)
# 对语料生成tfidf
corpus_tfidf = tfidf_model[corpus_model]

#使用测试文本来测试模型,提取关键词,test_bow提供当前文本词频,tfidf_model提供idf计算
testword = "我在九寨沟,很喜欢"
test_bow = dictionary.doc2bow([word for word in jieba.cut(testword)])
test_tfidf = tfidf_model[test_bow]
print test_tfidf
# 词id,tfidf值
# [(4, 0.32718457421365993), (5, 0.32718457421365993), (6, 0.8865102981879297)]

# 计算相似度
index = similarities.MatrixSimilarity(corpus_tfidf) #把所有评论做成索引
sims = index[test_tfidf]  #利用索引计算每一条评论和商品描述之间的相似度
print sims
# [ 0.07639694 0.2473283   0.94496047]

 

文章相似进行存储

id origin_id target_id sim
1 1 2 0.89
2 1 3 0.72
3 1 4 0.18
4 2 1 0.89
5 2 3 0.21
6 2 4 0.37

 

1.3、相似比较之doc2vec

 

 

1.4、大数据量下相似之MinHash--使用datasketch

 

 

 

 

1.5、大数据量下相似之LSH--使用datasketch

LSH算法样例
from datasketch import MinHash, MinHashLSH
data1 = ['这个', '程序', '代码', '太乱', '那个', '代码', '规范']
data2 = ['这个', '程序', '代码', '不', '规范', '那个', '更', '规范']
data3 = ['这个', '程序', '代码', '不', '规范', '那个', '规范', '些']

m1 = MinHash()
m2 = MinHash()
m3 = MinHash()
for d in data1:
m1.update(d.encode('utf8'))
for d in data2:
m2.update(d.encode('utf8'))
for d in data3:
m3.update(d.encode('utf8'))

lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
result = lsh.query(m1)
print("近似的邻居(Jaccard相似度>0.5)", result)

 

LSH需要预先设置阈值,threshold=0.5表示相似度大于等于0.5的,如果数据量少,阈值又设置的特别大的话,结果就为空,不太好。此外LSH算法无法排序,它不是精确值,是一个估值。

 

2、协同过滤

不需要知道user相关的属性,也不需要知道item相关的属性,只需要知道user和item之间的相互关系,比如评分关系或者点击关系等等,就可以做推荐,这是系统过滤的思想。

 

2.1、用户协同过滤

2.2、物品协同过滤

 

 

3、矩阵分解

框架surprise使用样例

3.1、SVD分解,矩阵分解

 

 

 

 

 

 

SVD代码样例
import numpy as np

def svd(data,k):
   u,i,v = np.linalg.svd(data)
   u=u[:,0:k]
   i=np.diag(i[0:k])
   v=v[0:k,:]

   return u,i,v

def predictSingle(u_index,i_index,u,i,v):
   return u[u_index].dot(i).dot(v.T[i_index].T)

def play():
   import sys
   k=4
   data = np.mat([[1,2,3,1,1],[1,3,3,1,2],[3,1,1,2,1],[1,2,3,3,1]])
   u,i,v = svd(data,k)
   print(u.dot(i).dot(v))
   print(predictSingle(0, 0, u, i, v))

if __name__ == '__main__':
   play()

 

 

3.2、LFM分解,隐语义模型分解

 

 

 

LFM代码样例

import numpy as np

def prediction(pu,qi):
   return np.dot(pu,qi.T)

def getError(r,pu,qi):
   return r - prediction(pu, qi)

def tryTrain():
   real=np.mat([[1,2,3,0,3],[3,0,3,1,3],[3,2,0,3,1]])
   print(real)

   factors=3
   p = np.random.randn(3, factors)
   q = np.random.randn(5, factors)

   ul,il=real.shape

   lr=0.05
   lamda=0.1

   for e in range(30):
       for u in range(ul):
           for i in range(il):
               r=real[u,i]
               if r!=0:
                   error = getError(r,p[u],q[i])
                   p[u] -= lr * (-2 * error * q[i] + 2 * lamda * p[u])
                   q[i] -= lr * (-2 * error * p[u] + 2 * lamda * q[i])


   print(prediction(p,q))




if __name__ == '__main__':
   tryTrain()

 

 

4、FM因子分解机

特征值分解、SVD、MF、SVD++ 、FM、ALS 均完成了矩阵分解操作,区别在于SVD与 SVD++均需要矩阵填充, 而 funk-svd 与FM、 ALS 均采用MF分解模式,进行隐语义分解,其中ALS主要用于协同过滤,而FM主要用于ctr cvr预估,FM又解决了了特征关系问题—通过向量内积。

 

 

 

 

 

框架xlearn使用样例

FM算法

FM算法样例--输入输出都是txt文件

import xlearn as xl

# Training task
fm_model = xl.create_fm()  # Use factorization machine
fm_model.setTrain("./titanic_train.txt")  # Training data
fm_model.setValidate("./titanic_test.txt")  # Validation data

# param:
# 0. Binary classification task
# 1. learning rate: 0.2
# 2. lambda: 0.002
# 3. metric: accuracy
param = {'task':'binary', 'lr':0.2,
        'lambda':0.002, 'metric':'acc'}

# Start to train
# The trained model will be stored in model.out
fm_model.fit(param, './model.out')

# Prediction task
fm_model.setTest("./titanic_test.txt")  # Test data
fm_model.setSigmoid()  # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
fm_model.predict("./model.out", "./output.txt")

 

FM算法样例--输入输出通过pandas转换,更加灵活
import xlearn as xl
import numpy as np
import pandas as pd

# read file from file
titanic_train = pd.read_csv("titanic_train.txt", header=None, sep="\t")
titanic_test = pd.read_csv("titanic_test.txt", header=None, sep="\t")

# get train X, y
X_train = titanic_train[titanic_train.columns[1:]]
y_train = titanic_train[0]

# get test X, y
X_test = titanic_test[titanic_test.columns[1:]]
y_test = titanic_test[0]

# DMatrix transition
xdm_train = xl.DMatrix(X_train, y_train)
xdm_test = xl.DMatrix(X_test, y_test)

# Training task
fm_model = xl.create_fm()  # Use factorization machine
# we use the same API for train from file
# that is, you can also pass xl.DMatrix for this API now
fm_model.setTrain(xdm_train)    # Training data
fm_model.setValidate(xdm_test)  # Validation data

# param:
# 0. regression task
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: acc
param = {'task':'binary', 'lr':0.2,
        'lambda':0.002, 'metric':'acc'}

# Start to train
# The trained model will be stored in model.out
fm_model.fit(param, './model_dm.out')

# Prediction task
# we use the same API for test from file
# that is, you can also pass xl.DMatrix for this API now
fm_model.setTest(xdm_test)  # Test data
fm_model.setSigmoid()  # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
# if no result out path setted, we return res as numpy.ndarray
res = fm_model.predict("./model_dm.out")

print(res)

 

 

训练数据和预测数据格式

 

 

结果数据格式

 

 

FFM算法

FFM算法样例

import xlearn as xl

# Training task
ffm_model = xl.create_ffm() # Use field-aware factorization machine
ffm_model.setTrain("./small_train.txt")  # Training data
ffm_model.setValidate("./small_test.txt")  # Validation data

# param:
# 0. binary classification
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: accuracy
param = {'task':'binary', 'lr':0.2,
        'lambda':0.002, 'metric':'acc'}

# Start to train
# The trained model will be stored in model.out
ffm_model.fit(param, './model.out')

# Prediction task
ffm_model.setTest("./small_test.txt")  # Test data
ffm_model.setSigmoid()  # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
ffm_model.predict("./model.out", "./output.txt")

 

5、推荐系统构建

 

5.1、详情页

使用IF-IDF算法,直接计算相似度

 

5.2、猜你喜欢栏目

 

5.2.1、特征选取

是否点击、用户id、物品id、物品类别

click user_id item_id cat_id
1 3304808 29830 8
1 3304808 29831 8
1 3304808 29832 8
0 3304808 38292 9
1 3393049 5844 7

 

5.2.2、数据收集

数据收集格式如上,存储在MongoDB中

 

 

5.2.3、模型更新

1、模型生成后,根据模型一次预测用户所有数据,按照预测的数据排序存储到redis队列

2、当队列中的数据少于一定的数量时,调用预测模块,去重后填充redis队列

 

5.2.4、预测数据

注意数据量不大时,如果推荐结果不好,可以适当插入一些热门数据

 

5.2.5、热门数据设计

 

 

5.3、如何判断用户已读,不重复推荐

redis布隆过滤器,判断数据是否存在

 

附录

1、BPR算法,point rank、pair rank、list rank

 

 

上一篇:菜鸟笔记——决策树(Titanic)


下一篇:入坑kaggle第二天- 详细分析Titanic - Machine Learning from Disaster