1、内容推荐
框架gensim使用样例
https://blog.csdn.net/tianbwin2995/article/details/51768574
LDA本质是一种降维,归类
1.1、聚类之lda主题模型
LDA算法样例
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
1.2、相似比较之tf-idf
TF是词频(Term Frequency),IDF是逆文本频率指数(Inverse Document Frequency)
tf-idf算法样例
#coding=utf-8
'''
Created on 2018-1-24
优点:计算出来的效果不错
缺点:为了计算tfidf值,需要多篇文章作为铺垫
'''
import jieba
from gensim import corpora, models, similarities
# gensim的模型model模块,可以对corpus进行进一步的处理,比如tf-idf模型,lsi模型,lda模型等
wordstest_model = ["我去玉龙雪山并且喜欢玉龙雪山玉龙雪山","我在玉龙雪山并且喜欢玉龙雪山","我在九寨沟"]
test_model = [[word for word in jieba.cut(words)] for words in wordstest_model]
dictionary = corpora.Dictionary(test_model,prune_at=2000000)
# for key in dictionary.iterkeys():
# print key,dictionary.get(key),dictionary.dfs[key]
corpus_model= [dictionary.doc2bow(test) for test in test_model]
print corpus_model
# [[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1)], [(0, 1), (1, 2), (3, 1), (4, 1), (5, 1)], [(0, 1), (5, 1), (6, 1)]]
# 目前只是生成了一个模型,并不是将对应的corpus转化后的结果,里面存储有各个单词的词频,文频等信息
tfidf_model = models.TfidfModel(corpus_model)
# 对语料生成tfidf
corpus_tfidf = tfidf_model[corpus_model]
#使用测试文本来测试模型,提取关键词,test_bow提供当前文本词频,tfidf_model提供idf计算
testword = "我在九寨沟,很喜欢"
test_bow = dictionary.doc2bow([word for word in jieba.cut(testword)])
test_tfidf = tfidf_model[test_bow]
print test_tfidf
# 词id,tfidf值
# [(4, 0.32718457421365993), (5, 0.32718457421365993), (6, 0.8865102981879297)]
# 计算相似度
index = similarities.MatrixSimilarity(corpus_tfidf) #把所有评论做成索引
sims = index[test_tfidf] #利用索引计算每一条评论和商品描述之间的相似度
print sims
# [ 0.07639694 0.2473283 0.94496047]
文章相似进行存储
id | origin_id | target_id | sim |
---|---|---|---|
1 | 1 | 2 | 0.89 |
2 | 1 | 3 | 0.72 |
3 | 1 | 4 | 0.18 |
4 | 2 | 1 | 0.89 |
5 | 2 | 3 | 0.21 |
6 | 2 | 4 | 0.37 |
1.3、相似比较之doc2vec
1.4、大数据量下相似之MinHash--使用datasketch
1.5、大数据量下相似之LSH--使用datasketch
LSH算法样例
from datasketch import MinHash, MinHashLSH
data1 = ['这个', '程序', '代码', '太乱', '那个', '代码', '规范']
data2 = ['这个', '程序', '代码', '不', '规范', '那个', '更', '规范']
data3 = ['这个', '程序', '代码', '不', '规范', '那个', '规范', '些']
m1 = MinHash()
m2 = MinHash()
m3 = MinHash()
for d in data1:
m1.update(d.encode('utf8'))
for d in data2:
m2.update(d.encode('utf8'))
for d in data3:
m3.update(d.encode('utf8'))
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
result = lsh.query(m1)
print("近似的邻居(Jaccard相似度>0.5)", result)
LSH需要预先设置阈值,threshold=0.5表示相似度大于等于0.5的,如果数据量少,阈值又设置的特别大的话,结果就为空,不太好。此外LSH算法无法排序,它不是精确值,是一个估值。
2、协同过滤
不需要知道user相关的属性,也不需要知道item相关的属性,只需要知道user和item之间的相互关系,比如评分关系或者点击关系等等,就可以做推荐,这是系统过滤的思想。
2.1、用户协同过滤
2.2、物品协同过滤
3、矩阵分解
框架surprise使用样例
3.1、SVD分解,矩阵分解
SVD代码样例
import numpy as np
def svd(data,k):
u,i,v = np.linalg.svd(data)
u=u[:,0:k]
i=np.diag(i[0:k])
v=v[0:k,:]
return u,i,v
def predictSingle(u_index,i_index,u,i,v):
return u[u_index].dot(i).dot(v.T[i_index].T)
def play():
import sys
k=4
data = np.mat([[1,2,3,1,1],[1,3,3,1,2],[3,1,1,2,1],[1,2,3,3,1]])
u,i,v = svd(data,k)
print(u.dot(i).dot(v))
print(predictSingle(0, 0, u, i, v))
if __name__ == '__main__':
play()
3.2、LFM分解,隐语义模型分解
LFM代码样例
import numpy as np
def prediction(pu,qi):
return np.dot(pu,qi.T)
def getError(r,pu,qi):
return r - prediction(pu, qi)
def tryTrain():
real=np.mat([[1,2,3,0,3],[3,0,3,1,3],[3,2,0,3,1]])
print(real)
factors=3
p = np.random.randn(3, factors)
q = np.random.randn(5, factors)
ul,il=real.shape
lr=0.05
lamda=0.1
for e in range(30):
for u in range(ul):
for i in range(il):
r=real[u,i]
if r!=0:
error = getError(r,p[u],q[i])
p[u] -= lr * (-2 * error * q[i] + 2 * lamda * p[u])
q[i] -= lr * (-2 * error * p[u] + 2 * lamda * q[i])
print(prediction(p,q))
if __name__ == '__main__':
tryTrain()
4、FM因子分解机
特征值分解、SVD、MF、SVD++ 、FM、ALS 均完成了矩阵分解操作,区别在于SVD与 SVD++均需要矩阵填充, 而 funk-svd 与FM、 ALS 均采用MF分解模式,进行隐语义分解,其中ALS主要用于协同过滤,而FM主要用于ctr cvr预估,FM又解决了了特征关系问题—通过向量内积。
框架xlearn使用样例
FM算法
FM算法样例--输入输出都是txt文件
import xlearn as xl
# Training task
fm_model = xl.create_fm() # Use factorization machine
fm_model.setTrain("./titanic_train.txt") # Training data
fm_model.setValidate("./titanic_test.txt") # Validation data
# param:
# 0. Binary classification task
# 1. learning rate: 0.2
# 2. lambda: 0.002
# 3. metric: accuracy
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc'}
# Start to train
# The trained model will be stored in model.out
fm_model.fit(param, './model.out')
# Prediction task
fm_model.setTest("./titanic_test.txt") # Test data
fm_model.setSigmoid() # Convert output to 0-1
# Start to predict
# The output result will be stored in output.txt
fm_model.predict("./model.out", "./output.txt")
FM算法样例--输入输出通过pandas转换,更加灵活
import xlearn as xl
import numpy as np
import pandas as pd
# read file from file
titanic_train = pd.read_csv("titanic_train.txt", header=None, sep="\t")
titanic_test = pd.read_csv("titanic_test.txt", header=None, sep="\t")
# get train X, y
X_train = titanic_train[titanic_train.columns[1:]]
y_train = titanic_train[0]
# get test X, y
X_test = titanic_test[titanic_test.columns[1:]]
y_test = titanic_test[0]
# DMatrix transition
xdm_train = xl.DMatrix(X_train, y_train)
xdm_test = xl.DMatrix(X_test, y_test)
# Training task
fm_model = xl.create_fm() # Use factorization machine
# we use the same API for train from file
# that is, you can also pass xl.DMatrix for this API now
fm_model.setTrain(xdm_train) # Training data
fm_model.setValidate(xdm_test) # Validation data
# param:
# 0. regression task
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: acc
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc'}
# Start to train
# The trained model will be stored in model.out
fm_model.fit(param, './model_dm.out')
# Prediction task
# we use the same API for test from file
# that is, you can also pass xl.DMatrix for this API now
fm_model.setTest(xdm_test) # Test data
fm_model.setSigmoid() # Convert output to 0-1
# Start to predict
# The output result will be stored in output.txt
# if no result out path setted, we return res as numpy.ndarray
res = fm_model.predict("./model_dm.out")
print(res)
训练数据和预测数据格式
结果数据格式
FFM算法
FFM算法样例
import xlearn as xl
# Training task
ffm_model = xl.create_ffm() # Use field-aware factorization machine
ffm_model.setTrain("./small_train.txt") # Training data
ffm_model.setValidate("./small_test.txt") # Validation data
# param:
# 0. binary classification
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: accuracy
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc'}
# Start to train
# The trained model will be stored in model.out
ffm_model.fit(param, './model.out')
# Prediction task
ffm_model.setTest("./small_test.txt") # Test data
ffm_model.setSigmoid() # Convert output to 0-1
# Start to predict
# The output result will be stored in output.txt
ffm_model.predict("./model.out", "./output.txt")
5、推荐系统构建
5.1、详情页
使用IF-IDF算法,直接计算相似度
5.2、猜你喜欢栏目
5.2.1、特征选取
是否点击、用户id、物品id、物品类别
click | user_id | item_id | cat_id |
---|---|---|---|
1 | 3304808 | 29830 | 8 |
1 | 3304808 | 29831 | 8 |
1 | 3304808 | 29832 | 8 |
0 | 3304808 | 38292 | 9 |
1 | 3393049 | 5844 | 7 |
5.2.2、数据收集
数据收集格式如上,存储在MongoDB中
5.2.3、模型更新
1、模型生成后,根据模型一次预测用户所有数据,按照预测的数据排序存储到redis队列
2、当队列中的数据少于一定的数量时,调用预测模块,去重后填充redis队列
5.2.4、预测数据
注意数据量不大时,如果推荐结果不好,可以适当插入一些热门数据
5.2.5、热门数据设计
5.3、如何判断用户已读,不重复推荐
redis布隆过滤器,判断数据是否存在
附录
1、BPR算法,point rank、pair rank、list rank