机器学习推荐系统记录

记录机器学习的过程

git clone 问题

git config --global http.lowSpeedLimit 0
git config --global http.lowSpeedTime 999999
git config --global http.postBuffer 50024288000

2020年1月15日,为了准备西电的复试做一个机器学习的毕设,由于本科没有学过,零基础,所以在此记录一下,同时,本科做过的项目都忘了,也是为了培养良好的习惯。

机器学习管道

数据–》机器学习算法–》智能

学习目标

安装python,ipython notebook graphlab create
启动ipython notebook
在python中编写变量,函数和循环
在python中使用sframe执行基本数据操作

安装graphlab create 通过command line

最后一步要在anaconda prompt里面install graphlab create

python基本语法

notebook 中import graphlab

graphlab canvas

可视化
导入数据集之后,运用sf.show()进行数据可视化
canvas数据重定向,将可视化的数据重定向到ipython notebook中:
graphlab.canvas.set_target(‘ipynb’)

转换函数apply

某一列数据.apply(自定义函数)
sf[‘Country’].apply(transform_country)

推荐系统

机器学习推荐系统记录
机器学习推荐系统记录
非常流行的物品会淹没其他的影响,比如所有人都买了尿布,不代表我也需要,缺乏个性化,需要对过于流行的物品所在的矩阵进行正规化。机器学习推荐系统记录
机器学习推荐系统记录
机器学习推荐系统记录
机器学习推荐系统记录
机器学习推荐系统记录
机器学习推荐系统记录
机器学习推荐系统记录

综合起来=特征+矩阵分解

机器学习推荐系统记录
机器学习推荐系统记录

推荐系统的性能度量:召回率、准确率

召回率=推荐的喜欢的物品/全部喜欢的物品
准确率=推荐的喜欢的物品/推荐的物品
召回率最大时准确率不理想
最优推荐:召回率=准确率=1
机器学习推荐系统记录
机器学习推荐系统记录

import graphlab
song_data = graphlab.SFrame('song_data.gl/')
song_data.head()
song_data['song'].show()
len(song_data)
users=song_data['user_id'].unique()
len(users)
train_data,test_data=song_data.random_split(.8,seed=0)
popularity_model=graphlab.popularity_recommender.create(train_data,user_id='user_id',item_id='song')
popularity_model.recommend(users=[users[0]])
personalized_model=graphlab.item_similarity_recommender.create(train_data,user_id='user_id',item_id='song')
personalized_model.recommend(users=[users[0]])
model_performance=graphlab.compare(test_data,[popularity_model,personalized_model],user_sample=0.05)
graphlab.show_comparison(model_performance,[popularity_model,personalized_model])

机器学习推荐系统记录
机器学习推荐系统记录
机器学习推荐系统记录机器学习推荐系统记录机器学习推荐系统记录

网易云音乐UID

http://192.168.3.2:3000/v1/likelist?uid=645954254

返回数据得到音乐ID

{“ids”:[18638059,26217117],“checkPoint”:1582996860453,“code”:200}

获取网易云推荐的每日歌曲

先登录:
http://192.168.3.2:3000/v1/login/cellphone?phone=17772002134&password=614919799
再获取:
http://192.168.3.2:3000/v1/recommend/songs

获取歌曲详情有问题,可以获取相似歌曲

http://192.168.3.2:3000/v1/simi/song?id=26217117

歌曲详情的获取备用API:

https://api.imjad.cn/cloudmusic.md
https://api.imjad.cn/cloudmusic/?type=detail&id=26217117

音乐推荐系统搭建

推荐系统搭建1

数据的获取

# -*- coding:utf-8 -*-
"""
爬虫爬取网易云音乐歌单的数据包保存成json文件
python2.7环境
"""
import sys
reload(sys)
//解决字符乱码问题
sys.setdefaultencoding('utf-8')
import os
os.environ['NLS_LANG'] = 'Simplified Chinese_CHINA.ZHS16GBK'
import requests
import json
import os
import base64
import binascii
import urllib
import urllib2
from Crypto.Cipher import AES
from bs4 import BeautifulSoup


class NetEaseAPI:
    def __init__(self):
        self.header = {
            'Host': 'music.163.com',
            'Origin': 'https://music.163.com',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0',
            'Accept': 'application/json, text/javascript',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Connection': 'keep-alive',
            'Content-Type': 'application/x-www-form-urlencoded',
        }
        self.cookies = {'appver': '1.5.2'}
        self.playlist_class_dict = {}
        self.session = requests.Session()

    def _http_request(self, method, action, query=None, urlencoded=None, callback=None, timeout=None):
        connection = json.loads(self._raw_http_request(method, action, query, urlencoded, callback, timeout))
        return connection

    def _raw_http_request(self, method, action, query=None, urlencoded=None, callback=None, timeout=None):
        if method == 'GET':
            request = urllib2.Request(action, self.header)
            response = urllib2.urlopen(request)
            connection = response.read()
        elif method == 'POST':
            data = urllib.urlencode(query)
            request = urllib2.Request(action, data, self.header)
            response = urllib2.urlopen(request)
            connection = response.read()
        return connection

    @staticmethod
    def _aes_encrypt(text, secKey):
        pad = 16 - len(text) % 16
        text = text + chr(pad) * pad
        encryptor = AES.new(secKey, 2, '0102030405060708')
        ciphertext = encryptor.encrypt(text)
        ciphertext = base64.b64encode(ciphertext).decode('utf-8')
        return ciphertext

    @staticmethod
    def _rsa_encrypt(text, pubKey, modulus):
        text = text[::-1]
        rs = pow(int(binascii.hexlify(text), 16), int(pubKey, 16), int(modulus, 16))
        return format(rs, 'x').zfill(256)

    @staticmethod
    def _create_secret_key(size):
        return (''.join(map(lambda xx: (hex(ord(xx))[2:]), os.urandom(size))))[0:16]

    def get_playlist_id(self, action):
        request = urllib2.Request(action, headers=self.header)
        response = urllib2.urlopen(request)
        html = response.read().decode('utf-8')
        response.close()
        soup = BeautifulSoup(html, 'lxml')
        list_url = soup.select('ul#m-pl-container li div a.msk')
        for k, v in enumerate(list_url):
            list_url[k] = v['href'][13:]
        return list_url

    def get_playlist_detail(self, id):
        text = {
            'id': id,
            'limit': '100',
            'total': 'true'
        }
        text = json.dumps(text)
        nonce = '0CoJUm6Qyw8W8jud'
        pubKey = '010001'
        modulus = ('00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7'
                   'b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280'
                   '104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932'
                   '575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b'
                   '3ece0462db0a22b8e7')
        secKey = self._create_secret_key(16)
        encText = self._aes_encrypt(self._aes_encrypt(text, nonce), secKey)
        encSecKey = self._rsa_encrypt(secKey, pubKey, modulus)

        data = {
            'params': encText,
            'encSecKey': encSecKey
        }
        action = 'http://music.163.com/weapi/v3/playlist/detail'
        playlist_detail = self._http_request('POST', action, data)

        return playlist_detail


if __name__ == '__main__':
    nn = NetEaseAPI()

    index = 1
    for flag in range(1, 38):
        if flag > 1:
            page = (flag - 1) * 35
            url = 'http://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=' + str(
                page)
        else:
            url = 'http://music.163.com/discover/playlist'
        playlist_id = nn.get_playlist_id(url)
        for item_id in playlist_id:
            playlist_detail = nn.get_playlist_detail(item_id)

            with open('data/{0}.json'.format(index), 'w') as file_obj:
                json.dump(playlist_detail, file_obj, ensure_ascii=False)
                index += 1
                print("写入json文件:", item_id)

特征工程和数据预处理,提取我这次做推荐系统有用的特征信息

# -*- coding:utf-8-*-
"""
对网易云所有歌单爬虫的json文件进行数据预处理成csv文件
python3.6环境
"""
import io
from __future__ import (absolute_import, division, print_function, unicode_literals)
import json


def parse_playlist_item():
    """
    :return: 解析成userid itemid rating timestamp行格式
    """
    file = io.open("neteasy_playlist_recommend_data.csv", 'a', encoding='utf8')
    for i in range(1, 1292):
        with io.open("{0}.json".format(i), 'r', encoding='UTF-8') as load_f:
            load_dict = json.load(load_f)
            try:
                for item in load_dict['playlist']['tracks']:
                    # playlist id # song id # score # datetime
                    line_result = [load_dict['playlist']['id'], item['id'], item['pop'], item['publishTime']]
                    for k, v in enumerate(line_result):
                        if k == len(line_result) - 1:
                            file.write(str(v))
                        else:
                            file.write(str(v) + ',')
                    file.write('\n')
            except Exception:
                print(i)
                continue
    file.close()


def parse_playlist_id_to_name():
    file = io.open("neteasy_playlist_id_to_name_data.csv", 'a', encoding='utf8')
    for i in range(1, 1292):
        with io.open("{0}.json".format(i), 'r', encoding='UTF-8') as load_f:
            load_dict = json.load(load_f)
            try:
                line_result = [load_dict['playlist']['id'], load_dict['playlist']['name']]
                for k, v in enumerate(line_result):
                    if k == len(line_result) - 1:
                        file.write(str(v))
                    else:
                        file.write(str(v) + ',')
                file.write('\n')
            except Exception:
                print(i)
                continue
    file.close()


def parse_song_id_to_name():
    file = io.open("neteasy_song_id_to_name_data.csv", 'a', encoding='utf8')
    for i in range(1, 1292):
        with io.open("{0}.json".format(i), 'r', encoding='UTF-8') as load_f:
            load_dict = json.load(load_f)
            try:
                for item in load_dict['playlist']['tracks']:
                    # playlist id # song id # score # datetime
                    line_result = [item['id'], item['name'] + '-' + item['ar'][0]['name']]
                    for k, v in enumerate(line_result):
                        if k == len(line_result) - 1:
                            file.write(str(v))
                        else:
                            file.write(str(v) + ',')
                    file.write('\n')
            except Exception:
                print(i)
                continue
    file.close()

parse_playlist_item()
parse_playlist_id_to_name()
parse_song_id_to_name()

Surprise推荐库(推荐歌单)

# -*- coding:utf-8-*-
"""
利用surprise推荐库 KNN协同过滤算法推荐网易云歌单
python2.7环境
"""
from __future__ import (absolute_import, division, print_function, unicode_literals)
import os
import csv
from surprise import KNNBaseline, Reader, KNNBasic, KNNWithMeans
from surprise import Dataset


def recommend_model():
    file_path = os.path.expanduser('neteasy_playlist_recommend_data.csv')
    # 指定文件格式
    reader = Reader(line_format='user item rating timestamp', sep=',')
    # 从文件读取数据
    music_data = Dataset.load_from_file(file_path, reader=reader)
    # 计算歌曲和歌曲之间的相似度

    train_set = music_data.build_full_trainset()
    print('开始使用协同过滤算法训练推荐模型...')
    algo = KNNBasic()
    algo.fit(train_set)
    return algo


def playlist_data_preprocessing():
    csv_reader = csv.reader(open('neteasy_playlist_id_to_name_data.csv'))
    id_name_dic = {}
    name_id_dic = {}
    for row in csv_reader:
        id_name_dic[row[0]] = row[1]
        name_id_dic[row[1]] = row[0]
    return id_name_dic, name_id_dic


def song_data_preprocessing():
    csv_reader = csv.reader(open('neteasy_song_id_to_name_data.csv'))
    id_name_dic = {}
    name_id_dic = {}
    for row in csv_reader:
        id_name_dic[row[0]] = row[1]
        name_id_dic[row[1]] = row[0]
    return id_name_dic, name_id_dic


def playlist_recommend_main():
    print("加载歌单id到歌单名的字典映射...")
    print("加载歌单名到歌单id的字典映射...")
    id_name_dic, name_id_dic = playlist_data_preprocessing()
    print("字典映射成功...")
    print('构建数据集...')
    algo = recommend_model()
    print('模型训练结束...')

    current_playlist_id = id_name_dic.keys()[102]//歌单id
    print('当前的歌单id:' + current_playlist_id)

    current_playlist_name = id_name_dic[current_playlist_id]
    print("当前的歌单名字:")
    print(current_playlist_name)

    playlist_inner_id = algo.trainset.to_inner_uid(current_playlist_id)
    print('当前的歌单内部id:' + str(playlist_inner_id))

    playlist_neighbors = algo.get_neighbors(playlist_inner_id, k=10)
    playlist_neighbors_id = (algo.trainset.to_raw_uid(inner_id) for inner_id in playlist_neighbors)
    # 把歌曲id转成歌曲名字
    playlist_neighbors_name = (id_name_dic[playlist_id] for playlist_id in playlist_neighbors_id)
    print("和歌单<", current_playlist_name, '> 最接近的10个歌单为:\n')
    for playlist_name in playlist_neighbors_name:
        print(playlist_name, name_id_dic[playlist_name])
playlist_recommend_main()

运行结果

加载歌单id到歌单名的字典映射...
加载歌单名到歌单id的字典映射...
字典映射成功...
构建数据集...
开始使用协同过滤算法训练推荐模型...
Computing the msd similarity matrix...
Done computing similarity matrix.
模型训练结束...
当前的歌单id:4879924824
当前的歌单名字:
【美剧】良医插曲BGM 第二季
当前的歌单内部id:812
和歌单< 【美剧】良医插曲BGM 第二季 > 最接近的10个歌单为:

良医BGM 3195822488
4869100193 4875075726
追逐繁星的孩子,梦里总有无尽星辰 4869100193
City pop ‖ 都市乐享主义 3133725493
【中世纪民谣】吟游诗人与时代挽歌 89963967
春日初告白 | 温暖男声,流进心底的阳光 3186322538
私人雷达|根据听歌记录为你打造 3136952023
起来 3079182188
[欧美私人订制] 最懂你的欧美推荐 每日更新35首 2829816518
「纯音」觅得一隅清净,花自尘埃出 3066614455

推荐系统搭建2

基于Word2Vec的网易云音乐歌曲推荐系统

# -*- coding:utf-8-*-
import os
import json
from random import shuffle
import multiprocessing
import gensim
import csv


def train_song2vec():
    """
    :return: 所有歌单song2Vec模型的训练和保存
    """
    songlist_sequence = []
    # 读取网易云音乐原数据
    for i in range(1, 1292):
        with open("{0}.json".format(i), 'r') as load_f:
            load_dict = json.load(load_f)
            parse_songlist_get_sequence(load_dict, songlist_sequence)

    # 多进程计算
    cores = multiprocessing.cpu_count()
    print('Using all {cores} cores'.format(cores=cores))
    print('Training word2vec model...')
    model = gensim.models.Word2Vec(sentences=songlist_sequence, size=150, min_count=3, window=7, workers=cores)
    print('Save model..')
    model.save('songVec.model')


def parse_songlist_get_sequence(load_dict, songlist_sequence):
    """
    解析每个歌单中的歌曲id信息
    :param load_dict: 包含一个歌单中所有歌曲的原始列表
    :param songlist_sequence: 一个歌单中所有给的id序列
    :return:
    """
    song_sequence = []
    for item in load_dict['playlist']['tracks']:
        try:
            song = [item['id'], item['name'], item['ar'][0]['name'], item['pop']]
            song_id, song_name, artist, pop = song
            song_sequence.append(str(song_id))
        except:
            print('song format error')

    for i in range(len(song_sequence)):
        shuffle(song_sequence)
        # 这里的list()必须加上,要不songlist中歌曲根本就不是随机打乱序列,而是都相同序列
        songlist_sequence.append(list(song_sequence))


def song_data_preprocessing():
    """
    歌曲id到歌曲名字的映射
    :return: 歌曲id到歌曲名字的映射字典,歌曲名字到歌曲id的映射字典
    """
    csv_reader = csv.reader(open('neteasy_song_id_to_name_data.csv'))
    id_name_dic = {}
    name_id_dic = {}
    for row in csv_reader:
        id_name_dic[row[0]] = row[1]
        name_id_dic[row[1]] = row[0]
    return id_name_dic, name_id_dic


train_song2vec()

model_str = 'songVec.model'
# 载入word2vec模型
model = gensim.models.Word2Vec.load(model_str)
id_name_dic, name_id_dic = song_data_preprocessing()

#song_id_list = list(id_name_dic.keys())[4000:5000:200]
song_id_list = id_name_dic.keys()[1000:1500:50]//数据的选取,间隔50
for song_id in song_id_list:
    result_song_list = model.most_similar(song_id)
    print(song_id)
    print(json.dumps(id_name_dic[song_id],encoding='UTF-8',ensure_ascii=False))
    print('\n相似歌曲和相似度分别为:')
    for song in result_song_list:
        print(json.dumps(id_name_dic[song[0]],encoding='UTF-8', ensure_ascii=False))
        print(song[1])
        #print('\t' + id_name_dic[song[0]].encode('utf-8'), song[1])
    print('\n')

运行结果

Using all 4 cores
Training word2vec model...
Save model..
420513125
"レイディ・ブルース-LUCKY TAPES"

相似歌曲和相似度分别为:
"关键词-林俊杰"
0.626468360424
"水星记-郭顶"
0.621081233025
"鱼仔(Cover:卢广仲)-是你的垚"
0.619691371918
"嚣张-en"
0.617167830467
"世间美好与你环环相扣-柏松"
0.615520179272
"嗜好-颜人中"
0.614804267883
"大眠 (完整版)-小乐哥"
0.613656818867
"全部都是你-DP龙猪"
0.612156033516
"蓝-石白其"
0.612105309963
"I Know You Know I Love You-落日飞车"
0.61181396246


1422705673
"Past Lives(Cover:BØRNS)-孙圳翰"

相似歌曲和相似度分别为:
"我还想她(Cover:林俊杰)-Uu"
0.590772628784
"所念皆星河-CMJ"
0.578336238861
"蓝-石白其"
0.576055586338
"零几年听的情歌-AY楊佬叁"
0.575863420963
"你要相信这不是最后一天-华晨宇"
0.56902128458
"七日seven days-小野道ono"
0.567467391491
"The truth that you leave-Pianoboy高至豪"
0.562936365604
"椿-沈以诚"
0.560849130154
"世间美好与你环环相扣-柏松"
0.559100747108
"克卜勒-孙燕姿"
0.55870193243


18790760
"Wildest Moments-Jessie Ware"

相似歌曲和相似度分别为:
"第三人称-Todd Li"
0.542654037476
"MELANCHOLY-White Cherry"
0.536370813847
"10%-SynBlazer"
0.53133648634
"Roundabout-Yes"
0.531289100647
"1%-Oscar Scheller"
0.528999328613
"后来-刘若英"
0.526810228825
"Creep-Gamper & Dadoni"
0.523189365864
"Kitarman-Ghulamjan Yakup"
0.521900355816
"大眠 (完整版)-小乐哥"
0.521808922291
"我想以世纪和你在一起-棱镜"
0.520630300045


4898223
"Universe Song-土岐麻子"

相似歌曲和相似度分别为:
"Adagio for Summer Wind-清水準一"
0.671902596951
"无人之岛-任然"
0.653309166431
"DJ DJ给我一条K (DJ抖音版)-安筱冷"
0.653192460537
"Something Just Like This (Megamix)-AnDyWuMUSICLAND"
0.650882899761
"My Heart Will Go On-满舒克"
0.649816930294
"Summer-久石譲"
0.649288952351
"冬眠-司南"
0.645800054073
"I Want You To Know (Hella x Pegato Remix) -Pegato"
0.641312420368
"所念皆星河-CMJ"
0.641280949116
"Neon Rainbow (feat. Anna Yvette)-Rameses B"
0.641224443913


1342678507
"Flying Saucer-Shlump"

相似歌曲和相似度分别为:
"The Lost Ballerina (Radio Edit)-Fiona Joy Hawkins"
0.587948739529
"水星记-郭顶"
0.582286775112
"Cyka Blyat-DJ Blyatman"
0.579136252403
"Crusade-Marshmello"
0.575079441071
"蓝-石白其"
0.570292174816
"愿你余生漫长-王贰浪"
0.570254147053
"你要相信这不是最后一天-华晨宇"
0.566699206829
"大课间跑步音乐-群星"
0.566686093807
"See You Again-Wiz Khalifa"
0.565815865993
"想*-林宥嘉"
0.563404619694


558572724
"刘若英-后来(水潇 Remix)-二狗村高富帅"

相似歌曲和相似度分别为:
"Little Girl (As Featured in \"Unbroken: Path to Redemption\" Film)-Andrea Litkei"
0.561440587044
"大城小爱-王力宏"
0.548691391945
"You Look Lovely-音乐治疗"
0.544121265411
"关山酒-等什么君"
0.543048799038
"囍(Chinese Wedding)-葛东琪"
0.53617054224
"红色高跟鞋-蔡健雅"
0.534194111824
"世间美好与你环环相扣-柏松"
0.532548666
"Dancing With Your Ghost-Sasha Sloan"
0.532440066338
"The Way I Still Love You-Reynard Silva"
0.532300829887
"Be My Mistake-The 1975"
0.532189667225


574274427
"Whip Blow-Yuji Kondo"

相似歌曲和相似度分别为:
"那女孩对我说 (完整版)-Uu"
0.466711193323
"Che m'importa del mondo-Rita Pavone"
0.460101604462
"荒野魂斗罗 (Live)-华晨宇"
0.444359987974
"Monsters (Live)-周深"
0.440917819738
"寒鸦少年 (Live)-华晨宇"
0.43233910203
"Reality-Lost Frequencies"
0.418823868036
"寒鸦少年-华晨宇"
0.418192714453
"吹梦到西洲(四合院版本)(Cover:恋恋故人难)-四只烤翅"
0.4164057374
"Sayama Rain 2?(?Demo)-The Nature Sounds Society Japan"
0.415658026934
"神树 (Live)-华晨宇"
0.410130620003


22637718
"どうか届きますように-SMAP"

相似歌曲和相似度分别为:
"星屑ビーナス-Aimer"
0.52750056982
"There For You-Martin Garrix"
0.522041022778
"Blanc-Sylvain Chauveau"
0.503010869026
"Pyro-Chester Young"
0.50101774931
"21 Miles-MY FIRST STORY"
0.499602258205
"“露を吸う群”-増田俊郎"
0.498453527689
"我-张国荣"
0.495759695768
"Manta-刘柏辛Lexie"
0.495319634676
"ᐇ-Seto"
0.495067447424
"Cyka Blyat-DJ Blyatman"
0.493984639645


1376148033
"爱要坦荡荡(Cover:萧潇)-小天才鸭"

相似歌曲和相似度分别为:
"那女孩对我说 (完整版)-Uu"
0.548079669476
"intro (w rook1e)-barnes blvd."
0.540413379669
"无人之岛 (Cover:任然)-是你的垚"
0.517935693264
"You Look Lovely-音乐治疗"
0.51715862751
"pure imagination-ROOK1E"
0.515536487103
"I Want You To Know (Hella x Pegato Remix) -Pegato"
0.514691889286
"Wonderful World-ChakYoun9"
0.51469117403
"GOOD NIGHT-Lil Ghost小鬼"
0.514522790909
"好几年-刘心"
0.512091457844
"7 %-XMASwu"
0.511670172215


1407561335
"去追一只鹿-万象凡音"

相似歌曲和相似度分别为:
"Late summer-周涵"
0.60853689909
"星茶会-灰澈"
0.604638457298
"嚣张-en"
0.602224886417
"最甜情歌-红人馆"
0.601751744747
"蓝-石白其"
0.60148859024
"Monody (Radio Edit)-TheFatRat"
0.601218402386
"My Heart Will Go On-满舒克"
0.600651681423
"只是太爱你-丁芙妮"
0.599411010742
"The rain-Vsun"
0.598190009594
"Frisbee-Ahxello"
0.59788608551

上一篇:Python函数1


下一篇:Mybatis操作详解