MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。分为三个表:评分,用户信息,电影信息。这些数据都是dat文件格式。
读取3个数据集:
#coding=gbk
# MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。
# 分为三个表:评分,用户信息,电影信息。这些数据都是dat文件格式
# ,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中
import pandas as pd
import time
start = time.clock()
filename1 =r'D:\datasets\users.dat'
filename2 = r'D:\datasets\ratings.dat'
filename3 = r'D:\datasets\movies.dat'
pd.options.display.max_rows = 10
uname = ['user_id','gender','age','occupation','zip']
users = pd.read_table(filename1, sep='::', header = None, names=uname, engine='python')
print(users.head()) #年龄和职业都是使用编码的形式给出来的
# user_id gender age occupation zip
# 0 1 F 1 10 48067
# 1 2 M 56 16 70072
# 2 3 M 25 15 55117
# 3 4 M 45 7 02460
# 4 5 M 25 20 55455
print(users.shape) # (6040, 5)
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table(filename2, header =None, sep='::',names=rnames, engine= 'python')
print(ratings.head())
# user_id movie_id rating timestamp
# 0 1 1193 5 978300760
# 1 1 661 3 978302109
# 2 1 914 3 978301968
# 3 1 3408 4 978300275
# 4 1 2355 5 978824291
# print(ratings.shape) #(1000209, 4)
mnames = ['movie_id','title','genres'] # genres 表示影片的体裁是什么
movies = pd.read_table(filename3, header = None, sep='::', names = mnames, engine='python')
# print(movies.head())
# movie_id title genres
# 0 1 Toy Story (1995) Animation|Children's|Comedy
# 1 2 Jumanji (1995) Adventure|Children's|Fantasy
# 2 3 Grumpier Old Men (1995) Comedy|Romance
# 3 4 Waiting to Exhale (1995) Comedy|Drama
# 4 5 Father of the Bride Part II (1995) Comedy
# print(movies.shape) #(3883, 3)
年龄和职业都是使用编码的形式给出来的:
- Age is chosen from the following ranges:
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
- Occupation is chosen from the following choices:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
使用merge 函数将3个表进行合并
#使用merge 函数将3个表进行合并
data = pd.merge(pd.merge(ratings, users), movies)
# print(data.head())
# user_id movie_id rating timestamp gender age occupation zip \..
# 0 1 1193 5 978300760 F 1 10 48067
# 1 2 1193 5 978298413 M 56 16 70072
# 2 12 1193 4 978220179 M 25 12 32793
# 3 15 1193 4 978199279 M 25 7 22903
# 4 17 1193 5 978158471 M 50 1 95350
# print(data.iloc[0])
# user_id 1
# movie_id 1193
# rating 5
# timestamp 978300760
# gender F
# age 1
# occupation 10
# zip 48067
# title One Flew Over the Cuckoo's Nest (1975)
# genres Drama
# Name: 0, dtype: object
使用透视表,按性别计算每部电影的平均得分
#index 表示索引,values表示所要进行分析的数据, columns允许选择一个或多个列,以columns作为分组的列
mean_ratings = data.pivot_table(values ='rating', index='title', columns ='gender', aggfunc='mean')
# print(mean_ratings.head())
# gender F M
# title
# $1,000,000 Duck (1971) 3.375000 2.761905
# 'Night Mother (1986) 3.388889 3.352941
# 'Til There Was You (1997) 2.675676 2.733333
# 'burbs, The (1989) 2.793478 2.962085
# ...And Justice for All (1979) 3.828571 3.689024
使用选择的数据进行分析
#过滤掉评分数据不足250 条的电影
ratings_by_title = data.groupby('title').size()
print(ratings_by_title[:3])
# title
# $1,000,000 Duck (1971) 37
# 'Night Mother (1986) 70
# 'Til There Was You (1997) 52
# dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250] #找出其评论大于250 的索引
print(active_titles[:3])
# Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
# '101 Dalmatians (1961)'],
# dtype='object', name='title')
#可以以active_titles 中的电影作为索引,选择出 mean_ratings 中的电影
mean_ratings = mean_ratings.loc[active_titles]
print(mean_ratings[:5])
# gender F M
# title
# 'burbs, The (1989) 2.793478 2.962085
# 10 Things I Hate About You (1999) 3.646552 3.311966
# 101 Dalmatians (1961) 3.791444 3.500000
# 101 Dalmatians (1996) 3.240000 2.911215
# 12 Angry Men (1957) 4.184397 4.328421
#查看女性观众喜欢的电影,可以按 F 列进行降序排列
top_ratings = mean_ratings.sort_values(by="F", ascending = False)
print(top_ratings[:3])
# gender F M
# title
# Close Shave, A (1995) 4.644444 4.473795
# Wrong Trousers, The (1993) 4.588235 4.478261
# Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
#计算男性观众和女性观众分歧最大的电影
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sort_by_diff = mean_ratings.sort_values(by='diff')
print(sort_by_diff[:3])
# gender F M diff
# title
# Dirty Dancing (1987) 3.790378 2.959596 -0.830782
# Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
# Grease (1978) 3.975265 3.367041 -0.608224
#对行进行反序操作, 取出前3行,得到是男性更喜欢的电影,而女性观众相反
print(sort_by_diff[::-1][:3])
# gender F M diff
# title
# Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
# Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
# Dumb & Dumber (1994) 2.697987 3.336595 0.638608
#计算得分数据的标准差,找出分歧最大的电影
rating_std = data.groupby('title')['rating'].std()
rating_std = rating_std.loc[active_titles]
print(rating_std.sort_values(ascending=False)[:3])
# title
# Dumb & Dumber (1994) 1.321333
# Blair Witch Project, The (1999) 1.316368
# Natural Born Killers (1994) 1.307198
# Name: rating, dtype: float64
end = time.clock()
spending_time = end - start
print('花费的时间为:%.2f'%spending_time + 's')
# 花费的时间为:11.13s