pandas将数据离散化
要求统计:给出一个电影数据,将其中的所有电影,按照分类统计各类型电影的数量
数据格式:
Rank Title Genre \
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi
1 2 Prometheus Adventure,Mystery,Sci-Fi
2 3 Split Horror,Thriller
3 4 Sing Animation,Comedy,Family
4 5 Suicide Squad Action,Adventure,Fantasy
Description Director \
0 A group of intergalactic criminals are forced ... James Gunn
1 Following clues to the origin of mankind, a te... Ridley Scott
2 Three girls are kidnapped by a man with a diag... M. Night Shyamalan
3 In a city of humanoid animals, a hustling thea... Christophe Lourdelet
4 A secret government agency recruits some of th... David Ayer
Actors Year Runtime (Minutes) \
0 Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121
1 Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124
2 James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117
3 Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108
4 Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# 文件地址
file_path = '数据文件'
# 显示设置
pd.set_option('display.max_columns', 20)
# 读取数据
df = pd.read_csv(file_path)
# 统计分类,转化为一个列表,形式为 [[],[],[]]
temp_list = df['Genre'].str.split(',').tolist()
# print(temp_list)
# 将分类去重,得到一个内容不重复的列表
genre_list = list(set([i for j in temp_list for i in j]))
# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0], len(genre_list))), columns=genre_list)
# print(zero_df)
# 遍历每一行数据,给每个电影类型出现的位置赋值为1
for i in range(df.shape[0]):
# 使用loc将出现的具体位置置1,temp_list[i]检索出来的时电影的名字,对应的正好是 zero_df中的列
zeros_df.loc[i, temp_list[i]] = 1
# print(zero_df.head(3))
# 统计每个分类的电影的数量和
genre_count = zeros_df.sum(axis=0)
print(genre_count.sort_values())
# 排序
# genre_count = genre_count.sort_values()
# _x = genre_count.index
# _y = genre_count.values
# # print(_x)
# # print("*"*100)
# # print(_y)
#
#
# # 画图
# plt.figure(figsize=(20, 8), dpi=80)
# plt.bar(range(len(_x)), _y)
# plt.xticks(range(len(_x)), _x)
# plt.show()
效果:
Musical 5.0
Western 7.0
War 13.0
Music 16.0
Sport 18.0
History 29.0
Animation 49.0
Family 51.0
Biography 81.0
Fantasy 101.0
Mystery 106.0
Horror 119.0
Sci-Fi 120.0
Romance 141.0
Crime 150.0
Thriller 195.0
Adventure 259.0
Comedy 279.0
Action 303.0
Drama 513.0
dtype: float64
Process finished with exit code 0