013 将数据离散化

pandas将数据离散化

要求统计:给出一个电影数据,将其中的所有电影,按照分类统计各类型电影的数量

数据格式

 Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   
2     3                    Split           Horror,Thriller   
3     4                     Sing   Animation,Comedy,Family   
4     5            Suicide Squad  Action,Adventure,Fantasy   

                                         Description              Director  \
0  A group of intergalactic criminals are forced ...            James Gunn   
1  Following clues to the origin of mankind, a te...          Ridley Scott   
2  Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3  In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4  A secret government agency recruits some of th...            David Ayer   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121   
1  Noomi Rapace, Logan Marshall-Green, Michael Fa...  2012                124   
2  James McAvoy, Anya Taylor-Joy, Haley Lu Richar...  2016                117   
3  Matthew McConaughey,Reese Witherspoon, Seth Ma...  2016                108   
4  Will Smith, Jared Leto, Margot Robbie, Viola D...  2016                123   

 

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# 文件地址
file_path = '数据文件'
# 显示设置
pd.set_option('display.max_columns', 20)
# 读取数据
df = pd.read_csv(file_path)

# 统计分类,转化为一个列表,形式为 [[],[],[]]
temp_list = df['Genre'].str.split(',').tolist()
# print(temp_list)
# 将分类去重,得到一个内容不重复的列表
genre_list = list(set([i for j in temp_list for i in j]))

# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0], len(genre_list))), columns=genre_list)
# print(zero_df)

# 遍历每一行数据,给每个电影类型出现的位置赋值为1
for i in range(df.shape[0]):
    # 使用loc将出现的具体位置置1,temp_list[i]检索出来的时电影的名字,对应的正好是 zero_df中的列
    zeros_df.loc[i, temp_list[i]] = 1
# print(zero_df.head(3))

# 统计每个分类的电影的数量和
genre_count = zeros_df.sum(axis=0)
print(genre_count.sort_values())
# 排序
# genre_count = genre_count.sort_values()
# _x = genre_count.index
# _y = genre_count.values
# # print(_x)
# # print("*"*100)
# # print(_y)
#
#
# # 画图
# plt.figure(figsize=(20, 8), dpi=80)
# plt.bar(range(len(_x)), _y)
# plt.xticks(range(len(_x)), _x)
# plt.show()

效果:

Musical        5.0
Western        7.0
War           13.0
Music         16.0
Sport         18.0
History       29.0
Animation     49.0
Family        51.0
Biography     81.0
Fantasy      101.0
Mystery      106.0
Horror       119.0
Sci-Fi       120.0
Romance      141.0
Crime        150.0
Thriller     195.0
Adventure    259.0
Comedy       279.0
Action       303.0
Drama        513.0
dtype: float64

Process finished with exit code 0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

上一篇:013 Java内存分析简述


下一篇:(C# Binary Tree) 基本概念和算法