数据分析day03
4.基于pandas的数据清洗
处理丢失数据
- 有两种丢失数据:
- None
- np.nan(NaN)
In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import tushare as ts#财经数据接口包
import matplotlib.pyplot as plt
- 两种丢失数据的区别
In [2]:
type(np.nan)
Out[2]:
float
In [5]:
np.nan + 3
Out[5]:
nan
In [3]:
type(None)
Out[3]:
NoneType
- pandas中的None和NAN
In [10]:
df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df
Out[10]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91 | 92 | 51 | 55 |
1 | 23 | 22 | 92 | 35 | 83 |
2 | 21 | 52 | 40 | 63 | 29 |
3 | 94 | 51 | 24 | 70 | 59 |
4 | 27 | 78 | 1 | 21 | 17 |
5 | 94 | 57 | 5 | 43 | 22 |
6 | 87 | 31 | 58 | 30 | 82 |
7 | 93 | 28 | 54 | 7 | 93 |
In [12]:
df.iloc[1,2] = None
df.iloc[3,4] = None
df.iloc[4,1] = None
df.iloc[7,4] = np.nan
In [13]:
df
Out[13]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91.0 | 92.0 | 51 | 55.0 |
1 | 23 | 22.0 | NaN | 35 | 83.0 |
2 | 21 | 52.0 | 40.0 | 63 | 29.0 |
3 | 94 | 51.0 | 24.0 | 70 | NaN |
4 | 27 | NaN | 1.0 | 21 | 17.0 |
5 | 94 | 57.0 | 5.0 | 43 | 22.0 |
6 | 87 | 31.0 | 58.0 | 30 | 82.0 |
7 | 93 | 28.0 | 54.0 | 7 | NaN |
pandas处理空值操作
- isnull
- notnull
- any
- all
- dropna
- fillna
In [16]:
df.isnull()
Out[16]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | True | False | False |
2 | False | False | False | False | False |
3 | False | False | False | False | True |
4 | False | True | False | False | False |
5 | False | False | False | False | False |
6 | False | False | False | False | False |
7 | False | False | False | False | True |
- 检测出原始数据中哪些行中存在空值
In [20]:
df.isnull()
Out[20]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | True | False | False |
2 | False | False | False | False | False |
3 | False | False | False | False | True |
4 | False | True | False | False | False |
5 | False | False | False | False | False |
6 | False | False | False | False | False |
7 | False | False | False | False | True |
- any和all可以帮我们检测df中哪些行列中存在空值
- isnull->any(axis=1)
- notnull->all(axis=1)
In [24]:
~df.isnull().any(axis=1)
df.loc[~df.isnull().any(axis=1)]
Out[24]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91.0 | 92.0 | 51 | 55.0 |
2 | 21 | 52.0 | 40.0 | 63 | 29.0 |
5 | 94 | 57.0 | 5.0 | 43 | 22.0 |
6 | 87 | 31.0 | 58.0 | 30 | 82.0 |
In [28]:
df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]
Out[28]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91.0 | 92.0 | 51 | 55.0 |
2 | 21 | 52.0 | 40.0 | 63 | 29.0 |
5 | 94 | 57.0 | 5.0 | 43 | 22.0 |
6 | 87 | 31.0 | 58.0 | 30 | 82.0 |
In [29]:
df.dropna(axis=0) #将空值对应的行数据删除
Out[29]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91.0 | 92.0 | 51 | 55.0 |
2 | 21 | 52.0 | 40.0 | 63 | 29.0 |
5 | 94 | 57.0 | 5.0 | 43 | 22.0 |
6 | 87 | 31.0 | 58.0 | 30 | 82.0 |
In [32]:
df
Out[32]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91.0 | 92.0 | 51 | 55.0 |
1 | 23 | 22.0 | NaN | 35 | 83.0 |
2 | 21 | 52.0 | 40.0 | 63 | 29.0 |
3 | 94 | 51.0 | 24.0 | 70 | NaN |
4 | 27 | NaN | 1.0 | 21 | 17.0 |
5 | 94 | 57.0 | 5.0 | 43 | 22.0 |
6 | 87 | 31.0 | 58.0 | 30 | 82.0 |
7 | 93 | 28.0 | 54.0 | 7 | NaN |
In [33]:
#fillna将空值进行覆盖
df.fillna(method='bfill',axis=0) #使用紧邻值填充空值
Out[33]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 91.0 | 92.0 | 51 | 55.0 |
1 | 23 | 22.0 | 92.0 | 35 | 83.0 |
2 | 21 | 52.0 | 40.0 | 63 | 29.0 |
3 | 94 | 51.0 | 24.0 | 70 | 29.0 |
4 | 27 | 51.0 | 1.0 | 21 | 17.0 |
5 | 94 | 57.0 | 5.0 | 43 | 22.0 |
6 | 87 | 31.0 | 58.0 | 30 | 82.0 |
7 | 93 | 28.0 | 54.0 | 7 | 82.0 |
面试题
- 数据说明:
- 数据是1个冷库的温度数据,1-7对应7个温度采集设备,1分钟采集一次。
- 数据处理目标:
- 用1-4对应的4个必须设备,通过建立冷库的温度场关系模型,预估出5-7对应的数据。
- 最后每个冷库中仅需放置4个设备,取代放置7个设备。
- f(1-4) --> y(5-7)
- 数据处理过程:
- 1、原始数据中有丢帧现象,需要做预处理;
- 2、matplotlib 绘图;
- 3、建立逻辑回归模型。
- 无标准答案,按个人理解操作即可,请把自己的操作过程以文字形式简单描述一下,谢谢配合。
- 测试数据为testData.xlsx
In [5]:
df = pd.read_excel('./testData.xlsx')
df.drop(labels=['none','none1'],axis=1,inplace=True)
In [7]:
df.dropna(axis=0)
In [10]:
#isnull notnull any all
df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]
In [15]:
df.fillna(method='ffill',axis=0).fillna(method='bfill',axis=0)
In [ ]:
处理重复数据
In [20]:
df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df.iloc[1] = [6,6,6,6,6]
df.iloc[3] = [6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6]
df
Out[20]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 68 | 53 | 32 | 24 |
1 | 6 | 6 | 6 | 6 | 6 |
2 | 79 | 86 | 73 | 14 | 25 |
3 | 6 | 6 | 6 | 6 | 6 |
4 | 72 | 1 | 73 | 67 | 89 |
5 | 6 | 6 | 6 | 6 | 6 |
6 | 69 | 32 | 94 | 91 | 18 |
7 | 47 | 7 | 77 | 11 | 67 |
In [25]:
df.drop_duplicates(keep='first')
Out[25]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 44 | 68 | 53 | 32 | 24 |
1 | 6 | 6 | 6 | 6 | 6 |
2 | 79 | 86 | 73 | 14 | 25 |
4 | 72 | 1 | 73 | 67 | 89 |
6 | 69 | 32 | 94 | 91 | 18 |
7 | 47 | 7 | 77 | 11 | 67 |
处理异常数据
- 自定义一个1000行3列(A,B,C)取值范围为0-1的数据源,然后将C列中的值大于其两倍标准差的异常值进行清洗
In [27]:
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()
Out[27]:
A | B | C | |
---|---|---|---|
0 | 0.886005 | 0.831529 | 0.822956 |
1 | 0.391742 | 0.104039 | 0.434260 |
2 | 0.539745 | 0.950540 | 0.948526 |
3 | 0.392029 | 0.904831 | 0.680343 |
4 | 0.513764 | 0.979957 | 0.600518 |
In [30]:
#指定一个判定异常值的条件
twice_std = df['C'].std() * 2
twice_std
Out[30]:
0.570731429850527
In [34]:
#判断C列中的哪些值为异常值
df['C'] > twice_std
df.loc[df['C'] > twice_std]
indexs = df.loc[df['C'] > twice_std].index #异常值对应的行索引
In [35]:
df.drop(labels=indexs)
Out[35]:
A | B | C | |
---|---|---|---|
1 | 0.391742 | 0.104039 | 0.434260 |
5 | 0.512951 | 0.897948 | 0.245320 |
6 | 0.473968 | 0.979213 | 0.271424 |
7 | 0.843319 | 0.038657 | 0.184559 |
8 | 0.982931 | 0.881284 | 0.208147 |
10 | 0.784656 | 0.314898 | 0.089802 |
13 | 0.624428 | 0.252411 | 0.327818 |
15 | 0.213042 | 0.969693 | 0.567275 |
18 | 0.710010 | 0.534330 | 0.559441 |
19 | 0.026479 | 0.736259 | 0.120431 |
20 | 0.990039 | 0.982449 | 0.017151 |
23 | 0.158157 | 0.183678 | 0.428155 |
25 | 0.604838 | 0.950466 | 0.294742 |
26 | 0.304136 | 0.822809 | 0.388579 |
28 | 0.671559 | 0.726631 | 0.196907 |
30 | 0.811249 | 0.751182 | 0.467697 |
31 | 0.376243 | 0.805516 | 0.287484 |
33 | 0.570442 | 0.797945 | 0.026182 |
35 | 0.467125 | 0.062123 | 0.439725 |
36 | 0.861741 | 0.413997 | 0.543973 |
38 | 0.955328 | 0.817003 | 0.293787 |
47 | 0.458014 | 0.228608 | 0.285172 |
49 | 0.931513 | 0.403981 | 0.239329 |
51 | 0.008178 | 0.484172 | 0.021373 |
53 | 0.253882 | 0.300069 | 0.561118 |
55 | 0.752559 | 0.685649 | 0.451692 |
56 | 0.003363 | 0.486893 | 0.154598 |
57 | 0.859653 | 0.569252 | 0.007432 |
58 | 0.327716 | 0.419704 | 0.452710 |
59 | 0.068403 | 0.029346 | 0.226587 |
... | ... | ... | ... |
953 | 0.247954 | 0.072558 | 0.038834 |
954 | 0.199553 | 0.193049 | 0.027725 |
956 | 0.513195 | 0.175896 | 0.254432 |
957 | 0.080261 | 0.476756 | 0.521142 |
958 | 0.944795 | 0.550317 | 0.336043 |
961 | 0.464895 | 0.592027 | 0.195383 |
962 | 0.127469 | 0.300982 | 0.309427 |
963 | 0.595242 | 0.139702 | 0.450026 |
964 | 0.520342 | 0.639537 | 0.209403 |
965 | 0.372687 | 0.117984 | 0.262849 |
966 | 0.007270 | 0.044250 | 0.533105 |
967 | 0.854830 | 0.512720 | 0.173844 |
968 | 0.247666 | 0.972284 | 0.227422 |
970 | 0.047074 | 0.714412 | 0.392280 |
974 | 0.112649 | 0.483324 | 0.125105 |
975 | 0.307405 | 0.875641 | 0.432340 |
978 | 0.520662 | 0.003040 | 0.412422 |
979 | 0.337178 | 0.540283 | 0.257443 |
981 | 0.877978 | 0.842195 | 0.448030 |
982 | 0.273752 | 0.063285 | 0.291012 |
985 | 0.765849 | 0.974933 | 0.253099 |
988 | 0.139305 | 0.570496 | 0.535778 |
989 | 0.597190 | 0.973190 | 0.177517 |
990 | 0.817945 | 0.183825 | 0.330112 |
991 | 0.738457 | 0.578425 | 0.032489 |
992 | 0.159229 | 0.544980 | 0.242586 |
994 | 0.300998 | 0.352331 | 0.434336 |
996 | 0.609123 | 0.491735 | 0.045738 |
998 | 0.839935 | 0.181189 | 0.121180 |
999 | 0.798840 | 0.939869 | 0.150332 |
5.DataFrame的级联and合并操作
级联操作
- pd.concat, pd.append
pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
ignore_index=False
- 匹配级联
In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
In [2]:
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering'],
})
df1
Out[2]:
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
In [6]:
pd.concat((df1,df1,df1),axis=1)
Out[6]:
employee | group | employee | group | employee | group | |
---|---|---|---|---|---|---|
0 | Bob | Accounting | Bob | Accounting | Bob | Accounting |
1 | Jake | Engineering | Jake | Engineering | Jake | Engineering |
2 | Lisa | Engineering | Lisa | Engineering | Lisa | Engineering |
- 不匹配级联
- 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
- 有2种连接方式:
- 外连接:补NaN(默认模式)
- 内连接:只连接匹配的项
In [7]:
df1
Out[7]:
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
In [12]:
df2 = df1.copy()
df2.columns = ['employee','groups']
df2
Out[12]:
employee | groups | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
In [14]:
pd.concat((df1,df2),axis=0)
Out[14]:
employee | group | groups | |
---|---|---|---|
0 | Bob | Accounting | NaN |
1 | Jake | Engineering | NaN |
2 | Lisa | Engineering | NaN |
0 | Bob | NaN | Accounting |
1 | Jake | NaN | Engineering |
2 | Lisa | NaN | Engineering |
In [16]:
pd.concat((df1,df2),axis=0,join='inner')
Out[16]:
employee | |
---|---|
0 | Bob |
1 | Jake |
2 | Lisa |
0 | Bob |
1 | Jake |
2 | Lisa |
- append函数的使用
In [17]:
df1.append(df2)
Out[17]:
employee | group | groups | |
---|---|---|---|
0 | Bob | Accounting | NaN |
1 | Jake | Engineering | NaN |
2 | Lisa | Engineering | NaN |
0 | Bob | NaN | Accounting |
1 | Jake | NaN | Engineering |
2 | Lisa | NaN | Engineering |
合并操作:级联是对表格做拼接,合并是对表格中的数据进行汇总
- merge与concat的区别在于,merge需要依据某一共同列来进行合并
- 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
- 注意每一列元素的顺序不要求一致
一对一合并
In [18]:
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering'],
})
df1
Out[18]:
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
In [19]:
df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
'hire_date':[2004,2008,2012],
})
df2
Out[19]:
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
In [20]:
pd.merge(df1,df2)
Out[20]:
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
一对多合并
In [21]:
df3 = DataFrame({
'employee':['Lisa','Jake'],
'group':['Accounting','Engineering'],
'hire_date':[2004,2016]})
df3
Out[21]:
employee | group | hire_date | |
---|---|---|---|
0 | Lisa | Accounting | 2004 |
1 | Jake | Engineering | 2016 |
In [22]:
df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
'supervisor':['Carly','Guido','Steve']
})
df4
Out[22]:
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | Engineering | Steve |
In [23]:
pd.merge(df3,df4)
Out[23]:
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Lisa | Accounting | 2004 | Carly |
1 | Jake | Engineering | 2016 | Guido |
2 | Jake | Engineering | 2016 | Steve |
多对多合并
In [24]:
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering']})
df1
Out[24]:
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
In [25]:
df5 = DataFrame({'group':['Engineering','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']
})
df5
Out[25]:
group | supervisor | |
---|---|---|
0 | Engineering | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
In [28]:
pd.merge(df1,df5)
Out[28]:
employee | group | supervisor | |
---|---|---|---|
0 | Jake | Engineering | Carly |
1 | Jake | Engineering | Guido |
2 | Lisa | Engineering | Carly |
3 | Lisa | Engineering | Guido |
key的规范化
- 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
In [29]:
df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
'group':['Accounting','Finance','Marketing']})
df1
Out[29]:
employee | group | |
---|---|---|
0 | Jack | Accounting |
1 | Summer | Finance |
2 | Steve | Marketing |
In [30]:
df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
'hire_date':[2003,2009,2012],
'group':['Accounting','sell','ceo']})
df2
Out[30]:
employee | group | hire_date | |
---|---|---|---|
0 | Jack | Accounting | 2003 |
1 | Bob | sell | 2009 |
2 | Jake | ceo | 2012 |
In [32]:
pd.merge(df1,df2,on='group')
Out[32]:
employee_x | group | employee_y | hire_date | |
---|---|---|---|---|
0 | Jack | Accounting | Jack | 2003 |
- 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列
In [33]:
df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
'group':['Accounting','Product','Marketing'],
'hire_date':[1998,2017,2018]})
df1
Out[33]:
employee | group | hire_date | |
---|---|---|---|
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
In [34]:
df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
'hire_dates':[1998,2016,2007]})
df5
Out[34]:
hire_dates | name | |
---|---|---|
0 | 1998 | Lisa |
1 | 2016 | Bobs |
2 | 2007 | Bill |
In [35]:
pd.merge(df1,df5,left_on='employee',right_on='name')
Out[35]:
employee | group | hire_date | hire_dates | name | |
---|---|---|---|---|---|
0 | Bobs | Accounting | 1998 | 2016 | Bobs |
1 | Bill | Marketing | 2018 | 2007 | Bill |
内合并与外合并:out取并集 inner取交集
In [37]:
df6 = DataFrame({'name':['Peter','Paul','Mary'],
'food':['fish','beans','bread']}
)
df7 = DataFrame({'name':['Mary','Joseph'],
'drink':['wine','beer']})
In [38]:
df6
Out[38]:
food | name | |
---|---|---|
0 | fish | Peter |
1 | beans | Paul |
2 | bread | Mary |
In [39]:
df7
Out[39]:
drink | name | |
---|---|---|
0 | wine | Mary |
1 | beer | Joseph |
In [43]:
pd.merge(df6,df7,how='right')
Out[43]:
food | name | drink | |
---|---|---|---|
0 | bread | Mary | wine |
1 | NaN | Joseph | beer |
In [ ]:
df6 = DataFrame({'name':['Peter','Paul','Mary'],
'food':['fish','beans','bread']}
)
df7 = DataFrame({'name':['Mary','Joseph'],
'drink':['wine','beer']})
In [ ]:
In [ ]:
In [ ]:
#合并df1和df2
dic1={
'name':['tom','jay','helly'],
'age':[11,12,33],
'classRoom':[1,2,3]
}
df1=DataFrame(data=dic1)
df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
index=['jay','tom','helly'],
columns=['java','python','c'])
6.人口分析案例
- 需求:
- 导入文件,查看原始数据
- 将人口数据和各州简称数据进行汇总
- 将汇总的数据中重复的abbreviation列进行删除
- 查看存在缺失数据的列
- 找到有哪些state/region使得state的值为NaN,进行去重操作
- 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
- 合并各州面积数据areas
- 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
- 去除含有缺失数据的行
- 找出2010年的全民人口数据
- 计算各州的人口密度
- 排序,并找出人口密度最高的五个州
In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
In [10]:
abb = pd.read_csv('./data/state-abbrevs.csv') #存储的是各州的全程和简称数据
area = pd.read_csv('./data/state-areas.csv') #存储的是各州的全程和面积
pop = pd.read_csv('./data/state-population.csv') #人口数据
In [7]:
pop.head(2)
Out[7]:
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
In [8]:
abb.head(2)
Out[8]:
state | abbreviation | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
In [12]:
#数据汇总(合并)
abb_pop = pd.merge(abb,pop,how='outer',left_on='abbreviation',right_on='state/region')
abb_pop.head()
Out[12]:
state | abbreviation | state/region | ages | year | population | |
---|---|---|---|---|---|---|
0 | Alabama | AL | AL | under18 | 2012 | 1117489.0 |
1 | Alabama | AL | AL | total | 2012 | 4817528.0 |
2 | Alabama | AL | AL | under18 | 2010 | 1130966.0 |
3 | Alabama | AL | AL | total | 2010 | 4785570.0 |
4 | Alabama | AL | AL | under18 | 2011 | 1125763.0 |
In [13]:
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
In [15]:
#查看存在缺失数据的列
abb_pop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
state 2448 non-null object
state/region 2544 non-null object
ages 2544 non-null object
year 2544 non-null int64
population 2524 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB
In [16]:
##查看存在缺失数据的列
abb_pop.isnull().any(axis=0)
Out[16]:
state True
state/region False
ages False
year False
population True
dtype: bool
In [19]:
#找到有哪些state/region(洲的简称)使得state(洲的全程)的值为NaN,进行(简称)去重操作
abb_pop.head()
Out[19]:
state | state/region | ages | year | population | |
---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012 | 1117489.0 |
1 | Alabama | AL | total | 2012 | 4817528.0 |
2 | Alabama | AL | under18 | 2010 | 1130966.0 |
3 | Alabama | AL | total | 2010 | 4785570.0 |
4 | Alabama | AL | under18 | 2011 | 1125763.0 |
In [24]:
#1.找出全程的空值都有哪些
abb_pop['state'].isnull()
#2.将空值对应的行数据取出
abb_pop.loc[abb_pop['state'].isnull()]
#3.将空值对应的简称取出
abb_pop.loc[abb_pop['state'].isnull()]['state/region']
#4,对取出的简称对应的Series进行去重操作
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()#unique()是用来将Series中的元素进行去重操作
Out[24]:
array(['PR', 'USA'], dtype=object)
In [ ]:
#为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
#将PR简称对应的全程的空值定位到,将这些空值赋值成PR的全程PUERTO RICO
In [29]:
abb_pop['state/region'] == 'PR'
#取出PR简称对应的行数据
abb_pop.loc[abb_pop['state/region'] == 'PR']
#将定位到的行数据中state列中的空值统一赋值成PR的简称
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index#PR简称对应全程的空值的行索引
abb_pop.loc[indexs,'state'] = 'PUERTO RICO'
In [31]:
#同理可以将USA对应全程的值赋值成United States
abb_pop['state/region'] == 'USA'
abb_pop.loc[abb_pop['state/region'] == 'USA']
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
abb_pop.loc[indexs,'state'] = 'United States'
In [39]:
#合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head()
Out[39]:
state | state/region | ages | year | population | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012.0 | 1117489.0 | 52423.0 |
1 | Alabama | AL | total | 2012.0 | 4817528.0 | 52423.0 |
2 | Alabama | AL | under18 | 2010.0 | 1130966.0 | 52423.0 |
3 | Alabama | AL | total | 2010.0 | 4785570.0 | 52423.0 |
4 | Alabama | AL | under18 | 2011.0 | 1125763.0 | 52423.0 |
In [45]:
#我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs
In [43]:
abb_pop_area.drop(labels=indexs,inplace=True)
In [48]:
#找出2010年的全民人口数据 query('查询条件')
abb_pop_area.query('year==2010 & ages == "total"')
In [ ]:
#计算各州的人口密度
In [51]:
abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head()
Out[51]:
state | state/region | ages | year | population | area (sq. mi) | midu | |
---|---|---|---|---|---|---|---|
0 | Alabama | AL | under18 | 2012.0 | 1117489.0 | 52423.0 | 21.316769 |
1 | Alabama | AL | total | 2012.0 | 4817528.0 | 52423.0 | 91.897221 |
2 | Alabama | AL | under18 | 2010.0 | 1130966.0 | 52423.0 | 21.573851 |
3 | Alabama | AL | total | 2010.0 | 4785570.0 | 52423.0 | 91.287603 |
4 | Alabama | AL | under18 | 2011.0 | 1125763.0 | 52423.0 | 21.474601 |
In [53]:
#排序,并找出人口密度最高的五个州
abb_pop_area.sort_values(by='midu',axis=0,ascending=False)
Out[53]:
state | state/region | ages | year | population | area (sq. mi) | midu | |
---|---|---|---|---|---|---|---|
391 | District of Columbia | DC | total | 2013.0 | 646449.0 | 68.0 | 9506.602941 |
385 | District of Columbia | DC | total | 2012.0 | 633427.0 | 68.0 | 9315.102941 |
387 | District of Columbia | DC | total | 2011.0 | 619624.0 | 68.0 | 9112.117647 |
431 | District of Columbia | DC | total | 1990.0 | 605321.0 | 68.0 | 8901.779412 |
389 | District of Columbia | DC | total | 2010.0 | 605125.0 | 68.0 | 8898.897059 |
426 | District of Columbia | DC | total | 1991.0 | 600870.0 | 68.0 | 8836.323529 |
429 | District of Columbia | DC | total | 1992.0 | 597567.0 | 68.0 | 8787.750000 |
422 | District of Columbia | DC | total | 1993.0 | 595302.0 | 68.0 | 8754.441176 |
392 | District of Columbia | DC | total | 2009.0 | 592228.0 | 68.0 | 8709.235294 |
425 | District of Columbia | DC | total | 1994.0 | 589240.0 | 68.0 | 8665.294118 |
420 | District of Columbia | DC | total | 1995.0 | 580519.0 | 68.0 | 8537.044118 |
396 | District of Columbia | DC | total | 2008.0 | 580236.0 | 68.0 | 8532.882353 |
406 | District of Columbia | DC | total | 2001.0 | 574504.0 | 68.0 | 8448.588235 |
394 | District of Columbia | DC | total | 2007.0 | 574404.0 | 68.0 | 8447.117647 |
408 | District of Columbia | DC | total | 2002.0 | 573158.0 | 68.0 | 8428.794118 |
419 | District of Columbia | DC | total | 1996.0 | 572379.0 | 68.0 | 8417.338235 |
412 | District of Columbia | DC | total | 2000.0 | 572046.0 | 68.0 | 8412.441176 |
400 | District of Columbia | DC | total | 2006.0 | 570681.0 | 68.0 | 8392.367647 |
410 | District of Columbia | DC | total | 1999.0 | 570220.0 | 68.0 | 8385.588235 |
402 | District of Columbia | DC | total | 2003.0 | 568502.0 | 68.0 | 8360.323529 |
404 | District of Columbia | DC | total | 2004.0 | 567754.0 | 68.0 | 8349.323529 |
417 | District of Columbia | DC | total | 1997.0 | 567739.0 | 68.0 | 8349.102941 |
398 | District of Columbia | DC | total | 2005.0 | 567136.0 | 68.0 | 8340.235294 |
415 | District of Columbia | DC | total | 1998.0 | 565232.0 | 68.0 | 8312.235294 |
421 | District of Columbia | DC | under18 | 1995.0 | 123620.0 | 68.0 | 1817.941176 |
424 | District of Columbia | DC | under18 | 1994.0 | 122170.0 | 68.0 | 1796.617647 |
418 | District of Columbia | DC | under18 | 1996.0 | 121210.0 | 68.0 | 1782.500000 |
423 | District of Columbia | DC | under18 | 1993.0 | 120471.0 | 68.0 | 1771.632353 |
416 | District of Columbia | DC | under18 | 1997.0 | 119531.0 | 68.0 | 1757.808824 |
428 | District of Columbia | DC | under18 | 1992.0 | 118636.0 | 68.0 | 1744.647059 |
... | ... | ... | ... | ... | ... | ... | ... |
53 | Alaska | AK | total | 1994.0 | 603308.0 | 656425.0 | 0.919081 |
56 | Alaska | AK | total | 1993.0 | 599434.0 | 656425.0 | 0.913180 |
50 | Alaska | AK | total | 1992.0 | 588736.0 | 656425.0 | 0.896882 |
55 | Alaska | AK | total | 1991.0 | 570193.0 | 656425.0 | 0.868634 |
48 | Alaska | AK | total | 1990.0 | 553290.0 | 656425.0 | 0.842884 |
63 | Alaska | AK | under18 | 1998.0 | 192636.0 | 656425.0 | 0.293462 |
66 | Alaska | AK | under18 | 1999.0 | 191422.0 | 656425.0 | 0.291613 |
69 | Alaska | AK | under18 | 2000.0 | 190615.0 | 656425.0 | 0.290384 |
71 | Alaska | AK | under18 | 2001.0 | 188771.0 | 656425.0 | 0.287574 |
73 | Alaska | AK | under18 | 2002.0 | 188482.0 | 656425.0 | 0.287134 |
92 | Alaska | AK | under18 | 2011.0 | 188329.0 | 656425.0 | 0.286901 |
62 | Alaska | AK | under18 | 1997.0 | 188280.0 | 656425.0 | 0.286826 |
94 | Alaska | AK | under18 | 2012.0 | 188162.0 | 656425.0 | 0.286647 |
86 | Alaska | AK | under18 | 2013.0 | 188132.0 | 656425.0 | 0.286601 |
90 | Alaska | AK | under18 | 2010.0 | 187902.0 | 656425.0 | 0.286251 |
54 | Alaska | AK | under18 | 1994.0 | 187439.0 | 656425.0 | 0.285545 |
57 | Alaska | AK | under18 | 1993.0 | 187190.0 | 656425.0 | 0.285166 |
75 | Alaska | AK | under18 | 2003.0 | 186843.0 | 656425.0 | 0.284637 |
89 | Alaska | AK | under18 | 2009.0 | 186351.0 | 656425.0 | 0.283888 |
77 | Alaska | AK | under18 | 2004.0 | 186335.0 | 656425.0 | 0.283863 |
81 | Alaska | AK | under18 | 2006.0 | 185580.0 | 656425.0 | 0.282713 |
61 | Alaska | AK | under18 | 1996.0 | 185360.0 | 656425.0 | 0.282378 |
79 | Alaska | AK | under18 | 2005.0 | 185304.0 | 656425.0 | 0.282293 |
59 | Alaska | AK | under18 | 1995.0 | 184990.0 | 656425.0 | 0.281814 |
52 | Alaska | AK | under18 | 1992.0 | 184878.0 | 656425.0 | 0.281644 |
83 | Alaska | AK | under18 | 2007.0 | 184344.0 | 656425.0 | 0.280830 |
85 | Alaska | AK | under18 | 2008.0 | 183124.0 | 656425.0 | 0.278972 |
51 | Alaska | AK | under18 | 1991.0 | 182180.0 | 656425.0 | 0.277534 |
49 | Alaska | AK | under18 | 1990.0 | 177502.0 | 656425.0 | 0.270407 |
2544 | Puerto Rico | NaN | NaN | NaN | NaN | 3515.0 | NaN |
7.pandas高级操作
In [3]:
import pandas as pd
from pandas import DataFrame
import numpy as np
替换操作
- 替换操作可以同步作用于Series和DataFrame中
- 单值替换
- 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换: to_replace={列标签:替换值} value='value'
- 多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换(推荐) to_replace={to_replace:value,to_replace:value}
In [4]:
df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df
Out[4]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | 3 | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 77 | 49 |
In [5]:
df.replace(to_replace=3,value='Three')
Out[5]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | Three | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 77 | 49 |
In [6]:
df.replace(to_replace={3:'aaa'})
Out[6]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | aaa | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 77 | 49 |
In [8]:
#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')
Out[8]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | 3 | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 6666666 | 49 |
映射操作
概念:创建一个映射关系列表,把values元素和一个特定的标签或者字符串绑定(给一个元素值提供不同的表现形式)
创建一个df,两列分别是姓名和薪资,然后给其名字起对应的中文名
In [10]:
dic = {
'name':['jay','tom','jay'],
'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df
Out[10]:
name | salary | |
---|---|---|
0 | jay | 10000 |
1 | tom | 20000 |
2 | jay | 10000 |
In [14]:
#映射关系表
dic = {
'jay':'张三',
'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df
Out[14]:
name | salary | c_name | |
---|---|---|---|
0 | jay | 10000 | 张三 |
1 | tom | 20000 | 李四 |
2 | jay | 10000 | 张三 |
运算工具
- 超过3000部分的钱缴纳50%的税,计算每个人的税后薪资
In [16]:
def after_sal(s):
return s - (s-3000)*0.5
In [18]:
df['after_salary'] = df['salary'].map(after_sal)
df
Out[18]:
name | salary | c_name | after_salary | |
---|---|---|---|---|
0 | jay | 10000 | 张三 | 6500.0 |
1 | tom | 20000 | 李四 | 11500.0 |
2 | jay | 10000 | 张三 | 6500.0 |
映射索引
- 使用rename()函数替换行列索引
- 参数介绍:
- index 替换行索引
- columns 替换列索引
In [19]:
df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4
Out[19]:
color | value | |
---|---|---|
0 | white | 2 |
1 | gray | 5 |
2 | purple | 9 |
3 | blue | 0 |
4 | green | 1 |
In [20]:
new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)
Out[20]:
cc | vv | |
---|---|---|
first | white | 2 |
two | gray | 5 |
three | purple | 9 |
four | blue | 0 |
five | green | 1 |
排序实现的随机抽样
- take()
- np.random.permutation()
In [22]:
df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df
In [24]:
# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)
In [32]:
np.random.permutation(3) #返回随机序列
Out[32]:
array([0, 1, 2])
In [31]:
#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)
In [35]:
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]
数据的分类处理
- 数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
In [36]:
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df
Out[36]:
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
In [37]:
#根据水果的种类进行分组
df.groupby(by='item')
Out[37]:
<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>
In [38]:
#调用groups查看分组情况
df.groupby(by='item').groups
Out[38]:
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
In [40]:
#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']
Out[40]:
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
In [41]:
df.groupby(by='item')['price'].mean() #推荐
Out[41]:
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
In [42]:
#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()
Out[42]:
color
green 31.333333
red 12.000000
yellow 35.000000
Name: weight, dtype: float64
In [44]:
#将每一种水果的平均价格计算出来且汇总到原数据中
df
Out[44]:
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
In [47]:
series_price = df.groupby(by='item')['price'].mean()
dic = series_price.to_dict()
dic #映射关系表
Out[47]:
{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}
In [49]:
df['mean_price'] = df['item'].map(dic)
df
Out[49]:
color | item | price | weight | mean_price | |
---|---|---|---|---|---|
0 | red | Apple | 4.0 | 12 | 3.00 |
1 | yellow | Banana | 3.0 | 20 | 2.75 |
2 | yellow | Orange | 3.0 | 50 | 3.50 |
3 | green | Banana | 2.5 | 30 | 2.75 |
4 | green | Orange | 4.0 | 20 | 3.50 |
5 | green | Apple | 2.0 | 44 | 3.00 |
高级数据聚合
- 使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算,在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式
In [56]:
def myMean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)
In [57]:
df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具
Out[57]:
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
In [58]:
df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具
Out[58]:
0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64
数据加载
- 读取type-.txt文件数据
In [1]:
import pandas as pd
from pandas import DataFrame,Series
data=pd.read_csv('./data/type-.txt')
data
Out[1]:
你好-我好-他也好 | |
---|---|
0 | 也许-大概-有可能 |
1 | 然而-未必-不见得 |
In [2]:
data.shape
Out[2]:
(2, 1)
- 将文件中每一个词作为元素存放在DataFrame中
In [4]:
data=pd.read_csv('./data/type-.txt',sep='-',header=None)
data
Out[4]:
0 | 1 | 2 | |
---|---|---|---|
0 | 你好 | 我好 | 他也好 |
1 | 也许 | 大概 | 有可能 |
2 | 然而 | 未必 | 不见得 |
- 读取数据库中的数据
In [6]:
#连接数据库,获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')
In [7]:
#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df
In [ ]:
#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data',conn)
透视表
透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表,也体会到它的强大功能,而在pandas中它被称作pivot_table。
- 透视表的优点:
- 灵活性高,可以随意定制你的分析计算要求
- 脉络清晰易于理解数据
- 操作性强,报表神器
In [16]:
import pandas as pd
import numpy as np
df = pd.read_csv('./data/透视表-篮球赛.csv',encoding='utf8')
df.head()
Out[16]:
对手 | 胜负 | 主客场 | 命中 | 投篮数 | 投篮命中率 | 3分命中率 | 篮板 | 助攻 | 得分 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 勇士 | 胜 | 客 | 10 | 23 | 0.435 | 0.444 | 6 | 11 | 27 |
1 | 国王 | 胜 | 客 | 8 | 21 | 0.381 | 0.286 | 3 | 9 | 27 |
2 | 小牛 | 胜 | 主 | 10 | 19 | 0.526 | 0.462 | 3 | 7 | 29 |
3 | 灰熊 | 负 | 主 | 8 | 20 | 0.400 | 0.250 | 5 | 8 | 22 |
4 | 76人 | 胜 | 客 | 10 | 20 | 0.500 | 0.250 | 3 | 13 | 27 |
pivot_table有四个最重要的参数index、values、columns、aggfunc
- index参数:分类汇总的分类条件
- 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个对进行分类并计算其各类得分的平均值:
In [14]:
df.pivot_table(index='对手')
- 想看看对阵同一对手在不同主客场下的数据,分类条件为对手和主客场
In [17]:
df.pivot_table(index=['对手','主客场'])
- values参数:需要对计算的数据进行筛选
- 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据:
In [18]:
df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'])
Out[18]:
助攻 | 得分 | 篮板 | ||
---|---|---|---|---|
主客场 | 胜负 | |||
主 | 胜 | 10.555556 | 34.222222 | 5.444444 |
负 | 8.666667 | 29.666667 | 5.000000 | |
客 | 胜 | 9.000000 | 32.000000 | 4.916667 |
负 | 8.000000 | 20.000000 | 4.000000 |
- Aggfunc参数:设置我们对数据聚合时进行的函数操作
- 当我们未设置aggfunc时,它默认aggfunc='mean'计算均值。
- 还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时:
In [20]:
df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'],aggfunc=['sum','mean'])
Out[20]:
sum | mean | ||||||
---|---|---|---|---|---|---|---|
助攻 | 得分 | 篮板 | 助攻 | 得分 | 篮板 | ||
主客场 | 胜负 | ||||||
主 | 胜 | 95 | 308 | 49 | 10.555556 | 34.222222 | 5.444444 |
负 | 26 | 89 | 15 | 8.666667 | 29.666667 | 5.000000 | |
客 | 胜 | 108 | 384 | 59 | 9.000000 | 32.000000 | 4.916667 |
负 | 8 | 20 | 4 | 8.000000 | 20.000000 | 4.000000 |
- Columns:可以设置列层次字段
- 对values字段进行分类
In [33]:
#获取所有队主客场的总得分
df.pivot_table(index='主客场',values='得分',aggfunc='sum',fill_value=0)
Out[33]:
得分 | |
---|---|
主客场 | |
主 | 397 |
客 | 404 |
In [34]:
#获取每个队主客场的总得分(在总得分的基础上又进行了对手的分类)
df.pivot_table(index='主客场',values='得分',columns='对手',aggfunc='sum',fill_value=0)
Out[34]:
对手 | 76人 | 勇士 | 国王 | 太阳 | 小牛 | 尼克斯 | 开拓者 | 掘金 | 步行者 | 湖人 | 灰熊 | 爵士 | 猛龙 | 篮网 | 老鹰 | 骑士 | 鹈鹕 | 黄蜂 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
主客场 | ||||||||||||||||||
主 | 29 | 0 | 0 | 0 | 29 | 37 | 0 | 21 | 29 | 0 | 60 | 56 | 38 | 37 | 0 | 35 | 26 | 0 |
客 | 27 | 27 | 27 | 48 | 0 | 31 | 48 | 0 | 26 | 36 | 49 | 29 | 0 | 0 | 29 | 0 | 0 | 27 |
交叉表
- 是一种用于计算分组的特殊透视图,对数据进行汇总
- pd.crosstab(index,colums)
- index:分组数据,交叉表的行索引
- columns:交叉表的列索引
In [35]:
df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
'age':[15,23,25,17,35,57,24,31,22],
'smoke':[True,False,False,True,True,False,False,True,False],
'height':[168,179,181,166,173,178,188,190,160]})
- 求出各个性别抽烟的人数
In [36]:
pd.crosstab(df.smoke,df.sex)
Out[36]:
sex | man | women |
---|---|---|
smoke | ||
False | 2 | 3 |
True | 2 | 2 |
- 求出各个年龄段抽烟人情况
In [14]:
pd.crosstab(df.age,df.smoke)
Out[14]:
smoke | False | True |
---|---|---|
age | ||
15 | 0 | 1 |
17 | 0 | 1 |
22 | 1 | 0 |
23 | 1 | 0 |
24 | 1 | 0 |
25 | 1 | 0 |
31 | 0 | 1 |
35 | 0 | 1 |
57 | 1 | 0 |