数据分析day04
7.pandas高级操作
In [53]:
import pandas as pd
from pandas import DataFrame
import numpy as np
替换操作
- 替换操作可以同步作用于Series和DataFrame中
- 单值替换
- 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换: to_replace={列标签:替换值} value='value'
- 多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换(推荐) to_replace={to_replace:value,to_replace:value}
In [4]:
df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df
Out[4]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | 3 | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 77 | 49 |
In [5]:
df.replace(to_replace=3,value='Three')
Out[5]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | Three | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 77 | 49 |
In [6]:
df.replace(to_replace={3:'aaa'})
Out[6]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | aaa | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 77 | 49 |
In [8]:
#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')
Out[8]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 44 | 62 | 3 | 85 | 26 | 47 | 14 |
1 | 15 | 78 | 32 | 98 | 85 | 4 | 51 |
2 | 53 | 75 | 87 | 21 | 45 | 8 | 18 |
3 | 54 | 31 | 67 | 49 | 77 | 25 | 49 |
4 | 18 | 21 | 18 | 31 | 93 | 11 | 0 |
5 | 21 | 54 | 76 | 95 | 70 | 6666666 | 49 |
映射操作
概念:创建一个映射关系列表,把values元素和一个特定的标签或者字符串绑定(给一个元素值提供不同的表现形式)
创建一个df,两列分别是姓名和薪资,然后给其名字起对应的中文名
In [10]:
dic = {
'name':['jay','tom','jay'],
'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df
Out[10]:
name | salary | |
---|---|---|
0 | jay | 10000 |
1 | tom | 20000 |
2 | jay | 10000 |
In [14]:
#映射关系表
dic = {
'jay':'张三',
'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df
Out[14]:
name | salary | c_name | |
---|---|---|---|
0 | jay | 10000 | 张三 |
1 | tom | 20000 | 李四 |
2 | jay | 10000 | 张三 |
运算工具
- 超过3000部分的钱缴纳50%的税,计算每个人的税后薪资
In [16]:
def after_sal(s):
return s - (s-3000)*0.5
In [18]:
df['after_salary'] = df['salary'].map(after_sal)
df
Out[18]:
name | salary | c_name | after_salary | |
---|---|---|---|---|
0 | jay | 10000 | 张三 | 6500.0 |
1 | tom | 20000 | 李四 | 11500.0 |
2 | jay | 10000 | 张三 | 6500.0 |
映射索引
- 使用rename()函数替换行列索引
- 参数介绍:
- index 替换行索引
- columns 替换列索引
In [19]:
df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4
Out[19]:
color | value | |
---|---|---|
0 | white | 2 |
1 | gray | 5 |
2 | purple | 9 |
3 | blue | 0 |
4 | green | 1 |
In [20]:
new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)
Out[20]:
cc | vv | |
---|---|---|
first | white | 2 |
two | gray | 5 |
three | purple | 9 |
four | blue | 0 |
five | green | 1 |
排序实现的随机抽样
- take()
- np.random.permutation()
In [22]:
df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df
In [24]:
# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)
In [32]:
np.random.permutation(3) #返回随机序列
Out[32]:
array([0, 1, 2])
In [31]:
#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)
In [35]:
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]
数据的分类处理
- 数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
In [36]:
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df
Out[36]:
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
In [37]:
#根据水果的种类进行分组
df.groupby(by='item')
Out[37]:
<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>
In [38]:
#调用groups查看分组情况
df.groupby(by='item').groups
Out[38]:
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
In [40]:
#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']
Out[40]:
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
In [41]:
df.groupby(by='item')['price'].mean() #推荐
Out[41]:
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
In [42]:
#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()
Out[42]:
color
green 31.333333
red 12.000000
yellow 35.000000
Name: weight, dtype: float64
In [44]:
#将每一种水果的平均价格计算出来且汇总到原数据中
df
Out[44]:
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
In [47]:
series_price = df.groupby(by='item')['price'].mean()
dic = series_price.to_dict()
dic #映射关系表
Out[47]:
{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}
In [49]:
df['mean_price'] = df['item'].map(dic)
df
Out[49]:
color | item | price | weight | mean_price | |
---|---|---|---|---|---|
0 | red | Apple | 4.0 | 12 | 3.00 |
1 | yellow | Banana | 3.0 | 20 | 2.75 |
2 | yellow | Orange | 3.0 | 50 | 3.50 |
3 | green | Banana | 2.5 | 30 | 2.75 |
4 | green | Orange | 4.0 | 20 | 3.50 |
5 | green | Apple | 2.0 | 44 | 3.00 |
高级数据聚合
- 使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算,在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式
In [56]:
def myMean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)
In [57]:
df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具
Out[57]:
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
In [58]:
df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具
Out[58]:
0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64
数据加载
- 读取type-.txt文件数据
In [50]:
data_1 = pd.read_csv('./data/type-.txt',sep='-',header=None)
- 将文件中每一个词作为元素存放在DataFrame中
In [ ]:
- 读取数据库中的数据
In [46]:
#连接数据库,获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')
In [47]:
#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df
In [51]:
#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data123',conn)
In [52]:
pd.read_sql('select * from sql_data123',conn)
Out[52]:
index | 0 | 1 | 2 | |
---|---|---|---|---|
0 | 0 | 你好 | 我好 | 他也好 |
1 | 1 | 也许 | 大概 | 有可能 |
2 | 2 | 然而 | 未必 | 不见得 |
透视表
透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表,也体会到它的强大功能,而在pandas中它被称作pivot_table。
- 透视表的优点:
- 灵活性高,可以随意定制你的分析计算要求
- 脉络清晰易于理解数据
- 操作性强,报表神器
In [6]:
import pandas as pd
import numpy as np
In [15]:
df = pd.read_csv('./data/games.csv',encoding='utf-8')
df.head()
Out[15]:
对手 | 胜负 | 主客场 | 命中 | 投篮数 | 投篮命中率 | 3分命中率 | 篮板 | 助攻 | 得分 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 勇士 | 胜 | 客 | 10 | 23 | 0.435 | 0.444 | 6 | 11 | 27 |
1 | 国王 | 胜 | 客 | 8 | 21 | 0.381 | 0.286 | 3 | 9 | 27 |
2 | 小牛 | 胜 | 主 | 10 | 19 | 0.526 | 0.462 | 3 | 7 | 29 |
3 | 灰熊 | 负 | 主 | 8 | 20 | 0.400 | 0.250 | 5 | 8 | 22 |
4 | 76人 | 胜 | 客 | 10 | 20 | 0.500 | 0.250 | 3 | 13 | 27 |
pivot_table有四个最重要的参数index、values、columns、aggfunc
- index参数:分类汇总的分类条件
- 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个队进行分类并计算其各类得分的平均值:
In [16]:
df.pivot_table(index='对手')
- 想看看对阵同一对手在不同主客场下的数据,分类条件为对手和主客场
In [17]:
df.pivot_table(index=['对手','主客场'])
- values参数:需要对计算的数据进行筛选
- 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据:
In [19]:
df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'])
Out[19]:
助攻 | 得分 | 篮板 | ||
---|---|---|---|---|
主客场 | 胜负 | |||
主 | 胜 | 10.555556 | 34.222222 | 5.444444 |
负 | 8.666667 | 29.666667 | 5.000000 | |
客 | 胜 | 9.000000 | 32.000000 | 4.916667 |
负 | 8.000000 | 20.000000 | 4.000000 |
- Aggfunc参数:设置我们对数据聚合时进行的函数操作
- 当我们未设置aggfunc时,它默认aggfunc='mean'计算均值。
- 还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时:
In [23]:
df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'],aggfunc='sum')
Out[23]:
助攻 | 得分 | 篮板 | ||
---|---|---|---|---|
主客场 | 胜负 | |||
主 | 胜 | 95 | 308 | 49 |
负 | 26 | 89 | 15 | |
客 | 胜 | 108 | 384 | 59 |
负 | 8 | 20 | 4 |
In [24]:
#还想获得james harden在主客场和不同胜负情况下的总得分、平均篮板、最大助攻时
df.pivot_table(index=['主客场','胜负'],aggfunc={'得分':'sum','篮板':'mean','助攻':'max'})
Out[24]:
助攻 | 得分 | 篮板 | ||
---|---|---|---|---|
主客场 | 胜负 | |||
主 | 胜 | 17 | 308 | 5.444444 |
负 | 11 | 89 | 5.000000 | |
客 | 胜 | 15 | 384 | 4.916667 |
负 | 8 | 20 | 4.000000 |
- Columns:可以设置列层次字段
- 对values字段进行分类
In [35]:
df.pivot_table(index='主客场',values='得分',aggfunc='sum',columns='对手')
Out[35]:
对手 | 76人 | 勇士 | 国王 | 太阳 | 小牛 | 尼克斯 | 开拓者 | 掘金 | 步行者 | 湖人 | 灰熊 | 爵士 | 猛龙 | 篮网 | 老鹰 | 骑士 | 鹈鹕 | 黄蜂 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
主客场 | ||||||||||||||||||
主 | 29.0 | NaN | NaN | NaN | 29.0 | 37.0 | NaN | 21.0 | 29.0 | NaN | 60.0 | 56.0 | 38.0 | 37.0 | NaN | 35.0 | 26.0 | NaN |
客 | 27.0 | 27.0 | 27.0 | 48.0 | NaN | 31.0 | 48.0 | NaN | 26.0 | 36.0 | 49.0 | 29.0 | NaN | NaN | 29.0 | NaN | NaN | 27.0 |
交叉表
- 是一种用于计算分组的特殊透视图,对数据进行汇总
- pd.crosstab(index,colums)
- index:分组数据,交叉表的行索引
- columns:交叉表的列索引
In [36]:
df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
'age':[15,23,25,17,35,57,24,31,22],
'smoke':[True,False,False,True,True,False,False,True,False],
'height':[168,179,181,166,173,178,188,190,160]})
df
Out[36]:
age | height | sex | smoke | |
---|---|---|---|---|
0 | 15 | 168 | man | True |
1 | 23 | 179 | man | False |
2 | 25 | 181 | women | False |
3 | 17 | 166 | women | True |
4 | 35 | 173 | man | True |
5 | 57 | 178 | women | False |
6 | 24 | 188 | man | False |
7 | 31 | 190 | women | True |
8 | 22 | 160 | women | False |
- 求出各个性别抽烟的人数
In [37]:
pd.crosstab(df.smoke,df.sex)
Out[37]:
sex | man | women |
---|---|---|
smoke | ||
False | 2 | 3 |
True | 2 | 2 |
In [38]:
pd.crosstab(df.sex,df.smoke)
Out[38]:
smoke | False | True |
---|---|---|
sex | ||
man | 2 | 2 |
women | 3 | 2 |
- 求出各个年龄段抽烟人情况
In [41]:
pd.crosstab(df.age,df.smoke)
Out[41]:
smoke | False | True |
---|---|---|
age | ||
15 | 0 | 1 |
17 | 0 | 1 |
22 | 1 | 0 |
23 | 1 | 0 |
24 | 1 | 0 |
25 | 1 | 0 |
31 | 0 | 1 |
35 | 0 | 1 |
57 | 1 | 0 |
8. 2012美国大选献金项目数据分析
In [51]:
import pandas as pd
from pandas import DataFrame
import numpy as np
In [52]:
#方便大家操作,将月份和参选人以及所在政党进行定义:
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
parties = {
'Bachmann, Michelle': 'Republican',
'Romney, Mitt': 'Republican',
'Obama, Barack': 'Democrat',
"Roemer, Charles E. 'Buddy' III": 'Reform',
'Pawlenty, Timothy': 'Republican',
'Johnson, Gary Earl': 'Libertarian',
'Paul, Ron': 'Republican',
'Santorum, Rick': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Huntsman, Jon': 'Republican',
'Perry, Rick': 'Republican'
}
需求
- 加载数据
- 查看数据的基本信息
- 指定数据截取,将如下字段的数据进行提取,其他数据舍弃
- cand_nm :候选人姓名
- contbr_nm : 捐赠人姓名
- contbr_st :捐赠人所在州
- contbr_employer : 捐赠人所在公司
- contbr_occupation : 捐赠人职业
- contb_receipt_amt :捐赠数额(美元)
- contb_receipt_dt : 捐款的日期
- 对新数据进行总览df.info(),查看是否存在缺失数据
- 用统计学指标快速描述数值型属性的概要。df.describe()
- 空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
- 异常值处理。将捐款金额<=0的数据删除
- 新建一列为各个候选人所在党派party
- 查看party这一列中有哪些不同的元素
- 统计party列中各个元素出现次数
- 查看各个党派收到的政治献金总数contb_receipt_amt
- 查看具体每天各个党派收到的政治献金总数contb_receipt_amt
- 将表中日期格式转换为'yyyy-mm-dd'。
- 查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
- 找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额
In [53]:
#加载数据,查看数据的基本信息
df = pd.read_csv('./data/usa_election.txt')
df.head()
C:\Users\laonanhai\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2728: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Out[53]:
cmte_id | cand_id | cand_nm | contbr_nm | contbr_city | contbr_st | contbr_zip | contbr_employer | contbr_occupation | contb_receipt_amt | contb_receipt_dt | receipt_desc | memo_cd | memo_text | form_tp | file_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 250.0 | 20-JUN-11 | NaN | NaN | NaN | SA17A | 736166 |
1 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 50.0 | 23-JUN-11 | NaN | NaN | NaN | SA17A | 736166 |
2 | C00410118 | P20002978 | Bachmann, Michelle | SMITH, LANIER | LANETT | AL | 3.68633e+08 | INFORMATION REQUESTED | INFORMATION REQUESTED | 250.0 | 05-JUL-11 | NaN | NaN | NaN | SA17A | 749073 |
3 | C00410118 | P20002978 | Bachmann, Michelle | BLEVINS, DARONDA | PIGGOTT | AR | 7.24548e+08 | NONE | RETIRED | 250.0 | 01-AUG-11 | NaN | NaN | NaN | SA17A | 749073 |
4 | C00410118 | P20002978 | Bachmann, Michelle | WARDENBURG, HAROLD | HOT SPRINGS NATION | AR | 7.19016e+08 | NONE | RETIRED | 300.0 | 20-JUN-11 | NaN | NaN | NaN | SA17A | 736166 |
In [54]:
#查看原始数据中是否存在缺失数据
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id 536041 non-null object
cand_id 536041 non-null object
cand_nm 536041 non-null object
contbr_nm 536041 non-null object
contbr_city 536026 non-null object
contbr_st 536040 non-null object
contbr_zip 535973 non-null object
contbr_employer 525088 non-null object
contbr_occupation 530520 non-null object
contb_receipt_amt 536041 non-null float64
contb_receipt_dt 536041 non-null object
receipt_desc 8479 non-null object
memo_cd 49718 non-null object
memo_text 52740 non-null object
form_tp 536041 non-null object
file_num 536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB
In [55]:
df.describe()
Out[55]:
contb_receipt_amt | file_num | |
---|---|---|
count | 5.360410e+05 | 536041.000000 |
mean | 3.750373e+02 | 761472.107800 |
std | 3.564436e+03 | 5148.893508 |
min | -3.080000e+04 | 723511.000000 |
25% | 5.000000e+01 | 756218.000000 |
50% | 1.000000e+02 | 763233.000000 |
75% | 2.500000e+02 | 763621.000000 |
max | 1.944042e+06 | 767394.000000 |
In [56]:
#空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
df.fillna(value='NOT PROVIDE',inplace=True)
In [57]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id 536041 non-null object
cand_id 536041 non-null object
cand_nm 536041 non-null object
contbr_nm 536041 non-null object
contbr_city 536041 non-null object
contbr_st 536041 non-null object
contbr_zip 536041 non-null object
contbr_employer 536041 non-null object
contbr_occupation 536041 non-null object
contb_receipt_amt 536041 non-null float64
contb_receipt_dt 536041 non-null object
receipt_desc 536041 non-null object
memo_cd 536041 non-null object
memo_text 536041 non-null object
form_tp 536041 non-null object
file_num 536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB
In [58]:
#异常值处理。将捐款金额<=0的数据删除
df = df.loc[~(df['contb_receipt_amt'] <= 0)]
In [59]:
#查看当前有多少人参与了竞选
df['cand_nm'].unique()
Out[59]:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
'Huntsman, Jon', 'Perry, Rick'], dtype=object)
In [60]:
df['cand_nm'].nunique()
Out[60]:
13
In [61]:
#新建一列为各个候选人所在党派party
df['party'] = df['cand_nm'].map(parties)
df.head()
Out[61]:
cmte_id | cand_id | cand_nm | contbr_nm | contbr_city | contbr_st | contbr_zip | contbr_employer | contbr_occupation | contb_receipt_amt | contb_receipt_dt | receipt_desc | memo_cd | memo_text | form_tp | file_num | party | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 250.0 | 20-JUN-11 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 736166 | Republican |
1 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 50.0 | 23-JUN-11 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 736166 | Republican |
2 | C00410118 | P20002978 | Bachmann, Michelle | SMITH, LANIER | LANETT | AL | 3.68633e+08 | INFORMATION REQUESTED | INFORMATION REQUESTED | 250.0 | 05-JUL-11 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 749073 | Republican |
3 | C00410118 | P20002978 | Bachmann, Michelle | BLEVINS, DARONDA | PIGGOTT | AR | 7.24548e+08 | NONE | RETIRED | 250.0 | 01-AUG-11 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 749073 | Republican |
4 | C00410118 | P20002978 | Bachmann, Michelle | WARDENBURG, HAROLD | HOT SPRINGS NATION | AR | 7.19016e+08 | NONE | RETIRED | 300.0 | 20-JUN-11 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 736166 | Republican |
In [62]:
#查看party这一列中有哪些不同的元素,统计party列中各个元素出现次数
df['party'].unique()
Out[62]:
array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)
In [63]:
df['party'].value_counts()#value_counts()统计Serise中每一元素出现的次数
Out[63]:
Democrat 289999
Republican 234300
Reform 5313
Libertarian 702
Name: party, dtype: int64
In [64]:
#查看各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by='party',axis=0)['contb_receipt_amt'].sum()
Out[64]:
party
Democrat 8.259441e+07
Libertarian 4.132769e+05
Reform 3.429658e+05
Republican 1.251181e+08
Name: contb_receipt_amt, dtype: float64
In [65]:
#查看具体每天各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()
Out[65]:
contb_receipt_dt party
01-APR-11 Reform 50.00
Republican 12635.00
01-AUG-11 Democrat 182198.00
Libertarian 1000.00
Reform 1847.00
Republican 268903.02
01-DEC-11 Democrat 651982.82
Libertarian 725.00
Reform 875.00
Republican 505255.96
01-FEB-11 Republican 250.00
01-JAN-11 Republican 8600.00
01-JAN-12 Democrat 74303.80
Reform 515.00
Republican 76804.72
01-JUL-11 Democrat 175364.00
Libertarian 2000.00
Reform 100.00
Republican 125973.72
01-JUN-11 Democrat 148409.00
Libertarian 500.00
Reform 50.00
Republican 435609.20
01-MAR-11 Republican 1000.00
01-MAY-11 Democrat 82644.00
Reform 480.00
Republican 28663.87
01-NOV-11 Democrat 129309.87
Libertarian 3000.00
Reform 1792.00
...
30-OCT-11 Reform 3910.00
Republican 46413.16
30-SEP-11 Democrat 3409587.24
Libertarian 550.00
Reform 2050.00
Republican 5094824.20
31-AUG-11 Democrat 375487.44
Libertarian 10750.00
Reform 450.00
Republican 1038330.90
31-DEC-11 Democrat 3571793.57
Reform 695.00
Republican 1165777.72
31-JAN-11 Republican 6000.00
31-JAN-12 Democrat 1421887.31
Reform 150.00
Republican 963681.41
31-JUL-11 Democrat 20305.00
Reform 1066.00
Republican 12781.02
31-MAR-11 Reform 200.00
Republican 74575.00
31-MAY-11 Democrat 352005.66
Libertarian 250.00
Reform 100.00
Republican 313839.80
31-OCT-11 Democrat 216971.87
Libertarian 4250.00
Reform 3205.00
Republican 751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64
In [66]:
df.columns
Out[66]:
Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
'memo_text', 'form_tp', 'file_num', 'party'],
dtype='object')
In [67]:
#将表中日期格式转换为'yyyy-mm-dd'。
def tranformDate(d):
day,month,year = d.split('-')
month = months[month]
return '20'+year+'-'+str(month)+'-'+day
df['contb_receipt_dt'] = df['contb_receipt_dt'].map(tranformDate)
df.head()
Out[67]:
cmte_id | cand_id | cand_nm | contbr_nm | contbr_city | contbr_st | contbr_zip | contbr_employer | contbr_occupation | contb_receipt_amt | contb_receipt_dt | receipt_desc | memo_cd | memo_text | form_tp | file_num | party | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 250.0 | 2011-6-20 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 736166 | Republican |
1 | C00410118 | P20002978 | Bachmann, Michelle | HARVEY, WILLIAM | MOBILE | AL | 3.6601e+08 | RETIRED | RETIRED | 50.0 | 2011-6-23 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 736166 | Republican |
2 | C00410118 | P20002978 | Bachmann, Michelle | SMITH, LANIER | LANETT | AL | 3.68633e+08 | INFORMATION REQUESTED | INFORMATION REQUESTED | 250.0 | 2011-7-05 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 749073 | Republican |
3 | C00410118 | P20002978 | Bachmann, Michelle | BLEVINS, DARONDA | PIGGOTT | AR | 7.24548e+08 | NONE | RETIRED | 250.0 | 2011-8-01 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 749073 | Republican |
4 | C00410118 | P20002978 | Bachmann, Michelle | WARDENBURG, HAROLD | HOT SPRINGS NATION | AR | 7.19016e+08 | NONE | RETIRED | 300.0 | 2011-6-20 | NOT PROVIDE | NOT PROVIDE | NOT PROVIDE | SA17A | 736166 | Republican |
In [68]:
#查看老兵(捐献者职业)DISABLED VETERAN主要支持谁(捐钱数量)
#1.将老兵对应的行数据取出
old_bing_df = df.loc[df['contbr_occupation'] == 'DISABLED VETERAN']
#2.对候选人分组钱数的聚合
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()
Out[68]:
cand_nm
Cain, Herman 300.00
Obama, Barack 4205.00
Paul, Ron 2425.49
Santorum, Rick 250.00
Name: contb_receipt_amt, dtype: float64
In [69]:
#找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额
s = df.groupby(by=['cand_nm'])['contb_receipt_amt'].max()
s
Out[69]:
cand_nm
Bachmann, Michelle 3022.00
Cain, Herman 10000.00
Gingrich, Newt 5100.00
Huntsman, Jon 5000.00
Johnson, Gary Earl 2500.00
McCotter, Thaddeus G 4000.00
Obama, Barack 1944042.43
Paul, Ron 5000.00
Pawlenty, Timothy 10000.00
Perry, Rick 10000.00
Roemer, Charles E. 'Buddy' III 200.00
Romney, Mitt 12700.00
Santorum, Rick 5000.00
Name: contb_receipt_amt, dtype: float64
In [70]:
s.index[0]
Out[70]:
'Bachmann, Michelle'
In [71]:
for i in range(len(s)):
q_str = 'cand_nm == "%s" & contb_receipt_amt==%d'%(s.index[i],s.values[i])
display(df.query(q_str))
9.matplotlib绘图
plt.plot()绘制线性图
- 绘制单条线形图
- 绘制多条线形图
- 多个曲线图绘制在一个table区域中(subplot()函数)
- 设置坐标系的比例plt.figure(figsize=(a,b))
- 设置图例legend()
- 设置轴的标识
- 图例保存
- fig = plt.figure()
- plt.plot(x,y)
- figure.savefig()
- 曲线的样式和风格(自学)
In [6]:
import numpy as np
In [4]:
import matplotlib.pyplot as plt
%matplotlib inline #保证绘制的图像可以被正常的显示加载出来
UsageError: unrecognized arguments: #保证绘制的图像可以被正常的显示加载出来
In [5]:
x = [1,2,3,4,5]
y = [5,4,3,2,1]
plt.plot(x,y)
Out[5]:
[<matplotlib.lines.Line2D at 0x1a72b583cf8>]
In [8]:
#在一个坐标系中绘制两条线段
xx = np.linspace(-np.pi,np.pi,num=20)
yy = xx ** 2
plt.plot(x,y) #plot多次被调用,绘制多条线段
plt.plot(xx,yy)
Out[8]:
[<matplotlib.lines.Line2D at 0x1a72b870ef0>]
In [15]:
#将多个坐标放置在一个表格中
ax1 = plt.subplot(2,2,1) #表格大小和坐标存放的位置
ax1.plot(x,y)
ax2 = plt.subplot(2,2,2)
ax2.plot(xx,yy)
ax3 = plt.subplot(2,2,3)
ax3.plot(xx,yy)
ax4 = plt.subplot(2,2,4)
ax4.plot(x,y)
Out[15]:
[<matplotlib.lines.Line2D at 0x1a72bd5fb70>]
In [17]:
#plt.figure(figsize=(a,b))
plt.figure(figsize=(4,8))
plt.plot(x,y)
In [23]:
#图例的设定
plt.plot(xx,yy,label='aaa')
plt.plot(xx-1,yy+1,label='bbb')
plt.legend(loc=1)
Out[23]:
<matplotlib.legend.Legend at 0x1a72d8d4198>
In [25]:
#给坐标轴设定标识
plt.plot(xx-1,yy+1,label='bbb')
plt.xlabel('distence')
plt.ylabel('temp')
plt.title('aaa')
Out[25]:
Text(0.5,1,'aaa')
柱状图:plt.bar()
- 参数:第一个参数是索引。第二个参数是数据值。第三个参数是条形的宽度
In [33]:
plt.bar(x,y)
Out[33]:
<Container object of 5 artists>
In [29]:
plt.barh(x,y)
Out[29]:
<Container object of 5 artists>
直方图
- 是一个特殊的柱状图,又叫做密度图
- plt.hist()的参数
- bins
可以是一个bin数量的整数值,也可以是表示bin的一个序列。默认值为10 - normed
如果值为True,直方图的值将进行归一化处理,形成概率密度,默认值为False - color
指定直方图的颜色。可以是单一颜色值或颜色的序列。如果指定了多个数据集合,例如DataFrame对象,颜色序列将会设置为相同的顺序。如果未指定,将会使用一个默认的线条颜色 - orientation
通过设置orientation为horizontal创建水平直方图。默认值为vertical
- bins
In [40]:
x = [1,1,2,3,4,5,5,5,6,7,7,7,7,7,7,7,8]
plt.hist(x,bins=20)
Out[40]:
(array([2., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 3., 0., 0., 1., 0., 0.,
7., 0., 1.]),
array([1. , 1.35, 1.7 , 2.05, 2.4 , 2.75, 3.1 , 3.45, 3.8 , 4.15, 4.5 ,
4.85, 5.2 , 5.55, 5.9 , 6.25, 6.6 , 6.95, 7.3 , 7.65, 8. ]),
<a list of 20 Patch objects>)
饼图
- pie(),饼图也只有一个参数x
- 饼图适合展示各部分占总体的比例,条形图适合比较各部分的大小
In [41]:
arr=[11,22,31,15]
plt.pie(arr)
Out[41]:
([<matplotlib.patches.Wedge at 0x1a72dae13c8>,
<matplotlib.patches.Wedge at 0x1a72dae1748>,
<matplotlib.patches.Wedge at 0x1a72c37ea58>,
<matplotlib.patches.Wedge at 0x1a72bc76ac8>],
[Text(0.996424,0.465981,''),
Text(-0.195798,1.08243,''),
Text(-0.830021,-0.721848,''),
Text(0.910034,-0.61793,'')])
In [42]:
arr=[0.2,0.3,0.1]
plt.pie(arr)
Out[42]:
([<matplotlib.patches.Wedge at 0x1a72ef21630>,
<matplotlib.patches.Wedge at 0x1a72ef21b00>,
<matplotlib.patches.Wedge at 0x1a72ef28080>],
[Text(0.889919,0.646564,''),
Text(-0.646564,0.889919,''),
Text(-1.04616,-0.339919,'')])
In [43]:
arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'])
Out[43]:
([<matplotlib.patches.Wedge at 0x1a72ef61d68>,
<matplotlib.patches.Wedge at 0x1a72ef69278>,
<matplotlib.patches.Wedge at 0x1a72ef697b8>,
<matplotlib.patches.Wedge at 0x1a72ef69cf8>],
[Text(0.996424,0.465981,'a'),
Text(-0.195798,1.08243,'b'),
Text(-0.830021,-0.721848,'c'),
Text(0.910034,-0.61793,'d')])
In [44]:
arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3)
Out[44]:
([<matplotlib.patches.Wedge at 0x1a72efb18d0>,
<matplotlib.patches.Wedge at 0x1a72efb1da0>,
<matplotlib.patches.Wedge at 0x1a72efbb320>,
<matplotlib.patches.Wedge at 0x1a72efbb860>],
[Text(0.271752,0.127086,'a'),
Text(-0.0533994,0.295209,'b'),
Text(-0.226369,-0.196868,'c'),
Text(0.248191,-0.168526,'d')])
In [45]:
arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,autopct='%.6f%%')
Out[45]:
([<matplotlib.patches.Wedge at 0x1a72f0024a8>,
<matplotlib.patches.Wedge at 0x1a72f002ba8>,
<matplotlib.patches.Wedge at 0x1a72f00a358>,
<matplotlib.patches.Wedge at 0x1a72f00aac8>],
[Text(0.271752,0.127086,'a'),
Text(-0.0533994,0.295209,'b'),
Text(-0.226369,-0.196868,'c'),
Text(0.248191,-0.168526,'d')],
[Text(0.543504,0.254171,'13.924050%'),
Text(-0.106799,0.590419,'27.848101%'),
Text(-0.452739,-0.393735,'39.240506%'),
Text(0.496382,-0.337053,'18.987341%')])
In [46]:
arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,shadow=True,explode=[0.2,0.3,0.2,0.4])
Out[46]:
([<matplotlib.patches.Wedge at 0x1a72f04e940>,
<matplotlib.patches.Wedge at 0x1a72f056128>,
<matplotlib.patches.Wedge at 0x1a72f056940>,
<matplotlib.patches.Wedge at 0x1a72f062198>],
[Text(0.45292,0.21181,'a'),
Text(-0.106799,0.590419,'b'),
Text(-0.377282,-0.328113,'c'),
Text(0.579113,-0.393228,'d')])
散点图scatter()
- 因变量随自变量而变化的大致趋势
In [49]:
x = np.array([1,2,3,4,5])
y = x ** 2
plt.scatter(x,y)
Out[49]:
<matplotlib.collections.PathCollection at 0x1a72f089438>
In [51]:
x = np.random.random(size=(20,))
y = np.random.random(size=(20,))
plt.scatter(x,y)
Out[51]:
<matplotlib.collections.PathCollection at 0x1a72fe59e10>
temp dist
Type Markdown and LaTeX: α2α2
Type Markdown and LaTeX: α2α2
项目需求
第一部分:数据类型处理
- 数据加载
- 字段含义:
- user_id:用户ID
- order_dt:购买日期
- order_product:购买产品的数量
- order_amount:购买金额
- 字段含义:
- 观察数据
- 查看数据的数据类型
- 数据中是否存储在缺失值
- 将order_dt转换成时间类型
- 查看数据的统计描述
- 计算所有用户购买商品的平均数量
- 计算所有用户购买商品的平均花费
- 在源数据中添加一列表示月份
第二部分:按月数据分析
- 用户每月购买的总金额
- 绘制曲线图展示
- 所有用户每月的消费总次数
- 所有用户每月的产品购买量
- 统计每月的消费人数
第三部分:用户个体消费数据分析
- 用户消费总金额和消费总次数的统计描述
- 用户消费金额和消费次数的散点图
- 用户消费金额的分布图
- 用户消费次数的分布图(消费次数在100次之内的分布)
第四部分:用户消费行为分析
- 用户第一次消费的时间分布,和人数统计
- 绘制线形图
- 用户最后一次消费的时间分布,和人数统计
- 绘制线形图
- 新老客户的占比
- 消费一次为新用户
- 消费多次为老用户
- 用户分层
- 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间
- 用户的生命周期
- 将用户划分为活跃用户和其他用户
- 查看每月活跃用户和其他用户的占比