数据分析day03

数据分析day03

4.基于pandas的数据清洗

处理丢失数据

  • 有两种丢失数据:
    • None
    • np.nan(NaN)

In [1]:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import tushare as ts#财经数据接口包
import matplotlib.pyplot as plt
  • 两种丢失数据的区别

In [2]:

type(np.nan)

Out[2]:

float

In [5]:

np.nan + 3

Out[5]:

nan

In [3]:

type(None)

Out[3]:

NoneType
  • pandas中的None和NAN

In [10]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df

Out[10]:

0 1 2 3 4
0 44 91 92 51 55
1 23 22 92 35 83
2 21 52 40 63 29
3 94 51 24 70 59
4 27 78 1 21 17
5 94 57 5 43 22
6 87 31 58 30 82
7 93 28 54 7 93

In [12]:

df.iloc[1,2] = None
df.iloc[3,4] = None
df.iloc[4,1] = None
df.iloc[7,4] = np.nan

In [13]:

df

Out[13]:

0 1 2 3 4
0 44 91.0 92.0 51 55.0
1 23 22.0 NaN 35 83.0
2 21 52.0 40.0 63 29.0
3 94 51.0 24.0 70 NaN
4 27 NaN 1.0 21 17.0
5 94 57.0 5.0 43 22.0
6 87 31.0 58.0 30 82.0
7 93 28.0 54.0 7 NaN

pandas处理空值操作

  • isnull
  • notnull
  • any
  • all
  • dropna
  • fillna

In [16]:

df.isnull()

Out[16]:

0 1 2 3 4
0 False False False False False
1 False False True False False
2 False False False False False
3 False False False False True
4 False True False False False
5 False False False False False
6 False False False False False
7 False False False False True
  • 检测出原始数据中哪些行中存在空值

In [20]:

df.isnull()

Out[20]:

0 1 2 3 4
0 False False False False False
1 False False True False False
2 False False False False False
3 False False False False True
4 False True False False False
5 False False False False False
6 False False False False False
7 False False False False True
  • any和all可以帮我们检测df中哪些行列中存在空值
  • isnull->any(axis=1)
  • notnull->all(axis=1)

In [24]:

~df.isnull().any(axis=1)
df.loc[~df.isnull().any(axis=1)]

Out[24]:

0 1 2 3 4
0 44 91.0 92.0 51 55.0
2 21 52.0 40.0 63 29.0
5 94 57.0 5.0 43 22.0
6 87 31.0 58.0 30 82.0

In [28]:

df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]

Out[28]:

0 1 2 3 4
0 44 91.0 92.0 51 55.0
2 21 52.0 40.0 63 29.0
5 94 57.0 5.0 43 22.0
6 87 31.0 58.0 30 82.0

In [29]:

df.dropna(axis=0)  #将空值对应的行数据删除

Out[29]:

0 1 2 3 4
0 44 91.0 92.0 51 55.0
2 21 52.0 40.0 63 29.0
5 94 57.0 5.0 43 22.0
6 87 31.0 58.0 30 82.0

In [32]:

df

Out[32]:

0 1 2 3 4
0 44 91.0 92.0 51 55.0
1 23 22.0 NaN 35 83.0
2 21 52.0 40.0 63 29.0
3 94 51.0 24.0 70 NaN
4 27 NaN 1.0 21 17.0
5 94 57.0 5.0 43 22.0
6 87 31.0 58.0 30 82.0
7 93 28.0 54.0 7 NaN

In [33]:

#fillna将空值进行覆盖
df.fillna(method='bfill',axis=0) #使用紧邻值填充空值

Out[33]:

0 1 2 3 4
0 44 91.0 92.0 51 55.0
1 23 22.0 92.0 35 83.0
2 21 52.0 40.0 63 29.0
3 94 51.0 24.0 70 29.0
4 27 51.0 1.0 21 17.0
5 94 57.0 5.0 43 22.0
6 87 31.0 58.0 30 82.0
7 93 28.0 54.0 7 82.0

面试题

  • 数据说明:
    • 数据是1个冷库的温度数据,1-7对应7个温度采集设备,1分钟采集一次。
  • 数据处理目标:
    • 用1-4对应的4个必须设备,通过建立冷库的温度场关系模型,预估出5-7对应的数据。
    • 最后每个冷库中仅需放置4个设备,取代放置7个设备。
    • f(1-4) --> y(5-7)
  • 数据处理过程:
    • 1、原始数据中有丢帧现象,需要做预处理;
    • 2、matplotlib 绘图;
    • 3、建立逻辑回归模型。
  • 无标准答案,按个人理解操作即可,请把自己的操作过程以文字形式简单描述一下,谢谢配合。
  • 测试数据为testData.xlsx

In [5]:

df = pd.read_excel('./testData.xlsx')
df.drop(labels=['none','none1'],axis=1,inplace=True)

In [7]:

df.dropna(axis=0)

In [10]:

#isnull notnull any all
df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]

In [15]:

df.fillna(method='ffill',axis=0).fillna(method='bfill',axis=0)

In [ ]:

处理重复数据

In [20]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df.iloc[1] = [6,6,6,6,6]
df.iloc[3] = [6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6]
df

Out[20]:

0 1 2 3 4
0 44 68 53 32 24
1 6 6 6 6 6
2 79 86 73 14 25
3 6 6 6 6 6
4 72 1 73 67 89
5 6 6 6 6 6
6 69 32 94 91 18
7 47 7 77 11 67

In [25]:

df.drop_duplicates(keep='first')

Out[25]:

0 1 2 3 4
0 44 68 53 32 24
1 6 6 6 6 6
2 79 86 73 14 25
4 72 1 73 67 89
6 69 32 94 91 18
7 47 7 77 11 67

处理异常数据

  • 自定义一个1000行3列(A,B,C)取值范围为0-1的数据源,然后将C列中的值大于其两倍标准差的异常值进行清洗

In [27]:

df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()

Out[27]:

A B C
0 0.886005 0.831529 0.822956
1 0.391742 0.104039 0.434260
2 0.539745 0.950540 0.948526
3 0.392029 0.904831 0.680343
4 0.513764 0.979957 0.600518

In [30]:

#指定一个判定异常值的条件
twice_std = df['C'].std() * 2
twice_std

Out[30]:

0.570731429850527

In [34]:

#判断C列中的哪些值为异常值
df['C'] > twice_std
df.loc[df['C'] > twice_std]
indexs = df.loc[df['C'] > twice_std].index #异常值对应的行索引

In [35]:

df.drop(labels=indexs)

Out[35]:

A B C
1 0.391742 0.104039 0.434260
5 0.512951 0.897948 0.245320
6 0.473968 0.979213 0.271424
7 0.843319 0.038657 0.184559
8 0.982931 0.881284 0.208147
10 0.784656 0.314898 0.089802
13 0.624428 0.252411 0.327818
15 0.213042 0.969693 0.567275
18 0.710010 0.534330 0.559441
19 0.026479 0.736259 0.120431
20 0.990039 0.982449 0.017151
23 0.158157 0.183678 0.428155
25 0.604838 0.950466 0.294742
26 0.304136 0.822809 0.388579
28 0.671559 0.726631 0.196907
30 0.811249 0.751182 0.467697
31 0.376243 0.805516 0.287484
33 0.570442 0.797945 0.026182
35 0.467125 0.062123 0.439725
36 0.861741 0.413997 0.543973
38 0.955328 0.817003 0.293787
47 0.458014 0.228608 0.285172
49 0.931513 0.403981 0.239329
51 0.008178 0.484172 0.021373
53 0.253882 0.300069 0.561118
55 0.752559 0.685649 0.451692
56 0.003363 0.486893 0.154598
57 0.859653 0.569252 0.007432
58 0.327716 0.419704 0.452710
59 0.068403 0.029346 0.226587
... ... ... ...
953 0.247954 0.072558 0.038834
954 0.199553 0.193049 0.027725
956 0.513195 0.175896 0.254432
957 0.080261 0.476756 0.521142
958 0.944795 0.550317 0.336043
961 0.464895 0.592027 0.195383
962 0.127469 0.300982 0.309427
963 0.595242 0.139702 0.450026
964 0.520342 0.639537 0.209403
965 0.372687 0.117984 0.262849
966 0.007270 0.044250 0.533105
967 0.854830 0.512720 0.173844
968 0.247666 0.972284 0.227422
970 0.047074 0.714412 0.392280
974 0.112649 0.483324 0.125105
975 0.307405 0.875641 0.432340
978 0.520662 0.003040 0.412422
979 0.337178 0.540283 0.257443
981 0.877978 0.842195 0.448030
982 0.273752 0.063285 0.291012
985 0.765849 0.974933 0.253099
988 0.139305 0.570496 0.535778
989 0.597190 0.973190 0.177517
990 0.817945 0.183825 0.330112
991 0.738457 0.578425 0.032489
992 0.159229 0.544980 0.242586
994 0.300998 0.352331 0.434336
996 0.609123 0.491735 0.045738
998 0.839935 0.181189 0.121180
999 0.798840 0.939869 0.150332

5.DataFrame的级联and合并操作

级联操作

  • pd.concat, pd.append

pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
ignore_index=False
  • 匹配级联

In [1]:

import numpy as np
import pandas as pd
from pandas import DataFrame

In [2]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df1

Out[2]:

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering

In [6]:

pd.concat((df1,df1,df1),axis=1)

Out[6]:

employee group employee group employee group
0 Bob Accounting Bob Accounting Bob Accounting
1 Jake Engineering Jake Engineering Jake Engineering
2 Lisa Engineering Lisa Engineering Lisa Engineering
  • 不匹配级联
    • 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
    • 有2种连接方式:
      • 外连接:补NaN(默认模式)
      • 内连接:只连接匹配的项

In [7]:

df1

Out[7]:

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering

In [12]:

df2 = df1.copy()
df2.columns = ['employee','groups']
df2

Out[12]:

employee groups
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering

In [14]:

pd.concat((df1,df2),axis=0)

Out[14]:

employee group groups
0 Bob Accounting NaN
1 Jake Engineering NaN
2 Lisa Engineering NaN
0 Bob NaN Accounting
1 Jake NaN Engineering
2 Lisa NaN Engineering

In [16]:

pd.concat((df1,df2),axis=0,join='inner')

Out[16]:

employee
0 Bob
1 Jake
2 Lisa
0 Bob
1 Jake
2 Lisa
  • append函数的使用

In [17]:

df1.append(df2)

Out[17]:

employee group groups
0 Bob Accounting NaN
1 Jake Engineering NaN
2 Lisa Engineering NaN
0 Bob NaN Accounting
1 Jake NaN Engineering
2 Lisa NaN Engineering

合并操作:级联是对表格做拼接,合并是对表格中的数据进行汇总

  • merge与concat的区别在于,merge需要依据某一共同列来进行合并
  • 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
  • 注意每一列元素的顺序不要求一致

一对一合并

In [18]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df1

Out[18]:

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering

In [19]:

df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })
df2

Out[19]:

employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012

In [20]:

pd.merge(df1,df2)

Out[20]:

employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004

一对多合并

In [21]:

df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})
df3

Out[21]:

employee group hire_date
0 Lisa Accounting 2004
1 Jake Engineering 2016

In [22]:

df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })
df4

Out[22]:

group supervisor
0 Accounting Carly
1 Engineering Guido
2 Engineering Steve

In [23]:

pd.merge(df3,df4)

Out[23]:

employee group hire_date supervisor
0 Lisa Accounting 2004 Carly
1 Jake Engineering 2016 Guido
2 Jake Engineering 2016 Steve

多对多合并

In [24]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})
df1

Out[24]:

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering

In [25]:

df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })
df5

Out[25]:

group supervisor
0 Engineering Carly
1 Engineering Guido
2 HR Steve

In [28]:

pd.merge(df1,df5)

Out[28]:

employee group supervisor
0 Jake Engineering Carly
1 Jake Engineering Guido
2 Lisa Engineering Carly
3 Lisa Engineering Guido

key的规范化

  • 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

In [29]:

df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                 'group':['Accounting','Finance','Marketing']})
df1

Out[29]:

employee group
0 Jack Accounting
1 Summer Finance
2 Steve Marketing

In [30]:

df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                 'hire_date':[2003,2009,2012],
                'group':['Accounting','sell','ceo']})
df2

Out[30]:

employee group hire_date
0 Jack Accounting 2003
1 Bob sell 2009
2 Jake ceo 2012

In [32]:

pd.merge(df1,df2,on='group')

Out[32]:

employee_x group employee_y hire_date
0 Jack Accounting Jack 2003
  • 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

In [33]:

df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                'group':['Accounting','Product','Marketing'],
               'hire_date':[1998,2017,2018]})
df1

Out[33]:

employee group hire_date
0 Bobs Accounting 1998
1 Linda Product 2017
2 Bill Marketing 2018

In [34]:

df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})
df5

Out[34]:

hire_dates name
0 1998 Lisa
1 2016 Bobs
2 2007 Bill

In [35]:

pd.merge(df1,df5,left_on='employee',right_on='name')

Out[35]:

employee group hire_date hire_dates name
0 Bobs Accounting 1998 2016 Bobs
1 Bill Marketing 2018 2007 Bill

内合并与外合并:out取并集 inner取交集

In [37]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

In [38]:

df6

Out[38]:

food name
0 fish Peter
1 beans Paul
2 bread Mary

In [39]:

df7

Out[39]:

drink name
0 wine Mary
1 beer Joseph

In [43]:

pd.merge(df6,df7,how='right')

Out[43]:

food name drink
0 bread Mary wine
1 NaN Joseph beer

In [ ]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

In [ ]:

In [ ]:

In [ ]:

#合并df1和df2
dic1={
    
    'name':['tom','jay','helly'],
    'age':[11,12,33],
    'classRoom':[1,2,3]
}
df1=DataFrame(data=dic1)
df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
              index=['jay','tom','helly'],
             columns=['java','python','c'])

6.人口分析案例

  • 需求:
    • 导入文件,查看原始数据
    • 将人口数据和各州简称数据进行汇总
    • 将汇总的数据中重复的abbreviation列进行删除
    • 查看存在缺失数据的列
    • 找到有哪些state/region使得state的值为NaN,进行去重操作
    • 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
    • 合并各州面积数据areas
    • 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
    • 去除含有缺失数据的行
    • 找出2010年的全民人口数据
    • 计算各州的人口密度
    • 排序,并找出人口密度最高的五个州

In [1]:

import pandas as pd
from pandas import DataFrame
import numpy as np

In [10]:

abb = pd.read_csv('./data/state-abbrevs.csv') #存储的是各州的全程和简称数据
area = pd.read_csv('./data/state-areas.csv') #存储的是各州的全程和面积
pop = pd.read_csv('./data/state-population.csv') #人口数据

In [7]:

pop.head(2)

Out[7]:

state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0

In [8]:

abb.head(2)

Out[8]:

state abbreviation
0 Alabama AL
1 Alaska AK

In [12]:

#数据汇总(合并)
abb_pop = pd.merge(abb,pop,how='outer',left_on='abbreviation',right_on='state/region')
abb_pop.head()

Out[12]:

state abbreviation state/region ages year population
0 Alabama AL AL under18 2012 1117489.0
1 Alabama AL AL total 2012 4817528.0
2 Alabama AL AL under18 2010 1130966.0
3 Alabama AL AL total 2010 4785570.0
4 Alabama AL AL under18 2011 1125763.0

In [13]:

abb_pop.drop(labels='abbreviation',axis=1,inplace=True)

In [15]:

#查看存在缺失数据的列
abb_pop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
state           2448 non-null object
state/region    2544 non-null object
ages            2544 non-null object
year            2544 non-null int64
population      2524 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB

In [16]:

##查看存在缺失数据的列
abb_pop.isnull().any(axis=0)

Out[16]:

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

In [19]:

#找到有哪些state/region(洲的简称)使得state(洲的全程)的值为NaN,进行(简称)去重操作
abb_pop.head()

Out[19]:

state state/region ages year population
0 Alabama AL under18 2012 1117489.0
1 Alabama AL total 2012 4817528.0
2 Alabama AL under18 2010 1130966.0
3 Alabama AL total 2010 4785570.0
4 Alabama AL under18 2011 1125763.0

In [24]:

#1.找出全程的空值都有哪些
abb_pop['state'].isnull()
#2.将空值对应的行数据取出
abb_pop.loc[abb_pop['state'].isnull()]
#3.将空值对应的简称取出
abb_pop.loc[abb_pop['state'].isnull()]['state/region']
#4,对取出的简称对应的Series进行去重操作
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()#unique()是用来将Series中的元素进行去重操作

Out[24]:

array(['PR', 'USA'], dtype=object)

In [ ]:

#为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
#将PR简称对应的全程的空值定位到,将这些空值赋值成PR的全程PUERTO RICO

In [29]:

abb_pop['state/region'] == 'PR'
#取出PR简称对应的行数据
abb_pop.loc[abb_pop['state/region'] == 'PR']
#将定位到的行数据中state列中的空值统一赋值成PR的简称
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index#PR简称对应全程的空值的行索引
abb_pop.loc[indexs,'state'] = 'PUERTO RICO'

In [31]:

#同理可以将USA对应全程的值赋值成United States
abb_pop['state/region'] == 'USA'
abb_pop.loc[abb_pop['state/region'] == 'USA']
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
abb_pop.loc[indexs,'state'] = 'United States'

In [39]:

#合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head()

Out[39]:

state state/region ages year population area (sq. mi)
0 Alabama AL under18 2012.0 1117489.0 52423.0
1 Alabama AL total 2012.0 4817528.0 52423.0
2 Alabama AL under18 2010.0 1130966.0 52423.0
3 Alabama AL total 2010.0 4785570.0 52423.0
4 Alabama AL under18 2011.0 1125763.0 52423.0

In [45]:

#我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs

In [43]:

abb_pop_area.drop(labels=indexs,inplace=True)

In [48]:

#找出2010年的全民人口数据 query('查询条件')
abb_pop_area.query('year==2010 & ages == "total"')

In [ ]:

#计算各州的人口密度

In [51]:

abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head()

Out[51]:

state state/region ages year population area (sq. mi) midu
0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769
1 Alabama AL total 2012.0 4817528.0 52423.0 91.897221
2 Alabama AL under18 2010.0 1130966.0 52423.0 21.573851
3 Alabama AL total 2010.0 4785570.0 52423.0 91.287603
4 Alabama AL under18 2011.0 1125763.0 52423.0 21.474601

In [53]:

#排序,并找出人口密度最高的五个州
abb_pop_area.sort_values(by='midu',axis=0,ascending=False)

Out[53]:

state state/region ages year population area (sq. mi) midu
391 District of Columbia DC total 2013.0 646449.0 68.0 9506.602941
385 District of Columbia DC total 2012.0 633427.0 68.0 9315.102941
387 District of Columbia DC total 2011.0 619624.0 68.0 9112.117647
431 District of Columbia DC total 1990.0 605321.0 68.0 8901.779412
389 District of Columbia DC total 2010.0 605125.0 68.0 8898.897059
426 District of Columbia DC total 1991.0 600870.0 68.0 8836.323529
429 District of Columbia DC total 1992.0 597567.0 68.0 8787.750000
422 District of Columbia DC total 1993.0 595302.0 68.0 8754.441176
392 District of Columbia DC total 2009.0 592228.0 68.0 8709.235294
425 District of Columbia DC total 1994.0 589240.0 68.0 8665.294118
420 District of Columbia DC total 1995.0 580519.0 68.0 8537.044118
396 District of Columbia DC total 2008.0 580236.0 68.0 8532.882353
406 District of Columbia DC total 2001.0 574504.0 68.0 8448.588235
394 District of Columbia DC total 2007.0 574404.0 68.0 8447.117647
408 District of Columbia DC total 2002.0 573158.0 68.0 8428.794118
419 District of Columbia DC total 1996.0 572379.0 68.0 8417.338235
412 District of Columbia DC total 2000.0 572046.0 68.0 8412.441176
400 District of Columbia DC total 2006.0 570681.0 68.0 8392.367647
410 District of Columbia DC total 1999.0 570220.0 68.0 8385.588235
402 District of Columbia DC total 2003.0 568502.0 68.0 8360.323529
404 District of Columbia DC total 2004.0 567754.0 68.0 8349.323529
417 District of Columbia DC total 1997.0 567739.0 68.0 8349.102941
398 District of Columbia DC total 2005.0 567136.0 68.0 8340.235294
415 District of Columbia DC total 1998.0 565232.0 68.0 8312.235294
421 District of Columbia DC under18 1995.0 123620.0 68.0 1817.941176
424 District of Columbia DC under18 1994.0 122170.0 68.0 1796.617647
418 District of Columbia DC under18 1996.0 121210.0 68.0 1782.500000
423 District of Columbia DC under18 1993.0 120471.0 68.0 1771.632353
416 District of Columbia DC under18 1997.0 119531.0 68.0 1757.808824
428 District of Columbia DC under18 1992.0 118636.0 68.0 1744.647059
... ... ... ... ... ... ... ...
53 Alaska AK total 1994.0 603308.0 656425.0 0.919081
56 Alaska AK total 1993.0 599434.0 656425.0 0.913180
50 Alaska AK total 1992.0 588736.0 656425.0 0.896882
55 Alaska AK total 1991.0 570193.0 656425.0 0.868634
48 Alaska AK total 1990.0 553290.0 656425.0 0.842884
63 Alaska AK under18 1998.0 192636.0 656425.0 0.293462
66 Alaska AK under18 1999.0 191422.0 656425.0 0.291613
69 Alaska AK under18 2000.0 190615.0 656425.0 0.290384
71 Alaska AK under18 2001.0 188771.0 656425.0 0.287574
73 Alaska AK under18 2002.0 188482.0 656425.0 0.287134
92 Alaska AK under18 2011.0 188329.0 656425.0 0.286901
62 Alaska AK under18 1997.0 188280.0 656425.0 0.286826
94 Alaska AK under18 2012.0 188162.0 656425.0 0.286647
86 Alaska AK under18 2013.0 188132.0 656425.0 0.286601
90 Alaska AK under18 2010.0 187902.0 656425.0 0.286251
54 Alaska AK under18 1994.0 187439.0 656425.0 0.285545
57 Alaska AK under18 1993.0 187190.0 656425.0 0.285166
75 Alaska AK under18 2003.0 186843.0 656425.0 0.284637
89 Alaska AK under18 2009.0 186351.0 656425.0 0.283888
77 Alaska AK under18 2004.0 186335.0 656425.0 0.283863
81 Alaska AK under18 2006.0 185580.0 656425.0 0.282713
61 Alaska AK under18 1996.0 185360.0 656425.0 0.282378
79 Alaska AK under18 2005.0 185304.0 656425.0 0.282293
59 Alaska AK under18 1995.0 184990.0 656425.0 0.281814
52 Alaska AK under18 1992.0 184878.0 656425.0 0.281644
83 Alaska AK under18 2007.0 184344.0 656425.0 0.280830
85 Alaska AK under18 2008.0 183124.0 656425.0 0.278972
51 Alaska AK under18 1991.0 182180.0 656425.0 0.277534
49 Alaska AK under18 1990.0 177502.0 656425.0 0.270407
2544 Puerto Rico NaN NaN NaN NaN 3515.0 NaN

7.pandas高级操作

In [3]:

import pandas as pd
from pandas import DataFrame
import numpy as np

替换操作

  • 替换操作可以同步作用于Series和DataFrame中
  • 单值替换
    • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
    • 按列指定单值替换: to_replace={列标签:替换值} value='value'
  • 多值替换
    • 列表替换: to_replace=[] value=[]
    • 字典替换(推荐) to_replace={to_replace:value,to_replace:value}

In [4]:

df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df

Out[4]:

0 1 2 3 4 5 6
0 44 62 3 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 77 49

In [5]:

df.replace(to_replace=3,value='Three')

Out[5]:

0 1 2 3 4 5 6
0 44 62 Three 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 77 49

In [6]:

df.replace(to_replace={3:'aaa'})

Out[6]:

0 1 2 3 4 5 6
0 44 62 aaa 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 77 49

In [8]:

#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')

Out[8]:

0 1 2 3 4 5 6
0 44 62 3 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 6666666 49

映射操作

  • 概念:创建一个映射关系列表,把values元素和一个特定的标签或者字符串绑定(给一个元素值提供不同的表现形式)

  • 创建一个df,两列分别是姓名和薪资,然后给其名字起对应的中文名

In [10]:

dic = {
    'name':['jay','tom','jay'],
    'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df

Out[10]:

name salary
0 jay 10000
1 tom 20000
2 jay 10000

In [14]:

#映射关系表
dic = {
    'jay':'张三',
    'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df

Out[14]:

name salary c_name
0 jay 10000 张三
1 tom 20000 李四
2 jay 10000 张三

运算工具

  • 超过3000部分的钱缴纳50%的税,计算每个人的税后薪资

In [16]:

def after_sal(s):
    return s - (s-3000)*0.5

In [18]:

df['after_salary'] = df['salary'].map(after_sal)
df

Out[18]:

name salary c_name after_salary
0 jay 10000 张三 6500.0
1 tom 20000 李四 11500.0
2 jay 10000 张三 6500.0

映射索引

  • 使用rename()函数替换行列索引
  • 参数介绍:
    • index 替换行索引
    • columns 替换列索引

In [19]:

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4

Out[19]:

color value
0 white 2
1 gray 5
2 purple 9
3 blue 0
4 green 1

In [20]:

new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)

Out[20]:

cc vv
first white 2
two gray 5
three purple 9
four blue 0
five green 1

排序实现的随机抽样

  • take()
  • np.random.permutation()

In [22]:

df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df

In [24]:

# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)

In [32]:

np.random.permutation(3) #返回随机序列

Out[32]:

array([0, 1, 2])

In [31]:

#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)

In [35]:

df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]

数据的分类处理

  • 数据分类处理的核心:
    • groupby()函数
    • groups属性查看分组情况

In [36]:

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df

Out[36]:

color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44

In [37]:

#根据水果的种类进行分组
df.groupby(by='item')

Out[37]:

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>

In [38]:

#调用groups查看分组情况
df.groupby(by='item').groups

Out[38]:

{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}

In [40]:

#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']

Out[40]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [41]:

df.groupby(by='item')['price'].mean() #推荐

Out[41]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [42]:

#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()

Out[42]:

color
green     31.333333
red       12.000000
yellow    35.000000
Name: weight, dtype: float64

In [44]:

#将每一种水果的平均价格计算出来且汇总到原数据中
df

Out[44]:

color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44

In [47]:

series_price = df.groupby(by='item')['price'].mean() 
dic = series_price.to_dict()
dic #映射关系表

Out[47]:

{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}

In [49]:

df['mean_price'] = df['item'].map(dic)
df

Out[49]:

color item price weight mean_price
0 red Apple 4.0 12 3.00
1 yellow Banana 3.0 20 2.75
2 yellow Orange 3.0 50 3.50
3 green Banana 2.5 30 2.75
4 green Orange 4.0 20 3.50
5 green Apple 2.0 44 3.00

高级数据聚合

  • 使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • transform和apply都会进行运算,在transform或者apply中传入函数即可
  • transform和apply也可以传入一个lambda表达式

In [56]:

def myMean(s):
    sum = 0
    for i in s:
        sum += i
    return sum/len(s)

In [57]:

df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具

Out[57]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [58]:

df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具

Out[58]:

0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64

数据加载

  • 读取type-.txt文件数据

In [1]:

import pandas as pd
from pandas import DataFrame,Series
data=pd.read_csv('./data/type-.txt')
data

Out[1]:

你好-我好-他也好
0 也许-大概-有可能
1 然而-未必-不见得

In [2]:

data.shape

Out[2]:

(2, 1)
  • 将文件中每一个词作为元素存放在DataFrame中

In [4]:

data=pd.read_csv('./data/type-.txt',sep='-',header=None)
data

Out[4]:

0 1 2
0 你好 我好 他也好
1 也许 大概 有可能
2 然而 未必 不见得
  • 读取数据库中的数据

In [6]:

#连接数据库,获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')

In [7]:

#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df

In [ ]:

#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data',conn)

透视表

  • 透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表,也体会到它的强大功能,而在pandas中它被称作pivot_table。

  • 透视表的优点:
    • 灵活性高,可以随意定制你的分析计算要求
    • 脉络清晰易于理解数据
    • 操作性强,报表神器

In [16]:

import pandas as pd
import numpy as np
df = pd.read_csv('./data/透视表-篮球赛.csv',encoding='utf8')
df.head()

Out[16]:

对手 胜负 主客场 命中 投篮数 投篮命中率 3分命中率 篮板 助攻 得分
0 勇士 10 23 0.435 0.444 6 11 27
1 国王 8 21 0.381 0.286 3 9 27
2 小牛 10 19 0.526 0.462 3 7 29
3 灰熊 8 20 0.400 0.250 5 8 22
4 76人 10 20 0.500 0.250 3 13 27

pivot_table有四个最重要的参数index、values、columns、aggfunc

  • index参数:分类汇总的分类条件
    • 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个对进行分类并计算其各类得分的平均值:

In [14]:

df.pivot_table(index='对手')
  • 想看看对阵同一对手在不同主客场下的数据,分类条件为对手和主客场

In [17]:

df.pivot_table(index=['对手','主客场'])
  • values参数:需要对计算的数据进行筛选
    • 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据:

In [18]:

df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'])

Out[18]:

助攻 得分 篮板
主客场 胜负
10.555556 34.222222 5.444444
8.666667 29.666667 5.000000
9.000000 32.000000 4.916667
8.000000 20.000000 4.000000
  • Aggfunc参数:设置我们对数据聚合时进行的函数操作
    • 当我们未设置aggfunc时,它默认aggfunc='mean'计算均值。
  • 还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时:

In [20]:

df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'],aggfunc=['sum','mean'])

Out[20]:

sum mean
助攻 得分 篮板 助攻 得分 篮板
主客场 胜负
95 308 49 10.555556 34.222222 5.444444
26 89 15 8.666667 29.666667 5.000000
108 384 59 9.000000 32.000000 4.916667
8 20 4 8.000000 20.000000 4.000000
  • Columns:可以设置列层次字段
    • 对values字段进行分类

In [33]:

#获取所有队主客场的总得分
df.pivot_table(index='主客场',values='得分',aggfunc='sum',fill_value=0)

Out[33]:

得分
主客场
397
404

In [34]:

#获取每个队主客场的总得分(在总得分的基础上又进行了对手的分类)
df.pivot_table(index='主客场',values='得分',columns='对手',aggfunc='sum',fill_value=0)

Out[34]:

对手 76人 勇士 国王 太阳 小牛 尼克斯 开拓者 掘金 步行者 湖人 灰熊 爵士 猛龙 篮网 老鹰 骑士 鹈鹕 黄蜂
主客场
29 0 0 0 29 37 0 21 29 0 60 56 38 37 0 35 26 0
27 27 27 48 0 31 48 0 26 36 49 29 0 0 29 0 0 27

交叉表

  • 是一种用于计算分组的特殊透视图,对数据进行汇总
  • pd.crosstab(index,colums)
    • index:分组数据,交叉表的行索引
    • columns:交叉表的列索引

In [35]:

df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
               'age':[15,23,25,17,35,57,24,31,22],
               'smoke':[True,False,False,True,True,False,False,True,False],
               'height':[168,179,181,166,173,178,188,190,160]})
  • 求出各个性别抽烟的人数

In [36]:

pd.crosstab(df.smoke,df.sex)

Out[36]:

sex man women
smoke
False 2 3
True 2 2
  • 求出各个年龄段抽烟人情况

In [14]:

pd.crosstab(df.age,df.smoke)

Out[14]:

smoke False True
age
15 0 1
17 0 1
22 1 0
23 1 0
24 1 0
25 1 0
31 0 1
35 0 1
57 1 0
上一篇:【Python零基础入门】基础|Pandas常用知识点汇总


下一篇:2019春第五周作业