第七章数据清洗与准备
7.1 处理缺失值
pandas对象的所有描述性统计信息默认情况下是排除缺失值的。
对于数值型数据,pandas使用浮点值NaN(Not a Number来表示缺失值)。
在pandas中,我们采用了R语言中的编程惯例,将缺失值成为NA,意思是not available(不可用)
string_data = pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
#当清洗数据用于分析时,对缺失数据本身进行分析以确定数据收集问题或数据丢失导致的数据偏差通常很重要。
#Python内建的None值在对象数组中也被当作NA处理
string_data[0] = None
string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool
函数名
描述
dropna
根据每个标签的值是否是缺失数据来筛选轴标签,并根据允许丢失的数据量来确定阈值
fillna
用某些值填充缺失的数据或使用插值方法(如’ffill’或’bfill’).
isnull
返回表明哪些值是缺失值的布尔值
notnull
isnull的反函数
7.1.1 过滤缺失值
#在Series上使用dropna,它会返回Series中所有的非空数据及其索引值
from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
#dropna默认情况下会删除包含缺失值的行
data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
2
NaN
NaN
NaN
3
NaN
6.5
3.0
#传入how='all’时,将删除所有值均为NA的行
data.dropna(how='all')
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
3
NaN
6.5
3.0
#如果要用同样的方式去删除列,传入参数axis=1
data[4] = NA
data
0
1
2
4
0
1.0
6.5
3.0
NaN
1
1.0
NaN
NaN
NaN
2
NaN
NaN
NaN
NaN
3
NaN
6.5
3.0
NaN
data.dropna(axis=1,how = 'all')
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
2
NaN
NaN
NaN
3
NaN
6.5
3.0
df = pd.DataFrame(np.random.randn(7,3))
df
0
1
2
0
-0.100288
0.117081
0.629897
1
0.145224
0.827820
-0.197561
2
-1.372610
-0.521075
0.783224
3
-0.679339
0.355698
-1.283404
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
0
1
2
0
-0.100288
NaN
NaN
1
0.145224
NaN
NaN
2
-1.372610
NaN
0.783224
3
-0.679339
NaN
-1.283404
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
df.dropna()
0
1
2
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
#假设你只想保留包含一定数量的观察值的行。你可以用thresh参数来表示
#thresh=n是指:保留下来的每一行,其非NA的数目大于等于n,thresh=2表示保留至少n个非Nan的数据行
df.dropna(thresh = 2)
0
1
2
2
-1.372610
NaN
0.783224
3
-0.679339
NaN
-1.283404
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
7.1.2 补全缺失值
主要使用fillna方法来补全缺失值。调用fillna时,可以使用一个常数来替代缺失值
fillna函数参数
参数
描述
value
标量值或字典型对象用于填充缺失值
method
插值方法,如果没有其他参数,默认是’ffill’
axis
需要填充的轴,默认axis=0
inplace
修改被调用的对象,而不是生成一个备份
limit
用于前向或后向填充时最大的填充范围
df.fillna(0)
0
1
2
0
-0.100288
0.000000
0.000000
1
0.145224
0.000000
0.000000
2
-1.372610
0.000000
0.783224
3
-0.679339
0.000000
-1.283404
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
#在调用fillna时使用字典,你可以为不同列设定不同的填充值
df.fillna({1:0.5,2:0})
0
1
2
0
-0.100288
0.500000
0.000000
1
0.145224
0.500000
0.000000
2
-1.372610
0.500000
0.783224
3
-0.679339
0.500000
-1.283404
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
#fillna返回的是一个新的对象,但你也可以修改已经存在的对象
_ = df.fillna(0,inplace = True)
df
0
1
2
0
-0.100288
0.000000
0.000000
1
0.145224
0.000000
0.000000
2
-1.372610
0.000000
0.783224
3
-0.679339
0.000000
-1.283404
4
-1.587708
0.254616
0.149215
5
-0.323276
-0.393636
-1.828212
6
-0.639610
-1.677821
1.618943
#用于重建索引的相同的插值方法也可以用于fillna
df = pd.DataFrame(np.random.randn(6,3))
df
0
1
2
0
-0.428405
0.199383
0.354342
1
0.019782
0.921389
0.534736
2
-0.583158
0.390681
-2.386976
3
-0.076475
-0.034995
1.635065
4
0.528814
0.711717
0.696243
5
-0.193577
0.162206
-0.520191
df.iloc[2:,1] = NA
df
0
1
2
0
-0.428405
0.199383
0.354342
1
0.019782
0.921389
0.534736
2
-0.583158
NaN
-2.386976
3
-0.076475
NaN
1.635065
4
0.528814
NaN
0.696243
5
-0.193577
NaN
-0.520191
df.iloc[4:,2] = NA
df
0
1
2
0
-0.428405
0.199383
0.354342
1
0.019782
0.921389
0.534736
2
-0.583158
NaN
-2.386976
3
-0.076475
NaN
1.635065
4
0.528814
NaN
NaN
5
-0.193577
NaN
NaN
df.fillna(method='ffill')
0
1
2
0
-0.428405
0.199383
0.354342
1
0.019782
0.921389
0.534736
2
-0.583158
0.921389
-2.386976
3
-0.076475
0.921389
1.635065
4
0.528814
0.921389
1.635065
5
-0.193577
0.921389
1.635065
df.fillna(method='backfill')
0
1
2
0
-0.428405
0.199383
0.354342
1
0.019782
0.921389
0.534736
2
-0.583158
NaN
-2.386976
3
-0.076475
NaN
1.635065
4
0.528814
NaN
NaN
5
-0.193577
NaN
NaN
df.fillna(method='ffill',limit=2)
0
1
2
0
-0.428405
0.199383
0.354342
1
0.019782
0.921389
0.534736
2
-0.583158
0.921389
-2.386976
3
-0.076475
0.921389
1.635065
4
0.528814
NaN
1.635065
5
-0.193577
NaN
1.635065
data = pd.Series([5,NA,3,NA,7])
data.fillna(data.mean())
0 5.0
1 5.0
2 3.0
3 5.0
4 7.0
dtype: float64
data.mean()
5.0
7.2 数据转换
7.2.1 删除重复值
data = pd.DataFrame({'k1':['one','two']*3+['two'],
'k2':[1,1,2,3,3,4,4,]})
data
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
6
two
4
#DataFrame的duplicated方法返回的是一个布尔值Series,
#这个Series反映的是每一行是否存在重复(与之前出现过的行相同)情况
data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
#drop_duplicates返回的是DataFrame,内容是duplicated返回数组中为False的部分
#删除重复值,只保留唯一值
#这些方法默认都是对列进行操作
data.drop_duplicates()
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
data['v1'] = range(7)
data
k1
k2
v1
0
one
1
0
1
two
1
1
2
one
2
2
3
two
3
3
4
one
3
4
5
two
4
5
6
two
4
6
data.drop_duplicates(['k1'])
k1
k2
v1
0
one
1
0
1
two
1
1
#duplicated和drop_duplicates默认都是保留第一个观测到的值。传入参数keep='last’将会返回最后一个
data.drop_duplicates(['k1','k2'],keep = 'last')
k1
k2
v1
0
one
1
0
1
two
1
1
2
one
2
2
3
two
3
3
4
one
3
4
6
two
4
6
7.2.2 使用函数或映射进行数据转换
data = pd.DataFrame({'food':['bacon','pulled pork','bacon','pastrami','corned beef',
'bacon','pastrami','honey ham','nova lox'],
'ounces':[4.0,3.0,12.0,6.0,7.5,8.0,3.0,5.0,6.0]})
data
food
ounces
0
bacon
4.0
1
pulled pork
3.0
2
bacon
12.0
3
pastrami
6.0
4
corned beef
7.5
5
bacon
8.0
6
pastrami
3.0
7
honey ham
5.0
8
nova lox
6.0
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',
'pastrami':'cow',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'samlon',
}
#Series的map方法接收一个函数或一个包含映射关系的字典型对象
lowercased = data['food'].str.lower()
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
food
ounces
animal
0
bacon
4.0
pig
1
pulled pork
3.0
pig
2
bacon
12.0
pig
3
pastrami
6.0
cow
4
corned beef
7.5
cow
5
bacon
8.0
pig
6
pastrami
3.0
cow
7
honey ham
5.0
pig
8
nova lox
6.0
samlon
#使用map是一种可以便捷执行按元素转换及其他清洗相关操作的方法
data['food'].map(lambda x :meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 samlon
Name: food, dtype: object
7.2.3 替代值
data.replace方法与data.str.replace方法是不同的,data.str. replace是对字符串进行按元素替代的。
data = pd.Series([1,-999,2,-999,-1000,3])
data
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: int64
data.replace(-999,np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
#如果你想要一次替代多个值,你可以传入一个列表和替代值
data.replace([-999,-1000],np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
#要将不同的值替换为不同的值,可以传入替代值的列表
data.replace([-999,-1000],[np.nan,0])
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
#参数也可以通过字典传递
data.replace({-999:np.nan,-1000:0})
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
7.2.4 重命名轴索引
data = pd.DataFrame(np.arange(12).reshape(3,4),
index = ['Ohio','Colorado','New York'],
columns = ['one','two','three','four'])
data
one
two
three
four
Ohio
0
1
2
3
Colorado
4
5
6
7
New York
8
9
10
11
transform = lambda x :x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
one
two
three
four
OHIO
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
data.rename(index = str.title,columns = str.upper)
ONE
TWO
THREE
FOUR
Ohio
0
1
2
3
Colo
4
5
6
7
New
8
9
10
11
#rename可以结合字典型对象使用,为轴标签的子集提供新的值
data.rename(index = {'OHIO':'INDIANA'},
columns = {'three':'peekaboo'})
one
two
peekaboo
four
INDIANA
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
7.2.5 离散化和分箱
ages = [20,22,24,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
#pd.value_counts(cats)是对pandas.cut的结果中的箱数量的计数
pd.value_counts(cats)
(18, 25] 5
(25, 35] 3
(35, 60] 3
(60, 100] 1
dtype: int64
#与区间的数学符号一致,小括号表示边是开放的,中括号表示它是封闭的(包括边)。
#你可以通过传递right=False来改变哪一边是封闭的
pd.cut(ages,[18,26,36,61,100],right = False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
#你也可以通过向labels选项传递一个列表或数组来传入自定义的箱名
group_names = ['youth','youngadult','middleaged','senior']
a = pd.cut(ages,bins,labels = group_names)
pd.value_counts(a)
youth 5
youngadult 3
middleaged 3
senior 1
dtype: int64
#如果你传给cut整数个的箱来代替显式的箱边,pandas将根据数据中的最小值和最大值计算出等长的箱
#precision=2的选项将十进制精度限制在两位
data = np.random.rand(20)
pd.cut(data,4,precision=2)
[(0.51, 0.74], (0.29, 0.51], (0.74, 0.97], (0.29, 0.51], (0.06, 0.29], ..., (0.06, 0.29], (0.29, 0.51], (0.74, 0.97], (0.51, 0.74], (0.74, 0.97]]
Length: 20
Categories (4, interval[float64]): [(0.06, 0.29] < (0.29, 0.51] < (0.51, 0.74] < (0.74, 0.97]]
#qcut是一个与分箱密切相关的函数,它基于样本分位数进行分箱。取决于数据的分布,使用cut通常不会使每个箱具有相同数据量的数据点。
#由于qcut使用样本的分位数,你可以通过qcut获得等长的箱
data = np.random.randn(1000)
cats = pd.qcut(data,4)#切成4份
cats
[(-0.00707, 0.65], (-0.00707, 0.65], (-2.936, -0.626], (-0.626, -0.00707], (-2.936, -0.626], ..., (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.00707, 0.65]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -0.626] < (-0.626, -0.00707] < (-0.00707, 0.65] < (0.65, 3.139]]
pd.value_counts(cats)
(-2.936, -0.626] 250
(-0.626, -0.00707] 250
(-0.00707, 0.65] 250
(0.65, 3.139] 250
dtype: int64
#与cut类似,你可以传入自定义的分位数(0和1之间的数据,包括边)
pd.cut(data,[0,0.1,0.5,0.9,1])
[(0.5, 0.9], (0.1, 0.5], NaN, NaN, NaN, ..., NaN, NaN, NaN, NaN, (0.1, 0.5]]
Length: 1000
Categories (4, interval[float64]): [(0.0, 0.1] < (0.1, 0.5] < (0.5, 0.9] < (0.9, 1.0]]
7.2.6 检测和过滤异常值
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
0
1
2
3
count
1000.000000
1000.000000
1000.000000
1000.000000
mean
0.013343
0.030142
0.020312
0.042330
std
1.012528
0.984443
0.999869
0.982124
min
-2.942920
-3.799121
-3.412855
-2.632107
25%
-0.668303
-0.629645
-0.654843
-0.643005
50%
0.010349
0.040064
0.026197
0.028003
75%
0.701525
0.679371
0.706170
0.714993
max
3.274496
3.998493
3.264216
2.907744
#假设你想要找出一列中绝对值大于三的值
col = data[2]
col[np.abs(col) > 3]
91 -3.044972
711 3.264216
858 -3.412855
Name: 2, dtype: float64
data[(np.abs(data)>3).any(1)]
0
1
2
3
91
-0.341046
-0.555910
-3.044972
0.474512
325
2.233400
-3.027404
0.845704
1.441757
332
-0.460361
-3.799121
-0.312931
0.478548
457
0.011004
3.998493
0.977419
0.577620
711
-0.603762
-1.650901
3.264216
-0.803395
746
1.455624
-3.178085
-0.387140
0.859193
858
-2.127923
0.163924
-3.412855
-0.073186
946
3.274496
-0.699596
-1.016879
0.358252
data[np.abs(data)>3] = np.sign(data)*3
data.describe()
0
1
2
3
count
1000.000000
1000.000000
1000.000000
1000.000000
mean
0.013069
0.030148
0.020506
0.042330
std
1.011680
0.977459
0.997573
0.982124
min
-2.942920
-3.000000
-3.000000
-2.632107
25%
-0.668303
-0.629645
-0.654843
-0.643005
50%
0.010349
0.040064
0.026197
0.028003
75%
0.701525
0.679371
0.706170
0.714993
max
3.000000
3.000000
3.000000
2.907744
data
0
1
2
3
0
0.997285
0.352539
-0.158277
-0.069519
1
-1.144523
-0.173312
-0.651227
0.686972
2
0.650131
0.271325
-0.304344
-0.281217
3
0.527442
-2.023765
0.827982
-1.855424
4
-0.578451
-0.949705
-0.582701
-1.725697
...
...
...
...
...
995
0.494311
0.528862
-0.191097
0.118121
996
-0.582154
1.251247
-1.622055
-0.436563
997
0.687732
-1.670059
-0.272708
-0.369290
998
-0.443230
0.984728
-0.283506
-1.473420
999
-0.276277
-0.597256
1.269391
-0.704337
1000 rows × 4 columns
#语句np.sign(data)根据数据中的值的正负分别生成1和-1的数值
np.sign(data).head()
0
1
2
3
0
1.0
1.0
-1.0
-1.0
1
-1.0
-1.0
-1.0
1.0
2
1.0
1.0
-1.0
-1.0
3
1.0
-1.0
1.0
-1.0
4
-1.0
-1.0
-1.0
-1.0
7.2.7 置换和随机抽样
使用numpy.random.permutation对DataFrame中的Series或行进行置换(随机重排序)是非常方便的。
sampler = np.random.permutation(5)
sampler
array([3, 2, 0, 4, 1])
df = pd.DataFrame(np.arange(5*4).reshape(5,4))
df
0
1
2
3
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
3
12
13
14
15
4
16
17
18
19
#整数数组可以用在基于iloc的索引或等价的take函数中
df.take(sampler)
0
1
2
3
3
12
13
14
15
2
8
9
10
11
0
0
1
2
3
4
16
17
18
19
1
4
5
6
7
#要选出一个不含有替代值的随机子集,你可以使用Series和DataFrame的sample方法
df.sample(n=3)
0
1
2
3
0
0
1
2
3
4
16
17
18
19
3
12
13
14
15
#要生成一个带有替代值的样本(允许有重复选择),将replace=True传入sample方法
choices = pd.Series([5,6,-1,6,4])
draws = choices.sample(n=10,replace = True)
draws
2 -1
0 5
2 -1
3 6
0 5
1 6
1 6
4 4
3 6
1 6
dtype: int64
7.2.8 计算指标/虚拟变量
将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模或机器学习的转换操作
pandas有一个get_dummies函数用于实现该功能
df = pd.DataFrame({'key':['b','b','a','c','a','b'],
'data1':range(6)})
df
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
b
5
pd.get_dummies(df['key'])
a
b
c
0
0
1
0
1
0
1
0
2
1
0
0
3
0
0
1
4
1
0
0
5
0
1
0
#在某些情况下,你可能想在指标DataFrame的列上加入前缀,然后与其他数据合并。
#在get_dummies方法中有一个前缀参数用于实现该功能
dummies = pd.get_dummies(df['key'],prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
data1
key_a
key_b
key_c
0
0
0
1
0
1
1
0
1
0
2
2
1
0
0
3
3
0
0
1
4
4
1
0
0
5
5
0
1
0
mnames = ['movie_id','title','genres']
movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
,sep='::',header=None,names = mnames)
<ipython-input-188-960ac40c2eea>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
movies[::10]
movie_id
title
genres
0
1
Toy Story (1995)
Animation|Children's|Comedy
10
11
American President, The (1995)
Comedy|Drama|Romance
20
21
Get Shorty (1995)
Action|Comedy|Drama
30
31
Dangerous Minds (1995)
Drama
40
41
Richard III (1995)
Drama|War
...
...
...
...
3840
3910
Dancer in the Dark (2000)
Drama|Musical
3850
3920
Faraway, So Close (In Weiter Ferne, So Nah!) (...
Drama|Fantasy
3860
3930
Creature From the Black Lagoon, The (1954)
Horror
3870
3940
Slumber Party * III, The (1990)
Horror
3880
3950
Tigerland (2000)
Drama
389 rows × 3 columns
#为每个电影流派添加指标变量需要进行一些数据处理。首先,我们从数据集中提取出所有不同的流派的列表
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np.zeros((len(movies),len(genres)))
dummies = pd.DataFrame(zero_matrix,columns=genres)
zero_matrix
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
dummies
Animation
Children's
Comedy
Adventure
Fantasy
Romance
Drama
Action
Crime
Thriller
Horror
Sci-Fi
Documentary
War
Musical
Mystery
Film-Noir
Western
0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
4
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
3878
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3879
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3880
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3881
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3882
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3883 rows × 18 columns
gen = movies.genres[0]
gen.split('|')
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split("|"))
array([0, 1, 2], dtype=int64)
for i,gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split("|"))
dummies.iloc[i,indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1.0
Genre_Children's 1.0
Genre_Comedy 1.0
Genre_Adventure 0.0
Genre_Fantasy 0.0
Genre_Romance 0.0
Genre_Drama 0.0
Genre_Action 0.0
Genre_Crime 0.0
Genre_Thriller 0.0
Genre_Horror 0.0
Genre_Sci-Fi 0.0
Genre_Documentary 0.0
Genre_War 0.0
Genre_Musical 0.0
Genre_Mystery 0.0
Genre_Film-Noir 0.0
Genre_Western 0.0
Name: 0, dtype: object
#将get_dummies与cut等离散化函数结合使用是统计应用的一个有用方法
np.random.seed(12345)
values = np.random.rand(10)
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins = [0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))
(0.0, 0.2]
(0.2, 0.4]
(0.4, 0.6]
(0.6, 0.8]
(0.8, 1.0]
0
0
0
0
0
1
1
0
1
0
0
0
2
1
0
0
0
0
3
0
1
0
0
0
4
0
0
1
0
0
5
0
0
1
0
0
6
0
0
0
0
1
7
0
0
0
1
0
8
0
0
0
1
0
9
0
0
0
1
0
7.3 字符串操作
7.3.1 字符串对象方法
#一个逗号分隔的字符串可以使用split方法拆分成多块:
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
#split常和strip一起使用,用于清除空格(包括换行)
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
#这些子字符串可以使用加法与两个冒号分隔符连接在一起
first,second,third = pieces
first+"::"+second+"::"+third
'a::b::guido'
#在字符串’ : : ’的join方法中传入一个列表或元组是一种更快且更加Pythonic(Python风格化)的方法
"::".join(pieces)
'a::b::guido'
#使用Python的in关键字是检测子字符串的最佳方法,尽管index和find也能实现同样的功能
'guido' in val
True
#请注意find和index的区别在于index在字符串没有找到时会抛出一个异常(而find是返回-1)
val.index('guido')
5
val.find('guido')
5
#count返回的是某个特定的子字符串在字符串中出现的次数
val.count('guido')
1
#replace将用一种模式替代另一种模式。它通常也用于传入空字符串来删除某个模式
val.replace(',','::')
'a::b:: guido'
val.replace(',','')
'ab guido'
方法
描述
count
返回子字符串在字符串中的非重叠出现次数
endswith
如果字符串以后缀结尾则返回True
startswith
如果字符串以前缀开始则返回True
join
使用字符串作为间隔符,用于粘合其他字符串的序列
index
如果在字符串中找到,则返回子字符串中第一个字符的位置:如果找不到则引发ValueError
find
返回字符串中第一个出现子字符的第一个字符的位置:类似index,但如果没有找到则返回-1
rfind
返回子字符串在字符串中最后一次出现时第一个字符的位置,如果没有找到,则返回-1
replace
使用一个字符串替代另一个字符串
strip,rstrip,1strip
修剪空白,包括换行符,相当于对每个元素进行x. strip() (以及rstrip,lstrip)。
split
使用分隔符将字符串拆分为子字符串的列表
lower
将大写字母转换为小写字母
upper
将小写字母转换为大写字母
casefold
将字符转换为小写,并将任何特定于区域的变量字符组合转换为常见的可比较形式
ljust, rjust
左对齐或右对齐;用空格(或其他一些字符)填充字符串的相反侧以返回且,有最小宽度的字符串
7.3.2 正则表达式
re模块主要有三个主题:模式匹配、替代、拆分
描述一个或多个空白字符的正则表达式是\s+
如果你需要将相同的表达式应用到多个字符串上,推荐使用re.compile创建一个正则表达式对象,这样做有利于节约CPU周期。
为了在正则表达式中避免转义符\的影响,可以使用原生字符串语法,比如r’C:\x’或者用等价的’C:\x’
正则表达式方法
方法
描述
findall
将字符串中所有的非重叠匹配模式以列表形式返回
finditer
与findall类似,但返回的是迭代器
match
在字符串起始位置匹配模式,也可以将模式组建匹配到分组中;如果模式匹配上了,返回的一个匹配对象,否则返回None
search
扫描字符串的匹配模式,如果扫描到了返回匹配对象,与match方法不同的是,search 方法的匹配可以是字符串的任意位置,而不仅仅是字符串的起始位置
split
根据模式,将字符串拆分为多个部分
sub,subn
用替换表达式替换字符串中所有的匹配(sub) 或第n个出现的匹配串(subn);使用符号\ 1. \ 2 …来引用替换字符串中的匹配组元素
import re
text = 'foo bar\t baz \tqux'
re.split('\s+',text)
['foo', 'bar', 'baz', 'qux']
#你可以使用re.compile自行编译,形成一个可复用的正则表达式对象
regex = re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
#如果你想获得的是一个所有匹配正则表达式的模式的列表,你可以使用findall方法
regex.findall(text)
[' ', '\t ', ' \t']
text = """
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9.%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern,flags = re.IGNORECASE)
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
#search返回的是文本中第一个匹配到的电子邮件地址
m = regex.search(text)
m
<re.Match object; span=(6, 21), match='dave@google.com'>
text[m.start():m.end()]
'dave@google.com'
#regex.match只在模式出现于字符串起始位置时进行匹配,如果没有匹配到,返回None
print(regex.match(text))
None
#sub会返回一个新的字符串,原字符串中的模式会被一个新的字符串替代
print(regex.sub('ABC',text))
Dave ABC
Steve ABC
Rob ABC
Ryan ABC
#假设您想查找电子邮件地址,并将每个地址分为三个部分:用户名,域名和域名后缀。
#要实现这一点,可以用括号将模式包起来
pattern = r'([A-Z0-9.%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern,flags = re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
('wesm', 'bright', 'net')
#当模式可以分组时,findall返回的是包含元组的列表
regex.findall(text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username:\1,Domain:\2,Suffix:\3',text))
Dave Username:dave,Domain:google,Suffix:com
Steve Username:steve,Domain:gmail,Suffix:com
Rob Username:rob,Domain:gmail,Suffix:com
Ryan Username:ryan,Domain:yahoo,Suffix:com
7.3.3 pandas中的向量化字符串函数
方法
描述
cat
根据可选的分隔符按元素黏合字符串
contains
返回是否含有某个模式/正则表达式的布尔值数组
count
模式出现次数的计数
extract
使用正则表达式从字符串Scries 中分组抽取-个或多个字符串;返回的结果是每个分组形成-列的DataFrame
endswith
等价于对每个元素使用x. endwith (模式)
data = {'Dave':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Ryan':np.nan}
data = pd.Series(data)
data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan NaN
dtype: object
data.isnull()
Dave False
Steve False
Rob False
Ryan True
dtype: bool
#你可以使用data.map将字符串和有效的正则表达式方法(以lambda或其他函数的方式传递)应用到每个值上,
#但是在NA(null)值上会失败,Series有面向数组的方法用于跳过NA值的字符串操作。这些方法通过Series的str属性进行调用
data.str.contains('gmail')
Dave False
Steve True
Rob True
Ryan NaN
dtype: object
#正则表达式也可以结合任意的re模块选项使用
data.str.findall(pattern,flags=re.IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Ryan NaN
dtype: object
#可以使用str.get或在str属性内部索引
matches = data.str.match(pattern,flags=re.IGNORECASE)
matches
Dave True
Steve True
Rob True
Ryan NaN
dtype: object
#要访问嵌入式列表中的元素,我们可以将索引传递给这些函数中的任意一个
matches.str.get(l)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-245-8d76f9329d2a> in <module>
1 #要访问嵌入式列表中的元素,我们可以将索引传递给这些函数中的任意一个
----> 2 matches.str.get(l)
D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
152 from pandas.core.arrays.string_ import StringDtype
153
--> 154 self._inferred_dtype = self._validate(data)
155 self._is_categorical = is_categorical_dtype(data.dtype)
156 self._is_string = isinstance(data.dtype, StringDtype)
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
215
216 if inferred_dtype not in allowed_types:
--> 217 raise AttributeError("Can only use .str accessor with string values!")
218 return inferred_dtype
219
AttributeError: Can only use .str accessor with string values!
matches.str[0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-246-10bdd22fd8b2> in <module>
----> 1 matches.str[0]
D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
152 from pandas.core.arrays.string_ import StringDtype
153
--> 154 self._inferred_dtype = self._validate(data)
155 self._is_categorical = is_categorical_dtype(data.dtype)
156 self._is_string = isinstance(data.dtype, StringDtype)
D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
215
216 if inferred_dtype not in allowed_types:
--> 217 raise AttributeError("Can only use .str accessor with string values!")
218 return inferred_dtype
219
AttributeError: Can only use .str accessor with string values!
#你可以使用字符串切片的类似语法进行向量化切片
data.str[:5]
Dave dave@
Steve steve
Rob rob@g
Ryan NaN
dtype: object
第八章数据规整:连接、联合与重塑
8.1 分层索引
#你看到的是一个以MultiIndex作为索引的Series的美化视图。索引中的“间隙”表示“直接使用上面的标签”
data = pd.Series(np.random.randn(9),
index = [['a','a','a','b','b','c','c','d','d'],
[1,2,3,1,3,1,2,2,3]])
data
a 1 1.007189
2 -1.296221
3 0.274992
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
d 2 -0.371843
3 1.669025
dtype: float64
data.index
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
data['b']
1 0.228913
3 1.352917
dtype: float64
data['b':'c']
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
dtype: float64
data.loc[['b','c']]
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
dtype: float64
#在“内部”层级中进行选择也是可以的
data.loc[:,3]
a 0.274992
b 1.352917
d 1.669025
dtype: float64
#分层索引在重塑数据和数组透视表等分组操作中扮演了重要角色。
#可以使用unstack方法将数据在DataFrame中重新排列
data.unstack()
1
2
3
a
1.007189
-1.296221
0.274992
b
0.228913
NaN
1.352917
c
0.886429
-2.001637
NaN
d
NaN
-0.371843
1.669025
#unstack的反操作是stack
data.unstack().stack()
a 1 1.007189
2 -1.296221
3 0.274992
b 1 0.228913
3 1.352917
c 1 0.886429
2 -2.001637
d 2 -0.371843
3 1.669025
dtype: float64
#在DataFrame中,每个轴都可以拥有分层索引
frame = pd.DataFrame(np.arange(12).reshape(4,3),
index = [['a','a','b','b'],[1,2,1,2]],
columns = [['ohio','ohio','colorado'],['green','red','green']])
frame
ohio
colorado
green
red
green
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame.index.names = ['key1','key2']
frame
ohio
colorado
green
red
green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame.columns.names = ['state','color']
frame
state
ohio
colorado
color
green
red
green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame['ohio']
color
green
red
key1
key2
a
1
0
1
2
3
4
b
1
6
7
2
9
10
#一个MultiIndex对象可以使用其自身的构造函数创建并复用
pd.MultiIndex.from_arrays([['ohio','ohio','colorado'],['green','red','green']],
names = ['state','color'])
MultiIndex([( 'ohio', 'green'),
( 'ohio', 'red'),
('colorado', 'green')],
names=['state', 'color'])
8.1.1 重排序和层级排序
#swaplevel接收两个层级序号或层级名称,返回一个进行了层级变更的新对象(但是数据是不变的)
frame.swaplevel('key1','key2')
state
ohio
colorado
color
green
red
green
key2
key1
1
a
0
1
2
2
a
3
4
5
1
b
6
7
8
2
b
9
10
11
#sort_index只能在单一层级上对数据进行排序。
#在进行层级变换时,使用sort_index以使得结果按照层级进行字典排序也很常见
frame.sort_index(level=1)
state
ohio
colorado
color
green
red
green
key1
key2
a
1
0
1
2
b
1
6
7
8
a
2
3
4
5
b
2
9
10
11
frame.sort_index(level=0)
state
ohio
colorado
color
green
red
green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame.swaplevel(0,1).sort_index(level=0)
state
ohio
colorado
color
green
red
green
key2
key1
1
a
0
1
2
b
6
7
8
2
a
3
4
5
b
9
10
11
8.1.2 按层级进行汇总统计
DataFrame和Series中很多描述性和汇总性统计有一个level选项,通过level选项你可以指定你想要在某个特定的轴上进行聚合
frame.sum(level='key2')
state
ohio
colorado
color
green
red
green
key2
1
6
8
10
2
12
14
16
frame.sum(level='color',axis=1)
color
green
red
key1
key2
a
1
2
1
2
8
4
b
1
14
7
2
20
10
8.1.3 使用DataFrame的列进行索引
通常我们不会使用DataFrame中一个或多个列作为行索引;反而你可能想要将行索引移动到DataFrame的列中。
how参数的不同连接类型
选项
行为
’ inner’
只对两张表都有的键的交集进行联合
‘left’
对所有左表的键进行联合
'right ’
对所有右表的键进行联合
’ outer’
对两张表都有的键的并集进行联合
frame = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
'c':['one','one','one','two','two','two','two'],
'd':[0,1,2,0,1,2,3]})
frame
a
b
c
d
0
0
7
one
0
1
1
6
one
1
2
2
5
one
2
3
3
4
two
0
4
4
3
two
1
5
5
2
two
2
6
6
1
two
3
#DataFrame的set_index函数会生成一个新的DataFrame,新的DataFrame使用一个或多个列作为索引
frame2 = frame.set_index(['c','d'])
frame2
a
b
c
d
one
0
0
7
1
1
6
2
2
5
two
0
3
4
1
4
3
2
5
2
3
6
1
#默认情况下这些列会从DataFrame中移除,你也可以将它们留在DataFrame中
frame.set_index(['c','d'],drop = False)
a
b
c
d
c
d
one
0
0
7
one
0
1
1
6
one
1
2
2
5
one
2
two
0
3
4
two
0
1
4
3
two
1
2
5
2
two
2
3
6
1
two
3
#reset_index是set_index的反操作,分层索引的索引层级会被移动到列中
frame2.reset_index()
c
d
a
b
0
one
0
0
7
1
one
1
1
6
2
one
2
2
5
3
two
0
3
4
4
two
1
4
3
5
two
2
5
2
6
two
3
6
1
8.2 联合与合并数据集
pandas.merge根据一个或多个键将行进行连接。对于SQL或其他关系型数据库的用户来说,这种方式比较熟悉,它实现的是数据库的连接操作。
pandas.concat使对象在轴向上进行黏合或“堆叠”。
combine_first实例方法允许将重叠的数据拼接在一起,以使用一个对象中的值填充另一个对象中的缺失值。
8.2.1 数据库风格的DataFrame连接
合并或连接操作通过一个或多个键连接行来联合数据集
merge函数参数
参数
描述
left
合并时操作中左边的DataFrame
right
合并时操作中右边的DataFrame
how
‘inner’. ‘outer’. ‘left’. ‘right’之一; 默认是’ inner’
on
需要连接的列名。必须是在两边的DataFrame对象都有的列名,并以left和right中的列名的交集作为连接键
left_on
1eft DataFrame 中用作连接键的列
right_on
right DataFrame 中用作连接键的列
left_index
使用left的行索引作为它的连接键(如果是Multilndex,则是多个键)
right_index
使用right的行索引作为它的连接键(如果是MultiIndex,则是多个键)
sort
通过连接键按字母顺序对合并的数据进行排序;在默认情况下为True (在大数据集上某些情况下禁用该功能可以获得更好的性能)
suffixes
在重叠情况下,添加到列名后的字符串元组;默认是(’. x’,’ y’) (例如如果待合并的DataFrame中都含有’data’ 列,那么结果中会出现’data_x’、‘data_ y’)
copy
如果为False,则在某些特殊情况下避免将数据复制到结果数据结构中;默认情况下总是复制
indicator
添加一个特殊的列_ merge, 指示每一行的来源;值将根据每行中连接数据的来源分别为’left_ only’,‘right_ only’ 或’ both’
df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],
'data1':range(7)})
df1
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
a
5
6
b
6
df2 = pd.DataFrame({'key':['a','b','d'],
'data2':range(3)})
df2
key
data2
0
a
0
1
b
1
2
d
2
pd.merge(df1,df2)
key
data1
data2
0
b
0
1
1
b
1
1
2
b
6
1
3
a
2
0
4
a
4
0
5
a
5
0
#请注意,我并没有指定在哪一列上进行连接。如果连接的键信息没有指定,merge会自动将重叠列名作为连接的键。
#但是,显式地指定连接键才是好的实现
pd.merge(df1,df2,on='key')
key
data1
data2
0
b
0
1
1
b
1
1
2
b
6
1
3
a
2
0
4
a
4
0
5
a
5
0
#如果每个对象的列名是不同的,你可以分别为它们指定列名
df3 = pd.DataFrame({'Lkey':['b','b','a','c','a','a','b'],
'data1':range(7)})
df3
Lkey
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
a
5
6
b
6
df4 = pd.DataFrame({'Rkey':['a','b','d'],
'data2':range(3)})
df4
Rkey
data2
0
a
0
1
b
1
2
d
2
pd.merge(df3,df4,left_on='Lkey',right_on='Rkey')
Lkey
data1
Rkey
data2
0
b
0
b
1
1
b
1
b
1
2
b
6
b
1
3
a
2
a
0
4
a
4
a
0
5
a
5
a
0
#其他可选的选项有’left'、'right’和’outer'。
#外连接(outer join)是键的并集,联合了左连接和右连接的效果
pd.merge(df1,df2,how='outer')
key
data1
data2
0
b
0.0
1.0
1
b
1.0
1.0
2
b
6.0
1.0
3
a
2.0
0.0
4
a
4.0
0.0
5
a
5.0
0.0
6
c
3.0
NaN
7
d
NaN
2.0
df1 = pd.DataFrame({'key':['b','b','a','c','a','b'],
'data1':range(6)})
df1
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
b
5
df2 = pd.DataFrame({'key':['a','b','a','b','d'],
'data2':range(5)})
df2
key
data2
0
a
0
1
b
1
2
a
2
3
b
3
4
d
4
pd.merge(df1,df2,on='key',how='left')
key
data1
data2
0
b
0
1.0
1
b
0
3.0
2
b
1
1.0
3
b
1
3.0
4
a
2
0.0
5
a
2
2.0
6
c
3
NaN
7
a
4
0.0
8
a
4
2.0
9
b
5
1.0
10
b
5
3.0
pd.merge(df1,df2,on='key',how='inner')
key
data1
data2
0
b
0
1
1
b
0
3
2
b
1
1
3
b
1
3
4
b
5
1
5
b
5
3
6
a
2
0
7
a
2
2
8
a
4
0
9
a
4
2
#使用多个键进行合并时,传入一个列名的列表
left = pd.DataFrame({'key1':['foo','foo','bar'],
'key2':['one','two','one'],
'lval':[1,2,3]})
left
key1
key2
lval
0
foo
one
1
1
foo
two
2
2
bar
one
3
right = pd.DataFrame({'key1':['foo','foo','bar','bar'],
'key2':['one','one','one','two'],
'rval':[4,5,6,7]})
right
key1
key2
rval
0
foo
one
4
1
foo
one
5
2
bar
one
6
3
bar
two
7
pd.merge(left,right,on=['key1','key2'],how='outer')
key1
key2
lval
rval
0
foo
one
1.0
4.0
1
foo
one
1.0
5.0
2
foo
two
2.0
NaN
3
bar
one
3.0
6.0
4
bar
two
NaN
7.0
#merge有一个suffixes后缀选项,
#用于在左右两边DataFrame对象的重叠列名后指定需要添加的字符串
pd.merge(left,right,on=['key1'])
key1
key2_x
lval
key2_y
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
pd.merge(left,right,on=['key1'],suffixes=('_left','_right'))
key1
key2_left
lval
key2_right
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
8.2.2 根据索引合并
在某些情况下,DataFrame中用于合并的键是它的索引。在这种情况下,你可以传递left_index=True或right_index=True(或者都传)来表示索引需要用来作为合并的键
left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
'value':range(6)})
left1
key
value
0
a
0
1
b
1
2
a
2
3
a
3
4
b
4
5
c
5
right1 = pd.DataFrame({'group_val':[3.5,7]},index = ['a','b'])
right1
pd.merge(left1,right1,left_on = 'key',right_index=True)
key
value
group_val
0
a
0
3.5
2
a
2
3.5
3
a
3
3.5
1
b
1
7.0
4
b
4
7.0
pd.merge(left1,right1,left_on = 'key',right_index=True,how='outer')
key
value
group_val
0
a
0
3.5
2
a
2
3.5
3
a
3
3.5
1
b
1
7.0
4
b
4
7.0
5
c
5
NaN
#在多层索引数据的情况下,事情会更复杂,在索引上连接是一个隐式的多键合并
lefth = pd.DataFrame({'key1':['ohio','ohio','ohio','Nevada','Nevada'],
'key2':[2000,2001,2002,2001,2002],
'data':np.arange(5.)})
lefth
key1
key2
data
0
ohio
2000
0.0
1
ohio
2001
1.0
2
ohio
2002
2.0
3
Nevada
2001
3.0
4
Nevada
2002
4.0
righth = pd.DataFrame(np.arange(12).reshape(6,2),
index=[['nevada','nevada','ohio','ohio','ohio','ohio'],[2001,2000,2000,2000,2001,2002]],
columns = ['event1','event2'])
righth
event1
event2
nevada
2001
0
1
2000
2
3
ohio
2000
4
5
2000
6
7
2001
8
9
2002
10
11
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
key1
key2
data
event1
event2
0
ohio
2000
0.0
4.0
5.0
0
ohio
2000
0.0
6.0
7.0
1
ohio
2001
1.0
8.0
9.0
2
ohio
2002
2.0
10.0
11.0
3
Nevada
2001
3.0
NaN
NaN
4
Nevada
2002
4.0
NaN
NaN
4
nevada
2001
NaN
0.0
1.0
4
nevada
2000
NaN
2.0
3.0
left2 = pd.DataFrame([[1,2],[3,4],[5,6]],
index = ['a','c','e'],
columns = ['ohio','nevada'])
left2
ohio
nevada
a
1
2
c
3
4
e
5
6
right2 = pd.DataFrame([[7,8],[9,10],[11,12],[13,14]],
index = ['b','c','d','e'],
columns = ['missouri','alabama'])
right2
missouri
alabama
b
7
8
c
9
10
d
11
12
e
13
14
pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
ohio
nevada
missouri
alabama
a
1.0
2.0
NaN
NaN
b
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
d
NaN
NaN
11.0
12.0
e
5.0
6.0
13.0
14.0
#join实例方法,用于按照索引合并
left2.join(right2,how='outer')
ohio
nevada
missouri
alabama
a
1.0
2.0
NaN
NaN
b
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
d
NaN
NaN
11.0
12.0
e
5.0
6.0
13.0
14.0
left1.join(right1,on='key')
key
value
group_val
0
a
0
3.5
1
b
1
7.0
2
a
2
3.5
3
a
3
3.5
4
b
4
7.0
5
c
5
NaN
another = pd.DataFrame([[7,8],[9,10],[11,12],[16,17]],
index = ['a','c','e','f'],
columns = ['new york','oregon'])
another
new york
oregon
a
7
8
c
9
10
e
11
12
f
16
17
left2.join([right2,another])
ohio
nevada
missouri
alabama
new york
oregon
a
1.0
2.0
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
9.0
10.0
e
5.0
6.0
13.0
14.0
11.0
12.0
left2.join([right2,another],how='outer')
ohio
nevada
missouri
alabama
new york
oregon
a
1.0
2.0
NaN
NaN
7.0
8.0
c
3.0
4.0
9.0
10.0
9.0
10.0
e
5.0
6.0
13.0
14.0
11.0
12.0
b
NaN
NaN
7.0
8.0
NaN
NaN
d
NaN
NaN
11.0
12.0
NaN
NaN
f
NaN
NaN
NaN
NaN
16.0
17.0
8.2.3 沿轴向连接
另一种数据组合操作可互换地称为拼接、绑定或堆叠。
NumPy的concatenate函数可以在NumPy数组上实现该功能
concat函数的参数
参数
描述
objs
需要连接的pandas对象列表或字典,这是必选参数
axis
连接的轴向;默认是0 (沿着行方向)
join
可以是’inner’或’outer’ (默认是’outer’);用于指定连接方式是内连接(inner) 还是外连接(outer)
join_ _axes
用于指定其他n-1轴的特定索引,可以替代内/外连接的逻辑
keys
与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或数组,也可以是元组的数组,也可以是数组的列表(如果向levels参数传入多层数组)
leels
在键值传递时,该参数用于指定多层索引的层级
arr = np.arange(12).reshape(3,4)
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.concatenate([arr,arr],axis=1)
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
s1 = pd.Series([0,1],index=['a','b'])
s1
a 0
b 1
dtype: int64
s2 = pd.Series([2,3,4],index=['c','d','e'])
s2
c 2
d 3
e 4
dtype: int64
s3 = pd.Series([5,6],index=['f','g'])
s3
f 5
g 6
dtype: int64
pd.concat([s1,s2,s3])
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
#concat方法是沿着axis=0的轴向生效的,生成另一个Series。
#如果你传递axis=1,返回的结果则是一个DataFrame(axis=1时是列)
pd.concat([s1,s2,s3],axis=1)
0
1
2
a
0.0
NaN
NaN
b
1.0
NaN
NaN
c
NaN
2.0
NaN
d
NaN
3.0
NaN
e
NaN
4.0
NaN
f
NaN
NaN
5.0
g
NaN
NaN
6.0
s4 = pd.concat([s1,s3])
s4
a 0
b 1
f 5
g 6
dtype: int64
pd.concat([s1,s4],axis=1)
0
1
a
0.0
0
b
1.0
1
f
NaN
5
g
NaN
6
pd.concat([s1,s4],axis=1,join='inner')
#可以使用join_axes来指定用于连接其他轴向的轴
pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])
0
1
a
0.0
0
b
1.0
1
f
NaN
5
g
NaN
6
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
result.unstack()
a
b
c
d
e
f
g
one
0.0
1.0
NaN
NaN
NaN
NaN
NaN
two
NaN
NaN
2.0
3.0
4.0
NaN
NaN
three
NaN
NaN
NaN
NaN
NaN
5.0
6.0
pd.concat([s1,s2,s3],axis = 1,keys=['one','two','three'])
one
two
three
a
0.0
NaN
NaN
b
1.0
NaN
NaN
c
NaN
2.0
NaN
d
NaN
3.0
NaN
e
NaN
4.0
NaN
f
NaN
NaN
5.0
g
NaN
NaN
6.0
df1 = pd.DataFrame(np.arange(6).reshape(3,2),
index = ['a','b','c'],
columns = ['one','two'])
df1
one
two
a
0
1
b
2
3
c
4
5
df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,
index = ['a','c'],
columns = ['three','four'])
df2
pd.concat([df1,df2],axis=1,keys=['lever1','level2'])
lever1
level2
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
#如果你传递的是对象的字典而不是列表的话,则字典的键会用于keys选项
pd.concat({'level1':df1,'level2':df2},axis=1)
level1
level2
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
pd.concat([df1,df2],axis=1,keys=['lever1','level2'],names=['upper','lower'])
upper
lever1
level2
lower
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
df1 = pd.DataFrame(np.random.randn(3,4),columns = ['a','b','c','d'])
df1
a
b
c
d
0
-1.119593
1.953114
-1.514807
-1.054782
1
0.543393
1.172903
0.945829
0.656643
2
1.012695
1.481920
-0.413033
-1.280521
df2 = pd.DataFrame(np.random.randn(2,3),columns = ['b','d','a'])
df2
b
d
a
0
1.638046
-0.850112
1.895532
1
-1.175952
1.370474
-0.992356
pd.concat([df1,df2],ignore_index=True)
a
b
c
d
0
-1.119593
1.953114
-1.514807
-1.054782
1
0.543393
1.172903
0.945829
0.656643
2
1.012695
1.481920
-0.413033
-1.280521
3
1.895532
1.638046
NaN
-0.850112
4
-0.992356
-1.175952
NaN
1.370474
8.2.4 联合重叠数据
a = pd.Series([np.nan,2.5,0,3.5,4.5,np.nan],
index=['f','e','d','c','b','a'])
a
f NaN
e 2.5
d 0.0
c 3.5
b 4.5
a NaN
dtype: float64
b = pd.Series([0,np.nan,2,np.nan,np.nan,5],
index=['a','b','c','d','e','f'])
b
a 0.0
b NaN
c 2.0
d NaN
e NaN
f 5.0
dtype: float64
#考虑NumPy的where函数,这个函数可以进行面向数组的if-else等价操作
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
#Series有一个combine_first方法,该方法可以等价于下面这种使用pandas常见数据对齐逻辑的轴向操作
b.combine_first(a)
a 0.0
b 4.5
c 2.0
d 0.0
e 2.5
f 5.0
dtype: float64
df1 = pd.DataFrame({'a':[1,np.nan,5,np.nan],
'b':[np.nan,2,np.nan,6],
'c':range(2,18,4)})
df1
a
b
c
0
1.0
NaN
2
1
NaN
2.0
6
2
5.0
NaN
10
3
NaN
6.0
14
df2 = pd.DataFrame({'a':[5,4,np.nan,3,7],
'b':[np.nan,3,4,6,8]})
df2
a
b
0
5.0
NaN
1
4.0
3.0
2
NaN
4.0
3
3.0
6.0
4
7.0
8.0
df1.combine_first(df2)
a
b
c
0
1.0
NaN
2.0
1
4.0
2.0
6.0
2
5.0
4.0
10.0
3
3.0
6.0
14.0
4
7.0
8.0
NaN
8.3 重塑和透视
重排列表格型数据有多种基础操作。这些操作被称为重塑或透视。
8.3.1 使用多层索引进行重塑
statck(堆叠)该操作会“旋转”或将列中的数据透视到行。
unstack(拆堆)该操作会将行中的数据透视到列。
data = pd.DataFrame(np.arange(6).reshape(2,3),
index = pd.Index(['ohio','colorado'],name='state'),
columns = pd.Index(['one','two','three'],name='number'))
data
number
one
two
three
state
ohio
0
1
2
colorado
3
4
5
result = data.stack()
result
state number
ohio one 0
two 1
three 2
colorado one 3
two 4
three 5
dtype: int32
result.unstack()
number
one
two
three
state
ohio
0
1
2
colorado
3
4
5
result.unstack(0)
state
ohio
colorado
number
one
0
3
two
1
4
three
2
5
result.unstack('state')
state
ohio
colorado
number
one
0
3
two
1
4
three
2
5
#如果层级中的所有值并未包含于每个子分组中时,拆分可能会引入缺失值
s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s1
a 0
b 1
c 2
d 3
dtype: int64
s2 = pd.Series([4,5,6],index=['c','d','e'])
s2
c 4
d 5
e 6
dtype: int64
data = pd.concat([s1,s2],keys=['one','two'])
data
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data.unstack()
a
b
c
d
e
one
0.0
1.0
2.0
3.0
NaN
two
NaN
NaN
4.0
5.0
6.0
#默认情况下,堆叠会过滤出缺失值,因此堆叠拆堆的操作是可逆的
data.unstack().stack()
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data.unstack().stack(dropna = False)
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
#当你在DataFrame中拆堆时,被拆堆的层级会变为结果中最低的层级
df = pd.DataFrame({'left':result,'right':result+5},
columns=pd.Index(['left','right'],name='side'))
df
side
left
right
state
number
ohio
one
0
5
two
1
6
three
2
7
colorado
one
3
8
two
4
9
three
5
10
df.unstack('state')
side
left
right
state
ohio
colorado
ohio
colorado
number
one
0
3
5
8
two
1
4
6
9
three
2
5
7
10
#在调用stack方法时,我们可以指明需要堆叠的轴向名称
df.unstack('state').stack('side')
state
colorado
ohio
number
side
one
left
3
0
right
8
5
two
left
4
1
right
9
6
three
left
5
2
right
10
7
df.unstack('state').stack()
side
left
right
number
state
one
ohio
0
5
colorado
3
8
two
ohio
1
6
colorado
4
9
three
ohio
2
7
colorado
5
10
8.3.2 将“长”透视为“宽”
data = pd.read_csv('examples/macrodata.csv')
data.head()
year
quarter
realgdp
realcons
realinv
realgovt
realdpi
cpi
m1
tbilrate
unemp
pop
infl
realint
0
1959.0
1.0
2710.349
1707.4
286.898
470.045
1886.9
28.98
139.7
2.82
5.8
177.146
0.00
0.00
1
1959.0
2.0
2778.801
1733.7
310.859
481.301
1919.7
29.15
141.7
3.08
5.1
177.830
2.34
0.74
2
1959.0
3.0
2775.488
1751.8
289.226
491.260
1916.4
29.35
140.5
3.82
5.3
178.657
2.74
1.09
3
1959.0
4.0
2785.204
1753.7
299.356
484.052
1931.3
29.37
140.0
4.33
5.6
179.386
0.27
4.06
4
1960.0
1.0
2847.699
1770.5
331.722
462.199
1955.5
29.54
139.6
3.50
5.2
180.007
2.31
1.19
periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='date')
periods
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
columns
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data.reindex(columns=columns)
data.head()
item
realgdp
infl
unemp
0
2710.349
NaN
5.8
1
2778.801
NaN
5.1
2
2775.488
NaN
5.3
3
2785.204
NaN
5.6
4
2847.699
NaN
5.2
data.index = periods.to_timestamp('D','end')
data.head()
item
realgdp
infl
unemp
date
1959-03-31 23:59:59.999999999
2710.349
NaN
5.8
1959-06-30 23:59:59.999999999
2778.801
NaN
5.1
1959-09-30 23:59:59.999999999
2775.488
NaN
5.3
1959-12-31 23:59:59.999999999
2785.204
NaN
5.6
1960-03-31 23:59:59.999999999
2847.699
NaN
5.2
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:10]
date
item
value
0
1959-03-31 23:59:59.999999999
realgdp
2710.349
1
1959-03-31 23:59:59.999999999
unemp
5.800
2
1959-06-30 23:59:59.999999999
realgdp
2778.801
3
1959-06-30 23:59:59.999999999
unemp
5.100
4
1959-09-30 23:59:59.999999999
realgdp
2775.488
5
1959-09-30 23:59:59.999999999
unemp
5.300
6
1959-12-31 23:59:59.999999999
realgdp
2785.204
7
1959-12-31 23:59:59.999999999
unemp
5.600
8
1960-03-31 23:59:59.999999999
realgdp
2847.699
9
1960-03-31 23:59:59.999999999
unemp
5.200
pivoted = ldata.pivot('date','item','value')
pivoted
item
realgdp
unemp
date
1959-03-31 23:59:59.999999999
2710.349
5.8
1959-06-30 23:59:59.999999999
2778.801
5.1
1959-09-30 23:59:59.999999999
2775.488
5.3
1959-12-31 23:59:59.999999999
2785.204
5.6
1960-03-31 23:59:59.999999999
2847.699
5.2
...
...
...
2008-09-30 23:59:59.999999999
13324.600
6.0
2008-12-31 23:59:59.999999999
13141.920
6.9
2009-03-31 23:59:59.999999999
12925.410
8.1
2009-06-30 23:59:59.999999999
12901.504
9.2
2009-09-30 23:59:59.999999999
12990.341
9.6
203 rows × 2 columns
8.3.3 将“宽”透视为“长”
在DataFrame中,pivot方法的反操作是pandas.melt
df = pd.DataFrame({'key':['foo','bar','baz'],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
df
key
A
B
C
0
foo
1
4
7
1
bar
2
5
8
2
baz
3
6
9
#当使用pandas.melt时,我们必须指明哪些列是分组指标(如果有的话)
melted = pd.melt(df)
melted
variable
value
0
key
foo
1
key
bar
2
key
baz
3
A
1
4
A
2
5
A
3
6
B
4
7
B
5
8
B
6
9
C
7
10
C
8
11
C
9
melted = pd.melt(df,['key'])
melted
key
variable
value
0
foo
A
1
1
bar
A
2
2
baz
A
3
3
foo
B
4
4
bar
B
5
5
baz
B
6
6
foo
C
7
7
bar
C
8
8
baz
C
9
reshaped = melted.pivot('key','variable','value')
reshaped
variable
A
B
C
key
bar
2
5
8
baz
3
6
9
foo
1
4
7
#使用reset_index来将数据回移一列
reshaped.reset_index()
variable
key
A
B
C
0
bar
2
5
8
1
baz
3
6
9
2
foo
1
4
7
pd.melt(df,id_vars=['key'],value_vars=['A','B'])
key
variable
value
0
foo
A
1
1
bar
A
2
2
baz
A
3
3
foo
B
4
4
bar
B
5
5
baz
B
6
pd.melt(df,value_vars=['A','B','C'])
variable
value
0
A
1
1
A
2
2
A
3
3
B
4
4
B
5
5
B
6
6
C
7
7
C
8
8
C
9
pd.melt(df,value_vars=['A','B','key'])
variable
value
0
A
1
1
A
2
2
A
3
3
B
4
4
B
5
5
B
6
6
key
foo
7
key
bar
8
key
baz
第九章绘图与可视化
%matplotlib notebook
9.1 简明matplotlib API入门
使用Jupyter notebook时有个细节需要注意,在每个单元格运行后,图表被重置,因此对于更复杂的图表,你必须将所有的绘图命令放在单个的notebook单元格中
import matplotlib.pyplot as plt
import numpy as np
data = np.arange(10)
data
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
plt.plot(data)
9.1.1 图片与子图
matplotlib所绘制的图位于图片(Figure)对象中
pyplot.subplots选项
参数
描述
nrows
子图的行数
ncols
子图的列数
sharex
所有子图使用相同的x轴刻度(调整xlim会影响所有子图)
sharey
所有子图使用相同的y轴刻度(调整ylim会影响所有子图)
subplot_ kw
传入add_ subplot 的关键字参数字典,用于生成子图
**fig_ _kW
在生成图片时使用的额外关键字参数,例如plt. subplots (2,2,figsize= (8,6))
#你可以使用plt.figure生成一个新的图片
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
<IPython.core.display.Javascript object>