利用python进行数据分析(2)

第七章数据清洗与准备

7.1 处理缺失值

  • pandas对象的所有描述性统计信息默认情况下是排除缺失值的。
  • 对于数值型数据,pandas使用浮点值NaN(Not a Number来表示缺失值)。
  • 在pandas中,我们采用了R语言中的编程惯例,将缺失值成为NA,意思是not available(不可用)
string_data = pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
#当清洗数据用于分析时,对缺失数据本身进行分析以确定数据收集问题或数据丢失导致的数据偏差通常很重要。
#Python内建的None值在对象数组中也被当作NA处理
string_data[0] = None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool
  • NA处理方法
函数名 描述
dropna 根据每个标签的值是否是缺失数据来筛选轴标签,并根据允许丢失的数据量来确定阈值
fillna 用某些值填充缺失的数据或使用插值方法(如’ffill’或’bfill’).
isnull 返回表明哪些值是缺失值的布尔值
notnull isnull的反函数

7.1.1 过滤缺失值

#在Series上使用dropna,它会返回Series中所有的非空数据及其索引值
from numpy import nan as  NA
data = pd.Series([1,NA,3.5,NA,7])
data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
#dropna默认情况下会删除包含缺失值的行
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
#传入how='all’时,将删除所有值均为NA的行
data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
#如果要用同样的方式去删除列,传入参数axis=1
data[4] = NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(axis=1,how = 'all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = pd.DataFrame(np.random.randn(7,3))
df
0 1 2
0 -0.100288 0.117081 0.629897
1 0.145224 0.827820 -0.197561
2 -1.372610 -0.521075 0.783224
3 -0.679339 0.355698 -1.283404
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
0 1 2
0 -0.100288 NaN NaN
1 0.145224 NaN NaN
2 -1.372610 NaN 0.783224
3 -0.679339 NaN -1.283404
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943
df.dropna()
0 1 2
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943
#假设你只想保留包含一定数量的观察值的行。你可以用thresh参数来表示
#thresh=n是指:保留下来的每一行,其非NA的数目大于等于n,thresh=2表示保留至少n个非Nan的数据行
df.dropna(thresh = 2)
0 1 2
2 -1.372610 NaN 0.783224
3 -0.679339 NaN -1.283404
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943

7.1.2 补全缺失值

  • 主要使用fillna方法来补全缺失值。调用fillna时,可以使用一个常数来替代缺失值
  • fillna函数参数
参数 描述
value 标量值或字典型对象用于填充缺失值
method 插值方法,如果没有其他参数,默认是’ffill’
axis 需要填充的轴,默认axis=0
inplace 修改被调用的对象,而不是生成一个备份
limit 用于前向或后向填充时最大的填充范围
df.fillna(0)
0 1 2
0 -0.100288 0.000000 0.000000
1 0.145224 0.000000 0.000000
2 -1.372610 0.000000 0.783224
3 -0.679339 0.000000 -1.283404
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943
#在调用fillna时使用字典,你可以为不同列设定不同的填充值
df.fillna({1:0.5,2:0})
0 1 2
0 -0.100288 0.500000 0.000000
1 0.145224 0.500000 0.000000
2 -1.372610 0.500000 0.783224
3 -0.679339 0.500000 -1.283404
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943
#fillna返回的是一个新的对象,但你也可以修改已经存在的对象
_ = df.fillna(0,inplace = True)
df
0 1 2
0 -0.100288 0.000000 0.000000
1 0.145224 0.000000 0.000000
2 -1.372610 0.000000 0.783224
3 -0.679339 0.000000 -1.283404
4 -1.587708 0.254616 0.149215
5 -0.323276 -0.393636 -1.828212
6 -0.639610 -1.677821 1.618943
#用于重建索引的相同的插值方法也可以用于fillna
df = pd.DataFrame(np.random.randn(6,3))
df
0 1 2
0 -0.428405 0.199383 0.354342
1 0.019782 0.921389 0.534736
2 -0.583158 0.390681 -2.386976
3 -0.076475 -0.034995 1.635065
4 0.528814 0.711717 0.696243
5 -0.193577 0.162206 -0.520191
df.iloc[2:,1] = NA
df
0 1 2
0 -0.428405 0.199383 0.354342
1 0.019782 0.921389 0.534736
2 -0.583158 NaN -2.386976
3 -0.076475 NaN 1.635065
4 0.528814 NaN 0.696243
5 -0.193577 NaN -0.520191
df.iloc[4:,2] = NA
df
0 1 2
0 -0.428405 0.199383 0.354342
1 0.019782 0.921389 0.534736
2 -0.583158 NaN -2.386976
3 -0.076475 NaN 1.635065
4 0.528814 NaN NaN
5 -0.193577 NaN NaN
df.fillna(method='ffill')
0 1 2
0 -0.428405 0.199383 0.354342
1 0.019782 0.921389 0.534736
2 -0.583158 0.921389 -2.386976
3 -0.076475 0.921389 1.635065
4 0.528814 0.921389 1.635065
5 -0.193577 0.921389 1.635065
df.fillna(method='backfill')
0 1 2
0 -0.428405 0.199383 0.354342
1 0.019782 0.921389 0.534736
2 -0.583158 NaN -2.386976
3 -0.076475 NaN 1.635065
4 0.528814 NaN NaN
5 -0.193577 NaN NaN
df.fillna(method='ffill',limit=2)
0 1 2
0 -0.428405 0.199383 0.354342
1 0.019782 0.921389 0.534736
2 -0.583158 0.921389 -2.386976
3 -0.076475 0.921389 1.635065
4 0.528814 NaN 1.635065
5 -0.193577 NaN 1.635065
data = pd.Series([5,NA,3,NA,7])
data.fillna(data.mean())
0    5.0
1    5.0
2    3.0
3    5.0
4    7.0
dtype: float64
data.mean()
5.0

7.2 数据转换

7.2.1 删除重复值

data = pd.DataFrame({'k1':['one','two']*3+['two'],
                    'k2':[1,1,2,3,3,4,4,]})
data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
#DataFrame的duplicated方法返回的是一个布尔值Series,
#这个Series反映的是每一行是否存在重复(与之前出现过的行相同)情况
data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
#drop_duplicates返回的是DataFrame,内容是duplicated返回数组中为False的部分
#删除重复值,只保留唯一值
#这些方法默认都是对列进行操作
data.drop_duplicates()
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
data['v1'] = range(7)
data
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
k1 k2 v1
0 one 1 0
1 two 1 1
#duplicated和drop_duplicates默认都是保留第一个观测到的值。传入参数keep='last’将会返回最后一个
data.drop_duplicates(['k1','k2'],keep = 'last')
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

7.2.2 使用函数或映射进行数据转换

data = pd.DataFrame({'food':['bacon','pulled pork','bacon','pastrami','corned beef',
                             'bacon','pastrami','honey ham','nova lox'],
                     'ounces':[4.0,3.0,12.0,6.0,7.5,8.0,3.0,5.0,6.0]})

data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 pastrami 6.0
4 corned beef 7.5
5 bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',  
'pastrami':'cow',
'corned beef':'cow',  
'honey ham':'pig', 
'nova lox':'samlon',     
}
#Series的map方法接收一个函数或一个包含映射关系的字典型对象
lowercased = data['food'].str.lower()
lowercased
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 pastrami 6.0 cow
4 corned beef 7.5 cow
5 bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 samlon
#使用map是一种可以便捷执行按元素转换及其他清洗相关操作的方法
data['food'].map(lambda x :meat_to_animal[x.lower()])
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    samlon
Name: food, dtype: object

7.2.3 替代值

  • data.replace方法与data.str.replace方法是不同的,data.str. replace是对字符串进行按元素替代的。
data = pd.Series([1,-999,2,-999,-1000,3])
data
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64
data.replace(-999,np.nan)
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
#如果你想要一次替代多个值,你可以传入一个列表和替代值
data.replace([-999,-1000],np.nan)
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
#要将不同的值替换为不同的值,可以传入替代值的列表
data.replace([-999,-1000],[np.nan,0])
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
#参数也可以通过字典传递
data.replace({-999:np.nan,-1000:0})
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

7.2.4 重命名轴索引

data = pd.DataFrame(np.arange(12).reshape(3,4),
                   index = ['Ohio','Colorado','New York'],
                   columns = ['one','two','three','four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
transform = lambda x :x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
data.rename(index = str.title,columns = str.upper)
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
#rename可以结合字典型对象使用,为轴标签的子集提供新的值
data.rename(index = {'OHIO':'INDIANA'},
           columns = {'three':'peekaboo'})
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

7.2.5 离散化和分箱

ages = [20,22,24,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
#pd.value_counts(cats)是对pandas.cut的结果中的箱数量的计数
pd.value_counts(cats)
(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64
#与区间的数学符号一致,小括号表示边是开放的,中括号表示它是封闭的(包括边)。
#你可以通过传递right=False来改变哪一边是封闭的
pd.cut(ages,[18,26,36,61,100],right = False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
#你也可以通过向labels选项传递一个列表或数组来传入自定义的箱名
group_names = ['youth','youngadult','middleaged','senior']
a = pd.cut(ages,bins,labels = group_names)
pd.value_counts(a)
youth         5
youngadult    3
middleaged    3
senior        1
dtype: int64
#如果你传给cut整数个的箱来代替显式的箱边,pandas将根据数据中的最小值和最大值计算出等长的箱
#precision=2的选项将十进制精度限制在两位
data = np.random.rand(20)
pd.cut(data,4,precision=2)
[(0.51, 0.74], (0.29, 0.51], (0.74, 0.97], (0.29, 0.51], (0.06, 0.29], ..., (0.06, 0.29], (0.29, 0.51], (0.74, 0.97], (0.51, 0.74], (0.74, 0.97]]
Length: 20
Categories (4, interval[float64]): [(0.06, 0.29] < (0.29, 0.51] < (0.51, 0.74] < (0.74, 0.97]]
#qcut是一个与分箱密切相关的函数,它基于样本分位数进行分箱。取决于数据的分布,使用cut通常不会使每个箱具有相同数据量的数据点。
#由于qcut使用样本的分位数,你可以通过qcut获得等长的箱
data = np.random.randn(1000)
cats = pd.qcut(data,4)#切成4份
cats
[(-0.00707, 0.65], (-0.00707, 0.65], (-2.936, -0.626], (-0.626, -0.00707], (-2.936, -0.626], ..., (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.626, -0.00707], (-0.00707, 0.65]]
Length: 1000
Categories (4, interval[float64]): [(-2.936, -0.626] < (-0.626, -0.00707] < (-0.00707, 0.65] < (0.65, 3.139]]
pd.value_counts(cats)
(-2.936, -0.626]      250
(-0.626, -0.00707]    250
(-0.00707, 0.65]      250
(0.65, 3.139]         250
dtype: int64
#与cut类似,你可以传入自定义的分位数(0和1之间的数据,包括边)
pd.cut(data,[0,0.1,0.5,0.9,1])
[(0.5, 0.9], (0.1, 0.5], NaN, NaN, NaN, ..., NaN, NaN, NaN, NaN, (0.1, 0.5]]
Length: 1000
Categories (4, interval[float64]): [(0.0, 0.1] < (0.1, 0.5] < (0.5, 0.9] < (0.9, 1.0]]

7.2.6 检测和过滤异常值

data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.013343 0.030142 0.020312 0.042330
std 1.012528 0.984443 0.999869 0.982124
min -2.942920 -3.799121 -3.412855 -2.632107
25% -0.668303 -0.629645 -0.654843 -0.643005
50% 0.010349 0.040064 0.026197 0.028003
75% 0.701525 0.679371 0.706170 0.714993
max 3.274496 3.998493 3.264216 2.907744
#假设你想要找出一列中绝对值大于三的值
col = data[2]
col[np.abs(col) > 3]
91    -3.044972
711    3.264216
858   -3.412855
Name: 2, dtype: float64
data[(np.abs(data)>3).any(1)]
0 1 2 3
91 -0.341046 -0.555910 -3.044972 0.474512
325 2.233400 -3.027404 0.845704 1.441757
332 -0.460361 -3.799121 -0.312931 0.478548
457 0.011004 3.998493 0.977419 0.577620
711 -0.603762 -1.650901 3.264216 -0.803395
746 1.455624 -3.178085 -0.387140 0.859193
858 -2.127923 0.163924 -3.412855 -0.073186
946 3.274496 -0.699596 -1.016879 0.358252
data[np.abs(data)>3] = np.sign(data)*3
data.describe()
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.013069 0.030148 0.020506 0.042330
std 1.011680 0.977459 0.997573 0.982124
min -2.942920 -3.000000 -3.000000 -2.632107
25% -0.668303 -0.629645 -0.654843 -0.643005
50% 0.010349 0.040064 0.026197 0.028003
75% 0.701525 0.679371 0.706170 0.714993
max 3.000000 3.000000 3.000000 2.907744
data
0 1 2 3
0 0.997285 0.352539 -0.158277 -0.069519
1 -1.144523 -0.173312 -0.651227 0.686972
2 0.650131 0.271325 -0.304344 -0.281217
3 0.527442 -2.023765 0.827982 -1.855424
4 -0.578451 -0.949705 -0.582701 -1.725697
... ... ... ... ...
995 0.494311 0.528862 -0.191097 0.118121
996 -0.582154 1.251247 -1.622055 -0.436563
997 0.687732 -1.670059 -0.272708 -0.369290
998 -0.443230 0.984728 -0.283506 -1.473420
999 -0.276277 -0.597256 1.269391 -0.704337

1000 rows × 4 columns

#语句np.sign(data)根据数据中的值的正负分别生成1和-1的数值
np.sign(data).head()
0 1 2 3
0 1.0 1.0 -1.0 -1.0
1 -1.0 -1.0 -1.0 1.0
2 1.0 1.0 -1.0 -1.0
3 1.0 -1.0 1.0 -1.0
4 -1.0 -1.0 -1.0 -1.0

7.2.7 置换和随机抽样

  • 使用numpy.random.permutation对DataFrame中的Series或行进行置换(随机重排序)是非常方便的。
sampler = np.random.permutation(5)
sampler
array([3, 2, 0, 4, 1])
df = pd.DataFrame(np.arange(5*4).reshape(5,4))
df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
#整数数组可以用在基于iloc的索引或等价的take函数中
df.take(sampler)
0 1 2 3
3 12 13 14 15
2 8 9 10 11
0 0 1 2 3
4 16 17 18 19
1 4 5 6 7
#要选出一个不含有替代值的随机子集,你可以使用Series和DataFrame的sample方法
df.sample(n=3)
0 1 2 3
0 0 1 2 3
4 16 17 18 19
3 12 13 14 15
#要生成一个带有替代值的样本(允许有重复选择),将replace=True传入sample方法
choices = pd.Series([5,6,-1,6,4])
draws = choices.sample(n=10,replace = True)
draws
2   -1
0    5
2   -1
3    6
0    5
1    6
1    6
4    4
3    6
1    6
dtype: int64

7.2.8 计算指标/虚拟变量

  • 将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模或机器学习的转换操作
  • pandas有一个get_dummies函数用于实现该功能
df = pd.DataFrame({'key':['b','b','a','c','a','b'],
                  'data1':range(6)})
df
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
pd.get_dummies(df['key'])
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
#在某些情况下,你可能想在指标DataFrame的列上加入前缀,然后与其他数据合并。
#在get_dummies方法中有一个前缀参数用于实现该功能
dummies = pd.get_dummies(df['key'],prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
mnames = ['movie_id','title','genres']
movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
                       ,sep='::',header=None,names = mnames)
<ipython-input-188-960ac40c2eea>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table(r'D:\PythonFlie\python\利用python进行数据分析(书籍笔记)\pydata-book-2nd-edition\datasets\movielens\movies.dat'
movies[::10]
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
10 11 American President, The (1995) Comedy|Drama|Romance
20 21 Get Shorty (1995) Action|Comedy|Drama
30 31 Dangerous Minds (1995) Drama
40 41 Richard III (1995) Drama|War
... ... ... ...
3840 3910 Dancer in the Dark (2000) Drama|Musical
3850 3920 Faraway, So Close (In Weiter Ferne, So Nah!) (... Drama|Fantasy
3860 3930 Creature From the Black Lagoon, The (1954) Horror
3870 3940 Slumber Party * III, The (1990) Horror
3880 3950 Tigerland (2000) Drama

389 rows × 3 columns

#为每个电影流派添加指标变量需要进行一些数据处理。首先,我们从数据集中提取出所有不同的流派的列表
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)
zero_matrix = np.zeros((len(movies),len(genres)))
dummies = pd.DataFrame(zero_matrix,columns=genres)
zero_matrix
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
dummies
Animation Children's Comedy Adventure Fantasy Romance Drama Action Crime Thriller Horror Sci-Fi Documentary War Musical Mystery Film-Noir Western
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3878 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3880 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3881 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3882 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3883 rows × 18 columns

gen = movies.genres[0]
gen.split('|')
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split("|"))
array([0, 1, 2], dtype=int64)
for i,gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split("|"))
    dummies.iloc[i,indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                              1.0
Genre_Children's                             1.0
Genre_Comedy                                 1.0
Genre_Adventure                              0.0
Genre_Fantasy                                0.0
Genre_Romance                                0.0
Genre_Drama                                  0.0
Genre_Action                                 0.0
Genre_Crime                                  0.0
Genre_Thriller                               0.0
Genre_Horror                                 0.0
Genre_Sci-Fi                                 0.0
Genre_Documentary                            0.0
Genre_War                                    0.0
Genre_Musical                                0.0
Genre_Mystery                                0.0
Genre_Film-Noir                              0.0
Genre_Western                                0.0
Name: 0, dtype: object
#将get_dummies与cut等离散化函数结合使用是统计应用的一个有用方法
np.random.seed(12345)
values = np.random.rand(10)
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins = [0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 0 0 0 0 1
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0

7.3 字符串操作

7.3.1 字符串对象方法

#一个逗号分隔的字符串可以使用split方法拆分成多块:
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
#split常和strip一起使用,用于清除空格(包括换行)
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
#这些子字符串可以使用加法与两个冒号分隔符连接在一起
first,second,third = pieces
first+"::"+second+"::"+third
'a::b::guido'
#在字符串’ : : ’的join方法中传入一个列表或元组是一种更快且更加Pythonic(Python风格化)的方法
"::".join(pieces)
'a::b::guido'
#使用Python的in关键字是检测子字符串的最佳方法,尽管index和find也能实现同样的功能
'guido' in val
True
#请注意find和index的区别在于index在字符串没有找到时会抛出一个异常(而find是返回-1)
val.index('guido')
5
val.find('guido')
5
#count返回的是某个特定的子字符串在字符串中出现的次数

val.count('guido')
1
#replace将用一种模式替代另一种模式。它通常也用于传入空字符串来删除某个模式
val.replace(',','::')
'a::b:: guido'
val.replace(',','')
'ab guido'
  • Python内建字符串方法
方法 描述
count 返回子字符串在字符串中的非重叠出现次数
endswith 如果字符串以后缀结尾则返回True
startswith 如果字符串以前缀开始则返回True
join 使用字符串作为间隔符,用于粘合其他字符串的序列
index 如果在字符串中找到,则返回子字符串中第一个字符的位置:如果找不到则引发ValueError
find 返回字符串中第一个出现子字符的第一个字符的位置:类似index,但如果没有找到则返回-1
rfind 返回子字符串在字符串中最后一次出现时第一个字符的位置,如果没有找到,则返回-1
replace 使用一个字符串替代另一个字符串
strip,rstrip,1strip 修剪空白,包括换行符,相当于对每个元素进行x. strip() (以及rstrip,lstrip)。
split 使用分隔符将字符串拆分为子字符串的列表
lower 将大写字母转换为小写字母
upper 将小写字母转换为大写字母
casefold 将字符转换为小写,并将任何特定于区域的变量字符组合转换为常见的可比较形式
ljust, rjust 左对齐或右对齐;用空格(或其他一些字符)填充字符串的相反侧以返回且,有最小宽度的字符串

7.3.2 正则表达式

  • re模块主要有三个主题:模式匹配、替代、拆分
  • 描述一个或多个空白字符的正则表达式是\s+
  • 如果你需要将相同的表达式应用到多个字符串上,推荐使用re.compile创建一个正则表达式对象,这样做有利于节约CPU周期。
  • 为了在正则表达式中避免转义符\的影响,可以使用原生字符串语法,比如r’C:\x’或者用等价的’C:\x’
  • 正则表达式方法
方法 描述
findall 将字符串中所有的非重叠匹配模式以列表形式返回
finditer 与findall类似,但返回的是迭代器
match 在字符串起始位置匹配模式,也可以将模式组建匹配到分组中;如果模式匹配上了,返回的一个匹配对象,否则返回None
search 扫描字符串的匹配模式,如果扫描到了返回匹配对象,与match方法不同的是,search 方法的匹配可以是字符串的任意位置,而不仅仅是字符串的起始位置
split 根据模式,将字符串拆分为多个部分
sub,subn 用替换表达式替换字符串中所有的匹配(sub) 或第n个出现的匹配串(subn);使用符号\ 1. \ 2 …来引用替换字符串中的匹配组元素
import re
text = 'foo bar\t baz \tqux'
re.split('\s+',text)
['foo', 'bar', 'baz', 'qux']
#你可以使用re.compile自行编译,形成一个可复用的正则表达式对象
regex = re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
#如果你想获得的是一个所有匹配正则表达式的模式的列表,你可以使用findall方法
regex.findall(text)
[' ', '\t ', ' \t']
text = """
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9.%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern,flags = re.IGNORECASE)
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
#search返回的是文本中第一个匹配到的电子邮件地址
m = regex.search(text)
m
<re.Match object; span=(6, 21), match='dave@google.com'>
text[m.start():m.end()]
'dave@google.com'
#regex.match只在模式出现于字符串起始位置时进行匹配,如果没有匹配到,返回None
print(regex.match(text))
None
#sub会返回一个新的字符串,原字符串中的模式会被一个新的字符串替代
print(regex.sub('ABC',text))
Dave ABC
Steve ABC
Rob ABC
Ryan ABC

#假设您想查找电子邮件地址,并将每个地址分为三个部分:用户名,域名和域名后缀。
#要实现这一点,可以用括号将模式包起来
pattern = r'([A-Z0-9.%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern,flags = re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
('wesm', 'bright', 'net')
#当模式可以分组时,findall返回的是包含元组的列表
regex.findall(text)
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username:\1,Domain:\2,Suffix:\3',text))
Dave Username:dave,Domain:google,Suffix:com
Steve Username:steve,Domain:gmail,Suffix:com
Rob Username:rob,Domain:gmail,Suffix:com
Ryan Username:ryan,Domain:yahoo,Suffix:com

7.3.3 pandas中的向量化字符串函数

  • 部分向量化字符串方法列表
方法 描述
cat 根据可选的分隔符按元素黏合字符串
contains 返回是否含有某个模式/正则表达式的布尔值数组
count 模式出现次数的计数
extract 使用正则表达式从字符串Scries 中分组抽取-个或多个字符串;返回的结果是每个分组形成-列的DataFrame
endswith 等价于对每个元素使用x. endwith (模式)
data = {'Dave':'dave@google.com','Steve':'steve@gmail.com','Rob':'rob@gmail.com','Ryan':np.nan}
data = pd.Series(data)
data
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Ryan                 NaN
dtype: object
data.isnull()
Dave     False
Steve    False
Rob      False
Ryan      True
dtype: bool
#你可以使用data.map将字符串和有效的正则表达式方法(以lambda或其他函数的方式传递)应用到每个值上,
#但是在NA(null)值上会失败,Series有面向数组的方法用于跳过NA值的字符串操作。这些方法通过Series的str属性进行调用
data.str.contains('gmail')
Dave     False
Steve     True
Rob       True
Ryan       NaN
dtype: object
#正则表达式也可以结合任意的re模块选项使用
data.str.findall(pattern,flags=re.IGNORECASE)
Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Ryan                       NaN
dtype: object
#可以使用str.get或在str属性内部索引
matches = data.str.match(pattern,flags=re.IGNORECASE)
matches
Dave     True
Steve    True
Rob      True
Ryan      NaN
dtype: object
#要访问嵌入式列表中的元素,我们可以将索引传递给这些函数中的任意一个
matches.str.get(l)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-245-8d76f9329d2a> in <module>
      1 #要访问嵌入式列表中的元素,我们可以将索引传递给这些函数中的任意一个
----> 2 matches.str.get(l)


D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):


D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 


AttributeError: Can only use .str accessor with string values!
matches.str[0]
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-246-10bdd22fd8b2> in <module>
----> 1 matches.str[0]


D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5459             or name in self._accessors
   5460         ):
-> 5461             return object.__getattribute__(self, name)
   5462         else:
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):


D:\Anaconda3\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    178             # we're accessing the attribute of the class, i.e., Dataset.geo
    179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
    181         # Replace the property with the accessor object. Inspired by:
    182         # https://www.pydanny.com/cached-property.html


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)


D:\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 


AttributeError: Can only use .str accessor with string values!
#你可以使用字符串切片的类似语法进行向量化切片
data.str[:5]
Dave     dave@
Steve    steve
Rob      rob@g
Ryan       NaN
dtype: object

第八章数据规整:连接、联合与重塑

8.1 分层索引

#你看到的是一个以MultiIndex作为索引的Series的美化视图。索引中的“间隙”表示“直接使用上面的标签”
data = pd.Series(np.random.randn(9),
                index = [['a','a','a','b','b','c','c','d','d'],
                        [1,2,3,1,3,1,2,2,3]])
data
a  1    1.007189
   2   -1.296221
   3    0.274992
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
d  2   -0.371843
   3    1.669025
dtype: float64
data.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )
data['b']
1    0.228913
3    1.352917
dtype: float64
data['b':'c']
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
dtype: float64
data.loc[['b','c']]
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
dtype: float64
#在“内部”层级中进行选择也是可以的
data.loc[:,3]
a    0.274992
b    1.352917
d    1.669025
dtype: float64
#分层索引在重塑数据和数组透视表等分组操作中扮演了重要角色。
#可以使用unstack方法将数据在DataFrame中重新排列
data.unstack()
1 2 3
a 1.007189 -1.296221 0.274992
b 0.228913 NaN 1.352917
c 0.886429 -2.001637 NaN
d NaN -0.371843 1.669025
#unstack的反操作是stack
data.unstack().stack()
a  1    1.007189
   2   -1.296221
   3    0.274992
b  1    0.228913
   3    1.352917
c  1    0.886429
   2   -2.001637
d  2   -0.371843
   3    1.669025
dtype: float64
#在DataFrame中,每个轴都可以拥有分层索引
frame = pd.DataFrame(np.arange(12).reshape(4,3),
                    index = [['a','a','b','b'],[1,2,1,2]],
                    columns = [['ohio','ohio','colorado'],['green','red','green']])
frame
ohio colorado
green red green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names = ['key1','key2']
frame
ohio colorado
green red green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.columns.names = ['state','color']
frame
state ohio colorado
color green red green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame['ohio']
color green red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
#一个MultiIndex对象可以使用其自身的构造函数创建并复用
pd.MultiIndex.from_arrays([['ohio','ohio','colorado'],['green','red','green']],
                      names = ['state','color'])
MultiIndex([(    'ohio', 'green'),
            (    'ohio',   'red'),
            ('colorado', 'green')],
           names=['state', 'color'])

8.1.1 重排序和层级排序

#swaplevel接收两个层级序号或层级名称,返回一个进行了层级变更的新对象(但是数据是不变的)
frame.swaplevel('key1','key2')
state ohio colorado
color green red green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
#sort_index只能在单一层级上对数据进行排序。
#在进行层级变换时,使用sort_index以使得结果按照层级进行字典排序也很常见
frame.sort_index(level=1)
state ohio colorado
color green red green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
frame.sort_index(level=0)
state ohio colorado
color green red green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.swaplevel(0,1).sort_index(level=0)
state ohio colorado
color green red green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11

8.1.2 按层级进行汇总统计

  • DataFrame和Series中很多描述性和汇总性统计有一个level选项,通过level选项你可以指定你想要在某个特定的轴上进行聚合
frame.sum(level='key2')
state ohio colorado
color green red green
key2
1 6 8 10
2 12 14 16
frame.sum(level='color',axis=1)
color green red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10

8.1.3 使用DataFrame的列进行索引

  • 通常我们不会使用DataFrame中一个或多个列作为行索引;反而你可能想要将行索引移动到DataFrame的列中。
  • how参数的不同连接类型
选项 行为
’ inner’ 只对两张表都有的键的交集进行联合
‘left’ 对所有左表的键进行联合
'right ’ 对所有右表的键进行联合
’ outer’ 对两张表都有的键的并集进行联合
frame = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
                     'c':['one','one','one','two','two','two','two'],
                     'd':[0,1,2,0,1,2,3]})
frame
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
#DataFrame的set_index函数会生成一个新的DataFrame,新的DataFrame使用一个或多个列作为索引
frame2 = frame.set_index(['c','d'])
frame2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
#默认情况下这些列会从DataFrame中移除,你也可以将它们留在DataFrame中
frame.set_index(['c','d'],drop = False)
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
#reset_index是set_index的反操作,分层索引的索引层级会被移动到列中
frame2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

8.2 联合与合并数据集

  • pandas.merge根据一个或多个键将行进行连接。对于SQL或其他关系型数据库的用户来说,这种方式比较熟悉,它实现的是数据库的连接操作。
  • pandas.concat使对象在轴向上进行黏合或“堆叠”。
  • combine_first实例方法允许将重叠的数据拼接在一起,以使用一个对象中的值填充另一个对象中的缺失值。

8.2.1 数据库风格的DataFrame连接

  • 合并或连接操作通过一个或多个键连接行来联合数据集
  • merge函数参数
参数 描述
left 合并时操作中左边的DataFrame
right 合并时操作中右边的DataFrame
how ‘inner’. ‘outer’. ‘left’. ‘right’之一; 默认是’ inner’
on 需要连接的列名。必须是在两边的DataFrame对象都有的列名,并以left和right中的列名的交集作为连接键
left_on 1eft DataFrame 中用作连接键的列
right_on right DataFrame 中用作连接键的列
left_index 使用left的行索引作为它的连接键(如果是Multilndex,则是多个键)
right_index 使用right的行索引作为它的连接键(如果是MultiIndex,则是多个键)
sort 通过连接键按字母顺序对合并的数据进行排序;在默认情况下为True (在大数据集上某些情况下禁用该功能可以获得更好的性能)
suffixes 在重叠情况下,添加到列名后的字符串元组;默认是(’. x’,’ y’) (例如如果待合并的DataFrame中都含有’data’ 列,那么结果中会出现’data_x’、‘data_ y’)
copy 如果为False,则在某些特殊情况下避免将数据复制到结果数据结构中;默认情况下总是复制
indicator 添加一个特殊的列_ merge, 指示每一行的来源;值将根据每行中连接数据的来源分别为’left_ only’,‘right_ only’ 或’ both’
df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'key':['a','b','d'],
                   'data2':range(3)})
df2
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1,df2)
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#请注意,我并没有指定在哪一列上进行连接。如果连接的键信息没有指定,merge会自动将重叠列名作为连接的键。
#但是,显式地指定连接键才是好的实现
pd.merge(df1,df2,on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#如果每个对象的列名是不同的,你可以分别为它们指定列名
df3 = pd.DataFrame({'Lkey':['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df3
Lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df4 = pd.DataFrame({'Rkey':['a','b','d'],
                   'data2':range(3)})
df4
Rkey data2
0 a 0
1 b 1
2 d 2
pd.merge(df3,df4,left_on='Lkey',right_on='Rkey')
Lkey data1 Rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
#其他可选的选项有’left'、'right’和’outer'。
#外连接(outer join)是键的并集,联合了左连接和右连接的效果
pd.merge(df1,df2,how='outer')
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0
df1 = pd.DataFrame({'key':['b','b','a','c','a','b'],
                   'data1':range(6)})
df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
df2 = pd.DataFrame({'key':['a','b','a','b','d'],
                   'data2':range(5)})
df2
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
pd.merge(df1,df2,on='key',how='left')
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
pd.merge(df1,df2,on='key',how='inner')
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
#使用多个键进行合并时,传入一个列名的列表
left = pd.DataFrame({'key1':['foo','foo','bar'],
                    'key2':['one','two','one'],
                    'lval':[1,2,3]})
left
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
right = pd.DataFrame({'key1':['foo','foo','bar','bar'],
                    'key2':['one','one','one','two'],
                    'rval':[4,5,6,7]})
right
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
pd.merge(left,right,on=['key1','key2'],how='outer')
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
#merge有一个suffixes后缀选项,
#用于在左右两边DataFrame对象的重叠列名后指定需要添加的字符串
pd.merge(left,right,on=['key1'])
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
pd.merge(left,right,on=['key1'],suffixes=('_left','_right'))
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

8.2.2 根据索引合并

  • 在某些情况下,DataFrame中用于合并的键是它的索引。在这种情况下,你可以传递left_index=True或right_index=True(或者都传)来表示索引需要用来作为合并的键
left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
                     'value':range(6)})
left1
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
right1 = pd.DataFrame({'group_val':[3.5,7]},index = ['a','b'])
right1
group_val
a 3.5
b 7.0
pd.merge(left1,right1,left_on = 'key',right_index=True)
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
pd.merge(left1,right1,left_on = 'key',right_index=True,how='outer')
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
#在多层索引数据的情况下,事情会更复杂,在索引上连接是一个隐式的多键合并
lefth = pd.DataFrame({'key1':['ohio','ohio','ohio','Nevada','Nevada'],
                     'key2':[2000,2001,2002,2001,2002],
                     'data':np.arange(5.)})
lefth
key1 key2 data
0 ohio 2000 0.0
1 ohio 2001 1.0
2 ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
righth = pd.DataFrame(np.arange(12).reshape(6,2),
                     index=[['nevada','nevada','ohio','ohio','ohio','ohio'],[2001,2000,2000,2000,2001,2002]],
                     columns = ['event1','event2'])
righth
event1 event2
nevada 2001 0 1
2000 2 3
ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
key1 key2 data event1 event2
0 ohio 2000 0.0 4.0 5.0
0 ohio 2000 0.0 6.0 7.0
1 ohio 2001 1.0 8.0 9.0
2 ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 NaN NaN
4 Nevada 2002 4.0 NaN NaN
4 nevada 2001 NaN 0.0 1.0
4 nevada 2000 NaN 2.0 3.0
left2 = pd.DataFrame([[1,2],[3,4],[5,6]],
                    index = ['a','c','e'],
                    columns = ['ohio','nevada'])
left2
ohio nevada
a 1 2
c 3 4
e 5 6
right2 = pd.DataFrame([[7,8],[9,10],[11,12],[13,14]],
                    index = ['b','c','d','e'],
                    columns = ['missouri','alabama'])
right2
missouri alabama
b 7 8
c 9 10
d 11 12
e 13 14
pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
ohio nevada missouri alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
#join实例方法,用于按照索引合并
left2.join(right2,how='outer')
ohio nevada missouri alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
left1.join(right1,on='key')
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
another = pd.DataFrame([[7,8],[9,10],[11,12],[16,17]],
                      index = ['a','c','e','f'],
                      columns = ['new york','oregon'])
another
new york oregon
a 7 8
c 9 10
e 11 12
f 16 17
left2.join([right2,another])
ohio nevada missouri alabama new york oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
left2.join([right2,another],how='outer')
ohio nevada missouri alabama new york oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
b NaN NaN 7.0 8.0 NaN NaN
d NaN NaN 11.0 12.0 NaN NaN
f NaN NaN NaN NaN 16.0 17.0

8.2.3 沿轴向连接

  • 另一种数据组合操作可互换地称为拼接、绑定或堆叠。
  • NumPy的concatenate函数可以在NumPy数组上实现该功能
  • concat函数的参数
参数 描述
objs 需要连接的pandas对象列表或字典,这是必选参数
axis 连接的轴向;默认是0 (沿着行方向)
join 可以是’inner’或’outer’ (默认是’outer’);用于指定连接方式是内连接(inner) 还是外连接(outer)
join_ _axes 用于指定其他n-1轴的特定索引,可以替代内/外连接的逻辑
keys 与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或数组,也可以是元组的数组,也可以是数组的列表(如果向levels参数传入多层数组)
leels 在键值传递时,该参数用于指定多层索引的层级
arr = np.arange(12).reshape(3,4)
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
np.concatenate([arr,arr],axis=1)
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
s1 = pd.Series([0,1],index=['a','b'])
s1
a    0
b    1
dtype: int64
s2 = pd.Series([2,3,4],index=['c','d','e'])
s2
c    2
d    3
e    4
dtype: int64
s3 = pd.Series([5,6],index=['f','g'])
s3
f    5
g    6
dtype: int64
pd.concat([s1,s2,s3])
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
#concat方法是沿着axis=0的轴向生效的,生成另一个Series。
#如果你传递axis=1,返回的结果则是一个DataFrame(axis=1时是列)
pd.concat([s1,s2,s3],axis=1)
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
s4 = pd.concat([s1,s3])
s4
a    0
b    1
f    5
g    6
dtype: int64
pd.concat([s1,s4],axis=1)
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
pd.concat([s1,s4],axis=1,join='inner')
0 1
a 0 0
b 1 1
#可以使用join_axes来指定用于连接其他轴向的轴
pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
result.unstack()
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
pd.concat([s1,s2,s3],axis = 1,keys=['one','two','three'])
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
df1 = pd.DataFrame(np.arange(6).reshape(3,2),
                  index = ['a','b','c'],
                  columns = ['one','two'])
df1
one two
a 0 1
b 2 3
c 4 5
df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,
                  index = ['a','c'],
                  columns = ['three','four'])
df2
three four
a 5 6
c 7 8
pd.concat([df1,df2],axis=1,keys=['lever1','level2'])
lever1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
#如果你传递的是对象的字典而不是列表的话,则字典的键会用于keys选项
pd.concat({'level1':df1,'level2':df2},axis=1)
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
pd.concat([df1,df2],axis=1,keys=['lever1','level2'],names=['upper','lower'])
upper lever1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
df1 = pd.DataFrame(np.random.randn(3,4),columns = ['a','b','c','d'])
df1
a b c d
0 -1.119593 1.953114 -1.514807 -1.054782
1 0.543393 1.172903 0.945829 0.656643
2 1.012695 1.481920 -0.413033 -1.280521
df2 = pd.DataFrame(np.random.randn(2,3),columns = ['b','d','a'])
df2
b d a
0 1.638046 -0.850112 1.895532
1 -1.175952 1.370474 -0.992356
pd.concat([df1,df2],ignore_index=True)
a b c d
0 -1.119593 1.953114 -1.514807 -1.054782
1 0.543393 1.172903 0.945829 0.656643
2 1.012695 1.481920 -0.413033 -1.280521
3 1.895532 1.638046 NaN -0.850112
4 -0.992356 -1.175952 NaN 1.370474

8.2.4 联合重叠数据

a = pd.Series([np.nan,2.5,0,3.5,4.5,np.nan],
             index=['f','e','d','c','b','a'])
a
f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64
b = pd.Series([0,np.nan,2,np.nan,np.nan,5],
             index=['a','b','c','d','e','f'])
b
a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64
#考虑NumPy的where函数,这个函数可以进行面向数组的if-else等价操作
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
#Series有一个combine_first方法,该方法可以等价于下面这种使用pandas常见数据对齐逻辑的轴向操作
b.combine_first(a)
a    0.0
b    4.5
c    2.0
d    0.0
e    2.5
f    5.0
dtype: float64
df1 = pd.DataFrame({'a':[1,np.nan,5,np.nan],
                   'b':[np.nan,2,np.nan,6],
                   'c':range(2,18,4)})
df1
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2 = pd.DataFrame({'a':[5,4,np.nan,3,7],
                   'b':[np.nan,3,4,6,8]})
df2
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

8.3 重塑和透视

  • 重排列表格型数据有多种基础操作。这些操作被称为重塑或透视。

8.3.1 使用多层索引进行重塑

  • statck(堆叠)该操作会“旋转”或将列中的数据透视到行。
  • unstack(拆堆)该操作会将行中的数据透视到列。
data = pd.DataFrame(np.arange(6).reshape(2,3),
                    index = pd.Index(['ohio','colorado'],name='state'),
                    columns = pd.Index(['one','two','three'],name='number'))
data
number one two three
state
ohio 0 1 2
colorado 3 4 5
result = data.stack()
result
state     number
ohio      one       0
          two       1
          three     2
colorado  one       3
          two       4
          three     5
dtype: int32
result.unstack()
number one two three
state
ohio 0 1 2
colorado 3 4 5
result.unstack(0)
state ohio colorado
number
one 0 3
two 1 4
three 2 5
result.unstack('state')
state ohio colorado
number
one 0 3
two 1 4
three 2 5
#如果层级中的所有值并未包含于每个子分组中时,拆分可能会引入缺失值
s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s1
a    0
b    1
c    2
d    3
dtype: int64
s2 = pd.Series([4,5,6],index=['c','d','e'])
s2
c    4
d    5
e    6
dtype: int64
data = pd.concat([s1,s2],keys=['one','two'])
data
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
data.unstack()
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
#默认情况下,堆叠会过滤出缺失值,因此堆叠拆堆的操作是可逆的
data.unstack().stack()
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64
data.unstack().stack(dropna = False)
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64
#当你在DataFrame中拆堆时,被拆堆的层级会变为结果中最低的层级
df = pd.DataFrame({'left':result,'right':result+5},
                 columns=pd.Index(['left','right'],name='side'))
df
side left right
state number
ohio one 0 5
two 1 6
three 2 7
colorado one 3 8
two 4 9
three 5 10
df.unstack('state')
side left right
state ohio colorado ohio colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
#在调用stack方法时,我们可以指明需要堆叠的轴向名称
df.unstack('state').stack('side')
state colorado ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
df.unstack('state').stack()
side left right
number state
one ohio 0 5
colorado 3 8
two ohio 1 6
colorado 4 9
three ohio 2 7
colorado 5 10

8.3.2 将“长”透视为“宽”

data = pd.read_csv('examples/macrodata.csv')
data.head()
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='date')
periods
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
columns
Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')
data = data.reindex(columns=columns)
data.head()
item realgdp infl unemp
0 2710.349 NaN 5.8
1 2778.801 NaN 5.1
2 2775.488 NaN 5.3
3 2785.204 NaN 5.6
4 2847.699 NaN 5.2
data.index = periods.to_timestamp('D','end')
data.head()
item realgdp infl unemp
date
1959-03-31 23:59:59.999999999 2710.349 NaN 5.8
1959-06-30 23:59:59.999999999 2778.801 NaN 5.1
1959-09-30 23:59:59.999999999 2775.488 NaN 5.3
1959-12-31 23:59:59.999999999 2785.204 NaN 5.6
1960-03-31 23:59:59.999999999 2847.699 NaN 5.2
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:10]
date item value
0 1959-03-31 23:59:59.999999999 realgdp 2710.349
1 1959-03-31 23:59:59.999999999 unemp 5.800
2 1959-06-30 23:59:59.999999999 realgdp 2778.801
3 1959-06-30 23:59:59.999999999 unemp 5.100
4 1959-09-30 23:59:59.999999999 realgdp 2775.488
5 1959-09-30 23:59:59.999999999 unemp 5.300
6 1959-12-31 23:59:59.999999999 realgdp 2785.204
7 1959-12-31 23:59:59.999999999 unemp 5.600
8 1960-03-31 23:59:59.999999999 realgdp 2847.699
9 1960-03-31 23:59:59.999999999 unemp 5.200
pivoted = ldata.pivot('date','item','value')
pivoted
item realgdp unemp
date
1959-03-31 23:59:59.999999999 2710.349 5.8
1959-06-30 23:59:59.999999999 2778.801 5.1
1959-09-30 23:59:59.999999999 2775.488 5.3
1959-12-31 23:59:59.999999999 2785.204 5.6
1960-03-31 23:59:59.999999999 2847.699 5.2
... ... ...
2008-09-30 23:59:59.999999999 13324.600 6.0
2008-12-31 23:59:59.999999999 13141.920 6.9
2009-03-31 23:59:59.999999999 12925.410 8.1
2009-06-30 23:59:59.999999999 12901.504 9.2
2009-09-30 23:59:59.999999999 12990.341 9.6

203 rows × 2 columns

8.3.3 将“宽”透视为“长”

  • 在DataFrame中,pivot方法的反操作是pandas.melt
df = pd.DataFrame({'key':['foo','bar','baz'],
                  'A':[1,2,3],
                  'B':[4,5,6],
                  'C':[7,8,9]})
df
key A B C
0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9
#当使用pandas.melt时,我们必须指明哪些列是分组指标(如果有的话)
melted = pd.melt(df)
melted
variable value
0 key foo
1 key bar
2 key baz
3 A 1
4 A 2
5 A 3
6 B 4
7 B 5
8 B 6
9 C 7
10 C 8
11 C 9
melted = pd.melt(df,['key'])
melted
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9
reshaped = melted.pivot('key','variable','value')
reshaped
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7
#使用reset_index来将数据回移一列
reshaped.reset_index()
variable key A B C
0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7
pd.melt(df,id_vars=['key'],value_vars=['A','B'])
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
pd.melt(df,value_vars=['A','B','C'])
variable value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
pd.melt(df,value_vars=['A','B','key'])
variable value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 key foo
7 key bar
8 key baz

第九章绘图与可视化

%matplotlib notebook

9.1 简明matplotlib API入门

  • 使用Jupyter notebook时有个细节需要注意,在每个单元格运行后,图表被重置,因此对于更复杂的图表,你必须将所有的绘图命令放在单个的notebook单元格中
import matplotlib.pyplot as plt
import numpy as np
data = np.arange(10)
data
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
plt.plot(data)

9.1.1 图片与子图

  • matplotlib所绘制的图位于图片(Figure)对象中
  • pyplot.subplots选项
参数 描述
nrows 子图的行数
ncols 子图的列数
sharex 所有子图使用相同的x轴刻度(调整xlim会影响所有子图)
sharey 所有子图使用相同的y轴刻度(调整ylim会影响所有子图)
subplot_ kw 传入add_ subplot 的关键字参数字典,用于生成子图
**fig_ _kW 在生成图片时使用的额外关键字参数,例如plt. subplots (2,2,figsize= (8,6))
#你可以使用plt.figure生成一个新的图片
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
<IPython.core.display.Javascript object>
上一篇:XML 基础


下一篇:使用 CXF 做 webservice 简单例子(转载)