Pandas之DataFrame——Part 1

'''
【课程2.】 Pandas数据结构Dataframe:基本概念及创建 "二维数组"Dataframe:是一个表格型的数据结构,包含一组有序的列,其列的值类型可以是数值、字符串、布尔值等。 Dataframe中的数据以一个或多个二维块存放,不是列表、字典或一维数组结构。 '''
# Dataframe 数据结构
# Dataframe是一个表格型的数据结构,“带有标签的二维数组”。
# Dataframe带有index(行标签)和columns(列标签) data = {'name':['Jack','Tom','Mary'],
'age':[,,],
'gender':['m','m','w']}
frame = pd.DataFrame(data)
print(frame)
print(type(frame))
print(frame.index,'\n该数据类型为:',type(frame.index))
print(frame.columns,'\n该数据类型为:',type(frame.columns))
print(frame.values,'\n该数据类型为:',type(frame.values))
# 查看数据,数据类型为dataframe
# .index查看行标签
# .columns查看列标签
# .values查看值,数据类型为ndarray

  输出:

   age gender  name
m Jack
m Tom
w Mary
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=, stop=, step=)
该数据类型为: <class 'pandas.indexes.range.RangeIndex'>
Index(['age', 'gender', 'name'], dtype='object')
该数据类型为: <class 'pandas.indexes.base.Index'>
[[ 'm' 'Jack']
[ 'm' 'Tom']
[ 'w' 'Mary']]
该数据类型为: <class 'numpy.ndarray'>
# Dataframe 创建方法一:由数组/list组成的字典
# 创建方法:pandas.Dataframe() data1 = {'a':[,,],
'b':[,,],
'c':[,,]}
data2 = {'one':np.random.rand(),
'two':np.random.rand()} # 这里如果尝试 'two':np.random.rand() 会怎么样?
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
# 字典的值的长度必须保持一致! df1 = pd.DataFrame(data1, columns = ['b','c','a','d'])
print(df1)
df1 = pd.DataFrame(data1, columns = ['b','c'])
print(df1)
# columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如'd'),则产生NaN值
# 如果columns重新指定时候,列的数量可以少于原数据 df2 = pd.DataFrame(data2, index = ['f1','f2','f3']) # 这里如果尝试 index = ['f1','f2','f3','f4'] 会怎么样?
print(df2)
# index参数:重新定义index,格式为list,长度必须保持一致

  输出:

{'a': [, , ], 'c': [, , ], 'b': [, , ]}
{'one': array([ 0.00101091, 0.08807153, 0.58345056]), 'two': array([ 0.49774634, 0.16782565, 0.76443489])}
a b c one two
0.001011 0.497746
0.088072 0.167826
0.583451 0.764435
b c a d
NaN
NaN
NaN
b c one two
f1 0.001011 0.497746
f2 0.088072 0.167826
f3 0.583451 0.764435
# Dataframe 创建方法二:由Series组成的字典

data1 = {'one':pd.Series(np.random.rand()),
'two':pd.Series(np.random.rand())} # 没有设置index的Series
data2 = {'one':pd.Series(np.random.rand(), index = ['a','b']),
'two':pd.Series(np.random.rand(),index = ['a','b','c'])} # 设置了index的Series
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
# Series可以长度不一样,生成的Dataframe会出现NaN值

  输出:

{'one':     0.892580
0.834076
dtype: float64, 'two': 0.301309
0.977709
0.489000
dtype: float64}
{'one': a 0.470947
b 0.584577
dtype: float64, 'two': a 0.122659
b 0.136429
c 0.396825
dtype: float64}
one two
0.892580 0.301309
0.834076 0.977709
NaN 0.489000
one two
a 0.470947 0.122659
b 0.584577 0.136429
c NaN 0.396825
# Dataframe 创建方法三:通过二维数组直接创建

ar = np.random.rand().reshape(,)
print(ar)
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three']) # 可以尝试一下index或columns长度不等于已有数组的情况
print(df1)
print(df2)
# 通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
# index和colunms指定长度与原数组保持一致

  输出:

[[ 0.54492282  0.28956161  0.46592269]
[ 0.30480674 0.12917132 0.38757672]
[ 0.2518185 0.13544544 0.13930429]] 0.544923 0.289562 0.465923
0.304807 0.129171 0.387577
0.251819 0.135445 0.139304
one two three
a 0.544923 0.289562 0.465923
b 0.304807 0.129171 0.387577
c 0.251819 0.135445 0.139304
# Dataframe 创建方法四:由字典组成的列表

data = [{'one': , 'two': }, {'one': , 'two': , 'three': }]
print(data)
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index = ['a','b'])
df3 = pd.DataFrame(data, columns = ['one','two'])
print(df1)
print(df2)
print(df3)
# 由字典组成的列表创建Dataframe,columns为字典的key,index不做指定则为默认数组标签
# colunms和index参数分别重新指定相应列及行标签

  输出:

[{'one': , 'two': }, {'one': , 'three': , 'two': }]
one three two
NaN
20.0
one three two
a NaN
b 20.0
one two
# Dataframe 创建方法五:由字典组成的字典

data = {'Jack':{'math':,'english':,'art':},
'Marry':{'math':,'english':,'art':},
'Tom':{'math':,'english':}}
df1 = pd.DataFrame(data)
print(df1)
# 由字典组成的字典创建Dataframe,columns为字典的key,index为子字典的key df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print(df2)
print(df3)
# columns参数可以增加和减少现有列,如出现新的列,值为NaN
# index在这里和之前不同,并不能改变原有index,如果指向新的标签,值为NaN (非常重要!)

  输出:

         Jack  Marry   Tom
art NaN
english 67.0
math 78.0
Jack Tom Bob
art NaN NaN
english 67.0 NaN
math 78.0 NaN
Jack Marry Tom
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
'''
【课程2.】 Pandas数据结构Dataframe:索引 Dataframe既有行索引也有列索引,可以被看做由Series组成的字典(共用一个索引) 选择列 / 选择行 / 切片 / 布尔判断 '''

  

# 选择行与列

df = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print(df) data1 = df['a']
data2 = df[['a','c']]
print(data1,type(data1))
print(data2,type(data2))
print('-----')
# 按照列名选择列,只选择一列输出Series,选择多列输出Dataframe data3 = df.loc['one']
data4 = df.loc[['one','two']]
print(data2,type(data3))
print(data3,type(data4))
# 按照index选择行,只选择一行输出Series,选择多行输出Dataframe

  输出:

             a          b          c          d
one 72.615321 49.816987 57.485645 84.226944
two 46.295674 34.480439 92.267989 17.111412
three 14.699591 92.754997 39.683577 93.255880
one 72.615321
two 46.295674
three 14.699591
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
a c
one 72.615321 57.485645
two 46.295674 92.267989
three 14.699591 39.683577 <class 'pandas.core.frame.DataFrame'>
-----
a c
one 72.615321 57.485645
two 46.295674 92.267989
three 14.699591 39.683577 <class 'pandas.core.series.Series'>
a 72.615321
b 49.816987
c 57.485645
d 84.226944
Name: one, dtype: float64 <class 'pandas.core.frame.DataFrame'>
# df[] - 选择列
# 一般用于选择列,也可以选择行 df = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print(df)
print('-----') data1 = df['a']
data2 = df[['b','c']] # 尝试输入 data2 = df[['b','c','e']]
print(data1)
print(data2)
# df[]默认选择列,[]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)
# 单选列为Series,print结果为Series格式
# 多选列为Dataframe,print结果为Dataframe格式 data3 = df[:]
#data3 = df[]
#data3 = df['one']
print(data3,type(data3))
# df[]中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[])
# 输出结果为Dataframe,即便只选择一行
# df[]不能通过索引标签名来选择行(df['one']) # 核心笔记:df[col]一般用于选择列,[]中写列名

  输出:

              a          b          c          d
one 88.490183 93.588825 1.605172 74.610087
two 45.905361 49.257001 87.852426 97.490521
three 95.801001 97.991028 74.451954 64.290587
-----
one 88.490183
two 45.905361
three 95.801001
Name: a, dtype: float64
b c
one 93.588825 1.605172
two 49.257001 87.852426
three 97.991028 74.451954
a b c d
one 88.490183 93.588825 1.605172 74.610087 <class 'pandas.core.frame.DataFrame'>
# df.loc[] - 按index选择行

df1 = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand().reshape(,)*,
columns = ['a','b','c','d'])
print(df1)
print(df2)
print('-----') data1 = df1.loc['one']
data2 = df2.loc[]
print(data1)
print(data2)
print('单标签索引\n-----')
# 单个标签索引,返回Series data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[,,]]
print(data3)
print(data4)
print('多标签索引\n-----')
# 多个标签索引,如果标签不存在,则返回NaN
# 顺序可变 data5 = df1.loc['one':'three']
data6 = df2.loc[:]
print(data5)
print(data6)
print('切片索引')
# 可以做切片对象
# 末端包含 # 核心笔记:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index

  输出:

              a          b          c          d
one 73.070679 7.169884 80.820532 62.299367
two 34.025462 77.849955 96.160170 55.159017
three 27.897582 39.595687 69.280955 49.477429
four 76.723039 44.995970 22.408450 23.273089
a b c d
93.871055 28.031989 57.093181 34.695293
22.882809 47.499852 86.466393 86.140909
80.840336 98.120735 84.495414 8.413039
59.695834 1.478707 15.069485 48.775008
-----
a 73.070679
b 7.169884
c 80.820532
d 62.299367
Name: one, dtype: float64
a 22.882809
b 47.499852
c 86.466393
d 86.140909
Name: , dtype: float64
单标签索引
-----
a b c d
two 34.025462 77.849955 96.160170 55.159017
three 27.897582 39.595687 69.280955 49.477429
five NaN NaN NaN NaN
a b c d
59.695834 1.478707 15.069485 48.775008
80.840336 98.120735 84.495414 8.413039
22.882809 47.499852 86.466393 86.140909
多标签索引
-----
a b c d
one 73.070679 7.169884 80.820532 62.299367
two 34.025462 77.849955 96.160170 55.159017
three 27.897582 39.595687 69.280955 49.477429
a b c d
22.882809 47.499852 86.466393 86.140909
80.840336 98.120735 84.495414 8.413039
59.695834 1.478707 15.069485 48.775008
切片索引
# df.iloc[] - 按照整数位置(从轴的0到length-)选择行
# 类似list的索引,其顺序就是dataframe的整数位置,从0开始计 df = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
print('------') print(df.iloc[])
print(df.iloc[-])
#print(df.iloc[])
print('单位置索引\n-----')
# 单位置索引
# 和loc索引不同,不能索引超出数据行数的整数位置 print(df.iloc[[,]])
print(df.iloc[[,,]])
print('多位置索引\n-----')
# 多位置索引
# 顺序可变 print(df.iloc[:])
print(df.iloc[::])
print('切片索引')
# 切片索引
# 末端不包含

  输出:

              a          b          c          d
one 21.848926 2.482328 17.338355 73.014166
two 99.092794 0.601173 18.598736 61.166478
three 87.183015 85.973426 48.839267 99.930097
four 75.007726 84.208576 69.445779 75.546038
------
a 21.848926
b 2.482328
c 17.338355
d 73.014166
Name: one, dtype: float64
a 75.007726
b 84.208576
c 69.445779
d 75.546038
Name: four, dtype: float64
单位置索引
-----
a b c d
one 21.848926 2.482328 17.338355 73.014166
three 87.183015 85.973426 48.839267 99.930097
a b c d
four 75.007726 84.208576 69.445779 75.546038
three 87.183015 85.973426 48.839267 99.930097
two 99.092794 0.601173 18.598736 61.166478
多位置索引
-----
a b c d
two 99.092794 0.601173 18.598736 61.166478
three 87.183015 85.973426 48.839267 99.930097
a b c d
one 21.848926 2.482328 17.338355 73.014166
three 87.183015 85.973426 48.839267 99.930097
切片索引
# 布尔型索引
# 和Series原理相同 df = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
print('------') b1 = df <
print(b1,type(b1))
print(df[b1]) # 也可以书写为 df[df < ]
print('------')
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN b2 = df['a'] >
print(b2,type(b2))
print(df[b2]) # 也可以书写为 df[df['a'] > ]
print('------')
# 单列做判断
# 索引结果保留 单列判断为True的行数据,包括其他列 b3 = df[['a','b']] >
print(b3,type(b3))
print(df[b3]) # 也可以书写为 df[df[['a','b']] > ]
print('------')
# 多列做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN b4 = df.loc[['one','three']] <
print(b4,type(b4))
print(df[b4]) # 也可以书写为 df[df.loc[['one','three']] < ]
print('------')
# 多行做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN

  输出:

             a          b          c          d
one 19.185849 20.303217 21.800384 45.189534
two 50.105112 28.478878 93.669529 90.029489
three 35.496053 19.248457 74.811841 20.711431
four 24.604478 57.731456 49.682717 82.132866
------
a b c d
one True False False False
two False False False False
three False True False False
four False False False False <class 'pandas.core.frame.DataFrame'>
a b c d
one 19.185849 NaN NaN NaN
two NaN NaN NaN NaN
three NaN 19.248457 NaN NaN
four NaN NaN NaN NaN
------
one False
two True
three False
four False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
a b c d
two 50.105112 28.478878 93.669529 90.029489
------
a b
one False False
two True False
three False False
four False True <class 'pandas.core.frame.DataFrame'>
a b c d
one NaN NaN NaN NaN
two 50.105112 NaN NaN NaN
three NaN NaN NaN NaN
four NaN 57.731456 NaN NaN
------
a b c d
one True True True True
three True True False True <class 'pandas.core.frame.DataFrame'>
a b c d
one 19.185849 20.303217 21.800384 45.189534
two NaN NaN NaN NaN
three 35.496053 19.248457 NaN 20.711431
four NaN NaN NaN NaN
# 多重索引:比如同时索引行和列
# 先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量 df = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)
print('------') print(df['a'].loc[['one','three']]) # 选择a列的one,three行
print(df[['b','c','d']].iloc[::]) # 选择b,c,d列的one,three行
print(df[df['a'] < ].iloc[:]) # 选择满足判断索引的前两行数据

  输出:

             a          b          c          d
one 50.660904 89.827374 51.096827 3.844736
two 70.699721 78.750014 52.988276 48.833037
three 33.653032 27.225202 24.864712 29.662736
four 21.792339 26.450939 6.122134 52.323963
------
one 50.660904
three 33.653032
Name: a, dtype: float64
b c d
one 89.827374 51.096827 3.844736
three 27.225202 24.864712 29.662736
a b c d
three 33.653032 27.225202 24.864712 29.662736
four 21.792339 26.450939 6.122134 52.323963
'''
【课程2.】 Pandas数据结构Dataframe:基本技巧 数据查看、转置 / 添加、修改、删除值 / 对齐 / 排序 '''
# 数据查看、转置

df = pd.DataFrame(np.random.rand().reshape(,)*,
columns = ['a','b'])
print(df.head())
print(df.tail())
# .head()查看头部数据
# .tail()查看尾部数据
# 默认查看5条 print(df.T)
# .T 转置

  输出:

          a          b
5.777208 18.374283
85.961515 55.120036
a b
21.236577 15.902872
46.137564 29.350647
70.157709 58.972728
8.368292 42.011356
29.824574 87.062295
\
a 5.777208 85.961515 11.005284 21.236577 46.137564 70.157709
b 18.374283 55.120036 35.595598 15.902872 29.350647 58.972728 a 8.368292 29.824574
b 42.011356 87.062295
# 添加与修改

df = pd.DataFrame(np.random.rand().reshape(,)*,
columns = ['a','b','c','d'])
print(df) df['e'] =
df.loc[] =
print(df)
# 新增列/行并赋值 df['e'] =
df[['a','c']] =
print(df)
# 索引后直接修改值

  输出:

           a          b          c          d
17.148791 73.833921 39.069417 5.675815
91.572695 66.851601 60.320698 92.071097
79.377105 24.314520 44.406357 57.313429
84.599206 61.310945 3.916679 30.076458
a b c d e
17.148791 73.833921 39.069417 5.675815
91.572695 66.851601 60.320698 92.071097
79.377105 24.314520 44.406357 57.313429
84.599206 61.310945 3.916679 30.076458
20.000000 20.000000 20.000000 20.000000
a b c d e
73.833921 5.675815
66.851601 92.071097
24.314520 57.313429
61.310945 30.076458
20.000000 20.000000
# 删除  del / drop()

df = pd.DataFrame(np.random.rand().reshape(,)*,
columns = ['a','b','c','d'])
print(df) del df['a']
print(df)
print('-----')
# del语句 - 删除列 print(df.drop())
print(df.drop([,]))
print(df)
print('-----')
# drop()删除行,inplace=False → 删除后生成新的数据,不改变原数据 print(df.drop(['d'], axis = ))
print(df)
# drop()删除列,需要加上axis = ,inplace=False → 删除后生成新的数据,不改变原数据

  输出:

          a          b          c          d
91.866806 88.753655 18.469852 71.651277
64.835568 33.844967 6.391246 54.916094
75.930985 19.169862 91.042457 43.648258
15.863853 24.788866 10.625684 82.135316
b c d
88.753655 18.469852 71.651277
33.844967 6.391246 54.916094
19.169862 91.042457 43.648258
24.788866 10.625684 82.135316
-----
b c d
33.844967 6.391246 54.916094
19.169862 91.042457 43.648258
24.788866 10.625684 82.135316
b c d
88.753655 18.469852 71.651277
24.788866 10.625684 82.135316
b c d
88.753655 18.469852 71.651277
33.844967 6.391246 54.916094
19.169862 91.042457 43.648258
24.788866 10.625684 82.135316
-----
b c
88.753655 18.469852
33.844967 6.391246
19.169862 91.042457
24.788866 10.625684
b c d
88.753655 18.469852 71.651277
33.844967 6.391246 54.916094
19.169862 91.042457 43.648258
24.788866 10.625684 82.135316
# 对齐

df1 = pd.DataFrame(np.random.randn(, ), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(, ), columns=['A', 'B', 'C'])
print(df1 + df2)
# DataFrame对象之间的数据自动按照列和索引(行标签)对齐

  输出:

          A         B         C   D
-0.281123 -2.529461 1.325663 NaN
-0.310514 -0.408225 -0.760986 NaN
-0.172169 -2.355042 1.521342 NaN
1.113505 0.325933 3.689586 NaN
0.107513 -0.503907 -1.010349 NaN
-0.845676 -2.410537 -1.406071 NaN
1.682854 -0.576620 -0.981622 NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
# 排序1 - 按值排序 .sort_values
# 同样适用于Series df1 = pd.DataFrame(np.random.rand().reshape(,)*,
columns = ['a','b','c','d'])
print(df1)
print(df1.sort_values(['a'], ascending = True)) # 升序
print(df1.sort_values(['a'], ascending = False)) # 降序
print('------')
# ascending参数:设置升序降序,默认升序
# 单列排序 df2 = pd.DataFrame({'a':[,,,,,,,],
'b':list(range()),
'c':list(range(,,-))})
print(df2)
print(df2.sort_values(['a','c']))
# 多列排序,按列顺序排序

  输出:

           a          b          c          d
16.519099 19.601879 35.464189 58.866972
34.506472 97.106578 96.308244 54.049359
87.177828 47.253416 92.098847 19.672678
66.673226 51.969534 71.789055 14.504191
a b c d
16.519099 19.601879 35.464189 58.866972
34.506472 97.106578 96.308244 54.049359
66.673226 51.969534 71.789055 14.504191
87.177828 47.253416 92.098847 19.672678
a b c d
87.177828 47.253416 92.098847 19.672678
66.673226 51.969534 71.789055 14.504191
34.506472 97.106578 96.308244 54.049359
16.519099 19.601879 35.464189 58.866972
------
a b c a b c
# 排序2 - 索引排序 .sort_index

df1 = pd.DataFrame(np.random.rand().reshape(,)*,
index = [,,,],
columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand().reshape(,)*,
index = ['h','s','x','g'],
columns = ['a','b','c','d'])
print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())
# 按照index排序
# 默认 ascending=True, inplace=False

  输出:

           a          b          c          d
57.327269 87.623119 93.655538 5.859571
69.739134 80.084366 89.005538 56.825475
88.148296 6.211556 68.938504 41.542563
29.248036 72.005306 57.855365 45.931715
a b c d
29.248036 72.005306 57.855365 45.931715
88.148296 6.211556 68.938504 41.542563
69.739134 80.084366 89.005538 56.825475
57.327269 87.623119 93.655538 5.859571
a b c d
h 50.579469 80.239138 24.085110 39.443600
s 30.906725 39.175302 11.161542 81.010205
x 19.900056 18.421110 4.995141 12.605395
g 67.760755 72.573568 33.507090 69.854906
a b c d
g 67.760755 72.573568 33.507090 69.854906
h 50.579469 80.239138 24.085110 39.443600
s 30.906725 39.175302 11.161542 81.010205
x 19.900056 18.421110 4.995141 12.605395
上一篇:NET平台和C#


下一篇:<<< Java提取网页源码