30分钟了解pandas

参考资料:https://pandas.pydata.org/docs/user_guide/10min.html

 

创建对象

创建一个Series对象

In [150]: s = pd.Series([1, 3, 5, np.nan, 6, 8])                                                            

In [151]: s                                                                                                 
Out[151]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

  

创建一个DataFrame对象通过NumPy的数组

In [152]: dates = pd.date_range('20130101', periods=6)                                                      

In [153]: dates                                                                                             
Out[153]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [154]: df = pd.DataFrame(np.random.randn(6,4),index=dates, columns=list('ABCD'))                         

In [155]: df                                                                                                
Out[155]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954  

 

通过一个字典来生成一个df对象

In [179]: df2 = pd.DataFrame( 
     ...:    ...:     { 
     ...:    ...:         "A": 1.0, 
     ...:    ...:         "B": pd.Timestamp("20130102"), 
     ...:    ...:         "C": pd.Series(1, index=list(range(4)), dtype="float32"), 
     ...:    ...:         "D": np.array([3] * 4, dtype="int32"), 
     ...:    ...:         "E": pd.Categorical(["test", "train", "test", "train"]), 
     ...:    ...:         "F": "foo", 
     ...:    ...:     }, 
     ...:    ...:     
     ...: )                                                                                                 

In [180]: df2                                                                                               
Out[180]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

In [181]: df2.dtypes                                                                                        
Out[181]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [182]: 

  通过dtypes可以查看相关索引的内容类型。

 

查看数据

通过head, tail,可以查看部分内容,通过index与columns可以查看行索引与列索引的信息

In [183]: df.head()                                                                                         
Out[183]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109

In [184]: df.tail(2)                                                                                        
Out[184]: 
                   A         B         C         D
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954

In [185]: df.head(1)                                                                                        
Out[185]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168

In [186]: df.index                                                                                          
Out[186]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [187]: df.columns                                                                                        
Out[187]: Index(['A', 'B', 'C', 'D'], dtype='object')

In [188]:       

  通过DataFrame.to_numpy()可以给你一个NumPy的array数据,但有时候这会很耗费电脑,因为pandas是根据每一列一个dtypes的,而NumPy的array是整个数据一个dtype的

In [194]: df.to_numpy()                                                                                     
Out[194]: 
array([[ 0.9119509 ,  0.11907694,  2.24459767, -1.52416844],
       [-0.71159066,  1.81432742,  0.85962346, -0.24911614],
       [-0.0414173 , -1.15847237,  1.03775241,  1.12435552],
       [ 1.22224697, -0.65168145,  1.76462966,  0.50050719],
       [-0.33219183, -1.42487132, -0.68025439,  0.37010889],
       [-0.72452176, -1.23899146,  0.22342519, -1.77595409]])

In [195]: df2.to_numpy()                                                                                    
Out[195]: 
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

  上面演示了两种to_numpy的方式,其中第二种的花费要多很多。

describe() 将会展示一些数据的基础信息,比如统计数量,标准差,最小值,最大值.......

In [202]: df.describe()                                                                                     
Out[202]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.054079 -0.423435  0.908296 -0.259045
std    0.830825  1.229820  1.051730  1.165178
min   -0.724522 -1.424871 -0.680254 -1.775954
25%   -0.616741 -1.218862  0.382475 -1.205405
50%   -0.186805 -0.905077  0.948688  0.060496
75%    0.673609 -0.073613  1.582910  0.467908
max    1.222247  1.814327  2.244598  1.124356

  通过T可以把行索引与列索引进行转置。

In [204]: df.T                                                                                              
Out[204]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.911951   -0.711591   -0.041417    1.222247   -0.332192   -0.724522
B    0.119077    1.814327   -1.158472   -0.651681   -1.424871   -1.238991
C    2.244598    0.859623    1.037752    1.764630   -0.680254    0.223425
D   -1.524168   -0.249116    1.124356    0.500507    0.370109   -1.775954

  通过索引进行排序

通过sort_index按照索引的数值大小,可以进行排序

In [207]: df.sort_index(ascending=False)                                                                    
Out[207]: 
                   A         B         C         D
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-01  0.911951  0.119077  2.244598 -1.524168

In [208]: df.sort_index?                                                                                    

In [209]: df.sort_index(axis=1, ascending=False)                                                            
Out[209]: 
                   D         C         B         A
2013-01-01 -1.524168  2.244598  0.119077  0.911951
2013-01-02 -0.249116  0.859623  1.814327 -0.711591
2013-01-03  1.124356  1.037752 -1.158472 -0.041417
2013-01-04  0.500507  1.764630 -0.651681  1.222247
2013-01-05  0.370109 -0.680254 -1.424871 -0.332192
2013-01-06 -1.775954  0.223425 -1.238991 -0.724522

  当然最后还可以通过具体某一列的数值大小进行排序

In [212]: df.sort_values(by='B')                                                                            
Out[212]: 
                   A         B         C         D
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116

  

选择

pandas建议用过.at.iat.loc and .iloc.这4类方法进行取值。

选择单列数据,返回的是一个Series对象,单个数值的时候返回列数据

In [226]: df['A']                                                                                           
Out[226]: 
2013-01-01    0.911951
2013-01-02   -0.711591
2013-01-03   -0.041417
2013-01-04    1.222247
2013-01-05   -0.332192
2013-01-06   -0.724522
Freq: D, Name: A, dtype: float64

  通过切片取值的时候返回多行数据的df对象。[理解为被切片的df对象],切片为取头取尾。

In [229]: df[0:3]                                                                                           
Out[229]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

In [230]: df["20130102":'20130103']                                                                         
Out[230]: 
                   A         B         C         D
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

  

通过便签取值,国内书中有说是显式取值。通过loc的方法,注意调用该方法时不用()直接后面加[]就可以.

In [238]: df.loc[dates[0]]                                                                                  
Out[238]: 
A    0.911951
B    0.119077
C    2.244598
D   -1.524168
Name: 2013-01-01 00:00:00, dtype: float64

  

取多列坐标数据

In [239]: df.loc[:,['A','B']]                                                                               
Out[239]: 
                   A         B
2013-01-01  0.911951  0.119077
2013-01-02 -0.711591  1.814327
2013-01-03 -0.041417 -1.158472
2013-01-04  1.222247 -0.651681
2013-01-05 -0.332192 -1.424871
2013-01-06 -0.724522 -1.238991

 通过切片选择不同的行与列的信息

In [240]: df.loc["20130102":"20130104",["A","B"]]                                                           
Out[240]: 
                   A         B
2013-01-02 -0.711591  1.814327
2013-01-03 -0.041417 -1.158472
2013-01-04  1.222247 -0.651681

  只选取一行,选取指定列

In [241]: df.loc["20130102",["A","B"]]                                                                      
Out[241]: 
A   -0.711591
B    1.814327
Name: 2013-01-02 00:00:00, dtype: float64

 最后介绍两种方式取出单个数值,也就是所谓的标量

In [242]: df.loc[dates[0],"A"]                                                                              
Out[242]: 0.91195089904327

In [243]: df.at[dates[0],"A"]                                                                               
Out[243]: 0.91195089904327

In [244]:  

  

通过位置来取值,也就是所谓的隐式传参 iloc

跟loc使用差不多,可以传入单值,多值,还有切片

单值,返回某一行的数据的Series对象

In [244]: df.iloc[0]                                                                                        
Out[244]: 
A    0.911951
B    0.119077
C    2.244598
D   -1.524168
Name: 2013-01-01 00:00:00, dtype: float64

In [245]: df.iloc[3]                                                                                        
Out[245]: 
A    1.222247
B   -0.651681
C    1.764630
D    0.500507
Name: 2013-01-04 00:00:00, dtype: float64

  也可以通过切片,这是跟Python一样,取头不取尾的

In [246]: df.iloc[3:5,0:2]                                                                                  
Out[246]: 
                   A         B
2013-01-04  1.222247 -0.651681
2013-01-05 -0.332192 -1.424871

  还可以传入多个位置参数来取,位置参数之间用逗号分割

In [247]: df.iloc[[1,2,4],[0,2]]                                                                            
Out[247]: 
                   A         C
2013-01-02 -0.711591  0.859623
2013-01-03 -0.041417  1.037752
2013-01-05 -0.332192 -0.680254

  可以通过单个:来全选数据

In [248]: df.iloc[1:3]                                                                                      
Out[248]: 
                   A         B         C         D
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

In [249]: df.iloc[1:3,:]                                                                                    
Out[249]: 
                   A         B         C         D
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

In [250]: df.iloc[:,1:3]                                                                                    
Out[250]: 
                   B         C
2013-01-01  0.119077  2.244598
2013-01-02  1.814327  0.859623
2013-01-03 -1.158472  1.037752
2013-01-04 -0.651681  1.764630
2013-01-05 -1.424871 -0.680254
2013-01-06 -1.238991  0.223425

  也可以通过输入两个数值坐标来取标量。

In [251]: df.iloc[1,1]                                                                                      
Out[251]: 1.8143274155708045

In [252]: df.iat[1,1]                                                                                       
Out[252]: 1.8143274155708045

  

bolllean 索引,有些书称为掩码取值

 

上一篇:穷人家的孩子刘强东又投资了一家AI芯片公司


下一篇:Java将字符串写入文件与将文件内容读取到字符串