python学习笔记(一)matplotlib、numpy、Pandas

整合自网络与https://space.bilibili.com/243821484?from=search&seid=8124768530697300938

 

3.Pandas

  如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的,没有数值标签,而 Pandas 就是字典形式

1 import pandas as pd
2 import numpy as np
3 s = pd.Series([1,3,6,np.nan,44,1])
5 print(s)
###################
0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64
###################


  Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。

  • 3.1DataFrame

1 dates = pd.date_range('20160101',periods=6)
2 print(dates)
3 df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d']) # 行 列
5 print(df)
#############################################################################
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')
                   a         b         c         d
2016-01-01 -0.362729  0.025856 -0.453970  0.521317
2016-01-02 -0.694964 -0.418078 -0.034875 -0.382649
2016-01-03 -1.308891 -0.465486 -0.892237 -0.094203
2016-01-04  0.331540  0.621307  0.033407 -1.490113
2016-01-05 -1.770037  1.443139 -0.465179 -1.571931
2016-01-06  0.017418 -0.007310  1.151194 -0.043637
#############################################################################

  

  DataFrame是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series组成的大字典。

 

  选择显示pd其中一行

1 print(df['b'])
########################
2016-01-01    0.743081
2016-01-02   -0.558816
2016-01-03    0.287229
2016-01-04    1.850405
2016-01-05    0.619291
2016-01-06    0.847188
Freq: D, Name: b, dtype: float64
########################

 

   

  不选择显示列索引,默认从零开始

1 df1 = pd.DataFrame(np.arange(12).reshape((3,4)))
2 print(df1)
##########
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
########## 

  

 

  显示列的序号、数据的名称、所有的值

 

 1 df2 = pd.DataFrame({'A' : 1.,
 2                     'B' : pd.Timestamp('20130102'),
 3                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
 4                     'D' : np.array([3] * 4,dtype='int32'),
 5                     'E' : pd.Categorical(["test","train","test","train"]),
 6                     'F' : 'foo'})
 7                     
 8 print(df2)
"""
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
"""
1 print(df2.index)
"""
Int64Index([0, 1, 2, 3], dtype='int64')
"""
1 print(df2.columns)
"""
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
"""
1 print(df2.values)
"""
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
"""

  

 

  显示行索引信息

1 print(df2.dtypes)
"""
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
"""

 

 

  数据的总结。只针对数值类型

1 df2.describe()
        A      C      D
count    4.0    4.0    4.0
mean    1.0    1.0    3.0
std    0.0    0.0    0.0
min    1.0    1.0    3.0
25%    1.0    1.0    3.0
50%    1.0    1.0    3.0
75%    1.0    1.0    3.0
max    1.0    1.0    3.0

 

  翻转数据

1 print(df2.T)
                     0                    1                    2  \
A                    1                    1                    1   
B  2013-01-02 00:00:00  2013-01-02 00:00:00  2013-01-02 00:00:00   
C                    1                    1                    1   
D                    3                    3                    3   
E                 test                train                 test   
F                  foo                  foo                  foo   

                     3  
A                    1  
B  2013-01-02 00:00:00  
C                    1  
D                    3  
E                train  
F                  foo  

 

  对数据的 index 进行排序并输出

1 print(df2.sort_index(axis=0, ascending=True))      #axis=0为选择列索引,axis=1为选择行索引
2 print(df2.sort_index(axis=1, ascending=False))   #ascending=True为正序,False为倒序
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
     F      E  D    C          B    A
0  foo   test  3  1.0 2013-01-02  1.0
1  foo  train  3  1.0 2013-01-02  1.0
2  foo   test  3  1.0 2013-01-02  1.0
3  foo  train  3  1.0 2013-01-02  1.0

 

  

  对数据 值 某一列 排序输出:

1 print(df2.sort_values(by='E'))
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
2  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
3  1.0 2013-01-02  1.0  3  train  foo

 

  • 3.2  选择数据

  

1 dates = pd.date_range('20130101', periods=6)
2 df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
3 df
"""
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
"""

 

  选择某一列

1 print(df['A'])
2 或者
3 print(df.A)
"""
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
"""

 

  选择跨越多行或多列

1 print(df[0:3])
"""
            A  B   C   D
2013-01-01  0  1   2   3
2013-01-02  4  5   6   7
2013-01-03  8  9  10  11
"""
1 print(df[0:3]["A"])
"""
2013-01-01    0
2013-01-02    4
2013-01-03    8
"""
1 print(df['20130102':'20130104'])
"""
A   B   C   D
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
"""

 

  loc 使用标签来选择数据 

1 print(df.loc['20130102'])
"""
A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int64
"""
1 print(df.loc[:,['A','B']]) 
1 """
2              A   B
3 2013-01-01   0   1
4 2013-01-02   4   5
5 2013-01-03   8   9
6 2013-01-04  12  13
7 2013-01-05  16  17
8 2013-01-06  20  21
9 """
1 print(df.loc['20130102',['A','B']])
"""
A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int64
"""

 

   iloc 根据序列来选择数据 

1 print(df)
2 print(df.iloc[3,1])    #第4行第2列
'''
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23


13

'''
1 print(df.iloc[3:5,1:3])    # 第三行到第五行,第一列到第三列
"""
             B   C
2013-01-04  13  14
2013-01-05  17  18
"""
1 print(df.iloc[[1,3,5],1:3])
"""
             B   C
2013-01-02   5   6
2013-01-04  13  14
2013-01-06  21  22

"""

 

  通过判断的筛选

1 print(df[df.A>8])
"""
             A   B   C   D
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
"""

 

  当有条件筛选时,如下图筛选出所有C列PM2.5在I列1006A处的值:

python学习笔记(一)matplotlib、numpy、Pandas

 

 

1 path = 'D:\python\站点_20190101-20191231\china_sites_20190101.csv'
2 csv_data = pd.read_csv(path)
3 aa=csv_data[csv_data['type'] == 'PM2.5'][['type', '1006A']]
4 aa
######
    1006A
1    47.0
16    44.0
31    43.0
46    40.0
61    42.0
76    46.0
91    47.0
106    49.0
121    47.0
136    53.0
151    46.0
166    34.0
。。。。。。
#######

 

 

上一篇:成员函数和友元函数对比完成同一个实例


下一篇:【机器学习与数据挖掘】统计回归模型(Market intraday momentum)