整合自网络与https://space.bilibili.com/243821484?from=search&seid=8124768530697300938
3.Pandas
如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的,没有数值标签,而 Pandas 就是字典形式
1 import pandas as pd 2 import numpy as np 3 s = pd.Series([1,3,6,np.nan,44,1]) 5 print(s)
################### 0 1.0 1 3.0 2 6.0 3 NaN 4 44.0 5 1.0 dtype: float64 ###################
Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。
-
3.1DataFrame
1 dates = pd.date_range('20160101',periods=6) 2 print(dates) 3 df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d']) # 行 列 5 print(df)
############################################################################# DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06'], dtype='datetime64[ns]', freq='D') a b c d 2016-01-01 -0.362729 0.025856 -0.453970 0.521317 2016-01-02 -0.694964 -0.418078 -0.034875 -0.382649 2016-01-03 -1.308891 -0.465486 -0.892237 -0.094203 2016-01-04 0.331540 0.621307 0.033407 -1.490113 2016-01-05 -1.770037 1.443139 -0.465179 -1.571931 2016-01-06 0.017418 -0.007310 1.151194 -0.043637 #############################################################################
DataFrame是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series组成的大字典。
选择显示pd其中一行
1 print(df['b'])
######################## 2016-01-01 0.743081 2016-01-02 -0.558816 2016-01-03 0.287229 2016-01-04 1.850405 2016-01-05 0.619291 2016-01-06 0.847188 Freq: D, Name: b, dtype: float64 ########################
不选择显示列索引,默认从零开始
1 df1 = pd.DataFrame(np.arange(12).reshape((3,4))) 2 print(df1)
########## 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 ##########
显示列的序号、数据的名称、所有的值
1 df2 = pd.DataFrame({'A' : 1., 2 'B' : pd.Timestamp('20130102'), 3 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 4 'D' : np.array([3] * 4,dtype='int32'), 5 'E' : pd.Categorical(["test","train","test","train"]), 6 'F' : 'foo'}) 7 8 print(df2)
""" A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo """
1 print(df2.index)
"""
Int64Index([0, 1, 2, 3], dtype='int64')
"""
1 print(df2.columns)
""" Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object') """
1 print(df2.values)
""" array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object) """
显示行索引信息
1 print(df2.dtypes)
""" df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object """
数据的总结。只针对数值类型
1 df2.describe()
A C D count 4.0 4.0 4.0 mean 1.0 1.0 3.0 std 0.0 0.0 0.0 min 1.0 1.0 3.0 25% 1.0 1.0 3.0 50% 1.0 1.0 3.0 75% 1.0 1.0 3.0 max 1.0 1.0 3.0
翻转数据
1 print(df2.T)
0 1 2 \ A 1 1 1 B 2013-01-02 00:00:00 2013-01-02 00:00:00 2013-01-02 00:00:00 C 1 1 1 D 3 3 3 E test train test F foo foo foo 3 A 1 B 2013-01-02 00:00:00 C 1 D 3 E train F foo
对数据的 index 进行排序并输出
1 print(df2.sort_index(axis=0, ascending=True)) #axis=0为选择列索引,axis=1为选择行索引
2 print(df2.sort_index(axis=1, ascending=False)) #ascending=True为正序,False为倒序
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo F E D C B A 0 foo test 3 1.0 2013-01-02 1.0 1 foo train 3 1.0 2013-01-02 1.0 2 foo test 3 1.0 2013-01-02 1.0 3 foo train 3 1.0 2013-01-02 1.0
对数据 值 某一列 排序输出:
1 print(df2.sort_values(by='E'))
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 2 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 3 1.0 2013-01-02 1.0 3 train foo
-
3.2 选择数据
1 dates = pd.date_range('20130101', periods=6) 2 df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D']) 3 df
""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
选择某一列
1 print(df['A']) 2 或者 3 print(df.A)
"""
2013-01-01 0 2013-01-02 4 2013-01-03 8 2013-01-04 12 2013-01-05 16 2013-01-06 20
"""
选择跨越多行或多列
1 print(df[0:3])
""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 """
1 print(df[0:3]["A"])
"""
2013-01-01 0 2013-01-02 4 2013-01-03 8
"""
1 print(df['20130102':'20130104'])
""" A B C D 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 """
loc 使用标签来选择数据
1 print(df.loc['20130102'])
""" A 4 B 5 C 6 D 7 Name: 2013-01-02 00:00:00, dtype: int64 """
1 print(df.loc[:,['A','B']])
1 """ 2 A B 3 2013-01-01 0 1 4 2013-01-02 4 5 5 2013-01-03 8 9 6 2013-01-04 12 13 7 2013-01-05 16 17 8 2013-01-06 20 21 9 """
1 print(df.loc['20130102',['A','B']])
""" A 4 B 5 Name: 2013-01-02 00:00:00, dtype: int64 """
iloc 根据序列来选择数据
1 print(df) 2 print(df.iloc[3,1]) #第4行第2列
''' A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 13 '''
1 print(df.iloc[3:5,1:3]) # 第三行到第五行,第一列到第三列
""" B C 2013-01-04 13 14 2013-01-05 17 18 """
1 print(df.iloc[[1,3,5],1:3])
""" B C 2013-01-02 5 6 2013-01-04 13 14 2013-01-06 21 22 """
通过判断的筛选
1 print(df[df.A>8])
""" A B C D 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
当有条件筛选时,如下图筛选出所有C列PM2.5在I列1006A处的值:
1 path = 'D:\python\站点_20190101-20191231\china_sites_20190101.csv' 2 csv_data = pd.read_csv(path) 3 aa=csv_data[csv_data['type'] == 'PM2.5'][['type', '1006A']] 4 aa
###### 1006A 1 47.0 16 44.0 31 43.0 46 40.0 61 42.0 76 46.0 91 47.0 106 49.0 121 47.0 136 53.0 151 46.0 166 34.0 。。。。。。 #######