参考资料:https://pandas.pydata.org/docs/user_guide/10min.html
创建对象
创建一个Series对象
In [150]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [151]: s Out[151]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
创建一个DataFrame对象通过NumPy的数组
In [152]: dates = pd.date_range('20130101', periods=6) In [153]: dates Out[153]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [154]: df = pd.DataFrame(np.random.randn(6,4),index=dates, columns=list('ABCD')) In [155]: df Out[155]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954
通过一个字典来生成一个df对象
In [179]: df2 = pd.DataFrame( ...: ...: { ...: ...: "A": 1.0, ...: ...: "B": pd.Timestamp("20130102"), ...: ...: "C": pd.Series(1, index=list(range(4)), dtype="float32"), ...: ...: "D": np.array([3] * 4, dtype="int32"), ...: ...: "E": pd.Categorical(["test", "train", "test", "train"]), ...: ...: "F": "foo", ...: ...: }, ...: ...: ...: ) In [180]: df2 Out[180]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo In [181]: df2.dtypes Out[181]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object In [182]:
通过dtypes可以查看相关索引的内容类型。
查看数据
通过head, tail,可以查看部分内容,通过index与columns可以查看行索引与列索引的信息
In [183]: df.head() Out[183]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 In [184]: df.tail(2) Out[184]: A B C D 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954 In [185]: df.head(1) Out[185]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 In [186]: df.index Out[186]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [187]: df.columns Out[187]: Index(['A', 'B', 'C', 'D'], dtype='object') In [188]:
通过DataFrame.to_numpy()可以给你一个NumPy的array数据,但有时候这会很耗费电脑,因为pandas是根据每一列一个dtypes的,而NumPy的array是整个数据一个dtype的
In [194]: df.to_numpy() Out[194]: array([[ 0.9119509 , 0.11907694, 2.24459767, -1.52416844], [-0.71159066, 1.81432742, 0.85962346, -0.24911614], [-0.0414173 , -1.15847237, 1.03775241, 1.12435552], [ 1.22224697, -0.65168145, 1.76462966, 0.50050719], [-0.33219183, -1.42487132, -0.68025439, 0.37010889], [-0.72452176, -1.23899146, 0.22342519, -1.77595409]]) In [195]: df2.to_numpy() Out[195]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
上面演示了两种to_numpy的方式,其中第二种的花费要多很多。
describe()
将会展示一些数据的基础信息,比如统计数量,标准差,最小值,最大值.......
In [202]: df.describe() Out[202]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.054079 -0.423435 0.908296 -0.259045 std 0.830825 1.229820 1.051730 1.165178 min -0.724522 -1.424871 -0.680254 -1.775954 25% -0.616741 -1.218862 0.382475 -1.205405 50% -0.186805 -0.905077 0.948688 0.060496 75% 0.673609 -0.073613 1.582910 0.467908 max 1.222247 1.814327 2.244598 1.124356
通过T可以把行索引与列索引进行转置。
In [204]: df.T Out[204]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.911951 -0.711591 -0.041417 1.222247 -0.332192 -0.724522 B 0.119077 1.814327 -1.158472 -0.651681 -1.424871 -1.238991 C 2.244598 0.859623 1.037752 1.764630 -0.680254 0.223425 D -1.524168 -0.249116 1.124356 0.500507 0.370109 -1.775954
通过索引进行排序
通过sort_index按照索引的数值大小,可以进行排序
In [207]: df.sort_index(ascending=False) Out[207]: A B C D 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-01 0.911951 0.119077 2.244598 -1.524168 In [208]: df.sort_index? In [209]: df.sort_index(axis=1, ascending=False) Out[209]: D C B A 2013-01-01 -1.524168 2.244598 0.119077 0.911951 2013-01-02 -0.249116 0.859623 1.814327 -0.711591 2013-01-03 1.124356 1.037752 -1.158472 -0.041417 2013-01-04 0.500507 1.764630 -0.651681 1.222247 2013-01-05 0.370109 -0.680254 -1.424871 -0.332192 2013-01-06 -1.775954 0.223425 -1.238991 -0.724522
当然最后还可以通过具体某一列的数值大小进行排序
In [212]: df.sort_values(by='B') Out[212]: A B C D 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116
选择
pandas建议用过.at
, .iat
, .loc
and .iloc
.这4类方法进行取值。
选择单列数据,返回的是一个Series对象,单个数值的时候返回列数据
In [226]: df['A'] Out[226]: 2013-01-01 0.911951 2013-01-02 -0.711591 2013-01-03 -0.041417 2013-01-04 1.222247 2013-01-05 -0.332192 2013-01-06 -0.724522 Freq: D, Name: A, dtype: float64
通过切片取值的时候返回多行数据的df对象。[理解为被切片的df对象],切片为取头取尾。
In [229]: df[0:3] Out[229]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 In [230]: df["20130102":'20130103'] Out[230]: A B C D 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356
通过便签取值,国内书中有说是显式取值。通过loc的方法,注意调用该方法时不用()直接后面加[]就可以.
In [238]: df.loc[dates[0]] Out[238]: A 0.911951 B 0.119077 C 2.244598 D -1.524168 Name: 2013-01-01 00:00:00, dtype: float64
取多列坐标数据
In [239]: df.loc[:,['A','B']] Out[239]: A B 2013-01-01 0.911951 0.119077 2013-01-02 -0.711591 1.814327 2013-01-03 -0.041417 -1.158472 2013-01-04 1.222247 -0.651681 2013-01-05 -0.332192 -1.424871 2013-01-06 -0.724522 -1.238991
通过切片选择不同的行与列的信息
In [240]: df.loc["20130102":"20130104",["A","B"]] Out[240]: A B 2013-01-02 -0.711591 1.814327 2013-01-03 -0.041417 -1.158472 2013-01-04 1.222247 -0.651681
只选取一行,选取指定列
In [241]: df.loc["20130102",["A","B"]] Out[241]: A -0.711591 B 1.814327 Name: 2013-01-02 00:00:00, dtype: float64
最后介绍两种方式取出单个数值,也就是所谓的标量
In [242]: df.loc[dates[0],"A"] Out[242]: 0.91195089904327 In [243]: df.at[dates[0],"A"] Out[243]: 0.91195089904327 In [244]:
通过位置来取值,也就是所谓的隐式传参 iloc
跟loc使用差不多,可以传入单值,多值,还有切片
单值,返回某一行的数据的Series对象
In [244]: df.iloc[0] Out[244]: A 0.911951 B 0.119077 C 2.244598 D -1.524168 Name: 2013-01-01 00:00:00, dtype: float64 In [245]: df.iloc[3] Out[245]: A 1.222247 B -0.651681 C 1.764630 D 0.500507 Name: 2013-01-04 00:00:00, dtype: float64
也可以通过切片,这是跟Python一样,取头不取尾的
In [246]: df.iloc[3:5,0:2] Out[246]: A B 2013-01-04 1.222247 -0.651681 2013-01-05 -0.332192 -1.424871
还可以传入多个位置参数来取,位置参数之间用逗号分割
In [247]: df.iloc[[1,2,4],[0,2]] Out[247]: A C 2013-01-02 -0.711591 0.859623 2013-01-03 -0.041417 1.037752 2013-01-05 -0.332192 -0.680254
可以通过单个:来全选数据
In [248]: df.iloc[1:3] Out[248]: A B C D 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 In [249]: df.iloc[1:3,:] Out[249]: A B C D 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 In [250]: df.iloc[:,1:3] Out[250]: B C 2013-01-01 0.119077 2.244598 2013-01-02 1.814327 0.859623 2013-01-03 -1.158472 1.037752 2013-01-04 -0.651681 1.764630 2013-01-05 -1.424871 -0.680254 2013-01-06 -1.238991 0.223425
也可以通过输入两个数值坐标来取标量。
In [251]: df.iloc[1,1] Out[251]: 1.8143274155708045 In [252]: df.iat[1,1] Out[252]: 1.8143274155708045
bolllean 索引,有些书称为掩码取值