Pandas进阶大神!从0到100你只差这篇文章!(二)

2. 查询


查看每列的数据类型


# 查看df数据类型df.dtypesdate       objectopen        float64close        float64high         float64low          float64volume    float64code       objectdtype: object


查看指定指定数量的行:head函数默认查看前5行,tail函数默认查看后5行,可以传递指定的数值用于查看指定行数。


查看前5行df.head()date    open    close   high    low volume  code0   2015-12-23  9.927   9.935   10.174  9.871   1039018.0   0000011   2015-12-24  9.919   9.823   9.998   9.744   640229.0    0000012   2015-12-25  9.855   9.879   9.927   9.815   399845.0    0000013   2015-12-28  9.895   9.537   9.919   9.537   822408.0    0000014   2015-12-29  9.545   9.624   9.632   9.529   619802.0    000001# 查看后5行df.tail()date    open    close   high    low volume  code636 2018-08-01  9.42    9.15    9.50    9.11    814081.0    000001637 2018-08-02  9.13    8.94    9.15    8.88    931401.0    000001638 2018-08-03  8.93    8.91    9.10    8.91    476546.0    000001639 2018-08-06  8.94    8.94    9.11    8.89    554010.0    000001640 2018-08-07  8.96    9.17    9.17    8.88    690423.0    000001# 查看前10行df.head(10)date    open    close   high    low volume  code0   2015-12-23  9.927   9.935   10.174  9.871   1039018.0   0000011   2015-12-24  9.919   9.823   9.998   9.744   640229.0    0000012   2015-12-25  9.855   9.879   9.927   9.815   399845.0    0000013   2015-12-28  9.895   9.537   9.919   9.537   822408.0    0000014   2015-12-29  9.545   9.624   9.632   9.529   619802.0    0000015   2015-12-30  9.624   9.632   9.640   9.513   532667.0    0000016   2015-12-31  9.632   9.545   9.656   9.537   491258.0    0000017   2016-01-04  9.553   8.995   9.577   8.940   563497.0    0000018   2016-01-05  8.972   9.075   9.210   8.876   663269.0    0000019   2016-01-06  9.091   9.179   9.202   9.067   515706.0    000001


查看某一行或多行,某一列或多列


# 查看第一行df[0:1]    date    open    close   high    low volume  code0   2015-12-23  9.927   9.935   10.174  9.871   1039018.0   000001
# 查看 10到20行df[10:21]    date    open    close   high    low volume  code10  2016-01-07  9.083   8.709   9.083   8.685   174761.0    00000111  2016-01-08  8.924   8.852   8.987   8.677   747527.0    00000112  2016-01-11  8.757   8.566   8.820   8.502   732013.0    00000113  2016-01-12  8.621   8.605   8.685   8.470   561642.0    00000114  2016-01-13  8.669   8.526   8.709   8.518   391709.0    00000115  2016-01-14  8.430   8.574   8.597   8.343   666314.0    00000116  2016-01-15  8.486   8.327   8.597   8.295   448202.0    00000117  2016-01-18  8.231   8.287   8.406   8.199   421040.0    00000118  2016-01-19  8.319   8.526   8.582   8.287   501109.0    00000119  2016-01-20  8.518   8.390   8.597   8.311   603752.0    00000120  2016-01-21  8.343   8.215   8.558   8.215   606145.0    000001
# 查看看Date列前5个数据df["date"].head() # 或者df.date.head()0    2015-12-231    2015-12-242    2015-12-253    2015-12-284    2015-12-29Name: date, dtype: object
# 查看看Date列,code列, open列前5个数据df[["date","code", "open"]].head()    date    code    open0   2015-12-23  000001  9.9271   2015-12-24  000001  9.9192   2015-12-25  000001  9.8553   2015-12-28  000001  9.8954   2015-12-29  000001  9.545


使用行列组合条件查询


# 查看date, code列的第10行df.loc[10, ["date", "code"]]
date    2016-01-07code        000001Name: 10, dtype: object# 查看date, code列的第10行到20行df.loc[10:20, ["date", "code"]]
    date    code10  2016-01-07  00000111  2016-01-08  00000112  2016-01-11  00000113  2016-01-12  00000114  2016-01-13  00000115  2016-01-14  00000116  2016-01-15  00000117  2016-01-18  00000118  2016-01-19  00000119  2016-01-20  00000120  2016-01-21  000001
# 查看第一行,open列的数据df.loc[0, "open"]9.9269999999999996


通过位置查询:值得注意的是上面的索引值就是特定的位置。


# 查看第1行()df.iloc[0]date      2015-12-24open           9.919close          9.823high           9.998low            9.744volume        640229code          000001Name: 0, dtype: object# 查看最后一行df.iloc[-1]date      2018-08-08open            9.16close           9.12high            9.16low              9.1volume         29985code          000001Name: 640, dtype: object# 查看第一列,前5个数值df.iloc[:,0].head()0    2015-12-241    2015-12-252    2015-12-283    2015-12-294    2015-12-30Name: date, dtype: object
# 查看前2到4行,第1,3列df.iloc[2:4,[0,2]]
date    close2   2015-12-28  9.5373   2015-12-29  9.624


通过条件筛选:


查看open列大于10的前5行df[df.open > 10].head()
    date    open    close   high    low volume  code378 2017-07-14  10.483  10.570  10.609  10.337  1722570.0   000001379 2017-07-17  10.619  10.483  10.987  10.396  3273123.0   000001380 2017-07-18  10.425  10.716  10.803  10.299  2349431.0   000001381 2017-07-19  10.657  10.754  10.851  10.551  1933075.0   000001382 2017-07-20  10.745  10.638  10.880  10.580  1537338.0   000001
# 查看open列大于10且open列小于10.6的前五行df[(df.open > 10) & (df.open < 10.6)].head()    date    open    close   high    low volume  code378 2017-07-14  10.483  10.570  10.609  10.337  1722570.0   000001380 2017-07-18  10.425  10.716  10.803  10.299  2349431.0   000001387 2017-07-27  10.550  10.422  10.599  10.363  1194490.0   000001388 2017-07-28  10.441  10.569  10.638  10.412  819195.0    000001390 2017-08-01  10.471  10.865  10.904  10.432  2035709.0   000001 
# 查看open列大于10或open列小于10.6的前五行df[(df.open > 10) | (df.open < 10.6)].head()    date    open    close   high    low volume  code0   2015-12-24  9.919   9.823   9.998   9.744   640229.0    0000011   2015-12-25  9.855   9.879   9.927   9.815   399845.0    0000012   2015-12-28  9.895   9.537   9.919   9.537   822408.0    0000013   2015-12-29  9.545   9.624   9.632   9.529   619802.0    0000014   2015-12-30  9.624   9.632   9.640   9.513   532667.0    000001


3. 增加


在前面已经简单的说明Series, DataFrame的创建,这里说一些常用有用的创建方式。


# 创建2018-08-08到2018-08-15的时间序列,默认时间间隔为Days2 = pd.date_range("20180808", periods=7)print(s2)
DatetimeIndex(['2018-08-08', '2018-08-09', '2018-08-10', '2018-08-11',               '2018-08-12', '2018-08-13', '2018-08-14'],                                              dtype='datetime64[ns]', freq='D')# 指定2018-08-08 00:00 到2018-08-09 00:00 时间间隔为小时# freq参数可使用参数, 参考: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases s3 = pd.date_range("20180808", "20180809", freq="H")print(s2)
DatetimeIndex(['2018-08-08 00:00:00', '2018-08-08 01:00:00',               '2018-08-08 02:00:00', '2018-08-08 03:00:00',               '2018-08-08 04:00:00', '2018-08-08 05:00:00',               '2018-08-08 06:00:00', '2018-08-08 07:00:00',               '2018-08-08 08:00:00', '2018-08-08 09:00:00',               '2018-08-08 10:00:00', '2018-08-08 11:00:00',               '2018-08-08 12:00:00', '2018-08-08 13:00:00',               '2018-08-08 14:00:00', '2018-08-08 15:00:00',               '2018-08-08 16:00:00', '2018-08-08 17:00:00',               '2018-08-08 18:00:00', '2018-08-08 19:00:00',               '2018-08-08 20:00:00', '2018-08-08 21:00:00',               '2018-08-08 22:00:00', '2018-08-08 23:00:00',               '2018-08-09 00:00:00'],               dtype='datetime64[ns]', freq='H')# 通过已有序列创建时间序列s4 = pd.to_datetime(df.date.head())print(s4)
0   2015-12-241   2015-12-252   2015-12-283   2015-12-294   2015-12-30Name: date, dtype: datetime64[ns]


4. 修改


# 将df 的索引修改为date列的数据,并且将类型转换为datetime类型df.index = pd.to_datetime(df.date)df.head()
    date    open    close   high    low volume  code     date 2015-12-24  2015-12-24  9.919   9.823   9.998   9.744   640229.0    0000012015-12-25  2015-12-25  9.855   9.879   9.927   9.815   399845.0    0000012015-12-28  2015-12-28  9.895   9.537   9.919   9.537   822408.0    0000012015-12-29  2015-12-29  9.545   9.624   9.632   9.529   619802.0    0000012015-12-30  2015-12-30  9.624   9.632   9.640   9.513   532667.0    000001# 修改列的字段df.columns = ["Date", "Open","Close","High","Low","Volume","Code"]print(df.head())
 Date   Open  Close   High    Low    Volume    Code     date2015-12-24  2015-12-24  9.919  9.823  9.998  9.744   640229.0  0000012015-12-25  2015-12-25  9.855  9.879  9.927  9.815   399845.0  0000012015-12-28  2015-12-28  9.895  9.537  9.919  9.537  822408.0  0000012015-12-29  2015-12-29  9.545  9.624  9.632  9.529  619802.0  0000012015-12-30  2015-12-30  9.624  9.632  9.640  9.513  532667.0  000001# 将Open列每个数值加1, apply方法并不直接修改源数据,所以需要将新值复制给dfdf.Open = df.Open.apply(lambda x: x+1)df.head()
  Date    Open    Close   High    Low Volume   Code    date2015-12-24  2015-12-24  10.919  9.823   9.998   9.744   640229.0    0000012015-12-25  2015-12-25  10.855  9.879   9.927   9.815   399845.0    0000012015-12-28  2015-12-28  10.895  9.537   9.919   9.537   822408.0    0000012015-12-29  2015-12-29  10.545  9.624   9.632   9.529   619802.0    0000012015-12-30  2015-12-30  10.624  9.632   9.640   9.513   532667.0    000001# 将Open,Close列都数值上加1,如果多列,apply接收的对象是整个列df[["Open", "Close"]].head().apply(lambda x: x.apply(lambda x: x+1))
            Open    Closedate        2015-12-24  11.919  10.8232015-12-25  11.855  10.8792015-12-28  11.895  10.5372015-12-29  11.545  10.6242015-12-30  11.624  10.632


5. 删除


通过drop方法drop指定的行或者列。


注意: drop方法并不直接修改源数据,如果需要使源dataframe对象被修改,需要传入inplace=True,通过之前的axis图解,知道行的值(或者说label)在axis=0,列的值(或者说label)在axis=1。


# 删除指定列,删除Open列df.drop("Open", axis=1).head() #或者df.drop(df.columns[1]) 
   Date    Close   High      Low Volume     Code       date        
2015-12-24  2015-12-24  9.823   9.998   9.744   640229.0    0000012015-12-25  2015-12-25  9.879   9.927   9.815   399845.0    0000012015-12-28  2015-12-28  9.537   9.919   9.537   822408.0    0000012015-12-29  2015-12-29  9.624   9.632   9.529   619802.0    0000012015-12-30  2015-12-30  9.632   9.640   9.513   532667.0    000001# 删除第1,3列. 即Open,High列df.drop(df.columns[[1,3]], axis=1).head() # 或df.drop(["Open", "High], axis=1).head()        Date    Close      Low Volume       Code         date 2015-12-24  2015-12-24  9.823   9.744   640229.0    000001 2015-12-25  2015-12-25  9.879   9.815   399845.0    000001 2015-12-28  2015-12-28  9.537   9.537   822408.0    000001 2015-12-29  2015-12-29  9.624   9.529   619802.0    000001 2015-12-30  2015-12-30  9.632   9.513   532667.0    000001


上一篇:分享3个提升效率的方法


下一篇:与苹果分享营收凸显谷歌移动业务软肋