问题描述:数据处理,尤其是遇到大量数据且需要for循环处理时,需要消耗大量时间,如代码1所示。通过data['trip_time'][i]的方式会占用大量的时间
代码1
import time t0=time.time() for i in range(0,len(data.index)): data['trip_time'][i] = pd.Timestamp(data['lpep_dropoff_datetime'][i]) - pd.Timestamp(data['lpep_pickup_datetime'][i]) t1=time.time() print(t1 - t0)
解决办法,添加.at定位索引,data.at[i,'trip_time']
import time t0=time.time() for i in range(0,len(data.index)): data.at[i,'trip_time'] = pd.Timestamp(data.at[i,'lpep_dropoff_datetime']) - pd.Timestamp(data.at[i,'lpep_pickup_datetime']) t1=time.time() print(t1 - t0)
评价:可以看出 使用at进行索引的方法相比loc、iloc和ix要快了将近1000倍!
%timeit outdf.loc[0] = indf.loc[0] 100 loops, best of 3: 11.7 ms per loop %timeit outdf.iloc[0] = indf.iloc[0] 100 loops, best of 3: 11.4 ms per loop %timeit outdf.ix[0] = indf.ix[0] 100 loops, best of 3: 11.6 ms per loop %timeit outdf.at[0,'time'] = indf.at[0,'time'] 10000 loops, best of 3: 25.3 µs per loop