目录
学习主题
:
比赛链接: https://data.xm.gov.cn/contest-series-api/promote/register/3/UrnA69nb.
赛题说明
共享单车,延伸了城市公共交通脉络,解决了市民出行“最后一公里”问题。然而,随着共享经济模式被越来越多市民接受,成为出行习惯,潮汐现象也随之出现。白天工作、晚上休息的人类活动规律的客观存在,加之上下班时间段的集中,导致早晚高峰“一车难寻”、“无地可停”的供需矛盾。本题希望通过对车辆数据的综合分析,对厦门岛内早高峰阶段潮汐点进行有效定位,进一步设计高峰期群智优化方案,缓解潮汐点供需问题,以期为城市管理部门和共享单车运营方研究制定下一步优化措施提供数据支撑。
赛题任务
任务一: 为更好地掌握早高峰潮汐现象的变化规律与趋势,参赛者需基于主办方提供的数据进行数据分析和计算模型构建等工作,识别出工作日早高峰07:00-09:00潮汐现象最突出的40个区域,列出各区域所包含的共享单车停车点位编号名称,并提供计算方法说明及计算模型,为下一步优化措施提供辅助支撑。
任务二: 参赛者根据任务一Top40区域计算结果进一步设计高峰期共享单车潮汐点优化方案,通过主动引导停车用户到邻近停车点位停车,进行削峰填谷,缓解潮汐点停车位(如地铁口)的拥堵问题。允许参赛者自带训练数据,但需在参赛作品中说明所自带数据的来源及使用方式,并保证其合法合规。(城市公共自行车从业者将发生在早晚高峰时段共享单车“借不到、还不进”的问题称之为“潮汐”现象。本题涉及的“潮汐现象”聚焦“还不进”的问题,识别出早高峰共享单车最淤积的40个区域)
代码
import os, codecs
import pandas as pd
import numpy as np
PATH = './data/'
# 共享单车轨迹数据
bike_track = pd.concat([
pd.read_csv(PATH + 'gxdc_gj20201221.csv'),
pd.read_csv(PATH + 'gxdc_gj20201222.csv'),
pd.read_csv(PATH + 'gxdc_gj20201223.csv'),
pd.read_csv(PATH + 'gxdc_gj20201224.csv'),
pd.read_csv(PATH + 'gxdc_gj20201225.csv')
])
# 按照单车ID和时间进行排序
bike_track = bike_track.sort_values(['BICYCLE_ID', 'LOCATING_TIME'])
import folium
m = folium.Map(location=[24.482426, 118.157606], zoom_start=12)
my_PolyLine=folium.PolyLine(locations=bike_track[bike_track['BICYCLE_ID'] == '000152773681a23a7f2d9af8e8902703'][['LATITUDE', 'LONGITUDE']].values,weight=5)
m.add_children(my_PolyLine)
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
s = np.array(s).astype(float).reshape(5, -1)
return s
# 共享单车停车点位(电子围栏)数据
bike_fence = pd.read_csv(PATH + 'gxdc_tcd.csv')
bike_fence['FENCE_LOC'] = bike_fence['FENCE_LOC'].apply(bike_fence_format)
import folium
m = folium.Map(location=[24.482426, 118.157606], zoom_start=12)
# for data in bike_fence['FENCE_LOC'].values[:100]:
# folium.Marker([data[0,1],data[0,0]]).add_to(m)
for data in bike_fence['FENCE_LOC'].values[:100]:
folium.Marker(list(data[0, ::-1])).add_to(m)
m
# 共享单车订单数据
bike_order = pd.read_csv(PATH + 'gxdc_dd.csv')
bike_order = bike_order.sort_values(['BICYCLE_ID', 'UPDATE_TIME'])
import folium
m = folium.Map(location=[24.482426, 118.157606], zoom_start=12)
my_PolyLine=folium.PolyLine(locations=bike_order[bike_order['BICYCLE_ID'] == '0000ff105fd5f9099b866bccd157dc50'][['LATITUDE', 'LONGITUDE']].values,weight=5)
m.add_children(my_PolyLine)
# 轨道站点进站客流数据
rail_inflow = pd.read_excel(PATH + 'gdzdtjsj_jzkl.csv')
rail_inflow = rail_inflow.drop(0)
# 轨道站点出站客流数据
rail_outflow = pd.read_excel(PATH + 'gdzdtjsj_czkl.csv')
rail_outflow = rail_outflow.drop(0)
# 轨道站点闸机设备编码数据
rail_device = pd.read_excel(PATH + 'gdzdkltj_zjbh.csv')
rail_device.columns = [
'LINE_NO', 'STATION_NO', 'STATION_NAME',
'A_IN_MANCHINE', 'A_OUT_MANCHINE',
'B_IN_MANCHINE', 'B_OUT_MANCHINE'
]
rail_device = rail_device.drop(0)
# 得出停车点 LATITUDE 范围
bike_fence['MIN_LATITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.min(x[:, 1]))
bike_fence['MAX_LATITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.max(x[:, 1]))
# 得到停车点 LONGITUDE 范围
bike_fence['MIN_LONGITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.min(x[:, 0]))
bike_fence['MAX_LONGITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.max(x[:, 0]))
from geopy.distance import geodesic
# 根据停车点 范围 计算具体的面积
bike_fence['FENCE_AREA'] = bike_fence.apply(lambda x: geodesic(
(x['MIN_LATITUDE'], x['MIN_LONGITUDE']), (x['MAX_LATITUDE'], x['MAX_LONGITUDE'])
).meters, axis=1)
# 根据停车点 计算中心经纬度
bike_fence['FENCE_CENTER'] = bike_fence['FENCE_LOC'].apply(
lambda x: np.mean(x[:-1, ::-1], 0)
)
深入了解Geohash可参考下面文章:
import geohash
bike_order['geohash'] = bike_order.apply(
lambda x: geohash.encode(x['LATITUDE'], x['LONGITUDE'], precision=6),
axis=1)
bike_fence['geohash'] = bike_fence['FENCE_CENTER'].apply(
lambda x: geohash.encode(x[0], x[1], precision=6)
)
bike_order['UPDATE_TIME'] = pd.to_datetime(bike_order['UPDATE_TIME'])
bike_order['DAY'] = bike_order['UPDATE_TIME'].dt.day.astype(object)
bike_order['DAY'] = bike_order['DAY'].apply(str)
bike_order['HOUR'] = bike_order['UPDATE_TIME'].dt.hour.astype(object)
bike_order['HOUR'] = bike_order['HOUR'].apply(str)
bike_order['HOUR'] = bike_order['HOUR'].str.pad(width=2,side='left',fillchar='0')
# 日期和时间进行拼接
bike_order['DAY_HOUR'] = bike_order['DAY'] + bike_order['HOUR']
bike_order[bike_order['geohash'] == 'ws7gx9']
区域流量与潮汐统计
在完成具体的经纬度匹配后,接下来就需要完成具体的区域流量统计,即统计某一范围内的不同时间的流量(入流量和出流量)。
首先对订单数据进行时间提取:
bike_order['UPDATE_TIME'] = pd.to_datetime(bike_order['UPDATE_TIME'])
bike_order['DAY'] = bike_order['UPDATE_TIME'].dt.day.astype(object)
bike_order['DAY'] = bike_order['DAY'].apply(str)
bike_order['HOUR'] = bike_order['UPDATE_TIME'].dt.hour.astype(object)
bike_order['HOUR'] = bike_order['HOUR'].apply(str)
bike_order['HOUR'] = bike_order['HOUR'].str.pad(width=2,side='left',fillchar='0')
# 日期和时间进行拼接
bike_order['DAY_HOUR'] = bike_order['DAY'] + bike_order['HOUR']
使用透视表统计每个区域在不同时间的入流量和出流量:
bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1],
values='LOCK_STATUS', index=['geohash'],
columns=['DAY_HOUR'], aggfunc='count', fill_value=0
)
bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0],
values='LOCK_STATUS', index=['geohash'],
columns=['DAY_HOUR'], aggfunc='count', fill_value=0
)
bike_inflow.loc['wsk593'].plot()
bike_outflow.loc['wsk593'].plot()
plt.xticks(list(range(bike_inflow.shape[1])), bike_inflow.columns, rotation=40)
plt.legend(['入流量', '出流量'])
bike_inflow.loc['wsk52r'].plot()
bike_outflow.loc['wsk52r'].plot()
plt.xticks(list(range(bike_inflow.shape[1])), bike_inflow.columns, rotation=40)
plt.legend(['入流量', '出流量'], prop = None)
方法1:Geohash匹配计算潮汐
由于赛题需要统计工作日早高峰期间的潮汐现象,所以我们可以按照天进行单车流量统计:
bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1],
values='LOCK_STATUS', index=['geohash'],
columns=['DAY'], aggfunc='count', fill_value=0
)
bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0],
values='LOCK_STATUS', index=['geohash'],
columns=['DAY'], aggfunc='count', fill_value=0
)
根据入流量和出流量,可以计算得到每个位置的留存流量:
bike_remain = (bike_inflow - bike_outflow).fillna(0)
# 存在骑走的车数量 大于 进来的车数量
bike_remain[bike_remain < 0] = 0
# 按照天求平均
bike_remain = bike_remain.sum(1)
就可以得到潮汐情况最严重的道路,并且导出结果并提交测试:
# 总共有993条街
bike_fence['STREET'] = bike_fence['FENCE_ID'].apply(lambda x: x.split('_')[0])
# 留存车辆 / 街道停车位总面积,计算得到密度
bike_density = bike_fence.groupby(['STREET'])['geohash'].unique().apply(
lambda hs: np.sum([bike_remain[x] for x in hs])
) / bike_fence.groupby(['STREET'])['FENCE_AREA'].sum()
# 按照密度倒序
bike_density = bike_density.sort_values(ascending=False).reset_index()
bike_density.to_csv('./result.txt', index=None, sep='|')
方法2:距离匹配计算潮汐
如果使用Geohash来统计会存在一个问题,统计的方法会不准确,导致只能精确到街道信息。本节将使用经纬度距离匹配的方法来进行尝试,具体的思路为计算订单最近的停车点,进而计算具体的潮汐情况。
对于经纬度距离计算,可以直接使用sklearn中的NearestNeighbors,通过设置haversine距离可以很方便的完成最近停车点的计算。
from sklearn.neighbors import NearestNeighbors
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
knn = NearestNeighbors(metric = "haversine", n_jobs=-1, algorithm='brute')
knn.fit(np.stack(bike_fence['FENCE_CENTER'].values))
但是如果直接使用NearestNeighbors计算速度会非常慢,如果是全量定量订单数据可能需要较长时间。因此可以用hnsw做近似搜索,速度较快但精度差一点。
import hnswlib
import numpy as np
p = hnswlib.Index(space='l2', dim=2)
p.init_index(max_elements=300000, ef_construction=1000, M=32)
p.set_ef(1024)
p.set_num_threads(14)
p.add_items(np.stack(bike_fence['FENCE_CENTER'].values))
计算所有订单的停车位置:
index, dist = p.knn_query(bike_order[['LATITUDE','LONGITUDE']].values[:], k=1)
计算所有停车点的潮汐流量:
bike_order['fence'] = bike_fence.iloc[index.flatten()]['FENCE_ID'].values
bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1],
values='LOCK_STATUS', index=['fence'],
columns=['DAY'], aggfunc='count', fill_value=0
)
bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0],
values='LOCK_STATUS', index=['fence'],
columns=['DAY'], aggfunc='count', fill_value=0
)
bike_remain = (bike_inflow - bike_outflow).fillna(0)
bike_remain[bike_remain < 0] = 0
bike_remain = bike_remain.sum(1)
计算停车点的密度:
bike_density = bike_remain / bike_fence.set_index('FENCE_ID')['FENCE_AREA']
bike_density = bike_density.sort_values(ascending=False).reset_index()
bike_density = bike_density.fillna(0)
最终提交结果: