行为识别论文笔记|TSM|TSM: Temporal Shift Module for Efficient Video Understanding
Lin, Ji , C. Gan , and S. Han . “TSM: Temporal Shift Module for Efficient Video Understanding.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) IEEE, 2019.
Motivations
-
Temporal Shift Module (TSM) can achieve the performance of 3D CNN but maintain 2D CNN’s complexity.
-
Address shift, which is a hardware-friendly primitive, has also been exploited for compact 2D CNN design on image recognition tasks
Chen, Weijie, et al. “All you need is a few shifts: Designing efficient convolutional neural networks for image classification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Wu, Bichen, et al. “Shift: A zero flop, zero parameter alternative to spatial convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
He, Yihui, et al. “Addressnet: Shift-based primitives for efficient convolutional neural networks.” 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019.
Solutions
-
partial shift:
-
residual shift:
-
Online Video understanding
1/8 feature maps (red)of each residual block was cached in the memory
Experiments
- Something-Something-V1 SOTA
Compare to 2D-CNN: TSN & TRN (late temporal fusion) which have low FLOPs and Param, however, lack of temporal modeling.
Zhou, Bolei, et al. “Temporal relational reasoning in videos.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
Compare to early 2D->late 3D: ECO (medium-level temporal fusion)
Compare to Non-local I3D + GCN (all-level temporal fusion): The GCN needs a Region Proposal Network [34] trained on MSCOCO object detection dataset [30] unfair
TVL1 optical flow algorithm [55] implemented in OpenCV with CUDA
- Cost vs. Accuracy
-
Online Recognition with TSM
English Expression
- There are works to trade off 交替使用 between temporal modeling and computation, such as post-hoc fusion [13, 9, 58, 7] and mid-level temporal fusion [61, 53, 46]
- Data movement increases the memory footprint 空间占用 and inference latency on hardware 硬件延时. Worse still, such effect is exacerbated 恶化 in the video understanding networks due to large activation size (5D tensor).
Advantages and Drawbacks
-
Shift 操作不会增加计算复杂度,保持2D卷积的同时,交换了同一个时间窗口内的部分通道;Shift核是个平移滤波器
-
不同移动设备上的实验也很充分
-
移出去的部分删除,空缺的部分 pad zero;万一移除了当前帧的有用通道怎么办?
-
2D卷积上的shift只是depthwise 的一个特例,相比全连接省同样量级的参数,且性能没有下降太多,shift与不同depthwise比较还有个优点是运行时间不受卷积核大小影响,普通depthwise参数量与卷积核边长平方成正比,而shift只能用3x3卷积;海康有个文章 all you need is a few shift,提出稀疏化的shift操作(损失函数惩罚无用shift),里面可视化了一个block中所有的shift操作,统计各个方向shift的占比,保持不变的最多;
2Dshift和时序shift的区别:2D在spatial中对1个特征图内部shift,采用平移卷积核,temporal shift是1个channel,也就是1个特征图整体被shift进其他时间点的同层通道
-
总觉得论文Figure存在误导, y t y_t yt并不是最终的prediction,论文居然对 y t y_t yt没有说明;看下面这张图才是正道:每N帧给出一个预测结果