- CVPR 2020
- https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_Spatio-Temporal_Graph_for_Video_Captioning_With_Knowledge_Distillation_CVPR_2020_paper.pdf
- spatio-temporal graph model for video captioning that exploits object interactions in space and time
- two-branch, knowledge distillatio
2 Related Work
General Video Classification
- 3D conv
- two-stream, optical flow
- wider range
- SlowFast, multiple time scales, two pathways
- feature bank, long-term, correlated, short-term
- raw pixels, in contrast, objects within scenes
3
- two-branch, distill
- scene, 2D, resnet, 3D, I3D
- object features: \(N_T\) objects, each \(o_t^j\) has the same dimension
3.2 Spatio-Temporal Graph
- decompose our graph into two components: the spatial graph and the temporal graph
- Spatial: normalized Intersection over Union (IoU) value, explicitly
- temporal: object transformations, semantic similarities, \(cos\)
- imagine: # - % = $ x @ structure