[论文阅读 2020 CVPR 目标跟踪]Siamese Box Adaptive Network for Visual Tracking

简介

paper:Siamese Box Adaptive Network for Visual Tracking

code:hqucv/siamban

这篇论文和SiamCAR的思想有点撞车,都是发现SiamRPN系列的跟踪算法需要预先设置好anchor bbox的相关参数,而这需要花费很多精力去调整这些参数。基于这个动机,这篇论文通过FCN对目标的bbox进行端到端的回归训练得到。

[论文阅读 2020 CVPR 目标跟踪]Siamese Box Adaptive Network for Visual Tracking

主要内容

[论文阅读 2020 CVPR 目标跟踪]Siamese Box Adaptive Network for Visual Tracking

如上图所示是这篇论文中SiamBAN的网络结构。不同于SiamRPN系列跟踪器,该模型将跟踪划分为一个分类任务(通过Cls Module分支得到代表目标的分数)和一个回归任务(通过Reg Module分支得到代表目标位置的相关系数).

Siamese Network Backbone

SiamBAN采用ResNet-50作为backbone network,并对其进行如下修改:

We remove the downsampling operations from the last two convolution blocks.In order to improve the receptive field, we use atrous convolution. In addition, inspired by multi-grid methods, we adopt different atrous rates in our model. Specifically, we set the stride to 1 in the conv4 and conv5 blocks, the atrous rate to 2 in the conv4 block,and the atrous rate to 4 in the conv5 block.we add a 1 × 1 1 × 1 1×1 convolution to reduce the output features channel to 256 256 256, and use only the features of the template branch center 7 × 7 7 × 7 7×7 regions.

Box Adaptive Head

Cls Module输出结果中,每个位置的输出是一个代表foreground-background的二维分类分数;而Reg Module输出结果中,每个位置的输出是一个代表位置偏移的四维向量,用公式描述如下:

P w × h × 2 d s = [ φ ( x ) ] d s ⋆ [ φ ( z ) ] c l s P w × h × 4 r e g = [ φ ( x ) ] r e g ⋆ [ φ ( z ) ] r e g \begin{array}{l} P_{w \times h \times 2}^{d s}=[\varphi(x)]_{d s} \star[\varphi(z)]_{c l s} \\ P_{w \times h \times 4}^{r e g}=[\varphi(x)]_{r e g} \star[\varphi(z)]_{r e g} \end{array} Pw×h×2ds​=[φ(x)]ds​⋆[φ(z)]cls​Pw×h×4reg​=[φ(x)]reg​⋆[φ(z)]reg​​

where ⋆ ⋆ ⋆ denotes the convolution operation with [ ϕ ( z ) ] c l s [ϕ(z)]_{cls} [ϕ(z)]cls​ or [ ϕ ( z ) ] r e g [ϕ(z)]_{reg} [ϕ(z)]reg​ as the convolution kernel, P w × h × 2 c l s P^{cls}_{w×h×2} Pw×h×2cls​ denotes classification map, P w × h × 4 r e g P^{reg}_{w×h×4} Pw×h×4reg​ indicates regression map.

同时,得到的classification mapregression map中位置 ( i , j ) (i,j) (i,j)通过下面式子映射为输入的搜索区域中的位置 ( p i , p j ) (p_i,p_j) (pi​,pj​):

[ p i , p j ] = [ ⌊ w i m 2 ⌋ + ( i − ⌊ w 2 ⌋ ) × s , ⌊ h i m 2 ⌋ + ( j − ⌊ h 2 ⌋ ) × s ] [p_i,p_j]=\left[\left\lfloor\frac{w_{i m}}{2}\right\rfloor+(i-\right.\left.\left.\left\lfloor\frac{w}{2}\right\rfloor\right) \times s,\left\lfloor\frac{h_{i m}}{2}\right\rfloor+\left(j-\left\lfloor\frac{h}{2}\right\rfloor\right) \times s\right] [pi​,pj​]=[⌊2wim​​⌋+(i−⌊2w​⌋)×s,⌊2him​​⌋+(j−⌊2h​⌋)×s]

where, w i m w_im wi​m and h i m h_im hi​m represent the width and height of the input search patch and s s s represents the total stride of the network.

Multi-level Prediction

由于浅层的网络提取的特征更有助于目标定位,而深层的网络提取的特征代表了丰富的目标语意信息,更有利于提高模型对目标外形变化的鲁棒性。

所以,这篇论文将ResNet-50的后三层提取特征分别进行预测后,再按权重进行“融合”,用公式描述如下:

P w × h × 2 c l s − a l l = ∑ l = 3 5 α l P l d s P_{w \times h \times 2}^{c l s-a l l}=\sum_{l=3}^{5} \alpha_{l} P_{l}^{d s} Pw×h×2cls−all​=l=3∑5​αl​Plds​
P w × h × 4 r e g − a l l = ∑ l = 3 5 β l P l r e g P_{w \times h \times 4}^{r e g-a l l}=\sum_{l=3}^{5} \beta_{l} P_{l}^{r e g} Pw×h×4reg−all​=l=3∑5​βl​Plreg​

where α l α_l αl​ and β l β_l βl​ are the weights corresponding to each map and are optimized together with the network.

Ground-truth and Loss

这一节作者主要是借鉴了目标检测中的anchor-free检测算法中的一些做法,想了解更多可以参考目标检测:Anchor-Free时代

作者将grounding-truth bouding box的宽、高、左上角坐标、中点坐标、右下角坐标分别定义为 g w g_w gw​, g h g_h gh​, ( g x 1 , g y 1 ) (g_{x_1},g_{y_1}) (gx1​​,gy1​​), ( g x c , g y c ) (g_{x_c},g_{y_c}) (gxc​​,gyc​​), ( g x 2 , g y 2 ) (g_{x_2},g_{y_2}) (gx2​​,gy2​​).

并定义两个椭圆 E 1 , E 2 E_1,E_2 E1​,E2​:

( p i − g x c ) 2 ( g w 2 ) 2 + ( p j − g y c ) 2 ( g h 2 ) 2 = 1 \frac{\left(p_{i}-g_{x_{c}}\right)^{2}}{\left(\frac{g_{w}}{2}\right)^{2}}+\frac{\left(p_{j}-g_{y_{c}}\right)^{2}}{\left(\frac{g_{h}}{2}\right)^{2}}=1 (2gw​​)2(pi​−gxc​​)2​+(2gh​​)2(pj​−gyc​​)2​=1

( p i − g x c ) 2 ( g w 4 ) 2 + ( p j − g y c ) 2 ( g h 4 ) 2 = 1 \frac{\left(p_{i}-g_{x_{c}}\right)^{2}}{\left(\frac{g_{w}}{4}\right)^{2}}+\frac{\left(p_{j}-g_{y_{c}}\right)^{2}}{\left(\frac{g_{h}}{4}\right)^{2}}=1 (4gw​​)2(pi​−gxc​​)2​+(4gh​​)2(pj​−gyc​​)2​=1

如果对应的位置 ( p i , p j ) (p_i,p_j) (pi​,pj​)落在 E 2 E_2 E2​内,则它被标记为正样本,如果落在 E 1 E_1 E1​外面则它被标记为负样本,如果落在 E 1 E_1 E1​ 和 E 2 E_2 E2​之间则忽略。

而对于bbox标签,通过下面的式子计算:

d l = p i − g x 1 d t = p j − g y 1 d r = g x 2 − p i d b = g y 2 − p j \begin{array}{l} d_{l}=p_{i}-g_{x_{1}} \\ d_{t}=p_{j}-g_{y_{1}} \\ d_{r}=g_{x_{2}}-p_{i} \\ d_{b}=g_{y_{2}}-p_{j} \end{array} dl​=pi​−gx1​​dt​=pj​−gy1​​dr​=gx2​​−pi​db​=gy2​​−pj​​

where d l d_l dl​, d t d_t dt​, d r d_r dr​, d b d_b db​ are the distances from the location to the four sides of the bounding box

最终的损失函数由两部分组成:分类的cross entropy loss和回归框的IoU loss:

L = λ 1 L c l s + λ 2 L r e g L=\lambda_{1} L_{c l s}+\lambda_{2} L_{r e g} L=λ1​Lcls​+λ2​Lreg​

L I o U = 1 − I o U L_{I o U}=1-I o U LIoU​=1−IoU

Training and Inference

训练时,在ImageNet VID, YouTube-BoundingBoxes, COCO, ImageNet DET, GOT10k 和 LaSOT 这些数据集上进行训练。

预测时,通过下面式子计算目标的位置信息

p x 1 = p i − d l r e g p y 1 = p j − d t r e g p x 2 = p i + d r r e g p y 2 = p j + d b r e g \begin{array}{l} p_{x_{1}}=p_{i}-d_{l}^{r e g} \\ p_{y_{1}}=p_{j}-d_{t}^{r e g} \\ p_{x_{2}}=p_{i}+d_{r}^{r e g} \\ p_{y_{2}}=p_{j}+d_{b}^{r e g} \end{array} px1​​=pi​−dlreg​py1​​=pj​−dtreg​px2​​=pi​+drreg​py2​​=pj​+dbreg​​

where d l r e g , d t r e g , d r r e g d_{l}^{r e g}, d_{t}^{r e g}, d_{r}^{r e g} dlreg​,dtreg​,drreg​ and d b r e g d_{b}^{r e g} dbreg​ denote the prediction values of the regression map, ( p x 1 , p y 1 ) \left(p_{x_{1}}, p_{y_{1}}\right) (px1​​,py1​​) and ( p x 2 , p y 2 ) \left(p_{x_{2}}, p_{y_{2}}\right) (px2​​,py2​​) are the top-left corner and bottom-right corner of the prediction box.

实验结果

[论文阅读 2020 CVPR 目标跟踪]Siamese Box Adaptive Network for Visual Tracking

小结

总的来说,这篇论文最大的看点就是通过回归来预测bbox,省去了之前SiamRPN复杂的超参数设置。从这篇论文中,也可以看到一些目标检测算法的影子,检测的论文还是要多看多学!

上一篇:Django访问量和页面点击数统计


下一篇:OpenCV均值漂移的跟踪mean-shift based tracking的实例(附完整代码)