简介
paper:Siamese Box Adaptive Network for Visual Tracking
code:hqucv/siamban
这篇论文和SiamCAR的思想有点撞车,都是发现SiamRPN
系列的跟踪算法需要预先设置好anchor bbox
的相关参数,而这需要花费很多精力去调整这些参数。基于这个动机,这篇论文通过FCN
对目标的bbox
进行端到端的回归训练得到。
主要内容
如上图所示是这篇论文中SiamBAN
的网络结构。不同于SiamRPN
系列跟踪器,该模型将跟踪划分为一个分类任务(通过Cls Module
分支得到代表目标的分数)和一个回归任务(通过Reg Module
分支得到代表目标位置的相关系数).
Siamese Network Backbone
SiamBAN
采用ResNet-50
作为backbone network
,并对其进行如下修改:
We remove the downsampling operations from the last two convolution blocks.In order to improve the receptive field, we use atrous convolution. In addition, inspired by multi-grid methods, we adopt different atrous rates in our model. Specifically, we set the stride to 1 in the conv4 and conv5 blocks, the atrous rate to 2 in the conv4 block,and the atrous rate to 4 in the conv5 block.we add a 1 × 1 1 × 1 1×1 convolution to reduce the output features channel to 256 256 256, and use only the features of the template branch center 7 × 7 7 × 7 7×7 regions.
Box Adaptive Head
Cls Module
输出结果中,每个位置的输出是一个代表foreground-background
的二维分类分数;而Reg Module
输出结果中,每个位置的输出是一个代表位置偏移的四维向量,用公式描述如下:
P w × h × 2 d s = [ φ ( x ) ] d s ⋆ [ φ ( z ) ] c l s P w × h × 4 r e g = [ φ ( x ) ] r e g ⋆ [ φ ( z ) ] r e g \begin{array}{l} P_{w \times h \times 2}^{d s}=[\varphi(x)]_{d s} \star[\varphi(z)]_{c l s} \\ P_{w \times h \times 4}^{r e g}=[\varphi(x)]_{r e g} \star[\varphi(z)]_{r e g} \end{array} Pw×h×2ds=[φ(x)]ds⋆[φ(z)]clsPw×h×4reg=[φ(x)]reg⋆[φ(z)]reg
where ⋆ ⋆ ⋆ denotes the convolution operation with [ ϕ ( z ) ] c l s [ϕ(z)]_{cls} [ϕ(z)]cls or [ ϕ ( z ) ] r e g [ϕ(z)]_{reg} [ϕ(z)]reg as the convolution kernel, P w × h × 2 c l s P^{cls}_{w×h×2} Pw×h×2cls denotes classification map, P w × h × 4 r e g P^{reg}_{w×h×4} Pw×h×4reg indicates regression map.
同时,得到的classification map
和regression map
中位置
(
i
,
j
)
(i,j)
(i,j)通过下面式子映射为输入的搜索区域中的位置
(
p
i
,
p
j
)
(p_i,p_j)
(pi,pj):
[ p i , p j ] = [ ⌊ w i m 2 ⌋ + ( i − ⌊ w 2 ⌋ ) × s , ⌊ h i m 2 ⌋ + ( j − ⌊ h 2 ⌋ ) × s ] [p_i,p_j]=\left[\left\lfloor\frac{w_{i m}}{2}\right\rfloor+(i-\right.\left.\left.\left\lfloor\frac{w}{2}\right\rfloor\right) \times s,\left\lfloor\frac{h_{i m}}{2}\right\rfloor+\left(j-\left\lfloor\frac{h}{2}\right\rfloor\right) \times s\right] [pi,pj]=[⌊2wim⌋+(i−⌊2w⌋)×s,⌊2him⌋+(j−⌊2h⌋)×s]
where, w i m w_im wim and h i m h_im him represent the width and height of the input search patch and s s s represents the total stride of the network.
Multi-level Prediction
由于浅层的网络提取的特征更有助于目标定位,而深层的网络提取的特征代表了丰富的目标语意信息,更有利于提高模型对目标外形变化的鲁棒性。
所以,这篇论文将ResNet-50
的后三层提取特征分别进行预测后,再按权重进行“融合”,用公式描述如下:
P
w
×
h
×
2
c
l
s
−
a
l
l
=
∑
l
=
3
5
α
l
P
l
d
s
P_{w \times h \times 2}^{c l s-a l l}=\sum_{l=3}^{5} \alpha_{l} P_{l}^{d s}
Pw×h×2cls−all=l=3∑5αlPlds
P
w
×
h
×
4
r
e
g
−
a
l
l
=
∑
l
=
3
5
β
l
P
l
r
e
g
P_{w \times h \times 4}^{r e g-a l l}=\sum_{l=3}^{5} \beta_{l} P_{l}^{r e g}
Pw×h×4reg−all=l=3∑5βlPlreg
where α l α_l αl and β l β_l βl are the weights corresponding to each map and are optimized together with the network.
Ground-truth and Loss
这一节作者主要是借鉴了目标检测中的anchor-free
检测算法中的一些做法,想了解更多可以参考目标检测:Anchor-Free时代
作者将grounding-truth bouding box
的宽、高、左上角坐标、中点坐标、右下角坐标分别定义为
g
w
g_w
gw,
g
h
g_h
gh,
(
g
x
1
,
g
y
1
)
(g_{x_1},g_{y_1})
(gx1,gy1),
(
g
x
c
,
g
y
c
)
(g_{x_c},g_{y_c})
(gxc,gyc),
(
g
x
2
,
g
y
2
)
(g_{x_2},g_{y_2})
(gx2,gy2).
并定义两个椭圆 E 1 , E 2 E_1,E_2 E1,E2:
( p i − g x c ) 2 ( g w 2 ) 2 + ( p j − g y c ) 2 ( g h 2 ) 2 = 1 \frac{\left(p_{i}-g_{x_{c}}\right)^{2}}{\left(\frac{g_{w}}{2}\right)^{2}}+\frac{\left(p_{j}-g_{y_{c}}\right)^{2}}{\left(\frac{g_{h}}{2}\right)^{2}}=1 (2gw)2(pi−gxc)2+(2gh)2(pj−gyc)2=1
( p i − g x c ) 2 ( g w 4 ) 2 + ( p j − g y c ) 2 ( g h 4 ) 2 = 1 \frac{\left(p_{i}-g_{x_{c}}\right)^{2}}{\left(\frac{g_{w}}{4}\right)^{2}}+\frac{\left(p_{j}-g_{y_{c}}\right)^{2}}{\left(\frac{g_{h}}{4}\right)^{2}}=1 (4gw)2(pi−gxc)2+(4gh)2(pj−gyc)2=1
如果对应的位置 ( p i , p j ) (p_i,p_j) (pi,pj)落在 E 2 E_2 E2内,则它被标记为正样本,如果落在 E 1 E_1 E1外面则它被标记为负样本,如果落在 E 1 E_1 E1 和 E 2 E_2 E2之间则忽略。
而对于bbox标签,通过下面的式子计算:
d l = p i − g x 1 d t = p j − g y 1 d r = g x 2 − p i d b = g y 2 − p j \begin{array}{l} d_{l}=p_{i}-g_{x_{1}} \\ d_{t}=p_{j}-g_{y_{1}} \\ d_{r}=g_{x_{2}}-p_{i} \\ d_{b}=g_{y_{2}}-p_{j} \end{array} dl=pi−gx1dt=pj−gy1dr=gx2−pidb=gy2−pj
where d l d_l dl, d t d_t dt, d r d_r dr, d b d_b db are the distances from the location to the four sides of the bounding box
最终的损失函数由两部分组成:分类的cross entropy loss
和回归框的IoU loss
:
L = λ 1 L c l s + λ 2 L r e g L=\lambda_{1} L_{c l s}+\lambda_{2} L_{r e g} L=λ1Lcls+λ2Lreg
L I o U = 1 − I o U L_{I o U}=1-I o U LIoU=1−IoU
Training and Inference
训练时,在ImageNet VID, YouTube-BoundingBoxes, COCO, ImageNet DET, GOT10k 和 LaSOT 这些数据集上进行训练。
预测时,通过下面式子计算目标的位置信息
p x 1 = p i − d l r e g p y 1 = p j − d t r e g p x 2 = p i + d r r e g p y 2 = p j + d b r e g \begin{array}{l} p_{x_{1}}=p_{i}-d_{l}^{r e g} \\ p_{y_{1}}=p_{j}-d_{t}^{r e g} \\ p_{x_{2}}=p_{i}+d_{r}^{r e g} \\ p_{y_{2}}=p_{j}+d_{b}^{r e g} \end{array} px1=pi−dlregpy1=pj−dtregpx2=pi+drregpy2=pj+dbreg
where d l r e g , d t r e g , d r r e g d_{l}^{r e g}, d_{t}^{r e g}, d_{r}^{r e g} dlreg,dtreg,drreg and d b r e g d_{b}^{r e g} dbreg denote the prediction values of the regression map, ( p x 1 , p y 1 ) \left(p_{x_{1}}, p_{y_{1}}\right) (px1,py1) and ( p x 2 , p y 2 ) \left(p_{x_{2}}, p_{y_{2}}\right) (px2,py2) are the top-left corner and bottom-right corner of the prediction box.
实验结果
小结
总的来说,这篇论文最大的看点就是通过回归来预测bbox,省去了之前SiamRPN
复杂的超参数设置。从这篇论文中,也可以看到一些目标检测算法的影子,检测的论文还是要多看多学!