〇、本论文需要有的基础知识
- 目标检测:了解传统目标检测的基本技术路线(如anchor-based、非最极大值抑制、one-stage、two-stage),大致了解近两年的SOTA方法(如Faster-RCNN)
- Transformer:了解Transformer的机制,知道self-attention机制
- 二分图匹配:了解图论中的二分图匹配,知道匈牙利算法
一、 摘要核心点
1. 相比传统路线:去掉了很多手工设计模块(hand-designed):如非极大值抑制、anchor的设计
这些手工设计的模块里均为人为对task先验知识的一定程度上的“先验的编码(encode)”
2. DETR核心内容:
a set-based global loss → forces unique predictions via:
(a) bipartite matching, and
(b) a transformer encoder-decoder architecture.(本文用的Transformer网络是non-autoregressive非自回归的)
关于非自回归的介绍可以参考https://zhuanlan.zhihu.com/p/82892975
3. DETR能做到的事:
· 输入: a fixed small set of learned objects queries
· DETR输出:
(a) the relations of the objects
(b) the global image context to directly output the final set of prediction in parallel
4. 流程架构示意图:
更细节一些的流程架构示意图↓:
二、 正文
1. 首先定性object detection问题为set of prediction
2. 整个网络设计是端到端(end-to-end)的,然后用一个“集合”损失函数(set loss function)来训练,这个损失函数描述预测框和ground-truth框之间的二分图匹配( performs bipartite matching between predicted and ground-truth objects)来训练
3. DETR仅仅是架构上的创新,并没有创新独有的层(就好像resnet创新了跳连,DETR没有在layer这个层面进行创新)
4. DETR用的“匹配”损失函数(matching loss function)将预测框“一一分配”给ground-truth框(uniquely assigns a prediction to a ground truth object,这里的“一一分配”正是bipartite matching的本身含义);而且能保证对预测对象的排列顺序保持不变(这也是用二分图匹配建模的原因,这里特指无向二分图)(uniquely assigns a prediction to a ground truth object)→这是能够并行化预测的一个原因
“matching”这里是图论里的概念,可以参考https://www.renfei.org/blog/bipartite-matching.html
5. 对于建模为“Set Prediction”(“集合”预测)的考虑:
通常“集合”预测任务是一种多标签分类问题。多标签分类问题的解决方法通常是“one-vs-rest”(“一对多”,one-vs-rest,又称one-vs-all, 这里指的是将label的类别作为“一”,将其余类别当做一个整体作为“多”,进行训练),这种方法不适用于“元素”间有底层关系结构的情况(“元素”e.g.几乎一模一样的预测框)(does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes)。这个方法会导致大量几乎一样的结果的情况(near-duplicates),传统的目标检测方法会用后处理(如非极大值抑制)来解决这个问题(成堆的近乎一样的预测结果),但是如果是建模为set prediction就不用这些后处理。set prediction需要在全局上有个策略来对这些“元素”之间的关系建模,来避免预测过多的无用、复制的结果造成冗余。
6. 对于采用“Bipartite Matching”(二分图匹配)作为“预测值→ground-truth值”的损失函数的考虑:
在Set Prediction问题中,损失函数必须满足“预测顺序不变性”(invariant by a permutation of the predictions,即预测值/框的顺序不能影响损失值),而二分图匹配——这里特指的是“无向”二分图匹配将“预测值→ground-truth值”的关系建模为了一个无向二分图,这种图的“匹配”不存在顺序问题。特别地,用“匈牙利算法”来求解二分图匹配问题。
· “Bipartite Matching”(二分图匹配)(1)能保证预测顺序不变性”; (2)能保证两者间的“一一匹配”
7. 对于大物体的预测更准确:
文章中说“a result likely enabled by the non-local computations of the transformer”,这里的“non-local computations”指的是Non-local Neural Networks(https://arxiv.org/pdf/1711.07971.pdf)这篇文章中的Non-local概念。
non-local computations指的是计算“非局部”感受野上的信息,可以参考https://zhuanlan.zhihu.com/p/33345791
三、结果
四、源码讨论
为了防止后面代码项目有改动,我摘出来写本文时候(2020.06.18)的最新的一次提交(1fcfc65)来做部分源码说明
DETR网络结构一览:
class DETR(nn.Module):
""" This is the DETR module that performs object detection """
def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
""" Initializes the model.
Parameters:
backbone: torch module of the backbone to be used. See backbone.py
transformer: torch module of the transformer architecture. See transformer.py
num_classes: number of object classes
num_queries: number of object queries, ie detection slot. This is the maximal number of objects
DETR can detect in a single image. For COCO, we recommend 100 queries.
aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
"""
super().__init__()
self.num_queries = num_queries
self.transformer = transformer
hidden_dim = transformer.d_model
self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
self.query_embed = nn.Embedding(num_queries, hidden_dim)
self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)
self.backbone = backbone
self.aux_loss = aux_loss
def forward(self, samples: NestedTensor):
""" The forward expects a NestedTensor, which consists of:
- samples.tensor: batched images, of shape [batch_size x 3 x H x W]
- samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels
It returns a dict with the following elements:
- "pred_logits": the classification logits (including no-object) for all queries.
Shape= [batch_size x num_queries x (num_classes + 1)]
- "pred_boxes": The normalized boxes coordinates for all queries, represented as
(center_x, center_y, height, width). These values are normalized in [0, 1],
relative to the size of each individual image (disregarding possible padding).
See PostProcess for information on how to retrieve the unnormalized bounding box.
- "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
dictionnaries containing the two above keys for each decoder layer.
"""
if not isinstance(samples, NestedTensor):
samples = nested_tensor_from_tensor_list(samples)
features, pos = self.backbone(samples) # backbone是一个CNN用于特征提取
src, mask = features[-1].decompose() #??
assert mask is not None
hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0] # 这里是吧features的其中一部分信息作为src传进Transformer,input_proj是一个卷积层,用来收缩输入的维度,把维度控制到d_model的尺寸(model dimension)
outputs_class = self.class_embed(hs) # 为了把Transformer应用于目标检测问题上,作者引入了“类别嵌入网络”和“框嵌入网络”
outputs_coord = self.bbox_embed(hs).sigmoid() # 在框嵌入后加入一层sigmoid输出框坐标(原论文中提到是四点坐标,但是要考虑到原图片的尺寸)
out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
if self.aux_loss:
out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
return out
@torch.jit.unused
def _set_aux_loss(self, outputs_class, outputs_coord):
# this is a workaround to make torchscript happy, as torchscript
# doesn't support dictionary with non-homogeneous values, such
# as a dict having both a Tensor and a list.
return [{'pred_logits': a, 'pred_boxes': b}
for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]
TBC.(没写完的部分最近会补上,毕竟我也是边看边学然后记下来的……)