EffificientDet: Scalable and Effificient Object Detection

动机:

Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)?
【CC】开门见山:基于不同的算力构建一族网络

We systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
【CC】动机非常存粹:尽可能的提升网络行效率(在不损失精度,甚至提升精度);首先,在多尺度特征融合阶段提出了BiFPN结构(这也是本文最大的贡献!);其次,基于作者自己的effificientne给出一族NN应对不同的算力

解题思路:

Challenge 1: efficient multi-scale feature fusion
Since these different input features are at different resolutions, we observe they usually contribute to the fused output feature unequally. we propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features
【CC】观察发现不同尺度特征对最后的输出贡献是不一样的,基于这点设计一个权重可学习的双向金字塔结构用于特征融合;用MLP做weight的学些是不是也可以? 同理,用self-attention是不是也可以可以?已经有人这么干了

Challenge 2: model scaling
Recently, [36] demonstrates remarkable model efficiency for image classification by jointly scaling up network
width, depth, and resolution.We observe that scaling up feature network and box/class prediction network is also critical when taking into account both accuracy and effificiency. we propose a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network
【CC】其实是根据前人研究:将backbone/header/resolution 合起来缩放对最终精度有比较大的影响;基于这个思想作者对efficientnet+bifpn+header+resolution 进行不同尺度的缩放,形成了自己的一个网络族叫做efficientDet

BiFPN

Multi-scale feature fusion aims to aggregate features at different resolutions. Formally, given a list of multi-scale
features Pin = (P-in-l1 , P-in-l2 , …), where P-in-li represents the feature at level li, our goal is to find a transformation f that can effectively aggregate different features and output a list of new features: Pout = f(Pin).
【CC】形式化描述FPN融合问题:给定一组多尺度的特征 Pin = (P-in-l1 , P-in-l2 , …)找到一个高效的函数f 使得Pout = f(Pin)

  • Cross-Scale Connections
    Conventional top-down FPN is inherently limited by the one-way information flow.To address this issue, PANet[23] adds an extra bottom-up path aggregation network, as shown in Figure 2(b). Recently, NAS-FPN [8] employs neural architecture search to search for better cross-scale feature network topology, but it requires thousands of GPU hours during search and the found network is irregular and difficult to interpret or modify, as shown in Figure 2©.
    【CC】FPN方式本质缺陷:单向信息流;PANet加上bottom-up的信息流克服FPN的缺陷;NAS-FPN看名字就知道是通过NAS方式自动搜索FPN的连接关系,但是这个东西比较耗费算力(这里应该是误导,确实在NAS阶段比较耗算力,但是在inference阶段应该会比较快)
    EffificientDet: Scalable and Effificient Object Detection
    Figure 2: Feature network design
    (a) FPN introduces a top-down pathway to fuse multi-scale features from level 3 to7 (P3 - P7);
    (b) PANet adds an additional bottom-up pathway on top of FPN;
    (c ) NAS-FPN use neural architecture search to find an irregular feature network topology and then repeatedly apply the same block;
    (d) is our BiFPN with better accuracy and efficiency trade-offs.

This paper proposes several optimizations for cross-scale connections:
First, we remove those nodes that only have one input edge. Our intuition is simple: if a node has only one input edge with no feature fusion, then it will have less contribution to feature network that aims at fusing different features.
【CC】BiFPN都是基于PANet做优化,所以我们看图b->图d的变化:只有一个入度的节点明显就没有做融合,没啥用,可以删掉

Second, we add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost
【CC】在同层间加了类似resblock的跳跃连接(紫色的线),这个连接还不消耗额外算力

Third, unlike PANet [23] that only has one top-down and one bottom-up path, we treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion
【CC】双向连接本身连接的思路跟PANet非常像,作者强调的是将BiFPN Block向后叠加多层这样来做多尺度的特征融合

  • Weighted Feature Fusion

When fusing features with different resolutions, a common way is to first resize them to the same resolution and then sum them up.
we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input, and let the network to learn the importance of each input feature.
【CC】一般不同尺度特征融合方式:将其他尺度的特征通过上/下采样转换到当前尺度,然后相加; 因为不同尺度的特征对最后ouput的贡献是不一样的,所以加入自学习权重来提升性能

Unbounded fusion:
EffificientDet: Scalable and Effificient Object Detection
where wi is a learnable weight that can be a scalar. However, since the scalar weight is unbounded, it could potentially cause training instability. Therefore, we resort to weight normalization to bound the value range of each weight
【CC】比较native的方式,直接在特征上加上可学习的权重wi;但是对wi没有约束,容易导致这个模型不稳定,不容易训练,那么比较native的改进就是对wi做normalization

Softmax-based fusion:
EffificientDet: Scalable and Effificient Object Detection
An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input。the extra softmax leads to signifificant slowdown on GPU hardware
【CC】比较native的normaliztion就是softmax,即本式子;但是这个loss会带来额外的性能开销

Fast normalized fusion:
EffificientDet: Scalable and Effificient Object Detection
where wi ≥ 0 is ensured by applying a Relu after each w i, and τ = 0.0001 is a small value to avoid numerical instability. Similarly, the value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more effificient.
【CC】避免指数e计算,直接用权重和替代

Our fifinal BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion.
As a concrete example, here we describe the two fused features at level 6 for BiFPN shown in Figure 2(d):
EffificientDet: Scalable and Effificient Object Detection
EffificientDet: Scalable and Effificient Object Detection
where P6-td is the intermediate feature at level 6 on the top-down pathway, and P6-out is the output feature at level 6 on the bottom-up pathway
【CC】这里举例说明P6的计算逻辑:P6-td是从左自右第一个蓝色节点计算后输出,计算方式维:本层的输入P6-in + 上层原始输入下采样的结果Resize(P7-in),即给到第二个蓝色节点右向下传递到P5层;第二个蓝色节点的输入:P6-td + 第5层的上采样结果 P5-out + 本层的跳跃连接P6-in

EffificientDet

We employ ImageNet-pretrained EffificientNets as the backbone network. Our proposed BiFPN serves as the feature network, which takes level 3-7 features
{P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively.
【CC】使用efficientNet作为backbone,使用BiFPN作为特征融合层,将P3-P7的特征层喂给BiFPN,将融合后的特征喂给Header
EffificientDet: Scalable and Effificient Object Detection
Figure 3: EfficientDet architecture – It employs EfficientNet [36] as the backbone network, BiFPN as the feature network, and shared class/box prediction network. Both BiFPN layers and class/box net layers are repeated multiple times based on different resource constraints as shown in Table 1.

Compound Scaling
Aiming at optimizing both accuracy and effificiency, we would like to develop a family of models that can meet a wide spectrum of resource constraints.
【CC】后面是不是用RegNet的方式做NAS搜索会更好?并且理论性更强

we propose a new compound scaling method for object detection, which uses a simple compound coeffificient φ to jointly scale up all dimensions of backbone
network, BiFPN network, class/box network, and resolution.
【CC】φ作为缩放因子,会影响所有网络尺度:backbone+BiFPN+Header + resolution

  1. Backbone network

we reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6
【CC】直接重用efficientNet backbone的网络族

  1. BiFPN network

Formally, BiFPN width and depth are scaled with the following equation, BiFPN depth Dbifpn (#layers), BiFPN width Wbifpn
EffificientDet: Scalable and Effificient Object Detection
【CC】也就是个检验公式,没啥好说的,下面作者自己也写了,也是试验出来的

  1. Box/class prediction network

we fix their width to be always the same as BiFPN (i.e., Wpred = Wbifpn), but linearly increase the depth (#layers) using equation
EffificientDet: Scalable and Effificient Object Detection

  1. Input image resolution

in BiFPN, the input resolution must be dividable by 27 =128, so we linearly increase resolutions using equation:
EffificientDet: Scalable and Effificient Object Detection
Notably, our scaling is heuristic-based and might not be optimal, but we will show that this simple scaling method can signifificantly improve effificiency than other single-dimension scaling method
【CC】也就是个经验公式,没啥可说的
EffificientDet: Scalable and Effificient Object Detection
Scaling configs for EfficientDet D0-D6 – φ is the compound coefficient that controls all other scaling dimensions; BiFPN, box/class net, and input size are scaled
up using equation 1, 2, 3 respectively.
【CC】就是根据公式1-3 总结出来的表格
EffificientDet: Scalable and Effificient Object Detection
Figure 1: Model FLOPs vs. COCO accuracy – All numbers are for single-model single-scale. Our EfficientDet achieves new state-of-the-art 52.2% COCO AP with much fewer parameters and FLOPs than previous detectors. More studies on different backbones and FPN/NAS-FPN/BiFPN are in Table 4 and 5. Complete results are in Table 2.
【CC】这是文章最开头放的对比数据,还是挺震撼的,跑的比人家快, 精度还比较高

上一篇:Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics


下一篇:第四章 数据库安全性 4.1