abstract
我们提出了一个使用卷积网络进行分类,定位和检测的集成框架。 我们展示了如何在ConvNet中有效地实现多尺度和滑动窗口方法。 我们还通过学习预测对象边界引入了一种新颖的深度学习方法来进行定位。 然后累积边界框而不是抑制边界框,以增加检测置信度。 我们展示了使用单个共享网络可以同时学习不同的任务。 该集成框架是ImageNetLargeScale Visual RecognitionChallenge 2013(ILSVRC2013)的定位任务的获胜者,并且在检测和分类任务中获得了非常具有竞争力的结果。 在赛后工作中,我们为检测任务建立了新的技术水平。 最后,我们从称为OverFeat的最佳模型中发布了特征提取器。
We present an integrated framework for using Convolutional Networks for classification, localizationand detection. We show how amultiscaleand sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object bound-aries. Bounding boxes are then accumulated rather than suppressed in order to increase detection conf i dence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNetLargeScale Visual RecognitionChallenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state ofthe art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.
1 Introduction
识别图像中主要对象的类别是一项任务,卷积网络(ConvNets)[17]已经应用了很多年,无论这些对象是手写字符[16],门牌号[24],无纹理玩具[18] ],交通标志[3、26],来自Caltech-101数据集的对象[14]或来自1000类ImageNet数据集的对象[15]。 ConvNets在小型数据集(例如Caltech-101)上的准确性虽然不错,但并没有打破纪录。 但是,大型数据集的出现使ConvNets能够显着提高诸如1000类ImageNet [5]之类的数据集的最新技术水平。
Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traff i c signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15]. The accuracy of ConvNets on small datasets such as Caltech-101, while decent, has not been record-breaking. However, the advent of larger datasets has enabled ConvNets to signif i cantly advance the state of the art on datasets such as the 1000-category ImageNet [5].
ConvNets用于许多此类任务的主要优势在于,从原始像素到最终类别,整个系统都进行了端到端的培训,从而减轻了手动设计合适的特征提取器的需求。 主要缺点是它们对标记的训练样本的食欲不振
The main advantage of ConvNets for many such tasks is that the entire system is trained end to end, from raw pixels to ultimate categories, thereby alleviating the requirement to manually design a suitable feature extractor. The main disadvantage is their ravenous appetite for labeled training samples
本文的主要目的是表明,训练卷积网络以同时对图像中的对象进行分类,定位和检测可以提高分类精度以及所有任务的检测和定位精度。 本文提出了一种使用单个ConvNet进行对象检测,识别和定位的新集成方法。 我们还介绍了一种通过累积预测边界框来进行定位和检测的新颖方法。 我们建议通过结合许多本地化预测,无需对背景样本进行训练就可以进行检测,并且可以避免耗时且复杂的自举训练过程。 不进行后台培训也可以使网络仅专注于阳性类别以提高准确性。 对ImageNet ILSVRC 2012和2013数据集进行了实验,并建立了ILSVRC 2013本地化和检测任务的最新结果。
The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classif i cation accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localizationwith a single ConvNet. We also introducea novelmethod for localizationanddetectionby accumulating predictedboundingboxes. We suggestthatby combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on backgroundalso lets the networkfocus solely on positiveclasses forhigheraccuracy. Experiments are conducted on the ImageNet ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.
尽管从ImageNet分类数据集中选择的图像主要包含一个充满大部分图像的居中对象,但感兴趣的对象有时在图像中的大小和位置也有很大不同。 解决这个问题的第一个想法是将ConvNet以滑动窗口的方式并在多个比例尺上应用于图像中的多个位置。 但是,即使这样,许多观察窗口仍可能包含对象的完全可识别的部分(例如狗的头),而不是整个对象,甚至对象的中心。 这导致分类不错,但定位和检测不佳。 因此,第二个想法是训练系统不仅要为每个窗口生成类别上的分布,还要对包含相对于窗口的对象的边界框的位置和大小进行预测。 第三个想法是在每个位置和规模上为每个类别积累证据。
While images from the ImageNet classif i cation dataset are largely chosen to contain a roughly-centered object that fills much of the image, objects of interest sometimes vary significantly in size and position within the image. The f i rst idea in addressing this is to apply a ConvNet at multiple locations in the image, in a sliding window fashion, and over multiple scales. Even with this, however, many viewing windows may contain a perfectly identif i able portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection. Thus, the second idea is to train the system to not only produce a distribution over categories for each window, but also to produce a prediction of the location and size of the bounding box containing the object relative to the window. The third idea is to accumulate the evidence for each category at each location and size.
许多作者建议使用ConvNets在多个尺度上使用滑动窗口进行检测和定位,可以追溯到1990年代早期的多字符字符串[20],脸部[30]和手[22]。 最近,ConvNets已显示出在自然图像[4],面部检测[8、23]和行人检测[25]中的文本检测方面具有最先进的性能。
Many authors have proposed to use ConvNets for detection and localization with a sliding window over multiple scales, going back to the early 1990’s for multi-character strings [20], faces [30], and hands [22]. More recently, ConvNets have been shown to yield state ofthe art performance on text detection in natural images [4], face detection [8, 23] and pedestrian detection [25].
一些作者还建议训练ConvNet,以直接预测要定位的对象的实例化参数,例如相对于查看窗口的位置或对象的姿势。例如,Osadchy等。 [23]描述了用于同时面部检测和姿势估计的ConvNet。面部由9维输出空间中的3D流形表示。歧管上的位置指示姿势(俯仰,偏航和滚动)。当训练图像是面部时,训练网络以在歧管上在已知姿势的位置处产生一个点。如果图像不是人脸,则将输出推离歧管。在测试时,到歧管的距离指示图像是否包含面部,并且歧管上最近点的位置指示姿势。泰勒等。 [27,28]使用ConvNet估计身体部位(手,头等)的位置,从而得出人体姿势。他们使用度量学习标准来训练网络以在体位歧管上产生点。 Hinton等。还提出了训练网络以计算特征的显式实例化参数作为识别过程的一部分[12]。
Several authors have also proposed to train ConvNets to directly predict the instantiation parameters of the objects to be located, such as the position relative to the viewing window, or the pose of the object. For example Osadchy et al. [23] describe a ConvNet for simultaneous face detection and pose estimation. Faces are represented by a 3D manifold in the nine-dimensional output space. Positions on the manifold indicate the pose (pitch, yaw, and roll). When the training image is a face, the network is trained to produce a point on the manifold at the location of the known pose. If the image is not a face, the output is pushed away from the manifold. At test time, the distance to the manifold indicate whether the image contains a face, and the position of the closest point on the manifold indicates pose. Taylor et al. [27, 28] use a ConvNet to estimate the location of body parts (hands, head, etc) so as to derive the human body pose. They use a metric learning criterion to train the network to produce points on a body pose manifold. Hinton et al. have also proposed to train networks to compute explicit instantiation parameters of features as part of a recognition process [12].
其他作者建议通过基于ConvNet的分段执行对象定位。最简单的方法包括训练ConvNet,以将其查看窗口的中心像素(或体素图像的体素)分类为区域之间的边界与否[13]。但是,当必须对区域进行分类时,最好执行语义分割。主要思想是训练ConvNet,使用该窗口作为决策环境,以其所属的对象类别对观察窗口的中心像素进行分类。应用范围从生物图像分析[21]到移动机器人的障碍物标记[10]到照片标记[7]。这种方法的优点是边界轮廓不必是矩形,区域也不必是外接对象。缺点是需要密集的像素级标签进行训练。这种分割预处理或对象建议步骤最近在传统计算机视觉中获得了普及,以减少用于检测的位置,比例和纵横比的搜索空间[19、2、6、29]。因此,可以在搜索空间中的最佳位置应用昂贵的分类方法,从而提高识别精度。另外,[29,1]提出这些方法通过大幅度减少不太可能的对象区域来提高准确性,从而减少潜在的假阳性。但是,我们的密集滑动窗口方法能够胜过ILSVRC13检测数据集上的对象建议方法。
Other authors have proposed to perform object localization via ConvNet-based segmentation. The simplest approach consists in training the ConvNet to classify the central pixel (or voxel for vol-umetric images) of its viewing window as a boundary between regions or not [13]. But when the regions must be categorized, it is preferable to perform semantic segmentation. The main idea is to train the ConvNet to classify the central pixel of the viewing window with the category of the object it belongs to, using the window as context for the decision. Applications range from biological image analysis [21], to obstacle tagging for mobile robots [10] to tagging of photos [7]. The advantage ofthis approach is that the bounding contours need not be rectangles, and the regions need not be well-circumscribed objects. The disadvantage is that it requires dense pixel-level labels for training. This segmentation pre-processing or object proposal step has recently gained popularity in traditional computer vision to reduce the search space of position, scale and aspect ratio for detection [19, 2, 6, 29]. Hence an expensive classif i cation method can be applied at the optimal location in the search space, thus increasing recognition accuracy. Additionally, [29, 1] suggest that these methods improve accuracy by drastically reducing unlikely object regions, hence reducing potential false positives. Our dense sliding window method, however, is able to outperform object proposal methods on the ILSVRC13 detection dataset.
克里热夫斯基等。 [15]最近展示了使用大型ConvNet的令人印象深刻的分类性能。 作者还参加了ImageNet 2012竞赛,赢得了分类和定位方面的挑战。 尽管他们展示出了令人印象深刻的本地化性能,但是还没有发表任何工作描述他们的方法。 因此,我们的论文首次明确说明了如何将ConvNets用于ImageNet数据的定位和检测。
Krizhevsky et al. [15] recently demonstrated impressive classification performance using a large ConvNet. The authors also entered the ImageNet 2012 competition, winning both the classification and localization challenges. Although they demonstrated an impressive localization performance, there has been no published work describing how their approach. Our paper is thus the first to provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data.
在本文中,我们以与定位和检测在ImageNet 2013竞赛中使用的方式一致的方式使用术语“定位和检测”,即唯一的区别是所使用的评估标准,并且两者都涉及预测图像中每个对象的边界框。
In this paper we use the terms localization and detection in a way that is consistent with their use in the ImageNet 2013 competition, namely that the only difference is the evaluation criterion used and both involve predicting the bounding box for each object in the image.
图1:定位(顶部)和检测任务(底部)。 左边的图像包含我们的预测(以降低的置信度排序),而右边的图像则显示了地面标签。 检测图像(下图)说明了检测数据集的较高难度,它可以包含许多小对象,而分类和定位图像通常包含一个大对象。
Figure 1: Localization (top) and detection tasks (bottom). The left images contains our predictions (ordered by decreasing conf i dence) while the right images show the groundtruth labels. The detection image (bottom) illustrates the higher diff i culty ofthe detection dataset, which can contain many small objects while the classif i cation and localization images typically contain a single large object.
2 Vision Tasks
在本文中,我们以难度递增的顺序探索了三个计算机视觉任务:(i)分类,(ii)定位和(iii)检测。 每个任务是下一个任务的子任务。 虽然使用单个框架和共享的功能学习库来完成所有任务,但我们将在以下各节中分别描述它们。
In this paper, we explore three computer vision tasks in increasing order of diff i culty: (i) classification, (ii) localization, and (iii) detection. Each task is a sub-task of the next. While all tasks are adressed using a single framework and a shared feature learning base, we will describe them separately in the following sections.
在整篇论文中,我们报告了2013年ImageNet大规模视觉识别挑战赛(ILSVRC2013)的结果。在此挑战的分类任务中,为每个图像分配一个对应于图像中主要对象的标签。可以通过五个猜测找到正确的答案(这是因为图像还可以包含多个未标记的对象)。定位任务的相似之处在于,每个图像允许5个猜测,但此外,每次猜测都必须返回预测对象的边界框。为了被认为是正确的,预测框必须与地面真相至少匹配50%(使用交会合的PASCAL准则),并用正确的类别进行标记(即,每个预测都是一个标记和边界框,它们分别是关联在一起)。检测任务与本定位的不同之处在于,每个图像中可以有任意数量的对象(包括零个),并且误报会受到平均平均精度(mAP)度量的惩罚。定位任务是分类和检测之间的便捷中间步骤,它使我们能够独立于检测所特有的挑战(例如学习背景课)来评估我们的定位方法。在图1中,我们显示了具有定位/检测预测以及相应地面真实性的图像示例。请注意,分类和定位共享同一数据集,而检测还具有其他数据,其中对象可能更小。检测数据还包含缺少某些对象的一组图像。这可以用于引导,但是我们在这项工作中没有使用它。
Throughout the paper, we report results on the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013). In the classification task of this challenge, each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (this is because images can also contain multiple unlabeled objects). The localization task is similar in that 5 guesses are allowed per image, but in addition, a bounding box for the predicted object must be returned with each guess. To be considered correct, the predicted box must match the ground truth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class (i.e. each prediction is a label and bounding box that are associated together). The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision (mAP) measure. The localization task is a convenient intermediate step between classification and detection, and allows us to evaluate our localization method independently of challenges specific to detection (such as learning a background class). In Fig. 1, we show examples of images with our localization/detection predictions as well as corresponding ground truth. Note that classification and localization share the same dataset, while detection also has additional data where objects can be smaller. The detection data also contain a set of images where certain objects are absent. This can be used for bootstrapping, but we have not made use ofit in this work.
3 Classification
我们的分类架构类似于Krizhevsky等人的最佳ILSVRC12架构。 [15]。 但是,我们改进了网络设计和推理步骤。 由于时间限制,并未探索克里夫斯基斯基模型中的某些训练功能,因此我们希望我们的结果可以进一步改善。 这些将在以后的工作部分6中进行讨论。
Our classification architecture is similar to the best ILSVRC12 architecture by Krizhevsky etal. [15]. However, we improve on the network design and the inference step. Because of time constraints, some of the training features in Krizhevsky’s model were not explored, and so we expect our results can be improved even further. These are discussed in the future work section 6
3.1 Model Design and Training
我们在ImageNet 2012训练集上训练了网络(120万幅图像和C = 1000类)[5]。 我们的模型使用了Krizhevsky等人提出的相同的固定输入大小方法。 [15]在训练期间,但转向多尺度进行分类,如下一节所述。 每个图像都经过下采样,因此最小尺寸为256像素。 然后,我们提取5个大小为221x221像素的随机作物(及其水平翻转),并以大小为128的迷你批次将其呈现给网络。网络中的权重通过(μ,σ)=(0,1× 10 -2)初始化。 然后通过随机梯度下降法对其进行更新,并伴随着动量项0.6和ℓ2权重衰减1×10 -5。 学习速率最初为5×10 -2,并在(30,50,60,70,80)个时代后相继降低0.5倍。 在分类器的完全连接的层(第6和第7)上采用速率为0.5的DropOut [11]。
We train the network on the ImageNet 2012 training set (1.2 million images and C = 1000 classes)[5]. Our model uses the same fixed input size approach proposed by Krizhevsky et al. [15] during training but turns to multi-scale for classification as described in the next section. Each image is downsampled so that the smallest dimension is 256 pixels. We then extract 5 random crops (and their horizontal flips) of size 221x221 pixels and present these to the network in mini-batches of size 128. The weights in the network are initialized randomly with (µ, σ) = (0, 1 × 10 −2 ). They are then updated by stochastic gradient descent, accompanied by momentum term of 0.6 and an ℓ 2 weight decay of 1 × 10 −5 . The learning rate is initially 5 × 10 −2 and is successively decreased by a factor of 0.5 after (30, 50, 60, 70, 80) epochs. DropOut [11] with a rate of 0.5 is employed on the fully connected layers (6th and 7th) in the classifier.
我们在表1和3中详细说明了体系结构的大小。请注意,在训练过程中,我们将此体系结构视为非空间(大小为1x1的输出映射),而不是推断步骤,后者会产生空间输出。 1-5层类似于Krizhevsky等。 [15],使用整流(“ relu”)非线性和最大池化,但有以下差异:(i)不使用对比度归一化; (ii)合并区域不重叠,并且(iii)由于步幅较小(2代替4),我们的模型具有较大的第一层和第二层特征图。 较大的步幅有利于提高速度,但会影响准确性。
We detail the architecture sizes in tables 1 and 3. Note that during training, we treat this architecture as non-spatial (output maps of size 1x1), as opposed to the inference step, which produces spatial outputs. Layers 1-5 are similar to Krizhevsky et al. [15], using rectif i cation (“relu”) non-linearities and max pooling, but with the following differences: (i) no contrast normalization is used; (ii) pooling regions are non-overlapping and (iii) our model has larger 1st and 2nd layer feature maps, thanks to a smaller stride (2 instead of 4). A larger stride is beneficial for speed but will hurt accuracy.
表1:快速模型的体系结构规范。 特征图的空间大小取决于输入图像的大小,该大小在我们的推理步骤中有所不同(请参见附录中的表5)。 在这里,我们显示训练空间大小。 第5层是顶层卷积层。 后续层完全连接,并在测试时以滑动窗口的方式应用。 在空间设置中,完全连接的层也可以视为1x1卷积。 附录中可以找到类似尺寸的精确模型。
Table 1: Architecture specifics for fast model. The spatial size of the feature maps depends on the input image size, which varies during our inference step (see Table 5 in the Appendix). Here we show training spatial sizes. Layer 5 is the top convolutional layer. Subsequent layers are fully connected, and applied in sliding window fashion at test time. The fully-connected layers can also be seen as 1x1 convolutions in a spatial setting. Similar sizes for accurate model can be found in the Appendix.
3.2 Feature Extractor
与本文一起,我们发布了一个名为“ OverFeat”的特征提取器,以便为计算机视觉研究提供强大的功能。 提供了两种模型,一种快速而准确的模型。 表1和3中描述了每种体系结构。我们还在表4中根据参数和连接比较它们的大小。 准确的模型比快速模型更准确(分类错误为14.18%,而表2中为16.39%),但是它需要的连接数几乎是原来的两倍。 如图4所示,使用7个准确模型的委员会达到13.6%的分类误差。
Along with this paper, we release a feature extractor named “OverFeat” in order to provide powerful features for computer vision research. Two models are provided, a fast and accurate one. Each architecture is described in tables 1 and 3. We also compare their sizes in Table 4 in terms of parameters and connections. The accurate model is more accurate than the fast one (14.18% classification error as opposed to 16.39% in Table 2), however it requires nearly twice as many connections. Using a committee of 7 accurate models reaches 13.6% classification error as shown in Fig. 4.
3.3 Multi-Scale Classification
在[15]中,多视图投票用于提高性能:将一组固定的10个视图(4个角和中心,水平翻转)进行平均。 但是,这种方法可以忽略图像的许多区域,并且在视图重叠时在计算上是多余的。 此外,它仅以单一规模应用,可能不是ConvNet能够以最佳置信度做出响应的规模。
In [15], multi-view voting is used to boost performance: a fixed set of 10 views (4 corners and center, with horizontal flip) is averaged. However, this approach can ignore many regions of the image, and is computationally redundant when views overlap. Additionally, it is only applied at a single scale, which may not be the scale at which the ConvNet will respond with optimal confidence.
取而代之的是,我们通过在每个位置和多个规模密集运行网络来探索整个图像。 尽管滑动窗口方法在某些类型的模型上可能无法实现计算,但在ConvNets的情况下,它固有地高效(请参阅第3.5节)。 这种方法产生了更多的投票视图,从而提高了鲁棒性,同时保持了效率。 将卷积网络卷积在任意大小的图像上的结果是每个尺度上C维向量的空间图。
Instead, we explore the entire image by densely running the network at each location and at multiple scales. While the sliding window approach may be computationally prohibitive for certain types of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields significantly more views for voting, which increases robustness while remaining efficient. The result of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at each scale.
但是,上述网络中的总子采样率是2x3x2x3或36。因此,如果密集应用,此体系结构只能沿每个轴在输入维中每36个像素产生一个分类矢量。 与10视图方案相比,输出的这种粗略分布会降低性能,因为网络窗口与图像中的对象没有很好地对齐。 网络窗口和对象的对齐方式越好,网络响应的置信度就越强。 为了避免这个问题,我们采用类似于Giusti等人介绍的方法。 [9],并在每个偏移量处应用最后的子采样操作。 这消除了该层的分辨率损失,从而产生了x12而不是x36的总子采样率。
However, the total subsampling ratio in the network described above is 2x3x2x3, or 36. Hence when applied densely, this architecture can only produce a classification vector every 36 pixels in the input dimension along each axis. This coarse distribution of outputs decreases performance compared to the 10-view scheme because the network windows are not well aligned with the objects in the images. The better aligned the network window and the object, the strongest the confidence of the network response. To circumvent this problem, we take an approach similar to that introduced by Giusti et al. [9], and apply the last subsampling operation at every offset. This removes the loss of resolution from this layer, yielding a total subsampling ratio of x12 instead of x36.
现在,我们详细解释如何执行分辨率增强。我们使用6个比例尺的输入,这些分辨率会导致不同分辨率的非池化第5层贴图(有关详细信息,请参见表5)。然后将它们合并起来,并使用以下过程如图3所示提供给分类器:(a)对于给定比例的单个图像,我们从未合并的第5层特征图开始。 (b)每个未合并的地图都要进行3x3的最大合并操作(非重叠区域),对于{∆,∆y}像素偏移量{0,1,2}重复3x3次。 (c)这将生成一组合并的特征图,对于不同的(∆ x,∆ y)组合,将其复制(3x3)次。 (d)分类器(第6、7、8层)的输入大小固定为5x5,并为合并地图中的每个位置生成C维输出矢量。分类器以滑动窗口方式应用于合并的图,从而生成C维输出图(对于给定的(∆ x,∆ y)组合)。 (e)将不同(∆ x,∆ y)组合的输出图重塑为单个3D输出图(两个空间尺寸x C类)。
We now explain in detail how the resolution augmentation is performed. We use 6 scales of input which result in unpooled layer 5 maps of varying resolution (see Table 5 for details). These are then pooled and presented to the classifier using the following procedure, illustrated in Fig. 3: (a) For a single image, at a given scale, we start with the unpooled layer 5 feature maps. (b) Each of unpooled maps undergoes a 3x3 max pooling operation (non-overlapping regions), repeated 3x3 times for (∆ x , ∆y ) pixel offsets of{0, 1, 2}. © This produces a set of pooled feature maps, replicated (3x3) times for different (∆ x , ∆ y ) combinations. (d) The classifier (layers 6,7,8) has a fixed input size of 5x5 and produces a C-dimensional output vector for each location within the pooled maps. The classifier is applied in sliding-window fashion to the pooled maps, yielding C-dimensional output maps (for a given (∆ x, ∆ y) combination). (e) The output maps for different (∆ x , ∆ y ) combinations are reshaped into a single 3D output map (two spatial dimensions x C classes).
图3:使用分类2的y维作为示例,对分类的输出映射计算进行一维图示(按比例绘制)(请参见表5)。 (a):20像素的非合并第5层要素图。 (b):在不重叠的3个像素组上最大合并,使用偏移量∆ = {0,1,2}像素(分别为红色,绿色,蓝色)。 (c):针对不同的∆生成的6个像素的合并贴图。 (d):将5个像素的分类器(第6,7层)以滑动窗口的方式应用于合并的图,每个∆产生2个像素乘C的图。 (e):通过C输出贴图重塑为6个像素。
Figure 3: 1D illustration (to scale) ofoutput map computation for classif i cation, using y-dimension from scale 2 as an example (see Table 5). (a): 20 pixel unpooled layer 5 feature map. (b): max pooling over non-overlapping 3 pixel groups, using offsets of∆ = {0, 1, 2} pixels (red, green, blue respectively). ©: The resulting 6 pixel pooled maps, for different ∆. (d): 5 pixel classif i er (layers 6,7) is applied in sliding window fashion to pooled maps, yielding 2 pixel by C maps for each ∆. (e): reshaped into 6 pixel by C output maps.
可以将这些操作视为在不进行二次采样的情况下通过合并层将分类器的查看窗口移动1个像素,并在下一层(其中邻域中的值不相邻)使用跳过内核。 或等效地,在每个可能的偏移处应用最终的池化层和完全连接的堆栈,并通过对输出进行交织来组合结果。
These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pooling layers without subsampling and using skip-kernels in the following layer (where values in the neighborhood are non-adjacent). Or equivalently, as applying the final pooling layer and fully-connected stack at every possible offset, and assembling the results by interleaving the outputs.
对每个图像的水平翻转版本重复上述过程。 然后,我们通过以下方式产生最终分类:(i)在每个尺度和翻转下取每个类别的空间最大值; (ii)对不同比例和翻转产生的C维向量进行平均,并且(iii)从均值向量中获取前1个或前5个元素(取决于评估标准)。
The procedure above is repeated for the horizontally flipped version of each image. We then produce the final classification by (i) taking the spatial max for each class, at each scale and flip; (ii) averaging the resulting C-dimensional vectors from different scales and flips and (iii) taking the top-1 or top-5 elements (depending on the evaluation criterion) from the mean class vector.
在直观的层面上,网络的两个部分-即 特征提取层(1-5)和分类器层(6-输出)的使用方法相反。 在特征提取部分中,将滤镜一遍遍历整个图像。 从计算角度看,这比在图像上滑动固定大小的特征提取器然后聚合来自不同位置2的结果要有效得多。但是,对于网络的分类器部分,这些原理是相反的。 在这里,我们想在第5层要素图中跨越不同位置和比例寻找固定大小的表示形式。 因此,分类器具有固定大小的5x5输入,并详尽地应用于第5层地图。 详尽的池化方案(具有单个像素偏移(∆ x,∆ y))确保我们可以在分类器与特征图中对象的表示之间获得良好的对齐
At an intuitive level, the two halves of the network—i.e. feature extraction layers (1-5) and classifier layers (6-output) — are used in opposite ways. In the feature extraction portion, the filters are convolved across the entire image in one pass. From a computational perspective, this is far more efficient than sliding a fixed-size feature extractor over the image and then aggregating the results from different locations 2. However, these principles are reversed for the classifier portion of the network. Here, we want to hunt for a fixed-size representation in the layer 5 feature maps across different positions and scales. Thus the classifier has a fixed-size 5x5 input and is exhaustively applied to the layer 5 maps. The exhaustive pooling scheme (with single pixel shifts (∆ x, ∆ y )) ensures that we can obtain fine alignment between the classifier and the representation of the object in the feature map
3.4 Results
在表2中,我们尝试了不同的方法,并将它们与Krizhevsky等人的单个网络模型进行比较。 [15]供参考。 上面描述的方法有6个等级,实现了前13.5个错误率。 可以预料的是,使用更少的标度会损害性能:单标度模型更糟,前5个错误的发生率为16.97%。 图3所示的精细步幅技术在单刻度范围内带来了相对较小的改进,但对于此处所示的多刻度增益也很重要。
In Table 2, we experiment with different approaches, and compare them to the single network model of Krizhevsky et al. [15] for reference. The approach described above, with 6 scales, achieves a top-5 error rate of 13.6%. As might be expected, using fewer scales hurts performance: the single-scale model is worse with 16.97% top-5 error. The fine stride technique illustrated in Fig. 3 brings a relatively small improvement in the single scale regime, but is also of importance for the multi-scale gains shown here.
表2:对验证集进行分类实验。 细步/粗步是指应用分类器时使用的∆值数。 精细:∆ = 0、1、2; 粗略:∆ = 0。
Table 2: Classif i cation experiments on validation set. Fine/coarse stride refers to the number of ∆ values used when applying the classif i er. Fine: ∆ = 0, 1, 2; coarse: ∆ = 0.
图4:测试集分类结果。 在竞赛中,OverFeat使用7个快速模型的平均前5个错误率,产生14.2%。 在赛后工作中,OverFeat使用较大的模型(更多的功能和更多的图层)以13.6%的错误率排名第五。
Figure 4: Test set classification results. During the competition, OverFeat yielded 14.2% top 5 error rate using an average of 7 fast models. In post-competition work, OverFeat ranks fifth with 13.6% error using bigger models (more features and more layers).
我们在图4中报告了2013年竞赛的测试集结果,其中我们的模型(OverFeat)通过对7个ConvNet(通过不同的初始化进行训练)进行投票而获得了14.2%的准确度,在18个团队中排名第5。 仅使用ILSVRC13数据的最佳准确性为11.7%。 使用ImageNet Fall11数据集中的额外数据进行预训练后,该数字提高到了11.2%。 在赛后工作中,我们通过使用更大的模型(更多的功能和更多的图层),将OverFeat的结果降低到13.6%的错误。 由于时间的限制,这些较大的模型没有得到充分的训练,预计会出现更多改进。
We repor tthe test set results of the 2013 competition in Fig. 4 where our model (OverFeat) obtained 14.2% accuracy by voting of 7 ConvNets (each trained with different initializations) and ranked 5th out of 18 teams. The best accuracy using only ILSVRC13 data was 11.7%. Pre-training with extra data from the ImageNet Fall11 dataset improved this number to 11.2%. In post-competition work, we improve the OverFeat results down to 13.6% error by using bigger models (more features and more layers). Due to time constraints, these bigger models are not fully trained, more improvements are expected to appear in time.
3.5 ConvNets and Sliding Window Eff i ciency
与许多滑动窗口方法一次为输入窗口的每个窗口计算整个流水线的方法相反,ConvNet以滑动方式应用时具有固有的效率,因为它们自然共享重叠区域共有的计算。 在测试时将我们的网络应用于较大的图像时,我们只需在整个图像范围内应用每个卷积即可。 这扩展了每层的输出以覆盖新的图像尺寸,最终生成了输出类别预测的地图,其中每个输入“窗口”(视场)都有一个空间位置。 这在图5中进行了图解说明。卷积是自下而上应用的,因此相邻窗口共有的计算只需要执行一次。
In contrast to many sliding-window approaches that compute an entire pipeline for each window of the input one at a time, ConvNets are inherently efficient when applied in a sliding fashion because they naturally share computations common to overlapping regions. When applying our network to larger images at test time, we simply apply each convolution over the extent of the full image. This extends the output of each layer to cover the new image size, eventually producing a map of output class predictions, with one spatial location for each “window” (field of view) of input. This is diagrammed in Fig. 5. Convolutions are applied bottom-up, so that the computations common to neighboring windows need only be done once.
请注意,我们架构的最后一层是完全连接的线性层。 在测试时,这些层被具有1x1空间范围内核的卷积运算有效地替换。 因此,整个ConvNet仅仅是一系列卷积,最大池化和阈值运算的序列。
Note that the last layers of our architecture are fully connected linear layers. At test time, these layers are effectively replaced by convolution operations with kernels of 1x1 spatial extent. The entire ConvNet is then simply a sequence of convolutions, max-pooling and thresholding operations exclusively.
4 Localization
从我们经过分类训练的网络开始,我们用回归网络替换分类器图层,并对其进行训练,以预测每个空间位置和比例的对象边界框。 然后,我们将回归预测与每个位置的分类结果结合在一起,如我们现在描述的。
Starting from our classification-trained network, we replace the classifier layers by a regression network and train it to predict object bounding boxes at each spatial location and scale. We then combine the regression predictions together, along with the classification results at each location, as we now describe.
4.1 Generating Predictions
为了生成对象边界框预测,我们同时在所有位置和范围内运行分类器和回归网络。 由于这些共享相同的特征提取层,因此在计算分类网络后仅需要重新计算最终的回归层。 在每个位置上类别c的最终softmax层的输出提供了一定的可信度,即类别c的对象存在于(虽然不一定完全包含)在相应的视野中。 因此,我们可以为每个边界框分配一个置信度。
To generate object bounding box predictions, we simultaneously run the classifier and regressor networks across all locations and scales. Since these share the same feature extraction layers, only the final regression layers need to be recomputed after computing the classification network. The output of the final softmax layer for a class c at each location provides a score of confidence that an object of class c is present (though not necessarily fully contained) in the corresponding field of view. Thus we can assign a confidence to each bounding box.
4.2 Regressor Training
回归网络将第5层的合并要素图作为输入。它具有2个完全连接的隐藏层,分别为4096和1024通道。 最终输出层有4个单位,用于指定边界框边缘的坐标。 与分类一样,由于∆ x,∆ y偏移,整个副本中有(3x3)份。 架构如图8所示。
The regression network takes as input the pooled feature maps from layer 5. It has 2 fully-connected hidden layers of size 4096 and 1024 channels, respectively. The final output layer has 4 units which specify the coordinates for the bounding box edges. As with classification, there are (3x3) copies throughout, resulting from the ∆ x , ∆ y shifts. The architecture is shown in Fig. 8.
图6:定位/检测管道。 原始分类器/检测器输出每个位置的分类和置信度(第一个图表)。 这些预测的分辨率可以使用第3.3节(第二幅图)中所述的方法来提高。 然后,回归预测相对于每个窗口的对象的位置比例(第三张图)。 然后将这些边界框合并并累积到少量对象(第四张图)。
Figure 6: Localization/Detection pipeline. The raw classifier/detector outputs a class and a confidence for each location (1st diagram). The resolution of these predictions can be increased using the method described in section 3.3 (2nd diagram). The regression then predicts the location scale of the object with respect to each window (3rd diagram). These bounding boxes are then merge and accumulated to a small number of objects (4th diagram).
图7:在被结合到最终预测中之前,回归网络产生的边界框示例。 此处显示的示例仅是一个比例。 根据对象的不同,在其他尺度上的预测可能更理想。 在这里,大多数最初以网格形式组织的边界框都收敛到单个位置和比例。 这表明网络非常确定对象的位置,而不是随机分布。 左上方的图像显示,如果存在多个对象,它也可以正确识别多个位置。 预测边界框的各种纵横比表明,网络能够应对各种对象姿态。
Figure 7: Examples of bounding boxes produced by the regression network, before being combined into final predictions. The examples shown here are at a single scale. Predictions may be more optimal at other scales depending on the objects. Here, most of the bounding boxes which are initially organized as a grid, converge to a single location and scale. This indicates that the network is very conf i dentin the location of the object, as opposed to being spread out randomly. The top left image shows that it can also correctly identify multiple location if several objects are present. The various aspect ratios of the predicted bounding boxes shows that the network is able to cope with various object poses.
我们修复了分类网络中的特征提取层(1-5),并使用每个示例的预测边界框和真实边界框之间的ℓ2损失训练了回归网络。 最终的回归器层是类特定的,具有1000个不同的版本,每个类一个。 我们使用与第3节中所述相同的比例尺集训练此网络。我们将回归空间在每个空间位置的预测与地面真实性边界框进行比较,并移入卷积内回归空间平移偏移的参考系中 (见图8)。 但是,我们不会在与输入视场重叠少于50%的边界框上训练回归器:由于对象主要位于这些位置之外,因此,最好通过包含该对象的回归窗口来更好地处理它。
We fix the feature extraction layers (1-5) from the classification network and train the regression network using an ℓ2 loss between the predicted and true bounding box for each example. The final regressor layer is class-specif i c, having 1000 different versions, one for each class. We train this network using the same set of scales as described in Section 3. We compare the prediction of the regressor net at each spatial location with the ground-truth bounding box, shifted into the frame of reference of the regressor’s translation offset within the convolution (see Fig. 8). However, we do not train the regressor on bounding boxes with less than 50% overlap with the input field of view: since the object is mostly outside of these locations, it will be better handled by regression windows that do contain the object.
以多尺度的方式训练回归变量对于跨尺度的预测组合非常重要。 单一规模的培训将在该规模上表现良好,而在其他规模上仍然可以合理地进行。 但是,多尺度训练将使预测在各个尺度之间正确匹配,并以指数方式增加合并预测的置信度。 反过来,这使我们仅使用几个标度就能表现良好,而不是像通常在检测中那样使用多个标度。 行人检测中从一个比例到另一个比例的典型比例约为1.05到1.1,但此处我们使用的比例约为1.4(此比例对于每个比例而言都不同,因为调整了尺寸以完全适合我们的网络步幅) 这使我们可以更快地运行系统。
Training the regressors in a multi-scale manner is important for the across-scale prediction combination. Training on a single scale will perform well on that scale and still perform reasonably on other scales. However training multi-scale will make predictions match correctly across scales and exponentially increase the confidence of the merged predictions. In turn, this allows us to perform well with a few scales only, rather than many scales as is typically the case in detection. The typical ratio from one scale to another in pedestrian detection [25] is about 1.05 to 1.1, here however we use a large ratio of approximately 1.4 (this number differs for each scale since dimensions are adjusted to fit exactly the stride of our network) which allows us to run our system faster.
4.3 Combining Predictions
我们使用以下算法,通过应用于回归边界框的贪婪合并策略来组合各个预测(参见图7)。 (a)为C分配每个标度s∈1的前k个类别的集合。 。 。 通过在该比例尺上跨空间位置获取最大检测类别输出来找到图6。 (b)将B的所有空间位置上以s为单位的回归类网络在C s中为每个类别预测的边界框集合分配给B。
We combine the individual predictions (see Fig. 7) via a greedy merge strategy applied to the regressor bounding boxes, using the following algorithm. (a) Assign to C s the set of classes in the top k for each scale s ∈ 1 . . . 6, found by taking the maximum detection class outputs across spatial locations for that scale. (b) Assign to B sthe set ofbounding boxes predicted by the regressor network for each class in C s , across all spatial locations at scale s.
图8:将回归网络应用于第2层的第5层要素。 (a)对于(3x3)∆ x,∆ y位移中的每一个,在此比例下回归器的输入在空间上都是256 x通道的6x7像素。 (b)回归网第一层中的每个单元都连接到layer5映射中的5x5空间邻域以及所有256个通道。 对于层中的4096个通道中的每个通道,以及(3x3)∆ x,∆ y位移中的每个位移,围绕5x5邻域移动都将产生2x3空间范围的地图。 (c)第二回归层具有1024个单位并已完全连接(即,紫色元素仅在所有4096个通道中连接到(b)中的紫色元素)。 (d)对于2x3映射中的每个位置,以及对于(3x3)∆ x,∆ y位移,回归网络的输出均为4矢量(指定边界框的边缘)。
Figure 8: Application of the regression network to layer 5 features, at scale 2, for example. (a) The input to the regressor at this scale are 6x7 pixels spatially by 256 channels for each of the (3x3) ∆ x, ∆ y shifts. (b) Each unit in the 1st layer of the regression net is connected to a 5x5 spatial neighborhood in the layer5 maps, as well as all 256 channels. Shifting the 5x5 neighborhood around results in a map of 2x3 spatial extent, for each of the 4096 channels in the layer, and for each of the (3x3) ∆ x , ∆ y shifts. © The 2nd regression layer has 1024 units and is fully connected (i.e. the purple element only connects to the purple element in (b), across all 4096 channels). (d) The output of the regression network is a 4-vector (specifying the edges of the bounding box) for each location in the 2x3 map, and for each of the (3x3) ∆ x , ∆ y shifts.
在上面,我们使用两个边界框的中心之间的距离与这些框的相交区域之和来计算match_score。 box_merge计算边界框坐标的平均值。
In the above, we compute match_score using the sum of the distance between centers of the two bounding boxes and the intersection area of the boxes. box_merge compute the average of the bounding boxes’ coordinates.
最终的预测是通过合并合并的具有最高类分数的边界框来给出的。这是通过累加与输入窗口关联的检测类输出来计算的,从每个输入窗口可以预测每个边界框。有关合并到单个高置信度边界框中的边界框的示例,请参见图6。在该示例中,一些乌龟和鲸鱼边界框出现在中间的多尺度步骤中,但消失在最终的检测图像中。这些边界框不仅具有较低的分类置信度(分别最多为0.11和0.12),而且它们的集合不如Bear边界框那么连贯,因此可以显着提高置信度。熊箱具有很高的置信度(每个标度平均约0.5)和较高的匹配分数。因此,合并之后,许多熊边界框被融合到单个非常高的置信度框中,而误报由于缺少边界框的连贯性和置信度而在检测阈值以下消失。该分析表明,通过奖励边界框连贯性,我们的方法对于纯分类模型产生的误报自然比传统的非最大抑制更健壮。
The final prediction is given by taking the merged bounding boxes with maximum class scores. This is computed by cumulatively adding the detection class outputs associated with the input windows from which each bounding box was predicted. See Fig. 6 for an example of bounding boxes merged into a single high-confidence bounding box. In that example, some turtle and whale bounding boxes appear in the intermediate multi-scale steps, but disappear in the final detection image. Not only do these bounding boxes have low classification confidence (at most 0.11 and 0.12 respectively), their collection is not as coherent as the bear bounding boxes to get asignificant confidence boost. The bear boxes have a strong confidence (approximately 0.5 on average per scale) and high matching scores. Hence after merging, many bear bounding boxes are fused into a single very high confidence box, while false positives disappear below the detection threshold due their lack of bounding box coherence and confidence. This analysis suggest that our approach is naturally more robust to false positives coming from the pure-classification model than traditional non-maximum suppression, by rewarding bounding box coherence.
图9:ILSVRC12验证集上的本地化实验。 我们使用不同数量的量表并使用单类回归(SCR)或每类回归(PCR)进行实验。
Figure 9: Localization experiments on ILSVRC12 validation set. We experiment with different number ofscales and with the use ofsingle-class regression (SCR) or per-class regression (PCR).
4.4 Experiments
我们使用为比赛指定的本地化标准将网络应用于Imagenet 2012验证集。 其结果如图9所示。图10显示了2012年和2013年本地化竞赛的结果(这两年的火车和测试数据都相同)。 我们的方法是2013年竞赛的获胜者,错误率为29.9%。
We apply our network to the Imagenet 2012 validation set using the localization criterion specified for the competition. The results for this are shown in Fig. 9. Fig. 10 shows the results of the 2012 and 2013 localization competitions (the train and test data are the same for both of these years). Our method is the winner of the 2013 competition with 29.9% error.
如图9所示,我们的多尺度和多视图方法对于获得良好的性能至关重要:如图9所示:我们的回归网络仅使用单个居中作物,其错误率达到40%。 通过组合来自所有空间位置的两个尺度的回归预测,我们获得了31.5%的更好的错误率。 添加第三和第四等级进一步将性能提高到30.0%的误差。
Our multiscale and multi-view approach was critical to obtaining good performance, as can be seen in Fig. 9: Using only a single centered crop, our regressor network achieves an error rate of 40%. By combining regressor predictions from all spatial locations at two scales, we achieve a vastly better error rate of 31.5%. Adding a third and fourth scale further improves performance to 30.0% error.
令人惊讶的是,在每个类别的回归器网络中为每个类别使用不同的顶层(图9中的每个类别的回归器(PCR)),并不能仅使用所有类别之间共享的单个网络来胜过(44.1%对31.3%)。 这可能是因为在每个类别中,训练集中带有边界框的示例相对较少,而网络的顶层参数却高出1000倍,从而导致训练不足。 通过仅在相似类别之间共享参数(例如,为所有类别的狗训练一种网络,为车辆训练另一种网络等),可以改善这种方法。
Using a different top layer for each class in the regressor network for each class (Per-Class Regressor (PCR) in Fig. 9) surprisingly did not outperform using only a single network shared among all classes (44.1% vs. 31.3%). This may be because there are relatively few examples per class annotated with bounding boxes in the training set, while the network has 1000 times more top-layer parameters, resulting in insufficient training. It is possible this approach may be improved by sharing parameters only among similar classes (e.g. training one network for all classes of dogs, another for vehicles, etc.).
5 Detection
检测训练类似于分类训练,但在空间上。可以同时训练图像的多个位置。由于该模型是卷积模型,因此所有位置之间共享所有权重。与定位任务的主要区别在于,当不存在任何对象时,必须预测背景类别。传统上,最初会随机抽取负面示例进行训练。然后,最令人讨厌的负错误将添加到引导过程中的训练集中。独立的引导过程使培训变得复杂,并可能在否定样本收集和培训时间之间造成错配。此外,需要调整自举次数的大小,以确保训练不会在小范围内过大。为了规避所有这些问题,我们会针对视频进行负面训练,方法是为每个图像选择一些有趣的负面示例,例如随机的或最令人反感的负面示例。这种方法在计算上更加昂贵,但是使过程变得更加简单。并且由于特征提取最初是通过分类任务进行训练的,因此检测微调反而不会那么长。
Detection training is similar to classification training but in a spatial manner. Multiple location of an image may be trained simultaneously. Since the model is convolutional, all weights are shared among all locations. The main difference with the localization task, is the necessity to predict a background class when no object is present. Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes. Independent bootstrapping passes render training complicated and risk potential mismatches between the negative examples collection and training times. Additionally, the size of bootstrapping passes needs to be tuned to make sure training does not overfit on a small set. To circumvent all these problems, we perform negative training on the f l y, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler. And since the feature extraction is initially trained with the classification task, the detection fine-tuning is not as long anyway.
在图11中,我们报告了ILSVRC 2013竞赛的结果,其中我们的检测系统以19.4%的平均平均精度(mAP)排名第三。后来我们以24.3%的mAP建立了新的检测技术。请注意,排名前3位的方法与其他团队之间存在很大差距(这4种方法的mAP产率为11.5%)。此外,我们的方法与其他2个使用初始分割步骤将候选窗口从大约200,000减少到2,000的其他系统有很大不同。此技术可加快推理速度,并显着减少潜在的误报次数。 [29,1]建议使用密集滑动窗口而不是选择性搜索来降低检测精度,选择性搜索会丢弃不太可能的物体位置,从而减少误报。结合我们的方法,我们可能会在传统的密集方法和基于细分的方法之间观察到类似的改进。还应注意,我们没有像NEC和UvA那样对检测验证集进行微调。验证和测试集的分布与训练集有显着差异,仅此一项就可以将结果提高约1点。图11中两个OverFeat结果之间的改进是由于训练时间更长和使用上下文所致,即每个音阶还使用较低分辨率的音阶作为输入。
In Fig. 11, we report the results of the ILSVRC 2013 competition where our detection system ranked 3rd with 19.4% mean average precision (mAP). We later established a new detection state of the art with 24.3% mAP. Note that there is a large gap between the top 3 methods and other teams (the 4 methodyields 11.5% mAP). Additionally, our approach is considerably different from the top 2 other systems which use an initial segmentation step to reduce candidate windows from approximately 200,000 to 2,000. This technique speeds up inference and substantially reduces the number of potential false positives. [29, 1] suggest that detection accuracy drops when using dense sliding window as opposed to selective search which discards unlikely object locations hence reducing false positives. Combined with our method, we may observe similar improvements as seen here between traditional dense methods and segmentation based methods. It should also be noted that we did not fine tune on the detection validation set as NEC and UvA did. The validation and test set distributions differ significantly enough from the training set that this alone improves results by approximately 1 point. The improvement between the two OverFeat results in Fig. 11 are due to longer training times and the use of context, i.e. each scale also uses lower resolution scales as input.
图10:ILSVRC12和ILSVRC13竞争结果(测试集)。 我们的参赛作品是ILSVRC13本地化竞赛的获胜者,错误率为29.9%(前5名)。 请注意,两年的培训和测试数据是相同的。 OverFeat条目使用4个标度和单类回归方法。
Figure 10: ILSVRC12 and ILSVRC13 competitions results (test set). Our entry is the winner of the ILSVRC13 localization competition with 29.9% error (top 5). Note that training and testing data is the same for both years. The OverFeat entry uses 4 scales and a single-class regression approach.
图11:ILSVRC13测试集检测结果。 在比赛中,UvA以22.6%的mAP排名第一。 在比赛后的工作中,我们以24.3%的mAP建立了新的技术水平。 标有*的系统已使用ILSVRC12分类数据进行了预训练。
Figure 11: ILSVRC13 test set Detection results. During the competition, UvA ranked first with 22.6% mAP. In post competition work, we establish a new state of the art with 24.3% mAP. Systems marked with * were pre-trained with the ILSVRC12 classification data.
6 Discussion
我们提出了一种可用于分类,定位和检测的多尺度滑动窗口方法。我们将其应用于ILSVRC 2013数据集,目前它在分类中排名第4,在本地化排名第1,在检测排名第1。本文的第二个重要贡献是说明如何将ConvNets有效地用于检测和定位任务。这些问题在[15]中从未得到解决,因此我们是第一个解释如何在ImageNet 2012的背景下完成此工作。我们提出的方案涉及对旨在分类的网络进行的实质性修改,但清楚地证明了ConvNets能够做到这些更具挑战性的任务。我们的本地化方法在2013年ILSVRC竞赛中获胜,并且在2012年和2013年的所有方法中均表现出色。该检测模型是比赛中表现最好的模型之一,并且在比赛后结果中排名第一。我们提出了一个集成的管道,该管道可以执行不同的任务,同时共享一个共同的特征提取基础,可以直接从像素中完全学习。
We have presented a multi-scale, sliding window approach that can be used for classification, localization and detection. We applied it to the ILSVRC 2013 datasets, and it currently ranks 4 th in classification, 1 st in localization and 1 st in detection. A second important contribution of our paper is explaining how ConvNets can be effectively used for detection and localization tasks. These were never addressed in [15] and thus we are the first to explain how this can be done in the context of ImageNet 2012. The scheme we propose involves substantial modifications to networks designed for classification, but clearly demonstrate that ConvNets are capable of these more challenging tasks. Our localization approach won the 2013 ILSVRC competition and significantly out performed all 2012 and 2013 approaches. The detection model was among the top performers during the competition, and ranks first in post-competition results. We have proposed an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels.
我们的方法可能仍会在几种方面进行改进。 (i)对于本地化,我们目前不在整个网络中进行反向支持; 这样做可能会提高性能。 (ii)我们使用的损失是2英镑,而不是直接优化衡量绩效的工会交叉点(IOU)标准。 如果有一些重叠的话,由于IOU仍然是可区分的,因此应该可以将损失转移到此。 (iii)边界框的替代参数设置可能有助于解相关输出,这将有助于网络训练。
菜菜菜菜菜菜菜 发布了43 篇原创文章 · 获赞 19 · 访问量 8490 私信 关注Our approach might still be improved in several ways. (i) For localization, we are not currently back-propping through the whole network; doing so is likely to improve performance. (ii) We are using ℓ 2 loss, rather than directly optimizing the intersection-over-union (IOU) criterion on which performance is measured. Swapping the loss to this should be possible since IOU is still differentiable, provided there is some overlap. (iii) Alternate parameterizations of the bounding box may help to decorrelate the outputs, which will aid network training.