摘要
Modern multiple object tracking (MOT) systems usually follow the tracking-by-detection paradigm. It has 1) a detection model for target localization and 2) an appearance embedding model for data association. Having the two models separately executed might lead to efficiency problems, as the running time is simply a sum of the two steps without investigating potential structures that can be shared between them. Existing research efforts on real-time MOT usually focus on the association step, so they are essentially real-time association methods but not real-time MOT system. In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. Specifically, we incorporate the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings. We further propose a simple and fast association method that works in conjunction with the joint model. In both components the computation cost is significantly reduced compared with former MOT systems, resulting in a neat and fast baseline for future follow-ups on real-time MOT algorithm design. To our knowledge, this work reports the first (near) real-time MOT system, with a running speed of 22 to 40 FPS depending on the input resolution. Meanwhile, its tracking accuracy is comparable to the state-of-the-art trackers embodying separate detection and embedding (SDE) learning (64:4% MOTA v.s. 66:1% MOTA on MOT-16 challenge). Code and models are available at https://github.com/Zhongdao/Towards-Realtime-MOT
现代多目标跟踪(MOT)系统通常遵循跟踪-检测范式。它有 1)目标定位的检测模型和 2)数据关联的外观嵌入模型。单独执行这两个模型可能会导致效率较低,因为运行时间只是两个步骤的总和,而没有调查它们之间可以共享的潜在结构。现有的实时MOT研究多集中在关联步骤上,本质上是实时关联方法,而不是实时MOT系统。在这篇论文中,我们提出了一个允许在共享模型中学习目标检测和外观嵌入的MOT系统。具体来说,我们将外观嵌入模型合并到一个单镜头检测器中,使模型可以同时输出检测和相应的嵌入。我们进一步提出了一种简单和快速的关联方法,与联合模型一起工作。与以前的MOT系统相比,这两种组件的计算成本都显著降低,从而为后续实时MOT算法设计提供了一个简洁、快速的基线。据我们所知,这项工作报告了第一个(近)实时MOT系统,根据输入分辨率的不同,运行速度为22到40帧/秒。同时,它的跟踪精度可以与包含独立检测和嵌入(SDE)学习的最先进的跟踪器(MOTA: 64:4% vs . 66:1% MOTA on MOT-16 challenge)相媲美。代码和模型可在https://github.com/Zhongdao/Towards-Realtime-MOT上找到
引言
Multiple object tracking (MOT), which aims at predicting trajectories of multiple targets in video sequences, underpins critical application significance ranging from autonomous driving to smart video analysis.
多目标跟踪(MOT)旨在预测视频序列中多个目标的轨迹,从自动驾驶到智能视频分析都具有重要的应用意义。
The dominant strategy to this problem, i.e., tracking-by-detection [24,40,6] paradigm, breaks MOT down to two steps: 1) the detection step, in which targets in single video frames are localized; and 2) the association step, where detected targets are assigned and connected to existing trajectories. It means the system requires at least two compute-intensive components: a detector and an embedding (re-ID) model. We term those methods as the Separate Detection and Embedding (SDE) methods for convenience. The overall inference time, therefore, is roughly the summation of the two components, and will increase as the target number increases. The characteristics of SDE methods bring critical challenges in building a real-time MOT system, an essential demand in practice.
该问题的主要策略,即根据检测来跟踪[24,40,6]范式,将MOT分解为两个步骤:1)检测步骤,即对单个视频帧中的目标进行定位;2)关联步骤,将检测到的目标分配并连接到现有轨迹上。这意味着该系统至少需要两个计算密集型组件:一个检测器和一个嵌入(re-ID)模型。为了方便起见,我们将这些方法称为分离检测和嵌入(SDE)方法。因此,总的推断时间大致是这两个部分的总和,并且会随着目标数量的增加而增加。SDE方法的特点给建立实时MOT系统带来了严峻的挑战,这是实际应用中的一个基本需求。
In order to save computation, a feasible idea is to integrate the detector and the embedding model into a single network. The two tasks thus can share the same set of low-level features, and re-computation is avoided. One choice for joint detector and embedding learning is to adopt the Faster R-CNN framework [28], a type of two-stage detectors. Specifically, the first stage, the region proposal network (RPN), remains the same with Faster R-CNN and outputs detected bounding boxes; the second stage, Fast R-CNN [11], can be converted to an embedding model by replacing the classification supervision with the metric learning supervision [39,36]. In spite of saving some computation, this method is still limited in speed due to its two-stage design and usually runs at fewer than 10 frames per second (FPS), far from real-time. Moreover, the runtime of the second stage also increases as target number increases like SDE methods.
为了节省计算量,一种可行的思想是将检测器和嵌入模型集成到一个单一的网络中。因此,这两个任务可以共享同一组低级特征,并且避免了重新计算。联合检测器和嵌入学习的一种选择是采用更快的R-CNN框架[28],这是一种两级检测器。具体来说,第一阶段,区域建议网络(RPN),与更快的R-CNN保持一致,并输出检测到的边界框;第二阶段是Fast R-CNN[11],用度量学习监督代替分类监督,将其转换为嵌入模型[39,36]。尽管节省了一些计算量,但由于它的两级设计,速度仍然有限,通常运行在10帧每秒(FPS)以下,远不是实时的。此外,第二阶段的运行时间也像SDE方法一样随着目标数量的增加而增加。
This paper is dedicated to the improving efficiency of an MOT system. We introduce an early attempt that Jointly learns the Detector and Embedding model (JDE) in a single-shot deep network. In other words, the proposed JDE employs a single network to simultaneously output detection results and the corresponding appearance embeddings of the detected boxes. In comparison, SDE methods and two-stage methods are characterized by re-sampled pixels (bounding boxes) and feature maps, respectively. Both the bounding boxes and feature maps are fed into a separate re-ID model for appearance feature extraction. Figure 1 briefly illustrates the difference between the SDE methods, the two-stage methods and the proposed JDE. Our method is near real-time while being almost as accurate as the SDE methods. For example, we obtain a running time of 20.2 FPS with MOTA=64:4% on the MOT-16 test set. In comparison, Faster R-CNN + QAN embedding [40] only runs at <6 FPS with MOTA=66:1% on the MOT-16 test set.
本文致力于提高MOT系统的效率。我们介绍了在单次深度网络中联合学习检测器和嵌入模型(JDE)的早期尝试。也就是说,我们提出的JDE使用单一网络,同时输出检测结果和被检测盒相应的外观嵌入。相比之下,SDE方法和两阶段方法分别以重新采样的像素(边界框)和特征映射为特征。边界框和特征图都被输入到一个单独的重新识别模型中,用于外观特征提取。图1简要说明了SDE方法、两阶段方法和建议的JDE之间的区别。我们的方法接近实时性,同时几乎与SDE方法一样精确。例如,在MOTA=64:4%的MOTA测试集中,我们获得了20.2 FPS的运行时间。相比之下,更快的R-CNN + QAN嵌入[40]在MOTA=66:1%的MOTA测试集中只能运行6 FPS。
To build a joint learning framework with high efficiency and accuracy, we explore and deliberately design the following fundamental aspects: training data, network architecture, learning objectives, optimization strategies, and validation metrics. First, we collect six publicly available datasets on pedestrian detection and person search to form a unified large-scale multi-label dataset. In this unified dataset, all the pedestrian bounding boxes are labeled, and a portion of the pedestrian identities are labeled. Second, we choose the Feature Pyramid Network (FPN) [21] as our base architecture and discuss with which type of loss functions the network learns the best embeddings. Then, we model the training process as a multi-task learning problem with anchor classification, box regression, and embedding learning. To balance the importance of each individual task, we employ task-dependent uncertainty [16] to dynamically weight the heterogenous losses. A simple and fast association algorithm is proposed to further improve efficiency. Finally, we employ the following evaluation metrics. The average precision (AP) is employed to evaluate the performance of the detector. The retrieval metric True Accept Rate (TAR) at certain False Alarm Rate (FAR) is adopted to evaluate the quality of the embedding. The overall MOT accuracy is evaluated by the CLEAR metrics [2], especially the MOTA score. This paper also provides new settings and baselines for joint detection and embedding learning, which we believe will facilitate research towards real-time MOT.
为了构建高效和准确的联合学习框架,我们探索并精心设计了以下基本方面:训练数据、网络架构、学习目标、优化策略和验证指标。首先,我们收集了6个公开的行人检测和人搜索数据集,形成统一的大规模多标签数据集。在这个统一的数据集中,对行人的所有边界盒进行标记,并对部分行人身份进行标记。其次,我们选择特征金字塔网络(FPN)[21]作为我们的基础架构,并讨论了使用哪种类型的损失函数学习最佳嵌入。然后,我们将训练过程建模为一个多任务学习问题,使用锚点分类、框回归和嵌入学习。为了平衡每个独立任务的重要性,我们利用任务相关的不确定性[16]来动态加权异构损失。为了进一步提高效率,提出了一种简单快速的关联算法。最后,我们使用以下评估指标。采用平均精度(AP)来评价探测器的性能。采用一定误报率(FAR)下的检索度量真接受率(TAR)来评价嵌入质量。整体的MOT精度是通过清晰的指标[2]来评估的,特别是MOTA分数。本文还为联合检测和嵌入学习提供了新的设置和基线,我们相信这将有助于实时MOT的研究。
The contributions of our work are summarized as follows,
- We introduce JDE, a single-shot framework for joint detection and embedding learning. It runs in (near) real-time and is comparably accurate to the separate detection + embedding (SDE) state-of-the-art methods.
- We conduct thorough analysis and experiments on how to build such a joint learning framework from multiple aspects including training data, network
architecture, learning objectives and optimization strategy.- Experiments with the same training data show the JDE performs as well as a range of strong SDE model combinations and achieves the fastest speed.
-> Experiments on MOT-16 demonstrate the advantage of our method over state-of-the-art MOT systems considering the amount of training data, accuracy and speed
我们的工作总结如下:
- 我们引入了JDE,一个用于联合检测和嵌入(embeddings)学习的单次发射框架。它运行在(接近)实时和比较准确的独立检测+嵌入(SDE)最先进的方法。
- 我们从训练数据、网络等多个方面对如何构建这样的联合学习框架进行了深入的分析和实验
架构、学习目标和优化策略。 - 相同训练数据的实验表明,JDE可以实现一系列强SDE模型组合,并且速度最快。
- 在MOT-16上的实验表明,我们的方法在训练数据量、准确性和速度方面优于最先进的MOT系统
Embedding(嵌入)来自拓扑学,在深度学习领域经常和Manifold(流形)搭配使用。知乎
- 三维空间的球体是一个二维流形嵌入在三维空间(2D manifold embedded in 3D space),因为球上的任意一个点只需要用一个二维的经纬度便可表达。
- 二维空间的旋转矩阵是 2 × 2 2\times 2 2×2 的矩阵,其实只需要一个角度就能表达了,这就是一个一维流形嵌入在 2 × 2 2\times 2 2×2 的矩阵空间。
深度学习的任务就是把高维原始数据(图像,句子)映射到低维流形,使得高维的原始数据被映射到低维流形之后变得可分,而这个映射就叫嵌入(Embedding)。比如Word Embedding,就是把单词组成的句子映射到一个表征向量。但后来不知咋回事,开始把低维流形的表征向量叫做Embedding,其实是一种误用…
如果按照现在深度学习界通用的理解(其实是偏离了原意的),Embedding就是从原始数据提取出来的Feature,也就是那个通过神经网络映射之后的低维向量 。