ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
ScanNet: 大量标注的室内场景三维重建
Abstract
摘要
A key requirement for leveraging supervised deep learning methods Is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available – current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentation. To collect this data, we design an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowd-sourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on serval 3D scene understanding tasks, including 3D object classification, sematic voxel labeling, and CAD model retrieval.
利用监督的深度学习方法的一个关键要求是,拥有可用的大量、有标签的数据集。很遗憾的是,在RGB-D场景理解方面,只有非常少的可用数据——当前的数据集覆盖了非常小范围的场景视角,并且在语义注解方面有局限性。为了解决这个问题,我们引入了ScanNet,一个RGB-D视频数据集,包含了2.5M个视角,有1513个场景,都被三维相机位置、表面重建和语义分割所标注。为了收集这些数据,我们设计了一个易用的、可扩展的RGB-D捕获系统,它包含了自动表面重建以及源于人群的(众包的)语义注释。我们展示出,使用这个数据有助于在许多三维场景理解任务中取得前言成就,包括三维场景分类、体素级与已标注,计算机辅助设计模型检索等。
- 1. Introduction
- 1. 简介
Since the introduction of commodity RGB-D sensors, such as the Microsoft Kinect, the field of 3D geometry capture has gained significant attention and opened up a wide range of new applications. Although there has been significant effort on 3D reconstruction algorithms, general 3D scene understanding with RGB-D data has only very recently started to become popular. Research along semantic understanding is also heavily facilitated by the rapid progress of modern machine learning methods, such as neural models. One key to successfully applying these approaches is the availability of large, labeled datasets. While much effort has been made on 2D datasets [17, 44, 47], where images can be downloaded from the web and directly annotated, the situation for 3D data is more challenging. Thus, many of the current RGB-D datasets [74, 92, 77, 32] are orders of magnitude smaller than their 2D counterparts. Typically, 3D deep learning methods use synthetic data to mitigate this lack of real-world data [91, 6].
自从引入RGB-D传感器的商品之后,诸如微软Kinect之类的传感器,三维几何结构捕捉领域引发了大量的馆主,并开启了一个大量的新型应用市场。尽管有许多接触的研究三维重建的算法,从RGB-D数据中理解普通的三维场景仅仅是最近才开始流行起来的。在语义理解方面的研究也在很大程度上受到了现代机器学习方法的促进,比如说神经网络模型。让这些方法得以成功应用的一个关键点是,拥有大量的被标注的数据集。尽管有许多研究已经在二维数据集上进行了【17, 44, 47】,这些二维数据及是可以从网站上下载下来的,并且都已经被标注好了,然而对于三维的数据而言是非常有挑战性的。因此,许多现有的RGB-D数据集【74, 92, 77, 32】在尺度上都比起对应的二维数据集要小得多。特别的是,使用合成数据进行三维深度学习的方法缓和了真实世界数据的缺少带来的影响。
One of the reasons that current 3D datasets are small is because their capture requires much more effort, and efficiently providing (dense) annotations in 3D is non-trivial. Thus, existing work on 3D datasets often fall back to polygon or bounding box annotations on 2.5D RGB-D images [74, 92, 77], rather than directly annotating in 3D. In the latter case, labels are added manually by expert users (typically by the paper authors) [32, 71] which limits their overall size and scalability.
现有三维数据集很小的一个原因是,三维数据集的抓取需要更多的努力,在三维数据上有效地提供稠密的注释是很重要的。因此,现有的三维数据集的工作主要都回落到2.5维RGBD图像上的多边形、盒装注释上了【74, 92, 77】,而不是直接在三维数据上进行标注。一些后来的例子中,都是专家们手工的进行标注(特别是那些论文的作者们),这就限制了他们标注出来的数据集的大小,也限制了可扩展性。
In this paper, we introduce ScanNet, a dataset of richly annotated RGB-D scans of real-world environments containing 2.5M RGB-D images in 1513 scans acquired in 707 distinct spaces. The sheer magnitude of this dataset is larger than any other [58, 81, 92, 75, 3, 71, 32]. However, what make it particularly valuable for research in scene understanding is its annotation with estimated calibration parameters, camera poses, 3D surface reconstructions, texture meshes, dense object-level semantic segmentations and aligned CAD models (see Fig. 2). The semantic segmentations are more than an order of magnitude larger than any previous RGB-D dataset.
在这篇论文中,我们提出了Scannet数据集,该数据集包含真实世界场景中大量标注的RGB-D的扫描,一共有来自707个不同场景的1513次扫描得到的250万张RGB-D图像。 这个数据集的规模之大,远超任何其他数据集【58, 81, 92, 75, 3, 71, 32】。然而,在场景理解方面的研究上,这个数据集意义格外重要,因为他标注好了估计的标定参数,相机位置,三维表面重建,纹理网格,密集的物体级别的语义分割,还有对齐的计算机辅助设计模型(参见图2)。语义的分割要比此前任何一个RGB-D数据集都要大至少一个数量级。
In the collection of this dataset, we have considered two main research questions: 1) how can we design a framework that allows many people to collect and annotate large amounts of RGB-D data, and 2) can we use the rich annotations and data quantity provided in ScanNet to learn better 3D models for scene understanding?
在收集这个数据集的过程中,我们思考了两个主要的研究问题:第一是我们应该如何设计框架,让许多人收集并标注大量的RGB-D数据;第二是,我们是否可以使用大量的标注以及ScanNet提供的数据量,以更好的在场景理解中更好的学习三维模型呢?
To investigate the first question, we built a capture pipeline to help novices acquire semantically-labeled 3D models of scenes. A person uses an app on an iPad mounted with a depth camera to acquire RGB-D video, and then we process the data off-line and return a complete semantically-labeled 3D reconstruction of the scene. The challenges in developing such a framework are numerous, including how to perform 3D surface reconstruction robustly in a scalable pipeline and how to crowdsource semantic labeling. The paper discusses our study of these issues and documents our experience with scaling up RGB-D scan collection (20 people) and annotation (500 crowd workers).
为了研究第一个问题,我们构建了一个捕捉的流水线,用于帮助新手获得场景中有语义标注的三维模型。观测者使用装有深度相机的ipad上的应用软件来获取RGB-D视频,接下来我们在线下处理,最终返回一个完整的有语义标注的三维场景重建。开发这样一个框架需要面临诸多挑战,包括如何在可扩展的流水线上呈现稳健的三维表面重建,以及如何能够将语义标签众包到网络上的用户。这篇论文介绍了我们是如何研究这些问题的,并用文档说明了我们的扩大RGB-D扫描收集(20人)、进行标注(500个大众工作者)。
To investigate the second question, we trained 3D deep networks with the data provided by ScanNet and tested their performance on several scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval. For the semantic voxel labeling task, we introduce a new volumetric CNN architecture.
为了研究第二个问题,我们用ScanNet提供的数据训练了一个三维深度神经网络,并在许多的场景理解任务上测试了他们的表现。包括三维场景分类,语义体素的标签,以及计算机辅助设计模型检索等。关于体素语义标签人物,我们介绍了一个体素卷积神经网络结构。
Overall, the contributions of this paper are:
总的来说,这篇论文的贡献如下:
l A large 3D dataset containing 1513 RGB-D scans of over 707unique indoor environments with estimated camera parameters, surface reconstruction, textured meshes, semantic segmentations. We also provide CAD model placements for a subset of the scans.
l 一个囊括707个不同室内场景的1513次RGB-D扫描的大型三维数据集,有估计相机参数,表面重建,纹理网格,语义分割等。我们也提供在一个扫描子集中的计算机辅助设计模型替换。
l A design for efficient 3D data capture and annotation suitable for novice uses.
l 一个适用于初学者有效三维数据抓取以及标注的设计。
l New RGB-D benchmarks and improved results for state-of-the-art machine learning methods on 3D object classification, semantic voxel labeling, and CAD model retrieval.
l 新的RGB-D基准,并提高了最新的机器学习方法,用于三维物体分类,体素语义标签,以及计算机辅助设计模型检索。
l A complete open source acquisition and annotation framework for dense RGB-D reconstructions.
l 一个完整的开源获取方法和标注框架,用于密集的RGB-D重建。
Figure 1. Example reconstructed spaces in ScanNet annotated with instance-level object category labels through our crowdsourced annotation framework.
图一。ScanNet重建空间的实例,已经被实例级的物体种类标签所标注了,通过我们的众包标注框架。
Figure 2. Overview of our RGB-D reconstruction and semantic annotation framework. Left: a novice user uses a handheld RGB-D device with our scanning interface to scan an environment. Mid: RGB-D sequences are uploaded to a processing server which produces 3D surface mesh reconstructions and their surface segmentations. Right: Semantic annotation tasks are issued for crowdsourcing to obtain instance-level object category annotations and 3D CAD model alignments to the reconstruction.
图2. 关于我们RGB-D重建和语义标注框架的总览。左边:一个小心手用户使用手持的RGB-D设备上的扫描界面,来扫描一个场景。中间:RGB-D序列被上传到一个处理服务器上,提供了三维表面网格重建,以及他们的表面分割。右面:语义标注任务,被发送给众包以获取实例级别的物体种类标注,和三维计算机辅助设计模型对齐以进行重建。