Joint 3D Layout and Depth Prediction from a Single Indoor Panorama Image
目录
Abstract
In this paper, we propose a method which jointly learns the layout prediction and depth estimation from a single indoor panorama image. Previous methods have considered layout prediction and depth estimation from a single panorama image separately. However, these two tasks are tightly intertwined. Leveraging the layout depth map as an intermediate representation, our proposed method outperforms existing methods for both panorama layout prediction and depth estimation. Experiments on the challenging real-world dataset of Stanford 2D-3D demonstrate that our approach obtains superior performance for both the layout prediction tasks (3D IoU: 85.81% v.s. 79.79%) and the depth estimation (Abs Rel: 0.068 v.s. 0.079).
本文做了什么:从单一全景图像学习 布局预测 和 深度估计。
为什么这么做: 布局预测 和 深度估计 本身是紧密联系的。
本文怎么做的:布局的深度图 就是二者的纽带,作为中间特征表达,促进 布局预测 和 深度估计。
实验结果如何:好呗。
Fig. 1: Given (a) an indoor panorama as input, our proposed method utilizes the (b) coarse depth estimation to compute the (c) layout depth map. Leveraging the estimated layout depth map, our method improves the (d) 3D layout prediction and (e) refifines the depth estimation (e.g. the ambiguous window depth is inferred correctly compared to the coarse depth estimation)
Fig. 2: Illustration of the layout depth maps. From left to right: the panorama input image, the original layout corner map and the layout depth map
Introduction
......
Although scene layout and depth can both be used for 3D scene understanding, previous methods focus on solving these two problems separately. For 3D layout prediction, methods mostly use 2D geometrical cues such as edges [20, 25, 35], corners [16, 25, 35], 2D floor-plans [19, 30] or they make assumptions about the 3D scene geometry such that rooms are modelled by cuboids or by a Manhattan World. For depth estimation, different features are used such as normals [17], planar surfaces [21] and semantic cues [22]. Hence, existing methods impose geometric assumptions but ignore to exploit the complementary characteristics of layout and depth information. [本文的 motivation]
In this paper, a different approach is taken. We propose a method that, from a single panorama, jointly exploits the 3D layout and depth cues via an intermediate layout depth map, as shown in Fig. 1.
The intermediate layout depth map represents the distances from the camera to the room layout components (e.g. ceiling, floor and walls) and excludes all objects in the room (e.g. furniture), as illustrated in Fig. 2. Estimating the layout depth as an intermediate representation of the network encompasses the geometric information needed for both tasks.
The use of depth information is beneficial to produce room layouts by reducing the complexity of object clutter and occlusion. Likewise, the use of room layout information diminishes the ambiguity of depth estimation and interposes planar information for the room layout parts (e.g. ceiling, floor and walls).
提出一个研究盲区:布局 和 深度 两个信息都可以作为 3D 场景的理解方法,但过去的工作总是将二者分开研究。
例如,3D 布局预测 一般采用 2D 的几何曲线(如边缘,角落,2D 楼面布置图),或者假设三维场景的几何形状,比如房间的模型是长方体或曼哈顿空间。
例如,深度估计 利用不同的特征(如 法线 normals,平面的表面 planar surfaces,语义线索 semantic cues)等。
因此,现有的方法只考虑了几何假设,而忽略了布局和深度信息的互补特征。
本文将二者结合。
布局深度图表示从相机到房间布局组件的距离,排除了房间里的所有物体。
估计布局深度作为网络的中间表示包含了两个任务所需的几何信息。
深度估计 对 布局预测的好处:深度信息的使用有利于通过减少对象杂乱和遮挡的复杂性产生房间布局。
布局预测 对 深度估计的好处: 房间布局信息的使用减少了深度估计的模糊性,并插入了房间布局部分的平面信息。
The proposed method estimates the 3D layout and detailed depth information from a single panorama image. To combine the depth and layout information, the proposed method predicts the layout depth map to relate these two tightly intertwined tasks.
Previous methods on layout prediction provides proper reconstruction by predicting the layout edges and corners on the input panorama and by post-processing them to match the (Manhattan) 3D layout [16, 25, 35]. However, object clutter in the room poses a challenge to extract occluded edges and corners. In addition, estimating the 3D layout from 2D edge and corner maps is an ill-posed problem. Therefore, extra constraints are essential to perform 2D to 3D conversion in the optimization. [单独的 布局预测 遇到的问题]
In contrast, our method estimates the layout depth map by using more structural information to become less influenced by occlusions. Furthermore, the predicted layout depth map serves as a coarse 3D layout as it can be converted to the 3D point cloud of the scene layout. Thus the proposed method does not require extra constraints for the 2D to 3D conversion. This makes the proposed method more generic for parameterizing a 3D layout. After computing the estimated layout depth maps, the proposed method further enables the refinement of a detailed depth map. [本文布局预测方法的好处]
Monocular depth estimation methods usually have problems with planar room parts (ceiling, floor and walls) being rugged after the 3D reconstruction process. [单独的 深度估计 遇到的问题]
The layout depth map preserves the planar nature of the room layout components yielding robust ness to these errors. [本文深度估计方法的好处]
Empirical results on the challenging Stanford 2D-3D indoor dataset show that jointly estimating 3D layout and depth outperforms previous methods for both tasks. The proposed method achieves state-of-the-art performance for both layout prediction and depth estimation from a single panorama image on the Stanford 2D-3D dataset. Our method also obtains state-of-the-art performance for 3D layout prediction on the PanoContext dataset. [实验结论]
单独的 布局预测 遇到的问题:杂乱的物体会严重影响 布局深度 结构的预测;从二维边缘和角映射估计三维布局是一个不适定问题。因此,在优化中执行2D到3D转换时,额外的约束是必不可少的。
本文布局预测方法的好处:本文的方法是利用更多的结构信息来估计深度图,以减少遮挡对布局深度图的影响;该方法对二维到三维的转换不需要额外的约束条件;估计的布局深度图可以用来进一步细化了 深度估计。
单独的 深度估计 遇到的问题:平滑的表面,在三维重建过程后变得崎岖不整。
本文深度估计方法的好处:布局深度图 保留了房间布局组件的平面特性,对这些错误具有鲁棒性。
In summary, our contributions are as follows:
– We propose a novel neural network pipeline which jointly learns layout prediction and depth estimation from a single indoor panorama image. We show that layout and depth estimation tasks are highly correlated and joint learning improves the performance for both tasks.
– We show that leveraging the layout depth map as an intermediate representation improves the layout prediction performance and refines the depth estimation.
– The proposed method outperforms the state-of-the-art methods for both layout prediction and depth estimation on the challenging real-world dataset Stanford 2D-3D and PanoContext dataset for layout prediction.
贡献:
提出了联合两个任务的 神经网络 pipeline;
提出了将 布局深度 作为两个任务的桥梁的方法;
效果很好。
Related Words
Depth Estimation
Single-view depth estimation refers to the problem of estimating depth from a single 2D image.
Eigen et al. [9] show that it is possible to produce pixel depth estimations using a two scale deep network which is trained on images with their corresponding depth values. Several methods extend this approach by introducing new components such as CRFs to increase the accuracy [17], changing the loss from regression to classification [2], using other more robust loss functions [15], and by incorporating scene priors [29].
Zioulis et al. [34] propose a learning framework to estimate the depth of a scene from a single 360◦ panorama image. Eder et al. [7] present a method to train a plane-aware convolutional network for dense depth and surface normal estimation from panoramas. There are some other methods [6, 27] to regress the layered depth image (LDI) to capture the occluded texture and depth. In our work, we demonstrate that the layout prediction and depth estimation are tightly coupled and can benefit from each other. Leveraging the estimated layout depth map, our method refines the depth estimation.
深度估计包括 2D 和 3D 场景 两种环境:
2D:最早是 Eigen et al. [9];然后的方法有 引入新的组件;将分类损失函数改为回归;使用更鲁棒的损失函数,或者引入相关场景先验。
3D:最早是 plane-aware convolutional network [34];返回分层深度图像等。
Method
Inferring high-quality 3D room layout from an indoor panorama image relies on the understanding of both the 3D geometry and the semantics of the indoor scene. Therefore, the proposed method uses the predicted coarse depth map and semantic segmentation of the input panorama to predict the layout depth map. The proposed method enables the refinement of depth estimation by integrating the coarse depth and layout depth with semantic information as a guidance.
Fig. 3: Overview of the proposed pipeline. Our method first leverages the coarse depth and semantic prediction to enforce the layout depth prediction, and then uses the estimated layout depth map to recover the 3D layout and refine the depth estimation.
算法的 overview:从室内全景图像推断高质量的三维房间布局依赖于对室内场景的三维几何和语义的理解。因此,该方法使用预测的粗深度图和输入全景图的语义分割来预测布局深度图。该方法将粗深度和布局深度结合起来,以语义信息为指导,实现深度估计的精细化。
Input and Pre-processing
Following [35], the first step of our method is to align the input panorama image to match the horizontal floor plane. The floor plane direction under equirectangular projection is estimated by first selecting the long line segments using the Line Segment Detector (LSD) [28] in overlapping perspective views and then vote for three mutually orthogonal vanishing directions [33]. This alignment ensures that wall-wall boundaries are vertical lines. The input of our network is the concatenation of the panorama image and the corresponding Manhattan line feature map provided by the alignment.
第一步是对齐输入的全景图像来匹配水平地板平面。利用线段检测算法 (LSD) 在重叠视点中选取长线段,然后对三个相互正交的消失方向进行投票,估计出等直角投影下的地板平面方向。这种对齐确保了墙壁的边界是垂直线。网络的输入是将全景图像和相应的曼哈顿线特征图拼接在一起。
Manhattan line feature map 长啥样?看下图:
[35] 2018 CVPR : Layoutnet: Reconstructing the 3D room layout from a single RGB image
Coarse Depth and Semantics
Our approach receives the concatenation of a single RGB panorama and the Manhattan line feature map as input. The output of this module is the coarse depth estimation and semantic segmentation of the 2D panorama image.
先看第一个模块:
输入:全景图像 和 曼哈顿线图;
输出:粗略的 深度估计 和 语义分割;
An encoder-decoder architecture is used for the joint learning of the coarse depth information and semantic segmentation. The input panorama images suffer from horizontal distortions. To reduce the distortion effect, the encoder uses a modified input block in front of the ResNet-18 architecture. As shown by [34], the input block uses rectangle filters and varies the resolution to account for different distortion levels. The encoder is shared for both the depth estimation and semantic segmentation. The decoders restore the original input resolution by means of up-sampling operators followed by 3 × 3 convolutions. Skip connections are also added to link to the corresponding resolution in the encoder. The two decoders do not share weights and are trained to minimize the coarse depth estimation loss and semantic segmentation loss, respectively.
网络结构:
编码器:深度估计 和 语义分割 用相同的编码器;编码器在 ResNet-18 架构前使用一个修改过的输入块,其中卷积的形状和分辨率都不同,目的是
解码器:深度估计 和 语义分割 用两个不同的解码器,分别在不同的 loss 下优化;
跳接:
[34] 2018 ECCV : Omnidepth: Dense depth estimation for indoors spherical panoramas
- Loss Function
其中:
, 分别表示预测深度图和真实深度图. n 是像素的总数.
, 是计算 ei 在第 i 个像素处对 x 的空间导数.
, 表示估计深度图的表面法线和真实值。
,p 和 ˆp 是真实和预测语义标签。
Layout Prediction
To obtain the global geometric structure of the scene, the proposed approach predicts the 3D layout of the scene. Instead of predicting 2D representations, our method directly predicts the layout depth maps of the input panoramas.
The input of this proposed module is a 8-channel feature map: the concatenation of RGB panorama, the corresponding Manhattan line feature map, and the predicted depth and semantics obtained by the previous modules of the pipeline. A ResNet-18 is used to build our encoder for the layout depth prediction network. The decoder architecture is similar to the previous ones for depth estimation and semantic segmentation, with nearest neighbor up-sampling operations followed by 3×3 convolutions. The skip connections are also added to prevent shifting of the prediction results during the up-sampling step. The output is the estimated layout depth map with the same resolution as the input panorama.
为了获得场景的全局几何结构,本文预测了场景的三维布局。该方法不是预测 2D 表示,而是直接预测输入全景图的布局深度映射。
该模块的输入是一幅 8 通道的 feature map:将 RGB 全景图与对应的曼哈顿线 feature map 拼接在一起,再加上之前的管道模块得到的预测深度和语义。一个 ResNet-18 被用来构建我们的布局深度预测网络编码器。在深度估计和语义分割方面,译码器的架构与之前的类似,采用最近邻向上采样操作,然后进行 3x3 卷积。还添加了跳过连接,以防止在上采样步骤中预测结果发生移动。输出的是与输入全景图分辨率相同的估计布局深度图。
- Loss Function
其中,前三项与上面相同,第四项
, 是虚平面的法向量。N 的定义是:
,其中 Pi = (Pa, Pb, Pc) 是空间中的三个点。
Depth Refifinement
A straight-forward way is to concatenate all the data representations as input and use an encoder-decoder network to predict the final depth estimation. This approach is denoted by direct refinement. The semantic approach is to use the semantic information as a guidance to dynamically fuse the two depth maps. This approach is denoted by semantic-guided refinement. The semantic-guided refinement step produces an attention map incorporating the coarse depth map and the layout depth map. For a structural background representing the scene layout components (ceiling, floor and wall), the network focuses more on the layout depth map. While for objects in the room (furniture), the network switches the attention to the coarse depth estimation. Therefore, in this paper, we combine these two concepts as shown in Fig. 3. First, an encoder-decoder network, taking the concatenation of the coarse depth, layout depth and semantic segmentation prediction as inputs, combines the previous depth maps with the semantic-guided attention map. This semantic-guided depth fusion maximizes the exploitation of the coarse depth and layout depth. Then, the depth refinement module takes the used depth as input to predict the final refined depth. The encoder-decoder architecture of the depth refinement module is similar to the previous coarse depth estimation network.
一种直接的方法是将所有的数据表示连接起来作为输入,并使用编码器-解码器网络来预测最终的深度估计。这种方法用直接细化表示。语义方法是利用语义信息作为引导,动态地融合两个深度映射。这种方法被表示为语义引导的细化。语义引导的细化步骤产生一个包含粗深度图和布局深度图的注意图。对于代表场景布局组件(天花板、地板和墙壁)的结构背景,网络更关注布局深度图。而对于房间中的物体(家具),网络将注意力转移到粗略的深度估计上。因此,在本文中,我们结合这两个概念,如图3所示。首先,以粗深度、布局深度和语义分割预测的拼接为输入,将之前的深度图与语义引导的注意图相结合。这种语义引导的深度融合最大限度地利用粗深度和布局深度。然后,深度细化模块将使用的深度作为输入,预测最终的细化深度。深度细分模块的编解码器结构类似于以前的粗深度估计网络。
- Loss Function
深度细化的损失函数与布局深度估计损失相同。
最后,看几组实验结果:
Fig. 4: Qualitative comparison on layout prediction. Results are shown of testing the baseline LayoutNet [35] (blue), our proposed method (green) and the ground truth (orange) on the Stanford 2D-3D dataset and PanoContext dataset
Fig. 5: Qualitative results of non-cuboid layout prediction. It can be derived that our proposed method also works well for non-cuboid layouts
Fig. 6: Qualitative comparison on depth estimation. Results are shown for testing the baseline RectNet [34], Plane-aware network [7] and our proposed method on the Stanford 2D-3D dataset
Fig. 7: Comparison of the derived surface normal from the depth estimation. Our proposed method produces smoother surfaces for planar regions