Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation
多尺度的连续条件随机场,作为序列化的深度神经网络,用于单目深度图的估计
Abstract
摘要
This paper addresses the problem of depth estimation from a single still image. Inspired by recent works on multi-scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods, the integration is obtained by means of continuous Conditional Random Fields (CRFs). In particular, we propose two different variations, one based on a cascade of multiple CRFs, the other on a unified graphical model. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. Through extensive experimental evaluation we demonstrate the effectiveness of the proposed approach and establish new state of the art results on publicly available datasets.
这篇论文解决了从单个静止图像中估计深度的问题。受到最近许多关于多尺度卷积神经网络的工作的启发,我们提出了一个深度模型,这个模型融合了来自于多个卷积神经网络输出的互补信息。不同于此前的方法,这个整合用到了连续条件随机场的方法。特别地,我们提出了两个不同的变体,其中一个是基于多个条件随机场的分流,另一个是一个统一的图形模型。通过设计一个新的卷积神经网络实现,关于平均场更新连续条件随机场,我们展示了所有提出的模型都可以被视为序列深度神经网络,并且这个训练可以用端到端的方式进行展现。通过额外的试验评价,我们证明了该方法的有效性,以及在公众公开的数据及上达到了最前沿的程度。
- 1. Introduction
- 1. 简介
While estimating the depth of a scene from a single image is a nature ability for humans, devising computational models for accurately predicting depth information from RGB data is a challenging task. Many attempts have been made to address this problem in the past. In particular, recent works have achieved remarkable performance thanks to powerful deep learning models [8, 9, 20, 24]. Assuming the availability of a large training set of RGB-depth pairs, monocular depth prediction is casted as a pixel-level regression problem and Convolutional Neural Network (CNN) architectures are typically employed.
尽管从一个单张相片中估算场景的深度是一个人类与生俱来的能力,然而设计计算模型用来从RGB数据中准确地预测深度是一个颇具挑战的任务。在过去的一段时间里,已经有许多人尝试解决这个问题。特别地,最近点 工作已经实现了令人瞩目的表现,这多亏了功能强大的深度学习模型。假设有一个可用的大型RGB和深度之间的训练数据集,单目深度预测就成为了一个像素级别的回归问题,并且卷积神经网络结构非常适合用来做这件事情。
In the last few years significant effort have been made in the research community to improve the performance of CNN models for pixel-level prediction tasks (e.g., semantic segmentation, contour detection). Previous works have shown that, for depth estimation as well as for other pixel level classification/regression problems, more accurate estimates can be obtained by combing information from multiple scales [8, 33, 6]. This can be achieved in different ways, e.g. fusing feature maps corresponding to different network layers or designing an architecture with multiple inputs corresponding to images at different resolutions. Other works have demonstrated that, by adding a Conditional Random Field (CRF) in cascade to a convolutional neural architecture, the performance can be greatly enhanced and the CRF can be fully integrated within the deep model enabling end-to-end training with back-propagation [36]. However, these works mainly focus on pixel-level prediction problems in the discrete domain (e.g. sematic segmentation). While complementary, so far these strategies have been only considered in isolation and no previous work have exploited multi-scale information within a CRF inference framework.
在过去的几年间,在研究团体中有许多杰出的贡献,用于提升卷积神经网络在像素级别上的预测任务(比方说语义分割,边缘检测等)的性能表现。之前的工作表明深度预测和其他的像素级别的分类或回归任务一样,更高的准确率可以通过多个尺度的信息融合来得到【8, 33, 6】。这可以通过不同的方式来实现,比方说,融合特征图谱来实现,这对应于不同网络层,又或者设计一个使用多个不同分辨率图像作为输入的结构。其他的工作也证明了,通过添加条件随机场(CRF)在级联的时候,可以有效地提升表现性能。另外,条件随机场可以被完整的整合到深度模型,以便能够进行端到端的反向传播的训练【36】。然而,这些工作主要集中在像素级别的离散领域预测问题上(比方说,语义分割)。虽然这些策略具有互补性,但是迄今为止只是孤立的考虑这些策略,以前的工作没有在条件随机场推理框架内利用多尺度信息。
In this paper we argue that, benefiting from the flexibility and the representational power of graphical models, we can optimally fuse representations derived from multiple CNN side output layers, improving performance over traditional multi-scale strategies. By exploiting this idea, we introduce a novel framework to estimate depth maps from single still images. Opposite to previous work fusing multi-scale features by averaging or concatenation, we propose to integrate multi-layer side output information by devising a novel approach based on continuous CRFs. Specifically, we present two different methods. The first approach is based on a single multi-scale CRF model, while the other considers a cascade of scale-specific CRFs. We also show that, by introducing a common CNN implementation for mean-field updates in continuous CRFs, both models are equivalent to sequential deep networks and an end-to-end approach can be devised for training. Through extensive experimental evaluation we demonstrate that the proposed CRF-based method produces more accurate depth maps than traditional multi-scale approaches for pixel-level prediction tasks [10, 33] (Fig 1). Moreover, by performing experiments on the publicly available NYU Depth V2 [30] and on the Make3D [29] datasets, we show that our approach outperforms state of the art methods for monocular depth estimation.
在这篇论文中,我们认为,灵活性和图像模型代表的能力是有益处的,我们可以最优地融合从多个卷积神经网络部分输出层得到的表示,相比于传统的多尺度策略提高了表现性能。通过运用这个想法,我们引入了一个新的框架来从单个精致图像中估计深度图。不同于此前的其他的将多尺度特征图用平均或者链接的方式进行融合的工作,我们提出整合多层的输出信息,通过设计一个新的基于连续条件随机场的方法。特别的,我们提出了两种不同的方法。第一个方法是基于一个单一多尺度的条件随机场模型,另一个考虑特殊尺寸条件随机场的层叠。我们还证明,通过在连续条件随机场中引入一个用于平均场更新的常见CNN实现,两种模型都等效于顺序深度网络,并且可以设计一种端到端的方法进行训练。通过额外的实验评价,我们证明提出的基于条件随机场的方法,相比于传统的预测像素级任务的多尺度方法【10, 33】(参见图1)产生了更高准确度的深度图。此外,通过在公开数据集NYU Depth V2【30】和Make3D【29】上面进行实验,我们证明了此方法在单目深度图估计中达到了前沿水平。
Figure 1. (a) Original RGB image. (b) Ground truth. Depth map obtained by considering a pre-trained CNN (e.g. VGG Convolution-Deconvolution [23]) and fusing multi-layer representations (c) with the approach in [33] and (d) with the proposed multi-scale CRF.
图1. (a)原始的RGB图像。(b)真值。深度图是通过考虑一个与训练过的卷积神经网络获得的(比如说,深度卷积神经网络的卷积和逆卷积【23】)。(c)是使用了【33】中的多层表示方法进行融合得到的结果。(d)是使用提出的多尺度条件随机场模型得到的。
To summarize, the contributions of this paper are three-fold. First, we propose a novel approach for predicting depth maps from RGB inputs which exploits multi-scale estimations derived from CNN inner layers by fusing them within a CRF framework. Second, as the task of pixel-level depth prediction implies inferring a set of continuous values, we show how mean field (MF) updates can be implemented as sequential deep models, enabling end-to-end training of the whole network. We believe that our MF implementation will be useful not only to researchers working on depth prediction, but also to those interested in other problems involving continuous variables. Therefore, our code is made publicly available. Third, our experiments demonstrate that the proposed multi-scale CRF framework is superior to previous methods integrating information from intermediate network layers by combining multiple losses [33] or by adopting feature concatenations [10]. We also show that our approach outperforms state of the art depth estimation methods on public benchmarks and that the proposed CRF-based models can be employed in combination with different pre-trained CNN architectures, consistently enhancing their performance.
总结来说,这篇论文的主要贡献有三点。首先,我们提出了一个用于从RGB图像中预测深度图的新方法,利用了从卷积神经网络内部层中的多尺度估计,这些特征用条件随机场框架加以融合。第二点,作为像素级的深度估计暗示了推测一个连续数值的集合,我们证明平均场更新可以怎样作为顺序深度模块进行实现,使得端到端训练整个网络成为可能。我们相信我们的平均场实现不仅仅对于研究深度预测的人员有用,同时也有助于那些有兴趣研究其他连续数值的人。因此,我们的代码已经开源了。第三点,我们的实验证明提出的多尺度条件随机场框架,相比于之前的从中间网络层组合多个损失函数【33】或者使用特征连接的方法整合信息而言,我们的方法是更优的。我们同时也证明了,我们的方法在公开基准上面进行深度估计的方法中达到了前沿水平,提出的基于条件随机场的模型可以被应用于与不同的预训练的卷积神经网络结构相结合,总是能够提高他们的性能。