动机:
In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task’s loss. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task
【CC】本文是对多任务目标函数优化的文章,经典的原因是有详实的数学/简单的数学推导,所以这篇文章有大量的公式,基于这个推导能够方便的用在工程上! 作者说在多任务NN网络性能受到各个子任务的权重影响,本文想找一种理论上的子任务权重求解的方案,利用的叫做homoscedastic uncertainty(同方差不确定性)
Multi-task learning aims to improve learning effificiency and prediction accuracy by learning multiple objectives
from a shared representation. It can be considered an approach to inductive knowledge transfer which improves generalisation by sharing the domain information between complimentary tasks. It does this by using a shared representation to learn multiple tasks – what is learned from one task can help learn other tasks
【CC】多任务的方式说白了就是一个公共的表示层+各种header,现在比较普遍的方式了。这种共享方式可以认为在各任务间共享了knowledge能够提高泛化性。这样在各自任务上表现也会比较好。这只是直观解释,实际上有paper从数学上研究过这种共享的方式到底优化的是哪里
Scene understanding algorithms must understand both the geometry and semantics of the scene at the same time. This forms an interesting multi-task learning problem because scene understanding involves joint learning of various regression and classifification tasks with different units and scales.
【CC】以scene understanding举例,说这是个经典的多任务:有回归/分类,还有多尺度。这里讲scene understanding是为了后面loss func的推导和网络架构设计
解题思路:
We interpret homoscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classifification losses
【CC】开宗明义:使用数据的homoscedastic uncertainty作为子任务的权重;这里把顺序掉一下:理论描述放一放,先看看推导过程,回头再看会更好理解
形式化推导:
Multi-task learning concerns the problem of optimising a model with respect to multiple objectives. The naive approach to combining multi objective losses would be to simply perform a weighted linear sum of the losses for each individual task:
【CC】总的Loss就是各个子Loss的线性组合;话说能不能通过一个MLP去学习一个W呢?
In this section we derive a multi-task loss function based on maximising the Gaussian likelihood with homoscedastic
uncertainty. Let fW(x) be the output of a neural network with weights W on input x. We defifine the following probabilistic model. For regression tasks we defifine our likelihood as a Gaussian with mean given by the model output with an observation noise scalar σ.
【CC】这一段是回归任务推导的起点,有三层意思:首先定义 fW(x)基于权重W对输入x的NN输出. 然后p(y| fW(x)),数学含义是一个边缘概率:在NN给出x的估计值fW(x)的情况下事件y发生的概率, 最后这个边缘概率服从均值为fW(x),方差为σ的高斯分布(当然这是作者的简单假设,简记为约束1),而这里的σ是可以从数据中计算出来的,所以文章里面叫observation
For classifification we often squash the model output through a softmax function, and sample from the resulting probability vector:
【CC】这是分类任务的形式化描述。经典分类就是这么干的,直接认为softmax出来的值就是一个概率,不用再解释了
In the case of multiple model outputs, we often defifine the likelihood to factorise over the outputs, given some suffifi-
cient statistics. We defifine fW(x) as our suffificient statistics, and obtain the following multi-task likelihood with model outputs y1, …, yK (such as semantic segmentation, depth regression, etc).
【CC】经典概率定理使用:在独立的假设下(这里作者没有明确说,但用这个公式就有这样的假设,简记为约束2),多任务的联合概率等于各个事件概率相乘
In maximum likelihood inference, we maximise the log likelihood of the model. In regression, for example, the log likelihood can be written as
for a Gaussian likelihood (or similarly for a Laplace likelihood) with σ the model’s observation noise parameter –capturing how much noise we have in the outputs.
【CC】基于公式2的一个log likelihood 基于高斯分布线性映射,没啥可说的
Let us now assume that our model output is composed of two vectors y1 and y2, each following a Gaussian distribution:
【CC】这里是两个高斯分布loss func联合分布的推到,用到了公式(2)(4),没啥可说的
This leads to the minimisation objective, L(W, σ1, σ2), (our loss) for our multi-output model:
Where we wrote L1(W) = ||y1 − fW(x)||2 for the loss of the first output variable, and similarly for L2(W)
【CC】在公式(6)的基础上带入了公式(5),没啥可说的;整理的时候使用了L1/L2认为其就是各个子任务的Loss Func,即要求子任务的目标函数是二阶范数,简记为约束3,那这样就能够比较自然的诱导出后面的结论
We interpret minimising this last objective with respect to σ1 and σ2 as learning the relative weight of the losses L1(W) and L2(W) adaptively, based on the data. As σ1 – the noise parameter for the variable y1 – increases, we have that the weight of L1(W) decreases. On the other hand, as the noise decreases, we have that the weight of the respective objective increases.
【CC】终极目标是使得公式(7)的losses func最小,那么对比公式(1)跟(7),发现这里的σ1/σ2正好可以匹配上各个子loss func前的w1/w2,即拿方差σ1/σ2做权重即可; 简单理解一下:哪个子任务的方差比较大,那么它在最后做贡献的权重就比较小,这也符合直观。注意直到现在都是严格的数学推导,我已经把一些假设/约束都标出来了
We adapt the classifification likelihood to squash a scaled version of the model output through a softmax function:
with a positive scalar σ. This can be interpreted as a Boltzmann distribution (also called Gibbs distribution) where the
input is scaled by σ2(often referred to as temperature). This scalar is either fixed or can be learnt, where the parameter’s magnitude determines how ‘uniform’ (flat) the discrete distribution is.
【CC】上面把回归的多目标推导完了,下面推导分类的目标函数:公式(8)把公式(3)稍微变化了一下,引入了缩放量σ,这个形式就正好是一个Boltzmann分布,基于这个分布的图形化解释,σ即是用来控制分布陡峭/平坦程度
This relates to its uncertainty, as measured in entropy. The log likelihood for this output can then be written as
with fcW(x) the c’th element of the vector fW(x)
【CC】而Boltzmann分布uncertainty是由entropy来度量的;那么公式(8)的log likelihood就变成了公式(9),其中c是分类数
Next, assume that a model’s multiple outputs are composed of a continuous output y1 and a discrete output y2, modelled with a Gaussian likelihood and a softmax likelihood, respectively. Like before, the joint loss, L(W, σ1, σ2), is given as
where again we write L1(W) = ||y1 − fW(x)||2 for the Euclidean loss of y1, write L2(W) = log Softmax(y2,fW(x)) for the cross entropy loss of y2 (with fW(x) not scaled), and optimise with respect to W as well as σ1, σ2.
【CC】现在开始推导连续性(回归任务)的loss func和离散型(分类任务)的loss func联合起来是个啥样子:如上图,op1使用了公式(2)(3)(4);op2使用了公式(5)和softmax的log likelihood定义;op3只是一个简记,L1(W)表示一个二阶范数,L2(W)表示一个交叉熵
In the last transition we introduced the explicit simplifying assumption which becomes an equality when σ 2 → 1
【CC】最后一步做了一个近似简化 得到最后一项 log σ2;
This last objective can be seen as learning the relative weights of the losses for each output. The scale is regulated by the last term in the equation. This construction can be trivially extended to arbitrary combinations of discrete and continuous loss functions, allowing us to learn the relative weights of each loss in a principled and well-founded way. In practice, we train the network to predict the log variance s := log σ2. This is because it is more numerically stable than regressing the variance σ 2, as the loss avoids any division by zero.
【CC】那么公式(10)最后的近似表示即为本篇论文对公式(1)的线性解; 最后的log σ项可以作为正则项对loss func进行惩罚。 同时,这个推论即可以用在连续性的目标函数也可以用在离散型的,具有较好的泛化性。在实际应用时会用s := log σ2 避免直接用σ2,这个量往往很小,趋近于0,做除法时容易出问题
网络结构&实验:
Figure 1: Multi-task deep learning. We derive a principled way of combining multiple regression and classification loss functions for multi-task learning. Our architecture takes a single monocular RGB image as input and produces a pixel-wise classification, an instance semantic segmentation and an estimate of per pixel depth. Multi-task learning can improve accuracy over separately trained models because from one task, such as depth, are used to regularize and improve the generalization of another domain, such as segmentation.
【CC】作者用了语义分割/实例分割/深度估计 三个任务做多任务的距离,在Mutlit-Task Loss时用了上述方法,下面有对应的实验结论; 整个网络结构论文里面有详细描述,这里不太想罗列了,因为感觉不是本文的重点;可能比较有价值的是Instace子任务里面Loss func的设计,作者花了较大的精力把这个LOSS FUNC变成了二阶范数(感觉是为了对应前面的子任务目标函数假设,因为本人对Instace这个方面不太了解,就不进一步讨论了)
We observe that at some optimal weighting, the joint network performs better than separate networks trained on each task individually
【CC】看起来挺神奇,看来推导就能理解,说白了就是求了一个线性最优解
回头看理论描述:
In Bayesian modelling, there are two main types of uncertainty one can model
• Epistemic uncertainty is uncertainty in the model, which captures what our model does not know due to lack of training data. It can be explained away with increased training data.
【CC】简单说就是数据量不够
• Aleatoric uncertainty captures our uncertainty with respect to information which our data cannot explain.
【CC】简单说就是各种随机量
Aleatoric uncertainty can be explained away with the ability to observe all explanatory variables with increasing precision.
• Data-dependent or Heteroscedastic uncertainty is aleatoric uncertainty which depends on the input data and is predicted as a model output.
【CC】简单说就是数据本身特性决定的不确定性,比如 在深度估计任务时,从光面的墙得到的观测数据和规则的几何物体得到的观测数据 前者的不确定性肯定要大
• Task-dependent or Homoscedastic uncertainty is aleatoric uncertainty which is not dependent on the input data. It is not a model output, rather it is a quantity which stays constant for all input data and varies between different tasks.
【CC】所有的数据具有相同的不确定性,比如,服从同一高斯分布的样本 都具有σ2的不确定性
It can therefore be described as task-dependent uncertainty.In a multi-task setting, we show that the task uncertainty captures the relative confifidence between tasks, reflflecting the uncertainty inherent to the regression or classifification task. We propose that we can use homoscedastic uncertainty as a basis for weighting losses in a multi-task learning problem.
【CC】非得在前面讲这一段,直接告诉读者 使用了各子任务output出来的σ2做权重更直观,更好理解啊,后面把推理过程一摆就结了