Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读
目录
Understanding the difficulty of training deep feedforward neural networks
5 Error Curves and Conclusions 误差曲线及结论
相关文章
Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读
Paper:He参数初始化之《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet C》的翻译与解读
DL之DNN优化技术:DNN中参数初始化【Lecun参数初始化、He参数初始化和Xavier参数初始化】的简介、使用方法详细攻略
原论文地址:http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi
http://proceedings.mlr.press/v9/glorot10a.html
作者:Xavier Glorot Yoshua Bengio DIRO, Universite de Montr ´ eal, Montr ´ eal, Qu ´ ebec, Canada
引用 格式:[1] Xavier Glorot, Yoshua Bengio ; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
Abstract
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. | 在2006年之前,深层多层神经网络似乎并没有被成功地训练,从那时起,一些算法已经被证明能够成功地训练它们,实验结果显示了深层结构与非深层结构的优越性。所有这些实验结果都是使用新的初始化或训练机制得到的。我们的目标是更好地理解为什么深度神经网络在随机初始化的标准梯度下降中表现如此糟糕,更好地理解这些最近的相对成功,并在未来帮助设计更好的算法。我们首先观察非线性激活函数的影响。我们发现逻辑s型激活不适合具有随机初始化的深度网络,因为它的均值会使深度网络特别是最顶层的隐藏层达到饱和。令人惊讶的是,我们发现饱和的单位可以自己走出饱和状态,尽管速度很慢,这也解释了为什么在训练神经网络时有时会出现停滞状态。我们发现,一个新的非线性,饱和少往往是有益的。最后,我们研究了激活度和梯度在层间和训练过程中的变化,认为当与每一层相关的雅可比矩阵的奇异值远离1时,训练可能会更加困难。基于这些考虑,我们提出了一种新的初始化方案,该方案大大加快了收敛速度。 |
5 Error Curves and Conclusions 误差曲线及结论
The final consideration that we care for is the success of training with different strategies, and this is best illustrated with error curves showing the evolution of test error as training progresses and asymptotes. Figure 11 shows such curves with online training on Shapeset-3 × 2, while Table 1 gives final test error for all the datasets studied (Shapeset-3 × 2, MNIST, CIFAR-10, and SmallImageNet). As a baseline, we optimized RBF SVM models on one hundred thousand Shapeset examples and obtained 59.47% test error, while on the same set we obtained 50.47% with a depth five hyperbolic tangent network with normalized initialization. | 我们关心的最后一个问题是不同策略下的训练是否成功,这可以用误差曲线来最好地说明,误差曲线显示了测试误差随训练的进展和渐近线的演变。图11显示了对Shapeset-3×2进行在线训练后的曲线,表1给出了所有研究数据集的最终测试误差(Shapeset-3×2、MNIST、ci- 10和SmallImageNet)。作为基线,我们对10万个Shapeset样本的RBF SVM模型进行了优化,得到了59.47%的测试误差,而在同一组样本中,我们得到了50.47%的深度5双曲正切网络,并进行了归一化初始化。 |
Figure 8: Weight gradient normalized histograms with hyperbolic tangent activation just after initialization, with standard initialization (top) and normalized initialization (bottom), for different layers. Even though with standard initialization the back-propagated gradients get smaller, the weight gradients do not! 图8:不同层的权重梯度归一化直方图,初始化后使用双曲正切激活,标准初始化(顶部)和归一化初始化(底部)。即使使用标准的初始化,反向传播的梯度也会变小,但是权值梯度不会变小!
Figure 9: Standard deviation intervals of the weights gradients with hyperbolic tangents with standard initialization (top) and normalized (bottom) during training. We see that the normalization allows to keep the same variance of the weights gradient across layers, during training (top: smaller variance for higher layers). 图9:训练过程中,带有标准初始化的双曲切线权值梯度(上)和归一化(下)的标准差区间。我们可以看到,在训练过程中,规范化允许在不同层之间保持相同的权值梯度的方差(顶部:更高层的方差更小)。
Table 1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden layers. N after the activation function name indicates the use of normalized initialization. Results in bold are statistically different from non-bold ones under the null hypothesis test with p = 0.005. 表1:不同激活函数和初始化方案对5个隐含层的深度网络的测试误差。激活函数名后的N表示使用规范化初始化。在p = 0.005的原假设检验下,粗体的结果与非粗体的结果有统计学差异。
Figure 10: 98 percentile (markers alone) and standard deviation (solid lines with markers) of the distribution of activation values for hyperbolic tangent with normalized initialization during learning. 图10:学习过程中正切激活值分布的98个百分位(单独标记)和标准差(带有标记的实线)。 |
|
These results illustrate the effect of the choice of activation and initialization. As a reference we include in Figure 11 the error curve for the supervised fine-tuning from the initialization obtained after unsupervised pre-training with denoising auto-encoders (Vincent et al., 2008). For each network the learning rate is separately chosen to minimize error on the validation set. We can remark that on Shapeset-3 × 2, because of the task difficulty, we observe important saturations during learning, this might explain that the normalized initialization or the softsign effects are more visible | 这些结果说明了激活和初始化选择的影响。作为参考,我们在图11中包括了经过去噪自动编码器的无监督预训练后获得的初始化的监督微调的误差曲线(Vincent et al., 2008)。对于每个网络,我们分别选择学习速率来最小化验证集上的错误。我们可以注意到,在Shapeset-3×2上,由于任务的难度,我们观察到了学习过程中的重要饱和,这可能解释了规范化初始化或软标记效应更明显 |
Several conclusions can be drawn from these error curves:
Others methods can alleviate discrepancies between layers during learning, e.g., exploiting second order information to set the learning rate separately for each parameter. For example, we can exploit the diagonal of the Hessian (LeCun et al., 1998b) or a gradient variance estimate. Both those methods have been applied for Shapeset-3 × 2 with hyperbolic tangent and standard initialization. We observed a gain in performance but not reaching the result obtained from normalized initialization. In addition, we observed further gains by combining normalized initialization with second order methods: the estimated Hessian might then focus on discrepancies between units, not having to correct important initial discrepancies between layers. |
从这些误差曲线可以得出以下几点结论:
|
|
Figure 11: Test error during online training on the Shapeset-3×2 dataset, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 图11:在Shapeset-3×2数据集上进行在线训练时,各种激活函数和初始化方案的测试误差(为了减少最终误差,从上到下排序)。激活函数名后的N表示使用规范化初始化。
Figure 12: Test error curves during training on MNIST and CIFAR10, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 图12:MNIST和CIFAR10上的各种激活函数和初始化方案的训练误差曲线(从上到下排序以减少最终误差)。激活函数名后的N表示使用规范化初始化。 |
In all reported experiments we have used the same number of units per layer. However, we verified that we obtain the same gains when the layer size increases (or decreases) with layer number. The other conclusions from this study are the following:
|
在所有报告的实验中,我们使用了相同数量的单位每层。但是,我们验证了随着层数的增加(或减少),我们获得了相同的收益。本研究的其他结论如下:
|