Visualizing the Loss Landscape of Neural Nets

 

 

[论文阅读]损失函数可视化及其对神经网络的指导作用

论文在Visualizing the Loss Landscape of Neural Nets,在SGD在两层神经网络上是怎么收敛的?看到的,在写[论文阅读]How Does Batch Normalization Help Optimizati的过程中,实在写不下去了,觉得还是先补补这篇论文的基础吧。


0 Abstract

对神经网络有个阐述:

Neural Network training relies on our ability to find "good" minimizers of highly non-convex loss function.

很好理解,凸函数是没有极小值的,神经网络是在非凸函数中寻找全局最小值而避免陷入局部极小值中。


1 Introduction

神经网络目前还处于黑箱模型,现实中取得了巨大成功,但是理论上仍然是个未知数:

Training neural networks requires minimizing a high-dimensional non-convex loss function - a task that is hard in theory, but sometimes easy in practice.

本文即意在理论上能给予神经网络的设计相关的指导。要理解论文的意义,我认为要从数值分析的角度去看,即如何利用有效的资源、在有效的时间内去求解整个参数空间的最优解:

Because of the prohibitive cost of loss function evaluations (which requires looping over all the data points in the training set), studies in this field have remained predominantly theoretical.

这里分析一种最简单的情况,求解 Visualizing the Loss Landscape of Neural Nets 的解(套用到本论文,这个函数等价于求解 Visualizing the Loss Landscape of Neural Nets 的最小值)。

最简单的做法就是均匀取样,取201个点,即 Visualizing the Loss Landscape of Neural Nets ,求解其因变量,找到因变量最接近0的值即为函数的解:

Visualizing the Loss Landscape of Neural Nets

因变量为0附近的取值为:

Visualizing the Loss Landscape of Neural Nets

很简单,对吧,取了201步可以认为 Visualizing the Loss Landscape of Neural Nets 为函数的解,精确度为0.01。

第一个问题是取样的步长,取样的步长如果取得过小的话精确度会很低,第二个问题就是这种做法不适合参数更多的情况。其时间复杂度大致可以用 Visualizing the Loss Landscape of Neural Nets 表示,这里 Visualizing the Loss Landscape of Neural Nets 表示取样个数, Visualizing the Loss Landscape of Neural Nets 表示参数量,如果 Visualizing the Loss Landscape of Neural Nets 的话,那么时间复杂度为 Visualizing the Loss Landscape of Neural Nets ,参考经典神经网络参数的计算【不定期更新】,参数量较少的Inception V1也有6,990,272个参数,那么时间复杂度为 Visualizing the Loss Landscape of Neural Nets ,可以看到,这种时间消耗的问题是不可能直接求解的,所以论文说是prohibitive cost of loss function evaluations。

这时候就需要数值分析中的优化求解方法了,一种很经典的方法是使用牛顿法,这里取初始值 Visualizing the Loss Landscape of Neural Nets 为例求解一遍,迭代公式为 :Visualizing the Loss Landscape of Neural Nets 。

Visualizing the Loss Landscape of Neural Nets

大概求解5步后就可以跳出循环了,求解精度也比第一种方法要高出很多,这就是牛顿法的魅力所在。

我认为梯度下降法的思想与牛顿法相同,如何快速的找到这个所求的解是优化方法的目标,但显然这种方法不够第一种方法直观,本论文追求的目标是用第一种方法得到loss function一种直观的解释,如下图所示:

Visualizing the Loss Landscape of Neural Nets

可视化非常有助于我们理解神经网络的工作原理,理解我们为什么能够去得到非凸函数的最小值,理解为什么这个最小值的泛化性很好:

Visualizations have the potential to help us answer several important questions about why neural networks work. In particular, why are we able to minimize highly non-convex neural loss functions? And why do the resulting minima generalize?

1.1 Contributions

直接看正文吧。


2 Theoretical Background

没啥好讲的。


3 The Basis of Loss Function Visualization

首先讲了可视化神经网络的困难,即现实世界的实体是三维的,但损失函数却是一个高维空间,必须使用一些技巧来将损失函数投影到低维空间内(另外一个困难点在于本文开头讲的计算复杂度):

Neural nets contain many parameters, and so their loss functions live in a very high-dimensional space. Unfortunately, visualizations are only possible using low-dimensional 1D (line) or 2D (surface) plots.

论文使用了以下两个技巧用于解决这一问题:

1-Dimensional Linear Interpolation

首先选取两个参数集合 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets (可以一个是随机的初始值,一个是相邻的最小值),这样可以得到两个参数集合的加权线性和参数集合 Visualizing the Loss Landscape of Neural Nets ,以这个新的集合去得到一个新的loss即 Visualizing the Loss Landscape of Neural Nets 。

论文也讲了这种方法的几个缺点,首先,1D的曲线图很难绘制非凸函数,其次,这种方法没有考虑BN或网络的不变对称性,这个问题在第5节还会讲。

Contour Plots & Random Directions

相对于3.1节就是将随机的初始值参数集合换成了方向矢量 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets (维度与 Visualizing the Loss Landscape of Neural Nets 相同,因为后面有相加操作),函数为 Visualizing the Loss Landscape of Neural Nets 。

注意这里有个注释:

When making 2D plots in this paper, batch normalization parameters are held constant, i.e., random directions are not applied to batch normalization parameters.

因为BN层包含了参数mean和std,如果这两个参数也会随着 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets 的变化而变化,那么BN的特性即使得BN的输入位于mean为0、std为1会发生改变,BN层失去了原有的作用,因此论文会保持BN参数不变。

一个问题就是,这两个方向矢量是怎么选取的?带着这个问题看看论文吧。


4 Proposed Visualization: Filter-Wise Normalization

3.2节的问题是:两个方向矢量是怎么选取的。一种简单的方法就是随机选取(参数符合高斯分布且经过适当比例的缩放),论文认为这种方法捕捉不到loss surface的本质特征,而且不能用于比较两个不同的优化方法或者两个不同的网络:

While the "random directions" approach to plotting is simple, it fails to capture the intrinsic geometry of loss surfaces, and cannot be used to compare the geometry of two different minimizers or two different networks.

原因在于网络参数的尺度不变性(使用BN情况会更加严重,在于BN的强正则化过程)。

这里使用论文例子验证一下,如果使用ReLU作为激活函数,有 Visualizing the Loss Landscape of Neural Nets ,第一层网络权值乘10,第二层网络权值除10,有 Visualizing the Loss Landscape of Neural Nets

Visualizing the Loss Landscape of Neural Nets 还是发生一点小变化的,这是尺度不变性的一个例子,感觉不是很好,下面的BN反而更符合我的理解。

有BN层后尺度不变性更大了,参考深入解读Inception V2之Batch Normalization(附源码)3.3节公式1的证明:

Visualizing the Loss Landscape of Neural Nets

经过BN后尺度消失了。

尺度不变性对函数 Visualizing the Loss Landscape of Neural Nets 的效果没法从数学上进行判断,比如 Visualizing the Loss Landscape of Neural Nets ,有 Visualizing the Loss Landscape of Neural Nets 的存在,并不只是尺度变了,更像一种线性的关系,但论文好像并没有对线性关系做出说明。

论文开始对方向矢量 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets进行正则化。首先产生一个和 Visualizing the Loss Landscape of Neural Nets 相同维度的随机高斯矢量Visualizing the Loss Landscape of Neural Nets,之后,会对Visualizing the Loss Landscape of Neural Nets进行正则化,以与Visualizing the Loss Landscape of Neural Nets拥有相同的范数,正则化公式为:

Visualizing the Loss Landscape of Neural Nets

注意 Visualizing the Loss Landscape of Neural Nets 表示的是 Visualizing the Loss Landscape of Neural Nets 的方向,另外,论文说明Visualizing the Loss Landscape of Neural Nets表示的是 Visualizing the Loss Landscape of Neural Nets th层中的 Visualizing the Loss Landscape of Neural Nets th卷积核,而不是Visualizing the Loss Landscape of Neural Nets th weight,这里说明其区别,以 Visualizing the Loss Landscape of Neural Nets 卷积核为例,Visualizing the Loss Landscape of Neural Nets th filter为 Visualizing the Loss Landscape of Neural Nets , Visualizing the Loss Landscape of Neural Nets 。

Frobenius norm可参考Frobenius Norm -- from Wolfram MathWorld,就是欧式距离。


5 The Sharp vs Flat Dilemma

本节的作用在于论证平缓的minimizers的泛化性是否比陡峭的要好:

In this section, we address the issue of whether sharp minimizers generalize better than flat minimizers.

以前一直认为曲线的平或陡峭与泛化性相关:

It is widely thought that small-batch SGD produces "flat" minimizers that generalize well, while large batches produce "sharp" minima with poor generalization.

但论文通过实验反驳了这一观点:

Visualizing the Loss Landscape of Neural Nets

上三幅图没有权值衰减,符合一般认为的曲线越平泛化性越好这一观点( Visualizing the Loss Landscape of Neural Nets 时Visualizing the Loss Landscape of Neural Nets,意味着只有小batch的参数集合 ,此处曲线变化较平,Visualizing the Loss Landscape of Neural Nets 时Visualizing the Loss Landscape of Neural Nets,意味着只有大batch的参数集合,此处曲线变化较频繁)。

然而论文用下三幅图狠狠的打了这个观点的脸,只是加了 Visualizing the Loss Landscape of Neural Nets 的weight decay,情况反过来了,小batch处的曲线变化较频繁,大batch处的曲线较平。但是,注意这个但是,test的准确率都是小batch好,这说明小batch的泛化性比大batch的泛化性要好,这说明曲线的陡峭程度与泛化性没有直接关系:

However, we see that small batches generalize better in all experiments; there is no apparent correlation between sharpness and generalization.

5.1 Filter Normalized Plots

本节的作用在于重新绘制曲线,验证第4节介绍的方法能不能反应minimizers的陡峭程度与泛化性之间的关系:

Visualizing the Loss Landscape of Neural Nets

加了weight的normalization后,图3就完美反应了sharpness与泛化性之间的关系,即越宽的contour泛化性越好:

We see that now sharpness correlates well with generalization error.

并且发现权值衰减的小batch size拥有更宽的contour,即泛化性更好:

The weights obtained with small batch size and non-zero weight decay have wider contours than the sharper large batch minimizers.
Large batches produced visually sharper minima (although not dramatically so) with higher test error.

6 What Makes Neural Networks Trainable? Insights on the (Non)Convexity Structure of Loss Surfaces

本节会研究网络结构对损失函数非凸性的影响,以及其与泛化性的关系:

We will see that different architectures have extreme differences in non-convexity structure that answer these questions, and that these differences correlate with generalization error.

本节使用的方法是第4节介绍的filter-normalized随机方向方法来绘制loss landscape,并且5.1节的经验,即越宽的contour泛化性越好:

To understand the effects of network architecture on non-convexity, we trained a number of networks, and plotted the landscape around the obtained minimizers using the filter-normalized random direction method described in Section 4.

6.1 The Effect of Network Depth

Visualizing the Loss Landscape of Neural Nets

本节主要针对的是图5下面三幅图,当不加residual时,网络的深度对loss surface有很多影响,即越深的网络非凸性越强:

From Figure 5, we see that network depth has a dramatic effect on the loss surfaces of neural networks when skip connections are not used.

6.2 Shortcut Connections to the Rescue

如图5上面三幅图所示,加了residual后,网络深度增加时,loss surface变为混沌状态的可能性变少了很多:

Shortcut connections have a dramatic effect of the geometry of the loss functions. In Figure 5, we see that residual connections prevent the transition to chaotic behavior as depth increases.

6.3 Wide Models vs Thin Models

Visualizing the Loss Landscape of Neural Nets

卷积核数目越多,loss surface越多,泛化性也越好:

From Figure 6, we see that the wider models have loss landscapes with no noticeable chaotic behavior. Increased network width resulted in flat minima and wide regions of apparent convexity. We see that increased width prevents prevents chaotic behavior, and skip connections dramatically widen minimizers. Finally, note that sharpness correlates extremely well with test error.

不过要注意带来计算量的急剧急剧增加,参考卷积神经网络的复杂度分析,计算量与卷积核的数量的平方是成正比的。

6.4 Implications for Network Initialization

初始化如果落在"well-behaved"的区域,则很可能通过训练落在minimizer处,反之,初始化如果落在high-loss chaotic plateaus,则不会落回minimizer处:

For such landscapes, a random initialization will likely lie in the "well-behaved" loss region, and the optimization algorithm might never "see" the pathological non-convexities that occur on the high-loss chaotic plateaus.

6.5 Landscape Geometry Affects Generalization

landscape geometry对泛化性的作用有两点,第一点为loss landscape越平检测错误率越低:

1. Visually flatter minimizers consistently correspond to lower test error, which further stregthens our assertion that filter normalization is a natural way to visualize loss function geometry.

第二点为混沌区域会增加错误率:

2. Chaotic landscapes (deep networks without skip connections) result in worse training and test error, while more convex landscapes have lower error values.

6.6 A note of caution: Are we really seeing convexity?

可以使用主曲率来计算凸的程度,数学上可以使用海森矩阵的特征值来表达:

One way to measure the level of convexity in a loss function is to compute the principle curvatures, which are simply eigenvalues of the Hessian.

但论文描述的二维loss平面并不能简单应用到整个参数空间,整个参数空间的维度远远大于2,低维平面的凹凸性并不意味着高维平面的凹凸性:

If non-convexity is present in the dimensionality reduces plot, then non-convexity must be present in the full-dimensional surface as well. However, apparent convexity in the low-dimensional surface does not mean the high-dimensional function is truly convex.
Visualizing the Loss Landscape of Neural Nets

图7的值表示的是 Visualizing the Loss Landscape of Neural Nets 的值,值越低(即偏蓝)表示凹凸性越好,值越高(即偏黄)表示凹凸性越差,论文认为混乱区域表明包含了大量的negative curvatures(偏黄),这是与上面画的二维图是一致的:

We see that the convex-looking regions in our surface plots do indeed correspond to regions with insignificant negative eigenvalues (i.e., there are not major non-convex features that the plot missed).

7 Visualizing Optimization Paths

本节意在通过可视化追踪训练过程中loss的变化:

Finally, we explore methods for visualizing the trajectories of different optimizers.

不能使用随机方向矢量去可视化训练轨迹:

For this application, random directions are ineffective. We will provide a theoretical explanation for why random directions fail, and explore methods for effectively plotting trajectories on top of loss function contours.
Visualizing the Loss Landscape of Neural Nets

如图8(a)所示,将SGD(Stochastic Gradient Descent)算法得到的参数投影到两随机方向组成的平面,可以看到不能很好的捕捉训练轨迹。图8(b)将x方向的矢量换成随机初始化矢量到minimizer的方向矢量,很奇怪基本是一条直线了,表示随机方向矢量基本没有得到分量。图8(c)是图8(b)的放大,结论就是随机方向矢量基本没有得到分量:

As seen in Figure 8(c), the random axis captures almost no variation, leading to the (misleading) appearance of a straight line path.

7.1 Why Random Directions Fail: Low-Dimensional Optimization Trajectories

两个随机高斯分布的矢量在高维空间里几乎是正交的:

It is well-known that two random vectors in a high dimensional space will be nearly orthogonal with high probability. In fact, the expected cosine similarity between Gaussian random vectors in Visualizing the Loss Landscape of Neural Nets dimensions is roughly Visualizing the Loss Landscape of Neural Nets .

当 Visualizing the Loss Landscape of Neural Nets 足够大时, Visualizing the Loss Landscape of Neural Nets ,表明两个矢量是正交的。这会导致图8(b)所示的问题。

7.2 Effective Trajectory Plotting using PCA Directions

本节使用PCA来抽取方向矢量。论文阐述了做法,使用训练过程的参数 Visualizing the Loss Landscape of Neural Nets ,构造矩阵 Visualizing the Loss Landscape of Neural Nets ,使用PCA求差异最大的矢量即可,这是结果:

Visualizing the Loss Landscape of Neural Nets

显示参数空间可以降为低维空间:

Finally, we can directly observe that the descent path is very low dimensional: between 40% and 90% of the variation in the descent paths lies in a space of only 2 dimensions.

8 Conclusion

没啥好讲的。

 

【已完结】

编辑于 2019-08-30
深度学习(Deep Learning)
可视化

文章被以下专栏收录

Visualizing the Loss Landscape of Neural Nets
Michael的学习小窝
想到啥写点啥,不定期打补丁,非喜勿喷,欢迎交流

推荐阅读

CNN的一些可视化方法

本文所有资料均来自Keras之父、Google人工智能研究员Francois Chollet的大作:《Python深度学习》,建议大家直接去看原文,这里只是结合楼主的理解做点笔记。引言有一些同学认为深度学习、…

Visualizing the Loss Landscape of Neural Nets

神经网络可视化工具-让你的论文美美的

人群密集场景计数论文总结

先附上论文地址: https://ieeexplore.ieee.org/document/8010465/ 一.研究现状 计算密集场景中人群的行人个数一直是研究人员的关注热点,这个问题对人群监控和公共安全等方面有着很大的研…

利用python一层一层可视化卷积神经网络,以ResNet50为例

引言一直以来,卷积神经网络对人们来说都是一个黑箱,我们只知道它识别图片准确率很惊人,但是具体是怎么做到的,它究竟使用了什么特征来分辨图像,我们一无所知。无数的学者、研究人员都想…

还没有评论

 
 
 
 
 

Visualizing the Loss Landscape of Neural Nets

上一篇:盒子模型(CSS重点)


下一篇:META标签的设置