Visualizing the Loss Landscape of Neural Nets




论文在Visualizing the Loss Landscape of Neural Nets,在SGD在两层神经网络上是怎么收敛的?看到的,在写[论文阅读]How Does Batch Normalization Help Optimizati的过程中,实在写不下去了,觉得还是先补补这篇论文的基础吧。

0 Abstract


Neural Network training relies on our ability to find "good" minimizers of highly non-convex loss function.


1 Introduction


Training neural networks requires minimizing a high-dimensional non-convex loss function - a task that is hard in theory, but sometimes easy in practice.


Because of the prohibitive cost of loss function evaluations (which requires looping over all the data points in the training set), studies in this field have remained predominantly theoretical.

这里分析一种最简单的情况,求解 Visualizing the Loss Landscape of Neural Nets 的解(套用到本论文,这个函数等价于求解 Visualizing the Loss Landscape of Neural Nets 的最小值)。

最简单的做法就是均匀取样,取201个点,即 Visualizing the Loss Landscape of Neural Nets ,求解其因变量,找到因变量最接近0的值即为函数的解:

Visualizing the Loss Landscape of Neural Nets


Visualizing the Loss Landscape of Neural Nets

很简单,对吧,取了201步可以认为 Visualizing the Loss Landscape of Neural Nets 为函数的解,精确度为0.01。

第一个问题是取样的步长,取样的步长如果取得过小的话精确度会很低,第二个问题就是这种做法不适合参数更多的情况。其时间复杂度大致可以用 Visualizing the Loss Landscape of Neural Nets 表示,这里 Visualizing the Loss Landscape of Neural Nets 表示取样个数, Visualizing the Loss Landscape of Neural Nets 表示参数量,如果 Visualizing the Loss Landscape of Neural Nets 的话,那么时间复杂度为 Visualizing the Loss Landscape of Neural Nets ,参考经典神经网络参数的计算【不定期更新】,参数量较少的Inception V1也有6,990,272个参数,那么时间复杂度为 Visualizing the Loss Landscape of Neural Nets ,可以看到,这种时间消耗的问题是不可能直接求解的,所以论文说是prohibitive cost of loss function evaluations。

这时候就需要数值分析中的优化求解方法了,一种很经典的方法是使用牛顿法,这里取初始值 Visualizing the Loss Landscape of Neural Nets 为例求解一遍,迭代公式为 :Visualizing the Loss Landscape of Neural Nets 。

Visualizing the Loss Landscape of Neural Nets


我认为梯度下降法的思想与牛顿法相同,如何快速的找到这个所求的解是优化方法的目标,但显然这种方法不够第一种方法直观,本论文追求的目标是用第一种方法得到loss function一种直观的解释,如下图所示:

Visualizing the Loss Landscape of Neural Nets


Visualizations have the potential to help us answer several important questions about why neural networks work. In particular, why are we able to minimize highly non-convex neural loss functions? And why do the resulting minima generalize?

1.1 Contributions


2 Theoretical Background


3 The Basis of Loss Function Visualization


Neural nets contain many parameters, and so their loss functions live in a very high-dimensional space. Unfortunately, visualizations are only possible using low-dimensional 1D (line) or 2D (surface) plots.


1-Dimensional Linear Interpolation

首先选取两个参数集合 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets (可以一个是随机的初始值,一个是相邻的最小值),这样可以得到两个参数集合的加权线性和参数集合 Visualizing the Loss Landscape of Neural Nets ,以这个新的集合去得到一个新的loss即 Visualizing the Loss Landscape of Neural Nets 。


Contour Plots & Random Directions

相对于3.1节就是将随机的初始值参数集合换成了方向矢量 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets (维度与 Visualizing the Loss Landscape of Neural Nets 相同,因为后面有相加操作),函数为 Visualizing the Loss Landscape of Neural Nets 。


When making 2D plots in this paper, batch normalization parameters are held constant, i.e., random directions are not applied to batch normalization parameters.

因为BN层包含了参数mean和std,如果这两个参数也会随着 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets 的变化而变化,那么BN的特性即使得BN的输入位于mean为0、std为1会发生改变,BN层失去了原有的作用,因此论文会保持BN参数不变。


4 Proposed Visualization: Filter-Wise Normalization

3.2节的问题是:两个方向矢量是怎么选取的。一种简单的方法就是随机选取(参数符合高斯分布且经过适当比例的缩放),论文认为这种方法捕捉不到loss surface的本质特征,而且不能用于比较两个不同的优化方法或者两个不同的网络:

While the "random directions" approach to plotting is simple, it fails to capture the intrinsic geometry of loss surfaces, and cannot be used to compare the geometry of two different minimizers or two different networks.


这里使用论文例子验证一下,如果使用ReLU作为激活函数,有 Visualizing the Loss Landscape of Neural Nets ,第一层网络权值乘10,第二层网络权值除10,有 Visualizing the Loss Landscape of Neural Nets

Visualizing the Loss Landscape of Neural Nets 还是发生一点小变化的,这是尺度不变性的一个例子,感觉不是很好,下面的BN反而更符合我的理解。

有BN层后尺度不变性更大了,参考深入解读Inception V2之Batch Normalization(附源码)3.3节公式1的证明:

Visualizing the Loss Landscape of Neural Nets


尺度不变性对函数 Visualizing the Loss Landscape of Neural Nets 的效果没法从数学上进行判断,比如 Visualizing the Loss Landscape of Neural Nets ,有 Visualizing the Loss Landscape of Neural Nets 的存在,并不只是尺度变了,更像一种线性的关系,但论文好像并没有对线性关系做出说明。

论文开始对方向矢量 Visualizing the Loss Landscape of Neural Nets 、 Visualizing the Loss Landscape of Neural Nets进行正则化。首先产生一个和 Visualizing the Loss Landscape of Neural Nets 相同维度的随机高斯矢量Visualizing the Loss Landscape of Neural Nets,之后,会对Visualizing the Loss Landscape of Neural Nets进行正则化,以与Visualizing the Loss Landscape of Neural Nets拥有相同的范数,正则化公式为:

Visualizing the Loss Landscape of Neural Nets

注意 Visualizing the Loss Landscape of Neural Nets 表示的是 Visualizing the Loss Landscape of Neural Nets 的方向,另外,论文说明Visualizing the Loss Landscape of Neural Nets表示的是 Visualizing the Loss Landscape of Neural Nets th层中的 Visualizing the Loss Landscape of Neural Nets th卷积核,而不是Visualizing the Loss Landscape of Neural Nets th weight,这里说明其区别,以 Visualizing the Loss Landscape of Neural Nets 卷积核为例,Visualizing the Loss Landscape of Neural Nets th filter为 Visualizing the Loss Landscape of Neural Nets , Visualizing the Loss Landscape of Neural Nets 。

Frobenius norm可参考Frobenius Norm -- from Wolfram MathWorld,就是欧式距离。

5 The Sharp vs Flat Dilemma


In this section, we address the issue of whether sharp minimizers generalize better than flat minimizers.


It is widely thought that small-batch SGD produces "flat" minimizers that generalize well, while large batches produce "sharp" minima with poor generalization.


Visualizing the Loss Landscape of Neural Nets

上三幅图没有权值衰减,符合一般认为的曲线越平泛化性越好这一观点( Visualizing the Loss Landscape of Neural Nets 时Visualizing the Loss Landscape of Neural Nets,意味着只有小batch的参数集合 ,此处曲线变化较平,Visualizing the Loss Landscape of Neural Nets 时Visualizing the Loss Landscape of Neural Nets,意味着只有大batch的参数集合,此处曲线变化较频繁)。

然而论文用下三幅图狠狠的打了这个观点的脸,只是加了 Visualizing the Loss Landscape of Neural Nets 的weight decay,情况反过来了,小batch处的曲线变化较频繁,大batch处的曲线较平。但是,注意这个但是,test的准确率都是小batch好,这说明小batch的泛化性比大batch的泛化性要好,这说明曲线的陡峭程度与泛化性没有直接关系:

However, we see that small batches generalize better in all experiments; there is no apparent correlation between sharpness and generalization.

5.1 Filter Normalized Plots


Visualizing the Loss Landscape of Neural Nets


We see that now sharpness correlates well with generalization error.

并且发现权值衰减的小batch size拥有更宽的contour,即泛化性更好:

The weights obtained with small batch size and non-zero weight decay have wider contours than the sharper large batch minimizers.
Large batches produced visually sharper minima (although not dramatically so) with higher test error.

6 What Makes Neural Networks Trainable? Insights on the (Non)Convexity Structure of Loss Surfaces


We will see that different architectures have extreme differences in non-convexity structure that answer these questions, and that these differences correlate with generalization error.

本节使用的方法是第4节介绍的filter-normalized随机方向方法来绘制loss landscape,并且5.1节的经验,即越宽的contour泛化性越好:

To understand the effects of network architecture on non-convexity, we trained a number of networks, and plotted the landscape around the obtained minimizers using the filter-normalized random direction method described in Section 4.

6.1 The Effect of Network Depth

Visualizing the Loss Landscape of Neural Nets

本节主要针对的是图5下面三幅图,当不加residual时,网络的深度对loss surface有很多影响,即越深的网络非凸性越强:

From Figure 5, we see that network depth has a dramatic effect on the loss surfaces of neural networks when skip connections are not used.

6.2 Shortcut Connections to the Rescue

如图5上面三幅图所示,加了residual后,网络深度增加时,loss surface变为混沌状态的可能性变少了很多:

Shortcut connections have a dramatic effect of the geometry of the loss functions. In Figure 5, we see that residual connections prevent the transition to chaotic behavior as depth increases.

6.3 Wide Models vs Thin Models

Visualizing the Loss Landscape of Neural Nets

卷积核数目越多,loss surface越多,泛化性也越好:

From Figure 6, we see that the wider models have loss landscapes with no noticeable chaotic behavior. Increased network width resulted in flat minima and wide regions of apparent convexity. We see that increased width prevents prevents chaotic behavior, and skip connections dramatically widen minimizers. Finally, note that sharpness correlates extremely well with test error.


6.4 Implications for Network Initialization

初始化如果落在"well-behaved"的区域,则很可能通过训练落在minimizer处,反之,初始化如果落在high-loss chaotic plateaus,则不会落回minimizer处:

For such landscapes, a random initialization will likely lie in the "well-behaved" loss region, and the optimization algorithm might never "see" the pathological non-convexities that occur on the high-loss chaotic plateaus.

6.5 Landscape Geometry Affects Generalization

landscape geometry对泛化性的作用有两点,第一点为loss landscape越平检测错误率越低:

1. Visually flatter minimizers consistently correspond to lower test error, which further stregthens our assertion that filter normalization is a natural way to visualize loss function geometry.


2. Chaotic landscapes (deep networks without skip connections) result in worse training and test error, while more convex landscapes have lower error values.

6.6 A note of caution: Are we really seeing convexity?


One way to measure the level of convexity in a loss function is to compute the principle curvatures, which are simply eigenvalues of the Hessian.


If non-convexity is present in the dimensionality reduces plot, then non-convexity must be present in the full-dimensional surface as well. However, apparent convexity in the low-dimensional surface does not mean the high-dimensional function is truly convex.
Visualizing the Loss Landscape of Neural Nets

图7的值表示的是 Visualizing the Loss Landscape of Neural Nets 的值,值越低(即偏蓝)表示凹凸性越好,值越高(即偏黄)表示凹凸性越差,论文认为混乱区域表明包含了大量的negative curvatures(偏黄),这是与上面画的二维图是一致的:

We see that the convex-looking regions in our surface plots do indeed correspond to regions with insignificant negative eigenvalues (i.e., there are not major non-convex features that the plot missed).

7 Visualizing Optimization Paths


Finally, we explore methods for visualizing the trajectories of different optimizers.


For this application, random directions are ineffective. We will provide a theoretical explanation for why random directions fail, and explore methods for effectively plotting trajectories on top of loss function contours.
Visualizing the Loss Landscape of Neural Nets

如图8(a)所示,将SGD(Stochastic Gradient Descent)算法得到的参数投影到两随机方向组成的平面,可以看到不能很好的捕捉训练轨迹。图8(b)将x方向的矢量换成随机初始化矢量到minimizer的方向矢量,很奇怪基本是一条直线了,表示随机方向矢量基本没有得到分量。图8(c)是图8(b)的放大,结论就是随机方向矢量基本没有得到分量:

As seen in Figure 8(c), the random axis captures almost no variation, leading to the (misleading) appearance of a straight line path.

7.1 Why Random Directions Fail: Low-Dimensional Optimization Trajectories


It is well-known that two random vectors in a high dimensional space will be nearly orthogonal with high probability. In fact, the expected cosine similarity between Gaussian random vectors in Visualizing the Loss Landscape of Neural Nets dimensions is roughly Visualizing the Loss Landscape of Neural Nets .

当 Visualizing the Loss Landscape of Neural Nets 足够大时, Visualizing the Loss Landscape of Neural Nets ,表明两个矢量是正交的。这会导致图8(b)所示的问题。

7.2 Effective Trajectory Plotting using PCA Directions

本节使用PCA来抽取方向矢量。论文阐述了做法,使用训练过程的参数 Visualizing the Loss Landscape of Neural Nets ,构造矩阵 Visualizing the Loss Landscape of Neural Nets ,使用PCA求差异最大的矢量即可,这是结果:

Visualizing the Loss Landscape of Neural Nets


Finally, we can directly observe that the descent path is very low dimensional: between 40% and 90% of the variation in the descent paths lies in a space of only 2 dimensions.

8 Conclusion




编辑于 2019-08-30
深度学习(Deep Learning)


Visualizing the Loss Landscape of Neural Nets



本文所有资料均来自Keras之父、Google人工智能研究员Francois Chollet的大作:《Python深度学习》,建议大家直接去看原文,这里只是结合楼主的理解做点笔记。引言有一些同学认为深度学习、…

Visualizing the Loss Landscape of Neural Nets



先附上论文地址: 一.研究现状 计算密集场景中人群的行人个数一直是研究人员的关注热点,这个问题对人群监控和公共安全等方面有着很大的研…





Visualizing the Loss Landscape of Neural Nets

