It's not the problem of underfitting, it is just not well trained.
Why bigger network doesn't have the same good performance as the smaller one? Maybe it has the ability but it is not well trained.
Dropout : when there are good results on training data and bad results on testing data. (overfitting)
If there are bad resutls on training data, it's no use to apply dropout.
deeper usually does not imply better. vanishing gradient problem (with sigmoid function)
layers near output are trained faster than layers near input. layers near output converge may based on random
A big change for weight near input will have litter impact on output.
So, ReLU is used
Q: So this is a linear function?
No, it's a nonlinear function. Different input corresponds with different linear system. (think the definition on linear system in signal and system)
If special parameters are picked, maxout has the same effect with ReLU.
learning rate
RMSProp: give big weight to new gradient and small weight to old gradient
local minima
Actually, you don't need to worry about local minima,
Yoshua Bengio研究组通过实验发现,在训练高维(参数)神经网络时,几乎不会遇到局部极小点(这与我们以往的直觉相背),但会存在鞍点,而这些鞍点只在某些维度上是局部极小的。鞍点会显著减缓神经网络的训练速度,直到在训练过程中找到正确的逃离方向。每当到达一个鞍点,都会“震荡”多次最终逃逸。
Bengio提供了一个浅显易懂的解释:我们假设在某个维度上,一个点是局部极小点的概率为p。那么这个点在1000维的空间下是局部极小点的概率则为p^1000,是一个典型的小概率事件。而该点在少数几个维度上局部极小的概率则相对较高。在参数优化过程中,当到达这些点的时候训练速度会明显变慢,直到找到正确的方向。
另外,概率p会随着损失函数逐渐接近全局最优点而不断增大。这意味着,当网络收敛到一个真正的局部极小点时,通常可以认为该点已经离全局最优足够接近了。
A big network may be smooth and there are not so many local minima.
、
Q: Why regularization can ease overfitting?
minimize the number of parameters, if lambda is large, then corresponding theta is small and reduce the chance of overfitting.