与学习相关的技巧
参数的更新
SGD
- \(W \leftarrow W-\eta \frac{\partial L}{\partial W}\)
- 缺点:如果函数的形状非均向(anisotropic),搜索的路径会非常低效.梯度的方向并没有指向最小值的方向.呈"之"字形向最小值移动,效率低.
class SGD:
def __init__(self,lr = 0.01):
self.lr = lr
def update(self,params,grads):
'''
params和grads为字典型变量,按params['W1'],grads['W1']的形式保存权重和它们的梯度
'''
for key in params.keys():
params[key]-=self.lr*grads[key]
Momentum(动量)
- \[ \upsilon \leftarrow \alpha \upsilon - \eta \frac{\partial L}{\partial W}\\ W \leftarrow W + \upsilon \]
- \(\eta\)表示学习率,\(\upsilon\)对应物理上的速度.
- \(\alpha \upsilon\)表示物体在不受到任何力时物体逐渐减速.
- \(\upsilon\)保存物体的速度
- 与SGD相比,"之"字形的程度减轻.在\(\alpha \upsilon\)的作用下,一直存在向某一方向运动的趋势.
class Momentum:
def __init__(self,lr=0.01,momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self,params,grads):
if self.v is None:
self.v = {}
for key,val in params.items():
self.v[key]=np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key]-self.lr*grads[key]
params[key]+=self.v[key]
AdaGrad
- "学习率衰减":随着学习的进行,使学习率逐渐减小
- AdaGrad(Adaptive Gradient)为参数的每个元素适当地调整学习率,与此同时进行学习
- \[ h \leftarrow h + \frac{\partial L}{\partial W}\bigodot {\partial L}{\partial W}\\ W \leftarrow W - \eta \frac{1}{\sqrt{h}}\frac{\partial L}{\partial W} \]
- \(h\)保存以前的所有梯度值的平方和.
- 在更新参数时,通过乘以\(\frac{1}{\sqrt{h}}\)调整学习的尺度.参数的元素中变动较大(被大幅更新)的元素的学习率将变小.
class AdaGrad:
def __init__(self,lr = 0.01):
self.lr = lr
self.h = None
def update(self,params,grads):
if self.h is None:
self.h = {}
for key,val in params.items():
self.h[key]=np.zeros_lisk(val)
for key in params.keys():
self.h[key]+=grads[key]*grads[key]
params[key]-=self.lr*grads[key]/(np.sqrt(self.h[key])+1e-7)#防止除数为0
Adam
- 融合Momentum和AdaGrad的方法
- 进行超参数的"偏置校正"
权重的初始值
权重初始值为0
- 初始权重为0,在误差反向传播法中,所有的权重值都会进行相同的更新,因此权重被更新为相同的值,并拥有了对称的值(重复的值).为了防止"权重均一化",必须随机生成初始值
隐藏层的激活值的分布
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1/(1+np.exp(-x))
x = np.random.randn(1000,100)
node_num = 100 #各隐藏层的神经元数
hidden_layer_size = 5 #隐藏层五层
activations = {} #激活值的结果
for i in range(hidden_layer_size):
if i != 0:
x = activations[i-1]
w = np.random.randn(node_num,node_num)/np.sqrt(node_num)
z = np.dot(x,w)
a = sigmoid(z) #sigmoid函数
activations[i]=a
for i,a in activations.items():
plt.subplot(1,len(activations),i+1)
plt.title(str(i+1)+".layer")
plt.hist(a.flatten(),30,range=(0,1))
plt.show()
Xavier初始值
- 为了使各层的激活值呈现出具有相同广度的分布.
- 如果前一层的节点数为\(n\),则初始值使用标准差为\(\frac{1}{\sqrt{n}}\)的分布
ReLU的权重初始值
- Xavier初始值是以激活函数是线性函数为前提而推导出来的.因为sigmoid函数和tanh函数左右对称,且*附近可以视作线性函数,所以适合使用Xavier初始值
- ReLU专用的初始值,也称为"He初始值"
- 当前一层的节点数为\(n\)时,He初始值使用标准差为\(\sqrt{\frac{2}{n}}\)的高斯分布
Batch Normalization
- 优点
- 可以时学习快速进行(可以增大学习率)
- 不那么依赖初始值(对于初始值不用那么神经质)
- 抑制过拟合(降低Dropout等的必要性)
- 思路:调整各层的激活值分布使其拥有实当的广度
措施:在神经网络中插入对数据分布进行正规化的层,即Batch Normalization层
-
以进行学习时的mini-batch为单位,按照mini-batch进行正规化.进行使数据分布的均值为0,方差为1的正规化.
\[ \begin{aligned} \mu_B &\leftarrow \frac{1}{m}\sum_{i=1}^mx_i\\ \sigma^2_B &\leftarrow \frac{1}{m}\sum_{i=1}^m(x_i-\mu_B)^2\\ \hat{x_i}&\leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma^2_B+\varepsilon}} \end{aligned} \] - 对m个输入数据的集合\(B={x_1,x_2,...,x_m}\)求均值\(\mu_B\)和方差\(\sigma_B^2\).然后对输入数据进行均值为0,方差为1的正规化.\(\varepsilon\)使一个微小值,防止出现除以0的情况
Batch Norm层对正规化后的数据进行缩放和平移的变换.(初始时\(\gamma=1,\beta=0\),然后通过学习调整到合适的值)
\[ y_i \leftarrow \gamma \hat{x_i}+\beta \]
#Batch Norm的实现
import sys,os
import numpy as np
import matplotlib.pyplot as plt
path = os.getcwd()+'\\sourcecode'
sys.path.append(path)
from sourcecode.dataset.mnist import load_mnist
from sourcecode.common.multi_layer_net_extend import MultiLayerNetExtend
from sourcecode.common.optimizer import SGD,Adam
(x_train,t_train),(x_test,t_test) = load_mnist(normalize=True)
x_train = x_train[:1000]
t_train = t_train[:1000]
max_epochs = 20
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.01
def __train(weight_init_std):
bn_network = MultiLayerNetExtend(input_size=784,hidden_size_list=[100,100,100,100,100],output_size=100,
weight_init_std=weight_init_std,use_batchnorm=True)
network = MultiLayerNetExtend(input_size=784,hidden_size_list=[100,100,100,100,100],output_size=100,
weight_init_std=weight_init_std)
optimizer = SGD(lr= learning_rate)
train_acc_list = []
bn_train_acc_list = []
iter_per_epoch = max(train_size/batch_size,1)
epoch_cnt = 0
for i in range(1000000000):
batch_mask = np.random.choice(train_size,batch_size)#从train_size中随机选取batch_size个数
x_batch= x_train[batch_mask]
t_batch= t_train[batch_mask]
for _network in (bn_network,network):
grads = _network.gradient(x_batch,t_batch)
optimizer.update(_network.params,grads)
if i % iter_per_epoch==0:
train_acc = network.accuracy(x_train,t_train)
bn_train_acc = bn_network.accuracy(x_train,t_train)
train_acc_list.append(train_acc)
bn_train_acc_list.append(bn_train_acc)
print("epoch:"+str(epoch_cnt)+"|"+str(train_acc)+"-"+str(bn_train_acc))
epoch_cnt+=1
if epoch_cnt>=max_epochs:
break
return train_acc_list,bn_train_acc_list
#绘制图形
weight_scale_list = np.logspace(0,-4,num=16)#开始点和结束点是10的幂
x = np.arange(max_epochs)
for i,w in enumerate(weight_scale_list):
print("======"+str(i+1)+"/16"+"========")
train_acc_list,bn_train_acc_list=__train(w)
plt.subplot(4,4,i+1)
plt.title("W:"+str(w))
if i == 15:
plt.plot(x,bn_train_acc_list,label="Batch Normalization,",markevery=2)
plt.plot(x,train_acc_list,linestyle="--",label='Normal(without BatchNorm)',markevery=2)
else:
plt.plot(x,bn_train_acc_list,markevery=2)
plt.plot(x,train_acc_list,linestyle='--',markevery=2)
plt.ylim(0,1.0)
if i%4==0:
plt.yticks([])
else:
plt.ylabel("accuracy")
if i<12:
plt.xticks([])
else:
plt.xlabel("epoches")
plt.legend(loc='lower right')
plt.show()
======1/16========
epoch:0|0.1-0.094
C:\Note\DeepLearningByPython\sourcecode\common\multi_layer_net_extend.py:101: RuntimeWarning: overflow encountered in square
weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W**2)
C:\Note\DeepLearningByPython\sourcecode\common\multi_layer_net_extend.py:101: RuntimeWarning: invalid value encountered in double_scalars
weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W**2)
C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:83: RuntimeWarning: overflow encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
epoch:1|0.116-0.057
epoch:2|0.116-0.031
epoch:3|0.116-0.037
epoch:4|0.116-0.053
epoch:5|0.116-0.07
epoch:6|0.116-0.087
epoch:7|0.116-0.11
epoch:8|0.116-0.136
epoch:9|0.116-0.148
epoch:10|0.116-0.175
epoch:11|0.116-0.198
epoch:12|0.116-0.222
epoch:13|0.116-0.239
epoch:14|0.116-0.248
epoch:15|0.116-0.279
epoch:16|0.116-0.292
epoch:17|0.116-0.326
epoch:18|0.116-0.337
No handles with labels found to put in legend.
epoch:19|0.116-0.357
======2/16========
epoch:0|0.094-0.0
C:\Note\DeepLearningByPython\sourcecode\common\layers.py:12: RuntimeWarning: invalid value encountered in less_equal
self.mask = (x <= 0)
C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
epoch:1|0.097-0.005
epoch:2|0.097-0.006
epoch:3|0.097-0.015
epoch:4|0.097-0.023
epoch:5|0.097-0.047
epoch:6|0.097-0.072
epoch:7|0.097-0.104
epoch:8|0.097-0.13
epoch:9|0.097-0.163
epoch:10|0.097-0.191
epoch:11|0.097-0.21
epoch:12|0.097-0.242
epoch:13|0.097-0.271
epoch:14|0.097-0.29
epoch:15|0.097-0.321
epoch:16|0.097-0.347
epoch:17|0.097-0.363
epoch:18|0.097-0.392
No handles with labels found to put in legend.
epoch:19|0.097-0.418
======3/16========
epoch:0|0.098-0.014
epoch:1|0.401-0.021
epoch:2|0.513-0.052
epoch:3|0.614-0.084
epoch:4|0.7-0.124
epoch:5|0.754-0.156
epoch:6|0.789-0.205
epoch:7|0.84-0.24
epoch:8|0.868-0.278
epoch:9|0.884-0.323
epoch:10|0.901-0.365
epoch:11|0.916-0.404
epoch:12|0.926-0.433
epoch:13|0.938-0.456
epoch:14|0.944-0.494
epoch:15|0.958-0.52
epoch:16|0.965-0.544
epoch:17|0.972-0.56
epoch:18|0.976-0.58
No handles with labels found to put in legend.
epoch:19|0.982-0.601
======4/16========
epoch:0|0.002-0.011
epoch:1|0.127-0.026
epoch:2|0.263-0.053
epoch:3|0.396-0.092
epoch:4|0.474-0.157
epoch:5|0.56-0.231
epoch:6|0.596-0.315
epoch:7|0.645-0.41
epoch:8|0.681-0.476
epoch:9|0.709-0.523
epoch:10|0.713-0.566
epoch:11|0.75-0.609
epoch:12|0.762-0.65
epoch:13|0.787-0.676
epoch:14|0.794-0.704
epoch:15|0.778-0.722
epoch:16|0.774-0.738
epoch:17|0.822-0.761
epoch:18|0.821-0.771
No handles with labels found to put in legend.
epoch:19|0.841-0.789
======5/16========
epoch:0|0.0-0.0
epoch:1|0.0-0.003
epoch:2|0.017-0.03
epoch:3|0.041-0.173
epoch:4|0.077-0.332
epoch:5|0.088-0.45
epoch:6|0.101-0.534
epoch:7|0.111-0.59
epoch:8|0.118-0.624
epoch:9|0.124-0.658
epoch:10|0.13-0.694
epoch:11|0.136-0.723
epoch:12|0.133-0.749
epoch:13|0.134-0.766
epoch:14|0.129-0.791
epoch:15|0.128-0.803
epoch:16|0.118-0.818
epoch:17|0.108-0.828
epoch:18|0.102-0.842
No handles with labels found to put in legend.
epoch:19|0.1-0.853
======6/16========
epoch:0|0.003-0.007
epoch:1|0.129-0.162
epoch:2|0.105-0.426
epoch:3|0.116-0.557
epoch:4|0.123-0.63
epoch:5|0.133-0.673
epoch:6|0.138-0.707
epoch:7|0.129-0.728
epoch:8|0.121-0.756
epoch:9|0.161-0.776
epoch:10|0.143-0.795
epoch:11|0.116-0.818
epoch:12|0.168-0.858
epoch:13|0.126-0.888
epoch:14|0.121-0.902
epoch:15|0.127-0.919
epoch:16|0.132-0.932
epoch:17|0.146-0.946
epoch:18|0.127-0.953
epoch:19|0.125-0.96
No handles with labels found to put in legend.
======7/16========
epoch:0|0.116-0.0
epoch:1|0.105-0.25
epoch:2|0.117-0.556
epoch:3|0.117-0.664
epoch:4|0.117-0.691
epoch:5|0.117-0.719
epoch:6|0.117-0.743
epoch:7|0.117-0.757
epoch:8|0.117-0.774
epoch:9|0.117-0.808
epoch:10|0.117-0.846
epoch:11|0.117-0.893
epoch:12|0.117-0.929
epoch:13|0.117-0.95
epoch:14|0.117-0.959
epoch:15|0.117-0.967
epoch:16|0.117-0.972
epoch:17|0.117-0.979
epoch:18|0.117-0.982
No handles with labels found to put in legend.
epoch:19|0.117-0.985
======8/16========
epoch:0|0.105-0.001
epoch:1|0.117-0.454
epoch:2|0.117-0.677
epoch:3|0.117-0.735
epoch:4|0.117-0.783
epoch:5|0.116-0.842
epoch:6|0.116-0.897
epoch:7|0.116-0.92
epoch:8|0.116-0.944
epoch:9|0.116-0.961
epoch:10|0.116-0.974
epoch:11|0.116-0.982
epoch:12|0.116-0.985
epoch:13|0.117-0.988
epoch:14|0.116-0.995
epoch:15|0.116-0.993
epoch:16|0.116-0.994
epoch:17|0.116-0.996
epoch:18|0.116-0.996
No handles with labels found to put in legend.
epoch:19|0.116-0.998
======9/16========
epoch:0|0.097-0.087
epoch:1|0.116-0.467
epoch:2|0.116-0.725
epoch:3|0.117-0.828
epoch:4|0.117-0.858
epoch:5|0.117-0.93
epoch:6|0.117-0.953
epoch:7|0.117-0.969
epoch:8|0.117-0.978
epoch:9|0.117-0.984
epoch:10|0.117-0.989
epoch:11|0.117-0.992
epoch:12|0.117-0.994
epoch:13|0.117-0.996
epoch:14|0.117-0.997
epoch:15|0.116-0.998
epoch:16|0.117-0.998
epoch:17|0.117-0.998
epoch:18|0.117-0.999
epoch:19|0.116-1.0
No handles with labels found to put in legend.
======10/16========
epoch:0|0.093-0.043
epoch:1|0.116-0.477
epoch:2|0.116-0.712
epoch:3|0.116-0.787
epoch:4|0.116-0.827
epoch:5|0.116-0.797
epoch:6|0.116-0.891
epoch:7|0.116-0.926
epoch:8|0.116-0.968
epoch:9|0.116-0.975
epoch:10|0.117-0.949
epoch:11|0.116-0.993
epoch:12|0.116-0.988
epoch:13|0.116-0.996
epoch:14|0.116-0.926
epoch:15|0.116-0.997
epoch:16|0.116-0.994
epoch:17|0.116-0.999
epoch:18|0.116-1.0
epoch:19|0.116-0.997
No handles with labels found to put in legend.
======11/16========
epoch:0|0.094-0.086
epoch:1|0.097-0.668
epoch:2|0.116-0.681
epoch:3|0.116-0.744
epoch:4|0.116-0.728
epoch:5|0.116-0.784
epoch:6|0.117-0.904
epoch:7|0.117-0.916
epoch:8|0.117-0.908
epoch:9|0.116-0.979
epoch:10|0.117-0.978
epoch:11|0.117-0.984
epoch:12|0.117-0.977
epoch:13|0.117-0.984
epoch:14|0.117-0.972
epoch:15|0.117-0.989
epoch:16|0.117-0.99
epoch:17|0.117-0.974
epoch:18|0.117-0.997
No handles with labels found to put in legend.
epoch:19|0.117-0.998
======12/16========
epoch:0|0.116-0.165
epoch:1|0.116-0.498
epoch:2|0.116-0.653
epoch:3|0.116-0.645
epoch:4|0.117-0.682
epoch:5|0.117-0.582
epoch:6|0.116-0.7
epoch:7|0.116-0.767
epoch:8|0.116-0.745
epoch:9|0.116-0.788
epoch:10|0.116-0.79
epoch:11|0.116-0.797
epoch:12|0.117-0.8
epoch:13|0.117-0.802
epoch:14|0.117-0.8
epoch:15|0.117-0.803
epoch:16|0.117-0.816
epoch:17|0.117-0.776
epoch:18|0.117-0.98
No handles with labels found to put in legend.
epoch:19|0.117-0.987
======13/16========
epoch:0|0.105-0.097
epoch:1|0.105-0.496
epoch:2|0.117-0.267
epoch:3|0.105-0.573
epoch:4|0.117-0.594
epoch:5|0.117-0.579
epoch:6|0.117-0.563
epoch:7|0.117-0.574
epoch:8|0.117-0.603
epoch:9|0.117-0.611
epoch:10|0.117-0.603
epoch:11|0.117-0.691
epoch:12|0.117-0.68
epoch:13|0.117-0.691
epoch:14|0.117-0.698
epoch:15|0.117-0.722
epoch:16|0.117-0.71
epoch:17|0.117-0.789
epoch:18|0.117-0.694
No handles with labels found to put in legend.
epoch:19|0.117-0.801
======14/16========
epoch:0|0.117-0.141
epoch:1|0.117-0.33
epoch:2|0.1-0.399
epoch:3|0.117-0.22
epoch:4|0.117-0.401
epoch:5|0.117-0.411
epoch:6|0.116-0.408
epoch:7|0.116-0.405
epoch:8|0.116-0.412
epoch:9|0.116-0.411
epoch:10|0.116-0.426
epoch:11|0.116-0.507
epoch:12|0.116-0.513
epoch:13|0.116-0.521
epoch:14|0.116-0.515
epoch:15|0.116-0.518
epoch:16|0.116-0.606
epoch:17|0.116-0.611
epoch:18|0.116-0.618
No handles with labels found to put in legend.
epoch:19|0.116-0.619
======15/16========
epoch:0|0.116-0.162
epoch:1|0.117-0.365
epoch:2|0.117-0.381
epoch:3|0.117-0.391
epoch:4|0.117-0.387
epoch:5|0.116-0.386
epoch:6|0.116-0.417
epoch:7|0.116-0.419
epoch:8|0.117-0.461
epoch:9|0.117-0.476
epoch:10|0.117-0.484
epoch:11|0.117-0.52
epoch:12|0.116-0.517
epoch:13|0.117-0.45
epoch:14|0.116-0.482
epoch:15|0.117-0.43
epoch:16|0.117-0.514
epoch:17|0.117-0.434
epoch:18|0.117-0.465
No handles with labels found to put in legend.
epoch:19|0.117-0.524
======16/16========
epoch:0|0.1-0.184
epoch:1|0.116-0.212
epoch:2|0.117-0.221
epoch:3|0.117-0.305
epoch:4|0.117-0.315
epoch:5|0.117-0.312
epoch:6|0.117-0.317
epoch:7|0.117-0.317
epoch:8|0.117-0.325
epoch:9|0.117-0.334
epoch:10|0.117-0.346
epoch:11|0.117-0.33
epoch:12|0.117-0.417
epoch:13|0.117-0.517
epoch:14|0.117-0.435
epoch:15|0.117-0.509
epoch:16|0.117-0.466
epoch:17|0.117-0.484
epoch:18|0.117-0.511
epoch:19|0.117-0.458
正则化
过拟合
- 只能拟合训练数据,但不能很好地拟合不包含在训练数据中地其他数据地状态
- 原因
- 模型拥有大量参数,表现力强
- 训练数据少
权值衰减
- 通过在学习地过程中对大的权重进行惩罚,来抑制过拟合
- 如果将权重记为\(W\),\(L2\)范数地权值衰减为\(\frac{1}{2}\lambda W^2\),然后将这个\(\frac{1}{2}\lambda W^2\)加到损失函数上.
Dropout
- 在学习过程中随机删除神经元地方法.训练时,随机选出隐藏层地神经元,然后将其删除
- 集成学习:让多个模型单独进行学习,推理时再取多个模型的输出的平均值.
- Dropout将集成学习的效果(模拟地)通过一个网络实现了
class Dropout:
def __init__(self,dropout_ratio = 0.5):
self.dropout_ratio = dropout_ratio
self.mask = None
def forward(self,x,train_flg = True):
if train_flg:
self.mask = np.random.rand(*x.shape)>self.dropout_ratio
return x*self.mask
else:
return x*(1.0-self.dropout_ratio)
def backward(self,dout):
return dout*self.mask
超参数的验证
-
超参数指各层的神经元数量,batch大小,参数更新时的学习率或权值衰减等
验证数据
- 不能使用测试数据评估超参数的性能.如果使用测试数据调整超参数,超参数的值会对测试数据发生过拟合.
调整超参数时,必须使用超参数专用的确认数据,一般称为验证数据.
超参数的最优化
- 进行超参数的最优化时,逐渐缩小超参数的"好值"的存在范围
- 步骤
- 步骤0:设定超参数的范围
- 步骤1:从设定的超参数范围中随机采样
- 步骤2:使用步骤1中采样到的超参数的值进行学习,通过验证数据评估识别精度(将epoch设置的很小)
- 步骤3:重复步骤1和步骤2(100次等),根据它们的识别精度的结果,缩小超参数的范围