          梯度下降的一个限制是,它对每个输入变量使用相同的步长(学习率)。 AdaGradn和RMSProp是梯度下降的扩展,为目标函数的每个参数增加了自适应学习率。





        本教程分为三个部分: 他们是:







                             x = x – step_size * f'(x)




       Adadelta(或“ ADADELTA”)是梯度下降优化算法的扩展。该算法在Matthew Zeiler于2012年发表的题为“ ADADELTA:一种自适应学习率方法”的论文中进行了描述。




cust_step_size(t + 1)= step_size /(1e-8 + sqrt(s(t)))

          其中cust_step_size(t + 1)是搜索过程中给定点的输入变量的计算步长,step_size是初始步长,sqrt()是平方根运算,s(t)是求和的总和。到目前为止在搜索中看到的输入变量(包括当前迭代)的平方偏导数的平方。

         可以将RMSProp视为AdaGrad的扩展,因为它使用偏导数的衰减平均值或移动平均值,而不是每个参数的步长计算中的总和。这可以通过添加新的超参数“ rho”来实现,该参数像偏导数的动量。


s(t + 1)=(s(t)* rho)+(f'(x(t))^ 2 *(1.0-rho))

         其中s(t + 1)是该算法当前迭代的一个参数的均方偏导数,而s(t)是前一次迭代的衰减移动均方偏导数f'(x(t))^ 2是当前参数的平方偏导数,rho是一个超参数,通常像动量一样值为0.9。Adadelta是RMSProp的进一步扩展,旨在改善算法的收敛性,并消除了手动指定初始学习速率的需要。

         与RMSProp一样,针对每个参数计算平方偏导数的衰减移动平均值。 关键区别在于使用增量或参数变化的衰减平均值的参数步长的计算。选择分子的目的是确保计算的两个部分具有相同的单位。


cust_step_size(t + 1)=(ep + sqrt(delta(t)))/(ep + sqrt(s(t)))

        其中cust_step_size(t + 1)是给定更新的参数的自定义步长,其中ep是一个超参数,将其添加到分子和分母中以避免除以零误差,而delta(t)是衰减的移动平均值参数的平方变化(在上次迭代中计算),而s(t)是平方的偏导数的衰减移动平均值(在当前迭代中计算)。ep超参数设置为一个较小的值,例如1e-3或1e-8。除了避免除以零误差外,当递减的移动平均平方变化和递减的移动平均平方梯度为零时,它还有助于算法的第一步。


(t + 1)= cust_step_size(t + 1)* f'(x(t))


delta(t + 1)=(delta(t)* rho)+(change(t + 1)^ 2 *(1.0-rho))

         其中delta(t + 1)是要在下一次迭代中使用的变量的变化的衰减平均值,则change(t + 1)是在前一步中计算的,rho是一个超参数,其作用类似于动量并具有一个值像0.9。


                         x(t + 1)= x(t)–更改(t + 1)






# objective function
def objective(x, y):
	return x**2.0 + y**2.0


# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot

       运行示例将创建目标函数的三维表面图。我们可以看到全局最小值为f(0,0)= 0的熟悉的碗形状。


         我们还可以创建函数的二维图。 这在以后要绘制搜索进度时会很有帮助。下面的示例创建目标函数的轮廓图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot

        运行示例将创建目标函数的二维轮廓图。我们可以看到碗的形状被压缩为以颜色渐变显示的轮廓。 我们将使用该图来绘制在搜索过程中探索的特定点。





f(x)= x ^ 2
f'(x)= x * 2

          x ^ 2的导数在每个维度上均为x * 2。 derived()函数在下面实现了这一点。

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])


# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])


# list of the average square gradients for each variable
sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
# list of the average parameter updates
sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

        然后,我们可以枚举“ n_iter”超参数定义的搜索优化算法的固定迭代次数。

# run the gradient descent
for it in range(n_iter):


# calculate gradient
gradient = derivative(solution[0], solution[1])

        然后,我们需要计算偏导数的平方,并使用“ rho”超参数更新平方偏导数的衰减移动平均值。

# update the average of the squared partial derivatives
for i in range(gradient.shape[0]):
	# calculate the squared gradient
	sg = gradient[i]**2.0
	# update the moving average of the squared gradient
	sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

       然后,我们可以使用平方偏导数和梯度的衰减移动平均值来计算下一点的步长。 我们将一次执行一个变量。

# build solution
new_solution = list()
for i in range(solution.shape[0]):

        首先,我们将使用平方变化和平方偏导数的递减移动平均值以及“ ep”超参数,在此迭代中计算此变量的自定义步长。

# calculate the step size for this variable
alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))


# calculate the change
change = alpha * gradient[i]

       然后,我们可以使用“ rho”超参数使用该变化来更新平方变化的衰减移动平均值。

# update the moving average of squared parameter changes
sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))


# calculate the new position in this variable
value = solution[i] - change
# store this variable


# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))


       我们可以将所有这些绑定到一个名为adadelta()的函数中,该函数采用目标函数和派生函数的名称,一个具有域边界和超参数值的边界的数组,用于算法迭代和rho的总数,然后返回 最终解决方案及其评估。ep超参数也可以作为参数,尽管它的默认值是1e-3。下面列出了完整的功能。

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

         注意:为了提高可读性,我们有意使用列表和命令式编码样式,而不是矢量化操作。 随意将实现改编为带有NumPy数组的矢量化实现,以实现更好的性能。



# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# momentum for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
best, score = adadelta(objective, derivative, bounds, n_iter, rho)
print('f(%s) = %f' % (best, score))


# gradient descent optimization with adadelta for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# momentum for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
best, score = adadelta(objective, derivative, bounds, n_iter, rho)
print('f(%s) = %f' % (best, score))


      注意:由于算法或评估程序的随机性,或者数值精度的差异,您的结果可能会有所不同。 考虑运行该示例几次并比较平均结果。


>100 f([-1.45142626e-07 2.71163181e-03]) = 0.00001
>101 f([-1.24898699e-07 2.56875692e-03]) = 0.00001
>102 f([-1.07454197e-07 2.43328237e-03]) = 0.00001
>103 f([-9.24253035e-08 2.30483111e-03]) = 0.00001
>104 f([-7.94803792e-08 2.18304501e-03]) = 0.00000
>105 f([-6.83329263e-08 2.06758392e-03]) = 0.00000
>106 f([-5.87354975e-08 1.95812477e-03]) = 0.00000
>107 f([-5.04744185e-08 1.85436071e-03]) = 0.00000
>108 f([-4.33652179e-08 1.75600036e-03]) = 0.00000
>109 f([-3.72486699e-08 1.66276699e-03]) = 0.00000
>110 f([-3.19873691e-08 1.57439783e-03]) = 0.00000
>111 f([-2.74627662e-08 1.49064334e-03]) = 0.00000
>112 f([-2.3572602e-08 1.4112666e-03]) = 0.00000
>113 f([-2.02286891e-08 1.33604264e-03]) = 0.00000
>114 f([-1.73549914e-08 1.26475787e-03]) = 0.00000
>115 f([-1.48859650e-08 1.19720951e-03]) = 0.00000
>116 f([-1.27651224e-08 1.13320504e-03]) = 0.00000
>117 f([-1.09437923e-08 1.07256172e-03]) = 0.00000
>118 f([-9.38004754e-09 1.01510604e-03]) = 0.00000
>119 f([-8.03777865e-09 9.60673346e-04]) = 0.00000
f([-8.03777865e-09 9.60673346e-04]) = 0.000001



# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
		# store the new solution
		solution = asarray(new_solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions


# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)


# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')


# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')


# example of plotting the adadelta search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of the average square gradients for each variable
	sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
	# list of the average parameter updates
	sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate gradient
		gradient = derivative(solution[0], solution[1])
		# update the average of the squared partial derivatives
		for i in range(gradient.shape[0]):
			# calculate the squared gradient
			sg = gradient[i]**2.0
			# update the moving average of the squared gradient
			sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
		# build solution
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the step size for this variable
			alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
			# calculate the change
			change = alpha * gradient[i]
			# update the moving average of squared parameter changes
			sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
			# calculate the new position in this variable
			value = solution[i] - change
			# store this variable
		# store the new solution
		solution = asarray(new_solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot





