6.1 梯度下降
梯度
-
导数,derivative
-
偏微分,partial derivative
-
梯度,gradient
∇f=(∂x1∂f,∂x2∂f,...,∂xn∂f)
含义
梯度值揭示了函数值增大或者减小的方向。
梯度下降
- ∇f(θ)−larger value
- Search for minima:
-
lr α η
θt+1=θt−αt∇f(θt)
-
lr α η
实例
θt+1=θt−αt∇f(θt)
优化过程一
优化过程二
自动求导
- With tf.GradientTape() as tape:
- build computation graph
- loss = fθ(x)
- [wgrad] = tape.gradient(loss, [w])
w = tf.constant(1.)
x = tf.constant(2.)
y = x * w
with tf.GradientTape() as tape:
tape.watch([w])
y2 = x * w
grad1 = tape.gradient(y, [w])
grad1 # [None]
with tf.GradientTape() as tape:
tape.watch([w])
y2 = x * w
grad2 = tape.gradient(y2, [w])
grad2 # [<tf.Tensor: id=7, shape=(), dtype=float32, numpy=2.0>]
设置persistent GradientTape 实现多次求导
w = tf.constant(1.)
x = tf.constant(2.)
y = x * w
with tf.GradientTape() as tape:
tape.watch([w])
y2 = x * w
grad = tape.gradient(y2, [w])
grad1 # [<tf.Tensor: id=6, shape=(), dtype=float32, numpy=2.0>]
grad = tape.gradient(y2, [w])
grad1
# RuntimeError: GradientTape.gradient can only be called once on non-persistent tapes.
设置 persistent=True
w = tf.constant(1.)
x = tf.constant(2.)
y = x * w
with tf.GradientTape(persistent=True) as tape:
tape.watch([w])
y2 = x * w
grad = tape.gradient(y2, [w])
grad # [<tf.Tensor: id=6, shape=(), dtype=float32, numpy=2.0>]
grad = tape.gradient(y2, [w])
grad # [<tf.Tensor: id=10, shape=(), dtype=float32, numpy=2.0>]
二阶梯度
- y=xw+b
- ∂w∂y=x
- ∂w2∂2y=∂w∂y′=∂w∂x= None
w = tf.Variable(1.0)
b = tf.Variable(2.0)
x = tf.Variable(3.0)
with tf.GradientTape() as t1:
with tf.GradientTape() as t2:
y = x * w + b
dy_dw, dy_db = t2.gradient(y, [w, b])
d2y_dw2 = t1.gradient(dy_dw, w)
dy_dw # tf.Tensor(3.0, shape=(), dtype=float32)
dy_db # tf.Tensor(1.0, shape=(), dtype=float32)
d2y_dw2 # None