循环神经网络就是为了学习卷积神经网络中权值共享等思路,来处理序列化数据, 这就造成了他们有很多类似的地方。
RNN与CNN的区别主要在输入形式上:
循环神经网络是一类用于处理序列数据的神经网络。卷积神经网络是一类用于处理网格化数据(如一个图像)的神经网络。
循环网络可以扩展到更长的序列。大多数循环网络也能处理可变长度的序列。卷积网络可以很容易地扩展到具有很大宽度和高度的图像,以及处理大小可变的图像。
循环图
展开图能够明确描述其中的计算流程。 展开图还通过显式的信息流动路径帮助说明信息在时间上向前(计算输出和损失) 和向后(计算梯度)的思想
(图片来源:花书page321)
最简单的RNN形式
在左边循环图中,x是神经网络的输入,U是输入 层到隐藏层之间的权重矩阵,W是记忆单元到隐 藏层之间的权重矩阵,V是隐藏层到输出层之间 的权重矩阵,S是隐藏层的输出,同时也是要保存 到记忆单元中,并与下一时刻的一起作为输入, O是神经网络的输出。【这里的W, V, U, 和全连接网络中的参数矩阵相同。】
从左边的展开图中,可以看出,RNN每个时刻隐藏层的输出传递给下一个时刻,因此每个时刻的网络都会保留一定的来自之前时刻的历史信息,并结合当前时刻的网络状态一并再传给下一时刻。
前向传播
假设我们有一个1000ms长度的语音样本,内容是“早上好”。我么可以10ms取一个样本向量,那么共有100个样本向量。每一个向量进行重采样,得到160维,因此我们有一个 100 × 160 100\times 160 100×160的矩阵, t=100.比如说t=1 to 30, 对应zao,t=31 to 70, 对应shang, t=71 to 100, 对应hao. 又假定词袋标签有6000个,那么输出O向量的维度就是 1000 × 1 1000\times 1 1000×1。上图就是不同样本向量进入RNN网络进行训练的前向传播过程.
假定中间隐藏层有神经元1000个,那么
h
h
h的维度就是1000,
U
U
U的维度就是
160
×
1000
160\times 1000
160×1000。因此,对于
t
=
1
,
2
t=1, 2
t=1,2, 有
h
1
=
x
1
U
+
b
1
h
2
=
x
2
U
+
S
1
W
+
b
1
S
1
=
f
(
h
1
)
S
2
=
f
(
h
2
)
O
1
=
S
1
V
+
b
2
O
2
=
S
2
V
+
b
2
\begin{array}{ll} h_{1}=x_{1} U+b_{1} & h_{2}=x_{2} U+S_{1} W+b_{1} \\ S_{1}=f\left(h_{1}\right) & S_{2}=f\left(h_{2}\right) \\ O_{1}=S_{1} V+b_{2} & O_{2}=S_{2} V+b_{2} \end{array}
h1=x1U+b1S1=f(h1)O1=S1V+b2h2=x2U+S1W+b1S2=f(h2)O2=S2V+b2
对于
t
=
t
−
1
,
t
t=t-1, t
t=t−1,t,有
h
t
−
1
=
x
t
−
1
U
+
S
t
−
2
W
+
b
1
h
t
=
x
t
U
+
S
t
−
1
W
+
b
1
S
t
−
1
=
f
(
h
t
−
1
)
S
t
=
f
(
h
t
)
O
t
−
1
=
S
t
−
1
V
+
b
2
O
t
=
S
t
V
+
b
2
\begin{array}{ll} h_{t-1}=x_{t-1} U+S_{t-2} W+b_{1} & h_{t}=x_{t} U+S_{t-1} W+b_{1} \\ S_{t-1}=f\left(h_{t-1}\right) & S_{t}=f\left(h_{t}\right) \\ O_{t-1}=S_{t-1} V+b_{2} & O_{t}=S_{t} V+b_{2} \end{array}
ht−1=xt−1U+St−2W+b1St−1=f(ht−1)Ot−1=St−1V+b2ht=xtU+St−1W+b1St=f(ht)Ot=StV+b2
其中
x
i
x_i
xi是维度是160的一维向量,
S
i
,
h
i
S_i, h_i
Si,hi均为
1000
×
1
1000\times 1
1000×1的一维向量,
O
i
O_i
Oi是
6000
×
1
6000\times 1
6000×1的一维向量。可见,RNN就是通过共享
W
,
U
,
V
W, U, V
W,U,V三个参数矩阵进行训练学习的。如果不做权值共享,每个时刻都有自己的矩阵,那么可见参数量会随着时间尺度增加而增加。这里的权值共享的一个优点就是减少了100倍的参数量。而且另一个优点就是能适应不同数量的序列样本集。
后向传播
对于所有的输出,我们要把所有的输出预测与标签之间的差别加和起来才能得到损失函数
J
=
∑
i
=
1
t
∥
O
i
−
O
~
i
∥
=
J
1
+
J
2
+
⋯
+
J
t
J=\sum_{i=1}^{t}\left\|O_{i}-\widetilde{O}_{i}\right\|=J_{1}+J_{2}+\cdots+J_{t}
J=i=1∑t∥∥∥Oi−O
i∥∥∥=J1+J2+⋯+Jt
因此我们要针对不同的输出结点进行求导
∂
J
∂
o
i
=
∂
(
J
1
+
J
2
+
⋯
+
J
t
)
∂
o
i
=
∂
J
i
∂
o
i
\frac{\partial J}{\partial o_{i}}=\frac{\partial\left(J_{1}+J_{2}+\cdots+J_{t}\right)}{\partial o_{i}}=\frac{\partial J_{i}}{\partial o_{i}}
∂oi∂J=∂oi∂(J1+J2+⋯+Jt)=∂oi∂Ji
这里每个对输出的梯度维度都是
6000
×
1
6000\times 1
6000×1。
我们可以先回忆全连接FC网络的时候,对于一个隐藏层
y
=
X
W
y=XW
y=XW, 我们有
∂
J
∂
X
=
∂
J
∂
y
W
T
∂
J
∂
W
=
X
T
∂
J
∂
y
\dfrac{\partial J}{\partial X}=\dfrac{\partial J}{\partial y}W^{T}\\ \dfrac{\partial J}{\partial W}=X^{T}\dfrac{\partial J}{\partial y}
∂X∂J=∂y∂JWT∂W∂J=XT∂y∂J
这里也类似,我们先考虑对各个隐藏层的输入输出进行求导。对于倒数的两个时刻,有
∂
J
∂
S
t
=
∂
J
∂
O
t
V
T
∂
J
∂
S
t
−
1
=
∂
J
∂
O
t
−
1
V
T
+
∂
J
∂
h
t
W
T
∂
J
∂
h
t
=
∂
J
∂
S
t
d
S
t
d
h
t
∂
J
∂
h
t
−
1
=
∂
J
∂
S
t
−
1
d
S
t
−
1
d
h
t
−
1
∂
J
∂
x
t
=
∂
J
∂
h
t
U
T
∂
J
∂
x
t
−
1
=
∂
J
∂
h
t
−
1
U
T
\begin{array}{ll} \dfrac{\partial J}{\partial S_{t}}=\dfrac{\partial J}{\partial O_{t}} V^{T} & \dfrac{\partial J}{\partial S_{t-1}}=\dfrac{\partial J}{\partial O_{t-1}} V^{T}+\dfrac{\partial J}{\partial h_{t}} W^{T} \\ \dfrac{\partial J}{\partial h_{t}}=\dfrac{\partial J}{\partial S_{t}} \dfrac{d S_{t}}{d h_{t}} & \dfrac{\partial J}{\partial h_{t-1}}=\dfrac{\partial J}{\partial S_{t-1}} \dfrac{d S_{t-1}}{d h_{t-1}} \\ \dfrac{\partial J}{\partial x_{t}}=\dfrac{\partial J}{\partial h_{t}} U^{T} \quad & \dfrac{\partial J}{\partial x_{t-1}}=\dfrac{\partial J}{\partial h_{t-1}} U^{T} \end{array}
∂St∂J=∂Ot∂JVT∂ht∂J=∂St∂JdhtdSt∂xt∂J=∂ht∂JUT∂St−1∂J=∂Ot−1∂JVT+∂ht∂JWT∂ht−1∂J=∂St−1∂Jdht−1dSt−1∂xt−1∂J=∂ht−1∂JUT
一直到头两个时刻
∂
J
∂
S
2
=
∂
J
∂
O
2
V
T
+
∂
J
∂
h
3
W
T
∂
J
∂
S
1
=
∂
J
∂
O
1
V
T
+
∂
J
∂
h
2
W
T
∂
J
∂
h
2
=
∂
J
∂
S
2
d
S
2
d
h
2
∂
J
∂
h
1
=
∂
J
∂
S
1
d
S
1
d
h
1
∂
J
∂
x
2
=
∂
J
∂
h
2
U
T
∂
J
∂
x
1
=
∂
J
∂
h
1
U
T
\begin{array}{ll} \dfrac{\partial J}{\partial S_{2}}=\dfrac{\partial J}{\partial O_{2}} V^{T}+\dfrac{\partial J}{\partial h_{3}} W^{T} & \dfrac{\partial J}{\partial S_{1}}=\dfrac{\partial J}{\partial O_{1}} V^{T}+\dfrac{\partial J}{\partial h_{2}} W^{T} \\ \dfrac{\partial J}{\partial h_{2}}=\dfrac{\partial J}{\partial S_{2}} \dfrac{d S_{2}}{d h_{2}} & \dfrac{\partial J}{\partial h_{1}}=\dfrac{\partial J}{\partial S_{1}} \dfrac{d S_{1}}{d h_{1}} \\ \dfrac{\partial J}{\partial x_{2}}=\dfrac{\partial J}{\partial h_{2}} U^{T} \quad & \dfrac{\partial J}{\partial x_{1}}=\dfrac{\partial J}{\partial h_{1}} U^{T} \end{array}
∂S2∂J=∂O2∂JVT+∂h3∂JWT∂h2∂J=∂S2∂Jdh2dS2∂x2∂J=∂h2∂JUT∂S1∂J=∂O1∂JVT+∂h2∂JWT∂h1∂J=∂S1∂Jdh1dS1∂x1∂J=∂h1∂JUT
这里仍要注意,
d
S
t
d
h
t
=
S
t
(
1
−
S
t
)
\dfrac{dS_t}{dh_t}=S_{t}(1-S_{t})
dhtdSt=St(1−St), 或者
1
−
S
t
2
1-S_{t}^{2}
1−St2
下一步就是针对参数进行求导。在RNN中,有三个参数矩阵。我们先看
V
V
V,因为RNN有多个输出
O
i
O_{i}
Oi,因此对
J
(
O
1
,
O
2
,
⋯
,
O
t
)
J(O_1, O_2, \cdots, O_t)
J(O1,O2,⋯,Ot)有
∂
J
t
∂
V
=
S
t
T
∂
J
∂
o
t
∂
J
t
−
1
∂
V
=
S
t
−
1
T
∂
J
∂
o
t
−
1
⋮
∂
J
1
∂
V
=
S
1
T
∂
J
∂
o
1
\begin{array}{l} \dfrac{\partial J_t}{\partial V}=S_{t}^{T} \dfrac{\partial J}{\partial o_{t}} \\ \dfrac{\partial J_{t-1}}{\partial V}=S_{t-1}^{T}\dfrac{\partial J}{\partial o_{t-1}} \\ \vdots \\ \dfrac{\partial J_{1}}{\partial V}=S_{1}^{T} \frac{\partial J}{\partial o_{1}} \end{array}
∂V∂Jt=StT∂ot∂J∂V∂Jt−1=St−1T∂ot−1∂J⋮∂V∂J1=S1T∂o1∂J
【参数矩阵在前】
∂
J
∂
V
=
∑
i
=
1
t
S
i
T
∂
J
∂
o
i
\frac{\partial J}{\partial V}=\sum_{i=1}^{t} S_{i}^{T} \frac{\partial J}{\partial o_{i}}
∂V∂J=i=1∑tSiT∂oi∂J
类似,对于U和S有
∂
J
∂
U
=
∑
i
=
1
t
x
i
T
∂
J
∂
h
i
∂
J
∂
W
=
∑
i
=
1
t
−
1
S
i
T
∂
J
∂
h
i
+
1
\frac{\partial J}{\partial U}=\sum_{i=1}^{t} x_{i}^{T} \frac{\partial J}{\partial h_{i}}\\ \frac{\partial J}{\partial W}=\sum_{i=1}^{t-1} S_{i}^{T} \frac{\partial J}{\partial h_{i+1}}
∂U∂J=i=1∑txiT∂hi∂J∂W∂J=i=1∑t−1SiT∂hi+1∂J
【这里要注意,从图中可以看出,W存在于不同时刻之间,因此只有t-1项。而且从中我们也可以看出一点技巧,针对对输入输出求导,我们要看向量有几个箭头输出,比如说
S
t
−
1
S_{t-1}
St−1,它有两个箭头输出,分别指向
O
t
−
1
O_{t-1}
Ot−1和
h
t
h_t
ht,因此分别考虑对这两个向量求偏导,而
h
t
−
1
,
S
t
h_{t-1}, S_{t}
ht−1,St都只有一个箭头输出。针对参数求导,我们就看这个参数的箭头连接的两个向量,比如说W, 它连接的就是
S
i
−
1
S_{i-1}
Si−1到
h
i
+
1
h_{i+1}
hi+1,那就先对箭头末端求导,然后乘以箭头后端的向量】
因此有
∂
J
∂
V
=
∑
i
=
1
t
S
i
T
∂
J
∂
o
i
=
(
S
1
T
,
S
2
T
,
…
,
S
t
T
)
(
∂
J
∂
o
1
⋮
∂
J
∂
o
t
)
\begin{array}{l} \frac{\partial J}{\partial V}=\sum_{i=1}^{t} S_{i}^{T} \frac{\partial J}{\partial o_{i}} \\ =\left(S_{1}^{T}, S_{2}^{T}, \ldots, S_{t}^{T}\right)\left(\begin{array}{c} \frac{\partial J}{\partial o_{1}} \\ \vdots \\ \frac{\partial J}{\partial o_{t}} \end{array}\right) \end{array}
∂V∂J=∑i=1tSiT∂oi∂J=(S1T,S2T,…,StT)⎝⎜⎛∂o1∂J⋮∂ot∂J⎠⎟⎞
∂ J ∂ W = ∑ i = 1 t − 1 S i T ∂ J ∂ h i + 1 = ( S 1 T , S 2 T , … , S t − 1 T ) ( ∂ J ∂ h 2 ∂ J ∂ h t ) \begin{array}{l} \frac{\partial J}{\partial W}=\sum_{i=1}^{t-1} S_{i}^{T} \frac{\partial J}{\partial h_{i+1}} \\ =\left(S_{1}^{T}, S_{2}^{T}, \ldots, S_{t-1}^{T}\right)\left(\begin{array}{c} \frac{\partial J}{\partial h_{2}} \\ \frac{\partial J}{\partial h_{t}} \end{array}\right) \end{array} ∂W∂J=∑i=1t−1SiT∂hi+1∂J=(S1T,S2T,…,St−1T)(∂h2∂J∂ht∂J)
∂ J ∂ U = ∑ i = 1 t x i T ∂ J ∂ h i = ( x 1 T , x 2 T , … , x t − 1 T ) ( ∂ J ∂ h 1 ⋮ ∂ J ∂ h t ) \begin{array}{l} \frac{\partial J}{\partial U}=\sum_{i=1}^{t} x_{i}^{T} \frac{\partial J}{\partial h_{i}} \\ =\left(x_{1}^{T}, x_{2}^{T}, \ldots, x_{t-1}^{T}\right)\left(\begin{array}{c} \frac{\partial J}{\partial h_{1}} \\ \vdots \\ \frac{\partial J}{\partial h_{t}} \end{array}\right) \end{array} ∂U∂J=∑i=1txiT∂hi∂J=(x1T,x2T,…,xt−1T)⎝⎜⎛∂h1∂J⋮∂ht∂J⎠⎟⎞
但是我们可以发现一个问题,除了在求对参数的梯度是矩阵乘以矩阵,可以并行,前向传播,还有对输入输出向量求导,却都是向量和矩阵的乘法,没有办法做到矩阵乘以矩阵。因此在GPU中,RNN没有办法做batch训练,发挥GPU性能。
比如说前向传播中, S 2 = f ( h 2 ) S_2 = f(h_2) S2=f(h2), 也就是 S 2 S_2 S2依赖 h 2 h_2 h2, h 2 h_2 h2依赖 S 1 S_{1} S1, S 1 S_1 S1依赖 。 h 1 。h_1 。h1也就是当 x 2 x_2 x2输入时, S 2 S_2 S2和 S 1 S_{1} S1没有办法同时算好,没有办法进行并行。
再比如,反向传播中, h t − 1 h_{t-1} ht−1依赖 S t − 1 S_{t-1} St−1, S t − 1 S_{t-1} St−1依赖 h t h_{t} ht。 h t − 1 h_{t-1} ht−1和 h t h_{t} ht也没有办法同时算好,没有办法进行并行。
既然一句话中,无法进行并行训练,那么我们只能尝试多句并行训练。假设
x
1
N
,
x
2
N
,
x
3
N
,
⋯
,
x
t
N
x_{1}^{N} , x_{2}^{N} , x_{3}^{N} , \cdots, x_{t}^{N}
x1N,x2N,x3N,⋯,xtN为第N句话的t个样本(以最长序列为准,不够补0),那么对于这N句话的第一个词,我们有
h
1
1
=
x
1
1
U
+
b
1
S
1
1
=
f
(
h
1
1
)
O
1
1
=
S
1
1
V
+
b
2
h
1
2
=
x
1
2
U
+
b
1
S
1
2
=
f
(
h
1
2
)
O
1
2
=
S
1
2
V
+
b
2
…
h
1
N
=
x
1
N
U
+
b
1
S
1
N
=
f
(
h
1
N
)
O
1
N
=
s
1
N
V
+
b
2
\begin{array}{l} h_{1}^{1}=x_{1}^{1} U+b_{1} \\ S_{1}^{1}=f\left(h_{1}^{1}\right) \\ O_{1}^{1}=S_{1}^{1} V+b_{2} \\ h_{1}^{2}=x_{1}^{2} U+b_{1} \\ S_{1}^{2}=f\left(h_{1}^{2}\right) \\ O_{1}^{2}=S_{1}^{2} V+b_{2} \\ \ldots \\ h_{1}^{N}=x_{1}^{N} U+b_{1} \\ S_{1}^{N}=f\left(h_{1}^{N}\right) \\ O_{1}^{N}=s_{1}^{N} V+b_{2} \end{array}
h11=x11U+b1S11=f(h11)O11=S11V+b2h12=x12U+b1S12=f(h12)O12=S12V+b2…h1N=x1NU+b1S1N=f(h1N)O1N=s1NV+b2
因此
(
h
1
1
⋮
h
1
N
)
=
(
x
1
1
⋮
x
1
N
)
U
+
(
b
1
⋮
b
1
)
(
S
1
1
⋮
S
1
N
)
=
f
(
h
1
1
⋮
h
1
N
)
(
O
1
1
⋮
O
1
N
)
=
(
S
1
1
⋮
S
1
N
)
V
+
(
b
2
⋮
b
2
)
\begin{array}{c} \left(\begin{array}{c} h_{1}^{1} \\ \vdots \\ h_{1}^{N} \end{array}\right)=\left(\begin{array}{c} x_{1}^{1} \\ \vdots \\ x_{1}^{N} \end{array}\right) U+\left(\begin{array}{c} b_{1} \\ \vdots \\ b_{1} \end{array}\right) \\ \left(\begin{array}{c} S_{1}^{1} \\ \vdots \\ S_{1}^{N} \end{array}\right)=f\left(\begin{array}{c} h_{1}^{1} \\ \vdots \\ h_{1}^{N} \end{array}\right) \\ \left(\begin{array}{c} O_{1}^{1} \\ \vdots \\ O_{1}^{N} \end{array}\right)=\left(\begin{array}{c} S_{1}^{1} \\ \vdots \\ S_{1}^{N} \end{array}\right) V+\left(\begin{array}{c} b_{2} \\ \vdots \\ b_{2} \end{array}\right) \end{array}
⎝⎜⎛h11⋮h1N⎠⎟⎞=⎝⎜⎛x11⋮x1N⎠⎟⎞U+⎝⎜⎛b1⋮b1⎠⎟⎞⎝⎜⎛S11⋮S1N⎠⎟⎞=f⎝⎜⎛h11⋮h1N⎠⎟⎞⎝⎜⎛O11⋮O1N⎠⎟⎞=⎝⎜⎛S11⋮S1N⎠⎟⎞V+⎝⎜⎛b2⋮b2⎠⎟⎞
对
t
−
1
t-1
t−1时刻, 有
h
t
−
1
1
=
x
t
−
1
1
U
+
S
t
−
2
1
W
+
b
1
S
t
−
1
1
=
f
(
h
t
−
1
1
)
O
t
−
1
1
=
S
t
−
1
1
V
+
b
2
h
t
−
1
2
=
x
t
−
1
2
U
+
S
t
−
2
2
W
+
b
1
S
t
−
1
2
=
f
(
h
t
−
1
2
)
O
t
−
1
2
=
S
t
−
1
2
V
+
b
2
⋯
h
t
−
1
N
=
x
t
−
1
N
U
+
S
t
−
2
N
W
+
b
1
S
t
−
1
N
=
f
(
h
t
−
1
N
)
O
t
−
1
N
=
S
t
−
1
N
V
+
b
2
\begin{aligned} &\begin{array}{l} h_{t-1}^{1}=x_{t-1}^{1} U+S_{t-2}^{1} W+b_{1} \\ S_{t-1}^{1}=f\left(h_{t-1}^{1}\right) \\ O_{t-1}^{1}=S_{t-1}^{1} V+b_{2} \\ h_{t-1}^{2}=x_{t-1}^{2} U+S_{t-2}^{2} W+b_{1} \\ S_{t-1}^{2}=f\left(h_{t-1}^{2}\right) \\ {O}_{t-1}^{2}=S_{t-1}^{2} V+b_{2} \end{array}\\ &\cdots\\ &h_{t-1}^{N}=x_{t-1}^{N} U+S_{t-2}^{N} W+b_{1}\\ &\begin{array}{l} S_{t-1}^{N}=f\left(h_{t-1}^{N}\right) \\ O_{t-1}^{N}=S_{t-1}^{N} V+b_{2} \end{array} \end{aligned}
ht−11=xt−11U+St−21W+b1St−11=f(ht−11)Ot−11=St−11V+b2ht−12=xt−12U+St−22W+b1St−12=f(ht−12)Ot−12=St−12V+b2⋯ht−1N=xt−1NU+St−2NW+b1St−1N=f(ht−1N)Ot−1N=St−1NV+b2
因此
(
h
t
−
1
1
⋮
h
t
−
1
N
)
=
(
x
t
−
1
1
⋮
x
t
−
1
N
)
U
+
(
S
t
−
2
1
⋮
S
t
−
2
N
)
W
+
(
b
1
⋮
b
1
)
(
S
t
−
1
1
⋮
S
t
−
1
N
)
=
f
(
h
t
−
1
1
⋮
h
t
−
1
N
)
(
O
t
−
1
1
⋮
O
t
−
1
N
)
=
(
S
t
−
1
1
⋮
S
t
−
1
N
)
V
+
(
b
2
⋮
b
2
)
\begin{array}{l} \left(\begin{array}{c} h_{t-1}^{1} \\ \vdots \\ h_{t-1}^{N} \end{array}\right)=\left(\begin{array}{c} x_{t-1}^{1} \\ \vdots \\ x_{t-1}^{N} \end{array}\right) U+\left(\begin{array}{c} S_{t-2}^{1} \\ \vdots \\ S_{t-2}^{N} \end{array}\right) W+\left(\begin{array}{c} b_{1} \\ \vdots \\ b_{1} \end{array}\right) \\ \left(\begin{array}{c} S_{t-1}^{1} \\ \vdots \\ S_{t-1}^{N} \end{array}\right)=f\left(\begin{array}{c} h_{t-1}^{1} \\ \vdots \\ h_{t-1}^{N} \end{array}\right) \\ \left(\begin{array}{c} O_{t-1}^{1} \\ \vdots \\ O_{t-1}^{N} \end{array}\right)=\left(\begin{array}{c} S_{t-1}^{1} \\ \vdots \\ S_{t-1}^{N} \end{array}\right) V+\left(\begin{array}{c} b_{2} \\ \vdots \\ b_{2} \end{array}\right) \end{array}
⎝⎜⎛ht−11⋮ht−1N⎠⎟⎞=⎝⎜⎛xt−11⋮xt−1N⎠⎟⎞U+⎝⎜⎛St−21⋮St−2N⎠⎟⎞W+⎝⎜⎛b1⋮b1⎠⎟⎞⎝⎜⎛St−11⋮St−1N⎠⎟⎞=f⎝⎜⎛ht−11⋮ht−1N⎠⎟⎞⎝⎜⎛Ot−11⋮Ot−1N⎠⎟⎞=⎝⎜⎛St−11⋮St−1N⎠⎟⎞V+⎝⎜⎛b2⋮b2⎠⎟⎞
这样我们就可以构建针对不同句子中同一个时刻的单词(或者说样本向量)的矩阵的运算。
类似的,我们对求输入输出向量的梯度,有
∂
J
∂
S
t
−
1
1
=
∂
J
∂
O
t
−
1
1
V
T
+
∂
J
∂
h
t
1
W
T
∂
J
∂
h
t
−
1
1
=
∂
J
∂
S
t
−
1
1
∂
S
t
−
1
1
∂
h
t
−
1
1
∂
J
∂
x
t
−
1
1
=
∂
J
∂
h
t
−
1
1
U
T
⋮
∂
J
∂
S
t
−
1
N
=
∂
J
∂
O
t
−
1
N
V
T
+
∂
J
∂
h
t
N
W
T
∂
J
∂
h
t
−
1
N
=
∂
J
∂
S
t
−
1
N
∂
S
t
−
1
N
∂
h
t
−
1
N
∂
J
∂
x
t
−
1
N
=
∂
J
∂
h
t
−
1
N
U
T
\begin{aligned} \frac{\partial J}{\partial S_{t-1}^{1}} &=\frac{\partial J}{\partial O_{t-1}^{1}} V^{T}+\frac{\partial J}{\partial h_{t}^{1}} W^{T} \\ \frac{\partial J}{\partial h_{t-1}^{1}} &=\frac{\partial J}{\partial S_{t-1}^{1}} \frac{\partial S_{t-1}^{1}}{\partial h_{t-1}^{1}} \\ \frac{\partial J}{\partial x_{t-1}^{1}} &=\frac{\partial J}{\partial h_{t-1}^{1}} U^{T} \\ & \vdots \\ \frac{\partial J}{\partial S_{t-1}^{N}} &=\frac{\partial J}{\partial O_{t-1}^{N}} V^{T}+\frac{\partial J}{\partial h_{t}^{N}} W^{T} \\ \frac{\partial J}{\partial h_{t-1}^{N}} &=\frac{\partial J}{\partial S_{t-1}^{N}} \frac{\partial S_{t-1}^{N}}{\partial h_{t-1}^{N}} \\ \frac{\partial J}{\partial x_{t-1}^{N}} &=\frac{\partial J}{\partial h_{t-1}^{N}} U^{T} \end{aligned}
∂St−11∂J∂ht−11∂J∂xt−11∂J∂St−1N∂J∂ht−1N∂J∂xt−1N∂J=∂Ot−11∂JVT+∂ht1∂JWT=∂St−11∂J∂ht−11∂St−11=∂ht−11∂JUT⋮=∂Ot−1N∂JVT+∂htN∂JWT=∂St−1N∂J∂ht−1N∂St−1N=∂ht−1N∂JUT
因此
(
∂
J
∂
S
t
−
1
1
⋮
∂
J
∂
S
t
−
1
N
)
=
(
∂
J
∂
O
t
−
1
1
⋮
∂
J
∂
O
t
−
1
N
)
V
T
+
(
∂
J
∂
h
t
1
⋮
∂
J
∂
h
t
N
)
W
T
\left(\begin{array}{c} \dfrac{\partial J}{\partial S_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial S_{t-1}^{N}} \end{array}\right)=\left(\begin{array}{c} \dfrac{\partial J}{\partial O_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial O_{t-1}^{N}} \end{array}\right) V^{T}+\left(\begin{array}{c} \dfrac{\partial J}{\partial h_{t}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial h_{t}^{N}} \end{array}\right) W^{T}
⎝⎜⎜⎜⎜⎜⎛∂St−11∂J⋮∂St−1N∂J⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛∂Ot−11∂J⋮∂Ot−1N∂J⎠⎟⎟⎟⎟⎟⎞VT+⎝⎜⎜⎜⎜⎛∂ht1∂J⋮∂htN∂J⎠⎟⎟⎟⎟⎞WT
( ∂ J ∂ h t − 1 1 ⋮ ∂ J ∂ h t − 1 N ) = ( ∂ J ∂ S t − 1 1 ⋮ ∂ J ∂ S t − 1 N ) ⊙ ( ∂ S t − 1 1 ∂ h t − 1 1 ⋮ ∂ S t − 1 N ∂ h t − 1 N ) \left(\begin{array}{c} \dfrac{\partial J}{\partial h_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial h_{t-1}^{N}} \end{array}\right)=\left(\begin{array}{c} \dfrac{\partial J}{\partial S_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial S_{t-1}^{N}} \end{array}\right) \odot\left(\begin{array}{c} \dfrac{\partial S_{t-1}^{1}}{\partial h_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial S_{t-1}^{N}}{\partial h_{t-1}^{N}} \end{array}\right) ⎝⎜⎜⎜⎜⎜⎛∂ht−11∂J⋮∂ht−1N∂J⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛∂St−11∂J⋮∂St−1N∂J⎠⎟⎟⎟⎟⎟⎞⊙⎝⎜⎜⎜⎜⎜⎛∂ht−11∂St−11⋮∂ht−1N∂St−1N⎠⎟⎟⎟⎟⎟⎞
( ∂ J ∂ x t − 1 1 ⋮ ∂ J ∂ x t − 1 N ) = ( ∂ J ∂ h t − 1 1 ⋮ ∂ J ∂ h t − 1 N ) U T \left(\begin{array}{c} \dfrac{\partial J}{\partial x_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial x_{t-1}^{N}} \end{array}\right)=\left(\begin{array}{c} \dfrac{\partial J}{\partial h_{t-1}^{1}} \\ \vdots \\ \dfrac{\partial J}{\partial h_{t-1}^{N}} \end{array}\right) U^{T} ⎝⎜⎜⎜⎜⎜⎛∂xt−11∂J⋮∂xt−1N∂J⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛∂ht−11∂J⋮∂ht−1N∂J⎠⎟⎟⎟⎟⎟⎞UT
这样,也能对输入输出向量的梯度进行并行计算了。