反向传播算法的暴力理解

1 Backpropation 反向传播算法

我们在学习和实现反向传播算法的时候,往往因为其计算的复杂性,计算内涵的抽象性,只是机械的按照公式模板去套用算法。但是这种形式的算法使用甚至不如直接调用一些已有框架的算法实现来得方便。

我们实现反向传播算法,就是要理解为什么公式这么写,为什么这么算。这是非常重要的一件事情!

可能有一些教学会将算法的顺序步骤抽象为一个“反向传播“的过程,将计算转为一种图形或是动画的模式。但在你真正知道为什么这么算之前,这些都是无根之萍。

对于为什么这么算,我们的方法就是正面G

至于什么是正面G,意思是我们只要理解导数就可以了,其他的所有理解在这里都被摒弃。我们只是计算,用计算来推导公式

There are no meanings. There are just laws of arithmetic.

下面的文章大多带有英文。提前预警。因为数学公式的使用,建议大屏设备观看

2 Terminology[1]

反向传播算法的暴力理解

  • L = total number of layers in the network

  • \(s_l\) = number of units (not counting bias unit) in layer l

  • K = number of output units/classes

  • Binary classification: y = 0 or y = 1, K=1;

  • Multi-class classification: K>=3;

\[\begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*} \]

\[\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, } \\ \text{ then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.} \]

\[z^{(j+1)} = \Theta^{(j)}a^{(j)} \]

Example

\[\begin{align*} z_1^{(3)}=\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} \newline a_1^{(3)} = g(z_1^{(3)}) \newline z_2^{(3)}=\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}a_1^{(2)} + \Theta_{22}^{(2)}a_2^{(2)} \newline a_2^{(3)} = g(z_2^{(3)}) \newline \text{add}\; a_0^{(3)} = 1 \newline \end{align*}\]

\(\Theta_{10}^{(1)}a_0^{(1)}\)

  • (1) : 第一层向第二层的权重
  • 10 :
    • 1 :对应第二层的第一个激活单元 \(a_1^{(2)}\)
    • 0 :对应第一层的第0个参数 \(a_0^{(1)}\)

3 Feedforward computation

对于上面的4层的神经网络,下面是一个详细的实现算法。要实现反向传播,我们首先要实现前向传播。

前向传播的精髓就是计算一层的输出值,作为下一层的输入值。

3.1 Layer 1

\[\begin{align*} a_1^{(1)} = x_1 \newline a_2^{(1)} = x_2 \newline \text{add}\; a_0^{(1)} = 1 \newline \end{align*}\]

3.2 Layer 2

\[\begin{align*} z_1^{(2)}=\Theta_{10}^{(1)}a_0^{(1)} + \Theta_{11}^{(1)}a_1^{(1)} + \Theta_{12}^{(1)}a_2^{(1)} \newline a_1^{(2)} = g(z_1^{(2)}) \newline z_2^{(2)}=\Theta_{20}^{(1)}a_0^{(1)} + \Theta_{21}^{(1)}a_1^{(1)} + \Theta_{22}^{(1)}a_2^{(1)} \newline a_2^{(2)} = g(z_2^{(2)}) \newline \text{add}\; a_0^{(2)} = 1 \newline \end{align*}\]

3.3 Layer 3

\[\begin{align*} z_1^{(3)}=\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} \newline a_1^{(3)} = g(z_1^{(3)}) \newline z_2^{(3)}=\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}a_1^{(2)} + \Theta_{22}^{(2)}a_2^{(2)} \newline a_2^{(3)} = g(z_2^{(3)}) \newline \text{add}\; a_0^{(3)} = 1 \newline \end{align*}\]

3.4 Layer 4

\[\begin{align*} z_1^{(4)}=\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}a_1^{(3)} + \Theta_{12}^{(3)}a_2^{(3)} \newline a_1^{(4)} = g(z_1^{(4)}) \newline h_\Theta(x) = a_1^{(4)} \end{align*}\]

4 Hypothesis expansion

\[\begin{align*} h_\Theta(x) & = a_1^{(4)} \newline &= g(z_1^{(4)}) \newline &= g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}a_1^{(3)} + \Theta_{12}^{(3)}a_2^{(3)}) \newline &= g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(z_1^{(3)}) + \Theta_{12}^{(3)}g(z_2^{(3)})) \newline &=g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} ) \\ & \quad + \Theta_{12}^{(3)}g(\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}a_1^{(2)} + \Theta_{22}^{(2)}a_2^{(2)} )) \newline &=g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}g(z_1^{(2)}) + \Theta_{12}^{(2)}g(z_2^{(2)}) ) \\ & \quad + \Theta_{12}^{(3)}g(\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}g(z_1^{(2)}) + \Theta_{22}^{(2)}g(z_2^{(2)}) )) \newline &=g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}g(\Theta_{10}^{(1)}a_0^{(1)} + \Theta_{11}^{(1)}a_1^{(1)} \\ & \quad + \Theta_{12}^{(1)}a_2^{(1)}) + \Theta_{12}^{(2)}g(\Theta_{20}^{(1)}a_0^{(1)} + \Theta_{21}^{(1)}a_1^{(1)} + \Theta_{22}^{(1)}a_2^{(1)}) ) \\ & \quad +\Theta_{12}^{(3)}g(\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}g(\Theta_{10}^{(1)}a_0^{(1)} + \Theta_{11}^{(1)}a_1^{(1)} + \Theta_{12}^{(1)}a_2^{(1)}) \\ & \quad + \Theta_{22}^{(2)}g(\Theta_{20}^{(1)}a_0^{(1)}+ \Theta_{21}^{(1)}a_1^{(1)} + \Theta_{22}^{(1)}a_2^{(1)}) )) \end{align*}\]

5 对应三个权重矩阵

\[\Theta^{(1)} =2 \times 3= \begin{pmatrix} \Theta_{10}^{(1)}& \Theta_{11}^{(1)}&\Theta_{12}^{(1)}\\ \Theta_{20}^{(1)}& \Theta_{21}^{(1)}&\Theta_{22}^{(1)}\\ \end{pmatrix} \]

\[\Theta^{(2)} =2 \times 3 = \begin{pmatrix} \Theta_{10}^{(2)}& \Theta_{11}^{(2)}&\Theta_{12}^{(2)}\\ \Theta_{20}^{(2)}& \Theta_{21}^{(2)}&\Theta_{22}^{(2)}\\ \end{pmatrix} \]

\[\Theta^{(3)} =1 \times 3= \begin{pmatrix} \Theta_{10}^{(3)}& \Theta_{11}^{(3)}&\Theta_{12}^{(3)}\\ \end{pmatrix} \]

6 导数部分

单分类:

\[J(\Theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))] \]

多分类:

\[\begin{gather*}J(\Theta) = - \frac{1}{m} \sum_{t=1}^m\sum_{k=1}^K \left[ y^{(t)}_k \ \log (h_\Theta (x^{(t)}))_k + (1 - y^{(t)}_k)\ \log (1 - h_\Theta(x^{(t)})_k)\right]\end{gather*} \]

这里我们实现单分类的求导,为了简便我们假设只有一个样本 m = 1, 多样本没有什么不一样的,就是向量化的一个样本的实现。同时没有 regularization 简化推导。

反向传播就是计算每一个\(\Theta\)对应的导数值,而\(\Theta\)又存在于\(J(\Theta)\)之中。所以直接利用链式法则,追根溯源,最终计算到需要计算的\(\Theta\)身上。

1 单分类 \(\Theta^{(3)}\) 1x3

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{10}^{(3)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial \Theta_{10}^{(3)}}\\ &=(a_{1}^{(4)} - y)a_{0}^{(3)} \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{11}^{(3)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial \Theta_{11}^{(3)}}\\ &=(a_{1}^{(4)} - y)a_{1}^{(3)} \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{12}^{(3)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial \Theta_{12}^{(3)}}\\ &=(a_{1}^{(4)} - y)a_{2}^{(3)} \end{align*}\]

2 单分类 \(\Theta^{(2)}\) 2x3

\[\begin{align*} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} &= \Theta_{11}^{(3)} \\ \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} &= a_{1}^{(3)}(1 - a_{1}^{(3)}) \\ \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{10}^{(2)}} &= a_{0}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{10}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{10}^{(2)}}\\ &=(a_{1}^{(4)} - y) \Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)}) a_{0}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{11}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{11}^{(2)}}\\ &=(a_{1}^{(4)} - y) \Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)}) a_{1}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{12}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{12}^{(2)}}\\ &=(a_{1}^{(4)} - y) \Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)}) a_{2}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} &= \Theta_{12}^{(3)} \\ \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} &= a_{2}^{(3)}(1 - a_{2}^{(3)}) \\ \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{20}^{(2)}} &= a_{0}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{20}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{20}^{(2)}}\\ &=(a_{1}^{(4)} - y) \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)}) a_{0}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{21}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{21}^{(2)}}\\ &=(a_{1}^{(4)} - y) \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)}) a_{1}^{(2)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{22}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{22}^{(2)}}\\ &=(a_{1}^{(4)} - y) \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)}) a_{2}^{(2)}\\ \end{align*}\]

3 单分类 \(\Theta^{(1)}\) 2x3

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{10}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} ( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{1}^{(2)}} + \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{1}^{(2)}} ) \dfrac{\partial a_{1}^{(2)}}{\partial z_{1}^{(2)}} \dfrac{\partial z_{1}^{(2)}}{\partial \Theta_{10}^{(1)}}\\ &=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{11}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{21}^{(2)}] a_{1}^{(2)}(1 - a_{1}^{(2)}) a_{0}^{(1)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{11}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} ( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{1}^{(2)}} + \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{1}^{(2)}} ) \dfrac{\partial a_{1}^{(2)}}{\partial z_{1}^{(2)}} \dfrac{\partial z_{1}^{(2)}}{\partial \Theta_{11}^{(1)}}\\ &=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{11}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{21}^{(2)}] a_{1}^{(2)}(1 - a_{1}^{(2)}) a_{1}^{(1)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{12}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} ( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{1}^{(2)}} + \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{1}^{(2)}} ) \dfrac{\partial a_{1}^{(2)}}{\partial z_{1}^{(2)}} \dfrac{\partial z_{1}^{(2)}}{\partial \Theta_{12}^{(1)}}\\ &=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{11}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{21}^{(2)}] a_{1}^{(2)}(1 - a_{1}^{(2)}) a_{2}^{(1)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{20}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} ( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{2}^{(2)}} + \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{2}^{(2)}} ) \dfrac{\partial a_{2}^{(2)}}{\partial z_{2}^{(2)}} \dfrac{\partial z_{2}^{(2)}}{\partial \Theta_{20}^{(1)}}\\ &=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{12}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{22}^{(2)}] a_{2}^{(2)}(1 - a_{2}^{(2)}) a_{0}^{(1)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{21}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} ( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{2}^{(2)}} + \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{2}^{(2)}} ) \dfrac{\partial a_{2}^{(2)}}{\partial z_{2}^{(2)}} \dfrac{\partial z_{2}^{(2)}}{\partial \Theta_{21}^{(1)}}\\ &=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{12}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{22}^{(2)}] a_{2}^{(2)}(1 - a_{2}^{(2)}) a_{1}^{(1)}\\ \end{align*}\]

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta_{22}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} ( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{2}^{(2)}} + \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{2}^{(2)}} ) \dfrac{\partial a_{2}^{(2)}}{\partial z_{2}^{(2)}} \dfrac{\partial z_{2}^{(2)}}{\partial \Theta_{22}^{(1)}}\\ &=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{12}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{22}^{(2)}] a_{2}^{(2)}(1 - a_{2}^{(2)}) a_{2}^{(1)}\\ \end{align*}\]

Tips

反向传播算法的暴力理解

我们已经看到,单分类 \(\Theta^{(1)}\) 2x3 的导数公式计算\(\dfrac{\partial J(\Theta)}{\partial \Theta^{(1)}}\), 因为函数嵌套越发深入,如果从头开始,计算量将会十分复杂

\(\dfrac{\partial J(\Theta)}{\partial z^{(j+1)}} = \delta^{(j+1)}\) 是一个接口,它一方面可以找到 \(\Theta^j\) 也就是我们最终要计算的导数(在此即可终止)。也可以找到\(a^j\)(再次出发), 利用\(a^j\)解开\(z^j\)从而又形成下一层的接口\(\dfrac{\partial J(\Theta)}{\partial z^{(j)}} = \delta^{(j)}\)。找到下一层\(\Theta\)的导数。

因此δ是公式上的存档,我们要做的就是。并且只是。首先开解 \(\dfrac{\partial z^{(j+1)}}{\partial a^{(j)}}\),再开解\(\dfrac{\partial a^{(j)}}{\partial z^{(j)}}\)。分别对应\(\Theta\) 以及 sigmoid的导数。并保存为下一个存档。如此往复

\[\begin{align*} \delta^4 &= a^4 - y \\ \delta^3 &= (\Theta^{(3)})^{T} \delta^4 \;.*\; a^{(3)}(1 - a^{(3)}) \; remove \; \delta_0^3\\ \delta^2 &= (\Theta^{(2)})^{T} \delta^3 \;.*\; a^{(2)}(1 - a^{(2)}) \; remove \; \delta_0^2\\ \end{align*}\]

因为隐藏层的bais unit是一个常数,并不对应下一层的接口。从公式上也可以得出,bais unit的\(\delta\)对应值为零。所以我们删除他们,同时也使得下一层的\(\delta\)计算时符合矩阵运算的维度要求。

\[\begin{align*} \dfrac{\partial J(\Theta)}{\partial \Theta^{(3)}} &= \delta^4 * a^3 = (a^4 - y) * a^3 \\ \dfrac{\partial J(\Theta)}{\partial \Theta^{(2)}} &= \delta^3 * a^2 \\ \dfrac{\partial J(\Theta)}{\partial \Theta^{(1)}} &= \delta^2 * a^1 \\ \end{align*}\]

\(\delta^{(l)}\) is vector, \(a^{(l-1)}\) is matrix.

多分类怎么计算呢?自己想一想吧。

代码示例

反向传播算法的暴力理解

这是手写数字数据集的反向传播实践。X 5000张手写数字的400像素灰度图片。两个权重 Theta1 Theta2.

通过前向传播计算\(cost \; J\),通过反向传播计算 \(\dfrac{\partial J(\Theta)}{\partial \Theta^{(j)}}\)。

10分类,3层神经网络。

反向传播算法的暴力理解

% ----X = 5000x400; y = 5000x1; Theta1 = 25x401; Theta2 = 10x26----
% ----Feedforward----
a1 = X;
a1 = [ones(m,1) a1];	% 5000x401

z2 = a1 * Theta1';		% 5000x401 401x25
a2 = sigmoid(z2);
a2 = [ones(size(a2,1),1) a2];	%5000x26

z3 = a2 * Theta2';		% 5000x26 26x10
a3 = sigmoid(z3);

% ----Cost By For Loop and Not Regularized----
%for k=1:num_labels,
%	y_k = (y==k);
%	J = J -(1/m) * (y_k' * log(a3(:,k)) + (1 - y_k') * log(1 - a3(:,k)));
%end

% ----Cost By Matrix and Not Regularized----
% y_K = 5000x10. Notice that " .* ";
y_K = zeros(m, num_labels);
for k=1:num_labels,
	y_K(:,k) = (y==k);
end

% First Part Not Regularied
%J = -(1/m) * sum(sum((y_K .* log(a3) + (1 - y_K) .* log(1 - a3))));

Theta1_fix = [zeros(size(Theta1,1),1) Theta1(:,2:end)];
Theta2_fix = [zeros(size(Theta2,1),1) Theta2(:,2:end)];
Theta_fix = [Theta1_fix(:);Theta2_fix(:)];

% Regularied
J = -(1/m) * sum(sum((y_K .* log(a3) + (1 - y_K) .* log(1 - a3)))) + (lambda/(2*m)) * sum(Theta_fix .^2);

gDz2 = sigmoidGradient(z2);		% 5000x25
deltaL3 = (a3 .- y_K)';			% 5000*10' = 10*5000
deltaL2 = Theta2' * deltaL3 .*  [zeros(size(gDz2,1),1) gDz2]';	% 26*5000 .* 26*5000

% First Part Not Regularied
%Theta2_grad = (1/m) * deltaL3 * a2 ;
%Theta1_grad = (1/m) * deltaL2(2:end,:) * a1 ;

Theta2_grad = (1/m) * deltaL3 * a2 + (lambda/m) * Theta2_fix;
Theta1_grad = (1/m) * deltaL2(2:end,:) * a1 + (lambda/m) * Theta1_fix;

Reference

[1] Andrew NG. Coursera Machine Learning Deep Learning. Cost Function and BackPropagation.

文章会随时改动,要到博客园里看偶。一些网站会爬取本文章,但是可能会有出入。
转载请注明出处哦( ̄︶ ̄)↗ 
https://www.cnblogs.com/asmurmur/

上一篇:积分变量的求解


下一篇:《数字信号处理》系统函数的频率响应、零极点和稳定性的实现