已知:\(a^l = \sigma(z^l) = \sigma(W^la^{l-1} + b^l)\)
定义二次损失函数(当然也可以是其他损失函数):
\(J(W,b) = \frac{1}{2}||a^L-y||^2\)
目标:求解每一层的W,b。
首先,输出层第L层有:
\[a^L = \sigma(z^L) = \sigma(W^La^{L-1} + b^L)\\ J(W,b) = \frac{1}{2}||a^L-y||^2 = \frac{1}{2}|| \sigma(W^La^{L-1} + b^L)-y||^2 \]分别对W,b求梯度:
\[\frac{\partial J(W,b)}{\partial W^L} = \frac{\partial J(W,b)}{\partial z^L}\frac{\partial z^L}{\partial W^L} , \frac{\partial J(W,b)}{\partial b^L} = \frac{\partial J(W,b)}{\partial z^L}\frac{\partial z^L}{\partial b^L} \]有公共部分\(\frac{\partial J(W,b)}{\partial z^L}\),令\(\delta^L = \frac{\partial J(W,b)}{\partial z^L}\)。
为了方便理解,先令\(\delta^L_j = \frac{\partial J(W,b)}{\partial z^L_j}\)(表示第L 层的第 j 个神经元上的误差),则:
\[ \delta^L_j = \frac{\partial J(W,b)}{\partial z^L_j} \\ = \sum_k \frac{\partial J(W,b)}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j} \\ = \frac{\partial J(W,b)}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j} \]得
\[ \delta^L_j = \frac{\partial J(W,b)}{\partial a^L_j} \sigma'(z^L_j)= (a^L_j-y_j) \sigma^{'}(z^L_j)\\ \delta^L = \frac{\partial J(W,b)}{\partial z^L} = (a^L-y)\odot \sigma^{'}(z^L) \]所以,第L层W,b的梯度为:
\[\frac{\partial J(W,b)}{\partial W^L} = \frac{\partial J(W,b)}{\partial z^L}\frac{\partial z^L}{\partial W^L} =(a^L-y) \odot \sigma^{'}(z^L)(a^{L-1})^T\\ \frac{\partial J(W,b)}{\partial b^L} = \frac{\partial J(W,b)}{\partial z^L}\frac{\partial z^L}{\partial b^L} =(a^L-y)\odot \sigma^{'}(z^L) \]接下来,我们需要往前递推求出L-1,L-2 ...层的梯度。由神经网络前向传播特性:
\[z^{l+1}= W^{l+1}a^{l} + b^{l+1} = W^{l+1}\sigma(z^l) + b^{l+1} \]可得:
\[\delta^{l} = \frac{\partial J(W,b)}{\partial z^l} = \frac{\partial J(W,b)}{\partial z^{l+1}}\frac{\partial z^{l+1}}{\partial z^{l}} = \delta^{l+1}\frac{\partial z^{l+1}}{\partial z^{l}} = (W^{l+1})^T\delta^{l+1}\odot \sigma^{'}(z^l) \]【
\[ \delta^l_j = \frac{\partial J(W,b)}{\partial z^l_j} = \sum_k \frac{\partial J(W,b)}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} = \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \] \[ z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k \] \[ \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j) \] \[ \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j) \] \[ \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \]】