机器学习中的数学基础

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 数学基础\n",
"\n",
"本节总结了本书中涉及的有关线性代数、微分和概率的基础知识。为避免赘述本书未涉及的数学背景知识,本节中的少数定义稍有简化。\n",
"\n",
"\n",
"## 线性代数\n",
"\n",
"下面分别概括了向量、矩阵、运算、范数、特征向量和特征值的概念。\n",
"\n",
"### 向量\n",
"\n",
"本书中的向量指的是列向量。一个\(n\)维向量\(\\boldsymbol{x}\)的表达式可写成\n",
"\n",
"\[\n", "\\boldsymbol{x} = \n", "\\begin{bmatrix}\n", " x_{1} \\\\\n", " x_{2} \\\\\n", " \\vdots \\\\\n", " x_{n} \n", "\\end{bmatrix},\n", "\]\n",
"\n",
"其中\(x_1, \\ldots, x_n\)是向量的元素。我们将各元素均为实数的\(n\)维向量\(\\boldsymbol{x}\)记作\(\\boldsymbol{x} \\in \\mathbb{R}^{n}\)或\(\\boldsymbol{x} \\in \\mathbb{R}^{n \\times 1}\)。\n",
"\n",
"\n",
"### 矩阵\n",
"\n",
"一个\(m\)行\(n\)列矩阵的表达式可写成\n",
"\n",
"\[\n", "\\boldsymbol{X} = \n", "\\begin{bmatrix}\n", " x_{11} & x_{12} & \\dots & x_{1n} \\\\\n", " x_{21} & x_{22} & \\dots & x_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " x_{m1} & x_{m2} & \\dots & x_{mn}\n", "\\end{bmatrix},\n", "\]\n",
"\n",
"其中\(x_{ij}\)是矩阵\(\\boldsymbol{X}\)中第\(i\)行第\(j\)列的元素(\(1 \\leq i \\leq m, 1 \\leq j \\leq n\))。我们将各元素均为实数的\(m\)行\(n\)列矩阵\(\\boldsymbol{X}\)记作\(\\boldsymbol{X} \\in \\mathbb{R}^{m \\times n}\)。不难发现,向量是特殊的矩阵。\n",
"\n",
"\n",
"### 运算\n",
"\n",
"设\(n\)维向量\(\\boldsymbol{a}\)中的元素为\(a_1, \\ldots, a_n\),\(n\)维向量\(\\boldsymbol{b}\)中的元素为\(b_1, \\ldots, b_n\)。向量\(\\boldsymbol{a}\)与\(\\boldsymbol{b}\)的点乘(内积)是一个标量:\n",
"\n",
"\[\\boldsymbol{a} \\cdot \\boldsymbol{b} = a_1 b_1 + \\ldots + a_n b_n.\]\n",
"\n",
"\n",
"设两个\(m\)行\(n\)列矩阵\n",
"\n",
"\[\n", "\\boldsymbol{A} = \n", "\\begin{bmatrix}\n", " a_{11} & a_{12} & \\dots & a_{1n} \\\\\n", " a_{21} & a_{22} & \\dots & a_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " a_{m1} & a_{m2} & \\dots & a_{mn}\n", "\\end{bmatrix},\\quad\n", "\\boldsymbol{B} = \n", "\\begin{bmatrix}\n", " b_{11} & b_{12} & \\dots & b_{1n} \\\\\n", " b_{21} & b_{22} & \\dots & b_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " b_{m1} & b_{m2} & \\dots & b_{mn}\n", "\\end{bmatrix}.\n", "\]\n",
"\n",
"矩阵\(\\boldsymbol{A}\)的转置是一个\(n\)行\(m\)列矩阵,它的每一行其实是原矩阵的每一列:\n",
"\[\n", "\\boldsymbol{A}^\\top = \n", "\\begin{bmatrix}\n", " a_{11} & a_{21} & \\dots & a_{m1} \\\\\n", " a_{12} & a_{22} & \\dots & a_{m2} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " a_{1n} & a_{2n} & \\dots & a_{mn}\n", "\\end{bmatrix}.\n", "\]\n",
"\n",
"\n",
"两个相同形状的矩阵的加法是将两个矩阵按元素做加法:\n",
"\n",
"\[\n", "\\boldsymbol{A} + \\boldsymbol{B} = \n", "\\begin{bmatrix}\n", " a_{11} + b_{11} & a_{12} + b_{12} & \\dots & a_{1n} + b_{1n} \\\\\n", " a_{21} + b_{21} & a_{22} + b_{22} & \\dots & a_{2n} + b_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " a_{m1} + b_{m1} & a_{m2} + b_{m2} & \\dots & a_{mn} + b_{mn}\n", "\\end{bmatrix}.\n", "\]\n",
"\n",
"我们使用符号\(\\odot\)表示两个矩阵按元素做乘法的运算:\n",
"\n",
"\[\n", "\\boldsymbol{A} \\odot \\boldsymbol{B} = \n", "\\begin{bmatrix}\n", " a_{11} b_{11} & a_{12} b_{12} & \\dots & a_{1n} b_{1n} \\\\\n", " a_{21} b_{21} & a_{22} b_{22} & \\dots & a_{2n} b_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " a_{m1} b_{m1} & a_{m2} b_{m2} & \\dots & a_{mn} b_{mn}\n", "\\end{bmatrix}.\n", "\]\n",
"\n",
"定义一个标量\(k\)。标量与矩阵的乘法也是按元素做乘法的运算:\n",
"\n",
"\n",
"\[\n", "k\\boldsymbol{A} = \n", "\\begin{bmatrix}\n", " ka_{11} & ka_{12} & \\dots & ka_{1n} \\\\\n", " ka_{21} & ka_{22} & \\dots & ka_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " ka_{m1} & ka_{m2} & \\dots & ka_{mn}\n", "\\end{bmatrix}.\n", "\]\n",
"\n",
"其他诸如标量与矩阵按元素相加、相除等运算与上式中的相乘运算类似。矩阵按元素开根号、取对数等运算也就是对矩阵每个元素开根号、取对数等,并得到和原矩阵形状相同的矩阵。\n",
"\n",
"矩阵乘法和按元素的乘法不同。设\(\\boldsymbol{A}\)为\(m\)行\(p\)列的矩阵,\(\\boldsymbol{B}\)为\(p\)行\(n\)列的矩阵。两个矩阵相乘的结果\n",
"\n",
"\[\n", "\\boldsymbol{A} \\boldsymbol{B} = \n", "\\begin{bmatrix}\n", " a_{11} & a_{12} & \\dots & a_{1p} \\\\\n", " a_{21} & a_{22} & \\dots & a_{2p} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " a_{i1} & a_{i2} & \\dots & a_{ip} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " a_{m1} & a_{m2} & \\dots & a_{mp}\n", "\\end{bmatrix}\n", "\\begin{bmatrix}\n", " b_{11} & b_{12} & \\dots & b_{1j} & \\dots & b_{1n} \\\\\n", " b_{21} & b_{22} & \\dots & b_{2j} & \\dots & b_{2n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots & \\ddots & \\vdots \\\\\n", " b_{p1} & b_{p2} & \\dots & b_{pj} & \\dots & b_{pn}\n", "\\end{bmatrix}\n", "\]\n",
"\n",
"是一个\(m\)行\(n\)列的矩阵,其中第\(i\)行第\(j\)列(\(1 \\leq i \\leq m, 1 \\leq j \\leq n\))的元素为\n",
"\n",
"\[a_{i1}b_{1j} + a_{i2}b_{2j} + \\ldots + a_{ip}b_{pj} = \\sum_{k=1}^p a_{ik}b_{kj}. \]\n",
"\n",
"\n",
"### 范数\n",
"\n",
"设\(n\)维向量\(\\boldsymbol{x}\)中的元素为\(x_1, \\ldots, x_n\)。向量\(\\boldsymbol{x}\)的\(L_p\)范数为\n",
"\n",
"\[\\|\\boldsymbol{x}\\|_p = \\left(\\sum_{i=1}^n \\left|x_i \\right|^p \\right)^{1/p}.\]\n",
"\n",
"例如,\(\\boldsymbol{x}\)的\(L_1\)范数是该向量元素绝对值之和:\n",
"\n",
"\[\\|\\boldsymbol{x}\\|_1 = \\sum_{i=1}^n \\left|x_i \\right|.\]\n",
"\n",
"而\(\\boldsymbol{x}\)的\(L_2\)范数是该向量元素平方和的平方根:\n",
"\n",
"\[\\|\\boldsymbol{x}\\|_2 = \\sqrt{\\sum_{i=1}^n x_i^2}.\]\n",
"\n",
"我们通常用\(\\|\\boldsymbol{x}\\|\)指代\(\\|\\boldsymbol{x}\\|_2\)。\n",
"\n",
"设\(\\boldsymbol{X}\)是一个\(m\)行\(n\)列矩阵。矩阵\(\\boldsymbol{X}\)的Frobenius范数为该矩阵元素平方和的平方根:\n",
"\n",
"\[\\|\\boldsymbol{X}\\|_F = \\sqrt{\\sum_{i=1}^m \\sum_{j=1}^n x_{ij}^2},\]\n",
"\n",
"其中\(x_{ij}\)为矩阵\(\\boldsymbol{X}\)在第\(i\)行第\(j\)列的元素。\n",
"\n",
"\n",
"### 特征向量和特征值\n",
"\n",
"\n",
"对于一个\(n\)行\(n\)列的矩阵\(\\boldsymbol{A}\),假设有标量\(\\lambda\)和非零的\(n\)维向量\(\\boldsymbol{v}\)使\n",
"\n",
"\[\\boldsymbol{A} \\boldsymbol{v} = \\lambda \\boldsymbol{v},\]\n",
"\n",
"那么\(\\boldsymbol{v}\)是矩阵\(\\boldsymbol{A}\)的一个特征向量,标量\(\\lambda\)是\(\\boldsymbol{v}\)对应的特征值。\n",
"\n",
"\n",
"\n",
"## 微分\n",
"\n",
"我们在这里简要介绍微分的一些基本概念和演算。\n",
"\n",
"\n",
"### 导数和微分\n",
"\n",
"假设函数\(f: \\mathbb{R} \\rightarrow \\mathbb{R}\)的输入和输出都是标量。函数\(f\)的导数\n",
"\n",
"\[f'(x) = \\lim_{h \\rightarrow 0} \\frac{f(x+h) - f(x)}{h},\]\n",
"\n",
"且假定该极限存在。给定\(y = f(x)\),其中\(x\)和\(y\)分别是函数\(f\)的自变量和因变量。以下有关导数和微分的表达式等价:\n",
"\n",
"\[f'(x) = y' = \\frac{\\text{d}y}{\\text{d}x} = \\frac{\\text{d}f}{\\text{d}x} = \\frac{\\text{d}}{\\text{d}x} f(x) = \\text{D}f(x) = \\text{D}_x f(x),\]\n",
"\n",
"其中符号\(\\text{D}\)和\(\\text{d}/\\text{d}x\)也叫微分运算符。常见的微分演算有\(\\text{D}C = 0\)(\(C\)为常数)、\(\\text{D}x^n = nx^{n-1}\)(\(n\)为常数)、\(\\text{D}e^x = e^x\)、\(\\text{D}\\ln(x) = 1/x\)等。\n",
"\n",
"如果函数\(f\)和\(g\)都可导,设\(C\)为常数,那么\n",
"\n",
"\[\n", "\\begin{aligned}\n", "\\frac{\\text{d}}{\\text{d}x} [Cf(x)] &= C \\frac{\\text{d}}{\\text{d}x} f(x),\\\\\n", "\\frac{\\text{d}}{\\text{d}x} [f(x) + g(x)] &= \\frac{\\text{d}}{\\text{d}x} f(x) + \\frac{\\text{d}}{\\text{d}x} g(x),\\\\ \n", "\\frac{\\text{d}}{\\text{d}x} [f(x)g(x)] &= f(x) \\frac{\\text{d}}{\\text{d}x} [g(x)] + g(x) \\frac{\\text{d}}{\\text{d}x} [f(x)],\\\\\n", "\\frac{\\text{d}}{\\text{d}x} \\left[\\frac{f(x)}{g(x)}\\right] &= \\frac{g(x) \\frac{\\text{d}}{\\text{d}x} [f(x)] - f(x) \\frac{\\text{d}}{\\text{d}x} [g(x)]}{[g(x)]^2}.\n", "\\end{aligned}\n", "\]\n",
"\n",
"\n",
"如果\(y=f(u)\)和\(u=g(x)\)都是可导函数,依据链式法则,\n",
"\n",
"\[\\frac{\\text{d}y}{\\text{d}x} = \\frac{\\text{d}y}{\\text{d}u} \\frac{\\text{d}u}{\\text{d}x}.\]\n",
"\n",
"\n",
"### 泰勒展开\n",
"\n",
"函数\(f\)的泰勒展开式是\n",
"\n",
"\[f(x) = \\sum_{n=0}^\\infty \\frac{f^{(n)}(a)}{n!} (x-a)^n,\]\n",
"\n",
"其中\(f^{(n)}\)为函数\(f\)的\(n\)阶导数(求\(n\)次导数),\(n!\)为\(n\)的阶乘。假设\(\\epsilon\)是一个足够小的数,如果将上式中\(x\)和\(a\)分别替换成\(x+\\epsilon\)和\(x\),可以得到\n",
"\n",
"\[f(x + \\epsilon) \\approx f(x) + f'(x) \\epsilon + \\mathcal{O}(\\epsilon^2).\]\n",
"\n",
"由于\(\\epsilon\)足够小,上式也可以简化成\n",
"\n",
"\[f(x + \\epsilon) \\approx f(x) + f'(x) \\epsilon.\]\n",
"\n",
"\n",
"\n",
"### 偏导数\n",
"\n",
"设\(u\)为一个有\(n\)个自变量的函数,\(u = f(x_1, x_2, \\ldots, x_n)\),它有关第\(i\)个变量\(x_i\)的偏导数为\n",
"\n",
"\[ \\frac{\\partial u}{\\partial x_i} = \\lim_{h \\rightarrow 0} \\frac{f(x_1, \\ldots, x_{i-1}, x_i+h, x_{i+1}, \\ldots, x_n) - f(x_1, \\ldots, x_i, \\ldots, x_n)}{h}.\]\n",
"\n",
"\n",
"以下有关偏导数的表达式等价:\n",
"\n",
"\[\\frac{\\partial u}{\\partial x_i} = \\frac{\\partial f}{\\partial x_i} = f_{x_i} = f_i = \\text{D}_i f = \\text{D}_{x_i} f.\]\n",
"\n",
"为了计算\(\\partial u/\\partial x_i\),只需将\(x_1, \\ldots, x_{i-1}, x_{i+1}, \\ldots, x_n\)视为常数并求\(u\)有关\(x_i\)的导数。\n",
"\n",
"\n",
"\n",
"### 梯度\n",
"\n",
"\n",
"假设函数\(f: \\mathbb{R}^n \\rightarrow \\mathbb{R}\)的输入是一个\(n\)维向量\(\\boldsymbol{x} = [x_1, x_2, \\ldots, x_n]^\\top\),输出是标量。函数\(f(\\boldsymbol{x})\)有关\(\\boldsymbol{x}\)的梯度是一个由\(n\)个偏导数组成的向量:\n",
"\n",
"\[\\nabla_{\\boldsymbol{x}} f(\\boldsymbol{x}) = \\bigg[\\frac{\\partial f(\\boldsymbol{x})}{\\partial x_1}, \\frac{\\partial f(\\boldsymbol{x})}{\\partial x_2}, \\ldots, \\frac{\\partial f(\\boldsymbol{x})}{\\partial x_n}\\bigg]^\\top.\]\n",
"\n",
"\n",
"为表示简洁,我们有时用\(\\nabla f(\\boldsymbol{x})\)代替\(\\nabla_{\\boldsymbol{x}} f(\\boldsymbol{x})\)。\n",
"\n",
"假设\(\\boldsymbol{x}\)是一个向量,常见的梯度演算包括\n",
"\n",
"\[\n", "\\begin{aligned}\n", "\\nabla_{\\boldsymbol{x}} \\boldsymbol{A}^\\top \\boldsymbol{x} &= \\boldsymbol{A}, \\\\\n", "\\nabla_{\\boldsymbol{x}} \\boldsymbol{x}^\\top \\boldsymbol{A} &= \\boldsymbol{A}, \\\\\n", "\\nabla_{\\boldsymbol{x}} \\boldsymbol{x}^\\top \\boldsymbol{A} \\boldsymbol{x} &= (\\boldsymbol{A} + \\boldsymbol{A}^\\top)\\boldsymbol{x},\\\\\n", "\\nabla_{\\boldsymbol{x}} \\|\\boldsymbol{x} \\|^2 &= \\nabla_{\\boldsymbol{x}} \\boldsymbol{x}^\\top \\boldsymbol{x} = 2\\boldsymbol{x}.\n", "\\end{aligned}\n", "\]\n",
"\n",
"类似地,假设\(\\boldsymbol{X}\)是一个矩阵,那么\n",
"\[\\nabla_{\\boldsymbol{X}} \\|\\boldsymbol{X} \\|_F^2 = 2\\boldsymbol{X}.\]\n",
"\n",
"\n",
"\n",
"\n",
"### 海森矩阵\n",
"\n",
"假设函数\(f: \\mathbb{R}^n \\rightarrow \\mathbb{R}\)的输入是一个\(n\)维向量\(\\boldsymbol{x} = [x_1, x_2, \\ldots, x_n]^\\top\),输出是标量。假定函数\(f\)所有的二阶偏导数都存在,\(f\)的海森矩阵\(\\boldsymbol{H}\)是一个\(n\)行\(n\)列的矩阵:\n",
"\n",
"\[\n", "\\boldsymbol{H} = \n", "\\begin{bmatrix}\n", " \\frac{\\partial^2 f}{\\partial x_1^2} & \\frac{\\partial^2 f}{\\partial x_1 \\partial x_2} & \\dots & \\frac{\\partial^2 f}{\\partial x_1 \\partial x_n} \\\\\n", " \\frac{\\partial^2 f}{\\partial x_2 \\partial x_1} & \\frac{\\partial^2 f}{\\partial x_2^2} & \\dots & \\frac{\\partial^2 f}{\\partial x_2 \\partial x_n} \\\\\n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " \\frac{\\partial^2 f}{\\partial x_n \\partial x_1} & \\frac{\\partial^2 f}{\\partial x_n \\partial x_2} & \\dots & \\frac{\\partial^2 f}{\\partial x_n^2}\n", "\\end{bmatrix},\n", "\]\n",
"\n",
"其中二阶偏导数\n",
"\n",
"\[\\frac{\\partial^2 f}{\\partial x_i \\partial x_j} = \\frac{\\partial }{\\partial x_j} \\left(\\frac{\\partial f}{ \\partial x_i}\\right).\]\n",
"\n",
"\n",
"\n",
"## 概率\n",
"\n",
"最后,我们简要介绍条件概率、期望和均匀分布。\n",
"\n",
"### 条件概率\n",
"\n",
"假设事件\(A\)和事件\(B\)的概率分别为\(P(A)\)和\(P(B)\),两个事件同时发生的概率记作\(P(A \\cap B)\)或\(P(A, B)\)。给定事件\(B\),事件\(A\)的条件概率\n",
"\n",
"\[P(A \\mid B) = \\frac{P(A \\cap B)}{P(B)}.\]\n",
"\n",
"也就是说,\n",
"\n",
"\[P(A \\cap B) = P(B) P(A \\mid B) = P(A) P(B \\mid A).\]\n",
"\n",
"当满足\n",
"\n",
"\[P(A \\cap B) = P(A) P(B)\]\n",
"\n",
"时,事件\(A\)和事件\(B\)相互独立。\n",
"\n",
"\n",
"### 期望\n",
"\n",
"离散的随机变量\(X\)的期望(或平均值)为\n",
"\n",
"\[E(X) = \\sum_{x} x P(X = x).\]\n",
"\n",
"\n",
"\n",
"### 均匀分布\n",
"\n",
"假设随机变量\(X\)服从\([a, b]\)上的均匀分布,即\(X \\sim U(a, b)\)。随机变量\(X\)取\(a\)和\(b\)之间任意一个数的概率相等。\n",
"\n",
"\n",
"\n",
"\n",
"## 小结\n",
"\n",
"* 本节总结了本书中涉及的有关线性代数、微分和概率的基础知识。\n",
"\n",
"\n",
"## 练习\n",
"\n",
"* 求函数\(f(\\boldsymbol{x}) = 3x_1^2 + 5e^{x_2}\)的梯度。\n",
"\n",
"\n",
"\n",
"\n",
"## 扫码直达讨论区\n",
"\n",
"机器学习中的数学基础"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

上一篇:最全最详细-线性规划(最小二乘、正交回归、梯度下降、仿真)


下一篇:cs224n - Vanishing Gradients and Fancy RNNs