核方法:PCA,LDA,GDA,SVM,LR

文章目录

1. Kernel Method

特征映射 ϕ : R d → R n \phi:\mathbb{R}^d\to\mathbb{R}^n ϕ:Rd→Rn将原空间下的变量映射到高维特征空间,例如 ( x 1 , x 2 ) ↦ ( z 1 , z 2 , z 3 ) = ( x 1 2 , 2 x 1 x 2 , x 2 2 ) (x_1,x_2)\mapsto(z_1,z_2,z_3)=(x_1^2,\sqrt{2}x_1 x_2,x_2^2) (x1​,x2​)↦(z1​,z2​,z3​)=(x12​,2 ​x1​x2​,x22​),其目的是将原空间下的复杂关系在高维空间下简化。

核函数是计算原空间内的变量在特征空间下的内积: κ ( x , x ′ ) = < ϕ ( x ) , ϕ ( x ′ ) > \kappa(\boldsymbol x,\boldsymbol x')=<\phi(\boldsymbol x),\phi(\boldsymbol x')> κ(x,x′)=<ϕ(x),ϕ(x′)>。内积可以用来计算在该空间下的距离与角度,衡量特征空间下的性质,

距离:
∣ ∣ ϕ ( x ) − ϕ ( x ′ ) ∣ ∣ 2 = ( ϕ ( x ) − ϕ ( x ′ ) ) T ( ϕ ( x ) − ϕ ( x ′ ) ) = ϕ ( x ) T ϕ ( x ) − 2 ϕ ( x ) T ϕ ( x ′ ) + ϕ ( x ′ ) T ϕ ( x ′ ) = < ϕ ( x ) , ϕ ( x ) > − 2 < ϕ ( x ) , ϕ ( x ′ ) > + < ϕ ( x ′ ) , ϕ ( x ′ ) > = κ ( x , x ) − 2 κ ( x , x ′ ) + κ ( x ′ , x ′ ) \begin{aligned} ||\phi(\boldsymbol x)-\phi(\boldsymbol x')||^2 &= (\phi(\boldsymbol x)-\phi(\boldsymbol x'))^T(\phi(\boldsymbol x)-\phi(\boldsymbol x')) \\ &= \phi(\boldsymbol x)^T\phi(\boldsymbol x)-2\phi(\boldsymbol x)^T\phi(\boldsymbol x')+\phi(\boldsymbol x')^T\phi(\boldsymbol x') \\ &= <\phi(\boldsymbol x), \phi(\boldsymbol x)>-2<\phi(\boldsymbol x), \phi(\boldsymbol x')>+<\phi(\boldsymbol x'),\phi(\boldsymbol x')> \\ &= \kappa(\boldsymbol x,\boldsymbol x)-2\kappa(\boldsymbol x,\boldsymbol x')+\kappa(\boldsymbol x',\boldsymbol x') \end{aligned} ∣∣ϕ(x)−ϕ(x′)∣∣2​=(ϕ(x)−ϕ(x′))T(ϕ(x)−ϕ(x′))=ϕ(x)Tϕ(x)−2ϕ(x)Tϕ(x′)+ϕ(x′)Tϕ(x′)=<ϕ(x),ϕ(x)>−2<ϕ(x),ϕ(x′)>+<ϕ(x′),ϕ(x′)>=κ(x,x)−2κ(x,x′)+κ(x′,x′)​
角度:
< ϕ ( x ) , ϕ ( x ′ ) > = ∣ ∣ ϕ ( x ) ∣ ∣ ⋅ ∣ ∣ ϕ ( x ′ ) ∣ ∣ cos ⁡ θ ⇒ cos ⁡ θ = < ϕ ( x ) , ϕ ( x ′ ) > ∣ ∣ ϕ ( x ) ∣ ∣ ⋅ ∣ ∣ ϕ ( x ′ ) ∣ ∣ = < ϕ ( x ) , ϕ ( x ′ ) > < ϕ ( x ) , ϕ ( x ) > < ϕ ( x ′ ) , ϕ ( x ′ ) > = κ ( x , x ′ ) κ ( x , x ) κ ( x ′ , x ′ ) <\phi(\boldsymbol x),\phi(\boldsymbol x')>=||\phi(\boldsymbol x)||\cdot||\phi(\boldsymbol x')||\cos\theta \\ \begin{aligned} \Rightarrow \cos\theta &= \frac{<\phi(\boldsymbol x),\phi(\boldsymbol x')>}{||\phi(\boldsymbol x)||\cdot||\phi(\boldsymbol x')||} \\ &= \frac{<\phi(\boldsymbol x),\phi(\boldsymbol x')>}{\sqrt{<\phi(\boldsymbol x),\phi(\boldsymbol x)>}\sqrt{<\phi(\boldsymbol x'),\phi(\boldsymbol x')>}} \\ &= \frac{\kappa(\boldsymbol x,\boldsymbol x')}{\sqrt{\kappa(\boldsymbol x,\boldsymbol x)}\sqrt{\kappa(\boldsymbol x',\boldsymbol x')}} \end{aligned} <ϕ(x),ϕ(x′)>=∣∣ϕ(x)∣∣⋅∣∣ϕ(x′)∣∣cosθ⇒cosθ​=∣∣ϕ(x)∣∣⋅∣∣ϕ(x′)∣∣<ϕ(x),ϕ(x′)>​=<ϕ(x),ϕ(x)> ​<ϕ(x′),ϕ(x′)> ​<ϕ(x),ϕ(x′)>​=κ(x,x) ​κ(x′,x′) ​κ(x,x′)​​
内积矩阵,Gram矩阵,Kernel矩阵:
K = [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] ⋅ [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] = [ < ϕ ( x 1 ) , ϕ ( x 1 ) > ⋯ < ϕ ( x 1 ) , ϕ ( x N ) > ⋮ ⋱ ⋮ < ϕ ( x N ) , ϕ ( x 1 ) > ⋯ < ϕ ( x N ) , ϕ ( x N ) > ] = [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x N ) ⋮ ⋱ ⋮ κ ( x N , x 1 ) ⋯ κ ( x N , x N ) ] \begin{aligned} K &= \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix} \cdot \begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix} \\ &= \begin{bmatrix} <\phi(\boldsymbol x_1),\phi(\boldsymbol x_1)> & \cdots & <\phi(\boldsymbol x_1),\phi(\boldsymbol x_N)> \\ \vdots & \ddots & \vdots \\ <\phi(\boldsymbol x_N),\phi(\boldsymbol x_1)> & \cdots & <\phi(\boldsymbol x_N),\phi(\boldsymbol x_N)> \end{bmatrix} \\ &= \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_N) \\ \vdots & \ddots & \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_N,\boldsymbol x_N) \end{bmatrix} \end{aligned} K​=⎣⎢⎡​ϕ(x1​)T⋮ϕ(xN​)T​⎦⎥⎤​⋅[ϕ(x1​)​⋯​ϕ(xN​)​]=⎣⎢⎡​<ϕ(x1​),ϕ(x1​)>⋮<ϕ(xN​),ϕ(x1​)>​⋯⋱⋯​<ϕ(x1​),ϕ(xN​)>⋮<ϕ(xN​),ϕ(xN​)>​⎦⎥⎤​=⎣⎢⎡​κ(x1​,x1​)⋮κ(xN​,x1​)​⋯⋱⋯​κ(x1​,xN​)⋮κ(xN​,xN​)​⎦⎥⎤​​
例1: ( x 1 , y 1 ) , ⋯   , ( x n , y n ) ⊂ R d × 1 , − 1 → ϕ ( ϕ ( x 1 ) , y 1 ) , ⋯   , ( ϕ ( x n ) , y n ) ⊂ H × 1 , − 1 {(\boldsymbol x_1,y_1),\cdots,(\boldsymbol x_n,y_n)}\subset\mathbb{R}^d\times{1,-1}\to^{\phi}{(\phi(\boldsymbol x_1),y_1),\cdots,(\phi(\boldsymbol x_n),y_n)}\subset H\times{1,-1} (x1​,y1​),⋯,(xn​,yn​)⊂Rd×1,−1→ϕ(ϕ(x1​),y1​),⋯,(ϕ(xn​),yn​)⊂H×1,−1

核方法:PCA,LDA,GDA,SVM,LR

利用垂直于两个质心中点的面作为分界面。

核方法:PCA,LDA,GDA,SVM,LR

对于新的样本 ( x , y ) → ( ϕ ( x ) , y ) → y = ? (\boldsymbol x,y)\to(\phi(\boldsymbol x),y)\to y=? (x,y)→(ϕ(x),y)→y=?

当 y = 1 y=1 y=1时,样本点与中点连线与质心连线之间的夹角小于90度,即,
0 ≤ θ < π / 2 ⇔ 0 ≤ θ < p i / 2 ⇔ 0 ≤ < ϕ ( x ) − c , w > ϕ ( x ) − c w < 1 ⇔ < ϕ ( x ) − c , w ≥ 0 0\le\theta<\pi/2 \Leftrightarrow 0\le\theta<pi/2 \Leftrightarrow 0\le \frac{<\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w>}{\sqrt{\phi(\boldsymbol x)-\boldsymbol c}\sqrt{\boldsymbol w}} < 1 \Leftrightarrow <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w \ge 0 0≤θ<π/2⇔0≤θ<pi/2⇔0≤ϕ(x)−c ​w ​<ϕ(x)−c,w>​<1⇔<ϕ(x)−c,w≥0

同理,
y = − 1 ⇔ < ϕ ( x ) − c , w > < 0 y=-1 \Leftrightarrow <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w> < 0 y=−1⇔<ϕ(x)−c,w><0
合并以上两项为:
y = s g n ( < ϕ ( x ) − c , w > ) y=sgn(<\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w>) y=sgn(<ϕ(x)−c,w>)
计算 < ϕ ( x ) − c , w > <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w> <ϕ(x)−c,w>
< ϕ ( x ) − c , w > = w T ( ϕ ( x ) − c ) = ( c + − c − ) T ϕ ( x ) − 1 2 ( c + − c − ) T ( c + + c − ) = ( 1 m + ∑ y i = 1 ϕ ( x i ) − 1 m − ∑ y i = − 1 ϕ ( x i ) ) T ϕ ( x ) − 1 2 ( c + − c − ) T ( c + + c − ) = 1 m + ∑ y i = 1 ϕ ( x i ) T ϕ ( x ) − 1 m − ∑ y i = − 1 ϕ ( x i ) T ϕ ( x ) − b = 1 m + ∑ y i = 1 κ ( x i , x ) − 1 m − ∑ y i = − 1 κ ( x i , x ) − b \begin{aligned} <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w> &= \boldsymbol w^T(\phi(\boldsymbol x)-\boldsymbol c) \\ &= (\boldsymbol c_+-\boldsymbol c_-)^T\phi(\boldsymbol x)-\frac{1}{2}(\boldsymbol c_+-\boldsymbol c_-)^T(\boldsymbol c_++\boldsymbol c_-) \\ &= \left(\frac{1}{m_+}\sum_{y_i=1}\phi(\boldsymbol x_i)-\frac{1}{m_-}\sum_{y_i=-1}\phi(\boldsymbol x_i)\right)^T\phi(\boldsymbol x)-\frac{1}{2}(\boldsymbol c_+-\boldsymbol c_-)^T(\boldsymbol c_++\boldsymbol c_-) \\ &= \frac{1}{m_+}\sum_{y_i=1}\phi(\boldsymbol x_i)^T\phi(\boldsymbol x)-\frac{1}{m_-}\sum_{y_i=-1}\phi(\boldsymbol x_i)^T\phi(\boldsymbol x) - b \\ &= \frac{1}{m_+}\sum_{y_i=1}\kappa(\boldsymbol x_i,\boldsymbol x)-\frac{1}{m_-}\sum_{y_i=-1}\kappa(\boldsymbol x_i,\boldsymbol x) - b \end{aligned} <ϕ(x)−c,w>​=wT(ϕ(x)−c)=(c+​−c−​)Tϕ(x)−21​(c+​−c−​)T(c+​+c−​)=(m+​1​yi​=1∑​ϕ(xi​)−m−​1​yi​=−1∑​ϕ(xi​))Tϕ(x)−21​(c+​−c−​)T(c+​+c−​)=m+​1​yi​=1∑​ϕ(xi​)Tϕ(x)−m−​1​yi​=−1∑​ϕ(xi​)Tϕ(x)−b=m+​1​yi​=1∑​κ(xi​,x)−m−​1​yi​=−1∑​κ(xi​,x)−b​
由以上可知,特征映射 ϕ \phi ϕ不一定需要,只利用核函数 κ \kappa κ即可。

  • 有限半正定函数:一个函数 κ : X × X → R \kappa:X\times X\to R κ:X×X→R 如果满足对称函数,作用在空间 X X X的任意有限子集的矩阵K是半正定矩阵,则函数是有限半正定矩阵。

  • Characterization of Kernels: 如果函数 κ : X × X → R \kappa:X\times X\to\mathbb{R} κ:X×X→R连续或有一个有限定义域,若它是有限半正定函数,则它能够分解为 κ ( x , z ) = < ϕ ( x ) , ϕ ( z ) > \kappa(\boldsymbol x,\boldsymbol z)=<\phi(\boldsymbol x),\phi(\boldsymbol z)> κ(x,z)=<ϕ(x),ϕ(z)>,特征映射 ϕ \phi ϕ将原空间映射到Hilbert空间。

常用核函数:

  • 线性核函数:
    κ ( x , z ) = < x , z > \kappa(\boldsymbol x, \boldsymbol z)=<\boldsymbol x,\boldsymbol z> κ(x,z)=<x,z>

  • 多项式核函数:
    κ ( x , z ) = ( < x , z > + 1 ) r   r ∈ Z + \kappa(\boldsymbol x,\boldsymbol z)=(<\boldsymbol x,\boldsymbol z>+1)^r \ r\in\mathbb{Z}^+ κ(x,z)=(<x,z>+1)r r∈Z+

  • 高斯核函数(RBF):
    κ ( x , z ) = exp ⁡ ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) ,   σ ∈ R − { 0 } \kappa(\boldsymbol x,\boldsymbol z)=\exp(\frac{-||\boldsymbol x-\boldsymbol z||^2}{2\sigma^2}),\ \sigma\in\mathbb{R}-\left\{0\right\} κ(x,z)=exp(2σ2−∣∣x−z∣∣2​), σ∈R−{0}

在特征空间内的线性函数为: f ( x ) = w T ϕ ( x ) + b f(\boldsymbol x)=\boldsymbol w^T\phi(\boldsymbol x)+b f(x)=wTϕ(x)+b。

对于例1,
y = s g n ( < ϕ ( x ) − c , w > ) ⇔ y = s g n ( f ( x ) ) , f ( x ) = w T ϕ ( x ) − w T c w T = 1 m + ∑ y i = 1 ϕ ( x i ) − 1 m − ∑ y i = − 1 ϕ ( x i ) y=sgn(<\phi(\boldsymbol x)-\boldsymbol c,\boldsymbol w>) \\ \Leftrightarrow y=sgn(f(\boldsymbol x)),f(\boldsymbol x)=\boldsymbol w^T\phi(\boldsymbol x)-\boldsymbol w^T\boldsymbol c \\ w^T=\frac{1}{m_+}\sum_{y_i=1}\phi(\boldsymbol x_i)-\frac{1}{m_-}\sum_{y_i=-1}\phi(\boldsymbol x_i) y=sgn(<ϕ(x)−c,w>)⇔y=sgn(f(x)),f(x)=wTϕ(x)−wTcwT=m+​1​yi​=1∑​ϕ(xi​)−m−​1​yi​=−1∑​ϕ(xi​)
对偶表述: w w w可用 ϕ ( x i ) {\phi(x_i)} ϕ(xi​)的线性组合表示,
w = ∑ i = 1 N α i ϕ ( x i ) w = \sum_{i=1}^N \alpha_i\phi(\boldsymbol x_i) w=i=1∑N​αi​ϕ(xi​)
因此,
f ( x ) = w T ϕ ( x ) + b = ( ∑ i = 1 N α i ϕ ( x i ) ) T ϕ ( x ) + b = ∑ i = 1 N α i κ ( x i , x ) + b f(\boldsymbol x)=\boldsymbol w^T\phi(\boldsymbol x)+b=(\sum_{i=1}^N \alpha_i\phi(\boldsymbol x_i))^T\phi(\boldsymbol x)+b=\sum_{i=1}^N\alpha_i\kappa(\boldsymbol x_i,\boldsymbol x)+b f(x)=wTϕ(x)+b=(i=1∑N​αi​ϕ(xi​))Tϕ(x)+b=i=1∑N​αi​κ(xi​,x)+b

  • 利用特征函数 ϕ \phi ϕ将样本从原空间映射到特征空间 H H H,其是一个Hilbert高维空间

  • 在 H H H中,原有关系可转化为线性关系

  • 直接利用核函数来计算特征空间内样本的内积(不需要显式的 ϕ \phi ϕ)

  • 假设特征空间内的样本可以用对偶形式表述,训练样本的组合。


2. PCA and Kernel PCA

2.1 Principal Component Analysis

假设 x 1 , x 2 , ⋯   , x N ∈ R d \boldsymbol x_1,\boldsymbol x_2,\cdots,\boldsymbol x_N\in\mathbb{R}^d x1​,x2​,⋯,xN​∈Rd是零均值训练样本,PCA的目的是找到一个在 R p , p ≤ d \mathbb{R}^p,p\le d Rp,p≤d 空间内的子集,其包含原数据的最大总方差。将数据往某些方向投影
核方法:PCA,LDA,GDA,SVM,LR

目的是找到最大投影方差的方向。

将样本投影到归一化方向 v v v上: v T x 1 , ⋯   , v T x N \boldsymbol v^T\boldsymbol x_1,\cdots,\boldsymbol v^T\boldsymbol x_N vTx1​,⋯,vTxN​。

投影方差为:
σ 2 = 1 N ∑ i = 1 N ( v T x i − 0 ) 2 = 1 N ∑ i = 1 N ( v T x i ) ( v T x i ) = 1 N ∑ i = 1 N ( v T x i ) ( v T x i ) T = v T ( 1 N ∑ i = 1 N x i x i T ) v = v T C v \begin{aligned} \sigma^2 &= \frac{1}{N}\sum^N_{i=1}(\boldsymbol v^T\boldsymbol x_i-0)^2 \\ &= \frac{1}{N}\sum^N_{i=1}(\boldsymbol v^T\boldsymbol x_i)(\boldsymbol v^T\boldsymbol x_i) \\ &= \frac{1}{N}\sum^N_{i=1}(\boldsymbol v^T\boldsymbol x_i)(\boldsymbol v^T\boldsymbol x_i)^T \\ &= \boldsymbol v^T(\frac{1}{N}\sum^N_{i=1}\boldsymbol x_i\boldsymbol x_i^T)\boldsymbol v \\ &= \boldsymbol v^TC\boldsymbol v \end{aligned} σ2​=N1​i=1∑N​(vTxi​−0)2=N1​i=1∑N​(vTxi​)(vTxi​)=N1​i=1∑N​(vTxi​)(vTxi​)T=vT(N1​i=1∑N​xi​xiT​)v=vTCv​
其中,
C = 1 N ∑ i = 1 N x i x i T C = \frac{1}{N}\sum^N_{i=1}\boldsymbol x_i\boldsymbol x_i^T C=N1​i=1∑N​xi​xiT​
第一主元向量为
v = arg ⁡ max ⁡ v ∈ R d , ∣ ∣ v ∣ ∣ = 1 v T C v \color{red}{\boldsymbol v = \mathop{\arg \max}\limits_{\boldsymbol v\in\mathbb{R}^d,||\boldsymbol v||=1}\boldsymbol v^TC\boldsymbol v} v=v∈Rd,∣∣v∣∣=1argmax​vTCv
利用拉格朗日方法,
f ( v , λ ) = v T C v − λ ( v T v − 1 ) ∂ f ∂ v = 2 C v − 2 λ v = 0 ⇔ C v = λ v ∂ f ∂ λ = v T v − 1 = 0 ⇒ v T v = 1 \begin{aligned} f(\boldsymbol v,\lambda) &= \boldsymbol v^TC\boldsymbol v-\lambda(\boldsymbol v^T\boldsymbol v-1) \\ \frac{\partial f}{\partial \boldsymbol v} &= 2C\boldsymbol v-2\lambda \boldsymbol v = 0 \Leftrightarrow C\boldsymbol v=\lambda \boldsymbol v \\ \frac{\partial f}{\partial \lambda} &= \boldsymbol v^T\boldsymbol v-1=0\Rightarrow \boldsymbol v^T\boldsymbol v=1 \end{aligned} f(v,λ)∂v∂f​∂λ∂f​​=vTCv−λ(vTv−1)=2Cv−2λv=0⇔Cv=λv=vTv−1=0⇒vTv=1​
所以问题等价于寻找最大特征值对应的特征向量:
{ C v = λ v ∣ ∣ v ∣ ∣ = 1 \begin{cases} C\boldsymbol v=\lambda \boldsymbol v \\ ||\boldsymbol v||=1 \end {cases} {Cv=λv∣∣v∣∣=1​
将上式代入之前的公式可知,最大方差为 σ 2 = λ \sigma^2=\lambda σ2=λ。

分析 C C C,
C = 1 N ∑ i = 1 N x i x i T = 1 N [ x 1 , ⋯   , x N ] [ x 1 T ⋮ x N T ] C=\frac{1}{N}\sum_{i=1}^N\boldsymbol x_i\boldsymbol x_i^T=\frac{1}{N} \begin{bmatrix} \boldsymbol x_1, \cdots, \boldsymbol x_N \end{bmatrix} \begin{bmatrix} \boldsymbol x_1^T \\ \vdots \\ \boldsymbol x_N^T \end{bmatrix} C=N1​i=1∑N​xi​xiT​=N1​[x1​,⋯,xN​​]⎣⎢⎡​x1T​⋮xNT​​⎦⎥⎤​
假设 X T = [ x 1 , ⋯   , x N ] X^T=[\boldsymbol x_1,\cdots,\boldsymbol x_N] XT=[x1​,⋯,xN​],则,
C = 1 N X T X C=\frac{1}{N}X^TX C=N1​XTX

2.2 Kernel Principal Component Analysis

对于映射到特征空间内的样本 ϕ ( x 1 ) , ⋯   , ϕ ( x N ) {\phi(\boldsymbol x1),\cdots,\phi(\boldsymbol x_N)} ϕ(x1),⋯,ϕ(xN​), X T = [ ϕ ( x 1 ) , ⋯   , ϕ ( x N ) ] X^T=[\phi(\boldsymbol x_1),\cdots,\phi(\boldsymbol x_N)] XT=[ϕ(x1​),⋯,ϕ(xN​)](未知),则
C = 1 N ∑ i = 1 N ϕ ( x i ) ϕ ( x i ) T = 1 N X T X   ( d i m = p × p ) C=\frac{1}{N}\sum_{i=1}^N\phi(\boldsymbol x_i)\phi(\boldsymbol x_i)^T=\frac{1}{N}X^TX \ (dim=p\times p) C=N1​i=1∑N​ϕ(xi​)ϕ(xi​)T=N1​XTX (dim=p×p)
在实际使用中,我们已知核函数 κ \kappa κ,而映射函数 ϕ \phi ϕ未知,无法直接求解 C C C。
K = X X T = [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] = [ ϕ ( x 1 ) T ϕ ( x 1 ) ⋯ ϕ ( x 1 ) T ϕ ( x N ) ⋮ ⋱ ⋮ ϕ ( x N ) T ϕ ( x 1 ) ⋯ ϕ ( x N ) T ϕ ( x N ) ] = [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x N ) ⋮ ⋱ ⋮ κ ( x N , x 1 ) ⋯ κ ( x N , x N ) ]   ( d i m = N × N ) \begin{aligned} K = XX^T &= \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix} \begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix} \\ &= \begin{bmatrix} \phi(\boldsymbol x_1)^T\phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_1)^T\phi(\boldsymbol x_N) \\ \vdots & \ddots & \vdots \\ \phi(\boldsymbol x_N)^T\phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N)^T\phi(\boldsymbol x_N) \end{bmatrix} \\ &= \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_N) \\ \vdots & \ddots & \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_N,\boldsymbol x_N) \end{bmatrix} \ (dim=N\times N) \end{aligned} K=XXT​=⎣⎢⎡​ϕ(x1​)T⋮ϕ(xN​)T​⎦⎥⎤​[ϕ(x1​)​⋯​ϕ(xN​)​]=⎣⎢⎡​ϕ(x1​)Tϕ(x1​)⋮ϕ(xN​)Tϕ(x1​)​⋯⋱⋯​ϕ(x1​)Tϕ(xN​)⋮ϕ(xN​)Tϕ(xN​)​⎦⎥⎤​=⎣⎢⎡​κ(x1​,x1​)⋮κ(xN​,x1​)​⋯⋱⋯​κ(x1​,xN​)⋮κ(xN​,xN​)​⎦⎥⎤​ (dim=N×N)​
利用 K K K来求解 C C C的最大特征值及其对应的特征向量:

K = X X T K=XX^T K=XXT的特征向量为 ( X X T ) u = λ u (XX^T)\boldsymbol u=\lambda \boldsymbol u (XXT)u=λu,
X T ( X X T ) u = λ X T u ⇔ ( X T X ) ( X T u ) = λ ( X T u ) X^T(XX^T)\boldsymbol u=\lambda X^T\boldsymbol u \Leftrightarrow (X^TX)(X^T\boldsymbol u)=\lambda(X^T\boldsymbol u) XT(XXT)u=λXTu⇔(XTX)(XTu)=λ(XTu)
即 X T u X^T\boldsymbol u XTu是 X T X ( N C ) X^TX(NC) XTX(NC)的特征向量,对应特征值为 λ \lambda λ。

由此可知, X T X X^TX XTX的特征值和特征向量 v v v可由 K = X X T K=XX^T K=XXT计算,
v = X T u ∣ ∣ X T u ∣ ∣ = X T u u T X X T u = X T u u T ( λ u ) = 1 λ X T u \boldsymbol v=\frac{X^T\boldsymbol u}{||X^T\boldsymbol u||}=\frac{X^T\boldsymbol u}{\sqrt{\boldsymbol u^TXX^T\boldsymbol u}}=\frac{X^T\boldsymbol u}{\sqrt{\boldsymbol u^T(\lambda \boldsymbol u)}}=\frac{1}{\sqrt{\lambda}}X^T\boldsymbol u v=∣∣XTu∣∣XTu​=uTXXTu ​XTu​=uT(λu) ​XTu​=λ ​1​XTu
将测试样本投影到 v \boldsymbol v v上,
v T ϕ ( x ′ ) = ( 1 λ X T u ) T ϕ ( x ′ ) = 1 λ u T X ϕ ( x ′ ) = 1 λ u T [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] ϕ ( x ′ ) = 1 λ u T [ κ ( x 1 , x ′ ) ⋮ κ ( x N , x ′ ) ] \begin{aligned} \boldsymbol v^T\phi(\boldsymbol x') &= (\frac{1}{\sqrt{\lambda}}\boldsymbol X^T\boldsymbol u)^T\phi(\boldsymbol x') \\ &= \frac{1}{\sqrt \lambda}\boldsymbol u^TX\phi(\boldsymbol x') \\ &= \frac{1}{\sqrt \lambda}\boldsymbol u^T \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix} \phi(\boldsymbol x') \\ &= \frac{1}{\sqrt \lambda}\boldsymbol u^T \begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x') \\ \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x') \end{bmatrix} \end{aligned} vTϕ(x′)​=(λ ​1​XTu)Tϕ(x′)=λ ​1​uTXϕ(x′)=λ ​1​uT⎣⎢⎡​ϕ(x1​)T⋮ϕ(xN​)T​⎦⎥⎤​ϕ(x′)=λ ​1​uT⎣⎢⎡​κ(x1​,x′)⋮κ(xN​,x′)​⎦⎥⎤​​
综上,

  1. 求解 K = X X T K=XX^T K=XXT的特征值及特征向量,
    K u i = λ i u i ,   λ 1 ≥ λ 2 ≥ ⋯ ≥ λ N Ku_i=\lambda_i\boldsymbol u_i,\ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_N Kui​=λi​ui​, λ1​≥λ2​≥⋯≥λN​

  2. 将样本投影到第i各特征向量方向上,
    v i T ϕ ( x ′ ) = 1 λ i u i T [ κ ( x 1 , x ′ ) ⋮ κ ( x N , x ′ ) ] \boldsymbol v^T_i\phi(\boldsymbol x')=\frac{1}{\sqrt \lambda_i}\boldsymbol u_i^T \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x') \\ \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x') \end{bmatrix} viT​ϕ(x′)=λ ​i​1​uiT​⎣⎢⎡​κ(x1​,x′)⋮κ(xN​,x′)​⎦⎥⎤​

Ex:假设 ϕ \phi ϕ为特征映射,其对应的核函数为 κ \kappa κ,假设 ϕ ( x 1 ) , ⋯   , ϕ ( x N ) \phi(\boldsymbol x_1), \cdots, \phi(\boldsymbol x_N) ϕ(x1​),⋯,ϕ(xN​)是非零均值,即 ϕ ˉ = 1 N ∑ i = 1 N ϕ ( x i ) ≠ 0 \bar\phi=\frac{1}{N}\sum_{i=1}^N\phi(\boldsymbol x_i)\ne0 ϕˉ​=N1​∑i=1N​ϕ(xi​)​=0,假设 ϕ ~ = ϕ − ϕ ˉ \tilde{\phi}=\phi-\bar\phi ϕ~​=ϕ−ϕˉ​,那么 ϕ ~ ( x i ) = ϕ ( x i ) − ϕ ˉ \tilde{\phi}(\boldsymbol x_i)=\phi(\boldsymbol x_i)-\bar\phi ϕ~​(xi​)=ϕ(xi​)−ϕˉ​。寻找一个矩阵 M M M使得关于 ϕ ~ \tilde{\phi} ϕ~​的核函数矩阵 K ~ \tilde{K} K~满足 K ~ = M T K M \tilde{K}=M^TKM K~=MTKM.
κ ~ ( x i , x j ) = < ϕ ~ ( x i ) , ϕ ~ ( x j ) > = < ϕ ( x i ) − ϕ ˉ , ϕ ( x j ) − ϕ ˉ > = < ϕ ( x i ) , ϕ ( x j ) > − < ϕ ˉ , ϕ ( x j ) > − < ϕ ( x i ) , ϕ ˉ > + < ϕ ˉ , ϕ ˉ > = κ ( x i , x j ) − 1 N ∑ k = 1 κ ( x k , x j ) − 1 N ∑ k = 1 κ ( x i , x k ) + 1 N 2 ( ∑ s = 1 ϕ ( x s ) ) ( ∑ t = 1 ϕ ( x t ) ) \begin{aligned} \tilde{\kappa}(\boldsymbol x_i,\boldsymbol x_j) &= <\tilde{\phi}(\boldsymbol x_i), \tilde{\phi}(\boldsymbol x_j)>\\ &= <\phi(\boldsymbol x_i)-\bar\phi,\phi(\boldsymbol x_j)-\bar\phi>\\ &= <\phi(\boldsymbol x_i),\phi(\boldsymbol x_j)>-<\bar\phi,\phi(\boldsymbol x_j)>-<\phi(\boldsymbol x_i),\bar\phi>+<\bar\phi,\bar\phi>\\ &= \kappa(\boldsymbol x_i,\boldsymbol x_j)-\frac{1}{N}\sum_{k=1}\kappa(\boldsymbol x_k,\boldsymbol x_j)-\frac{1}{N}\sum_{k=1}\kappa(\boldsymbol x_i,\boldsymbol x_k)+\frac{1}{N^2}(\sum_{s=1}\phi(\boldsymbol x_s))(\sum_{t=1}\phi(\boldsymbol x_t)) \end{aligned} κ~(xi​,xj​)​=<ϕ~​(xi​),ϕ~​(xj​)>=<ϕ(xi​)−ϕˉ​,ϕ(xj​)−ϕˉ​>=<ϕ(xi​),ϕ(xj​)>−<ϕˉ​,ϕ(xj​)>−<ϕ(xi​),ϕˉ​>+<ϕˉ​,ϕˉ​>=κ(xi​,xj​)−N1​k=1∑​κ(xk​,xj​)−N1​k=1∑​κ(xi​,xk​)+N21​(s=1∑​ϕ(xs​))(t=1∑​ϕ(xt​))​

K = [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x N ) ⋮ ⋱ ⋮ κ ( x N , x 1 ) ⋯ κ ( x N , x N ) ] K ~ = [ < ϕ ~ ( x 1 ) , ϕ ~ ( x 1 ) > ⋯ < ϕ ~ ( x 1 ) , ϕ ~ ( x N ) > ⋮ ⋱ ⋮ < ϕ ~ ( x N ) , ϕ ~ ( x 1 ) > ⋯ < ϕ ~ ( x N ) , ϕ ~ ( x N ) > ] K= \begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_N)\\ \vdots & \ddots & \vdots \\ \kappa(\boldsymbol x_N, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_N,\boldsymbol x_N) \end{bmatrix} \\ \tilde{K}= \begin{bmatrix} <\tilde{\phi}(\boldsymbol x_1),\tilde{\phi}(\boldsymbol x_1)> & \cdots & <\tilde{\phi}(\boldsymbol x_1),\tilde{\phi}(\boldsymbol x_N)>\\ \vdots & \ddots & \vdots \\ <\tilde{\phi}(\boldsymbol x_N),\tilde{\phi}(\boldsymbol x_1)> & \cdots & <\tilde{\phi}(\boldsymbol x_N),\tilde{\phi}(\boldsymbol x_N)> \end{bmatrix} K=⎣⎢⎡​κ(x1​,x1​)⋮κ(xN​,x1​)​⋯⋱⋯​κ(x1​,xN​)⋮κ(xN​,xN​)​⎦⎥⎤​K~=⎣⎢⎡​<ϕ~​(x1​),ϕ~​(x1​)>⋮<ϕ~​(xN​),ϕ~​(x1​)>​⋯⋱⋯​<ϕ~​(x1​),ϕ~​(xN​)>⋮<ϕ~​(xN​),ϕ~​(xN​)>​⎦⎥⎤​

K ~ = [ ϕ ~ ( x 1 ) T ⋮ ϕ ~ ( x N ) T ] [ ϕ ~ ( x 1 ) ⋯ ϕ ~ ( x N ) ] = [ ϕ ( x 1 ) T − ϕ ˉ T ⋮ ϕ ( x N ) T − ϕ ˉ T ] [ ϕ ( x 1 ) − ϕ ˉ ⋯ ϕ ( x N ) − ϕ ˉ ] = ( [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] − [ 1 ⋮ 1 ] ϕ ˉ T ) ( [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] − ϕ ˉ [ 1 ⋯ 1 ] ) = ( [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] − [ 1 ⋮ 1 ] ϕ ˉ T ) ( [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] − 1 N [ 1 ⋯ 1 ] ) \tilde{K}= \begin{bmatrix} \tilde{\phi}(\boldsymbol x_1)^T \\ \vdots \\ \tilde{\phi}(\boldsymbol x_N)^T \end{bmatrix} \begin{bmatrix} \tilde{\phi}(\boldsymbol x_1) & \cdots & \tilde{\phi}(\boldsymbol x_N) \end{bmatrix}\\ =\begin{bmatrix} \phi(\boldsymbol x_1)^T-\bar\phi^T \\ \vdots \\ \phi(\boldsymbol x_N)^T-\bar\phi^T \end{bmatrix} \begin{bmatrix} \phi(\boldsymbol x_1)-\bar\phi & \cdots & \phi(\boldsymbol x_N)-\bar\phi \end{bmatrix}\\ =\left(\begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix}-\begin{bmatrix}1\\\vdots\\1\end{bmatrix}\bar\phi^T\right) \left(\begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix}-\bar\phi\begin{bmatrix}1&\cdots&1\end{bmatrix}\right)\\ =\left(\begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix}-\begin{bmatrix}1\\\vdots\\1\end{bmatrix}\bar\phi^T\right) \left(\begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix}-\frac{1}{N}\begin{bmatrix}1&\cdots&1\end{bmatrix}\right)\\ K~=⎣⎢⎡​ϕ~​(x1​)T⋮ϕ~​(xN​)T​⎦⎥⎤​[ϕ~​(x1​)​⋯​ϕ~​(xN​)​]=⎣⎢⎡​ϕ(x1​)T−ϕˉ​T⋮ϕ(xN​)T−ϕˉ​T​⎦⎥⎤​[ϕ(x1​)−ϕˉ​​⋯​ϕ(xN​)−ϕˉ​​]=⎝⎜⎛​⎣⎢⎡​ϕ(x1​)T⋮ϕ(xN​)T​⎦⎥⎤​−⎣⎢⎡​1⋮1​⎦⎥⎤​ϕˉ​T⎠⎟⎞​([ϕ(x1​)​⋯​ϕ(xN​)​]−ϕˉ​[1​⋯​1​])=⎝⎜⎛​⎣⎢⎡​ϕ(x1​)T⋮ϕ(xN​)T​⎦⎥⎤​−⎣⎢⎡​1⋮1​⎦⎥⎤​ϕˉ​T⎠⎟⎞​([ϕ(x1​)​⋯​ϕ(xN​)​]−N1​[1​⋯​1​])


3. LDA and GDA

3.1 Linear Discriminant Analysis

寻找一个方向向量满足:

  • 投影后的各类均值距离最大
  • 投影后每一类的样本与均值的距离最小

即增大类均值距离,增大每一类的样本聚集程度。目的是降低样本投影之间的重叠部分,增大可分性

核方法:PCA,LDA,GDA,SVM,LR

L L L:样本类别数目; N i N_i Ni​:第 i i i类样本的数目; N N N全部样本数目; x j ( i ) \boldsymbol x^{(i)}_j xj(i)​:第 j j j类中的第 i i i个样本

将所有的样本投影到方向向量 v \boldsymbol v v上, v T x 1 ( 1 ) , ⋯   , v T x N 1 ( 1 ) ; v T x 2 ( 2 ) , ⋯   , v T x N 2 ( 2 ) ; ⋯   ; v T x 1 ( L ) , ⋯   , v T x N L ( L ) \boldsymbol v^T\boldsymbol x^{(1)}_1,\cdots,\boldsymbol v^T\boldsymbol x^{(1)}_{N_1};\boldsymbol v^T\boldsymbol x^{(2)}_2,\cdots,\boldsymbol v^T\boldsymbol x^{(2)}_{N_2};\cdots;\boldsymbol v^T\boldsymbol x^{(L)}_1,\cdots,\boldsymbol v^T\boldsymbol x^{(L)}_{N_L} vTx1(1)​,⋯,vTxN1​(1)​;vTx2(2)​,⋯,vTxN2​(2)​;⋯;vTx1(L)​,⋯,vTxNL​(L)​。

各类的均值为
m ‾ i = 1 N i ∑ j = 1 N i v T x j ( i ) = v T ( 1 N i ∑ j = 1 N i x j ( i ) ) = v T m i \overline{\boldsymbol m}_i=\frac{1}{N_i}\sum_{j=1}^{N_i}\boldsymbol v^T\boldsymbol x^{(i)}_j=\boldsymbol v^T\left(\frac{1}{N_i}\sum_{j=1}^{N_i}\boldsymbol x^{(i)}_j\right)=\boldsymbol v^T\boldsymbol m_i mi​=Ni​1​j=1∑Ni​​vTxj(i)​=vT(Ni​1​j=1∑Ni​​xj(i)​)=vTmi​
其中 m i m_i mi​为原空间内第 i i i类的均值。

然后计算每一类均值之间距离的权重平方和
∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( m ‾ i − m ‾ j ) 2 = ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( m ‾ i − m ‾ j ) ( m ‾ i − m ‾ j ) T = ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( v T m i − v T m j ) ( v T m i − v T m j ) T = ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N v T ( m i − m j ) ( m i − m j ) T v = v T ( ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( m i − m j ) ( m i − m j ) T ) v = v T S b L D A v \begin{aligned} \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}(\overline{\boldsymbol m}_i-\overline{\boldsymbol m}_j)^2}} &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}(\overline{\boldsymbol m}_i-\overline{\boldsymbol m}_j)(\overline{\boldsymbol m}_i-\overline{\boldsymbol m}_j)^T}} \\ &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}(\boldsymbol v^T\boldsymbol m_i-\boldsymbol v^T\boldsymbol m_j)(\boldsymbol v^T\boldsymbol m_i-\boldsymbol v^T\boldsymbol m_j)^T}} \\ &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol v^T(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T\boldsymbol v}} \\ &= \boldsymbol v^T\left(\sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T}\right)\boldsymbol v \\ &= \boldsymbol v^TS^{LDA}_{b}\boldsymbol v \end{aligned} i=1∑L−1​j=i+1∑L​NNi​​NNj​​(mi​−mj​)2​=i=1∑L−1​j=i+1∑L​NNi​​NNj​​(mi​−mj​)(mi​−mj​)T=i=1∑L−1​j=i+1∑L​NNi​​NNj​​(vTmi​−vTmj​)(vTmi​−vTmj​)T=i=1∑L−1​j=i+1∑L​NNi​​NNj​​vT(mi​−mj​)(mi​−mj​)Tv=vT(i=1∑L−1​j=i+1∑L​NNi​​NNj​​(mi​−mj​)(mi​−mj​)T)v=vTSbLDA​v​

S b L D A = ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( m i − m j ) ( m i − m j ) T = 1 2 ∑ i = 1 L ∑ j = 1 L N i N N j N ( m i − m j ) ( m i − m j ) T = 1 2 ∑ i = 1 L ∑ j = 1 L N i N N j N ( m i m i T − m i m j T − m j m i T + m j m j T ) = 1 2 ( ∑ i = 1 L ∑ j = 1 L N i N N j N m i m i T − ∑ i = 1 L ∑ j = 1 L N i N N j N m i m j T − ∑ i = 1 L ∑ j = 1 L N i N N j N m j m i T + ∑ i = 1 L ∑ j = 1 L N i N N j N m j m j T ) = 1 2 ( ∑ i = 1 L N i N m i m i T ∑ j = 1 L N j N − ∑ i = 1 L N i N m i ∑ j = 1 L N j N m j T − ∑ j = 1 L N j N m j ∑ i = 1 L N i N m i T + ∑ i = 1 L N i N ∑ j = 1 L N j N m j T m j T ) = 1 2 ( ∑ i = 1 L N i N m i m i T − m 0 m 0 T − m 0 m 0 T + ∑ L j = 1 N j N m j m j T ) = ∑ L i = 1 N i N m i m i T − m 0 m 0 T = ∑ i = 1 L N i N ( m i − m 0 ) ( m i − m 0 ) T   ( 与 E [ ( x − x ˉ ) 2 ] = E [ x 2 ] − x ˉ 2 相 似 ) \begin{aligned} S^{LDA}_b &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T}\\ &= \frac{1}{2}\sum^{L}_{i=1}{\sum^{L}_{j=1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T}\\ &= \frac{1}{2}\sum^{L}_{i=1}{\sum^{L}_{j=1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i\boldsymbol m_i^T-\boldsymbol m_i\boldsymbol m_j^T-\boldsymbol m_j\boldsymbol m_i^T+\boldsymbol m_j\boldsymbol m_j^T)}\\ &= \frac{1}{2}\left({\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_i\boldsymbol m_i^T}} -\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_i\boldsymbol m_j^T}} -\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_j\boldsymbol m_i^T}} +\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_j\boldsymbol m_j^T}}}\right)\\ &= \frac{1}{2}\left( \sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T\sum_{j=1}^{L}\frac{N_j}{N}- \sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i\sum_{j=1}^{L}\frac{N_j}{N}\boldsymbol m_j^T- \sum_{j=1}^{L}\frac{N_j}{N}\boldsymbol m_j\sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i^T+ \sum_{i=1}^{L}\frac{N_i}{N}\sum_{j=1}^{L}\frac{N_j}{N}\boldsymbol m_j^T\boldsymbol m_j^T \right)\\ &= \frac{1}{2}\left( \sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T-\boldsymbol m_0\boldsymbol m_0^T-\boldsymbol m_0\boldsymbol m_0^T+\sum_{L}^{j=1}\frac{N_j}{N}\boldsymbol m_j\boldsymbol m_j^T \right)\\ &= \sum_{L}^{i=1}\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T-\boldsymbol m_0\boldsymbol m_0^T\\ &= \sum_{i=1}^{L}\frac{N_i}{N}(\boldsymbol m_i-\boldsymbol m_0)(\boldsymbol m_i-\boldsymbol m_0)^T\ (与E[(x-\bar x)^2]=E[x^2]-\bar x^2相似) \end{aligned} SbLDA​​=i=1∑L−1​j=i+1∑L​NNi​​NNj​​(mi​−mj​)(mi​−mj​)T=21​i=1∑L​j=1∑L​NNi​​NNj​​(mi​−mj​)(mi​−mj​)T=21​i=1∑L​j=1∑L​NNi​​NNj​​(mi​miT​−mi​mjT​−mj​miT​+mj​mjT​)=21​(i=1∑L​j=1∑L​NNi​​NNj​​mi​miT​−i=1∑L​j=1∑L​NNi​​NNj​​mi​mjT​−i=1∑L​j=1∑L​NNi​​NNj​​mj​miT​+i=1∑L​j=1∑L​NNi​​NNj​​mj​mjT​)=21​(i=1∑L​NNi​​mi​miT​j=1∑L​NNj​​−i=1∑L​NNi​​mi​j=1∑L​NNj​​mjT​−j=1∑L​NNj​​mj​i=1∑L​NNi​​miT​+i=1∑L​NNi​​j=1∑L​NNj​​mjT​mjT​)=21​(i=1∑L​NNi​​mi​miT​−m0​m0T​−m0​m0T​+L∑j=1​NNj​​mj​mjT​)=L∑i=1​NNi​​mi​miT​−m0​m0T​=i=1∑L​NNi​​(mi​−m0​)(mi​−m0​)T (与E[(x−xˉ)2]=E[x2]−xˉ2相似)​

其中,
m 0 = ∑ L i = 1 N i N m i = ∑ L i = 1 N i N ∑ k = 1 N i 1 N i x k ( i ) = ∑ L i = 1 ∑ N i k = 1 1 N x k ( i ) \boldsymbol m_0=\sum_{L}^{i=1}\frac{N_i}{N}\boldsymbol m_i=\sum_{L}^{i=1}\frac{N_i}{N}\sum_{k=1}^{N_i}\frac{1}{N_i}\boldsymbol x_k^{(i)}=\sum_{L}^{i=1}\sum_{N_i}^{k=1}\frac{1}{N}\boldsymbol x_k^{(i)} m0​=L∑i=1​NNi​​mi​=L∑i=1​NNi​​k=1∑Ni​​Ni​1​xk(i)​=L∑i=1​Ni​∑k=1​N1​xk(i)​
综上,组间分散矩阵为:
S b L D A = ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( m i − m j ) ( m i − m j ) T = ∑ i = 1 L N i N ( m i − m 0 ) ( m i − m 0 ) T S^{LDA}_b=\sum_{i=1}^{L-1}\sum^{L}_{j=i+1}\frac{N_i}{N}\frac{N_j}{N}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T=\sum_{i=1}^{L}\frac{N_i}{N}(\boldsymbol m_i-\boldsymbol m_0)(\boldsymbol m_i-\boldsymbol m_0)^T SbLDA​=i=1∑L−1​j=i+1∑L​NNi​​NNj​​(mi​−mj​)(mi​−mj​)T=i=1∑L​NNi​​(mi​−m0​)(mi​−m0​)T
相当与每个集群的形心到整个集群的形心之间的距离乘上质量权重

核方法:PCA,LDA,GDA,SVM,LR

类方差和为
∑ i = 1 L ∑ j = 1 N i 1 N ( v T x J ( i ) − m ‾ i ) 2 = ∑ i = 1 L ∑ j = 1 N i 1 N ( v T x j ( i ) − v T m i ) ( v T x j ( i ) − v T m i ) T = v T ( ∑ i = 1 L ∑ j = 1 N i 1 N ( x j ( i ) − m i ) ( x j ( i ) − m i ) T ) v = v T S w L D A v \begin{aligned} \sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol v^Tx_J^{(i)}-\overline{\boldsymbol m}_i)^2 &= \sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol v^T\boldsymbol x_j^{(i)}-\boldsymbol v^T\boldsymbol m_i)(\boldsymbol v^T\boldsymbol x_j^{(i)}-\boldsymbol v^T\boldsymbol m_i)^T\\ &= \boldsymbol v^T\left(\sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol x_j^{(i)}-\boldsymbol m_i)(\boldsymbol x_j^{(i)}-\boldsymbol m_i)^T\right)\boldsymbol v\\ &= \boldsymbol v^TS^{LDA}_w\boldsymbol v \end{aligned} i=1∑L​j=1∑Ni​​N1​(vTxJ(i)​−mi​)2​=i=1∑L​j=1∑Ni​​N1​(vTxj(i)​−vTmi​)(vTxj(i)​−vTmi​)T=vT(i=1∑L​j=1∑Ni​​N1​(xj(i)​−mi​)(xj(i)​−mi​)T)v=vTSwLDA​v​
所以,组内分散矩阵为:
S w L D A = ∑ i = 1 L ∑ j = 1 N i 1 N ( x j ( i ) − m i ) ( x j ( i ) − m i ) T S^{LDA}_w=\sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol x_j^{(i)}-\boldsymbol m_i)(\boldsymbol x_j^{(i)}-\boldsymbol m_i)^T SwLDA​=i=1∑L​j=1∑Ni​​N1​(xj(i)​−mi​)(xj(i)​−mi​)T
第一主元向量可以由以下计算:
v = arg ⁡ max ⁡ v ∈ R d v T S b L D A v v T S w L D A v = arg ⁡ max ⁡ v T S b L D A v = 1 v T S b L D A v . {\color{red} \boldsymbol v=\mathop{\arg\max}_{\boldsymbol v\in\mathbb{R}^d}\frac{\boldsymbol v^TS^{LDA}_b\boldsymbol v}{\boldsymbol v^TS^{LDA}_w\boldsymbol v}=\mathop{\arg\max}_{\boldsymbol v^TS_b^{LDA}\boldsymbol v=1}\boldsymbol v^TS_b^{LDA}\boldsymbol v}. v=argmaxv∈Rd​vTSwLDA​vvTSbLDA​v​=argmaxvTSbLDA​v=1​vTSbLDA​v.
由Lagrangian方法可得,
f ( v , λ ) = v T S b L D A v − λ ( v T S w L D A v − 1 ) f(\boldsymbol v,\lambda)=\boldsymbol v^TS_b^{LDA}\boldsymbol v-\lambda(\boldsymbol v^TS_w^{LDA}\boldsymbol v-1) f(v,λ)=vTSbLDA​v−λ(vTSwLDA​v−1)

∂ f ∂ v = 2 S b L D A v − 2 λ S w L D A v ⇔ ( S w L D A ) − 1 S b L D A v = λ v ∂ f ∂ λ = v T S w L D A v − 1 = 0 ⇔ v T S w L D A v = 1 \begin{aligned} \frac{\partial f}{\partial \boldsymbol v}&=2S_b^{LDA}\boldsymbol v-2\lambda S_w^{LDA}\boldsymbol v \Leftrightarrow {\color{red}(S_w^{LDA})^{-1}S_b^{LDA}\boldsymbol v=\lambda \boldsymbol v}\\ \frac{\partial f}{\partial \lambda}&=\boldsymbol v^TS_w^{LDA}\boldsymbol v-1=0 \Leftrightarrow \boldsymbol v^TS_w^{LDA}\boldsymbol v=1 \end{aligned} ∂v∂f​∂λ∂f​​=2SbLDA​v−2λSwLDA​v⇔(SwLDA​)−1SbLDA​v=λv=vTSwLDA​v−1=0⇔vTSwLDA​v=1​

当满足以上条件时, v T S b L D A v = λ v T S w L D A v = λ \boldsymbol v^TS^{LDA}_b\boldsymbol v=\lambda \boldsymbol v^TS^{LDA}_w\boldsymbol v=\lambda vTSbLDA​v=λvTSwLDA​v=λ。

综上,求解第一主元等价于求解下列最大广义特征值,
S b L D A u = λ S w L D A u , v = 1 u T S w L D A u u S^{LDA}_b\boldsymbol u=\lambda S^{LDA}_w\boldsymbol u, \boldsymbol v=\frac{1}{\sqrt{\boldsymbol u^TS^{LDA}_w\boldsymbol u}}\boldsymbol u SbLDA​u=λSwLDA​u,v=uTSwLDA​u ​1​u
其中后一项保证 v T S b L D A v = 1 \boldsymbol v^TS_b^{LDA}\boldsymbol v=1 vTSbLDA​v=1。

3.2 Generalized Discriminant Analysis

L L L:样本类别数目;

N i N_i Ni​:第 i i i类样本的数目;

N N N全部样本数目;

ϕ ( x j ( i ) ) \phi(\boldsymbol x^{(i)}_j) ϕ(xj(i)​):第 j j j类中的第 i i i个样本;

X i T = [ ϕ ( x 1 ( i ) ) , ⋯   , ϕ ( x N i ( i ) ) ] X^T_i=[\phi(\boldsymbol x^{(i)}_1),\cdots,\phi(\boldsymbol x^{(i)}_{N_i})] XiT​=[ϕ(x1(i)​),⋯,ϕ(xNi​(i)​)];

X T = [ X 1 T , ⋯   , X L T ] X^T=[X^T_1,\cdots,X^T_L] XT=[X1T​,⋯,XLT​]。

假设在空间 H H H内样本均值为零: m 0 = 0 \boldsymbol m_0=0 m0​=0

则组间分散矩阵为:
S b G D A = ∑ i = 1 L N i N ( m i − m 0 ) ( m i − m 0 ) T = ∑ i = 1 L N i N m i m i T S^{GDA}_b=\sum_{i=1}^L\frac{N_i}{N}(\boldsymbol m_i-\boldsymbol m_0)(\boldsymbol m_i-\boldsymbol m_0)^T=\sum_{i=1}^L\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T SbGDA​=i=1∑L​NNi​​(mi​−m0​)(mi​−m0​)T=i=1∑L​NNi​​mi​miT​
组内分散矩阵为:
S w G D A = ∑ i = 1 L ∑ j = 1 N i 1 N ϕ ( x j ( i ) ) ϕ ( x j ( i ) ) T S^{GDA}_w=\sum_{i=1}^L\sum_{j=1}^{N_i}\frac{1}{N}\phi(\boldsymbol x^{(i)}_j)\phi(\boldsymbol x^{(i)}_j)^T SwGDA​=i=1∑L​j=1∑Ni​​N1​ϕ(xj(i)​)ϕ(xj(i)​)T

m i = 1 N i ∑ j = 1 N i ϕ ( x j ( i ) ) = 1 N i [ ϕ ( x 1 ( i ) ) , ⋯   , ϕ ( x N i ( i ) ) ] [ 1 ⋮ 1 ] = 1 N i X i T 1 N i × 1 \boldsymbol m_i=\frac{1}{N_i}\sum_{j=1}^{N_i}\phi(\boldsymbol x^{(i)}_j)=\frac{1}{N_i}[\phi(\boldsymbol x^{(i)}_1),\cdots,\phi(\boldsymbol x^{(i)}_{N_i})]\begin{bmatrix}1\\\vdots\\1\end{bmatrix}=\frac{1}{N_i}X^T_i1_{N_i\times1} mi​=Ni​1​j=1∑Ni​​ϕ(xj(i)​)=Ni​1​[ϕ(x1(i)​),⋯,ϕ(xNi​(i)​)]⎣⎢⎡​1⋮1​⎦⎥⎤​=Ni​1​XiT​1Ni​×1​

m i m i T = 1 N i 2 X i T 1 N i × 1 1 1 × N i X i = 1 N i X i T B i X i \boldsymbol m_i\boldsymbol m_i^T=\frac{1}{N_i^2}X^T_i1_{N_i\times1}1_{1\times N_i}X_i=\frac{1}{N_i}X^T_iB_iX_i mi​miT​=Ni2​1​XiT​1Ni​×1​11×Ni​​Xi​=Ni​1​XiT​Bi​Xi​

其中, B i = 1 N i 1 N i × N i B_i=\frac{1}{N_i}1_{N_i\times N_i} Bi​=Ni​1​1Ni​×Ni​​。组间分散矩阵为:
S b G D A = ∑ i = 1 L N i N m i m i T = 1 N ∑ i = 1 L X i T B i X i = 1 N [ X 1 T ⋯ X L T ] [ B 1 0 ⋱ 0 B L ] [ X i ⋮ X L ] = 1 N X T B X {\color{red}S^{GDA}_b}=\sum_{i=1}^L\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T=\frac{1}{N}\sum_{i=1}^LX^T_iB_iX_i=\frac{1}{N} \begin{bmatrix} X^T_1 & \cdots & X^T_L \end{bmatrix} \begin{bmatrix} B_1 & & 0\\ & \ddots &\\ 0 & & B_L \end{bmatrix} \begin{bmatrix} X_i \\ \vdots \\ X_L \end{bmatrix} =\frac{1}{N}X^TBX SbGDA​=i=1∑L​NNi​​mi​miT​=N1​i=1∑L​XiT​Bi​Xi​=N1​[X1T​​⋯​XLT​​]⎣⎡​B1​0​⋱​0BL​​⎦⎤​⎣⎢⎡​Xi​⋮XL​​⎦⎥⎤​=N1​XTBX
组内分散矩阵为:
S w G D A = ∑ i = 1 L ∑ j = 1 N i 1 N ϕ ( x j ( i ) ) ϕ ( x j ( i ) ) T = 1 N ∑ i = 1 L [ ϕ ( x 1 ( i ) ) ⋯ ϕ ( x N i ( i ) ) ] [ ϕ ( x 1 ( i ) ) T ⋮ ϕ ( x N i ( i ) ) T ] = 1 N ∑ i = 1 L X i T X i = 1 N [ X 1 T ⋯ X L T ] [ X 1 ⋮ X L ] = 1 N X T X \begin{aligned} {\color{red}S^{GDA}_w}&=\sum_{i=1}^L\sum_{j=1}^{N_i}\frac{1}{N}\phi(\boldsymbol x^{(i)}_j)\phi(\boldsymbol x^{(i)}_j)^T\\ &=\frac{1}{N} \sum_{i=1}^L \begin{bmatrix} \phi(\boldsymbol x^{(i)}_1) & \cdots & \phi(\boldsymbol x^{(i)}_{N_i}) \end{bmatrix} \begin{bmatrix} \phi(\boldsymbol x^{(i)}_1)^T \\ \vdots \\ \phi(\boldsymbol x^{(i)}_{N_i})^T \end{bmatrix}\\ &=\frac{1}{N}\sum_{i=1}^LX^T_iX_i\\ &=\frac{1}{N} \begin{bmatrix} X_1^T & \cdots & X^T_L \end{bmatrix} \begin{bmatrix} X_1 \\ \vdots \\X_L \end{bmatrix}\\ &=\frac{1}{N}X^TX \end{aligned} SwGDA​​=i=1∑L​j=1∑Ni​​N1​ϕ(xj(i)​)ϕ(xj(i)​)T=N1​i=1∑L​[ϕ(x1(i)​)​⋯​ϕ(xNi​(i)​)​]⎣⎢⎢⎡​ϕ(x1(i)​)T⋮ϕ(xNi​(i)​)T​⎦⎥⎥⎤​=N1​i=1∑L​XiT​Xi​=N1​[X1T​​⋯​XLT​​]⎣⎢⎡​X1​⋮XL​​⎦⎥⎤​=N1​XTX​
同理,
S b G D A v = λ S w G D A v i . e .   ( 1 N X T B X ) v = λ ( 1 N X T X ) v   ( X 未 知 ) S^{GDA}_b\boldsymbol v=\lambda S^{GDA}_w \boldsymbol v \\ i.e.\ (\frac{1}{N}X^TBX)\boldsymbol v=\lambda (\frac{1}{N}X^TX)\boldsymbol v\ (X未知) SbGDA​v=λSwGDA​vi.e. (N1​XTBX)v=λ(N1​XTX)v (X未知)
假设 v v v可以由样本的线性组合表示,即
v = ∑ i = 1 L ∑ j = 1 N i α j ( i ) ϕ ( x j ( i ) ) = X T α . \boldsymbol v=\sum_{i=1}^L\sum_{j=1}^{N_i}\alpha_j^{(i)}\phi(\boldsymbol x^{(i)}_j)=X^T\boldsymbol \alpha. v=i=1∑L​j=1∑Ni​​αj(i)​ϕ(xj(i)​)=XTα.
将假设代入上式,
⇒ X T B X X T α = λ X T X X T α ⇒ X X T B X X T α = λ X X T X X T α ⇒ ( K B K ) α = λ ( K K ) α \begin{aligned} &\Rightarrow X^TBXX^T\boldsymbol \alpha=\lambda X^TXX^T\boldsymbol \alpha\\ &\Rightarrow XX^TBXX^T\boldsymbol \alpha=\lambda XX^TXX^T\boldsymbol \alpha\\ &\Rightarrow (KBK)\boldsymbol \alpha=\lambda(KK)\boldsymbol \alpha \end{aligned} ​⇒XTBXXTα=λXTXXTα⇒XXTBXXTα=λXXTXXTα⇒(KBK)α=λ(KK)α​
计算上式可获得 α \boldsymbol \alpha α,将测试样本投影到 v = X T α \boldsymbol v=X^T\boldsymbol \alpha v=XTα上,
v T ϕ ( x ) = ( X T α ) T ϕ ( x ) = α T [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] ϕ ( x ) = α T [ κ ( x 1 , x ) ⋮ κ ( x N , x ) ] \boldsymbol v^T\phi(\boldsymbol x)=(X^T\boldsymbol \alpha)^T\phi(\boldsymbol x)=\boldsymbol \alpha^T \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix}\phi(\boldsymbol x) =\boldsymbol \alpha^T \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x) \\ \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x) \end{bmatrix} vTϕ(x)=(XTα)Tϕ(x)=αT⎣⎢⎡​ϕ(x1​)T⋮ϕ(xN​)T​⎦⎥⎤​ϕ(x)=αT⎣⎢⎡​κ(x1​,x)⋮κ(xN​,x)​⎦⎥⎤​
Ex: 在GDA中,组内分散矩阵为:
S w G D A = ∑ i = 1 L ∑ j = 1 N i 1 N ϕ ( x j ( i ) ) ϕ ( x j ( i ) ) T S^{GDA}_w=\sum_{i=1}^L\sum_{j=1}^{N_i}\frac{1}{N}\phi(\boldsymbol x^{(i)}_j)\phi(\boldsymbol x^{(i)}_j)^T SwGDA​=i=1∑L​j=1∑Ni​​N1​ϕ(xj(i)​)ϕ(xj(i)​)T
而在LDA中,
S w L D A = ∑ i = 1 L ∑ j = 1 N i 1 N ( ϕ ( x j ( i ) ) − m i ) ( ϕ ( x j ( i ) ) − m i ) T m i = 1 N i ∑ j = 1 N i ϕ ( x j ( i ) ) S^{LDA}_w=\sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\phi(\boldsymbol x_j^{(i)})-\boldsymbol m_i)(\phi(\boldsymbol x_j^{(i)})-\boldsymbol m_i)^T\\ \boldsymbol m_i=\frac{1}{N_i}\sum_{j=1}^{N_i}\phi(\boldsymbol x^{(i)}_j) SwLDA​=i=1∑L​j=1∑Ni​​N1​(ϕ(xj(i)​)−mi​)(ϕ(xj(i)​)−mi​)Tmi​=Ni​1​j=1∑Ni​​ϕ(xj(i)​)
能否由LDA推导GDA组内分散矩阵,并且找到 W W W,使得 S w L D A = X T W X S^{LDA}_w=X^TWX SwLDA​=XTWX.


4. Hard-Margin Support Vector Machine

min ⁡ f ( x ) s . t .   h ( x ) = 0 \min f(\boldsymbol x)\\ s.t.\ h(\boldsymbol x)=0 minf(x)s.t. h(x)=0

假设该问题有一个局部最优解 x ∗ \boldsymbol x^* x∗与对应的拉格朗日乘子 α ∗ \boldsymbol \alpha^* α∗。假设拉格朗日方程的Hessian矩阵在 x ∗ \boldsymbol x^* x∗处正定,则该问题的对偶问题为:
max ⁡ W ( α ) = max ⁡ L ( x ( α ) , α ) \max W(\boldsymbol \alpha)=\max L(\boldsymbol x(\boldsymbol \alpha),\boldsymbol \alpha) maxW(α)=maxL(x(α),α)
其中, W ( α ) W(\boldsymbol \alpha) W(α)为 L ( x , α ) L(\boldsymbol x,\boldsymbol \alpha) L(x,α)对 x \boldsymbol x x求导为0得到。

Ex: min ⁡ x 1 2 + x 2 2   s . t . x 1 + x 2 = 1 \min x^2_1+x_2^2\ s.t.x_1+x_2=1 minx12​+x22​ s.t.x1​+x2​=1

Lagrangian 方程为
L ( x 1 , x 2 , α ) = x 1 2 + x 2 2 + α ( x 1 + x 2 − 1 ) ∂ L ∂ x 1 = 2 x 1 + α = 0 ⇒ x 1 = − α 2 ∂ L ∂ x 2 = 2 x 2 + α = 0 ⇒ x 2 = − α 2 L(x_1,x_2,\alpha)=x_1^2+x_2^2+\alpha(x_1+x_2-1)\\ \frac{\partial L}{\partial x_1}=2x_1+\alpha=0 \Rightarrow x_1=-\frac{\alpha}{2}\\ \frac{\partial L}{\partial x_2}=2x_2+\alpha=0 \Rightarrow x_2=-\frac{\alpha}{2} L(x1​,x2​,α)=x12​+x22​+α(x1​+x2​−1)∂x1​∂L​=2x1​+α=0⇒x1​=−2α​∂x2​∂L​=2x2​+α=0⇒x2​=−2α​
因此,对偶问题为:
W ( α ) = min ⁡ x 1 , x 2 L ( x 1 , x 2 , α ) = − α 2 2 − α W(\alpha)=\min_{x_1,x_2}L(x_1,x_2,\alpha)=-\frac{\alpha^2}{2}-\alpha W(α)=x1​,x2​min​L(x1​,x2​,α)=−2α2​−α
求解对偶问题 max ⁡ α W ( α ) \max_\alpha W(\alpha) maxα​W(α),
∂ W ∂ α = − α − 1 = 0 ⇒ α ∗ = − 1 \frac{\partial W}{\partial \alpha}=-\alpha-1=0 \Rightarrow \alpha^*=-1 ∂α∂W​=−α−1=0⇒α∗=−1
原问题的解为 x 1 ∗ = x 2 ∗ = 1 2 x_1^*=x_2^*=\frac{1}{2} x1∗​=x2∗​=21​.

核方法:PCA,LDA,GDA,SVM,LR

Karush-Kuhn-Tucker Theorem (KKT):
min ⁡ f ( x ) s . t .   h ( x ) = 0          g ( x ) ≤ 0 \min f(\boldsymbol x)\\ s.t.\ \boldsymbol{h}(\boldsymbol x)=0\\ \ \ \ \ \ \ \ \ \boldsymbol{g}(\boldsymbol x)\le0 minf(x)s.t. h(x)=0        g(x)≤0
假设 x ∗ \boldsymbol{x}^* x∗是最优点,对应的拉格朗日乘子为 λ ∗ \boldsymbol{\lambda}^* λ∗和 α ∗ ≥ 0 \boldsymbol{\alpha}^*\ge0 α∗≥0,有
α ∗ ≥ 0 ∇ f ( x ∗ ) + λ ∗ T ∇ h ( x ∗ ) + α ∗ T ∇ g ( x ∗ ) = 0 α ∗ T g ( x ∗ ) = 0 \begin{aligned} \boldsymbol{\alpha}^* &\ge0\\ \nabla{f(\boldsymbol x^*)}+{\boldsymbol{\lambda^*}}^T\nabla\boldsymbol{h}(\boldsymbol x^*)+{\boldsymbol{\alpha^*}}^T\nabla\boldsymbol{g}(\boldsymbol x^*) &= \boldsymbol{0}\\ {\boldsymbol{\alpha^*}}^T\boldsymbol{g}(\boldsymbol x^*)&=0 \end{aligned} α∗∇f(x∗)+λ∗T∇h(x∗)+α∗T∇g(x∗)α∗Tg(x∗)​≥0=0=0​
其中最后一项要求 α \alpha α与 g ( x ) g(x) g(x)至少有一个为0。

对于二分类线性可分问题:
y i = { + 1 x i ∈ c l a s s   1 − 1 x i ∈ c l a s s   2   ( x i , y i ) , i = 1 , ⋯   , l , x i ∈ R n y_i=\begin{cases} +1 & \boldsymbol x_i\in class\ 1\\ -1 & \boldsymbol x_i\in class\ 2 \end{cases} \ (\boldsymbol x_i,y_i),i=1,\cdots,l,\boldsymbol x_i\in \mathbb{R}^n yi​={+1−1​xi​∈class 1xi​∈class 2​ (xi​,yi​),i=1,⋯,l,xi​∈Rn
核方法:PCA,LDA,GDA,SVM,LR

{ w T x i + b ≥ 1 y i = + 1 w T x i + b ≤ 1 y i = − 1 ⇒ y i ( w T x i + b ) ≥ 1 , ∀ i = 1 , ⋯   , l \begin{cases} \boldsymbol w^T\boldsymbol x_i+b\ge1 & y_i=+1\\ \boldsymbol w^T\boldsymbol x_i+b\le1 & y_i=-1 \end{cases} \Rightarrow y_i(\boldsymbol w^T\boldsymbol x_i+b)\ge1,\forall i=1,\cdots,l {wTxi​+b≥1wTxi​+b≤1​yi​=+1yi​=−1​⇒yi​(wTxi​+b)≥1,∀i=1,⋯,l

m a r g i n = ( d + + d − ) = 2 ∣ ∣ w ∣ ∣ = 2 w T w margin=(d_++d_-)=\frac{2}{||\boldsymbol w||}=\frac{2}{\sqrt{\boldsymbol w^T\boldsymbol w}} margin=(d+​+d−​)=∣∣w∣∣2​=wTw ​2​

优化问题为:
M i n i m i z e   Φ ( w ) = 1 2 w T w S u b j e c t   t o   y i ( w T x i + b ) ≥ 1 , ∀ i = 1 , ⋯   , l \color{red}{ Minimize\ \Phi(w)=\frac{1}{2}\boldsymbol w^T\boldsymbol w\\ Subject\ to\ y_i(\boldsymbol w^T\boldsymbol x_i+b)\ge1,\forall i=1,\cdots,l} Minimize Φ(w)=21​wTwSubject to yi​(wTxi​+b)≥1,∀i=1,⋯,l
利用Lagrangian函数计算
M i n i m i z e   L ( w , b , α ) = 1 2 w T w − ∑ i = 1 l α i [ y i ( w T x i + b ) − 1 ] S u b j e c t   t o   α i ≤ 0   ∀ i = 1 , ⋯   , l Minimize\ L(\boldsymbol w,b,\boldsymbol \alpha)=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\boldsymbol x_i+b)-1]\\ Subject\ to\ \alpha_i\le0\ \forall i=1,\cdots,l Minimize L(w,b,α)=21​wTw−i=1∑l​αi​[yi​(wTxi​+b)−1]Subject to αi​≤0 ∀i=1,⋯,l

∂ L ∂ w = 0 ⇒ w ( α ) = ∑ i = 1 l α i y i x i ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w(\boldsymbol \alpha)=\sum_{i=1}^l\alpha_iy_i\boldsymbol x_i\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0 ∂w∂L​=0⇒w(α)=i=1∑l​αi​yi​xi​∂b∂L​=0⇒i=1∑l​αi​yi​=0

假设 ϕ \phi ϕ为关于核函数 κ ( x , z ) \kappa(\boldsymbol x,\boldsymbol z) κ(x,z)的特征映射,则上述问题可转化为
M i n i m i z e   L ( w , b , α ) = 1 2 w T w − ∑ i = 1 l α i [ y i ( w T ϕ ( x i ) + b ) − 1 ] S u b j e c t   t o   α i ≤ 0   ∀ i = 1 , ⋯   , l Minimize\ L(\boldsymbol w,b,\boldsymbol \alpha)=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\phi(\boldsymbol x_i)+b)-1]\\ Subject\ to\ \alpha_i\le0\ \forall i=1,\cdots,l Minimize L(w,b,α)=21​wTw−i=1∑l​αi​[yi​(wTϕ(xi​)+b)−1]Subject to αi​≤0 ∀i=1,⋯,l

∂ L ∂ w = 0 ⇒ w ( α ) = ∑ i = 1 l α i y i ϕ ( x i ) ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w(\boldsymbol \alpha)=\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0 ∂w∂L​=0⇒w(α)=i=1∑l​αi​yi​ϕ(xi​)∂b∂L​=0⇒i=1∑l​αi​yi​=0

将偏导数代入拉格朗日函数中得到,
w T w = ( ∑ i = 1 l α i y i ϕ ( x i ) ) T ( ∑ j = 1 l α j y j ϕ ( x j ) ) = ∑ i = 1 l ∑ j = 1 l α i α j y i y j ϕ ( x i ) T ϕ ( x j ) = ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) w T ϕ ( x i ) = ( ∑ j = 1 l α j y j ϕ ( x j ) ) T ϕ ( x i ) = ∑ j = 1 l α j y j κ ( x j , x i ) \begin{aligned} \boldsymbol w^T\boldsymbol w&=\left({\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)}\right)^T\left({\sum_{j=1}^l\alpha_jy_j\phi(\boldsymbol x_j)}\right)\\ &=\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\phi(\boldsymbol x_i)^T\phi(\boldsymbol x_j)\\ &=\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j) \end{aligned}\\ \boldsymbol w^T\phi(\boldsymbol x_i)=\left({\sum_{j=1}^l\alpha_jy_j\phi(\boldsymbol x_j)}\right)^T\phi(\boldsymbol x_i)=\sum_{j=1}^l\alpha_jy_j\kappa(\boldsymbol x_j,\boldsymbol x_i) wTw​=(i=1∑l​αi​yi​ϕ(xi​))T(j=1∑l​αj​yj​ϕ(xj​))=i=1∑l​j=1∑l​αi​αj​yi​yj​ϕ(xi​)Tϕ(xj​)=i=1∑l​j=1∑l​αi​αj​yi​yj​κ(xi​,xj​)​wTϕ(xi​)=(j=1∑l​αj​yj​ϕ(xj​))Tϕ(xi​)=j=1∑l​αj​yj​κ(xj​,xi​)

W ( α ) = L ( w ( α ) , b , α ) = 1 2 w T w − ∑ i = 1 l α i [ y i ( w T ϕ ( x i ) + b ) − 1 ] = 1 2 w T w − ∑ i = 1 l α i y i w T ϕ ( x i ) − b ∑ i = 1 l α i y i + ∑ i = 1 l α i = 1 2 w T w − ∑ i = 1 l α i y i w T ϕ ( x i ) + ∑ i = 1 l α i = 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) − ∑ i = 1 l α i y i ∑ j = 1 l α j y j κ ( x j , x i ) + ∑ i = 1 l α i = ∑ i = 1 l α i − 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) \begin{aligned} W(\boldsymbol \alpha)&=L(\boldsymbol w(\boldsymbol \alpha),b,\boldsymbol \alpha)=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\phi(\boldsymbol x_i)+b)-1]\\ &=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)-b\sum_{i=1}^l\alpha_iy_i+\sum_{i=1}^l\alpha_i\\ &=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)+\sum_{i=1}^l\alpha_i\\ &=\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)-\sum_{i=1}^l\alpha_iy_i\sum_{j=1}^l\alpha_jy_j\kappa(\boldsymbol x_j,\boldsymbol x_i)+\sum_{i=1}^l\alpha_i\\ &=\sum_{i=1}^l\alpha_i-\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j) \end{aligned} W(α)​=L(w(α),b,α)=21​wTw−i=1∑l​αi​[yi​(wTϕ(xi​)+b)−1]=21​wTw−i=1∑l​αi​yi​wTϕ(xi​)−bi=1∑l​αi​yi​+i=1∑l​αi​=21​wTw−i=1∑l​αi​yi​wTϕ(xi​)+i=1∑l​αi​=21​i=1∑l​j=1∑l​αi​αj​yi​yj​κ(xi​,xj​)−i=1∑l​αi​yi​j=1∑l​αj​yj​κ(xj​,xi​)+i=1∑l​αi​=i=1∑l​αi​−21​i=1∑l​j=1∑l​αi​αj​yi​yj​κ(xi​,xj​)​

综上,问题等价于
max ⁡   W ( α ) = ∑ i = 1 l α i − 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) s . t .   ∑ i = 1 l α i y i = 0 α i ≥ 0   ∀ i = 1 , ⋯   , l \max\ W(\boldsymbol \alpha)=\sum_{i=1}^l\alpha_i-\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)\\ \begin{aligned} s.t.\ &\sum_{i=1}^l\alpha_iy_i=0\\ &\alpha_i\ge0\ \forall i=1,\cdots,l \end{aligned} max W(α)=i=1∑l​αi​−21​i=1∑l​j=1∑l​αi​αj​yi​yj​κ(xi​,xj​)s.t. ​i=1∑l​αi​yi​=0αi​≥0 ∀i=1,⋯,l​
W ( α ) W(\alpha) W(α)是二次型
W ( α ) = ∑ i = 1 l α i − 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) = [ 1 ⋯ 1 ] l × 1 [ α 1 ⋮ α l ] − 1 2 [ α 1 ⋯ α l ] [ y 1 y 1 κ ( x 1 , x 1 ) ⋯ y 1 y l κ ( x 1 , x l ) ⋮ ⋱ ⋮ y l y 1 κ ( x l , x 1 ) ⋯ y l y l κ ( x l , x l ) ] [ α 1 ⋮ α l ] = 1 l × 1 α − 1 2 α T [ y 1 y 1 κ ( x 1 , x 1 ) ⋯ y 1 y l κ ( x 1 , x l ) ⋮ ⋱ ⋮ y l y 1 κ ( x l , x 1 ) ⋯ y l y l κ ( x l , x l ) ] α ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) = [ α 1 y 1 ⋯ α l y l ] [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x l ) ⋮ ⋱ ⋮ κ ( x l , x 1 ) ⋯ κ ( x l , x l ) ] [ α 1 y 1 ⋮ α l y l ] ≥ 0 \begin{aligned} W(\boldsymbol \alpha)&=\sum_{i=1}^l\alpha_i-\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)\\ &= \begin{bmatrix} 1 & \cdots & 1 \end{bmatrix}_{l\times 1} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_l \end{bmatrix}-\frac{1}{2} \begin{bmatrix} \alpha_1 & \cdots & \alpha_l \end{bmatrix} \begin{bmatrix} y_1y_1\kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & y_1y_l\kappa(\boldsymbol x_1,\boldsymbol x_l)\\ \vdots & \ddots & \vdots\\ y_ly_1\kappa(\boldsymbol x_l,\boldsymbol x_1) & \cdots & y_ly_l\kappa(\boldsymbol x_l,\boldsymbol x_l) \end{bmatrix} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_l \end{bmatrix}\\ &=1_{l\times 1}\boldsymbol \alpha-\frac{1}{2}\boldsymbol \alpha^T\begin{bmatrix} y_1y_1\kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & y_1y_l\kappa(\boldsymbol x_1,\boldsymbol x_l)\\ \vdots & \ddots & \vdots\\ y_ly_1\kappa(\boldsymbol x_l,\boldsymbol x_1) & \cdots & y_ly_l\kappa(\boldsymbol x_l,\boldsymbol x_l) \end{bmatrix}\boldsymbol \alpha\\ & \sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)=\begin{bmatrix} \alpha_1y_1 & \cdots & \alpha_ly_l \end{bmatrix} \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_l)\\ \vdots & \ddots & \vdots\\ \kappa(\boldsymbol x_l,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_l,\boldsymbol x_l) \end{bmatrix} \begin{bmatrix} \alpha_1y_1 \\ \vdots \\ \alpha_ly_l \end{bmatrix}\ge0 \end{aligned} W(α)​=i=1∑l​αi​−21​i=1∑l​j=1∑l​αi​αj​yi​yj​κ(xi​,xj​)=[1​⋯​1​]l×1​⎣⎢⎡​α1​⋮αl​​⎦⎥⎤​−21​[α1​​⋯​αl​​]⎣⎢⎡​y1​y1​κ(x1​,x1​)⋮yl​y1​κ(xl​,x1​)​⋯⋱⋯​y1​yl​κ(x1​,xl​)⋮yl​yl​κ(xl​,xl​)​⎦⎥⎤​⎣⎢⎡​α1​⋮αl​​⎦⎥⎤​=1l×1​α−21​αT⎣⎢⎡​y1​y1​κ(x1​,x1​)⋮yl​y1​κ(xl​,x1​)​⋯⋱⋯​y1​yl​κ(x1​,xl​)⋮yl​yl​κ(xl​,xl​)​⎦⎥⎤​αi=1∑l​j=1∑l​αi​αj​yi​yj​κ(xi​,xj​)=[α1​y1​​⋯​αl​yl​​]⎣⎢⎡​κ(x1​,x1​)⋮κ(xl​,x1​)​⋯⋱⋯​κ(x1​,xl​)⋮κ(xl​,xl​)​⎦⎥⎤​⎣⎢⎡​α1​y1​⋮αl​yl​​⎦⎥⎤​≥0​
因此 W ( α ) W(\boldsymbol \alpha) W(α)是负半定二次型,有全局最大值。可以求解 α ∗ \boldsymbol \alpha^* α∗,进而 w ∗ = ∑ α i ∗ y i ϕ ( x i ) \boldsymbol w^*=\sum\alpha^*_iy_i\phi(\boldsymbol x_i) w∗=∑αi∗​yi​ϕ(xi​)。

KKT条件表明只有当不等式条件取等时,拉格朗日乘子才能取非0值,假设 α ∗ , w ∗ , b ∗ \boldsymbol \alpha^*,\boldsymbol w^*,b^* α∗,w∗,b∗是最优解,则有
α i ∗ ( y i ( w ∗ T ϕ ( x i ) + b ∗ ) − 1 ) = 0 ,   ∀ i = 1 , ⋯   , l \alpha^*_i(y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1)=0,\ \forall i=1,\cdots,l αi∗​(yi​(w∗Tϕ(xi​)+b∗)−1)=0, ∀i=1,⋯,l
所以该条件分为两种情况,

  • y i ( w ∗ T ϕ ( x i ) + b ∗ ) > 1 ⇒ α i ∗ = 0 y_i({w^*}^T\phi(x_i)+b^*)>1\Rightarrow \boldsymbol \alpha^*_i=0 yi​(w∗Tϕ(xi​)+b∗)>1⇒αi∗​=0.
  • α i ∗ ≠ 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) = 0 \alpha^*_i\ne0\Rightarrow y_i({w^*}^T\phi(\boldsymbol x_i)+b^*)=0 αi∗​​=0⇒yi​(w∗Tϕ(xi​)+b∗)=0。这意味着 ϕ ( x i ) \phi(\boldsymbol x_i) ϕ(xi​)位于边界上,该点即为支持向量(起决定性作用)

所以,大多数 α i ∗ = 0 \alpha_i^*=0 αi∗​=0,对于支持向量(Support Vectors, SV):

  • 非0的 α i ∗ ⇒ ϕ ( x i ) \alpha^*_i\Rightarrow \phi(\boldsymbol x_i) αi∗​⇒ϕ(xi​)在边界上

  • 决策边界只取决于支持向量,
    w ∗ = ∑ i = 1 l α i ∗ y i ϕ ( x i ) = ∑ α i ∗ ≠ 0 α i ∗ y i ϕ ( x i ) \boldsymbol w^*=\sum_{i=1}^l\alpha^*_iy_i\phi(\boldsymbol x_i)=\sum_{\alpha^*_i\ne0}\alpha^*_iy_i\phi(\boldsymbol x_i) w∗=i=1∑l​αi∗​yi​ϕ(xi​)=αi∗​​=0∑​αi∗​yi​ϕ(xi​)

流程:

  1. 计算获得 α t ∗ > 0 \alpha^*_t>0 αt∗​>0

  2. w ∗ = ∑ α t ∗ α t ∗ y t ϕ ( x t ) w^*=\sum_{\alpha^*_t}\alpha^*_ty_t\phi(\boldsymbol x_t) w∗=∑αt∗​​αt∗​yt​ϕ(xt​)

  3. 由KKT条件, y t ( w ∗ T ϕ ( x t ) + b ∗ ) − 1 = 0 ⇒ w ∗ T ϕ ( x t ) + b ∗ = ± 1 ⇒ w ∗ T ϕ ( x t ) + b ∗ = y t y_t({w^*}^T\phi(x_t)+b^*)-1=0\Rightarrow {w^*}^T\phi(x_t)+b^*=\pm1\Rightarrow {\boldsymbol w^*}^T\phi(x_t)+b^*=y_t yt​(w∗Tϕ(xt​)+b∗)−1=0⇒w∗Tϕ(xt​)+b∗=±1⇒w∗Tϕ(xt​)+b∗=yt​,因此由任意一支持向量可计算 b ∗ b^* b∗
    b ∗ = y t − w ∗ T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i ϕ ( x i ) T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i κ ( x i , x t ) b^*=y_t-{w^*}^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha^*_iy_i\phi(\boldsymbol x_i)^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha^*_iy_i\kappa(\boldsymbol x_i,\boldsymbol x_t) b∗=yt​−w∗Tϕ(xt​)=yt​−i=1∑l​αi∗​yi​ϕ(xi​)Tϕ(xt​)=yt​−i=1∑l​αi∗​yi​κ(xi​,xt​)

  4. 对于新的样本点 ϕ ( x ′ ) \phi(\boldsymbol x') ϕ(x′),
    w T ϕ ( x ′ ) + b = ∑ i = 1 l α i y i ϕ ( x i ) T ϕ ( x ′ ) + b = ∑ i = 1 l α i y i κ ( x i , x ′ ) + b \boldsymbol w^T\phi(\boldsymbol x')+b=\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)^T\phi(\boldsymbol x')+b= \sum_{i=1}^l\alpha_iy_i\kappa(\boldsymbol x_i,\boldsymbol x')+b wTϕ(x′)+b=i=1∑l​αi​yi​ϕ(xi​)Tϕ(x′)+b=i=1∑l​αi​yi​κ(xi​,x′)+b

  5. 结果为正, x ′ \boldsymbol x' x′位于第一类,否则位于第二类,
    y ′ = s i g n ( ∑ i = 1 l α i y i κ ( x i , x ′ ) + b ) y'=sign(\sum_{i=1}^l\alpha_iy_i\kappa(\boldsymbol x_i,\boldsymbol x')+b) y′=sign(i=1∑l​αi​yi​κ(xi​,x′)+b)

5. Soft-Margin Support Vector Machine

M i n i m i z e   Φ ( w , ξ ) = 1 2 w T w + C ∑ i = 1 l ξ i ,   C > 0 S u b j e c t   t o   ξ i ≥ 0 , y i ( w T x i + b ) ≥ 1 − ξ i ,   ∀ i = 1 , ⋯   , l Minimize\ \Phi(\boldsymbol w,\boldsymbol \xi)=\frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i,\ C>0\\ Subject\ to\ \xi_i\ge0,y_i(\boldsymbol w^T\boldsymbol x_i+b)\ge1-\xi_i,\ \forall i=1,\cdots,l Minimize Φ(w,ξ)=21​wTw+Ci=1∑l​ξi​, C>0Subject to ξi​≥0,yi​(wTxi​+b)≥1−ξi​, ∀i=1,⋯,l

其中 C C C控制边界区域大小,当 C C C增大时, ξ i \xi_i ξi​减小,导致边界区域减小。

拉格朗日函数为
L ( w , b , ξ , α , β ) = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i [ y i ( w T x i + b ) − 1 + ξ i ] − ∑ i = 1 l β i ξ i s . t .   α i , β i ≥ 0 ,   ∀ i = 1 , ⋯   , l \color{red} L(\boldsymbol w,b,\boldsymbol \xi,\boldsymbol \alpha,\boldsymbol \beta)=\frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\boldsymbol x_i+b)-1+\xi_i]-\sum_{i=1}^l\beta_i\xi_i\\ \color{red}s.t.\ \alpha_i,\beta_i\ge0,\ \forall i=1,\cdots,l L(w,b,ξ,α,β)=21​wTw+Ci=1∑l​ξi​−i=1∑l​αi​[yi​(wTxi​+b)−1+ξi​]−i=1∑l​βi​ξi​s.t. αi​,βi​≥0, ∀i=1,⋯,l

∂ L ∂ w = 0 ⇒ w = ∑ i = 1 l α i y i x i ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 ∂ L ∂ ξ i = 0 ⇒ C − α i − β i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w=\sum_{i=1}^l\alpha_iy_i\boldsymbol x_i\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0\\ \frac{\partial L}{\partial \xi_i}=0\Rightarrow C-\alpha_i-\beta_i=0 ∂w∂L​=0⇒w=i=1∑l​αi​yi​xi​∂b∂L​=0⇒i=1∑l​αi​yi​=0∂ξi​∂L​=0⇒C−αi​−βi​=0

同理将 x i x_i xi​替换为 ϕ ( x i ) \phi(x_i) ϕ(xi​),
∂ L ∂ w = 0 ⇒ w = ∑ i = 1 l α i y i ϕ ( x i ) = X T Y α , Y = d i a g ( y 1 , ⋯   , y l ) ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 = y T α ∂ L ∂ ξ i = 0 ⇒ C − α i − β i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w=\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)={\color{red}X^TY\boldsymbol \alpha},Y=diag(y_1,\cdots,y_l)\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0=\boldsymbol y^T\boldsymbol \alpha\\ \frac{\partial L}{\partial \xi_i}=0\Rightarrow C-\alpha_i-\beta_i=0 ∂w∂L​=0⇒w=i=1∑l​αi​yi​ϕ(xi​)=XTYα,Y=diag(y1​,⋯,yl​)∂b∂L​=0⇒i=1∑l​αi​yi​=0=yTα∂ξi​∂L​=0⇒C−αi​−βi​=0
将以上偏导数代入拉格朗日函数中可得,
w T w = ( X T Y α ) T ( X T Y α ) = α T Y X X T Y α   ( ∑ i = 1 l ∑ i = 1 l α i α j y i y j κ ( x i , x j ) ) w T ϕ ( x i ) = ( X T Y α ) T ϕ ( x i ) = α T Y X ϕ ( x i )   ( ∑ j = 1 l α j y j κ ( x i , x j ) ) \boldsymbol w^T\boldsymbol w=(X^TY\alpha)^T(X^TY\alpha)=\boldsymbol \alpha^TYXX^TY\boldsymbol \alpha\ \left({\sum_{i=1}^l\sum_{i=1}^{l}\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)}\right)\\ \boldsymbol w^T\phi(\boldsymbol x_i)=(X^TY\boldsymbol \alpha)^T\phi(\boldsymbol x_i)=\boldsymbol \alpha^TYX\phi(\boldsymbol x_i)\ \left({\sum_{j=1}^l\alpha_jy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)}\right) wTw=(XTYα)T(XTYα)=αTYXXTYα (i=1∑l​i=1∑l​αi​αj​yi​yj​κ(xi​,xj​))wTϕ(xi​)=(XTYα)Tϕ(xi​)=αTYXϕ(xi​) (j=1∑l​αj​yj​κ(xi​,xj​))

W ( α ) = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i [ y i ( w T ϕ ( x i ) + b ) − 1 + ξ i ] − ∑ i = 1 l β i ξ i = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i y i w T ϕ ( x i ) − ∑ i = 1 l α i y i b + ∑ i = 1 l α i − ∑ i = 1 l α i ξ i − ∑ i = 1 l β i ξ i = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i y i w T ϕ ( x i ) + ∑ i = 1 l α i − ∑ i = 1 l ( α i + β i ) ξ i = 1 2 α T Y X X Y Y α + C ∑ i = 1 l ξ i − ∑ i = 1 l α i y i ( α T Y X ϕ ( x i ) ) + ∑ i = 1 l α i − ∑ i = 1 l C ξ i = ∑ i = 1 l α i + 1 2 α T Y X X T Y α − α T Y X X T Y α = α T 1 l × 1 − 1 2 α T Y K Y α \begin{aligned} W(\boldsymbol \alpha)&=\frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\phi(\boldsymbol x_i)+b)-1+\xi_i]-\sum_{i=1}^l\beta_i\xi_i\\ &= \frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)-\sum_{i=1}^l\alpha_iy_ib+\sum_{i=1}^l\alpha_i-\sum_{i=1}^l\alpha_i\xi_i-\sum_{i=1}^l\beta_i\xi_i\\ &= \frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)+\sum_{i=1}^l\alpha_i-\sum_{i=1}^l(\alpha_i+\beta_i)\xi_i\\ &= \frac{1}{2}\boldsymbol \alpha^TYXX^YY\boldsymbol \alpha+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_iy_i(\boldsymbol \alpha^TYX\phi(\boldsymbol x_i))+\sum_{i=1}^l\alpha_i-\sum_{i=1}^lC\xi_i\\ &= \sum_{i=1}^l\alpha_i+\frac{1}{2}\boldsymbol \alpha^TYXX^TY\boldsymbol \alpha-\boldsymbol \alpha^TYXX^TY\boldsymbol \alpha\\ &= \boldsymbol \alpha^T1_{l\times 1}-\frac{1}{2}\boldsymbol \alpha^TYKY\boldsymbol \alpha \end{aligned} W(α)​=21​wTw+Ci=1∑l​ξi​−i=1∑l​αi​[yi​(wTϕ(xi​)+b)−1+ξi​]−i=1∑l​βi​ξi​=21​wTw+Ci=1∑l​ξi​−i=1∑l​αi​yi​wTϕ(xi​)−i=1∑l​αi​yi​b+i=1∑l​αi​−i=1∑l​αi​ξi​−i=1∑l​βi​ξi​=21​wTw+Ci=1∑l​ξi​−i=1∑l​αi​yi​wTϕ(xi​)+i=1∑l​αi​−i=1∑l​(αi​+βi​)ξi​=21​αTYXXYYα+Ci=1∑l​ξi​−i=1∑l​αi​yi​(αTYXϕ(xi​))+i=1∑l​αi​−i=1∑l​Cξi​=i=1∑l​αi​+21​αTYXXTYα−αTYXXTYα=αT1l×1​−21​αTYKYα​

原问题转化为
M a x i m u m   W ( α ) = α T 1 l × 1 − 1 2 α T Y K Y α S u b j e c t   t o   α T y = 0 0 ≤ α i ≤ C   ∀ i = 1 , ⋯   , l \begin{aligned} Maximum\ &\color{red}W(\boldsymbol \alpha)=\boldsymbol \alpha^T1_{l\times 1}-\frac{1}{2}\boldsymbol \alpha^TYKY\boldsymbol \alpha\\ Subject\ to\ &\color{red}\boldsymbol \alpha^T\boldsymbol y=0\\ &\color{red}0\le\alpha_i\le C\ \forall i=1,\cdots,l \end{aligned} Maximum Subject to ​W(α)=αT1l×1​−21​αTYKYααTy=00≤αi​≤C ∀i=1,⋯,l​
假设最优解为 α ∗ , w ∗ , b ∗ , ξ ∗ \boldsymbol \alpha^*,\boldsymbol w^*,b^*,\boldsymbol \xi^* α∗,w∗,b∗,ξ∗,由KKT条件可得
{ α i ∗ ( y i ( w ∗ T ϕ ( x i ) + b ∗ ) − 1 + ξ i ∗ ) = 0 β i ∗ ξ i ∗ = 0   i = 1 , ⋯   , l α i ∗ + β i ∗ = C \begin{cases} \alpha^*_i(y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1+\xi^*_i)=0\\ \beta^*_i\xi^*_i=0 \end{cases}\ i=1,\cdots,l\\ \alpha^*_i+\beta^*_i=C {αi∗​(yi​(w∗Tϕ(xi​)+b∗)−1+ξi∗​)=0βi∗​ξi∗​=0​ i=1,⋯,lαi∗​+βi∗​=C

  • α i ∗ = 0 ⇒ β i ∗ = C ⇒ ξ i ∗ = 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) ≥ 1 \alpha^*_i=0\Rightarrow \beta^*_i=C \Rightarrow \xi^*_i=0 \Rightarrow y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b*)\ge1 αi∗​=0⇒βi∗​=C⇒ξi∗​=0⇒yi​(w∗Tϕ(xi​)+b∗)≥1
  • 0 < α i ∗ < C ⇒ { β i ∗ = C − α i ∗ = 0 ⇒ ξ i ∗ ≥ 0 y i ( w ∗ T ϕ ( x i ) + b ∗ ) − 1 + ξ i ∗ = 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) = 1 0<\alpha^*_i<C\Rightarrow {\begin{cases} \beta^*_i = C-\alpha^*_i=0\Rightarrow \xi^*_i\ge0 \\ y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1+\xi^*_i=0 \end{cases}}\Rightarrow y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)=1 0<αi∗​<C⇒{βi∗​=C−αi∗​=0⇒ξi∗​≥0yi​(w∗Tϕ(xi​)+b∗)−1+ξi∗​=0​⇒yi​(w∗Tϕ(xi​)+b∗)=1
  • α i ∗ = C ⇒ { β i ∗ = C − α i ∗ = 0 ⇒ ξ i ∗ ≥ 0 y i ( w ∗ T ϕ ( x i ) + b ∗ ) − 1 + ξ i ∗ = 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) ≤ 1 \alpha^*_i=C\Rightarrow {\begin{cases} \beta^*_i=C-\alpha^*_i=0\Rightarrow \xi^*_i\ge0\\y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1+\xi^*_i=0 \end{cases}}\Rightarrow y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)\le1 αi∗​=C⇒{βi∗​=C−αi∗​=0⇒ξi∗​≥0yi​(w∗Tϕ(xi​)+b∗)−1+ξi∗​=0​⇒yi​(w∗Tϕ(xi​)+b∗)≤1

此时, α i ≠ 0 \alpha_i\ne0 αi​​=0的点可能在边界以及两个边界之间,被成为支持向量。

核方法:PCA,LDA,GDA,SVM,LR

  • 在两个边界之间: ξ i ∗ > 0 , α i ∗ = C \xi^*_i>0,\alpha_i^*=C ξi∗​>0,αi∗​=C
  • 在边界上: ξ i ∗ = 0 , 0 < α i ∗ < C \xi_i^*=0,0<\alpha_i^*<C ξi∗​=0,0<αi∗​<C
  • 在边界内: ξ i ∗ = 0 , α i ∗ = 0 \xi_i^*=0,\alpha^*_i=0 ξi∗​=0,αi∗​=0

流程:

  1. 计算 α t ∗ \alpha^*_t αt∗​,取 0 < α t ∗ < C 0<\alpha^*_t<C 0<αt∗​<C的值计算 b ∗ b^* b∗

  2. 由KKT条件, α t ∗ ( y t ( w ∗ T ϕ ( x t ) + b ∗ ) − 1 ) = 0 \alpha^*_t(y_t({\boldsymbol w^*}^T\phi(\boldsymbol x_t)+b^*)-1)=0 αt∗​(yt​(w∗Tϕ(xt​)+b∗)−1)=0知, w ∗ T ϕ ( x t ) + b ∗ = y t {\boldsymbol w^*}^T\phi(\boldsymbol x_t)+b^*=y_t w∗Tϕ(xt​)+b∗=yt​,因此
    b ∗ = y t − w ∗ T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i ϕ ( x i ) T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i κ ( x i , x t ) b^*=y_t-{\boldsymbol w^*}^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha_i^*y_i\phi(\boldsymbol x_i)^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha_i^*y_i\kappa(\boldsymbol x_i,\boldsymbol x_t) b∗=yt​−w∗Tϕ(xt​)=yt​−i=1∑l​αi∗​yi​ϕ(xi​)Tϕ(xt​)=yt​−i=1∑l​αi∗​yi​κ(xi​,xt​)

多分类方法:

  • One-Against-One:两两分类,投票决定类别
  • One-Against-All:一个与剩下所有的进行二非类,计算与中间边界线的距离决定类别(距离大的) k ∗ = arg ⁡ max ⁡ k ( w k T ϕ ( x ) + b k ) k^*=\mathop{\arg\max}_k(\boldsymbol w^T_k\phi(\boldsymbol x)+b_k) k∗=argmaxk​(wkT​ϕ(x)+bk​)

6. An Automatic Method for Selecting the Parameter of the RBF Kernel Function to SVM

RBF Kernel:
κ ( x , z , σ ) = exp ⁡ ( − ∥ x − z ∥ 2 2 σ 2 ) \kappa(\boldsymbol x, \boldsymbol z,\sigma)=\exp(-\frac{\|\boldsymbol x-\boldsymbol z\|^2}{2\sigma^2}) κ(x,z,σ)=exp(−2σ2∥x−z∥2​)
κ ( x , x ) = 1 \kappa(\boldsymbol x, \boldsymbol x)=1 κ(x,x)=1,即在RBF Kernel决定的特征空间内,样本的范数恒为1,因此,样本会被映射到单位球面上。

因此可以用余弦距离衡量两个样本之间的相似度:
cos ⁡ θ = < ϕ ( x ) , ϕ ( z ) > ∥ ϕ ( x ) ∥ ⋅ ∥ ϕ ( z ) ∥ = < ϕ ( x ) , ϕ ( z ) > = κ ( x , z ) \cos\theta=\frac{<\phi(\boldsymbol x),\phi(\boldsymbol z)>}{\|\phi(\boldsymbol x)\|\cdot\|\phi(\boldsymbol z)\|}=<\phi(\boldsymbol x),\phi(\boldsymbol z)>=\kappa(\boldsymbol x, \boldsymbol z) cosθ=∥ϕ(x)∥⋅∥ϕ(z)∥<ϕ(x),ϕ(z)>​=<ϕ(x),ϕ(z)>=κ(x,z)
如果 cos ⁡ \cos cos值趋于1,则样本在特征空间内相似,如果趋于0,则不相似。

在分类问题中,目标是找到合适的 σ \sigma σ使得

  • 对于同一类的样本,RBF核函数的值应当趋于1(聚集)
    κ ( x l ( i ) , x k ( i ) , σ ) ≈ 1 w ( σ ) = 1 ∑ i = 1 L N i 2 ∑ i = 1 L ∑ l = 1 N i ∑ k = 1 N i exp ⁡ ( − ∥ x l ( i ) − x k ( i ) ∥ 2 2 σ 2 ) ≈ 1 \kappa(\boldsymbol x^{(i)}_l, \boldsymbol x^{(i)}_k, \sigma)\approx1\\ w(\sigma)=\frac{1}{\sum_{i=1}^LN_i^2}\sum_{i=1}^L\sum_{l=1}^{N_i}\sum_{k=1}^{N_i}\exp(-\frac{\|\boldsymbol x^{(i)}_l-\boldsymbol x^{(i)}_k\|^2}{2\sigma^2})\approx 1 κ(xl(i)​,xk(i)​,σ)≈1w(σ)=∑i=1L​Ni2​1​i=1∑L​l=1∑Ni​​k=1∑Ni​​exp(−2σ2∥xl(i)​−xk(i)​∥2​)≈1

  • 对于不同类的样本,RBF核函数的值应当趋于0(分散)
    κ ( x l ( i ) , x k ( j ) , σ ) ≈ 1 , i ≠ j b ( σ ) = 1 ∑ i = 1 L ∑ j = i , j ≠ i L N i N j ∑ i = 1 L ∑ j = 1 , j ≠ i L ∑ l = 1 N i ∑ k = 1 N i exp ⁡ ( − ∥ x l ( i ) − x k ( j ) ∥ 2 2 σ 2 ) ≈ 0 \kappa(\boldsymbol x^{(i)}_l, \boldsymbol x^{(j)}_k, \sigma)\approx1,i\ne j\\ b(\sigma)=\frac{1}{\sum_{i=1}^L\sum_{j=i,j\ne i}^LN_iN_j}\sum_{i=1}^L\sum_{j=1,j\ne i}^L\sum_{l=1}^{N_i}\sum_{k=1}^{N_i}\exp(-\frac{\|\boldsymbol x^{(i)}_l-\boldsymbol x_k^{(j)}\|^2}{2\sigma^2})\approx 0 κ(xl(i)​,xk(j)​,σ)≈1,i​=jb(σ)=∑i=1L​∑j=i,j​=iL​Ni​Nj​1​i=1∑L​j=1,j​=i∑L​l=1∑Ni​​k=1∑Ni​​exp(−2σ2∥xl(i)​−xk(j)​∥2​)≈0

当 σ → 0 \sigma\to 0 σ→0时,同一类的样本在特征空间内会相互垂直,当 σ → ∞ \sigma\to \infty σ→∞时,所有样本在特征空间内均趋于重合。两种情况都不可分。

我们想要寻找合适的 σ \sigma σ使得
w ( σ ) → 1 − b ( σ ) → 0 + w(\sigma)\to1^-\\ b(\sigma)\to0^+ w(σ)→1−b(σ)→0+
评价指标可确定为
J ( σ ) = ( 1 − w ( σ ) ) + ( b ( σ ) − 0 ) = 1 − w ( σ ) + b ( σ ) J(\sigma)=(1-w(\sigma))+(b(\sigma)-0)=1-w(\sigma)+b(\sigma) J(σ)=(1−w(σ))+(b(σ)−0)=1−w(σ)+b(σ)
可以采用基于梯度的方法寻找最优的 σ \sigma σ。


7. Linear Regression Model and Kernel-based Linear Regression Model

7.1 Linear Regression Model

假设:随机变量 ε n \varepsilon_n εn​是满足独立同分布的均值为0、方差为 σ 2 \sigma^2 σ2的高斯分布。随机变量 y n = f ( x n ) + ε n y_n=f(\boldsymbol{x_n})+\varepsilon_n yn​=f(xn​)+εn​。
E ( y n ∣ x n ) = f ( x n ) + E ( ε n ) = f ( x n ) E(y_n|\boldsymbol x_n)=f(\boldsymbol x_n)+E(\varepsilon_n)=f(\boldsymbol x_n) E(yn​∣xn​)=f(xn​)+E(εn​)=f(xn​)
对于任意 n n n,假设条件期望满足
E ( y n ∣ x n ) = w 0 + w 1 x n 1 + ⋯ + w d x n d = [ 1 x n 1 ⋯ x n d ] [ w 0 ⋮ w d ] = [ 1 , x n T ] w = x ˉ n T w \begin{aligned} E(y_n|\boldsymbol x_n)&=w_0+w_1x_{n1}+\cdots+w_dx_{nd}\\ &= \begin{bmatrix}1 & x_{n1} & \cdots & x_{nd}\end{bmatrix} \begin{bmatrix}w_0 \\ \vdots \\ w_d\end{bmatrix}\\ &=[1, \boldsymbol x^T_n]\boldsymbol w= \bar{\boldsymbol x}^T_n \boldsymbol w \end{aligned} E(yn​∣xn​)​=w0​+w1​xn1​+⋯+wd​xnd​=[1​xn1​​⋯​xnd​​]⎣⎢⎡​w0​⋮wd​​⎦⎥⎤​=[1,xnT​]w=xˉnT​w​
其中, x n = [ x n 1 , ⋯   , x n d ] T , x ˉ n = [ 1 , x n T ] T , w = [ w 0 , ⋯   , w d ] T \boldsymbol x_n=[x_{n1},\cdots,x_{nd}]^T,\bar{\boldsymbol x}_n=[1,\boldsymbol x^T_n]^T, \boldsymbol w=[w_0,\cdots,w_d]^T xn​=[xn1​,⋯,xnd​]T,xˉn​=[1,xnT​]T,w=[w0​,⋯,wd​]T。 w 0 w_0 w0​为截距。
y n = f ( x n ) + ε n = E ( y n ∣ x n ) + ε n = x ˉ n T w y_n=f(\boldsymbol x_n)+\varepsilon_n=E(y_n|\boldsymbol x_n)+\varepsilon_n=\bar{\boldsymbol x}^T_n \boldsymbol w yn​=f(xn​)+εn​=E(yn​∣xn​)+εn​=xˉnT​w
假设有 N N N个样本 ( x 1 , y 1 ) , ⋯   , ( x N , y N ) {(\boldsymbol x_1, y_1),\cdots,(\boldsymbol x_N, y_N)} (x1​,y1​),⋯,(xN​,yN​),记为
X = [ x 1 T ⋮ x N T ] ,   X ˉ = [ x ˉ 1 T ⋮ x ˉ N T ] = [ 1 x 1 T ⋮ ⋮ 1 x N T ] ,   y = [ y 1 ⋮ y N ] X=\begin{bmatrix}\boldsymbol x^T_1 \\ \vdots \\ \boldsymbol x^T_N\end{bmatrix},\ \bar X=\begin{bmatrix}\bar{\boldsymbol x}^T_1 \\ \vdots \\ \bar{\boldsymbol x}^T_N\end{bmatrix}=\begin{bmatrix}1 & \boldsymbol x^T_1 \\ \vdots & \vdots \\ 1 & \boldsymbol x^T_N\end{bmatrix},\ \boldsymbol y=\begin{bmatrix}y_1 \\ \vdots \\ y_N\end{bmatrix} X=⎣⎢⎡​x1T​⋮xNT​​⎦⎥⎤​, Xˉ=⎣⎢⎡​xˉ1T​⋮xˉNT​​⎦⎥⎤​=⎣⎢⎡​1⋮1​x1T​⋮xNT​​⎦⎥⎤​, y=⎣⎢⎡​y1​⋮yN​​⎦⎥⎤​
线性回归模型描述的是 X X X与 y \boldsymbol y y之间的关系:
{ y 1 ≈ x ˉ 1 T w + ε 1 ⋮ y N ≈ x ˉ N T w + ε N ⇒ [ y 1 ⋮ y N ] ≈ [ x ˉ 1 T ⋮ x ˉ N T ] w + [ ε 1 ⋮ ϵ N ] ⇒ y ≈ X ˉ w + ε \begin{cases} y_1\approx \bar{\boldsymbol x}^T_1 \boldsymbol w + \varepsilon_1\\ \vdots\\ y_N \approx \bar{\boldsymbol x}^T_N \boldsymbol w + \varepsilon_N \end{cases}\Rightarrow \begin{bmatrix}y_1 \\ \vdots \\ y_N\end{bmatrix}\approx\begin{bmatrix}\bar{\boldsymbol x}^T_1 \\ \vdots \\ \bar{\boldsymbol x}^T_N\end{bmatrix}\boldsymbol w+\begin{bmatrix}\varepsilon_1 \\ \vdots \\ \epsilon_N\end{bmatrix}\Rightarrow \boldsymbol y \approx \bar X \boldsymbol w + \boldsymbol \varepsilon ⎩⎪⎪⎨⎪⎪⎧​y1​≈xˉ1T​w+ε1​⋮yN​≈xˉNT​w+εN​​⇒⎣⎢⎡​y1​⋮yN​​⎦⎥⎤​≈⎣⎢⎡​xˉ1T​⋮xˉNT​​⎦⎥⎤​w+⎣⎢⎡​ε1​⋮ϵN​​⎦⎥⎤​⇒y≈Xˉw+ε
最小化残差平方和(residual sum of squares, RSS)来估计 w \boldsymbol w w的最优值
w ^ = arg ⁡ min ⁡ w ∣ ∣ y − X ˉ w ∣ ∣ 2 \hat{\boldsymbol w}=\mathop{\arg \min}_{\boldsymbol w}||\boldsymbol y-\bar X\boldsymbol w||^2 w^=argminw​∣∣y−Xˉw∣∣2

∣ ∣ y − X ˉ w ∣ ∣ 2 = ( y − X ˉ w ) T ( y − X ˉ w ) = y T y − 2 w T X ˉ T y + w T X ˉ T X ˉ w ||\boldsymbol y-\bar X\boldsymbol w||^2=(\boldsymbol y-\bar X\boldsymbol w)^T(\boldsymbol y-\bar X\boldsymbol w)=\boldsymbol y^T \boldsymbol y-2\boldsymbol w^T \bar X^T \boldsymbol y + \boldsymbol w^T \bar X^T \bar X \boldsymbol w ∣∣y−Xˉw∣∣2=(y−Xˉw)T(y−Xˉw)=yTy−2wTXˉTy+wTXˉTXˉw

∂ ∂ w ∣ ∣ y − X ˉ w ∣ ∣ 2 = − 2 X ˉ T y + 2 X ˉ T X ˉ w = 0 \frac{\partial}{\partial \boldsymbol w}||\boldsymbol y-\bar X\boldsymbol w||^2=-2\bar X^T \boldsymbol y + 2 \bar X^T \bar X \boldsymbol w=0 ∂w∂​∣∣y−Xˉw∣∣2=−2XˉTy+2XˉTXˉw=0

正规方程(normal equation)为
X ˉ T X ˉ w = X ˉ T y ⇒ w ^ = ( X ˉ T X ˉ ) − 1 X ˉ T y ⇒ y ^ = X ˉ w ^ = X ˉ ( X ˉ T X ˉ ) − 1 X ˉ T y \bar X^T \bar X \boldsymbol w=\bar X^T \boldsymbol y\\ \begin{aligned} &\Rightarrow \hat{\boldsymbol w}=(\bar X^T \bar X)^{-1}\bar X^T \boldsymbol y\\ &\Rightarrow \hat{\boldsymbol y}=\bar X \hat{\boldsymbol w}=\bar X (\bar X^T \bar X)^{-1}\bar X^T \boldsymbol y \end{aligned} XˉTXˉw=XˉTy​⇒w^=(XˉTXˉ)−1XˉTy⇒y^​=Xˉw^=Xˉ(XˉTXˉ)−1XˉTy​
对于未知样本 x \boldsymbol x x, f ( x ) = [ 1 ,   x T ] w ^ f(\boldsymbol x)=[1,\ \boldsymbol x^T]\hat{\boldsymbol w} f(x)=[1, xT]w^。


y ≈ X ˉ w = [ 1 x 11 ⋯ x 1 d ⋮ ⋮ ⋱ ⋮ 1 x N 1 ⋯ x N d ] [ w 0 w 1 ⋮ w d ] = w 0 [ 1 ⋮ 1 ] + w 1 [ x 11 ⋮ x N 1 ] + ⋯ + w d [ x 1 d ⋮ x N d ] \begin{aligned} \boldsymbol y \approx \bar X \boldsymbol w &= \begin{bmatrix}1 & x_{11} & \cdots & x_{1d}\\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & \cdots & x_{Nd}\end{bmatrix}\begin{bmatrix}w_0 \\ w_1 \\ \vdots \\ w_d\end{bmatrix}\\ &=w_0\begin{bmatrix}1 \\ \vdots \\ 1\end{bmatrix}+w_1\begin{bmatrix}x_{11} \\ \vdots \\ x_{N1}\end{bmatrix}+\cdots+w_d\begin{bmatrix}x_{1d} \\ \vdots \\ x_{Nd}\end{bmatrix} \end{aligned} y≈Xˉw​=⎣⎢⎡​1⋮1​x11​⋮xN1​​⋯⋱⋯​x1d​⋮xNd​​⎦⎥⎤​⎣⎢⎢⎢⎡​w0​w1​⋮wd​​⎦⎥⎥⎥⎤​=w0​⎣⎢⎡​1⋮1​⎦⎥⎤​+w1​⎣⎢⎡​x11​⋮xN1​​⎦⎥⎤​+⋯+wd​⎣⎢⎡​x1d​⋮xNd​​⎦⎥⎤​​

可知 y \boldsymbol y y的近似值可用 X ˉ \bar X Xˉ的列向量线性组合表示。而残差可表示为两个向量之间的差,当残差垂直于 X ˉ \bar X Xˉ的列向量构成的线性空间时,近似值最优。

  1. 0 = [ 1 , ⋯   , 1 ] e = [ 1 , ⋯   , 1 ] ( y − y ^ ) = ∑ n = 1 N y n − ∑ n = 1 N y ^ n ⇒ ∑ n = 1 N y n = ∑ n = 1 N y ^ n ⇒ y ˉ = y ^ ˉ 0=[1,\cdots,1]\boldsymbol e=[1,\cdots,1](\boldsymbol y-\hat{\boldsymbol y})=\sum_{n=1}^N y_n -\sum_{n=1}^N \hat{y}_n \Rightarrow \sum_{n=1}^N y_n =\sum_{n=1}^N \hat{y}_n \Rightarrow \color{red}\bar y=\bar{\hat y} 0=[1,⋯,1]e=[1,⋯,1](y−y^​)=n=1∑N​yn​−n=1∑N​y^​n​⇒n=1∑N​yn​=n=1∑N​y^​n​⇒yˉ​=y^​ˉ​

  2. 计算原方差
    ∣ ∣ y − y ˉ 1 N × 1 ∣ ∣ 2 = ∣ ∣ y ^ − y ˉ 1 N × 1 + y − y ^ ∣ ∣ 2 = ∣ ∣ y ^ − y ˉ 1 N × 1 ∣ ∣ 2 + ∣ ∣ y − y ^ ∣ ∣ 2 − 2 ( y ^ − y ˉ 1 N × 1 ) T ( y − y ^ ) = ∣ ∣ y ^ − y ˉ 1 N × 1 ∣ ∣ 2 + ∣ ∣ y − y ^ ∣ ∣ 2 − 2 ( y ^ − y ˉ 1 N × 1 ) T e = ∣ ∣ y ^ − y ^ ˉ 1 N × 1 ∣ ∣ 2 + ∣ ∣ y − y ^ ∣ ∣ 2 \begin{aligned} ||\boldsymbol y - \bar y \boldsymbol 1_{N\times 1}||^2 &= ||\hat{\boldsymbol y}-\bar y \boldsymbol 1_{N\times 1}+\boldsymbol y - \hat{\boldsymbol y}||^2\\ &= ||\hat{\boldsymbol y}-\bar{y}\boldsymbol 1_{N\times 1}||^2+||\boldsymbol y-\hat{\boldsymbol y}||^2-2(\hat{\boldsymbol y}-\bar y \boldsymbol 1_{N\times 1})^T(\boldsymbol y-\hat{\boldsymbol y})\\ &= ||\hat{\boldsymbol y}-\bar{y}\boldsymbol 1_{N\times 1}||^2+||\boldsymbol y-\hat{\boldsymbol y}||^2-2(\hat{\boldsymbol y}-\bar y \boldsymbol 1_{N\times 1})^T\boldsymbol e \\ &= ||\hat{\boldsymbol y}-\bar{\hat y}\boldsymbol 1_{N\times 1}||^2+||\boldsymbol y-\hat{\boldsymbol y}||^2 \end{aligned} ∣∣y−yˉ​1N×1​∣∣2​=∣∣y^​−yˉ​1N×1​+y−y^​∣∣2=∣∣y^​−yˉ​1N×1​∣∣2+∣∣y−y^​∣∣2−2(y^​−yˉ​1N×1​)T(y−y^​)=∣∣y^​−yˉ​1N×1​∣∣2+∣∣y−y^​∣∣2−2(y^​−yˉ​1N×1​)Te=∣∣y^​−y^​ˉ​1N×1​∣∣2+∣∣y−y^​∣∣2​
    估计的方差与原方差之间相差残差的平方和。

∑ n = 1 N ( y n − y ˉ n ) 2 = ∑ n = 1 N ( y ^ n − y ^ ˉ n ) 2 + ∑ n = 1 N ( y n − y ^ n ) 2 T S S = E S S + R S S \sum_{n=1}^N(y_n-\bar y_n)^2=\sum_{n=1}^N(\hat y_n-\bar{\hat y}_n)^2+\sum_{n=1}^N(y_n-\hat y_n)^2\\ TSS=ESS+RSS n=1∑N​(yn​−yˉ​n​)2=n=1∑N​(y^​n​−y^​ˉ​n​)2+n=1∑N​(yn​−y^​n​)2TSS=ESS+RSS

  • TSS: Total sum of squares
  • ESS: Explained sum of squares
  • RSS: Residual (unexplained) sum of squares

ESS与TSS的比值被用于评价拟合效果的好坏:
R 2 = E S S T S S = ∑ n = 1 N ( y ^ n − y ^ ˉ n ) 2 ∑ n = 1 N ( y n − y ˉ n ) 2 = 1 N ∑ n = 1 N ( y ^ n − y ^ ˉ n ) 2 1 N ∑ n = 1 N ( y n − y ˉ n ) 2 = T S S − R S S T S S = 1 − R S S T S S R^2=\frac{ESS}{TSS}=\frac{\sum_{n=1}^N(\hat y_n-\bar{\hat y}_n)^2}{\sum_{n=1}^N(y_n-\bar y_n)^2}=\frac{\frac{1}{N}\sum_{n=1}^N(\hat y_n-\bar{\hat y}_n)^2}{\frac{1}{N}\sum_{n=1}^N(y_n-\bar y_n)^2}=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS} R2=TSSESS​=∑n=1N​(yn​−yˉ​n​)2∑n=1N​(y^​n​−y^​ˉ​n​)2​=N1​∑n=1N​(yn​−yˉ​n​)2N1​∑n=1N​(y^​n​−y^​ˉ​n​)2​=TSSTSS−RSS​=1−TSSRSS​

7.2 Kernel-based Linear Regression

X = [ ϕ ( x 1 T ) ⋮ ϕ ( x N T ) ] ,   X ˉ = [ 1 ϕ ( x 1 T ) ⋮ ⋮ 1 ϕ ( x N T ) ] = [ 1 N × 1 , X ] ,   y = [ y 1 ⋮ y N ] X=\begin{bmatrix}\boldsymbol \phi(x^T_1) \\ \vdots \\ \boldsymbol \phi(x^T_N)\end{bmatrix},\ \bar X=\begin{bmatrix}1 & \phi(\boldsymbol x^T_1) \\ \vdots & \vdots \\ 1 & \phi(\boldsymbol x^T_N)\end{bmatrix}=[\boldsymbol 1_{N\times 1}, X],\ \boldsymbol y=\begin{bmatrix}y_1 \\ \vdots \\ y_N\end{bmatrix} X=⎣⎢⎡​ϕ(x1T​)⋮ϕ(xNT​)​⎦⎥⎤​, Xˉ=⎣⎢⎡​1⋮1​ϕ(x1T​)⋮ϕ(xNT​)​⎦⎥⎤​=[1N×1​,X], y=⎣⎢⎡​y1​⋮yN​​⎦⎥⎤​

w ^ = ( X ˉ T X ˉ ) − 1 X ˉ T y = ( X ˉ T X ˉ ) ( X ˉ T X ˉ ) − 2 X ˉ T y = X ˉ T α ^ , 其 中 α ^ = X ˉ ( X ˉ T X ˉ ) − 2 X ˉ T y \hat{\boldsymbol w}=(\bar X^T \bar X)^{-1}\bar X^T \boldsymbol y=(\bar X^T \bar X)(\bar X^T \bar X)^{-2}\bar X^T \boldsymbol y=\bar X^T \hat{\boldsymbol \alpha},其中\hat{\boldsymbol \alpha}=\bar X(\bar X^T \bar X)^{-2}\bar X^T \boldsymbol y w^=(XˉTXˉ)−1XˉTy=(XˉTXˉ)(XˉTXˉ)−2XˉTy=XˉTα^,其中α^=Xˉ(XˉTXˉ)−2XˉTy

代入正规方程中
X ˉ T X ˉ w = X ˉ T y ⇒ X ˉ X ˉ T X ˉ X ˉ T α ^ = X ˉ X ˉ T y ⇒ K ˉ 2 α ^ = K ˉ y ,   w h e r e   K ˉ = X ˉ X ˉ T ⇒ K ˉ α ^ = y ,   i f   d e t ( K ˉ ) ≠ 0 ⇒ α ^ = K ˉ − 1 y \begin{aligned} & \bar X^T \bar X \boldsymbol w=\bar X^T \boldsymbol y\\ \Rightarrow & \bar X \bar X^T \bar X \bar X^T \hat{\boldsymbol \alpha}=\bar X\bar X^T \boldsymbol y\\ \Rightarrow & \bar K^2\hat{\boldsymbol \alpha}=\bar K \boldsymbol y,\ where\ \bar K=\bar X \bar X^T\\ \Rightarrow & \bar K \hat{\boldsymbol \alpha}=\boldsymbol y,\ if\ det(\bar K)\ne0\\ \Rightarrow & \hat{\boldsymbol \alpha}=\bar K^{-1}\boldsymbol y \end{aligned} ⇒⇒⇒⇒​XˉTXˉw=XˉTyXˉXˉTXˉXˉTα^=XˉXˉTyKˉ2α^=Kˉy, where Kˉ=XˉXˉTKˉα^=y, if det(Kˉ)​=0α^=Kˉ−1y​

K ˉ = X ˉ X ˉ T = [ 1 N × 1 , X ] [ 1 1 × N X T ] = 1 N × 1 + X X T = 1 N × 1 + K \bar K=\bar X\bar X^T=[\boldsymbol 1_{N\times 1}, X]\begin{bmatrix}\boldsymbol 1_{1\times N}\\ X^T\end{bmatrix}=\boldsymbol 1_{N\times 1}+XX^T=\boldsymbol 1_{N\times 1}+K Kˉ=XˉXˉT=[1N×1​,X][11×N​XT​]=1N×1​+XXT=1N×1​+K

因此,
α ^ = ( 1 N × 1 + K ) − 1 y y ^ = X ˉ w = X ˉ X ˉ T α ^ = ( 1 N × 1 + K ) α ^ \color{red} \begin{aligned} \hat{\boldsymbol \alpha}&=(\boldsymbol 1_{N\times 1}+K)^{-1}\boldsymbol y\\ \hat{\boldsymbol y}&=\bar X \boldsymbol w=\bar X \bar X^T \hat{\boldsymbol \alpha}=(\boldsymbol 1_{N\times 1}+K)\hat{\boldsymbol \alpha} \end{aligned} α^y^​​=(1N×1​+K)−1y=Xˉw=XˉXˉTα^=(1N×1​+K)α^​
对于未知的 x \boldsymbol x x,
KaTeX parse error: \cr valid only within a tabular/array environment


8. Reproducing kernel Hilbert Space

一个 m × m m\times m m×m维的实对称矩阵 K K K如果:
a T K a ≥ 0 \boldsymbol{a}^TK\boldsymbol a\ge0 aTKa≥0
对任意的 a ∈ R m \boldsymbol{a}\in\mathbb{R}^m a∈Rm均成立,则 K K K称为半正定矩阵

  • 一个实对称矩阵可以对角化
  • 一个实对称矩阵式半正定当且仅当它的所有特征值非负

如果一个函数 κ : R d × R d → R \kappa:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R} κ:Rd×Rd→R且
K = [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x m ) ⋮ ⋱ ⋮ κ ( x m , x 1 ) ⋯ κ ( x m , x m ) ] K=\begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1, \boldsymbol x_m)\\ \vdots & \ddots & \vdots\\ \kappa(\boldsymbol x_m, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_m, \boldsymbol x_m) \end{bmatrix} K=⎣⎢⎡​κ(x1​,x1​)⋮κ(xm​,x1​)​⋯⋱⋯​κ(x1​,xm​)⋮κ(xm​,xm​)​⎦⎥⎤​
对于任意 { x 1 , ⋯   , x m } ⊂ R d \{\boldsymbol x_1, \cdots, \boldsymbol x_m\}\subset\mathbb{R}^d {x1​,⋯,xm​}⊂Rd均是半正定矩阵,则 κ \kappa κ称为正半定核函数

此时核函数满足Cauchy-Schwarz Inequality:
∣ κ ( x , z ∣ 2 ≤ κ ( x , x ) κ ( z , z ) |\kappa(\boldsymbol x, \boldsymbol z|^2\le \kappa(\boldsymbol x, \boldsymbol x)\kappa(\boldsymbol z, \boldsymbol z) ∣κ(x,z∣2≤κ(x,x)κ(z,z)
证: κ \kappa κ为半正定核函数,则其对应的Gram矩阵的行列式非负,即
KaTeX parse error: Expected group after '\begin{array}' at position 31: …|\begin{array} &̲\kappa(\boldsym…

Reproducing kernel map: Φ : R d → R R d , Φ ( z ) = ϕ z = κ ( ⋅ , z ) \Phi:\mathbb{R}^d \to \mathbb{R}^{\mathbb{R}^d},\Phi(\boldsymbol z)=\phi_{\boldsymbol z}=\kappa(\cdot,\boldsymbol z) Φ:Rd→RRd,Φ(z)=ϕz​=κ(⋅,z)。意味着,
Φ ( z ) ( x ) = ϕ z ( x ) = κ ( x , z ) \Phi(\boldsymbol z)(\boldsymbol x)=\phi_{\boldsymbol z}(\boldsymbol x)=\kappa(\boldsymbol x, \boldsymbol z) Φ(z)(x)=ϕz​(x)=κ(x,z)
Φ \Phi Φ是一个由向量到函数的映射,相当于将 z \boldsymbol z z当作自变量获得 ϕ z : z ↦ κ ( ⋅ , x ) \phi_{\boldsymbol z}:\boldsymbol z \mapsto\kappa(\cdot,\boldsymbol x) ϕz​:z↦κ(⋅,x)。

假设 κ : R d × R d → R \kappa:\mathbb{R}^d\times \mathbb{R}^d\to\mathbb{R} κ:Rd×Rd→R是一个核函数,且 Φ \Phi Φ是对应的再生核映射,则
F κ = { f = ∑ i = 1 m α i Φ ( x i ) ∣ m ∈ N , x 1 , ⋯   , x m ∈ R d , α 1 , ⋯   , α m ∈ R } \mathcal{F}_\kappa=\{f=\sum_{i=1}^m\alpha_i\Phi(\boldsymbol x_i)|m\in\mathcal{N},\boldsymbol x_1,\cdots,\boldsymbol x_m\in\mathbb{R}^d,\alpha_1,\cdots,\alpha_m\in\mathbb{R}\} Fκ​={f=i=1∑m​αi​Φ(xi​)∣m∈N,x1​,⋯,xm​∈Rd,α1​,⋯,αm​∈R}
是线性空间。

由该线性空间定义内积空间:给定 f = ∑ i = 1 m α i Φ ( x i ) = ∑ i = 1 m α i κ ( ⋅ , x i ) f=\sum_{i=1}^m\alpha_i\Phi(\boldsymbol x_i)=\sum_{i=1}^m\alpha_i\kappa(\cdot,\boldsymbol x_i) f=∑i=1m​αi​Φ(xi​)=∑i=1m​αi​κ(⋅,xi​)与 g = ∑ j = 1 m ′ β j Φ ( x j ′ ) = ∑ j = 1 m ′ β j κ ( ⋅ , x j ′ ) g=\sum_{j=1}^{m'}\beta_j\Phi(\boldsymbol x_j')=\sum_{j=1}^{m'}\beta_j\kappa(\cdot,\boldsymbol x_j') g=∑j=1m′​βj​Φ(xj′​)=∑j=1m′​βj​κ(⋅,xj′​)。 f f f与 g g g的内积为
< f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i,\boldsymbol x_j') <f,g>=i=1∑m​j=1∑m′​αi​βj​κ(xi​,xj′​)
性质:

  • < f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) = ∑ i = 1 m α i ∑ j = 1 m ′ β j κ ( x i , x j ′ ) = ∑ i = 1 m α i g ( x i ) w h e r e ,   g = ∑ j = 1 m ′ β j κ ( ⋅ , x j ′ ) <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{i=1}^m\alpha_i\sum_{j=1}^{m'}\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{i=1}^m\alpha_ig(\boldsymbol x_i)\\ where,\ g=\sum_{j=1}^{m'}\beta_j\kappa(\cdot,\boldsymbol x_j') <f,g>=i=1∑m​j=1∑m′​αi​βj​κ(xi​,xj′​)=i=1∑m​αi​j=1∑m′​βj​κ(xi​,xj′​)=i=1∑m​αi​g(xi​)where, g=j=1∑m′​βj​κ(⋅,xj′​)

  • < f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) = ∑ j = 1 m ′ β j ∑ i = 1 m α i κ ( x j ′ , x i ) = ∑ j = 1 m ′ β j f ( x j ′ ) w h e r e ,   f = ∑ i = 1 m α i κ ( ⋅ , x i ) <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{j=1}^{m'}\beta_j\sum_{i=1}^{m}\alpha_i\kappa(\boldsymbol x_j', \boldsymbol x_i)=\sum_{j=1}^{m'}\beta_jf(\boldsymbol x_j')\\ where,\ f=\sum_{i=1}^{m}\alpha_i\kappa(\cdot,\boldsymbol x_i) <f,g>=i=1∑m​j=1∑m′​αi​βj​κ(xi​,xj′​)=j=1∑m′​βj​i=1∑m​αi​κ(xj′​,xi​)=j=1∑m′​βj​f(xj′​)where, f=i=1∑m​αi​κ(⋅,xi​)

由以上两点性质可知:
< κ ( ⋅ , x ) , f > = f ( x ) = ∑ i = 1 m α i κ ( x i , x ) < κ ( ⋅ , x ) , κ ( ⋅ , z ) > = κ ( x , z ) = κ ( z , x ) <\kappa(\cdot, \boldsymbol x),f>=f(\boldsymbol x)=\sum_{i=1}^m\alpha_i\kappa(\boldsymbol x_i, \boldsymbol x)\\ <\kappa(\cdot,\boldsymbol x), \kappa(\cdot, \boldsymbol z)>=\kappa(\boldsymbol x, \boldsymbol z)=\kappa(\boldsymbol z, \boldsymbol x) <κ(⋅,x),f>=f(x)=i=1∑m​αi​κ(xi​,x)<κ(⋅,x),κ(⋅,z)>=κ(x,z)=κ(z,x)
下面证明假设满足内积空间的定义:对称性,线性性及正定性,

  1. 对称性
    < f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) = ∑ j = 1 m ′ ∑ i = 1 m β j α i κ ( x j ′ , x i ) = < g , f > <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{j=1}^{m'}\sum_{i=1}^m\beta_j\alpha_i\kappa(\boldsymbol x_j',\boldsymbol x_i)=<g, f> <f,g>=i=1∑m​j=1∑m′​αi​βj​κ(xi​,xj′​)=j=1∑m′​i=1∑m​βj​αi​κ(xj′​,xi​)=<g,f>

  2. 线性性
    < a f , g > = ∑ j = 1 m ′ β j ( a f ) ( x j ′ ) = a ∑ j = 1 m ′ β j f ( x j ′ ) = a < f , g > <af,g>=\sum_{j=1}^{m'}\beta_j(af)(\boldsymbol x_j')=a\sum_{j=1}^{m'}\beta_jf(\boldsymbol x_j')=a<f,g> <af,g>=j=1∑m′​βj​(af)(xj′​)=aj=1∑m′​βj​f(xj′​)=a<f,g>

    < f + h , g > = ∑ j = 1 m ′ β j ( f + h ) ( x j ′ ) = ∑ j = 1 m ′ β j ( f ( x j ′ ) + h ( x j ′ ) ) = ∑ j = 1 m ′ β j f ( x j ′ ) + ∑ j = 1 m ′ β j h ( x j ′ ) = < f , g > + < h , g > <f+h, g>=\sum_{j=1}^{m'}\beta_j(f+h)(\boldsymbol x_j')=\sum_{j=1}^{m'}\beta_j(f(\boldsymbol x_j')+h(\boldsymbol x_j'))=\sum_{j=1}^{m'}\beta_jf(\boldsymbol x_j')+\sum_{j=1}^{m'}\beta_jh(\boldsymbol x_j')=<f,g>+<h,g> <f+h,g>=j=1∑m′​βj​(f+h)(xj′​)=j=1∑m′​βj​(f(xj′​)+h(xj′​))=j=1∑m′​βj​f(xj′​)+j=1∑m′​βj​h(xj′​)=<f,g>+<h,g>

  3. 正定性
    < f , f > = ∑ i = 1 m ∑ j = 1 m α i α j κ ( x i , x j ) = [ α 1 ⋯ α m ] [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x m ) ⋮ ⋱ ⋮ κ ( x m , x 1 ) ⋯ κ ( x m , x m ) ] [ α 1 ⋮ α m ] = α T K α ≥ 0 <f,f>=\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_j\kappa(\boldsymbol x_i, \boldsymbol x_j)= \begin{bmatrix} \alpha_1 & \cdots & \alpha_m \end{bmatrix} \begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1, \boldsymbol x_m)\\ \vdots & \ddots & \vdots\\ \kappa(\boldsymbol x_m, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_m, \boldsymbol x_m) \end{bmatrix} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_m \end{bmatrix}=\alpha^TK\alpha \ge 0 <f,f>=i=1∑m​j=1∑m​αi​αj​κ(xi​,xj​)=[α1​​⋯​αm​​]⎣⎢⎡​κ(x1​,x1​)⋮κ(xm​,x1​)​⋯⋱⋯​κ(x1​,xm​)⋮κ(xm​,xm​)​⎦⎥⎤​⎣⎢⎡​α1​⋮αm​​⎦⎥⎤​=αTKα≥0

    ∣ f ( x ) ∣ 2 = ∣ < κ ( ⋅ , x ) , f > ∣ 2 ≤ < κ ( ⋅ , x ) , κ ( ⋅ , x ) > < f , f > = κ ( x , x ) < f , f > = 0 |f(\boldsymbol x)|^2=|<\kappa(\cdot,\boldsymbol x),f>|^2\le<\kappa(\cdot,\boldsymbol x),\kappa(\cdot, \boldsymbol x)><f, f>=\kappa(\boldsymbol x,\boldsymbol x)<f,f>=0 ∣f(x)∣2=∣<κ(⋅,x),f>∣2≤<κ(⋅,x),κ(⋅,x)><f,f>=κ(x,x)<f,f>=0

    当且仅当 f ( x ) = 0 , ∀ x f(\boldsymbol x)=0,\forall \boldsymbol{x} f(x)=0,∀x时, < f , f > = 0 ⇒ ∣ f ( x ) ∣ = 0 <f,f>=0\Rightarrow |f(\boldsymbol x)|=0 <f,f>=0⇒∣f(x)∣=0

综上,
κ ( x , z ) = < κ ( ⋅ , x ) , κ ( ⋅ , z ) > = < Φ ( x ) , Φ ( z ) > \kappa(\boldsymbol x, \boldsymbol z)=<\kappa(\cdot, \boldsymbol x), \kappa(\cdot, \boldsymbol z)>=<\Phi(\boldsymbol x), \Phi(\boldsymbol z)> κ(x,z)=<κ(⋅,x),κ(⋅,z)>=<Φ(x),Φ(z)>
意味着对于核函数 κ ( x , z ) \kappa(\boldsymbol x, \boldsymbol z) κ(x,z)我们可以在找到特征映射函数 Φ \Phi Φ将 R d → H \mathbb{R}^d\to\mathcal{H} Rd→H

测度空间:对于测度空间内的一个顺序对 ( M , d ) (M,d) (M,d),其中 M M M是一个非空集合, d d d是 M M M上的测度 M × M → R M\times M\to \mathbb{R} M×M→R,对于 M M M内的任意三个元素: x , y , z \boldsymbol x,\boldsymbol y,\boldsymbol z x,y,z,满足:

  1. d ( x , y ) ≥ 0 d(\boldsymbol x, \boldsymbol y)\ge0 d(x,y)≥0
  2. d ( x , y ) = 0 ⇔ x = y d(\boldsymbol x,\boldsymbol y)=0\Leftrightarrow \boldsymbol x=\boldsymbol y d(x,y)=0⇔x=y
  3. d ( x , y ) = d ( y , x ) d(\boldsymbol x,\boldsymbol y)=d(\boldsymbol y, \boldsymbol x) d(x,y)=d(y,x)
  4. d ( x , z ) ≥ d ( x , y ) + d ( y , z ) d(\boldsymbol x,\boldsymbol z)\ge d(\boldsymbol x, \boldsymbol y)+d(\boldsymbol y, \boldsymbol z) d(x,z)≥d(x,y)+d(y,z)

柯西序列:对于测度空间 ( M , d ) (M,d) (M,d)和数列 x 1 , x 2 , ⋯ \boldsymbol x_1,\boldsymbol x_2,\cdots x1​,x2​,⋯,对任意正实数 ϵ > 0 \epsilon>0 ϵ>0,存在正整数 N N N,对于所有的自然数 m , n > N m,n>N m,n>N,测度满足
d ( x m , x n ) ≤ ϵ d(\boldsymbol x_m,\boldsymbol x_n)\le\epsilon d(xm​,xn​)≤ϵ
即数列在 M M M上收敛。

一个测度空间 ( M , d ) (M,d) (M,d)是完备的当且仅当 M M M上所有的柯西序列收敛。

希尔伯特空间 H H H是一个实内积空间,且是一个完备的测度空间,其距离以内积定义,如
d ( x , y ) = ∥ x − y ∥ d(\boldsymbol x,\boldsymbol y)=\|\boldsymbol x-\boldsymbol y\| d(x,y)=∥x−y∥
其中, ∥ x ∥ = < x , x > \|\boldsymbol x\|=\sqrt{<\boldsymbol x,\boldsymbol x>} ∥x∥=<x,x> ​。

l x)=0,\forall \boldsymbol{x} 时 , 时, 时,<f,f>=0\Rightarrow |f(\boldsymbol x)|=0$

综上,
κ ( x , z ) = < κ ( ⋅ , x ) , κ ( ⋅ , z ) > = < Φ ( x ) , Φ ( z ) > \kappa(\boldsymbol x, \boldsymbol z)=<\kappa(\cdot, \boldsymbol x), \kappa(\cdot, \boldsymbol z)>=<\Phi(\boldsymbol x), \Phi(\boldsymbol z)> κ(x,z)=<κ(⋅,x),κ(⋅,z)>=<Φ(x),Φ(z)>
意味着对于核函数 κ ( x , z ) \kappa(\boldsymbol x, \boldsymbol z) κ(x,z)我们可以在找到特征映射函数 Φ \Phi Φ将 R d → H \mathbb{R}^d\to\mathcal{H} Rd→H

测度空间:对于测度空间内的一个顺序对 ( M , d ) (M,d) (M,d),其中 M M M是一个非空集合, d d d是 M M M上的测度 M × M → R M\times M\to \mathbb{R} M×M→R,对于 M M M内的任意三个元素: x , y , z \boldsymbol x,\boldsymbol y,\boldsymbol z x,y,z,满足:

  1. d ( x , y ) ≥ 0 d(\boldsymbol x, \boldsymbol y)\ge0 d(x,y)≥0
  2. d ( x , y ) = 0 ⇔ x = y d(\boldsymbol x,\boldsymbol y)=0\Leftrightarrow \boldsymbol x=\boldsymbol y d(x,y)=0⇔x=y
  3. d ( x , y ) = d ( y , x ) d(\boldsymbol x,\boldsymbol y)=d(\boldsymbol y, \boldsymbol x) d(x,y)=d(y,x)
  4. d ( x , z ) ≥ d ( x , y ) + d ( y , z ) d(\boldsymbol x,\boldsymbol z)\ge d(\boldsymbol x, \boldsymbol y)+d(\boldsymbol y, \boldsymbol z) d(x,z)≥d(x,y)+d(y,z)

柯西序列:对于测度空间 ( M , d ) (M,d) (M,d)和数列 x 1 , x 2 , ⋯ \boldsymbol x_1,\boldsymbol x_2,\cdots x1​,x2​,⋯,对任意正实数 ϵ > 0 \epsilon>0 ϵ>0,存在正整数 N N N,对于所有的自然数 m , n > N m,n>N m,n>N,测度满足
d ( x m , x n ) ≤ ϵ d(\boldsymbol x_m,\boldsymbol x_n)\le\epsilon d(xm​,xn​)≤ϵ
即数列在 M M M上收敛。

一个测度空间 ( M , d ) (M,d) (M,d)是完备的当且仅当 M M M上所有的柯西序列收敛。

希尔伯特空间 H H H是一个实内积空间,且是一个完备的测度空间,其距离以内积定义,如
d ( x , y ) = ∥ x − y ∥ d(\boldsymbol x,\boldsymbol y)=\|\boldsymbol x-\boldsymbol y\| d(x,y)=∥x−y∥
其中, ∥ x ∥ = < x , x > \|\boldsymbol x\|=\sqrt{<\boldsymbol x,\boldsymbol x>} ∥x∥=<x,x> ​。

Kernel Method

上一篇:SQL Server 怎么在分页获取数据的同时获取到总记录数


下一篇:js(jQuery)获取时间的方法及常用时间类搜集