文章目录
- 1. Kernel Method
- 2. PCA and Kernel PCA
- 3. LDA and GDA
- 4. Hard-Margin Support Vector Machine
- 5. Soft-Margin Support Vector Machine
- 6. An Automatic Method for Selecting the Parameter of the RBF Kernel Function to SVM
- 7. Linear Regression Model and Kernel-based Linear Regression Model
- 8. Reproducing kernel Hilbert Space
1. Kernel Method
特征映射 ϕ : R d → R n \phi:\mathbb{R}^d\to\mathbb{R}^n ϕ:Rd→Rn将原空间下的变量映射到高维特征空间,例如 ( x 1 , x 2 ) ↦ ( z 1 , z 2 , z 3 ) = ( x 1 2 , 2 x 1 x 2 , x 2 2 ) (x_1,x_2)\mapsto(z_1,z_2,z_3)=(x_1^2,\sqrt{2}x_1 x_2,x_2^2) (x1,x2)↦(z1,z2,z3)=(x12,2 x1x2,x22),其目的是将原空间下的复杂关系在高维空间下简化。
核函数是计算原空间内的变量在特征空间下的内积: κ ( x , x ′ ) = < ϕ ( x ) , ϕ ( x ′ ) > \kappa(\boldsymbol x,\boldsymbol x')=<\phi(\boldsymbol x),\phi(\boldsymbol x')> κ(x,x′)=<ϕ(x),ϕ(x′)>。内积可以用来计算在该空间下的距离与角度,衡量特征空间下的性质,
距离:
∣
∣
ϕ
(
x
)
−
ϕ
(
x
′
)
∣
∣
2
=
(
ϕ
(
x
)
−
ϕ
(
x
′
)
)
T
(
ϕ
(
x
)
−
ϕ
(
x
′
)
)
=
ϕ
(
x
)
T
ϕ
(
x
)
−
2
ϕ
(
x
)
T
ϕ
(
x
′
)
+
ϕ
(
x
′
)
T
ϕ
(
x
′
)
=
<
ϕ
(
x
)
,
ϕ
(
x
)
>
−
2
<
ϕ
(
x
)
,
ϕ
(
x
′
)
>
+
<
ϕ
(
x
′
)
,
ϕ
(
x
′
)
>
=
κ
(
x
,
x
)
−
2
κ
(
x
,
x
′
)
+
κ
(
x
′
,
x
′
)
\begin{aligned} ||\phi(\boldsymbol x)-\phi(\boldsymbol x')||^2 &= (\phi(\boldsymbol x)-\phi(\boldsymbol x'))^T(\phi(\boldsymbol x)-\phi(\boldsymbol x')) \\ &= \phi(\boldsymbol x)^T\phi(\boldsymbol x)-2\phi(\boldsymbol x)^T\phi(\boldsymbol x')+\phi(\boldsymbol x')^T\phi(\boldsymbol x') \\ &= <\phi(\boldsymbol x), \phi(\boldsymbol x)>-2<\phi(\boldsymbol x), \phi(\boldsymbol x')>+<\phi(\boldsymbol x'),\phi(\boldsymbol x')> \\ &= \kappa(\boldsymbol x,\boldsymbol x)-2\kappa(\boldsymbol x,\boldsymbol x')+\kappa(\boldsymbol x',\boldsymbol x') \end{aligned}
∣∣ϕ(x)−ϕ(x′)∣∣2=(ϕ(x)−ϕ(x′))T(ϕ(x)−ϕ(x′))=ϕ(x)Tϕ(x)−2ϕ(x)Tϕ(x′)+ϕ(x′)Tϕ(x′)=<ϕ(x),ϕ(x)>−2<ϕ(x),ϕ(x′)>+<ϕ(x′),ϕ(x′)>=κ(x,x)−2κ(x,x′)+κ(x′,x′)
角度:
<
ϕ
(
x
)
,
ϕ
(
x
′
)
>
=
∣
∣
ϕ
(
x
)
∣
∣
⋅
∣
∣
ϕ
(
x
′
)
∣
∣
cos
θ
⇒
cos
θ
=
<
ϕ
(
x
)
,
ϕ
(
x
′
)
>
∣
∣
ϕ
(
x
)
∣
∣
⋅
∣
∣
ϕ
(
x
′
)
∣
∣
=
<
ϕ
(
x
)
,
ϕ
(
x
′
)
>
<
ϕ
(
x
)
,
ϕ
(
x
)
>
<
ϕ
(
x
′
)
,
ϕ
(
x
′
)
>
=
κ
(
x
,
x
′
)
κ
(
x
,
x
)
κ
(
x
′
,
x
′
)
<\phi(\boldsymbol x),\phi(\boldsymbol x')>=||\phi(\boldsymbol x)||\cdot||\phi(\boldsymbol x')||\cos\theta \\ \begin{aligned} \Rightarrow \cos\theta &= \frac{<\phi(\boldsymbol x),\phi(\boldsymbol x')>}{||\phi(\boldsymbol x)||\cdot||\phi(\boldsymbol x')||} \\ &= \frac{<\phi(\boldsymbol x),\phi(\boldsymbol x')>}{\sqrt{<\phi(\boldsymbol x),\phi(\boldsymbol x)>}\sqrt{<\phi(\boldsymbol x'),\phi(\boldsymbol x')>}} \\ &= \frac{\kappa(\boldsymbol x,\boldsymbol x')}{\sqrt{\kappa(\boldsymbol x,\boldsymbol x)}\sqrt{\kappa(\boldsymbol x',\boldsymbol x')}} \end{aligned}
<ϕ(x),ϕ(x′)>=∣∣ϕ(x)∣∣⋅∣∣ϕ(x′)∣∣cosθ⇒cosθ=∣∣ϕ(x)∣∣⋅∣∣ϕ(x′)∣∣<ϕ(x),ϕ(x′)>=<ϕ(x),ϕ(x)>
<ϕ(x′),ϕ(x′)>
<ϕ(x),ϕ(x′)>=κ(x,x)
κ(x′,x′)
κ(x,x′)
内积矩阵,Gram矩阵,Kernel矩阵:
K
=
[
ϕ
(
x
1
)
T
⋮
ϕ
(
x
N
)
T
]
⋅
[
ϕ
(
x
1
)
⋯
ϕ
(
x
N
)
]
=
[
<
ϕ
(
x
1
)
,
ϕ
(
x
1
)
>
⋯
<
ϕ
(
x
1
)
,
ϕ
(
x
N
)
>
⋮
⋱
⋮
<
ϕ
(
x
N
)
,
ϕ
(
x
1
)
>
⋯
<
ϕ
(
x
N
)
,
ϕ
(
x
N
)
>
]
=
[
κ
(
x
1
,
x
1
)
⋯
κ
(
x
1
,
x
N
)
⋮
⋱
⋮
κ
(
x
N
,
x
1
)
⋯
κ
(
x
N
,
x
N
)
]
\begin{aligned} K &= \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix} \cdot \begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix} \\ &= \begin{bmatrix} <\phi(\boldsymbol x_1),\phi(\boldsymbol x_1)> & \cdots & <\phi(\boldsymbol x_1),\phi(\boldsymbol x_N)> \\ \vdots & \ddots & \vdots \\ <\phi(\boldsymbol x_N),\phi(\boldsymbol x_1)> & \cdots & <\phi(\boldsymbol x_N),\phi(\boldsymbol x_N)> \end{bmatrix} \\ &= \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_N) \\ \vdots & \ddots & \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_N,\boldsymbol x_N) \end{bmatrix} \end{aligned}
K=⎣⎢⎡ϕ(x1)T⋮ϕ(xN)T⎦⎥⎤⋅[ϕ(x1)⋯ϕ(xN)]=⎣⎢⎡<ϕ(x1),ϕ(x1)>⋮<ϕ(xN),ϕ(x1)>⋯⋱⋯<ϕ(x1),ϕ(xN)>⋮<ϕ(xN),ϕ(xN)>⎦⎥⎤=⎣⎢⎡κ(x1,x1)⋮κ(xN,x1)⋯⋱⋯κ(x1,xN)⋮κ(xN,xN)⎦⎥⎤
例1:
(
x
1
,
y
1
)
,
⋯
,
(
x
n
,
y
n
)
⊂
R
d
×
1
,
−
1
→
ϕ
(
ϕ
(
x
1
)
,
y
1
)
,
⋯
,
(
ϕ
(
x
n
)
,
y
n
)
⊂
H
×
1
,
−
1
{(\boldsymbol x_1,y_1),\cdots,(\boldsymbol x_n,y_n)}\subset\mathbb{R}^d\times{1,-1}\to^{\phi}{(\phi(\boldsymbol x_1),y_1),\cdots,(\phi(\boldsymbol x_n),y_n)}\subset H\times{1,-1}
(x1,y1),⋯,(xn,yn)⊂Rd×1,−1→ϕ(ϕ(x1),y1),⋯,(ϕ(xn),yn)⊂H×1,−1
利用垂直于两个质心中点的面作为分界面。
对于新的样本 ( x , y ) → ( ϕ ( x ) , y ) → y = ? (\boldsymbol x,y)\to(\phi(\boldsymbol x),y)\to y=? (x,y)→(ϕ(x),y)→y=?
当
y
=
1
y=1
y=1时,样本点与中点连线与质心连线之间的夹角小于90度,即,
0
≤
θ
<
π
/
2
⇔
0
≤
θ
<
p
i
/
2
⇔
0
≤
<
ϕ
(
x
)
−
c
,
w
>
ϕ
(
x
)
−
c
w
<
1
⇔
<
ϕ
(
x
)
−
c
,
w
≥
0
0\le\theta<\pi/2 \Leftrightarrow 0\le\theta<pi/2 \Leftrightarrow 0\le \frac{<\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w>}{\sqrt{\phi(\boldsymbol x)-\boldsymbol c}\sqrt{\boldsymbol w}} < 1 \Leftrightarrow <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w \ge 0
0≤θ<π/2⇔0≤θ<pi/2⇔0≤ϕ(x)−c
w
<ϕ(x)−c,w><1⇔<ϕ(x)−c,w≥0
同理,
y
=
−
1
⇔
<
ϕ
(
x
)
−
c
,
w
>
<
0
y=-1 \Leftrightarrow <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w> < 0
y=−1⇔<ϕ(x)−c,w><0
合并以上两项为:
y
=
s
g
n
(
<
ϕ
(
x
)
−
c
,
w
>
)
y=sgn(<\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w>)
y=sgn(<ϕ(x)−c,w>)
计算
<
ϕ
(
x
)
−
c
,
w
>
<\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w>
<ϕ(x)−c,w>
<
ϕ
(
x
)
−
c
,
w
>
=
w
T
(
ϕ
(
x
)
−
c
)
=
(
c
+
−
c
−
)
T
ϕ
(
x
)
−
1
2
(
c
+
−
c
−
)
T
(
c
+
+
c
−
)
=
(
1
m
+
∑
y
i
=
1
ϕ
(
x
i
)
−
1
m
−
∑
y
i
=
−
1
ϕ
(
x
i
)
)
T
ϕ
(
x
)
−
1
2
(
c
+
−
c
−
)
T
(
c
+
+
c
−
)
=
1
m
+
∑
y
i
=
1
ϕ
(
x
i
)
T
ϕ
(
x
)
−
1
m
−
∑
y
i
=
−
1
ϕ
(
x
i
)
T
ϕ
(
x
)
−
b
=
1
m
+
∑
y
i
=
1
κ
(
x
i
,
x
)
−
1
m
−
∑
y
i
=
−
1
κ
(
x
i
,
x
)
−
b
\begin{aligned} <\phi(\boldsymbol x)-\boldsymbol c, \boldsymbol w> &= \boldsymbol w^T(\phi(\boldsymbol x)-\boldsymbol c) \\ &= (\boldsymbol c_+-\boldsymbol c_-)^T\phi(\boldsymbol x)-\frac{1}{2}(\boldsymbol c_+-\boldsymbol c_-)^T(\boldsymbol c_++\boldsymbol c_-) \\ &= \left(\frac{1}{m_+}\sum_{y_i=1}\phi(\boldsymbol x_i)-\frac{1}{m_-}\sum_{y_i=-1}\phi(\boldsymbol x_i)\right)^T\phi(\boldsymbol x)-\frac{1}{2}(\boldsymbol c_+-\boldsymbol c_-)^T(\boldsymbol c_++\boldsymbol c_-) \\ &= \frac{1}{m_+}\sum_{y_i=1}\phi(\boldsymbol x_i)^T\phi(\boldsymbol x)-\frac{1}{m_-}\sum_{y_i=-1}\phi(\boldsymbol x_i)^T\phi(\boldsymbol x) - b \\ &= \frac{1}{m_+}\sum_{y_i=1}\kappa(\boldsymbol x_i,\boldsymbol x)-\frac{1}{m_-}\sum_{y_i=-1}\kappa(\boldsymbol x_i,\boldsymbol x) - b \end{aligned}
<ϕ(x)−c,w>=wT(ϕ(x)−c)=(c+−c−)Tϕ(x)−21(c+−c−)T(c++c−)=(m+1yi=1∑ϕ(xi)−m−1yi=−1∑ϕ(xi))Tϕ(x)−21(c+−c−)T(c++c−)=m+1yi=1∑ϕ(xi)Tϕ(x)−m−1yi=−1∑ϕ(xi)Tϕ(x)−b=m+1yi=1∑κ(xi,x)−m−1yi=−1∑κ(xi,x)−b
由以上可知,特征映射
ϕ
\phi
ϕ不一定需要,只利用核函数
κ
\kappa
κ即可。
-
有限半正定函数:一个函数 κ : X × X → R \kappa:X\times X\to R κ:X×X→R 如果满足对称函数,作用在空间 X X X的任意有限子集的矩阵K是半正定矩阵,则函数是有限半正定矩阵。
-
Characterization of Kernels: 如果函数 κ : X × X → R \kappa:X\times X\to\mathbb{R} κ:X×X→R连续或有一个有限定义域,若它是有限半正定函数,则它能够分解为 κ ( x , z ) = < ϕ ( x ) , ϕ ( z ) > \kappa(\boldsymbol x,\boldsymbol z)=<\phi(\boldsymbol x),\phi(\boldsymbol z)> κ(x,z)=<ϕ(x),ϕ(z)>,特征映射 ϕ \phi ϕ将原空间映射到Hilbert空间。
常用核函数:
-
线性核函数:
κ ( x , z ) = < x , z > \kappa(\boldsymbol x, \boldsymbol z)=<\boldsymbol x,\boldsymbol z> κ(x,z)=<x,z> -
多项式核函数:
κ ( x , z ) = ( < x , z > + 1 ) r r ∈ Z + \kappa(\boldsymbol x,\boldsymbol z)=(<\boldsymbol x,\boldsymbol z>+1)^r \ r\in\mathbb{Z}^+ κ(x,z)=(<x,z>+1)r r∈Z+ -
高斯核函数(RBF):
κ ( x , z ) = exp ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) , σ ∈ R − { 0 } \kappa(\boldsymbol x,\boldsymbol z)=\exp(\frac{-||\boldsymbol x-\boldsymbol z||^2}{2\sigma^2}),\ \sigma\in\mathbb{R}-\left\{0\right\} κ(x,z)=exp(2σ2−∣∣x−z∣∣2), σ∈R−{0}
在特征空间内的线性函数为: f ( x ) = w T ϕ ( x ) + b f(\boldsymbol x)=\boldsymbol w^T\phi(\boldsymbol x)+b f(x)=wTϕ(x)+b。
对于例1,
y
=
s
g
n
(
<
ϕ
(
x
)
−
c
,
w
>
)
⇔
y
=
s
g
n
(
f
(
x
)
)
,
f
(
x
)
=
w
T
ϕ
(
x
)
−
w
T
c
w
T
=
1
m
+
∑
y
i
=
1
ϕ
(
x
i
)
−
1
m
−
∑
y
i
=
−
1
ϕ
(
x
i
)
y=sgn(<\phi(\boldsymbol x)-\boldsymbol c,\boldsymbol w>) \\ \Leftrightarrow y=sgn(f(\boldsymbol x)),f(\boldsymbol x)=\boldsymbol w^T\phi(\boldsymbol x)-\boldsymbol w^T\boldsymbol c \\ w^T=\frac{1}{m_+}\sum_{y_i=1}\phi(\boldsymbol x_i)-\frac{1}{m_-}\sum_{y_i=-1}\phi(\boldsymbol x_i)
y=sgn(<ϕ(x)−c,w>)⇔y=sgn(f(x)),f(x)=wTϕ(x)−wTcwT=m+1yi=1∑ϕ(xi)−m−1yi=−1∑ϕ(xi)
对偶表述:
w
w
w可用
ϕ
(
x
i
)
{\phi(x_i)}
ϕ(xi)的线性组合表示,
w
=
∑
i
=
1
N
α
i
ϕ
(
x
i
)
w = \sum_{i=1}^N \alpha_i\phi(\boldsymbol x_i)
w=i=1∑Nαiϕ(xi)
因此,
f
(
x
)
=
w
T
ϕ
(
x
)
+
b
=
(
∑
i
=
1
N
α
i
ϕ
(
x
i
)
)
T
ϕ
(
x
)
+
b
=
∑
i
=
1
N
α
i
κ
(
x
i
,
x
)
+
b
f(\boldsymbol x)=\boldsymbol w^T\phi(\boldsymbol x)+b=(\sum_{i=1}^N \alpha_i\phi(\boldsymbol x_i))^T\phi(\boldsymbol x)+b=\sum_{i=1}^N\alpha_i\kappa(\boldsymbol x_i,\boldsymbol x)+b
f(x)=wTϕ(x)+b=(i=1∑Nαiϕ(xi))Tϕ(x)+b=i=1∑Nαiκ(xi,x)+b
-
利用特征函数 ϕ \phi ϕ将样本从原空间映射到特征空间 H H H,其是一个Hilbert高维空间
-
在 H H H中,原有关系可转化为线性关系
-
直接利用核函数来计算特征空间内样本的内积(不需要显式的 ϕ \phi ϕ)
-
假设特征空间内的样本可以用对偶形式表述,训练样本的组合。
2. PCA and Kernel PCA
2.1 Principal Component Analysis
假设
x
1
,
x
2
,
⋯
,
x
N
∈
R
d
\boldsymbol x_1,\boldsymbol x_2,\cdots,\boldsymbol x_N\in\mathbb{R}^d
x1,x2,⋯,xN∈Rd是零均值训练样本,PCA的目的是找到一个在
R
p
,
p
≤
d
\mathbb{R}^p,p\le d
Rp,p≤d 空间内的子集,其包含原数据的最大总方差。将数据往某些方向投影。
目的是找到最大投影方差的方向。
将样本投影到归一化方向 v v v上: v T x 1 , ⋯ , v T x N \boldsymbol v^T\boldsymbol x_1,\cdots,\boldsymbol v^T\boldsymbol x_N vTx1,⋯,vTxN。
投影方差为:
σ
2
=
1
N
∑
i
=
1
N
(
v
T
x
i
−
0
)
2
=
1
N
∑
i
=
1
N
(
v
T
x
i
)
(
v
T
x
i
)
=
1
N
∑
i
=
1
N
(
v
T
x
i
)
(
v
T
x
i
)
T
=
v
T
(
1
N
∑
i
=
1
N
x
i
x
i
T
)
v
=
v
T
C
v
\begin{aligned} \sigma^2 &= \frac{1}{N}\sum^N_{i=1}(\boldsymbol v^T\boldsymbol x_i-0)^2 \\ &= \frac{1}{N}\sum^N_{i=1}(\boldsymbol v^T\boldsymbol x_i)(\boldsymbol v^T\boldsymbol x_i) \\ &= \frac{1}{N}\sum^N_{i=1}(\boldsymbol v^T\boldsymbol x_i)(\boldsymbol v^T\boldsymbol x_i)^T \\ &= \boldsymbol v^T(\frac{1}{N}\sum^N_{i=1}\boldsymbol x_i\boldsymbol x_i^T)\boldsymbol v \\ &= \boldsymbol v^TC\boldsymbol v \end{aligned}
σ2=N1i=1∑N(vTxi−0)2=N1i=1∑N(vTxi)(vTxi)=N1i=1∑N(vTxi)(vTxi)T=vT(N1i=1∑NxixiT)v=vTCv
其中,
C
=
1
N
∑
i
=
1
N
x
i
x
i
T
C = \frac{1}{N}\sum^N_{i=1}\boldsymbol x_i\boldsymbol x_i^T
C=N1i=1∑NxixiT
第一主元向量为
v
=
arg
max
v
∈
R
d
,
∣
∣
v
∣
∣
=
1
v
T
C
v
\color{red}{\boldsymbol v = \mathop{\arg \max}\limits_{\boldsymbol v\in\mathbb{R}^d,||\boldsymbol v||=1}\boldsymbol v^TC\boldsymbol v}
v=v∈Rd,∣∣v∣∣=1argmaxvTCv
利用拉格朗日方法,
f
(
v
,
λ
)
=
v
T
C
v
−
λ
(
v
T
v
−
1
)
∂
f
∂
v
=
2
C
v
−
2
λ
v
=
0
⇔
C
v
=
λ
v
∂
f
∂
λ
=
v
T
v
−
1
=
0
⇒
v
T
v
=
1
\begin{aligned} f(\boldsymbol v,\lambda) &= \boldsymbol v^TC\boldsymbol v-\lambda(\boldsymbol v^T\boldsymbol v-1) \\ \frac{\partial f}{\partial \boldsymbol v} &= 2C\boldsymbol v-2\lambda \boldsymbol v = 0 \Leftrightarrow C\boldsymbol v=\lambda \boldsymbol v \\ \frac{\partial f}{\partial \lambda} &= \boldsymbol v^T\boldsymbol v-1=0\Rightarrow \boldsymbol v^T\boldsymbol v=1 \end{aligned}
f(v,λ)∂v∂f∂λ∂f=vTCv−λ(vTv−1)=2Cv−2λv=0⇔Cv=λv=vTv−1=0⇒vTv=1
所以问题等价于寻找最大特征值对应的特征向量:
{
C
v
=
λ
v
∣
∣
v
∣
∣
=
1
\begin{cases} C\boldsymbol v=\lambda \boldsymbol v \\ ||\boldsymbol v||=1 \end {cases}
{Cv=λv∣∣v∣∣=1
将上式代入之前的公式可知,最大方差为
σ
2
=
λ
\sigma^2=\lambda
σ2=λ。
分析
C
C
C,
C
=
1
N
∑
i
=
1
N
x
i
x
i
T
=
1
N
[
x
1
,
⋯
,
x
N
]
[
x
1
T
⋮
x
N
T
]
C=\frac{1}{N}\sum_{i=1}^N\boldsymbol x_i\boldsymbol x_i^T=\frac{1}{N} \begin{bmatrix} \boldsymbol x_1, \cdots, \boldsymbol x_N \end{bmatrix} \begin{bmatrix} \boldsymbol x_1^T \\ \vdots \\ \boldsymbol x_N^T \end{bmatrix}
C=N1i=1∑NxixiT=N1[x1,⋯,xN]⎣⎢⎡x1T⋮xNT⎦⎥⎤
假设
X
T
=
[
x
1
,
⋯
,
x
N
]
X^T=[\boldsymbol x_1,\cdots,\boldsymbol x_N]
XT=[x1,⋯,xN],则,
C
=
1
N
X
T
X
C=\frac{1}{N}X^TX
C=N1XTX
2.2 Kernel Principal Component Analysis
对于映射到特征空间内的样本
ϕ
(
x
1
)
,
⋯
,
ϕ
(
x
N
)
{\phi(\boldsymbol x1),\cdots,\phi(\boldsymbol x_N)}
ϕ(x1),⋯,ϕ(xN),
X
T
=
[
ϕ
(
x
1
)
,
⋯
,
ϕ
(
x
N
)
]
X^T=[\phi(\boldsymbol x_1),\cdots,\phi(\boldsymbol x_N)]
XT=[ϕ(x1),⋯,ϕ(xN)](未知),则
C
=
1
N
∑
i
=
1
N
ϕ
(
x
i
)
ϕ
(
x
i
)
T
=
1
N
X
T
X
(
d
i
m
=
p
×
p
)
C=\frac{1}{N}\sum_{i=1}^N\phi(\boldsymbol x_i)\phi(\boldsymbol x_i)^T=\frac{1}{N}X^TX \ (dim=p\times p)
C=N1i=1∑Nϕ(xi)ϕ(xi)T=N1XTX (dim=p×p)
在实际使用中,我们已知核函数
κ
\kappa
κ,而映射函数
ϕ
\phi
ϕ未知,无法直接求解
C
C
C。
K
=
X
X
T
=
[
ϕ
(
x
1
)
T
⋮
ϕ
(
x
N
)
T
]
[
ϕ
(
x
1
)
⋯
ϕ
(
x
N
)
]
=
[
ϕ
(
x
1
)
T
ϕ
(
x
1
)
⋯
ϕ
(
x
1
)
T
ϕ
(
x
N
)
⋮
⋱
⋮
ϕ
(
x
N
)
T
ϕ
(
x
1
)
⋯
ϕ
(
x
N
)
T
ϕ
(
x
N
)
]
=
[
κ
(
x
1
,
x
1
)
⋯
κ
(
x
1
,
x
N
)
⋮
⋱
⋮
κ
(
x
N
,
x
1
)
⋯
κ
(
x
N
,
x
N
)
]
(
d
i
m
=
N
×
N
)
\begin{aligned} K = XX^T &= \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix} \begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix} \\ &= \begin{bmatrix} \phi(\boldsymbol x_1)^T\phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_1)^T\phi(\boldsymbol x_N) \\ \vdots & \ddots & \vdots \\ \phi(\boldsymbol x_N)^T\phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N)^T\phi(\boldsymbol x_N) \end{bmatrix} \\ &= \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_N) \\ \vdots & \ddots & \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_N,\boldsymbol x_N) \end{bmatrix} \ (dim=N\times N) \end{aligned}
K=XXT=⎣⎢⎡ϕ(x1)T⋮ϕ(xN)T⎦⎥⎤[ϕ(x1)⋯ϕ(xN)]=⎣⎢⎡ϕ(x1)Tϕ(x1)⋮ϕ(xN)Tϕ(x1)⋯⋱⋯ϕ(x1)Tϕ(xN)⋮ϕ(xN)Tϕ(xN)⎦⎥⎤=⎣⎢⎡κ(x1,x1)⋮κ(xN,x1)⋯⋱⋯κ(x1,xN)⋮κ(xN,xN)⎦⎥⎤ (dim=N×N)
利用
K
K
K来求解
C
C
C的最大特征值及其对应的特征向量:
K
=
X
X
T
K=XX^T
K=XXT的特征向量为
(
X
X
T
)
u
=
λ
u
(XX^T)\boldsymbol u=\lambda \boldsymbol u
(XXT)u=λu,
X
T
(
X
X
T
)
u
=
λ
X
T
u
⇔
(
X
T
X
)
(
X
T
u
)
=
λ
(
X
T
u
)
X^T(XX^T)\boldsymbol u=\lambda X^T\boldsymbol u \Leftrightarrow (X^TX)(X^T\boldsymbol u)=\lambda(X^T\boldsymbol u)
XT(XXT)u=λXTu⇔(XTX)(XTu)=λ(XTu)
即
X
T
u
X^T\boldsymbol u
XTu是
X
T
X
(
N
C
)
X^TX(NC)
XTX(NC)的特征向量,对应特征值为
λ
\lambda
λ。
由此可知,
X
T
X
X^TX
XTX的特征值和特征向量
v
v
v可由
K
=
X
X
T
K=XX^T
K=XXT计算,
v
=
X
T
u
∣
∣
X
T
u
∣
∣
=
X
T
u
u
T
X
X
T
u
=
X
T
u
u
T
(
λ
u
)
=
1
λ
X
T
u
\boldsymbol v=\frac{X^T\boldsymbol u}{||X^T\boldsymbol u||}=\frac{X^T\boldsymbol u}{\sqrt{\boldsymbol u^TXX^T\boldsymbol u}}=\frac{X^T\boldsymbol u}{\sqrt{\boldsymbol u^T(\lambda \boldsymbol u)}}=\frac{1}{\sqrt{\lambda}}X^T\boldsymbol u
v=∣∣XTu∣∣XTu=uTXXTu
XTu=uT(λu)
XTu=λ
1XTu
将测试样本投影到
v
\boldsymbol v
v上,
v
T
ϕ
(
x
′
)
=
(
1
λ
X
T
u
)
T
ϕ
(
x
′
)
=
1
λ
u
T
X
ϕ
(
x
′
)
=
1
λ
u
T
[
ϕ
(
x
1
)
T
⋮
ϕ
(
x
N
)
T
]
ϕ
(
x
′
)
=
1
λ
u
T
[
κ
(
x
1
,
x
′
)
⋮
κ
(
x
N
,
x
′
)
]
\begin{aligned} \boldsymbol v^T\phi(\boldsymbol x') &= (\frac{1}{\sqrt{\lambda}}\boldsymbol X^T\boldsymbol u)^T\phi(\boldsymbol x') \\ &= \frac{1}{\sqrt \lambda}\boldsymbol u^TX\phi(\boldsymbol x') \\ &= \frac{1}{\sqrt \lambda}\boldsymbol u^T \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix} \phi(\boldsymbol x') \\ &= \frac{1}{\sqrt \lambda}\boldsymbol u^T \begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x') \\ \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x') \end{bmatrix} \end{aligned}
vTϕ(x′)=(λ
1XTu)Tϕ(x′)=λ
1uTXϕ(x′)=λ
1uT⎣⎢⎡ϕ(x1)T⋮ϕ(xN)T⎦⎥⎤ϕ(x′)=λ
1uT⎣⎢⎡κ(x1,x′)⋮κ(xN,x′)⎦⎥⎤
综上,
-
求解 K = X X T K=XX^T K=XXT的特征值及特征向量,
K u i = λ i u i , λ 1 ≥ λ 2 ≥ ⋯ ≥ λ N Ku_i=\lambda_i\boldsymbol u_i,\ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_N Kui=λiui, λ1≥λ2≥⋯≥λN -
将样本投影到第i各特征向量方向上,
v i T ϕ ( x ′ ) = 1 λ i u i T [ κ ( x 1 , x ′ ) ⋮ κ ( x N , x ′ ) ] \boldsymbol v^T_i\phi(\boldsymbol x')=\frac{1}{\sqrt \lambda_i}\boldsymbol u_i^T \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x') \\ \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x') \end{bmatrix} viTϕ(x′)=λ i1uiT⎣⎢⎡κ(x1,x′)⋮κ(xN,x′)⎦⎥⎤
Ex:假设
ϕ
\phi
ϕ为特征映射,其对应的核函数为
κ
\kappa
κ,假设
ϕ
(
x
1
)
,
⋯
,
ϕ
(
x
N
)
\phi(\boldsymbol x_1), \cdots, \phi(\boldsymbol x_N)
ϕ(x1),⋯,ϕ(xN)是非零均值,即
ϕ
ˉ
=
1
N
∑
i
=
1
N
ϕ
(
x
i
)
≠
0
\bar\phi=\frac{1}{N}\sum_{i=1}^N\phi(\boldsymbol x_i)\ne0
ϕˉ=N1∑i=1Nϕ(xi)=0,假设
ϕ
~
=
ϕ
−
ϕ
ˉ
\tilde{\phi}=\phi-\bar\phi
ϕ~=ϕ−ϕˉ,那么
ϕ
~
(
x
i
)
=
ϕ
(
x
i
)
−
ϕ
ˉ
\tilde{\phi}(\boldsymbol x_i)=\phi(\boldsymbol x_i)-\bar\phi
ϕ~(xi)=ϕ(xi)−ϕˉ。寻找一个矩阵
M
M
M使得关于
ϕ
~
\tilde{\phi}
ϕ~的核函数矩阵
K
~
\tilde{K}
K~满足
K
~
=
M
T
K
M
\tilde{K}=M^TKM
K~=MTKM.
κ
~
(
x
i
,
x
j
)
=
<
ϕ
~
(
x
i
)
,
ϕ
~
(
x
j
)
>
=
<
ϕ
(
x
i
)
−
ϕ
ˉ
,
ϕ
(
x
j
)
−
ϕ
ˉ
>
=
<
ϕ
(
x
i
)
,
ϕ
(
x
j
)
>
−
<
ϕ
ˉ
,
ϕ
(
x
j
)
>
−
<
ϕ
(
x
i
)
,
ϕ
ˉ
>
+
<
ϕ
ˉ
,
ϕ
ˉ
>
=
κ
(
x
i
,
x
j
)
−
1
N
∑
k
=
1
κ
(
x
k
,
x
j
)
−
1
N
∑
k
=
1
κ
(
x
i
,
x
k
)
+
1
N
2
(
∑
s
=
1
ϕ
(
x
s
)
)
(
∑
t
=
1
ϕ
(
x
t
)
)
\begin{aligned} \tilde{\kappa}(\boldsymbol x_i,\boldsymbol x_j) &= <\tilde{\phi}(\boldsymbol x_i), \tilde{\phi}(\boldsymbol x_j)>\\ &= <\phi(\boldsymbol x_i)-\bar\phi,\phi(\boldsymbol x_j)-\bar\phi>\\ &= <\phi(\boldsymbol x_i),\phi(\boldsymbol x_j)>-<\bar\phi,\phi(\boldsymbol x_j)>-<\phi(\boldsymbol x_i),\bar\phi>+<\bar\phi,\bar\phi>\\ &= \kappa(\boldsymbol x_i,\boldsymbol x_j)-\frac{1}{N}\sum_{k=1}\kappa(\boldsymbol x_k,\boldsymbol x_j)-\frac{1}{N}\sum_{k=1}\kappa(\boldsymbol x_i,\boldsymbol x_k)+\frac{1}{N^2}(\sum_{s=1}\phi(\boldsymbol x_s))(\sum_{t=1}\phi(\boldsymbol x_t)) \end{aligned}
κ~(xi,xj)=<ϕ~(xi),ϕ~(xj)>=<ϕ(xi)−ϕˉ,ϕ(xj)−ϕˉ>=<ϕ(xi),ϕ(xj)>−<ϕˉ,ϕ(xj)>−<ϕ(xi),ϕˉ>+<ϕˉ,ϕˉ>=κ(xi,xj)−N1k=1∑κ(xk,xj)−N1k=1∑κ(xi,xk)+N21(s=1∑ϕ(xs))(t=1∑ϕ(xt))
K = [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x N ) ⋮ ⋱ ⋮ κ ( x N , x 1 ) ⋯ κ ( x N , x N ) ] K ~ = [ < ϕ ~ ( x 1 ) , ϕ ~ ( x 1 ) > ⋯ < ϕ ~ ( x 1 ) , ϕ ~ ( x N ) > ⋮ ⋱ ⋮ < ϕ ~ ( x N ) , ϕ ~ ( x 1 ) > ⋯ < ϕ ~ ( x N ) , ϕ ~ ( x N ) > ] K= \begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_N)\\ \vdots & \ddots & \vdots \\ \kappa(\boldsymbol x_N, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_N,\boldsymbol x_N) \end{bmatrix} \\ \tilde{K}= \begin{bmatrix} <\tilde{\phi}(\boldsymbol x_1),\tilde{\phi}(\boldsymbol x_1)> & \cdots & <\tilde{\phi}(\boldsymbol x_1),\tilde{\phi}(\boldsymbol x_N)>\\ \vdots & \ddots & \vdots \\ <\tilde{\phi}(\boldsymbol x_N),\tilde{\phi}(\boldsymbol x_1)> & \cdots & <\tilde{\phi}(\boldsymbol x_N),\tilde{\phi}(\boldsymbol x_N)> \end{bmatrix} K=⎣⎢⎡κ(x1,x1)⋮κ(xN,x1)⋯⋱⋯κ(x1,xN)⋮κ(xN,xN)⎦⎥⎤K~=⎣⎢⎡<ϕ~(x1),ϕ~(x1)>⋮<ϕ~(xN),ϕ~(x1)>⋯⋱⋯<ϕ~(x1),ϕ~(xN)>⋮<ϕ~(xN),ϕ~(xN)>⎦⎥⎤
K ~ = [ ϕ ~ ( x 1 ) T ⋮ ϕ ~ ( x N ) T ] [ ϕ ~ ( x 1 ) ⋯ ϕ ~ ( x N ) ] = [ ϕ ( x 1 ) T − ϕ ˉ T ⋮ ϕ ( x N ) T − ϕ ˉ T ] [ ϕ ( x 1 ) − ϕ ˉ ⋯ ϕ ( x N ) − ϕ ˉ ] = ( [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] − [ 1 ⋮ 1 ] ϕ ˉ T ) ( [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] − ϕ ˉ [ 1 ⋯ 1 ] ) = ( [ ϕ ( x 1 ) T ⋮ ϕ ( x N ) T ] − [ 1 ⋮ 1 ] ϕ ˉ T ) ( [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] − 1 N [ 1 ⋯ 1 ] ) \tilde{K}= \begin{bmatrix} \tilde{\phi}(\boldsymbol x_1)^T \\ \vdots \\ \tilde{\phi}(\boldsymbol x_N)^T \end{bmatrix} \begin{bmatrix} \tilde{\phi}(\boldsymbol x_1) & \cdots & \tilde{\phi}(\boldsymbol x_N) \end{bmatrix}\\ =\begin{bmatrix} \phi(\boldsymbol x_1)^T-\bar\phi^T \\ \vdots \\ \phi(\boldsymbol x_N)^T-\bar\phi^T \end{bmatrix} \begin{bmatrix} \phi(\boldsymbol x_1)-\bar\phi & \cdots & \phi(\boldsymbol x_N)-\bar\phi \end{bmatrix}\\ =\left(\begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix}-\begin{bmatrix}1\\\vdots\\1\end{bmatrix}\bar\phi^T\right) \left(\begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix}-\bar\phi\begin{bmatrix}1&\cdots&1\end{bmatrix}\right)\\ =\left(\begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix}-\begin{bmatrix}1\\\vdots\\1\end{bmatrix}\bar\phi^T\right) \left(\begin{bmatrix} \phi(\boldsymbol x_1) & \cdots & \phi(\boldsymbol x_N) \end{bmatrix}-\frac{1}{N}\begin{bmatrix}1&\cdots&1\end{bmatrix}\right)\\ K~=⎣⎢⎡ϕ~(x1)T⋮ϕ~(xN)T⎦⎥⎤[ϕ~(x1)⋯ϕ~(xN)]=⎣⎢⎡ϕ(x1)T−ϕˉT⋮ϕ(xN)T−ϕˉT⎦⎥⎤[ϕ(x1)−ϕˉ⋯ϕ(xN)−ϕˉ]=⎝⎜⎛⎣⎢⎡ϕ(x1)T⋮ϕ(xN)T⎦⎥⎤−⎣⎢⎡1⋮1⎦⎥⎤ϕˉT⎠⎟⎞([ϕ(x1)⋯ϕ(xN)]−ϕˉ[1⋯1])=⎝⎜⎛⎣⎢⎡ϕ(x1)T⋮ϕ(xN)T⎦⎥⎤−⎣⎢⎡1⋮1⎦⎥⎤ϕˉT⎠⎟⎞([ϕ(x1)⋯ϕ(xN)]−N1[1⋯1])
3. LDA and GDA
3.1 Linear Discriminant Analysis
寻找一个方向向量满足:
- 投影后的各类均值距离最大
- 投影后每一类的样本与均值的距离最小
即增大类均值距离,增大每一类的样本聚集程度。目的是降低样本投影之间的重叠部分,增大可分性
L L L:样本类别数目; N i N_i Ni:第 i i i类样本的数目; N N N全部样本数目; x j ( i ) \boldsymbol x^{(i)}_j xj(i):第 j j j类中的第 i i i个样本
将所有的样本投影到方向向量 v \boldsymbol v v上, v T x 1 ( 1 ) , ⋯ , v T x N 1 ( 1 ) ; v T x 2 ( 2 ) , ⋯ , v T x N 2 ( 2 ) ; ⋯ ; v T x 1 ( L ) , ⋯ , v T x N L ( L ) \boldsymbol v^T\boldsymbol x^{(1)}_1,\cdots,\boldsymbol v^T\boldsymbol x^{(1)}_{N_1};\boldsymbol v^T\boldsymbol x^{(2)}_2,\cdots,\boldsymbol v^T\boldsymbol x^{(2)}_{N_2};\cdots;\boldsymbol v^T\boldsymbol x^{(L)}_1,\cdots,\boldsymbol v^T\boldsymbol x^{(L)}_{N_L} vTx1(1),⋯,vTxN1(1);vTx2(2),⋯,vTxN2(2);⋯;vTx1(L),⋯,vTxNL(L)。
各类的均值为
m
‾
i
=
1
N
i
∑
j
=
1
N
i
v
T
x
j
(
i
)
=
v
T
(
1
N
i
∑
j
=
1
N
i
x
j
(
i
)
)
=
v
T
m
i
\overline{\boldsymbol m}_i=\frac{1}{N_i}\sum_{j=1}^{N_i}\boldsymbol v^T\boldsymbol x^{(i)}_j=\boldsymbol v^T\left(\frac{1}{N_i}\sum_{j=1}^{N_i}\boldsymbol x^{(i)}_j\right)=\boldsymbol v^T\boldsymbol m_i
mi=Ni1j=1∑NivTxj(i)=vT(Ni1j=1∑Nixj(i))=vTmi
其中
m
i
m_i
mi为原空间内第
i
i
i类的均值。
然后计算每一类均值之间距离的权重平方和为
∑
i
=
1
L
−
1
∑
j
=
i
+
1
L
N
i
N
N
j
N
(
m
‾
i
−
m
‾
j
)
2
=
∑
i
=
1
L
−
1
∑
j
=
i
+
1
L
N
i
N
N
j
N
(
m
‾
i
−
m
‾
j
)
(
m
‾
i
−
m
‾
j
)
T
=
∑
i
=
1
L
−
1
∑
j
=
i
+
1
L
N
i
N
N
j
N
(
v
T
m
i
−
v
T
m
j
)
(
v
T
m
i
−
v
T
m
j
)
T
=
∑
i
=
1
L
−
1
∑
j
=
i
+
1
L
N
i
N
N
j
N
v
T
(
m
i
−
m
j
)
(
m
i
−
m
j
)
T
v
=
v
T
(
∑
i
=
1
L
−
1
∑
j
=
i
+
1
L
N
i
N
N
j
N
(
m
i
−
m
j
)
(
m
i
−
m
j
)
T
)
v
=
v
T
S
b
L
D
A
v
\begin{aligned} \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}(\overline{\boldsymbol m}_i-\overline{\boldsymbol m}_j)^2}} &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}(\overline{\boldsymbol m}_i-\overline{\boldsymbol m}_j)(\overline{\boldsymbol m}_i-\overline{\boldsymbol m}_j)^T}} \\ &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}(\boldsymbol v^T\boldsymbol m_i-\boldsymbol v^T\boldsymbol m_j)(\boldsymbol v^T\boldsymbol m_i-\boldsymbol v^T\boldsymbol m_j)^T}} \\ &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol v^T(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T\boldsymbol v}} \\ &= \boldsymbol v^T\left(\sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T}\right)\boldsymbol v \\ &= \boldsymbol v^TS^{LDA}_{b}\boldsymbol v \end{aligned}
i=1∑L−1j=i+1∑LNNiNNj(mi−mj)2=i=1∑L−1j=i+1∑LNNiNNj(mi−mj)(mi−mj)T=i=1∑L−1j=i+1∑LNNiNNj(vTmi−vTmj)(vTmi−vTmj)T=i=1∑L−1j=i+1∑LNNiNNjvT(mi−mj)(mi−mj)Tv=vT(i=1∑L−1j=i+1∑LNNiNNj(mi−mj)(mi−mj)T)v=vTSbLDAv
S b L D A = ∑ i = 1 L − 1 ∑ j = i + 1 L N i N N j N ( m i − m j ) ( m i − m j ) T = 1 2 ∑ i = 1 L ∑ j = 1 L N i N N j N ( m i − m j ) ( m i − m j ) T = 1 2 ∑ i = 1 L ∑ j = 1 L N i N N j N ( m i m i T − m i m j T − m j m i T + m j m j T ) = 1 2 ( ∑ i = 1 L ∑ j = 1 L N i N N j N m i m i T − ∑ i = 1 L ∑ j = 1 L N i N N j N m i m j T − ∑ i = 1 L ∑ j = 1 L N i N N j N m j m i T + ∑ i = 1 L ∑ j = 1 L N i N N j N m j m j T ) = 1 2 ( ∑ i = 1 L N i N m i m i T ∑ j = 1 L N j N − ∑ i = 1 L N i N m i ∑ j = 1 L N j N m j T − ∑ j = 1 L N j N m j ∑ i = 1 L N i N m i T + ∑ i = 1 L N i N ∑ j = 1 L N j N m j T m j T ) = 1 2 ( ∑ i = 1 L N i N m i m i T − m 0 m 0 T − m 0 m 0 T + ∑ L j = 1 N j N m j m j T ) = ∑ L i = 1 N i N m i m i T − m 0 m 0 T = ∑ i = 1 L N i N ( m i − m 0 ) ( m i − m 0 ) T ( 与 E [ ( x − x ˉ ) 2 ] = E [ x 2 ] − x ˉ 2 相 似 ) \begin{aligned} S^{LDA}_b &= \sum^{L-1}_{i=1}{\sum^{L}_{j=i+1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T}\\ &= \frac{1}{2}\sum^{L}_{i=1}{\sum^{L}_{j=1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T}\\ &= \frac{1}{2}\sum^{L}_{i=1}{\sum^{L}_{j=1}{\frac{N_i}{N}\frac{N_j}{N}}(\boldsymbol m_i\boldsymbol m_i^T-\boldsymbol m_i\boldsymbol m_j^T-\boldsymbol m_j\boldsymbol m_i^T+\boldsymbol m_j\boldsymbol m_j^T)}\\ &= \frac{1}{2}\left({\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_i\boldsymbol m_i^T}} -\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_i\boldsymbol m_j^T}} -\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_j\boldsymbol m_i^T}} +\sum_{i=1}^{L}{\sum_{j=1}^{L}{\frac{N_i}{N}\frac{N_j}{N}\boldsymbol m_j\boldsymbol m_j^T}}}\right)\\ &= \frac{1}{2}\left( \sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T\sum_{j=1}^{L}\frac{N_j}{N}- \sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i\sum_{j=1}^{L}\frac{N_j}{N}\boldsymbol m_j^T- \sum_{j=1}^{L}\frac{N_j}{N}\boldsymbol m_j\sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i^T+ \sum_{i=1}^{L}\frac{N_i}{N}\sum_{j=1}^{L}\frac{N_j}{N}\boldsymbol m_j^T\boldsymbol m_j^T \right)\\ &= \frac{1}{2}\left( \sum_{i=1}^{L}\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T-\boldsymbol m_0\boldsymbol m_0^T-\boldsymbol m_0\boldsymbol m_0^T+\sum_{L}^{j=1}\frac{N_j}{N}\boldsymbol m_j\boldsymbol m_j^T \right)\\ &= \sum_{L}^{i=1}\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T-\boldsymbol m_0\boldsymbol m_0^T\\ &= \sum_{i=1}^{L}\frac{N_i}{N}(\boldsymbol m_i-\boldsymbol m_0)(\boldsymbol m_i-\boldsymbol m_0)^T\ (与E[(x-\bar x)^2]=E[x^2]-\bar x^2相似) \end{aligned} SbLDA=i=1∑L−1j=i+1∑LNNiNNj(mi−mj)(mi−mj)T=21i=1∑Lj=1∑LNNiNNj(mi−mj)(mi−mj)T=21i=1∑Lj=1∑LNNiNNj(mimiT−mimjT−mjmiT+mjmjT)=21(i=1∑Lj=1∑LNNiNNjmimiT−i=1∑Lj=1∑LNNiNNjmimjT−i=1∑Lj=1∑LNNiNNjmjmiT+i=1∑Lj=1∑LNNiNNjmjmjT)=21(i=1∑LNNimimiTj=1∑LNNj−i=1∑LNNimij=1∑LNNjmjT−j=1∑LNNjmji=1∑LNNimiT+i=1∑LNNij=1∑LNNjmjTmjT)=21(i=1∑LNNimimiT−m0m0T−m0m0T+L∑j=1NNjmjmjT)=L∑i=1NNimimiT−m0m0T=i=1∑LNNi(mi−m0)(mi−m0)T (与E[(x−xˉ)2]=E[x2]−xˉ2相似)
其中,
m
0
=
∑
L
i
=
1
N
i
N
m
i
=
∑
L
i
=
1
N
i
N
∑
k
=
1
N
i
1
N
i
x
k
(
i
)
=
∑
L
i
=
1
∑
N
i
k
=
1
1
N
x
k
(
i
)
\boldsymbol m_0=\sum_{L}^{i=1}\frac{N_i}{N}\boldsymbol m_i=\sum_{L}^{i=1}\frac{N_i}{N}\sum_{k=1}^{N_i}\frac{1}{N_i}\boldsymbol x_k^{(i)}=\sum_{L}^{i=1}\sum_{N_i}^{k=1}\frac{1}{N}\boldsymbol x_k^{(i)}
m0=L∑i=1NNimi=L∑i=1NNik=1∑NiNi1xk(i)=L∑i=1Ni∑k=1N1xk(i)
综上,组间分散矩阵为:
S
b
L
D
A
=
∑
i
=
1
L
−
1
∑
j
=
i
+
1
L
N
i
N
N
j
N
(
m
i
−
m
j
)
(
m
i
−
m
j
)
T
=
∑
i
=
1
L
N
i
N
(
m
i
−
m
0
)
(
m
i
−
m
0
)
T
S^{LDA}_b=\sum_{i=1}^{L-1}\sum^{L}_{j=i+1}\frac{N_i}{N}\frac{N_j}{N}(\boldsymbol m_i-\boldsymbol m_j)(\boldsymbol m_i-\boldsymbol m_j)^T=\sum_{i=1}^{L}\frac{N_i}{N}(\boldsymbol m_i-\boldsymbol m_0)(\boldsymbol m_i-\boldsymbol m_0)^T
SbLDA=i=1∑L−1j=i+1∑LNNiNNj(mi−mj)(mi−mj)T=i=1∑LNNi(mi−m0)(mi−m0)T
相当与每个集群的形心到整个集群的形心之间的距离乘上质量权重。
类方差和为
∑
i
=
1
L
∑
j
=
1
N
i
1
N
(
v
T
x
J
(
i
)
−
m
‾
i
)
2
=
∑
i
=
1
L
∑
j
=
1
N
i
1
N
(
v
T
x
j
(
i
)
−
v
T
m
i
)
(
v
T
x
j
(
i
)
−
v
T
m
i
)
T
=
v
T
(
∑
i
=
1
L
∑
j
=
1
N
i
1
N
(
x
j
(
i
)
−
m
i
)
(
x
j
(
i
)
−
m
i
)
T
)
v
=
v
T
S
w
L
D
A
v
\begin{aligned} \sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol v^Tx_J^{(i)}-\overline{\boldsymbol m}_i)^2 &= \sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol v^T\boldsymbol x_j^{(i)}-\boldsymbol v^T\boldsymbol m_i)(\boldsymbol v^T\boldsymbol x_j^{(i)}-\boldsymbol v^T\boldsymbol m_i)^T\\ &= \boldsymbol v^T\left(\sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol x_j^{(i)}-\boldsymbol m_i)(\boldsymbol x_j^{(i)}-\boldsymbol m_i)^T\right)\boldsymbol v\\ &= \boldsymbol v^TS^{LDA}_w\boldsymbol v \end{aligned}
i=1∑Lj=1∑NiN1(vTxJ(i)−mi)2=i=1∑Lj=1∑NiN1(vTxj(i)−vTmi)(vTxj(i)−vTmi)T=vT(i=1∑Lj=1∑NiN1(xj(i)−mi)(xj(i)−mi)T)v=vTSwLDAv
所以,组内分散矩阵为:
S
w
L
D
A
=
∑
i
=
1
L
∑
j
=
1
N
i
1
N
(
x
j
(
i
)
−
m
i
)
(
x
j
(
i
)
−
m
i
)
T
S^{LDA}_w=\sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\boldsymbol x_j^{(i)}-\boldsymbol m_i)(\boldsymbol x_j^{(i)}-\boldsymbol m_i)^T
SwLDA=i=1∑Lj=1∑NiN1(xj(i)−mi)(xj(i)−mi)T
第一主元向量可以由以下计算:
v
=
arg
max
v
∈
R
d
v
T
S
b
L
D
A
v
v
T
S
w
L
D
A
v
=
arg
max
v
T
S
b
L
D
A
v
=
1
v
T
S
b
L
D
A
v
.
{\color{red} \boldsymbol v=\mathop{\arg\max}_{\boldsymbol v\in\mathbb{R}^d}\frac{\boldsymbol v^TS^{LDA}_b\boldsymbol v}{\boldsymbol v^TS^{LDA}_w\boldsymbol v}=\mathop{\arg\max}_{\boldsymbol v^TS_b^{LDA}\boldsymbol v=1}\boldsymbol v^TS_b^{LDA}\boldsymbol v}.
v=argmaxv∈RdvTSwLDAvvTSbLDAv=argmaxvTSbLDAv=1vTSbLDAv.
由Lagrangian方法可得,
f
(
v
,
λ
)
=
v
T
S
b
L
D
A
v
−
λ
(
v
T
S
w
L
D
A
v
−
1
)
f(\boldsymbol v,\lambda)=\boldsymbol v^TS_b^{LDA}\boldsymbol v-\lambda(\boldsymbol v^TS_w^{LDA}\boldsymbol v-1)
f(v,λ)=vTSbLDAv−λ(vTSwLDAv−1)
∂ f ∂ v = 2 S b L D A v − 2 λ S w L D A v ⇔ ( S w L D A ) − 1 S b L D A v = λ v ∂ f ∂ λ = v T S w L D A v − 1 = 0 ⇔ v T S w L D A v = 1 \begin{aligned} \frac{\partial f}{\partial \boldsymbol v}&=2S_b^{LDA}\boldsymbol v-2\lambda S_w^{LDA}\boldsymbol v \Leftrightarrow {\color{red}(S_w^{LDA})^{-1}S_b^{LDA}\boldsymbol v=\lambda \boldsymbol v}\\ \frac{\partial f}{\partial \lambda}&=\boldsymbol v^TS_w^{LDA}\boldsymbol v-1=0 \Leftrightarrow \boldsymbol v^TS_w^{LDA}\boldsymbol v=1 \end{aligned} ∂v∂f∂λ∂f=2SbLDAv−2λSwLDAv⇔(SwLDA)−1SbLDAv=λv=vTSwLDAv−1=0⇔vTSwLDAv=1
当满足以上条件时, v T S b L D A v = λ v T S w L D A v = λ \boldsymbol v^TS^{LDA}_b\boldsymbol v=\lambda \boldsymbol v^TS^{LDA}_w\boldsymbol v=\lambda vTSbLDAv=λvTSwLDAv=λ。
综上,求解第一主元等价于求解下列最大广义特征值,
S
b
L
D
A
u
=
λ
S
w
L
D
A
u
,
v
=
1
u
T
S
w
L
D
A
u
u
S^{LDA}_b\boldsymbol u=\lambda S^{LDA}_w\boldsymbol u, \boldsymbol v=\frac{1}{\sqrt{\boldsymbol u^TS^{LDA}_w\boldsymbol u}}\boldsymbol u
SbLDAu=λSwLDAu,v=uTSwLDAu
1u
其中后一项保证
v
T
S
b
L
D
A
v
=
1
\boldsymbol v^TS_b^{LDA}\boldsymbol v=1
vTSbLDAv=1。
3.2 Generalized Discriminant Analysis
L L L:样本类别数目;
N i N_i Ni:第 i i i类样本的数目;
N N N全部样本数目;
ϕ ( x j ( i ) ) \phi(\boldsymbol x^{(i)}_j) ϕ(xj(i)):第 j j j类中的第 i i i个样本;
X i T = [ ϕ ( x 1 ( i ) ) , ⋯ , ϕ ( x N i ( i ) ) ] X^T_i=[\phi(\boldsymbol x^{(i)}_1),\cdots,\phi(\boldsymbol x^{(i)}_{N_i})] XiT=[ϕ(x1(i)),⋯,ϕ(xNi(i))];
X T = [ X 1 T , ⋯ , X L T ] X^T=[X^T_1,\cdots,X^T_L] XT=[X1T,⋯,XLT]。
假设在空间 H H H内样本均值为零: m 0 = 0 \boldsymbol m_0=0 m0=0
则组间分散矩阵为:
S
b
G
D
A
=
∑
i
=
1
L
N
i
N
(
m
i
−
m
0
)
(
m
i
−
m
0
)
T
=
∑
i
=
1
L
N
i
N
m
i
m
i
T
S^{GDA}_b=\sum_{i=1}^L\frac{N_i}{N}(\boldsymbol m_i-\boldsymbol m_0)(\boldsymbol m_i-\boldsymbol m_0)^T=\sum_{i=1}^L\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T
SbGDA=i=1∑LNNi(mi−m0)(mi−m0)T=i=1∑LNNimimiT
组内分散矩阵为:
S
w
G
D
A
=
∑
i
=
1
L
∑
j
=
1
N
i
1
N
ϕ
(
x
j
(
i
)
)
ϕ
(
x
j
(
i
)
)
T
S^{GDA}_w=\sum_{i=1}^L\sum_{j=1}^{N_i}\frac{1}{N}\phi(\boldsymbol x^{(i)}_j)\phi(\boldsymbol x^{(i)}_j)^T
SwGDA=i=1∑Lj=1∑NiN1ϕ(xj(i))ϕ(xj(i))T
m i = 1 N i ∑ j = 1 N i ϕ ( x j ( i ) ) = 1 N i [ ϕ ( x 1 ( i ) ) , ⋯ , ϕ ( x N i ( i ) ) ] [ 1 ⋮ 1 ] = 1 N i X i T 1 N i × 1 \boldsymbol m_i=\frac{1}{N_i}\sum_{j=1}^{N_i}\phi(\boldsymbol x^{(i)}_j)=\frac{1}{N_i}[\phi(\boldsymbol x^{(i)}_1),\cdots,\phi(\boldsymbol x^{(i)}_{N_i})]\begin{bmatrix}1\\\vdots\\1\end{bmatrix}=\frac{1}{N_i}X^T_i1_{N_i\times1} mi=Ni1j=1∑Niϕ(xj(i))=Ni1[ϕ(x1(i)),⋯,ϕ(xNi(i))]⎣⎢⎡1⋮1⎦⎥⎤=Ni1XiT1Ni×1
m i m i T = 1 N i 2 X i T 1 N i × 1 1 1 × N i X i = 1 N i X i T B i X i \boldsymbol m_i\boldsymbol m_i^T=\frac{1}{N_i^2}X^T_i1_{N_i\times1}1_{1\times N_i}X_i=\frac{1}{N_i}X^T_iB_iX_i mimiT=Ni21XiT1Ni×111×NiXi=Ni1XiTBiXi
其中,
B
i
=
1
N
i
1
N
i
×
N
i
B_i=\frac{1}{N_i}1_{N_i\times N_i}
Bi=Ni11Ni×Ni。组间分散矩阵为:
S
b
G
D
A
=
∑
i
=
1
L
N
i
N
m
i
m
i
T
=
1
N
∑
i
=
1
L
X
i
T
B
i
X
i
=
1
N
[
X
1
T
⋯
X
L
T
]
[
B
1
0
⋱
0
B
L
]
[
X
i
⋮
X
L
]
=
1
N
X
T
B
X
{\color{red}S^{GDA}_b}=\sum_{i=1}^L\frac{N_i}{N}\boldsymbol m_i\boldsymbol m_i^T=\frac{1}{N}\sum_{i=1}^LX^T_iB_iX_i=\frac{1}{N} \begin{bmatrix} X^T_1 & \cdots & X^T_L \end{bmatrix} \begin{bmatrix} B_1 & & 0\\ & \ddots &\\ 0 & & B_L \end{bmatrix} \begin{bmatrix} X_i \\ \vdots \\ X_L \end{bmatrix} =\frac{1}{N}X^TBX
SbGDA=i=1∑LNNimimiT=N1i=1∑LXiTBiXi=N1[X1T⋯XLT]⎣⎡B10⋱0BL⎦⎤⎣⎢⎡Xi⋮XL⎦⎥⎤=N1XTBX
组内分散矩阵为:
S
w
G
D
A
=
∑
i
=
1
L
∑
j
=
1
N
i
1
N
ϕ
(
x
j
(
i
)
)
ϕ
(
x
j
(
i
)
)
T
=
1
N
∑
i
=
1
L
[
ϕ
(
x
1
(
i
)
)
⋯
ϕ
(
x
N
i
(
i
)
)
]
[
ϕ
(
x
1
(
i
)
)
T
⋮
ϕ
(
x
N
i
(
i
)
)
T
]
=
1
N
∑
i
=
1
L
X
i
T
X
i
=
1
N
[
X
1
T
⋯
X
L
T
]
[
X
1
⋮
X
L
]
=
1
N
X
T
X
\begin{aligned} {\color{red}S^{GDA}_w}&=\sum_{i=1}^L\sum_{j=1}^{N_i}\frac{1}{N}\phi(\boldsymbol x^{(i)}_j)\phi(\boldsymbol x^{(i)}_j)^T\\ &=\frac{1}{N} \sum_{i=1}^L \begin{bmatrix} \phi(\boldsymbol x^{(i)}_1) & \cdots & \phi(\boldsymbol x^{(i)}_{N_i}) \end{bmatrix} \begin{bmatrix} \phi(\boldsymbol x^{(i)}_1)^T \\ \vdots \\ \phi(\boldsymbol x^{(i)}_{N_i})^T \end{bmatrix}\\ &=\frac{1}{N}\sum_{i=1}^LX^T_iX_i\\ &=\frac{1}{N} \begin{bmatrix} X_1^T & \cdots & X^T_L \end{bmatrix} \begin{bmatrix} X_1 \\ \vdots \\X_L \end{bmatrix}\\ &=\frac{1}{N}X^TX \end{aligned}
SwGDA=i=1∑Lj=1∑NiN1ϕ(xj(i))ϕ(xj(i))T=N1i=1∑L[ϕ(x1(i))⋯ϕ(xNi(i))]⎣⎢⎢⎡ϕ(x1(i))T⋮ϕ(xNi(i))T⎦⎥⎥⎤=N1i=1∑LXiTXi=N1[X1T⋯XLT]⎣⎢⎡X1⋮XL⎦⎥⎤=N1XTX
同理,
S
b
G
D
A
v
=
λ
S
w
G
D
A
v
i
.
e
.
(
1
N
X
T
B
X
)
v
=
λ
(
1
N
X
T
X
)
v
(
X
未
知
)
S^{GDA}_b\boldsymbol v=\lambda S^{GDA}_w \boldsymbol v \\ i.e.\ (\frac{1}{N}X^TBX)\boldsymbol v=\lambda (\frac{1}{N}X^TX)\boldsymbol v\ (X未知)
SbGDAv=λSwGDAvi.e. (N1XTBX)v=λ(N1XTX)v (X未知)
假设
v
v
v可以由样本的线性组合表示,即
v
=
∑
i
=
1
L
∑
j
=
1
N
i
α
j
(
i
)
ϕ
(
x
j
(
i
)
)
=
X
T
α
.
\boldsymbol v=\sum_{i=1}^L\sum_{j=1}^{N_i}\alpha_j^{(i)}\phi(\boldsymbol x^{(i)}_j)=X^T\boldsymbol \alpha.
v=i=1∑Lj=1∑Niαj(i)ϕ(xj(i))=XTα.
将假设代入上式,
⇒
X
T
B
X
X
T
α
=
λ
X
T
X
X
T
α
⇒
X
X
T
B
X
X
T
α
=
λ
X
X
T
X
X
T
α
⇒
(
K
B
K
)
α
=
λ
(
K
K
)
α
\begin{aligned} &\Rightarrow X^TBXX^T\boldsymbol \alpha=\lambda X^TXX^T\boldsymbol \alpha\\ &\Rightarrow XX^TBXX^T\boldsymbol \alpha=\lambda XX^TXX^T\boldsymbol \alpha\\ &\Rightarrow (KBK)\boldsymbol \alpha=\lambda(KK)\boldsymbol \alpha \end{aligned}
⇒XTBXXTα=λXTXXTα⇒XXTBXXTα=λXXTXXTα⇒(KBK)α=λ(KK)α
计算上式可获得
α
\boldsymbol \alpha
α,将测试样本投影到
v
=
X
T
α
\boldsymbol v=X^T\boldsymbol \alpha
v=XTα上,
v
T
ϕ
(
x
)
=
(
X
T
α
)
T
ϕ
(
x
)
=
α
T
[
ϕ
(
x
1
)
T
⋮
ϕ
(
x
N
)
T
]
ϕ
(
x
)
=
α
T
[
κ
(
x
1
,
x
)
⋮
κ
(
x
N
,
x
)
]
\boldsymbol v^T\phi(\boldsymbol x)=(X^T\boldsymbol \alpha)^T\phi(\boldsymbol x)=\boldsymbol \alpha^T \begin{bmatrix} \phi(\boldsymbol x_1)^T \\ \vdots \\ \phi(\boldsymbol x_N)^T \end{bmatrix}\phi(\boldsymbol x) =\boldsymbol \alpha^T \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x) \\ \vdots \\ \kappa(\boldsymbol x_N,\boldsymbol x) \end{bmatrix}
vTϕ(x)=(XTα)Tϕ(x)=αT⎣⎢⎡ϕ(x1)T⋮ϕ(xN)T⎦⎥⎤ϕ(x)=αT⎣⎢⎡κ(x1,x)⋮κ(xN,x)⎦⎥⎤
Ex: 在GDA中,组内分散矩阵为:
S
w
G
D
A
=
∑
i
=
1
L
∑
j
=
1
N
i
1
N
ϕ
(
x
j
(
i
)
)
ϕ
(
x
j
(
i
)
)
T
S^{GDA}_w=\sum_{i=1}^L\sum_{j=1}^{N_i}\frac{1}{N}\phi(\boldsymbol x^{(i)}_j)\phi(\boldsymbol x^{(i)}_j)^T
SwGDA=i=1∑Lj=1∑NiN1ϕ(xj(i))ϕ(xj(i))T
而在LDA中,
S
w
L
D
A
=
∑
i
=
1
L
∑
j
=
1
N
i
1
N
(
ϕ
(
x
j
(
i
)
)
−
m
i
)
(
ϕ
(
x
j
(
i
)
)
−
m
i
)
T
m
i
=
1
N
i
∑
j
=
1
N
i
ϕ
(
x
j
(
i
)
)
S^{LDA}_w=\sum_{i=1}^{L}\sum_{j=1}^{N_i}\frac{1}{N}(\phi(\boldsymbol x_j^{(i)})-\boldsymbol m_i)(\phi(\boldsymbol x_j^{(i)})-\boldsymbol m_i)^T\\ \boldsymbol m_i=\frac{1}{N_i}\sum_{j=1}^{N_i}\phi(\boldsymbol x^{(i)}_j)
SwLDA=i=1∑Lj=1∑NiN1(ϕ(xj(i))−mi)(ϕ(xj(i))−mi)Tmi=Ni1j=1∑Niϕ(xj(i))
能否由LDA推导GDA组内分散矩阵,并且找到
W
W
W,使得
S
w
L
D
A
=
X
T
W
X
S^{LDA}_w=X^TWX
SwLDA=XTWX.
4. Hard-Margin Support Vector Machine
min f ( x ) s . t . h ( x ) = 0 \min f(\boldsymbol x)\\ s.t.\ h(\boldsymbol x)=0 minf(x)s.t. h(x)=0
假设该问题有一个局部最优解
x
∗
\boldsymbol x^*
x∗与对应的拉格朗日乘子
α
∗
\boldsymbol \alpha^*
α∗。假设拉格朗日方程的Hessian矩阵在
x
∗
\boldsymbol x^*
x∗处正定,则该问题的对偶问题为:
max
W
(
α
)
=
max
L
(
x
(
α
)
,
α
)
\max W(\boldsymbol \alpha)=\max L(\boldsymbol x(\boldsymbol \alpha),\boldsymbol \alpha)
maxW(α)=maxL(x(α),α)
其中,
W
(
α
)
W(\boldsymbol \alpha)
W(α)为
L
(
x
,
α
)
L(\boldsymbol x,\boldsymbol \alpha)
L(x,α)对
x
\boldsymbol x
x求导为0得到。
Ex: min x 1 2 + x 2 2 s . t . x 1 + x 2 = 1 \min x^2_1+x_2^2\ s.t.x_1+x_2=1 minx12+x22 s.t.x1+x2=1
Lagrangian 方程为
L
(
x
1
,
x
2
,
α
)
=
x
1
2
+
x
2
2
+
α
(
x
1
+
x
2
−
1
)
∂
L
∂
x
1
=
2
x
1
+
α
=
0
⇒
x
1
=
−
α
2
∂
L
∂
x
2
=
2
x
2
+
α
=
0
⇒
x
2
=
−
α
2
L(x_1,x_2,\alpha)=x_1^2+x_2^2+\alpha(x_1+x_2-1)\\ \frac{\partial L}{\partial x_1}=2x_1+\alpha=0 \Rightarrow x_1=-\frac{\alpha}{2}\\ \frac{\partial L}{\partial x_2}=2x_2+\alpha=0 \Rightarrow x_2=-\frac{\alpha}{2}
L(x1,x2,α)=x12+x22+α(x1+x2−1)∂x1∂L=2x1+α=0⇒x1=−2α∂x2∂L=2x2+α=0⇒x2=−2α
因此,对偶问题为:
W
(
α
)
=
min
x
1
,
x
2
L
(
x
1
,
x
2
,
α
)
=
−
α
2
2
−
α
W(\alpha)=\min_{x_1,x_2}L(x_1,x_2,\alpha)=-\frac{\alpha^2}{2}-\alpha
W(α)=x1,x2minL(x1,x2,α)=−2α2−α
求解对偶问题
max
α
W
(
α
)
\max_\alpha W(\alpha)
maxαW(α),
∂
W
∂
α
=
−
α
−
1
=
0
⇒
α
∗
=
−
1
\frac{\partial W}{\partial \alpha}=-\alpha-1=0 \Rightarrow \alpha^*=-1
∂α∂W=−α−1=0⇒α∗=−1
原问题的解为
x
1
∗
=
x
2
∗
=
1
2
x_1^*=x_2^*=\frac{1}{2}
x1∗=x2∗=21.
Karush-Kuhn-Tucker Theorem (KKT):
min
f
(
x
)
s
.
t
.
h
(
x
)
=
0
g
(
x
)
≤
0
\min f(\boldsymbol x)\\ s.t.\ \boldsymbol{h}(\boldsymbol x)=0\\ \ \ \ \ \ \ \ \ \boldsymbol{g}(\boldsymbol x)\le0
minf(x)s.t. h(x)=0 g(x)≤0
假设
x
∗
\boldsymbol{x}^*
x∗是最优点,对应的拉格朗日乘子为
λ
∗
\boldsymbol{\lambda}^*
λ∗和
α
∗
≥
0
\boldsymbol{\alpha}^*\ge0
α∗≥0,有
α
∗
≥
0
∇
f
(
x
∗
)
+
λ
∗
T
∇
h
(
x
∗
)
+
α
∗
T
∇
g
(
x
∗
)
=
0
α
∗
T
g
(
x
∗
)
=
0
\begin{aligned} \boldsymbol{\alpha}^* &\ge0\\ \nabla{f(\boldsymbol x^*)}+{\boldsymbol{\lambda^*}}^T\nabla\boldsymbol{h}(\boldsymbol x^*)+{\boldsymbol{\alpha^*}}^T\nabla\boldsymbol{g}(\boldsymbol x^*) &= \boldsymbol{0}\\ {\boldsymbol{\alpha^*}}^T\boldsymbol{g}(\boldsymbol x^*)&=0 \end{aligned}
α∗∇f(x∗)+λ∗T∇h(x∗)+α∗T∇g(x∗)α∗Tg(x∗)≥0=0=0
其中最后一项要求
α
\alpha
α与
g
(
x
)
g(x)
g(x)至少有一个为0。
对于二分类线性可分问题:
y
i
=
{
+
1
x
i
∈
c
l
a
s
s
1
−
1
x
i
∈
c
l
a
s
s
2
(
x
i
,
y
i
)
,
i
=
1
,
⋯
,
l
,
x
i
∈
R
n
y_i=\begin{cases} +1 & \boldsymbol x_i\in class\ 1\\ -1 & \boldsymbol x_i\in class\ 2 \end{cases} \ (\boldsymbol x_i,y_i),i=1,\cdots,l,\boldsymbol x_i\in \mathbb{R}^n
yi={+1−1xi∈class 1xi∈class 2 (xi,yi),i=1,⋯,l,xi∈Rn
{ w T x i + b ≥ 1 y i = + 1 w T x i + b ≤ 1 y i = − 1 ⇒ y i ( w T x i + b ) ≥ 1 , ∀ i = 1 , ⋯ , l \begin{cases} \boldsymbol w^T\boldsymbol x_i+b\ge1 & y_i=+1\\ \boldsymbol w^T\boldsymbol x_i+b\le1 & y_i=-1 \end{cases} \Rightarrow y_i(\boldsymbol w^T\boldsymbol x_i+b)\ge1,\forall i=1,\cdots,l {wTxi+b≥1wTxi+b≤1yi=+1yi=−1⇒yi(wTxi+b)≥1,∀i=1,⋯,l
m a r g i n = ( d + + d − ) = 2 ∣ ∣ w ∣ ∣ = 2 w T w margin=(d_++d_-)=\frac{2}{||\boldsymbol w||}=\frac{2}{\sqrt{\boldsymbol w^T\boldsymbol w}} margin=(d++d−)=∣∣w∣∣2=wTw 2
优化问题为:
M
i
n
i
m
i
z
e
Φ
(
w
)
=
1
2
w
T
w
S
u
b
j
e
c
t
t
o
y
i
(
w
T
x
i
+
b
)
≥
1
,
∀
i
=
1
,
⋯
,
l
\color{red}{ Minimize\ \Phi(w)=\frac{1}{2}\boldsymbol w^T\boldsymbol w\\ Subject\ to\ y_i(\boldsymbol w^T\boldsymbol x_i+b)\ge1,\forall i=1,\cdots,l}
Minimize Φ(w)=21wTwSubject to yi(wTxi+b)≥1,∀i=1,⋯,l
利用Lagrangian函数计算
M
i
n
i
m
i
z
e
L
(
w
,
b
,
α
)
=
1
2
w
T
w
−
∑
i
=
1
l
α
i
[
y
i
(
w
T
x
i
+
b
)
−
1
]
S
u
b
j
e
c
t
t
o
α
i
≤
0
∀
i
=
1
,
⋯
,
l
Minimize\ L(\boldsymbol w,b,\boldsymbol \alpha)=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\boldsymbol x_i+b)-1]\\ Subject\ to\ \alpha_i\le0\ \forall i=1,\cdots,l
Minimize L(w,b,α)=21wTw−i=1∑lαi[yi(wTxi+b)−1]Subject to αi≤0 ∀i=1,⋯,l
∂ L ∂ w = 0 ⇒ w ( α ) = ∑ i = 1 l α i y i x i ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w(\boldsymbol \alpha)=\sum_{i=1}^l\alpha_iy_i\boldsymbol x_i\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0 ∂w∂L=0⇒w(α)=i=1∑lαiyixi∂b∂L=0⇒i=1∑lαiyi=0
假设
ϕ
\phi
ϕ为关于核函数
κ
(
x
,
z
)
\kappa(\boldsymbol x,\boldsymbol z)
κ(x,z)的特征映射,则上述问题可转化为
M
i
n
i
m
i
z
e
L
(
w
,
b
,
α
)
=
1
2
w
T
w
−
∑
i
=
1
l
α
i
[
y
i
(
w
T
ϕ
(
x
i
)
+
b
)
−
1
]
S
u
b
j
e
c
t
t
o
α
i
≤
0
∀
i
=
1
,
⋯
,
l
Minimize\ L(\boldsymbol w,b,\boldsymbol \alpha)=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\phi(\boldsymbol x_i)+b)-1]\\ Subject\ to\ \alpha_i\le0\ \forall i=1,\cdots,l
Minimize L(w,b,α)=21wTw−i=1∑lαi[yi(wTϕ(xi)+b)−1]Subject to αi≤0 ∀i=1,⋯,l
∂ L ∂ w = 0 ⇒ w ( α ) = ∑ i = 1 l α i y i ϕ ( x i ) ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w(\boldsymbol \alpha)=\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0 ∂w∂L=0⇒w(α)=i=1∑lαiyiϕ(xi)∂b∂L=0⇒i=1∑lαiyi=0
将偏导数代入拉格朗日函数中得到,
w
T
w
=
(
∑
i
=
1
l
α
i
y
i
ϕ
(
x
i
)
)
T
(
∑
j
=
1
l
α
j
y
j
ϕ
(
x
j
)
)
=
∑
i
=
1
l
∑
j
=
1
l
α
i
α
j
y
i
y
j
ϕ
(
x
i
)
T
ϕ
(
x
j
)
=
∑
i
=
1
l
∑
j
=
1
l
α
i
α
j
y
i
y
j
κ
(
x
i
,
x
j
)
w
T
ϕ
(
x
i
)
=
(
∑
j
=
1
l
α
j
y
j
ϕ
(
x
j
)
)
T
ϕ
(
x
i
)
=
∑
j
=
1
l
α
j
y
j
κ
(
x
j
,
x
i
)
\begin{aligned} \boldsymbol w^T\boldsymbol w&=\left({\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)}\right)^T\left({\sum_{j=1}^l\alpha_jy_j\phi(\boldsymbol x_j)}\right)\\ &=\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\phi(\boldsymbol x_i)^T\phi(\boldsymbol x_j)\\ &=\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j) \end{aligned}\\ \boldsymbol w^T\phi(\boldsymbol x_i)=\left({\sum_{j=1}^l\alpha_jy_j\phi(\boldsymbol x_j)}\right)^T\phi(\boldsymbol x_i)=\sum_{j=1}^l\alpha_jy_j\kappa(\boldsymbol x_j,\boldsymbol x_i)
wTw=(i=1∑lαiyiϕ(xi))T(j=1∑lαjyjϕ(xj))=i=1∑lj=1∑lαiαjyiyjϕ(xi)Tϕ(xj)=i=1∑lj=1∑lαiαjyiyjκ(xi,xj)wTϕ(xi)=(j=1∑lαjyjϕ(xj))Tϕ(xi)=j=1∑lαjyjκ(xj,xi)
W ( α ) = L ( w ( α ) , b , α ) = 1 2 w T w − ∑ i = 1 l α i [ y i ( w T ϕ ( x i ) + b ) − 1 ] = 1 2 w T w − ∑ i = 1 l α i y i w T ϕ ( x i ) − b ∑ i = 1 l α i y i + ∑ i = 1 l α i = 1 2 w T w − ∑ i = 1 l α i y i w T ϕ ( x i ) + ∑ i = 1 l α i = 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) − ∑ i = 1 l α i y i ∑ j = 1 l α j y j κ ( x j , x i ) + ∑ i = 1 l α i = ∑ i = 1 l α i − 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j κ ( x i , x j ) \begin{aligned} W(\boldsymbol \alpha)&=L(\boldsymbol w(\boldsymbol \alpha),b,\boldsymbol \alpha)=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\phi(\boldsymbol x_i)+b)-1]\\ &=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)-b\sum_{i=1}^l\alpha_iy_i+\sum_{i=1}^l\alpha_i\\ &=\frac{1}{2}\boldsymbol w^T\boldsymbol w-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)+\sum_{i=1}^l\alpha_i\\ &=\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)-\sum_{i=1}^l\alpha_iy_i\sum_{j=1}^l\alpha_jy_j\kappa(\boldsymbol x_j,\boldsymbol x_i)+\sum_{i=1}^l\alpha_i\\ &=\sum_{i=1}^l\alpha_i-\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j) \end{aligned} W(α)=L(w(α),b,α)=21wTw−i=1∑lαi[yi(wTϕ(xi)+b)−1]=21wTw−i=1∑lαiyiwTϕ(xi)−bi=1∑lαiyi+i=1∑lαi=21wTw−i=1∑lαiyiwTϕ(xi)+i=1∑lαi=21i=1∑lj=1∑lαiαjyiyjκ(xi,xj)−i=1∑lαiyij=1∑lαjyjκ(xj,xi)+i=1∑lαi=i=1∑lαi−21i=1∑lj=1∑lαiαjyiyjκ(xi,xj)
综上,问题等价于
max
W
(
α
)
=
∑
i
=
1
l
α
i
−
1
2
∑
i
=
1
l
∑
j
=
1
l
α
i
α
j
y
i
y
j
κ
(
x
i
,
x
j
)
s
.
t
.
∑
i
=
1
l
α
i
y
i
=
0
α
i
≥
0
∀
i
=
1
,
⋯
,
l
\max\ W(\boldsymbol \alpha)=\sum_{i=1}^l\alpha_i-\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)\\ \begin{aligned} s.t.\ &\sum_{i=1}^l\alpha_iy_i=0\\ &\alpha_i\ge0\ \forall i=1,\cdots,l \end{aligned}
max W(α)=i=1∑lαi−21i=1∑lj=1∑lαiαjyiyjκ(xi,xj)s.t. i=1∑lαiyi=0αi≥0 ∀i=1,⋯,l
W
(
α
)
W(\alpha)
W(α)是二次型
W
(
α
)
=
∑
i
=
1
l
α
i
−
1
2
∑
i
=
1
l
∑
j
=
1
l
α
i
α
j
y
i
y
j
κ
(
x
i
,
x
j
)
=
[
1
⋯
1
]
l
×
1
[
α
1
⋮
α
l
]
−
1
2
[
α
1
⋯
α
l
]
[
y
1
y
1
κ
(
x
1
,
x
1
)
⋯
y
1
y
l
κ
(
x
1
,
x
l
)
⋮
⋱
⋮
y
l
y
1
κ
(
x
l
,
x
1
)
⋯
y
l
y
l
κ
(
x
l
,
x
l
)
]
[
α
1
⋮
α
l
]
=
1
l
×
1
α
−
1
2
α
T
[
y
1
y
1
κ
(
x
1
,
x
1
)
⋯
y
1
y
l
κ
(
x
1
,
x
l
)
⋮
⋱
⋮
y
l
y
1
κ
(
x
l
,
x
1
)
⋯
y
l
y
l
κ
(
x
l
,
x
l
)
]
α
∑
i
=
1
l
∑
j
=
1
l
α
i
α
j
y
i
y
j
κ
(
x
i
,
x
j
)
=
[
α
1
y
1
⋯
α
l
y
l
]
[
κ
(
x
1
,
x
1
)
⋯
κ
(
x
1
,
x
l
)
⋮
⋱
⋮
κ
(
x
l
,
x
1
)
⋯
κ
(
x
l
,
x
l
)
]
[
α
1
y
1
⋮
α
l
y
l
]
≥
0
\begin{aligned} W(\boldsymbol \alpha)&=\sum_{i=1}^l\alpha_i-\frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)\\ &= \begin{bmatrix} 1 & \cdots & 1 \end{bmatrix}_{l\times 1} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_l \end{bmatrix}-\frac{1}{2} \begin{bmatrix} \alpha_1 & \cdots & \alpha_l \end{bmatrix} \begin{bmatrix} y_1y_1\kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & y_1y_l\kappa(\boldsymbol x_1,\boldsymbol x_l)\\ \vdots & \ddots & \vdots\\ y_ly_1\kappa(\boldsymbol x_l,\boldsymbol x_1) & \cdots & y_ly_l\kappa(\boldsymbol x_l,\boldsymbol x_l) \end{bmatrix} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_l \end{bmatrix}\\ &=1_{l\times 1}\boldsymbol \alpha-\frac{1}{2}\boldsymbol \alpha^T\begin{bmatrix} y_1y_1\kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & y_1y_l\kappa(\boldsymbol x_1,\boldsymbol x_l)\\ \vdots & \ddots & \vdots\\ y_ly_1\kappa(\boldsymbol x_l,\boldsymbol x_1) & \cdots & y_ly_l\kappa(\boldsymbol x_l,\boldsymbol x_l) \end{bmatrix}\boldsymbol \alpha\\ & \sum_{i=1}^l\sum_{j=1}^l\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)=\begin{bmatrix} \alpha_1y_1 & \cdots & \alpha_ly_l \end{bmatrix} \begin{bmatrix} \kappa(\boldsymbol x_1,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1,\boldsymbol x_l)\\ \vdots & \ddots & \vdots\\ \kappa(\boldsymbol x_l,\boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_l,\boldsymbol x_l) \end{bmatrix} \begin{bmatrix} \alpha_1y_1 \\ \vdots \\ \alpha_ly_l \end{bmatrix}\ge0 \end{aligned}
W(α)=i=1∑lαi−21i=1∑lj=1∑lαiαjyiyjκ(xi,xj)=[1⋯1]l×1⎣⎢⎡α1⋮αl⎦⎥⎤−21[α1⋯αl]⎣⎢⎡y1y1κ(x1,x1)⋮yly1κ(xl,x1)⋯⋱⋯y1ylκ(x1,xl)⋮ylylκ(xl,xl)⎦⎥⎤⎣⎢⎡α1⋮αl⎦⎥⎤=1l×1α−21αT⎣⎢⎡y1y1κ(x1,x1)⋮yly1κ(xl,x1)⋯⋱⋯y1ylκ(x1,xl)⋮ylylκ(xl,xl)⎦⎥⎤αi=1∑lj=1∑lαiαjyiyjκ(xi,xj)=[α1y1⋯αlyl]⎣⎢⎡κ(x1,x1)⋮κ(xl,x1)⋯⋱⋯κ(x1,xl)⋮κ(xl,xl)⎦⎥⎤⎣⎢⎡α1y1⋮αlyl⎦⎥⎤≥0
因此
W
(
α
)
W(\boldsymbol \alpha)
W(α)是负半定二次型,有全局最大值。可以求解
α
∗
\boldsymbol \alpha^*
α∗,进而
w
∗
=
∑
α
i
∗
y
i
ϕ
(
x
i
)
\boldsymbol w^*=\sum\alpha^*_iy_i\phi(\boldsymbol x_i)
w∗=∑αi∗yiϕ(xi)。
KKT条件表明只有当不等式条件取等时,拉格朗日乘子才能取非0值,假设
α
∗
,
w
∗
,
b
∗
\boldsymbol \alpha^*,\boldsymbol w^*,b^*
α∗,w∗,b∗是最优解,则有
α
i
∗
(
y
i
(
w
∗
T
ϕ
(
x
i
)
+
b
∗
)
−
1
)
=
0
,
∀
i
=
1
,
⋯
,
l
\alpha^*_i(y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1)=0,\ \forall i=1,\cdots,l
αi∗(yi(w∗Tϕ(xi)+b∗)−1)=0, ∀i=1,⋯,l
所以该条件分为两种情况,
- y i ( w ∗ T ϕ ( x i ) + b ∗ ) > 1 ⇒ α i ∗ = 0 y_i({w^*}^T\phi(x_i)+b^*)>1\Rightarrow \boldsymbol \alpha^*_i=0 yi(w∗Tϕ(xi)+b∗)>1⇒αi∗=0.
- α i ∗ ≠ 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) = 0 \alpha^*_i\ne0\Rightarrow y_i({w^*}^T\phi(\boldsymbol x_i)+b^*)=0 αi∗=0⇒yi(w∗Tϕ(xi)+b∗)=0。这意味着 ϕ ( x i ) \phi(\boldsymbol x_i) ϕ(xi)位于边界上,该点即为支持向量(起决定性作用)
所以,大多数 α i ∗ = 0 \alpha_i^*=0 αi∗=0,对于支持向量(Support Vectors, SV):
-
非0的 α i ∗ ⇒ ϕ ( x i ) \alpha^*_i\Rightarrow \phi(\boldsymbol x_i) αi∗⇒ϕ(xi)在边界上
-
决策边界只取决于支持向量,
w ∗ = ∑ i = 1 l α i ∗ y i ϕ ( x i ) = ∑ α i ∗ ≠ 0 α i ∗ y i ϕ ( x i ) \boldsymbol w^*=\sum_{i=1}^l\alpha^*_iy_i\phi(\boldsymbol x_i)=\sum_{\alpha^*_i\ne0}\alpha^*_iy_i\phi(\boldsymbol x_i) w∗=i=1∑lαi∗yiϕ(xi)=αi∗=0∑αi∗yiϕ(xi)
流程:
-
计算获得 α t ∗ > 0 \alpha^*_t>0 αt∗>0
-
w ∗ = ∑ α t ∗ α t ∗ y t ϕ ( x t ) w^*=\sum_{\alpha^*_t}\alpha^*_ty_t\phi(\boldsymbol x_t) w∗=∑αt∗αt∗ytϕ(xt)
-
由KKT条件, y t ( w ∗ T ϕ ( x t ) + b ∗ ) − 1 = 0 ⇒ w ∗ T ϕ ( x t ) + b ∗ = ± 1 ⇒ w ∗ T ϕ ( x t ) + b ∗ = y t y_t({w^*}^T\phi(x_t)+b^*)-1=0\Rightarrow {w^*}^T\phi(x_t)+b^*=\pm1\Rightarrow {\boldsymbol w^*}^T\phi(x_t)+b^*=y_t yt(w∗Tϕ(xt)+b∗)−1=0⇒w∗Tϕ(xt)+b∗=±1⇒w∗Tϕ(xt)+b∗=yt,因此由任意一支持向量可计算 b ∗ b^* b∗
b ∗ = y t − w ∗ T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i ϕ ( x i ) T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i κ ( x i , x t ) b^*=y_t-{w^*}^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha^*_iy_i\phi(\boldsymbol x_i)^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha^*_iy_i\kappa(\boldsymbol x_i,\boldsymbol x_t) b∗=yt−w∗Tϕ(xt)=yt−i=1∑lαi∗yiϕ(xi)Tϕ(xt)=yt−i=1∑lαi∗yiκ(xi,xt) -
对于新的样本点 ϕ ( x ′ ) \phi(\boldsymbol x') ϕ(x′),
w T ϕ ( x ′ ) + b = ∑ i = 1 l α i y i ϕ ( x i ) T ϕ ( x ′ ) + b = ∑ i = 1 l α i y i κ ( x i , x ′ ) + b \boldsymbol w^T\phi(\boldsymbol x')+b=\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)^T\phi(\boldsymbol x')+b= \sum_{i=1}^l\alpha_iy_i\kappa(\boldsymbol x_i,\boldsymbol x')+b wTϕ(x′)+b=i=1∑lαiyiϕ(xi)Tϕ(x′)+b=i=1∑lαiyiκ(xi,x′)+b -
结果为正, x ′ \boldsymbol x' x′位于第一类,否则位于第二类,
y ′ = s i g n ( ∑ i = 1 l α i y i κ ( x i , x ′ ) + b ) y'=sign(\sum_{i=1}^l\alpha_iy_i\kappa(\boldsymbol x_i,\boldsymbol x')+b) y′=sign(i=1∑lαiyiκ(xi,x′)+b)
5. Soft-Margin Support Vector Machine
M i n i m i z e Φ ( w , ξ ) = 1 2 w T w + C ∑ i = 1 l ξ i , C > 0 S u b j e c t t o ξ i ≥ 0 , y i ( w T x i + b ) ≥ 1 − ξ i , ∀ i = 1 , ⋯ , l Minimize\ \Phi(\boldsymbol w,\boldsymbol \xi)=\frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i,\ C>0\\ Subject\ to\ \xi_i\ge0,y_i(\boldsymbol w^T\boldsymbol x_i+b)\ge1-\xi_i,\ \forall i=1,\cdots,l Minimize Φ(w,ξ)=21wTw+Ci=1∑lξi, C>0Subject to ξi≥0,yi(wTxi+b)≥1−ξi, ∀i=1,⋯,l
其中 C C C控制边界区域大小,当 C C C增大时, ξ i \xi_i ξi减小,导致边界区域减小。
拉格朗日函数为
L
(
w
,
b
,
ξ
,
α
,
β
)
=
1
2
w
T
w
+
C
∑
i
=
1
l
ξ
i
−
∑
i
=
1
l
α
i
[
y
i
(
w
T
x
i
+
b
)
−
1
+
ξ
i
]
−
∑
i
=
1
l
β
i
ξ
i
s
.
t
.
α
i
,
β
i
≥
0
,
∀
i
=
1
,
⋯
,
l
\color{red} L(\boldsymbol w,b,\boldsymbol \xi,\boldsymbol \alpha,\boldsymbol \beta)=\frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\boldsymbol x_i+b)-1+\xi_i]-\sum_{i=1}^l\beta_i\xi_i\\ \color{red}s.t.\ \alpha_i,\beta_i\ge0,\ \forall i=1,\cdots,l
L(w,b,ξ,α,β)=21wTw+Ci=1∑lξi−i=1∑lαi[yi(wTxi+b)−1+ξi]−i=1∑lβiξis.t. αi,βi≥0, ∀i=1,⋯,l
∂ L ∂ w = 0 ⇒ w = ∑ i = 1 l α i y i x i ∂ L ∂ b = 0 ⇒ ∑ i = 1 l α i y i = 0 ∂ L ∂ ξ i = 0 ⇒ C − α i − β i = 0 \frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w=\sum_{i=1}^l\alpha_iy_i\boldsymbol x_i\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0\\ \frac{\partial L}{\partial \xi_i}=0\Rightarrow C-\alpha_i-\beta_i=0 ∂w∂L=0⇒w=i=1∑lαiyixi∂b∂L=0⇒i=1∑lαiyi=0∂ξi∂L=0⇒C−αi−βi=0
同理将
x
i
x_i
xi替换为
ϕ
(
x
i
)
\phi(x_i)
ϕ(xi),
∂
L
∂
w
=
0
⇒
w
=
∑
i
=
1
l
α
i
y
i
ϕ
(
x
i
)
=
X
T
Y
α
,
Y
=
d
i
a
g
(
y
1
,
⋯
,
y
l
)
∂
L
∂
b
=
0
⇒
∑
i
=
1
l
α
i
y
i
=
0
=
y
T
α
∂
L
∂
ξ
i
=
0
⇒
C
−
α
i
−
β
i
=
0
\frac{\partial L}{\partial \boldsymbol w}=0\Rightarrow \boldsymbol w=\sum_{i=1}^l\alpha_iy_i\phi(\boldsymbol x_i)={\color{red}X^TY\boldsymbol \alpha},Y=diag(y_1,\cdots,y_l)\\ \frac{\partial L}{\partial b}=0\Rightarrow \sum_{i=1}^l\alpha_iy_i=0=\boldsymbol y^T\boldsymbol \alpha\\ \frac{\partial L}{\partial \xi_i}=0\Rightarrow C-\alpha_i-\beta_i=0
∂w∂L=0⇒w=i=1∑lαiyiϕ(xi)=XTYα,Y=diag(y1,⋯,yl)∂b∂L=0⇒i=1∑lαiyi=0=yTα∂ξi∂L=0⇒C−αi−βi=0
将以上偏导数代入拉格朗日函数中可得,
w
T
w
=
(
X
T
Y
α
)
T
(
X
T
Y
α
)
=
α
T
Y
X
X
T
Y
α
(
∑
i
=
1
l
∑
i
=
1
l
α
i
α
j
y
i
y
j
κ
(
x
i
,
x
j
)
)
w
T
ϕ
(
x
i
)
=
(
X
T
Y
α
)
T
ϕ
(
x
i
)
=
α
T
Y
X
ϕ
(
x
i
)
(
∑
j
=
1
l
α
j
y
j
κ
(
x
i
,
x
j
)
)
\boldsymbol w^T\boldsymbol w=(X^TY\alpha)^T(X^TY\alpha)=\boldsymbol \alpha^TYXX^TY\boldsymbol \alpha\ \left({\sum_{i=1}^l\sum_{i=1}^{l}\alpha_i\alpha_jy_iy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)}\right)\\ \boldsymbol w^T\phi(\boldsymbol x_i)=(X^TY\boldsymbol \alpha)^T\phi(\boldsymbol x_i)=\boldsymbol \alpha^TYX\phi(\boldsymbol x_i)\ \left({\sum_{j=1}^l\alpha_jy_j\kappa(\boldsymbol x_i,\boldsymbol x_j)}\right)
wTw=(XTYα)T(XTYα)=αTYXXTYα (i=1∑li=1∑lαiαjyiyjκ(xi,xj))wTϕ(xi)=(XTYα)Tϕ(xi)=αTYXϕ(xi) (j=1∑lαjyjκ(xi,xj))
W ( α ) = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i [ y i ( w T ϕ ( x i ) + b ) − 1 + ξ i ] − ∑ i = 1 l β i ξ i = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i y i w T ϕ ( x i ) − ∑ i = 1 l α i y i b + ∑ i = 1 l α i − ∑ i = 1 l α i ξ i − ∑ i = 1 l β i ξ i = 1 2 w T w + C ∑ i = 1 l ξ i − ∑ i = 1 l α i y i w T ϕ ( x i ) + ∑ i = 1 l α i − ∑ i = 1 l ( α i + β i ) ξ i = 1 2 α T Y X X Y Y α + C ∑ i = 1 l ξ i − ∑ i = 1 l α i y i ( α T Y X ϕ ( x i ) ) + ∑ i = 1 l α i − ∑ i = 1 l C ξ i = ∑ i = 1 l α i + 1 2 α T Y X X T Y α − α T Y X X T Y α = α T 1 l × 1 − 1 2 α T Y K Y α \begin{aligned} W(\boldsymbol \alpha)&=\frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_i[y_i(\boldsymbol w^T\phi(\boldsymbol x_i)+b)-1+\xi_i]-\sum_{i=1}^l\beta_i\xi_i\\ &= \frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)-\sum_{i=1}^l\alpha_iy_ib+\sum_{i=1}^l\alpha_i-\sum_{i=1}^l\alpha_i\xi_i-\sum_{i=1}^l\beta_i\xi_i\\ &= \frac{1}{2}\boldsymbol w^T\boldsymbol w+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_iy_i\boldsymbol w^T\phi(\boldsymbol x_i)+\sum_{i=1}^l\alpha_i-\sum_{i=1}^l(\alpha_i+\beta_i)\xi_i\\ &= \frac{1}{2}\boldsymbol \alpha^TYXX^YY\boldsymbol \alpha+C\sum_{i=1}^l\xi_i-\sum_{i=1}^l\alpha_iy_i(\boldsymbol \alpha^TYX\phi(\boldsymbol x_i))+\sum_{i=1}^l\alpha_i-\sum_{i=1}^lC\xi_i\\ &= \sum_{i=1}^l\alpha_i+\frac{1}{2}\boldsymbol \alpha^TYXX^TY\boldsymbol \alpha-\boldsymbol \alpha^TYXX^TY\boldsymbol \alpha\\ &= \boldsymbol \alpha^T1_{l\times 1}-\frac{1}{2}\boldsymbol \alpha^TYKY\boldsymbol \alpha \end{aligned} W(α)=21wTw+Ci=1∑lξi−i=1∑lαi[yi(wTϕ(xi)+b)−1+ξi]−i=1∑lβiξi=21wTw+Ci=1∑lξi−i=1∑lαiyiwTϕ(xi)−i=1∑lαiyib+i=1∑lαi−i=1∑lαiξi−i=1∑lβiξi=21wTw+Ci=1∑lξi−i=1∑lαiyiwTϕ(xi)+i=1∑lαi−i=1∑l(αi+βi)ξi=21αTYXXYYα+Ci=1∑lξi−i=1∑lαiyi(αTYXϕ(xi))+i=1∑lαi−i=1∑lCξi=i=1∑lαi+21αTYXXTYα−αTYXXTYα=αT1l×1−21αTYKYα
原问题转化为
M
a
x
i
m
u
m
W
(
α
)
=
α
T
1
l
×
1
−
1
2
α
T
Y
K
Y
α
S
u
b
j
e
c
t
t
o
α
T
y
=
0
0
≤
α
i
≤
C
∀
i
=
1
,
⋯
,
l
\begin{aligned} Maximum\ &\color{red}W(\boldsymbol \alpha)=\boldsymbol \alpha^T1_{l\times 1}-\frac{1}{2}\boldsymbol \alpha^TYKY\boldsymbol \alpha\\ Subject\ to\ &\color{red}\boldsymbol \alpha^T\boldsymbol y=0\\ &\color{red}0\le\alpha_i\le C\ \forall i=1,\cdots,l \end{aligned}
Maximum Subject to W(α)=αT1l×1−21αTYKYααTy=00≤αi≤C ∀i=1,⋯,l
假设最优解为
α
∗
,
w
∗
,
b
∗
,
ξ
∗
\boldsymbol \alpha^*,\boldsymbol w^*,b^*,\boldsymbol \xi^*
α∗,w∗,b∗,ξ∗,由KKT条件可得
{
α
i
∗
(
y
i
(
w
∗
T
ϕ
(
x
i
)
+
b
∗
)
−
1
+
ξ
i
∗
)
=
0
β
i
∗
ξ
i
∗
=
0
i
=
1
,
⋯
,
l
α
i
∗
+
β
i
∗
=
C
\begin{cases} \alpha^*_i(y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1+\xi^*_i)=0\\ \beta^*_i\xi^*_i=0 \end{cases}\ i=1,\cdots,l\\ \alpha^*_i+\beta^*_i=C
{αi∗(yi(w∗Tϕ(xi)+b∗)−1+ξi∗)=0βi∗ξi∗=0 i=1,⋯,lαi∗+βi∗=C
- α i ∗ = 0 ⇒ β i ∗ = C ⇒ ξ i ∗ = 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) ≥ 1 \alpha^*_i=0\Rightarrow \beta^*_i=C \Rightarrow \xi^*_i=0 \Rightarrow y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b*)\ge1 αi∗=0⇒βi∗=C⇒ξi∗=0⇒yi(w∗Tϕ(xi)+b∗)≥1
- 0 < α i ∗ < C ⇒ { β i ∗ = C − α i ∗ = 0 ⇒ ξ i ∗ ≥ 0 y i ( w ∗ T ϕ ( x i ) + b ∗ ) − 1 + ξ i ∗ = 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) = 1 0<\alpha^*_i<C\Rightarrow {\begin{cases} \beta^*_i = C-\alpha^*_i=0\Rightarrow \xi^*_i\ge0 \\ y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1+\xi^*_i=0 \end{cases}}\Rightarrow y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)=1 0<αi∗<C⇒{βi∗=C−αi∗=0⇒ξi∗≥0yi(w∗Tϕ(xi)+b∗)−1+ξi∗=0⇒yi(w∗Tϕ(xi)+b∗)=1
- α i ∗ = C ⇒ { β i ∗ = C − α i ∗ = 0 ⇒ ξ i ∗ ≥ 0 y i ( w ∗ T ϕ ( x i ) + b ∗ ) − 1 + ξ i ∗ = 0 ⇒ y i ( w ∗ T ϕ ( x i ) + b ∗ ) ≤ 1 \alpha^*_i=C\Rightarrow {\begin{cases} \beta^*_i=C-\alpha^*_i=0\Rightarrow \xi^*_i\ge0\\y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)-1+\xi^*_i=0 \end{cases}}\Rightarrow y_i({\boldsymbol w^*}^T\phi(\boldsymbol x_i)+b^*)\le1 αi∗=C⇒{βi∗=C−αi∗=0⇒ξi∗≥0yi(w∗Tϕ(xi)+b∗)−1+ξi∗=0⇒yi(w∗Tϕ(xi)+b∗)≤1
此时, α i ≠ 0 \alpha_i\ne0 αi=0的点可能在边界以及两个边界之间,被成为支持向量。
- 在两个边界之间: ξ i ∗ > 0 , α i ∗ = C \xi^*_i>0,\alpha_i^*=C ξi∗>0,αi∗=C
- 在边界上: ξ i ∗ = 0 , 0 < α i ∗ < C \xi_i^*=0,0<\alpha_i^*<C ξi∗=0,0<αi∗<C
- 在边界内: ξ i ∗ = 0 , α i ∗ = 0 \xi_i^*=0,\alpha^*_i=0 ξi∗=0,αi∗=0
流程:
-
计算 α t ∗ \alpha^*_t αt∗,取 0 < α t ∗ < C 0<\alpha^*_t<C 0<αt∗<C的值计算 b ∗ b^* b∗
-
由KKT条件, α t ∗ ( y t ( w ∗ T ϕ ( x t ) + b ∗ ) − 1 ) = 0 \alpha^*_t(y_t({\boldsymbol w^*}^T\phi(\boldsymbol x_t)+b^*)-1)=0 αt∗(yt(w∗Tϕ(xt)+b∗)−1)=0知, w ∗ T ϕ ( x t ) + b ∗ = y t {\boldsymbol w^*}^T\phi(\boldsymbol x_t)+b^*=y_t w∗Tϕ(xt)+b∗=yt,因此
b ∗ = y t − w ∗ T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i ϕ ( x i ) T ϕ ( x t ) = y t − ∑ i = 1 l α i ∗ y i κ ( x i , x t ) b^*=y_t-{\boldsymbol w^*}^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha_i^*y_i\phi(\boldsymbol x_i)^T\phi(\boldsymbol x_t)=y_t-\sum_{i=1}^l\alpha_i^*y_i\kappa(\boldsymbol x_i,\boldsymbol x_t) b∗=yt−w∗Tϕ(xt)=yt−i=1∑lαi∗yiϕ(xi)Tϕ(xt)=yt−i=1∑lαi∗yiκ(xi,xt)
多分类方法:
- One-Against-One:两两分类,投票决定类别
- One-Against-All:一个与剩下所有的进行二非类,计算与中间边界线的距离决定类别(距离大的) k ∗ = arg max k ( w k T ϕ ( x ) + b k ) k^*=\mathop{\arg\max}_k(\boldsymbol w^T_k\phi(\boldsymbol x)+b_k) k∗=argmaxk(wkTϕ(x)+bk)
6. An Automatic Method for Selecting the Parameter of the RBF Kernel Function to SVM
RBF Kernel:
κ
(
x
,
z
,
σ
)
=
exp
(
−
∥
x
−
z
∥
2
2
σ
2
)
\kappa(\boldsymbol x, \boldsymbol z,\sigma)=\exp(-\frac{\|\boldsymbol x-\boldsymbol z\|^2}{2\sigma^2})
κ(x,z,σ)=exp(−2σ2∥x−z∥2)
κ
(
x
,
x
)
=
1
\kappa(\boldsymbol x, \boldsymbol x)=1
κ(x,x)=1,即在RBF Kernel决定的特征空间内,样本的范数恒为1,因此,样本会被映射到单位球面上。
因此可以用余弦距离衡量两个样本之间的相似度:
cos
θ
=
<
ϕ
(
x
)
,
ϕ
(
z
)
>
∥
ϕ
(
x
)
∥
⋅
∥
ϕ
(
z
)
∥
=
<
ϕ
(
x
)
,
ϕ
(
z
)
>
=
κ
(
x
,
z
)
\cos\theta=\frac{<\phi(\boldsymbol x),\phi(\boldsymbol z)>}{\|\phi(\boldsymbol x)\|\cdot\|\phi(\boldsymbol z)\|}=<\phi(\boldsymbol x),\phi(\boldsymbol z)>=\kappa(\boldsymbol x, \boldsymbol z)
cosθ=∥ϕ(x)∥⋅∥ϕ(z)∥<ϕ(x),ϕ(z)>=<ϕ(x),ϕ(z)>=κ(x,z)
如果
cos
\cos
cos值趋于1,则样本在特征空间内相似,如果趋于0,则不相似。
在分类问题中,目标是找到合适的 σ \sigma σ使得
-
对于同一类的样本,RBF核函数的值应当趋于1(聚集)
κ ( x l ( i ) , x k ( i ) , σ ) ≈ 1 w ( σ ) = 1 ∑ i = 1 L N i 2 ∑ i = 1 L ∑ l = 1 N i ∑ k = 1 N i exp ( − ∥ x l ( i ) − x k ( i ) ∥ 2 2 σ 2 ) ≈ 1 \kappa(\boldsymbol x^{(i)}_l, \boldsymbol x^{(i)}_k, \sigma)\approx1\\ w(\sigma)=\frac{1}{\sum_{i=1}^LN_i^2}\sum_{i=1}^L\sum_{l=1}^{N_i}\sum_{k=1}^{N_i}\exp(-\frac{\|\boldsymbol x^{(i)}_l-\boldsymbol x^{(i)}_k\|^2}{2\sigma^2})\approx 1 κ(xl(i),xk(i),σ)≈1w(σ)=∑i=1LNi21i=1∑Ll=1∑Nik=1∑Niexp(−2σ2∥xl(i)−xk(i)∥2)≈1 -
对于不同类的样本,RBF核函数的值应当趋于0(分散)
κ ( x l ( i ) , x k ( j ) , σ ) ≈ 1 , i ≠ j b ( σ ) = 1 ∑ i = 1 L ∑ j = i , j ≠ i L N i N j ∑ i = 1 L ∑ j = 1 , j ≠ i L ∑ l = 1 N i ∑ k = 1 N i exp ( − ∥ x l ( i ) − x k ( j ) ∥ 2 2 σ 2 ) ≈ 0 \kappa(\boldsymbol x^{(i)}_l, \boldsymbol x^{(j)}_k, \sigma)\approx1,i\ne j\\ b(\sigma)=\frac{1}{\sum_{i=1}^L\sum_{j=i,j\ne i}^LN_iN_j}\sum_{i=1}^L\sum_{j=1,j\ne i}^L\sum_{l=1}^{N_i}\sum_{k=1}^{N_i}\exp(-\frac{\|\boldsymbol x^{(i)}_l-\boldsymbol x_k^{(j)}\|^2}{2\sigma^2})\approx 0 κ(xl(i),xk(j),σ)≈1,i=jb(σ)=∑i=1L∑j=i,j=iLNiNj1i=1∑Lj=1,j=i∑Ll=1∑Nik=1∑Niexp(−2σ2∥xl(i)−xk(j)∥2)≈0
当 σ → 0 \sigma\to 0 σ→0时,同一类的样本在特征空间内会相互垂直,当 σ → ∞ \sigma\to \infty σ→∞时,所有样本在特征空间内均趋于重合。两种情况都不可分。
我们想要寻找合适的
σ
\sigma
σ使得
w
(
σ
)
→
1
−
b
(
σ
)
→
0
+
w(\sigma)\to1^-\\ b(\sigma)\to0^+
w(σ)→1−b(σ)→0+
评价指标可确定为
J
(
σ
)
=
(
1
−
w
(
σ
)
)
+
(
b
(
σ
)
−
0
)
=
1
−
w
(
σ
)
+
b
(
σ
)
J(\sigma)=(1-w(\sigma))+(b(\sigma)-0)=1-w(\sigma)+b(\sigma)
J(σ)=(1−w(σ))+(b(σ)−0)=1−w(σ)+b(σ)
可以采用基于梯度的方法寻找最优的
σ
\sigma
σ。
7. Linear Regression Model and Kernel-based Linear Regression Model
7.1 Linear Regression Model
假设:随机变量
ε
n
\varepsilon_n
εn是满足独立同分布的均值为0、方差为
σ
2
\sigma^2
σ2的高斯分布。随机变量
y
n
=
f
(
x
n
)
+
ε
n
y_n=f(\boldsymbol{x_n})+\varepsilon_n
yn=f(xn)+εn。
E
(
y
n
∣
x
n
)
=
f
(
x
n
)
+
E
(
ε
n
)
=
f
(
x
n
)
E(y_n|\boldsymbol x_n)=f(\boldsymbol x_n)+E(\varepsilon_n)=f(\boldsymbol x_n)
E(yn∣xn)=f(xn)+E(εn)=f(xn)
对于任意
n
n
n,假设条件期望满足
E
(
y
n
∣
x
n
)
=
w
0
+
w
1
x
n
1
+
⋯
+
w
d
x
n
d
=
[
1
x
n
1
⋯
x
n
d
]
[
w
0
⋮
w
d
]
=
[
1
,
x
n
T
]
w
=
x
ˉ
n
T
w
\begin{aligned} E(y_n|\boldsymbol x_n)&=w_0+w_1x_{n1}+\cdots+w_dx_{nd}\\ &= \begin{bmatrix}1 & x_{n1} & \cdots & x_{nd}\end{bmatrix} \begin{bmatrix}w_0 \\ \vdots \\ w_d\end{bmatrix}\\ &=[1, \boldsymbol x^T_n]\boldsymbol w= \bar{\boldsymbol x}^T_n \boldsymbol w \end{aligned}
E(yn∣xn)=w0+w1xn1+⋯+wdxnd=[1xn1⋯xnd]⎣⎢⎡w0⋮wd⎦⎥⎤=[1,xnT]w=xˉnTw
其中,
x
n
=
[
x
n
1
,
⋯
,
x
n
d
]
T
,
x
ˉ
n
=
[
1
,
x
n
T
]
T
,
w
=
[
w
0
,
⋯
,
w
d
]
T
\boldsymbol x_n=[x_{n1},\cdots,x_{nd}]^T,\bar{\boldsymbol x}_n=[1,\boldsymbol x^T_n]^T, \boldsymbol w=[w_0,\cdots,w_d]^T
xn=[xn1,⋯,xnd]T,xˉn=[1,xnT]T,w=[w0,⋯,wd]T。
w
0
w_0
w0为截距。
y
n
=
f
(
x
n
)
+
ε
n
=
E
(
y
n
∣
x
n
)
+
ε
n
=
x
ˉ
n
T
w
y_n=f(\boldsymbol x_n)+\varepsilon_n=E(y_n|\boldsymbol x_n)+\varepsilon_n=\bar{\boldsymbol x}^T_n \boldsymbol w
yn=f(xn)+εn=E(yn∣xn)+εn=xˉnTw
假设有
N
N
N个样本
(
x
1
,
y
1
)
,
⋯
,
(
x
N
,
y
N
)
{(\boldsymbol x_1, y_1),\cdots,(\boldsymbol x_N, y_N)}
(x1,y1),⋯,(xN,yN),记为
X
=
[
x
1
T
⋮
x
N
T
]
,
X
ˉ
=
[
x
ˉ
1
T
⋮
x
ˉ
N
T
]
=
[
1
x
1
T
⋮
⋮
1
x
N
T
]
,
y
=
[
y
1
⋮
y
N
]
X=\begin{bmatrix}\boldsymbol x^T_1 \\ \vdots \\ \boldsymbol x^T_N\end{bmatrix},\ \bar X=\begin{bmatrix}\bar{\boldsymbol x}^T_1 \\ \vdots \\ \bar{\boldsymbol x}^T_N\end{bmatrix}=\begin{bmatrix}1 & \boldsymbol x^T_1 \\ \vdots & \vdots \\ 1 & \boldsymbol x^T_N\end{bmatrix},\ \boldsymbol y=\begin{bmatrix}y_1 \\ \vdots \\ y_N\end{bmatrix}
X=⎣⎢⎡x1T⋮xNT⎦⎥⎤, Xˉ=⎣⎢⎡xˉ1T⋮xˉNT⎦⎥⎤=⎣⎢⎡1⋮1x1T⋮xNT⎦⎥⎤, y=⎣⎢⎡y1⋮yN⎦⎥⎤
线性回归模型描述的是
X
X
X与
y
\boldsymbol y
y之间的关系:
{
y
1
≈
x
ˉ
1
T
w
+
ε
1
⋮
y
N
≈
x
ˉ
N
T
w
+
ε
N
⇒
[
y
1
⋮
y
N
]
≈
[
x
ˉ
1
T
⋮
x
ˉ
N
T
]
w
+
[
ε
1
⋮
ϵ
N
]
⇒
y
≈
X
ˉ
w
+
ε
\begin{cases} y_1\approx \bar{\boldsymbol x}^T_1 \boldsymbol w + \varepsilon_1\\ \vdots\\ y_N \approx \bar{\boldsymbol x}^T_N \boldsymbol w + \varepsilon_N \end{cases}\Rightarrow \begin{bmatrix}y_1 \\ \vdots \\ y_N\end{bmatrix}\approx\begin{bmatrix}\bar{\boldsymbol x}^T_1 \\ \vdots \\ \bar{\boldsymbol x}^T_N\end{bmatrix}\boldsymbol w+\begin{bmatrix}\varepsilon_1 \\ \vdots \\ \epsilon_N\end{bmatrix}\Rightarrow \boldsymbol y \approx \bar X \boldsymbol w + \boldsymbol \varepsilon
⎩⎪⎪⎨⎪⎪⎧y1≈xˉ1Tw+ε1⋮yN≈xˉNTw+εN⇒⎣⎢⎡y1⋮yN⎦⎥⎤≈⎣⎢⎡xˉ1T⋮xˉNT⎦⎥⎤w+⎣⎢⎡ε1⋮ϵN⎦⎥⎤⇒y≈Xˉw+ε
最小化残差平方和(residual sum of squares, RSS)来估计
w
\boldsymbol w
w的最优值
w
^
=
arg
min
w
∣
∣
y
−
X
ˉ
w
∣
∣
2
\hat{\boldsymbol w}=\mathop{\arg \min}_{\boldsymbol w}||\boldsymbol y-\bar X\boldsymbol w||^2
w^=argminw∣∣y−Xˉw∣∣2
∣ ∣ y − X ˉ w ∣ ∣ 2 = ( y − X ˉ w ) T ( y − X ˉ w ) = y T y − 2 w T X ˉ T y + w T X ˉ T X ˉ w ||\boldsymbol y-\bar X\boldsymbol w||^2=(\boldsymbol y-\bar X\boldsymbol w)^T(\boldsymbol y-\bar X\boldsymbol w)=\boldsymbol y^T \boldsymbol y-2\boldsymbol w^T \bar X^T \boldsymbol y + \boldsymbol w^T \bar X^T \bar X \boldsymbol w ∣∣y−Xˉw∣∣2=(y−Xˉw)T(y−Xˉw)=yTy−2wTXˉTy+wTXˉTXˉw
∂ ∂ w ∣ ∣ y − X ˉ w ∣ ∣ 2 = − 2 X ˉ T y + 2 X ˉ T X ˉ w = 0 \frac{\partial}{\partial \boldsymbol w}||\boldsymbol y-\bar X\boldsymbol w||^2=-2\bar X^T \boldsymbol y + 2 \bar X^T \bar X \boldsymbol w=0 ∂w∂∣∣y−Xˉw∣∣2=−2XˉTy+2XˉTXˉw=0
正规方程(normal equation)为
X
ˉ
T
X
ˉ
w
=
X
ˉ
T
y
⇒
w
^
=
(
X
ˉ
T
X
ˉ
)
−
1
X
ˉ
T
y
⇒
y
^
=
X
ˉ
w
^
=
X
ˉ
(
X
ˉ
T
X
ˉ
)
−
1
X
ˉ
T
y
\bar X^T \bar X \boldsymbol w=\bar X^T \boldsymbol y\\ \begin{aligned} &\Rightarrow \hat{\boldsymbol w}=(\bar X^T \bar X)^{-1}\bar X^T \boldsymbol y\\ &\Rightarrow \hat{\boldsymbol y}=\bar X \hat{\boldsymbol w}=\bar X (\bar X^T \bar X)^{-1}\bar X^T \boldsymbol y \end{aligned}
XˉTXˉw=XˉTy⇒w^=(XˉTXˉ)−1XˉTy⇒y^=Xˉw^=Xˉ(XˉTXˉ)−1XˉTy
对于未知样本
x
\boldsymbol x
x,
f
(
x
)
=
[
1
,
x
T
]
w
^
f(\boldsymbol x)=[1,\ \boldsymbol x^T]\hat{\boldsymbol w}
f(x)=[1, xT]w^。
y ≈ X ˉ w = [ 1 x 11 ⋯ x 1 d ⋮ ⋮ ⋱ ⋮ 1 x N 1 ⋯ x N d ] [ w 0 w 1 ⋮ w d ] = w 0 [ 1 ⋮ 1 ] + w 1 [ x 11 ⋮ x N 1 ] + ⋯ + w d [ x 1 d ⋮ x N d ] \begin{aligned} \boldsymbol y \approx \bar X \boldsymbol w &= \begin{bmatrix}1 & x_{11} & \cdots & x_{1d}\\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N1} & \cdots & x_{Nd}\end{bmatrix}\begin{bmatrix}w_0 \\ w_1 \\ \vdots \\ w_d\end{bmatrix}\\ &=w_0\begin{bmatrix}1 \\ \vdots \\ 1\end{bmatrix}+w_1\begin{bmatrix}x_{11} \\ \vdots \\ x_{N1}\end{bmatrix}+\cdots+w_d\begin{bmatrix}x_{1d} \\ \vdots \\ x_{Nd}\end{bmatrix} \end{aligned} y≈Xˉw=⎣⎢⎡1⋮1x11⋮xN1⋯⋱⋯x1d⋮xNd⎦⎥⎤⎣⎢⎢⎢⎡w0w1⋮wd⎦⎥⎥⎥⎤=w0⎣⎢⎡1⋮1⎦⎥⎤+w1⎣⎢⎡x11⋮xN1⎦⎥⎤+⋯+wd⎣⎢⎡x1d⋮xNd⎦⎥⎤
可知 y \boldsymbol y y的近似值可用 X ˉ \bar X Xˉ的列向量线性组合表示。而残差可表示为两个向量之间的差,当残差垂直于 X ˉ \bar X Xˉ的列向量构成的线性空间时,近似值最优。
-
0 = [ 1 , ⋯ , 1 ] e = [ 1 , ⋯ , 1 ] ( y − y ^ ) = ∑ n = 1 N y n − ∑ n = 1 N y ^ n ⇒ ∑ n = 1 N y n = ∑ n = 1 N y ^ n ⇒ y ˉ = y ^ ˉ 0=[1,\cdots,1]\boldsymbol e=[1,\cdots,1](\boldsymbol y-\hat{\boldsymbol y})=\sum_{n=1}^N y_n -\sum_{n=1}^N \hat{y}_n \Rightarrow \sum_{n=1}^N y_n =\sum_{n=1}^N \hat{y}_n \Rightarrow \color{red}\bar y=\bar{\hat y} 0=[1,⋯,1]e=[1,⋯,1](y−y^)=n=1∑Nyn−n=1∑Ny^n⇒n=1∑Nyn=n=1∑Ny^n⇒yˉ=y^ˉ
-
计算原方差
∣ ∣ y − y ˉ 1 N × 1 ∣ ∣ 2 = ∣ ∣ y ^ − y ˉ 1 N × 1 + y − y ^ ∣ ∣ 2 = ∣ ∣ y ^ − y ˉ 1 N × 1 ∣ ∣ 2 + ∣ ∣ y − y ^ ∣ ∣ 2 − 2 ( y ^ − y ˉ 1 N × 1 ) T ( y − y ^ ) = ∣ ∣ y ^ − y ˉ 1 N × 1 ∣ ∣ 2 + ∣ ∣ y − y ^ ∣ ∣ 2 − 2 ( y ^ − y ˉ 1 N × 1 ) T e = ∣ ∣ y ^ − y ^ ˉ 1 N × 1 ∣ ∣ 2 + ∣ ∣ y − y ^ ∣ ∣ 2 \begin{aligned} ||\boldsymbol y - \bar y \boldsymbol 1_{N\times 1}||^2 &= ||\hat{\boldsymbol y}-\bar y \boldsymbol 1_{N\times 1}+\boldsymbol y - \hat{\boldsymbol y}||^2\\ &= ||\hat{\boldsymbol y}-\bar{y}\boldsymbol 1_{N\times 1}||^2+||\boldsymbol y-\hat{\boldsymbol y}||^2-2(\hat{\boldsymbol y}-\bar y \boldsymbol 1_{N\times 1})^T(\boldsymbol y-\hat{\boldsymbol y})\\ &= ||\hat{\boldsymbol y}-\bar{y}\boldsymbol 1_{N\times 1}||^2+||\boldsymbol y-\hat{\boldsymbol y}||^2-2(\hat{\boldsymbol y}-\bar y \boldsymbol 1_{N\times 1})^T\boldsymbol e \\ &= ||\hat{\boldsymbol y}-\bar{\hat y}\boldsymbol 1_{N\times 1}||^2+||\boldsymbol y-\hat{\boldsymbol y}||^2 \end{aligned} ∣∣y−yˉ1N×1∣∣2=∣∣y^−yˉ1N×1+y−y^∣∣2=∣∣y^−yˉ1N×1∣∣2+∣∣y−y^∣∣2−2(y^−yˉ1N×1)T(y−y^)=∣∣y^−yˉ1N×1∣∣2+∣∣y−y^∣∣2−2(y^−yˉ1N×1)Te=∣∣y^−y^ˉ1N×1∣∣2+∣∣y−y^∣∣2
估计的方差与原方差之间相差残差的平方和。
∑ n = 1 N ( y n − y ˉ n ) 2 = ∑ n = 1 N ( y ^ n − y ^ ˉ n ) 2 + ∑ n = 1 N ( y n − y ^ n ) 2 T S S = E S S + R S S \sum_{n=1}^N(y_n-\bar y_n)^2=\sum_{n=1}^N(\hat y_n-\bar{\hat y}_n)^2+\sum_{n=1}^N(y_n-\hat y_n)^2\\ TSS=ESS+RSS n=1∑N(yn−yˉn)2=n=1∑N(y^n−y^ˉn)2+n=1∑N(yn−y^n)2TSS=ESS+RSS
- TSS: Total sum of squares
- ESS: Explained sum of squares
- RSS: Residual (unexplained) sum of squares
ESS与TSS的比值被用于评价拟合效果的好坏:
R
2
=
E
S
S
T
S
S
=
∑
n
=
1
N
(
y
^
n
−
y
^
ˉ
n
)
2
∑
n
=
1
N
(
y
n
−
y
ˉ
n
)
2
=
1
N
∑
n
=
1
N
(
y
^
n
−
y
^
ˉ
n
)
2
1
N
∑
n
=
1
N
(
y
n
−
y
ˉ
n
)
2
=
T
S
S
−
R
S
S
T
S
S
=
1
−
R
S
S
T
S
S
R^2=\frac{ESS}{TSS}=\frac{\sum_{n=1}^N(\hat y_n-\bar{\hat y}_n)^2}{\sum_{n=1}^N(y_n-\bar y_n)^2}=\frac{\frac{1}{N}\sum_{n=1}^N(\hat y_n-\bar{\hat y}_n)^2}{\frac{1}{N}\sum_{n=1}^N(y_n-\bar y_n)^2}=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}
R2=TSSESS=∑n=1N(yn−yˉn)2∑n=1N(y^n−y^ˉn)2=N1∑n=1N(yn−yˉn)2N1∑n=1N(y^n−y^ˉn)2=TSSTSS−RSS=1−TSSRSS
7.2 Kernel-based Linear Regression
X = [ ϕ ( x 1 T ) ⋮ ϕ ( x N T ) ] , X ˉ = [ 1 ϕ ( x 1 T ) ⋮ ⋮ 1 ϕ ( x N T ) ] = [ 1 N × 1 , X ] , y = [ y 1 ⋮ y N ] X=\begin{bmatrix}\boldsymbol \phi(x^T_1) \\ \vdots \\ \boldsymbol \phi(x^T_N)\end{bmatrix},\ \bar X=\begin{bmatrix}1 & \phi(\boldsymbol x^T_1) \\ \vdots & \vdots \\ 1 & \phi(\boldsymbol x^T_N)\end{bmatrix}=[\boldsymbol 1_{N\times 1}, X],\ \boldsymbol y=\begin{bmatrix}y_1 \\ \vdots \\ y_N\end{bmatrix} X=⎣⎢⎡ϕ(x1T)⋮ϕ(xNT)⎦⎥⎤, Xˉ=⎣⎢⎡1⋮1ϕ(x1T)⋮ϕ(xNT)⎦⎥⎤=[1N×1,X], y=⎣⎢⎡y1⋮yN⎦⎥⎤
w ^ = ( X ˉ T X ˉ ) − 1 X ˉ T y = ( X ˉ T X ˉ ) ( X ˉ T X ˉ ) − 2 X ˉ T y = X ˉ T α ^ , 其 中 α ^ = X ˉ ( X ˉ T X ˉ ) − 2 X ˉ T y \hat{\boldsymbol w}=(\bar X^T \bar X)^{-1}\bar X^T \boldsymbol y=(\bar X^T \bar X)(\bar X^T \bar X)^{-2}\bar X^T \boldsymbol y=\bar X^T \hat{\boldsymbol \alpha},其中\hat{\boldsymbol \alpha}=\bar X(\bar X^T \bar X)^{-2}\bar X^T \boldsymbol y w^=(XˉTXˉ)−1XˉTy=(XˉTXˉ)(XˉTXˉ)−2XˉTy=XˉTα^,其中α^=Xˉ(XˉTXˉ)−2XˉTy
代入正规方程中
X
ˉ
T
X
ˉ
w
=
X
ˉ
T
y
⇒
X
ˉ
X
ˉ
T
X
ˉ
X
ˉ
T
α
^
=
X
ˉ
X
ˉ
T
y
⇒
K
ˉ
2
α
^
=
K
ˉ
y
,
w
h
e
r
e
K
ˉ
=
X
ˉ
X
ˉ
T
⇒
K
ˉ
α
^
=
y
,
i
f
d
e
t
(
K
ˉ
)
≠
0
⇒
α
^
=
K
ˉ
−
1
y
\begin{aligned} & \bar X^T \bar X \boldsymbol w=\bar X^T \boldsymbol y\\ \Rightarrow & \bar X \bar X^T \bar X \bar X^T \hat{\boldsymbol \alpha}=\bar X\bar X^T \boldsymbol y\\ \Rightarrow & \bar K^2\hat{\boldsymbol \alpha}=\bar K \boldsymbol y,\ where\ \bar K=\bar X \bar X^T\\ \Rightarrow & \bar K \hat{\boldsymbol \alpha}=\boldsymbol y,\ if\ det(\bar K)\ne0\\ \Rightarrow & \hat{\boldsymbol \alpha}=\bar K^{-1}\boldsymbol y \end{aligned}
⇒⇒⇒⇒XˉTXˉw=XˉTyXˉXˉTXˉXˉTα^=XˉXˉTyKˉ2α^=Kˉy, where Kˉ=XˉXˉTKˉα^=y, if det(Kˉ)=0α^=Kˉ−1y
K ˉ = X ˉ X ˉ T = [ 1 N × 1 , X ] [ 1 1 × N X T ] = 1 N × 1 + X X T = 1 N × 1 + K \bar K=\bar X\bar X^T=[\boldsymbol 1_{N\times 1}, X]\begin{bmatrix}\boldsymbol 1_{1\times N}\\ X^T\end{bmatrix}=\boldsymbol 1_{N\times 1}+XX^T=\boldsymbol 1_{N\times 1}+K Kˉ=XˉXˉT=[1N×1,X][11×NXT]=1N×1+XXT=1N×1+K
因此,
α
^
=
(
1
N
×
1
+
K
)
−
1
y
y
^
=
X
ˉ
w
=
X
ˉ
X
ˉ
T
α
^
=
(
1
N
×
1
+
K
)
α
^
\color{red} \begin{aligned} \hat{\boldsymbol \alpha}&=(\boldsymbol 1_{N\times 1}+K)^{-1}\boldsymbol y\\ \hat{\boldsymbol y}&=\bar X \boldsymbol w=\bar X \bar X^T \hat{\boldsymbol \alpha}=(\boldsymbol 1_{N\times 1}+K)\hat{\boldsymbol \alpha} \end{aligned}
α^y^=(1N×1+K)−1y=Xˉw=XˉXˉTα^=(1N×1+K)α^
对于未知的
x
\boldsymbol x
x,
KaTeX parse error: \cr valid only within a tabular/array environment
8. Reproducing kernel Hilbert Space
一个
m
×
m
m\times m
m×m维的实对称矩阵
K
K
K如果:
a
T
K
a
≥
0
\boldsymbol{a}^TK\boldsymbol a\ge0
aTKa≥0
对任意的
a
∈
R
m
\boldsymbol{a}\in\mathbb{R}^m
a∈Rm均成立,则
K
K
K称为半正定矩阵。
- 一个实对称矩阵可以对角化。
- 一个实对称矩阵式半正定当且仅当它的所有特征值非负。
如果一个函数
κ
:
R
d
×
R
d
→
R
\kappa:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}
κ:Rd×Rd→R且
K
=
[
κ
(
x
1
,
x
1
)
⋯
κ
(
x
1
,
x
m
)
⋮
⋱
⋮
κ
(
x
m
,
x
1
)
⋯
κ
(
x
m
,
x
m
)
]
K=\begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1, \boldsymbol x_m)\\ \vdots & \ddots & \vdots\\ \kappa(\boldsymbol x_m, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_m, \boldsymbol x_m) \end{bmatrix}
K=⎣⎢⎡κ(x1,x1)⋮κ(xm,x1)⋯⋱⋯κ(x1,xm)⋮κ(xm,xm)⎦⎥⎤
对于任意
{
x
1
,
⋯
,
x
m
}
⊂
R
d
\{\boldsymbol x_1, \cdots, \boldsymbol x_m\}\subset\mathbb{R}^d
{x1,⋯,xm}⊂Rd均是半正定矩阵,则
κ
\kappa
κ称为正半定核函数。
此时核函数满足Cauchy-Schwarz Inequality:
∣
κ
(
x
,
z
∣
2
≤
κ
(
x
,
x
)
κ
(
z
,
z
)
|\kappa(\boldsymbol x, \boldsymbol z|^2\le \kappa(\boldsymbol x, \boldsymbol x)\kappa(\boldsymbol z, \boldsymbol z)
∣κ(x,z∣2≤κ(x,x)κ(z,z)
证:
κ
\kappa
κ为半正定核函数,则其对应的Gram矩阵的行列式非负,即
KaTeX parse error: Expected group after '\begin{array}' at position 31: …|\begin{array} &̲\kappa(\boldsym…
Reproducing kernel map:
Φ
:
R
d
→
R
R
d
,
Φ
(
z
)
=
ϕ
z
=
κ
(
⋅
,
z
)
\Phi:\mathbb{R}^d \to \mathbb{R}^{\mathbb{R}^d},\Phi(\boldsymbol z)=\phi_{\boldsymbol z}=\kappa(\cdot,\boldsymbol z)
Φ:Rd→RRd,Φ(z)=ϕz=κ(⋅,z)。意味着,
Φ
(
z
)
(
x
)
=
ϕ
z
(
x
)
=
κ
(
x
,
z
)
\Phi(\boldsymbol z)(\boldsymbol x)=\phi_{\boldsymbol z}(\boldsymbol x)=\kappa(\boldsymbol x, \boldsymbol z)
Φ(z)(x)=ϕz(x)=κ(x,z)
Φ
\Phi
Φ是一个由向量到函数的映射,相当于将
z
\boldsymbol z
z当作自变量获得
ϕ
z
:
z
↦
κ
(
⋅
,
x
)
\phi_{\boldsymbol z}:\boldsymbol z \mapsto\kappa(\cdot,\boldsymbol x)
ϕz:z↦κ(⋅,x)。
假设
κ
:
R
d
×
R
d
→
R
\kappa:\mathbb{R}^d\times \mathbb{R}^d\to\mathbb{R}
κ:Rd×Rd→R是一个核函数,且
Φ
\Phi
Φ是对应的再生核映射,则
F
κ
=
{
f
=
∑
i
=
1
m
α
i
Φ
(
x
i
)
∣
m
∈
N
,
x
1
,
⋯
,
x
m
∈
R
d
,
α
1
,
⋯
,
α
m
∈
R
}
\mathcal{F}_\kappa=\{f=\sum_{i=1}^m\alpha_i\Phi(\boldsymbol x_i)|m\in\mathcal{N},\boldsymbol x_1,\cdots,\boldsymbol x_m\in\mathbb{R}^d,\alpha_1,\cdots,\alpha_m\in\mathbb{R}\}
Fκ={f=i=1∑mαiΦ(xi)∣m∈N,x1,⋯,xm∈Rd,α1,⋯,αm∈R}
是线性空间。
由该线性空间定义内积空间:给定
f
=
∑
i
=
1
m
α
i
Φ
(
x
i
)
=
∑
i
=
1
m
α
i
κ
(
⋅
,
x
i
)
f=\sum_{i=1}^m\alpha_i\Phi(\boldsymbol x_i)=\sum_{i=1}^m\alpha_i\kappa(\cdot,\boldsymbol x_i)
f=∑i=1mαiΦ(xi)=∑i=1mαiκ(⋅,xi)与
g
=
∑
j
=
1
m
′
β
j
Φ
(
x
j
′
)
=
∑
j
=
1
m
′
β
j
κ
(
⋅
,
x
j
′
)
g=\sum_{j=1}^{m'}\beta_j\Phi(\boldsymbol x_j')=\sum_{j=1}^{m'}\beta_j\kappa(\cdot,\boldsymbol x_j')
g=∑j=1m′βjΦ(xj′)=∑j=1m′βjκ(⋅,xj′)。
f
f
f与
g
g
g的内积为
<
f
,
g
>
=
∑
i
=
1
m
∑
j
=
1
m
′
α
i
β
j
κ
(
x
i
,
x
j
′
)
<f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i,\boldsymbol x_j')
<f,g>=i=1∑mj=1∑m′αiβjκ(xi,xj′)
性质:
-
< f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) = ∑ i = 1 m α i ∑ j = 1 m ′ β j κ ( x i , x j ′ ) = ∑ i = 1 m α i g ( x i ) w h e r e , g = ∑ j = 1 m ′ β j κ ( ⋅ , x j ′ ) <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{i=1}^m\alpha_i\sum_{j=1}^{m'}\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{i=1}^m\alpha_ig(\boldsymbol x_i)\\ where,\ g=\sum_{j=1}^{m'}\beta_j\kappa(\cdot,\boldsymbol x_j') <f,g>=i=1∑mj=1∑m′αiβjκ(xi,xj′)=i=1∑mαij=1∑m′βjκ(xi,xj′)=i=1∑mαig(xi)where, g=j=1∑m′βjκ(⋅,xj′)
-
< f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) = ∑ j = 1 m ′ β j ∑ i = 1 m α i κ ( x j ′ , x i ) = ∑ j = 1 m ′ β j f ( x j ′ ) w h e r e , f = ∑ i = 1 m α i κ ( ⋅ , x i ) <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{j=1}^{m'}\beta_j\sum_{i=1}^{m}\alpha_i\kappa(\boldsymbol x_j', \boldsymbol x_i)=\sum_{j=1}^{m'}\beta_jf(\boldsymbol x_j')\\ where,\ f=\sum_{i=1}^{m}\alpha_i\kappa(\cdot,\boldsymbol x_i) <f,g>=i=1∑mj=1∑m′αiβjκ(xi,xj′)=j=1∑m′βji=1∑mαiκ(xj′,xi)=j=1∑m′βjf(xj′)where, f=i=1∑mαiκ(⋅,xi)
由以上两点性质可知:
<
κ
(
⋅
,
x
)
,
f
>
=
f
(
x
)
=
∑
i
=
1
m
α
i
κ
(
x
i
,
x
)
<
κ
(
⋅
,
x
)
,
κ
(
⋅
,
z
)
>
=
κ
(
x
,
z
)
=
κ
(
z
,
x
)
<\kappa(\cdot, \boldsymbol x),f>=f(\boldsymbol x)=\sum_{i=1}^m\alpha_i\kappa(\boldsymbol x_i, \boldsymbol x)\\ <\kappa(\cdot,\boldsymbol x), \kappa(\cdot, \boldsymbol z)>=\kappa(\boldsymbol x, \boldsymbol z)=\kappa(\boldsymbol z, \boldsymbol x)
<κ(⋅,x),f>=f(x)=i=1∑mαiκ(xi,x)<κ(⋅,x),κ(⋅,z)>=κ(x,z)=κ(z,x)
下面证明假设满足内积空间的定义:对称性,线性性及正定性,
-
对称性
< f , g > = ∑ i = 1 m ∑ j = 1 m ′ α i β j κ ( x i , x j ′ ) = ∑ j = 1 m ′ ∑ i = 1 m β j α i κ ( x j ′ , x i ) = < g , f > <f,g>=\sum_{i=1}^m\sum_{j=1}^{m'}\alpha_i\beta_j\kappa(\boldsymbol x_i, \boldsymbol x_j')=\sum_{j=1}^{m'}\sum_{i=1}^m\beta_j\alpha_i\kappa(\boldsymbol x_j',\boldsymbol x_i)=<g, f> <f,g>=i=1∑mj=1∑m′αiβjκ(xi,xj′)=j=1∑m′i=1∑mβjαiκ(xj′,xi)=<g,f> -
线性性
< a f , g > = ∑ j = 1 m ′ β j ( a f ) ( x j ′ ) = a ∑ j = 1 m ′ β j f ( x j ′ ) = a < f , g > <af,g>=\sum_{j=1}^{m'}\beta_j(af)(\boldsymbol x_j')=a\sum_{j=1}^{m'}\beta_jf(\boldsymbol x_j')=a<f,g> <af,g>=j=1∑m′βj(af)(xj′)=aj=1∑m′βjf(xj′)=a<f,g>< f + h , g > = ∑ j = 1 m ′ β j ( f + h ) ( x j ′ ) = ∑ j = 1 m ′ β j ( f ( x j ′ ) + h ( x j ′ ) ) = ∑ j = 1 m ′ β j f ( x j ′ ) + ∑ j = 1 m ′ β j h ( x j ′ ) = < f , g > + < h , g > <f+h, g>=\sum_{j=1}^{m'}\beta_j(f+h)(\boldsymbol x_j')=\sum_{j=1}^{m'}\beta_j(f(\boldsymbol x_j')+h(\boldsymbol x_j'))=\sum_{j=1}^{m'}\beta_jf(\boldsymbol x_j')+\sum_{j=1}^{m'}\beta_jh(\boldsymbol x_j')=<f,g>+<h,g> <f+h,g>=j=1∑m′βj(f+h)(xj′)=j=1∑m′βj(f(xj′)+h(xj′))=j=1∑m′βjf(xj′)+j=1∑m′βjh(xj′)=<f,g>+<h,g>
-
正定性
< f , f > = ∑ i = 1 m ∑ j = 1 m α i α j κ ( x i , x j ) = [ α 1 ⋯ α m ] [ κ ( x 1 , x 1 ) ⋯ κ ( x 1 , x m ) ⋮ ⋱ ⋮ κ ( x m , x 1 ) ⋯ κ ( x m , x m ) ] [ α 1 ⋮ α m ] = α T K α ≥ 0 <f,f>=\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_j\kappa(\boldsymbol x_i, \boldsymbol x_j)= \begin{bmatrix} \alpha_1 & \cdots & \alpha_m \end{bmatrix} \begin{bmatrix} \kappa(\boldsymbol x_1, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_1, \boldsymbol x_m)\\ \vdots & \ddots & \vdots\\ \kappa(\boldsymbol x_m, \boldsymbol x_1) & \cdots & \kappa(\boldsymbol x_m, \boldsymbol x_m) \end{bmatrix} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_m \end{bmatrix}=\alpha^TK\alpha \ge 0 <f,f>=i=1∑mj=1∑mαiαjκ(xi,xj)=[α1⋯αm]⎣⎢⎡κ(x1,x1)⋮κ(xm,x1)⋯⋱⋯κ(x1,xm)⋮κ(xm,xm)⎦⎥⎤⎣⎢⎡α1⋮αm⎦⎥⎤=αTKα≥0∣ f ( x ) ∣ 2 = ∣ < κ ( ⋅ , x ) , f > ∣ 2 ≤ < κ ( ⋅ , x ) , κ ( ⋅ , x ) > < f , f > = κ ( x , x ) < f , f > = 0 |f(\boldsymbol x)|^2=|<\kappa(\cdot,\boldsymbol x),f>|^2\le<\kappa(\cdot,\boldsymbol x),\kappa(\cdot, \boldsymbol x)><f, f>=\kappa(\boldsymbol x,\boldsymbol x)<f,f>=0 ∣f(x)∣2=∣<κ(⋅,x),f>∣2≤<κ(⋅,x),κ(⋅,x)><f,f>=κ(x,x)<f,f>=0
当且仅当 f ( x ) = 0 , ∀ x f(\boldsymbol x)=0,\forall \boldsymbol{x} f(x)=0,∀x时, < f , f > = 0 ⇒ ∣ f ( x ) ∣ = 0 <f,f>=0\Rightarrow |f(\boldsymbol x)|=0 <f,f>=0⇒∣f(x)∣=0
综上,
κ
(
x
,
z
)
=
<
κ
(
⋅
,
x
)
,
κ
(
⋅
,
z
)
>
=
<
Φ
(
x
)
,
Φ
(
z
)
>
\kappa(\boldsymbol x, \boldsymbol z)=<\kappa(\cdot, \boldsymbol x), \kappa(\cdot, \boldsymbol z)>=<\Phi(\boldsymbol x), \Phi(\boldsymbol z)>
κ(x,z)=<κ(⋅,x),κ(⋅,z)>=<Φ(x),Φ(z)>
意味着对于核函数
κ
(
x
,
z
)
\kappa(\boldsymbol x, \boldsymbol z)
κ(x,z)我们可以在找到特征映射函数
Φ
\Phi
Φ将
R
d
→
H
\mathbb{R}^d\to\mathcal{H}
Rd→H
测度空间:对于测度空间内的一个顺序对 ( M , d ) (M,d) (M,d),其中 M M M是一个非空集合, d d d是 M M M上的测度 M × M → R M\times M\to \mathbb{R} M×M→R,对于 M M M内的任意三个元素: x , y , z \boldsymbol x,\boldsymbol y,\boldsymbol z x,y,z,满足:
- d ( x , y ) ≥ 0 d(\boldsymbol x, \boldsymbol y)\ge0 d(x,y)≥0
- d ( x , y ) = 0 ⇔ x = y d(\boldsymbol x,\boldsymbol y)=0\Leftrightarrow \boldsymbol x=\boldsymbol y d(x,y)=0⇔x=y
- d ( x , y ) = d ( y , x ) d(\boldsymbol x,\boldsymbol y)=d(\boldsymbol y, \boldsymbol x) d(x,y)=d(y,x)
- d ( x , z ) ≥ d ( x , y ) + d ( y , z ) d(\boldsymbol x,\boldsymbol z)\ge d(\boldsymbol x, \boldsymbol y)+d(\boldsymbol y, \boldsymbol z) d(x,z)≥d(x,y)+d(y,z)
柯西序列:对于测度空间
(
M
,
d
)
(M,d)
(M,d)和数列
x
1
,
x
2
,
⋯
\boldsymbol x_1,\boldsymbol x_2,\cdots
x1,x2,⋯,对任意正实数
ϵ
>
0
\epsilon>0
ϵ>0,存在正整数
N
N
N,对于所有的自然数
m
,
n
>
N
m,n>N
m,n>N,测度满足
d
(
x
m
,
x
n
)
≤
ϵ
d(\boldsymbol x_m,\boldsymbol x_n)\le\epsilon
d(xm,xn)≤ϵ
即数列在
M
M
M上收敛。
一个测度空间 ( M , d ) (M,d) (M,d)是完备的当且仅当 M M M上所有的柯西序列收敛。
希尔伯特空间
H
H
H是一个实内积空间,且是一个完备的测度空间,其距离以内积定义,如
d
(
x
,
y
)
=
∥
x
−
y
∥
d(\boldsymbol x,\boldsymbol y)=\|\boldsymbol x-\boldsymbol y\|
d(x,y)=∥x−y∥
其中,
∥
x
∥
=
<
x
,
x
>
\|\boldsymbol x\|=\sqrt{<\boldsymbol x,\boldsymbol x>}
∥x∥=<x,x>
。
l x)=0,\forall \boldsymbol{x} 时 , 时, 时,<f,f>=0\Rightarrow |f(\boldsymbol x)|=0$
综上,
κ
(
x
,
z
)
=
<
κ
(
⋅
,
x
)
,
κ
(
⋅
,
z
)
>
=
<
Φ
(
x
)
,
Φ
(
z
)
>
\kappa(\boldsymbol x, \boldsymbol z)=<\kappa(\cdot, \boldsymbol x), \kappa(\cdot, \boldsymbol z)>=<\Phi(\boldsymbol x), \Phi(\boldsymbol z)>
κ(x,z)=<κ(⋅,x),κ(⋅,z)>=<Φ(x),Φ(z)>
意味着对于核函数
κ
(
x
,
z
)
\kappa(\boldsymbol x, \boldsymbol z)
κ(x,z)我们可以在找到特征映射函数
Φ
\Phi
Φ将
R
d
→
H
\mathbb{R}^d\to\mathcal{H}
Rd→H
测度空间:对于测度空间内的一个顺序对 ( M , d ) (M,d) (M,d),其中 M M M是一个非空集合, d d d是 M M M上的测度 M × M → R M\times M\to \mathbb{R} M×M→R,对于 M M M内的任意三个元素: x , y , z \boldsymbol x,\boldsymbol y,\boldsymbol z x,y,z,满足:
- d ( x , y ) ≥ 0 d(\boldsymbol x, \boldsymbol y)\ge0 d(x,y)≥0
- d ( x , y ) = 0 ⇔ x = y d(\boldsymbol x,\boldsymbol y)=0\Leftrightarrow \boldsymbol x=\boldsymbol y d(x,y)=0⇔x=y
- d ( x , y ) = d ( y , x ) d(\boldsymbol x,\boldsymbol y)=d(\boldsymbol y, \boldsymbol x) d(x,y)=d(y,x)
- d ( x , z ) ≥ d ( x , y ) + d ( y , z ) d(\boldsymbol x,\boldsymbol z)\ge d(\boldsymbol x, \boldsymbol y)+d(\boldsymbol y, \boldsymbol z) d(x,z)≥d(x,y)+d(y,z)
柯西序列:对于测度空间
(
M
,
d
)
(M,d)
(M,d)和数列
x
1
,
x
2
,
⋯
\boldsymbol x_1,\boldsymbol x_2,\cdots
x1,x2,⋯,对任意正实数
ϵ
>
0
\epsilon>0
ϵ>0,存在正整数
N
N
N,对于所有的自然数
m
,
n
>
N
m,n>N
m,n>N,测度满足
d
(
x
m
,
x
n
)
≤
ϵ
d(\boldsymbol x_m,\boldsymbol x_n)\le\epsilon
d(xm,xn)≤ϵ
即数列在
M
M
M上收敛。
一个测度空间 ( M , d ) (M,d) (M,d)是完备的当且仅当 M M M上所有的柯西序列收敛。
希尔伯特空间
H
H
H是一个实内积空间,且是一个完备的测度空间,其距离以内积定义,如
d
(
x
,
y
)
=
∥
x
−
y
∥
d(\boldsymbol x,\boldsymbol y)=\|\boldsymbol x-\boldsymbol y\|
d(x,y)=∥x−y∥
其中,
∥
x
∥
=
<
x
,
x
>
\|\boldsymbol x\|=\sqrt{<\boldsymbol x,\boldsymbol x>}
∥x∥=<x,x>
。