ML Note 1.3 - SVM

Contents

Support vector machines 几乎是最好的有监督学习算法。对于一个线性二分问题,设 y{1,1},xRny \in \{-1,1\}, x \in \mathbb{R}^ny∈{−1,1},x∈Rn。注意到我们没有使用增广形式的特征向量,而是使用线性参数 www 和 bbb 来表示
hw,b(x)=g(wTx+b) h_{w,b}(x) = g(w^Tx+b) hw,b​(x)=g(wTx+b)

其中
g(z)={1if z01otherwise g(z) = \left\{\begin{array}{cl} 1 & \text{if }z \ge 0\\ -1 & \text{otherwise} \end{array}\right. g(z)={1−1​if z≥0otherwise​

Optimal Margin Classifier

假设 SSS 线性可分,即存在 w,bw, bw,b 使所有样本被预测正确。可以认为 wTx+bw^Tx+bwTx+b 的绝对值反映了单次预测的可信度。定义 functional margin
γ^(i)=y(i)(wTx(i)+b) \hat\gamma^{(i)} = y^{(i)}(w^Tx^{(i)}+b) γ^​(i)=y(i)(wTx(i)+b)

如果 γ^(i)>0\hat\gamma^{(i)} > 0γ^​(i)>0 则说明预测正确。而 γ^(i)\hat\gamma^{(i)}γ^​(i) 越大,说明预测的可信度越高。定义样本集的 functional margin
γ^=mini=1,,mγ^(i) \hat\gamma = \min\limits_{i=1,\dots,m}\hat\gamma^{(i)} γ^​=i=1,…,mmin​γ^​(i)

则我们的目标可以表示为
maxw,bγ^ \max\limits_{w, b}{\hat\gamma} w,bmax​γ^​

但是如果 www 以一定比例增大 γ^\hat\gammaγ^​ 可以变得无穷大。因此我们还需要一个约束
w=1 ||w|| = 1 ∣∣w∣∣=1

但是这个问题并不是凸优化,因此定义 geometric margins
γ(i)=y(i)((ww)Tx(i)+bw)γ=mini=1,,mγ(i) \begin{array}{lcl} \gamma^{(i)} &=& y^{(i)}\left(\left(\frac{w}{||w||}\right)^Tx^{(i)}+\frac{b}{||w||}\right)\\ \gamma &=& \min\limits_{i=1,\dots,m}\gamma^{(i)} \end{array} γ(i)γ​==​y(i)((∣∣w∣∣w​)Tx(i)+∣∣w∣∣b​)i=1,…,mmin​γ(i)​

从几何角度考虑,γ(i)\gamma^{(i)}γ(i) 代表第 i 个样本到决策面的距离。因为距离决策面越远的点具有越高的可信度,所以原问题可以变为
maxw,bγ \max\limits_{w, b}{\gamma} w,bmax​γ

但是现在目标函数不是凸函数,因此还是不能应用凸优化。不过 w 的长度可以任取而不影响目标函数值了,而总有一个系数可以使 γ^=1\hat\gamma = 1γ^​=1 成立。因此将原问题重述为
minw,b12w2s.t.y(i)(wTx(i)+b)1,i=1,,m \begin{array}{rll} \min\limits_{w, b} & \frac{1}{2}||w||^2 &\\ \text{s.t.} & y^{(i)}(w^Tx^{(i)}+b) \ge 1, & i=1,\dots,m \end{array} w,bmin​s.t.​21​∣∣w∣∣2y(i)(wTx(i)+b)≥1,​i=1,…,m​

Soft Margin Classifier

对于线性不可分的 SSS 采用 l1l_1l1​ regularization 重述问题
minw,b,ξ12w2+Ci=1mξis.t.y(i)(wTx(i)+b)1ξi,i=1,,mξi0,i=1,,m \begin{array}{rll} \min\limits_{w, b, \xi} & \frac{1}{2}||w||^2 + C\sum\limits_{i=1}^m\xi_i&\\ \text{s.t.} & y^{(i)}(w^Tx^{(i)}+b) \ge 1 - \xi_i, & i=1,\dots,m\\ & \xi_i \ge 0, & i=1,\dots,m \end{array} w,b,ξmin​s.t.​21​∣∣w∣∣2+Ci=1∑m​ξi​y(i)(wTx(i)+b)≥1−ξi​,ξi​≥0,​i=1,…,mi=1,…,m​

构造 generalized Lagrangian
L(w,b,ξ,α,β)=12w2+Ci=1mξii=1mαi[y(i)(wTx(i)+b)1+ξi]i=1mβiξi L(w,b,\xi,\alpha,\beta) = \frac{1}{2}||w||^2 + C\sum\limits_{i=1}^m\xi_i- \sum\limits_{i=1}^m \alpha_i[y^{(i)}(w^Tx^{(i)}+b)-1+\xi_i] - \sum\limits_{i=1}^m\beta_i\xi_i L(w,b,ξ,α,β)=21​∣∣w∣∣2+Ci=1∑m​ξi​−i=1∑m​αi​[y(i)(wTx(i)+b)−1+ξi​]−i=1∑m​βi​ξi​

容易验证
w=0b=0ξi=2 \begin{array}{rcl} w &=& \vec 0\\ b &=& 0\\ \xi_i &=& 2 \end{array} wbξi​​===​002​

满足 the Slater’s condition, therefore all we have to do is solving the dual problem
maxαθD(α,β) \begin{array}{rll} \max\limits_{\alpha} & \theta_D(\alpha, \beta) \end{array} αmax​​θD​(α,β)​ By derivating the function
wL=wi=1mαiy(i)x(i)bL=i=1mαiy(i)ξiL=Cαiβi \begin{array}{rcl} \nabla_wL &=& w-\sum\limits_{i=1}^m \alpha_iy^{(i)}x^{(i)}\\ \frac{\partial}{\partial b}L &=& -\sum\limits_{i=1}^m \alpha_iy^{(i)}\\ \frac{\partial}{\partial\xi_i}L &=& C-\alpha_i-\beta_i \end{array} ∇w​L∂b∂​L∂ξi​∂​L​===​w−i=1∑m​αi​y(i)x(i)−i=1∑m​αi​y(i)C−αi​−βi​​ We find that the optimal points
w=i=1mαiy(i)x(i)b=maxi:y(i)=1wTx(i)+mini:y(i)=1wTx(i)2wTx+b=i=1mαiy(i)x(i),x+b \begin{array}{rcl} w^* &=& \sum\limits_{i=1}^m\alpha_iy^{(i)}x^{(i)}\\ b^* &=& -\frac{\max\limits_{i: y^{(i)} = -1}w^{*T}x^{(i)} + \min\limits_{i: y^{(i)} = 1}w^{*T}x^{(i)}}{2}\\ w^Tx + b &=& \sum\limits_{i=1}^m\alpha_iy^{(i)}\langle x^{(i)}, x\rangle + b^* \end{array} w∗b∗wTx+b​===​i=1∑m​αi​y(i)x(i)−2i:y(i)=−1max​w∗Tx(i)+i:y(i)=1min​w∗Tx(i)​i=1∑m​αi​y(i)⟨x(i),x⟩+b∗​ Plugging back to the dual problem
maxα,βi=1mαi12i,j=1my(i)y(j)αiαjx(i),x(j)s.t.0αiC,i=1,,mi=1mαiy(i)=0 \begin{array}{rll} \max\limits_{\alpha, \beta} & \sum\limits_{i=1}^m \alpha_i - \frac{1}{2}\sum\limits_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle\\ \text{s.t.} & 0 \le \alpha_i \le C,\quad i=1,\dots,m\\ & \sum\limits_{i=1}^m \alpha_iy^{(i)} = 0 \end{array} α,βmax​s.t.​i=1∑m​αi​−21​i,j=1∑m​y(i)y(j)αi​αj​⟨x(i),x(j)⟩0≤αi​≤C,i=1,…,mi=1∑m​αi​y(i)=0​ KKT complementary conditions require that
αi[y(i)(wTx(i)+b)1+ξi]=0βiξi=0 \begin{array}{rcl} \alpha_i^*[y^{(i)}(w^{*T}x^{(i)}+b^*)-1+\xi_i^*] &=& 0\\ \beta_i^*\xi_i^* &=& 0 \end{array} αi∗​[y(i)(w∗Tx(i)+b∗)−1+ξi∗​]βi∗​ξi∗​​==​00​ after summerization
ai=0(γ^(i))10<ai<C(γ^(i))=1ai=C(γ^(i))1 \begin{array}{rcl} a_i^* = 0 &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* \ge 1\\ 0 < a_i^* < C &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* = 1\\ a_i^* = C &\Rightarrow& \Big(\hat\gamma^{(i)}\Big)^* \le 1\\ \end{array} ai∗​=00<ai∗​<Cai∗​=C​⇒⇒⇒​(γ^​(i))∗≥1(γ^​(i))∗=1(γ^​(i))∗≤1​

SMO1

为了同时对 (α1,α2,,αm)(\alpha_1, \alpha_2, \dots, \alpha_m)(α1​,α2​,…,αm​) 进行优化,最直观的想法是使用坐标上升。但是在这个问题中,等式条件
i=1mαiy(i)=0 \sum\limits_{i=1}^m \alpha_iy^{(i)} = 0 i=1∑m​αi​y(i)=0

在改变一个变量的值后无法保持成立。因此 SMO 算法同时更新两个变量,以保持等式条件成立
repeat {Select some pair (αi,αj) to update by heuristicOptimize W w.r.t. (αi,αj)} \begin{aligned} &\text{repeat}\ \{\\ &\qquad \text{Select some pair }(\alpha_i,\alpha_j)\text{ to update by heuristic}\\ &\qquad\text{Optimize }W\text{ w.r.t. }(\alpha_i,\alpha_j)\\ &\} \end{aligned} ​repeat {Select some pair (αi​,αj​) to update by heuristicOptimize W w.r.t. (αi​,αj​)}​

其中
W=i=1mαi12i,j=1my(i)y(j)αiαjx(i),x(j) W = \sum\limits_{i=1}^m \alpha_i - \frac{1}{2}\sum\limits_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle W=i=1∑m​αi​−21​i,j=1∑m​y(i)y(j)αi​αj​⟨x(i),x(j)⟩

为优化的目标函数。
Here, step 111 is beyond the scope and we will focus on step 222. 设第一步中选取的优化变量为 (α1,α2)(\alpha_1,\alpha_2)(α1​,α2​),令常数
ζi=3mαiy(i)=α1y(1)+α2y(2) \zeta \equiv -\sum\limits_{i=3}^m\alpha_iy^{(i)} = \alpha_1y^{(1)} + \alpha_2y^{(2)} ζ≡−i=3∑m​αi​y(i)=α1​y(1)+α2​y(2)

则两个优化变量实际可以简化为一个
α1=(ζα2y(2))y(1) \alpha_1 = (\zeta-\alpha_2y^{(2)})y^{(1)} α1​=(ζ−α2​y(2))y(1)

由此将原问题转化为单变量优化问题
α2:=argmaxα^2W((ζα^2y(2))y(1),α^2,α3,,αm) \alpha_2 := \arg\max_{\hat\alpha_2} W((\zeta-\hat\alpha_2y^{(2)})y^{(1)},\hat\alpha_2,\alpha_3,\dots,\alpha_m) α2​:=argα^2​max​W((ζ−α^2​y(2))y(1),α^2​,α3​,…,αm​)

如果变量的最优值违反了约束条件而无法取到,则选择这一变量可行域中最接近最优值的一侧边界作为新的变量值。

Kernels

When mapping from lower dimensional space to higher dimensional space, the original non-linearly separable data could become linearly separable. For instance

ML Note 1.3 - SVM
For a function K:Rn×RnRK: \mathbb{R}^n \times\mathbb{R}^n\rightarrow \mathbb{R}K:Rn×Rn→R define its corresponding matrix K=(Kij)Rn×nK = (K_{ij}) \in \mathbb{R}^{n\times n}K=(Kij​)∈Rn×n
Kij=K(x(i),x(j)) K_{ij} = K(x^{(i)},x^{(j)}) Kij​=K(x(i),x(j)) Mercer theorem shows that

if and only if matrix KKK is symmetric positive semi-definite, then there exsit some ϕ\phiϕ such that K(x,z)=ϕ(x),ϕ(z)K(x,z) = \langle\phi(x),\phi(z)\rangleK(x,z)=⟨ϕ(x),ϕ(z)⟩.

In this way the function KKK is called the kernel and the matrix KKK is called the kernel matrix. Commonly used kernels involve

  • Polynomial K(x,z)=(xTz+c)dK(x,z) = (x^Tz+c)^dK(x,z)=(xTz+c)d
  • Gaussian K(x,z)=exp(xz22σ2)K(x,z) = exp\Big(-\frac{||x-z||^2}{2\sigma^2}\Big)K(x,z)=exp(−2σ2∣∣x−z∣∣2​)

Suppose we have an input attribute xxx but we want to apply SVM over the input features ϕ(x)\phi(x)ϕ(x), where ϕ\phiϕ is the feature mapping. All we have to do is using kernel to replace the inner products. This is useful because in many cases, calculating KKK is much more efficient than ϕ(x(i)),ϕ(x(j))\langle\phi(x^{(i)}), \phi(x^{(j)})\rangle⟨ϕ(x(i)),ϕ(x(j))⟩ (if KKK corresponds to an infinite dimensional space like the Gaussian kernel, there maybe no easy way to work out the inner product of ϕ\phiϕ).


  1. 【机器学习详解】SMO算法剖析 ↩︎

上一篇:makefile高级应用


下一篇:Gamma阶段事后分析