Linear/Logistic/Softmax Regression对比

Linear/Logistic/Softmax Regression是常见的机器学习模型,且都是广义线性模型的一种,有诸多相似点,详细对比之。原文见Linear/Logistic/Softmax Regression对比

概述

Linear Regression是回归模型,Logistic Regression是二分类模型,Softmax Regression是多分类模型,但三者都属于广义线性「输入的线性组合」模型「GLM」。

其中Softmax Regression可以看做Logistic Regression在多类别上的拓展。

Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive).

符号约定

  • 样本 (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i))
  • 样本数 mmm
  • 特征维度 nnn
  • Linear Regression输出 y(i)y^{(i)}y(i)
  • Logistic Regression类别 y(i){0,1}y^{(i)}\in\{0,1\}y(i)∈{0,1}
  • Softmax Regression类别 y(i){1,,K}y^{(i)}\in\{1,\ldots,K\}y(i)∈{1,…,K}
  • Softmax Regression类别数 KKK
  • 损失函数 J(θ)J(\theta)J(θ)
  • Indicator函数 I{boolean}I\{boolean\}I{boolean}

模型参数对比

Linear Regression,维度为(n1)(n \cdot 1)(n⋅1)的向量

θ=[θ]\theta = \begin{bmatrix} \vert \\ \theta \\ \vert \end{bmatrix} θ=⎣⎡​∣θ∣​⎦⎤​

Logistic Regression,维度为(n1)(n \cdot 1)(n⋅1)的向量

θ=[θ]\theta = \begin{bmatrix} \vert \\ \theta \\ \vert \end{bmatrix} θ=⎣⎡​∣θ∣​⎦⎤​

Softmax Regression,维度为(nK)(n \cdot K)(n⋅K)的矩阵

θ=[θ(1)θ(2)θ(K)]\theta = \begin{bmatrix} \vert & \vert & \vert & \vert \\ \theta^{(1)} & \theta^{(2)} & \dots & \theta^{(K)} \\ \vert & \vert & \vert & \vert \\ \end{bmatrix} θ=⎣⎡​∣θ(1)∣​∣θ(2)∣​∣…∣​∣θ(K)∣​⎦⎤​

模型输出对比

Linear Regression输出样本的得分「标量」。

hθ(x)=θTxh_\theta(x) = \theta^Txhθ​(x)=θTx

Logistic Regression输出正样本的概率「标量」。

hθ(x)=P(y=1x;θ)=11+exp(θTx)h_\theta(x) = P(y = 1 | x; \theta) = \frac{1}{1+\exp(-\theta^Tx)}hθ​(x)=P(y=1∣x;θ)=1+exp(−θTx)1​

Softmax Regression输出为KKK个类别的概率「向量」。

hθ(x)=[P(y=1x;θ)P(y=2x;θ)P(y=Kx;θ)]=1k=1Kexp(θ(k)x)[exp(θ(1)Tx)exp(θ(2)Tx)exp(θ(K)Tx)] h_\theta(x) = \begin{bmatrix} P(y = 1 | x; \theta) \\ P(y = 2 | x; \theta) \\ \vdots \\ P(y = K | x; \theta) \end{bmatrix} = \frac{1}{\sum_{k=1}^{K}{\exp(\theta^{(k)\top}x)}} \begin{bmatrix} \exp(\theta^{(1)T} x) \\ \exp(\theta^{(2)T} x) \\ \vdots \\ \exp(\theta^{(K)T} x) \\ \end{bmatrix} hθ​(x)=⎣⎢⎢⎢⎡​P(y=1∣x;θ)P(y=2∣x;θ)⋮P(y=K∣x;θ)​⎦⎥⎥⎥⎤​=∑k=1K​exp(θ(k)⊤x)1​⎣⎢⎢⎢⎡​exp(θ(1)Tx)exp(θ(2)Tx)⋮exp(θ(K)Tx)​⎦⎥⎥⎥⎤​

损失函数对比

Linear Regression是回归问题,损失函数一般取平方误差;Logistic/Softmax Regression是分类问题,损失函数一般用交叉熵。

分类问题,对样本(x,y)(x, y)(x,y),模型输出在类别上的概率分布,可统一表示为条件概率P(yx)P(y\vert x)P(y∣x),可以直接写出交叉熵表达式,也可以通过极大似然法则导出,最终效果一样。

Linear Regression。

J(θ)=12i=1m(hθ(x(i)y(i)))2J(\theta) = \frac{1}{2} \sum_{i=1}^m (h_\theta(x^{(i)} - y^{(i)}))^2J(θ)=21​i=1∑m​(hθ​(x(i)−y(i)))2

Logistic Regression。条件概率可以表示为

KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ P(y|x) &= …

对所有训练样本,损失函数为

KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ J(\theta) &= -…

Softmax Regression。条件概率可以表示为

P(yx)=I{y=1}P(y=1x;θ)++I{y=K}P(y=Kx;θ) P(y|x) = I\{y=1\}P(y=1\vert x; \theta) + \dots + I\{y=K\}P(y=K\vert x; \theta) P(y∣x)=I{y=1}P(y=1∣x;θ)+⋯+I{y=K}P(y=K∣x;θ)

对所有训练样本,损失函数为

KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ J(\theta) &= -…

对比式子Logistic/Softmax Regression,二者的损失函数形式完全一致,就是交叉熵损失。真实概率分布ppp和预估概率分布qqq的交叉熵为

H(p,q)=xp(x)logq(x)H(p,q) = - \sum_x p(x) \log q(x)H(p,q)=−x∑​p(x)logq(x)

  • 对Logistic Regression来说,真实概率分布为[1,0][1, 0][1,0]或[0,1][0, 1][0,1]
  • 对Softmax Regression来说,真实概率分布为[1,0,0][1,0,0][1,0,0]、[0,1,0][0,1,0][0,1,0]或[0,0,1][0,0,1][0,0,1]

梯度对比

Linear/Logistic/Softmax Regression都是广义线性模型的一种,其形式都极其相似,包括梯度。

Linear Regression梯度

θJ(θ)=i=1mx(i)(hθ(x(i))y(i))\nabla_\theta J(\theta) = \sum_{i=1}^m x^{(i)}(h_\theta(x^{(i)}) - y^{(i)})∇θ​J(θ)=i=1∑m​x(i)(hθ​(x(i))−y(i))

其中hθ(x)=θTxh_\theta(x) = \theta^Txhθ​(x)=θTx。

Logistic Regression梯度

θJ(θ)=i=1mx(i)(hθ(x(i))y(i))\nabla_\theta J(\theta) = \sum_{i=1}^m x^{(i)}(h_\theta(x^{(i)}) - y^{(i)})∇θ​J(θ)=i=1∑m​x(i)(hθ​(x(i))−y(i))

其中hθ(x)=σ(θTx)h_\theta(x) = \sigma(\theta^Tx)hθ​(x)=σ(θTx)。

Softmax Regression梯度

θ(k)J(θ)=i=1mx(i)[P(y(i)=kx(i);θ)I(y(i)=k)]\nabla_{\theta^{(k)}} J(\theta) = \sum_{i=1}^m x^{(i)} [P(y^{(i)} = k | x^{(i)}; \theta) - I(y^{(i)} = k)]∇θ(k)​J(θ)=i=1∑m​x(i)[P(y(i)=k∣x(i);θ)−I(y(i)=k)]

其中预测结果见上文模型输出对比内容,方便表示,分别对θk\theta^{k}θk求导。

梯度形式非常的Intuitive,更新尺度正比于误差项

The magnitude of the update is proportional to the error term hθ(x(i))y(i)h_\theta(x^{(i)}) - y^{(i)}hθ​(x(i))−y(i); thus, for instance, if we are encountering a training example on which our prediction nearly matches the actual value of y(i)y^{(i)}y(i), then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction hθ(x(i))h_\theta(x^{(i)})hθ​(x(i)) has a large error (i.e., if it is very far from y(i)y^{(i)}y(i)).

上一篇:树形DP ---- Codeforces Global Round 2 F. Niyaz and Small Degrees引发的一场血案


下一篇:高斯分布相乘、积分整理