随机游走001 | 什么是好的惩罚函数 (penalty function)?

Question

A good penalty function should result in an estimator with three properties:

  • Unbiasedness(无偏性): The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias.

  • Sparsity(稀疏性): The resulting estimator is a thresholding rule, which automatically sets small estmated coefficient to zero to reduce model complexity.

  • Continuity(连续性): The resulting estimator is continuous in data \(z\) to avoid instability in model prediction.

Now you need to verify whether OLS, Ridge, LASSO, SCAD satisfy these preperties or not.

Answer

Conditions

Linear model:

\[\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\quad y_i=\beta_0+\sum\limits_{j=1}^p\beta_jx_{ij}+\varepsilon_i,i=1,\dots,n, \]

where \(\mathbf{y}=(y_1,\dots,y_n)^\top,\mathbf{X}=(\mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_n)^\top\),where \(\mathbf{x}_0=(1,1,\dots,1)^\top,\mathbf{x}_i=(x_{i1},\dots,x_{ip})^\top,i=1,\dots,n\),and \(\boldsymbol{\varepsilon}=(\varepsilon_1,\dots,\varepsilon_n)^\top\), \(\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_p)^\top\).

Now we first consider the ordinary least squre estimator (OLS):

\[\widehat{\boldsymbol{\beta}}^{\text{ols}}=\arg\min\limits_{\boldsymbol{\beta}}\sum_{i=1}^n\bigg(y_i-\beta_0-\sum\limits_{j=1}^p\beta_jx_{ij}\bigg)^2=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \]

we know that \(\widehat{\boldsymbol{\beta}}^\text{ols}\) is unbiased, since

\[E(\widehat{\boldsymbol{\beta}}^\text{ols}-\boldsymbol{\beta})=E\big((\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon})-\boldsymbol{\beta}\big)=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top E(\boldsymbol{\varepsilon})=\boldsymbol{0}. \]

And of course that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}\) is continuous in data \(z\) and it doesn't have sparsity since no coefficient will be set to zero.

Now we consider the penalized least square regression model whose objective function is

\[\begin{align*} Q(\boldsymbol{\beta})&=\frac{1}{2}||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|)\\ &=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}+\hat{\mathbf{y}}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}\sum_{j=1}^p(z_j-\beta_j)^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|) \end{align*} \]

Noting that here we denote \(\mathbf{z}=\mathbf{X}^\top\mathbf{y}\) and assume that the columns of \(\mathbf{X}\) are orthonormal, which means \(\mathbf{X}^\top\mathbf{X}=\mathbf{X}\mathbf{X}^\top=\mathbf{I}\), so that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{X}^\top\mathbf{y}\), \(\hat{\mathbf{y}}=\mathbf{X}\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{y}\), and

\[||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2=||\mathbf{z}||^2+||\boldsymbol{\beta}||^2-2\mathbf{z}^\top\boldsymbol{\beta}=||\mathbf{z}-\boldsymbol{\beta}||^2. \]

Thus, the minimization problem of penalized least squares is equivalent ot minimizing componentwise

\[Q(\theta)=\frac{1}{2}(z-\theta)^2+p_\lambda(|\theta|). \]

In order to get the minimizer of \(Q(\theta)\),we let \(\frac{dQ(\theta)}{d\theta}=0\) and have

\[(\theta-z)+\text{sgn}(\theta)p_\lambda^\prime(|\theta|)=\text{sgn}(\theta)\{|\theta|+p_\lambda^\prime(|\theta|)\}-z=0. \]

Here are some observations based on this equation:

  1. When \(p^\prime_\lambda(|\theta|)=0\) for large \(|\theta|\), the resulting estimator is \(z\) when \(|z|\) is sufficently large, which is that \(\hat{\theta}=z\).
  2. In order to get sparsity, we hope \(\hat{\theta}=0\) when \(z\) is small, that is \(0\) is the minimizer of \(Q(\theta)\), which requaring

\[\begin{equation*} \begin{cases} \frac{dQ(\theta)}{d\theta}>0,& \text{when } \theta>0,\\ \frac{dQ(\theta)}{d\theta}<0,& \text{when } \theta<0, \end{cases} \iff \begin{cases} \theta+p^\prime_\lambda(|\theta|)>z,& \text{when } \theta>0,\\ -\big(\theta+p^\prime_\lambda(|\theta|)\big)<z,& \text{when } \theta<0, \end{cases} \end{equation*} \]

and this condition can be summarized into

\[\min\limits_{\theta\neq0}\{|\theta|+p_\lambda^\prime(|\theta|)\}>|z|. \]

  1. From sparsity, we have \(\hat{\theta}=0,\) if \(|\theta|+p_\lambda^\prime(|\theta|)>|z|\). When \(|\theta|+p^\prime_\lambda(|\theta|)=|z|\), we get a resulting estimator \(\hat{\theta}=\theta_0\). For continuity, we need \(\theta_0\) goes to zero, that is \(\arg\min\{|\theta|+p^\prime_\lambda(|\theta|)\}=0.\)

In conclusion, the conditions of three properties for a good estimator are:

  1. Unbiasedness condition: \(p_\lambda^\prime(|\theta|)=0\), for large \(|\theta|\);
  2. Sparsity condition: \(\min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}>0\);
  3. Continuity condition: \(\arg\min\limits_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=0.\)

Examples

Now we review the OLS estimator with \(p_\lambda(|\theta|)=0\), it's obvious that

\[p_\lambda^\prime(|\theta|)\equiv0,\text{ and } \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=\min_\theta\{|\theta|\}=\{|\theta|\}|_{\theta=0}=0. \]

Therefore, OLS satisfies unbiasedness and continuity while it does not satisfy sparsity.

Secnondly, we consider ridge regression with \(p_\lambda(|\theta|)=\lambda|\theta|^2\), we can see that

\[\begin{align*} p_\lambda^\prime(|\theta|)&=2\lambda\theta\neq0, \,\, \text{for large }|\theta|,\\ \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}&=\min_\theta\{|\theta|+\lambda|\theta|^2\}=\{|\theta|+\lambda|\theta|^2\}|_{\theta=0}=0. \end{align*} \]

Therefore, ridge regression estimator satisfies continuity while it does not satisfy unbiasedness and sparsity.

Next, we consider LASSO regression with \(p_\lambda(|\theta|)=\lambda|\theta|\). For large \(|\theta|\), we have $$p_\lambda^\prime(|\theta|)=\lambda\text{sgn}(\theta)\neq0,,, \text{since } \lambda>0.$$ For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\text{sgn}(\theta)\),

\[\begin{equation*} \begin{cases} H^\prime(|\theta|)=1>0,& \text{when } \theta>0,\\ H^\prime(|\theta|)=-1<0,& \text{when } \theta<0, \end{cases} \end{equation*} \]

so that \(\arg\min\limits_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, LASSO regression estimator satisfies sparsity and continuity while it does not satisfy unbiasedness.

Last, we consider SCAD with penalized function

\[ \begin{equation*} p_\lambda(|\theta|;a)=\begin{cases} \lambda|\theta|0,& \text{if } 0\leq\theta<\lambda,\\ -\frac{\theta^2-2a\lambda|\theta|+\lambda^2}{2(a-1)},& \text{if } \lambda\leq|\theta|<a\lambda<0,\\ (a+1)\lambda^2/2,&\text{otherwise}, \end{cases} \end{equation*} \]

where \(a>1\). So that

\[\begin{align*} p_\lambda^\prime(\theta)&=\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\},\\ p^\prime_\lambda(\theta)&=\bigg(\frac{(a+1)\lambda^2}{2}\bigg)^\prime=0,\,\, \text{for large } |\theta|. \end{align*} \]

For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\}\), we have

\[\begin{equation*} H^\prime(|\theta|)= \begin{cases} -1<0,& \text{when } \theta<0,\\ 1>0,& \text{when } 0<\theta \leq\lambda,\\ 1-\frac{1}{a-1}>0,&\text{when } \lambda<\theta\leq a\lambda,\\ 1>0,&\text{when } \theta>a\lambda, \end{cases} \end{equation*} \]

so that \(\arg\min_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, SCAD estimator satisfies all the three properties.

Conclution

OLS Ridge LASSO SCAD
Unbiasedness \(\surd\) \(\times\) \(\times\) \(\surd\)
Sparsity \(\times\) \(\times\) \(\surd\) \(\surd\)
Continuity \(\surd\) \(\surd\) \(\surd\) \(\surd\)

Reference

[1] Fan, J. & Li, R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 2001, 96, 1348-1360.

上一篇:Day 001


下一篇:python百炼成钢实例001-数字组合