Question
A good penalty function should result in an estimator with three properties:
-
Unbiasedness(无偏性): The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias.
-
Sparsity(稀疏性): The resulting estimator is a thresholding rule, which automatically sets small estmated coefficient to zero to reduce model complexity.
-
Continuity(连续性): The resulting estimator is continuous in data \(z\) to avoid instability in model prediction.
Now you need to verify whether OLS, Ridge, LASSO, SCAD satisfy these preperties or not.
Answer
Conditions
Linear model:
\[\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\quad y_i=\beta_0+\sum\limits_{j=1}^p\beta_jx_{ij}+\varepsilon_i,i=1,\dots,n, \]where \(\mathbf{y}=(y_1,\dots,y_n)^\top,\mathbf{X}=(\mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_n)^\top\),where \(\mathbf{x}_0=(1,1,\dots,1)^\top,\mathbf{x}_i=(x_{i1},\dots,x_{ip})^\top,i=1,\dots,n\),and \(\boldsymbol{\varepsilon}=(\varepsilon_1,\dots,\varepsilon_n)^\top\), \(\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_p)^\top\).
Now we first consider the ordinary least squre estimator (OLS):
\[\widehat{\boldsymbol{\beta}}^{\text{ols}}=\arg\min\limits_{\boldsymbol{\beta}}\sum_{i=1}^n\bigg(y_i-\beta_0-\sum\limits_{j=1}^p\beta_jx_{ij}\bigg)^2=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \]we know that \(\widehat{\boldsymbol{\beta}}^\text{ols}\) is unbiased, since
\[E(\widehat{\boldsymbol{\beta}}^\text{ols}-\boldsymbol{\beta})=E\big((\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon})-\boldsymbol{\beta}\big)=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top E(\boldsymbol{\varepsilon})=\boldsymbol{0}. \]And of course that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}\) is continuous in data \(z\) and it doesn't have sparsity since no coefficient will be set to zero.
Now we consider the penalized least square regression model whose objective function is
\[\begin{align*} Q(\boldsymbol{\beta})&=\frac{1}{2}||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|)\\ &=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}+\hat{\mathbf{y}}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}\sum_{j=1}^p(z_j-\beta_j)^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|) \end{align*} \]Noting that here we denote \(\mathbf{z}=\mathbf{X}^\top\mathbf{y}\) and assume that the columns of \(\mathbf{X}\) are orthonormal, which means \(\mathbf{X}^\top\mathbf{X}=\mathbf{X}\mathbf{X}^\top=\mathbf{I}\), so that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{X}^\top\mathbf{y}\), \(\hat{\mathbf{y}}=\mathbf{X}\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{y}\), and
\[||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2=||\mathbf{z}||^2+||\boldsymbol{\beta}||^2-2\mathbf{z}^\top\boldsymbol{\beta}=||\mathbf{z}-\boldsymbol{\beta}||^2. \]Thus, the minimization problem of penalized least squares is equivalent ot minimizing componentwise
\[Q(\theta)=\frac{1}{2}(z-\theta)^2+p_\lambda(|\theta|). \]In order to get the minimizer of \(Q(\theta)\),we let \(\frac{dQ(\theta)}{d\theta}=0\) and have
\[(\theta-z)+\text{sgn}(\theta)p_\lambda^\prime(|\theta|)=\text{sgn}(\theta)\{|\theta|+p_\lambda^\prime(|\theta|)\}-z=0. \]Here are some observations based on this equation:
- When \(p^\prime_\lambda(|\theta|)=0\) for large \(|\theta|\), the resulting estimator is \(z\) when \(|z|\) is sufficently large, which is that \(\hat{\theta}=z\).
- In order to get sparsity, we hope \(\hat{\theta}=0\) when \(z\) is small, that is \(0\) is the minimizer of \(Q(\theta)\), which requaring
and this condition can be summarized into
\[\min\limits_{\theta\neq0}\{|\theta|+p_\lambda^\prime(|\theta|)\}>|z|. \]- From sparsity, we have \(\hat{\theta}=0,\) if \(|\theta|+p_\lambda^\prime(|\theta|)>|z|\). When \(|\theta|+p^\prime_\lambda(|\theta|)=|z|\), we get a resulting estimator \(\hat{\theta}=\theta_0\). For continuity, we need \(\theta_0\) goes to zero, that is \(\arg\min\{|\theta|+p^\prime_\lambda(|\theta|)\}=0.\)
In conclusion, the conditions of three properties for a good estimator are:
- Unbiasedness condition: \(p_\lambda^\prime(|\theta|)=0\), for large \(|\theta|\);
- Sparsity condition: \(\min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}>0\);
- Continuity condition: \(\arg\min\limits_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=0.\)
Examples
Now we review the OLS estimator with \(p_\lambda(|\theta|)=0\), it's obvious that
\[p_\lambda^\prime(|\theta|)\equiv0,\text{ and } \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=\min_\theta\{|\theta|\}=\{|\theta|\}|_{\theta=0}=0. \]Therefore, OLS satisfies unbiasedness and continuity while it does not satisfy sparsity.
Secnondly, we consider ridge regression with \(p_\lambda(|\theta|)=\lambda|\theta|^2\), we can see that
\[\begin{align*} p_\lambda^\prime(|\theta|)&=2\lambda\theta\neq0, \,\, \text{for large }|\theta|,\\ \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}&=\min_\theta\{|\theta|+\lambda|\theta|^2\}=\{|\theta|+\lambda|\theta|^2\}|_{\theta=0}=0. \end{align*} \]Therefore, ridge regression estimator satisfies continuity while it does not satisfy unbiasedness and sparsity.
Next, we consider LASSO regression with \(p_\lambda(|\theta|)=\lambda|\theta|\). For large \(|\theta|\), we have $$p_\lambda^\prime(|\theta|)=\lambda\text{sgn}(\theta)\neq0,,, \text{since } \lambda>0.$$ For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\text{sgn}(\theta)\),
\[\begin{equation*} \begin{cases} H^\prime(|\theta|)=1>0,& \text{when } \theta>0,\\ H^\prime(|\theta|)=-1<0,& \text{when } \theta<0, \end{cases} \end{equation*} \]so that \(\arg\min\limits_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, LASSO regression estimator satisfies sparsity and continuity while it does not satisfy unbiasedness.
Last, we consider SCAD with penalized function
\[ \begin{equation*} p_\lambda(|\theta|;a)=\begin{cases} \lambda|\theta|0,& \text{if } 0\leq\theta<\lambda,\\ -\frac{\theta^2-2a\lambda|\theta|+\lambda^2}{2(a-1)},& \text{if } \lambda\leq|\theta|<a\lambda<0,\\ (a+1)\lambda^2/2,&\text{otherwise}, \end{cases} \end{equation*} \]where \(a>1\). So that
\[\begin{align*} p_\lambda^\prime(\theta)&=\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\},\\ p^\prime_\lambda(\theta)&=\bigg(\frac{(a+1)\lambda^2}{2}\bigg)^\prime=0,\,\, \text{for large } |\theta|. \end{align*} \]For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\}\), we have
\[\begin{equation*} H^\prime(|\theta|)= \begin{cases} -1<0,& \text{when } \theta<0,\\ 1>0,& \text{when } 0<\theta \leq\lambda,\\ 1-\frac{1}{a-1}>0,&\text{when } \lambda<\theta\leq a\lambda,\\ 1>0,&\text{when } \theta>a\lambda, \end{cases} \end{equation*} \]so that \(\arg\min_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, SCAD estimator satisfies all the three properties.
Conclution
OLS | Ridge | LASSO | SCAD | |
---|---|---|---|---|
Unbiasedness | \(\surd\) | \(\times\) | \(\times\) | \(\surd\) |
Sparsity | \(\times\) | \(\times\) | \(\surd\) | \(\surd\) |
Continuity | \(\surd\) | \(\surd\) | \(\surd\) | \(\surd\) |