Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

目录

Classical Statistical Inference

  • In the preceding chapter, we developed the Bayesian approach to inference, where unknown parameters are modeled as random variables. In all cases we worked within a single, fully-specified probabilistic model, and we based most of our derivations and calculations on judicious application of Bayes’ rule.

  • By contrast, in the present chapter we adopt a fundamentally different philosophy: we view the unknown parameter θ \theta θ as a deterministic (not random) but unknown quantity. The observation X X X is random and its distribution p X ( x ; θ ) p_X(x; \theta) pX​(x;θ) [if X X X is discrete] or f X ( x ; θ ) f_X(x; \theta) fX​(x;θ) [if X X X is continuous] depends on the value of θ \theta θ.
    Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)
  • Thus, instead of working within a single probabilistic model, we will be dealing simultaneously with multiple candidate models, one model for each possible value of θ \theta θ.
  • In this context, a “good” hypothesis testing or estimation procedure will be one that possesses certain desirable properties under every candidate model, that is, for every possible value of θ \theta θ. In some cases, this may be considered to be a worst case viewpoint: a procedure is not considered to fulfill our specifications unless it does so against the worst possible value that θ \theta θ can take. (只有在最坏情况仍能达到要求,才能被认为具有好的效果)
    • For example, we may require that the expected value of the estimation error be zero, or that the estimation error be small with high probability, for all possible values of the unknown parameter.

  • Our notation will generally indicate the dependence of probabilities and expected values on θ \theta θ.
    • For example, we will denote by E θ [ h ( X ) ] E_\theta[h(X)] Eθ​[h(X)] the expected value of a random variable h ( X ) h(X) h(X) as a function of θ \theta θ. Similarly, we will use the notation P θ ( A ) P_\theta(A) Pθ​(A) to denote the probability of an event A A A.
    • Note that this only indicates a functional dependence, not conditioning in the probabilistic sense.

Classical Parameter Estimation

Properties of Estimators

  • Given observations X = ( X 1 . . . . . X n ) X = (X_1 ..... X_n) X=(X1​.....Xn​), an estimator (估计量) is a random variable of the form Θ ^ = g ( X ) \hat\Theta= g(X) Θ^=g(X), for some function g g g.
  • Note that since the distribution of X X X depends on θ \theta θ, the same is true for the distribution of Θ ^ \hat\Theta Θ^. We use the term estimate (估计值) to refer to an actual realized value of Θ ^ \hat\Theta Θ^.

  • Sometimes, particularly when we are interested in the role of the number of observations n n n, we use the notation Θ ^ n \hat\Theta_n Θ^n​ for an estimator. It is then also appropriate to view Θ ^ n \hat\Theta_n Θ^n​ as a sequence of estimators (one for each value of n n n). The mean and variance of Θ ^ n \hat\Theta_n Θ^n​ are denoted E θ [ Θ ^ n ] E_\theta[\hat\Theta_n] Eθ​[Θ^n​] and v a r θ ( Θ ^ n ) var_\theta(\hat\Theta_n) varθ​(Θ^n​), respectively. Both E θ [ Θ ^ n ] E_\theta[\hat\Theta_n] Eθ​[Θ^n​] and v a r θ ( Θ ^ n ) var_\theta(\hat\Theta_n) varθ​(Θ^n​) are numerical functions of θ \theta θ, but for simplicity, when the context is clear we sometimes do not show this dependence.

Terminology Regarding Estimators

Let Θ ^ \hat\Theta Θ^ be an estimator of an unknown parameter θ \theta θ, that is, a function of n n n observations X 1 , . . . , X n X_1, ... , X_n X1​,...,Xn​ whose distribution depends on θ \theta θ.

  • The estimation error, denoted by Θ ~ n \tilde\Theta_n Θ~n​, is defined by Θ ~ n = Θ ^ n − θ \tilde\Theta_n=\hat\Theta_n-\theta Θ~n​=Θ^n​−θ.
  • The bias of the estimator, denoted by b θ ( Θ ^ n ) b_\theta(\hat\Theta_n) bθ​(Θ^n​), is the expected value of the estimation error:
    b θ ( Θ ^ n ) = E θ [ Θ ^ n ] − θ b_\theta(\hat\Theta_n)=E_\theta[\hat\Theta_n]-\theta bθ​(Θ^n​)=Eθ​[Θ^n​]−θ
  • The expected value, the variance, and the bias of Θ ^ n \hat\Theta_n Θ^n​ depend on θ \theta θ, while the estimation error depends in addition on the observations X 1 , . . . , X n X_1, ... ,X_n X1​,...,Xn​.
  • We call Θ ^ n \hat\Theta_n Θ^n​ unbiased (无偏) if E θ [ Θ ^ n ] = θ E_\theta[\hat\Theta_n]=\theta Eθ​[Θ^n​]=θ, for every possible value of θ \theta θ.
  • We call Θ ^ n \hat\Theta_n Θ^n​ asymptotically unbiased (渐近无偏) if lim ⁡ n → ∞ E θ [ Θ ^ n ] = θ \lim_{n\rightarrow\infty}E_\theta[\hat\Theta_n]=\theta limn→∞​Eθ​[Θ^n​]=θ, for every possible value of θ \theta θ.
  • We call Θ ^ n \hat\Theta_n Θ^n​ consistent if the sequence Θ ^ n \hat\Theta_n Θ^n​ converges to the true value of the parameter θ \theta θ, in probability, for every possible value of θ \theta θ.

  • Besides the bias b θ ( Θ ^ n ) b_\theta(\hat\Theta_n) bθ​(Θ^n​), we are usually interested in the size of the estimation error. This is captured by the mean squared error E θ [ Θ ~ n 2 ] E_\theta[\tilde\Theta_n^2] Eθ​[Θ~n2​], which is related to the bias and the variance of Θ ^ n \hat\Theta_n Θ^n​ according to the following formula:
    E θ [ Θ ~ n 2 ] = b θ 2 ( Θ ^ n ) + v a r θ ( Θ ^ n ) E_\theta[\tilde\Theta_n^2]=b^2_\theta(\hat\Theta_n)+var_\theta(\hat\Theta_n) Eθ​[Θ~n2​]=bθ2​(Θ^n​)+varθ​(Θ^n​)
  • This formula is important because in many statistical problems. There is a tradeoff between the two terms on the right-hand-side. Often a reduction in the variance is accompanied by an increase in the bias. Of course, a good estimator is one that manages to keep both terms small.

Maximum Likelihood Estimation (最大似然估计)

This is a general method that bears similarity to MAP estimation.

  • Let the vector of observations X = ( X 1 , . . . , X n ) X = (X_1, ... , X_n) X=(X1​,...,Xn​) be described by a joint PMF p X ( x ; θ ) p_X(x;\theta) pX​(x;θ) whose form depends on an unknown (scalar or vector) parameter θ \theta θ. Suppose we observe a particular value x = ( x 1 , . . . , x n ) x = (x_1, ... , x_n) x=(x1​,...,xn​) of X X X. Then, a maximum likelihood (ML) estimate is a value of the parameter that maximizes the numerical function p X ( x 1 , . . . , x n ; θ ) p_X(x_1,...,x_n;\theta) pX​(x1​,...,xn​;θ) over all θ \theta θ:
    θ ^ n = arg ⁡ max ⁡ θ p X ( x 1 , . . . , x n ; θ ) \hat\theta_n=\arg\max_\theta p_X(x_1,...,x_n;\theta) θ^n​=argθmax​pX​(x1​,...,xn​;θ)For the case where X X X is continuous,
    θ ^ n = arg ⁡ max ⁡ θ f X ( x 1 , . . . , x n ; θ ) \hat\theta_n=\arg\max_\theta f_X(x_1,...,x_n;\theta) θ^n​=argθmax​fX​(x1​,...,xn​;θ)
  • We refer to p X ( x ; θ ) p_X(x;\theta) pX​(x;θ) [or f X ( x ; θ ) f_X(x;\theta) fX​(x;θ) if X X X is continuous] as the likelihood function (似然函数).
    Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)

  • In many applications, the observations X i X_i Xi​ are assumed to be independent, in which case, the likelihood function is of the form
    p X ( x 1 , . . . , x n ; θ ) = ∏ i = 1 n p X i ( x i ; θ ) p_X(x_1,...,x_n;\theta)=\prod_{i=1}^np_{X_i}(x_i;\theta) pX​(x1​,...,xn​;θ)=i=1∏n​pXi​​(xi​;θ)(for discrete X i X_i Xi​). In this case, it is often analytically or computationally convenient to maximize its logarithm, called the log-likelihood function (对数似然函数),
    log ⁡ p X ( x 1 , . . . , x n ; θ ) = ∑ i = 1 n log ⁡ p X i ( x i ; θ ) \log p_X(x_1,...,x_n;\theta)=\sum_{i=1}^n\log p_{X_i}(x_i;\theta) logpX​(x1​,...,xn​;θ)=i=1∑n​logpXi​​(xi​;θ)over θ \theta θ. When X X X is continuous, there is a similar possibility, with PMFs replaced by PDFs: we maximize over θ \theta θ the expression
    log ⁡ f X ( x 1 , . . . , x n ; θ ) = ∑ i = 1 n log ⁡ f X i ( x i ; θ ) \log f_X(x_1,...,x_n;\theta)=\sum_{i=1}^n\log f_{X_i}(x_i;\theta) logfX​(x1​,...,xn​;θ)=i=1∑n​logfXi​​(xi​;θ)

  • Recall that in Bayesian MAP estimation, the estimate is chosen to maximize the expression p Θ ( θ ) p X ∣ Θ ( x ∣ θ ) p_\Theta(\theta)p_{X|\Theta}(x |\theta) pΘ​(θ)pX∣Θ​(x∣θ) over all θ \theta θ, where p Θ ( θ ) p_\Theta(\theta) pΘ​(θ) is the prior PMF of an unknown discrete parameter θ \theta θ. Thus, if we view p X ( x ; θ ) p_X(x;\theta) pX​(x;θ) as a conditional PMF, we may interpret ML estimation as MAP estimation with a flat prior (均匀先验), i.e., a prior which is the same for all θ \theta θ, indicating the absence of any useful prior knowledge.

Example 9.1.

  • Let us revisit Example 8.2, in which Juliet is always late by an amount X X X that is uniformly distributed over the interval [ 0 , θ ] [0,\theta] [0,θ], and θ \theta θ is an unknown parameter. In that example, we used a random variable Θ \Theta Θ with flat prior PDF f Θ ( θ ) f_\Theta(\theta) fΘ​(θ) (uniform over the interval [ 0 , 1 ] [0, 1] [0,1]) to model the parameter. and we showed that the MAP estimate is the value x x x of X X X.
  • In the classical context of this section, there is no prior, and θ \theta θ is treated as a constant, but the ML estimate is also θ ^ = x \hat\theta=x θ^=x. The resulting estimator is Θ ^ = X \hat\Theta=X Θ^=X.

Example 9.4. Estimating the Mean and Variance of a Normal.

  • Consider the problem of estimating the mean μ μ μ and variance v v v of a normal distribution using n n n independent observations X 1 , . . . , X n X_1, ... , X_n X1​,...,Xn​. The parameter vector here is θ = ( μ , v ) \theta = (μ, v) θ=(μ,v). The corresponding likelihood function is
    f X ( x ; μ , v ) = ∏ i = 1 n f X i ( x i ; μ , v ) = ∏ i = 1 n 1 2 π v e − ( x i − μ ) 2 / 2 v = 1 ( 2 π v ) n / 2 ∏ i = 1 n e − ( x i − μ ) 2 / 2 v f_X(x;\mu,v)=\prod_{i=1}^nf_{X_i}(x_i;\mu,v)=\prod_{i=1}^n\frac{1}{\sqrt{2\pi v}}e^{-(x_i-\mu)^2/2v}=\frac{1}{(2\pi v)^{n/2}}\prod_{i=1}^ne^{-(x_i-\mu)^2/2v} fX​(x;μ,v)=i=1∏n​fXi​​(xi​;μ,v)=i=1∏n​2πv ​1​e−(xi​−μ)2/2v=(2πv)n/21​i=1∏n​e−(xi​−μ)2/2vAfter some calculation it can be written as
    f X ( x ; μ , v ) = 1 ( 2 π v ) n / 2 ⋅ exp ⁡ { − n s n 2 2 v } ⋅ exp ⁡ { − n ( m n − μ ) 2 2 v } f_X(x;\mu,v)=\frac{1}{(2\pi v)^{n/2}}\cdot\exp\bigg\{-\frac{ns_n^2}{2v}\bigg\}\cdot\exp\bigg\{-\frac{n(m_n-\mu)^2}{2v}\bigg\} fX​(x;μ,v)=(2πv)n/21​⋅exp{−2vnsn2​​}⋅exp{−2vn(mn​−μ)2​}where m n m_n mn​ is the realized value of the random variable
    M n = 1 n ∑ i = 1 n X i M_n=\frac{1}{n}\sum_{i=1}^nX_i Mn​=n1​i=1∑n​Xi​and s n 2 s_n^2 sn2​ is the realized value of the random variable
    S ‾ n 2 = 1 n ∑ i = 1 n ( X i − M n ) 2 \overline S_n^2=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)^2 Sn2​=n1​i=1∑n​(Xi​−Mn​)2
    • To verify this, write for i = 1 , . . . , n i = 1, ... , n i=1,...,n,
      ( x i − μ ) 2 = ( x i − m n + m n − μ ) 2 = ( x i − m n ) 2 + ( m n − μ ) 2 + 2 ( x i − m n ) ( m n − μ ) (x_i-\mu)^2=(x_i-m_n+m_n-\mu)^2=(x_i-m_n)^2+(m_n-\mu)^2+2(x_i-m_n)(m_n-\mu) (xi​−μ)2=(xi​−mn​+mn​−μ)2=(xi​−mn​)2+(mn​−μ)2+2(xi​−mn​)(mn​−μ)sum over i i i, and note that
      ∑ i = 1 n ( x i − m n ) ( m n − μ ) = ( m n − μ ) ∑ i = 1 n ( x i − m n ) = 0 \sum_{i=1}^n(x_i-m_n)(m_n-\mu) = (m_n-\mu)\sum_{i=1}^n(x_i-m_n)= 0 i=1∑n​(xi​−mn​)(mn​−μ)=(mn​−μ)i=1∑n​(xi​−mn​)=0
  • The log-likelihood function is
    log ⁡ f X ( x ; μ , v ) = − n 2 ⋅ log ⁡ ( 2 π ) − n 2 ⋅ log ⁡ ( v ) − n s n 2 2 v − n ( m n − μ ) 2 2 v \log f_X(x;\mu,v)=-\frac{n}{2}\cdot\log(2\pi)-\frac{n}{2}\cdot\log(v)-\frac{ns_n^2}{2v}-\frac{n(m_n-\mu)^2}{2v} logfX​(x;μ,v)=−2n​⋅log(2π)−2n​⋅log(v)−2vnsn2​​−2vn(mn​−μ)2​Setting to zero the derivatives of this function with respect to μ μ μ and v v v, we obtain the estimate and estimator, respectively,
    θ ^ n = ( m n , s n 2 )              Θ ^ n = ( M n , S ‾ n 2 ) \hat\theta_n=(m_n,s_n^2)\ \ \ \ \ \ \ \ \ \ \ \ \hat\Theta_n=(M_n,\overline S_n^2) θ^n​=(mn​,sn2​)            Θ^n​=(Mn​,Sn2​)Note that M n M_n Mn​ is the sample mean, while S ‾ n 2 \overline S_n^2 Sn2​ may be viewed as a “sample variance.” As will be shown shortly, E θ [ S ‾ n 2 ] E_\theta[\overline S_n^2] Eθ​[Sn2​] converges to v v v as n n n increases, so that S ‾ n 2 \overline S_n^2 Sn2​ is asymptotically unbiased. Using also the weak law of large numbers, it can be shown that M n M_n Mn​ and S n S_n Sn​ are consistent estimators of μ μ μ and v v v, respectively.

  • Maximum likelihood estimation has some appealing properties.
    • For example, it obeys the invariance principle (不变原理): if Θ ^ n \hat\Theta_n Θ^n​ is the ML estimate of θ \theta θ, then for any one-to-one function h h h of θ \theta θ, the ML estimate of the parameter ζ = h ( θ ) \zeta=h(\theta) ζ=h(θ) is h ( Θ ^ n ) h(\hat\Theta_n) h(Θ^n​).
    • Also. when the observations are i.i.d. (independent identically distributed), and under some mild additional assumptions, it can be shown that the ML estimator is consistent.
    • Another interesting property is that when θ \theta θ is a scalar parameter, then under some mild conditions, the ML estimator has an asymptotic normality 渐近正态性质 property. In particular, it can be shown that the distribution of ( Θ ^ n − θ ) / σ ( Θ ^ n ) (\hat\Theta_n-\theta)/\sigma(\hat\Theta_n) (Θ^n​−θ)/σ(Θ^n​), where σ 2 ( Θ ^ n ) \sigma^2(\hat\Theta_n) σ2(Θ^n​) is the variance of Θ ^ n \hat\Theta_n Θ^n​, approaches a standard normal distribution. Thus, if we are able to also estimate σ ( Θ ^ n ) \sigma(\hat\Theta_n) σ(Θ^n​), we can use it to derive an error variance estimate based on a normal approximation. When θ \theta θ is a vector parameter, a similar statement applies to each one of its components.

Estimation of the Mean and Variance of a Random Variable

  • Suppose that the observations X 1 , . . . , X n X_1 ,..., X_n X1​,...,Xn​ are i.i.d., with an unknown common mean θ \theta θ. The most natural estimator of θ \theta θ is the sample mean:
    M n = X 1 + . . . + X n n M_n=\frac{X_1+...+X_n}{n} Mn​=nX1​+...+Xn​​This estimator is unbiased. Its mean squared error is equal to its variance, which is v / n v/n v/n, where v v v is the common variance of the X i X_i Xi​. Furthermore, by the weak law of large numbers, this estimator converges to θ \theta θ in probability, and is therefore consistent.
  • Suppose that we are interested in an estimator of the variance v v v. A natural one is
    S ‾ n 2 = 1 n ∑ i = 1 n ( X i − M n ) 2 \overline S_n^2=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)^2 Sn2​=n1​i=1∑n​(Xi​−Mn​)2which coincides with the ML estimator derived in Example 9.4 under a normality assumption. We have
    E ( θ , v ) [ S ‾ n 2 ] = 1 n E ( θ , v ) [ ∑ i = 1 n X i 2 − 2 M n ∑ i = 1 n X i + n M n 2 ] = E ( θ , v ) [ 1 n ∑ i = 1 n X i 2 − 2 M n 2 + M n 2 ] = E ( θ , v ) [ 1 n ∑ i = 1 n X i 2 − M n 2 ] = E ( θ , v ) [ X 2 ] − E ( θ , v ) [ M n 2 ] = ( θ 2 + v ) − ( θ 2 + v / n ) = n − 1 n v \begin{aligned}E_{(\theta,v)}[\overline S_n^2]&=\frac{1}{n}E_{(\theta,v)}\bigg[\sum_{i=1}^nX_i^2-2M_n\sum_{i=1}^nX_i+nM_n^2\bigg] \\&=E_{(\theta,v)}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2-2M_n^2+M_n^2\bigg] \\&=E_{(\theta,v)}\bigg[\frac{1}{n}\sum_{i=1}^nX_i^2-M_n^2\bigg] \\&=E_{(\theta,v)}[X^2]-E_{(\theta,v)}[M_n^2] \\&=(\theta^2+v)-(\theta^2+v/n) \\&=\frac{n-1}{n}v\end{aligned} E(θ,v)​[Sn2​]​=n1​E(θ,v)​[i=1∑n​Xi2​−2Mn​i=1∑n​Xi​+nMn2​]=E(θ,v)​[n1​i=1∑n​Xi2​−2Mn2​+Mn2​]=E(θ,v)​[n1​i=1∑n​Xi2​−Mn2​]=E(θ,v)​[X2]−E(θ,v)​[Mn2​]=(θ2+v)−(θ2+v/n)=nn−1​v​Thus, S ‾ n 2 \overline S_n^2 Sn2​ is not an unbiased estimator of v v v, although it is asymptotically unbiased. We can obtain an unbiased variance estimator after some suitable scaling. This is the estimator
    S ^ n 2 = n n − 1 S ‾ n 2 = 1 n − 1 ∑ i = 1 n ( X i − M n ) 2 \hat S_n^2=\frac{n}{n-1}\overline S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-M_n)^2 S^n2​=n−1n​Sn2​=n−11​i=1∑n​(Xi​−Mn​)2

Confidence Intervals 置信区间

  • Consider an estimator Θ ^ \hat\Theta Θ^ of an unknown parameter θ \theta θ. Besides the numerical value provided by an estimate, we are often interested in constructing a so-called confidence interval. Roughly speaking, this is an interval that contains θ \theta θ with a certain high probability, for every possible value of θ \theta θ.
  • For a precise definition, let us first fix a desired confidence level, 1 − α 1-\alpha 1−α, where α \alpha α is typically a small number. We then replace the point estimator Θ ^ n \hat\Theta_n Θ^n​ by a lower estimator Θ ^ n − \hat\Theta_n^- Θ^n−​, and an upper estimator Θ ^ n + \hat\Theta_n^+ Θ^n+​, designed so that Θ ^ n − ≤ Θ ^ n + \hat\Theta_n^-\leq\hat\Theta_n^+ Θ^n−​≤Θ^n+​, and
    P θ ( Θ ^ n − ≤ θ ≤ Θ ^ n + ) ≥ 1 − α P_\theta(\hat\Theta_n^-\leq\theta\leq\hat\Theta_n^+)\geq1-\alpha Pθ​(Θ^n−​≤θ≤Θ^n+​)≥1−αfor every possible value of θ \theta θ. Note that, similar to estimators, Θ ^ n − \hat\Theta_n^- Θ^n−​ and Θ ^ n + \hat\Theta_n^+ Θ^n+​, are functions of the observations, and hence random variables whose distributions depend on θ \theta θ. We call [ Θ ^ n − , Θ ^ n + \hat\Theta_n^-,\hat\Theta_n^+ Θ^n−​,Θ^n+​] a 1 − α \boldsymbol{1-\alpha} 1−α confidence interval.

Example 9.6.

  • Suppose that the observations X i X_i Xi​ are i.i.d. normal, with unknown mean θ \theta θ and known variance v v v. Then, the sample mean estimator
    Θ ^ n = X 1 + . . . + X n n \hat\Theta_n=\frac{X_1+...+X_n}{n} Θ^n​=nX1​+...+Xn​​is normal, with mean θ \theta θ and variance v / n v/n v/n.
  • Let α = 0.05 \alpha = 0.05 α=0.05. Using the CDF Φ ( z ) \Phi(z) Φ(z) of the standard normal (available in the normal tables), we have Φ ( 1.96 ) = 0.975 = 1 − α / 2 \Phi(1.96) = 0.975 = 1-\alpha/2 Φ(1.96)=0.975=1−α/2 and we obtain
    P θ ( ∣ Θ ^ n − θ ∣ v / n ≤ 1.96 ) = 1 − α = 0.95 P_\theta\bigg(\frac{|\hat\Theta_n-\theta|}{\sqrt{v/n}}\leq1.96\bigg)=1-\alpha=0.95 Pθ​(v/n ​∣Θ^n​−θ∣​≤1.96)=1−α=0.95We can rewrite this statement in the form
    P θ ( Θ ^ n − 1.96 v n ≤ θ ≤ Θ ^ n + 1.96 v n ) P_\theta\bigg(\hat\Theta_n-1.96\sqrt{\frac{v}{n}}\leq\theta\leq\hat\Theta_n+1.96\sqrt{\frac{v}{n}}\bigg) Pθ​(Θ^n​−1.96nv​ ​≤θ≤Θ^n​+1.96nv​ ​)which implies that
    [ Θ ^ n − 1.96 v n , Θ ^ n + 1.96 v n ] \bigg[\hat\Theta_n-1.96\sqrt{\frac{v}{n}},\hat\Theta_n+1.96\sqrt{\frac{v}{n}}\bigg] [Θ^n​−1.96nv​ ​,Θ^n​+1.96nv​ ​]is a 95% confidence interval.

Out of a variety of possible confidence intervals, one with the smallest possible width is usually desirable.


  • In the preceding example, we may be tempted to describe the concept of a 95% confidence interval by a statement such as “the true parameter lies in the confidence interval with probability 0.95.” Such statements, however, can be ambiguous. For example, suppose that after the observations are obtained, the confidence interval turns out to be [ − 2.3 , 4.1 ] [-2.3, 4.1] [−2.3,4.1] with probability 0.95 0.95 0.95, because the latter statement does not involve any random variables; after all, in the classical approach, θ \theta θ is a constant.
  • For a concrete interpretation, suppose that θ \theta θ is fixed. We construct a confidence interval many times, using the same statistical procedure, i.e., each time, we obtain an independent collection of n n n observations and construct the corresponding 95% confidence interval. We then expect that about 95% of these confidence intervals will include θ \theta θ. This should be true regardless of what the value of θ \theta θ is.

  • The construction of confidence intervals is sometimes hard. Fortunately, for many important models, Θ ^ n − θ \hat\Theta_n-\theta Θ^n​−θ is asymptotically normal and asymptotically unbiased. By this we mean that the CDF of the random variable
    Θ ^ n − θ v a r θ ( Θ ^ n ) \frac{\hat\Theta_n-\theta}{\sqrt{var_\theta(\hat\Theta_n)}} varθ​(Θ^n​) ​Θ^n​−θ​approaches the standard normal CDF as n n n increases, for every value of θ \theta θ. We may then proceed exactly as in Example 9.6, provided that v a r θ ( Θ ^ n ) var_\theta(\hat\Theta_n) varθ​(Θ^n​) is known or can be approximated.

Confidence Intervals Based on Estimator Variance Approximations

  • Suppose that the observations X i X_i Xi​ are i.i.d. with mean θ \theta θ and variance v v v that are unknown. We may estimate θ \theta θ with the sample mean
    Θ ^ n = X 1 + . . . + X n n \hat\Theta_n=\frac{X_1+...+X_n}{n} Θ^n​=nX1​+...+Xn​​and estimate v v v with the unbiased estimator
    S ^ n 2 = 1 n − 1 ∑ i = 1 n ( X i − M n ) 2 \hat S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-M_n)^2 S^n2​=n−11​i=1∑n​(Xi​−Mn​)2
  • In particular, we may estimate the variance v / n v/n v/n of the sample mean by S ^ n 2 / n \hat S_n^2/n S^n2​/n. Then, for a given α \alpha α, we may use these estimates and the central limit theorem to construct an (approximate) 1 − α 1 - \alpha 1−α confidence interval. This is the interval
    [ Θ ^ n − z S ^ n n , Θ ^ n + z S ^ n n ] \bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg] [Θ^n​−zn ​S^n​​,Θ^n​+zn ​S^n​​]where z z z is obtained from the relation
    Φ ( z ) = 1 − α 2 \Phi(z)=1-\frac{\alpha}{2} Φ(z)=1−2α​and the normal tables.

  • Note that in this approach, there are two different approximations in effect. First, we are treating Θ ^ n \hat\Theta_n Θ^n​ as if it were a normal random variable; second, we are replacing the true variance v / n v/n v/n of Θ ^ n \hat\Theta_n Θ^n​ by its estimate S ^ n 2 / n \hat S_n^2/n S^n2​/n.
  • Even in the special case where the X i X_i Xi​ are normal random variables, the confidence interval produced by the preceding procedure is still approximate. The reason is that S ^ n 2 \hat S_n^2 S^n2​ is only an approximation to the true variance v v v, and the random variable
    T n = n ( Θ ^ n − θ ) S ^ n T_n=\frac{\sqrt n(\hat\Theta_n-\theta)}{\hat S_n} Tn​=S^n​n ​(Θ^n​−θ)​is not normal. However, for normal X i X_i Xi​, it can be shown that the PDF of T n T_n Tn​ does not depend on θ \theta θ and v v v, and can be computed explicitly. It is called the t t t-distribution with n − 1 n - 1 n−1 degrees of freedom (*度). Like the standard normal PDF, it is symmetric and bell-shaped, but it is a little more spread out and has heavier tails. The probabilities of various intervals of interest are available in tables, similar to the normal tables.
    Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)
  • Thus, when the X i X_i Xi​ are normal (or nearly normal) and n n n is relatively small, a more accurate confidence interval is of the form
    [ Θ ^ n − z S ^ n n , Θ ^ n + z S ^ n n ] \bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg] [Θ^n​−zn ​S^n​​,Θ^n​+zn ​S^n​​]where z z z is obtained from the relation
    Ψ n − 1 ( z ) = 1 − α 2 \Psi_{n-1}(z)=1-\frac{\alpha}{2} Ψn−1​(z)=1−2α​
  • On the other hand, when n n n is moderately large (e.g., n ≥ 50 n\geq50 n≥50), the t t t-distribution is very close to the normal distribution, and the normal tables can be used.

Example 9.7.

  • The weight of an object is measured eight times using an electronic scale that reports the true weight plus a random error that is normally distributed with zero mean and unknown variance. Assume that the errors in the observations are independent. The following results are obtained:
    Chapter 9 (Classical Statistical Inference): Classical Parameter Estimation (经典参数估计)
  • We compute a 95% confidence interval ( n = 0.05 n = 0.05 n=0.05) using the t t t-distribution. The value of the sample mean Θ ^ n \hat\Theta_n Θ^n​ is 0.5747, and S ^ n / n \hat S_n/\sqrt n S^n​/n ​ is 0.0182. From the t t t-distribution tables, we obtain 1 − Ψ ( 2.365 ) = 0.025 = α / 2 1- \Psi(2.365) =0.025 =\alpha/2 1−Ψ(2.365)=0.025=α/2, so that
    [ Θ ^ n − z S ^ n n , Θ ^ n + z S ^ n n ] = [ 0.531 , 0.618 ] \bigg[\hat\Theta_n-z{\frac{\hat S_n}{\sqrt n}},\hat\Theta_n+z{\frac{\hat S_n}{\sqrt n}}\bigg]=[0.531,0.618] [Θ^n​−zn ​S^n​​,Θ^n​+zn ​S^n​​]=[0.531,0.618]is a 95% confidence interval.

  • The approximate confidence intervals constructed so far relied on the particular estimator S ^ n 2 \hat S_n^2 S^n2​ for the unknown variance v v v. However, different estimators or approximations of the variance are possible.
    • For example, suppose that the observations X 1 , . . . , X n X_1, ... , X_n X1​,...,Xn​ are i.i.d. Bernoulli with unknown mean θ \theta θ, and variance v = θ ( 1 − θ ) v=\theta(1 -\theta) v=θ(1−θ). Then, instead of S ^ n 2 \hat S_n^2 S^n2​, the variance could be approximated by Θ ^ ( 1 − Θ ^ ) \hat\Theta(1 -\hat\Theta) Θ^(1−Θ^). Another possibility is to just observe that θ ( 1 − θ ) ≤ 1 / 4 \theta(1 -\theta)\leq1/4 θ(1−θ)≤1/4 for all θ ∈ [ 0 , 1 ] \theta\in [0,1] θ∈[0,1], and use 1/4 as a conservative estimate of the variance.
上一篇:强化学习导论 课后习题参考 - Chapter 5,6


下一篇:【aardio】回车换行符