PRML读书笔记——2 Probability Distributions

2.1. Binary Variables

1. Bernoulli distribution, p(x = 1|µ) = µ

PRML读书笔记——2 Probability Distributions

2.Binomial distribution

PRML读书笔记——2 Probability Distributions

+PRML读书笔记——2 Probability Distributions

3.beta distribution(Conjugate Prior of Bernoulli distribution)

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

The parameters a and b are often called hyperparameters because they control the distribution of the parameter µ.

m observations of x = 1 and l observations of x = 0.

the variance goes to zero for a → ∞ or b → ∞. It is a general property of Bayesian learning that, as we observe more and more data, the uncertainty represented by the posterior distribution will steadily decrease.

2.2. Multinomial Variables

1.Consider discrete variables that can take on one of K possible mutually exclusive states.

One of the elements xk equals 1, and all remaining elements equa 0.

  x = (0, 0, 1, 0, 0, 0)T

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

Consider a data set D of N independent observations x1, . . . , xN. The corresponding likelihood function takes the form:

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

And find the maximum likelihood solution for µ. log it and add use a Lagrange multiplier λ,

PRML读书笔记——2 Probability Distributions

Setting derivative of it with respect to µk and we abtain:

PRML读书笔记——2 Probability Distributions

give λ = −N, and the solution is in the form:

PRML读书笔记——2 Probability Distributions

which is the fraction of the N observations for which xk = 1.

Consider the join distribution of quantities m1, ... , mk

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

2.The Dirichlet distribution(Conjugate Prior of Multinomial Distribution)

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

m = (m1, . . . , mK)T

we can interpret the parameters αk of the Dirichlet prior as an effective number of observations of xk = 1.

multinomial distribution with K = 2.

2.3. The Gaussian Distribution

PRML读书笔记——2 Probability Distributions

For a D-dimensional vector x:

PRML读书笔记——2 Probability Distributions

Σ is a D × D covariance matrix.

eigenvector equation for the covariance matrix:

  Σui = λiui

Σ can be expressed as an expansion in terms of its eigenvecExercise 2.19 tors in the form:

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

Define:

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

in the yj coordinate system, the Gaussian distribution takes the form

PRML读书笔记——2 Probability Distributions

This confirms that the multivariate Gaussian is indeed normalized.

And the expection of Gaussian distribution is:

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

2.3.1 Conditional Gaussian distributions

An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.

the mean and covariance of the conditional distribution p(xa|xb).

PRML读书笔记——2 Probability Distributions

And they are  independent of xa.

2.3.2 Marginal Gaussian distributions

PRML读书笔记——2 Probability Distributions

Marginal Gaussian distribution is also Gaussian.

PRML读书笔记——2 Probability Distributions

2.3.3 Bayes’ theorem for Gaussian variables

Here we shall suppose that we are given a Gaussian marginal distribution p(x) and a Gaussian conditional distribution p(y|x) in which p(y|x) has a mean that is a linear function of x, and a covariance which is independent of x. We wish to find the marginal distribution p(y) and the conditional distribution p(x|y).

PRML读书笔记——2 Probability Distributions

2.3.4 Maximum likelihood for the Gaussian

The log likelihood function is given by

PRML读书笔记——2 Probability Distributions

we see that the likelihood function depends on the data set only through the two quantities

PRML读书笔记——2 Probability Distributions

the maximum likelihood estimate of the mean and corvirance matrix given by

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

evaluate the expectations of the maximum likelihood solutions under the true distribution, we obtain the following results

PRML读书笔记——2 Probability Distributions

We see that the expectation of the maximum likelihood estimate for the mean is equal to the true mean. However, the maximum likelihood estimate for the covariance has an expectation that is less than the true value, and hence it is biased. We can correct this bias by defining a different estimator Σ given by

PRML读书笔记——2 Probability Distributions

2.3.5 Sequential estimation

1.Sequential methods allow data points to be processed one at a time and then discarded and are important for on-line applications, and also where large data sets are involved so that batch processing of all data points at once is infeasible.

PRML读书笔记——2 Probability Distributions

This result has a nice interpretation, as follows. After observing N − 1 data points we have estimated µ by PRML读书笔记——2 Probability Distributions. We now observe data point xN, and we obtain our revised estimate PRML读书笔记——2 Probability Distributions  by moving the old estimate a small amount, proportional to 1/N, in the direction of the ‘error signal’ PRML读书笔记——2 Probability Distributions. Note that, as N increases, so the contribution from successive data points gets smaller.

2.Robbins-Monro algorithm

The conditional expectation of z given θ defines a deterministic function f(θ) that is given by

PRML读书笔记——2 Probability Distributions

We shall assume that the conditional variance of z is finite so that

PRML读书笔记——2 Probability Distributions

The Robbins-Monro procedure then defines a sequence of successive estimates of the root θ given by

PRML读书笔记——2 Probability Distributions

where z(θ(N)) is an observed value of z when θ takes the value θ(N). The coefficients {aN} represent a sequence of positive numbers that satisfy the conditions

PRML读书笔记——2 Probability Distributions

first condition ensures that the successive corrections decrease in magnitude so that the process can converge to a limiting value. The second condition is required to ensure that the algorithm does not converge short of the root, and the third condition is needed to ensure that the accumulated noise has finite variance and hence does not spoil convergence.

2.3.6 Bayesian inference for the Gaussian

gamma distribution

PRML读书笔记——2 Probability Distributions

The mean and variance of the gamma distribution are given by

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

2.3.7 Student’s t-distribution

If we have a univariate Gaussian N(x|µ, τ −1) together with a Gamma prior Gam(τ|a, b) and we integrate out the precision, we obtain the marginal distribution of x in the form

PRML读书笔记——2 Probability Distributions

set ν = 2a and λ = a/b

PRML读书笔记——2 Probability Distributions

which is known as Student’s t-distribution. The parameter λ is sometimes called the precision of the t-distribution, even though it is not in general equal to the inverse of the variance. The parameter ν is called the degrees of freedom.

PRML读书笔记——2 Probability Distributions

For the particular case of ν = 1, the t-distribution reduces to the Cauchy distribution, while in the limit ν → ∞ the t-distribution St(x|µ, λ, ν) becomes a Gaussian N(x|µ, λ−1) with mean µ and precision λ.

The result is a distribution that in general has longer ‘tails’ than a Gaussian, as was seen in figure above. This gives the tdistribution an important property called robustness, which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers.

PRML读书笔记——2 Probability Distributions

2.3.8 Periodic variables

1 Periodic quantities can conveniently be represented using an angular (polar) coordinate 0 θ < 2π.

We might be tempted to treat periodic variables by choosing some direction as the origin and then applying a conventional distribution such as the Gaussian. Such an approach, however, would give results that were strongly dependent on the arbitrary choice of origin.

To find an invariant measure of the mean, we note that the observations can be viewed as points on the unit circle and can therefore be described instead by two-dimensional unit vectors x1, . . . , xN where xn = 1 for n = 1, . . . , N

PRML读书笔记——2 Probability Distributions

The Cartesian coordinates of the observations are given by xn = (cos θn, sin θn), and we can write the Cartesian coordinates of the sample mean in the form PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

2 von Mises distribution

we will consider distributions p(θ) that have period 2π. Any probability density p(θ) defined over θ must not only be nonnegative and integrate to one, but it must also be periodic. Thus p(θ) must satisfy the three conditions

PRML读书笔记——2 Probability Distributions

it follows that p(θ + M2π) = p(θ) for any integer M.

Consider a Gaussian distribution over two variables x = (x1, x2) having mean µ = (µ1, µ2) and a covariance matrix Σ = σ2I where I is the 2 × 2 identity matrix, so that.

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions

2.3.9 Mixtures of Gaussians

Consider a superposition of K Gaussian densities of the form

PRML读书笔记——2 Probability Distributions

PRML读书笔记——2 Probability Distributions , PRML读书笔记——2 Probability Distributions

which is called a mixture of Gaussians. The parameters πk in are called mixing coefficients.

one example for k = 3

PRML读书笔记——2 Probability Distributions

2.4. The Exponential Family

1 The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distributions called the exponential family (Duda and Hart, 1973; Bernardo and Smith,1994).

The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form

PRML读书笔记——2 Probability Distributions

where x may be scalar or vector, and may be discrete or continuous. Here η are called the natural parameters of the distribution, and u(x) is some function of x. The function g(η) can be interpreted as the coefficient that ensures that the distribution is normalized and therefore satisfies

PRML读书笔记——2 Probability Distributions

where the integration is replaced by summation if x is a discrete variable.

2 Conjugate priors

In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior.

2.5. Nonparametric Methods

First, to estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point.

Second, the value of the smoothing parameter should be neither too large nor too small in order to obtain good results.(degree M of the polynomial,  the value α of the regularization parameter)

2.5.1 Kernel density estimators

consider some small region R containing x, and x would be the kernel. we obtain our density estimate in the form

PRML读书笔记——2 Probability Distributions

K is the total number of points that lie inside R. V is the volume of R

kernel function k discribes that how close data point is to x.

thus the estimated density at x is

PRML读书笔记——2 Probability Distributions

we have used hD for the volume of a hypercube of side h in D dimensions.

k can also be a Gaussian kernel function

PRML读书笔记——2 Probability Distributions

h represents the standard deviation of the Gaussian components and plays the role of a smoothing parameter, and there is a trade-off between sensitivity to noise at small h and over-smoothing at large h.

PRML读书笔记——2 Probability Distributions

2.5.2 Nearest-neighbour methods

One of the difficulties with the kernel approach to density estimation is that the parameter h governing the kernel width is fixed for all kernels. In regions of high data density, a large value of h may lead to over-smoothing and a washing out of structure that might otherwise be extracted from the data. However, reducing h may lead to noisy estimates elsewhere in data space where the density is smaller.

the optimal choice for h may be dependent on location within the data space. This issue is addressed by nearest-neighbour methods for density estimation.

we consider a fixed value of K and use the data to find an appropriate value for V .

PRML读书笔记——2 Probability Distributions

Note that the model produced by K nearest neighbours is not a true density model because the integral over all space diverges.

If we wish to classify a new point x, we draw a sphere centred on x containing precisely K points irrespective of their class. Suppose this sphere has volume V and contains Kk points from class Ck.

PRML读书笔记——2 Probability Distributions

An interesting property of the nearest-neighbour (K = 1) classifier is that, in the limit N → ∞, the error rate is never more than twice the minimum achievable error rate of an optimal classifier.

both the K-nearest-neighbour method, and the kernel density estimator, require the entire training data set to be stored, leading to expensive computation if the data set is large.

This effect can be offset, at the expense of some additional one-off computation, by constructing tree-based search structures to allow (approximate) near neighbours to be found efficiently without doing an exhaustive search of the data set.

Previous Chapter | Next Chapter

上一篇:Go语言基础之并发


下一篇:C#中指针使用总结