-
There are three possible sources of uncertainty:
Inherent stochasticity in the system being modeled
Incomplete observability(we cannot observe all of the variables that drive the behavior of the system)
Incomplete modeling( When we use a model that must discard some of the information we have observed, the discarded information results in uncertainty in the model’s predictions. ) -
In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a complex rule.
-
repeatable events:frequentist probability
-
not repeatable events:Bayesian probability(degree of belief)
-
A random variable is a variable that can take on different values randomly. We typically denote the random variable itself with a lower case letter in plain typeface, and the values it can take on with lower case script letters.
-
For example, x 1 x_{1} x1 and x 2 x_{2} x2 are both possible values that the random variable x x x can take on. For vector-valued variables, we would write the random variable as x \mathbf{x} x and one of its values as x \boldsymbol{x} x. On its own, a random variable is just a description of the states that are possible; it must be coupled with a probability distribution that specifies how likely each of these states are.
-
A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.
-
A probability distribution over discrete variables may be described using a probability mass function (PMF). We typically denote probability mass functions with a capital P P P.
-
The probability mass function maps from a state of a random variable to the probability of that random variable taking on that state.
-
Sometimes to disambiguate which PMF to use, we write the name of the random variable explicitly: P ( x = x ) P(\mathrm{x}=x) P(x=x). Sometimes we define a variable first, then use ∼ \sim ∼ notation to specify which distribution it follows later: x ∼ P ( x ) \mathrm{x} \sim P(\mathrm{x}) x∼P(x).
-
Probability mass functions can act on many variables at the same time. Such a probability distribution over many variables is known as a joint probability distribution. P ( x = x , y = y ) P(\mathrm{x}=x, \mathrm{y}=y) P(x=x,y=y) denotes the probability that x = x \mathrm{x}=x x=x and y = y \mathrm{y}=y y=y simultaneously. We may also write P ( x , y ) P(x, y) P(x,y) for brevity.
-
uniform distribution: P ( x = x i ) = 1 k P\left(\mathrm{x}=x_{i}\right)=\frac{1}{k} P(x=xi)=k1
-
When working with continuous random variables, we describe probability distributions using a probability density function (PDF) rather than a probability mass function.
-
For an example of a probability density function corresponding to a specific probability density over a continuous random variable, consider a uniform distribution on an interval of the real numbers. We can do this with a function u ( x ; a , b ) u(x ; a, b) u(x;a,b), where a a a and b b b are the endpoints of the interval, with b > a b>a b>a. The “;” notation means “parametrized by”; we consider x x x to be the argument of the function, while a a a and b b b are parameters that define the function. To ensure that there is no probability mass outside the interval, we say u ( x ; a , b ) = 0 u(x ; a, b)=0 u(x;a,b)=0 for all x ∉ [ a , b ] . x \notin[a, b] . x∈/[a,b]. Within [ a , b ] [a, b] [a,b], u ( x ; a , b ) = 1 b − a . u(x ; a, b)=\frac{1}{b-a} . u(x;a,b)=b−a1. We can see that this is nonnegative everywhere.
-
Additionally, it integrates to 1. 1 . 1. We often denote that x x x follows the uniform distribution on [ a , b ] [a, b] [a,b] by writing x ∼ U ( a , b ) \mathrm{x} \sim U(a, b) x∼U(a,b).
-
The probability distribution over the subset is known as the marginal probability distribution.
∀ x ∈ x , P ( x = x ) = ∑ y P ( x = x , y = y ) \forall x \in \mathrm{x}, P(\mathrm{x}=x)=\sum_{y} P(\mathrm{x}=x, \mathrm{y}=y) ∀x∈x,P(x=x)=∑yP(x=x,y=y)
p ( x ) = ∫ p ( x , y ) d y p(x)=\int p(x, y) d y p(x)=∫p(x,y)dy -
In many cases, we are interested in the probability of some event, given that some other event has happened. This is called a conditional probability. We denote the conditional probability that y = y \mathrm{y}=y y=y given x = x \mathrm{x}=x x=x as P ( y = y ∣ x = x ) P(\mathrm{y}=y \mid \mathrm{x}=x) P(y=y∣x=x). This conditional probability can be computed with the formula
P ( y = y ∣ x = x ) = P ( y = y , x = x ) P ( x = x ) P(\mathrm{y}=y \mid \mathrm{x}=x)=\frac{P(\mathrm{y}=y, \mathrm{x}=x)}{P(\mathrm{x}=x)} P(y=y∣x=x)=P(x=x)P(y=y,x=x)
The conditional probability is only defined when P ( x = x ) > 0 P(\mathrm{x}=x)>0 P(x=x)>0. We cannot compute the conditional probability conditioned on an event that never happens. -
It is important not to confuse conditional probability with computing what would happen if some action were undertaken. The conditional probability that a person is from Germany given that they speak German is quite high, but if a randomly selected person is taught to speak German, their country of origin does not change. Computing the consequences of an action is called making an intervention query. Intervention queries are the domain of causal modeling, which we do not explore in this book.
-
Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable:
P ( x ( 1 ) , … , x ( n ) ) = P ( x ( 1 ) ) Π i = 2 n P ( x ( i ) ∣ x ( 1 ) , … , x ( i − 1 ) ) P\left(\mathrm{x}^{(1)}, \ldots, \mathrm{x}^{(n)}\right)=P\left(\mathrm{x}^{(1)}\right) \Pi_{i=2}^{n} P\left(\mathrm{x}^{(i)} \mid \mathrm{x}^{(1)}, \ldots, \mathrm{x}^{(i-1)}\right) P(x(1),…,x(n))=P(x(1))Πi=2nP(x(i)∣x(1),…,x(i−1))
This observation is known as the chain rule or product rule of probability. -
The expectation or expected value of some function f ( x ) f(x) f(x) with respect to a probability distribution P ( x ) P(\mathrm{x}) P(x) is the average or mean value that f f f takes on when x x x is drawn from P P P. For discrete variables this can be computed with a summation:
E x ∼ P [ f ( x ) ] = ∑ x P ( x ) f ( x ) \mathbb{E}_{\mathrm{x} \sim P}[f(x)]=\sum_{x} P(x) f(x) Ex∼P[f(x)]=x∑P(x)f(x)
while for continuous variables, it is computed with an integral:
E x ∼ p [ f ( x ) ] = ∫ p ( x ) f ( x ) d x \mathbb{E}_{\mathbf{x} \sim p}[f(x)]=\int p(x) f(x) d x Ex∼p[f(x)]=∫p(x)f(x)dx -
Expectations are linear, for example,
E x [ α f ( x ) + β g ( x ) ] = α E x [ f ( x ) ] + β E x [ g ( x ) ] \mathbb{E}_{\mathbf{x}}[\alpha f(x)+\beta g(x)]=\alpha \mathbb{E}_{\mathbf{x}}[f(x)]+\beta \mathbb{E}_{\mathrm{x}}[g(x)] Ex[αf(x)+βg(x)]=αEx[f(x)]+βEx[g(x)] -
The variance gives a measure of how much the values of a function of a random variable x x x vary as we sample different values of x x x from its probability distribution:
Var ( f ( x ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) 2 ] \operatorname{Var}(f(x))=\mathbb{E}\left[(f(x)-\mathbb{E}[f(x)])^{2}\right] Var(f(x))=E[(f(x)−E[f(x)])2]
When the variance is low, the values of f ( x ) f(x) f(x) cluster near their expected value. The square root of the variance is known as the standard deviation. -
The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:
Cov ( f ( x ) , g ( y ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) ( g ( y ) − E [ g ( y ) ] ) ] \operatorname{Cov}(f(x), g(y))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(y)-\mathbb{E}[g(y)])] Cov(f(x),g(y))=E[(f(x)−E[f(x)])(g(y)−E[g(y)])]
High absolute values of the covariance mean that the values change very much and are both far from their respective means at the same time. If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa. Other measures such as correlation normalize the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables -
The covariance matrix of a random vector x ∈ R n \boldsymbol{x} \in \mathbb{R}^{n} x∈Rn is an n × n n \times n n×n matrix, such that
Cov ( x ) i , j = Cov ( x i , x j ) \operatorname{Cov}(\mathbf{x})_{i, j}=\operatorname{Cov}\left(\mathrm{x}_{i}, \mathrm{x}_{j}\right) Cov(x)i,j=Cov(xi,xj)
The diagonal elements of the covariance give the variance:
Cov ( x i , x i ) = Var ( x i ) \operatorname{Cov}\left(\mathrm{x}_{i}, \mathrm{x}_{i}\right)=\operatorname{Var}\left(\mathrm{x}_{i}\right) Cov(xi,xi)=Var(xi)
Common Distribution
-
Bernoulli distribution:
P ( x = 1 ) = ϕ P(\mathrm{x}=1)=\phi P(x=1)=ϕ
P ( x = 0 ) = 1 − ϕ P(\mathrm{x}=0)=1-\phi P(x=0)=1−ϕ
P ( x = x ) = ϕ x ( 1 − ϕ ) 1 − x P(\mathrm{x}=x)=\phi^{x}(1-\phi)^{1-x} P(x=x)=ϕx(1−ϕ)1−x
E x [ x ] = ϕ \mathbb{E}_{\mathrm{x}}[\mathrm{x}]=\phi Ex[x]=ϕ
Var x ( x ) = ϕ ( 1 − ϕ ) \operatorname{Var}_{\mathrm{x}}(\mathrm{x})=\phi(1-\phi) Varx(x)=ϕ(1−ϕ) -
Gaussian Distribution :
N ( x ; μ , σ 2 ) = 1 2 π σ 2 exp ( − 1 2 σ 2 ( x − μ ) 2 ) \mathcal{N}\left(x ; \mu, \sigma^{2}\right)=\sqrt{\frac{1}{2 \pi \sigma^{2}}} \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right) N(x;μ,σ2)=2πσ21 exp(−2σ21(x−μ)2)
The two parameters μ ∈ R \mu \in \mathbb{R} μ∈R and σ ∈ ( 0 , ∞ ) \sigma \in(0, \infty) σ∈(0,∞) control the normal distribution. The parameter μ \mu μ gives the coordinate of the central peak. This is also the mean of the distribution: E [ x ] = μ \mathbb{E}[\mathrm{x}]=\mu E[x]=μ. The standard deviation of the distribution is given by σ \sigma σ, and the variance by σ 2 \sigma^{2} σ2.When we evaluate the PDF, we need to square and invert σ \sigma σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parametrizing the distribution is to use a parameter β ∈ ( 0 , ∞ ) \beta \in(0, \infty) β∈(0,∞) to control the precision or inverse variance of the distribution:
N ( x ; μ , β − 1 ) = β 2 π exp ( − 1 2 β ( x − μ ) 2 ) \mathcal{N}\left(x ; \mu, \beta^{-1}\right)=\sqrt{\frac{\beta}{2 \pi}} \exp \left(-\frac{1}{2} \beta(x-\mu)^{2}\right) N(x;μ,β−1)=2πβ exp(−21β(x−μ)2) -
In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons.
First, many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed.
Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model. -
The normal distribution generalizes to R n \mathbb{R}^{n} Rn, in which case it is known as the multivariate(多元) normal distribution. It may be parametrized with a positive definite symmetric matrix Σ \boldsymbol{\Sigma} Σ :
N ( x ; μ , Σ ) = 1 ( 2 π ) n det ( Σ ) exp ( − 1 2 ( x − μ ) ⊤ Σ − 1 ( x − μ ) ) . \mathcal{N}(\boldsymbol{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma})=\sqrt{\frac{1}{(2 \pi)^{n} \operatorname{det}(\boldsymbol{\Sigma})}} \exp \left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right) . N(x;μ,Σ)=(2π)ndet(Σ)1 exp(−21(x−μ)⊤Σ−1(x−μ)).
The parameter μ \boldsymbol{\mu} μ still gives the mean of the distribution, though now it is vector-valued. The parameter Σ \boldsymbol{\Sigma} Σ gives the covariance matrix of the distribution. As in the univariate case, when we wish to evaluate the PDF several times for many different values of the parameters, the covariance is not a computationally efficient way to parametrize the distribution, since we need to invert Σ \boldsymbol{\Sigma} Σ to evaluate the PDF. We can instead use a precision matrix β \beta β :
N ( x ; μ , β − 1 ) = det ( β ) ( 2 π ) n exp ( − 1 2 ( x − μ ) ⊤ β ( x − μ ) ) \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{\mu}, \boldsymbol{\beta}^{-1}\right)=\sqrt{\frac{\operatorname{det}(\boldsymbol{\beta})}{(2 \pi)^{n}}} \exp \left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\beta}(\boldsymbol{x}-\boldsymbol{\mu})\right) N(x;μ,β−1)=(2π)ndet(β) exp(−21(x−μ)⊤β(x−μ)) -
In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exponential distribution:
p ( x ; λ ) = λ 1 x ≥ 0 exp ( − λ x ) . p(x ; \lambda)=\lambda \mathbf{1}_{x \geq 0} \exp (-\lambda x) \text { . } p(x;λ)=λ1x≥0exp(−λx) .
The exponential distribution uses the indicator function 1 x > 0 \mathbf{1}_{x>0} 1x>0 to assign probability zero to all negative values of x x x.A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point μ \mu μ is the Laplace distribution
Laplace ( x ; μ , γ ) = 1 2 γ exp ( − ∣ x − μ ∣ γ ) \operatorname{Laplace}(x ; \mu, \gamma)=\frac{1}{2 \gamma} \exp \left(-\frac{|x-\mu|}{\gamma}\right) Laplace(x;μ,γ)=2γ1exp(−γ∣x−μ∣) -
In some cases, we wish to specify that all of the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using the Dirac delta function, δ ( x ) \delta(x) δ(x) :
p ( x ) = δ ( x − μ ) p(x)=\delta(x-\mu) p(x)=δ(x−μ)
We can think of the Dirac delta function as being the limit point of a series of functions that put less and less mass on all points other than zero. -
A common use of the Dirac delta distribution is as a component of an empirical distribution,
p ^ ( x ) = 1 m ∑ i = 1 m δ ( x − x ( i ) ) \hat{p}(\boldsymbol{x})=\frac{1}{m} \sum_{i=1}^{m} \delta\left(\boldsymbol{x}-\boldsymbol{x}^{(i)}\right) p^(x)=m1i=1∑mδ(x−x(i))
which puts probability mass 1 m \frac{1}{m} m1 on each of the m m m points x ( 1 ) , … , x ( m ) \boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)} x(1),…,x(m) forming a given dataset or collection of samples. The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables. For discrete variables, the situation is simpler: an empirical distribution can be conceptualized as a multinoulli distribution, with a probability associated to each possible input value that is simply equal to the empirical frequency of that value in the training set.
We can view the empirical distribution formed from a dataset of training examples as specifying the distribution that we sample from when we train a model on this dataset.
Another important perspective on the empirical distribution is that it is the probability density that maximizes the likelihood of the training data -
It is also common to define probability distributions by combining other simpler probability distributions. One common way of combining distributions is to construct a mixture distribution. A mixture distribution is made up of several component distributions. On each trial, the choice of which component distribution generates the sample is determined by sampling a component identity from a multinoulli distribution:
P ( x ) = ∑ i P ( c = i ) P ( x ∣ c = i ) P(\mathrm{x})=\sum_{i} P(\mathrm{c}=i) P(\mathrm{x} \mid \mathrm{c}=i) P(x)=i∑P(c=i)P(x∣c=i)
where P ( c ) P(\mathrm{c}) P(c) is the multinoulli distribution over component identities. -
The mixture model allows us to briefly glimpse a concept that will be of paramount importance later the latent variable. A latent variable is a random variable that we cannot observe directly. The component identity variable c of the mixture model provides an example. Latent variables may be related to x \mathrm{x} x through the joint distribution, in this case, P ( x , c ) = P ( x ∣ c ) P ( c ) . P(\mathrm{x}, \mathrm{c})=P(\mathrm{x} \mid \mathrm{c}) P(\mathrm{c}) . P(x,c)=P(x∣c)P(c). The distribution P ( c ) P(\mathrm{c}) P(c) over the latent variable and the distribution P ( x ∣ c ) P(\mathrm{x} \mid \mathrm{c}) P(x∣c) relating the latent variables to the visible variables determines the shape of the distribution P ( x ) P(\mathrm{x}) P(x) even though it is possible to describe P ( x ) P(\mathrm{x}) P(x) without reference to the latent variable. (需要等后期结合实例理解)
A very powerful and common type of mixture model is the Gaussian mixture model, in which the components p ( x ∣ c = i ) p(\mathbf{x} \mid \mathrm{c}=i) p(x∣c=i) are Gaussians. Each component has a separately parametrized mean μ ( i ) \boldsymbol{\mu}^{(i)} μ(i) and covariance Σ ( i ) \boldsymbol{\Sigma}^{(i)} Σ(i). Some mixtures can have more constraints. For example, the covariances could be shared across components via the constraint Σ ( i ) = Σ , ∀ i . \boldsymbol{\Sigma}^{(i)}=\boldsymbol{\Sigma}, \forall i . Σ(i)=Σ,∀i. As with a single Gaussian distribution, the mixture of Gaussians might constrain the covariance matrix for each component to be diagonal or isotropic(各向同性).
In addition to the means and covariances, the parameters of a Gaussian mixture specify the prior probability α i = P ( c = i ) \alpha_{i}=P(\mathrm{c}=i) αi=P(c=i) given to each component i i i. The word “prior” indicates that it expresses the model’s beliefs about c before it has observed x. By comparison, P ( c ∣ x ) P(\mathrm{c} \mid \boldsymbol{x}) P(c∣x) is a posterior probability, because it is computed after observation of x \mathbf{x} x. A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific, non-zero amount of error by a Gaussian mixture model with enough components.
-
Certain functions arise often while working with probability distributions, especially the probability distributions used in deep learning models. One of these functions is the logistic sigmoid:
σ ( x ) = 1 1 + exp ( − x ) \sigma(x)=\frac{1}{1+\exp (-x)} σ(x)=1+exp(−x)1
The logistic sigmoid is commonly used to produce the ϕ \phi ϕ parameter of a Bernoulli distribution because its range is ( 0 , 1 ) (0,1) (0,1), which lies within the valid range of values for the ϕ \phi ϕ parameter. The sigmoid function saturates when its argument is very positive or very negative, meaning that the function becomes very flat and insensitive to small changes in its input. -
Another commonly encountered function is the softplus function:$
ζ ( x ) = log ( 1 + exp ( x ) ) \zeta(x)=\log (1+\exp (x)) ζ(x)=log(1+exp(x))
The softplus function can be useful for producing the β \beta β or σ \sigma σ parameter of a normal distribution because its range is ( 0 , ∞ ) (0, \infty) (0,∞). It also arises commonly when manipulating expressions involving sigmoids. The name of the softplus function comes from the fact that it is a smoothed or “softened” version of
x + = max ( 0 , x ) x^{+}=\max (0, x) x+=max(0,x)
The following properties are all useful enough that you may wish to memorize them: σ ( x ) = exp ( x ) exp ( x ) + exp ( 0 ) d d x σ ( x ) = σ ( x ) ( 1 − σ ( x ) ) 1 − σ ( x ) = σ ( − x ) log σ ( x ) = − ζ ( − x ) d d x ζ ( x ) = σ ( x ) ∀ x ∈ ( 0 , 1 ) , σ − 1 ( x ) = log ( x 1 − x ) ∀ x > 0 , ζ − 1 ( x ) = log ( exp ( x ) − 1 ) ζ ( x ) = ∫ − ∞ x σ ( y ) d y ζ ( x ) − ζ ( − x ) = x \begin{gathered} \sigma(x)=\frac{\exp (x)}{\exp (x)+\exp (0)} \\ \frac{d}{d x} \sigma(x)=\sigma(x)(1-\sigma(x)) \\ 1-\sigma(x)=\sigma(-x) \\ \log \sigma(x)=-\zeta(-x) \\ \frac{d}{d x} \zeta(x)=\sigma(x) \\ \forall x \in(0,1), \sigma^{-1}(x)=\log \left(\frac{x}{1-x}\right) \\ \forall x>0, \zeta^{-1}(x)=\log (\exp (x)-1) \\ \zeta(x)=\int_{-\infty}^{x} \sigma(y) d y \\ \zeta(x)-\zeta(-x)=x \end{gathered} σ(x)=exp(x)+exp(0)exp(x)dxdσ(x)=σ(x)(1−σ(x))1−σ(x)=σ(−x)logσ(x)=−ζ(−x)dxdζ(x)=σ(x)∀x∈(0,1),σ−1(x)=log(1−xx)∀x>0,ζ−1(x)=log(exp(x)−1)ζ(x)=∫−∞xσ(y)dyζ(x)−ζ(−x)=x
The function σ − 1 ( x ) \sigma^{-1}(x) σ−1(x) is called the logit in statistics, but this term is more rarely used in machine learning.
Bayes’ Rule
- We often find ourselves in a situation where we know
P
(
y
∣
x
)
P(\mathrm{y} \mid \mathrm{x})
P(y∣x) and need to know
P
(
x
∣
y
)
P(\mathrm{x} \mid \mathrm{y})
P(x∣y). Fortunately, if we also know
P
(
x
)
P(\mathrm{x})
P(x), we can compute the desired quantity using Bayes’ rule:
P ( x ∣ y ) = P ( x ) P ( y ∣ x ) P ( y ) P(\mathrm{x} \mid \mathrm{y})=\frac{P(\mathrm{x}) P(\mathrm{y} \mid \mathrm{x})}{P(\mathrm{y})} P(x∣y)=P(y)P(x)P(y∣x)
Note that while P ( y ) P(\mathrm{y}) P(y) appears in the formula, it is usually feasible to compute P ( y ) = ∑ x P ( y ∣ x ) P ( x ) P(\mathrm{y})=\sum_{x} P(\mathrm{y} \mid x) P(x) P(y)=∑xP(y∣x)P(x), so we do not need to begin with knowledge of P ( y ) P(\mathrm{y}) P(y)
Information Theory
-
We would like to quantify information in a way that formalizes this intuition. Specifically,
• Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
• Less likely events should have higher information content.
• Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once. -
In order to satisfy all three of these properties, we define the self-information of an event x = x \mathrm{x}=x x=x to be
I ( x ) = − log P ( x ) I(x)=-\log P(x) I(x)=−logP(x)
In this book, we always use log to mean the natural logarithm, with base e e e. Our definition of I ( x ) I(x) I(x) is therefore written in units of nats. One nat is the amount of information gained by observing an event of probability 1 e \frac{1}{e} e1. Other texts use base-2 logarithms and units called bits or shannons; information measured in bits is just a rescaling of information measured in nats. -
When x \mathrm{x} x is continuous, we use the same definition of information by analogy, but some of the properties from the discrete case are lost. For example, an event with unit density still has zero information, despite not being an event that is guaranteed to occur.
Self-information deals only with a single outcome. We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:
H ( x ) = E x ∼ P [ I ( x ) ] = − E x ∼ P [ log P ( x ) ] H(\mathrm{x})=\mathbb{E}_{\mathrm{x} \sim P}[I(x)]=-\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)] H(x)=Ex∼P[I(x)]=−Ex∼P[logP(x)]
also denoted H ( P ) H(P) H(P). In other words, the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits (if the logarithm is base 2 , otherwise the units are different) needed on average to encode symbols drawn from a distribution P P P. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy. See figure 3.5 3.5 3.5 for a demonstration. When x \mathrm{x} x is continuous, the Shannon entropy is known as the differential entropy. -
If we have two separate probability distributions P ( x ) P(\mathrm{x}) P(x) and Q Q Q (x) over the same random variable x \mathrm{x} x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence(相对熵):
D K L ( P ∥ Q ) = E x ∼ P [ log P ( x ) Q ( x ) ] = E x ∼ P [ log P ( x ) − log Q ( x ) ] D_{\mathrm{KL}}(P \| Q)=\mathbb{E}_{\mathrm{x} \sim P}\left[\log \frac{P(x)}{Q(x)}\right]=\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)-\log Q(x)] DKL(P∥Q)=Ex∼P[logQ(x)P(x)]=Ex∼P[logP(x)−logQ(x)]
In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base 2 logarithm, but in machine learning we usually use nats and the natural logarithm) needed to send a message containing symbols drawn from probability distribution P P P, when we use a code that was designed to minimize the length of messages drawn from probability distribution Q Q Q. -
The KL divergence has many useful properties, most notably that it is nonnegative. The KL divergence is 0 if and only if P P P and Q Q Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables. Because the KL divergence is non-negative and measures the difference between two distributions, it is often conceptualized as measuring some sort of distance between these distributions. However, it is not a true distance measure because it is not symmetric: D K L ( P ∥ Q ) ≠ D K L ( Q ∥ P ) D_{\mathrm{KL}}(P \| Q) \neq D_{\mathrm{KL}}(Q \| P) DKL(P∥Q)=DKL(Q∥P) for some P P P and Q Q Q. This asymmetry means that there are important consequences to the choice of whether to use D K L ( P ∥ Q ) D_{\mathrm{KL}}(P \| Q) DKL(P∥Q) or D K L ( Q ∥ P ) . D_{\mathrm{KL}}(Q \| P) . DKL(Q∥P).
-
A quantity that is closely related to the KL divergence is the cross-entropy H ( P , Q ) = H ( P ) + D K L ( P ∥ Q ) H(P, Q)=H(P)+D_{\mathrm{KL}}(P \| Q) H(P,Q)=H(P)+DKL(P∥Q), which is similar to the KL divergence but lacking the term on the left:
H ( P , Q ) = − E x ∼ P log Q ( x ) H(P, Q)=-\mathbb{E}_{\mathrm{x} \sim P} \log Q(x) H(P,Q)=−Ex∼PlogQ(x)
Minimizing the cross-entropy with respect to Q Q Q is equivalent to minimizing the KL divergence, because Q Q Q does not participate in the omitted term.When computing many of these quantities, it is common to encounter expressions of the form 0 log 0. 0 \log 0 . 0log0. By convention, in the context of information theory, we treat these expressions as lim x → 0 x log x = 0 \lim _{x \rightarrow 0} x \log x=0 limx→0xlogx=0
Structured Probabilistic Models
-
Instead of using a single function to represent a probability distribution, we can split a probability distribution into many factors that we multiply together. For example, suppose we have three random variables: a , b a, b a,b and c c c. Suppose that a a a influences the value of b b b and b b b influences the value of c c c, but that a and c c c are independent given b. We can represent the probability distribution over all three variables as a product of probability distributions over two variables:
p ( a , b , c ) = p ( a ) p ( b ∣ a ) p ( c ∣ b ) p(\mathrm{a}, \mathrm{b}, \mathrm{c})=p(\mathrm{a}) p(\mathrm{~b} \mid \mathrm{a}) p(\mathrm{c} \mid \mathrm{b}) p(a,b,c)=p(a)p( b∣a)p(c∣b)
These factorizations can greatly reduce the number of parameters needed to describe the distribution. Each factor uses a number of parameters that is exponential in the number of variables in the factor. This means that we can greatly reduce the cost of representing a distribution if we are able to find a factorization into distributions over fewer variables. -
We can describe these kinds of factorizations using graphs. Here we use the word “graph” in the sense of graph theory: a set of vertices that may be connected to each other with edges. When we represent the factorization of a probability distribution with a graph, we call it a structured probabilistic model or graphical model.
-
Directed models use graphs with directed edges, and they represent factorizations into conditional probability distributions, as in the example above. Specifically, a directed model contains one factor for every random variable x i \mathrm{x}_{i} xi in the distribution, and that factor consists of the conditional distribution over x i \mathrm{x}_{i} xi given the parents of x i \mathrm{x}_{i} xi, denoted P a G ( x i ) P a_{\mathcal{G}}\left(\mathrm{x}_{i}\right) PaG(xi) :
p ( x ) = ∏ i p ( x i ∣ P a G ( x i ) ) p(\mathbf{x})=\prod_{i} p\left(\mathrm{x}_{i} \mid P a_{\mathcal{G}}\left(\mathrm{x}_{i}\right)\right) p(x)=i∏p(xi∣PaG(xi)) -
Undirected models use graphs with undirected edges, and they represent factorizations into a set of functions; unlike in the directed case, these functions are usually not probability distributions of any kind. Any set of nodes that are all connected to each other in G \mathcal{G} G is called a clique. Each clique C ( i ) \mathcal{C}^{(i)} C(i) in an undirected model is associated with a factor ϕ ( i ) ( C ( i ) ) . \phi^{(i)}\left(\mathcal{C}^{(i)}\right) . ϕ(i)(C(i)). These factors are just functions, not probability distributions. The output of each factor must be non-negative, but there is no constraint that the factor must sum or integrate to 1 like a probability distribution.
The probability of a configuration of random variables is proportional to the product of all of these factors-assignments that result in larger factor values are more likely. Of course, there is no guarantee that this product will sum to 1 . We therefore divide by a normalizing constant Z Z Z, defined to be the sum or integral over all states of the product of the ϕ \phi ϕ functions, in order to obtain a normalized probability distribution:
p ( x ) = 1 Z ∏ i ϕ ( i ) ( C ( i ) ) p(\mathbf{x})=\frac{1}{Z} \prod_{i} \phi^{(i)}\left(\mathcal{C}^{(i)}\right) p(x)=Z1i∏ϕ(i)(C(i))
Keep in mind that these graphical representations of factorizations are a language for describing probability distributions. They are not mutually exclusive families of probability distributions. Being directed or undirected is not a property of a probability distribution; it is a property of a particular description of a probability distribution, but any probability distribution may be described in both ways.