CCJ PRML Study Note - Chapter 1.6 : Information Theory

Chapter 1.6 : Information Theory

 
 

Chapter 1.6 : Information Theory

Christopher M. Bishop, PRML, Chapter 1 Introdcution

1. Information h(x)

Given a random variable CCJ PRML Study Note -  Chapter 1.6 : Information Theory and we ask how much information is received when we observe a specific value for this variable.

  • The amount of information can be viewed as the “degree of surprise” on learning the value of CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • information CCJ PRML Study Note -  Chapter 1.6 : Information Theory: CCJ PRML Study Note -  Chapter 1.6 : Information Theory where the negative sign ensures that information is positive or zero.
  • the units of CCJ PRML Study Note -  Chapter 1.6 : Information Theory:
    • using logarithms to the base of 2: the units of CCJ PRML Study Note -  Chapter 1.6 : Information Theory are bits (‘binary digits’).
    • using logarithms to the base of CCJ PRML Study Note -  Chapter 1.6 : Information Theory, i.e., natural logarithms: the units of CCJ PRML Study Note -  Chapter 1.6 : Information Theory are nats.

2. Entropy H(x): average amount of information

2.1 Entropy H(x)

Firstly we interpret the concept of entropy in terms of the average amount of information needed to specify the state of a random variable.

Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory and is given as

  • discrete entropy for discrete random variable byCCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • or differential/continuous entropy for continuous random variable by CCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • Note that CCJ PRML Study Note -  Chapter 1.6 : Information Theory and so we shall take CCJ PRML Study Note -  Chapter 1.6 : Information Theory whenever we encounter a value for CCJ PRML Study Note -  Chapter 1.6 : Information Theory such that CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • The nonuniform distribution has a smaller entropy than the uniform one.

2.2 Noiseless coding theorem (Shannon, 1948)

The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

2.3 Alternative view of entropy H(x)

Secondly, let us introduces the concept of entropy in physics in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics.

Consider a set of CCJ PRML Study Note -  Chapter 1.6 : Information Theory identical objects that are to be divided amongst a set of bins, such that there are CCJ PRML Study Note -  Chapter 1.6 : Information Theory objects in the CCJ PRML Study Note -  Chapter 1.6 : Information Theory bin. Consider the number of different ways of allocating the objects to the bins.

  • There are CCJ PRML Study Note -  Chapter 1.6 : Information Theory ways to choose the first object, CCJ PRML Study Note -  Chapter 1.6 : Information Theory ways to choose the second object, and so on, leading to a total of CCJ PRML Study Note -  Chapter 1.6 : Information Theory ways to allocate all CCJ PRML Study Note -  Chapter 1.6 : Information Theory objects to the bins.
  • However, we don’t wish to distinguish between rearrangements of objects within each bin. In the CCJ PRML Study Note -  Chapter 1.6 : Information Theory bin there are CCJ PRML Study Note -  Chapter 1.6 : Information Theory ways of reordering the objects, and so the total number of ways of allocating the CCJ PRML Study Note -  Chapter 1.6 : Information Theory objects to the bins is given by CCJ PRML Study Note -  Chapter 1.6 : Information Theory which is called the multiplicity.
  • The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant CCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • We now consider the limit CCJ PRML Study Note -  Chapter 1.6 : Information Theory, in which the fractions CCJ PRML Study Note -  Chapter 1.6 : Information Theory are held fixed, and apply Stirling’s approximation CCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • which gives
    CCJ PRML Study Note -  Chapter 1.6 : Information Theory

    i.e., CCJ PRML Study Note -  Chapter 1.6 : Information Theory

  • where we have used CCJ PRML Study Note -  Chapter 1.6 : Information Theory, and CCJ PRML Study Note -  Chapter 1.6 : Information Theory is the probability of an object being assigned to the CCJ PRML Study Note -  Chapter 1.6 : Information Theory bin.
  • microstate: In physics terminology, the specific arrangements of objects in the bins is called a microstate,
  • macrostate: the overall distribution of occupation numbers, expressed through the ratios CCJ PRML Study Note -  Chapter 1.6 : Information Theory , is called macrostate.
  • The multiplicity CCJ PRML Study Note -  Chapter 1.6 : Information Theory is also known as the weight of the macrostate.
  • We can interpret the bins as the states CCJ PRML Study Note -  Chapter 1.6 : Information Theory of a discrete random variable CCJ PRML Study Note -  Chapter 1.6 : Information Theory, where CCJ PRML Study Note -  Chapter 1.6 : Information Theory . The entropy of the random variable X is then CCJ PRML Study Note -  Chapter 1.6 : Information Theory

2.4 Comparison between discrete entropy and continuous entropy

H(x) Discrete Distribution X Continuous Distribution X
Maximum Uniform X Gaussian X
+/- CCJ PRML Study Note -  Chapter 1.6 : Information Theory could be negative
  • Maximum entropy H(x) :
    • In the case of discrete distributions, the maximum entropy configuration corresponded to an equal distribution of probabilities across the possible states of the variable.
    • For a continuous variable, the distribution that maximizes the differential entropy is the Gaussian [see Page 54 in PRML].
  • Is Negative or Positive?
    • The discrete entropy in (1.93) is always CCJ PRML Study Note -  Chapter 1.6 : Information Theory, because CCJ PRML Study Note -  Chapter 1.6 : Information Theory. It will equal its minimum value of CCJ PRML Study Note -  Chapter 1.6 : Information Theory when one of the CCJ PRML Study Note -  Chapter 1.6 : Information Theory and all other CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
    • The differential entropy can be negative, because CCJ PRML Study Note -  Chapter 1.6 : Information Theory in (1.110) for CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

      If we evaluate the differential entropy of the Gaussian, we obtain CCJ PRML Study Note -  Chapter 1.6 : Information Theory This result also shows that the differential entropy, unlike the discrete entropy, can be negative, because CCJ PRML Study Note -  Chapter 1.6 : Information Theory in (1.110) for CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

2.5 Conditional entropy H(y|x)

  • Conditional entropy:
    Suppose we have a joint distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory from which we draw pairs of values of CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory. If a value of CCJ PRML Study Note -  Chapter 1.6 : Information Theory is already known, then the additional information needed to specify the corresponding value of CCJ PRML Study Note -  Chapter 1.6 : Information Theory is given by CCJ PRML Study Note -  Chapter 1.6 : Information Theory. Thus the average additional information needed to specify y can be written as CCJ PRML Study Note -  Chapter 1.6 : Information Theory
    which is called the conditional entropy of y given x.
  • It is easily seen, using the product rule, that the conditional entropy satisfies the relation CCJ PRML Study Note -  Chapter 1.6 : Information Theory where CCJ PRML Study Note -  Chapter 1.6 : Information Theory is the differential entropy (i.e., continuous entropy) of CCJ PRML Study Note -  Chapter 1.6 : Information Theory, and CCJ PRML Study Note -  Chapter 1.6 : Information Theory is the differential entropy of the marginal distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • From (1.112) we get to know that

    the information needed to describe CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory is given by the sum of the information needed to describe CCJ PRML Study Note -  Chapter 1.6 : Information Theory alone plus the additional information required to specify CCJ PRML Study Note -  Chapter 1.6 : Information Theory given CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

3. Relative entropy or KL divergence

3.1 Relative entropy or KL divergence

Problem: How to relate the notion of entropy to pattern recognition?
Description: Consider some unknown distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory, and suppose that we have modeled this using an approximating distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

  • Relative entropy or Kullback-Leibler divergence, or KL divergence between the distributions CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory: If we use CCJ PRML Study Note -  Chapter 1.6 : Information Theory to construct a coding scheme for the purpose of transmitting values of CCJ PRML Study Note -  Chapter 1.6 : Information Theory to a receiver, then the average additional amount of information (in nats) required to specify the value of CCJ PRML Study Note -  Chapter 1.6 : Information Theory (assuming we choose an efficient coding scheme) as a result of using CCJ PRML Study Note -  Chapter 1.6 : Information Theory instead of the true distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory is given by
    CCJ PRML Study Note -  Chapter 1.6 : Information Theory and

CCJ PRML Study Note -  Chapter 1.6 : Information Theory

  • It can be rewritten as [see Ref-1]CCJ PRML Study Note -  Chapter 1.6 : Information Theory

  • Cross Entropy: where CCJ PRML Study Note -  Chapter 1.6 : Information Theory is called the cross entropy, CCJ PRML Study Note -  Chapter 1.6 : Information Theory

    • Understanding of Cross Entropy: One can show that the cross entropy is the average number of bits (or nats) needed to encode data coming from a source with distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory when we use model CCJ PRML Study Note -  Chapter 1.6 : Information Theory to define our codebook.
    • Understanding of (“Regular”) Entropy: Hence the “regular” entropy CCJ PRML Study Note -  Chapter 1.6 : Information Theory, is the expected number of bits if we use the true model.
    • Understanding of Relative Entropy: So the KL divergence is the difference between these (shown in 2.111). In other words, the KL divergence is the average number of extra bits (or nats) needed to encode the data, due to the fact that we used approximation distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory to encode the data instead of the true distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • Asymmetric: Note that KL divergence is not a symmetrical quantity, that is to say CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

  • KL divergence is a way to measure the dissimilarity of two probability distributions, CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory [see Ref-1].

3.2 Information inequality [see Ref-1]

The “extra number of bits” interpretation should make it clear that CCJ PRML Study Note -  Chapter 1.6 : Information Theory, and that the KL is only equal to zero i.f.f. CCJ PRML Study Note -  Chapter 1.6 : Information Theory. We now give a proof of this important result. CCJ PRML Study Note -  Chapter 1.6 : Information Theory

Proof:

  • 1) Convex functions: To do this we first introduce the concept of convex functions. A function CCJ PRML Study Note -  Chapter 1.6 : Information Theory is said to be convex if it has the property that every chord lies on or above the function, as shown in Figure 1.31. CCJ PRML Study Note -  Chapter 1.6 : Information Theory
    • Convexity then implies CCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • 2) Jensen’s inequality:
    • Using the technique of proof by induction(数学归纳法), we can show from (1.114) that a convex function CCJ PRML Study Note -  Chapter 1.6 : Information Theory satisfies CCJ PRML Study Note -  Chapter 1.6 : Information Theory where CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory, for any set of points CCJ PRML Study Note -  Chapter 1.6 : Information Theory. The result (1.115) is known as Jensen’s inequality.
    • If we interpret the CCJ PRML Study Note -  Chapter 1.6 : Information Theory as the probability distribution over a discrete variable CCJ PRML Study Note -  Chapter 1.6 : Information Theory taking the values CCJ PRML Study Note -  Chapter 1.6 : Information Theory, then (1.115) can be written CCJ PRML Study Note -  Chapter 1.6 : Information Theory For continuous variables, Jensen’s inequality takes the form CCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • 3) Apply Jensen’s inequality in the form (1.117) to the KL divergence (1.113) to give CCJ PRML Study Note -  Chapter 1.6 : Information Theory where we have used the fact that CCJ PRML Study Note -  Chapter 1.6 : Information Theory is a convex function (In fact, CCJ PRML Study Note -  Chapter 1.6 : Information Theory is a strictly convex function, so the equality will hold if, and only if, CCJ PRML Study Note -  Chapter 1.6 : Information Theory for all CCJ PRML Study Note -  Chapter 1.6 : Information Theory), together with the normalization condition CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • 4) Similarly, Let CCJ PRML Study Note -  Chapter 1.6 : Information Theory be the support of CCJ PRML Study Note -  Chapter 1.6 : Information Theory, and apply Jensen’s inequality in the form (1.115) to the discrete form KL divergence (2.110) to get [see Ref-1] CCJ PRML Study Note -  Chapter 1.6 : Information Theory where the first inequality follows from Jensen’s. Since CCJ PRML Study Note -  Chapter 1.6 : Information Theory is a strictly concave (i.e., the inverse of convex) function, we have equality in Equation (2.115) iff CCJ PRML Study Note -  Chapter 1.6 : Information Theory for some CCJ PRML Study Note -  Chapter 1.6 : Information Theory. We have equality in Equation (2.116) iff CCJ PRML Study Note -  Chapter 1.6 : Information Theory, which implies CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • 5) Hence CCJ PRML Study Note -  Chapter 1.6 : Information Theory iff CCJ PRML Study Note -  Chapter 1.6 : Information Theory for all CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

3.3 How to use KL divergence

Note that:

  • we can interpret the KL divergence as a measure of the dissimilarity of the two distributions CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions.

Problem description:

  • Suppose that data is being generated from an unknown distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory that we wish to model.
  • We can try to approximate this distribution using some parametric distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory, governed by a set of adjustable parameters CCJ PRML Study Note -  Chapter 1.6 : Information Theory, for example a multivariate Gaussian.
  • One way to determine CCJ PRML Study Note -  Chapter 1.6 : Information Theory is to minimize the KL divergence between CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory with respect to CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • We cannot do this directly because we don’t know CCJ PRML Study Note -  Chapter 1.6 : Information Theory. Suppose, however, that we have observed a finite set of training points CCJ PRML Study Note -  Chapter 1.6 : Information Theory , for CCJ PRML Study Note -  Chapter 1.6 : Information Theory, drawn from CCJ PRML Study Note -  Chapter 1.6 : Information Theory. Then the expectation with respect to CCJ PRML Study Note -  Chapter 1.6 : Information Theory can be approximated by a finite sum over these points, using CCJ PRML Study Note -  Chapter 1.6 : Information Theory(1.35), so that (???don’t know how to derive it)CCJ PRML Study Note -  Chapter 1.6 : Information Theory
  • the first term is the negative log likelihood function for CCJ PRML Study Note -  Chapter 1.6 : Information Theory under the distribution CCJ PRML Study Note -  Chapter 1.6 : Information Theory evaluated using the training set.
  • Thus we see that minimizing this KL divergence is equivalent to maximizing the likelihood function.

3.4 Mutual information

Now consider the joint distribution between two sets of variables CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory given by CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

Mutual information between the variables CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory:

  • If CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory are independent, p(x, y) = p(x)p(y).
  • If CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory are not independent, we can gain some idea of whether they are “close” to being independent by considering the KL divergence between the joint distribution and the product of the marginals, given by CCJ PRML Study Note -  Chapter 1.6 : Information Theory which is called the mutual information between the variables CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory.
  • Using the sum and product rules of probability, we see that the mutual information is related to the conditional entropy
    through CCJ PRML Study Note -  Chapter 1.6 : Information Theory

Understanding of Mutual information:

  • Thus we can view the mutual information as the reduction in the uncertainty about CCJ PRML Study Note -  Chapter 1.6 : Information Theory by virtue of being told the value of CCJ PRML Study Note -  Chapter 1.6 : Information Theory (or vice versa).
  • From a Bayesian perspective, we can view CCJ PRML Study Note -  Chapter 1.6 : Information Theory as the prior distribution for CCJ PRML Study Note -  Chapter 1.6 : Information Theory and CCJ PRML Study Note -  Chapter 1.6 : Information Theory as the posterior distribution after we have observed new data CCJ PRML Study Note -  Chapter 1.6 : Information Theory. The mutual information therefore represents the reduction in uncertainty about CCJ PRML Study Note -  Chapter 1.6 : Information Theory as a consequence of the new observation CCJ PRML Study Note -  Chapter 1.6 : Information Theory.

Reference

[1]: Section 2.8.2, Page 57, Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

 
上一篇:Service系统服务(一):安装一个KVM服务器、KVM平台构建及简单管理、virsh基本管理操作、xml配置文件的应用、为虚拟机制作快照备份、快建新虚拟机


下一篇:AppCompat v21 — Android 5.0之前版本设备的Material Design实现