转载自:https://colah.github.io/posts/2015-09-Visual-Information/
上
中
前文概要:
比如一个密文有50%的概率要使用,那么我们可以花50%的密文空间来让这个密文简短一些。如果这个密文只有1%的概率用到,那么只花1%的代价来表示这个密文。
Calculating Entropy
Recall that the cost of a message of length L is 1 2 L \frac{1}{ 2^L} 2L1. We can invert this to get the length of a message that costs a given amount: l o g 2 ( 1 c o s t ) log_2(\frac1{cost}) log2(cost1). Since we spend p(x) on a codeword for x, the length is l o g 2 ( 1 p ( x ) ) log_2(\frac1{p(x)}) log2(p(x)1). Those are the best choices of lengths.
回想一下,长度为L的消息的代价为 1 2 L \frac{1}{ 2^L} 2L1。 我们可以将其求反以获取代价给定数量的消息的长度: l o g 2 ( 1 c o s t ) log_2(\frac1{cost}) log2(cost1)。 由于我们在x的代码字上花费了 p ( x ) p(x) p(x),因此长度为 l o g 2 ( 1 p ( x ) ) log_2(\frac1{p(x)}) log2(p(x)1)。 这是长度的最佳选择。
Earlier, we discussed how there is a fundamental limit to how short one can get the average message to communicate events from a particular probability distribution, p. This limit, the average message length using the best possible code, is called the entropy of p, H§. Now that we know the optimal lengths of the codewords, we can actually calculate it!
之前,我们讨论了从一个特定的概率分布p获得一个平均消息来传达事件的时间有多短的基本限制。 这个限制,即使用最佳可能代码的平均消息长度,称为p的熵H(p)。 既然我们知道了码字的最佳长度,我们就可以实际计算出来了!
H ( p ) = ∑ x p ( x ) l o g 2 ( 1 p ( x ) ) H(p)=∑_xp(x)log_2(\frac1{p(x)}) H(p)=x∑p(x)log2(p(x)1)
(People often write entropy as H ( p ) = − ∑ p ( x ) l o g 2 ( p ( x ) ) H(p)=−∑p(x)log_2(p(x)) H(p)=−∑p(x)log2(p(x)) using the identity log(1/a)=−log(a). I think the first version is more intuitive, and will continue to use it in this essay.)
通常是另外一种写法 H ( p ) = − ∑ p ( x ) l o g 2 ( p ( x ) ) H(p)=−∑p(x)log_2(p(x)) H(p)=−∑p(x)log2(p(x)),不过这两种是等价变换,而第一种写法更加直观。
No matter what I do, on average I need to send at least that number of bits if I want to communicate which event occurred.
不管我做什么,平均来说,如果我想传达发生的事件,我至少需要发送那个数量的位(bits)。
The average amount of information needed to communicate something has clear implications for compression. But are there other reasons we should care about it? Yes! It describes how uncertain I am and gives a way to quantify information.
通信所需的平均信息量对压缩有明显的影响。但是我们还有其他的理由关心它吗?对!它描述了我的不确定性,并给出了量化信息的方法。
If I knew for sure what was going to happen, I wouldn’t have to send a message at all! If there’s two things that could happen with 50% probability, I only need to send 1 bit. But if there’s 64 different things that could happen with equal probability, I’d have to send 6 bits. The more concentrated the probability, the more I can craft a clever code with short average messages. The more diffuse the probability, the longer my messages have to be.
如果我确定会发生什么事,我就不必发信息了!如果有两件事发生的概率为50%,我只需要发送1位。但是如果有64种不同的事情以相同的概率发生,我必须发送6位。概率越集中,我就越能用简短的平均消息编写一个聪明的代码。概率越分散,我的信息就越长。
The more uncertain the outcome, the more I learn, on average, when I find out what happened.
结果越不确定,平均来说,当我发现发生了什么时,我学到的东西就越多。
Cross-Entropy交叉熵
Shortly before his move to Australia, Bob married Alice, another figment of my imagination. To the surprise of myself, and also the other characters in my head, Alice was not a dog lover. She was a cat lover. Despite this, the two of them were able to find common ground in their shared obsession with animals and very limited vocabulary size.
鲍勃(Bob)移居澳大利亚之前不久,娶了爱丽丝(Alice),这是我的另一个幻想。 令我自己以及我脑海中其他角色感到惊讶的是,爱丽丝不是爱狗的人。 她是个爱猫的人。 尽管如此,他们两个在对动物的共同痴迷和非常有限的词汇量中找到了共同点。
The two of them say the same words, just at different frequencies. Bob talks about dogs all the time, Alice talks about cats all the time.
他们两个说相同的词,只是频率不同。 鲍勃一直在谈论狗,爱丽丝一直在谈论猫。
Initially, Alice sent me messages using Bob’s code. Unfortunately, her messages were longer than they needed to be. Bob’s code was optimized to his probability distribution. Alice has a different probability distribution, and the code is suboptimal for it. While the average length of a codeword when Bob uses his own code is 1.75 bits, when Alice uses his code it’s 2.25. It would be worse if the two weren’t so similar!
最初,爱丽丝使用鲍勃(Bob)的代码向我发送了消息。 不幸的是,她的信息比需要的更长。 Bob的代码针对其概率分布进行了优化。 爱丽丝的概率分布不同,因此代码不理想。 当鲍勃使用他自己的代码时,一个代码字的平均长度为1.75位,而当爱丽丝使用他的代码时,则为2.25位。 如果两者不太相似,那就更糟了!
H p ( q ) = ∑ q a l o g 1 p b = 2.25 H_p(q)=\sum q_a log\frac 1{p_b}=2.25 Hp(q)=∑qalogpb1=2.25
This length – the average length of communicating an event from one distribution with the optimal code for another distribution – is called the cross-entropy. Formally, we can define cross-entropy as:4
该长度(通过一个分布分发事件,并使用另一种分布的最佳代码来传达事件的平均长度)称为交叉熵。 形式上,我们可以将交叉熵定义为:4
H p ( q ) = ∑ x q ( x ) l o g 2 ( 1 p ( x ) ) H_p(q)=∑_xq(x)log_2(\frac1{p(x)}) Hp(q)=x∑q(x)log2(p(x)1)
In this case, it’s the cross-entropy of Alice the cat-lovers word frequency with respect to the Bob the dog-lovers word frequency.
在这种情况下,这是爱猫者单词频率的爱丽丝相对于爱狗者单词鲍勃的交叉熵。
注:A相对于B:即A使用B的编码长度
To keep the cost of our communications down, I asked Alice to use her own code. To my relief, this pushed down her average message length. But it introduced a new problem: sometimes Bob would accidentally use Alice’s code. Surprisingly, it’s worse for Bob to accidentally use Alice’s code than for Alice to use his!
为了降低通信成本,我请爱丽丝使用她自己的代码。 令我松了一口气的是,这降低了她的平均邮件长度。 但这带来了一个新问题:有时鲍勃会不小心使用爱丽丝的代码。 令人惊讶的是,鲍勃不小心使用爱丽丝的代码比爱丽丝使用他的代码更糟糕!
So, now we have four possibilities:
- Bob using his own code (H§=1.75 bits)
- Alice using Bob’s code (Hp(q)=2.25 bits)
- Alice using her own code (H(q)=1.75 bits)
- Bob using Alice’s code (Hq§=2.375 bits)
因此,现在我们有四种可能性:
鲍勃使用自己的代码(H(p)= 1.75位)
爱丽丝使用鲍勃的代码(Hp(q)= 2.25位)
爱丽丝使用她自己的代码(H(q)= 1.75位)
鲍勃使用爱丽丝的代码(Hq(p)= 2.375位)
This isn’t necessarily as intuitive as one might think. For example, we can see that Hp(q)≠Hq§. Is there some way we can see how these four values relate to each other?
这不一定像人们想象的那样直观。 例如,我们可以看到Hp(q)≠Hq(p)。有什么办法可以看到这四个值之间的相互关系?
In the following diagram, each subplot represents one of these 4 possibilities. Each subplot visualizes average message length the same way our previous diagrams did. They are organized in a square, so that if the messages are coming from the same distribution the plots are beside each other, and if they use the same codes they are on top of each other. This allows you to kind of visually slide the distributions and codes together.
在下图中,每个子图表示这4种可能性之一。 每个子图都以与之前的图表相同的方式可视化平均消息长度。 它们以正方形组织,因此,如果消息来自相同的分布,则图是彼此相邻的,并且如果它们使用相同的代码,则它们将彼此重叠。 这使您可以直观地将分布和代码一起滑动。
Can you see why Hp(q)≠Hq§? Hq§ is large because there is an event (blue) which is very common under p but gets a long code because it is very uncommon under q. On the other hand, common events under q are less common under p, but the difference is less drastic, so Hp(q) isn’t as high.
您能看到为什么Hp(q)≠Hq(p)吗? Hq(p)大是因为在p下有一个很常见的事件(蓝色),但由于在q下很少见,所以得到了很长的代码(代价)。 另一方面,q下的常见事件在p下不太常见,但相差不大,因此Hp(q)并不那么高。
Cross-entropy isn’t symmetric.
So, why should you care about cross-entropy? Well, cross-entropy gives us a way to express how different two probability distributions are. The more different the distributions p and q are, the more the cross-entropy of p with respect to q will be bigger than the entropy of p.
那么,为什么还要关心交叉熵呢? 好吧,交叉熵为我们提供了一种表达两种概率分布的差异的方法。 p和q的分布越不同,p相对于q的交叉熵将大于p的熵。
Similarly, the more different p is from q, the more the cross-entropy of q with respect to p will be bigger than the entropy of q.
类似地,p与q的差异越大,相对于p的q的交叉熵将大于q的熵。
注:这两张图画反了
The really interesting thing is the difference between the entropy and the cross-entropy. That difference is how much longer our messages are because we used a code optimized for a different distribution. If the distributions are the same, this difference will be zero. As the difference grows, it will get bigger.
真正有趣的是熵和交叉熵之间的差异。 区别在于我们的消息有多长时间,因为我们使用了针对不同分发进行了优化的代码。 如果分布相同,则该差异将为零。 随着差异的增加,它将变得更大。
We call this difference the Kullback–Leibler divergence, or just the KL divergence. The KL divergence of p with respect to q, Dq§,5 is defined:6
我们称这种差异为Kullback-Leibler差异,或简称为KL差异。 p相对于q的KL发散,Dq(p)5,定义为:6
D q ( p ) = H q ( p ) − H ( p ) D_q(p)=H_q(p)−H(p) Dq(p)=Hq(p)−H(p)
The really neat thing about KL divergence is that it’s like a distance between two distributions. It measures how different they are! (If you take that idea seriously, you end up with information geometry.)
KL散度的真正妙处在于它就像两个分布之间的距离。 它可以衡量它们有多不同! (如果您认真对待该想法,那么最终会遇到信息几何问题。)
Cross-Entropy and KL divergence are incredibly useful in machine learning. Often, we want one distribution to be close to another. For example, we might want a predicted distribution to be close to the ground truth. KL divergence gives us a natural way to do this, and so it shows up everywhere.
交叉熵和KL散度在机器学习中非常有用。 通常,我们希望一种分布与另一种分布接近。 例如,我们可能希望预测分布接近基本事实。 KL差异为我们提供了一种自然的方法,因此它随处可见。
Entropy and Multiple Variables多变量的联合熵
Let’s return to our weather and clothing example from earlier:
让我们重新回到天气和穿衣的讨论中:
My mother, like many parents, sometimes worries that I don’t dress appropriately for the weather. (She has reasonable cause for suspicion – I have often failed to wear coats in winter.) So, she often wants to know both the weather and what clothing I’m wearing. How many bits do I have to send her to communicate this?
我的母亲和许多父母一样,有时会担心我的着装不适合天气。 (她有怀疑的理由-我冬天常常没穿大衣。)因此,她经常想知道天气和我穿的衣服。 我必须发送多少位来传达此信息?
Well, the easy way to think about this is to flatten the probability distribution:
一种简便的思考方式是讲联合概率分布扁平化
Now we can figure out the optimal codewords for events of these probabilities and compute the average message length:
现在我们可以计算出这些概率事件的最佳码字,并计算平均消息长度:
注:每个矩形中,长宽成比例。长为p,那么宽为
l
o
g
1
p
log\frac1{p}
logp1
We call this the joint entropy of X and Y, defined
称H(X,Y)为联合熵
H ( X , Y ) = ∑ x , y p ( x , y ) l o g 2 ( 1 p ( x , y ) ) H(X,Y)=∑_{x,y}p(x,y)log_2(\frac1{p(x,y)}) H(X,Y)=x,y∑p(x,y)log2(p(x,y)1)
This is the exact same as our normal definition, except with two variables instead of one.
这与我们的正常定义完全相同,只是有两个变量而不是一个。
A slightly nicer way to think about this is to avoid flattening the distribution, and just think of the code lengths as a third dimension. Now the entropy is the volume!
考虑这一点的一个稍微好一点的方法是避免将分布平坦化,而将代码长度看作第三个维度。现在熵就是体积!
But suppose my mom already knows the weather. She can check it on the news. Now how much information do I need to provide?
但是假设我妈妈已经知道天气了。她可以在新闻上查到。现在我需要提供多少信息?
It seems like I need to send however much information I need to communicate the clothes I’m wearing. But I actually need to send less, because the weather strongly implies what clothing I’ll wear! Let’s consider the case where it’s raining and where it’s sunny separately.
似乎我需要发送多少信息来传达我穿的衣服。但我真的需要少送点,因为天气强烈暗示着我要穿什么衣服!让我们分别考虑下雨和晴天的情况。
In both cases, I don’t need to send very much information on average, because the weather gives me a good guess at what the right answer will be. When it’s sunny, I can use a special sunny-optimized code, and when it’s raining I can use a raining optimized code. In both cases, I send less information than if I used a generic code for both. To get the average amount of information I need to send my mother, I just put these two cases together…
在这两种情况下,我平均不需要发送太多信息,因为天气让我很好地猜测正确答案是什么。当天气晴朗时,我可以使用一个特殊的晴朗优化代码,当下雨时,我可以使用一个晴朗优化代码。在这两种情况下,我发送的信息都比使用通用代码时少。为了得到平均数量的信息,我需要送我的母亲,我只是把这两个案件放在一起…
也就是说,在两种情况下,使用不一样的语言编码,再在此基础上平均
We call this the conditional entropy. If you formalize it into an equation, you get:
我们称之为条件熵,写成公式为
H
(
X
∣
Y
)
=
∑
y
p
(
y
)
∑
x
p
(
x
∣
y
)
l
o
g
2
(
1
p
(
x
∣
y
)
)
=
∑
x
,
y
p
(
x
,
y
)
l
o
g
2
(
1
p
(
x
∣
y
)
)
H(X|Y)=∑_yp(y)∑_xp(x|y)log_2(\frac1{p(x|y)}) =∑_{x,y}p(x,y)log2(\frac1{p(x|y)})
H(X∣Y)=y∑p(y)x∑p(x∣y)log2(p(x∣y)1)=x,y∑p(x,y)log2(p(x∣y)1)
y是天气
x是穿衣
Mutual Information
In the previous section, we observed that knowing one variable can mean that communicating another variable requires less information.
在上一节中,我们注意到掌握一个变量意味着传递另一个变量需要的信息更少。
One nice way to think about this is to imagine amounts of information as bars. These bars overlap if there’s shared information between them. For example, some of the information in X and Y is shared between them, so H(X) and H(Y) are overlapping bars. And since H(X,Y) is the information in both, it’s the union of the bars H(X) and H(Y).7
思考这个问题的一个好方法是把大量的信息想象成条形图。如果这些条之间有共享信息,它们就会重叠。例如,X和Y中的一些信息在它们之间共享,因此H(X)和H(Y)是重叠的条。因为H(X,Y)是两者的信息,它是条H(X)和H(Y)的并集
Once we think about things this way, a lot of things become easier to see.
一旦我们以这种方式思考问题,很多事情就变得更容易看了。
For example, we previously noted it takes more information to communicate both X and Y (the “joint entropy,” H(X,Y)) than it takes to just communicate X (the “marginal entropy,” H(X)). But if you already know Y, then it takes less information to communicate X (the “conditional entropy,” H(X|Y)) than it would if you didn’t!
例如,我们之前注意到传递X和Y(联合熵,H(X,Y))比传递X(边际熵,H(X))需要更多的信息。但是如果你已经知道Y,那么传递X(条件熵,H(X | Y))所需的信息要比不知道的少!
That sounds a bit complicated, but it’s very simple when we think about it from the bar perspective. H(X|Y) is the information we need to send to communicate X to someone who already knows Y, the information in X which isn’t also in Y. Visually, that means H(X|Y) is the part of H(X) bar which doesn’t overlap with H(Y).
这听起来有点复杂,但从酒吧的角度来看却很简单。H(X | Y)是我们需要发送的信息,用于将X传达给已经知道Y的人,X中的信息也不在Y中。从视觉上看,这意味着H(X | Y)是H(X)条中与H(Y)不重叠的部分。
You can now read the inequality H(X,Y)≥H(X)≥H(X|Y) right off the following diagram.
现在可以从下图中直接读取不等式H(X,Y)≥H(X)≥H(X | Y)。
Another identity is that H(X,Y)=H(Y)+H(X|Y). That is, the information in X and Y is the information in Y plus the information in X which is not in Y.
另一个恒等式是H(X,Y)=H(Y)+H(X | Y)。也就是说,X和Y中的信息是Y中的信息加上X中不在Y中的信息。
Again, it’s difficult to see in the equations, but easy to see if you’re thinking in terms of these overlapping bars of information.
同样,很难从方程中看出,但很容易看出你是否在考虑这些重叠的信息条。
At this point, we’ve broken the information in X and Y up in several ways. We have the information in each variable, H(X) and H(Y). We have the the union of the information in both, H(X,Y). We have the information which is in one but not the other, H(X|Y) and H(Y|X). A lot of this seems to revolve around the information shared between the variables, the intersection of their information. We call this “mutual information,” I(X,Y), defined as:8
在这一点上,我们已经用几种方法分解了X和Y中的信息。我们在每个变量H(X)和H(Y)中都有信息。我们有两个信息的并集,H(X,Y)。我们有一个信息,而不是另一个,H(X | Y)和H(Y | X)。其中很多似乎都是围绕着变量之间共享的信息,它们信息的交叉点。我们称之为“互信息”,I(X,Y),定义为:8
I ( X , Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I(X,Y)=H(X)+H(Y)−H(X,Y) I(X,Y)=H(X)+H(Y)−H(X,Y)
This definition works because H(X)+H(Y) has two copies of the mutual information, since it’s in both X and Y, while H(X,Y) only has one. (Consider the previous bar diagram.)
这个定义之所以有效,是因为H(X)+H(Y)有两个互信息的副本,因为它在X和Y中,而H(X,Y)只有一个。(考虑前面的条形图。)
Closely related to the mutual information is the variation of information. The variation of information is the information which isn’t shared between the variables. We can define it like so:
与互信息密切相关的是信息的变化。信息的变化是变量之间不共享的信息。我们可以这样定义:
V
(
X
,
Y
)
=
H
(
X
,
Y
)
−
I
(
X
,
Y
)
V(X,Y)=H(X,Y)−I(X,Y)
V(X,Y)=H(X,Y)−I(X,Y)
注:推得,互信息就是变量间共享信息
Variation of information is interesting because it gives us a metric, a notion of distance, between different variables. The variation of information between two variables is zero if knowing the value of one tells you the value of the other and increases as they become more independent.
信息的变化很有趣,因为它给了我们一个度量,一个不同变量之间距离的概念。如果知道一个变量的值告诉你另一个变量的值,并且随着它们变得更加独立,两个变量之间的信息变化为零。
How does this relate to KL divergence, which also gave us a notion of distance? Well, KL divergence gives us a distance between two distributions over the same variable or set of variables. In contrast, variation of information gives us distance between two jointly distributed variables. KL divergence is between distributions, variation of information within a distribution.
这和KL散度有什么关系,KL散度也给了我们距离的概念?KL散度给出了同一变量或一组变量上两个分布之间的距离。相反,信息的变化给了我们两个联合分布的变量之间的距离。KL散度是分布之间的差异,分布中信息的变化。
We can bring this all together into a single diagram relating all these different kinds of information:
我们可以将所有这些整合到一个图表中,将所有这些不同类型的信息联系起来:
小结:
V
(
X
,
Y
)
=
H
(
X
,
Y
)
−
I
(
X
,
Y
)
V(X,Y)=H(X,Y)−I(X,Y)
V(X,Y)=H(X,Y)−I(X,Y)
I
=
H
(
X
)
+
H
(
Y
)
−
H
(
X
,
Y
)
=
∑
x
,
y
p
(
x
,
y
)
l
o
g
p
(
x
,
y
)
p
(
x
)
p
(
y
)
I=H(X)+H(Y)-H(X,Y) =\sum_{x,y}p(x,y)log\frac{p(x,y)}{p(x)p(y)}
I=H(X)+H(Y)−H(X,Y)=x,y∑p(x,y)logp(x)p(y)p(x,y)
信息的变化给了我们两个联合分布的变量之间的距离。信息的变化是变量之间不共享的信息。
D
q
(
p
)
=
H
q
(
p
)
−
H
(
p
)
D_q(p)=H_q(p)−H(p)
Dq(p)=Hq(p)−H(p)
∑
p
l
o
g
1
q
−
∑
p
l
o
g
1
p
=
∑
p
l
o
g
p
q
\sum plog\frac 1 q -\sum plog\frac 1 p =\sum plog\frac p q
∑plogq1−∑plogp1=∑plogqp
KL散度的真正妙处在于它就像两个分布之间的距离。 它可以衡量它们有多不同!
Fractional Bits
A very unintuitive thing about information theory is that we can have fractional numbers of bits. That’s pretty weird. What does it mean to have half a bit?
Here’s the easy answer: often, we’re interested in the average length of a message rather than any particular message length. If half the time one sends a single bit, and half the time one sends two bits, on average one sends one and a half bits. There’s nothing strange about averages being fractional.
But that answer is really dodging the issue. Often, the optimal lengths of codewords are fractional. What do those mean?
To be concrete, let’s consider a probability distribution where one event, a, happens 71% of the time and another event, b, occurs 29% of the time.
The optimal code would use 0.5 bits to represent a, and 1.7 bits to represent b. Well, if we want to send a single one of these codewords, it simply isn’t possible. We’re forced to round to a whole number of bits, and send on average 1 bit.
… But if we’re sending multiple messages at once, it turns out that we can do better. Let’s consider communicating two events from this distribution. If we sent them independently, we’d need to send two bits. Can we do better?
Half the time, we need to communicate aa, 21% of the time we need to send ab or ba, and 8% of the time we need to communicate bb. Again, the ideal code involves fractional numbers of bits.
If we round the codeword lengths, we’ll get something like this:
This codes give us an average message length of 1.8 bits. That’s less than the 2 bits when we send them independently. Another way of thinking of this is that we’re sending 0.9 bits on average for each event. If we were to send more events at once, it would become smaller still. As n tends to infinity, the overhead due to rounding our code would vanish, and the number of bits per codeword would approach the entropy.
Further, notice that the ideal codeword length for a was 0.5 bits, and the ideal codeword length for aa was 1 bit. Ideal codeword lengths add, even when they’re fractional! So, if we communicate a lot of events at once, the lengths will add.
There is a very real sense in which one can have fractional numbers of bits of information, even if actual codes can only use whole numbers.
(In practice, people use particular coding schemes which are efficient to different extents. Huffman coding, which is basically the kind of code we’ve sketched out here, doesn’t handle fractional bits very gracefully – you have to group symbols, like we did above, or use more complicated tricks to approach the entropy limit. Arithmetic coding is a bit different, but elegantly handles fractional bits to be asymptotically optimal.)