往期文章链接目录
Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post
Let’s get started!
Conditional independence
A and B are conditionally independent given C if and only if, given knowledge that C occurs, knowledge of whether A occurs provides no information on the likelihood of B occurring, and knowledge of whether B occurs provides no information on the likelihood of A occurring.
Formally, if we denote conditional independence of A and B given C by (A⊥⊥B)∣C, then by definition, we have
(A⊥⊥B)∣C⟺P(A,B∣C)=P(A∣C)⋅P(B∣C)
Given the knowledge that C occurs, to show the knowledge of whether B occurs provides no information on the likelihood of A occurring, we have
P(A∣B,C)=P(B,C)P(A,B,C)=P(B,C)P(A,B∣C)⋅P(C)=P(B∣C)⋅P(C)P(A∣C)⋅P(B∣C)⋅P(C)=P(A∣C)
Two classical cases where X and Z are conditionally independent
Case 1 :
From the above directed graph, we have P(X,Y,Z)=P(X)⋅P(Y∣X)⋅P(Z∣Y). Hence we have
P(Z∣X,Y)=P(X,Y)P(X,Y,Z)=P(X)⋅P(Y∣X)P(X)⋅P(Y∣X)⋅P(Z∣Y)=P(Z∣Y)
Therefore, X and Z are conditionally independent.
Case 2 :
From the above directed graph, we have P(X,Y,Z)=P(Y)⋅P(X∣Y)⋅P(Z∣Y). Hence we have
P(Z∣X,Y)=P(X,Y)P(X,Y,Z)=P(Y)⋅P(X∣Y)P(Y)⋅P(X∣Y)⋅P(Z∣Y)=P(Z∣Y)
Therefore, X and Z are conditionally independent.
Settings of the Hidden Markov Model (HMM)
The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us something about the probabilities of sequences of random variables, states, each of which can take on values from some set. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state.
To put it formally, suppose we have a sequence of state variables z1,z2,...,zn. Then the Markov assumption is
p(zn∣z1z2...zn−1)=p(zn∣zn−1)
A Markov chain is useful when we need to compute a probability for a sequence of observable events. However, in many cases the events we are interested in are hidden. For example we don’t normally observe part-of-speech (POS) tags in a text. Rather, we see words, and must infer the tags from the word sequence. We call the tags hidden because they are not observed.
A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model. An HMM is specified by the following components:
-
A sequence of hidden states z, where zk takes values from all possible hidden states Z={1,2,..,m}.
-
A sequence of observations x, where x=(x1,x2,...,xn). Each one is drawn from a vocabulary V.
-
A transition probability matrix A, where A is an m×m matrix. Aij represents the probability of moving from state i to state j: Aij=p(zt+1=j∣zt=i), and ∑j=1mAij=1 for all i.
-
An emission probability matrix B, where B is an m×∣V∣ matrix. Bij represents the probability of an observation xj being generated from a state i: Bij=P(xt=Vj∣zt=i)
-
An initial probability distribution π over states, where π=(π1,π2,...,πm). πi is the probability that the Markov chain will start in state i. ∑i=1mπi=1.
Given a sequence x and the corresponding hidden states z (like one in the picture above), we have
P(x,z∣θ)=p(z1)⋅[p(z2∣z1)⋅p(z3∣z2)⋅...⋅(zn∣zn−1)]⋅[p(x1∣z1)⋅p(x2∣z2)⋅...⋅p(xn∣zn)](0)
We get p(z1) from π, p(zk+1∣zk) from A, and p(xk∣zk) from B.
Useful probabilities p(zk∣x) and p(zk+1,zk∣x)
p(zk∣x) and p(zk+1,zk∣x) are useful probabilities and we are going to use them later.
Intuition: Once we have a sequence x, we might be interested in find the probability of any hidden state zk, i.e., find probabilities p(zk=1∣x),p(zk=2∣x),...,p(zk=m∣x). we have the following
p(zk∣x)=p(x)p(zk,x)∝p(zk,x)(1)(2)
Note that from (1) to (2), since p(x) doesn’t change for all values of zk, p(zk∣x) is proportional to p(zk,x).
p(zk=i,x)=p(zk=i,x1:k,xk+1:n)=p(zk=i,x1:k)⋅p(xk+1:n∣zk=i,x1:k)=p(zk=i,x1:k)⋅p(xk+1:n∣zk=i)=αk(zk=i)⋅βk(zk=i)(3)(4.1)(4.11)
From the above graph, we see that the second term (3) is the 2nd classical cases. So xk+1:n and x1:k are conditionally independent. This is why we can go from (3) to (4.1). We are going to use the Forward Algorithm to compute p(zk,x1:k), and Backward Algorithm to compute p(xk+1:n∣zk) later.
We denote p(zk,x1:k) by αk(zk) and p(xk+1:n∣zk) by βk(zk).
After we know how to calculate these two terms separately, we can calculate p(zk∣x) easily by introducing a normalization term. That is,
p(zk=i∣x)=∑j=1mp(zk=j,x)p(zk=i,x)=∑j=1mαk(zk=j)βk(zk=j)αk(zk=i)βk(zk=i)(4.2)
where ∑jmp(zk=j,x) is the normalization term which makes p(zk=1,x) take values between 0 and 1 for all zk.
Similarly, we are also interested in finding p(zk+1,zk∣x), where
p(zk+1,zk∣x)∝p(zk+1,zk,x)
By using the property of conditional independence, we have
p(zk+1=j,zk=i,x)=p(zk=i,zk+1=j,x1:k,xk+1,xk+2:n)=p(zk=i,x1:k)⋅p(xk+2:n)∣zk+1=j)⋅p(zk+1=j∣zk=i)⋅p(xk+1∣zk+1=j)=αk(zk=i)⋅βk+1(zk+1=j)⋅p(zk+1=j∣zk=i)⋅p(xk+1∣zk+1=j)(4.3)(4.4)
Note that we can find the third and the forth term from the transition probability matrix and the emission probability matrix. Again, we can calculate p(zk+1,zk∣x) simply by introducing a normalization term. That is,
p(zk+1=s,zk=r∣x)=∑i=1m∑j=1mp(zk+1=j,zk=i,x)p(zk+1=s,zk=r,x)=∑i=1mαk(zk=i)⋅βk+1(zk+1=j)⋅p(zk+1=j∣zk=i)⋅p(xk+1∣zk+1=j)αk(zk=r)⋅βk+1(zk+1=s)⋅p(zk+1=s∣zk=r)⋅p(xk+1∣zk+1=s)(4.42)
Remark
-
We denote
γk(i)=p(zk=i∣x)=p(x)p(zk=i,x)(4.43)
-
We denote
ξk(i,j)=p(zk+1=j,zk=i∣x)=p(x)p(zk+1=j,zk=i,x)(4.44)
Three fundamental problems of HMM
Problem 1 (Likelihood): Given an observation sequence x and parameters θ=(A,B,π), determine the likelihood p(x∣θ).
Problem 2 (Learning): Given an observation sequence x, learn the parameters θ=(A,B,π).
Problem 3 (Inference): Given an observation sequence x and parameters θ=(A,B,π), discover the best hidden state sequence z.
Problem 1 (Likelihood)
Goal: Given an observation sequence x and parameters θ=(A,B,π), determine the likelihood p(x∣θ).
Naive Way:
From (0), we have already know how to compute P(x,z∣θ), so we can compute p(x∣θ) by summing all possible sequence z:
p(x∣θ)=z∑P(x,z∣θ)⋅p(z∣θ)
This method is not applicable since there are mn ways of combinations of sequence z. So we introduce the following two algorithm: Forward Algorithm and Backward Algorithm.
Forward Algorithm
- Goal: Compute p(zk,x1:k), given θ=(A,B,π).
From the picture above, it’s natural to compute p(zk,x1:k) by dynamic programming (DP). That is, to calculate it in terms of p(zk−1,x1:k−1):
p(zk,x1:k)=zk−1∑p(zk,zk−1,x1:k−1,xk)=zk−1∑p(zk−1,x1:k−1)⋅p(zk,xk∣zk−1,x1:k−1)=zk−1∑p(zk−1,x1:k−1)⋅p(zk∣zk−1,x1:k−1)⋅p(xk∣zk,zk−1,x1:k−1)=zk−1∑p(zk−1,x1:k−1)⋅p(zk∣zk−1)⋅p(xk∣zk)(5)(6)(7)(8)
Ramark:
-
From (6) to (7), we use the fact that p(b,c∣a)=p(b∣a)⋅p(c∣a,b).
-
From (7) to (8), we use the conditional independence, which is visualized in the picture above.
-
We denote p(zk,x1:k) by αk(zk), so
αk(zk)=p(zk,x1:k)=zk−1∑αk−1(zk−1)⋅p(zk∣zk−1)⋅p(xk∣zk)(9)
-
In equation (9), the term p(zk∣zk−1) is the transition probability from state zk−1 to state zk; the term p(xk∣zk) is the emission probability of observing xk given state zk.
-
α1(z1=q)=p(z1=q,x1)=πq⋅p(x1∣z1=q), where p(x1∣z1=q) is an emmission probability.
Knowing how to compute p(zk,x1:k) recurssively, we have
p(x∣θ)=p(x1:n∣θ)=zn∑p(zn,x1:n)=zn∑αn(zn)=q=1∑mαn(zn=q)
Backward Algorithm
- Goal: Compute p(xk+1:n∣zk), given θ=(A,B,π).
Again, we are going to use DP to compute p(xk+1:n∣zk) in terms of p(xk+2:n∣zk+1):
p(xk+1:n∣zk)=zk+1∑p(xk+1,xk+2:n,zk+1∣zk)=zk+1∑p(xk+2:n,zk+1∣zk)⋅p(xk+1∣zk,xk+2:n,zk+1)=zk+1∑p(zk+1∣zk)⋅p(xk+2:n∣zk+1,zk)⋅p(xk+1∣zk,xk+2:n,zk+1)=zk+1∑p(xk+2:n∣zk+1)⋅p(zk+1∣zk)⋅p(xk+1∣zk+1)(10)(11)
Ramark:
-
From (10) to (11), we use the conditional independece similar to the one in forward algorithm.
-
We denote p(xk+1:n∣zk) by βk(zk), so
βk(zk)=p(xk+1:n∣zk)=zk+1∑p(xk+2:n∣zk+1)⋅p(zk+1∣zk)⋅p(xk+1∣zk+1)(12)
-
In equation (12), the term p(zk+1∣zk) is the transition probability from state zk to state zk+1; the term p(xk+1∣zk+1) is the emission probability of observing xk+1 given state zk+1.
-
βn(zn)=1.
Knowing how to compute p(xk+1:n∣zk) recursively, we have
p(x∣θ)=z1∑p(x,z1)=z1∑p(x∣z1)⋅p(z1)=z1∑p(x1,x1+1:n∣z1)⋅p(z1)=z1∑p(x1∣z1)⋅p(x1+1:n∣z1,x1)⋅p(z1)=z1∑p(x1∣z1)⋅p(x1+1:n∣z1)⋅p(z1)=z1∑p(x1∣z1)⋅β1(z1)⋅p(z1)=q=1∑mβ1(z1=q)⋅p(x1∣z1=q)⋅πq(13)(14)
From (13) to (14), we use the conditional independence. To make it clean, I didn’t include θ in the above derivation, but keep in mind x is conditioned on θ.
Problem 2 (Learning)
Goal: Given an observation sequence x, learn the parameters θ=(A,B,π).
Given that the hidden states are unknown, it’s natural to use the EM Algorithm to solve parameters. Remind that the EM Algorithm consists of two steps:
- An expectation (E) step, which creates a function Q(θ,θi) for the expectation of the log-likelihood logp(x,z∣θ) evaluated using the current conditional distribution of z given x and the current estimate of the parameters θi, where
Q(θ,θi)=Ez∼P(z∣x,θi)[logp(x,z∣θ)]=z∑P(z∣x,θi)⋅logp(x,z∣θ)
- A maximization (M) step, which computes parameters maximizing the expected log-likelihood Q(θ,θi) found on the E step and then update parameters to θi+1.
We fist initialize parameters θ0=(A0,B0,π0)
E Step:
We are going to construct Q(θ,θi).
Q(θ,θi)=Ez∼P(z∣x,θi)[logp(x,z∣θ)]=z∑P(z∣x,θi)⋅logp(x,z∣θ)=logp(x,z∣θ)⋅p(x∣θi)p(x,z∣θi)∝logp(x,z∣θ)⋅p(x,z∣θi)(15)(16)
Since we know x and θi, p(x∣θi) is a constant and therefore we can write from (15) to (16). In the earlier section “Settings of the Hidden Markov Model” of the post, we deduce that
P(x,z∣θ)=p(z1)⋅[p(z2∣z1)⋅p(z3∣z2)⋅...⋅(zn∣zn−1)]⋅[p(x1∣z1)⋅p(x2∣z2)⋅...⋅p(xn∣zn)]
So we can formulate Q(θ,θi) as
Q(θ,θi)=z∑(logπzi+t=1∑n−1logp(zt+1∣zt)+t=1∑nlogp(xn∣zn))⋅p(x,z∣θi)=z∑logπzi⋅p(x,z∣θi)+z∑t=1∑n−1logp(zt+1∣zt)⋅p(x,z∣θi)+z∑t=1∑nlogp(xt∣zt)⋅p(x,z∣θi)
M Step:
We are going to maximize Q(θ,θi) and update θi+1.
Note that we write Q(θ,θi) as the sum of three terms. The first therm is related to π, the second term is related to A, and the third term is related to B. Therefore we can maximize each term separately.
We can write the first term as
z∑logπzi⋅p(x,z∣θi)=j=1∑mlogπj⋅p(x,z1=j∣θi)
under the constraint ∑j=1mπj=1. Clearly, this is a convex optimization problem:
The Lagrangian L associated with the problem is
L(π,v)=j=1∑mlogπj⋅p(x,z1=j∣θi)+v⋅(j=1∑mπj−1)
Note that any pair of primal and dual optimal points must satisfy the KKT conditions. So we use one KKT property that the gradient must vanish at the optimal point to find π. This might not be the optimal π since “any pair of primal and dual optimal points must satisfy the KKT conditions” doesn’t imply that a point satisfying the KKT conditions is the optimal.
∂πj∂L=p(x,z1=j∣θi)⋅πj1+vp(x,z1=j∣θi)+v⋅πjπj=0=0=v−p(x,z1=j∣θi)(17)
By setting ∂πj∂L=0 for all j, we have
j=1∑mp(x,z1=j∣θi)+v⋅j=1∑mπjp(x∣θi)+vv=0=0=−p(x∣θi)(18)
By plugging (18) into (17), we have
πj=v−p(x,z1=j∣θi)=p(x∣θi)p(x,z1=j∣θi)=γ1(j).
In the similar way, we can write the second term as
z∑t=1∑n−1logp(zt+1∣zt)⋅p(x,z∣θi)=j=1∑mk=1∑mt=1∑n−1logp(zt+1=k∣zt=j)⋅p(x,zt=j,zt+1=k∣θi)z∑t=1∑nlogp(xt∣zt)⋅p(x,z∣θi)=j=1∑mt=1∑nlogp(xt∣zt=j)⋅p(x,zt=j∣θi)
with seperate constraints
k=1∑mp(zt+1=k∣zt=j)=1,j=1∑mp(xt∣zt=j)=1
We can solve for optimal parameters similar to solving for π. After we set the gradient of corresponding Lagrangian to 0, we have
AjkBjk=p(zt+1=k∣zt=j)=∑t=1n−1p(x,zt=j∣θi)∑t=1n−1p(x,zt=j,zt+1=k∣θi)=∑t=1n−1p(x,zt=j∣θi)/p(x∣θi)∑t=1n−1p(x,zt=j,zt+1=k∣θi)/p(x∣θi)=∑t=1n−1γt(j)∑t=1n−1ξt(jk)=p(xt=Vk∣zt=j)=∑t=1n−1p(x,zt=j∣θi)∑t=1n−1p(x,zt=j∣θi)⋅I(xt=Vk)=∑t=1n−1p(x,zt=j∣θi)/p(x∣θi)∑t=1n−1p(x,zt=j∣θi)/p(x∣θi)⋅I(xt=Vk)=∑t=1n−1γt(j)∑t=1n−1γt(j)⋅I(xt=Vk)(19)(20)
Remark: I(xt=Vk) is an indicator function. If xt=Vk, then I(xt=Vk)=1, and 0 otherwise. We get the results (19) and (20) from (4.43) and (4.44).
We also call this algorithm Baum-Welch algorithm.
Problem 3 (Inference)
Goal: Given an observation sequence x and parameters θ=(A,B,π), discover the best hidden state sequence z.
Method 1:Brute force.
This is not applicable. Every hidden state has m choices and the sequence has length n. So there are mn possible combinations.
Method 2: Use Forward/ Backward Algorithm.
Given a sequence x, we know how to compute p(zk∣x) from the top of the post. Therefore, at each time k, we can compute p(zk=i∣x) for all i∈{1,2,...,m} and choose the one with the highest probability. In the way, at every time, we chose the most possible hidden state. However, there is still a problem. It only finds the most possible hidden state locally and doesn’t take the whole sequence into account. Even if we chose the most possible hidden state at each time k. The combination of them might not be the best one and even doesn’t make sense. Foe example, if Aij=0, then if zk=i, zk+1 cannot take state j. But the Forward/ Backward Algorithm doesn’t take it into account. We can think of it as a greedy approach to approximate the best result.
Method 3: Viterbi algorithm:
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states that results in a sequence of observed events in HMM.
We define δk(i) to be the maximum probability among all paths which are at state i at time k. That is,
δk(i)=i1:ik−1maxp(ik=i,i1:ik−1,x1:k∣θ),i=1,2,...,m
We can construct a recurssion formula δk(i). That is,
δk+1(i)=i1:ikmaxp(ik+1=i,i1:ik,x1:k+1∣θ)=1≤j≤mmaxδk(j)⋅Aji⋅p(xk+1∣zk+1=i)i=1,2,...,m;k=1,2,...,n−1
And the base case is δ1(i)=πi⋅p(x1∣z1=i).
Therefore, we compute the optimal probability P∗
P∗=1≤i≤mmaxδn(i)
by recursion. During the process of finding the highest probability of a path, we keep recording the hidden states associate with the path. So after we find the the highest probability, we also record the path associated with it and therefore the best sequence z.
Reference:
- https://towardsdatascience.com/conditional-independence-the-backbone-of-bayesian-networks-85710f1b35b
- https://courses.cs.washington.edu/courses/cse473/16au/slides-16au/25-bn.pdf
- https://en.wikipedia.org/wiki/Conditional_independence
- https://web.stanford.edu/~jurafsky/slp3/A.pdf
- https://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/em-hmm.pdf
- https://people.eecs.berkeley.edu/~stephentu/writeups/hmm-baum-welch-derivation.pdf
- 统计学习方法,李航著,清华大学出版社