Hidden Markov Model (HMM) 详细推导及思路分析

往期文章链接目录

Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post

Let’s get started!

Conditional independence

AAA and BBB are conditionally independent given CCC if and only if, given knowledge that CCC occurs, knowledge of whether AAA occurs provides no information on the likelihood of BBB occurring, and knowledge of whether BBB occurs provides no information on the likelihood of AAA occurring.

Formally, if we denote conditional independence of AAA and BBB given CCC by (A ⁣ ⁣ ⁣B)C(A\perp \!\!\!\perp B)\mid C(A⊥⊥B)∣C, then by definition, we have

(A ⁣ ⁣ ⁣B)C    P(A,BC)=P(AC)P(BC)(A\perp \!\!\!\perp B)\mid C\quad \iff \quad P(A, B\mid C)= P(A\mid C) \cdot P(B\mid C)(A⊥⊥B)∣C⟺P(A,B∣C)=P(A∣C)⋅P(B∣C)

Given the knowledge that CCC occurs, to show the knowledge of whether BBB occurs provides no information on the likelihood of AAA occurring, we have

P(AB,C)=P(A,B,C)P(B,C)=P(A,BC)P(C)P(B,C)=P(AC)P(BC)P(C)P(BC)P(C)=P(AC) \begin{aligned} P(A | B ,C) &=\frac{P(A , B , C)}{P(B , C)} \\ &=\frac{P(A , B | C) \cdot P(C)}{P(B , C)} \\ &=\frac{P(A | C) \cdot P(B | C) \cdot P(C)}{P(B | C) \cdot P(C)} \\ &=P(A | C) \end{aligned} P(A∣B,C)​=P(B,C)P(A,B,C)​=P(B,C)P(A,B∣C)⋅P(C)​=P(B∣C)⋅P(C)P(A∣C)⋅P(B∣C)⋅P(C)​=P(A∣C)​

Two classical cases where XXX and ZZZ are conditionally independent

Case 1 :

Hidden Markov Model (HMM) 详细推导及思路分析

From the above directed graph, we have P(X,Y,Z)=P(X)P(YX)P(ZY)P(X,Y,Z) = P(X)\cdot P(Y|X)\cdot P(Z|Y)P(X,Y,Z)=P(X)⋅P(Y∣X)⋅P(Z∣Y). Hence we have

P(ZX,Y)=P(X,Y,Z)P(X,Y)=P(X)P(YX)P(ZY)P(X)P(YX)=P(ZY) \begin{aligned} P(Z|X,Y) &= \frac{P(X,Y,Z)}{P(X,Y)}\\ &= \frac{P(X)\cdot P(Y|X)\cdot P(Z|Y)}{P(X)\cdot P(Y|X)}\\ &= P(Z|Y) \end{aligned} P(Z∣X,Y)​=P(X,Y)P(X,Y,Z)​=P(X)⋅P(Y∣X)P(X)⋅P(Y∣X)⋅P(Z∣Y)​=P(Z∣Y)​

Therefore, XXX and ZZZ are conditionally independent.

Case 2 :

Hidden Markov Model (HMM) 详细推导及思路分析

From the above directed graph, we have P(X,Y,Z)=P(Y)P(XY)P(ZY)P(X,Y,Z) = P(Y)\cdot P(X|Y) \cdot P(Z|Y)P(X,Y,Z)=P(Y)⋅P(X∣Y)⋅P(Z∣Y). Hence we have

P(ZX,Y)=P(X,Y,Z)P(X,Y)=P(Y)P(XY)P(ZY)P(Y)P(XY)=P(ZY) \begin{aligned} P(Z|X,Y) &= \frac{P(X,Y,Z)}{P(X,Y)}\\ &= \frac{P(Y)\cdot P(X|Y) \cdot P(Z|Y)}{P(Y)\cdot P(X|Y)}\\ &= P(Z|Y) \end{aligned} P(Z∣X,Y)​=P(X,Y)P(X,Y,Z)​=P(Y)⋅P(X∣Y)P(Y)⋅P(X∣Y)⋅P(Z∣Y)​=P(Z∣Y)​

Therefore, XXX and ZZZ are conditionally independent.

Settings of the Hidden Markov Model (HMM)

The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us something about the probabilities of sequences of random variables, states, each of which can take on values from some set. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state.

To put it formally, suppose we have a sequence of state variables z1,z2,...,znz_1, z_2, ..., z_nz1​,z2​,...,zn​. Then the Markov assumption is

p(znz1z2...zn1)=p(znzn1) p(z_n | z_1z_2...z_{n-1}) = p(z_n | z_{n-1}) p(zn​∣z1​z2​...zn−1​)=p(zn​∣zn−1​)

A Markov chain is useful when we need to compute a probability for a sequence of observable events. However, in many cases the events we are interested in are hidden. For example we don’t normally observe part-of-speech (POS) tags in a text. Rather, we see words, and must infer the tags from the word sequence. We call the tags hidden because they are not observed.

Hidden Markov Model (HMM) 详细推导及思路分析

A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model. An HMM is specified by the following components:

  • A sequence of hidden states zzz, where zkz_kzk​ takes values from all possible hidden states Z={1,2,..,m}Z = \{1,2,..,m\}Z={1,2,..,m}.

  • A sequence of observations xxx, where x=(x1,x2,...,xn)x = (x_1, x_2, ..., x_n)x=(x1​,x2​,...,xn​). Each one is drawn from a vocabulary VVV.

  • A transition probability matrix AAA, where AAA is an m×mm \times mm×m matrix. AijA_{ij}Aij​ represents the probability of moving from state iii to state jjj: Aij=p(zt+1=jzt=i)A_{ij} = p(z_{t+1}=j| z_t=i)Aij​=p(zt+1​=j∣zt​=i), and j=1mAij=1\sum_{j=1}^{m} A_{ij} = 1∑j=1m​Aij​=1 for all iii.

  • An emission probability matrix BBB, where BBB is an m×Vm \times |V|m×∣V∣ matrix. BijB_{ij}Bij​ represents the probability of an observation xjx_jxj​ being generated from a state iii: Bij=P(xt=Vjzt=i)B_{ij} = P(x_t = V_j|z_t = i)Bij​=P(xt​=Vj​∣zt​=i)

  • An initial probability distribution π\piπ over states, where π=(π1,π2,...,πm)\pi = (\pi_1, \pi_2, ..., \pi_m)π=(π1​,π2​,...,πm​). πi\pi_iπi​ is the probability that the Markov chain will start in state iii. i=1mπi=1\sum_{i=1}^{m} \pi_i = 1∑i=1m​πi​=1.

Given a sequence xxx and the corresponding hidden states zzz (like one in the picture above), we have

P(x,zθ)=p(z1)[p(z2z1)p(z3z2)...(znzn1)][p(x1z1)p(x2z2)...p(xnzn)](0) P(x, z|\theta) = p(z_1) \cdot [p(z_2|z_1)\cdot p(z_3|z_2)\cdot ... \cdotp(z_n|z_{n-1})] \cdot [p(x_1|z_1)\cdot p(x_2|z_2)\cdot ... \cdot p(x_n|z_n)] \tag 0P(x,z∣θ)=p(z1​)⋅[p(z2​∣z1​)⋅p(z3​∣z2​)⋅...⋅(zn​∣zn−1​)]⋅[p(x1​∣z1​)⋅p(x2​∣z2​)⋅...⋅p(xn​∣zn​)](0)

We get p(z1)p(z_1)p(z1​) from π\piπ, p(zk+1zk)p(z_{k+1}|z_k)p(zk+1​∣zk​) from AAA, and p(xkzk)p(x_k|z_k)p(xk​∣zk​) from BBB.

Useful probabilities p(zkx)p(z_k | x)p(zk​∣x) and p(zk+1,zkx)p(z_{k+1}, z_k | x)p(zk+1​,zk​∣x)

p(zkx)p(z_k | x)p(zk​∣x) and p(zk+1,zkx)p(z_{k+1}, z_k | x)p(zk+1​,zk​∣x) are useful probabilities and we are going to use them later.

Intuition: Once we have a sequence xxx, we might be interested in find the probability of any hidden state zkz_kzk​, i.e., find probabilities p(zk=1x),p(zk=2x),...,p(zk=mx)p(z_k =1| x), p(z_k =2| x), ..., p(z_k =m| x)p(zk​=1∣x),p(zk​=2∣x),...,p(zk​=m∣x). we have the following

p(zkx)=p(zk,x)p(x)(1)p(zk,x)(2) \begin{aligned} p(z_k | x) &= \frac{p(z_k, x)}{p(x)} & & (1)\\ &\propto p(z_k, x) & & (2)\\ \end{aligned} p(zk​∣x)​=p(x)p(zk​,x)​∝p(zk​,x)​​(1)(2)​

Note that from (1)(1)(1) to (2), since p(x)p(x)p(x) doesn’t change for all values of zkz_kzk​, p(zkx)p(z_k | x)p(zk​∣x) is proportional to p(zk,x)p(z_k, x)p(zk​,x).

p(zk=i,x)=p(zk=i,x1:k,xk+1:n)=p(zk=i,x1:k)p(xk+1:nzk=i,x1:k)(3)=p(zk=i,x1:k)p(xk+1:nzk=i)(4.1)=αk(zk=i)βk(zk=i)(4.11) \begin{aligned} p(z_k=i, x) &= p(z_k=i, x_{1:k}, x_{k+1:n}) \\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+1:n}|z_k=i, x_{1:k}) & & (3)\\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+1:n}|z_k=i) & & (4.1) \\ &= \alpha_k(z_k=i) \cdot \beta_k(z_k=i) &&(4.11)\\ \end{aligned} p(zk​=i,x)​=p(zk​=i,x1:k​,xk+1:n​)=p(zk​=i,x1:k​)⋅p(xk+1:n​∣zk​=i,x1:k​)=p(zk​=i,x1:k​)⋅p(xk+1:n​∣zk​=i)=αk​(zk​=i)⋅βk​(zk​=i)​​(3)(4.1)(4.11)​

Hidden Markov Model (HMM) 详细推导及思路分析

From the above graph, we see that the second term (3)(3)(3) is the 2nd classical cases. So xk+1:nx_{k+1:n}xk+1:n​ and x1:kx_{1:k}x1:k​ are conditionally independent. This is why we can go from (3)(3)(3) to (4.1)(4.1)(4.1). We are going to use the Forward Algorithm to compute p(zk,x1:k)p(z_k, x_{1:k})p(zk​,x1:k​), and Backward Algorithm to compute p(xk+1:nzk)p(x_{k+1:n}|z_k)p(xk+1:n​∣zk​) later.

We denote p(zk,x1:k)p(z_k, x_{1:k})p(zk​,x1:k​) by αk(zk)\alpha_k(z_k)αk​(zk​) and p(xk+1:nzk)p(x_{k+1:n}|z_k)p(xk+1:n​∣zk​) by βk(zk)\beta_k(z_k)βk​(zk​).

After we know how to calculate these two terms separately, we can calculate p(zkx)p(z_k | x)p(zk​∣x) easily by introducing a normalization term. That is,

p(zk=ix)=p(zk=i,x)j=1mp(zk=j,x)=αk(zk=i)βk(zk=i)j=1mαk(zk=j)βk(zk=j)(4.2) \begin{aligned} p(z_k = i | x) &= \frac{p(z_k = i, x)}{\sum_{j=1}^m p(z_k = j, x)} \\ &= \frac{\alpha_k(z_k=i)\beta_k(z_k=i)}{\sum_{j=1}^{m} \alpha_k(z_k=j)\beta_k(z_k=j)} && (4.2) \end{aligned} p(zk​=i∣x)​=∑j=1m​p(zk​=j,x)p(zk​=i,x)​=∑j=1m​αk​(zk​=j)βk​(zk​=j)αk​(zk​=i)βk​(zk​=i)​​​(4.2)​

where jmp(zk=j,x)\sum_j^m p(z_k = j, x)∑jm​p(zk​=j,x) is the normalization term which makes p(zk=1,x)p(z_k = 1, x)p(zk​=1,x) take values between 000 and 111 for all zkz_kzk​.

Similarly, we are also interested in finding p(zk+1,zkx)p(z_{k+1}, z_k | x)p(zk+1​,zk​∣x), where

p(zk+1,zkx)p(zk+1,zk,x) p(z_{k+1}, z_k | x) \propto p(z_{k+1}, z_k, x)p(zk+1​,zk​∣x)∝p(zk+1​,zk​,x)

By using the property of conditional independence, we have

p(zk+1=j,zk=i,x)=p(zk=i,zk+1=j,x1:k,xk+1,xk+2:n)=p(zk=i,x1:k)p(xk+2:n)zk+1=j)p(zk+1=jzk=i)p(xk+1zk+1=j)(4.3)=αk(zk=i)βk+1(zk+1=j)p(zk+1=jzk=i)p(xk+1zk+1=j)(4.4) \begin{aligned} p(z_{k+1}=j, z_k=i, x) &= p(z_k=i, z_{k+1}=j, x_{1:k}, x_{k+1}, x_{k+2:n}) \\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+2:n)|z_{k+1}=j}) \cdot p(z_{k+1}=j|z_{k}=i) \cdot p(x_{k+1}| z_{k+1}=j) && (4.3)\\ &= \alpha_k(z_k=i) \cdot \beta_{k+1}(z_{k+1}=j) \cdot p(z_{k+1}=j|z_{k}=i) \cdot p(x_{k+1}| z_{k+1}=j) && (4.4) \\ \end{aligned} p(zk+1​=j,zk​=i,x)​=p(zk​=i,zk+1​=j,x1:k​,xk+1​,xk+2:n​)=p(zk​=i,x1:k​)⋅p(xk+2:n)∣zk+1​=j​)⋅p(zk+1​=j∣zk​=i)⋅p(xk+1​∣zk+1​=j)=αk​(zk​=i)⋅βk+1​(zk+1​=j)⋅p(zk+1​=j∣zk​=i)⋅p(xk+1​∣zk+1​=j)​​(4.3)(4.4)​

Note that we can find the third and the forth term from the transition probability matrix and the emission probability matrix. Again, we can calculate p(zk+1,zkx)p(z_{k+1}, z_k | x)p(zk+1​,zk​∣x) simply by introducing a normalization term. That is,

p(zk+1=s,zk=rx)=p(zk+1=s,zk=r,x)i=1mj=1mp(zk+1=j,zk=i,x)=αk(zk=r)βk+1(zk+1=s)p(zk+1=szk=r)p(xk+1zk+1=s)i=1mαk(zk=i)βk+1(zk+1=j)p(zk+1=jzk=i)p(xk+1zk+1=j)(4.42) \begin{aligned} p(z_{k+1}=s, z_k=r | x) &= \frac{p(z_{k+1}=s, z_k=r, x)}{\sum_{i=1}^{m} \sum_{j=1}^{m} p(z_{k+1}=j, z_k=i, x)} \\ &= \frac{\alpha_k(z_k=r) \cdot \beta_{k+1}(z_{k+1}=s) \cdot p(z_{k+1}=s|z_{k}=r) \cdot p(x_{k+1}| z_{k+1}=s)}{\sum_{i=1}^{m} \alpha_k(z_k=i) \cdot \beta_{k+1}(z_{k+1}=j) \cdot p(z_{k+1}=j|z_{k}=i) \cdot p(x_{k+1}| z_{k+1}=j)} && (4.42) \end{aligned} p(zk+1​=s,zk​=r∣x)​=∑i=1m​∑j=1m​p(zk+1​=j,zk​=i,x)p(zk+1​=s,zk​=r,x)​=∑i=1m​αk​(zk​=i)⋅βk+1​(zk+1​=j)⋅p(zk+1​=j∣zk​=i)⋅p(xk+1​∣zk+1​=j)αk​(zk​=r)⋅βk+1​(zk+1​=s)⋅p(zk+1​=s∣zk​=r)⋅p(xk+1​∣zk+1​=s)​​​(4.42)​

Remark

  • We denote

    γk(i)=p(zk=ix)=p(zk=i,x)p(x)(4.43) \gamma_k(i) = p(z_k = i | x) = \frac{p(z_k=i, x)}{p(x)} \tag{4.43}γk​(i)=p(zk​=i∣x)=p(x)p(zk​=i,x)​(4.43)

  • We denote

    ξk(i,j)=p(zk+1=j,zk=ix)=p(zk+1=j,zk=i,x)p(x)(4.44) \xi_k(i,j) = p(z_{k+1}=j, z_k=i | x) = \frac{p(z_{k+1}=j, z_k=i, x)}{p(x)} \tag{4.44}ξk​(i,j)=p(zk+1​=j,zk​=i∣x)=p(x)p(zk+1​=j,zk​=i,x)​(4.44)

Three fundamental problems of HMM

Problem 1 (Likelihood): Given an observation sequence xxx and parameters θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π), determine the likelihood p(xθ)p(x|\theta)p(x∣θ).

Problem 2 (Learning): Given an observation sequence xxx, learn the parameters θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π).

Problem 3 (Inference): Given an observation sequence xxx and parameters θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π), discover the best hidden state sequence zzz.

Problem 1 (Likelihood)

Goal: Given an observation sequence xxx and parameters θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π), determine the likelihood p(xθ)p(x|\theta)p(x∣θ).

Naive Way:

From (0)(0)(0), we have already know how to compute P(x,zθ)P(x, z|\theta)P(x,z∣θ), so we can compute p(xθ)p(x|\theta)p(x∣θ) by summing all possible sequence zzz:

p(xθ)=zP(x,zθ)p(zθ) p(x|\theta) = \sum_z P(x, z|\theta) \cdot p(z|\theta)p(x∣θ)=z∑​P(x,z∣θ)⋅p(z∣θ)

This method is not applicable since there are mnm^nmn ways of combinations of sequence zzz. So we introduce the following two algorithm: Forward Algorithm and Backward Algorithm.

Forward Algorithm

Hidden Markov Model (HMM) 详细推导及思路分析
  • Goal: Compute p(zk,x1:k)p(z_k, x_{1:k})p(zk​,x1:k​), given θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π).

From the picture above, it’s natural to compute p(zk,x1:k)p(z_k, x_{1:k})p(zk​,x1:k​) by dynamic programming (DP). That is, to calculate it in terms of p(zk1,x1:k1)p(z_{k-1}, x_{1:k-1})p(zk−1​,x1:k−1​):

p(zk,x1:k)=zk1p(zk,zk1,x1:k1,xk)(5)=zk1p(zk1,x1:k1)p(zk,xkzk1,x1:k1)(6)=zk1p(zk1,x1:k1)p(zkzk1,x1:k1)p(xkzk,zk1,x1:k1)(7)=zk1p(zk1,x1:k1)p(zkzk1)p(xkzk)(8) \begin{aligned} p(z_k , x_{1:k}) &= \sum_{z_{k-1}} p(z_k, z_{k-1}, x_{1:k-1}, x_k) & & (5)\\ &= \sum_{z_{k-1}} p(z_{k-1}, x_{1:k-1}) \cdot p(z_k, x_k|z_{k-1}, x_{1:k-1}) & & (6)\\ &= \sum_{z_{k-1}} p(z_{k-1}, x_{1:k-1}) \cdot p(z_k|z_{k-1}, x_{1:k-1}) \cdot p(x_k|z_k, z_{k-1}, x_{1:k-1}) & & (7)\\ &= \sum_{z_{k-1}} p(z_{k-1}, x_{1:k-1}) \cdot p(z_k|z_{k-1}) \cdot p(x_k|z_k) & & (8)\\ \end{aligned} p(zk​,x1:k​)​=zk−1​∑​p(zk​,zk−1​,x1:k−1​,xk​)=zk−1​∑​p(zk−1​,x1:k−1​)⋅p(zk​,xk​∣zk−1​,x1:k−1​)=zk−1​∑​p(zk−1​,x1:k−1​)⋅p(zk​∣zk−1​,x1:k−1​)⋅p(xk​∣zk​,zk−1​,x1:k−1​)=zk−1​∑​p(zk−1​,x1:k−1​)⋅p(zk​∣zk−1​)⋅p(xk​∣zk​)​​(5)(6)(7)(8)​

Ramark:

  • From (6)(6)(6) to (7)(7)(7), we use the fact that p(b,ca)=p(ba)p(ca,b)p(b,c|a) = p(b|a) \cdot p(c|a,b)p(b,c∣a)=p(b∣a)⋅p(c∣a,b).

  • From (7)(7)(7) to (8)(8)(8), we use the conditional independence, which is visualized in the picture above.

  • We denote p(zk,x1:k)p(z_k , x_{1:k})p(zk​,x1:k​) by αk(zk)\alpha_k(z_k)αk​(zk​), so

    αk(zk)=p(zk,x1:k)=zk1αk1(zk1)p(zkzk1)p(xkzk)(9) \alpha_k(z_k) = p(z_k , x_{1:k}) = \sum_{z_{k-1}} \alpha_{k-1}(z_{k-1}) \cdot p(z_k|z_{k-1}) \cdot p(x_k|z_k) \tag 9αk​(zk​)=p(zk​,x1:k​)=zk−1​∑​αk−1​(zk−1​)⋅p(zk​∣zk−1​)⋅p(xk​∣zk​)(9)

  • In equation (9)(9)(9), the term p(zkzk1)p(z_k|z_{k-1})p(zk​∣zk−1​) is the transition probability from state zk1z_{k-1}zk−1​ to state zkz_{k}zk​; the term p(xkzk)p(x_k|z_k)p(xk​∣zk​) is the emission probability of observing xkx_kxk​ given state zkz_kzk​.

  • α1(z1=q)=p(z1=q,x1)=πqp(x1z1=q)\alpha_1(z_1=q) = p(z_1=q, x_1) = \pi_q \cdot p(x_1 | z_1 = q)α1​(z1​=q)=p(z1​=q,x1​)=πq​⋅p(x1​∣z1​=q), where p(x1z1=q)p(x_1 | z_1 = q)p(x1​∣z1​=q) is an emmission probability.

Knowing how to compute p(zk,x1:k)p(z_k , x_{1:k})p(zk​,x1:k​) recurssively, we have

p(xθ)=p(x1:nθ)=znp(zn,x1:n)=znαn(zn)=q=1mαn(zn=q)p(x|\theta) = p(x_{1:n}|\theta) = \sum_{z_n} p(z_n, x_{1:n}) = \sum_{z_n} \alpha_n(z_n) = \sum_{q=1}^{m} \alpha_n(z_n=q)p(x∣θ)=p(x1:n​∣θ)=zn​∑​p(zn​,x1:n​)=zn​∑​αn​(zn​)=q=1∑m​αn​(zn​=q)

Backward Algorithm

  • Goal: Compute p(xk+1:nzk)p(x_{k+1:n} | z_k)p(xk+1:n​∣zk​), given θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π).

Again, we are going to use DP to compute p(xk+1:nzk)p(x_{k+1:n} | z_k)p(xk+1:n​∣zk​) in terms of p(xk+2:nzk+1)p(x_{k+2:n} | z_{k+1})p(xk+2:n​∣zk+1​):

p(xk+1:nzk)=zk+1p(xk+1,xk+2:n,zk+1zk)=zk+1p(xk+2:n,zk+1zk)p(xk+1zk,xk+2:n,zk+1)=zk+1p(zk+1zk)p(xk+2:nzk+1,zk)p(xk+1zk,xk+2:n,zk+1)(10)=zk+1p(xk+2:nzk+1)p(zk+1zk)p(xk+1zk+1)(11) \begin{aligned} p(x_{k+1:n} | z_k) &= \sum_{z_{k+1}} p(x_{k+1}, x_{k+2:n}, z_{k+1} | z_k) \\ &= \sum_{z_{k+1}} p(x_{k+2:n}, z_{k+1}| z_k) \cdot p(x_{k+1}| z_k, x_{k+2:n}, z_{k+1}) \\ &= \sum_{z_{k+1}} p(z_{k+1}|z_k) \cdot p(x_{k+2:n}|z_{k+1}, z_k) \cdot p(x_{k+1} | z_k, x_{k+2:n}, z_{k+1}) && (10)\\ &= \sum_{z_{k+1}} p(x_{k+2:n}|z_{k+1}) \cdot p(z_{k+1}|z_k) \cdot p(x_{k+1}|z_{k+1}) && (11)\\ \end{aligned} p(xk+1:n​∣zk​)​=zk+1​∑​p(xk+1​,xk+2:n​,zk+1​∣zk​)=zk+1​∑​p(xk+2:n​,zk+1​∣zk​)⋅p(xk+1​∣zk​,xk+2:n​,zk+1​)=zk+1​∑​p(zk+1​∣zk​)⋅p(xk+2:n​∣zk+1​,zk​)⋅p(xk+1​∣zk​,xk+2:n​,zk+1​)=zk+1​∑​p(xk+2:n​∣zk+1​)⋅p(zk+1​∣zk​)⋅p(xk+1​∣zk+1​)​​(10)(11)​

Ramark:

  • From (10)(10)(10) to (11)(11)(11), we use the conditional independece similar to the one in forward algorithm.

  • We denote p(xk+1:nzk)p(x_{k+1:n} | z_k)p(xk+1:n​∣zk​) by βk(zk)\beta_k(z_k)βk​(zk​), so

    βk(zk)=p(xk+1:nzk)=zk+1p(xk+2:nzk+1)p(zk+1zk)p(xk+1zk+1)(12) \beta_k(z_k) = p(x_{k+1:n} | z_k) = \sum_{z_{k+1}} p(x_{k+2:n}|z_{k+1}) \cdot p(z_{k+1}|z_k) \cdot p(x_{k+1}|z_{k+1}) \tag {12}βk​(zk​)=p(xk+1:n​∣zk​)=zk+1​∑​p(xk+2:n​∣zk+1​)⋅p(zk+1​∣zk​)⋅p(xk+1​∣zk+1​)(12)

  • In equation (12)(12)(12), the term p(zk+1zk)p(z_{k+1}|z_k)p(zk+1​∣zk​) is the transition probability from state zkz_{k}zk​ to state zk+1z_{k+1}zk+1​; the term p(xk+1zk+1)p(x_{k+1}|z_{k+1})p(xk+1​∣zk+1​) is the emission probability of observing xk+1x_{k+1}xk+1​ given state zk+1z_{k+1}zk+1​.

  • βn(zn)=1\beta_n(z_n) = 1βn​(zn​)=1.

Knowing how to compute p(xk+1:nzk)p(x_{k+1:n} | z_k)p(xk+1:n​∣zk​) recursively, we have

p(xθ)=z1p(x,z1)=z1p(xz1)p(z1)=z1p(x1,x1+1:nz1)p(z1)=z1p(x1z1)p(x1+1:nz1,x1)p(z1)(13)=z1p(x1z1)p(x1+1:nz1)p(z1)(14)=z1p(x1z1)β1(z1)p(z1)=q=1mβ1(z1=q)p(x1z1=q)πq \begin{aligned} p(x|\theta) &= \sum_{z_1} p(x, z_1) = \sum_{z_1} p(x | z_1) \cdot p(z_1) \\ &= \sum_{z_1} p(x_1, x_{1+1:n} | z_1) \cdot p(z_1) \\ &= \sum_{z_1} p(x_1 | z_1) \cdot p(x_{1+1:n}|z_1, x_1) \cdot p(z_1) &&(13)\\ &= \sum_{z_1} p(x_1 | z_1) \cdot p(x_{1+1:n}|z_1) \cdot p(z_1) &&(14)\\ &= \sum_{z_1} p(x_1 | z_1) \cdot \beta_1(z_1) \cdot p(z_1) \\ &= \sum_{q=1}^m \beta_1(z_1=q) \cdot p(x_1 | z_1=q) \cdot \pi_q \end{aligned} p(x∣θ)​=z1​∑​p(x,z1​)=z1​∑​p(x∣z1​)⋅p(z1​)=z1​∑​p(x1​,x1+1:n​∣z1​)⋅p(z1​)=z1​∑​p(x1​∣z1​)⋅p(x1+1:n​∣z1​,x1​)⋅p(z1​)=z1​∑​p(x1​∣z1​)⋅p(x1+1:n​∣z1​)⋅p(z1​)=z1​∑​p(x1​∣z1​)⋅β1​(z1​)⋅p(z1​)=q=1∑m​β1​(z1​=q)⋅p(x1​∣z1​=q)⋅πq​​​(13)(14)​

From (13)(13)(13) to (14)(14)(14), we use the conditional independence. To make it clean, I didn’t include θ\thetaθ in the above derivation, but keep in mind xxx is conditioned on θ\thetaθ.

Problem 2 (Learning)

Goal: Given an observation sequence xxx, learn the parameters θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π).

Given that the hidden states are unknown, it’s natural to use the EM Algorithm to solve parameters. Remind that the EM Algorithm consists of two steps:

  • An expectation (E) step, which creates a function Q(θ,θi)Q(\theta, \theta_i)Q(θ,θi​) for the expectation of the log-likelihood logp(x,zθ)\log p(x,z|\theta)logp(x,z∣θ) evaluated using the current conditional distribution of zzz given xxx and the current estimate of the parameters θi\theta_iθi​, where

Q(θ,θi)=EzP(zx,θi)[logp(x,zθ)]=zP(zx,θi)logp(x,zθ) \begin{aligned} Q(\theta, \theta_i) &= E_{z \sim P(z|x,\theta_i)}[\log p(x,z|\theta)] \\ &= \sum_z P(z|x,\theta_i) \cdot \log p(x,z|\theta) \\ \end{aligned} Q(θ,θi​)​=Ez∼P(z∣x,θi​)​[logp(x,z∣θ)]=z∑​P(z∣x,θi​)⋅logp(x,z∣θ)​

  • A maximization (M) step, which computes parameters maximizing the expected log-likelihood Q(θ,θi)Q(\theta, \theta_i)Q(θ,θi​) found on the EEE step and then update parameters to θi+1\theta_{i+1}θi+1​.

We fist initialize parameters θ0=(A0,B0,π0)\theta_0 = (A_0, B_0, \pi_0)θ0​=(A0​,B0​,π0​)

E Step:

We are going to construct Q(θ,θi)Q(\theta, \theta_i)Q(θ,θi​).

Q(θ,θi)=EzP(zx,θi)[logp(x,zθ)]=zP(zx,θi)logp(x,zθ)=logp(x,zθ)p(x,zθi)p(xθi)(15)logp(x,zθ)p(x,zθi)(16) \begin{aligned} Q(\theta, \theta_i) &= E_{z \sim P(z|x,\theta_i)}[\log p(x,z|\theta)] \\ &= \sum_z P(z|x,\theta_i) \cdot \log p(x,z|\theta) \\ &= \log p(x,z|\theta) \cdot \frac{p(x,z|\theta_i)}{p(x|\theta_i)} &&(15)\\ &\propto \log p(x,z|\theta) \cdot p(x,z|\theta_i) &&(16)\\ \end{aligned} Q(θ,θi​)​=Ez∼P(z∣x,θi​)​[logp(x,z∣θ)]=z∑​P(z∣x,θi​)⋅logp(x,z∣θ)=logp(x,z∣θ)⋅p(x∣θi​)p(x,z∣θi​)​∝logp(x,z∣θ)⋅p(x,z∣θi​)​​(15)(16)​

Since we know xxx and θi\theta_iθi​, p(xθi)p(x|\theta_i)p(x∣θi​) is a constant and therefore we can write from (15)(15)(15) to (16)(16)(16). In the earlier section “Settings of the Hidden Markov Model” of the post, we deduce that

P(x,zθ)=p(z1)[p(z2z1)p(z3z2)...(znzn1)][p(x1z1)p(x2z2)...p(xnzn)]P(x, z|\theta) = p(z_1) \cdot [p(z_2|z_1)\cdot p(z_3|z_2)\cdot ... \cdotp(z_n|z_{n-1})] \cdot [p(x_1|z_1)\cdot p(x_2|z_2)\cdot ... \cdot p(x_n|z_n)]P(x,z∣θ)=p(z1​)⋅[p(z2​∣z1​)⋅p(z3​∣z2​)⋅...⋅(zn​∣zn−1​)]⋅[p(x1​∣z1​)⋅p(x2​∣z2​)⋅...⋅p(xn​∣zn​)]

So we can formulate Q(θ,θi)Q(\theta, \theta_i)Q(θ,θi​) as

Q(θ,θi)=z(logπzi+t=1n1logp(zt+1zt)+t=1nlogp(xnzn))p(x,zθi)=zlogπzip(x,zθi)+zt=1n1logp(zt+1zt)p(x,zθi)+zt=1nlogp(xtzt)p(x,zθi) \begin{aligned} Q(\theta, \theta_i) &= \sum_z \left( \log \pi_{z_i} + \sum_{t=1}^{n-1} \log p(z_{t+1}|z_t) + \sum_{t=1}^{n} \log p(x_n|z_n)\right) \cdot p(x,z|\theta_i) \\ &= \sum_z \log \pi_{z_i} \cdot p(x,z|\theta_i) + \sum_z \sum_{t=1}^{n-1} \log p(z_{t+1}|z_t) \cdot p(x,z|\theta_i) + \sum_z \sum_{t=1}^{n} \log p(x_t|z_t) \cdot p(x,z|\theta_i) \\ \end{aligned} Q(θ,θi​)​=z∑​(logπzi​​+t=1∑n−1​logp(zt+1​∣zt​)+t=1∑n​logp(xn​∣zn​))⋅p(x,z∣θi​)=z∑​logπzi​​⋅p(x,z∣θi​)+z∑​t=1∑n−1​logp(zt+1​∣zt​)⋅p(x,z∣θi​)+z∑​t=1∑n​logp(xt​∣zt​)⋅p(x,z∣θi​)​

M Step:

We are going to maximize Q(θ,θi)Q(\theta, \theta_i)Q(θ,θi​) and update θi+1\theta_{i+1}θi+1​.

Note that we write Q(θ,θi)Q(\theta, \theta_i)Q(θ,θi​) as the sum of three terms. The first therm is related to π\piπ, the second term is related to AAA, and the third term is related to BBB. Therefore we can maximize each term separately.

We can write the first term as

zlogπzip(x,zθi)=j=1mlogπjp(x,z1=jθi)\sum_z \log \pi_{z_i} \cdot p(x,z|\theta_i) = \sum_{j=1}^m \log \pi_j \cdot p(x, z_1 = j|\theta_i)z∑​logπzi​​⋅p(x,z∣θi​)=j=1∑m​logπj​⋅p(x,z1​=j∣θi​)

under the constraint j=1mπj=1\sum_{j=1}^m \pi_j = 1∑j=1m​πj​=1. Clearly, this is a convex optimization problem:

Hidden Markov Model (HMM) 详细推导及思路分析

The Lagrangian LLL associated with the problem is

L(π,v)=j=1mlogπjp(x,z1=jθi)+v(j=1mπj1) L(\pi, v) = \sum_{j=1}^m \log \pi_j \cdot p(x, z_1 = j|\theta_i) + v \cdot (\sum_{j=1}^m \pi_j - 1)L(π,v)=j=1∑m​logπj​⋅p(x,z1​=j∣θi​)+v⋅(j=1∑m​πj​−1)

Note that any pair of primal and dual optimal points must satisfy the KKT conditions. So we use one KKT property that the gradient must vanish at the optimal point to find π\piπ. This might not be the optimal π\piπ since “any pair of primal and dual optimal points must satisfy the KKT conditions” doesn’t imply that a point satisfying the KKT conditions is the optimal.

Lπj=p(x,z1=jθi)1πj+v=0p(x,z1=jθi)+vπj=0πj=p(x,z1=jθi)v(17) \begin{aligned} \frac{\partial L}{\partial \pi_j} = p(x, z_1=j|\theta_i) \cdot \frac{1}{\pi_j} + v & = 0 \\ p(x, z_1=j|\theta_i) + v \cdot \pi_j & = 0 \\ \pi_j & = \frac{-p(x, z_1=j|\theta_i)}{v} & & (17)\\ \end{aligned} ∂πj​∂L​=p(x,z1​=j∣θi​)⋅πj​1​+vp(x,z1​=j∣θi​)+v⋅πj​πj​​=0=0=v−p(x,z1​=j∣θi​)​​​(17)​

By setting Lπj=0\frac{\partial L}{\partial \pi_j} = 0∂πj​∂L​=0 for all jjj, we have

j=1mp(x,z1=jθi)+vj=1mπj=0p(xθi)+v=0v=p(xθi)(18) \begin{aligned} \sum_{j=1}^m p(x, z_1=j|\theta_i) + v \cdot \sum_{j=1}^m \pi_j &= 0 \\ p(x|\theta_i) + v &= 0\\ v &= - p(x|\theta_i) && (18)\\ \end{aligned} j=1∑m​p(x,z1​=j∣θi​)+v⋅j=1∑m​πj​p(x∣θi​)+vv​=0=0=−p(x∣θi​)​​(18)​

By plugging (18)(18)(18) into (17)(17)(17), we have

πj=p(x,z1=jθi)v=p(x,z1=jθi)p(xθi)=γ1(j) \pi_j = \frac{-p(x, z_1=j|\theta_i)}{v} = \frac{p(x, z_1=j|\theta_i)}{p(x|\theta_i)} = \gamma_1(j)πj​=v−p(x,z1​=j∣θi​)​=p(x∣θi​)p(x,z1​=j∣θi​)​=γ1​(j).

In the similar way, we can write the second term as

zt=1n1logp(zt+1zt)p(x,zθi)=j=1mk=1mt=1n1logp(zt+1=kzt=j)p(x,zt=j,zt+1=kθi)zt=1nlogp(xtzt)p(x,zθi)=j=1mt=1nlogp(xtzt=j)p(x,zt=jθi) \sum_z \sum_{t=1}^{n-1} \log p(z_{t+1}|z_t) \cdot p(x,z|\theta_i) = \sum_{j=1}^m \sum_{k=1}^m \sum_{t=1}^{n-1} \log p(z_{t+1}=k|z_t=j) \cdot p(x,z_t=j, z_{t+1}=k|\theta_i) \\ \sum_z \sum_{t=1}^{n} \log p(x_t|z_t) \cdot p(x,z|\theta_i) = \sum_{j=1}^m \sum_{t=1}^n \log p(x_t|z_t=j) \cdot p(x,z_t=j|\theta_i) z∑​t=1∑n−1​logp(zt+1​∣zt​)⋅p(x,z∣θi​)=j=1∑m​k=1∑m​t=1∑n−1​logp(zt+1​=k∣zt​=j)⋅p(x,zt​=j,zt+1​=k∣θi​)z∑​t=1∑n​logp(xt​∣zt​)⋅p(x,z∣θi​)=j=1∑m​t=1∑n​logp(xt​∣zt​=j)⋅p(x,zt​=j∣θi​)

with seperate constraints

k=1mp(zt+1=kzt=j)=1,j=1mp(xtzt=j)=1\sum_{k=1}^m p(z_{t+1}=k|z_t=j) = 1, \\ \sum_{j=1}^m p(x_t|z_t=j) = 1 k=1∑m​p(zt+1​=k∣zt​=j)=1,j=1∑m​p(xt​∣zt​=j)=1

We can solve for optimal parameters similar to solving for π\piπ. After we set the gradient of corresponding Lagrangian to 000, we have

Ajk=p(zt+1=kzt=j)=t=1n1p(x,zt=j,zt+1=kθi)t=1n1p(x,zt=jθi)=t=1n1p(x,zt=j,zt+1=kθi)/p(xθi)t=1n1p(x,zt=jθi)/p(xθi)=t=1n1ξt(jk)t=1n1γt(j)(19)Bjk=p(xt=Vkzt=j)=t=1n1p(x,zt=jθi)I(xt=Vk)t=1n1p(x,zt=jθi)=t=1n1p(x,zt=jθi)/p(xθi)I(xt=Vk)t=1n1p(x,zt=jθi)/p(xθi)=t=1n1γt(j)I(xt=Vk)t=1n1γt(j)(20) \begin{aligned} A_{jk} &= p(z_{t+1}=k|z_t=j) \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j, z_{t+1}=k|\theta_i)}{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j, z_{t+1}=k|\theta_i)/ p(x|\theta_i)}{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)/p(x|\theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} \xi_t(jk)}{\sum_{t=1}^{n-1} \gamma_t(j)} &&(19)\\\\ B_{jk} &= p(x_t= V_k|z_t=j) \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)\cdot I(x_t= V_k)}{\sum_{t=1}^{n-1} p(x, z_t=j | \theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} p(x, z_t=j|\theta_i)/ p(x|\theta_i) \cdot I(x_t= V_k)}{\sum_{t=1}^{n-1} p(x, z_t=j | \theta_i)/ p(x|\theta_i)} \\ &= \frac{\sum_{t=1}^{n-1} \gamma_t(j) \cdot I(x_t=V_k)}{\sum_{t=1}^{n-1} \gamma_t(j)} &&(20)\\ \end{aligned} Ajk​Bjk​​=p(zt+1​=k∣zt​=j)=∑t=1n−1​p(x,zt​=j∣θi​)∑t=1n−1​p(x,zt​=j,zt+1​=k∣θi​)​=∑t=1n−1​p(x,zt​=j∣θi​)/p(x∣θi​)∑t=1n−1​p(x,zt​=j,zt+1​=k∣θi​)/p(x∣θi​)​=∑t=1n−1​γt​(j)∑t=1n−1​ξt​(jk)​=p(xt​=Vk​∣zt​=j)=∑t=1n−1​p(x,zt​=j∣θi​)∑t=1n−1​p(x,zt​=j∣θi​)⋅I(xt​=Vk​)​=∑t=1n−1​p(x,zt​=j∣θi​)/p(x∣θi​)∑t=1n−1​p(x,zt​=j∣θi​)/p(x∣θi​)⋅I(xt​=Vk​)​=∑t=1n−1​γt​(j)∑t=1n−1​γt​(j)⋅I(xt​=Vk​)​​​(19)(20)​

Remark: I(xt=Vk)I(x_t= V_k)I(xt​=Vk​) is an indicator function. If xt=Vkx_t= V_kxt​=Vk​, then I(xt=Vk)=1I(x_t= V_k) = 1I(xt​=Vk​)=1, and 000 otherwise. We get the results (19)(19)(19) and (20)(20)(20) from (4.43)(4.43)(4.43) and (4.44)(4.44)(4.44).

We also call this algorithm Baum-Welch algorithm.

Problem 3 (Inference)

Goal: Given an observation sequence xxx and parameters θ=(A,B,π)\theta = (A, B, \pi)θ=(A,B,π), discover the best hidden state sequence zzz.

Method 1:Brute force.

This is not applicable. Every hidden state has mmm choices and the sequence has length nnn. So there are mnm^nmn possible combinations.

Method 2: Use Forward/ Backward Algorithm.
Given a sequence xxx, we know how to compute p(zkx)p(z_k | x)p(zk​∣x) from the top of the post. Therefore, at each time kkk, we can compute p(zk=ix)p(z_k=i| x)p(zk​=i∣x) for all i{1,2,...,m}i \in \{1,2,...,m\}i∈{1,2,...,m} and choose the one with the highest probability. In the way, at every time, we chose the most possible hidden state. However, there is still a problem. It only finds the most possible hidden state locally and doesn’t take the whole sequence into account. Even if we chose the most possible hidden state at each time kkk. The combination of them might not be the best one and even doesn’t make sense. Foe example, if Aij=0A_{ij} = 0Aij​=0, then if zk=iz_k=izk​=i, zk+1z_{k+1}zk+1​ cannot take state jjj. But the Forward/ Backward Algorithm doesn’t take it into account. We can think of it as a greedy approach to approximate the best result.

Method 3: Viterbi algorithm:

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states that results in a sequence of observed events in HMM.

We define δk(i)\delta_k(i)δk​(i) to be the maximum probability among all paths which are at state iii at time kkk. That is,

δk(i)=maxi1:ik1p(ik=i,i1:ik1,x1:kθ),i=1,2,...,m\delta_k(i) = \mathop{\rm max}\limits_{i_1:i_{k-1}} p(i_k=i, i_1:i_{k-1}, x_{1:k}|\theta), \quad i= 1,2,...,mδk​(i)=i1​:ik−1​max​p(ik​=i,i1​:ik−1​,x1:k​∣θ),i=1,2,...,m

We can construct a recurssion formula δk(i)\delta_k(i)δk​(i). That is,

δk+1(i)=maxi1:ikp(ik+1=i,i1:ik,x1:k+1θ)=max1jmδk(j)Ajip(xk+1zk+1=i)i=1,2,...,m;k=1,2,...,n1 \begin{aligned} \delta_{k+1}(i) &= \mathop{\rm max}\limits_{i_1:i_{k}} p(i_{k+1}=i, i_1:i_{k}, x_{1:k+1}|\theta) \\ &= \mathop{\rm max}\limits_{1\leq j \leq m} \delta_k(j) \cdot A_{ji} \cdot p(x_{k+1}|z_{k+1}=i) && i= 1,2,...,m; \, k=1,2,...,n-1\\ \end{aligned} δk+1​(i)​=i1​:ik​max​p(ik+1​=i,i1​:ik​,x1:k+1​∣θ)=1≤j≤mmax​δk​(j)⋅Aji​⋅p(xk+1​∣zk+1​=i)​​i=1,2,...,m;k=1,2,...,n−1​

And the base case is δ1(i)=πip(x1z1=i)\delta_{1}(i) = \pi_i \cdot p(x_1| z_1=i)δ1​(i)=πi​⋅p(x1​∣z1​=i).

Therefore, we compute the optimal probability PP^{*}P∗

P=max1imδn(i)P^{*} = \mathop{\rm max}\limits_{1 \leq i \leq m} \delta_n(i)P∗=1≤i≤mmax​δn​(i)

by recursion. During the process of finding the highest probability of a path, we keep recording the hidden states associate with the path. So after we find the the highest probability, we also record the path associated with it and therefore the best sequence zzz.


Reference:

  • https://towardsdatascience.com/conditional-independence-the-backbone-of-bayesian-networks-85710f1b35b
  • https://courses.cs.washington.edu/courses/cse473/16au/slides-16au/25-bn.pdf
  • https://en.wikipedia.org/wiki/Conditional_independence
  • https://web.stanford.edu/~jurafsky/slp3/A.pdf
  • https://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/em-hmm.pdf
  • https://people.eecs.berkeley.edu/~stephentu/writeups/hmm-baum-welch-derivation.pdf
  • 统计学习方法,李航著,清华大学出版社

往期文章链接目录

上一篇:有限环Z[ω]/(m+nω)的结构分析(Lua版本)


下一篇:Redis数据类型