【NLP】主题模型LDA与ABAE

主题模型LDA

原理

LDA也称为隐狄利克雷分布,LDA的目的就是要识别主题,即把文档—词汇矩阵变成文档—主题矩阵(分布)和主题—词汇矩阵(分布)。

 

文档生成方式

  • 按照先验概率$P(d_{i})$选择一篇文档$d_{i}$
  • 从狄利克雷分布$\alpha$中取样生成文档$i$的主题分布$\theta_{i}$,换言之,主题分布$\theta_{i}$由超参数$\alpha$的狄利克雷分布生成
  • 从主题多项式分布$\theta_{i}$中取样生成文档$i$第$j$个词的主题$z_{i,j}$
  • 从狄利克雷分布$\beta$中取样生成主题$z_{i,j}$对应的词语分布$\phi_{z_{i,j}}$,换言之,词语分布$\phi_{z_{i,j}}$由参数为$\beta$的狄利克雷分布生成
  • 从词语的多项式分布$\phi_{z_{i,j}}$中采样最终生成词语$w_{i,j}$

 

共轭先验分布

狄利克雷分布是多项式分布的共轭先验分布,如果后验概率P(θ|x)和先验概率p(θ)满足同样的分布律,那么,先验分布和后验分布被叫做共轭分布,同时,先验分布叫做似然函数的共轭先验分布。

 

LDA参数估计

LDA的参数估计使用的是吉布斯采样的方法。LDA的学习过程其实就是估计主题分布$\theta$和词分布$\phi$这两个未知参数的过程。我们知道LDA是生成模型,最终目的是在控制超参数$\alpha$和$\beta$的条件下,通过隐变量$\theta$和$phi$,得到联合分布$p(w,z)$,公式如下:

$$p(z,w|\alpha, \beta)=p(w|z,\beta)p(z|\alpha)$$

$$p(w|z,\beta)= \int p(w|z, \phi)p(\phi|\beta)d \phi$$

$$p(z|\alpha)= \int p(z| \theta)p(\theta| \alpha)$$

当得到联合分布后,就可以根据当前的文章计算出主题分布$\phi$和词分布$\theta$

 

代码实现

模型训练

 1 import json
 2 from gensim import corpora, models
 3 from gensim.corpora import Dictionary
 4 
 5 with open(r'./data/data_specification/cn_software_data.json', 'r', encoding='utf8') as f:
 6     cn_software_data = json.load(f)
 7 
 8 with open(r'./data/LDA_data/LDA_text.json', 'r', encoding='UTF8') as f:
 9     LDA_texts = json.load(f)
10 
11 LDA_dict = Dictionary(LDA_texts)
12 LDA_dict.save(r'./data/LDA_data/LDA_dict')
13 LDA_corpus = [LDA_dict.doc2bow(text) for text in LDA_texts]
14  
15 # LDA训练参数
16 num_topics=500
17 iterations=1000
18 workers=3
19 
20 # lda多进程训练
21 lda = models.ldamulticore.LdaMulticore(LDA_corpus, id2word=LDA_dict, num_topics=num_topics, iterations=iterations, workers=workers, batch=True)
22 lda.save(r'./LDA_model/lda.model' + 'lda_%s_%s.model'%(num_topics, iterations))

 

计算perplexity

 1 #-*-coding:utf-8-*-
 2 import sys
 3 import os
 4 from gensim.corpora import Dictionary
 5 from gensim import corpora, models
 6 from datetime import datetime
 7 import logging
 8 import math
 9 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s : ', level=logging.INFO)
10 
11 def perplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
12     """calculate the perplexity of a lda-model"""
13     # dictionary : {7822:'deferment', 1841:'circuitry',19202:'f*ism'...]
14     print ('the info of this ldamodel: \n')
15     print ('num of testset: %s; size_dictionary: %s; num of topics: %s'%(len(testset), size_dictionary, num_topics))
16     prep = 0.0
17     prob_doc_sum = 0.0
18     topic_word_list = []
19     for topic_id in range(num_topics):
20         topic_word = ldamodel.show_topic(topic_id, size_dictionary)
21         dic = {}
22         for word, probability in topic_word:
23             dic[word] = probability
24         topic_word_list.append(dic)
25     doc_topics_ist = []
26     for doc in testset:
27         doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
28     testset_word_num = 0
29     for i in range(len(testset)):
30         prob_doc = 0.0 # the probablity of the doc
31         doc = testset[i]
32         doc_word_num = 0 # the num of words in the doc
33         for word_id, num in doc:
34             prob_word = 0.0 # the probablity of the word 
35             doc_word_num += num
36             word = dictionary[word_id]
37             for topic_id in range(num_topics):
38                 # cal p(w) : p(w) = sumz(p(z)*p(w|z))
39                 prob_topic = doc_topics_ist[i][topic_id][1]
40                 prob_topic_word = topic_word_list[topic_id][word]
41                 prob_word += prob_topic*prob_topic_word
42             prob_doc += math.log(prob_word) # p(d) = sum(log(p(w)))
43         prob_doc_sum += prob_doc
44         testset_word_num += doc_word_num
45     prep = math.exp(-prob_doc_sum/testset_word_num) # perplexity = exp(-sum(p(d))/sum(Nd))
46     print ("the perplexity of this ldamodel is : %s"%prep)
47     
48     return prep
49 
50 if __name__ == '__main__':
51     dictionary_path = r'./data/LDA_data/LDA_dict'
52     corpus_path = r'./data/LDA_data/LDA_corpus'
53     num_topics = 500
54     ldamodel_path = './LDA_model/lda_{}_1000.model'.format(str(num_topics))
55     dictionary = corpora.Dictionary.load(dictionary_path)
56     corpus = corpora.MmCorpus(corpus_path)
57     lda_multi = models.ldamodel.LdaModel.load(ldamodel_path)
58     
59     testset = []
60     # sample 1/300
61     for i in range(int(corpus.num_docs/300)):
62         testset.append(corpus[i*300])
63         # print(corpus[i*300])
64     prep = perplexity(lda_multi, testset, dictionary, len(dictionary.keys()), num_topics)
65     with open('./LDA_model/lda_{}.txt'.format(str(num_topics)), 'a', encoding='utf8') as f:
66         f.write("the perplexity of K={} ldamodel is : {}".format(str(num_topics), str(prep)))

 

面试问题

pLSA和LDA的关系

pLSA和LDA都在寻找主题分布与词分布。pLSA跟LDA的区别在于,去探索这两个未知参数的方法或思想不一样。pLSA是求到一个能拟合文本最好的参数(分布),这个值就认为是真实的参数。但LDA认为,其实我们没法去完全求解出主题分布、词分布到底是什么参数,我们只能把它们当成随机变量,通过缩小其方差(变化度)来尽量让这个随机变量变得更“确切”。换言之,我们不再求主题分布、词分布的具体值,而是通过这些分布生成的观测值(即实际文本)来反推分布的参数的范围,即在什么范围比较可能,在什么范围不太可能。所以,其实这就是一种贝叶斯分析的思想,虽然无法给出真实值具体是多少,但可以按照经验给一个相对合理的真实值服从的先验分布,然后从先验出发求解其后验分布。

 

参考

https://blog.csdn.net/v_july_v/article/details/41209515

https://www.jianshu.com/p/b7033e792718

上一篇:TMvis:基于LDA的主题建模可视分析系统


下一篇:2021-03-04