主题模型LDA
原理
LDA也称为隐狄利克雷分布,LDA的目的就是要识别主题,即把文档—词汇矩阵变成文档—主题矩阵(分布)和主题—词汇矩阵(分布)。
文档生成方式
- 按照先验概率$P(d_{i})$选择一篇文档$d_{i}$
- 从狄利克雷分布$\alpha$中取样生成文档$i$的主题分布$\theta_{i}$,换言之,主题分布$\theta_{i}$由超参数$\alpha$的狄利克雷分布生成
- 从主题多项式分布$\theta_{i}$中取样生成文档$i$第$j$个词的主题$z_{i,j}$
- 从狄利克雷分布$\beta$中取样生成主题$z_{i,j}$对应的词语分布$\phi_{z_{i,j}}$,换言之,词语分布$\phi_{z_{i,j}}$由参数为$\beta$的狄利克雷分布生成
- 从词语的多项式分布$\phi_{z_{i,j}}$中采样最终生成词语$w_{i,j}$
共轭先验分布
狄利克雷分布是多项式分布的共轭先验分布,如果后验概率P(θ|x)和先验概率p(θ)满足同样的分布律,那么,先验分布和后验分布被叫做共轭分布,同时,先验分布叫做似然函数的共轭先验分布。
LDA参数估计
LDA的参数估计使用的是吉布斯采样的方法。LDA的学习过程其实就是估计主题分布$\theta$和词分布$\phi$这两个未知参数的过程。我们知道LDA是生成模型,最终目的是在控制超参数$\alpha$和$\beta$的条件下,通过隐变量$\theta$和$phi$,得到联合分布$p(w,z)$,公式如下:
$$p(z,w|\alpha, \beta)=p(w|z,\beta)p(z|\alpha)$$
$$p(w|z,\beta)= \int p(w|z, \phi)p(\phi|\beta)d \phi$$
$$p(z|\alpha)= \int p(z| \theta)p(\theta| \alpha)$$
当得到联合分布后,就可以根据当前的文章计算出主题分布$\phi$和词分布$\theta$
代码实现
模型训练
1 import json 2 from gensim import corpora, models 3 from gensim.corpora import Dictionary 4 5 with open(r'./data/data_specification/cn_software_data.json', 'r', encoding='utf8') as f: 6 cn_software_data = json.load(f) 7 8 with open(r'./data/LDA_data/LDA_text.json', 'r', encoding='UTF8') as f: 9 LDA_texts = json.load(f) 10 11 LDA_dict = Dictionary(LDA_texts) 12 LDA_dict.save(r'./data/LDA_data/LDA_dict') 13 LDA_corpus = [LDA_dict.doc2bow(text) for text in LDA_texts] 14 15 # LDA训练参数 16 num_topics=500 17 iterations=1000 18 workers=3 19 20 # lda多进程训练 21 lda = models.ldamulticore.LdaMulticore(LDA_corpus, id2word=LDA_dict, num_topics=num_topics, iterations=iterations, workers=workers, batch=True) 22 lda.save(r'./LDA_model/lda.model' + 'lda_%s_%s.model'%(num_topics, iterations))
计算perplexity
1 #-*-coding:utf-8-*- 2 import sys 3 import os 4 from gensim.corpora import Dictionary 5 from gensim import corpora, models 6 from datetime import datetime 7 import logging 8 import math 9 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s : ', level=logging.INFO) 10 11 def perplexity(ldamodel, testset, dictionary, size_dictionary, num_topics): 12 """calculate the perplexity of a lda-model""" 13 # dictionary : {7822:'deferment', 1841:'circuitry',19202:'f*ism'...] 14 print ('the info of this ldamodel: \n') 15 print ('num of testset: %s; size_dictionary: %s; num of topics: %s'%(len(testset), size_dictionary, num_topics)) 16 prep = 0.0 17 prob_doc_sum = 0.0 18 topic_word_list = [] 19 for topic_id in range(num_topics): 20 topic_word = ldamodel.show_topic(topic_id, size_dictionary) 21 dic = {} 22 for word, probability in topic_word: 23 dic[word] = probability 24 topic_word_list.append(dic) 25 doc_topics_ist = [] 26 for doc in testset: 27 doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0)) 28 testset_word_num = 0 29 for i in range(len(testset)): 30 prob_doc = 0.0 # the probablity of the doc 31 doc = testset[i] 32 doc_word_num = 0 # the num of words in the doc 33 for word_id, num in doc: 34 prob_word = 0.0 # the probablity of the word 35 doc_word_num += num 36 word = dictionary[word_id] 37 for topic_id in range(num_topics): 38 # cal p(w) : p(w) = sumz(p(z)*p(w|z)) 39 prob_topic = doc_topics_ist[i][topic_id][1] 40 prob_topic_word = topic_word_list[topic_id][word] 41 prob_word += prob_topic*prob_topic_word 42 prob_doc += math.log(prob_word) # p(d) = sum(log(p(w))) 43 prob_doc_sum += prob_doc 44 testset_word_num += doc_word_num 45 prep = math.exp(-prob_doc_sum/testset_word_num) # perplexity = exp(-sum(p(d))/sum(Nd)) 46 print ("the perplexity of this ldamodel is : %s"%prep) 47 48 return prep 49 50 if __name__ == '__main__': 51 dictionary_path = r'./data/LDA_data/LDA_dict' 52 corpus_path = r'./data/LDA_data/LDA_corpus' 53 num_topics = 500 54 ldamodel_path = './LDA_model/lda_{}_1000.model'.format(str(num_topics)) 55 dictionary = corpora.Dictionary.load(dictionary_path) 56 corpus = corpora.MmCorpus(corpus_path) 57 lda_multi = models.ldamodel.LdaModel.load(ldamodel_path) 58 59 testset = [] 60 # sample 1/300 61 for i in range(int(corpus.num_docs/300)): 62 testset.append(corpus[i*300]) 63 # print(corpus[i*300]) 64 prep = perplexity(lda_multi, testset, dictionary, len(dictionary.keys()), num_topics) 65 with open('./LDA_model/lda_{}.txt'.format(str(num_topics)), 'a', encoding='utf8') as f: 66 f.write("the perplexity of K={} ldamodel is : {}".format(str(num_topics), str(prep)))
面试问题
pLSA和LDA的关系
pLSA和LDA都在寻找主题分布与词分布。pLSA跟LDA的区别在于,去探索这两个未知参数的方法或思想不一样。pLSA是求到一个能拟合文本最好的参数(分布),这个值就认为是真实的参数。但LDA认为,其实我们没法去完全求解出主题分布、词分布到底是什么参数,我们只能把它们当成随机变量,通过缩小其方差(变化度)来尽量让这个随机变量变得更“确切”。换言之,我们不再求主题分布、词分布的具体值,而是通过这些分布生成的观测值(即实际文本)来反推分布的参数的范围,即在什么范围比较可能,在什么范围不太可能。所以,其实这就是一种贝叶斯分析的思想,虽然无法给出真实值具体是多少,但可以按照经验给一个相对合理的真实值服从的先验分布,然后从先验出发求解其后验分布。
参考
https://blog.csdn.net/v_july_v/article/details/41209515
https://www.jianshu.com/p/b7033e792718