如何从python的gensim加载LDA转换语料库?我尝试过的:
from gensim import corpora, models
import numpy.random
numpy.random.seed(10)
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]
corpus = [doc0,doc1,doc2,doc3]
dictionary = corpora.Dictionary(corpus)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf.save('x.corpus_tfidf')
# To access the tfidf fitted corpus i've saved i used corpora.MmCorpus.load()
corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf')
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda[corpus]
corpus_lda.save('x.corpus_lda')
for i,j in enumerate(corpus_lda):
print j, corpus[i]
上面的代码将输出:
[(0, 0.54259038344543631), (1, 0.45740961655456358)] [(0, 1), (1, 1)]
[(0, 0.56718063124157458), (1, 0.43281936875842542)] [(0, 1)]
[(0, 0.54255407573666647), (1, 0.45744592426333358)] [(0, 1), (1, 1)]
[(0, 0.75229707773868093), (1, 0.2477029222613191)] [(0, 3), (1, 1)]
# [(<topic_number_from x.corpus_lda model>,
# <probability of this topic for this document>),
# (<topic# from lda model>, <prob of this top for this doc>)] [<document[i] from corpus>]
如果我想加载保存的LDA转换语料库,我应该使用gensim中的哪个类加载?
我尝试过使用corpora.MmCorpus.load(),它没有给我相同的输出语料库输出,如上所示:
>>> lda_corpus = corpora.MmCorpus.load('x.corpus_lda')
>>> for i,j in enumerate(lda_corpus):
... print j, corpus[i]
...
[(0, 0.55087839240547309), (1, 0.44912160759452685)] [(0, 1), (1, 1)]
[(0, 0.56715974584850259), (1, 0.43284025415149735)] [(0, 1)]
[(0, 0.54275680271070581), (1, 0.45724319728929413)] [(0, 1), (1, 1)]
[(0, 0.75233330695720912), (1, 0.24766669304279079)] [(0, 3), (1, 1)]
解决方法:
您的代码中存在更多问题.
要以MatrixMarket格式保存语料库,您需要
corpora.MmCorpus.serialize('x.corpus_lda', corpus_lda)
文档是here.
你正在训练corpus_tfidf,但后来只转换lda [corpus](没有tfidf).要么使用tfidf,要么使用简单的词袋,但要始终如一地使用它.