python – TopicModel:如何按主题模型“主题”查询文档?

下面我创建了一个完全可重现的示例来计算给定DataFrame的主题模型.

import numpy as np  
import pandas as pd

data = pd.DataFrame({'Body': ['Here goes one example sentence that is generic',
                  'My car drives really fast and I have no brakes',
                  'Your car is slow and needs no brakes', 
                  'Your and my vehicle are both not as fast as the airplane']})

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(lowercase = True, analyzer = 'word')

data_vectorized = vectorizer.fit_transform(data.Body)
lda_model = LatentDirichletAllocation(n_components=4, 
                                      learning_method='online', 
                                      random_state=0,
                                      verbose=1)
lda_topic_matrix = lda_model.fit_transform(data_vectorized)

问题:如何按主题过滤文档?如果是这样,文档可以有多个主题标签,还是需要阈值?

最后,我喜欢用“1”标记每个文档,具体取决于它是否具有主题2和主题3的高加载,否则为“0”.

解决方法:

lda_topic_matrix包含文档属于特定主题/标记的概率的分布.在人类中,它意味着每行总和为1,而每个索引的值是该文档属于特定主题的概率.因此,每个文档都有不同程度的所有主题标签.如果您有4个主题,则包含所有标记的文档将在lda_topic_matrix中具有相应的行,类似于
[0.25,0.25,0.25,0.25].只有一个主题(“0”)的文档行将变为[0.97,0.01,0.01,0.01],具有两个主题(“1”和“2”)的文档将具有[0.01, 0.54,0.44,0.01]

因此,最简单的方法是选择概率最高的主题并检查它是2还是3:

main_topic_of_document = np.argmax(lda_topic_matrix, axis=1)
tagged = ((main_topic_of_document==2) | (main_topic_of_document==3)).astype(np.int64)

This article对LDA的内在力学提供了很好的解释.

上一篇:记录:Ubuntu下配置java


下一篇:NLP入门-Task4 自然语言处理