点互信息PMI(Pointwise Mutual Information)这个指标来衡量两个事物之间的相关性(比如两个词)。
在概率论中,我们知道,如果x跟y相互独立,则p(x,y)=p(x)p(y)。
二者相关性越大,则p(x,y)就相比于p(x)p(y)越大。用后面的式子可能更好理解,在y出现的情况下x出现的条件概率p(x|y)除以x本身出现的概率p(x),自然就表示x跟y的相关程度。
例子:
举个自然语言处理中的例子来说,我们想衡量like这个词的极性(正向情感还是负向情感)。我们可以预先挑选一些正向情感的词,比如good。然后我们算like跟good的PMI,即:
其中,
在*中找到pmi实现的代码
from nltk.collocations import BigramAssocMeasures,BigramCollocationFinder
from nltk.tokenize import word_tokenize
text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
words = word_tokenize(text)
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)
for row in finder.score_ngrams(bigram_measures.pmi):
print(row)
(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sheep'), 2.5235619560570135)
(('black', 'sentence'), 2.523561956057013)
(('sheep', 'foo'), 2.3536369546147005)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)
好了,下面写一个完整的代码
实现以下功能:
-
读取txt、xls、xlsx文件的数据(其中excel形式的数据,其数据是存储在某一列)
-
对文本数据进行分词、英文小写化、英文词干化、去停用词
-
按照两元语法模式,计算所有文本两两词语的pmi值
- 将pmi值保存到csv文件中
完整代码
import re
import csv
import jieba
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
def chinese(text):
"""
对中文数据进行处理,并将计算出的pmi保存到"中文pmi计算.csv"
"""
content = ''.join(re.findall(r'[\u4e00-\u9fa5]+', text))
words = jieba.cut(content)
words = [w for w in words if len(w)>1]
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)
with open('中文pmi计算.csv','a+',encoding='gbk',newline='') as csvf:
writer = csv.writer(csvf)
writer.writerow(('word1','word2','pmi_score'))
for row in finder.score_ngrams(bigram_measures.pmi):
data = (*row[0],row[1])
try:
writer.writerow(data)
except:
pass
def english(text):
"""
对英文数据进行处理,并将计算出的pmi保存到"english_pmi_computer.csv"
"""
stopwordss = set(stopwords.words('english'))
stemmer = nltk.stem.snowball.SnowballStemmer('english')
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
words = tokenizer.tokenize(text)
words = [w for w in words if not w.isnumeric()]
words = [w.lower() for w in words]
words = [stemmer.stem(w) for w in words]
words = [w for w in words if w not in stopwordss]
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)
with open('english_pmi_computer.csv','a+',encoding='gbk',newline='') as csvf:
writer = csv.writer(csvf)
writer.writerow(('word1','word2','pmi_score'))
for row in finder.score_ngrams(bigram_measures.pmi):
data = (*row[0],row[1])
try:
writer.writerow(data)
except:
pass
def pmi_score(file,lang,column='数据列'):
"""
计算pmi
:param file: 原始文本数据文件
:param lang: 数据的语言,参数为chinese或english
:param column: 如果文件为excel形式的文件,column为excel中的数据列
"""
#读取数据
text = ''
if 'csv' in file:
df = pd.read_csv(file)
rows = df.iterrows()
for row in rows:
text += row[1][column]
elif ('xlsx' in file) or ('xls' in file):
df = pd.read_excel(file)
rows = df.iterrows()
for row in rows:
text += row[1][column]
else:
text = open(file).read()
#对该语言的文本数据计算pmi
globals()[lang](text)
#计算pmi
pmi_score(file='test.txt',lang='chinese')
test.txt数据来自4000+场知乎live的简介,pmi部分计算结果截图。
pmi计算结果是从大到小输出的。从中可以看到,pmi越大,两个词语更有感情,更搭。
而当翻看最后面的组合,pmi已经沦为负值,两个词语间关系已经不大了。