NLTK基础教程学习笔记(十三)

在信息摘要应用中还包含着另一种理论逻辑:重要的句子中通常包含着重要的词汇,而跨语料库的差异词(discriminatory word)绝大多数数是重要词汇。因此,句子中包含具有差异很大的词汇,它就很重要。这样就得到一个非常简单的测量方法,就是计算每一个词各种的TF-IDF(term frequency-inverse document )分值,然后根据词汇的重要性找出一种标准化的凭据评分。这个评分就可以用来充当在信息摘要中选取句子的标准。
TF-IDF(term frequency–inverse document frequency)是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。
按照其不拿整段介绍来做,只拿前三句来实践,我拿了前一段:

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
f=open('news.txt')
news_content=f.read()
results=[]
sentences=nltk.sent_tokenize(news_content)
vectorizer=TfidfVectorizer(norm='l2',min_df=0,use_idf=True,smooth_idf=False,sublinear_tf=True)
sklearn_binary=vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names())
print(sklearn_binary.toarray())

结果:

['accept', 'accepting', 'altria', 'and', 'announce', 'approaches', 'arthur', 'as', 'at', 'be', 'birth', 'britain', 'british', 'by', 'caliburn', 'ceremonial', 'character', 'decides', 'despite', 'destined', 'dies', 'draws', 'ector', 'eligible', 'embedded', 'enters', 'entrusted', 'explaining', 'fearing', 'fifteen', 'following', 'for', 'full', 'gender', 'growing', 'hardships', 'heir', 'her', 'hesitation', 'his', 'however', 'if', 'in', 'inspired', 'invasion', 'is', 'king', 'knight', 'known', 'large', 'leadership', 'leaving', 'legends', 'legitimate', 'loyal', 'mantle', 'merlin', 'monarch', 'name', 'nativity', 'never', 'no', 'not', 'of', 'or', 'pendragon', 'people', 'period', 'preserving', 'publicly', 'pulling', 'raises', 'recognize', 'responsible', 'ruler', 'saber', 'saxons', 'she', 'shoulders', 'sir', 'slab', 'son', 'soon', 'stone', 'subjects', 'surrogate', 'sword', 'symbolic', 'that', 'the', 'this', 'threat', 'throne', 'to', 'turmoil', 'uther', 'welfare', 'when', 'who', 'will', 'withdraws', 'without', 'woman']
[[ 0.          0.          0.15095332  0.          0.          0.
   0.31622502  0.          0.          0.          0.          0.          0.
   0.20340954  0.          0.          0.31622502  0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.31622502
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.31622502  0.          0.17386773
   0.24504638  0.          0.          0.          0.          0.
   0.31622502  0.          0.          0.          0.          0.
   0.31622502  0.          0.          0.          0.          0.15095332
   0.          0.31622502  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.31622502  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.15095332  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.23250474  0.          0.11098857  0.          0.23250474  0.          0.
   0.14955705  0.23250474  0.          0.23250474  0.          0.          0.
   0.          0.          0.          0.23250474  0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.23250474  0.          0.          0.          0.          0.18017058
   0.          0.          0.          0.11098857  0.          0.23250474
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.23250474  0.          0.          0.          0.          0.
   0.23250474  0.23250474  0.          0.23250474  0.          0.23250474
   0.          0.          0.          0.          0.23250474  0.          0.
   0.          0.          0.18017058  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.23250474
   0.          0.          0.          0.          0.          0.          0.
   0.          0.14955705  0.          0.18017058  0.          0.          0.
   0.14955705  0.          0.          0.23250474]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.18736875  0.          0.          0.          0.          0.
   0.18736875  0.          0.          0.          0.          0.          0.
   0.          0.          0.29128766  0.          0.          0.
   0.29128766  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.13904921  0.          0.
   0.          0.          0.          0.          0.          0.1601566
   0.          0.29128766  0.          0.          0.          0.          0.
   0.          0.29128766  0.          0.22572213  0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.29128766  0.          0.
   0.          0.          0.          0.18736875  0.          0.29128766
   0.          0.29128766  0.          0.          0.          0.29128766
   0.          0.          0.          0.          0.          0.          0.
   0.18736875  0.          0.          0.          0.          0.29128766
   0.          0.          0.          0.        ]
 [ 0.          0.          0.14155101  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.29652856  0.          0.          0.29652856  0.          0.          0.
   0.          0.          0.29652856  0.          0.          0.          0.
   0.          0.          0.29652856  0.          0.          0.          0.
   0.          0.          0.          0.          0.16303816  0.22978336
   0.          0.29652856  0.          0.          0.29652856  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.29652856  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.14155101  0.          0.          0.29652856  0.19073992  0.
   0.22978336  0.          0.29652856  0.          0.          0.          0.
   0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.24121053  0.
   0.20022545  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.31127497
   0.          0.          0.          0.          0.31127497  0.          0.
   0.          0.31127497  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.31127497  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.25158536  0.          0.          0.
   0.31127497  0.          0.          0.          0.          0.          0.
   0.          0.          0.31127497  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.25158536  0.          0.31127497  0.          0.
   0.31127497  0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.          0.          0.10632924  0.          0.          0.22274414
   0.          0.14327861  0.          0.          0.          0.
   0.22274414  0.          0.17260697  0.22274414  0.          0.          0.
   0.22274414  0.          0.          0.          0.          0.22274414
   0.          0.          0.22274414  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.10632924
   0.          0.          0.          0.22274414  0.22274414  0.          0.
   0.          0.          0.          0.          0.22274414  0.          0.
   0.          0.          0.          0.          0.17260697  0.          0.
   0.          0.          0.          0.          0.10632924  0.          0.
   0.17260697  0.          0.          0.          0.          0.
   0.22274414  0.          0.17260697  0.          0.          0.14327861
   0.          0.          0.22274414  0.          0.22274414  0.22274414
   0.          0.          0.17260697  0.          0.22274414  0.10632924
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.14327861  0.22274414  0.          0.        ]
 [ 0.          0.24521796  0.11705736  0.19002219  0.          0.          0.
   0.          0.          0.24521796  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.24521796  0.          0.          0.
   0.24521796  0.          0.11705736  0.          0.          0.24521796
   0.          0.          0.          0.          0.13482643  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.24521796  0.          0.          0.          0.
   0.          0.24565801  0.          0.          0.19002219  0.
   0.24521796  0.          0.24521796  0.          0.          0.24521796
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.19002219
   0.24521796  0.          0.19819534  0.24521796  0.          0.          0.
   0.          0.          0.24521796  0.          0.          0.15773474
   0.          0.          0.        ]
 [ 0.          0.          0.          0.38872173  0.          0.          0.
   0.          0.          0.          0.          0.22958532  0.          0.
   0.22958532  0.          0.          0.          0.29627299  0.          0.
   0.29627299  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.22958532
   0.          0.          0.          0.14142901  0.29627299  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.29627299  0.          0.          0.          0.
   0.29627299  0.          0.          0.          0.          0.          0.
   0.          0.14142901  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.19057553  0.29627299  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.29627299  0.        ]]

NLTK基础教程学习笔记(十三)

上一篇:续 | 《SQL基础教程》学习笔记(下)


下一篇:递归问题:二汉诺塔