在信息摘要应用中还包含着另一种理论逻辑:重要的句子中通常包含着重要的词汇,而跨语料库的差异词(discriminatory word)绝大多数数是重要词汇。因此,句子中包含具有差异很大的词汇,它就很重要。这样就得到一个非常简单的测量方法,就是计算每一个词各种的TF-IDF(term frequency-inverse document )分值,然后根据词汇的重要性找出一种标准化的凭据评分。这个评分就可以用来充当在信息摘要中选取句子的标准。
TF-IDF(term frequency–inverse document frequency)是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。
按照其不拿整段介绍来做,只拿前三句来实践,我拿了前一段:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
f=open('news.txt')
news_content=f.read()
results=[]
sentences=nltk.sent_tokenize(news_content)
vectorizer=TfidfVectorizer(norm='l2',min_df=0,use_idf=True,smooth_idf=False,sublinear_tf=True)
sklearn_binary=vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names())
print(sklearn_binary.toarray())
结果:
['accept', 'accepting', 'altria', 'and', 'announce', 'approaches', 'arthur', 'as', 'at', 'be', 'birth', 'britain', 'british', 'by', 'caliburn', 'ceremonial', 'character', 'decides', 'despite', 'destined', 'dies', 'draws', 'ector', 'eligible', 'embedded', 'enters', 'entrusted', 'explaining', 'fearing', 'fifteen', 'following', 'for', 'full', 'gender', 'growing', 'hardships', 'heir', 'her', 'hesitation', 'his', 'however', 'if', 'in', 'inspired', 'invasion', 'is', 'king', 'knight', 'known', 'large', 'leadership', 'leaving', 'legends', 'legitimate', 'loyal', 'mantle', 'merlin', 'monarch', 'name', 'nativity', 'never', 'no', 'not', 'of', 'or', 'pendragon', 'people', 'period', 'preserving', 'publicly', 'pulling', 'raises', 'recognize', 'responsible', 'ruler', 'saber', 'saxons', 'she', 'shoulders', 'sir', 'slab', 'son', 'soon', 'stone', 'subjects', 'surrogate', 'sword', 'symbolic', 'that', 'the', 'this', 'threat', 'throne', 'to', 'turmoil', 'uther', 'welfare', 'when', 'who', 'will', 'withdraws', 'without', 'woman']
[[ 0. 0. 0.15095332 0. 0. 0.
0.31622502 0. 0. 0. 0. 0. 0.
0.20340954 0. 0. 0.31622502 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.31622502
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.31622502 0. 0.17386773
0.24504638 0. 0. 0. 0. 0.
0.31622502 0. 0. 0. 0. 0.
0.31622502 0. 0. 0. 0. 0.15095332
0. 0.31622502 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.31622502 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.15095332 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
[ 0.23250474 0. 0.11098857 0. 0.23250474 0. 0.
0.14955705 0.23250474 0. 0.23250474 0. 0. 0.
0. 0. 0. 0.23250474 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0.23250474 0. 0. 0. 0. 0.18017058
0. 0. 0. 0.11098857 0. 0.23250474
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.23250474 0. 0. 0. 0. 0.
0.23250474 0.23250474 0. 0.23250474 0. 0.23250474
0. 0. 0. 0. 0.23250474 0. 0.
0. 0. 0.18017058 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.23250474
0. 0. 0. 0. 0. 0. 0.
0. 0.14955705 0. 0.18017058 0. 0. 0.
0.14955705 0. 0. 0.23250474]
[ 0. 0. 0. 0. 0. 0. 0.
0.18736875 0. 0. 0. 0. 0.
0.18736875 0. 0. 0. 0. 0. 0.
0. 0. 0.29128766 0. 0. 0.
0.29128766 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.13904921 0. 0.
0. 0. 0. 0. 0. 0.1601566
0. 0.29128766 0. 0. 0. 0. 0.
0. 0.29128766 0. 0.22572213 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.29128766 0. 0.
0. 0. 0. 0.18736875 0. 0.29128766
0. 0.29128766 0. 0. 0. 0.29128766
0. 0. 0. 0. 0. 0. 0.
0.18736875 0. 0. 0. 0. 0.29128766
0. 0. 0. 0. ]
[ 0. 0. 0.14155101 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.29652856 0. 0. 0.29652856 0. 0. 0.
0. 0. 0.29652856 0. 0. 0. 0.
0. 0. 0.29652856 0. 0. 0. 0.
0. 0. 0. 0. 0.16303816 0.22978336
0. 0.29652856 0. 0. 0.29652856 0. 0.
0. 0. 0. 0. 0. 0. 0.
0.29652856 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0.14155101 0. 0. 0.29652856 0.19073992 0.
0.22978336 0. 0.29652856 0. 0. 0. 0.
0. ]
[ 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.24121053 0.
0.20022545 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.31127497
0. 0. 0. 0. 0.31127497 0. 0.
0. 0.31127497 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.31127497 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.25158536 0. 0. 0.
0.31127497 0. 0. 0. 0. 0. 0.
0. 0. 0.31127497 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.25158536 0. 0.31127497 0. 0.
0.31127497 0. 0. 0. 0. 0. 0.
0. 0. ]
[ 0. 0. 0.10632924 0. 0. 0.22274414
0. 0.14327861 0. 0. 0. 0.
0.22274414 0. 0.17260697 0.22274414 0. 0. 0.
0.22274414 0. 0. 0. 0. 0.22274414
0. 0. 0.22274414 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.10632924
0. 0. 0. 0.22274414 0.22274414 0. 0.
0. 0. 0. 0. 0.22274414 0. 0.
0. 0. 0. 0. 0.17260697 0. 0.
0. 0. 0. 0. 0.10632924 0. 0.
0.17260697 0. 0. 0. 0. 0.
0.22274414 0. 0.17260697 0. 0. 0.14327861
0. 0. 0.22274414 0. 0.22274414 0.22274414
0. 0. 0.17260697 0. 0.22274414 0.10632924
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.14327861 0.22274414 0. 0. ]
[ 0. 0.24521796 0.11705736 0.19002219 0. 0. 0.
0. 0. 0.24521796 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.24521796 0. 0. 0.
0.24521796 0. 0.11705736 0. 0. 0.24521796
0. 0. 0. 0. 0.13482643 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.24521796 0. 0. 0. 0.
0. 0.24565801 0. 0. 0.19002219 0.
0.24521796 0. 0.24521796 0. 0. 0.24521796
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.19002219
0.24521796 0. 0.19819534 0.24521796 0. 0. 0.
0. 0. 0.24521796 0. 0. 0.15773474
0. 0. 0. ]
[ 0. 0. 0. 0.38872173 0. 0. 0.
0. 0. 0. 0. 0.22958532 0. 0.
0.22958532 0. 0. 0. 0.29627299 0. 0.
0.29627299 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.22958532
0. 0. 0. 0.14142901 0.29627299 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.29627299 0. 0. 0. 0.
0.29627299 0. 0. 0. 0. 0. 0.
0. 0.14142901 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.19057553 0.29627299 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.29627299 0. ]]