TF-IDF学习(python实现)

从大一开始接触TF-IDF,一直觉得这个特别简单,,但是图样图森破,,,

即使现在来说,也似乎并非完全搞懂

核心思想:

  计算词语在该文章中权重,与词语出现次数和词语价值有关

  词语出现次数,重复即强调,越重要

  词语价值,出现在越多的文档中越滥情,越廉价

公式:

  词频TF = 出现次数 / 总次数

  逆向文件频率IDF = log( 总文档数 / ( 出现文档数+1) )

  TF-IDF = TF * IDF

具体计算:

1.我的代码:

  # 由于算这个是为了求feature值,因此用了jieba,轻量级好用的分词包,具体可参见它的github:https://github.com/hosiet/jieba

  # 并且最终计算结果用json存储在文件中

  起初,自己写了个代码计算

 #coding=utf-8
import jieba
import re
import math
import json with open('stop_words.txt', 'r', encoding='utf-8') as f:
stopwords = [x[:-1] for x in f] data = []
tf = {}
doc_num = {}
tfidf = {} def calcu_tf():
'''计算tf值'''
with open('exercise.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
global TOTAL
TOTAL = 0
for l in lines:
# 使用jieba分词
lx = re.sub('\W', '', l)
list = jieba.lcut(lx)
# 每句话中一个词可能出现多次
tmp = {}
for i in list:
if(i not in doc_num):
doc_num[i] = 0
if (i not in stopwords)and(i not in tmp):
data.append(i)
# 计算出现在多少个文档里
tmp[i] = 1
doc_num[i] += 1
# 计算总文档数
TOTAL += 1
dataset = set(data)
for i in dataset:
tf[i] = data.count(i) def calcu_tfidf():
'''计算TF-IDF值'''
for i in tf:
tfidf[i] = tf[i] * math.log10(TOTAL / (doc_num[i]+1)) if __name__ == '__main__' :
calcu_tf()
calcu_tfidf()
print(tfidf)
with open('tfidf.json', 'w', encoding="utf-8") as file:
# json.dumps需要设置一下参数,不然文件中全是/u什么的
file.write(json.dumps(tfidf, ensure_ascii=False, indent=2))

是自己设置的测试文档。。以及运算结果(部分截图)

TF-IDF学习(python实现) TF-IDF学习(python实现)

最终用时1.54041444018928秒

2.使用sklearn包

但后来觉得,有现成能用就用现成的,毕竟少好多代码

于是,使用scikit-learn计算TF-IDF值就诞生了

  # sklearn包的安装另一篇博客中有写http://www.cnblogs.com/rucwxb/p/7297733.html

计算过程:

  CountVectorizer计算TF

  TFidfTransformer计算IDF

核心代码:

 from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from numpy import *
import time
import jieba
import re def calcu_tfidf():
corpus = []
idfDic = {}
tf = {}
tfs = []
tfidf = {}
with open('exercise.txt', 'r', encoding='utf-8') as f:
for x in f:
lx = re.sub('\W', '', x)
jb = jieba.lcut(lx)
list = []
for i in jb:
if i not in stopwords:
list.append(i)
list = " ".join(list)
corpus.append(list)
#将文本中的词语转换为词频矩阵
vectorizer = CountVectorizer(ngram_range=(1, 1), lowercase=False, token_pattern = r'\b\w+\b', min_df = 1)
#类调用
transformer = TfidfTransformer()
#计算个词语出现的次数
tf_mat = vectorizer.fit_transform(corpus)
tfidf = transformer.fit_transform(tf_mat)
#获取词袋中所有文本关键词
words = vectorizer.get_feature_names()
# 获得IDF和TF值
tfs = tf_mat.sum(axis=0).tolist()
for i, word in enumerate(words):
idfDic[word] = transformer.idf_[i]
tf[word] = tfs[i]
# 计算TF-IDF
for i in words:
tfidf[i] = idfDic[i] * tf[i] if __name__ == '__main__' :
startT = time.clock()
with open('stop_words.txt', 'r', encoding='utf-8') as f:
stopwords = [x[:-1] for x in f]
calcu_tfidf()
with open('tfidf2.json', 'w', encoding="utf-8") as file:
# json.dumps需要设置一下参数,不然文件中全是/u什么的
file.write(json.dumps(tfidf, ensure_ascii=False, indent=2))
endT = time.clock()
print(endT-startT)
上一篇:python中read()、readline()、readlnes()


下一篇:Oracle数据库中的大对象(LOB)数据类型介绍