文章目录
提示:以下是本篇文章正文内容,下面案例可供参考
一、TF-IDF
1、TF-IDF是什么?
TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。
- TF意思是词频(Term Frequency)
- DF(t,D)包含词语t的文档数量
- |D|文档数
- IDF意思是逆文本频率指数(Inverse Document Frequency)
显然,|D|比上DF(t,D)越大表示该词语越能代表该文档,当每个文档中都有该词语时,那么取对数时为0,为了防止分母为0,因此将分母加1,为了维持取对数后|D|和DF相等时为0,因此对分子也加1。
但是,一个文档中可能出现很多重复的而没有实际意义的词语,比如a,an,the,为了表示词语对文档的重要性,采用TF-IDF。
从公式中可以看出,词频如果很大且在很多文档中都出现,那么IDF就会很小,所以两者结合,就能很好判定词语对文档的重要性。
2、spark官方代码实现
def tfidf():Unit={
val spark = SparkSession.builder().appName("TFIDF").getOrCreate()
val sentenceData = spark.createDataFrame(Array(
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label","sentence")
/**\
* 单词分割
*/
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
/*
+-----+-----------------------------------+------------------------------------------+
|label|sentence |words |
+-----+-----------------------------------+------------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+-----+-----------------------------------+------------------------------------------+
*/
/**
* 通过 hashingTF.transform() 创建特征向量
*/
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeature")
val featurizedData = hashingTF.transform(wordsData)
featurizedData.show(10,false)
/*
|label|sentence |words |rawFeature |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0]) |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0]) |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+
根据该表可以看 [hi, i, heard, about, spark] 分别对应 [18700,19036,33808,66273,173558],其中 [1.0,1.0,1.0,1.0,1.0] 代表单词在该句中出现的次数。
*/
/**
* 调用IDF方法来重新构造特征向量的规模,生成的idf是一个Estimator,在特征向量上应用它的fit()方法,会产生一个IDFModel
*/
val idf = new IDF().setInputCol("rawFeature").setOutputCol("feature")
val idfModel = idf.fit(featurizedData)
val rescalaData = idfModel.transform(featurizedData)
rescalaData.show(10,false)
/*
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|sentence |words |rawFeature |feature |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0]) |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453]) |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0]) |(262144,[46243,58267,91006,160975,190884],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]) |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
从上表可以看出,hi仅在第一句中出现,所以hi的TF-IDF值比i大,hi更能代表第一句
*/
}