TF-IDF

文章目录


提示:以下是本篇文章正文内容,下面案例可供参考

一、TF-IDF

1、TF-IDF是什么?

TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。

  • TF意思是词频(Term Frequency)
  • DF(t,D)包含词语t的文档数量
  • |D|文档数
  • IDF意思是逆文本频率指数(Inverse Document Frequency)
    TF-IDF
    显然,|D|比上DF(t,D)越大表示该词语越能代表该文档,当每个文档中都有该词语时,那么取对数时为0,为了防止分母为0,因此将分母加1,为了维持取对数后|D|和DF相等时为0,因此对分子也加1。

但是,一个文档中可能出现很多重复的而没有实际意义的词语,比如a,an,the,为了表示词语对文档的重要性,采用TF-IDF。 TF-IDF
从公式中可以看出,词频如果很大且在很多文档中都出现,那么IDF就会很小,所以两者结合,就能很好判定词语对文档的重要性。

2、spark官方代码实现

def tfidf():Unit={
    val spark = SparkSession.builder().appName("TFIDF").getOrCreate()

    val sentenceData = spark.createDataFrame(Array(
      (0.0, "Hi I heard about Spark"),
      (0.0, "I wish Java could use case classes"),
      (1.0, "Logistic regression models are neat")
    )).toDF("label","sentence")


    /**\
     * 单词分割
     */
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
    val wordsData = tokenizer.transform(sentenceData)
    /*
    +-----+-----------------------------------+------------------------------------------+
    |label|sentence                           |words                                     |
    +-----+-----------------------------------+------------------------------------------+
    |0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
    |0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
    |1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
    +-----+-----------------------------------+------------------------------------------+
     */


    /**
     * 通过 hashingTF.transform() 创建特征向量
     */
    val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeature")
    val featurizedData =  hashingTF.transform(wordsData)
    featurizedData.show(10,false)
/*
|label|sentence                           |words                                     |rawFeature                                                                          |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                     |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                    |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+

根据该表可以看 [hi, i, heard, about, spark] 分别对应 [18700,19036,33808,66273,173558],其中 [1.0,1.0,1.0,1.0,1.0] 代表单词在该句中出现的次数。
 */


	 /**
     * 调用IDF方法来重新构造特征向量的规模,生成的idf是一个Estimator,在特征向量上应用它的fit()方法,会产生一个IDFModel
     */
    val idf = new IDF().setInputCol("rawFeature").setOutputCol("feature")
    val idfModel = idf.fit(featurizedData)
    val rescalaData = idfModel.transform(featurizedData)
    rescalaData.show(10,false)

/*
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|sentence                           |words                                     |rawFeature                                                                          |feature                                                                                                                                                                                       |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                     |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                    |(262144,[46243,58267,91006,160975,190884],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

从上表可以看出,hi仅在第一句中出现,所以hi的TF-IDF值比i大,hi更能代表第一句
 */
  }

上一篇:隆基股份邓良平:跟踪系统与高效双面单晶组件结合的优势更明显


下一篇:关键词提取-TFIDF(一)