Bag of words,中文译作词袋模型,即把文本的单词分开之后,统计每个单词出现的次数,然后作为该文本的特征表示。我们引用网上的一个图片来解释:
把原始文本转化为词袋模型的表示。Courtesy Zheng & Casari (2018)
下面我们会自己构造数据然后举一个实际例子,首先加载包:
library(pacman)
p_load(tidyverse,tidytext)
实践操作
第一步,我们先手动创造一个数据集:
corpus = c('The sky is blue and beautiful.',
'Love this blue and beautiful sky!',
'The quick brown fox jumps over the lazy dog.',
'The brown fox is quick and the blue dog is lazy!',
'The sky is very blue and the sky is very beautiful today',
'The dog is lazy but the brown fox is quick!' )
labels = c('weather', 'weather', 'ani