Several alternative term weighting methods for text representation~~ ——4. Experimental settings 实验设置

“In this study, we use two public text classification datasets to validate the performance of our schemes, namely Reuters-21578 and 20 Newsgroups datasets [37]. Reuters-21578 dataset has 8 different categories including 5485 training texts and 2189 test texts. There are 20 categories in 20 Newsgroups corpus, including 11,293 training texts and 7528 test texts. Moreover, we omit those terms that length less than two characters and occurrence less than two times. In addition to that, we use porter stemmer for stemming purpose [38] and all terms are converted to lowercase letters. And then, punctuation, stop words, numbers and other symbols are deleted. Finally, the Reuters-21578 and 20 Newsgroups datasets have 8541 and 33,414 different features respectively that can be used to train the classifier.”
在本研究中,我们使用两个公共文本分类数据集Reuters-21578和20 Newsgroups数据集来验证我们的方案的性能[37]。Reuters-21578数据集拥有8个不同的类别,包括5485个训练文本和2189个测试文本。20 Newsgroups语料库包含20个类别,其中11,293个训练文本,7528个测试文本。此外,我们省略那些长度小于两个字符和出现次数小于两次的词项。此外,我们使用porter stemmer用于词干提取[38],所有词项都转换为小写字母。然后,将标点符号、停用词、数字等符号删除。最后,Reuters-21578和20 Newsgroups数据集分别有8541和33,414个不同的特征,可用于训练分类器。

上一篇:【505】Using keras for word-level one-hot encoding


下一篇:Document Builder: REMOVE_CC_DEFAULT_TEXTS