一、数据特征提取
1、安装依赖库
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple Scikit-learn 注意:安装Scikit-learn前需先安装numpy和pandas
2、字典特征数据抽取
from sklearn.feature_extraction import DictVectorizer data = [{"name":"nick","age":12},{"name":"mile",'age':23},{'name':'jack','age':34}] def dictvec(): #将sparse设置为false,返回数组数据而不是sparse矩阵 dict = DictVectorizer(sparse=False) tans_data = dict.fit_transform(data) print(dict.feature_names_) print(tans_data) if __name__ == "__main__": dictvec()
返回的数据集采用one-hot编码形式
['age', 'name=jack', 'name=mile', 'name=nick'] [[12. 0. 0. 1.] #001表示nick [23. 0. 1. 0.] [34. 1. 0. 0.]]
3、文本特征抽取以及中文问题
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer import jieba def cutword(): text1 = "床前明月光,疑是地上霜" text2 = "一二三四五,上山打老虎" cut_text1 = jieba.cut(text1) cut_text2 = jieba.cut(text2) l1 = list(cut_text1) l2 = list(cut_text2) return [" ".join(l1)," ".join(l2)] if __name__ == "__main__": countvac = TfidfVectorizer(); trans_data = countvac.fit_transform(cutword()) print(countvac.get_feature_names()) print(trans_data.toarray())
说明:
1.文本特征抽取函数没有sparse的参数,所以可自己转化为数组
trans_data.toarray()
2.由于中文文本单词之间没有间隔,所以需要借助分词库jieba来将文本分词
3.使用TfidfVectorizer可以抽取重要性高的词
返回结果:
['一二三四五', '上山', '地上', '床前', '明月光', '疑是', '老虎'] [[0. 0. 0.5 0.5 0.5 0.5 0. ] [0.57735027 0.57735027 0. 0. 0. 0. 0.57735027]]