Python 第三方模块 机器学习 Scikit-Learn模块 特征工程

一.feature_extraction
1.简介:

该模块用于对原始数据进行"特征提取"(feature extraction)

2.使用:

将"特征值映射列表"(lists of feature-value mappings)转换为矢量:class sklearn.feature_extraction.DictVectorizer([dtype<class 'numpy.float64'>,separator='=',sparse=True,sort=True])
实现"特征哈希"(feature hashing)/"哈希技巧"(hashing trick):class sklearn.feature_extraction.FeatureHasher([n_features=1048576,input_type='dict',dtype=<class 'numpy.float64'>,alternate_sign=True])

3.feature_extraction.image
(1)简介:

该子模块用于从图像中提取特征

(2)方法:

将2维图像转换为"补丁集合"(collection of patches):[<patches>=]sklearn.feature_extraction.image.extract_patches_2d(<image>,<patch_size>[,max_patches=None,random_state=None])
获取"像素连接图"(Graph of the pixel-to-pixel connections):sklearn.feature_extraction.image.grid_to_graph(<n_x>,<n_y>[,n_z=1,mask=None,return_as=<class 'scipy.sparse.coo.coo_matrix'>,dtype=<class 'int'>])
获取"像素梯度连接图"(Graph of the pixel-to-pixel gradient connections):sklearn.feature_extraction.image.img_to_graph(<img>[,mask=None,return_as=<class 'scipy.sparse.coo.coo_matrix'>,dtype=None])
通过图像的全部补丁重建图像:[<image>=]sklearn.feature_extraction.image.reconstruct_from_patches_2d(<patches>,<image_size>)

(3)类:

从图像集合中提取补丁:class sklearn.feature_extraction.image.PatchExtractor([patch_size=None,max_patches=None,random_state=None])

4.feature_extraction.text
(1)简介:

该子模块用于从文本文档中提取特征

(2)使用:

将文本文档集合转换为"令牌计数矩阵"(matrix of token counts):class sklearn.feature_extraction.text.CountVectorizer([input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,stop_words=None,token_pattern='(?u)\b\w\w+\b',ngram_range=(1,1),analyzer='word',max_df=1.0,min_df=1,max_features=None,vocabulary=None,binary=False,dtype=<class 'numpy.int64'>])
将文本文档集合转换为"令牌出现矩阵"(matrix of token occurrences):class sklearn.feature_extraction.text.HashingVectorizer([input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,stop_words=None,token_pattern='(?u)\b\w\w+\b',ngram_range=(1,1),analyzer='word',n_features=1048576,binary=False,norm='l2',alternate_sign=True,dtype=<class 'numpy.float64'>])
将"计数矩阵"(count matrix)转换为"经过归一化的词频(-逆文档频率)表示"(normalized tf(-idf) representation):class sklearn.feature_extraction.text.TfidfTransformer([norm='l2',use_idf=True,smooth_idf=True,sublinear_tf=False])
将原始文档集合转换为"词频-逆文档频率特征矩阵"(matrix of TF-IDF features):class sklearn.feature_extraction.text.TfidfVectorizer([input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,analyzer='word',stop_words=None,token_pattern='(?u)\b\w\w+\b',ngram_range=(1,1),max_df=1.0,min_df=1,max_features=None,vocabulary=None,binary=False,dtype=<class 'numpy.float64'>,norm='l2',use_idf=True,smooth_idf=True,sublinear_tf=False])

二.feature_selection
1.简介:

该模块用于进行"特征选择"(feature selection),包括"单变量过滤选择方法"(univariate filter selection methods)和"递归特征消除方法"
(recursive feature elimination algorithm)

2.使用:


上一篇:使用javax包下ImageIO.write方法读取.jpg后缀为null的解决方法


下一篇:UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bo