特征选择(Feature Selection)的机器学习中的一项基本任务,其被定义为从可用的特征集合中选择出有意义的特征子集,从而简化机器学习问题。特征选择方法一般分为以下三类:
- 基于封装的方法(Wrapper Methods)
- 基于过滤的方法(Filter Methods)
- 基于嵌入的方法(Embedded Methods)
1、基于封装的方法(Wrapper Methods)
Wrapper methods evaluate subsets of features by training a model with each subset and scoring on a held-out set.
基于封装的方法将特征选择视为优化问题:生成不同的特征组合,训练一个模型对特征组合进行评价,选择最优的特征组合。
This approach is independent of the prediction algorithm in use, but scales poorly for larger commercial systems with many features.
基于封装的方法独立于预测算法,但当特征过多时其可扩展性较差。
代表性方法:Recursive Feature Elimiantion Algorithm
2、基于过滤的方法(Filter Methods)
Filter methods user heuristic measures such as Mutual Information or Pearson Correlation to score features based on their informative power with regard to the prediction target.
基于过滤的方法采用启发式度量评估每一个特征的信息性,按信息性排序选择特征子集。
These methods are more scalable than wrapper methods as they do not require training many models. However, they are highly dependent on the specific heuristic metric used to features, and there is no structured approach or clear guidelines for preferring one metric over the other.
基于过滤的方法比基于封装的方法更具可扩展性,然而其高度依赖于对特征进行评估的启发式度量,而关于启发式度量的选择没有明确的标准。
代表性方法:Chi-Squared Test、Correlation Coeficient Scores、Information Gain
3、基于嵌入的方法(Embedded Methods)
These are a family of algorithm in which feature selection is performed during model construction. Embedded methods are not based on cross-validation and therefore scale well with data size. The features are chosen based on their relative usefulness and informative power with regard to the prediction task at hand.
基于嵌入的方法是在模型执行期间进行特征选择的一系列方法,其基于特征对于预测任务的相对有用性和信息性选择特征子集。
基于嵌入的方法对数据规模的可扩展性较好。
代表性方法:LASSO
参考:
- Koenigstein N, Paquet U. Xbox movies recommendations: Variational Bayes matrix factorization with embedded feature selection[C]//Proceedings of the 7th ACM Conference on Recommender Systems. 2013: 129-136.