特征选择(Feature Selection)的机器学习中的一项基本任务,其被定义为从可用的特征集合中选择出有意义的特征子集,从而简化机器学习问题。特征选择方法一般分为以下三类:
- 基于封装的方法(Wrapper Methods)
- 基于过滤的方法(Filter Methods)
- 基于嵌入的方法(Embedded Methods)
1、基于封装的方法(Wrapper Methods)
Wrapper methods evaluate subsets of features by training a model with each subset and scoring on a held-out set.
This approach is independent of the prediction algorithm in use, but scales poorly for larger commercial systems with many features.
代表性方法:Recursive Feature Elimiantion Algorithm
2、基于过滤的方法(Filter Methods)
Filter methods user heuristic measures such as Mutual Information or Pearson Correlation to score features based on their informative power with regard to the prediction target.
These methods are more scalable than wrapper methods as they do not require training many models. However, they are highly dependent on the specific heuristic metric used to features, and there is no structured approach or clear guidelines for preferring one metric over the other.
代表性方法:Chi-Squared Test、Correlation Coeficient Scores、Information Gain
3、基于嵌入的方法(Embedded Methods)
These are a family of algorithm in which feature selection is performed during model construction. Embedded methods are not based on cross-validation and therefore scale well with data size. The features are chosen based on their relative usefulness and informative power with regard to the prediction task at hand.
- Koenigstein N, Paquet U. Xbox movies recommendations: Variational Bayes matrix factorization with embedded feature selection[C]//Proceedings of the 7th ACM Conference on Recommender Systems. 2013: 129-136.