**
Regression Algorithms
**
1.Linear Regression:
from sklearn.linear_model import LinearRgression
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Parameters:
normalize布尔型,默认为false.说明:是否对数据进行标准化处理
copy_X 布尔型,默认为true.说明:是否对X复制,如果选择false,则直接对原数据进行覆盖。(即经过中心化,标准化后,是否把新数据覆盖到原数据上)。
2.Ridge Regression(L2 norm generalization)
from sklearn.linear_model import Ridge
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver=‘auto’, tol=0.001)
solver:{‘auto’,‘svd’,‘cholesky’,‘lsqr’,‘sparse_cg’,‘sag’}
用于计算的求解方法:
'auto’根据数据类型自动选择求解器。
'svd’使用X的奇异值分解来计算Ridge系数。对于奇异矩阵比’cholesky’更稳定。
'cholesky’使用标准的scipy.linalg.solve函数来获得闭合形式的解。
'sparse_cg’使用在scipy.sparse.linalg.cg中找到的共轭梯度求解器。作为迭代算法,这个求解器比大规模数据(设置tol和max_iter的可能性)的“cholesky”更合适。
'lsqr’使用专用的正则化最小二乘常数scipy.sparse.linalg.lsqr。它是最快的,但可能不是在旧的scipy版本可用。它还使用迭代过程。
'sag’使用随机平均梯度下降。它也使用迭代过程,并且当n_samples和n_feature都很大时,通常比其他求解器更快。注意,“sag”快速收敛仅在具有近似相同尺度的特征上被保证。您可以使用sklearn.preprocessing的缩放器预处理数据。
3.Lasso Regression(L1 norm generaliization)
from sklearn.linear_model import Lasso
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection=‘cyclic’, tol=0.0001, warm_start=False)
4.ElasticNetRegression(L1+L2 norm generalization)
SGDClassifier(L1+L2 norm generalization) is used to classification
from sklearn.linear_model import ElasticNet
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection=‘cyclic’, tol=0.0001, warm_start=False)
Minimizes the objective function::
1 / (2 * n_samples) * ||y - Xw||^2_2
+ alpha * l1_ratio * ||w||_1 ###参数alpha共用,l1_ratio为L1-norm所占的比重
+ 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2
If you are interested in controlling the L1 and L2 penalty
separately, keep in mind that this is equivalent to::
a * L1 + b * L2
where::
alpha = a + b and l1_ratio = a / (a + b)
5.Logistic Regression
From sklearn.linear_model import LogisticRegression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘ovr’, n_jobs=1,
penalty=‘l2’, random_state=None, solver=‘liblinear’, tol=0.0001,
verbose=0, warm_start=False)
C:正则化系数λ的倒数,float类型,默认为1.0。必须是正浮点型数。像SVM一样,越小的数值表示越强的正则化。相当于以上算法中的1/alpha.
class_weight:用于标示分类模型中各种类型的权重,可以是一个字典或者’balanced’字符串,默认为不输入,也就是不考虑权重,即为None。如果选择输入的话,可以选择balanced让类库自己计算类型权重,或者自己输入各个类型的权重。举个例子,比如对于0,1的二元模型,我们可以定义class_weight={0:0.9,1:0.1},这样类型0的权重为90%,而类型1的权重为10%。如果class_weight选择balanced,那么类库会根据训练样本量来计算权重。某种类型样本量越多,则权重越低,样本量越少,则权重越高。当class_weight为balanced时,类权重计算方法如下:n_samples / (n_classes * np.bincount(y))。n_samples为样本数,n_classes为类别数量,np.bincount(y)会输出每个类的样本数,例如y=[1,0,0,1,1],则np.bincount(y)=[2,3]。
那么class_weight有什么作用呢?
在分类模型中,我们经常会遇到两类问题:
第一种是误分类的代价很高。比如对合法用户和非法用户进行分类,将非法用户分类为合法用户的代价很高,我们宁愿将合法用户分类为非法用户,这时可以人工再甄别,但是却不愿将非法用户分类为合法用户。这时,我们可以适当提高非法用户的权重。
第二种是样本是高度失衡的,比如我们有合法用户和非法用户的二元样本数据10000条,里面合法用户有9995条,非法用户只有5条,如果我们不考虑权重,则我们可以将所有的测试集都预测为合法用户,这样预测准确率理论上有99.95%,但是却没有任何意义。这时,我们可以选择balanced,让类库自动提高非法用户样本的权重。提高了某种分类的权重,相比不考虑权重,会有更多的样本分类划分到高权重的类别,从而可以解决上面两类问题。
penalty:惩罚项,str类型,可选参数为l1和l2,默认为l2。用于指定惩罚项中使用的规范。newton-cg、sag和lbfgs求解算法只支持L2规范。L1G规范假设的是模型的参数满足拉普拉斯分布,L2假设的模型参数满足高斯分布,所谓的范式就是加上对参数的约束,使得模型不会过拟合(overfit),但是如果要说是不是加了约束就会好,这个没有人能回答,只能说,加约束的情况下,理论上应该可以获得泛化能力更强的结果。
6.K-nearest neighbors Regression
from sklearn.neighbors import KNeighborsRegressor
KNeighborsRegressor(algorithm=‘auto’, leaf_size=30, metric=‘minkowski’,
metric_params=None, n_jobs=1, n_neighbors=5, p=2,weights=‘uniform’)
weights : str or callable
weight function used in prediction. Possible values:
‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
Uniform weights are used by default.
algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
Algorithm used to compute the nearest neighbors:
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
7.Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
DecisionTreeRegressor(criterion=‘mse’, max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter=‘best’)
参数 DecisionTreeClassifier DecisionTreeRegressor
特征选择标准criterion 可以使用"gini"或者"entropy",前者代表基尼系数,后者代表信息增益。一般说使用默认的基尼系数"gini"就可以了,即CART算法。除非你更喜欢类似ID3, C4.5的最优特征选择方法。 可以使用"mse"或者"mae",前者是均方差,后者是和均值之差的绝对值之和。推荐使用默认的"mse"。一般来说"mse"比"mae"更加精确。除非你想比较二个参数的效果的不同之处。
特征划分点选择标准splitter 可以使用"best"或者"random"。前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。
默认的"best"适合样本量不大的时候,而如果样本数据量非常大,此时决策树构建推荐"random"
划分时考虑的最大特征数max_features 可以使用很多种类型的值,默认是"None",意味着划分时考虑所有的特征数;如果是"log2"意味着划分时最多考虑log2N个特征;如果是"sqrt"或者"auto"意味着划分时最多考虑N个特征。如果是整数,代表考虑的特征绝对数。如果是浮点数,代表考虑特征百分比,即考虑(百分比xN)取整后的特征数。其中N为样本总特征数。
一般来说,如果样本特征数不多,比如小于50,我们用默认的"None"就可以了,如果特征数非常多,我们可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数,以控制决策树的生成时间。
决策树最大深max_depth 决策树的最大深度,默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。一般来说,数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多的情况下,推荐限制这个最大深度,具体的取值取决于数据的分布。常用的可以取值10-100之间。
内部节点再划分所需最小样本数min_samples_split 这个值限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。 默认是2.如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。我之前的一个项目例子,有大概10万样本,建立决策树时,我选择了min_samples_split=10。可以作为参考。
叶子节点最少样本数min_samples_leaf 这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。 默认是1,可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。之前的10万样本项目使用min_samples_leaf的值为5,仅供参考。
叶子节点最小的样本权重和min_weight_fraction_leaf 这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝。 默认是0,就是不考虑权重问题。一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。
最大叶子节点数max_leaf_nodes 通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。如果加了限制,算法会建立在最大叶子节点数内最优的决策树。如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。
类别权重class_weight 指定样本各类别的的权重,主要是为了防止训练集某些类别的样本过多,导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重,或者用“balanced”,如果使用“balanced”,则算法会自己计算权重,样本量少的类别所对应的样本权重会高。当然,如果你的样本类别分布没有明显的偏倚,则可以不管这个参数,选择默认的"None" 不适用于回归树
节点划分最小不纯度min_impurity_split 这个值限制了决策树的增长,如果某节点的不纯度(基尼系数,信息增益,均方差,绝对差)小于这个阈值,则该节点不再生成子节点。即为叶子节点 。
数据是否预排序presort 这个值是布尔值,默认是False不排序。一般来说,如果样本量少或者限制了一个深度很小的决策树,设置为true可以让划分点选择更加快,决策树建立的更加快。如果样本量太大的话,反而没有什么好处。问题是样本量少的时候,我速度本来就不慢。所以这个值一般懒得理它就可以了。
8.Support Vector Machine Regression
from sklearn.svm import SVR
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=‘auto’,
kernel=‘rbf’, max_iter=-1, shrinking=True, tol=0.001, verbose=False)
9.XGBoost Regression
from xgboost.sklearn import XGBRegressor
XGBRegressor(base_score=0.5, booster=‘gbtree’, colsample_bylevel=1,
colsample_bytree=1, gamma=0, importance_type=‘gain’,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective=‘reg:linear’, random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
subsample=1)
10.RandomForest Regression
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor(bootstrap=True, criterion=‘mse’, max_depth=None,
max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
11.AdaBoostRegressor
from sklearn.ensemble import AdaBoostRegressor
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss=‘linear’,
n_estimators=50, random_state=None)
12.GradientBoosting Decision Tree Regression(GBDT)
from sklearn.ensemble import GradientBoostingRegressor
GradientBoostingRegressor(alpha=0.9, criterion=‘friedman_mse’, init=None,
learning_rate=0.1, loss=‘ls’, max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, presort=‘auto’, random_state=None,
subsample=1.0, verbose=0, warm_start=False)
13.XGBRegressor
import xgboost as xgb
xgb.XGBRegressor(base_score=0.5, booster=‘gbtree’, colsample_bylevel=1,
colsample_bytree=1, gamma=0, importance_type=‘gain’,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective=‘reg:linear’, random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
subsample=1)
14.LightGBM
import lightgbm as lgb
lgb.LGBMRegressor(boosting_type=‘gbdt’, class_weight=None, colsample_bytree=1.0,
importance_type=‘split’, learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
15.CatBoost
import catboost as cb
cb.CatBoostRegressor(iterations=None,
learning_rate=None,
depth=None,
l2_leaf_reg=None,
model_size_reg=None,
rsm=None,
loss_function=‘RMSE’,
border_count=None,
feature_border_type=None,
input_borders=None,
output_borders=None,
fold_permutation_block=None,
od_pval=None,
od_wait=None,
od_type=None,
nan_mode=None,
counter_calc_method=None,
leaf_estimation_iterations=None,
leaf_estimation_method=None,
thread_count=None,
random_seed=None,
use_best_model=None,
best_model_min_trees=None,
verbose=None,
silent=None,
logging_level=None,
metric_period=None,
ctr_leaf_count_limit=None,
store_all_simple_ctr=None,
max_ctr_complexity=None,
has_time=None,
allow_const_label=None,
one_hot_max_size=None,
random_strength=None,
name=None,
ignored_features=None,
train_dir=None,
custom_metric=None,
eval_metric=None,
bagging_temperature=None,
save_snapshot=None,
snapshot_file=None,
snapshot_interval=None,
fold_len_multiplier=None,
used_ram_limit=None,
gpu_ram_part=None,
pinned_memory_size=None,
allow_writing_files=None,
final_ctr_computation_mode=None,
approx_on_full_history=None,
boosting_type=None,
simple_ctr=None,
combinations_ctr=None,
per_feature_ctr=None,
ctr_target_border_count=None,
task_type=None,
device_config=None,
devices=None,
bootstrap_type=None,
subsample=None,
sampling_unit=None,
dev_score_calc_obj_block_size=None,
max_depth=None,
n_estimators=None,
num_boost_round=None,
num_trees=None,
colsample_bylevel=None,
random_state=None,
reg_lambda=None,
objective=None,
eta=None,
max_bin=None,
gpu_cat_features_storage=None,
data_partition=None,
metadata=None,
early_stopping_rounds=None,
cat_features=None,
grow_policy=None,
min_data_in_leaf=None,
max_leaves=None,
score_function=None,
leaf_estimation_backtracking=None)
16.Stacking Regression
from mlxtend.regressor import StackingRegressor
from mlxtend.regressor import StackingCVRegressor
StackingRegressor(regressors,meta_regressor,verbose=0,use_features_in_secondary=False,store_train_meta_features=False)
17.KerasRegressor
from keras.wrappers.scikit_learn import KerasRegressor
KerasRegressor(build_fn=None, **sk_params)
**
Classification Algorithms
**
Single estimator: LR / KNN / NB / DT / SVC / SGD /
Ensemble methods: RF / ExtraTreesClassifer / AdaBoost / GBDT / XGBoost / lightGBM /CatBoost /
Stacking
Deep Learning Classification Algorithms: NN /RNN /CNN /
1.LogisticRegression
from sklearn.linear_model import LogisticRegression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘ovr’, n_jobs=1,
penalty=‘l2’, random_state=None, solver=‘liblinear’, tol=0.0001,
verbose=0, warm_start=False)
2.SVM
from sklearn.svm import SVC
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=‘ovr’, degree=3, gamma=‘auto’, kernel=‘rbf’,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
3.KNN
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier(algorithm=‘auto’, leaf_size=30, metric=‘minkowski’,
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights=‘uniform’)
4.DT
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter=‘best’)
5.Stochastic Gradient Descent Classification
from sklearn.linear_model import SGDClassifier
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate=‘optimal’, loss=‘hinge’, max_iter=None, n_iter=None,
n_jobs=1, penalty=‘l2’, power_t=0.5, random_state=None,
shuffle=True, tol=None, verbose=0, warm_start=False)
penalty:惩罚方式,字符串型;默认为’l2’;其余有’none’,‘l1’,‘elasticnet’--------------对应ElasticNet()
6.Ensemble Learning Algorithm:BaggingClassifier
Ensemble the KNN base estimators,but the most commen base estimators is DT,and it can change into the RF Algorithm and the Extra-Trees method.
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)
RandomForest Classification(based on bagging Method)
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=‘gini’,
max_depth=None, max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=‘warn’, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False) --------------------------------------BootStrap抽样
ExtraTreesClassifier(进一步减小方差)
extra-trees is to use the whole dataset (bootstrap=False),so it is different from the RF.
from sklearn.ensemble import ExtraTreesClassifier
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion=‘gini’,
max_depth=None, max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False) ------------全样本
7.Ensemble Learning Algorithm:BoostingClassifier
AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
AdaBoostClassifier(algorithm=‘SAMME.R’, base_estimator=None, ---------base estimator: DT or others
learning_rate=1.0, n_estimators=50, random_state=None)
base_estimator:基分类器,默认是决策树,在该分类器基础上进行boosting,理论上可以是任意一个分类器,但是如果是其他分类器时需要指明样本权重。
algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)
If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.
GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier ---------base estimator: DT
GradientBoostingClassifier(criterion=‘friedman_mse’, init=None,
learning_rate=0.1, loss=‘deviance’, max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
presort=‘auto’, random_state=None, subsample=1.0, verbose=0,
warm_start=False)
8.VotingClassifer(Can also combine different base estimators)
from sklearn.ensemble import VotingClassifier
VotingClassifier(estimators,voting=’hard / soft’,weights=None,n_jobs=1,flatten_transform=None)
9.StackingCVClassifier(combine different base estimators, layer0+layer1)
from mlxtend.classifier import StackingCVClassifier,StackingClassifier
StackingCVClassifier(classifiers,meta_classifier,use_probas=False,cv=2,
use_features_in_secondary=False,stratify=True,shuffle=True,verbose=0,
store_train_meta_features=False,use_clones=True)
#use_probas如果为True,则基于预测的概率而不是类标签来训练元分类器,cv=2表示StackingClassifier默认为2折交叉验证。
use_features_in_secondary:bool(默认值:False)
如果为True,元分类器将根据原始分类器和原始数据集的预测进行训练。 如果是假的,则元回归器将仅接受原始回归器的预测训练。
StackingClassifier(不进行交叉验证)
StackingClassifier(classifiers,meta_classifier,use_probas = False,average_probas = False,verbose = 0)
EnsembleVoteClassifier
from mlxtend.classifier import EnsembleVoteClassifier
EnsembleVoteClassifier(clfs,voting=’hard/soft’,weights=None,verbose=0,refit=True)
10.XGBoost
import xgboost as xgb
xgb.XGBClassifier(base_score=0.5, booster=‘gbtree’, colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective=‘binary:logistic’, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
11.LightGBM
import lightgbm as lgb
lgb.LGBMClassifier(boosting_type=‘gbdt’, class_weight=None, colsample_bytree=1.0,
importance_type=‘split’, learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
lgb.LGBMRanker(boosting_type=‘gbdt’, class_weight=None, colsample_bytree=1.0,
importance_type=‘split’, learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
12.CatBoost
import catboost as cb
cb.CatBoostClassifier(iterations=None,
learning_rate=None,
depth=None,
l2_leaf_reg=None, ----------------L2正则化系数
model_size_reg=None,
rsm=None,
loss_function=‘Logloss’,
border_count=None,
feature_border_type=None,
input_borders=None,
output_borders=None,
fold_permutation_block=None,
od_pval=None,
od_wait=None,
od_type=None,
nan_mode=None,
counter_calc_method=None,
leaf_estimation_iterations=None,
leaf_estimation_method=None,
thread_count=None,
random_seed=None,
use_best_model=None,
verbose=None,
logging_level=None,
metric_period=None,
ctr_leaf_count_limit=None,
store_all_simple_ctr=None,
max_ctr_complexity=None,
has_time=None,
allow_const_label=None,
classes_count=None,
class_weights=None,
one_hot_max_size=None, -----------------对于某些变量进行one-hot编码
random_strength=None,
name=None,
ignored_features=None,
train_dir=None,
custom_loss=None,
custom_metric=None,
eval_metric=None,
bagging_temperature=None,
save_snapshot=None,
snapshot_file=None,
snapshot_interval=None,
fold_len_multiplier=None,
used_ram_limit=None,
gpu_ram_part=None,
allow_writing_files=None,
final_ctr_computation_mode=None,
approx_on_full_history=None,
boosting_type=None,
simple_ctr=None,
combinations_ctr=None,
per_feature_ctr=None,
task_type=None,
device_config=None,
devices=None,
bootstrap_type=None,
subsample=None,
sampling_unit=None,
dev_score_calc_obj_block_size=None,
max_depth=None,
n_estimators=None,
num_boost_round=None,
num_trees=None,
colsample_bylevel=None,
random_state=None,
reg_lambda=None,
objective=None,
eta=None,
max_bin=None,
scale_pos_weight=None,
gpu_cat_features_storage=None,
data_partition=None
metadata=None,
early_stopping_rounds=None,
cat_features=None,
grow_policy=None,
min_data_in_leaf=None,
max_leaves=None,
score_function=None,
leaf_estimation_backtracking=None)
13.KerasClassifier
from keras.wrappers.scikit_learn import KerasClassifier
KerasClassifier(build_fn=None, **sk_params)
**
Clustering Algorithms
**
Kmeans / DBSCAN / SpectralClustering / AgglomerativeClustering
1.KMeans
from sklearn.cluster import KMeans
KMeans(algorithm=‘auto’, copy_x=True, init=‘k-means++’, max_iter=300,
n_clusters=8, n_init=10, n_jobs=1, precompute_distances=‘auto’,
random_state=None, tol=0.0001, verbose=0)
2.DBSCAN
from sklearn.cluster import DBSCAN
DBSCAN(algorithm=‘auto’, eps=0.5, leaf_size=30, metric=‘euclidean’,
metric_params=None, min_samples=5, n_jobs=1, p=None)
3.SpectralClustering
from sklearn.cluster import SpectralClustering
SpectralClustering(affinity=‘rbf’, assign_labels=‘kmeans’, coef0=1, degree=3,
eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
n_clusters=8, n_init=10, n_jobs=1, n_neighbors=10,
random_state=None)
4.AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
AgglomerativeClustering(affinity=‘euclidean’, compute_full_tree=‘auto’,
connectivity=None, linkage=‘ward’, memory=None, n_clusters=2,
pooling_func=<function mean at 0x000002797C3B0D08>)
**
个人小结 调参
**
GridSearchCV存在的意义就是自动调参,只要把参数输进去,就能给出最优化的结果和参数。但是这个方法适合于小数据集,一旦数据的量级上去了,很难得出结果。这个时候就是需要动脑筋了。数据量比较大的时候可以使用一个快速调优的方法——坐标下降。它其实是一种贪心算法:拿当前对模型影响最大的参数调优,直到最优化;再拿下一个影响最大的参数调优,如此下去,直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优,但是省时间省力,巨大的优势面前,还是试一试吧,后续可以再拿bagging再优化。
通过画学习曲线,或者网格搜索,我们能够探索到调参边缘(代价可能是训练一次模型要跑三天三夜),但是在现实中,高手调参恐怕还是多依赖于经验,而这些经验,来源于:1)非常正确的调参思路和方法,2)对模型评估指标的理解,3)对数据的感觉和经验,4)用洪荒之力去不断地尝试。
我们也许无法学到高手们多年累积的经验,但我们可以学习他们对模型评估指标的理解和调参的思路。