机器学习训练营--快来一起挖掘幸福感吧

文章目录


前言

本学习笔记为阿里云天池龙珠计划机器学习训练营的学习内容,学习链接为:
https://tianchi.aliyun.com/specials/promotion/aicampml

比赛链接:快来一起挖掘幸福感!

友情提醒: 本人刚开始学机器学习,有些地方可能不太成熟,参考了一些论坛的笔记,能写出来全靠各位大佬的无私分享,希望对你有所帮助!


一、赛题理解

1.1 实验环境

阿里云天池的实验环境,先把数据集导入DSW。

机器学习训练营--快来一起挖掘幸福感吧

1.2 背景介绍

比赛的数据使用的是官方的《中国综合社会调查(CGSS)2015 年度调查问卷(居民问卷)》文件中的调查结果中的数据,其共包含有139个维度的特征,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务)等特征。

赛题要求利用给出的 139 维的特征, 8000 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代表幸福感最高)。

1.3 数据信息

happiness_index.xlsx: 142行,5列,本赛题具体字段介绍,包含每个变量对应的问卷题目,以及变量取值的含义。用得上的其实是139维的特征。
机器学习训练营--快来一起挖掘幸福感吧
happiness_submit.csv : 2969行,2列,平台提交结果示例文件。
happiness_survey_cgss2015.pdf: 中国综合社会调查(CGSS)2015 年度调查问卷(居民问卷)
happiness_test_abbr.csv : 2969行,41列,精简版数据。
happiness_test_complete.csv : 2969行,139列,完整版数据。
happiness_train_abbr.csv : 8001行,42列,精简版数据。
happiness_train_complete.csv : 8001行,140列,完整版数据。

在这里,我用的是完整版数据。

1.4 评价指标

最终的评价指标为均方误差MSE,
机器学习训练营--快来一起挖掘幸福感吧

提交的均方误差越小,代表结果越好。

二、探索性数据分析(EDA)& 特征工程

2.1 为什么要做探索性数据分析

  1. 了解数据
    – 数据类型大小(需要什么配置,参赛代价大不大)…
    – 数据是否干净(明显错误的数据,例如身高5m…)
    – 标签是什么类型的,是否需要格式转换?..(DataFrame.info())
  2. 为数据建模做准备
    – 线下验证集的构建,是否可能会穿越?(观察数据分布情况)
    – 是否存在某些奇异的现象?为特征工程做准备:例如时序的周期变化现象

2.2 探索性数据分析要看哪些数据,看什么

1. 数据集大小,字段类型:数据多大,每个字段是什么类型的
2. 缺失值的情况:缺失是否严重,是否缺失有特殊含义
3. 特征之间是否冗余:比如身高用cm表示和m表示就存在冗余
4. 是否存在时间信息:潜在的穿越问题
5. 标签的分布:是否类别分布不平衡等
6. 训练集测试集的分布:测试集中有的字段很多特征训练集没有
7. 单变量/多变量分布:熟悉特征的分布情况,和标签的关系

2.3 数据预处理

导入包

import os
import time 
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error,mean_absolute_error, f1_score
import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.ensemble import ExtraTreesRegressor as etr
from sklearn.linear_model import BayesianRidge as br
from sklearn.ensemble import GradientBoostingRegressor as gbr
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression as lr
from sklearn.linear_model import ElasticNet as en
from sklearn.kernel_ridge import KernelRidge as kr
from sklearn.model_selection import  KFold, StratifiedKFold,GroupKFold, RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
import logging
import warnings

warnings.filterwarnings('ignore') #消除warning
#导入数据
train = pd.read_csv("happiness_train_complete.csv", parse_dates=['survey_time'],encoding='latin-1') 
test = pd.read_csv("happiness_test_complete.csv", parse_dates=['survey_time'],encoding='latin-1') #latin-1向下兼容ASCII
#观察数据大小
train.shape

机器学习训练营--快来一起挖掘幸福感吧

test.shape

机器学习训练营--快来一起挖掘幸福感吧

#简单查看数据
train.head()

机器学习训练营--快来一起挖掘幸福感吧

#查看数据是否缺失
train.info(verbose=True,null_counts=True)

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 140 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    8000 non-null   int64         
 1   happiness             8000 non-null   int64         
 2   survey_type           8000 non-null   int64         
 3   province              8000 non-null   int64         
 4   city                  8000 non-null   int64         
 5   county                8000 non-null   int64         
 6   survey_time           8000 non-null   datetime64[ns]
 7   gender                8000 non-null   int64         
 8   birth                 8000 non-null   int64         
 9   nationality           8000 non-null   int64         
 10  religion              8000 non-null   int64         
 11  religion_freq         8000 non-null   int64         
 12  edu                   8000 non-null   int64         
 13  edu_other             3 non-null      object        
 14  edu_status            6880 non-null   float64       
 15  edu_yr                6028 non-null   float64       
 16  income                8000 non-null   int64         
 17  political             8000 non-null   int64         
 18  join_party            824 non-null    float64       
 19  floor_area            8000 non-null   float64       
 20  property_0            8000 non-null   int64         
 21  property_1            8000 non-null   int64         
 22  property_2            8000 non-null   int64         
 23  property_3            8000 non-null   int64         
 24  property_4            8000 non-null   int64         
 25  property_5            8000 non-null   int64         
 26  property_6            8000 non-null   int64         
 27  property_7            8000 non-null   int64         
 28  property_8            8000 non-null   int64         
 29  property_other        66 non-null     object        
 30  height_cm             8000 non-null   int64         
 31  weight_jin            8000 non-null   int64         
 32  health                8000 non-null   int64         
 33  health_problem        8000 non-null   int64         
 34  depression            8000 non-null   int64         
 35  hukou                 8000 non-null   int64         
 36  hukou_loc             7996 non-null   float64       
 37  media_1               8000 non-null   int64         
 38  media_2               8000 non-null   int64         
 39  media_3               8000 non-null   int64         
 40  media_4               8000 non-null   int64         
 41  media_5               8000 non-null   int64         
 42  media_6               8000 non-null   int64         
 43  leisure_1             8000 non-null   int64         
 44  leisure_2             8000 non-null   int64         
 45  leisure_3             8000 non-null   int64         
 46  leisure_4             8000 non-null   int64         
 47  leisure_5             8000 non-null   int64         
 48  leisure_6             8000 non-null   int64         
 49  leisure_7             8000 non-null   int64         
 50  leisure_8             8000 non-null   int64         
 51  leisure_9             8000 non-null   int64         
 52  leisure_10            8000 non-null   int64         
 53  leisure_11            8000 non-null   int64         
 54  leisure_12            8000 non-null   int64         
 55  socialize             8000 non-null   int64         
 56  relax                 8000 non-null   int64         
 57  learn                 8000 non-null   int64         
 58  social_neighbor       7204 non-null   float64       
 59  social_friend         7204 non-null   float64       
 60  socia_outing          8000 non-null   int64         
 61  equity                8000 non-null   int64         
 62  class                 8000 non-null   int64         
 63  class_10_before       8000 non-null   int64         
 64  class_10_after        8000 non-null   int64         
 65  class_14              8000 non-null   int64         
 66  work_exper            8000 non-null   int64         
 67  work_status           2951 non-null   float64       
 68  work_yr               2951 non-null   float64       
 69  work_type             2951 non-null   float64       
 70  work_manage           2951 non-null   float64       
 71  insur_1               8000 non-null   int64         
 72  insur_2               8000 non-null   int64         
 73  insur_3               8000 non-null   int64         
 74  insur_4               8000 non-null   int64         
 75  family_income         7999 non-null   float64       
 76  family_m              8000 non-null   int64         
 77  family_status         8000 non-null   int64         
 78  house                 8000 non-null   int64         
 79  car                   8000 non-null   int64         
 80  invest_0              8000 non-null   int64         
 81  invest_1              8000 non-null   int64         
 82  invest_2              8000 non-null   int64         
 83  invest_3              8000 non-null   int64         
 84  invest_4              8000 non-null   int64         
 85  invest_5              8000 non-null   int64         
 86  invest_6              8000 non-null   int64         
 87  invest_7              8000 non-null   int64         
 88  invest_8              8000 non-null   int64         
 89  invest_other          29 non-null     object        
 90  son                   8000 non-null   int64         
 91  daughter              8000 non-null   int64         
 92  minor_child           6934 non-null   float64       
 93  marital               8000 non-null   int64         
 94  marital_1st           7172 non-null   float64       
 95  s_birth               6282 non-null   float64       
 96  marital_now           6230 non-null   float64       
 97  s_edu                 6282 non-null   float64       
 98  s_political           6282 non-null   float64       
 99  s_hukou               6282 non-null   float64       
 100 s_income              6282 non-null   float64       
 101 s_work_exper          6282 non-null   float64       
 102 s_work_status         2565 non-null   float64       
 103 s_work_type           2565 non-null   float64       
 104 f_birth               8000 non-null   int64         
 105 f_edu                 8000 non-null   int64         
 106 f_political           8000 non-null   int64         
 107 f_work_14             8000 non-null   int64         
 108 m_birth               8000 non-null   int64         
 109 m_edu                 8000 non-null   int64         
 110 m_political           8000 non-null   int64         
 111 m_work_14             8000 non-null   int64         
 112 status_peer           8000 non-null   int64         
 113 status_3_before       8000 non-null   int64         
 114 view                  8000 non-null   int64         
 115 inc_ability           8000 non-null   int64         
 116 inc_exp               8000 non-null   float64       
 117 trust_1               8000 non-null   int64         
 118 trust_2               8000 non-null   int64         
 119 trust_3               8000 non-null   int64         
 120 trust_4               8000 non-null   int64         
 121 trust_5               8000 non-null   int64         
 122 trust_6               8000 non-null   int64         
 123 trust_7               8000 non-null   int64         
 124 trust_8               8000 non-null   int64         
 125 trust_9               8000 non-null   int64         
 126 trust_10              8000 non-null   int64         
 127 trust_11              8000 non-null   int64         
 128 trust_12              8000 non-null   int64         
 129 trust_13              8000 non-null   int64         
 130 neighbor_familiarity  8000 non-null   int64         
 131 public_service_1      8000 non-null   int64         
 132 public_service_2      8000 non-null   int64         
 133 public_service_3      8000 non-null   int64         
 134 public_service_4      8000 non-null   int64         
 135 public_service_5      8000 non-null   float64       
 136 public_service_6      8000 non-null   int64         
 137 public_service_7      8000 non-null   int64         
 138 public_service_8      8000 non-null   int64         
 139 public_service_9      8000 non-null   int64         
dtypes: datetime64[ns](1), float64(25), int64(111), object(3)
memory usage: 8.5+ MB

#查看label分布
y_train_=train["happiness"]
y_train_.value_counts()

机器学习训练营--快来一起挖掘幸福感吧

#将-8换成3
y_train_=y_train_.map(lambda x:3 if x==-8 else x)

#重新查看label分布
y_train_.value_counts()

机器学习训练营--快来一起挖掘幸福感吧

#让label从0开始
y_train_=y_train_.map(lambda x:x-1)
#train和test连在一起
data = pd.concat([train,test],axis=0,ignore_index=True)
#全部数据大小
data.shape

机器学习训练营--快来一起挖掘幸福感吧

#处理时间特征
data['survey_time'] = pd.to_datetime(data['survey_time'],format='%Y-%m-%d %H:%M:%S')
data["weekday"]=data["survey_time"].dt.weekday
data["year"]=data["survey_time"].dt.year
data["quarter"]=data["survey_time"].dt.quarter
data["hour"]=data["survey_time"].dt.hour
data["month"]=data["survey_time"].dt.month

#把一天的时间分段
def hour_cut(x):
    if 0<=x<6:
        return 0
    elif  6<=x<8:
        return 1
    elif  8<=x<12:
        return 2
    elif  12<=x<14:
        return 3
    elif  14<=x<18:
        return 4
    elif  18<=x<21:
        return 5
    elif  21<=x<24:
        return 6

    
data["hour_cut"]=data["hour"].map(hour_cut)

#做问卷时候的年龄
data["survey_age"]=data["year"]-data["birth"]

#让label从0开始
data["happiness"]=data["happiness"].map(lambda x:x-1)

#去掉三个缺失值很多的
data=data.drop(["edu_other"], axis=1)
data=data.drop(["happiness"], axis=1)
data=data.drop(["survey_time"], axis=1)

#是否入党
data["join_party"]=data["join_party"].map(lambda x:0 if pd.isnull(x)  else 1)

#出生的年代
def birth_split(x):
    if 1920<=x<=1930:
        return 0
    elif  1930<x<=1940:
        return 1
    elif  1940<x<=1950:
        return 2
    elif  1950<x<=1960:
        return 3
    elif  1960<x<=1970:
        return 4
    elif  1970<x<=1980:
        return 5
    elif  1980<x<=1990:
        return 6
    elif  1990<x<=2000:
        return 7
    
data["birth_s"]=data["birth"].map(birth_split)

#收入分组
def income_cut(x):
    if x<0:
        return 0
    elif  0<=x<1200:
        return 1
    elif  1200<x<=10000:
        return 2
    elif  10000<x<24000:
        return 3
    elif  24000<x<40000:
        return 4
    elif  40000<=x:
        return 5
 

    
data["income_cut"]=data["income"].map(income_cut)

#填充数据
data["edu_status"]=data["edu_status"].fillna(5)
data["edu_yr"]=data["edu_yr"].fillna(-2)
data["property_other"]=data["property_other"].map(lambda x:0 if pd.isnull(x)  else 1)
data["hukou_loc"]=data["hukou_loc"].fillna(1)
data["social_neighbor"]=data["social_neighbor"].fillna(8)
data["social_friend"]=data["social_friend"].fillna(8)
data["work_status"]=data["work_status"].fillna(0)
data["work_yr"]=data["work_yr"].fillna(0)
data["work_type"]=data["work_type"].fillna(0)
data["work_manage"]=data["work_manage"].fillna(0)
data["family_income"]=data["family_income"].fillna(-2)
data["invest_other"]=data["invest_other"].map(lambda x:0 if pd.isnull(x)  else 1)

#填充数据
data["minor_child"]=data["minor_child"].fillna(0)
data["marital_1st"]=data["marital_1st"].fillna(0)
data["s_birth"]=data["s_birth"].fillna(0)
data["marital_now"]=data["marital_now"].fillna(0)
data["s_edu"]=data["s_edu"].fillna(0)
data["s_political"]=data["s_political"].fillna(0)
data["s_hukou"]=data["s_hukou"].fillna(0)
data["s_income"]=data["s_income"].fillna(0)
data["s_work_exper"]=data["s_work_exper"].fillna(0)
data["s_work_status"]=data["s_work_status"].fillna(0)
data["s_work_type"]=data["s_work_type"].fillna(0)

data=data.drop(["id"], axis=1)

X_train_ = data[:train.shape[0]]
X_test_  = data[train.shape[0]:]
X_train_.shape

机器学习训练营--快来一起挖掘幸福感吧

X_test_.shape

机器学习训练营--快来一起挖掘幸福感吧

target_column = 'happiness'
feature_columns=list(X_test_.columns) 
feature_columns

output:

['survey_type',
 'province',
 'city',
 'county',
 'gender',
 'birth',
 'nationality',
 'religion',
 'religion_freq',
 'edu',
 'edu_status',
 'edu_yr',
 'income',
 'political',
 'join_party',
 'floor_area',
 'property_0',
 'property_1',
 'property_2',
 'property_3',
 'property_4',
 'property_5',
 'property_6',
 'property_7',
 'property_8',
 'property_other',
 'height_cm',
 'weight_jin',
 'health',
 'health_problem',
 'depression',
 'hukou',
 'hukou_loc',
 'media_1',
 'media_2',
 'media_3',
 'media_4',
 'media_5',
 'media_6',
 'leisure_1',
 'leisure_2',
 'leisure_3',
 'leisure_4',
 'leisure_5',
 'leisure_6',
 'leisure_7',
 'leisure_8',
 'leisure_9',
 'leisure_10',
 'leisure_11',
 'leisure_12',
 'socialize',
 'relax',
 'learn',
 'social_neighbor',
 'social_friend',
 'socia_outing',
 'equity',
 'class',
 'class_10_before',
 'class_10_after',
 'class_14',
 'work_exper',
 'work_status',
 'work_yr',
 'work_type',
 'work_manage',
 'insur_1',
 'insur_2',
 'insur_3',
 'insur_4',
 'family_income',
 'family_m',
 'family_status',
 'house',
 'car',
 'invest_0',
 'invest_1',
 'invest_2',
 'invest_3',
 'invest_4',
 'invest_5',
 'invest_6',
 'invest_7',
 'invest_8',
 'invest_other',
 'son',
 'daughter',
 'minor_child',
 'marital',
 'marital_1st',
 's_birth',
 'marital_now',
 's_edu',
 's_political',
 's_hukou',
 's_income',
 's_work_exper',
 's_work_status',
 's_work_type',
 'f_birth',
 'f_edu',
 'f_political',
 'f_work_14',
 'm_birth',
 'm_edu',
 'm_political',
 'm_work_14',
 'status_peer',
 'status_3_before',
 'view',
 'inc_ability',
 'inc_exp',
 'trust_1',
 'trust_2',
 'trust_3',
 'trust_4',
 'trust_5',
 'trust_6',
 'trust_7',
 'trust_8',
 'trust_9',
 'trust_10',
 'trust_11',
 'trust_12',
 'trust_13',
 'neighbor_familiarity',
 'public_service_1',
 'public_service_2',
 'public_service_3',
 'public_service_4',
 'public_service_5',
 'public_service_6',
 'public_service_7',
 'public_service_8',
 'public_service_9',
 'weekday',
 'year',
 'quarter',
 'hour',
 'month',
 'hour_cut',
 'survey_age',
 'birth_s',
 'income_cut']
X_train = np.array(X_train_)
y_train = np.array(y_train_)
X_test  = np.array(X_test_)
X_train.shape

机器学习训练营--快来一起挖掘幸福感吧

y_train.shape

机器学习训练营--快来一起挖掘幸福感吧

X_test.shape

机器学习训练营--快来一起挖掘幸福感吧

#自定义评价函数
def myFeval(preds, xgbtrain):
    label = xgbtrain.get_label()
    score = mean_squared_error(label,preds)
    return 'myFeval',score

三、建模调参 & 模型融合

##### xgb

xgb_params = {"booster":'gbtree','eta': 0.005, 'max_depth': 5, 'subsample': 0.7, 
              'colsample_bytree': 0.8, 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 8}
folds = KFold(n_splits=5, shuffle=True, random_state=2018)
oof_xgb = np.zeros(len(train))
predictions_xgb = np.zeros(len(test))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))
    trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
    val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
    
    watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
    clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params,feval = myFeval)
    oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
    predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
    
print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train_)))
fold n°1
[0]	train-rmse:2.49563	valid_data-rmse:2.4813	train-myFeval:6.22818	valid_data-myFeval:6.15686
Multiple eval metrics have been passed: 'valid_data-myFeval' will be used for early stopping.

Will train until valid_data-myFeval hasn't improved in 200 rounds.
[100]	train-rmse:1.6126	valid_data-rmse:1.60259	train-myFeval:2.60047	valid_data-myFeval:2.5683
[200]	train-rmse:1.11478	valid_data-rmse:1.11408	train-myFeval:1.24274	valid_data-myFeval:1.24118
[300]	train-rmse:0.851318	valid_data-rmse:0.865196	train-myFeval:0.724743	valid_data-myFeval:0.748564
[400]	train-rmse:0.720533	valid_data-rmse:0.750967	train-myFeval:0.519168	valid_data-myFeval:0.563951
[500]	train-rmse:0.656659	valid_data-rmse:0.702841	train-myFeval:0.431201	valid_data-myFeval:0.493985
[600]	train-rmse:0.623453	valid_data-rmse:0.683129	train-myFeval:0.388694	valid_data-myFeval:0.466665
[700]	train-rmse:0.603769	valid_data-rmse:0.675099	train-myFeval:0.364538	valid_data-myFeval:0.455759
[800]	train-rmse:0.58938	valid_data-rmse:0.671326	train-myFeval:0.347369	valid_data-myFeval:0.450678
[900]	train-rmse:0.577791	valid_data-rmse:0.669284	train-myFeval:0.333843	valid_data-myFeval:0.447941
[1000]	train-rmse:0.567713	valid_data-rmse:0.668098	train-myFeval:0.322298	valid_data-myFeval:0.446355
[1100]	train-rmse:0.558195	valid_data-rmse:0.667073	train-myFeval:0.311582	valid_data-myFeval:0.444986
[1200]	train-rmse:0.549402	valid_data-rmse:0.666413	train-myFeval:0.301842	valid_data-myFeval:0.444107
[1300]	train-rmse:0.541053	valid_data-rmse:0.665955	train-myFeval:0.292738	valid_data-myFeval:0.443496
[1400]	train-rmse:0.533161	valid_data-rmse:0.665632	train-myFeval:0.28426	valid_data-myFeval:0.443066
[1500]	train-rmse:0.525618	valid_data-rmse:0.665304	train-myFeval:0.276275	valid_data-myFeval:0.442629
[1600]	train-rmse:0.518385	valid_data-rmse:0.665372	train-myFeval:0.268723	valid_data-myFeval:0.44272
[1700]	train-rmse:0.511254	valid_data-rmse:0.665176	train-myFeval:0.26138	valid_data-myFeval:0.44246
[1800]	train-rmse:0.504662	valid_data-rmse:0.664956	train-myFeval:0.254683	valid_data-myFeval:0.442167
[1900]	train-rmse:0.498012	valid_data-rmse:0.664776	train-myFeval:0.248016	valid_data-myFeval:0.441928
[2000]	train-rmse:0.49174	valid_data-rmse:0.664572	train-myFeval:0.241808	valid_data-myFeval:0.441656
[2100]	train-rmse:0.485493	valid_data-rmse:0.664355	train-myFeval:0.235703	valid_data-myFeval:0.441368
[2200]	train-rmse:0.479446	valid_data-rmse:0.664263	train-myFeval:0.229868	valid_data-myFeval:0.441245
[2300]	train-rmse:0.473532	valid_data-rmse:0.664077	train-myFeval:0.224232	valid_data-myFeval:0.440998
[2400]	train-rmse:0.46794	valid_data-rmse:0.663973	train-myFeval:0.218968	valid_data-myFeval:0.44086
[2500]	train-rmse:0.462211	valid_data-rmse:0.663841	train-myFeval:0.213639	valid_data-myFeval:0.440685
[2600]	train-rmse:0.45661	valid_data-rmse:0.663949	train-myFeval:0.208493	valid_data-myFeval:0.440828
Stopping. Best iteration:
[2492]	train-rmse:0.462626	valid_data-rmse:0.663821	train-myFeval:0.214022	valid_data-myFeval:0.440658

fold n°2
[0]	train-rmse:2.49853	valid_data-rmse:2.46955	train-myFeval:6.24265	valid_data-myFeval:6.09866
Multiple eval metrics have been passed: 'valid_data-myFeval' will be used for early stopping.

Will train until valid_data-myFeval hasn't improved in 200 rounds.
[100]	train-rmse:1.61339	valid_data-rmse:1.59864	train-myFeval:2.60302	valid_data-myFeval:2.55564
[200]	train-rmse:1.11383	valid_data-rmse:1.11804	train-myFeval:1.24062	valid_data-myFeval:1.25001
[300]	train-rmse:0.848462	valid_data-rmse:0.875119	train-myFeval:0.719888	valid_data-myFeval:0.765833
[400]	train-rmse:0.716857	valid_data-rmse:0.764725	train-myFeval:0.513884	valid_data-myFeval:0.584804
[500]	train-rmse:0.652761	valid_data-rmse:0.718626	train-myFeval:0.426097	valid_data-myFeval:0.516424
[600]	train-rmse:0.619343	valid_data-rmse:0.699559	train-myFeval:0.383586	valid_data-myFeval:0.489383
[700]	train-rmse:0.59878	valid_data-rmse:0.691164	train-myFeval:0.358538	valid_data-myFeval:0.477707
[800]	train-rmse:0.584406	valid_data-rmse:0.687036	train-myFeval:0.34153	valid_data-myFeval:0.472018
[900]	train-rmse:0.572886	valid_data-rmse:0.684788	train-myFeval:0.328199	valid_data-myFeval:0.468935
[1000]	train-rmse:0.562962	valid_data-rmse:0.683213	train-myFeval:0.316926	valid_data-myFeval:0.46678
[1100]	train-rmse:0.554569	valid_data-rmse:0.682218	train-myFeval:0.307547	valid_data-myFeval:0.465422
[1200]	train-rmse:0.546599	valid_data-rmse:0.681102	train-myFeval:0.298771	valid_data-myFeval:0.4639
[1300]	train-rmse:0.538384	valid_data-rmse:0.680288	train-myFeval:0.289857	valid_data-myFeval:0.462791
[1400]	train-rmse:0.530827	valid_data-rmse:0.679778	train-myFeval:0.281777	valid_data-myFeval:0.462099
[1500]	train-rmse:0.523566	valid_data-rmse:0.679006	train-myFeval:0.274121	valid_data-myFeval:0.46105
[1600]	train-rmse:0.516822	valid_data-rmse:0.678669	train-myFeval:0.267105	valid_data-myFeval:0.460592
[1700]	train-rmse:0.510059	valid_data-rmse:0.678479	train-myFeval:0.26016	valid_data-myFeval:0.460334
[1800]	train-rmse:0.503851	valid_data-rmse:0.678285	train-myFeval:0.253866	valid_data-myFeval:0.46007
[1900]	train-rmse:0.497297	valid_data-rmse:0.678069	train-myFeval:0.247305	valid_data-myFeval:0.459777
[2000]	train-rmse:0.491299	valid_data-rmse:0.677739	train-myFeval:0.241375	valid_data-myFeval:0.45933
[2100]	train-rmse:0.485227	valid_data-rmse:0.677723	train-myFeval:0.235445	valid_data-myFeval:0.459309
[2200]	train-rmse:0.479466	valid_data-rmse:0.677622	train-myFeval:0.229888	valid_data-myFeval:0.459172
[2300]	train-rmse:0.473802	valid_data-rmse:0.677815	train-myFeval:0.224488	valid_data-myFeval:0.459433
fold n°3
[0]	train-rmse:2.48824	valid_data-rmse:2.51066	train-myFeval:6.19132	valid_data-myFeval:6.30342
Multiple eval metrics have been passed: 'valid_data-myFeval' will be used for early stopping.

Will train until valid_data-myFeval hasn't improved in 200 rounds.
[100]	train-rmse:1.60686	valid_data-rmse:1.63402	train-myFeval:2.582	valid_data-myFeval:2.67001
[200]	train-rmse:1.10953	valid_data-rmse:1.14734	train-myFeval:1.23105	valid_data-myFeval:1.31639
[300]	train-rmse:0.845884	valid_data-rmse:0.897291	train-myFeval:0.71552	valid_data-myFeval:0.805131
[400]	train-rmse:0.715194	valid_data-rmse:0.780631	train-myFeval:0.511503	valid_data-myFeval:0.609386
[500]	train-rmse:0.651654	valid_data-rmse:0.729504	train-myFeval:0.424653	valid_data-myFeval:0.532176
[600]	train-rmse:0.618452	valid_data-rmse:0.707078	train-myFeval:0.382482	valid_data-myFeval:0.499959
[700]	train-rmse:0.598778	valid_data-rmse:0.696645	train-myFeval:0.358536	valid_data-myFeval:0.485314
[800]	train-rmse:0.584995	valid_data-rmse:0.691768	train-myFeval:0.34222	valid_data-myFeval:0.478543
[900]	train-rmse:0.573764	valid_data-rmse:0.688744	train-myFeval:0.329205	valid_data-myFeval:0.474368
[1000]	train-rmse:0.564022	valid_data-rmse:0.68689	train-myFeval:0.31812	valid_data-myFeval:0.471817
[1100]	train-rmse:0.554914	valid_data-rmse:0.685561	train-myFeval:0.30793	valid_data-myFeval:0.469994
[1200]	train-rmse:0.546831	valid_data-rmse:0.684609	train-myFeval:0.299024	valid_data-myFeval:0.46869
[1300]	train-rmse:0.538596	valid_data-rmse:0.683757	train-myFeval:0.290086	valid_data-myFeval:0.467524
[1400]	train-rmse:0.531141	valid_data-rmse:0.682961	train-myFeval:0.28211	valid_data-myFeval:0.466436
[1500]	train-rmse:0.523763	valid_data-rmse:0.682162	train-myFeval:0.274328	valid_data-myFeval:0.465345
[1600]	train-rmse:0.517292	valid_data-rmse:0.681895	train-myFeval:0.267591	valid_data-myFeval:0.46498
[1700]	train-rmse:0.510182	valid_data-rmse:0.681542	train-myFeval:0.260286	valid_data-myFeval:0.464499
[1800]	train-rmse:0.503402	valid_data-rmse:0.681202	train-myFeval:0.253413	valid_data-myFeval:0.464036
[1900]	train-rmse:0.496937	valid_data-rmse:0.681047	train-myFeval:0.246946	valid_data-myFeval:0.463825
[2000]	train-rmse:0.490995	valid_data-rmse:0.681031	train-myFeval:0.241076	valid_data-myFeval:0.463803
[2100]	train-rmse:0.484851	valid_data-rmse:0.680772	train-myFeval:0.23508	valid_data-myFeval:0.463451
[2200]	train-rmse:0.47916	valid_data-rmse:0.680598	train-myFeval:0.229595	valid_data-myFeval:0.463214
[2300]	train-rmse:0.473224	valid_data-rmse:0.680338	train-myFeval:0.223941	valid_data-myFeval:0.46286
[2400]	train-rmse:0.46759	valid_data-rmse:0.680437	train-myFeval:0.218641	valid_data-myFeval:0.462995
[2500]	train-rmse:0.461985	valid_data-rmse:0.680176	train-myFeval:0.213431	valid_data-myFeval:0.46264
[2600]	train-rmse:0.456638	valid_data-rmse:0.679895	train-myFeval:0.208518	valid_data-myFeval:0.462257
[2700]	train-rmse:0.451555	valid_data-rmse:0.679877	train-myFeval:0.203902	valid_data-myFeval:0.462233
[2800]	train-rmse:0.446265	valid_data-rmse:0.679654	train-myFeval:0.199153	valid_data-myFeval:0.46193
[2900]	train-rmse:0.440872	valid_data-rmse:0.679562	train-myFeval:0.194368	valid_data-myFeval:0.461804
[3000]	train-rmse:0.435686	valid_data-rmse:0.679548	train-myFeval:0.189822	valid_data-myFeval:0.461786
[3100]	train-rmse:0.430535	valid_data-rmse:0.679437	train-myFeval:0.18536	valid_data-myFeval:0.461635
[3200]	train-rmse:0.425839	valid_data-rmse:0.679546	train-myFeval:0.181339	valid_data-myFeval:0.461783
[3300]	train-rmse:0.421157	valid_data-rmse:0.679572	train-myFeval:0.177374	valid_data-myFeval:0.461818
Stopping. Best iteration:
[3100]	train-rmse:0.430535	valid_data-rmse:0.679437	train-myFeval:0.18536	valid_data-myFeval:0.461635

fold n°4
[0]	train-rmse:2.49336	valid_data-rmse:2.49067	train-myFeval:6.21684	valid_data-myFeval:6.20343
Multiple eval metrics have been passed: 'valid_data-myFeval' will be used for early stopping.

Will train until valid_data-myFeval hasn't improved in 200 rounds.
[100]	train-rmse:1.61098	valid_data-rmse:1.61922	train-myFeval:2.59525	valid_data-myFeval:2.62187
[200]	train-rmse:1.11289	valid_data-rmse:1.13498	train-myFeval:1.23853	valid_data-myFeval:1.28817
[300]	train-rmse:0.849092	valid_data-rmse:0.887377	train-myFeval:0.720957	valid_data-myFeval:0.787438
[400]	train-rmse:0.717979	valid_data-rmse:0.771117	train-myFeval:0.515493	valid_data-myFeval:0.594622
[500]	train-rmse:0.654382	valid_data-rmse:0.720297	train-myFeval:0.428216	valid_data-myFeval:0.518827
[600]	train-rmse:0.621261	valid_data-rmse:0.698244	train-myFeval:0.385966	valid_data-myFeval:0.487545
[700]	train-rmse:0.601387	valid_data-rmse:0.687932	train-myFeval:0.361667	valid_data-myFeval:0.473251
[800]	train-rmse:0.587205	valid_data-rmse:0.68274	train-myFeval:0.344809	valid_data-myFeval:0.466134
[900]	train-rmse:0.576164	valid_data-rmse:0.67993	train-myFeval:0.331965	valid_data-myFeval:0.462305
[1000]	train-rmse:0.565982	valid_data-rmse:0.67764	train-myFeval:0.320336	valid_data-myFeval:0.459196
[1100]	train-rmse:0.556975	valid_data-rmse:0.676599	train-myFeval:0.310221	valid_data-myFeval:0.457786
[1200]	train-rmse:0.548716	valid_data-rmse:0.675994	train-myFeval:0.301089	valid_data-myFeval:0.456967
[1300]	train-rmse:0.540704	valid_data-rmse:0.675275	train-myFeval:0.292361	valid_data-myFeval:0.455996
[1400]	train-rmse:0.533031	valid_data-rmse:0.67509	train-myFeval:0.284122	valid_data-myFeval:0.455747
[1600]	train-rmse:0.518829	valid_data-rmse:0.674387	train-myFeval:0.269184	valid_data-myFeval:0.454797
[1700]	train-rmse:0.512472	valid_data-rmse:0.674234	train-myFeval:0.262628	valid_data-myFeval:0.454592
[1800]	train-rmse:0.505854	valid_data-rmse:0.674212	train-myFeval:0.255889	valid_data-myFeval:0.454562
[1900]	train-rmse:0.499552	valid_data-rmse:0.673864	train-myFeval:0.249552	valid_data-myFeval:0.454093
[2000]	train-rmse:0.493428	valid_data-rmse:0.673896	train-myFeval:0.243471	valid_data-myFeval:0.454136
[2100]	train-rmse:0.487465	valid_data-rmse:0.673945	train-myFeval:0.237622	valid_data-myFeval:0.454202
Stopping. Best iteration:
[1982]	train-rmse:0.49453	valid_data-rmse:0.673799	train-myFeval:0.24456	valid_data-myFeval:0.454005

fold n°5
[0]	train-rmse:2.48807	valid_data-rmse:2.51175	train-myFeval:6.19053	valid_data-myFeval:6.30887
Multiple eval metrics have been passed: 'valid_data-myFeval' will be used for early stopping.

Will train until valid_data-myFeval hasn't improved in 200 rounds.
[200]	train-rmse:1.11019	valid_data-rmse:1.14899	train-myFeval:1.23253	valid_data-myFeval:1.32018
[300]	train-rmse:0.846003	valid_data-rmse:0.897965	train-myFeval:0.71572	valid_data-myFeval:0.806341
[400]	train-rmse:0.714992	valid_data-rmse:0.780389	train-myFeval:0.511214	valid_data-myFeval:0.609008
[500]	train-rmse:0.65098	valid_data-rmse:0.728968	train-myFeval:0.423775	valid_data-myFeval:0.531395
[600]	train-rmse:0.617539	valid_data-rmse:0.706148	train-myFeval:0.381354	valid_data-myFeval:0.498644
[700]	train-rmse:0.597487	valid_data-rmse:0.695606	train-myFeval:0.356991	valid_data-myFeval:0.483867
[800]	train-rmse:0.583142	valid_data-rmse:0.689927	train-myFeval:0.340054	valid_data-myFeval:0.475999
[900]	train-rmse:0.571824	valid_data-rmse:0.687029	train-myFeval:0.326983	valid_data-myFeval:0.472009
[1000]	train-rmse:0.562088	valid_data-rmse:0.685097	train-myFeval:0.315943	valid_data-myFeval:0.469358
[1100]	train-rmse:0.552812	valid_data-rmse:0.683917	train-myFeval:0.305601	valid_data-myFeval:0.467742
[1200]	train-rmse:0.544331	valid_data-rmse:0.682804	train-myFeval:0.296296	valid_data-myFeval:0.466221
[1300]	train-rmse:0.536364	valid_data-rmse:0.68213	train-myFeval:0.287687	valid_data-myFeval:0.465301
[1400]	train-rmse:0.528567	valid_data-rmse:0.681425	train-myFeval:0.279383	valid_data-myFeval:0.46434
[1500]	train-rmse:0.52093	valid_data-rmse:0.680725	train-myFeval:0.271368	valid_data-myFeval:0.463386
[1600]	train-rmse:0.514128	valid_data-rmse:0.680122	train-myFeval:0.264328	valid_data-myFeval:0.462566
[1700]	train-rmse:0.507027	valid_data-rmse:0.68001	train-myFeval:0.257076	valid_data-myFeval:0.462414
[1800]	train-rmse:0.500298	valid_data-rmse:0.679592	train-myFeval:0.250298	valid_data-myFeval:0.461846
[1900]	train-rmse:0.493881	valid_data-rmse:0.679473	train-myFeval:0.243919	valid_data-myFeval:0.461683
[2000]	train-rmse:0.487692	valid_data-rmse:0.679272	train-myFeval:0.237844	valid_data-myFeval:0.46141
[2100]	train-rmse:0.481702	valid_data-rmse:0.679074	train-myFeval:0.232036	valid_data-myFeval:0.461142
[2200]	train-rmse:0.475845	valid_data-rmse:0.678811	train-myFeval:0.226429	valid_data-myFeval:0.460784
[2300]	train-rmse:0.470158	valid_data-rmse:0.678583	train-myFeval:0.221049	valid_data-myFeval:0.460475
[2400]	train-rmse:0.464276	valid_data-rmse:0.678738	train-myFeval:0.215553	valid_data-myFeval:0.460686
[2500]	train-rmse:0.458573	valid_data-rmse:0.678521	train-myFeval:0.210289	valid_data-myFeval:0.460391
[2600]	train-rmse:0.453289	valid_data-rmse:0.678433	train-myFeval:0.205471	valid_data-myFeval:0.460271
[2700]	train-rmse:0.447749	valid_data-rmse:0.678131	train-myFeval:0.200479	valid_data-myFeval:0.459862
[2800]	train-rmse:0.442273	valid_data-rmse:0.678136	train-myFeval:0.195606	valid_data-myFeval:0.459869
[2900]	train-rmse:0.436974	valid_data-rmse:0.678275	train-myFeval:0.190947	valid_data-myFeval:0.460057
Stopping. Best iteration:
[2711]	train-rmse:0.447123	valid_data-rmse:0.678055	train-myFeval:0.199919	valid_data-myFeval:0.459758

CV score: 0.45503947
##### lgb

param = {'boosting_type': 'gbdt',
         'num_leaves': 20,
         'min_data_in_leaf': 20, 
         'objective':'regression',
         'max_depth':6,
         'learning_rate': 0.01,
         "min_child_samples": 30,
         
         "feature_fraction": 0.8,
         "bagging_freq": 1,
         "bagging_fraction": 0.8 ,
         "bagging_seed": 11,
         "metric": 'mse',
         "lambda_l1": 0.1,
         "verbosity": -1}
folds = KFold(n_splits=5, shuffle=True, random_state=2018)
oof_lgb = np.zeros(len(X_train_))
predictions_lgb = np.zeros(len(X_test_))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))  
    trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])    
    val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=200, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)    
    predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train_)))
fold n°1
Training until validation scores don't improve for 100 rounds
[200]	training's l2: 0.437503	valid_1's l2: 0.469686
[400]	training's l2: 0.372168	valid_1's l2: 0.44976
[600]	training's l2: 0.33182	valid_1's l2: 0.443816
[800]	training's l2: 0.300597	valid_1's l2: 0.4413
Early stopping, best iteration is:
[852]	training's l2: 0.293435	valid_1's l2: 0.440755
fold n°2
Training until validation scores don't improve for 100 rounds
[200]	training's l2: 0.431328	valid_1's l2: 0.494627
[400]	training's l2: 0.366744	valid_1's l2: 0.46981
[600]	training's l2: 0.327688	valid_1's l2: 0.463121
[800]	training's l2: 0.297368	valid_1's l2: 0.459899
[1000]	training's l2: 0.272359	valid_1's l2: 0.458902
[1200]	training's l2: 0.251022	valid_1's l2: 0.457813
Early stopping, best iteration is:
[1175]	training's l2: 0.253627	valid_1's l2: 0.457521
fold n°3
Training until validation scores don't improve for 100 rounds
[200]	training's l2: 0.429379	valid_1's l2: 0.499227
[400]	training's l2: 0.3656	valid_1's l2: 0.475046
[600]	training's l2: 0.326419	valid_1's l2: 0.466977
[800]	training's l2: 0.296541	valid_1's l2: 0.464133
[1000]	training's l2: 0.271029	valid_1's l2: 0.462466
[1200]	training's l2: 0.249656	valid_1's l2: 0.462441
Early stopping, best iteration is:
[1108]	training's l2: 0.259331	valid_1's l2: 0.461866
fold n°4
Training until validation scores don't improve for 100 rounds
[200]	training's l2: 0.433149	valid_1's l2: 0.490838
[400]	training's l2: 0.368487	valid_1's l2: 0.461291
[600]	training's l2: 0.3288	valid_1's l2: 0.452724
[800]	training's l2: 0.298579	valid_1's l2: 0.450139
Early stopping, best iteration is:
[745]	training's l2: 0.306104	valid_1's l2: 0.449927
fold n°5
Training until validation scores don't improve for 100 rounds
[200]	training's l2: 0.431879	valid_1's l2: 0.488074
[400]	training's l2: 0.366806	valid_1's l2: 0.469409
[600]	training's l2: 0.326648	valid_1's l2: 0.464181
[800]	training's l2: 0.295898	valid_1's l2: 0.461481
[1000]	training's l2: 0.270621	valid_1's l2: 0.459628
Early stopping, best iteration is:
[1033]	training's l2: 0.266873	valid_1's l2: 0.459088
CV score: 0.45383135
#安装catboost的包
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple catboost

机器学习训练营--快来一起挖掘幸福感吧

from catboost import Pool, CatBoostRegressor
from sklearn.model_selection import train_test_split
kfolder = KFold(n_splits=5, shuffle=True, random_state=2019)
oof_cb = np.zeros(len(X_train_))
predictions_cb = np.zeros(len(X_test_))
kfold = kfolder.split(X_train_, y_train_)
fold_=0
#X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_train, y_train, test_size=0.3, random_state=2019)
for train_index, vali_index in kfold:
    print("fold n°{}".format(fold_))
    fold_=fold_+1
    k_x_train = X_train[train_index]
    k_y_train = y_train[train_index]
    k_x_vali = X_train[vali_index]
    k_y_vali = y_train[vali_index]
    cb_params = {
         'n_estimators': 100000,
         'loss_function': 'RMSE',
         'eval_metric':'RMSE',
         'learning_rate': 0.05,
         'depth': 5,
         'use_best_model': True,
         'subsample': 0.6,
         'bootstrap_type': 'Bernoulli',
         'reg_lambda': 3
    }
    model_cb = CatBoostRegressor(**cb_params)
    #train the model
    model_cb.fit(k_x_train, k_y_train,eval_set=[(k_x_vali, k_y_vali)],verbose=100,early_stopping_rounds=50)
    oof_cb[vali_index] = model_cb.predict(k_x_vali, ntree_end=model_cb.best_iteration_)
    predictions_cb += model_cb.predict(X_test_, ntree_end=model_cb.best_iteration_) / kfolder.n_splits
print("CV score: {:<8.8f}".format(mean_squared_error(oof_cb, y_train_)))
fold n°0
0:	learn: 0.8175871	test: 0.7820939	best: 0.7820939 (0)	total: 49.9ms	remaining: 1h 23m 8s
100:	learn: 0.6711041	test: 0.6749289	best: 0.6749289 (100)	total: 372ms	remaining: 6m 7s
200:	learn: 0.6410910	test: 0.6688829	best: 0.6686703 (190)	total: 674ms	remaining: 5m 34s
300:	learn: 0.6130819	test: 0.6669464	best: 0.6668201 (282)	total: 988ms	remaining: 5m 27s
400:	learn: 0.5895197	test: 0.6666901	best: 0.6663658 (371)	total: 1.3s	remaining: 5m 23s
500:	learn: 0.5684832	test: 0.6657841	best: 0.6654600 (478)	total: 1.6s	remaining: 5m 18s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.6654599993
bestIteration = 478

Shrink model to first 479 iterations.
fold n°1
0:	learn: 0.8107754	test: 0.8172376	best: 0.8172376 (0)	total: 3.48ms	remaining: 5m 48s
100:	learn: 0.6715406	test: 0.6800052	best: 0.6800052 (100)	total: 323ms	remaining: 5m 19s
200:	learn: 0.6428284	test: 0.6699391	best: 0.6699391 (200)	total: 641ms	remaining: 5m 18s
300:	learn: 0.6144500	test: 0.6663790	best: 0.6662390 (298)	total: 964ms	remaining: 5m 19s
400:	learn: 0.5905343	test: 0.6643743	best: 0.6641256 (388)	total: 1.28s	remaining: 5m 18s
500:	learn: 0.5703917	test: 0.6632232	best: 0.6632137 (497)	total: 1.6s	remaining: 5m 17s
600:	learn: 0.5523517	test: 0.6626011	best: 0.6620170 (579)	total: 1.92s	remaining: 5m 17s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.6620170222
bestIteration = 579

Shrink model to first 580 iterations.
fold n°2
0:	learn: 0.8046145	test: 0.8370989	best: 0.8370989 (0)	total: 3.56ms	remaining: 5m 56s
100:	learn: 0.6652528	test: 0.7059731	best: 0.7059731 (100)	total: 314ms	remaining: 5m 10s
200:	learn: 0.6356395	test: 0.6958527	best: 0.6958527 (200)	total: 618ms	remaining: 5m 7s
300:	learn: 0.6079444	test: 0.6913800	best: 0.6913800 (300)	total: 927ms	remaining: 5m 6s
400:	learn: 0.5848883	test: 0.6900293	best: 0.6900293 (400)	total: 1.24s	remaining: 5m 8s
500:	learn: 0.5637398	test: 0.6896119	best: 0.6889243 (455)	total: 1.56s	remaining: 5m 10s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.6889243403
bestIteration = 455

Shrink model to first 456 iterations.
fold n°3
0:	learn: 0.8156897	test: 0.7928103	best: 0.7928103 (0)	total: 3.89ms	remaining: 6m 29s
100:	learn: 0.6666901	test: 0.6886018	best: 0.6886018 (100)	total: 325ms	remaining: 5m 21s
200:	learn: 0.6349422	test: 0.6834388	best: 0.6834388 (200)	total: 643ms	remaining: 5m 19s
300:	learn: 0.6054434	test: 0.6814056	best: 0.6806466 (259)	total: 954ms	remaining: 5m 15s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.680646584
bestIteration = 259

Shrink model to first 260 iterations.
fold n°4
0:	learn: 0.8073054	test: 0.8273646	best: 0.8273646 (0)	total: 3.34ms	remaining: 5m 34s
100:	learn: 0.6617636	test: 0.7072268	best: 0.7072268 (100)	total: 312ms	remaining: 5m 8s
200:	learn: 0.6326520	test: 0.6986823	best: 0.6985780 (193)	total: 614ms	remaining: 5m 5s
300:	learn: 0.6047984	test: 0.6949317	best: 0.6949112 (296)	total: 914ms	remaining: 5m 2s
400:	learn: 0.5809457	test: 0.6927416	best: 0.6925554 (375)	total: 1.22s	remaining: 5m 2s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.6925554216
bestIteration = 375

Shrink model to first 376 iterations.
CV score: 0.45983020
from sklearn import linear_model
# 将lgb和xgb和ctb的结果进行stacking
train_stack = np.vstack([oof_lgb,oof_xgb,oof_cb]).transpose()
test_stack = np.vstack([predictions_lgb, predictions_xgb,predictions_cb]).transpose()


folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2018)
oof_stack = np.zeros(train_stack.shape[0])
predictions = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack,y_train)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]
    val_data, val_y = train_stack[val_idx], y_train[val_idx]
    
    clf_3 = linear_model.BayesianRidge()
    #clf_3 =linear_model.Ridge()
    clf_3.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf_3.predict(val_data)
    predictions += clf_3.predict(test_stack) / 10
    
print("CV score: {:<8.8f}".format(mean_squared_error(oof_stack, y_train_)))

机器学习训练营--快来一起挖掘幸福感吧

result=list(predictions)
result=list(map(lambda x: x + 1, result))
test_sub=pd.read_csv("happiness_submit.csv",encoding='ISO-8859-1')
test_sub["happiness"]=result
test_sub.to_csv("submit_20211122.csv", index=False)
#查看文件保存到哪里了,在路径下下载文件,提交
print(os.path.abspath('.'))

机器学习训练营--快来一起挖掘幸福感吧


四、参考文献

参考:
集成学习案例(幸福感预测)
幸福感预测-线上0.471-排名18-思路分享-含xgb-lgb-ctb
快来一起挖掘幸福感!——阿里云天池项目实战(附完成实践过程+代码)
机器学习(四)幸福感数据分析+预测

上一篇:32. Longest Valid Parentheses


下一篇:[leetCode]76. 最小覆盖子串