正文

今天来完成任务一的实验九:改变和优化XGBoost模型参数，对比阿里云天池平台线上测试结果。

首先根据实验指导书提供的博客链接进行学习，两个链接中有一个失效了，还有一个可以用：机器学习系列(12)_XGBoost参数调优完全指南（附Python代码）

训练轮数调优

在上一个实验提取的特征所构造的数据集上进行调参，首先调优训练轮数。

def modelfit(alg, dtrain, predictors, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(train.drop(['User_id', 'Coupon_id', 'Date_received', 'label'], axis=1), label=train['label'])
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=False)
        alg.set_params(n_estimators=cvresult.shape[0])



    # Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['label'], eval_metric='auc')

    # Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:, 1]

    # Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(dtrain['label'].values, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['label'], dtrain_predprob))

    feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')
    plt.show()

if __name__ == '__main__':

    train = pd.read_csv(r'./prepared_dataset/train.csv')
    target='label'

    predictors = [x for x in train.columns if x not in ['label', 'User_id', 'Coupon_id', 'Date_received']]

    xgb1 = XGBClassifier(
        learning_rate=0.1,
        n_estimators=10000,
        max_depth=5,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='binary:logistic',
        nthread=4,
        scale_pos_weight=1,
        seed=27)

    #调优n_estimators
    modelfit(xgb1, train, predictors)

    params=xgb1.get_params()
    print(params)

    # 模型保存
    xgb1.save_model('output_files/xgb_model1')

调优的结果为n_estimators=563，有点出乎我的意料，之前一直都是使用的5000以上的轮次进行训练，没想到调优出来的结果这么低。使用该参数进行训练并提交数据，得到的评测结果为0.7333,在其他参数都不变的情况下相较上一次5000轮训练上升了0.017。

max_depth 和 min_weight 参数调优

接下来对max_depth 和 min_weight 参数调优。

#max_depth 和 min_weight 参数调优
param_test1 = {
    'max_depth': range(3, 10, 2),
    'min_child_weight': range(1, 6, 2)
}
gsearch1 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=5,
                                                min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                                objective='binary:logistic', nthread=4, scale_pos_weight=1,
                                                seed=27),
                        param_grid=param_test1, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch1.fit(train[predictors], train[target])
print(gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_)

输出结果为：

[mean: 0.86405, std: 0.01783, params: {'max_depth': 3, 'min_child_weight': 1},
 mean: 0.87122, std: 0.00858, params: {'max_depth': 3, 'min_child_weight': 3},
 mean: 0.87233, std: 0.00796, params: {'max_depth': 3, 'min_child_weight': 5},
 mean: 0.84315, std: 0.04618, params: {'max_depth': 5, 'min_child_weight': 1},
 mean: 0.85586, std: 0.02888, params: {'max_depth': 5, 'min_child_weight': 3},
 mean: 0.86428, std: 0.01649, params: {'max_depth': 5, 'min_child_weight': 5},
 mean: 0.83507, std: 0.04100, params: {'max_depth': 7, 'min_child_weight': 1},
 mean: 0.84799, std: 0.02979, params: {'max_depth': 7, 'min_child_weight': 3},
 mean: 0.85674, std: 0.02130, params: {'max_depth': 7, 'min_child_weight': 5},
 mean: 0.83280, std: 0.03825, params: {'max_depth': 9, 'min_child_weight': 1},
 mean: 0.84614, std: 0.02922, params: {'max_depth': 9, 'min_child_weight': 3},
 mean: 0.85383, std: 0.02155, params: {'max_depth': 9, 'min_child_weight': 5}]
{'max_depth': 3, 'min_child_weight': 5}
0.8723303890244708

缩小步长和范围继续调参：

param_test2 = {
    'max_depth': [2, 3, 4],
    'min_child_weight': [4, 5, 6]
}
gsearch2 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3,
                                                min_child_weight=5, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                                objective='binary:logistic', nthread=4, scale_pos_weight=1,
                                                seed=27),
                        param_grid=param_test2, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch2.fit(train[predictors], train[target])
print(gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_)

输出结果为：

[mean: 0.86872, std: 0.00497, params: {'max_depth': 2, 'min_child_weight': 4},
 mean: 0.86826, std: 0.00534, params: {'max_depth': 2, 'min_child_weight': 5},
 mean: 0.87010, std: 0.00351, params: {'max_depth': 2, 'min_child_weight': 6},
 mean: 0.86739, std: 0.01412, params: {'max_depth': 3, 'min_child_weight': 4},
 mean: 0.87233, std: 0.00796, params: {'max_depth': 3, 'min_child_weight': 5},
 mean: 0.87362, std: 0.00682, params: {'max_depth': 3, 'min_child_weight': 6},
 mean: 0.86237, std: 0.02257, params: {'max_depth': 4, 'min_child_weight': 4},
 mean: 0.86814, std: 0.01341, params: {'max_depth': 4, 'min_child_weight': 5},
 mean: 0.87046, std: 0.01170, params: {'max_depth': 4, 'min_child_weight': 6}]
 {'max_depth': 3, 'min_child_weight': 6}
 0.8736185043976402

所以max_depth最优值为3，不过min_child_weight还要调整范围继续调参：

[mean: 0.87362, std: 0.00682, params: {'min_child_weight': 6},
 mean: 0.87386, std: 0.00697, params: {'min_child_weight': 8},
 mean: 0.87506, std: 0.00554, params: {'min_child_weight': 10},
 mean: 0.87607, std: 0.00478, params: {'min_child_weight': 12},
 mean: 0.87625, std: 0.00476, params: {'min_child_weight': 14},
 mean: 0.87597, std: 0.00503, params: {'min_child_weight': 16},
 mean: 0.87556, std: 0.00647, params: {'min_child_weight': 18}]
 {'min_child_weight': 14}
 0.8762503781866819]

min_child_weight最优值为14。

再将此值带回原来的模型进行训练，得分下降到了0.7175，此时再次对训练轮数进行优化，结果为2346。
2346轮训练以后，评分又涨到了0.7320。

gamma 参数调优

param_test3 = {
    'gamma': [i / 10.0 for i in range(0, 5)]
}
gsearch3 = GridSearchCV(
    estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0,
                            subsample=0.8, colsample_bytree=0.8, objective='binary:logistic', nthread=4,
                            scale_pos_weight=1, seed=27), param_grid=param_test3, scoring='roc_auc', n_jobs=4,
    iid=False, cv=5)

gsearch3.fit(train[predictors], train[target])
print(gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_)

输出结果为：

[mean: 0.87625, std: 0.00476, params: {'gamma': 0.0},
 mean: 0.87626, std: 0.00486, params: {'gamma': 0.1},
 mean: 0.87497, std: 0.00695, params: {'gamma': 0.2},
 mean: 0.87491, std: 0.00695, params: {'gamma': 0.3},
 mean: 0.87554, std: 0.00593, params: {'gamma': 0.4}]
 {'gamma': 0.1}
 0.8762571697239213

再次调优训练轮数为1916，提交结果为0.7324。

subsample 和 colsample_bytree 参数调优

param_test4 = {
    'subsample': [i / 10.0 for i in range(6, 10)],
    'colsample_bytree': [i / 10.0 for i in range(6, 10)]
}

gsearch4 = GridSearchCV(
    estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
                            subsample=0.8, colsample_bytree=0.8, objective='binary:logistic', nthread=4,
                            scale_pos_weight=1, seed=27), param_grid=param_test4, scoring='roc_auc', n_jobs=4,
    iid=False, cv=5)

gsearch4.fit(train[predictors], train[target])
print(gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_)

输出结果为：

[mean: 0.87548, std: 0.00574, params: {'colsample_bytree': 0.6, 'subsample': 0.6}, mean: 0.87438, std: 0.00674, params: {'colsample_bytree': 0.6, 'subsample': 0.7},  
 mean: 0.87390, std: 0.00769, params: {'colsample_bytree': 0.6, 'subsample': 0.8}, mean: 0.87328, std: 0.00734, params: {'colsample_bytree': 0.6, 'subsample': 0.9}, 
 mean: 0.87549, std: 0.00563, params: {'colsample_bytree': 0.7, 'subsample': 0.6}, mean: 0.87440, std: 0.00597, params: {'colsample_bytree': 0.7, 'subsample': 0.7}, 
 mean: 0.87397, std: 0.00625, params: {'colsample_bytree': 0.7, 'subsample': 0.8}, mean: 0.87527, std: 0.00504, params: {'colsample_bytree': 0.7, 'subsample': 0.9}, 
 mean: 0.87516, std: 0.00564, params: {'colsample_bytree': 0.8, 'subsample': 0.6}, mean: 0.87490, std: 0.00677, params: {'colsample_bytree': 0.8, 'subsample': 0.7}, 
 mean: 0.87626, std: 0.00486, params: {'colsample_bytree': 0.8, 'subsample': 0.8}, mean: 0.87391, std: 0.00615, params: {'colsample_bytree': 0.8, 'subsample': 0.9}, 
 mean: 0.87510, std: 0.00552, params: {'colsample_bytree': 0.9, 'subsample': 0.6}, mean: 0.87683, std: 0.00423, params: {'colsample_bytree': 0.9, 'subsample': 0.7}, 
 mean: 0.87554, std: 0.00568, params: {'colsample_bytree': 0.9, 'subsample': 0.8}, mean: 0.87475, std: 0.00692, params: {'colsample_bytree': 0.9, 'subsample': 0.9}] 
 {'colsample_bytree': 0.9, 'subsample': 0.7} 
 0.8768305808825803

缩小范围和步长继续优化：

param_test5 = {
    'subsample': [i / 100.0 for i in range(65, 80, 5)],
    'colsample_bytree': [i / 100.0 for i in range(85, 100, 5)]
}

gsearch5 = GridSearchCV(
    estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
                            subsample=0.7, colsample_bytree=0.9, objective='binary:logistic', nthread=4,
                            scale_pos_weight=1, seed=27), param_grid=param_test5, scoring='roc_auc', n_jobs=4,
    iid=False, cv=5)

gsearch5.fit(train[predictors], train[target])
print(gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_)

输出结果为：

[mean: 0.87546, std: 0.00545, params: {'colsample_bytree': 0.85, 'subsample': 0.65},
 mean: 0.87490, std: 0.00768, params: {'colsample_bytree': 0.85, 'subsample': 0.7},
 mean: 0.87638, std: 0.00432, params: {'colsample_bytree': 0.85, 'subsample': 0.75},
 mean: 0.87394, std: 0.00754, params: {'colsample_bytree': 0.9, 'subsample': 0.65},
 mean: 0.87683, std: 0.00423, params: {'colsample_bytree': 0.9, 'subsample': 0.7},
 mean: 0.87501, std: 0.00628, params: {'colsample_bytree': 0.9, 'subsample': 0.75},
 mean: 0.87359, std: 0.00838, params: {'colsample_bytree': 0.95, 'subsample': 0.65},
 mean: 0.87566, std: 0.00544, params: {'colsample_bytree': 0.95, 'subsample': 0.7},
 mean: 0.87497, std: 0.00711, params: {'colsample_bytree': 0.95, 'subsample': 0.75}]
 {'colsample_bytree': 0.9, 'subsample': 0.7}
 0.8768305808825803

从输出结果来看'colsample_bytree': 0.9, 'subsample': 0.7确实是最好的了。

reg_alpha 参数调优

param_test6 = {
    'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(
    estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
                            subsample=0.7, colsample_bytree=0.9, objective='binary:logistic', nthread=4,
                            scale_pos_weight=1, seed=27), param_grid=param_test6, scoring='roc_auc', n_jobs=4,
    iid=False, cv=5)

gsearch6.fit(train[predictors], train[target])
print(gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_)

输出结果为：

[mean: 0.87683, std: 0.00423, params: {'reg_alpha': 1e-05},
 mean: 0.87650, std: 0.00473, params: {'reg_alpha': 0.01},
 mean: 0.87489, std: 0.00610, params: {'reg_alpha': 0.1},
 mean: 0.87533, std: 0.00635, params: {'reg_alpha': 1},
 mean: 0.86921, std: 0.00367, params: {'reg_alpha': 100}]
 {'reg_alpha': 1e-05}
 0.8768305794218445

调整范围继续优化：

[mean: 0.87683, std: 0.00423, params: {'reg_alpha': 1e-06},
 mean: 0.87683, std: 0.00423, params: {'reg_alpha': 5e-06},
 mean: 0.87683, std: 0.00423, params: {'reg_alpha': 1e-05},
 mean: 0.87683, std: 0.00423, params: {'reg_alpha': 5e-05},
 mean: 0.87683, std: 0.00423, params: {'reg_alpha': 0.0001},
 mean: 0.87689, std: 0.00421, params: {'reg_alpha': 0.0005},
 mean: 0.87689, std: 0.00421, params: {'reg_alpha': 0.001},
 mean: 0.87651, std: 0.00430, params: {'reg_alpha': 0.005}]
 {'reg_alpha': 0.0005}
 0.8768932960098768

colsample_bylevel 参数调优

param_test8 = {
    'colsample_bylevel': [i / 100.0 for i in range(60, 100, 10)]
}

gsearch8 = GridSearchCV(
    estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
                            subsample=0.7, colsample_bytree=0.9, objective='binary:logistic', nthread=4,
                            scale_pos_weight=1,reg_alpha=0.0005, seed=27), param_grid=param_test8, scoring='roc_auc', n_jobs=4,
    iid=False, cv=5)

gsearch8.fit(train[predictors], train[target])
print(gsearch8.grid_scores_, gsearch8.best_params_, gsearch8.best_score_)

输出结果为：

[mean: 0.87460, std: 0.00497, params: {'colsample_bylevel': 0.6},
 mean: 0.87499, std: 0.00494, params: {'colsample_bylevel': 0.7},
 mean: 0.87338, std: 0.00823, params: {'colsample_bylevel': 0.8},
 mean: 0.87343, std: 0.00807, params: {'colsample_bylevel': 0.9}]
 {'colsample_bylevel': 0.7}
 0.874990904057588

经过反复测试，得分最高的还是这个参数模型：

params = {'booster': 'gbtree',
              'objective': 'binary:logistic',
              'eval_metric': 'auc',
              'nthread': 4,
              'silent': 0,
              'eta': 0.01,
              'max_depth': 5,
              'min_child_weight': 1,
              'gamma': 0,
              'lambda': 1,
              'colsample_bylevel': 0.7,
              'colsample_bytree': 0.7,
              'subsample': 0.9,
              'scale_pos_weight': 1}
num_boost_round=563
score=0.7333

结语

调参实在是一个费心费力的过程，由于电脑配置不是很好跑了整整两天才把所有参数调完。但很多参数调了以后虽然本地auc提高了，但线上测试得分反而降低，得分最高的一个模型参数还是最开始的模型上略微调整得来的。