SiriBlog

siriyang的个人博客


  • 首页

  • 排行榜

  • 标签115

  • 分类37

  • 归档321

  • 关于

  • 搜索

X-Data数据工程基础实践(七)

发表于 2020-02-05 更新于 2021-10-29 分类于 考研 , 复试 阅读次数: Valine:
本文字数: 16k 阅读时长 ≈ 15 分钟

正文

  今天来完成任务一的实验九:改变和优化XGBoost模型参数,对比阿里云天池平台线上测试结果。

  首先根据实验指导书提供的博客链接进行学习,两个链接中有一个失效了,还有一个可以用:机器学习系列(12)_XGBoost参数调优完全指南(附Python代码)

训练轮数调优

  在上一个实验提取的特征所构造的数据集上进行调参,首先调优训练轮数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def modelfit(alg, dtrain, predictors, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(train.drop(['User_id', 'Coupon_id', 'Date_received', 'label'], axis=1), label=train['label'])
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=False)
alg.set_params(n_estimators=cvresult.shape[0])



# Fit the algorithm on the data
alg.fit(dtrain[predictors], dtrain['label'], eval_metric='auc')

# Predict training set:
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:, 1]

# Print model report:
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(dtrain['label'].values, dtrain_predictions))
print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['label'], dtrain_predprob))

feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
plt.show()

if __name__ == '__main__':

train = pd.read_csv(r'./prepared_dataset/train.csv')
target='label'

predictors = [x for x in train.columns if x not in ['label', 'User_id', 'Coupon_id', 'Date_received']]

xgb1 = XGBClassifier(
learning_rate=0.1,
n_estimators=10000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)

#调优n_estimators
modelfit(xgb1, train, predictors)

params=xgb1.get_params()
print(params)

# 模型保存
xgb1.save_model('output_files/xgb_model1')

  调优的结果为n_estimators=563,有点出乎我的意料,之前一直都是使用的5000以上的轮次进行训练,没想到调优出来的结果这么低。使用该参数进行训练并提交数据,得到的评测结果为0.7333,在其他参数都不变的情况下相较上一次5000轮训练上升了0.017。

max_depth 和 min_weight 参数调优

  接下来对max_depth 和 min_weight 参数调优。

1
2
3
4
5
6
7
8
9
10
11
12
#max_depth 和 min_weight 参数调优
param_test1 = {
'max_depth': range(3, 10, 2),
'min_child_weight': range(1, 6, 2)
}
gsearch1 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective='binary:logistic', nthread=4, scale_pos_weight=1,
seed=27),
param_grid=param_test1, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch1.fit(train[predictors], train[target])
print(gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_)

  输出结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[mean: 0.86405, std: 0.01783, params: {'max_depth': 3, 'min_child_weight': 1},
mean: 0.87122, std: 0.00858, params: {'max_depth': 3, 'min_child_weight': 3},
mean: 0.87233, std: 0.00796, params: {'max_depth': 3, 'min_child_weight': 5},
mean: 0.84315, std: 0.04618, params: {'max_depth': 5, 'min_child_weight': 1},
mean: 0.85586, std: 0.02888, params: {'max_depth': 5, 'min_child_weight': 3},
mean: 0.86428, std: 0.01649, params: {'max_depth': 5, 'min_child_weight': 5},
mean: 0.83507, std: 0.04100, params: {'max_depth': 7, 'min_child_weight': 1},
mean: 0.84799, std: 0.02979, params: {'max_depth': 7, 'min_child_weight': 3},
mean: 0.85674, std: 0.02130, params: {'max_depth': 7, 'min_child_weight': 5},
mean: 0.83280, std: 0.03825, params: {'max_depth': 9, 'min_child_weight': 1},
mean: 0.84614, std: 0.02922, params: {'max_depth': 9, 'min_child_weight': 3},
mean: 0.85383, std: 0.02155, params: {'max_depth': 9, 'min_child_weight': 5}]
{'max_depth': 3, 'min_child_weight': 5}
0.8723303890244708

  缩小步长和范围继续调参:

1
2
3
4
5
6
7
8
9
10
11
param_test2 = {
'max_depth': [2, 3, 4],
'min_child_weight': [4, 5, 6]
}
gsearch2 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3,
min_child_weight=5, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective='binary:logistic', nthread=4, scale_pos_weight=1,
seed=27),
param_grid=param_test2, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch2.fit(train[predictors], train[target])
print(gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_)

  输出结果为:

1
2
3
4
5
6
7
8
9
10
11
[mean: 0.86872, std: 0.00497, params: {'max_depth': 2, 'min_child_weight': 4},
mean: 0.86826, std: 0.00534, params: {'max_depth': 2, 'min_child_weight': 5},
mean: 0.87010, std: 0.00351, params: {'max_depth': 2, 'min_child_weight': 6},
mean: 0.86739, std: 0.01412, params: {'max_depth': 3, 'min_child_weight': 4},
mean: 0.87233, std: 0.00796, params: {'max_depth': 3, 'min_child_weight': 5},
mean: 0.87362, std: 0.00682, params: {'max_depth': 3, 'min_child_weight': 6},
mean: 0.86237, std: 0.02257, params: {'max_depth': 4, 'min_child_weight': 4},
mean: 0.86814, std: 0.01341, params: {'max_depth': 4, 'min_child_weight': 5},
mean: 0.87046, std: 0.01170, params: {'max_depth': 4, 'min_child_weight': 6}]
{'max_depth': 3, 'min_child_weight': 6}
0.8736185043976402

  所以max_depth最优值为3,不过min_child_weight还要调整范围继续调参:

1
2
3
4
5
6
7
8
9
[mean: 0.87362, std: 0.00682, params: {'min_child_weight': 6},
mean: 0.87386, std: 0.00697, params: {'min_child_weight': 8},
mean: 0.87506, std: 0.00554, params: {'min_child_weight': 10},
mean: 0.87607, std: 0.00478, params: {'min_child_weight': 12},
mean: 0.87625, std: 0.00476, params: {'min_child_weight': 14},
mean: 0.87597, std: 0.00503, params: {'min_child_weight': 16},
mean: 0.87556, std: 0.00647, params: {'min_child_weight': 18}]
{'min_child_weight': 14}
0.8762503781866819]

  min_child_weight最优值为14。

  再将此值带回原来的模型进行训练,得分下降到了0.7175,此时再次对训练轮数进行优化,结果为2346。
  2346轮训练以后,评分又涨到了0.7320。

gamma 参数调优

1
2
3
4
5
6
7
8
9
10
11
param_test3 = {
'gamma': [i / 10.0 for i in range(0, 5)]
}
gsearch3 = GridSearchCV(
estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0,
subsample=0.8, colsample_bytree=0.8, objective='binary:logistic', nthread=4,
scale_pos_weight=1, seed=27), param_grid=param_test3, scoring='roc_auc', n_jobs=4,
iid=False, cv=5)

gsearch3.fit(train[predictors], train[target])
print(gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_)

  输出结果为:

1
2
3
4
5
6
7
[mean: 0.87625, std: 0.00476, params: {'gamma': 0.0},
mean: 0.87626, std: 0.00486, params: {'gamma': 0.1},
mean: 0.87497, std: 0.00695, params: {'gamma': 0.2},
mean: 0.87491, std: 0.00695, params: {'gamma': 0.3},
mean: 0.87554, std: 0.00593, params: {'gamma': 0.4}]
{'gamma': 0.1}
0.8762571697239213

  再次调优训练轮数为1916,提交结果为0.7324。

subsample 和 colsample_bytree 参数调优

1
2
3
4
5
6
7
8
9
10
11
12
13
param_test4 = {
'subsample': [i / 10.0 for i in range(6, 10)],
'colsample_bytree': [i / 10.0 for i in range(6, 10)]
}

gsearch4 = GridSearchCV(
estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
subsample=0.8, colsample_bytree=0.8, objective='binary:logistic', nthread=4,
scale_pos_weight=1, seed=27), param_grid=param_test4, scoring='roc_auc', n_jobs=4,
iid=False, cv=5)

gsearch4.fit(train[predictors], train[target])
print(gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_)

  输出结果为:

1
2
3
4
5
6
7
8
9
10
[mean: 0.87548, std: 0.00574, params: {'colsample_bytree': 0.6, 'subsample': 0.6}, mean: 0.87438, std: 0.00674, params: {'colsample_bytree': 0.6, 'subsample': 0.7},  
mean: 0.87390, std: 0.00769, params: {'colsample_bytree': 0.6, 'subsample': 0.8}, mean: 0.87328, std: 0.00734, params: {'colsample_bytree': 0.6, 'subsample': 0.9},
mean: 0.87549, std: 0.00563, params: {'colsample_bytree': 0.7, 'subsample': 0.6}, mean: 0.87440, std: 0.00597, params: {'colsample_bytree': 0.7, 'subsample': 0.7},
mean: 0.87397, std: 0.00625, params: {'colsample_bytree': 0.7, 'subsample': 0.8}, mean: 0.87527, std: 0.00504, params: {'colsample_bytree': 0.7, 'subsample': 0.9},
mean: 0.87516, std: 0.00564, params: {'colsample_bytree': 0.8, 'subsample': 0.6}, mean: 0.87490, std: 0.00677, params: {'colsample_bytree': 0.8, 'subsample': 0.7},
mean: 0.87626, std: 0.00486, params: {'colsample_bytree': 0.8, 'subsample': 0.8}, mean: 0.87391, std: 0.00615, params: {'colsample_bytree': 0.8, 'subsample': 0.9},
mean: 0.87510, std: 0.00552, params: {'colsample_bytree': 0.9, 'subsample': 0.6}, mean: 0.87683, std: 0.00423, params: {'colsample_bytree': 0.9, 'subsample': 0.7},
mean: 0.87554, std: 0.00568, params: {'colsample_bytree': 0.9, 'subsample': 0.8}, mean: 0.87475, std: 0.00692, params: {'colsample_bytree': 0.9, 'subsample': 0.9}]
{'colsample_bytree': 0.9, 'subsample': 0.7}
0.8768305808825803

  缩小范围和步长继续优化:

1
2
3
4
5
6
7
8
9
10
11
12
13
param_test5 = {
'subsample': [i / 100.0 for i in range(65, 80, 5)],
'colsample_bytree': [i / 100.0 for i in range(85, 100, 5)]
}

gsearch5 = GridSearchCV(
estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
subsample=0.7, colsample_bytree=0.9, objective='binary:logistic', nthread=4,
scale_pos_weight=1, seed=27), param_grid=param_test5, scoring='roc_auc', n_jobs=4,
iid=False, cv=5)

gsearch5.fit(train[predictors], train[target])
print(gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_)

  输出结果为:

1
2
3
4
5
6
7
8
9
10
11
[mean: 0.87546, std: 0.00545, params: {'colsample_bytree': 0.85, 'subsample': 0.65},
mean: 0.87490, std: 0.00768, params: {'colsample_bytree': 0.85, 'subsample': 0.7},
mean: 0.87638, std: 0.00432, params: {'colsample_bytree': 0.85, 'subsample': 0.75},
mean: 0.87394, std: 0.00754, params: {'colsample_bytree': 0.9, 'subsample': 0.65},
mean: 0.87683, std: 0.00423, params: {'colsample_bytree': 0.9, 'subsample': 0.7},
mean: 0.87501, std: 0.00628, params: {'colsample_bytree': 0.9, 'subsample': 0.75},
mean: 0.87359, std: 0.00838, params: {'colsample_bytree': 0.95, 'subsample': 0.65},
mean: 0.87566, std: 0.00544, params: {'colsample_bytree': 0.95, 'subsample': 0.7},
mean: 0.87497, std: 0.00711, params: {'colsample_bytree': 0.95, 'subsample': 0.75}]
{'colsample_bytree': 0.9, 'subsample': 0.7}
0.8768305808825803

  从输出结果来看'colsample_bytree': 0.9, 'subsample': 0.7确实是最好的了。

reg_alpha 参数调优

1
2
3
4
5
6
7
8
9
10
11
param_test6 = {
'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(
estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
subsample=0.7, colsample_bytree=0.9, objective='binary:logistic', nthread=4,
scale_pos_weight=1, seed=27), param_grid=param_test6, scoring='roc_auc', n_jobs=4,
iid=False, cv=5)

gsearch6.fit(train[predictors], train[target])
print(gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_)

  输出结果为:

1
2
3
4
5
6
7
[mean: 0.87683, std: 0.00423, params: {'reg_alpha': 1e-05},
mean: 0.87650, std: 0.00473, params: {'reg_alpha': 0.01},
mean: 0.87489, std: 0.00610, params: {'reg_alpha': 0.1},
mean: 0.87533, std: 0.00635, params: {'reg_alpha': 1},
mean: 0.86921, std: 0.00367, params: {'reg_alpha': 100}]
{'reg_alpha': 1e-05}
0.8768305794218445

  调整范围继续优化:

1
2
3
4
5
6
7
8
9
10
[mean: 0.87683, std: 0.00423, params: {'reg_alpha': 1e-06},
mean: 0.87683, std: 0.00423, params: {'reg_alpha': 5e-06},
mean: 0.87683, std: 0.00423, params: {'reg_alpha': 1e-05},
mean: 0.87683, std: 0.00423, params: {'reg_alpha': 5e-05},
mean: 0.87683, std: 0.00423, params: {'reg_alpha': 0.0001},
mean: 0.87689, std: 0.00421, params: {'reg_alpha': 0.0005},
mean: 0.87689, std: 0.00421, params: {'reg_alpha': 0.001},
mean: 0.87651, std: 0.00430, params: {'reg_alpha': 0.005}]
{'reg_alpha': 0.0005}
0.8768932960098768

colsample_bylevel 参数调优

1
2
3
4
5
6
7
8
9
10
11
12
param_test8 = {
'colsample_bylevel': [i / 100.0 for i in range(60, 100, 10)]
}

gsearch8 = GridSearchCV(
estimator=XGBClassifier(learning_rate=0.1, n_estimators=563, max_depth=3, min_child_weight=14, gamma=0.1,
subsample=0.7, colsample_bytree=0.9, objective='binary:logistic', nthread=4,
scale_pos_weight=1,reg_alpha=0.0005, seed=27), param_grid=param_test8, scoring='roc_auc', n_jobs=4,
iid=False, cv=5)

gsearch8.fit(train[predictors], train[target])
print(gsearch8.grid_scores_, gsearch8.best_params_, gsearch8.best_score_)

  输出结果为:

1
2
3
4
5
6
[mean: 0.87460, std: 0.00497, params: {'colsample_bylevel': 0.6},
mean: 0.87499, std: 0.00494, params: {'colsample_bylevel': 0.7},
mean: 0.87338, std: 0.00823, params: {'colsample_bylevel': 0.8},
mean: 0.87343, std: 0.00807, params: {'colsample_bylevel': 0.9}]
{'colsample_bylevel': 0.7}
0.874990904057588

  经过反复测试,得分最高的还是这个参数模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
params = {'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'nthread': 4,
'silent': 0,
'eta': 0.01,
'max_depth': 5,
'min_child_weight': 1,
'gamma': 0,
'lambda': 1,
'colsample_bylevel': 0.7,
'colsample_bytree': 0.7,
'subsample': 0.9,
'scale_pos_weight': 1}
num_boost_round=563
score=0.7333

结语

  调参实在是一个费心费力的过程,由于电脑配置不是很好跑了整整两天才把所有参数调完。但很多参数调了以后虽然本地auc提高了,但线上测试得分反而降低,得分最高的一个模型参数还是最开始的模型上略微调整得来的。

-------- 本文结束 感谢阅读 --------
相关文章
  • X-Data数据工程基础实践(十)
  • X-Data数据工程基础实践(九)
  • X-Data数据工程基础实践(八)
  • X-Data数据工程基础实践过程中遇到的问题
  • X-Data数据工程基础实践(六)
觉得文章写的不错的话,请我喝瓶怡宝吧!😀
SiriYang 微信支付

微信支付

SiriYang 支付宝

支付宝

  • 本文标题: X-Data数据工程基础实践(七)
  • 本文作者: SiriYang
  • 创建时间: 2020年02月05日 - 12时02分
  • 修改时间: 2021年10月29日 - 18时10分
  • 本文链接: https://blog.siriyang.cn/posts/20200205125426id.html
  • 版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!
学习笔记 考研 复试 Python 机器学习 数据挖掘
基于Pythonista和百度API的高精度文本识别脚本
X-Data数据工程基础实践过程中遇到的问题
  • 文章目录
  • 站点概览
SiriYang

SiriYang

努力搬砖攒钱买镜头的摄影迷
321 日志
33 分类
88 标签
RSS
GitHub E-Mail
Creative Commons
Links
  • 友情链接
  • 作品商铺

  1. 正文
    1. 训练轮数调优
    2. max_depth 和 min_weight 参数调优
    3. gamma 参数调优
    4. subsample 和 colsample_bytree 参数调优
    5. reg_alpha 参数调优
    6. colsample_bylevel 参数调优
  2. 结语
蜀ICP备19008337号 © 2019 – 2025 SiriYang | 1.7m | 25:48
0%