SiriBlog

siriyang的个人博客


  • 首页

  • 排行榜

  • 标签115

  • 分类37

  • 归档321

  • 关于

  • 搜索

X-Data数据工程基础实践(十)

发表于 2020-02-12 更新于 2021-10-29 分类于 考研 , 复试 阅读次数: Valine:
本文字数: 6.5k 阅读时长 ≈ 6 分钟

正文

  今天准备再尝试一个新的算法:GBDT(GBM)。

训练

  经过前两个算法模型的学习再学习新的,会发现基本都差不多了,参数和接口有很多都相同,所以实现起来也很快。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# -*- coding:utf-8 -*-
"""
@author:SiriYang
@file:gbdt_train.py
@time:2020/2/11 17:09
"""

import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics

def monitor(i,self,locals):
if i>50 :
return self.train_score_[i]<=self.train_score_[i-50]
return False

def model_gbdt(train, validate, test):
y = train['label']
X = train.drop(['User_id', 'Coupon_id', 'Date_received', 'label'], axis=1)

val_y = validate['label']
val_X = validate.drop(['User_id', 'Coupon_id', 'Date_received', 'label'], axis=1)

test_X = test.drop(['User_id', 'Coupon_id', 'Date_received'], axis=1)

# 训练
gbr = GradientBoostingClassifier(n_estimators=60, learning_rate=0.1, max_depth=6, min_samples_split=400,
min_samples_leaf=18, verbose=2)
gbr.fit(X, y,monitor=monitor)

# 预测
y_pred=gbr.predict(X)
y_proba=gbr.predict_proba(X)[:,1]
predict=gbr.predict_proba(test_X)[:,1]
print("Accuracy:%.4f" % metrics.accuracy_score(y, y_pred))
print("AUC Score(Train):%f\n" % metrics.roc_auc_score(y, y_proba))

# 处理结果
predict = pd.DataFrame(predict, columns=['prob'])
result = pd.concat([test[['User_id', 'Coupon_id', 'Date_received']], predict], axis=1)

feat_imp = pd.Series(gbr.feature_importances_, X.columns).sort_values(ascending=False)

return result,feat_imp


if __name__ == "__main__":
start = datetime.now()
print(start.strftime('%Y-%m-%d %H:%M:%S'))

train = pd.read_csv(r'./prepared_dataset/train.csv')
validate = pd.read_csv("./prepared_dataset/validate.csv")
test = pd.read_csv("./prepared_dataset/test.csv")

# 训练
result,feat_importance= model_gbdt(train, validate, test)

# 保存
result.to_csv(r'./output_files/gbdt/' + datetime.now().strftime('%d_%H%M') + '_test.csv', index=False, header=None)
feat_importance.to_csv(r'./output_files/gbdt/' + datetime.now().strftime('%d_%H%M') + '_feat_importance.csv')

print(feat_importance)
# feat_importance.plot(kind='bar')
# plt.show()

print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
print('time costed is: %d min' % (int((datetime.now() - start).seconds) / 60))

调参

参考文章:

  • 机器学习系列(11)_Python中Gradient Boosting Machine(GBM)调参方法详解
  • GBDT调参(Python 3.7)

迭代次数

  不想前两个模型有实现早停的参数接口,这个模型只提供了一个monitor回调函数接口,让你自己来实现,如上面训练代码那样。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
from datetime import datetime
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

if __name__ == "__main__":
start = datetime.now()
print(start.strftime('%Y-%m-%d %H:%M:%S'))

train = pd.read_csv(r'./prepared_dataset/train.csv')

y = train['label']
X = train.drop(['User_id', 'Coupon_id', 'Date_received', 'label'], axis=1)

# 迭代次数(n_estimator)
param_test1 = {'n_estimators':range(20,81,10)}
grid_search1 = GridSearchCV(
estimator=GradientBoostingClassifier(learning_rate=0.1,n_estimators=100,
min_samples_split=400, min_samples_leaf=18, max_depth=6,
max_features='sqrt', subsample=0.8, random_state=10,verbose=2),
param_grid=param_test1,
scoring='roc_auc',
iid=False,
cv=5
)
grid_result1 = grid_search1.fit(X, y)
##打印结果
print("Best: %f using %s" % (grid_result1.best_score_, grid_result1.best_params_))
means = grid_result1.cv_results_['mean_test_score']
params = grid_result1.cv_results_['params']
for mean, param in zip(means, params):
print("mean: %f , params: %r" % (mean, param))

  调参结果为70最优,但是线上提交60更好,等所有参数调好之后再微调一下。

决策树最大深度(max_depth)& 内部节点划分所需最小样本数(min_samples_split)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
param_test2 = {'max_depth': range(3, 8, 2), 'min_samples_split': range(100, 801, 200)}
grid_search2 = GridSearchCV(
estimator=GradientBoostingClassifier(learning_rate=0.1,
n_estimators=100, min_samples_leaf=18,max_depth=6,
max_features='sqrt', subsample=0.8, random_state=10,verbose=2),
param_grid=param_test2,
scoring='roc_auc',
iid=False,
cv=5
)
grid_result2 = grid_search2.fit(X, y)
##打印结果
print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
means = grid_result2.cv_results_['mean_test_score']
params = grid_result2.cv_results_['params']
for mean, param in zip(means, params):
print("mean: %f , params: %r" % (mean, param))

内部节点再划分所需最小样本数(min_samples_split) & 叶子节点最少样本数(min_samples_leaf)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

param_test3 = {'min_samples_split': range(400, 801, 200), 'min_samples_leaf': range(10, 41, 10)}
grid_search3 = GridSearchCV(
estimator=GradientBoostingClassifier(learning_rate=0.1,
n_estimators=70, max_depth=6,
max_features='sqrt', subsample=0.8, random_state=10,verbose=2),
param_grid=param_test3,
scoring='roc_auc',
iid=False,
cv=5
)
grid_result3 = grid_search3.fit(X, y)
##打印结果
print("Best: %f using %s" % (grid_result3.best_score_, grid_result3.best_params_))
means = grid_result3.cv_results_['mean_test_score']
params = grid_result3.cv_results_['params']
for mean, param in zip(means, params):
print("mean: %f , params: %r" % (mean, param))

最大特征数(max_features)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
param_test4 = {'max_features': range(7, 20, 2)}
# param_test4 = {'max_features': [1,2,3]}
grid_search4 = GridSearchCV(
estimator=GradientBoostingClassifier(learning_rate=0.1,
n_estimators=70, max_depth=6, min_samples_leaf=18,
min_samples_split=400, subsample=0.8, random_state=10,verbose=2),
param_grid=param_test4,
scoring='roc_auc',
iid=False,
cv=5
)
grid_result4 = grid_search4.fit(X, y)
##打印结果
print("Best: %f using %s" % (grid_result4.best_score_, grid_result4.best_params_))
means = grid_result4.cv_results_['mean_test_score']
params = grid_result4.cv_results_['params']
for mean, param in zip(means, params):
print("mean: %f , params: %r" % (mean, param))

  最后再降低学习率,增大训练轮数,得到的最高得分为:0.7801

1
2
3
4
5
6
7
8
gbr = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=6,
min_samples_split=400,
min_samples_leaf=18,
random_state=10,
verbose=2)

  然后用循环尝试融合三种模型,最终找到最优的融合比例为 LightGBM x 0.5 + GBDT x 0.5 ,最终得分为 0.7896。

-------- 本文结束 感谢阅读 --------
相关文章
  • X-Data数据工程基础实践(九)
  • X-Data数据工程基础实践(八)
  • X-Data数据工程基础实践(七)
  • X-Data数据工程基础实践过程中遇到的问题
  • X-Data数据工程基础实践(六)
觉得文章写的不错的话,请我喝瓶怡宝吧!😀
SiriYang 微信支付

微信支付

SiriYang 支付宝

支付宝

  • 本文标题: X-Data数据工程基础实践(十)
  • 本文作者: SiriYang
  • 创建时间: 2020年02月12日 - 10时02分
  • 修改时间: 2021年10月29日 - 18时10分
  • 本文链接: https://blog.siriyang.cn/posts/20200212105301id.html
  • 版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!
学习笔记 考研 复试 Python 机器学习 数据挖掘
SiriPR_Python开发中遇到的问题
X-Data数据工程基础实践(九)
  • 文章目录
  • 站点概览
SiriYang

SiriYang

努力搬砖攒钱买镜头的摄影迷
321 日志
33 分类
88 标签
RSS
GitHub E-Mail
Creative Commons
Links
  • 友情链接
  • 作品商铺

  1. 正文
    1. 训练
    2. 调参
      1. 迭代次数
      2. 决策树最大深度(max_depth)& 内部节点划分所需最小样本数(min_samples_split)
      3. 内部节点再划分所需最小样本数(min_samples_split) & 叶子节点最少样本数(min_samples_leaf)
      4. 最大特征数(max_features)
蜀ICP备19008337号 © 2019 – 2025 SiriYang | 1.7m | 25:48
0%