正文

接下来进入第二阶段任务二的实验中来。

要想在之前实验的基础上再有提升有以下几个方向：

继续特征工程加入新特征，通过不断过拟合得到特征重要性的数据，然后筛选保留重要特征，剔除无用特征且保持auc不变。
参数调优，不过这个方法能力有限，只能小范围优化，无法得到质的飞跃。
换用其它算法进行训练，并进行模型融合。

我接下来首先准备尝试的是模型融合，这个是最能立竿见影的。通过百度查阅资料，我准备选用LightGBM作为新的算法进行训练，该算法是在之前XGBoost的基础上进行优化得来的，至于会好多少还是只有试了才知道。

训练

中文文档：LightGBM 中文文档
参考这篇博客进行代码实现：【集成学习】lightgbm使用案例

由于之前特征工程已经做好了，直接拿过来训练就行，所以实现还是很快的。

# -*- coding:utf-8 -*-
"""
@author:SiriYang
@file:lgbm_train.py
@time:2020/2/10 13:59
"""

from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.metrics import mean_squared_error


def model_lgbm(train, validate, test):

    y = train['label']
    X = train.drop(['User_id', 'Coupon_id', 'Date_received', 'label'],axis=1)

    val_y=validate['label']
    val_X=validate.drop(['User_id', 'Coupon_id', 'Date_received', 'label'],axis=1)

    test_X=test.drop(['User_id', 'Coupon_id', 'Date_received'],axis=1)


    # 创建成lgb特征的数据集格式
    lgb_train = lgb.Dataset(X, y, free_raw_data=False)
    lgb_eval = lgb.Dataset(val_X, val_y, reference=lgb_train, free_raw_data=False)

    # 将参数写成字典下形式
    params = {
        'task': 'train',
        'boosting_type': 'gbdt',  # 设置提升类型
        'objective': 'regression',  # 目标函数
        'metric': {'l2', 'auc'},  # 评估函数
        'num_leaves': 31,  # 叶子节点数
        'learning_rate': 0.05,  # 学习速率
        'feature_fraction': 0.9,  # 建树的特征选择比例
        'bagging_fraction': 0.8,  # 建树的样本采样比例
        'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
        'verbose': 1  # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
    }

    # 训练 cv and train
    print('Start training...')
    gbm = lgb.train(params, lgb_train, num_boost_round=3000, valid_sets=lgb_eval, early_stopping_rounds=50)

    # 保存模型到文件
    print('Save model...')
    gbm.save_model('./model/lgbm/' + datetime.now().strftime('%d_%H%M') + '_model.txt')

    # 预测
    print('Start predicting...')
    predict=gbm.predict(test_X, num_iteration=gbm.best_iteration)
    # 处理结果
    predict = pd.DataFrame(predict, columns=['prob'])
    result = pd.concat([test[['User_id', 'Coupon_id', 'Date_received']], predict], axis=1)

    # 特征重要性
    feat_importance = pd.DataFrame(X.columns.tolist(), columns=['feature_name'])
    feat_importance['importance'] = list(gbm.feature_importance())
    feat_importance = feat_importance.sort_values(by='importance', ascending=False)

    return result,feat_importance


if __name__ == "__main__":
    start = datetime.now()
    print(start.strftime('%Y-%m-%d %H:%M:%S'))

    train = pd.read_csv(r'./prepared_dataset/train.csv')
    validate = pd.read_csv("./prepared_dataset/validate.csv")
    test = pd.read_csv("./prepared_dataset/test.csv")

    # 训练
    result,feat_importance= model_lgbm(train, validate, test)

    # 保存
    result.to_csv(r'./output_files/lgbm/' + datetime.now().strftime('%d_%H%M') + '_test.csv', index=False, header=None)
    feat_importance.to_csv(r'./output_files/lgbm/' + datetime.now().strftime('%d_%H%M') + '_feat_importance.csv')

    print(feat_importance)
    feat_importance.plot(kind='bar')
    plt.show()

    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    print('time costed is: %d min' % (int((datetime.now() - start).seconds) / 60))

没想到这个算法的训练速度贼快，不到半分钟就结束了，一共训练了102轮就提前结束，由于模型参数都是直接抄过来的还没调优，本以为提交结果会很差，结果没想到有0.7711分，使用XGB可是得训练半个小时才有这个成绩，看来LightGBM的优越确实名不虚传。

调优

接下来进行参数调优，看看单模型上限是多少。
参考博客：LightGBM 调参方法（具体操作）

训练轮数n_estimators调优

# -*- coding:utf-8 -*-
"""
@author:SiriYang
@file:lgbm_modify.py
@time:2020/2/10 15:28
"""

from datetime import datetime
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

if __name__ == "__main__":
    start = datetime.now()
    print(start.strftime('%Y-%m-%d %H:%M:%S'))

    train = pd.read_csv(r'./prepared_dataset/train.csv')

    y = train['label']
    X = train.drop(['User_id', 'Coupon_id', 'Date_received', 'label'], axis=1)

    # n_estimators调优
    params = {'boosting_type': 'gbdt',
              'objective': 'regression',
              'learning_rate': 0.1,
              'metric': {'l2', 'auc'},
              'num_leaves': 31,
              'max_depth': 6,
              'subsample': 0.8,
              'colsample_bytree': 0.8,
              'feature_fraction': 0.9,  # 建树的特征选择比例
              'bagging_fraction': 0.8,  # 建树的样本采样比例
              'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
              'verbose': 1  # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
              }

    data_train = lgb.Dataset(X, y, free_raw_data=False)
    cv_results = lgb.cv(
        params, data_train, num_boost_round=1000, nfold=5, stratified=False, shuffle=True, metrics='auc',
        early_stopping_rounds=50, verbose_eval=50, show_stdv=True, seed=0)

    print('best n_estimators:', len(cv_results['auc-mean']))
    print('best cv score:', cv_results['auc-mean'][-1])

    ...

    print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    print('time costed is: %d min' % (int((datetime.now() - start).seconds) / 60))

首先把学习率设高一点0.1,得到一个最佳训练轮数:

1 2	best n_estimators: 340 best cv score: 0.8918884497963833

接下来把这个数据带到后面的调优中使用。

max_depth 和 num_leaves调优

# max_depth 和 num_leaves调优
model_lgb = lgb.LGBMRegressor(objective='regression', num_leaves=31,
                              learning_rate=0.1, n_estimators=340, max_depth=6,subsample=0.8,colsample_bytree=0.8,
                              metric='auc', bagging_fraction=0.8, feature_fraction=0.9,bagging_freq=5,verbose=-1)

params_test1 = {'max_depth': range(3, 8, 2), 'num_leaves': range(20, 170, 30)
                }
gsearch1 = GridSearchCV(estimator=model_lgb, param_grid=params_test1, scoring='roc_auc', cv=5,
                        verbose=-1, n_jobs=4)
gsearch1.fit(X, y)
print(gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_)


{'max_depth': 7, 'num_leaves': 20}
0.8879466433288299


{'max_depth': 12, 'num_leaves': 15}
0.8890450909039307


{'max_depth': 13, 'num_leaves': 18} 0.8894765240894891


{'max_depth': 13, 'num_leaves': 18}
0.8894765240894891

线上得分上升到0.7731

min_data_in_leaf(min_child_samples) 和 min_sum_hessian_in_leaf(min_child_weight) 调优

params_test3 = {'min_child_samples': [18, 19, 20, 21, 22], 'min_child_weight': [0.001, 0.002]
                }
model_lgb = lgb.LGBMRegressor(objective='regression', num_leaves=18, learning_rate=0.1, n_estimators=340,
                              max_depth=13, subsample=0.8, colsample_bytree=0.8, metric='auc', bagging_fraction=0.8,
                              feature_fraction=0.9, bagging_freq=5, verbose=-1)
gsearch3 = GridSearchCV(estimator=model_lgb, param_grid=params_test3, scoring='roc_auc', cv=5,
                        verbose=-1, n_jobs=4)
gsearch3.fit(X, y)
print(gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_)

1
2
3


{'min_child_samples': 20, 'min_child_weight': 0.001}
 0.8894765240894891

feature_fraction 和 bagging_fraction 调优

params_test4 = {'feature_fraction': [0.5, 0.6, 0.7, 0.8, 0.9], 'bagging_fraction': [0.6, 0.7, 0.8, 0.9, 1.0]
                }
model_lgb = lgb.LGBMRegressor(objective='regression', num_leaves=18, learning_rate=0.1, n_estimators=340,
                               max_depth=13, subsample=0.8, colsample_bytree=0.8, metric='auc', bagging_fraction=0.8,
                               feature_fraction=0.9,min_child_samples=20,min_child_weight=0.001, bagging_freq=5, verbose=-1)
gsearch4 = GridSearchCV(estimator=model_lgb, param_grid=params_test4, scoring='roc_auc', cv=5,
                        verbose=1, n_jobs=4)
gsearch4.fit(X, y)
print(gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_)

1
2
3


{'bagging_fraction': 1.0, 'feature_fraction': 0.8}
0.8895620932443743

params_test5={    'feature_fraction': [0.62, 0.65, 0.68, 0.7, 0.72, 0.75, 0.78 ]
}
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=80,
                              learning_rate=0.1, n_estimators=43, max_depth=7, 
                              metric='rmse',  min_child_samples=20)
gsearch5 = GridSearchCV(estimator=model_lgb, param_grid=params_test5, scoring='neg_mean_squared_error', cv=5, verbose=1, n_jobs=4)
gsearch5.fit(df_train, y_train)
gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_

1
2
3


{'feature_fraction': 0.82}
0.8895911006088235

正则化参数

params_test6 = {'reg_alpha': [0, 0.001, 0.01, 0.03, 0.08, 0.3, 0.5],
                'reg_lambda': [0, 0.001, 0.01, 0.03, 0.08, 0.3, 0.5]
                }
model_lgb = lgb.LGBMRegressor(objective='regression', num_leaves=18, learning_rate=0.1, n_estimators=340,
                              max_depth=13, subsample=0.8, colsample_bytree=0.8, metric='auc', bagging_fraction=1.0,
                              feature_fraction=0.82, min_child_samples=20, min_child_weight=0.001, bagging_freq=5,
                              verbose=-1)
gsearch6 = GridSearchCV(estimator=model_lgb, param_grid=params_test6, scoring='roc_auc', cv=5,
                        verbose=1, n_jobs=4)
gsearch6.fit(X, y)
print(gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_)


{'reg_alpha': 0.5, 'reg_lambda': 0.5}
0.8904085279222402


{'reg_alpha': 0.6, 'reg_lambda': 0.6}
0.8904402510041289


{'reg_alpha': 0.58, 'reg_lambda': 0.56}
0.8905777734829009

经过漫长的调参，最后的结果并不理想，线上得分最高只有0.7703

模型融合

接下来尝试模型融合，将两种模型最好的预测数据通过循环以0.1为步长尝试各种比例线性加权融合以后，提交得出最好成绩为0.7770，融合比例为 XGBoost x 0.5 + LightGBM x 0.5平均混合。

2020.2.11更新

今天早上起来看代码，突然发现自己傻逼了，之前直接抄的别人的代码过来用，结果发现他用的目标函数是regression，而我要做的是二分类binary。更换目标函数后再进行训练，单模型得分直接0.7772，但是进行模型融合以后最高只有0.7753。

都说学习率越低越好，我反而越低学出来越差，就把学习率调高了，结果效果还更好:0.7824。以XGBoost x 0.2 + LightGBM x 0.8融合后最高得分为0.7863。

params = {
        'task': 'train',
        'boosting_type': 'gbdt',  # 设置提升类型
        'objective': 'binary',  # 目标函数
        'metric': {'l2', 'auc'},  # 评估函数
        'max_depth': 3,
        'num_leaves': 18,  # 叶子节点数
        'learning_rate': 0.1,  # 学习速率
        'feature_fraction': 0.9,  # 建树的特征选择比例
        'bagging_fraction': 0.8,  # 建树的样本采样比例
        'min_child_samples': 20,
        'min_child_weight': 0.001,
        'bagging_freq': 5,  # k 意味着每 k 次迭代执行bagging
        'verbose': 1  # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
    }