『飞桨领航团AI达人创造营』基于LGBM的模型预测借款人是否能按期还款
本文介绍基于LGBM模型预测借款人还款情况的项目。用40000条训练集和15000条测试集数据,经数据加载、预处理和特征工程,构建如贷款金额与缴存额组合等特征。选择LGBM算法,通过多轮参数调优,采用分层K折交叉验证训练,最终模型AUC均值较高,可辅助银行判断借款人信用。
利用轻量级LGBM模型学习纯数据,从中归纳出优秀的特征,能够判断一个人是否会按期还款。
一、项目背景1、代替人为判断是否能贷款给某人 2、减少传统人为贷款,暗箱操作 3、帮助银行找出信用更优质的人群
二、数据集简介已收集的数据集数量: 训练集提供40000名,测试集提供15000名的缴存人基本信息、缴存信息,贷款信息 。
train_df = df[df['label'].isna() == False].reset_index(drop=True)test_df = df[df['label'].isna() == True].reset_index(drop=True)display(train_df.shape, test_df.shape)登录后复制登录后复制
训练集样本量: (40000, 1093),验证集样本量: (15000, 1093)
2.数据集查看def get_daikuanYE(df,col): df[col + '_genFeat1'] = (df[col] > 100000).astype(int) df[col + '_genFeat2'] = (df[col] > 120000).astype(int) df[col + '_genFeat3'] = (df[col] > 140000).astype(int) df[col + '_genFeat4'] = (df[col] > 180000).astype(int) df[col + '_genFeat5'] = (df[col] > 220000).astype(int) df[col + '_genFeat6'] = (df[col] > 260000).astype(int) df[col + '_genFeat7'] = (df[col] > 300000).astype(int) return df, [col + f'_genFeat{i}' for i in range(1, 8)]df, genFeats2 = get_daikuanYE(df, col = 'DKYE')df, genFeats3 = get_daikuanYE(df, col = 'DKFFE')plt.figure(figsize = (8, 2))plt.subplot(1,2,1)sns.distplot(df['DKYE'][df['label'] == 1])plt.subplot(1,2,2)sns.distplot(df['DKFFE'][df['label'] == 1])登录后复制登录后复制
三、模型选择和开发
详细说明你使用的算法。此处可细分,如下所示:
1.特征工程# df['GRYJCE_sum_DWYJCE']= (df['GRYJCE']+df['DWYJCE'])*12*(df['DKLL']+1) #贷款每年的还款额# df['GRZHDNGJYE_GRZHSNJZYE']=(df['GRZHDNGJYE']+df['GRZHSNJZYE']+df['GRZHYE'])-df['GRYJCE_sum_DWYJCE']# df['DWJJLX_DWYSSHY']=df['DWJJLX'] *df['DWSSHY'] #单位经济体行业*单位经济体行业# df['XINGBIEDKYE'] = df['XINGBIE'] * df['DKYE']# df['m2'] = (df['DKYE'] - ((df['GRYJCE'] + df['DWYJCE'] ) * 12) + df['GRZHDNGJYE'] ) / 12# df['KDKZGED'] = df['m2'] * (df['GRYJCE'] + df['DWYJCE'] )# gen_feats = ['DKFFE_multi_DKLL','DKFFE_DKYE_DKFFE','DWYSSHY2GRYJCE','DWYSSHY2DWYJCE','ZHIYE_GRZHZT','GRZHDNGJYE_GRZHSNJZYE']df['DWYSSHY2GRYJCE']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE'] #好gen_feats = ['DWYSSHY2GRYJCE']df.head()登录后复制
2.模型介绍xgboost的出现,让数据民工们告别了传统的机器学习算法们:RF、GBM、SVM、LASSO……..。 顾名思义,lightGBM包含两个关键点:light即轻量级,GBM 梯度提升机。LightGBM 是一个梯度 boosting 框架,使用基于学习算法的决策树。是分布式的,高效的,有以下优势:更快的训练效率低内存使用更高的准确率支持并行化学习可处理大规模数据
概括来说,lightGBM主要有以下特点:
基于Histogram的决策树算法带深度限制的Leaf-wise的叶子生长策略直方图做差加速直接支持类别特征(Categorical Feature)Cache命中率优化基于直方图的稀疏特征优化多线程优化3.模型训练oof = np.zeros(train_df.shape[0])# feat_imp_df = pd.DataFrame({'feat': cols, 'imp': 0})test_df['prob'] = 0clf = LGBMClassifier( # 0.05--0.1 learning_rate=0.07, # 1030 # 1300 n_estimators=1030, # 31 # 35 # 37 # 40 # (0.523177, 0.93799) 38 #(0.519115, 0.93587) 39 num_leaves=37, subsample=0.8, # 0.8 # 0.85 colsample_bytree=0.8, random_state=11, is_unbalace=True, sample_pos_weight=13 # learning_rate=0.066,#学习率 # n_estimators=1032,#拟合的树的棵树,相当于训练轮数 # num_leaves=38,#树的最大叶子数,对比xgboost一般为2^(max_depth) # subsample=0.85,#子样本频率 # colsample_bytree=0.85, #训练特征采样率列 # random_state=17, #随机种子数 # reg_lambda=1e-1, #L2正则化系数 # # min_split_gain=0.2#最小分割增益 # learning_rate=0.07,#学习率 # n_estimators=1032,#拟合的树的棵树,相当于训练轮数 # num_leaves=37,#树的最大叶子数,对比xgboost一般为2^(max_depth) # subsample=0.8,#子样本频率 # colsample_bytree=0.8, #训练特征采样率 列 # random_state=17, #随机种子数 # silent=True , #训练过程是否打印日志信息 # min_split_gain=0.05 ,#最小分割增益 # is_unbalace=True, # sample_pos_weight=13)登录后复制
--------------------- 0 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200] valid_0's auc: 0.944549 valid_0's binary_logloss: 0.110362
Early stopping, best iteration is:
[173] valid_0's auc: 0.944278 valid_0's binary_logloss: 0.1097
--------------------- 1 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200] valid_0's auc: 0.943315 valid_0's binary_logloss: 0.113508
Early stopping, best iteration is:
[161] valid_0's auc: 0.943045 valid_0's binary_logloss: 0.113012
4.模型预测使用model.predict接口来完成对大量数据集的批量预测。
val_aucs = []seeds = [11,22,33]for seed in seeds: skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed) for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df['label'])): print('--------------------- {} fold ---------------------'.format(i)) t = time.time() trn_x, trn_y = train_df[cols].iloc[trn_idx].reset_index(drop=True), train_df['label'].values[trn_idx] val_x, val_y = train_df[cols].iloc[val_idx].reset_index(drop=True), train_df['label'].values[val_idx] clf.fit( trn_x, trn_y, eval_set=[(val_x, val_y)], # categorical_feature=cate_cols, eval_metric='auc', early_stopping_rounds=200, verbose=200 ) # feat_imp_df['imp'] += clf.feature_importances_ / skf.n_splits oof[val_idx] = clf.predict_proba(val_x)[:, 1] test_df['prob'] += clf.predict_proba(test_df[cols])[:, 1] / skf.n_splits / len(seeds) cv_auc = roc_auc_score(train_df['label'], oof) val_aucs.append(cv_auc) print('\ncv_auc: ', cv_auc)print(val_aucs, np.mean(val_aucs))登录后复制
四、总结与升华
心得:特征工程非常重要,大部分提升都是在特征工程的特征组合上,然后就是参数调优。我尝试了所有的参数,调参需要一一做对比实验,千万别同时改两个参数,然后正则化是一般是防止过拟合的,没搞清楚是否过拟合时不要用,不然大概率降低,找到关键的一些参数,能够强迫模型有提升的,比如上述min_split_gain=0.05 ,#最小分割增益,学习率由大往小了调整,多尝试不同的参数组合会有意想不到的效果。
个人简介AI达人创造营作业:模型训练和调参我在AI Studio上获得钻石等级,点亮10个徽章,来互关呀~ Alchemist_W 点我互关
当数据集准备好后,就需要进行愉快的训练了,在这里 Paddle 为大家提供了很多方便的套件,大大缩短开发者的开发时间,提高了开发效率
综合套件:PaddleHub图像分类:PaddleClas目标检测:PaddleDetection图像分割:PaddleSeg文字识别:PaddleOCR更多的套件请访问 飞桨产品全景
下面详细介绍项目——基于LGBM模型预测借款人是否能按期还款
import pandas as pdimport numpy as npfrom tqdm import tqdmimport osimport matplotlib.pyplot as pltimport seaborn as snsimport paddle登录后复制 In [ ]
train = pd.read_csv('./work/train.csv')test = pd.read_csv('./work/test.csv')submit = pd.read_csv('./work/submit.csv')train.shape, test.shape, submit.shape登录后复制 1、观察数据结构In [ ]
train.head()登录后复制 In [ ]
cate_2_cols = ['XINGBIE', 'ZHIWU', 'XUELI']cate_cols = ['HYZK', 'ZHIYE', 'ZHICHEN', 'DWJJLX', 'DWSSHY', 'GRZHZT']train[cate_cols]登录后复制 In [ ]
num_cols = ['GRJCJS', 'GRZHYE', 'GRZHSNJZYE', 'GRZHDNGJYE', 'GRYJCE', 'DWYJCE','DKFFE', 'DKYE', 'DKLL']train[num_cols]登录后复制 2、特征工程In [ ]
df = pd.concat([train, test], axis = 0).reset_index(drop = True) #这一块把训练集和测试集连起来 观察数据分布特征 且方便后期做处理 最终还是会分开df.head(10)登录后复制 3、LightGBM简介xgboost的出现,让数据民工们告别了传统的机器学习算法们:RF、GBM、SVM、LASSO……..。 顾名思义,lightGBM包含两个关键点:light即轻量级,GBM 梯度提升机。LightGBM 是一个梯度 boosting 框架,使用基于学习算法的决策树。是分布式的,高效的,有以下优势:更快的训练效率低内存使用更高的准确率支持并行化学习可处理大规模数据
概括来说,lightGBM主要有以下特点:
基于Histogram的决策树算法带深度限制的Leaf-wise的叶子生长策略直方图做差加速直接支持类别特征(Categorical Feature)Cache命中率优化基于直方图的稀疏特征优化多线程优化4、皮尔逊系数4.1 Corr()函数保证行相同 可以使用 其中有三种相关函数 默认为皮尔逊相关系数('Pearson','Kendall','Spman')4.2当两个变量的标准差都不为零时,相关系数才有定义,Pearson相关系数适用于:(1)、两个变量之间是线性关系,都是连续数据。(2)、两个变量的总体是正态分布,或接近正态的单峰分布。(3)、两个变量的观测值是成对的,每对观测值之间相互独立。In [ ]# Find correlations with the target and sort correlations = df.corr()['label'].sort_values()# Display correlationsprint('Most Positive Correlations:\n', correlations.tail(15))print('\nMost Negative Correlations:\n', correlations.head(15))#Most Positive Correlations: 越大说明正相关性越强 可以用来做组合特征登录后复制 In [ ]
# Heatmap 使用热力图来分析两组特征存在的数学上的关系summary=pd.pivot_table(data=df, index='GRZHZT', columns='ZHIYE', values='label', aggfunc=np.sum)sns.heatmap(data=summary, cmap='rainbow', annot=True, # fmt='.2e', #科学计数法 保留2位小数 linewidth=0.5)plt.title('Label')plt.show()登录后复制 5、数据可视化-----使用条形图分析年龄段与还款的关系In [ ]
def get_age(df,col = 'age'): df[col+"_genFeat1"]=(df['age'] > 18).astype(int) df[col+"_genFeat2"]=(df['age'] > 25).astype(int) df[col+"_genFeat3"]=(df['age'] > 30).astype(int) df[col+"_genFeat4"]=(df['age'] > 35).astype(int) df[col+"_genFeat5"]=(df['age'] > 40).astype(int) df[col+"_genFeat6"]=(df['age'] > 45).astype(int) return df, [col + f'_genFeat{i}' for i in range(1, 7)]df['age'] = ((1609430399 - df['CSNY']) / (365 * 24 * 3600)).astype(int)df, genFeats1 = get_age(df, col = 'age')sns.distplot(df['age'][df['age'] > 0])登录后复制 In [ ]
def get_daikuanYE(df,col): df[col + '_genFeat1'] = (df[col] > 100000).astype(int) df[col + '_genFeat2'] = (df[col] > 120000).astype(int) df[col + '_genFeat3'] = (df[col] > 140000).astype(int) df[col + '_genFeat4'] = (df[col] > 180000).astype(int) df[col + '_genFeat5'] = (df[col] > 220000).astype(int) df[col + '_genFeat6'] = (df[col] > 260000).astype(int) df[col + '_genFeat7'] = (df[col] > 300000).astype(int) return df, [col + f'_genFeat{i}' for i in range(1, 8)]df, genFeats2 = get_daikuanYE(df, col = 'DKYE')df, genFeats3 = get_daikuanYE(df, col = 'DKFFE')plt.figure(figsize = (8, 2))plt.subplot(1,2,1)sns.distplot(df['DKYE'][df['label'] == 1])plt.subplot(1,2,2)sns.distplot(df['DKFFE'][df['label'] == 1])登录后复制登录后复制 In [ ]
train_df = df[df['label'].isna() == False].reset_index(drop=True)test_df = df[df['label'].isna() == True].reset_index(drop=True)display(train_df.shape, test_df.shape)plt.figure(figsize = (8, 2))plt.subplot(1,2,1)sns.distplot(train_df['age'][train_df['age'] > 0])plt.subplot(1,2,2)sns.distplot(test_df['age'][test_df['age'] > 0])登录后复制
Tips比较训练集与测试集中的各属性分布 发现(zhiwu)职务这个属性训练集与测试集分布不同 是不可用属性 给LGBM会使得分类效果变差
In [ ]gen_feats_fest = ['age','HYZK','ZHIYE','ZHICHEN','ZHIWU','XUELI','DWJJLX','DWSSHY','GRJCJS','GRZHZT','GRZHYE','GRZHSNJZYE','GRZHDNGJYE','GRYJCE','DWYJCE','DKFFE','DKYE','DKLL']for i in range(len(gen_feats_fest)): plt.figure(figsize = (8, 2)) plt.subplot(1,2,1) sns.distplot(train_df[gen_feats_fest[i]]) plt.subplot(1,2,2) sns.distplot(test_df[gen_feats_fest[i]])登录后复制
依据贷款公式组合特征,根据前面提到的皮尔曼系数组合特征
In [ ]#df['missing_rate'] = (df.shape[1] - df.count(axis = 1)) / df.shape[1]#差#df['DKFFE_DKYE'] = df['DKFFE'] + df['DKYE'] #一般差#df['DKFFE_DKY_multi_DKLL'] = (df['DKFFE'] + df['DKYE']) * df['DKLL']#一般好#df['DKFFE_multi_DKLL'] = df['DKFFE'] * df['DKLL']#一般好#df['DKYE_multi_DKLL'] = df['DKYE'] * df['DKLL']#一般差#df['GRYJCE_DWYJCE'] = df['GRYJCE'] + df['DWYJCE']#一般#df['GRZHDNGJYE_GRZHSNJZYE'] = df['GRZHDNGJYE'] + df['GRZHSNJZYE']#一般差#df['DKFFE_multi_DKLL_ratio'] = df['DKFFE'] * df['DKLL'] / df['DKFFE_DKY_multi_DKLL']#一般差#df['DKYE_multi_DKLL_ratio'] = df['DKYE'] * df['DKLL'] / df['DKFFE_DKY_multi_DKLL']#一般差#df['DKYE_DKFFE_ratio'] = df['DKYE'] / (df['DKFFE'] + df['DKYE'])#一般#df['DKFFE_DKYE_ratio'] = df['DKFFE'] /(df['DKFFE'] + df['DKYE'])#一般差#df['GRZHYE_diff_GRZHDNGJYE'] = df['GRZHYE'] - df['GRZHDNGJYE']#一般差#df['GRZHYE_diff_GRZHSNJZYE'] = df['GRZHYE'] - df['GRZHSNJZYE']#一般差#df['GRYJCE_DWYJCE_ratio'] = df['GRYJCE'] / (df['GRYJCE'] + df['DWYJCE'])#差#df['DWYJCE_GRYJCE_ratio'] = df['DWYJCE'] / (df['GRYJCE'] + df['DWYJCE'])#一般差#df['DWYSSHY2DKLL']=df['DWSSHY'] * df['DWSSHY']*df['DKLL']# df['DWYSSHY2GRJCJS2']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE']*df['GRYJCE']# df['ZHIYE_GRZHZT']= df['GRZHZT']/(df['ZHIYE']+0.00000001)# gen_feats = ['DWYSSHY2GRYJCE','ZHIYE_GRZHZT']# df['DKFFE_multi_DKLL'] = df['DKFFE'] * df['DKLL'] #发放贷款金额*贷款利率# df['DKFFE-DKYE']=df['DKFFE']-df['DKYE'] #贷款发放额-贷款余额=剩余未还款# df['DKFFE_DKYE_DKFFE']=df['DKFFE-DKYE']*df['DKFFE'] #(贷款发放额-贷款余额)*贷款利率# df['DWYSSHY2GRYJCE']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE'] #所属行业*所属行业*个人月缴存额 ***# df['DWYSSHY2DWYJCE']=df['DWSSHY'] * df['DWSSHY']*df['DWYJCE'] #所属行业*所属行业*单位月缴存额# df['ZHIYE_GRZHZT']=df['GRZHZT']/df['ZHIYE']# df['DWYSSHY3GRYJCE']=(df['DWSSHY']*df['DWSSHY']*df['DWSSHY']*df['GRYJCE'])*(df['GRZHZT']/df['ZHIYE'])# df['GRYJCE_sum_DWYJCE']= (df['GRYJCE']+df['DWYJCE'])*12*(df['DKLL']+1) #贷款每年的还款额# df['GRZHDNGJYE_GRZHSNJZYE']=(df['GRZHDNGJYE']+df['GRZHSNJZYE']+df['GRZHYE'])-df['GRYJCE_sum_DWYJCE']# df['DWJJLX_DWYSSHY']=df['DWJJLX'] *df['DWSSHY'] #单位经济体行业*单位经济体行业# df['XINGBIEDKYE'] = df['XINGBIE'] * df['DKYE']# df['m2'] = (df['DKYE'] - ((df['GRYJCE'] + df['DWYJCE'] ) * 12) + df['GRZHDNGJYE'] ) / 12# df['KDKZGED'] = df['m2'] * (df['GRYJCE'] + df['DWYJCE'] )# gen_feats = ['DKFFE_multi_DKLL','DKFFE_DKYE_DKFFE','DWYSSHY2GRYJCE','DWYSSHY2DWYJCE','ZHIYE_GRZHZT','GRZHDNGJYE_GRZHSNJZYE']df['DWYSSHY2GRYJCE']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE'] #好gen_feats = ['DWYSSHY2GRYJCE']df.head()登录后复制 In [ ]
for f in tqdm(cate_cols): df[f] = df[f].map(dict(zip(df[f].unique(), range(df[f].nunique())))) df[f + '_count'] = df[f].map(df[f].value_counts()) df = pd.concat([df,pd.get_dummies(df[f],prefix=f"{f}")],axis=1) cate_cols_combine = [[cate_cols[i], cate_cols[j]] for i in range(len(cate_cols)) \ for j in range(i + 1, len(cate_cols))]for f1, f2 in tqdm(cate_cols_combine): df['{}_{}_count'.format(f1, f2)] = df.groupby([f1, f2])['id'].transform('count') df['{}_in_{}_prop'.format(f1, f2)] = df['{}_{}_count'.format(f1, f2)] / df[f2 + '_count'] df['{}_in_{}_prop'.format(f2, f1)] = df['{}_{}_count'.format(f1, f2)] / df[f1 + '_count'] for f1 in tqdm(cate_cols): g = df.groupby(f1) for f2 in num_cols + gen_feats: for stat in ['sum', 'mean', 'std', 'max', 'min', 'std']: df['{}_{}_{}'.format(f1, f2, stat)] = g[f2].transform(stat) for f3 in genFeats2 + genFeats3: for stat in ['sum', 'mean']: df['{}_{}_{}'.format(f1, f2, stat)] = g[f2].transform(stat)num_cols_gen_feats = num_cols + gen_featsfor f1 in tqdm(num_cols_gen_feats): g = df.groupby(f1) for f2 in num_cols_gen_feats: if f1 != f2: for stat in ['sum', 'mean', 'std', 'max', 'min', 'std']: df['{}_{}_{}'.format(f1, f2, stat)] = g[f2].transform(stat)for i in tqdm(range(len(num_cols_gen_feats))): for j in range(i + 1, len(num_cols_gen_feats)): df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_add'] = df[num_cols_gen_feats[i]] + df[num_cols_gen_feats[j]] df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_diff'] = df[num_cols_gen_feats[i]] - df[num_cols_gen_feats[j]] df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_multi'] = df[num_cols_gen_feats[i]] * df[num_cols_gen_feats[j]] df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_div'] = df[num_cols_gen_feats[i]] / (df[num_cols_gen_feats[j]] + 0.0000000001)登录后复制 6、划分训练集与测试集In [ ]
train_df = df[df['label'].isna() == False].reset_index(drop=True)test_df = df[df['label'].isna() == True].reset_index(drop=True)display(train_df.shape, test_df.shape)登录后复制登录后复制
(40000, 1093)登录后复制
(15000, 1093)登录后复制 In [ ]
drop_feats = [f for f in train_df.columns if train_df[f].nunique() == 1 or train_df[f].nunique() == 0]len(drop_feats), drop_feats登录后复制
(4, ['DWSSHY_DKYE_min', 'GRYJCE_DWYJCE_std', 'DWYJCE_GRYJCE_std', 'numsOf_GRYJCE_DWYJCE_diff'])登录后复制 In [ ]
cols = [col for col in train_df.columns if col not in ['id', 'label'] + drop_feats]登录后复制 In [ ]
from sklearn.model_selection import StratifiedKFoldfrom lightgbm.sklearn import LGBMClassifierfrom sklearn.metrics import f1_score, roc_auc_scorefrom sklearn.ensemble import RandomForestClassifier,VotingClassifierfrom xgboost import XGBClassifierimport timeimport lightgbm as lgb登录后复制
<IPython.core.display.HTML object>登录后复制 In [ ]
# callback=paddle.callbacks.VisualDL(log_dir='visualdl_log_dir')# 本地oof = np.zeros(train_df.shape[0])# feat_imp_df = pd.DataFrame({'feat': cols, 'imp': 0})test_df['prob'] = 0clf = LGBMClassifier( # 0.05--0.1 learning_rate=0.07, # 1030 # 1300 n_estimators=1030, # 31 # 35 # 37 # 40 # (0.523177, 0.93799) 38 #(0.519115, 0.93587) 39 num_leaves=37, subsample=0.8, # 0.8 # 0.85 colsample_bytree=0.8, random_state=11, is_unbalace=True, sample_pos_weight=13 # learning_rate=0.066,#学习率 # n_estimators=1032,#拟合的树的棵树,相当于训练轮数 # num_leaves=38,#树的最大叶子数,对比xgboost一般为2^(max_depth) # subsample=0.85,#子样本频率 # colsample_bytree=0.85, #训练特征采样率列 # random_state=17, #随机种子数 # reg_lambda=1e-1, #L2正则化系数 # # min_split_gain=0.2#最小分割增益 # learning_rate=0.07,#学习率 # n_estimators=1032,#拟合的树的棵树,相当于训练轮数 # num_leaves=37,#树的最大叶子数,对比xgboost一般为2^(max_depth) # subsample=0.8,#子样本频率 # colsample_bytree=0.8, #训练特征采样率 列 # random_state=17, #随机种子数 # silent=True , #训练过程是否打印日志信息 # min_split_gain=0.05 ,#最小分割增益 # is_unbalace=True, # sample_pos_weight=13)val_aucs = []seeds = [11,22,33]for seed in seeds: skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed) for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df['label'])): print('--------------------- {} fold ---------------------'.format(i)) t = time.time() trn_x, trn_y = train_df[cols].iloc[trn_idx].reset_index(drop=True), train_df['label'].values[trn_idx] val_x, val_y = train_df[cols].iloc[val_idx].reset_index(drop=True), train_df['label'].values[val_idx] clf.fit( trn_x, trn_y, eval_set=[(val_x, val_y)], # categorical_feature=cate_cols, eval_metric='auc', early_stopping_rounds=200, verbose=200 ) # feat_imp_df['imp'] += clf.feature_importances_ / skf.n_splits oof[val_idx] = clf.predict_proba(val_x)[:, 1] test_df['prob'] += clf.predict_proba(test_df[cols])[:, 1] / skf.n_splits / len(seeds) cv_auc = roc_auc_score(train_df['label'], oof) val_aucs.append(cv_auc) print('\ncv_auc: ', cv_auc)print(val_aucs, np.mean(val_aucs))登录后复制
--------------------- 0 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.944549valid_0's binary_logloss: 0.110362Early stopping, best iteration is:[173]valid_0's auc: 0.944278valid_0's binary_logloss: 0.1097--------------------- 1 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.943315valid_0's binary_logloss: 0.113508Early stopping, best iteration is:[161]valid_0's auc: 0.943045valid_0's binary_logloss: 0.113012--------------------- 2 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.942585valid_0's binary_logloss: 0.119059Early stopping, best iteration is:[148]valid_0's auc: 0.942207valid_0's binary_logloss: 0.117848--------------------- 3 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.942192valid_0's binary_logloss: 0.115931Early stopping, best iteration is:[123]valid_0's auc: 0.942244valid_0's binary_logloss: 0.114857--------------------- 4 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.939505valid_0's binary_logloss: 0.113455Early stopping, best iteration is:[164]valid_0's auc: 0.939654valid_0's binary_logloss: 0.112933cv_auc: 0.9420797160267054--------------------- 0 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.937373valid_0's binary_logloss: 0.119639Early stopping, best iteration is:[140]valid_0's auc: 0.938125valid_0's binary_logloss: 0.117851--------------------- 1 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.942087valid_0's binary_logloss: 0.113331Early stopping, best iteration is:[182]valid_0's auc: 0.942311valid_0's binary_logloss: 0.112912--------------------- 2 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.93272valid_0's binary_logloss: 0.120388Early stopping, best iteration is:[138]valid_0's auc: 0.933033valid_0's binary_logloss: 0.118682--------------------- 3 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds[200]valid_0's auc: 0.951504valid_0's binary_logloss: 0.10742Early stopping, best iteration is:[178]valid_0's auc: 0.951198valid_0's binary_logloss: 0.107208--------------------- 4 fold ---------------------[LightGBM] [Warning] Unknown parameter: is_unbalace[LightGBM] [Warning] Unknown parameter: sample_pos_weightTraining until validation scores don't improve for 200 rounds登录后复制
print(val_aucs, np.mean(val_aucs))def tpr_weight_funtion(y_true,y_predict): d = pd.DataFrame() d['prob'] = list(y_predict) #训练集 d['y'] = list(y_true) #训练之后的label d = d.sort_values(['prob'], ascending=[0]) #对第一列排序 y = d.y #训练之后的label PosAll = pd.Series(y).value_counts()[1] #测试集所有为1结果相加 NegAll = pd.Series(y).value_counts()[0] #测试集所有为0结果相加 pCumsum = d['y'].cumsum() #真实的1的总数 nCumsum = np.arange(len(y)) - pCumsum + 1 #真实的0的总数 pCumsumPer = pCumsum / PosAll #覆盖率 nCumsumPer = nCumsum / NegAll #打扰率 TR1 = pCumsumPer[abs(nCumsumPer-0.001).idxmin()] #TPR:TPR1:FPR = 0.001 TR2 = pCumsumPer[abs(nCumsumPer-0.005).idxmin()] #TPR TPR2:FPR = 0.005 TR3 = pCumsumPer[abs(nCumsumPer-0.01).idxmin()] #TPR TPR3:FPR = 0.01 return 0.4 * TR1 + 0.3 * TR2 + 0.3 * TR3tpr = round(tpr_weight_funtion(train_df['label'], oof), 6)tpr, round(np.mean(val_aucs), 5)登录后复制 In [ ]
submit.head()登录后复制 In [ ]
submit['id'] = test_df['id']submit['label'] = test_df['prob']submit.to_csv('./work/Sub62 {}_{}.csv'.format(tpr, round(np.mean(val_aucs), 6)), index = False)submit.head()登录后复制
训练完感觉精度不是很满意,可以对配置文件进行调参,常见的调参是针对优化器和学习率
打开PaddleDetection/configs/yolov3/yolov3_mobilenet_v1_roadsign.yml文件
LearningRate: base_lr: 0.0001 schedulers: - !PiecewiseDecay gamma: 0.1 milestones: [32, 36] - !LinearWarmup start_factor: 0.3333333333333333 steps: 100OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0005 type: L2登录后复制
我们可以根据数据集以及选择的模型来适当调整我们的参数
更多详细可见 配置文件改动和说明
优秀的项目参考PaddleHub 各种项目
Jetson Nano上部署PaddleDection
使用SSD-MobileNetv1完成一个项目--准备数据集到完成树莓派部署
PaddleClas 源码解析
PaddleSeg 2.0动态图:车道线图像分割任务简介
PaddleGAN 大合集
如果没有你感兴趣的,可以去寻找相应的项目作为参考,注意项目 paddle 版本最好是 2.0.2 及其以上
写出完整的训练代码,并说明使用的套件,使用的优化器,在训练过程调整了那些参数,以及简短的心得
Note :如果您打算新建项目完成本作业,可以在提交的作业附上链接
使用的套件:使用了 PaddlePaddle
使用了什么模型:使用了LGBM模型
调整了那些参数: learning_rate=0.066,#学习率
n_estimators=1032,#拟合的树的棵树,相当于训练轮数 num_leaves=38,#树的最大叶子数,对比xgboost一般为2^(max_depth) subsample=0.85,#子样本频率 colsample_bytree=0.85, #训练特征采样率列 random_state=17, #随机种子数 reg_lambda=1e-1, #L2正则化系数 # min_split_gain=0.2#最小分割增益 learning_rate=0.07,#学习率 n_estimators=1032,#拟合的树的棵树,相当于训练轮数 num_leaves=37,#树的最大叶子数,对比xgboost一般为2^(max_depth) subsample=0.8,#子样本频率 colsample_bytree=0.8, #训练特征采样率 列 random_state=17, #随机种子数 silent=True , #训练过程是否打印日志信息 min_split_gain=0.05 ,#最小分割增益 is_unbalace=True, sample_pos_weight=13登录后复制
以上就是『飞桨领航团AI达人创造营』基于LGBM的模型预测借款人是否能按期还款的详细内容,更多请关注乐哥常识网其它相关文章!