Home [DACON] Credit Card Delinquency Level Prediction(LightGBM)
Post
Cancel

[DACON] Credit Card Delinquency Level Prediction(LightGBM)

Dacon GitHub


Introduction

Credit card companies use personal information and data submitted by credit card applicants to calculate credit scores.
They use this credit score to predict the applicant’s likelihood of defaulting on future payments and delinquent credit card payments.
In this project, we develop an algorithm that can predict the degree of delinquency of a user’s credit card payment using personal information data of credit card users.

Experiment

1. Data Preparation

1.1. Data Exploration

There are 20 columns including index and credit(dependent variable).

FeautreFormatDescription
index  
genderobjectM: Male / F: Female
carobjecthave car or not(Y/N)
realityobjecthave reality or not(Y/N)
child_numintegernumber of children
income_totalfloat(0.0)annual income
income_typeobjectCommercial associate, Working, State servant, Pensioner, Student
edu_typeobjectHigher education, Secondary / secondary special, Incomplete higher, Lower secondary, Academic degree
family_typeobjectMarried, Civil marriage, Separated, Single / not married, Widow
house_typeobjectMunicipal apartment, House / apartment, With parents, Co-op apartment, Rented apartment, Office apartment
DAYS_BIRTHintegerinterval between data collection date and birth date(negative number)
DAYS_EMPLOYEDintegerinterval between data collection date and first working date(negative number)
FLAG_MOBILintegerhave cell phone or not(1/0)
work_phoneintegerhave business cell phone or not(1/0)
phoneintegerhave phone or not(1/0)
emailintegerhave email or not(1/0)
occyp_typeobjecttype of occupation(NaN)
family_sizefloat(0.0)number of family member
begin_monthfloat(0.0)interval between data collection date and card issue date(negative number)
creditfloat(0.0)credit card delinquency level(lower credit means better user


1.2. Data Preprocessing

I need to drop unusable variable(index) and fill NaN cells with None.
I then simplify the dependent variable by redefining the 3 levels of credit to 2 levels.(float → category)

1
2
3
df.drop(columns=['index'], inplace=True) # df: pandas dataframe of credit card data
df['occyp_type'] = df['occyp_type'].fillna('None')
df['credit'] = df['credit'].map(lambda x: 'low_risk' if x<=1.0 else 'high risk').astype('category')


Since categorical variables cannot be used in LightGBM, the category data type must be converted to object.

1
2
cat_vars = list(df.select_dtypes('object').columns)
df[cat_vars] = df[cat_vars].astype('category')


2. Model Fitting

2.1. Model Optimization

First, the data set is divided into independent and dependent variables, and then divided into train and test data in a ratio of 7:3 in a fixed random state 2021.

1
2
3
y = df['credit']
x = df.drop(columns=['credit'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2021)


Next, the model parameters are optimized using the 5-fold algorithm with the train data./ The evaluation indicator at this project is logloss.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Evaluation Indicator
logloss_ = make_scorer(log_loss, greater_is_better=False, needs_proba=True)

# LightGBM
warnings.filterwarnings("ignore")
md_lgb = LGBMClassifier(random_state=2021, categorical_feature=cat_vars)

# Parameter Optimization
params_lgb = {
    'learning_rate':[0.05, 0.1, 0.15],
    'n_estimators':[100, 200, 300, 400],
    'max_depth':[5, 6, 7, 8, 9],
    'num_leaves':[31, 47, 63],
    'reg_alpha':[0, 1],
    'reg_lambda':[0, 1, 10]
}

grid_search_lgb = RandomizedSearchCV(estimator=md_lgb, 
                                     param_distributions=params_lgb,
                                     n_iter=10,
                                     cv=5, 
                                     scoring=logloss_, 
                                     random_state=2021)

grid_search_lgb.fit(x_train, y_train, categorical_feature=cat_vars)

# Show Result
print(grid_search_lgb.best_params_)
print(grid_search_lgb.best_score_)

The result be like this.

  • {‘reg_lambda’: 1, ‘reg_alpha’: 0, ‘num_leaves’: 47, ‘n_estimators’: 400, ‘max_depth’: 9, ‘learning_rate’: 0.05}
  • -0.5505740190402201


Using these parameters, the model can be optimized.

1
2
3
4
5
6
7
8
9
10
11
# Model Optimization
md_lgb_best = LGBMClassifier(random_state=2021,
                             reg_lambda=1,
                             num_leaves=47,
                             max_depth=9,
                             learning_rate=0.05)

md_lgb_best.fit(x_train, y_train,
               categorical_feature=cat_vars,                
                eval_set=[(x_test, y_test)],
                early_stopping_rounds=10)


2.2. Model Evaluation

1
log_loss(y_test, md_lgb_best.predict_proba(x_test))
  • 0.552121430722158


3. Conclusion

The importance of variables in the model is as follows.

1
2
plot_importance(md_lgb_best)
plt.show()

variable importance

This is the visualized appearance of tree 0.

1
2
plot_tree(md_lgb_best, tree_index=0, dpi=100, figsize(20,12))
plt.show()

tree plot

Next, calculate SHAP and look at the SHAP distribution for each variable.

1
2
3
4
explainer = shap.TreeExplainer(md_lgb_best)
shap_values = explainer.shap_values(x_test)
shap.initjs()
shap.summary_plot(shap_values[1], x_test, feature_names = x_test.columns)

SHAP

This post is licensed under CC BY 4.0 by the author.

[Prj] Car Recommendation Service using Filtered Review Data

[1] Jekyll 및 GitPages 기반 블로그 생성(Setting)