1 Introduction

1.1 Problem Statement

Churn rate is the number of new customers compared to the number of churning customers. This is important because the cost of acquiring new customers is much greater than the cost of retaining existing ones. Therefore, if existing customers who are likely to churn can be identified and prevented in advance, the effect can be greater than simply acquiring more new customers. In addition, if the customer churn rate is accurately predicted, the customer lifetime value (CLV) and return on marketing investment can also be calculated more accurately, which in turn helps to make the right decision.

1.2 Goal

For the stability of the company’s financial problems, create a model to predict what customer has a high risk of leaving the bank’s credit card services.

1.3 Load libraries

1.4 Load the data

Data Frame Summary

data

Dimensions: 10127 x 21
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 CLIENTNUM
[integer]
Mean (sd) : 739177606 (36903784)
min < med < max:
708082083 < 717926358 < 828343083
IQR (CV) : 60106763 (0)
10127 distinct values 10127
(100.0%)
0
(0.0%)
2 Attrition_Flag
[character]
1. Attrited Customer
2. Existing Customer
1627 (16.1%)
8500 (83.9%)
10127
(100.0%)
0
(0.0%)
3 Customer_Age
[integer]
Mean (sd) : 46.3 (8)
min < med < max:
26 < 46 < 73
IQR (CV) : 11 (0.2)
45 distinct values 10127
(100.0%)
0
(0.0%)
4 Gender
[character]
1. F
2. M
5358 (52.9%)
4769 (47.1%)
10127
(100.0%)
0
(0.0%)
5 Dependent_count
[integer]
Mean (sd) : 2.3 (1.3)
min < med < max:
0 < 2 < 5
IQR (CV) : 2 (0.6)
0 : 904 ( 8.9%)
1 : 1838 (18.1%)
2 : 2655 (26.2%)
3 : 2732 (27.0%)
4 : 1574 (15.5%)
5 : 424 ( 4.2%)
10127
(100.0%)
0
(0.0%)
6 Education_Level
[character]
1. College
2. Doctorate
3. Graduate
4. High School
5. Post-Graduate
6. Uneducated
7. Unknown
1013 (10.0%)
451 ( 4.5%)
3128 (30.9%)
2013 (19.9%)
516 ( 5.1%)
1487 (14.7%)
1519 (15.0%)
10127
(100.0%)
0
(0.0%)
7 Marital_Status
[character]
1. Divorced
2. Married
3. Single
4. Unknown
748 ( 7.4%)
4687 (46.3%)
3943 (38.9%)
749 ( 7.4%)
10127
(100.0%)
0
(0.0%)
8 Income_Category
[character]
1. $120K +
2. $40K - $60K
3. $60K - $80K
4. $80K - $120K
5. Less than $40K
6. Unknown
727 ( 7.2%)
1790 (17.7%)
1402 (13.8%)
1535 (15.2%)
3561 (35.2%)
1112 (11.0%)
10127
(100.0%)
0
(0.0%)
9 Card_Category
[character]
1. Blue
2. Gold
3. Platinum
4. Silver
9436 (93.2%)
116 ( 1.1%)
20 ( 0.2%)
555 ( 5.5%)
10127
(100.0%)
0
(0.0%)
10 Months_on_book
[integer]
Mean (sd) : 35.9 (8)
min < med < max:
13 < 36 < 56
IQR (CV) : 9 (0.2)
44 distinct values 10127
(100.0%)
0
(0.0%)
11 Total_Relationship_Count
[integer]
Mean (sd) : 3.8 (1.6)
min < med < max:
1 < 4 < 6
IQR (CV) : 2 (0.4)
1 : 910 ( 9.0%)
2 : 1243 (12.3%)
3 : 2305 (22.8%)
4 : 1912 (18.9%)
5 : 1891 (18.7%)
6 : 1866 (18.4%)
10127
(100.0%)
0
(0.0%)
12 Months_Inactive_12_mon
[integer]
Mean (sd) : 2.3 (1)
min < med < max:
0 < 2 < 6
IQR (CV) : 1 (0.4)
0 : 29 ( 0.3%)
1 : 2233 (22.0%)
2 : 3282 (32.4%)
3 : 3846 (38.0%)
4 : 435 ( 4.3%)
5 : 178 ( 1.8%)
6 : 124 ( 1.2%)
10127
(100.0%)
0
(0.0%)
13 Contacts_Count_12_mon
[integer]
Mean (sd) : 2.5 (1.1)
min < med < max:
0 < 2 < 6
IQR (CV) : 1 (0.5)
0 : 399 ( 3.9%)
1 : 1499 (14.8%)
2 : 3227 (31.9%)
3 : 3380 (33.4%)
4 : 1392 (13.7%)
5 : 176 ( 1.7%)
6 : 54 ( 0.5%)
10127
(100.0%)
0
(0.0%)
14 Credit_Limit
[numeric]
Mean (sd) : 8632 (9088.8)
min < med < max:
1438.3 < 4549 < 34516
IQR (CV) : 8512.5 (1.1)
6205 distinct values 10127
(100.0%)
0
(0.0%)
15 Total_Revolving_Bal
[integer]
Mean (sd) : 1162.8 (815)
min < med < max:
0 < 1276 < 2517
IQR (CV) : 1425 (0.7)
1974 distinct values 10127
(100.0%)
0
(0.0%)
16 Avg_Open_To_Buy
[numeric]
Mean (sd) : 7469.1 (9090.7)
min < med < max:
3 < 3474 < 34516
IQR (CV) : 8534.5 (1.2)
6813 distinct values 10127
(100.0%)
0
(0.0%)
17 Total_Amt_Chng_Q4_Q1
[numeric]
Mean (sd) : 0.8 (0.2)
min < med < max:
0 < 0.7 < 3.4
IQR (CV) : 0.2 (0.3)
1158 distinct values 10127
(100.0%)
0
(0.0%)
18 Total_Trans_Amt
[integer]
Mean (sd) : 4404.1 (3397.1)
min < med < max:
510 < 3899 < 18484
IQR (CV) : 2585.5 (0.8)
5033 distinct values 10127
(100.0%)
0
(0.0%)
19 Total_Trans_Ct
[integer]
Mean (sd) : 64.9 (23.5)
min < med < max:
10 < 67 < 139
IQR (CV) : 36 (0.4)
126 distinct values 10127
(100.0%)
0
(0.0%)
20 Total_Ct_Chng_Q4_Q1
[numeric]
Mean (sd) : 0.7 (0.2)
min < med < max:
0 < 0.7 < 3.7
IQR (CV) : 0.2 (0.3)
830 distinct values 10127
(100.0%)
0
(0.0%)
21 Avg_Utilization_Ratio
[numeric]
Mean (sd) : 0.3 (0.3)
min < med < max:
0 < 0.2 < 1
IQR (CV) : 0.5 (1)
964 distinct values 10127
(100.0%)
0
(0.0%)

2 Exploratory Data Analysis(EDA)

2.1 Categorical Variables

For education level plot, we need to categorize college, doctorate, graduate and post-graduate to degree recipient. We notice that there are unknown variables on Education level, Matrital status, and Income category. So, we need to convert Unknown values to NA values.

2.1.1 Income relationship

In all education levels, Less than $40k shows a high percentage.

2.2 Numerical Variables

The higher they spend annually, the customers are more likely to remain. The interesting part is that existing customers are divided into three clusters.

After inactive of 2-3 month, customers are more likely to leave.

The higher the credit limit is, the customers are more likely to remain.

Credit_Limit is correlated with Avg_Open_To_Buy and Total_Trans_Amt is correlated with Total_Trans_Ct, respectively. Correlated features will increase the complexity of the algorithm, thus increasing the risk of errors.

3 Data Preprocessing

## [1] TRUE

There are no duplicate customer numbers

Characteristic 0, N = 8,5001 1, N = 1,6271 p-value2
Education_Level 0.6
    Degree recipient 4,280 (50%) 828 (51%)
    High School 1,707 (20%) 306 (19%)
    Uneducated 1,250 (15%) 237 (15%)
    Unknown 1,263 (15%) 256 (16%)
Marital_Status 0.11
    Divorced 627 (7.4%) 121 (7.4%)
    Married 3,978 (47%) 709 (44%)
    Single 3,275 (39%) 668 (41%)
    Unknown 620 (7.3%) 129 (7.9%)
Income_Category 0.025
    $120K + 601 (7.1%) 126 (7.7%)
    $40K - $60K 1,519 (18%) 271 (17%)
    $60K - $80K 1,213 (14%) 189 (12%)
    $80K - $120K 1,293 (15%) 242 (15%)
    Less than $40K 2,949 (35%) 612 (38%)
    Unknown 925 (11%) 187 (11%)
Card_Category 0.5
    Blue 7,917 (93%) 1,519 (93%)
    Gold 95 (1.1%) 21 (1.3%)
    Platinum 15 (0.2%) 5 (0.3%)
    Silver 473 (5.6%) 82 (5.0%)
1 n (%)
2 Pearson’s Chi-squared test; Fisher’s exact test

Education_Level p-value = 0.6, Marital_Status p-value = 0.11, Income_Category p-value = 0.025, Card_Category = 0.5

Characteristic 0, N = 5,9681 1, N = 1,1131 p-value2
Education_Level 0.2
    Degree recipient 3,534 (59%) 690 (62%)
    High School 1,416 (24%) 237 (21%)
    Uneducated 1,018 (17%) 186 (17%)
Marital_Status 0.12
    Divorced 477 (8.0%) 92 (8.3%)
    Married 3,035 (51%) 529 (48%)
    Single 2,456 (41%) 492 (44%)
Income_Category 0.015
    $120K + 470 (7.9%) 102 (9.2%)
    $40K - $60K 1,208 (20%) 204 (18%)
    $60K - $80K 958 (16%) 145 (13%)
    $80K - $120K 1,013 (17%) 189 (17%)
    Less than $40K 2,319 (39%) 473 (42%)
Card_Category 0.7
    Blue 5,564 (93%) 1,034 (93%)
    Gold 68 (1.1%) 13 (1.2%)
    Platinum 8 (0.1%) 3 (0.3%)
    Silver 328 (5.5%) 63 (5.7%)
1 n (%)
2 Pearson’s Chi-squared test; Fisher’s exact test

Education_Level p-value = 0.2, Marital_Status p-value = 0.12, Income_Category p-value = 0.015, Card_Category = 0.7

When comparing the p values from the two tables, the difference between the p-values of the columns with original data and the columns where the NA values are dropped is not large, and since the p values were meaningless in the first place, it is not a problem to drop them. Also, the NA values accounts for 30% of the whole data, but we have enough data to make models.

4 Modeling

4.1 Logistic Regression

4.1.1 Stepwise Logistic Regression

##    
##        0    1
##   0 0.81 0.07
##   1 0.03 0.09
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1718  140
##   1   72  193
##                                           
##                Accuracy : 0.9001          
##                  95% CI : (0.8866, 0.9126)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 1.372e-14       
##                                           
##                   Kappa : 0.5882          
##                                           
##  Mcnemar's Test P-Value : 4.193e-06       
##                                           
##             Sensitivity : 0.57958         
##             Specificity : 0.95978         
##          Pos Pred Value : 0.72830         
##          Neg Pred Value : 0.92465         
##              Prevalence : 0.15685         
##          Detection Rate : 0.09091         
##    Detection Prevalence : 0.12482         
##       Balanced Accuracy : 0.76968         
##                                           
##        'Positive' Class : 1               
## 

4.1.2 Logistic Regression with k-fold cross-validation

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1721  139
##   1   69  194
##                                           
##                Accuracy : 0.902           
##                  95% CI : (0.8886, 0.9143)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 1.628e-15       
##                                           
##                   Kappa : 0.5949          
##                                           
##  Mcnemar's Test P-Value : 1.716e-06       
##                                           
##             Sensitivity : 0.58258         
##             Specificity : 0.96145         
##          Pos Pred Value : 0.73764         
##          Neg Pred Value : 0.92527         
##              Prevalence : 0.15685         
##          Detection Rate : 0.09138         
##    Detection Prevalence : 0.12388         
##       Balanced Accuracy : 0.77202         
##                                           
##        'Positive' Class : 1               
## 

4.2 LASSO Regression

##   alpha lambda
## 5     1 0.0018

Tuning parameter ‘alpha’ was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were alpha = 1 and lambda = 0.0018.

## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                      5.191
## Customer_Age                    -0.006
## GenderF                          0.618
## Dependent_count                  0.098
## Education_LevelHigh_School      -0.012
## Education_LevelDegree_recipient  0.065
## Marital_StatusMarried           -0.504
## Marital_StatusSingle             .    
## Income_Category40k_60k          -0.068
## Income_Category60k_80k           .    
## Income_Category80k_120k          0.301
## Income_Categoryover_120k         0.534
## Card_CategorySilver              0.488
## Card_CategoryGold                0.303
## Card_CategoryPlatinum            1.259
## Months_on_book                  -0.009
## Total_Relationship_Count        -0.400
## Months_Inactive_12_mon           0.429
## Contacts_Count_12_mon            0.408
## Credit_Limit                     0.000
## Total_Revolving_Bal             -0.001
## Total_Amt_Chng_Q4_Q1            -0.399
## Total_Trans_Amt                  0.000
## Total_Trans_Ct                  -0.103
## Total_Ct_Chng_Q4_Q1             -2.681
## Avg_Utilization_Ratio            .

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1723  144
##   1   67  189
##                                          
##                Accuracy : 0.9006         
##                  95% CI : (0.8871, 0.913)
##     No Information Rate : 0.8431         
##     P-Value [Acc > NIR] : 8.115e-15      
##                                          
##                   Kappa : 0.5852         
##                                          
##  Mcnemar's Test P-Value : 1.676e-07      
##                                          
##             Sensitivity : 0.56757        
##             Specificity : 0.96257        
##          Pos Pred Value : 0.73828        
##          Neg Pred Value : 0.92287        
##              Prevalence : 0.15685        
##          Detection Rate : 0.08902        
##    Detection Prevalence : 0.12058        
##       Balanced Accuracy : 0.76507        
##                                          
##        'Positive' Class : 1              
## 

4.3 Decision tree

##    
##        0    1
##   0 0.81 0.07
##   1 0.03 0.09
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1719  144
##   1   71  189
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8851, 0.9112)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 6.424e-14       
##                                           
##                   Kappa : 0.5796          
##                                           
##  Mcnemar's Test P-Value : 9.091e-07       
##                                           
##             Sensitivity : 0.56757         
##             Specificity : 0.96034         
##          Pos Pred Value : 0.72692         
##          Neg Pred Value : 0.92271         
##              Prevalence : 0.15685         
##          Detection Rate : 0.08902         
##    Detection Prevalence : 0.12247         
##       Balanced Accuracy : 0.76395         
##                                           
##        'Positive' Class : 1               
## 

4.4 Random forest

##    
##        0    1
##   0 0.83 0.03
##   1 0.01 0.13
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1765   58
##   1   25  275
##                                           
##                Accuracy : 0.9609          
##                  95% CI : (0.9518, 0.9687)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.846           
##                                           
##  Mcnemar's Test P-Value : 0.000444        
##                                           
##             Sensitivity : 0.8258          
##             Specificity : 0.9860          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.9682          
##              Prevalence : 0.1569          
##          Detection Rate : 0.1295          
##    Detection Prevalence : 0.1413          
##       Balanced Accuracy : 0.9059          
##                                           
##        'Positive' Class : 1               
## 

4.5 Naive bayes

##    
##        0    1
##   0 0.80 0.07
##   1 0.05 0.08
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1692  153
##   1   98  180
##                                           
##                Accuracy : 0.8818          
##                  95% CI : (0.8673, 0.8952)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 2.341e-07       
##                                           
##                   Kappa : 0.5208          
##                                           
##  Mcnemar's Test P-Value : 0.0006533       
##                                           
##             Sensitivity : 0.54054         
##             Specificity : 0.94525         
##          Pos Pred Value : 0.64748         
##          Neg Pred Value : 0.91707         
##              Prevalence : 0.15685         
##          Detection Rate : 0.08479         
##    Detection Prevalence : 0.13095         
##       Balanced Accuracy : 0.74290         
##                                           
##        'Positive' Class : 1               
## 

4.6 SVM

##    
##        0    1
##   0 0.82 0.04
##   1 0.02 0.11
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1739   92
##   1   51  241
##                                           
##                Accuracy : 0.9326          
##                  95% CI : (0.9211, 0.9429)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7319          
##                                           
##  Mcnemar's Test P-Value : 0.0008229       
##                                           
##             Sensitivity : 0.7237          
##             Specificity : 0.9715          
##          Pos Pred Value : 0.8253          
##          Neg Pred Value : 0.9498          
##              Prevalence : 0.1569          
##          Detection Rate : 0.1135          
##    Detection Prevalence : 0.1375          
##       Balanced Accuracy : 0.8476          
##                                           
##        'Positive' Class : 1               
## 
##    
##        0    1
##   0 0.82 0.04
##   1 0.02 0.11

4.7 k-NN

##    
##        0    1
##   0 0.81 0.08
##   1 0.04 0.08
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1715  163
##   1   75  170
##                                          
##                Accuracy : 0.8879         
##                  95% CI : (0.8737, 0.901)
##     No Information Rate : 0.8431         
##     P-Value [Acc > NIR] : 2.018e-09      
##                                          
##                   Kappa : 0.5251         
##                                          
##  Mcnemar's Test P-Value : 1.707e-08      
##                                          
##             Sensitivity : 0.51051        
##             Specificity : 0.95810        
##          Pos Pred Value : 0.69388        
##          Neg Pred Value : 0.91321        
##              Prevalence : 0.15685        
##          Detection Rate : 0.08008        
##    Detection Prevalence : 0.11540        
##       Balanced Accuracy : 0.73431        
##                                          
##        'Positive' Class : 1              
## 

5. Summary

5.1 ROC curves and AUC values

5.2 Results

Method

AUC

Accuracy

Kappa

Sensitivity

Specificity

F1_score

Stepwise Logistic Regression

0.9138

0.9001

0.5882

0.5796

0.9598

0.6455

Logistic Regression with k-fold cv

0.9135

0.9020

0.5949

0.5826

0.9614

0.6510

LASSO Regression

0.9141

0.9006

0.5852

0.5676

0.9626

0.6418

Decision Tree

0.8197

0.8987

0.5796

0.5676

0.9603

0.6374

Random Forest

0.9859

0.9609

0.8460

0.8258

0.9860

0.8689

Naive Bayes

0.8501

0.8818

0.5208

0.5405

0.9452

0.5892

SVM

0.9595

0.9326

0.7319

0.7237

0.9715

0.7712

k-NN

0.8820

0.8879

0.5251

0.5105

0.9581

0.5882

This is a summary of our prediction models. Random Forest has best model compared to other with metrics score(Kappa, Sensitivity, Specificity and F1 score) has shown at least 96% accuracy. Random Forest have AUC score of 98% that shows the model is learning the data well enough.