1 Introduction

1.1 Problem Statement

Churn rate is the number of new customers compared to the number of churning customers. This is important because the cost of acquiring new customers is much greater than the cost of retaining existing ones. Therefore, if existing customers who are likely to churn can be identified and prevented in advance, the effect can be greater than simply acquiring more new customers. In addition, if the customer churn rate is accurately predicted, the customer lifetime value (CLV) and return on marketing investment can also be calculated more accurately, which in turn helps to make the right decision.

1.2 Goal

For the stability of the company’s financial problems, create a model to predict what customer has a high risk of leaving the bank’s credit card services.

1.3 Load libraries

1.4 Load the data

Data Frame Summary

data

Dimensions: 10127 x 21
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Valid
1	CLIENTNUM [integer]	Mean (sd) : 739177606 (36903784) min < med < max: 708082083 < 717926358 < 828343083 IQR (CV) : 60106763 (0)	10127 distinct values	10127 (100.0%)
2	Attrition_Flag [character]	1. Attrited Customer 2. Existing Customer	1627 (16.1%) 8500 (83.9%)	10127 (100.0%)
3	Customer_Age [integer]	Mean (sd) : 46.3 (8) min < med < max: 26 < 46 < 73 IQR (CV) : 11 (0.2)	45 distinct values	10127 (100.0%)
4	Gender [character]	1. F 2. M	5358 (52.9%) 4769 (47.1%)	10127 (100.0%)
5	Dependent_count [integer]	Mean (sd) : 2.3 (1.3) min < med < max: 0 < 2 < 5 IQR (CV) : 2 (0.6)	0 : 904 ( 8.9%) 1 : 1838 (18.1%) 2 : 2655 (26.2%) 3 : 2732 (27.0%) 4 : 1574 (15.5%) 5 : 424 ( 4.2%)	10127 (100.0%)
6	Education_Level [character]	1. College 2. Doctorate 3. Graduate 4. High School 5. Post-Graduate 6. Uneducated 7. Unknown	1013 (10.0%) 451 ( 4.5%) 3128 (30.9%) 2013 (19.9%) 516 ( 5.1%) 1487 (14.7%) 1519 (15.0%)	10127 (100.0%)
7	Marital_Status [character]	1. Divorced 2. Married 3. Single 4. Unknown	748 ( 7.4%) 4687 (46.3%) 3943 (38.9%) 749 ( 7.4%)	10127 (100.0%)
8	Income_Category [character]	1. $120K + 2. $40K - $60K 3. $60K - $80K 4. $80K - $120K 5. Less than $40K 6. Unknown	727 ( 7.2%) 1790 (17.7%) 1402 (13.8%) 1535 (15.2%) 3561 (35.2%) 1112 (11.0%)	10127 (100.0%)
9	Card_Category [character]	1. Blue 2. Gold 3. Platinum 4. Silver	9436 (93.2%) 116 ( 1.1%) 20 ( 0.2%) 555 ( 5.5%)	10127 (100.0%)
10	Months_on_book [integer]	Mean (sd) : 35.9 (8) min < med < max: 13 < 36 < 56 IQR (CV) : 9 (0.2)	44 distinct values	10127 (100.0%)
11	Total_Relationship_Count [integer]	Mean (sd) : 3.8 (1.6) min < med < max: 1 < 4 < 6 IQR (CV) : 2 (0.4)	1 : 910 ( 9.0%) 2 : 1243 (12.3%) 3 : 2305 (22.8%) 4 : 1912 (18.9%) 5 : 1891 (18.7%) 6 : 1866 (18.4%)	10127 (100.0%)
12	Months_Inactive_12_mon [integer]	Mean (sd) : 2.3 (1) min < med < max: 0 < 2 < 6 IQR (CV) : 1 (0.4)	0 : 29 ( 0.3%) 1 : 2233 (22.0%) 2 : 3282 (32.4%) 3 : 3846 (38.0%) 4 : 435 ( 4.3%) 5 : 178 ( 1.8%) 6 : 124 ( 1.2%)	10127 (100.0%)
13	Contacts_Count_12_mon [integer]	Mean (sd) : 2.5 (1.1) min < med < max: 0 < 2 < 6 IQR (CV) : 1 (0.5)	0 : 399 ( 3.9%) 1 : 1499 (14.8%) 2 : 3227 (31.9%) 3 : 3380 (33.4%) 4 : 1392 (13.7%) 5 : 176 ( 1.7%) 6 : 54 ( 0.5%)	10127 (100.0%)
14	Credit_Limit [numeric]	Mean (sd) : 8632 (9088.8) min < med < max: 1438.3 < 4549 < 34516 IQR (CV) : 8512.5 (1.1)	6205 distinct values	10127 (100.0%)
15	Total_Revolving_Bal [integer]	Mean (sd) : 1162.8 (815) min < med < max: 0 < 1276 < 2517 IQR (CV) : 1425 (0.7)	1974 distinct values	10127 (100.0%)
16	Avg_Open_To_Buy [numeric]	Mean (sd) : 7469.1 (9090.7) min < med < max: 3 < 3474 < 34516 IQR (CV) : 8534.5 (1.2)	6813 distinct values	10127 (100.0%)
17	Total_Amt_Chng_Q4_Q1 [numeric]	Mean (sd) : 0.8 (0.2) min < med < max: 0 < 0.7 < 3.4 IQR (CV) : 0.2 (0.3)	1158 distinct values	10127 (100.0%)
18	Total_Trans_Amt [integer]	Mean (sd) : 4404.1 (3397.1) min < med < max: 510 < 3899 < 18484 IQR (CV) : 2585.5 (0.8)	5033 distinct values	10127 (100.0%)
19	Total_Trans_Ct [integer]	Mean (sd) : 64.9 (23.5) min < med < max: 10 < 67 < 139 IQR (CV) : 36 (0.4)	126 distinct values	10127 (100.0%)
20	Total_Ct_Chng_Q4_Q1 [numeric]	Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.7 < 3.7 IQR (CV) : 0.2 (0.3)	830 distinct values	10127 (100.0%)
21	Avg_Utilization_Ratio [numeric]	Mean (sd) : 0.3 (0.3) min < med < max: 0 < 0.2 < 1 IQR (CV) : 0.5 (1)	964 distinct values	10127 (100.0%)

2 Exploratory Data Analysis(EDA)

2.1 Categorical Variables

For education level plot, we need to categorize college, doctorate, graduate and post-graduate to degree recipient. We notice that there are unknown variables on Education level, Matrital status, and Income category. So, we need to convert Unknown values to NA values.

2.1.1 Income relationship

In all education levels, Less than $40k shows a high percentage.

2.2 Numerical Variables

The higher they spend annually, the customers are more likely to remain. The interesting part is that existing customers are divided into three clusters.

After inactive of 2-3 month, customers are more likely to leave.

The higher the credit limit is, the customers are more likely to remain.

Credit_Limit is correlated with Avg_Open_To_Buy and Total_Trans_Amt is correlated with Total_Trans_Ct, respectively. Correlated features will increase the complexity of the algorithm, thus increasing the risk of errors.

3 Data Preprocessing

## [1] TRUE

There are no duplicate customer numbers

Characteristic	0, N = 8,500¹	1, N = 1,627¹	p-value²
Education_Level			0.6
Degree recipient	4,280 (50%)	828 (51%)
High School	1,707 (20%)	306 (19%)
Uneducated	1,250 (15%)	237 (15%)
Unknown	1,263 (15%)	256 (16%)
Marital_Status			0.11
Divorced	627 (7.4%)	121 (7.4%)
Married	3,978 (47%)	709 (44%)
Single	3,275 (39%)	668 (41%)
Unknown	620 (7.3%)	129 (7.9%)
Income_Category			0.025
$120K +	601 (7.1%)	126 (7.7%)
$40K - $60K	1,519 (18%)	271 (17%)
$60K - $80K	1,213 (14%)	189 (12%)
$80K - $120K	1,293 (15%)	242 (15%)
Less than $40K	2,949 (35%)	612 (38%)
Unknown	925 (11%)	187 (11%)
Card_Category			0.5
Blue	7,917 (93%)	1,519 (93%)
Gold	95 (1.1%)	21 (1.3%)
Platinum	15 (0.2%)	5 (0.3%)
Silver	473 (5.6%)	82 (5.0%)
¹ n (%)
² Pearson’s Chi-squared test; Fisher’s exact test

Education_Level p-value = 0.6, Marital_Status p-value = 0.11, Income_Category p-value = 0.025, Card_Category = 0.5

Characteristic	0, N = 5,968¹	1, N = 1,113¹	p-value²
Education_Level			0.2
Degree recipient	3,534 (59%)	690 (62%)
High School	1,416 (24%)	237 (21%)
Uneducated	1,018 (17%)	186 (17%)
Marital_Status			0.12
Divorced	477 (8.0%)	92 (8.3%)
Married	3,035 (51%)	529 (48%)
Single	2,456 (41%)	492 (44%)
Income_Category			0.015
$120K +	470 (7.9%)	102 (9.2%)
$40K - $60K	1,208 (20%)	204 (18%)
$60K - $80K	958 (16%)	145 (13%)
$80K - $120K	1,013 (17%)	189 (17%)
Less than $40K	2,319 (39%)	473 (42%)
Card_Category			0.7
Blue	5,564 (93%)	1,034 (93%)
Gold	68 (1.1%)	13 (1.2%)
Platinum	8 (0.1%)	3 (0.3%)
Silver	328 (5.5%)	63 (5.7%)
¹ n (%)
² Pearson’s Chi-squared test; Fisher’s exact test

Education_Level p-value = 0.2, Marital_Status p-value = 0.12, Income_Category p-value = 0.015, Card_Category = 0.7

When comparing the p values from the two tables, the difference between the p-values of the columns with original data and the columns where the NA values are dropped is not large, and since the p values were meaningless in the first place, it is not a problem to drop them. Also, the NA values accounts for 30% of the whole data, but we have enough data to make models.

4 Modeling

4.1 Logistic Regression

4.1.1 Stepwise Logistic Regression

##    
##        0    1
##   0 0.81 0.07
##   1 0.03 0.09

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1718  140
##   1   72  193
##                                           
##                Accuracy : 0.9001          
##                  95% CI : (0.8866, 0.9126)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 1.372e-14       
##                                           
##                   Kappa : 0.5882          
##                                           
##  Mcnemar's Test P-Value : 4.193e-06       
##                                           
##             Sensitivity : 0.57958         
##             Specificity : 0.95978         
##          Pos Pred Value : 0.72830         
##          Neg Pred Value : 0.92465         
##              Prevalence : 0.15685         
##          Detection Rate : 0.09091         
##    Detection Prevalence : 0.12482         
##       Balanced Accuracy : 0.76968         
##                                           
##        'Positive' Class : 1               
##

4.1.2 Logistic Regression with k-fold cross-validation

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1721  139
##   1   69  194
##                                           
##                Accuracy : 0.902           
##                  95% CI : (0.8886, 0.9143)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 1.628e-15       
##                                           
##                   Kappa : 0.5949          
##                                           
##  Mcnemar's Test P-Value : 1.716e-06       
##                                           
##             Sensitivity : 0.58258         
##             Specificity : 0.96145         
##          Pos Pred Value : 0.73764         
##          Neg Pred Value : 0.92527         
##              Prevalence : 0.15685         
##          Detection Rate : 0.09138         
##    Detection Prevalence : 0.12388         
##       Balanced Accuracy : 0.77202         
##                                           
##        'Positive' Class : 1               
##

4.2 LASSO Regression

##   alpha lambda
## 5     1 0.0018

Tuning parameter ‘alpha’ was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were alpha = 1 and lambda = 0.0018.

## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                      5.191
## Customer_Age                    -0.006
## GenderF                          0.618
## Dependent_count                  0.098
## Education_LevelHigh_School      -0.012
## Education_LevelDegree_recipient  0.065
## Marital_StatusMarried           -0.504
## Marital_StatusSingle             .    
## Income_Category40k_60k          -0.068
## Income_Category60k_80k           .    
## Income_Category80k_120k          0.301
## Income_Categoryover_120k         0.534
## Card_CategorySilver              0.488
## Card_CategoryGold                0.303
## Card_CategoryPlatinum            1.259
## Months_on_book                  -0.009
## Total_Relationship_Count        -0.400
## Months_Inactive_12_mon           0.429
## Contacts_Count_12_mon            0.408
## Credit_Limit                     0.000
## Total_Revolving_Bal             -0.001
## Total_Amt_Chng_Q4_Q1            -0.399
## Total_Trans_Amt                  0.000
## Total_Trans_Ct                  -0.103
## Total_Ct_Chng_Q4_Q1             -2.681
## Avg_Utilization_Ratio            .

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1723  144
##   1   67  189
##                                          
##                Accuracy : 0.9006         
##                  95% CI : (0.8871, 0.913)
##     No Information Rate : 0.8431         
##     P-Value [Acc > NIR] : 8.115e-15      
##                                          
##                   Kappa : 0.5852         
##                                          
##  Mcnemar's Test P-Value : 1.676e-07      
##                                          
##             Sensitivity : 0.56757        
##             Specificity : 0.96257        
##          Pos Pred Value : 0.73828        
##          Neg Pred Value : 0.92287        
##              Prevalence : 0.15685        
##          Detection Rate : 0.08902        
##    Detection Prevalence : 0.12058        
##       Balanced Accuracy : 0.76507        
##                                          
##        'Positive' Class : 1              
##

4.3 Decision tree

##    
##        0    1
##   0 0.81 0.07
##   1 0.03 0.09

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1719  144
##   1   71  189
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8851, 0.9112)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 6.424e-14       
##                                           
##                   Kappa : 0.5796          
##                                           
##  Mcnemar's Test P-Value : 9.091e-07       
##                                           
##             Sensitivity : 0.56757         
##             Specificity : 0.96034         
##          Pos Pred Value : 0.72692         
##          Neg Pred Value : 0.92271         
##              Prevalence : 0.15685         
##          Detection Rate : 0.08902         
##    Detection Prevalence : 0.12247         
##       Balanced Accuracy : 0.76395         
##                                           
##        'Positive' Class : 1               
##

4.4 Random forest

##    
##        0    1
##   0 0.83 0.03
##   1 0.01 0.13

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1765   58
##   1   25  275
##                                           
##                Accuracy : 0.9609          
##                  95% CI : (0.9518, 0.9687)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.846           
##                                           
##  Mcnemar's Test P-Value : 0.000444        
##                                           
##             Sensitivity : 0.8258          
##             Specificity : 0.9860          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.9682          
##              Prevalence : 0.1569          
##          Detection Rate : 0.1295          
##    Detection Prevalence : 0.1413          
##       Balanced Accuracy : 0.9059          
##                                           
##        'Positive' Class : 1               
##

4.5 Naive bayes

##    
##        0    1
##   0 0.80 0.07
##   1 0.05 0.08

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1692  153
##   1   98  180
##                                           
##                Accuracy : 0.8818          
##                  95% CI : (0.8673, 0.8952)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : 2.341e-07       
##                                           
##                   Kappa : 0.5208          
##                                           
##  Mcnemar's Test P-Value : 0.0006533       
##                                           
##             Sensitivity : 0.54054         
##             Specificity : 0.94525         
##          Pos Pred Value : 0.64748         
##          Neg Pred Value : 0.91707         
##              Prevalence : 0.15685         
##          Detection Rate : 0.08479         
##    Detection Prevalence : 0.13095         
##       Balanced Accuracy : 0.74290         
##                                           
##        'Positive' Class : 1               
##

4.6 SVM

##    
##        0    1
##   0 0.82 0.04
##   1 0.02 0.11

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1739   92
##   1   51  241
##                                           
##                Accuracy : 0.9326          
##                  95% CI : (0.9211, 0.9429)
##     No Information Rate : 0.8431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7319          
##                                           
##  Mcnemar's Test P-Value : 0.0008229       
##                                           
##             Sensitivity : 0.7237          
##             Specificity : 0.9715          
##          Pos Pred Value : 0.8253          
##          Neg Pred Value : 0.9498          
##              Prevalence : 0.1569          
##          Detection Rate : 0.1135          
##    Detection Prevalence : 0.1375          
##       Balanced Accuracy : 0.8476          
##                                           
##        'Positive' Class : 1               
##

##    
##        0    1
##   0 0.82 0.04
##   1 0.02 0.11

4.7 k-NN

##    
##        0    1
##   0 0.81 0.08
##   1 0.04 0.08

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1715  163
##   1   75  170
##                                          
##                Accuracy : 0.8879         
##                  95% CI : (0.8737, 0.901)
##     No Information Rate : 0.8431         
##     P-Value [Acc > NIR] : 2.018e-09      
##                                          
##                   Kappa : 0.5251         
##                                          
##  Mcnemar's Test P-Value : 1.707e-08      
##                                          
##             Sensitivity : 0.51051        
##             Specificity : 0.95810        
##          Pos Pred Value : 0.69388        
##          Neg Pred Value : 0.91321        
##              Prevalence : 0.15685        
##          Detection Rate : 0.08008        
##    Detection Prevalence : 0.11540        
##       Balanced Accuracy : 0.73431        
##                                          
##        'Positive' Class : 1              
##

5. Summary

5.1 ROC curves and AUC values

5.2 Results

Method	AUC	Accuracy	Kappa	Sensitivity	Specificity	F1_score
Stepwise Logistic Regression	0.9138	0.9001	0.5882	0.5796	0.9598	0.6455
Logistic Regression with k-fold cv	0.9135	0.9020	0.5949	0.5826	0.9614	0.6510
LASSO Regression	0.9141	0.9006	0.5852	0.5676	0.9626	0.6418
Decision Tree	0.8197	0.8987	0.5796	0.5676	0.9603	0.6374
Random Forest	0.9859	0.9609	0.8460	0.8258	0.9860	0.8689
Naive Bayes	0.8501	0.8818	0.5208	0.5405	0.9452	0.5892
SVM	0.9595	0.9326	0.7319	0.7237	0.9715	0.7712
k-NN	0.8820	0.8879	0.5251	0.5105	0.9581	0.5882

This is a summary of our prediction models. Random Forest has best model compared to other with metrics score(Kappa, Sensitivity, Specificity and F1 score) has shown at least 96% accuracy. Random Forest have AUC score of 98% that shows the model is learning the data well enough.

Customer Churn Prediction

Gyujin Seo

2022-08-07