Churn rate is the number of new customers compared to the number of churning customers. This is important because the cost of acquiring new customers is much greater than the cost of retaining existing ones. Therefore, if existing customers who are likely to churn can be identified and prevented in advance, the effect can be greater than simply acquiring more new customers. In addition, if the customer churn rate is accurately predicted, the customer lifetime value (CLV) and return on marketing investment can also be calculated more accurately, which in turn helps to make the right decision.
For the stability of the company’s financial problems, create a model to predict what customer has a high risk of leaving the bank’s credit card services.
Dimensions: 10127 x 21
Duplicates: 0
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
---|---|---|---|---|---|---|
1 | CLIENTNUM [integer] |
Mean (sd) : 739177606 (36903784) min < med < max: 708082083 < 717926358 < 828343083 IQR (CV) : 60106763 (0) |
10127 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
2 | Attrition_Flag [character] |
1. Attrited Customer 2. Existing Customer |
1627 (16.1%) 8500 (83.9%) |
10127 (100.0%) |
0 (0.0%) |
|
3 | Customer_Age [integer] |
Mean (sd) : 46.3 (8) min < med < max: 26 < 46 < 73 IQR (CV) : 11 (0.2) |
45 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
4 | Gender [character] |
1. F 2. M |
5358 (52.9%) 4769 (47.1%) |
10127 (100.0%) |
0 (0.0%) |
|
5 | Dependent_count [integer] |
Mean (sd) : 2.3 (1.3) min < med < max: 0 < 2 < 5 IQR (CV) : 2 (0.6) |
0 : 904 ( 8.9%) 1 : 1838 (18.1%) 2 : 2655 (26.2%) 3 : 2732 (27.0%) 4 : 1574 (15.5%) 5 : 424 ( 4.2%) |
10127 (100.0%) |
0 (0.0%) |
|
6 | Education_Level [character] |
1. College 2. Doctorate 3. Graduate 4. High School 5. Post-Graduate 6. Uneducated 7. Unknown |
1013 (10.0%) 451 ( 4.5%) 3128 (30.9%) 2013 (19.9%) 516 ( 5.1%) 1487 (14.7%) 1519 (15.0%) |
10127 (100.0%) |
0 (0.0%) |
|
7 | Marital_Status [character] |
1. Divorced 2. Married 3. Single 4. Unknown |
748 ( 7.4%) 4687 (46.3%) 3943 (38.9%) 749 ( 7.4%) |
10127 (100.0%) |
0 (0.0%) |
|
8 | Income_Category [character] |
1. $120K + 2. $40K - $60K 3. $60K - $80K 4. $80K - $120K 5. Less than $40K 6. Unknown |
727 ( 7.2%) 1790 (17.7%) 1402 (13.8%) 1535 (15.2%) 3561 (35.2%) 1112 (11.0%) |
10127 (100.0%) |
0 (0.0%) |
|
9 | Card_Category [character] |
1. Blue 2. Gold 3. Platinum 4. Silver |
9436 (93.2%) 116 ( 1.1%) 20 ( 0.2%) 555 ( 5.5%) |
10127 (100.0%) |
0 (0.0%) |
|
10 | Months_on_book [integer] |
Mean (sd) : 35.9 (8) min < med < max: 13 < 36 < 56 IQR (CV) : 9 (0.2) |
44 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
11 | Total_Relationship_Count [integer] |
Mean (sd) : 3.8 (1.6) min < med < max: 1 < 4 < 6 IQR (CV) : 2 (0.4) |
1 : 910 ( 9.0%) 2 : 1243 (12.3%) 3 : 2305 (22.8%) 4 : 1912 (18.9%) 5 : 1891 (18.7%) 6 : 1866 (18.4%) |
10127 (100.0%) |
0 (0.0%) |
|
12 | Months_Inactive_12_mon [integer] |
Mean (sd) : 2.3 (1) min < med < max: 0 < 2 < 6 IQR (CV) : 1 (0.4) |
0 : 29 ( 0.3%) 1 : 2233 (22.0%) 2 : 3282 (32.4%) 3 : 3846 (38.0%) 4 : 435 ( 4.3%) 5 : 178 ( 1.8%) 6 : 124 ( 1.2%) |
10127 (100.0%) |
0 (0.0%) |
|
13 | Contacts_Count_12_mon [integer] |
Mean (sd) : 2.5 (1.1) min < med < max: 0 < 2 < 6 IQR (CV) : 1 (0.5) |
0 : 399 ( 3.9%) 1 : 1499 (14.8%) 2 : 3227 (31.9%) 3 : 3380 (33.4%) 4 : 1392 (13.7%) 5 : 176 ( 1.7%) 6 : 54 ( 0.5%) |
10127 (100.0%) |
0 (0.0%) |
|
14 | Credit_Limit [numeric] |
Mean (sd) : 8632 (9088.8) min < med < max: 1438.3 < 4549 < 34516 IQR (CV) : 8512.5 (1.1) |
6205 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
15 | Total_Revolving_Bal [integer] |
Mean (sd) : 1162.8 (815) min < med < max: 0 < 1276 < 2517 IQR (CV) : 1425 (0.7) |
1974 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
16 | Avg_Open_To_Buy [numeric] |
Mean (sd) : 7469.1 (9090.7) min < med < max: 3 < 3474 < 34516 IQR (CV) : 8534.5 (1.2) |
6813 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
17 | Total_Amt_Chng_Q4_Q1 [numeric] |
Mean (sd) : 0.8 (0.2) min < med < max: 0 < 0.7 < 3.4 IQR (CV) : 0.2 (0.3) |
1158 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
18 | Total_Trans_Amt [integer] |
Mean (sd) : 4404.1 (3397.1) min < med < max: 510 < 3899 < 18484 IQR (CV) : 2585.5 (0.8) |
5033 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
19 | Total_Trans_Ct [integer] |
Mean (sd) : 64.9 (23.5) min < med < max: 10 < 67 < 139 IQR (CV) : 36 (0.4) |
126 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
20 | Total_Ct_Chng_Q4_Q1 [numeric] |
Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.7 < 3.7 IQR (CV) : 0.2 (0.3) |
830 distinct values | 10127 (100.0%) |
0 (0.0%) |
|
21 | Avg_Utilization_Ratio [numeric] |
Mean (sd) : 0.3 (0.3) min < med < max: 0 < 0.2 < 1 IQR (CV) : 0.5 (1) |
964 distinct values | 10127 (100.0%) |
0 (0.0%) |
For education level plot, we need to categorize college, doctorate, graduate and post-graduate to degree recipient. We notice that there are
unknown
variables onEducation level
,Matrital status
, andIncome category
. So, we need to convertUnknown
values toNA
values.
In all education levels,
Less than $40k
shows a high percentage.
The higher they spend annually, the customers are more likely to remain. The interesting part is that existing customers are divided into three clusters.
After inactive of 2-3 month, customers are more likely to leave.
The higher the credit limit is, the customers are more likely to remain.
Credit_Limit
is correlated withAvg_Open_To_Buy
andTotal_Trans_Amt
is correlated withTotal_Trans_Ct
, respectively. Correlated features will increase the complexity of the algorithm, thus increasing the risk of errors.
## [1] TRUE
There are no duplicate customer numbers
Characteristic | 0, N = 8,5001 | 1, N = 1,6271 | p-value2 |
---|---|---|---|
Education_Level | 0.6 | ||
    Degree recipient | 4,280 (50%) | 828 (51%) | |
    High School | 1,707 (20%) | 306 (19%) | |
    Uneducated | 1,250 (15%) | 237 (15%) | |
    Unknown | 1,263 (15%) | 256 (16%) | |
Marital_Status | 0.11 | ||
    Divorced | 627 (7.4%) | 121 (7.4%) | |
    Married | 3,978 (47%) | 709 (44%) | |
    Single | 3,275 (39%) | 668 (41%) | |
    Unknown | 620 (7.3%) | 129 (7.9%) | |
Income_Category | 0.025 | ||
    $120K + | 601 (7.1%) | 126 (7.7%) | |
    $40K - $60K | 1,519 (18%) | 271 (17%) | |
    $60K - $80K | 1,213 (14%) | 189 (12%) | |
    $80K - $120K | 1,293 (15%) | 242 (15%) | |
    Less than $40K | 2,949 (35%) | 612 (38%) | |
    Unknown | 925 (11%) | 187 (11%) | |
Card_Category | 0.5 | ||
    Blue | 7,917 (93%) | 1,519 (93%) | |
    Gold | 95 (1.1%) | 21 (1.3%) | |
    Platinum | 15 (0.2%) | 5 (0.3%) | |
    Silver | 473 (5.6%) | 82 (5.0%) | |
1 n (%) | |||
2 Pearson’s Chi-squared test; Fisher’s exact test |
Education_Level
p-value = 0.6,Marital_Status
p-value = 0.11,Income_Category
p-value = 0.025,Card_Category
= 0.5
Characteristic | 0, N = 5,9681 | 1, N = 1,1131 | p-value2 |
---|---|---|---|
Education_Level | 0.2 | ||
    Degree recipient | 3,534 (59%) | 690 (62%) | |
    High School | 1,416 (24%) | 237 (21%) | |
    Uneducated | 1,018 (17%) | 186 (17%) | |
Marital_Status | 0.12 | ||
    Divorced | 477 (8.0%) | 92 (8.3%) | |
    Married | 3,035 (51%) | 529 (48%) | |
    Single | 2,456 (41%) | 492 (44%) | |
Income_Category | 0.015 | ||
    $120K + | 470 (7.9%) | 102 (9.2%) | |
    $40K - $60K | 1,208 (20%) | 204 (18%) | |
    $60K - $80K | 958 (16%) | 145 (13%) | |
    $80K - $120K | 1,013 (17%) | 189 (17%) | |
    Less than $40K | 2,319 (39%) | 473 (42%) | |
Card_Category | 0.7 | ||
    Blue | 5,564 (93%) | 1,034 (93%) | |
    Gold | 68 (1.1%) | 13 (1.2%) | |
    Platinum | 8 (0.1%) | 3 (0.3%) | |
    Silver | 328 (5.5%) | 63 (5.7%) | |
1 n (%) | |||
2 Pearson’s Chi-squared test; Fisher’s exact test |
Education_Level
p-value = 0.2,Marital_Status
p-value = 0.12,Income_Category
p-value = 0.015,Card_Category
= 0.7
When comparing the p values from the two tables, the difference between the p-values of the columns with original data and the columns where the
NA
values are dropped is not large, and since the p values were meaningless in the first place, it is not a problem to drop them. Also, theNA
values accounts for 30% of the whole data, but we have enough data to make models.
##
## 0 1
## 0 0.81 0.07
## 1 0.03 0.09
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1718 140
## 1 72 193
##
## Accuracy : 0.9001
## 95% CI : (0.8866, 0.9126)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : 1.372e-14
##
## Kappa : 0.5882
##
## Mcnemar's Test P-Value : 4.193e-06
##
## Sensitivity : 0.57958
## Specificity : 0.95978
## Pos Pred Value : 0.72830
## Neg Pred Value : 0.92465
## Prevalence : 0.15685
## Detection Rate : 0.09091
## Detection Prevalence : 0.12482
## Balanced Accuracy : 0.76968
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1721 139
## 1 69 194
##
## Accuracy : 0.902
## 95% CI : (0.8886, 0.9143)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : 1.628e-15
##
## Kappa : 0.5949
##
## Mcnemar's Test P-Value : 1.716e-06
##
## Sensitivity : 0.58258
## Specificity : 0.96145
## Pos Pred Value : 0.73764
## Neg Pred Value : 0.92527
## Prevalence : 0.15685
## Detection Rate : 0.09138
## Detection Prevalence : 0.12388
## Balanced Accuracy : 0.77202
##
## 'Positive' Class : 1
##
## alpha lambda
## 5 1 0.0018
Tuning parameter ‘alpha’ was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were alpha = 1 and lambda = 0.0018.
## 26 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 5.191
## Customer_Age -0.006
## GenderF 0.618
## Dependent_count 0.098
## Education_LevelHigh_School -0.012
## Education_LevelDegree_recipient 0.065
## Marital_StatusMarried -0.504
## Marital_StatusSingle .
## Income_Category40k_60k -0.068
## Income_Category60k_80k .
## Income_Category80k_120k 0.301
## Income_Categoryover_120k 0.534
## Card_CategorySilver 0.488
## Card_CategoryGold 0.303
## Card_CategoryPlatinum 1.259
## Months_on_book -0.009
## Total_Relationship_Count -0.400
## Months_Inactive_12_mon 0.429
## Contacts_Count_12_mon 0.408
## Credit_Limit 0.000
## Total_Revolving_Bal -0.001
## Total_Amt_Chng_Q4_Q1 -0.399
## Total_Trans_Amt 0.000
## Total_Trans_Ct -0.103
## Total_Ct_Chng_Q4_Q1 -2.681
## Avg_Utilization_Ratio .
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1723 144
## 1 67 189
##
## Accuracy : 0.9006
## 95% CI : (0.8871, 0.913)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : 8.115e-15
##
## Kappa : 0.5852
##
## Mcnemar's Test P-Value : 1.676e-07
##
## Sensitivity : 0.56757
## Specificity : 0.96257
## Pos Pred Value : 0.73828
## Neg Pred Value : 0.92287
## Prevalence : 0.15685
## Detection Rate : 0.08902
## Detection Prevalence : 0.12058
## Balanced Accuracy : 0.76507
##
## 'Positive' Class : 1
##
##
## 0 1
## 0 0.81 0.07
## 1 0.03 0.09
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1719 144
## 1 71 189
##
## Accuracy : 0.8987
## 95% CI : (0.8851, 0.9112)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : 6.424e-14
##
## Kappa : 0.5796
##
## Mcnemar's Test P-Value : 9.091e-07
##
## Sensitivity : 0.56757
## Specificity : 0.96034
## Pos Pred Value : 0.72692
## Neg Pred Value : 0.92271
## Prevalence : 0.15685
## Detection Rate : 0.08902
## Detection Prevalence : 0.12247
## Balanced Accuracy : 0.76395
##
## 'Positive' Class : 1
##
##
## 0 1
## 0 0.83 0.03
## 1 0.01 0.13
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1765 58
## 1 25 275
##
## Accuracy : 0.9609
## 95% CI : (0.9518, 0.9687)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.846
##
## Mcnemar's Test P-Value : 0.000444
##
## Sensitivity : 0.8258
## Specificity : 0.9860
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.9682
## Prevalence : 0.1569
## Detection Rate : 0.1295
## Detection Prevalence : 0.1413
## Balanced Accuracy : 0.9059
##
## 'Positive' Class : 1
##
##
## 0 1
## 0 0.80 0.07
## 1 0.05 0.08
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1692 153
## 1 98 180
##
## Accuracy : 0.8818
## 95% CI : (0.8673, 0.8952)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : 2.341e-07
##
## Kappa : 0.5208
##
## Mcnemar's Test P-Value : 0.0006533
##
## Sensitivity : 0.54054
## Specificity : 0.94525
## Pos Pred Value : 0.64748
## Neg Pred Value : 0.91707
## Prevalence : 0.15685
## Detection Rate : 0.08479
## Detection Prevalence : 0.13095
## Balanced Accuracy : 0.74290
##
## 'Positive' Class : 1
##
##
## 0 1
## 0 0.82 0.04
## 1 0.02 0.11
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1739 92
## 1 51 241
##
## Accuracy : 0.9326
## 95% CI : (0.9211, 0.9429)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7319
##
## Mcnemar's Test P-Value : 0.0008229
##
## Sensitivity : 0.7237
## Specificity : 0.9715
## Pos Pred Value : 0.8253
## Neg Pred Value : 0.9498
## Prevalence : 0.1569
## Detection Rate : 0.1135
## Detection Prevalence : 0.1375
## Balanced Accuracy : 0.8476
##
## 'Positive' Class : 1
##
##
## 0 1
## 0 0.82 0.04
## 1 0.02 0.11
##
## 0 1
## 0 0.81 0.08
## 1 0.04 0.08
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1715 163
## 1 75 170
##
## Accuracy : 0.8879
## 95% CI : (0.8737, 0.901)
## No Information Rate : 0.8431
## P-Value [Acc > NIR] : 2.018e-09
##
## Kappa : 0.5251
##
## Mcnemar's Test P-Value : 1.707e-08
##
## Sensitivity : 0.51051
## Specificity : 0.95810
## Pos Pred Value : 0.69388
## Neg Pred Value : 0.91321
## Prevalence : 0.15685
## Detection Rate : 0.08008
## Detection Prevalence : 0.11540
## Balanced Accuracy : 0.73431
##
## 'Positive' Class : 1
##
Method | AUC | Accuracy | Kappa | Sensitivity | Specificity | F1_score |
---|---|---|---|---|---|---|
Stepwise Logistic Regression | 0.9138 | 0.9001 | 0.5882 | 0.5796 | 0.9598 | 0.6455 |
Logistic Regression with k-fold cv | 0.9135 | 0.9020 | 0.5949 | 0.5826 | 0.9614 | 0.6510 |
LASSO Regression | 0.9141 | 0.9006 | 0.5852 | 0.5676 | 0.9626 | 0.6418 |
Decision Tree | 0.8197 | 0.8987 | 0.5796 | 0.5676 | 0.9603 | 0.6374 |
Random Forest | 0.9859 | 0.9609 | 0.8460 | 0.8258 | 0.9860 | 0.8689 |
Naive Bayes | 0.8501 | 0.8818 | 0.5208 | 0.5405 | 0.9452 | 0.5892 |
SVM | 0.9595 | 0.9326 | 0.7319 | 0.7237 | 0.9715 | 0.7712 |
k-NN | 0.8820 | 0.8879 | 0.5251 | 0.5105 | 0.9581 | 0.5882 |
This is a summary of our prediction models. Random Forest has best model compared to other with metrics score(Kappa, Sensitivity, Specificity and F1 score) has shown at least 96% accuracy. Random Forest have AUC score of 98% that shows the model is learning the data well enough.