Bank Customer Churn Modeling
The purpose of this analysis is to examine the data on bank customers from the Halloween BI, Ltd. 🎃 and find out:
What are the most important factors for predicting customer churn in this bank? How much time does the bank have on average before a customer churns?
To do so, I firstly completed an exploratory data analysis in order to get familiar with the data and identify possible predictors for customers leaving the bank. Then, predicted churn with a logistic regression model, decision tree and random forest, comparing their performance and assessing quality. For the first and the latter, randomly selected cases were analyzed in order to understand when model makes a correct prediction and when it completes a mistake. In the last part, time to churn was predicted.
Preparation
As a first step, let’s take a look at the dataset. The first column RowNumber
was excluded as it is simply a row number, which is equivalent to the already existing indexes.
##
## -----------------------------------------------------------------------------------------------------------------------------------------------------------
## CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## ------------ ---------- ------------- ----------- -------- ----- -------- --------- --------------- ----------- ---------------- ----------------- --------
## 15634602 Hargrave 619 France Female 42 2 0 1 1 1 101349 1
##
## 15647311 Hill 608 Spain Female 41 1 83808 1 0 1 112543 0
##
## 15619304 Onio 502 France Female 42 8 159661 3 1 0 113932 1
##
## 15701354 Boni 699 France Female 39 1 0 2 0 0 93827 0
##
## 15737888 Mitchell 850 Spain Female 43 2 125511 1 1 1 79084 0
##
## 15574012 Chu 645 Spain Male 44 8 113756 2 1 0 149757 1
## -----------------------------------------------------------------------------------------------------------------------------------------------------------
Several variables have only two different values (0 & 1), and represent categorical features but are treated as integers. Thus, I converted HasCrCard
, IsActiveMember
and Exited
to factor. This way R models will treat them correctly. By mapping the resulting values to “No” and “Yes” interpretation and visualization become slightly easier as well.
## 'data.frame': 10000 obs. of 13 variables:
## $ CustomerId : int 15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
## $ Surname : Factor w/ 2932 levels "Abazu","Abbie",..: 1116 1178 2039 290 1823 538 178 1989 1147 1082 ...
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : int 1 0 1 0 0 1 0 1 0 0 ...
The following output summarizes the information about the available features:
- No missing data
- Customers come from three countries, and majority is from France
- Age of the clients ranges from 18 to 92
- Some of them stayed with the bank for 10 years, on average - ~5
- Most of the customers have a credit card
- Estimated salary range is quite high
## The data contains 10000 observations of the following variables:
## - CreditScore: Mean = 650.53, SD = 96.65, range = [350, 850], 0 missing
## - Geography: 3 levels: France (n = 5014); Germany (n = 2509) and Spain (n = 2477)
## - Gender: 2 levels: Female (n = 4543) and Male (n = 5457)
## - Age: Mean = 38.92, SD = 10.49, range = [18, 92], 0 missing
## - Tenure: Mean = 5.01, SD = 2.89, range = [0, 10], 0 missing
## - Balance: Mean = 76485.89, SD = 62397.41, range = [0, 250898.09], 0 missing
## - NumOfProducts: Mean = 1.53, SD = 0.58, range = [1, 4], 0 missing
## - HasCrCard: 2 levels: No (n = 2945) and Yes (n = 7055)
## - IsActiveMember: 2 levels: No (n = 4849) and Yes (n = 5151)
## - EstimatedSalary: Mean = 100090.24, SD = 57510.49, range = [11.58, 199992.48], 0 missing
## - Exited: 2 levels: No (n = 7963) and Yes (n = 2037)
Exploratory Data Analysis 🚀
churn
The exploration starts by looking at how many customers churn at all.
The churn rate is 20.4%, meaning that approximately \(20.4\%\) of the customers decided to leave the bank, and \(79.6\%\) have stayed.
Apart from the main target variable, there are 12 features. The first two relate to customers’ personal information, their id - CustomerId
and Surname
, and cannot be used in the predictive models.
So, let’s examine other features and check whether churners have some common characteristics by looking how churn rate differs depending on Gender
and Geography
(country).
- Churn rate is \(20.4\%\), however, it is higher for customers in Germany: \(32.4\%\) of them have churned, comparing to \(16.2\%\) and \(16.7\%\) in France and Spain respectively (Fig.2)
- Among females, the percentage of those who left the bank was by \(8.6\%\) higher than for males: \(25.1\%\) compared to \(16.5\%\) (Fig.3)
The Age
density for those who stayed and churned is different: according to the Fig.4, median age of churned customers is higher, and more younger clients are present in group of retained. Consequently, Age
may act as a predictor of exiting. Three vertical lines represent the median age: black - of all customers, orange - of exited clients, green - of stayed.
## [1] "Median age of stayed customers: 36"
## [1] "Median age of exited customers: 45"
- The
Balance
andCreditScore
of customers seem to have no or little visible relationship withAge
- Observations with the lowest credit score (< 400) are those who exited (Fig. 7)
Let’s look at the Balance
(Fig. 8a & 8b). Both clients with low and high balance (\(39\%\) of have \(\$0\)) are present among the stayed customers. Churned ones also do have some money on their closed bank accounts, while \(24.5\%\) of them had zero. #StrangerThings.
I guess more information should have been added on what exactly does this variable mean and whether the data was obtained at the same time as other labels. Perhaps, there are other causes of accounts’ cancellation? Or the Balance
variable was not updated after the customer left? (took money -> cancelled -> 0)
## [1] "Percentage of retained clients with 0 balance: 39.143539"
## [1] "Percentage of churned clients with 0 balance: 24.545901"
The median balance of churned customers was higher compared to the stayed ones. The company should check, for instance, whether current offers/plans suit such customers, and if not, introduce new ones, as exits of clients with high balance may negatively affect the bank. (can’t say more as sadly nothing is stated about the available plans)
Meanwhile, density of CreditScore
distribution is more or less the same for both exited and stayed.
Moving to other attributes about customer’s account: HasCrCard
, NumOfProducts
and IsActiveMember
.
The first one seems to be less useful as churn rate does not differ depending on whether the customer holds a credit card with the bank or not (Fig. 10).
Let’s see how the churn rate changes as the number of bank products the customer is subscribed to (or uses) increases. This is a super interesting attribute as clients with 3 or 4 products have higher churn rate compared to the ones with 1-2: \(100\%\) of customers with 4 products have churned!
Unfortunately, it is unknown which products the clients were using, and which ones may cause an increase of churn rate. The bank should carefully review which services have such a high rate and consider introduction of changes.
According to another plot below (Fig. 12), among active members churn rate is lower. As stated in the data description, the variable IsActiveMember
of the bank is subjective “(Subjective, but for the concept)”, so it is unclear how it was constructed. It would be interesting to see how exactly this feature was obtained and to which extent its definition differs from Exited
.
Based on the output below, there were 326 (~3%) customers with 3 and more products (only 60 of them had 4).
By comparing it to the description of entire dataset:
- the mean age of these clients is higher
- now there are slighly more females than males: \(57\%\) in subset compared to \(45\%\) in full dataset
- \(85\%\) of these clients churned!
Unfortunately, it is impossible to see whether all of them were subscribed to/used the same product, but it may be so.
## [1] "------------ Description of full dataset -----------"
## The data contains 10000 observations of the following variables:
## - CreditScore: Mean = 650.53, SD = 96.65, range = [350, 850], 0 missing
## - Geography: 3 levels: France (n = 5014); Germany (n = 2509) and Spain (n = 2477)
## - Gender: 2 levels: Female (n = 4543) and Male (n = 5457)
## - Age: Mean = 38.92, SD = 10.49, range = [18, 92], 0 missing
## - Tenure: Mean = 5.01, SD = 2.89, range = [0, 10], 0 missing
## - Balance: Mean = 76485.89, SD = 62397.41, range = [0, 250898.09], 0 missing
## - NumOfProducts: Mean = 1.53, SD = 0.58, range = [1, 4], 0 missing
## - HasCrCard: 2 levels: No (n = 2945) and Yes (n = 7055)
## - IsActiveMember: 2 levels: No (n = 4849) and Yes (n = 5151)
## - EstimatedSalary: Mean = 100090.24, SD = 57510.49, range = [11.58, 199992.48], 0 missing
## - Exited: 2 levels: No (n = 7963) and Yes (n = 2037)
## [1] "------ Description of the subset of customers with 3-4 products ------"
## The data contains 326 observations of the following variables:
## - CreditScore: Mean = 649.11, SD = 96.20, range = [358, 850], 0 missing
## - Geography: 3 levels: France (n = 133); Germany (n = 120) and Spain (n = 73)
## - Gender: 2 levels: Female (n = 187) and Male (n = 139)
## - Age: Mean = 43.65, SD = 10.15, range = [21, 77], 0 missing
## - Tenure: Mean = 5.06, SD = 2.89, range = [0, 10], 0 missing
## - Balance: Mean = 78821.79, SD = 62283.19, range = [0, 250898.09], 0 missing
## - NumOfProducts: Mean = 3.18, SD = 0.39, range = [3, 4], 0 missing
## - HasCrCard: 2 levels: No (n = 95) and Yes (n = 231)
## - IsActiveMember: 2 levels: No (n = 184) and Yes (n = 142)
## - EstimatedSalary: Mean = 104400.14, SD = 55759.76, range = [3440.47, 199357.24], 0 missing
## - Exited: 2 levels: No (n = 46) and Yes (n = 280)
The next plot shows the associations between continuous data attributes. Almost all variables are not correlated with each other, apart from several combinations that were previously identified as those that may be important for further predictive models: (1) Balance
and NumOfProducts
, (2) Exited
and Age
, (3) Exited
and Balance
. The correlation coefficients between \(0.1\) and \(0.3\) indicate that there may be small association between these variables.
tenure
Tenure
is the second variable of interest in the to-do list. Let’s see how it is related to the other ones. Overall, the mean tenure time in the dataset equals 5, meaning that a lot of customers have been with the bank for around 5 years.
As shown in the plots below, the distribution of Tenure
does not differ with HasCrCard
, Geography
, Gender
and IsActiveMember
(boxplots everywhere…).
The scatter plots present the relationship between tenure time and continuous attributes. Again, as it was previously shown on the correlation plot, there is little or no relationship between the variables.
Churn Prediction
Logistic Regression
For the first model, I employed a logistic regression with 5-fold cross-validation as train control. The preparation steps included removal of CustomerId
and Surname
variables and split of the data into two parts: the first one to be used for training and cross-validation, while the second one - as a validation part.
Features were selected using stepAIC
function which allows comparing several models such that variables are dropped and added based on their significance, and this process stops when a minimum AIC is reached. Based on the results, the function dropped two features, HasCrCard
and EstimatedSalary
, that were marked as possibly not important on EDA stage.
The resulting output shows variables Geography
(Germany), Gender
(Male), Age
, Balance
, NumOfProducts
, CreditScore
, IsActiveMember
(Yes) have a significant effect on customer staying or exiting, as p-values for these features are \(<.005\), meaning that the probability of finding results like this or more extreme, assuming {variable}
has no effect, is \(5\%\).
outcome | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 0.04 | 0.03 – 0.07 | <0.001 |
Credit Score | 1.00 | 1.00 – 1.00 | 0.002 |
Geography Germany | 2.08 | 1.79 – 2.41 | <0.001 |
Geography Spain | 0.99 | 0.85 – 1.16 | 0.947 |
Gender Male | 0.59 | 0.52 – 0.66 | <0.001 |
Age | 1.07 | 1.07 – 1.08 | <0.001 |
Tenure | 0.99 | 0.97 – 1.01 | 0.180 |
Balance | 1.00 | 1.00 – 1.00 | <0.001 |
Num Of Products | 0.89 | 0.80 – 0.98 | 0.021 |
Is Active Member Yes | 0.36 | 0.32 – 0.41 | <0.001 |
Observations | 8001 | ||
Cox & Snell’s R2 / Nagelkerke’s R2 | 0.141 / 0.221 |
Take the exponential of model coefficients to interpret them:
## (Intercept) CreditScore GeographyGermany GeographySpain
## 0.04 1.00 2.08 0.99
## GenderMale Age Tenure Balance
## 0.59 1.07 0.99 1.00
## NumOfProducts IsActiveMemberYes
## 0.89 0.36
feature | interpretation |
---|---|
Geography Germany | The hazard to churn increases by the factor of 2.08 (108%) for customers from Germany, compared to clients from France |
Gender Male | Being a male decreases the hazard of churn by the factor of 0.59 compared to females |
Age | A one-year increase in age increases the hazard of churning by a factor of 1.07 (or 7%) |
NumOfProducts | Each additional product decreases the hazard of churning by 0.89 |
IsActiveMember Yes | For active members, the hazard to churn decreases by the factor of 0.36 (or 64%) |
How well did the model perform? On train:
- “No” (stay) was correctly predicted 6149 times (TN), incorrectly - 1286 (FN)
- “Yes” (churn) was correctly predicted in 344 cases (TP), and incorrectly - in 222 (FP)
The overall accuracy is \(81\%\). This is better than simple prediction of the majority class (“No”), as in this case it would have been \(79.6\%\). Other important metrics are precision and recall:
- Precision shows the ratio of correctly predicted “churned” to the total number of predicted as “churned”, meaning that \(61\%\) of “Yes” predictions were actually correct (\(\frac{344}{344+222}\))
- Recall equals 0.21, so \(21\%\) of churned customers were identified
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6149 1286
## Yes 222 344
##
## Accuracy : 0.8115
## 95% CI : (0.8028, 0.82)
## No Information Rate : 0.7963
## P-Value [Acc > NIR] : 0.0003344
##
## Kappa : 0.2327
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.60777
## Recall : 0.21104
## F1 : 0.31330
## Prevalence : 0.20372
## Detection Rate : 0.04299
## Detection Prevalence : 0.07074
## Balanced Accuracy : 0.58810
##
## 'Positive' Class : Yes
##
And how the model is performing with the test set?:
- Accuracy equals \(81\%\), meaning that the model performs well on the test data too
- Precision is \(0.6\), meaning if customers are predicted to churn, \(60\%\) of them will actually churn
- Recall is \(0.2\), so the model identified \(20\%\) of those who churned
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1538 325
## Yes 54 82
##
## Accuracy : 0.8104
## 95% CI : (0.7925, 0.8274)
## No Information Rate : 0.7964
## P-Value [Acc > NIR] : 0.06238
##
## Kappa : 0.2228
##
## Mcnemar's Test P-Value : < 2e-16
##
## Precision : 0.60294
## Recall : 0.20147
## F1 : 0.30203
## Prevalence : 0.20360
## Detection Rate : 0.04102
## Detection Prevalence : 0.06803
## Balanced Accuracy : 0.58378
##
## 'Positive' Class : Yes
##
The last but important metric in this part is the area under the curve, which is further used for model comparison. Higher the AUC, the better the model is at predicting our classes (perfect classifier has an AUC of 1, while 0.5 - no class separation capacity). Here it equals \(0.78\), which is quite good.
## Area under the curve: 0.7803
Interpretation 🔮
Age
is the most important feature for the model, followed by IsActiveMember
(Yes), Geography
(Germany) and Gender
(Male). Overall, these are consistent with those identified at the EDA step. However, NumOfProducts
is not as important as I expected. Let’s proceed with more in-depth interpretations using lime
.
Understanding why model makes this or that decision may be crucial. The plots below represent several cases from the test set where the model prediction was either correct or wrong, showing the top five features for each case. The green bars show that the feature supports model’s decision, the red - contradict.
correct predictions ✔️️
The test set contains 1999 observations. The distribution of Exited
(referred as actual
) variable is the same as in full dataset: \(79.6\%\) - stayed, \(20.4\%\) - exited. The model predicted 1863 observations as retained, and 136 as exited. In total there are 1620 correct answers, and in 379 cases the model was wrong, the accuracy is \(\frac{1620}{1999}=81\%\).
## The data contains 1999 observations of the following variables:
## - sample_id: Mean = 1000.00, SD = 577.21, range = [1, 1999], 0 missing
## - No: Mean = 0.80, SD = 0.16, range = [0.14, 0.99], 0 missing
## - Yes: Mean = 0.20, SD = 0.16, range = [0.01, 0.86], 0 missing
## - actual: 2 levels: No (n = 1592) and Yes (n = 407)
## - prediction: 2 entries: No, n = 1863; Yes, n = 136
## - correct: 2 entries: correct, n = 1620; wrong, n = 379
For instance: (1) test case 593, a female from Germany aged 44, who’s been with the bank for 6 years, is subscribed to 1 product, has a credit card with quite high balance, but is not an active member. It was correctly predicted as churned because:
- not an active member
- is from Germany
- is female
This observation also has age between 37 and 44, which is more representative for cases who stayed. Also, the probability for “yes” label is \(0.57\), and is quite close to the threshold of 0.5.
## sample_id No Yes actual prediction correct CreditScore
## 3075 593 0.4295501 0.5704499 Yes Yes correct 614
## Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 3075 Germany Female 44 6 118715.9 1 Yes
## IsActiveMember EstimatedSalary Exited
## 3075 No 133591.1 Yes
(2) Another example 1616 shows why the model predicted that a person will stay. The probability here is \(0.95\), so the model was very sure about the prediction. So, why did it make this decision?
For “stay”: this person is male, younger than 32, is an active member from France (not Germany).
## sample_id No Yes actual prediction correct CreditScore
## 8162 1616 0.9554198 0.04458022 No No correct 588
## Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 8162 France Male 31 4 99607.37 2 No
## IsActiveMember EstimatedSalary Exited
## 8162 Yes 35877.03 No
(3) And case 372. The model was very sure that a person will stay because this customer is younger than 32, from Spain and male. However, he is also not an active member.
## sample_id No Yes actual prediction correct CreditScore
## 1864 372 0.9281113 0.0718887 No No correct 731
## Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 1864 Spain Male 25 8 96950.21 1 Yes
## IsActiveMember EstimatedSalary Exited
## 1864 No 97877.92 No
wrong predictions ❌
As shown on the plots below (scroll down a bit), the model made incorrect decisions mostly based on Age
, IsActiveMember
, Gender
in all three randomly selected cases.
(1) Customer with test id 1984 was predicted as stayed, while he actually churned. The probability of “Yes” is \(0.34\), of “No” - \(0.66\). Why did the model make a mistake? (“stay” instead of “churn”)
A person was predicted as stayed because he is male and an active member. At the same time, “Churn” was supported by: age higher than 44 & being from Germany.
## sample_id No Yes actual prediction correct CreditScore
## 9918 1984 0.6634682 0.3365318 Yes No wrong 635
## Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 9918 Germany Male 53 8 117005.6 1 No
## IsActiveMember EstimatedSalary Exited
## 9918 Yes 123646.6 Yes
(2) Case 299: predicted stay instead of churn. Top features supported the “Yes” label, while being from France (Geography
) supported “No”, and model made a mistake due to that. (3) Similar situation in case 1359: a male person from France aged 50, stayed with the bank for 5 years, but had even higher balance and didn’t have a credit card. The model predicted “No” as well.
## sample_id No Yes actual prediction correct CreditScore
## 1479 299 0.6139743 0.3860257 Yes No wrong 649
## 6817 1359 0.5913829 0.4086171 Yes No wrong 559
## Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 1479 France Female 45 5 92786.66 1 Yes
## 6817 France Male 50 5 162702.35 1 No
## IsActiveMember EstimatedSalary Exited
## 1479 No 173365.9 Yes
## 6817 No 150548.5 Yes
Ok, let’s move on.
Trees
Based on the experience, I know that sometimes other methods may perform better (or worse) than others. Thus, I used a decision tree as the next model to predict churners. Keeping in mind their love to overfitting, I further ran a Random Forest classifier & a feature selection algorithm.
decision tree
Specification:
- all variables
- 5-fold cross-validation
- summary metric that will be used to select the optimal model - ROC
## Conditional Inference Tree
##
## 8001 samples
## 10 predictors
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 6401, 6400, 6401, 6401, 6401
## Resampling results across tuning parameters:
##
## mincriterion ROC Sens Spec
## 0.01 0.7561407 0.9303110 0.3975460
## 0.50 0.7804787 0.9604455 0.3368098
## 0.99 0.7688680 0.9678225 0.3049080
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.5.
The decision tree seems to perform better than logistic regression in terms of accuracy on the train set: \(84.7%\) compared to \(81\%\). Both precision and recall have increased: from \(0.6\) to \(0.75\) and from \(0.2\) to \(0.37\), which is good. Now the model finds \(37\%\) of those from churned class.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6174 1027
## Yes 197 603
##
## Accuracy : 0.847
## 95% CI : (0.8389, 0.8548)
## No Information Rate : 0.7963
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4183
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.75375
## Recall : 0.36994
## F1 : 0.49630
## Prevalence : 0.20372
## Detection Rate : 0.07537
## Detection Prevalence : 0.09999
## Balanced Accuracy : 0.66951
##
## 'Positive' Class : Yes
##
On test set, the accuracy drops by \(1\%\) to \(83\%\), as well as precision and recall. Still, it is better than logistic regression.
- Precision is \(0.68\), so \(68\%\) of predicted to churn will actually churn
- Recall is \(0.33\), meaning the model identified \(33\%\) of those who churned
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1528 272
## Yes 64 135
##
## Accuracy : 0.8319
## 95% CI : (0.8148, 0.8481)
## No Information Rate : 0.7964
## P-Value [Acc > NIR] : 3.141e-05
##
## Kappa : 0.36
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.67839
## Recall : 0.33170
## F1 : 0.44554
## Prevalence : 0.20360
## Detection Rate : 0.06753
## Detection Prevalence : 0.09955
## Balanced Accuracy : 0.64575
##
## 'Positive' Class : Yes
##
interpretation 🔮
This first lonely tree seems to be quite complicated in terms of interpretation of visualization as it performed many splits. The first division was based on Age
, then either by Geography
(Germany) or by IsActiveMember
.
So, let’s look at the variable importance (relative). Similary to logistic regression, Age
is the most important variable, as well as IsActiveMember
. But at the 3rd place we can wee NumOfProducts
that looked a bit promising at the EDA stage: churn rate among customers with \(\geq3\) products was quite high. Other important variables are Balance
and Gender
.
Let’s proceed to feature selection and random forest building.
feature selection (RFE)
The feature selection was performed using Recursive Feature Elimination.
RFE dropped Tenure
& HasCrCard
, and the most important variables are the following 8 attributes. This mostly supports the EDA part, where having a card and tenure time did not show relationship with target variable Exited
.
## [1] "NumOfProducts" "Age" "IsActiveMember" "Balance"
## [5] "Geography" "Gender" "CreditScore" "EstimatedSalary"
random forest
Let’s finally fit a random forest with variables selected using RFE. What do we have here??
At first, the accuracy on train is \(90\%\), which is better than previous models. Good news are also present in precision and recall:
- Precision of \(0.95\) - most cases predicted as churned will actually churn
- Recall - \(0.518\), now the model can identify \(52\%\) of those from churners class
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6336 787
## Yes 35 843
##
## Accuracy : 0.8973
## 95% CI : (0.8904, 0.9038)
## No Information Rate : 0.7963
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6177
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.9601
## Recall : 0.5172
## F1 : 0.6722
## Prevalence : 0.2037
## Detection Rate : 0.1054
## Detection Prevalence : 0.1097
## Balanced Accuracy : 0.7558
##
## 'Positive' Class : Yes
##
On the test part, the random forest performs well too:
- Accuracy: \(86\%\)
- Lower precision of \(0.80\)
- Recall equals \(0.40\)
The model identifies \(40\%\) of churners, and \(80\%\) of those predicted to leave the bank actually left the bank. In sum, Random Forest is better than the logistic regression model as it showed higher accuracy, better precision and recall.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1551 243
## Yes 41 164
##
## Accuracy : 0.8579
## 95% CI : (0.8418, 0.873)
## No Information Rate : 0.7964
## P-Value [Acc > NIR] : 6.399e-13
##
## Kappa : 0.4627
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.80000
## Recall : 0.40295
## F1 : 0.53595
## Prevalence : 0.20360
## Detection Rate : 0.08204
## Detection Prevalence : 0.10255
## Balanced Accuracy : 0.68860
##
## 'Positive' Class : Yes
##
And the last comparison: AUC. Logistic regression had \(78.03\), while random forest shows a big improvement to \(84.45\)! So, this model is the best one for today.
## Area under the curve: 0.8445
interpretation 🔮
For random forest, Age
is the most important feature, followed by NumOfProducts
and money-related variables. In contrast, other models had IsActiveMember
in top-3!
correct & wrong predictions
Let’s take three correct and three wrong predictions, as it was done for logistic regression, to see the contribution of each feature to the prediction for an observation.
In all three randomly selected correct cases, the model took into account NumOfProducts
: in case 428 and 803, customers had less than two products. However, the first one was an active member, while second - not. Speaking of case 1334 (who churned), a person had more than two products, was older than 44 and not an active member.
When did the model make a mistake? Actually, all selected observations are quite similar: model predicted that they will stay, but they churned.
The main problem was having \(\leq2\) products. Which makes sense, as EDA showed that churn rate among those having more than 2 is higher. Other issues are related to whether a client was an active one, and balance.
Tenure
In this section, the variables of interest are Exited
and Tenure
.
- If a customer churned,
Exited
equals “Yes”, andTenure
is the amount of time a person was a customer, in years - If client stayed,
Exited
is “No”,Tenure
shows time under observation (until the date of data collection), in years
There are customers who have been with the bank for 10 years, while some - less than a year. The median Tenure
in the dataset of both churned and stayed customers is around 5 years.
Kaplan-Meier
Now, as we briefly refreshed information about the data, let’s compute a Kaplan-Meier analysis and then plot survival curves. Firstly, without grouping, i.e. not considering covariates.
## Call: survfit(formula = survival ~ 1, data = X, type = "kaplan-meier")
##
## n events *rmean *se(rmean) median 0.95LCL
## 1.00e+04 2.04e+03 8.48e+00 3.04e-02 1.00e+01 1.00e+01
## 0.95UCL
## NA
## * restricted mean with upper limit = 10
The purple line represents the estimate of the percent of surviving customers at each time interval, and shaded areas show the 95% confidence interval. For instance, 99% of customers are expected to survive through the first year, approximately 60% - year 9, while by year 10 the survival probability falls to 48%. Actually, not bad.
The dashed lines show the median time of survival, which is approximately 10 years! (reminder: the max tenure is also 10, no observations with higher value)
Now let’s fit with categorical covariate and see whether there are differences in survival times between females and males.
## Call: survfit(formula = Surv(X$Tenure, X$Exited == "Yes") ~ Gender,
## data = X, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## Gender=Female 4543 1139 10 10 10
## Gender=Male 5457 898 NA NA NA
The following plot presents estimates of survival probability grouped by Gender
, and the table highlights the number of customers under observation in certain period of time.
The p-value in the left corner refers to the p-value of a log rank test (non-parametric, no assumptions), where the \(H_0\) is:
No difference between the populations in the probability of churn at any point (curves are identical)
The p-value is \(<0.0001\), meaning that there is a difference in survival probabilities for representatives of different Gender
.
The following outputs show that customers from Germany churn faster than clients from France and Spain. For them, the median equals 9.
## Call: survfit(formula = survival ~ Geography, data = X, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## Geography=France 5014 810 NA NA NA
## Geography=Germany 2509 814 9 9 9
## Geography=Spain 2477 413 NA NA NA
cox regression
Further, I ran a Cox proportional hazards regression model of survival, based on variables identified as important in previous steps, to examine how selected covariates influence the hazard rate.
Significant variables turn out to be: Age
, IsActiveMember
, Geography
, Gender
. Time for interpretation!
## Call:
## coxph(formula = survival ~ NumOfProducts + Age + IsActiveMember +
## Balance + Geography + Gender + CreditScore + EstimatedSalary,
## data = X)
##
## coef exp(coef) se(coef) z p
## NumOfProducts -6.635e-02 9.358e-01 3.871e-02 -1.714 0.0865
## Age 4.753e-02 1.049e+00 1.776e-03 26.767 < 2e-16
## IsActiveMemberYes -7.368e-01 4.786e-01 4.747e-02 -15.523 < 2e-16
## Balance 2.111e-06 1.000e+00 4.384e-07 4.816 1.47e-06
## GeographyGermany 4.865e-01 1.627e+00 5.437e-02 8.948 < 2e-16
## GeographySpain 4.893e-02 1.050e+00 6.050e-02 0.809 0.4187
## GenderMale -3.875e-01 6.788e-01 4.476e-02 -8.657 < 2e-16
## CreditScore -4.984e-04 9.995e-01 2.252e-04 -2.213 0.0269
## EstimatedSalary -1.294e-08 1.000e+00 3.834e-07 -0.034 0.9731
##
## Likelihood ratio test=1155 on 9 df, p=< 2.2e-16
## n= 10000, number of events= 2037
The following model output and forest plot show the hazard ratios from the model for all selected covariates. Values \(>1\) indicate an increased risk of churn, while \(<1\) mean that this condition decreases it.
- Male clients churn 32% slower than the baseline survival rate (reference: female)
- One additional year of customer’s
Age
multiplies the hazard rate by 1.048 (5%) (slight increase of churn)
- Customers from Germany churn 1.63 (or 63%) times faster than the baseline (reference: France). Wow! As for Spain, these clients do not churn significantly higher or lower, as the p-value is \(0.4\)
- Active member have a lower risk of churn compared to not active ones: decreases by 52% (reference: not active)
## NumOfProducts Age IsActiveMemberYes Balance
## 0.93580 1.04867 0.47863 1.00000
## GeographyGermany GeographySpain GenderMale CreditScore
## 1.62657 1.05014 0.67877 0.99950
## EstimatedSalary
## 1.00000
## Call:
## coxph(formula = survival ~ NumOfProducts + Age + IsActiveMember +
## Balance + Geography + Gender + CreditScore + EstimatedSalary,
## data = X)
##
## n= 10000, number of events= 2037
##
## coef exp(coef) se(coef) z Pr(>|z|)
## NumOfProducts -6.635e-02 9.358e-01 3.871e-02 -1.714 0.0865 .
## Age 4.753e-02 1.049e+00 1.776e-03 26.767 < 2e-16 ***
## IsActiveMemberYes -7.368e-01 4.786e-01 4.747e-02 -15.523 < 2e-16 ***
## Balance 2.111e-06 1.000e+00 4.384e-07 4.816 1.47e-06 ***
## GeographyGermany 4.865e-01 1.627e+00 5.437e-02 8.948 < 2e-16 ***
## GeographySpain 4.893e-02 1.050e+00 6.050e-02 0.809 0.4187
## GenderMale -3.875e-01 6.788e-01 4.476e-02 -8.657 < 2e-16 ***
## CreditScore -4.984e-04 9.995e-01 2.252e-04 -2.213 0.0269 *
## EstimatedSalary -1.294e-08 1.000e+00 3.834e-07 -0.034 0.9731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## NumOfProducts 0.9358 1.0686 0.8674 1.0096
## Age 1.0487 0.9536 1.0450 1.0523
## IsActiveMemberYes 0.4786 2.0893 0.4361 0.5253
## Balance 1.0000 1.0000 1.0000 1.0000
## GeographyGermany 1.6266 0.6148 1.4622 1.8095
## GeographySpain 1.0501 0.9523 0.9327 1.1823
## GenderMale 0.6788 1.4732 0.6218 0.7410
## CreditScore 0.9995 1.0005 0.9991 0.9999
## EstimatedSalary 1.0000 1.0000 1.0000 1.0000
##
## Concordance= 0.718 (se = 0.006 )
## Rsquare= 0.109 (max possible= 0.967 )
## Likelihood ratio test= 1155 on 9 df, p=<2e-16
## Wald test = 1202 on 9 df, p=<2e-16
## Score (logrank) test = 1244 on 9 df, p=<2e-16
oh, assumptions check: proportional hazards
But does the data violate the model assuptions, or everything is fine? The hazard rate of an individual should be relatively constant in time. Here statistical significance is a bad sign, meaning that the feature is violating the assumption (there are time dependent coefficients).
As you can see, only covariate Geography
(Spain) has a p-value lower than \(0.05\). Balance
is quite close to this level as well, but still higher than the significance level. The global test is also not statistically significant. Therefore, we can assume the proportional hazards.
## rho chisq p
## NumOfProducts 0.02648 2.8868 0.0893
## Age -0.01234 0.2082 0.6482
## IsActiveMemberYes -0.03042 2.0645 0.1508
## Balance 0.03723 3.4908 0.0617
## GeographyGermany -0.01792 0.6919 0.4055
## GeographySpain -0.04493 4.1179 0.0424
## GenderMale -0.00389 0.0311 0.8600
## CreditScore 0.01384 0.4053 0.5244
## EstimatedSalary 0.02875 1.6894 0.1937
## GLOBAL NA 15.0504 0.0896
Additionally to illustrate, let’s look at the following plots for each covariate: the scaled Schoenfeld residuals against the transformed time (sounds a bit spooky?). In particular, the one for GeographySpain
(row 2, col 3). The line seems to be close to horizontal in the beginning, but further starts to go down, which is not good but not terrible.
Conclusion
Summing up this report!
What are the most important factors for predicting customer churn in this bank?
According to the models and cases presented in the analysis, the most important factors are:
Age
: as age increases, customer becomes more likely to churnIsActiveMember
: active members churn lessNumOfProducts
: customers with \(\geq3\) products churn moreGeography
: Churn rate among clients from Germany is higher compared to France and Spain
How much time does the bank have on average before a customer churns?
The median time for survival (not churning) was estimated as approximately 10 years, while the mean one - between 8 and 9. It’s worth saying that the maximum time under observation equals 10 years, so it is possible that the estimates may change.
Some references
- Log-rank test
- General description of survival analysis
- Refreshing ROC/AUC related stuff
- ROC plot code
- Cox diagnostics
- Nice note on churn rate; Main purpose: churn rate alone does not tell anything useful, and does not often act as a signal of business’ growth/state.