The aim of this work is to examine the dataset about the direct phone call marketing campaigns, which aimed to promote term deposits among existing customers by a Portuguese bank from May 2008 to November 2010 [1], and do the following:
- explain the current situation with subscription rate
- create a predictive model (whether a client will subscribe a term deposit)
- suggest policy for subscription rate increase, and analyze it using the what-if modeling
In the first section, exploratory data analysis, I describe the current situation with subscription rate and identify possible predictors for client subscribing a term deposit. Next one focuses on several predictive models (two decision trees & random forest), comparing their performance, identifying the most important features, and then applying the best model to 6 randomly selected cases in order to analyze when model makes a correct prediction and when it is mistaken. After that, I constuct a Bayesian Network summarizing the hypotheses proposed in the previous sections and propose a policy for improvement.
What is a term deposit by the way?
A term deposit is a fixed-term investment that includes the deposit of money into an account at a financial institution. Term deposit investments usually carry short-term maturities ranging from one month to a few years and will have varying levels of required minimum deposits.[2]
In more plain words, a client locks his/her money for a certain period, and the bank pays them an interest rate. These money can be used by the bank for various purposes, including lending money to other clients [3]. Thus, it is highly important to make customers subscribe for such offers so that the bank will have sufficient amount of capital.
The very first step is to upload the packages & dataset. Then - describe the data.
plot_rate_bar = function(data, gr_variable, x_label="", sub_label="", y_label="Percentage", title="Percentage of subscribed customers", only_yes=TRUE, label_size=3.5, tag=""){
tmp_df = data.frame(data$y, data[[gr_variable]])
colnames(tmp_df) = c('y', 'grp')
tmp_df = plyr::count(tmp_df, vars=c("y", 'grp')) %>%
dplyr::group_by(grp) %>%
dplyr::mutate(pct = freq/sum(freq))
if (only_yes==T) {
tmp_df= tmp_df %>% dplyr::filter(y=="yes")
rev = scale_fill_material_d(palette = "ice", reverse = T)
} else {
rev = scale_fill_material_d(palette = "ice", reverse = F)
}
p = ggplot(data=tmp_df, aes(x=grp, pct, fill=y)) +
geom_col(position="dodge") +
geom_text(aes(label=scales::percent(pct)), position = position_dodge(0.9),
color="#909499", vjust=-0.75, size=label_size) +
scale_y_continuous(labels = percent, limits=c(0, 1)) +
labs(x=x_label, y=y_label, title=title, subtitle=sub_label, tag=tag) +
theme_modern() +
theme(plot.title = element_text(family = "Helvetica", color = "#191919",
size = 15, face="bold", vjust = -2),
plot.subtitle = element_text(family = "Helvetica", color = "#7d7d7d",
size = 12, vjust = 2),
legend.position = "none", plot.tag = element_text(size=10),
panel.grid.major.y = element_line(color = "#fafafa"),
axis.line = element_blank()) + rev
return(p)
}
df = read_delim("bank-full.csv", ";", escape_double = FALSE, trim_ws = TRUE)
df = remove_attributes(df, "spec") # drop attributes w. col type specifications
df$y = ifelse(df$y=="yes", 1, 0)
df$y = as.factor(df$y)
df$y = plyr::mapvalues(df$y, from=c("0","1"), to=c("no", "yes"))
The dataset contains 17 variables, that provide the following information about each customer:
The table presents the name of the variable, the category it falls into and the description.
vars_desc %>% dplyr::mutate(
group = cell_spec(group, "html",
color = ifelse(group=="target", "#34465d",
ifelse(group=="campaign", "#ea4c89",
ifelse(group=="account", "#0077B5",
"#410093"))))) %>%
dplyr::select(variable, group, meaning) %>%
kable(format = "html", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
variable | group | meaning |
---|---|---|
age | socio-dem. | customer age |
job | socio-dem. | type of job |
marital | socio-dem. | marital status |
education | socio-dem. | education level |
default | account | has credit in default |
balance | account | balance level |
housing | account | has housing loan |
loan | account | has personal loan |
contact | campaign | contact communication type |
day | campaign | last contact day of the week |
month | campaign | last contact month of year |
duration | campaign | last contact duration, in seconds |
campaign | campaign | number of contacts performed during this campaign and for this client |
pdays | campaign | number of days that passed by after the client was last contacted from a previous campaign |
previous | campaign | number of contacts performed before this campaign and for this client |
poutcome | campaign | outcome of the previous marketing campaign |
y | target | has the client subscribed a term deposit |
According to the output, the variables are parsed either as character
or numeric
, and this is actually not correct: some of them represent categorical ones. I fix that by transforming such columns to factors (next step).
job
marital
education
default
housing
loan
contact
month
poutcome
str(df)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 45211 obs. of 17 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
The following output shows the report of a data frame, including the basic statistics for each column.
Some first small highlights:
balance
pdays
- number of days that passed by after the client was last contacted from a previous campaign - has negative value (-1), probably this is the encoding for those who were contacted for the 1st timeage
range is quite big, covering clients from 18 to even 95 years oldfactor_columns = c("job", "marital", "education", "default", "housing", "loan",
"contact", "month", "poutcome")
df[factor_columns] = lapply(df[factor_columns], factor)
## The data contains 45211 observations of the following variables:
## - age: Mean = 40.94, SD = 10.62, range = [18, 95], 0% missing
## - job: 12 levels: admin. (n = 5171); blue-collar (n = 9732); entrepreneur (n = 1487); housemaid (n = 1240); management (n = 9458); retired (n = 2264); self-employed (n = 1579); services (n = 4154); student (n = 938); technician (n = 7597); unemployed (n = 1303) and unknown (n = 288)
## - marital: 3 levels: divorced (n = 5207); married (n = 27214) and single (n = 12790)
## - education: 4 levels: primary (n = 6851); secondary (n = 23202); tertiary (n = 13301) and unknown (n = 1857)
## - default: 2 levels: no (n = 44396) and yes (n = 815)
## - balance: Mean = 1362.27, SD = 3044.77, range = [-8019, 102127], 0% missing
## - housing: 2 levels: no (n = 20081) and yes (n = 25130)
## - loan: 2 levels: no (n = 37967) and yes (n = 7244)
## - contact: 3 levels: cellular (n = 29285); telephone (n = 2906) and unknown (n = 13020)
## - day: Mean = 15.81, SD = 8.32, range = [1, 31], 0% missing
## - month: 12 levels: apr (n = 2932); aug (n = 6247); dec (n = 214); feb (n = 2649); jan (n = 1403); jul (n = 6895); jun (n = 5341); mar (n = 477); may (n = 13766); nov (n = 3970); oct (n = 738) and sep (n = 579)
## - duration: Mean = 258.16, SD = 257.53, range = [0, 4918], 0% missing
## - campaign: Mean = 2.76, SD = 3.10, range = [1, 63], 0% missing
## - pdays: Mean = 40.20, SD = 100.13, range = [-1, 871], 0% missing
## - previous: Mean = 0.58, SD = 2.30, range = [0, 275], 0% missing
## - poutcome: 4 levels: failure (n = 4901); other (n = 1840); success (n = 1511) and unknown (n = 36959)
## - y: 2 levels: no (n = 39922) and yes (n = 5289)
Now, as the data types are assigned properly, it’s time for the visualizations!
Firstly, I begin with the distribution of the most important variable - overall subscription rate (y
), and then proceed with the description of socio-demographic variables, account-related and campaign-related ones. Then, summarize the main findings that are further used for the (1) model building and (2) Bayesian network stages.
What is the subscription rate of term deposits among the existing customers? According to the plot,
~12% of clients have agreed for a term deposit.
Let’s move to the next variables and see in which terms the subscribed clients differ from those who did not.
The socio-demographic features are examined on the following Fig.2-4, depending on whether customer has subscribed.
Here you can see a half-violin half-dot plot (Fig. 5) showing the distribution and the sample size of age
.
I further cut age into 6 age groups and calculated the subscription rate within them. As shown on the Fig. 6 & in the table, the highest rate was in the 65-100 (95) age group, followed by 18-25. This is quite consistent with Fig. 2, where the most subscribed were students and retired people.
y | length | min | max | mean | sd |
---|---|---|---|---|---|
no | 39922 | 18 | 95 | 40.83899 | 10.17266 |
yes | 5289 | 18 | 95 | 41.67007 | 13.49778 |
Now let’s move to the account-related attributes: balance
, default
, housing
, loan
.
So, probably balance
may influence the decision for subscribing a term deposit? (as a person needs to have some money for investment)
As shown on the next Fig. 7-9, most clients who subscribed a deposit have rather positive balance
level, and both mean & median values are higher for this group.
desc_statby(df, measure.var = "balance", grps = "y")[, c("y", "length", "min", "max" ,"mean", "median", "sd")] %>%
kable(format = "html", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
y | length | min | max | mean | median | sd |
---|---|---|---|---|---|---|
no | 39922 | -8019 | 102127 | 1303.715 | 417 | 2974.195 |
yes | 5289 | -3058 | 81204 | 1804.268 | 733 | 3501.105 |
Next three are:
default
has credit in defaulthousing
has housing loanloan
has personal loanSimple conclusion for Fig. 10-12: rates are higher among those who have no default, no personal and no housing loan. Probably it happens because they do not owe money for the bank and have more to invest.
bar_def = plot_rate_bar(data = df, gr_variable = "default", x_label = "has credit in default", sub_label = "by default", tag="Fig. 10")
bar_loan = plot_rate_bar(data = df, gr_variable = "loan", x_label = "personal loan", sub_label = "by loan", tag="Fig. 11")
bar_house = plot_rate_bar(data = df, gr_variable = "housing", x_label = "housing loan", sub_label = "by housing", tag="Fig. 12")
ggarrange(bar_def, bar_loan, bar_house, ncol=3, nrow = 1)
contact
- contact communication type (cellular (n = 29285); telephone (n = 2906) and unknown (n = 13020))duration
- last contact duration, in seconds (Mean = 258.16, SD = 257.5, range = [0, 4918])So, the clients were contacted mostly by cellular (n = 29285), another 2906 - by telephone. For the rest 13020 clients the contact
communication type is unknown.
The median duration
of last contact for subscribers was 7 minutes, while for others - between 2 and 3. What is interesting: for someone it took just 8 seconds to subscribe!
desc_statby(df, measure.var = "duration", grps = "y")[, c("y", "length", "min", "max" ,"mean", "median", "sd")] %>%
kable(format = "html", escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
y | length | min | max | mean | median | sd |
---|---|---|---|---|---|---|
no | 39922 | 0 | 4918 | 221.1828 | 164 | 207.3832 |
yes | 5289 | 8 | 3881 | 537.2946 | 426 | 392.5253 |
During the campaign
(Mean = 2.76, SD = 3.10, range = [1, 63]), most of the clients have been contacted several times, while some - more than 50. Perhaps, such customers had additional questions, or problems, or asked to call back (e.g. parents aren’t home). We’ll never know. Fig. 16 shows the subscription rate depending on the number of contacts performed during this campaign and for this client. In general, the % decreases with the increase of number of calls. However, the rate for those who were contacted many-many times (note: up to 32, not the max. value) the sub. rate slightly increases. This may happen due to some outlier cases. Table itself shows the information about the campaign
variable split by y
.
At Fig.17 we can finally see some division: clients who were contacted less times had longer conversations and subscribed, so in general the duration of contact was higher for those who agreed. Clients contacted very often had shorter conversations and mostly declined (purple dots at the bottom). Main point: do not contact your clients that often, otherwise company may be considered as annoying, which in long term may lead to the churn of these customers!
Well, we also have data on month
and day
. Let’s take a look at these two!
day
- last contact day of the weekmonth
- last contact month of year (apr (n = 2932); aug (n = 6247); dec (n = 214); feb (n = 2649); jan (n = 1403); jul (n = 6895); jun (n = 5341); mar (n = 477); may (n = 13766); nov (n = 3970); oct (n = 738) and sep (n = 579))The highest subscription rate happened in March (Spring) and in September, October and December, the lowest - in Summer. No explainable pattern is observed for day
, however, seems like most agree in the beginning & close to end of the month; wish we had data on year to detect weekday! (perhaps, people have slightly more time on holidays)
pdays
- number of days that passed by after the client was last contacted from a previous campaign (Mean = 40.20, SD = 100.13, range = [-1, 871])The most frequent value for pdays
is “-1”, which is a bit strange as how a number of days may be negative? Guess it is the encoding for those customers who were not contacted at all. Others were contacted ~90 and ~180 days ago.
df %>% dplyr::group_by(pdays) %>% dplyr::count(sort = T) %>% head() %>%
kable(format = "html", escape = F, align = c("l", "r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
pdays | n |
---|---|
-1 | 36954 |
182 | 167 |
92 | 147 |
91 | 126 |
183 | 126 |
181 | 117 |
From the following plots, we can detect some sort of “seasonality” in terms of number of days passed from the last contact during a previous campaign. Seems like the previous campaigns were performed every 90 days (2-3 months) as the peaks are located at 90, 180, etc. days (note: -1 is excluded from the Fig. 20).
It is also noticeable that a year ago a very small number of customers was reached.
previous
- number of contacts performed before this campaign and for this client (Mean = 0.58, SD = 2.30, range = [0, 275])poutcome
- outcome of the previous marketing campaign
As for the number of previous contacts, most of the clients were contacted 0 times, meaning that this is the first time they have been called.
The percentage of subscribed clients was the highest among those for whom the outcome of previous
campaign was successfu: 64.7%! Perhaps, the campaign was similar, and the customers were satisfied with the service, thus, decided to continue.
That’s it for the plots. So, what are the main findings?
In this section, I create two decision trees and a random forest. Then compare their performance based on various metrics, identify most important variables, and provide interpretation for several correctly and incorrectly predicted cases. After that, I move to the Bayesian networks section, where the main findings are summarized and tested.
Here I employ two methods of creating decision trees, which are supported by caret
package. These are ctree
and rpart
. In sum, the first one performs better than the second, which seemed a bit odd in the beginning. By looking closer at the output, it is noticeable that the accuracy for both models is very-very similar. However, the recall score of the 1st model is higher, as well as the ROC-AUC.
I googled a bit and found that they represent implementations of different algorithms, which makes the difference in scores quite explainable. Thus, as the metrics are higher for ctree
, I can conclude that this method performs better on the provided data, and this model is the best one among the provided two. [3]
Set up:
As we have 88% of non-subscribed users, the baseline accuracy should be \(\geq 88%\).
## Conditional Inference Tree
##
## 36170 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 28937, 28935, 28935, 28937, 28936
## Resampling results across tuning parameters:
##
## mincriterion ROC Sens Spec
## 0.01 0.8827731 0.9561025 0.4692817
## 0.50 0.8956743 0.9614564 0.4650353
## 0.99 0.8974755 0.9609553 0.4629263
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.99.
#plot(one_tree$finalModel, type="simple")
Performance on train:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 30852 2308
## yes 1086 1924
##
## Accuracy : 0.9062
## 95% CI : (0.9031, 0.9092)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4809
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.63920
## Recall : 0.45463
## F1 : 0.53134
## Prevalence : 0.11700
## Detection Rate : 0.05319
## Detection Prevalence : 0.08322
## Balanced Accuracy : 0.71031
##
## 'Positive' Class : yes
##
More importantly, the performance on test:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7687 596
## yes 297 461
##
## Accuracy : 0.9012
## 95% CI : (0.8949, 0.9073)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 2.178e-08
##
## Kappa : 0.4547
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.60818
## Recall : 0.43614
## F1 : 0.50799
## Prevalence : 0.11691
## Detection Rate : 0.05099
## Detection Prevalence : 0.08384
## Balanced Accuracy : 0.69947
##
## 'Positive' Class : yes
##
And the main metric based on which I’m going to compare the models is ROC-AUC, which shows how well does the model distinguish one class from another. Higher the AUC (and line closer to the left upper corner), the better is the model: perfect classifier has an AUC of 1, while 0.5 - no class separation capacity. Here it equals 90.39%, which is quite good!
## Area under the curve: 0.9039
While making decision, the most important factors for the tree wereduration
, contact
, housing
. The least: month
, day
, age
and default
.
Another way of creating a tree is to use rpart
. Again:
## CART
##
## 36170 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 28937, 28935, 28935, 28937, 28936
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.02244802 0.7526214 0.9737930 0.3237202
## 0.02481096 0.7518272 0.9700356 0.3355364
## 0.03386894 0.6994867 0.9739805 0.2754843
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02244802.
Here comes an interesting thing. The accuracy and precision are quite similar to the ctree
- 89% and 62% respectively. However, the recall on train is 0.32, which is definitely lower than 0.45.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 31091 2850
## yes 847 1382
##
## Accuracy : 0.8978
## 95% CI : (0.8946, 0.9009)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3775
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.62001
## Recall : 0.32656
## F1 : 0.42780
## Prevalence : 0.11700
## Detection Rate : 0.03821
## Detection Prevalence : 0.06163
## Balanced Accuracy : 0.65002
##
## 'Positive' Class : yes
##
On the test set the model performs in the same way - accuracy of 90% and 65% precision. But the percentage of correctly identified “subscribers” is just 34, compared to 44 using ctree
.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7787 695
## yes 197 362
##
## Accuracy : 0.9013
## 95% CI : (0.895, 0.9074)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.793e-08
##
## Kappa : 0.3994
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.64758
## Recall : 0.34248
## F1 : 0.44802
## Prevalence : 0.11691
## Detection Rate : 0.04004
## Detection Prevalence : 0.06183
## Balanced Accuracy : 0.65890
##
## 'Positive' Class : yes
##
Oh. And, consequently, AUC is just 76.3, which looks sad compared to an amazing score of 90.39!
As a result, the conclusion about decision trees is that a ctree
model performed better on this data than rpart
.
## Area under the curve: 0.763
The next plot shows the calculated variable importance for this model, which is different, filtering those with higher than 0.0 value. The most important ones are: poutcome(success)
and duration
. The least - related to months, days & the rest filtered, e.g. age, marital status other than married, etc. :)
Tree by ctree
is good, but what if an entire forest is better?
Set up:
#set.seed(17)
#rf_model_b = caret::train(y ~ .,data=X_train, method="rf", ntrees=10,
# trControl = trainControl(method = "cv", number = 3))
#saveRDS(rf_model_b, "_rf_model.rds")
rf_model_b = readRDS("_rf_model.rds")
rf_model_b
## Random Forest
##
## 36170 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 24113, 24114, 24113
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8920100 0.1594414
## 22 0.9061653 0.4893659
## 42 0.9045065 0.4854582
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.
The performance on train data is super good: accuracy of 1, precision 1, recall 1, what a perfect classifier!
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 31938 0
## yes 0 4232
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Precision : 1.000
## Recall : 1.000
## F1 : 1.000
## Prevalence : 0.117
## Detection Rate : 0.117
## Detection Prevalence : 0.117
## Balanced Accuracy : 1.000
##
## 'Positive' Class : yes
##
Now, take a closer look at the confusion matrix for test data. We have the following:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7676 565
## yes 308 492
##
## Accuracy : 0.9034
## 95% CI : (0.8972, 0.9095)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 3.518e-10
##
## Kappa : 0.4772
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.61500
## Recall : 0.46547
## F1 : 0.52989
## Prevalence : 0.11691
## Detection Rate : 0.05442
## Detection Prevalence : 0.08849
## Balanced Accuracy : 0.71345
##
## 'Positive' Class : yes
##
Also, can look at the rates.
tn = cm_rf_test$table[1,1] ; fn = cm_rf_test$table[1,2] ; fp = cm_rf_test$table[2,1] ; tp = cm_rf_test$table[2,2]
tpr = tp/(tp+fn) ; tnr = tn/(tn+fp) ; fpr = fp/(fp+tn) ; fnr = fn/(fn+tp)
sprintf("TPR: %f", tpr)
## [1] "TPR: 0.465468"
sprintf("TNR: %f", tnr)
## [1] "TNR: 0.961423"
sprintf("FPR: %f", fpr)
## [1] "FPR: 0.038577"
sprintf("FNR: %f", fnr)
## [1] "FNR: 0.534532"
Finally, the AUC has increased to 92.57. Looks like we have a winner here!
Random Forest model is the best one among the presented ones, based on almost every metric, including the AUC.
## Area under the curve: 0.9257
The most important variable, again, is duration
, followed by balance
, age
, day
and poutcome
(success). This means that a certain duration, balance, age, day of contact and success of previous campaign highly influence the model’s decision. Let’s go deeper in the next section!
This section is devoted to explanation of model’s individual predictions using lime
package.
Here we have three cases where the random forest made the correct predictions. In all three, the most important feature was duration
, but split on different values. Case 2809 was predicted as not subscribed because the conversation lasted less than 103 seconds, person was aged
between 39 and 48, and the outcome of previous campaign was unknown. Though, two features contradicted to the selected label. These are balance
higher than 1427 and month (july)
.
A client 5169 did not subscribe. Why? Well, we did not know the outcome of the previous campaign, he/she was middle aged, had small balance and was contacted in November, in which the overall subscription rate was rather low.
Another person 9027 subscribed a deposit! The model made this decision because the client previously successfully participated in the campaign, had a good call duration (which was enough to convince the customer). However, he or she had quite a small balance, and this is the variable that contradicts to the prediction of “subscribed” label.
The main cause for incorrect predictions for the 3 selected cases was the duration
longer than 319 seconds. Also, the wrong label was supported by month
and poutcome
features: the unknown result of the previous contact supports the “No” label.
library(Rgraphviz)
library(gRain)
So, it’s time to find factors that may increase the clients’ chances to subscribe. As shown in the previous sections, top features differed among the presented models. However, the majority of them had duration
, balance
, age
, poutcome
, contact
, day
, housing
on the list. Other socio-demographic characteristics of customers had zero or little importance.
In order to understand the situation and possible ties more clearly, let’s take a step back and specify what is a «term deposit», «personal loan» and «home loan».
Okay, now it’s time to proceed to the next sub-section - preparation of the data.
In order to create the Bayesian networks, numeric variables should be firstly cut into factors.
Age can be simply divided into intervals:
It’s clear from the plot that the subscription rate among people aged 65+ is the highest (42%), followed by age group 18-25. Clients from the middle age groups paid less attention to the campaign. This means that the bank should further target the youngest and more old age groups, as others showed quite little interest. As can be seen from the table below the plot, the bank mostly contacted people aged from 25-55 groups (assuming that this is only a part of bank’s entire database).
Balance: may cut into the following (without creating too many groups..):
The conclusion here is pretty straightforward: clients with higher balance subscribed more than those with low balance. Perhaps, they have some extra money and the interest rate is higher for those who can afford a bigger term deposit. So it’s better to focus on clients whose balance is at least not negative, or even better.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8019 72 448 1362 1428 102127
Duration, split based on the distribution & general knowledge.
This feature was one of the most important for the models. And here comes a short note: actually, duration
is not a very good variable for the model in real life as the length of a conversation is not known before a call is performed.
Firstly, I started with a very simple model based only on four money-related variables. Automatically created structure showed that default
and housing
features are connected to balance
. So, according to the automatic logic, (1) having credit in default and (2) having a housing loan influences client’s balance, which seems like truth: default
means that a client did not have enough money, so it’s likely that people with a debt will have lower balance. However, I guess it’s more likely that if a customer’s balance
turns negative, then the accout is marked yes
in default
, consequently, it should be rather balance
-> default
tie.
As for connections to y
: the link from balance
seems correct, while housing
and loan
should probably be linked to y
only through balance
. Or, for instance, balance
-> loan
-> y
as if a person has a small amount of money, he/she may take a loan & needs money to pay for that (i.e. no/less money for term deposit). Also, the EDA stage showed that subscription rate was higher among those having no default
, no loan
and no housing
loan.
set.seed(17)
df_bn_first = data.frame(df_bn %>% dplyr::select(housing, loan, default, balance, y)) %>% na.omit()
bnStructure = bnlearn::hc(df_bn_first)
graphviz.plot(bnStructure, highlight = list(nodes="y", fill="#55acee", col="white"))
Furthermore, link between housing
and loan
looks not very reliable. Well, I’m not a pro in banking, but previously I went to Google to understand whether such a connection between two loans is valid at all. As understood:
Then I looked at the data, and some customers with a housing loan did not have a personal loan. Looks like it’s better to delete this connection as it becomes too complex to interprete, at least.
head(df_bn %>% dplyr::select(default, housing, loan, y), 3)
## # A tibble: 3 x 4
## default housing loan y
## <fct> <fct> <fct> <fct>
## 1 no yes no no
## 2 no yes no no
## 3 no yes yes no
Now the situation is the following: balance
can directly influence the decision to subscribe a term deposit as in order to do so, client needs some money. Also, the interest rate may be higher for bigger amounts, so the conditions may look more attractive for such clients. It is also is related to default
, housing
and loan
. In case of default
, having a negative balance mostly means that a client probably has no money to pay for something, e.g. housing loan. However, having a default does not necessary mean that a person has a loan
or housing
, as shown in the table below. As far as I know, people usually take loans if they do not have enough money to buy something they want, so, perhaps, if they have lower balance, it’s more likely that they will take a loan.
In sum, here I add ties because balance
may influence the decision to subscribe a term deposit differently if a person has some loans or debts.
df_bn %>% dplyr::filter(default=="yes") %>% group_by(default, housing, loan) %>% dplyr::select(default, housing, loan) %>% dplyr::count()
## # A tibble: 4 x 4
## # Groups: default, housing, loan [4]
## default housing loan n
## <fct> <fct> <fct> <int>
## 1 yes no no 212
## 2 yes no yes 168
## 3 yes yes no 302
## 4 yes yes yes 133
#bn = model2network("[loan][housing][default][balance|housing:loan:default][y|balance]")
bn = model2network("[balance][loan|balance][housing|balance][default|balance][y|balance:housing:loan:default]")
graphviz.plot(bn, highlight = list(nodes="y", fill="#55acee", col="white"))
Let’s calculate probabilities on the data and look what we have here!
net_first = bn.fit(bn, data = data.frame(df_bn %>% dplyr::select(housing, loan, default, balance, y)))
So, the probability of subscription based on our small dataset is 11.49%.
set.seed(17)
cpquery(net_first, event = (y == "yes"), evidence = TRUE)
## [1] 0.1149
What about different segments?
loan
decreases the probability of subscription from 0.1176 to 0.068not
having a loan
slightly increases it to 0.127set.seed(17)
sprintf("P. if a personal has a loan: %f", cpquery(net_first, event = (y == "yes"), evidence = (loan == "yes")))
## [1] "P. if a personal has a loan: 0.068140"
sprintf("P. if a personal doesn't have a loan: %f", cpquery(net_first, event = (y == "yes"), evidence = (loan == "no")))
## [1] "P. if a personal doesn't have a loan: 0.126588"
The probability of subscription decreases to 0.079 if a person has a housing loan, and increases to 0.16 if a client doesn’t have it. So, this confirms the suggestion that having loans negatively influences the decision to subscribe a term deposit.
set.seed(17)
sprintf("P. if a personal has a housing loan: %f", cpquery(net_first, event = (y == "yes"), evidence = (housing == "yes")))
## [1] "P. if a personal has a housing loan: 0.078793"
sprintf("P. if a personal doesn't have a housing loan: %f", cpquery(net_first, event = (y == "yes"), evidence = (housing == "no")))
## [1] "P. if a personal doesn't have a housing loan: 0.163017"
And what if a person has a default, personal loan and housing loan?
set.seed(17)
sprintf("P. if has: %f", cpquery(net_first, event = (y == "yes"), evidence = (housing == "yes" & loan=="yes" & default=="yes")))
## [1] "P. if has: 0.071429"
sprintf("P. if not: %f", cpquery(net_first, event = (y == "yes"), evidence = (housing == "no" & loan=="no" & default=="no")))
## [1] "P. if not: 0.182203"
After more thinking and drawing, I came up with the following structure and added more variables that were identified as important during EDA and model building.
The model has the following structure:
balance
can directly influence the decision to subscribe a term deposit as in order to do so, client needs some money. Also, the interest rate may be higher for bigger amounts, so the conditions may look more attractive for such clients. It is also is related to default
, housing
and loan
. In case of default
, having a negative balance mostly means that a client probably has no money to pay for something, e.g. housing loan. However, having a default does not necessary mean that a person has a loan
or housing
, as shown in the table below. As far as I know, people usually take loans if they do not have enough money to buy something they want, so, perhaps, if they have lower balance, it’s more likely that they will take a loan.age
: young and old people had higher subscription rates compared to middle age groupsage
and balance
: younger people may have less money as they probably had less time to earn a large amount :). This is not very visible on the plots, but seems like at least clients younger than 25 do not have super big numbers on their accountspoutcome
(outcome of the previous marketing campaign): 64.7% of clients for whom the previous campaign was marked as «success» have subscribed a term deposit. This is a very good number compared to rates for those with failure, other & unknown: 12.6%, 16.7%, 9.2% respectivelyNow let’s examine some variables and “what-if” scenarios.
net_second = bn.fit(bn, data = data.frame(df_bn %>% dplyr::select(age, balance, default, housing, loan, poutcome, y)))
Based on the calculated probabilities, and consistent with EDA, the highest probability of subscribing a term desposit is for those aged 18-25 & 65-100.
## y
## age no yes
## (17,25] 0.7822971 0.2177029
## (25,35] 0.8752513 0.1247487
## (35,45] 0.8982939 0.1017061
## (45,55] 0.8996323 0.1003677
## (55,65] 0.8731372 0.1268628
## (65,100] 0.5531044 0.4468956
Higher balance - higher probability of subscribing a term desposit.
querygrain(BBNnet, nodes=c("y", "balance"), type="conditional")
## y
## balance no yes
## negative 0.9248267 0.07517325
## below median 0.8948012 0.10519879
## above median 0.8774549 0.12254506
## high 0.8398880 0.16011196
## very high 0.8441843 0.15581571
Better to contact those without housing loan: 0.15 compared to 0.09.
querygrain(BBNnet, nodes=c("y", "housing"), type="conditional")
## y
## housing no yes
## no 0.8410387 0.15896135
## yes 0.9096405 0.09035946
And without personal loan: probability of subscription rises to 0.13, compared to 0.08 for those with a loan.
querygrain(BBNnet, nodes=c("y", "loan"), type="conditional")
## y
## loan no yes
## no 0.8723901 0.12760986
## yes 0.9147056 0.08529437
Starting probability of subscription for this model is 0.12.
First situation: a client aged 18-25, high balance. The probability is 25%, not bad.
set.seed(17)
cpquery(net_second, event = (y == "yes"), evidence = (age == "(17,25]" & balance=="high"))
## [1] 0.2535211
Previous outcome is success? Then the probability is 0.8!
set.seed(42)
cpquery(net_second, event = (y == "yes"), evidence = (age == "(17,25]" & balance=="high" & poutcome=="success"))
## [1] 0.8
What if a person is middle aged but still with above median balance? Then the probability decreases to 0.09.
set.seed(42)
cpquery(net_second, event = (y == "yes"), evidence = (age == "(35,45]" & balance=="above median"))
## [1] 0.09485816
However, if such client has higher balance, the probability rather stays near 0.12.
set.seed(42)
cpquery(net_second, event = (y == "yes"), evidence = (age == "(35,45]" & balance=="high"))
## [1] 0.1196984
Overall, the company should introduce some changes that may influence its performance, according to the previously presented models and tables.
In this report, I’ve described all the steps that I took in order to complete the task. Firstly, examined the dataset in the EDA section. Then created 3 models, and random forest showed the best results, and examined a few cases using lime
package. The last parts included creation if Bayesian networks in order to conduct what-if analysis and propose a policy.
Existing situation
The current subscription rate is 12%, meaning that 12% of the clients agreed for a term deposit. Main factors based on which this rate differs are: age
, duration
of the call, having loans. The rate also increased for clients with higher balance and for those with previously successful campaign.
ML
The best model in ML section was random forest, showing an AUC of almost 93,which means that the model was close to being a perfect classifier (max. value is 1). The most important factors were: duration
, balance
, age
, day
and poutcome
(success).
policy of subscription rate improvement
The rates may be increased by changing the focus from middle age groups to clients aged 18-25 and 65+, as they mostly do not have loans. The latter also have higher balance, which is more profitable for the bank. Clients for whom the previous campaign was successful subscribed more often as well, meaning that the “successful” customer database should be definitely targeted. Overall, such changes may impove the rates to at least 0.54 for whom the outcome of previous campaign is unknown.
set.seed(42)
cpquery(net_second, event = (y == "yes"), evidence = (age == "(17,25]" & balance=="high" & loan=="no" & default=="no" & housing=="no"))
## [1] 0.5454545