---
title: "The Selection of a Model for German Credit Applications"
author: "Philippe De Brouwer"
date: "`r format(Sys.time(), '%d %B, %Y')`"
abstract: "We use the data of German Credit applications to build a credit model. We use two approaches: neural networks and logistic regression and conclude that the neural network is not preferable, because it over-fits too much and is black box. For the logistic regression we also build a challenger model that has one variable more and conclude that this model is the best so far."
documentclass: article
classoption:
- 12pt
output:
pdf_document:
includes:
in_header: preamble.tex
number_sections: yes
bibliography: bibliography.bib
fontsize: 12pt
header-includes:
- \usepackage{titling}
- \pretitle{\begin{center}
\thispagestyle{empty}
\includegraphics[height=37.5mm]{img/logoAGH.pdf}\\
\rule{30mm}{0pt}
{\large \textsf{WydziaĆ Applied Mathematics}}\\
\rule{\textwidth}{3pt}\\
\rule[2ex]{\textwidth}{1pt}\\
\vspace{7ex}
{\LARGE \bf \textsf{Project for the course Data Science}}\\
\vspace{13ex}
\rule[2ex]{\textwidth}{1pt}\\
}
- \posttitle{\end{center}}
---
\newpage
\tableofcontents
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
library(tidyverse)
library(forcats)
library(InformationValue)
library(ggplot2)
library(ggcorrplot)
library(gridExtra)
library(ROCR)
library(modelr)
library(purrr)
d <- read_csv('http://www.de-brouwer.com/tmp/german_credit.csv')
d$isGood <- 1 - d$default
```
\newpage
# Introduction
This is a template that can be used for the report of the project that goes with the course _Data Science_ of \textbf{Philippe De Brouwer} for AGH. It is obviously not perfect, nor really finished. We invite you to use this as a starting point, and rethink all decisions yourself. Eventually we hope that you can build a better model.
Best is to take the models that is selected here (``logistic 2'') as a reference and take it as a challenge to build a better model. The quality of the model is both the performance and the robustness ("not over-fit"). Ideally it is better in at least one of those two areas while at least as good in the other criterion.
You are free to take another view anything. In particular we suggest you to consider:
1. the selection of variables,
2. interaction variables (matrix variables),
3. the choice of model (logistic, probit, decission tree, neural network, SVM, etc.), or eventually improve the models chosen (e.g. how could you make the neural network to be less over-fitted?),
4. of course you can also choose to consider other performance criteria and even use other cross validation methods, such as _k-fold cross validation_, and
5. the documentation, are the better ways of presenting it, what to present, etc.
_Good luck!_
The data is believed to be public domain.
The main reference is @debrouwer2020, and more in particular [@debrouwer2020, parts 1 to 6 with elements of part 8], the slides presented in the classroom, and the code that we walked through together.
# The Data
## Introduction
The data consists of `r nrow(d)` rows and `r ncol(d)` columns. In Table \ref{tab:colnames} we list all column names.
```{r datasummary, echo=FALSE}
tmp <- as.data.frame(colnames(d))
colnames(tmp) <- 'Column Names'
knitr::kable(tmp, caption = 'The column names of the data.\\label{tab:colnames}')
```
The breakdown of the dependent variable ``default'' is as presented in Table \ref{tab:tableDefault}.
```{r defaultsummary, echo=FALSE}
knitr::kable(table(d$default), caption = 'The Data consists of a high number of defaults. We assume that this is because the data has been prepared by dropping randomly "good" customers.\\label{tab:tableDefault}')
```
## The Data Quality
From the `r nrow(d)` observations, there are `r nrow(d) - length(complete.cases(d))` rows with missing values. Some of the values have one of the categorical variables that indicates the lack of information. We refer to the following sections where we will show how these "NA" values will be treated.
No suspicious outliers were found.
# Data Wrangling
## Categorical Variables
### The information Value
The first step is selecting the variables that we ultimately want to use in our model. First we will consider the categorical variables and study the information value.
```{r iv, echo=FALSE}
WOE_tbl <- data.frame(varName = character(0), IV = numeric(0)) # create empty data frame
continuous_scale_variables = c('duration_in_month', 'credit_amount', 'installment_as_income_perc', 'present_res_since', 'age', 'credits_this_bank', 'people_under_maintenance')
# For all categorical values we display the IV:
for (col in colnames(d)) {
if (!(col %in% c('isGood', 'default', continuous_scale_variables))) {
wt <- WOETable(d[[col]], d$isGood, valueOfGood = 1)
tmp <- data.frame(varName = col, IV = sum(wt$IV))
WOE_tbl <- rbind(WOE_tbl, tmp)
}
}
# show the best IVs:
knitr::kable(WOE_tbl[order(WOE_tbl$IV, decreasing = TRUE),],
caption = 'The table of all information values for each categorical variable ordered in decreasing order. We will work with the ones that have an information value above $0.1$.')
```
## The Continous Variables
The continuous variables are presented in Table \ref{tab:contVars}. Also their content follows from the naming convention.
```{r csv,echo=FALSE}
knitr::kable(continuous_scale_variables,
col.names = c('Continuous Scale Variables'),
caption = "List of the continuous scale variables in the data.\\label{tab:contVars}")
```
### Decide which Continuous Variable to Use
Note that we have introduced a variable $isGood$ which is one for good credits and zero for bad credits.
In order to select continuous scale variables we will investigate the correlation and study the plot of the average behaviour in function of the relevant varialbes in order to detect non-linear dependencies. Later (in next section) we will consider interactions between the different variables (relevant for the logistic regression for example).
```{r continousvars, fig.cap='The correlation matrix for all continous scale vriables. Note that $isGood$ is the binary variable that equals $1$ for all successful credits and $0$ for credits that defaulted.', echo=FALSE}
df_cont <- data.frame(isGood = d$isGood)
for (col in continuous_scale_variables) {
df_cont <- cbind(df_cont, d[[col]])
colnames(df_cont)[ncol(df_cont)] <- col
}
ggcorrplot(cor(df_cont),
type = "lower",
outline.col = "white",
lab = TRUE,
insig = "blank",
method = "circle")
```
Based on the correlation we can certainly consider $duration\_in\_month$, $credit\_amount$, $installment\_as\_income\_perc$ (which is usually a strong candidate, however in this data the dependency is not remarkable: `r cor(df_cont$isGood, df_cont$installment_as_income_perc)`), and $age$.
We also note the high correlation between $installment\_as\_income\_perc$, $duration\_in\_month$ and $credit\_amount$ and make note that we might need to combined them in order to avoid co-linearity if we would decide to keep both variables in a regression model.
We used a loess model and histogram to assess what binning is optimal for these variables. Here we present the loess with superimposed confidence level for the variables where the correlation with $isGood$ is above $0.05\%$.
```{r loess, fig.cap=c('The loess estimations and its confidence level for the variables with correlation above.\\label{fig:loess}')}
#t <- textGrob("Loess Estimations")
p1 <- qplot(data = df_cont, duration_in_month, isGood, xlab = "", ylab = "") +
geom_smooth(method = "loess", size = 1.5) +
ggtitle("duration_in_month") +
theme_bw()
p2 <- qplot(data = df_cont, credit_amount, isGood, xlab = "", ylab = "") +
geom_smooth(method = "loess", size = 1.5) +
ggtitle("credit_amount") +
theme_bw()
p3 <- qplot(data = df_cont, installment_as_income_perc,
isGood, xlab = "", ylab = "") +
geom_smooth(method = "loess", size = 1.5) +
ggtitle("installment_as_income_perc") +
theme_bw()
p4 <- qplot(data = df_cont, age, isGood, xlab = "", ylab = "") +
geom_smooth(method = "loess", size = 1.5) +
ggtitle("age") +
theme_bw()
grid.arrange(p1, p2, p3, p4, ncol=2)
```
We already knew that the variable $installment\_as\_income\_perc$ had a reasonably low correlation with the dependent varialbe. From Figure \ref{fig:loess} we learn also that the shape of the relationships is rather flat. Therefore we decide in a first approach not to use this variable (the models "logistic 1" and "neuralnet" will be built without this parameter) and build a challenger models that includes this variable ("logistic 2").
## Data Binning
### The Categorical Variables
The existing binning is probably directly taken from the production system. This means that we should at least check if it is optimal for a logistic regression and its cross validation. Ideally we want to avoid bins that are too small and we also want to make sure that the relationships observed are realistic, logical and consistent.
We decided to make sure all bins have at least 200 observations in them and will take bins that have similar WOE together.
```{r reviewBinningFactorsFunctions, echo=FALSE, warning=FALSE}
# Now review all binning
N <- nrow(d)
totGoods <- sum(d$isGood)
totBads <- N - totGoods
# returns the table of IV and WOE
var_display_fct <- function(.data, colName) {
#data = d
#colName = 'account_check_status'
colNm = rlang::sym(colName)
tmp <- .data %>%
select(all_of(colNm), isGood, default) %>%
group_by(get(colName)) %>%
summarise(
n = n(),
nbrBad = sum(default),
nbrGood = sum(isGood),
pctGood = nbrGood / n,
WOE = log(nbrGood/totGoods / (nbrBad/totBads)),
IV = (nbrGood/totGoods - nbrBad/totBads) * WOE
) %>%
arrange(WOE)
colnames(tmp)[1] <- colName
par(mar=c(11,4,4,4)) # increase margin for long names
barplot(tmp$WOE, names = tmp[[colName]], horiz = F, las = 2, col = "khaki3",
main = paste("WOE for", colName))
tmp$pctGood <- round(tmp$pctGood, 2)
tmp$WOE <- round(tmp$WOE, 2)
tmp$IV <- round(tmp$IV, 2)
tmp[[colName]] <- str_sub(tmp[[colName]], 1, 34)
tmp
}
# -----------------------------------------------------------------------------------
```
#### Account_check_status
To illustrate the thought process, we display the information for $account\_check\_status$. First we study the WOE of the existing bins (see Figure \ref{fig:accCheckStatus}) and their detailed information about histogram, weight of evidence and information value (see Table \ref{tab:accCheckStatus}).
```{r newData}
# start a new dataset with our re-organised data
df <- tibble(isGood = d$isGood) # our dependent variable
```
```{r account_check_status, fig.cap='The WOE for the variable $account\\_check\\_status$.\\label{fig:accCheckStatus}'}
### account_check_status
tbl <- var_display_fct(d, 'account_check_status')
```
```{r tblTbl}
knitr::kable(tbl,
caption = 'The WOE and IV for the variable account\\_check\\_status.\\label{tab:accCheckStatus}')
```
Then we decide on which bins to combine. In this case, we decide to take the second and third bin together. This is done by the following code:
```{r, echo=TRUE}
df$acc_chk_sts <- if_else(d$account_check_status == '< 0 DM', 'neg',
if_else((d$account_check_status == '0 <= ... < 200 DM') |
(d$account_check_status == '>= 200 DM / salary assignments for at least 1 year'),
'pos',
if_else(d$account_check_status == 'no checking account', 'none', 'ERROR')))
# Check for mistakes (they should appear as ERROR or an undefined category):
df$acc_chk_sts %>% table
```
#### Other variables
We work with similar logic through all other variables and make the following transformations:
```{r credit_history, echo=TRUE}
### credit_history
df$credit_hist <- if_else((d$credit_history == 'no credits taken/ all credits paid back duly') |
(d$credit_history == 'all credits at this bank paid back duly'),
'clear',
if_else((d$credit_history == 'existing credits paid back duly till now') |
(d$credit_history == 'delay in paying off in the past'), 'paid',
if_else(d$credit_history == 'critical account/ other credits existing (not at this bank)',
'crit', 'ERROR')))
### savings
df$savings <- if_else((d$savings == '... < 100 DM') | (d$savings == '100 <= ... < 500 DM'), 'small',
if_else((d$savings == 'unknown/ no savings account') | (d$savings == '500 <= ... < 1000 DM '), 'other',
if_else(d$savings == '.. >= 1000 DM ', 'large', 'ERROR')))
### Propoerty
df$property <- if_else(d$property == 'unknown / no property', 'none',
if_else((d$property == 'if not A121/A122 : car or other, not in attribute 6') |
(d$property == 'if not A121 : building society savings agreement/ life insurance'),
'movbl',
if_else(d$property == 'real estate', 'real', 'ERROR')))
### purpose
df$purpose <- if_else((d$purpose == '(vacation - does not exist?)') |
(d$purpose == 'furniture/equipment') |
(d$purpose == 'car (new)')
, 'friv',
if_else((d$purpose == 'education') |
(d$purpose == 'business') |
(d$purpose == 'repairs') |
(d$purpose == 'radio/television')
, 'invest',
if_else((d$purpose == 'domestic appliances') |
(d$purpose == 'car (used)') |
(d$purpose == 'retraining')
, 'useful', 'ERROR')))
```
Finally we check the information value of the new binning:
```{r newBinsIV}
tmp <- tibble(variable = character(0),
InformationValue = numeric(0),
Predictability = character(0))
for (k in 1:ncol(df)) {
if (colnames(df)[k] != 'isGood') {
x <- colnames(df)[k]
infVal <- IV(factor(df[[x]]), df$isGood)
tmp <- add_row(tmp,
variable = x,
InformationValue = infVal[1],
Predictability = attr(infVal, "howgood")
)
}
}
knitr::kable(tmp)
```
### The Continuous variables
Based on a similar reasoning^[We want to end up with bins that are not too small and we want behaviour in that one bin to be as similar as possible.], but also making sure that we capture any non-linear behaviour, we split the continuous variables as follows:
```{r, echo=TRUE}
#### duration_in_month
df$duration <- if_else(d$duration_in_month <= 12, 'L',
if_else(d$duration_in_month <= 24 & d$duration_in_month > 12, 'M',
if_else(d$duration_in_month > 24, 'H', 'ERROR')))
### credit_amount
df$credit_amount <- if_else(d$credit_amount <= 2500, 'L',
if_else(d$credit_amount <= 5000 & d$credit_amount > 2500, 'M',
if_else(d$credit_amount > 5000, 'H', 'ERROR')))
### age
df$age <- if_else(d$age <= 30, 'L',
if_else(d$age > 30, 'H', 'ERROR'))
```
# The Logisic Regresssion
The first model that we will study is a logistic regression. We name it for furture reference "logistic 1".
```{r echo=TRUE}
# Define the formula:
frm <- isGood ~ acc_chk_sts + credit_hist + savings + property +
purpose + duration + credit_amount + age +
savings * age + purpose * credit_amount
# Fit the model:
m <- glm(formula = frm, data = df, family = "binomial")
# Investigate the model:
summary(m)
```
We notice that none of the interactions is statistically significant and decide to leave them out in order to make the model more robust. Then we fit the model agai in the following code:
```{r, echo=TRUE}
frm <- isGood ~ acc_chk_sts + credit_hist + savings + property +
purpose + duration + credit_amount + age
m <- glm(formula = frm, data = df, family = "binomial")
summary(m)
```
# The performance of the Model
We will check the performance of the model with the aid of the ROCurve.
```{r rocr, fig.cap=c("The ROC (receiver operating curve) for our model", "The lift of the model (bottom): the cumulative percentage of responders (ones) captured by the model")}
# Re-use the model m and the dataset t2:
pred <- prediction(predict(m, type = "response"), df$isGood)
# Visualize the ROC curve:
#plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3)
#abline(0, 1, lty = 2)
AUC <- attr(performance(pred, "auc"), "y.values")[[1]]
#paste("AUC:", AUC)
perf <- performance(pred, "tpr", "fpr")
ks <- max(attr(perf,'y.values')[[1]] - attr(perf,'x.values')[[1]])
#paste("KS:", ks)
#predScores <- modelr::add_predictions(df, m)$pred # not correct?
predScores <- predict(m, type = "response")
# Visualize the KS with the package InformationValue:
ks_plot(actuals = df$isGood, predictedScores = predScores)
cat('\n\n')
plotROC(actuals = df$isGood, predictedScores = predScores)
```
The model has a an AUC of `r AUC`, and a KS of `r ks`. These findings are not bad and we will proceed now to calculate the optimal cutoff. This cutoff is as follows:
```{r}
# The optimal cutof
# We introduce cost.fp to be understood as a the cost of a
# false positive, expressed as a multiple of the cost of a
# false negative.
# get_best_cutoff
# Finds a cutof for the score so that sensitivity and specificity
# are optimal.
# Arguments
# fpr -- numeric vector -- false positive rate
# tpr -- numeric vector -- true positive rate
# cutoff -- numeric vector -- the associated cutoff values
# cost.fp -- numeric -- cost of false positive divided
# by the cost of a false negative
# (default = 1)
# Returns:
# the cutoff value (numeric)
get_best_cutoff <- function(fpr, tpr, cutoff, cost.fp = 1){
cst <- (cost.fp * fpr - 0)^2 + (tpr - 1)^2
idx = which(cst == min(cst))
c(sensitivity = tpr[[idx]],
specificity = 1 - fpr[[idx]],
cutoff = cutoff[[idx]])
}
# opt_cut_off
# Wrapper for get_best_cutoff. Finds a cutof for the score so that
# sensitivity and specificity are optimal.
# Arguments:
# perf -- performance object (ROCR package)
# pred -- prediction object (ROCR package)
# cost.fp -- numeric -- cost of false positive divided by the
# cost of a false negative (default = 1)
# Returns:
# The optimal cutoff value (numeric)
opt_cut_off = function(perf, pred, cost.fp = 1){
mapply(FUN=get_best_cutoff,
perf@x.values,
perf@y.values,
pred@cutoffs,
cost.fp)
}
cat("The optimal cutoff if the false positives cost 10 times more than the false negatives:")
opt_cut_off(perf, pred, cost.fp = 10)
```
We can also use the package $InformationValue$ to calculate the optimal cutoff:
```{r optCutIV, echo = TRUE}
optimalCutoff(actuals = df$isGood, predictedScores = predScores, optimiseFor = "Zeros")
optimalCutoff(actuals = df$isGood, predictedScores = predScores, optimiseFor = "Both")
optimalCutoff(actuals = df$isGood, predictedScores = predScores, optimiseFor = "Ones")
optimalCutoff(actuals = df$isGood, predictedScores = predScores, optimiseFor = "misclasserror")
```
# Validation of the Model
To validate the model we will use the cross validation method, and opt for the Monte Carlo Cross Validation. [^While we could opt for the _k-fold_ validation (with $k$ not too high), we choose the Monte Carlo methods because it allows to draw more random selections.]
## Monte Carlo Cross Validation
```{r}
pctTrain <- 0.7
set.seed(18901229)
nRuns = 200
```
We use the Monte Carlo Cross Validation with a test dataset that spans 30% of our observations and 70% in the training data-set. We will draw `r nRuns` times a training data-set of `r pctTrain` and study the AUC on the testing data-set.
```{r fig.cap='The histogram for the AUC of the ramdomised data-sets with the Monte Carlo cross vailidation for the first logistic regression model.\\label{fig:MCxval1}'}
fAUC <- function(model, data) {
y <- all.vars(formula(model))[1]
pred1 <- predict(model, newdata = data, type = "response")
pred <- ROCR::prediction(pred1, data[[y]])
perf <- performance(pred, "auc")
AUC <- attr(perf, "y.values")[[1]]
AUC
}
#fvAUC <- Vectorize(fAUC)
cv_mc <- crossv_mc(df, n = nRuns, test = 1 - pctTrain)
mods <- map(cv_mc$train, ~ glm(frm, data = ., family = "binomial"))
#AUCs <- map2_dbl(mods, cv_mc$test, fAUC)
RMSE <- map2_dbl(mods, cv_mc$test, rmse)
AUCs_train <- numeric(0)
AUCs_test <- numeric(0)
for(k in 1:nRuns) {
AUCs_test[k] <- fAUC(mods[[k]], as.data.frame(cv_mc$test[[k]]))
AUCs_train[k] <- fAUC(mods[[k]], as.data.frame(cv_mc$train[[k]]))
}
allAUCs <- rbind(tibble(model = "train data", AUC = AUCs_train),
tibble(model = "test data", AUC = AUCs_test))
p1 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) + geom_density(alpha=0.5)
p2 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) + stat_ecdf()
grid.arrange(p1, p2, ncol = 1)
```
The results are in Figure \ref{fig:MCxval1}. We notice that the model is over-fit, but even for the test data-set the performance is acceptable.
All variables retained have at least one of the categories with a $p-value$ that is smaller than $0.01$. We could consider to leave out the categories that have a higher $p-value$ to make the model more robust. We choose, however, not to do that because all coefficients are somehow logical.
The median of the observed values for the AUC of the test data is `r median(AUCs_test)`, the average is `r mean(AUCs_test)` with a standard deviation of `r sd(AUCs_test)`.
# The Challenger Models
We will build two challenger models.
## Neural Network
As a challenger model, we us a neural network on the same binned data and with three hidden layers with resp 16, 8, and 4 hidden neurons.
First we fit the model on all the data:
```{r nnAUC, fig.cap="The ROC for the neural network with three hidden layers of 14, and 7 neurons respectively."}
M <- model.matrix( ~ ., data = df)
load("m_nn.RData")
pred <- predict(m_nn, M[, 3:ncol(M)])
pred <- ROCR::prediction(pred, M[,2]) # the function prediction is both in neuralnet and ROCR library
# Visualize the ROC curve:
plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3)
#abline(0, 1, lty = 2)
AUC_nn <- attr(performance(pred, "auc"), "y.values")[[1]]
paste("AUC:", AUC_nn)
perf <- performance(pred, "tpr", "fpr")
ks_nn <- max(attr(perf,'y.values')[[1]] - attr(perf,'x.values')[[1]])
paste("KS:", ks_nn)
```
Now we perform the Monte Carlo Cross Validation for the neural network. The results are in Figure \ref{fig:nn:xv}.
```{r nn, fig.cap="The results of the cross vaildation for the neural network. We notice that the neural network witho nly three layers does not significantly overfit and hence is a valid model.\\label{fig:nn:xv}"}
setwd("/home/philippe/Documents/science/teaching/r-book/lectures/exam_rmd/")
load("AUCs_nn_test.RData")
load("AUCs_nn_train.RData")
allAUCs <- rbind(tibble(model = "train data", AUC = AUCs_nn_train),
tibble(model = "test data", AUC = AUCs_nn_test))
p1 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) + geom_density(alpha=0.5)
p2 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) + stat_ecdf()
grid.arrange(p1, p2, ncol = 1)
```
## Another logistic regression: logistic 2
We use also the installments as percentage of income, the correlation was small but it is usual a good variable to use for consumer lending. We also decide the parameter as it is. The variable is a number between 1 and 4.
```{r echo=TRUE}
summary(d$installment_as_income_perc)
```
The variable is encoded as a number between 1 and 4. We believe this is not too different from the other binary variables, and will fit the model with the variable unchanged.
```{r logistic2, echo=TRUE}
d2 <- cbind(df, installment_as_income_perc = d$installment_as_income_perc)
frm2 <- isGood ~ acc_chk_sts + credit_hist + savings + property +
purpose + duration + credit_amount + age + installment_as_income_perc
m2 <- glm(formula = frm2, data = d2, family = "binomial")
summary(m2)
```
```{r logist2AllDataAUC, fig.cap="The ROC for the logistic challenger logistic regression"}
pred <- prediction(predict(m2, type = "response"), d2$isGood)
# Visualize the ROC curve:
plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3)
abline(0, 1, lty = 2)
AUC2 <- attr(performance(pred, "auc"), "y.values")[[1]]
paste("AUC:", AUC2)
perf <- performance(pred, "tpr", "fpr")
ks2 <- max(attr(perf,'y.values')[[1]] - attr(perf,'x.values')[[1]])
paste("KS:", ks2)
```
The coefficient of $installment\_as\_income\_perc$ is significant (the p-value is `r summary(m2)$coefficients[17,4]`). This is encouraging.
We will now subject this model to the same Monte Carlo cross validation and compare the results.
```{r challengerXval, echo=TRUE, fig.cap='The distribution of the AUC for the challenger model "logistic 2".'}
cv_mc2 <- crossv_mc(d2, n = nRuns, test = 1 - pctTrain)
mods2 <- map(cv_mc2$train, ~ glm(frm2, data = .))
AUCs2_test <- numeric(0)
AUCs2_train <- numeric(0)
for(k in 1:nRuns) {
AUCs2_test[k] <- fAUC(mods2[[k]], as.data.frame(cv_mc2$test[[k]]))
AUCs2_train[k] <- fAUC(mods2[[k]], as.data.frame(cv_mc2$train[[k]]))
}
allAUCs <- rbind(tibble(model = "train data", AUC = AUCs2_train),
tibble(model = "test data", AUC = AUCs2_test))
p1 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) +
geom_density(alpha=0.5)
p2 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) +
stat_ecdf()
grid.arrange(p1, p2, ncol = 1)
```
The median of the observed values for the AUC of the test data is `r median(AUCs2_test)`, the average is `r mean(AUCs2_test)` with a standard deviation of `r sd(AUCs2_test)`.
# Conclusion
```{r concl, fig.cap='The kernel density for the observed areas under the curve (top) and the cumulative probability density functions (bottom) for the base and challenger models. All AUCs shown are for the test data only.'}
allAUCs <- rbind(tibble(model = "logistic 1", AUC = AUCs_test),
tibble(model = "logistic 2", AUC = AUCs2_test),
tibble(model = "neural net", AUC = AUCs_nn_test)
)
p1 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) + geom_density(alpha=0.5)
p2 <- ggplot(allAUCs, aes(AUC, fill = model, colour = model)) + stat_ecdf()
grid.arrange(p1, p2, ncol = 1)
```
Comparing the two logistic regressions it seems that the challenger model that uses the additional variable (debt installments as a percentage of income), but also prone to over-fitting. The neural network performs significantly better on the training data, but is even more over-fit: it preforms worse on the testing data. For that reason we have to reject this particular neural network.
In the following table we summarise these results:
Model | Quality | Mean AUC on Test Data | AUC on All Data
--------+----------+----------------+--------------------------+----------------
logistic 1 | slightly over-fit | `r mean(AUCs_test)` | `r AUC`
logistic 2 | slightly over-fit | `r mean(AUCs2_test)` | `r AUC2`
neural network | significantly over-fit | `r mean(AUCs_nn_test)` | `r AUC_nn`
The decisions of the logistic regression are transparent in that sense that they can be explained to the customer^[The explainability is important for the bank (who seeks assurance that he or she does the right thing, the customer and eventual courts. The neural networks seems to over-fit more -- so we would suggest to look into simpler neural networks with less neurons and/or layers.
The overall best models that we have is ``logistic 2'' and we recommend this for implementation.
# Bibliography