--- title: 'Car Insurance' subtitle: 'Acceptance Model' author: "Philippe De Brouwer" date: "`r Sys.Date()`" bibliography: - ./car_insurance_library.bib nocite: | @debrouwer2020rbook output: pdf_document: #theme: cerulean highlight: haddock #smart: true number_sections: true toc: true toc_depth: 3 #toc_float: true #collapsed: false #smooth_scroll: true fig_width: 7 fig_height: 6 fig_caption: true toc-title: "Table of Contents" always_allow_html: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, # show the code chunks message=FALSE, # do not show messages warning=FALSE, # do not show warnings fig.align = 'center' # center figures and captions ) # Libraries: library(tidyverse) # tibble, readr, etc. #ibrary(gridExtra) # put multiple ggplot graphs alongside in one box library(ROCR) # calculate AUC and visualise performance objects #library(InformationValue) # depreciated library(scorecard) # #library(rpart) # fit the decision tree library(modelr) # cross validation support in the tidyverse library(broom) # add predictions and residuals library(magrittr) # add more piping symbols library(pROC) # robust AUC calculation library(knitr) # beautiful tables ``` # Executive Summary This part 2 we build a binary acceptance model: do we accept the customer or not. In this file we disregard the question whether it is ethical to use certain features such as gender, married status, income, etc. We only try to build the best model and refer for the ethical discussion to the file `car_insurance_ethics`. # Introduction The work-flow is inspired by: [@debrouwer2020rbook] and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book, and is the standard in stochastic modelling. The first step in building a reliable predictive model is to split the original dataset into two parts: a training set and a test set. The training set is used exclusively for fitting and tuning the model, while the test set is held completely aside for a final unbiased evaluation. This separation ensures that the test data remains unseen during model development, providing a realistic assessment of how the model will perform on new data [@web48]. To tune and select the best model, the training data is further divided internally via cross-validation. For example, in k-fold cross-validation, the training set is split into *k* equally sized folds. Iteratively, the model is trained on *k-1* folds and validated on the remaining fold. This procedure is repeated so each fold serves once as validation data. Cross-validation thus allows comprehensive model evaluation and tuning while maximizing the use of available training data [@web49]. It is crucial to keep the test set separate and untouched until the very end. Using the test set for model tuning or validation can lead to overfitting, as the model might indirectly learn from this data. This would give overly optimistic performance estimates that will not hold on genuinely new data. By restricting validation only to splits within the training data and reserving the test set for final evaluation, the modeling process maintains an unbiased measure of performance. In summary, the entire modeling pipeline follows these key steps: - **Initial split** into training and test sets to separate model development from evaluation. - **Cross-validation** inside the training data to tune and select models reliably. - **Final training** of the chosen model on the full training set. - **Unbiased evaluation** on the preserved test set to estimate real-world performance. This disciplined workflow ensures trustworthy, generalizable models that perform well in practice. ```{r testdataSChem, echo=FALSE, fig.cap="Schematic overview of the modelling cycle."} library(DiagrammeR) grViz(" digraph model_building { graph [rankdir=TB, fontsize=12] node [shape=box, style=filled, color=lightblue, fontname=Helvetica, fontcolor=black] data_collected [label='Collect Data'] split_data [label='Set Aside Test Data'] training_data [label='Use Training Data Only'] fit_models [label='Fit Candidate Models'] cross_validation [label='Perform Cross-Validation'] select_model [label='Select Best Model'] evaluate [label='Evaluate on Test Data'] deploy [label='Deploy Model'] # Main forward flow data_collected -> split_data -> training_data -> fit_models -> cross_validation -> select_model -> evaluate -> deploy # Iteration loops (dotted arrows) cross_validation -> fit_models [style=dashed, label='Improvement Needed', fontcolor=gray] select_model -> fit_models [style=dashed, label='Model Not Adequate', fontcolor=gray] evaluate -> select_model [style=dashed, label='Reconsider Model', fontcolor=gray] deploy -> data_collected [style=dashed, label='New Data Available', fontcolor=gray] } ") ``` # Initial Data Processing ## Reading in the files from `car_insurance_dataprep.Rmd`. We need to load the data as prepared in previous file ```{r prep} setwd("/home/philippe/Documents/courses/lectures/car_insurance") # import functions: source("ethics_functions.R") # List the functions defined in this file: tmt.env <- new.env() # create a temporary environment sys.source("ethics_functions.R", envir = tmt.env) utils::lsf.str(envir=tmt.env) rm(tmt.env) # Remove the temporary environment ``` ```{r dataImport} # Read in the binned binary data: d_bin <- readRDS('./d_bin.R') # Read in the factorial data: d_fact <- readRDS('./d_fact.R') # The normalised and numerical data d_norm <- readRDS('./d_norm.R') ``` ## Common data manipulations We replace `CLAIM_FLAG` with a variable that is 1 when the outcome is positive. The only purpose of this is to make the results easier to understand as "better insurance risk." ```{r dataManip} d_fact <- d_fact %>% mutate(isGood = 1 - as.numeric(paste(d_fact$CLAIM_FLAG))) %>% select(-c(CLAIM_FLAG)) d_bin <- d_bin %>% mutate(isGood = 1 - as.numeric(paste(d_bin$CLAIM_FLAG))) %>% select(-c(CLAIM_FLAG)) d_norm <- d_norm %>% mutate(isGood = 1 - as.numeric(paste(d_norm$CLAIM_FLAG))) %>% select(-c(CLAIM_FLAG)) transform_factors_to_binary <- function(df) { # Extract numeric columns numeric_cols <- df %>% select(where(is.numeric)) # Extract factors and convert them to dummy variables using model.matrix factor_cols <- df %>% select(where(is.factor)) if (ncol(factor_cols) > 0) { # Create dummy variables; remove intercept column dummies <- model.matrix(~ . - 1, data = factor_cols) %>% as.data.frame() # Combine numeric and dummy columns result <- bind_cols(numeric_cols, dummies) } else { result <- numeric_cols } return(result) } d_num <- d_norm %>% transform_factors_to_binary ``` ## Separating test data ```{r testDAta} set.seed(2025) sampled_rows <- d_fact %>% slice_sample(prop = 0.85) %>% # 15% kept as test data mutate(row_id = row_number()) %>% # get the row numbers in a column pull(row_id) # this does the same as '$' hence gets that column # we will keep all data in one list object (list of lists of data-frames) d <- list() # Add train and test for fact dataset d$fact <- list( train = d_fact %>% slice(sampled_rows), test = d_fact %>% slice(-sampled_rows) ) # Add train and test for bin dataset d$bin <- list( train = d_bin %>% slice(sampled_rows), test = d_bin %>% slice(-sampled_rows) ) # Add train and test for numerical dataset d$norm <- list( train = d_norm %>% slice(sampled_rows), test = d_norm %>% slice(-sampled_rows) ) d$num <- list( train = d_num %>% slice(sampled_rows), test = d_num %>% slice(-sampled_rows) ) # Note that when we write slice(d_bin, -sampled_rows), it results in the same columns, # However, the class is different ``` # The Logistic Regression 1 First, we will fit a logistic regression using all variables that from a mathematical point of view make sense. For the logistic regression we use `d_fact`, where the data is categorical, but with meaningful labels. While we could use the other database, it is a little easier to use since R will create binary data from factors anyhow for us. First try all variables ```{r} m0 <- glm(isGood ~ ., d$fact$train, family = 'binomial') summary(m0) ``` Certain variables have no stars, and hence can be removed from the model. ## Fit the Model With only Relevant Variables Leave out the least relevant ones (the coefficients without stars) and build a first usable model $m_1$. We will not use further: - `KIDSDRV` - `HOMEKIDS` - `GENDER` - `RED_CAR` - `CAR_AGE` - `CLM_FREQ` (too much correlated to `OLDCLAIM`) - `GENDER.AGE` ```{r m1} frm1 <- formula(isGood ~ MSTATUS + EDUCATION + OCCUPATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + REVOKED + MVR_PTS + URBANICITY + AGE + INCOME + YOJ + HOME_VAL) m1 <- glm(frm1, d$fact$train, family = 'binomial') summary(m1) ``` ```{r rocr1, fig.cap=c("The ROC (receiver operating curve) for our model", "The lift of the model (bottom): the cumulative percentage of responders (ones) captured by the model")} # Re-use the model m and the dataset t2: pred1 <- prediction(predict(m1, type = "response"), d$fact$train$isGood) # Visualize the ROC curve: #plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3) #abline(0, 1, lty = 2) AUC1 <- attr(performance(pred1, "auc"), "y.values")[[1]] #paste("AUC:", AUC) perf1 <- performance(pred1, "tpr", "fpr") ks <- max(attr(perf1,'y.values')[[1]] - attr(perf1,'x.values')[[1]]) paste("KS:", ks) predScores1 <- predict(m1, type = "response") pred <- prediction(predScores1, d$fact$train$isGood) perf <- performance(pred, "tpr", "fpr") plot(perf, main = "ROC Curve") ``` ## Cross Validation ### Storing of the results We explain more in the concluding section, but selecting a model based on AUC (or any other performance measure) based on cross validations relies on selecting the model that has: - low high AUC on the test-data - low difference between AUC on train and test-data - low variation (e.g. standard deviation) of the AUC on test data - preferably a low difference between the variation of the AUC on train and test data To compare these models, we need hence to keep track of the means of the AUC on train and test data as well as their standard deviation.^[Of course one can chose other measures of centrality and deviation (eg. median and mean absolute deviation (MAD)).] In the next code blocks we prepare the `tibble` to hold these results. ```{r prepAUCdata} # prepare a dataset to keep all AUC values d_AUC <- tibble(model = character(0), mn_train = numeric(0), sd_train = numeric(0), mn_test = numeric(0), sd_test = numeric(0), delta_tt = numeric(0) ) ``` We will use a k-fold cross validation and prepare the folds for all data-sets. We want exactly the same choices of fold for each data-set and hence reset the seed each time to the same value. ```{r prepFolds} seed <- 67 # Create the folds: set.seed(seed = seed) folds_fact <- crossv_kfold(d$fact$train, k = 5) set.seed(seed = seed) folds_norm <- crossv_kfold(d$norm$train, k = 5) set.seed(seed = seed) folds_bin <- crossv_kfold(d$bin$train, k = 5) set.seed(seed = seed) folds_num <- crossv_kfold(d$num$train, k = 5) ``` ### Base R A k-fold cross validation in base-R is straightforward, but verbose. ```{r baseRcv, eval=FALSE} # Manual k-fold cross-validation with Base-R k <- 5 folds <- cut(seq(1, nrow(d$fact$train)), breaks = k, labels = FALSE) auc_values <- numeric(k) # here we will store the results for(i in 1:k) { # Create train/test splits test_indices <- which(folds == i) train_data <- d$fact$train[-test_indices, ] test_data <- d$fact$train[test_indices, ] # Train model model <- glm(frm1, data = train_data, family = binomial) # Predict and calculate AUC predictions <- predict(model, newdata = test_data, type = "response") auc_values[i] <- pROC::auc(test_data$isGood, predictions) } auc_per_fold <- tibble(.id = 1:k, auc = auc_values) print(auc_per_fold) ``` ```{r funcBase} # note that this function uses library(pROC) and library(dplyr) f_add_kfold_auc_base <- function(data, k, model_func, model_frm, model_desc, ...) { # Create folds: vector assigning fold number to each row folds <- cut(seq(1, nrow(data)), breaks = k, labels = FALSE) auc_train <- numeric(k) auc_test <- numeric(k) for(i in 1:k) { test_indices <- which(folds == i) train_data <- data[-test_indices, ] test_data <- data[test_indices, ] # Train model on training fold model <- model_func(fromula = model_frm, data = train_data, ...) # Predict on train and calculate train AUC preds_train <- predict(model, newdata = train_data, type = "response") auc_train[i] <- pROC::auc(train_data$isGood, preds_train) # Predict on test and calculate test AUC preds_test <- predict(model, newdata = test_data, type = "response") auc_test[i] <- pROC::auc(test_data$isGood, preds_test) } # Summarize fold AUC results auc_per_fold <- tibble::tibble(.id = 1:k, auc_train = auc_train, auc_test = auc_test) print(auc_per_fold) # Initialize global d_AUC if missing if(!exists("d_AUC", envir = .GlobalEnv)) { d_AUC <- tibble::tibble( model = character(), mn_train = numeric(), sd_train = numeric(), mn_test = numeric(), sd_test = numeric(), delta_tt = numeric() ) assign("d_AUC", d_AUC, envir = .GlobalEnv) } # Append summary stats to global d_AUC d_AUC <<- dplyr::bind_rows( get("d_AUC", envir = .GlobalEnv), tibble::tibble( model = model_desc, mn_train = mean(auc_train), sd_train = sd(auc_train), mn_test = mean(auc_test), sd_test = sd(auc_test), delta_tt = mean(auc_train) - mean(auc_test) ) ) invisible(d_AUC) } ``` The problem with this approach is that the both functions to fit the model and functions to calculate predictions need special parameters and functions to be passed. In R we can only once pass the `...` place-holder. Therefore we will need to split this function in 2 parts: - first fit the models on the folds - then calculate the performance parameter of choice (AUC) and add the results to our results tibble `d_AUC`. Here is how this can be done: ```{r baseRXValTwo} library(dplyr) library(purrr) library(pROC) library(tibble) # Function 1: Fit models on folds — returns folds with a new 'model' column fit_models_on_folds <- function(folds, model_func, model_frm, ...) { folds %>% mutate( model = map(train, function(df) { model_func(data = df, formula = model_frm, ...) }) ) } # Function 2: Calculate AUC on train/test, update global d_AUC, returns updated d_AUC calculate_and_update_auc <- function(folds, model_desc, train_pred_fun = function(model, data) predict(model, newdata = data, type = "response"), test_pred_fun = function(model, data) predict(model, newdata = data, type = "response")) { auc_results <- folds %>% mutate( auc_train = map2_dbl(model, train, ~ { train_data <- as_tibble(.y) preds_train <- train_pred_fun(.x, train_data) pROC::auc(train_data$isGood, preds_train) }), auc_test = map2_dbl(model, test, ~ { test_data <- as_tibble(.y) preds_test <- test_pred_fun(.x, test_data) pROC::auc(test_data$isGood, preds_test) }) ) %>% select(.id, auc_train, auc_test) # Initialize global d_AUC if missing if(!exists("d_AUC", envir = .GlobalEnv)) { assign("d_AUC", tibble( model = character(), mn_train = numeric(), sd_train = numeric(), mn_test = numeric(), sd_test = numeric(), delta_tt = numeric() ), envir = .GlobalEnv) } # Append summary to global d_AUC d_AUC <<- bind_rows( get("d_AUC", envir = .GlobalEnv), tibble( model = model_desc, mn_train = mean(auc_results$auc_train), sd_train = sd(auc_results$auc_train), mn_test = mean(auc_results$auc_test), sd_test = sd(auc_results$auc_test), delta_tt = mean(auc_results$auc_train) - mean(auc_results$auc_test) ) ) invisible(d_AUC) } ``` For the linear regression, it will work very simple: ```{r exGLM} # Suppose 'folds' is prepared with columns train and test (data subsets for each fold) folds_with_models <- fit_models_on_folds(folds_norm, glm, frm1, family = binomial()) # AUC calculation and global update d_AUC <- calculate_and_update_auc(folds_with_models, "Logistic Regression test") print(d_AUC) ``` For a decision tree, we need some more work: ```{r exDT} library(rpart) # Define prediction functions for rpart to get positive class probability rpart_pred_fun <- function(model, data) { preds <- predict(model, newdata = data, type = "prob") preds[, 2] # Assuming 2nd column is positive class probability } folds_with_models <- fit_models_on_folds(folds_norm, rpart, frm1, method = "class", cp = 0.01) d_AUC <- calculate_and_update_auc( folds_with_models, "Decision Tree (rpart) test", train_pred_fun = rpart_pred_fun, test_pred_fun = rpart_pred_fun ) print(d_AUC) ``` ### Cross-Validation in the `tidyverse` In the next code block we demonstrate how the k-fold cross validation can be done in the `tidyverse` and immediately wrap it in a re-usable function. But before that we create a data-frame to keep track of the values of the AUC. ```{r tidvyerseCV} f_add_kfold_auc <- function(data, folds, model_func, model_frm, model_desc, ...) { auc_per_fold <- folds %>% mutate( model = map(train, function(df) { model_func(data = df, formula = model_frm, ...) }), auc_train = map_dbl(model, ~ { train_data <- as_tibble(.x$model) preds_train <- predict(.x, newdata = train_data, type = "response") pROC::auc(train_data$isGood, preds_train) }), auc_test = map2_dbl(model, test, ~ { test_data <- as_tibble(.y) preds_test <- predict(.x, newdata = test_data, type = "response") pROC::auc(test_data$isGood, preds_test) }) ) %>% select(.id, auc_train, auc_test) # Append results to global d_AUC as before d_AUC <<- bind_rows( get("d_AUC", envir = .GlobalEnv), tibble( model = model_desc, mn_train = mean(auc_per_fold$auc_train), sd_train = sd(auc_per_fold$auc_train), mn_test = mean(auc_per_fold$auc_test), sd_test = sd(auc_per_fold$auc_test), delta_tt = mean(auc_per_fold$auc_train) - mean(auc_per_fold$auc_test) ) ) invisible(d_AUC) } # Use the function for the logistic regression: f_add_kfold_auc(data = d$fact$train, folds = folds_fact, model_func = glm, model_frm = frm1, model_desc = 'm1 - logistic regression', family = binomial() # pass as function call, not string ) ``` ```{r demo1} # Add the model with all variables too as m0 f_add_kfold_auc(data = d$fact$train, folds = folds_fact, model_func = glm, model_frm = formula(isGood ~ .), model_desc = 'm0 - logistic regr. all vars', family = binomial() # pass as function call, not string ) # Demonstrate the content of d_AUC knitr::kable(d_AUC) ``` # Decision Tree ## Decision Tree on binned data ```{r decisionTree} library(rpart) library(rpart.plot) # Fit a large initial tree with minimal pruning (set cp = 0) large_tree <- rpart(isGood ~ ., data = d$fact$train, method = "class", control = rpart.control(cp = 0)) # Plot cross-validation results to see optimal cp plotcp(large_tree) # shows xerror, xstd, and cp values # Print the cross-validation table printcp(large_tree) # Find the cp with minimum xerror + 1sd for pruning best_cp <- large_tree$cptable[which.min(large_tree$cptable[,"xerror"]), "CP"] # Alternatively, choose cp at the "1-SE rule" # best_cp <- large_tree$cptable[which.min(large_tree$cptable[,"xerror"] + large_tree$cptable[,"xstd"]), "CP"] # Prune the tree with the selected cp pruned_tree <- prune(large_tree, cp = best_cp) # Plot the pruned tree rpart.plot(pruned_tree, main = "Pruned Decision Tree") ``` In order to use our function `f_add_kfold_auc()`, we would have to modify the function in such way it would fail to work for the regression. So we do this here without function. The cross validation for the tree: ```{r treeXvalFACT} frm0 <- isGood ~ . ### results <- folds_fact %>% mutate( model = map(train, ~ rpart(frm0, data = as_tibble(.), method = "class", cp = 0.01)), auc_train = map2_dbl(model, train, ~ { train_data <- as_tibble(.y) # Use train fold data, not .x$model preds_train <- predict(.x, newdata = train_data, type = "prob")[, 2] pROC::auc(train_data$isGood, preds_train) }), auc_test = map2_dbl(model, test, ~ { test_data <- as_tibble(.y) preds_test <- predict(.x, newdata = test_data, type = "prob")[, 2] pROC::auc(test_data$isGood, preds_test) }) ) %>% select(.id, auc_train, auc_test) print(results) # Append summary stats to global d_AUC d_AUC <- d_AUC %>% add_row( model = "Decision Tree (rpart)", mn_train = mean(results$auc_train), sd_train = sd(results$auc_train), mn_test = mean(results$auc_test), sd_test = sd(results$auc_test), delta_tt = mean(results$auc_train) - mean(results$auc_test) ) ``` ## Decision Tree on numerical data ```{r decisionTreeNUM} library(rpart) library(rpart.plot) # Fit a large initial tree with minimal pruning (set cp = 0) large_tree <- rpart(isGood ~ ., data = d$norm$train, method = "class", control = rpart.control(cp = 0)) # Plot cross-validation results to see optimal cp plotcp(large_tree) # shows xerror, xstd, and cp values # Print the cross-validation table printcp(large_tree) # Find the cp with minimum xerror + 1sd for pruning best_cp <- large_tree$cptable[which.min(large_tree$cptable[,"xerror"]), "CP"] # Alternatively, choose cp at the "1-SE rule" # best_cp <- large_tree$cptable[which.min(large_tree$cptable[,"xerror"] + large_tree$cptable[,"xstd"]), "CP"] # Prune the tree with the selected cp pruned_tree <- prune(large_tree, cp = best_cp) # Plot the pruned tree rpart.plot(pruned_tree, main = "Pruned Decision Tree") ``` The cross validation for the tree: ```{r treeXval} # Wrapper for rpart fitting to match f_add_kfold_auc interface rpart_model_func <- function(data, formula, ...) { rpart::rpart(formula, data = as_tibble(data), method = "class", ...) } frm0 <- isGood ~ . ### results <- folds_norm %>% mutate( model = map(train, ~ rpart(frm0, data = as_tibble(.), method = "class", cp = 0.01)), auc_train = map2_dbl(model, train, ~ { train_data <- as_tibble(.y) # Use train fold data, not .x$model preds_train <- predict(.x, newdata = train_data, type = "prob")[, 2] pROC::auc(train_data$isGood, preds_train) }), auc_test = map2_dbl(model, test, ~ { test_data <- as_tibble(.y) preds_test <- predict(.x, newdata = test_data, type = "prob")[, 2] pROC::auc(test_data$isGood, preds_test) }) ) %>% select(.id, auc_train, auc_test) print(results) # Append summary stats to global d_AUC d_AUC <- d_AUC %>% add_row( model = "Decision Tree (num)", mn_train = mean(results$auc_train), sd_train = sd(results$auc_train), mn_test = mean(results$auc_test), sd_test = sd(results$auc_test), delta_tt = mean(results$auc_train) - mean(results$auc_test) ) ``` # Random Forest The function `randomForest()` requires a data.frame as input, this makes everything a little different - so we perform the cross validation manually (without our custom function) ```{r RF} library(dplyr) library(randomForest) library(pROC) k <- nrow(folds_norm) auc_train <- numeric(k) auc_test <- numeric(k) for (i in 1:k) { train_data <- as.data.frame(folds_norm$train[[i]]) test_data <- as.data.frame(folds_norm$test[[i]]) # convert the numeric isGood to factor train_data$isGood <- as.factor(train_data$isGood) test_data$isGood <- as.factor(test_data$isGood) # Fit Random Forest model model_rf <- randomForest(isGood ~ ., data = train_data, ntree = 200, mtry = floor(sqrt(ncol(train_data) - 1))) # Predictions on train preds_train <- predict(model_rf, newdata = train_data, type = "prob")[, 2] auc_train[i] <- pROC::auc(train_data$isGood, preds_train) # Predictions on test preds_test <- predict(model_rf, newdata = test_data, type = "prob")[, 2] auc_test[i] <- pROC::auc(test_data$isGood, preds_test) } # Append summary stats to global d_AUC tibble d_AUC <- d_AUC %>% add_row( model = "Random Forest (ntree=200, mtry=sqrt)", mn_train = mean(auc_train), sd_train = sd(auc_train), mn_test = mean(auc_test), sd_test = sd(auc_test), delta_tt = mean(auc_train) - mean(auc_test) ) ``` We notice here that the model fits perfectly the training data, but looses on the test-data. A training AUC of 1 (with zero std) but a lower test AUC (~0.85) means the model perfectly fits training data but less so unseen data—typical **overfitting**. We can reduce the over-fitting with `randomForest()` by: - **Reduce number of trees:** `ntree = 100` - **Limit tree size:** use maximum nodes or minimum node size `maxnodes = 30` `nodesize = 10` - **Adjust features sampled at each split (mtry):** Increase or tune with `mtry = floor(ncol(train_data)/2)` or use `tuneRF()` to find best value - **Feature selection or dimensionality reduction:** Preprocess data before modelling (e.g., PCA) ```{r RF2ReduceOverFit} for (i in 1:k) { train_data <- as.data.frame(folds_norm$train[[i]]) test_data <- as.data.frame(folds_norm$test[[i]]) # convert the numeric isGood to factor train_data$isGood <- as.factor(train_data$isGood) test_data$isGood <- as.factor(test_data$isGood) # Fit Random Forest model model_rf <- randomForest(isGood ~ ., data = train_data, ntree = 150, mtry = floor(sqrt(ncol(train_data) - 1)/2), maxnodes = 30, nodesize = 10 ) # Predictions on train preds_train <- predict(model_rf, newdata = train_data, type = "prob")[, 2] auc_train[i] <- pROC::auc(train_data$isGood, preds_train) # Predictions on test preds_test <- predict(model_rf, newdata = test_data, type = "prob")[, 2] auc_test[i] <- pROC::auc(test_data$isGood, preds_test) } # Append summary stats to global d_AUC tibble d_AUC <- d_AUC %>% add_row( model = "Random Forest (ntree=150, mtry=sqrt/2)", mn_train = mean(auc_train), sd_train = sd(auc_train), mn_test = mean(auc_test), sd_test = sd(auc_test), delta_tt = mean(auc_train) - mean(auc_test) ) ``` ## Random Forest on the binned data We fit now the random forest on the data that is completely categorical (so also the numerical variables have been converted to multiple categories -- binned data). ```{r RFbinned} # The only difference is that we use folds_fact for (i in 1:k) { train_data <- as.data.frame(folds_fact$train[[i]]) test_data <- as.data.frame(folds_fact$test[[i]]) # convert the numeric isGood to factor train_data$isGood <- as.factor(train_data$isGood) test_data$isGood <- as.factor(test_data$isGood) # Fit Random Forest model model_rf <- randomForest(isGood ~ ., data = train_data, ntree = 150, mtry = floor(sqrt(ncol(train_data) - 1)), maxnodes = 60, nodesize = 10 ) # Predictions on train preds_train <- predict(model_rf, newdata = train_data, type = "prob")[, 2] auc_train[i] <- pROC::auc(train_data$isGood, preds_train) # Predictions on test preds_test <- predict(model_rf, newdata = test_data, type = "prob")[, 2] auc_test[i] <- pROC::auc(test_data$isGood, preds_test) } # Append summary stats to global d_AUC tibble d_AUC <- d_AUC %>% add_row( model = "Random Forest on fact data (150/sqrt)", mn_train = mean(auc_train), sd_train = sd(auc_train), mn_test = mean(auc_test), sd_test = sd(auc_test), delta_tt = mean(auc_train) - mean(auc_test) ) ``` # A Neural Network ## Neural networks in R When optimizing neural networks in R, several package options exist, each with advantages and disadvantages: | Package | Key Features | Advantages | Disadvantages | |---------------|---------------------------------------------|--------------------------------------|-----------------------------------| | **nnet** | Classic single-hidden-layer MLP | Built-in R package, simple, fast | Only one hidden layer | | **neuralnet** | Flexible multiple hidden layers | Easy to use, visualize, customize | Slower, limited optimization methods | | **RSNNS** | Interface to Stuttgart Neural Network Simulator | Powerful, supports complex nets | Steeper learning curve, less tidy | | **keras** | Deep learning via TensorFlow backend | Supports deep architectures, GPU acceleration, large community | More complex setup, heavier dependency | | **caret / tidymodels** | Wrapper interfaces to many NN packages | Unified interface, cross-validation, tuning tools | Overhead, may be less flexible for custom nets | For simple networks, `nnet` or `neuralnet` work well; for deeper/more complex models, `keras` is preferable. Wrappers like `caret`/`tidymodels` offer convenience but less direct control. Choose based on your architecture needs, compute resources, and familiarity. ## The package `nnet` ```{r nnet1} library(dplyr) library(nnet) library(pROC) library(tibble) k <- nrow(folds_num) auc_train <- numeric(k) auc_test <- numeric(k) for (i in 1:k) { train_data <- as.data.frame(folds_num$train[[i]]) test_data <- as.data.frame(folds_num$test[[i]]) # For nnet, target needs to be numeric (0/1) train_data$isGood <- as.numeric(as.character(train_data$isGood)) test_data$isGood <- as.numeric(as.character(test_data$isGood)) # Fit neural network model (size = number of hidden units) model_nn <- nnet(isGood ~ ., data = train_data, size = 12, maxit = 250, trace = FALSE) # Predictions on train (type="raw" gives numeric output) preds_train <- predict(model_nn, newdata = train_data, type = "raw") auc_train[i] <- pROC::auc(train_data$isGood, preds_train) # Predictions on test preds_test <- predict(model_nn, newdata = test_data, type = "raw") auc_test[i] <- pROC::auc(test_data$isGood, preds_test) } # Append summary stats to global d_AUC tibble d_AUC <- d_AUC %>% add_row( model = "aNN (nnet, size=12, maxit=250)", mn_train = mean(auc_train), sd_train = sd(auc_train), mn_test = mean(auc_test), sd_test = sd(auc_test), delta_tt = mean(auc_train) - mean(auc_test) ) ``` The neural network can be visualised as follows: ```{r visNN} library(NeuralNetTools) # After fitting your model_nn (one of the folds or final model), for example: # model_nn <- nnet(isGood ~ ., data = train_data, size = 6, maxit = 250, trace = FALSE) # Visualize the network structure plotnet(model_nn) ``` ## The package `neuralnet` ```{r neuralnet, eval=FALSE} library(dplyr) library(neuralnet) library(pROC) library(tibble) k <- nrow(folds_num) auc_train <- numeric(k) auc_test <- numeric(k) for (i in 1:k) { # the function neuralnet will not work with a fold: train_data <- as.data.frame(folds_num$train[[i]]) test_data <- as.data.frame(folds_num$test[[i]]) # Ensure target variable is numeric 0/1 for neuralnet train_data$isGood <- as.numeric(as.character(train_data$isGood)) test_data$isGood <- as.numeric(as.character(test_data$isGood)) # replace all special characters in the column names with a . names(train_data) <- make.names(names(train_data), unique = TRUE) names(test_data) <- make.names(names(test_data), unique = TRUE) # this way we can make all a formula for all x variables: all_vars <- setdiff(names(train_data), "isGood") model_frm <- as.formula(paste("isGood ~", paste(all_vars, collapse = " + "))) # but we will make a shorter one only including variable similar to the choice in frm1 model_frm <- isGood ~ MSTATUSNo + EDUCATIONz_High_School + EDUCATIONBachelors + EDUCATIONMasters + EDUCATIONPhD + TRAVTIME + CAR_USEPrivate + BLUEBOOK + TIF + OLDCLAIM + REVOKEDYes + MVR_PTS + `URBANICITYz_Highly_Rural..Rural` + AGE + INCOME + YOJ + HOME_VAL # Fit neural network with two hidden layers: 6 and 3 neurons model_nn <- neuralnet(model_frm, data = train_data, hidden = c(4,2), # hidden layers linear.output = FALSE, stepmax = 1e6) # Prediction function for neuralnet: compute on feature columns only preds_train <- neuralnet::compute(model_nn, train_data[, setdiff(names(train_data), "isGood")])$net.result[,1] auc_train[i] <- pROC::auc(train_data$isGood, preds_train) preds_test <- compute(model_nn, test_data[, setdiff(names(test_data), "isGood")])$net.result[,1] auc_test[i] <- pROC::auc(test_data$isGood, preds_test) } # Save the results of the saveRDS(auc_train, './auc_train.R') saveRDS(auc_test, './auc_test.R') saveRDS(model_nn, './model_nn.R') ``` ```{r aNNread} # read the previously saved results to save time: # COMMENT OUT IF PREVIOUS CHUNCK IS eval=TRUE auc_train <- readRDS('./auc_train.R') auc_test <- readRDS('./auc_test.R') model_nn <- readRDS('./model_nn.R') # Append summary stats to global d_AUC tibble d_AUC <- d_AUC %>% add_row( model = "aNN (neuralnet, hidden=c(4,2))", mn_train = mean(auc_train), sd_train = sd(auc_train), mn_test = mean(auc_test), sd_test = sd(auc_test), delta_tt = mean(auc_train) - mean(auc_test) ) ``` The model can be visualised by the library 'neuralnet' as follows (or we can use the `NeuralNetTools` library) ```{r visNNneuralnet, eval=FALSE} # We visualise the last NN that was fitted plot(model_nn) ``` # Support Vector Machine (SVM) The function `svm()` from `e1071` is a standard choice in R. However, it does not allow arguments ```{r svm1} library(e1071) library(pROC) library(dplyr) fit_models_on_folds_svm <- function(folds, model_func, model_frm, ...) { folds %>% mutate( model = map(train, function(df) { model_func(model_frm, df, ...) # formula first, data second, no names }) ) } frm0 <- isGood ~ . # Fit SVM models on the folds with radial kernel models_f <- fit_models_on_folds_svm(folds = folds_bin, model_func = e1071::svm, model_frm = frm0, kernel = "radial", cost = 1) # Calculate and update AUC stats globally d_AUC <- calculate_and_update_auc(models_f, model_desc = "SVM radial kernel, cost=1") ``` # PCA Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of large datasets by transforming correlated variables into a set of uncorrelated principal components. In R, PCA can be performed using the `prcomp()` function, which centres and scales the data by default, making it easier to interpret the results. The principal components are ordered by the amount of variance they explain, with the first few components typically capturing most of the information in the original dataset. This method is widely used for data visualization, noise reduction, and feature extraction in multivariate analysis. ```{r PCA1} library(dplyr) library(purrr) library(pROC) library(tibble) library(modelr) # for crossv_kfold and resample objects # Function to run PCA on train and apply on test data run_pca_and_transform <- function(train_df, test_df, ncomp = 35) { # Select features only (exclude target) train_x <- train_df %>% select(-isGood) # PCA on train data pca <- prcomp(train_x, center = TRUE, scale. = TRUE) # Project train and test into PCA space train_pca <- as_tibble(predict(pca, train_x)[, 1:ncomp]) test_x <- test_df %>% select(-isGood) test_pca <- as_tibble(predict(pca, test_x)[, 1:ncomp]) # Add target variable back train_pca$isGood <- train_df$isGood test_pca$isGood <- test_df$isGood list(train = train_pca, test = test_pca) } # reminder: Create 5 folds with crossv_kfold (from modelr) #folds_bin <- crossv_kfold(mydata, k = 5) # Extract train/test data frames properly and run PCA transform on each fold folds_pca <- map2( folds_bin$train, folds_bin$test, ~ run_pca_and_transform( as.data.frame(.x), # train data frame from resample object as.data.frame(.y), # test data frame from resample object ncomp = 35 ) ) # Assemble tibble of folds with transformed data folds_pca_tibble <- tibble( .id = seq_along(folds_pca), train = map(folds_pca, "train"), test = map(folds_pca, "test") ) # Logistic regression fitting function logistic_model_func <- function(data, formula, ...) { glm(formula, data = data, family = binomial, ...) } # Model formula using all PCA components formula_pca <- as.formula("isGood ~ .") # Fit logistic regression models on PCA-transformed folds folds_with_models <- fit_models_on_folds(folds_pca_tibble, logistic_model_func, formula_pca) # Calculate and update AUC scores for train and test sets calculate_and_update_auc(folds_with_models, "Logistic regression PCA (first 35 components)") # Show results #print(d_AUC) ``` # The Final Model Choice ## The interpretation of the AUC findings When comparing multiple predictive models, the Area Under the ROC Curve (AUC) metric is often used to evaluate classification performance [@web208; @web209]. - The **test data AUC** is the most important indicator since it reflects how well the model generalizes to new, unseen data. - A model with a **high training AUC but noticeably lower test AUC** is likely overfitting the training data and may not perform well in practice. - Ideally, choose a model with a **high and similar AUC on both training and test data**, indicating good generalization and fitting. - Consider also the variability in AUC obtained via cross-validation folds to assess the stability and robustness of the performance. - For imbalanced datasets, supplement AUC with other metrics or precision-recall curves for a more complete evaluation [@web212]. | **Criterion** | **Interpretation** | **Preferred Model** | |-------------------------------|---------------------------------------|-------------------------------------| | High test AUC | Good generalization | Higher test AUC | | Large gap between train-test | Overfitting | Smaller gap | | Similar train & test AUC | Good fit and generalization | Models with minimal difference | | Low variability across folds | Stable and reliable performance | Models with consistent AUCs | This approach ensures selecting models that balance accuracy and robustness to new data, avoiding poor generalization [@web172]. ```{r d_AUC} library(knitr) library(kableExtra) # Continuous color scale for 'test' (4th column), high values better: green high, red low colors_test <- spec_color(d_AUC$mn_test, option = "D") # viridis default # Continuous color scale for 'sd_test' (5th column), lower is better so reverse colors colors_sd_test <- spec_color(d_AUC$sd_test, option = "D", direction = -1) # Continuous color scale for 'delta' (6th column), lower is better so reverse colors colors_delta <- spec_color(d_AUC$delta_tt, option = "D", direction = -1) kable(d_AUC, "latex", caption = "A summary of all cross validation results. Best results are in yellow, green is intermediate, purple are bad results.") %>% kable_styling() %>% column_spec(4, background = colors_test, color = "black") %>% column_spec(5, background = colors_sd_test, color = "black") %>% column_spec(6, background = colors_delta, color = "black") ``` ## Model Selection and Unbiased Performance Estimation The dataset presents a high-dimensional feature space with numerous independent variables exhibiting multicollinearity and varying predictive strengths, as evidenced by Information Value (IV) analyses and cross-validation outcomes. Among the evaluated candidates—including logistic regressions (`m0` with all variables, `m1` with refined features), decision trees, random forests, neural networks, SVMs, and PCA-preprocessed variants—the generalized linear model `m1` emerges as optimal. This selection adheres to established criteria prioritizing i. maximal mean test AUC , ii/ minimal train-test delta , iii low cross-fold standard deviation , and iv/ parsimony to mitigate over-fitting risks inherent to high-dimensionality (Akaike Information Criterion favours fewer parameters without substantial performance loss). While `m0` yields comparable CV performance, its excess parameters elevate over-fitting susceptibility, as indicated by a larger train-test discrepancy and elevated variance. We choose the model (`m1`) as the model to be put in production. All we have to do now is retrain it on the complete training data (now the variable `m1` refers to models trained on a sub-set of the training data). Then we will still check its performance on the held out testing data. ```{r final_model_eval, echo=TRUE, message=FALSE, warning=FALSE} library(pROC) library(ROCR) # Retrain final m1 on full training data m1_final <- glm(frm1, data = d$fact$train, family = binomial) # Test predictions preds_test <- predict(m1_final, newdata = d$fact$test, type = "response") # AUC and CI roc_obj <- roc(d$fact$test$isGood, preds_test) auc_test <- auc(roc_obj) ci <- ci.auc(roc_obj) ci_lower <- ci ci_upper <- ci[2] # KS statistic pred_final <- prediction(preds_test, d$fact$test$isGood) perf_final <- performance(pred_final, "tpr", "fpr") #ks_test <- max(attr(perf_obj, "y.values")[] - attr(perf_obj, "x.values")[]) # Optimal cutoff (3:1 FP:FN cost ratio) #cutoff_value <- optimal_cutoff(pred_obj, optimal = "rec", cost.fp = 3, cost.fn = 1) ``` The AUC on the test data is `r round(auc_test, 2)` -- which is a decent result, keeping into mind the data of the table in previous section. The model `m1_final` hence passes this test. ## Optimal Cut-Off Finally we produce an optimal cut-off and the confusion matrix for the model. ```{r} # The optimal cutoff if the false positives cost 3 times more than the false negatives: cutoff_vals <- opt_cut_off(perf_final, pred_final, cost.fp = 3) cutoff_vals # print the values related to the cutoff cutoff_final <- cutoff_vals[3,1] # Binary predictions based on that cutoff: bin_pred_final <- if_else(pred_final@predictions[[1]] > cutoff_final, 1, 0) ``` The chosen cut-off is `r round(cutoff_final, 2)`. With this choice, we accept around `r round(sum(bin_pred_final) / length(bin_pred_final) * 100, 2)`% of the customers. However, note that the data is re-balanced (probably rows of good customers have been deleted ad random). So, it is impossible to answer the question how much percent of customers we actually will reject. In any case this should be much less than this calculation seems to imply. The package `{scorecard}` provides a function `perf_eva()` to produce the confusion-matrix: ```{r confusionFinal, ig.cap="m1_final Confusion Matrix"} # Extract raw vectors (equal length) scores <- unlist(pred_final@predictions) # Flatten ROCR predictions labels <- d$fact$test$isGood # 0/1 numeric labels # Verify lengths match stopifnot(length(scores) == length(labels)) perf_eva(pred = scores, label = labels, title = "m1 Model Confusion Matrix", confusion_matrix = TRUE, threshold = cutoff_final) ``` # Appendices ## Acknowledgement In this document we used workflow and code from: De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd. We mainly used in this document: * Part V (p. 373--562): **Modelling** ## Notes on the templates RMarkdown is very flexible, you can use any template. For example here are more templates: [AGH templates at overleaf](https://www.overleaf.com/latex/templates?q=AGH) ## Bibliography