Table of Contents

1 Executive Summary

This part 2 we build a binary acceptance model: do we accept the customer or not. First with no protected features, then with protected features (such as age, marital status, gender, etc.)

2 Introduction

The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:

  • Is it fair to use gender?
  • No? Well can we use age then? That is a good indicator proxy for driving experience, maturity, etc. and indeed has a high information value.
  • Also not? Well, let’s just look at past track record then (did the person have an accident in the last 5 years)
  • But wait, that is somehow discriminatory for people that drive at least 5 years, because only they could build up such track record?
  • That leaves us wealth related parameters such as income, value of the car, etc. They do the same … but isn’t that discriminatory for poor people?
  • So, we have left things related to car use (private or commercial use, daily average travel time)
  • This model won’t be that strong. The insurance company now need to increase prices … which will impact the poor …

The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.

Workflow

The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.

We will approach the problem in a few phases

  1. Prepare the data: previous file
  2. Build a naive model (this file)
  3. Build a classification model based with CLAIM_FLAG as dependent variable (this variable is 0 of no claim was filed)
  4. Now we need to find a price for the insurance policy. We can use a simple average or build a model. We will build a regression model to determine the fee that the customer should pay based on CLM_AMT (the amount claimed) of those customers that are expected to have an accident
  5. Decide on Protected Features
  6. Remove protected features and proxy features
  7. repeat both models
  8. Draw conclusions

3 Loading the Data

We need to load the data as prepared in previous file

setwd("/home/philippe/Documents/courses/lectures/bias_data")

# import functions:
source("ethics_functions.R")

# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)
## dollars_to_numeric : function (input)  
## expand_factor2bin : function (data, col)  
## fAUC : function (model, data, ...)  
## get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)  
## make_dependency_hist_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)  
## make_dependency_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)  
## make_WOE_table : function (df, y)  
## opt_cut_off : function (perf, pred, cost.fp = 1)  
## pct_table : function (x, y, round_digits = 2)  
## space_to_underscore : function (input)  
## str_clean : function (x)
# Remove the temporary environment:
rm(tmt.env)


# Read in the binned binary data:
d_bin  <- readRDS('./d_bin.R')

# Read in the factorial data:
d_fact <- readRDS('./d_fact.R')

# We replace CLAIM_FLAG with a variable that is 1 when the outcome is positive
d_fact <- d_fact                         %>%
         mutate(isGood = 1 - as.numeric(paste(d_fact$CLAIM_FLAG))) %>%
         select(-c(CLAIM_FLAG))

4 Naive Model Without Protected Features

First, we will fit a logistic regression using all variables that from a mathematical point of view make sense.

4.1 Test Model

First try all variables

m0 <- glm(isGood ~ ., d_fact, family = 'binomial')
summary(m0)
## 
## Call:
## glm(formula = isGood ~ ., family = "binomial", data = d_fact)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3400  -0.6272   0.3692   0.7039   2.4279  
## 
## Coefficients: (5 not defined because of singularities)
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.395711   0.313416  -1.263 0.206743    
## KIDSDRIV                        -0.363335   0.278051  -1.307 0.191308    
## HOMEKIDS1                       -0.140183   0.108144  -1.296 0.194886    
## HOMEKIDS>=2                     -0.038919   0.094897  -0.410 0.681719    
## PARENT1No                        0.232937   0.112631   2.068 0.038626 *  
## MSTATUSNo                       -0.547889   0.080533  -6.803 1.02e-11 ***
## GENDERF                          0.118503   0.262555   0.451 0.651740    
## EDUCATIONz_High_School          -0.003395   0.088557  -0.038 0.969422    
## EDUCATIONBachelors               0.426812   0.108692   3.927 8.61e-05 ***
## EDUCATIONMasters                 0.477329   0.169691   2.813 0.004909 ** 
## EDUCATIONPhD                     0.331784   0.185936   1.784 0.074358 .  
## OCCUPATIONjunior                -0.220583   0.120845  -1.825 0.067951 .  
## OCCUPATIONLawyer                 0.004206   0.179401   0.023 0.981295    
## OCCUPATIONProfessional          -0.073459   0.144009  -0.510 0.609983    
## OCCUPATIONsenior                 0.718145   0.152230   4.717 2.39e-06 ***
## OCCUPATIONz_Blue_Collar         -0.218401   0.145881  -1.497 0.134364    
## TRAVTIME25--40                  -0.306511   0.068309  -4.487 7.22e-06 ***
## TRAVTIME>40                     -0.621968   0.069355  -8.968  < 2e-16 ***
## CAR_USEPrivate                   0.727758   0.081388   8.942  < 2e-16 ***
## BLUEBOOK10--15K                  0.191409   0.075113   2.548 0.010826 *  
## BLUEBOOK15--20K                  0.230293   0.088199   2.611 0.009026 ** 
## BLUEBOOK>20K                     0.403673   0.094565   4.269 1.97e-05 ***
## TIF2--5                          0.254580   0.074264   3.428 0.000608 ***
## TIF6--8                          0.440816   0.072632   6.069 1.29e-09 ***
## TIF>8                            0.635556   0.076650   8.292  < 2e-16 ***
## CAR_TYPEPickup                  -0.488389   0.093274  -5.236 1.64e-07 ***
## CAR_TYPEPTruck_Van              -0.526232   0.113422  -4.640 3.49e-06 ***
## CAR_TYPESports_Car              -0.946210   0.115749  -8.175 2.97e-16 ***
## CAR_TYPEz_SUV                   -0.681845   0.100024  -6.817 9.31e-12 ***
## RED_CARyes                       0.062462   0.083032   0.752 0.451888    
## OLDCLAIM>0                      -0.552550   0.083395  -6.626 3.46e-11 ***
## CLM_FREQ1                        0.113506   0.095338   1.191 0.233825    
## CLM_FREQ2                        0.129216   0.092369   1.399 0.161839    
## CLM_FREQ>2                             NA         NA      NA       NA    
## REVOKEDYes                      -0.730897   0.074565  -9.802  < 2e-16 ***
## MVR_PTS1                        -0.160456   0.085488  -1.877 0.060525 .  
## MVR_PTS2                        -0.231751   0.090765  -2.553 0.010670 *  
## MVR_PTS3--4                     -0.303138   0.078661  -3.854 0.000116 ***
## MVR_PTS>4                       -0.486111   0.086909  -5.593 2.23e-08 ***
## URBANICITYz_Highly_Rural/ Rural  2.455352   0.104746  23.441  < 2e-16 ***
## AGE30--40                        0.484143   0.210297   2.302 0.021325 *  
## AGE40--55                        0.958097   0.206768   4.634 3.59e-06 ***
## AGE>55                           0.461342   0.230491   2.002 0.045332 *  
## INCOME25--55K                   -0.021786   0.092082  -0.237 0.812975    
## INCOME55--75K                   -0.088236   0.115748  -0.762 0.445876    
## INCOME>75K                       0.268401   0.123270   2.177 0.029455 *  
## YOJ>1                            0.427722   0.113624   3.764 0.000167 ***
## HOME_VAL100--200K                0.255601   0.080120   3.190 0.001422 ** 
## HOME_VAL200--300K                0.280117   0.092913   3.015 0.002571 ** 
## HOME_VAL>300K                    0.390850   0.139332   2.805 0.005029 ** 
## CAR_AGE2--7                      0.087812   0.078542   1.118 0.263559    
## CAR_AGE8--11                     0.056959   0.079792   0.714 0.475325    
## CAR_AGE12--15                    0.021582   0.106894   0.202 0.839993    
## CAR_AGE>=16                     -0.063907   0.134840  -0.474 0.635540    
## KIDSDRV1                        -0.334213   0.297440  -1.124 0.261169    
## KIDSDRV>=2                      -0.159423   0.628694  -0.254 0.799821    
## GENDER.AGEF.>55                 -0.304965   0.289877  -1.052 0.292776    
## GENDER.AGEF.30--40               0.209611   0.270226   0.776 0.437934    
## GENDER.AGEF.40--55              -0.058893   0.259330  -0.227 0.820348    
## GENDER.AGEM.<30                        NA         NA      NA       NA    
## GENDER.AGEM.>55                        NA         NA      NA       NA    
## GENDER.AGEM.30--40                     NA         NA      NA       NA    
## GENDER.AGEM.40--55                     NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11181.6  on 9636  degrees of freedom
## Residual deviance:  8431.6  on 9579  degrees of freedom
## AIC: 8547.6
## 
## Number of Fisher Scoring iterations: 5

4.2 Fit the Model With only Relevant Variables

Leave out the least relevant ones (the coefficients without stars) and build a first usable model \(m_1\).

We will not use further:

  • KIDSDRV
  • HOMEKIDS
  • GENDER
  • RED_CAR
  • CAR_AGE
  • CLM_FREQ (too much correlated to OLDCLAIM)
  • GENDER.AGE
frm <- formula(isGood ~ MSTATUS + EDUCATION + OCCUPATION + TRAVTIME +
                 CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + 
                 REVOKED + MVR_PTS + URBANICITY + AGE  + INCOME + 
                 YOJ + HOME_VAL)

m1 <- glm(frm, d_fact, family = 'binomial')
summary(m1)
## 
## Call:
## glm(formula = frm, family = "binomial", data = d_fact)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2743  -0.6660   0.3856   0.7128   2.4101  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.107127   0.224487  -0.477 0.633214    
## MSTATUSNo                       -0.593818   0.065057  -9.128  < 2e-16 ***
## EDUCATIONz_High_School          -0.007295   0.086237  -0.085 0.932584    
## EDUCATIONBachelors               0.438288   0.099852   4.389 1.14e-05 ***
## EDUCATIONMasters                 0.435534   0.149740   2.909 0.003630 ** 
## EDUCATIONPhD                     0.347441   0.168960   2.056 0.039749 *  
## OCCUPATIONjunior                -0.244750   0.118566  -2.064 0.038994 *  
## OCCUPATIONLawyer                 0.013269   0.176246   0.075 0.939985    
## OCCUPATIONProfessional          -0.073681   0.141143  -0.522 0.601649    
## OCCUPATIONsenior                 0.698354   0.149224   4.680 2.87e-06 ***
## OCCUPATIONz_Blue_Collar         -0.239132   0.143192  -1.670 0.094916 .  
## TRAVTIME25--40                  -0.296752   0.067463  -4.399 1.09e-05 ***
## TRAVTIME>40                     -0.596040   0.068454  -8.707  < 2e-16 ***
## CAR_USEPrivate                   0.693804   0.080302   8.640  < 2e-16 ***
## BLUEBOOK10--15K                  0.194169   0.074122   2.620 0.008804 ** 
## BLUEBOOK15--20K                  0.232693   0.085412   2.724 0.006443 ** 
## BLUEBOOK>20K                     0.389974   0.086922   4.487 7.24e-06 ***
## TIF2--5                          0.252219   0.073316   3.440 0.000581 ***
## TIF6--8                          0.450104   0.071800   6.269 3.64e-10 ***
## TIF>8                            0.626743   0.075750   8.274  < 2e-16 ***
## CAR_TYPEPickup                  -0.484687   0.091825  -5.278 1.30e-07 ***
## CAR_TYPEPTruck_Van              -0.508646   0.106474  -4.777 1.78e-06 ***
## CAR_TYPESports_Car              -0.898498   0.095765  -9.382  < 2e-16 ***
## CAR_TYPEz_SUV                   -0.639593   0.076944  -8.312  < 2e-16 ***
## OLDCLAIM>0                      -0.486383   0.060328  -8.062 7.48e-16 ***
## REVOKEDYes                      -0.746171   0.073641 -10.133  < 2e-16 ***
## MVR_PTS1                        -0.178206   0.084428  -2.111 0.034795 *  
## MVR_PTS2                        -0.231409   0.089502  -2.586 0.009724 ** 
## MVR_PTS3--4                     -0.327117   0.077671  -4.212 2.54e-05 ***
## MVR_PTS>4                       -0.514532   0.085991  -5.984 2.18e-09 ***
## URBANICITYz_Highly_Rural/ Rural  2.361003   0.102937  22.936  < 2e-16 ***
## AGE30--40                        0.513866   0.132334   3.883 0.000103 ***
## AGE40--55                        0.891984   0.128534   6.940 3.93e-12 ***
## AGE>55                           0.322565   0.145418   2.218 0.026541 *  
## INCOME25--55K                   -0.016250   0.090983  -0.179 0.858248    
## INCOME55--75K                   -0.060066   0.114426  -0.525 0.599628    
## INCOME>75K                       0.261078   0.121693   2.145 0.031922 *  
## YOJ>1                            0.409499   0.112193   3.650 0.000262 ***
## HOME_VAL100--200K                0.248116   0.079048   3.139 0.001696 ** 
## HOME_VAL200--300K                0.269736   0.091436   2.950 0.003178 ** 
## HOME_VAL>300K                    0.357513   0.137360   2.603 0.009248 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11181.6  on 9636  degrees of freedom
## Residual deviance:  8589.2  on 9596  degrees of freedom
## AIC: 8671.2
## 
## Number of Fisher Scoring iterations: 5
AUC1 <- fAUC(m1, d_fact, type="response")

The AUC of model \(m_1\) is: 0.817

# Re-use the model m and the dataset t2:
pred1 <- prediction(predict(m1, type = "response"), d_fact$isGood)

# Visualize the ROC curve:
#plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3)
#abline(0, 1, lty = 2)

AUC1 <- attr(performance(pred1, "auc"), "y.values")[[1]]
#paste("AUC:", AUC)

perf1 <- performance(pred1, "tpr", "fpr")
ks <- max(attr(perf1,'y.values')[[1]] - attr(perf1,'x.values')[[1]])
#paste("KS:", ks)


#predScores <- modelr::add_predictions(df, m)$pred  # not correct?
predScores1 <- predict(m1, type = "response")

# Visualize the KS with the package InformationValue:
InformationValue::ks_plot(actuals = d_fact$isGood, predictedScores = predScores1)
The ROC (receiver operating curve) for our model

The ROC (receiver operating curve) for our model

InformationValue::plotROC(actuals = d_fact$isGood, predictedScores = predScores1)