This part 2 we build a binary acceptance model: do we accept the customer or not. First with no protected features, then with protected features (such as age, marital status, gender, etc.)
The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:
The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.
Workflow
The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.
We will approach the problem in a few phases
CLAIM_FLAG
as
dependent variable (this variable is 0 of no claim was filed)CLM_AMT
(the amount claimed) of those customers that are
expected to have an accidentWe need to load the data as prepared in previous file
setwd("/home/philippe/Documents/courses/lectures/bias_data")
# import functions:
source("ethics_functions.R")
# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)
## dollars_to_numeric : function (input)
## expand_factor2bin : function (data, col)
## fAUC : function (model, data, ...)
## get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)
## make_dependency_hist_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)
## make_dependency_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)
## make_WOE_table : function (df, y)
## opt_cut_off : function (perf, pred, cost.fp = 1)
## pct_table : function (x, y, round_digits = 2)
## space_to_underscore : function (input)
## str_clean : function (x)
# Remove the temporary environment:
rm(tmt.env)
# Read in the binned binary data:
d_bin <- readRDS('./d_bin.R')
# Read in the factorial data:
d_fact <- readRDS('./d_fact.R')
# We replace CLAIM_FLAG with a variable that is 1 when the outcome is positive
d_fact <- d_fact %>%
mutate(isGood = 1 - as.numeric(paste(d_fact$CLAIM_FLAG))) %>%
select(-c(CLAIM_FLAG))
First, we will fit a logistic regression using all variables that from a mathematical point of view make sense.
First try all variables
##
## Call:
## glm(formula = isGood ~ ., family = "binomial", data = d_fact)
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.395711 0.313416 -1.263 0.206743
## KIDSDRIV -0.363335 0.278051 -1.307 0.191308
## HOMEKIDS1 -0.140183 0.108144 -1.296 0.194886
## HOMEKIDS>=2 -0.038919 0.094897 -0.410 0.681719
## PARENT1No 0.232937 0.112631 2.068 0.038626 *
## MSTATUSNo -0.547889 0.080533 -6.803 1.02e-11 ***
## GENDERF 0.118503 0.262555 0.451 0.651740
## EDUCATIONz_High_School -0.003395 0.088557 -0.038 0.969422
## EDUCATIONBachelors 0.426812 0.108692 3.927 8.61e-05 ***
## EDUCATIONMasters 0.477329 0.169691 2.813 0.004909 **
## EDUCATIONPhD 0.331784 0.185936 1.784 0.074358 .
## OCCUPATIONjunior -0.220583 0.120845 -1.825 0.067951 .
## OCCUPATIONLawyer 0.004206 0.179401 0.023 0.981295
## OCCUPATIONProfessional -0.073459 0.144009 -0.510 0.609983
## OCCUPATIONsenior 0.718145 0.152230 4.717 2.39e-06 ***
## OCCUPATIONz_Blue_Collar -0.218401 0.145881 -1.497 0.134364
## TRAVTIME25--40 -0.306511 0.068309 -4.487 7.22e-06 ***
## TRAVTIME>40 -0.621968 0.069355 -8.968 < 2e-16 ***
## CAR_USEPrivate 0.727758 0.081388 8.942 < 2e-16 ***
## BLUEBOOK10--15K 0.191409 0.075113 2.548 0.010826 *
## BLUEBOOK15--20K 0.230293 0.088199 2.611 0.009026 **
## BLUEBOOK>20K 0.403673 0.094565 4.269 1.97e-05 ***
## TIF2--5 0.254580 0.074264 3.428 0.000608 ***
## TIF6--8 0.440816 0.072632 6.069 1.29e-09 ***
## TIF>8 0.635556 0.076650 8.292 < 2e-16 ***
## CAR_TYPEPickup -0.488389 0.093274 -5.236 1.64e-07 ***
## CAR_TYPEPTruck_Van -0.526232 0.113422 -4.640 3.49e-06 ***
## CAR_TYPESports_Car -0.946210 0.115749 -8.175 2.97e-16 ***
## CAR_TYPEz_SUV -0.681845 0.100024 -6.817 9.31e-12 ***
## RED_CARyes 0.062462 0.083032 0.752 0.451888
## OLDCLAIM>0 -0.552550 0.083395 -6.626 3.46e-11 ***
## CLM_FREQ1 0.113506 0.095338 1.191 0.233825
## CLM_FREQ2 0.129216 0.092369 1.399 0.161839
## CLM_FREQ>2 NA NA NA NA
## REVOKEDYes -0.730897 0.074565 -9.802 < 2e-16 ***
## MVR_PTS1 -0.160456 0.085488 -1.877 0.060525 .
## MVR_PTS2 -0.231751 0.090765 -2.553 0.010670 *
## MVR_PTS3--4 -0.303138 0.078661 -3.854 0.000116 ***
## MVR_PTS>4 -0.486111 0.086909 -5.593 2.23e-08 ***
## URBANICITYz_Highly_Rural/ Rural 2.455352 0.104746 23.441 < 2e-16 ***
## AGE30--40 0.484143 0.210297 2.302 0.021325 *
## AGE40--55 0.958097 0.206768 4.634 3.59e-06 ***
## AGE>55 0.461342 0.230491 2.002 0.045332 *
## INCOME25--55K -0.021786 0.092082 -0.237 0.812975
## INCOME55--75K -0.088236 0.115748 -0.762 0.445876
## INCOME>75K 0.268401 0.123270 2.177 0.029455 *
## YOJ>1 0.427722 0.113624 3.764 0.000167 ***
## HOME_VAL100--200K 0.255601 0.080120 3.190 0.001422 **
## HOME_VAL200--300K 0.280117 0.092913 3.015 0.002571 **
## HOME_VAL>300K 0.390850 0.139332 2.805 0.005029 **
## CAR_AGE2--7 0.087812 0.078542 1.118 0.263559
## CAR_AGE8--11 0.056959 0.079792 0.714 0.475325
## CAR_AGE12--15 0.021582 0.106894 0.202 0.839993
## CAR_AGE>=16 -0.063907 0.134840 -0.474 0.635540
## KIDSDRV1 -0.334213 0.297440 -1.124 0.261169
## KIDSDRV>=2 -0.159423 0.628694 -0.254 0.799821
## GENDER.AGEF.>55 -0.304965 0.289877 -1.052 0.292776
## GENDER.AGEF.30--40 0.209611 0.270226 0.776 0.437934
## GENDER.AGEF.40--55 -0.058893 0.259330 -0.227 0.820348
## GENDER.AGEM.<30 NA NA NA NA
## GENDER.AGEM.>55 NA NA NA NA
## GENDER.AGEM.30--40 NA NA NA NA
## GENDER.AGEM.40--55 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11181.6 on 9636 degrees of freedom
## Residual deviance: 8431.6 on 9579 degrees of freedom
## AIC: 8547.6
##
## Number of Fisher Scoring iterations: 5
Leave out the least relevant ones (the coefficients without stars) and build a first usable model \(m_1\).
We will not use further:
KIDSDRV
HOMEKIDS
GENDER
RED_CAR
CAR_AGE
CLM_FREQ
(too much correlated to
OLDCLAIM
)GENDER.AGE
frm <- formula(isGood ~ MSTATUS + EDUCATION + OCCUPATION + TRAVTIME +
CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM +
REVOKED + MVR_PTS + URBANICITY + AGE + INCOME +
YOJ + HOME_VAL)
m1 <- glm(frm, d_fact, family = 'binomial')
summary(m1)
##
## Call:
## glm(formula = frm, family = "binomial", data = d_fact)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.107127 0.224487 -0.477 0.633214
## MSTATUSNo -0.593818 0.065057 -9.128 < 2e-16 ***
## EDUCATIONz_High_School -0.007295 0.086237 -0.085 0.932584
## EDUCATIONBachelors 0.438288 0.099852 4.389 1.14e-05 ***
## EDUCATIONMasters 0.435534 0.149740 2.909 0.003630 **
## EDUCATIONPhD 0.347441 0.168960 2.056 0.039749 *
## OCCUPATIONjunior -0.244750 0.118566 -2.064 0.038994 *
## OCCUPATIONLawyer 0.013269 0.176246 0.075 0.939985
## OCCUPATIONProfessional -0.073681 0.141143 -0.522 0.601649
## OCCUPATIONsenior 0.698354 0.149224 4.680 2.87e-06 ***
## OCCUPATIONz_Blue_Collar -0.239132 0.143192 -1.670 0.094916 .
## TRAVTIME25--40 -0.296752 0.067463 -4.399 1.09e-05 ***
## TRAVTIME>40 -0.596040 0.068454 -8.707 < 2e-16 ***
## CAR_USEPrivate 0.693804 0.080302 8.640 < 2e-16 ***
## BLUEBOOK10--15K 0.194169 0.074122 2.620 0.008804 **
## BLUEBOOK15--20K 0.232693 0.085412 2.724 0.006443 **
## BLUEBOOK>20K 0.389974 0.086922 4.487 7.24e-06 ***
## TIF2--5 0.252219 0.073316 3.440 0.000581 ***
## TIF6--8 0.450104 0.071800 6.269 3.64e-10 ***
## TIF>8 0.626743 0.075750 8.274 < 2e-16 ***
## CAR_TYPEPickup -0.484687 0.091825 -5.278 1.30e-07 ***
## CAR_TYPEPTruck_Van -0.508646 0.106474 -4.777 1.78e-06 ***
## CAR_TYPESports_Car -0.898498 0.095765 -9.382 < 2e-16 ***
## CAR_TYPEz_SUV -0.639593 0.076944 -8.312 < 2e-16 ***
## OLDCLAIM>0 -0.486383 0.060328 -8.062 7.48e-16 ***
## REVOKEDYes -0.746171 0.073641 -10.133 < 2e-16 ***
## MVR_PTS1 -0.178206 0.084428 -2.111 0.034795 *
## MVR_PTS2 -0.231409 0.089502 -2.586 0.009724 **
## MVR_PTS3--4 -0.327117 0.077671 -4.212 2.54e-05 ***
## MVR_PTS>4 -0.514532 0.085991 -5.984 2.18e-09 ***
## URBANICITYz_Highly_Rural/ Rural 2.361003 0.102937 22.936 < 2e-16 ***
## AGE30--40 0.513866 0.132334 3.883 0.000103 ***
## AGE40--55 0.891984 0.128534 6.940 3.93e-12 ***
## AGE>55 0.322565 0.145418 2.218 0.026541 *
## INCOME25--55K -0.016250 0.090983 -0.179 0.858248
## INCOME55--75K -0.060066 0.114426 -0.525 0.599628
## INCOME>75K 0.261078 0.121693 2.145 0.031922 *
## YOJ>1 0.409499 0.112193 3.650 0.000262 ***
## HOME_VAL100--200K 0.248116 0.079048 3.139 0.001696 **
## HOME_VAL200--300K 0.269736 0.091436 2.950 0.003178 **
## HOME_VAL>300K 0.357513 0.137360 2.603 0.009248 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11181.6 on 9636 degrees of freedom
## Residual deviance: 8589.2 on 9596 degrees of freedom
## AIC: 8671.2
##
## Number of Fisher Scoring iterations: 5
The AUC of model \(m_1\) is: 0.817
# Re-use the model m and the dataset t2:
pred1 <- prediction(predict(m1, type = "response"), d_fact$isGood)
# Visualize the ROC curve:
#plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3)
#abline(0, 1, lty = 2)
AUC1 <- attr(performance(pred1, "auc"), "y.values")[[1]]
#paste("AUC:", AUC)
perf1 <- performance(pred1, "tpr", "fpr")
ks <- max(attr(perf1,'y.values')[[1]] - attr(perf1,'x.values')[[1]])
#paste("KS:", ks)
#predScores <- modelr::add_predictions(df, m)$pred # not correct?
predScores1 <- predict(m1, type = "response")
# Visualize the KS with the package InformationValue:
InformationValue::ks_plot(actuals = d_fact$isGood, predictedScores = predScores1)