This part 2 we build a binary acceptance model: do we accept the customer or not. First with no protected features, then with protected features (such as age, marital status, gender, etc.)
The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:
The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.
Workflow
The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.
We will approach the problem in a few phases
CLAIM_FLAG
as
dependent variable (this variable is 0 of no claim was filed)CLM_AMT
(the amount claimed) of those customers that are
expected to have an accidentWe need to load the data as prepared in previous file
setwd("/home/philippe/Documents/courses/lectures/bias_data")
# import functions:
source("ethics_functions.R")
# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)
## dollars_to_numeric : function (input)
## expand_factor2bin : function (data, col)
## fAUC : function (model, data, ...)
## get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)
## make_dependency_hist_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)
## make_dependency_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)
## make_WOE_table : function (df, y)
## opt_cut_off : function (perf, pred, cost.fp = 1)
## pct_table : function (x, y, round_digits = 2)
## space_to_underscore : function (input)
## str_clean : function (x)
# Remove the temporary environment:
rm(tmt.env)
# Read in the binned binary data:
d_bin <- readRDS('./d_bin.R')
# Read in the factorial data:
d_fact <- readRDS('./d_fact.R')
# We replace CLAIM_FLAG with a variable that is 1 when the outcome is positive
d_fact <- d_fact %>%
mutate(isGood = 1 - as.numeric(paste(d_fact$CLAIM_FLAG))) %>%
select(-c(CLAIM_FLAG))
First, we will fit a logistic regression using all variables that from a mathematical point of view make sense.
First try all variables
##
## Call:
## glm(formula = isGood ~ ., family = "binomial", data = d_fact)
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.395711 0.313416 -1.263 0.206743
## KIDSDRIV -0.363335 0.278051 -1.307 0.191308
## HOMEKIDS1 -0.140183 0.108144 -1.296 0.194886
## HOMEKIDS>=2 -0.038919 0.094897 -0.410 0.681719
## PARENT1No 0.232937 0.112631 2.068 0.038626 *
## MSTATUSNo -0.547889 0.080533 -6.803 1.02e-11 ***
## GENDERF 0.118503 0.262555 0.451 0.651740
## EDUCATIONz_High_School -0.003395 0.088557 -0.038 0.969422
## EDUCATIONBachelors 0.426812 0.108692 3.927 8.61e-05 ***
## EDUCATIONMasters 0.477329 0.169691 2.813 0.004909 **
## EDUCATIONPhD 0.331784 0.185936 1.784 0.074358 .
## OCCUPATIONjunior -0.220583 0.120845 -1.825 0.067951 .
## OCCUPATIONLawyer 0.004206 0.179401 0.023 0.981295
## OCCUPATIONProfessional -0.073459 0.144009 -0.510 0.609983
## OCCUPATIONsenior 0.718145 0.152230 4.717 2.39e-06 ***
## OCCUPATIONz_Blue_Collar -0.218401 0.145881 -1.497 0.134364
## TRAVTIME25--40 -0.306511 0.068309 -4.487 7.22e-06 ***
## TRAVTIME>40 -0.621968 0.069355 -8.968 < 2e-16 ***
## CAR_USEPrivate 0.727758 0.081388 8.942 < 2e-16 ***
## BLUEBOOK10--15K 0.191409 0.075113 2.548 0.010826 *
## BLUEBOOK15--20K 0.230293 0.088199 2.611 0.009026 **
## BLUEBOOK>20K 0.403673 0.094565 4.269 1.97e-05 ***
## TIF2--5 0.254580 0.074264 3.428 0.000608 ***
## TIF6--8 0.440816 0.072632 6.069 1.29e-09 ***
## TIF>8 0.635556 0.076650 8.292 < 2e-16 ***
## CAR_TYPEPickup -0.488389 0.093274 -5.236 1.64e-07 ***
## CAR_TYPEPTruck_Van -0.526232 0.113422 -4.640 3.49e-06 ***
## CAR_TYPESports_Car -0.946210 0.115749 -8.175 2.97e-16 ***
## CAR_TYPEz_SUV -0.681845 0.100024 -6.817 9.31e-12 ***
## RED_CARyes 0.062462 0.083032 0.752 0.451888
## OLDCLAIM>0 -0.552550 0.083395 -6.626 3.46e-11 ***
## CLM_FREQ1 0.113506 0.095338 1.191 0.233825
## CLM_FREQ2 0.129216 0.092369 1.399 0.161839
## CLM_FREQ>2 NA NA NA NA
## REVOKEDYes -0.730897 0.074565 -9.802 < 2e-16 ***
## MVR_PTS1 -0.160456 0.085488 -1.877 0.060525 .
## MVR_PTS2 -0.231751 0.090765 -2.553 0.010670 *
## MVR_PTS3--4 -0.303138 0.078661 -3.854 0.000116 ***
## MVR_PTS>4 -0.486111 0.086909 -5.593 2.23e-08 ***
## URBANICITYz_Highly_Rural/ Rural 2.455352 0.104746 23.441 < 2e-16 ***
## AGE30--40 0.484143 0.210297 2.302 0.021325 *
## AGE40--55 0.958097 0.206768 4.634 3.59e-06 ***
## AGE>55 0.461342 0.230491 2.002 0.045332 *
## INCOME25--55K -0.021786 0.092082 -0.237 0.812975
## INCOME55--75K -0.088236 0.115748 -0.762 0.445876
## INCOME>75K 0.268401 0.123270 2.177 0.029455 *
## YOJ>1 0.427722 0.113624 3.764 0.000167 ***
## HOME_VAL100--200K 0.255601 0.080120 3.190 0.001422 **
## HOME_VAL200--300K 0.280117 0.092913 3.015 0.002571 **
## HOME_VAL>300K 0.390850 0.139332 2.805 0.005029 **
## CAR_AGE2--7 0.087812 0.078542 1.118 0.263559
## CAR_AGE8--11 0.056959 0.079792 0.714 0.475325
## CAR_AGE12--15 0.021582 0.106894 0.202 0.839993
## CAR_AGE>=16 -0.063907 0.134840 -0.474 0.635540
## KIDSDRV1 -0.334213 0.297440 -1.124 0.261169
## KIDSDRV>=2 -0.159423 0.628694 -0.254 0.799821
## GENDER.AGEF.>55 -0.304965 0.289877 -1.052 0.292776
## GENDER.AGEF.30--40 0.209611 0.270226 0.776 0.437934
## GENDER.AGEF.40--55 -0.058893 0.259330 -0.227 0.820348
## GENDER.AGEM.<30 NA NA NA NA
## GENDER.AGEM.>55 NA NA NA NA
## GENDER.AGEM.30--40 NA NA NA NA
## GENDER.AGEM.40--55 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11181.6 on 9636 degrees of freedom
## Residual deviance: 8431.6 on 9579 degrees of freedom
## AIC: 8547.6
##
## Number of Fisher Scoring iterations: 5
Leave out the least relevant ones (the coefficients without stars) and build a first usable model \(m_1\).
We will not use further:
KIDSDRV
HOMEKIDS
GENDER
RED_CAR
CAR_AGE
CLM_FREQ
(too much correlated to
OLDCLAIM
)GENDER.AGE
frm <- formula(isGood ~ MSTATUS + EDUCATION + OCCUPATION + TRAVTIME +
CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM +
REVOKED + MVR_PTS + URBANICITY + AGE + INCOME +
YOJ + HOME_VAL)
m1 <- glm(frm, d_fact, family = 'binomial')
summary(m1)
##
## Call:
## glm(formula = frm, family = "binomial", data = d_fact)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.107127 0.224487 -0.477 0.633214
## MSTATUSNo -0.593818 0.065057 -9.128 < 2e-16 ***
## EDUCATIONz_High_School -0.007295 0.086237 -0.085 0.932584
## EDUCATIONBachelors 0.438288 0.099852 4.389 1.14e-05 ***
## EDUCATIONMasters 0.435534 0.149740 2.909 0.003630 **
## EDUCATIONPhD 0.347441 0.168960 2.056 0.039749 *
## OCCUPATIONjunior -0.244750 0.118566 -2.064 0.038994 *
## OCCUPATIONLawyer 0.013269 0.176246 0.075 0.939985
## OCCUPATIONProfessional -0.073681 0.141143 -0.522 0.601649
## OCCUPATIONsenior 0.698354 0.149224 4.680 2.87e-06 ***
## OCCUPATIONz_Blue_Collar -0.239132 0.143192 -1.670 0.094916 .
## TRAVTIME25--40 -0.296752 0.067463 -4.399 1.09e-05 ***
## TRAVTIME>40 -0.596040 0.068454 -8.707 < 2e-16 ***
## CAR_USEPrivate 0.693804 0.080302 8.640 < 2e-16 ***
## BLUEBOOK10--15K 0.194169 0.074122 2.620 0.008804 **
## BLUEBOOK15--20K 0.232693 0.085412 2.724 0.006443 **
## BLUEBOOK>20K 0.389974 0.086922 4.487 7.24e-06 ***
## TIF2--5 0.252219 0.073316 3.440 0.000581 ***
## TIF6--8 0.450104 0.071800 6.269 3.64e-10 ***
## TIF>8 0.626743 0.075750 8.274 < 2e-16 ***
## CAR_TYPEPickup -0.484687 0.091825 -5.278 1.30e-07 ***
## CAR_TYPEPTruck_Van -0.508646 0.106474 -4.777 1.78e-06 ***
## CAR_TYPESports_Car -0.898498 0.095765 -9.382 < 2e-16 ***
## CAR_TYPEz_SUV -0.639593 0.076944 -8.312 < 2e-16 ***
## OLDCLAIM>0 -0.486383 0.060328 -8.062 7.48e-16 ***
## REVOKEDYes -0.746171 0.073641 -10.133 < 2e-16 ***
## MVR_PTS1 -0.178206 0.084428 -2.111 0.034795 *
## MVR_PTS2 -0.231409 0.089502 -2.586 0.009724 **
## MVR_PTS3--4 -0.327117 0.077671 -4.212 2.54e-05 ***
## MVR_PTS>4 -0.514532 0.085991 -5.984 2.18e-09 ***
## URBANICITYz_Highly_Rural/ Rural 2.361003 0.102937 22.936 < 2e-16 ***
## AGE30--40 0.513866 0.132334 3.883 0.000103 ***
## AGE40--55 0.891984 0.128534 6.940 3.93e-12 ***
## AGE>55 0.322565 0.145418 2.218 0.026541 *
## INCOME25--55K -0.016250 0.090983 -0.179 0.858248
## INCOME55--75K -0.060066 0.114426 -0.525 0.599628
## INCOME>75K 0.261078 0.121693 2.145 0.031922 *
## YOJ>1 0.409499 0.112193 3.650 0.000262 ***
## HOME_VAL100--200K 0.248116 0.079048 3.139 0.001696 **
## HOME_VAL200--300K 0.269736 0.091436 2.950 0.003178 **
## HOME_VAL>300K 0.357513 0.137360 2.603 0.009248 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11181.6 on 9636 degrees of freedom
## Residual deviance: 8589.2 on 9596 degrees of freedom
## AIC: 8671.2
##
## Number of Fisher Scoring iterations: 5
The AUC of model \(m_1\) is: 0.817
# Re-use the model m and the dataset t2:
pred1 <- prediction(predict(m1, type = "response"), d_fact$isGood)
# Visualize the ROC curve:
#plot(performance(pred, "tpr", "fpr"), col="blue", lwd = 3)
#abline(0, 1, lty = 2)
AUC1 <- attr(performance(pred1, "auc"), "y.values")[[1]]
#paste("AUC:", AUC)
perf1 <- performance(pred1, "tpr", "fpr")
ks <- max(attr(perf1,'y.values')[[1]] - attr(perf1,'x.values')[[1]])
#paste("KS:", ks)
#predScores <- modelr::add_predictions(df, m)$pred # not correct?
predScores1 <- predict(m1, type = "response")
# Visualize the KS with the package InformationValue:
InformationValue::ks_plot(actuals = d_fact$isGood, predictedScores = predScores1)
The ROC (receiver operating curve) for our model
The lift of the model (bottom): the cumulative percentage of responders (ones) captured by the model
# The optimal cutoff if the false positives cost 3 times more than the false negatives:
cutoff_vals <- opt_cut_off(perf1, pred1, cost.fp = 3)
cutoff_vals # print the values related to the cutoff
## [,1]
## sensitivity 0.5695683
## specificity 0.8748056
## cutoff 0.8301915
cutoff1 <- cutoff_vals[3,1]
# Binary predictions based on that cutoff:
bin_pred1 <- if_else(pred1@predictions[[1]] > cutoff1, 1, 0)
We accept around 45.09% of the customers. However, note that the data is re-balanced (probably rows of good customers have been deleted ad random). So, it is impossible to answer the question how much percent of customers we actually will reject. In any case this should be much less than this calculation seems to imply.
First we calculate false negatives and false positives for further use:
FN1 <- if_else(bin_pred1 == 0 & d_fact$isGood == 1, 1, 0)
FP1 <- if_else(bin_pred1 == 1 & d_fact$isGood == 0, 1, 0)
In the data men and women were roughly equally represented and in the model we did not use gender.
The following table illustrates Equal Opportunity: compare False Negative Rates:
factor(FN1) | 0 | 1 | Total | |
factor(d_fact$GENDER) | ||||
M | 2937 (69.4%) | 1293 (30.6%) | 4230 (100.0%) | |
F | 3658 (67.7%) | 1749 (32.3%) | 5407 (100.0%) | |
Total | 6595 (68.4%) | 3042 (31.6%) | 9637 (100.0%) |
Predictive Equality: compare False Positive Rates
factor(FP1) | 0 | 1 | Total | |
factor(d_fact$GENDER) | ||||
M | 4081 (96.5%) | 149 (3.5%) | 4230 (100.0%) | |
F | 5234 (96.8%) | 173 (3.2%) | 5407 (100.0%) | |
Total | 9315 (96.7%) | 322 (3.3%) | 9637 (100.0%) |
The variable AGE
has, unlike GENDER
, been
used in our model. So, it would be normal to see a different demographic
parity for younger and older people.
factor(bin_pred1) | 0 | 1 | Total | |
AGE | ||||
<30 | 321 (85.8%) | 53 (14.2%) | 374 (100.0%) | |
30–40 | 1523 (66.0%) | 783 (34.0%) | 2306 (100.0%) | |
40–55 | 2723 (47.7%) | 2981 (52.3%) | 5704 (100.0%) | |
>55 | 725 (57.9%) | 528 (42.1%) | 1253 (100.0%) | |
Total | 5292 (54.9%) | 4345 (45.1%) | 9637 (100.0%) |
Indeed, a young person has a 17.4% probability to be accepted, and an older person around 50%. The difference –amounting to 187% higher probability– is indeed significant. Due to the model design this is rather a desired outcome than bias that was not expected.
Old and young people are under-represented, this makes it harder for the model to correctly predict their behaviour. That could be a hidden bias (or a bias that is not by design.)
The following table illustrates Equal Opportunity: compare False Negative Rates:
factor(FN1) | 0 | 1 | Total | |
AGE | ||||
<30 | 242 (64.7%) | 132 (35.3%) | 374 (100.0%) | |
30–40 | 1482 (64.3%) | 824 (35.7%) | 2306 (100.0%) | |
40–55 | 4042 (70.9%) | 1662 (29.1%) | 5704 (100.0%) | |
>55 | 829 (66.2%) | 424 (33.8%) | 1253 (100.0%) | |
Total | 6595 (68.4%) | 3042 (31.6%) | 9637 (100.0%) |
Younger people are discriminated against compared to older people. The false positive rates are about 5% higher for younger people.
Predictive Equality: compare False Positive Rates
factor(FP1) | 0 | 1 | Total | |
AGE | ||||
<30 | 371 (99.2%) | 3 (0.8%) | 374 (100.0%) | |
30–40 | 2232 (96.8%) | 74 (3.2%) | 2306 (100.0%) | |
40–55 | 5507 (96.5%) | 197 (3.5%) | 5704 (100.0%) | |
>55 | 1205 (96.2%) | 48 (3.8%) | 1253 (100.0%) | |
Total | 9315 (96.7%) | 322 (3.3%) | 9637 (100.0%) |
This effect is even more pronounced when comparing the false positives: this happens the oldest group has 162% more probability to get an advantage.
factor(FN1) | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 4120 (70.8%) | 1700 (29.2%) | 5820 (100.0%) | |
No | 2475 (64.8%) | 1342 (35.2%) | 3817 (100.0%) | |
Total | 6595 (68.4%) | 3042 (31.6%) | 9637 (100.0%) |
factor(FP1) | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 5620 (96.6%) | 200 (3.4%) | 5820 (100.0%) | |
No | 3695 (96.8%) | 122 (3.2%) | 3817 (100.0%) | |
Total | 9315 (96.7%) | 322 (3.3%) | 9637 (100.0%) |
While there are differences for the married status, they are not big and do not indicate significant bias.
The first consideration is what we consider as protected features. If we consider AGE, GENDER, MSTATUS, etc. as protected features then we conclude that there are 2 types of cases:
The chosen protected features \(S_j\) are chosen to be:
or any variables that are too much correlated with those. [we skip this]
That leaves us the following \(X_i\):
frm <- formula(isGood ~ TRAVTIME + CAR_USE + TIF + CAR_TYPE + OLDCLAIM + REVOKED + MVR_PTS + URBANICITY + YOJ)
m2 <- glm(frm, d_fact, family = 'binomial')
summary(m2)
##
## Call:
## glm(formula = frm, family = "binomial", data = d_fact)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.53254 0.12742 4.179 2.92e-05 ***
## TRAVTIME25--40 -0.24938 0.06428 -3.880 0.000105 ***
## TRAVTIME>40 -0.55331 0.06525 -8.480 < 2e-16 ***
## CAR_USEPrivate 0.90615 0.06017 15.061 < 2e-16 ***
## TIF2--5 0.21861 0.06999 3.123 0.001788 **
## TIF6--8 0.39730 0.06845 5.804 6.46e-09 ***
## TIF>8 0.54335 0.07216 7.530 5.09e-14 ***
## CAR_TYPEPickup -0.52777 0.08414 -6.273 3.55e-10 ***
## CAR_TYPEPTruck_Van -0.10433 0.09246 -1.128 0.259193
## CAR_TYPESports_Car -0.97359 0.08881 -10.963 < 2e-16 ***
## CAR_TYPEz_SUV -0.72451 0.07177 -10.094 < 2e-16 ***
## OLDCLAIM>0 -0.57691 0.05754 -10.027 < 2e-16 ***
## REVOKEDYes -0.78603 0.07020 -11.197 < 2e-16 ***
## MVR_PTS1 -0.20309 0.08053 -2.522 0.011671 *
## MVR_PTS2 -0.22278 0.08485 -2.626 0.008649 **
## MVR_PTS3--4 -0.37063 0.07411 -5.001 5.69e-07 ***
## MVR_PTS>4 -0.65506 0.08222 -7.967 1.62e-15 ***
## URBANICITYz_Highly_Rural/ Rural 1.96552 0.10001 19.653 < 2e-16 ***
## YOJ>1 0.78966 0.08681 9.096 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11181.6 on 9636 degrees of freedom
## Residual deviance: 9286.4 on 9618 degrees of freedom
## AIC: 9324.4
##
## Number of Fisher Scoring iterations: 5
All coeficients are meaningful.
AUC2 <- fAUC(m2, d_fact, type="response")
pred2 <- prediction(predict(m2, type = "response"), d_fact$isGood)
perf2 <- performance(pred2, "tpr", "fpr")
#predScores <- modelr::add_predictions(df, m)$pred # not correct?
predScores <- predict(m2, type = "response")
The AUC of model \(m_2\) is: 0.775. Compare this to the AUC of \(m_1\): 0.817.
# The optimal cutoff if the false positives cost 3 times more than the false negatives:
cutoff_vals <- opt_cut_off(perf2, pred2, cost.fp = 3)
cutoff_vals
## [,1]
## sensitivity 0.4454352
## specificity 0.9016330
## cutoff 0.8443056
We accept around 0.35 of the customers. However, we remind ourselves that the data is re-balanced (probably rows of good customers have been deleted ad random). So, it is impossible to answer the question how much percent of customers we actually will reject.
First we calculate false negatives and false positives for further use:
FN2 <- if_else(bin_pred2 == 0 & d_fact$isGood == 1, 1, 0)
FP2 <- if_else(bin_pred2 == 1 & d_fact$isGood == 0, 1, 0)
In the data men and women were roughly equally represented and in the model we did not use gender.
The following table illustrates Equal Opportunity: compare False Negative Rates:
factor(FN2) | 0 | 1 | Total | |
GENDER | ||||
M | 2612 (61.7%) | 1618 (38.3%) | 4230 (100.0%) | |
F | 3098 (57.3%) | 2309 (42.7%) | 5407 (100.0%) | |
Total | 5710 (59.3%) | 3927 (40.7%) | 9637 (100.0%) |
Males have now roughly 10% less false negatives. That means that we will unjustly reject 10% more women than men. Note that this situation moved from roughly 6% more false rejections for men.
Predictive Equality: compare False Positive Rates
factor(FP2) | 0 | 1 | Total | |
GENDER | ||||
M | 4089 (96.7%) | 141 (3.3%) | 4230 (100.0%) | |
F | 5296 (97.9%) | 111 (2.1%) | 5407 (100.0%) | |
Total | 9385 (97.4%) | 252 (2.6%) | 9637 (100.0%) |
The false positives for men are also significantly higher than for women now.
The variable AGE
has, unlike GENDER
, been
used in our model. So, it would be normal to see a different demographic
parity for younger and older people.
factor(bin_pred2) | 0 | 1 | Total | |
AGE | ||||
<30 | 286 (76.5%) | 88 (23.5%) | 374 (100.0%) | |
30–40 | 1549 (67.2%) | 757 (32.8%) | 2306 (100.0%) | |
40–55 | 3578 (62.7%) | 2126 (37.3%) | 5704 (100.0%) | |
>55 | 834 (66.6%) | 419 (33.4%) | 1253 (100.0%) | |
Total | 6247 (64.8%) | 3390 (35.2%) | 9637 (100.0%) |
While we left out the variable AGE, we still notice that there are still differences in the probability of being accepted for different age groups. Now this is due to the correlation between age group and the features that were used in the model.
Despite the fact that younger people have less change to build up a 5 year negative track record, they are still getting lower acceptance rates.
The following table illustrates Equal Opportunity: compare False Negative Rates:
factor(FN2) | 0 | 1 | Total | |
AGE | ||||
<30 | 271 (72.5%) | 103 (27.5%) | 374 (100.0%) | |
30–40 | 1445 (62.7%) | 861 (37.3%) | 2306 (100.0%) | |
40–55 | 3260 (57.2%) | 2444 (42.8%) | 5704 (100.0%) | |
>55 | 734 (58.6%) | 519 (41.4%) | 1253 (100.0%) | |
Total | 5710 (59.3%) | 3927 (40.7%) | 9637 (100.0%) |
Younger people are discriminated against compared to older people. The difference in the false positive rates is up from 5% to 13%.
Predictive Equality: compare False Positive Rates
factor(FP2) | 0 | 1 | Total | |
AGE | ||||
<30 | 365 (97.6%) | 9 (2.4%) | 374 (100.0%) | |
30–40 | 2221 (96.3%) | 85 (3.7%) | 2306 (100.0%) | |
40–55 | 5580 (97.8%) | 124 (2.2%) | 5704 (100.0%) | |
>55 | 1219 (97.3%) | 34 (2.7%) | 1253 (100.0%) | |
Total | 9385 (97.4%) | 252 (2.6%) | 9637 (100.0%) |
This effect is even more pronounced when comparing the false positives: this happens the oldest group has 8% more probability to get an advantage. That difference is much lower than in model m1.
factor(FN2) | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 3301 (56.7%) | 2519 (43.3%) | 5820 (100.0%) | |
No | 2409 (63.1%) | 1408 (36.9%) | 3817 (100.0%) | |
Total | 5710 (59.3%) | 3927 (40.7%) | 9637 (100.0%) |
factor(FP2) | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 5703 (98.0%) | 117 (2.0%) | 5820 (100.0%) | |
No | 3682 (96.5%) | 135 (3.5%) | 3817 (100.0%) | |
Total | 9385 (97.4%) | 252 (2.6%) | 9637 (100.0%) |
While there are differences for the married status, they are not big and do not indicate significant bias.
Leaving out variables does not solve all problems.
We choose to address the unfair treatment of young people.
Young people are more likely to be misunderstood by the model because we have less young people in the data. A simple and powerful solution could be to weight the data. This technique can only address one bias at a time.
Therefore we will duplicate the lines 5-fold (N
in the
code below).
x <- d_fact %>% filter(AGE == "<30")
df <- d_fact
for(k in 1:N) {df <- add_row(df, x)}
frm <- formula(isGood ~ TRAVTIME + CAR_USE + TIF + CAR_TYPE + OLDCLAIM + REVOKED + MVR_PTS + URBANICITY + YOJ)
m2_weighted <- glm(frm, df, family = 'binomial')
summary(m2_weighted)
##
## Call:
## glm(formula = frm, family = "binomial", data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.48141 0.10860 4.433 9.30e-06 ***
## TRAVTIME25--40 -0.22477 0.05635 -3.989 6.64e-05 ***
## TRAVTIME>40 -0.47744 0.05768 -8.277 < 2e-16 ***
## CAR_USEPrivate 0.94279 0.05323 17.712 < 2e-16 ***
## TIF2--5 0.20331 0.06292 3.231 0.00123 **
## TIF6--8 0.34347 0.06079 5.650 1.61e-08 ***
## TIF>8 0.46414 0.06382 7.272 3.53e-13 ***
## CAR_TYPEPickup -0.54513 0.07404 -7.363 1.80e-13 ***
## CAR_TYPEPTruck_Van -0.15890 0.08244 -1.928 0.05391 .
## CAR_TYPESports_Car -1.10351 0.07803 -14.142 < 2e-16 ***
## CAR_TYPEz_SUV -0.78235 0.06288 -12.441 < 2e-16 ***
## OLDCLAIM>0 -0.53347 0.05106 -10.448 < 2e-16 ***
## REVOKEDYes -0.71190 0.06336 -11.236 < 2e-16 ***
## MVR_PTS1 -0.28167 0.07200 -3.912 9.15e-05 ***
## MVR_PTS2 -0.27150 0.07594 -3.575 0.00035 ***
## MVR_PTS3--4 -0.48838 0.06540 -7.468 8.14e-14 ***
## MVR_PTS>4 -0.80067 0.07204 -11.114 < 2e-16 ***
## URBANICITYz_Highly_Rural/ Rural 2.11314 0.08884 23.785 < 2e-16 ***
## YOJ>1 0.69502 0.07444 9.337 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14191 on 11506 degrees of freedom
## Residual deviance: 11628 on 11488 degrees of freedom
## AIC: 11666
##
## Number of Fisher Scoring iterations: 5
AUC2_w <- fAUC(m2_weighted, d_fact, type="response")
pred2_w <- prediction(predict(m2_weighted, type = "response"), df$isGood)
perf2_w <- performance(pred2_w, "tpr", "fpr")
#predScores <- modelr::add_predictions(df, m)$pred # not correct?
predScores <- predict(m2_weighted, type = "response")
The AUC of model \(m2_{weighted}\) is: 0.774. Compare this to model 2: 0.775: this is hardly lower (remember also the AUC of \(m_1\): 0.817).
# The optimal cutoff if the false positives cost 3 times more than the false negatives:
cutoff_vals <- opt_cut_off(perf2_w, pred2_w, cost.fp = 3)
cutoff_vals
## [,1]
## sensitivity 0.4766144
## specificity 0.8972254
## cutoff 0.8067735
First we calculate false negatives and false positives for further use:
FN2_w <- if_else(bin_pred2_w == 0 & df$isGood == 1, 1, 0)
FP2_w <- if_else(bin_pred2_w == 1 & df$isGood == 0, 1, 0)
In the data men and women were roughly equally represented and in the model we did not use gender.
The following table illustrates Equal Opportunity: compare False Negative Rates:
factor(FN2_w) | 0 | 1 | Total | |
GENDER | ||||
M | 3326 (66.8%) | 1654 (33.2%) | 4980 (100.0%) | |
F | 4003 (61.3%) | 2524 (38.7%) | 6527 (100.0%) | |
Total | 7329 (63.7%) | 4178 (36.3%) | 11507 (100.0%) |
Now, the females have about 16.43% more probability to get a false negative than a male. That is a significant difference.
Predictive Equality: compare False Positive Rates
factor(FP2_w) | 0 | 1 | Total | |
GENDER | ||||
M | 4778 (95.9%) | 202 (4.1%) | 4980 (100.0%) | |
F | 6366 (97.5%) | 161 (2.5%) | 6527 (100.0%) | |
Total | 11144 (96.8%) | 363 (3.2%) | 11507 (100.0%) |
The false positives for men are also significantly higher than for women now. Men are -39.19% more likely to get an advantage.
The variable AGE
has, unlike GENDER
, been
used in our model. So, it would be normal to see a different demographic
parity for younger and older people.
factor(bin_pred2_w) | 0 | 1 | Total | |
AGE | ||||
<30 | 1650 (73.5%) | 594 (26.5%) | 2244 (100.0%) | |
30–40 | 1497 (64.9%) | 809 (35.1%) | 2306 (100.0%) | |
40–55 | 3397 (59.6%) | 2307 (40.4%) | 5704 (100.0%) | |
>55 | 803 (64.1%) | 450 (35.9%) | 1253 (100.0%) | |
Total | 7347 (63.8%) | 4160 (36.2%) | 11507 (100.0%) |
While we left out the variable AGE, we still notice that there are still differences in the probability of being accepted for different age groups. Now this is due to the correlation between age group and the features that were used in the model.
Despite the fact that younger people have less change to build up a 5 year negative track record, they are still getting lower acceptance rates.
The following table illustrates Equal Opportunity: compare False Negative Rates:
factor(FN2_w) | 0 | 1 | Total | |
AGE | ||||
<30 | 1668 (74.3%) | 576 (25.7%) | 2244 (100.0%) | |
30–40 | 1487 (64.5%) | 819 (35.5%) | 2306 (100.0%) | |
40–55 | 3414 (59.9%) | 2290 (40.1%) | 5704 (100.0%) | |
>55 | 760 (60.7%) | 493 (39.3%) | 1253 (100.0%) | |
Total | 7329 (63.7%) | 4178 (36.3%) | 11507 (100.0%) |
Now, younger people are not discriminated against anymore
Predictive Equality: compare False Positive Rates
factor(FP2_w) | 0 | 1 | Total | |
AGE | ||||
<30 | 2166 (96.5%) | 78 (3.5%) | 2244 (100.0%) | |
30–40 | 2211 (95.9%) | 95 (4.1%) | 2306 (100.0%) | |
40–55 | 5553 (97.4%) | 151 (2.6%) | 5704 (100.0%) | |
>55 | 1214 (96.9%) | 39 (3.1%) | 1253 (100.0%) | |
Total | 11144 (96.8%) | 363 (3.2%) | 11507 (100.0%) |
The false positives for young people are now in line with the other groups.
factor(FN2_w) | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 4101 (61.0%) | 2624 (39.0%) | 6725 (100.0%) | |
No | 3228 (67.5%) | 1554 (32.5%) | 4782 (100.0%) | |
Total | 7329 (63.7%) | 4178 (36.3%) | 11507 (100.0%) |
factor(FP2_w) | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 6570 (97.7%) | 155 (2.3%) | 6725 (100.0%) | |
No | 4574 (95.7%) | 208 (4.3%) | 4782 (100.0%) | |
Total | 11144 (96.8%) | 363 (3.2%) | 11507 (100.0%) |
The weighting also introduced a significant bias against married people.
ACF models have the following characteristics:
An ACF model is built as follows
The chosen protected features \(S_j\) are:
or any variables that are too much correlated with those. [we skip this]
That leaves us the following \(X_i\):
d_bin$isGood <- 1 - d_bin$CLAIM_FLAG
d_bin <- select(d_bin, -c(CLAIM_FLAG))
S <- c("HOMEKIDS.1","HOMEKIDS.>=2",
"PARENT1.Yes",
"MSTATUS.No",
"GENDER.M",
"EDUCATION.<High_School", "EDUCATION.Bachelors", "EDUCATION.Masters", "EDUCATION.PhD",
"OCCUPATION.Home_Maker", "OCCUPATION.Lawyer", "OCCUPATION.Professional", "OCCUPATION.senior", "OCCUPATION.z_Blue_Collar",
"AGE.<30", "AGE.30--40", "AGE.>55",
"HOME_VAL.100--200K", "HOME_VAL.200--300K", "HOME_VAL.>300K", "BLUEBOOK.10--15K", "BLUEBOOK.15--20K", "BLUEBOOK.>20K" )
X <- c("TRAVTIME.<25", "TRAVTIME.>40",
"CAR_USE.Commercial",
"TIF.2--5", "TIF.6--8", "TIF.>8",
"CAR_TYPE.Minivan", "CAR_TYPE.Pickup", "CAR_TYPE.PTruck_Van", "CAR_TYPE.Sports_Car",
#"OLDCLAIM.>0", # <-- this is too much correlated to CLM_FREQ (and gets no stars)
"CLM_FREQ.1", "CLM_FREQ.2", "CLM_FREQ.>2",
"REVOKED.Yes",
"MVR_PTS.1", "MVR_PTS.2", "MVR_PTS.3--4", "MVR_PTS.>4",
"URBANICITY.z_Highly_Rural/ Rural", "YOJ.<=1")
#not_used_in_this_model <- c("RED_CAR.yes","KIDSDRV.1", "KIDSDRV.>=2","GENDER.AGE.F.<30", "GENDER.AGE.F.>55", "GENDER.AGE.F.30--40", "GENDER.AGE.M.<30", "GENDER.AGE.M.>55", "GENDER.AGE.M.30--40", "GENDER.AGE.M.40--55", "INCOME.<25K", "INCOME.55--75K", "INCOME.>75K","CAR_AGE.2--7", "CAR_AGE.8--11","CAR_AGE.12--15", "CAR_AGE.>=16","CLAIM_FLAG")
X <- str_clean(X)
S <- str_clean(S)
colnames(d_bin) <- str_clean(colnames(d_bin))
rhs <- paste(S, collapse="` + `")
rhs <- paste0("`", rhs, "`")
d_acf <- data.frame(isGood = d_bin$isGood) # the new dataset that will be used for the model
m_acf_vars <- list()
for (x in X) {
frm <- formula(paste0("`", x, "` ~ ", rhs))
m_acf_vars[[x]] <- glm(frm, d_bin, family = "binomial")
x_predictions <- predict(m_acf_vars[[x]], d_bin, type = "response")
pearson_residual <- (d_bin[[x]] - x_predictions) / sqrt(x_predictions * (1 - x_predictions))
d_acf <- cbind(d_acf, pearson_residual)
colnames(d_acf)[ncol(d_acf)] <- x
}
# Fit the model based on the residuals (stored in d_acf):
frm <- paste("isGood ~", paste(X, collapse = " + "))
m_acf <- glm(frm, d_acf, family = "quasibinomial")
AUC_acf <- fAUC(m_acf, d_acf, type="response")
summary(m_acf)
##
## Call:
## glm(formula = frm, family = "quasibinomial", data = d_acf)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.34803 0.03500 38.517 < 2e-16 ***
## TRAVTIME.lt.25 0.11657 0.03102 3.758 0.000172 ***
## TRAVTIME.gt.40 -0.12425 0.03047 -4.077 4.59e-05 ***
## CAR_USE.Commercial -0.22571 0.02986 -7.560 4.41e-14 ***
## TIF.2.to.5 0.08461 0.03078 2.749 0.005990 **
## TIF.6.to.8 0.16851 0.03152 5.346 9.18e-08 ***
## TIF.gt.8 0.22848 0.03218 7.101 1.33e-12 ***
## CAR_TYPE.Minivan 0.24686 0.03405 7.249 4.53e-13 ***
## CAR_TYPE.Pickup 0.05850 0.02579 2.268 0.023351 *
## CAR_TYPE.PTruck_Van 0.07747 0.04786 1.619 0.105520
## CAR_TYPE.Sports_Car -0.04872 0.02922 -1.668 0.095417 .
## CLM_FREQ.1 -0.11806 0.02755 -4.285 1.85e-05 ***
## CLM_FREQ.2 -0.12679 0.02757 -4.599 4.29e-06 ***
## CLM_FREQ.gt.2 -0.15388 0.02703 -5.693 1.28e-08 ***
## REVOKED.Yes -0.20843 0.02469 -8.443 < 2e-16 ***
## MVR_PTS.1 -0.04941 0.02987 -1.654 0.098084 .
## MVR_PTS.2 -0.07424 0.02887 -2.572 0.010140 *
## MVR_PTS.3.to.4 -0.10812 0.02935 -3.684 0.000231 ***
## MVR_PTS.gt.4 -0.15361 0.02896 -5.303 1.16e-07 ***
## URBANICITY.z_Highly_Rural.or.Rural 1.14356 0.05592 20.449 < 2e-16 ***
## YOJ.le.1 -0.04118 0.02765 -1.489 0.136422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.186)
##
## Null deviance: 11181.6 on 9636 degrees of freedom
## Residual deviance: 9531.6 on 9616 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
The model is less good (lower AUC), but still acceptable.
model | description | AUC |
---|---|---|
m1 | naive model, no protected features | 0.8171151 |
m2 | protected features not used in m2 | 0.774611 |
m2_w | data re-weighted for AGE.<.30 | 0.7739774 |
m_acf | additive counter-factual model | 0.7630629 |
Instead of defining a clear cutoff \(\theta\), define a “grey zone” around \(\theta\). In that zone give a favourable outcome to the disadvantaged group if below threshold and an unfavourable outcome to the advantaged group if above threshold.
Reject Option Classifier is:
However, it feels like discrimination for the people that are selected to be given an advantage or disadvantage (the trolley problem). Obviously, such techniques can only be applied if there is a good reason to assume bias and to reduce bias only.
Summary of the conclusion: You decide what is fair in your domain for your model, sector, and use case. You are biased, so make it a team effort.
In this document we used workflow and code from: De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.
We mainly used in this document: