This is part 1 on the analysis that demonstrate bias in data and the possibilities to de-bias data. In this document we show how to clean the data and prepare it.
The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:
The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.
Workflow
The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.
We will approach the problem in a few phases
CLAIM_FLAG
as dependent variable (this variable is 0 of no claim was filed)CLM_AMT
(the amount claimed) of those customers that are expected to have an accidentFirst we need to read in data and some custom functions.
setwd("/home/philippe/Documents/courses/lectures/bias_data")
# Read in the data:
read_csv("./car_insurance_claim_data.csv")
d0 =
# Read in the functions:
source("ethics_functions.R")
# List the functions defined in this file:
new.env()
tmt.env <-sys.source("ethics_functions.R", envir = tmt.env)
::lsf.str(envir=tmt.env) utils
## dollars_to_numeric : function (input)
## expand_factor2bin : function (data, col)
## fAUC : function (model, data, ...)
## get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)
## make_dependency_hist_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)
## make_dependency_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)
## make_WOE_table : function (df, y)
## opt_cut_off : function (perf, pred, cost.fp = 1)
## pct_table : function (x, y, round_digits = 2)
## space_to_underscore : function (input)
## str_clean : function (x)
# Remove the temporary environment:
rm(tmt.env)
The data is imported from kaggle, the licence is unknown and it is believed to be “public domain.”
The data contains 10302 rows with 27 columns.
Each record (row) represents a set of attributes of an insurance company individual customer that are related to their socio-demographic profile and the insured vehicle. The binary response variable TARGET_FLAG
is 1 if the customer’s car was in a crash, and 0 if not. The continuous response variable TARGET_AMT
defines the cost related to the car crash.
The variables of the data are: ID, KIDSDRIV, BIRTH, AGE, HOMEKIDS, YOJ, INCOME, PARENT1, HOME_VAL, MSTATUS, GENDER, EDUCATION, OCCUPATION, TRAVTIME, CAR_USE, BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS, CLM_AMT, CAR_AGE, CLAIM_FLAG, URBANICITY.
For the purpose of this exercise we added RACE, that is supposed to be representing some proxy for a small privileged class, large underprivileged class and a reference “race.” The labels of this feature are chosen in this order: A, B, and C.
The data dictionary enriched with some additional variables is as follows:
variable name | definition | prejudice or expectation |
---|---|---|
ID | unique identifier | none |
KIDSDRIV | number of teenager drivers | teenagers cause more accidents |
BIRTH | birth date | used to derive age |
AGE | age in years | young people cause more accidents, and older people too |
HOMEKIDS | nbr kids at home | more kids might be more distraction in the car |
YOJ | years on job | people that job-hop might be more prone to risk taking and accidents |
INCOME | annual income in USD | income correlates to reliability and responsibility |
PARENT1 | yes or no single parent | not clear |
HOME_VAL | home value in USD if owner | similar to INCOME |
MSTATUS | marital status “yes” if married | marriage might be a sign of stability and risk aversion(?) |
GENDER | sex | men cause more and more expensive accidents |
EDUCATION | level of the diploma | higher education correlates to safer driving |
OCCUPATION | categories of employment | white color workers might drive safer |
TRAVTIME | distance to work (probably in minutes) | longer distance translates in more probability to be involved in an accident |
CAR_USE | commercial or private use | commercial use might be more risky |
BLUEBOOK | resale value of the car | not clear |
TIF | time in force = the time with the same insurer (numeric, probably in years, minimum is “1”) | longer should be better |
CAR_TYPE | categorical | sports cars are more prone to accidents than minivans |
RED_CAR | “yes” if the car is red | urban legend says that red cars are more prone to accidents |
OLDCLAIM | total amount in USD of claims in the last 5 years | your past performance might be indicative, but note how this should interact with TIF |
CLM_FREQ | not claim frequency, but the number of claims in the last 5 years | past performance might be indicative |
REVOKED | “yes” if main driver’s licence was revoked during the last 7 years | licence being revoked should be indicative for driving style |
MVR_PTS | motor vehicle record points (number) | traffic tickets should be indicative for the driving style and hence propensity to be involved in an accident |
CLM_AMT | if last year in a car accident, the dollar amount of the claim paid by insurer | target variable |
CAR_AGE | age of car in years | one might assume that drivers of older cars are less prudent |
CLAIM_FLAG | 0 = NO, 1 = YES | target variable |
URBANICITY | categorical, 2 options | cities should be more dangerous |
We were not able to contact the data provider, and close observation made us assume that some variables have been manipulated to eliminate missing values. The missing values have then been assigned to some “assumption” for example in the variable EDUCATION
we have “<High School
” (lower than highs school) and “z_High School
”. The latter seems to be the collection of missing values.
::ctable(d0$EDUCATION, d0$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
EDUCATION | ||||
<High School | 1023 (67.5%) | 492 (32.5%) | 1515 (100.0%) | |
Bachelors | 2150 (76.2%) | 673 (23.8%) | 2823 (100.0%) | |
Masters | 1674 (80.6%) | 404 (19.4%) | 2078 (100.0%) | |
PhD | 780 (83.5%) | 154 (16.5%) | 934 (100.0%) | |
z_High School | 1929 (65.3%) | 1023 (34.7%) | 2952 (100.0%) | |
Total | 7556 (73.3%) | 2746 (26.7%) | 10302 (100.0%) |
The table above of the EDUCATION
variable shows that somehow the probability of being involved in an accident increases with education. Therefore we must conclude that z_High_School
and <High_School
are not really logical here. The people with lower than high school are for 32.4% involved in accidents and the z_High_School
have a higher probability to be in accidents (34.6%).
For this particualar variable we might take the two classes with highest probability to be involved in an accident together as <=High_School
. This is logically consistent, and leaves us with similar buckets sizes.
Therefore we assume that this unfortunate data manipulation is also done in other variables. For example the variable GENDER
seems to have mixed the missing values with females. Conslusions based on small differences hence will not be reliable.
::ctable(d0$GENDER, d0$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
GENDER | ||||
M | 3539 (74.4%) | 1218 (25.6%) | 4757 (100.0%) | |
z_F | 4017 (72.4%) | 1528 (27.6%) | 5545 (100.0%) | |
Total | 7556 (73.3%) | 2746 (26.7%) | 10302 (100.0%) |
These considerations place serious footnotes at the reliability of the data.
Issues to consider for data quality:
# Preprocess the variables into numeric / factors as necessary
# dollar values are in the format "$20,540" (string) -> convert this:
as.tbl(d0) %>%
df = mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM", "CLM_AMT"),
%>%
parse_number) mutate_at(c("EDUCATION","OCCUPATION","CAR_TYPE","URBANICITY"),
%>%
space_to_underscore) mutate_at(c("PARENT1", "MSTATUS", "GENDER", "EDUCATION","OCCUPATION", "CAR_USE", "CAR_TYPE", "RED_CAR", "REVOKED", "URBANICITY"),
%>%
as.factor) mutate(CLAIM_FLAG = as.factor(CLAIM_FLAG))
We also notice that
ID
column will never yield useful informationBIRTH
data is the same as AGE
but in a less usable format.Therefore, we will remove those columns before moving forward.
df %>% dplyr::select(-c(ID, BIRTH)) df <-
The univariate summaries for the individual variables (after some cleaning) are provided below.
::dfSummary(df,
summarytoolsplain.ascii = FALSE,
style = 'grid',
graph.magnif = 0.85,
varnumbers = FALSE,
valid.col = FALSE,
tmp.img.dir = "/tmp")
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
---|---|---|---|---|
KIDSDRIV [numeric] |
Mean (sd) : 0.2 (0.5) min < med < max: 0 < 0 < 4 IQR (CV) : 0 (3) |
0 : 9069 (88.0%) 1 : 804 ( 7.8%) 2 : 351 ( 3.4%) 3 : 74 ( 0.7%) 4 : 4 ( 0.0%) |
0 (0.0%) |
|
AGE [numeric] |
Mean (sd) : 44.8 (8.6) min < med < max: 16 < 45 < 81 IQR (CV) : 12 (0.2) |
61 distinct values | 7 (0.1%) |
|
HOMEKIDS [numeric] |
Mean (sd) : 0.7 (1.1) min < med < max: 0 < 0 < 5 IQR (CV) : 1 (1.5) |
0 : 6694 (65.0%) 1 : 1106 (10.7%) 2 : 1427 (13.9%) 3 : 856 ( 8.3%) 4 : 201 ( 2.0%) 5 : 18 ( 0.2%) |
0 (0.0%) |
|
YOJ [numeric] |
Mean (sd) : 10.5 (4.1) min < med < max: 0 < 11 < 23 IQR (CV) : 4 (0.4) |
21 distinct values | 548 (5.3%) |
|
INCOME [numeric] |
Mean (sd) : 61572.1 (47457.2) min < med < max: 0 < 53529 < 367030 IQR (CV) : 58582 (0.8) |
8151 distinct values | 570 (5.5%) |
|
PARENT1 [factor] |
1. No 2. Yes |
8959 (87.0%) 1343 (13.0%) |
0 (0.0%) |
|
HOME_VAL [numeric] |
Mean (sd) : 154523 (129188.4) min < med < max: 0 < 160661 < 885282 IQR (CV) : 238256 (0.8) |
6334 distinct values | 575 (5.6%) |
|
MSTATUS [factor] |
1. Yes 2. z_No |
6188 (60.1%) 4114 (39.9%) |
0 (0.0%) |
|
GENDER [factor] |
1. M 2. z_F |
4757 (46.2%) 5545 (53.8%) |
0 (0.0%) |
|
EDUCATION [factor] |
1. <High_School 2. Bachelors 3. Masters 4. PhD 5. z_High_School |
1515 (14.7%) 2823 (27.4%) 2078 (20.2%) 934 ( 9.1%) 2952 (28.7%) |
0 (0.0%) |
|
OCCUPATION [factor] |
1. Clerical 2. Doctor 3. Home_Maker 4. Lawyer 5. Manager 6. Professional 7. Student 8. z_Blue_Collar |
1590 (16.5%) 321 ( 3.3%) 843 ( 8.7%) 1031 (10.7%) 1257 (13.0%) 1408 (14.6%) 899 ( 9.3%) 2288 (23.7%) |
665 (6.5%) |
|
TRAVTIME [numeric] |
Mean (sd) : 33.4 (15.9) min < med < max: 5 < 33 < 142 IQR (CV) : 22 (0.5) |
100 distinct values | 0 (0.0%) |
|
CAR_USE [factor] |
1. Commercial 2. Private |
3789 (36.8%) 6513 (63.2%) |
0 (0.0%) |
|
BLUEBOOK [numeric] |
Mean (sd) : 15659.9 (8428.8) min < med < max: 1500 < 14400 < 69740 IQR (CV) : 11690 (0.5) |
2985 distinct values | 0 (0.0%) |
|
TIF [numeric] |
Mean (sd) : 5.3 (4.1) min < med < max: 1 < 4 < 25 IQR (CV) : 6 (0.8) |
23 distinct values | 0 (0.0%) |
|
CAR_TYPE [factor] |
1. Minivan 2. Panel_Truck 3. Pickup 4. Sports_Car 5. Van 6. z_SUV |
2694 (26.2%) 853 ( 8.3%) 1772 (17.2%) 1179 (11.4%) 921 ( 8.9%) 2883 (28.0%) |
0 (0.0%) |
|
RED_CAR [factor] |
1. no 2. yes |
7326 (71.1%) 2976 (28.9%) |
0 (0.0%) |
|
OLDCLAIM [numeric] |
Mean (sd) : 4034 (8733.1) min < med < max: 0 < 0 < 57037 IQR (CV) : 4647.5 (2.2) |
3545 distinct values | 0 (0.0%) |
|
CLM_FREQ [numeric] |
Mean (sd) : 0.8 (1.2) min < med < max: 0 < 0 < 5 IQR (CV) : 2 (1.4) |
0 : 6292 (61.1%) 1 : 1279 (12.4%) 2 : 1492 (14.5%) 3 : 992 ( 9.6%) 4 : 225 ( 2.2%) 5 : 22 ( 0.2%) |
0 (0.0%) |
|
REVOKED [factor] |
1. No 2. Yes |
9041 (87.8%) 1261 (12.2%) |
0 (0.0%) |
|
MVR_PTS [numeric] |
Mean (sd) : 1.7 (2.2) min < med < max: 0 < 1 < 13 IQR (CV) : 3 (1.3) |
14 distinct values | 0 (0.0%) |
|
CLM_AMT [numeric] |
Mean (sd) : 1511.3 (4725.2) min < med < max: 0 < 0 < 123247 IQR (CV) : 1144.8 (3.1) |
2346 distinct values | 0 (0.0%) |
|
CAR_AGE [numeric] |
Mean (sd) : 8.3 (5.7) min < med < max: -3 < 8 < 28 IQR (CV) : 11 (0.7) |
30 distinct values | 639 (6.2%) |
|
CLAIM_FLAG [factor] |
1. 0 2. 1 |
7556 (73.3%) 2746 (26.7%) |
0 (0.0%) |
|
URBANICITY [factor] |
1. Highly_Urban/ Urban 2. z_Highly_Rural/ Rural |
8230 (79.9%) 2072 (20.1%) |
0 (0.0%) |
::plot_missing(d0, title = "Percentage of missing data per variable") DataExplorer
::gg_miss_upset(d0) naniar
The correlation structure and amplitude of the missing values are illustrated in Figure .
The number of missing values in the variable OCCUPATION
is 665. So, 6.46% is missing.
Figure indicates also that the data is not missing at random (MAR)1.
We prefer to keep rows with missing data, and instead of removing all rows with incomplete observations, we use an imputation of missing data using the predictive mean matching approach implemented in the mice[^ Multiple Imputation by Chained Equations (this method assumes missing at random).] package is applied. Other methods such as using a Random Forrest (e.g. using the package missForest
in R
might provide more convincing results, but they run too slow.)
# Impute the columns with missing values
# install.packages("mice")
#library(mice)
df[,c("AGE","INCOME","YOJ","HOME_VAL","CAR_AGE")]
df_missing_cols = mice::mice(data = df_missing_cols, m = 1, method = "pmm", maxit = 50, seed = 500)
df_imp_cols_tmp = mice::complete(df_imp_cols_tmp)
df_imp_cols = bind_cols(df %>% select(-AGE, -INCOME, -YOJ, -HOME_VAL, -CAR_AGE), df_imp_cols)
df_imp =
# For the rest of the analysis we use the imputed data
df_imp[complete.cases(df_imp),] df <-
Here is how we could use missForest:
# Input missing values via random forest estimates:
# ---
library(missForest)# impute missing values
# Note: this fails:
# d1_imp <- missForest(df, )$ximp
# While everything that worked for a data-frame will also work
# for a tibble, this is an exception, so we have 2 solutions:
# 1. coerce to data-frame
missForest(as.data.frame(df), )$ximp
d_imp <-
# 2. Manual fit the randomforest and impute the missing data:
df %>% filter(!is.na(df$OCCUPATION))
x_cc <- df %>% filter(is.na(df$OCCUPATION))
x_nc <- randomForest::randomForest(OCCUPATION ~ ., data = x_cc)
rf <-$OCCUPATION <- predict(rf, x_nc)
x_nc rbind(x_cc, x_nc) x <-
data.frame()[1:nrow(df), ]
d_cont <- 1
n <-for (col in colnames(df)) {
if (!(is.factor(df[[col]]))) {
cbind(d_cont, df[[col]])
d_cont <-colnames(d_cont)[n] <- gsub("\"", "", deparse(col))
n + 1
n <-
}
} mutate(d_cont, CLAIM_FLAG = CLAIM_FLAG <- as.numeric(paste(df$CLAIM_FLAG)))
d_cont <- d_cont[complete.cases(d_cont),] d_cont <-
# Use the library GGaly to explore the correlation structure
# ---
::ggcorr(d_cont, method = c("everything", "pearson"),
GGallylabel = TRUE, label_round = 2, label_size = 3,
size = 2.5, geom="circle")
make_WOE_table(df, 'CLAIM_FLAG')
WOE_tbl <-# show the best IVs:
::kable(WOE_tbl[order(WOE_tbl$IV, decreasing = TRUE),],
knitrcaption = 'The table of all information values
for each categorical variable ordered in
decreasing order. We will work with the ones
that have an information value above $0.1$.')
varName | IV | |
---|---|---|
10 | URBANICITY | 0.4189352 |
5 | OCCUPATION | 0.2031089 |
4 | EDUCATION | 0.1477226 |
1 | PARENT1 | 0.1117093 |
7 | CAR_TYPE | 0.1110225 |
6 | CAR_USE | 0.1108089 |
9 | REVOKED | 0.1069475 |
2 | MSTATUS | 0.0814126 |
3 | GENDER | 0.0029738 |
8 | RED_CAR | 0.0007407 |
(De Brouwer 2020) (pg. 359) suggests the following rule of thumb for those information values:
IV | predictability |
---|---|
< 0.02 | not predictable |
0.02 – 0.1 | weak |
0.1 – 0.3 | medium |
0.13 – 0.5 | strong |
> 0.5 | suspicious |
This implies that CARUSE
, MSTATUS
, and GENDER
probably are not very predictable. At least we don’t expect a linear relationship.
df d_fact <-
First we observe the existing binning or levels.
::IV(factor(df$KIDSDRIV), df$CLAIM_FLAG) InformationValue
## [1] 0.05209009
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(df$KIDSDRIV, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
KIDSDRIV | ||||
0 | 6341 (75.0%) | 2112 (25.0%) | 8453 (100.0%) | |
1 | 482 (62.5%) | 289 (37.5%) | 771 (100.0%) | |
2 | 205 (60.8%) | 132 (39.2%) | 337 (100.0%) | |
3 | 35 (48.6%) | 37 (51.4%) | 72 (100.0%) | |
4 | 2 (50.0%) | 2 (50.0%) | 4 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
This allows us to make a decision on what bins to use. We create the bins so that
Note that in the following sections we will leave out this exploratory phase and refer to the previous chapters about data exploration instead.
$KIDSDRV <- if_else(df$KIDSDRIV == 0, 'none',
d_factif_else(df$KIDSDRIV == 1, '1', '>=2')) %>%
factor(levels = c('none', '1', '>=2'))
::IV(d_fact$KIDSDRV, df$CLAIM_FLAG) InformationValue
## [1] 0.05019181
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$KIDSDRV, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
KIDSDRV | ||||
none | 6341 (75.0%) | 2112 (25.0%) | 8453 (100.0%) | |
1 | 482 (62.5%) | 289 (37.5%) | 771 (100.0%) | |
>=2 | 242 (58.6%) | 171 (41.4%) | 413 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$AGE <- if_else(df$AGE < 30 , '<30',
d_factif_else(df$AGE < 40, '30--40',
if_else(df$AGE < 55, '40--55', '>55'))) %>%
factor(level = c('<30', '30--40', '40--55', '>55'))
::IV(d_fact$AGE, df$CLAIM_FLAG) InformationValue
## [1] 0.1150198
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$AGE, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
AGE | ||||
<30 | 182 (48.7%) | 192 (51.3%) | 374 (100.0%) | |
30–40 | 1533 (66.5%) | 773 (33.5%) | 2306 (100.0%) | |
40–55 | 4446 (77.9%) | 1258 (22.1%) | 5704 (100.0%) | |
>55 | 904 (72.1%) | 349 (27.9%) | 1253 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$HOMEKIDS <- if_else(df$HOMEKIDS == 0 , '0',
d_factif_else(df$HOMEKIDS == 1, '1', '>=2')) %>%
factor(level = c('0', '1', '>=2'))
::IV(d_fact$HOMEKIDS, df$CLAIM_FLAG) InformationValue
## [1] 0.08915518
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$HOMEKIDS, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
HOMEKIDS | ||||
0 | 4810 (77.7%) | 1378 (22.3%) | 6188 (100.0%) | |
1 | 691 (65.8%) | 359 (34.2%) | 1050 (100.0%) | |
>=2 | 1564 (65.2%) | 835 (34.8%) | 2399 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$YOJ <- if_else(df$YOJ <= 1 , '<=1', '>1') %>%
d_fact factor(level = c('<=1', '>1'))
::IV(d_fact$YOJ, df$CLAIM_FLAG) InformationValue
## [1] 0.03611412
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$YOJ, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
YOJ | ||||
<=1 | 519 (60.8%) | 335 (39.2%) | 854 (100.0%) | |
>1 | 6546 (74.5%) | 2237 (25.5%) | 8783 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$INCOME <- if_else(df$INCOME < 25000 , '<25K',
d_factif_else(df$INCOME < 55000, '25--55K',
if_else(df$INCOME < 75000, '55--75K', '>75K'))) %>%
factor(level = c('<25K', '25--55K', '55--75K', '>75K'))
::IV(d_fact$INCOME, df$CLAIM_FLAG) InformationValue
## [1] 0.1238587
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$INCOME, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
INCOME | ||||
<25K | 1539 (65.9%) | 798 (34.1%) | 2337 (100.0%) | |
25–55K | 1996 (69.8%) | 865 (30.2%) | 2861 (100.0%) | |
55–75K | 1228 (73.8%) | 437 (26.2%) | 1665 (100.0%) | |
>75K | 2302 (83.0%) | 472 (17.0%) | 2774 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$PARENT1 <- df$PARENT1 %>%
d_fact factor(level = c('Yes', 'No'))
::IV(d_fact$PARENT1, df$CLAIM_FLAG) InformationValue
## [1] 0.1117093
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$PARENT1, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
PARENT1 | ||||
Yes | 713 (55.6%) | 569 (44.4%) | 1282 (100.0%) | |
No | 6352 (76.0%) | 2003 (24.0%) | 8355 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$HOME_VAL <- if_else(df$HOME_VAL < 100000 , '<100K',
d_factif_else(df$HOME_VAL < 200000, '100--200K',
if_else(df$HOME_VAL < 300000, '200--300K', '>300K'))) %>%
factor(level = c('<100K', '100--200K', '200--300K', '>300K'))
::IV(d_fact$HOME_VAL, df$CLAIM_FLAG) InformationValue
## [1] 0.176499
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$HOME_VAL, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
HOME_VAL | ||||
<100K | 2174 (64.2%) | 1210 (35.8%) | 3384 (100.0%) | |
100–200K | 2093 (73.3%) | 763 (26.7%) | 2856 (100.0%) | |
200–300K | 1895 (80.0%) | 473 (20.0%) | 2368 (100.0%) | |
>300K | 903 (87.8%) | 126 (12.2%) | 1029 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$MSTATUS <- if_else(df$MSTATUS == 'Yes', 'Yes', 'No') %>%
d_fact factor(level = c('Yes', 'No'))
::IV(d_fact$MSTATUS, df$CLAIM_FLAG) InformationValue
## [1] 0.08141263
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$MSTATUS, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
MSTATUS | ||||
Yes | 4532 (77.9%) | 1288 (22.1%) | 5820 (100.0%) | |
No | 2533 (66.4%) | 1284 (33.6%) | 3817 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$GENDER <- if_else(df$GENDER == 'M', 'M', 'F') %>%
d_fact factor(level = c('M', 'F'))
::IV(d_fact$GENDER, df$CLAIM_FLAG) InformationValue
## [1] 0.002973753
## attr(,"howgood")
## [1] "Not Predictive"
::ctable(d_fact$GENDER, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
GENDER | ||||
M | 3152 (74.5%) | 1078 (25.5%) | 4230 (100.0%) | |
F | 3913 (72.4%) | 1494 (27.6%) | 5407 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
Gender is not predictive for the probability to be involved in an accident.
$EDUCATION <- factor(df$EDUCATION,
d_factlevels = c('<High_School', 'z_High_School', 'Bachelors', 'Masters', 'PhD'))
::IV(d_fact$EDUCATION, df$CLAIM_FLAG) InformationValue
## [1] 0.1477226
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$EDUCATION, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
EDUCATION | ||||
<High_School | 1023 (67.5%) | 492 (32.5%) | 1515 (100.0%) | |
z_High_School | 1929 (65.3%) | 1023 (34.7%) | 2952 (100.0%) | |
Bachelors | 2150 (76.2%) | 673 (23.8%) | 2823 (100.0%) | |
Masters | 1371 (82.9%) | 283 (17.1%) | 1654 (100.0%) | |
PhD | 592 (85.4%) | 101 (14.6%) | 693 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
paste(df$OCCUPATION)
x <-$OCCUPATION <- if_else(x == 'Doctor' | x == 'Manager' ,
d_fact'senior',
if_else(x == 'Student' | x == 'Clerical',
'junior', x)) %>%
factor()
::IV(d_fact$EDUCATION, df$CLAIM_FLAG) InformationValue
## [1] 0.1477226
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$OCCUPATION, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
OCCUPATION | ||||
Home_Maker | 614 (72.8%) | 229 (27.2%) | 843 (100.0%) | |
junior | 1667 (67.0%) | 822 (33.0%) | 2489 (100.0%) | |
Lawyer | 845 (82.0%) | 186 (18.0%) | 1031 (100.0%) | |
Professional | 1083 (76.9%) | 325 (23.1%) | 1408 (100.0%) | |
senior | 1371 (86.9%) | 207 (13.1%) | 1578 (100.0%) | |
z_Blue_Collar | 1485 (64.9%) | 803 (35.1%) | 2288 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$TRAVTIME <- if_else(df$TRAVTIME < 25, '<25',
d_factif_else(df$TRAVTIME < 40, '25--40', '>40')) %>%
factor(levels = c('<25', '25--40', '>40'))
::IV(d_fact$EDUCATION, df$CLAIM_FLAG) InformationValue
## [1] 0.1477226
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$TRAVTIME, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
TRAVTIME | ||||
<25 | 2183 (76.8%) | 658 (23.2%) | 2841 (100.0%) | |
25–40 | 2591 (73.5%) | 933 (26.5%) | 3524 (100.0%) | |
>40 | 2291 (70.0%) | 981 (30.0%) | 3272 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$CAR_USE <- paste(df$CAR_USE) %>%
d_fact factor()
::IV(d_fact$CAR_USE, df$CLAIM_FLAG) InformationValue
## [1] 0.1108089
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$CAR_USE, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
CAR_USE | ||||
Commercial | 2042 (63.9%) | 1154 (36.1%) | 3196 (100.0%) | |
Private | 5023 (78.0%) | 1418 (22.0%) | 6441 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$BLUEBOOK <- if_else(df$BLUEBOOK < 10000, '<10K',
d_factif_else(df$BLUEBOOK < 15000, '10--15K',
if_else(df$BLUEBOOK < 20000, '15--20K', '>20K'))) %>%
factor(levels = c('<10K', '10--15K', '15--20K', '>20K'))
::IV(d_fact$BLUEBOOK, df$CLAIM_FLAG) InformationValue
## [1] 0.06523297
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$BLUEBOOK, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
BLUEBOOK | ||||
<10K | 1898 (65.9%) | 982 (34.1%) | 2880 (100.0%) | |
10–15K | 1814 (74.5%) | 622 (25.5%) | 2436 (100.0%) | |
15–20K | 1437 (76.3%) | 446 (23.7%) | 1883 (100.0%) | |
>20K | 1916 (78.6%) | 522 (21.4%) | 2438 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
Time in force.
$TIF <- if_else(df$TIF <= 1, '1',
d_factif_else(df$TIF <= 5, '2--5',
if_else(df$TIF <= 8, '6--8', '>8'))) %>%
factor(levels = c('1', '2--5', '6--8', '>8'))
::IV(d_fact$TIF, df$CLAIM_FLAG) InformationValue
## [1] 0.03728586
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$TIF, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
TIF | ||||
1 | 2042 (68.5%) | 938 (31.5%) | 2980 (100.0%) | |
2–5 | 1509 (72.4%) | 574 (27.6%) | 2083 (100.0%) | |
6–8 | 1808 (75.4%) | 590 (24.6%) | 2398 (100.0%) | |
>8 | 1706 (78.4%) | 470 (21.6%) | 2176 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
paste(df$CAR_TYPE)
x <-# We only merge Panel Truck and Van (too small bins + similar values)
$CAR_TYPE <- if_else(x == 'Panel_Truck' | x == 'Van' ,
d_fact'PTruck_Van', x) %>%
factor()
::IV(d_fact$CAR_TYPE, df$CLAIM_FLAG) InformationValue
## [1] 0.1107071
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$CAR_TYPE, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
CAR_TYPE | ||||
Minivan | 2207 (82.8%) | 457 (17.2%) | 2664 (100.0%) | |
Pickup | 1096 (68.2%) | 512 (31.8%) | 1608 (100.0%) | |
PTruck_Van | 972 (73.0%) | 360 (27.0%) | 1332 (100.0%) | |
Sports_Car | 766 (65.3%) | 407 (34.7%) | 1173 (100.0%) | |
z_SUV | 2024 (70.8%) | 836 (29.2%) | 2860 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$RED_CAR <- paste(df$RED_CAR) %>%
d_fact factor()
::IV(d_fact$RED_CAR, df$CLAIM_FLAG) InformationValue
## [1] 0.0007407012
## attr(,"howgood")
## [1] "Not Predictive"
::ctable(d_fact$RED_CAR, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
RED_CAR | ||||
no | 5095 (73.0%) | 1886 (27.0%) | 6981 (100.0%) | |
yes | 1970 (74.2%) | 686 (25.8%) | 2656 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
We will not use the variable RED_CAR
. This variable might make sense for the first buyers, people who buy a red care second hand might have a very different behaviour. We don’t have this information and hence cannot use the colour of the car as variable.
$OLDCLAIM <- if_else(df$OLDCLAIM == 0, '0', '>0') %>%
d_fact factor(levels = c('0', '>0'))
::IV(d_fact$OLDCLAIM, df$CLAIM_FLAG) InformationValue
## [1] 0.3294791
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$OLDCLAIM, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
OLDCLAIM | ||||
0 | 4881 (82.3%) | 1053 (17.7%) | 5934 (100.0%) | |
>0 | 2184 (59.0%) | 1519 (41.0%) | 3703 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
The cumulative amount of claims in the last 5 years does not show a trend. Probability of having a claim is around 40% for people how did claim something regardless the amount (for those that didn’t claim is it is less only 17%). So, we can only create two meaningful bins here (at least assuming that we are targetting the variable CLAIM_FLAG
– not the amount)
$CLM_FREQ <- if_else(df$CLM_FREQ == 0, '0',
d_factif_else(df$CLM_FREQ == 1, '1',
if_else(df$CLM_FREQ == 2, '2', '>2'))) %>%
factor(levels = c('0', '1', '2', '>2'))
::IV(d_fact$CLM_FREQ, df$CLAIM_FLAG) InformationValue
## [1] 0.3308363
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$CLM_FREQ, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
CLM_FREQ | ||||
0 | 4881 (82.3%) | 1053 (17.7%) | 5934 (100.0%) | |
1 | 724 (60.2%) | 479 (39.8%) | 1203 (100.0%) | |
2 | 814 (59.5%) | 553 (40.5%) | 1367 (100.0%) | |
>2 | 646 (57.0%) | 487 (43.0%) | 1133 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$REVOKED <- paste(df$REVOKED) %>%
d_fact factor()
::IV(d_fact$REVOKED, df$CLAIM_FLAG) InformationValue
## [1] 0.1069475
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$REVOKED, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
REVOKED | ||||
No | 6412 (75.9%) | 2041 (24.1%) | 8453 (100.0%) | |
Yes | 653 (55.2%) | 531 (44.8%) | 1184 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
People that have their licence revoked are about twice as likely to file a claim.
$MVR_PTS <- if_else(df$MVR_PTS == 0, '0',
d_factif_else(df$MVR_PTS == 1, '1',
if_else(df$MVR_PTS == 2, '2',
if_else(df$MVR_PTS <= 4, '3--4', '>4')))) %>%
factor(levels = c('0', '1', '2', '3--4', '>4'))
::IV(d_fact$MVR_PTS, df$CLAIM_FLAG) InformationValue
## [1] 0.2200071
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$MVR_PTS, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
MVR_PTS | ||||
0 | 3542 (81.0%) | 833 (19.0%) | 4375 (100.0%) | |
1 | 1033 (75.7%) | 331 (24.3%) | 1364 (100.0%) | |
2 | 822 (72.9%) | 306 (27.1%) | 1128 (100.0%) | |
3–4 | 1037 (66.0%) | 534 (34.0%) | 1571 (100.0%) | |
>4 | 631 (52.6%) | 568 (47.4%) | 1199 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
There is one negative car age, let’s ignore that for now.
$CAR_AGE <- if_else(df$CAR_AGE <= 1, '<=1',
d_factif_else(df$CAR_AGE <= 7, '2--7',
if_else(df$CAR_AGE <= 11, '8--11',
if_else(df$CAR_AGE <= 15, '12--15', '>=16')))) %>%
factor(levels = c('<=1', '2--7', '8--11', '12--15', '>=16'))
::IV(d_fact$CAR_AGE, df$CLAIM_FLAG) InformationValue
## [1] 0.06974132
## attr(,"howgood")
## [1] "Somewhat Predictive"
::ctable(d_fact$CAR_AGE, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
CAR_AGE | ||||
<=1 | 1771 (67.3%) | 859 (32.7%) | 2630 (100.0%) | |
2–7 | 1356 (71.1%) | 552 (28.9%) | 1908 (100.0%) | |
8–11 | 1883 (73.8%) | 670 (26.2%) | 2553 (100.0%) | |
12–15 | 1231 (79.9%) | 310 (20.1%) | 1541 (100.0%) | |
>=16 | 824 (82.0%) | 181 (18.0%) | 1005 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
$URBANICITY <- paste(df$URBANICITY) %>%
d_fact factor()
::IV(d_fact$URBANICITY, df$CLAIM_FLAG) InformationValue
## [1] 0.4189352
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$URBANICITY, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
URBANICITY | ||||
Highly_Urban/ Urban | 5155 (67.9%) | 2439 (32.1%) | 7594 (100.0%) | |
z_Highly_Rural/ Rural | 1910 (93.5%) | 133 ( 6.5%) | 2043 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
GENDER
and AGE
$GENDER.AGE <- paste0(d_fact$GENDER, ".", d_fact$AGE) %>% factor()
d_fact::IV(d_fact$GENDER.AGE, df$CLAIM_FLAG) InformationValue
## [1] 0.1199623
## attr(,"howgood")
## [1] "Highly Predictive"
::ctable(d_fact$GENDER.AGE, df$CLAIM_FLAG) summarytools
CLAIM_FLAG | 0 | 1 | Total | |
GENDER.AGE | ||||
F.<30 | 106 (47.3%) | 118 (52.7%) | 224 (100.0%) | |
F.>55 | 474 (69.9%) | 204 (30.1%) | 678 (100.0%) | |
F.30–40 | 924 (67.2%) | 450 (32.8%) | 1374 (100.0%) | |
F.40–55 | 2409 (76.9%) | 722 (23.1%) | 3131 (100.0%) | |
M.<30 | 76 (50.7%) | 74 (49.3%) | 150 (100.0%) | |
M.>55 | 430 (74.8%) | 145 (25.2%) | 575 (100.0%) | |
M.30–40 | 609 (65.3%) | 323 (34.7%) | 932 (100.0%) | |
M.40–55 | 2037 (79.2%) | 536 (20.8%) | 2573 (100.0%) | |
Total | 7065 (73.3%) | 2572 (26.7%) | 9637 (100.0%) |
Average cost of accidents:
tibble(GENDER_AGE = paste(d_fact$GENDER.AGE), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
filter(CLM_AMT > 0) %>%
group_by(GENDER_AGE) %>%
summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE)))
## # A tibble: 8 × 2
## GENDER_AGE CLM_AMT
## <chr> <dbl>
## 1 F.30--40 5586.
## 2 F.40--55 5415.
## 3 F.<30 4934.
## 4 F.>55 5282.
## 5 M.30--40 5706.
## 6 M.40--55 5709.
## 7 M.<30 6708.
## 8 M.>55 6545.
Overall averages per gender:
tibble(GENDER = paste(d_fact$GENDER), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
filter(CLM_AMT > 0) %>%
group_by(GENDER) %>%
summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE)))
## # A tibble: 2 × 2
## GENDER CLM_AMT
## <chr> <dbl>
## 1 F 5410.
## 2 M 5889.
While both males and females have higher claims for younger people, this effect is more pronounced in males. The average claim amount for a young male is 27.31% higher than for the same age group in females.
So, while the young female has about 8% more probability to be involved in an accident, the claims of the male are much higher (this effect will be stronger when we would consider only the group that was involved in an accident.).
Useful for logistic regression.
This is d_fact
, and we save it for later use.
saveRDS(d_fact %>% select(-c(CLM_AMT)), './d_fact.R')
We choose to normalize so that all variables range from 0 to 1 (this is method range
in the formula below)
caret::preProcess(as.data.frame(df), method="range")
preProc_norm <- predict(preProc_norm, as.data.frame(df)) %>%
d_norm <- as.data.frame() %>% # coerces text to factor
as.tibble() # actually we prefer tibbles, but want the factors here
# Now replace all factorial columns with the binned ones:
for (col in colnames(d_norm)) {
if (!(col == 'CLM_AMT') & (is.factor(d_norm[[col]]))) {
d_fact[[col]]
d_norm[[col]] <-
}
}saveRDS(preProc_norm, './preProc_norm.R')
saveRDS(d_norm, './d_norm.R')
Some methods such as linear regressions, generalized linear regressions, neural networks, etc. will require all data to be numerical. So, we prepare an equivalent data set for those models.
d_fact[, FALSE] # create empty data frame with N rows and 0 cols
d_bin <-for (col in colnames(d_fact)) {
if (is.factor(d_fact[[col]])) {
expand_factor2bin(d_fact, col) # get the expanded columns
x <-
# find the largest bin and leave it out
colSums(x)
sums <- which(sums == max(sums))
theCol <- x[,-theCol]
x <-
bind_cols(d_bin, x) # add them to d_bin
d_bin <-
}
} which(colnames(d_bin) == "CLAIM_FLAG.1")
n <-colnames(d_bin)[n] <- "CLAIM_FLAG"
saveRDS(d_bin, './d_bin.R')
In this document we used workflow and code from: De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.
In this document we used mainly:
Agarwal, Sray, and Shashin Mishra. 2021. Responsible Ai: Implementing Ethical and Unbiased Algorithms. Springer.
Coeckelbergh, Mark. 2020. AI Ethics. Mit Press.
De Brouwer, Philippe J. S. 2020. The Big R-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.
Dubber, Markus Dirk, Frank Pasquale, and Sunit Das. 2020. The Oxford Handbook of Ethics of Ai. Oxford Handbooks.
Hardt, Moritz, Solon Barocas, and Arvind Narayanan. 2018. “Fairness in Machine Learning: Limitations and Opportunities.” Solon Barocas Moritz Hardt and Arvind Narayanan.
We already argued that the missing values of education might indicate lower education, missing values in job category might be indicative for the risk too.↩︎