1 Executive Summary

This is part 1 on the analysis that demonstrate bias in data and the possibilities to de-bias data. In this document we show how to clean the data and prepare it.

2 Introduction

The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:

Is it fair to use gender?
No? Well can we use age then? That is a good indicator proxy for driving experience, maturity, etc. and indeed has a high information value.
Also not? Well, let’s just look at past track record then (did the person have an accident in the last 5 years)
But wait, that is somehow discriminatory for people that drive at least 5 years, because only they could build up such track record?
That leaves us wealth related parameters such as income, value of the car, etc. They do the same … but isn’t that discriminatory for poor people?
So, we have left things related to car use (private or commercial use, daily average travel time)
This model won’t be that strong. The insurance company now need to increase prices … which will impact the poor …

The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.

Workflow

The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.

We will approach the problem in a few phases

Prepare the data: this file
Build a naive model
Build a classification model based with CLAIM_FLAG as dependent variable (this variable is 0 of no claim was filed)
Now we need to find a price for the insurance policy. We can use a simple average or build a model. We will build a regression model to determine the fee that the customer should pay based on CLM_AMT (the amount claimed) of those customers that are expected to have an accident
Decide on Protected Features
Remove protected features and proxy features
repeat both models
Draw conclusions

3 Loading the Data

First we need to read in data and some custom functions.

setwd("/home/philippe/Documents/teaching/courses/ethicsAI/3-bias_data/car_insurance")

# Read in the data:
d0 = read_csv("./data/car_insurance_claim.csv") 

# Read in the functions:
source("ethics_functions.R")

# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)

## dollars_to_numeric : function (input)  
## expand_factor2bin : function (data, col)  
## fAUC : function (model, data, ...)  
## get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)  
## make_dependency_hist_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)  
## make_dependency_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)  
## make_WOE_table : function (df, y)  
## opt_cut_off : function (perf, pred, cost.fp = 1)  
## pct_table : function (x, y, round_digits = 2)  
## space_to_underscore : function (input)  
## str_clean : function (x)

# Remove the temporary environment:
rm(tmt.env)

The data is imported from kaggle, the licence is unknown and it is believed to be “public domain.”

3.1 Data Dictionary

The data contains 10302 rows with 27 columns.

Each record (row) represents a set of attributes of an insurance company individual customer that are related to their socio-demographic profile and the insured vehicle. The binary response variable TARGET_FLAG is 1 if the customer’s car was in a crash, and 0 if not. The continuous response variable TARGET_AMT defines the cost related to the car crash.

The variables of the data are: ID, KIDSDRIV, BIRTH, AGE, HOMEKIDS, YOJ, INCOME, PARENT1, HOME_VAL, MSTATUS, GENDER, EDUCATION, OCCUPATION, TRAVTIME, CAR_USE, BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS, CLM_AMT, CAR_AGE, CLAIM_FLAG, URBANICITY.

For the purpose of this exercise we added RACE, that is supposed to be representing some proxy for a small privileged class, large underprivileged class and a reference “race.” The labels of this feature are chosen in this order: A, B, and C.

The data dictionary enriched with some additional variables is as follows:

variable name	definition	prejudice or expectation
ID	unique identifier	none
KIDSDRIV	number of teenager drivers	teenagers cause more accidents
BIRTH	birth date	used to derive age
AGE	age in years	young people cause more accidents, and older people too
HOMEKIDS	nbr kids at home	more kids might be more distraction in the car
YOJ	years on job	people that job-hop might be more prone to risk taking and accidents
INCOME	annual income in USD	income correlates to reliability and responsibility
PARENT1	yes or no single parent	not clear
HOME_VAL	home value in USD if owner	similar to INCOME
MSTATUS	marital status “yes” if married	marriage might be a sign of stability and risk aversion(?)
GENDER	sex	men cause more and more expensive accidents
EDUCATION	level of the diploma	higher education correlates to safer driving
OCCUPATION	categories of employment	white color workers might drive safer
TRAVTIME	distance to work (probably in minutes)	longer distance translates in more probability to be involved in an accident
CAR_USE	commercial or private use	commercial use might be more risky
BLUEBOOK	resale value of the car	not clear
TIF	time in force = the time with the same insurer (numeric, probably in years, minimum is “1”)	longer should be better
CAR_TYPE	categorical	sports cars are more prone to accidents than minivans
RED_CAR	“yes” if the car is red	urban legend says that red cars are more prone to accidents
OLDCLAIM	total amount in USD of claims in the last 5 years	your past performance might be indicative, but note how this should interact with TIF
CLM_FREQ	not claim frequency, but the number of claims in the last 5 years	past performance might be indicative
REVOKED	“yes” if main driver’s licence was revoked during the last 7 years	licence being revoked should be indicative for driving style
MVR_PTS	motor vehicle record points (number)	traffic tickets should be indicative for the driving style and hence propensity to be involved in an accident
CLM_AMT	if last year in a car accident, the dollar amount of the claim paid by insurer	target variable
CAR_AGE	age of car in years	one might assume that drivers of older cars are less prudent
CLAIM_FLAG	0 = NO, 1 = YES	target variable
URBANICITY	categorical, 2 options	cities should be more dangerous

3.2 Data quality

We were not able to contact the data provider, and close observation made us assume that some variables have been manipulated to eliminate missing values. The missing values have then been assigned to some “assumption” for example in the variable EDUCATION we have “<High School” (lower than highs school) and “z_High School”. The latter seems to be the collection of missing values.

summarytools::ctable(d0$EDUCATION, d0$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
EDUCATION
<High School		1023 (67.5%)	492 (32.5%)	1515 (100.0%)
Bachelors		2150 (76.2%)	673 (23.8%)	2823 (100.0%)
Masters		1674 (80.6%)	404 (19.4%)	2078 (100.0%)
PhD		780 (83.5%)	154 (16.5%)	934 (100.0%)
z_High School		1929 (65.3%)	1023 (34.7%)	2952 (100.0%)
Total		7556 (73.3%)	2746 (26.7%)	10302 (100.0%)

The table above of the EDUCATION variable shows that somehow the probability of being involved in an accident increases with education. Therefore we must conclude that z_High_School and <High_School are not really logical here. The people with lower than high school are for 32.4% involved in accidents and the z_High_School have a higher probability to be in accidents (34.6%).

For this particualar variable we might take the two classes with highest probability to be involved in an accident together as <=High_School. This is logically consistent, and leaves us with similar buckets sizes.

Therefore we assume that this unfortunate data manipulation is also done in other variables. For example the variable GENDER seems to have mixed the missing values with females. Conslusions based on small differences hence will not be reliable.

summarytools::ctable(d0$GENDER, d0$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
GENDER
M		3539 (74.4%)	1218 (25.6%)	4757 (100.0%)
z_F		4017 (72.4%)	1528 (27.6%)	5545 (100.0%)
Total		7556 (73.3%)	2746 (26.7%)	10302 (100.0%)

These considerations place serious footnotes at the reliability of the data.

Issues to consider for data quality:

the partial treatment of missing values (see above)
the huge amount of accidents. This data set is probably a subset that has been designed to have more accidents (one would expect one in thousand and not 26% accidents)
the relative large amount of red cars indicates the a similar issue

3.3 First Data Cleanup

# Preprocess the variables into numeric / factors as necessary

# dollar values are in the format "$20,540" (string) -> convert this:
df = as.tbl(d0) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM", "CLM_AMT"),
            parse_number) %>% 
  mutate_at(c("EDUCATION","OCCUPATION","CAR_TYPE","URBANICITY"),
            space_to_underscore) %>% 
  mutate_at(c("PARENT1", "MSTATUS", "GENDER", "EDUCATION","OCCUPATION", "CAR_USE", "CAR_TYPE", "RED_CAR", "REVOKED", "URBANICITY"),
            as.factor) %>% 
  mutate(CLAIM_FLAG = as.factor(CLAIM_FLAG))

We also notice that

the ID column will never yield useful information
the BIRTH data is the same as AGE but in a less usable format.

Therefore, we will remove those columns before moving forward.

df <- df %>% dplyr::select(-c(ID, BIRTH))

4 Exploring the Data (Univariate Analysis)

The univariate summaries for the individual variables (after some cleaning) are provided below.

summarytools::dfSummary(df, 
          plain.ascii  = FALSE,
          style        = 'grid',
          graph.magnif = 0.85,
          varnumbers = FALSE,
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp")

Variable	Stats / Values	Freqs (% of Valid)	Missing
KIDSDRIV [numeric]	Mean (sd) : 0.2 (0.5) min < med < max: 0 < 0 < 4 IQR (CV) : 0 (3)	0 : 9069 (88.0%) 1 : 804 ( 7.8%) 2 : 351 ( 3.4%) 3 : 74 ( 0.7%) 4 : 4 ( 0.0%)	0 (0.0%)
AGE [numeric]	Mean (sd) : 44.8 (8.6) min < med < max: 16 < 45 < 81 IQR (CV) : 12 (0.2)	61 distinct values	7 (0.1%)
HOMEKIDS [numeric]	Mean (sd) : 0.7 (1.1) min < med < max: 0 < 0 < 5 IQR (CV) : 1 (1.5)	0 : 6694 (65.0%) 1 : 1106 (10.7%) 2 : 1427 (13.9%) 3 : 856 ( 8.3%) 4 : 201 ( 2.0%) 5 : 18 ( 0.2%)	0 (0.0%)
YOJ [numeric]	Mean (sd) : 10.5 (4.1) min < med < max: 0 < 11 < 23 IQR (CV) : 4 (0.4)	21 distinct values	548 (5.3%)
INCOME [numeric]	Mean (sd) : 61572.1 (47457.2) min < med < max: 0 < 53529 < 367030 IQR (CV) : 58582 (0.8)	8151 distinct values	570 (5.5%)
PARENT1 [factor]	1. No 2. Yes	8959 (87.0%) 1343 (13.0%)	0 (0.0%)
HOME_VAL [numeric]	Mean (sd) : 154523 (129188.4) min < med < max: 0 < 160661 < 885282 IQR (CV) : 238256 (0.8)	6334 distinct values	575 (5.6%)
MSTATUS [factor]	1. Yes 2. z_No	6188 (60.1%) 4114 (39.9%)	0 (0.0%)
GENDER [factor]	1. M 2. z_F	4757 (46.2%) 5545 (53.8%)	0 (0.0%)
EDUCATION [factor]	1. <High_School 2. Bachelors 3. Masters 4. PhD 5. z_High_School	1515 (14.7%) 2823 (27.4%) 2078 (20.2%) 934 ( 9.1%) 2952 (28.7%)	0 (0.0%)
OCCUPATION [factor]	1. Clerical 2. Doctor 3. Home_Maker 4. Lawyer 5. Manager 6. Professional 7. Student 8. z_Blue_Collar	1590 (16.5%) 321 ( 3.3%) 843 ( 8.7%) 1031 (10.7%) 1257 (13.0%) 1408 (14.6%) 899 ( 9.3%) 2288 (23.7%)	665 (6.5%)
TRAVTIME [numeric]	Mean (sd) : 33.4 (15.9) min < med < max: 5 < 33 < 142 IQR (CV) : 22 (0.5)	100 distinct values	0 (0.0%)
CAR_USE [factor]	1. Commercial 2. Private	3789 (36.8%) 6513 (63.2%)	0 (0.0%)
BLUEBOOK [numeric]	Mean (sd) : 15659.9 (8428.8) min < med < max: 1500 < 14400 < 69740 IQR (CV) : 11690 (0.5)	2985 distinct values	0 (0.0%)
TIF [numeric]	Mean (sd) : 5.3 (4.1) min < med < max: 1 < 4 < 25 IQR (CV) : 6 (0.8)	23 distinct values	0 (0.0%)
CAR_TYPE [factor]	1. Minivan 2. Panel_Truck 3. Pickup 4. Sports_Car 5. Van 6. z_SUV	2694 (26.2%) 853 ( 8.3%) 1772 (17.2%) 1179 (11.4%) 921 ( 8.9%) 2883 (28.0%)	0 (0.0%)
RED_CAR [factor]	1. no 2. yes	7326 (71.1%) 2976 (28.9%)	0 (0.0%)
OLDCLAIM [numeric]	Mean (sd) : 4034 (8733.1) min < med < max: 0 < 0 < 57037 IQR (CV) : 4647.5 (2.2)	3545 distinct values	0 (0.0%)
CLM_FREQ [numeric]	Mean (sd) : 0.8 (1.2) min < med < max: 0 < 0 < 5 IQR (CV) : 2 (1.4)	0 : 6292 (61.1%) 1 : 1279 (12.4%) 2 : 1492 (14.5%) 3 : 992 ( 9.6%) 4 : 225 ( 2.2%) 5 : 22 ( 0.2%)	0 (0.0%)
REVOKED [factor]	1. No 2. Yes	9041 (87.8%) 1261 (12.2%)	0 (0.0%)
MVR_PTS [numeric]	Mean (sd) : 1.7 (2.2) min < med < max: 0 < 1 < 13 IQR (CV) : 3 (1.3)	14 distinct values	0 (0.0%)
CLM_AMT [numeric]	Mean (sd) : 1511.3 (4725.2) min < med < max: 0 < 0 < 123247 IQR (CV) : 1144.8 (3.1)	2346 distinct values	0 (0.0%)
CAR_AGE [numeric]	Mean (sd) : 8.3 (5.7) min < med < max: -3 < 8 < 28 IQR (CV) : 11 (0.7)	30 distinct values	639 (6.2%)
CLAIM_FLAG [factor]	1. 0 2. 1	7556 (73.3%) 2746 (26.7%)	0 (0.0%)
URBANICITY [factor]	1. Highly_Urban/ Urban 2. z_Highly_Rural/ Rural	8230 (79.9%) 2072 (20.1%)	0 (0.0%)

5 Handling the Missing Values

5.1 Exploring the Structure of Missing Values

DataExplorer::plot_missing(d0, title = "Percentage of missing data per variable")

naniar::gg_miss_upset(d0)

The correlation structure and amplitude of the missing values are illustrated in Figure .

The number of missing values in the variable OCCUPATION is 665. So, 6.46% is missing.

Figure indicates also that the data is not missing at random (MAR)¹.

5.2 Imputing Missing values

We prefer to keep rows with missing data, and instead of removing all rows with incomplete observations, we use an imputation of missing data using the predictive mean matching approach implemented in the mice[^ Multiple Imputation by Chained Equations (this method assumes missing at random).] package is applied. Other methods such as using a Random Forrest (e.g. using the package missForest in R might provide more convincing results, but they run too slow.)

# Impute the columns with missing values

# install.packages("mice")
#library(mice)
df_missing_cols = df[,c("AGE","INCOME","YOJ","HOME_VAL","CAR_AGE")]
df_imp_cols_tmp = mice::mice(data = df_missing_cols, m = 1, method = "pmm", maxit = 50, seed = 500)
df_imp_cols = mice::complete(df_imp_cols_tmp)
df_imp = bind_cols(df %>% select(-AGE, -INCOME, -YOJ, -HOME_VAL, -CAR_AGE), df_imp_cols)

# For the rest of the analysis we use the imputed data
df <- df_imp[complete.cases(df_imp),]

Here is how we could use missForest:

# Input missing values via random forest estimates:
# ---
library(missForest)# impute missing values

# Note: this fails:
# d1_imp <- missForest(df, )$ximp
# While everything that worked for a data-frame will also work
# for a tibble, this is an exception, so we have 2 solutions:

# 1. coerce to data-frame
d_imp <- missForest(as.data.frame(df), )$ximp

# 2. Manual fit the randomforest and impute the missing data:
x_cc <- df %>% filter(!is.na(df$OCCUPATION))
x_nc <- df %>% filter(is.na(df$OCCUPATION))
rf <- randomForest::randomForest(OCCUPATION ~ ., data = x_cc)
x_nc$OCCUPATION <- predict(rf, x_nc)
x <- rbind(x_cc, x_nc)

6 Further Data Exploration

6.1 Numeric Variables

d_cont <- data.frame()[1:nrow(df), ]
n <- 1
for (col in colnames(df)) {
  if (!(is.factor(df[[col]]))) {
    d_cont <- cbind(d_cont, df[[col]])
    colnames(d_cont)[n] <- gsub("\"", "", deparse(col))
    n <- n + 1
  }
}
d_cont <- mutate(d_cont, CLAIM_FLAG = CLAIM_FLAG <- as.numeric(paste(df$CLAIM_FLAG)))
d_cont <- d_cont[complete.cases(d_cont),]

# Use the library GGaly to explore the correlation structure
# ---
GGally::ggcorr(d_cont,  method = c("everything", "pearson"),
               label = TRUE, label_round = 2, label_size = 3,
               size = 2.5,  geom="circle")

The correlation structure of the data

6.1.1 Variables Related to Age

## Visualizing the dependency for the 
p1 <- make_dependency_plot(d_cont, "AGE",      "CLM_AMT", "age",              
                           method = 'gam', q_ymax = 0.95)
p2 <- make_dependency_plot(d_cont, "HOMEKIDS", "CLM_AMT", "nbr kids at home", 
                           method = 'glm', q_ymax = 0.95)
p3 <- make_dependency_plot(d_cont, "KIDSDRIV", "CLM_AMT", "number of teenage drivers", 
                           method = 'lm', q_ymax = 0.95)
p4 <- make_dependency_plot(d_cont, "YOJ",      "CLM_AMT", "years on job",     
                           method = 'gam', q_ymax = 0.95)
grid.arrange(p1, p2, p3, p4, ncol=2)

p1 <- make_dependency_plot(d_cont, "AGE",      "CLAIM_FLAG", "age",              
                           method = 'gam', q_ymax = 1, q_xmax = 0.975)
p2 <- make_dependency_plot(d_cont, "HOMEKIDS", "CLAIM_FLAG", "nbr kids at home", 
                           method = 'glm', q_ymax = 1)
p3 <- make_dependency_plot(d_cont, "KIDSDRIV", "CLAIM_FLAG", "number of teenage drivers", 
                           method = 'glm', q_ymax = 1)
p4 <- make_dependency_plot(d_cont, "YOJ",      "CLAIM_FLAG", "years on job",     
                           method = 'gam', q_ymax = 1)
grid.arrange(p1, p2, p3, p4, ncol=2)

6.1.2 Variables Related to Financial Situation

p1 <- make_dependency_plot(d_cont, "INCOME",      "CLAIM_FLAG", "income",              
                           method = 'gam', q_ymax = 1, q_xmax = 0.95)
p2 <- make_dependency_plot(d_cont, "HOME_VAL", "CLAIM_FLAG", "home value", 
                           method = 'gam', q_ymax = 1, q_xmax = 0.95)
p3 <- make_dependency_plot(d_cont, "BLUEBOOK", "CLAIM_FLAG", "value of car", 
                           method = 'gam', q_ymax = 1, q_xmax = 0.95)
p4 <- make_dependency_plot(d_cont, "TRAVTIME",      "CLAIM_FLAG", "travel time",     
                           method = 'gam', q_ymax = 1, q_xmax = 0.95)
grid.arrange(p1, p2, p3, p4, ncol=2)

6.1.3 Driving History Related Variables

p1 <- make_dependency_plot(d_cont, "CLM_FREQ",      "CLAIM_FLAG", "Claim Frequency (5yrs)",              
                           method = 'glm', q_ymax = 1, q_xmax = 0.95)
p2 <- make_dependency_plot(d_cont, "OLDCLAIM", "CLAIM_FLAG", "amount of claims in last 5 years", 
                           method = 'gam', q_ymax = 1, q_xmax = 0.95)
p3 <- make_dependency_plot(d_cont, "MVR_PTS", "CLAIM_FLAG", "motor vehicle record points ", 
                           method = 'lm', q_ymax = 1, q_xmax = 0.95)
p4 <- make_dependency_plot(d_cont, "CAR_AGE",      "CLAIM_FLAG", "car age",     
                           method = 'gam', q_ymax = 1, q_xmax = 0.95)
grid.arrange(p1, p2, p3, p4, ncol=2)

6.2 Factorial Variables

WOE_tbl <- make_WOE_table(df, 'CLAIM_FLAG')
# show the best IVs:
knitr::kable(WOE_tbl[order(WOE_tbl$IV, decreasing = TRUE),],
             caption = 'The table of all information values 
              for each categorical variable ordered in 
              decreasing order. We will work with the ones 
              that have an information value above $0.1$.')

The table of all information values for each categorical variable ordered in decreasing order. We will work with the ones that have an information value above \(0.1\).
	varName	IV
10	URBANICITY	0.4189352
5	OCCUPATION	0.2031089
4	EDUCATION	0.1477226
1	PARENT1	0.1117093
7	CAR_TYPE	0.1110225
6	CAR_USE	0.1108089
9	REVOKED	0.1069475
2	MSTATUS	0.0814126
3	GENDER	0.0029738
8	RED_CAR	0.0007407

(De Brouwer 2020) (pg. 359) suggests the following rule of thumb for those information values:

IV	predictability
< 0.02	not predictable
0.02 – 0.1	weak
0.1 – 0.3	medium
0.13 – 0.5	strong
> 0.5	suspicious

This implies that CARUSE, MSTATUS, and GENDER probably are not very predictable. At least we don’t expect a linear relationship.

7 Data Binning

d_fact <- df

7.1 KIDSDRIV

First we observe the existing binning or levels.

InformationValue::IV(factor(df$KIDSDRIV), df$CLAIM_FLAG)

## [1] 0.05209009
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(df$KIDSDRIV, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
KIDSDRIV
0		6341 (75.0%)	2112 (25.0%)	8453 (100.0%)
1		482 (62.5%)	289 (37.5%)	771 (100.0%)
2		205 (60.8%)	132 (39.2%)	337 (100.0%)
3		35 (48.6%)	37 (51.4%)	72 (100.0%)
4		2 (50.0%)	2 (50.0%)	4 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

This allows us to make a decision on what bins to use. We create the bins so that

no bin is too small (e.g. lower than 10% of the data);
there is a meaningful difference between the weight of evidence of the bins; and
if we combine variables, we decide so to do this meaningful (e.g. larger cars vs. small cars, or age bins that follow each other instead of grouping the youngest and oldest bin, etc.).

Note that in the following sections we will leave out this exploratory phase and refer to the previous chapters about data exploration instead.

d_fact$KIDSDRV <- if_else(df$KIDSDRIV == 0, 'none', 
                        if_else(df$KIDSDRIV == 1, '1', '>=2')) %>%
  factor(levels = c('none', '1', '>=2'))
InformationValue::IV(d_fact$KIDSDRV, df$CLAIM_FLAG)

## [1] 0.05019181
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$KIDSDRV, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
KIDSDRV
none		6341 (75.0%)	2112 (25.0%)	8453 (100.0%)
1		482 (62.5%)	289 (37.5%)	771 (100.0%)
>=2		242 (58.6%)	171 (41.4%)	413 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.2 AGE

d_fact$AGE <- if_else(df$AGE < 30 , '<30', 
                        if_else(df$AGE < 40, '30--40', 
                                if_else(df$AGE < 55, '40--55', '>55'))) %>%
    factor(level = c('<30', '30--40', '40--55', '>55'))
InformationValue::IV(d_fact$AGE, df$CLAIM_FLAG)

## [1] 0.1150198
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$AGE, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
AGE
<30		182 (48.7%)	192 (51.3%)	374 (100.0%)
30–40		1533 (66.5%)	773 (33.5%)	2306 (100.0%)
40–55		4446 (77.9%)	1258 (22.1%)	5704 (100.0%)
>55		904 (72.1%)	349 (27.9%)	1253 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.3 HOMEKIDS

d_fact$HOMEKIDS <- if_else(df$HOMEKIDS == 0 , '0', 
                        if_else(df$HOMEKIDS == 1, '1', '>=2')) %>%
  factor(level = c('0', '1', '>=2'))
InformationValue::IV(d_fact$HOMEKIDS, df$CLAIM_FLAG)

## [1] 0.08915518
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$HOMEKIDS, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
HOMEKIDS
0		4810 (77.7%)	1378 (22.3%)	6188 (100.0%)
1		691 (65.8%)	359 (34.2%)	1050 (100.0%)
>=2		1564 (65.2%)	835 (34.8%)	2399 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.4 HOMEKIDS

d_fact$YOJ <- if_else(df$YOJ <= 1 , '<=1', '>1') %>%
    factor(level = c('<=1', '>1'))
InformationValue::IV(d_fact$YOJ, df$CLAIM_FLAG)

## [1] 0.03611412
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$YOJ, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
YOJ
<=1		519 (60.8%)	335 (39.2%)	854 (100.0%)
>1		6546 (74.5%)	2237 (25.5%)	8783 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.5 INCOME

d_fact$INCOME <- if_else(df$INCOME < 25000 , '<25K', 
                       if_else(df$INCOME < 55000, '25--55K',
                               if_else(df$INCOME < 75000, '55--75K', '>75K'))) %>%
  factor(level = c('<25K', '25--55K', '55--75K', '>75K'))
InformationValue::IV(d_fact$INCOME, df$CLAIM_FLAG)

## [1] 0.1238587
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$INCOME, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
INCOME
<25K		1539 (65.9%)	798 (34.1%)	2337 (100.0%)
25–55K		1996 (69.8%)	865 (30.2%)	2861 (100.0%)
55–75K		1228 (73.8%)	437 (26.2%)	1665 (100.0%)
>75K		2302 (83.0%)	472 (17.0%)	2774 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.6 PARENT1

d_fact$PARENT1 <- df$PARENT1 %>%
  factor(level = c('Yes', 'No'))
InformationValue::IV(d_fact$PARENT1, df$CLAIM_FLAG)

## [1] 0.1117093
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$PARENT1, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
PARENT1
Yes		713 (55.6%)	569 (44.4%)	1282 (100.0%)
No		6352 (76.0%)	2003 (24.0%)	8355 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.7 HOME_VAL

d_fact$HOME_VAL <- if_else(df$HOME_VAL < 100000 , '<100K', 
                       if_else(df$HOME_VAL < 200000, '100--200K',
                               if_else(df$HOME_VAL < 300000, '200--300K', '>300K'))) %>%
  factor(level = c('<100K', '100--200K', '200--300K', '>300K'))
InformationValue::IV(d_fact$HOME_VAL, df$CLAIM_FLAG)

## [1] 0.176499
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$HOME_VAL, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
HOME_VAL
<100K		2174 (64.2%)	1210 (35.8%)	3384 (100.0%)
100–200K		2093 (73.3%)	763 (26.7%)	2856 (100.0%)
200–300K		1895 (80.0%)	473 (20.0%)	2368 (100.0%)
>300K		903 (87.8%)	126 (12.2%)	1029 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.8 MSTATUS

d_fact$MSTATUS <- if_else(df$MSTATUS == 'Yes', 'Yes', 'No') %>%
  factor(level = c('Yes', 'No'))
InformationValue::IV(d_fact$MSTATUS, df$CLAIM_FLAG)

## [1] 0.08141263
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$MSTATUS, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
MSTATUS
Yes		4532 (77.9%)	1288 (22.1%)	5820 (100.0%)
No		2533 (66.4%)	1284 (33.6%)	3817 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.9 GENDER

d_fact$GENDER <- if_else(df$GENDER == 'M', 'M', 'F') %>%
  factor(level = c('M', 'F'))
InformationValue::IV(d_fact$GENDER, df$CLAIM_FLAG)

## [1] 0.002973753
## attr(,"howgood")
## [1] "Not Predictive"

summarytools::ctable(d_fact$GENDER, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
GENDER
M		3152 (74.5%)	1078 (25.5%)	4230 (100.0%)
F		3913 (72.4%)	1494 (27.6%)	5407 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

Gender is not predictive for the probability to be involved in an accident.

7.10 EDUCATION

d_fact$EDUCATION <- factor(df$EDUCATION, 
                         levels = c('<High_School', 'z_High_School', 'Bachelors', 'Masters', 'PhD'))
InformationValue::IV(d_fact$EDUCATION, df$CLAIM_FLAG)

## [1] 0.1477226
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$EDUCATION, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
EDUCATION
<High_School		1023 (67.5%)	492 (32.5%)	1515 (100.0%)
z_High_School		1929 (65.3%)	1023 (34.7%)	2952 (100.0%)
Bachelors		2150 (76.2%)	673 (23.8%)	2823 (100.0%)
Masters		1371 (82.9%)	283 (17.1%)	1654 (100.0%)
PhD		592 (85.4%)	101 (14.6%)	693 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.11 OCCUPATION

x <- paste(df$OCCUPATION)
d_fact$OCCUPATION <- if_else(x == 'Doctor' | x == 'Manager' , 
                           'senior', 
                           if_else(x == 'Student' | x == 'Clerical',
                                   'junior', x)) %>%
  factor()
InformationValue::IV(d_fact$EDUCATION, df$CLAIM_FLAG)

## [1] 0.1477226
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$OCCUPATION, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
OCCUPATION
Home_Maker		614 (72.8%)	229 (27.2%)	843 (100.0%)
junior		1667 (67.0%)	822 (33.0%)	2489 (100.0%)
Lawyer		845 (82.0%)	186 (18.0%)	1031 (100.0%)
Professional		1083 (76.9%)	325 (23.1%)	1408 (100.0%)
senior		1371 (86.9%)	207 (13.1%)	1578 (100.0%)
z_Blue_Collar		1485 (64.9%)	803 (35.1%)	2288 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.12 TRAVTIME

d_fact$TRAVTIME <- if_else(df$TRAVTIME < 25, '<25',  
                           if_else(df$TRAVTIME < 40, '25--40', '>40')) %>%
     factor(levels = c('<25', '25--40', '>40'))
InformationValue::IV(d_fact$EDUCATION, df$CLAIM_FLAG)

## [1] 0.1477226
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$TRAVTIME, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
TRAVTIME
<25		2183 (76.8%)	658 (23.2%)	2841 (100.0%)
25–40		2591 (73.5%)	933 (26.5%)	3524 (100.0%)
>40		2291 (70.0%)	981 (30.0%)	3272 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.13 CAR_USE

d_fact$CAR_USE <- paste(df$CAR_USE) %>%
     factor()
InformationValue::IV(d_fact$CAR_USE, df$CLAIM_FLAG)

## [1] 0.1108089
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$CAR_USE, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
CAR_USE
Commercial		2042 (63.9%)	1154 (36.1%)	3196 (100.0%)
Private		5023 (78.0%)	1418 (22.0%)	6441 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.14 BLUEBOOK

d_fact$BLUEBOOK <- if_else(df$BLUEBOOK < 10000, '<10K',  
                           if_else(df$BLUEBOOK < 15000, '10--15K', 
                                   if_else(df$BLUEBOOK < 20000, '15--20K', '>20K'))) %>%
     factor(levels = c('<10K', '10--15K', '15--20K', '>20K'))
InformationValue::IV(d_fact$BLUEBOOK, df$CLAIM_FLAG)

## [1] 0.06523297
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$BLUEBOOK, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
BLUEBOOK
<10K		1898 (65.9%)	982 (34.1%)	2880 (100.0%)
10–15K		1814 (74.5%)	622 (25.5%)	2436 (100.0%)
15–20K		1437 (76.3%)	446 (23.7%)	1883 (100.0%)
>20K		1916 (78.6%)	522 (21.4%)	2438 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.15 TIF

Time in force.

d_fact$TIF <- if_else(df$TIF <= 1, '1',  
                           if_else(df$TIF <= 5, '2--5', 
                                   if_else(df$TIF <= 8, '6--8', '>8'))) %>%
     factor(levels = c('1', '2--5', '6--8', '>8'))
InformationValue::IV(d_fact$TIF, df$CLAIM_FLAG)

## [1] 0.03728586
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$TIF, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
TIF
1		2042 (68.5%)	938 (31.5%)	2980 (100.0%)
2–5		1509 (72.4%)	574 (27.6%)	2083 (100.0%)
6–8		1808 (75.4%)	590 (24.6%)	2398 (100.0%)
>8		1706 (78.4%)	470 (21.6%)	2176 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.16 CAR_TYPE

x <- paste(df$CAR_TYPE)
# We only merge Panel Truck and Van (too small bins + similar values)
d_fact$CAR_TYPE <- if_else(x == 'Panel_Truck' | x == 'Van' , 
                           'PTruck_Van', x) %>%
  factor()
InformationValue::IV(d_fact$CAR_TYPE, df$CLAIM_FLAG)

## [1] 0.1107071
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$CAR_TYPE, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
CAR_TYPE
Minivan		2207 (82.8%)	457 (17.2%)	2664 (100.0%)
Pickup		1096 (68.2%)	512 (31.8%)	1608 (100.0%)
PTruck_Van		972 (73.0%)	360 (27.0%)	1332 (100.0%)
Sports_Car		766 (65.3%)	407 (34.7%)	1173 (100.0%)
z_SUV		2024 (70.8%)	836 (29.2%)	2860 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.17 RED_CAR

d_fact$RED_CAR <- paste(df$RED_CAR) %>%
     factor()
InformationValue::IV(d_fact$RED_CAR, df$CLAIM_FLAG)

## [1] 0.0007407012
## attr(,"howgood")
## [1] "Not Predictive"

summarytools::ctable(d_fact$RED_CAR, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
RED_CAR
no		5095 (73.0%)	1886 (27.0%)	6981 (100.0%)
yes		1970 (74.2%)	686 (25.8%)	2656 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

We will not use the variable RED_CAR. This variable might make sense for the first buyers, people who buy a red care second hand might have a very different behaviour. We don’t have this information and hence cannot use the colour of the car as variable.

7.18 OLDCLAIM

d_fact$OLDCLAIM <- if_else(df$OLDCLAIM == 0, '0', '>0') %>%
     factor(levels = c('0', '>0'))
InformationValue::IV(d_fact$OLDCLAIM, df$CLAIM_FLAG)

## [1] 0.3294791
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$OLDCLAIM, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
OLDCLAIM
0		4881 (82.3%)	1053 (17.7%)	5934 (100.0%)
>0		2184 (59.0%)	1519 (41.0%)	3703 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

The cumulative amount of claims in the last 5 years does not show a trend. Probability of having a claim is around 40% for people how did claim something regardless the amount (for those that didn’t claim is it is less only 17%). So, we can only create two meaningful bins here (at least assuming that we are targetting the variable CLAIM_FLAG – not the amount)

7.19 CLM_FREQ

d_fact$CLM_FREQ <- if_else(df$CLM_FREQ == 0, '0',  
                           if_else(df$CLM_FREQ == 1, '1', 
                                   if_else(df$CLM_FREQ == 2, '2', '>2'))) %>%
     factor(levels = c('0', '1', '2', '>2'))
InformationValue::IV(d_fact$CLM_FREQ, df$CLAIM_FLAG)

## [1] 0.3308363
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$CLM_FREQ, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
CLM_FREQ
0		4881 (82.3%)	1053 (17.7%)	5934 (100.0%)
1		724 (60.2%)	479 (39.8%)	1203 (100.0%)
2		814 (59.5%)	553 (40.5%)	1367 (100.0%)
>2		646 (57.0%)	487 (43.0%)	1133 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.20 REVOKED

d_fact$REVOKED <- paste(df$REVOKED) %>%
     factor()
InformationValue::IV(d_fact$REVOKED, df$CLAIM_FLAG)

## [1] 0.1069475
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$REVOKED, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
REVOKED
No		6412 (75.9%)	2041 (24.1%)	8453 (100.0%)
Yes		653 (55.2%)	531 (44.8%)	1184 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

People that have their licence revoked are about twice as likely to file a claim.

7.21 MVR_PTS

d_fact$MVR_PTS <- if_else(df$MVR_PTS == 0, '0',  
                           if_else(df$MVR_PTS == 1, '1', 
                                   if_else(df$MVR_PTS == 2, '2', 
                                           if_else(df$MVR_PTS <= 4, '3--4', '>4')))) %>%
     factor(levels = c('0', '1', '2', '3--4', '>4'))
InformationValue::IV(d_fact$MVR_PTS, df$CLAIM_FLAG)

## [1] 0.2200071
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$MVR_PTS, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
MVR_PTS
0		3542 (81.0%)	833 (19.0%)	4375 (100.0%)
1		1033 (75.7%)	331 (24.3%)	1364 (100.0%)
2		822 (72.9%)	306 (27.1%)	1128 (100.0%)
3–4		1037 (66.0%)	534 (34.0%)	1571 (100.0%)
>4		631 (52.6%)	568 (47.4%)	1199 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.21.1 CAR_AGE

There is one negative car age, let’s ignore that for now.

d_fact$CAR_AGE <- if_else(df$CAR_AGE <= 1, '<=1',  
                           if_else(df$CAR_AGE <= 7, '2--7', 
                                   if_else(df$CAR_AGE <= 11, '8--11', 
                                           if_else(df$CAR_AGE <= 15, '12--15', '>=16')))) %>%
     factor(levels = c('<=1', '2--7', '8--11', '12--15', '>=16'))
InformationValue::IV(d_fact$CAR_AGE, df$CLAIM_FLAG)

## [1] 0.06974132
## attr(,"howgood")
## [1] "Somewhat Predictive"

summarytools::ctable(d_fact$CAR_AGE, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
CAR_AGE
<=1		1771 (67.3%)	859 (32.7%)	2630 (100.0%)
2–7		1356 (71.1%)	552 (28.9%)	1908 (100.0%)
8–11		1883 (73.8%)	670 (26.2%)	2553 (100.0%)
12–15		1231 (79.9%)	310 (20.1%)	1541 (100.0%)
>=16		824 (82.0%)	181 (18.0%)	1005 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.22 URBANICITY

d_fact$URBANICITY <- paste(df$URBANICITY) %>%
     factor()
InformationValue::IV(d_fact$URBANICITY, df$CLAIM_FLAG)

## [1] 0.4189352
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$URBANICITY, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
URBANICITY
Highly_Urban/ Urban		5155 (67.9%)	2439 (32.1%)	7594 (100.0%)
z_Highly_Rural/ Rural		1910 (93.5%)	133 ( 6.5%)	2043 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

7.23 Combination of Variables

7.23.1 `GENDER` and `AGE`

d_fact$GENDER.AGE <- paste0(d_fact$GENDER, ".", d_fact$AGE) %>% factor()
InformationValue::IV(d_fact$GENDER.AGE, df$CLAIM_FLAG)

## [1] 0.1199623
## attr(,"howgood")
## [1] "Highly Predictive"

summarytools::ctable(d_fact$GENDER.AGE, df$CLAIM_FLAG)

	CLAIM_FLAG	0	1	Total
GENDER.AGE
F.<30		106 (47.3%)	118 (52.7%)	224 (100.0%)
F.>55		474 (69.9%)	204 (30.1%)	678 (100.0%)
F.30–40		924 (67.2%)	450 (32.8%)	1374 (100.0%)
F.40–55		2409 (76.9%)	722 (23.1%)	3131 (100.0%)
M.<30		76 (50.7%)	74 (49.3%)	150 (100.0%)
M.>55		430 (74.8%)	145 (25.2%)	575 (100.0%)
M.30–40		609 (65.3%)	323 (34.7%)	932 (100.0%)
M.40–55		2037 (79.2%)	536 (20.8%)	2573 (100.0%)
Total		7065 (73.3%)	2572 (26.7%)	9637 (100.0%)

Average cost of accidents:

tibble(GENDER_AGE = paste(d_fact$GENDER.AGE), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
  filter(CLM_AMT > 0) %>%
  group_by(GENDER_AGE) %>%
  summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE)))

## # A tibble: 8 × 2
##   GENDER_AGE CLM_AMT
##   <chr>        <dbl>
## 1 F.<30        4934.
## 2 F.>55        5282.
## 3 F.30--40     5586.
## 4 F.40--55     5415.
## 5 M.<30        6708.
## 6 M.>55        6545.
## 7 M.30--40     5706.
## 8 M.40--55     5709.

Overall averages per gender:

tibble(GENDER = paste(d_fact$GENDER), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
  filter(CLM_AMT > 0) %>%
  group_by(GENDER) %>%
  summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE)))

## # A tibble: 2 × 2
##   GENDER CLM_AMT
##   <chr>    <dbl>
## 1 F        5410.
## 2 M        5889.

While both males and females have higher claims for younger people, this effect is more pronounced in males. The average claim amount for a young male is 27.31% higher than for the same age group in females.

So, while the young female has about 8% more probability to be involved in an accident, the claims of the male are much higher (this effect will be stronger when we would consider only the group that was involved in an accident.).

8 Output Data

8.1 A Completely Binned Dataset with Text (for multilevel factorial data): all factorial data

Useful for logistic regression.

This is d_fact, and we save it for later use.

saveRDS(d_fact %>% select(-c(CLM_AMT)), './data/d_fact.R')

8.2 Numerical values normalised

We choose to normalize so that all variables range from 0 to 1 (this is method range in the formula below)

preProc_norm <- caret::preProcess(as.data.frame(df), method="range")
d_norm <- predict(preProc_norm, as.data.frame(df)) %>%
  as.data.frame() %>%   # coerces text to factor
  as.tibble()           # actually we prefer tibbles, but want the factors here

# Now replace all factorial columns with the binned ones:
for (col in colnames(d_norm)) {
    if (!(col == 'CLM_AMT') & (is.factor(d_norm[[col]]))) {
      d_norm[[col]] <- d_fact[[col]]
      }
  }
saveRDS(preProc_norm, './data/preProc_norm.R')
saveRDS(d_norm,       './data/d_norm.R')

8.3 A Binary DataSet based on the factorial one

Some methods such as linear regressions, generalized linear regressions, neural networks, etc. will require all data to be numerical. So, we prepare an equivalent data set for those models.

d_bin <- d_fact[, FALSE] # create empty data frame with N rows and 0 cols
for (col in colnames(d_fact)) {
    if (is.factor(d_fact[[col]])) {
      x <- expand_factor2bin(d_fact, col)  # get the expanded columns
      
      # find the largest bin and leave it out
      sums <- colSums(x)
      theCol <- which(sums == max(sums))
      x <- x[,-theCol]
      
      d_bin <- bind_cols(d_bin, x)     # add them to d_bin
    } 
}
n <- which(colnames(d_bin) == "CLAIM_FLAG.1")
colnames(d_bin)[n] <- "CLAIM_FLAG"
saveRDS(d_bin, './data/d_bin.R')

9 References

9.1 Acknoledgement

In this document we used workflow and code from: De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.

In this document we used mainly:

Part III (p. 213–256): Data Import
Part IV (p. 257–372): Data Wrangling

Bibliography

Agarwal, Sray, and Shashin Mishra. 2021. Responsible AI: Implementing Ethical and Unbiased Algorithms. Springer.

Coeckelbergh, Mark. 2020. AI Ethics. Mit Press.

De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.

Dubber, Markus Dirk, Frank Pasquale, and Sunit Das. 2020. The Oxford Handbook of Ethics of AI. Oxford Handbooks.

Hardt, Moritz, Solon Barocas, and Arvind Narayanan. 2018. “Fairness in Machine Learning: Limitations and Opportunities.” Solon Barocas Moritz Hardt and Arvind Narayanan.

We already argued that the missing values of education might indicate lower education, missing values in job category might be indicative for the risk too.↩︎

Table of Contents

Car Insurance

Data Preparatation

Philippe De Brouwer

2022-11-24