Table of Contents

1 Executive Summary

This is part 1 on the analysis that demonstrate bias in data and the possibilities to de-bias data. In this document we show how to clean the data and prepare it.

2 Introduction

The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:

  • Is it fair to use gender?
  • No? Well can we use age then? That is a good indicator proxy for driving experience, maturity, etc. and indeed has a high information value.
  • Also not? Well, let’s just look at past track record then (did the person have an accident in the last 5 years)
  • But wait, that is somehow discriminatory for people that drive at least 5 years, because only they could build up such track record?
  • That leaves us wealth related parameters such as income, value of the car, etc. They do the same … but isn’t that discriminatory for poor people?
  • So, we have left things related to car use (private or commercial use, daily average travel time)
  • This model won’t be that strong. The insurance company now need to increase prices … which will impact the poor …

The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.

Workflow

The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.

We will approach the problem in a few phases

  1. Prepare the data: this file
  2. Build a naive model
  3. Build a classification model based with CLAIM_FLAG as dependent variable (this variable is 0 of no claim was filed)
  4. Now we need to find a price for the insurance policy. We can use a simple average or build a model. We will build a regression model to determine the fee that the customer should pay based on CLM_AMT (the amount claimed) of those customers that are expected to have an accident
  5. Decide on Protected Features
  6. Remove protected features and proxy features
  7. repeat both models
  8. Draw conclusions

3 Loading the Data

First we need to read in data and some custom functions.

setwd("/home/philippe/Documents/courses/lectures/bias_data")

# Read in the data:
d0 = read_csv("./car_insurance_claim_data.csv") 

# Read in the functions:
source("ethics_functions.R")

# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)
## dollars_to_numeric : function (input)  
## expand_factor2bin : function (data, col)  
## fAUC : function (model, data, ...)  
## get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)  
## make_dependency_hist_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)  
## make_dependency_plot : function (data, x, y, title, method = "loess", q_ymax = 1, q_xmax = 1)  
## make_WOE_table : function (df, y)  
## opt_cut_off : function (perf, pred, cost.fp = 1)  
## pct_table : function (x, y, round_digits = 2)  
## space_to_underscore : function (input)  
## str_clean : function (x)
# Remove the temporary environment:
rm(tmt.env)

The data is imported from kaggle, the licence is unknown and it is believed to be “public domain.”

3.1 Data Dictionary

The data contains 10302 rows with 27 columns.

Each record (row) represents a set of attributes of an insurance company individual customer that are related to their socio-demographic profile and the insured vehicle. The binary response variable TARGET_FLAG is 1 if the customer’s car was in a crash, and 0 if not. The continuous response variable TARGET_AMT defines the cost related to the car crash.

The variables of the data are: ID, KIDSDRIV, BIRTH, AGE, HOMEKIDS, YOJ, INCOME, PARENT1, HOME_VAL, MSTATUS, GENDER, EDUCATION, OCCUPATION, TRAVTIME, CAR_USE, BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS, CLM_AMT, CAR_AGE, CLAIM_FLAG, URBANICITY.

For the purpose of this exercise we added RACE, that is supposed to be representing some proxy for a small privileged class, large underprivileged class and a reference “race.” The labels of this feature are chosen in this order: A, B, and C.

The data dictionary enriched with some additional variables is as follows:

variable name definition prejudice or expectation
ID unique identifier none
KIDSDRIV number of teenager drivers teenagers cause more accidents
BIRTH birth date used to derive age
AGE age in years young people cause more accidents, and older people too
HOMEKIDS nbr kids at home more kids might be more distraction in the car
YOJ years on job people that job-hop might be more prone to risk taking and accidents
INCOME annual income in USD income correlates to reliability and responsibility
PARENT1 yes or no single parent not clear
HOME_VAL home value in USD if owner similar to INCOME
MSTATUS marital status “yes” if married marriage might be a sign of stability and risk aversion(?)
GENDER sex men cause more and more expensive accidents
EDUCATION level of the diploma higher education correlates to safer driving
OCCUPATION categories of employment white color workers might drive safer
TRAVTIME distance to work (probably in minutes) longer distance translates in more probability to be involved in an accident
CAR_USE commercial or private use commercial use might be more risky
BLUEBOOK resale value of the car not clear
TIF time in force = the time with the same insurer (numeric, probably in years, minimum is “1”) longer should be better
CAR_TYPE categorical sports cars are more prone to accidents than minivans
RED_CAR “yes” if the car is red urban legend says that red cars are more prone to accidents
OLDCLAIM total amount in USD of claims in the last 5 years your past performance might be indicative, but note how this should interact with TIF
CLM_FREQ not claim frequency, but the number of claims in the last 5 years past performance might be indicative
REVOKED “yes” if main driver’s licence was revoked during the last 7 years licence being revoked should be indicative for driving style
MVR_PTS motor vehicle record points (number) traffic tickets should be indicative for the driving style and hence propensity to be involved in an accident
CLM_AMT if last year in a car accident, the dollar amount of the claim paid by insurer target variable
CAR_AGE age of car in years one might assume that drivers of older cars are less prudent
CLAIM_FLAG 0 = NO, 1 = YES target variable
URBANICITY categorical, 2 options cities should be more dangerous

3.2 Data quality

We were not able to contact the data provider, and close observation made us assume that some variables have been manipulated to eliminate missing values. The missing values have then been assigned to some “assumption” for example in the variable EDUCATION we have “<High School” (lower than highs school) and “z_High School”. The latter seems to be the collection of missing values.

summarytools::ctable(factor(d0$EDUCATION), factor(d0$CLAIM_FLAG))
factor(d0\(CLAIM_FLAG) | 0 | 1 | Total | | factor(d0\)EDUCATION)
<High School 1023 (67.5%) 492 (32.5%) 1515 (100.0%)
Bachelors 2150 (76.2%) 673 (23.8%) 2823 (100.0%)
Masters 1674 (80.6%) 404 (19.4%) 2078 (100.0%)
PhD 780 (83.5%) 154 (16.5%) 934 (100.0%)
z_High School 1929 (65.3%) 1023 (34.7%) 2952 (100.0%)
Total 7556 (73.3%) 2746 (26.7%) 10302 (100.0%)

The table above of the EDUCATION variable shows that somehow the probability of being involved in an accident increases with education. Therefore we must conclude that z_High_School and <High_School are not really logical here. The people with lower than high school are for 32.4% involved in accidents and the z_High_School have a higher probability to be in accidents (34.6%).

For this particular variable we might take the two classes with highest probability to be involved in an accident together as <=High_School. This is logically consistent, and leaves us with similar buckets sizes.

Therefore we assume that this unfortunate data manipulation is also done in other variables. For example the variable GENDER seems to have mixed the missing values with females. Conclusions based on small differences hence will not be reliable.

summarytools::ctable(factor(d0$GENDER), factor(d0$CLAIM_FLAG))
factor(d0\(CLAIM_FLAG) | 0 | 1 | Total | | factor(d0\)GENDER)
M 3539 (74.4%) 1218 (25.6%) 4757 (100.0%)
z_F 4017 (72.4%) 1528 (27.6%) 5545 (100.0%)
Total 7556 (73.3%) 2746 (26.7%) 10302 (100.0%)

These considerations place serious footnotes at the reliability of the data.

Issues to consider for data quality:

  • the partial treatment of missing values (see above)
  • the huge amount of accidents. This data set is probably a subset that has been designed to have more accidents (one would expect one in thousand and not 26% accidents)
  • the relative large amount of red cars indicates the a similar issue

3.3 First Data Cleanup

# Preprocess the variables into numeric / factors as necessary

# dollar values are in the format "$20,540" (string) -> convert this:
df = as.tbl(d0) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM", "CLM_AMT"),
            parse_number) %>% 
  mutate_at(c("EDUCATION","OCCUPATION","CAR_TYPE","URBANICITY"),
            space_to_underscore) %>% 
  mutate_at(c("PARENT1", "MSTATUS", "GENDER", "EDUCATION","OCCUPATION", "CAR_USE", "CAR_TYPE", "RED_CAR", "REVOKED", "URBANICITY"),
            as.factor) %>% 
  mutate(CLAIM_FLAG = as.factor(CLAIM_FLAG))

We also notice that

  • the ID column will never yield useful information
  • the BIRTH data is the same as AGE but in a less usable format.

Therefore, we will remove those columns before moving forward.

df <- df %>% dplyr::select(-c(ID, BIRTH))

4 Exploring the Data (Univariate Analysis)

The univariate summaries for the individual variables (after some cleaning) are provided below.

summarytools::dfSummary(df, 
          plain.ascii  = FALSE,
          style        = 'grid',
          graph.magnif = 0.85,
          varnumbers = FALSE,
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp")
Variable Stats / Values Freqs (% of Valid) Graph Missing
KIDSDRIV
[numeric]
Mean (sd) : 0.2 (0.5)
min < med < max:
0 < 0 < 4
IQR (CV) : 0 (3)
0 : 9069 (88.0%)
1 : 804 ( 7.8%)
2 : 351 ( 3.4%)
3 : 74 ( 0.7%)
4 : 4 ( 0.0%)
0
(0.0%)
AGE
[numeric]
Mean (sd) : 44.8 (8.6)
min < med < max:
16 < 45 < 81
IQR (CV) : 12 (0.2)
61 distinct values 7
(0.1%)
HOMEKIDS
[numeric]
Mean (sd) : 0.7 (1.1)
min < med < max:
0 < 0 < 5
IQR (CV) : 1 (1.5)
0 : 6694 (65.0%)
1 : 1106 (10.7%)
2 : 1427 (13.9%)
3 : 856 ( 8.3%)
4 : 201 ( 2.0%)
5 : 18 ( 0.2%)
0
(0.0%)
YOJ
[numeric]
Mean (sd) : 10.5 (4.1)
min < med < max:
0 < 11 < 23
IQR (CV) : 4 (0.4)
21 distinct values 548
(5.3%)
INCOME
[numeric]
Mean (sd) : 61572.1 (47457.2)
min < med < max:
0 < 53529 < 367030
IQR (CV) : 58582 (0.8)
8151 distinct values 570
(5.5%)
PARENT1
[factor]
1. No
2. Yes
8959 (87.0%)
1343 (13.0%)
0
(0.0%)
HOME_VAL
[numeric]
Mean (sd) : 154523 (129188.4)
min < med < max:
0 < 160661 < 885282
IQR (CV) : 238256 (0.8)
6334 distinct values 575
(5.6%)
MSTATUS
[factor]
1. Yes
2. z_No
6188 (60.1%)
4114 (39.9%)
0
(0.0%)
GENDER
[factor]
1. M
2. z_F
4757 (46.2%)
5545 (53.8%)
0
(0.0%)
EDUCATION
[factor]
1. <High_School
2. Bachelors
3. Masters
4. PhD
5. z_High_School
1515 (14.7%)
2823 (27.4%)
2078 (20.2%)
934 ( 9.1%)
2952 (28.7%)
0
(0.0%)
OCCUPATION
[factor]
1. Clerical
2. Doctor
3. Home_Maker
4. Lawyer
5. Manager
6. Professional
7. Student
8. z_Blue_Collar
1590 (16.5%)
321 ( 3.3%)
843 ( 8.7%)
1031 (10.7%)
1257 (13.0%)
1408 (14.6%)
899 ( 9.3%)
2288 (23.7%)
665
(6.5%)
TRAVTIME
[numeric]
Mean (sd) : 33.4 (15.9)
min < med < max:
5 < 33 < 142
IQR (CV) : 22 (0.5)
100 distinct values 0
(0.0%)
CAR_USE
[factor]
1. Commercial
2. Private
3789 (36.8%)
6513 (63.2%)
0
(0.0%)
BLUEBOOK
[numeric]
Mean (sd) : 15659.9 (8428.8)
min < med < max:
1500 < 14400 < 69740
IQR (CV) : 11690 (0.5)
2985 distinct values 0
(0.0%)
TIF
[numeric]
Mean (sd) : 5.3 (4.1)
min < med < max:
1 < 4 < 25
IQR (CV) : 6 (0.8)
23 distinct values 0
(0.0%)
CAR_TYPE
[factor]
1. Minivan
2. Panel_Truck
3. Pickup
4. Sports_Car
5. Van
6. z_SUV
2694 (26.2%)
853 ( 8.3%)
1772 (17.2%)
1179 (11.4%)
921 ( 8.9%)
2883 (28.0%)
0
(0.0%)
RED_CAR
[factor]
1. no
2. yes
7326 (71.1%)
2976 (28.9%)
0
(0.0%)
OLDCLAIM
[numeric]
Mean (sd) : 4034 (8733.1)
min < med < max:
0 < 0 < 57037
IQR (CV) : 4647.5 (2.2)
3545 distinct values 0
(0.0%)
CLM_FREQ
[numeric]
Mean (sd) : 0.8 (1.2)
min < med < max:
0 < 0 < 5
IQR (CV) : 2 (1.4)
0 : 6292 (61.1%)
1 : 1279 (12.4%)
2 : 1492 (14.5%)
3 : 992 ( 9.6%)
4 : 225 ( 2.2%)
5 : 22 ( 0.2%)
0
(0.0%)
REVOKED
[factor]
1. No
2. Yes
9041 (87.8%)
1261 (12.2%)
0
(0.0%)
MVR_PTS
[numeric]
Mean (sd) : 1.7 (2.2)
min < med < max:
0 < 1 < 13
IQR (CV) : 3 (1.3)
14 distinct values 0
(0.0%)
CLM_AMT
[numeric]
Mean (sd) : 1511.3 (4725.2)
min < med < max:
0 < 0 < 123247
IQR (CV) : 1144.8 (3.1)
2346 distinct values 0
(0.0%)
CAR_AGE
[numeric]
Mean (sd) : 8.3 (5.7)
min < med < max:
-3 < 8 < 28
IQR (CV) : 11 (0.7)
30 distinct values 639
(6.2%)
CLAIM_FLAG
[factor]
1. 0
2. 1
7556 (73.3%)
2746 (26.7%)
0
(0.0%)
URBANICITY
[factor]
1. Highly_Urban/ Urban
2. z_Highly_Rural/ Rural
8230 (79.9%)
2072 (20.1%)
0
(0.0%)

5 Handling the Missing Values

5.1 Exploring the Structure of Missing Values

DataExplorer::plot_missing(d0, title = "Percentage of missing data per variable")