Table of Contents

1 Executive Summary

This is part 1 on the analysis that demonstrate bias in data and the possibilities to de-bias data. In this document we show how to clean the data and prepare it in order to create models. We will prepare two data-sets:

  • one with binned data – for generalized linear models
  • one with un-binned data – for decision trees and other machine learning techniques that have their own mechanism

Then those data-frames are saved to disk, in order to be used by the next file to build to models.

2 Introduction

The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:

  • Is it fair to use gender?
  • No? Well can we use age then? That is a good indicator proxy for driving experience, maturity, etc. and indeed has a high information value.
  • Also not? Well, let’s just look at past track record then (did the person have an accident in the last 5 years)
  • But wait, that is somehow discriminatory for people that drive at least 5 years, because only they could build up such track record?
  • That leaves us wealth related parameters such as income, value of the car, etc. They do the same … but isn’t that discriminatory for poor people?
  • So, we have left things related to car use (private or commercial use, daily average travel time)
  • This model won’t be that strong. The insurance company now need to increase prices … which will impact the poor …

The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.

Workflow

The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.

We will approach the problem in a few phases

  1. Prepare the data: this file
  2. Build a naive model
  3. Build a classification model based with CLAIM_FLAG as dependent variable (this variable is 0 of no claim was filed)
  4. Now we need to find a price for the insurance policy. We can use a simple average or build a model. We will build a regression model to determine the fee that the customer should pay based on CLM_AMT (the amount claimed) of those customers that are expected to have an accident
  5. Decide on Protected Features
  6. Remove protected features and proxy features
  7. repeat both models
  8. Draw conclusions

3 Loading the Data

First we need to read in data and some custom functions.

setwd("/home/philippe/Documents/courses/lectures/bias_data")

# Read in the data:
d0 = read_csv("./car_insurance_claim_data.csv") 

# Read in the functions:
source("ethics_functions.R")

# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)

dollars_to_numeric : function (input)
expand_factor2bin : function (data, col)
f_comment_iv : function (df, x, y)
fAUC : function (model, data, …)
get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)
make_dependency_hist_plot : function (data, x, y, title, method = “loess”, q_ymax = 1, q_xmax = 1)
make_dependency_plot : function (data, x, y, title, method = “loess”, q_ymax = 1, q_xmax = 1)
make_WOE_table : function (df, y)
opt_cut_off : function (perf, pred, cost.fp = 1)
pct_table : function (x, y, round_digits = 2)
space_to_underscore : function (input)
str_clean : function (x)

# Remove the temporary environment:
rm(tmt.env)

The data is imported from kaggle, the licence is unknown and it is believed to be “public domain.”

3.1 Data Dictionary

The data contains 10302 rows with 27 columns.

Each record (row) represents a set of attributes of an insurance company individual customer that are related to their socio-demographic profile and the insured vehicle. The binary response variable TARGET_FLAG is 1 if the customer’s car was in a crash, and 0 if not. The continuous response variable TARGET_AMT defines the cost related to the car crash.

The variables of the data are: ID, KIDSDRIV, BIRTH, AGE, HOMEKIDS, YOJ, INCOME, PARENT1, HOME_VAL, MSTATUS, GENDER, EDUCATION, OCCUPATION, TRAVTIME, CAR_USE, BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS, CLM_AMT, CAR_AGE, CLAIM_FLAG, URBANICITY.

The data dictionary is as follows:

variable name definition prejudice or expectation
ID unique identifier none
KIDSDRIV number of teenager drivers teenagers cause more accidents
BIRTH birth date used to derive age
AGE age in years young people cause more accidents, and older people too
HOMEKIDS nbr kids at home more kids might be more distraction in the car
YOJ years on job people that job-hop might be more prone to risk taking and accidents
INCOME annual income in USD income correlates to reliability and responsibility
PARENT1 yes or no single parent not clear
HOME_VAL home value in USD if owner similar to INCOME
MSTATUS marital status “yes” if married marriage might be a sign of stability and risk aversion(?)
GENDER sex men cause more and more expensive accidents
EDUCATION level of the diploma higher education correlates to safer driving
OCCUPATION categories of employment white color workers might drive safer
TRAVTIME distance to work (probably in minutes) longer distance translates in more probability to be involved in an accident
CAR_USE commercial or private use commercial use might be more risky
BLUEBOOK resale value of the car not clear
TIF time in force = the time with the same insurer (numeric, probably in years, minimum is “1”) longer should be better
CAR_TYPE categorical sports cars are more prone to accidents than minivans
RED_CAR “yes” if the car is red urban legend says that red cars are more prone to accidents
OLDCLAIM total amount in USD of claims in the last 5 years your past performance might be indicative, but note how this should interact with TIF
CLM_FREQ not claim frequency, but the number of claims in the last 5 years past performance might be indicative
REVOKED “yes” if main driver’s licence was revoked during the last 7 years licence being revoked should be indicative for driving style
MVR_PTS motor vehicle record points (number) traffic tickets should be indicative for the driving style and hence propensity to be involved in an accident
CLM_AMT if last year in a car accident, the dollar amount of the claim paid by insurer target variable
CAR_AGE age of car in years one might assume that drivers of older cars are less prudent
CLAIM_FLAG 0 = NO, 1 = YES target variable
URBANICITY categorical, 2 options cities should be more dangerous

All Variables are relevant. Therefore we do not delete any columns at this point.

3.2 Data quality

We were not able to contact the data provider, and close observation made us assume that some variables have been manipulated to eliminate missing values. The missing values have then been assigned to some “assumption” for example in the variable EDUCATION we have “<High School” (lower than highs school) and “z_High School”. The latter seems to be the collection of missing values.

EDUCATION <- factor(d0$EDUCATION);
CLAIM_FLAG <- factor(d0$CLAIM_FLAG);
summarytools::ctable(EDUCATION, CLAIM_FLAG) 
CLAIM_FLAG 0 1 Total
EDUCATION
<High School 1023 (67.5%) 492 (32.5%) 1515 (100.0%)
Bachelors 2150 (76.2%) 673 (23.8%) 2823 (100.0%)
Masters 1674 (80.6%) 404 (19.4%) 2078 (100.0%)
PhD 780 (83.5%) 154 (16.5%) 934 (100.0%)
z_High School 1929 (65.3%) 1023 (34.7%) 2952 (100.0%)
Total 7556 (73.3%) 2746 (26.7%) 10302 (100.0%)

The table above of the EDUCATION variable shows that somehow the probability of being involved in an accident increases with education. Therefore we must conclude that z_High_School and <High_School are not really logical here. The people with lower than high school are for 32.4% involved in accidents and the z_High_School have a higher probability to be in accidents (34.6%).

For this particular variable we might take the two classes with highest probability to be involved in an accident together as <=High_School. This is logically consistent, and leaves us with similar buckets sizes.

Therefore we assume that this unfortunate data manipulation is also done in other variables. For example the variable GENDER seems to have mixed the missing values with females. Conclusions based on small differences hence will not be reliable.

GENDER <- factor(d0$GENDER)
summarytools::ctable(GENDER, CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
GENDER
M 3539 (74.4%) 1218 (25.6%) 4757 (100.0%)
z_F 4017 (72.4%) 1528 (27.6%) 5545 (100.0%)
Total 7556 (73.3%) 2746 (26.7%) 10302 (100.0%)

These considerations place serious footnotes at the reliability of the data.

Issues to consider for data quality:

  • the partial treatment of missing values (see above)
  • the huge amount of accidents. This data set is probably a subset that has been designed to have more accidents (one would expect one in thousand and not 26% accidents)
  • the relative large amount of red cars indicates the a similar issue

3.3 First Data Cleanup

# Preprocess the variables into numeric / factors as necessary

# dollar values are in the format "$20,540" (string) -> convert this:
df = as.tbl(d0) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM", "CLM_AMT"),
            parse_number) %>% 
  mutate_at(c("EDUCATION","OCCUPATION","CAR_TYPE","URBANICITY"),
            space_to_underscore) %>% 
  mutate_at(c("PARENT1", "MSTATUS", "GENDER", "EDUCATION","OCCUPATION", "CAR_USE", "CAR_TYPE", "RED_CAR", "REVOKED", "URBANICITY"),
            as.factor) %>% 
  mutate(CLAIM_FLAG = as.factor(CLAIM_FLAG))

We also notice that

  • the ID column will never yield useful information
  • the BIRTH data is the same as AGE but in a less usable format.

Therefore, we will remove those columns before moving forward.

df <- df %>% dplyr::select(-c(ID, BIRTH))

4 Exploring the Data (Univariate Analysis)

The univariate summaries for the individual variables (after some cleaning) are provided below.

summarytools::dfSummary(df, 
          plain.ascii  = FALSE,
          style        = 'grid',
          graph.magnif = 0.85,
          varnumbers = FALSE,
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp")
Variable Stats / Values Freqs (% of Valid) Graph Missing
KIDSDRIV
[numeric]
Mean (sd) : 0.2 (0.5)
min < med < max:
0 < 0 < 4
IQR (CV) : 0 (3)
0 : 9069 (88.0%)
1 : 804 ( 7.8%)
2 : 351 ( 3.4%)
3 : 74 ( 0.7%)
4 : 4 ( 0.0%)
0
(0.0%)
AGE
[numeric]
Mean (sd) : 44.8 (8.6)
min < med < max:
16 < 45 < 81
IQR (CV) : 12 (0.2)
61 distinct values 7
(0.1%)
HOMEKIDS
[numeric]
Mean (sd) : 0.7 (1.1)
min < med < max:
0 < 0 < 5
IQR (CV) : 1 (1.5)
0 : 6694 (65.0%)
1 : 1106 (10.7%)
2 : 1427 (13.9%)
3 : 856 ( 8.3%)
4 : 201 ( 2.0%)
5 : 18 ( 0.2%)
0
(0.0%)
YOJ
[numeric]
Mean (sd) : 10.5 (4.1)
min < med < max:
0 < 11 < 23
IQR (CV) : 4 (0.4)
21 distinct values 548
(5.3%)
INCOME
[numeric]
Mean (sd) : 61572.1 (47457.2)
min < med < max:
0 < 53529 < 367030
IQR (CV) : 58582 (0.8)
8151 distinct values 570
(5.5%)
PARENT1
[factor]
1. No
2. Yes
8959 (87.0%)
1343 (13.0%)
0
(0.0%)
HOME_VAL
[numeric]
Mean (sd) : 154523 (129188.4)
min < med < max:
0 < 160661 < 885282
IQR (CV) : 238256 (0.8)
6334 distinct values 575
(5.6%)
MSTATUS
[factor]
1. Yes
2. z_No
6188 (60.1%)
4114 (39.9%)
0
(0.0%)
GENDER
[factor]
1. M
2. z_F
4757 (46.2%)
5545 (53.8%)
0
(0.0%)
EDUCATION
[factor]
1. <High_School
2. Bachelors
3. Masters
4. PhD
5. z_High_School
1515 (14.7%)
2823 (27.4%)
2078 (20.2%)
934 ( 9.1%)
2952 (28.7%)
0
(0.0%)
OCCUPATION
[factor]
1. Clerical
2. Doctor
3. Home_Maker
4. Lawyer
5. Manager
6. Professional
7. Student
8. z_Blue_Collar
1590 (16.5%)
321 ( 3.3%)
843 ( 8.7%)
1031 (10.7%)
1257 (13.0%)
1408 (14.6%)
899 ( 9.3%)
2288 (23.7%)
665
(6.5%)
TRAVTIME
[numeric]
Mean (sd) : 33.4 (15.9)
min < med < max:
5 < 33 < 142
IQR (CV) : 22 (0.5)
100 distinct values 0
(0.0%)
CAR_USE
[factor]
1. Commercial
2. Private
3789 (36.8%)
6513 (63.2%)
0
(0.0%)
BLUEBOOK
[numeric]
Mean (sd) : 15659.9 (8428.8)
min < med < max:
1500 < 14400 < 69740
IQR (CV) : 11690 (0.5)
2985 distinct values 0
(0.0%)
TIF
[numeric]
Mean (sd) : 5.3 (4.1)
min < med < max:
1 < 4 < 25
IQR (CV) : 6 (0.8)
23 distinct values 0
(0.0%)
CAR_TYPE
[factor]
1. Minivan
2. Panel_Truck
3. Pickup
4. Sports_Car
5. Van
6. z_SUV
2694 (26.2%)
853 ( 8.3%)
1772 (17.2%)
1179 (11.4%)
921 ( 8.9%)
2883 (28.0%)
0
(0.0%)
RED_CAR
[factor]
1. no
2. yes
7326 (71.1%)
2976 (28.9%)
0
(0.0%)
OLDCLAIM
[numeric]
Mean (sd) : 4034 (8733.1)
min < med < max:
0 < 0 < 57037
IQR (CV) : 4647.5 (2.2)
3545 distinct values 0
(0.0%)
CLM_FREQ
[numeric]
Mean (sd) : 0.8 (1.2)
min < med < max:
0 < 0 < 5
IQR (CV) : 2 (1.4)
0 : 6292 (61.1%)
1 : 1279 (12.4%)
2 : 1492 (14.5%)
3 : 992 ( 9.6%)
4 : 225 ( 2.2%)
5 : 22 ( 0.2%)
0
(0.0%)
REVOKED
[factor]
1. No
2. Yes
9041 (87.8%)
1261 (12.2%)
0
(0.0%)
MVR_PTS
[numeric]
Mean (sd) : 1.7 (2.2)
min < med < max:
0 < 1 < 13
IQR (CV) : 3 (1.3)
14 distinct values 0
(0.0%)
CLM_AMT
[numeric]
Mean (sd) : 1511.3 (4725.2)
min < med < max:
0 < 0 < 123247
IQR (CV) : 1144.8 (3.1)
2346 distinct values 0
(0.0%)
CAR_AGE
[numeric]
Mean (sd) : 8.3 (5.7)
min < med < max:
-3 < 8 < 28
IQR (CV) : 11 (0.7)
30 distinct values 639
(6.2%)
CLAIM_FLAG
[factor]
1. 0
2. 1
7556 (73.3%)
2746 (26.7%)
0
(0.0%)
URBANICITY
[factor]
1. Highly_Urban/ Urban
2. z_Highly_Rural/ Rural
8230 (79.9%)
2072 (20.1%)
0
(0.0%)

5 Missing Values

5.1 Exploring the Structure of Missing Values

The packages DataExplorer and naniar provide tools to investigate missing values.

DataExplorer::plot_missing(d0, title = "Percentage of missing data per variable")
`plot_missing()` from `DataExplorer` can show the percentaages of missing values.

plot_missing() from DataExplorer can show the percentaages of missing values.

naniar::gg_miss_upset(d0)
`gg_miss_upset` shows patterns in missing data by exploring joint missing values.\label{fig:missval1}

gg_miss_upset shows patterns in missing data by exploring joint missing values.

The correlation structure and amplitude of the missing values are illustrated in Figure .

The variable OCCUPATION has the highest number of missing values: 665. So, 6.46% is missing.

# To illustrate the relevance of the NA with summarytools:
freq(d0$OCCUPATION) %>%
  kable(caption='The number of missing values is almost the same as the amount of Home makers or Students.')
The number of missing values is almost the same as the amount of Home makers or Students.
Freq % Valid % Valid Cum. % Total % Total Cum.
Clerical 1590 16.498910 16.49891 15.433896 15.43390
Doctor 321 3.330912 19.82982 3.115900 18.54980
Home Maker 843 8.747535 28.57736 8.182877 26.73267
Lawyer 1031 10.698350 39.27571 10.007765 36.74044
Manager 1257 13.043478 52.31919 12.201514 48.94195
Professional 1408 14.610356 66.92954 13.667249 62.60920
Student 899 9.328629 76.25817 8.726461 71.33566
z_Blue Collar 2288 23.741828 100.00000 22.209280 93.54494
665 NA NA 6.455057 100.00000
Total 10302 100.000000 100.00000 100.000000 100.00000

Figure indicates also that the data is not missing at random (MAR)1.

We can consider the following solutions for missing values:

  1. Do not bother if the missing value is in a column that we will not use – in our case that could work for RED_CAR that is not very useful, but that one has no missing values. Most missing values are in OCCUPATION
  2. Remove rows with missing values (in useful columns) – we have 6.46% missing in OCCUPATION
  3. If the data is categorical, we can consider to add a specific category missing and study the correlations structure of the missing values.
  4. input a variable
    1. if missing at random, we can use mice2
    2. if not missing at random, we can consider random forest methods to input values for the missing data. Good choices are: missForest, missRanger and Amelia.

5.2 Imputing Missing values

To make the computations lighter we cn remove rows that miss more than one value first (these are small numbers in our case). There are 3.2226752% rows that have more than 2 data points missing (total rows that have missing data is 25.6746263%).

# Delete the rows that have more than once field missing:
df <- df[rowSums(is.na(df)) <= 1, ]

# Use missRanger to impute missing values:
df <- missRanger::missRanger(df)

Here is how we could use mice: Note that this code chunks is not executed (eval=FALSE)

# Impute the columns with missing values

# install.packages("mice")
#library(mice)
df_missing_cols = df[,c("AGE","INCOME","YOJ","HOME_VAL","CAR_AGE")]
df_imp_cols_tmp = mice::mice(data = df_missing_cols, m = 1, method = "pmm", maxit = 50, seed = 500)
df_imp_cols = mice::complete(df_imp_cols_tmp)
df_imp = bind_cols(df %>% select(-AGE, -INCOME, -YOJ, -HOME_VAL, -CAR_AGE), df_imp_cols)

# For the rest of the analysis we use the imputed data
df <- df_imp[complete.cases(df_imp),]

Here is how we could use missForest: Note that this code chunks is not executed (eval=FALSE)

# Input missing values via random forest estimates:
# ---
library(missForest)# impute missing values

# Note: this fails:
# d1_imp <- missForest(df, )$ximp
# While everything that worked for a data-frame will also work
# for a tibble, this is an exception, so we have 2 solutions:

# 1. coerce to data-frame
d_imp <- missForest(as.data.frame(df), )$ximp

# 2. Manual fit the randomforest and impute the missing data:
x_cc <- df %>% filter(!is.na(df$OCCUPATION))
x_nc <- df %>% filter(is.na(df$OCCUPATION))
rf <- randomForest::randomForest(OCCUPATION ~ ., data = x_cc)
x_nc$OCCUPATION <- predict(rf, x_nc)
x <- rbind(x_cc, x_nc)

6 Further Data Exploration

6.1 Numeric Variables

d_cont <- data.frame()[1:nrow(df), ]
n <- 1
for (col in colnames(df)) {
  if (!(is.factor(df[[col]]))) {
    d_cont <- cbind(d_cont, df[[col]])
    colnames(d_cont)[n] <- gsub("\"", "", deparse(col))
    n <- n + 1
  }
}
d_cont <- mutate(d_cont, CLAIM_FLAG = CLAIM_FLAG <- as.numeric(paste(df$CLAIM_FLAG)))
d_cont <- d_cont[complete.cases(d_cont),]
# Use the library GGaly to explore the correlation structure
# ---
GGally::ggcorr(d_cont,  method = c("everything", "pearson"),
               label = TRUE, label_round = 2, label_size = 3,
               size = 2.5,  geom="circle")
The correlation structure of the data

The correlation structure of the data

6.2 Factorial Variables

df %>% 
  select_if(~class(.) == 'factor') %>% 
  iv(y = 'CLAIM_FLAG') %>%
  kable(
     caption = 'The table of all information values 
              for each categorical variable ordered in 
              decreasing order.')
The table of all information values for each categorical variable ordered in decreasing order.
variable info_value
URBANICITY 0.4031542
OCCUPATION 0.1916953
EDUCATION 0.1280311
PARENT1 0.1169597
REVOKED 0.1087880
CAR_TYPE 0.1051723
CAR_USE 0.0959599
MSTATUS 0.0856915
GENDER 0.0028151
RED_CAR 0.0004316

(De Brouwer 2020) (pg. 359) suggests the following rule of thumb for those information values:

IV predictability
< 0.02 not predictable
0.02 – 0.1 weak
0.1 – 0.3 medium
0.13 – 0.5 strong
> 0.5 suspicious

This implies that CARUSE, MSTATUS, and GENDER probably are not very predictable. At least we don’t expect a linear relationship.

7 Data Binning

7.1 Introduction to binning

Data binning is a data preprocessing technique that groups continuous numerical values into a smaller number of discrete “bins” or “buckets” (or groups values of categorical variables). It simplifies data, reduces noise, and can improve the performance of analytical models by converting a continuous variable into a categorical one. For example, ages can be binned into groups like “0-17,” “18-64,” and “65+,” instead of using each individual age.

How data binning works

  • Define bins: Intervals are created to group the data. These can be of equal size, like equal age groups, or based on other criteria, such as quantiles.
  • Assign data points: Each individual data point is placed into the bin that corresponds to its value. For example, a person age 25 would fit in the 20 to 30 bin.
  • Represent the bin: All values within a bin are then replaced by a single representative value for that bin, such as the midpoint or mean – unless we use the data as categorical.

Why data binning is used

  • Data smoothing: It reduces the impact of minor observation errors and noise by grouping values.
  • Outlier mitigation: Extreme values are grouped into bins, which lessens their influence on the analysis.
  • Simplifying analysis: It might make complex data easier to analyze, visualize, and understand by simplifying the number of unique values.
  • Feature engineering: It can convert continuous variables into categorical ones, which can be more useful for certain machine learning algorithms.
  • Improving model performance: By reducing noise and handling outliers, binning can sometimes improve the accuracy of predictive models.

Considerations

The number and size of the bins can significantly affect the results. Too few bins can oversimplify the data, while too many can make it cluttered and fail to smooth out noise. Choosing the right binning method is important to avoid hiding important trends or creating misleading representations of the data.

Binning in R Note that the package {scorecard} provides a function woebin() that will automatically propose bins for all variables. Then it can be fine-tuned with woe_adj() and visualized with woe_plot().

7.2 Information Value and Weight of evidence

Weight of Evidence (WoE) is a fundamental concept used primarily in credit scoring and binary classification modeling to quantify the strength and direction of the relationship between a predictor variable and the target outcome. WoE transforms categorical or binned continuous variables into continuous numeric values based on the distribution of “events” (e.g., defaults) and “non-events” (e.g., non-defaults). Mathematically, for a given bin \(i\), WoE is defined as:

\[ \text{WoE}_i = \log \left( \frac{\text{Distribution of non-events in bin } i}{\text{Distribution of events in bin } i} \right) = \log \left( \frac{P(\text{non-event}_i)}{P(\text{event}_i)} \right) \]

where \(P(\text{event}_i)\) and \(P(\text{non-event}_i)\) represent the proportions of events and non-events in bin \(i\) relative to the total events and non-events, respectively. A positive WoE indicates that the bin has a higher concentration of non-events than events (predicting a “good” outcome), while a negative WoE suggests the opposite.

One possible way to modify WOE so that it takes values between \(0\) and \(1\), is Information Value (IV) for bin \(i\): \[ \text{IV_i} = \left( P(\text{non-event}_i) - P(\text{event}_i) \right) \times \text{WoE}_i \]

The advantage of IV is that it can meaningfully be added over all bins of a given variable. That summary measure quantifies the variable’s overall predictive power and is calculated as:

\[ \text{IV} = \sum_{i} \left( P(\text{non-event}_i) - P(\text{event}_i) \right) \times \text{WoE}_i \]

Intuitively, IV measures how well the variable’s categories or bins differentiate between the two outcome groups. Variables with higher IV values are more predictive. Typical IV interpretation thresholds are:

  • IV < 0.02: Not predictive
  • 0.02 ≤ IV < 0.1: Weak predictive power
  • 0.1 ≤ IV < 0.3: Medium predictive power
  • 0.3 ≤ IV < 0.5: Strong predictive power
  • IV ≥ 0.5: Suspiciously high, possibly indicating data issues or over-fitting

7.3 Automated binning

The package scorecard provides tools to investigate existing bins, propose optimal bins, implement or change them,

scorecard::woebin(df, 'CLAIM_FLAG')
✔ Binning on 9970 rows and 25 columns in 00:00:13
$KIDSDRIV
   variable      bin count count_distr   neg   pos   posprob         woe
     <char>   <char> <int>       <num> <int> <int>     <num>       <num>
1: KIDSDRIV [-Inf,1)  8769   0.8795386  6595  2174 0.2479188 -0.09420622
2: KIDSDRIV [1, Inf)  1201   0.1204614   724   477 0.3971690  0.59826188
        bin_iv   total_iv breaks is_special_values
         <num>      <num> <char>            <lgcl>
1: 0.007631785 0.05609787      1             FALSE
2: 0.048466082 0.05609787    Inf             FALSE

$AGE
   variable       bin count count_distr   neg   pos   posprob         woe
     <char>    <char> <int>       <num> <int> <int>     <num>       <num>
1:      AGE [-Inf,31)   504  0.05055165   270   234 0.4642857  0.87243594
2:      AGE   [31,40)  2218  0.22246740  1477   741 0.3340848  0.32576912
3:      AGE   [40,44)  1634  0.16389168  1196   438 0.2680539  0.01101776
4:      AGE   [44,46)   888  0.08906720   684   204 0.2297297 -0.19430114
5:      AGE   [46,57)  3873  0.38846540  3123   750 0.1936483 -0.41093937
6:      AGE [57, Inf)   853  0.08555667   569   284 0.3329426  0.32063058
         bin_iv  total_iv breaks is_special_values
          <num>     <num> <char>            <lgcl>
1: 4.482427e-02 0.1418769     31             FALSE
2: 2.531670e-02 0.1418769     40             FALSE
3: 1.994621e-05 0.1418769     44             FALSE
4: 3.206609e-03 0.1418769     46             FALSE
5: 5.908713e-02 0.1418769     57             FALSE
6: 9.422218e-03 0.1418769    Inf             FALSE

$HOMEKIDS
   variable      bin count count_distr   neg   pos   posprob        woe
     <char>   <char> <int>       <num> <int> <int>     <num>      <num>
1: HOMEKIDS [-Inf,1)  6456   0.6475426  5023  1433 0.2219641 -0.2387204
2: HOMEKIDS    [1,3)  2473   0.2480441  1621   852 0.3445208  0.3723248
3: HOMEKIDS [3, Inf)  1041   0.1044132   675   366 0.3515850  0.4034574
       bin_iv   total_iv breaks is_special_values
        <num>      <num> <char>            <lgcl>
1: 0.03479236 0.09048389      1             FALSE
2: 0.03719890 0.09048389      3             FALSE
3: 0.01849263 0.09048389    Inf             FALSE

$YOJ
   variable         bin count count_distr   neg   pos   posprob        woe
     <char>      <char> <int>       <num> <int> <int>     <num>      <num>
1:      YOJ    [-Inf,1)   818  0.08204614   496   322 0.3936430  0.5835124
2:      YOJ       [1,8)   852  0.08545637   658   194 0.2276995 -0.2058100
3:      YOJ      [8,11)  2160  0.21664995  1530   630 0.2916667  0.1282336
4:      YOJ   [11,14.5)  5154  0.51695085  3909  1245 0.2415600 -0.1286093
5:      YOJ [14.5, Inf)   986  0.09889669   726   260 0.2636917 -0.0113316
         bin_iv   total_iv breaks is_special_values
          <num>      <num> <char>            <lgcl>
1: 3.133157e-02 0.04674319      1             FALSE
2: 3.441773e-03 0.04674319      8             FALSE
3: 3.667637e-03 0.04674319     11             FALSE
4: 8.289544e-03 0.04674319   14.5             FALSE
5: 1.266512e-05 0.04674319    Inf             FALSE

$INCOME
   variable            bin count count_distr   neg   pos   posprob        woe
     <char>         <char> <int>       <num> <int> <int>     <num>      <num>
1:   INCOME   [-Inf,15000)  1517   0.1521565   953   564 0.3717864  0.4909761
2:   INCOME  [15000,70000)  4955   0.4969910  3507  1448 0.2922301  0.1309591
3:   INCOME  [70000,85000)  1032   0.1035105   794   238 0.2306202 -0.1892760
4:   INCOME [85000,120000)  1453   0.1457372  1198   255 0.1754990 -0.5316085
5:   INCOME  [120000, Inf)  1013   0.1016048   867   146 0.1441264 -0.7658956
        bin_iv  total_iv breaks is_special_values
         <num>     <num> <char>            <lgcl>
1: 0.040525592 0.1372732  15000             FALSE
2: 0.008780174 0.1372732  70000             FALSE
3: 0.003540847 0.1372732  85000             FALSE
4: 0.035880087 0.1372732 120000             FALSE
5: 0.048546482 0.1372732    Inf             FALSE

$PARENT1
   variable    bin count count_distr   neg   pos   posprob        woe
     <char> <char> <int>       <num> <int> <int>     <num>      <num>
1:  PARENT1     No  8669   0.8695085  6602  2067 0.2384358 -0.1457376
2:  PARENT1    Yes  1301   0.1304915   717   584 0.4488855  0.8103619
       bin_iv  total_iv breaks is_special_values
        <num>     <num> <char>            <lgcl>
1: 0.01782808 0.1169597     No             FALSE
2: 0.09913159 0.1169597    Yes             FALSE

$HOME_VAL
   variable             bin count count_distr   neg   pos   posprob         woe
     <char>          <char> <int>       <num> <int> <int>     <num>       <num>
1: HOME_VAL    [-Inf,80000)  3148   0.3157472  1974  1174 0.3729352  0.49589156
2: HOME_VAL  [80000,220000)  3851   0.3862588  2846  1005 0.2609712 -0.02539018
3: HOME_VAL [220000,260000)  1065   0.1068205   854   211 0.1981221 -0.38253628
4: HOME_VAL   [260000, Inf)  1906   0.1911735  1645   261 0.1369360 -0.82543848
         bin_iv  total_iv breaks is_special_values
          <num>     <num> <char>            <lgcl>
1: 0.0858600419 0.2045521  80000             FALSE
2: 0.0002475215 0.2045521 220000             FALSE
3: 0.0141882708 0.2045521 260000             FALSE
4: 0.1042562364 0.2045521    Inf             FALSE

$MSTATUS
   variable    bin count count_distr   neg   pos   posprob        woe
     <char> <char> <int>       <num> <int> <int>     <num>      <num>
1:  MSTATUS    Yes  5999   0.6017051  4685  1314 0.2190365 -0.2557532
2:  MSTATUS   z_No  3971   0.3982949  2634  1337 0.3366910  0.3374615
       bin_iv  total_iv breaks is_special_values
        <num>     <num> <char>            <lgcl>
1: 0.03694426 0.0856915    Yes             FALSE
2: 0.04874724 0.0856915   z_No             FALSE

$GENDER
   variable    bin count count_distr   neg   pos   posprob         woe
     <char> <char> <int>       <num> <int> <int>     <num>       <num>
1:   GENDER      M  4552   0.4565697  3393  1159 0.2546134 -0.05862014
2:   GENDER    z_F  5418   0.5434303  3926  1492 0.2753784  0.04803319
        bin_iv    total_iv breaks is_special_values
         <num>       <num> <char>            <lgcl>
1: 0.001547244 0.002815052      M             FALSE
2: 0.001267808 0.002815052    z_F             FALSE

$EDUCATION
    variable           bin count count_distr   neg   pos   posprob        woe
      <char>        <char> <int>       <num> <int> <int>     <num>      <num>
1: EDUCATION  <High_School  1488  0.14924774  1003   485 0.3259409  0.2889349
2: EDUCATION     Bachelors  2769  0.27773320  2111   658 0.2376309 -0.1501753
3: EDUCATION       Masters  1947  0.19528586  1578   369 0.1895223 -0.4375801
4: EDUCATION           PhD   867  0.08696088   729   138 0.1591696 -0.6488833
5: EDUCATION z_High_School  2899  0.29077232  1898  1001 0.3452915  0.3757356
        bin_iv  total_iv        breaks is_special_values
         <num>     <num>        <char>            <lgcl>
1: 0.013264784 0.1280311  <High_School             FALSE
2: 0.006039926 0.1280311     Bachelors             FALSE
3: 0.033435698 0.1280311       Masters             FALSE
4: 0.030853065 0.1280311           PhD             FALSE
5: 0.044437614 0.1280311 z_High_School             FALSE

$OCCUPATION
     variable                                    bin count count_distr   neg
       <char>                                 <char> <int>       <num> <int>
1: OCCUPATION                               Clerical  1578   0.1582748  1103
2: OCCUPATION Doctor%,%Home_Maker%,%Lawyer%,%Manager  3797   0.3808425  3103
3: OCCUPATION                           Professional  1429   0.1433300  1087
4: OCCUPATION                Student%,%z_Blue_Collar  3166   0.3175527  2026
     pos   posprob        woe      bin_iv  total_iv
   <int>     <num>      <num>       <num>     <num>
1:   475 0.3010139  0.1730626 0.004927787 0.1533539
2:   694 0.1827759 -0.4821159 0.078188123 0.1533539
3:   342 0.2393282 -0.1408294 0.002747530 0.1533539
4:  1140 0.3600758  0.4405016 0.067490415 0.1533539
                                   breaks is_special_values
                                   <char>            <lgcl>
1:                               Clerical             FALSE
2: Doctor%,%Home_Maker%,%Lawyer%,%Manager             FALSE
3:                           Professional             FALSE
4:                Student%,%z_Blue_Collar             FALSE

$TRAVTIME
   variable       bin count count_distr   neg   pos   posprob         woe
     <char>    <char> <int>       <num> <int> <int>     <num>       <num>
1: TRAVTIME [-Inf,12)   865  0.08676028   689   176 0.2034682 -0.34922050
2: TRAVTIME   [12,22)  1442  0.14463390  1101   341 0.2364771 -0.15655488
3: TRAVTIME   [22,38)  3849  0.38605817  2847  1002 0.2603274 -0.02873103
4: TRAVTIME [38, Inf)  3814  0.38254764  2682  1132 0.2968013  0.15295997
         bin_iv   total_iv breaks is_special_values
          <num>      <num> <char>            <lgcl>
1: 0.0096903457 0.02268377     12             FALSE
2: 0.0034128465 0.02268377     22             FALSE
3: 0.0003165294 0.02268377     38             FALSE
4: 0.0092640473 0.02268377    Inf             FALSE

$CAR_USE
   variable        bin count count_distr   neg   pos   posprob        woe
     <char>     <char> <int>       <num> <int> <int>     <num>      <num>
1:  CAR_USE Commercial  3601   0.3611836  2350  1251 0.3474035  0.3850647
2:  CAR_USE    Private  6369   0.6388164  4969  1400 0.2198147 -0.2512096
       bin_iv   total_iv     breaks is_special_values
        <num>      <num>     <char>            <lgcl>
1: 0.05807364 0.09595988 Commercial             FALSE
2: 0.03788625 0.09595988    Private             FALSE

$BLUEBOOK
   variable           bin count count_distr   neg   pos   posprob         woe
     <char>        <char> <int>       <num> <int> <int>     <num>       <num>
1: BLUEBOOK   [-Inf,6000)  1005   0.1008024   609   396 0.3940299  0.58513272
2: BLUEBOOK  [6000,12000)  2888   0.2896690  2016   872 0.3019391  0.17745557
3: BLUEBOOK [12000,17000)  2272   0.2278837  1750   522 0.2297535 -0.19416670
4: BLUEBOOK [17000,27000)  2775   0.2783350  2090   685 0.2468468 -0.09996373
5: BLUEBOOK  [27000, Inf)  1030   0.1033099   854   176 0.1708738 -0.56391042
        bin_iv   total_iv breaks is_special_values
         <num>      <num> <char>            <lgcl>
1: 0.038717943 0.08747841   6000             FALSE
2: 0.009491214 0.08747841  12000             FALSE
3: 0.008193228 0.08747841  17000             FALSE
4: 0.002715522 0.08747841  27000             FALSE
5: 0.028360502 0.08747841    Inf             FALSE

$TIF
   variable      bin count count_distr   neg   pos   posprob          woe
     <char>   <char> <int>       <num> <int> <int>     <num>        <num>
1:      TIF [-Inf,4)  3586   0.3596790  2468  1118 0.3117680  0.223670048
2:      TIF    [4,5)  1565   0.1569709  1151   414 0.2645367 -0.006983656
3:      TIF    [5,9)  2568   0.2575727  1937   631 0.2457165 -0.106053022
4:      TIF [9, Inf)  2251   0.2257773  1763   488 0.2167925 -0.268919997
         bin_iv   total_iv breaks is_special_values
          <num>      <num> <char>            <lgcl>
1: 1.890529e-02 0.03701138      4             FALSE
2: 7.643171e-06 0.03701138      5             FALSE
3: 2.824219e-03 0.03701138      9             FALSE
4: 1.527423e-02 0.03701138    Inf             FALSE

$CAR_TYPE
   variable                               bin count count_distr   neg   pos
     <char>                            <char> <int>       <num> <int> <int>
1: CAR_TYPE                           Minivan  2631   0.2638917  2182   449
2: CAR_TYPE Panel_Truck%,%Pickup%,%Sports_Car  3634   0.3644935  2496  1138
3: CAR_TYPE                       Van%,%z_SUV  3705   0.3716148  2641  1064
     posprob        woe      bin_iv   total_iv
       <num>      <num>       <num>      <num>
1: 0.1706575 -0.5654375 0.072804664 0.09742236
2: 0.3131535  0.2301197 0.020306168 0.09742236
3: 0.2871795  0.1064145 0.004311527 0.09742236
                              breaks is_special_values
                              <char>            <lgcl>
1:                           Minivan             FALSE
2: Panel_Truck%,%Pickup%,%Sports_Car             FALSE
3:                       Van%,%z_SUV             FALSE

$RED_CAR
   variable    bin count count_distr   neg   pos   posprob         woe
     <char> <char> <int>       <num> <int> <int>     <num>       <num>
1:  RED_CAR     no  7126   0.7147442  5213  1913 0.2684536  0.01305396
2:  RED_CAR    yes  2844   0.2852558  2106   738 0.2594937 -0.03306509
         bin_iv    total_iv breaks is_special_values
          <num>       <num> <char>            <lgcl>
1: 0.0001221683 0.000431615     no             FALSE
2: 0.0003094467 0.000431615    yes             FALSE

$OLDCLAIM
   variable         bin count count_distr   neg   pos   posprob        woe
     <char>      <char> <int>       <num> <int> <int>     <num>      <num>
1: OLDCLAIM  [-Inf,500)  6117  0.61354062  5027  1090 0.1781919 -0.5131089
2: OLDCLAIM  [500,3500)   910  0.09127382   553   357 0.3923077  0.5779146
3: OLDCLAIM [3500, Inf)  2943  0.29518556  1739  1204 0.4091064  0.6478759
       bin_iv  total_iv breaks is_special_values
        <num>     <num> <char>            <lgcl>
1: 0.14145226 0.3159213    500             FALSE
2: 0.03416018 0.3159213   3500             FALSE
3: 0.14030884 0.3159213    Inf             FALSE

$CLM_FREQ
   variable      bin count count_distr   neg   pos   posprob        woe
     <char>   <char> <int>       <num> <int> <int>     <num>      <num>
1: CLM_FREQ [-Inf,1)  6117   0.6135406  5027  1090 0.1781919 -0.5131089
2: CLM_FREQ    [1,3)  2662   0.2670010  1605  1057 0.3970699  0.5978477
3: CLM_FREQ [3, Inf)  1191   0.1194584   687   504 0.4231738  0.7057788
       bin_iv  total_iv breaks is_special_values
        <num>     <num> <char>            <lgcl>
1: 0.14145226 0.3166536      1             FALSE
2: 0.10726896 0.3166536      3             FALSE
3: 0.06793237 0.3166536    Inf             FALSE

$REVOKED
   variable    bin count count_distr   neg   pos   posprob        woe
     <char> <char> <int>       <num> <int> <int>     <num>      <num>
1:  REVOKED     No  8750   0.8776329  6647  2103 0.2403429 -0.1352639
2:  REVOKED    Yes  1220   0.1223671   672   548 0.4491803  0.8115537
       bin_iv total_iv breaks is_special_values
        <num>    <num> <char>            <lgcl>
1: 0.01554164 0.108788     No             FALSE
2: 0.09324641 0.108788    Yes             FALSE

$MVR_PTS
   variable      bin count count_distr   neg   pos   posprob         woe
     <char>   <char> <int>       <num> <int> <int>     <num>       <num>
1:  MVR_PTS [-Inf,1)  4507  0.45205617  3641   866 0.1921456 -0.42059196
2:  MVR_PTS    [1,3)  2589  0.25967904  1931   658 0.2541522 -0.06105157
3:  MVR_PTS    [3,4)   938  0.09408225   634   304 0.3240938  0.28051553
4:  MVR_PTS    [4,6)  1207  0.12106319   763   444 0.3678542  0.47410331
5:  MVR_PTS [6, Inf)   729  0.07311936   350   379 0.5198903  1.09513983
         bin_iv  total_iv breaks is_special_values
          <num>     <num> <char>            <lgcl>
1: 0.0718384320 0.2148368      1             FALSE
2: 0.0009539695 0.2148368      3             FALSE
3: 0.0078684191 0.2148368      4             FALSE
4: 0.0299798062 0.2148368      6             FALSE
5: 0.1041961793 0.2148368    Inf             FALSE

$CLM_AMT
   variable         bin count count_distr   neg   pos   posprob   woe bin_iv
     <char>      <char> <int>       <num> <int> <int>     <num> <num>  <num>
1:  CLM_AMT [-Inf, Inf)  9970           1  7319  2651 0.2658977     0      0
   total_iv breaks is_special_values
      <num> <char>            <lgcl>
1:        0    Inf             FALSE

$CAR_AGE
   variable         bin count count_distr   neg   pos   posprob         woe
     <char>      <char> <int>       <num> <int> <int>     <num>       <num>
1:  CAR_AGE    [-Inf,5)  2864   0.2872618  1921   943 0.3292598  0.30400190
2:  CAR_AGE    [5,10.5)  3706   0.3717151  2673  1033 0.2787372  0.06480253
3:  CAR_AGE [10.5,12.5)  1009   0.1012036   788   221 0.2190287 -0.25579861
4:  CAR_AGE [12.5, Inf)  2391   0.2398195  1937   454 0.1898787 -0.43526169
        bin_iv   total_iv breaks is_special_values
         <num>      <num> <char>            <lgcl>
1: 0.028347349 0.07680012      5             FALSE
2: 0.001584451 0.07680012   10.5             FALSE
3: 0.006215961 0.07680012   12.5             FALSE
4: 0.040652358 0.07680012    Inf             FALSE

$URBANICITY
     variable                   bin count count_distr   neg   pos    posprob
       <char>                <char> <int>       <num> <int> <int>      <num>
1: URBANICITY   Highly_Urban/ Urban  7948   0.7971916  5426  2522 0.31731253
2: URBANICITY z_Highly_Rural/ Rural  2022   0.2028084  1893   129 0.06379822
          woe     bin_iv  total_iv                breaks is_special_values
        <num>      <num>     <num>                <char>            <lgcl>
1:  0.2493868 0.05236649 0.4031542   Highly_Urban/ Urban             FALSE
2: -1.6705690 0.35078776 0.4031542 z_Highly_Rural/ Rural             FALSE

2 data-sets For our purpose, we will create 2 datasets

  • d_fact: a fully categorical dataset (to be used for generalised linear models)
  • d: a dataset that holds both numerical and categorical data – to be used for decision trees, random forest, etc.
# We create a copy of the dataset that will hold the a fully categorical
# data-set:
d_fact <- df

7.4 KIDSDRIV

The variable KIDSDRV has an information value of 0.058294 with the existing binning. We use the function ctable() from {summarytools} to get more insight in the variable.

# KiDSDRIV is <dbl> and needs to be made categorical first
# Since there are only a limited number of possibilities, we can do this:
d_fact$KIDSDRIV <- factor(d_fact$KIDSDRIV)

# show the cross-tabulation:
summarytools::ctable(d_fact$KIDSDRIV, d_fact$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
KIDSDRIV
0 6595 (75.2%) 2174 (24.8%) 8769 (100.0%)
1 481 (61.4%) 303 (38.6%) 784 (100.0%)
2 207 (60.3%) 136 (39.7%) 343 (100.0%)
3 34 (48.6%) 36 (51.4%) 70 (100.0%)
4 2 (50.0%) 2 (50.0%) 4 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)
bins1 = woebin(df, y="CLAIM_FLAG", x="KIDSDRIV")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') #+ 
  #theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
print(p)
$KIDSDRIV

This allows us to make a decision on what bins to use. We create the bins so that

  • no bin is too small (e.g. lower than 10% of the data);
  • there is a meaningful difference between the weight of evidence of the bins; and
  • if we combine variables, we decide so to do this meaningful (e.g. larger cars vs. small cars, or age bins that follow each other instead of grouping the youngest and oldest bin, etc.).

Note that in the following sections we will leave out this exploratory phase and refer to the previous chapters about data exploration instead.

d_fact$KIDSDRV <- if_else(df$KIDSDRIV == 0, 'none', 
                        if_else(df$KIDSDRIV == 1, '1', '>=2')) %>%
factor(levels = c('none', '1', '>=2'))
f_comment_iv(d_fact, y = 'CLAIM_FLAG', x = 'KIDSDRIV')

[1] “The information value for KIDSDRIV is: 0.058 (this is weakly predictable).”

summarytools::ctable(d_fact$KIDSDRV, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
KIDSDRV
none 6595 (75.2%) 2174 (24.8%) 8769 (100.0%)
1 481 (61.4%) 303 (38.6%) 784 (100.0%)
>=2 243 (58.3%) 174 (41.7%) 417 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.5 AGE

#d_fact$AGE <- if_else(df$AGE < 30 , '<30', 
#                        if_else(df$AGE < 40, '30--40', 
#                                if_else(df$AGE < 55, '40--55', '>55'))) %>%
#    factor(level = c('<30', '30--40', '40--55', '>55'))
# f_comment_iv(d_fact, x = 'AGE', y = 'CLAIM_FLAG')


bins1 = woebin(d_fact, y="CLAIM_FLAG", x="AGE")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') 
print(p)
$AGE

d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
  dplyr::rename('AGE' = 'AGE_bin') %>%  # set the name back to what it was.
  dplyr::mutate(AGE = factor(AGE))      # make sure factor (required for automation)
✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
summarytools::ctable(d_fact$AGE, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
AGE
[-Inf,31) 270 (53.6%) 234 (46.4%) 504 (100.0%)
[31,40) 1477 (66.6%) 741 (33.4%) 2218 (100.0%)
[40,44) 1196 (73.2%) 438 (26.8%) 1634 (100.0%)
[44,46) 684 (77.0%) 204 (23.0%) 888 (100.0%)
[46,57) 3123 (80.6%) 750 (19.4%) 3873 (100.0%)
[57, Inf) 569 (66.7%) 284 (33.3%) 853 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.6 HOMEKIDS

bins1 = woebin(d_fact, y="CLAIM_FLAG", x="HOMEKIDS")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') 
print(p)
$HOMEKIDS
The WOE per suggested bin for HOMEKIDS.

The WOE per suggested bin for HOMEKIDS.

# woebin() suggests the same bins as our untuition guided us to
# We can now choose how to implement these bins

# (1) automatically via woebin_ply():
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
  dplyr::rename('HOMEKIDS' = 'HOMEKIDS_bin')    %>%  # set the name back 
  dplyr::mutate(HOMEKIDS = factor(HOMEKIDS))      
✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
# or (2) manually
d_fact$HOMEKIDS <- if_else(df$HOMEKIDS == 0 , '0', 
                        if_else(df$HOMEKIDS == 1, '1', '>=2')) %>%
  factor(level = c('0', '1', '>=2'))


bins1 = woebin(d_fact, y="CLAIM_FLAG", x="HOMEKIDS")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') 
print(p)
$HOMEKIDS
The WOE per suggested bin for HOMEKIDS.

The WOE per suggested bin for HOMEKIDS.

# eventually check the IV of the binning:
f_comment_iv(d_fact, x = 'HOMEKIDS', y = 'CLAIM_FLAG')
[1] "The information value for `HOMEKIDS` is: 0.09 (this is weakly predictable)."
summarytools::ctable(d_fact$HOMEKIDS, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
HOMEKIDS
0 5023 (77.8%) 1433 (22.2%) 6456 (100.0%)
1 712 (65.8%) 370 (34.2%) 1082 (100.0%)
>=2 1584 (65.1%) 848 (34.9%) 2432 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.7 YOJ

“Years on job” refers to the number of years that the customer is employed in the same company.

# Investigate the suggested binning:
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="YOJ")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') 
print(p)
$YOJ
The suggested binning for YOJ by {scorecard}

The suggested binning for YOJ by {scorecard}

We notice that the pattern is a little erratic and not completely in line with intuition. There is a deep decline of claim probability for someone who is one year in the same job, but then it increases again. This is most likely a pattern that is only relevant in our own data. Therefore we will make our own binning.

d_fact$YOJ <- if_else(df$YOJ <= 1 , '<=1', '>1') %>%
    factor(level = c('<=1', '>1'))

bins1 = woebin(d_fact, y="CLAIM_FLAG", x="YOJ") # extract the bin information
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe')      # prepare the plot
print(p)                                        # shot it on screen
$YOJ
Our own binning for YOJ

Our own binning for YOJ

The information value of the our own binning is 0.0337333.

The cross-table for YOJ looks now as follows:

summarytools::ctable(d_fact$YOJ, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
YOJ
<=1 502 (60.8%) 323 (39.2%) 825 (100.0%)
>1 6817 (74.5%) 2328 (25.5%) 9145 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.8 INCOME

Income is one of those variable that could have some significance. Once could expect that people who were able to study well, work hard and work themselves towards a good income are people that understand the risk of driving a car and drive more carefully.

# Investigate the suggested binning:
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="INCOME")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') 
print(p)
$INCOME

While the proposed bins are rather small, the relationships with the dependent variable is almost perfectly linear and in line with intuition. Therefore, we will accept the binning as proposed by the package scorecard.

# Manual binning could look like this:
#d_fact$INCOME <- if_else(df$INCOME < 25000 , '<25K', 
#                       if_else(df$INCOME < 55000, '25--55K',
#                               if_else(df$INCOME < 75000, '55--75K', '>75K'))) %>%
#  factor(level = c('<25K', '25--55K', '55--75K', '>75K'))

d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
  dplyr::rename('INCOME' = 'INCOME_bin') %>%  # set the name back to what it was.
  dplyr::mutate(INCOME = factor(INCOME))  %>%    # create factor object
  dplyr::mutate(INCOME = fct_reorder(INCOME, df$INCOME)) # re-order the bins in order of increasing income

✔ Woe transformating on 9970 rows and 1 columns in 00:00:00

The information value of the INCOME variable is now: 0.1372732; and the cross-table with CLAIM_FLAG is as follows.

summarytools::ctable(d_fact$INCOME, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
INCOME
[-Inf,15000) 953 (62.8%) 564 (37.2%) 1517 (100.0%)
[15000,70000) 3507 (70.8%) 1448 (29.2%) 4955 (100.0%)
[70000,85000) 794 (76.9%) 238 (23.1%) 1032 (100.0%)
[85000,120000) 1198 (82.5%) 255 (17.5%) 1453 (100.0%)
[120000, Inf) 867 (85.6%) 146 (14.4%) 1013 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.9 PARENT1

This variable is binary, and hence binning is obvious. We expect to see that people who are single parents have more accidents, as they will have more often distracting passengers in the car.

d_fact$PARENT1 <- df$PARENT1 %>%
  factor(level = c('Yes', 'No'))
f_comment_iv(d_fact, x = 'PARENT1', y = 'CLAIM_FLAG')
[1] "The information value for `PARENT1` is: 0.117 (this is medium predictable)."
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="PARENT1")

✔ Binning on 9970 rows and 2 columns in 00:00:00

p = woebin_plot(bins1, line_value = 'woe') 
print(p)

$PARENT1

summarytools::ctable(d_fact$PARENT1, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
PARENT1
Yes 717 (55.1%) 584 (44.9%) 1301 (100.0%)
No 6602 (76.2%) 2067 (23.8%) 8669 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.10 HOME_VAL

The value of the house in which people live is a proxy for wealth, foresight, planning, and might be predictable for careful driving to some extent. The value is numeric.

# First try our own approach:
d_fact$HOME_VALtmp <- if_else(df$HOME_VAL < 100000 , '<100K', 
                       if_else(df$HOME_VAL < 200000, '100--200K',
                            if_else(df$HOME_VAL < 300000, '200--300K', '>300K'))) %>%
  factor(level = c('<100K', '100--200K', '200--300K', '>300K'))
f_comment_iv(d_fact, x = 'HOME_VALtmp', y = 'CLAIM_FLAG')
[1] "The information value for `HOME_VALtmp` is: 0.19 (this is medium predictable)."
# Try now the automated binning via woebin:
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="HOME_VAL")

✔ Binning on 9970 rows and 2 columns in 00:00:00

p = woebin_plot(bins1, line_value = 'woe') 
print(p)

$HOME_VAL

# The IV of this binning is higher, the smallest bin is larger -> accept this one:
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
  dplyr::rename('HOME_VAL' = 'HOME_VAL_bin') %>%  # set the name back to what it was.
  dplyr::mutate(HOME_VAL = factor(HOME_VAL))  %>%    # create factor object
  dplyr::mutate(HOME_VAL = fct_reorder(HOME_VAL, df$HOME_VAL)) %>% # re-order the bins in order of increasing income
  dplyr::select (-c("HOME_VALtmp"))  # remove the previously created column

✔ Woe transformating on 9970 rows and 1 columns in 00:00:00

summarytools::ctable(d_fact$HOME_VAL, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
HOME_VAL
[-Inf,80000) 1974 (62.7%) 1174 (37.3%) 3148 (100.0%)
[80000,220000) 2846 (73.9%) 1005 (26.1%) 3851 (100.0%)
[220000,260000) 854 (80.2%) 211 (19.8%) 1065 (100.0%)
[260000, Inf) 1645 (86.3%) 261 (13.7%) 1906 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.11 MSTATUS

Is the insured person married or not. While obviously correlated to age, typically people in marriage tend to be more risk-averse and plan better. The variable is binary, hence binning is straightforward.

d_fact$MSTATUS <- if_else(df$MSTATUS == 'Yes', 'Yes', 'No') %>%
  factor(level = c('Yes', 'No'))
f_comment_iv(d_fact, x = 'MSTATUS', y = 'CLAIM_FLAG')
[1] "The information value for `MSTATUS` is: 0.086 (this is weakly predictable)."
# Visualise the bins for MSTATUS:
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="MSTATUS")

✔ Binning on 9970 rows and 2 columns in 00:00:00

woebin_plot(bins1, line_value = 'woe') %>% print()

$MSTATUS

summarytools::ctable(d_fact$MSTATUS, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
MSTATUS
Yes 4685 (78.1%) 1314 (21.9%) 5999 (100.0%)
No 2634 (66.3%) 1337 (33.7%) 3971 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.12 GENDER

Would gender influence the propensity to claim insurance for a car accident?

d_fact$GENDER <- if_else(df$GENDER == 'M', 'M', 'F') %>%
  factor(level = c('M', 'F'))
f_comment_iv(d_fact, x = 'GENDER', y = 'CLAIM_FLAG')
[1] "The information value for `GENDER` is: 0.003 (this is not predictable)."

We find that in this dataset, women claim slightly more insurance, however the difference is not useful for modelling – it is probably an example of spurious correlation.

summarytools::ctable(d_fact$GENDER, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
GENDER
M 3393 (74.5%) 1159 (25.5%) 4552 (100.0%)
F 3926 (72.5%) 1492 (27.5%) 5418 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

Gender is not predictive for the probability to be involved in an accident.

7.13 EDUCATION

d_fact$EDUCATION <- factor(df$EDUCATION, 
                         levels = c('<High_School', 'z_High_School', 'Bachelors', 'Masters', 'PhD'))

f_comment_iv(d_fact, x = 'EDUCATION', y = 'CLAIM_FLAG')
[1] "The information value for `EDUCATION` is: 0.128 (this is medium predictable)."
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="EDUCATION")

✔ Binning on 9970 rows and 2 columns in 00:00:00

p = woebin_plot(bins1, line_value = 'woe') 
print(p)

$EDUCATION

People with a PhD are a smaller group, but since the relationship is stable and logical, we keep all groups as they are.

summarytools::ctable(d_fact$EDUCATION, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
EDUCATION
<High_School 1003 (67.4%) 485 (32.6%) 1488 (100.0%)
z_High_School 1898 (65.5%) 1001 (34.5%) 2899 (100.0%)
Bachelors 2111 (76.2%) 658 (23.8%) 2769 (100.0%)
Masters 1578 (81.0%) 369 (19.0%) 1947 (100.0%)
PhD 729 (84.1%) 138 (15.9%) 867 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.14 OCCUPATION

OCCUPATION is most likely predictive for the propensity to claim car insurance. It is nominal data (labels without specific order), and hence we can take together those lavels that have the most similar WOE. This allows us to trust the suggestion of woebin().

bins1 = woebin(d_fact, y="CLAIM_FLAG", x="OCCUPATION")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') 

# in order to turn the x-axis labels, we need to do this for each object in the list:
for (varname in names(p)) {
  print(p[[varname]] + 
        theme(axis.text.x = element_text(angle = 45, vjust = 1.0, hjust = 1)))
}

d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
  dplyr::rename('OCCUPATION' = 'OCCUPATION_bin') %>%  # set the name back to what it was.
  dplyr::mutate(OCCUPATION = factor(OCCUPATION))      # make sure factor (required for automation)
✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
f_comment_iv(d_fact, x = 'OCCUPATION', y = 'CLAIM_FLAG')
[1] "The information value for `OCCUPATION` is: 0.153 (this is medium predictable)."
summarytools::ctable(d_fact$OCCUPATION, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
OCCUPATION
Clerical 1103 (69.9%) 475 (30.1%) 1578 (100.0%)
Doctor%,%Home_Maker%,%Lawyer%,%Manager 3103 (81.7%) 694 (18.3%) 3797 (100.0%)
Professional 1087 (76.1%) 342 (23.9%) 1429 (100.0%)
Student%,%z_Blue_Collar 2026 (64.0%) 1140 (36.0%) 3166 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.15 TRAVTIME

TRAVTIME is numeric. One can expect a linear relationship between the commute time and the probability to claim insurance.

bins1 = woebin(d_fact, y="CLAIM_FLAG", x="TRAVTIME")

✔ Binning on 9970 rows and 2 columns in 00:00:00

p = woebin_plot(bins1, line_value = 'woe') %>% print

$TRAVTIME

d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
  dplyr::rename('TRAVTIME' = 'TRAVTIME_bin') %>%  # set the name back to what it was.
  dplyr::mutate(TRAVTIME = factor(TRAVTIME))      # make sure factor (required for automation)

✔ Woe transformating on 9970 rows and 1 columns in 00:00:00

f_comment_iv(d_fact, x = 'TRAVTIME', y = 'CLAIM_FLAG')

[1] “The information value for TRAVTIME is: 0.023 (this is weakly predictable).”

d_fact$TRAVTIME <- if_else(df$TRAVTIME < 25, '<25',  
                           if_else(df$TRAVTIME < 40, '25--40', '>40')) %>%
     factor(levels = c('<25', '25--40', '>40'))
f_comment_iv(d_fact, x = 'EDUCATION', y = 'CLAIM_FLAG')
[1] "The information value for `EDUCATION` is: 0.128 (this is medium predictable)."
summarytools::ctable(d_fact$TRAVTIME, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
TRAVTIME
<25 2262 (76.8%) 685 (23.2%) 2947 (100.0%)
25–40 2692 (73.8%) 958 (26.2%) 3650 (100.0%)
>40 2365 (70.1%) 1008 (29.9%) 3373 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.16 CAR_USE

This is a binary variable.

d_fact$CAR_USE <- paste(df$CAR_USE) %>%
     factor()
f_comment_iv(d_fact, x = 'CAR_USE', y = 'CLAIM_FLAG')
[1] "The information value for `CAR_USE` is: 0.096 (this is weakly predictable)."
summarytools::ctable(d_fact$CAR_USE, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
CAR_USE
Commercial 2350 (65.3%) 1251 (34.7%) 3601 (100.0%)
Private 4969 (78.0%) 1400 (22.0%) 6369 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.17 BLUEBOOK

A numeric variable. We first build a challenger binning based on data exploration.

d_fact$BLUEBOOK <- if_else(df$BLUEBOOK < 10000, '<10K',  
                           if_else(df$BLUEBOOK < 15000, '10--15K', 
                                   if_else(df$BLUEBOOK < 20000, '15--20K', '>20K'))) %>%
     factor(levels = c('<10K', '10--15K', '15--20K', '>20K'))
f_comment_iv(d_fact, x = 'BLUEBOOK', y = 'CLAIM_FLAG')
[1] "The information value for `BLUEBOOK` is: 0.061 (this is weakly predictable)."
d_fact$BLUEBOOK <- df$BLUEBOOK # reset back to the numeric variable
# /or/ d_fact <- d_fact %>% dplyr::mutate(BLUEBOOK = df$BLUEBOOK) 

bins = woebin(d_fact, y="CLAIM_FLAG", x="BLUEBOOK")

✔ Binning on 9970 rows and 2 columns in 00:00:00

p = woebin_plot(bins, line_value = 'woe') %>% print

$BLUEBOOK

# Note that the 4th bin has an increase of WOE compared to the 3rd - that is not logical.

# We decide on a bin less and all bins wider:
custom_breaks <- list(BLUEBOOK = c(7000, 12000, 20000)) # define custom breaks

# Generate bins using custom breakpoints:
bins <- woebin(d_fact, y = "CLAIM_FLAG", x="BLUEBOOK", breaks_list = custom_breaks)

✔ Binning on 9970 rows and 2 columns in 00:00:00

# The IV of this new binning is about 35% higher, hence we keep the new one:
d_fact <- d_fact                             %>%
  scorecard::woebin_ply(bins, to = 'bin')    %>% # re-apply the binning
  dplyr::rename('BLUEBOOK' = 'BLUEBOOK_bin') %>% # set the name back to what it was.
  dplyr::mutate(BLUEBOOK = factor(BLUEBOOK))     # coerce to factor

✔ Woe transformating on 9970 rows and 1 columns in 00:00:00

p = woebin_plot(bins, line_value = 'woe') %>% print # plot the new bins

$BLUEBOOK

f_comment_iv(d_fact, x = 'BLUEBOOK', y = 'CLAIM_FLAG')

[1] “The information value for BLUEBOOK is: 0.069 (this is weakly predictable).” While the IV of this new binning is lower, it does display a more intuitive pattern.

summarytools::ctable(d_fact$BLUEBOOK, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
BLUEBOOK
[-Inf,7000) 998 (63.6%) 571 (36.4%) 1569 (100.0%)
[12000,20000) 2570 (76.4%) 794 (23.6%) 3364 (100.0%)
[20000, Inf) 2124 (78.3%) 589 (21.7%) 2713 (100.0%)
[7000,12000) 1627 (70.0%) 697 (30.0%) 2324 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.18 TIF

Time in force, this is the time that the customer is on the books of our insurance company. Typically longer periods correlate to more stable behaviour, hence a better insurance risk. This phenomenon will allow us to reward loyal customers.

bins = woebin(d_fact, y="CLAIM_FLAG", x="TIF")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins, line_value = 'woe') %>% print
$TIF

# We will use our own binning for more equal bins
d_fact$TIF <- if_else(df$TIF <= 1, '1',  
                           if_else(df$TIF <= 5, '2--5', 
                                   if_else(df$TIF <= 8, '6--8', '>8'))) %>%
     factor(levels = c('1', '2--5', '6--8', '>8'))
f_comment_iv(d_fact, x = 'TIF', y = 'CLAIM_FLAG')
[1] "The information value for `TIF` is: 0.035 (this is weakly predictable)."
summarytools::ctable(d_fact$TIF, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
TIF
1 2104 (68.8%) 956 (31.2%) 3060 (100.0%)
2–5 1569 (72.6%) 591 (27.4%) 2160 (100.0%)
6–8 1883 (75.4%) 616 (24.6%) 2499 (100.0%)
>8 1763 (78.3%) 488 (21.7%) 2251 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.19 CAR_TYPE

x <- paste(df$CAR_TYPE)
# We only merge Panel Truck and Van (too small bins + similar values)
d_fact$CAR_TYPE <- if_else(x == 'Panel_Truck' | x == 'Van' , 
                           'PTruck_Van', x) %>%
  factor()
f_comment_iv(d_fact, x = 'CAR_TYPE', y = 'CLAIM_FLAG')
[1] "The information value for `CAR_TYPE` is: 0.105 (this is medium predictable)."
summarytools::ctable(d_fact$CAR_TYPE, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
CAR_TYPE
Minivan 2182 (82.9%) 449 (17.1%) 2631 (100.0%)
Pickup 1178 (68.7%) 537 (31.3%) 1715 (100.0%)
PTruck_Van 1202 (73.2%) 440 (26.8%) 1642 (100.0%)
Sports_Car 752 (65.4%) 397 (34.6%) 1149 (100.0%)
z_SUV 2005 (70.8%) 828 (29.2%) 2833 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.20 RED_CAR

d_fact$RED_CAR <- paste(df$RED_CAR) %>%
     factor()
f_comment_iv(d_fact, x = 'RED_CAR', y = 'CLAIM_FLAG')
[1] "The information value for `RED_CAR` is: 0 (this is not predictable)."

The urban legend that red cars would be bought by less careful drivers seems to be wrong. The IV is not significant (4.3161495^{-4}`). In our data, red cars have even every so slightly less insurance claims.

summarytools::ctable(d_fact$RED_CAR, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
RED_CAR
no 5213 (73.2%) 1913 (26.8%) 7126 (100.0%)
yes 2106 (74.1%) 738 (25.9%) 2844 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

We will not use the variable RED_CAR. This variable might make sense for the first buyers, people who buy a red care second hand might have a very different behaviour. We don’t have this information and hence cannot use the colour of the car as variable.

7.21 OLDCLAIM

The amount of claims over the last five years, could together with the next variable (number of claims in the last five years).

d_fact$OLDCLAIM <- if_else(df$OLDCLAIM == 0, '0', '>0') %>%
     factor(levels = c('0', '>0'))
f_comment_iv(d_fact, x = 'OLDCLAIM', y = 'CLAIM_FLAG')
[1] "The information value for `OLDCLAIM` is: 0.316 (this is strongly predictable)."
summarytools::ctable(d_fact$OLDCLAIM, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
OLDCLAIM
0 5027 (82.2%) 1090 (17.8%) 6117 (100.0%)
>0 2292 (59.5%) 1561 (40.5%) 3853 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

The cumulative amount of claims in the last 5 years does not show a trend. Probability of having a claim is around 40% for people how did claim something regardless the amount (for those that didn’t claim is it is less only 17%). So, we can only create two meaningful bins here (at least assuming that we are targeting the variable CLAIM_FLAG – not the amount)

7.22 CLM_FREQ

bins = woebin(d_fact, y="CLAIM_FLAG", x="CLM_FREQ")

✔ Binning on 9970 rows and 2 columns in 00:00:00

p = woebin_plot(bins, line_value = 'woe') %>% print

$CLM_FREQ

We prefer to keep people appart that did not claim anything and try to make the bins more of similar size.

d_fact$CLM_FREQ <- if_else(df$CLM_FREQ == 0, '0',  
                           if_else(df$CLM_FREQ == 1, '1', 
                                   if_else(df$CLM_FREQ == 2, '2', '>2'))) %>%
     factor(levels = c('0', '1', '2', '>2'))
f_comment_iv(d_fact, x = 'CLM_FREQ', y = 'CLAIM_FLAG')
[1] "The information value for `CLM_FREQ` is: 0.317 (this is strongly predictable)."
summarytools::ctable(d_fact$CLM_FREQ, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
CLM_FREQ
0 5027 (82.2%) 1090 (17.8%) 6117 (100.0%)
1 745 (60.5%) 486 (39.5%) 1231 (100.0%)
2 860 (60.1%) 571 (39.9%) 1431 (100.0%)
>2 687 (57.7%) 504 (42.3%) 1191 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

The information value is nearly the same as for the binning proposed by summarytools::woebin(), hence we keep our binning.

7.23 REVOKED

Has the driver’s licence been revoked? This is a binary variable and hence binning is obvious.

d_fact$REVOKED <- paste(df$REVOKED) %>%
     factor()
f_comment_iv(d_fact, x = 'REVOKED', y = 'CLAIM_FLAG')
[1] "The information value for `REVOKED` is: 0.109 (this is medium predictable)."
summarytools::ctable(d_fact$REVOKED, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
REVOKED
No 6647 (76.0%) 2103 (24.0%) 8750 (100.0%)
Yes 672 (55.1%) 548 (44.9%) 1220 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

People that have their licence revoked are about twice as likely to file a claim.

7.24 MVR_PTS

bins = woebin(d_fact, y="CLAIM_FLAG", x="MVR_PTS")
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins, line_value = 'woe') %>% print
$MVR_PTS

We try our own bins, keeping the bins of more similar size and using one bin less.

d_fact$MVR_PTS <- if_else(df$MVR_PTS == 0, '0',  
                           if_else(df$MVR_PTS == 1, '1', 
                                   if_else(df$MVR_PTS == 2, '2', 
                                           if_else(df$MVR_PTS <= 4, '3--4', '>4')))) %>%
     factor(levels = c('0', '1', '2', '3--4', '>4'))
f_comment_iv(d_fact, x = 'MVR_PTS', y = 'CLAIM_FLAG')

[1] “The information value for MVR_PTS is: 0.206 (this is medium predictable).”

summarytools::ctable(d_fact$MVR_PTS, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
MVR_PTS
0 3641 (80.8%) 866 (19.2%) 4507 (100.0%)
1 1086 (76.2%) 340 (23.8%) 1426 (100.0%)
2 845 (72.7%) 318 (27.3%) 1163 (100.0%)
3–4 1085 (66.3%) 551 (33.7%) 1636 (100.0%)
>4 662 (53.5%) 576 (46.5%) 1238 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.25 CAR_AGE

There is one negative car age, we will simply bin it in the bucket of cars less than one year.

d_fact$CAR_AGE <- if_else(df$CAR_AGE <= 1, '<=1',  
                           if_else(df$CAR_AGE <= 7, '2--7', 
                                   if_else(df$CAR_AGE <= 11, '8--11', 
                                           if_else(df$CAR_AGE <= 15, '12--15', '>=16')))) %>%
     factor(levels = c('<=1', '2--7', '8--11', '12--15', '>=16'))
f_comment_iv(d_fact, x = 'CAR_AGE', y = 'CLAIM_FLAG')
[1] "The information value for `CAR_AGE` is: 0.069 (this is weakly predictable)."
summarytools::ctable(d_fact$CAR_AGE, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
CAR_AGE
<=1 1657 (67.1%) 814 (32.9%) 2471 (100.0%)
2–7 1428 (70.8%) 589 (29.2%) 2017 (100.0%)
8–11 1935 (73.8%) 687 (26.2%) 2622 (100.0%)
12–15 1398 (80.1%) 348 (19.9%) 1746 (100.0%)
>=16 901 (80.9%) 213 (19.1%) 1114 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.26 URBANICITY

Binary variable, indicating where the primary use of the car is.

d_fact$URBANICITY <- paste(df$URBANICITY) %>%
     factor()
f_comment_iv(d_fact, x = 'URBANICITY', y = 'CLAIM_FLAG')
[1] "The information value for `URBANICITY` is: 0.403 (this is strongly predictable)."
summarytools::ctable(d_fact$URBANICITY, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
URBANICITY
Highly_Urban/ Urban 5426 (68.3%) 2522 (31.7%) 7948 (100.0%)
z_Highly_Rural/ Rural 1893 (93.6%) 129 ( 6.4%) 2022 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

7.27 Combination of Variables

We illustrate the lack of differentiation between genders for the claim amount.

p <- ggplot(df, aes(x=OLDCLAIM, color = GENDER)) + 
  #geom_histogram(aes(y=..density..), position="identity", alpha=0.5) +
  geom_density(alpha=0.6) + 
  scale_y_continuous(trans='log2')
p

7.27.1 GENDER and AGE

d_fact$GENDER.AGE <- paste0(d_fact$GENDER, ".", d_fact$AGE) %>% factor()
f_comment_iv(d_fact, x = 'GENDER.AGE', y = 'CLAIM_FLAG')

[1] “The information value for GENDER.AGE is: 0.145 (this is medium predictable).”

summarytools::ctable(d_fact$GENDER.AGE, df$CLAIM_FLAG)
CLAIM_FLAG 0 1 Total
GENDER.AGE
F.[-Inf,31) 159 (53.2%) 140 (46.8%) 299 (100.0%)
F.[31,40) 876 (67.3%) 426 (32.7%) 1302 (100.0%)
F.[40,44) 663 (72.0%) 258 (28.0%) 921 (100.0%)
F.[44,46) 397 (76.8%) 120 (23.2%) 517 (100.0%)
F.[46,57) 1550 (79.9%) 391 (20.1%) 1941 (100.0%)
F.[57, Inf) 281 (64.2%) 157 (35.8%) 438 (100.0%)
M.[-Inf,31) 111 (54.1%) 94 (45.9%) 205 (100.0%)
M.[31,40) 601 (65.6%) 315 (34.4%) 916 (100.0%)
M.[40,44) 533 (74.8%) 180 (25.2%) 713 (100.0%)
M.[44,46) 287 (77.4%) 84 (22.6%) 371 (100.0%)
M.[46,57) 1573 (81.4%) 359 (18.6%) 1932 (100.0%)
M.[57, Inf) 288 (69.4%) 127 (30.6%) 415 (100.0%)
Total 7319 (73.4%) 2651 (26.6%) 9970 (100.0%)

Average cost of accidents:

tibble(GENDER_AGE = paste(d_fact$GENDER.AGE), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
  filter(CLM_AMT > 0) %>%
  group_by(GENDER_AGE) %>%
  summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE))) %>%
  kable
GENDER_AGE CLM_AMT
F.[-Inf,31) 4760.221
F.[31,40) 5681.056
F.[40,44) 5318.380
F.[44,46) 5241.250
F.[46,57) 5523.939
F.[57, Inf) 5427.739
M.[-Inf,31) 6107.564
M.[31,40) 5738.267
M.[40,44) 6118.789
M.[44,46) 5315.405
M.[46,57) 5621.451
M.[57, Inf) 7444.984
# prepare the data
data <- tibble(matrix(nrow = nrow(df), ncol = 0))
data$claim_amount <- df$CLM_AMT
data$gender       <- d_fact$GENDER
data$age          <- d_fact$AGE

We can first prepare the data with group-by() and then plot (option 1):

# -- option 1:
mean_data <- group_by(data, gender, age) %>%
             summarise(claim_amount = mean(claim_amount, na.rm = TRUE))
ggplot(na.omit(mean_data), aes(x = age, y = claim_amount, colour = gender)) +
  geom_point() + geom_line()

Or use stat_summary():

# -- option 2:
ggplot(data = data, aes(x = age, y = claim_amount, group = gender, color = gender)) + 
   stat_summary(geom = "line", fun.y = mean, size = 3)

Overall averages per gender:

tibble(GENDER = paste(d_fact$GENDER), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
  filter(CLM_AMT > 0) %>%
  group_by(GENDER)    %>%
  summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE))) %>%
  kable
GENDER CLM_AMT
F 5428.732
M 5947.502

While both males and females have higher claims for younger people, this effect is more pronounced in males. The average claim amount for a young male is 27.31% higher than for the same age group in females.

So, while the young female has about 8% more probability to be involved in an accident, the claims of the male are much higher (this effect will be stronger when we would consider only the group that was involved in an accident.).

8 Output Data

8.1 A Completely Binned Dataset (only categorical data)

This dataset can be used for logistic regression, and some machine learning techniques.

This is d_fact, and we save it for later use.

# leave out CLM_AMNt because it is part of the target variable:
saveRDS(d_fact %>% select(-c(CLM_AMT)), './d_fact.R')

8.2 Numerical values normalised

We choose to normalize so that all variables range from 0 to 1 (this is method range in the formula below)

# remove CLM_AMNT (because it is a 100% predictor)
df <- df %>% select (-c('CLM_AMT'))

# Normalise all numerical data
preProc_norm <- caret::preProcess(as.data.frame(df), method="range")
d_norm <- predict(preProc_norm, as.data.frame(df)) %>%
  as.data.frame() %>%   # coerces text to factor
  as.tibble()           # actually we prefer tibbles, but want the factors here

# Now replace all factorial columns with the binned ones:
for (col in colnames(d_norm)) {
    if (is.factor(d_norm[[col]])) {
      d_norm[[col]] <- d_fact[[col]]
      }
}
saveRDS(preProc_norm, './preProc_norm.R') # to convert data back and forth:
saveRDS(d_norm,       './d_norm.R')       # the data

8.3 A Binary DataSet based on the factorial one

Some methods such as linear regressions, generalized linear regressions, neural networks, etc. will require all data to be numerical. So, we prepare an equivalent data set for those models.

d_bin = data.frame(matrix(nrow = nrow(d_fact), ncol = 0)) 

for (col in colnames(d_fact)) {
    if (is.factor(d_fact[[col]])) {
      x <- expand_factor2bin(d_fact, col)  # get the expanded columns
      
      # find the largest bin and leave it out
      sums <- colSums(x)
      theCol <- which(sums == max(sums))
      x <- x[,-theCol]
      
      d_bin <- bind_cols(d_bin, x)     # add them to d_bin
    } 
}
n <- which(colnames(d_bin) == "CLAIM_FLAG.1") # this process has renamed the target variable
colnames(d_bin)[n] <- "CLAIM_FLAG"            # set back to original name
saveRDS(d_bin, './d_bin.R')                   # save the binary data-set

9 References

9.1 Acknowledgement

In this document we used workflow and code from: De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.

In this document we used mainly:

  • Part III (p. 213–256): Data Import
  • Part IV (p. 257–372): Data Wrangling

Bibliography

Agarwal, Sray, and Shashin Mishra. 2021. Responsible AI: Implementing Ethical and Unbiased Algorithms. Springer.
Coeckelbergh, Mark. 2020. AI Ethics. Mit Press.
De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.
Dubber, Markus Dirk, Frank Pasquale, and Sunit Das. 2020. The Oxford Handbook of Ethics of AI. Oxford Handbooks.
Hardt, Moritz, Solon Barocas, and Arvind Narayanan. 2018. “Fairness in Machine Learning: Limitations and Opportunities.” Solon Barocas Moritz Hardt and Arvind Narayanan.

  1. We already argued that the missing values of education might indicate lower education, missing values in job category might be indicative for the risk too.↩︎

  2. Multiple Imputation by Chained Equations (this method assumes missing at random).↩︎