This is part 1 on the analysis that demonstrate bias in data and the possibilities to de-bias data. In this document we show how to clean the data and prepare it in order to create models. We will prepare two data-sets:
Then those data-frames are saved to disk, in order to be used by the next file to build to models.
The car insurance data is provides a wealth of information and points to discuss. What is fair? What is ethical. We will explore questions like:
The dilemma is that when we believe that there is a protected feature \(S\) (say income), then there are at least two groups for which using the feature will results in more desirable outcome or less desirable outcome. Leaving out the feature will make the model weaker and increase prices for all … impacting the hardest the group that we set out to protect.
Workflow
The work-flow is inspired by: (De Brouwer 2020) and is prepared by Philippe De Brouwer to illustrate the concepts explained in the book.
We will approach the problem in a few phases
CLAIM_FLAG as
dependent variable (this variable is 0 of no claim was filed)CLM_AMT (the amount claimed) of those customers that are
expected to have an accidentFirst we need to read in data and some custom functions.
setwd("/home/philippe/Documents/courses/lectures/bias_data")
# Read in the data:
d0 = read_csv("./car_insurance_claim_data.csv")
# Read in the functions:
source("ethics_functions.R")
# List the functions defined in this file:
tmt.env <- new.env()
sys.source("ethics_functions.R", envir = tmt.env)
utils::lsf.str(envir=tmt.env)dollars_to_numeric : function (input)
expand_factor2bin : function (data, col)
f_comment_iv : function (df, x, y)
fAUC : function (model, data, …)
get_best_cutoff : function (fpr, tpr, cutoff, cost.fp = 1)
make_dependency_hist_plot : function (data, x, y, title, method =
“loess”, q_ymax = 1, q_xmax = 1)
make_dependency_plot : function (data, x, y, title, method = “loess”,
q_ymax = 1, q_xmax = 1)
make_WOE_table : function (df, y)
opt_cut_off : function (perf, pred, cost.fp = 1)
pct_table : function (x, y, round_digits = 2)
space_to_underscore : function (input)
str_clean : function (x)
The data is imported from kaggle, the licence is unknown and it is believed to be “public domain.”
The data contains 10302 rows with 27 columns.
Each record (row) represents a set of attributes of an insurance
company individual customer that are related to their socio-demographic
profile and the insured vehicle. The binary response variable
TARGET_FLAG is 1 if the customer’s car was in a crash, and
0 if not. The continuous response variable TARGET_AMT
defines the cost related to the car crash.
The variables of the data are: ID, KIDSDRIV, BIRTH, AGE, HOMEKIDS, YOJ, INCOME, PARENT1, HOME_VAL, MSTATUS, GENDER, EDUCATION, OCCUPATION, TRAVTIME, CAR_USE, BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS, CLM_AMT, CAR_AGE, CLAIM_FLAG, URBANICITY.
The data dictionary is as follows:
| variable name | definition | prejudice or expectation |
|---|---|---|
| ID | unique identifier | none |
| KIDSDRIV | number of teenager drivers | teenagers cause more accidents |
| BIRTH | birth date | used to derive age |
| AGE | age in years | young people cause more accidents, and older people too |
| HOMEKIDS | nbr kids at home | more kids might be more distraction in the car |
| YOJ | years on job | people that job-hop might be more prone to risk taking and accidents |
| INCOME | annual income in USD | income correlates to reliability and responsibility |
| PARENT1 | yes or no single parent | not clear |
| HOME_VAL | home value in USD if owner | similar to INCOME |
| MSTATUS | marital status “yes” if married | marriage might be a sign of stability and risk aversion(?) |
| GENDER | sex | men cause more and more expensive accidents |
| EDUCATION | level of the diploma | higher education correlates to safer driving |
| OCCUPATION | categories of employment | white color workers might drive safer |
| TRAVTIME | distance to work (probably in minutes) | longer distance translates in more probability to be involved in an accident |
| CAR_USE | commercial or private use | commercial use might be more risky |
| BLUEBOOK | resale value of the car | not clear |
| TIF | time in force = the time with the same insurer (numeric, probably in years, minimum is “1”) | longer should be better |
| CAR_TYPE | categorical | sports cars are more prone to accidents than minivans |
| RED_CAR | “yes” if the car is red | urban legend says that red cars are more prone to accidents |
| OLDCLAIM | total amount in USD of claims in the last 5 years | your past performance might be indicative, but note how this should interact with TIF |
| CLM_FREQ | not claim frequency, but the number of claims in the last 5 years | past performance might be indicative |
| REVOKED | “yes” if main driver’s licence was revoked during the last 7 years | licence being revoked should be indicative for driving style |
| MVR_PTS | motor vehicle record points (number) | traffic tickets should be indicative for the driving style and hence propensity to be involved in an accident |
| CLM_AMT | if last year in a car accident, the dollar amount of the claim paid by insurer | target variable |
| CAR_AGE | age of car in years | one might assume that drivers of older cars are less prudent |
| CLAIM_FLAG | 0 = NO, 1 = YES | target variable |
| URBANICITY | categorical, 2 options | cities should be more dangerous |
All Variables are relevant. Therefore we do not delete any columns at this point.
We were not able to contact the data provider, and close observation
made us assume that some variables have been manipulated to eliminate
missing values. The missing values have then been assigned to some
“assumption” for example in the variable EDUCATION we have
“<High School” (lower than highs school) and
“z_High School”. The latter seems to be the collection of
missing values.
EDUCATION <- factor(d0$EDUCATION);
CLAIM_FLAG <- factor(d0$CLAIM_FLAG);
summarytools::ctable(EDUCATION, CLAIM_FLAG) | CLAIM_FLAG | 0 | 1 | Total | |
| EDUCATION | ||||
| <High School | 1023 (67.5%) | 492 (32.5%) | 1515 (100.0%) | |
| Bachelors | 2150 (76.2%) | 673 (23.8%) | 2823 (100.0%) | |
| Masters | 1674 (80.6%) | 404 (19.4%) | 2078 (100.0%) | |
| PhD | 780 (83.5%) | 154 (16.5%) | 934 (100.0%) | |
| z_High School | 1929 (65.3%) | 1023 (34.7%) | 2952 (100.0%) | |
| Total | 7556 (73.3%) | 2746 (26.7%) | 10302 (100.0%) |
The table above of the EDUCATION variable shows that
somehow the probability of being involved in an accident increases with
education. Therefore we must conclude that z_High_School
and <High_School are not really logical here. The people
with lower than high school are for 32.4% involved in accidents and the
z_High_School have a higher probability to be in accidents
(34.6%).
For this particular variable we might take the two classes with
highest probability to be involved in an accident together as
<=High_School. This is logically consistent, and leaves
us with similar buckets sizes.
Therefore we assume that this unfortunate data manipulation is also
done in other variables. For example the variable GENDER
seems to have mixed the missing values with females. Conclusions based
on small differences hence will not be reliable.
| CLAIM_FLAG | 0 | 1 | Total | |
| GENDER | ||||
| M | 3539 (74.4%) | 1218 (25.6%) | 4757 (100.0%) | |
| z_F | 4017 (72.4%) | 1528 (27.6%) | 5545 (100.0%) | |
| Total | 7556 (73.3%) | 2746 (26.7%) | 10302 (100.0%) |
These considerations place serious footnotes at the reliability of the data.
Issues to consider for data quality:
# Preprocess the variables into numeric / factors as necessary
# dollar values are in the format "$20,540" (string) -> convert this:
df = as.tbl(d0) %>%
mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM", "CLM_AMT"),
parse_number) %>%
mutate_at(c("EDUCATION","OCCUPATION","CAR_TYPE","URBANICITY"),
space_to_underscore) %>%
mutate_at(c("PARENT1", "MSTATUS", "GENDER", "EDUCATION","OCCUPATION", "CAR_USE", "CAR_TYPE", "RED_CAR", "REVOKED", "URBANICITY"),
as.factor) %>%
mutate(CLAIM_FLAG = as.factor(CLAIM_FLAG))We also notice that
ID column will never yield useful informationBIRTH data is the same as AGE but in a
less usable format.Therefore, we will remove those columns before moving forward.
The univariate summaries for the individual variables (after some cleaning) are provided below.
summarytools::dfSummary(df,
plain.ascii = FALSE,
style = 'grid',
graph.magnif = 0.85,
varnumbers = FALSE,
valid.col = FALSE,
tmp.img.dir = "/tmp")| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|
| KIDSDRIV [numeric] |
Mean (sd) : 0.2 (0.5) min < med < max: 0 < 0 < 4 IQR (CV) : 0 (3) |
0 : 9069 (88.0%) 1 : 804 ( 7.8%) 2 : 351 ( 3.4%) 3 : 74 ( 0.7%) 4 : 4 ( 0.0%) |
0 (0.0%) |
|
| AGE [numeric] |
Mean (sd) : 44.8 (8.6) min < med < max: 16 < 45 < 81 IQR (CV) : 12 (0.2) |
61 distinct values | 7 (0.1%) |
|
| HOMEKIDS [numeric] |
Mean (sd) : 0.7 (1.1) min < med < max: 0 < 0 < 5 IQR (CV) : 1 (1.5) |
0 : 6694 (65.0%) 1 : 1106 (10.7%) 2 : 1427 (13.9%) 3 : 856 ( 8.3%) 4 : 201 ( 2.0%) 5 : 18 ( 0.2%) |
0 (0.0%) |
|
| YOJ [numeric] |
Mean (sd) : 10.5 (4.1) min < med < max: 0 < 11 < 23 IQR (CV) : 4 (0.4) |
21 distinct values | 548 (5.3%) |
|
| INCOME [numeric] |
Mean (sd) : 61572.1 (47457.2) min < med < max: 0 < 53529 < 367030 IQR (CV) : 58582 (0.8) |
8151 distinct values | 570 (5.5%) |
|
| PARENT1 [factor] |
1. No 2. Yes |
8959 (87.0%) 1343 (13.0%) |
0 (0.0%) |
|
| HOME_VAL [numeric] |
Mean (sd) : 154523 (129188.4) min < med < max: 0 < 160661 < 885282 IQR (CV) : 238256 (0.8) |
6334 distinct values | 575 (5.6%) |
|
| MSTATUS [factor] |
1. Yes 2. z_No |
6188 (60.1%) 4114 (39.9%) |
0 (0.0%) |
|
| GENDER [factor] |
1. M 2. z_F |
4757 (46.2%) 5545 (53.8%) |
0 (0.0%) |
|
| EDUCATION [factor] |
1. <High_School 2. Bachelors 3. Masters 4. PhD 5. z_High_School |
1515 (14.7%) 2823 (27.4%) 2078 (20.2%) 934 ( 9.1%) 2952 (28.7%) |
0 (0.0%) |
|
| OCCUPATION [factor] |
1. Clerical 2. Doctor 3. Home_Maker 4. Lawyer 5. Manager 6. Professional 7. Student 8. z_Blue_Collar |
1590 (16.5%) 321 ( 3.3%) 843 ( 8.7%) 1031 (10.7%) 1257 (13.0%) 1408 (14.6%) 899 ( 9.3%) 2288 (23.7%) |
665 (6.5%) |
|
| TRAVTIME [numeric] |
Mean (sd) : 33.4 (15.9) min < med < max: 5 < 33 < 142 IQR (CV) : 22 (0.5) |
100 distinct values | 0 (0.0%) |
|
| CAR_USE [factor] |
1. Commercial 2. Private |
3789 (36.8%) 6513 (63.2%) |
0 (0.0%) |
|
| BLUEBOOK [numeric] |
Mean (sd) : 15659.9 (8428.8) min < med < max: 1500 < 14400 < 69740 IQR (CV) : 11690 (0.5) |
2985 distinct values | 0 (0.0%) |
|
| TIF [numeric] |
Mean (sd) : 5.3 (4.1) min < med < max: 1 < 4 < 25 IQR (CV) : 6 (0.8) |
23 distinct values | 0 (0.0%) |
|
| CAR_TYPE [factor] |
1. Minivan 2. Panel_Truck 3. Pickup 4. Sports_Car 5. Van 6. z_SUV |
2694 (26.2%) 853 ( 8.3%) 1772 (17.2%) 1179 (11.4%) 921 ( 8.9%) 2883 (28.0%) |
0 (0.0%) |
|
| RED_CAR [factor] |
1. no 2. yes |
7326 (71.1%) 2976 (28.9%) |
0 (0.0%) |
|
| OLDCLAIM [numeric] |
Mean (sd) : 4034 (8733.1) min < med < max: 0 < 0 < 57037 IQR (CV) : 4647.5 (2.2) |
3545 distinct values | 0 (0.0%) |
|
| CLM_FREQ [numeric] |
Mean (sd) : 0.8 (1.2) min < med < max: 0 < 0 < 5 IQR (CV) : 2 (1.4) |
0 : 6292 (61.1%) 1 : 1279 (12.4%) 2 : 1492 (14.5%) 3 : 992 ( 9.6%) 4 : 225 ( 2.2%) 5 : 22 ( 0.2%) |
0 (0.0%) |
|
| REVOKED [factor] |
1. No 2. Yes |
9041 (87.8%) 1261 (12.2%) |
0 (0.0%) |
|
| MVR_PTS [numeric] |
Mean (sd) : 1.7 (2.2) min < med < max: 0 < 1 < 13 IQR (CV) : 3 (1.3) |
14 distinct values | 0 (0.0%) |
|
| CLM_AMT [numeric] |
Mean (sd) : 1511.3 (4725.2) min < med < max: 0 < 0 < 123247 IQR (CV) : 1144.8 (3.1) |
2346 distinct values | 0 (0.0%) |
|
| CAR_AGE [numeric] |
Mean (sd) : 8.3 (5.7) min < med < max: -3 < 8 < 28 IQR (CV) : 11 (0.7) |
30 distinct values | 639 (6.2%) |
|
| CLAIM_FLAG [factor] |
1. 0 2. 1 |
7556 (73.3%) 2746 (26.7%) |
0 (0.0%) |
|
| URBANICITY [factor] |
1. Highly_Urban/ Urban 2. z_Highly_Rural/ Rural |
8230 (79.9%) 2072 (20.1%) |
0 (0.0%) |
The packages DataExplorer and naniar
provide tools to investigate missing values.
plot_missing() from DataExplorer can show the
percentaages of missing values.
gg_miss_upset shows patterns in missing data by exploring
joint missing values.
The correlation structure and amplitude of the missing values are illustrated in Figure .
The variable OCCUPATION has the highest number of
missing values: 665. So, 6.46% is missing.
# To illustrate the relevance of the NA with summarytools:
freq(d0$OCCUPATION) %>%
kable(caption='The number of missing values is almost the same as the amount of Home makers or Students.')| Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
|---|---|---|---|---|---|
| Clerical | 1590 | 16.498910 | 16.49891 | 15.433896 | 15.43390 |
| Doctor | 321 | 3.330912 | 19.82982 | 3.115900 | 18.54980 |
| Home Maker | 843 | 8.747535 | 28.57736 | 8.182877 | 26.73267 |
| Lawyer | 1031 | 10.698350 | 39.27571 | 10.007765 | 36.74044 |
| Manager | 1257 | 13.043478 | 52.31919 | 12.201514 | 48.94195 |
| Professional | 1408 | 14.610356 | 66.92954 | 13.667249 | 62.60920 |
| Student | 899 | 9.328629 | 76.25817 | 8.726461 | 71.33566 |
| z_Blue Collar | 2288 | 23.741828 | 100.00000 | 22.209280 | 93.54494 |
| 665 | NA | NA | 6.455057 | 100.00000 | |
| Total | 10302 | 100.000000 | 100.00000 | 100.000000 | 100.00000 |
Figure indicates also that the data is not missing at random (MAR)1.
We can consider the following solutions for missing values:
RED_CAR that is not
very useful, but that one has no missing values. Most missing values are
in OCCUPATIONOCCUPATIONmissing and study the correlations structure of
the missing values.mice2missForest, missRanger and
Amelia.To make the computations lighter we cn remove rows that miss more than one value first (these are small numbers in our case). There are 3.2226752% rows that have more than 2 data points missing (total rows that have missing data is 25.6746263%).
# Delete the rows that have more than once field missing:
df <- df[rowSums(is.na(df)) <= 1, ]
# Use missRanger to impute missing values:
df <- missRanger::missRanger(df)Here is how we could use mice: Note that this code
chunks is not executed (eval=FALSE)
# Impute the columns with missing values
# install.packages("mice")
#library(mice)
df_missing_cols = df[,c("AGE","INCOME","YOJ","HOME_VAL","CAR_AGE")]
df_imp_cols_tmp = mice::mice(data = df_missing_cols, m = 1, method = "pmm", maxit = 50, seed = 500)
df_imp_cols = mice::complete(df_imp_cols_tmp)
df_imp = bind_cols(df %>% select(-AGE, -INCOME, -YOJ, -HOME_VAL, -CAR_AGE), df_imp_cols)
# For the rest of the analysis we use the imputed data
df <- df_imp[complete.cases(df_imp),]Here is how we could use missForest: Note that this
code chunks is not executed (eval=FALSE)
# Input missing values via random forest estimates:
# ---
library(missForest)# impute missing values
# Note: this fails:
# d1_imp <- missForest(df, )$ximp
# While everything that worked for a data-frame will also work
# for a tibble, this is an exception, so we have 2 solutions:
# 1. coerce to data-frame
d_imp <- missForest(as.data.frame(df), )$ximp
# 2. Manual fit the randomforest and impute the missing data:
x_cc <- df %>% filter(!is.na(df$OCCUPATION))
x_nc <- df %>% filter(is.na(df$OCCUPATION))
rf <- randomForest::randomForest(OCCUPATION ~ ., data = x_cc)
x_nc$OCCUPATION <- predict(rf, x_nc)
x <- rbind(x_cc, x_nc)d_cont <- data.frame()[1:nrow(df), ]
n <- 1
for (col in colnames(df)) {
if (!(is.factor(df[[col]]))) {
d_cont <- cbind(d_cont, df[[col]])
colnames(d_cont)[n] <- gsub("\"", "", deparse(col))
n <- n + 1
}
}
d_cont <- mutate(d_cont, CLAIM_FLAG = CLAIM_FLAG <- as.numeric(paste(df$CLAIM_FLAG)))
d_cont <- d_cont[complete.cases(d_cont),]# Use the library GGaly to explore the correlation structure
# ---
GGally::ggcorr(d_cont, method = c("everything", "pearson"),
label = TRUE, label_round = 2, label_size = 3,
size = 2.5, geom="circle")The correlation structure of the data
df %>%
select_if(~class(.) == 'factor') %>%
iv(y = 'CLAIM_FLAG') %>%
kable(
caption = 'The table of all information values
for each categorical variable ordered in
decreasing order.')| variable | info_value |
|---|---|
| URBANICITY | 0.4031542 |
| OCCUPATION | 0.1875184 |
| EDUCATION | 0.1280311 |
| PARENT1 | 0.1169597 |
| REVOKED | 0.1087880 |
| CAR_TYPE | 0.1051723 |
| CAR_USE | 0.0959599 |
| MSTATUS | 0.0856915 |
| GENDER | 0.0028151 |
| RED_CAR | 0.0004316 |
(De Brouwer 2020) (pg. 359) suggests the following rule of thumb for those information values:
| IV | predictability |
|---|---|
| < 0.02 | not predictable |
| 0.02 – 0.1 | weak |
| 0.1 – 0.3 | medium |
| 0.13 – 0.5 | strong |
| > 0.5 | suspicious |
This implies that CARUSE, MSTATUS, and
GENDER probably are not very predictable. At least we don’t
expect a linear relationship.
Data binning is a data preprocessing technique that groups continuous numerical values into a smaller number of discrete “bins” or “buckets” (or groups values of categorical variables). It simplifies data, reduces noise, and can improve the performance of analytical models by converting a continuous variable into a categorical one. For example, ages can be binned into groups like “0-17,” “18-64,” and “65+,” instead of using each individual age.
How data binning works
Why data binning is used
Considerations
The number and size of the bins can significantly affect the results. Too few bins can oversimplify the data, while too many can make it cluttered and fail to smooth out noise. Choosing the right binning method is important to avoid hiding important trends or creating misleading representations of the data.
Binning in R Note that the package
{scorecard} provides a function woebin() that
will automatically propose bins for all variables. Then it can be
fine-tuned with woe_adj() and visualized with
woe_plot().
Weight of Evidence (WoE) is a fundamental concept used primarily in credit scoring and binary classification modeling to quantify the strength and direction of the relationship between a predictor variable and the target outcome. WoE transforms categorical or binned continuous variables into continuous numeric values based on the distribution of “events” (e.g., defaults) and “non-events” (e.g., non-defaults). Mathematically, for a given bin \(i\), WoE is defined as:
\[ \text{WoE}_i = \log \left( \frac{\text{Distribution of non-events in bin } i}{\text{Distribution of events in bin } i} \right) = \log \left( \frac{P(\text{non-event}_i)}{P(\text{event}_i)} \right) \]
where \(P(\text{event}_i)\) and \(P(\text{non-event}_i)\) represent the proportions of events and non-events in bin \(i\) relative to the total events and non-events, respectively. A positive WoE indicates that the bin has a higher concentration of non-events than events (predicting a “good” outcome), while a negative WoE suggests the opposite.
One possible way to modify WOE so that it takes values between \(0\) and \(1\), is Information Value (IV) for bin \(i\): \[ \text{IV_i} = \left( P(\text{non-event}_i) - P(\text{event}_i) \right) \times \text{WoE}_i \]
The advantage of IV is that it can meaningfully be added over all bins of a given variable. That summary measure quantifies the variable’s overall predictive power and is calculated as:
\[ \text{IV} = \sum_{i} \left( P(\text{non-event}_i) - P(\text{event}_i) \right) \times \text{WoE}_i \]
Intuitively, IV measures how well the variable’s categories or bins differentiate between the two outcome groups. Variables with higher IV values are more predictive. Typical IV interpretation thresholds are:
The package scorecard provides tools to investigate
existing bins, propose optimal bins, implement or change them,
✔ Binning on 9970 rows and 25 columns in 00:00:13
$KIDSDRIV
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: KIDSDRIV [-Inf,1) 8769 0.8795386 6595 2174 0.2479188 -0.09420622
2: KIDSDRIV [1, Inf) 1201 0.1204614 724 477 0.3971690 0.59826188
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.007631785 0.05609787 1 FALSE
2: 0.048466082 0.05609787 Inf FALSE
$AGE
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: AGE [-Inf,33) 770 0.07723170 432 338 0.4389610 0.77015709
2: AGE [33,40) 1952 0.19578736 1315 637 0.3263320 0.29071449
3: AGE [40,44) 1634 0.16389168 1196 438 0.2680539 0.01101776
4: AGE [44,46) 888 0.08906720 684 204 0.2297297 -0.19430114
5: AGE [46,57) 3873 0.38846540 3123 750 0.1936483 -0.41093937
6: AGE [57, Inf) 853 0.08555667 569 284 0.3329426 0.32063058
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 5.273620e-02 0.1420944 33 FALSE
2: 1.762234e-02 0.1420944 40 FALSE
3: 1.994621e-05 0.1420944 44 FALSE
4: 3.206609e-03 0.1420944 46 FALSE
5: 5.908713e-02 0.1420944 57 FALSE
6: 9.422218e-03 0.1420944 Inf FALSE
$HOMEKIDS
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: HOMEKIDS [-Inf,1) 6456 0.6475426 5023 1433 0.2219641 -0.2387204
2: HOMEKIDS [1,3) 2473 0.2480441 1621 852 0.3445208 0.3723248
3: HOMEKIDS [3, Inf) 1041 0.1044132 675 366 0.3515850 0.4034574
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.03479236 0.09048389 1 FALSE
2: 0.03719890 0.09048389 3 FALSE
3: 0.01849263 0.09048389 Inf FALSE
$YOJ
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: YOJ [-Inf,1) 820 0.08224674 497 323 0.3939024 0.5845991
2: YOJ [1,8) 849 0.08515547 656 193 0.2273263 -0.2079338
3: YOJ [8,11) 2165 0.21715145 1536 629 0.2905312 0.1227311
4: YOJ [11,14.5) 5150 0.51654965 3904 1246 0.2419417 -0.1265265
5: YOJ [14.5, Inf) 986 0.09889669 726 260 0.2636917 -0.0113316
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 3.153056e-02 0.04642662 1 FALSE
2: 3.498905e-03 0.04642662 8 FALSE
3: 3.363351e-03 0.04642662 11 FALSE
4: 8.021131e-03 0.04642662 14.5 FALSE
5: 1.266512e-05 0.04642662 Inf FALSE
$INCOME
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: INCOME [-Inf,15000) 1518 0.1522568 954 564 0.3715415 0.4899274
2: INCOME [15000,70000) 4952 0.4966901 3504 1448 0.2924071 0.1318149
3: INCOME [70000,85000) 1035 0.1038114 797 238 0.2299517 -0.1930472
4: INCOME [85000,120000) 1452 0.1456369 1197 255 0.1756198 -0.5307734
5: INCOME [120000, Inf) 1013 0.1016048 867 146 0.1441264 -0.7658956
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.040372087 0.1372519 15000 FALSE
2: 0.008891581 0.1372519 70000 FALSE
3: 0.003690525 0.1372519 85000 FALSE
4: 0.035751205 0.1372519 120000 FALSE
5: 0.048546482 0.1372519 Inf FALSE
$PARENT1
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: PARENT1 No 8669 0.8695085 6602 2067 0.2384358 -0.1457376
2: PARENT1 Yes 1301 0.1304915 717 584 0.4488855 0.8103619
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.01782808 0.1169597 No FALSE
2: 0.09913159 0.1169597 Yes FALSE
$HOME_VAL
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: HOME_VAL [-Inf,80000) 3145 0.3154463 1970 1175 0.3736089 0.49877138
2: HOME_VAL [80000,220000) 3852 0.3863591 2848 1004 0.2606438 -0.02708819
3: HOME_VAL [220000,260000) 1066 0.1069208 855 211 0.1979362 -0.38370656
4: HOME_VAL [260000, Inf) 1907 0.1912738 1646 261 0.1368642 -0.82604619
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.0868193973 0.2058311 80000 FALSE
2: 0.0002816952 0.2058311 220000 FALSE
3: 0.0142841024 0.2058311 260000 FALSE
4: 0.1044458569 0.2058311 Inf FALSE
$MSTATUS
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: MSTATUS Yes 5999 0.6017051 4685 1314 0.2190365 -0.2557532
2: MSTATUS z_No 3971 0.3982949 2634 1337 0.3366910 0.3374615
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.03694426 0.0856915 Yes FALSE
2: 0.04874724 0.0856915 z_No FALSE
$GENDER
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: GENDER M 4552 0.4565697 3393 1159 0.2546134 -0.05862014
2: GENDER z_F 5418 0.5434303 3926 1492 0.2753784 0.04803319
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.001547244 0.002815052 M FALSE
2: 0.001267808 0.002815052 z_F FALSE
$EDUCATION
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: EDUCATION <High_School 1488 0.14924774 1003 485 0.3259409 0.2889349
2: EDUCATION Bachelors 2769 0.27773320 2111 658 0.2376309 -0.1501753
3: EDUCATION Masters 1947 0.19528586 1578 369 0.1895223 -0.4375801
4: EDUCATION PhD 867 0.08696088 729 138 0.1591696 -0.6488833
5: EDUCATION z_High_School 2899 0.29077232 1898 1001 0.3452915 0.3757356
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.013264784 0.1280311 <High_School FALSE
2: 0.006039926 0.1280311 Bachelors FALSE
3: 0.033435698 0.1280311 Masters FALSE
4: 0.030853065 0.1280311 PhD FALSE
5: 0.044437614 0.1280311 z_High_School FALSE
$OCCUPATION
variable bin count count_distr neg
<char> <char> <int> <num> <int>
1: OCCUPATION Clerical 1578 0.1582748 1103
2: OCCUPATION Doctor%,%Home_Maker%,%Lawyer%,%Manager 3792 0.3803410 3097
3: OCCUPATION Professional 1441 0.1445336 1094
4: OCCUPATION Student%,%z_Blue_Collar 3159 0.3168506 2025
pos posprob woe bin_iv total_iv
<int> <num> <num> <num> <num>
1: 475 0.3010139 0.1730626 0.004927787 0.1502926
2: 695 0.1832806 -0.4787406 0.077067663 0.1502926
3: 347 0.2408050 -0.1327344 0.002466202 0.1502926
4: 1134 0.3589744 0.4357183 0.065830918 0.1502926
breaks is_special_values
<char> <lgcl>
1: Clerical FALSE
2: Doctor%,%Home_Maker%,%Lawyer%,%Manager FALSE
3: Professional FALSE
4: Student%,%z_Blue_Collar FALSE
$TRAVTIME
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: TRAVTIME [-Inf,12) 865 0.08676028 689 176 0.2034682 -0.34922050
2: TRAVTIME [12,22) 1442 0.14463390 1101 341 0.2364771 -0.15655488
3: TRAVTIME [22,38) 3849 0.38605817 2847 1002 0.2603274 -0.02873103
4: TRAVTIME [38, Inf) 3814 0.38254764 2682 1132 0.2968013 0.15295997
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.0096903457 0.02268377 12 FALSE
2: 0.0034128465 0.02268377 22 FALSE
3: 0.0003165294 0.02268377 38 FALSE
4: 0.0092640473 0.02268377 Inf FALSE
$CAR_USE
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: CAR_USE Commercial 3601 0.3611836 2350 1251 0.3474035 0.3850647
2: CAR_USE Private 6369 0.6388164 4969 1400 0.2198147 -0.2512096
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.05807364 0.09595988 Commercial FALSE
2: 0.03788625 0.09595988 Private FALSE
$BLUEBOOK
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: BLUEBOOK [-Inf,6000) 1005 0.1008024 609 396 0.3940299 0.58513272
2: BLUEBOOK [6000,12000) 2888 0.2896690 2016 872 0.3019391 0.17745557
3: BLUEBOOK [12000,17000) 2272 0.2278837 1750 522 0.2297535 -0.19416670
4: BLUEBOOK [17000,27000) 2775 0.2783350 2090 685 0.2468468 -0.09996373
5: BLUEBOOK [27000, Inf) 1030 0.1033099 854 176 0.1708738 -0.56391042
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.038717943 0.08747841 6000 FALSE
2: 0.009491214 0.08747841 12000 FALSE
3: 0.008193228 0.08747841 17000 FALSE
4: 0.002715522 0.08747841 27000 FALSE
5: 0.028360502 0.08747841 Inf FALSE
$TIF
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: TIF [-Inf,4) 3586 0.3596790 2468 1118 0.3117680 0.223670048
2: TIF [4,5) 1565 0.1569709 1151 414 0.2645367 -0.006983656
3: TIF [5,9) 2568 0.2575727 1937 631 0.2457165 -0.106053022
4: TIF [9, Inf) 2251 0.2257773 1763 488 0.2167925 -0.268919997
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 1.890529e-02 0.03701138 4 FALSE
2: 7.643171e-06 0.03701138 5 FALSE
3: 2.824219e-03 0.03701138 9 FALSE
4: 1.527423e-02 0.03701138 Inf FALSE
$CAR_TYPE
variable bin count count_distr neg pos
<char> <char> <int> <num> <int> <int>
1: CAR_TYPE Minivan 2631 0.2638917 2182 449
2: CAR_TYPE Panel_Truck%,%Pickup%,%Sports_Car 3634 0.3644935 2496 1138
3: CAR_TYPE Van%,%z_SUV 3705 0.3716148 2641 1064
posprob woe bin_iv total_iv
<num> <num> <num> <num>
1: 0.1706575 -0.5654375 0.072804664 0.09742236
2: 0.3131535 0.2301197 0.020306168 0.09742236
3: 0.2871795 0.1064145 0.004311527 0.09742236
breaks is_special_values
<char> <lgcl>
1: Minivan FALSE
2: Panel_Truck%,%Pickup%,%Sports_Car FALSE
3: Van%,%z_SUV FALSE
$RED_CAR
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: RED_CAR no 7126 0.7147442 5213 1913 0.2684536 0.01305396
2: RED_CAR yes 2844 0.2852558 2106 738 0.2594937 -0.03306509
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.0001221683 0.000431615 no FALSE
2: 0.0003094467 0.000431615 yes FALSE
$OLDCLAIM
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: OLDCLAIM [-Inf,500) 6117 0.61354062 5027 1090 0.1781919 -0.5131089
2: OLDCLAIM [500,3500) 910 0.09127382 553 357 0.3923077 0.5779146
3: OLDCLAIM [3500, Inf) 2943 0.29518556 1739 1204 0.4091064 0.6478759
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.14145226 0.3159213 500 FALSE
2: 0.03416018 0.3159213 3500 FALSE
3: 0.14030884 0.3159213 Inf FALSE
$CLM_FREQ
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: CLM_FREQ [-Inf,1) 6117 0.6135406 5027 1090 0.1781919 -0.5131089
2: CLM_FREQ [1,3) 2662 0.2670010 1605 1057 0.3970699 0.5978477
3: CLM_FREQ [3, Inf) 1191 0.1194584 687 504 0.4231738 0.7057788
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.14145226 0.3166536 1 FALSE
2: 0.10726896 0.3166536 3 FALSE
3: 0.06793237 0.3166536 Inf FALSE
$REVOKED
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: REVOKED No 8750 0.8776329 6647 2103 0.2403429 -0.1352639
2: REVOKED Yes 1220 0.1223671 672 548 0.4491803 0.8115537
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.01554164 0.108788 No FALSE
2: 0.09324641 0.108788 Yes FALSE
$MVR_PTS
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: MVR_PTS [-Inf,1) 4507 0.45205617 3641 866 0.1921456 -0.42059196
2: MVR_PTS [1,3) 2589 0.25967904 1931 658 0.2541522 -0.06105157
3: MVR_PTS [3,4) 938 0.09408225 634 304 0.3240938 0.28051553
4: MVR_PTS [4,6) 1207 0.12106319 763 444 0.3678542 0.47410331
5: MVR_PTS [6, Inf) 729 0.07311936 350 379 0.5198903 1.09513983
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.0718384320 0.2148368 1 FALSE
2: 0.0009539695 0.2148368 3 FALSE
3: 0.0078684191 0.2148368 4 FALSE
4: 0.0299798062 0.2148368 6 FALSE
5: 0.1041961793 0.2148368 Inf FALSE
$CLM_AMT
variable bin count count_distr neg pos posprob woe bin_iv
<char> <char> <int> <num> <int> <int> <num> <num> <num>
1: CLM_AMT [-Inf, Inf) 9970 1 7319 2651 0.2658977 0 0
total_iv breaks is_special_values
<num> <char> <lgcl>
1: 0 Inf FALSE
$CAR_AGE
variable bin count count_distr neg pos posprob woe
<char> <char> <int> <num> <int> <int> <num> <num>
1: CAR_AGE [-Inf,5) 2854 0.2862588 1916 938 0.3286615 0.30129177
2: CAR_AGE [5,11) 3717 0.3728185 2678 1039 0.2795265 0.06872524
3: CAR_AGE [11,13) 1028 0.1031093 802 226 0.2198444 -0.25103683
4: CAR_AGE [13, Inf) 2371 0.2378134 1923 448 0.1889498 -0.44131173
bin_iv total_iv breaks is_special_values
<num> <num> <char> <lgcl>
1: 0.027732204 0.07700021 5 FALSE
2: 0.001788959 0.07700021 11 FALSE
3: 0.006106964 0.07700021 13 FALSE
4: 0.041372083 0.07700021 Inf FALSE
$URBANICITY
variable bin count count_distr neg pos posprob
<char> <char> <int> <num> <int> <int> <num>
1: URBANICITY Highly_Urban/ Urban 7948 0.7971916 5426 2522 0.31731253
2: URBANICITY z_Highly_Rural/ Rural 2022 0.2028084 1893 129 0.06379822
woe bin_iv total_iv breaks is_special_values
<num> <num> <num> <char> <lgcl>
1: 0.2493868 0.05236649 0.4031542 Highly_Urban/ Urban FALSE
2: -1.6705690 0.35078776 0.4031542 z_Highly_Rural/ Rural FALSE
2 data-sets For our purpose, we will create 2 datasets
d_fact: a fully categorical dataset (to be used for
generalised linear models)d: a dataset that holds both numerical and categorical
data – to be used for decision trees, random forest, etc.The variable KIDSDRV has an information value of
0.058294 with the existing binning. We use the function
ctable() from {summarytools} to get more
insight in the variable.
# KiDSDRIV is <dbl> and needs to be made categorical first
# Since there are only a limited number of possibilities, we can do this:
d_fact$KIDSDRIV <- factor(d_fact$KIDSDRIV)
# show the cross-tabulation:
summarytools::ctable(d_fact$KIDSDRIV, d_fact$CLAIM_FLAG)| CLAIM_FLAG | 0 | 1 | Total | |
| KIDSDRIV | ||||
| 0 | 6595 (75.2%) | 2174 (24.8%) | 8769 (100.0%) | |
| 1 | 481 (61.4%) | 303 (38.6%) | 784 (100.0%) | |
| 2 | 207 (60.3%) | 136 (39.7%) | 343 (100.0%) | |
| 3 | 34 (48.6%) | 36 (51.4%) | 70 (100.0%) | |
| 4 | 2 (50.0%) | 2 (50.0%) | 4 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe') #+
#theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
print(p)$KIDSDRIV
This allows us to make a decision on what bins to use. We create the bins so that
Note that in the following sections we will leave out this exploratory phase and refer to the previous chapters about data exploration instead.
d_fact$KIDSDRV <- if_else(df$KIDSDRIV == 0, 'none',
if_else(df$KIDSDRIV == 1, '1', '>=2')) %>%
factor(levels = c('none', '1', '>=2'))
f_comment_iv(d_fact, y = 'CLAIM_FLAG', x = 'KIDSDRIV')[1] “The information value for KIDSDRIV is: 0.058 (this
is weakly predictable).”
| CLAIM_FLAG | 0 | 1 | Total | |
| KIDSDRV | ||||
| none | 6595 (75.2%) | 2174 (24.8%) | 8769 (100.0%) | |
| 1 | 481 (61.4%) | 303 (38.6%) | 784 (100.0%) | |
| >=2 | 243 (58.3%) | 174 (41.7%) | 417 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
# Manual binning can look like this:
#d_fact$AGE <- if_else(df$AGE < 30 , '<30',
# if_else(df$AGE < 40, '30--40',
# if_else(df$AGE < 55, '40--55', '>55'))) %>%
# factor(level = c('<30', '30--40', '40--55', '>55'))
# f_comment_iv(d_fact, x = 'AGE', y = 'CLAIM_FLAG')
# we propose automated binning:
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="AGE")✔ Binning on 9970 rows and 2 columns in 00:00:01
$AGE
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
dplyr::rename('AGE' = 'AGE_bin') %>% # set the name back to what it was.
dplyr::mutate(AGE = factor(AGE)) # make sure factor (required for automation)✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
| CLAIM_FLAG | 0 | 1 | Total | |
| AGE | ||||
| [-Inf,33) | 432 (56.1%) | 338 (43.9%) | 770 (100.0%) | |
| [33,40) | 1315 (67.4%) | 637 (32.6%) | 1952 (100.0%) | |
| [40,44) | 1196 (73.2%) | 438 (26.8%) | 1634 (100.0%) | |
| [44,46) | 684 (77.0%) | 204 (23.0%) | 888 (100.0%) | |
| [46,57) | 3123 (80.6%) | 750 (19.4%) | 3873 (100.0%) | |
| [57, Inf) | 569 (66.7%) | 284 (33.3%) | 853 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
✔ Binning on 9970 rows and 2 columns in 00:00:00
$HOMEKIDS
The WOE per suggested bin for HOMEKIDS.
# woebin() suggests the same bins as our untuition guided us to
# We can now choose how to implement these bins
# (1) automatically via woebin_ply():
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
dplyr::rename('HOMEKIDS' = 'HOMEKIDS_bin') %>% # set the name back
dplyr::mutate(HOMEKIDS = factor(HOMEKIDS)) ✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
# or (2) manually
d_fact$HOMEKIDS <- if_else(df$HOMEKIDS == 0 , '0',
if_else(df$HOMEKIDS == 1, '1', '>=2')) %>%
factor(level = c('0', '1', '>=2'))
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="HOMEKIDS")✔ Binning on 9970 rows and 2 columns in 00:00:00
$HOMEKIDS
The WOE per suggested bin for HOMEKIDS.
[1] "The information value for `HOMEKIDS` is: 0.09 (this is weakly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| HOMEKIDS | ||||
| 0 | 5023 (77.8%) | 1433 (22.2%) | 6456 (100.0%) | |
| 1 | 712 (65.8%) | 370 (34.2%) | 1082 (100.0%) | |
| >=2 | 1584 (65.1%) | 848 (34.9%) | 2432 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
“Years on job” refers to the number of years that the customer is employed in the same company.
✔ Binning on 9970 rows and 2 columns in 00:00:00
$YOJ
The suggested binning for YOJ by {scorecard}
We notice that the pattern is a little erratic and not completely in line with intuition. There is a deep decline of claim probability for someone who is one year in the same job, but then it increases again. This is most likely a pattern that is only relevant in our own data. Therefore we will make our own binning.
d_fact$YOJ <- if_else(df$YOJ <= 1 , '<=1', '>1') %>%
factor(level = c('<=1', '>1'))
bins1 = woebin(d_fact, y="CLAIM_FLAG", x="YOJ") # extract the bin information✔ Binning on 9970 rows and 2 columns in 00:00:00
$YOJ
Our own binning for YOJ
The information value of the our own binning is 0.0339597.
The cross-table for YOJ looks now as follows:
| CLAIM_FLAG | 0 | 1 | Total | |
| YOJ | ||||
| <=1 | 503 (60.8%) | 324 (39.2%) | 827 (100.0%) | |
| >1 | 6816 (74.5%) | 2327 (25.5%) | 9143 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Income is one of those variable that could have some significance. Once could expect that people who were able to study well, work hard and work themselves towards a good income are people that understand the risk of driving a car and drive more carefully.
✔ Binning on 9970 rows and 2 columns in 00:00:00
$INCOME
While the proposed bins are rather small, the relationships with the
dependent variable is almost perfectly linear and in line with
intuition. Therefore, we will accept the binning as proposed by the
package
scorecard.
# Manual binning could look like this:
#d_fact$INCOME <- if_else(df$INCOME < 25000 , '<25K',
# if_else(df$INCOME < 55000, '25--55K',
# if_else(df$INCOME < 75000, '55--75K', '>75K'))) %>%
# factor(level = c('<25K', '25--55K', '55--75K', '>75K'))
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
dplyr::rename('INCOME' = 'INCOME_bin') %>% # set the name back to what it was.
dplyr::mutate(INCOME = factor(INCOME)) %>% # create factor object
dplyr::mutate(INCOME = fct_reorder(INCOME, df$INCOME)) # re-order the bins in order of increasing income✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
The information value of the INCOME variable is now:
0.1372519; and the cross-table with CLAIM_FLAG is as
follows.
| CLAIM_FLAG | 0 | 1 | Total | |
| INCOME | ||||
| [-Inf,15000) | 954 (62.8%) | 564 (37.2%) | 1518 (100.0%) | |
| [15000,70000) | 3504 (70.8%) | 1448 (29.2%) | 4952 (100.0%) | |
| [70000,85000) | 797 (77.0%) | 238 (23.0%) | 1035 (100.0%) | |
| [85000,120000) | 1197 (82.4%) | 255 (17.6%) | 1452 (100.0%) | |
| [120000, Inf) | 867 (85.6%) | 146 (14.4%) | 1013 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
This variable is binary, and hence binning is obvious. We expect to see that people who are single parents have more accidents, as they will have more often distracting passengers in the car.
d_fact$PARENT1 <- df$PARENT1 %>%
factor(level = c('Yes', 'No'))
f_comment_iv(d_fact, x = 'PARENT1', y = 'CLAIM_FLAG')[1] "The information value for `PARENT1` is: 0.117 (this is medium predictable)."
✔ Binning on 9970 rows and 2 columns in 00:00:00
$PARENT1
| CLAIM_FLAG | 0 | 1 | Total | |
| PARENT1 | ||||
| Yes | 717 (55.1%) | 584 (44.9%) | 1301 (100.0%) | |
| No | 6602 (76.2%) | 2067 (23.8%) | 8669 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
The value of the house in which people live is a proxy for wealth, foresight, planning, and might be predictable for careful driving to some extent. The value is numeric.
# First try our own approach:
d_fact$HOME_VALtmp <- if_else(df$HOME_VAL < 100000 , '<100K',
if_else(df$HOME_VAL < 200000, '100--200K',
if_else(df$HOME_VAL < 300000, '200--300K', '>300K'))) %>%
factor(level = c('<100K', '100--200K', '200--300K', '>300K'))
f_comment_iv(d_fact, x = 'HOME_VALtmp', y = 'CLAIM_FLAG')[1] "The information value for `HOME_VALtmp` is: 0.19 (this is medium predictable)."
✔ Binning on 9970 rows and 2 columns in 00:00:00
$HOME_VAL
# The IV of this binning is higher, the smallest bin is larger -> accept this one:
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
dplyr::rename('HOME_VAL' = 'HOME_VAL_bin') %>% # set the name back to what it was.
dplyr::mutate(HOME_VAL = factor(HOME_VAL)) %>% # create factor object
dplyr::mutate(HOME_VAL = fct_reorder(HOME_VAL, df$HOME_VAL)) %>% # re-order the bins in order of increasing income
dplyr::select (-c("HOME_VALtmp")) # remove the previously created column✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
| CLAIM_FLAG | 0 | 1 | Total | |
| HOME_VAL | ||||
| [-Inf,80000) | 1970 (62.6%) | 1175 (37.4%) | 3145 (100.0%) | |
| [80000,220000) | 2848 (73.9%) | 1004 (26.1%) | 3852 (100.0%) | |
| [220000,260000) | 855 (80.2%) | 211 (19.8%) | 1066 (100.0%) | |
| [260000, Inf) | 1646 (86.3%) | 261 (13.7%) | 1907 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Is the insured person married or not. While obviously correlated to age, typically people in marriage tend to be more risk-averse and plan better. The variable is binary, hence binning is straightforward.
d_fact$MSTATUS <- if_else(df$MSTATUS == 'Yes', 'Yes', 'No') %>%
factor(level = c('Yes', 'No'))
f_comment_iv(d_fact, x = 'MSTATUS', y = 'CLAIM_FLAG')[1] "The information value for `MSTATUS` is: 0.086 (this is weakly predictable)."
✔ Binning on 9970 rows and 2 columns in 00:00:00
$MSTATUS
| CLAIM_FLAG | 0 | 1 | Total | |
| MSTATUS | ||||
| Yes | 4685 (78.1%) | 1314 (21.9%) | 5999 (100.0%) | |
| No | 2634 (66.3%) | 1337 (33.7%) | 3971 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Would gender influence the propensity to claim insurance for a car accident?
d_fact$GENDER <- if_else(df$GENDER == 'M', 'M', 'F') %>%
factor(level = c('M', 'F'))
f_comment_iv(d_fact, x = 'GENDER', y = 'CLAIM_FLAG')[1] "The information value for `GENDER` is: 0.003 (this is not predictable)."
We find that in this dataset, women claim slightly more insurance, however the difference is not useful for modelling – it is probably an example of spurious correlation.
| CLAIM_FLAG | 0 | 1 | Total | |
| GENDER | ||||
| M | 3393 (74.5%) | 1159 (25.5%) | 4552 (100.0%) | |
| F | 3926 (72.5%) | 1492 (27.5%) | 5418 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Gender is not predictive for the probability to be involved in an accident.
d_fact$EDUCATION <- factor(df$EDUCATION,
levels = c('<High_School', 'z_High_School', 'Bachelors', 'Masters', 'PhD'))
f_comment_iv(d_fact, x = 'EDUCATION', y = 'CLAIM_FLAG')[1] "The information value for `EDUCATION` is: 0.128 (this is medium predictable)."
✔ Binning on 9970 rows and 2 columns in 00:00:00
$EDUCATION
People with a PhD are a smaller group, but since the relationship is stable and logical, we keep all groups as they are.
| CLAIM_FLAG | 0 | 1 | Total | |
| EDUCATION | ||||
| <High_School | 1003 (67.4%) | 485 (32.6%) | 1488 (100.0%) | |
| z_High_School | 1898 (65.5%) | 1001 (34.5%) | 2899 (100.0%) | |
| Bachelors | 2111 (76.2%) | 658 (23.8%) | 2769 (100.0%) | |
| Masters | 1578 (81.0%) | 369 (19.0%) | 1947 (100.0%) | |
| PhD | 729 (84.1%) | 138 (15.9%) | 867 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
OCCUPATION is most likely predictive for the propensity
to claim car insurance. It is nominal data (labels without specific
order), and hence we can take together those lavels that have the most
similar WOE. This allows us to trust the suggestion of
woebin().
✔ Binning on 9970 rows and 2 columns in 00:00:00
p = woebin_plot(bins1, line_value = 'woe')
# in order to turn the x-axis labels, we need to do this for each object in the list:
for (varname in names(p)) {
print(p[[varname]] +
theme(axis.text.x = element_text(angle = 45, vjust = 1.0, hjust = 1)))
}d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
dplyr::rename('OCCUPATION' = 'OCCUPATION_bin') %>% # set the name back to what it was.
dplyr::mutate(OCCUPATION = factor(OCCUPATION)) # make sure factor (required for automation)✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
[1] "The information value for `OCCUPATION` is: 0.15 (this is medium predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| OCCUPATION | ||||
| Clerical | 1103 (69.9%) | 475 (30.1%) | 1578 (100.0%) | |
| Doctor%,%Home_Maker%,%Lawyer%,%Manager | 3097 (81.7%) | 695 (18.3%) | 3792 (100.0%) | |
| Professional | 1094 (75.9%) | 347 (24.1%) | 1441 (100.0%) | |
| Student%,%z_Blue_Collar | 2025 (64.1%) | 1134 (35.9%) | 3159 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
TRAVTIME is numeric. One can expect a linear
relationship between the commute time and the probability to claim
insurance.
✔ Binning on 9970 rows and 2 columns in 00:00:00
$TRAVTIME
d_fact <- woebin_ply(d_fact, bins1, to = 'bin') %>%
dplyr::rename('TRAVTIME' = 'TRAVTIME_bin') %>% # set the name back to what it was.
dplyr::mutate(TRAVTIME = factor(TRAVTIME)) # make sure factor (required for automation)✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
[1] “The information value for TRAVTIME is: 0.023 (this
is weakly predictable).”
d_fact$TRAVTIME <- if_else(df$TRAVTIME < 25, '<25',
if_else(df$TRAVTIME < 40, '25--40', '>40')) %>%
factor(levels = c('<25', '25--40', '>40'))
f_comment_iv(d_fact, x = 'EDUCATION', y = 'CLAIM_FLAG')[1] "The information value for `EDUCATION` is: 0.128 (this is medium predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| TRAVTIME | ||||
| <25 | 2262 (76.8%) | 685 (23.2%) | 2947 (100.0%) | |
| 25–40 | 2692 (73.8%) | 958 (26.2%) | 3650 (100.0%) | |
| >40 | 2365 (70.1%) | 1008 (29.9%) | 3373 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
This is a binary variable.
d_fact$CAR_USE <- paste(df$CAR_USE) %>%
factor()
f_comment_iv(d_fact, x = 'CAR_USE', y = 'CLAIM_FLAG')[1] "The information value for `CAR_USE` is: 0.096 (this is weakly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| CAR_USE | ||||
| Commercial | 2350 (65.3%) | 1251 (34.7%) | 3601 (100.0%) | |
| Private | 4969 (78.0%) | 1400 (22.0%) | 6369 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
A numeric variable. We first build a challenger binning based on data exploration.
d_fact$BLUEBOOK <- if_else(df$BLUEBOOK < 10000, '<10K',
if_else(df$BLUEBOOK < 15000, '10--15K',
if_else(df$BLUEBOOK < 20000, '15--20K', '>20K'))) %>%
factor(levels = c('<10K', '10--15K', '15--20K', '>20K'))
f_comment_iv(d_fact, x = 'BLUEBOOK', y = 'CLAIM_FLAG')[1] "The information value for `BLUEBOOK` is: 0.061 (this is weakly predictable)."
d_fact$BLUEBOOK <- df$BLUEBOOK # reset back to the numeric variable
# /or/ d_fact <- d_fact %>% dplyr::mutate(BLUEBOOK = df$BLUEBOOK)
bins = woebin(d_fact, y="CLAIM_FLAG", x="BLUEBOOK")✔ Binning on 9970 rows and 2 columns in 00:00:00
$BLUEBOOK
# Note that the 4th bin has an increase of WOE compared to the 3rd - that is not logical.
# We decide on a bin less and all bins wider:
custom_breaks <- list(BLUEBOOK = c(7000, 12000, 20000)) # define custom breaks
# Generate bins using custom breakpoints:
bins <- woebin(d_fact, y = "CLAIM_FLAG", x="BLUEBOOK", breaks_list = custom_breaks)✔ Binning on 9970 rows and 2 columns in 00:00:00
# The IV of this new binning is about 35% higher, hence we keep the new one:
d_fact <- d_fact %>%
scorecard::woebin_ply(bins, to = 'bin') %>% # re-apply the binning
dplyr::rename('BLUEBOOK' = 'BLUEBOOK_bin') %>% # set the name back to what it was.
dplyr::mutate(BLUEBOOK = factor(BLUEBOOK)) # coerce to factor✔ Woe transformating on 9970 rows and 1 columns in 00:00:00
$BLUEBOOK
[1] “The information value for BLUEBOOK is: 0.069 (this
is weakly predictable).” While the IV of this new binning is lower, it
does display a more intuitive pattern.
| CLAIM_FLAG | 0 | 1 | Total | |
| BLUEBOOK | ||||
| [-Inf,7000) | 998 (63.6%) | 571 (36.4%) | 1569 (100.0%) | |
| [12000,20000) | 2570 (76.4%) | 794 (23.6%) | 3364 (100.0%) | |
| [20000, Inf) | 2124 (78.3%) | 589 (21.7%) | 2713 (100.0%) | |
| [7000,12000) | 1627 (70.0%) | 697 (30.0%) | 2324 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Time in force, this is the time that the customer is on the books of our insurance company. Typically longer periods correlate to more stable behaviour, hence a better insurance risk. This phenomenon will allow us to reward loyal customers.
✔ Binning on 9970 rows and 2 columns in 00:00:00
$TIF
# We will use our own binning for more equal bins
d_fact$TIF <- if_else(df$TIF <= 1, '1',
if_else(df$TIF <= 5, '2--5',
if_else(df$TIF <= 8, '6--8', '>8'))) %>%
factor(levels = c('1', '2--5', '6--8', '>8'))
f_comment_iv(d_fact, x = 'TIF', y = 'CLAIM_FLAG')[1] "The information value for `TIF` is: 0.035 (this is weakly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| TIF | ||||
| 1 | 2104 (68.8%) | 956 (31.2%) | 3060 (100.0%) | |
| 2–5 | 1569 (72.6%) | 591 (27.4%) | 2160 (100.0%) | |
| 6–8 | 1883 (75.4%) | 616 (24.6%) | 2499 (100.0%) | |
| >8 | 1763 (78.3%) | 488 (21.7%) | 2251 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
x <- paste(df$CAR_TYPE)
# We only merge Panel Truck and Van (too small bins + similar values)
d_fact$CAR_TYPE <- if_else(x == 'Panel_Truck' | x == 'Van' ,
'PTruck_Van', x) %>%
factor()
f_comment_iv(d_fact, x = 'CAR_TYPE', y = 'CLAIM_FLAG')[1] "The information value for `CAR_TYPE` is: 0.105 (this is medium predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| CAR_TYPE | ||||
| Minivan | 2182 (82.9%) | 449 (17.1%) | 2631 (100.0%) | |
| Pickup | 1178 (68.7%) | 537 (31.3%) | 1715 (100.0%) | |
| PTruck_Van | 1202 (73.2%) | 440 (26.8%) | 1642 (100.0%) | |
| Sports_Car | 752 (65.4%) | 397 (34.6%) | 1149 (100.0%) | |
| z_SUV | 2005 (70.8%) | 828 (29.2%) | 2833 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
d_fact$RED_CAR <- paste(df$RED_CAR) %>%
factor()
f_comment_iv(d_fact, x = 'RED_CAR', y = 'CLAIM_FLAG')[1] "The information value for `RED_CAR` is: 0 (this is not predictable)."
The urban legend that red cars would be bought by less careful drivers seems to be wrong. The IV is not significant (4.3161495^{-4}`). In our data, red cars have even every so slightly less insurance claims.
| CLAIM_FLAG | 0 | 1 | Total | |
| RED_CAR | ||||
| no | 5213 (73.2%) | 1913 (26.8%) | 7126 (100.0%) | |
| yes | 2106 (74.1%) | 738 (25.9%) | 2844 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
We will not use the variable RED_CAR. This variable
might make sense for the first buyers, people who buy a red care second
hand might have a very different behaviour. We don’t have this
information and hence cannot use the colour of the car as variable.
The amount of claims over the last five years, could together with the next variable (number of claims in the last five years).
d_fact$OLDCLAIM <- if_else(df$OLDCLAIM == 0, '0', '>0') %>%
factor(levels = c('0', '>0'))
f_comment_iv(d_fact, x = 'OLDCLAIM', y = 'CLAIM_FLAG')[1] "The information value for `OLDCLAIM` is: 0.316 (this is strongly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| OLDCLAIM | ||||
| 0 | 5027 (82.2%) | 1090 (17.8%) | 6117 (100.0%) | |
| >0 | 2292 (59.5%) | 1561 (40.5%) | 3853 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
The cumulative amount of claims in the last 5 years does not show a
trend. Probability of having a claim is around 40% for people how did
claim something regardless the amount (for those that didn’t claim is it
is less only 17%). So, we can only create two meaningful bins here (at
least assuming that we are targeting the variable
CLAIM_FLAG – not the amount)
✔ Binning on 9970 rows and 2 columns in 00:00:00
$CLM_FREQ
We prefer to keep people appart that did not claim anything and try to make the bins more of similar size.
d_fact$CLM_FREQ <- if_else(df$CLM_FREQ == 0, '0',
if_else(df$CLM_FREQ == 1, '1',
if_else(df$CLM_FREQ == 2, '2', '>2'))) %>%
factor(levels = c('0', '1', '2', '>2'))
f_comment_iv(d_fact, x = 'CLM_FREQ', y = 'CLAIM_FLAG')[1] "The information value for `CLM_FREQ` is: 0.317 (this is strongly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| CLM_FREQ | ||||
| 0 | 5027 (82.2%) | 1090 (17.8%) | 6117 (100.0%) | |
| 1 | 745 (60.5%) | 486 (39.5%) | 1231 (100.0%) | |
| 2 | 860 (60.1%) | 571 (39.9%) | 1431 (100.0%) | |
| >2 | 687 (57.7%) | 504 (42.3%) | 1191 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
The information value is nearly the same as for the binning proposed
by summarytools::woebin(), hence we keep our binning.
Has the driver’s licence been revoked? This is a binary variable and hence binning is obvious.
d_fact$REVOKED <- paste(df$REVOKED) %>%
factor()
f_comment_iv(d_fact, x = 'REVOKED', y = 'CLAIM_FLAG')[1] "The information value for `REVOKED` is: 0.109 (this is medium predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| REVOKED | ||||
| No | 6647 (76.0%) | 2103 (24.0%) | 8750 (100.0%) | |
| Yes | 672 (55.1%) | 548 (44.9%) | 1220 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
People that have their licence revoked are about twice as likely to file a claim.
✔ Binning on 9970 rows and 2 columns in 00:00:00
$MVR_PTS
We try our own bins, keeping the bins of more similar size and using one bin less.
d_fact$MVR_PTS <- if_else(df$MVR_PTS == 0, '0',
if_else(df$MVR_PTS == 1, '1',
if_else(df$MVR_PTS == 2, '2',
if_else(df$MVR_PTS <= 4, '3--4', '>4')))) %>%
factor(levels = c('0', '1', '2', '3--4', '>4'))
f_comment_iv(d_fact, x = 'MVR_PTS', y = 'CLAIM_FLAG')[1] “The information value for MVR_PTS is: 0.206 (this
is medium predictable).”
| CLAIM_FLAG | 0 | 1 | Total | |
| MVR_PTS | ||||
| 0 | 3641 (80.8%) | 866 (19.2%) | 4507 (100.0%) | |
| 1 | 1086 (76.2%) | 340 (23.8%) | 1426 (100.0%) | |
| 2 | 845 (72.7%) | 318 (27.3%) | 1163 (100.0%) | |
| 3–4 | 1085 (66.3%) | 551 (33.7%) | 1636 (100.0%) | |
| >4 | 662 (53.5%) | 576 (46.5%) | 1238 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
There is one negative car age, we will simply bin it in the bucket of cars less than one year.
d_fact$CAR_AGE <- if_else(df$CAR_AGE <= 1, '<=1',
if_else(df$CAR_AGE <= 7, '2--7',
if_else(df$CAR_AGE <= 11, '8--11',
if_else(df$CAR_AGE <= 15, '12--15', '>=16')))) %>%
factor(levels = c('<=1', '2--7', '8--11', '12--15', '>=16'))
f_comment_iv(d_fact, x = 'CAR_AGE', y = 'CLAIM_FLAG')[1] "The information value for `CAR_AGE` is: 0.069 (this is weakly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| CAR_AGE | ||||
| <=1 | 1657 (67.1%) | 814 (32.9%) | 2471 (100.0%) | |
| 2–7 | 1431 (70.8%) | 589 (29.2%) | 2020 (100.0%) | |
| 8–11 | 1932 (73.7%) | 688 (26.3%) | 2620 (100.0%) | |
| 12–15 | 1401 (80.1%) | 347 (19.9%) | 1748 (100.0%) | |
| >=16 | 898 (80.8%) | 213 (19.2%) | 1111 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Binary variable, indicating where the primary use of the car is.
d_fact$URBANICITY <- paste(df$URBANICITY) %>%
factor()
f_comment_iv(d_fact, x = 'URBANICITY', y = 'CLAIM_FLAG')[1] "The information value for `URBANICITY` is: 0.403 (this is strongly predictable)."
| CLAIM_FLAG | 0 | 1 | Total | |
| URBANICITY | ||||
| Highly_Urban/ Urban | 5426 (68.3%) | 2522 (31.7%) | 7948 (100.0%) | |
| z_Highly_Rural/ Rural | 1893 (93.6%) | 129 ( 6.4%) | 2022 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
We illustrate the lack of differentiation between genders for the claim amount.
p <- ggplot(df, aes(x=OLDCLAIM, color = GENDER)) +
#geom_histogram(aes(y=..density..), position="identity", alpha=0.5) +
geom_density(alpha=0.6) +
scale_y_continuous(trans='log2')
pGENDER and AGEd_fact$GENDER.AGE <- paste0(d_fact$GENDER, ".", d_fact$AGE) %>% factor()
f_comment_iv(d_fact, x = 'GENDER.AGE', y = 'CLAIM_FLAG')[1] “The information value for GENDER.AGE is: 0.146
(this is medium predictable).”
| CLAIM_FLAG | 0 | 1 | Total | |
| GENDER.AGE | ||||
| F.[-Inf,33) | 255 (55.4%) | 205 (44.6%) | 460 (100.0%) | |
| F.[33,40) | 780 (68.4%) | 361 (31.6%) | 1141 (100.0%) | |
| F.[40,44) | 663 (72.0%) | 258 (28.0%) | 921 (100.0%) | |
| F.[44,46) | 397 (76.8%) | 120 (23.2%) | 517 (100.0%) | |
| F.[46,57) | 1550 (79.9%) | 391 (20.1%) | 1941 (100.0%) | |
| F.[57, Inf) | 281 (64.2%) | 157 (35.8%) | 438 (100.0%) | |
| M.[-Inf,33) | 177 (57.1%) | 133 (42.9%) | 310 (100.0%) | |
| M.[33,40) | 535 (66.0%) | 276 (34.0%) | 811 (100.0%) | |
| M.[40,44) | 533 (74.8%) | 180 (25.2%) | 713 (100.0%) | |
| M.[44,46) | 287 (77.4%) | 84 (22.6%) | 371 (100.0%) | |
| M.[46,57) | 1573 (81.4%) | 359 (18.6%) | 1932 (100.0%) | |
| M.[57, Inf) | 288 (69.4%) | 127 (30.6%) | 415 (100.0%) | |
| Total | 7319 (73.4%) | 2651 (26.6%) | 9970 (100.0%) |
Average cost of accidents:
tibble(GENDER_AGE = paste(d_fact$GENDER.AGE), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
filter(CLM_AMT > 0) %>%
group_by(GENDER_AGE) %>%
summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE))) %>%
kable| GENDER_AGE | CLM_AMT |
|---|---|
| F.[-Inf,33) | 5056.702 |
| F.[33,40) | 5678.496 |
| F.[40,44) | 5318.380 |
| F.[44,46) | 5241.250 |
| F.[46,57) | 5523.939 |
| F.[57, Inf) | 5427.739 |
| M.[-Inf,33) | 5910.406 |
| M.[33,40) | 5781.091 |
| M.[40,44) | 6118.789 |
| M.[44,46) | 5315.405 |
| M.[46,57) | 5621.451 |
| M.[57, Inf) | 7444.984 |
# prepare the data
data <- tibble(matrix(nrow = nrow(df), ncol = 0))
data$claim_amount <- df$CLM_AMT
data$gender <- d_fact$GENDER
data$age <- d_fact$AGEWe can first prepare the data with group-by() and then
plot (option 1):
# -- option 1:
mean_data <- group_by(data, gender, age) %>%
summarise(claim_amount = mean(claim_amount, na.rm = TRUE))
ggplot(na.omit(mean_data), aes(x = age, y = claim_amount, colour = gender)) +
geom_point() + geom_line()Or use stat_summary():
# -- option 2:
ggplot(data = data, aes(x = age, y = claim_amount, group = gender, color = gender)) +
stat_summary(geom = "line", fun.y = mean, size = 3)Overall averages per gender:
tibble(GENDER = paste(d_fact$GENDER), CLM_AMT = as.numeric(df$CLM_AMT)) %>%
filter(CLM_AMT > 0) %>%
group_by(GENDER) %>%
summarise_at(vars(CLM_AMT), funs(mean(., na.rm=TRUE))) %>%
kable| GENDER | CLM_AMT |
|---|---|
| F | 5428.732 |
| M | 5947.502 |
While both males and females have higher claims for younger people, this effect is more pronounced in males. The average claim amount for a young male is 27.31% higher than for the same age group in females.
So, while the young female has about 8% more probability to be involved in an accident, the claims of the male are much higher (this effect will be stronger when we would consider only the group that was involved in an accident.).
This dataset can be used for logistic regression, and some machine learning techniques.
This is d_fact, and we save it for later use.
We choose to normalize so that all variables range from 0 to 1 (this
is method range in the formula below)
# remove CLM_AMNT (because it is a 100% predictor)
df <- df %>% select (-c('CLM_AMT'))
# Normalise all numerical data
preProc_norm <- caret::preProcess(as.data.frame(df), method="range")
d_norm <- predict(preProc_norm, as.data.frame(df)) %>%
as.data.frame() %>% # coerces text to factor
as.tibble() # actually we prefer tibbles, but want the factors here
# Now replace all factorial columns with the binned ones:
for (col in colnames(d_norm)) {
if (is.factor(d_norm[[col]])) {
d_norm[[col]] <- d_fact[[col]]
}
}
saveRDS(preProc_norm, './preProc_norm.R') # to convert data back and forth:
saveRDS(d_norm, './d_norm.R') # the dataSome methods such as linear regressions, generalized linear regressions, neural networks, etc. will require all data to be numerical. So, we prepare an equivalent data set for those models.
d_bin = data.frame(matrix(nrow = nrow(d_fact), ncol = 0))
for (col in colnames(d_fact)) {
if (is.factor(d_fact[[col]])) {
x <- expand_factor2bin(d_fact, col) # get the expanded columns
# find the largest bin and leave it out
sums <- colSums(x)
theCol <- which(sums == max(sums))
x <- x[,-theCol]
d_bin <- bind_cols(d_bin, x) # add them to d_bin
}
}
n <- which(colnames(d_bin) == "CLAIM_FLAG.1") # this process has renamed the target variable
colnames(d_bin)[n] <- "CLAIM_FLAG" # set back to original name
saveRDS(d_bin, './d_bin.R') # save the binary data-setIn this document we used workflow and code from: De Brouwer, Philippe J. S. 2020. The Big r-Book: From Data Science to Learning Machines and Big Data. New York: John Wiley & Sons, Ltd.
In this document we used mainly: