# 11 Predictive modeling II: Classification

In this chapter we will discuss how we can predict binary (categorical) dependent variables by performing basic classification analysis in R.

```
library(tidymodels)
library(tidyverse)
set.seed(7)
```

## 11.1 Data profiling

We will use a credit cards dataset. Our goal is to estimate the probability that a customer will pay their credit card debt given the customer’s observed characteristics.

```
= read_csv("../data/creditCards.csv")
d d
```

```
## # A tibble: 30,000 × 5
## TARGET GENDER MARRIAGE CREDIT_LIMIT AGE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 NOT_PAID 2 1 20000 24
## 2 NOT_PAID 2 2 120000 26
## 3 PAID 2 2 90000 34
## 4 PAID 2 1 50000 37
## 5 PAID 1 1 50000 57
## 6 PAID 1 2 50000 37
## 7 PAID 1 2 500000 29
## 8 PAID 2 2 100000 23
## 9 PAID 2 1 140000 28
## 10 PAID 1 2 20000 35
## # … with 29,990 more rows
```

### 11.1.1 Data types

The first thing to do is to verify that each column has the correct data type. Columns `Gender`

and `Marriage`

are listed as numeric whereas they should be categorical. Similarly, our dependent variable `TARGET`

is listed as a character column. We will need to transform these columns into categorical (`factors`

), in order for our models to work correctly.

```
= d %>% mutate(across(c(TARGET,GENDER,MARRIAGE), factor))
d d
```

```
## # A tibble: 30,000 × 5
## TARGET GENDER MARRIAGE CREDIT_LIMIT AGE
## <fct> <fct> <fct> <dbl> <dbl>
## 1 NOT_PAID 2 1 20000 24
## 2 NOT_PAID 2 2 120000 26
## 3 PAID 2 2 90000 34
## 4 PAID 2 1 50000 37
## 5 PAID 1 1 50000 57
## 6 PAID 1 2 50000 37
## 7 PAID 1 2 500000 29
## 8 PAID 2 2 100000 23
## 9 PAID 2 1 140000 28
## 10 PAID 1 2 20000 35
## # … with 29,990 more rows
```

Check out Section 7.3.1 to recall how to use the function

`across`

.

## 11.2 Feature engineering

This process will be similar to the one we introduced in the previous chapter, in Section 10.3.

### 11.2.1 Data spliting

Next, we will split our data into training and test sets:

```
= initial_split(d, prop=5/6)
ds = training(ds)
trs = testing(ds) ts
```

### 11.2.2 Recipes

Now we can use the training set to do some feature engineering. Recall that models do not understand categorical variables and that in order to use them we will need to transform them into binary columns which are called dummies. For our basic recipe in this example, we will use the function `step_dummy`

that transforms all categorical columns into binary.

```
= recipe(TARGET ~ GENDER + MARRIAGE + CREDIT_LIMIT + AGE, trs) %>%
br step_dummy(all_nominal(), -all_outcomes())
%>% prep %>% juice %>% summary br
```

```
## CREDIT_LIMIT AGE TARGET GENDER_X2
## Min. : 10000 Min. :21.00 NOT_PAID: 5568 Min. :0.000
## 1st Qu.: 50000 1st Qu.:28.00 PAID :19432 1st Qu.:0.000
## Median :140000 Median :34.00 Median :1.000
## Mean :167586 Mean :35.45 Mean :0.604
## 3rd Qu.:240000 3rd Qu.:41.00 3rd Qu.:1.000
## Max. :800000 Max. :75.00 Max. :1.000
## MARRIAGE_X1 MARRIAGE_X2 MARRIAGE_X3
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :1.0000 Median :0.00000
## Mean :0.4538 Mean :0.5337 Mean :0.01076
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
```

Furthermore, we can create an additional recipe which extends our basic recipe by log-transforming the credit limit:

```
= br %>% step_log(CREDIT_LIMIT)
er %>% prep %>% juice %>% head er
```

```
## # A tibble: 6 × 7
## CREDIT_LIMIT AGE TARGET GENDER_X2 MARRIAGE_X1 MARRIAGE_X2 MARRIAGE_X3
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 12.1 30 PAID 1 0 1 0
## 2 11.5 32 NOT_PAID 1 1 0 0
## 3 11.7 40 PAID 1 1 0 0
## 4 10.8 35 PAID 1 1 0 0
## 5 12.8 35 PAID 1 1 0 0
## 6 9.90 30 NOT_PAID 1 1 0 0
```

## 11.3 Classification models

Defining classification models is very similar to defining regression models (Section 10.4.2). Specifically, we will need to use the model’s special keyword, the function `set_engine`

which identify which package to use, and the function `set_mode`

which identifies whether this is a regression or a classification problem. Here we define five classification models:

```
= decision_tree() %>% set_engine("rpart") %>% set_mode("classification")
dt_class = boost_tree() %>% set_engine("xgboost") %>% set_mode("classification")
xg_class = logistic_reg() %>% set_engine("glm") %>% set_mode("classification")
lg_class = mlp() %>% set_engine("keras") %>% set_mode("classification")
mlp_class = rand_forest() %>% set_engine("ranger") %>% set_mode("classification") rf_class
```

Next, we define a basic workflow:

```
= workflow() %>% add_model(lg_class) %>% add_recipe(br)
bw bw
```

```
## ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
```

Finally, we can call function `fit`

on the training set to train these models, and the function `predict`

to get our predictions for the test set:

```
= bw %>% fit(trs)
logitFit = predict(logitFit, ts) %>% bind_cols(ts %>% select(TARGET))
p p
```

```
## # A tibble: 5,000 × 2
## .pred_class TARGET
## <fct> <fct>
## 1 PAID PAID
## 2 PAID PAID
## 3 PAID NOT_PAID
## 4 PAID PAID
## 5 PAID NOT_PAID
## 6 PAID PAID
## 7 PAID PAID
## 8 PAID PAID
## 9 PAID PAID
## 10 PAID NOT_PAID
## # … with 4,990 more rows
```

## 11.4 Evaluation

Now that we have built our classification models, how do we know which ones work and which ones do not work?

In classification, we care about whether or not we can correctly predict class membership. For instance, we want to know whether or not a customer is more likely to pay or not pay their credit card bills. Hence, for these types of models (i.e., **classification** models), we use the following evaluation metrics:

**Accuracy and confusion matrix**: The accuracy of a classifier quantifies the ratio of corrected classified instances. The accuracy is a direct result of a model’s confusion matrix, which captures the number of true positives, true negatives, false positives, and false negatives.**AUC, sensitivity, specificity**: Pretty often, the accuracy of a classifier can be misleading. In those cases, a metric such as the AUC (Area Under the Curve, or Area Under the ROC curve, or ROC)—which captures the correctness of our predicted membership probabilities and subsequent rankings are—is more appropriate. The AUC metric is a direct result of sensitivity and specificity, two metrics that capture the performance of an algorithm within each available class.

### 11.4.1 Accuracy and confusion matrix

A **True Positive** (TP) is an instance that was predicted positive and it is actually positive. A **True Negative** (TN) is an instance that was predictive negative and it is actually negative. **False Positives** (FP) and **False Negatives** (FN) capture errors that our classifiers make. A FP is an instance that was predicted positive but it is actually negative. Finally, a FN is an instance that was predicted to be negative but it is actually positive.

If we put the number of TP, TN, FP and FN in a table, we can get the confusion matrix by calling the function `conf_mat`

from the package `yardstick`

. The result looks like this:

`%>% conf_mat(TARGET,.pred_class) p `

```
## Truth
## Prediction NOT_PAID PAID
## NOT_PAID 0 0
## PAID 1068 3932
```

The confusion matrix has three columns: the first column lists the **predicted** classes. The other two columns list the negative and the positive observed instances in the test set. The elements in the confusion matrix identify the number that predictions and observations agree and disagree, and hence identify the TPs, TNs, FPs, and FNs.

In this example, negative instances are the

`NOT_PAID`

instances, while positive instances are the`PAID`

ones.

From the confusion matrix, we can get the accuracy by summing the diagonal and dividing by the total number of instances in our test set. Or simpler, by calling the function accuracy:

`%>% accuracy(TARGET,.pred_class) p `

```
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.786
```

### 11.4.2 AUC

Some times, the accuracy of a classifier can be misleading. In our example, the accuracy is close to ~80%. Yet, our classifier never predicts customers who will not pay their debts (never predicts `NOT_PAID`

). As a result, our model is **majority** class classifier.

Thankfully, we can use a different evaluation metric that is not affected by how skewed towards one class our input distribution is. This metric is caller **Area Under the ROC Curve** (AUC). The AUC describes the ability of a classifier to separate instances. Intuitively, the AUC score captures the probability that a randomly chosen positive instance from our dataset will rank higher than a randomly chosen negative one. Specifically, the AUC score is the area under a curve that captures the relationship between the **true positive rate (sensitivity)** and the **false positive rate (1-specificity)** of a classifier. To understand this, we first need to talk about thresholds and probabilities.

Most classifiers predict class membership probabilities. This means that they predict the likelihood that each point will be positive or negative. For instance in our example, a probabilistic classifier will predict the likelihood of a customer to pay their credit card bill. Hence, in order to map these probabilities to classes, we need to set up a threshold, such that a probability estimate above that threshold signals a fraudulent transaction, and a probability estimate below that threshold signals a legitimate transaction.

To get the probability estimates of our predictions, we need to call the function `predict`

with the option `prob`

:

```
= predict(logitFit, ts, "prob") %>% bind_cols(ts %>% select(TARGET))
p %>% head p
```

```
## # A tibble: 6 × 3
## .pred_NOT_PAID .pred_PAID TARGET
## <dbl> <dbl> <fct>
## 1 0.171 0.829 PAID
## 2 0.152 0.848 PAID
## 3 0.272 0.728 NOT_PAID
## 4 0.271 0.729 PAID
## 5 0.252 0.748 NOT_PAID
## 6 0.307 0.693 PAID
```

Now, we can call the function `roc_curve`

which will estimate the AUC score of our model, by adjusting the probability threshold from 0 to 1 and plotting the sensitivity and the 1-specificity curve. Our curve looks like this:

`%>% roc_curve(truth = TARGET, .pred_PAID, event_level = "second") %>% autoplot() p `

The area under this curve can be estimated by the function `roc_auc`

as follows:

`%>% roc_auc(truth = TARGET, .pred_PAID, event_level = "second") p `

```
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.622
```

`levels(p$TARGET)`

`## [1] "NOT_PAID" "PAID"`

`event_level = "second"`

. This basically identifies whether we want to rank the predicted instances according to their likelihood of “PAID” or “NOT_PAID”. The `event_level`

can be set to either `first`

or `second`

. In our case, we want to rank instances according to their likelihood of being `PAID`

. If we check the order of level `PAID`

in our dependent variable `TARGET`

(see previous chunk) we will notice that it comes `second`

. Hence, we set `event_level = "second"`

.

### 11.4.3 Comparison of multiple models

Next, we define a function `evalClass`

that streamlines the process of estimating the AUC score of different algorithms. The function takes as input a workflow, and returns the AUC score estimated on the test set of the model included in the workflow:

```
= function(aWorkFlow){
evalClass = aWorkFlow %>% fit(trs)
tmpFit = predict(tmpFit, ts, "prob") %>% bind_cols(ts %>% select(TARGET))
p %>% roc_auc(truth = TARGET, .pred_PAID, event_level = "second")
p }
```

We can call this function multiple times and create a final tibble that includes the results.

```
= bw %>% evalClass %>% mutate(model = "logit",recipe="b")
r1 = bw %>% update_recipe(er) %>% evalClass %>% mutate(model = "logit",recipe="e")
r2 = bw %>% update_model(rf_class) %>% evalClass %>%mutate(model = "rf",recipe="b")
r3 = bw %>% update_model(rf_class) %>% update_recipe(er) %>%evalClass %>% mutate(model = "rf",recipe="e")
r4 = bind_rows(r1,r2,r3,r4)
r r
```

```
## # A tibble: 4 × 5
## .metric .estimator .estimate model recipe
## <chr> <chr> <dbl> <chr> <chr>
## 1 roc_auc binary 0.622 logit b
## 2 roc_auc binary 0.628 logit e
## 3 roc_auc binary 0.626 rf b
## 4 roc_auc binary 0.626 rf e
```

Finally, we can plot the results in a bar plot:

`%>% ggplot(aes(x=model, y = .estimate, fill = model) )+geom_col() + facet_wrap(~recipe) r `

## 11.5 Threshold analysis

The previous discussion about thresholds and probabilities raises another question:

**How do we choose a threshold in practice?**

The answer is it depends: it depends on what we care for. Do we care about minimizing the number of FPs? Or do we care about minimize the number of FNs? Or do we equally care about minimizing both FPs and FNs at the same time? `Sensitivity`

and `Specificity`

are good measures to optimize depending on what our objective is:

Sensitivity is defined as the number of TPs over the number of positives, and it captures the probability of correctly classifying an instance given that it is positive. As a result sensitivity is a measure we want to maximize when we care about minimizing the number of FNs!!

On the other hand, specificity is defined as the number of TNs over the total number of negatives, and it captures the probability of correctly classifying an instance given that it is negative. As a result, specificity is a measure we want to maximize when we care about minimizing the number of FPs.

**As the threshold moves from 0 to 1, the specificity increases and the sensitivity decreases.**

To perform this analysis in R, we will use the package `probably`

:

`install.packages("probably")`

`library(probably)`

Let’s get the probability estimates of our model:

```
= predict(logitFit, ts, "prob") %>% bind_cols(ts %>% select(TARGET))
p %>% head p
```

```
## # A tibble: 6 × 3
## .pred_NOT_PAID .pred_PAID TARGET
## <dbl> <dbl> <fct>
## 1 0.171 0.829 PAID
## 2 0.152 0.848 PAID
## 3 0.272 0.728 NOT_PAID
## 4 0.271 0.729 PAID
## 5 0.252 0.748 NOT_PAID
## 6 0.307 0.693 PAID
```

The function `threshold_perf`

will use these probability estimates to compute the sensitivity, specificity, and j index for different thresholds:

```
= p %>% threshold_perf(TARGET, .pred_PAID, thresholds = seq(0.2,0.8, by=0.0025))
thresholdData %>% head thresholdData
```

```
## # A tibble: 6 × 4
## .threshold .metric .estimator .estimate
## <dbl> <chr> <chr> <dbl>
## 1 0.2 sens binary 1
## 2 0.202 sens binary 1
## 3 0.205 sens binary 1
## 4 0.208 sens binary 1
## 5 0.21 sens binary 1
## 6 0.213 sens binary 1
```

When we care about both the sensitivity and the specificity of a model, we can use the J-index. The J-index can provide a sort of best-threshold point where increasing the threshold results in no more of an increase in the specificity than a decrease in the sensitivity, and as a result the J-index is maximized. We can plot these metrics as follows:

`%>% ggplot(aes(x=.threshold, y =.estimate, color =.metric))+geom_line() thresholdData `

Then, we can get the thresholds that maximize the J index:

`%>% filter(.metric == "j_index") %>% filter(.estimate == max(.estimate)) %>% head thresholdData `

```
## # A tibble: 6 × 4
## .threshold .metric .estimator .estimate
## <dbl> <chr> <chr> <dbl>
## 1 0.2 j_index binary 0
## 2 0.202 j_index binary 0
## 3 0.205 j_index binary 0
## 4 0.208 j_index binary 0
## 5 0.21 j_index binary 0
## 6 0.213 j_index binary 0
```

*For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu*