Background

This example involves building a model for predicting whether a traffic accident will involve injuries or not. We will use data from the city of Chicago’s open data portal. (Some of this activity is derived from a blog post by Julia Silge, one of the authors of our book, TMWR)

years_ago <- today() - years(1) # data from last year. Can go back further, but compute time will be very long!
crash_url <- glue::glue("https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if?$where=CRASH_DATE > '{years_ago}'")
crash_raw <- as_tibble(read.socrata(crash_url)) # a new way to read in data, don't worry about it!

This dataset is pretty crazy! Take a look at it in the viewer, and then let’s do some data munging to get it into a nicer form.

-create a variable called injuries which indicates if the crash involved injuries or not. -create an unknown category for missing report_types -decide which other variables to keep -omit rows with missing data (about 2000 rows). This will affect the model, because it might be that the missing data reveals something important! For example, maybe all the missing cases are alike in some way that could inform the model. But we have so much data that in this case it will not likely affect things.

Take a minute to see if you understand each step. If you want more details on all the steps that are going into it, check out the blog post linked above.

crash <- crash_raw %>%
  arrange(desc(crash_date)) %>%
  transmute(
    injuries = as.factor(if_else(injuries_total > 0, "injuries", "none")),
    crash_date,
    crash_hour,
    report_type = if_else(report_type == "", "UNKNOWN", report_type),
    num_units,
    posted_speed_limit,
    weather_condition,
    lighting_condition,
    roadway_surface_cond,
    first_crash_type,
    trafficway_type,
    prim_contributory_cause,
    latitude, longitude
  ) %>%
  na.omit()

crash
## # A tibble: 106,898 × 14
##    injuries crash_date          crash_…¹ repor…² num_u…³ poste…⁴ weath…⁵ light…⁶
##    <fct>    <dttm>                 <int> <chr>     <int>   <int> <chr>   <chr>  
##  1 none     2023-02-25 02:00:00        2 NOT ON…       2      30 SNOW    DARKNE…
##  2 none     2023-02-25 01:40:00        1 NOT ON…       2      40 SNOW    DARKNE…
##  3 none     2023-02-25 01:20:00        1 NOT ON…       2      35 SNOW    DARKNE…
##  4 none     2023-02-25 01:10:00        1 NOT ON…       2      10 SNOW    DARKNE…
##  5 none     2023-02-25 01:10:00        1 ON SCE…       3      30 SNOW    DARKNE…
##  6 none     2023-02-25 01:00:00        1 NOT ON…       2      30 SNOW    DARKNE…
##  7 none     2023-02-25 01:00:00        1 ON SCE…       2      30 SNOW    DARKNE…
##  8 none     2023-02-25 00:40:00        0 NOT ON…       2      30 SNOW    DARKNE…
##  9 none     2023-02-25 00:19:00        0 ON SCE…       2      30 SNOW    DARKNE…
## 10 none     2023-02-25 00:18:00        0 ON SCE…       2      30 SNOW    DARKNE…
## # … with 106,888 more rows, 6 more variables: roadway_surface_cond <chr>,
## #   first_crash_type <chr>, trafficway_type <chr>,
## #   prim_contributory_cause <chr>, latitude <dbl>, longitude <dbl>, and
## #   abbreviated variable names ¹​crash_hour, ²​report_type, ³​num_units,
## #   ⁴​posted_speed_limit, ⁵​weather_condition, ⁶​lighting_condition

EDA

Let’s see how the number of crashes varied over time. (If you go back further, you’ll see the effect of the pandemic!)

crash %>%
  mutate(crash_date = floor_date(crash_date, unit = "week")) %>%
  count(crash_date, injuries) %>%
  filter(
    crash_date != last(crash_date),
    crash_date != first(crash_date)
  ) %>%
  ggplot(aes(crash_date, n, color = injuries)) +
  geom_line(linewidth = 1.5, alpha = 0.7) +
  scale_y_continuous(limits = (c(0, NA))) +
  labs(
    x = NULL, y = "Number of traffic crashes per week",
    color = "Injuries?"
  )

How does the proportion of crashes with injuries change over time?

crash %>%
  mutate(crash_date = floor_date(crash_date, unit = "week")) %>%
  count(crash_date, injuries) %>%
  filter(
    crash_date != last(crash_date),
    crash_date != first(crash_date)
  ) %>%
  group_by(crash_date) %>%
  mutate(percent_injury = n / sum(n)) %>%
  ungroup() %>%
  filter(injuries == "injuries") %>%
  ggplot(aes(crash_date, percent_injury)) +
  geom_line(size = 1.5, alpha = 0.7, color = "midnightblue") +
  scale_y_continuous(limits = c(0, NA), labels=scales::percent_format()) +
  labs(x = NULL, y = "% of crashes that involve injuries")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Does the day of the week affect the number of crashes or whether injuries occured?

crash %>%
  mutate(crash_date = wday(crash_date, label = TRUE)) %>%
  count(crash_date, injuries) %>%
  group_by(injuries) %>%
  mutate(percent = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(percent, crash_date, fill = injuries)) +
  geom_col(position = "dodge", alpha = 0.8) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(x = "% of crashes", y = NULL, fill = "Injuries?")

Let’s look at some of the most common crash types to see if there is an association with injuries.

crash %>%
  count(first_crash_type, injuries) %>%
  mutate(first_crash_type = fct_reorder(first_crash_type, n)) %>%
  group_by(injuries) %>%
  mutate(percent = n / sum(n)) %>%
  ungroup() %>%
  group_by(first_crash_type) %>%
  filter(sum(n) > 1e4) %>%
  ungroup() %>%
  ggplot(aes(percent, first_crash_type, fill = injuries)) +
  geom_col(position = "dodge", alpha = 0.8) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(x = "% of crashes", y = NULL, fill = "Injuries?")

Lastly, perhaps the location of the crash provides some extra information.

crash %>%
  filter(latitude > 0) %>%
  ggplot(aes(longitude, latitude, color = injuries)) +
  geom_point(size = 0.5, alpha = 0.4) +
  labs(color = NULL) +
  scale_color_manual(values = c("deeppink4", "gray80")) +
  coord_fixed()

Model Building: Decision Tree

set.seed(2023)
crash_split <- initial_split(crash, strata = injuries, prop=.75)
crash_train <- training(crash_split)
crash_test <- testing(crash_split)

# 10 fold CV, no repeats since the training set is huge
crash_folds <- vfold_cv(crash_train, strata = injuries)

Let’s create two recipes, and test how they perform. We’ll also tune the parameters in the decision tree model. What does step_downsample and why is that important here?

library(themis)

basic_recipe <- recipe(injuries ~ ., data = crash_train) %>%
  step_date(crash_date) %>%
  step_rm(crash_date) %>%
  step_other(weather_condition, first_crash_type,
    trafficway_type, prim_contributory_cause,
    other = "OTHER"
  )
  # Most tree methods can handle categorical predictors, so we don't need to dummy!
  #step_dummy(all_nominal_predictors()) %>%
  # Tree methods are also insensitive to scale, so we don't need to rescale

downsample_recipe <- basic_recipe %>%
  step_downsample(injuries) # from the themis package. What does this do?

library(baguette)
dtree_spec <- decision_tree(tree_depth = tune(), cost_complexity=tune(), min_n=30) %>%
  set_engine("rpart") %>%
  set_mode("classification")


crash_wf <- workflow_set(
  preproc = list(
    basic=basic_recipe,
    downsample=downsample_recipe
    ),
  models = list(
    decision_tree = dtree_spec
  )
)

Tuning the Model

# This will take a looooong time! Let's parallelize it.
all_cores <- parallel::detectCores(logical = FALSE) # This will check how many cores are available on your machine

library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Loading required package: parallel
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

crash_res <- crash_wf %>%
  workflow_map(
    "tune_grid",
    verbose=TRUE,
    grid = 10,
    resamples = crash_folds,
    metrics = metric_set(accuracy, sens, spec, f_meas)
  )
## i 1 of 2 tuning:     basic_decision_tree
## ✔ 1 of 2 tuning:     basic_decision_tree (2m 21.4s)
## i 2 of 2 tuning:     downsample_decision_tree
## ✔ 2 of 2 tuning:     downsample_decision_tree (53s)

Once the model has been tuned on all the folds, we have some helper functions to let us get a look at the performance estimates.

rank_results(crash_res)
## # A tibble: 80 × 9
##    wflow_id            .config   .metric  mean std_err     n prepr…¹ model  rank
##    <chr>               <chr>     <chr>   <dbl>   <dbl> <int> <chr>   <chr> <int>
##  1 basic_decision_tree Preproce… accura… 0.862 7.50e-4    10 recipe  deci…     1
##  2 basic_decision_tree Preproce… f_meas  0.337 5.16e-3    10 recipe  deci…     1
##  3 basic_decision_tree Preproce… sens    0.233 4.40e-3    10 recipe  deci…     1
##  4 basic_decision_tree Preproce… spec    0.973 6.10e-4    10 recipe  deci…     1
##  5 basic_decision_tree Preproce… accura… 0.862 8.49e-4    10 recipe  deci…     2
##  6 basic_decision_tree Preproce… f_meas  0.342 6.73e-3    10 recipe  deci…     2
##  7 basic_decision_tree Preproce… sens    0.240 6.03e-3    10 recipe  deci…     2
##  8 basic_decision_tree Preproce… spec    0.971 6.76e-4    10 recipe  deci…     2
##  9 basic_decision_tree Preproce… accura… 0.861 8.01e-4    10 recipe  deci…     3
## 10 basic_decision_tree Preproce… f_meas  0.329 6.23e-3    10 recipe  deci…     3
## # … with 70 more rows, and abbreviated variable name ¹​preprocessor
rank_results(crash_res, rank_metric="f_meas")
## # A tibble: 80 × 9
##    wflow_id              .config .metric  mean std_err     n prepr…¹ model  rank
##    <chr>                 <chr>   <chr>   <dbl>   <dbl> <int> <chr>   <chr> <int>
##  1 downsample_decision_… Prepro… accura… 0.753 8.20e-4    10 recipe  deci…     1
##  2 downsample_decision_… Prepro… f_meas  0.465 1.75e-3    10 recipe  deci…     1
##  3 downsample_decision_… Prepro… sens    0.718 3.54e-3    10 recipe  deci…     1
##  4 downsample_decision_… Prepro… spec    0.759 8.94e-4    10 recipe  deci…     1
##  5 downsample_decision_… Prepro… accura… 0.753 8.20e-4    10 recipe  deci…     2
##  6 downsample_decision_… Prepro… f_meas  0.465 1.75e-3    10 recipe  deci…     2
##  7 downsample_decision_… Prepro… sens    0.718 3.54e-3    10 recipe  deci…     2
##  8 downsample_decision_… Prepro… spec    0.759 8.94e-4    10 recipe  deci…     2
##  9 downsample_decision_… Prepro… accura… 0.753 8.20e-4    10 recipe  deci…     3
## 10 downsample_decision_… Prepro… f_meas  0.465 1.75e-3    10 recipe  deci…     3
## # … with 70 more rows, and abbreviated variable name ¹​preprocessor
autoplot(crash_res)
## Warning: Removed 4 rows containing missing values (`geom_point()`).

autoplot(crash_res, id="downsample_decision_tree")

basic_best_param <- crash_res %>%
      extract_workflow_set_result("basic_decision_tree") %>%
      select_best(metric="accuracy")

basic_res <- crash_res %>%
  extract_workflow("basic_decision_tree") %>%
  finalize_workflow(
    basic_best_param
  ) %>%
  fit(crash_train) %>%
  augment(crash_test)
  
downsample_best_param <- crash_res %>%
      extract_workflow_set_result("downsample_decision_tree") %>%
      select_best(metric="accuracy")

downsample_final <- crash_res %>%
  extract_workflow("downsample_decision_tree") %>%
  finalize_workflow(
    downsample_best_param
  ) %>%
  fit(crash_train) 

downsample_res <- downsample_final %>%
  augment(crash_test)

Was the CV estimate good for these?

my_metrics <- metric_set(accuracy, f_meas, sens, spec)
my_metrics(downsample_res, truth=injuries, estimate=.pred_class)
## # A tibble: 4 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.752
## 2 f_meas   binary         0.464
## 3 sens     binary         0.717
## 4 spec     binary         0.758
my_metrics(basic_res, truth=injuries, estimate=.pred_class)
## # A tibble: 4 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.862
## 2 f_meas   binary         0.332
## 3 sens     binary         0.229
## 4 spec     binary         0.973

Displaying a Tree

library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
downsample_final %>%
  extract_fit_engine() %>%
  rpart.plot()

# This code looks really complicated, only because I didn't save the fitted model anywhere like I did with the downsampled one
crash_res %>%
  extract_workflow("basic_decision_tree") %>%
  finalize_workflow(
    basic_best_param
  ) %>%
  fit(crash_train) %>%
  extract_fit_engine() %>%
  rpart.plot()

Ensemble Based Tree Methods

library(baguette)
bag_spec <- bag_tree(
    class_cost=tune(), tree_depth=tune(), 
    min_n = 20, cost_complexity=tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

library(ranger)
# I'm not setting any tuning parameters because it takes forever to tune
rf_spec <- rand_forest(min_n = 20) %>%
  set_engine("ranger") %>%
  set_mode("classification")

crash_wfs <- workflow_set(
  preproc = list(rec = downsample_recipe),
  models = list(bagging = bag_spec,
                rf = rf_spec)
  )
crash_res <- crash_wfs %>%
  workflow_map(
    "tune_grid",
    verbose=TRUE,
    grid = 10,
    resamples = crash_folds
  )
## i 1 of 2 tuning:     rec_bagging
## ✔ 1 of 2 tuning:     rec_bagging (4m 26.4s)
## i    No tuning parameters. `fit_resamples()` will be attempted
## i 2 of 2 resampling: rec_rf
## ✔ 2 of 2 resampling: rec_rf (1m 36s)
stopCluster(cl)
# Stops the parallel backend
collect_metrics(crash_res)
## # A tibble: 22 × 9
##    wflow_id    .config         preproc model .metric .esti…¹  mean     n std_err
##    <chr>       <chr>           <chr>   <chr> <chr>   <chr>   <dbl> <int>   <dbl>
##  1 rec_bagging Preprocessor1_… recipe  bag_… accura… binary  0.729    10 9.67e-4
##  2 rec_bagging Preprocessor1_… recipe  bag_… roc_auc binary  0.820    10 1.25e-3
##  3 rec_bagging Preprocessor1_… recipe  bag_… accura… binary  0.620    10 8.40e-4
##  4 rec_bagging Preprocessor1_… recipe  bag_… roc_auc binary  0.721    10 1.61e-3
##  5 rec_bagging Preprocessor1_… recipe  bag_… accura… binary  0.620    10 8.40e-4
##  6 rec_bagging Preprocessor1_… recipe  bag_… roc_auc binary  0.703    10 1.64e-3
##  7 rec_bagging Preprocessor1_… recipe  bag_… accura… binary  0.738    10 2.11e-3
##  8 rec_bagging Preprocessor1_… recipe  bag_… roc_auc binary  0.791    10 1.97e-3
##  9 rec_bagging Preprocessor1_… recipe  bag_… accura… binary  0.682    10 2.10e-2
## 10 rec_bagging Preprocessor1_… recipe  bag_… roc_auc binary  0.771    10 6.55e-3
## # … with 12 more rows, and abbreviated variable name ¹​.estimator
autoplot(crash_res)

Your turn

  1. Create a boosting tree model specification. Make sure to read the help page and see which parameters can be tuned.

  2. Create a workflow with just this model specification and the recipe. Tune the model using the 10 fold CV.

  3. You probably failed at the last point. Can you figure out what you need to change to make this model run?

  4. Compare the performance of this model to the bagged tree and random forest. Are they different, about the same? Which model might you choose out of the ones we’ve tried so far?