This example involves building a model for predicting whether a traffic accident will involve injuries or not. We will use data from the city of Chicago’s open data portal. (Some of this activity is derived from a blog post by Julia Silge, one of the authors of our book, TMWR)
years_ago <- today() - years(1) # data from last year. Can go back further, but compute time will be very long!
crash_url <- glue::glue("https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if?$where=CRASH_DATE > '{years_ago}'")
crash_raw <- as_tibble(read.socrata(crash_url)) # a new way to read in data, don't worry about it!
This dataset is pretty crazy! Take a look at it in the viewer, and then let’s do some data munging to get it into a nicer form.
-create a variable called injuries
which indicates if
the crash involved injuries or not. -create an unknown category for
missing report_type
s -decide which other variables to keep
-omit rows with missing data (about 2000 rows). This will affect the
model, because it might be that the missing data reveals something
important! For example, maybe all the missing cases are alike in some
way that could inform the model. But we have so much data that in this
case it will not likely affect things.
Take a minute to see if you understand each step. If you want more details on all the steps that are going into it, check out the blog post linked above.
crash <- crash_raw %>%
arrange(desc(crash_date)) %>%
transmute(
injuries = as.factor(if_else(injuries_total > 0, "injuries", "none")),
crash_date,
crash_hour,
report_type = if_else(report_type == "", "UNKNOWN", report_type),
num_units,
posted_speed_limit,
weather_condition,
lighting_condition,
roadway_surface_cond,
first_crash_type,
trafficway_type,
prim_contributory_cause,
latitude, longitude
) %>%
na.omit()
crash
## # A tibble: 106,898 × 14
## injuries crash_date crash_…¹ repor…² num_u…³ poste…⁴ weath…⁵ light…⁶
## <fct> <dttm> <int> <chr> <int> <int> <chr> <chr>
## 1 none 2023-02-25 02:00:00 2 NOT ON… 2 30 SNOW DARKNE…
## 2 none 2023-02-25 01:40:00 1 NOT ON… 2 40 SNOW DARKNE…
## 3 none 2023-02-25 01:20:00 1 NOT ON… 2 35 SNOW DARKNE…
## 4 none 2023-02-25 01:10:00 1 NOT ON… 2 10 SNOW DARKNE…
## 5 none 2023-02-25 01:10:00 1 ON SCE… 3 30 SNOW DARKNE…
## 6 none 2023-02-25 01:00:00 1 NOT ON… 2 30 SNOW DARKNE…
## 7 none 2023-02-25 01:00:00 1 ON SCE… 2 30 SNOW DARKNE…
## 8 none 2023-02-25 00:40:00 0 NOT ON… 2 30 SNOW DARKNE…
## 9 none 2023-02-25 00:19:00 0 ON SCE… 2 30 SNOW DARKNE…
## 10 none 2023-02-25 00:18:00 0 ON SCE… 2 30 SNOW DARKNE…
## # … with 106,888 more rows, 6 more variables: roadway_surface_cond <chr>,
## # first_crash_type <chr>, trafficway_type <chr>,
## # prim_contributory_cause <chr>, latitude <dbl>, longitude <dbl>, and
## # abbreviated variable names ¹crash_hour, ²report_type, ³num_units,
## # ⁴posted_speed_limit, ⁵weather_condition, ⁶lighting_condition
Let’s see how the number of crashes varied over time. (If you go back further, you’ll see the effect of the pandemic!)
crash %>%
mutate(crash_date = floor_date(crash_date, unit = "week")) %>%
count(crash_date, injuries) %>%
filter(
crash_date != last(crash_date),
crash_date != first(crash_date)
) %>%
ggplot(aes(crash_date, n, color = injuries)) +
geom_line(linewidth = 1.5, alpha = 0.7) +
scale_y_continuous(limits = (c(0, NA))) +
labs(
x = NULL, y = "Number of traffic crashes per week",
color = "Injuries?"
)
How does the proportion of crashes with injuries change over time?
crash %>%
mutate(crash_date = floor_date(crash_date, unit = "week")) %>%
count(crash_date, injuries) %>%
filter(
crash_date != last(crash_date),
crash_date != first(crash_date)
) %>%
group_by(crash_date) %>%
mutate(percent_injury = n / sum(n)) %>%
ungroup() %>%
filter(injuries == "injuries") %>%
ggplot(aes(crash_date, percent_injury)) +
geom_line(size = 1.5, alpha = 0.7, color = "midnightblue") +
scale_y_continuous(limits = c(0, NA), labels=scales::percent_format()) +
labs(x = NULL, y = "% of crashes that involve injuries")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Does the day of the week affect the number of crashes or whether injuries occured?
crash %>%
mutate(crash_date = wday(crash_date, label = TRUE)) %>%
count(crash_date, injuries) %>%
group_by(injuries) %>%
mutate(percent = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(percent, crash_date, fill = injuries)) +
geom_col(position = "dodge", alpha = 0.8) +
scale_x_continuous(labels = scales::percent_format()) +
labs(x = "% of crashes", y = NULL, fill = "Injuries?")
Let’s look at some of the most common crash types to see if there is an association with injuries.
crash %>%
count(first_crash_type, injuries) %>%
mutate(first_crash_type = fct_reorder(first_crash_type, n)) %>%
group_by(injuries) %>%
mutate(percent = n / sum(n)) %>%
ungroup() %>%
group_by(first_crash_type) %>%
filter(sum(n) > 1e4) %>%
ungroup() %>%
ggplot(aes(percent, first_crash_type, fill = injuries)) +
geom_col(position = "dodge", alpha = 0.8) +
scale_x_continuous(labels = scales::percent_format()) +
labs(x = "% of crashes", y = NULL, fill = "Injuries?")
Lastly, perhaps the location of the crash provides some extra information.
crash %>%
filter(latitude > 0) %>%
ggplot(aes(longitude, latitude, color = injuries)) +
geom_point(size = 0.5, alpha = 0.4) +
labs(color = NULL) +
scale_color_manual(values = c("deeppink4", "gray80")) +
coord_fixed()
set.seed(2023)
crash_split <- initial_split(crash, strata = injuries, prop=.75)
crash_train <- training(crash_split)
crash_test <- testing(crash_split)
# 10 fold CV, no repeats since the training set is huge
crash_folds <- vfold_cv(crash_train, strata = injuries)
Let’s create two recipes, and test how they perform. We’ll also tune
the parameters in the decision tree model. What does
step_downsample
and why is that important here?
library(themis)
basic_recipe <- recipe(injuries ~ ., data = crash_train) %>%
step_date(crash_date) %>%
step_rm(crash_date) %>%
step_other(weather_condition, first_crash_type,
trafficway_type, prim_contributory_cause,
other = "OTHER"
)
# Most tree methods can handle categorical predictors, so we don't need to dummy!
#step_dummy(all_nominal_predictors()) %>%
# Tree methods are also insensitive to scale, so we don't need to rescale
downsample_recipe <- basic_recipe %>%
step_downsample(injuries) # from the themis package. What does this do?
library(baguette)
dtree_spec <- decision_tree(tree_depth = tune(), cost_complexity=tune(), min_n=30) %>%
set_engine("rpart") %>%
set_mode("classification")
crash_wf <- workflow_set(
preproc = list(
basic=basic_recipe,
downsample=downsample_recipe
),
models = list(
decision_tree = dtree_spec
)
)
# This will take a looooong time! Let's parallelize it.
all_cores <- parallel::detectCores(logical = FALSE) # This will check how many cores are available on your machine
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
crash_res <- crash_wf %>%
workflow_map(
"tune_grid",
verbose=TRUE,
grid = 10,
resamples = crash_folds,
metrics = metric_set(accuracy, sens, spec, f_meas)
)
## i 1 of 2 tuning: basic_decision_tree
## ✔ 1 of 2 tuning: basic_decision_tree (2m 21.4s)
## i 2 of 2 tuning: downsample_decision_tree
## ✔ 2 of 2 tuning: downsample_decision_tree (53s)
Once the model has been tuned on all the folds, we have some helper functions to let us get a look at the performance estimates.
rank_results(crash_res)
## # A tibble: 80 × 9
## wflow_id .config .metric mean std_err n prepr…¹ model rank
## <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
## 1 basic_decision_tree Preproce… accura… 0.862 7.50e-4 10 recipe deci… 1
## 2 basic_decision_tree Preproce… f_meas 0.337 5.16e-3 10 recipe deci… 1
## 3 basic_decision_tree Preproce… sens 0.233 4.40e-3 10 recipe deci… 1
## 4 basic_decision_tree Preproce… spec 0.973 6.10e-4 10 recipe deci… 1
## 5 basic_decision_tree Preproce… accura… 0.862 8.49e-4 10 recipe deci… 2
## 6 basic_decision_tree Preproce… f_meas 0.342 6.73e-3 10 recipe deci… 2
## 7 basic_decision_tree Preproce… sens 0.240 6.03e-3 10 recipe deci… 2
## 8 basic_decision_tree Preproce… spec 0.971 6.76e-4 10 recipe deci… 2
## 9 basic_decision_tree Preproce… accura… 0.861 8.01e-4 10 recipe deci… 3
## 10 basic_decision_tree Preproce… f_meas 0.329 6.23e-3 10 recipe deci… 3
## # … with 70 more rows, and abbreviated variable name ¹preprocessor
rank_results(crash_res, rank_metric="f_meas")
## # A tibble: 80 × 9
## wflow_id .config .metric mean std_err n prepr…¹ model rank
## <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
## 1 downsample_decision_… Prepro… accura… 0.753 8.20e-4 10 recipe deci… 1
## 2 downsample_decision_… Prepro… f_meas 0.465 1.75e-3 10 recipe deci… 1
## 3 downsample_decision_… Prepro… sens 0.718 3.54e-3 10 recipe deci… 1
## 4 downsample_decision_… Prepro… spec 0.759 8.94e-4 10 recipe deci… 1
## 5 downsample_decision_… Prepro… accura… 0.753 8.20e-4 10 recipe deci… 2
## 6 downsample_decision_… Prepro… f_meas 0.465 1.75e-3 10 recipe deci… 2
## 7 downsample_decision_… Prepro… sens 0.718 3.54e-3 10 recipe deci… 2
## 8 downsample_decision_… Prepro… spec 0.759 8.94e-4 10 recipe deci… 2
## 9 downsample_decision_… Prepro… accura… 0.753 8.20e-4 10 recipe deci… 3
## 10 downsample_decision_… Prepro… f_meas 0.465 1.75e-3 10 recipe deci… 3
## # … with 70 more rows, and abbreviated variable name ¹preprocessor
autoplot(crash_res)
## Warning: Removed 4 rows containing missing values (`geom_point()`).
autoplot(crash_res, id="downsample_decision_tree")
basic_best_param <- crash_res %>%
extract_workflow_set_result("basic_decision_tree") %>%
select_best(metric="accuracy")
basic_res <- crash_res %>%
extract_workflow("basic_decision_tree") %>%
finalize_workflow(
basic_best_param
) %>%
fit(crash_train) %>%
augment(crash_test)
downsample_best_param <- crash_res %>%
extract_workflow_set_result("downsample_decision_tree") %>%
select_best(metric="accuracy")
downsample_final <- crash_res %>%
extract_workflow("downsample_decision_tree") %>%
finalize_workflow(
downsample_best_param
) %>%
fit(crash_train)
downsample_res <- downsample_final %>%
augment(crash_test)
Was the CV estimate good for these?
my_metrics <- metric_set(accuracy, f_meas, sens, spec)
my_metrics(downsample_res, truth=injuries, estimate=.pred_class)
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.752
## 2 f_meas binary 0.464
## 3 sens binary 0.717
## 4 spec binary 0.758
my_metrics(basic_res, truth=injuries, estimate=.pred_class)
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.862
## 2 f_meas binary 0.332
## 3 sens binary 0.229
## 4 spec binary 0.973
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
downsample_final %>%
extract_fit_engine() %>%
rpart.plot()
# This code looks really complicated, only because I didn't save the fitted model anywhere like I did with the downsampled one
crash_res %>%
extract_workflow("basic_decision_tree") %>%
finalize_workflow(
basic_best_param
) %>%
fit(crash_train) %>%
extract_fit_engine() %>%
rpart.plot()
library(baguette)
bag_spec <- bag_tree(
class_cost=tune(), tree_depth=tune(),
min_n = 20, cost_complexity=tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
library(ranger)
# I'm not setting any tuning parameters because it takes forever to tune
rf_spec <- rand_forest(min_n = 20) %>%
set_engine("ranger") %>%
set_mode("classification")
crash_wfs <- workflow_set(
preproc = list(rec = downsample_recipe),
models = list(bagging = bag_spec,
rf = rf_spec)
)
crash_res <- crash_wfs %>%
workflow_map(
"tune_grid",
verbose=TRUE,
grid = 10,
resamples = crash_folds
)
## i 1 of 2 tuning: rec_bagging
## ✔ 1 of 2 tuning: rec_bagging (4m 26.4s)
## i No tuning parameters. `fit_resamples()` will be attempted
## i 2 of 2 resampling: rec_rf
## ✔ 2 of 2 resampling: rec_rf (1m 36s)
stopCluster(cl)
# Stops the parallel backend
collect_metrics(crash_res)
## # A tibble: 22 × 9
## wflow_id .config preproc model .metric .esti…¹ mean n std_err
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 rec_bagging Preprocessor1_… recipe bag_… accura… binary 0.729 10 9.67e-4
## 2 rec_bagging Preprocessor1_… recipe bag_… roc_auc binary 0.820 10 1.25e-3
## 3 rec_bagging Preprocessor1_… recipe bag_… accura… binary 0.620 10 8.40e-4
## 4 rec_bagging Preprocessor1_… recipe bag_… roc_auc binary 0.721 10 1.61e-3
## 5 rec_bagging Preprocessor1_… recipe bag_… accura… binary 0.620 10 8.40e-4
## 6 rec_bagging Preprocessor1_… recipe bag_… roc_auc binary 0.703 10 1.64e-3
## 7 rec_bagging Preprocessor1_… recipe bag_… accura… binary 0.738 10 2.11e-3
## 8 rec_bagging Preprocessor1_… recipe bag_… roc_auc binary 0.791 10 1.97e-3
## 9 rec_bagging Preprocessor1_… recipe bag_… accura… binary 0.682 10 2.10e-2
## 10 rec_bagging Preprocessor1_… recipe bag_… roc_auc binary 0.771 10 6.55e-3
## # … with 12 more rows, and abbreviated variable name ¹.estimator
autoplot(crash_res)
Create a boosting tree model specification. Make sure to read the help page and see which parameters can be tuned.
Create a workflow with just this model specification and the recipe. Tune the model using the 10 fold CV.
You probably failed at the last point. Can you figure out what you need to change to make this model run?
Compare the performance of this model to the bagged tree and random forest. Are they different, about the same? Which model might you choose out of the ones we’ve tried so far?