Nonlinear Regression Models
Background
This case study examines a large number of models that predict a property of concrete mixtures, as outlined in Chapter 10 of APM. The analysis here will be similar to the APM approach but will illustrate two techniques: using workflow sets to launch models and efficient triage of models using race methods, following TMWR Chapter 15. There are two versions of the data. Like APM, we’ll use the version where the concrete ingredients are represented as proportions in a mixture. There are some replicated mixtures so we create a distinct set of mixtures and average the outcome data across replicates.
data(concrete, package = "modeldata")
We split our data and set up a 10 fold CV set with 5 repeats.
set.seed(1001)
<- initial_split(concrete, strata = compressive_strength)
concrete_split <- training(concrete_split)
concrete_train <- testing(concrete_split)
concrete_test set.seed(1002)
<- vfold_cv(concrete_train, strata = compressive_strength,
concrete_folds repeats = 5)
Preprocessing and Model Specification
SVMs only require normalized parameters, so we won’t do much else here.
<-
normalized_rec recipe(compressive_strength ~ ., data = concrete_train) %>%
step_normalize(all_numeric_predictors())
Here, we specify our SVM models, plus a KNN for comparison.
library(kernlab)
## Warning: package 'kernlab' was built under R version 4.0.5
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:scales':
##
## alpha
<-
svm_r_spec svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
<-
svm_p_spec svm_poly(cost = tune(), degree = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
library(kknn)
<-
knn_spec nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
library("earth")
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
## Loading required package: TeachingDemos
<-
mars_spec mars(prod_degree = tune()) %>% #<- use GCV to choose terms
set_engine("earth") %>%
set_mode("regression")
library(ranger)
<-
rf_spec rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
<-
xgb_spec boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(),
min_n = tune(), sample_size = tune(), trees = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
<-
no_pre_proc workflow_set(
preproc = list(simple = compressive_strength ~ .),
models = list(MARS = mars_spec, RF = rf_spec,
boosting = xgb_spec)
)
<-
normalized workflow_set(
preproc = list(normalized = normalized_rec),
models = list(svm_radial = svm_r_spec, svm_poly = svm_p_spec,
KNN = knn_spec)
)
<-
all_workflows bind_rows(no_pre_proc, normalized)
When using the workflow_map()
function, the same
function is applied to the workflows of interest.
The seed
argument controls the random number stream so
that each model with is processed with this random number seed.
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.0.5
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.5
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.0.5
## Loading required package: parallel
library(finetune)
<- parallel::detectCores()
cores <- makePSOCKcluster(cores)
cl registerDoParallel(cl)
<-
workflow_fits %>%
all_workflows workflow_map(fn = "tune_grid", seed = 1003, verbose = TRUE,
resamples = concrete_folds, grid = 15,
control = control_grid(parallel_over = "everything")
)
## i 1 of 6 tuning: simple_MARS
## ✔ 1 of 6 tuning: simple_MARS (14.2s)
## i 2 of 6 tuning: simple_RF
## i Creating pre-processing data to finalize unknown parameter: mtry
## ✔ 2 of 6 tuning: simple_RF (2m 15.7s)
## i 3 of 6 tuning: simple_boosting
## ✔ 3 of 6 tuning: simple_boosting (3m 42s)
## i 4 of 6 tuning: normalized_svm_radial
## ✔ 4 of 6 tuning: normalized_svm_radial (1m 43.4s)
## i 5 of 6 tuning: normalized_svm_poly
## ✔ 5 of 6 tuning: normalized_svm_poly (11m 59.8s)
## i 6 of 6 tuning: normalized_KNN
## ✔ 6 of 6 tuning: normalized_KNN (1m 43.3s)
stopCluster(cl)
Alternatively, the finetune::tune_race_anova()
could be
used. Rather than performing the entire CV process for each parameter
combination, this tuning function eliminates some tuning parameter
combinations whenever it becomes clear that they are unlikely to produce
the best results. See TMWR Ch15.4. To get a quick assessment of
the results, the autoplot()
function can be used:
autoplot(workflow_fits, rank_metric = "rmse")
autoplot(workflow_fits, rank_metric = "rmse", select_best=TRUE)
Here it looks like the boosted tree model is performing the best, and it’s not particularly close.
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
%>% collect_metrics(select_best=TRUE) %>%
workflow_fits group_by(wflow_id) %>%
filter(.metric=="rmse") %>%
slice_min(order_by = mean) %>%
ungroup() %>%
arrange(mean) %>%
select(model, mean, n, std_err) %>%
kable()
model | mean | n | std_err |
---|---|---|---|
boost_tree | 4.133672 | 50 | 0.0978921 |
rand_forest | 5.062955 | 50 | 0.0932092 |
mars | 6.227203 | 50 | 0.1115937 |
svm_rbf | 6.338072 | 50 | 0.0878039 |
svm_poly | 7.025577 | 50 | 0.2169813 |
nearest_neighbor | 7.995286 | 50 | 0.1099770 |
<- workflow_fits %>%
boosting_res extract_workflow_set_result("simple_boosting")
%>%
boosting_res show_best(metric="rmse") %>%
select(-.config) %>%
kable() # for nice output, e.g., for a project
trees | min_n | tree_depth | learn_rate | loss_reduction | sample_size | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|---|---|---|---|
860 | 5 | 5 | 0.0169087 | 0.0002203 | 0.7646908 | rmse | standard | 4.133672 | 50 | 0.0978921 |
1924 | 22 | 12 | 0.1823157 | 0.0331872 | 0.9843885 | rmse | standard | 4.138362 | 50 | 0.0896586 |
1745 | 27 | 5 | 0.2490691 | 0.0015410 | 0.3977150 | rmse | standard | 4.184375 | 50 | 0.0902776 |
623 | 24 | 9 | 0.0256900 | 0.0000005 | 0.7522490 | rmse | standard | 4.312496 | 50 | 0.0929550 |
339 | 10 | 10 | 0.0476567 | 0.0167030 | 0.2984359 | rmse | standard | 4.347723 | 50 | 0.0958753 |
<- boosting_res %>% select_best(metric="rmse")
best_boost <- workflow_fits %>%
final_fit extract_workflow("simple_boosting") %>%
finalize_workflow(
parameters = best_boost
%>%
) fit(data=concrete_train)
<- final_fit %>%
final_test_res augment(new_data=concrete_test)
rmse(final_test_res, truth=compressive_strength, estimate=.pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.49