Background
This activity is modified from Emil Hvitfeldt’s work here,
where the author adapts the labs in our book, Introduction to
Statistical Learning, to work with the tidymodels
framework.
The data that we will use today comes from our textbook:
library(ISLR)
<- as_tibble(Wage) Wage
Exercise 1. Take a look at the help page for the
Wage
data set. What are some of the variables in there?
Describe the distributions of wage
and age
, as
well as their relationship.
This lab will look at the various ways we can introduce non-linearity into our model by doing preprocessing. Methods include: polynomials expansion, step functions, and splines. We will also look briefly at GAMs (Generalized Additive Models)
Polynomial Features and Interactions
Polynomial regression can be thought of as doing polynomial expansion
on a variable and passing that expansion into a linear regression model.
We will be very explicit in this formulation in this chapter.
step_poly()
allows us to do a polynomial expansion on one
or more variables.
The following step will take age
and replace it with the
variables age
, age^2
, age^3
, and
age^4
since we set degree = 4
.
<- recipe(wage ~ age, data = Wage) %>%
rec_poly step_poly(age, degree = 4)
This recipe is combined with a linear regression specification and combined to create a workflow object.
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
<- workflow() %>%
poly_wf add_model(lm_spec) %>%
add_recipe(rec_poly)
This object can now be fit()
, and we can pull the
coefficients using tidy()
.
<- fit(poly_wf, data = Wage)
poly_fit
tidy(poly_fit)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 112. 0.729 153. 0
## 2 age_poly_1 447. 39.9 11.2 1.48e-28
## 3 age_poly_2 -478. 39.9 -12.0 2.36e-32
## 4 age_poly_3 126. 39.9 3.14 1.68e- 3
## 5 age_poly_4 -77.9 39.9 -1.95 5.10e- 2
Let’s take a look at our fitted polynomial. To do this, I’ll create a
new vector of age
s, and make a prediction for each
age
.
<- tibble(age = seq(min(Wage$age), max(Wage$age)))
age_range
<- bind_cols(
regression_lines # Get the predicted values
augment(poly_fit, new_data = age_range),
# Get the confidence intervals as well
predict(poly_fit, new_data = age_range, type = "conf_int")
)
%>%
Wage ggplot(aes(age, wage)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = .pred), color = "darkgreen",
data = regression_lines) +
geom_line(aes(y = .pred_lower), data = regression_lines,
linetype = "dashed", color = "blue") +
geom_line(aes(y = .pred_upper), data = regression_lines,
linetype = "dashed", color = "blue")
The regression curve is now a curve instead of a line as we would have gotten with a simple linear regression model. Notice furthermore that the confidence bands are tighter when there is a lot of data and they wider towards the ends of the data.
Exercise 2. Bake your polynomial recipe and take a look at the new features it creates. You probably expected something like the data below. Is that what you get? Why or why not? What is an orthogonal polynomial basis? (Check out the help file and ask me if you’re not sure)
## # A tibble: 3,000 × 5
## wage age age_sq age.3 age.4
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 75.0 18 324 5832 104976
## 2 70.5 24 576 13824 331776
## 3 131. 45 2025 91125 4100625
## 4 155. 43 1849 79507 3418801
## 5 75.0 50 2500 125000 6250000
## 6 127. 54 2916 157464 8503056
## 7 170. 44 1936 85184 3748096
## 8 112. 30 900 27000 810000
## 9 119. 41 1681 68921 2825761
## 10 129. 52 2704 140608 7311616
## # … with 2,990 more rows
Let us take our model one step further and see what happens to the regression line once we go past the domain it was trained on. The previous plot showed individuals within the age range 18-80. Let us see what happens once we push this to 18-100. This is not an impossible range, even if making predictions outside the width of your data is not recommended.
<- tibble(age = seq(18, 90))
wide_age_range <- bind_cols(
regression_lines augment(poly_fit, new_data = wide_age_range),
predict(poly_fit, new_data = wide_age_range, type = "conf_int")
)%>%
Wage ggplot(aes(age, wage)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = .pred), color = "darkgreen",
data = regression_lines) +
geom_line(aes(y = .pred_lower), data = regression_lines,
linetype = "dashed", color = "blue") +
geom_line(aes(y = .pred_upper), data = regression_lines,
linetype = "dashed", color = "blue")
And we see that the curve starts diverging: once we get to the
predicted wage
is negative. The confidence bands also get
wider and wider as we get farther away from the data.
This is one of the main drawbacks of using high degree polynomials: they diverge to infinity very rapidly outside of the data domain, which can cause unstable behavior near the endpoints of our domain.
Polynomial features can also be used in classification methods.
Step Functions
Next, let us take a look at the step function and how to fit a model
using it as a preprocessor. You can create step functions in a couple of
different ways. step_discretize()
will convert a numeric
variable into a factor variable with n
bins, n
here is specified with num_breaks
. These will have
approximately the same number of points in them according to the
training data set.
<- recipe(wage ~ age, data = Wage) %>%
rec_discretize step_discretize(age, num_breaks = 4)
<- workflow() %>%
discretize_wf add_model(lm_spec) %>%
add_recipe(rec_discretize)
<- fit(discretize_wf, data = Wage)
discretize_fit discretize_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_discretize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) agebin2 agebin3 agebin4
## 94.16 22.50 25.03 22.41
If you already know where you want the step function to break then
you can use step_cut()
and supply the breaks manually.
<- recipe(wage ~ age, data = Wage) %>%
rec_cut step_cut(age, breaks = c(30, 50, 70))
<- workflow() %>%
cut_wf add_model(lm_spec) %>%
add_recipe(rec_cut)
<- fit(cut_wf, data = Wage)
cut_fit cut_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_cut()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) age(30,50] age(50,70] age(70,80]
## 89.54 26.59 28.20 7.47
Exercise 3. Modify the code from earlier (to create
a plot of poly_fit
) to create a plot of your model
discretize_fit
. Describe the fit of the model.
Splines
In order to fit regression splines, or in other words, use splines as
preprocessors when fitting a linear model, we use step_bs()
to construct the matrices of basis functions. The bs()
function is used and arguments such as knots
can be passed
to bs()
by using passing a named list to
options
.
<- recipe(wage ~ age, data = Wage) %>%
rec_spline step_bs(age, deg_free = 3)
We already have the linear regression specification
lm_spec
so we can create the workflow, fit the model and
predict with it like we have seen how to do in the previous
chapters.
<- workflow() %>%
spline_wf add_model(lm_spec) %>%
add_recipe(rec_spline)
<- fit(spline_wf, data = Wage)
spline_fit predict(spline_fit, new_data = Wage)
## # A tibble: 3,000 × 1
## .pred
## <dbl>
## 1 58.7
## 2 84.3
## 3 120.
## 4 120.
## 5 120.
## 6 119.
## 7 120.
## 8 102.
## 9 119.
## 10 120.
## # … with 2,990 more rows
Lastly, we can plot the basic spline on top of the data.
<- tibble(age = seq(min(Wage$age), max(Wage$age)))
age_range <- bind_cols(
regression_lines augment(spline_fit, new_data = age_range),
predict(spline_fit, new_data = age_range, type = "conf_int")
)%>%
Wage ggplot(aes(age, wage)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = .pred), data = regression_lines, color = "darkgreen") +
geom_line(aes(y = .pred_lower), data = regression_lines,
linetype = "dashed", color = "blue") +
geom_line(aes(y = .pred_upper), data = regression_lines,
linetype = "dashed", color = "blue")
Exercise 4. Change around the deg_free
in the B-spline recipe and try a few different numbers. How does this
change your model?
Exercise 5. Repeat the above analysis, but this time using natural splines. Plot your B-spline and natural spline approximation on the same axis (you can leave off the confidence bounds). Describe the similarities and differences between the two models visually from the graph.
Exercise 6. Mathematically, what is the difference
between B-splines and natural splines? Are there advantages or
disadvantages to each method? How are the knots chosen by the recipe
when only deg_free
is selected?
Multivariate Adaptive Regression Splines
Also called MARS. Similar to adding a step_bs
or
step_ns
, this method is another way to incorporate
nonlinear, continuous functions of the predictors into the model, though
it works in a slightly different way. MARS creates a piecewise
linear approximation of the response, only requiring that the
approximation is continuous, In one dimension, this means that each of
the approximation lines meets at the boundary of the interval breaks
(also called knots).
In tidymodels, the model specification is just mars()
,
and you can also use an ensemble of these models with
bag_mars()
.
<- mars() %>%
mars_spec set_mode("regression")
<- workflow() %>%
mars_wf add_model(mars_spec) %>%
add_formula(wage ~ age) # the mars model cannot use recipes, only formulas
<- mars_wf %>% fit(data=Wage)
mars_fit
%>%
mars_fit augment(new_data=age_range) %>%
ggplot(aes(age,.pred)) +
geom_line(color="blue") +
geom_point(data=Wage, aes(age, wage), alpha = 0.2)
A few assorted topics
GAMs
All of these methods can be seen in the context of linear models by considering the expression:
\[\hat f(x) = g_1(x_1) + g_2(x_2) + g_3(x_3) + \ldots + g_n(x_n)\]
Here, our approximation \(\hat f\) is a linear combination of functions of each of the variables. These functions could be splines, polynomials, step functions, etc. These models are more general than linear models, but still assume additivity, and hence miss out on interactions.
Generalized Additive Models are implemented in tidymodels, but we won’t explore them today.
Interactions
None of the models above considered any “interaction terms” between
variables, such as \(x_1 x_2\). For
example, if we knew the length (\(x_1\)) and width (\(x_2\)) of a building, we might want to
create a new variable that was the area in sq feet. We can add these new
predictors to the model using step_interact
. Take a look at
the help file for step_interact
for more information.