Background

This activity is modified from Emil Hvitfeldt’s work here, where the author adapts the labs in our book, Introduction to Statistical Learning, to work with the tidymodels framework.

The data that we will use today comes from our textbook:

library(ISLR)
Wage <- as_tibble(Wage)

Exercise 1. Take a look at the help page for the Wage data set. What are some of the variables in there? Describe the distributions of wage and age, as well as their relationship.

This lab will look at the various ways we can introduce non-linearity into our model by doing preprocessing. Methods include: polynomials expansion, step functions, and splines. We will also look briefly at GAMs (Generalized Additive Models)

Polynomial Features and Interactions

Polynomial regression can be thought of as doing polynomial expansion on a variable and passing that expansion into a linear regression model. We will be very explicit in this formulation in this chapter. step_poly() allows us to do a polynomial expansion on one or more variables.

The following step will take age and replace it with the variables age, age^2, age^3, and age^4 since we set degree = 4.

rec_poly <- recipe(wage ~ age, data = Wage) %>%
  step_poly(age, degree = 4)

This recipe is combined with a linear regression specification and combined to create a workflow object.

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

poly_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_poly)

This object can now be fit(), and we can pull the coefficients using tidy().

poly_fit <- fit(poly_wf, data = Wage)

tidy(poly_fit)

## # A tibble: 5 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    112.      0.729    153.   0       
## 2 age_poly_1     447.     39.9       11.2  1.48e-28
## 3 age_poly_2    -478.     39.9      -12.0  2.36e-32
## 4 age_poly_3     126.     39.9        3.14 1.68e- 3
## 5 age_poly_4     -77.9    39.9       -1.95 5.10e- 2

Let’s take a look at our fitted polynomial. To do this, I’ll create a new vector of ages, and make a prediction for each age.

age_range <- tibble(age = seq(min(Wage$age), max(Wage$age)))

regression_lines <- bind_cols(
  # Get the predicted values
  augment(poly_fit, new_data = age_range),
  # Get the confidence intervals as well
  predict(poly_fit, new_data = age_range, type = "conf_int")
)

Wage %>%
  ggplot(aes(age, wage)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = .pred), color = "darkgreen",
            data = regression_lines) +
  geom_line(aes(y = .pred_lower), data = regression_lines, 
            linetype = "dashed", color = "blue") +
  geom_line(aes(y = .pred_upper), data = regression_lines, 
            linetype = "dashed", color = "blue")

The regression curve is now a curve instead of a line as we would have gotten with a simple linear regression model. Notice furthermore that the confidence bands are tighter when there is a lot of data and they wider towards the ends of the data.

Exercise 2. Bake your polynomial recipe and take a look at the new features it creates. You probably expected something like the data below. Is that what you get? Why or why not? What is an orthogonal polynomial basis? (Check out the help file and ask me if you’re not sure)

## # A tibble: 3,000 × 5
##     wage   age age_sq  age.3   age.4
##    <dbl> <int>  <dbl>  <dbl>   <dbl>
##  1  75.0    18    324   5832  104976
##  2  70.5    24    576  13824  331776
##  3 131.     45   2025  91125 4100625
##  4 155.     43   1849  79507 3418801
##  5  75.0    50   2500 125000 6250000
##  6 127.     54   2916 157464 8503056
##  7 170.     44   1936  85184 3748096
##  8 112.     30    900  27000  810000
##  9 119.     41   1681  68921 2825761
## 10 129.     52   2704 140608 7311616
## # ℹ 2,990 more rows

Let us take our model one step further and see what happens to the regression line once we go past the domain it was trained on. The previous plot showed individuals within the age range 18-80. Let us see what happens once we push this to 18-100. This is not an impossible range, even if making predictions outside the width of your data is not recommended.

wide_age_range <- tibble(age = seq(18, 90))
regression_lines <- bind_cols(
  augment(poly_fit, new_data = wide_age_range),
  predict(poly_fit, new_data = wide_age_range, type = "conf_int")
)
Wage %>%
  ggplot(aes(age, wage)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = .pred), color = "darkgreen",
            data = regression_lines) +
  geom_line(aes(y = .pred_lower), data = regression_lines,
            linetype = "dashed", color = "blue") +
  geom_line(aes(y = .pred_upper), data = regression_lines,
            linetype = "dashed", color = "blue")

And we see that the curve starts diverging: once we get to the predicted wage is negative. The confidence bands also get wider and wider as we get farther away from the data.

This is one of the main drawbacks of using high degree polynomials: they diverge to infinity very rapidly outside of the data domain, which can cause unstable behavior near the endpoints of our domain.

Polynomial features can also be used in classification methods.

Step Functions

Next, let us take a look at the step function and how to fit a model using it as a preprocessor. You can create step functions in a couple of different ways. step_discretize() will convert a numeric variable into a factor variable with n bins, n here is specified with num_breaks. These will have approximately the same number of points in them according to the training data set.

rec_discretize <- recipe(wage ~ age, data = Wage) %>%
  step_discretize(age, num_breaks = 4)
discretize_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_discretize)
discretize_fit <- fit(discretize_wf, data = Wage)
discretize_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_discretize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)      agebin2      agebin3      agebin4  
##       94.16        22.50        25.03        22.41

If you already know where you want the step function to break then you can use step_cut() and supply the breaks manually.

rec_cut <- recipe(wage ~ age, data = Wage) %>%
  step_cut(age, breaks = c(30, 50, 70))
cut_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_cut)
cut_fit <- fit(cut_wf, data = Wage)
cut_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_cut()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)   age(30,50]   age(50,70]   age(70,80]  
##       89.54        26.59        28.20         7.47

Exercise 3. Modify the code from earlier (that we used to create a plot of poly_fit) to create a plot of your model discretize_fit. Describe the fit of the model.

Splines

In order to fit regression splines, or in other words, use splines as preprocessors when fitting a linear model, we use step_bs() to construct the matrices of basis functions. The bs() function is used and arguments such as knots can be passed to bs() by using passing a named list to options.

rec_spline <- recipe(wage ~ age, data = Wage) %>%
  step_bs(age, deg_free = 3)

We already have the linear regression specification lm_spec so we can create the workflow, fit the model and predict with it like we have seen how to do in the previous chapters.

spline_wf <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(rec_spline)
spline_fit <- fit(spline_wf, data = Wage)
predict(spline_fit, new_data = Wage)

## # A tibble: 3,000 × 1
##    .pred
##    <dbl>
##  1  58.7
##  2  84.3
##  3 120. 
##  4 120. 
##  5 120. 
##  6 119. 
##  7 120. 
##  8 102. 
##  9 119. 
## 10 120. 
## # ℹ 2,990 more rows

Lastly, we can plot the basic spline on top of the data.

age_range <- tibble(age = seq(min(Wage$age), max(Wage$age)))
regression_lines <- bind_cols(
  augment(spline_fit, new_data = age_range),
  predict(spline_fit, new_data = age_range, type = "conf_int")
)
Wage %>%
  ggplot(aes(age, wage)) +
  geom_point(alpha = 0.2) +
  geom_line(aes(y = .pred), data = regression_lines, color = "darkgreen") +
  geom_line(aes(y = .pred_lower), data = regression_lines, 
            linetype = "dashed", color = "blue") +
  geom_line(aes(y = .pred_upper), data = regression_lines, 
            linetype = "dashed", color = "blue")

Exercise 4. Change around the deg_free in the B-spline recipe and try a few different numbers. How does this change your model?

Exercise 5. Repeat the above analysis, but this time using natural splines. Plot your B-spline and natural spline approximation on the same axis (you can leave off the confidence bounds). Describe the similarities and differences between the two models visually from the graph.

Exercise 6. Mathematically, what is the difference between B-splines and natural splines? Are there advantages or disadvantages to each method? How are the knots chosen by the recipe when only deg_free is selected?

Multivariate Adaptive Regression Splines

Also called MARS. Similar to adding a step_bs or step_ns, this method is another way to incorporate nonlinear, continuous functions of the predictors into the model, though it works in a slightly different way. MARS creates a piecewise linear approximation of the response, only requiring that the approximation is continuous, In one dimension, this means that each of the approximation lines meets at the boundary of the interval breaks (also called knots).

In tidymodels, the model specification is just mars(), and you can also use an ensemble of these models with bag_mars().

mars_spec <- mars() %>% # you'll need to install the "earth" package
  set_mode("regression")

mars_wf <- workflow() %>%
  add_model(mars_spec) %>%
  add_formula(wage ~ age) # the mars model cannot use recipes, only formulas

mars_fit <- mars_wf %>% fit(data=Wage)

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:scales':
## 
##     rescale

mars_fit %>% 
  augment(new_data=age_range) %>%
  ggplot(aes(age,.pred)) +
  geom_line(color="blue") +
  geom_point(data=Wage, aes(age, wage), alpha = 0.2)

A few assorted topics

GAMs

All of these methods can be seen in the context of linear models by considering the expression:

\[\hat f(x) = g_1(x_1) + g_2(x_2) + g_3(x_3) + \ldots + g_n(x_n)\]

Here, our approximation \(\hat f\) is a linear combination of functions of each of the variables. These functions could be splines, polynomials, step functions, etc. These models are more general than linear models, but still assume additivity, and hence miss out on interactions.

Generalized Additive Models are implemented in tidymodels, but we won’t explore them today.

Interactions

None of the models above considered any “interaction terms” between variables, such as \(x_1 x_2\). For example, if we knew the length (\(x_1\)) and width (\(x_2\)) of a building, we might want to create a new variable that was the area in sq feet. We can add these new predictors to the model using step_interact. Take a look at the help file for step_interact for more information.

Spline Demo

Wage.sub <- sample_n(Wage, size=50)
rec_spline <- recipe(wage ~ age, data = Wage.sub) %>%
  step_bs(age, deg_free = 3)