Read in the data

This data set contains a sample of information about the geyser Old Faithful. Imagine you are a visitor to Yellowstone National Park, and you’d like to predict when the next eruption will be. This data set includes the duration of eruptions and the wait time until the next eruption for a small sample of eruptions.

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Rows: 272 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): duration, wait
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Linear Association

By plotting the duration of each eruption versus the wait time, we can see that there seems to be a linear association between the eruption duration and the wait until the next eruption.

ggplot(data=faithful, aes(x=duration, y=wait)) +
  geom_point()

What line would fit this data best? Try to estimate the slope and intercept from the graph and fill in below.

# Fill in with your guesses
your_slope = NA
your_intercept = NA

ggplot(data=faithful, aes(x=duration, y=wait)) +
  geom_point() + 
  geom_abline(intercept=your_intercept, slope=your_slope, color="blue")

Remember that ggplot has a nice method to draw a line that “best” matches the data. Compare this with your line. How did you do?

ggplot(data=faithful, aes(x=duration, y=wait)) +
  geom_point() + 
  geom_abline(intercept=your_intercept, slope=your_slope, color="blue") + 
  geom_smooth(method="lm", se=FALSE, color="purple")

## `geom_smooth()` using formula = 'y ~ x'

This line matches what we think the “best” line should be. But where does it come from? Think about your intuition of the best line: you want as many points to be as close to the line as possible. You wouldn’t draw a line where all the points were above or below the line, so it should sit somewhere in the “middle” of all the points. We can make this more precise with the Method of Least Squares.

Basic least squares regression

Later we’ll learn the tidymodels way to approach fitting a line to data, but here let’s just use base R:

library(broom)
faithful_mod <- lm(wait ~ duration, data = faithful)
#summary(faithful_mod) # less tidy way to see model outputs
tidy(faithful_mod) # the tidy function comes from the broom package

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     33.5     1.15       29.0 7.14e- 85
## 2 duration        10.7     0.315      34.1 8.13e-100

glance(faithful_mod)

## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.811       0.811  5.91   1162. 8.13e-100     1  -868. 1743. 1754.   9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

Here we see that the lm method is estimating the “least squares” line to be \[ 33.47440 + 10.72964 * \texttt{duration} \]

We’ll talk about the rest of the information in the table later. But this seems to be a very good fit! The wait times are very predictable—now you know why the geyser is called Old Faithful!

Regression Diagnostics

How well does a linear model fit the data?

We have sample correlation as a measure of the strength of linear association between two variables:

faithful %>% summarize(cor = cor(wait, duration), rsq = cor(wait, duration)^2)

## # A tibble: 1 × 2
##     cor   rsq
##   <dbl> <dbl>
## 1 0.901 0.811

The value of $.9$ indicates that there is a positive relationship, and the fact that it is close to 1 indicates that there is a strong linear relationship. Typically in these circumstances we use $r^2$, which can be more easily interpreted as the (esimtated) “proportion of variation in $Y$ that can be explained by knowing $x$.”

A pattern in the residuals can indicate some other pattern, perhaps due to nonlinear dependence or some confounding variable. (Note that here the model residuals and model predictions for each row are stored in the faithful_mod$resid and faithful_mod$fitted variables. In this case the . is a placeholder for faithful_mod

ggplot(data = faithful_mod) +
  geom_point(aes(y = .resid, x = .fitted)) +
  labs(x = "Fitted Values", y = "Residuals")

We should also make sure that the residuals are nearly normal. We won’t talk much about this: there are more sophisticated methods and statistical tests to precisely measure how likely it is that data came from a normal distribution, but we’ll content ourselves with the old “glance at a plot” method.

ggplot(data = faithful_mod) +
  geom_histogram(aes(x = .resid), bins=15)

Looks normal-ish. Good enough. (A boxplot could also be appropriate here). Since we are more concerned with a models predictive power rather than statistical inference, we won’t worry as much about assumptions as in 463. We are mostly using linear models as a jumping off point for regression!

Do the variables contribute to the model?

We can see this in the model summary (or tidy summary). The $p$-value in each row summarizes the evidence for the null-hypothesis in a statistical hypothesis test \[ H_0: \, \beta_i = 0, \qquad H_1: \, \beta_i \neq 0 \].

Small p-values indicate that there is evidence to reject the null hypothesis. So in the case below there is strong evidence to believe that the true coefficients really are different from 0.

# summary(faithful_mod)
tidy(faithful_mod)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     33.5     1.15       29.0 7.14e- 85
## 2 duration        10.7     0.315      34.1 8.13e-100

glance(faithful_mod)

## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.811       0.811  5.91   1162. 8.13e-100     1  -868. 1743. 1754.   9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

Is a linear model a helpful predictor overall?

The $F$-statistic above (glance(faithful_mod)$statistic) provides a measure of model fit. Is the model overall helpful in predicting the output $Y$, or is it more useful to just use $\bar{Y}$? The $p$-value assists in making the conclusion: here we see it is small, so we conclude that at least one of the variables is helpful in predicting $Y$.

The tidymodels way

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

## ✔ dials        1.1.0     ✔ rsample      1.1.1
## ✔ infer        1.0.4     ✔ tune         1.0.1
## ✔ modeldata    1.0.1     ✔ workflows    1.1.2
## ✔ parsnip      1.0.3     ✔ workflowsets 1.0.0
## ✔ recipes      1.0.4     ✔ yardstick    1.1.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

faithful_tidymodel <- linear_reg() %>% 
  set_engine("lm") %>%
  fit(wait ~ duration, data=faithful)

# we can extract all the same diagnostic information that we know from "lm" by using extract_fit_engine
faithful_tidymodel %>% extract_fit_engine()

## 
## Call:
## stats::lm(formula = wait ~ duration, data = data)
## 
## Coefficients:
## (Intercept)     duration  
##       33.47        10.73

# For example:
faithful_tidymodel %>% 
  extract_fit_engine() %>%
  glance() # or glance!

## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.811       0.811  5.91   1162. 8.13e-100     1  -868. 1743. 1754.   9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

The lm method gives the least squares regression line, but this is not the only way to do it! Later we will learn about penalized or weighted least squares, along with generalized linear models (GLM).

Your turn

Read in the file dugong.csv that contains information about the age and length of dugongs.

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

dugong <- read_csv("dugong.csv") %>% clean_names()

## Rows: 27 Columns: 2

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Length, Age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Draw a scatter plot of age vs length. Does there appear to be a linear relationship? Plot the best fit line.
What is the correlation between age and length? Add this to your graph above (may use a caption or put it in the title)
Fit a linear model using length ~ age to find the best fit line.
Provide some diagnostics to evaluate how good age is as a predictor of length. Are any assumptions violated?
What ideas do you have to improve the model? How could you implement them?

Feature Engineering

We noticed before that it looks like there are two clusters, long eruptions and short eruptions. Maybe it is the case that the linear predictor should differ for each cluster. Let’s create a new variable and use it in our model:

faithful <- faithful %>%
  mutate(long_wait = if_else(duration > 3, TRUE, FALSE))
faithful %>% ggplot(aes(duration, wait)) + 
  geom_point() + 
  geom_smooth(aes(color=long_wait), method="lm", se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Here is the model and summary:

## # A tibble: 3 × 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      43.1      1.97      21.8  2.03e-61
## 2 duration          5.61     0.928      6.05 4.92e- 9
## 3 long_waitTRUE    12.9      2.21       5.82 1.64e- 8

## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.833       0.831  5.58    669. 4.05e-105     2  -852. 1712. 1727.   8386.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

What is the equation for the estimated model now? How does it handle the categorical variable?

Here is an example using “interaction terms”, i.e. products of different variables.

## # A tibble: 4 × 5
##   term                   estimate std.error statistic  p.value
##   <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              41.6        4.40     9.45  1.74e-18
## 2 duration                  6.35       2.14     2.97  3.26e- 3
## 3 long_waitTRUE            15.1        6.25     2.41  1.64e- 2
## 4 duration:long_waitTRUE   -0.912      2.37    -0.384 7.01e- 1

## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.833       0.831  5.59    445. 1.10e-103     3  -852. 1714. 1732.   8382.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

Multiple Regression

The previous models are examples of using multiple predictors in a linear regression model.

mr.mod <- lm(mpg ~ ., data=mtcars)

tidy(mr.mod)

## # A tibble: 11 × 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)  12.3      18.7        0.657  0.518 
##  2 cyl          -0.111     1.05      -0.107  0.916 
##  3 disp          0.0133    0.0179     0.747  0.463 
##  4 hp           -0.0215    0.0218    -0.987  0.335 
##  5 drat          0.787     1.64       0.481  0.635 
##  6 wt           -3.72      1.89      -1.96   0.0633
##  7 qsec          0.821     0.731      1.12   0.274 
##  8 vs            0.318     2.10       0.151  0.881 
##  9 am            2.52      2.06       1.23   0.234 
## 10 gear          0.655     1.49       0.439  0.665 
## 11 carb         -0.199     0.829     -0.241  0.812

glance(mr.mod)

## # A tibble: 1 × 12
##   r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
##     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
## 1   0.869   0.807  2.65    13.9 3.79e-7    10  -69.9  164.  181.    147.      21
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## #   ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual

Multi-colinearly is a problem because it makes it difficult to infer the effect of one variable where everything else. So this is not a problem if you just care about the predictions, as is the case in Machine Learning. the factors with high collinearity don’t matter for your for you analysis.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(mr.mod)

##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  4.648487 
##      gear      carb 
##  5.357452  7.908747

Let’s use backward selection to get a better model. We will discuss better ways to do variable selection later in this course.

mr.mod <- lm(mpg ~ ., data=mtcars)

tidy(mr.mod)

## # A tibble: 11 × 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)  12.3      18.7        0.657  0.518 
##  2 cyl          -0.111     1.05      -0.107  0.916 
##  3 disp          0.0133    0.0179     0.747  0.463 
##  4 hp           -0.0215    0.0218    -0.987  0.335 
##  5 drat          0.787     1.64       0.481  0.635 
##  6 wt           -3.72      1.89      -1.96   0.0633
##  7 qsec          0.821     0.731      1.12   0.274 
##  8 vs            0.318     2.10       0.151  0.881 
##  9 am            2.52      2.06       1.23   0.234 
## 10 gear          0.655     1.49       0.439  0.665 
## 11 carb         -0.199     0.829     -0.241  0.812

glance(mr.mod)

## # A tibble: 1 × 12
##   r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
##     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
## 1   0.869   0.807  2.65    13.9 3.79e-7    10  -69.9  164.  181.    147.      21
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## #   ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual

Regression ith categorical variables

diamonds <- diamonds %>% 
  select(carat:price)

diamond.mod <- lm(price ~ ., data=diamonds)
tidy(diamond.mod)

## # A tibble: 21 × 5
##    term        estimate std.error statistic   p.value
##    <chr>          <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)   -970.      360.      -2.69 7.14e-  3
##  2 carat         8895.       12.1    736.   0        
##  3 cut.L          616.       23.0     26.8  5.35e-157
##  4 cut.Q         -327.       18.4    -17.8  2.22e- 70
##  5 cut.C          156.       15.8      9.89 5.01e- 23
##  6 cut^4          -16.0      12.6     -1.26 2.07e-  1
##  7 color.L      -1908.       17.7   -108.   0        
##  8 color.Q       -626.       16.1    -38.9  0        
##  9 color.C       -172.       15.1    -11.4  3.48e- 30
## 10 color^4         20.3      13.8      1.47 1.42e-  1
## # … with 11 more rows

glance(diamond.mod)

## # A tibble: 1 × 12
##   r.squared adj.r.sq…¹ sigma stati…² p.value    df  logLik    AIC    BIC devia…³
##       <dbl>      <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1     0.916      0.916 1156.  29419.       0    20 -4.57e5 9.14e5 9.14e5 7.21e10
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

Comparing models for prediction

We are going to consider using a linear model for prediction. Since we want to know if our model will work well on data we haven’t yet seen, we’ll randomly split the data up into two parts, a training and a test set.

In tidymodels, the split object tells stores how the split is performs.

library(tidymodels)

# Use a seed for reproducibility
# Otherwise your code will run differently than mine (which is ok!)
# set.seed(2023)

faithful_split <- initial_split(faithful, prop = 0.80)
faithful_split

## <Training/Testing/Total>
## <217/55/272>

faithful_train <- training(faithful_split)
faithful_test <- testing(faithful_split)

# Our first model, fit using training data
faithful_tidymodel1 <- linear_reg() %>% 
  set_engine("lm") %>%
  fit(wait ~ duration, data=faithful_train)

# Use the model to make predictions on the test set
faithful_results <- predict(
  faithful_tidymodel1, 
  new_data = faithful_test %>% select(-wait)
  ) %>% 
  bind_cols(faithful_test %>% select(wait))

faithful_results

## # A tibble: 55 × 2
##    .pred  wait
##    <dbl> <dbl>
##  1  71.9    79
##  2  83.7    88
##  3  51.9    47
##  4  50.3    52
##  5  74.6    78
##  6  76.5    80
##  7  54.8    52
##  8  53.2    48
##  9  84.8    75
## 10  52.8    54
## # … with 45 more rows

Let’s see how the model performed:

rmse(faithful_results, truth = wait, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.83

rsq(faithful_results, truth = wait, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.828

Repeat these steps with our full model using the long_wait variable and compare the metrics.

# Our first model, fit using training data
faithful_tidymodel2 <- linear_reg() %>% 
  set_engine("lm") %>%
  fit(wait ~ duration + long_wait, data=faithful_train)

# Use the model to make predictions on the test set
faithful_results <- predict(
  faithful_tidymodel2, 
  new_data = faithful_test %>% select(-wait)
  ) %>% 
  bind_cols(faithful_test %>% select(wait))

glance(faithful_tidymodel2)

## # A tibble: 1 × 12
##   r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
##       <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.830        0.828  5.62    521. 6.15e-83     2  -681. 1370. 1384.   6760.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹adj.r.squared, ²statistic, ³deviance

rmse(faithful_results, truth = wait, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.49

rsq(faithful_results, truth = wait, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.855

What does overfitting look like? This model fits the training data very well, but won’t do well on the test data.

ggplot(faithful_train, aes(duration, wait)) +
  geom_point() + 
  geom_smooth(method="lm", se=FALSE, formula="y~poly(x,15)")

Linear Regression Demo

2023-01-23