Read in the data
This data set contains a sample of information about the geyser Old Faithful. Imagine you are a visitor to Yellowstone National Park, and you’d like to predict when the next eruption will be. This data set includes the duration of eruptions and the wait time until the next eruption for a small sample of eruptions.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Rows: 272 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): duration, wait
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Linear Association
By plotting the duration of each eruption versus the wait time, we can see that there seems to be a linear association between the eruption duration and the wait until the next eruption.
ggplot(data=faithful, aes(x=duration, y=wait)) +
geom_point()
What line would fit this data best? Try to estimate the slope and intercept from the graph and fill in below.
# Fill in with your guesses
= NA
your_slope = NA
your_intercept
ggplot(data=faithful, aes(x=duration, y=wait)) +
geom_point() +
geom_abline(intercept=your_intercept, slope=your_slope, color="blue")
Remember that ggplot has a nice method to draw a line that “best” matches the data. Compare this with your line. How did you do?
ggplot(data=faithful, aes(x=duration, y=wait)) +
geom_point() +
geom_abline(intercept=your_intercept, slope=your_slope, color="blue") +
geom_smooth(method="lm", se=FALSE, color="purple")
## `geom_smooth()` using formula = 'y ~ x'
This line matches what we think the “best” line should be. But where does it come from? Think about your intuition of the best line: you want as many points to be as close to the line as possible. You wouldn’t draw a line where all the points were above or below the line, so it should sit somewhere in the “middle” of all the points. We can make this more precise with the Method of Least Squares.
Basic least squares regression
Later we’ll learn the tidymodels
way to approach fitting
a line to data, but here let’s just use base R
:
library(broom)
<- lm(wait ~ duration, data = faithful)
faithful_mod #summary(faithful_mod) # less tidy way to see model outputs
tidy(faithful_mod) # the tidy function comes from the broom package
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 33.5 1.15 29.0 7.14e- 85
## 2 duration 10.7 0.315 34.1 8.13e-100
glance(faithful_mod)
## # A tibble: 1 × 12
## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.811 0.811 5.91 1162. 8.13e-100 1 -868. 1743. 1754. 9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
Here we see that the lm
method is estimating the “least
squares” line to be \[ 33.47440 + 10.72964 *
\texttt{duration} \]
We’ll talk about the rest of the information in the table later. But this seems to be a very good fit! The wait times are very predictable—now you know why the geyser is called Old Faithful!
Regression Diagnostics
How well does a linear model fit the data?
We have sample correlation as a measure of the strength of linear association between two variables:
%>% summarize(cor = cor(wait, duration), rsq = cor(wait, duration)^2) faithful
## # A tibble: 1 × 2
## cor rsq
## <dbl> <dbl>
## 1 0.901 0.811
The value of \(.9\) indicates that there is a positive relationship, and the fact that it is close to 1 indicates that there is a strong linear relationship. Typically in these circumstances we use \(r^2\), which can be more easily interpreted as the (esimtated) “proportion of variation in \(Y\) that can be explained by knowing \(x\).”
A pattern in the residuals can indicate some other pattern, perhaps
due to nonlinear dependence or some confounding variable. (Note that
here the model residuals and model predictions for each row are stored
in the faithful_mod$resid
and
faithful_mod$fitted
variables. In this case the
.
is a placeholder for faithful_mod
ggplot(data = faithful_mod) +
geom_point(aes(y = .resid, x = .fitted)) +
labs(x = "Fitted Values", y = "Residuals")
We should also make sure that the residuals are nearly normal. We won’t talk much about this: there are more sophisticated methods and statistical tests to precisely measure how likely it is that data came from a normal distribution, but we’ll content ourselves with the old “glance at a plot” method.
ggplot(data = faithful_mod) +
geom_histogram(aes(x = .resid), bins=15)
Looks normal-ish. Good enough. (A boxplot could also be appropriate here). Since we are more concerned with a models predictive power rather than statistical inference, we won’t worry as much about assumptions as in 463. We are mostly using linear models as a jumping off point for regression!
Do the variables contribute to the model?
We can see this in the model summary (or tidy summary). The \(p\)-value in each row summarizes the evidence for the null-hypothesis in a statistical hypothesis test \[ H_0: \, \beta_i = 0, \qquad H_1: \, \beta_i \neq 0 \].
Small p-values indicate that there is evidence to reject the null hypothesis. So in the case below there is strong evidence to believe that the true coefficients really are different from 0.
# summary(faithful_mod)
tidy(faithful_mod)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 33.5 1.15 29.0 7.14e- 85
## 2 duration 10.7 0.315 34.1 8.13e-100
glance(faithful_mod)
## # A tibble: 1 × 12
## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.811 0.811 5.91 1162. 8.13e-100 1 -868. 1743. 1754. 9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
Is a linear model a helpful predictor overall?
The \(F\)-statistic above
(glance(faithful_mod)$statistic
) provides a measure of
model fit. Is the model overall helpful in predicting the output \(Y\), or is it more useful to just use \(\bar{Y}\)? The \(p\)-value assists in making the conclusion:
here we see it is small, so we conclude that at least one of the
variables is helpful in predicting \(Y\).
The tidymodels way
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ dials 1.1.0 ✔ rsample 1.1.1
## ✔ infer 1.0.4 ✔ tune 1.0.1
## ✔ modeldata 1.0.1 ✔ workflows 1.1.2
## ✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
## ✔ recipes 1.0.4 ✔ yardstick 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
<- linear_reg() %>%
faithful_tidymodel set_engine("lm") %>%
fit(wait ~ duration, data=faithful)
# we can extract all the same diagnostic information that we know from "lm" by using extract_fit_engine
%>% extract_fit_engine() faithful_tidymodel
##
## Call:
## stats::lm(formula = wait ~ duration, data = data)
##
## Coefficients:
## (Intercept) duration
## 33.47 10.73
# For example:
%>%
faithful_tidymodel extract_fit_engine() %>%
glance() # or glance!
## # A tibble: 1 × 12
## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.811 0.811 5.91 1162. 8.13e-100 1 -868. 1743. 1754. 9443.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
The lm
method gives the least squares regression line,
but this is not the only way to do it! Later we will learn about
penalized or weighted least squares, along with generalized linear
models (GLM).
Your turn
Read in the file dugong.csv
that contains information
about the age and length of dugongs.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
<- read_csv("dugong.csv") %>% clean_names() dugong
## Rows: 27 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Length, Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Draw a scatter plot of
age
vslength
. Does there appear to be a linear relationship? Plot the best fit line.What is the correlation between
age
andlength
? Add this to your graph above (may use a caption or put it in the title)Fit a linear model using
length ~ age
to find the best fit line.Provide some diagnostics to evaluate how good
age
is as a predictor oflength
. Are any assumptions violated?What ideas do you have to improve the model? How could you implement them?
Feature Engineering
We noticed before that it looks like there are two clusters, long eruptions and short eruptions. Maybe it is the case that the linear predictor should differ for each cluster. Let’s create a new variable and use it in our model:
<- faithful %>%
faithful mutate(long_wait = if_else(duration > 3, TRUE, FALSE))
%>% ggplot(aes(duration, wait)) +
faithful geom_point() +
geom_smooth(aes(color=long_wait), method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Here is the model and summary:
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 43.1 1.97 21.8 2.03e-61
## 2 duration 5.61 0.928 6.05 4.92e- 9
## 3 long_waitTRUE 12.9 2.21 5.82 1.64e- 8
## # A tibble: 1 × 12
## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.833 0.831 5.58 669. 4.05e-105 2 -852. 1712. 1727. 8386.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
What is the equation for the estimated model now? How does it handle the categorical variable?
Here is an example using “interaction terms”, i.e. products of different variables.
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 41.6 4.40 9.45 1.74e-18
## 2 duration 6.35 2.14 2.97 3.26e- 3
## 3 long_waitTRUE 15.1 6.25 2.41 1.64e- 2
## 4 duration:long_waitTRUE -0.912 2.37 -0.384 7.01e- 1
## # A tibble: 1 × 12
## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.833 0.831 5.59 445. 1.10e-103 3 -852. 1714. 1732. 8382.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
Multiple Regression
The previous models are examples of using multiple predictors in a linear regression model.
<- lm(mpg ~ ., data=mtcars)
mr.mod
tidy(mr.mod)
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.3 18.7 0.657 0.518
## 2 cyl -0.111 1.05 -0.107 0.916
## 3 disp 0.0133 0.0179 0.747 0.463
## 4 hp -0.0215 0.0218 -0.987 0.335
## 5 drat 0.787 1.64 0.481 0.635
## 6 wt -3.72 1.89 -1.96 0.0633
## 7 qsec 0.821 0.731 1.12 0.274
## 8 vs 0.318 2.10 0.151 0.881
## 9 am 2.52 2.06 1.23 0.234
## 10 gear 0.655 1.49 0.439 0.665
## 11 carb -0.199 0.829 -0.241 0.812
glance(mr.mod)
## # A tibble: 1 × 12
## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.869 0.807 2.65 13.9 3.79e-7 10 -69.9 164. 181. 147. 21
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
Multi-colinearly is a problem because it makes it difficult to infer the effect of one variable where everything else. So this is not a problem if you just care about the predictions, as is the case in Machine Learning. the factors with high collinearity don’t matter for your for you analysis.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(mr.mod)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
Let’s use backward selection to get a better model. We will discuss better ways to do variable selection later in this course.
<- lm(mpg ~ ., data=mtcars)
mr.mod
tidy(mr.mod)
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.3 18.7 0.657 0.518
## 2 cyl -0.111 1.05 -0.107 0.916
## 3 disp 0.0133 0.0179 0.747 0.463
## 4 hp -0.0215 0.0218 -0.987 0.335
## 5 drat 0.787 1.64 0.481 0.635
## 6 wt -3.72 1.89 -1.96 0.0633
## 7 qsec 0.821 0.731 1.12 0.274
## 8 vs 0.318 2.10 0.151 0.881
## 9 am 2.52 2.06 1.23 0.234
## 10 gear 0.655 1.49 0.439 0.665
## 11 carb -0.199 0.829 -0.241 0.812
glance(mr.mod)
## # A tibble: 1 × 12
## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.869 0.807 2.65 13.9 3.79e-7 10 -69.9 164. 181. 147. 21
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
Regression ith categorical variables
<- diamonds %>%
diamonds select(carat:price)
<- lm(price ~ ., data=diamonds)
diamond.mod tidy(diamond.mod)
## # A tibble: 21 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -970. 360. -2.69 7.14e- 3
## 2 carat 8895. 12.1 736. 0
## 3 cut.L 616. 23.0 26.8 5.35e-157
## 4 cut.Q -327. 18.4 -17.8 2.22e- 70
## 5 cut.C 156. 15.8 9.89 5.01e- 23
## 6 cut^4 -16.0 12.6 -1.26 2.07e- 1
## 7 color.L -1908. 17.7 -108. 0
## 8 color.Q -626. 16.1 -38.9 0
## 9 color.C -172. 15.1 -11.4 3.48e- 30
## 10 color^4 20.3 13.8 1.47 1.42e- 1
## # … with 11 more rows
glance(diamond.mod)
## # A tibble: 1 × 12
## r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.916 0.916 1156. 29419. 0 20 -4.57e5 9.14e5 9.14e5 7.21e10
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
Comparing models for prediction
We are going to consider using a linear model for prediction. Since we want to know if our model will work well on data we haven’t yet seen, we’ll randomly split the data up into two parts, a training and a test set.
In tidymodels
, the split object tells stores how the
split is performs.
library(tidymodels)
# Use a seed for reproducibility
# Otherwise your code will run differently than mine (which is ok!)
# set.seed(2023)
<- initial_split(faithful, prop = 0.80)
faithful_split faithful_split
## <Training/Testing/Total>
## <217/55/272>
<- training(faithful_split)
faithful_train <- testing(faithful_split) faithful_test
# Our first model, fit using training data
<- linear_reg() %>%
faithful_tidymodel1 set_engine("lm") %>%
fit(wait ~ duration, data=faithful_train)
# Use the model to make predictions on the test set
<- predict(
faithful_results
faithful_tidymodel1, new_data = faithful_test %>% select(-wait)
%>%
) bind_cols(faithful_test %>% select(wait))
faithful_results
## # A tibble: 55 × 2
## .pred wait
## <dbl> <dbl>
## 1 71.9 79
## 2 83.7 88
## 3 51.9 47
## 4 50.3 52
## 5 74.6 78
## 6 76.5 80
## 7 54.8 52
## 8 53.2 48
## 9 84.8 75
## 10 52.8 54
## # … with 45 more rows
Let’s see how the model performed:
rmse(faithful_results, truth = wait, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5.83
rsq(faithful_results, truth = wait, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.828
Repeat these steps with our full model using the
long_wait
variable and compare the metrics.
# Our first model, fit using training data
<- linear_reg() %>%
faithful_tidymodel2 set_engine("lm") %>%
fit(wait ~ duration + long_wait, data=faithful_train)
# Use the model to make predictions on the test set
<- predict(
faithful_results
faithful_tidymodel2, new_data = faithful_test %>% select(-wait)
%>%
) bind_cols(faithful_test %>% select(wait))
glance(faithful_tidymodel2)
## # A tibble: 1 × 12
## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.830 0.828 5.62 521. 6.15e-83 2 -681. 1370. 1384. 6760.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
rmse(faithful_results, truth = wait, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5.49
rsq(faithful_results, truth = wait, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.855
What does overfitting look like? This model fits the training data very well, but won’t do well on the test data.
ggplot(faithful_train, aes(duration, wait)) +
geom_point() +
geom_smooth(method="lm", se=FALSE, formula="y~poly(x,15)")