Math 463 - Lab 6 - Simple Linear Regression

Batter up

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

In this lab, you’ll be looking at data from all 30 Major League Baseball teams and examining the relationship between runs scored in a season and a number of other player statistics. The aim is to summarize these relationships both graphically and numerically in order to find which variable, if any, helps you best predict a team’s runs scored in a season.

Loading a comma-separated value file

The data for the Major League Baseball’s 2021 season is in a file called mlb21.csv, which is in the comma-separated value format. The values of the different columns are separated by commas. This is a very common format for storing data.

You’ll need to download the file to your laptop and save it in the directory where you are keeping your Markdown file. The code below will look in your working directory for the file, load it into your environment, and call it mlb21. If you get an error, then your working directory is probably wrong. You can change the working directory to point to where your data is by going to Session|Set Working Directory|Choose Directory and navigating to where you downloaded the data.

The first thing you’ll want to do is look at the data. Since this dataset is not contained in an R package, there is no help file.

mlb21 <- read_csv("mlb21.csv")

In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the three newer variables on your own.

Below is a list of variables and their descriptions. If you’re not familiar with baseball, feel free to ask more questions! You can also read more detailed explanations of the stats here.

team: Team name.
runs: Number of runs.
at_bats: Number of at bats.
hits: Number of hits.
homeruns: Number of home runs.
bat_avg: Batting average (hits/at_bats).
strikeouts: Number of strikeouts.
stolen_bases: Number of stolen bases.
wins: Number of wins.
onbase_pct: On base percentage, measure of how often a batter reaches base for any reason other than a fielding error, fielder’s choice, dropped/uncaught third strike, fielder’s obstruction, or catcher’s interference.
slug_pct: Slugging percentage, popular measure of the power of a hitter calculated as the total bases divided by at bats.
onbase_plus_slug: On base plus slugging, calculated as the sum of these two variables

EDA

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient (the strength of the linear relationship between two random variables). With the code below, we can calculate the sample correlation between these two variables.

cor(mlb21$runs, mlb21$at_bats)

Think back to the way that we described the distribution of a single variable. Recall that we discussed characteristics such as center, spread, and shape. It’s also useful to be able to describe the relationship of two numerical variables, such as runs and at_bats.

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship (i.e. use the correlation) as well as any unusual observations.

The linear model

The most common way in R to fit the least-squares regression line is to use the lm function. This function will calculate the \(\hat{\beta_0}\) and \(\hat{\beta_1}\) estimators using the formulas we derived in class, as well as the standard errors using the variance formulas, where \(\sigma^2\) is replace with \[S^2 = \frac{SSE}{n-2}\].

The code below uses the formula interface; think of the ~ like an = in our linear regression model. Instead of simply pasting the output into the console, I am saving it as a list of model results.

m1 <- lm(runs ~ at_bats, data = mlb21)

The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb21 data frame to find the runs and at_bats variables.

The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access some of this information using the summary function.

summary(m1)

With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = -2540.0987 + 0.6065 \cdot \mathrm{at\_bats} \]

This equations tells us that our model predicts that for an increase of at_bats of 1, we would expect the runs to increase (on average, remember that the line represents MEANS at the different values of \(x\)) by 0.6065.

One piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response (or outcome) variable that is explained by the model. For this model, 37.3% of the variability in runs is explained by at_bats. That should make sense from the plot. While there is an obvious linear relationship, there is still quite a bit of variability around the line that isn’t accounted for by at_bats.

Fit a new model that uses stolen_bases to predict runs. Call it m2. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and number of stolen bases? In other words, interpret the slope parameter. Which model does a better job fitting the data? Why?

Prediction and prediction errors

Below is code to overlay the data from the first model with the linear regression line (in green). Notice that adding the line in ggplot is literally “adding” another layer. The grey band is the estimated “prediction error” (see Section 11.13 in your text). The estimated prediction error is a function of both \(S^2\) as well as the predictor, \(x\). You can see this reflected in the width of the grey bands changing. There is less error near the middle of the dataset than near the ends.

ggplot(data = mlb21) +
  geom_point(aes(x = at_bats, y = runs)) +
  geom_smooth(method = "lm", aes(x = at_bats, y = runs), color = 'green')

This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals (the \(\epsilon\)’s).

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,357 at-bats? Is this an overestimate or an underestimate for the Chicago Whitesox, and by how much? In other words, what is the residual for this prediction?

Model diagnostics

To assess whether the linear model is reliable and appropriate, we need to check our assumptions:
linearity, normal residuals, and constant variability.

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. If the relationship is linear, we should NOT see a pattern in the residuals. The nice thing about storing the model results is that the residuals, as well as the predicted values, are stored in the model object (we named it m1 in this case).

ggplot(data = m1) +
  geom_point(aes(y = .resid, x = .fitted)) +
  labs(x = "Fitted Values", y = "Residuals")

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

Nearly normal residuals: Remember that we are assuming that the random error terms \(\varepsilon\) follow a normal distribution with mean 0 and standard deviation \(\sigma\). We also assume that the distribution of \(\varepsilon\) is independent of \(x\), which we check in the next section. To check the normality condition, you can look at a histogram with a normal distribution curve overlay:

ggplot(data = m1,aes(x = .resid)) +
  geom_histogram(aes(y = ..density..), bins=8) +
  stat_function(fun = dnorm,
                args = list(mean = mean(m1$residuals),
                            sd = sd(m1$residuals)),
                col = "blue",
                size = 1.5)

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Note: There are other ways to test for normality in a data set, including a QQ-plot, and hypothesis test approaches. We will not cover those in this class.

Constant variability: If variability in the residuals is constant, you should approximately even spread across the residual plot.

Based on the residual plot, does the constant variability condition appear to be met?

More exercises

Choose another one of the seven traditional variables from mlb21 besides at_bats or homeruns that you think might be a good predictor of runs (“traditional” means any except the last three columns). Produce a scatterplot of the two variables and fit a linear model. Also, plot the least squares line on top of the data. At a glance, does there seem to be a linear relationship?
How does this relationship compare to the relationship between runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the seven traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all seven): Include a scatterplot with an overlaid regression line, correlation, and model output. You don’t need to consider model diagnostics in this exercise.
Now examine the three newer variables (the last three columns). These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs than the traditional variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

When you are finished with the lab, you need to upload your lab as a .pdf file, and include the .Rmd file as well. Make sure your final Markdown document Knits properly and is neatly organized. If your document doesn’t Knit, you will receive points off. Also remember that if you needed output (graphs, numeric output, etc.) to answer a question, the code to generate that output needs to be in the lab report..