The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
In this lab, you’ll be looking at data from all 30 Major League Baseball teams and examining the relationship between runs scored in a season and a number of other player statistics. The aim is to summarize these relationships both graphically and numerically in order to find which variable, if any, helps you best predict a team’s runs scored in a season.
The data for the Major League Baseball’s 2021 season is in a file
called mlb21.csv
, which is in the comma-separated value
format. The values of the different columns are separated by commas.
This is a very common format for storing data.
You’ll need to download the file to your laptop and save it in the
directory where you are keeping your Markdown file. The code below will
look in your working directory for the file, load it into your
environment, and call it mlb21
. If you get an error, then
your working directory is probably wrong. You can change the working
directory to point to where your data is by going to Session|Set Working
Directory|Choose Directory and navigating to where you downloaded the
data.
The first thing you’ll want to do is look at the data. Since this dataset is not contained in an R package, there is no help file.
<- read_csv("mlb21.csv") mlb21
In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the three newer variables on your own.
Below is a list of variables and their descriptions. If you’re not familiar with baseball, feel free to ask more questions! You can also read more detailed explanations of the stats here.
runs
and one of the other numerical variables? Plot this
relationship using the variable at_bats
as the predictor.
Does the relationship look linear? If you knew a team’s
at_bats
, would you be comfortable using a linear model to
predict the number of runs?If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient (the strength of the linear relationship between two random variables). With the code below, we can calculate the sample correlation between these two variables.
cor(mlb21$runs, mlb21$at_bats)
Think back to the way that we described the distribution of a single
variable. Recall that we discussed characteristics such as center,
spread, and shape. It’s also useful to be able to describe the
relationship of two numerical variables, such as runs
and
at_bats
.
The most common way in R to fit the least-squares regression line is
to use the lm
function. This function will calculate the
\(\hat{\beta_0}\) and \(\hat{\beta_1}\) estimators using the
formulas we derived in class, as well as the standard errors using the
variance formulas, where \(\sigma^2\)
is replace with \[S^2 =
\frac{SSE}{n-2}\].
The code below uses the formula interface; think of the
~
like an =
in our linear regression model.
Instead of simply pasting the output into the console, I am saving it as
a list
of model results.
<- lm(runs ~ at_bats, data = mlb21) m1
The first argument in the function lm
is a formula that
takes the form y ~ x
. Here it can be read that we want to
make a linear model of runs
as a function of
at_bats
. The second argument specifies that R should look
in the mlb21
data frame to find the runs
and
at_bats
variables.
The output of lm
is an object that contains all of the
information we need about the linear model that was just fit. We can
access some of this information using the summary
function.
summary(m1)
With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = -2540.0987 + 0.6065 \cdot \mathrm{at\_bats} \]
This equations tells us that our model predicts that for an
increase of at_bats
of 1, we would expect the
runs
to increase (on average, remember that the line
represents MEANS at the different values of \(x\)) by 0.6065.
One piece of information we will discuss from the summary output is
the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of
variability in the response (or outcome) variable that is explained by
the model. For this model, 37.3% of the variability in runs
is explained by at_bats
. That should make sense from the
plot. While there is an obvious linear relationship, there is still
quite a bit of variability around the line that isn’t accounted for by
at_bats
.
stolen_bases
to predict
runs
. Call it m2
. Using the estimates from the
R output, write the equation of the regression line. What does the slope
tell us in the context of the relationship between success of a team and
number of stolen bases? In other words, interpret the slope parameter.
Which model does a better job fitting the data? Why?Below is code to overlay the data from the first model with the
linear regression line (in green). Notice that adding the line in
ggplot
is literally “adding” another layer. The grey band
is the estimated “prediction error” (see Section 11.13 in your text).
The estimated prediction error is a function of both \(S^2\) as well as the
predictor, \(x\). You can see this
reflected in the width of the grey bands changing. There is less error
near the middle of the dataset than near the ends.
ggplot(data = mlb21) +
geom_point(aes(x = at_bats, y = runs)) +
geom_smooth(method = "lm", aes(x = at_bats, y = runs), color = 'green')
This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals (the \(\epsilon\)’s).
To assess whether the linear model is reliable and appropriate, we
need to check our assumptions:
linearity, normal residuals, and constant variability.
Linearity: You already checked if the relationship
between runs and at-bats is linear using a scatterplot. We should also
verify this condition with a plot of the residuals vs. at-bats. If the
relationship is linear, we should NOT see a pattern in the residuals.
The nice thing about storing the model results is that the residuals, as
well as the predicted values, are stored in the model object (we named
it m1
in this case).
ggplot(data = m1) +
geom_point(aes(y = .resid, x = .fitted)) +
labs(x = "Fitted Values", y = "Residuals")
Nearly normal residuals: Remember that we are assuming that the random error terms \(\varepsilon\) follow a normal distribution with mean 0 and standard deviation \(\sigma\). We also assume that the distribution of \(\varepsilon\) is independent of \(x\), which we check in the next section. To check the normality condition, you can look at a histogram with a normal distribution curve overlay:
ggplot(data = m1,aes(x = .resid)) +
geom_histogram(aes(y = ..density..), bins=8) +
stat_function(fun = dnorm,
args = list(mean = mean(m1$residuals),
sd = sd(m1$residuals)),
col = "blue",
size = 1.5)
Note: There are other ways to test for normality in a data set, including a QQ-plot, and hypothesis test approaches. We will not cover those in this class.
Constant variability: If variability in the residuals is constant, you should approximately even spread across the residual plot.
Choose another one of the seven traditional variables from
mlb21
besides at_bats
or homeruns
that you think might be a good predictor of runs
(“traditional” means any except the last three columns). Produce a
scatterplot of the two variables and fit a linear model. Also, plot the
least squares line on top of the data. At a glance, does there seem to
be a linear relationship?
How does this relationship compare to the relationship between
runs
and at_bats
? Use the R\(^2\) values from the two model summaries to
compare. Does your variable seem to predict runs
better
than at_bats
? How can you tell?
Now that you can summarize the linear relationship between two
variables, investigate the relationships between runs
and
each of the seven traditional variables. Which variable best predicts
runs
? Support your conclusion using the graphical and
numerical methods we’ve discussed (for the sake of conciseness, only
include output for the best variable, not all seven): Include a
scatterplot with an overlaid regression line, correlation, and model
output. You don’t need to consider model diagnostics in this
exercise.
Now examine the three newer variables (the last three columns).
These are the statistics used by the author of Moneyball to
predict a teams success. In general, are they more or less effective at
predicting runs than the traditional variables? Explain using
appropriate graphical and numerical evidence. Of all ten variables we’ve
analyzed, which seems to be the best predictor of runs
?
Using the limited (or not so limited) information you know about these
baseball statistics, does your result make sense?
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
When you are finished with the lab, you need to upload your lab as a
.pdf
file, and include the .Rmd
file as well.
Make sure your final Markdown document Knits properly and is
neatly organized. If your document doesn’t Knit, you will receive points
off. Also remember that if you needed output (graphs, numeric output,
etc.) to answer a question, the code to generate that output needs to be
in the lab report..