We have been learning about how to make inferences about a population mean using hypothesis testing and confidence intervals. In your last lab, you considered data from 2013 flights that departed from New York City. We will consider the same dataset for this lab.
Remember to start a new Markdown document and name it appropriately. You will use a number of functions from your previous labs, so you may want to have them open. You will not have as much direction as you did in the last lab, so you should find a group of 2 or 3 to work with. Only 1 person in the group needs to submit a lab report. You will want to search the Help pages often (lower right window in RStudio); Google will often yield a large number of helpful and fast resources, once you learn how to search well.
You will focus today on the variables arr_delay
,
origin
, and air_time
.
Create a new dataset called nyc.sub
that contains
only the columns listed above. Use the select
function with
piping. (If you don’t remember the select
function, type
?select
into the console to open the help page)
Use a new function called ggpairs
to plot this new
dataset. You’ll need to install/load the appropriate library (Google to
find out which package you need). Type up a summary of all the
information you are able to get from a ggpairs
plot. (Note:
the message about bins always shows up— Suppress the message in your
knitted document. You may need to search for how to suppress messages in
an R Markdown code chunk.)
Using the plot from #2, describe the distributions of
arr_delay
and air_time
.
Use the summary
function to get some basic
statistics for your variables. Which key statistic is missing?
Another descriptive function is the skim
function in
the skimr
package. Try this and comment on any differences
with summary.
The ggpairs
function can be nice when there are a
small number of variables for which you want to quickly see bivariate
relationships. But when you want to more deeply investigate an
individual variable, you should create separate plots. Plot histograms
of air_time
separately for the different
origin
s by faceting. (Note: Remove all the messages AND
warnings for this chunk)
Now that we have explored and summarized our data, we will turn to performing inference. Again, there will be a number of functions that you haven’t yet seen, and you’ll have to figure out how to use them. This will help prepare you for the class project (coming after spring break!). Feel free to ask a lot of questions!
Run a \(t\)-test to determine if
there is evidence that the mean arrival delay is greater than zero. You
will use the function t.test
and provide a 99% (two-sided)
confidence interval for the mean. Give a conclusion for the above
analysis in complete sentences. Comment on what you think about the
assumptions of the \(t\)-test. (Note:
You will have to run the function twice to get a one-sided \(p\)-value and a two-sided confidence
interval. I haven’t given you example code for t.test
. See
the examples at the bottom of the help file, or Google something like
“how to use t.test in R”)
Do the same type of analysis for air time. Conduct a two-sided hypothesis test to determine if there is evidence that air time is different from 150 minutes. Provide a 95% confidence interval for the mean. Note the defaults of the function! Also comment on the assumptions of the test for this variable.
Now run the same test with a different null hypothesis to determine if the true mean is different from 149. Why is the \(p\)-value so different? In other words, why do we fail to show a difference from 150 (not even close), but we have fairly strong evidence that it’s different from 149?
Is there evidence of a difference in mean air time between LaGuardia and JFK airports, using this 2013 data? Include all necessary code and write your conclusions in complete sentences with a reported confidence interval. Also include an appropriate visualization to accompany your inference.
When you are finished with the lab, one person from your group will
need to upload your .html
file to Canvas. Please make sure
everyone’s name is on the lab to get credit! Look over your
report to make sure it is rendering properly. Also remember that if you
needed output (graphs, numeric output, etc.) to answer a question, the
code to generate that output needs to be in the lab
report..