Resampling Activity

2023-01-20

library(tidyverse)
library(janitor)
library(tidymodels)

Background

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person: -Whether they survived or not -their age -their passenger-class -their gender -the fair they paid -whether they had children, siblings, or spouse on board

This data has been cleaned, and rows with missing data have been removed.

titanic <- read_csv("titanic.csv") %>% 
  clean_names() %>%
  mutate(survived = as.factor(survived))

EDA

Take a look at the distribution of the different survived variable. How many observations are in each class? Does there seem to be any relationship between that variable and the others that we could use?

Modeling

  1. Create an initial_split object, and stratify over survived.

  2. Create a recipe to preprocess this data. Discuss in your groups what preprocessing steps should be done.

  3. Specify a K Nearest Neighbors model for classification. Add the model and recipe to a workflow.

  4. Fit the workflow on the training data and check out the accuracy, sensitivity, and specificity. Compare your results with your neighbors. Is there a lot of variation in the accuracy? (if you set the same seed there won’t be, so you may need to remove the seed if you set one)

Resampling

Notice that the accuracy we get depends highly on the training set. I got accuracy from .79 up to .88 depending on the split. We need a better estimate using cross validation!

  1. Use the function vfold_cv to create a cross validation set using 10 folds and 5 repeats. Stratify based on survived.

  2. Use the fit_resamples function to fit your model. Pass a metric set containing accuracy, sens and spec. This will take a long time!

  3. Use the collect_metrics function to view the results of your resampling.

  4. What are the sensitivity and specificity telling you? Which class is positive and which is negative?

  5. About how accurate would a predictor be that always predicted 0 for survived? About how accurate would a predictor be that predicted 1 for females and 0 for males? (Refer to EDA, and make a new graph up there if you didn’t make a graph that answers this question!)