Creating a new Project in RStudio

Projects are a great way to save your data, environment, models, and progress without interrupting other things you are doing. So far in this class, we’ve been able to get by without it, but our class work is starting to get unwieldy.

Create a new folder in your file directory—call it Missing_Data_Activity or something similar. Next, we’ll look for the project tab in the upper right of the RStudio window, and click “New Project”. Make sure you save any files that are currently open.

Next, we’ll make a new project from existing folder, and locate it in the folder you just made. Note that you can also create the new folder directly as part of this process. (When you open your new project, you will need to reopen this markdown file within the project environment)

And that’s it! We’ve created a new project. Now any code we run while in the project environment gets saved there. We can close the project, work on other code, and when we come back to it, everything is the same as when we left!

Missing Data

In this activity, we are going to use the palmer penguins data.

library(palmerpenguins)
## 
## Attaching package: 'palmerpenguins'
## The following object is masked from 'package:modeldata':
## 
##     penguins
?penguins
## Help on topic 'penguins' was found in the following packages:
## 
##   Package               Library
##   modeldata             /Library/Frameworks/R.framework/Versions/4.0/Resources/library
##   palmerpenguins        /Library/Frameworks/R.framework/Versions/4.0/Resources/library
## 
## 
## Using the first match ...

We’ve done a little bit of exploring with this data. Let’s check out what is going on with missing values:

library(skimr)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Let’s take a look at the rows with NA values:

penguins %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
##    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
##  1 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
##  2 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
##  3 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
##  4 Adelie  Torgersen           37.8          17.1        186    3300 <NA>   2007
##  5 Adelie  Torgersen           37.8          17.3        180    3700 <NA>   2007
##  6 Adelie  Dream               37.5          18.9        179    2975 <NA>   2007
##  7 Gentoo  Biscoe              44.5          14.3        216    4100 <NA>   2007
##  8 Gentoo  Biscoe              46.2          14.4        214    4650 <NA>   2008
##  9 Gentoo  Biscoe              47.3          13.8        216    4725 <NA>   2009
## 10 Gentoo  Biscoe              44.5          15.7        217    4875 <NA>   2009
## 11 Gentoo  Biscoe              NA            NA           NA      NA <NA>   2009
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

We have two rows that contain almost no information, and 9 rows that are just missing the value of sex.

An easy option would be to simply drop all the rows with NA using step_naomit. But you should be careful with this, as the “missing-ness” might contain important information about a data point. Plus, more data is usually better, and throwing away potentially useful points isn’t great!

Imputing values

One way to deal with this issue is to impute missing values, meaning we fill them in with what we think should be there. For example, we are missing the bill_length_mm for an Adelie penguin, so we might fill in the average bill length for all penguins, or the average bill length for just Adelie penguins.

This is just prediction, what we’ve been doing the whole semester! Except now we are predicting the value of the predictors.

Let’s see if we can impute the sex variable using KNN. Notice that I am normalizing in advance, since KNN works better with normalized distances.

penguins_recipe <- recipe(species ~ ., data=penguins) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_impute_knn(sex)

We could also impute using a linear model or a bagged tree. In fact, all we’re doing is predicting the predictors!

If we bake the recipe, we can see whether there are any missing values left:

penguins_baked <- bake(prep(penguins_recipe), new_data=NULL) 
penguins_baked %>%
  filter(if_any(everything(), is.na))
## # A tibble: 2 × 8
##   island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year species
##   <fct>              <dbl>         <dbl>       <dbl>   <dbl> <fct> <dbl> <fct>  
## 1 Torgersen             NA            NA          NA      NA fema… -1.26 Adelie 
## 2 Biscoe                NA            NA          NA      NA fema…  1.19 Gentoo 
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

We can even see the values that were imputed:

penguins_baked %>% filter(is.na(penguins$sex))
## # A tibble: 11 × 8
##    island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex      year species
##    <fct>              <dbl>         <dbl>    <dbl>   <dbl> <fct>   <dbl> <fct>  
##  1 Torgersen         NA           NA        NA     NA      fema… -1.26   Adelie 
##  2 Torgersen         -1.80         0.480    -0.563 -0.906  fema… -1.26   Adelie 
##  3 Torgersen         -0.352        1.54     -0.776  0.0602 male  -1.26   Adelie 
##  4 Torgersen         -1.12        -0.0259   -1.06  -1.12   fema… -1.26   Adelie 
##  5 Torgersen         -1.12         0.0754   -1.49  -0.626  fema… -1.26   Adelie 
##  6 Dream             -1.18         0.886    -1.56  -1.53   fema… -1.26   Adelie 
##  7 Biscoe             0.106       -1.44      1.07  -0.127  fema… -1.26   Gentoo 
##  8 Biscoe             0.417       -1.39      0.931  0.559  fema… -0.0355 Gentoo 
##  9 Biscoe             0.619       -1.70      1.07   0.652  fema…  1.19   Gentoo 
## 10 Biscoe             0.106       -0.735     1.14   0.840  fema…  1.19   Gentoo 
## 11 Biscoe            NA           NA        NA     NA      fema…  1.19   Gentoo 
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

So what do we do with the remaining missing values? In this case, all we have to go on is the species and year. This isn’t really enough to impute anything, we would simply be filling in based on the mean, median, or other sample statistics from the variables. Here is an example with step_impute_median:

new_recipe <- penguins_recipe %>% step_impute_median(all_numeric_predictors())

# No more missing values:
bake(
  prep(new_recipe), 
  new_data=NULL
  ) %>%
  filter(if_any(everything(), is.na))
## # A tibble: 0 × 8
## # … with 8 variables: island <fct>, bill_length_mm <dbl>, bill_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <fct>, year <dbl>,
## #   species <fct>

Other options

There are a number of other things you could try! In our case, there isn’t much to go on to try to impute the values. If we were dealing with a categorical variable, we might use step_unknown to assign missing values to a new category.

Saving model objects

You may have noticed that many of our recent models take a long time to tune, cross-validate, and/or train. One thing we can do to help our timing is to save the model object so that we can come back to it at a later time. Or once you have your finalized, fitted model for your project, you can just save it to import it into your final report—then you won’t have to re-fit/tune/cv each time you want to knit!

To do this, we will use the readr package. Here we are going to create a model, fit it, then save it to our system.

library(baguette)
penguins_wf <- workflow() %>%
  add_recipe(new_recipe) %>%
  add_model(
    bag_tree() %>%
      set_mode("classification") %>%
      set_engine("rpart")
  )

penguins_fit <- penguins_wf %>% fit(data=penguins)

library(readr)
write_rds(penguins_fit, "final_model.rds")

We can now remove everything from the environment, and still load the model:

model <- read_rds("final_model.rds")
model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: bag_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_normalize()
## • step_impute_knn()
## • step_impute_median()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Bagged CART (classification with 11 members)
## 
## Variable importance scores include:
## 
## # A tibble: 7 × 4
##   term               value std.error  used
##   <chr>              <dbl>     <dbl> <int>
## 1 bill_length_mm    134.       1.96     11
## 2 flipper_length_mm 133.       2.85     11
## 3 bill_depth_mm     120.       1.58     11
## 4 body_mass_g       105.       2.87     11
## 5 island             88.7      4.49     11
## 6 sex                 2.59     1.00      6
## 7 year                1.46     0.172     8