Projects are a great way to save your data, environment, models, and progress without interrupting other things you are doing. So far in this class, we’ve been able to get by without it, but our class work is starting to get unwieldy.
Create a new folder in your file directory—call it
Missing_Data_Activity
or something similar. Next, we’ll
look for the project tab in the upper right of the RStudio window, and
click “New Project”. Make sure you save any files that are currently
open.
Next, we’ll make a new project from existing folder, and locate it in the folder you just made. Note that you can also create the new folder directly as part of this process. (When you open your new project, you will need to reopen this markdown file within the project environment)
And that’s it! We’ve created a new project. Now any code we run while in the project environment gets saved there. We can close the project, work on other code, and when we come back to it, everything is the same as when we left!
In this activity, we are going to use the palmer penguins data.
library(palmerpenguins)
##
## Attaching package: 'palmerpenguins'
## The following object is masked from 'package:modeldata':
##
## penguins
?penguins
## Help on topic 'penguins' was found in the following packages:
##
## Package Library
## modeldata /Library/Frameworks/R.framework/Versions/4.0/Resources/library
## palmerpenguins /Library/Frameworks/R.framework/Versions/4.0/Resources/library
##
##
## Using the first match ...
We’ve done a little bit of exploring with this data. Let’s check out what is going on with missing values:
library(skimr)
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Let’s take a look at the rows with NA
values:
penguins %>%
filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen NA NA NA NA <NA> 2007
## 2 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 3 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## 4 Adelie Torgersen 37.8 17.1 186 3300 <NA> 2007
## 5 Adelie Torgersen 37.8 17.3 180 3700 <NA> 2007
## 6 Adelie Dream 37.5 18.9 179 2975 <NA> 2007
## 7 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 2007
## 8 Gentoo Biscoe 46.2 14.4 214 4650 <NA> 2008
## 9 Gentoo Biscoe 47.3 13.8 216 4725 <NA> 2009
## 10 Gentoo Biscoe 44.5 15.7 217 4875 <NA> 2009
## 11 Gentoo Biscoe NA NA NA NA <NA> 2009
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
We have two rows that contain almost no information, and 9 rows that
are just missing the value of sex
.
An easy option would be to simply drop all the rows with
NA
using step_naomit
. But you should be
careful with this, as the “missing-ness” might contain important
information about a data point. Plus, more data is usually better, and
throwing away potentially useful points isn’t great!
One way to deal with this issue is to impute missing values,
meaning we fill them in with what we think should be there. For example,
we are missing the bill_length_mm
for an Adelie penguin, so
we might fill in the average bill length for all penguins, or the
average bill length for just Adelie penguins.
This is just prediction, what we’ve been doing the whole semester! Except now we are predicting the value of the predictors.
Let’s see if we can impute the sex
variable using KNN.
Notice that I am normalizing in advance, since KNN works better with
normalized distances.
penguins_recipe <- recipe(species ~ ., data=penguins) %>%
step_normalize(all_numeric_predictors()) %>%
step_impute_knn(sex)
We could also impute using a linear model or a bagged tree. In fact, all we’re doing is predicting the predictors!
If we bake the recipe, we can see whether there are any missing values left:
penguins_baked <- bake(prep(penguins_recipe), new_data=NULL)
penguins_baked %>%
filter(if_any(everything(), is.na))
## # A tibble: 2 × 8
## island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year species
## <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 Torgersen NA NA NA NA fema… -1.26 Adelie
## 2 Biscoe NA NA NA NA fema… 1.19 Gentoo
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
We can even see the values that were imputed:
penguins_baked %>% filter(is.na(penguins$sex))
## # A tibble: 11 × 8
## island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year species
## <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 Torgersen NA NA NA NA fema… -1.26 Adelie
## 2 Torgersen -1.80 0.480 -0.563 -0.906 fema… -1.26 Adelie
## 3 Torgersen -0.352 1.54 -0.776 0.0602 male -1.26 Adelie
## 4 Torgersen -1.12 -0.0259 -1.06 -1.12 fema… -1.26 Adelie
## 5 Torgersen -1.12 0.0754 -1.49 -0.626 fema… -1.26 Adelie
## 6 Dream -1.18 0.886 -1.56 -1.53 fema… -1.26 Adelie
## 7 Biscoe 0.106 -1.44 1.07 -0.127 fema… -1.26 Gentoo
## 8 Biscoe 0.417 -1.39 0.931 0.559 fema… -0.0355 Gentoo
## 9 Biscoe 0.619 -1.70 1.07 0.652 fema… 1.19 Gentoo
## 10 Biscoe 0.106 -0.735 1.14 0.840 fema… 1.19 Gentoo
## 11 Biscoe NA NA NA NA fema… 1.19 Gentoo
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
So what do we do with the remaining missing values? In this case, all
we have to go on is the species and year. This isn’t really enough to
impute anything, we would simply be filling in based on the mean,
median, or other sample statistics from the variables. Here is an
example with step_impute_median
:
new_recipe <- penguins_recipe %>% step_impute_median(all_numeric_predictors())
# No more missing values:
bake(
prep(new_recipe),
new_data=NULL
) %>%
filter(if_any(everything(), is.na))
## # A tibble: 0 × 8
## # … with 8 variables: island <fct>, bill_length_mm <dbl>, bill_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <fct>, year <dbl>,
## # species <fct>
There are a number of other things you could try! In our case, there
isn’t much to go on to try to impute the values. If we were dealing with
a categorical variable, we might use step_unknown
to assign
missing values to a new category.
You may have noticed that many of our recent models take a long time to tune, cross-validate, and/or train. One thing we can do to help our timing is to save the model object so that we can come back to it at a later time. Or once you have your finalized, fitted model for your project, you can just save it to import it into your final report—then you won’t have to re-fit/tune/cv each time you want to knit!
To do this, we will use the readr
package. Here we are
going to create a model, fit it, then save it to our system.
library(baguette)
penguins_wf <- workflow() %>%
add_recipe(new_recipe) %>%
add_model(
bag_tree() %>%
set_mode("classification") %>%
set_engine("rpart")
)
penguins_fit <- penguins_wf %>% fit(data=penguins)
library(readr)
write_rds(penguins_fit, "final_model.rds")
We can now remove everything from the environment, and still load the model:
model <- read_rds("final_model.rds")
model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: bag_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_normalize()
## • step_impute_knn()
## • step_impute_median()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Bagged CART (classification with 11 members)
##
## Variable importance scores include:
##
## # A tibble: 7 × 4
## term value std.error used
## <chr> <dbl> <dbl> <int>
## 1 bill_length_mm 134. 1.96 11
## 2 flipper_length_mm 133. 2.85 11
## 3 bill_depth_mm 120. 1.58 11
## 4 body_mass_g 105. 2.87 11
## 5 island 88.7 4.49 11
## 6 sex 2.59 1.00 6
## 7 year 1.46 0.172 8