library(tidyverse)
library(janitor)
library(tidymodels)
Instructions
For this homework, you should create a new folder in your homework
directory. Call it HW7
or something similar that you can
keep track of. Download the homework markdown template file
Student_HW_template.Rmd
from the course webpage, and put a
copy in this folder. Rename it something like
HW7_YourName.Rmd
. This markdown document will be where you
will answer each of the questions below.
The Assignment
In this homework, you will use life expectancy data over multiple
years from the gapminder
dataset in the
gapminder
package.
In class, we used PCA to reduce the number of variables for a regression problem. Let’s do something similar for a classification problem. The task is to calculate the principal components of the predictors, and then visualize the data to see if the new components will show natural clusters by continent.
Load the
gapminder
data and familiarize yourself with it by looking at the raw data and the help file.You need to tidy this dataset so that each year is a separate column. Use the
pivot_wider
function to create a data with columns for the country, continent, and year, with the values in the year column being the life expectancies for that year. See R for Data Science Section 12.3.Create a
recipe
to normalize the data and then perform a PCA (the order is important here!). Don’t forget that you can only perform PCA on all the numeric variables, so tell the recipe that in your PCA step. Set the number of components to be 4. In the same code chunk, create a prepped recipe object.Following the code from our PCA demo in class, create a scree plot to show how much variation is captured by each PC. What do you notice?
Use the
bake
function on your prepped recipe and you should get a tibble with country, continent, and four columns with the loadings for the four principal components. Use this tibble to create a scatterplot of the first two principal components; color the points by continent, and label them with the country names. Do you see reasonable clusters? Does the second component seem to be necessary for the identification of these clusters (in other words, is there separability just by looking along the horizontal axis)?
Submitting HW
When you’ve successfully answered all the questions, knit your
document to a PDF file. Look through it to make sure everything worked
the way you expect it to. You will submit both your .Rmd
and .pdf
files to Schoology.