Homework 06: Principal Component Analysis

Due Friday, Feb 24

2023-02-13

library(tidyverse)
library(janitor)
library(tidymodels)

Instructions

For this homework, you should create a new folder in your homework directory. Call it HW7 or something similar that you can keep track of. Download the homework markdown template file Student_HW_template.Rmd from the course webpage, and put a copy in this folder. Rename it something like HW7_YourName.Rmd. This markdown document will be where you will answer each of the questions below.

The Assignment

In this homework, you will use life expectancy data over multiple years from the gapminder dataset in the gapminder package.

In class, we used PCA to reduce the number of variables for a regression problem. Let’s do something similar for a classification problem. The task is to calculate the principal components of the predictors, and then visualize the data to see if the new components will show natural clusters by continent.

  1. Load the gapminder data and familiarize yourself with it by looking at the raw data and the help file.

  2. You need to tidy this dataset so that each year is a separate column. Use the pivot_wider function to create a data with columns for the country, continent, and year, with the values in the year column being the life expectancies for that year. See R for Data Science Section 12.3.

  3. Create a recipe to normalize the data and then perform a PCA (the order is important here!). Don’t forget that you can only perform PCA on all the numeric variables, so tell the recipe that in your PCA step. Set the number of components to be 4. In the same code chunk, create a prepped recipe object.

  4. Following the code from our PCA demo in class, create a scree plot to show how much variation is captured by each PC. What do you notice?

  5. Use the bake function on your prepped recipe and you should get a tibble with country, continent, and four columns with the loadings for the four principal components. Use this tibble to create a scatterplot of the first two principal components; color the points by continent, and label them with the country names. Do you see reasonable clusters? Does the second component seem to be necessary for the identification of these clusters (in other words, is there separability just by looking along the horizontal axis)?

Submitting HW

When you’ve successfully answered all the questions, knit your document to a PDF file. Look through it to make sure everything worked the way you expect it to. You will submit both your .Rmd and .pdf files to Schoology.