Homework 04

Instructions

For this homework, you should create a new folder in your homework directory. Call it HW4 or something similar that you can keep track of. Download the homework markdown template file Student_HW_template.Rmd from the course webpage, and put a copy in this folder. Rename it something like HW4_YourName.Rmd. This markdown document will be where you will answer each of the questions below.

The Assignment

Background

For this assignment, we will use a data set containing information about the daily percentage returns of the S&P 500 stock market index for dates in the years 2001-2005. The data are contained in the ISLR2 package associated with our textbook. Once you’ve installed and loaded the library, load the data into your environment using data("Smarket"). Take a look at the data and make sure you know what each variable means (using ?Smarket can help).

Exploratory Data Analysis

  1. We want to predict the value of the Direction variable, A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day. What percentage of observations are in each level?

  2. Does there seem to be any relationship between any of the Lag variables and Direction? Does the Volume seem to have any effect?

Model Definition

  1. Write a code chunk to specify your model, using \(k=5\) neighbors and the epanechnikov weight function. ((this weight function looks like a \(Beta(2,2)\) density on the distances)

Data Splitting and Recipe Creation

  1. Note that this data has a time component. Check out TMWR Section 5.1 for some ideas on how to split data of this type then split your data into training and test sets.

  2. Build a recipe to preprocess the data for K nearest neighbors.

    1. Remove Today and Year or assign a new role

    2. You need to normalize all the numeric variables so that they are on the same scale.

    3. bake the recipe just so you can see if the preprocessed data looks like what you expect. The glimpse command is nice here since there are so many variables.

Model Fitting

  1. Fit the model and generate the class predictions.

  2. Using the predicted test data, calculate

    1. Specificity

    2. Sensitivity

    3. Accuracy

  3. Explain the meaning of each of the metrics above, and comment on their values for this particular dataset. Which class (Up or Down) gets assigned to “positive” and “negative”?

  4. Does it seem that k-nearest neighbors does a satisfactory job? Think about this question in terms of a predictor that picked randomly, and a predictor that always chooses Up (see exercise 1).

Submitting HW

When you’ve successfully answered all the questions, knit your document to a HTML file. Look through it to make sure everything worked the way you expect it to.