Support Vector Machines for Classification

library(tidyverse)
library(tidymodels)
library(janitor)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions

We’re going to load again the breast cancer classification data set:

patients <- read.csv("breast-cancer.csv") %>% 
  clean_names() %>% 
  mutate(
    class = factor(class),
    bland_chromatin = as.double(bland_chromatin),
    single_epithelial_cell_size = as.double(single_epithelial_cell_size)
  )

Exercise 0. Create a new section in your markdown file for the previous activity, called “SVMs for classification”.

Exercise 1. Split the data into training and test sets. Create a model specification of a linear support vector classifier using svm_poly. You will need to install and load the kernlab package. Set the degree parameter to tune().

Exercise 2. Create a recipe to that predicts class by bland_chromatin and single_epithelial_cell_size. Make sure to normalize the data. Name your recipe svm_rec.

Exercise 3. Add your recipe and model to a workflow. Use 10 fold cross-validation to tune the degree parameter.

Exercise 4. Finalize the workflow with the best degree parameter, then fit the model and name it svm_fit. What was the best degree according to the AUC metric?

Exercise 5. Create a plot like the one below to visualize the nonlinear decision boundary. (Make sure you load the data like I did above, so that R treats the variables as doubles and not integers)

autoplot(svm_res)

Here is an unsophisticated way to get a look at the decision boundary: First, I created a grid of points, then used the model to make predictions for each point, then plotted each point, colored by predicted class. The red region is classified as 0 (no cancer) while the blue region is classified as 1 (cancer).

expand_grid(
  bland_chromatin = seq(1, 10, by=.1),
  single_epithelial_cell_size = seq(1, 10, by=.1)
) %>%
  augment(svm_fit, .) %>%
  ggplot(aes(x=bland_chromatin, y=single_epithelial_cell_size)) +
  geom_point(aes( color=.pred_class), alpha=.1) +
  geom_point(data=patients_train, aes(shape=class), position="jitter")