Linear Support Vector Classification

library(tidyverse)
library(tidymodels)
library(janitor)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions

We’re going to load again the breast cancer classification data set:

patients <- read.csv("breast-cancer.csv") %>% 
  clean_names() %>% 
  mutate(
    class = factor(class),
    bland_chromatin = as.double(bland_chromatin),
    single_epithelial_cell_size = as.double(single_epithelial_cell_size)
  )

Exercise 1. Split the data into training and test sets. Create a model specification of a linear support vector classifier using svm_linear. You will need to install and load the LiblineaR package.

Exercise 2. Create a recipe to that predicts class using only bland_chromatin and single_epithelial_cell_size. Make sure to normalize the data as the first step in the recipe. Name your recipe svm_rec.

Exercise 3. Add your recipe and model to a workflow. Fit the workflow to your training set and name it svm_fit.

Exercise 4. Plot the test set for two variables bland_chromatin and single_epithelial_cell_size. Color by predicted class. Use the geom_abline to plot the decision boundary (the slope and intercept for the decision boundary can be computed using the code below, as long as you named everything the same way I did)

Exercise 5. Based on the plot, how did your model do? What drawbacks are there to this linear SVM that you see?

# Calculates the slope and intercept of the linear decision boundary
means <- tidy(prep(svm_rec), 1)$value[1:2]
sds <- tidy(prep(svm_rec), 1)$value[3:4]
coeff <- tidy(svm_fit)$estimate

slope <- -coeff[1] * sds[2] / ( coeff[2] * sds[1] )
intercept <- -coeff[3]*sds[2]/coeff[2] + means[2] + coeff[1]*sds[2]*means[1]/(coeff[2]*sds[1])

Side Note: If you’re wondering what’s going on in these expressions, remember that we normalized our data, so we created new variables \[ x_{new} = \frac{x - \bar{x}}{s_x} \qquad \mbox{and} \qquad y_{new} = \frac{y - \bar{y}}{s_y} \] Subsequently, we found a separating plane in the new variables whose coefficients are stored in the fitted model: \[ a x_{new} + b y_{new} + B = 0 \] Run tidy(svm_fit) to see these (\(B\) is the Bias)! If you plug in the expressions for \(x_{new}\) and \(y_{new}\) to the plane, and solve for \(y\) in terms of \(x\), you get a linear expression with the slope and intercept given above.

Here is a less sophisticated way to get a look at the decision boundary: First, I created a grid of points, then used the model to make predictions for each point, then plotted each point, colored by predicted class. The red region is classified as 0 (no cancer) while the blue region is classified as 1 (cancer).

expand_grid(
  bland_chromatin = seq(1, 10, by=.1),
  single_epithelial_cell_size = seq(1, 10, by=.1)
) %>%
  augment(svm_fit, .) %>%
  ggplot(aes(x=bland_chromatin, y=single_epithelial_cell_size)) +
  geom_point(aes( color=.pred_class), alpha=.1) +
  geom_jitter(data=patients_train, aes(shape=class), width=0.2, height=0.2)