library(tidyverse)
library(tidymodels)
library(janitor)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
We’re going to load again the breast cancer classification data set:
patients <- read.csv("breast-cancer.csv") %>%
clean_names() %>%
mutate(
class = factor(class),
bland_chromatin = as.double(bland_chromatin),
single_epithelial_cell_size = as.double(single_epithelial_cell_size)
)
Exercise 1. Split the data into training and test
sets. Create a model specification of a linear support vector classifier
using svm_linear
. You will need to install and load the
LiblineaR
package.
Exercise 2. Create a recipe to that predicts
class
using only bland_chromatin
and
single_epithelial_cell_size
. Make sure to normalize the
data as the first step in the recipe. Name your recipe
svm_rec
.
Exercise 3. Add your recipe and model to a workflow.
Fit the workflow to your training set and name it
svm_fit
.
Exercise 4. Plot the test set for two variables
bland_chromatin
and
single_epithelial_cell_size
. Color by predicted class. Use
the geom_abline
to plot the decision boundary (the slope
and intercept for the decision boundary can be computed using the code
below, as long as you named everything the same way I did)
Exercise 5. Based on the plot, how did your model do? What drawbacks are there to this linear SVM that you see?
# Calculates the slope and intercept of the linear decision boundary
means <- tidy(prep(svm_rec), 1)$value[1:2]
sds <- tidy(prep(svm_rec), 1)$value[3:4]
coeff <- tidy(svm_fit)$estimate
slope <- -coeff[1] * sds[2] / ( coeff[2] * sds[1] )
intercept <- -coeff[3]*sds[2]/coeff[2] + means[2] + coeff[1]*sds[2]*means[1]/(coeff[2]*sds[1])
Side Note: If you’re wondering what’s going on in
these expressions, remember that we normalized our data, so we created
new variables \[ x_{new} = \frac{x -
\bar{x}}{s_x} \qquad \mbox{and} \qquad y_{new} = \frac{y - \bar{y}}{s_y}
\] Subsequently, we found a separating plane in the new variables
whose coefficients are stored in the fitted model: \[ a x_{new} + b y_{new} + B = 0 \] Run
tidy(svm_fit)
to see these (\(B\) is the Bias
)! If you plug
in the expressions for \(x_{new}\) and
\(y_{new}\) to the plane, and solve for
\(y\) in terms of \(x\), you get a linear expression with the
slope and intercept given above.
Here is a less sophisticated way to get a look at the decision boundary: First, I created a grid of points, then used the model to make predictions for each point, then plotted each point, colored by predicted class. The red region is classified as 0 (no cancer) while the blue region is classified as 1 (cancer).
expand_grid(
bland_chromatin = seq(1, 10, by=.1),
single_epithelial_cell_size = seq(1, 10, by=.1)
) %>%
augment(svm_fit, .) %>%
ggplot(aes(x=bland_chromatin, y=single_epithelial_cell_size)) +
geom_point(aes( color=.pred_class), alpha=.1) +
geom_jitter(data=patients_train, aes(shape=class), width=0.2, height=0.2)