library(tidyverse)
library(tidymodels)
library(janitor)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
We’re going to load again the breast cancer classification data set:
<- read.csv("breast-cancer.csv") %>%
patients clean_names() %>%
mutate(
class = factor(class),
bland_chromatin = as.double(bland_chromatin),
single_epithelial_cell_size = as.double(single_epithelial_cell_size)
)
Exercise 0. Create a new section in your markdown file for the previous activity, called “SVMs for classification”.
Exercise 1. Split the data into training and test
sets. Create a model specification of a linear support vector classifier
using svm_poly
. You will need to install and load the
kernlab
package. Set the degree parameter to
tune()
.
Exercise 2. Create a recipe to that predicts
class
by bland_chromatin
and
single_epithelial_cell_size
. Make sure to normalize the
data. Name your recipe svm_rec
.
Exercise 3. Add your recipe and model to a workflow.
Use 10 fold cross-validation to tune the degree
parameter.
Exercise 4. Finalize the workflow with the best
degree parameter, then fit the model and name it svm_fit
.
What was the best degree according to the AUC metric?
Exercise 5. Create a plot like the one below to
visualize the nonlinear decision boundary. (Make sure you load the data
like I did above, so that R
treats the variables as doubles
and not integers)
autoplot(svm_res)
Here is an unsophisticated way to get a look at the decision boundary: First, I created a grid of points, then used the model to make predictions for each point, then plotted each point, colored by predicted class. The red region is classified as 0 (no cancer) while the blue region is classified as 1 (cancer).
expand_grid(
bland_chromatin = seq(1, 10, by=.1),
single_epithelial_cell_size = seq(1, 10, by=.1)
%>%
) augment(svm_fit, .) %>%
ggplot(aes(x=bland_chromatin, y=single_epithelial_cell_size)) +
geom_point(aes( color=.pred_class), alpha=.1) +
geom_point(data=patients_train, aes(shape=class), position="jitter")