The basics
To get the idea of hierarchical clustering, we’re going to again create some artificial data and perform clustering on this. I reduced the number of points to make the dendrogram more readable later on. You can see that we’ve created three distinct clusters in a two variable space.
set.seed(465)
<- tibble(
centers cluster = factor(1:3),
num_points = c(20, 15, 15), # number points in each cluster
x1 = c(5, 0, -1), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
<-
labelled_points %>%
centers mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
%>%
) select(-num_points) %>%
unnest(cols = c(x1, x2))
ggplot(labelled_points, aes(x1, x2, color = cluster)) +
geom_point(alpha = 0.5)
K Means using kmeans
The k-means model specification is in the tidyclust
library. To specify a k-means model in tidymodels, simply choose a value
of num_clusters:
library(tidyclust)
<- hier_clust(num_clusters = 3, linkage_method = "average")
hclust_spec hclust_spec
## Hierarchical Clustering Specification (partition)
##
## Main Arguments:
## num_clusters = 3
## linkage_method = average
##
## Computational engine: stats
# note that you don't need to provide the outcome variable, because there isn't one!
<- recipe(~., data=labelled_points) %>%
hclust_rec #we don't want to use the cluster variable, but I'm going to use it later so just update the role
update_role(cluster, new_role="label") %>%
# k means uses distances, so we'll normalize the predictors
step_normalize(all_numeric())
<- workflow() %>%
hclust_wf add_model(hclust_spec) %>%
add_recipe(hclust_rec)
set.seed(465)
<- hclust_wf %>% fit(data=labelled_points)
hclust_fit summary(hclust_fit)
## Length Class Mode
## pre 3 stage_pre list
## fit 2 stage_fit list
## post 1 stage_post list
## trained 1 -none- logical
%>%
hclust_fit extract_fit_engine() %>%
plot()
To get the predicted labels, we can use the augment function. I’ll also plot them to see how the algorithm did.
# Try changing the number of clusters to see what happens!
# hclust <- kmeans(unlabelled_points, centers = 3)
<- hclust_fit %>% augment(labelled_points)
clustered_points clustered_points
## # A tibble: 50 × 4
## cluster x1 x2 .pred_cluster
## <fct> <dbl> <dbl> <fct>
## 1 1 6.12 -2.04 Cluster_1
## 2 1 6.73 -1.90 Cluster_1
## 3 1 7.15 -0.424 Cluster_1
## 4 1 3.62 0.0161 Cluster_1
## 5 1 6.23 0.237 Cluster_1
## 6 1 5.32 -1.09 Cluster_1
## 7 1 4.22 -2.52 Cluster_1
## 8 1 4.61 -0.919 Cluster_1
## 9 1 4.71 -2.92 Cluster_1
## 10 1 5.25 -1.00 Cluster_1
## # ℹ 40 more rows
<- clustered_points %>%
plot ggplot(aes(x1, x2)) +
geom_point(aes(color=.pred_cluster, shape=cluster), alpha=.5)
plot
Choosing the number of clusters
With hierarchical clustering, we choose the number of clusters based on the dendrogram. Of course, we set the number of clusters in advance, but the dendrogram will be the same for any number that we choose, so we should look at the chart and evaluate our choice.
%>%
hclust_fit extract_fit_engine() %>%
plot()
The vertical distances represent similarity. When the vertical distance between links in the tree gets longer, that means clusters are being fused that are not very similar.
Likewise, we might want to choose to make a cut without leaving any leaves (points) as individual clusters. In this dendrogram, I might look a little closer at point 16 to see if there is anything special about it that would warrant leaving it in it’s own cluster. Otherwise, I would cut just above that for three clusters.
Your turn
This will be both activity and homework for the week.
1. Choose another linkage method, refit the hierarchical clustering model above and plot the dendrogram. What differences do you see?
2. Try running a hierarchical clustering on the NBA data. Plot the dendogram.
3. Just looking at the dendrogram, how many clusters would make sense to you? Remember, any number of clusters is ok if it leads to some insights into the patterns and structures in the data.
4. How would you describe your clusters in a qualitative sense? Ask a friend who knows about basketball if you need help. Are they similar to the clusters generated by K means?