Hierarchical Clustering

The basics

To get the idea of hierarchical clustering, we’re going to again create some artificial data and perform clustering on this. I reduced the number of points to make the dendrogram more readable later on. You can see that we’ve created three distinct clusters in a two variable space.

set.seed(465)

centers <- tibble(
  cluster = factor(1:3), 
  num_points = c(20, 15, 15),  # number points in each cluster
  x1 = c(5, 0, -1),              # x1 coordinate of cluster center
  x2 = c(-1, 1, -2)              # x2 coordinate of cluster center
)

labelled_points <- 
  centers %>%
  mutate(
    x1 = map2(num_points, x1, rnorm),
    x2 = map2(num_points, x2, rnorm)
  ) %>% 
  select(-num_points) %>% 
  unnest(cols = c(x1, x2))

ggplot(labelled_points, aes(x1, x2, color = cluster)) +
  geom_point(alpha = 0.5)

K Means using kmeans

The k-means model specification is in the tidyclust library. To specify a k-means model in tidymodels, simply choose a value of num_clusters:

library(tidyclust)

hclust_spec <- hier_clust(num_clusters = 3, linkage_method = "average")
hclust_spec
## Hierarchical Clustering Specification (partition)
## 
## Main Arguments:
##   num_clusters = 3
##   linkage_method = average
## 
## Computational engine: stats
# note that you don't need to provide the outcome variable, because there isn't one!
hclust_rec <- recipe(~., data=labelled_points) %>%
  #we don't want to use the cluster variable, but I'm going to use it later so just update the role
  update_role(cluster, new_role="label") %>% 
  # k means uses distances, so we'll normalize the predictors
  step_normalize(all_numeric())

hclust_wf <- workflow() %>%
  add_model(hclust_spec) %>%
  add_recipe(hclust_rec)
set.seed(465)
hclust_fit <- hclust_wf %>% fit(data=labelled_points)
summary(hclust_fit)
##         Length Class      Mode   
## pre     3      stage_pre  list   
## fit     2      stage_fit  list   
## post    1      stage_post list   
## trained 1      -none-     logical
hclust_fit %>% 
  extract_fit_engine() %>%
  plot()

To get the predicted labels, we can use the augment function. I’ll also plot them to see how the algorithm did.

# Try changing the number of clusters to see what happens!
# hclust <- kmeans(unlabelled_points, centers = 3)
clustered_points <- hclust_fit %>% augment(labelled_points)
clustered_points
## # A tibble: 50 × 4
##    cluster    x1      x2 .pred_cluster
##    <fct>   <dbl>   <dbl> <fct>        
##  1 1        6.12 -2.04   Cluster_1    
##  2 1        6.73 -1.90   Cluster_1    
##  3 1        7.15 -0.424  Cluster_1    
##  4 1        3.62  0.0161 Cluster_1    
##  5 1        6.23  0.237  Cluster_1    
##  6 1        5.32 -1.09   Cluster_1    
##  7 1        4.22 -2.52   Cluster_1    
##  8 1        4.61 -0.919  Cluster_1    
##  9 1        4.71 -2.92   Cluster_1    
## 10 1        5.25 -1.00   Cluster_1    
## # ℹ 40 more rows
plot <- clustered_points %>% 
  ggplot(aes(x1, x2)) + 
  geom_point(aes(color=.pred_cluster, shape=cluster), alpha=.5)
plot

Choosing the number of clusters

With hierarchical clustering, we choose the number of clusters based on the dendrogram. Of course, we set the number of clusters in advance, but the dendrogram will be the same for any number that we choose, so we should look at the chart and evaluate our choice.

hclust_fit %>% 
  extract_fit_engine() %>%
  plot()

The vertical distances represent similarity. When the vertical distance between links in the tree gets longer, that means clusters are being fused that are not very similar.

Likewise, we might want to choose to make a cut without leaving any leaves (points) as individual clusters. In this dendrogram, I might look a little closer at point 16 to see if there is anything special about it that would warrant leaving it in it’s own cluster. Otherwise, I would cut just above that for three clusters.

Your turn

This will be both activity and homework for the week.

1. Choose another linkage method, refit the hierarchical clustering model above and plot the dendrogram. What differences do you see?

2. Try running a hierarchical clustering on the NBA data. Plot the dendogram.

3. Just looking at the dendrogram, how many clusters would make sense to you? Remember, any number of clusters is ok if it leads to some insights into the patterns and structures in the data.

4. How would you describe your clusters in a qualitative sense? Ask a friend who knows about basketball if you need help. Are they similar to the clusters generated by K means?