Open In App

Hierarchical Clustering in R Programming

Last Updated : 02 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Hierarchical clustering in R is an unsupervised, non-linear algorithm used to create clusters with a hierarchical structure. The method is often compared to organizing a family tree. Suppose a family of up to three generations. The grandfather and mother have children and these children become parents to their own children. In hierarchical clustering, individuals (data points) are grouped into a hierarchy similar to family relationships.

Hierarchical-Clustering
Hierarchical clustering

Working of Hierarchical Clustering

In hierarchical clustering, objects (data points) are categorized into a tree-like structure, known as a dendrogram. The process works as follows:

  1. Treat each data point as its own cluster, creating N clusters.
  2. Find the two closest clusters (based on distance) and merge them into one cluster, reducing the number of clusters to N-1.
  3. Repeat the process of merging the closest clusters until there is only one cluster left.

A dendrogram is a tree-like diagram that shows the hierarchy of clusters, with the height of the branches representing the distance between clusters. The leaves at the bottom represent individual data points.

Thumb Rule for Choosing the Optimal Number of Clusters: The largest vertical distance that does not intersect any horizontal lines on the dendrogram indicates the optimal number of clusters.

Types of Hierarchical Clustering

There are mainly two types of hierarchical clustering:

1. Agglomerative Hierarchical Clustering:

  • Starts with each data point as its own cluster and merges the closest clusters.
  • This is a bottom-up approach.

2. Divisive Hierarchical Clustering:

  • Begins with a single cluster and recursively splits it into smaller clusters.
  • This is a top-down approach.

In this article, we will explore hierarchical clustering in R, focusing on the agglomerative approach.

Implementation of Hierarchical Clustering in R

We will use the hclust() function from the stats package (pre-installed with R) to perform hierarchical clustering. We will use the mtcars dataset, which contains data about fuel consumption, performance and aspects of automobile design. This dataset is part of the dplyr package in R.

1. Installing the Required Package:

We will install and load the dplyr package. The dplyr package will also contain the mtcars dataset which we can use.

  • install.packages(): is used to install the package in the environment
  • library(): is used to load the package in the environment
  • head(): is used to display the first few rows of the dataset
R
install.packages("dplyr")
library(dplyr)
head(mtcars)

Output:

data
Installing Packages

2. Calculating the Distance Matrix

We will calculate the pairwise distances between data points using the Euclidean method. The dist() function computes the distance matrix.

  • dist(): This function calculates the Euclidean distances between rows of a matrix.
  • method = 'euclidean': Specifies the use of the Euclidean distance metric.
R
distance_mat <- dist(mtcars, method = 'euclidean')
distance_mat

Output:

distance_mat
Distance Matrix

The distance matrix is calculated using the Euclidean method, which determines the distances between all pairs of data points. Each entry in the matrix represents the distance between two data points.

3. Fitting the Hierarchical Clustering Model

We will apply the hierarchical clustering algorithm using the hclust() function to the distance matrix. This function uses different linkage methods to merge clusters.

  • set.seed(240): Ensures reproducibility of results by setting the random number seed.
  • hclust(): Performs hierarchical clustering using a specific linkage method.
  • method = 'average': Specifies the use of average linkage, where the distance between two clusters is the average of all pairwise distances between data points in the clusters.
R
set.seed(240)
Hierar_cl <- hclust(distance_mat, method = "average")
Hierar_cl

Output:

hirech
Hierarchical Clustering Model

The Hierarchical Clustering Model output shows the clustering process, including the method used (average linkage) and the Euclidean distance between clusters.

4. Plotting the Dendrogram

We will plot the dendrogram to visualize the hierarchical clustering. The plot() function creates a visual representation of the hierarchical clustering.

  • plot(): Plots the hierarchical clustering object, showing the clustering process in a tree-like structure.
R
plot(Hierar_cl)

Output:

dendo
Dendrogram

5. Choosing the Number of Clusters

We will cut the dendrogram at a specific height or specify the number of clusters. Cutting the tree helps us decide how many clusters to form.

  • abline(h = 110, col = "green"): Adds a horizontal line to the dendrogram at height 110, indicating where to cut the tree.
  • cutree(): Cuts the dendrogram into k clusters, where k is the desired number of clusters.
R
plot(Hierar_cl)

abline(h = 110, col = "darkgreen")

fit <- cutree(Hierar_cl, k = 3)

Output:

linedendo
Choosing the Number of Clusters

6. Visualizing the Cut Tree

We will visualize the cut clusters on the dendrogram and display the count of data points in each cluster.

  • table(fit): Shows the number of data points in each cluster.
  • rect.hclust(): Draws rectangles around the clusters to highlight them in the dendrogram.
R
plot(Hierar_cl)
rect.hclust(Hierar_cl, k = 3, border = "darkgreen")

Output:

clust
Cut Tree

The dendrogram visualizes the clusters and their relationships. The x-axis represents the data points, while the y-axis represents the distance (height) between clusters. The green line in the plot indicates where the tree was cut to form 3 clusters.


Next Article

Similar Reads