User:Manudouz/sandbox/Complete-linkage clustering

From Wikipedia, the free encyclopedia

Working example[edit]

This working example is based on a JC69 genetic distance matrix computed from the 5S ribosomal RNA sequence alignment of five bacteria: Bacillus subtilis (), Bacillus stearothermophilus (), Lactobacillus viridescens (), Acholeplasma modicum (), and Micrococcus luteus ().[1][2]

First step[edit]

  • First clustering

Let us assume that we have five elements and the following matrix of pairwise distances between them:

a b c d e
a 0 17 21 31 23
b 17 0 30 34 21
c 21 30 0 28 39
d 31 34 28 0 43
e 23 21 39 43 0

In this example, is the smallest value of , so we join elements and .

  • First branch length estimation

Let denote the node to which and are now connected. Setting ensures that elements and are equidistant from . This corresponds to the expectation of the ultrametricity hypothesis. The branches joining and to then have lengths (see the final dendrogram)

  • First distance matrix update

We then proceed to update the initial proximity matrix into a new proximity matrix (see below), reduced in size by one row and one column because of the clustering of with . Bold values in correspond to the new distances, calculated by retaining the maximum distance between each element of the first cluster and each of the remaining elements:

Italicized values in are not affected by the matrix update as they correspond to distances between elements not involved in the first cluster.

Second step[edit]

  • Second clustering

We now reiterate the three previous steps, starting from the new distance matrix  :

(a,b) c d e
(a,b) 0 30 34 23
c 30 0 28 39
d 34 28 0 43
e 23 39 43 0

Here, is the lowest value of , so we join cluster with element .

  • Second branch length estimation

Let denote the node to which and are now connected. Because of the ultrametricity constraint, the branches joining or to , and to , are equal and have the following total length:

We deduce the missing branch length: (see the final dendrogram)

  • Second distance matrix update

We then proceed to update the matrix into a new distance matrix (see below), reduced in size by one row and one column because of the clustering of with  :

Third step[edit]

  • Third clustering

We again reiterate the three previous steps, starting from the updated distance matrix .

((a,b),e) c d
((a,b),e) 0 39 43
c 39 0 28
d 43 28 0

Here, is the smallest value of , so we join elements and .

  • Third branch length estimation

Let denote the node to which and are now connected. The branches joining and to then have lengths (see the final dendrogram)

  • Third distance matrix update

There is a single entry to update:

Final step[edit]

The final matrix is:

((a,b),e) (c,d)
((a,b),e) 0 43
(c,d) 43 0

So we join clusters and .

Let denote the (root) node to which and are now connected. The branches joining and to then have lengths:

We deduce the two remaining branch lengths:

The complete-linkage dendrogram[edit]

WPGMA Dendrogram 5S data
WPGMA Dendrogram 5S data

The dendrogram is now complete. It is ultrametric because all tips ( to ) are equidistant from  :

The dendrogram is therefore rooted by , its deepest node.

Comparison of clustering methods[edit]

Comparison of dendrograms obtained under different clustering methods from the same distance matrix.
Single-linkage clustering Complete-linkage clustering WPGMA UPGMA
  1. ^ Erdmann, Volker A.; Wolters, Jörn (1986). "Collection of published 5S, 5.8S and 4.5S ribosomal RNA sequences". Nucleic Acids Research. 14 (Suppl): r1–r59. ISSN 0305-1048. PMC 341310. PMID 2422630.{{cite journal}}: CS1 maint: PMC format (link)
  2. ^ Olsen, G. J. (1988). "Phylogenetic analysis using ribosomal RNA". Methods in Enzymology. 164: 793–812. ISSN 0076-6879. PMID 3241556.