S06 Instructions

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 25

DownloadS06 Instructions
Open PDF In BrowserView PDF
STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

STAT540-UBC / STAT540-UBC.github.io

Branch: master

Find file

Copy path

STAT540-UBC.github.io / seminars / seminars_winter_2019 / seminar6 / sm06_clustering-pca.md
Jwong684 Updated guideline

6d47b44 on Jan 9

2 contributors

822 lines (614 sloc)

38.4 KB

Seminar 6: Dimensionality reduction and Cluster AnalysisHierarchical Clustering
Contributors: Gabriela Cohen Freue, Jasleen Grewal, Alice Zhu.

Learning objectives
By the end of this tutorial, you should be able to
Differentiate between sample-wise clustering and gene-wise clustering, and identify when either may be appropriate.
Differentiate between agglomerative hierarchical clustering and partition clustering methods like K-means.
Outline the methods used to determine an optimal number of clusters in K-means clustering.
Undertake dimensionality reduction using PCA.
Appreciate the differences between PCA and t-SNE.
Recognize that the t-SNE approach is an intricate process requiring parametrization.
Add cluster annotations to t-SNE visualizations of their datasets.
Understand the limitations of the clustering methods, PCA, and t-SNE, as discussed in this seminar.

Introduction
In this seminar we'll explore clustering genes and samples using the Affymetrix microarray data on characterizing the gene
expression profile of nebulin knockout mice. This data was made available by Li F et al., with the GEO accession number
GSE70213. There are two genotypes - the wildtype mice, and the knockout mice. We will compare the results of different
clustering algorithms on the data, evaluate the effect of filtering/feature selection on the clusters, and lastly, try to assess
if different attributes help explain the clusters we see.

Load data and packages
Install required libraries Install required packages if you haven't done so before. This seminar will require GEOquery ,
pvclust , xtable , limma , cluster , RColorBrewer , and plyr .

Remember you may need to edit the file paths below, to reflect your working directory and local file storage choices.

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 1 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

library(RColorBrewer)
library(cluster)
library(pvclust)
library(xtable)
library(limma)
library(plyr)
library(lattice)
library(RCurl)
options(download.file.method = "curl")
library(GEOquery)
library(knitr)
library(pheatmap)

If you don't have GEOquery installed, you will need to get it using biocLite ( install.packages won't work!).
source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")

Read the data into R > We will be reading in the data using the GEOquery library. This lets us read in the expression data
and phenotypic data for a particular analysis, using its GEO accession ID.
if (file.exists("GSE70213.Rdata")) {
# if previously downloaded
load("GSE70213.Rdata")
} else {
# Get geo object that contains our data and phenotype information
geo_obj <- getGEO("GSE70213", GSEMatrix = TRUE)
geo_obj <- geo_obj[[1]]
save(geo_obj, file = "GSE70213.Rdata")
}
# Get expression data
data <- exprs(geo_obj)
# Get covariate data
prDes <- pData(geo_obj)[, c("organism_ch1", "title", colnames(pData(geo_obj))[grep("characteristics",
colnames(pData(geo_obj)))])]
## Clean up covariate data
colnames(prDes) = c("organism", "sample_name", "tissue", "genotype", "sex", "age")
prDes$tissue = as.factor(gsub("tissue: ", "", prDes$tissue))
prDes$genotype = as.factor(gsub("genotype: ", "", prDes$genotype))
prDes$sex = as.factor(gsub("Sex: ", "", prDes$sex))
prDes$age = gsub("age: ", "", prDes$age)

You might get an error like Error in download.file..cannot download all files . This happens when R cannot
connect to the NCBI ftp servers from linux systems. We will need to set the 'curl' option to fix this.
options(download.file.method = "curl")

Exploratory analysis
Let us take a look at our data.

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 2 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

kable(head(data[, 1:5]))

GSM1720833

GSM1720834

GSM1720835

GSM1720836

GSM1720837

10338001

2041.408000

2200.861000

2323.760000

3216.263000

2362.775000

10338002

63.780590

65.084380

58.308200

75.861450

66.956050

10338003

635.390400

687.393600

756.004000

1181.929000

759.099800

10338004

251.566800

316.997300

320.513200

592.806000

359.152500

10338005

2.808835

2.966376

2.985357

3.352954

3.155735

10338006

3.573085

3.816430

3.815323

4.690040

3.862684

dim(data)

## [1] 35557

24

We have 24 samples (columns) and 35,557 genes (rows) in our dataset. If we look in our covars object (called prDes), we
should have 24 rows, each corresponding to each sample.
kable(head(prDes))

organism

sample_name

tissue

genotype

sex

age

GSM1720833

Mus musculus

quad-control-1

quadriceps

control

male

41 days old

GSM1720834

Mus musculus

quad-control-2

quadriceps

control

male

41 days old

GSM1720835

Mus musculus

quad-control-3

quadriceps

control

male

41 days old

GSM1720836

Mus musculus

quad-control-4

quadriceps

control

male

41 days old

GSM1720837

Mus musculus

quad-control-5

quadriceps

control

male

42 days old

GSM1720838

Mus musculus

quad-control-6

quadriceps

control

male

40 days old

dim(prDes)

## [1] 24

6

Now let us see how the gene values are spread across our dataset, with a frequency histogram (using base R).
hist(data, col = "gray", main = "GSE70213 - Histogram")

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 3 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

It appears a lot of genes have values < 1000. What happens if we plot the frequency distribution after Log2
transformation?
Why might it be useful to log transform the data, prior to making any comparisons?
hist(log2(data + 1), col = "gray", main = "GSE70213 log transformed - Histogram")

Finally, as an additional step to make visualization easier later, we'll rescale the rows in our data object, since we're not
interested in absolute differences in expression between genes at the moment. Note that although one can do this step
within the pheatmap() function, it will not be available for other functions we will use. We can always go back to the
original data if we need to.
sprDat <- t(scale(t(data)))
str(sprDat, max.level = 0, give.attr = FALSE)

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 4 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

##

2019-02-27, 11:23 AM

num [1:35557, 1:24] -0.2042 0.9693 -0.0693 -0.3329 -0.7671 ...

round(data.frame(avgBefore = rowMeans(head(data)), avgAfter = rowMeans(head(sprDat)),
varBefore = apply(head(data), 1, var), varAfter = apply(head(sprDat), 1, var)),
2)

##
##
##
##
##
##
##

10338001
10338002
10338003
10338004
10338005
10338006

avgBefore avgAfter varBefore varAfter
2109.42
0 110944.28
1
55.62
0
70.82
1
645.76
0 22386.92
1
280.43
0
7513.48
1
2.92
0
0.02
1
3.64
0
0.07
1

The data for each row -- which is for one probeset -- now has mean 0 and variance 1.
Now, let us try and consider how the various samples cluster across all our genes. We will then try and do some feature
selection, and see the effect it has on the clustering of the samples. We will use the covars object to annotate our clusters
and identify interesting clusters. The second part of our analysis will focus on clustering the genes across all our samples.

Sample Clustering
In this part, we will use samples as objects to be clustered using gene attributes (i.e., vector variables of dimension ~35K).
First we will cluster the data using agglomerative hierarchical clustering. Here, the partitions can be visualized using a
dendrogram at various levels of granularity. We do not need to input the number of clusters, in this approach. Then, we
will find various clustering solutions using partitional clustering methods, specifically K-means and partition around
medoids (PAM). Here, the partitions are independent of each other, and the number of clusters is given as an input. As
part of your take-home exercise, you will pick a specific number of clusters, and compare the sample memberships in
these clusters across the various clustering methods. ## Part I: Hierarchical Clustering ### Hierarchical clustering for
mice knockout data
In this section we will illustrate different hierarchical clustering methods. These plots were included in Lecture 16.
However, for most expression data applications, we suggest you should standardize the data; use Euclidean as the
"distance" (so it's just like Pearson correlation) and use "average linkage".
data_to_plot = sprDat
# compute pairwise distances
pr.dis <- dist(t(data_to_plot), method = "euclidean")
# create a new factor representing the interaction of tissue type and genotype
prDes$grp <- with(prDes, interaction(tissue, genotype))
summary(prDes$grp)

##
##
##
##

quadriceps.control
6
soleus.nebulin KO
6

soleus.control quadriceps.nebulin KO
6
6

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 5 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

# compute hierarchical clustering using different linkage types
pr.hc.s <- hclust(pr.dis, method = "single")
pr.hc.c <- hclust(pr.dis, method = "complete")
pr.hc.a <- hclust(pr.dis, method = "average")
pr.hc.w <- hclust(pr.dis, method = "ward.D")
# plot them
op <- par(mar = c(0, 4, 4, 2), mfrow = c(2, 2))
plot(pr.hc.s,
plot(pr.hc.c,
plot(pr.hc.a,
plot(pr.hc.w,

labels
labels
labels
labels

=
=
=
=

FALSE,
FALSE,
FALSE,
FALSE,

main
main
main
main

=
=
=
=

"Single", xlab = "")
"Complete", xlab = "")
"Average", xlab = "")
"Ward", xlab = "")

par(op)

We can look at the trees that are output from different clustering algorithms. However, it can also be visually helpful to
identify what sorts of trends in the data are associated with these clusters. We can look at this output using heatmaps. We
will be using the pheatmap package for is purpose.
When you call pheatmap() , it automatically performs hierarchical clustering for you and it reorders the rows and/or
columns of the data accordingly. Both the reordering and the dendrograms can be suppressed with cluster_rows = FALSE
and/or cluster_cols = FALSE .
Note that when you have a lot of genes, the tree is pretty ugly. Thus, the row clustering was suppressed for now.
By default, pheatmap() uses the hclust() function, which takes a distance matrix, calculated by the dist() function
(with default = 'euclidean' ). However, you can also write your own clustering and distance functions. In the examples
below, I used hclust() with ward linkage method and the euclidean distance.
Note that the dendrogram in the top margin of the heatmap is the same as that of the hclust() function.

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 6 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Exercise: Play with the options of the pheatmap function and compare the different heatmaps. Note that one can also use
the original data data and set the option scale = "row" . You will get the same heatmaps although the columns may be
ordered differently (use cluster_cols = FALSE to suppress reordering).

# set pheatmap clustering parameters
clust_dist_col = "euclidean" #‘'correlation'’ for Pearson correlation, ‘'euclidean'’, ‘'maximum'’, ‘'manhattan'’, ‘'canberra'
clust_method = "ward.D2" #‘'ward.D'’, ‘'ward.D2'’,‘'single'’, ‘'complete'’, ‘'average'’ (= UPGMA), ‘'mcquitty'’ (= WPGMA), ‘'
clust_scale = "none" #'column', 'none', 'row'
## the annotation option uses the covariate object (prDes) we defined. It should
## have the same rownames, as the colnames in our data object (data_to_plot).
pheatmap(data_to_plot, cluster_rows = FALSE, scale = clust_scale, clustering_method = clust_method,
clustering_distance_cols = clust_dist_col, , show_colnames = T, show_rownames = FALSE,
main = "Clustering heatmap for GSE70213", annotation = prDes[, c("tissue", "genotype",
"grp")])

We can also change the colours of the different covariates. As you see, this can help differentiate important variables and
the clustering trends.
## We can change the colours of the covariates
var1 = c("orange1", "darkred")
names(var1) = levels(prDes$tissue)
var2 = c("grey", "black")
names(var2) = levels(prDes$genotype)
var3 = c("pink1", "pink3", "lightblue1", "blue3")
names(var3) = levels(as.factor(prDes$grp))
covar_color = list(tissue = var1, genotype = var2, grp = var3)
my_heatmap_obj = pheatmap(data_to_plot, cluster_rows = FALSE, scale = clust_scale,
clustering_method = clust_method, clustering_distance_cols = clust_dist_col,
show_rownames = FALSE, main = "Clustering heatmap for GSE70213", annotation = prDes[,
c("tissue", "genotype", "grp")], annotation_colors = covar_color)

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 7 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

We can also get clusters from our pheatmap object. We will use the cutree function to extract the clusters. Note that we
can do this for samples (look at the tree_col ) or for genes (look at the tree_row ).
cluster_samples = cutree(my_heatmap_obj$tree_col, k = 10)
# cluster_genes = cutree(my_heatmap_obj$tree_row, k=100)
kable(cluster_samples)

x
GSM1720833

1

GSM1720834

1

GSM1720835

1

GSM1720836

2

GSM1720837

1

GSM1720838

3

GSM1720839

4

GSM1720840

4

GSM1720841

5

GSM1720842

4

GSM1720843

4

GSM1720844

4

GSM1720845

6

GSM1720846

7

GSM1720847

7

GSM1720848

7

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 8 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

GSM1720849

8

GSM1720850

9

GSM1720851

10

GSM1720852

10

GSM1720853

10

GSM1720854

10

GSM1720855

10

GSM1720856

10

2019-02-27, 11:23 AM

Note you can do this with the base hclust method too, as shown here. We are using one of the hclust objects we defined
earlier in this document.
# identify 10 clusters
op <- par(mar = c(1, 4, 4, 1))
plot(pr.hc.w, labels = prDes$grp, cex = 0.6, main = "Ward showing 10 clusters")
rect.hclust(pr.hc.w, k = 10)

par(op)

We can save the heatmap we made, to a PDF file for reference. Remember to define the filename properly as the file will be
saved relative to where you are running the script in your directory structure.
# Save the heatmap to a PDF file
pdf("GSE70213_Heatmap.pdf")
pheatmap(data_to_plot, cluster_rows = F, scale = clust_scale, clustering_method = clust_method,
clustering_distance_cols = clust_dist_col, annotation = prDes[, c("tissue", "genotype",
"grp")], annotation_colors = covar_color)
dev.off()
https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 9 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Part II: Parametric and Alternative Non-Parametric Clustering with PCA and tSNE
Partitioning methods for mice knockout data
We can build clusters bottom-up from our data, via agglomerative hierarchical clustering. This method produces a
dendrogram. As a different algorithmic approach, we can pre-determine the number of clusters (k), iteratively pick
different 'cluster representatives', called centroids, and assign the closest remaining samples to it, until the solution
converges to stable clusters. This way, we can find the best way to divide the data into the k clusters in this top-down
clustering approach.
The centroids can be determined by different means, as covered in lecture already. We will be covering two approaches, kmeans (implemented in kmeans function), and k-medoids (implemented in the pam function).
Note that the results depend on the initial values (randomly generated) to create the first k clusters. In order to get
the same results, you need to set many initial points (see the parameter nstart ).
K-means clustering
Keep in mind that k-means makes certain assumptions about the data that may not always hold:
Variance of distribution of each variable (in our case, genes) is spherical
All variables have the same variance
A prior probability that all k clusters have the same number of members
Often, we have to try different 'k' values before we identify the most suitable k-means decomposition. We can look at the
mutual information loss as clusters increase in count, to determine the number of clusters to use.
Here we'll just do a k-means clustering of samples using all genes (~35K).
# Objects in columns
set.seed(31)
k <- 5
pr.km <- kmeans(t(data_to_plot), centers = k, nstart = 50)
# We can look at the within sum of squares of each cluster
pr.km$withinss

## [1] 131484.25

95226.22 107233.61 103083.34

0.00

# We can look at the composition of each cluster
pr.kmTable <- data.frame(exptStage = prDes$grp, cluster = pr.km$cluster)
kable(pr.kmTable)

exptStage

cluster

GSM1720833

quadriceps.control

4

GSM1720834

quadriceps.control

4

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 10 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

GSM1720835

quadriceps.control

4

GSM1720836

quadriceps.control

5

GSM1720837

quadriceps.control

4

GSM1720838

quadriceps.control

4

GSM1720839

quadriceps.nebulin KO

3

GSM1720840

quadriceps.nebulin KO

3

GSM1720841

quadriceps.nebulin KO

3

GSM1720842

quadriceps.nebulin KO

3

GSM1720843

quadriceps.nebulin KO

3

GSM1720844

quadriceps.nebulin KO

3

GSM1720845

soleus.control

1

GSM1720846

soleus.control

1

GSM1720847

soleus.control

1

GSM1720848

soleus.control

1

GSM1720849

soleus.control

1

GSM1720850

soleus.control

1

GSM1720851

soleus.nebulin KO

2

GSM1720852

soleus.nebulin KO

2

GSM1720853

soleus.nebulin KO

2

GSM1720854

soleus.nebulin KO

2

GSM1720855

soleus.nebulin KO

2

GSM1720856

soleus.nebulin KO

2

2019-02-27, 11:23 AM

Repeat the analysis using a different seed and check if you get the same clusters.
Helpful info and tips:
An aside on set.seed() : Normally you might not need to set this; R will pick one. But if you are doing a "real"
experiment and using methods that require random number generation, you should consider it when finalizing an
analysis. The reason is that your results might come out slightly different each time you run it. To ensure that you can
exactly reproduce the results later, you should set the seed (and record what you set it to). Of course if your results
are highly sensitive to the choice of seed, that indicates a problem. In the case above, we're just choosing genes for
an exercise so it doesn't matter, but setting the seed makes sure all students are looking at the same genes.
PAM algorithm

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 11 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

In K-medoids clustering, K representative objects (= medoids) are chosen as cluster centers and objects are assigned to
the center (= medoid = cluster) with which they have minimum dissimilarity (Kaufman and Rousseeuw, 1990). Nice
features of partitioning around medoids (PAM) are: (a) it accepts a dissimilarity matrix (use diss = TRUE ). (b) it is more
robust to outliers as the centroids of the clusters are data objects, unlike k-means.
We will determine the optimal number of clusters in this experiment, by looking at the average silhouette value. This is a
stastistic introduced in the PAM algorithm, which lets us identify a suitable k.
pr.pam <- pam(pr.dis, k = k)
pr.pamTable <- data.frame(exptStage = prDes$grp, cluster = pr.pam$clustering)
kable(pr.pamTable)

exptStage

cluster

GSM1720833

quadriceps.control

1

GSM1720834

quadriceps.control

1

GSM1720835

quadriceps.control

1

GSM1720836

quadriceps.control

2

GSM1720837

quadriceps.control

1

GSM1720838

quadriceps.control

3

GSM1720839

quadriceps.nebulin KO

3

GSM1720840

quadriceps.nebulin KO

3

GSM1720841

quadriceps.nebulin KO

3

GSM1720842

quadriceps.nebulin KO

3

GSM1720843

quadriceps.nebulin KO

3

GSM1720844

quadriceps.nebulin KO

3

GSM1720845

soleus.control

4

GSM1720846

soleus.control

5

GSM1720847

soleus.control

5

GSM1720848

soleus.control

5

GSM1720849

soleus.control

5

GSM1720850

soleus.control

4

GSM1720851

soleus.nebulin KO

4

GSM1720852

soleus.nebulin KO

4

GSM1720853

soleus.nebulin KO

4

GSM1720854

soleus.nebulin KO

4

GSM1720855

soleus.nebulin KO

4

GSM1720856

soleus.nebulin KO

4

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 12 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Additional information on the PAM result is available through summary(pr.pam)
The silhouette plot The cluster package contains the function silhouette() that compares the minimum average
dissimilarity of each object to other clusters with the average dissimilarity to objects in its own cluster. The resulting
measure is called the "width of each object's silhouette". A value close to 1 indicates that the object is similar to objects in
its cluster compared to those in other clusters. Thus, the average of all objects silhouette widths gives an indication of how
well the clusters are defined.
op <- par(mar = c(5, 1, 4, 4))
plot(pr.pam, main = "Silhouette Plot for 5 clusters")

par(op)

Take-home problem (1)draw a plot with number of clusters in the x-axis and the average silhouette widths in the y-axis.
Use the information obtained to determine if 5 was the best choice for the number of clusters.
(2)For a common choice of k, compare the clustering across different methods, e.g. hierarchical (pruned to specific k,
obviously), k-means, PAM. You will re-discover the "label switching problem" for yourself. How does that manifest itself?
How concordant are the clusterings for different methods?

Gene clustering
A different view at the data can be obtained from clustering genes instead of samples. Since clustering genes is slow when
you have a lot of genes, for the sake of time we will work with a smaller subset of genes.
In many cases, analysts use cluster analysis to illustrate the results of a differential expression analysis. Sample clustering
following a differential expression (DE) analysis will probably show the separation of the groups identified by the DE
analysis. Thus, as it was mentioned in lectures, we need to be careful in over-interpreting these kind of results. However,
note that it is valid to perform a gene clustering to see if differential expressed genes cluster according to their function,
subcellular localizations, pathways, etc.

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 13 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

A smaller dataset
In Seminar 4: Differential Expression Analysis, you've learned how to use limma to fit a common linear model to a very
large number of genes and thus identify genes that show differential expression over the course of development.
cutoff <- 1e-05
DesMat <- model.matrix(~grp, prDes)
dsFit <- lmFit(sprDat, DesMat)
dsEbFit <- eBayes(dsFit)
dsHits <- topTable(dsEbFit, coef = grep("grp", colnames(coef(dsEbFit))), p.value = cutoff,
n = Inf)
numBHhits <- nrow(dsHits)
topGenes <- rownames(dsHits)
# Scaled data of topGenes
topDat <- sprDat[topGenes, ]

We start by using different clustering algorithms to cluster the top 897 genes that showed differential expression across
the different developmental stage (BH adjusted p value < 10^{-5}).
Agglomerative Hierarchical Clustering
We can plot the heatmap using the pheatmap function....
pheatmap(topDat, cluster_rows = TRUE, scale = "none", clustering_method = "average",
clustering_distance_cols = "euclidean", clustering_distance_rows = "euclidean",
annotation = prDes[, c("tissue", "genotype", "grp")], show_rownames = FALSE,
annotation_colors = covar_color)

Or we can plot the heatmap using the plot function, after we have made the hclust object....
geneC.dis <- dist(topDat, method = "euclidean")
geneC.hc.a <- hclust(geneC.dis, method = "average")

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 14 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

plot(geneC.hc.a, labels = FALSE, main = "Hierarchical with Average Linkage", xlab = "")

As you can see, when there are lots of objects to cluster, the dendrograms are in general not very informative as it is
difficult to identify any interesting pattern in the data.
Partitioning Methods
The most interesting thing to look at is the cluster centers (basically the "prototype" for the cluster) and membership sizes.
Then we can try to visualize the genes that are in each cluster.
Let's visualize a cluster (remember the data were rescaled) using line plots. This makes sense since we also want to be
able to see the cluster center.
set.seed(1234)
k <- 5
kmeans.genes <- kmeans(topDat, centers = k)
# choose which cluster we want
clusterNum <- 2
# Set up the axes without plotting; ylim set based on trial run.
plot(kmeans.genes$centers[clusterNum, ], ylim = c(0, 10), type = "n", xlab = "Samples",
ylab = "Relative expression")
# Plot the expression of all the genes in the selected cluster in grey.
matlines(y = t(topDat[kmeans.genes$cluster == clusterNum, ]), col = "grey")
# Add the cluster center. This is last so it isn't underneath the members
points(kmeans.genes$centers[clusterNum, ], type = "l")
# Optional: colored points to show which stage the samples are from.
points(kmeans.genes$centers[clusterNum, ], col = prDes$grp, pch = 20)

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 15 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Improve the plot above adding sample names to the x-axis (e.g., wt_E16_1)

Evaluating clusters
Choosing the right k
As mentioned in Lecture 12, we need to find a balance between accurately grouping similar data into one representative
cluster and the “cost” of adding additional clusters. Sometimes we don't have any prior knowledge to tell us how many
clusters there are supposed to be in our data. In this case, we can use Akaike information criterion (AIC) and Bayesian
information criterion (BIC) to help us to choose a proper k.
First, we calculate the AIC for each choice of k. We are clustering the samples in this example:
set.seed(31)
k_max <- 10 # the max number of clusters to explore clustering with
km_fit <- list() # create empty list to store the kmeans object
for (i in 1:k_max) {
k_cluster <- kmeans(t(sprDat), centers = i, nstart = 50)
km_fit[[i]] <- k_cluster
}

# calculate AIC
km_AIC <- function(km_cluster) {
m <- ncol(km_cluster$centers)
n <- length(km_cluster$cluster)
k <- nrow(km_cluster$centers)
D <- km_cluster$tot.withinss
return(D + 2 * m * k)
}

Then, we plot the AIC vs. the number of clusters. We want to choose the k value that corresponds to the elbow point on
the AIC/BIC curve.

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 16 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

aic <- sapply(km_fit, km_AIC)
plot(seq(1, k_max), aic, xlab = "Number of clusters", ylab = "AIC", pch = 20, cex = 2,
main = "Clustering Samples")

Same for BIC
# calculate BIC
km_BIC <- function(km_cluster) {
m <- ncol(km_cluster$centers)
n <- length(km_cluster$cluster)
k <- nrow(km_cluster$centers)
D <- km_cluster$tot.withinss
return(D + log(n) * m * k)
}
bic <- sapply(km_fit, km_BIC)
plot(seq(1, k_max), bic, xlab = "Number of clusters", ylab = "BIC", pch = 20, cex = 2,
main = "Clustering Samples")

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 17 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Can you eyeball the optimal 'k' by looking at these plots?
The code for the section "Choosing the right k" is based on Towers' blog and this thread

Statistical methods
An important issue for clustering is the question of certainty of the cluster membership. Clustering always gives you an
answer, even if there aren't really any underlying clusters. There are many ways to address this. Here we introduce an
approachable one offered in R, pvclust , which you can read about at (http://www.sigmath.es.osaka-u.ac.jp/shimolab/prog/pvclust/).
Important: pvclust clusters the columns. I don't recommend doing this for genes! The computation will take a very
long time. Even the following example with all 30K genes will take some time to run.
You control how many bootstrap iterations pvclust does with the nboot parameter. We've also noted that pvclust
causes problems on some machines, so if you have trouble with it, it's not critical.
Unlike picking the right clusters in the partition based methods (like k-means and PAM), here we are identifying the most
stable clustering arising from hierarchichal clustering.
pvc <- pvclust(topDat, nboot = 100)

##
##
##
##
##
##
##
##
##
##

Bootstrap
Bootstrap
Bootstrap
Bootstrap
Bootstrap
Bootstrap
Bootstrap
Bootstrap
Bootstrap
Bootstrap

(r
(r
(r
(r
(r
(r
(r
(r
(r
(r

=
=
=
=
=
=
=
=
=
=

0.5)...
0.6)...
0.7)...
0.8)...
0.9)...
1.0)...
1.1)...
1.2)...
1.3)...
1.4)...

Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.

plot(pvc, labels = prDes$grp, cex = 0.6)

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 18 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

pvrect(pvc, alpha = 0.95)

Feature reduction
There are various ways to reduce the number of features (variables) being used in our clustering analyses. We have
already shown how to subset the number of variables (genes) based on variance, calculated using the limma package.

PCA plots
The other way we can do this is using PCA (principal components analysis). PCA assumes that the most important
characteristics of our data are the ones with the largest variance. Furthermore, it takes our data and organizes it in such a
way that redundancy is removed as the most important variables are listed first. The new variables will be linear
combinations of the original variables, with different weights.
In R, we can use prcomp() to do PCA. You can also use svd() .
Scaling is suppressed because we already scaled the rows. You can experiment with this to see what happens.
pcs <- prcomp(sprDat, center = FALSE, scale = FALSE)
# scree plot
plot(pcs)

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 19 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

# append the rotations for the first 10 PCs to the phenodata
prinComp <- cbind(prDes, pcs$rotation[rownames(prDes), 1:10])
# scatter plot showing us how the first few PCs relate to covariates
plot(prinComp[, c("genotype", "tissue", "PC1", "PC2", "PC3")], pch = 19, cex = 0.8)

What does the samples spread look like, as explained by their first 2 principal components?
plot(prinComp[, c("PC1", "PC2")], pch = 21, cex = 1.5)

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 20 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Is the covariate tissue localized in the different clusters we see?
plot(prinComp[, c("PC1", "PC2")], bg = prDes$tissue, pch = 21, cex = 1.5)
legend(list(x = 0.2, y = 0.3), as.character(levels(prDes$tissue)), pch = 21, pt.bg = c(1,
2, 3, 4, 5))

Is the covariate genotype localized in the different clusters we see?
plot(prinComp[, c("PC1", "PC2")], bg = prDes$genotype, pch = 21, cex = 1.5)
legend(list(x = 0.2, y = 0.3), as.character(levels(prDes$genotype)), pch = 21, pt.bg = c(1,
2, 3, 4, 5))

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 21 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

PCA is a useful initial means of analysing any hidden structures in your data. We can also use it to determine how many
sources of variance are important, and how the different features interact to produce these sources.
First, let us first assess how much of the total variance is captured by each principal component.
# Get the subset of PCs that capture the most variance in your predictors
summary(pcs)

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Importance of components:
PC1
PC2
PC3
PC4
PC5
PC6
Standard deviation
2.1235 1.8957 1.49067 1.30336 1.02878 0.92795
Proportion of Variance 0.1961 0.1562 0.09661 0.07386 0.04602 0.03744
Cumulative Proportion 0.1961 0.3523 0.44890 0.52276 0.56877 0.60621
PC7
PC8
PC9
PC10
PC11
PC12
Standard deviation
0.86674 0.83534 0.81621 0.78678 0.77243 0.76143
Proportion of Variance 0.03266 0.03034 0.02896 0.02691 0.02594 0.02521
Cumulative Proportion 0.63887 0.66921 0.69817 0.72509 0.75103 0.77623
PC13
PC14
PC15
PC16
PC17
PC18
Standard deviation
0.7477 0.73139 0.72127 0.70872 0.69406 0.67519
Proportion of Variance 0.0243 0.02326 0.02262 0.02184 0.02094 0.01982
Cumulative Proportion 0.8005 0.82379 0.84641 0.86825 0.88919 0.90901
PC19
PC20
PC21
PC22
PC23
PC24
Standard deviation
0.67423 0.65740 0.64933 0.62812 0.62433 1.687e-15
Proportion of Variance 0.01976 0.01879 0.01833 0.01715 0.01695 0.000e+00
Cumulative Proportion 0.92878 0.94757 0.96590 0.98305 1.00000 1.000e+00

We see that the first two principal components capture 35% of the total variance. If we include the first 11 principal
components, we capture 75% of the total variance. Depending on which of these subsets you want to keep, we will select
the rotated data from the first n components. We can use the tol parameter in prcomp to remove trailing PCs.
pcs_2dim = prcomp(sprDat, center = FALSE, scale = FALSE, tol = 0.8)

It is commonly seen a cluster analysis on the first 3 principal components to illustrate and explore the data.

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 22 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

t-SNE plots
WHen we are dealing with datasets that have thousands of variables, and we want to have an initial pass at identifying
hidden patterns in the data, another method we can use as an alterative to PCA is t-SNE. This method allows for non-linear
interactions between our features.
Importantly, there are certain caveats with using t-SNE.
1. Solutions are not deterministic: While in PCA the correct solution to a question is guaranteed, t-SNE can have many
multiple minima, and might give many different optimal solutions. It is hence non-deterministic. This may make it
challenging to generate reproducible results.
2. Clusters are not intuitive: t-SNE collapses similar points in high dimensional space, on top of each other in lower
dimensions. This means it maps features that are proximal to each other in a way that global trends may be warped.
On the other hand, PCA always rotates our features in specific ways that can be extracted by considering the
covariance matrix of our initial dataset and the eigenvectors in the new coordinate space.
3. Applying our fit to new data: t-SNE embedding is generated by moving all our data to a lower dimensional state. It
does not give us eigenvectors (like PCA does) that can map new/unseen data to this lower dimensional state.
The computational costs of t-SNE are also quite expensive, and finding an embedding in lower space that makes sense
may often require extensive fientuning of several hyperparameters.
We will be using the Rtsne package to visualize our data using t-SNE. In this plot we are changing the perplexity
parameter for the two different plots. As you see, the outputs are remarkably different.
# install.packages('Rtsne')
library(Rtsne)
colors = rainbow(length(unique(prDes$grp)))
names(colors) = unique(prDes$grp)
tsne <- Rtsne(unique(t(sprDat)), dims = 2, perplexity = 0.1, verbose = TRUE, max_iter = 100)

##
##
##
##
##
##
##
##
##
##
##
##

Performing PCA
Read the 24 x 24 data matrix successfully!
OpenMP is working. 1 threads.
Using no_dims = 2, perplexity = 0.100000, and theta = 0.500000
Computing input similarities...
Perplexity should be lower than K!
Building tree...
Done in 0.00 seconds (sparsity = 0.000000)!
Learning embedding...
Iteration 50: error is 0.000000 (50 iterations in 0.00 seconds)
Iteration 100: error is 0.000000 (50 iterations in 0.00 seconds)
Fitting performed in 0.01 seconds.

plot(tsne$Y, main = "tsne")
text(tsne$Y, labels = prDes$grp, col = colors[prDes$grp])

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 23 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

tsne_p1 <- Rtsne(unique(t(sprDat)), dims = 2, perplexity = 1, verbose = TRUE, max_iter = 100)

##
##
##
##
##
##
##
##
##
##
##

Performing PCA
Read the 24 x 24 data matrix successfully!
OpenMP is working. 1 threads.
Using no_dims = 2, perplexity = 1.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.00 seconds (sparsity = 0.163194)!
Learning embedding...
Iteration 50: error is 56.076765 (50 iterations in 0.00 seconds)
Iteration 100: error is 57.961845 (50 iterations in 0.00 seconds)
Fitting performed in 0.01 seconds.

plot(tsne_p1$Y, main = "tsne")
text(tsne_p1$Y, labels = prDes$grp, col = colors[prDes$grp])

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 24 of 25

STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io

2019-02-27, 11:23 AM

Deliverables
Regenerate the pheatmap clustering plot for the top genes, selected from limma, using clustering distance:
correlation, and clustering method: mcquitty.
Regenerate the dendrogram on the samples of this heatmap using the hclust and dist functions.
Plot the data for this analyses along PCs 1 and 2 using ggplot instead base plotting. Color the points by tissue

https://github.com/STAT540-UBC/STAT540-UBC.github.io/blob/master/seminars/seminars_winter_2019/seminar6/sm06_clustering-pca.md

Page 25 of 25



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
Page Count                      : 25
PDF Version                     : 1.4
Title                           : STAT540-UBC.github.io/sm06_clustering-pca.md at master · STAT540-UBC/STAT540-UBC.github.io
Author                          : Ninadh Malrina D'Costa
Subject                         : 
Producer                        : Mac OS X 10.11.6 Quartz PDFContext
Creator                         : Safari
Create Date                     : 2019:02:27 19:23:05Z
Modify Date                     : 2019:02:27 19:23:05Z
Apple Keywords                  : 
EXIF Metadata provided by EXIF.tools

Navigation menu