Robina Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 55

Download
Open PDF In Browser	View PDF

Robin and RobiNA
Users’ Manual
May 2012

Table of Contents
1

ROBINA – RNA-SEQ ANALYSIS ........................................................................................ 3
1.1 INTRODUCTION ................................................................................................................... 3
1.2 INPUT DATA FORMAT .......................................................................................................... 4
1.3 RNA-SEQ WALKTHROUGH ................................................................................................ 4
1.3.1 Quality checking .......................................................................................................... 5
1.3.2 Read filtering ............................................................................................................. 11
1.3.3 Sample definition ....................................................................................................... 14
1.3.4 Read mapping ............................................................................................................ 15
1.3.5 Statistical assessment of differential gene expression ............................................... 18
1.3.6 Result Browser........................................................................................................... 19

ANALYSING MICROARRAY DATA WITH ROBIN ..................................................... 22
2.1 INTRODUCTION ................................................................................................................. 22
2.2 IN BRIEF: WHAT CAN ROBIN DO FOR YOU? ...................................................................... 22

PRECONDITIONS AND GLOSSARY ............................................................................... 22
3.1 COMMONLY USED TERMS ................................................................................................. 22
3.2 AFFYMETRIX FILES ........................................................................................................... 23
3.3 OTHER SINGLE CHANNEL AND TWO COLOR DATA FILES .................................................. 23
3.4 ASSUMPTIONS ................................................................................................................... 24

WALKTHROUGHS .............................................................................................................. 25
4.1 USING ROBIN TO ANALYZE AFFYMETRIX MICROARRAY DATA........................................ 25
4.1.1 Quality Control .......................................................................................................... 27
4.1.2 Experiment design and statistical analysis................................................................ 28
4.2 ANALYSING TWO-COLOUR MICROARRAY DATA............................................................... 32
4.3 ANALYSIS OF GENERIC SINGLE CHANNEL ARRAYS (E.G. AGILENT) ................................. 34
4.4 FUNCTIONAL ANNOTATION OF THE RESULTS ................................................................... 36

CHIP QUALITY ASSESSMENT ........................................................................................ 37
5.1 AFFYMETRIX CHIP QUALITY CHECKS ............................................................................... 38
5.1.1 Analysis of signal intensity distribution .................................................................... 38
5.1.2 MA plots..................................................................................................................... 38
5.1.3 False color images of probe level model weights ..................................................... 39
5.1.4 Normalized unscaled standard error and relative logarithmic expression............... 40
5.1.5 RNA degradation ....................................................................................................... 41
5.1.6 Scatter plots ............................................................................................................... 42
5.1.7 Principal component analysis and hierarchical clustering....................................... 42
5.2 TWO COLOR MICROARRAY QUALITY CHECKS .................................................................. 43
5.2.1 Image plots of two-color background intensities and unnormalized M values ......... 44
5.2.2 Overview of two color signal intensity distribution .................................................. 45

DATA NORMALIZATION.................................................................................................. 46
6.1 SINGLE CHANNEL MICROARRAY NORMALIZATION .......................................................... 46
6.1.1 Normalization methods for Affymetrix arrays ........................................................... 46
6.1.2 Normalization of generic single channel and two color arrays ................................ 48

ANALYSIS OF DIFFERENTIAL GENE EXPRESSION................................................. 50

OUTPUT ................................................................................................................................. 51

RobiNA – RNA-Seq analysis

1.1 Introduction
Before going into detail on the newly introduced workflow for RNA-Seq based
analysis of differential gene expression, we want to point out that the RobiNA software
package also contains the microarray workflows described in section 2. If you installed
RobiNA you will thus have access to all the functionality that was until now provided by
Robin. If you want to use RobiNA to analyse microarray data, please continue reading in
section 2.
Analysis of differential gene expression using sequencing data is a relatively new
approach that takes advantage of the new high-throughput (HT) sequencing technologies
developed e.g. by Illumina/Solexa. The basic assumption of this approach is that the
frequency with which a certain RNA molecule in a given RNA mixture is sequenced is
proportional to the number of copies of this RNA molecule in the mixture. Using this
assumption it is possible to extract transcript profiles from HT RNA sequencing (RNASeq) data both at a high sensitivity (when using a sufficiently high number of sequence
reads) and specificity, because the reads can be directly mapped to a reference
transcriptome or genome. Of course, the amount of reads that can be uniquely mapped to
specific transcripts also depends on the length of the reads: The longer the reads are the
more likely it is to be able to unambiguously find the transcript they originate from.
Hence, a large amount of tens of millions of very long (e.g. 1kb) reads would be the ideal
data set for RNA-Seq. Unfortunately, current sequencing technology does not yet allow
the generation of such datasets (at affordable prices). To perform RNA-Seq based
transcriptomics, the “depth” of sequencing is more important than the length of the
individual sequence reads because it defines the dynamic range of the transcript profile
that can be extracted from the data, given the length is long enough to disambiguate
different isoforms in most of the cases, which is usually the case even with shorter reads
(as of 2012) Therefore, the most frequently used technique for RNA-Seq is currently
Illumina/Solexa sequencing that provides up to 200 million reads of a length of up to
100bp per RNA sample library. The Roche/454 sequencing technology usually yields
much longer reads of up to 1kb but a smaller total number of reads (around 1 million) per
sample and is thus much better suited for transcriptome and genome assembly.
Analogous to the microarray analysis workflows, the RNA-Seq workflow allows you to:
1)
2)
3)
4)
5)

Thoroughly quality check the raw sequencing data
Filter out low quality reads based on a range of freely combinable criteria
Map the reads to a user-provided genome- or transcriptome reference sequence
Detect differentially expressed genes using state of the art statistical methods
Generate graphical summary plots and detailed tabular results that can be browsed
directly inside the application and exported to downstream tools like MapMan
and PageMan (or any other tool that can read simple tab-separated tables)

1.2 Input data format
The raw data has to be provided in FASTQ format [1], which is the standard output
format for Illumina/Solexa data. It is also possible to use bzip2- or gzip-compressed
FASTQ files – however, RobiNA will extract the files during the workflow and write the
uncompressed data to the project directory (which will require extra disk space). So it is
recommended to use uncompressed data directly.
Alternatively, users who have already aligned their reads to a reference, using a tool that
generates BAM or SAM files, can directly import these by checking the “Alternative
inputs” box on the import panel (see Figure 2). Using BAM/SAM files will skip the
quality checking, filtering and mapping steps.

1.3 RNA-Seq Walkthrough
The following section will guide through all steps of a RNA-Seq analysis starting
from raw data input. When reanalysing data from third parties that was obtained from the
NCBI Short Read Archive (SRA), you’ll first need to extract the reads as FASTQ files.
This
is
easily
done
using
the
sratoolkits’
“fastq-dump”
tool
(http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s
=software). Files from the European Nucleotide Archive (ENA) can used directly Before
actually starting the analysis workflow, the user has to define a project directory in which
all data, that is generated during the analysis is saved (see Figure 1).

Figure 1: Project setup dialog. Create a new directory in which the analysis results and all other relevant
data will be stored

1.3.1 Quality checking
The first step in the workflow is the import of raw sequencing data – so after
choosing the RNA-Seq workflow, clicking the “Add” button will open a file browser in
which the input files can be selected. To-date, RobiNA accepts Illumina/Solexa-type
FASTQ files only, since this is the preferred data type for RNA-Seq based analysis of
differential gene expression. However, in future releases, FASTA sequence files with
separate FASTA quality file originating from other technologies like e.g. 454/Roche will
also be supported (although the depth of sequencing provided by standard runs at the
current state of this technology is, to our knowledge, not sufficient for in-depth sensitive
analysis of differential gene expression). Upon clicking “Next” the workflow moves on
to the quality checking step (see Figure 1). RobiNA offers a choice of quality test
modules that cover a range of quality aspects of the data:

Figure 2: Raw data import panel. RobiNA accepts Illumina/Solexa fastq files as input. The quality
encoding version will automatically be deduced from the data. Alternatively, BAM/SAM alignment files or
precomputed counts tables can be imported.

Figure 3: Quality checking options for raw deep sequencing data.

1) Base call quality: This module summarizes the base call quality scores included in
the FASTQ file. All high-throughput sequencing technologies include a base
calling step in their processing pipelines. The purpose of this step is to transform
the raw signals recorded by, e.g. in the case of Illumina/Solexa sequencing, the
fluorescence scanner into nucleotide base calls (i.e. A, C, G, T or N). The
software component that accomplishes this step evaluates the raw signal and tries
to assign a nucleotide to each signal. For each assignment, an error probability is
computed, which, in turn, is transformed into a quality score according to the
following equations:
Equation 1: PHRED score

QPHRED = "10 * log10 (Pe )

Equation 2: Solexa score

QSolexa = "10 * log10 (

Pe
)
1 " Pe

40
30
20
10
0

Quality score (Q)

0.0

0.2

0.4

0.6

0.8

1.0

Error probability (P)

Figure 4: Plot showing the difference between quality scores computed according to equation 1 and
equation 2. The blue dashed line denotes p=0.05. The diversion for higher error probabilities is stronger
while quality of approx. Q=13 and greater have almost identical error probabilities in both equations.

with Pe being the probability of error. Depending on the processing pipeline
version used to generate the data, either Equation 1 (version >= 1.3) or Equation 2
(prior to pipeline version 1.3) are used for computing the quality scores. In
FASTAQ files, these quality scores are usually encoded as ASCII characters.
Again depending on the pipeline version used, the scores are encoded by adding
either 64 (Illumina pipeline until version 1.5+) or 33 (version 1.8+) to the quality
score. For example a quality score of 35 would be encoded as ASCII character
64+35=99, i.e. a lower case ‘c’ in all versions prior to 1.8 and as 33+35=68 i.e. an
upper case ‘D’ in version 1.8+.
Note: When working with RobiNA, the user usually does not have to deal with the
quality encoding manually. RobiNA tries to extrapolate the encoding version by
scanning the first 10000 entries of each file and evaluating the range of encoding
characters observed. In most cases this yields a correct guess of the version used in the
data. In case, however, this approach fails, the user can override the version manually
on the “General” settings tab of the quality checking options panel (Figure 3)

1) Consecutive homopolymers. In rare cases, we have encountered raw sequencing
data that showed the occurrence of short homopolymer stretches in every or
almost every read starting at a specific nucleotide position that was the same
across all reads. The base that was called, however, was not the same across the
reads, i.e. read X might show a AAAAA pentamer starting at position 22 while
read Y contains a GGGGG stretch at the same position. We are not sure what
technical defect is causing this effect, but it will manifest itself as a clear spike
towards a value of 1.0 on the consecutive homopolymer plot (see Figure 5)

Figure 5: Example of homopolymer plots. Panel A) shows the homopolymer plot of a sequence set
whose quality deteriorates strongly at the end of the reads leading to “N”-only base calls at the end of all
reads from approximately position 75 on. This is a rather expected and, in the case shown, hardly alarming
result, since there is no indication of an increased fraction of homopolymer base calls until the very low
quality end of the reads. Panel B) shows a severe case with two blocks of homopolymer calls caused by a
defect in the sequencing pipeline.

2) Kmer frequency scan. Sequence biases in a large set of reads can be detected by
breaking up the reads into shorter fragments of a defined size K (i.e. Kmers) and
conting the frequency of occurrence of each such fragment. The expected
frequency of occurrence of each Kmer can be computed from the nucleotide
composition of the sampled sequences. If the observed frequency of a specific
Kmer exceeds its expected frequency by a factor of 3, the Kmer is considered as
enriched and recorded. The 10 most enriched Kmers will be reported in the
quality results summary, also graphically showing the positional enrichment
within the reads. By default, RobiNA will only scan for enriched 5mer Kmers.
The user can, however, choose to record a wider range of Kmer lengths and also
restrict the maximal number of individual Kmers to record for the enrichment
analysis by changing the settings on the “Kmer” settings tab. Changing the
settings there to higher values will also lead to a strongly increased computation
time and memory consumption of the Java Virtual Machine (JVM) – so please

make sure that you have a fast computer that provides the necessary. Please note
that the Kmer enrichment analysis is computed for each input file individually –
so if you run the quality checks in more than one parallel thread (see “parallel
processes” in the “general” settings tab), the memory requirement is multiplied by
the number of threads.
3) Base call frequencies. This module simply records the frequency in which each
base (A, C, G, T and N) was seen at each position in the reads. Depending on the
sample that was sequenced, the expected result will vary. Usually, when
sequencing poly-A mRNA, you would expect to see roughly the same nucleotide
composition as that of the underlying genome. In the case of many plant genomes,
the composition usually shows a slight AT bias.
4) Overrepresented sequences. Analogous to the Kmer frequency module, this
module identifies longer sequences that occur more often than expected. Usually
these would be adapter sequences carried over from the library generation etc.
After choosing the quality check modules, clicking “Next” will start the processing of
the input sequences. The “General” section of the settings tab allows to control some
technical aspects related to the performance of the quality checking modules. The left
part of the tab shows how much random access memory (RAM) is available and free
on the machine in total, how much is available and free for RobiNA and how many
processors (CPU) have been detected on the system. Based on these measurements,
RobiNA automatically sets up the number of parallel quality checking and filtering
processes in a very conservative way: The number of parallel processes automatically
configured equals the number of CPUs found. On a machine with a lot of memory, it
should usually be safe to run more parallel processes than CPUs – e.g. on a 32GB 8core machine it would be safe to run 10-12 parallel quality checking threads. The
more threads run in parallel, the faster the quality checking step will finish, given the
underlying hardware supports it.
RobiNA will display the progress of each quality checking process – as soon as a
process is finished, you can view the results by clicking on the respective status box
(see Figure 6). In case a file is corrupted and cannot be processed, RobiNA will
display a warning triangle and highlight file in red. Clicking it will display an error
message indicating why the processing failed.

Figure 6: The quality check progress (upper half) and result browser panel (lower half).

1.3.2

Read filtering

When importing raw next generation sequencing data, it is very important to filter out
low quality data and (identifiable) contaminating sequences. RobiNA provides a flexible
way of doing this by offering a range of trimmer modules that can be combined freely to
build a filtering pipeline that best suits the users’ needs. The trimmer pipeline is also
available
as
a
stand-alone
command
line
tool
at
http://www.usadellab.org/cms/index.php?page=trimmomatic. The trimming pipeline can
be pieced together simply by dragging individual modules from the list on the lower left
side of the Trimmomatic panel and dropping them into the white area on the right side
with the mouse. For users who are unsure about what modules to use, we have included a
default set of trimmers that can be added by clicking the “Standard” button on the lower
right. The following trimmer modules are available:

Figure 7: The barcode splitter module

1) Barcode Splitter: This module is not really a trimmer, because it does not remove
any reads from the input. Instead, it splits the reads based on user-supplied
barcode sequences. When working with multiplexed data, that means several
different sample libraries have been sequenced in one sequencing run (e.g. one
lane on a Illumina/Solexa platform), the different sample libraries are usually
tagged by a short barcode sequence at the 5' end of the reads. To further analyse
the data, the multiplexed reads need to be split into different files - one for each of
the samples. To do so, you have to supply the barcodes used plus a short human
readable label either by typing the information in the table provided in the
barcode splitter box or by loading it from a tab-separated text file. You can set the
barcode splitter to accept mismatches in the barcode sequence, however this is not
really recommended. Reads that have a barcode that does not match to any of the

user-provided ones will be collected in a separated file with the file name prefix
"UNKNOWN".

Figure 8: The adapter clipper module

2) Adapter clipper. In many cases, the raw sequence data contains a significant
amount of reads that originate from adapter molecules that were used during the
library preparation protocol. These reads should be removed prior to analyses
since they constitute an artificial contamination and could, depending on their
abundance, bias downstream analyses. The adapter clipper module clipper
modules can be used to remove known adapter (and other short nucleotide
sequences with a length <= the read length) sequences from the input data. These
sequences can be provided by the user as a standard FASTA file. The "max. seed
mismatch" and "match stringency" settings can be used to tweak the adapter
search performance. The settings chosen should work robustly for most cases.
However, if you expect very short adapters, the match stringency might have to
be reduced in order to reliably detect them. For legal reasons we are not allowed
to bundle known adapter sequences with the RobiNA software bundle. Feel free
to contact us for further information on adapter sequences (please visit
http://mapman.gabipd.org/web/guest/forum).

Figure 9: The leading trimmer module.

3) Leading trimmer. The leading trimmer simply removes bases below the specified
minimal quality score from the start of each read, leaving the rest of the read
untouched (even if there are more low quality reads in the middle of the end of the
read – see “Trailing trimmer” and “sliding window trimmer” below).

Figure 10: The read length cropper module.

4) Read cropper. The read cropper is very simply only cutting off all bases beyond
the specified maximal length from all reads in the input. In the example given in
Figure 10 this would mean that after processing all reads are 35 bases long or
shorter.

Figure 11: The trailing trimmer module.

5) Trailing trimmer. This trimmer module works analogous to the leading trimmer,
but starts removing bases from the end. In its default setting, the trimmer is very
conservative in only removing very low quality (score <= 3) bases.

Figure 12: The sliding window trimmer module.

6) Sliding window trimmer. The sliding window trimmer will scan across each read
in a window of the specified length and trim the read if the average quality within
the window drops below the specified threshold. That means that long reads that
contain a low base call quality stretch might get truncated at this stretch even if
the quality downstream of this stretch is acceptable again. Depending on whether
you want to be more conservative or stricter you might consider increasing (more
conservative) or decreasing (stricter) the window size.

Figure 13: The read length filter module.

7) Read length filter. Removes all reads that are shorter than the specified minimal
length. This module is very useful (and recommended) as the last processing step
in a multistep pipeline that removes bases based on quality. If it is omitted, reads
that are very short – possibly too short to be unambiguously mappable to a
reference gene or transcript – will be carried through to the mapping step (which
would unnecessarily extend the computation time).

1.3.3

Sample definition

Now that low quality reads and detectable contaminants have been removed, the next step
in the analysis is the definition of the different treatments and samples. Every individual
sequence file was generated by sequencing an RNA sample that was taken from
biological material treated in a defined way. This information has to be entered on the
“Experiment set up” panel (see Figure 14). First, all experimental conditions, or
treatments, have to be listed in box in the lower left corner of the panel. For example,
when different tissues are to be compared, the user would have to list which tissues these
are (e.g. “root”, “flower”; or “liver”, “brain” etc.). After all conditions have been entered,
the samples can be defined simply by selecting an input file and a condition and then
clicking the “add” button.

Figure 14: Experiment set up panel. In order to continue to the mapping step and generate a counts table
for each sample, the samples and treatments in the experiment have to be defined.

More than one sequence file can be chosen per sample to account for raw data that was
generated by sequencing the same original RNA sample more than once to obtain a
sufficient amount of reads.
1.3.4

Read mapping

Since the final aim of the RobiNA analysis is to identify differentially expressed genes
that are significantly responding to the different experimental treatments, it is necessary
to compute a measure of transcript abundance from the reads in each of the defined
samples. This is done by mapping the reads in each sample separately to a reference
sequence (genome or transcriptome) and subsequently counting the number of reads that
aligned unambiguously to each of the transcripts or genes of the reference. The
intermediate result of this step is a counts table that contains a column for each sample
and a row for each gene that has been see at least once in at least one of the samples. This
table will be saved in the “detailed_results” folder of the project directory and can be
easily imported into e.g. spreadsheet applications or other tools. The counts table will be
used as the basis to estimate the expression level of each of the genes contained and
hence is the most important input into the statistical analysis of differential gene
expression. Figure 15 shows the mapping panel user interface – First, the user has to

choose whether a transcriptome or a genome is to be used as the reference sequence.
When using a transcriptome, the data has to be supplied in the form of a multiple FASTA
file containing the nucleotide sequences of all transcripts. While working with a
transcriptome usually makes the mapping process faster, it is also biased, because only
reads that are mapping to the known transcripts supplied in the FASTA file will be
recorded. If the reference transcript set is incomplete and / or inaccurate (as might very
well be the case for an ill described organism or a primary transcriptome assembly), a
substantial fraction of the input reads might be lost because they map to transcripts that
are not included in the reference set.

Figure 15: Mapping panel. In the mapping step, the filtered reads are aligned to a user-supplied reference
sequence to determine how “often” each transcript was sequenced.

When working with a genome sequence as the reference, both the raw sequence data
(again in FASTA format) and a matching annotation file in GFF3 format
(http://www.sequenceontology.org/gff3.shtml) has to be supplied. After choosing your
desired reference format, the next step is to configure the mapping tool. The current
version of RobiNA uses BOWTIE [2] to align the reads. The settings for the alignment
can be modified using the “Bowtie settings” panel. While there are a lot of options
available for tweaking BOWTIE, RobiNA limits itself to controlling the mapping by 1)
setting an option that will ignore any non-ambiguously mappable reads (-m1) and 2)
allowing the user to control the alignment stringency by explicitly setting a) the maximal
number of mismatches allowed in the seed region, b) the seed region length and c) the
maximal sum of mismatch quality scores. The default setting will accept no mismatches

at all which is very conservative and might lead to a lot of well-aligning reads being
discarded just because of 1 mismatch. Especially when working with reads that originate
from e.g. a different ecotype (or other genetic variant) than the reference, this setting
might be too strict. So we recommend to either use the ‘up to 2 mismatches’ preset or
change the settings manually to even more permissive values.

Figure 16: Reference genome panel. Users can either provide new genomic sequence ( in a FASTA file)
and annotation (in a GFF3 file) or choose an already generated reference index from the drop box.

To confirm the alignment settings and continue the workflow, click ‘use these settings’.
Before starting the actual mapping process, the reference sequence has to be provided.
Since the generation of a BOWTIE sequence index can be quite time consuming when
working with large reference files, RobiNA requires the user to do this only once. All
indices that have been generated via the RobiNA application will be saved and are
subsequently available via the drop-down box in the reference data panel (see Figure 16)
so that the indices don’t have to be created every time a new analysis is run. If a new
genomic reference was entered, the indexing procedure will start as soon as the input is
complete. If an existing index was chosen, the main mapping process can directly be
started by clicking the ‘start mapping’ button.

Figure 17: Short summary of single sample mapping. The graph on the left shows a scaled histogram of
count observations. The expected shape of the plot would be similar to an exponential decay curve, since

most of the genes have a low count while only a few are counted very often. The right half shows a textual
summary of the mapping process as reported by the BOWTIE alignment tool.

As soon as the mapping process is finished, the ‘Mapping result overview’ box in the
lower right corner will show a short result summary for each of the samples. The
summary consists of a count frequency histogram and further textual summary data
giving the number of reads processed, discarded and unambiguously mapped to the
reference. Figure 17 shows a typical count frequency histogram.

1.3.5

Statistical assessment of differential gene expression

After the mapping step, the counts table is computed and can be found in the
‘detailed_results’ subfolder of the project. To identify genes that are expressed
differentially between two experimental conditions, the user first has to define which
conditions or treatments are to be compared. This is done on the experiment designer
panel (Figure 18). Each of the conditions or treatments entered previously will be
represented by a blue box on the experiment designer panel. To define a comparison
between two of them you simply have to draw an arrow by holding down the control key
and then click-dragging from one box to the other. After releasing the mouse and control
key, you will see a picture similar to Figure 18, which means that the comparison
STRESS minus CONTROL has been defined. You can define any number of
comparisons you are interested in. To delete an arrow, simply right-click on it and choose
delete from the context menu.
The lower half of the experiment designer panel allows you to choose the statistical
analysis method and modify the settings of the analysis. To-date, two different methods,
based on two different Bioconductor libraries, are available:
1) edgeR [3]
The statistical assessment methods implemented in the edgeR package are based
on the assumption that the read counts recorded in a RNA-Seq experiment follow
a negative binomial distribution. It uses the discrete reads counts recorded in the
counts table for the assessment of differential gene expression (including
correction for library size and compositional bias). edgeR employs empirical
Bayes methods to estimate the gene-specific biological variation and thereby can
also be used for experiments with little or even no biological replicate samples.
However, we want to emphasize that true biological replication should
always be a main focus when designing an experiment. The edgeR-based
statistics computed by RobiNA uses exact tests as described by [4, 5]. Future
releases of RobiNA will also provide the option to compute more complex
multifactorial statistics based on generalized linear models according to [6]. For
further in-depth information on the statistical approaches implemented in edgeR,
please
see
the
edgeR
User’s
Guide

(http://www.bioconductor.org/packages/2.9/bioc/vignettes/edgeR/inst/doc/edgeR
UsersGuide.pdf)
2) DESeq [7]
The method implemented in the DESeq package is essentially based on the same
assumption of a negative binomial distribution of RNA-Seq data as the edgeR
method. However it uses a slightly different approach for the computation of
differentially expressed genes. Please see the publication cited above for an indepth description of the method.
When the main analysis step is finished successfully, you can annotate the result data
given that a matching MapMan mapping is available – please refer to section 4.4 for
details as the process is identical for microarray and RNA-Seq data sets).
1.3.6

Result Browser

Once the analysis is finished, you can browse a summary of the whole analysis directly
within the RobiNA application. The summary gives an overview of all steps of the
workflow comprising input data, trimming, settings of the statistical analysis and various
overview plots visualizing the results of the differential expression analysis. Additionally,
full lists of differentially expressed genes, including the log fold change, and p-values are
given.

Figure 18: Experiment designer panel. Comparisons of interest between the different treatments that
were entered earlier can be defined by simply drawing arrows between the boxes representing the different
treatments. The lower part of the panel allows modifying the settings of the statistical assessment.

1.4 RobiNA Troubleshooting
1.4.1 Installation on Linux
When using RobiNA on Linux (or any other UNIX-based operating system), the RobiNA
installer package will not include an embedded R engine and hence requires that R (R
version 2.14.2 when working with RobiNA version 1.1.0) be installed.
The first time RobiNA is started, it will ask the user for the location of an R installation
on the host system and try to install all required packages (if not already present).
Depending on the configuration of the R installation and the permissions owned by the
user running RobiNA this might cause problems, e.g. when the user lacks the permissions
to install R packages in the standard R library path. This, however, can be resolved by
setting a user-specific library path: Start an R session and type
> Sys.getenv("R_LIBS_USER")
[1] "~/Library/R/2.14/library"

to have R report the default location for the user specific library. In a standard
installation, the reported directory might not exist, so you’ll have to switch to a terminal
and create the directory first. In e.g. a bash shell this would be accomplished by typing
mkdir ~/Library/R/2.14/library
(make sure to replace the path with the actual path reported on your system). If you
subsequently try to install new packages, these should be installed in the user-specific
library path. The next time RobiNA is started, it should be able to install all required
packages and set up your local R environment for being used with RobiNA.

Analysing microarray data with Robin

2.1 Introduction
Robin represents an easy to use graphical interface for microarray (Affymetrix GeneChip,
other single channel (e.g. Agilent) and two colour) analysis functions from
R/BioConductor. It is available on all Java-enabled computer platforms that are also
supported by the R Development team. The main objective of Robin is enabling the
individual biologist to use state of the art microarray pre-processing and analysis tools
that are provided by the BioConductor project without in-depth knowledge of
programming in R. To this end Robin provides documented, standard workflows for the
quality assessment, normalization and statistical analysis of microarray data originating
from many commonly used technical platforms. These workflows should allow for the
analysis of most experimental setups that are conducted in microarray experiments
carried out in labs around the world.
This manual gives a detailed guideline through the Robin analysis workflows for
different types of microarray experiments (e.g. Affymetrix, two colour, Agilent single
channel…) and explains the concepts and methods of quality assessment, normalization
and analysis of differential gene expression. The output that is generated by Robin can
directly be imported into meta-analysis tools like MapMan and PageMan for further
visualization and analysis of the data in a biological context or into Microsoft Excel.

2.2 In brief: What can Robin do for you?
You can use Robin for…
(i)
quality assessment of your data
(ii)
normalization of your microarray data
(iii)
detection of differentially expressed genes
(iv)
preparation of the data for an import into MapMan and/or excel
(v)
generation of informative plots on your experiment

Preconditions and Glossary

3.1 Commonly used Terms
Robin helps in evaluating microarrays using advanced normalization strategies and
statistics from R/BioConductor. Nevertheless, please bear in mind that most statistics and
most normalization techniques make some strong assumptions and have some general
terminology.
When dealing with microarrays, almost always one will deal with values which have
been transformed by taking the logarithm to the base of 2. The reason is, that by logging
the data, the data becomes roughly normally distributed (Gauss shaped), which then
allows using tests, like student’s t-test, making assumption about standard deviations etc..
Unlogged data is almost always NOT normally distributed, meaning t-tests are NOT

applicable (even though they might still perform reasonably well). Thus, a difference of 1
unit means a two-fold increase or decrease in expression.
Often data is not represented as treatment value and control value, but instead of M and
A. Here, M stands for treatment minus control (on the log scale, being a division on the
normal scale), and A stands for (treatment plus control)/2. So M is a measure of your
treatment effect and A of the expression level of that gene.
(Actually another reason for using M and A values is that it is easier to see if values
deviate from the zero line as if they were deviating from a line with a slope of one. Please
see Figure 19)

Figure 19: Comparison of MA plot versus Scatter plot of normalized expression values. The left
panel shows an MA plot of the log2-fold changes when comparing two chips (M) plotted on the Y
axis and the average log2 intensities (A) plotted on the X axis. On the right panel the same two chips’
expression values have been plotted against each other. The MA plots gives a clearer representation
of the cganges in gene expression when comparing the two chips.

3.2 Affymetrix Files
When dealing with Affymetrix chips, you will be confronted with .CEL and .CDF files,
the former describes the scanned intensity for every spot (usually there are 2 times ~11
spots per gene). The CDF file describes where the spots for a probe-set are to be found on
the chip, since these are not clustered to compensate for local effects such as bubbles,
smears, etc.

3.3 Other single channel and two color data files
Data derived from other microarray experiments may come in a variety of different file
formats depending on the microarray scanner hardware and software used. Robin
supports direct import of generic file types that contain the data in text files with each
column of data points separated by a special character (e.g. semicolon, TAB etc.). Import
of generic data is managed by a generic data import dialog that allows you to specify

which column contains what kind of data. Using this dialog it should be possible to
import arbitrary microarray data. Since the generic import mechanism does not work for
some data formats (like the tab-separated raw text files produced by Agilent scanners),
customized settings have been supplied to allow import of these formats. Please set the
import type on the file import panel according to your microarray data type if it is listed.
If not, try generic import settings. If these fail we will be happy to create a customized
filter for your data, if you supply us with a sample of the format. When working with
generic data you’ll also have to know the layout of the chips you want to analyze –
presets for commonly used chip types are already included in Robin. This list can be
completed with your custom layouts.

3.4 Assumptions
The strongest assumption probably being, that not much changes in your experiment. I.e.
the assumption is that let’s say not more than 5, 10 % of your genes are changing and that
thus everything is comparable.
If this assumption is violated, you may not get satisfactory results, or worse wrong
results. To demonstrate this issue, just consider the probably oldest, easiest
normalization, namely median centering. Here, one just subtracts the median of one
experiment from each data point. In this extreme example, Gene1 and Gene2 are
completely switched off.
XP1

XP2

Gene1
Gene2
Gene3

10.2
3.2
4.5

0
0
4.7

Gene4
Gene5
Gene6
median

7.8

7.9
9.8
10.2
6.3

9.9
10
8.85

Table 1: Experiment before normalization
XP1
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6

XP2
1.35
-5.65
-4.35
-1.05
1.05
1.15

-6.3
-6.3
-1.6
1.6
3.5
3.9

Table 2: Experiment after normalization
As an effect, Genes 5 and 6 seem to be upregulated, even though they were unchanged.
These effects would disappear in this case, if also some genes were turned on, which
often might be the case, but if you have strong suspicions, that very many genes change,
and/or that these change in one direction only, you might have to consult an expert
statistician.

Walkthroughs

The following sections of the manual provide step-by-step walkthroughs through
microarray data analysis using Robin. Since Affymetrix data analysis is the most
common task it is described in all detail. The workflows for two color and generic single
channel analysis resemble the Affymetrix workflow and are hence described in an
abbreviated fashion, focusing on the steps that are different from the Affymetrix analysis
procedure.
A common step at the beginning of all workflows is choosing the project directory. This
directory will be used to store all data and results related to the analysis at hand. If you
want to continue or modify a previous analysis, you can do that by simply choosing an
existing project directory. Robin will import the data and settings from the project folder
allowing you to conveniently modify the settings and run a new analysis. To distinguish
the new analysis results from the imported data, Robin requires you to specify a name
extension that will be appended to the imported project’s name to generate a new folder
holding the results of the modified analysis run. If you import e.g. project AFFYTEST
and specify “NEW” to be the extension, the new results will be placed into the directory
AFFYTEST_NEW to make sure that previous results won’t be overwritten.
PLEASE NOTE that the project import feature only works with analysis projects that
have been generated using Robin version 1.1 or higher. Trying to import a project
from an older version will generate an error message. The version number is always
displayed in the title bar of Robin’s main window.

4.1 Using Robin to analyze Affymetrix microarray data
Firstly, when using Robin, you have to localize your CEL files. Robin comes preinstalled
with specialized CDF files for a small selection of organisms (arabidopsis, maize, lotus,
yeast etc.), when dealing with other organisms, you will need an internet connection, so
Robin can use the Bioconductor framework to install missing CDF files. The INFO
button can be used to display some details about the imported CEL files such as
microarray type, algorithm parameters and all the technical data included in the header
section of the CEL file.

Figure 20: Importing CEL files into Robin

After having selected your CEL files, you are presented with various options to
investigate into the quality of the arrays.

Figure 21: Quality control options available for Affymetrix(r) arrays in Robin.

The ”expert options“ box is not shown by default – the preselected values there can be
used to correctly analyze most standard experiments. If you activate the expert settings
box you can explicitly choose which normalization method, p-value correction and
general analysis strategy is to be used on your data.
4.1.1 Quality Control
After running the chosen quality control methods on your data, Robin will present a
summary page showing thumbnails of the generated plots (see Figure 22). Clicking on
the individual rows will open the images in full size and offer a possibility to save the
image. PLEASE NOTE: You don’t have to open each image individually and save them
manually – all generated quality control plots will automatically be saved together with
the results of your analysis.

Figure 22: Quality analysis summary page.

Some of the quality assessments functions may have issued warnings – clicking on the
small warning icon will open an info panel that tells you more specifically why the
warning was generated. For example the RNA degradation analysis may have identified
chips that display slopes higher than the accepted threshold or whose slopes deviate by
more than 10 per cent from the median slope (see section 5.1.5 for details). Individual
chips displaying an extraordinarily bad quality in the PLM-Plot (see 5.1.3) or MA Plot
(see 5.1.2) can be excluded from further analyses by checking the “Exclude” box. Section
5 describes all available quality control methods in detail and gives examples of good and
bad quality check results.
4.1.2 Experiment design and statistical analysis
The next step in the analysis workflow is the assignment of the chips to groups of
biological replicates. NOTE: Robin analyses all replicates as biological replicates – there
is no way implemented yet that allows for proper consideration of technical replicates.
Please be aware that if technical replicates are imported the statistical test outputs will not
be sound any more.
You can choose a descriptive unique name for each group of replicates (like “mutant”,
“wildtpye” etc : see Figure 23). After sorting the chips, clicking “next” will proceed to
the graphical experiment designer. Here the user can set up the comparisons that are to be
made by CTRL-click-dragging connections between the groups (see Figure 24).

Figure 23: Sorting of replicate experiments into named groups.
4.1.2.1 Single factor analysis
If only a single experimental factor is varied in the experiment (e.g. when comparing
“wildtype” and a “mutant” genotype under identical environmental conditions), direct
comparisons between the groups are defined by simply dragging an arrow from the
“wildtype” to the “mutants” box on the left panel of the graphical designer screen (Figure
24.1).
4.1.2.2 Two factor analysis
If experiments with more than one varying experimental condition are to be analysed the
user can combine groups into “meta groups” and define comparisons of meta groups by
dragging connections between them. First, the user has to link two groups of replicate
chips by an arrow defining the direction of the comparison between them. To define a
“meta group”, the user has to select these two simple groups of replicates by clicking or
dragging a box around them and then click “create metagroup”. An orange box that is
named after the comparison between the two selected groups will be added to the
designer pane (see Figure 24, panel 3). Subsequently, comparisons can be defined
between “meta groups” by simply drawing arrows accordingly.
In the example experiment shown in Figure 24, mutant and wild type plants were
compared both under stress and normal conditions - so the experiment varies in two
dimensions with genotype (wild type or mutant) being one factor and treatment (stress,

no stress) being the other. The first four direct comparisons (Figure 24.2) will yield the
genes that are responding to the treatment in the wild type (“wildtype – wildtype stressed)
and the mutant (“mutant – mutant stressed”), which genes respond differently between
the genotypes under normal conditions (“wildtype – mutant”) and stress conditions. The
fifth comparison defined between the meta groups named “widtype – wildtyope stressed”
and “mutant – mutant stressed” will extract genes that generally respond differentially to
stress in the two genotypes (“(wildtype – wildtype stressed) – (mutant – mutant stressed”)
– this is also referred to as the interaction term; see Figure 24.3).
PLEASE NOTE: The direction of the arrow specifies the direction of the
comparison. When the arrow points from the wildtype to the mutant this should be
read as “wildtype minus mutant”. Genes showing a higher expression in the mutant
when compared to the wildtype will accordingly yield a negative log2-fold change
value as a result!

Figure 24: Setting up the experiment using Robins graphical designer.

The experiment designer panel also offers an expert option box that enables the
experienced user to influence specific parameters of the statistical inference. By default,
the normalization method used for the main analysis will be the same that was chosen on
the quality check panel (see Figure 24; if nothing was changed the default will be robust

multi array averaging - RMA) to ensure consistency between the quality check and
differential expression statistics. The user can define significance cut-offs like discarding
genes that show a log2 fold change in expression lesser than 1 (i.e. less than 2-fold up- or
down regulation) and genes showing a p-value greater than e.g. 0.05 (i.e. 5% false
discovery rate is accepted). A choice of multiple testing methods is available for the
inference of differentially expressed genes:
1) “separate” – Does the multiple testing for each comparison (contrast) separately.
Using this method, each specific comparison will always give the same result
irrespective of the set of comparisons being made in the analysis. It is the simplest
method available as it does not consider multiple testing adjustment between the
comparisons and assumes the same raw p-value cut-off for all comparisons
(which might be very different).
2) “global” - Implements multiple testing correction across all comparisons and
probes simultaneously ensuring a consistent p-value cut-off across all
comparisons.
3) “hierarchical” – Does p-value adjustment first for all genes and then across
comparisons, which offers more statistical power to control the family-wise error
rate when using the method described by [8] for p-value adjustment.
4) “nestedF” – First does p-value adjustment for all genes and uses a nested F test to
classify the comparisons as significant or not for the selected genes.
Users that are familiar with R programming can activate the “preview R script”-mode in
which all scripts generated by Robin are shown in an internal editor for review and
modification prior to execution. Even if this option was not chosen, all R scripts
generated by Robin will be written to the “source” folder in the final output directory.
When the design step is completed, clicking the “Next” button will first open a file
browser asking for a location to save the results to and then move on to the execution of
the analysis. After completing the calculations, Robin will show a summary of the
warnings generated during the workflow (if any) and offer options to exit, restart or
modify the current experiment.

Figure 25: Two color data import wizard. Robin automatically removes header sections from
different tabular file formats and extracts the column headers. The user has to define which column
contains which data (1) by assigning the proper column names to the required column fields. After
choosing a chip layout from the list of layout presets (or defining a new layout and saving it as a
preset; see 2), the user can save the import settings (3) and reuse them later when importing data of
the same format.

4.2 Analysing two-colour microarray data
The first step when working with two colour microarray data is data import. Robin
provides a wizard dialog that helps the user to import various import formats with the
only restriction that the data has to be provided in plain text format (.csv, .tab etc).
Loading MS Excel worksheets directly is not supported (yet). Aside from this any kind of
tabular data can be used. When importing data, the user only needs to know which
column separator was used. Layouts of frequently used microarray types are included as
clickable presets in the layout preset list – if the layout of your favorite chips is not
included in the list, you can define a new layout and save it for later use. The minimal
data required to analyze two color chips is an identifier uniquely identifying the oligos /
cDNAs spotted on the chip and the red and green channel foreground and background
signal intensities. The table view on the left half of the import dialog facilitates choosing
the column containing the required data, and after specifying the column names under
“Required columns” the information needed to import the data is complete. Robin will
create copies of the input files that are stripped off any header text and checked row-wise

for data format consistency. The processed input files will be placed in a separate folder
in the output folder.

Figure 26: Defining the RNA targets table for two-color microarrays

The next piece of information Robin needs is which different RNA samples (RNA
targets) have been hybridized to which channel on which chip. This can be conveniently
entered on the targets table panel (see Figure 26). Robin will run some checks on the
input to assert consistency. Analogous to Affymetrix data analysis the next panel
provides a choice of quality check methods adapted to two color arrays and an expert
settings box granting deeper control of the analysis parameters (See Figure 27).
Each step of the normalization process, namely background correction, within-array
normalization and between-array normalization can be configured separately to o
Quality check results will be summarized in a list resembling Figure 22. Depending on
the amount of factors being varied in the experiment (i.e. the amount of different RNA
samples hybridized) clicking “Next” on the quality check panel will either directly start
the main analysis (e.g. in a simple two sample comparison) or open the graphical
experiment designer panel (see Figure 24). Experiment layout is done exactly as for
Affymetrix arrays – please refer to section “Experiment design“ for a detailed
description.

Figure 27: Quality check and expert settings for two color microarrays

4.3 Analysis of generic single channel arrays (e.g. Agilent)
Analysis of generic single channel arrays resembles the workflow for two color chip
analysis in the largest part. The flexible import dialog (see Figure 28) allows for
configuration of any tabular text file based data. Please note that you have to specify
whether the data originates from an Agilent scanner prior to import to make sure that
Robin can correctly remove the header section of the data files. Robin will process the
input according the configuration chosen in the import dialog and create cleaned-up
working copies, leaving the original data untouched. In the next step, analogous to
Affymetrix data analysis, several assessment methods can be chosen to investigate into
the chips’ quality. Since most generic single channel chips are not based on a probeset
design (several probes per target) but only contain one probe per target transcript, the
probeset specific quality checks available for Affymetrix arrays (i.e. PLM-based
analyses, RNA degradation plot) cannot be used.

Figure 28: Import dialog for generic single channel microarray data.

Following the review of quality check results as depicted on Figure 22 and described in
section 4.1.1, the individual chips have to be organized into groups of biological
replicates. Depending on the statistical analysis strategy chosen (rank product- or linear
model-based) two or more groups can

Figure 29: Quality check and expert settings panel for non-Affymetrix single channel microarrays

4.4 Functional annotation of the results
The MapMan project provides functional classifications, so called “Bins”, for a wide
range of completed genomes and microarray platforms. These classifications are
available in the form of mapping files that contain entries for each gene (probe; probeset)
assigning them to functional Bins. Multiple assignments are possible in cases where e.g. a
transcription factor is member of a certain family of transcription factors (e.g. MADS box
factors) and is known to be involved in a specific biological process (e.g. flower
development). All available mapping files can be freely downloaded by academic users
via the MapManStore webpage (http://mapman.gabipd.org/web/guest/mapmanstore). At
the end of the analysis workflow, Robin will show the result annotation dialog (see
Figure 30) asking the user whether functional annotation information should be merged
into the results. To integrate the annotation, the user simply has to choose the correct
mapping file from the drop-down list and click “Annotate”. If an appropriate file
matching the microarrays platform used is not included in the list, it might be available
for download in the MapManStore. In case the MapManStore is also not containing a
mapping file for your favorite platform please contact us via the MapMen website’s
support forum (http://mapman.gabipd.org/web/guest/forum). Robin tries to interleave the
chosen mapping file with the results and finally displays an informative summary giving
the numbers of identifiers found in the mapping and the results file that could be
matched. If the mapping file that was chosen is not compatible with the microarray
platform that was used to generate the data Robin will issue a warning. Theoretically, if

mapping file and chip type match perfectly, 100% of the identifiers in the result file
should be contained in the mapping, however raw result files often contain several
control spots/probesets that are not included in the mapping files (that are only containing
functional information for genes). This might lead to numbers lower than 100% but
usually above 90%.

Figure 30: Result annotation dialog. If the microarray platform is supported by the MapMan project, the
results can be augmented by adding MapMan functional classification information. The Robin installer
package contains a variety of annotation files for common microarray platforms – more can be downloaded
from the MapManStore repository

Chip quality assessment

When analysing your own primary microarray data or reanalysing data that is publicly
available the first step is quality assessment. Individual chips displaying a very bad
quality might strongly impact the final results of your microarray experiment and hence
lead to incorrect biological assumptions. Chip quality can be affected on different levels
and Robin offers a range of informative plots that cover many different aspects of the
chip data quality. In the following section these methods will be described in detail.

5.1 Affymetrix chip quality checks
5.1.1

Analysis of signal intensity distribution

Figure 31: Box plots (left panel) and smoothed signal intensity densities (right panel). The red circles
highlight individual chips that show strong outlier behaviour indicating low quality.

Box plots of the unnormalized expression values on each chip give a global overview of
the signal intensity distributions. Ideally all chips should have a comparable distribution
already before normalization (see Figure 31, left panel). Another way to visualize the
distribution of signal intensities is plotting smoothed histograms of the (log2) signal
intensity of all perfect match (PM) probes (see Figure 31, right panel). The red circles
point out outliers.
5.1.2

MA plots

Figure 32: MA plots and box plots. The left panel shows an unobjectionable behavior while the data
displayed on the right panel strongly deviates from normal values. In the box plot (see Figure 31) the
two highlighted chips are also clearly showing an outlying intensity distribution.

On the MA plots, the average log2 probe signal intensity A = ½ * (logR + logG) is
plotted against the log2 fold change in expression M = logR – logG. In the case of
Affymetrix and other single channel chips G is a synthetic chip created from the median
expression values of all chips in the input. For two-color chips the M values are
calculated as the log2-fold ratio of the normalized red and green signal intensity. Based
on the assumption that most of the genes will not show differential expression. Robin will
issue a warning if more than 10% of the genes show a greater than two-fold change (log2
fold change of 1, resp. M >= 1 or M <= -1) in expression. The actual percentage of genes
showing a higher than two-fold change in expression is shown on the plot as “%>LFC1”.
To capture artifacts that are related to the signal intensity A, a lowess fit curve over the
data points is calculated (see Figure 32). If the integral of the absolute values of the
lowess curve over the zero line is greater than 1 another warning is generated indicating
that there seems to be a signal intensity-dependent artifactual effect (the integrals value is
shown on the plot as “I”). The median M value also given on each plot is usually less
informative as can be seen on the right panel of Figure 32 where the median shows a
normal value while the data quality is severely affected. MA plots are available for all
microarray types.

5.1.3 False color images of probe level model weights
A linear model is fitted (using RMA style, more later) to your probeset (i.e. your 11
probes), using the boundary, that the effect of all probes in each probeset is zero.
Weights are attached to the different probes in each probeset, low weights are colored in
green (i.e. they were not important for the model), and high values in white.

Figure 33: PLM weight image. Here a potential artifact is visible in the upper right corner.

For some examples of probe level model (PLM) image plots showing different artifact
have a look at: http://plmimagegallery.bmbolstad.com/. The weights applied to each
probe are visualized as pseudo chip images (see Figure 33). Areas on the chip that show
consistently low probe weights might indicate technical problems cause e.g. by washing,
dust on the chips or scanner malfunction. PLM-based analyses (pseudo images, NUSE
and RLE, see next section) are only available for Affymetrix chips.
5.1.4 Normalized unscaled standard error and relative logarithmic expression
The normalized unscaled standard error (NUSE) plots show the standard error estimates
of probe level models for each probeset standardized across all chips so that the median
standard error for each probeset is 1. The NUSE plot visualizes the distribution of the
standard errors for each individual chip. Chips showing a consistently increased standard
error are probably of lower quality. The relative logarithmic expression (RLE) is
computed by comparing the logarithmic expression of each probeset on each chip to the
median expression of this probeset across all chips. According to the assumption that
most of the genes are not differentially expressed under a given treatment, the median
RLE value should be zero. Individual arrays showing a deviation of the median from the
zero line and/or increased spread on a box plot of the RLE values are presumably of low
quality.

Figure 34: Relative logarithmic expression and normalized unscaled standard error plots. Note the
two arrays that are consistently showing low quality behaviour across both plots

5.1.5 RNA degradation
In each probeset the probes are ordered directionally from the 5’ to the 3’ end. Average
probe intensities are plotted by probe number. The resulting plot visualizes the global
RNA degradation state of the samples used. Generally, RNA degradation is more active
at the 5’ terminus - signal intensities of the probes closer to this terminus are accordingly
lower. If the slope of the probe intensity curve is exceeding a certain threshold value or
the slopes of individual chips are deviating from the median by more than 10% Robin
issues a warning (see Figure 35 ). As this kind of analysis relies on probesets consisting
of more than one probe, it is only available for Affymetrix arrays.

Figure 35: RNA degradation plot.

5.1.6 Scatter plots
If the scatter plot option is chosen, Robin plots pair wise comparisons of the normalized
expression values of all possible combinations of two chips. NOTE: Using this feature on
a large number of input chips will generate a lot of images and might increase calculation
time and memory demand significantly. The scatter plots are useful for assessing whether
two replicate chips are showing similar behavior. If they do, the points should lie on a
perfect diagonal line. Replicate samples that are not showing this behavior strongly
suggest a problem (e.g. accidentally swapped or mislabeled samples, technical problems
on one individual chip, strong RNA degradation effects etc. )

Figure 36: Scatter plots of normalized expression values. The left panel shows two biological
replicates of acceptable reproducibility plotted against each other while the right panel shows two
chips with very different expression profiles. Identical values are plotted on the blue (0) line; The red
lines indicate a log2-fold difference of 1.

5.1.7 Principal component analysis and hierarchical clustering
The data generated in a microarray experiment can be understood as a matrix of p
columns, where p is the number of chips used, and n rows, where n is the number of
genes (probesets, probes) measured. Such a dataset could be visualized as a set of n
points in a p-dimensional space. The principal component analysis reduces the
dimensionality of the dataset by finding a small number of linear combinations of the
data that explain most of the variance in the dataset. These are the principal components
(PCs). The principal components are ordered by the amount of variance explained and
subsequently the first two PCs are plotted against each other. The example on the right
panel of Figure 37 shows a PCA on eight samples, six of which are grouping closely
together on two groups of three replicates while the last two are completely unrelated.
The hierarchical clustering method performs a clustering of the Pearson correlation of
raw normalized expression estimates for each chip a the distance measure for the

clustering. Chips that show similar expression profiles should cluster together when using
this approach. The results are shown as a dendrogram where the branch length depicts 1correlation score. The hierarchical clustering gives an overview of the internal structure
of the data and identifies experimental conditions that generate similar global responses
in gene expression. Replicate chips should always cluster closely. Accordingly, the
samples six samples belonging to two groups of three replicates form distinct clusters
while the last two are very distant from them and each other. The PCA and hierarchical
clustering analyses are only available for Affymetrix and generic single channel
experiments.

Figure 37: Principal component analysis and hierarchical clustering of normalized expression values.
The red circles highlight chips with strongly deviating behavior.

5.2 Two color microarray quality checks
Quality check methods that are specific to two color arrays are described in the following
section. Some quality checks that can be run for all chip types – these will not be
described again below (e.g. MA plots).

5.2.1

Image plots of two-color background intensities and unnormalized M values

Figure 38: Two-color microarray background signal intensities and unnormalized M value plots.

The background signal intensities measured on two-color and generic single channel
chips (not shown here) can be visualized as false-color images. This is very useful for the
identification of washing artifacts like those visible on the two left plots in Figure 38.
Both color channels display obvious traces of droplets, so called washing artifacts. In the
worst case these artifacts carry over to the foreground signal and cannot be eliminated by
background subtraction. If this happens they would also be visible on the M value plot
shown on the right side of Figure 38 (in the example given, however, this is not the case).
The M value plots is simply a false-color image of the merged red and green foreground
signal intensities measured on the chip prior to normalization.

5.2.2

Overview of two color signal intensity distribution

Figure 39: Two color microarray signal intensity distribution assessment. Upper left: Smoothed
signal intensity distributions are shown for the red and green channel separately for each chip.
Lower left: Box plots of the raw foreground signal intensities for each chip and color channel. The
left hand plots show data prior to normalization while the plots on the right half show normalized
data. The title of the right hand intensity distribution plot reflects the chosen normalization settings.
For the shown example within-array printtiploess normalization without background correction and
between-array scaling were performed.

Analogous to the box plots and smoothed histograms that are generated for Affymetrix
arrays (see section “Analysis of signal intensity distribution” and Figure 31).

Data normalization

When analyzing microarray experiments, the raw data obtained by scanning probe
intensities on the chips can be strongly influenced by different technical effects. These
can be different levels of background signal due to inhomogeneous washing,
systematically deviating probe signal intensities due to different scanner settings (or even
same settings on different devices), probe-specific hybridization affinity effects etc.
To make sure that the microarrays you are going to analyze in a differential expression
experiment can actually be compared it is very important to eliminate these effects. This
process is called normalization. Since the first application of microarray technology many
different normalization techniques have been proposed - the most widely used ones are
available in Robin. If your favorite method is not among them feel free to contact us.
Generally, all normalization methods consist of two (three in the case of Affymetrix
GeneChip microarrays) major steps: (I) background correction, (II) normalization of
background-corrected probe level data and (III) summarization of probe-level data to
yield one expression measure per probeset.

6.1 Single channel microarray normalization
6.1.1

Normalization methods for Affymetrix arrays

6.1.1.1 RMA [9]
The robust multi-array average (RMA) normalization method proposed by [9] has been
widely used and accepted as a well-performing approach for inference of differential
gene expression from Affymetrix GeneChip(R)-based experiments. The RMA procedure
first does background correction based on the assumption that the background signal is
normally distributed while the real probe signal is exponentially distributed (convolution
model). The background-corrected data is then quantile normalized. Quantile
normalization assumes that the distribution of gene abundances is nearly the same across
all chips. A reference distribution is created using the pooled intensity probe distribution
on all chips. To normalize each chip, the quantile of each intensity value is computed and
then the original value is transformed to the corresponding quantile’s value on the pooled
reference chip (that is created by averaging the values of each probe across all chips in
the experiment). In the last step, a linear model is fitted to the corrected, normalized and
log2-transformed probe intensities:
with
being
a probe affinity effect, µi representing the log2 expression level on array I and
representing a noise error with mean = 0. The model parameters are estimated using
the median polish procedure that is robust against outliers.
6.1.1.2 GCRMA [10]
The GCRMA method adds a more refined background adjustment to the standard RMA
normalization. This background adjustment method models the different hybridization
affinities for each PM-MM probe pair based on its nucleotide sequence which results in a
more precise estimate of the background. While the standard RMA approach ignores the

MM probe-derived signal, GCRMA subtracts a shrunken MM value that was corrected
for its binding affinity from the PM signal. More specifically, the model assumes:
and
with O being the optical noise, N
being the non-specific binding effect and S being proportional to the real concentration of
the target transcript. Hence, the model takes into account the observation, that the MM
signal may contain real transcript signal.
6.1.1.3 MAS 5.0 (Affymetrix Microarray Analysis Suite 5.0 )
In contrast to the other normalization methods described here, MAS 5.0 works on a single
chip basis. Briefly, each chip is divided into 16 (4x4) equally sized grid regions and a
background and noise signal value is calculated based on the lowest 2% of measured
probe intensities for each grid region. The probe intensities in each grid block are
adjusted to the weighted average of the background signal where the weight is dependent
on the (euclidean) distance of the probe to the centroid of the grid block. In the next step
the perfect match (PM) and mismatch (MM) probe pairs are considered. The original
purpose of the PM/MM probe pair design was to use the MM probe signal intensity as
unspecific signal intensity and subtract it from the PM probe to generate a reliable probe
signal. However it turned out that up to 30% of the MM probes display a signal intensity
that is higher than the corresponding PM probe so that a simple subtraction would yield
negative values. To work around this problem, the so called ideal mismatch (IM) was
introduced. If the PM intensity is larger than MM, IM equals the MM value. In cases
where PM=MM or PMLFC1”
Percentage
of
probes/probesets displaying a log2-fold
change greater than 1. Based on the
assumption that most of the genes will not
show differential expression, a warning will
be issued of more than 5% of the probes
show an absolute log2 fold change higher
than
1
(meaning
2-fold
upor
downregulation).
“median” – Gives the median value of M. In
an ideal experiment this should be zero.
TMP_plm1..n.png

qualitychecks

Shows pseudo images of the model weights
for each probe after fitting linear probe level
models. Low weights are indicated by
stronger red or green color

TMP_rle.png

qualitychecks

Boxplots of the relative logarithmic
expression (RLE) values on each chip. The
boxes should be centered around zero.

TMP_nuse.png

qualitychecks

Boxplots of the normalized unscaled
standard errors (NUSE) of the probe level
models on each chip. The plots should be
centered around zero and display
comparable spread.

TMP_scat1..n1..m.png

qualitychecks

Scatter plots of all possible combinations of
two chips. The normalized expression values
are plotted against each other.

A, S

TMP_rna.png

qualitychecks

RNA degradation plot (only available for
Affymetrix arrays). Shows mean intensities
of probes in all probesets ordered from the
5’ to the 3’ end of the target sequence. This
plot allows a good overview of the global
RNA quality on the chips.

TMP_bground.png

qualitychecks

Pseudo images of the background signal
intensities measured on two color or nonAffymetrix single channel arrays.

T, S

TMP_mvalues.png

qualitychecks

Pseudo image plots of the unnormalized M
(= log2 ratios of green and red signal) values
of two color chips.

XYZ_robin

input

Cleaned-up copies of the input raw data files

T, S

EXP_NAME.main.analysis
.R

source

The R script file containing code for the
main analysis. The file can be used as a
starting point for customizations of the
analysis. Please note that the file contains
some hard coded paths.

qualityChecks.R

source

Quality checks R source code file.

MAplot_GRPa-GRPb.png

plots

The plots folder contains some informative
plots on the results of the main analysis. MA
plots are generated for each contrast that was
defined on the experiment designer panel.
Genes that are significantly differentially
expressed according to the statistical
analysis are highlighted by red circles.

vennDiagram_down/total/
up.png

plots

Venn diagrams showing the number of
significantly up- down- and total regulated
genes for up to four contrasts.

PCAplot.png

plots

Principal component analysis plot analogous
to the plots generated in the quality checks
section. This plot does in addition highlight
the groups of replicate experiments as
defined on the groups panel.

A, S

REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.

Cock, P.J., et al., The Sanger FASTQ file format for sequences with quality
scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res, 2010. 38(6):
p. 1767-71.
Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biol, 2009. 10(3): p. R25.
Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor
package for differential expression analysis of digital gene expression data.
Bioinformatics, 2010. 26(1): p. 139-40.
Robinson, M.D. and G.K. Smyth, Moderated statistical tests for assessing
differences in tag abundance. Bioinformatics, 2007. 23(21): p. 2881-7.
Robinson, M.D. and G.K. Smyth, Small-sample estimation of negative binomial
dispersion, with applications to SAGE data. Biostatistics, 2008. 9(2): p. 321-32.
McCarthy, D.J., Y. Chen, and G.K. Smyth, Differential expression analysis of
multifactor RNA-Seq experiments with respect to biological variation. Nucleic
Acids Research, 2012.
Anders, S. and W. Huber, Differential expression analysis for sequence count
data. Genome Biol, 2010. 11(10): p. R106.
Holm, S., A stagewise rejective multiple test procedure based on a modified
Bonferroni test. Scandinavian Journal of Statistics, 1979(6): p. 65-70.
Irizarry, R.A., et al., Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics (Oxford, England), 2003.
4(2): p. 249-64.
Wu, Z., et al., A Model-Based Background Adjustment for Oligonucleotide
Expression Arrays. Journal of the American Statistical Association, 2004.
99(468): p. 909-917.
Ritchie, M.E., et al., A comparison of background correction methods for twocolour microarrays. Bioinformatics, 2007. 23(20): p. 2700-7.
Smyth, G.K. and T. Speed, Normalization of cDNA microarray data. Methods,
2003. 31(4): p. 265-73.
Smyth, G.K., Linear models and empirical bayes methods for assessing
differential expression in microarray experiments. Statistical applications in
genetics and molecular biology, 2004. 3: p. Article3.
Breitling, R., et al., Rank products: a simple, yet powerful, new method to detect
differentially regulated genes in replicated microarray experiments. FEBS Lett,
2004. 573(1-3): p. 83-92.
Hong, F., et al., RankProd: a bioconductor package for detecting differentially
expressed genes in meta-analysis. Bioinformatics, 2006. 22(22): p. 2825-7.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
Page Count                      : 55
PDF Version                     : 1.4
Title                           : robin_and_robina_manual_030212
Author                          : Marc Lohse
Subject                         : 
Producer                        : Mac OS X 10.6.8 Quartz PDFContext
Creator                         : Microsoft Word
Create Date                     : 2013:06:24 11:21:20Z
Modify Date                     : 2013:06:24 11:21:20Z
Apple Keywords                  :

EXIF Metadata provided by EXIF.tools

Robina Manual

Navigation menu

Versions of this User Manual:

Views

Navigation