TreeCmp_manual Tree Cmp Manual

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 10

DownloadTreeCmp_manual Tree Cmp Manual
Open PDF In BrowserView PDF
TreeCmp: comparison of trees in polynomial time – the manual

1. Introduction
A phylogenetic tree represents historical evolutionary relationship between different species or
organisms. There are various methods for reconstructing phylogenetic trees. Applying those
techniques usually results in different trees for the same input data. An important problem is to
determine how distant two trees reconstructed in such a way are from each other. Comparing
phylogenetic trees is also useful in mining phylogenetic information databases. The TreeCmp
application was designed to compute distances between arbitrary (not necessary binary)
phylogenetic trees.

2. Input data format
The TreeCmp software was designed to support BEAST (http://beast.bio.ed.ac.uk/) and
MrBayes (http://mrbayes.csit.fsu.edu/) date files, where phylogenetic trees are stored in the
Newick format. Note that plain text files containing only trees in this format are supported as
well.

1

3. Running TreeCmp
The TreeCmp application is distributed as a zip archive. In order to unpack the file any
software supporting zip compression, for example free software 7-zip (http://www.7-zip.org/),
can be used. In order to run the TreeCmp application Java VM in version at least 1.5 is
required.

3.1.

Directory structure

bin

config
data

examples
align
beast
mr_bayes
plain
plain2
prune
ref_tree
scaled
src

3.2.

Description
contains main jar file: TreeCmp.jar and lib folder with
necessary open source libraries: pal-1.5.1
(http://www.cebl.auckland.ac.nz/pal-project/) and
commons-cli-1.2 (http://commons.apache.org/cli/)
contains xml configuration file
contains text files with pre-computed data (average value
and other statistics) for all the 8 metrics under the two
models of generation of random binary trees: the Yule
model and the uniform model.
contains subdirectories with examples
contains an example of creating alignments
contains an example input file created using BEAST
contains an example input file created using MrBayes
contains an example input file with plain trees
contains an example input file with plain trees
contains an example of comparing trees having different sets
of taxa
contains an example of comparing a single tree to a set of
trees
contains an example with reporting scaled values of chosen
metrics
contains source code of this application

Command line syntax

Usage:
java -jar TreeCmp.jar -w |-s|-m|-r  –d  -i
 -o  [-N] [-P] [-I] [-A|-O]

Note that options order is important. See section 4 for details regarding output file format for a
particular combination of the options.
Mandatory switches:
•

The comparison mode options (only one option should be specified,):
o –s – overlapping pair comparison mode; every two neighboring trees in the
input file are compared,

2

o -w  – window comparison mode; every two trees within a window with
a specified size are compared – the average distance and the standard deviation
go to the output file,
o –m – matrix comparison mode; every two trees in the input file are compared.
o -r  – single tree to all trees mode. Each tree in the input file is
compared to the single referenced tree.
Details of the computation flow in each of these case are explained in the pictures
below.

Pair comparison (-s)
Tree 1
Tree 2
Tree 3
Tree 4
Tree 5
Tree 6
…
…
Tree n-2
Tree n-1
Tree n
Input

Window comparison (-w 3) Matrix comparison (-m)
Tree 1
2 rows
Tree 2
1 row
Tree 3
(n-1) rows
Tree 1
2 rows
Tree 4
(n-2) rows
Tree 2
Tree 5
1 row
…
…
Tree 6
1 row
Tree n-1
…
…
Tree n
2 rows
Tree n-2
Tree n-1
1 row
Tree n

Output

Input

Output

Input

Output

Single (reference) tree to all trees mode (-r )
Tree 1
Tree 2
…
…
Tree n-1
Tree n

Reference tree
n rows

Input
•

Output

The metric option (-d). At least one and at most 8 metrics can be specified (numbers in
square brackets correspond to the reference list. Metrics should be separated by space
character.
Metrics for unrooted trees:
o ms – the Matching Split distance (Bogdanowicz and Giaro 2012),
o rf – the Robinson-Foulds distance (Robinson and Foulds 1981)
o pd – the path difference distance (Steel and Penny 1993),
o qt – the quartet distance (Estabrook 1985).

3

Metrics for rooted trees:
o mc – the Matching Cluster metric (Bogdanowicz et al. 2012),
o rc – the Robinson-Foulds metric based on clusters (Robinson and Foulds 1981),
o ns – the Nodal Splitted metric with L2 norm (Cardona et al. 2010),
o tt – the Triples metric (Crichlow et al. 1996).
Example: -d ms rf
•

IO options (both options should be specified):
o -i  – input data file with trees in the Newick format,
o -o  – output data file with the results of computations.

Optional switches:
•

General options:
o –N – report normalized distances δm for a particular metric m (Bogdanowicz et
al. 2012; based on an average value from pre-computed data). This functionality
is available for trees with number of leaves between 4 and 1000. Note that
normalized tree similarity for a particular metric m (NTSm) can be expressed by
normalized distance as follows: NTSm. = 1 - δm (Bogdanowicz et al. 2012).
o –P – prune compared trees if needed. This option is design to allow comparing
trees having different (partially overlapping) sets of taxa. After using this option
three additional columns appear in the output file (see section 4 for details).
o –I – -include summary section in the output file.

•

Matching metric specific options (only one option should be specified).
o –A – Generate alignment files – this option should be used together with
selection the MS or MC metrics. As a result additional files containing aligned
splits or clusters are generated:
- [output_file_name].out.aln_MS.txt,
- [output_file_name].out.aln_MC.txt,
where [output_file_name] is the file name specified after -o option.
o -O – use special implementations of MS/MC metrics optimized for similar trees.

Note that if a rooted tree (with bifurcation in the root) is compared using metrics for unrooted
trees the tree will be automatically transform into unrooted one, i.e., the bifurcation will be
replaced with an arbitrary trifurcation.

4

4. Output data format
All output files created by the application regardless of chosen mode have similar structure.
Output files are tab separated text files (TSV), which means that they can be easily read by
various data analysis software (e.g. MS Excel, R, OpenOffice.org). An output file consists of
two sections. The first section contains formatted in rows values of distances in selected
metrics. The second (optional) section contains summary data computed based on all rows that
appears in the first section.

4.1.

Basic output file structure

Base output file format for options -s, -m, and -w
No

Tree1

Tree2

MetricName_1

MetricName_2

…

MetricName_n

Comparison
number

Tree1
number

Tree2
number

Distance
value

Distance
value

…

Distance
value

Base output file format for option -r,
Tree

MetricName_1

MetricName_2

…

MetricName_n

Comparison number (= tree number)

Distance value

Distance value

…

Distance value

Tree, tree1, tree2 numbers

in the output file correspond to the number of the tree in the

input file.
The following table contains a mapping between available metrics and column names in the
output file that are related to them.
Metric name in
the output file
MatchingSplit
R-F
PathDiffernce
Quartet
MatchingCluster
R-F_Cluster
NodalSplitted
Triples

4.2.

Full metric name
the Matching Split distance
the Robinson-Foulds distance
the path difference distance
the quartet distance
the Matching Cluster metric
the Robinson-Foulds metric based on clusters
the Nodal Splitted metric with L2 norm
the Triples metric

TreeCmp command
line parameter
ms
rf
pd
qt
mc
rc
ns
tt

Additional columns (-P and -N options)

After using switch -P the following three columns appear additionally in the output file.
Tree1_taxa

Tree2_taxa (or
RefTree_taxa)

Common_taxa

Number of taxa in the
first tree

Number of taxa in the
second (or reference) tree

Number of taxa in
common

5

After using switch -N the following two columns per each chosen metric appear additionally in
the output file. These columns contain the value of the distance in a particular metric divided
by its empirical average value. If the number of common leaves in compared trees is out of
supported range (which is form 4 to 1000), then “N/A” value is inserted.
MetricName_toYuleAvg

MetricName_toUnifAvg

(Distance value)/(Empirical average
value in the Yule model)

(Distance value)/(Empirical average
value in the uniform model)

For details regarding generating phylogenetic trees under the Yule and uniform models see
(McKenzie and Steel 2000; Semple and Steel 2003).

4.3.

Summary section format (-I option)

Name

Avg

Std

Min

Max

Count

Metric
name 1

Average
value

Standard
deviation
value

Minimal value

Maximal value

Number of analyzed
values

Metric
name 2

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…
Metric
name n

5. Useful Java VM parameters
In the case of an analysis of large trees the following exceptions might occur:
1. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
To solve the problem increase Java heap space memory limit using JVM option –Xmx
Example:
java –Xmx700m –jar TreeCmp.jar 

2. Exception in thread "main" java.lang.StackOverflowError at
pal.io.FormattedInput.skipWhiteSpace(FormattedInput.java:111)
at pal.io.FormattedInput.readNextChar(FormattedInput.java:131)
at pal.tree.ReadTree.readNH(ReadTree.java:81)
…..
at pal.tree.ReadTree.readNH(ReadTree.java:89)
To solve the problem increase Java thread stack size limit using JVM option –Xss
Example:
java –Xss1m –jar TreeCmp.jar 

These options can be used in conjunction.

6

6. Examples
6.1.

Running application to compare trees using MS

Input file: \examples\beast\testBSP.newick
Invocation:
java -jar TreeCmp.jar -w 2 -d ms -i testBSP.newick -o testBSP.newick_w_2.out -I

Console output:
TreeCmp version 1.0-b291
Active options:
Type of the analysis: window comparison mode (-w) with window size: 2
Metrics:
1. MatchingSplit (ms)
Input file: testBSP.newick
Output file: testBSP.newick_w_2.out
Additional options:
I - Include summary section in the output file.
----2011-08-27 16:03:17: Start of scanning input file: testBSP.newick
2011-08-27 16:03:17: End of scanning input file: testBSP.newick
2011-08-27 16:03:17: 11 valid trees found in file: testBSP.newick
2011-08-27 16:03:17: Start of calculation...please wait...
2011-08-27 16:03:17: 0.00% completed...
2011-08-27 16:03:17: 20.00% completed...
2011-08-27 16:03:17: 40.00% completed...
2011-08-27 16:03:17: 60.00% completed...
2011-08-27 16:03:17: 80.00% completed...
2011-08-27 16:03:17: 100.00% completed.
2011-08-27 16:03:17: End of calculation.
2011-08-27 16:03:17: Total calculation time: 62 ms.

Output file testBSP.newick_w_2.out:
No
Tree1 Tree2
1
1
2
2
3
4
3
5
6
4
7
8
5
9
10
--------Summary:
Name Avg
Std
MatchingSplit

6.2.

MatchingSplit
58.0000
24.0000
10.0000
13.0000
14.0000

Min
23.8

Max
Count
17.73583942191629

10.0

58.0

5

Computing normalized distances

Reporting distances divided by pre-computed empirical average values for random trees
(generated according to Yule and uniform models, -N option) can help in an interpretation of
the similarity level of analyzed trees in chosen metric. In the following example, the distance in
the MS metric of each tree from a given set to the reference tree is computed. Analyzed trees
have 15 leaves.

7

Input files:

\examples\sclaed\ref_tree.trees
\examples\sclaed\test_set.trees

Invocation:
java -jar TreeCmp.jar -r ref_tree.trees -d ms -i test_set.trees -o
test_set.trees.r.out –N

Output file test_set.trees.r.out:
Tree
1
2
3
4
5
6
7
8
9
10
11
12

MatchingSplit
43.0000
43.0000
41.0000
40.0000
43.0000
41.0000
43.0000
41.0000
39.0000
40.0000
0.0000
6.0000

MatchingSplit_toYuleAvg
1.0742
1.0742
1.0242
0.9992
1.0742
1.0242
1.0742
1.0242
0.9742
0.9992
0.0000
0.1499

MatchingSplit_toUnifAvg
0.9663
0.9663
0.9214
0.8989
0.9663
0.9214
0.9663
0.9214
0.8764
0.8989
0.0000
0.1348

Basic interpretation:
• Tree number 11 has the same topology as the reference tree.
• Tree number 12 is very similar to the reference tree in comparison to similarly of
random on 15 leaves (the normalized distance is about 0.15 and 0.13 depending on the
random model).
• Trees with numbers 1 to 10 are approximately as similar to the reference tree as random
trees to each other (the normalized distance is close to 1).
In ordered to perform more advance similarity analysis, e.g. involving different model of
generation of random trees, user my need to use TreeCmp twice:
• to compute distances between custom set of random trees generated by other software,
e.g. Evolver application form PAML package
(http://abacus.gene.ucl.ac.uk/software/paml.html) to obtain the empirical average
distance in a particular metric or its distribution,
• to compute the distance between analyzed trees.

6.3.

Finding the most similar trees in the input file

The most convenient comparison mode for such purpose is a matrix mode (-m). In the
following example, the Matching Split distance is used.
Input file: \examples\plain2\plain2.trees
(a,(b,c),(d,e));
(a,b,(c,(d,e)));
(((a,b),c),d,e);
(a,(b,(c,d)),e);

Invocation:
java -jar TreeCmp.jar -m -d ms -i plain2.trees -o plain2.trees.m.out

8

Output file plain2.trees.m.out:
No
1
2
3
4
5
6

Tree1
1
1
1
2
2
3

Tree2
2
3
4
3
4
4

MatchingSplit
2.0000
2.0000
3.0000
0.0000
3.0000
3.0000

The most similar trees

Trees number 2, i.e.: (a,b,(c,(d,e))) and 3, i.e.:(((a,b),c),d,e) in the input file are the most
similar. In fact, they have the same topology (trees are assumed to be unrooted as metric for
unrooted trees is used) because their distance is 0.

6.4.

Exporting data to other applications: MS Excel, R

In order to export data to MS Excel open the output file in any text editor and use copy and
paste mechanism. Alternatively, you can open the input file directly in MS Excel application
using the tabular character as a filed separator.
In order to pass data to R (http://www.r-project.org/) it is convenient to have the TreeCmp
output file in a simple tabular form (therefore, it is recommended to avoid -I option, because it
results in generation the summary section, which disturb the tabular order). Such files can be
easily read by R environment by using for example the read.table function as follows:
treeCmpData<-read.table("C:\\Program
Files\\TreeCmp\\examples\\plain\\plain.trees.m.out", header = TRUE, sep = "\t")

In the example, the file to read “plain.trees.m.out” is placed in “C:\Program
Files\TreeCmp\examples\plain” folder.

9

7. License
Copyright (C) 2012, Damian Bogdanowicz
This program is free software: you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation, either version 3 of
the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program.
If not, see http://www.gnu.org/licenses/.

References
1. Bogdanowicz D, Giaro K: Matching Split Distance for Unrooted Binary
Phylogenetic Trees. IEEE/ACM Trans Comput Biol Bioinform 2012, 9: 150-160.
2. Bogdanowicz D, Giaro K., Wróbel B. TreeCmp: comparison of trees in polynomial
time. Evol. Bioinform. 2012, in press.
3. Cardona G, Llabrés M, Rosselló F, Valiente G: Nodal distances for rooted
phylogenetic trees, J Math Biol 2010 61:253-276.
4. Critchlow DE, Pearl DK, Qian C: The Triples Distance for Rooted Bifurcating
Phylogenetic Trees, Syst Biol 1996, 45: 323-334.
5. Estabrook GF, McMorris FR, Meacham CA: Comparison of Undirected
Phylogenetic Trees Based on Subtrees of Four Evolutionary Units. Syst Biol 1985,
34:193-200.
6. McKenzie A, Steel M: Distributions of cherries for two models of trees. Math Biosci
2000, 164:81-92.
7. Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci 1981,
53:131-147.
8. Steel MA, Penny D: Distributions of Tree Comparison Metrics – Some New
Results. Syst Biol 1993, 42:126-141.
9. Semple C, Steel M: Phylogenetics, Oxford University Press 2003.

10



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 10
XMP Toolkit                     : XMP toolkit 2.9.1-13, framework 1.6
About                           : uuid:b125adf4-ac8d-11e1-0000-d7ea18532112
Producer                        : GPL Ghostscript 9.05
Keywords                        : ()
Modify Date                     : 2012:05:30 10:33:42+02:00
Create Date                     : 2012:05:30 10:33:42+02:00
Creator Tool                    : PDFCreator Version 1.3.2
Document ID                     : uuid:b125adf4-ac8d-11e1-0000-d7ea18532112
Format                          : application/pdf
Title                           : TreeCmp_manual
Creator                         : user
Description                     : ()
Author                          : user
Subject                         : 
EXIF Metadata provided by EXIF.tools

Navigation menu