TreeCmp_manual Tree Cmp Manual
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 10
Download | ![]() |
Open PDF In Browser | View PDF |
TreeCmp: comparison of trees in polynomial time – the manual 1. Introduction A phylogenetic tree represents historical evolutionary relationship between different species or organisms. There are various methods for reconstructing phylogenetic trees. Applying those techniques usually results in different trees for the same input data. An important problem is to determine how distant two trees reconstructed in such a way are from each other. Comparing phylogenetic trees is also useful in mining phylogenetic information databases. The TreeCmp application was designed to compute distances between arbitrary (not necessary binary) phylogenetic trees. 2. Input data format The TreeCmp software was designed to support BEAST (http://beast.bio.ed.ac.uk/) and MrBayes (http://mrbayes.csit.fsu.edu/) date files, where phylogenetic trees are stored in the Newick format. Note that plain text files containing only trees in this format are supported as well. 1 3. Running TreeCmp The TreeCmp application is distributed as a zip archive. In order to unpack the file any software supporting zip compression, for example free software 7-zip (http://www.7-zip.org/), can be used. In order to run the TreeCmp application Java VM in version at least 1.5 is required. 3.1. Directory structure bin config data examples align beast mr_bayes plain plain2 prune ref_tree scaled src 3.2. Description contains main jar file: TreeCmp.jar and lib folder with necessary open source libraries: pal-1.5.1 (http://www.cebl.auckland.ac.nz/pal-project/) and commons-cli-1.2 (http://commons.apache.org/cli/) contains xml configuration file contains text files with pre-computed data (average value and other statistics) for all the 8 metrics under the two models of generation of random binary trees: the Yule model and the uniform model. contains subdirectories with examples contains an example of creating alignments contains an example input file created using BEAST contains an example input file created using MrBayes contains an example input file with plain trees contains an example input file with plain trees contains an example of comparing trees having different sets of taxa contains an example of comparing a single tree to a set of trees contains an example with reporting scaled values of chosen metrics contains source code of this application Command line syntax Usage: java -jar TreeCmp.jar -w|-s|-m|-r –d -i -o [-N] [-P] [-I] [-A|-O] Note that options order is important. See section 4 for details regarding output file format for a particular combination of the options. Mandatory switches: • The comparison mode options (only one option should be specified,): o –s – overlapping pair comparison mode; every two neighboring trees in the input file are compared, 2 o -w – window comparison mode; every two trees within a window with a specified size are compared – the average distance and the standard deviation go to the output file, o –m – matrix comparison mode; every two trees in the input file are compared. o -r – single tree to all trees mode. Each tree in the input file is compared to the single referenced tree. Details of the computation flow in each of these case are explained in the pictures below. Pair comparison (-s) Tree 1 Tree 2 Tree 3 Tree 4 Tree 5 Tree 6 … … Tree n-2 Tree n-1 Tree n Input Window comparison (-w 3) Matrix comparison (-m) Tree 1 2 rows Tree 2 1 row Tree 3 (n-1) rows Tree 1 2 rows Tree 4 (n-2) rows Tree 2 Tree 5 1 row … … Tree 6 1 row Tree n-1 … … Tree n 2 rows Tree n-2 Tree n-1 1 row Tree n Output Input Output Input Output Single (reference) tree to all trees mode (-r ) Tree 1 Tree 2 … … Tree n-1 Tree n Reference tree n rows Input • Output The metric option (-d). At least one and at most 8 metrics can be specified (numbers in square brackets correspond to the reference list. Metrics should be separated by space character. Metrics for unrooted trees: o ms – the Matching Split distance (Bogdanowicz and Giaro 2012), o rf – the Robinson-Foulds distance (Robinson and Foulds 1981) o pd – the path difference distance (Steel and Penny 1993), o qt – the quartet distance (Estabrook 1985). 3 Metrics for rooted trees: o mc – the Matching Cluster metric (Bogdanowicz et al. 2012), o rc – the Robinson-Foulds metric based on clusters (Robinson and Foulds 1981), o ns – the Nodal Splitted metric with L2 norm (Cardona et al. 2010), o tt – the Triples metric (Crichlow et al. 1996). Example: -d ms rf • IO options (both options should be specified): o -i – input data file with trees in the Newick format, o -o – output data file with the results of computations. Optional switches: • General options: o –N – report normalized distances δm for a particular metric m (Bogdanowicz et al. 2012; based on an average value from pre-computed data). This functionality is available for trees with number of leaves between 4 and 1000. Note that normalized tree similarity for a particular metric m (NTSm) can be expressed by normalized distance as follows: NTSm. = 1 - δm (Bogdanowicz et al. 2012). o –P – prune compared trees if needed. This option is design to allow comparing trees having different (partially overlapping) sets of taxa. After using this option three additional columns appear in the output file (see section 4 for details). o –I – -include summary section in the output file. • Matching metric specific options (only one option should be specified). o –A – Generate alignment files – this option should be used together with selection the MS or MC metrics. As a result additional files containing aligned splits or clusters are generated: - [output_file_name].out.aln_MS.txt, - [output_file_name].out.aln_MC.txt, where [output_file_name] is the file name specified after -o option. o -O – use special implementations of MS/MC metrics optimized for similar trees. Note that if a rooted tree (with bifurcation in the root) is compared using metrics for unrooted trees the tree will be automatically transform into unrooted one, i.e., the bifurcation will be replaced with an arbitrary trifurcation. 4 4. Output data format All output files created by the application regardless of chosen mode have similar structure. Output files are tab separated text files (TSV), which means that they can be easily read by various data analysis software (e.g. MS Excel, R, OpenOffice.org). An output file consists of two sections. The first section contains formatted in rows values of distances in selected metrics. The second (optional) section contains summary data computed based on all rows that appears in the first section. 4.1. Basic output file structure Base output file format for options -s, -m, and -w No Tree1 Tree2 MetricName_1 MetricName_2 … MetricName_n Comparison number Tree1 number Tree2 number Distance value Distance value … Distance value Base output file format for option -r, Tree MetricName_1 MetricName_2 … MetricName_n Comparison number (= tree number) Distance value Distance value … Distance value Tree, tree1, tree2 numbers in the output file correspond to the number of the tree in the input file. The following table contains a mapping between available metrics and column names in the output file that are related to them. Metric name in the output file MatchingSplit R-F PathDiffernce Quartet MatchingCluster R-F_Cluster NodalSplitted Triples 4.2. Full metric name the Matching Split distance the Robinson-Foulds distance the path difference distance the quartet distance the Matching Cluster metric the Robinson-Foulds metric based on clusters the Nodal Splitted metric with L2 norm the Triples metric TreeCmp command line parameter ms rf pd qt mc rc ns tt Additional columns (-P and -N options) After using switch -P the following three columns appear additionally in the output file. Tree1_taxa Tree2_taxa (or RefTree_taxa) Common_taxa Number of taxa in the first tree Number of taxa in the second (or reference) tree Number of taxa in common 5 After using switch -N the following two columns per each chosen metric appear additionally in the output file. These columns contain the value of the distance in a particular metric divided by its empirical average value. If the number of common leaves in compared trees is out of supported range (which is form 4 to 1000), then “N/A” value is inserted. MetricName_toYuleAvg MetricName_toUnifAvg (Distance value)/(Empirical average value in the Yule model) (Distance value)/(Empirical average value in the uniform model) For details regarding generating phylogenetic trees under the Yule and uniform models see (McKenzie and Steel 2000; Semple and Steel 2003). 4.3. Summary section format (-I option) Name Avg Std Min Max Count Metric name 1 Average value Standard deviation value Minimal value Maximal value Number of analyzed values Metric name 2 … … … … … … … … … … … … … … … … Metric name n 5. Useful Java VM parameters In the case of an analysis of large trees the following exceptions might occur: 1. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space To solve the problem increase Java heap space memory limit using JVM option –Xmx Example: java –Xmx700m –jar TreeCmp.jar 2. Exception in thread "main" java.lang.StackOverflowError at pal.io.FormattedInput.skipWhiteSpace(FormattedInput.java:111) at pal.io.FormattedInput.readNextChar(FormattedInput.java:131) at pal.tree.ReadTree.readNH(ReadTree.java:81) ….. at pal.tree.ReadTree.readNH(ReadTree.java:89) To solve the problem increase Java thread stack size limit using JVM option –Xss Example: java –Xss1m –jar TreeCmp.jar These options can be used in conjunction. 6 6. Examples 6.1. Running application to compare trees using MS Input file: \examples\beast\testBSP.newick Invocation: java -jar TreeCmp.jar -w 2 -d ms -i testBSP.newick -o testBSP.newick_w_2.out -I Console output: TreeCmp version 1.0-b291 Active options: Type of the analysis: window comparison mode (-w) with window size: 2 Metrics: 1. MatchingSplit (ms) Input file: testBSP.newick Output file: testBSP.newick_w_2.out Additional options: I - Include summary section in the output file. ----2011-08-27 16:03:17: Start of scanning input file: testBSP.newick 2011-08-27 16:03:17: End of scanning input file: testBSP.newick 2011-08-27 16:03:17: 11 valid trees found in file: testBSP.newick 2011-08-27 16:03:17: Start of calculation...please wait... 2011-08-27 16:03:17: 0.00% completed... 2011-08-27 16:03:17: 20.00% completed... 2011-08-27 16:03:17: 40.00% completed... 2011-08-27 16:03:17: 60.00% completed... 2011-08-27 16:03:17: 80.00% completed... 2011-08-27 16:03:17: 100.00% completed. 2011-08-27 16:03:17: End of calculation. 2011-08-27 16:03:17: Total calculation time: 62 ms. Output file testBSP.newick_w_2.out: No Tree1 Tree2 1 1 2 2 3 4 3 5 6 4 7 8 5 9 10 --------Summary: Name Avg Std MatchingSplit 6.2. MatchingSplit 58.0000 24.0000 10.0000 13.0000 14.0000 Min 23.8 Max Count 17.73583942191629 10.0 58.0 5 Computing normalized distances Reporting distances divided by pre-computed empirical average values for random trees (generated according to Yule and uniform models, -N option) can help in an interpretation of the similarity level of analyzed trees in chosen metric. In the following example, the distance in the MS metric of each tree from a given set to the reference tree is computed. Analyzed trees have 15 leaves. 7 Input files: \examples\sclaed\ref_tree.trees \examples\sclaed\test_set.trees Invocation: java -jar TreeCmp.jar -r ref_tree.trees -d ms -i test_set.trees -o test_set.trees.r.out –N Output file test_set.trees.r.out: Tree 1 2 3 4 5 6 7 8 9 10 11 12 MatchingSplit 43.0000 43.0000 41.0000 40.0000 43.0000 41.0000 43.0000 41.0000 39.0000 40.0000 0.0000 6.0000 MatchingSplit_toYuleAvg 1.0742 1.0742 1.0242 0.9992 1.0742 1.0242 1.0742 1.0242 0.9742 0.9992 0.0000 0.1499 MatchingSplit_toUnifAvg 0.9663 0.9663 0.9214 0.8989 0.9663 0.9214 0.9663 0.9214 0.8764 0.8989 0.0000 0.1348 Basic interpretation: • Tree number 11 has the same topology as the reference tree. • Tree number 12 is very similar to the reference tree in comparison to similarly of random on 15 leaves (the normalized distance is about 0.15 and 0.13 depending on the random model). • Trees with numbers 1 to 10 are approximately as similar to the reference tree as random trees to each other (the normalized distance is close to 1). In ordered to perform more advance similarity analysis, e.g. involving different model of generation of random trees, user my need to use TreeCmp twice: • to compute distances between custom set of random trees generated by other software, e.g. Evolver application form PAML package (http://abacus.gene.ucl.ac.uk/software/paml.html) to obtain the empirical average distance in a particular metric or its distribution, • to compute the distance between analyzed trees. 6.3. Finding the most similar trees in the input file The most convenient comparison mode for such purpose is a matrix mode (-m). In the following example, the Matching Split distance is used. Input file: \examples\plain2\plain2.trees (a,(b,c),(d,e)); (a,b,(c,(d,e))); (((a,b),c),d,e); (a,(b,(c,d)),e); Invocation: java -jar TreeCmp.jar -m -d ms -i plain2.trees -o plain2.trees.m.out 8 Output file plain2.trees.m.out: No 1 2 3 4 5 6 Tree1 1 1 1 2 2 3 Tree2 2 3 4 3 4 4 MatchingSplit 2.0000 2.0000 3.0000 0.0000 3.0000 3.0000 The most similar trees Trees number 2, i.e.: (a,b,(c,(d,e))) and 3, i.e.:(((a,b),c),d,e) in the input file are the most similar. In fact, they have the same topology (trees are assumed to be unrooted as metric for unrooted trees is used) because their distance is 0. 6.4. Exporting data to other applications: MS Excel, R In order to export data to MS Excel open the output file in any text editor and use copy and paste mechanism. Alternatively, you can open the input file directly in MS Excel application using the tabular character as a filed separator. In order to pass data to R (http://www.r-project.org/) it is convenient to have the TreeCmp output file in a simple tabular form (therefore, it is recommended to avoid -I option, because it results in generation the summary section, which disturb the tabular order). Such files can be easily read by R environment by using for example the read.table function as follows: treeCmpData<-read.table("C:\\Program Files\\TreeCmp\\examples\\plain\\plain.trees.m.out", header = TRUE, sep = "\t") In the example, the file to read “plain.trees.m.out” is placed in “C:\Program Files\TreeCmp\examples\plain” folder. 9 7. License Copyright (C) 2012, Damian Bogdanowicz This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/. References 1. Bogdanowicz D, Giaro K: Matching Split Distance for Unrooted Binary Phylogenetic Trees. IEEE/ACM Trans Comput Biol Bioinform 2012, 9: 150-160. 2. Bogdanowicz D, Giaro K., Wróbel B. TreeCmp: comparison of trees in polynomial time. Evol. Bioinform. 2012, in press. 3. Cardona G, Llabrés M, Rosselló F, Valiente G: Nodal distances for rooted phylogenetic trees, J Math Biol 2010 61:253-276. 4. Critchlow DE, Pearl DK, Qian C: The Triples Distance for Rooted Bifurcating Phylogenetic Trees, Syst Biol 1996, 45: 323-334. 5. Estabrook GF, McMorris FR, Meacham CA: Comparison of Undirected Phylogenetic Trees Based on Subtrees of Four Evolutionary Units. Syst Biol 1985, 34:193-200. 6. McKenzie A, Steel M: Distributions of cherries for two models of trees. Math Biosci 2000, 164:81-92. 7. Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci 1981, 53:131-147. 8. Steel MA, Penny D: Distributions of Tree Comparison Metrics – Some New Results. Syst Biol 1993, 42:126-141. 9. Semple C, Steel M: Phylogenetics, Oxford University Press 2003. 10
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 10 XMP Toolkit : XMP toolkit 2.9.1-13, framework 1.6 About : uuid:b125adf4-ac8d-11e1-0000-d7ea18532112 Producer : GPL Ghostscript 9.05 Keywords : () Modify Date : 2012:05:30 10:33:42+02:00 Create Date : 2012:05:30 10:33:42+02:00 Creator Tool : PDFCreator Version 1.3.2 Document ID : uuid:b125adf4-ac8d-11e1-0000-d7ea18532112 Format : application/pdf Title : TreeCmp_manual Creator : user Description : () Author : user Subject :EXIF Metadata provided by EXIF.tools