Manual Get Homologues
User Manual:
Open the PDF directly: View PDF .
Page Count: 61
Download | |
Open PDF In Browser | View PDF |
get homologues manual Bruno Contreras-Moreira (1,2) and Pablo Vinuesa (3) 1. Fundación ARAID and 2. Estación Experimental de Aula Dei-CSIC 3. Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México May 24, 2018 1 Contents 1 Description 3 2 Requirements and installation 2.1 Perl modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Required binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Optional software dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 4 4 3 User manual 3.1 Input data . . . . . . . . . . . . . . . . . . 3.2 Obtaining (bacterial) GenBank input files . 3.3 (Eukaryotic) FASTA amino acid input files . 3.4 Program options . . . . . . . . . . . . . . . 3.5 Accompanying scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 9 11 19 A few examples of use 4.1 Clustering orthologous proteins from a few FASTA files . . . . . . . 4.2 Clustering orthologous proteins from a single FASTA file . . . . . . 4.3 Clustering genes and proteins extracted from GenBank files . . . . . 4.4 Clustering genes and proteins that share Pfam domain architecture . 4.5 Clustering syntenic/neighbor genes . . . . . . . . . . . . . . . . . . 4.6 Comparing clusters with external sequence sets . . . . . . . . . . . 4.7 Clustering intergenic segments from GenBank files . . . . . . . . . 4.8 Performing genome composition analyses . . . . . . . . . . . . . . 4.8.1 Obtaining a pangenome matrix . . . . . . . . . . . . . . . . 4.8.2 Interrogating a pangenome matrix . . . . . . . . . . . . . . 4.8.3 Calculating cloud, shell and core genomes . . . . . . . . . . 4.8.4 Estimating core/pan-genome size by sampling genomes . . 4.8.5 Calculating Pfam enrichment of cluster sets . . . . . . . . . 4.8.6 Estimating average identity matrices . . . . . . . . . . . . . 4.8.7 Finding out best hits of a particular sequence . . . . . . . . 4.9 A script to test most get homologues features with a sample dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 25 27 29 31 32 33 35 35 38 41 44 46 47 48 50 5 Frequently asked questions (FAQs) 5.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Run options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Downstream analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 54 56 6 Frequent warnings and error messages 60 7 Credits and references 61 4 . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Description This document describes the software GET HOMOLOGUES and provides a few examples on how to install and use it. get homologues is mainly written in the Perl programming language and includes several algorithms designed for three main tasks: • Clustering protein and nucleotide sequences in homologous (possibly orthologous) groups, on the grounds of sequence similarity. • Identification of orthologous groups of intergenic regions, flanked by orthologous open reading frames (ORFs), conserved across related genomes. • Definition of pan- and core-genomes by calculation of overlapping sets of proteins. While the program was mainly developed for the study of bacterial genomes in GenBank format, it can also be applied to eukaryotic sets of sequences, although in this case the meaning of terms such as core- or pan-genome might change, and intergenic regions might be much larger. The twin manual get homologues-est.pdf file describes the specific options of get homologues-est, designed and tested for the task of clustering transcripts and, more generally, DNA-sequences of strains of the same species. The output files of GET HOMOLOGUES can be used to drive phylogenomics and population genetics analyses with the kin pipeline GET PHYLOMARKERS. 2 Requirements and installation get homologues.pl is a Perl5 program bundled with a few binary files. The software has been tested on 64-bit Linux boxes, and on Intel MacOSX 10.11.1 systems. Therefore, a Perl5 interpreter is needed to run this software, which is usually installed by default on these operating systems. In addition, the package includes a few extra scripts which can be useful for downloading GenBank files and for the analysis of the results. In order to install and test this software please follow these steps: 1. Unpack the software with: $ tar xvfz get_homologues_X.Y.tgz 2. $ cd get_homologues_X.Y 3. $ ./install.pl Please follow the indications in case some required part is missing. 4. Type $ ./get_homologues.pl -v which will tell exactly which features are available. 5. Test the main Perl script, named get_homologues.pl, with the included sample input folder sample_buch_fasta by means of the instruction: $ ./get_homologues.pl -d sample_buch_fasta . You should get an output similar to the contents of file sample_output.txt. 6. Optionally modify your $PATH environment variable to include get homologues.pl. Please copy the following lines to the .bash_profile or .bashrc files, found in your home directory, replacing [INSTALL_PATH] by the full path of the installation folder: export GETHOMS=[INSTALL_PATH]/get_homologues_X.Y export PATH=${GETHOMS}/:${PATH} This change will be effective in a new terminal or after running: $ source ~/.bash_profile The rest of this section might be safely skipped if installation went fine, it was written to help solve installation problems. 3 2.1 Perl modules A few Perl core modules are required by the get homologues.pl script, which should be already installed on your system: Cwd, FindBin, File::Basename, File::Spec, File::Temp, FileHandle, List::Util, Getopt::Std, Benchmark and Storable. In addition, the Bio::Seq, Bio::SeqIO, Bio::Graphics and Bio::SeqFeature::Generic modules from the Bioperl collection, and module Parallel::ForkManager are also required, and have been included in the get homologues bundle for your convenience. Should this version of BioPerl fail in your system (as diagnosed by install.pl) it might be necessary to install it from scratch. However, before trying to download it, you might want to check whether it is already living on your system, by typing on the terminal: $ perl -MBio::Root::Version -e ’print $Bio::Root::Version::VERSION’ If you get a message Can’t locate Bio/Root/Version... then you need to actually install it, which can sometimes become troublesome due to failed dependencies. For this reason usually the easiest way of installing it, provided that you have root privileges, it is to use the software manager of your Linux distribution (such as synaptic/apt-get in Ubuntu, yum in Fedora or YaST in openSUSE). If you prefer the terminal please use the cpan program with administrator privileges (sudo in Ubuntu): $ cpan -i C/CJ/CJFIELDS/BioPerl-1.6.1.tar.gz This form should be also valid: $ perl -MCPAN -e ’install C/CJ/CJFIELDS/BioPerl-1.6.1.tar.gz’ Please check this tutorial if you need further help. The accompanying script download genomes ncbi.pl imports File::Fetch, which should be bundled by default as well. In case it is missing on your Fedora systems, it can be installed as root with: $ yum install perl-File-Fetch 2.2 Required binaries The Perl script install.pl, already mentioned in section 2, checks whether the included precompiled binaries for COGtriangles, hmmer, MCL and BLAST are in place and ready to be used by get homologues. However, if any of these binaries fails to work in your system, perhaps due a different architecture or due to missing libraries, it will be necessary to obtain an appropriate version for your system or to compile them with your own compiler. In order to compile MCL the GNU gcc compiler is required, although it should most certainly already be installed on your system. If not, you might install it by any of the alternatives listed in section 2.1. For instance, in Ubuntu this works well: $ sudo apt-get install gcc . The compilation steps are as follows: $ cd bin/mcl-14-137; $ ./configure‘; $ make To compile COGtriangles the GNU g++ compiler is required. You should obtain it by any of the alternatives listed in section 2.1. The compilation would then include several steps: $cd $cd $cd $cd $cd bin/COGsoft; COGlse; make; ../COGmakehash;make; ../COGreadblast;make; ../COGtriangles;make Regarding BLAST, get homologues uses BLAST+ binaries, which can be easily downloaded from the NCBI FTP site. The packed binaries are blastp and makeblastdb from version ncbi-blast-2.2.27+. If these do not work in your machine or your prefer to use older BLAST versions, then it will be necessary to edit file lib/phyTools.pm. First, environmental variable $ENV{’BLAST_PATH’} needs to be set to the right path in your system (inside subroutine sub set_phyTools_env). Variables $ENV{’EXE_BLASTP’} and $ENV{’EXE_FORMATDB’} also need to be changed to the appropriate BLAST binaries, which are respectively blastall and formatdb. 2.3 Optional software dependencies It is possible to make use of get homologues on a computer farm or high-performance computing cluster managed by gridengine. In particular we have tested this feature with versions GE 6.0u8, 6.2u4, 2011.11p1 invoking the program with 4 option -m cluster. For this command to work it might be necessary to edit the get homologues.pl file and add the right path to set global variable $SGEPATH. To find out the installation path of your SGE installation you might try the next terminal command: $ which qsub In case you have access to a multi-core computer you can follow the next steps to set up your own Grid Engine cluster and speed up your calculations: # updated 02032017 # 1) go to http://arc.liv.ac.uk/downloads/SGE/releases , # create user ’sgeadmin’ and download the latest binary packages # (Debian-like here) matching your architecture (amd64 here): wget -c http://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-common_8.1.9_all.deb wget -c http://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9_amd64.deb wget -c http://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-dbg_8.1.9_amd64.deb sudo sudo sudo sudo sudo useradd dpkg -i dpkg -i dpkg -i apt-get sgeadmin sge-common_8.1.9_all.deb sge_8.1.9_amd64.deb sge-dbg_8.1.9_amd64.deb install -f # 2) set hostname to anything but localhost by editing /etc/hosts so that # the first line is something like this (localhost or 127.0.x.x IPs not valid): # 172.1.1.1 yourhost # 3) install Grid Engine server with defaults except cluster name (’yourhost’) # and admin user name (’sgeadmin’): sudo su cd /opt/sge/ chown -R sgeadmin sge chgrp -R sgeadmin sge ./install_qmaster # 4) install Grid Engine client with all defaults: ./install_execd exit # 5) check the path to your sge binaries, which can be ’lx-amd64’ ls /opt/sge/bin # 6) Set relevant environment variables in /etc/bash.bashrc [can also be named /etc/basrhc] # or alternatively in ~/.bashrc for a given user export SGE_ROOT=/opt/sge export PATH=$PATH:"$SGE_ROOT/bin/lx-amd64" # 7) Optionally configure default all.q queue: qconf -mq all.q # 8) Add your host to list of admitted hosts: qconf -as yourhost For cluster-based operations three bundled Perl scripts are invoked: _cluster_makeHomolog.pl, _cluster_makeInparalog.pl and _cluster_makeOrtholog.pl . It is also possible to invoke Pfam domain scanning from get homologues. This option requires the bundled binary hmmscan, which is part of the HMMER3 package, whose path is set in file lib/phyTools.pm (variable $ENV{’EXE_HMMPFAM’}). 5 Should this binary not work in your system, a fresh install might be the solution, say in /your/path/hmmer-3.1b2/. In this case you’ll have to edit file lib/phyTools.pm and modify the relevant: if( ! defined($ENV{’EXE_HMMPFAM’}) ) { $ENV{’EXE_HMMPFAM’} = ’/your/path/hmmer-3.1b2/src/hmmscan --noali --acc --cut_ga ’; } The Pfam HMM library is also required and the install.pl script should take care of it. However, you can manually download it from the appropriate Pfam FTP site. This file needs to be decompressed, either in the default db folder or in any other location, and then it should be formatted with the program hmmpress, which is also part of the HMMER3 package. A valid command sequence could be: $ $ $ $ cd db; wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz .; gunzip Pfam-A.hmm.gz; /your/path/hmmer-3.1b2/src/hmmpress Pfam-A.hmm Finally, you’ll need to edit file lib/phyTools.pm and modify the relevant line to: if( ! defined($ENV{"PFAMDB"}) ){ $ENV{"PFAMDB"} = "db/Pfam-A.hmm"; } In order to reduce the memory footprint of get homologues it is possible to take advantage of the Berkeley DB database engine, which requires Perl core module DB File, which should be installed on all major Linux distributions. Similarly, in order to take full advantage of the accompanying script parse pangenome matrix.pl, particularly for option -p, the installation of module GD is recommended. An easy way to install them, provided that you have administrator privileges, is with help from the software manager of your Linux distribution (such as synaptic/apt-get in Ubuntu, yum in Fedora or YaST in openSUSE). This can usually be done on the terminal as well, in different forms: $ sudo apt-get -y install libgd-gd2-perl # Ubuntu/Debian-based distros $ yum -y install perl-GD # Redhat and derived distros $ zypper --assume-yes install perl-GD # SuSE $ cpan -i GD # will require administrator privileges (sudo) $ perl -MCPAN -e ’install GD’ # will require administrator privileges (sudo) The installation of perl-GD on macOSX systems is known to be troublesome. The accompanying scripts compare clusters.pl, plot pancore matrix.pl, parse pangenome matrix.pl, plot matrix heatmap.sh, hcluster pangenome matrix.sh require the installation of the statistical software R, which usually is listed by software managers in all major Linux distributions. In some cases (some SuSE versions and some Redhat-like distros) it will be necessary to add a repository to your package manager. R can be installed from the terminal: $ sudo apt-get -y install r-base r-base-dev # Ubuntu/Debian-based distros $ yum -y install R # RedHat and derived distros $ zypper --assume-yes R-patched R-patched-devel # Suse Please visit CRAN to download and install R on macOSX systems, which is straightforward. In addition to R itself, plot matrix heatmap.sh and hcluster pangenome matrix.sh require some R packages to run, which can be easily installed from the R command line with: 6 > install.packages(c("ape", "gplots", "cluster", "dendextend", "factoextra"), dependencies=TRUE) The script compare clusters.pl might require the installation of program PARS from the PHYLIP suite, which should be already bundled with your copy of get homologues. Finally, download genomes ncbi.pl might require wget in order to download WGS genomes, but should be installed on most systems. 7 3 User manual This section describes the available options for the get homologues software. 3.1 Input data The input required to run get homologues can be of two types: 1 A single file with amino acid sequences in FASTA format, in which headers include a taxon name between square brackets, including at least two words and a first capital letter, as shown in the example: >protein 123 [Species 1] MANILLLDNIDSFTYNLVEQLRNQKNNVLVYRNTVSIDIIFNSLKKLTHPILMLSPGPSLPKHAGCMLDL PEKFVINSYFEKMIMSVRNNCDRVCGFQFHPESILTTHGDQILEKIIHWASLKYITNKKQ >gi|10923402| [Species 2] IKKVKGDIPIVGICLGHQAIVEAYGGIIGYAGEIFHGKASLIRHDGLEMFEGVPQPLPVARYHSLICNKI PEKFVINSYFEKMIMSVRNNCDRVCGFQFHPESILTTHGDQILEKIIHWASLKYITNKKQ ... 2 A directory or folder containing several files in either FASTA format (extensions ’.faa’ or ’.fasta’, containing amino acid sequences) or GenBank files (extension ’.gbk’, one file per organism). The advantage of this option is that new files can be added to the input folder in the future and previous calculations will be conserved. This might be useful to study a group of organisms for which a few genomes are available, and then keep adding new genomes as they become available. This directory can actually contain a mixture of FASTA and GenBank files. 3.2 Obtaining (bacterial) GenBank input files The GenBank format is routinely used to describe genomic sequences, usually taking one file per chromosome or genomic contig. Each file contains a reference DNA genomic sequence plus a collection of genes and their products, making it possible to extract simultaneously the sequence of every ORF and its corresponding protein products. GenBank files are the recommended input format for bacterial sequences, as they permit the compilation of DNA and protein sequences clusters, which might have different applications. There are many ways to obtain GenBank files, starting by manual browsing and downloading from the NCBI site, keeping in mind that full files, which include the reference nucleotide sequences, should be downloaded. In fact, get homologues.pl will fail to parse any ORF in summary GenBank files. Figure 1: NCBI download widget showing the choice of ’GenBank (full)’ format, which contains the raw reference nucleotide sequences. Often users take custom-made GenBank files, resulting from in-house genome assemblies, to be analysed. In most cases genes from such files don’t have GenBank identifiers assigned yet, and so we recommend adding the field locus_tag to each CDS feature so that parsed sequences can be properly identified. 8 For their use with get homologues, GenBank files for the same species (for example, from the main chromosome and from a couple of plasmids) must be concatenated. For instance, the genomic sequences of Rhizobium etli CFN 42 comprise seven files, which can be concatenated into a single Rhizobium etli CFN42.gbk file. In order to assist in this task this software package includes the accompanying script download genomes ncbi.pl. We will explain its use by fetching some of the Yersinia pestis genomic sequences used in a 2010 paper by Morelli et al: Group 0.PE2 0.PE3 0.PE4 0.ANT2 1.ANT1 1.ANT1 1.IN3 1.ORI1 1.ORI1 1.ORI1 1.ORI2 1.ORI3 1.ORI3 1.ORI3 2.ANT1 2.MED1 2.MED2 Name Pestoides F Angola 91001 B42003004 UG05-0454 Antiqua E1979001 CA88-4125 FV-1 CO92 F1991016 IP674 IP275 MG05-1020 Nepal516 KIM K1973002 Accession Number NC_009381 NC_010159 NC_005810 NZ_AAYU00000000 NZ_AAYR00000000 NC_008150 NZ_AAYV00000000 NZ_ABCD00000000 NZ_AAUB00000000 NC_003143 NZ_ABAT00000000 ERA000177 NZ_AAOS00000000 NZ_AAYS00000000 NZ_ACNQ00000000 NC_004088 NZ_AAYT00000000 Status Completed Sanger genome Completed Sanger genome Completed Sanger genome Draft Sanger genome Draft Sanger genome (12.3X coverage) Completed Sanger genome Draft Sanger genome Draft Sanger genome Draft Sanger genome Completed Sanger genome Draft Sanger genome Draft 454 genome (82X coverage) Draft Sanger genome (7.6X coverage) Draft Sanger genome (12.1X coverage) Draft Sanger genome Completed Sanger genome Draft Sanger genome In order to use download genomes ncbi.pl is is necessary to feed it a text file listing which genomes are to be downloaded. The next examples show the exact format required, as does the bundled file sample_genome_list.txt. First, it can be seen that completed genomes have NC accession numbers, and can be added to the list as follows: NC_010159 Yersinia_pestis_Angola Other annotated genomes can be added using their assembly code, as in this example: can be added to the list as follows: GCA_000016445.1_ASM1644v1 Yersinia_pestis_Pestoides_F Finally, draft WGS genomes, which can be browsed at http://www.ncbi.nlm.nih.gov/Traces/wgs, can be listed in our download file by adding their four-letter code of their prefixes, as follows: AAYU01 Yersinia_pestis_B42003004 Finally, the genome_list.txt file will look as this: NC_010159 GCA_000016445.1_ASM1644v1 AAYU01 Yersinia_pestis_Angola Yersinia_pestis_Pestoides_F Yersinia_pestis_B42003004 Note that only the first two columns (separated by blanks) are read in, and that lines can be commented out by adding a ’#’ as the first character. Now we can run the following terminal command to fetch these genomes: $ ./download_genomes_ncbi.pl genome_list.txt which will put several Yersinia pestis *.gbk files in the current directory, which are now ready to be used by get homologues. 3.3 (Eukaryotic) FASTA amino acid input files Due to the complexity of eukaryotic genomes, which are split in many chromosomes and contigs and usually contain complex gene models, the preferred format taken by get homologues for their sequences is FASTA. 9 While eukaryotic GenBank files can be fed in, during development we have not tested nor benchmarked the compilation of clusters of nucleotide eukaryotic sequences, which can be more error prone due to the inclusion of, for instance, introns and pseudogenes. Therefore we currently cannot recommend the use of eukaryotic GenBank input files. Of course FASTA format can also be used for prokaryotic amino acid sequences, as in the case of the example sample_buch_fasta folder, which contains protein sequences found in four Buchnera aphidicola genomes. If your data are DNA coding sequences you can translate them to protein sequences for use with get homologues, for instance by means of a Perl command in the terminal, with a little help from Bioperl 2.1. It is a long command, which is split in three chunks to fit in this page: $ perl -MBio::Seq -lne ’if(/^(>.*)/){$h=$1}else{$fa{$h}.=$_} \ END{ foreach $h (sort(keys(%fa))){ $fa{$h}=Bio::Seq->new(-seq=>$fa{$h})->translate()->seq(); \ print "$h\n$fa{$h}\n" }}’ your_CDS_file.fna 10 3.4 Program options -v print version, credits and checks installation -d directory with input FASTA files ( .faa / .fna ), GenBank files ( .gbk ), 1 per genome, or a subdirectory ( subdir.clusters / subdir_ ) with pre-clustered sequences ( .faa / .fna ); allows for new files to be added later; creates output folder named ’directory_homologues’ -i input amino acid FASTA file with [taxon names] in headers, creates output folder named ’file_homologues’ Optional parameters: -o only run BLAST/Pfam searches and exit -c report genome composition analysis -R set random seed for genome composition analysis -s save memory by using BerkeleyDB; default parsing stores sequence hits in RAM -m runmode [local|cluster] -n nb of threads for BLAST/HMMER/MCL in ’local’ runmode -I file with .faa/.gbk files in -d to be included Algorithms instead of default bidirectional best-hits (BDBH): -G use COGtriangle algorithm (COGS, PubMed=20439257) -M use orthoMCL algorithm (OMCL, PubMed=12952885) Options that control sequence similarity searches: -X use diamond instead of blastp -C min %coverage in BLAST pairwise alignments -E max E-value -D require equal Pfam domain composition when defining similarity-based orthology -S min %sequence identity in BLAST query/subj pairs -N min BLAST neighborhood correlation PubMed=18475320 -b compile core-genome with minimum BLAST searches Options that control clustering: -t report sequence clusters including at least t taxa -a report clusters of sequence features in GenBank files instead of default ’CDS’ GenBank features -g report clusters of intergenic sequences flanked by ORFs in addition to default ’CDS’ clusters -f filter by %length difference within clusters -r reference proteome .faa/.gbk file -e exclude clusters with inparalogues 11 (overrides -i, use of pre-clustered sequences ignores -c, -g) (required unless -d is set) (useful to pre-compute searches) (follows order in -I file if enforced, ignores -r,-t,-e) (optional, requires -c, example -R 1234, required for mixing -c with -c -a runs) (default local) (default=2) (takes all by default, requires -d) (requires 3+ genomes|taxa) (optional, set threads with -n) (range [1-100],default=75) (default=1e-05,max=0.01) (recommended with -m cluster (range [1-100],default=1 [BDBH|OMCL]) (range [0,1],default=0 [BDBH|OMCL]) (ignores -c [BDBH]) (default t=numberOfTaxa, t=0 reports all clusters [OMCL|COGS]) (requires -d and .gbk files, example -a ’tRNA,rRNA’, NOTE: uses blastn instead of blastp, ignores -g,-D) (requires -d and .gbk files) (range [1-100], by default sequence length is not checked) (by default takes file with least sequences; with BDBH sets first taxa to start adding genes) (by default inparalogues are -x allow sequences in multiple COG clusters -F orthoMCL inflation value -A calculate average identity of clustered sequences, by default uses blastp results but can use blastn with -a -z add soft-core to genome composition analysis included) (by default sequences are allocated to single clusters [COGS]) (range [1-5], default=1.5 [OMCL]) (optional, creates tab-separated matrix, recommended with -t 0 [OMCL|COGS]) (optional, requires -c [OMCL|COGS]) Figure 2: Flowchart of get homologues. Typing $ ./get_homologues.pl -h on the terminal will show the available options, shown on the previous pages. The only required option is either -i, used to choose an input file, or -d instead, which indicates an input folder, as seen in section 3.1. In previous versions only files with extensions .fa[a] and .gb[k] were considered when parsing the -d directory. Currently, GZIP- or BZIP2-compressed input files are also accepted. By using .faa input files in theory you might only calculate clusters of protein sequences. In contrast, the advantage of using .gbk files is that you obtain both nucleotide and protein clusters. If both types of input files are combined, only protein clusters will be produced. However, if each input .faa file has a twin .fna file in place, containing the corresponding nucleotide sequences in the same order, the program will attempt to produce the corresponding clusters of nucleotide sequences. The possible input file combinations are summarized in Table 1: The use of an input folder or directory (-d) is recommended as it allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow, by copying these new 10 .faa / .gbk input files to the pre-existing -d folder, so that all previous BLAST searches are re-used. In addition to .gbk and .faa files, the input directory can also contain one subfolder with pre-clustered sequences. This feature was designed so that users can add previously produced get homologues clusters, or any other set of grouped sequences in FASTA format, to be analysed. For such a subfolder to be recognized, it must be named subdir.clusters or subdir_. Sample data folder sample_buch_fasta/ contains such an example subfolder which can be uncompressed 12 input file extensions .gbk .faa .gbk & .faa .faa & .fna .gbk & .faa & .fna output clusters amino acid + DNA sequence amino acid sequence amino acid sequence amino acid + DNA sequence amino acid + DNA sequence Table 1: Valid input file combinations. to be tested. It is important to note that, during subsequent calculations, these clusters are represented by the first sequence found in each. However, the output of the program will include all pre-clustered sequences for convenience. All remaining flags are options that can modify the default behavior of the program, which is to use the bidirectional best hit algorithm (BDBH) in order to compile clusters of potential orthologous ORFs, taking the smallest genome as a reference. By default protein sequences are used to guide the clustering, thus relying on BLASTP searches. Perhaps the most important optional parameter would be the choice of clustering algorithm (Table 2): name BDBH option default COGS -G OMCL -M Starting from a reference genome, keep adding genomes stepwise while storing the sequence clusters that result of merging the latest bidirectional best hits, as illustrated in Figure 3. Merges triangles of inter-genomic symmetrical best matches, as described in PubMed=20439257. Note that a single sequence might occasionally be included in several COGS clusters with option -x. OrthoMCL v1.4, uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in PubMed=12952885. Table 2: List of available clustering algorithms. The remaining options are now reviewed: • Apart from showing the credits, option -v can be helpful after installation, for it prints the enabled features of the program, which of course depend on the required and optional binaries mentioned in sections 2.2 and 2.3. • -o is ideally used to submit to a computer cluster the required BLAST (and Pfam) searches, preparing a job for posterior analysis on a single computer. • -c is used to request a pan- and core-genome analysis of the input genomes, which will be output as a tab-separated data file. The number of samples for the genome composition analysis is set to 10 by default, but this can be edited at the header of get_homologues.pl (check the $NOFSAMPLESREPORT variable). In addition, variables $MIN_PERSEQID_HOM and $MIN_COVERAGE_HOM, with default values 0 and 20, respectively, control how homologues are called. Note that these are stringent values. These can also be edited at lib/marfil_homology.pm to relax calling a sequence homologous, and therefore, redundant. For instance, the equivalent values used by Tettelin and collaborators (PubMed=16172379), are 50 and 50, respectively. • -R takes a number that will be used to seed the random generator used with option -c. By using the same seed in different -c runs the user ensures that genomes are sampled in the same order. • -s can be used to reduce the memory footprint, provided that the Perl module BerkeleyDB is in place (please check section 2.3). This option usually makes get homologues slower, but for very large datasets or in machines with little memory resources this might be the only way to complete a job. • -m allows the choice of runmode, which can be either -m local (the default) or -m cluster. In the second case global variable $SGEPATH might need to be appropriately set, as explained in section 2.3, as well as $QUEUESETTINGS, that specificies for instance a particular queue name for your cluster jobs. • -n sets the number of threads/CPUs to dedicate to each BLAST/HMMER job run locally, which by default is 2. 13 Figure 3: Flowchart of BDBH algorithm with default parameters and G genomes. First, inparalogues, defined as intraspecific bidirectional best hits (BDBHs), are identified in each genome; second, new genomes are incrementally compared to the reference genome and their BDBHs annotated; finally, clusters containing at least 1 sequence per genome are conserved. • -I list_file.txt allows the user to restrict a get homologues job to a subset of the genomes included in the input -d folder. This flag can be used in conjunction with -c to control the order in which genomes are considered during pan- and core-genome analyses. Taking the sample_buch_fasta folder, a valid list_file.txt could contain these lines: Buch_aph_APS.faa Buch_aph_Bp.faa Buch_aph_Cc.faa • option -X indicates that peptide similarity searches are to be performed with DIAMOND (PubMed=25402007) instead of BLASTP. This has a small sensitivity penalty (see blog post), but the tests summarized in Table 3 show that it speeds up protein similarity searches when compared to default BLAST jobs. The gain in computing time increases as more genomes are fed in, allowing the analysis of even larger datasets (see Table 4). Figure 4 shows that cloud clusters are where most of the differences arise when DIAMOND is used, with more singletons produced than in BLASTP-based analysesi (see section 4.8.3 to learn how these plots can be produced). • option -C sets the minimum percentage of coverage required to call two sequences best hits, as illustrated in the figure. The larger these values get, the smaller the chance that two sequences are found to be reciprocal best hits. The default coverage value is set to 75%. When using the COGS algorithm the maximum accepted coverage is 99%. This parameter has a large impact on the results obtained and its optimal values will depend on the input data and the anticipated use of the produced clusters. • -E sets the maximum expectation value (E-value) for BLAST alignments. This value is by default set to 1e-05. This parameter might be adjusted for nucleotide BLAST searches or for very short proteins, under 40 residues. 14 runmode -m local -n 20 -m local -n 20 -X -m cluster -m cluster -X wallclock time (s) 1,189 302 733 276 speedup 3.93X 2.65X Table 3: Time required to perform BLASTP and DIAMOND (-X) searches among 9 Mycobacterium africanum accessions. runmode -m cluster -X -m cluster -X -M -t 0 -m cluster -X -M -t 0 -A clusters 1,287 30,098 30,098 wallclock time (s) 74,652 81,059 315,781 RAM (MB) 2,385 34,155 13,474 Table 4: Resources required to analyze 220 Escherichia coli genomes with DIAMOND (-X) on a 32-core Linux box running SGE. The total number of sequences was 1,104,985. The OMCL (-M) commands were run after the BDBH job and thus avoided re-running DIAMOND. While the input genomes took 626MB of drive space, the uncompressed output was 172GB. • -S can be passed to require a minimum % sequence identity for two sequences to be called best hits. This option does not affect COGS runs; its default value is set to 1. • -N sets a minimum neighborhood correlation, as defined in PubMed=18475320, for two sequences to be called best hits. In this context ’neighborhood’ is the set of homologous sequences reported by BLAST, with the idea that two reliable best hits should have similar sets of homologous sequences. • -D is an extra restriction for calling best hits, that should have identical Pfam domain compositions. Note that this option requires scanning all input sequences for Pfam domains, and this task requires some software to be installed (see section 2.3) and extra computing time, ideally on a computer cluster (-m cluster). While for BDBH domain filtering is done at the time bidirectional best hits are called, this processing step is performed only after the standard OMCL and COGS algorithms have completed, to preserve each algorithm features. 15 Figure 4: Comparison of OMCL-based pangenomes of 9 Mycobacterium africanum accessions computed with BLASTP (left) and DIAMOND (right). Figure 5: Coverage [BDBH,OMCL] and overall segment coverage [COGS] illustrated with the alignment of sequence ’query’ to two aligned fragments of sequence ’subject’, where 1,s1,e1,s2,e2 and L are alignment coordinates. • -b reduces the number of pairwise BLAST searches performed while compiling core-genomes with algorithm BDBH, reducing considerably memory and run-time requirements (for G genomes, 3G searches are launched instead of the default G2 ). It comes at the cost of being less exhaustive in finding inparalogues, but in our bacterial benchmarks this potential, undesired effect was negligible. • -t is used to control which sequence clusters should be reported. By default only clusters which include at least one sequence per genome are output. However, a value of -t 2 would report all clusters containing sequences from at least 2 taxa. A especial case is -t 0, which will report all clusters found, even those with sequences from a single genome. • -a forces the program to extract user-selected sequence features typically contained in GenBank files, such as tRNA or rRNA genes, instead of default CDSs. When using this option clusters are compiled by comparing nucleotide sequences with BLASTN. Note that such BLASTN searches are expected to be less sensitive than default BLASTP searches. • -g can be used to request the compilation of clusters of intergenic sequences. This implies the calculation of ORF clusters and then a search for pairs of ’orthologous’ ORFs which flanking conserved intergenic regions, with the constraints set by three global variables in the header of get_homologues.pl: my $MININTERGENESIZE = 200; # minimum length (nts) required for intergenic # segments to be considered 16 Figure 6: For G genomes, a typical get homologues job requires running G2 BLAST searches in order to compare all against all sequences, including against itself to help infer inparalogues. Therefore, the resources required for calculating BLAST jobs grow quadratically. Instead, the BDBH algorithm with option -b requires only 3G BLAST searches (in grey) for any reference genome. my $MAXINTERGENESIZE my $INTERGENEFLANKORF = 700; = 180; # length in nts of intergene flanks borrowed # from neighbor ORFs Figure 7: Two divergent ORFs flanking an intergenic region. Only 180 bases from each ORF are taken for compiling intergenic clusters. • -f filters out cluster sequences with large differences in length. This flag compares sequences within a cluster to the first (arbitrary) reference sequence. Those with length difference (either shorter or longer) beyond the selected threshold will be removed. This might cause the resulting cluster to be entirely removed if the final number of taxa falls below the -t minimum. • -r allows the choice of any input genome (of course included in -d folder) as the reference, instead of the default smaller one. If possible, resulting clusters are named using gene names from this genome, which can be used to select well annotated species for this purpose. In addition, when using the default BDBH algorithm, the reference proteome is the one chosen to start adding genes in the clustering process. Therefore, when using BDBH, the choice of reference proteome can have a large impact on the resulting number of clusters. By default, the taxon with least genes is taken as reference. It is possible to change the way clusters are named by editing subroutine extract_gene_name in file lib/phyTools.pm. • -e excludes clusters with inparalogues, defined as sequences with best hits in its own genome. This option might be helpful to rule out clusters including several sequences from the same species, which might be of interest for users employing these clusters for primer design, for instance. • -x allows COG-generated sequence clusters to contain the same sequence in more than one cluster. • -F is the inflation value that governs Markov Clustering in OMCL runs, as explained in PubMed=12952885. As a rule of thumb, low inflation values (-F 1)result in the inclusion of more sequences in fewer groups, whilst large values produce more, smaller clusters (-F 4). • -A tells the program to produce a tab-separated file with average % sequence identity values among pairs of genomes, computed from sequences in the final set of clusters (see also option -t ). By default these identities are derived from BLASTP alignments, and hence correspond to amino acid sequence identities. However, as explained earlier, option -a forces the program to use nucleotide sequences and run BLASTN instead, and therefore, -a ’CDS’ combined -A will produce genomic average nucleotide sequence identities (ANI), as used in the literature to help define prokaryotic species (PubMed=19855009). 17 • -z can be called when performing a genome composition analysis with clustering algorithms OMCL or COGS. In addition to the core- and pan-genome tab-separated files mentioned earlier (see option -c), this flag requests a soft-core report, considering all sequence clusters present in a fraction of genomes defined by global variable $SOFTCOREFRACTION, with a default value of 0.95. This choice produces a composition report more robust to assembly or annotation errors than the core-genome. 18 3.5 Accompanying scripts The following Perl scripts are included in the bundle to assist in the interpretation of results generated by get homologues.pl: • download genomes ncbi.pl, a script which is described in section 3.2 with examples. • compare clusters.pl primarily calculates the intersection between cluster sets, which can be used to select clusters supported by different algorithms or settings. This script can also produce syntenic clusters, pangenome matrices, OrthoXML reports, and Venn diagrams (this last optional feature requires R, please check section 2.3). Examples of use of this script are presented in sections 4.1, 4.5 and 4.8.1. • parse pangenome matrix.pl is a script that can be used to analyze pan-genome sets, in order to find genes present in a group A of species which are absent in set B. The identified genes can be mapped onto the underlying genome contigs of a reference genome included in A. Moreover this script can be used for calculating and plotting cloud, shell and core genome compartments. Please see examples in sections 4.8.2 and 4.8.3. • make nr pangenome matrix.pl is provided to post-process pangenome matrices in case the user wishes to remove redundant clusters. • plot pancore matrix.pl, a Perl script to plot pan/core-genome sampling results and to fit regression curves with help from R functions. An example of use of this script is given in section 4.8.4. Please check section 2.3 for the requirements of this script. • check BDBHs.pl is a script that can be used, after a previous get homologues run, to find out the bidirectional best hits of a sequence identifier chosen by the user. It can also retrieve the Pfam annotations of a sequence and its reciprocal best hits. See section 4.8.7. • add pancore matrices.pl can be used to add pan/core-matrices produced by previous get homologues -c -R runs on the same set of genomes, with the aim of combining default CDS clusters and -a ’rRNA,tRNA’ results. • add pangenome matrices.pl can be used similarly to add two pangenome matrices produced by compare clusters.pl, for instance from sets of CDS and rRNA,tRNA clusters. • pfam enrich.pl calculates the enrichment of a set of sequence clusters in terms of Pfam domains, by using Fisher’s exact test. • annotate cluster.pl can be used to retrieve a multiple alignment view of the supporting local BLAST alignments of the sequences in the cluster, and to annotate any encoded Pfam domain. It can also be used to find private sequence variants private to an arbitrary group of sequences. In addition, two shell scripts are also included: • plot matrix heatmap.sh calculates ordered heatmaps with attached row and column dendrograms from squared tabseparated numeric matrices, which can be presence/absence pangenomic matrices or similarity / identity matrices as those produced by get homologues with flag -A. From the latter type of matrix a distance matrix can optionally be calculated to drive a neighbor joining tree. See example on section 4.8.1. • hcluster pangenome matrix.sh generates a distance matrix out of a tab-separated presence/absence pangenome matrix, which is then used to call R functions hclust() and heatmap.2() in order to produce a heatmap. 19 To check the options of any of these scripts please invoke them from the terminal with flag -h. For instance, typing $ ./compare_clusters.pl -h in the terminal will produce the following: -h -d -o -n -r -s -t -I -m this message comma-separated names of cluster directories output directory use nucleotide sequence .fna clusters take first cluster dir as reference set, which might contain a single representative sequence per cluster use only clusters with syntenic genes use only clusters with single-copy orthologues from taxa >= t produce clusters with single-copy seqs from ALL taxa in file produce intersection pangenome matrices -x -T produce cluster report in OrthoXML format produce parsimony-based pangenomic tree 20 4 A few examples of use This section presents a few different ways of running get homologues.pl and the accompanying scripts with provided sample input data. 4.1 Clustering orthologous proteins from a few FASTA files This example takes the provided sample input folder sample_buch_fasta, which contains the proteins sets of four Buchnera aphidicola strains, and compiles clusters of BDBH sequences, which are candidates to be orthologues, with this command: $ ./get_homologues.pl -d sample_buch_fasta . The output should look like this (contained in file sample_output.txt): # ./get_homologues.pl -i 0 -d sample_buch_fasta -o 0 -e 0 -f 0 -r 0 -t all -c 0 -I 0 # -m local -n 2 -M 0 -G 0 -P 0 -C 75 -S 1 -E 1e-05 -F 1.5 -N 0 -B 50 -s 0 -D 0 -g 0 -a ’0’ -x # results_directory=sample_buch_fasta_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 # # # # # checking input files... Buch_aph_APS.faa 574 Buch_aph_Bp.faa 507 Buch_aph_Cc.faa 357 Buch_aphid_Sg.faa 546 # 4 genomes, 1984 sequences # taxa considered = 4 sequences = 1984 residues = 650959 MIN_BITSCORE_SIM = 17.2 # mask=BuchaphCc_f0_alltaxa_algBDBH_e0_ (_algBDBH) # running makeblastdb with sample_buch_fasta_homologues/Buch_aph_APS.faa.fasta # running makeblastdb with sample_buch_fasta_homologues/Buch_aph_Bp.faa.fasta # running makeblastdb with sample_buch_fasta_homologues/Buch_aph_Cc.faa.fasta # running makeblastdb with sample_buch_fasta_homologues/Buch_aphid_Sg.faa.fasta # running BLAST searches ... # done # # # # # # concatenating and sorting blast results... sorting _Buch_aph_APS.faa results (0.12MB) sorting _Buch_aph_Bp.faa results (0.11MB) sorting _Buch_aph_Cc.faa results (0.084MB) sorting _Buch_aphid_Sg.faa results (0.11MB) done # parsing blast result! (sample_buch_fasta_homologues/tmp/all.blast , 0.42MB) # parsing blast file finished # creating indexes, this might take some time (lines=9.30e+03) ... # construct_taxa_indexes: number of taxa found = 4 # number of file addresses = 9.3e+03 number of BLAST queries 21 = 2.0e+03 -R 0 # clustering orthologous sequences # clustering inparalogues in Buch_aph_Cc.faa (reference) # 0 sequences # clustering inparalogues in Buch_aph_APS.faa # 1 sequences # finding BDBHs between Buch_aph_Cc.faa and Buch_aph_APS.faa # 324 sequences # clustering inparalogues in Buch_aph_Bp.faa # 0 sequences # finding BDBHs between Buch_aph_Cc.faa and Buch_aph_Bp.faa # 326 sequences # clustering inparalogues in Buch_aphid_Sg.faa # 0 sequences # finding BDBHs between Buch_aph_Cc.faa and Buch_aphid_Sg.faa # 317 sequences # looking for valid ORF clusters (n_of_taxa=4)... # number_of_clusters = 305 # cluster_list = sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_.cluster_list # cluster_directory = sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_ # runtime: 64 wallclock secs ( 0.74 usr # RAM use: 20.3 MB 0.08 sys + 61.49 cusr 0.47 csys = 62.78 CPU) In summary, the output details the processing steps required: • Reading and parsing input files (Buch_aph_APS.faa,Buch_aph_Bp.faa,Buch_aph_Cc.faa,Buch_aphid_Sg.faa), which contain 574, 507, 357 and 546 protein sequences, respectively. In total there are four input taxa and 1984 sequences. • Preparing input sequences for BLAST. • Running BLAST searches and sorting the results. • Parsing the complete volume of sorted BLAST results. • Searching for orthologous sequences using the BDBH algorithm, which requires a reference taxon or proteome to start with (see Figure 3). • Clustering orthologous sequences and put them in files inside an appropriate folder. In this example the relevant output is directory sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_ together with file sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_.cluster_list, which lists the found clusters and their taxa composition. It can be seen that the folder name contains the key settings used to cluster the sequences contained therein: 22 BuchaphCc_f0_alltaxa_algBDBH_e0_ | | | | | | | | | -e option was not used (inparalogues are in) | | | the clustering algorithm is BDBD (default) | | all clusters contain at least 1 sequence from each taxa (default -t behavior) | -f option not used (no length filtering) reference proteome In this case a total of 305 protein sequence clusters are produced, which include the original FASTA headers plus information of which segment was actually aligned by BLAST for inclusion in the cluster: >gi|116515296| Rho [Buchnera aphidicola str. Cc (Cinara cedri)] | aligned:1-419 (420) MNLTKLKNTSVSKLIILGEKIGLENLARMRKQDIIFSILKQHSKSGEDIFGDGVLEILQDGFGFLRSSDSSYLAGPDDIYVSPS... >gi|15617182| termination factor Rho [Buchnera aphidicola str. APS] | aligned:1-419 (419) MNLTALKNMPVSELITLGEKMGLENLARMRKQDIIFAILKQHAKSGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPS... >gi|27905006| termination factor Rho [Buchnera aphidicola str. Bp] | aligned:1-419 (419) MNLTALKNIPVSELIFLGDNAGLENLARMRKQDIIFSILKQHAKSGEDIFGDGVLEILQDGFGFLRSSDSSYLAGPDDIYVSPS... >gi|21672828| termination factor Rho [Buchnera aphidicola str. Sg] | aligned:1-419 (419) MNLTALKNMPVSELITLGEKMGLENLARMRKQDIIFAILKQHAKSGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPS... If we wanted to test a different sequence clustering algorithm we could run $ ./get_homologues.pl -d sample_buch_fasta -G , which will produce 298 clusters employing the COG triangles algorithm (see Table 2) in folder sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algCOG_e0_. Furthermore, typing $ ./get_homologues.pl -d sample_buch_fasta -M produces 308 clusters employing the OMCL algorithm in folder sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algOMCL_e0_. Now we can make use of script compare clusters.pl to get the intersection between these cluster sets and choose only the consensus subset. We will need to type (without any blanks between folder names, in a single long line) and execute: ./compare_clusters.pl -o sample_intersection -d \ sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_, \ sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algCOG_e0_, \ sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algOMCL_e0_ The following output is produced: # number of input cluster directories = 3 # # # # # # # # # parsing clusters in sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_ ... cluster_list in place, will parse it (BuchaphCc_f0_alltaxa_algBDBH_e0_.cluster_list) number of clusters = 305 parsing clusters in sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algCOG_e0_ ... cluster_list in place, will parse it (BuchaphCc_f0_alltaxa_algCOG_e0_.cluster_list) number of clusters = 298 parsing clusters in sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algOMCL_e0_ ... cluster_list in place, will parse it (BuchaphCc_f0_alltaxa_algOMCL_e0_.cluster_list) number of clusters = 308 # intersection output directory: sample_intersection # intersection size = 295 clusters # intersection list = sample_intersection/intersection_t0.cluster_list # input set: sample_intersection/BuchaphCc_f0_alltaxa_algBDBH_e0_.venn_t0.txt 23 # input set: sample_intersection/BuchaphCc_f0_alltaxa_algCOG_e0_.venn_t0.txt # input set: sample_intersection/BuchaphCc_f0_alltaxa_algOMCL_e0_.venn_t0.txt # # # # Venn Venn Venn Venn diagram = sample_intersection/venn_t0.pdf region file: sample_intersection/unique_BuchaphCc_f0_alltaxa_algBDBH_e0_.venn_t0.txt (5) region file: sample_intersection/unique_BuchaphCc_f0_alltaxa_algCOG_e0_.venn_t0.txt (0) region file: sample_intersection/unique_BuchaphCc_f0_alltaxa_algOMCL_e0_.venn_t0.txt (5) The 295 resulting clusters, those present in all input cluster sets, are placed in a new folder which was designated by parameter -o sample_intersection. Note that these are clusters that belong to the core-genome, as they contain sequence from all input taxa. A Venn diagram, such as the one in Figure 8, might also be produced which summarizes the analysis. Figure 8: Venn diagram showing the overlap between clusters of ’orthologous’ sequences produced by three different algorithms and otherwise identical settings. If we are interested only in clusters containing single-copy proteins from all input species, as they are probably safer orthologues, we can add the option -t 4 to our previous command, as our example dataset contains 4 input proteomes. 24 4.2 Clustering orthologous proteins from a single FASTA file A similar analysis could be performed with a single input FASTA file containing amino acid sequences, provided that each contains a [taxon name] in its header, as explained in section 3.1: >gi|10957100|ref|NP_057962.1| ... [Buchnera aphidicola str. APS (Acyrthosiphon pisum)] MFLIEKRRKLIQKKANYHSDPTTVFNHLCGSRPATLLLETAEVNKKNNLESIMIVDSAIRVSAVKNSVKI TALSENGAEILSILKENPHKKIKFFEKNKSINLIFPSLDNNLDEDKKIFSLSVFDSFRFIMKSVNNTKRT SKAMFFGGLFSYDLISNFESLPNVKKKQKCPDFCFYLAETLLVVDHQKKTCLIQSSLFGRNVDEKNRIKK RTEEIEKKLEEKLTSIPKNKTTVPVQLTSNISDFQYSSTIKKLQKLIQKGEIFQVVPSRKFFLPCDNSLS AYQELKKSNPSPYMFFMQDEDFILFGASPESSLKYDEKNRQIELYPIAGTRPRGRKKDGTLDLDLDSRIE LEMRTNHKELAEHLMLVDLARNDLARICEPGSRYVSDLVKVDKYSHVMHLVSKVVGQLKYGLDALHAYSS CMNMGTLTGAPKVRAMQLIAEYEGEGRGSYGGAIGYFTDLGNLDTCITIRSAYVESGVATIQAGAGVVFN SIPEDEVKESLNKAQAVINAIKKAHFTMGSS [...] >gi|15616637|ref|NP_239849.1| ... [Buchnera aphidicola str. APS (Acyrthosiphon pisum)] MTSTKEIKNKIVSVTNTKKITKAMEMVAVSKMRKTEERMRSGRPYSDIIRKVIDHVTQGNLEYKHSYLEE RKTNRIGMIIISTDRGLCGGLNTNLFKQVLFKIQNFAKVNIPCDLILFGLKSLSVFKLCGSNILAKATNL GENPKLEELINSVGIILQEYQCKRIDKIFIAYNKFHNKMSQYPTITQLLPFSKKNDQDASNNNWDYLYEP ESKLILDTLFNRYIESQVYQSILENIASEHAARMIAMKTATDNSGNRIKELQLVYNKVRQANITQELNEI VSGASAVSID [...] >gi|21672839|ref|NP_660906.1| ... [Buchnera aphidicola str. Sg (Schizaphis graminum)] MHLNKMKKVSLKTYLVLFFLIFFIFCSFWFIKPKEKKLKLEKLRYEEVIKKINAKNNQNLKSVENFITEN KNIYGTLSSLFLAKKYILDKNLDKALIQLNNSLKYTKEENLQNILKIRIAKIKIQQNKNQDAIKILEEIK DNSWKNIVENMKGDIFMKNKEIKKAILAWKKSKYLEKSNASKEIINMKINEIKR It is possible to analyze the provided sample input file sample_buchnera.faa with the following command: $ ./get_homologues.pl -i sample_buchnera.faa . Obtaining: # results_directory=sample_buchnera_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 # checking input files... # sample_buchnera.faa # created file sample_buchnera_homologues/tmp/all.fa (4 genomes, 1984 sequences) # taxa considered = 4 sequences = 1984 residues = 650959 MIN_BITSCORE_SIM = 17.2 # mask=BuchneraaphidicolastrCcCinaracedri3_f0_alltaxa_algBDBH_e0_ (_algBDBH) # running makeblastdb with sample_buchnera_homologues/tmp/all.fa # running local BLAST search # done # parsing blast result! (sample_buchnera_homologues/tmp/all.blast , 0.44MB) # parsing blast file finished # creating indexes, this might take some time (lines=9.30e+03) ... # construct_taxa_indexes: number of taxa found = 4 # number of file addresses = 9.3e+03 number of BLAST queries # clustering orthologous sequences 25 = 2.0e+03 # clustering inparalogues in Buchnera_aphidicola_str__Cc__Cinara_cedri__3.faa (reference) # 0 sequences [...] # looking for valid ORF clusters (n_of_taxa=4)... # number_of_clusters = 305 # cluster_list = [...]/BuchneraaphidicolastrCcCinaracedri3_f0_alltaxa_algBDBH_e0_.cluster_list # cluster_directory = [...]/BuchneraaphidicolastrCcCinaracedri3_f0_alltaxa_algBDBH_e0_ # runtime: 55 wallclock secs ( 0.76 usr # RAM use: 21.3 MB 0.04 sys + 51.75 cusr 26 0.23 csys = 52.78 CPU) 4.3 Clustering genes and proteins extracted from GenBank files The use of input files in GenBank format allows clustering nucleotide sequences in addition to proteins, since this format supports the annotation of raw genomic sequences. This example illustrates this feature by taking the input folder sample_plasmids_gbk, which contains 12 GenBank files of plasmid replicons, which we analyze by running $ ./get_homologues.pl -d sample_plasmids_gbk : # results_directory=sample_plasmids_gbk_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 # # # # # # # # # # # # # checking input files... E_coli_ST131_plasmid_pKC394.gb 55 E_coli_plasmid_pMUR050.gb 60 IncN_plasmid_R46.gb 63 K_oxytoca_plasmid_pKOX105.gb 69 K_pneumoniae_12_plasmid_12.gb 92 K_pneumoniae_9_plasmid_9.gb 87 K_pneumoniae_KP96_plasmid_pKP96.gb 64 S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2.gb 52 Uncultured_bacterium_plasmid_pRSB201.gb 58 Uncultured_bacterium_plasmid_pRSB203.gb 49 Uncultured_bacterium_plasmid_pRSB205.gb 52 Uncultured_bacterium_plasmid_pRSB206.gb 55 # 12 genomes, 756 sequences # taxa considered = 12 sequences = 756 residues = 184339 MIN_BITSCORE_SIM = 16.0 # mask=EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_ (_algBDBH) [..] # running BLAST searches ... # done # # # # # # # # # # # # # # concatenating and sorting blast results... sorting _E_coli_ST131_plasmid_pKC394.gb results (0.026MB) sorting _E_coli_plasmid_pMUR050.gb results (0.026MB) sorting _IncN_plasmid_R46.gb results (0.026MB) sorting _K_oxytoca_plasmid_pKOX105.gb results (0.031MB) sorting _K_pneumoniae_12_plasmid_12.gb results (0.036MB) sorting _K_pneumoniae_9_plasmid_9.gb results (0.027MB) sorting _K_pneumoniae_KP96_plasmid_pKP96.gb results (0.026MB) sorting _S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2.gb results (0.025MB) sorting _Uncultured_bacterium_plasmid_pRSB201.gb results (0.029MB) sorting _Uncultured_bacterium_plasmid_pRSB203.gb results (0.023MB) sorting _Uncultured_bacterium_plasmid_pRSB205.gb results (0.026MB) sorting _Uncultured_bacterium_plasmid_pRSB206.gb results (0.026MB) done # parsing blast result! (sample_plasmids_gbk_homologues/tmp/all.blast , 0.33MB) # parsing blast file finished # creating indexes, this might take some time (lines=7.61e+03) ... # construct_taxa_indexes: number of taxa found = 12 # number of file addresses = 7.6e+03 number of BLAST queries 27 = 7.6e+02 # clustering orthologous sequences # clustering inparalogues in E_coli_plasmid_pMUR050.gb (reference) # 2 sequences [...] # looking for valid ORF clusters (n_of_taxa=12)... # number_of_clusters = 24 # cluster_list = [...]_homologues/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_.cluster_list # cluster_directory = sample_plasmids_gbk_homologues/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_ This outcome is similar to that explained in example 4.1, with the notable difference that now both protein and nucleotide sequence clusters (24) are produced, as GenBank files usually contain both types of sequences. File EcoliplasmidpMUR050_f0 summarizes the contents and composition of the clusters stored in folder EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_. For instance, the data concerning cluster 100_traJ looks like this: cluster 100_traJ size=12 taxa=12 file: 100_traJ.faa dnafile: 100_traJ.fna : E_coli_plasmid_pMUR050.gb : E_coli_ST131_plasmid_pKC394.gb : IncN_plasmid_R46.gb : K_oxytoca_plasmid_pKOX105.gb : K_pneumoniae_12_plasmid_12.gb : K_pneumoniae_9_plasmid_9.gb : K_pneumoniae_KP96_plasmid_pKP96.gb : S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2.gb : Uncultured_bacterium_plasmid_pRSB201.gb : Uncultured_bacterium_plasmid_pRSB203.gb : Uncultured_bacterium_plasmid_pRSB205.gb : Uncultured_bacterium_plasmid_pRSB206.gb The two FASTA files produced for this cluster are now dissected. Note that each header includes the coordinates of the sequence in the context of a genomic contig. For instance, the first sequence was extracted from the leading strand of GenBank contig AY522431, positions 44726-46255, out of a total 56634 nucleotides. Furthermore, the names of neighboring genes are annotated when available, in order to capture some synteny information. These syntenic data can be valuable when evaluating possible orthologous genes, as conservation of genomic position (also operon context) strongly suggests orthology among prokaryots: ID:ABG33824.1 |[Escherichia coli]||traJ|1530|AY522431(56634):44726-46255:-1 [...]|neighbour_genes:traI,traK| ATGGACGATAGAGAAAGAGGCTTAGCATTTTTATTTGCAATTACTTTGCCTCCAGTGATGGTATGGTTTCTAGTT... [...] and >ID:ABG33824.1 |[Escherichia coli]||traJ|1530|AY522431(56634):44726-46255:-1 [...] | aligned:1-509 (509) MDDRERGLAFLFAITLPPVMVWFLV... 28 4.4 Clustering genes and proteins that share Pfam domain architecture The BDBH algorithm in get homologues.pl can be modified by requiring bidirectional best hits to share the same domain architecture, annotated in terms of Pfam domains. For large volumes of sequences this task should be accomplished on a computer cluster, but of course can also be performed locally. The command on the terminal could then be: $ ./get_homologues.pl -d sample_plasmids_gbk -D The generated output should be: # results_directory=sample_plasmids_gbk_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 # # # # # # # # # # # # # checking input files... E_coli_ST131_plasmid_pKC394.gb 55 E_coli_plasmid_pMUR050.gb 60 IncN_plasmid_R46.gb 63 K_oxytoca_plasmid_pKOX105.gb 69 K_pneumoniae_12_plasmid_12.gb 92 K_pneumoniae_9_plasmid_9.gb 87 K_pneumoniae_KP96_plasmid_pKP96.gb 64 S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2.gb 52 Uncultured_bacterium_plasmid_pRSB201.gb 58 Uncultured_bacterium_plasmid_pRSB203.gb 49 Uncultured_bacterium_plasmid_pRSB205.gb 52 Uncultured_bacterium_plasmid_pRSB206.gb 55 # 12 genomes, 756 sequences # taxa considered = 12 sequences = 756 residues = 184339 MIN_BITSCORE_SIM = 16.0 # mask=EcoliplasmidpMUR050_f0_alltaxa_algBDBH_Pfam_e0_ (_algBDBH_Pfam) # skipped genome parsing (sample_plasmids_gbk_homologues/tmp/selected.genomes) # submitting Pfam HMMER jobs ... [...] # done # concatenating Pfam files ([...]/_E_coli_ST131_plasmid_pKC394.gb.fasta.pfam)... # done [..] # parsing Pfam domain assignments (generating sample_plasmids_gbk_homologues/tmp/all.pfam) ... # skip BLAST searches and parsing # WARNING: please remove/rename results directory: # ’/home/contrera/codigo/cvs/get_homologues/sample_plasmids_gbk_homologues/’ # if you change the sequences in your .gbk/.faa files or want to re-run # creating indexes, this might take some time (lines=7.61e+03) ... # construct_taxa_indexes: number of taxa found = 12 # number of file addresses = 7.6e+03 number of BLAST queries 29 = 7.6e+02 # creating Pfam indexes, this might take some time (lines=7.54e+02) ... # clustering orthologous sequences # clustering inparalogues in E_coli_plasmid_pMUR050.gb (reference) # 2 sequences (re-using previous results) [...] # looking for valid ORF clusters (n_of_taxa=12)... # number_of_clusters = 24 # cluster_list = [...]/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_Pfam_e0_.cluster_list # cluster_directory = sample_plasmids_gbk_homologues/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_Pfam_e0_ Matching Pfam domains are summarized in the .cluster_list file, with this format: cluster 606_.. size=8 taxa=8 Pfam=PF04471, file: 606_...faa 606_...fna 30 4.5 Clustering syntenic/neighbor genes The sequence clusters derived from a set of GenBank files can be further processed in order to select those that contain only syntenic genes, defined as those having at least one neighbor included in other clusters. Again we will invoke script compare clusters.pl for this task: ./compare_clusters.pl -o sample_intersection -s -d \ sample_plasmids_gbk_homologues/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_ The following output is produced: # number of input cluster directories = 1 # parsing clusters in sample_plasmids_gbk_homologues/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_ ... # cluster_list in place, will parse it ([...]/EcoliplasmidpMUR050_f0_alltaxa_algBDBH_e0_.cluster_list) # number of clusters = 24 # intersection output directory: sample_intersection # intersection size = 21 clusters (syntenic) # intersection list = sample_intersection/intersection_t0_s.cluster_list Figure 9: A cluster is called syntenic when it contains neighboring genes which are also contained in other single clusters. In this example, genes X and Z of species 1,2 and 3 are found to be syntenic, regardless of their orientation. 31 4.6 Comparing clusters with external sequence sets Sometimes we will need to compare clusters of possibly orthologous sequences, produced by get homologues.pl in any of the ways explained earlier, with a set of sequences defined elsewere, for instance in a publication. This can be done to validate a set of core clusters and to check that nothing important was left out. We can accomplish just this with help from script compare clusters.pl, invoking option -r, which indicates that the first parsed cluster folder is actually a reference to be compared. To illustrate this application we have set a folder with 4 protein sequences from Buchnera aphidicola from strain Cinara cedri (directory sample_buch_fasta/sample_proteins), each sequence in a single FASTA file. Note that these clusters must contain sequences contained in the larger dataset which we want to compare with, otherwise the script will not match them. Headers are not used by the program, only the sequences matter. In order to check whether these sequences are clustered in any of the clusters generated earlier, say with BDBH, we will issue a command such as: ./compare_clusters.pl -o sample_intersection -r -d \ sample_buch_fasta/sample_proteins,\ sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_\ The following output should be produced: # number of input cluster directories = 2 # # # # parsing clusters in sample_buch_fasta/sample_proteins ... no cluster list in place, checking directory content ... WARNING: [taxon names] will be automatically extracted from FASTA headers, please watch out for errors # # # # number of clusters = 4 parsing clusters in sample_buch_fasta_homologues/BuchaphCc_f0_alltaxa_algBDBH_e0_ ... cluster_list in place, will parse it ([...]/BuchaphCc_f0_alltaxa_algBDBH_e0_.cluster_list) number of clusters = 305 # intersection output directory: sample_intersection # intersection size = 4 clusters # intersection list = sample_intersection/intersection_t0.cluster_list # input set: sample_intersection/sample_proteins.venn_t0.txt # input set: sample_intersection/BuchaphCc_f0_alltaxa_algBDBH_e0_.venn_t0.txt # Venn diagram = sample_intersection/venn_t0.pdf # Venn region file: sample_intersection/unique_sample_proteins.venn_t0.txt (0) # Venn region file: sample_intersection/unique_BuchaphCc_f0_alltaxa_algBDBH_e0_.venn_t0.txt (301) 32 4.7 Clustering intergenic segments from GenBank files The use of input files in GenBank format also allows the extraction of clusters of flanked orthologous intergenic regions, which might be of interest as these are expected to mutate at higher rates compared to coding sequences. In this example this feature is illustrated by processing folder sample_plasmids_gbk with options -g -I sample_plasmids_gbk/include_list.t The restraints that apply to the parsed intergenic regions are defined by three global variables variables within get_homologues.pl, as explained in section 3.4. These default values might be edited for specific tasks; for instance, chloroplast intergenic regions are usually much smaller than 200 bases, the default size, and therefore variable $MININTERGENESIZE should be set to a smaller value. Moreover, in this example we restrict the search for conserved intergenic segments to Klebsiella pneumoniae plasmids, by creating a file sample_plasmids_gbk/include_list.txt with these contents: K_pneumoniae_12_plasmid_12.gb K_pneumoniae_9_plasmid_9.gb K_pneumoniae_KP96_plasmid_pKP96.gb We can now execute $ ./get_homologues.pl -d sample_plasmids_gbk -g -I sample_plasmids_gbk/include_list.txt: # results_directory=sample_plasmids_gbk_homologues # parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 # # # # # # # # # # # # # checking input files... E_coli_ST131_plasmid_pKC394.gb 55 (intergenes=7) E_coli_plasmid_pMUR050.gb 60 (intergenes=12) IncN_plasmid_R46.gb 63 (intergenes=11) K_oxytoca_plasmid_pKOX105.gb 69 (intergenes=13) K_pneumoniae_12_plasmid_12.gb 92 (intergenes=11) K_pneumoniae_9_plasmid_9.gb 87 (intergenes=12) K_pneumoniae_KP96_plasmid_pKP96.gb 64 (intergenes=18) S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2.gb 52 (intergenes=9) Uncultured_bacterium_plasmid_pRSB201.gb 58 (intergenes=9) Uncultured_bacterium_plasmid_pRSB203.gb 49 (intergenes=7) Uncultured_bacterium_plasmid_pRSB205.gb 52 (intergenes=8) Uncultured_bacterium_plasmid_pRSB206.gb 55 (intergenes=10) # 12 genomes, 756 sequences # : : : included input files (3): K_pneumoniae_12_plasmid_12.gb K_pneumoniae_12_plasmid_12.gb 92 K_pneumoniae_9_plasmid_9.gb K_pneumoniae_9_plasmid_9.gb 87 K_pneumoniae_KP96_plasmid_pKP96.gb K_pneumoniae_KP96_plasmid_pKP96.gb 64 [...] # looking for valid ORF clusters (n_of_taxa=3)... # number_of_clusters = 32 # cluster_list = sample_plasmids_gbk_homologues/[...]include_list.txt_algBDBH_e0_.cluster_list # cluster_directory = sample_plasmids_gbk_homologues/[...]include_list.txt_algBDBH_e0_ # looking for valid clusters of intergenic regions (n_of_taxa=3)... # parameters: MININTERGENESIZE=200 MAXINTERGENESIZE=700 INTERGENEFLANKORF=180 # number_of_intergenic_clusters = 1 # intergenic_cluster_list = [...]/[...]_intergenic200_700_180_.cluster_list # intergenic_cluster_directory = sample_plasmids_gbk_homologues/[...]_intergenic200_700_180_ 33 # runtime: 1 wallclock secs ( 0.10 usr # RAM use: 27.6 MB 0.01 sys + 0.05 cusr 0.01 csys = 0.17 CPU) Intergenic clusters, illustrated by Figure 3.4, include upper-case nucleotides to mark up the sequence of flanking ORFs, with the intergenic region itself in lower-case, and the names of the flanking ORFs in the FASTA header, with their strand in parentheses: >1 | intergenic18|coords:63706..64479|length:774|neighbours:ID:ABY74399.1(-1),ID:ABY74398.1(1)... CGCGCCATTGCTGGCCTGAAGGTATTCCCAATACCCTCCCTGGTAGTCTTTAGCGTAACGATTCAGAAAGGACTGAATGAAGTGATCTGCGCTGAAGAAAGCG CCACGAAATGCCGCAGGCATGAAGTTCATGCGGGCGTTTTCAGAAATGTAGCGGGCGGTGATTTCGATAGTTTCCATgatacttcctctttaagccgataccg gcgatggttaagcggcaggcacatcacctgccactttttaattatcgtacaatggggcgttaaagtcaatacaagtacggattatatttacctaattttatgc ccgtcagagcatggaaggcgacctcgccggactccaccggacaccgggggcaaatcgccggaaactgcgggactgaccggagcgacaggccacccccctccct gctagcccgccgccacgcggccggttacaggggacactgagaaagcagaaagccaacaaacactatatatagcgttcgttggcagctgaagcagcactacata tagtagagtacctgtaaaacttgccaacctgaccataacagcgatactgtataagtaaacagtgatttggaagatcgctATGAAGGTCGATATTTTTGAAAGC TCCGGCGCCAGCCGGGTACACAGCATCCCTTTTTATCTGCAAAGAATTTCTGCGGGGTTCCCCAGCCCGGCCCAGGGCTATGAAAAGCAGGAGTTAAACCTGC ATGAGTATTGTGTTCGTCACCCTTCAGCAACTTACTTCCTGCGGGTTTCTGGC >2 | intergenic3|coords:9538..10293|length:756|neighbours:ID:ACI62996.1(-1),ID:ACI62997.1(1)... CGCGCCATTGCTGGCCTGAAGGTATTCCCAATACCCTCCCTGGTAGTCTTTAGCGTAACGATTCAGAAAGGACTGAATGAAGTGATCTGCGCTGAAGAAAGCG CCACGAAATGCCGCAGGCATGAAGTTCATGCGGGCGTTTTCAGAAATGTAGCGGGCGGTGATTTCGATAGTTTCCATgatacttcctctttaagccgataccg gcgatggttaagcggcaggcacatcacctgccactttttaattatcgtacaatggggcgttaaagtcaatacaagtacggattatatttacctaattttatgc ccgtcagagcatggaaggcgacctcgccggactccaccggacaccgggggcaaatcgccggaaactgcgggactgaccggagcgacaggccacccccctccct gctagcccgccgccacgcggccggttacaggggacactgagaaagcagaaagccaacaaacactatatatagcgttcgttggcagctgaagcagcactacata tagtagagtacctgtaaaacttgccaacctgaccataacagcgatactgtataagtaaacaGTGATTTGGAAGATCGCTATGAAGGTCGATATTTTTGAAAGC TCCGGCGCCAGCCGGGTACACAGCATCCCTTTTTATCTGCAAAGAATTTCTGCGGGGTTCCCCAGCCCGGCCCAGGGCTATGAAAAGCAGGAGTTAAACCTGC ATGAGTATTGTGTTCGTCACCCTTCAGCAACTTAC ... 34 4.8 Performing genome composition analyses The next few examples illustrate how get homologues.pl might be used to analyze the genomic evolution of a group of related organisms, the core-genome and the pan-genome, using the terms coined by Tettelin and collaborators (PubMed=16172379). 4.8.1 Obtaining a pangenome matrix First we will try option -t 0 in combination with the OMCL or the COG algorithms. By enforcing this option we are actually asking for all possible clusters, including those which might not contain sequences from all input genomes (taxa). For this reason the use of this option usually means that a large number of clusters are reported. This is particularly true for COG runs, since this algorithm does not resolve clusters involving less than 3 genomes. The default algorithm BDBH is not available with this option. For instance, by calling $ ./get_homologues.pl -d sample_plasmids_gbk -t 0 -G we obtain 199 clusters: [...] # looking for valid ORF clusters (n_of_taxa=0)... # number_of_clusters = 199 # cluster_list = [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algCOG_e0_.cluster_list # cluster_directory = [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algCOG_e0_ By choosing the OMCL algorithm we obtain a smaller set of clusters, which we can test by typing on the terminal $ ./get_homologues.pl -d sample_plasmids_gbk -t 0 -M: [...] # looking for valid ORF clusters (n_of_taxa=0)... # number_of_clusters = 193 # cluster_list = [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algOMCL_e0_.cluster_list # cluster_directory = [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algOMCL_e0_ We can now take advantage of script compare clusters.pl, and the generated cluster directories, to compile the corresponding pangenome matrix. This can be accomplished for a single cluster set: ./compare_clusters.pl -o sample_intersection -m -d \ sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algCOG_e0_ or for the intersection of several sets, in order to get a consensus pangenome matrix: ./compare_clusters.pl -o sample_intersection -m -d \ sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algCOG_e0_,\ sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algOMCL_e0_ The ouput of the latter command will include the following lines: [...] # number of input cluster directories = 2 # # # # # # # # parsing clusters in sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algCOG_e0_ ... cluster_list in place, will parse it ([...]/Uncultured[...]_f0_0taxa_algCOG_e0_.cluster_list) WARNING: skipping cluster 62_transposase.faa , seems to duplicate 59_transposase.faa WARNING: skipping cluster 116_tnpA.faa , seems to duplicate 59_transposase.faa number of clusters = 197 parsing clusters in sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algOMCL_e0_ ... cluster_list in place, will parse it ([...]/Uncultured[...]_f0_0taxa_algOMCL_e0_.cluster_list) number of clusters = 193 [...] 35 # intersection size = 180 clusters # intersection list = sample_intersection/intersection_t0.cluster_list # # # # pangenome_file = sample_intersection/pangenome_matrix_t0.tab pangenome_phylip file = sample_intersection/pangenome_matrix_t0.phylip pangenome_FASTA file = sample_intersection/pangenome_matrix_t0.fasta pangenome CSV file (Scoary) = sample_intersection/pangenome_matrix_t0.tr.csv # input set: sample_intersection/Uncultured[...]_f0_0taxa_algCOG_e0_.venn_t0.txt # input set: sample_intersection/Uncultured[...]_f0_0taxa_algOMCL_e0_.venn_t0.txt # Venn diagram = sample_intersection/venn_t0.pdf # Venn region file: sample_intersection/unique_Uncultured[...]_f0_0taxa_algCOG_e0_.venn_t0.txt (17) # Venn region file: sample_intersection/unique_Uncultured[...]_f0_0taxa_algOMCL_e0_.venn_t0.txt (13) Note that skipped clusters correspond precisely to COG unresolved clusters. This script produces several versions of the same pangenomic matrix: • A full detailed matrix in tab-separated columns, with taxa/genomes as rows and sequence clusters as columns, in which cells with natural numbers indicate whether a given taxa contains one or more sequences from a given cluster. Such files can be read and edited with any text editor or spreadsheet software. source:folder 7_transposase_A.faa 8_tnpA.faa 9_mphA.faa ... K_oxytoca_plasmid_pKOX105.gb 0 0 0 ... E_coli_plasmid_pMUR050.gb 0 0 0 ... Uncultured_bacterium_plasmid_pRSB206.gb 0 0 0 ... [...] • A reduced binary matrix in a format suitable for PHYLIP discrete character analysis software, which looks like this: 12 0000000000 0000000001 0000000002 0000000003 0000000004 0000000005 0000000006 0000000007 0000000008 0000000009 0000000010 0000000011 180 <-12 taxa, 180 clusters 0000000010000000000000100000000000111111111111111 ... 0000000101111111111111000000000000000000000000000 ... 0000000010001000000000000000000000000000000000000 0000000001011000000001111111111111000000000000000 0000000011000000000000100000000000000000000000000 0000000110011100100000100100000000010000100000000 0000000010000000000000000000000000000000001000000 0000000010000000000000000000000000000000000000000 0000000010000000000000000000000000000100000000000 0000000011001000000000100000011111110110000000000 0000000000000000000000000000000000000000000000000 1111111110000000000000000000000000000000000000000 ... • A reduced binary matrix in FASTA format suitable for binary character analysis software such as IQ-TREE, which can compute bootstrap and aLRT support (see example in Figure 11). The contents of this file which look like this: >K_oxytoca_plasmid_pKOX105.gb 00000000100000000000001000000000001111111111111111111110000000... >E_coli_plasmid_pMUR050.gb 00000001011111111111110000000000000000000000000000000000000000... >Uncultured_bacterium_plasmid_pRSB206.gb 00000000100010000000000000000000000000000000000000000100000000... 36 >IncN_plasmid_R46.gb 00000000010110000000011111111111110000000000000000000000000000... ... • A transposed, reduced binary matrix in CSV format suitable for pangenome-wide association analysis with software Scoary. The contents of this file which look like this: Gene,Non-unique Gene name,Annotation,No. isolates,No. sequences,... 601_repA.faa,,,,,,,,,,,,,,1,1,1,1,1,1,1,1,1,1,1,1 602_resP.faa,,,,,,,,,,,,,,1,1,1,1,1,1,1,1,1,1,1,1 614_hypothetical_protein.faa,,,,,,,,,,,,,,1,1,1,1,1,1,1,1,1,1,1,1 ... Indeed, when option -T is toggled, as in the next example, ./compare_clusters.pl -o sample_intersection -m -T -d \ sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algCOG_e0_,\ sample_plasmids_gbk_homologues/Uncultured[...]_f0_0taxa_algOMCL_e0_ then the script calls program PARS from the PHYLIP suite to produce one or more alternative parsimony trees that capture the phylogeny implied in this matrix, adding the following lines to the produced output: # parsimony results by PARS (PHYLIP suite, evolution.genetics.washington.edu/phylip/doc/pars.html): # pangenome_phylip tree = sample_intersection/pangenome_matrix_t0.phylip.ph # pangenome_phylip log = sample_intersection/pangenome_matrix_t0.phylip.log The resulting tree (with extension .ph) is in Newick format and is shown in Figure 10. Please note that such files might contain several equally parsimonious trees separated by ’;’, one per line. In order to plot them, as in the next figure, it might be necessary to leave only one, depending on the software used. Both parsimony and ML trees with support estimates can be computed with script estimate_pangenome_trees.sh from the GET PHYLOMARKERS pipeline. A complementary view of the same data con be obtained with script plot matrix heatmap.sh, which was called to produce Figure 12: ./plot_matrix_heatmap.sh -i sample_intersection/pangenome_matrix_t0.tab -o pdf \ -r -H 8 -W 14 -m 28 -t "sample pangenome (clusters=180)" -k "genes per cluster" 37 Figure 10: Example of pangenomic tree of the consensus COG and OMCL pangenomic matrix obtained for a few plasmids. Such trees can be useful to create the A and B lists discussed in the next section. Plot produced with FigTree, with midpoint root. 4.8.2 Interrogating a pangenome matrix Script parse pangenome matrix.pl can be used to analyze a pangenome matrix, such as that created in the previous section. It was primarily designed to identify genes present in a group A of species which are absent in another group B, but can also be used to find expansions/contractions of gene families. If you require the genes present/expanded in B with respect to A, just reverse them. Expanded clusters are defined as those where all A taxa contain more sequences than the maximum number of corresponding sequences in any taxa of group B. We now review these features with the same plasmid set of previous sections, analyzing the pangenome matrix produced by intersecting several cluster sets on section 4.8.1. Let’s say we are interested in finding plasmid genes present in Klebsiella oxytoca which are not encoded in K.pneumoniae KP96. In order to do this we first create a couple of text files to define sets A and B, called A.txt and B.txt, which we place inside folder sample_plasmids_gbk. The content of A and B files should be one line per species. In this example file A.txt contains a single line: K_oxytoca_plasmid_pKOX105.gb As well as B.txt: K_pneumoniae_KP96_plasmid_pKP96.gb We can now execute the script as follows: ./parse_pangenome_matrix.pl -m sample_intersection/pangenome_matrix_t0.tab \ -A sample_plasmids_gbk/A.txt -B sample_plasmids_gbk/B.txt -g The output should be: # matrix contains 180 clusters and 12 taxa # taxa included in group A = 1 # taxa included in group B = 1 38 Figure 11: Pangenomic tree of the consensus COG and OMCL pangenomic matrix with bootstrap and aLRT values, respectively. Tree computed with binary model and plotted at IQ-TREE. Figure 12: Heatmap of the previous pangenome matrix, with dendrograms sorting genomes according to cluster occupancy. # finding genes present in A which are absent in B ... # file with genes present in set A and absent in B (21): [...]pangenome_matrix_t0__pangenes_list.txt It can be seen that 21 genes were found to be present in A and absent in B. In the case of pangenome matrices derived from GenBank files, as in this example, it is possible to produce a map of these genes in the genomic context of any species included in A, which should be queried using option -l. A valid syntax would be: ./parse_pangenome_matrix.pl -m sample_intersection/pangenome_matrix_t0.tab \ -A sample_plasmids_gbk/A.txt -B sample_plasmids_gbk/B.txt -g \ -p ’Klebsiella oxytoca KOX105’ By default, parse pangenome matrix.pl requires present genes to be present in all genomes of A and none of B. However, as genomes might not be completelly annotated, it is possible to make these tests more flexible by controlling the cutoff for inclusion, by using flag -P. For instance, the next command will require genes to be present only in 90% of A genomes and missing in 90% of B genomes: ./parse_pangenome_matrix.pl -m sample_intersection/pangenome_matrix_t0.tab \ -A sample_plasmids_gbk/A.txt -B sample_plasmids_gbk/B.txt -g -P 90 39 Figure 13: Map of plasmid OX105 highlighting 7 genes absent in pKP96. Note that the most flexible way of finding out genes absent in a set of genomes within a pangenome matrix is by using option -a, which does not require an A list, rather a B list is sufficient. It is called as in the example: ./parse_pangenome_matrix.pl -m sample_intersection/pangenome_matrix_t0.tab \ -B sample_plasmids_gbk/B.txt -a 40 4.8.3 Calculating cloud, shell and core genomes parse pangenome matrix.pl can also be employed to classify genes in these four compartments: class core soft core alternative name cloud strain-specific (PubMed=25483351) shell dispensable (PubMed=25483351) definition Genes contained in all considered genomes/taxa. Genes contained in 95% of the considered genomes/taxa, as in the work of Kaas and collaborators (PubMed=23114024). Rare genes present only in a few genomes/taxa. The cutoff is defined as the class next to the most populated non-core cluster class. Remaining moderately conserved genes, present in several genomes/taxa. Table 5: Definitions of pangenome compartments/occupancy classes used by GET HOMOLOGUES, taken from PubMed=23241446. Accessory genes include both shell and cloud genes. The script is invoked as follows: ./parse_pangenome_matrix.pl -m sample_intersection/pangenome_matrix_t0.tab -s The output is as follows: # matrix contains 180 clusters and 12 taxa # # # # cloud size: 124 list: sample_intersection/pangenome_matrix_t0__cloud_list.txt shell size: 23 list: sample_intersection/pangenome_matrix_t0__shell_list.txt soft core size: 33 list: sample_intersection/pangenome_matrix_t0__softcore_list.txt core size: 24 (included in soft core) list: sample_intersection/pangenome_matrix_t0__core_list.txt # using default colors, defined in %COLORS # globals controlling R plots: $YLIMRATIO=1.2 # shell bar plots: sample_intersection/pangenome_matrix_t0__shell.png , [...]shell.pdf # shell circle plots: sample_intersection/pangenome_matrix_t0__shell_circle.png , [...]circle.pdf # pan-genome size estimates (Snipen mixture model PMID:19691844): [...]shell_estimates.tab Core.size Pan.size BIC LogLikelihood 2 components 24 193 1056.08217644576 -520.251652946544 3 components 23 370 583.140322949438 -278.587769347493 4 components 14 417 570.835786136617 -267.242544090193 5 components 13 703 579.364023427564 -266.313705884776 6 components 12 954 589.754861142051 -266.31616789113 7 components 13 808 600.134442088951 -266.313001513689 8 components 8 549 610.689900917556 -266.397774077102 9 components 0 572 621.283299953595 -266.501516744231 10 components 0 489 632.354613188809 -266.844216 Sample 24 180 NA NA Apart from text files listing the cluster names that take part in each of the four compartments, two types of plots are generated. The lenght of Y-axes in barplots can be controlled with global variable $YLIMRATIO, as well as the colors used, by editing variable %RGBCOLORS. Note that the output also includes estimates of the pan- and core-genome sizes as calculated by the binomial mixture model of Snipen and collaborators (PubMed=19691844). A simple interpretation is that as soon as likelihood converges then adding more components does not improve the mixture model. Please check that paper for a full explanation. 41 Figure 14: Barplot of the pangenome matrix created in section 4.8.1. Core clusters are in white for clarity, but note that according to the definitions in Table 5 the soft core also includes the strict core. 42 Figure 15: Area plot of the pangenome matrix created in section 4.8.1. Note that the soft core compartment includes also the core, as implied by the definition in Table 5. Optional flag -x can be added to the previous command in order to produce an intersection pangenome matrix: ./parse_pangenome_matrix.pl -m sample_intersection/pangenome_matrix_t0.tab -s -x The output will now include the following: # intersection pangenome matrix: pangenome_matrix_t0__intersection.tab # mean %cluster intersection: 71.57 The reported mean value is the percentage of sequence clusters shared by any two input genomes. The TAB-separated file contains the number of clusters shared by all pairs of input genomes/sequence sets. Note that these clusters might contain several inparalogues of the same species: intersection E_coli_ST131_plasmid_pKC394.gb E_coli_plasmid_pMUR050.gb IncN_plasmid_R46.gb ... E_coli_ST131_plasmid_pKC394.gb 51 39 39 ... E_coli_plasmid_pMUR050.gb 39 52 41 ... IncN_plasmid_R46.gb 39 41 58 ... ... 43 4.8.4 Estimating core/pan-genome size by sampling genomes The pioneer work of Tettelin and collaborators (PubMed=16172379) unveiled that bacterial genomes are dynamic containers that harbour essential genes and also accessory elements, which might be unique to each community. get homologues.pl can be used to perform such genome composition analyses. The rationale is to sample a set of genomes (present in the input folder) and keep adding genome after genome keeping track of i) the novel genes added to the pool and ii) the genes that fall in pre-existing clusters. This sampling experiment can be done with any of the included 3 algorithms (please see Table 2), by invoking option -c . For instance, we could try $ ./get_homologues.pl -d sample_buch_fasta -c: [... same as first example ...] # # # # # # # genome composition report (samples=10,permutations=24) genomic report parameters: MIN_PERSEQID_HOM=0 MIN_COVERAGE_HOM=20 genome order: 0 Buch_aph_APS.faa 1 Buch_aph_Bp.faa 2 Buch_aph_Cc.faa 3 Buch_aphid_Sg.faa ## sample 0 (Buch_aph_APS.faa | 0,1,2,3,) # adding Buch_aph_APS.faa: core=574 pan=574 [...] # pan-genome (number of genes, can be plotted with plot_pancore_matrix.pl) # file=sample_buch_fasta_homologues/pan_genome_algBDBH.tab genomes mean stddev | samples 0 490 90 | 574 507 507 546 357 357 546 574 357 574 1 572 28 | 598 585 585 587 521 521 562 592 575 598 2 597 5 | 606 593 594 594 594 596 600 599 591 606 3 608 5 | 615 602 600 608 605 605 611 613 605 615 # core-genome (number of genes, can be plotted with plot_pancore_matrix.pl) # file=sample_buch_fasta_homologues/core_genome_algBDBH.tab genomes mean stddev | samples 0 490 90 | 574 507 507 546 357 357 546 574 357 574 1 420 82 | 466 466 466 523 324 324 317 523 324 466 2 327 36 | 319 319 434 315 310 315 313 318 309 319 3 310 4 | 313 313 313 311 304 304 311 313 304 313 [... same as first example ...] As can be seen, the output now contains two data frames which summarize the genome composition analysis done by sampling, which are also stored as tab-separated files. These text files can be used to plot the core- and pan-genome, with help from the accompanying script plot pancore matrix.pl. A suitable command would be: ./plot_pancore_matrix.pl -i sample_buch_fasta_homologues/core_genome_algBDBH.tab -f core_Tettelin The script also supports the core function as modified by Willenbrock and collaborators (PubMed=18088402), as shown on the next figure in a more realistic set of 35 genomes. Both fits can be superimposed by calling option -f core_both. Besides standard core- and pan-genomes, it is possible to estimate the evolution of the soft core-genome, which is a relaxed version of the core that considers genes found in a fraction (by default 0.95) of genomes, and thus accommodates some annotation or assembly errors. This experiment can be done with either the OMCl or COGS algorithms by invoking options -c -z. The resulting data file can be plotted the same way. The genome composition analyses presented so far are actually random sampling experiments. It is thus worth mentioning that the user can control the order in which genomes are sampled during these simulations, by enforcing a list of genomes with option -I, already introduced in section 4.7, or by setting the seed of the random number generator with 44 Figure 16: Core-genome (left) and pan-genome (right) estimates after ten random samples of 4 taxa. Fitted curves follow functions first proposed by Tettelin in 2005 (PubMed=16172379). Residual standard errors are reported on the right margin as a measure of the goodness of fit. Figure 17: Core-genome estimate after ten random samples of 35 taxa. option -R. In the first case only one sampling is performed and therefore the standard deviation of the core and pan values is zero. The second strategy ensures that sampling order is conserved in different program executions and thus allows merging CDS core-genomes and non-coding genes (such as rRNAs) core-genomes computed separately over the same set of taxa, with help from accompanying script add pancore matrices.pl. 45 4.8.5 Calculating Pfam enrichment of cluster sets Provided that Pfam domains have been annotated in advance (see section 4.4), it is possible to calculate whether a set of clusters, for instance those that take part of the shell (see 4.8.3), are enriched on a set of protein domains. To this end pfam enrich.pl can be invoked as follows: $ ./pfam_enrich.pl -d sample_plasmids_gbk_homologues/ \ -c sample_plasmids_gbk_homologues/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algOMCL_e0_ \ -x sample_intersection/pangenome_matrix_t0__shell_list.txt -t less -p 1.0 There are several input data required for this kind analysis: • A folder containing previously computed Pfam-annotations (-d). • A directory with FASTA-format cluster files (-c), obtained on earlier steps. These will be the ’control’ set. • A file with a list of clusters defining a subset of ’experiment’ clusters (-x). • The desired type of Fisher’s exact test, either greater, two.sided or less, as in this example. The default is greater, which will test for Pfam domains over-represented in ’experiment’ clusters. Instead, greater tests for under-represented Pfam domains. • A threshold on FDR-adjusted P-values, set to 0.05 by default. Note that in this toy example it was set to 1.0 • Optionally flag -r can be used to sample only sequences from a selected reference taxon. • If using nucleotide clusters, then option -n should be called. The output is as follows: # parsing clusters... # 756 sequences extracted from 193 clusters # total experiment sequence ids = 148 # total control sequence ids = 756 # parse_Pfam_freqs: set1 = 19 Pfams set2 = 103 Pfams # fisher exact test type: ’less’ # multi-testing p-value adjustment: fdr # adjusted p-value threshold: 1 # total annotated domains: experiment=19 control=144 #PfamID PF00239 PF01526 PF02796 PF12681 [...] counts(exp) counts(ctr) 0 6 0.000e+00 4.167e-02 0 6 0.000e+00 4.167e-02 0 3 0.000e+00 2.083e-02 0 3 0.000e+00 2.083e-02 freq(exp) 4.833e-01 4.833e-01 6.928e-01 6.928e-01 freq(ctr) 9.860e-01 9.860e-01 9.860e-01 9.860e-01 46 p-value p-value(adj) description Resolvase, N terminal domain Tn3 transposase DDE domain Helix-turn-helix domain of resolvase Glyoxalase-like domain 4.8.6 Estimating average identity matrices If we recall for a moment the example GenBank files analyzed on section 4.3 we can demonstrate how to calculate average identity matrices, which can then be used to compare genome members of a pangenome. To do so we will add a few flags to the previous command, in addition to flag -A, which specifically asks for an identity matrix to be calculated: $ ./get_homologues.pl -d sample_plasmids_gbk -A -t 0 -M This will produce the following output: [...] # number_of_clusters = 193 # cluster_list = [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algOMCL_e0_.cluster_list # cluster_directory = [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algOMCL_e0_ # average_identity_matrix_file = # [...]/UnculturedbacteriumplasmidpRSB203_f0_0taxa_algOMCL_e0_Avg_identity.tab # NOTE: matrix computed on blastp results Note that on this example the produced identity matrix was calculated with the BLASTP scores among protein sequences included on the resulting clusters (193). If average nucleotide identities are desired the command must be modified to: $ ./get_homologues.pl -d sample_plasmids_gbk -a ’CDS’ -A -t 0 -M Such matrices can then be used to calculate heatmaps and dendrograms that capture how similar the coding sequences are among genomes. An example of this would be: cd sample_plasmids_gbk_homologues ../plot_matrix_heatmap.sh -i EcoliST131plasmidpKC394_f0_0taxa_CDS_algOMCL_e0_Avg_identity.tab \ -d 2 -t "CDS ANI matrix with 2 decimals" 47 Figure 18: Example heatmap derived from an Average Nucleotide Identity (ANI) matrix calculated with get homologues.pl. 4.8.7 Finding out best hits of a particular sequence After running get homologues with almost any set of parameters, you will always end up with a lot of BLAST files of allagainst-all involved taxa. The accompanying script check BDBHs.pl can help you find out which are the best BLAST hits of any sequence that might interest you. First, as cluster names are not always informative, you’ll need to find out the internal identifier used by the software to handle your target sequence. For instance, we might want to investigate protein CspE among our 4 Buchnera taxa. The command to achieve this would be: $ ./check_BDBHs.pl -d sample_buch_fasta_homologues And we obtain: # Sequences containing label CspE: 1360,Buch_aph_Cc.faa,gi|116515229|ref|YP_802858.1| CspE [Buchnera aphidicola str. Cc (Cinara cedri)] Now that we know the identifier (1360), we can check its best hits: $ ./check_BDBHs.pl -d sample_buch_fasta_homologues -i 1360 Output contains the identifiers of best hits in both directions, their bit-scores, E-values, alignment %coverages and annotated Pfam protein domains when available: # query = 1360 # query fullname = gi|116515229|ref|YP_802858.1| CspE [Buchnera aphidicola str. Cc (Cinara cedri)] # list of bidirectional best-hits: dir query sbjct bits Eval %ident cover Pfam annotation : [Buch_aph_APS.faa] 48 > 1360 467 136 4e-42 97.1 100.0 < : > 467 1360 136 4e-42 97.1 100.0 1360 972 135 1e-41 95.7 100.0 < : > 972 1360 135 1e-41 95.7 100.0 1360 1883 136 4e-42 97.1 100.0 < 1883 1360 136 4e-42 97.1 100.0 NA gi|15617086|..cold shock protein E NA [Buch_aph_Bp.faa] NA gi|27904911|..cold shock protein E NA [Buch_aphid_Sg.faa] NA gi|21672738|..cold shock protein E NA If previous get homologues jobs included the calculation of Pfam domains, then option -D can be added to produce a richer report, that now includes the identifiers of Pfam domains such as PF00313, sorted on their position along the sequence: dir query : > 1360 sbjct bits 467 136 4e-42 97.1 100.0 < ... 1360 136 4e-42 97.1 100.0 467 Eval %ident cover Pfam annotation [Buch_aph_APS.faa] PF00313, gi|15617086|..cold shock protein E PF00313, Note that this script works by parsing files all.p2o.csv and all.bpo, which are created at run-time by get homologues in folder tmp/ within the results directory. These are text files that can be inspected with help from any text editor. 49 4.9 A script to test most get homologues features with a sample dataset File HOWTOTettelin is a shell script which performs typical uses of get homologues.pl. This script can be made executable on the terminal with: $ chmod +x HOWTOTettelin and then executed with: $ ./HOWTOTettelin The first task carried out by the script is to download the same GenBank files used in the landmark work of Tettelin and collaborators (PubMed=16172379); afterwards several analyses are sequentially undertaken: # 1.0) optionally download genomes in GenBank format from NCBI FTP site cd test_Streptococcus ../download_genomes_ncbi.pl test_Streptococcus_download_list.txt cd .. # 1.1) run BLAST jobs with 4 CPU cores and optionally HMMER ./get_homologues.pl -d test_Streptococcus/ -n 4 -o ./get_homologues.pl -d test_Streptococcus/ -n 4 -D -o # 1.2) calculate core-genomes with all BDBH, OMCL & COG algorithms ./get_homologues.pl -d test_Streptococcus/ ./get_homologues.pl -d test_Streptococcus/ -M ./get_homologues.pl -d test_Streptococcus/ -G ./get_homologues.pl -d test_Streptococcus/ -M -D # 1.3) calculate consensus core-genome with syntenic genes ./compare_clusters.pl -s -n -o test_Streptococcus_intersection -d \ test_Streptococcus_homologues/S_f0_alltaxa_algCOG_e0_,\ test_Streptococcus_homologues/S_f0_alltaxa_algOMCL_e0_,\ test_Streptococcus_homologues/S_f0_alltaxa_algBDBH_e0_ # 1.4) calculate core-genome with coverage and identity as in the Tettelin paper (cover=50%,%ID=50) ./get_homologues.pl -d test_Streptococcus/ -M -C 50 -S 50 # 1.6) calculate core intergenic clusters ./get_homologues.pl -d test_Streptococcus/ -g # 1.7) estimate and ./get_homologues.pl ./get_homologues.pl ./get_homologues.pl plot core- and pangenome sizes with all BDBH, OMCL & COG algorithms -d test_Streptococcus/ -c -d test_Streptococcus/ -M -c -d test_Streptococcus/ -G -c # 2.1) calculate pan-genome with OMCL & COG algorithms ./get_homologues.pl -d test_Streptococcus/ -M -t 0 ./get_homologues.pl -d test_Streptococcus/ -G -t 0 # 2.2) build consensus pangenomic matrix ./compare_clusters.pl -d test_Streptococcus_homologues/S_f0_0taxa_algCOG_e0_,\ test_Streptococcus_homologues/S_f0_0taxa_algOMCL_e0_ -m -T -o test_Streptococcus_intersection # 2.3) plot pangenomic compartments and estimate core- and pan-genome size with mixture models ./parse_pangenome_matrix.pl -m test_Streptococcus_intersection/pangenome_matrix_t0.tab -s # 2.4) check accesory genes present in A genomes and absent in B ./parse_pangenome_matrix.pl -m test_Streptococcus_intersection/pangenome_matrix_t0.tab \ -A test_Streptococcus/A.list -B test_Streptococcus/B.list -g -e Readers looking for a fully worked out example, including a comprehensive interpretation of results, can read our book chapter ”Robust Identification of Orthologues and Paralogues for Microbial Pan-Genomics Using GET HOMOLOGUES: A Case Study of pIncAC Plasmids” (PubMed=25343868). 50 5 5.1 Frequently asked questions (FAQs) Installation • Is it necessary to have a computer cluster to run get homologues.pl? Not really. It can be run on a single machine, but if you have say, more than microbial genomes, the required all-against-all BLAST jobs would take some time to complete, depending on your hardware. If this is your case we recommend setting option -n to the number of cores of your processor, as this way BLAST/HMMER jobs will be split and sent to different jobs and the required computing time proportionally reduced. Figure 19: Runtime of a typical bacterial genomic BLASTP jobs as calculated with 1 to 8 CPU cores when splitting the job in batches. Note that the optimal batch size in this test is 100 sequences (compared to 250 and 500 sequences). • How do I set up get homologues.pl to run in a computer cluster? Cluster jobs are submitted by invoking option -m cluster, instead of explicitely calling qsub. The fist time such a job is run the following error can be seen: # running BLAST searches ... Unable to run job: You have to specify a queue, with the ’-q’ option to qsub. Exiting. [...] This usually means that qsub jobs inside get homologues.pl lack a target queue in the cluster. Please find out the name of the right queue for your jobs, perhaps something as simple as such as -q default, and edit the relevant line in the header of get homologues.pl: my $QUEUESETTINGS = ’ -q default ’; • Does get homologues.pl run on Windows? 51 As Perl is not installed by default on Windows, not BerkeleyDB is easily installed there, we haven’t ported this software to these systems. However, with some work, you could probably set up a windows machine to run get homologues.pl and most of its dependencies. Nevertheless, we have tested the package only on Linux and macOSX systems, and currently the only option for Windows users would be to set up a VirtualBox Linux box, and then installing get homologues.pl there. NOTE: If you don’t know how to do this please check these resources: Install-Ubuntu-on-VirtualBox or virtualboxes. • How much memory does my computer require to run get homologues.pl? This depends directly on the size and number of the genomes to be analyzed. A recent benchmark with 71 Mycobacterium genomes and 280k sequences yielded a RAM usage of just over 3700 MB, confirming that versions 2.* have a much reduced memory footprint (see Figure 20, in the original benchmark up to 40 microbial genomes could be analyzed on a 8Gb RAM Linux box). Moreover, as mentioned in section 3.4, the -s flag option reduces the memory footprint of your get homologues up to 20-fold, allowing running large jobs in machines with small RAM resources. Therefore, if you are running a large job that fails to finish succesfully, the most common explanation would be that it was killed by the operating system for taking too much memory. In those cases it is advisable to re-run the same job with the -s flag. Figure 20: Computing time and RAM requirements of the BDBH, COG and OMCL algorithms when processing input volumes of increasing size. Performance is measured also with BerkeleyDB (-s flag). 52 • What’s the performance gain of v2? After evolving parts of the original code base, and fixing some bugs (see CHANGES.txt), both get homologues.pl and get homologues-est.pl have significantly improved their performance, as can be seen in the figure, which combines data from the original benchmark and new data generated after v2 was in place. Figure 21: Computing time and RAM requirements of the original algorithm (OMCL, measured on 6 sequence sets) as compared to the updated v2 code (measured on 3 three sets). • BLAST jobs fail in my computer cluster, how can I sort this out? When solving problems related with submitting jobs to the cluster it is necessary to check the generated .queue files, which capture any errors that might occur during job submission. If you check one of these files an find a message such as blast: error while loading shared libraries: libbz2.so.1: cannot open shared object file this means that some cluster nodes will require the installation of 32-bit compatibility library libbz2.so.1, which can be done with root privileges typing this command from your cluster master node: rocks run host compute "yum -y install bzip2-libs.i386" Another solution would be to use BLAST binaries native to your cluster architecture, as explained in section 2.2. 53 5.2 Run options • Why is it that I can’t re-use BLAST results from the original get homologues-est.pl after updating to get homologuesest.pl? In the v2 code family BLAST parsing evolved to capture -outfmt 6 ’qseqid sseqid pident length qlen slen qstart qend sstart send evalue bitscore’. However, the original code base was capturing different data columns and this implies that old BLAST results from previous releases won’t work anymore. • Is there any real scenario in that the default value for -S flag is the best option? Option -S 1 means that two sequences can be considered orthologues or inparalogues with any sequence identity, provided that Evalue < max(Evalue) and alignment coverage is sufficient (-C). This can be useful when comparing very divergent genomes/proteomes, where orthologues can be distant in terms of sequence. If a reciprocal hit between genomes A and B is, say 30% identical, still it is a likely orthologue. It could of course be a false positive, and that’s why you need to know, perhaps by using -A, the average identity among proteins in those genomes, and ultimately a gene phylogenetic tree, including species A, B and others to compare to the species tree. • Could you please explain a bit more what option -o is good for? This option is useful in situations like the following: you have access to a compute cluster with modern compute nodes (64bit cores and a good amount of memory), but the master computer is an old, 32bit machine with limited RAM (say 1GB or less). Under such a setting, it would be convenient to distribute the first steps of the pipeline (all against all BLAST jobs) on the cluster using options -o -m cluster. The downstream parsing of the BLAST results, which is memory-consuming, could then be executed logged in from one of the more powerful compute nodes or from a powerful server. • Would it be possible, after running the all-against-all BLAST jobs using option -o, to log into three different compute nodes and execute the downstream pipeline on each node for a different clustering algorithm (i.e. BDBHs, COG and OMCL)? No, as each get homologues.pl job requires exclusive access to the data directory and prevents other jobs to access it simultaneously. • Does the -I option have any impact on the calculation of the core and pan genomes, including their graphical representations? Of course, as this option enforces the program to sample once the list of genomes in the implicit order of the list. The fitting of pan-/core-genome functions will be affected as there will be less points to do the noon-linear fitting. • What is the minimum value for BLAST neighborhood correlation parameter? This parameter captures the concept of BLAST neighborhood explained in PubMed=18475320, and is calculated as a a positive Pearson correlation coefficient. In this context, sequences A and B should have a similar list of BLAST matches if they truly are orthologues. This parameter is by default set to 0, so it takes no effect. As its values approaches to 1, it gets more difficult to call bidirectional best hits, as they will be required to have very similar lists of BLAST hits. • Could you explain a bit more about the meaning/effect of the -t option? You should use -t 0 (zero) when you need to get all clusters of homologous sequences. This will generate files containing 2 or more homologous sequences from one or more taxa. By default, get homologues.pl sets t=numberOfTaxa, that is, it will provide the user with clusters of homologous sequences that contain at least one sequence from each taxon. Note that singleton clusters (just one sequence per organism) will be produced if combined with option -e, which excludes clusters containing inparalogues. In this later case, the resulting clusters will contain only single copy genes from each taxon, i.e. the orthologues. This is convenient if you want to use resulting gene families to do for example genome-level phylogenetic analyses using only the repertoire of orthologous single copy genes. Note, however, that such clusters are not suitable for pangenome analyses. For such analyses, instead, please use auxiliary script compare clusters.pl with the set of clusters obtained with -t 0. • And what about option -a, what are these other GenBank file features? Thoroughly annotated genomes (see for example the GenBank file for Escherichia coli K12 MG1655: Escherichia coli K 12 substr MG1655 uid225.gbk) have many features, incluing ribosomal rRNAs, intergenic 54 spacers, tRNAas, ’mat peptide’, ’repeat region’, and miscellaneous features, noted as ’misc feature’, in the corresponding GenBank files. Miscellaneous features report on diverse things such as cryptic prophages, the target sites of resolvases involved in replicon replication, and many more ”miscellaneous” features. Note that there is great heterogeneity both in the format and detail of annotating genome features. By default, get homologues.pl will extract only the ’CDS’ feature (i.e. the sequences for protein-coding sequences) from the GenBank file. By using option -a the user can select which features to parse, such as ’tRNA’ or ’rRNA’. • Could you explain what the inflation parameter of orthoMCL is and how to decide what value to use? The original OrthoMCL paper (PubMed=12952885) explains it: ”[...] changing the inflation index affects cluster tightness: Lower inflation values result in the inclusion of more sequences in fewer groups, whereas increasing the inflation index fragments clusters and reduces the number of sequences included”. In our benchmarks with bacterial genomes this parameter shows a very modest, negligible effect. • I’ve run the get homologues.pl pipeline some time ago. I would like to add some new genomes to the previous analysis. How do I proceed so as to reuse as much of the previous computations as possible? If you still conserve the original input folder with FASTA or GenBank sequence files and the results _homologues directory, both contained in the same directory, all you will have to do is to copy the new sequence files to the input folder and re-run get homologues.pl. This will ensure that only new required BLAST/PFam searches are completed, conserving the previous results as much as possible. • How does get homologues.pl decide how to name a certain cluster file? Is this affected by the use of -r? Cluster files are named using gene names from the reference genome, or from the first included genome otherwise. If a given genome R is selected with option -r then gene names from R will be used preferably. It is possible to change the way clusters are named by editing subroutine extract_gene_name in file lib/phyTools.pm. • What happens if I perform the above explained steps but using a different reference genome? The most obvious effect is that any resulting clusters will now be named according to gene names of the new reference genome. A more subtle consequence for BDBH jobs is that now all genomes are compared to this reference, see figure 3, and this will change the order in which bidirectional best hits are computed. • How are the gene clusters named if no -r reference genome is specified? In this case the genome with the least number of genes/features will be taken as the reference, and resulting sequence clusters will be named according to gene names of this reference genome, which might not be the best annotated genome in your set. For this reason it is often a good idea to set as reference genome one with a good annotation, for instance the species or strain described in RefSeq. • I have 40 draft genomes annotated in gbk format and I am using get homologues to obtain the core and pan genomes. My plan is to run get homologues with BDBH and -b to speed up the process. How can I choose the most appropriate reference genome? Option -b is only suited for core-genome calculation, not pangenome. If this is really the desired task, the genome with the least number of annotated genes should be used as a reference, which is what the program would do by default, or else the best-annotated among small genomes. However, note that this sort of core-genome calculation is most sensitive to missing genes, usually due to poor automatic annotations, which is why compiling a pangenome matrix is recommended when possible (see Section 4.8.1), so that a more robust soft-core can be estimated. • Why does option ’-t 0’ not work with BDBH in get homologues.pl? Is this also the reason BDBH cannot be used in a pangenome matrix analysis? The reason is that BDBH uses a single reference genome and thus by definition cannot track genes not present in the reference. Therefore, pangenome matrices produced by the BDBH algorithm would be incomplete, considering only clusters including genes from the reference genome. For this reason BDBH is adequate for core-genome calculation, but not for pan-genomes. • When the initial BLAST is being performed, does get homologues.pl take into account the database size (i.e. the number of genes being BLASTed), or because the BLASTing is not done as an all-vs-all manner, do you not consider this as a factor in the analysis? Within lib/marfil_homology.pm there is a global variable $BLAST_DB_SIZE set to 100 000 000 for that purpose. That value is the fixed effective search space during BLAST searches so that any resulting E-values are comparable, even across experiments or algorithm (BDBH, OMCL, COGS). 55 • I would like to get information of inparalogues and orthologues from each genome. I found several files in the tmp directory, such as inparalogues_Buch_aph_APS.faa. Could you explain about the files or a method to extract the information? According to our working definition, all sequences grouped together in the same .faa/.fna cluster are likely orthologues, although you should always keep in mind that orthology is an evolutionary concept and therefore sequencebased approaches such as those in get homologues.pl are simpler approximations. What is different with inparalogues? They are supposed to be duplicated genes that appeared after species separation, and therefore their orthology relationships are many-to-one or many-to-many. Inparalogues are easy to spot in clusters produced by get homologues.pl because they are 2+ sequences from the same genome in the same cluster cluster. If you wish to know which inparalogue is most similar to an orthologous gene the best option is to run check BDBHs.pl, which is explained on Section 4.8.7. • I have been using GET HOMOLOGUES and I could not figure out which is the default value for saving blastp hits, I mean, the value set for ’-max target seqs’ BLAST parameter -max_target_seqs is set to the number of sequences of the query proteome, which usually is a large number that ensures all good quality hits are recovered. • How does buffer flushing affects get homologues.pl? Although get homologues.pl scripts explicitily flush output buffers (set with $|=1), users can occasionally experience buffering problems when writing to slow, external hard drives, as output files are often very large. Such problems have been reported when calling -G option, which in turn invokes subroutine find_COGs and calls several external binaries, whose buffers cannot be flushed from the scripts. If that happens to you please consider increasing the sleep time in that sub, which by default is 10 seconds. 5.3 Downstream analyses • What are pancore and pangenome files/matrices? Pancore matrices contain estimates of core- and pan-genomes and they are produced by get homologues.pl with option -c. These files take names such as pan_genome_algBDBH_C75.tab, which record the algorithm employed, and are generated by random-sampling genomes. Sampling can be controlled and reproduced by using the same random-number generator seed (see section -R in Section 3.4). Such files can be used to render plots (and fitted functions) with script plot_pancore_matrix.pl, as shown in Section 4.8.4. Instead, pangenome matrices are generated by accessory script compare clusters.pl, and contain information about what which genomes contain sequences from gene clusters, with no sampling involved (see Section 4.8.1). They take names such as pangenome_matrix_t0.tab. • Is it possible to plot pangenome matrices with compare clusters.pl using a single cluster directory? Yes, no problem. The script will generate the intersection_t0.cluster_list , pangenome_matrix_t0.phylip and pangenome_matrix_t0.tab files based on the clusters found by the chosen algorithm. The only thing you won’t find in the directory are Venn diagrams, since at least 2 cluster sets (generated by 2 algorithms) are required to compute them. • What is the advantage of providing multiple cluster output directories to the -d ’dir1,dir2,dir3’ option of compare clusters.pl? When provided with the output directories holding the clusters generated by 2 or 3 algorithms, the script will select only those clusters that contain exactly the same sequences in each of the clusters. This may be valuable for example if the user is interested in defining a very robust core genome set, containing only those families with exactly the same members, independently of the chosen algorithm. However, it is possible that several otherwise important families get lost for downstream analyses, such as presence-absence analyses of gene families in pairs of lineages, which can be done with parse pangenome matrix.pl. • We have sequenced 25 bacteria genomes and used get homologues.pl to get orthologs included in all the 25 strains. As I already used BDBH method for core genome analysis, now I cannot switch to COG method for pangenome matrix generation. Could you indicate me how to get the full pangenome matrix using BDBH method? BDBH results cannot be used for pangenome analysis, but you could re-run the software with the same 25 input genomes, now adding -t 0 -M for OMCL, probably the best choice for such an analysis. This will re-use all blast 56 results previously calculated and resume until OMCL analysis is completed. Usually core sets produced by BDBH, COGS and OCML are very similar. Therefore, most of your previously tested core genes should also be picked up by OCML on this second run, and presumably you could now proceed to pangenome analysis. • I am a PhD student doing some work on different strains. My input is .gbk from draft genomes and my aim at this point is to see what are core-genes in each of 3 groups and which of these are shared with the other 2 groups. I have been working under the impression that I can make 3 different groups and compare them or can I only make 1 big group and compare the results for different methods? You have calculated 3 core-genomes from 3 different sets of strains and now you want to know the subset of coregenes present in all individual core-genomes. If you check compare clusters.pl option you’ll see that it says ”by default cluster dirs are expected to be derived from the same taxa”, which is exactly what’s failing in your examples. If you want to find common core-clusters present if all 3 sets and also those shared by only some of the taxa I guess you should build a pangenome matrix by running get homologues.pl with all strains together with option -t 0. The matrix itself will give you the clusters present/absent in all and some of the strains, which I guess is what you need. The script parse pangenome matrix.pl can read this matrix and further help you identify these clusters. • Could you please give a use case for compare clusters.pl -r? Option -r can be used to compare a list of core genes from a single genome G, that is, with clusters containing only sequences from G, to clusters of a larger group of taxa (A,B,C,...,G) that includes G. • The parsimony tree produced by compare_cluster.pls -T cannot be opened by MEGA Program pars from the PHYLIP suite often produces several alternative parsimony trees contained in the pangenome_matrix_t0.phylip.ph file, one per line. Some phylogeny programs, such as MEGA, require splitting these trees in separate files in order to properly read and plot them. • Parsing a pangenome matrix with A & B lists yields zero clusters When a command such as ./parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -A A.txt -B B.txt -g is invoked, the passed A & B files must contain taxon names matching exactly those of corresponding input files, including the extension. Instead of a list such as: taxonA taxonB_001 ... taxonZ244 the list should be: taxonA.faa taxonB_001.faa ... taxonZ244.faa • The number of clusters produced with -C 75 -S 70 does not match the pangenome size estimated with option -c The reason for these discrepancies is that they are fundamentally different analyses. While the default runmode simply groups sequences trying to put in the same clusters orthologues and inparalogues, a genome composition analysis performs a simulation in order to estimate how many novel sequences are added by genomes sampled in random order. In terms of code, there are a couple of key global variables set in lib/marfil_homology.pm, lines 130-131, which control how a gene is compared to previously processed pangenome genes in order to call it novel: $MIN_PERSEQID_HOM = 0; $MIN_COVERAGE_HOM = 20; The first variable is set by default to 0, meaning that there is no %identity limitation to call homologues. The second is set to 20, which means that any sequence matching a previous gene with coverage ≥ 20% will be considered homologous and thus won’t be considered new. As you can see these are very stringent values. 57 Now, in your settings, you might want to change these values. For instance, Tettelin in their landmark work used values of 50 and 50 (PubMed=16172379), which means that protein sequences with coverage ≥ 50% and identity ≥ 50% to previous genes will be called homologues, and therefore won’t be accumulated to the growing pangenome. In other words, you should tweak these variables to your particular settings. • I produced 2 different parsimony trees with compare clusters.pl, is there a way to merge them and add bootstrap values? The two trees are equally parsimonious, that is, have the same parsimony score but different topologies. One possibility to combine them into one topology is to compute either the majority rule consensus (mjr) tree, for example with consense from the PHYLIP package, or represent a network consensus with a program such as splitstree. Regarding the bootstrapping, you could write some R code (or in any other language) to randomly sample the columns in the pangenome matrix with replacement to construct a new, bootstrapped matrix with the same number of columns as the original one. You should generate 100 or 500 of these matrices and then run pars on each one of them. Then run consense to obtain the mjr consensus tree and associated split frequencies (bootstrap support values for each bipartition). An easy way of achieving this would be with R’s boot package (see example). Another option is to call seqboot from the PHYLIP package to generate N bootstrap pseudo-replicates of the matrix. Rename the resulting file as infile and call pars to read the file, using the option m no_of_pseudoreplicates to run the standard (Fitch) parsimony analysis on each of the bootstraped matrices. PHYLIP pars will generate an outtree file containing as many trees as bootstraped matrices found in infile. Rename outtree to intree and call consense to generate the default majority rule consensus tree. This tree is only a cladogram (only topology, no branch lengths). The node labels correspond to the number of bootstrap pseudoreplicates in which that particular bipartition was found. • How can I produce a maximum likelihood (ML) tree with bootstrap values from a pangenome matrix? We recommend script estimate_pangenome_trees.sh from the GET PHYLOMARKERS pipeline. Another alternative is to use the .fasta version of the pangenome matrix produced by script ./compare_clusters.pl -m , next to the .tab and .phy files. This FASTA file can be analyzed with ML software such as IQ-TREE (PubMed=25371430) both online or in the terminal with a command such as: path_to_iqtree -s pangenome_matrix_t0.fasta -st BIN -m TEST -bb 1000 -alrt 1000 This will produce an optimal ML tree after selecting a binary substitution model with both bootstrap and aLRT support values. • Is there a way to plot ANI matrices of soft-core clusters? Let’s say you have 20 genomes, then 95% of them are exaclty 19 taxa, which is the minimum occupancy that defines soft-core clusters (see global variable $SOFTCOREFRACTION). You should then compute the ANI matrix as follows: ./get_homologues.pl -d your_data -a ’CDS’ -A -M -t 19 And then plot the resulting matrix with script hcluster p angenomem atrix.sh. • When I use the hcluster pangenome matrix.sh script the trees in the output of the newick file and the heatmap differ. Is there a reason for this? The difference in the topologies of the NJ trees and the row-dendrogram of the heatmaps differ because the heatmaps are ordered bi-dimensionally. That is, the heatmap plot shows only the row-dendrogram, but the matrix is ordered also by columns. The NJ tree is computed from the distance matrix that you indicate the program to calculate for you (ward.D2 is the default). • What are the numbers in the branches of the tree generated with compare_clusters.pl -T? The PARS program from the PHYLIP suite reconstructs the pan-genome phylogeny using the parsimony optimality criterion. As such, each column in the input pangenome matrix is considered an independent character, and character state changes (within columns) are considered to represent a single evolutionary event (have a weight of 1) in standard parsimony. As can be read in its documentation, the algorithm used for estimating branch lengths averages the number of reconstructed changes of state over all sites, over all possible most parsimonious placements of the changes of state among branches. Note that it does not correct in any way for multiple changes that overlay each other. Therefore, the scale that you find at the bottom of the tree corresponds to average number of state transitions or steps, as estimated under the standard or Fitch parsimony criterion. It is hence the parsimony estimate of the 58 ”distance” in number of steps (differences in cluster composition, in our case). Given that it is an average, the branch lengths (number of steps) are real numbers, not integers, as one might expect. • I ran get homologues with amino acid fasta files of 78 bacterial genomes; is there a way to export a 78 x 78 matrix of the number of homologues shared between each genome? If you did your analysis requesting cluster of all occupancies (-t 0) then you can get what you want in two steps. First, you must produce a pangenome matrix with compare_clusters -d ... -m. Now it is possible to request an intersection pangenome matrix (pangenome_matrix_t0__intersection.tab) which contains the number of sequence clusters shared by any two pairs of genomes with parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab Note that these clusters might contain several inparalogues of the same species. • How does make nr pangenome matrix.pl work? Please refer to manual get homologues-est.pdf for an example application of this script. • How does annotate cluster.pl work? Please refer to manual get homologues-est.pdf for an example application of this script. 59 6 Frequent warnings and error messages error message EXIT : cannot find previous input file XXXX, please re-run everything WARNING: could not extract nucleotide sequences from file XXXX WARNING: can only extract genes (not CDSs) from file XXXX WARNING: cannot use nucleotide sequences in file XXXX as they do not match those in file YYYY EXIT, XXXX does not exist, Pfam search failed... EXIT: cannot format BLAST sequence base EXIT, XXXX.blastout BLAST search failed... does not exist, EXIT: parsed XXXX output (YYYY) seems to be empty, please remove ’input homologues/’ and re-run WARNING: please remove/rename results directory: XXXX if you change the sequences in your .gbk/.faa files or want to rerun EXIT: cannot compile intergenic clusters as not all input GenBank files are valid WARNING: skipping cluster 123 XXX.fna , seems to duplicate 456 YYY.fna practical meaning This can happen when re-running the program with an input -d directory which used to contain more sequences files, or with different names. This prevents the software to recycle previous results, as it cannot ensure that sequences are still numbered consistently. You’ll see this warning when using an uncomplete input GenBank file, lacking the nucleotide sequence at the bottom. Occurs when reading a GenBank file lacking CDS features. This warning occurs when a twin XXXX .fna file (see Table 1) contains a different number of sequences than the corresponding YYYY .faa file, and cannot therefore be safely used to compile DNA clusters. Occurs when a Pfam job submitted to the cluster (option -D) failed to report back and terminate. The solution is often to re-run the program, as it will only re-submit the missing Pfam jobs. When solving problems with submitting jobs to the cluster queue it is helpful to check the .queue files. Happens when for some reason the collection of input sequences could not be formatted for BLAST. This might surface hard drive trouble or simply an architecture issue. Again a BLAST error, spotted for failing to produce a BLAST output. Often the solution is simply to re-run, as this might be simply a cluster overload problem. When solving problems with submitting jobs to the cluster queue it is often helpful top check the .queue files. Another BLAST/Pfam error, which can happen if the programs fails to parse the results. The simplest solution is usually to do as suggested and re-run. This warning is issued only to make it clear that the program is recycling previous BLAST results, which is usually a good idea, unless you specifically changed the contents of your input files (which should’t be that common). This message appears when the user requested intergenic clusters (option -g ) but not all parsed GenBank files contained nucleotide sequences. The solution is to check the input files and correct the offending one, which likely is uncomplete and lacks the nucleotide sequence at the bottom. This is issued by compare clusters.pl when it finds, usually singleton, clusters with identical sequences produced by the COG or OMCL algorithms. This can happen when such clusters contain short sequences, or perhaps with composition biases, that yield few or even no BLAST hits when compared to all other sequences in a given setup. As these kinds of clusters can confound posterior analysis they are currently ignored by compare clusters.pl. Table 6: Frequent warnings and error messages produced by get homologues and kin scripts. 60 7 Credits and references get homologues.pl is designed, created and maintained at the Laboratory of Computational Biology at Estación Experimental de Aula Dei/CSIC in Zaragoza (Spain) and at the Center for Genomic Sciences of Universidad Nacional Autónoma de México (CCG/UNAM). The code was written mostly by Bruno Contreras-Moreira and Pablo Vinuesa, but it also includes code and binaries from OrthoMCL v1.4 (algorithm OMCL, -M), COGtriangles v2.1 (algorithm COGS, -G), NCBI Blast+, MVIEW and BioPerl 1.5.2. Other contributors: Carlos P Cantalapiedra, Roland Wilhelm, David A Wilkinson. We ask the reader to cite the main references describing the get homologues software, • Contreras-Moreira,B and Vinuesa,P (2013) GET HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl.Environ.Microbiol. 79:7696-7701. • Vinuesa P and Contreras-Moreira B (2015) Robust Identification of Orthologues and Paralogues for Microbial PanGenomics Using GET HOMOLOGUES: A Case Study of pIncA/C Plasmids. In Bacterial Pangenomics, Methods in Molecular Biology Volume 1231, 203-232, edited by A Mengoni, M Galardini and M Fondi. and also the original papers describing the included algorithms and databases, accordingly: • Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13(9):2178-89. • Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26(12):1481-7. • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25(17): 3389-3402. • Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12(10):16118. • hmmscan :: search sequence(s) against a profile database HMMER 3.1b2 (Feb 2015) http://hmmer.org Copyright (C) 2015 Howard Hughes Medical Institute. Freely distributed under the GNU General Public License (GPLv3). • Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, SangradorVegas A, Salazar GA, Tate J, Bateman A. (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44(D1):D279-85 • Brown NP, Leroy C, Sander C (1998) MView: A Web compatible database search or multiple alignment viewer. Bioinformatics. 14 (4):380-381. • Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60 If you use the accompanying scripts the following references should also be cited: • R Core Team (2013) R: A Language and Environment for Statistical Computing. http://www.R-project.org R Foundation for Statistical Computing, Vienna, Austria, ISBN3-900051-07-0 61
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 61 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.16 Create Date : 2018:05:24 13:17:42+02:00 Modify Date : 2018:05:24 13:17:42+02:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) kpathsea version 6.2.1EXIF Metadata provided by EXIF.tools