Manual Get Homologues Est

User Manual:

Open the PDF directly: View PDF .
Page Count: 37

Download
Open PDF In Browser	View PDF

get homologues-est manual
Bruno Contreras-Moreira (1,2) and Pablo Vinuesa (3)
1. Fundación ARAID and 2. Estación Experimental de Aula Dei-CSIC
3. Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México
July 23, 2018

Contents
1

Description

Requirements and installation
2.1 Perl modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Required binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Optional software dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
4
4
5

User manual
3.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Program options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Accompanying scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8
8
9
14

A few examples
4.1 Extracting coding sequences (CDS) from transcripts . . . . . . . . .
4.2 Clustering orthologous transcripts from FASTA files, one per strain .
4.3 Producing a nucleotide-based pangenome matrix . . . . . . . . . .
4.4 Estimating protein domain enrichment of some sequence clusters . .
4.5 Making and annotating a non-redundant pangenome matrix . . . . .
4.6 Annotating a sequence cluster . . . . . . . . . . . . . . . . . . . .

16
16
18
23
24
26
28

.
.
.
.
.
.

A step-by-step protocol with barley assembled transcripts

Frequently asked questions (FAQs)

Credits and references

Description

This document describes GET HOMOLOGUES-EST, a fork of get homologues for clustering homologous gene/transcript
sequences of strains/populations of the same species. This algorithm has been designed and tested with plant transcripts
and CDS sequences, and uses BLASTN to compare DNA sequences. The main tasks for which this was conceived are:
• Finding and translating coding regions (CDSs) within raw transcripts.
• Clustering transcripts/CDS nucleotide sequences in homologous (possibly orthologous) groups, on the grounds of
DNA sequence similarity.
• Definition of pan- and core-transcriptomes by calculation of overlapping sets of CDSs.
The core algorithms of get homologues-est have been adapted from get homologues, and are therefore explained in
manual get homologues.pdf. This document focuses mostly on EST-specific options.
When obtaining twin DNA and peptide CDS files, the output of GET HOMOLOGUES-EST can be used to drive
phylogenomics and population genetics analyses with the kin pipeline GET PHYLOMARKERS.
This table lists features developed for get homologues-est which were not available in the original get homologues
release, although most have been backported.
name
transcripts2cds.pl
redundant isoform calling
ANI matrices
make nr pangenome matrix.pl
pfam enrich.pl
annotate cluster.pl

description
Script to extract coding sequences CDS from raw transcripts by combining Transdecoder
and BLASTX.
get homologues-est can handle redundant isoforms which otherwise will degrade clustering
performance.
get homologues-est can compute Average Nucleotide Identity (ANI) matrices which summarize the genetic distance among input genotypes.
Produces a non-redundant pangenome matrix by comparing all nucleotide/peptide clusters
to each other.
Script to test whether a set of sequence clusters are enriched in some Pfam domains.
Produces a multiple alignment view of the supporting local BLAST alignments of sequences in a cluster. It can also annotate Pfam domains and find private sequence variants
private to an arbitrary group of sequences.

Table 1: List of novel scripts/features in get homologues-est.

Requirements and installation

get homologues-est.pl is a Perl5 program bundled with a few binary files. The software has been tested on 64-bit Linux
boxes, and on Intel MacOSX 10.11.1 systems. Therefore, a Perl5 interpreter is needed to run this software, which is
usually installed by default on these operating systems.
In order to install and test this software please follow these steps:
1. Unpack the software with: $ tar xvfz get_homologues_X.Y.tgz
2. $ cd get_homologues_X.Y
3. $ ./install.pl
Please follow the indications in case some required part is missing.
4. Type $ ./get_homologues-est.pl -v which will tell exactly which features are available.
5. Test the main Perl script, named get_homologues-est.pl, with the included sample input folder
sample_transcripts_fasta by means of the instruction:
$ ./get_homologues-est.pl -d sample_transcripts_fasta . You should get an output similar to the contents of file sample_transcripts_output.txt.
3

6. Optionally modify your $PATH environment variable to include get homologues-est.pl. Please copy the following
lines to the .bash_profile or .bashrc files, found in your home directory, replacing [INSTALL_PATH] by the
full path of the installation folder:
export GETHOMS=[INSTALL_PATH]/get_homologues_X.Y
export PATH=${GETHOMS}/:${PATH}
This change will be effective in a new terminal or after running: $ source ~/.bash_profile
The rest of this section might be safely skipped if installation went fine, it was written to help solve installation
problems.

2.1

Perl modules

A few Perl core modules are required by the get homologues-est.pl script, which should be already installed on your system: Cwd, FindBin, File::Basename, File::Spec, File::Temp, FileHandle, List::Util, Getopt::Std, Benchmark and Storable.
In addition, the Bio::Seq, Bio::SeqIO, Bio::Graphics and Bio::SeqFeature::Generic modules from the Bioperl collection, and modules Parallel::ForkManager, URI::Escape are also required, and have been included in the get homologuesest bundle for your convenience.
Should this version of BioPerl fail in your system (as diagnosed by install.pl) it might be necessary to install it from
scratch. However, before trying to download it, you might want to check whether it is already living on your system, by
typing on the terminal:
$ perl -MBio::Root::Version -e ’print $Bio::Root::Version::VERSION’
If you get a message Can’t locate Bio/Root/Version... then you need to actually install it, which can sometimes become troublesome due to failed dependencies. For this reason usually the easiest way of installing it, provided
that you have root privileges, it is to use the software manager of your Linux distribution (such as synaptic/apt-get in
Ubuntu, yum in Fedora or YaST in openSUSE). If you prefer the terminal please use the cpan program with administrator
privileges (sudo in Ubuntu):
$ cpan -i C/CJ/CJFIELDS/BioPerl-1.6.1.tar.gz
This form should be also valid:
$ perl -MCPAN -e ’install C/CJ/CJFIELDS/BioPerl-1.6.1.tar.gz’
Please check this tutorial if you need further help.

2.2

Required binaries

In order to properly read (optionally) compressed input files, get homologues-est requires gunzip and bunzip2, which
should be universally installed on most systems.
The Perl script install.pl, already mentioned in section 2, checks whether the included precompiled binaries for hmmer,
MCL and BLAST are in place and ready to be used by get homologues-est. This includes also COGtriangles, which is
used only by prokaryotic get homologues. However, if any of these binaries fails to work in your system, perhaps due a
different architecture or due to missing libraries, it will be necessary to obtain an appropriate version for your system or
to compile them with your own compiler.
In order to compile MCL the GNU gcc compiler is required, although it should most certainly already be installed on
your system. If not, you might install it by any of the alternatives listed in section 2.1. For instance, in Ubuntu this works
well: $ sudo apt-get install gcc . The compilation steps are as follows:
$ cd bin/mcl-14-137;
$ ./configure‘;
$ make
To compile COGtriangles the GNU g++ compiler is required. You should obtain it by any of the alternatives listed in
section 2.1. The compilation would then include several steps:
$cd
$cd
$cd
$cd
$cd

bin/COGsoft;
COGlse; make;
../COGmakehash;make;
../COGreadblast;make;
../COGtriangles;make
4

Regarding BLAST, get homologues-est uses BLAST+ binaries, which can be easily downloaded from the NCBI
FTP site. The packed binaries are blastp and makeblastdb from version ncbi-blast-2.2.27+. If these do not work
in your machine or your prefer to use older BLAST versions, then it will be necessary to edit file lib/phyTools.pm.
First, environmental variable $ENV{’BLAST_PATH’} needs to be set to the right path in your system (inside subroutine
sub set_phyTools_env).
Variables $ENV{’EXE_BLASTP’} and $ENV{’EXE_FORMATDB’} also need to be changed to the appropriate BLAST binaries, which are respectively blastall and formatdb.

2.3

Optional software dependencies

It is possible to make use of get homologues-est on a computer farm or high-performance computing (HPC) cluster
managed by gridengine. In particular we have tested this feature with versions GE 6.0u8, 6.2u4, 2011.11p1 invoking the
program with option -m cluster. For this command to work it might be necessary to edit the get homologues-est.pl
file and add the right path to set global variable $SGEPATH. To find out the installation path of your SGE installation you
might try the next terminal command: $ which qsub
In case you have access to a multi-core computer you can follow the next steps to set up your own Grid Engine cluster
and speed up your calculations:
# updated 23072018
# 1) go to https://arc.liv.ac.uk/downloads/SGE/releases/ ,
# create user ’sgeadmin’ and download the latest binary packages
# (Debian-like here) matching your architecture (amd64 here):
wget -c https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-common_8.1.9_all.deb
wget -c https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9_amd64.deb
wget -c https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-dbg_8.1.9_amd64.deb
sudo
sudo
sudo
sudo
sudo

useradd
dpkg -i
dpkg -i
dpkg -i
apt-get

sgeadmin
sge-common_8.1.9_all.deb
sge_8.1.9_amd64.deb
sge-dbg_8.1.9_amd64.deb
install -f

# 2) set hostname to anything but localhost by editing /etc/hosts so that
# the first line is something like this (localhost or 127.0.x.x IPs not valid):
# 172.1.1.1
yourhost
# 3) install Grid Engine server with defaults except cluster name (’yourhost’)
# and admin user name (’sgeadmin’):
sudo su
cd /opt/sge/
chown -R sgeadmin sge
chgrp -R sgeadmin sge
./install_qmaster
# 4) install Grid Engine client with all defaults:
./install_execd
exit
# 5) check the path to your sge binaries, which can be ’lx-amd64’
ls /opt/sge/bin
# 6) Set relevant environment variables in /etc/bash.bashrc [can also be named /etc/basrhc]
# or alternatively in ~/.bashrc for a given user
export SGE_ROOT=/opt/sge

export PATH=$PATH:"$SGE_ROOT/bin/lx-amd64"
# 7) Optionally configure default all.q queue:
qconf -mq all.q
# 8) Add your host to list of admitted hosts:
qconf -as yourhost
For cluster-based operations three bundled Perl scripts are invoked:
_cluster_makeHomolog.pl, _cluster_makeInparalog.pl and _cluster_makeOrtholog.pl .
It is also possible to invoke Pfam domain scanning from get homologues-est. This option requires the bundled binary
hmmscan, which is part of the HMMER3 package, whose path is set in file lib/phyTools.pm (variable $ENV{’EXE_HMMPFAM’}).
Should this binary not work in your system, a fresh install might be the solution, say in /your/path/hmmer-3.1b2/. In
this case you’ll have to edit file lib/phyTools.pm and modify the relevant:
if( ! defined($ENV{’EXE_HMMPFAM’}) )
{
$ENV{’EXE_HMMPFAM’} = ’/your/path/hmmer-3.1b2/src/hmmscan --noali --acc --cut_ga ’;
}
The Pfam HMM library is also required and the install.pl script should take care of it. However, you can manually
download it from the appropriate Pfam FTP site. This file needs to be decompressed, either in the default db folder or
in any other location, and then it should be formatted with the program hmmpress, which is also part of the HMMER3
package. A valid command sequence could be:
$
$
$
$

cd db;
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz .;
gunzip Pfam-A.hmm.gz;
/your/path/hmmer-3.1b2/src/hmmpress Pfam-A.hmm

Finally, you’ll need to edit file lib/phyTools.pm and modify the relevant line to:
if( ! defined($ENV{"PFAMDB"}) ){ $ENV{"PFAMDB"} = "db/Pfam-A.hmm"; }
In order to reduce the memory footprint of get homologues-est it is possible to take advantage of the Berkeley DB
database engine, which requires Perl core module DB File, which should be installed on all major Linux distributions.
The accompanying script transcripts2cds.pl should work out of the box, but the more efficient transcripts2cdsCPP.pl
requires the installation of module Inline::CPP, which in turn requires Inline::C and g++, the GNU C++ compiler. The
installation of these modules is known to be troublesome in some systems, but the standard way should work in most
cases:
$ yum -y install gcc-c++ perl-Inline-C perl-Inline-CPP

# Redhat and derived distros

$ sudo apt-get -y install g++

# Ubuntu/Debian-based distros, and then cpan below

$ cpan -i Inline::C Inline::CPP

# will require administrator privileges (sudo)

This script may optionally use Diamond instead of BLASTX. The bundled linux binary should work out of the box;
in case the macOS binary does not work in your system you might have to re-compile it with:
cd bin/diamond-0.8.25/
build_simple.sh
cd ../../..
Similarly, in order to take full advantage of the accompanying script parse pangenome matrix.pl, particularly for option -p, the installation of module GD is recommended. An easy way to install them, provided that you have administrator
privileges, is with help from the software manager of your Linux distribution (such as synaptic/apt-get in Ubuntu, yum in
Fedora or YaST in openSUSE).
This can usually be done on the terminal as well, in different forms:
6

$ sudo apt-get -y install libgd-gd2-perl

# Ubuntu/Debian-based distros

$ yum -y install perl-GD

# Redhat and derived distros

$ zypper --assume-yes install perl-GD

# SuSE

$ cpan -i GD

# will require administrator privileges (sudo)

$ perl -MCPAN -e ’install GD’

# will require administrator privileges (sudo)

The installation of perl-GD on macOSX systems is known to be troublesome.

The accompanying scripts compare clusters.pl, plot pancore matrix.pl, parse pangenome matrix.pl,
plot matrix heatmap.sh, hcluster pangenome matrix.sh require the installation of the statistical software R, which usually
is listed by software managers in all major Linux distributions. In some cases (some SuSE versions and some Redhat-like
distros) it will be necessary to add a repository to your package manager. R can be installed from the terminal:
$ sudo apt-get -y install r-base r-base-dev

# Ubuntu/Debian-based distros

$ yum -y install R

# RedHat and derived distros

$ zypper --assume-yes R-patched R-patched-devel

# Suse

Please visit CRAN to download and install R on macOSX systems, which is straightforward.

In addition to R itself, plot matrix heatmap.sh and hcluster pangenome matrix.sh require some R packages to run,
which can be easily installed from the R command line with:
> install.packages(c("ape", "gplots", "cluster", "dendextend, "factoextra"), dependencies=TRUE)
Finally, the script compare clusters.pl might require the installation of program PARS from the PHYLIP suite, which
should be already bundled with your copy of get homologues.

User manual

This section describes the available options for the get homologues-est software.

3.1

Input data

This program takes input sequences in FASTA format, which might be GZIP- or BZIP2-compressed, contained in a
directory or folder containing several files with extension ’.fna’, which can have twin .faa files with translated amino
acid sequences for the corresponding CDSs. File names matching the tag ’flcdna’ are handled as full-length transcripts,
and this information will be used downstream in order to estimate coverage. Global variable $MINSEQLENGTH controls
the minimum length of sequences to be considered; the default value is 20.

3.2

Program options

Typing $ ./get_homologues-est.pl -h on the terminal will show the basic options:
-v print version, credits and checks installation
-d directory with input FASTA files (.fna , optionally .faa),
1 per sample, or subdirectories (subdir.clusters/subdir_)
with pre-clustered sequences (.faa/.fna ). Files matching
tag ’flcdna’ are handled as full-length transcripts.
Allows for files to be added later.
Creates output folder named ’directory_est_homologues’
Optional parameters:
-o only run BLASTN/Pfam searches and exit
-i cluster redundant isoforms, including those that can be
concatenated with no overhangs, and perform
calculations with longest
-c report transcriptome composition analysis

-R set random seed for genome composition analysis
-s save memory by using BerkeleyDB; default parsing stores
sequence hits in RAM
-m runmode [local|cluster]
-n nb of threads for BLASTN/HMMER/MCL in ’local’ runmode
-I file with .fna files in -d to be included

(use of pre-clustered sequences
ignores -c)

(useful to pre-compute searches)
(min overlap, default: -i 40,
use -i 0 to disable)

(follows order in -I file if enforced,
with -t N skips clusters occup
-h
-p
-l
-g
-d
-E
-n
-X

this message
check only ’plus’ strand
min length for CDS
genetic code to use during translation
run blastx against selected protein FASTA database file
max E-value during blastx search
number of threads for BLASTX jobs
use diamond instead of blastx

(optional, default both strands)
(optional, default=50 amino acid residues)
(optional, default=1, example: -g 11)
(default=swissprot, example: -d db.faa)
(default=1e-05)
(default=2)
(optional, much faster for many sequences)

-G show available genetic codes and exit
The main output of this script are two files, as shown in section 4.1, which contain inferred nucleotide and peptide
CDS sequences. These FASTA files contain in each header the evidence supporting each called CDS, which can be
blastx, transdecoder or a combination of both, giving precedence to blastx in case of conflict. Note that we
14

have observed that the output of TransDecoder might change if a single sequence is analyzed alone, in contrast to the
analysis of a batch of sequences. The next table shows the rules and evidence codes used by this script in order to call
CDS sequences by merging BLASTX (1) and TransDecoder (2) predictions. The rules are mutually exclusive and are
tested hierarchically from top to bottom. Sequences from 1 and 2 with less than 90 consecutive matches (30 amino
acid residues) are considered to be non-overlapping (last rule). Note that the occurrence of mismatches are checked as a
control:
graphical summary

evidence

1----------

blastx

2----------

transdecoder

description
no transdecoder
no blastx

1---------2-----------

blastx.transdecoder

inferred CDS overlap with no
mismatches and are concatenated

1---------2-----------

transdecoder.blastx

inferred CDS overlap with no
mismatches and are concatenated

1----------------2-----------

blastx 2), which
we do with this command:
$ ./get_homologues-est.pl -d cds -M -t 3 -m cluster
The output should include the next lines:
# number_of_clusters = 34248
# cluster_list = cds_est_homologues/Alexis_3taxa_algOMCL_e0_.cluster_list
# cluster_directory = cds_est_homologues/Alexis_3taxa_algOMCL_e0_
We should be now in position to compile the pan-genome matrix corresponding to these clusters:
./compare_clusters.pl -d cds_est_homologues/Alexis_3taxa_algOMCL_e0_ \
-o clusters_cds_t3 -m -n
which should produce:
# number of clusters = 34248
# intersection output directory: clusters_cds_t3
# intersection size = 34248 clusters
# intersection list = clusters_cds_t3/intersection_t0.cluster_list
# pangenome_file = clusters_cds_t3/pangenome_matrix_t0.tab
# pangenome_phylip file = clusters_cds_t3/pangenome_matrix_t0.phylip
We should now interrogate the pan-genome matrix, for instance looking for clusters found in one genotype (A) but
not in others (B):
./parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab \
-A cds/SBCC073.list -B cds/ref.list -g
You should obtain a list of 4348 accessory clusters:
# matrix contains 34248 clusters and 16 taxa
# taxa included in group A = 1
# taxa included in group B = 2
# finding genes present in A which are absent in B ...
# file with genes present in set A and absent in B (4348):
clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt
24

Finally, we will now estimate whether these clusters are enriched in any Pfam domain, producing also a single FASTA
file with the tested sequences:
./pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n -t greater \
-x clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt -e -p 0.05 \
-r SBCC073 -f SBCC073_accessory.fna
The output should be:
# 39400 sequences extracted from 113222 clusters
# total experiment sequence ids = 4818
# total control
sequence ids = 39400
# parse_Pfam_freqs: set1 = 562 Pfams set2 = 3718 Pfams

# created FASTA file: SBCC073_accessory.fna
# sequences=4818 mean length=353.8 , seqs/cluster=1.11
# fisher exact test type: ’greater’
# multi-testing p-value adjustment: fdr
# adjusted p-value threshold: 1
# total annotated domains: experiment=1243 control=19192
#PfamID
PF00009
PF00010
...
PF00665
PF07727
PF00931
PF13976

counts(exp) counts(ctr) freq(exp) freq(ctr) p-value p-value(adj) description
0 20 0.000e+00 1.042e-03 1.000e+00 1.000e+00 Elongation factor Tu GTP binding domain
0 32 0.000e+00 1.667e-03 1.000e+00 1.000e+00 Helix-loop-helix DNA-binding domain
13
28
44
14

31
61
201
19

1.046e-02
2.253e-02
3.540e-02
1.126e-02

1.615e-03
3.178e-03
1.047e-02
9.900e-04

1.418e-06
3.033e-13
1.750e-10
2.744e-09

1.318e-03
1.128e-09
3.253e-07
3.401e-06

Integrase core domain
Reverse transcriptase (RNA-dep DNA pol)
NB-ARC domain
GAG-pre-integrase domain

4.5

Making and annotating a non-redundant pangenome matrix

The script make_nr_pangenome_matrix.pl produces a non-redundant pangenome matrix by comparing all clusters to
each other, taking the median sequence in each cluster. By default nucleotide sequences are compared, but if the original
input of get homologues-est comprised both DNA and protein sequences, the user can also choose peptide sequences to
compute redundancy, which probably make more sense in terms of protein function. On the contrary, it would seem more
appropriate to use DNA sequences to measure diversity.
In this example a DNA-based non-redundant pangenome matrix is computed with BLASTN assuming that sequences
might be truncated (option -e) and using 10 processor cores and a coverage cutoff of 50%:
./make_nr_pangenome_matrix.pl -m outdir/pangenome_matrix_t0.tab -n 10 -e -C 50
# input matrix contains 5241 clusters and 4 taxa
# filtering clusters ...
# 5241 clusters with taxa >= 1 and sequence length >= 0
# sorting clusters and extracting median sequence ...
# running makeblastdb with outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.fna
# parsing blast result! (outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.blast , 0.37MB)
# parsing file finished
# 5172 non-redundant clusters
# created: outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.fna
# printing nr pangenome matrix ...
# created: outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.tab
Note that the previous command can be modified to match external reference sequences, for instance from Swissprot,
or pre-computed clusters, such as groups of orthologous sequences, so that the resulting matrix contains cross-references
to those external clusters, and their annotations. In either case, both input clusters and reference sequences must be of the
same type: either nucleotides or peptides.
The next example shows how a set of clusters produced by get homologues-est can be matched to some nucleotide
reference sequences, in this case annotated rice cDNAs:
./make_nr_pangenome_matrix.pl -m outdir/pangenome_matrix_t0.tab -n 10 -e -C 50 -f oryza.fna
This is the produced output:
# input matrix contains 5241 clusters and 4 taxa
# filtering clusters ...
# 5241 clusters with taxa >= 1 and sequence length >= 0
# sorting clusters and extracting median sequence ...
# re-using previous BLAST output outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.blast
# parsing blast result! (outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.blast , 0.34MB)
# parsing file finished
# 5172 non-redundant clusters
# created: outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90.fna
# 66339 reference sequences parsed in oryza.fna
# parsing blast result! (outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90_ref.blast , 0.37MB)
# parsing file finished

# matching nr clusters to reference (%alignment coverage cutoff=50) ...
# printing nr pangenome matrix ...
# created: outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90_ref_c50_s50.tab
# NOTE: matrix can be transposed for your convenience with:
perl -F’\t’ -ane ’$r++;for(1 .. @F){$m[$r][$_]=$F[$_-1]}; \
$mx=@F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}’ \
outdir/pangenome_matrix_t0_nr_t1_l0_e1_C50_S90_ref_c50_s50.tab
The suggested perl command can be invoked to tranpose the matrix, which now contains rows such as these:
non-redundant Franka.bz2.nucl
1_TR2804-c0_g1_i1.fna 1 1 0 0
2_TR1554-c0_g1_i1.fna 0 2 1 0
6_TR3918-c0_g1_i1.fna 0 1 1 0
...

Esterel.bz2.nucl flcdnas_Hnijo.gz.nucl ... redundant reference
NA LOC_Os09g07300.1 cDNA|BIG, putative, expressed
NA LOC_Os03g53280.1 cDNA|WD domain containing protein
NA NA

Pangenome matrices with more than 4 taxa can be plotted with help from script parse pangenome matrix.pl, as explained in manual get homologues.pdf.

4.6

Annotating a sequence cluster

After analyzing pan-genome or pan-transcriptome clusters it might be interesting to find out what kind of proteins they
enconde, or we might just want to double-check the BLAST matches that support a produced cluster. The script annotate cluster.pl does just that, and be used with both nucleotide and peptide clusters. It can be called like this:
./annotate_cluster.pl -f outdir/1004_TR425-c0_g2_i1.fna -o 1004_TR425-c0_g2_i1.aln.fna -D
And will produce this output:
# DEFBLASTNTASK=megablast DEFEVALUE=10
# MINBLUNTBLOCK=100 MAXSEQNAMELEN=60
# MAXMISMCOLLAP=0 MAXGAPSCOLLAP=2
# ./annotate_cluster.pl -f outdir/1004_TR425-c0_g2_i1.fna -r \
#
-o 1004_TR425-c0_g2_i1.aln.fna -P 1 -b 0 -D 1 -c 0 -A -B
# total

sequences: 3 taxa: 2

# Pfam domains: PF10602,PF01399,
# Pfam annotation: 26S proteasome subunit RPN7;PCI domain;
# aligned sequences: 3 width:
1595
# alignment sites: SNP=3 parsimony-informative=0 (outdir/1004_TR425-c0_g2_i1.fna)
# taxa included in alignment: 2
# alignment file: 1004_TR425-c0_g2_i1.aln.fna
If option -b is enforced a blunt-end alignment is produced, which might be useful for further analyses. In either case,
the produced FASTA alignment file will contain Pfam domains in each header, in addition to the relevant BLAST scores:
>TR425|c0_g2_i1_[Esterel.trinity.fna.bz2] bits E-value N qy ht 1:1595 Pfam:..
CCTGCTGGTGCATTTTTTTACAAACAGTTGGCACAGAGTATTTGTTGCTAATTGTGTTCGTTTTCTTGAA...

Figure 10: Fragment of alignment produced by annotate cluster.pl, rendered with BioEdit.

Finally, option -c can be invoked to collapse aligned sequences from the same species or taxon. This might be
useful when working with clusters of transcript isoforms, which are often redundant and broken in possibly overlapping
fragments. Taking the same example cluster, we could try to collapse isoforms with overlaps ≥ 30 residues like this:
./annotate_cluster.pl -f outdir/1004_TR425-c0_g2_i1.fna -o 1004_TR425-c0_g2_i1.aln.fna -D -c 30
Note that by default the script does not tolerate mismatches between sequences to be collapsed, but that behaviour can
be relaxed by editing the value of variable $MAXMISMCOLLAP=0 at the top of the script. Instead, as BLASTN-placed gaps
in identical sequences can often move, by default two such gaps are accepted (see variable MAXGAPSCOLLAP=2).
By default, the script annotate_cluster.pl looks for the longest sequences and aligns all other cluster sequences
to it with BLASTN (megablast). The user can also pass an external, reference sequence to guide cluster alignment (see
option -r). However, in either case, clusters of transcripts often contain a fraction of BLASTN hits that do not match
the longest/reference sequence; instead, they align towards the 5’ or 3’ of other sequences of the clusters and are thus not
included in the produced multiple sequence alignment (MSA):
-------------------------------------------------------------....
..

<= longest/reference sequence

<= sequences not included in MSA

A step-by-step protocol with barley assembled transcripts

This section describes the steps required to proceed with the analysis of barley transcripts with folder test_barley,
which you should get with the software. The following commands are to be pasted in your terminal:
## set get_homologues path if not already in $PATH
export GETHOMS=~/soft/github/get_homologues/
cd test_barley
## 1) prepare sequences
cd seqs
# download all transcriptomes
wget -c -i wgetlist.txt
# extract CDS sequences (this takes several hours)
# choose cdsCPP.sh if dependency Inline::CPP is available in your system
# the script will use 20 CPU cores, please adapt it to your system
./cds.sh
# clean and compress
#rm -f _* *noORF* *transcript*
#gzip *diamond*
# put cds sequences aside
mv *cds.f*gz ../cds
cd ..
# check lists of accessions are in place (see HOWTO.txt there)
ls cds/*list

## 2) cluster sequences and start the analyses
# calculate protein domain frequencies (Pfam)
$GETHOMS/get_homologues-est.pl -d cds -D -m cluster -o &> log.cds.pfam
# alternatively, if not running in a SGE cluster, taking for instance 20 CPUs
$GETHOMS/get_homologues-est.pl -d cds -D -n 20 -o &> log.cds.pfam
# calculate ’control’ cds clusters
$GETHOMS/get_homologues-est.pl -d cds -M -t 0 -m cluster &> log.cds
# get non-cloud clusters
$GETHOMS/get_homologues-est.pl -d cds -M -t 3 -m cluster &> log.cds.t3
# clusters for dN/dS calculations
$GETHOMS/get_homologues-est.pl -d cds -e -M -t 4 -m cluster &> log.cds.t4.e
# leaf clusters and pangenome growth simulations with soft-core
$GETHOMS/get_homologues-est.pl -d cds -c -z \
-I cds/leaf.list -M -t 3 -m cluster &> log.cds.leaf.t3.c
# produce pan-genome matrix and allocate clusters to occupancy classes
# all occupancies
30

$GETHOMS/compare_clusters.pl -d cds_est_homologues/Alexis_0taxa_algOMCL_e0_ \
-o clusters_cds -m -n &> log.compare_clusters.cds
# excluding cloud clusters, the most unreliable in our benchmarks
$GETHOMS/compare_clusters.pl -d cds_est_homologues/Alexis_3taxa_algOMCL_e0_ \
-o clusters_cds_t3 -m -n &> log.compare_clusters.cds.t3
$GETHOMS/parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab -s \
&> log.parse_pangenome_matrix.cds.t3
# make pan-genome growth plots
$GETHOMS/plot_pancore_matrix.pl -i cds_est_homologues/core_genome_leaf.list_algOMCL.tab \
-f core_both &> log.core.plots
$GETHOMS/plot_pancore_matrix.pl -i cds_est_homologues/pan_genome_leaf.list_algOMCL.tab \
-f pan &> log.pan.plots

## 3) annotate accessory genes
# find [-t 3] SBCC073 clusters absent from references
$GETHOMS/parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab \
-A cds/SBCC073.list -B cds/ref.list -g &> log.acc.SBCC073
mv clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt \
clusters_cds_t3/SBCC073_pangenes_list.txt
# how many SBCC073 clusters are there?
perl -lane ’if($F[0] =~ /SBCC073/){ foreach $c (1 .. $#F){ if($F[$c]>0){ $t++ } }; print $t }’ \
clusters_cds_t3/pangenome_matrix_t0.tab
# find [-t 3] Scarlett clusters absent from references
$GETHOMS/parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab \
-A cds/Scarlett.list -B cds/ref.list -g &> log.acc.Scarlett
mv clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt \
clusters_cds_t3/Scarlett_pangenes_list.txt
# find [-t 3] H.spontaneum clusters absent from references
$GETHOMS/parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab \
-A cds/spontaneum.list -B cds/ref.list -g &> log.acc.spontaneum
mv clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt \
clusters_cds_t3/spontaneum_pangenes_list.txt
# Pfam enrichment tests
# core
$GETHOMS/pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
-x clusters_cds_t3/pangenome_matrix_t0__core_list.txt -e -p 1 \
-r SBCC073 > SBCC073_core.pfam.enrich.tab
$GETHOMS/pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
-x clusters_cds_t3/pangenome_matrix_t0__core_list.txt -e -p 1 \
-r SBCC073 -t less > SBCC073_core.pfam.deplet.tab
# accessory
$GETHOMS/pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
-x clusters_cds_t3/SBCC073_pangenes_list.txt -e -p 1 -r SBCC073 \
-f SBCC073_accessory.fna > SBCC073_accessory.pfam.enrich.tab

$GETHOMS/pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
-x clusters_cds_t3/Scarlett_pangenes_list.txt -e -p 1 -r Scarlett \
-f Scarlett_accessory.fna > Scarlett_accessory.pfam.enrich.tab
$GETHOMS/pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
-x clusters_cds_t3/spontaneum_pangenes_list.txt -e -p 1 -r Hs_ \
-f spontaneum_accessory.fna > spontaneum_accessory.pfam.enrich.tab
# note that output files contain data such as the mean length of sequences
# get merged stats for figure
perl suppl_scripts/_add_Pfam_domains.pl > accessory_stats.tab
perl -lane ’print if($F[0] >= 5 || $F[1] >= 5 || $F[2] >= 5)’ \
accessory_stats.tab > accessory_stats_min5.tab
Rscript suppl_scripts/_plot_heatmap.R

Frequently asked questions (FAQs)

Please see also the FAQs in manual get homologues.pdf.
• What’s the performance gain of v2?
After evolving parts of the original code base, and fixing some bugs (see CHANGES.txt), both get homologues.pl
and get homologues-est.pl have significantly improved their performance, as can be seen in the figure, which combines data from the original benchmark and new data generated after v2 was in place.

Figure 11: Computing time and RAM requirements of the original algorithm (OMCL, measured on 6 sequence sets) as
compared to the updated v2 code (measured on 3 three sets).

• What are the main caveats when clustering transcripts/CDS sequences?
get homologues-est.pl has been mainly tested with plant sequences, using both CDS sets from whole-genome annotations and also transcripts from expression experiments. The main problems we have found so far are split genes,
frequent artifacts in genome assemblies, incomplete genes which lack exons, for the same previous reasons, and
retained introns, which are common among plant transcripts. These three common situations are illustrated in the
figure.
• What are those chimeras warnings produced when running transcripts2cds?
33

Figure 12: Common problems faced when clustering transcripts/CDS sequences.

When subroutine transcripts::parseb lastxc dss equencesreadsBLAST X/DIAMONDresultscheckswhethersecondaryalignmentstoth
• Why have you not implemented the COG algorithm in the get homologues-est.pl?
We have left the COG algorithm out of get homologues-est.pl as it will take some more work to integrate it with redundant
isoform calling, which is important for EST datasets. However, it should be possible to do it.
• The number of clusters produced with -C 75 -S 85 does not match the pangenome/pantranscriptome size estimated
with option -c
The reason for these discrepancies is that these are fundamentally different analyses. While the default runmode simply
groups sequences trying to put in the same cluster isoforms of orthologues and very close inparalogues, a genome composition analysis performs a simulation in order to estimate how many novel sequences are added by genomes/transcriptomes
sampled in random order. In terms of code, there are a couple of key global variables set in lib/marfil_homology.pm,
lines 135-138, which control how a gene/transcript is compared to previously processed sequences in order to call it novel:
$MIN_PERSEQID_HOM_EST = 70.0;
$MIN_COVERAGE_HOM_EST = 50.0;
These values are equivalent to say that any sequence with coverage ≥ 50% and identity ≥ 70% to previous genes/transcripts
will be considered simply a homologue and won’t be accumulated to the growing pangenome/pantranscriptome. You
might want to change these values to increase or relax the stringency and to match the parameters set to produce your
clusters.
• Can get homologues-est.pl be used to analyze non-coding sequences?
In principle the software should work with any type of nucleotide sequences. For instance, the next figure shows how it
can used to analyze conserved non-coding sequences among Brachypodium distachyon and rice, with a median BLASTN
alignment length of 32.
• I produced 2 different parsimony trees with compare clusters.pl, is there a way to merge them and add bootstrap values?

Figure 13: Core-genome composition analysis of conserved non-coding sequences (CNS) from 56 Brachypodium distachyon ecotypes and rice.

The two trees are equally parsimonious, that is, have the same parsimony score but different topologies. One possibility
to combine them into one topology is to compute either the majority rule consensus (mjr) tree, for example with consense
from the PHYLIP package, or represent a network consensus with a program such as splitstree.
Regarding the bootstrapping, you could write some R code (or in any other language) to randomly sample the columns
in the pangenome matrix with replacement to construct a new, bootstrapped matrix with the same number of columns as
the original one. You should generate 100 or 500 of these matrices and then run pars on each one of them. Then run
consense to obtain the mjr consensus tree and associated split frequencies (bootstrap support values for each bipartition).
An easy way of achieving this would be with R’s boot package (see example).
Another option is to call seqboot from the PHYLIP package to generate N bootstrap pseudo-replicates of the matrix.
Rename the resulting file as infile and call pars to read the file, using the option m no_of_pseudoreplicates to run
the standard (Fitch) parsimony analysis on each of the bootstraped matrices. PHYLIP pars will generate an outtree file
containing as many trees as bootstraped matrices found in infile. Rename outtree to intree and call consense to generate
the default majority rule consensus tree. This tree is only a cladogram (only topology, no branch lengths). The node labels
correspond to the number of bootstrap pseudoreplicates in which that particular bipartition was found.
• How can I produce a maximum likelihood (ML) tree with bootstrap values from a pangenome matrix?
We recommend script estimate_pangenome_trees.sh from the GET PHYLOMARKERS pipeline.
Another alternative is to use the .fasta version of the pangenome matrix produced by script ./compare_clusters.pl -m ,
next to the .tab and .phy files. This FASTA file can be analyzed with ML software such as IQ-TREE (PubMed=25371430)
both online or in the terminal with a command such as:
path_to_iqtree -s pangenome_matrix_t0.fasta -st BIN -m TEST -bb 1000 -alrt 1000
This will produce an optimal ML tree after selecting a binary substitution model with both bootstrap and aLRT support
values.
• Is there a way to plot ANI matrices of soft-core clusters?
Let’s say you have 20 genomes, then 95% of them are exaclty 19 taxa, which is the minimum occupancy that defines
soft-core clusters (see global variable $SOFTCOREFRACTION). You should then compute the ANI matrix as follows:
./get_homologues.pl -d your_data -a ’CDS’ -A -M -t 19
And then plot the resulting matrix with script hcluster pangenome matrix.sh.

• When I use the hcluster pangenome matrix.sh script the trees in the output of the newick file and the heatmap differ. Is
there a reason for this?
The difference in the topologies of the NJ trees and the row-dendrogram of the heatmaps differ because the heatmaps are
ordered bi-dimensionally. That is, the heatmap plot shows only the row-dendrogram, but the matrix is ordered also by
columns. The NJ tree is computed from the distance matrix that you indicate the program to calculate for you (ward.D2
is the default).
• I ran get homologues with fasta files of 78 genomes; is there a way to export a 78 x 78 matrix of the number of homologues
shared between each genome?
If you did your analysis requesting cluster of all occupancies (-t 0) then you can get what you want in two steps. First,
you must produce a pangenome matrix with compare_clusters -d ... -m. Now it is possible to request an intersection pangenome matrix (pangenome_matrix_t0__intersection.tab) which contains the number of sequence clusters shared by any two pairs of genomes with parse_pangenome_matrix.pl -m pangenome_matrix_t0.tab -s -x.
Note that these clusters might contain several inparalogues of the same species.
• How can I use get homologues to produce clusters for GET PHYLOMARKERS?
You should use option -e and make sure that your input FASTA nucleotide files have twin files with matching peptide
sequences. Then, in GET PHYLOMARKERS make sure to set option -f EST.

Credits and references

get homologues-est.pl is designed, created and maintained at the Laboratory of Computational Biology at Estación Experimental de Aula Dei/CSIC in Zaragoza (Spain) and at the Center for Genomic Sciences of Universidad Nacional
Autónoma de México (CCG/UNAM).
The code was written mostly by Bruno Contreras-Moreira and Pablo Vinuesa, but it also includes code and binaries
from OrthoMCL v1.4 (algorithm OMCL, -M), NCBI Blast+, MVIEW, DIAMOND and BioPerl 1.5.2.
Other contributors: Carlos P Cantalapiedra, Roland Wilhelm.
We ask the reader to cite the main reference describing the get homologues software,

• Contreras-Moreira B, Cantalapiedra CP, Garcia Pereira MJ, Gordon S, Vogel JP, Igartua E, Casas AM and Vinuesa P
(2017) Analysis of plant pan-genomes and transcriptomes with GETH OMOLOGUES−EST, aclusteringsolution f orsequenceso f thesam
and also the original papers describing the included algorithms and databases, accordingly:
• Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes.
Genome Res. 13(9):2178-89.
• Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ (1997) Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25(17): 3389-3402.
• Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H,
Lehvslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E,
Wilkinson MD, Birney E. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12(10):16118.
• hmmscan :: search sequence(s) against a profile database HMMER 3.1b2 (Feb 2015) http://hmmer.org Copyright
(C) 2015 Howard Hughes Medical Institute. Freely distributed under the GNU General Public License (GPLv3).
• Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, SangradorVegas A, Salazar GA, Tate J, Bateman A. (2016) The Pfam protein families database: towards a more sustainable
future. Nucleic Acids Res. 44(D1):D279-85
• Haas BJ, Papanicolaou A, Yassour M et al. (2013) De novo transcript sequence reconstruction from RNA-seq using
the Trinity platform for reference generation and analysis. Nat Protoc. 8(8):1494-512.
• Brown NP, Leroy C, Sander C (1998) MView: A Web compatible database search or multiple alignment viewer.
Bioinformatics. 14 (4):380-381.
• Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods.
12(1):59-60
If you use the accompanying scripts the following references should also be cited:
• R Core Team (2013) R: A Language and Environment for Statistical Computing. http://www.R-project.org R
Foundation for Statistical Computing, Vienna, Austria, ISBN3-900051-07-0

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 37
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.16
Create Date                     : 2018:07:23 15:06:48+02:00
Modify Date                     : 2018:07:23 15:06:48+02:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) kpathsea version 6.2.1

EXIF Metadata provided by EXIF.tools

Manual Get Homologues Est

Navigation menu

Versions of this User Manual:

Views

Navigation