Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 4

DownloadManual
Open PDF In BrowserView PDF
GREDU User Reference Manual
Mingfu Shao ∗
Laboratory for Computational Biology and Bioinfomratics,
École Polytechnique Fédérale de Lausanne (EPFL),
Lausanne, Switzerland

October 7, 2018

1

Introduction

GREDU (Genome REarrangements with DUplications) is a software package implemented several
exact algorithms to compute the edit distances under various evolutionaly models for two genomes
with duplicate genes. Specifically, GREDU contains the following five programs:
1. dcj, implements an exact algorithm to compute the DCJ (Double-Cut-and-Join) distance between two genomes with duplicate genes [1].
2. segdcj, implements an exact algorithm to compute the edit distance under segmental duplications and DCJ operations for two genomes with duplicate genes [2].
3. exemplar, implements an exact algorithm to compute the exemplar breakpoint distance for
two genomes with duplicate genes [3].
4. intermediate, implements an exact algorithm to compute the intermediate breakpoint distance for two genomes with duplicate genes [4].
5. maxmatching, implements an exact algorithm to compute the maximum-matching breakpoint distance for two genomes with duplicate genes [4].

2

Installation

To install GREDU, you need to download two libraries (BOOST and GUROBI), setup the corresponding environmental variables, and then compile the source code of GREDU.
∗ shaomingfu@gmail.com

1

2.1

Install BOOST

Download BOOST from http://www.boost.org. Uncompress it somewhere (compiling and
installing are not necessary). Set environment variable BOOST_HOME to indicate the directory of
BOOST. For example, for Unix platforms, add the following statement to the file ˜/.bash_profile:
export BOOST_HOME="/directory/to/your/boost/boost_1_60_0"

2.2

Install GUROBI

Download GUROBI from http://www.gurobi.com/ and uncompress the package somewhere (compiling and installing are not required). You need to apply an academic license to use the full features
of GUROBI (Please refer to the GUROBI documentation for more information.) After that, set two
environment variables, GUROBI_HOME and GRB_LICENSE_FILE, which indicates the directory of
GUROBI, and the location of your license file, respectively. For example, for Unix platforms, add
the following two statements to the file ˜/.bash_profile:
export GUROBI_HOME="/directory/to/your/gurobi/linux64"
export GRB_LICENSE_FILE="/your/license/gurobi.lic"

2.3

Compile GREDU

Get the source code of GREDU through git:
$git clone git@github.com:shaomingfu/grudo.git .
Compile the libraries, main source code, and tools through:
$ cd lib; ./build.sh
$ cd src; ./build.sh
$ cd tools; ./build.sh
After that all executable files (dcj, segdcj, exemplar, intermediate and maxmatching) will be
present at bin.

3

Command line

All five programs use the same parameters. Take exemplar as an example:
$./exemplar   

2

The first two arguments spcifies two files in which the two genomes are encoded. The third argument specifies the time limit (in seconds) for the GUROBI solver. A set of input-file examples are
provided under bin. For program dcj, you can use human.dcj and mouse.dcj as input files. For
other four programs, you can use human.all and mouse.all as input files.

4

Input Format

The structure of the genome file is as follows. A genome contains several linear or circular chromosomes, and each chromosome consists of a sequence of genes in the order of their location on the
chromosome. Each chromosome contain several lines, and each line specifies a gene, containing
four fields separated by spaces.
1. The first field is a string, which species the identifier of this gene. The identifier should be
unique for each gene.
2. The second field is an signed integer, which species the family of this gene. Genes in the same
gene family should have the same absolute value. The orientation of this gene is specified by
the sign of this integer.
3. The third field is a string, which species the chromosome name of this gene.
4. The fourth field is a integer choosing from {1, 2}, where 1 means this chromosome is linear
and 2 means this chromosome is circular.
NOTE: Make sure that for program dcj, for each gene family, the number of genes in each genome
in this gene family are equal (there is no such requirement for the other four programs).

5

Output Format

A file with name mapping will be generated in the current directionary specifying the optimal oneto-one correspondence of the genes in the two genomes, in which the identifiers of the genes are
used. For program segdcj, an additional file segments will be generated illstruting the optimal
segmental duplications for each genome. The optimal edit distance between the two given genomes
will be displayed at the bottom line of the standard output.

6

Tools

If you use the data from Ensembl http://www.ensembl.org, we have provided some perl scripts
at tools/ensembltool to generate the input files for these programs from raw downloaded data.
To download gene order data from Ensembl, use the customise your download page. Take human genome as an example. First, choose database (Ensembl Genes 74). Then choose dataset (Homo
sapiens genes). Second, in the Filters, choose proper chromosomes in the REGION field (1-22, X
and Y), and choose protein coding genes in the GENE field if necessary. Third, in the Attributes,
3

choose Ensembl Gene ID, Ensembl Transcript ID, Chromosome Name, Strand, Transcript
Start, Transcript End in the GENE field, and Ensembl Protein Family ID(s) in the PROTEIN
DOMAINS field. Make sure that these attributes are selected in the same order described above. Last,
come to the Results tab and save them to a CSV file.
After downloading the data, for example, human genome and mouse genome, we can use the perl
script tools/ensembltool/build.pl to transform to the required format:
$./build.pl    
The first two parameters are the names of the two raw data files, and the last two parameters are
the names of the input files of GREDU with the required format. build.pl calls the other three
scripts in the same directory, where longest.pl is to select the longest transcript for each gene,
family.pl is to select those gene families with the same number of genes in each genome, and
join.pl is to transform those genes in the selected families to the required format.
NOTE: for program dcj, you have to uncomment a few lines of family.pl to generate the correct
input files (please follow the instruction on line 47). This is because dcj requires that for each gene
family, exactly the same number of genes should be given. The modified family.pl will only keep
those gene families with the same gene copy numbers and remove others.

References
[1] M. Shao, Y. Lin, and B.M.E. Moret. An exact algorithm to compute the DCJ distance for
genomes with duplicate genes. In Proc. 18th Int’l Conf. Comput. Mol. Biol. (RECOMB’14),
volume 8394 of Lecture Notes in Comp. Sci., pages 280–292, 2014.
[2] M. Shao and B.M.E. Moret. Comparing genomes with rearrangements and segmental duplications. Bioinformatics, 31(12):i329–i338, 2015.
[3] M. Shao and B.M.E. Moret. A fast and exact algorithm for the exemplar breakpoint distance.
In Proc. 19th Int’l Conf. Comput. Mol. Biol. (RECOMB’15), volume 9029 of Lecture Notes in
Comp. Sci., pages 309–322, 2015.
[4] M. Shao and B.M.E. Moret. On computing breakpoint distances for genomes with duplicate
genes. In Proc. 20th Int’l Conf. Comput. Mol. Biol. (RECOMB’16), volume 9649 of Lecture
Notes in Comp. Sci., pages 189–203, 2016.

4



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 4
Producer                        : pdfTeX-1.40.16
Creator                         : TeX
Create Date                     : 2018:10:07 10:22:09-04:00
Modify Date                     : 2018:10:07 10:22:09-04:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) kpathsea version 6.2.1
EXIF Metadata provided by EXIF.tools

Navigation menu