Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 41

Download
Open PDF In Browser	View PDF

User Manual

LRSDAY: Long-read Sequencing Data Analysis for Yeasts
Release v1.1.0
(2018-07-11)

Author Contact:
Jia-Xing Yue (岳家兴)
Email: yuejiaxing@gmail.com
GitHub: yjx1217
Twitter: iAmphioxus
Website: http://www.iamphioxus.org

Table of Contents
INTRODUCTION .......................................................................................................................... 1
Background...................................................................................................................................................................................... 1
Overview of the LRSDAY workflow ....................................................................................................................................... 2
Comparison with other methods ............................................................................................................................................ 3
Experimental Design.................................................................................................................................................................... 4
Limitations and potential adaptation ................................................................................................................................. 8
Expected improvements ............................................................................................................................................................. 8

CITATIONS .................................................................................................................................. 9
MATERIALS ................................................................................................................................. 9
Hardware, operating system and network........................................................................................................................ 9
Software or library requirements.......................................................................................................................................... 9
Input data ...................................................................................................................................................................................... 10
Example data ............................................................................................................................................................................... 10

PROCEDURE.............................................................................................................................. 12
Download, install and configure LRSDAY ● TIMING <1 h ................................................................................... 12
Run LRSDAY with the testing example ● TIMING 50 h ........................................................................................... 14

? TROUBLESHOOTING .............................................................................................................. 25
● TIMING ................................................................................................................................ 28

ANTICIPATED RESULTS ............................................................................................................. 29
REFERENCES ............................................................................................................................. 33
APPENDIX ................................................................................................................................. 36
Appendix 1: Pre-shipped supporting data for LRSDAY............................................................................................. 36
Appendix 2: Tips for adapting LRSDAY to other eukaryotic organisms ........................................................... 38

INTRODUCTION
Background
Twenty years ago, the genome sequence of the budding yeast Saccharomyces cerevisiae was
published1. As the first complete eukaryotic genome ever sequenced, this marked a major
scientific milestone in biology. Since then, the genomes of many model and non-model organisms
have been sequenced, with this process accelerating after the emergence of next-generation
sequencing (NGS) technologies. Despite the notably improved throughputs, NGS technologies
suffer from the limitation of short reads and usually result in highly fragmented genome
assemblies containing numerous gaps and local mis-assemblies. The recently developed longread sequencing technologies represented by Pacific Biosciences (PacBio) and Oxford Nanopore
offer compelling alternatives to overcome such hurdles, producing high-quality genome
assemblies with substantially improved continuity and accuracy. Although initially tested in
microbial genome sequencing, their recent applications in complex mammalian and plant
genomes also achieved high-quality results2–6. With such new sequencing technologies,
challenging genomic regions with enriched repetitive elements, strongly biased GC-content, or
complex structural variants can often be correctly resolved. It is therefore anticipated that genome
sequencing projects will routinely adopt long-read-based sequencing technologies in the coming
years to gain insight in these complex genomic regions.
Yeast is a leading model organism with great importance in both basic biomedical research and
biotechnological applications. Its small genome size makes it particularly suitable for long-readbased high-coverage genome sequencing. The resulting complete genome assembly with fullyresolved subtelomere structure can in turn illuminate the genetic basis of many complex
phenotypic traits with unprecedented resolution. Recently, we used the long-read sequencing
technologies to generate the first panel of population-level end-to-end reference genomes of 12
yeast strains representing major subpopulations of the partially domesticated S. cerevisiae and
its sister species Saccharomyces paradoxus7. In addition, there have been a number of other
studies carrying out long-read sequencing for many S. cerevisiae strains8–11. Given the vast
genomic and phenotypic diversity of S. cerevisiae, we expect the incoming collection of long-readbased high-quality genome assemblies of strains from widespread geographic locations and
ecological niches will substantially deepen our understanding in the S. cerevisiae natural genetic
variation and its associated biotechnological values.
1

Overview of the LRSDAY workflow
Here we present a highly organized and modular computational framework named Long-Read
Sequencing Data Analysis for Yeasts (LRSDAY), which enables automated high-quality yeast
genome assembly and annotation production from raw long-read sequencing data. The prototype
of LRSDAY has been developed to generate the yeast population reference panel (YPRP) in our
previous study7. Under the hood, LRSDAY contains a series of task-specific modules handling
long-read-based de novo genome assembly, short-read-based assembly polishing, referenceguided assembly scaffolding, as well as comprehensive genomic feature annotations. These
tasks can be run individually, selectively or coordinately depending on users’ needs. LRSDAY
supports both leading long-read sequencing technologies: PacBio and Oxford Nanopore.
Running the full LRSDAY workflow, the final output is a chromosome-level genome assembly with
high-quality annotations of centromeres, protein-coding genes, tRNAs, transposable elements
(TEs; Ty1-Ty5 for S. cerevisiae and S. paradoxus), and telomere associated core X and Y’
elements. LRSDAY is shipped with various auxiliary scripts, configuration files and supporting
data that enable its semi-automatic installation, configuration, and execution with minimal manual
intervention (Appendix 1). This design concept greatly alleviates the technical barrier for bench
biologists with limited bioinformatics experiences. In addition, a real case example and its final
outputs are also provided for users’ test and comparison. All these task-specific modules, auxiliary
files, installed tools, sample outputs, together with the user-created project directories for the
testing example and users’ own data are hosted under the same home directory
($LRSDAY_HOME) in a self-contained fashion (Fig. 1). This design makes LRSDAY well-isolated
from the rest of the system and therefore greatly improves its portability. To sum up, LRSDAY is
a highly transparent, automated and powerful computational framework that handles both
genome assembly and annotation, which suits the needs of the yeast community in performing
long-read-based genome sequencing projects. In the PROCEDURE section of this article, we
provide a step-by-step walkthrough on how to install, configure, and run LRSDAY with our
prepared testing example.

Figure 1. Overview of the LRSDAY directory system. All the top-level directories (boxes, solid
lines) and individual files of LRSDAY are listed and briefly described. Additional directories and
files will be generated during the installation or execution of LRSDAY (boxes, dashed lines).

Comparison with other methods
Genome assembly is a rapidly moving field, co-evolving with the fast-paced development of
sequencing technologies. In recent years, both hybrid (i.e. using both long and short reads) and

native (i.e. using long reads only) assemblers that support long-read sequencing data have been
developed and tested on many different organisms8,12–17. As for gene annotation, there is also a
wide range of choices that perform gene model prediction in an ab initio fashion or based on
additional evidences (e.g. mRNA transcripts, protein-sequence alignment)18–20. Specifically for
yeasts, a web-based gene annotation tool has been developed that combines both approaches21.
However, there is currently no integrated solution that handles both genome assembly and
annotation in a seamless way. To fill this gap, we revamped our original workflow for deriving the
yeast population-level reference genomes7 into a self-contained package to considerably
streamline this process with modular design and automated implementation. Moreover, rather
than simply combining the existing tools for genome assembly and gene annotation, LRSDAY
assembled a well-integrated workflow with many other functionalities (e.g. reference-guided
scaffolding, gene orthology identification, and additional genomic feature annotation) built in,
which makes it a unique one-stop solution for high-quality genome assembly and annotation
production from long-read sequencing data.

Experimental Design
Genome assembly and annotation are complex computational processes with many intermediate
steps and inputs/outputs involved. With LRSDAY, we designed a highly structured project
directory system to help users to run the whole workflow in an organized and modular way (Fig.
2). Within such project directory system, the three subdirectories holding the pre-shipped
reference genome as well as the user-supplied long (PacBio or Oxford Nanopore) and short
(Illumina) reads are numbered as “00” and the task-specific subdirectories are numbered
sequentially from “01” to “13” according to their execution orders. For each subdirectory, a selfexplained name is attached after the number index to help users to navigate through the workflow.
To run each task, users only need to edit (i.e. to specify the input and output file paths and certain
task-specific parameters) and execute the task-specific pipeline scripts pre-placed in these
subdirectories. These pipeline scripts will automatically set environment variables, process the
data, and formulate the results. All computationally intensive tasks can be processed using
multiple threads to substantially save computation time. Although LRSDAY is mainly designed for
yeasts, most of these tasks can be further adapted for the analysis on any eukaryotic organisms
(See Appendix 2 for details). Below we briefly describe the computational processes executed by
each task-specific module in LRSDAY with the corresponding PROCEDURE Step labeled in
parentheses.

01. Long-read-based_Genome_Assembly (Step 13): Long reads generated from PacBio or
Oxford Nanopore technologies are used to perform de novo genome assembly using an
overlap-layout-consensus (OLC) algorithm.
02. Illumina-read-based_Assembly_Polishing (Step 15; optional): Illumina reads are first
clipped and trimmed to remove potential sequencing adapters and low-quality regions.
The cleaned reads are subsequently mapped to the raw long-read-based genome
assembly. The resulting BAM file is further processed for alignment sorting, mate
information and read group fixing, duplicates removal as well as local realignment. Finally,
the processed bam file is used for correcting base-level errors of the long-read-based
assembly.
03. Reference-guided_Assembly_Scaffolding (Step 16): The contigs from the polished
genome assembly are first aligned to the reference genome to identify their shared
sequence homology, based on which reference-guided assembly scaffolding is
subsequently performed. The chromosomal identity of each scaffold is labeled
accordingly. Structural rearrangements captured in the contigs will remain untouched
during the scaffolding.
04. Centromere_Identity_Profiling (Step 18): The pre-shipped S. cerevisiae centromere
sequences are searched against the scaffolded assembly for chromosome-specific
centromere identity profiling.
05. Mitochondrial_Genome_Assembly_Improvement (Step 19): The polished contigs
corresponding to the mitochondrial genome are re-collected from the scaffolded assembly.
The mitochondrial contigs spanning over the designated starting point (the ATP6 gene by
default) are broken into subsegments to prevent assembly problems caused by the
circular organization of the mitochondrial genome. The resulting contigs are then reassembled into a single linear sequence, which is further circularized by the designated
ATP6 starting point. The nuclear scaffolds and the circularized mitochondrial sequence
together form the improved genome assembly.
06. Supervised_Final_Assembly (Steps 20-22): A modification list containing the ordering,
orientation and naming information of each sequence from the improved genome
assembly is generated for users to review and to make manual adjustment when needed.
The final genome assembly is further generated based on the user-edited modification
list.
07. Centromere_Annotation (Step 23): The pre-shipped S. cerevisiae centromere
sequences are searched against the final genome assembly for centromere annotation.

08. Gene_Annotation (Step 25): De novo protein-coding and tRNA gene annotations are
performed for the final genome assembly, which are further leveraged by the mRNA
transcripts and protein sequences alignments.
09. TE_Annotation (Step 28): The pre-shipped curated TE library (containing the long
terminal repeats (LTRs) and internal sequences of the S. cerevisiae and S. paradoxus
Ty1-Ty5 by default) is searched against the final genome assembly to identify TEs. The
identified TEs are further curated and classified into the full-length, truncated, and soloLTRs of Ty1-Ty5.
10. Core_X_Element_Annotation (Step 30): The pre-shipped curated hidden Markov model
(HMM) of the S. cerevisiae core X elements is searched against the final genome
assembly to annotate core X elements.
11. Y_Prime_Element_Annotation (Step 32): The pre-shipped representative S. cerevisiae
Y’ element sequence is searched against the final genome assembly to annotate Y’
elements. Note that Y’ elements can have long, short or degenerated forms22, and we
used a representative long-form Y’ element as the query to maximize detection power.
12. Gene_Orthology_Identification (Step 34): The annotated protein-coding genes are
compared with the reference protein-coding genes based on both sequence homology
and gene order conservation to identify gene orthology relationship between these two
sets. Based on such gene orthology relationships, the Saccharomyces Genome Database
(SGD; http://www.yeastgenome.org/) systematic names are assigned to the annotated
protein-coding genes.
13. Annotation_Integration (Step 35): The annotations of centromeres, TEs, protein-coding
genes, tRNAs, as well as core X and Y’ elements are combined and sorted to form a final
integrated multi-feature annotation.

Figure 2. The workflow of LRSDAY. Each box represents an individual module. These modules
are numbered according to the tasks described in Experimental Design, with the corresponding
protocol step numbers also indicated. Modules that can be adapted for other eukaryotes are
colored in light blue while those yeast-specific are colored in orange.

Limitations and potential adaptation
In its distributed form, LRSDAY is tailored for the model budding yeast S. cerevisiae and its closely
related sister species S. paradoxus with a number of pre-shipped auxiliary data files configured
accordingly. However, given its modular design, the backbone of LRSDAY can be adapted for
virtually any eukaryotic organisms to perform genome assembly, sequence polishing, referenceguided scaffolding, protein-coding genes and tRNA annotations, gene orthology identification,
and annotation integration. In Appendix 2, we provide some tips with regard to such adaptation.
Moreover, those assembly polishing, scaffolding and various annotation modules (See
Experimental Design) can also be used to analyze existing genome assemblies derived from any
or any combination of sequencing technologies. Such flexibility makes LRSDAY very useful for
expanded use cases and therefore suits the needs of a broader audience.

Expected improvements
As thousands of yeast strains have been or are currently under sequencing23–26, our knowledge
of the overall genome content diversity27 of this important model organism is expanding rapidly,
revealing a whole new picture of the pan-genome diversity of S. cerevisiae. For example, our lab
is currently working on characterizing the pan-genome of >1,000 S. cerevisiae isolates across the
globe (The 1002 Yeast Genomes Project26; http://1002genomes.u-strasbg.fr/). Future
developments of LRSDAY will incorporate such pan-genome dataset to provide additional
annotation information for those non-reference genes, especially with regard to their evolutionary
origin, population prevalence, and putative functions. Such information will greatly help users to
dissect and interpret complex genotype-phenotype interactions in diverse ecological and
biotechnological settings. An additional potential future direction of our research is the direct
integration with the downstream synteny analysis tools (e.g. CHROnicle28, MCScanX29, etc) to
perform automatic large-scale structural variants discovery, which exploits one of the major
benefits of having a high-quality genome assembly derived from long reads. Finally, we envision
developing a dedicated web-based tool to implement such database and tool integration at a
larger scale towards fully automated genomics analysis for the yeast community in the long run.

CITATIONS
Jia-Xing Yue & Gianni Liti. (2018) Long-read sequencing data analysis for yeasts. Nature
Protocols, 13:1213–1231.
Jia-Xing Yue, Jing Li, Louise Aigrain, Johan Hallin, Karl Persson, Karen Oliver, Anders
Bergström, Paul Coupland, Jonas Warringer, Marco Cosentino Lagomarsino, Gilles Fischer,
Richard Durbin, Gianni Liti. (2017) Contrasting evolutionary genome dynamics between
domesticated and wild yeasts. Nature Genetics, 49:913-924.

MATERIALS
Hardware, operating system and network
This protocol is designed for a desktop or computing server running an x86-64-bit Linux operating
system. Multi-threaded processors are preferred to speed up the process since many steps can
be configured to use multiple threads in parallel. For assembling and analyzing the budding yeast
genomes (genome size = ~12 Mb), at least 16 Gb of RAM and 100 Gb of free disk space are
recommended. When adapted for other eukaryotic organisms with larger genome sizes, the RAM
and disk space consumption will scale up, majorly during de novo genome assembly (performed
by Canu17). Please refer to Canu’s manual (http://canu.readthedocs.io/en/latest/) for suggested
RAM and disk space consumption for assembling large genomes. Stable Internet connection is
required for the installation and configuration of LRSDAY as well as for retrieving the testing data.

Software or library requirements
●

Bash (https://www.gnu.org/software/bash/)

●

Bzip2 (http://www.bzip.org/)

●

Cmake (https://cmake.org/)

●

GCC and G++ v4.7 or newer (https://gcc.gnu.org/)

●

Ghostscript (https://www.ghostscript.com)

●

Git (https://git-scm.com/)

●

GNU make (https://www.gnu.org/software/make/)

●

Gzip (https://www.gnu.org/software/gzip/)

●

Java runtime environment (JRE) v1.8.0 or newer (https://www.java.com)

●

Perl v5.12 or newer (https://www.perl.org/)

●

Python v2.7.9 or newer (https://www.python.org/)

●

Python v3.6 or newer (https://www.python.org/)

●

Tar (https://www.gnu.org/software/tar/)

●

Unzip (http://infozip.sourceforge.net/UnZip.html)

●

Virtualenv v15.1.0 or newer (https://virtualenv.pypa.io)

●

Wget (https://www.gnu.org/software/wget/)

●

Zlib (https://zlib.net/)

Input data
●

Long reads: A single FASTQ file containing PacBio or Oxford Nanopore reads is needed,
which will be used for long-read based de novo genome assembly (Task 01).

●

Short reads: Short reads are optional for LRSDAY but LRSDAY could take advantage of
short reads when such data is available to perform additional assembly polishing (Task
02). If paired-end Illumina sequencing is performed, two FASTQ files containing the
forward and reverse Illumina reads respectively are needed. If only single-end Illumina
sequencing data is available, one FASTQ file containing the single-end reads is needed.

●

Reference genome: For the budding yeast S. cerevisiae, we pre-shipped two reference
genome files (one original assembly and one with hard-masked subtelomeres and
chromosome-ends based on our previous study7). The masked version is used for
chromosomal scaffolding to minimize the confounding effect due to interchromosomal
subtelomeric rearrangemnets. When working with organisms of which the subtelomeric
regions are undefined, users can just use a single raw reference genome instead. The
reference genome file(s) will be used for reference-guided scaffolding, mitochondrial
genome assembly improvement and supervised final genome assembly (Task 03, 05 and
06 respectively).

●

A number of S. cerevisiae-specific auxiliary data have been pre-shipped with LRSDAY for
genomic feature annotation and gene orthology identification (Task 07-12).

Example data
●

The S. cerevisiae reference genome pre-shipped with LRSDAY is taken from our

study7

with

the

Genbank

accession

number

GCA_002057635.1

(https://www.ncbi.nlm.nih.gov/assembly/GCA_002057635.1/). The sequencing reads

used for the testing example come from the same study, which consists of both PacBio
and Illumina reads produced from the S. cerevisiae strain SK1. The PacBio reads can be
retrieved

with

the

ENA

analysis

accession

number

ERZ448251

(https://www.ebi.ac.uk/ena/data/view/ERZ448251). The Illumina reads can be retrieved
with

the

SRA

sequencing

run

accession

number

SRR4074258

(https://www.ncbi.nlm.nih.gov/sra/?term=SRR4074258). In LRSDAY, we have provided
bash scripts to automatically download and setup these data for the testing example.

PROCEDURE
Download, install and configure LRSDAY ● TIMING <1 h
1) Download the latest LRSDAY release (current version: v1.1.0) by entering the following
commands in a terminal window:
$ wget https://github.com/yjx1217/LRSDAY/releases/download/v1.1.0/LRSDAY-v1.1.0.tar.gz
$ tar xvzf LRSDAY-v1.1.0.tar.gz
$ cd LRSDAY-v1.1.0
$ bash install_dependencies.sh

▲CRITICAL STEP Make sure you have a fast and stable Internet connection when
running this step since many tools will be downloaded here. Check that all the
prerequisites (see Software or library requirements in EQUIPMENT or the prerequisite.txt
file in the downloaded LRSDAY directory) have been installed on your system.
▲CRITICAL STEP Upon the successful completion of executing the bash script, you
should see a confirmation message prompted out: “LRSDAY message: This bash script
has been successfully processed! :)”. Otherwise, it means an error has occurred during
the execution of the bash script, which interrupted the automatic installation process. This
also applies to all the bash scripts used in Steps 8-35. Whenever an error is encountered,
please check the error message and refer to the troubleshooting section when available.
When the cause of the error is fixed, re-run this step to initiate a new installation. The
installer will prompt for the confirmation of deleting the build directory generated by the old
run, always answer “yes” to authorize such action so that the new installation can start.
▲CRITICAL STEP Pay attention to the final message prompted by the installation script
for a number of notes with regard to the additional manual setup steps as well as the
licensing information for commercial users.
? TROUBLESHOOTING
2) Load the environment settings for LRSDAY by entering:
$ source env.sh

After loading the pre-configured environment settings, the current directory should be
assigned to the environment variable $LRSDAY_HOME. You can check to see if the full
path to your current directory is displayed after entering:
$ echo $LRSDAY_HOME

▲CRITICAL STEP Although most required tools have been automatically installed and
configured, manual downloading and/or configuration are needed for GATK30 and
RepeatMasker31 due to license restriction of these tools or of their dependent libraries.
Make sure to run this command to load the pre-configured environment settings before
such manual setup (described in Steps 3-7). If you exited your terminal session before or
in the middle of your manual setup, you need to re-load the environment settings before
proceeding. These environment settings will be automatically loaded each time the taskspecific bash pipelines of LRSDAY are executed.
? TROUBLESHOOTING
3) Download and set up GATK3. Go to the official website of GATK
(https://software.broadinstitute.org/gatk/download/archive) and download GATK v3.8 (the
recently released GATK4 will not work). Registration might be needed for unregistered
GATK users. Place the downloaded GATK package (file name: GenomeAnalysisTK-3.81- gf15c1c3ef.tar.bz2) in the GATK installation directory and set up GATK by entering the
following command in terminal:
$ mv GenomeAnalysisTK-3.8-1- gf15c1c3ef.tar.bz2 $LRSDAY_HOME/build/GATK3
$ cd $LRSDAY_HOME/build/GATK3
$ tar xjf GenomeAnalysisTK-3.8-1.tar.bz2
$ mv ./GenomeAnalysisTK-3.8-1*/GenomeAnalysisTK.jar .
$ chmod 755 GenomeAnalysisTK.jar

4) Set

TRF

for

RepeatMasker.

the

TRF

website

(http://tandem.bu.edu/trf/trf.download.html) to download the TRF v4.09 executable built
for 64-bit Linux. Place the downloaded file (file name: trf409.linux64) in the RepeatMasker
installation directory and set up TRF by entering:
$ mv trf409.linux64 $LRSDAY_HOME/build/RepeatMasker/
$ cd $LRSDAY_HOME/build/RepeatMasker/
$ chmod 755 trf409.linux64
$ ln -s trf409.linux64 trf

5) Set up the RepBase library for RepeatMasker. Go to the RepBase website
(http://www.girinst.org/repbase/) and register a user account (using a non-profit
institutional email address, e.g. the “.com” extension will not work) to download the
RepeatMasker edition of the RepBase library (version: 20170127). Place the downloaded
file (file name: RepBaseRepeatMaskerEdition-20170127.tar.gz) in the RepeatMasker
installation directory and uncompress it by entering:
$ mv RepBaseRepeatMaskerEdition-20170127.tar.gz $LRSDAY_HOME/build/RepeatMasker/

$ cd $LRSDAY_HOME/build/RepeatMasker/
$ tar xzf RepBaseRepeatMaskerEdition-20170127.tar.gz

? TROUBLESHOOTING
6) Get the installation path of TRF and rmblastn/makeblastdb by entering:
$ echo $trf_dir
$ echo $rmblast_dir

Remember these two paths since they will be used for the RepeatMasker configuration in
Step 7.
? TROUBLESHOOTING
7) Run the configuration script for RepeatMasker by entering:
$ perl ./configure

This configuration script will prompt for several questions. Please do the following to
answer these questions. Enter “env” for the question about the installation path of Perl.
Just press enter for the question about the installation path of RepeatMasker. Enter the
first path that you obtained in Step 6 for the question about the installation path of TRF.
Enter “2” for the question about selecting a search engine. Then enter the second path
that you obtained in Step 6 for the question about the installation path of rmblastn and
makeblastdb. Just press enter for the question about the default search engine. And finally
enter “5” to complete the configuration.

Run LRSDAY with the testing example ● TIMING 50 h
8) Create the project directory. When running LRSDAY with your own data, it is
recommended to make a copy of our Project_Template directory to create your own
project directory such as Project_abc, where “abc” can be any string containing letters,
numbers, or underscores. For this testing example, we make a copy of the
Project_Template directory and name it as Project_Example by entering:
$ cd $LRSDAY_HOME
$ cp -r Project_Template Project_Example

▲CRITICAL STEP Before proceeding to your own project, it is advised to first run our
prepared testing example to check if LRSDAY is working properly as well as to get
acquainted with the logic and workflow of LRSDAY.
9) Prepare the reference genome files. When running LRSDAY with your own data, you can
directly put the reference genome (in FASTA format without compression) in the
00.Ref_Genome subdirectory of your project directory (e.g. Project_abc). If your

sequenced organism is S. cerevisiae or S. paradoxus, you can use the reference genome
pre-shipped with LRSDAY. Here we prepare the pre-shipped reference genome for the
testing example by entering:
$ cd ./Project_Example/00.Ref_Genome
$ bash LRSDAY.00.Prepare_Sc_Ref_Genome.sh

10) Prepare the long reads. When running LRSDAY with your own data, you can directly put
the long reads in the 00.Long_Reads subdirectory of your project directory (e.g.
Project_abc). The reads file should be in FASTQ or FASTA format. Compressed files with
file extensions of “.gz”, “.bz2”, and “.xz” are also supported. For this testing example, you
can download the sample long reads by entering:
$ cd ./../00.Long_Reads
$ bash LRSDAY.00.Retrieve_Sample_PacBio_Reads.sh

11) (Optional) If your long reads are generated from the PacBio Sequel platform, your reads
are likely to be in BAM format. In this case, convert it to our supported format (FASTA or
FASTQ) using the following commands:
$ source ./../../env.sh
$ $bedtools_dir/bedtools bamtofastq -i long_reads.bam -fq long_reads.fastq
$ gzip long_reads.fastq

12) (Optional) Prepare the Illumina reads. Like for the long reads, directly put your Illumina
reads in the 00.Illumina_Reads subdirectory of your project directory (e.g. Project_abc)
when running LRSDAY for your own data. The reads file should be in FASTQ format with
“gzip” comprehension (identified by the “.gz” extension). For this testing example, you can
download the Illumina reads by entering:
$ cd ./../00.Illumina_Reads
$ bash LRSDAY.00.Retrieve_Sample_Illumina_Reads.sh

? TROUBLESHOOTING
13) Perform long-read-based de novo genome assembly by running the following commands:
$ cd ./../01. Long-read-based_Genome_Assembly
$ bash LRSDAY.01.Long-read-based_Genome_Assembly.sh

Upon the completion of this step, a summary file (SK1.canu.stats.txt for this testing
example) will be generated to report some basic summary statistics (e.g. total assembly
size, N50 (i.e. the contig length such that 50% of the total assembly size is contained in
contigs of at least this size), L50 (i.e. the number of longest contigs such that 50% of the
total assembly size is contained), GC-content, etc) to assist gauging the genome
assembly quality (Table 1). Two VCF files (SK1.canu.filter.mummer2vcf.SNP.vcf and

SK1.canu.filter.mummer2vcf.INDEL.vcf for the testing example) will also be generated to
report base-level differences between the raw genome assembly and the reference
genome for their uniquely alignable regions, which could also help for assessing the
genome assembly quality.
▲CRITICAL STEP This step can be run with multiple threads to speed up the process.
Depending on the CPU configuration of your Linux server/desktop, you can edit the
“threads=” option in the bash script LRSDAY.01.Long-read-based_Genome_Assembly.sh
to enable multi-threading. You can do the same for all the following tasks whenever the
“threads=” option is provided in the corresponding task-specific bash script. Simple text
editors such as emacs, vim, gedit or pico are highly recommended for such editing. Rich
text editors might not work
▲CRITICAL STEP Note this step will take long to finish (see TIMING), so we recommend
running

this

step

and

all

the

other

time-consuming

steps

with

“nohup”

(https://en.wikipedia.org/wiki/Nohup), which allows the process to continue running after
you exit the terminal or logout from the server. As an example, you can run the bash script
using nohup as follows:
$ nohup bash LRSDAY.01.Long-read-based_Genome_Assembly.sh >run_log.txt 2>&1 &

▲CRITICAL

STEP

Starting

from

v1.1.0,

LRSDAY

added

the

"customized_canu_parameters” option to support customized parameter settings for the
Canu assembler. In addition, two additional assemblers, Flye and smartdenovo, were
further supported. Based on our test, Flye32 (https://github.com/fenderglass/Flye) and
smartdenovo (https://github.com/ruanjue/smartdenovo) run much faster than Canu but
also came with understandable tradeoff in assembly precision, as reflected by the higher
base-level error rate of their resulting assemblies. Therefore, when assembling with Flye
or smartdenovo, post-assembly polishing is strongly recommended.
▲CRITICAL STEP When running LRSDAY with your own data, modify the bash script to
specify the input reads and reference genome, the input reads type (e.g. “pacbio_raw”,
“pacbio_corrected”, “nanopore_raw” or “nanopore_corrected”), the estimated genome
size for the assembled genome, as well as the prefix for the output data. Remember to do
similar project-specific adjustment for all the following steps.
? TROUBLESHOOTING
14) (Optional) Polishing genome assembly with long-reads. When running LRSDAY for your
own data, if you performed PacBio sequencing and also have access to a locally installed
PacBio

SMRT

Analysis

software

package

(http://www.pacb.com/products-and-

services/analytical-software/smrt-analysis/), we recommend running the first-pass
polishing for the assembly generated in Step 13 based on raw PacBio reads by using
PacBio’s own Quiver/Arrow pipeline13. If you performed Oxford Nanopore sequencing, we
recommend using nanopolish (https://github.com/jts/nanopolish) or equivalent tools to
perform the assembly polishing based on raw Nanopore reads in this step. We expect to
directly pack nanopolish into LRSDAY in our next release. We will also try to the same for
PacBio’s Quiver/Arrow pipeline but it seems quite challenging. Finally, we also tested
Racon (https://github.com/isovic/racon) for polishing but its performance was not
satisfying.
15) (Optional) Polishing genome assembly with Illumina reads. When Illumina reads are
available, we recommend running this additional polishing step either for the raw assembly
generated in Step 13 (when Step 14 is skipped as in our testing example) or for the longread-polished assembly generated in Step 14. Use the following commands to perform
Illumina-read-based assembly polishing. This step can be run with multiple threads.
$ cd ./../02.Illumina-read-based_Assembly_Polishing
$ bash LRSDAY.02.Illumina-read-based_Assembly_Polishing.sh

TABLE 1 | Assembly statistics for the genome of S. cerevisiae strain SK1 assembled in the
testing example.
Assembly statistics

Raw
assembly

Scaffolded
assembly

Final
assembly

Total sequence count

Total sequence length (bp)

12505999

12511058

12489291

Min sequence length (bp)

1248

Max sequence length (bp)

1480196

1480203

Mean sequence length (bp)

357314.26

367972.29

378463.36

Median sequence length (bp) 103703.00

74951.00

84648.00

N50 (bp)

818058

923633

L50

N90 (bp)

341459

341463

L90

GC%

38.27

38.25

38.29

0.00

0.08

0.04

Note
N50: the contig length such that 50% of the total assembly size is contained in contigs of at
least this size.
L50: the number of longest contigs such that 50% of the total assembly size is contained.
N90: the contig length such that 90% of the total assembly size is contained in contigs of at
least this size.
L90: the number of longest contigs such that 90% of the total assembly size is contained.
GC%: the percentage of guanine (G) and cytosine (C) bases in the nucleotide sequences.
N%: the percentage of the N bases in the nucleotide sequence. In genome assembly, the N
bases are usually used to represent scaffolding gaps.

Figure 3. Genome-wide dotplots of the S. cerevisiae SK1 genome assembly generated in
the LRSDAY testing example. Both the raw scaffolded assembly (panel a; generated in Step
16) and the final assembly (panel b, generated in Step 22) are analyzed. The forward and reverse
sequence matches are depicted in red and blue respectively, while the zoomed-in views of the
mitochondrial genome (chrMT) comparison are shown in insets. In addition to the sixteen nuclear
chromosomes and the mitochondrial genome, the scaffolded and final assemblies also contain
some short contigs (named as tig*_pilon) that are derived from highly repetitive regions.
16) Perform chromosome-level scaffolding for the long-read-based assembly. Run the
following commands:
$ cd ./../03.Reference-guided_Assembly_Scaffolding
$ bash LRSDAY.03.Reference-guided_Assembly_Scaffolding.sh

This step can be run with multiple threads. Upon completion, a list of summary statistics
(SK1.ragout.stats.txt for this testing example) will be generated for the scaffolded
assembly (Table 1).
▲CRITICAL

STEP

Please

check

the

generated

genome-wide

dotplot

(SK1.ragout.filter.pdf for the testing example) (Fig. 3a) to verify the correctness of
chromosomal identity assignment performed by Ragout33 and apply manual adjustment in
Step 21 when necessary. When running LRSDAY with your own data, you might see a
single scaffold corresponds to more than one reference chromosomes, which could be
due to shared sequence homology between duplicated regions or interchromosomal
rearrangements. Both types of events can be correctly interpreted based on the genomewide dotplot generated in this Step. In either case, LRSDAY can correctly assign
chromosomal identity of the corresponding scaffold based on its encompassed
centromere identity as annotated in Step 18. Check the generated AGP file
(SK1.ragout.agp for the testing example) for the details of reference-based scaffolding.
▲CRITICAL STEP Due to the high AT and repeat contents and the circular conformation
of the mitochondrial genome, multiple contigs corresponding to the mitochondrial genome
are often obtained from the raw genome assembly, as shown in the generated
mitochondrial genome dotplot (SK1.ragout.chrMT.filter.pdf for the testing example) (Fig.
3a, inset). A list of such mitochondrial contigs will also be generated (SK1.mt_contig.list
for the testing example), which will be used in Step 19 for improving mitochondrial genome
assembly.
17) (Optional) When running LRSDAY for your own data, if you have strong evidence for misscaffolding based on prior knowledge or other experimental data (e.g. mate-pair libraries
or chromosomal contact data), break the corresponding ragout scaffolds back to contigs
and re-joined them with corrected order using the pre-shipped Perl scripts
break_scaffolds_by_N.pl, join_contigs_by_N.pl and extract_region_from_genome.pl in
the $LRSDAY_HOME/scripts directory by running the following commands:
perl $LRSDAY_HOME/scripts/break_scaffolds_by_N.pl -i -o < the output FASTA file containing the scaffolds after the
breaking> -g
perl $LRSDAY_HOME/scripts/join_contigs_by_N.pl -i -o –g -t

perl $LRSDAY_HOME/scripts/extract_region_from_genome.pl -i -o -q
-f

A scenario for such use case is when the breakpoints of structural rearrangements are
also the breakpoints of the genome assembly. In this case, the reference-based
scaffolding will arrange contigs according to the reference genome configuration and
therefore un-do the genome rearrangement.
18) Perform centromere profiling for the scaffolded genome assembly by running the following
commands:
$ cd ./../04.Centromere_Identity_Profiling
$ bash LRSDAY.04.Centromere_Identity_Profiling.sh

▲CRITICAL STEP The chromosome-specific centromere identities profiled here will be
used as another layer of information for the final chromosomal identity assignment in Step
21. The profiled centromere identities usually agree well with the chromosomal identities
labeled in Step 16, so that chrI will have the CEN1 centromere and chrII will have the
CEN2 centromere, etc. Exception can occur when interchromosomal rearrangements are
involved in your sequenced genome. In such case, we recommend naming those
rearranged chromosomes according to their encompassed centromeres in Step 21.
19) Perform mitochondrial genome assembly improvement by running the following
commands:
$ cd ./../05.Mitochondrial_Genome_Assembly_Improvement
$ bash LRSDAY.05.Mitochondrial_Genome_Assembly_Improvement.sh

▲CRITICAL STEP Check the generated final mitochondrial genome dotplot
(SK1.mt_improved.chrMT.filter.pdf for the testing example) and compare it with the
mitochondrial genome dotplot generated in Step 16 to see how the mitochondrial genome
assembly has been improved when aligning with the reference mitochondrial genome (Fig.
3b, inset). When running this step for your own data, the degree of such improvement may
vary because it depends on both the complexity of the assembled mitochondrial genome
and the quality of library preparation and sequencing experiments.
? TROUBLESHOOTING
20) Generate the assembly modification list file for performing the final chromosome
assignment by running the following commands:

$ cd ./../06.Supervised_Final_Assembly
$ bash LRSDAY.06.Supervised_Final_Assembly.1.sh

21) Edit the generated assembly modification list file (SK1.modification.list for the testing
example) based on the genome-wide dotplot generated in Step 16 and the centromere
profiles generated in Step 18. The modification list file consists of three comma-separated
columns, which correspond to the original sequence name, sequence orientation, and new
sequence name respectively. With this file, you can do three types of editing:
If you need to change the current sequence order, you can move the corresponding
rows upward or downward to reflect the correct order.
If you need to invert the orientation of a given sequence, you can change its
orientation from “+” to “-” in column 2.
If you need to rename a given sequence, you can specify the new name in the third
column.
For this testing example here, we need to move the row “chrIX,+,chrIX” downward to place
it after the row “chrVIII,+,chrVIII”, so that chrIX will be placed after chrVIII in the final
assembly. Also, we need to change the row “chrMT_Contig1,+,chrMT_Contig1” to
“chrMT_Contig1,+,chrMT” for renaming the assembled sequence corresponding to the
mitochondrial genome.
22) Once all the modifications have been specified, run the following bash script to generate
the final genome assembly as well as the associated genome-wide dotplot (Fig. 3b),
assembly statistics (Table 1), and VCF files:
$ bash LRSDAY.06.Supervised_Final_Assembly.2.sh

23) Re-run centromere annotation for the final genome assembly using the following
commands:
$ cd ./../07.Centromere_Annotation
$ bash LRSDAY.07.Centromere_Annotation.sh

24) (Optional) Customize the configuration file for gene annotation. When running LRSDAY
with

your

own

data,

edit

the

configuration

file

$LRSDAY_HOME/misc/maker_opts.customized.ctl if your sequenced organisms is
neither S. cerevisiae nor S. paradoxus (See Appendix 2 for the details). If your sequenced
organism is S. cerevisiae or S. paradoxus, no customization is needed unless you have
native transcriptome or expressed sequence tag (EST) data for the strain that you
sequenced. In this case, you can edit the line 16 of this file to provide the full path of the
native transcriptome or EST assembly for your sequenced strain.

25) Annotate protein-coding genes and tRNAs for the final genome assembly, using the
following commands. This step can be run with multiple threads.
$ cd ./../08.Gene_Annotation
$ bash LRSDAY.08.Gene_Annotation.sh

26) (Optional) A manual checklist file (SK1.EVM.manual_check.list for the testing example)
containing a list of genes with suspicious annotations will be generated in Step 25. As
labeled in this file, these annotated gene models could be fragmented, frameshifted or
containing internal stop codons. Potentially, these genes could be good candidates for
pseudogenes. Manually inspect the annotated gene models of these genes by loading the
annotation result (SK1.EVM.gff3 for the testing example) together with the protein/ESTalignment

evidence

files

generated

during

the

annotation

Step

(SK1.protein_evidence.gff3 and SK1.est_evidence.gff3 for the testing example) into IGV34
to check how well these suspicious gene models are supported by the corresponding
protein/EST-alignment evidence and to tag or remove those truly problematic ones in your
downstream analysis.
27) (Optional) Perform dedicated mitochondrial protein-coding and RNA annotation. If you are
interested in studying mitochondrial genomes, we highly recommend running dedicated
mitochondrial feature annotation with specialized software

such as

(http://megasun.bch.umontreal.ca/cgi-bin/mfannot/mfannotInterface.pl).

MFannot
Although

MFannot does not support local installation so far, you can run MFannot conveniently via
its web portal. Be sure to select the correct genetic code table (e.g. “3 Yeast Mitochondrial”
for annotating yeast mitochondrial genomes) for your analysis.
28) Annotate transposable elements (TEs) for the final genome assembly. This step can be
run with multiple threads using the following commands:
$ cd ./../09.TE_Annotation
$ bash LRSDAY.09.TE_Annotation.sh

29) (Optional) TE activity can be highly dynamic in the genome with many complex cases
such as fragmentation and nested insertion. In LRSDAY, we used REannotate35 to
automatically resolve these complex cases, which works well for most cases but it could
occasionally misjoin two adjacent TEs when they are closely spaced. Further inspect and
curate the LRSDAY TE annotation output (SK1.TE.gff3 for the testing example) by
visualizing it in IGV34 together with the raw REannotate annotation (SK1.REannotate.gff
for the testing example). For each TE found in the LRSDAY TE annotation output, examine
its corresponding LTR and internal region structure based on the raw REannotate

annotation to check for misjoinings. If needed, you can manually edit the corresponding
TE annotation output file to decouple the misjoinings.
30) Annotate yeast telomere-associated core X elements for the final genome assembly, using
the following commands:
$ cd ./../10.Core_X_Element_Annotation
$ bash LRSDAY.10.Core_X_Element_Annotation.sh

31) (Optional) Inspect the alignment file for annotated core X element. In LRSDAY, we label
the identified core X elements to be “partial” if they are shorter than 300 bp. This should
work for most cases but we recommend inspecting the generated alignment file
(SK1.X_element.aln.fa for the testing example) for further curation. Upon the curation,
manually adjust the “partial” labeling in the annotation file (SK1.X_element.gff3 for the
testing example) when needed.
32) Annotate yeast telomere-associated Y’ elements for the final genome assembly by using
the following commands:
$ cd ./../11.Y_Prime_Element_Annotation
$ bash LRSDAY.11.Y_Prime_Element_Annotation.sh

33) (Optional) Inspect the alignment file for annotated Y’ element. Like for the core X element
annotation, we used a hard length cutoff (3500 bp) to label if the identified Y’ elements are
“partial”.

recommend

manually

inspecting

the

generated

alignment

file

(SK1.Y_prime_element.aln.fa for the testing example) to check if the “partial” labeling is
needed and edit the annotation file (SK1.Y_prime_element.gff3 for the testing example)
accordingly.
34) Perform orthology identification for protein coding genes by using the following
commands:
$ cd ./../12.Gene_Orthology_Identification
$ bash LRSDAY12.Gene_Orthology_Identification.sh

In this step, a gene orthology relationship list is created between the annotated proteome
and the SGD S. cerevisiae reference proteome based on both sequence similarity and
synteny conservation. Based on this list, we further attach SGD systematic names to our
gene annotation as shown in the “Name=” field of the generated GFF3 file
(SK1.updated.gff3 for the testing example). For a given annotated gene, when more than
one orthologous gene can be found in the SGD reference proteome, we will label all of its
co-orthologs in the “Name=” filed with “/” between the alternative SGD systematic names
(e.g. “YAR071W/YHR215W”), whereas when no orthologous gene can be found, we will
label its gene name as “Name=NA”. This step can be run with multiple threads.
23

35) Integrate the annotation of different genomic features into a unified GFF3 file by using the
following commands:
$ cd ./../13.Annotation_Integration
$ bash LRSDAY.13.Annotation_Integration.sh

? TROUBLESHOOTING
Troubleshooting advice can be found in Table 2.
TABLE 2 | Troubleshooting table.
Step

Problem

Possible reason

Solution

Downloading errors or

Unstable internet

Stabilizing the internet connection

unresponsive remote

connection or temporary

and give another try.

servers

problems of the remote
servers where the tools for
downloading are hosted.

Compilation or

Prerequisites not satisfied

If missing prerequisites are found

installation error for

or corner cases due to

(see software and library

specific tools.

your specific system

requirements in EQUIPMENT),

settings.

install the missing prerequisites
and give another try as described
above. Otherwise, record the error
message and email it to the
developers of the problematic tools
or the authors of this protocol for
further problem diagnosis.

Cannot find the file

Step 1 has failed.

“env.sh”.

Check the error message that you
got when running Step 1 and refer
to the troubleshooting for Step 1.

Rejected by RepBase

A personal or commercial

Use a non-commercial institutional

for the license

email address (e.g. with a

email address for the registration

registration.

.com extension) was used

or purchase the commercial

for the registration.

license if you work in a commercial
entity. Also, the installation of
RepBase can be potentially
skipped although not
recommended by RepeatMasker.
Users can still follow the remaining
steps of this protocol even if they
skip the RepBase installation here.

“echo $rmblast_dir”

Step 2 has failed or your

Re-load environment settings in

returns nothing.

terminal session got

Step 2 and then re-try this step.

interrupted before this
step.
12

Downloading

Temporary SRA server

Directly download the sample

warnings/errors

problems.

Illumina reads for the testing

encountered.

example by running the following
commands:
$ wget
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SR
R407/008/SRR4074258/SRR4074
258_1.fastq.gz
$ ln -s SRR4074258_1.fastq.gz
SRR4074258_pass_1.fastq.gz
$ wget
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SR
R407/008/SRR4074258/SRR4074
258_2.fastq.gz
$ ln -s SRR4074258_2.fastq.gz
SRR4074258_pass_2.fastq.gz

The total assembly

Insufficient sequencing

Obtain more reads. We

size is much smaller

depth of coverage.

recommend a minimal sequencing

than the expected

depth of 20X.

value or the assembly
is too fragmented.
13

Assembly runs too

This likely will happen for

slow.

Canu with high-depth
nanopore reads.

Three options are available:
1) Use the pre-shipped perl
script
subsampling_seqeunces.p
l in the
$LRSDAY_HOME/scripts
directory to downsample
(either randomly or
preferably sampling the

longer ones) the reads.
Example:
perl
subsampling_sequences.p
l -i input.fq(.gz) -f fastq -s
0.1 -m random -p output #
randomly sampling the
10% sequences
or
perl
subsampling_sequences.p
l -i input.fq(.gz) -f fastq -s
0.1 -m longest -p output #
sampling the 10% longest
sequences
2) Use the customized
parameters as suggested
in the bash script
LRSDAY.01.Long-readbased_Genome_Assembly
.sh
3) Use alternative
assemblers such as Flye
or smartdenovo for this
step.
19

Only partial assembly

High AT and repeat

If the sequencing is done with

is obtained for the

content of the

PacBio, obtaining more reads

mitochondrial genome.

mitochondrial genome is

should solve the problem. If

challenging for de novo

sequencing is done with Oxford

genome assembly.

Nanopore data, currently there is
no satisfying solution. Future
improvement of the Oxford
Nanopore sequencing technology
will likely solve this problem given
the rapid development of this
technology.

● TIMING
The following timing information was measured on a Linux computing server with an Intel Xeon
CPU E5-2630L v3 (1.80GHz) using 4 threads. Enabling multithreading can substantially decrease
the processing time. All optional steps except Steps 12 and 15 were not processed when
measuring the computation time.
Steps 1-7, set up LRSDAY: < 1 h.
Step 8, prepare the project directory for the testing example: 5 s.
Step 9, prepare the reference genomes for the testing example: 1 s.
Step 10, prepare the long reads for the testing example: 22 min.
Step 12 (optional), prepare the Illumina reads for the testing example: 20 min.
Step 13, de novo assembly using long-reads: 26 h.
Step 15 (optional), assembly polishing using Illumina reads: 1h.
Step 16, reference-guided scaffolding for the raw assembly: 12 min.
Step 18, centromere identity profiling for the scaffolded assembly: 1 s.
Step 19, mitochondrial genome assembly improvement: 4 min.
Step 20, generate the assembly modification list file: 1s.
Step 21, manual editing the assembly modification list file: 30 s.
Step 22, finalize genome assembly based on the assembly modification list file: 20 s.
Step 23, centromere annotation for the final assembly: 1s.
Step 25, protein-coding gene and tRNA annotation for the final assembly: 19 h.
Step 28, TE annotation for the final assembly: 5 min.
Step 30, core X element annotation for the final assembly: 2 min.
Step 32, Y’ element annotation for the final assembly: 2.5 min.
Step 34, gene orthology identification for the annotated protein-coding genes: 20 min.
Step 35, final annotation integration: 1 min.

ANTICIPATED RESULTS
Upon the completion of the LRSDAY workflow described above, users can expect to obtain a
chromosome-level genome assembly with comprehensive genomic feature annotation, which will
lay a solid foundation for all kinds of downstream genomic and functional analyses. As
demonstrated in Fig. 3b, the final genome assembly is highly continuous, with each chromosome
assembled in an end-to-end fashion. Both genome-wide dotplots (like those shown in Fig. 3) and
summary statistics (like those listed in Table 1) will be generated to help users to evaluate the
genome assembly quality both graphically and quantitatively. As for the annotation, LRSDAY
profiles a full spectrum of genomic features for the assembled yeast genome, which include
centromeres, protein-coding genes, tRNAs, Ty1-Ty5 transposable elements, as well as the
telomere-associated core X and Y’ elements. The availability of such rich information can be very
valuable for users working on diverse biological questions. In Table 3, we further summarized the
major outputs of the testing example. The final genome assembly and annotation outputs
generated with this testing example are also provided in the directory Example_Outputs for users
to make direct comparison with their own results.
TABLE 3 | Major LRSDAY outputs from the testing example.
Task
Step
Output file or directory
Explanation for the output
01

SK1.canu.fa
SK1.canu.stats.txt

SK1.canu.filter.pdf
SK1.canu.filter.mummer2vcf.SNP
.vcf

The long-read-based de novo genome
assembly containing all the contigs
assembled by Canu17.
The summary table reporting basic
assembly statistics, such as the number
of the assembled sequences, the total
length of the assembled sequences, the
minimal, maximal, mean and median
lengths of the assembled sequences, the
N50, L50, N90, and L90 of the assembled
sequences, as well as the base
composition (A%, T%, G%, C%, AT%,
GC% and N%) of the assembled
sequences. Users can compare this file
with the similar file generated in Step 16
and Step 22. We also summarized such
comparison in Table 1.
The genome-wide dotplot for the
comparison between the raw genome
assembly and the reference genome.
The VCF file showing SNP differences
between the raw genome assembly and
the reference genome. This file can be
used for assessing the raw assembly
quality.

SK1.canu.filter.mummer2vcf.IND
EL.vcf

SK1_canu_out
02

SK1.pilon.fa
SK1.pilon.vcf

SK1.pilon.changes

SK1.realn.bam
03

SK1.ragout.fa
SK1.ragout.stats.txt

SK1.ragout.agp
SK1_ragout_out
SK1.ragout.filter.pdf
SK1.mt_contig.list
SK1.mt_contig.fa
SK1.ragout.chrMT.filter.pdf

SK1.centromere.gff3

SK1.mt_improved.fa

SK1.mt_improved.chrMT.filter.pdf

The VCF file showing INDEL differences
between the raw genome assembly and
the reference genome. This file can be
used for assessing the raw assembly
quality.
the directory containing all the output files
of Canu
The polished genome assembly
generated by Pilon36.
The VCF file reporting the variants
identified by Pilon based on short reads
mapping against the input genome
assembly.
The space-delimited record of all the
changes that Pilon made during the
assembly polishing. The four columns
are: the original sequence coordinate, the
new sequence coordinate after the
correction, the original base, the new
base after the correction.
The BAM file of short reads mapping
against the input genome assembly.
The scaffolded genome assembly based
on the reference genome.
The summary table reporting basic
assembly statistics of the scaffolded
genome assembly. Users can compare
this file with the similar file generated in
Step 13 and Step 22. We also
summarized such comparison in Table 1.
the AGP file reporting the order and
orientation of each input contig used
during scaffolding.
The directory containing all the output
files of Ragout33.
The genome-wide dotplot for the
comparison between the scaffolded
assembly and the reference genome.
The list of assembled contigs
corresponding to the mitochondrial
genome. This file will be used for Step 19.
The assembled contig sequences
corresponding to the mitochondrial
genome.
The dotplot for the comparison between
the scaffolded mitochondrial genome
assembly and the reference mitochondrial
genome.
The profiled centromere identities for the
scaffolded genome assembly.
The improved genome assembly with
better processing (re-assembling and
circularization) of the mitochondrial
genome.
The dotplot for the comparison between
the improved mitochondrial genome

20-22

SK1.modification.list
SK1.final.fa
SK1.final.filter.pdf
SK1.final.stats.txt

SK1.final.filter.mummer2vcf.SNP.
vcf

SK1.final.filter.mummer2vcf.INDE
L.vcf

SK1.centromere.gff3

SK1.maker.raw.gff3
SK1.EVM.gff3
SK1.protein_evidence.gff3

SK1.est_evidence.gff3

SK1.EVM.trimmed_cds.fa
SK1.EVM.trimmed_cds.log
SK1.EVM.pep.fa
SK1.EVM.manual_check.list
SK1.EVM.PoFF.gff

assembly and the reference mitochondrial
genome. You should see improved
collinearity in this plot than the similar plot
generated in Step 16.
The assembly modification list file for
manual editing to guide the final genome
assembly.
The final genome assembly generated by
LRSDAY.
The genome-wide dotplot for the
comparison between the final genome
assembly and the reference genome.
The summary table reporting basic
assembly statistics of the final genome
assembly. Users can compare this file
with the similar file generated in Step 13
and Step 16. We also summarized such
comparison in Table 1.
The VCF file showing SNP differences
between the final genome assembly and
the reference genome. This file can be
used for assessing the final assembly
quality.
The VCF file showing INDEL differences
between the final genome assembly and
the reference genome. This file can be
used for assessing the final assembly
quality.
The centromere annotation for the final
genome assembly.
The raw MAKER37 annotation for proteincoding genes and tRNA genes.
The final EVM38 annotation for proteincoding genes and tRNA genes with
systematically assigned gene IDs.
The protein-to-genome alignment
evidences generated by MAKER. This file
can be used for manual curation of those
suspicious annotations.
The EST-to-genome alignment evidences
generated by MAKER. This file can be
used for manual curation of those
suspicious annotations.
The CDSs of the final EVM protein-coding
annotation with the out-of-frame parts
trimmed.
The log file of the CDS trimming for the
final EVM protein-coding gene annotation.
The translated protein sequences of the
trimmed CDSs derived from the final EVM
protein-coding gene annotation.
The list of suspicious gene annotations for
manual curation.
The gene synteny information derived
from SK1.EVM.gff3, which will be used for
Task 12 (Step 34).

SK1.EVM.PoFF.ffn
SK1.EVM.PoFF.ffa
09

SK1.REannotate.gff

SK1.TE.gff3
SK1.X_element.gff3
SK1.X_element.fa
SK1.X_element.aln.fa

SK1.Y_prime_element.gff3
SK1.Y_prime_element.fa
SK1.Y_prime_element.aln.fa

SK1.proteinortho
SK1.poff

SK1.updated.gff3
13

SK1.final.gff3
SK1.final.trimmed_cds.fa
SK1.final.trimmed_cds.log
SK1.final.pep.fa
SK1.final.manual_check.list
SK1.final.fa

Same as SK1.EVM.trimmed_cds.fa but
with simpler sequence IDs, which could
be used for Task 12 (Step 34).
Same as SK1.EVM.pep.fa but with
simpler sequence IDs, which will be used
for Task 12 (Step 34).
The raw TE annotation from REannotate.
This file can be used for further curating
TE annotation.
The final TE annotation from LRSDAY.
The final core X element annotation from
LRSDAY.
The sequences of all the annotated core
X elements.
The sequence alignment of all the
annotated core X elements for further
checking whether the annotated feature is
complete or partial as well as whether this
is consistent with the labeling in the
annotation file SK1.X_element.gff3.
The final Y’ element annotation from
LRSDAY.
The sequences of all the annotated Y’
elements.
The sequence alignment of all the
annotated Y’ elements for further
checking whether the annotated feature is
complete or partial as well as whether this
is consistent with the labeling in the
annotation file
SK1.Y_prime_element.gff3.
The gene orthology mapping between the
annotated genes and the reference gene
sets based only on sequence similarity.
The gene orthology mapping between the
annotated genes and the reference gene
sets based on both sequence similarity
and synteny conservation.
The updated gene annotation with
reference-based gene name labeling.
The final integrated annotation from
LRSDAY.
The CDSs of the final protein-coding gene
annotation with the out-of-frame parts
trimmed.
SK1.final.trimmed_cds.log: the log file of
the CDS trimming for the final proteincoding gene annotation.
The translated protein sequences of the
trimmed CDSs derived from the final
protein-coding gene annotation.
The list of suspicious gene annotations for
manual curation.
A copy of the final genome assembly
generated in Task 06 (Step 22).

REFERENCES
1.

Goffeau, A. et al. Life with 6000 Genes. Science 274, 546–567 (1996).

Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352,
aae0344 (2016).

VanBuren, R. et al. Single-molecule sequencing of the desiccation-tolerant grass
Oropetium thomaeum. Nature 527, 508–11 (2015).

Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture
enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–
650 (2017).

Badouin, H. et al. The sunflower genome provides insights into oil metabolism, flowering
and Asterid evolution. Nature 546, 148–152 (2017).

Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long
reads. Nat. Biotechnol. 36, 338–345 (2018).

Yue, J.-X. et al. Contrasting evolutionary genome dynamics between domesticated and
wild yeasts. Nat. Genet. 49, 913–924 (2017).

Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo
assembly of a eukaryotic genome. Genome Res. 25, 1750–1756 (2015).

McIlwain, S. J. et al. Genome sequence and analysis of a stress-tolerant, wild-derived
strain of Saccharomyces cerevisiae used in biofuels research. G3 (Bethesda). 6, 1757–
66 (2016).

10.

Istace, B. et al. de novo assembly and population genomic survey of natural yeast
isolates with the Oxford Nanopore MinION sequencer. Gigascience 6, 1–13 (2017).

11.

Giordano, F. et al. De novo yeast genome assemblies from MinION, PacBio and MiSeq
platforms. Sci. Rep. 7, 3935 (2017).

12.

Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule
sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

13.

Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT
sequencing data. Nat. Methods 10, 563–9 (2013).

14.

Berlin, K. et al. Assembling large genomes with single-molecule sequencing and localitysensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

15.

Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time
sequencing. Nat. Methods 13, 1050–1054 (2016).

16.

Li, H. Minimap and miniasm: Fast mapping and de novo assembly for noisy long

sequences. Bioinformatics 32, 2103–2110 (2016).
17.

Koren, S. et al. Canu: Scalable and accurate long-read assembly via adaptive κ-mer
weighting and repeat separation. Genome Res. 27, 722–736 (2017).

18.

Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA.
Genome Res. 10, 516–522 (2000).

19.

Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: A web server for
gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).

20.

Besemer, J. & Borodovsky, M. GeneMark: Web software for gene finding in prokaryotes,
eukaryotes and viruses. Nucleic Acids Res. 33, (2005).

21.

Proux-Wéra, E., Armisén, D., Byrne, K. P. & Wolfe, K. H. A pipeline for automated
annotation of yeast genome sequences by a conserved-synteny approach. BMC
Bioinformatics 13, 237 (2012).

22.

Louis, E. J. & Haber, J. E. The structure and evolution of subtelomeric Y’ repeats in
Saccharomyces cerevisiae. Genetics 131, 559–574 (1992).

23.

Strope, P. K. et al. The 100-genomes strains, an S. cerevisiae resource that illuminates
its natural phenotypic and genotypic variation and emergence as an opportunistic
pathogen. Genome Res. 25, 762–774 (2015).

24.

Almeida, P. et al. A population genomics insight into the Mediterranean origins of wine
yeast domestication. Mol. Ecol. 24, 5412–5427 (2015).

25.

Gallone, B. et al. Domestication and divergence of Saccharomyces cerevisiae beer
yeasts. Cell 166, 1397–1410.e16 (2016).

26.

Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates.
Nature 556, 339–344 (2018).

27.

Bergström, A. et al. A high-definition view of functional genetic variation from natural
yeast genomes. Mol. Biol. Evol. 31, 872–888 (2014).

28.

Drillon, G., Carbone, A. & Fischer, G. SynChro: A fast and easy tool to reconstruct and
visualize synteny blocks along eukaryotic chromosomes. PLoS One 9, (2014).

29.

Wang, Y. et al. MCScanX: A toolkit for detection and evolutionary analysis of gene
synteny and collinearity. Nucleic Acids Res. 40, (2012).

30.

McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing
next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

31.

Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013-2015 .
http://www.repeatmasker.org (2013).

32.

Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. Assembly of Long Error-Prone Reads

Using Repeat Graphs. bioRxiv 247148 (2018). doi:10.1101/247148
33.

Kolmogorov, M., Raney, B., Paten, B. & Pham, S. Ragout - A reference-assisted
assembly tool for bacterial genomes. Bioinformatics 30, (2014).

34.

Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

35.

Pereira, V. Automated paleontology of repetitive DNA with REANNOTATE. BMC
Genomics 9, 614 (2008).

36.

Walker, B. J. et al. Pilon: An integrated tool for comprehensive microbial variant detection
and genome assembly improvement. PLoS One 9, (2014).

37.

Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database
management tool for second-generation genome projects. BMC Bioinformatics 12, 491
(2011).

38.

Haas, B. J. et al. Automated eukaryotic gene structure annotation using
EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7
(2008).

39.

Lowe, T. M. & Eddy, S. R. A computational screen for methylation guide snoRNAs in
yeast. Science 283, 1168–71 (1999).

40.

Minkin, I., Patel, A., Kolmogorov, M., Vyahhi, N. & Pham, S. Sibelia: A scalable and
comprehensive synteny block generation tool for closely related microbial genomes. in
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 8126 LNBI, 215–229 (2013).

APPENDIX
Appendix 1: Pre-shipped supporting data for LRSDAY
With LRSDAY, we pre-ship the following supporting datasets for the automatic execution of
LRSDAY. Unless labeled otherwise, most of these pre-shipped datasets were either described or
generated in our previous study7.
1) ATP6.cds.fa # The coding sequence (CDS) of the S. cerevisiae S288C ATP6 gene.
2) fuzzy_defragmentation.txt # the fuzzy defragmentation file for REannotate35.
3) Proteome_DB_for_annotation.CDhit_I95.fa # our curated proteome dataset for S.
cerevisiae and other closely related yeast sensu stricto species.
4) query.Y_prime_element.long.fa # the sequence of a representative S. cerevisiae Y’
element.
5) S288C.ASM205763v1.fa.gz # the S. cerevisiae S288C genome assembly.
6) S288C.ASM205763v1.noncore_masked.fa.gz # the S. cerevisiae S288C genome
assembly with subtelomeres and chromosome-ends hard-masked.
7) S288C.centromere.fa # the centromere sequence of S. cerevisiae S288C.
8) S288C.gene.hmm # the hidden Markov model (HMM) for de novo gene annotation based
on S. cerevisiae S288C.
9) S288C.X_element.hmm # the hidden Markov model (HMM) for the core X element
annotation based on S. cerevisiae S288C.
10) Sc-meth.sites # the S. cerevisiae methylation sites (shipped with snoScan39).
11) Sc-rRNA.fa # the S. cerevisiae rRNA sequences (shipped with snoScan39).
12) SGDref.PoFF.faa # the proteinortho proteome file generated for the SGD reference
genome.
13) SGDref.PoFF.ffn # the proteinortho CDS file generated for the SGD reference genome.
14) SGDref.PoFF.gff # the proteinortho gene order gff file generated for the SGD reference
genome.
15) te_proteins.fasta # protein sequences for genes encoded within TEs (shipped with
MAKER37).
16) TY2_specific_region.fa # the sequence of a Ty2 specific regions for differentiating Ty1
and Ty2.
17) TY_lib.Yue_et_al_2017_NG.fa # a custom RepeatMasker library for Ty annotation in S.
cerevisiae and S. paradoxus.

18) TY_lib.Yue_et_al_2017_NG.LTRonly.fa # representative Ty LTR sequences of S.
cerevisiae and S. paradoxus.

Appendix 2: Tips for adapting LRSDAY to other eukaryotic organisms
The backbone modules of LRSDAY can be easily adapted for other eukaryotic organisms. Here
are some tips with regard to this:
1) For Task 01 (long-read-based genome assembly; Step 13), be sure to adjust the genome
size

parameter

line

the

bash

script

LRSDAY.01.Long-read-

based_Genome_Assembly.sh based on the estimated genome size of the organism that
you sequenced.
2) For Task 03 (reference-guided assembly scaffolding; Step 16), be sure to modify the bash
script LRSDAY.03.Reference-guided_Assembly_Scaffolding.sh to provide the reference
genome file of your sequenced organisms to guide the scaffolding and chromosome
assignment.

It is very likely that the chromosomal cores and subtelomeres of your

reference genome have not been clearly defined. In such case, you can provide the same
reference

genome

file

for

both

the

“ref_genome_raw=”

and

“ref_genome_noncore_masked=” parameters. The “chrMT_tag=” and “gap_size=”
parameters should also be adjusted for your own project.
3) By default, Task 03 (reference-guided assembly scaffolding; Step 16) performs referenceguided scaffolding using the whole genome alignment constructed by Sibelia40. Sibelia is
designed for processing small genomes (<100 Mb). For organisms with large genomes,
you should install Progressive Cactus (https://github.com/glennhickey/progressiveCactus)
and use it to build the whole genome alignment in HAL format and feed it directly to
Ragout33.

Please

refer

Ragout’s

user

manual

(http://fenderglass.github.io/Ragout/usage.html) for this advanced usage. We have preshipped the HAL tools (https://github.com/ComparativeGenomicsToolkit/hal) to enable
Ragout to process the HAL file generated by Progressive Cactus.
4) For Task 05 (mitochondrial genome assembly improvement; Step 19), be sure to modify
the “gene_start=”, “ref_genome_raw=”, and “chrMT_tag=” parameters in the bash script
LRSDAY.05.Mitochondrial_Genome_Assembly_Improvement.sh based on your own
project.
5) While many of the genomic feature annotation tasks are S. cerevisiae and S. paradoxus
specific, Task 08 (protein-coding genes and tRNA annotation; Step 25) can be adapted
for any eukaryotic organism. In general, you will need to edit the lines 16, 22, 34, 36, 44,
45, and 68-71 in the $LRSDAY_HOME/misc/maker_opts.customized.ctl file to feed
organism-specific parameters into MAKER. Also, please refer to MAKER’s own Wiki page
(http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page) and protocols
38

(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/) for technical details and
advanced usage. Similarly, you can learn more about EVM from its website
(http://evidencemodeler.github.io/).
6) While Task 09 (TE annotation; Step 28) is heavily tuned for yeasts, the same tools that
we used here (RepeatMasker31 and REannotate35) can be used for any eukaryotic
organism. We recommend reading their respective manuals (distributed with the
corresponding software) for adapting these tools in your own study.
7) For Task 12 (gene orthology identification; Step 34), you need to edit the “ref_PoFF_faa=”
and “ref_PoFF_gff=” parameters based on the reference gene annotation that you used.
Please

check

ProteinOrtho’s

manual

(https://www.bioinf.uni-

leipzig.de/Software/proteinortho/manual.html) for more details on required file formats.
The pre-shipped Perl script prepare_PoFFfaa_simple.pl and prepare_PoFFgff_simple.pl
in the $LRSDAY_HOME/scripts directory should help for this. You can run these two
scripts as follows:
$ source ./../../env.sh
$ perl $LRSDAY_HOME/scripts/prepare_PoFFffn_simple.pl -i prefix.pep.fa -o
prefix.PoFF.faa
$ perl $LRSDAY_HOME/scripts/prepare_PoFFgff_simple.pl -i prefix.raw.gff -o
prefix.PoFF.gff

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : No
Page Count                      : 41
Title                           : Microsoft Word - Manual_20180711.docx
Producer                        : Mac OS X 10.13.5 Quartz PDFContext
Creator                         : Word
Create Date                     : 2018:07:11 13:20:38Z
Modify Date                     : 2018:07:11 13:20:38Z

EXIF Metadata provided by EXIF.tools

Manual

Navigation menu

Versions of this User Manual:

Views

Navigation