Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 13

Scroll down to view the document on your mobile browser.
LTR_retriever User Manual  Shujun Ou and Ning Jiang oushujun@msu.edu & jiangn@msu.edu Department of Horticulture, Michigan State University, East Lansing, MI, 48824, USA   In any research documents using LTR_retriever please cite the following paper: Shujun Ou and Ning Jiang (2017) LTR_retriever: a highly accurate and sensitive program for identification of LTR retrotransposons (in preparation)   LTR_retriever is licensed under GNU GPLv3.   Questions and Issues Please See: https://github.com/oushujun/LTR_retriever   Sept. 21, 2017   1  Introduction  LTR_retriever is a command line program (in Perl) for accurate identification of LTR retrotransposons (LTR-RTs) from outputs of LTRharvest (1), LTR_FINDER (2), and MGEScan-LTR (3, 4) and generation of a non-redundant LTR-RT library for genome annotations. As one of the most prevalent transposable elements (TEs), LTR-RT comprises the largest portion of most plant genomes (5). Due to the sequence diversity of LTR-RTs, identification of such elements based on sequence homology is inefficient. Instead, LTR-RTs are conserved in terms of element structure across different species. Several programs have been developed to search for LTR-RTs using relevant structural characteristics. These programs are very sensitive; however, they are not very accurate and specific for LTR-RT identifications. LTR_retriever was developed to address the accuracy and specificity needs, with several new functions to facilitate genome annotation and other downstream studies. LTR_retriever aims to identify high-quality LTR-RT exemplars (Figure 1A) that are intact and non-redundant from a variety of LTR-RT candidates. To retain sensitivity, sequences of nested LTRs and truncated LTRs (Figure 1CD) that are not represented by intact LTR-RTs will also be included in the exemplar. This package excludes the vast majority of the non-LTR false positives. The most common false positives were introduced by two adjacent non-LTR repeats which are found as
SINEs, LINEs, DNA TEs, or solo-LTRs that are derived from different elements (Figure 3). In addition, LTR_retriever excludes non-LTR open reading frames derived from LINEs, DNA TEs, or plant coding sequences to reduce misannotations of non-LTR coding sequences as LTR elements. LTR_retriever identifies and removes LTR-RT nested insertions in the identified intact LTR-RTs, which also reduces library redundancy. This program can also accurately identify rare non-canonical LTR-RTs that have terminal motifs different from the canonical 5'-TG..CA-3' motif. The program was built with a variety of Perl scripts that can be utilized for downstream analyses.  1.1  Main features of LTR_retriever  A command line Perl program;  Supports multi-threading;  Identifies intact LTR-RTs with accurate boundaries;   Identifies rare LTR-RTs with non-canonical (non-'TGCA') motifs;  Supports multiple inputs: LTRharvest, LTR_FINDER, and/or MGEScan_LTR;  Sequence input: FASTA format (contigs, scaffolds, genomes, corrected PacBio reads, and etc.);  Output: a non-redundant LTR-RT library (FASTA), GFF3 for all intact LTR-RTs, whole-genome LTR-RT annotation (GFF), and a comprehensive table.  2  The Structure and Characteristics of LTR-RTs The structure of an LTR retrotransposon (LTR-RT) is characterized by long terminal repeat ranging from 75 bp to 5000 bp (Figure 1A). The region between the 5' LTR and 3' LTR is termed the internal region, which encodes proteins for transposition. At the very termini of the LTRs are the bi-nucleic motifs, which is 5'-TG..CA-3' in most cases. However, various other motifs have been detected in the sacred lotus (Nelumbo nucifera) genome and in the rice (Oryza sativa) genome during our manual annotation, and also found in other studies (e.g., Tos17 (6) ; AtRE1 (7); and TARE1 (8)). Flanking the terminal motifs is the target site duplication (TSD), which is generated by staggered cuts from integrase activity (Figure 2) during LTR-RT insertion. TSDs are typically 5 bp in plants but could vary between 3-6 bp, and the 5' and 3' TSD should be identical because of the mechanism of their formation (Figure 2). The recently inserted LTR-RT has a highly similar LTR region that is recognizable by sequence alignment, which is the primary searching scheme for LTR search programs (1, 2, 4, 9). However, if two highly similar repetitive elements other than LTR (e.g., DNA, LINE, SINE, solo-LTR, tandem repeat, etc.) are located close to each other (Figure 3), searching tools may falsely choose them and report them as LTR-RT candidates. These are the most frequent false positives that occur in de novo searches for LTR-RTs. Given that LTR-RTs are following the "copy-and-paste" duplication scheme, the regions flanking the newly inserted LTR-RT are unlikely to be identical to the termini of the internal region. For example, in an intact LTR-RT (Figure 1A), region “a” is not identical to region “c”, and region “b” is not identical to region “d”. Thus, by aligning the flanking regions of the two LTR fragments (Figure 3), LTR_retriever can obtain the boundary information for the candidate.

Navigation menu