FasParser Manual User Guide

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 18

DownloadFasParser Manual User Guide
Open PDF In BrowserView PDF
FasParser Manual
[For version 2.1.1]

Sun Yan-Bo (edit at 2017-12-25)
Kunming Institute of Zoology, Chinese Academy of Science

Contents
Introduction .................................................................................................................................................. 2
Background ............................................................................................................................................. 2
Citations ................................................................................................................................................... 2
Installation ............................................................................................................................................... 3
Package Usage ............................................................................................................................................ 4
[Align]........................................................................................................................................................ 4
[Trim] ......................................................................................................................................................... 4
[Sim]........................................................................................................................................................... 5
[Format] .................................................................................................................................................... 6
[Merge] ..................................................................................................................................................... 7
[Sort] .......................................................................................................................................................... 7
Sort Sequences .................................................................................................................................. 7
Rename sequences ........................................................................................................................... 8
Classify sequences ............................................................................................................................. 9
[Filter] ...................................................................................................................................................... 10
Extraction & Filtration based on IDs ........................................................................................... 10
Extract 3-site columns for Codon sequences ......................................................................... 10
Filtration of columns based on Gap frequency....................................................................... 11
Filtration of sequences based on sequence length ............................................................... 12
Filtration based on similarity with a reference sequence ..................................................... 12
[DNA2AA] .............................................................................................................................................. 13
[ORF] ....................................................................................................................................................... 13
[2Seq] ...................................................................................................................................................... 14
[2Align] ................................................................................................................................................... 15
[Primer] ................................................................................................................................................... 15
[PAML] .................................................................................................................................................... 16
[GeneDrawer] ....................................................................................................................................... 17
Contact ........................................................................................................................................................ 17

Introduction
Background
With the development of sequencing technology in recent times, a great number
of molecular sequences (DNA and RNA) have been generated. Molecular analyses
based on these sequences have become one of the most important measures for
assessing their potential biological significance. The increase in the amount of
available sequence data has made its manipulation tricky, especially for those
without programming experience. Hence, it has now become necessary to develop
one or more user-friendly software to perform such analyses in a batch mode.
Common manipulations may include extracting/removing subsequences or
subalignments, estimating the sequence similarity and identifying the
non-homologous sequence(s), converting between file formats, and/or alignment
trimming to exclude poorly aligned regions from a multiple sequence alignment.
While, with the increase in the amount of available sequence data, it made
sequence manipulation tricky, especially for those without programming
experience.
To achieve achieve the objective of simplifying many common tasks in
sequence manipulation for biologists, we provide a new program package named
‘FasParser’ for manipulating sequence files. It is designed with a user-friendly GUI
and batch processing modes, which allows users to handle multiple sequence files
in a simple way. Presently, the main functions of FasParser2 involve: (1) batch
performing alignment construction with Muscle (Edgar, 2004), Mafft (Katoh, et al.,
2002), or Prank (Loytynoja and Goldman, 2005), (2) alignment trimming on
poorly-aligned regions, (3) alignment format conversion between FASTA and
either PHYLIP, PAML, or NEXUS, (4) sorting & classifying sequences according to
accession numbers or sequence length, which can also rename sequences, (5)
concatenating & merging sequences for a particular set of samples from multiple
sequence files, (6) translation and ORF prediction, (7) extracting and filtering
sequences according to ID, sequence length, or sequence similarity, (8) detecting
positive selection with CodeML (Yang, 2007), (9) designing PCR primers with
Primer3 (Rozen and Skaletsky, 2000), and (10) extracting consensus-aligned
regions between different aligner results (for a same gene). Moreover, there has
been designed a specialized Editor for FasParser (v2.0+) to easily viewing and
editing sequences.

Citations
If you use this package in your work, please use these references:
1. Yan-Bo Sun. FasParser: a package for manipulating sequence data, Zoological

Research 38(2): 110-112, 2017
2. Yan-Bo Sun. FasParser2: A Graphical Platform for Batch Manipulating Biological
Sequence unpublished.
Below is a non-exhaustive list of publications related to the programs integrated:
1. Edgar RC. 2004. MUSCLE: a multiple sequence alignment method with reduced time
and space complexity. BMC Bioinformatics, 5: 113.
2. Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30:
3059-3066.
3. Loytynoja A & Goldman N. 2005. An algorithm for progressive multiple alignment of
sequences with insertions. Proceedings of the National Academy of Sciences of the
United States of America, 102: 10557-10562.
4. Nei M & Gojobori T. 1986. Simple methods for estimating the numbers of
synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and
Evolution, 3: 418-426.
5. Untergasser A, et al. (2012) Primer3--new capabilities and interfaces. Nucleic Acids
Res 40(15):e115.
6. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Molecular
Biology and Evolution 24(8):1586-1591.

Installation
The ‘FasParser’ has been developed into a standalone Windows System Application
(compiled and tested on Windows 7/10). It can run on most Windows systems with
no installation of other programs.
Download the newest setup program (i.e. ‘FasParserX.X_setup.exe’) from
https://github.com/Sun-Yanbo/FasParser to your disk, and then double click it to
install the whole package. Normally, it is well using the default installation
parameters (by clicking the Next button to end). After successful installation, you
would get a screen of the Home page of this package (Figure 1).

Figure 1. The Home page of the FasParser package.

Package Usage
[Align]
This program is designed to construct alignments for multiple FASTA files. There
are 3 aligners (MUSCLE, MAFFT, and PRANK, see below reference) have been
integrated into FasParser. The first two aligners are always faster than Prank, but
Prank could generate more accurate results. In addition, the first two aligners can
auto recognize the input type (nuclear DNA or protein sequence), while, the Prank
should be told the sequence type; user should select the corresponding aligner by
yourself (including Prank-DNA, Prank-Codon, and Prank-AA, Figure 2). The
running process information can be viewed in the bottom text box. User should use
‘Scan’ button to select the Fasta files to analyze. The output files will be saved in the
folder which contains the input files, with ‘*. .fas’

Figure 2. Overview of alignment construction

[Trim]
Alignment quality is of great importance for downstream analyses, like
phylogenetic inference and positive selection detection. The low-quality alignment
always produces false positive results, like wrong phylogenetic positions of species.
So, it is much better to strip the poorly-aligned regions in the raw alignments
before conducting downstream evolutionary analyses. The ‘Trim’ function provides

a simple and efficient trimming method to filter the low-quality regions in each
alignment. [It is more suitable for preparing high-quality alignments for positive
selection detection with PAML]
‘Trim’ will apply a sliding window method to identify which regions hold
low-similarity sequences based on a dynamic cutoff that will be estimated from the
total alignment (1000 times). There are two steps to trim an alignment. The 1st step
is for the gap-near regions, and the window will be sliding from the gap "-" to both
left and right ends. The 2nd step is for the block regions. After the 1st step
trimming, the 2nd will trim again from the left end to right.
User should tell which type of seqs you provided: codon or DNA, and then you
can click the "Start" button to start the trimming analyses for multiple files (Figure 3).
The output files will be saved in the folder which contains the input files, with ‘*.
trimC.fas’ or ‘*. trimD.fas’ for codon or DNA, respectively.

Figure 3. Overview of alignment trimming.

[Sim]
It is important to know the conservation level for a gene or a segment among
different species (always refer to the species you studied). If the mean identity of an
alignment is too low, it might be due to rapid evolution of this gene or
non-homologous prediction for one or more species in this alignment. For each
alignment, ‘Sim’ (Similarity) provides three identity calculations: the minimum
pair-wise identity within the alignment (min.); the maximum pair-wise identity
(max.); and the mean identity (mean).
Similarity, user should select files to analyze through clicking the ‘Scan’ button,
and then click the “Start” button to start the estimates (Figure 4).

Figure 4. Overview of alignment similarity estimates.

[Format]
This function is used to convert the FASTA files to other formatted ones. Presently,
there are 4 different formats available: FASTA, PHYLIP (Phylipi and Phylips), PAML,
and NEXUS (Figure 5). The output files will be saved in the folder which contains the
input files, with ‘*. .fas’.

Figure 5. Overview of alignment format conversion.

[Merge]
This program is used to ‘concatenate’ sequences of the same IDs from multiple
FASTA files, or just ‘merge’ all the selected files together. It is much useful in
phylogenetic or other analyses, especially when users generated multiple loci
sequences for a particular set of samples, and want to generate a “super” sequence
by concatenating all the loci sequences together for each sample.
Please note that for concatenate, if the raw FASTA files are not aligned before,
the final concatenated FASTA file should be aligned first before conducting some
other analyses. (We recommend user to construct alignment for each file before,
and each file must have a same ID list).
To start this analysis, user should define an output file name and its location
(can be automated created by FasParser). User can define which IDs will be
concatenated by editing the ID list in the textbox (left-below, Figure 6).

Figure 6. Overview of sequence concatenation manipulation.

[Sort]
Sort Sequences
This program will allow users to sort their FASTA files according to either the ID

names, sequence lengths, or a provided list of IDs. Part of this function (with the ID
list provided by users) is much like the Extraction analysis in [Filter] module.
Please note that the ID is automatedly recognized from the raw ID, which is the
first continuous string before symbols space, “.”, and “|”. For example, if the raw ID
in a FASTA file is:
‘>Uma_R000001.2 locus=scaffold79:384179:406202:-’

You can use the ID ‘Uma_R000001.2’ to search its sequence. Sometimes, the ID
‘Uma_R000001’ is also ok, if there is only one targeted ID. If the provided IDs cannot
be recognized, there will be no sequence reported. Another example. You can use
‘gi|947195581’ to represent the sequence ID :
‘>gi|947195581|ref|XP_006139827.2| PREDICTED…’

The output files will be saved in the folder which contains the input files, with ‘*.
sort.1.fas’ for alphabetically, ‘*. sort.2.1.fas’ and ‘*. sort.2.2.fas’ for length-descend
and length-ascend, respectively.

Rename sequences
With the provided ID list (left-below textbox), the program can also rename the
raw sequence IDs. To achieve that, the ID list should have two columns, the first
column refers to the raw ID, and the second column refers to the new ID, the
separator between the two columns must be “=>”, as below:
Uma_R000001.2=>R000001.2
Uma_R000003.2
Uma_R000002.2=>R000002
…

Under the above query IDs, program will use R000001.2 to replace
Uma_R000001.2, and use R000002 to replace Uma_R000002.2. If there is only one
column, the raw IDs will be retained (Figure 7.1). The output files will be saved in
the folder which contains the input files, with ‘*. sort.3.fas’.

Figure 7.1. Overview of Sort & Rename sequences.

Classify sequences
In addition, this section also provides a Classification function, which means
that you can use one or more keywords to extract sequences from a raw FASTA file
and save them into separate FASTA files. The keywords can be the genus name,
species name, or some other words, which should be part of the raw IDs (Figure
7.2).
The user should input the keywords in the right-below textbox, and then click
the “Classification” button to run this analysis. The output files will be saved in the
folder which contains the input files, with ‘*.class.[keyword].fas’.

Figure 7.2. Overview of Classification function.

[Filter]
This program is designed to perform some extraction and filtration analyses within
a particular FASTA file. It is a much common sequence manipulation. With this
program, users can extract or remove a set of sequences from the raw FASTA file
based on query IDs, the positions per codon, and can also filter the sequences
based on the sequence length, or similarity with a provided reference sequence.

Extraction & Filtration based on IDs
Based on this function, user can extract or remove some sequences from one or
more FASTA files. Just select the files you want to analyze, and then input the IDs
(you want to query) in the left-below textbox. Select the appropriated operation:
extract or remove, whether or not with reverse complement operation, and then
click the “Run” button (Figure. 8.1). The output files will be saved in the folder which
contains the input files, with ‘*.extract.fas’ or ‘*.removed.fas’.

Figure 8.1. Overview of extraction & filtration based on query IDs.

Extract 3-site columns for Codon sequences
Based on this function, user can extract each-site columns of codon sequences for
multiple FASTA files. Just select the files you want to analyze, and then click the
“Run” button (Figure. 8.2). The output files will be saved in the folder which contains

the input files, with ‘*.codon1.fas’, ‘*.codon2.fas’, and ‘*.codon3.fas’.

Figure 8.2. Overview of extraction of site columns for codon sequences.

Filtration of columns based on Gap frequency
This is a common function to cut the raw alignment by deleting columns with many
gaps (‘-’). This function can also fill up the missing data (with ‘N’) at the beginning
and ending of an alignment, which is useful for some phylogenetic analyses. To
perform this analysis, user should provide a gap-frequency cutoff value, and then
select appropriated operation, like analyzing which region, and how to deal with
the columns with a lower value (than the cutoff). Click the ‘ Run’ button to start this
analysis (Figure 8.3).
The output files will be save in the folder which contains the raw input files, with
a name format of “*.cut.end.fas” or “*.cut.all.fas”.

Figure 8.3. Overview of the column-filtration based on Gap frequency.

Filtration of sequences based on sequence length
Maybe there are some too short sequences in a file, which would significantly
influence the final available sequence length. So, it is usually useful to identify and
remove the too short sequences. To do this, FasParser provides some different
methods.
The 1st method is designed for alignment files (slower). The program will
auto-estimate the block (no-gap regions) length under removal of each single
sequence, and then identify the bad sequences whose removal can increase
significantly the final block lengths.
The 2nd method is designed for non-aligned files (faster). This method will
identify directly the too short sequences according to their sequence lengths. The
formula used is "mean - 3*std" if there are more than 100 sequences. If there are
less than 100 sequences, Grubbs method will be used.

Filtration based on similarity with a reference sequence
This function is designed for manipulating multiple sequences, which are
always collected from different sources without few sequence checking tasks. There
may be some non-homologous sequences in the raw inputted sequences, or the
sequences come from different parts of a same gene. Before downstream analyses,
it is much better to do some filtration task on these sequences to improve their
availability in other analyses.

To do this work, user should provide a reference sequence that is from your
previous or other peoples’ work to identify the non-homologous sequences and
cut the input sequences to a same length based on the alignment result with the
reference sequence.
After setting appropriated parameters, just click the “Run” button to run this
analysis. There would generate 2 or 4 output files for each input file, which contain
the final alignment (cut to a length based on the reference sequence), as well as the
homologous identification results (Figure. 8.4).

Figure. 8.4. Overview of the sequence filtration based on similarity with a reference
sequence.

[DNA2AA]
This function is used to translate the DNA sequences to protein sequences.
There are two modes available. If users have many FASTA files, you can use the
batch mode to conduct such analyses, with genetic code selected. The output files
will be saved into the folder which contains the raw input files, with a filename as
“*.aa.fas”. You can also input a pure DNA sequence or a FASTA file under the
manual mode, the program can automate recognize them.

[ORF]
This program is designed to identify the ORF for cDNA sequence(s). The cDNA
sequences can be organized into a FASTA file or as single sequence. There are 2

choices to obtain the ORFs: a) get the longest one if there are present more than
one potential ORF in the input cDNA sequence, and b) get all the potential ORF
sequences (Figure 9).

Figure 9. overview of identification of ORFs for sequences.

[2Seq]
The program “2Seq” is designed to easily count and view differences between
two DNA sequences at both DNA and codon levels. Under the codon level, the
program can also estimate the total number of synonymous (S) and
non-synonymous (N) sites for the first sequence and then calculate the number of
synonymous and non-synonymous substitutions between the two sequences. To
do that, users just need to put two sequences into the two above textboxes and
click the Run button. The program could then provide a view the differences
between the two sequences in colors and the summary of identified mutations or
substitutions (Figure 10).

Figure 10. Overview of comparison between two sequences.

[2Align]
This program “2Align” is designed to compare different alignments of a same
gene that could be generated by different aligners. It is well known that there is
almost no current method correctly aligns the entire sequence, and different
aligners always correctly align different regions. One simple method is to identify
the overlapped regions between different aligner-generated alignments, which
might be useful for some other analyses, like phylogenetic reference and/or
positive selection detection. This function needs users to input two alignments
through the above scan buttons and click the “Run” button to view their overlaps
(Figure 11). *This is the only function currently cannot run at the batch mode.

Figure 11. Overview of the estimates of consensus alignment.

[Primer]
This is an interface to design PCR primers with an excellent Primer designing tool
(Primer3). To do this, jst input the DNA template (either in Fasta or pure DNA
sequence) and modify the parameters. Click the "Run" button will obtain the primer
results if all run properly (Figure 12).

Figure 12. Overview of the primer designing function.

[PAML]
FasParser provided a batch mode to run codeml in PAML package (Yang et al.
2007). It is much better for user to read the manual of PAML to understand the
parameter settings and add proper symbols to the phylogenetic tree (like below,
Figure 13).

Figure 13. Positive selection detection.
Below is an example of running output:

[GeneDrawer]
FasParser (since version 2) provides a gene structure drawing fiction. It is very easy
to use. It can be run in two different mode, based on the data type of inputs. It now
can read either gene annotation or gene sequences (Figure 14). Presently, the
drawer function has litter photo-editing interfaces, which will be modified in the
future versions.

Figure 14. Overview of gene structure drawing function.

Contact
If you have any question, or program bug, please feel free to contact the me (Yan-Bo Sun). My
email address is sunyanbo@mail.kiz.ac.cn
FasParser also has a bug report interface ([Bug]), through which you can also contact me
and report the errors of this program. I believe FasParser will grow better under your
suggestions.



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : No
Page Count                      : 18
Language                        : zh-CN
Tagged PDF                      : Yes
XMP Toolkit                     : 3.1-701
Producer                        : Microsoft® Word 2016
Title                           : FasParser Manual
Creator                         : Sun Yan-Bo
Description                     : [For version 2.1.1]
Creator Tool                    : Microsoft® Word 2016
Create Date                     : 2017:12:26 16:14:42+08:00
Modify Date                     : 2017:12:26 16:14:42+08:00
Document ID                     : uuid:C1C1F30F-2940-4809-A95E-C9454D00B9BA
Instance ID                     : uuid:C1C1F30F-2940-4809-A95E-C9454D00B9BA
Author                          : Sun Yan-Bo
Subject                         : [For version 2.1.1]
EXIF Metadata provided by EXIF.tools

Navigation menu