Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 21

DownloadManual
Open PDF In BrowserView PDF
A toolkit for DNA sequence
analysis and manipulation
D. Pratas (pratas@ua.pt)
J. R. Almeida (joao.rafael.almeida@ua.pt)
A. J. Pinho (ap@ua.pt)

IEETA/DETI, University of Aveiro, Portugal

Version 1.7.17

Contents
1 Introduction

3

1.1

Installation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2 FASTQ tools

5

3 FASTA tools

7

3.1

Program goose-fasta2seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.2

Program goose-fastaextract

9

3.3

Program goose-fastaextractbyread

3.4

Program goose-fastainfo

3.5

Program goose-mutatefasta

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.6

Program goose-randfastaextrachars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

4 Genomic sequence tools

15

4.1

Program goose-mutatedna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.2

Program goose-randseqextrachars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

5 Amino acid sequence tools

17

5.1

Program goose-AminoAcidToGroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

5.2

Program goose-ProteinToPseudoDNA

18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 General purpose tools

21

Bibliography

21

1

Chapter 1

Introduction
Recent advances in DNA sequencing have revolutionized the eld of genomics, making it possible for
research groups to generate large amounts of sequenced data, very rapidly and at substantially lower cost.
Its storage have been made using specic le formats, such as FASTQ and FASTA. Therefore, its analysis

?

and manipulation is crucial [ ]. Several frameworks for analysis and manipulation emerged, namely

? GATK [?], HTSeq [?], MEGA [?],

[ ],

among others.

GALAXY

In the majority, these frameworks require licenses and

do not provide a low level access to the information, since they are commonly approached by scripting or
interfaces.
We describe

GOOSE,

a (free) novel toolkit for analyzing and manipulating FASTA-FASTQ formats and

sequences (DNA, amino acids, text), with many complementary tools.
systems, built for fast processing.

GOOSE supports pipes for easy integration.

The toolkit is for Linux-based
It includes tools for information

display, randomizing, edition, conversion, extraction, searching, calculation and visualization.

GOOSE

is

prepared to deal with very large datasets, typically in the scale Gigabytes or Terabytes.
The toolkit is a command line version, using the prex goose- followed by the sux with the respective
name of the program.

GOOSE

is implemented in

C

language and it is available, under GPLv3, at:

https :// pratas . github . io / goose

1.1
For

Installation

GOOSE

installation, run:

git clone https :// github . com / pratas / goose . git
cd goose / src /
make

1.2

License

The license is

GPLv3.

In resume, everyone is permitted to copy and distribute verbatim copies of this

license document, but changing it is not allowed. For details on the license, consult:

2

http://www.gnu.org/

licenses/gpl-3.0.html.

3

Chapter 2

FASTQ tools
Current available tools for FASTQ format analysis and manipulation include:
1.

goose-fastq2fasta

2.

goose-fastq2mfasta

3.

goose-fastqclustreads

4.

goose-FastqExcludeN

5.

goose-FastqExtractQualityScores

6.

goose-FastqInfo

7.

goose-FastqMaximumReadSize

8.

goose-FastqMinimumLocalQualityScoreForward

9.

goose-FastqMinimumLocalQualityScoreReverse

10.

goose-FastqMinimumQualityScore

11.

goose-FastqMinimumReadSize

12.

goose-count

13.

goose-extractreadbypattern

14.

goose-fastqpack

15.

goose-fastqsimulation

16.

goose-FastqSplit

17.

goose-FastqTrimm

18.

goose-fastqunpack
4

19.

goose-filter

20.

goose-findnpos

21.

goose-genrandomdna

22.

goose-getunique

23.

goose-info

24.

goose-mfmotifcoords

25.

goose-mutatefastq

26.

goose-newlineonnewx

27.

goose-period

28.

goose-permuteseqbyblocks

29.

goose-randfastqextrachars

30.

goose-real2binthreshold

31.

goose-reducematrixbythreshold

32.

goose-renamehumanheaders

33.

goose-searchphash

34.

goose-seq2fasta

35.

goose-seq2fastq

36.

goose-SequenceToGroupSequence

37.

goose-splitreads

38.

goose-wsearch

5

Chapter 3

FASTA tools
Current available FASTA tools, for analysis and manipulation, are:
1.

goose-fasta2seq:

2.

goose-fastaextract:

it converts a FASTA or Multi-FASTA le format to a seq.
it extracts sequences from a FASTA le, which the range is dened by the user

in the parameters.
3.

goose-fastaextractbyread:

it extracts sequences from each read in a Multi-FASTA le (splited by

\n), which the range is dened by the user in the parameters.
4.

goose-fastainfo:

5.

goose-mutatefasta:

it shows the readed information of a FASTA or Multi-FASTA le format.
it reates a synthetic mutation of a fasta le given specic rates of editions,

deletions and additions.
6.

goose-randfastaextrachars:

it substitues in the DNA sequence the outside ACGT chars by random

ACGT symbols.
7.

goose-geco

8.

goose-gede

9.

goose-reverse

3.1
The

Program goose-fasta2seq

goose-fasta2seq

converts a FASTA or Multi-FASTA le format to a seq.

For help type:

./ goose - fasta2seq -h

In the following subsections, we explain the input and output paramters.

6

Input parameters
The

goose-fasta2seq

program needs two streams for the computation, namely the input and output

standard. The input stream is a FASTA or Multi-FASTA le.
The attribution is given according to:

Usage : ./ goose - fasta2seq [ options ] [[ - -] args ]
or : ./ goose - fasta2seq [ options ]
It converts a FASTA or Multi - FASTA file format to a seq .
-h , -- help

show this help message and exit

Basic options
< input . fasta
> output . seq

Input FASTA or Multi - FASTA file format ( stdin )
Output sequence file ( stdout )

Example : ./ goose - fasta2seq < input . fasta > output . seq

An example on such an input le is:

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTCCACCGCTGCCCTGCTGCCATTGTCCCC
GGCCCCACCTAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAA
GTGGTTTGAGTGGACCTCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGCAGGCCAGTGCC
GCGAATCCGCGCGCCGGGACAGAATCTCCTGCAAAGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCACCCCCCCAGC
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
> AB000263 | acc = AB000263 | descr = Homo sapiens mRNA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGT
GGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTG
GTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAG
GCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAA
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

Output
The output of the

goose-fasta2seq

program is a group sequence.

An example, for the input, is:

ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTCCACCGCTGCCCTGCTGCCATTGTCCCC
GGCCCCACCTAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAA
GTGGTTTGAGTGGACCTCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGCAGGCCAGTGCC
GCGAATCCGCGCGCCGGGACAGAATCTCCTGCAAAGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCACCCCCCCAGC
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAAACAAGATGCCATTGTCCCCCGGCCTCCTGCTG
CTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCA
GGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCG
GGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGAC
AGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTT
TAATTACAGACCTGAA

7

3.2
The

Program goose-fastaextract

goose-fastaextract

extracts sequences from a FASTA le, which the range is dened by the user in

the parameters.
For help type:

./ goose - fastaextract -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-fastaextract

program needs two paramenters, which denes the begin and the end of the

extraction, and two streams for the computation, namely the input and output standard. The input stream
is a FASTA le.
The attribution is given according to:

Usage : ./ goose - fastaextract [ options ] [[ - -] args ]
or : ./ goose - fastaextract [ options ]
It extracts sequences from a FASTA file .
-h , -- help

show this help message and exit

Basic options
-i , -- init =< int >
-e , -- end =< int >
< input . fasta
> output . seq

The first position to start the extraction ( default 0)
The last extract position ( default 100)
Input FASTA or Multi - FASTA file format ( stdin )
Output sequence file ( stdout )

Example : ./ goose - fastaextract -i < init > -e  < input . fasta > output . seq

An example on such an input le is:

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTCCACCGCTGCCCTGCTGCCATTGTCCCC
GGCCCCACCTAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAA
GTGGTTTGAGTGGACCTCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGCAGGCCAGTGCC
GCGAATCCGCGCGCCGGGACAGAATCTCCTGCAAAGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCACCCCCCCAGC
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

Output
The output of the

goose-fastaextract

program is a group sequence.

An example, using the value 0 as extraction starting point and the 50 as the end, for the provided input,
is:

8

ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGG

3.3
The

Program goose-fastaextractbyread

goose-fastaextractbyread

extracts sequences from a FASTA or Multi-FASTA le, which the range

is dened by the user in the parameters.
For help type:

./ goose - fastaextractbyread -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-fastaextractbyread

program needs two paramenters, which denes the begin and the end of

the extraction, and two streams for the computation, namely the input and output standard. The input
stream is a FASTA or Multi-FASTA le.
The attribution is given according to:

Usage : ./ goose - fastaextractbyread [ options ] [[ - -] args ]
or : ./ goose - fastaextractbyread [ options ]
It extracts sequences from each read in a Multi - FASTA file ( splited by \ n)
-h , -- help
Basic options
-i , -- init =< int >
-e , -- end =< int >
< input . fasta
> output . fasta

show this help message and exit

The first position to start the extraction ( default 0)
The last extract position ( default 100)
Input FASTA or Multi - FASTA file format ( stdin )
Output FASTA or Multi - FASTA file format ( stdout )

Example : ./ goose - fastaextractbyread -i  -e  < input . fasta > output . fasta

An example on such an input le is:

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTCCACCGCTGCCCTGCTGCCATTGTCCCC
GGCCCCACCTAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAA
GTGGTTTGAGTGGACCTCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGCAGGCCAGTGCC
GCGAATCCGCGCGCCGGGACAGAATCTCCTGCAAAGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCACCCCCCCAGC
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
> AB000263 | acc = AB000263 | descr = Homo sapiens mRNA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGT
GGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTG
GTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAG

9

GCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAA
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

Output
The output of the

goose-fastaextractbyread program is FASTA or Multi-FASTA le wiht the extracted

sequences.
An example, using the value 0 as extraction starting point and the 50 as the end, for the provided input,
is:

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGG
> AB000263 | acc = AB000263 | descr = Homo sapiens mRNA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC

3.4
The

Program goose-fastainfo

goose-fastainfo

shows the readed information of a FASTA or Multi-FASTA le format.

For help type:

./ goose - fastainfo -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-fastainfo

program needs two streams for the computation, namely the input and output

standard. The input stream is a FASTA or Multi-FASTA le.
The attribution is given according to:

Usage : ./ goose - fastainfo [ options ] [[ - -] args ]
or : ./ goose - fastainfo [ options ]
It shows read information of a FASTA or Multi - FASTA file format .
-h , -- help
Basic options
< input . fasta
> output

show this help message and exit

Input FASTA or Multi - FASTA file format ( stdin )
Output read information ( stdout )

Example : ./ goose - fastainfo < input . fasta > output

An example on such an input le is:

10

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTCCACCGCTGCCCTGCTGCCATTGTCCCC
GGCCCCACCTAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAA
GTGGTTTGAGTGGACCTCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGCAGGCCAGTGCC
GCGAATCCGCGCGCCGGGACAGAATCTCCTGCAAAGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCACCCCCCCAGC
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
> AB000263 | acc = AB000263 | descr = Homo sapiens mRNA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGT
GGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTG
GTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAG
GCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAA
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

Output
The output of the

goose-fastainfo

program is a set of informations related with the le readed.

An example, for the input, is:

Number
Number
MIN of
MAX of
AVG of

3.5
The

of reads
: 2
of bases
: 736
bases in read : 368
bases in read : 368
bases in read : 368.0000

Program goose-mutatefasta

goose-mutatefasta creates a synthetic mutation of a fasta le given specic rates of editions, deletions

and additions. All these paramenters are dened by the user, and their are optional.
For help type:

./ goose - mutatefasta -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-mutatefasta

program needs two streams for the computation, namely the input and output

standard. However, optional settings can be supplied too, such as the starting point to the random generator, and the edition, deletion and insertion rates. Also, the user can choose to use the ACGTN alphabet
in the synthetic mutation. The input stream is a FASTA or Multi-FASTA File.
The attribution is given according to:

Usage : ./ goose - mutatefasta [ options ] [[ - -] args ]
or : ./ goose - mutatefasta [ options ]
Creates a synthetic mutation of a fasta file given specific rates of editions , deletions and additions

11

-h , -- help

show this help message and exit

Basic options
< input . fasta
> output . fasta

Input FASTA or Multi - FASTA file format ( stdin )
Output FASTA or Multi - FASTA file format ( stdout )

Optional
-s , -- seed =< int >
-e , -- edit - rate =< dbl >
-d , -- deletion - rate =< dbl >
-i , -- insertion - rate =< dbl >
-a , -- ACGTN - alphabet

Starting point to the random generator
Defines the edition rate ( default 0.0)
Defines the deletion rate ( default 0.0)
Defines the insertion rate ( default 0.0)
When active , the application uses the ACGTN alphabet

Example : ./ goose - mutatefasta -s < seed > -e < edit rate > -d < deletion rate > -i < insertion rate > -a < input . fast
An example on such an input le is:

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACAAGACGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCCTGGAGGGTCCACCGCTGCCCTGCTGCCATTGTCCCC
GGCCCCACCTAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAA
GTGGTTTGAGTGGACCTCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGCAGGCCAGTGCC
GCGAATCCGCGCGCCGGGACAGAATCTCCTGCAAAGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCACCCCCCCAGC
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
> AB000263 | acc = AB000263 | descr = Homo sapiens mRNA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGT
GGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTG
GTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAG
GCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAA
TAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

Output
The output of the

goose-mutatefasta

program is a FASTA or Multi-FASTA le whith the synthetic

mutation of input le.
Using the seed value as 1 and the edition rate as 0.5, an example for this input, is:

> AB000264 | acc = AB000264 | descr = Homo sapiens mRNA
ACGCAACGNATTCCTGCTGATCATANTGTNCCGCNCCCCNGCGACGGGGNCTCNCNNGCACACATNGTACCATTGTCCAC
NCTTNCANGTNANCGCTAGCAGGCTACNGTTTNTCCTCNCCTANNCCAANCNGGCGTNNNTACACTGGCACGTGCAGGCA
TNGGTCGGCNGGNNCCTCCGGNAACGGCACCGGAGACGAAGCTCGGNGGNTATACAGGTGTCANGAAACATCCCCGCGNC
GNGTGNCCNNGAANCCANAGAGTATCTCACTCACAACCCTGCGTGCACNTCTAGAGNANGACCTTACNCACCNTCCCNTT
NNGTACCACACCAATGAACGCTGCAGAAAGTCTGTTTNNAGGNGNGCA
> AB000263 | acc = AB000263 | descr = Homo sapiens mRNA
ATTTGAAGGCAANCGGNCCAGNAATNCGGNGGGTGCNGCTCNTGTNGGCTACGGNCATCGCGGCCCTGCTNTANTAAGCN
TGAACCACCGNTCGNNGCACTTAGCAATNGCGNAANCCGTCGGCACGGCGGAGACNAANCCGCTANTNNTTTCCCGCTNA
ATGGNTGTACAAGACCNACTANACCANCCTCCGTCACCACACTGGAGCGCANGATGGNNCGCTGNCTAGNAGNCNNTGAG
GCGCTCCNTCCTANAAANCCGTGGNCGAGCNCCCTATGGNAGNGTGGGGGTTTTACCGGAAGACCNTCGNGCCCTATGGG
AGCAATCANAANCTAGAAAGCTTACNGATGGTGANGAANTAGACTANG

12

3.6
The

Program goose-randfastaextrachars

goose-randfastaextrachars

substitues in the DNA sequence the outside ACGT chars by random

ACGT symbols. It works both in FASTA and Multi-FASTA le formats.
For help type:

./ goose - randfastaextrachars -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-randfastaextrachars

program needs two streams for the computation, namely the input and

output standard. The input stream is a FASTA or Multi-FASTA le.
The attribution is given according to:

Usage : ./ goose - randfastaextrachars [ options ] [[ - -] args ]
or : ./ goose - randfastaextrachars [ options ]
It substitues in the DNA sequence the outside ACGT chars by random ACGT symbols .
It works both in FASTA and Multi - FASTA file formats

-h , -- help

show this help message and exit

Basic options
< input . fasta
> output . fasta

Input FASTA or Multi - FASTA file format ( stdin )
Output FASTA or Multi - FASTA file format ( stdout )

Example : ./ goose - randfastaextrachars < input . fasta > output . fasta

An example on such an input le is:

to do

Output
The output of the

goose-randfastaextrachars

program is a FASTA or Multi-FASTA le.

An example, for the input, is:

to do

13

Chapter 4

Genomic sequence tools
Current available genomic sequence tools, for analysis and manipulation, are:
1.

goose-mutatedna

2.

goose-randseqextrachars

4.1
The

Program goose-mutatedna

goose-mutatedna

...

For help type:

./ goose - mutatedna -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-mutatedna

program needs ... The attribution is given according to:

TO DO

An example on such an input le is:

TO DO

Output
The output of the

goose-mutatedna

program ... An example, for the input, is:

TO DO

14

4.2
The

Program goose-randseqextrachars

goose-randseqextrachars

...

For help type:

./ goose - randseqextrachars -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-randseqextrachars

program needs ... The attribution is given according to:

TO DO

An example on such an input le is:

TO DO

Output
The output of the

goose-randseqextrachars

program ... An example, for the input, is:

TO DO

15

Chapter 5

Amino acid sequence tools
Current available amino acid sequence tools, for analysis and manipulation, are:
1.

goose-AminoAcidToGroup:

2.

goose-ProteinToPseudoDNA: it converts an amino acid (protein) sequence to a pseudo DNA sequence.

5.1
The

it converts an amino acid sequence to a group sequence.

Program goose-AminoAcidToGroup

goose-AminoAcidToGroup

converts an amino acid sequence to a group sequence.

For help type:

./ goose - AminoAcidToGroup -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-AminoAcidToGroup

program needs two streams for the computation, namely the input and

output standard. The input stream is an amino acid sequence. The attribution is given according to:

Usage : ./ goose - AminoAcidToGroup [ options ] [[ - -] args ]
or : ./ goose - AminoAcidToGroup [ options ]
It converts a amino acid sequence to a group sequence .
-h , -- help
Basic options
< input . prot
> output . group

show this help message and exit

Input amino acid sequence file ( stdin )
Output group sequence file ( stdout )

Example : ./ goose - AminoAcidToGroup < input . prot > output . group
Table :
Prot
Group
R
P

16

H
K
D
E
S
T
N
Q
C
U
G
P
A
V
I
L
M
F
Y
W
*
X

P
P
N
N
U
U
U
U
S
S
S
S
H
H
H
H
H
H
H
H
*
X

Amino acids with electric charged side chains : POSITIVE

Amino acids with electric charged side chains : NEGATIVE

Amino acids with electric UNCHARGED side chains

Special cases

Amino acids with hydrophobic side chains

Others
Unknown

It can be used to group amino acids by properties, such as electric charge (positive and negative), uncharged
side chains, hydrophobic side chains and special cases. An example on such an input le is:

IPFLLKKQFALADKLVLSKLRQLLGGRIKMMPCGGAKLEPAIGLFFHAIGINIKLGYGMTETTATVSCWHDFQFNPNSIG
TLMPKAEVKIGENNEILVRGGMVMKGYYKKPEETAQAFTEDGFLKTGDAGEFDEQGNLFITDRIKELMKTSNGKYIAPQY
IESKIGKDKFIEQIAIIADAKKYVSALIVPCFDSLEEYAKQLNIKYHDRLELLKNSDILKMFE

Output
The output of the

goose-AminoAcidToGroup

program is a group sequence.

An example, for the input, is:

HSHHHPPUHHHHNPHHHUPHPUHHSSPHPHHSSSSHPHNSHHSHHHPHHSHUHPHSHSHUNUUHUHUSHPNHUHUSUUHS
UHHSPHNHPHSNUUNHHHPSSHHHPSHHPPSNNUHUHHUNNSHHPUSNHSNHNNUSUHHHUNPHPNHHPUUUSPHHHSUH
HNUPHSPNPHHNUHHHHHNHPPHHUHHHHSSHNUHNNHHPUHUHPHPNPHNHHPUUNHHPHHN

5.2
The

Program goose-ProteinToPseudoDNA

goose-ProteinToPseudoDNA

converts an amino acid (protein) sequence to a pseudo DNA sequence.

For help type:

17

./ goose - ProteinToPseudoDNA -h

In the following subsections, we explain the input and output paramters.

Input parameters
The

goose-ProteinToPseudoDNA

program needs two streams for the computation, namely the input and

output standard. The input stream is an amino acid sequence. The attribution is given according to:

Usage : ./ goose - ProteinToPseudoDNA [ options ] [[ - -] args ]
or : ./ goose - ProteinToPseudoDNA [ options ]
It converts a protein sequence to a pseudo DNA sequence .
-h , -- help
Basic options
< input . prot
> output . dna

show this help message and exit

Input amino acid sequence file ( stdin )
Output DNA sequence file ( stdout )

Example : ./ goose - ProteinToPseudoDNA < input . prot > output . dna
Table :
Prot
DNA
A
GCA
C
TGC
D
GAC
E
GAG
F
TTT
G
GGC
H
CAT
I
ATC
K
AAA
L
CTG
M
ATG
N
AAC
P
CCG
Q
CAG
R
CGT
S
TCT
T
ACG
V
GTA
W
TGG
Y
TAC
*
TAG
X
GGG

It can be used to generate pseudo-DNA with characteristics passed by amino acid (protein) sequences. An
example on such an input le is:

IPFLLKKQFALADKLVLSKLRQLLGGRIKMMPCGGAKLEPAIGLFFHAIGINIKLGYGMTETTATVSCWHDFQFNPNSIG
TLMPKAEVKIGENNEILVRGGMVMKGYYKKPEETAQAFTEDGFLKTGDAGEFDEQGNLFITDRIKELMKTSNGKYIAPQY
IESKIGKDKFIEQIAIIADAKKYVSALIVPCFDSLEEYAKQLNIKYHDRLELLKNSDILKMFE

18

Output
The output of the

goose-ProteinToPseudoDNA

program is a DNA sequence.

An example, for the input, is:

ATCCCGTTTCTGCTGAAAAAACAGTTTGCACTGGCAGACAAACTGGTACTGTCTAAACTGCGTCAGCTGCTGGGCGGCCG
TATCAAAATGATGCCGTGCGGCGGCGCAAAACTGGAGCCGGCAATCGGCCTGTTTTTTCATGCAATCGGCATCAACATCA
AACTGGGCTACGGCATGACGGAGACGACGGCAACGGTATCTTGCTGGCATGACTTTCAGTTTAACCCGAACTCTATCGGC
ACGCTGATGCCGAAAGCAGAGGTAAAAATCGGCGAGAACAACGAGATCCTGGTACGTGGCGGCATGGTAATGAAAGGCTA
CTACAAAAAACCGGAGGAGACGGCACAGGCATTTACGGAGGACGGCTTTCTGAAAACGGGCGACGCAGGCGAGTTTGACG
AGCAGGGCAACCTGTTTATCACGGACCGTATCAAAGAGCTGATGAAAACGTCTAACGGCAAATACATCGCACCGCAGTAC
ATCGAGTCTAAAATCGGCAAAGACAAATTTATCGAGCAGATCGCAATCATCGCAGACGCAAAAAAATACGTATCTGCACT
GATCGTACCGTGCTTTGACTCTCTGGAGGAGTACGCAAAACAGCTGAACATCAAATACCATGACCGTCTGGAGCTGCTGA
AAAACTCTGACATCCTGAAAATGTTTGAG

19

Chapter 6

General purpose tools
1.

goose-comparativemap:

visualisation of comparative maps. It builds a image given specic patterns

between two sequences.
2.

goose-BruteForceString:

it generates, line by line, multiple combinations of strings up to a certain

size.
3.

goose-char2line:

4.

goose-sum:

it adds the second column value to the rst column value.

5.

goose-min:

it nds the minimum value between two column values.

6.

goose-minus:

7.

goose-max:

8.

goose-extract:

it extracts a subsequence of a sequence by coordinates.

9.

goose-segment:

it segments a sequence given a certain threshold.

it transforms each char into a char in each line.

it substracts the second column value to the rst column value.

it nds the mmaximum value between two column values.

20



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 21
Producer                        : pdfTeX-1.40.16
Creator                         : TeX
Create Date                     : 2018:07:24 18:24:01+01:00
Modify Date                     : 2018:07:24 18:24:01+01:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) kpathsea version 6.2.1
EXIF Metadata provided by EXIF.tools

Navigation menu