Isomir Sea Manual V0 1

User Manual:

Open the PDF directly: View PDF .
Page Count: 20

isomiR-SEA: isomiRNA seed and extension based aligner for

RNA-Seq analysis ∗

Tool for miRNAs/isomiRs expression level proﬁling and miRNA-mRNA interaction site

evaluation

Gianvito Urgese†

July 10, 2015

] ] ]

∗Version 0.1, 2015/04/11

†Email: gianvito.urgese@polito.it

Contents

1 Introduction 3

1.1 isomiR-SEA philosophy . . . . . . 3

1.2 Main features . . . . . . . . . . . . 3

2 Program usage 5

2.1 Installation procedures . . . . . . 5

2.1.1 Pre-compiled release . . 5

2.2 Dependencies . . . . . . . . . . . 5

2.3 The miRNAs databases . . . . . . 5

2.4 Program options . . . . . . . . . . 5

2.4.1 Alignment parameters . . 6

2.4.2 Paths and ﬁles . . . . . . . 6

2.4.3 DNA/RNA conversion

and ﬁltering . . . . . . . . 6

3 Analysis work-ﬂow 7

3.1 The interaction sites point of view 7

3.2 The isomiRs point of view . . . . 9

3.3 isomiR-SEA Algorithm . . . . . . 10

3.3.1 Parameters setting . . . . 10

3.3.2 miRNAs DB and tags

counts ﬁle processing . . 12

3.3.3 isomiR-SEA alignment

procedure . . . . . . . . . 12

3.3.4 isomiR-SEA output ﬁles . 15

4 Files formats 16

4.1 Input ﬁles . . . . . . . . . . . . . . 16

4.2 Output ﬁles . . . . . . . . . . . . . 17

Bibliography 20

1 Introduction

1.1 isomiR-SEA philosophy

The main idea behind isomiR-SEA is to provide users with an integrated and user-friendly tool capa-

ble to extract miRNAs and isomiR expression levels from high-throughput sequencing data ensuring

high degrees of accuracy and detail.

The innovation introduced by isomiR-SEA consists in the pragmatic approach implemented to detect

and quantify miRNAs into the analysed samples.

Indeed the tool identiﬁes as reads accounting for miRNAs all the sequences that retain speciﬁc miR-

NAs features and that, as a consequence, can interact in some way with mRNA target molecules.

Clearly these interactions strictly depend on the amount and kind of differences detected in the read

with respect to the miRNA from which it derives. Furthermore, speciﬁc miRNA positions are nowa-

days considered fundamental to guarantee miRNA:mRNA target interaction, as proven in several re-

searches conducted in the last decade and exploited in most of the miRNAs target prediction algo-

rithms.

In the light of these considerations and given the lack of software tools capable to take into account

these miRNAs features during the reads alignment phase, we have developed isomiR-SEA.

isomiR-SEA is the ﬁrst tool implemented in order to perform reads alignment on miRNAs databases

by considering the miRNA:mRNA interaction pairing aspects. This novelty allows isomiR-SEA to

reach high accuracy in miRNAs expression levels extraction and at the same time to provide users

with a very detailed description of the detected expression proﬁles.

In other words, why biologists have to look pictures from 0.5 Mega pixel cameras when the available

knowledge allows them to look at the same images from 10 Mega pixel cameras? Additionally it is

usually simpler reduce the degree of detail of a picture than increase it thus isomiR-SEA has been

implemented in order to provide users with the most complete and detailed information that can be

extracted from data.

This precise representation, made possible by an accurate and fast alignment procedure, includes the

mature miRNAs and miRNA isoforms detected in the sample and their possible mode of interaction

with a target mRNA. The obtained information can be freely managed by users in order to highlight

those aspects considered as more meaningful.

1.2 Main features

isomiR-SEA is a software tool for the accurate and detailed processing of miRNAs sequencing data.

It requests as input a set of reads coming from miRNAs sequencing or a pre-elaborated tags ﬁle and

provides as output the tags counts attributed to the different miRNAs, distinguishing among their

isoforms. Moreover, conserved miRNA:mRNA interaction sites are reported in the output ﬁles.

High accuracy levels in miRNAs quantiﬁcation are reached by isomiR-SEA thanks to a seed-based

alignment procedure. This procedure begins with the search and recognition of an exact miRNA-

seed match between read and miRNA and proceed with the extension of the alignment. Furthermore

isomiR-SEA allows for multi-species miRNAs analyses: One or more species contained in miRBase

database can indeed be selected during the parameters selection in order to assign reads to the most

similar miRNA. This last feature can be considered very useful in different situations such as during

the analysis of pathogens and host organisms or when inter-species comparisons have to be per-

formed.

isomiR-SEA advantages can be summarized as follows:

• High level of accuracy in miRNAs detection and quantiﬁcation thanks to a miRNA-speciﬁc

alignment procedure;

• High degree of detail and precision in the isomiRs detection and proﬁling thanks to accurate

reads assignment;

• Ability in detecting, for each read supporting a speciﬁc miRNA, all the possible modes of inter-

action with a target mRNA;

• Computational costs contained despite the high degree of detail provided;

• Ability to perform multi-species analyses;

• High degree of ﬂexibility.

2 Program usage

2.1 Installation procedures

isomiR-SEA current version is provided only in a pre-compiled format.

2.1.1 Pre-compiled release

In order to make as easy as possible isomiR-SEA installation, we provide the binary package for the

three most common operating systems. To use these pre-build packages, simply download that ad-

equate for your operating system, decompress it and check if binaries are in stored a folder in your

PATH environment variable.

2.2 Dependencies

isomiR-SEA is implemented in C++ as stand alone tool. The installation of third party tools is not

required if the standard input ﬁle format is used.

However, the most-common pre-elaboration activities can be executed taking advantege of the fol-

lowing programs:

• The SRA Toolkit to convert .sra ﬁles into other formats like .fasta or .fastq. http://www.ncbi.

nlm.nih.gov/books/NBK158900/

•Flexbar - ﬂexible barcode and adapter removal to remove 3’ end read adapter. The adapter

sequence used during the sequencing procedure has to be provided. http://sourceforge.net/

p/ﬂexbar/wiki/Manual/

• The FASTX Toolkit and in particular fastx_collapser program, to generate the tags counts ﬁle.

http://hannonlab.cshl.edu/fastx_toolkit/index.html.

2.3 The miRNAs databases

isomiR-SEA current version adopts as miRNAs reference database miRBase http://www.mirbase.

org/. In particular it uses miRBase ﬁle that stores mature miRNAs sequences.

To download the complete last miRBase release from unix:

1$ wget −r ftp://mirbase.org/pub/mirbase/CURRENT/

To download only miRBase mature miRNAs ﬁle:

1$ wget −nd ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz

To download from browser click the following link:

ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz

2.4 Program options

Run isomiR-SEA from command-line:

1$ miR−SEA_1_6 −s hsa −l 16 −b 4 −i DIRECTORY_NAME

2−p DIRECTORY_NAME/test_16_4 −ss 6 −h 11 −m MATURE_DB −t COUNTFILE

2.4.1 Alignment parameters

-s, species_codes STR (hsa)

Specie or species that have to be selected as reference(s) during tags alignment procedure.

-l, min\_tag\_length INT (10)

Minimum alignment length to perform the third ungapped extension.

-ss, seed_size INT (6)

miRNA seed size to be searched into tags sequences.

-sb, begin_seed INT (2)

miRNA seed start position.

-se, end_seed INT (7)

miRNA seed end position.

-b, tag_begin INT (4)

Maximum coordinate in the tag at which miRNA seed has to be searched.

2.4.2 Paths and ﬁles

-p, path_out STR (NULL)

Output folder path.

-i, path_in STR (NULL)

Input folder path.

-m, db_ﬁle_name STR (NULL)

Reference database ﬁle name (formatted in fasta ﬁle and converted in .txt). The extension must not be speciﬁed.

-t, tag_ﬁle_name STR (NULL)

Tags counts ﬁle name without .txt extension.

2.4.3 DNA/RNA conversion and ﬁltering

-r, dna_or_rna BOOL (1)

If this parameter is equal to 0 the conversion between DNA to RNA is avoided.

-h, tag_selection_thr INT (11)

Threshold used to discard, during the alignment score check phase and the tag/miRNA

assignment procedure, multi-mapped tags with higher scores.

3 Analysis work-ﬂow

The analysis of miRNA-Seq data is nowadays considered a central step in the characterization of bio-

logical samples. In order to perform such an analysis (Figure 3.1) the huge number of reads obtained

as output of sequencing machine runs needs to be reduced. This is performed by grouping all the

identical reads and counting their occurrence: The output ﬁle obtained is called tags count ﬁle.

State of the art tools such as miRanalyzer and miRExpress detect miRNAs expression levels in the

sample under investigation by aligning each tag on a selected miRNAs database.

A tag is then considered expression of a speciﬁc miRNA if it can be attributed to the same miRNA with

a high level of conﬁdence.

However the alignment engines used by these tools are in the most of the cases general purpose align-

ment algorithms that don’t take into account miRNA-speciﬁc evidences.

Figure 3.1: Example of miRNA expression levels analysis pipeline.

In order to improve the information that can be derived from miRNA-Seq data, we have designed

isomiRNA Seed Extension Aligner (isomiR-SEA).

Differently from state of the art tools, its whole pipeline has been developed taking into accounts

miRNAs features as seed region, miRNAs families, miRNAs isoforms and possible interaction sites.

3.1 The interaction sites point of view

Recent researches by Bartel and colleagues on miRNAs have proven that almost all these short non-

coding molecules are characterized by similar structures [3]. In particular they present a common

standard region between nucleotides 2 and 7 that is fundamental for miRNA target recognition (Fig-

ure 3.2).

Moreover this region is usually conserved among miRNAs belonging to the same family.

Figure 3.2: miRNA seed region.

As stated in Section 1.1 one of isomiR-SEA mile stones consists in the pragmatic approach adopted

to detect miRNAs expression levels. This approach required an extensive literature review to identify

those miRNA features able to guarantee the correct attribution of a tag to a speciﬁc miRNA. Among

these the modes of miRNA:mRNA interaction have been considered fundamental.

Figure 3.3 reports on the most common miRNA:mRNA interaction modes. Among them 4 interaction

sites have been identiﬁed as responsible for these interaction modes:

• Seed site interaction: A perfect seed match is required in all the presented cases with exception

of G and H;

• Offset site interaction: Frequently the pairing of miRNA 8 nucleotide is required for miRNA:mRNA

interaction;

• Supplementary site interaction: In cases F, G and H the pairing of miRNA region between nu-

cleotides 13 and 16 is required for the interaction;

• Central site interaction: It has been recently identiﬁed in [4] as the only mode of interaction

able to compensate the lack of a strong pairing between miRNA seed and target mRNA.

Figure 3.3: miRNA:mRNA interaction regions (extracted from [3] and [4]).

Even if miRNA seed region is considered fundamental by all miRNAs target recognition tools, this fea-

ture has not been yet exploited in relation to miRNAs expression levels extraction.

isomiR-SEA algorithm is developed just considering this feature (Figure 3.2) and so miRNA seed iden-

tiﬁcation is the ﬁrst activity performed during the alignment procedure.

As a consequence only those tags exactly matching miRNA seed sequences in positions allowed by

miRNA structure (usually nt 1-10) are provided as input for the second alignment step of the pipeline.

3.2 The isomiRs point of view

Being miRNA:mRNA interaction sites identiﬁed on miRNA structure, even reduced variation in miRNA

sequences (determining isomiRs) can lead to the loss of interaction sites. For example an isomiR

characterized by a single nucleotide polymorphism in one of its interaction site will not be able to

interact and perhaps downregaluate the target genes of its miRNA of origin (as shown in Figure 3.3).

For this reason all the different isoforms that contribute to the miRNAs expression level value are re-

ported as output of isomiR-SEA run. Furthermore isomiR-SEA is able to distinguish among several

isoforms as illustrated in Figure 3.4 and reported in the following:

1. miRNA exact (mirna_exact), the original sequence reported in the database;

2. 5p isoforms (iso_5p), characterized by insertions or deletions in the 5p miRNA side;

3. single nucleotide polymorphism isoforms (iso_snp), presenting a mismatch with respect to

miRNA sequence;

4. multiple nucleotide polymorphism isoforms (iso_multi_snp), presenting more than a mismatch

with respect to miRNA sequence;

5. 3’ isoforms (iso_3p), characterized by insertions or deletions in the 3p miRNA side.

Figure 3.4: isomiR sequences aligned on the original miRNA sequence.

Moreover, since tags can also derive from two or more isoforms, also their combinations have to be

considered. Summarizing tags can be classiﬁed as deriving from:

1. mirna_exact

2. iso_5p-iso_snp-iso_3p

3. iso_5p-iso_snp

4. iso_5p-iso_multi_snp-iso_3p

5. iso_5p-iso_multi_snp

6. iso_5p-iso_3p

7. iso_5p_only

8. iso_snp-iso_3p

9. iso_snp_only

10. iso_multi_snp-iso_3p

11. iso_multi_snp_only

12. iso_3p_only

The proposed classiﬁcation allows users to easily discard from the whole number of mapped tags

those attributed to a speciﬁc kind of isoforms.

3.3 isomiR-SEA Algorithm

isomiR-SEA analysis pipeline is implemented in C++ and takes advantage of SeqAn library. [?].

Thanks to a series of data analysis algorithms, SeqAn library usage, allows isomiR-SEA to assume a

modular frame-work and to be very fast even when computationally expensive tasks have to be per-

formed. The isomiR-SEA algorithm, shown in Figure 3.5, is basically divided in the following phases:

• The preprocessing step in which parameters are set and input ﬁles loaded in memory;

• The alignment procedure in which each miRNA seed is compared with all the tags to check

for similarities in order to discard too divergent sequences with remarkable reduction in time

search space usage. This activity is performed by considering all the rules developed taking into

account biological evidences relative to miRNAs structure;

• The output ﬁle generation that allow to distinguish between the so called unique and multi

mapped tags.

3.3.1 Parameters setting

Twelve parameters allow users to select the preferred conﬁguration during isomiR-SEA run:

• specie_codes: Allows to select multiple species as reference on which tags will be aligned. De-

fault: Human (has);

• min_tag_length: Speciﬁes the minimum length of alignment after the second ungapped exten-

sion that allows to further proceed into the alignment procedure. Default: 10;

• seed_size: Indicates the size of the seed to be searched into tags. Default: 6;

• begin_seed: Fixes the location of the ﬁrst miRNA seed nucleotide. Default: 2;

• end_seed: Locates the last miRNA seed nucleotide. Default: 7;

• tag_begin: Speciﬁes the maximum location of the tag on which the seed has to be searched.

Default: 4 (in order to allow iso 5p predictions);

• path_out: Speciﬁes the location of the output ﬁles;

• path_in: Sets the location of the input ﬁle;

Figure 3.5: isomiR-SEA ﬂowchart.

• db_ﬁle_name: Speciﬁes the name of the reference database that has to be used (without .txt

extension). The current isomiR-SEA release requires a .txt ﬁle;

• tag_ﬁle_name: Speciﬁes the input ﬁle tags name (without .txt extension). The current isomiR-

SEA release requires a .txt ﬁle;

• dna_or_rna: Boolean ﬂag used to avoid conversion from DNA to RNA nucleotides when input

RNA reads are provided. Default: 1;

• tag_selection_thr: Speciﬁes the maximum alignment score for multi-mapped tags reporting in

the output ﬁle. Default: 11.

The usage of this parameters will be further detailed in the course of the guide when encountered.

3.3.2 miRNAs DB and tags counts ﬁle processing

In this phase both the database containing reference miRNA sequences and the input tags counts

ﬁle are elaborated. miRNAs reference sequences can be downloaded from several pubblic repository

such as TargetScan database [5], HMDD [6], PolymiRTS [7], miRNEST [8], CoGemiR [9] and the widely

adopted miRBase [10].

The format currently accepted by isomiR-SEA for the reference ﬁle is that proper of miRBase mature

miRNAs. In Section 4.1 more details about this ﬁle format will be provided.

The miRBase v21 set includes 35828 mature miRNAs from 223 different species among animals, bac-

teria and plants.

In the miRNA mature DB seed-based sorting phase, seeds are extracted from miRNAs sequences. All

the miRNAs belonging to the same family and as a consequence sharing the same seed are then clas-

siﬁed depending on the species and grouped together. The seed based groups are stored in a conve-

nient data structure in order to be easily accessible during the alignment procedure Figure 3.5.

Regarding the inputs, isomiR-SEA accepts the standard tags counts ﬁle format characterized by

the presence of the tag sequence followed by a number representing the occurrence of identical reads

supporting that tag (Section 4.1). This ﬁle is loaded in memory.

Even if the conversion from fastq reads to tags counts ﬁle is out of the scope of this manual, for com-

pleteness we will brieﬂy report on it in the following lines. Firstly the adapter sequences introduced

in the library preparation process are removed from the reads thanks to Flexbar tool [10]. Then the

clipped reads are collapsed by means of an ad hoc script to obtain the so called table of counts ﬁle

which stores all the different clipped read sequences detected in the sample (tags), with their occur-

rence (tag_count).

3.3.3 isomiR-SEA alignment procedure

Each miRNA seed is searched into the different tags contained in the input ﬁle. If the seed is found

in the tag, the alignment between the miRNA and the tag is extended without gaps in both 5’ and 3’

directions (ungapped extension). Similarly the tag is aligned on all the miRNAs that share the identi-

ﬁed seed.

For each alignment, the position on the tag where the ﬁrst match has been encountered is evaluated

to check if compatible with the standard miRNA seed position. Tags characterized by a seed located

in an incompatible position are discarded as shown in Figure 3.6.G.

All the tags with a correctly located miRNA seed are instead extended in 3’ direction allowing for the

presence of a second mismatch.

A threshold, called min_tag_length, is imposed in this phase in order to evaluate the distance of this

second mismatch from to the ﬁrst one: Tags not satisfying the threshold value, since having two adja-

cent mismathces, will be not considered in the next alignment steps as depict in Figure 3.6.H. Instead,

alignments with two mismatches enough far away from each other, are once a time extended without

gaps until the third mismatch which marks the end of the alignment procedure.

isomiR-SEA alignment policy is further detailed in eight different cases reported in Figure 3.6 in

relation to hsa-miR-15a-5P miRNA. In all these examples miRNA sequence is shown as ﬁrst line, the

tag in the third line and the alignment representation in the middle. The numbered line in each sub-

ﬁgure is instead used to identify the algorithm step in which sub-sequences have been evaluated:

1. The numbered 1 red line represents the miRNA seed sequence searching activity in the tag;

2. The numbered 2 blue line reports on the seed extension step in both 5’ and 3’ directions until a

ﬁrst encountered mismatch;

3. The numbered 3 green line describes the alignment extension in 3’ direction until a second

found mismatch;

4. The numbered 4 violet line depicts the last extension step executed if a minimum alignment

length is reached in the previous phases, until a last allowed mismatch.

isomiR-SEA parameters have been set for the previous tests as: Sizemi n =11 (minimum alignment

length for last extension) and Begin_Tagmi n <4(start tag position alignment on miRNA seed). The

imposed parameters values led to the discarding of those alignments reported in Figure 3.6 with light

red background and to consider those characterized by the light green one. Figure 3.6.A reports on

atag alignment characterized by a mismatch in position 10. This mismatch lead the algorithm to

continue the alignment procedure in its second phase since no additional mismatches are detected

between miRNA and tag sequence.

As depicted in Figure 3.6.B the mismatch encountered in the tag at position 14 leads isomiR-SEA to

assume the same behaviour highlighted in Figure 3.6.A.

In Figure 3.6.C the alignment procedure comprises all the four phases of the algorithm, indeed the

two mismatches found in the tag sequence are located in positions compatible with the imposed pa-

rameters.

The alignment in Figure 3.6.D is instead ended in the second phase of the alignment after the ﬁrst

encountered mismatch located in the last admissible nucleotide.

In Figure 3.6.E the tag alignment is ended before the third gap at position 20.

The mismatch detected in the tag at position 1 allows isomiR-SEA to proceed in the alignment as de-

picts in Figure 3.6.F when the detailed parameters are imposed. This mismatch will inﬂuence how-

ever the overall alignment size with consequences on the calculated alignment score.

According to Begin_Tagmi n <4, the tag in Figure 3.6.G is discarded after the second algorithm step

because tag alignment start position is beyond the third nucleotide; this condition, that is a high

number of nucleotides ﬂanking seed 5’ position on the tag, could indeed account for a different kind

of RNA structure such as a pre-miRNA as discussed in Bartel et al. [?].

In Figure 3.6.H the tag is discarded because a second mismatch can be detected in position 12 not

satisfying the imposed Sizemi n threshold. Once a time this choice depends on experimental observa-

tions: Bartel et al. [?] shown indeed that sequences having two not distant mismatches, close also to

the seed sequence in the most of the cases do not derive from miRNAs.

At the end of the alignment procedure, an alignment score is assigned to each alignment. This score

is very useful when the same tag is found to be mapped on more than a miRNA or if the user is inter-

ested in performing some kind of ﬁltering to isolate the most signiﬁcant aligned tags (characterized

by a higher degree of similarity to the miRNA).

The alignment score (al i g nscore ), based on several alignment parameters, is calculated as reported

in Equation 3.1.

al i g nscore =miRN Asi ze −(ali g nsi ze −#g aps)

mi RN Asi ze

∗100 (3.1)

where miRN Asi ze is the length of the miRNA sequence, #g aps represents the number of gaps in the

Figure 3.6: isomiR-SEA alignment cases.

alignment (maximum 2) and ali g nsi ze the miRNA-tag alignment length (remember that an align-

ment is stopped when the third gap is found). This parameter allows to evaluate the alignment qual-

ity: Lower scores indicate better alignments.

Moreover in order to consider also the tag length during the selection of the alignments output of

isomiR-SEA run, a second parameter is computed as shown in Equation 3.2:

mi RN At agd i f f =mi R N Asi ze −t ag si ze (3.2)

Obviously positive values of this parameter account for miRNA sequences longer than tags and vice

versa.

These two parameters are adopted in the last phase of the algorithm in order to choose the best align-

ment for tags that have been mapped on more than a miRNA.

Moreover in future users will be left free to introduce the proposed scoring scheme also for the uniquely

mapped tags, in order to give them the possibility to require a minimum alignment quality to report

tag assignments.

3.3.4 isomiR-SEA output ﬁles

All the tags found to be aligned on the miRNAs sequences are ﬁnally saved in two ﬁles according to

the kind of alignment detected. In particular if a tag is aligned only on a miRNA, it is saved into a

unique tags ﬁle (see Section 4.2), otherwise if it is mapped on multiple miRNAs it is saved into an

ambiguous tags ﬁle.

Moreover the multiple mapped tags are scanned and assigned to the miRNA for which they present

a lower alignment ali g nscore and a smaller mi RN At agdi f f . After this selection the obtained tags

are organized into a new ﬁle having the same format of that used for storing the unique mapped tags

but reporting also, for each tag, the number of miRNAs and miRNAs families on which the same tag

has been mapped and the alignment score difference between the score calculated for that tag on the

chosen miRNA and that on the second scored one.

Furthermore two miRNAs expression ﬁles are also provided to the user: One relative to the unique

mapped tags and one for those ambiguously aligned. These ﬁles contain useful information such as

the number of different tags attributed to the miRNA, the tags counts of the detected isomiRs and

interaction sites (details in Section 4.2).

4 Files formats

4.1 Input ﬁles

Two input ﬁles are required in order to run isomiR-SEA tool: A ﬁle containing the mature miRNAs

reference sequences and the tags counts ﬁle. The mature miRNAs ﬁle that has to be provided is in the

standard fasta format even if, to be compatible with isomiR-SEA execution, its extension has to be .txt

(this limitation will be removed in the next versions).

Concerning the tags counts ﬁle (Figure 4.1) we have implemented an ad hoc script that, given the

fastq ﬁle containing the input reads, removes the 3’adapter sequences using the Flexbar tool [10],

then collapse all the identical clipped reads into a unique tag and counts the occurrence of reads

supporting it. The tag ﬁle reports on each line in the ﬁrst column the tag sequence and in the second

column the occurrence of the reads supporting it. The extension of this ﬁle should be set as .txt

Figure 4.1: Tags counts ﬁle format

4.2 Output ﬁles

Nine output ﬁles are generated by isomiR-SEA run. The names of these ﬁles are formatted as out_result_DBNAME_SPECIALIZATION.txt

with the exception of the ﬁrst two ﬁles that store the mature miRNAs sequences grouped and sorted

on the basis of miRNA seed and species. These ﬁles are:

• mature_21_db_group.txt

• mature_21_db_group_sorted.txt

• out_result_mature_21_ambigue.txt

• out_result_mature_21_ambigue_ambigue_selected.txt

• out_result_mature_21_tag_ambigue.txt

• out_result_mature_21_tag_ambigue_selected.txt

• out_result_mature_21_tag_unique.txt

• out_result_mature_21_unique.txt

• out_result_mature_21_unique_ambigue_selected.txt

In the following it will be shown the meaning of each ﬁeld included into these ﬁles.

1. mature_21_db_group.txt is a ﬁle printed as output of the ﬁrst phase of the algorithm when the

miRNAs mature database is loaded into memory. It is the representation of the memory struc-

ture used during the alignment procedure and can be stored and then used in all the alignment

performed thanks to isomiR-SEA that adopt the same miRNAs reference sequences. In the ﬁrst

line of this ﬁle are reported all the species codes detected into the input miRNAs database, the

following lines are instead organized as follows:

• The ﬁrst column reports on the seed sequence;

• The second column indicates the species code of the following miRNAs;

• The third column reports on the ﬁrst miRNA sequence characterized by that speciﬁc seed;

• The fourth column reports on some information related to the ﬁrst detected miRNA;

• The other columns reports on all the other miRNAs characterized by the same seed and

species.

2. mature_21_db_group_sorted.txt has the same format of the previous one, but it is alphabeti-

cally sorted considering the ﬁrst ﬁeld, the miRNA seed sequence.

3. out_result_mature_21_tag_unique.txt ﬁle stores for each tag the unique miRNA on which it has

been mapped. This ﬁle contains those tags that have been assigned to a unique miRNA with a

series of information extracted during the alignment phase and reported in the following:

• tag_index: An integer value that stores the unique tag index assigned to each tag during

the loading phase (it is equal to the tag row position in the input tags counts ﬁle);

• tag_sequence: The original tag sequence;

• tag_quality: In this isomiR-SEA version it stores an “e’" value. In the next releases of

the tool however it will dedicated to store the average tag quality computed for each nu-

cleotide of the tag;

• #count_tags: The number of identical reads supporting the tag;

• mirna_id: An identiﬁcation number for the miRNA (this number is assigned during the

mature miRNAs database loading phase);

• mirna_name: A series of informations relative to the miRNA that has been assigned to the

tag. These data have been recovered from the miRNAs database;

• mirna_seq: The miRNA sequence for the miRNA that has been assigned to the tag;

• seed_index: The identiﬁcation number of the seed characterizing the miRNA assigned to

the tag (attributed during the loading phase);

• begin_ungapped_mirna: The miRNA nucleotide position where the tag alignment begin;

• begin_ungapped_tag: The tag nucleotide position where the alignment with the selected

miRNA begin;

• size_ungapped: The number of nucleotides aligned during the ﬁrst ungapped extension.

This number is relative to perfect matches between the two sequences;

• size_ungapped_1: The number of aligned nucleotides during the second ungapped ex-

tension. This number is relative to perfect matches and a unique mismatches between

the two sequences;

• size_ungapped_2: The number of aligned nucleotides during the third ungapped exten-

sion;

• align_score: The alignment score computed with Equation 3.1;

• mir_tag_size_diff: The score calculated as reported in Equation 3.2;

• mirna_exact, iso_5p, iso_snp, iso_multi_snp, iso_3p, offset_site, suppl_compens_site, cen-

tral_site: The ﬂags recovered during the alignment phase that provide informations about

the isoforms identiﬁed and the interaction sites conserved during miRNA-tag alignment.

4. out_result_mature_21_tag_ambigue.txt ﬁle stores all the tags that have been mapped on more

than a miRNA. It is formatted as the previous unique ﬁle with a little difference. Indeed isoform

and interaction site ﬂags are not reported, and each line stores all the miRNAs and alignment

information previously described in relation to out_result_mature_21_tag_unique.txt. All the

miRNAs associated with a tag are so reported in the same tag line, separated by the pipe (|)

character.

5. out_result_mature_21_tag_ambigue_selected.txt reports on all the tags initially evaluated as

ambiguous that, after the analysis of the mapping scores, have been assigned to a speciﬁc

miRNA.

This ﬁle shares the same format of out_result_mature_21_tag_unique.txt with the addition of

three useful parameters that are:

a) mir2-mir1_align_score: This value is the score difference between the second higher score,

proper of the second best aligned miRNA, and the score of the best aligned one;

b) num_mir_family: It represents the number of different miRNAs families, on which the tag

has been mapped;

c) num_of_mapped_miRNAs: It is the number of miRNAs that has been attributed to the

same tag.

6. out_result_mature_21_unique.txt reports, for each miRNA, the overall number of assigned tags

and some other information concerning isoforms and interaction sites. Only tags uniquely

assigned to a miRNA are here considered.

Reported features are:

a) mirna_index: The identiﬁcation number of the miRNA;

b) #different_tags: The number of tags with different sequences mapped on this miRNA;

c) #count_tags: The overall number of reads supporting the miRNA;

d) norm_expressed_all: A ﬁeld not used in the current version of isomiR-SEA that will be

considered in future versions of the tool;

e) norm_expressed_mapped: A ﬁeld not used in the current version of isomiR-SEA that will

be considered in future versions of the tool;

f) #mirna_exact_tags: The number of reads having an identical sequence with respect to the

miRNA;

g) #iso_5p_tags: The number of reads having some mutations on the 5’ side. A hierarchi-

cal aggregation method has been adopted for this and the following parameters: If a tag

presents a mutation on both 5’ and on 3’ that tag is counted only for iso_5p. This choice

allows to avoid overestimation in the assignment of tags to more then an isoform bin.

However from the tag ﬁles the complete isoform spectrum can be retrieved;

h) #iso_snp_tags: The number of reads characterized by SNP. As for the previous ﬁeld the

assignment of tags to this category is priority with respect to iso_3p;

i) #iso_multi_snp_tags: The number of reads having more than a SNP in the alignment;

j) #iso_3p_tags: The number of reads having only mutations on the 3’side;

k) #offset_site_tags: The number of reads exactly mapped at the 8 miRNA nucleotide;

l) #suppl_compens_site_tags: The number of reads exactly mapped on miRNA positions 2-

16;

m) seed_index: The miRNA seed identidﬁcation number;

n) mirna_seq: The miRNA sequence;

o) mirna_info: The miRNA information retrieved from miRNAs reference database;

7. out_result_mature_21_ambigue.txt ﬁle contains the same information for the re-evaluated tags

initially marked as multi-mapped.

The last two ﬁles, out_result_mature_21_ambigue_ambigue_selected.txt and

out_result_mature_21_unique_ambigue_selected.txt, store informations about the discarded

multi-mapped alignments.

Bibliography

[1] Peter Wilson, The Memoir Class for Conﬁgurable Typesetting – User Guide, 2010.

[2] Vincent Zoonekynd. On-line list of different chapter styles for L

EX. Available at http://zoonek.

free.fr/LaTeX/LaTeX_samples_chapter/0.html.

[3] Bartel, D.P. (2009) MicroRNAs: Target Recognition and Regulatory Functions. Cell,136, 215–233

[4] Shin, C., Nam, J. W., Farh, K. K. H., Chiang, H. R., Shkumatava, A. and Bartel, D. P. (2010) Expanding

the microRNA targeting code: functional sites with centered pairing. Molecular cell,38, 789–802

[5] Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes

are MicroRNA Targets Benjamin P Lewis, Christopher B Burge, David P Bartel. Cell, 120:15-20

(2005).

[6] Lu M, Zhang Q, Deng M, Miao J, Guo Y, et al. (2008) An Analysis of Human Mi-

croRNA and Disease Associations. PLoS ONE 3(10): e3420. doi:10.1371/journal.pone.0003420.

http://www.microrna.org/microrna/

[7] Bhattacharya A, Ziebarth JD, Cui Y. PolymiRTS Database 3.0: linking polymorphisms in microR-

NAs and their target sites with human diseases and biological pathways. Nucleic Acids Res. 2014;

42(D1):D86-D91.

[8] Szczes’niak, Michal Wojciech et al. miRNEST Database: An Integrative Approach in microRNA

Search and Annotation. Nucleic Acids Research 40.Database issue (2012): D198-D204. PMC. Web.

11 Dec. 2014.

[9] CoGemiR: a comparative genomics microRNA database Vincenza Maselli, Diego Di Bernardo and

Sandro Banﬁ BMC Genomics 2008, 9:457 (6 October 2008) http://www.biomedcentral.com/1471-

2164/9/457

[10] Matthias Dodt, Johannes T. Roehr, Rina Ahmed, Christoph Dieterich: Flexbar-ﬂexible barcode

and adapter processing for next-generation sequencing platforms. MDPI Biology 2012, 1(3):895-

905.

Isomir Sea Manual V0 1

Navigation menu

Versions of this User Manual:

Views

Navigation