Sailfish Logo Manual

Sailfish_manual

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 7

Overview
- Indexing
- Quantification
  - Library Format String
License
References

Sailﬁsh: User Guide v0.6.3

Author: Rob Patro

Overview

Sailﬁsh is a tool for transcript quantiﬁcation from RNA-seq data. It requires a

set of target transcripts (either from a reference or de-novo assembly) to quantify.

All you need to run Sailﬁsh is a fasta ﬁle containing your reference transcripts

and a (set of) fasta/fastq ﬁle(s) containing your reads. Sailﬁsh runs in two

phases; indexing and quantiﬁcation. The indexing step is independent of the

reads, and only need to be run one for a particular set of reference transcripts

and choice of k (the k-mer size). The quantiﬁcation step, obviously, is speciﬁc to

the set of RNA-seq reads and is thus run more frequently.

Indexing

To quantify the abundance, in a sample, of a set of target transcripts, you must

ﬁrst create a Sailﬁsh index of this transcript set. The Sailﬁsh index is actually

just a collection of diﬀerent ﬁles, kept together inside of a directory (the directory

itself is referred to as the index), that allow Sailﬁsh to eﬃciently access the

information it needs about the target transcripts.

The generation of the Sailﬁsh index is performed via the Sailﬁsh

index

command.

Like all user-level commands in Sailﬁsh,

index

is a subcommand of the main

Sailﬁsh program. Thus, the index command is invoked as:

> sailfish index [options]

The list of options are as follows:

•-v | –version Print the version of Sailﬁsh being used and exit

•-h | –help Print the help message describing the parameters

•-t | –transcripts

A FASTA format ﬁle containing the transcripts on which

the index will be built.

•-m | –tgmap Provide a transcript-to-gene map (currently unused)

•-k | –kmerSize

The size of the k-mer on which the index is built. There

is a tradeoﬀ here between the distinctiveness of the k-mers and their

robustness to errors. The shorter the k-mers, the more robust they will be

to errors in the reads, but the longer the k-mers, the more distinct they

will be. We generally recommend using a k-mer size of at least 20. Because

of the way k-mers are encoded in Sailﬁsh, the current maximum k-mer size

is 32, but this may change in the future.

•-o | –out The directory in which the Sailﬁsh index will be placed.

•-p | –threads

The maximum number of concurrent threads to use when

building the index.

•-f | –force

-o

is provided with a directory that already exists, then

force the re-building of the index, replacing the current contents of that

directory.

To generate the Sailﬁsh index for your reference set of transcripts, for example,

you would run a command like the following:

> sailfish index -t <ref_transcripts> -o <out_dir> -k <kmer_len>

This will build a Sailﬁsh index for k-mers of length <kmer_len> for the reference

transcripts provided in the ﬁle <ref_transcripts> and place the index under the

directory <out_dir>.

Quantiﬁcation

Now that you have generated the Sailﬁsh index (say that it’s the directory

<index_dir> — this corresponds to the <out_dir> argument provided in the

previous step), you can quantify the transcript expression for a given set of

reads. Just like the indexing is performed via the

index

sub-command of Sailﬁsh,

quantiﬁcation is performed via the

quant

sub-command. The quantiﬁcation

command is invoked as follows:

> sailfish quant [options]

The list of options are as follows:

•-v | –version Print the version of Sailﬁsh being used and exit

•-h | –help Print the help message describing the parameters

•-i | –index

The path to the Sailﬁsh index built on the set of target

transcripts.

•-l | –libtype

A string describing the library type of the provided reads.

For a description of the diﬀerent possible library strings and their meanings,

see the library string section below.

•-r | –unmated_reads

A list of one or more FASTA or FASTQ format

ﬁles (or a named pipe providing reads in one of these formats) containing

unpaired reads. This option should directly follow the

-l

option, and is

only valid if the library format is of type single end (SE).

•-1 | –mates1

A list of one or more FASTA or FASTQ format ﬁles (or a

named pipe providing reads in one of these formats) containing the ﬁrst

mate for a set of reads. This option should directoy follow the

-l

option,

and is only valid if the library format is of the paired-end type (PE).

•-2 | –mates2

A list of one or more FASTA or FASTQ format ﬁles (or a

named pipe providing reads in one of these formats) containing the second

mate for a set of reads. This option should directoy follow the

-1

option,

and is only valid if the library format is of the paired-end type (

). This

list should contain the same number of ﬁles (paired read-for-read) with the

mates provided by the -1 option.

•–no_bias_correct

Normally, Sailﬁsh outputs two quantiﬁcation ﬁles in

the requested output directory,

quant.sf

and

quant_bias_corrected.sf

If this option is provided, bias-correction is not performed and the bias-

corrected ﬁle is not produced.

•-m | –min_abundance

Set to 0 the abundance of any transcripts with a

computed K-mers Per Kilobase per Million mapped k-mers (KPKM) lower

than the provided value.

•-o | –out

The output directory where the quantiﬁcation results (and other

relevant ﬁles) are written.

•-n | –iterations

The maximum number of iterations of the EM step

to carry out. The optimization algorithm that computes the transcript

estimates will terminate when either the convergence critera speciﬁed by

the

-d

option (below) is met, or when this number of iterations has been

performed.

•-d | –delta

The maximum allowable delta between consecutive iterations

of the optimization procedure. If the maximum relative change in any

transcripts’ abundance is less than this value between two consecutive

iterations of the optimization, then the procedure will be considered to

have converged and the optimization will terminate.

•-p | –threads

The maximum number of threads to use when counting

k-mers and computing transcript abundance.

•-f | –force

By default, if the output folder provided to the

-o

option

already exists, the k-mer counts recorded by the previous run will be used

and only the quantiﬁcation will be performed again. Passing in this option

forces both a re-counting of the k-mers and a re-quantiﬁcation of the target

transcripts.

•-a | –polya

If this ﬂag is set, then polyA/polyT k-mers will not be counted.

So, a typical invocation of th the Sailﬁsh

quant

command will look something

like the following:

> sailfish quant -i <index_dir> -l "<libtype>" \

{-r <unmated> | -2 <mates1> -2 <mates2>} -o <quant_dir>

Where <index_dir> is, as described above, the location of the Sailﬁsh index,

<libtype> is a string describing the format of the read library (see library string

below) <unmated> is a list of ﬁles containing unmated reads, <mates{1,2}>

are lists of ﬁles containing, respectively, the ﬁrst and second mates of paired-end

reads. Finally, <quant_dir> is the directory where the output should be written.

When the quantiﬁcation step is ﬁnished, the directory <quant_dir> will contain

a ﬁle named “quant.sf”. This ﬁle contains the result of the Sailﬁsh quantiﬁcation

step. This ﬁle contains a number of columns (which are listed in the last of the

header lines beginning with ‘#’). Speciﬁcally, the columns are (1) Transcript ID,

(2) Transcript Length, (3) Transcripts per Million (TPM), (4) Reads Per Kilobase

per Million mapped reads (RPKM), (5) K-mers Per Kilobase per Million mapped

k-mers (KPKM), (6) Estimated number of k-mers (an estimate of the number

of k-mers drawn from this transcript given the transcript’s relative abundance

and length) and (7) Estimated number of reads (an estimate of the number of

reads drawn from this transcript given the transcript’s relative abnundance and

length). The ﬁrst two columns are self-explanatory, the next four are measures

of transcript abundance and the ﬁnal is a commonly used input for diﬀerential

expression tools. The Transcripts per Million quantiﬁcation number is computed

as described in [1], and is meant as an estimate of the number of transcripts, per

million observed transcripts, originating from each isoform. Its beneﬁt over the

K/RPKM measure is that it is independent of the mean expressed transcript

length (i.e. if the mean expressed transcript length varies between samples, for

example, this alone can aﬀect diﬀerential analysis based on the K/RPKM.) The

RPKM is a classic measure of relative transcript abundance, and is an estimate

of the number of reads per kilobase of transcript (per million mapped reads)

originating from each transcript. The KPKM should closely track the RPKM,

but is deﬁned for very short features which are larger than the chosen k-mer

length but may be shorter than the read length. Typically, you should prefer

the KPKM measure to the RPKM measure, since the k-mer is the most natural

unit of coverage for Sailﬁsh.

Library Format String

The library format string is given as a parameter to the

quant

step of Sailﬁsh.

Since Sailﬁsh works with the reads directly and not alignments, the purpose of

this string is to inform Sailﬁsh of relevant information about the reads in the

library. Not all of this information is currently used, but some of it is and other

pieces of it may be in the future.

The library format string consists of 3 parts (one of which is sometimes optional),

provided as key-value pairs. The relevant keys and possible value options are:

(T|TYPE)=(PE|SE)

This option speciﬁes the “paired-end” status of the read library. If the reads are

paired end, then this should be set to

, and the library format string should

be followed by the

-1

and

-2

options with the respective mate-pair reads. If the

reads are unpaired, then this should be set to

and the library format string

shoujld be followed by the -r option and list of ﬁles containing unpaired reads.

(O|ORIENTATION)=(>>|<>|><)

This option seciﬁes the relative orientation of reads within a pair. If the library

consists of unpaired reads, then this key-value pair can and should be ignored.

If the library consists of paired end reads, this key-value pair should be provided.

Note, this denotes the realtive orientation of the reads, not their absolute

directionality with respect to the reference. The options are meant to denote,

visually, how the reads could be oriented.

The ﬁrst option

denotes that the mates are oriented in the same direction —

e.g. if the 5’ end of mate 1 is upstream from the 3’ end, then the 5’ end of mate

2 is upstream from its 3’ end and vice-versa.

The second option

denotes that the mates are oriented away from each other.

This implies that start of mate1 is closer to start of mate 2 than the end of mate

2, etc.

The third option

is, perhaps, the most common relative orientation. It denotes

that the mates are oriented toward each other, so that the start of mate 1 is

farther from the start of mate 2 than it is from the end of mate 2 and vice-versa.

(S|STRAND)=(AS|SA|S|A|U)

This option speciﬁes the strandedness of the reads. If the type is

the only

allowable options are

, and

, which denote, respectively, that the reads come

from the sense strand, the antisense strand, or are of unknown strandedness (in

which case both strands are tried and the one resulting in more matching k-mers

is used).

If the type of the read library is

, then any of the options are valid. The

and

options given in the above paragraph have the same meaning. The

option speciﬁes that mate 1 is from the antisense strand and mate2 is from the

sense strand, while

speciﬁes that mate 1 is from the sense strand and mate 2

is from the antisense strand.

Because of the way argument parsing works, the library format string must be

oﬀset by quotations. An example format string specifying that the read library

consists of unpaired reads with unknown orientation is:

-l "T=SE:S=U"

Alternatively, a format string specifying that the read library consists of paired-

end reads, oriented toward each other where mate 1 is from the sense strand and

mate 2 is from the antisense strand is:

-l "T=PE:O=><:S=SA"

Many, but not all, combinations of the three options (type, orientation and

strandedness) are possible. Sailﬁsh will perform a coarse-grained sainty check to

ensure that the provided library string is not impossible (e.g.

T=SE:O=><:S=A

which is not possible because unpaired reads can’t have a relative orientation).

License

This program is free software: you can redistribute it and/or modify it under

the terms of the GNU General Public License as published by the Free Software

Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT

ANY WARRANTY; without even the implied warranty of MERCHANTABIL-

ITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General

Public License for more details.

You should have received a copy of the GNU General Public License along with

this program. If not, see <http://www.gnu.org/licenses/>.

References

[1] Li, Bo, et al. “RNA-Seq gene expression estimation with read mapping

uncertainty.” Bioinformatics 26.4 (2010): 493-500.

Sailfish Logo Manual

Sailfish_manual

Sailfish_manual

Sailfish_manual

Sailfish_manual

Sailfish_manual

Navigation menu

Versions of this User Manual:

Views

Navigation