Sa TAnn Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 27

Package ‘SaTAnn’

April 12, 2019

Title Splice-Aware Translatome Annotation

Version 0.99.0

Description

SaTAnn is a method that quantiﬁes translation at the single ORF level using Ribo-seq data.

Depends rtracklayer, BSgenome, devtools, Biostrings, GenomicFeatures,

foreach, doMC, multitaper, GenomicAlignments, GenomicFiles,

reshape2, ggplot2, cowplot, grid, BiocGenerics, knitr,

gridExtra, rmarkdown

License GPL-3 or above

Encoding UTF-8

LazyData FALSE

Name SaTAnn

biocViews RiboSeq, GenomeAnnotation, Transcriptomics, Software

RoxygenNote 6.1.1

NeedsCompilation no

Author Lorenzo Calviello [aut, cre],

Uwe Ohler [rev, fnd]

Maintainer Lorenzo Calviello <calviello.l.bio@gmail.com>

Rtopics documented:

annotate_ORFs ....................................... 2

annotate_splicing ...................................... 4

calc_orf_pval ........................................ 5

create_SaTAnn_html_report ................................ 6

detect_readthrough ..................................... 7

detect_translated_orfs.................................... 8

from_tx_togen........................................ 9

get_orfs ........................................... 10

get_ps_fromsplicemin.................................... 11

get_ps_fromspliceplus ................................... 11

get_reathr_seq........................................ 12

2annotate_ORFs

load_annotation....................................... 13

plot_SaTAnn_results .................................... 13

prepare_annotation_ﬁles .................................. 15

prepare_for_SaTAnn .................................... 17

run_SaTAnn......................................... 18

SaTAnn ........................................... 20

select_quantify_ORFs.................................... 22

select_start ......................................... 24

select_txs .......................................... 25

take_Fvals_spect ...................................... 26

Index 27

annotate_ORFs Annotate detected ORFs in transcript and genome space

Description

This function annotates quantiﬁed ORFs with respect to other detected ORFs and annotated ones,

in both genome and transcript space.

Usage

annotate_ORFs(results_ORFs, Annotation, genome_sequence, region,

genetic_code)

Arguments

results_ORFs Full list of detected ORFs, from select_quantify_ORFs

Annotation Rannot object containing annotation of CDS and transcript structures (see prepare_annotation_files

genome_sequence

BSgenome object

region genomic region being analyzed

genetic_code GENETIC_CODE table to use

Details

As multiple transcripts can contain the same ORF, all the transcript and transcript biotypes are

indicated, with a preference for protein_coding transcripts in the "compatible" columns (to be con-

servative when assessing translation of non-protein coding transcripts). Such compatibility is also

output considering the most upstream start codon for that ORF.

Splice features of each orf is annotated with respect to the longest coding transcripts and to the

highest translated ORF in that gene.

Variants in N or C terminus of the translated proteins are also indicated (Beta).

ORF annotation with respect to the annotated transcript is also indicated, as follows:

novel: no ORF annotated in the transcript.

annotate_ORFs 3

ORF_annotated: same exact ORF as annotated.

N_extension: N terminal extension.

N_truncation: N terminal extension.

uORF: upstream ORF.

overl_uORF: upstream overlappin uORF.

NC_extension: N and C termini extension.

dORF: downstream ORF.

overl_dORF: downstream overlapping ORF.

nested_ORF: nested ORF.

C_truncation: C terminal truncation.

C_extension: C terminal extension.

As transcipt-speciﬁc annotation can be misleading due to a plethora of different transcripts, it is

important to distinguish ORFs also on the basis of their overlap with know CDS regions. ORF

annotation with respect to the entire set of CDS exon for the analyzed genomic regions is indicated

as follows:

novel: No CDS region is annotated in the entire region.

novel_Upstream: ORF is upstream of annotated CDS regions (does not overlap).

novel_Downstream: ORF is downstream of annotated CDS regions (does not overlap).

novel_Internal: genomic location of the ORF is present between the start of the ﬁrst, and the end

of the last CDS region (does not overlap).

exact_start_stop: Same start and end locations.

Alt5_start: Different start region, upstream.

Alt3_start: Different start region, downstream.

Alt5_stop: Different end region, upstream.

Alt3_stop: Different end region, downstream.

Another layer of annotation is performed by checking the position of the ORF stop codon with

respect to the last exon-exon junction.

Value

Exon structure of detected ORF including possible missing exons from reference, together with a

spl_type column including the annotation for each exon (e.g. alternative acceptors or donor).

Additional columns are added to the ORFs_tx object:

compatible_with: Set of transcript ids possibly containing the entire ORF structure.

compatible_biotype: Compatible transcript biotype; if a protein coding transcript can contain the

ORF, this is set to protein_coding.

compatible_tx: One selected compatible transcript (preference if protein_coding).

compatible_ORF_id_tr: ORF_id_tr id if selecting the compatible transcript.

compatible_with_longest: Same as compatible_with but using the most upstream start codon.

compatible_ORF_id_tr_longest: Same as compatible_ORF_id_tr but using the most upstream

start codon .

ref_id: transcript_id of the transcript used to annotate splicing (longest) .

ref_id_maxORF: ORF_id_tr of the ORF used to annotated splicing (most translated of the gene).

NC_protein_isoform: Annotation of possible N or C termini variant (when transcript is pro-

tein_coding) .

4annotate_splicing

ORF_category_Tx: ORF annotation with respect to ORF position in the transcript .

ORF_category_Tx_compatible: ORF annotation with respect to ORF position in the transcript,

using the compatible_ORF_id_tr .

ORF_category_Gen: ORF annotation with respect to its genomic position .

NMD_candidate: TRUE or FALSE, depending on the presence of an additional exon-exon junction

downstream the stop codon.

Distance_to_lastExEx: Distance (in nt) between the last exon-exon junction and the stop codon.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

select_quantify_ORFs,annotate_splicing

annotate_splicing Annotate splice features of detected ORFs

Description

This function detects usage of different exons and exonic boundaries of one ORF with respect to a

reference ORF.

Usage

annotate_splicing(orf_gen, ref_cds)

Arguments

orf_gen Exon structure of a detected ORF

ref_cds Exon structure of a reference ORF

Details

each exon is aligned to the closest one to match acceptor and donor sites, or to annotate missing

exons. 5ss and 3ss indicate exon 5’ and 3’, respectively. CDS_spanning indicates retained intron;

missing_CDS indicates no overlapping exon (missed or included); monoCDS indicates a single-exon

ORF; firstCDS and lastCDS indicate ﬁrst CDS exon or last CDS exon.

Value

Exon structure of detected ORF including possible missing exons from reference, together with a

spl_type column including the annotation for each exon (e.g. alternative acceptors or donor).

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

calc_orf_pval 5

See Also

detect_translated_orfs,annotate_ORFs

calc_orf_pval Collect ORF Ribo-seq statistics

Description

This function calculates statistics for the analysis of P_sites proﬁles for each ORF

Usage

calc_orf_pval(ORFs, P_sites_rle, P_sites_uniq_rle, P_sites_uniq_mm_rle,

cutoff = 0.5, tapers = 24, bw = 12)

Arguments

ORFs Set of detected ORFs

P_sites_rle Rle signal of P_sites along the transcript

P_sites_uniq_rle

Rle signal of uniquely mapping P_sites along the transcript

P_sites_uniq_mm_rle

Rle signal of uniquely mapping P_sites with mismatches along the transcript

cutoff cutoff of average in-frame signal for each codon in the ORF. Defaults to .5

tapers Number of tapers to use in the multitaper analysis. Defaults to 24

bw time_bw parameter to use in the multitaper analysis. Defaults to 12

Details

Number of P_sites (uniquely mapping or all), frame percentage and multitaper test statistics are col-

lected for each ORF. The parameter space for the multitaper analysis was explored in the RiboTaper

paper.

Value

Set of detected ORFs, including info about the possible longest ORF for that frame.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs,get_orfs,take_Fvals_spect

6create_SaTAnn_html_report

create_SaTAnn_html_report

Create an html report summarizing SaTAnn results

Description

This function creates an html report showing summary statistics for SaTAnn-detected ORFs.

Usage

create_SaTAnn_html_report(input_files, input_sample_names, output_file)

Arguments

input_files Character vector with full paths to plot ﬁles (*SaTAnn_plots_RData) generated

with plot_SaTAnn_results. Must be of same length as input_sample_names.

input_sample_names

Character vector containing input names. Must be of same length as input_files.

output_file String; full path to html report ﬁle.

Details

This function creates the html report visualizing ﬁnal SaTAnn results.

Input are two lists of the same length:

a) input_files: list of full paths to one or multiple input ﬁles (*SaTAnn_plots_RData ﬁles gener-

ated with plot_SaTAnn_results) and

b) input_sample_names: list of corresponding names describing the ﬁle content (these are used as

names in the report).

For the report, a RMarkdown ﬁle is rendered as html document, saved as output_file.

Value

The function saves the html report ﬁle with the ﬁle path output_file.

Author(s)

Lorenzo Calviello, <calviello.bio@gmail.com>

See Also

plot_SaTAnn_results,run_SaTAnn

detect_readthrough 7

detect_readthrough Analyzed translation on possible readthrough regions (beta)

Description

This function uses the multitaper method to look for readthrough translation

Usage

detect_readthrough(results_orf, P_sites, P_sites_uniq, P_sites_uniq_mm,

genome_sequence, annotation, genetic_code_table, cutoff_fr_ave = 0.5)

Arguments

results_orf Full list of detected ORFs, from select_quantify_ORFs and annotate_ORFs

P_sites GRanges object with P_sites positions

P_sites_uniq GRanges object with uniquely mapping P_sites positions

P_sites_uniq_mm

Rle signal of uniquely mapping P_sites with mismatches along the transcript

genome_sequence

BSgenome object

annotation Rannot object containing annotation of CDS and transcript structures (see prepare_annotation_files)

genetic_code_table

GENETIC_CODE table to use

cutoff_fr_ave cutoff parameter for the calc_orf_pval functions

Details

The function looks for stop-stop pairs after the stop codon of the detected ORF

Value

GRanges object with the set of translated readthrough regions

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs,select_quantify_ORFs,annotate_ORFs,get_reathr_seq

8detect_translated_orfs

detect_translated_orfs

Detect actively translated ORFs

Description

This function detects translated ORFs

Usage

detect_translated_orfs(selected_txs, genome_sequence, annotation, P_sites,

P_sites_uniq, P_sites_uniq_mm, genomic_region, genetic_code,

all_starts = T, nostarts = F, start_sel_cutoff = NA,

start_sel_cutoff_ave = 0.5, cutoff_fr_ave = 0.5)

Arguments

selected_txs set of selected transcripts, output from select_txs

genome_sequence

BSgenome object

annotation Rannot object containing annotation of CDS and transcript structures (see prepare_annotation_files)

P_sites GRanges object with P_sites positions

P_sites_uniq GRanges object with uniquely mapping P_sites positions

P_sites_uniq_mm

GRanges object with uniquely mapping (with mismatches) P_sites positions

genomic_region GRanges object with genomic coordinates of the genomic region analyzed

genetic_code GENETIC_CODE table to use

all_starts get_all_starts parameter for the get_orfs function

nostarts Stop_Stop parameter for the get_orfs function

start_sel_cutoff

cutoff parameter for the select_start function

start_sel_cutoff_ave

cutoff_ave parameter for the select_start function

cutoff_fr_ave cutoff parameter for the calc_orf_pval functions

Details

A set of transcripts, together with genome sequence and Ribo-signal are analyzed to extract trans-

lated ORFs

from_tx_togen 9

Value

A list with transcript coordinates, exonic coordinates and statistics for each ORF exonic bin and

junction(from select_txs).

The value for each column is as follows:

ave_pct_fr: average percentage of in-frame reads for each codon in the ORF pct_fr: percent-

age of in-frame reads in the ORF ave_pct_fr: average percentage of in-frame reads for each

codon in the ORF ave_pct_fr_st: average percentage of in-frame reads per each codon between

the selected start codon and the next candidate one pct_fr_st: percentage of in-frame reads be-

tween the selected start codon and the next candidate one longest_ORF: GRanges coordinates

for the longest ORF with the same stop codon pval: P-value for the multitaper F-test at 1/3 us-

ing the ORF P_sites proﬁle pval_uniq: P-value for the multitaper F-test at 1/3 using the ORF

P_sites proﬁle (only uniquely mapping reads) P_sites_raw: Raw number of P_sites mapping to

the ORF pct_uniq: Percentage of raw number of P_sites mapping to the ORF TrP_raw: Raw

multitaper spectral coefﬁcient at 1/3 using the P_sites ORF signal ORF_id_tr: ORF id containing

<tx_id>_<start>_<end> Protein: AAString sequence of the translated protein region: Genomic

coordinates of the analyzed region gene_id: gene_id for the corresponding analyzed transcript

gene_biotype: gene biotype for the corresponding analyzed transcript gene_name: gene name

for the corresponding analyzed transcript transcript_id: transcript_id for the corresponding ana-

lyzed ORF transcript_biotype: transcript biotype for the corresponding analyzed ORF

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

select_txs,get_orfs,take_Fvals_spect,select_start,prepare_annotation_files

from_tx_togen Map transcript coordinates to genomic coordinates

Description

This function uses the mapFromTranscripts function to switch between transcript and genomic

coordinates

Usage

from_tx_togen(ORFs, exons, introns)

Arguments

ORFs Set of detected ORFs from the calc_orf_pval function

exons exonic regions of the analyzed transcripts, as a GRangesList object

introns intronic regions of the analyzed transcripts, as a GRangesList object

10 get_orfs

Value

exonic coordinates for each ORF.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

mapFromTranscripts

get_orfs Find ATG-starting ORFs in a sequence

Description

This function loads the annotation created by the prepare_annotation_files function

Usage

get_orfs(tx_name, sequence, get_all_starts = T, Stop_Stop = F,

scores = c(1, 0.5), genetic_code_table)

Arguments

tx_name transcript_id

sequence DNAString object containing the sequence of the transcript

get_all_starts Output all possible start codons? Defaults to TRUE

Stop_Stop Find Stop-Stop pairs (no deﬁned start codon)? Defaults to FALSE

scores Deprecated

genetic_code_table

GENETIC_CODE table to use

Value

GRanges object containing coordinates for the detected ORFs

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs

get_ps_fromsplicemin 11

get_ps_fromsplicemin Offset spliced reads on minus strand

Description

This function calculates P-sites positions for spliced reads on the minus strand

Usage

get_ps_fromsplicemin(x, cutoff)

Arguments

xaGAlignments object with a cigar string

cutoff number representing the offset value

Value

aGRanges object with offset reads

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

prepare_for_SaTAnn

get_ps_fromspliceplus Offset spliced reads on plus strand

Description

This function calculates P-sites positions for spliced reads on the plus strand

Usage

get_ps_fromspliceplus(x, cutoff)

Arguments

xaGAlignments object with a cigar string

cutoff number representing the offset value

12 get_reathr_seq

Value

aGRanges object with offset reads

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

prepare_for_SaTAnn

get_reathr_seq Extract possible readthrough sequences (beta)

Description

This function extracts readthrough regions for subsequent analysis

Usage

get_reathr_seq(tx_name, orf, sequence, genetic_code)

Arguments

tx_name transcript_id

orf transcript-level ORF coordinates

sequence DNAString object containing the sequence of the transcript

genetic_code GENETIC_CODE table to use

Details

The function looks for stop-stop pairs after the stop codon of the detected ORF

Value

GRanges object with the set of possible readthrough sequences

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs,select_quantify_ORFs

load_annotation 13

load_annotation Load genomic features and genome sequence

Description

This function loads the annotation created by the prepare_annotation_files function

Usage

load_annotation(path)

Arguments

path Full path to the *Rannot R ﬁle in the annotation directory used in the prepare_annotation_files function

Value

introduces a GTF_annotation object and a genome_seq object in the parent environment

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

prepare_annotation_files

plot_SaTAnn_results Plot general statistics about SaTAnn results

Description

This function produces a series of plots and statistics about the set ORFs called by SaTAnn com-

pared to the annotation. IMPORTANT: Use only on transcriptome-wide SaTAnn results. See

run_SaTAnn

Usage

plot_SaTAnn_results(for_SaTAnn_file, SaTAnn_output_file, annotation_file,

coverage_file_plus = NA, coverage_file_minus = NA,

output_plots_path = NA, prefix = NA)

14 plot_SaTAnn_results

Arguments

for_SaTAnn_file

path to the "for_SaTAnn" ﬁle containing P_sites positions and junction reads

SaTAnn_output_file

Full path to the "_ﬁnal_SaTAnn_results" RData object output by SaTAnn. See

run_SaTAnn

annotation_file

Full path to the *Rannot R ﬁle in the annotation directory used in the prepare_annotation_files function

coverage_file_plus

Full path to a Ribo-seq coverage (no P-sites but read coverage) bigwig ﬁle (plus

strand), as the ones created by RiboseQC

coverage_file_minus

Full path to a Ribo-seq coverage (no P-sites but read coverage) bigwig ﬁle (mi-

nus strand), as the ones created by RiboseQC

output_plots_path

Full path to the directory where plots in .pdf format are stored.

prefix preﬁx appended to output ﬁlenames

Value

the function exports a RData object (*SaTAnn_plots_RData) containing data to produce all plots,

and produces different QC plots in .pdf format. The plots created are as follows:

ORFs_found: Number of ORF categories detected per gene biotype.

ORFs_found_pct_tr: Distribution of ORF_pct_P_sites ( ORFs_found_P_sites_pNpM: Distribu-

tion of ORF_P_sites_pNpM (P-sites per nucleotide per Million, similar to TPM) for different ORF

categories and gene biotypes.

ORFs_found_len: Distribution of ORF length for different ORF categories and gene biotypes.

ORFs_genes: Number of detected ORFs per gene.

ORFs_genes_tpm: Gene level TPM values, plotted by number of ORFs detected.

ORFs_maxiso: Number of genes plotted against the percentages of gene translation of their most

translated ORF.

ORFs_maxiso_tpm: Gene level TPM values, plotted against the percentages of gene translation of

their most translated ORF.

Sel_txs_genes: Number of genes plotted against the number of selected transcripts.

Sel_txs_genes_tpm: Gene level TPM values, plotted against the number of selected transcripts.

Sel_txs_genes_pct: Percentages of annotated trascripts per gene, plotted against the number of

selected transcripts.

Sel_txs_bins_juns: Percentages of covered exonic bins or junctions, using all annotated tran-

scripts, coding transcripts only, or the set of selected transcripts.

Meta_splicing_coverage: Aggregate signal of Ribo-seq coverage and normalized ORF coverage

across different splice sites combinations, with different mixtures of translated overlapping ORFs.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

prepare_annotation_ﬁles 15

See Also

run_SaTAnn

prepare_annotation_files

Prepare comprehensive sets of annotated genomic features

Description

This function processes a gtf ﬁle and a twobit ﬁle (created using faToTwoBit from ucsc tools:

http://hgdownload.soe.ucsc.edu/admin/exe/ ) to create a comprehensive set of genomic regions of

interest in genomic and transcriptomic space (e.g. introns, UTRs, start/stop codons). In addition,

by linking genome sequence and annotation, it extracts additional info, such as gene and transcript

biotypes, genetic codes for different organelles, or chromosomes and transcripts lengths.

Usage

prepare_annotation_files(annotation_directory, twobit_file, gtf_file,

scientific_name = "Homo.sapiens", annotation_name = "genc25",

export_bed_tables_TxDb = T, forge_BSgenome = T, create_TxDb = T)

Arguments

annotation_directory

The target directory which will contain the output ﬁles

twobit_file Full path to the genome ﬁle in twobit format

gtf_file Full path to the annotation ﬁle in GTF format

scientific_name

A name to give to the organism studied; must be two words separated by a ".",

defaults to Homo.sapiens

annotation_name

A name to give to annotation used; defaults to genc25

export_bed_tables_TxDb

Export coordinates and info about different genomic regions in the annotation_directory?

It defaults to TRUE

forge_BSgenome Forge and install a BSgenome package? It defaults to TRUE

create_TxDb Create a TxDb object and a *Rannot object? It defaults to TRUE

Details

This function uses the makeTxDbFromGFF function to create a TxDb object and extract genomic

regions and other info to a *Rannot R ﬁle; the mapToTranscripts and mapFromTranscripts func-

tions are used to map features to genomic or transcript-level coordinates. GTF ﬁle mist contain

"exon" and "CDS" lines, where each line contains "transcript_id" and "gene_id" values. Additional

values such as "gene_biotype" or "gene_name" are also extracted. Regarding sequences, the twobit

16 prepare_annotation_ﬁles

ﬁle, together with input scientiﬁc and annotation names, is used to forge and install a BSgenome

package using the forgeBSgenomeDataPkg function.

The resulting GTF_annotation object (obtained after runnning load_annotation) contains:

txs: annotated transcript boundaries.

txs_gene: GRangesList including transcript grouped by gene.

seqinfo: indicating chromosomes and chromosome lengths.

start_stop_codons: the set of annotated start and stop codon, with respective transcript and

gene_ids. reprentative_mostcommon,reprentative_boundaries and reprentative_5len represent the

most common start/stop codon, the most upstream/downstream start/stop codons and the start/stop

codons residing on transcripts with the longest 5’UTRs

cds_txs: GRangesList including CDS grouped by transcript.

introns_txs: GRangesList including introns grouped by transcript.

cds_genes: GRangesList including CDS grouped by gene.

exons_txs: GRangesList including exons grouped by transcript.

exons_bins: the list of exonic bins with associated transcripts and genes.

junctions: the list of annotated splice junctions, with associated transcripts and genes.

genes: annotated genes coordinates.

threeutrs: collapsed set of 3’UTR regions, with correspinding gene_ids. This set does not overlap

CDS region.

fiveutrs: collapsed set of 5’UTR regions, with correspinding gene_ids. This set does not overlap

CDS region.

ncIsof: collapsed set of exonic regions of protein_coding genes, with correspinding gene_ids. This

set does not overlap CDS region.

ncRNAs: collapsed set of exonic regions of non_coding genes, with correspinding gene_ids. This

set does not overlap CDS region.

introns: collapsed set of intronic regions, with correspinding gene_ids. This set does not overlap

exonic region.

intergenicRegions: set of intergenic regions, deﬁned as regions with no annotated genes on ei-

ther strand.

trann: DataFrame object including (when available) the mapping between gene_id, gene_name,

gene_biotypes, transcript_id and transcript_biotypes.

cds_txs_coords: transcript-level coordinates of ORF boundaries, for each annotated coding tran-

script. Additional columns are the same as as for the start_stop_codons object.

genetic_codes: an object containing the list of genetic code ids used for each chromosome/organelle.

see GENETIC_CODE_TABLE for more info.

genome_package: the name of the forged BSgenome package. Loaded with load_annotation

function.

stop_in_gtf: stop codon, as deﬁned in the annotation.

Value

a TxDb ﬁle and a *Rannot ﬁles are created in the speciﬁed annotation_directory. In addition, a

BSgenome object is forged, installed, and linked to the *Rannot object

prepare_for_SaTAnn 17

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

load_annotation,forgeBSgenomeDataPkg,makeTxDbFromGFF,run_SaTAnn.

prepare_for_SaTAnn Prepare the "for_SaTAnn" ﬁle

Description

Prepare the "for_SaTAnn" ﬁle

Usage

prepare_for_SaTAnn(annotation_file, bam_file,

path_to_rl_cutoff_file = NA, chunk_size = 5e+06,

path_to_P_sites_plus_bw = NA, path_to_P_sites_minus_bw = NA,

path_to_P_sites_uniq_plus_bw = NA,

path_to_P_sites_uniq_minus_bw = NA,

path_to_P_sites_uniq_mm_plus_bw = NA,

path_to_P_sites_uniq_mm_minus_bw = NA, dest_name = NA)

Arguments

annotation_file

Full path to the annotation ﬁle (*Rannot)

bam_file Full path to the bam ﬁle

path_to_rl_cutoff_file

path to the rl_cutoff_ﬁle ﬁle specifying in 3 columns the read lengths, cutoffs

and compartments ("nucl" for standard chromosomes)

chunk_size the number of alignments to read at each iteration, defaults to 5000000, increase

when more RAM is available

path_to_P_sites_plus_bw

path to a bigwig ﬁle containing P_sites positions on the plus strand

path_to_P_sites_minus_bw

path to a bigwig ﬁle containing P_sites positions on the minus strand

path_to_P_sites_uniq_plus_bw

(Optional) path to a bigwig ﬁle containing uniquely mapping P_sites positions

on the plus strand

path_to_P_sites_uniq_minus_bw

(Optional) path to a bigwig ﬁle containing uniquely mapping P_sites positions

on the minus strand

18 run_SaTAnn

path_to_P_sites_uniq_mm_plus_bw

(Optional) path to a bigwig ﬁle containing uniquely mapping (with mismatches)

P_sites positions on the plus strand

path_to_P_sites_uniq_mm_minus_bw

(Optional) path to a bigwig ﬁle containing uniquely mapping (with mismatches)

P_sites positions on the minus strand

dest_name preﬁx to use for the output ﬁles. Defaults to same as bam_file (appends "for_SaTAnn"

to its ﬁlename)

Details

This function uses a list of pre-determined read lengths, cutoffs and compartments to calculate

P_sites positions.

Alternatively, bigwig ﬁles containing P_sites position for each strand can be speciﬁed. Optional

bigwig ﬁles for uniquely mapping P_sites position (with and without mismatches) can be speciﬁed

to obtain more statistics on the SaTAnn-identiﬁed ORFs

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

run_SaTAnn

run_SaTAnn Run the SaTAnn pipeline

Description

This wrapper function runs the entire SaTAnn pipeline

Usage

run_SaTAnn(for_SaTAnn_file, annotation_file, n_cores,

prefix = for_SaTAnn_file, gene_name = NA, gene_id = NA,

genomic_region = NA, write_temp_files = T, write_GTF_file = T,

write_protein_fasta = T, interactive = T,

stn.orf_find.all_starts = T, stn.orf_find.nostarts = F,

stn.orf_find.start_sel_cutoff = NA,

stn.orf_find.start_sel_cutoff_ave = 0.5,

stn.orf_find.cutoff_fr_ave = 0.5, stn.orf_quant.cutoff_cums = NA,

stn.orf_quant.cutoff_pct = 2, stn.orf_quant.cutoff_P_sites = NA)

run_SaTAnn 19

Arguments

for_SaTAnn_file

REQUIRED - path to the "for_SaTAnn" ﬁle containing P_sites positions and

junction reads

annotation_file

REQUIRED - path to the *Rannot R ﬁle in the annotation directory used in the

prepare_annotation_files function

n_cores REQUIRED - number of cores to use

prefix preﬁx to use for the output ﬁles. Defaults to same as for_SaTAnn_file (ap-

pends to its ﬁlename)

gene_name character vector of gene names to analyze.

gene_id character vector of gene ids to analyze

genomic_region GRanges object with genomic regions to analyze

write_temp_files

write temporary ﬁles. Defaults to TRUE

write_GTF_file write a GTF ﬁles with the ORF coordinates. Defaults to TRUE

write_protein_fasta

write a protein fasta ﬁle. Defaults to TRUE

interactive should put R object in global environment? Defaults to TRUE

stn.orf_find.all_starts

orf_find.all_starts parameter for the SaTAnn function

stn.orf_find.nostarts

orf_find.nostarts parameter for the SaTAnn function

stn.orf_find.start_sel_cutoff

orf_find.start_sel_cutoff parameter for the SaTAnn function

stn.orf_find.start_sel_cutoff_ave

orf_find.start_sel_cutoff_ave parameter for the SaTAnn functio

stn.orf_find.cutoff_fr_ave

orf_find.cutoff_fr_ave parameter for the SaTAnn function

stn.orf_quant.cutoff_cums

orf_quant.cutoff_cums parameter for the SaTAnn function

stn.orf_quant.cutoff_pct

orf_quant.cutoff_pct parameter for the SaTAnn function

stn.orf_quant.cutoff_P_sites

orf_quant.cutoff_P_sites parameter for the SaTAnn function

Details

A set of transcripts, together with genome sequence and Ribo-signal are analyzed to extract trans-

lated ORFs

20 SaTAnn

Value

A set of output ﬁles containing transcript coordinates, exonic coordinates and annotation for each

ORF, including optional GTF and protein fasta ﬁles.

The description for each list object is as follows:

tmp_SaTAnn_results: (Optional) RData object ﬁle containing the entire set of results for each

genomic region.

final_SaTAnn_results: RData object ﬁle containing the ﬁnal SaTAnn results, see SaTAnn.

Protein_sequences.fasta: (Optional) Fasta ﬁle containing the set of translated proteins .

Detected_ORFs.gtf: GTF ﬁle containing coordinates of the detected ORFs.

In addition, new columns are added in the ORFs_tx ﬁle:

TrP_pM: (Beta) multitaper spectral coefﬁcient of the P_sites track for each ORF, summing up to

a million.

TrP_pN: (Beta) multitaper spectral coefﬁcient of the P_sites track for each ORF, divided by ORF

length.

TrP_pNpM: (Beta) multitaper spectral coefﬁcient of the P_sites track for each ORF, divided by ORF

length and summing up to a million (akin to TPM).

P_sites_pM: number of P_sites for each ORF, summing up to a million.

P_sites_pN: number of P_sites for each ORF, divided by ORF length.

P_sites_pNpM: number of P_sites for each ORF, divided by ORF length and summing up to a mil-

lion (akin to TPM).

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

prepare_annotation_files,load_annotation,SaTAnn

SaTAnn Detection, quantiﬁcation and annotation of translated ORFs in a ge-

nomic region

Description

This function detects, quantiﬁes and annotates actively translated ORF in a genomic region

Usage

SaTAnn(region, for_SaTAnn, genetic_code_region, orf_find.all_starts = T,

orf_find.nostarts = F, orf_find.start_sel_cutoff = NA,

orf_find.start_sel_cutoff_ave = 0.5, orf_find.cutoff_fr_ave = 0.5,

SaTAnn 21

orf_quant.cutoff_cums = NA, orf_quant.cutoff_pct = 2,

orf_quant.cutoff_P_sites = NA)

Arguments

region GRanges object with genomic coordinates of the genomic region analyzed

for_SaTAnn "for_SaTAnn" Robject containing P_sites positions and junction reads

genetic_code_region

GENETIC_CODE table to use

orf_find.all_starts

get_all_starts parameter for the detect_translated_orfs function

orf_find.nostarts

Stop_Stop parameter for the detect_translated_orfs function

orf_find.start_sel_cutoff

cutoff parameter for the detect_translated_orfs function

orf_find.start_sel_cutoff_ave

cutoff_ave parameter for the detect_translated_orfs function

orf_find.cutoff_fr_ave

cutoff parameter for the detect_translated_orfs function

orf_quant.cutoff_cums

cutoff_cums parameter for the select_quantify_ORFs function

orf_quant.cutoff_pct

cutoff_pct parameter for the select_quantify_ORFs function

orf_quant.cutoff_P_sites

cutoff_P_sites parameter for the select_quantify_ORFs function

Details

A set of transcripts, together with genome sequence and Ribo-signal are analyzed to extract trans-

lated ORFs

Value

A list containing transcript coordinates, exonic coordinates and annotation for each ORF.

The description for each list object is as follows:

ORFs_tx: transcript coordinates of the detected ORFs.

ORFs_gen: genomic (exon) coordinates of the detected ORFs.

ORFs_feat: list of ORF features together with mapping reads and uniqueness.

ORFs_txs_feats: list of transcript features present in the genomic region, together with mapping

reads and uniqueness.

ORFs_spl_feat_longest: splicing annotation for each ORF exon, with respect to the longest an-

notated coding transcript for each gene.

ORFs_spl_feat_maxORF: splicing annotation for each ORF exon, with respect to the most trans-

lated ORF in each gene.

selected_txs: character vector containing the transcript ids of the selected transcripts.

22 select_quantify_ORFs

ORFs_readthroughs: (Beta) transcript coordinates of the detected ORFs readthroughs.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

select_txs,detect_translated_orfs,select_quantify_ORFs,annotate_ORFs,detect_readthrough

select_quantify_ORFs Select and quantify ORF translation

Description

This function selects a subset of detected ORFs and quantiﬁes their translation

Usage

select_quantify_ORFs(results_ORFs, P_sites, P_sites_uniq,

cutoff_cums = NA, cutoff_pct = 2, cutoff_P_sites = NA,

optimiz = FALSE, scaling = TRUE)

Arguments

results_ORFs Full list of detected ORFs, from detect_translated_ORFs

P_sites GRanges object with P_sites positions

P_sites_uniq GRanges object with uniquely mapping P_sites positions

cutoff_cums cutoff to select ORFs until <x> percentage of total gene translation. Defaults to

cutoff_pct minimum percentage of total gene translation for an ORF to be selected. De-

faults to 1

cutoff_P_sites minimum number of P_sites assigned to the ORF to be selected. Defaults to 10

optimiz (Beta) should numerical optimization (minimizing distance between observed

coverage and expected coverage) be used to quantify ORF translation? Defaults

to FALSE

scaling Additional scaling value taking into account total signal on the detected ORFs

to adjust quantiﬁcation estimates (recommended). Defaults to TRUE

select_quantify_ORFs 23

Details

ORFs are ﬁrst selected using the same method as in the select_txs function, but using ORF fea-

tures (ORF structures are treated as transcript structures).

Ribo-seq coverage (reads/length) on bins and junctions (set to a length of 60) is used to derive a

scaling factor (0-1) for each ORF, which indicates how much of the ORF coverage can be assigned

to such ORF (1 when no other ORF is present). When no unique features are present on an ORF,

an adjusted scaling value is calculated subtracting coverage expected from a ORF with a unique

feature. When no unique features are present on any ORF, scaling values are calculated assuming

uniform coverage on each ORF.

ORFs are then further ﬁltered to exclude lowly translated ORFs and quantiﬁcation/selection is re-

iterated until no ORF is further ﬁltered out. Percentage of total gene translation and length-adjusted

quantiﬁcation estimates are produced. More details about the quantiﬁcatin procedure can be found

in the SaTAnn manuscript.

Additional columns are added to the ORFs_tx object:

TrP: TrP_raw values (spectral coefﬁcient) from detect_translated_ORFs divided by the ORF

scaling value.

ORF_pct_TrP: Percentage of gene translation output for the ORF, derived using TrP values.

ORF_pct_TrP_pN: Percentage of gene translation ouptut (adjusted by length) for the ORF, derived

using TrP values.

P_sites: P_sites_raw value from detect_translated_ORFs divided by the ORF scaling value.

ORF_pct_P_sites: Percentage of gene translation output for the ORF, derived using P_sites values.

ORF_pct_P_sites_pN: Percentage of gene translation ouptut (adjusted by length) for the ORF, de-

rived using P_sites values.

unique_features_reads: initial number of reads on each unique ORF feature. NA when no unique

feature is present.

adj_unique_features_reads: ﬁnal number of reads on each unique ORF feature after the ORF

ﬁltering/quantiﬁcation procedure. NA when no unique feature is present.

scaling_factors: Set of 3 scaling factors assigned to the ORF using intial unique ORF features,

after adjusting for the presence of ORFs with no unique features, and ﬁnal scaling factor after

correcting for total Ribo-seq coverage on the gene.

Value

modiﬁed results_ORFs object with the selected ORFs including quantiﬁcation estimates.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs,select_txs

24 select_start

select_start Select start codon

Description

This function selects the start codon for ORFs in the same transcript

Usage

select_start(ORFs, P_sites_rle, cutoff = NA, cutoff_ave = 0.5)

Arguments

ORFs Set of detected ORFs

P_sites_rle Rle signal of P_sites along the transcript

cutoff cutoff of total in-frame signal between start codons (sensitive to outliers). De-

faults to NA

cutoff_ave cutoff for frequency of in-frame codons between two start codons (less sensitive

to outliers). Defaults to .5

Details

ORFs are divided based on stop codon and Ribo-seq signal between start codons is used to select

one.

When more than cutoff_ave fraction of codons is in-frame between two candidate start codons,

the most upstream is selected.

Value

Set of detected ORFs, including info about the possible longest ORF for that frame.

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs,get_orfs

select_txs 25

select_txs Select a subset of transcripts with Ribo-seq data

Description

This function ﬂattens all annotated transcript structures and uses Ribo-seq to select a subset of

transcripts.

Usage

select_txs(region, annotation, P_sites, P_sites_uniq, junction_counts)

Arguments

region genomic region being analyzed

annotation Rannot object containing annotation of CDS and transcript structures (see prepare_annotation_files)

P_sites GRanges object with P_sites positions

P_sites_uniq GRanges object with uniquely mapping P_sites positions

junction_counts

GRanges object containing Ribo-seq counts on the set of annotated junctions

Details

Features (bins and junctions) are divided into shared and unique features, and into with support and

without support (with or without reads mapping). A set of logical rules ﬁlters out transcripts with

internal features with no support and no unique features with reads. More speciﬁc details can be

found in the SaTAnn manuscript.

Value

GRanges object with the set of counts on each exonic bin and junctions, together with the list of

selected transcripts

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

prepare_annotation_files

26 take_Fvals_spect

take_Fvals_spect Extract output from multitaper analysis of a signal

Description

This function uses the multitaper tool to extract F-values and multitaper spectral coefﬁcients

Usage

take_Fvals_spect(x, n_tapers, time_bw, slepians_values)

Arguments

xnumeric signal to analyze

n_tapers n of tapers to use

time_bw time_bw parameter

slepians_values

set of calculated slepian functions to use in the multitaper analysis

Details

Values reported correspond to the closest frequency to 1/3 (same parameters as in RiboTaper).

Padding to a minimum length of 1024 is performed to increase spectral resolution.

Value

two numeric values representing the F-value for the multitaper test and its corresponding spectral

coefﬁcient at the closest frequency to 1/3

Author(s)

Lorenzo Calviello, <calviello.l.bio@gmail.com>

See Also

detect_translated_orfs,spec.mtm,dpss

Index

∗Topic Ribo-seQC,

get_ps_fromsplicemin,11

get_ps_fromspliceplus,11

prepare_annotation_files,15

∗Topic Ribo-seQC

load_annotation,13

plot_SaTAnn_results,13

∗Topic SaTAnn,

load_annotation,13

plot_SaTAnn_results,13

∗Topic SaTAnn

annotate_ORFs,2

annotate_splicing,4

calc_orf_pval,5

create_SaTAnn_html_report,6

detect_readthrough,7

detect_translated_orfs,8

from_tx_togen,9

get_orfs,10

get_ps_fromsplicemin,11

get_ps_fromspliceplus,11

get_reathr_seq,12

prepare_annotation_files,15

prepare_for_SaTAnn,17

run_SaTAnn,18

SaTAnn,20

select_quantify_ORFs,22

select_start,24

select_txs,25

take_Fvals_spect,26

annotate_ORFs,2,5,7,22

annotate_splicing,4,4

calc_orf_pval,5

create_SaTAnn_html_report,6

detect_readthrough,7,22

detect_translated_orfs,5,7,8,10,12,

22–24,26

dpss,26

forgeBSgenomeDataPkg,17

from_tx_togen,9

get_orfs,5,9,10,24

get_ps_fromsplicemin,11

get_ps_fromspliceplus,11

get_reathr_seq,7,12

load_annotation,13,17,20

makeTxDbFromGFF,17

mapFromTranscripts,10

plot_SaTAnn_results,6,13

prepare_annotation_files,9,13,15,20,

prepare_for_SaTAnn,11,12,17

run_SaTAnn,6,15,17,18,18

SaTAnn,20,20

select_quantify_ORFs,4,7,12,22,22

select_start,9,24

select_txs,9,22,23,25

spec.mtm,26

take_Fvals_spect,5,9,26

Sa TAnn Manual

Navigation menu

Versions of this User Manual:

Views

Navigation