Reference Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 28

Download
Open PDF In Browser	View PDF

Package ‘riboWaltz’
November 23, 2018
Type Package
Title Optimization of ribosome P-site positioning in ribosome profiling data
Version 1.0.1
Description riboWaltz is an R package designed for the analysis of
ribosome profiling (RiboSeq) data aimed at the identification of the
P-site offset. The P-site offset (PO) is specified by the localization of
the P-site of ribosomes within the fragments of the RNA (reads) resulting
from RiboSeq assays. It is defined as the distance of the P-site from the
two ends of the reads. Determining the PO is a crucial step for a variety of
RiboSeq-based analyses such as verify the so-called 3-nt periodicity of
ribosomes along the coding sequence, derive translation initiation and
elongation rates and reveal new translational events in unannotated open
reading frames and ncRNAs. riboWaltz performs accurate computation of the
PO for all the lengths of reads from single or multiple samples, taking
advantage from an original two-step algorithm. Moreover, riboWaltz
provides the user a variety of graphical representations, laying
the groundwork for further positional analyses and new biological
discoveries.
License MIT
LazyData TRUE
Depends R (>= 3.3.0)
Imports Biostrings (>= 2.46.0),
data.table (>= 1.10.4.3),
GenomicAlignments (>= 1.14.1),
GenomicFeatures (>= 1.24.5),
GenomicRanges (>= 1.24.3),
ggplot2 (>= 2.2.1),
ggrepel (>= 0.6.5),
IRanges (>= 2.12.0)
biocViews
RoxygenNote 6.0.1
Suggests knitr,
rmarkdown
1

bamtobed

VignetteBuilder knitr

R topics documented:
bamtobed . . . . .
bamtolist . . . . .
bedtolist . . . . . .
cds_coverage . . .
codon_coverage . .
codon_usage_psite
create_annotation .
frame_psite . . . .
frame_psite_length
length_filter . . . .
metaheatmap_psite
metaprofile_psite .
mm81cdna . . . . .
psite . . . . . . . .
psite_info . . . . .
psite_offset . . . .
reads_list . . . . .
reads_psite_list . .
region_psite . . . .
rends_heat . . . . .
rlength_distr . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Index

bamtobed

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2
3
4
6
7
8
10
11
12
14
15
17
18
19
20
23
23
24
25
26
27
28

From BAM files to BED files.

Description
This function reads one or multiple BAM files converting them into BED files that contain, for each
read: i) the name of the corresponding reference sequence (i.e. of the transcript on which it aligns);
ii) its leftmost and rightmost position with respect to the 1st nucleotide of the reference sequence;
iii) its length; iv) the strand on which it aligns. Please note: this function relies on the bamtobed
utility of the BEDTools suite and can be only run on UNIX, LINUX and Apple OS X operating
systems. Moreover, to generate R data structures containing reads information, the bedtolist
must be run on the resulting BED files. For these reasons the authors suggest the use of bamtolist.
Usage
bamtobed(bamfolder, bedfolder = NULL)

bamtolist

Arguments
bamfolder

Character string specifying the path to the folder storing BAM files. Please note:
the function looks for BAM files recursively starting from the specified folder.

bedfolder

Character string specifying the path to the directory where BED files shuold be
stored. If the specified folder doesn’t exist, it is automatically created. If NULL
(the default), BED files are stored in a new subfolder of the working directory,
called bed.

Examples
## path_bam <- "path/to/BAM/files"
## path_bed <- "path/to/output/directory"
## bamtobed(bamfolder = path_bam, bedfolder = path_bed)

bamtolist

From BAM files to lists of data tables or GRangesList objects.

Description
This function reads one or multiple BAM files converting them into data tables or GRanges objects,
arranged in a list or a GRangesList, respectively. In both cases the list elements contain, for each
read: i) the name of the corresponding reference sequence (i.e. of the transcript on which it aligns);
ii) its leftmost and rightmost position with respect to the 1st nucleotide of the reference sequence;
iii) its length; iv) the leftmost and rightmost position of the annotated CDS of the reference sequence
(if any) with respect to its 1st nucleotide. Please note: start and stop codon positions for transcripts
without annotated CDS are set to 0.
Usage
bamtolist(bamfolder, annotation, transcript_align = TRUE,
name_samples = NULL, rm_version = FALSE, granges = FALSE)
Arguments
bamfolder

Character string specifying the path to the folder storing BAM files.

Data table as generated by create_annotation. Please make sure the name of
reference transcripts in the annotation data table match those in the BAM files
(see rm_version).
transcript_align
Logical value whether BAM files in bamfolder come from a transcriptome
alignment (intended as an alignment against reference transcript sequences, see
Details). If TRUE (the default), reads mapping on the negative strand should
not be present and, if any, they are automatically removed.
annotation

bedtolist
name_samples

rm_version

granges

Named character string vector specifying the desired name for the output list
elements. A character string for each BAM file in bamfolder is required. Plase
be careful to name each element of the vector after the correct corresponding
BAM file in bamfolder, leaving their path and extension out. No specific order
is required. Default is NULL i.e. list elements are named after the name of the
BAM files, leaving their path and extension out.
Logical value whether to remove the transcript version at the end of their ID,
usually dot-separated. It might be required to make the transcripts IDs in the
BAM files match those in the annotation table. Default is FALSE.
Logical value whether to return a GRangesList object. Default is FALSE i.e.
a list of data tables is returned instead (the required input for length_filter,
psite, psite_info, rends_heat and rlength_distr).

Details
riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the
main purpose of RiboSeq assays to study translational events through the isolation and sequencing
of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and
not on introns and intergenic regions. Nevertheless, BAM based on transcript coordinates can
be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against
standard chromosome sequences, requiring the outputs to be translated in transcript coordinates.
The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA
file where each sequence represents a transcript, from the beginning of the 5’ UTR to the end
of the 3’ UTR. The second procedure is based on reference FASTA files where each sequence
represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF).
The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual),
is an example of tool providing such a feature.
Value
A list of data tables or a GRangesList object.
Examples
## path_bam <- "path/to/BAM/files"
## annotation_dt <- datatable_with_transcript_annotation
## bamtolist(bamfolder = path_bam, annotation = annotation_dt)

bedtolist

From BED files to lists of data tables or GRangesList objects.

Description
This function reads one or multiple BED files, as generated by bamtobed, converting them into
data tables or GRanges objects, arranged in a list or a GRangesList, respectively. In both cases
two columns are attached to the original data containing, for each read, the leftmost and rightmost
position of the annotated CDS of the reference sequence (if any) with respect to its 1st nucleotide.
Please note: start and stop codon positions for transcripts without annotated CDS are set to 0.

bedtolist

Usage
bedtolist(bedfolder, annotation, transcript_align = TRUE,
name_samples = NULL, rm_version = FALSE, granges = FALSE)
Arguments
bedfolder

Character string specifying the path to the folder storing BED files as generated
by bamtobed.

Data table as generated by create_annotation. Please make sure the name of
reference transcripts in the annotation data table match those in the BED files
(see also rm_version).
transcript_align
Logical value whether BED files in bedfolder come from a transcriptome
alignment (intended as an alignment against reference transcript sequences, see
Details). If TRUE (the default), reads mapping on the negative strand should
not be present and, if any, they are automatically removed.
annotation

name_samples

Named character string vector specifying the desired name for the output list
elements. A character string for each BED file in bedfolder is required. Plase
be careful to name each element of the vector after the correct corresponding
BED file in bedfolder, leaving their path and extension out. No specific order
is required. Default is NULL i.e. list elements are named after the name of the
BED files, leaving their path and extension out.

rm_version

Logical value whether to remove the transcript version at the end of their ID,
usually dot-separated. It might be required to make the transcripts IDs in the
BED files match those in the annotation table. Default is FALSE.

granges

Logical value whether to return a GRangesList object. Default is FALSE i.e.
a list of data tables is returned instead (the required input for length_filter,
psite, psite_info, rends_heat and rlength_distr).

cds_coverage

Examples
## path_bed <- "path/to/BED/files"
## annotation_dt <- datatable_with_transcript_annotation
## bedtolist(bedfolder = path_bed, annotation = annotation_dt)

Number of in-frame P-sites per coding sequence.

cds_coverage

Description
This function generates a data table containing, for each transcript: i) its name; ii) its length; iii)
the number of in-frame P-sites falling in its annotated coding sequence (if any) for all samples. A
chosen number of nucleotides at the beginning and/or at the end of the CDSs can be excluded for
restricting the analysis to a subregion of the original sequence. Please note: transcripts without
annotated CDS are automatically discarded.
Usage
cds_coverage(data, annotation, start_nts = 0, stop_nts = 0)
Arguments
data

List of data tables from psite_info.

annotation

Data table as generated by create_annotation.

start_nts

Positive integer specifying the number of nucleotides at the beginning of the
coding sequences to be excluded from the analisys. Default is 0.

stop_nts

Positive integer specifying the number of nucleotides at the end of the coding
sequences to be excluded from the analisys. Default is 0.

Value
A data table.
Examples
data(reads_psite_list)
data(mm81cdna)
## Compute the number of in-frame P-sites per whole coding sequence.
psite_cds <- cds_coverage(reads_psite_list, mm81cdna)
## Compute the number of in-frame P-sites per the coding sequence exluding
## the first 15 nucleotides and the last 10 nucleotides.
psite_cds <- cds_coverage(reads_psite_list, mm81cdna, start_nts = 15, stop_nts = 10)

codon_coverage

Number of reads per codon.

Description
This function computes transcript-specific codon coverages, defined as the number of either read
footprints or P-sites mapping on each triplet of coding sequences and UTRs (see Details). The
resulting data table contains, for each triplet: i) the name of the corresponding reference sequence
(i.e. of the transcript to which it belongs); ii) its leftmost and rightmost position with respect to the
1st nucleotide of the reference sequence; iii) its position with respect to the 1st and the last codon
of the annotated CDS of the reference sequence; iv) the region of the transcript (5’ UTR, CDS, 3’
UTR) it is in; v) the number of read footprints or P-sites falling in that region for all samples.
Usage
codon_coverage(data, annotation, sample = NULL, psite = FALSE,
min_overlap = 1, granges = FALSE)
Arguments
data

List of data tables from psite_info. Data tables generated by bamtolist and
bedtolist can be used if psite is FALSE (the default).

annotation

Data table as generated by create_annotation.

sample

Character string vector specifying the name of the sample(s) of interest. Default
is NULL i.e. all samples in data are processed.

psite

Logical value whether to return the number of P-sites per codon. Default is
TRUE. If FALSE, the number of read footprints per codon is returned instead.

min_overlap

Positive integer specifying the minimum number of overlapping positions (in
nucleotides) between reads and codons to be considered overlapping. If psite
is TRUE this parameter must be 1 (the default).

granges

Logical value whether to return a GRangesList object. Default is FALSE i.e. a
list of data tables is returned instead.

Details
The sequence of every transcript is divided in triplets starting from the annotated translation initiation site (if any) and proceeding towards the UTRs extremities, possibly discarding the exceeding
1 or 2 nucleotides at the extremities of the transcript. Please note: transcripts not associated to any
annotated 5’ UTR, CDS and 3’UTR and transcripts whose coding sequence length is not divisible
by 3 are automatically discarded.
Value
A data table or a GRanges object.

codon_usage_psite

Examples
data(reads_psite_list)
data(mm81cdna)
## Compute the codon coverage based on the number of ribosome footprint per
## codon, setting the minimum overlap between reads and triplets to 3 nts:
coverage_dt <- codon_coverage(reads_psite_list, mm81cdna, min_overlap = 3)
## Compute the coverage based on the number of P-sites per codon:
coverage_dt <- codon_coverage(reads_psite_list, mm81cdna, psite = TRUE)

codon_usage_psite

Empirical codon usage indexes.

Description
This function computes empirical codon usage indexes based on either ribosome P-sites, A-site
or E-site frequencies associated to in-frame P-sites within the coding sequence. It computes 64
codon usage indexes (one per triplet) normalized for the frequency of the corresponding codons
within the CDS and generates a bar plot of the resulting values. Optionally, this function compares
the computed codon usage indexes with a set of 64 values provided by the user. In this case the
function returns a scatter plot, reporting the result of a linear regression between the two variables
(i.e. the two sets of values) and the corresponding Pearson correlation coefficient.
Usage
codon_usage_psite(data, annotation, sample, site = "psite",
fastapath = NULL, fasta_genome = TRUE, bsgenome = NULL,
gtfpath = NULL, txdb = NULL, dataSource = NA, organism = NA,
transcripts = NULL, codon_values = NULL, scatter_label = FALSE,
aminoacid = FALSE)
Arguments
data

List of data tables from psite_info. Each data table may or may not include
one or more columns among p_site_codon, a_site_codon and e_site_codon reporting the three nucleotides covered by the P-site, A-site and E-site, respectively. These columns can be previously generated by the psite_info function.
Otherwise the column of interest can be specified by site and is automatically
generated starting from a FASTA file or a BSgenome data package.

annotation

Data table as generated by create_annotation.

sample

Character string specifying the name of the sample of interest.

site

Either "psite, "asite", "esite". It specifies if the empirical codon usage indexes
should be based on ribosome P-sites ("psite"), A-sites ("asite") or E-sites ("esite"). Default is "psite".

codon_usage_psite

fastapath

Character string specifying the FASTA file used in the alignment step, including
its path, name and extension. This file can contain reference nucleotide sequences either of a genome assembly or of all the transcripts (see Details and
fasta_genome). Please make sure the sequences derive from the same release
of the annotation file used in the create_annotation function. Note: either
fastapath or bsgenome is required to compute the codon frequencies within
the CDS used as normalization factors, even when data already includes one or
more columns among p_site_codon, a_site_codon and e_site_codon. Default is
NULL.

fasta_genome

Logical value whether the FASTA file specified by fastapath contains nucleotide sequences of a genome assembly. If TRUE (the default), an annotation object is required (see gtfpath and txdb). FALSE implies the nucleotide
sequences of all the transcripts is provided instead.

bsgenome

Character string specifying the BSgenome data package with the genome sequences to be loaded. If not already present in the system, it is automatically installed through the biocLite.R script (check the list of available BSgenome data
packages by running the available.genomes function of the BSgenome package). This parameter must be coupled with an annotation object (see gtfpath
and txdb). Please make sure the sequences included in the specified BSgenome
data pakage are in agreement with the sequences used in the alignment step.
Note: either fastapath or bsgenome is required to compute the codon frequencies within the CDS used as normalization factors, even when data already includes one or more columns among p_site_codon, a_site_codon and
e_site_codon. Default is NULL.

gtfpath

Character string specifying the location of a GTF file, including its path, name
and extension. Please make sure the GTF file and the sequences specified by
fastapath or bsgenome derive from the same release. Note that either gtfpath
or txdb is required if and only if nucleotide sequences of a genome assembly
are provided (see fastapath or bsgenome). Default is NULL.

txdb

Character string specifying the TxDb annotation package to be loaded. If not
already present in the system, it is automatically installed through the biocLite.R
script (check here the list of available TxDb annotation packages). Please make
sure the TxDb annotation package and the sequences specified by fastapath
or bsgenome derive from the same release. Note that either gtfpath or txdb is
required if and only if nucleotide sequences of a genome assembly are provided
(see fastapath or bsgenome). Default is NULL.

dataSource

Optional character string describing the origin of the GTF data file. This parameter is considered only if gtfpath is specified. For more information about this
parameter please refer to the description of dataSource of the makeTxDbFromGFF
function included in the GenomicFeatures package.

organism

Optional character string reporting the genus and species of the organism of
the GTF data file. This parameter is considered only if gtfpath is specified.
For more information about this parameter please refer to the description of
organism of the makeTxDbFromGFF function included in the GenomicFeatures
package.

transcripts

Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used. Please note: transcripts

create_annotation

codon_values

scatter_label

aminoacid

without annotated CDS and transcripts whose coding sequence length is not divisible by 3 are automatically discarded.
Data table containing 64 codon-specific values. If specified, the provided values
are compared with the empirical codon usage indexes computed for the sample
of interest. The data table must contain the DNA or RNA nucleotide sequence
of the 64 codons and the corresponding values arranged in two columns named
codon and value, respectively. Please note: a data table of the same format is
returned by codon_usage_psite itself. Default is NULL.
Logical value whether to label the dots of the scatter plot generated by specifying codon_values. Each dot is labeled using either the nucleotide sequence
of the codon or the corresponding amino acid symbol (see aminoacid). This
parameter is considered only if codon_values is specified. Default is FALSE.
Logical value whether to use the amino acid symbols to label the dots of the
scatter plot generated by specifying codon_values. Default is FALSE i.e. the
nucleotide sequences of the codons are used instead. This parameter is considered only if codon_values is specified and scatter_label is TRUE. Default
is FALSE.

Details
riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the
main purpose of RiboSeq assays to study translational events through the isolation and sequencing
of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and
not on introns and intergenic regions. Nevertheless, BAM based on transcript coordinates can
be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against
standard chromosome sequences, requiring the outputs to be translated in transcript coordinates.
The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA
file where each sequence represents a transcript, from the beginning of the 5’ UTR to the end
of the 3’ UTR. The second procedure is based on reference FASTA files where each sequence
represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF).
The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual),
is an example of tool providing such a feature.
Value
A list containing a ggplot2 object ("plot") and the data table with the associated data ("dt"). If
codon_values is specified, an additional ggplot2 object ("plot_comparison") is returned.

create_annotation

Annotation data table.

Description
This function generates transcript basic annotation data tables starting from GTF files or TxDb objects. Annotation data tables include a column named transcript reporting the name of the reference
transcripts and four columns named l_tr, l_utr5, l_cds and l_utr3 reporting the length of the transcripts and of their annotated 5’ UTRs, CDSs and 3’ UTRs, respectively. Please note: if a transcript
region is not annotated its length is set to 0.

frame_psite

Usage
create_annotation(gtfpath = NULL, txdb = NULL, dataSource = NA,
organism = NA)
Arguments
gtfpath

A character string specifying the path to a GTF file, including its name and
extension. Please make sure the GTF file derives from the same release of the
sequences used in the alignment step. Note that either gtfpath or txdb must be
specified. Default is NULL.

txdb

Character string specifying the TxDb annotation package to be loaded. If not already present in the system, it is automatically installed through the biocLite.R
script (check here the list of available TxDb annotation packages). Please make
sure the TxDb annotation package derives from the same release of the sequences used in the alignment step. Note that either gtfpath or txdb must
be specified. Default is NULL.

dataSource

organism

Value
A data table.
Examples
## gtf_file <- "path/to/GTF/file.GTF"
## create_annotation(gtfpath = gtf_file, dataSource = "gencode6", organism = "Mus musculus")

frame_psite

Percentage of P-sites per reading frame.

Description
This function computes the percentage of P-sites falling in the three possible translation reading
frames and generates a bar plot of the resulting values. It only handles annotated 5’ UTRs, coding
sequences and 3’ UTRs, separately.

frame_psite_length

Usage
frame_psite(data, sample = NULL, transcripts = NULL, region = "all",
length_range = "all", plot_title = NULL)
Arguments
data

List of data tables from psite_info.

sample

Character string vector specifying the name of the sample(s) of interest. Default
is NULL i.e. all samples in data are processed.

transcripts

Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used.

region

Character string specifying the region(s) of the transcripts to be analysed. It can
be either "5utr", "cds", "3utr" for 5’ UTRs, CDSs and 3’ UTRs, respectively. Default is "all" i.e. all regions are considered. According to this parameter the bar
plots are differently arranged to optimise the organization and the visualization
of the data.

length_range

Integer or an integer vector specyfying the read length(s) to be included in the
analysis. Default is "all" i.e. all read lengths are used.

plot_title

Character string specifying the title of the plot. If "auto", the title of the plot
reports the region specified by region (if any) and the considered read length(s).
Default is NULL i.e. no title is plotted.

Value
A list containing a ggplot2 object ("plot") and the data table with the associated data ("dt").
Examples
data(reads_psite_list)
## Generate the bar plot for all read lengths:
frame_whole <- frame_psite(reads_psite_list, sample = "Samp1")
## Generate the bar plot restricting the analysis to coding sequences and
## reads of 28 nucleotides:
frame_sub <- frame_psite(reads_psite_list, sample = "Samp1", region = "cds",
length_range = 28)

frame_psite_length

Percentage of P-sites per reading frame stratified by read length.

Description
Similar to frame_psite, but the results are stratified by read lengths and plotted as heatmaps.

frame_psite_length

Usage
frame_psite_length(data, sample = NULL, transcripts = NULL,
region = "all", cl = 100, length_range = "all", plot_title = NULL)
Arguments
data

List of data tables from psite_info.

sample

Character string vector specifying the name of the sample(s) of interest. Default
is NULL i.e. all samples in data are processed.

transcripts

Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used.

region

Character string specifying the region(s) of the transcripts to be analysed. It can
be either "5utr", "cds", "3utr" for 5’ UTRs, CDSs and 3’ UTRs, respectively.
Default is "all" i.e. all regions are considered. According to this parameter the
heatmaps are differently arranged to optimise the organization and the visualization of the data.

Integer value in [1,100] specifying a confidence level for restricting the analysis
to a sub-range of read lengths i.e. to the cl read lengths associated to the highest signals. Default is 100. This parameter has no effect if length_range is
specified.

length_range

Integer or an integer vector specyfying the read length(s) to be included in the
analysis. Default is "all" i.e. all read lengths are used. If specified, this parameter
prevails over cl.

plot_title

Value
A list containing a ggplot2 object ("plot") and the data table with the associated data ("dt").
Examples
data(reads_psite_list)
## Generate the heatmap for all read lengths:
frame_len_whole <- frame_psite_length(reads_psite_list, sample = "Samp1")
## Generate the heatmap restricting the analysis to coding sequences and a
## sub-range of read lengths:
frame_len_sub <- frame_psite_length(reads_psite_list, sample = "Samp1",
region = "cds", cl = 90)

length_filter

Read length filtering.

length_filter

Description
Read length filtering.
Usage
length_filter(data, length_filter_mode, length_filter_vector = NULL,
periodicity_threshold = 50, granges = FALSE)
Arguments
data
List of data tables from bamtolist, bedtolist or psite_info.
length_filter_mode
Either "custom" or "periodicity". It specifies how read length selection should
be performed. "custom": only read lengths specified by the user are kept (see
length_filter_vector); "periodicity": only read lengths satisfying a periodicity threshold (see periodicity_threshold) are kept. The latter mode enables
the removal of all reads with low or no periodicity.
length_filter_vector
Integer or an integer vector specifying a read length or a range of read lengths to
keep, respectively. This parameter is considered only if length_filter_mode
is "custom".
periodicity_threshold
Integer in [10, 100]. Only read lengths satisfying this threshold (i.e. a higher
percentage of read extremities falls in one of the three reading frames along the
CDS) are kept. This parameter is considered only if length_filter_mode is
"periodicity". Default is 50.
granges

Logical value whether to return a GRangesList object. Default is FALSE i.e. a
list of data tables is returned instead (the required input for psite, psite_info,
rends_heat and rlength_distr).

Value
A list of data tables or a GRangesList object.
Examples
data(reads_list)
## Keep reads of length between 27 and 30 nucleotides (included):
filtered_list <- length_filter(reads_list, length_filter_mode = "custom",
length_filter_vector = 27:30)
## Keep reads of lengths satisfying a periodicity threshold (70%):

metaheatmap_psite

filtered_list <- length_filter(reads_list, length_filter_mode = "periodicity",
periodicity_threshold = 70)

metaheatmap_psite

Ribosome occupancy metaheatmaps at single-nucleotide resolution.

Description
This function generates two heatmap-like metaprofiles (metaheatmaps) displaying the abundance of
P-sites around the start and the stop codon of annotated CDSs. It works similarly to metaprofile_psite
but the intensity of signal is represented by a continuous color scale rather than by the height of a
line chart. This graphical output is a good option to visualize several profiles at once and compare
results obtained with different read lengths or in multiple conditions.
Usage
metaheatmap_psite(data, annotation, sample, scale_factors = NULL,
length_range = "all", transcripts = NULL, utr5l = 25, cdsl = 50,
utr3l = 25, log_colour = F, colour = "black", plot_title = NULL)
Arguments
data

List of data tables from psite_info.

annotation

Data table as generated by create_annotation.

sample

List of either character strings specifying the name of the sample(s) of interest or
character string vectors specifying the name of their replicates. In the latter case
the final metaheatmaps for each element of the list are generated by merging the
results for the corresponding replicates exploiting the scale factors specified by
scale_factors. The row(s) of the final plot are labelled according to the name
of the elements of the list.

scale_factors

Named numeric vector specifying the scale factors for generating metaprofiles
from multiple replicates (see sample). Scale factors can be defined for a subset
of list elements of sample i.e. for all replicates of selected samples. If so, the remaining scale factors are set automatically to 1. Please be careful to name each
element of the vector after the correct corresponding string in sample. No specific order is required. Default is NULL i.e. all scale factors are automatically
set to 1.

length_range

Integer or an integer vector specyfying the read length(s) to be included in the
analysis. Default is "all" i.e. all read lengths are used.

transcripts

Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used. Please note: transcripts
with either 5’ UTR, coding sequence or 3’ UTR shorter than utr5l, 2∗cdsl and
utr3l, respectively, are automatically discarded.

utr5l

Positive integer specifying the length (in nucleotides) of the 5’ UTR region
flanking the start codon to be considered in the analysis. Default is 25.

metaheatmap_psite
cdsl

Positive integer specifying the length (in nucleotides) of the CDS regions flanking both the start and stop codon to be considered in the analysis. Default is
50.

utr3l

Positive integer specifying the length (in nucleotides) of the 3’ UTR region
flanking the stop codon to be considered in the analysis. Default is 25.

log_colour

Logical value whether to use a logarithmic colour scale (strongly suggested in
case of large signal variations). Default is FALSE.

colour

Character string specifying the colour of the plot. The colour scheme is as
follow: tiles corresponding to the lowest signal are always white, tiles corresponding to the highest signal are of the specified colour and the progression
between these two colours follows either linear or logarithmic gradients (see
log_colour). Default is "black".

plot_title

Character string specifying the title of the plot. If "auto", the title of the plot
reports the number of transcripts and the read length(s) employed for generating
the metaprofiles. Default is NULL i.e. no title is displayed.

Details
The intensity of signal in the metaprofiles corresponds, for each nucleotide, to the sum of the number
of P-sites (defined by their leftmost position) mapping on that position for all transcripts in one or
multiple replicates.
Value
A list containing a ggplot2 object ("plot") and the data table with the associated data ("dt").
Examples
data(reads_psite_list)
## Generate metaheatmaps employing all read lengths:
metaheat_whole <- metaheatmap_psite(reads_psite_list, mm81cdna, sample = list("Whole"=c("Samp1")))
## Generate metaprofiles employing reads of 27, 28 and 29 nucleotides and a
## subset of transcripts (in this example only transcripts with at least one
## P-site mapping on the translation initiation site are kept):
sample_name <- "Samp1"
sub_reads_psite_list <- subset(reads_psite_list[[sample_name]], psite_from_start == 0)
transcript_names <- as.character(sub_reads_psite_list$transcript)
metaheat_sub <- metaheatmap_psite(reads_psite_list, mm81cdna, sample = list("sub"=sample_name),
length_range = 27:29, transcripts = transcript_names, plot_title = "auto")
## Generate two sets of metaheatmaps, displayed in the same plot. In this
## example one set of metaheatmaps is based on all read lengths while the
## other one is generated employing only reads of 28 nucleotides:
sample_name <- "Samp1"
metaheat_df <- list()
metaheat_df[["subsample_28nt"]] <- subset(reads_psite_list[[sample_name]], length == 28)
metaheat_df[["whole_sample"]] <- reads_psite_list[[sample_name]]
names_list <- list("Only_28" = c("subsample_28nt"), "All" = c("whole_sample"))

metaprofile_psite

metaheat_comparison <- metaheatmap_psite(metaheat_df, mm81cdna, sample = names_list)

metaprofile_psite

Ribosome occupancy metaprofiles at single-nucleotide resolution.

Description
This function generates two metaprofiles displaying the abundance of P-sites around the start and
the stop codon of annotated CDSs.
Usage
metaprofile_psite(data, annotation, sample, scale_factors = NULL,
length_range = "all", transcripts = NULL, utr5l = 25, cdsl = 50,
utr3l = 25, plot_title = NULL)
Arguments
data
annotation
sample

scale_factors

length_range
transcripts

utr5l
cdsl

utr3l
plot_title

List of data tables from psite_info.
Data table as generated by create_annotation.
Either a character string specifying the name of the sample of interest or a character string vector specifying the name of its replicates. In the latter case the final
metaprofiles are generated by merging the results for each replicate exploiting
the scale factors specified by scale_factors.
Named numeric vector the same length as sample specifying the scale factors
for generating metaprofiles from multiple replicates (see sample). Please be
careful to name each element of the vector after the correct corresponding string
in sample. No specific order is required. Default is NULL i.e. all scale factors
are automatically set to 1.
Integer or an integer vector specyfying the read length(s) to be included in the
analysis. Default is "all" i.e. all read lengths are used.
Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used. Please note: transcripts
with either 5’ UTR, coding sequence or 3’ UTR shorter than utr5l, 2∗cdsl and
utr3l, respectively, are automatically discarded.
Positive integer specifying the length (in nucleotides) of the 5’ UTR region
flanking the start codon to be considered in the analysis. Default is 25.
Positive integer specifying the length (in nucleotides) of the CDS regions flanking both the start and stop codon to be considered in the analysis. Default is
50.
Positive integer specifying the length (in nucleotides) of the 3’ UTR region
flanking the stop codon to be considered in the analysis. Default is 25.
Character string specifying the title of the plot. If "auto", the title of the plot reports the sample(s) specified by sample as well as the number of transcripts and
the read length(s) employed for generating the metaprofiles. Default is NULL
i.e. no title is displayed.

mm81cdna

Annotation

Description
A dataset containing basic information about 109,712 mouse mRNA (Ensembl v81 transcript annotation).
Usage
mm81cdna
Format
A data table with 109,712 rows and 5 variables:
transcript Name of the transcript (ENST ID and version, dot separated)
l_tr Length of the transcript, in nucleotides
l_utr5 Length of the annotated 5’ UTR (if any), in nucleotides
l_cds Length of the annotated CDS (if any), in nucleotides
l_utr3 Length of the annotated 3’ UTR (if any), in nucleotides

psite

Ribosome P-sites position within reads.

Description
This function identifies the exact position of the ribosome P-site within each read, determined by the
localisation of its first nucleotide (see Details). It returns a data table containing, for all samples
and read lengths: i) the percentage of reads in the whole dataset, ii) the percentage of reads aligning
on the start codon (if any); iii) the distance of the P-site from the two extremities of the reads before
and after the correction step; iv) the name of the sample. Optionally, this function plots a collection
of read length-specific occupancy metaprofiles displaying the P-site offsets computed through the
process.
Usage
psite(data, flanking = 6, start = TRUE, extremity = "auto",
plot = FALSE, plot_dir = NULL, plot_format = "png", cl = 99)
Arguments
data

List of data tables from bamtolist, bedtolist or length_filter.

flanking

Integer value specifying for the selected reads the minimum number of nucleotides that must flank the reference codon in both directions. Default is 6.

start

Logical value whether to use the translation initiation site as reference codon.
Default is TRUE. If FALSE, the second to last codon is used instead.

extremity

Either "5end", "3end" or "auto". It specifies if the correction step should be
based on 5’ extremities ("5end") or 3’ extremities ("3end"). Default is "auto"
i.e. the optimal extremity is automatically selected.

plot

Logical value whether to plot the occupancy metaprofiles displaying the P-site
offsets computed in both steps of the algorithm. Default is FALSE.

plot_dir

Character string specifying the directory where read length-specific occupancy
metaprofiles shuold be stored. If the specified folder doesn’t exist, it is automatically created. If NULL (the default), the metaprofiles are stored in a new
subfolder of the working directory, called offset_plot. This parameter is considered only if plot is TRUE.

plot_format

Either "png" (the default) or "pdf". This parameter specifies the file format
storing the length-specific occupancy metaprofiles. It is considered only if plot
is TRUE.

Integer value in [1,100] specifying a confidence level for generating occupancy
metaprofiles for to a sub-range of read lengths i.e. for the cl 99. This parameter
is considered only if plot is TRUE.

psite_info

Details
The P-site offset (PO) is defined as the distance between the extremities of a read and the first
nucleotide of the P-site itself. The function processes all samples separately starting from reads
mapping on the reference codon (either the start codon or the second to last codon, see start) of
any annotated coding sequences. Read lengths-specific POs are inferred in two steps. First, reads
mapping on the reference codon are grouped according to their length, each group corresponding to
a bin. Reads whose extremities are too close to the reference codon are discarded (see flanking).
For each bin temporary 5’ and 3’ POs are defined as the distances between the first nucleotide of
the reference codon and the nucleotide corresponding to the global maximum found in the profiles
of the 5’ and the 3’ end at the left and at the right of the reference codon, respectively. After the
identification of the P-site for all reads aligning on the reference codon, the POs corresponding to
each length are assigned to each read of the dataset. Second, the most frequent temporary POs associated to the optimal extremity (see extremity) and the predominant bins are exploited as reference
values for correcting the temporary POs of smaller bins. Briefly, the correction step defines for each
length bin a new PO based on the local maximum, whose distance from the reference codon is the
closest to the most frequent temporary POs. For further details please refer to the riboWaltz article
(available here).
Value
A data table.
Examples
data(reads_list)
## Compute the P-site offset automatically selecting the optimal read
## extremity for the correction step and not plotting any metaprofile:
psite(reads_list, flanking = 6, extremity="auto")
## Compute the P-site offset specifying the extremity used in the correction
## step and plotting the length-specific occupancy metaprofiles for a
## sub-range of read lengths (the middle 95%). The plots will be placed in
## the current working directory:
psite_offset <- psite(reads_list, flanking = 6, extremity = "3end", plot = TRUE, cl = 95)

psite_info

Update reads information according to the inferred P-sites.

Description
This function provides additional reads information according to the position of the P-site identfied
by psite. It attaches to each data table in a list four columns reporting i) the P-site position with
respect to the 1st nucleotide of the transcript, ii) the P-site position with respect to the start and the
stop codon of the annotated coding sequence (if any) and iii) the region of the transcript (5’ UTR,
CDS, 3’ UTR) that includes the P-site. Please note: for transcripts not associated to any annotated
CDS the position of the P-site with respect to the start and the stop codon is set to NA. Optionally,
additional columns reporting the three nucleotides covered by the P-site, the A-site and the E-site

psite_info

are attached, based on FASTA files or BSgenome data packages containing the transcript nucleotide
sequences.
Usage
psite_info(data, offset, site = NULL, fastapath = NULL,
fasta_genome = TRUE, bsgenome = NULL, gtfpath = NULL, txdb = NULL,
dataSource = NA, organism = NA, granges = FALSE)
Arguments
data

List of data tables from bamtolist, bedtolist or length_filter.

offset

Data table from psite.

site

Either "psite, "asite", "esite" or a combination of these strings. It specifies if
additional column(s) reporting the three nucleotides covered by the ribosome Psite ("psite"), A-site ("asite") and E-site ("esite") should be added. Note: either
fastapath or bsgenome is required for this purpose. Default is NULL.

fastapath

Character string specifying the FASTA file used in the alignment step, including
its path, name and extension. This file can contain reference nucleotide sequences either of a genome assembly or of all the transcripts (see Details and
fasta_genome). Please make sure the sequences derive from the same release
of the annotation file used in the create_annotation function. Note: either
fastapath or bsgenome is required to generate additional column(s) specified
by site. Default is NULL.

fasta_genome

bsgenome

Character string specifying the BSgenome data package with the genome sequences to be loaded. If not already present in the system, it is automatically installed through the biocLite.R script (check the list of available BSgenome data
packages by running the available.genomes function of the BSgenome package). This parameter must be coupled with an annotation object (see gtfpath
and txdb). Please make sure the sequences included in the specified BSgenome
data pakage are in agreement with the sequences used in the alignment step.
Note: either fastapath or bsgenome is required to generate additional column(s) specified by site. Default is NULL.

gtfpath

txdb

psite_info
required if and only if nucleotide sequences of a genome assembly are provided
(see fastapath or bsgenome). Default is NULL.
dataSource

organism

granges

Logical value whether to return a GRangesList object. Default is FALSE i.e.
a list of data tables (the required input for downstream analyses and graphical
outputs provided by riboWaltz) is returned instead.

Value
A list of data tables or a GRangesList object.

Examples
data(reads_list)
data(psite_offset)
data(mm81cdna)
reads_psite_list <- psite_info(reads_list, psite_offset)

psite_offset

P-site offsets

Description
An example dataset containing length-specific ribosome P-site offsets as returned by psite applied
to reads_list.
Usage
psite_offset
Format
A data table with 31 rows and 9 variables:
length Length of the read, in nucleotides
total_percentage Percentage of reads of the considered length in the whole dataset
start_percentage Percentage of reads of the considered length aligning on the start codon (if any)
around_start A logical value whether at least one read of the considered length aligns on the start
codon (T = yes, F = no)
offset_from_5 Temporary P-site offset from the 5’ end of the read, in nucleotides (before the correction step)
offset_from_3 Temporary P-site offset from the 3’ end of the read, in nucleotides (before the correction step)
corrected_offset_from_5 P-site offset from the 5’ end of the read, in nucleotides (after the correction step)
corrected_offset_from_3 P-site offset from the 3’ end of the read, in nucleotides (after the correction step)
sample Name of the sample

reads_list

Reads information

Description
An example dataset containing details on reads mapping on the mouse transcriptome, generated
from BAM or BED files. A subset of the original dataset is provided, including only reads aligning
on the translation initiation site. Please contact the authors for more information.
Usage
reads_list

reads_psite_list

Format
A list of data tables with 1 object (named Samp1) of 393,338 rows and 6 variables:
transcript Name of the transcript (ENST ID and version, dot separated)
end5 Position of the 5’ end of the read with respect to the first nuclotide of the transcript, in
nucleotides
end3 Position of the 3’ end of the read with respect to the first nuclotide of the transcript, in
nucleotides
length Length of the read, in nucleotides
cds_start Leftmost position of the annotated CDS with respect to the first nuclotide of the transcript, in nucleotides
cds_stop Rightmost position of the annotated CDS with respect to the first nuclotide of the transcript, in nucleotides

reads_psite_list

Reads details updated with P-site information

Description
An example dataset that combines details on reads mapping on the mouse transcriptome (see
reads_list) and length-specific ribosome P-site offsets (see psite_offset), as returned by psite_info.
Usage
reads_psite_list
Format
A list of data tables with 1 object (named Samp1) of 393,338 rows and 10 variables:
transcript Name of the transcript (ENST ID and version, dot separated)
end5 Position of the 5’ end of the read with respect to the first nuclotide of the transcript, in
nucleotides
psite Position of the P-site with respect to the first nuclotide of the transcript, in nucleotides
end3 Position of the 3’ end of the read with respect to the first nuclotide of the transcript, in
nucleotides
length Length of the read, in nucleotides
cds_start Leftmost position of the CDS with respect to the first nuclotide of the transcript, in
nucleotides
cds_stop Rightmost position of the CDS with respect to the first nuclotide of the transcript, in
nucleotides
psite_from_start Position of the P-site with respect to the first nuclotide of the annotated CDS (if
any), in nucleotides
psite_from_stop Position of the P-site with respect to the last nuclotide of the annotated CDS (if
any), in nucleotides
psite_region Region of the transcript that includes the P-site (5utr, cds, 3utr)

region_psite

Percentage of P-sites per transcript region.

region_psite

Description
This function computes the percentage of P-sites falling in the three annotated regions of the transcripts (5’ UTR, CDS and 3’ UTR) and generates a bar plot of the resulting values.
Usage
region_psite(data, annotation, sample = NULL, transcripts = NULL,
label_sample = NULL, colour = c("gray70", "gray40", "gray10"))
Arguments
data
annotation
sample
transcripts

label_sample

colour

List of data tables from psite_info.
Data table as generated by create_annotation.
Character string vector specifying the name of the sample(s) of interest. Default
is NULL i.e. all samples in data are processed.
Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used. Please note: transcripts
without annotated 5’ UTR, CDS and 3’ UTR are automatically discarded.
Named character string vector the same length as sample specifying the sample
names to be displayed in the plot. Plase be careful to name each element of
the vector after the correct corresponding string in sample. No specific order is
required. Default is NULL i.e. sample names in sample are used.
Character string vector of three elements specifying the colour for the 5’ UTR,
CDS and 3’ UTR bars, respectively. Default is a grayscale.

Details
Column "RNAs" reports the percentage of region length for the transcripts included in the analysis,
based on the cumulative nucleotide length of 5’ UTRs, CDSs and 3’ UTRs. These values reflect the
expected read distribution from a random fragmentation of RNA and can be used as a baseline to
verify the expected enrichment of ribosome (P-site) signal in CDSs.
Value
A list containing a ggplot2 object ("plot") and the data table with the associated data ("dt").
Examples
data(reads_psite_list)
data(mm81cdna)
reg_psite <- region_psite(reads_psite_list, mm81cdna, sample = "Samp1")
reg_psite[["plot"]]

rends_heat

Metaheatmaps of the two extremities of the reads.

Description
This function generates four metaheatmaps displaying the abundance of the 5’ and 3’ extremity of
reads mapping around the start and the stop codon of annotated CDSs, stratified by their length.
Usage
rends_heat(data, annotation, sample, transcripts = NULL, cl = 95,
utr5l = 50, cdsl = 50, utr3l = 50, log_colour = F, colour = "black")
Arguments
data

List of data tables from bamtolist, bedtolist, length_filter or psite_info.

annotation

Data table as generated by create_annotation.

sample

Character string specifying the name of the sample of interest.

transcripts

Integer value in [1,100] specifying a confidence level for restricting the plot to
a sub-range of read lengths i.e. to the cl read lengths associated to the highest
signals. Default is 95.

utr5l

Positive integer specifying the length (in nucleotides) of the 5’ UTR region
flanking the start codon to be considered in the analysis. Default is 50.

cdsl

Positive integer specifying the length (in nucleotides) of the CDS regions flanking both the start and stop codon to be considered in the analysis. Default is
50.

utr3l

Positive integer specifying the length (in nucleotides) of the 3’ UTR region
flanking the stop codon to be considered in the analysis. Default is 50.

log_colour

Logical value whether to use a logarithmic colour scale (strongly suggested in
case of large signal variations). Default is FALSE.

colour

Value
A list containing a ggplot2 object ("plot") and the data table with the associated data ("dt").

rlength_distr

Examples
data(reads_list)
data(mm81cdna)
## Generate metaheatmaps for all read lengths:
heatend_whole <- rends_heat(reads_list, mm81cdna, sample = "Samp1", cl = 100)
## Generate metaheatmaps for a sub-range of read lengths shortening the
## flanking regions around the start and stop codon:
heatend_sub95 <- rends_heat(reads_list, mm81cdna, sample = "Samp1", cl = 95,
utr5l = 30, cdsl = 40, utr3l = 30)

rlength_distr

Read length distributions.

Description
This function generates read length distributions.
Usage
rlength_distr(data, sample, transcripts = NULL, cl = 100)
Arguments
data

List of data tables from bamtolist, bedtolist, length_filter or psite_info.

sample

Character string specifying the name of the sample of interest.

transcripts

Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used.

Integer value in [1,100] specifying a confidence level for restricting the plot to
a sub-range of read lengths i.e. to the cl read lengths associated to the highest
signals. Default is 100.

Value
List containing a ggplot2 object ("plot") and the data table with the associated data ("dt").
Examples
data(reads_list)
## Generate the length distribution for all read lengths:
lendist_whole <- rlength_distr(reads_list, sample = "Samp1", cl = 100)
lendist_whole[["plot"]]
## Generate the length distribution for a sub-range of read lengths:
lendist_sub95 <- rlength_distr(reads_list, sample = "Samp1", cl = 95)
lendist_sub95[["plot"]]

Index
∗Topic datasets
mm81cdna, 18
psite_offset, 23
reads_list, 23
reads_psite_list, 24
available.genomes, 9, 21
bamtobed, 2, 4, 5
bamtolist, 2, 3, 7, 14, 19, 21, 26, 27
bedtolist, 2, 4, 7, 14, 19, 21, 26, 27
cds_coverage, 6
codon_coverage, 7
codon_usage_psite, 8, 10
create_annotation, 3, 5–9, 10, 15, 17, 21,
25, 26
frame_psite, 11, 12
frame_psite_length, 12
length_filter, 4, 5, 14, 19, 21, 26, 27
makeTxDbFromGFF, 9, 11, 22
metaheatmap_psite, 15
metaprofile_psite, 15, 17
mm81cdna, 18
psite, 4, 5, 14, 19, 20, 21, 23
psite_info, 4–8, 12–15, 17, 20, 24–27
psite_offset, 23, 24
reads_list, 23, 23, 24
reads_psite_list, 24
region_psite, 25
rends_heat, 4, 5, 14, 26
rlength_distr, 4, 5, 14, 27

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 28
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.14
Create Date                     : 2018:11:23 15:03:55+01:00
Modify Date                     : 2018:11:23 15:03:55+01:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1

EXIF Metadata provided by EXIF.tools

Reference Manual

Navigation menu

Versions of this User Manual:

Views

Navigation