Reference Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 27

DownloadReference Manual
Open PDF In BrowserView PDF
Package ‘riboWaltz’
September 4, 2018
Type Package
Title Optimization of ribosome P-site positioning in ribosome profiling data
Version 1.0.0
Description riboWaltz is an R package designed for the analysis of
ribosome profiling (RiboSeq) data aimed at the identification of the
P-site offset. The P-site offset (PO) is specified by the localization of
the P-site of ribosomes within the fragments of the RNA (reads) resulting
from RiboSeq assays. It is defined as the distance of the P-site from the
two ends of the reads. Determining the PO is a crucial step for a variety of
RiboSeq-based analyses such as verify the so-called 3-nt periodicity of
ribosomes along the coding sequence, derive translation initiation and
elongation rates and reveal new translational events in unannotated open
reading frames and ncRNAs. riboWaltz performs accurate computation of the
PO for all the lengths of reads from single or multiple samples, taking
advantage from an original two-step algorithm. Moreover, riboWaltz
provides the user a variety of graphical representations, laying
the groundwork for further positional analyses and new biological
discoveries.
License MIT
LazyData TRUE
Depends R (>= 3.3.0)
Imports Biostrings (>= 2.46.0),
data.table (>= 1.10.4.3),
GenomicAlignments (>= 1.14.1),
GenomicFeatures (>= 1.24.5),
GenomicRanges (>= 1.24.3),
ggplot2 (>= 2.2.1),
ggrepel (>= 0.6.5),
IRanges (>= 2.12.0)
biocViews
RoxygenNote 6.0.1
Suggests knitr,
rmarkdown
1

2

bamtobed

VignetteBuilder knitr

R topics documented:
bamtobed . . . . .
bamtolist . . . . .
bedtolist . . . . . .
codon_coverage . .
codon_usage_psite
create_annotation .
frame_psite . . . .
frame_psite_length
length_filter . . . .
metaheatmap_psite
metaprofile_psite .
mm81cdna . . . . .
psite . . . . . . . .
psite_info . . . . .
psite_offset . . . .
psite_per_cds . . .
reads_list . . . . .
reads_psite_list . .
region_psite . . . .
rends_heat . . . . .
rlength_distr . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Index

bamtobed

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2
3
4
5
6
9
10
11
12
13
15
16
17
18
20
21
22
22
23
24
25
27

Convert BAM files into BED files.

Description
Converts one or several BAM files into a list of BED files containing for each read the name of the
reference sequence (i.e. of the transcript) on which it aligns, the leftmost and rightmost position of
the read, its length and the associated strand. Please note: this function calls the bamtobed utility
of the BEDTools suite.
Usage
bamtobed(bamfolder, bedfolder = NULL)

bamtolist

3

Arguments
bamfolder

A character string specifying the path to the directory containing the BAM files.
The function recursively looks for BAM format file starting from the specified
folder.

bedfolder

A character string specifying the (existing or not) location of the directory where
the BED files should be stored. By default this argument is NULL, which implies the folder is set as a subdirectory of bamfolder, called bed.

Examples
## path_bam <- "location_of_BAM_files"
## path_bed <- "location_of_output_directory"
## bamtobed(bamfolder = path_bam, bedfolder = path_bed)

bamtolist

Convert BAM files into a list of data tables or into a GRangesList
object.

Description
Reads one or several BAM files, converts each file into a data table and combines them into a list.
Alternatively, it returns a GRangesList i.e. a list of GRanges objects. In both cases the data structure
contains for each read the name of the reference sequence (i.e. of the transcript) on which it aligns,
the leftmost and rightmost position of the read and its length. Two additional columns are attached,
reporting the leftmost and rightmost position of the CDS of the reference sequence with respect to
its 1st nuclotide. Please note: if a transcript is not associated to any annotated CDS then its start
and the stop codon are set to 0.
Usage
bamtolist(bamfolder, annotation, transcript_align = TRUE, list_name = NULL,
rm_version = FALSE, granges = FALSE)
Arguments
bamfolder

A character string indicating the path to the folder containing the BAM files.

A data table from create_annotation. Please make sure that the name of
the reference sequences in the annotation data table coincides with those in the
BAM files.
transcript_align
A logical value whether or not the BAM files within bamfolder refers to a
transcriptome alignment (intended as an alignment based on a reference FASTA
of all the transcript sequences). When this parameter is TRUE (the default) no
reads mapping on the negative strand should be present and they are therefore
removed.
annotation

4

bedtolist
list_name

A character string vector specifying the desired names for the data tables of
the output list. Its length must coincides with the number of BAM files within
bamfolder. Please pay attention to the order in which they are provided: the
first string is assigned to the first file, the second string to the second one and so
on. By default this argument is NULL, implying that the data tables are named
after the name of the BAM file, leaving their path and extension out.

rm_version

A logical value whether ot not to remove the version of the transcripts from the
end of their ID, usually separated by a dot. This option might be useful to make
the transcripts IDs in the BAM files match with those in the annotation table.
Default is FALSE.

granges

A logical value whether or not to return a GRangesList object. Default is
FALSE, meaning that a list of data tables (the required input for length_filter,
psite and psite_info, rends_heat and rlength_distr) is returned instead.

Value
A list of data tables or a GRangesList object.
Examples
## path_bam <- "path/to/BAM/files"
## annotation_dt <- datatable_with_transcript_annotation
## bamtolist(bamfolder = path_bam, annotation = annotation_dt)

bedtolist

Convert BED files into a list of data tables or a GRangesList.

Description
Reads one or several BED files, converts each file into a data table and combines them into a list.
Alternatively, it returns a GRangesList i.e. a list of GRanges objects. In both cases two additional
columns are attached to the data structures, reporting the leftmost and rightmost position of the
CDS of the reference sequence with respect to its 1st nuclotide. Please note: if a transcript is not
associated to any annotated CDS then its start and the stop codon are set to 0.
Usage
bedtolist(bedfolder, annotation, transcript_align = TRUE, list_name = NULL,
rm_version = FALSE, granges = FALSE)
Arguments
bedfolder

A character string indicating the path to the folder containing the BED files from
bamtobed.

annotation

A data table from create_annotation. Please make sure that the name of the
reference sequences in the annotation data table coincides with those in the BED
files.

codon_coverage

5

transcript_align
A logical value whether or not the BED files within bedfolder refers to a transcriptome alignment (intended as an alignment based on a reference FASTA of
all the transcript sequences). When this parameter is TRUE (the default) no
reads mapping on the negative strand should be present and they are therefore
removed.
list_name

A character string vector specifying the desired names for the data tables of
the output list. Its length must coincides with the number of BED files within
bedfolder. Please pay attention to the order in which they are provided: the
first string is assigned to the first file, the second string to the second one and so
on. By default this argument is NULL, implying that the data tables are named
after the name of the BED file, leaving their path and extension out.

rm_version

A logical value whether ot not to remove the version of the transcripts from the
end of their ID, usually separated by a dot. This option might be useful to make
the transcripts IDs in the BED files match with those in the annotation table.
Default is FALSE.

granges

A logical value whether or not to return a GRangesList object. Default is
FALSE, meaning that a list of data tables (the required input for length_filter,
psite and psite_info, rends_heat and rlength_distr) is returned instead.

Value
A list of data tables or a GRangesList object.
Examples
## path_bed <- "path/to/BED/files"
## annotation_dt <- datatable_with_transcript_annotation
## bedtolist(bedfolder = path_bed, annotation = annotation_dt)

codon_coverage

Compute the number of reads per codon.

Description
For the specified sample(s), this function computes the codon coverage defined either as the number
of read footprints per codon or as the number of P-sites per codon.
Usage
codon_coverage(data, annotation, sample = NULL, psite = FALSE,
min_overlap = 1, granges = FALSE)

6

codon_usage_psite

Arguments
data

A list of data tables from psite_info. Data tables generated by bamtolist and
bedtolist can be used only if psite is FALSE (the default).

annotation

A data table as generated by create_annotation.

sample

A character string vector specifying the name of the sample(s) of interest. By
default this argument is NULL, meaning that the coverage is computed for all
the samples in data.

psite

A logical value whether or not to return the number of P-sites per codon. Default
is NULL, meaning that the number of read footprints per codon is computed
instead.

min_overlap

A positive integer specyfing the minimum number of overlapping positions (in
nucleotides) between a reads and a codon to be considered to be overlapping.
When psite is TRUE this parameter must be 1 (the default).

granges

A logical value whether or not to return a GRanges object. Default is FALSE,
meaning that a data tables is returned instead.

Details
The sequence of every transcript is divided in triplets starting from the annotated translation initiation site (if any) proceeding towards the UTRs extremities, and eventually discarding the exceeding
1 or 2 nucleotides at the extremities of the transcript. Please note that the transcripts not associated
to any annotated 5’ UTR, CDS and 3’UTR and transcripts with coding sequence length not divisible
by 3 are automatically discarded.
Value
A data table or a GRanges object.
Examples
data(reads_psite_list)
data(mm81cdna)
## Compute the coverage based on the number of ribosome footprint per codon,
## setting the minimum overlap between reads and triplets to 3 nts
## coverage_dt <- codon_coverage(reads_psite_list, mm81cdna, min_overlap = 3)
## Compute the coverage based on the number of P-sites per codon
##coverage_dt <- codon_coverage(reads_psite_list, mm81cdna, psite = TRUE)

codon_usage_psite

Compute and plot empirical codon usage indexes.

codon_usage_psite

7

Description
For a specified sample this function computes an empirical codon usage index based on the frequency of in-frame P-sites along the coding sequence (or one of the other two ribosome sites relative to them and falling in the CDS). It computes the codon usage index for all the 64 triplets,
normalizes them for the frequency of each codon within the CDS and returns a bar plot with the
resulting values. This function also allows to compare the computed codon usage indexes with a set
of 64 values provided by the user.
Usage
codon_usage_psite(data, annotation, sample, site = "psite",
fastapath = NULL, fasta_genome = TRUE, bsgenome = NULL,
gtfpath = NULL, txdb = NULL, dataSource = NA, organism = NA,
transcripts = NULL, codon_values = NULL, scatter_label = FALSE,
aminoacid = FALSE)
Arguments
data

A list of data tables from psite_info that may or may not include one or more
columns among p_site_codon, a_site_codon and e_site_codon. These columns
reports the three nucleotides covered by the P-site, A-site and E-site respectively and can be previously generated by the psite_info function. If not already present, the column of interest specified by site is automatically generated throughout the function starting from a FASTA file or a BSgenome data
package.

annotation

A data table as generated by create_annotation.

sample

A character string vector specifying the name of the sample of interest.

site

Either "psite, "asite" or "esite". This parameter specifies which of the three ribosome sites (P-site, A-site and E-site, rispectively) must be used for computing
the empirical codon usage indexes. Default is "psite".

fastapath

An optional character string specifying the path to the FASTA file used in the
alignment step, including its name and extension. This file can contain reference
nucleotide sequences either of a genome assembly or of all the transcripts (see
fasta_genome). Please make sure the sequences derive from the same release
of the annotation file used in the create_annotation function. Note: either
fastapath or bsgenome is required to normalize the data, even if one or more
columns among p_site_codon, a_site_codon and e_site_codon have been previously generated by psite_info. Default is NULL.

fasta_genome

A logical value whether or not the FASTA file specified by fastapath contains
nucleotide sequences of a genome assembly. FALSE means that the nucleotide
sequences of all the transcripts are provided instead. When this parameter is
TRUE (the default), an annotation object is required (see gtfpath and txdb).

bsgenome

An optional character string specifying the name of the BSgenome data package containing the genome sequences to be loaded. If it is not already present
in your system, it will be installed through the biocLite.R script (check the list
of data packages available in the Bioconductor repositories for your version of

8

codon_usage_psite
R/Bioconductor by the available.genomes function of the BSgenome package). This parameter also requires an annotation object (see gtfpath and txdb).
Please make sure the sequences included in the specified BSgenome data pakage are in agreement with the sequences used in the alignment step. Note: either
fastapath or bsgenome is required to normalize the data, even if one or more
columns among p_site_codon, a_site_codon and e_site_codon have been previously generated by psite_info. Default is NULL.
gtfpath

A character string specifying the path to te GTF file, including its name and
extension. Please make sure the GTF derives from the same release of what is
specified by fastapath or by bsgenome. Note that either gtfpath or txdb must
be specified when the nucleotide sequences of a genome assembly are provided
(see fastapath or bsgenome). Default is NULL.

txdb

A character string specifying the name of the annotation package for TxDb object(s) to be loaded. If it is not already present in your system, it will be installed
through the biocLite.R script (check the list of TxDb annotation packages available in the Bioconductor repositories at http://bioconductor.org/packages/release/BiocViews.html#___TxD
)). Please make sure the annotation package derives from the same release of
what is specified by fastapath or by bsgenome. Note that either gtfpath or
txdb must be specified when the nucleotide sequences of a genome assembly
are provided (see fastapath or bsgenome). Default is NULL.

dataSource

An optional character string describing the origin of the GTF data file. For more
information about this parameter please refer to the description of dataSource
of the makeTxDbFromGFF function included in the GenomicFeatures package.

organism

A optional character string reporting the genus and species of the organism when
gtfpath is specified. For more information about this parameter please refer to
the description of dataSource of the makeTxDbFromGFF function included in the
GenomicFeatures package.

transcripts

A character string vector specifying the name of the transcripts to be included
in the analysis. By default this argument is NULL, meaning that all the transcripts in data are used. Please note that the transcripts not associated to any
annotated 5’ UTR, CDS and 3’UTR and transcripts with coding sequence length
not divisible by 3 are automatically discarded.

codon_values

A data table containing codon-specific values provided by the user. These values
are compared with the empirical codon usage indexes of the sample of interest.
The data table must contain at least the 64 codons and the corresponding values arranged in two columns named codon and value, respectively. Note that
a similar data table is also returned by codon_usage_psite itself. Default is
NULL.

scatter_label

A logical value whether or not to label the dots of the scatter plot generated by
specifying codon_values. Each dot can be labeled either after the three nucleotides of the codon or after the corresponding amino acid (see aminoacid).
This parameter is considered only if codon_values is specified. Default is
FALSE.

aminoacid

A logical value whether or not to label the dots of the scatter plot generated by
specifying codon_values using the amino acids corresponding to the triplets.
Default is FALSE, meaning that the three nucleotides of the codon are used

create_annotation

9
instead. This parameter is considered only if codon_values is specified and
scatter_label is TRUE. Default is FALSE.

Value
A list containing a ggplot2 object (named "plot"), and a data table ("dt") with the associated data.
An additional ggplot2 object ("plot_comparison") is returned if codon_values is specified.

create_annotation

Create an annotation data table.

Description
Starting from a GTF file or a TxDb object this function generates a dada table containing a basic
annotation of the transcripts. The data table includes a column named transcript reporting the name
of the reference sequences and four columns named l_tr, l_utr5, l_cds and l_utr3 reporting the
length of the transcripts and of their annotated 5’ UTR, CDS and 3’ UTR, respectively.
Usage
create_annotation(gtfpath = NULL, txdb = NULL, dataSource = NA,
organism = NA)
Arguments
gtfpath

A character string specifying the path to te GTF file, including its name and
extension. Please make sure the GTF derives from the same release of the sequences used in the alignment step. Note that either gtfpath or txdb must be
specified.

txdb

A character string specifying the name of the annotation package for TxDb object(s) to be loaded. If it is not already present in your system, it will be installed
through the biocLite.R script (check the list of TxDb annotation packages available in the Bioconductor repositories at http://bioconductor.org/packages/release/BiocViews.html#___TxD
)). Please make sure the annotation package derives from the same release of the
sequences used in the alignment step. Note that either gtfpath or txdb must be
specified.

dataSource

An optional character string describing the origin of the GTF data file. For more
information about this parameter please refer to the description of dataSource
of the makeTxDbFromGFF function included in the GenomicFeatures package.

organism

A optional character string reporting the genus and species of the organism when
gtfpath is specified. For more information about this parameter please refer to
the description of dataSource of the makeTxDbFromGFF function included in the
GenomicFeatures package.

Value
A data table.

10

frame_psite

Examples
## gtf_file <- location_of_GTF_file
## path_bed <- location_of_output_directory
## bamtobed(gtfpath = gtf_file, dataSource = "gencode6", organism = "Mus musculus")

frame_psite

Compute the percentage of P-sites per frame.

Description
For one or several samples this function computes the percentage of P-sites falling on the three
reading frames of the transcripts and generates a barplot of the resulting values. This analysis is
performed for the annotated 5’ UTR, coding sequence and 3’ UTR, separately. It is possible to
compute the percentage of P-sites per frame using all the read lengths or to restrict the analysis to a
sub-range of read lengths.
Usage
frame_psite(data, sample = NULL, region = "all", length_range = "all",
plot_title = NULL)
Arguments
data

A list of data tables from psite_info.

sample

A character string vector specifying the name of the sample(s) of interest. By default this argument is NULL, meaning that all the samples in data are included
in the analysis.

region

Either "all" or a character string among "5utr", "cds", "3utr" specifying the regions of the transcript (5’ UTR, CDS or 3’ UTR, respectively) that must be
included in the analysis. Default is "all", meaning that the all the regions are
considered.

length_range

Either "all", an integer or an integer vector. Default is "all", meaning that all
the read lengths are included in the analysis. Otherwise, only the read lengths
matching the specified value(s) are kept.

plot_title

Any character string specifying the title of the plot. If "auto", the title of the plot
reports the region specified by region (if any) and the length(s) of the reads
used for generating the barplot. Default is NULL, meaning that no title will be
added to the plot.

Value
A list containing a ggplot2 object and a data table with the associated data.

frame_psite_length

11

Examples
data(reads_psite_list)
## Generate the barplot for all the read lengths
frame_whole <- frame_psite(reads_psite_list, sample = "Samp1")
## Generate the barplot restricting the analysis to the coding sequence and
## to the reads of 28 nucleotides
frame_sub <- frame_psite(reads_psite_list, sample = "Samp1", region = "cds",
length_range = 28)

frame_psite_length

Compute the number of P-sites per frame stratified by read length.

Description
Similar to frame_psite but the results are stratified by the length of the reads.
Usage
frame_psite_length(data, sample = NULL, region = "all", cl = 100,
length_range = "all", plot_title = NULL)
Arguments
data

A list of data tables from psite_info.

sample

A character string vector specifying the name of the sample(s) of interest. By default this argument is NULL, meaning that all the samples in data are included
in the analysis.

region

Either "all" or a character string among "5utr", "cds", "3utr" specifying the regions of the transcript (5’ UTR, CDS or 3’ UTR, respectively) that must be
included in the analysis. Default is "all", meaning that the all the regions are
considered.

cl

An integer value in [1,100] specifying the confidence level for restricting the
analysis to a sub-range of read lengths. Default is 100. This parameter has no
effect if length_range is specified.

length_range

Either "all", an integer or an integer vector. Default is "all", meaning that all
the read lengths are included in the analysis. Otherwise, only the read lengths
matching the specified value(s) are kept. If specified, this parameter prevails
over cl.

plot_title

Any character string specifying the title of the plot. When "auto", the title of the
plot reports the region specified by region (if any). Default is NULL, meaning
that no title will be added to the plot.

Value
A list containing a ggplot2 object and a data table with the associated data.

12

length_filter

Examples
data(reads_psite_list)
## Generate the heatmap for all the read lengths
frame_len_whole <- frame_psite_length(reads_psite_list, sample = "Samp1")
## Generate the heatmap for a sub-range of read lengths (the middle 90%) and
## restricting the analysis to the coding sequence
frame_len_sub <- frame_psite_length(reads_psite_list, sample = "Samp1",
region = "cds", cl = 90)

length_filter

Filter the reads according to their length.

Description
Filter the reads according to their length.
Usage
length_filter(data, length_filter_mode, length_filter_vector = NULL,
periodicity_threshold = 50, granges = FALSE)
Arguments
data
A list of data tables from either bamtolist or bedtolist.
length_filter_mode
Either "custom" or "periodicity". It specifies how to handle the selection of the
read. "custom": only read lengths specified by the user are kept (see length_filter_vector);
"periodicity": only read lengths satisfying a periodicity threshold (see periodicity_threshold)
are kept. This mode enables the removal of all the reads with low or no periodicity.
length_filter_vector
An integer or an integer vector specifying either a read length or a range of
read lengths to keep, respectively. This parameter is considered only when
length_filter_mode is set to "custom".
periodicity_threshold
An integer in [10, 100]. Only the read lengths satisfying this threshold (i.e.
with a higher percentage of read extremities falling in one of the three reading frame along the CDS) are kept. This parameter is considered only when
length_filter_mode is set to "periodicity". Default is 50.
granges
A logical value whether or not to return a GRangesList object. Default is
FALSE, meaning that a list of data tables (the required input for psite and
psite_info, rends_heat and rlength_distr) is returned instead.
Value
A list of data tables or a GRangesList object.

metaheatmap_psite

13

Examples
data(reads_list)
## Keep only reads of length between 27 and 30 nucleotides (included)
filtered_list <- length_filter(reads_list, length_filter_mode = "custom",
length_filter_vector = 27:30)
## Keep only reads of lengths satisfying a periodicity threshold (70%)
filtered_list <- length_filter(reads_list, length_filter_mode = "periodicity",
periodicity_threshold = 70)

metaheatmap_psite

Plot ribosome occupancy metaheatmaps at single-nucleotide resolution.

Description
For one or more sample this function plots a heatmap-like metaprofile based on the P-site of the
reads mapping around the start and the stop codon of the annotated CDS (if any). It works similarly
to metaprofile_psite but the intensity of the signal is represented by a continuous color scale
rather than by the height of a line chart. This graphical output is a good option for analyzing several
samples at once and for comparing the profiles generated by different reads lengths or in multiple
conditions.
Usage
metaheatmap_psite(data, annotation, sample, scale_factors = NULL,
length_range = "all", transcripts = NULL, utr5l = 25, cdsl = 50,
utr3l = 25, log = F, colour = "black", plot_title = NULL)
Arguments
data

A list of data tables from psite_info.

annotation

A data table as generated by create_annotation.

sample

A list of character string vectors specifying the name of the samples (or of its
replicates) of interest. The elements of each vector are merge together using the
scale factors specified by scale_factors. The name of the elements of the list
are used for labelling the raws of the heatmap.

scale_factors

A numeric vector of scale factors for merging the replicates (if any). The vector
must contain at least one value for each replicates, named after the strings listed
in sample. No specific order is required. Default is NULL, meaning that all the
scale factors are set to 1.

length_range

Either "all", an integer or an integer vector. Default is "all", meaning that all
the read lengths are included in the analysis. Otherwise, only the read lengths
matching the specified value(s) are kept.

14

metaheatmap_psite
transcripts

A character string vector specifying the name of the transcripts to be included in
the analysis. By default this argument is NULL, meaning that all the transcripts
in data are used. Note that if either the 5’ UTR, the coding sequence or the 3’
UTR of a transcript is shorther than utr5l, 2∗cdsl and utr3l respectively, the
transcript is automatically discarded.

utr5l

A positive integer specifying the length (in nucleotides) of the 5’ UTR region
that in the plot flanks the start codon. The default value is 25.

cdsl

A positive integer specifying the length (in nucleotides) of the CDS region that
in the plot will flank both the start and the stop codon. The default value is 50.

utr3l

A positive integer specifying the length (in nucleotides) of the 3’ UTR region
that in the plot flanks the start codon. The default value is 25.

log

A logical value whether or not to use a logarithmic scale colour (strongly suggested in case of large variations of the signal). Default is FALSE.

colour

A character string specifying the colour of the plot. Default is "black".

plot_title

Any character string specifying the title of the plot. When "auto", the title of the
plot reports the number of the transcripts and the length(s) of the reads considered for generating the metaprofile. Default is NULL, meaning that no title will
be added to the plot.

Value
A list containing a ggplot2 object, a data table with the associated data and the transcripts employed
for generating the plot.
Examples
data(reads_psite_list)
## Generate the metaheatmap for all the read lengths
metaheat_whole <- metaheatmap_psite(reads_psite_list, mm81cdna, sample = list("Whole"=c("Samp1")))
## Generate the metaheatmap employing reads of 27, 28 and 29 nucleotides and
## a subset of transcripts (for example with at least one P-site mapping on the
## translation initiation site)
sample_name <- "Samp1"
sub_reads_psite_list <- subset(reads_psite_list[[sample_name]], psite_from_start == 0)
transcript_names <- as.character(sub_reads_psite_list$transcript)
metaheat_sub <- metaheatmap_psite(reads_psite_list, mm81cdna, sample = list("sub"=sample_name),
length_range = 27:29, transcripts = transcript_names, plot_title = "auto")
## Generate two metaheatmaps displaied in the same plot. In this exampe one
## data table includes all the read lengths while in the other one contains only
## reads of 28 nucleotides
sample_name <- "Samp1"
metaheat_df <- list()
metaheat_df[["subsample_28nt"]] <- subset(reads_psite_list[[sample_name]], length == 28)
metaheat_df[["whole_sample"]] <- reads_psite_list[[sample_name]]
names_list <- list("Only_28" = c("subsample_28nt"), "All" = c("whole_sample"))
metaheat_comparison <- metaheatmap_psite(metaheat_df, mm81cdna, sample = names_list)

metaprofile_psite

metaprofile_psite

15

Plot ribosome occupancy metaprofiles at single-nucleotide resolution.

Description
For a specified sample this function generates a metaprofile based on the P-site of the reads mapping
around the start and the stop codon of the annotated CDS (if any). It sums up the number of P-sites
(defined by their first nucleotide) per nucleotide computed for all the transcripts starting from one
ore more replicates.
Usage
metaprofile_psite(data, annotation, sample, scale_factors = NULL,
length_range = "all", transcripts = NULL, utr5l = 25, cdsl = 50,
utr3l = 25, plot_title = NULL)
Arguments
data

A list of data tables from psite_info.

annotation

A data table as generated by create_annotation.

sample

A character string vector specifying the name of the sample (or of its replicates)
of interest. Its elements are merge together using the scale factors specified by
scale_factors.

scale_factors

A numeric vector of scale factors for merging the replicates (if any). The vector
must contain at least one value for each replicates, named after the strings listed
in sample. No specific order is required. Default is NULL, meaning that all the
scale factors are set to 1.

length_range

Either "all", an integer or an integer vector. Default is "all", meaning that all
the read lengths are included in the analysis. Otherwise, only the read lengths
matching the specified value(s) are kept.

transcripts

A character string vector specifying the name of the transcripts to be included in
the analysis. By default this argument is NULL, meaning that all the transcripts
in data are used. Note that if either the 5’ UTR, the coding sequence or the 3’
UTR of a transcript is shorther than utr5l, 2∗cdsl and utr3l respectively, the
transcript is automatically discarded.

utr5l

A positive integer specifying the length (in nucleotides) of the 5’ UTR region
that in the plot flanks the start codon. The default value is 25.

cdsl

A positive integer specifying the length (in nucleotides) of the CDS region that
in the plot will flank both the start and the stop codon. The default value is 50.

utr3l

A positive integer specifying the length (in nucleotides) of the 3’ UTR region
that in the plot flanks the start codon. The default value is 25.

plot_title

Any character string specifying the title of the plot. When "auto", the title of the
plot reports the sample(s) specified by sample and the number of the transcripts
and the length(s) of the reads considered for generating the metaprofile. Default
is NULL, meaning that no title will be added to the plot.

16

mm81cdna

Value
A list containing a ggplot2 object, a data table with the associated data and the transcripts employed
for generating the plot.
Examples
data(reads_psite_list)
data(mm81cdna)
## Generate the metaprofile for all the read lengths
metaprof_whole <- metaprofile_psite(reads_psite_list, mm81cdna, sample = "Samp1")
metaprof_whole[["plot"]]
## Generate the metaprofile employing reads of 27, 28 and 29 nucleotides and
## a subset of transcripts (for example with at least one P-site mapping on
## the translation initiation site)
sample_name <- "Samp1"
sub_reads_psite_list <- subset(reads_psite_list[[sample_name]], psite_from_start == 0)
transcript_names <- as.character(sub_reads_psite_list$transcript)
metaprof_sub <- metaprofile_psite(reads_psite_list, mm81cdna, sample = sample_name,
length_range = 27:29, transcripts = transcript_names)

mm81cdna

Annotation

Description
A dataset containing basic information about 109,712 mouse mRNA (using the Ensembl v81 transcript annotation).
Usage
mm81cdna
Format
A data table with 109,712 rows and 5 variables (the lengths are expressed in nucleotides):
transcript Name of the transcript (ENST ID and version, dot separated)
l_tr Length of the transcript
l_utr5 Length of the annotated 5’ UTR (if any)
l_cds Length of the annotated CDS (if any)
l_utr3 Length of the annotated 3’ UTR (if any)

psite

psite

17

Identify the ribosome P-site position within the reads.

Description
This function identifies within each read the position of the ribosome P-site, determined by the
localisation of its first nucleotide. The function processes the samples separately starting from the
reads aligning on the reference codon (selected by the user between the start codon and the second
to last codon) of any annotated coding sequence. It then returns the position of the P-site specifically
inferred for all the read lengths. It also allows to plot a collection of read length-specific occupancy
metaprofiles showing the P-sites offsets computed throughout the two steps of the algorithm.
Usage
psite(data, flanking = 6, start = TRUE, extremity = "auto",
plot = FALSE, plotdir = NULL, plotformat = "png", cl = 99)
Arguments
data

A list of data tables from bamtolist, bedtolist or length_filter.

flanking

An integer that specifies how many nucleotides, at least, of the reads mapping on
the reference codon must flank the reference codon in both directions. Default
is 6.

start

A logical value whether ot not to compute the P-site offsets starting from the
reads aligning on the translation initiation site. FALSE implies that the reads
mapping on the last triplet before the stop codon are used instead. Default is
TRUE.

extremity

A character string specifing which extremity of the reads should be used in the
correction step of the algorithm. It can be either "5end" or "3end" for the 5’ and
the 3’ extremity, respectively. Default is "auto", meaning that the best extremity
is automatically selected.

plot

A logical value whether or not to plot the occupancy metaprofiles showing the
P-sites offsets computed throughout the two steps of the algorithm. Default is
FALSE.

plotdir

A character string specifying the (existing or not) location of the directory where
the occupancy metaprofiles shuold be stored. This parameter is considered only
if plot is TRUE. By default this argument is NULL, which implies it is set as a
subfolder of the working directory, called offset_plot.

plotformat

Either "png" (the default) or "pdf", this parameter specifies the file format of the
generated metaprofiles. It is considered only if plot is TRUE.

cl

An integer value in [1,100] specifying the confidence level for restricting the
generation of the occupancy metaprofiles to a sub-range of read lengths. By
default it is set to 99. This parameter is considered only if plot is TRUE.

18

psite_info

Value
A data table.
Examples
data(reads_list)
## Compute the P-site offset automatically selecting the otimal read
## extremity for the correction step and not plotting any metaprofile
psite(reads_list, flanking = 6, extremity="auto")
## Compute the P-site offset specifying the extremity used in the correction
## step and plotting the metaprofiles only for a sub-range of read lengths (the
## middle 95%). The plots will be placed in the current working directory.
psite_offset <- psite(reads_list, flanking = 6, extremity="3end", plot = TRUE, cl = 95)

psite_info

Update reads information according to the inferred P-sites.

Description
Starting ftom the P-site position identfied by psite, this function updates the data tables that contains information about the reads. It attaches to the data tables 4 columns reporting the P-site
position with respect to the 1st nucleotide of the transcript, the start and the stop codon of the annotated coding sequence (if any) and the region of the transcript (5’ UTR, CDS, 3’ UTR) that includes
the P-site. Please note: if a transcript is not associated to any annotated CDS then the positions
of the P-site from both the start and the stop codon is set to NA. One or more additional columns
reporting the three nucleotides covered by the P-site, the A-site or the E-site can be attached by
providing either a FASTA file or a BSgenome data package with the nucleotide sequences.
Usage
psite_info(data, offset, site = NULL, fastapath = NULL,
fasta_genome = TRUE, bsgenome = NULL, gtfpath = NULL, txdb = NULL,
dataSource = NA, organism = NA, granges = FALSE)
Arguments
data

A list of data tables from bamtolist, bedtolist or length_filter.

offset

A data table from psite.

site

Either NULL, "psite, "asite", "esite" or a vector with a combination of the three
character strings. When this parameter is not NULL (the default), it specifies which of the column(s) reporting the three nucleotides covered by the Psite ("psite"), A-site ("asite") or E-site ("esite") must be added. Note: either
fastapath or bsgenome is required to generate the additional column(s).

psite_info

19

fastapath

An optional character string specifying the path to the FASTA file used in the
alignment step, including its name and extension. This file can contain reference
nucleotide sequences either of a genome assembly or of all the transcripts (see
fasta_genome). Please make sure the sequences derive from the same release
of the annotation file used in the create_annotation function. Note: either
fastapath or bsgenome is required to generate the additional column(s) specified by site. Default is NULL.

fasta_genome

A logical value whether or not the FASTA file specified by fastapath contains
nucleotide sequences of a genome assembly. FALSE means that the nucleotide
sequences of all the transcripts are provided instead. When this parameter is
TRUE (the default), an annotation object is required (see gtfpath and txdb).

bsgenome

An optional character string specifying the name of the BSgenome data package containing the genome sequences to be loaded. If it is not already present
in your system, it will be installed through the biocLite.R script (check the list
of data packages available in the Bioconductor repositories for your version of
R/Bioconductor by the available.genomes function of the BSgenome package). This parameter also requires an annotation object (see gtfpath and txdb).
Please make sure the sequences included in the specified BSgenome data pakage are in agreement with the sequences used in the alignment step. Note: either
fastapath or bsgenome is required to generate the additional column(s) specified by site. Default is NULL.

gtfpath

A character string specifying the path to te GTF file, including its name and
extension. Please make sure the GTF derives from the same release of what is
specified by fastapath or by bsgenome. Note that either gtfpath or txdb must
be specified when the nucleotide sequences of a genome assembly are provided
(see fastapath or bsgenome). Default is NULL.

txdb

A character string specifying the name of the annotation package for TxDb object(s) to be loaded. If it is not already present in your system, it will be installed
through the biocLite.R script (check the list of TxDb annotation packages available in the Bioconductor repositories at http://bioconductor.org/packages/release/BiocViews.html#___TxD
)). Please make sure the annotation package derives from the same release of
what is specified by fastapath or by bsgenome. Note that either gtfpath or
txdb must be specified when the nucleotide sequences of a genome assembly
are provided (see fastapath or bsgenome). Default is NULL.

dataSource

An optional character string describing the origin of the GTF data file. For more
information about this parameter please refer to the description of dataSource
of the makeTxDbFromGFF function included in the GenomicFeatures package.

organism

A optional character string reporting the genus and species of the organism when
gtfpath is specified. For more information about this parameter please refer to
the description of dataSource of the makeTxDbFromGFF function included in the
GenomicFeatures package.

granges

A logical value whether or not to return a GRangesList object. Default is
FALSE, meaning that a list of data tables (the required input for the downstream
analyses and graphical outputs provided by riboWaltz) is returned instead.

20

psite_offset

Value
A list of data tables or a GRangesList object.
Examples
data(reads_list)
data(psite_offset)
data(mm81cdna)
reads_psite_list <- psite_info(reads_list, psite_offset)

psite_offset

P-site offsets

Description
This dataset contains information on the offset computed by psite starting from reads_list.
Usage
psite_offset
Format
A data table with 31 rows and 9 variables (the lengths and the distances are expressed in nucleotides):
length Length of the read
total_percentage Percentage of reads of the considered length in the whole dataset
start_percentage Percentage of reads of the considered length aligning on the start codon (if any)
around_start A logical value reporting whether at least one read of the specified length aligns on
the start codon (T = yes, F = no)
offset_from_5 Temporary P-site offset from the 5’ end of read (before the correction step)
offset_from_3 Temporary P-site offset from the 3’ end of read (before the correction step)
adj_offset_from_5 P-site offset from the 5’ end of read after the correction step
adj_offset_from_3 P-site offset from the 3’ end of read after the correction step
sample Name of the sample

psite_per_cds

21

Compute the number of in-frame P-sites per coding sequence.

psite_per_cds

Description
For each sample and each transcript this function computes the number of P-sites in frame 0 within
the coding sequence. It is possible to exclude from the analysis a specified number of nucleotides
at the beginiing and/or at the end of the CDS, restricting the analysis to a subsequence of the coding
region. Please note that only the transcripts associated to an annotated CDS are kept for the analysis.
The resulting data table reports the name of the transripts along with the length of the considered
region (in nucleotides) and the associated number of P-sites for all the samples.
Usage
psite_per_cds(data, annotation, start_nts = 0, stop_nts = 0)
Arguments
data

A list of data tables from psite_info.

annotation

A data table as generated by create_annotation.

start_nts

A positive integer specifying the number of nucleotides at the beginning of the
coding sequences to be exluded from the analisys Default is 0.

stop_nts

A positive integer specifying the number of nucleotides at the end of the coding
sequences to be exluded from the analisys. Default is 0.

Value
A data table.
Examples
data(reads_psite_list)
data(mm81cdna)
## Compute the number of P-sites in frame on the whole coding sequence.
psite_cds <- psite_per_cds(reads_psite_list, mm81cdna)
## Compute the number of P-sites in frame on the coding sequence exluding
## the first 15 nucleotides and the last 10 nucleotides.
psite_cds <- psite_per_cds(reads_psite_list, mm81cdna, start_nts = 15, stop_nts = 10)

22

reads_psite_list

reads_list

Reads information

Description
This dataset contains details on mapping reads from BAM or BED files.
Usage
reads_list
Format
A list of data tables with 1 object (named Samp1) of 393,338 rows and 6 variables (the lengths and
the distances are expressed in nucleotides):
transcript Name of the transcript (ENST ID and version, dot separated)
end5 Position of the 5’ end of the read with respect to the first nuclotide of the transcript
end3 Position of the 3’ end of the read with respect to the first nuclotide of the transcript
length Length of the read
start_pos Leftmost position of the annotated CDS with respect to the first nuclotide of the transcript
stop_pos Rightmost position of the annotated CDS with respect to the first nuclotide of the transcript

reads_psite_list

P-sites and reads information

Description
This dataset contains details on mapping reads after the identification of the P-site and the update
of reads_list.
Usage
reads_psite_list

region_psite

23

Format
A list of data tables with 1 object (named Samp1) of 393,338 rows and 10 variables (the lengths and
the distances are expressed in nucleotides):
transcript Name of the transcript (ENST ID and version, dot separated)
end5 Position of the 5’ end of the read with respect to the first nuclotide of the transcript
psite Position of the P-site with respect to the first nuclotide of the transcript
end3 Position of the 3’ end of the read with respect to the first nuclotide of the transcript
length Length of the read
start_pos Leftmost position of the CDS with respect to the first nuclotide of the transcript
stop_pos Rightmost position of the CDS with respect to the first nuclotide of the transcript
psite_from_start Position of the P-site with respect to the first nuclotide of the annotated CDS (if
any)
psite_from_stop Position of the P-site with respect to the last nuclotide of the annotated CDS (if
any)
psite_region Region of the transcript that includes the P-site (5utr, cds, 3utr)

region_psite

Plot the percentage of P-sites per transcript region.

Description
For one or several samples this function computes the percentage of P-sites falling in the three
annotated regions of the transcripts (5’ UTR, CDS and 3’UTR) and generates a barplot of the
resulting values. The function also calculates and plots the percentage of region length for the
selected transcripts (reported in column "RNAs").
Usage
region_psite(data, annotation, sample = NULL, transcripts = NULL,
label = NULL, colour = c("gray70", "gray40", "gray10"))
Arguments
data

A list of data tables from psite_info.

annotation

A data table as generated by create_annotation.

sample

A character string vector specifying the name of the sample(s) of interest. By default this argument is NULL, meaning that all the samples in data are included
in the analysis.

transcripts

A character string vector specifying the name of the transcripts to be included in
the analysis. By default this argument is NULL, meaning that all the transcripts
in data are used. Please note that the transcripts not associated to any annotated
5’ UTR, CDS and 3’UTR are automatically discarded.

24

rends_heat
label

A character string vector of the same length of sample specifying the name
of the samples to be displaied in the plot. By default this argument is NULL
meaning that the name of the samples are used.

colour

A character string vector of three elements specifying the colours of the bars
corresponding to the 5’ UTR, the CDS and the 3’UTR respectively. The default
is a grayscale.

Value
A list containing a ggplot2 object, and a data table with the associated data.
Examples
data(reads_psite_list)
data(mm81cdna)
reg_psite <- region_psite(reads_psite_list, mm81cdna, sample = "Samp1")
reg_psite[["plot"]]

rends_heat

Plot metaheatmaps based on the two extremities of the reads.

Description
For a specified sample this function plots four metaheatmaps showing the abundance of the 5’ and
the 3’ end of the reads mapping around the start and the stop codon of the annotated CDS (if any),
stratified by their length. It is possible to visualise the metaheatmaps for all the read lengths or to
restrict the graphical output to a sub-range of read lengths.
Usage
rends_heat(data, annotation, sample, transcripts = NULL, cl = 95,
utr5l = 50, cdsl = 50, utr3l = 50, log = F, colour = "black")
Arguments
data

A list of data tables from bamtolist, bedtolist or length_filter.

annotation

A data table as generated by create_annotation.

sample

A character string specifying the name of the sample of interest.

transcripts

A character string vector specifying the name of the transcripts to be included in
the analysis. By default this argument is NULL, meaning that all the transcripts
in data are used. Note that if either the 5’ UTR, the coding sequence or the 3’
UTR of a transcript is shorther than utr5l, 2∗cdsl and utr3l respectively, the
transcript is automatically discarded.

cl

An integer value in [1,100] specifying the confidence level for restricting the
plot to a sub-range of read lengths. Default is 95.

rlength_distr

25

utr5l

A positive integer specifying the length (in nucleotides) of the 5’ UTR region
that in the plot flanks the start codon. The default value is 50.

cdsl

A positive integer specifying the length (in nucleotides) of the CDS region that
in the plot will flank both the start and the stop codon. The default value is 50.

utr3l

A positive integer specifying the length (in nucleotides) of the 3’ UTR region
that in the plot flanks the start codon. The default value is 50.

log

A logical value whether or not to use a logarithmic scale colour (strongly suggested in case of large variations of the signal). Default is FALSE.

colour

A character string specifying the colour of the plot. Default is "black".

Value
A list containing a ggplot2 object, and a data table with the associated data.
Examples
data(reads_list)
data(mm81cdna)
## Visualise the metaheatmaps for all the read lengths
heatend_whole <- rends_heat(reads_list, mm81cdna, sample = "Samp1", cl = 100)
## Visualise the
## 95%) reducing
heatend_sub95 
Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 27
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.14
Create Date                     : 2018:09:04 11:25:45+02:00
Modify Date                     : 2018:09:04 11:25:45+02:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
EXIF Metadata provided by EXIF.tools

Navigation menu