Primary Transcript Annotation Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 18

Download
Open PDF In Browser	View PDF

Package ‘primaryTranscriptAnnotation’
June 3, 2019
Type Package
Title Data-driven gene coordinate inference and assignment of
annotations to transcriptional units identified de novo
Version 0.1.0
Author Warren Anderson, Mete Civelek, Michael Guertin
Maintainer Warren Anderson 
Description This package was developed for two purposes: (1) to generate data-driven transcript annotations based on nascent transcript sequencing data, and (2) to annotate de novo identified transcriptional units (e.g., genes and enhancers) based on existing annotations such as those from (1). The code for this package was employed in analyses underlying our manuscript in preparation: Transcriptional mechanisms and gene regulatory networks underlying early adipogenesis (this note will be updated upon publication).
License NA
Encoding UTF-8
LazyData true
RoxygenNote 6.1.1
Imports bigWig, pracma, dplyr

R topics documented:
adjacent.gene.tss .
adjacent.gene.tts . .
apply.TSS.coords .
gene.end.plot . . .
gene.overlaps . . .
get.end.intervals . .
get.gene.end . . . .
get.largest.interval .
get.TSS . . . . . .
get.TTS . . . . . .
multi.overlap.assign
multi.overlaps . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
1

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

. 2
. 3
. 4
. 5
. 6
. 6
. 7
. 9
. 9
. 10
. 11
. 12

2

adjacent.gene.tss
read.count.end . . . .
single.overlap.assign
single.overlaps . . .
TSS.count.dist . . . .
TTS.count.dist . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Index

adjacent.gene.tss

13
14
15
16
17
18

Function to identify transcription start sites (TSSs) for adjacent gene
pairs

Description
We have identified adjacent gene pairs with overlaps based on manual analyses. We empirically
define TSSs for these genes by binning the region spanned by both genes, fitting smooth spline
curves to the binned read counts, and identifying the two largest peaks separated by a specified
distance. We set a bin size and apply the constraint that the identified ’pause’ peaks must be a given
distance apart. The search region includes a specified distance upstream of the upstream-most gene.
We shift the identified peaks upstream. For the spline fits, we set the number of knots to the number
of bins divided by specified setting. See also get.TSS().
Usage
adjacent.gene.tss(fix.genes = NULL, bed.long = NULL, bw.plus = NULL,
bw.minus = NULL, bp.bin = NULL, shift.up = NULL,
delta.tss = NULL, knot.div = 4, knot.thresh = 5, diff.tss = 1000,
fname = "adjacentTSS.pdf")
Arguments
fix.genes

frame with upstream and downstream genes

bed.long

long gene annotations, see get.largest.interval()

bw.plus

plus strand bigWig data

bw.minus

minus strand bigWig data

bp.bin

the interval will be separated into adjacent bins of this size

shift.up

look upstream of the long gene start by this amount to search for a viable TSS

delta.tss

amount by which to shift the identied TSS upstream of the identified interval

knot.div

the number of knots for the spline fit is defined as the number of bins covering
the gene end divided by this number

knot.thresh

minimum number of knots

diff.tss

minimum distance between TSSs

fname

file name (.pdf) for output plots

adjacent.gene.tts

3

Value
A vector of TSSs, along with a plot. The titles for the plots indicate the upstream gene, the downstream gene, and the strand. For genes on the plus strand, the upstream gene is on the left. For the
minus strand, the upstream gene is on the right. The red line denotes the spline fit and the vertical
lines indicate the TSSs. The plot is intended for diagnostic purposes.
Examples
# get the start sites for adjacent gene pairs
bp.bin = 10
knot.div = 40
shift.up = 100
delta.tss = 50
diff.tss = 2000
TSS.adjacent = adjacent.gene.tss(fix.genes=fix.genes, bed.long=bed.long,
bw.plus=bw.plus, bw.minus=bw.minus,
knot.div=knot.div, bp.bin=bp.bin,
shift.up=shift.up, delta.tss=delta.tss,
diff.tss=diff.tss, fname="adjacentTSS.pdf")

adjacent.gene.tts

Function to identify gene ends (TTSs) for adjacent gene pairs

Description
We set the TTSs for the upstream genes to a given distance from the starts of the downstream genes
for the adjacent gene pairs. See also adjacent.gene.tss().
Usage
adjacent.gene.tts(fix.genes = NULL, bed.long = NULL,
TSS.adjacent = NULL, dist.from.start = NULL)
Arguments
fix.genes

frame with upstream and downstream genes for adjacent gene pairs

bed.long

comprehensive gene annotations with large intervals defined by get.largest.interval()

TSS.adjacent
TSSs for adjancent genes, identified from adjacent.gene.tss()
dist.from.start
distance between the TTS of the upstream gene and the TSS of the downstream
gene
Value
A bed6 frame with TTSs for upstream genes in adjacent gene pairs

4

apply.TSS.coords

Examples
# get the end sites for adjacent upstream genes, integrate with TSSs
dist.from.start = 50
adjacent.tts.up = adjacent.gene.tts(fix.genes=fix.genes, bed.long=bed.long,
TSS.adjacent=TSS.adjacent, dist.from.start=dist.from.start)
tss = TSS.adjacent$TSS
names(tss) = TSS.adjacent$gene
adjacent.tts.up = apply.TSS.coords(bed=adjacent.tts.up, tss=tss)

apply.TSS.coords

Apply previously identified TSSs to an existing gene annotation

Description
This function will apply transcription start site (TSS) coordinates to a bed6 frame to incorporate the
output from get.TSS() into an existing bed frame.

Usage
apply.TSS.coords(bed = NULL, tss = NULL)

Arguments
bed

bed6 file with comprehensive gene annotations

tss

TSS estimates, output from get.TSS(). Note that the TSS genes must be a subset
of the genes in the bed6 data.

Value
A bed6 file with estimated TSSs incorporated

Examples
# apply TSS coordinates to the expression-filtered long gene annotations
bed.long.filtered.tss = apply.TSS.coords(bed=bed.long.filtered, tss=TSS.gene)

gene.end.plot

5

gene.end.plot

Plot the results of transcription terminination site identification

Description
This function can be used to evaluate the performance of get.gene.end() and get.TTS(). We also
recommend visualizing the results of data-driven gene annotations using a genome browser.
Usage
gene.end.plot(bed = NULL, gene = NULL, bw.plus = NULL,
bw.minus = NULL, bp.bin = NULL, pk.thresh = 0.1, knot.div = 40,
knot.thresh = 5, cnt.thresh = 5, add.to.end = 50000,
tau.dist = 10000, frac.max = 1, frac.min = 0.2)
Arguments
bed
gene
bw.plus
bw.minus
bp.bin
pk.thresh
knot.div

knot.thresh
cnt.thresh
add.to.end
tau.dist
frac.max
frac.min

bed6 gene coordinates with gene end intervals for estimation of the TTS
a specific gene for plotting
plus strand bigWig data
minus strand bigWig data
the interval at the gene end will be separated into adjacent bins of this size
the TTS is defined as pk.thresh percent of max peak of the spline fit
the number of knots for the spline fit is defined as the number of bins covering
the gene end divided by this number; increasing this parameter results and a
smoother curve
minimum number of knots, if knot.div gives a lower number, use this
read count number below which the TTS is not evaluated
the maximal length of the search region (bp)
distance constant for the exponential defining the region for peak detection
maximal fraction of gene end region for peal detection
minimal fraction of gene end region for peal detection

Value
A single plot is generated
Examples
gene.end.plot(bed=bed.for.tts.eval, gene="Aamp",
bw.plus=bw.plus, bw.minus=bw.minus,
bp.bin=bp.bin, add.to.end=add.to.end, knot.div=knot.div,
pk.thresh=pk.thresh, knot.thresh=knot.thresh,
cnt.thresh=cnt.thresh, tau.dist=tau.dist,
frac.max=frac.max, frac.min=frac.min)

6

get.end.intervals

gene.overlaps

Documentation of gene overlaps

Description
This function identifies gene overlaps that do not involve identical coordinates.
Usage
gene.overlaps(bed = NULL)
Arguments
bed

bed6 file with processed gene annotations

Value
A list with has.start.inside and is.a.start.inside. has.start.inside = bed6 file that documents genes in
which there are starts of other genes. is.a.start.inside = bed6 file that documents genes that start
inside other genes.
Examples
# run overlap analysis
overlap.data = gene.overlaps( bed = bed.long.filtered2.tss )
has.start.inside = overlap.data$has.start.inside
is.a.start.inside = overlap.data$is.a.start.inside
dim(has.start.inside)
dim(is.a.start.inside)

get.end.intervals

Get intervals at the gene ends for transcription termination site (TTS)
estimation

Description
We define regions at the gene ends to examine read counts for TTS identification. Note that transcription frequently extends beyond the poly-A site of a gene. To capture the end of transcription,
it is critical to examine regions beyond annotated gene boundries. We evaluate evidence of transcriptional termination in regions extending from a 3’ subset of the gene to a selected number of
base pairs downstream of the most distal annotated gene end. We initiate the search region with the
end of the largest gene annotation (see get.largest.interval()). We extend the search region up to a
given distance past the annotated gene end. We also apply the constraint that a TTS cannot be identified closer than a specified distance to the previously identified TSS of a downstream gene. This
analysis also incorporates the constraint that a gene end region identified cannot cross the TSS of a
downstream gene, thereby preventing gene overlaps on a given strand. Thus, we clip the amount of
bases added on to the gene end as necessary to avoid overlaps.

get.gene.end

7

Usage
get.end.intervals(bed = NULL, add.to.end = NULL, fraction.end = NULL,
dist.from.start = NULL)
Arguments
bed

processed bed6 file with one entry per gene and identified TSSs

add.to.end

number of bases to add to the end of each gene to define the search region

fraction.end
fraction of the gene annotation to consider at the end of the gene
dist.from.start
the maximal allowable distance between the end of one gene and the start of the
next
Value
A bed6 for gene end evaluation
Examples
# get intervals for TTS evaluation
add.to.end = 100000
fraction.end=0.1
dist.from.start=50
bed.for.tts.eval = get.end.intervals(bed=bed.long.filtered4.tss,
add.to.end=add.to.end,
fraction.end=fraction.end,
dist.from.start=dist.from.start)

get.gene.end

Defining gene ends

Description
Given regions within which to search for TTSs, we opperationally define the TTSs by binning the
gene end regions, counting reads within the bins, fitting smooth spline curves to the bin counts,
and detecting points at which the curves decay towards zero. We applied the constraint that there
must be a specified number of bases in the gene end interval, otherwise the TTS analysis is not
applied. Similarly, if the number of knots identified is too low, then we set the number of knots
to a specified threshold. For this analysis, we set a sub-region at the beginning of the gene end
region and identify the maximal peak from the spline fit. Then we identify the point at which the
spline fit decays to a threshold level of of the peak level. We reasoned that the sub-region should
be largest for genes with the greatest numbers of clipped bases, because such cases occur when
the conventional gene ends are proximal to identified TSSs, and we should include these entire
regions for analysis of the TTS. Similarly, we reasoned that for genes with substantially less clipped
bases, and correspondingly larger gene end regions with greater potential for observing enhancers
or divergent transcripts, the sub-regions should be smaller sections of the upstream-most gene end
region. We use an exponential model to define the sub-regions. The user should use the output of

8

get.gene.end
get.end.intervals() as an input to this function. Note that gene ends will be set to zeros if there are
no counts or if the interval is too small. This function calls get.TTS().

Usage
get.gene.end(bed = NULL, bw.plus = NULL, bw.minus = NULL,
bp.bin = NULL, pk.thresh = 0.1, knot.div = 40, knot.thresh = 5,
cnt.thresh = 5, add.to.end = 50000, tau.dist = 10000,
frac.max = 1, frac.min = 0.2)
Arguments
bed

bed6 gene coordinates with gene end intervals for estimation of the TTS

bw.plus

plus strand bigWig data

bw.minus

minus strand bigWig data

bp.bin

the interval at the gene end will be separated into adjacent bins of this size

pk.thresh

the TTS is defined as pk.thresh percent of max peak of the spline fit

knot.div

the number of knots for the spline fit is defined as the number of bins covering
the gene end divided by this number; increasing this parameter results and a
smoother curve

knot.thresh

minimum number of knots, if knot.div gives a lower number, use this

cnt.thresh

read count number below which the TTS is not evaluated

add.to.end

the maximal length of the search region (bp)

tau.dist

distance constant for the exponential defining the region for peak detection

frac.max

maximal fraction of gene end region for peal detection

frac.min

minimal fraction of gene end region for peal detection

Value
A bed6 file with TTS estimates incorporated, zeros are present if the TTS could not be estimated.
The output is a list including the bed frame along with several metrics (see example below and
vignette).
Examples
# identify gene ends
add.to.end = max(bed.for.tts.eval$xy)
knot.div = 40
pk.thresh = 0.02
bp.bin = 50
knot.thresh = 5
cnt.thresh = 5
tau.dist = 10000
frac.max = 1
frac.min = 0.2
gene.ends = get.gene.end(bed=bed.for.tts.eval, bw.plus=bw.plus, bw.minus=bw.minus,
bp.bin=bp.bin, add.to.end=add.to.end, knot.div=knot.div,

get.largest.interval

9
pk.thresh=pk.thresh, knot.thresh=knot.thresh,
cnt.thresh=cnt.thresh, tau.dist=tau.dist,

# get metrics
minus.lowcount = length(gene.ends$minus.lowcount)
plus.lowcount = length(gene.ends$plus.lowcount)
minus.knotmod = length(gene.ends$minus.knotmod)
plus.knotmod = length(gene.ends$plus.knotmod)
ends = gene.ends$bed

get.largest.interval

Get the largest interval for each gene, given multiple TSS and TTS
annotations

Description
The input bed6 file can be derived from a gencode annotation file, as described in the vignette
Usage
get.largest.interval(bed = NULL)
Arguments
bed

bed6 frame with comprehensive gene annotations, defaults to NULL

Value
a bed6 frame with the largest annotation for each gene
Examples
# get intervals for furthest TSS and TTS +/- interval
bed.long = get.largest.interval(bed=dat0)

get.TSS

Select an optimal transcription start site (TSS) for each gene

Description
We set a range around each annotated gene start within which to search for a region of peak read
density. Such regions of peak read density occur at ’pause sites’ that are typically 20-80 bp from the
TSS. Within each region around an annotated gene start, we translate a sliding window throughout
to find the sub-region with maximal density. We determine the window with the maximal density
out of all regions considered and define the TSS as the upstream boundry of this window, minus
a shift of a specified amount. The output of this function should be processed using the function
apply.TSS.coords().

10

get.TTS

Usage
get.TSS(bed = NULL, bw.plus = NULL, bw.minus = NULL,
bp.range = NULL, bp.delta = NULL, bp.bin = NULL,
delta.tss = NULL, cnt.thresh = NULL)
Arguments
bed

bed6 file with comprehensive gene annotations

bw.plus

plus strand bigWig data

bw.minus

minus strand bigWig data

bp.range

range to be centered on each gene start for evaluating read counts. Note that the
value should be divisible by 2

bp.delta

number of reads to move incrementally - sliding window interval

bp.bin

bin size for evaluating read counts within a given region

delta.tss

amount by which to shift the identied TSS upstream of the identified interval

cnt.thresh

read count number below which the TTS is not evaluated

Value
A vector of indentified transcription start site coordinates. Note that strand and chromosome information are not provided here.
Examples
# select the TSS for each gene
bp.range = 1200
bp.delta = 10
bp.bin = 50
delta.tss = 50
cnt.thresh = 5
TSS.gene = get.TSS(bed=dat0, bw.plus=bw.plus, bw.minus=bw.minus,
bp.range=bp.range, bp.delta=bp.delta, bp.bin=bp.bin,
delta.tss=delta.tss, cnt.thresh=cnt.thresh)

get.TTS

Estimate the transcription termination sites (TTSs)

Description
This function estimates the TTSs, this function is called by get.gene.end() and is not designed to be
run on its own.

multi.overlap.assign

11

Usage
get.TTS(bed = NULL, bw = NULL, bp.bin = NULL, pk.thresh = NULL,
knot.div = NULL, cnt.thresh = NULL, knot.thresh = NULL,
add.to.end = NULL, tau.dist = NULL, frac.max = NULL,
frac.min = NULL)
Arguments
bed

bed6 gene coordinates with gene end intervals for estimation of the TTS

bw

plus or minus strand bigWig data

bp.bin

the interval at the gene end will be separated into adjacent bins of this size

pk.thresh

the TTS is defined as pk.thresh percent of max peak of the spline fit

knot.div

the number of knots for the spline fit is defined as the number of bins covering
the gene end divided by this number

cnt.thresh

read count number below which the TTS is not evaluated

knot.thresh

minimum number of knots, if knot.div gives a lower number, use this

add.to.end

the maximal length of the search region (bp)

tau.dist

distance constant for the exponential defining the region for peak detection

frac.max

maximal fraction of gene end region for peal detection

frac.min

minimal fraction of gene end region for peal detection

Value
A vector of TTS coordinates
Examples
See get.gene.end()

multi.overlap.assign

Function to assign identifiers for class5 overlap TUs

Description
This function is called inside of multiple.overlaps() and is not recommended to be used on its own.
Usage
multi.overlap.assign(fr = NULL, tss.thresh = NULL, delta.tss = NULL,
delta.tts = NULL, cl = NULL)

12

multi.overlaps

Arguments
fr

= frame with full bedtools overlap data for a class5 TU with multiple internal
annotations

tss.thresh

= number of bp a TU beginning can be off from an annotation in order to be
assigned that annotation

delta.tss

= max distance between an upstream gene end and downstream gene start cl =
multi-overlap class

delta.tts

max difference distance between and annotated gene end and the start of a downstream gene before an intermediate TU id is assigned

cl

multi-overlap class

Value
A bed data with chr, start, end, id, class, and strand. Note that id = 0 for regions of large TUs that
do no match input annotations.
Examples
See multiple.overlaps()

multi.overlaps

Assign identifiers to TUs with multiple gene overlaps

Description
This function will assign identifiers to TUs that overlapped with multiple genes. Trusted gene
annotations are considered in terms of their degree of overlap with TUs identified in an unbiased
manner. Marginal regions of TUs outside of gene overlaps are given generic identifiers.
Usage
multi.overlaps(overlaps = NULL, tss.thresh = NULL, delta.tss = NULL,
delta.tts = NULL)
Arguments
overlaps

frame with bed intersect of inferred TUs (col1-6) with annotated TUs (col7-12)
and overlaping bps (col14)

tss.thresh

number of bp a TU beginning can be off from an annotation in order to be
assigned that annotation

delta.tss

max distance between an upstream gene end and downstream gene start

delta.tts

max difference distance between and annotated gene end and the start of a downstream gene before an intermediate TU id is assigned

read.count.end

13

Value
A list with bed data (bed) and counts for each class (cnt5-8). bed = data with chr, start, end, id,
class, and strand. cnt5 = count of TU from class 5 overlaps
Examples
dup.hmm.rows = overlap.tu[duplicated(overlap.tu[,c(1:3,6)]) |
duplicated(overlap.tu[,c(1:3,6)], fromLast=TRUE),]
dup.full = dup.hmm.rows
names(dup.full)[1:6] = c("infr.chr","infr.start","infr.end",
"infr.gene","infr.xy","infr.strand")
nrow(dup.full[,1:3] %>% unique)
nrow(dup.full %>% select(ann.gene) %>% unique)
tss.thresh = 200
delta.tss = 50
delta.tts = 1000
class5678 = multi.overlaps(overlaps=dup.full, tss.thresh=tss.thresh,
delta.tss=delta.tss, delta.tts=delta.tts)
new.ann.mult = class5678$bed

read.count.end

Select a regions containing the gene ends

Description
Select a fraction of the annotated gene end and consider an additional number of base pairs beyond
the gene end within which to count reads
Usage
read.count.end(bed = NULL, bw.plus = NULL, bw.minus = NULL,
fraction.end = NULL, add.to.end = NULL)
Arguments
bed

a bed6 frame

bw.plus

plus strand bigWig data

bw.minus

minus strand bigWig data

fraction.end

fraction of the annotated gene end (0,1)

add.to.end

number of base pairs beyond the gene end within which to count reads

Value
a list with counts and desity. counts = vector with end of gene counts for each gene. density =
vector with end of gene densities for each gene.

14

single.overlap.assign

Examples
# get read counts and densities at the end of each annotated gene
fraction.end = 0.1
add.to.end = 0
end.reads = read.count.end(bed=bed.long, bw.plus=bw.plus, bw.minus=bw.minus,
fraction.end=fraction.end, add.to.end=add.to.end)
hist(log(end.reads$density))
hist(log(end.reads$counts))

single.overlap.assign Function to assign identifiers for class1 overlap TUs

Description
This function is called inside of single.overlaps() and is not recommended to be used on its own.
Usage
single.overlap.assign(fr = NULL, tss.thresh = NULL, delta.tss = NULL,
delta.tts = NULL, cl = NULL)
Arguments
fr

frame with full bedtools overlap data for a class1 TU with multiple internal
annotations

tss.thresh

number of bp a TU beginning can be off from an annotation in order to be
assigned that annotation

delta.tss

max distance between an upstream gene end and downstream gene start

cl

multi-overlap class

Value
A bed data frame with chr, start, end, id, class, and strand. Note that id = 0 for regions of large TUs
that do no match input annotations.
Examples
See single.overlaps()

single.overlaps

single.overlaps

15

Assign identifiers to TUs with single gene overlaps

Description
This function will assign identifiers to TUs that overlapped with single genes. Trusted gene annotations are considered in terms of their degree of overlap with TUs identified in an unbiased manner.
Marginal regions of TUs outside of gene overlaps are given generic identifiers.
Usage
single.overlaps(overlaps = NULL, tss.thresh = NULL, delta.tss = NULL,
delta.tts = NULL)
Arguments
overlaps

frame with bed intersect of inferred TUs (col1-6) with annotated TUs (col7-12)
and overlaping bps (col14)

tss.thresh

= number of bp a TU beginning can be off from an annotation in order to be
assigned that annotation

delta.tss

max distance between an upstream gene end and downstream gene start

delta.tts

max difference distance between and annotated gene end and the start of a downstream gene before an intermediate TU id is assigned

Value
A list with bed data (bed) and counts for each class (cnt1-4). bed = data with chr, start, end, id,
class, and strand. cnt1 = count of TU from class 1 overlaps
Examples
sing.hmm.rows = overlap.tu[!duplicated(overlap.tu[,c(1:3,6)]) &
!duplicated(overlap.tu[,c(1:3,6)], fromLast=TRUE),]
overlaps = sing.hmm.rows
names(overlaps)[1:6] = c("infr.chr","infr.start","infr.end",
"infr.gene","infr.xy","infr.strand")
tss.thresh = 200
delta.tss = 50
delta.tts = 1000
class1234 = single.overlaps(overlaps=overlaps, tss.thresh=tss.thresh,
delta.tss=delta.tss, delta.tts=delta.tts)
new.ann.sing = class1234$bed

16

TSS.count.dist

TSS.count.dist

Get distances from identified transcription start sites (TSSs) to the
nearest regions with peak reads

Description
This function was designed to evaluate the results of out TSS identification analysis. We specify a
region centered on the TSS. We obtain read counts in bins that span this window. We sort the genes
based on the bin with the maximal reads and we scale the data to the interval (0,1) for visualization.
We compute the distances between the TSS and the bin with the max reads within the specified
window.
Usage
TSS.count.dist(bed = NULL, bw.plus = NULL, bw.minus = NULL,
window = NULL, bp.bin = NULL)
Arguments
bed

= bed file for TSS evaluation

bw.plus

= plus strand bigWig data

bw.minus

= minus strand bigWig data

window

= region size, centered on the TSS for analysis

bp.bin

= the interval will be separated into adjacent bins of this size

Value
A list with three vector elements: raw, scaled, and dist. All outputs are organized based on TSS
position (left-upstream). raw = binned raw counts. scaled = binned scaled counts (0,1). dist = upper
bound distances ( min(|dists|) = bp.bin ). See examples and vignette for more details.
Examples
# look at read distribution around identified TSSs
bp.bin = 10
window = 1000
tss.dists = TSS.count.dist(bed=bed.tss.tts, bw.plus=bw.plus, bw.minus=bw.minus,
window=window, bp.bin=bp.bin)
# look at read distribution around 'long gene' annotation TSSs
tss.dists.lng = TSS.count.dist(bed=bed.long[bed.long$gene %in% names(tss.dists$dist),],
bw.plus=bw.plus, bw.minus=bw.minus,
window=window, bp.bin=bp.bin)

TTS.count.dist

TTS.count.dist

17

Get read counts around transcription termination sites (TTSs)

Description
We set a window around the TTS and segment the window into bins. To sort the read data, we
compute cumulative counts of reads, and we sort based on the bin at which a specified percentage
of the reads are found. We also take a ratio of gene counts downstream / upstream of the TTS with
regions within an interval.
Usage
TTS.count.dist(bed = NULL, bw.plus = NULL, bw.minus = NULL,
window = NULL, bp.bin = NULL, frac.max = NULL,
ratio.region = NULL)
Arguments
bed

bed frame for TSS evaluation

bw.plus

plus strand bigWig data

bw.minus

minus strand bigWig data

window

region size, centered on the TTS for analysis

bp.bin

the interval will be separated into adjacent bins of this size

frac.max

fraction of cumulative distribution for sorting entries are sorted by indices min(
cumsum(x)/sum(x) ) < frac.max

ratio.region

number of bp on either side of the TTS to evaluate the ratio entries

Value
A list with three elements: raw, scaled, and ratio. All outputs are organized based on TTS position
(left-upstream, see frac.max). raw = binned raw counts. scaled = binned scaled counts (0,1). ratio
= downstream(ratio.region) / upstream(ratio.region).
Examples
# look at read distribution around identified TTSs
window = 1000
bp.bin = 10
frac.max = 0.8
ratio.region = 300
tts.dists = TTS.count.dist(bed=bed.tss.tts, bw.plus=bw.plus, bw.minus=bw.minus,
window=window, bp.bin=bp.bin, frac.max=frac.max,
ratio.region=ratio.region)

Index
adjacent.gene.tss, 2
adjacent.gene.tts, 3
apply.TSS.coords, 4
gene.end.plot, 5
gene.overlaps, 6
get.end.intervals, 6
get.gene.end, 7
get.largest.interval, 9
get.TSS, 9
get.TTS, 10
multi.overlap.assign, 11
multi.overlaps, 12
read.count.end, 13
single.overlap.assign, 14
single.overlaps, 15
TSS.count.dist, 16
TTS.count.dist, 17

18

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 18
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.16
Create Date                     : 2019:06:03 18:57:59-04:00
Modify Date                     : 2019:06:03 18:57:59-04:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) kpathsea version 6.2.1

EXIF Metadata provided by EXIF.tools

Primary Transcript Annotation Manual

Navigation menu

Versions of this User Manual:

Views

Navigation