SPOT Instructions V1

User Manual:

Open the PDF directly: View PDF .
Page Count: 21

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

1. Rationale

2. Installing SPOT

3. Running SPOT – Quick Start

4. Running SPOT

5. Data input formats

6. Data output formats

7. Setting up an AWS account

8. Setting up personal AWS interface on your laptop

9. Starting an AWS instance

10. Logging into you AWS instance

11. References

Cite: A.M. King, C.K. Vanderpool, and P.H. Degnan. sRNA-target Prediction Organizing

Tool (SPOT) integrates computational and experimental data to facilitate functional

characterization of bacterial small RNAs

1. Rationale

Computational approaches for sRNA target prediction have limitations but are relied

upon to generate testable hypotheses for sRNA function. Some algorithms are available

online or downloadable (e.g., TargetRNA2, IntaRNA), however these tools frequently

yield distinct results, have different data output formats and default search parameters.

Therefore, manually compiling results from these disparate tools and integrating the

predictions with existing experimental data is not trivial. We have generated an

innovative approach to streamline use of multiple existing sRNA target prediction

algorithms and integrate predictions with experimental data to generate a unified set of

target predictions. To this end, we have developed SPOT a flexible software pipeline

that searches for sRNA-mRNA binding sites in parallel using separate search tools,

collates the predictions, and integrates experimental data using customizable results

filters.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

Figure 1. Schematic of SPOT pipeline analysis (King et al.)

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

2. Installing SPOT

SPOT is a PERL program that runs TargetRNA2, IntaRNA, StarPicker and CopraRNA

in parallel, and collates the results to find consensus sRNA-mRNA targets (Figure 1).

Furthermore, additional data types can be utilized to filter the results including

expression differences, known binding sites, operon predictions and window size of

possible binding sites.

As written the program can run on any Unix/Linux based system, however it has a

number of dependencies. To facilitate its use we have set up an Amazon Web Service

(AWS) cloud Amazon Machine Image (AMI) with all of the required software installed.

Skip to sections 4-7 for setting up your own SPOT AMI. However, using the code

available here you can set up and run SPOT on a local server.

First, download and install the following software tools and all of their dependencies

according the authors’ instructions:

• TargetRNA v2

• StarPicker

• IntaRNA v1.0.4

• CopraRNA v1.2.9

Several modifications were made to the StarPicker and IntaRNA code to accommodate

demands of the pipeline.

Replace the following programs with those provided in the GitHub link. Modifications in

the code are marked with ## comments and/or initials (PHD). Descriptions of edits

made are listed briefly below.

StarPicker:

sTarPicker_global2.pl : changes made to input of command line arguments

IntaRNA v1.0.4:

add_GI_genename_annotation.pl : distinguish GeneIDs vs GI Nos

get_refseq_from_ftp.pl : Replacement code for get_refseq_from_ftp.sh

IntaRNA_wrapper.pl : Option added to use local GenBank files, use

get_refseq_from_ftp.pl

rerun_enrichment.pl : code snippet re-running enrichment analysis from

IntaRNA_Wrapper.pl

termClusterReport.pl : code modified to handle GeneIDs vs GI Nos

CopraRNA v1.2.9:

get_refseq_from_ftp.pl : Replacement code for get_refseq_from_ftp.sh

termClusterReport.pl : code modified to handle GeneIDs vs GI Nos

get_CDS_from_gbk.pl : code modified to skip and flag GenBank files not present

in kegg2refseqnew.csv list

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

Note: D3 Javascript libraries may or may not be accessible using existing framework

to generate functional enrichment heatmaps (http://d3js.org/d3.v3.min.js). If problems

are encountered, it is possible to edit the master html files in IntaRNA and CopraRNA

to use a local version of d3.v3.min.js .

Be sure all programs are added to the user path and all path references in

StarPicker, IntaRNA, and CopraRNA match your system installation.

The statistics program R is installed as a requirement for IntaRNA and CopraRNA. As

such add the following two packages:

• RColorBrewer

• gplots

$ sudo R

> install.packages(c('RColorBrewer', 'gplots'))

Some of the output from the SPOT program will be written in an xlsx format using the

Excel Writer PERL module:

• Excel-Writer-XLSX-0.98

$ sudo cpan Excel::Writer::XLSX

Most existing Unix/Linux installations should have sendmail installed. If not, install the

appropriate package

• sendmail

$ sudo apt install sendmail-bin

SPOT can work with local copies of genomes and annotations. However, to access

genomes from NCBI install the efetch program from the Entrez Direct (edirect)

toolkit.

• edirect

Retrieve and decompress the SPOT directory from GitHub containing core pipeline

script and its additional required support PERL scripts.

• SPOT

Make sure SPOT and all of the programs are in your user path. Modify the core pipeline

script with the absolute path locations for TargetRNA2, IntaRNA, StarPicker and

CopraRNA, and other support PERL scripts.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

3. Running SPOT - Quick Start

SPOT is a pipeline script that when run without arguments will print all of the possible

program options:

$ spot.pl

Usage ./spot.pl

Input parameters:

-r Fasta file of sRNA query

-a RefSeq Accession number (assumes any local files have RefSeq

number as their prefix)

...

The minimum data required for a SPOT search are:

1. A fasta file of the small RNA sequence

2. A RefSeq genome accession number

$ spot.pl -r sgrS.fasta -a NC_000913

This will initiate a job using the SgrS as the sRNA query and the E. coli str. K12

(NC_000913) as the reference genome. Progress of the search will be printed to the

screen. Run time will depend on the number of processors available as each search

tool is distributed to a separate sub process. By default CopraRNA is not run unless

specifically requested.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

4. Running SPOT

SPOT has an array of actions that control the input, algorithm parameters, and results

filtering.

$ spot.pl

Usage ./spot.pl

Input parameters:

-r Fasta file of sRNA query

-a RefSeq Accession number (assumes any local files have RefSeq

number as their prefix)

-o output file prefix (default = TEST)

-g Use local GenBank or PTT&FNA files for all Programs? (default = N

use latest from GenBank, CopraRNA cannot use local files)

-n Other genome RefSeq ids for CopraRNA listed in quotes '' ,

current max is 5 genomes (default ='')

-m Multisequence sRNA file for each genome in CopraRNA list

(default ='')

-x Email address for job completion notification (default ='')

Algorithm parameters:

-u Number of nt upstream of start site to search (default = 60)

-d Number of nt downstream of start site to search (default = 60)

-s seed sizes for I, T, S e.g., '6 7 6' (defaults TargetRNA = 7,

IntaRNA & Starpicker = 6)

-c P/Threshold value Cutoff for T, S, I e.g., '0.5 .001 un'

(defaults Target = 0.05, Starpicker = 0.5, IntaRNA = top)

Results Filters:

-b Number of nt upstream of start site to filter results

(default = -20)

-e Number of nt downstream of start site to filter results

(default = 20)

Note: -b and -e ignored if using a list (-l) or Rockhopper

results (-t)

Note: Set -b and -e to -u and -d to get all possible matches in

results

-l List of up and/or down regulated genes, include binding coord if

known e.g.,

b1101\tdown\n

b3826\tup\tsRNA_start\tsRNA_stop\tmRNA_start\tmRNA_stop\n

-t transcriptome expression file from Rockhopper *_transcripts.txt

-f Rockhopper fold change cutoff (default = 1.5)

-q Rockhopper q value cutoff (default = 0.01)

-k Rockhopper Expression cutoff value (default = 100)

-p Operon file from DOOR-2 (http://csbl.bmb.uga.edu/DOOR/index.php)

(optional)

-w Report all genes even if List or Rockhopper provided?

(default = No)

-y Exclude target predictions by only 1 method? (default = Yes)

Note: Does not apply to genes on List or significantly expressed

from Rockhopper

-z Skip sRNA-mRNA detection steps, and just re-analyze data [Yy]es

(default = No) (Run in the same directory & requires original

results files from each program)

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

Given the time SPOT runs can take it is recommended to use a queueing tool on large

distributed servers (qsub, slurm). Alternatively, on the AWS server, laptop or other

smaller computers it is recommended to use screen to ensure that jobs are not

prematurely aborted if the user account is logged out of.

$ screen -L spot.pl -r sgrS.fasta -a NC_000913

Four test datasets and precomputed output files are included in the folder

example_files. The following examples correspond to the four provided test

datasets.

test01 - Examine entire E. coli str. K12 genome for SgrS sRNA target mRNAs. This

folder only has the sRNA sequence in a fasta file, uses the individual program default

SEED size and significance settings and retrieves the genome sequence for E. coli from

GenBank. The final option is to have an email sent to the user after the job has

completed.

$ cd test01

$ ls

sgrS.fasta

$ spot.pl -r sgrS.fasta -o stringent -a NC_000913 -x username@email.edu

=========Prepping RefSeq Files====================

[Thu Aug 23 23:14:27 UTC 2018]

...

test02 - Examine E. coli str. K12 genome for SgrS sRNA target mRNA matches among

a set of defined differentially expressed genes (sgrS_diff.txt). In this case the user

has a fasta file and a traditional GenBank protein translation file (PTT). The user also

indicates a larger window size 150 nt upstream of the CDS start position and 100 nt

downstream to search for binding sites.

$ cd test02

$ ls

sgrS.fasta

sgrS_diff.txt

NC_000913.fna

NC_000913.ptt

$ spot.pl -r sgrS.fasta -l sgrS_diff.txt -u 150 -d 100 -c '0.5 0.001 un' -o

relaxed -a NC_000913 -g Y

Note: PTT files can be easily generated in Excel. Allowing for customization of gene

annotations and subsequent analyses. A script included with SPOT is fnaptt2gbk.pl

which can be used to generate GenBank files using the genome PTT and fasta files

as inputs. However, always make sure that MAC or DOS line breaks are converted into

UNIX line breaks.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

test03 - SPOT was designed to allow re-analysis of existing results. This example code

block is rung in a folder containing the results of test02’s search. In this case even

though the upstream/downstream region searched was 150nt and 100nt, the reanalysis

eliminates any binding sites found outside of 50nt upstream and 30nt downstream. This

search also does not use the list of differentially expressed genes.

$ cd test03

$ ls

sgrS.fasta

sgrS_diff.txt

NC_000913.fna

NC_000913.ptt

...

$ spot.pl -r sgrS.fasta -u 150 -d 100 -c '0.5 0.001 un' -o changed_50_30 -a

NC_000913 -g Y -b -50 -e 30 -z Y

test04 – SPOT can also be run using a *transcript.txt file generated by the

RNAseq analysis program Rockhopper directly (instead of list as in example test02). In

this example default expression cutoffs are used, however these can be specified by the

user. In addition, when provided a set of sRNA homologs and target genomes

CopraRNA can be run. In these instances only genomes in RefSeq can be used.

Custom genome annotations cannot be utilized.

$ cd test04

$ ls

NC_000913_SgrS_transcripts.txt

sgrS.fasta

sgrS_homologs.fasta

$ spot.pl -r sgrS.fasta -t NC_000913_SgrS_transcripts.txt -o express -a

NC_000913 -m sgrS_homologs.fasta -n 'NC_002695 NC_011740' -u 150 -d 100

When the jobs have completed compare your results to the files in the corresponding

_results folder.

5. Data input formats

sRNA fasta file – DNA sequence of sRNA in a standard fasta file. File extension does

not matter (.fasta, .fa, .fna, .frn, .ffn)

RefSeq ID – Standard RefSeq IDs can be used and GenBank files (.gbff) will be

retrieved using efetch. Program will retrieve additional replicons (e.g., plasmids) or

scaffolds associated with the provided RefSeq IDs, however, the search will only be

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

carried out on the file with a name corresponding to the input RefSeq ID. By default the

.gbff is renamed to a .gb file, and.fna and .ptt files are generated.

Local Files – Different combinations of local files can be used. They all must have the

same prefix and end in the following suffixes:

.fna

Genome fasta sequence

.ptt

Protein translation table – gene annotation

.gb or .gbk

Genbank file

Files without these suffixes will be ignored. All must have Unix linebreaks and the .ptt

file must be tab separated. Allowed input combinations include:

.fna

.ptt

.gb or

.gbk

Status

Action

okay

Start run

okay

Make .gb file, start run

okay

Make .ptt file, start run

okay

Make .fna file, start run

okay

Make .fna and .ptt file, start run

bad

Abort run

bad

Abort run

.ptt Files – This is a legacy GenBank annotation format. However, the StarPicker

algorithm used here requires this format. This format is very easy to generate in Excel

and can allow users of SPOT to customize their annotations. See example:

Escherichia coli str. K-12 substr. MG1655, complete genome - 1..4641652

4141 proteins

Location

Strand

Length

PID

Gene

Synonym

Code

COG

Product

190..255

thrL

b0001

thr operon leader peptide

337..2799

820

thrA

b0002

Bifunctional aspartokinase

2801..3733

310

thrB

b0003

homoserine kinase

3734..5020

428

thrC

b0004

L-threonine synthase

Note: As indicated above, customization of PTT files allows users to correct or change

annotations based on new data. Furthermore, by modifying PTT files RNAs can be

included in the annotation. First, this allows for sRNA – RNA interactions to be

identified. Second, this approach was used in the manuscript to perform a ‘reverse’

search. For a ‘reverse’ search the PTT file is edited to ONLY include the known sRNAs.

Then, the user supplies the UTR or putative sRNA binding region to SPOT as a fasta

file if it were the sRNA. ‘Reverse’ searches cannot use CopraRNA and as sRNAs do not

have GI numbers and may not have GeneIDs - no functional enrichment plots will be

produced. This may result in several warnings when the SPOT is run, however it should

not influence the final composite predictions.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

Differentially expressed genes – Lists of differential genes can be formatted as tab

separated files one of two ways. Do not include a header line

Simple:

Locus

Expression

b1101

down

b3826

With known binding sites:

Locus

Expression

sRNA_start

sRNA_stop

mRNA_start

mRNA_stop

b1101

down

168

187

-30

-9

b3826

168

187

-96

-76

Rockhopper *transcript.txt files – SPOT can read default output files of

Rockhopper from simple pairwise RNAseq experiments. Files generated with the

verbose output option in Rockhopper cannot be read. Files should have 12 columns

including the normalized expression values for the treatment and control, the q Values

and the estimated fold-change.

sRNA Multisequence Fasta Files – If running CopraRNA, sRNA files must conform to

expectations of the CopraRNA program:

1. RNA sequence must have Us instead of Ts

2. The sequence names must correspond to the individual genome RefSeq IDs

3. Must include the focal genome sRNA sequence as well

6. Data output formats

Data from each individual algorithm is preserved in the output folder for manual

investigation.

TargetRNA2_*txt = TargetRNA2 Primary report

*.output = Starpicker Primary report

intarna_websrv_table_truncated.csv = IntaRNA Primary report

*_ hIntaRNA.csv = CopraRNA Primary report

SPOT generates several output files for further analysis:

XLSX file – Primary file containing consensus table of sRNA-mRNA predictions from the

3 or 4 tools used in the run. File name prefix corresponds to run output prefix that was

assigned (-o , default= TEST).

• Sheet 1 (complete.txt) shows the aligned predictions, p values, and coordinates

for the predicted interaction for each gene.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

• Sheet 2 (summary.txt) has the counts predicted by each gene, and a summary

letter and ranking based location and on the number of algorithms that found the

same prediction.

A Prediction overlaps a known binding site

E Predictions that are not coincident with a known binding site when one was

provided for that gene. Shared letters overlap the same site.

I Predictions when no known binding site was provided. Shared letters overlap the

same site.

*_summary.pdf file – This file has a R generated plot that corresponds to Sheet 2

(summary.txt) which can be imported to Illustrator.

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

COLLATED_RESULTS folder – This folder contains plots generated based on IntaRNA

tools showing the localization of binding sites of the mRNA and sRNA as *pdf, *png and

*ps files. In addition, a functional enrichment heatmap is included as a *pdf file similar to

those individually provided by IntaRNA and CopraRNA - however it represents the

collated results.

Example result files for the sRNA SgrS and corresponding test datasets are available

with the SPOT software distribution.

b4184-yjfL

b1312-ycjP

b1785-yeaI

b0128-yadH

b4036-lamB

b0810-glnP

b2128-yehW

b2983-yghQ

b1101-ptsG

b2966-yqgA

123

1:2.34topologicaldomain:Periplasmic

2:2.08topologicaldomain:Cytoplasmic

3:1.7transmembraneregion

group1:1.35

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

7. Setup an AWS account

Navigate to the new account setup page:

https://portal.aws.amazon.com/billing/signup#/start

For now, set up your home region as “U.S. East (W. Virginia)” later you can switch this

as necessary.

Unfortunately, when setting up an account you will need a credit card number

Input Education credit - Depending on your application it may be possible to apply for

education credits to defray the cost of the AWS server time:

https://aws.amazon.com/education/awseducate/

8. Setup personal AWS interface on your laptop

People with MACs:

Terminal will already be installed /Applications/Utilities

Download & Install XQuartz if not already installed

http://xquartz.macosforge.org/landing/

Download & Install Cyberduck https://cyberduck.io/?l=en

People with PCs:

Download & Install PuTTY

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

Download & Install xMing

http://sourceforge.net/project/downloading.php?group_id=156984&filename=Xming-6-9-0-31-

setup.exe

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

How to setup xMing : http://www.geo.mtu.edu/geoschem/docs/putty_install.html

Download & Install WinSCP http://winscp.net/eng/download.php

Download & Install Cyberduck https://cyberduck.io/?l=en

9. Starting an AWS instance

For in-depth instructions regarding starting an AWS instance please see:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/launching-instance.html

1. After making and logging into your AWS account find your way to the EC2 (Elastic

Computing Cloud) page. You can find it under “Services” menu on the upper left-hand

corner of the page:

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Home:

2. Make sure your home region as “U.S. East (W. Virginia)”. Your region is indicated in

the upper right-hand corner of the page (circled above)

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

3. On the right-hand side bar under “IMAGES” select “AMIs”

4. In the search bar switch from “Owned by me” to “Public images” and search for

“SPOTv1”

5. Select the blue “Launch” button

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

6. Now you are on AWS “Step 2: Choose and Instance Type” – Select your computer:

t2.micro is the only free option, however it is maxed out at 1GiB of RAM, 1

processor and 30GiB of storage. Very slow

m5.2xlarge 8 virtual processors, 64 GiB of RAM

7. Select “Next: Configure Instance Details” button on bottom-right

8. On “Step 3: Configure Instance Details” page – leave defaults as-is

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

9. Select the “Next: Add Storage” button on bottom-right, to move to the next step

10. On the “Step 4: Add Storage” adjust local disk size to 30 GiB

11. Select the “Next: Add Tags” button on bottom-right, to move to the next step

12. On the “Step 5: Add Tags” optionally hit the “Add Tag” button OR skip to step 14

13. For example Add a key = “Name” and value = “my-SPOT” or “SPOT-server”

14. Select the “Next: Configure Security Group” button on bottom-right, to move to the

next step

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

15. On “Step 6: Configure Security Group” page – leave defaults as-is

**Note: For now, we will ignore the Warning. In the future consider making your

instances harder to access by non-users in your lab/group**

16. Select the “Review and Launch” button on bottom-right, to move to the next step

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

17. You can inspect the settings before hitting the “Launch” button. As before ignore

warnings.

18. Now it asks you to select or create a key pair.

19. You will need to download the key and save it to a private location on your computer

(e.g., the folder ~/.ssh/) .

20. From here you can navigate using the left-hand side bar to your “Instances”

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

21. Instance state will be “Initializing” until the computer has “booted” up.

22. Once the Instance state switches to “running” and you select the instance, details of

the instance will be shown below.

23. Find and copy the “IPv4 Public IP” address for your instance. You will use this to

IPv4 Public IP:

10. Logging into you AWS instance

To log into the server you will need your:

1. Private ssh key yourid_key.pem

2. username = first name and last initial as one word (e.g., Jane Doe = janed)

3. XX-XX-XX-XX = Your specific IPv4 Public IP from above

$ ssh -Y -i ~/.ssh/yourid_key.pem username@XX-XX-XX-XX

a. Open PuTTY

b. Under Category, click on SSH > Auth

c. Click browse

sRNA-target Prediction Organizing Tool v1 – Manual 10 Sept-18

d. Find your private key (yourid_key.pem) and select it

e. Under Category, click Session and input address of your EC2 instance (XX-XX-XX-

XX) in the "host name" box

f. Type "SPOT" in the box under saved sessions and click save.

g. Double-click on the "SPOT" that appears under saved sessions.

h. Log in with your username. Your key should be used automatically.

i. For future logins, just double-click the "SPOT" saved session.

Once entered you will find yourself on the command line interface:

11. References

Busch A, Richter AS, Backofen R. 2008. IntaRNA: efficient prediction of bacterial sRNA

targets incorporating target site accessibility and seed regions. Bioinformatics

24:2849-2856.

Kery MB, Feldman M, Livny J, Tjaden B. 2014. TargetRNA2: identifying targets of small

regulatory RNAs in Bacteria. Nucleic Acids Research 42:W124-129.

King AM, Vanderpool CK, and Degnan PH. sRNA-target Prediction Organizing Tool

(SPOT) integrates computational and experimental data to facilitate functional

characterization of bacterial small RNAs.

McClure R, Balasubramanian D, Sun Y, Bobrovskyy M, Sumby P, Genco CA,

Vanderpool CK, Tjaden B. 2013. Computational analysis of bacterial RNA-Seq

data. Nucleic Acids Research 41:e140.

Wright PR, Georg J, Martin M, Sorescu DA, Richter, AS, Lott S, Kleinkauf R, Hess WR,

Backofen R. 2014. CopraRNA and IntaRNA: predicting small RNA targets,

networks and interaction domains. Nucleic Acids Research 42:W119-W123.

Ying X, Cao Y, Wu J, Liu Q, Cha L, Li W. 2011. sTarPicker: a method for efficient

prediction of bacterial sRNA targets based on a two-step model for hybridization.

PloS One 6:e22705.

SPOT Instructions V1

Navigation menu

Versions of this User Manual:

Views

Navigation