Workflow_instructions.dvi Workflow Instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 4

Instructions for the proteomaps workﬂow - for internal use

1 Overview of the proteomaps workﬂow

The fastest and easiest way to generate proteomaps is to use the website

http://www.protecs.uni-greifswald.de/bionic-vis/index.php

Instructions (in particular, regarding the choice of protein identiﬁers in the input ﬁle) can be found on the web-

site. However, the website works only for standard cases (organisms for which proteomaps have been established

already), and does not provide all the output ﬁles that we need to include new maps into our main website

http://www.proteomaps.net/. In all other cases, a number of python and matlab scripts (“proteomaps work-

ﬂow”) need to be run. This text gives some basic information about how to prepare data ﬁles for the workﬂow.

Currently, generating proteomaps involves the following (partially manual) steps:

1. Only for new organisms: prepare organism ﬁles (Wolf, with input data provided by user - see section 3 and

2. Prepare processed ﬁles for paver (Wolf, with input data from user - see section 2)

3. Use the paver software to make proteomaps (Dan, using the processed ﬁles)

4. If desired, put proteomaps on website (Wolf, using the paver output ﬁles)

Files used in the workﬂow are stored in diﬀerent places:

1. General method description: http://www.proteomaps.net/methods.html

2. Selection of data ﬁles: http://www.proteomaps.net/download.html

3. Github project proteomaps-workﬂow with public data ﬁles:

https://github.com/liebermeister/proteomaps-workflow

2 How to prepare data ﬁles for the proteomaps workﬂow

In this section, we assume that proteomaps for an organism have already been established, and that we want to

generate a proteomap with a new data set for this organism. Proteomaps are generated by the paver software,

and input ﬁles for paver can be generated by a python script that preprocesses the original proteome data. This

script expects the original data to be given in a ﬁxed table format. The input ﬁle is a table in tab-separated

.csv format. It contains two columns, one containing the protein identiﬁers (see below), and one containing the

abundance data (numbers). For clarity, the ﬁle can also contain a header row with extra information, and a row

with the column headers. These two rows are marked, respectively, by “!!” and “!”. Here is an example (for

E. coli proteins):

!!SBtab TableType=’Proteomaps’ Organism=’eco’

!ProteinIdentifier !Abundance

b0878 7

b0948 12

b3933 12

.. ..

The ﬁle attributes in the ﬁrst row (such as “Organism” in the example) contain additional information. Rec-

ommended possible attributes are “Organism”, “Condition”, “Tissue”, “ValueType”, “Unit”, and “Title”. The

leading rows starting with “!” are optional and are currently not evaluated by the python script; thus, it is also

possible to simply write

b0878 7

b0948 12

b3933 12

.. ..

In any case, it is the order of columns that counts. The protein identiﬁers (in the ﬁrst column) must be chosen

according to the instructions given in “How to add a new organism to the proteomaps workﬂow”. The abundance

data (second column) must represent protein molecule count numbers. Do not use data that correspond to

mass-weighted count numbers (mass weighting will be done anyway in the workﬂow). Do not use logarithmic

values.

Several data sets can be provided together, but must be given as separate ﬁles. For each ﬁle, the following

information needs to be given:

1. The organism (three-letter) shortname

2. A short name of the experiment or paper corresponding to the data set, to be used within ﬁle names (e.g.,

Valgepea 2013 48)

3. A human-readable name of the experiment or paper corresponding to the data set (e.g. “Escherichia coli

(µ=0.48/h) - Valgepea et al. (2013)”)

3 How to add protein annotations

In this section, we assume that proteomaps for an organism have been established, but that some proteins

have wrong or missing annotations, If you notice that proteins in proteomaps are wrongly placed in the func-

tional hierarchy or not mapped at all, you can change this by adding new protein annotations. To do this, you

need to ﬁnd the proteins to be annotated (or reannotated, which does not make a diﬀerence here), their pro-

tein identiﬁers, protein names, and KO numbers, as well as the desired pathway annotations, and put all this

information into a table ﬁle. The ﬁle format for your additional annotations follows the format of the ﬁle ge-

nomic data/KO gene hierarchy/KO gene hierarchy changes.csv in the github project proteomaps-workﬂow. The

columns are as follows:

Organism shortname

Protein identiﬁer

Protein short name

KO number

Pathway name

Protein long name

Comment or provenance information

Example (human hemoglobin Hba2, assigned to the pathway “Hemoglobin”):

hsa P69905 Hba2 K13822 Hemoglobin Hemoglobin, alpha 2 http://ghr.nlm.nih.gov/gene/HBA2

The ﬁrst three entries are mandatory. The “KO number entry” can be left blank. Finding the right pathway

name is a bit involved. You may start by checking the pathway given on KEGG’s website for that gene.

Since our pathway deﬁnitions diﬀers from the original KEGG pathway deﬁnitions, please make sure you adhere

to our (not KEGG’s) naming scheme. To see our list of pathways, you can browse the hierarchy tree ﬁle (at

http://www.proteomaps.net/download.html, link “ Hierarchy tree (levels 1-3)”). In your annotations, you

can only refer to the lowest-level categories. Please make sure that you use the right spelling (one mistake, and

the entry will not be recognized by the program).

When your ﬁle is ready, please send it to Wolf. If you think that a pathway should be added to the hierarchy or

that the hierarchy should be restructured, please contact Wolf.

A practical way to proceed is to focus on not-mapped, highly expressed proteins. To do so, have a look at

the preprocessed version of your proteome data ﬁle (which contains, for every protein, the current pathway

annotation). If you sort the proteins by abundance, you will see which proteins deserve new annotations.

In some cases, you will notice that a protein has an entry in KEGG, but does not carry an annotation in the

proteomaps ﬁles. Usually, the reason is that in KEGG (i) no gene name is assigned or (ii) not pathway is assigned

to this protein (the entry in KEGG’s ﬁeld “Deﬁnition” does not serve as a gene name nor pathway).

4 How to add a new organism to the proteomaps workﬂow

In this section, we assume that for a certain organism, proteomaps have not been established yet. Currently the

workﬂow can only handle the organisms from the following list. For each of them, a speciﬁc type of protein

identiﬁers must be used (the examples shown refer to enolase proteins):

Organism org ID type Example Example URL Estimated protein count

Mycoplasma pneumoniae mpn Locus tag MPN606 www.ncbi.nlm.nih.gov/gene/?term=MPN606 50000

Mycobacterium tuberculosis mtu Locus tag Rv1023 www.genome.jp/dbget-bin/www bget?mtu: 1500000

Escherichia coli eco Locus tag b2779 www.genome.jp/dbget-bin/www bget?eco:b2779 3000000

Synechococcus elongatus sp. PCC7942 syf CyanoBase Synpcc7942 0639 genome.microbedb.jp/cyanobase/SYNPCC7942/genes/slr0752 3000000

Synechocystis sp. 6803 syn CyanoBase slr0752 genome.microbedb.jp/cyanobase/Synechocystis/genes/slr0752 3000000

Saccharomyces cerevisiae sce Locus tag YHR174W identifiers.org/sgd/YHR174W 100000000

Schizosaccharomyces pombe spo PomBase SPBC1815.01 www.pombase.org/spombe/result/SPBC1815.01 300000000

Arabidopsis thaliana ath Tair AT1G74030.1 www.arabidopsis.org/servlets/TairObject?type=locus&name=AT1G74030.1 10000000000

Drosophila melanogaster dme FlyBase CG17654 www.genome.jp/dbget-bin/www bget?dme:Dmel CG17654 10000000000

Mus musculus mmu NCBI Gene Id 13806 www.ncbi.nlm.nih.gov/gene/13806 10000000000

Pan troglodytes hsa UniProt P06733 identifiers.org/uniprot/P06733 10000000000

Homo sapiens hsa UniProt P06733 identifiers.org/uniprot/P06733 10000000000

The list is given as a ﬁle in github project proteomaps-workﬂow (ﬁle genomic data/KO gene hierarchy/organisms.csv)

and on the proteomaps website (http://www.proteomaps.net/download.html, link “Organism information”) To

add a new organism to the list, the necessary information need to be prepared, and a ﬁle with protein lengths

must be provided.

As an example, assume that you wanted to add E. coli as a new organism (which, in fact, is already included). For

the example, consider the KEGG page http://www.genome.jp/dbget-bin/www bget?eco:b3124 for enolase

in E. coli.

1. Find out KEGG organism shortname (three letters)). On the KEGG page, it appears in the ﬁeld Organism:

(in the example, eco for Escherichia coli K-12 MG1655). Make sure that not only the organism, but also

the strain matches the one for which you want to generate proteomaps. Choose the organism name (e.g.,

“Escherichia coli”) to be used in proteomaps (which can, but does not have to coincide with the name used

in KEGG).

2. Find out the type of protein identiﬁers used in KEGG for this organism. On the KEGG page, it appears ﬁrst

in the ﬁeld Entry: (in the example, b2779 for enolase. Make sure that your protein data carry the same

type of protein identiﬁers.

3. Find the protein identiﬁer for enolase (which we use as an example case)

4. Find a reference database that provides URLs for these protein identiﬁers (see table above). In the case of

E. coli, we simply use KEGG itself: www.genome.jp/dbget-bin/www bget?eco:; in other cases, we use

an organism-speciﬁc database, e.g. genome.microbedb.jp/cyanobase/ for cyanobacteriae).

5. Find out an (estimated) protein count number per cell; in case of doubt, contact Ron.

6. Prepare a table with protein lengths for these proteins. Note that length information refers to PROTEINS

and not to mRNA! Important: unlike other tables (which use protein identiﬁers as keys), this table must

use protein shortnames (e.g., “eno”) as keys (in a column called Protein:Name). The format is:

Protein:Name Protein:Size

eno 432

.. ..

Information about protein lengths can be obtained from uniprot, but due to gene name conversions, there

may be some loss of information.

7. Send all this information, as well as a test protein data set (see instruction “How to prepare input data”),

to Wolf; he prepares the input ﬁles for paver, and Dan makes the ﬁrst picture.

8. Be aware that heavy reannotating is usually necessary after the ﬁrst test pictures.

5 Contact

wolfram.liebermeister@gmail.com

Workflow_instructions.dvi Workflow Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation