Workflow_instructions.dvi Workflow Instructions

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 4

DownloadWorkflow_instructions.dvi Workflow Instructions
Open PDF In BrowserView PDF
Instructions for the proteomaps workflow - for internal use

1

Overview of the proteomaps workflow

The fastest and easiest way to generate proteomaps is to use the website
http://www.protecs.uni-greifswald.de/bionic-vis/index.php
Instructions (in particular, regarding the choice of protein identifiers in the input file) can be found on the website. However, the website works only for standard cases (organisms for which proteomaps have been established
already), and does not provide all the output files that we need to include new maps into our main website
http://www.proteomaps.net/. In all other cases, a number of python and matlab scripts (“proteomaps workflow”) need to be run. This text gives some basic information about how to prepare data files for the workflow.
Currently, generating proteomaps involves the following (partially manual) steps:
1. Only for new organisms: prepare organism files (Wolf, with input data provided by user - see section 3 and
4)
2. Prepare processed files for paver (Wolf, with input data from user - see section 2)
3. Use the paver software to make proteomaps (Dan, using the processed files)
4. If desired, put proteomaps on website (Wolf, using the paver output files)
Files used in the workflow are stored in different places:
1. General method description: http://www.proteomaps.net/methods.html
2. Selection of data files: http://www.proteomaps.net/download.html
3. Github project proteomaps-workflow with public data files:
https://github.com/liebermeister/proteomaps-workflow

2

How to prepare data files for the proteomaps workflow

In this section, we assume that proteomaps for an organism have already been established, and that we want to
generate a proteomap with a new data set for this organism. Proteomaps are generated by the paver software,
and input files for paver can be generated by a python script that preprocesses the original proteome data. This
script expects the original data to be given in a fixed table format. The input file is a table in tab-separated
.csv format. It contains two columns, one containing the protein identifiers (see below), and one containing the
abundance data (numbers). For clarity, the file can also contain a header row with extra information, and a row
with the column headers. These two rows are marked, respectively, by “!!” and “!”. Here is an example (for
E. coli proteins):

1

!!SBtab TableType=’Proteomaps’
!ProteinIdentifier
b0878
b0948
b3933
..

Organism=’eco’
!Abundance
7
12
12
..

The file attributes in the first row (such as “Organism” in the example) contain additional information. Recommended possible attributes are “Organism”, “Condition”, “Tissue”, “ValueType”, “Unit”, and “Title”. The
leading rows starting with “!” are optional and are currently not evaluated by the python script; thus, it is also
possible to simply write
b0878
b0948
b3933
..

7
12
12
..

In any case, it is the order of columns that counts. The protein identifiers (in the first column) must be chosen
according to the instructions given in “How to add a new organism to the proteomaps workflow”. The abundance
data (second column) must represent protein molecule count numbers. Do not use data that correspond to
mass-weighted count numbers (mass weighting will be done anyway in the workflow). Do not use logarithmic
values.
Several data sets can be provided together, but must be given as separate files. For each file, the following
information needs to be given:
1. The organism (three-letter) shortname
2. A short name of the experiment or paper corresponding to the data set, to be used within file names (e.g.,
Valgepea 2013 48)
3. A human-readable name of the experiment or paper corresponding to the data set (e.g. “Escherichia coli
(µ=0.48/h) - Valgepea et al. (2013)”)

3

How to add protein annotations

In this section, we assume that proteomaps for an organism have been established, but that some proteins
have wrong or missing annotations, If you notice that proteins in proteomaps are wrongly placed in the functional hierarchy or not mapped at all, you can change this by adding new protein annotations. To do this, you
need to find the proteins to be annotated (or reannotated, which does not make a difference here), their protein identifiers, protein names, and KO numbers, as well as the desired pathway annotations, and put all this
information into a table file. The file format for your additional annotations follows the format of the file genomic data/KO gene hierarchy/KO gene hierarchy changes.csv in the github project proteomaps-workflow. The
columns are as follows:
Organism shortname
Protein identifier
Protein short name
KO number
Pathway name
Protein long name
Comment or provenance information
Example (human hemoglobin Hba2, assigned to the pathway “Hemoglobin”):

2

hsa

P69905

Hba2

K13822

Hemoglobin

Hemoglobin, alpha 2

http://ghr.nlm.nih.gov/gene/HBA2

The first three entries are mandatory. The “KO number entry” can be left blank. Finding the right pathway
name is a bit involved. You may start by checking the pathway given on KEGG’s website for that gene.
Since our pathway definitions differs from the original KEGG pathway definitions, please make sure you adhere
to our (not KEGG’s) naming scheme. To see our list of pathways, you can browse the hierarchy tree file (at
http://www.proteomaps.net/download.html, link “ Hierarchy tree (levels 1-3)”). In your annotations, you
can only refer to the lowest-level categories. Please make sure that you use the right spelling (one mistake, and
the entry will not be recognized by the program).
When your file is ready, please send it to Wolf. If you think that a pathway should be added to the hierarchy or
that the hierarchy should be restructured, please contact Wolf.
A practical way to proceed is to focus on not-mapped, highly expressed proteins. To do so, have a look at
the preprocessed version of your proteome data file (which contains, for every protein, the current pathway
annotation). If you sort the proteins by abundance, you will see which proteins deserve new annotations.
In some cases, you will notice that a protein has an entry in KEGG, but does not carry an annotation in the
proteomaps files. Usually, the reason is that in KEGG (i) no gene name is assigned or (ii) not pathway is assigned
to this protein (the entry in KEGG’s field “Definition” does not serve as a gene name nor pathway).

4

How to add a new organism to the proteomaps workflow

In this section, we assume that for a certain organism, proteomaps have not been established yet. Currently the
workflow can only handle the organisms from the following list. For each of them, a specific type of protein
identifiers must be used (the examples shown refer to enolase proteins):
Organism
Mycoplasma pneumoniae
Mycobacterium tuberculosis
Escherichia coli
Synechococcus elongatus sp. PCC7942
Synechocystis sp. 6803
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Arabidopsis thaliana
Drosophila melanogaster
Mus musculus
Pan troglodytes
Homo sapiens

org
mpn
mtu
eco
syf
syn
sce
spo
ath
dme
mmu
hsa
hsa

ID type
Locus tag
Locus tag
Locus tag
CyanoBase
CyanoBase
Locus tag
PomBase
Tair
FlyBase
NCBI Gene Id
UniProt
UniProt

Example
MPN606
Rv1023
b2779
Synpcc7942 0639
slr0752
YHR174W
SPBC1815.01
AT1G74030.1
CG17654
13806
P06733
P06733

Example URL
www.ncbi.nlm.nih.gov/gene/?term=MPN606
www.genome.jp/dbget-bin/www bget?mtu:
www.genome.jp/dbget-bin/www bget?eco:b2779
genome.microbedb.jp/cyanobase/SYNPCC7942/genes/slr0752
genome.microbedb.jp/cyanobase/Synechocystis/genes/slr0752
identifiers.org/sgd/YHR174W
www.pombase.org/spombe/result/SPBC1815.01
www.arabidopsis.org/servlets/TairObject?type=locus&name=AT1G74030.1
www.genome.jp/dbget-bin/www bget?dme:Dmel CG17654
www.ncbi.nlm.nih.gov/gene/13806
identifiers.org/uniprot/P06733
identifiers.org/uniprot/P06733

Estimated protein count
50000
1500000
3000000
3000000
3000000
100000000
300000000
10000000000
10000000000
10000000000
10000000000
10000000000

The list is given as a file in github project proteomaps-workflow (file genomic data/KO gene hierarchy/organisms.csv)
and on the proteomaps website (http://www.proteomaps.net/download.html, link “Organism information”) To
add a new organism to the list, the necessary information need to be prepared, and a file with protein lengths
must be provided.
As an example, assume that you wanted to add E. coli as a new organism (which, in fact, is already included). For
the example, consider the KEGG page http://www.genome.jp/dbget-bin/www bget?eco:b3124 for enolase
in E. coli.
1. Find out KEGG organism shortname (three letters)). On the KEGG page, it appears in the field Organism:
(in the example, eco for Escherichia coli K-12 MG1655). Make sure that not only the organism, but also
the strain matches the one for which you want to generate proteomaps. Choose the organism name (e.g.,
“Escherichia coli”) to be used in proteomaps (which can, but does not have to coincide with the name used
in KEGG).
2. Find out the type of protein identifiers used in KEGG for this organism. On the KEGG page, it appears first
in the field Entry: (in the example, b2779 for enolase. Make sure that your protein data carry the same
type of protein identifiers.
3

3. Find the protein identifier for enolase (which we use as an example case)
4. Find a reference database that provides URLs for these protein identifiers (see table above). In the case of
E. coli, we simply use KEGG itself: www.genome.jp/dbget-bin/www bget?eco:; in other cases, we use
an organism-specific database, e.g. genome.microbedb.jp/cyanobase/ for cyanobacteriae).
5. Find out an (estimated) protein count number per cell; in case of doubt, contact Ron.
6. Prepare a table with protein lengths for these proteins. Note that length information refers to PROTEINS
and not to mRNA! Important: unlike other tables (which use protein identifiers as keys), this table must
use protein shortnames (e.g., “eno”) as keys (in a column called Protein:Name). The format is:
Protein:Name
eno
..

Protein:Size
432
..

Information about protein lengths can be obtained from uniprot, but due to gene name conversions, there
may be some loss of information.
7. Send all this information, as well as a test protein data set (see instruction “How to prepare input data”),
to Wolf; he prepares the input files for paver, and Dan makes the first picture.
8. Be aware that heavy reannotating is usually necessary after the first test pictures.

5

Contact

wolfram.liebermeister@gmail.com

4



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 4
XMP Toolkit                     : XMP toolkit 2.9.1-13, framework 1.6
About                           : uuid:74bdae0e-e577-11ef-0000-5439e811efd6
Producer                        : GPL Ghostscript 9.05
Modify Date                     : 2015:02:05 18:17:48+01:00
Create Date                     : 2015:02:05 18:17:48+01:00
Creator Tool                    : dvips(k) 5.98 Copyright 2009 Radical Eye Software
Document ID                     : uuid:74bdae0e-e577-11ef-0000-5439e811efd6
Format                          : application/pdf
Title                           : workflow_instructions.dvi
Creator                         : dvips(k) 5.98 Copyright 2009 Radical Eye Software
EXIF Metadata provided by EXIF.tools

Navigation menu