Workflow_instructions.dvi Workflow Instructions
User Manual:
Open the PDF directly: View PDF
.
Page Count: 4
| Download | |
| Open PDF In Browser | View PDF |
Instructions for the proteomaps workflow - for internal use 1 Overview of the proteomaps workflow The fastest and easiest way to generate proteomaps is to use the website http://www.protecs.uni-greifswald.de/bionic-vis/index.php Instructions (in particular, regarding the choice of protein identifiers in the input file) can be found on the website. However, the website works only for standard cases (organisms for which proteomaps have been established already), and does not provide all the output files that we need to include new maps into our main website http://www.proteomaps.net/. In all other cases, a number of python and matlab scripts (“proteomaps workflow”) need to be run. This text gives some basic information about how to prepare data files for the workflow. Currently, generating proteomaps involves the following (partially manual) steps: 1. Only for new organisms: prepare organism files (Wolf, with input data provided by user - see section 3 and 4) 2. Prepare processed files for paver (Wolf, with input data from user - see section 2) 3. Use the paver software to make proteomaps (Dan, using the processed files) 4. If desired, put proteomaps on website (Wolf, using the paver output files) Files used in the workflow are stored in different places: 1. General method description: http://www.proteomaps.net/methods.html 2. Selection of data files: http://www.proteomaps.net/download.html 3. Github project proteomaps-workflow with public data files: https://github.com/liebermeister/proteomaps-workflow 2 How to prepare data files for the proteomaps workflow In this section, we assume that proteomaps for an organism have already been established, and that we want to generate a proteomap with a new data set for this organism. Proteomaps are generated by the paver software, and input files for paver can be generated by a python script that preprocesses the original proteome data. This script expects the original data to be given in a fixed table format. The input file is a table in tab-separated .csv format. It contains two columns, one containing the protein identifiers (see below), and one containing the abundance data (numbers). For clarity, the file can also contain a header row with extra information, and a row with the column headers. These two rows are marked, respectively, by “!!” and “!”. Here is an example (for E. coli proteins): 1 !!SBtab TableType=’Proteomaps’ !ProteinIdentifier b0878 b0948 b3933 .. Organism=’eco’ !Abundance 7 12 12 .. The file attributes in the first row (such as “Organism” in the example) contain additional information. Recommended possible attributes are “Organism”, “Condition”, “Tissue”, “ValueType”, “Unit”, and “Title”. The leading rows starting with “!” are optional and are currently not evaluated by the python script; thus, it is also possible to simply write b0878 b0948 b3933 .. 7 12 12 .. In any case, it is the order of columns that counts. The protein identifiers (in the first column) must be chosen according to the instructions given in “How to add a new organism to the proteomaps workflow”. The abundance data (second column) must represent protein molecule count numbers. Do not use data that correspond to mass-weighted count numbers (mass weighting will be done anyway in the workflow). Do not use logarithmic values. Several data sets can be provided together, but must be given as separate files. For each file, the following information needs to be given: 1. The organism (three-letter) shortname 2. A short name of the experiment or paper corresponding to the data set, to be used within file names (e.g., Valgepea 2013 48) 3. A human-readable name of the experiment or paper corresponding to the data set (e.g. “Escherichia coli (µ=0.48/h) - Valgepea et al. (2013)”) 3 How to add protein annotations In this section, we assume that proteomaps for an organism have been established, but that some proteins have wrong or missing annotations, If you notice that proteins in proteomaps are wrongly placed in the functional hierarchy or not mapped at all, you can change this by adding new protein annotations. To do this, you need to find the proteins to be annotated (or reannotated, which does not make a difference here), their protein identifiers, protein names, and KO numbers, as well as the desired pathway annotations, and put all this information into a table file. The file format for your additional annotations follows the format of the file genomic data/KO gene hierarchy/KO gene hierarchy changes.csv in the github project proteomaps-workflow. The columns are as follows: Organism shortname Protein identifier Protein short name KO number Pathway name Protein long name Comment or provenance information Example (human hemoglobin Hba2, assigned to the pathway “Hemoglobin”): 2 hsa P69905 Hba2 K13822 Hemoglobin Hemoglobin, alpha 2 http://ghr.nlm.nih.gov/gene/HBA2 The first three entries are mandatory. The “KO number entry” can be left blank. Finding the right pathway name is a bit involved. You may start by checking the pathway given on KEGG’s website for that gene. Since our pathway definitions differs from the original KEGG pathway definitions, please make sure you adhere to our (not KEGG’s) naming scheme. To see our list of pathways, you can browse the hierarchy tree file (at http://www.proteomaps.net/download.html, link “ Hierarchy tree (levels 1-3)”). In your annotations, you can only refer to the lowest-level categories. Please make sure that you use the right spelling (one mistake, and the entry will not be recognized by the program). When your file is ready, please send it to Wolf. If you think that a pathway should be added to the hierarchy or that the hierarchy should be restructured, please contact Wolf. A practical way to proceed is to focus on not-mapped, highly expressed proteins. To do so, have a look at the preprocessed version of your proteome data file (which contains, for every protein, the current pathway annotation). If you sort the proteins by abundance, you will see which proteins deserve new annotations. In some cases, you will notice that a protein has an entry in KEGG, but does not carry an annotation in the proteomaps files. Usually, the reason is that in KEGG (i) no gene name is assigned or (ii) not pathway is assigned to this protein (the entry in KEGG’s field “Definition” does not serve as a gene name nor pathway). 4 How to add a new organism to the proteomaps workflow In this section, we assume that for a certain organism, proteomaps have not been established yet. Currently the workflow can only handle the organisms from the following list. For each of them, a specific type of protein identifiers must be used (the examples shown refer to enolase proteins): Organism Mycoplasma pneumoniae Mycobacterium tuberculosis Escherichia coli Synechococcus elongatus sp. PCC7942 Synechocystis sp. 6803 Saccharomyces cerevisiae Schizosaccharomyces pombe Arabidopsis thaliana Drosophila melanogaster Mus musculus Pan troglodytes Homo sapiens org mpn mtu eco syf syn sce spo ath dme mmu hsa hsa ID type Locus tag Locus tag Locus tag CyanoBase CyanoBase Locus tag PomBase Tair FlyBase NCBI Gene Id UniProt UniProt Example MPN606 Rv1023 b2779 Synpcc7942 0639 slr0752 YHR174W SPBC1815.01 AT1G74030.1 CG17654 13806 P06733 P06733 Example URL www.ncbi.nlm.nih.gov/gene/?term=MPN606 www.genome.jp/dbget-bin/www bget?mtu: www.genome.jp/dbget-bin/www bget?eco:b2779 genome.microbedb.jp/cyanobase/SYNPCC7942/genes/slr0752 genome.microbedb.jp/cyanobase/Synechocystis/genes/slr0752 identifiers.org/sgd/YHR174W www.pombase.org/spombe/result/SPBC1815.01 www.arabidopsis.org/servlets/TairObject?type=locus&name=AT1G74030.1 www.genome.jp/dbget-bin/www bget?dme:Dmel CG17654 www.ncbi.nlm.nih.gov/gene/13806 identifiers.org/uniprot/P06733 identifiers.org/uniprot/P06733 Estimated protein count 50000 1500000 3000000 3000000 3000000 100000000 300000000 10000000000 10000000000 10000000000 10000000000 10000000000 The list is given as a file in github project proteomaps-workflow (file genomic data/KO gene hierarchy/organisms.csv) and on the proteomaps website (http://www.proteomaps.net/download.html, link “Organism information”) To add a new organism to the list, the necessary information need to be prepared, and a file with protein lengths must be provided. As an example, assume that you wanted to add E. coli as a new organism (which, in fact, is already included). For the example, consider the KEGG page http://www.genome.jp/dbget-bin/www bget?eco:b3124 for enolase in E. coli. 1. Find out KEGG organism shortname (three letters)). On the KEGG page, it appears in the field Organism: (in the example, eco for Escherichia coli K-12 MG1655). Make sure that not only the organism, but also the strain matches the one for which you want to generate proteomaps. Choose the organism name (e.g., “Escherichia coli”) to be used in proteomaps (which can, but does not have to coincide with the name used in KEGG). 2. Find out the type of protein identifiers used in KEGG for this organism. On the KEGG page, it appears first in the field Entry: (in the example, b2779 for enolase. Make sure that your protein data carry the same type of protein identifiers. 3 3. Find the protein identifier for enolase (which we use as an example case) 4. Find a reference database that provides URLs for these protein identifiers (see table above). In the case of E. coli, we simply use KEGG itself: www.genome.jp/dbget-bin/www bget?eco:; in other cases, we use an organism-specific database, e.g. genome.microbedb.jp/cyanobase/ for cyanobacteriae). 5. Find out an (estimated) protein count number per cell; in case of doubt, contact Ron. 6. Prepare a table with protein lengths for these proteins. Note that length information refers to PROTEINS and not to mRNA! Important: unlike other tables (which use protein identifiers as keys), this table must use protein shortnames (e.g., “eno”) as keys (in a column called Protein:Name). The format is: Protein:Name eno .. Protein:Size 432 .. Information about protein lengths can be obtained from uniprot, but due to gene name conversions, there may be some loss of information. 7. Send all this information, as well as a test protein data set (see instruction “How to prepare input data”), to Wolf; he prepares the input files for paver, and Dan makes the first picture. 8. Be aware that heavy reannotating is usually necessary after the first test pictures. 5 Contact wolfram.liebermeister@gmail.com 4
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 4 XMP Toolkit : XMP toolkit 2.9.1-13, framework 1.6 About : uuid:74bdae0e-e577-11ef-0000-5439e811efd6 Producer : GPL Ghostscript 9.05 Modify Date : 2015:02:05 18:17:48+01:00 Create Date : 2015:02:05 18:17:48+01:00 Creator Tool : dvips(k) 5.98 Copyright 2009 Radical Eye Software Document ID : uuid:74bdae0e-e577-11ef-0000-5439e811efd6 Format : application/pdf Title : workflow_instructions.dvi Creator : dvips(k) 5.98 Copyright 2009 Radical Eye SoftwareEXIF Metadata provided by EXIF.tools