The_SeqHound_Admin_Manual The Seq Hound Admin Manual

The SeqHound Manual

Part II: Sections 4-7

For Administrators and Developers

Release 3.3

(April 20th, 2005)

Authors

Ian Donaldson, Katerina Michalickova, Hao Lieu, Renan Cavero, Michel Dumontier,

Doron Betel, Ruth Isserlin, Marc Dumontier, Michael Matan, Rong Yao, Zhe Wang,

Victor Gu, Elizabeth Burgess, Kai Zheng, Rachel Farrall

Edited by

Rachel Farrall and Ian Donaldson

The SeqHound Manual 2 of 421 18/04/2005

Table of Contents

About this manual............................................................................................................ 7

Conventions .....................................................................................................................8

How to contact us. ........................................................................................................... 8

Who is SeqHound?........................................................................................................... 9

4. Setting up SeqHound locally. ....................................................................................... 10

4.1 Overview.................................................................................................................. 10

4.2 SeqHound system requirements............................................................................... 11

OS and hardware architecture .................................................................................... 11

Memory (RAM) ......................................................................................................... 11

Hard Disk ................................................................................................................... 12

Source code and executables .................................................................................. 12

Database.................................................................................................................. 12

Other Software ........................................................................................................... 12

Compiling SeqHound Code yourself. ........................................................................ 13

ODBC compliant database engines............................................................................ 13

Library dependencies ................................................................................................. 13

4.3 Obtaining precompiled SeqHound executables....................................................... 14

4.3.1 Obtaining SeqHound Source Code...................................................................... 16

4.4 Compiling SeqHound executables on Solaris.......................................................... 18

4.5 Building the SeqHound system on Solaris............................................................... 26

Catch up on SeqHound daily updates ........................................................................ 45

Setting up daily sequence updates.............................................................................. 47

Setting up SeqHound servers. Overview................................................................... 53

Trouble-shooting notes............................................................................................... 57

Error logs ................................................................................................................ 57

Recompiling SeqHound.......................................................................................... 57

Restarting the Apache server .................................................................................. 57

Other useful links.................................................................................................... 58

Parser schedule........................................................................................................ 58

MySQL errors ......................................................................................................... 58

5. Description of the SeqHound parsers and data tables by module................................. 59

What are modules? ........................................................................................................ 59

How to use this section. ................................................................................................. 59

Parser descriptions........................................................................................................ 59

Table descriptions.......................................................................................................... 60

An overview of the SeqHound data table structure ....................................................... 63

Parsers and resource files needed to build and update modules of SeqHound. ........... 64

seqhound@blueprint.org Version 3.3

The SeqHound Manual 3 of 421 18/04/2005

core module................................................................................................................ 66

mother parser .......................................................................................................... 66

update parser ........................................................................................................... 71

postcomgen parser .................................................................................................. 72

asndb table .............................................................................................................. 75

parti table ................................................................................................................ 78

nucprot table............................................................................................................ 80

accdb table .............................................................................................................. 82

histdb table.............................................................................................................. 88

pubseq table ............................................................................................................ 91

taxgi table................................................................................................................ 94

sengi table ............................................................................................................... 97

sendb table .............................................................................................................. 99

chrom table............................................................................................................ 101

gichromid table ..................................................................................................... 105

contigchromid table .............................................................................................. 107

gichromosome table.............................................................................................. 109

contigchromosome table ....................................................................................... 111

Redundant protein sequences (redundb) module ..................................................... 113

redund parser......................................................................................................... 113

redund table........................................................................................................... 115

Complete genomes tracking (gendb) module........................................................... 119

Taxonomy hierarchy (taxdb) module....................................................................... 120

importtaxdb parser ................................................................................................ 120

taxdb table............................................................................................................. 122

gcodedb table ........................................................................................................ 127

divdb table............................................................................................................. 132

del table................................................................................................................. 135

merge table............................................................................................................ 137

Structural databases (strucdb) module ..................................................................... 139

cbmmdb parser...................................................................................................... 139

vastblst parser........................................................................................................ 144

pdbrep parser......................................................................................................... 146

mmdb table............................................................................................................ 148

mmgi table ............................................................................................................ 154

domdb table........................................................................................................... 156

Protein sequence neighbours (neighdb) module ...................................................... 162

Installing nblast:.................................................................................................... 162

Configuration of nblast environment:................................................................... 163

Running NBLAST ................................................................................................ 164

NBLAST Update Procedure ................................................................................. 166

nbraccess program* .............................................................................................. 168

BLASTDB table................................................................................................... 169

NBLASTDB table................................................................................................. 172

Locus link functional annotations (lldb) module ..................................................... 177

llparser................................................................................................................... 177

seqhound@blueprint.org Version 3.3

The SeqHound Manual 4 of 421 18/04/2005

addgoid parser....................................................................................................... 179

ll_omim table ........................................................................................................ 181

ll_go table.............................................................................................................. 183

ll_llink table .......................................................................................................... 186

ll_cdd table............................................................................................................ 188

GENE module .......................................................................................................... 191

parse_gene_files.pl parser..................................................................................... 191

gene_dbxref table.................................................................................................. 193

gene_genomicgi table ........................................................................................... 195

gene_history table ................................................................................................. 198

gene_info table...................................................................................................... 201

gene_object table .................................................................................................. 204

gene_productgi table............................................................................................. 206

gene_pubmed table ............................................................................................... 208

gene_synonyms table............................................................................................ 210

Gene Ontology hierarchy (godb) module................................................................. 212

goparser................................................................................................................. 212

go_parent table...................................................................................................... 214

go_name table ....................................................................................................... 216

go_reference table................................................................................................. 219

go_synonym table ................................................................................................. 221

Gene Ontology Association (GOA) module ............................................................ 223

Table summarizing input files, parsers and command line parameters for GOA

module................................................................................................................... 225

Gene Ontology Module Diagram.......................................................................... 228

goa_seq_dbxref table ............................................................................................ 230

goa_association table ............................................................................................ 234

goa_reference table ............................................................................................... 237

goa_with table....................................................................................................... 239

goa_xdb table........................................................................................................ 242

goa_gigo table....................................................................................................... 245

dbxref module .......................................................................................................... 248

Who Cross-references who? ................................................................................. 249

Explanation of the data table structure: ................................................................ 249

How to update the DBXref and GO Annotation modules using a cluster. .............. 256

Understanding the dbxref.ini file ............................................................................. 257

Table summarizing input files, parsers and command line parameters for dbxref

module................................................................................................................... 262

dbxref table ........................................................................................................... 265

dbxrefsourcedb table............................................................................................. 268

Contents of dbxrefsourcedb table ......................................................................... 270

RPS-BLAST domains (rpsdb) module..................................................................... 272

domname parser.................................................................................................... 272

Rpsdb parser.......................................................................................................... 273

domname table...................................................................................................... 274

rpsdb table............................................................................................................. 278

seqhound@blueprint.org Version 3.3

The SeqHound Manual 5 of 421 18/04/2005

Molecular Interaction (MI) module.......................................................................... 285

MI-BIND parser.................................................................................................... 285

MI_source table .................................................................................................... 289

MI_ints table ......................................................................................................... 291

MI_objects table.................................................................................................... 292

MI_obj_dbases table ............................................................................................. 294

MI_mol_types table .............................................................................................. 295

MI_dbases table .................................................................................................... 296

MI_record_types table .......................................................................................... 297

MI_complexes table.............................................................................................. 298

MI_complex2ints table ......................................................................................... 299

MI_complex2subunits table.................................................................................. 300

MI_complex2subunits table.................................................................................. 301

MI_refs table......................................................................................................... 302

MI_refs_db table................................................................................................... 304

MI_exp_methods table.......................................................................................... 305

MI_obj_labels table .............................................................................................. 306

Text mining module ................................................................................................. 307

mother parser ........................................................................................................ 307

text searcher parser ............................................................................................... 308

yeastnameparser.pl parser..................................................................................... 312

text_bioentity table................................................................................................ 314

text_bioname table................................................................................................ 317

text_secondrefs table............................................................................................. 321

text_bioentitytype table......................................................................................... 324

text_fieldtype table................................................................................................ 325

text_nametype table .............................................................................................. 326

text_rules table...................................................................................................... 327

text_db table.......................................................................................................... 328

text_doc table........................................................................................................ 329

text_docscore table................................................................................................ 331

text_evidencescore table ....................................................................................... 336

text_method table.................................................................................................. 338

text_point table...................................................................................................... 341

text_pointscore table ............................................................................................. 342

text_result table..................................................................................................... 344

text_resultscore table ............................................................................................ 346

text_search table.................................................................................................... 348

text_searchscore table ........................................................................................... 351

text_rng table ........................................................................................................ 353

text_rngresult table................................................................................................ 355

text_doctax table ................................................................................................... 357

text_organism table............................................................................................... 359

text_englishdict table ............................................................................................ 361

text_bncorpus table ............................................................................................... 363

text_pattern table................................................................................................... 365

seqhound@blueprint.org Version 3.3

The SeqHound Manual 6 of 421 18/04/2005

text_stopword table............................................................................................... 367

6. Developing for SeqHound. ......................................................................................... 369

Open source development............................................................................................ 369

Code organization. ...................................................................................................... 370

Adding/Modifying a remote API function to SeqHound.............................................. 373

Overall architecture of the SeqHound system.......................................................... 374

Adding a new module to SeqHound............................................................................. 380

Database layer .......................................................................................................... 381

Parser layer............................................................................................................... 382

Local API layer (Query layer).................................................................................. 383

CGI layer.................................................................................................................. 383

Remote API layer..................................................................................................... 384

7. Appendices.................................................................................................................. 387

Example GenBank record ........................................................................................ 388

Example SwissProt record ....................................................................................... 393

Example EMBL record ............................................................................................ 400

Example PDB record................................................................................................ 406

Example Biostruc ..................................................................................................... 411

GO background material .......................................................................................... 421

* not available at time of writing

seqhound@blueprint.org Version 3.3

The SeqHound Manual 7 of 421 18/04/2005

About this manual.

This manual contains everything that has been documented about SeqHound. It is

distributed in two Parts (Part I: For Users and Part II: For Administrators and

Developers).

If you can’t find the answer here then please contact us. This manual was written and

reviewed by the persons listed under “Who is SeqHound”. Any errors should be reported

to seqhound@blueprint.org.

You can find out more about the general architecture of SeqHound by reading the

SeqHound paper that is freely available from BioMed Central. This paper is included in

the supplementary material distributed with this manual. See:

Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R, Hogue CW.

SeqHound: biological sequence and structure database as a platform for bioinformatics

research. BMC Bioinformatics. 2002 Oct 25;3(1):32.

PMID: 12401134

The SeqHound Manual (Part I: Sections 1-3) For Users.

Section1 and Section 2 is a one page description that tells you what to read first to get

started depending on what kind of user you are.

Section 3 is of interest to programmers who want to use the remote API to access

information in the SeqHound database maintained by the Blueprint Initiative.

The SeqHound Manual (Part II: Sections 4-7) For Administrators and Developers

Section 4 is of interest to programmers and system administrators who want to set up

SeqHound themselves so they can use the local API.

Section 5 is an in-depth description of everything that’s in the SeqHound database and

how it gets there (table by table). This section will be of interest to all users.

Section 6 describes how programmers can add to SeqHound. This section also describes

our internal development process at Blueprint.

Section 7 includes Appendices of background and reference material.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 8 of 421 18/04/2005

Conventions

The following section describes the conventions used in this manual.

Italic

is used for filenames, file extensions, URLs, and email addresses.

Constant Width

is used for code examples, function names and system output.

Constant Bold

is used in examples for user input.

Constant Italic

is used in examples to show variables for which a context-specific substitution should be

made.

How to contact us.

General enquiries or comments can be posted to the SeqHound usergroup mailing list

seqhound.usergroup@blueprint.org. You may also subscribe to this list to receive

regular updates about SeqHound developments by going to

http://lists.blueprint.org/mailman/listinfo/seqhound.usergroup .

Private enquiries, bug reports from external users, questions about SeqHound or errors

found in this manual may be sent to seqhound@blueprint.org.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 9 of 421 18/04/2005

Who is SeqHound?

Chronologically ordered according to when the person first started work on SeqHound.

Chris Hogue

Katerina Michalickova

Gary Bader

Ian Donaldson

Ruth Isserlin

Michel Dumontier

Hao Lieu

Marc Dumontier

Doron Betel

Renan Cavero

Ivy Lu

Rong Yao

Volodya Grytsan

Zhe Wang

Victor Gu

Rachel Farrall

Michael Matan

Elizabeth Burgess

Kai Zheng

seqhound@blueprint.org Version 3.3

The SeqHound Manual 10 of 421 18/04/2005

4. Setting up SeqHound locally.

4.1 Overview.

This section describes how one can set up the SeqHound system on your own hardware

using freely available SeqHound executables. These executables will allow you to build

and update the SeqHound database as well as run a web-interface and a remote API

server.

Section 4.2 should be reviewed first for system requirements before attempting to install

the SeqHound system.

Section 4.3 tells you how to download executables from the SeqHound ftp site for your

platform and operating system. SeqHound code may also be downloaded from this site.

Section 4.4 describes how SeqHound code may be compiled on your own hardware using

the freely available code available on the SeqHound ftp site. This step is only required if

SeqHound executables are not available for your platform or if you want to make use of

the local programming API. If you obtain SeqHound executables from the ftp site and

want to build your local SeqHound database, you still need to go through Steps 8, 9, 10,

11 and 13 in this section which describe how to install the MySQL server and ODBC

driver.

Section 4.5 contains detailed instructions for using the executables to build the SeqHound

data tables and for setting up the SeqHound web-interface and remote API server.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 11 of 421 18/04/2005

4.2 SeqHound system requirements.

Before attempting to set up SeqHound yourself, you should review the system

requirements listed below. The SeqHound system is able to run on a number of operating

systems (we recommend and can best support a UNIX operating system like Sun Solaris

or Red Hat Linux). Setting up SeqHound will require approximately 700 GB of disk

space (see below).

Questions about system requirements, compilation, setup and maintenance can be

addressed to seqhound@blueprint.org. We will do our best to address all inquiries but

resources may not allow us to solve all problems arising on all possible set ups.

OS and hardware architecture

SeqHound code is compiled on the following platforms based on release version code.

Blueprint production SeqHound is compiled and run on Sun-Fire-880 - Sun Solaris

(version 9). We have also compiled and tested SeqHound on the Fedora Core 2.0 and

the MacOS X operating systems.

Release versions of SeqHound executables are available for.

x86 architecture (Fedora Core 2.0)

Sun-Fire-880 Sun Solaris (version 9)

PowerPC architecture MacOS X

We have also successfully built executables on the following platforms.

x86 architecture FreeBSD

x86 architecture QNX

x86 architecture Windows NT

PowerPC architecture PPC Linux

SGI Irix 6

Alpha architecture Compaq Alpha OS

HPPA 2.0 architecture HPUX 11.0

HPPA 1.1 architecture PA-RISC Linux

Memory (RAM)

We recommend a minimum of 1 GB of RAM to run the SeqHound executables.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 12 of 421 18/04/2005

Hard Disk

Source code and executables

Component Image Size

SeqHound Source and compiled 220.0 MB

NCBI Toolkit 560.0 MB

NCBI C++ Toolkit 12GB

bzip2 Library 4.5 MB

slri lib 7.3MB

slri lib_cxx 9.4 MB

Source code and executables (total) 13GB approx

Database

Component Image Size

data tables 300 GB

data tables backup 300 GB

Database (total) 700 GB*

*700GB includes 300 GB for a single copy of the SeqHound data tables. The SeqHound

system includes a second copy of the data tables used for back up and updating. We

suggest a minimum of 700 GB for SeqHound installation. This allows for yearly growth

of the data tables as well as for a RAID5 disk configuration.

We are using the MySQL database storage engine InnoDB, which provides transaction

support and automatic recovery in the event of database server outage. There is no need

to keep a separate instance of the database when the InnoDB storage engine is used. To

prevent deadlock during data insertion and update, you should not run SeqHound parsers

in parallel against the InnoDB database server. As a result, it takes up to three extra days

for the initial build of SeqHound database using the InnoDB storage engine. If you wish

to use the MyISM storage engine, you can run parallel parsers to speed up the initial

build of SeqHound. However, you will need to keep a separate database instance for

database update and backup as the storage engine MyISM does not support transaction

and automatic recovery.

Other Software

Apache

Webserver(version 1.3)

See http://www.apache.org/ for software installation for you

platform.

Apache Jakarta Tocat

JSP/Servlet Container

(version 4.1)

See http://jakarta.apache.org/tomcat/ for software installation for

you platform.

Perl (version 5.8.3) See http://www.cpan.org/ for installation for your platform.

Requiredmodules include Net/FTP.pm, sun4-solaris-64/DBI.pm

seqhound@blueprint.org Version 3.3

The SeqHound Manual 13 of 421 18/04/2005

Compiling SeqHound Code yourself.

It is not necessary to compile SeqHound executables yourself; the system may be set up

using the executables provided on the ftp site for selected Operating Systems. However,

if you wish to make use of the local API then you must compile SeqHound yourself.

ODBC compliant database engines

Blueprint uses the ODBC compliant MySQL database engine. We are using version

4.1.10 in production; this version supports nested SQL queries and internationalization.

We have not tested SeqHound on other ODBC compliant RDBMS such as Oracle, DB2

and PostgreSQL.

Library dependencies

Library Source

NCBI Toolkit from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/

NCBI C++ Toolkit (optional*) from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/

bzip2 Library from http://sourceforge.net/projects/slritools/

slri lib from http://sourceforge.net/projects/slritools/

slri lib_cxx (optional*) from http://sourceforge.net/projects/slritools/

* This library is only required if you plan to use the SeqHound remote C++ API.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 14 of 421 18/04/2005

4.3 Obtaining precompiled SeqHound executables.

It is not necessary to compile SeqHound executables yourself; the system may be set up

using the precompiled executables provided on the ftp site for selected Operating

Systems. If you choose to compile the executables yourself, skip to step 4.3.1.

You will require about 220 MB of disk space to store the SeqHound compiled

executables. These instructions assume you are logged in as user “seqhound” on a UNIX

system running the bash shell and you have perl installed on your system.

1. Decide the location to install the SeqHound binary executables. For example, if you

want to install in the directory /home/seqhound/execs, do the following:

mkdir execs

cd execs

2. Download the SeqHound installation utility script installseqhound.pl from the FTP

site: ftp.blueprint.org

ftp ftp.blueprint.org

When prompted for a name enter

anonymous

When prompted for a password type your email address:

myemail@home.com

cd pub/SeqHound/script

get installseqhound.pl

Close the ftp session by typing:

bye

3. Run the perl script to download and install SeqHound executables. The perl script

will download SeqHound binary executables based on the specified platform (linux or

solaris), unpack the tar ball, modify the configurations files .odbc.ini and .intrezrc (for

ODBC database access) and deploy the configuration files. It requires two

commandline arguments: platform (linux or solaris) and installation path (e.g.

/home/seqhound/execs). Enter the path to the ODBC driver (e.g.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 15 of 421 18/04/2005

/usr/lib/libmyodbc3.so, please refer to step 10 in section 4.4 for ODBC driver path),

database server name, port number, user id, password and database instance name

when prompted by the perl script.

./installseqhound.pl [linux OR solaris] [/home/seqhound/execs]

Upon successful execution of the perl script, you should see the following directories

in the directory execs:

build

config

example

include

lib

sql

test

updates

www

The configuration file .odbc.ini can be found in the home

directory (e.g. /home/seqhound).

seqhound@blueprint.org Version 3.3

The SeqHound Manual 16 of 421 18/04/2005

4.3.1 Obtaining SeqHound Source Code.

Follow the instructions below to download SeqHound source code . If you downloaded

and unpacked the executables, you can skip section 4.3.1 and 4.4 and continue with

section 4.5.

1. In your home directory, make a new directory where you will store the new

SeqHound code.

mkdir compile

Move into this directory and set an environment variable called COMPILE to point to

this directory.

cd compile

export COMPILE=`pwd`

(where (`) is a single back-quote)

2. Download the perl utility seqhoundsrcdownload from the SeqHound ftp site

Note: We no longer support SeqHound download from the Sourceforge

FTP site. Please download SeqHound from

ftp://ftp.blueprint.org/pub/SeqHound/

From the compile directory, type:

ftp ftp.blueprint.org

When prompted for a name enter

anonymous

When prompted for a password type your email address:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 17 of 421 18/04/2005

myemail@home.com

cd pub/SeqHound/script

get seqhoundsrcdownload.pl

Close the ftp session by typing:

bye

3. Download SeqHound source code by running the perl script seqhoundsrcdownload.pl.

The script will download the source code tar file and unpack the tar file into two

directories slri and bzip2. You will also see a release note file Release_notes_x.x.txt

in the same directory compile.

./seqhoundsrcdownload.pl

4. Set the SLRI environment variable

Move to the slri directory and set the environment variable “SLRI” to point to this

directory.

cd $COMPILE/slri

export SLRI=`pwd`

seqhound@blueprint.org Version 3.3

The SeqHound Manual 18 of 421 18/04/2005

4.4 Compiling SeqHound executables on Solaris

These instructions describe how to compile SeqHound running on the Solaris platform.

They may be used as a guide for compiling SeqHound code on other platforms.

Instructions are similar for Linux and differences are noted.

Using these instructions

These instructions assume that:

You have downloaded the SeqHound code from the ftp server and you have set

environment variables called COMPILE and SLRI. See section 4.3.1

You are using the bash shell.

Note: On Linux platforms, to compile SeqHound libs with ODBC support you also need

unixODBC-devel package which contains the sql.h + other libs/headers required to

compile SeqHound libs with ODBC support. This is not needed to run SeqHound, just to

compile it.

These instructions were tested on a Sun-Fire-880 architecture running a Sun Solaris OS

(version 9). The system information for the test-box (results of a “uname –a” call)

were:

SunOS machine_name 5.9 Generic_117171-15 sun4u sparc

SUNW,Sun-Fire-880

1. Download the NCBI toolkit

SeqHound is dependent on code in the NCBI toolkit

Move to the compile directory and ftp to the NCBI ftp site:

cd $COMPILE

ftp ftp.ncbi.nlm.nih.gov

When prompted for a name enter anonymous

When prompted for a password type myemail@home.com

cd toolbox/CURRENT

Make a note of the FAQ.html and the readme.htm files.

Change your transfer type to binary and get the zipped directory called ncbi.tar.gz

bin

get ncbi.tar.gz

Close the ftp session by typing:

bye

seqhound@blueprint.org Version 3.3

The SeqHound Manual 19 of 421 18/04/2005

Uncompress the toolkit.

gunzip ncbi.tar.gz

tar xvf ncbi.tar

2. Edit the platform make file.

Go to the platform directory and locate the file with a “.mk” extension that applies to

your platform. For 64-bit Solaris system the file is “solaris64.ncbi.mk” and in Linux

the file is linux-x86.ncbi.mk.

cd $COMPILE/ncbi

cd platform

In Linux linux-x86.ncbi.mk replace the line /home/coremake/ncbi with

${NCBI}

Use the following line (a Perl command) to replace the string in the Solaris file

/netopt/ncbi_tools/ncbi64/ncbi with the string ${NCBI}

in the solaris64.ncbi.mk file:

perl -p -i.bak -e 's|/netopt/ncbi_tools/

ncbi64/ncbi|\${NCBI}|g' solaris64.ncbi.mk

so for instance, the line

NCBI_INCDIR = /netopt/ncbi_tools/ncbi64/ncbi/include

Will become:

NCBI_INCDIR = ${NCBI}/include

You could also edit this file in hand using a text editor if you don’t have Perl

installed.

Copy the file up one level to the ncbi directory and rename it “ncbi.mk”

cp solaris64.ncbi.mk ../ncbi.mk

3. Set environment variables in preparation for the toolkit build.

Move back to the ncbi directory and set the environment variable NCBI to point to

that directory

cd $COMPILE/ncbi

export NCBI=`pwd`

check this by typing

echo $NCBI

the value shown will replace ${NCBI} in the “solaris64.ncbi.mk” file that you

modified in the above step when the make file is run.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 20 of 421 18/04/2005

Note: The make file in the NCBI toolkit will use the C compiler from Sun

instead of the compiler gcc. We do not recommend using gcc as it

generates seqhound parsers that lead to segmentation fault at run time.

Finally, paths to the compiler and the archive executable ar should be added to your

PATH variable:

export

PATH=/usr/local/bin:/opt/SUNWspro/prod/bin:/usr/ccs/bin:$

PATH

You can check all of your environment variables by typing

set | sort

At this point, the relevant environment variables should be something like this:

COMPILE=/export/home/your_user_name/compile

NCBI=/export/home/your_user_name/compile/ncbi

OSTYPE=solaris2.9

PATH=/opt/SUNWspro/prod/bin:/usr/local/bin:/usr/ccs/bin:/

usr/bin:/usr/ucb:/etc:.

If you want, you can read the readme file in the make directory.

cd make

Other useful links

SLRI on Sourceforge http://sourceforge.net/projects/slritools/

NCBI Info Engineering branch: http://www.ncbi.nlm.nih.gov/IEB/

Concurrent Versioning System (CVS) http://www.cvshome.org/

MySQL http://www.mysql.com/

Parser schedule

Parsers are run on a periodic basis as “cron” jobs on Unix platforms and as “Schedules

tasks” on Windows platforms.

The cron job schedule is set up in the file “seqhound/update/list_crontabs” on UNIX

platforms

The setup for the current production version of SeqHound described in these instructions

is shown below:

0 19 * * * /arena/seqhound/update/llcron_odbc.pl

0 21 * * * /arena/seqhound/update/redundcron_odbc.pl

30 22 * * * /arena/seqhound/update/dupdcron_odbc.pl

30 24 * * * /arena/seqhound/update/precomcron_odbc.pl

0 7 * * * /arena/seqhound/update/isshoundoncron_odbc.pl

MySQL errors

ERROR 1153 at line X: Got a packet bigger than

'max_allowed_packet'

When MySQL receives a packet bigger than max_allowed_packet bytes, it issues a

Packet too large error and closes the connection. For example, when a single SQL

statement from a mysqldump being imported exceeds the value for

"max_allowed_packet" configured on the MySQL server. Increasing this value to

64MB from the default 16MB should resolve the error. This value may be changed in the

global config or setting this on the running server via:

set global max_allowed_packet=67108864;

Please see http://dev.mysql.com/doc/mysql/en/Packet_too_large.html for more

information

seqhound@blueprint.org Version 3.3

The SeqHound Manual 59 of 421 18/04/2005

5. Description of the SeqHound parsers and data tables by module

What are modules?

The SeqHound system is divided into one required “core” module and several optionally configured modules. Modules are groups of

tables and API calls that are filled using a common data resource, for example the 3D structures. The purpose of this division is to give

the user an option to control hardware resources and complexity of system administration when parts of the SeqHound system are not

required. The list of SeqHound modules and their data resources is contained in the table below.

After a system build, the module information is recorded in the configuration file which is then utilized by the API to determine if

certain operations can be achieved with the current setup. The configuration file is called .intrezrc (Unix platforms) The relevant

section of this file is under the heading “[sections]”. Consult Section 4 under “Building the SeqHound system” for more information

on specifying the available modules in this file.

How to use this section.

This section describes the SeqHound system in detail module by module.

Parser and data tables associated with a given module are described under the section for that module.

A brief description of all of the parsers can be found below and in the Table 2 of the SeqHound paper. The table is repeated below

and will be updated here.

Note: It is assumed that you have read the material in section 3 and section 4 before delving into this material. These

two sections include everything you need to know to start using the SeqHound remote API or to install you own local

version of SeqHound. This section is intended for users who may want more details about how SeqHound is

constructed and exactly what it is doing behind the scenes. This section is also meant as background material for

developers who want to further develop the SeqHound system

Parser descriptions

Parser descriptions contain the following headings and information:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 60 of 421 18/04/2005

purpose: a brief description of what the parser is for

logic: more details on the parser – see update parser for example

module: what module the parser belongs to

input files: input files required by the parser (also available in Table 2)

tables altered: tables in SeqHound that are modified or created by the parser (also

available in Table 2)

source code

location:

location of the parser source code in the slri development tree

config file

dependencies:

what configuration file parameters must be set for the parser to work

command line

parameters:

used by the parser

example use: of the parser from the command line

associated scripts: that are used to run the parser

error and run-time

logs:

where they are located and what is in them

troubleshooting: problems that may occur with this parser

additional info: where to find it

Table descriptions

Each table that is relevant to a module is described under that module.

Here is an example with comments in brackets. A data table description consists of the following sections and content:

Database: that the table belongs to (almost always SeqHound)

Table: name of the table like “accdb”

Definition: a brief description of the table's purpose (for example, “This table

seqhound@blueprint.org Version 3.3

The SeqHound Manual 61 of 421 18/04/2005

correlates gi’s to accession identifiers).

Observation: notes about the table

Source db: where does this table’s information come from

Source file: the source file (used by the parser to fill this table) location is listed here

Parser: the name of parser if a single parser is responsible for filling this entire

table.

This is followed by a summary of the table’s definition (for example):

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

gi int(11)

N

o 0 GenInfo Identifier

asn1 longblob Yes

N

ULL BioSeqs: Sequences

and indexes

Keyname Type Field

PRIMARY PRIMARY gi

iasndb_rowid INDEX rowid

iasndb_gi INDEX gi

(each field is then described...only one field description is shown for this example)

***gi***

description: definition of the column (for example, GenInfo sequence record identifier”)

example: of a column entry (for example, “1232452”)

default value: if the value has a default value, it is listed here

ASN.1 structure: If the value in this column is derived from an NCBI data structure, this gives

you a quick idea of where to locate it--you can find more info by searching

for this data structure at:

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SB/hbr.html.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 62 of 421 18/04/2005

for example, “Bioseq->Seq-id (choice 12)

Alternatively, the column may store a binary object that is an NCBI or SLRI

ASN.1 object. Descriptions of SLRI ASN.1 objects may be found at

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/slritools/slri/seqhound/asn.

source: If the value in this column is derived from a text file, this describes how to

find the information present in this column in the source file (for example in

the fourth column)

parser: this is the parser(s) that retrieves the value from some other db (for example,

mother)

function: this is the parser function that retrieves the value

API: if a SHound API function retrieves this value from this table it is listed here

(for example, ShoundFindAcc)

more info: (other notes)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 63 of 421 18/04/2005

An overview of the SeqHound data table structure

An overview of the SeqHound data table structure is available as a separate document in pdf format. See

http://www.blueprint.org/seqhound/api_help/docs/SeqHound_Schema_Prod.pdf.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 64 of 421 18/04/2005

Parsers and resource files needed to build and update modules of SeqHound.

This table will be updated shortly

Input File Resource Parser Tables

Modified Module

ASN.1

sequences

f

tp://ftp.ncbi.nih.gov/ncbi-asn1/*.aso

f

tp://ftp.ncbi.nih.gov/

r

efseq/cumulative

/*.bna

mother

asndb, parti,

nucprot, accdb,

pubseq, taxgi,

sendb, sengi

core

ASN.1

sequences

f

tp://ftp.ncbi.nih.gov/ncbi-asn1/daily-

nc/*.aso

f

tp://ftp.ncbi.nih.gov/refseq/daily/*bna

update

asndb, parti,

nucprot, accdb,

pubseq, taxgi,

sendb, sengi

core

FASTA nr

database

f

tp://ftp.ncbi.nih.gov/blast/db/nr redund redund redundb

List of complete

genomes (flat

file)

http://cvs.sourceforge.net/cgi-

bin/viewcvs.cgi/slritools/slri/seqhound

/genomes/chromff

chrom chrom gendb

ASN.1 for

complete

genomes

f

tp://ftp.ncbi.nih.gov/genomes/*/*.asn comgen taxgi, accdb gendb

Taxonomy

release (flat file)

f

tp://ftp.ncbi.nih.gov/pub/taxonomy/tax

dump.tar importtaxdb TAX, GCODE,

DIV, del, merge taxdb

ASN.1 MMDB

release

f

tp://ftp.ncbi.nih.gov/mmdb/mmdbdata

/*.val cbmmdb mmdb, mmgi strucdb

MMDB

(database table) mmdb table vastblst domdb strucdb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 65 of 421 18/04/2005

3-D chain

BLAST sets

(flat file)

f

tp://ftp.ncbi.nih.gov/mmdb/nrtable/nr

p

db.* pdbrep domdb strucdb

FASTA nr

database

f

tp://ftp.ncbi.nih.gov/blast/db/nr nblast nrB neigdb

BLAST ASN.1

results

nrB table available nrB and nrN tables

available at

f

tp://ftp.blueprint.org/pub/SeqHound/

NBLAST/

nbraccess nrN neigdb

LL_tmpl (flat

file)

f

tp://ftp.ncbi.nih.gov/refseq/LocusLink

/

LL_tmpl llparser ll_omim, ll_go,

ll_llink, ll_cdd lldb

gene_associaton

.com

pugen.GenBank

/Swissprot (flat

files)

http://www.geneontology.org addgoid ll_go lldb

function.ontolog

y

process.ontolog

y

component.onto

logy (flat files)

http://www.geneontology.org goparser

go_parent,

go_name,

go_reference,

go_synonym

godb

CDD database

f

tp://ftp.ncbi.nih.gov/pub/mmdb/cdd/ domname domname rpsdb

FASTA nr

database and

CDD database

f

tp://ftp.ncbi.nih.gov/blast/db/nr;

f

tp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

DOMNAME and RPSDB tables

available at

f

tp://ftp.blueprint.org/pub/SeqHound/

RPS/

rpsdb rpsdb rpsdb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 66 of 421 18/04/2005

core module

Last updated April 11, 2005. This section is maintained bt Elizabeth Burgess.

mother parser

Last updated April 11, 2005

purpose:

The mother parser is the first parser that is used to initially build the SeqHound “core” set of data tables. The input files consist of the

latest release of GenBank and RefSeq in binary ASN.1 format available on the NCBI ftp site. The resulting SeqHound data tables

hold DNA, RNA and protein sequence record information.

As of release 4.0, the mother parser is also used with the GenBank and RefSeq daily updates to update SeqHound so that that

sequence information is synchronized with that of NCBI. Previously, updates were handled by a separate parser called “update”.

logic:

The mother parser is run on a daily basis in update mode (-u T) to update with two input files. One input file is the daily update from

GenBank and the other is the daily update from the RefSeq database.

For each bioseq in the input file, mother retrieves the GI and accession number.

Mother then looks for this pair of GI and accession identifiers in the SeqHound accdb table.

If neither the accession (nor the GI) is found, then the record in the daily update represents a GI that is to be ADDED to SeqHound.

If the accession in the update file is found in SeqHound and is associated with the same GI (as listed in the update file) then the record

in the daily update represents a GI that has been CHANGED. This means that the sequence record was resubmitted to GenBank with

the same GI; the sequence remains the same but the associated annotation has changed.

If the accession in the update file is found in SeqHound but the GI associated with this accession differs between the update file and

SeqHound, then the GI that is newly associated with the accession represents a change in the sequence. The accession and updated GI

seqhound@blueprint.org Version 3.3

The SeqHound Manual 67 of 421 18/04/2005

pair point to a sequence record that will be ADDED to SeqHound. The accession and old GI (currently in SeqHound) point to a

record in SeqHound that will be KILLED (deleted).

Mother then records in the SeqHound history table whether the GI and accession in the update file represent an ADDED or

CHANGED record. The previous gi associated with the accession is also recorded, if there is one. Mother also records in the history

table those sequence records currently in SeqHound that will be KILLED (see above).

This process is completed for every bioseq in the update file.

A list of ADDED, CHANGED and KILLED GI’s is written to the file rsncmmddgis where rsncmmdd refers to the update file that was

being processed.

Mother then deletes from the following tables: parti, accdb, sendb,sengi, taxgi, pubseq, nucprot asndb. If the record is a complete

genome record, then the gi will be deleted from gichromid, contigchromid, gichromosome and contigchromosome. The strategy for

complete genome deletions is as follows:

Complete genome information in NCBI may be stored in several different places in the record.

1. Records from organisms other than those for certain higher eukaryotes ( e.g. human, mouse, rat, chicken and bee). These

records contain the flag NCBI_GENOMES and list the RNA and protein gis in the annotation. The RNA and protein gis will

be written to gichromid. The gi of the contig that contains the annotation is also written to this record if it is known.

2. Records from human, mouse, rat etc. The top level contig contains the NCBI_GENOMES flag, but no annotation. Instead, it

lists the gis for contigs that make up this chromosome. The annotation of these contig records contains the gis and proteins that

belong to this chromosome. The lower level contig gis are written to contigchromid so that the protein and RNA gis can be

parsed later from the contig bioseqs by the postcomgen parser. The gi of the top level contig is also written to this record.

3. Some records exist that only contain a chromosome number in the description of the bioseq. These records can be a contig

record or a record for an individual protein or RNA. These records are first written to gichromosome or contigchromosome and

later moved to gichromid and contigchromid by postcomgen.

For updating, we check to see if the accession for that gi is in the chrom table. The only records that are written to this table are the top

level gis that contain the NCBI_GENOMES flag. If the gi belongs to a top level contig, then all gis in contigchromid that belong to

that contig are retrieved and all gis in gichromid that belong to each low level contig gi are deleted. The low level contig gis are then

deleted from contigchromid.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 68 of 421 18/04/2005

If the record is present in contigchromid, then all gis in gichromid from that contig are deleted before the record is deleted from

contigchromid.

If the gi is present in only gichromid and contains a contig gi, then the appropriate record in contigchromid is marked as changed.

That way, after mother finishes, postcomgen will only process those contigs that have changed.

Mother writes errors and messages to a log file called rsncmmddrun where rsncmmdd refers to the update file that was being

processed.

module: core

input files:

latest GenBank release (ftp://ftp.ncbi.nih.gov/ncbi-asn1/*.aso)

latest RefSeq release (ftp://ftp.ncbi.nih.gov/refseq/cumulative/*.bna)

daily GenBank release (ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc/*.aso)

daily RefSeq release (ftp://ftp.ncbi.nih.gov/refseq/daily/*.bna)

tables altered:

asndb, parti, nucprot, accdb, pubseq, taxgi, sendb, sengi, bioentity, bioname, chrom, gichromid, contigchromid, gichromosome,

contigchromosome

source code location:

slri/seqhound/parsers/mother.c

config file dependencies:

The relevant configuration file is:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 69 of 421 18/04/2005

slri/seqhound/config/.intrezrc

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,

password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see section 4).

command line parameters:

Typing “./mother –“ at the command line while in the directory where mother resides will return a list of command line

parameters and default settings.

Note that -n and -m are listed for historical Codebase purposes. These values are always F for ODBC.

For example:

> ./mother –

mother arguments:

-i Filename for asn.1 input [File In]

-r Release [String]

-n Initialize the ASNDB database file [T/F] Optional

default = F

-m Initialize the remaing database files [T/F] Optional

default = F

-u Is this an update [T/F] Optional

default = F

-c Is this a file for human/mouse complete genome [T/F] Optional

default = F

-t Read input file in text mode [T/F] Optional

default = F

seqhound@blueprint.org Version 3.3

The SeqHound Manual 70 of 421 18/04/2005

example use:

mother –i nc1227.aso -r 135 -n F -m F

mother –I nc1227.aso –r 135

Note that mother is normally run under the control of a script (see below).

associated scripts:

The initial build of SeqHound is executed using the script called “seqhound_build.sh”

See “slri/seqhound/scripts/seqhound_build.sh”

This script cycles through all the GenBank and RefSeq release files downloaded by asn_ftp.pl unzips them, processes them with

mother and then zips them up again.

The script must be run in the directory containing the GenBank and RefSeq release files and a copy of the mother parser The script

takes one argument (a release number).

./seqhound_build.sh 135

mother is also called by the scripts that generate the daily updates (see “associated scripts” under update parser).

error and run-time logs:

mother writes to a log file called “rsncmmddrun” where rsncmmdd refers to the GenBank release or update file that was being

processed. When run in update mode, it also writes the gis to a file called “rsncmmddgis”. The log files are created in the directory

logs.

troubleshooting:

additional info:

See readme files that accompany the input files on the GenBank ftp site.

See data table descriptions for each of the tables that are listed under “tables altered”.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 71 of 421 18/04/2005

update parser

Last updated April 11, 2005

note:

As of Release 4.0, update functionality has been moved to the mother parser and associated scripts. See above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 72 of 421 18/04/2005

postcomgen parser

Last updated April 11, 2005

purpose:

The postcomgen parser is used in conjunction with the mother parser. It is used to update the taxgi table with complete genome

information using the chrom, contigchromid, gichromid, gichromosome and contigchromosome tables. It is run after the initial build

and after any updates.

logic:

Complete genome information in NCBI may be stored in several different places in different records.

1.Records from organisms other than those for certain higher eukaryotes ( e.g. human, mouse, rat, chicken and bee) contain the flag

NCBI_GENOMES and list the RNA and protein gis in the annotation.

2.Records from certain higher eukaryotes (human, mouse, rat etc) are different. The top level contig contains the NCBI_GENOMES

flag, but no annotation. Instead, it lists gis for the contigs that make up this chromosome. The annotation of these records contains the

gis and proteins that belong to this chromosome.

3. Some records exist that only contain a chromosome number in the description of the bioseq. These records can be contig records or

records for an individual protein or RNA.

At the time that mother is run, gis in the annotation may not yet have been entered into accdb and taxgi. For this reason, mother writes

these gis out to the table gichromid. When mother encounters a record that contains only contig gis, these gis are written out to

contigchromid. The records that contain only a chromosome number are written to gichromosome or contigchromosome.

After mother is finished, postcomgen is run to process the data in these tables and write the kloodge to the taxgi table for each gi:

1. Postcomgen first writes every gi in gichromosome to gichromid with the appropriate chromid. After this is done, all records are

deleted from gichromosome.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 73 of 421 18/04/2005

2. Next, all contigchromosome gis are written to contigchromid with the appropriate chromid and all records are deleted from

contigchromosome.

3. For each gi in contigchromid, the bioseq is obtained from asndb and any protein or RNA gis in the annotation are written to

gichromid. In addition to the chromid, the contiggi is stored so that the appropriate contigchromid record can be reread on update if

necessary.

4. Each record in contigchromid is marked as read, so that upon update only records that have been modified need be read.

5. Finally, taxgi is updated for each gi in gichromid with the appropriate kloodge.

module: core

input files:

There are no input files. The parser uses the tables gichromid, contigchromid, gichromosome and contigchromosome, which are filled

by mother, to update the taxgi table.

tables altered:

taxgi, gichromid, contigchromid, gichromosome, contigchromosome

source code location:

slri/seqhound/genomes/postcomgen.c

config file dependencies:

The relevant configuration file is:

slri/seqhound/config/.intrezrc

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

seqhound@blueprint.org Version 3.3

The SeqHound Manual 74 of 421 18/04/2005

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username, password,

dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE respectively in your

.odbc.ini (see 4).

command line parameters:

None.

example use:

postcomgen

associated scripts:

error and run-time logs:

Postcomgen writes to a log file called “postcomgen.log”.

troubleshooting:

additional info:

See data table descriptions for each of the tables that are listed under “tables altered”.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 75 of 421 18/04/2005

asndb table

Last updated April 11, 2005

Database: SeqHound

Module: core

Table: asndb

Definition: Central table that stores sequence records in binary ASN.1 format and indexed by gi.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 76 of 421 18/04/2005

asndb table

Field Type Null Default Column_Definition Example Source API

rowid int(11)

N

o

N

ULL Mysql autoincrement

column

ts timestamp Yes

CURRENT_TIMES

TAMP timestamp 2005-03-23 12:32:16

gi int(11)

N

o 0

GenInfo primary

sequence record

identifier

4001923 FillASNDB calls GetGi.

See Bioseq->seqid->gi

asn1 longblob Yes

N

ULL

bioseq in binary ASN.1 An example of a bioseq

is included in the

Appendix (Example of

GenBank record).

ToBioseqSEQENTRY

See Bioseq->seqid->gi

SHoundGetBioseq

SHoundSequenceLength

division varchar(25)

N

o

N

CBI division PRI GetDivisionFromGBBlock.

See

Bioseq->descr->source

->org->orgname->div

release varchar(10)

N

o

The name of the release

given as a command

line input to the mother

parser

135

type varchar(15)

N

o The type of molecule

that this record refers to.

protein FillASNDB calls GetType.

See Bioseq-> mol

SHoundMoleculeType

seqhound@blueprint.org Version 3.3

The SeqHound Manual 77 of 421 18/04/2005

asndb indices

Keyname Type Field

PRIMARY PRIMARY gi

iasndb_rowid INDEX rowid

iasndb_ts INDEX ts

iasndb_gi INDEX gi

seqhound@blueprint.org Version 3.3

The SeqHound Manual 78 of 421 18/04/2005

parti table

Last updated April 11, 2005

Note: This table is documented for historical purposes and may be deprecated in a future release. PARTI was required

by codebase as an index to the ASNDB table.

Database: seqhound

Table: parti

Module: core

Definition: This table maps the gi to the NCBI division.

parti table

Field Type Null Default Column_Definition Example Source API

rowid int(11)

N

o MySQL autoincrement

column.

ts timestamp Yes CURRENT_TI

MESTAMP

MySQL timestamp

column

2005-03-23 12:32:16

gi int(11)

N

o 0

GenInfo Identifier 21676275 FillASNDB calls

AppendRecordPARTI. See

Bioseq->Seqid->gi (choice 12)

division char(15) Yes

N

ULL

N

CBI Division PRI GetDivisionFromGBBlock. See

Bioseq->descr->source

->org->orgname->div

seqhound@blueprint.org Version 3.3

The SeqHound Manual 79 of 421 18/04/2005

parti indices

Keyname Type Field

PRIMARY PRIMARY gi

iparti_rowid INDEX rowid

iparti_ts INDEX ts

iparti_gi INDEX gi

iparti_div INDEX division

seqhound@blueprint.org Version 3.3

The SeqHound Manual 80 of 421 18/04/2005

nucprot table

Last updated April 11, 2005

Database: seqhound

Table: nucprot

Module: core

Definition: This table maps the gi of a protein to the gi of its encoding DNA.

Observation: A DNA may map to more than one protein. A protein can only map to one DNA. Not every DNA may

map to a protein, eg synthetic DNA fragments typically do not have a protein. In a bioseqset structure of

type nucprot, the first bioseq record is a nucleic acid, subsequent bioseq records correspond to proteins.

In such a case, the entries added to the nucprot database would be:

record 1) gi_1 gi_2

record 2) gi_1 gi_3

record 3) gi_1 gi_4

Each record would be a mapping of the nucleic acid (gi_1) to each protein (gi_2 ... gi_N)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 81 of 421 18/04/2005

nucprot table

Field Type Null Default Column_Definition Source API

rowid int(11) MySQL autoincrement

column.

Example

N

o

gin

N

o 0 27464927 FillNUCPROT. See SHoundDNAFromProtein

timestamp Yes MySQL timestamp column

2005-03-23 12:32:16

int(11)

N

o

int(11) DNA GenInfo Identifier Bioseq->Seqid->gi (choice 12)

CURRENT_

TIMESTAM

P

ts

0 Protein GenInfo Identifier 27464928 FillNUCPROT. See

Bioseq->Seqid->gi (choice 12)

SHoundProteinFromDNA

gia

nucprot indices

Keyname Type Field

PRIMARY PRIMARY gia

inucprot_rowid INDEX rowid

inucprot_ts INDEX ts

inuc_gin INDEX gin

inuc_gia gia INDEX

seqhound@blueprint.org Version 3.3

The SeqHound Manual 82 of 421 18/04/2005

accdb table

Last updated April 11, 2005

Database: seqhound

Table: accdb

Definition: This table maps gi's to accession numbers and other identifiers (a combination of a database name and

identifier) found in GenBank sequence records.

Observation Note that gi is not unique in ACCDB. Also note that a gi does not always have an accession associated

with it. There are 2 kinds of records in ACCDB. In one case the gi is associated with an accession and in

the other case, the gi is associated with a database name and some identifier from that database (name

column). Any given gi may have both record types and may be associated with more than one name.

Note that internal NCBI databases may be stored in ACCDB, for example NCBI_GENOMES. In this

case the record will not have an accession, but it will have a name.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 83 of 421 18/04/2005

accdb table

Field Type Null Default Column_Definition Example Source API

rowid int(10)

N

o MySQL autoincrement

column.

ts timestamp Yes

CURRENT_TI

MESTAMP MySQL timestamp column 2005-03-23 12:32:16

gi int(11)

N

o

0 GenInfo Identifier 2313082 FillACCDB calls GetGi. See

Bioseq->seqid->gi

SHoundGiFromGBAcc

(ShoundFindAcc)

db varchar(15)

N

o

The source database of the

sequence record.

See note 1 below.

pdb FillACCDB. See

Bioseq->Seq-Id->dbtag->db

SHoundDbNameAndIdListFrom

GBAcc

name varchar(30)

N

o This is an accession number

from a foreign database.

1AAP See note 2 below.

SHoundDbNameAndIdListFrom

GBAcc

(ShoundFindName)

(ShoundGetNameByGi)

namelow varchar(30) Yes

N

ULL

Same as name but in lower

case. Present for historical

reasons and may probably be

removed in the future.

1aap See note 2 below.

access varchar(20)

N

o

Genbank Accession. The

primary sequence record

identifier.

n/a WriteACCDB. See Bioseq-

>Seq-id->Textseq-id-

>accession. This field is n/a

for sequences from PDB and

other databases. This field is

only relevant when name field

is n/a.

SHoundGBAccFromGi

(ShoundAccFromGi)

chain varchar(20) Yes

N

ULL

This describes which one of

(possibly) many chains in a

structure a sequence refers to.

This only refers to PDB

sequences.

A FillACCDB. See

Bioseq->Seq-id->PDB-seq-id-

>chain. See note 5 below.

SHoundGiFromPDBchain

seqhound@blueprint.org Version 3.3

The SeqHound Manual 84 of 421 18/04/2005

release varchar(20) Yes

N

ULL release date of the record Sep 14, 1990 See note 3 below.

SHoundSeqIdFromGi

version int(11) Yes

N

ULL version of the record

0 See note 4 below.

This value cannot be directly

retrieved from the table by the

API but a Seq-id can be retrieved

using ShoundSeqIdFromGi

given a gi.

title text Yes

N

ULL a brief description of the

sequence

Chain A, Protease

Inhibitor Domain Of

Alzheimer's Amyloid

Beta-Protein Precursor

(APPI)

FillACCDB calls

CreateDefline.

Notes

1.

A complete list of db names that may appear in this column is listed under API supplementary material at

http://www.blueprint.org/seqhound/apisupplement.html. Some common db name abbreviations are listed below.

gb means GenBank

embl means EMBL

pir means Protein Information Resource (PIR)

sp means Swiss-Prot

pbs means other database

ref means RefSeq

dbj means DNA Database of Japan (DDBJ)

prf means PRF

pdb means Protein Data Bank (PDB)

tpe means third party annotation from embl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 85 of 421 18/04/2005

tpg means third party annotation from genbank

tpd means third party annotation from ddbj

other means some other database

2.

WriteAccdb.

For all db’s except PDB and ‘other’:

Bioseq->Seq-id->Textseq-id->name

For PDB sequences:

Bioseq->Seq-id->PDB->seq-id->PDB-mol-id (4 characters)

For other databases:

Bioseq->Seq-id->Dbtag->Object-id->string or integer

3.

FillACCDB, given a Bioseq pointer, retrieves a Seq-id pointer. If the Seq-id is of the type, pdb (protein data bank), then a PDB-seq-

id ptr is passed to WriteACCDB which retrieves the value of rel. If the Seq-id is of the type, genbank, embl, pir, swissprot, other,

ddbj, prf, tpg, tpd or tpe then a Textseq-id pointer is passed to WriteACCDB which retrieves the value of release. If the Seq-id is of

the type, giimport then a Giimport-id ptr is passed to WriteACCDB which retrieves the value of release. In all cases, the formatting

of the date is attempted with NCBI's function DatePrint (like this 07-OCT-2004). If the Formatting cannot be done, then the string

is just copied.

Bioseq->Seq-id->Textseq-id->release (an INTEGER) for all databases besides PDB, Giimport and ‘other’

Bioseq->Seq-id->PDB->seq-id->rel (a string or an NCBI data type) for PDB sequences

Bioseq->Seq-id->Giimport-id->release (a string) for Giimport

n/a for all ‘other’ databases.

4.

FillACCDB, given a Bioseq pointer, retrieves a Seq-id pointer. If the Seq-id is of the type, genbank, embl, pir, swissprot, other,

ddbj, prf, tpg, tpd or tpe then a Textseq-id pointer is passed to WriteACCDB which retrieves the value of version. For all other

databases, this value is n/a. See

seqhound@blueprint.org Version 3.3

The SeqHound Manual 86 of 421 18/04/2005

5.

Sequences that are part of structures will have a PDB identifier. Since there may be more than one chain (sequence) in a structure, the

PDB identifier must be accompanied by a chain identifier to uniquely identifiy a chain within the structure. A PDB identifier

supplemented with a chain identifier will be associated with a single sequence record (GI). For example, see PDB id “9XIM” and

chain A; this corresponds to GI sequence 443580. See example ASN.1 record for 9XIM_A in the Appendix.

Chain in the ASN.1 annotation is of type character, not of type string so it appears in its decimal form. So 65 is A and so on.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 87 of 421 18/04/2005

accdb indices

Keyname Type Field

PRIMARY PRIMARY gi

db

access

iaccdb_rowid INDEX rowid

iaccdb_ts INDEX ts

iaccdb_gi INDEX gi

iaccdb_db INDEX db

iaccdb_name INDEX name

iaccdb_namelow INDEX namelow

iaccdb_acc INDEX access

seqhound@blueprint.org Version 3.3

The SeqHound Manual 88 of 421 18/04/2005

histdb table

Last updated April 11, 2005

Database: seqhound

Table: histdb

Module: hist

Definition: A history table of gis. Includes the date and any action taken on them.

Observation: As of release 4.0, histdb is no longer populated by the histparser and update. Instead, mother populates

this table, both during the initial build and the update. During the initial build, gi-accession number pairs,

the version number, filename and date are stored in histdb with an action ACTION_ADDED.

During updates, mother extracts the gi-accession number pairs from ACCDB. If neither the gi nor the

accession are found, then this is a new record (ACTION_ADDED). If the accession is found in

SeqHound but is associated with a different gi, then the record represents a changed record

(ACTION_CHANGED), and if the gi-accession pair is present in SeqHound, then the record represents a

deleted record (ACTION_KILLED). The gi, accession, action along with the version, filename and date

(current date update executed) are logged in histdb. Gi's with ACTION_ADDED will get added to accdb,

nucprot, taxgi, pubseq, sengi, sendb, parti by mother. Gi's with ACTION_CHANGED will be deleted and

the new updated information will be added to the core databases. For gi's with ACTION_KILLED, the

old record information in the core databases is now obsolete and will be deleted from the core databases

and the new information will be added to the databases . For any gi that is killed, there will be another gi

(with the same accession) that is added.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 89 of 421 18/04/2005

histdb table

Field Type Null Default Column_Definition Example Source API

rowid int(11)

N

o MySQL autoincrement column.

ts timestamp Yes

CURRENT_

TIMESTAM

P

MySQL timestamp column

2005-03-23 12:32:16

gi int(11)

N

o 0 GenInfo Identifier

21676275 GetGi is called by ToBioseqUp in the

initial build and by ToBioseqUp in

updates. See:

Bioseq-->Seqid->id (choice 12)

oldgi int(11) Yes

N

ULL

The gi previously associated with this

accession. Not used during the initial

build.

10954454 ToBioseqUp retrieves the gi from ACCDB

using GetACCDBRecord.

access char(20)

N

o

The genbank accession from ACCDB

for the gi. If the accession is ‘n/a’,

then the ‘name’ is written here.

See details in accdb table description.

N

C_001911 ToBioseqUp retrieves the accession or

name from ACCDB using

GetACCDBRecord.

version int(11) Yes

N

ULL Version of the record. See details in

accdb table description.

1 ToBioseq and ToBioseqUp call

GetSeqIdInfoFromBioseq. See Bioseq-

>Seq-id->Textseq-id->version.

date date

N

o 0000-00-00 current date when histdb is updated in

the form (YYYYMMDD)

20040128 Mother uses the current date.

action int(11)

N

o 0

the action taken on the record:

ACTION_ADDED,

ACTION_KILLED,

ACTION_CHANGED

ACTION_CHANGED ToBioseq and ToBioseqUp calls

LogHistory.

filename Varchar(80)

N

o The name of the file that contains this

gi for debugging purposes.

rsnc0311.bna Main passes this to ToBioseq and

ToBioseqUp.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 90 of 421 18/04/2005

histdb indices

Keyname Type Field

ihistdb_rowid INDEX rowid

ihistdb_ts INDEX ts

ihistdb_gi INDEX gi

ihistdb_date INDEX date

ihistdb_action INDEX action

ihistdb_acc INDEX access

ihistdb_filename INDEX filename

ihistdb_oldgi INDEX oldgi

seqhound@blueprint.org Version 3.3

The SeqHound Manual 91 of 421 18/04/2005

pubseq table

Last updated April 11, 2005

Database: seqhound

Table: pubseq

Definition: This database maps a geninfo identifier to the medline/pubmed article(s) found in the sequence record

indicating who first published the sequence.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 92 of 421 18/04/2005

pubseq table

Field Type Null Default Column_Definition Example Source API

rowid int(11)

N

o MySQL autoincrement column.

ts timestamp Yes

CURRENT_

TIMESTAM

P

MySQL timestamp column

2005-03-23 12:32:16

ts

gi int(11)

N

o 0 GenInfo Identifier

25992759 ToBioseq calls GetGi.

See Bioseq-->Seqid

(choice 12)

SHoundGiFromReferenceID

SHoundGiFromReferenceList

muid int(11)

N

o

0

Medline identifier found in this

record (points to same article as

pmid (if listed)).

99091706 FillPUBSEQ calls

GetMuid. See Seq-

entry->Bioseq-

>Seqdescr->Pubdesc-

>Pub-equiv->Pub-

>muid (choice 4)

SHoundGetReferenceIDFromGi

SHoundGetReferenceIDFromGiList

SHoundMuidFrom3D

SHoundMuidFrom3Dlist

(ShoundMuidFromGi)

(ShoundMuidFromGiList)

pmid int(11)

N

o

0

Pubmed identifier found in this

record (points to same article as

muid (if listed))

9873079 FillPUBSEQ calls

GetMuid. See Seq-

entry->Bioseq-

>Seqdescr->Pubdesc-

>Pub-equiv->Pub-

>pmid (choice 13)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 93 of 421 18/04/2005

pubseq indices

Keyname Type Field

PRIMARY PRIMARY gi

p

mi

d

ipubseq_rowid INDEX rowid

ipubseq_ts INDEX ts

ipubseq_gi INDEX gi

ipubseq_muid INDEX muid

ipubseq_pmid INDEX

p

mid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 94 of 421 18/04/2005

taxgi table

Last updated April 11, 2005

Database: seqhound

Table: taxgi

Module: core

Definition: This table associates a gi with a taxon identifier.

Observation.: Gis are stored in taxgi even if they are not associated with any taxonomy id. This occurs, for example, for

some patent gis.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 95 of 421 18/04/2005

taxgi table

Field Type Null Default Column_Definition Example Source API

rowid int(11)

N

o MySQL autoincrement

colulmn.

ts timestamp Yes

CURRENT_T

IMESTAMP MySQL timestamp column 2005-03-23

16:50:42

gi int(11)

N

o 0

GenInfo Identifier 42565179 WriteTAXGI calls GetGi.

See Bioseq-->Seq-id (choice 12)

SHoundDNAFromTaxID()

SHoundDNAFromTaxIDList()

SHoundProteinsFromTaxID()

SHoundProteinsFromTaxIDList()

SHoundProteinsFromChromosome()

SHoundProteinsFromChromosomeList()

SHoundDNAFromChromosome()

SHoundDNAFromChromosomeList()

taxid int(11)

N

o 0 Taxonomy Identifier from

N

CBI’s taxonomy database.

3702 See note 1 below. SHoundTaxIDFromGi()

SHoundTaxIDFromGiList()

kloodge int(11) Yes

N

ULL Blueprint chromosome id from

the chrom table.

392 See note 2 below.

type varchar(50) Yes

N

ULL

Type of macromolecule

DNA, RNA, protein, NA,

other, "not specified".

protein WriteTaxgi calls GetType. See

Bioseq->mol (see note 3 below).

SHoundDNAFromChromosome()

SHoundDNAFromChromosomeList()

SHoundProteinsFromChromosome()

ShoundProteinsFromChromosomeList()

Notes:

1.

FillTAXGI calls GetTaxId. See Bioseq-->Seq-desc->BioSource->Org-ref->Dbtag.db == "taxon"

Bioseq-->Seq-desc->BioSource->Org-ref->Dbtag.tag

2.

Mother: ToBioseq calls SearchCHROMByChromNum.

Postcomgen: GetAllRecordsFromGICHROMID and UpdateTaxgiWithKloodge

seqhound@blueprint.org Version 3.3

The SeqHound Manual 96 of 421 18/04/2005

3.

DNA = 1

RNA = 2

Protein = 3

NA = 4

Other = 255

taxgi indices

Keyname Type Field

PRIMARY PRIMARY gi

itaxgi_rowid INDEX rowid

itaxgi_ts INDEX timestamp

itaxgi_gi INDEX gi

itaxgi_taxid INDEX taxid

itaxgi_kl INDEX kloodge

itaxgi_type INDEX type

seqhound@blueprint.org Version 3.3

The SeqHound Manual 97 of 421 18/04/2005

sengi table

Last updated April 11, 2005

Database: seqhound

Table: sengi

Module: core

Definition: Seqentry to 'gi' conversion.

Observation: Seqhound keeps track of which Bioseqs come from which Bioseq-sets using the sengi table. In the sengi

table, each Bioseq is represented by a GI (a unique identifier for the sequence record) and each Bioseq-set

is represented by the first GI in the set of sequences, the seid.

Sengi also stores the division for historical purposes.

sengi table

Field Type Null Default Column_Definition Example Source API

rowid int(10)

N

o MySQL autoincrement column

ts timestamp Yes

CURRENT_TI

MESTAMP MySQL timestamp column 2005-03-23 12:32:16

gi int(11)

N

o

0 GenInfo Identifier 2313082 ToBioseqSEQENTRY calls GetGi. See Seq-entry->Bioseq-

set->Bioseq-->Seqid (choice 12)

seid int(11)

N

o 0

GenInfo Identifier: the first

one in the bioseq-set. seid

means seqentry id.

231308 ToBioseq calls GetGi for the first Bioseq in a SeqEntry See

Seq-entry->Bioseq-set->Bioseq-->Seqid (choice 12).

division char(15) Yes

N

ULL Genbank division. PRI ToBioseqSEQENTRY calls GetDivisionFromPARTI.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 98 of 421 18/04/2005

sengi indices

Keyname Type Field

PRIMARY PRIMARY gi

isengi_rowid INDEX rowid

isengi_ts INDEX ts

isengi_gi INDEX gi

isengi_seid INDEX seid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 99 of 421 18/04/2005

sendb table

Last updated April 11, 2005

Database: seqhound

Table: sendb

Module: core

Definition: SeqEntry Data Base

Observation: Sequences are distributed by GenBank in packages. Each package is referred to as a seq-entry.

A seq-entry may contain either a single sequence record (called a Bioseq) or a set of sequence records

(called a Bioseq-set). If a Seq-entry contains a Bioseq-set then that Bioseq-set contains a list of Seq-entry

packages (yes, this data structure is recursive). Each of these Seq-entry packages contains a single

sequence record (a Bioseq). There is annotation that is associated with single sequence records (Bioseqs).

An example of annotation is a list of authors who are responsible for submitting a sequence record. There

is also annotation associated with sets of sequence records (Bioseq-sets). This type of annotation applies

to all of the sequence records that are in the set. For example a set of authors may be responsible for all

of the sequence records in the set.

Seqhound stores Bioseqs and their associated GenInfo id's in one central table (called asndb). Since each

of these Bioseqs may have come from a Bioseq-set, Seqhound needs a way to store the Bioseq-set

associated annotation (that applies to each of the Bioseqs in the set). To accomplish this, Seqhound takes

the Bioseq-set, removes the Bioseqs that it contains, and stores the remainder of the Bioseq-set

(annotation) in the sendb table. Seqhound keeps track of which Bioseqs come from which Bioseq-sets

using the sengi table. In the sengi table, each Bioseq is represented by a GI (a unique identifier for the

sequence record) and each Bioseq-set is represented by the first GI in the set of sequences (the seid).

seqhound@blueprint.org Version 3.3

The SeqHound Manual 100 of 421 18/04/2005

sendb table

Field Type Null Default Column_Definition Example Source API

rowid int(10)

N

o Auto number row identifier

ts timestamp Yes

CURRENT_TI

MESTAMP MySQL timestamp column 2005-03-23 12:32:16

seid int(11)

N

o 0 Gene Info identifier for the first

bioseq in the SeqEntry.

2313082 ToBioseq calls GetGi for the

first Bioseq in a SeqEntry.

Main calls

AppendRecordSENDB. See

Seq-entry->Bioseq-set-

>Seq-entry->Bioseq->Seqid

(choice 12)

asn1 mediumblob

N

o

ASN1 Binary SeqEntry. Note

that the bioseqs themselves have

been removed to save space.

binary data ToBioseqSeqEntry removes

the bioseqs. Main calls

AppendRecordSENDB.

These functions return the Seqentry

or BioseqSet stored in this field

after first replacing all of the

Bioseqs that belong in it.

SHoundGetBioseqSet()

SHoundGetSeqEntry()

sendb indices

Keyname Type Field

PRIMARY PRIMARY seid

isendb_rowid INDEX rowid

isendb_ts INDEX ts

isendb_seid INDEX seid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 101 of 421 18/04/2005

chrom table

Database: seqhound

Table: chrom

Module: core

Definition: This table maps taxonomy ids with their complete genome information (chromid, chromflag, accession, and name)

Observation: This table keeps track of all completely sequenced chromosomes of a particular organism (taxid). We also store the NCBI

chromid (chromid), the type of sequence (chromflag), the name (name) and accession number (access).

Every chromid will have a corresponding kloodge identifier (this is Blueprint’s internal version of the NCBI chromid. In

addition, every taxon will have one kloodge identifier that has no corresponding chromid. This kloodge id is assigned to any

proteins or RNAs that are have no chromid; i.e. they have not been assigned to a chromosome yet. The kloodge is the value of

an autoincrement column and is written to taxgi by postcomgen.

We also store the chromosome number (chromnum) as text. This is used so that we can assign records which have no complete

genome information other than taxid and chromosome number to a chromosome identifier.

For example, let’s say that chromosome 4 (circular plasmid, named chr_4, access XIV) of species “species X” (taxid 123) is

completely sequenced and that NCBI has assigned this chromosome a chromid of 191. Let us say our internal kloodge id for

this same chromosome is 1254. Then a record is added to the chrom table with fields:

kloodge: 1254

taxid: 123

chromid: 191

chromflag: plasmid

accession: XIV

name: species X chr_4

chromnum: chr 4

See more example entries in the notes section below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 102 of 421 18/04/2005

chrom table

Field Type Null Default Column_Definition Example1Source API

kloodge int(11)

N

o

Blueprint chromosome

id (used internally).

Mysql Autoincrement

column.

568

ts timestamp

N

o CURRENT_T

IMESTAMP

2005-03-23

16:17:11

taxid int(11)

N

o 0

Taxonomy identifier 9606 ToBioseq calls GetTaxid See

Bioseq-->Seq-desc->BioSource->Org-

ref->Dbtag->tag

SHoundAllGenomes

chromid int(11)

N

o 0

N

CBI chromsome

identifier (see note 2

below)

1 ToBioseq calls

GetChromIdFromBioseq. See Bioseq-

>SeqId->Dbtag->tag

chromflag int(11)

N

o 0

the type of chromosome

(chromosome, plasmid,

mitochrondria etc)

defined as byte macros

in intrez.h.

(see note 3 below)

1 ToBioseq calls GetGenomeInfo, See

Bioseq->descr->source->subtype-

>subtype

SHoundChromosomeFromGenome

access varchar(20)

N

o NULL

Accession for the

sequence (see note 2

below).

N

C_000001 GetGenomeInfo calls

Misc_GetAccession. See Bioseq->Seq-

id->Textseq-id->accession

name text

N

o

The scientific name of

the organism with the

chromosome name.

Homo sapiens

Chromosome 1

GetGenomeFromBiosource constructs

this from various places in the

Biosource. See Bioseq->descr->source

chromnum varchar(10) Yes

N

ULL

Chromosome number as

text.

1 GetGenomeFromBiosource constructs

this from various places in the

Biosource subtype. See Bioseq->descr-

>source->subtype

seqhound@blueprint.org Version 3.3

The SeqHound Manual 103 of 421 18/04/2005

Notes.

1.

Example entries from the chrom table.

kloodge ts taxid chromid chromflag access name chromnum

1 '2005-03-23 12:33:24' 9 13723 9 'NC_001911' 'Buchnera aphidicola Plasmid pLeu-Dn' 'pLeu-Dn'

2 '2005-03-23 12:33:24' 9 17119 9 'NC_004843' 'Buchnera aphidicola Plasmid pBPS1' 'pBPS1'

567 '2005-03-23 16:06:41' 9606 -9606 1 'Homo sapiens Chromosome Un' 'Un'

568 '2005-03-23 16:17:11' 9606 1 1 'NC_000001' 'Homo sapiens Chromosome 1' '1'

569 '2005-03-23 16:17:12' 9606 2 1 'NC_000002' 'Homo sapiens Chromosome 2' '2'

570 '2005-03-23 16:17:12' 9606 3 1 'NC_000003' 'Homo sapiens Chromosome 3' '3'

592 '2005-03-23 16:35:05' 9606 2000021 5 'AC_000021' 'Homo sapiens Mitochondria'

4289 '2005-04-05 09:29:55' 287944 18301 5 'NC_006916' 'Harpiosquilla harpax Mitochondria'

4290 '2005-04-05 09:29:55' 302098 18316 5 'NC_006931' 'Eubalaena japonica Mitochondria'

Note that the third entry down represents an entry corresponding to an “unknown” chromosome. Proteins or RNAs that have not been

mapped to a chromosome will be assigned this kloodge identifier (567) in the accdb and table.

2.

Kloodge identifiers representing unknown chromosmomes will not have a corresponding chromid from NCBI. The negative value of

the taxid is used instead. See third entry in example table above. These entries will also have no accession associated with them.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 104 of 421 18/04/2005

notes continued…

3.

CHROM_PHAGE (phage sequence)

CHROM_NR

CHROM_ECE

CHROM_PLMD (plasmid sequence)

CHROM_CHLO (chloroplast sequence)

CHROM_MITO (mitochrondrial sequence)

CHROM_CHROM (chromosome sequence)

CHROM_ALL

chrom indices

Keyname Type Field

PRIMARY PRIMARY kloodge

ichrom_chromid INDEX chromid

ichrom_taxid INDEX taxid , chromnum

ichrom_kl INDEX kloodge

ichrom_acc INDEX access

Ichrom_chromnum INDEX chromnum

seqhound@blueprint.org Version 3.3

The SeqHound Manual 105 of 421 18/04/2005

gichromid table

Last updated April 11, 2005

Database: SeqHound

Table: gichromid

Module: core

Definition: Stores RNA and protein gis that are part of a complete genome. Used to

update taxgi with the kloodge.

Observation: This table is used internally to store RNA and Protein gis parsed from

contigs that are part of a complete genome. After a complete build is

done, the gis and chromids from this table are used by postcomgen to

update taxgi with the kloodge.

gichromid table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Primary key. Mysql auto-

increment column.

ts timestamp Yes CURRENT_TIMESTAMP

MySQL timestamp

column

2005-03-23 12:32:16

gi int(11)

N

o 0 Gene Info identifier

231308 ToBioseq calls

FillGiChromid. See Bioseq--

>Seqid (choice 12)

chromid int(11)

N

o 0

N

CBI chromosome id 3144

N

CBI chromid

contiggi int(11) Yes

N

ULL Gi of the contig that

contains this gi.

231308 ToBioseq calls

GetChromidFromBioseq.

See Bioseq->Seqid(choice

11)->tag->id

gichromid indices

Keyname Type Field

seqhound@blueprint.org Version 3.3

The SeqHound Manual 106 of 421 18/04/2005

id PRIMARY gi

gi INDEX gi

igits INDEX ts

igichromid INDEX chromid

igicontiggi INDEX contiggi

seqhound@blueprint.org Version 3.3

The SeqHound Manual 107 of 421 18/04/2005

contigchromid table

Last updated April 11, 2005

Database: SeqHound

Table: contigchromid

Module: core

Definition: Stores contig gis that are part of a complete genome. Used to update

taxgi with the kloodge.

Observation: This table is used internally to store lists of GI’s that make up a contig

sequence for a complete genome. After a complete build is done, the gis

and chromids from these contigs are used by postcomgen to update

taxgi with the kloodge.

contigchromid table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Mysql auto-increment column.

ts timestamp Yes

CURRENT_TI

MESTAMP MySQL timestamp column 2005-03-23 12:32:16

contiggi int(11)

N

o 0 Gene Info identifier for this sequence record

that is part of a contig.

50754263 ToBioseq calls GetGi. See Bioseq--

>Seqid (choice 12)

topgi int(11) Yes

N

ULL

Genbank gi of the contig that contains

multiple sequences (i.e. multiple

contiggi’s).

51039201 ToBioseq calls GetGi. See Bioseq--

>Seqid (choice 12)

chromid int(11)

N

o 0

N

CBI chromosome id.

461 ToBioseq calls

GetChromidFromBioseq. See Bioseq-

>Seqid(choice 11)->tag->id

changed Int(11) Yes

N

ULL

Indicates whether the record needs to be

processed by postcomgen. Used by mother

in update mode.

1 = process

2 = do not process

ToBioseq calls FillContigChromid. See

Bioseq->Seqid(choice 11)->tag->id

contigchromid indices

seqhound@blueprint.org Version 3.3

The SeqHound Manual 108 of 421 18/04/2005

Keyname Type Field

contiggi PRIMARY id

icontigts INDEX ts

contigi UNIQUE contiggi

iccontigi INDEX contiggi

ictopgi INDEX topgi

seqhound@blueprint.org Version 3.3

The SeqHound Manual 109 of 421 18/04/2005

gichromosome table

Last updated April 11, 2005

Database: SeqHound

Table: gichromosome

Module: core

Definition: Stores RNA and protein gis that have a chromosome name, but no other

complete genome info.

Observation: This table is used internally to store RNA and protein gis that have a

chromosome name, but no other complete genome info. After a

complete build is done, postcomgen will write these gis to gichromid

with the appropriate chromid.

gichromosome table

Field Type Null Default Column

_

Definition Example Source API

id int(11)

N

o

Primary key Mysql

auto-increment

column.

ts timestamp Yes

CURRENT_TIM

ESTAMP

MySQL timestamp

column

2005-03-23 12:32:16

gi int(11)

N

o 0 GenInfo Identifier 231308 ToBioseq calls GetGi. See Bioseq-->Seqid

(choice 12)

chromnum varchar(10)

N

o Chromosome

number

1 ToBioseq calls GetGenomeInfo. See Bioseq-

>descr->source->subtype->name

seqhound@blueprint.org Version 3.3

The SeqHound Manual 110 of 421 18/04/2005

gichromosome indices

Keyname Type Field

id PRIMARY id

igichrom_ts INDEX ts

gi UNIQUE gi

igichrom_gi INDEX gi

igichrom_num INDEX chromnum

seqhound@blueprint.org Version 3.3

The SeqHound Manual 111 of 421 18/04/2005

contigchromosome table

Last updated April 11, 2005

Database: SeqHound

Table: gichromosome

Module: core

Definition: Stores contig gis that have a chromosome name, but no other complete

genome info.

Observation: This table is used internally to store contig gis that have a chromosome

name, but no other complete genome info. After a complete build is

done, postcomgen will writethese gis to contigchromid with the

appropriate chromid.

contigchromosome table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Primary key Mysql auto-

increment column.

ts timestamp Yes

CURREN

T_TIMES

TAMP

MySQL timestamp column

2005-03-23 12:32:16

contiggi int(11)

N

o GenInfo gi 231308 ToBioseq calls GetGi. See Bioseq-->Seqid

(choice 12)

chromnum Varchar(10)

N

o Chromosome number 1 ToBioseq calls GetGenomeInfo. See Bioseq--

>Seqid (choice 12)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 112 of 421 18/04/2005

contigchromosome indices

Keyname Type Field

id PRIMARY id

igichrom_ts INDEX ts

contiggi UNIQUE contiggi

icontigchrom_gi INDEX contiggi

icontigchrom_num INDEX chromnum

seqhound@blueprint.org Version 3.3

The SeqHound Manual 113 of 421 18/04/2005

Redundant protein sequences (redundb) module

Last updated: August 4, 2004

redund parser

purpose:

The redund parser builds the redundb module, which consists of the redund table. The input file consists of information pertaining to

redundant GIs, accession numbers and the sequence information which the GIs refer to. The resulting data table contain information

on the GIs, redundant GI groups and the ranking of each GI within their redundant group.

module:

redundb

input files:

nr.gz from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

table created:

redund

source code location:

slri/seqhound/parsers/redund.c

config file dependencies:

slri/seqhound/config/.intrezrc (UNIX platform)

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

seqhound@blueprint.org Version 3.3

The SeqHound Manual 114 of 421 18/04/2005

;this should be set to 1 to allow usuage of the redundb

;redundant protein sequences

redundb = 1

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,

password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections] redundb should be 1.

commandline parameters

Typing ./redund at the command line will return a list of command line parameters and default settings. For example:

>./redund

redund arguments:

-i Input non-redundant database fasta file [File In]

-n Initialize database file [T/F] Optional

default = F

example use:

./redund -i nr -n T

associated scripts:

nrftp.pl: retrieves the relevant data file from NCBI's ftp site

error & run-time logs:

redund writes to a log file called redundlog

additional information:

ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

See data table descriptions for each of the tables that are listed under “tables altered”

seqhound@blueprint.org Version 3.3

The SeqHound Manual 115 of 421 18/04/2005

redund table

Last updated: August 4, 2004

Database: seqhound

Table: redund

Module: redundb

Definition: Partitions GIs into groups based on sequence redundancy, and ranks

each GI within their group.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

gi int(11)

N

o 0 GenInfo Identifier

rordinal int(11)

N

o 0 Position in redundant group

rgroup int(11) Yes

N

ULL ID of redundant group(not static)

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY gi

iredund_rowid INDEX rowid

iredund_gi INDEX gi

iredund_ordinal INDEX rordinal

iredund_rgroup INDEX rgroup

seqhound@blueprint.org Version 3.3

The SeqHound Manual 116 of 421 18/04/2005

Observation: redund is a database table that places GIs with equivalent sequences

into redundant groups. The same sequence may be worked on by

independent researchers. Each discovery is submitted and assigned

a different gi. When the discoveries concern a common sequence,

the GIs point to sequence records that describe redundant sequences,

i.e. a single sequence has multiple GIs. These GIs then belong to a

common group (rgroup). In such cases, it is common for one

particular GI to be used more than the others because it is better

annotated. The ordinal (rordinal) is a ranking of the GIs, 1 for the

b

est annotated GI, 2 for the second, 3 for the third etc. This ordering

is determined by the ordering of the sequence headers in the nr

source file. These 3 pieces of information form the redund database

table.

Source org: NCBI

Source file: nr.gz from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz

parser: redund

***gi***

description: GenInfo Identifier for a sequence record.

example: 7228451

default value: 0

source: parsed from nr (line 1 of record) as token(|) delimited fields

parser: redund

function: LabelGI, AssignRedund (redund.c)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 117 of 421 18/04/2005

API: SHoundRedundantGroup (returns all GIs in the same group as

the input GI)

SHoundRedundantGroupFromID (returns GIs from the group

with input group ID)

SHoundFirstOfRedundantGroupFromID (GI of rank 1,

given a group)

SHoundIsNRFirst (true/false, is this GI is ranked 1 in it's

group)

***rordinal***

description: rank position in redundant set

example: 1

default value: n/a

source: based on the position of the gi within the record, the first gi has

rordinal 1, the 2nd has ordinal 2

parser: redund

function: LabelGI, AssignRedund (redund.c)

API: SHoundIsNRFirst (tells if a gi has rordinal = 1)

***rgroup***

description: redundant group the gi is a member of

example: 1

default value: n/a

source: rgroup IDs are assigned to GIs as the input file is parsed. Each GI

encountered is assigned the rgroup ID of those GIs already

encountered which have an identical sequence. If no such GIs have

been encountered, the GI is assigned a new rgroup ID one number

higher than the largest current rgroup ID in the database. The first

GI encountered is given an rgroup ID of 1.

parser: redund

seqhound@blueprint.org Version 3.3

The SeqHound Manual 118 of 421 18/04/2005

function: LabelGI, AssignRedund (redund.c)

API: SHoundRedundantGroupIDFromGI[List] (returns the ID of

the rgroup that the input GI is a part of)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 119 of 421 18/04/2005

Complete genomes tracking (gendb) module

As of release 4.0 of SeqHound, the complete genomes module has been moved into the core module. The initial build and updates to

these tables are handled by the mother and postcomgen parsers. See the description of the core module above for more details.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 120 of 421 18/04/2005

Taxonomy hierarchy (taxdb) module

Last updated August 18, 2004

importtaxdb parser

purpose:

The importtaxdb parser builds the taxonomy module, which consist of the taxdb, gcodedb, divdb, del and merge table. The input file

consists of taxonomic information such as the taxonomy nodes, names, division, genetic codes, deleted nodes and merged nodes.

Table taxdb holds taxonomy ids and a binary file associated with each id. Table gcodedb holds genetic code ids and a binary file

associated with each id. Table divdb holds division ids and a binary file associated with each id. Table del holds the taxonomy id of

the deleted nodes. Table merge holds the taxonomy ids of nodes which has been merged and which is result of merging.

module:

taxdb

input files:

taxdump.tar.gz from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

tables created:

taxdb

gcodedb

divdb

del

merge

source code location:

slri/seqhound/taxon/importtaxdb.c

config file dependencies:

slri/seqhound/config/.intrezrc (UNIX platform)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 121 of 421 18/04/2005

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;taxonomy hierarchy

taxdb = 1

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,

password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections] taxdb should be 1.

commandline parameters

There are no command line parameters. Typing ./importtaxdb will run the program:

example use:

./importtaxdb

associated scripts:

taxftp.pl: retrieves the relevant data file from NCBI's ftp site

error & run-time logs:

importtaxdb writes to a log file called importtaxdb_log.txt

additional information:

readme files at ftp://ftp.ncbi.nih.gov/pub/taxnomy/

See data table descriptions for each of the tables that are listed under “tables altered”

seqhound@blueprint.org Version 3.3

The SeqHound Manual 122 of 421 18/04/2005

taxdb table

Database: seqhound

Table: taxdb

Module: taxdb

Definition: Maps a taxid to information about the taxid. The information about the

taxid is stored as an ASN blob.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

tid int(11)

N

o 0 Taxonomy identifier

asn mediumblob Yes

N

ULL Information about tid

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY tid

itax_rowid INDEX rowid

itax_tid INDEX tid

Observation: The entire taxonomy can be viewed as a hierarchical taxonomy tree.

The tree expresses the relationships between the nodes within the tree.

At the root of the tree is a generic node that provides no information.

Descendents of the root node include superkingdoms such as Archea,

Eubacteria, Eukaryota, Viroids, Viruses. etc. Each of these

superkingdoms have corresponding taxonomy children. Also included

in the taxonomy tree are artificial sequences and unclassified

taxonomies such as the prions, unidentified agents, etc. taxdb maps the

taxonomy identifiers to the relevant information describing the

taxonomies. Organizing the taxonomy as a tree allows users to request

seqhound@blueprint.org Version 3.3

The SeqHound Manual 123 of 421 18/04/2005

complete lineages, ancestors, children and permits the user to browse

the tree exploiting its tree structure.

taxdb contains information from both nodes.dmp and names.dmp

Source org: NCBI

Source files: nodes.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

names.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

parser: importtaxdb

***tid***

description: the unique taxonomy identifier.

example: 9606 (Human)

default value: n/a

source: nodes.dmp: column 1

parser: importtaxdb

function: Parser_TaxDBNodeRecord

API: SHoundGetTaxChildNodes

SHoundGetTaxChildNodesList

SHoundGetAllTaxProgeny

SHoundGetTaxParent

ShoundGetAllTaxAncestors

Note that although relationships (parent, child) are not

stored in the database, it is a part of the asn field, and as

such it makes it possible to retrieve parent, children,

ancestors and lineages (see asn blob).

seqhound@blueprint.org Version 3.3

The SeqHound Manual 124 of 421 18/04/2005

***asn***

description: ASN blob containing information about the taxonomy.

Fields in the ASN blob include:

1) parent taxonomy of node

2) rank of node

3) embl code for node

4) division of node (see divdb)

5) inherited div flag (1 if this node inherits its division from its parent)

6) genetic code (see gcodedb)

7) inherited genetic code (1 if this nodes inherits its genetic code from

parent)

8) mitochondrial genetic code

9) inherited mit. genetic code (1 if nodes inherits mitochondrial gencode

from parent)

10) genbank hidden flag (1 if name is suppressed in GenBank entry

lineage)

11) subtree root flag (1 if this subtree has no sequence data yet)

12) comments

13) name of taxid (from names.dmp)

Possible rankings of node (there may be more):

superkingdom, kingdom

genus, subgenus

subspecies, species

subfamily, family, superfamily

phylum, subphylum,

subtribe, tribe

varietas

seqhound@blueprint.org Version 3.3

The SeqHound Manual 125 of 421 18/04/2005

infraorder, order, suborder

infraclass, class, subclass

no rank

example: A textual representation of the blob for taxid 9606:

SLRI-taxon ::= {

taxId 9606 ,

parent-taxId 9605 ,

children-taxId {

63221 } ,

names {

{

name "Homo sapiens" ,

name-class scientific-name } ,

{

name "human" ,

name-class other ,

other-class "genbank common name" } ,

{

name "man" ,

name-class common-name } } ,

rank {

rank species ,

premod none ,

postmod none } ,

embl-code "HS" ,

division 5 ,

inherited-div TRUE ,

gencode 1 ,

inherited-gencode TRUE ,

mito-gencode 2 ,

inherited-mito-gencode TRUE ,

genbank-hidden FALSE ,

hidden-subtree-root FALSE }

default value: n/a

source: nodes.dmp: column 2 - 13 and names.dmp: column 2, 3

parser: importtaxdb

function: Parser_TaxDBNodeRecord

API: The ASN object is not directly accessible through the API. Instead,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 126 of 421 18/04/2005

fields in the ASN object may be retrieved and be accessible from the

API. eg.

SHoundGetTaxNameFromTaxIDByClass

SHoundGetTaxNameFromTaxID

SHoundGetTaxLineageFromTaxID

You can also directly retrieve the ASN SLRITaxon by:

#include <taxdb_db.h>

AsnIoPtr aip = NULL;

SLRITaxonPtr ptax = NULL;

SHoundInit(FALSE, "name");

ptax = DB_GetTaxRec(9606);

aip = AsnIoNex(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);

SLRITaxonAsnWrite(ptax, aip, NULL);

AsnIoClose(aip);

This will produce the text in the example above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 127 of 421 18/04/2005

gcodedb table

database: seqhound

table: gcodedb

Module: taxdb

definition: Maps a genetic code identifier to the corresponding ASN taxonomy

genecode record.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

gcid int(11)

N

o 0 genetic code

asn mediumblob Yes

N

ULL ASN taxon-gencode record

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY gcid

igcode_rowid INDEX rowid

igcode_gcid INDEX gcid

Observation: gcodedb is part of a group of databases (taxdb, gcodedb, divdb) that can

be used to form a taxonomy hierarchy tree. In the case of gcode, the

records map a genetic code identifier to an ASN object that holds

information about that genetic code. taxdb (see taxdb later) will specify

the type of genetic material each taxonomy uses, eg if the taxonomy has

a plasmid as a genetic code, mitochrondria DNA, or standard

chromosomal DNA etc. gcodedb houses information concerning each of

the genetic code. Relevant information include the translation table (the

mapping of nucleic acid codons to amino acids), & the start codon.

Further information at: http://www.ncbi.nlm.nih.gov/Taxonomy

seqhound@blueprint.org Version 3.3

The SeqHound Manual 128 of 421 18/04/2005

Source org: NCBI

Source file: gencode.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Parser: importtaxdb

***gcid***

description: The genetic code identifier. The identifiers may not be consecutive

because historically codes may have merged or new ones were created.

The code and their corresponding names:

0 Unspecified

1 Standard

2 Vertebrate Mitochondrial

3 Yeast Mitochondrial

4 Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate

Mitochondrial; Mycoplasma; Spiroplasma

5 Invertebrate Mitochondrial

6 Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear

9 Echinoderm Mitochondrial; Flatworm Mitochondrial

10 Euplotid Nuclear

11 Bacterial and Plant Plastid

12 Alternative Yeast Nuclear

13 Ascidian Mitochondrial

14 Alternative Flatworm Mitochondrial

15 Blepharisma Macronuclear

16 Chlorophycean Mitochondrial

21 Trematode Mitochondrial

22 Scenedesmus obliquus mitochondrial

seqhound@blueprint.org Version 3.3

The SeqHound Manual 129 of 421 18/04/2005

23 Thraustochytrium mitochondrial code

example: Because many of the taxonomy groups are bacterial, 11 is a common

genetic code.

default value: n/a

source: gencode.dmp: column 1

parser: importtaxdb

function: Parse_TaxDBGenCodeRecord

API: n/a

***asn***

description: An ASN blob for the genetic code information corresponding to the

genetic code identifiers. The individual fields in the blob are:

1) an optional abbreviation

2) name of genetic code (see gcid for names)

3) translation table

4) start codon

The translation table maps the nucleic acid codon to amino acids:

In the source file, the map is represented as:

FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

There are 64 characters in this map. Each position in the map

corresponds to a codon (see below) and the letter at that position

corresponds to the resulting amino acid that will be translated from that

codon. The same can map be better expressed as a direct mapping from

codon to amino acid where

1) first 3 letters represent a nucleic acid sequence (codon),

2) an amino acid letter identifier

3) an amino acid abbreviation

seqhound@blueprint.org Version 3.3

The SeqHound Manual 130 of 421 18/04/2005

TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys

TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys

TTA L Leu TCA S Ser TAA * Ter TGA * Ter

TTG L Leu TCG S Ser TAG * Ter TGG W Trp

CTT L Leu CCT P Pro CAT H His CGT R Arg

CTC L Leu CCC P Pro CAC H His CGC R Arg

CTA L Leu CCA P Pro CAA Q Gln CGA R Arg

CTG L Leu CCG P Pro CAG Q Gln CGG R Arg

ATT I Ile ACT T Thr AAT N Asn AGT S Ser

ATC I Ile ACC T Thr AAC N Asn AGC S Ser

ATA I Ile ACA T Thr AAA K Lys AGA R Arg

ATG M Met ACG T Thr AAG K Lys AGG R Arg

GTT V Val GCT A Ala GAT D Asp GGT G Gly

GTC V Val GCC A Ala GAC D Asp GGC G Gly

GTA V Val GCA A Ala GAA E Glu GGA G Gly

GTG V Val GCG A Ala GAG E Glu GGG G Gly

The start codon is the particular codon that signals the start of protein

translation for that particular genetic molecular (i.e. a standard

chromosome starts translation at a different codon than a protozoan

mitochrondrial DNA).

The start codon is shown in the same manner as the translation table:

----------------------------------MM----------------------------

This indicates that M (methionine) is the start codon.

Further information can be found at:

htpp://www.ncbi.nlm.nih.gov/Taxonomy in Genetic codes link.

example: A textually representation of a blob for genetic code 11:

SLRI-taxon-gencode ::= {

gencode-id 11 ,

name "Bacterial and Plant Plastid" ,

trans-table

"FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGG

G " ,

start-codons "---M---------------M------------MMMM--------------

seqhound@blueprint.org Version 3.3

The SeqHound Manual 131 of 421 18/04/2005

-M------------ " }

default value: n/a

source: gencode.dmp: column 2, 3, 4, 5 are stored in the database as an ASN

SLRITaxonGencode structure.

parser importtaxdb

function: Parse_TaxDBGenCodeRecord

API: not directly accessible from the API, but you can retrieve the

SLRITaxonGencode object using:

#include <taxdb_db.h>

AsnIoPtr aip = NULL;

SLRITaxGencodePtr ptax = NULL;

SHoundInit(FALSE, "name");

ptax = DB_GetTaxGencodeRec(11);

aip = AsnIoNex(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);

SLRITaxGencodeAsnWrite(ptax, aip, NULL);

AsnIoClose(aip);

This will produce the text in the example above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 132 of 421 18/04/2005

divdb table

Database: seqhound

Table: divdb

Module: taxdb

Definition: Maps taxonomy division ID to an ASN SLRI-taxon-div object

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

did int(11)

N

o 0 Taxonomy Division ID

asn mediumblob Yes

N

ULL ASN taxonomy division record

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY did

idiv_rowid INDEX rowid

idiv_did INDEX did

Observation: divdb is part of a group of databases (taxdb, gcodedb, divdb) that can be

used to form a taxonomy hierarchy tree. In the case of divdb, the

records map a division identifier to an ASN object that holds

information about that division. This information include a 3-character

GenBank division code, a division name and an optional comment.

There are 10 divisions (see below).

Source org: NCBI

Source file: division.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Parser: importtaxdb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 133 of 421 18/04/2005

***did***

description: The division identifier (0-11, see asn for the names of division).

example: 0

source: division.dmp: column 1

parser: importtaxdb

function: Parse_TaxDBDivRecord

API: n/a

***asn***

description: The ASN blob for the taxonomy division. The individual fields in the

ASN blob are:

1) a 3 character division code (see below)

2) a division name (see below)

3) an optional comment

Possible division codes and corresponding division names are:

BCT Bacteria

INV Invertebrates

MAM Mammals

PHG Phages

PLN Plants

PRI Primates

ROD Rodents

SYN Synthetic

UNA Unassigned

seqhound@blueprint.org Version 3.3

The SeqHound Manual 134 of 421 18/04/2005

VRL Viruses

VRT Vertebrates

example: the entire field is one ASN blob. A possible blob could be represented

textually as:

SLRI-taxon-div ::= {

div-id 1 ,

div-code "INV" ,

div-name "Invertebrates" }

source: Column 2, 3, and 4 of division.dmp are stored as one ASN blob

(SLRITaxonDiv) created in-house.

parser: importtaxdb

function: Parse_TaxDBDivRecord (importtaxdb)

API: there is no API function that allows you to retrieve the entire ASN blob.

The information is used to construct information about a taxid, (see

taxdb). You can retrieve a TaxonDivRec using:

#include <taxdb_db.h>

AsnIoPtr aip = NULL;

SLRITaxonDivPtr ptax = NULL;

SHoundInit(FALSE, "name");

ptax = DB_GetTaxDivRec(1);

aip = AsnIoNex(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);

SLRITaxonDivAsnWrite(ptax, aip, NULL);

AsnIoClose(aip);

This will produce the text in the example above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 135 of 421 18/04/2005

del table

Database: seqhound

Table: del

Module: taxdb

Definition: Contains all the deleted taxonomy identifiers.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

tid int(11)

N

o 0 Taxonomy ID

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY tid

idel_rowid INDEX rowid

idel_tid INDEX tid

Observation:

Source org: NCBI

Source files: delnodes.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

parser: importtaxdb

***tid***

description: The taxonomy identifier

example: 15

seqhound@blueprint.org Version 3.3

The SeqHound Manual 136 of 421 18/04/2005

default value: n/a

source: column 1 of delnodes.dmp

parser: importtaxdb

function: main of importtaxdb

API: SHoundIsTaxDeleted (true if taxid parameter is deleted)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 137 of 421 18/04/2005

merge table

Database: seqhound

Table: merge

Module: taxdb

Definition: Contains all tax nodes that have merged. Maps an old taxid before

merging with its new taxid after the merge. A taxonomy may be

deleted because initially it was considered species X, but when its DNA

is analyzed, it is deemed to be insignificantly different from species Y,

in which case one of the species gets merged.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

otid int(11)

N

o 0 old taxonomy id

ntid int(11)

N

o 0 new taxonomy id

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY otid

imerge_rowid INDEX rowid

imerge_ntid INDEX ntid

imerge_otid INDEX otid

Observation:

Source org: NCBI

Source files: merged.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

parser: importtaxdb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 138 of 421 18/04/2005

***otid***

description: the old taxid

example: 12

default value: n/a

source: merged.dmp: column 1

parser: importtaxdb

function: main of importtaxdb

API: SHoundIsTaxMerged (takes an old taxid and checks if it has been

merged into a new taxid)

***ntid***

description: the new taxid

example: 74109

default value:

source: merged.dmp: column 2

parser: importtaxdb

main of importtaxdb

API:

n/a

function:

SHoundIsTaxMerged (takes an old taxid and checks if it's been

merged into a new taxid)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 139 of 421 18/04/2005

Structural databases (strucdb) module

Last updated October 6, 2004

cbmmdb parser

purpose:

The cbmmdb parser builds the strucdb module databases. The input files to cbmmdb contain data for experimentally determined

macromolecular 3D structures. The data files are in ASN.1 format available from the NCBI ftp site.

Module:

Build the Strucdb module

Change to the mmdbdata directory.

cd $SEQH/7.mmdb.files

Create tables of the Strucdb module in the database.

Make sure file strucdb.sql has line use seqhound close to the beginning of the file.

mysql –u my_id –p –P my_port –h my_server < strucdb.sql

Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP

address or the server name of the database server repectively. You will be prompted to enter your password.

This creates tables mmdb, mmgi and domdb in the database.

strucdb

input files:

*.val.gz (ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata)

tables altered:

mmdb, mmgi

source code location:

slri/seqhound/parsers/cbmmdb.c

config file & other dependencies:

The relevant configuration files are:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 140 of 421 18/04/2005

slri/seqhound/config/.intrezrc (UNIX platform)

slri/seqhound/config/.mmdbrc (UNIX platform

ncbi/config/.ncbirc (UNIX platform)

ncbi/data/bstdt.val (optional)

* The requirement that bstdt.val be in the same directory as the parser is not strictly enforced. As long as .ncbirc (see below) is

properly configured, bstdt.val will be located. However, if bstdt.val is not in the same directory as parser, a warning will be issued in

cbmmdb's log file. Despite the warning, the parser should work properly, granted the configurations files are all set properly.

The .mmdbrc file should have at least 1 section [MMDB] with 3 fields:

1) Database: the path where the data source files (*.val.gz) are located

2) Index: the index file for the data source files (mmdb.idx: this is also downloaded by mmdbftp.pl)

3) Gunzip: path to gunzip

seqhound@blueprint.org Version 3.3

The SeqHound Manual 141 of 421 18/04/2005

sample .mmdbrc file:

[MMDB]

;Where the data source files are located

Database = ./

;Index of all data source files

Index = mmdb.idx

;Path to gunzip

Gunzip = /bin/gunzip

The .ncbirc file should have 1 section [NCBI] with 1 field:

1) Data: the path to bstdt.val

sample .ncbirc file:

In file .ncbirc, variable DATA should have a value which is the path of directory ncbi/data on your machine. File .ncbirc looks like

(change text in italics):

[NCBI]

ROOT=/

DATA=/my_home/compile/ncbi/data/

[NCBI]

;where bstdt.val is located

Data=/ncbi/data

The .intrezrc file should have the pathmm set to where you want the databases to be stored.

sample .intrezrc file:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;structural databases

strucdb = 1

seqhound@blueprint.org Version 3.3

The SeqHound Manual 142 of 421 18/04/2005

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,

password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections] strucdb should be 1.

mmdb.idx should be stored in the same directory as the data sources (*.val.gz files). It is an index, containing a list of MMDB ID and

PDB code pairs. The first line is the number of Biostrucs in MMDB in the current release. The * in *.val.gz is the MMDB ID. The

cbmmdb parser will initially parse the content of mmdb.idx to ensure its integrity.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 143 of 421 18/04/2005

command line parameters:

Typing "./cbmmdb" at the command line will return a list of parameters

and default settings. For example:

> ./cbmmdb

cbmmdb arguments:

-n Initialize Database File for Biostrucs [T/F]

-m Initialize Database File for MMDB Id and GI pairs [T/F]

example use:

./cbmmdb -i T -m T

associated scripts:

mmdbftp.pl retrieves all the data files for input to cbmmdb.

error and run-time logs:

cbmmdb writes to cbmmdblog

additional info:

See additional readme files on the NCBI ftp site (ftp://ftp.ncbi.nih.gov/mmdb)

See data table descriptions for each of the tables that are listed under “tables altered”

seqhound@blueprint.org Version 3.3

The SeqHound Manual 144 of 421 18/04/2005

vastblst parser

purpose:

The vastblst parser constructs the 3D domain information that is extracted from the MMDB 3D structures. vastblst builds the domdb

table to store this information.

module:

strucdb

input files:

mmdb table

tables altered:

domdb

source code location:

slri/seqhound/domains/vastblst.c

config file dependencies:

slri/seqhound/config/.intrezrc (UNIX platform)

The relevant section in the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;structural databases

strucdb = 1

seqhound@blueprint.org Version 3.3

The SeqHound Manual 145 of 421 18/04/2005

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,

password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections] strucdb should be 1.

command line parameters:

Typing ./vastblst at the command line will return a list of the command line

parameters and default setting. For example:

> ./vastblst

vastblst arguments:

-n Initialize Database File for Domains [T/F]

example use

./vastblst -n T

associated scripts:

n/a

additional information:

See data table description for each table listed under “tables altered”

seqhound@blueprint.org Version 3.3

The SeqHound Manual 146 of 421 18/04/2005

pdbrep parser

purpose:

The pdbrep parser fills in the 'rep' field in the domdb table. 'rep' is the best representative domain for that blast set. The input files

consist of non-redundant PDB chain set, BLAST rankings for the set and the representative for that chain set. pdbrep stores the

representative into the domdb table.

module:

strucdb

input files:

nrpdb.* from ftp://ftp.ncbi.nih.gov/mmdb/nrtable

tables altered:

domdb

source code location:

slri/seqhound/domains/pdbrep.c

config file dependencies:

The relevant configuration file is:

slri/seqhound/config/.intrezrc (UNIX platform)

The relevant sections are:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;structural databases

seqhound@blueprint.org Version 3.3

The SeqHound Manual 147 of 421 18/04/2005

strucdb = 1

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,

password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections] strucdb should be 1.

command line parameters:

Typing ./pdbrep at the commandline will generate a listing of commandline

parameters. For example:

>./pdbrep

pdbrep arguments:

-i Input nrpdb table [File In]

example use:

./pdbrep -i nrpdb.010400

associated scripts:

n/a

error & run-time log:

pdbrep writes to pdbreplog

additional information:

README's at ftp://ftp.ncbi.nih.gov/mmdb/nrtable

see data table descriptions for each of the tables that are listed under "tables altered"

seqhound@blueprint.org Version 3.3

The SeqHound Manual 148 of 421 18/04/2005

mmdb table

Database: seqhound

Table: mmdb

Module: strucdb

Definition: Is a database of experimentally determined macromolecular 3D

structures (Molecular Modeling DB).

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

mmdbid int(11)

N

o 0 Molecular modeling database

identifier

asn1 mediumblob

N

o Biostruc blob

p

dbi

d

varchar(20)

N

o PDB id

bwhat int(11) Yes

N

ULL types of molecules in the biostruc

models int(11) Yes

N

ULL number of models in record

molecules int(11) Yes

N

ULL number of molecules in record

size int(11) Yes

N

ULL size of uncompressed biostruc

bzsize int(11) Yes

N

ULL size of compressed biostruc

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY mmdbid

immdb_rowid INDEX rowid

immdb_mmdbid INDEX mmdbid

immdb_pdbid INDEX

p

dbid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 149 of 421 18/04/2005

Observation: Most 3D structures are obtained through X-ray crystallography and

NMR-spectroscopy. MMDB is a subset of Protein Data Bank (PDB),

excluding theoretical models. 3D structures can provide information

pertaining to the biological function, mechanism of function,

evolutionary history and relationships of a biomolecule.

Source org: NCBI

Source file: *.val.gz from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata

FTP script: slri/seqhound/scripts/mmdbftp.pl

parser: cbmmdb

***mmdbid***

description: a unique, stable numerical identifier. MMDB ID are assigned when a new

structure enters MMDB. If an entry is deleted from PDB, then its MMDB entry

is obsolete. Obsolete structures get archived.

example: 10

default value: n/a

ASN.1 struct: Biostruc->Biostruc-id (choice 1)

parser: cbmmdb

function: Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)

API: SHound3DExists

***asn1***

description: the Biostruct blob (see ASN structure below). Contains information about the

3D structure of a molecule including information about the researchers,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 150 of 421 18/04/2005

molecule bond information, etc.

example: see below

default value: n/a

source: *.val.gz

parser: cbmmdb

function: MMDBBiostrucGet

API: SHoundGet3DfromPdbId

SHoundGetPDB3D

SHoundGet3D

**pdbid

***

description: a 4 character identifier from the Protein Data Bank.

example: 100D

default: n/a

ASN.1 struct: Biostruc->Biostruc-id(choice 2)

parser: cbmmdb

function: Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)

API: no functions return the pdbid from mmdb, it is used as a key to retrieve a 3D

structure, see asn1 API.

***bwhat

***

description: an integer that represents the type of molecule.

The possible flags are (defined in ncbi/biostruc/mmdbapi1.h) and their

corresponding byte values are:

1) AM_ION 0x80

seqhound@blueprint.org Version 3.3

The SeqHound Manual 151 of 421 18/04/2005

2) AM_RNA 0x40

3) AM_WAT 0x20

4) AM_SOL 0x10

5) AM_HET 0x08

6) AM_DNA 0x04

7) AM_PROT 0x0

8) AM_POLY 0x01

9) AM_UNK 0x00

example: AM_ION

default: AM_UNK

ASN.1 stuct: Biostruc->Biostruc-graph->Molecule-graph->Bio-mol-descr->molecule-type

parser: cbmmdb

function: Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)

API: SHound3DbWhat

***models***

description: an enumeration of the different models for this structure

example: 3

default: n/a

source: this is enumerated by NCBI code, increment for EACH Biostruc->Biostruc-

model in the ASN structure

parser: cbmmdb

function: Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)

API: n/a, no public interface provided to this field

seqhound@blueprint.org Version 3.3

The SeqHound Manual 152 of 421 18/04/2005

***molecules***

description: an enumeration of the number of molecules in the structure

example: 3

default: n/a

source: this is enumerated by NCBI code, increment for EACH Biostruc->Biostruc-

graph->Molecule-graph in the ASN structure

parser: cbmmdb

function: Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)

API: n/a, no public interface provided to this field

***size***

description: the size of the uncompressed biostruc

example: 7691

default: n/a

source: the size of the uncompressed biostruc is determined just before it is compressed

and appended

parser: cbmmdb

function: AssignASNMemBZMemo (intrez_cb.c)

API: n/a, no public interface provided to this field

***bzsize***

the size of the compressed biostruc

description:

example: 31

default:

source:

n/a

the compressed size is determined after it is compressed

seqhound@blueprint.org Version 3.3

The SeqHound Manual 153 of 421 18/04/2005

parser: cbmmdb

n/a, no public interface provided to this field

function: AssignASNMemBZMemo (intrez_cb.c)

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 154 of 421 18/04/2005

mmgi table

Database: seqhound

Table: mmgi

Module:

Definition:

strucdb

A mapping of mmdbid to gi.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

mmdbid int(11)

N

o mmdb identifier 0

gi int(11)

N

o 0 GenInfo Identifier

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY mmdbid

gi

immgi_rowid INDEX rowid

immgi_mmdbid INDEX mmdbid

immgi_gi INDEX gi

Observation:

Source org: NCBI

Source file: *.val.gz from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata

FTP script: slri/seqhound/scripts/mmdbftp.pl

parser: cbmmdb (see mmdb for initialization info)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 155 of 421 18/04/2005

***mmdbid***

description: 3D molecular model unique identifier

example: 1

ASN.1 struct:

parser:

Biostruc->Biostruc-id (choice 1)

cbmmdb

function: Main

API: SHound3DExists

SHound3DFromGi

SHound3DFromGiList

***gi***

description: GenInfo Identifier

example: 1420979

ASN.1 struct: Biostruc->Biostruc-id

parser: cbmmdb

function: Main

API: SHoundGiFrom3D

SHoundGiFrom3DList

seqhound@blueprint.org Version 3.3

The SeqHound Manual 156 of 421 18/04/2005

domdb table

Database: seqhound

Table: domdb

Module: strucdb

Definition: Stores the information of structural domains.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

mmdbid int(11)

N

o 0 mmdb identifier

asn1 mediumblob

N

o asn blob

p

dbi

d

varchar(20)

N

o

p

db identifie

r

p

dbchain varchar(10) Yes

N

ULL

p

db chain

gi int(11)

N

o 0 geninfo identifier

domno int(11) Yes

N

ULL domain number

domall int(11)

N

o 0 number of domains in the whole

structure

domid int(11)

N

o 0 vast domain id

rep int(11) Yes

N

ULL representative of blast set

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY domid

idomdb_rowid INDEX rowid

idomdb_mmdbid INDEX mmdbid

idomdb_pdbid INDEX

p

dbid

idomdb_gi INDEX gi

seqhound@blueprint.org Version 3.3

The SeqHound Manual 157 of 421 18/04/2005

idomdb_domall INDEX domall

idomdb_domid INDEX domid

Observation: VAST (vector alignment search tool) is an NCBI-developed algorithm

used to identify similar 3-D structures that may not possess sequence

similarities. Structural information is taken from the mmdb database.

3D structures are converted into vectors (which corresponding to

structurally significant regions of the protein). Adjacent vectors are

grouped into domains. Domain information is used to determine distant

homologs and provide structural information. The domains of proteins

can be compared with other protein domains. Those with similar

domain overlaps are likely to be distant homologs. The domdb database

stores the domain information.

Source org: NCBI

Source file: MMDB database & nrpdb.* from ftp://ftp.ncbi.nih.gov/mmdb/nrtable

Parser: vastblst & pdbrep

Note: vastblst parser uses SHoundGet3D to retrieve the MMDB information..

***mmdbid***

description: the 3D structure id from which the domain was computed.

example: 3446

default value: n/a

source: retrieved from mmdbid by calling SHoundAllMMDBID.

parser: vastblst

function: Main calls SHoundAllMMDBID and MakeAModelstruc

seqhound@blueprint.org Version 3.3

The SeqHound Manual 158 of 421 18/04/2005

API: not used by the API, there are other tables with more complete

mmdbid's.

***asn1***

description: the asn structure of the domain. The structure gives the gi of the domain

and the starting and ending point of the domain.

example: a text output of the SLRIValNode:

SLRIValNode ::= {

domain {

gi 443581 ,

from 2 ,

to 393 } }

default value: n/a

source: extracted from the biostruc reconstructed from what is stored in mmdb.

parser: vastblst

API:

function: WriteFASTAByDomain calls MakeDomain

SHoundGetDomain to actually print out the ASN structure, after

calling SHoundGetDomain, the return value (a SLRIValNodePtr

object) must be opened and streamed into an asn IO stream. eg.

{ // start of program

SLRIValNodePtr pdom = NULL;

AsnIoPtr aip = NULL;

pdom = SHoundGetDomain(443581, 0);

aip = AsnIoNew(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);

SLRIValNodeAsnWrite(pdom, aip, NULL);

AsnIoClose(aip);

} // end of program, text output as in example

SHoundGetFastaDomain (to retrieve the text, follow code above,

replace SLRIValNode with SLRIFasta

seqhound@blueprint.org Version 3.3

The SeqHound Manual 159 of 421 18/04/2005

***pdbid***

description: the PDB id of the 3D structure from which the domain was computed

example: 9XIM

default value: n/a

ASN.1 struct: n/a

parser: vastblst

function: WriteFASTAByDomain

API: SHoundGetDomain

***gi***

description: the gi of the chain from which the domain was computed

example: 443581

default value: n/a

source: extracted from the biostruc object reconstructed from data in mmdb.

Biostruc->Biostruc-id (choice 1)

parser: vastblst

function: WriteFASTAByDomain

API: SHoundGiFromPDBchain

This field can also be parsed from the SLRIValNode and

SLRIFasta ASN structures (above).

***domno***

description: a protein chain may have several domains, each domain is enumerated,

eg if a protein X has 10 domains, the first domain has domno 0, the

seqhound@blueprint.org Version 3.3

The SeqHound Manual 160 of 421 18/04/2005

second domain has domno 1, ...etc

example: 0

default value: n/a

source: this is enumerated by NCBI code

parser: vastblast

function: WriteFASTAByDomain

API: n/a, by itself, the domno provides little information, so it is not

provided in the API

***domall***

description: The total number of domains in the 3D structure. In the above example

protein X, domall is 10

example: 1

default value: n/a

source: this is enumerated by NCBI code

parser: vastblast

function: WriteFASTAByDomain

API: n/a

***domid***

description: the id for the domain. domid is computed by hashing various id's,

including (but not limited to) mmdbid, domno, etc.

example: 34460200

default value: n/a

source: hashed in WriteFASTAByDomain

seqhound@blueprint.org Version 3.3

The SeqHound Manual 161 of 421 18/04/2005

parser: vastblast

function: WriteFASTAByDomain

API: n/a

***rep***

description: domains get grouped based on properties. The domains are then

compared to the other domains in their group and then the best over

domain representative of that group is chosen. This field reflects that

representative.

example: 1

default: 0

source: nrpdb.* from ftp://ftp.ncbi.nih.gov/mmdb/nrtable columns 6, 9 & A

parser: pdbrep

function: Main

API: n/a

seqhound@blueprint.org Version 3.3

The SeqHound Manual 162 of 421 18/04/2005

Protein sequence neighbours (neighdb) module

Last updated: August 18, 2004

Note: The neighbours tables are precalculated on a cluster and the resulting tables are distributed in MySQL format on

our ftp site. Therefore, this section is provided for informational purposes only, or for those who would like to build

neighbours tables from there own sequence data; it is not necessary if one wishes simply to include the neighbours

module into their own seqhound instance, in which case they should simply download the precomputed tables.

Note: the nblast development tree is in the slri directory on the same level as SeqHound (i.e.; slri/nblast).

This documentation includes a description of the nblast program.

The NBLAST paper is freely available from BioMed Central and is provided in the supplementary documents directory distributed

with this manual. See:

Dumontier M, Hogue CW. NBLAST: a cluster variant of BLAST for NxN comparisons.

BMC Bioinformatics. 2002 May 8;3(1):13. Epub 2002 May 08.

PMID: 12019022

Purpose: This section describes how to build and use nblast, a program for generating a database of sequence neighbours.

Installing nblast:

There are two ways to install nblast:

* Download nblast binaries for your platform from the sourceforge website:

http://sourceforge.net/project/showfiles.php?group_id=17918

or

* Compile nblast on your system: In order to compile nblast, two pieces of third party software are prerequisite: NCBI's C toolkit

and Sequiter software's CodeBase database library. Nblast currently only supports CodeBase as its database backend, but it may

be ported to other databases in the future. If your system meets these requirements, download the platform independent source

code for nblast from the link above and proceed.

The NCBI C toolkit can be downloaded from NCBI's website at

ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz

CodeBase is a commercial database software library available from Sequiter software.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 163 of 421 18/04/2005

A description of how to compile and set up the nblast software follows. If you installed the binaries, or already have compiled the

software, skip to the Configuration section. The installation procedure for CodeBase and the NCBI C toolkit are not detailed here, but

are assumed to have been successful.

Once you have unpacked the source code tree, change directories into the topmost nblast directory and then exercise one of the

following sets of instructions, depending on whether you are using a Unix or Windows system.

Unix:

• Modify the nblast.mk file for different source and library paths.

• Modify the make.nblast file or project settings to create an executable that incorporates NBLAST (-D NBLAST_API), logging

(-D NBLAST_LOG), and/or MoBiDiCK(-D MOBIDICK_API) (The MoBiDick library is currently not publicly available).

• From the src directory type 'make -f make.nblast', if the nblast binary is built then build nblastcleanup with 'make -f

make.nblastcleanup' The binaries will be placed in the build/ subdirectory of the source tree.

Windows:

• Open the nblast.dsw workspace in MSVC.

Build NBLAST and NBLASTCLEANUP

The two nblast executables, nblast and nblastcleanup, should now be present in the slri/nblast/build/ directory. Set your PATH

environment variable appropriately so that you can execute them. Before you can use nblast, you'll have to do some more

configuration, described in the next section.

Configuration of nblast environment:

At least one configuration file is required to use nblast and it's associated applications, more if you wish to have some of the relevant

files outside of the current directory. At the very least, you must have an nblast configuration file, called .nblastrc on Unix, and

nblast.ini on Windows. This file must be present either in your current directory or your home directory[may not apply on windows]

and has the following formats:

.nblastrc on Unix:

; NBLAST configuration file

[NBLAST]

writepath = /home/nblast/build/

nblast.ini on Windows:

[NBLAST]

writepath = g:\code\slri\nblast\build

seqhound@blueprint.org Version 3.3

The SeqHound Manual 164 of 421 18/04/2005

where the directory path named after “writepath = “ specifies where the files generated by nblast will be written to. A trailing

slash(unix) or backslash(windows) is necessary for the directory path to work.

If you want to place the scoring matrix (used by the blast algorithm) and/or the formatted fasta database in any directory but the

current working one, you must also have an ncbi configuration file:

.ncbirc on Unix:

[NCBI]

DATA = g:\code\ncbi\data

[BLAST]

BLASTDB = g:\blastdb\

ncbi.ini on Windows:

[NCBI]

DATA = /code/ncbi/data

[BLAST]

BLASTDB = /blastdb/

Where the path following “DATA =” specifies the path to find the scoring matrix file, and the path following “BLASTDB” specifies

the path to find the formatted fasta database.

The BLAST algorithm uses the BLOSUM62 scoring matrix. The file containing this matrix is required for nblast to function. The file

is named BLOSUM62, and comes with the NCBI C toolkit. It should also come with the binary distribution of nblast.

Once the configuration files have been properly set, you are ready to run nblast, detailed in the next section.

Running NBLAST

* Format the fasta database using formatdb: Before Nblast can process the protein sequences, they must be properly formatted.

This is done using the formatdb program, available from NCBI, which takes a fasta formatted database of protein sequences

and processes them into a form which can be BLASTed. Once you have downloaded formatdb, format your fasta database

using the following command:

formatdb -oT -i <dbname>

Where <dbname> is the file name of your fasta formatted protein sequence database. Generally this is nr, the non-redundant

protein sequence database, which can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.tar.gz file (at the time of this

writing, that file is ~500 MB compressed). <dbname> will be used throughout the rest of this document to mean the filename of

your original fasta formatted sequence database.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 165 of 421 18/04/2005

* Use NBLAST to build the skeleton database: Before nblast can compare sequences, it needs to build a skeleton of the nblastdb

table with information on how the sequences will be ordered and indexed. This is done by running nblast with the -N1

command line argument as follows:

nblast -i<dbname> -d<dbname> -N1

* Use Nblast to do the blast comparisons. Before the neighbour tables can be built, nblast must do the pairwise blast

comparisons. In addition to the nblast specific command line options mentioned here, you may also apply options related to

NCBI's blastall to augment the BLAST results. For small databases, it is feasible to do this step on a single computer. Single

computer execution:

nblast -i<dbname> -d<dbname> -e<eval_cutoff> -N2

Where <eval_cutoff> is the maximum evalue allowed for BLAST comparisons between sequences. BLAST comparisons

which result in evalues higher than this are not saved.

Parallel Computer Execution: Since these comparisons are computationally intensive for large databases (like nr), nblast is

capable of splitting up the task across multiple compute nodes. This is done by distributing the <dbname>N.* files generated by

the last step, and the <dbname>.p* files generated by formatdb, to the compute nodes, and then running nblast on each compute

node with node specific options:

nblast -i<dbname> -d<dbname> -e<eval_cutoff> -N2 -C<node_#> -D<total_#of_nodes>

Where <total_#of_nodes> is the total number of compute nodes in your parallel computing system, and <node_#> is the

index (ranging from 1 to <total_#of_nodes>) of the current compute node with respect to the set of nodes which the nblast

task is divided across. <eval_cutoff> has the same meaning as in single computer usage.

Both single and parallel computer execution of the N2 mode of nblast results in creation of files entitled

<dbname>B<node_#>.* (for single computers, <node_#> defaults to 1). These contain the set of blast results which that

particular compute node generated.

* Build the nblastdb table of sequence neighbours: Now the blastdb table of pairwise blast results can be used to build/fill the

nblastdb table of neighbouring sequences. The pairwise BLAST results generated during the last step are processed and

consolidated to generate the records for the neighbouring sequences table, nblastdb. Single Computer Execution:

nblast -i<dbname> -d<dbname> -e<eval_cutoff> -N4

Parallel Computer Execution: You must first collect the database files generated on the compute nodes during the last step onto

the head node before running this step. Then run the following on the head node:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 166 of 421 18/04/2005

nblast -i<dbname> -d<dbname> -e<eval_cutoff> -N4 -C<node_#> -D<total_#of_nodes>

* Cleanup the database and generate number of neighbours fields: The neighbours table, nblastdb, is nearly complete. All that

needs to be done is to fill in the #Neighbours fields of the nblastdb table. This is done using the program nblastcleanup with

the following arguments:

nblastcleanup -bT -pT -aT -qT -d<dbname> -n <num_blastdbs>

Where <num_blastdbs> is the number of blastdb's to check/build from, generally equal to the number of compute nodes you

used (1 in the single computer case).

Your Neighbours database should now be completely built.

NBLAST Update Procedure

Nblast was designed to process NCBI's non-redundant sequence database, named nr. This database contains a non-redundant list of

protein sequences and their associated GenInfo identifier numbers (GIs). The NCBI's nr database is updated on a regular basis, with

some GI's being removed, and others being added. It is desirable to keep the NeighDB module's nblastdb and blastdb tables updated

with the nr database, without having to recompute all the neighbours. This is the purpose of nblast's update procedure, which removes

GI's which have been “killed” from any entries in the nblast tables which contain them, and BLASTs and inserts “new” GI's where

they are appropriate. This process involves multiple steps, which are outlined and explained here.

* Format new version of the fasta database using formatdb: Use formatdb to format your new version of the fasta database in the

same way you did before. Note that your new fasta database should have the same name as the one you initially used to build the

nblast tables.

formatdb -oT -i <dbname>

* Run Nblast in Update mode N5: This step compares the GIs in the out of date nblastdb with those in the updated FASTA

database. It removes entries of the killed Gis from the neighbour lists of all gi's, deletes records which only exist because of the

killed GI, and inserts empty records for the newly added GIs in the nblastdb. It also creates an update file called

NBLAST_UPDATE.val which lists which GI's have been killed and which are to be added, for the next stages to use. On the head

compute node, execute:

nblast -i<dbname> -N5

* Run Nblast in Update mode N6: This step performs the pairwise BLAST comparisons between the newly added sequences and all

the old sequences that haven't been killed, as well as comparing the new sequences with each other. Single Computer Execution:

nblast -i<dbname> -e<eval_cutoff> -N6

seqhound@blueprint.org Version 3.3

The SeqHound Manual 167 of 421 18/04/2005

Parallel Computer Execution:

nblast -i<dbname> -e<eval_cutoff> -N6 -C<node_#> -D<total_#of_nodes>

Where <eval_cutoff> is the maximum allowed evalue for comparison results, as described earlier.

* Rebuild the Neighbouring Sequences Table: Now rebuild the nblast database using the same command you used to build it

previously. Single Computer Execution:

nblast -i<dbname> -d<dbname> -e<eval_cutoff> -N4

Parallel Computer Execution: First collect all the <dbname>B<node#>.* files from the compute nodes onto the head node.

Then run the following command:

nblast -i<dbname> -d<dbname> -e<eval_cutoff> -N4 -C<node_#> -D<total_#of_nodes>

* Cleanup: The cleanup procedure is done in the same manner as for the initial build of the neighbour tables. Single Computer

Execution:

nblastcleanup -bT -pT -aT -qT -d<dbname> -n <num_blastdbs>

Where <num_blastdbs> is the number of blastdb's to check/build from, generally equal to the number of compute nodes you

used (1 in the single computer case).

Your Neighbours database should now be properly updated.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 168 of 421 18/04/2005

nbraccess program*

Note: Not available at time of release.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 169 of 421 18/04/2005

BLASTDB table

Last updated: August 4, 2004

Database: SeqHound

Module: (Sequence) Neighbours

Source: NBLAST

Source File: Derived from NxN comparison of sequences from the NR database

from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz

Parser/Application: NBLAST http://www.sourceforge.net/projects/slritools (source &

binaries)

Published

Documentation:

BMC Bioinformatics http://www.biomedcentral.com/1471-

2105/3/13/

Definition: Pairwise sequence alignments from NBLAST computation

Note: The neighbours tables are precalculated on a cluster and the resulting tables are distributed in MySQL format on

our ftp site. Therefore, this section is provided for informational purposes only, or for those who would like to build

neighbours tables from there own sequence data; it is not necessary if one wishes simply to include the neighbours

module into their own seqhound instance, in which case they should simply download the precomputed tables.

Codebase

Column_Name Index NULL Data_Type Size Column_Definition

UID P

N

INTEGER/Hash

Key

14 Perfect Hash for 2

Ordinal Numbers

EVAL

N

FLOAT E-Value 12.7 Alignment Evalue

ASNSA

N

MEMO ASN.1 Modified SeqAlign for

p

airwise sequence

alignment

MySQL

Field Type Null Default Column_Definition

rowid int(11) Yes

N

ULL Auto incremented id

uid bigint(20) Yes

N

ULL Perfect Hash for 2 Ordinal

N

umbers

seqhound@blueprint.org Version 3.3

The SeqHound Manual 170 of 421 18/04/2005

eval decimal(12,7) Yes

N

ULL 12.7 Alignment Evalue

asnsa mediumblob Yes

N

ULL Modified SeqAlign for pairwise

sequence alignment

MySQL Indexes:

Keyname Type Field

iblastdb_rowid INDEX rowid

iblastdb_uid INDEX uid

*** uid ***

Description: 64 bit Perfect Hash Key generated from 2 32 bit Ordinal Numbers in

nblastdb

example:

default value: none

source: NBLAST - a cluster computer extension of BLAST to compute NxN

sequence comparisons

functions: ShoundGetBlastResult: Given two GIs, checks if alignment

exists, returns NBlast-result-set else NULL

SHoundGetBlastSeqAlign : Given two GIs, calls

SHoundGetBlastResult and formats the NBlast-result-set to a

SeqAlign

*** eval ***

Description: The BLAST e-value reported for this alignment

example: 0.0001

default value: none

function: Used by NBLAST to create the NBLASTDB table using a user-

specified minimum evalue

seqhound@blueprint.org Version 3.3

The SeqHound Manual 171 of 421 18/04/2005

***asnsa***

Description: NBlast-Result-Set (ASN.1 definition is described in

slri/nblast/asn/NBlastasn.asn)

- Stores the useful parts of a seqalign from the BLAST computation

including the alignment, bitscore, evalue

seqhound@blueprint.org Version 3.3

The SeqHound Manual 172 of 421 18/04/2005

NBLASTDB table

Last updated: August 4, 2004

Database: SeqHound

Module: (Sequence) Neighbours

Source: NBLAST

Source File: Derived from NxN comparison of the NR database from

f

tp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz

Parser/Application: NBLAST http://www.sourceforge.net/projects/slritools (source &

binaries)

Published

Documentation:

BMC Bioinformatics http://www.biomedcentral.com/1471-

2105/3/13/

Definition: Sequence neighbour lists from NBLAST computation

Codebase

Column_Name Index NULL Data_Type Size Column_Definition

ORD P

N

INTEGER

Ordinal

N

umbe

r

10 The ordinal number of

this entry (auto-

increment)

GI I

N

INTEGER

GenInfo Id

10 The GenInfo Identifier

for the sequence for

which there are

neighbours in the list

N

UM I

N

INTEGER#

N

eighbours

10

N

umber of neighbours

in ASN.1 structure

ASNNBR

N

MEMO ASN.1 ASN.1 structure

containing the list of GIs

whose sequences

neighbour that of this

entries GI, and the

corresponding evalues

of the alignments.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 173 of 421 18/04/2005

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

ord int(11)

N

o 0 The ordinal number of this entry

gi int(11)

N

o 0

The GenInfo Identifier for the

sequence for which there are

neighbours in the list

num int(11)

N

o 0

N

umber of neighbours in ASN.1

structure

asnnbr mediumblob

N

o

ASN.1 structure containing the

list of GIs whose sequences

neighbour that of this entries GI,

and the corresponding evalues of

the alignments.

MySQL Indexes:

Keyname Type Field

inblastdb_rowid INDEX rowid

inblastdb_ord INDEX ord

inblastdb_gi INDEX gi

inblastdb_num INDEX num

***ord***

Description: The ordinal value of the entry in the database (starts with 1). The

ordinal values from two GIs in a pairwise alignment are perfectly

and reversibly hashed into/from the BLASTDB UID value. This

approach saves on space required to store two 32 bit integers.

Example: 1

Default Value: NA

Function:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 174 of 421 18/04/2005

***gi***

Description: The GenInfo Identifier of the query sequence

Example: 2495000

Default Value: NA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 175 of 421 18/04/2005

API Function: SHoundNeighboursFromGiEx: Given GI, an evalue cutoff and

a limit of 100 returned results, returns FLinkSet containing list of

neighbour GIs and their alignment e-value

Other functions that use SeqHound Functionality (Redundant

Groups & Taxonomy Protein Lists)

SHoundNeighboursFromGi: Calls

SHoundNeighboursFromGiEx, and if the GI is not found,

searches through the list of redundant GIs for the respective

sequence to find an equivalent GI for which there is neighbour

information, returning the ShoundNeighboursFromGiEx results for

that GI.

SHoundNeighboursFromGiList: Calls

SHoundNeighboursFromGi with each GI in a valnode list

SHoundNeighboursFromTaxID: Calls

ShoundNeighboursFromGiList for each GI in a given

taxonomy (SHoundProteinsFromOrganism)

SHoundNeighboursOfNeighbours: Fetches the neighbours o

f

supplied GI and each of their neighbours (limit 100)

SHoundNeighboursOfNeighboursList: Fetches the

neighbours of supplied GI list and each of their neighbours

SHoundGiAndNumNeighboursList: returns a list of all the

GIs with more than 0 neighbours

SHoundNumNeighboursInDB: Fetches the number of

neighbours in the nblastdb

***num***

seqhound@blueprint.org Version 3.3

The SeqHound Manual 176 of 421 18/04/2005

Description: Quick lookup value to find out how many neighbours are in the

ASN.1 structure. Can also be used to sort the GIs with most/least

neighbours

Example: range of integer values

Default Value: 0

Function:

***asnnbr***

Description: NBlast-GiAndEval-set (ASN.1 definition is described in

slri/nblast/asn/NBlastasn.asn)

stores the GI and it's list of sequence neighbours (as GI, evalue

pairs)

Functions: see above for GI

seqhound@blueprint.org Version 3.3

The SeqHound Manual 177 of 421 18/04/2005

Locus link functional annotations (lldb) module

llparser

Last updated: August 5, 2004

purpose:

The llparser parses the LocusLink data files to create and update a set of tables that

correlate curated sequence data and descriptive information about genetic loci. It retrieves

information on official nomenclature, aliases, sequence accessions, phenotypes, EC

numbers, MIM numbers, UniGene clusters, homology and map locations.

module: lldb

input files:

LL_tmpl

From ftp://ftp.ncbi.nih.gov/refseq/LocusLink/

tables altered:

ll_cdd, ll_go, ll_llink, ll_omim

source code location:

slri/seqhound/locuslink/llparser.c

config file dependencies:

The relevant configuration file is:

slri/seqhound/parsers/.intrezrc (for Unix platforms) or

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;locus link functional annotations

lldb = 1

Text in italics must be changed for the .intrezrc file to function correctly with your

SeqHound set-up. Variables username, password, dsn, database in section

[datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections]

lldb should be 1.

command line parameters:

This parser does not have any command line parameter, Just type “./llparser” and

it will parse the LL_tmpl file. This input file must be in the same directory as the llparser

executable.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 178 of 421 18/04/2005

associated scripts:

see “slri/seqhound/scripts/llftp.pl”

This script retrieves the input file or this parser.

error and run-time logs:

llparser writes to a log file called “llparserlog” where it writes Time Stamp, Error #,

goparser.c line # and cause of the problem.

e.g.

=================[ May 21, 2003 3:49 PM ]==================

ERROR: [000.000] {llparser.c, line 81} Main: Cannot find LL_tmpl file.

troubleshooting:

additional info:

The LocusLink web page is at NCBI http://www.ncbi.nlm.nih.gov/LocusLink/

The LocusLink README file describes the input file for this parser

ftp://ftp.ncbi.nih.gov/refseq/LocusLink/

See data table descriptions for each of the tables in the lldb module.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 179 of 421 18/04/2005

addgoid parser

Last updated: August 5, 2004

purpose:

Correlates sequence record gi's with GO annotation identifiers.

The Gene Ontology flat file parser adds information to the supplements ll_go table. This

information correlates sequence records GI’s with GO annotation identifiers. This

information is retrieved from the gene_association.compugen.Genbank and

gene_association.compugen.Swissprot files from www.geneontology.org.

Other GO annotation data is available at this site but it is not currently incorporated into

SeqHound. This is actively being worked on at the time of writing.

Note that the input files used here do not contain PubMed Identifiers.

This parser is dependent on the following tables, asndb, parti, accdb and nucprot. This

parser is also dependent on SeqHound API.

For this reason, the mother parser must be run before using the addgoid parser.

module: lldb

input files:

gene_ association.compugen.Genbank

gene_association.compugen.Swissprot

from ftp://ftp.geneontology.org/pub/go/gene-associations/

tables altered:

ll_go

source code location:

slri/seqhound/locuslink/addgoid.c

config file dependencies:

The relevant configuration file is:

slri/seqhound/parsers/.intrezrc (for Unix platforms) or

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;locus link functional annotations

lldb = 1

seqhound@blueprint.org Version 3.3

The SeqHound Manual 180 of 421 18/04/2005

Text in italics must be changed for the .intrezrc file to function correctly with your

SeqHound set-up. Variables username, password, dsn, database in section

[datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections]

lldb should be 1.

command line parameters:

Typing “./addgoid –“ at the command line while in the directory where addgoid

resides will return a list of command line parameters and default settings. For example:

> ./addgoid –

pdbrep arguments:

-i Input file [File In]

associated scripts:

see /slri/seqhound/goftp.pl

The script retrieves the files used as input by the addgoid parser. This script also

retrieves the three input files required by the goparser (see godb module).

error and run-time logs:

addgoid parser writes to a log file called addgoidlog where it writes Time Stamp, Error #,

goparser.c line # and cause of the problem. For example

==================[ May 21, 2003 4:14 PM ]==================

NOTE: CoreLib [002.003] {ncbifile.c, line 624} FileOpen("gene_association.compugen.Swissprot","r")

failed

troubleshooting:

additional info:

The NCBI LocusLink web page is at

http://www.ncbi.nlm.nih.gov/LocusLink/.

The Gene Ontology Consortium documentation at is at :http://www.geneontology.org/..

See data table descriptions for each of the tables that are listed under the lldb module.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 181 of 421 18/04/2005

ll_omim table

Last updated: August 5, 2004

Database seqhound

Table ll_omim

Module lldb

Definition OMIM Online Mendelian Inheritance in Man.

Online Mendelian Inheritance in Man (OMIMTM) is a continuously

updated catalog of human genes and genetic disorders. OMIM focuses

primarily on inherited, or heritable, genetic diseases. It is also

considered to be a phenotypic companion to the human genome project.

OMIM is based upon the text Mendelian Inheritance in Man, authored

and edited by Dr. Victor A. McKusick and a team of science writers and

editors at Johns Hopkins University and elsewhere.

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

ll_id int(11)

N

o 0 Locus Link Identifier

omim_id int(11)

N

o 0 OMIM Identifier.

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY ll_id

omim_id

illomim_rowid INDEX rowid

illomim_llid INDEX ll_id

illomim_omimid INDEX omim_id

Observation:

Organization: NCBI

Source db: ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

*** ll_id ***

description: Locus link record identifier.

example: 1

default value: 0

source: First field in each locus link record. Eg:

>>1

is the record with locus link id 1.

parser: llparser.c --> ll_parser.c

seqhound@blueprint.org Version 3.3

The SeqHound Manual 182 of 421 18/04/2005

function: LL_ParseFile() --> LL_LineParser

db_layer: LL_Append2OMIM_DB()

API: SHoundLLIDFromOMIM()

*** omim_id ***

description: OMIM Online Mendelian

example: 138670

default value: n/a

source: After tag 'OMIM: '

parser: llparser.c --> ll_parser.c

function: LL_LineParser() --> LL_ParseOMIM()

db_layer: LL_AppendRecord() --> LL_Append2OMIM_DB()

API: SHoundOMIMFromLLID()

SHoundOMIMFromGi()

SHoundOMIMFromGiList()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 183 of 421 18/04/2005

ll_go table

Last updated: August 5, 2004

Database: seqhound

Table: ll_go

Module: lldb

Definition: Associates a locus link identifier with the gene ontology (GO)

annotation.

Field Type Default

MySQL

Null Column_Definition

int(11)

N

o Auto incremented id

int(11)

N

o 0 Locus Link Identifier

go_id int(11) Yes

N

ULL Gene Ontology ID.

p

mi

d

int(11) Yes

N

ULL Pub Med ID.

evidence varchar(50) Yes

N

ULL

rowid

ll_id

Evidence Code.

MySQL Indexes

Keyname Type Field

illgo_rowid INDEX rowid

illgo_llid INDEX ll_id

illgo_goid INDEX go_id

illgo_pmid INDEX

p

mi

d

Observation.:

Organization: NCBI

Source file: ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

Parser: ll_parser.c

*** ll_id ***

description: Locus link identifier

example: 2

default value: n/a

source: see above

parser: ll_parser.c -> ll_parser.c

function: LL_LineParser() -> LL_ParseNPUnit

SHoundLLIDFromGOIDAndECode ()

***go_id***

description:

API:

Gene Ontology Identifier.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 184 of 421 18/04/2005

example: 5717

default value: n/a

source: after tag GO, 4th element

e.g.:

GO: cellular

component|extracellular|IDA|GO:0005574|GOA|na

parser: ll_parser.c or ll_parser.c

function: LL_LineParser() -> LL_ParseGOUnit

API: SHoundGOIDFromLLID()

SHoundGOIDFromGi()

SHoundGOIDFromRedundantGi

SHoundGOIDFromGiList

SHoundGOIDFromRedundantGiList

example:

GO: cellular component|extracellular|IDA|GO:0005574|GOA|345201

eg:

***pmid***

description: PubMed Identification.

3458201

default value: n/a

source: after tag GO, 6th element

e.g.:

parser: llparser.c -> ll_parser.c

function: LL_LineParser() -> LL_ParseGOUnit

API: SHoundGOPMIDFromGiAndGOID()

***evidence***

description: Every GO annotation must indicate the type of evidence that supports

it; these evidence codes correspond to broad categories of experimental

or other support. http://www.geneontology.org/GO.evidence.html

example: IC: Inferred by Curator.

ND: No biological Data available.

TAS: Traceable Author Statement.

IEA: Inferred from Electronic Annotation.

default value: n/a

source: After a 'GO:' line, it's the 3rd. field element

GO: cellular

component|extracellular|IDA|GO:0005574|GOA|34520

1

parser: llparser.c -> ll_parser.c

seqhound@blueprint.org Version 3.3

The SeqHound Manual 185 of 421 18/04/2005

function: LL_LineParser()-> LL_ParseGOUnit()

API: SHoundGOECodeFromGiAndGOID()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 186 of 421 18/04/2005

ll_llink table

Last updated: August 5, 2004

Database: seqhound

Table:

Module:

ll_llink

lldb

Definition: Associates a locus link id with a sequence identifier

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

ll_id int(11)

N

o 0 Locus Link Identifier.

gi int(11)

N

o 0 Gene Info Identifier

map text Yes

N

ULL Gene location in chromosome.

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY ll_id

gi

illll_rowid INDEX rowid

illll_llid INDEX ll_id

illll_gi INDEX gi

Observation.: Maps a locus link id with a gi and its location in the chromosome. Not

all locus link ids will have a gi. A gi may be specified as an NP gi or a

XP gi ( experimentally determined ). If an NP gi is available, it will be

used. If an NP gi is not available, then the XP gi will be used. If neither

the NP nor XP gi is available, then no gi will be used.

Organization: NCBI

Source file: ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

*** ll_id ***

description: Locus Link Identifier.

example: 2

default value: n/a

source: start of record

eg.

>>2

parser: llparser.c --> ll_parser.c

function: LL_AppendLLID_DB()

LL_ParseFile() -> LL_LineParser()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 187 of 421 18/04/2005

API: SHoundLLIDFromGi()

SHoundLLIDFromGiList()

***gi***

description: Gene Info Identifier

example: 21071030

default value: n/a

source: in the NP tag, 2nd element

eg

NP: NP_570602|21071030

parser: llparser.c --> ll_parser.c

function: LL_ParseFile() --> LL_LineParser() -->

LL_Append2LLID_DB()

API: SHoundGiFromLLID()

SHoundGiFromLLIDList()

***map***

description: Location of the gene in the Chromosome

example: 19q13.4

default value: n/a

source: in the MAP tag: 1st element

MAP: 19q13.4|RefSeq|C|

parser: llparser.c --> ll_parser.c

function: LL_ParseFile() --> LL_LineParser()

LL_Append2LLID_DB()

API: SHoundLocusFromGi()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 188 of 421 18/04/2005

ll_cdd table

Last updated: August 5, 2004

Database: seqhound

Table: ll_cdd

Module: lldb

Definition: Conserved Domain Database.

Maps a locus link id to a Conserved Domain ID.

Proteins often contain several modules or domains, each with a distinct

evolutionary origin and function. The CDD database may be used to

identify the conserved domains present in a protein sequence.

Conserved Domains can be summarized with multiple local sequence

alignments. Computational biologists have compiled collections of

such alignments representing conserved domains, and LocusLink

imports them from three major sources:

* Smart, the Simple Modular Architecture Research Tool

* Pfam (UK), Pfam-A seed alignments

* COG (Clusters of Orthologous Groups)

Smart and Pfam are public domain databases, which are offered in

combination with HMM-based search engines and alignment

visualization services.

COG is an NCBI-curated protein classification resource. Sequence

alignments corresponding to COGs are created automatically from

constituent sequences and have not been validated manually for import

in CDD.

The default e-value cutoff of for data in this table is 0.01

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

ll_id int(11)

N

o 0 Locus Link Identifier

cdd_id varchar(50) Yes

N

ULL CDD Source Database.

evalue decimal(20,10)Yes

N

ULL Descriptive match.

MySQL Indexes

Keyname Type Field

illcdd_rowid INDEX rowid

illcdd_llid INDEX ll_id

illcdd_cddid INDEX cdd_id

Observation.: Domains can be thought of as distinct functional and/or structural units

of a protein. These two classifications coincide rather often, as a matter

of fact, and what is found as an independently folding unit of a

seqhound@blueprint.org Version 3.3

The SeqHound Manual 189 of 421 18/04/2005

polypeptide chain also carries specific function. Domains are often

identified as recurring (sequence or structure) units, which may exist in

various contexts. e.g.: of the CDD Record.

CDD: pfam00207: Alpha-2-macroglobulin

family|5952|2318|na|8.970050e+02

CDDID: pfam00207

Evalue: 8.970050e+02

SHoundLLIDFromCDDID()

after CDD tag: 1st element

function:

***evalue***

Organization: NCBI

Source db: ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

*** ll_id ***

description: Locus link identifier

example: 2

default value: n/a

source: start of record

eg.

>>2

parser: llparser.c --> ll_parser.c

function: LL_AppendLLID_DB()

LL_ParseFile() -> LL_LineParser()

API:

SHoundLLIDFromGi()

SHoundLLIDFromGiList()

***cdd_id***

description: Conserved Domain Database Identifier

example: pfam00207

default value: n/a

source:

CDD: pfam00207: Alpha-2-macroglobulin

family|5952|2318|na|8.970050e+02

parser: llparser.c --> ll_parser.c

LL_ParseNPUnit()

LL_ParseFile() --> LL_LineParser()

API: SHoundCDDIDFromGi()

SHoundCDDIDFromGiList()

SHoundCDDIDFromLLID()

description: Describes match between CDD and the sequence

seqhound@blueprint.org Version 3.3

The SeqHound Manual 190 of 421 18/04/2005

example: 8.970050e+02

default value: 0

source: after CDD tag: last element

CDD: pfam00207: Alpha-2-macroglobulin

family|5952|2318|na|8.970050e+02

parser: llparser.c --> ll_parser.c

function: LL_ParseNPUnit()

LL_ParseFile() --> LL_LineParser()

API: SHoundCDDScoreFromGi()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 191 of 421 18/04/2005

GENE module

parse_gene_files.pl parser

Last updated September 27, 2004

purpose:

parse_gene_files.pl parses 4 files from the NCBI Gene database and populates gene_*

tables in SeqHound.

logic:

parse_gene_files.pl first drops the gene_* tables and then recreates them. It then reads the

files generated by gene_cron.pl to populate the new tables. In the future, the tables will

not be dropped upon update.

module: gene

gene_infoUniq

tables altered: gene_dbxref, gene_genomicgi, gene_history, gene_info, gene_productgi,

gene_pubmed, gene_object, gene_synonyms.

[datab]

password=your_pass_word

input files:

ftp://ftp.ncbi.nih.gov/gene/DATA

gene2refseqUniq

gene_historyUniq

gene2pubmedUniq

source code location:

/seqhound/gene/parse_gene_files.pl

/seqhound/scripts/gene_cron.pl

config file dependencies:

The relevant configuration file is:

slri/seqhound/config/.intrezrc (for Unix platforms)

The relevant section of the configuration file is:

;seqhound database that you are connecting

username=your_user_name

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

Text in italics must be changed for the .intrezrc file to function correctly with your

SeqHound set-up. Variables username, password, dsn, database in section

[datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) .intrezrc must reside in the

directory where the parser is running and .odbc.ini should be in your home directory.

You should also have a copy of shconfig.pm in the directory where the parser is running

to read .intrezrc and .odbc.ini.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 192 of 421 18/04/2005

command line parameters:

example use:

perl parse_gene_files.pl

associated scripts:

Four files are downloaded from ftp://ftp.ncbi.nih.gov/gene/DATA by gene_cron.pl:

gene2refseq.gz

gene_info.gz

gene_history.gz

gene2pubmed.gz.

gene_cron.pl unzips the files and makes sure that each file only contains unique records

generating the following files:

gene2refseqUniq

gene_infoUniq

gene_historyUniq

error and run-time logs:

parse_gene_files.log

gene2pubmedUniq

troubleshooting:

additional info:

http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genehelp.html

Note that the NCBI gene database is experimental. This means that the

input files and the SeqHound tables may change.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 193 of 421 18/04/2005

gene_dbxref table

Last updated September 28, 2004

Database: SeqHound

Table: gene_dbxref

Module: GENE

Definition: The table holds cross references to other databases for GeneIds and their

associated gis.

MySQL

Field Type Null Default Column_Definition

id Int(11)

N

o Identifier for a cross reference.

geneinfoid Int(11)

N

o Id of the gene_info record that

this cross reference refers to.

dbname Varchar(255)

N

o Database name

Varchar(30)

N

o Accession. dbaccession

MySQL Indexes

Keyname Type Field

igenedbxrefs_id INDEX id

igenedbxrefs_infoid INDEX geneinfoid

igenedbxrefs_db_name INDEX dbname

igenedbxrefs_db_access INDEX dbaccess

Observation:

Source org: NCBI

Source file: gene_info.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

***id***

Identifier for the cross reference. Mysql auto-increment table.

function:

description:

example: 1

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

API: not available yet

seqhound@blueprint.org Version 3.3

The SeqHound Manual 194 of 421 18/04/2005

***geneinfoid***

description: Identifier for the geneinfo record where this cross reference was found.

example: 53577

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

not available yet

***dbname***

description: Name of the database that contains the second reference.

example: SGD

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***dbaccess***

description: Accession for the record that contains the name.

example: S0000572

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 195 of 421 18/04/2005

gene_genomicgi table

Last updated September 29, 2004

Module: GENE

Database: SeqHound

Table: gene_genomicgi

Definition: Table that contains the gis of genomic DNAs that contain genes. The

start, stop location and orientation on the genomic DNA are also

included.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Identifier of a product gi. Mysql

auto-increment column.

geneobjectid int(11)

N

o Identifier of the geneobject that

this genomic gi refers to.

gi int(11)

N

o The genomic gi.

int(11) Yes

The start location on the genomic

DNA.

end int(11) Yes

The start location on the genomic

DNA.

orientation char(1) Yes The orientation on the genomic

DNA.

start

MySQL Indexes

Keyname Type Field

id

igenomic_objectid geneobjectid

igenomic_gi gi

igenomic_id

Observation:

Source org: NCBI

Source file: gene_info.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 196 of 421 18/04/2005

***id***

description: Identifier of a genomic gi.

example: 1

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***geneobjectid***

description: The gene object that this gi refers to.

example:

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***gi***

description: The genomic gi.

example: 10954454

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***start***

description: The start position of the gene on the genomic DNA.

example: 348

default value:

ASN.1 struct:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 197 of 421 18/04/2005

source:

parser: parse_gene_files.pl

function:

API: not available yet

***end***

description: The end position of the gene on the genomic DNA.

example: 1190

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***orientation***

description: The orientation of the gene on the genomic DNA.

- meaning "minus" strand or

+ meaning "plus" strand

example: -

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

seqhound@blueprint.org Version 3.3

The SeqHound Manual 198 of 421 18/04/2005

gene_history table

Last updated September 29,2004

Database: SeqHound

Table: gene_history

Module: GENE

Definition: This table has information about geneids that are no longer current.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Identifier for a history record.

Mysql auto-increment column.

taxid int(11) Yes

N

CBI taxonomy identifier.

currentgeneid int(11) Yes The current NCBI Gene Id for

this gene.

oldgeneid int(11) Yes The discontinued Gene Id.

oldsymbol varchar(20) Yes

The symbol assigned to the

discontinued Gene Id, if the

discontinued record was not

replaced with another.

MySQL Indexes

Keyname Type Field

igenehistory_id Index id

current_gene_id Index currentgeneid

discont_gene_id Index oldgeneid

discont_symbol Index oldsymbol

Observation: Note that all fields except for id are optional.

Source org: NCBI

Source file: gene_history.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 199 of 421 18/04/2005

***id***

description: Identifier for the history record. Mysql auto-increment column.

example: 212815

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***taxid***

description: The NCBI taxonomy identifier for this gene..

example: 10116

default value:

ASN.1 struct:

***currentgeneid**

*

ASN.1 struct:

The discontinued NCBI gene id.

source:

parser: parse_gene_files.pl

function:

API: not available yet

description: The current NCBI gene id.

example: 29666

default value:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***oldgeneid***

description:

example:

default value:

ASN.1 struct:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 200 of 421 18/04/2005

source:

parser: parse_gene_files.pl

function:

API: not available yet

***oldsymbol***

description: The symbol associated with the discontinued gene id, if no new record

replaced the discontinued record.

example: hlyB

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

seqhound@blueprint.org Version 3.3

The SeqHound Manual 201 of 421 18/04/2005

gene_info table

Last updated September 29, 2004

Database: SeqHound

Table: gene_info

Module: GENE

Definition: Information about a gene.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Identifier for this record.

geneobjectid int(11)

N

o geneobjectid association with this

information.

symbol varchar(255) Yes The default symbol for this gene.

locustag varchar(255) Yes The LocusTag for this gene.

varchar(32) Yes

maplocation varchar(255) Yes The map location of this gene on

the chromosome.

description mediumtext Yes A description of this gene.

chromosome The chromosome where this gene

is found.

MySQL Indexes

Keyname Type Field

igeneinfo_id Index

geneobjectid

symbol Index symbol

locus_tag Index locustag

chr Index chromosome

id

object_id Index

Observation:

FTP script:

Source org: NCBI

Source file: gene_info.gz

gene_cron.pl

Parser: parse_gene_files.pl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 202 of 421 18/04/2005

***id***

description: Identifier for this record.

example: 74058

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

***geneobjectid**

description: The geneobjectid for the gene that this information refers to.

example: 73870

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

***symbol***

description: The symbol for this gene.

example: 1A981

default value:

ASN.1 struct: gene->locus (NCBI says this is where the data comes from).

source:

parser: parse_gene_files.pl

function:

API:

***locustag***

description: The locus tag for this gene.

example: 1A981

default value:

ASN.1 struct: gene->locus-tag (NCBI says this is where the data comes from).

seqhound@blueprint.org Version 3.3

The SeqHound Manual 203 of 421 18/04/2005

source:

parser: parse_gene_files.pl

function:

API:

***chromosome***

description: The chromosome where this gene is found.

example: I

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

***maplocation***

description: The map location of this gene on the chromosome.

example: I;-19.13 cM (interpolated genetic position)

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

***description***

description: A description of this gene.

example: putative protein (69.3 kD) (1A981)

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 204 of 421 18/04/2005

gene_object table

Last updated September 29,2004

Database: SeqHound

Table: gene_object

Module: GENE

Definition: This table contains information on the status of gene records.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Identifier for this gene.

geneid int(11)

N

o

N

CBI's Gene Id.

status varchar(64)) Yes

Status of the RefSeq for this gene.

May be provisional, INFERRED,

MODEL, PREDICTED,

Reviewed or VALIDATED

MySQL Indexes

Keyname Type Field

igeneobject_id Index id

igeneobject_geneid Index geneid

igeneobject_status Index status

Observation:

Source file:

Source org: NCBI

gene2refseq.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 205 of 421 18/04/2005

***id***

description: Identifier for this gene.

example: 8

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***geneid**

description: NCBI's Gene Id.

example: 1246510

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***status***

description: The status of the RefSeq record that contains this gene.

NULL

INFERRED

MODEL

PREDICTED

Provisional

Reviewed

VALIDATED

example: Provisional

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

seqhound@blueprint.org Version 3.3

The SeqHound Manual 206 of 421 18/04/2005

gene_productgi table

Last updated September 29, 2004

Database: SeqHound

Table: gene_productgi

Module: GENE

Definition: Stores the gis of the RNA and Proteins produced by a gene.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Internal identifier for the product

gi. Mysql auto-increment column.

geneobjectid int(11)

N

o The geneobjectid for the gene that

this gi belongs to.

gi int(11)

N

o The gi.

MySQL Indexes

Keyname Type Field

iproduct_id Index id

iproduct_objected Index geneobjectid

iproduct_gi Index gi

Observation:

Source org: NCBI

Source file: gene2refseq.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

***id***

description: Internal identifier for this gi. Mysql auto-increment column.

example: 9

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***geneobjectid***

seqhound@blueprint.org Version 3.3

The SeqHound Manual 207 of 421 18/04/2005

description: NCBI's identifier for the gene that this gi belongs to.

example: 10

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

***gi***

description: The gi. May be a protein or an RNA gi.

example: 32455275

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

seqhound@blueprint.org Version 3.3

The SeqHound Manual 208 of 421 18/04/2005

gene_pubmed table

Last updated September 29, 2004

Database: SeqHound

Table: gene_pubmed

Module: GENE

Definition: This table stores publications that refer to genes.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Internal identifier for this

reference.

geneobjectid int(11) Yes The geneobjectid that this

reference is about.

p

mi

d

int(11) Yes The pmid of the reference.

MySQL Indexes

Keyname Type Field

ipubmed_id Index id

ipubmed_geneid Index geneobjectid

ipmid Index

p

mi

d

Observation:

Source org: NCBI

Source file: gene2pubmed.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

***id***

description: Internal identifier of this reference or of this gene objectid-pmid pair

example: 112394

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API: not available yet

seqhound@blueprint.org Version 3.3

The SeqHound Manual 209 of 421 18/04/2005

***geneobjectid***

description: The geneobjectid of the gene.

example: 173392

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

***pmid***

description: The pmid of the reference.

example: 8889548

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 210 of 421 18/04/2005

gene_synonyms table

Last updated September 28, 2004

Database: SeqHound

Table: gene_synonyms table

Module: GENE

Definition: Stores synonyms for genes.

MySQL

Field Type Null Default Column_Definition

id int(11)

N

o Identifier of this synonym, Mysql

auto-increment column

geneinfoid int(11) Yes The geneinfoid for this synonym.

synonym text Yes The synonym.

MySQL Indexes

Keyname Type Field

igenesyn_id Index id

igeneinfoid Index geneinfoid

isynonym Index synonym

Observation:

Source org: NCBI

Source file: gene_info.gz

FTP script: gene_cron.pl

Parser: parse_gene_files.pl

***id*

description: Internal identifier for this synonym geneinfoid pair. Basically a rowid.

Mysql auto-increment column.

example: 11179

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 211 of 421 18/04/2005

***geneinfoid

description: The geneinfoid for the record where this synonym was found.

example: 27686

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

***synonym

description: The synonym.

example: T1J24.19

default value:

ASN.1 struct:

source:

parser: parse_gene_files.pl

function:

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 212 of 421 18/04/2005

Gene Ontology hierarchy (godb) module

goparser

Last updated: August 5, 2004

purpose:

Go data files contain information from the Gene Ontology Consortium that describe

controlled vocabularies for the description of the molecular function, biological process

and cellular component of gene products.

The GO parser is responsible for generating the hierarchical gene ontology datafiles:

go_name, go_parent, go_synonym and go_reference. This is accomplished by parsing

the flat files component.ontology, function.ontology and process.ontology.

module: godb

input files:

component.ontology

function.ontology

process.ontology.

From ftp://ftp.geneontology.org/pub/go/ontology/

tables altered:

go_name, go_parent, GO_REF, GO_SYN

source code location:

slri/seqhound/go/goparser.c

config file dependencies:

The relevant configuration file is:

slri/seqhound/parsers/.intrezrc (for Unix platforms) or

The relevant section of the configuration file is:

[datab]

;seqhound database that you are connecting

username=your_user_name

password=your_pass_word

dsn=dsn_in_.odbc.ini_file

database=seqhound

local=

[sections]

;gene ontology hierarchy

godb = 1

seqhound@blueprint.org Version 3.3

The SeqHound Manual 213 of 421 18/04/2005

Text in italics must be changed for the .intrezrc file to function correctly with your

SeqHound set-up. Variables username, password, dsn, database in section

[datab] should have the same values as USER, PASSWORD, DSN and DATABASE

respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections]

godb should be 1.

command line parameters:

This parser does not have any command line parameters. Typing “./goparser” will

process the three Gene Ontology files.

These files (component.ontology, function.ontology and process.ontology) must be

present in the same directory as the compiled goparser executable

associated scripts:

see “/slri/seqhound/goftp.pl”

This script retrieves the three input files required by the goparser. The script also

retrieves the files used as input by the addgoid parser (see lldb module).

error and run-time logs:

goparser writes to a log file called “goparserlog” where it writes Time Stamp, Error #,

goparser.c line # and cause of the problem.

e.g.

================[ May 21, 2003 2:36 PM====================

ERROR: [000.000] {goparser.c, line 74} Main: Cannot find function.ontology file

in current directory.

troubleshooting:

additional info:

The Gene Ontology Consortium documentation is located at

http://www.geneontology.org/

See data table descriptions for each of the tables that are listed under SeqHound's Data

Dictionary.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 214 of 421 18/04/2005

go_parent table

Last updated: August 5, 2004

Database: Seqhound

Table: go_parent

Module: Godb

Definition:

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto number row identifier

go_id int(11)

N

o 0 GeneOntology ID

p

arent_goi

d

int(11)

N

o 0 Parent Go Id..

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY go_id

p

arent_goi

d

igoparent_rowid INDEX rowid

igoparent_goid INDEX go_id

igoparent_pid INDEX

p

arent_goi

d

Observation.: Function, process and component are represented as directed acyclic

graphs (DAGs) or networks. The difference between a DAG and a

hierarchy is that in the latter each child can only have one parent; a

DAG allows a child to have more than one parent. A child term may be

an "instance" of its parent term(isa relationship) or a component of its

parent term (part-of relationship). A child term may have more than one

parent term and may have a different class of relationship with its

different parents.

Organization: Gene Ontology (ftp://ftp.geneontology.org/pub/go/ontology)

Source db: function.ontology (Molecular function)

process.ontology (Biological process)

component.ontology (Cellular component)

***go_id***

description: Gene Ontology identifier Child.

example: 0016172

%antifreeze activity ; GO:0016172 {Parent}

%ice nucleation inhibitor activity ; GO:0016173 {Child}

default value: n/a

source: related to the indentation in the file.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 215 of 421 18/04/2005

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Name()->

GO_AppendParent()

GO_GetParentOf()

GO_GetChildrenOf()

GO_GetAllChildren()

GO_GetAllAncestors()

API: SHoundGODBGetChildrenOf()

***parent_goid***

description: Gene Ontology identifier Parent.

example: 0016173

%antifreeze activity ; GO:0016172 {Parent}

%ice nucleation inhibitor activity ; GO:0016173 {Child}

default value: n/a

source: Related to indentation in the file

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Parent()

API:

GO_GetParentOf()

GO_GetChildrenOf()

GO_GetAllChildren()

GO_GetAllAncestors()

SHoundGODBGetParentOf()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 216 of 421 18/04/2005

go_name table

Last updated: August 5, 2004

Database: seqhound

Table: go_name

Module: godb

Definition:

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

go_id int(11)

N

o 0 GeneOntology ID.

go_name text

N

o Go_id Identifier name.

go_db int(11)

N

o 0 GO Database name.

go_level int(11)

N

o 0 Hierarchy level of the GO ID.

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY

go_id

go_name

go_db

go_level

igoname_rowid INDEX rowid

igoname_goid INDEX go_id

Observation.: This table contains a list of 'go_id' and it's molecular function name as

well the database where the go_id was parsed from and the level in the

Gene Ontology hierarchy.

Organization: GeneOntology (ftp://ftp.geneontology.org/pub/go/ontology)

Source db: function.ontology (Molecular function)

process.ontology (Biological process)

Gene Ontology identifier.

15643

component.ontology (Cellular component)

***go_id***

description:

example:

default value: n/a

source: See example Go File. The GO id follows 'GO:' tag.

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Name()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 217 of 421 18/04/2005

GO_GetRecordByID()

API: SHoundGOIDFromRedundantGi

SHoundGOIDFromRedundantGiList

SHoundGOIDFromGi

SHoundGOIDFromGiList

***go_name***

This is not used because the same GO ID can appear at

description: Gene Ontology identifier name.

example: anti-toxin;

default value: n/a

source: go_name follows, '%' and proceeds ';GO:'.

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Name()

GO_GetNameByID()

API: SHoundGODBGetNameByID

***go_db***

description: Gene Ontology Database file. This is an integer 1, 2 or 3

example: 1

where:

1 GO_MOLFUNCTION

2 GO_BIOPROCESS

3 GO_CELLCOMPONEN

default value: n/a

source: NA

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Name()

GO_GetClassification()

API: SHoundGODBGetClassification()

***go_level***

description: Hierarchy level of the GO ID.

different levels making the process to determine its level

irrelevant. This was intended to be used to indicate the

distance of a given GO node from the root node.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 218 of 421 18/04/2005

example: 3

default value: n/a

source: Related to the indentation of

'%{function name}'

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Name()

GO_GetHierarchyLevel()

API: SHoundGODBGetHierarchyLevel() this function is deprecated.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 219 of 421 18/04/2005

go_reference table

Last updated: August 5, 2004

Database: seqhound

Table:

MySQL

go_reference

Module: godb

Definition:

Field Type Null Default Column_Definition

rowid int(11) Auto incremented id

go_id int(11)

N

o 0 GeneOntology ID

go_ref text

N

o GeneOntology Reference.

go_ref_db varchar(50) Yes

N

ULL Reference Data Base Name

N

o

MySQL Indexes

Keyname Type Field

PRIMARY PRIMARY go_id

go_ref

igoref_rowid INDEX rowid

igoref_go_ref_db INDEX go_ref_db

Observation.: This table stores the 'go_id' with its database cross-reference. The

cross-reference can be an external database identifier that points to

something that is equivalent to a given GO term.

Organization: GeneOntology (ftp://ftp.geneontology.org/pub/go/ontology)

Source db: function.ontology (Molecular function)

process.ontology (Biological process)

component.ontology (Cellular component)

***go_id***

description: Gene Ontology identifier.

example: 45174

default value: n/a

source: The go_id comes after '%{function name} ; GO:xxxxxx ; '

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()-

>Go_Append_Reference()

GO_GetRecordByReference()

seqhound@blueprint.org Version 3.3

The SeqHound Manual 220 of 421 18/04/2005

API: SHoundGODBGetRecordByReference()

example:

***go_ref***

description: Gene Ontology Synonym Description.

example: ISBN: 0198506732

default value: n/a

source: go_ref comes after "GO:xxxxxx ; ISBN: 0198506732 ;"

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()-

>GO_Append_Reference()

GO_GetRecordByReference()

API: NA

***go_ref_db***

description: Gene Ontology Reference Database.

EC

ISBN

TC

default value: n/a

source: After '%{function_name} ; GO:xxxxxxx ; {go_ref_db}

After '%{component_name} ; GO:xxxxxxx ; {go_ref_db}

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()-

>GO_Append_Reference()

GO_GetRecordByReference()

API: NA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 221 of 421 18/04/2005

go_synonym table

Last updated: August 5, 2004

Database: seqhound

Table: go_synonym

Module: godb

Definition:

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

go_id int(11)

N

o 0 GeneOntology ID

go_syn text

N

o GeneOntology Synonym.

MySQL Indexes

Keyname Type Field

PRIMARY KEY INDEX go_id

go_syn(100)

igosynonym_rowid INDEX go_id

igosynonym_go_id INDEX go_syn

Observation.:

Organization: GeneOntology (ftp://ftp.geneontology.org/pub/go/ontology)

Source db: function.ontology (Molecular function)

process.ontology (Biological process)

component.ontology (Cellular component)

Example of source

file:

n/a

%glycine binding activity ; GO:0016594 ; synonym:Gly binding ;

synonym:aminoacetic acid binding ; synonym:aminoethanoic acid

binding

***go_id***

description: Gene Ontology identifier.

example: 45174

default value: n/a

source: The go_id follows '%{function name}; GO:'

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->Go_Append_Synonym()

API:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 222 of 421 18/04/2005

***go_syn***

description: Gene Ontology Synonym Description. This can be a 'synonym' of a

'process' or 'component' name.

example: 'Gly binding'

default value:

synonym:aminoacetic acid binding ;

n/a

source: After '%{function_name} ; GO:xxxxxxx ; synonym:{synonym name}'

After '%{component_name} ; GO:xxxxxxx ; synonym:{synonym name}'

After '%{process_name} ; GO:xxxxxxx ; synonym:{synonym name}'

glutathione dehydrogenase (ascorbate); glutamic acid binding

%glycine binding activity ; GO:0016594 ;

synonym:Gly binding ;

synonym:aminoethanoic acid binding

parser: goparser.c -> go_parser.c

function: GODB_ParseFile()->GO_LineParser()-

>GO_AppendRecord()

db_layer: GO_AppendRecord()->GO_Append_Synonym()

API: n/a

seqhound@blueprint.org Version 3.3

The SeqHound Manual 223 of 421 18/04/2005

Gene Ontology Association (GOA) module

Last updated April 5, 2005

This section is maintained by Renan Cavero.

purpose:

The GOA Module provides Gene Ontology (GO) information for all possible organisms that have GO terms. GO terms are controlled

vocabulary for molecular function, biological process and cellular component of gene products. GO Annotation assignments are

derived from ftp://ftp.geneontology.org/pub/go/gene-associations/ and from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz.

The GO terms are linked to identifiers provided by the curating database (e.g. FlyBase) AND to the NCBI Gene Info identifier (GI)

using the SeqHound DBXref Module.

The GOA Module contains

• GO term assignments to genes or proteins

• Literature references like PubMed IDs

• Evidence codes between the gene product and the GO term

• Object types that get annotated such as gene, transcript, protein or protein structure

• Gene symbols or other associated text

• Taxonomic identifiers

• Date on which the annotation was made.

module:

GOA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 224 of 421 18/04/2005

input files:

All files found under gene ontology FTP site

ftp://ftp.geneontology.org/pub/go/gene-associations/

NCBI EntrezGene

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz

tables altered:

goa_association

goa_gigo

goa_reference

goa_seq_dbxref

goa_with

goa_xdb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 225 of 421 18/04/2005

Table summarizing input files, parsers and command line parameters for GOA module.

Input file parser command line

parameters

All files from: ftp://ftp.geneontology.org/pub/go/gene-associations/ goa_parser_cluster.pl

See note below.

-d {Org.abbr.} (*)

–c {Cluster: T F}

-o {Offset: n}

-n {Num. Lines: n}

-f {Flag File}

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz go_geneparser_cluster.pl

See note below.

-d GO_GENE (**)

-c {Cluster: T F}

-o {Offset: n}

-n {Num. Lines: n}

-f {Flag File}

Tables: goa_gigo, dbxref, goa_seq_dbxref, goa_association, accdb,

redund, taxgi.

Seqhound_gigo_cluster.pl

See note below.

-c {Cluster: T F}

-o {Offset: n}

-n {Num. Lines: n}

-f {Flag File}

Tables: goa_gigo, goa_seq_dbxref, goa_association, accdb, redund, taxgi. Seqhound_gigoPDB_cluster.pl

See note below.

-c {Cluster: T F}

-o {Offset: n}

-n {Num. Lines: n}

-f {Flag File}

Tables: goa_gigo, dbxref, goa_seq_dbxref, goa_association. Seqhound_gigoCGEN_cluster.pl

See note below.

-c {Cluster: T F}

-o {Offset: n}

-n {Num. Lines: n}

-f {Flag File}

Notes:

(*) Parser Command Line Parameters. (See dbxref.ini for details.)

goa_parser_cluster.pl: Parse Gene Ontology Association files and populate GOA module tables.

-d Organism Database File Abbreviation: Can be found under dbxref.ini Partition [GENE_ASSOCIATON_FILE] Ex.: GOA_DDB,

GOA_CGEN.

-c Cluster Option: T: True (run in a cluster environment); F: False: (run stand alone).

The following options are optional. Needs to be set-up only if running in a Cluster environment.

-o Offset: The offset in a flat file where a cluster node will start parsing.

-n Number of lines: The number of line-records to process by a cluster node after the Offset is reached.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 226 of 421 18/04/2005

-f Flag File: For synchronization purpose. A file name that will be generated by a cluster node when finished parsing, telling the

Cluster’s head node that the parsing was completed. (Cluster Head node will track completion of all nodes before continuing with the

next process in the queue.

(**)

goa_gene_parser: Same as above but this parser will only parse gene2go file.

seqhound_gigo_cluster.pl: Populates goa_gigo with GI’s (GenBank assessions) extracted from dbxref table and by looking-up Xref

in goa_seq_dbxref will get GO terms from goa_association. When looking up GI’s in accdb, redundancy and taxonomies are

considered.

seqhound_gigoPDB_cluster.pl: Looks up PDB records and chains from goa_seq_dbxref and by looking-up GIs in accdb populates

gi-go pairs in goa_gigo. When looking up GI’s in accdb, redundancy and taxonomies are considered.

seqhound_gigoCGEN_cluster.pl: Same as above but for GIs from Compugen records.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 227 of 421 18/04/2005

source code location:

The parsers for this module have been updated to work in a cluster configuration. Source code is unavailable at the time of this

manual release but will be released with the next code release.

/slri/seqhound/dbxrefgoa

config file dependencies:

The relevant configuration file is:

dbxref.ini (for Unix platforms)

see dbxref.ini documentation.

command line parameters:

See summary table above.

associated scripts:

The following shell scripts execute the parsers:

dbxrefgoa_cron.sh: Cron job that runs dbxrefgoa_updatecron.pl

dbxrefgoa_updatecron.pl: Program that automates deployment and runs parsers in a cluster. (see dbxref.ini documentation for details)

auxiliary scripts:

In a cluster environment the following scripts will help dbxrefgoa_updatecron.pl in the deployment and execution:

parsers.deploy.sh: deploy files to be parsed to the cluster nodes.

generate_goa_run.pl: Generate an instruction script that distributes processes in the cluster nodes.

run{organism database_file_abbreviation}: instruction script generated by generate_goa_run.pl. Example.: runGOA_MGI.sh

clean.sh: clean (remove) files previously deployed by script deploy.sh.

wait.pl: makes the cluster’s head node wait until all the nodes finish processing the parser that was deployed. When wait.pl receives a

signal from all the nodes, it will continue with the next parser.

See dbxref.ini for more details.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 228 of 421 18/04/2005

error and run-time logs:

at summarize number of records updates by each goa parser.

dbxrefgoa_updatecron.pl will execute instructions from the configuration file dbxref.ini. Any error generated by this process or by the

parsers will be recorded under dbxrefgoa_error.log. The final results will be summarized in dbxrefgoa_updatecron.log and

goa_parser.log. By following the error messages, the SeqHound administrator will be able to adjust configuration parameters under

dbxref.ini and run it again.

dbxrefgoa_errors.log: log that tracks run time error by the parsers.

dbxrefgoa_updatecron.log: log that indicate what parser was run, completed or failed . A copy of this log will be send by email to the

SeqHound administrator.

goa_parser.log: log th

troubleshooting:

additional info:

The Gene Ontology Consortium documentation is located at

http://www.geneontology.org/GO.contents.doc.html

ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec

Gene Ontology Module Diagram

The tables also appears in http://www.blueprint.org/seqhound/api_help/docs/SeqHound_Schema_Prod.pdf

seqhound@blueprint.org Version 3.3

The SeqHound Manual 229 of 421 18/04/2005

goa_seq_dbxref

PK id

FK3 goa_xdb_id

I1 xref_key (2)

taxid_1 (13a)

taxid_2 (13b)

type_id (12)

symbol (3)

full_name (10)

synonym (11)

timestamp

goa_with

PK id

FK4 goa_association_id

FK5 goa_xdb_id (8)

xref_key (8a, 8b, 8x)

key_type

FK6 goa_seq_dbxref_id

timestamp

goa_xdb

PK id

abbreviation (1)

name

object

example

generic_url

url_syntax

url_example

timestamp

goa_reference

PK id

FK3 goa_association_id

goa_xdb_id

xref_key (6a, 6b, 6x)

timestamp

go_name

PK go_id

go_name

I2 go_db

I1 go_level

goa_association

PK,FK4 id

is_not (4)

FK8 go_id (5)

goa_seq_dbxref_id

assigned_by (15)

code (7)

date (14)

timestamp

Gene Ontology Flat File

DB (1)

DB_Object_ID (2)

DB_Object_Symbol (3)

NOT (4)

GOid (5)

DB_Reference (6)

Evidence (7)

With (8)

Aspect (9)

DB_Object_Name (10)

DB_Object_Synonym (11)

DB_Object_Type (12)

taxon (13)

Date (14)

Assigned_by (15)

Relational Data Base representation of Gene

Ontology Association module in SeqHound.

(Last updated March 23, 2005)

Numbers beside column names refer to columns in

the Gene Ontology flat file representation

(see http://www.geneontology.org/

GO.annotation.shtml#file and table below).

goa_gigo

gi

go

code

xdbid

xref_key

seqhound@blueprint.org Version 3.3

The SeqHound Manual 230 of 421 18/04/2005

goa_seq_dbxref table

Last updated April 12, 2005

Database:

Table:

Seqhound

goa_seq_dbxref

Module: GOA

Definition: The main table of the GOA Module. This table contains entries for a

database record ID (e.g.: Fly Base ID FBgn0013277) and the

information associated with the record that can be found in

gene_association files from Gene Ontology. The record is the base to

obtain other information like GO terms, reference and other annotation.

Source org: Gene Ontology ftp://ftp.geneontology.org/pub/go/gene-associations/

Source file: See module description above.

Parser: See module description above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 231 of 421 18/04/2005

goa_seq_dbxref table

Field Type API Null Default Column_Definition Example Source

id int(10)

N

o Auto incremented

integer.

Internal primary record

identifier.

2099131 Assigned by source

document parser.

N

A

goa_xdb_id int(10)

N

o

0

Database identifier from

table goa_xdb. For

example, an xdb_id of 4

indicates the Candida

Genome Database.

5 See note below.

N

A

xref_key varchar(30)

N

o

Record ID (in the above-

mentioned database)

which points to the object

being annotated.

PrID1098818 See note below.

N

A

taxid_1 int(11)

N

o

0

Taxonomic identifier of

the species encoding the

gene product.

11676 See note below.

N

A

taxid_2

int(11) Yes 0

Taxonomic identifier of

the species where the gene

product acts (in the

manner described by the

GO annotation). For

example, if the protein is

from a virus, the host

organism might be listed

here.

0 See note below.

N

A

type_id varchar(20)

N

o

Indicates the type of

object being annotated

(gene, transcript, protein

etc.)

protein See note below.

N

A

seqhound@blueprint.org Version 3.3

The SeqHound Manual 232 of 421 18/04/2005

symbol varchar(30)

N

o

A unique and valid

symbol to which the

xref_key is matched. It

can be an ORF name,

gene product symbol or

any other identifier.

GI424263 See note below.

N

A

full_name varchar(255) Yes

N

ame of gene or gene

product.

cell surface glycoprotein

gp138

See note below.

N

A

synonym varchar(50) Yes

N

ULL Gene symbol or other text.fusA See note below.

N

A

lastupdate CURRENT_TIMES

TAMP

2005-04-01 17:25:44

N

A

N

ULL

timestamp

N

o When was this entry last

updated.

Note: columns in this table may be mapped to the GO Annotation flat file format described at

http://www.geneontology.org/GO.annotation.shtml#file. Also see the Figure entitled “GOA relationships” below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 233 of 421 18/04/2005

goa_seq_dbxref indices

Keyname Type Cardinality Field

id_idx INDEX 2138717 id

goa_xdb_id_idx PRIMARY 91 goa_xdb_id

xref_key_idx PRIMARY VC0002 xref_key

taxid_1_idx INDEX 686 taxid_1

symbol_idx INDEX VC0002 symbol

synonym_idx INDEX mioC synonym

lastupdate_idx INDEX 2005-04-01 17:25:44 lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 234 of 421 18/04/2005

goa_association table

Last updated April 12, 2005

Database: Seqhound

Table: goa_association

Module: GOA

Definition: This table contains GO term information associated with the Record ID

(goa_seq_xref.xref_key).

Source file: See module description above.

Parser: See module description above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 235 of 421 18/04/2005

goa_association table

TField ype Null Default Column_Definition Example Source API

id int(10)

N

o Auto

increment

integer.

Internal unique identifier for this record. 33 This identifier is incremented

by the source file parser.

N

A

N

ULL Flags that modify the interpretation of an

annotation. A GO ID with a Not in this

field means that a particular gene product

is NOT associated with a particular GO

term.

For more information please read "Using

the Qualifier column" from

“T” indicates that “NOT was found in

this column.

“F” indicates that “NOT” was not found.

F See note below.

N

A

go_id int(10)

N

o 0 Gene Ontology identifier for the term

attributed to the object described by

goa_seq_xref.xref_key.

3677 See note below.

N

A

goa_seq_dbxref_id int(10)

N

o 0 Foreign key pointing to

goa_seq_dbxref.id

7 See note below.

N

A

assigned_by int(10)

N

o 0 database that made the annotation. 117 See note below.

code char(4)

N

o Evidence Code. One of IMP, IGI, IPI,

ISS, IDA, IEP, IEA, TAS, NAS, ND, IC.

IEA See note below.

N

A

date char(8) Yes

N

ULL Date on which the annotation was made. 20040107

N

A

lastupdate timestamp Yes

N

ULL Date when this record was last modified.

2005-04-01 17:25:44

See note below.

N

A

is_not char(1) Yes

http://www.geneontology.org/GO.annota

tion.html

See note below.

Note: columns in this table may be mapped to the GO Annotation flat file format described at

http://www.geneontology.org/GO.annotation.shtml#file. See the Figure entitled “GOA relationships” below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 236 of 421 18/04/2005

goa_association indices

Keyname Type Cardinality Field

PRIMARY PRIMARY 8720064 id

go_id

goa_seq_dbxref_id

go_id UNIQUE 8720064

code

goa_id_idx INDEX 21268 go_id

goa_seq_dbxref_id_idx INDEX 8720064 goa_seq_dbxref_id

assigned_by_idx INDEX 19 assigned_by

code_idx INDEX code

lastupadate_idx INDEX 2005-04-01 17:25:44 lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 237 of 421 18/04/2005

goa_reference table

Last updated April 12, 2005

Database: Seqhound

Table: goa_reference

Module: GOA

Definition: This table contains the reference identifier for the GO annotation.

Reference identifiers are unique identifiers appropriate to a database

authority for the attribution of the go_id to the Record ID. They may be

a literature reference or a database record.

Source file: See module description above.

Parser: See module description above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 238 of 421 18/04/2005

goa_reference table

Field Type Null Default Column_Definition Example Source API

id int(10)

N

o Auto

increment

integer.

A unique identifier for this record.

1 This value is autoincremented

by the source file parser.

N

A

goa_association_id int(10)

N

o 0 Foreign key pointing to

goa_association.id.

530508 See note below.

N

A

goa_xdb_id int(10)

N

o 0 Database identifier that made the

reference. Foreign key pointing to

goa_xdb.id..

49 See note below.

N

A

xref_key varchar(20)

N

o Reference Identifier. For example, if

the reference is a published paper

that has a PubMed ID, the PubMed

ID number will be in this field

MGI:1354194 See note below.

N

A

lastupdate timestamp Yes

N

ULL When was this entry last updated? 2005-04-01 17:25:44

N

A

Note: columns in this table may be mapped to the GO Annotation flat file format described at

http://www.geneontology.org/GO.annotation.shtml#file. See the Figure entitled “GOA relationships” below.

goa_reference indices

Keyname Type Cardinality Field

PRIMARY PRIMARY 8424423 id

goa_association_id_idx INDEX 8424423 goa_association_id

goa_xdb_id_idx INDEX 18 goa_xdb_id

xref_key_idx INDEX 2864 xref_key

lastupdate_idx INDEX 17122 lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 239 of 421 18/04/2005

goa_with table

Last updated April 12, 2005

Database: Seqhound

Table: goa_with

Module: GOA

Definition: This table is used to hold additional identifiers for annotations using

certain evidence codes. For more information please see "With (or)

From" section at http://www.geneontology.org/GO.annotation.html

Source files: See module description above.

Parsers: See module description above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 240 of 421 18/04/2005

goa_with table

Field Type Null Default Column_Definition Example Source API

id int(10)

N

o Auto

increment

integer.

A unique identifier for this record. 1 This column is autoincremented

by the source file parser.

N

A

goa_association_id int(10)

N

o 0 Foreign key pointing to

goa_association.id

530509 See note below.

N

A

goa_xdb_id int(10)

N

o 0 Database Identifier that made the

'with' annotation.

37 See note below.

N

A

xref_key varchar(20)

N

o Reference identifier. IPR001601 See note below.

N

A

key_type int(10)

N

o 0 Type of the symbol annotated:

Type of the symbol annotated:

1=gene_symbol,

2=allele_symbol,

3=gene_id,

4=sequence_id,

5=go_id.

0 See note below.

N

A

goa_seq_dbxref_id int(10)

N

o 0 Foreign key pointing to

goa_seq_dbxref.id for quick look-

up from a ‘with’ annotation back

to the record id.

0 See note below.

This field is not currently

implemented.

N

A

lastupdate timestamp

N

o

N

ULL When was this entry last updated? 2005-04-01 17:25:44

N

A

Note: columns in this table may be mapped to the GO Annotation flat file format described at

http://www.geneontology.org/GO.annotation.shtml#file. See the Figure entitled “GOA relationships” below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 241 of 421 18/04/2005

goa_with indices

Keyname Type Cardinality Field

PRIMARY PRIMARY 1234449 id

goa_association_id_idx INDEX 1234449 goa_association_id

goa_xdb_id_idx INDEX 17 goa_xdb_id

xref_key_idx INDEX xref_key

key_type_idx INDEX key_type

goa_seq_dbxref_id_idx INDEX goa_seq_dbxref_id

lastupdate_idx INDEX lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 242 of 421 18/04/2005

goa_xdb table

Last updated April 12, 2005

Database: Seqhound

Table: goa_xdb

Module: GOA

Definition: The table goa_xdb contains metadata about the organizations which

contribute to the GO. There is a one to one relationship between

abbreviations and URLs where data can be retrieved. A single URL

which can be queried using database ids is referred to as a datasource.

Each organization may have multiple datasources. Each abbreviation

identifies one section of the file which provides the abbreviation and

full name of that data source, the object type which is retrieved, an

example database id, the generic url which identifies that data source

uniquely and globally and the syntax of an actual query request with

parameters filled in.

For more information please read:

ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec

Source files: See module description above.

Parsers: See module description above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 243 of 421 18/04/2005

goa_xdb table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Auto

increment

integer.

A unique identifier for this record.

N

ote that this identifier is unstable

and may change from one release

to another.

76 This column is

autoincremented

by the source file

parser.

N

A

abbreviation varchar(50)

N

o Database abbreviation. SWISS-PROT See note below.

N

A

name varchar(255) Yes

N

ULL Database name or description. Swiss-Prot protein database. See note below.

N

A

object varchar(255) Yes

N

ULL type of identifier returned from the

data source:

Accession number

Locus identifier

call number

Gene symbol etc.

See note below.

N

A

example varchar(50) Yes

N

ULL An example database identifier. Swiss-Prot:P45867 See note below.

N

A

generic_url varchar(255) Yes

N

ULL The root or representative URL for

this data source.

http://ca.expasy.org/spro

t/

See note below.

N

A

url_syntax varchar(255) Yes

N

ULL A string to which one can append

a database ID and get a valid URL

query for the object referenced by

that id. There is no wild card to

represent the database ID, it is

simply appended to the end of the

string.

http://www.expasy.ch/cgi-

bin/sprot-search-ac?

See note below.

N

A

url_example varchar(255) Yes

N

ULL An example of what the

url_syntax will look like

http://www.expasy.ch/cgi-

bin/sprot-search-

ac?P45867

See note below.

N

A

lastupdate timestamp

N

o

N

ULL 2005-04-01 17:25:44

N

A

Note: columns in this table may be mapped to the GO Annotation flat file format described at

http://www.geneontology.org/GO.annotation.shtml#file. See the Figure entitled “GOA relationships” below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 244 of 421 18/04/2005

goa_xdb indices

Keyname Type Cardinality Field

PRIMARY PRIMARY 117 id

abbreviation_idx INDEX abbreviation

name_idx INDEX name

seqhound@blueprint.org Version 3.3

The SeqHound Manual 245 of 421 18/04/2005

goa_gigo table

Pre-computed list of gi-go pairs.

Parsers:

Last updated April 5, 2005

Database: Seqhound

Table:

Module:

goa_gigo

GOA

Definition:

Since SeqHound database is NCBI GI Centric and GOA is organism

database ID centric; the process to get a GO term given a GI can be

complicated and time consuming, for this reason a pair list of gi-go

were pre-computed and stored in goa_gigo table. The SeqHound core

database (mainly accdb and redund tables) is required to run these

parsers.

Source files: See module description above.

See module description above.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 246 of 421 18/04/2005

goa_gigo table

Field API Type Null Default Column_Definition Example Source

gi INTEGER

N

o 0 Gene Info Identifier for an NCBI sequence

record

6552303 The record indicated by the last

two columns of this table are

converted to a matching Gene

Info Identifier using the

DBXref module.

N

A

go INTEGER

N

o 0 Gene Ontology Identifier. 6281 From table

goa_association:gi_id

N

A

varchar(4) Yes

N

ULL Evidence Code. One of IMP, IGI, IPI, ISS,

IDA, IEP, IEA, TAS, NAS, ND, IC.

TAS

For more information please read

http://www.geneontology.org/GO.evidenc

e.html

From table

goa_association:code

N

A

xdb_id INTEGER Yes

N

ULL Identifier for database. See also xref_key

below. From table goa_xdb.

103 From table goa_xdb.id

N

A

xref_key varchar(30) Yes

N

ULL Identifier in database (see previous

column) pointing to object that was

originally annotated.

P38398 From table

goa_seq_dbxref:xref_key.

N

A

lastupdate timestamp

N

o

N

ULL When was this entry last updated? 2005-04-01 17:25:44

N

A

code

seqhound@blueprint.org Version 3.3

The SeqHound Manual 247 of 421 18/04/2005

goa_gigo indices

Keyname Type Cardinality Field

PRIMARY PRIMARY 7936320

gi

go

code

gi_idx INDEX 3968160 gi

go_idx INDEX 7254 go

code_idx INDEX code

xdb_id_idx INDEX 22 xdb_id

xref_key_idx INDEX 7936320 xref_key

lastupdate_idx INDEX 22 lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 248 of 421 18/04/2005

dbxref module

Last updated April 12, 2005

purpose:

The purpose of the SeqHound dbxref module is to have a centralized data source (cross references) where related information can be

found from a given ID. By using the dbxref module, it is possible to find one to "n" relationships between IDs from 3rd party

databases for DNA, Protein Sequences, Domains and Interactions, GenBank Accession Numbers, Swiss-Prot, LocusLink, SGD,

MGD, ZFIN, FB, PFAM, SMART, etc. See the “Explanation of the data table structure” below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 249 of 421 18/04/2005

Who Cross-references who?

The following table indicates what database records we collect as cross-references from primary database records in the SeqHound

DBXref module.

This table was last updated April 15th and may change on a regular basis.

DB Cross references to:

Primary DBs GB SPTR CG TAIR UNIGENE IPI ENSEMBL OMIM

GENE X X X X X X X

SPTR X X X X X X X X

ENSEMBL X X X

AFCS X

TIGR_ATH X

MGI X X

RGD X

FB X

WB X X

SGD X

DDB X

ZFIN X

GRAMINE X X

GENEDB_SPOMBE X

TIGR_ATH X

TIGR_CMR X

UNIGENE X

VIDA X

Explanation of the data table structure:

The dbxref table is the core of the dbxref module. This explanation refers to that table.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 250 of 421 18/04/2005

Dbxref represents cross-references between 3rd party databases and GeneBank. This table is created by parsing 3rd party flat files and

creating records in "dbxref" for each record parsed. dbxref is a self-referencing table.

The value in the "record_id" field may represent one of two things:

1. A source record:

If the value of "parent_id" in the row is zero (0) it is referred to as an object id. Its "record_id" is the identifier/primary id for a

record in the database ("source_db") from which cross-references have been retrieved.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 251 of 421 18/04/2005

For example, from the dbxref table content example listed below.

P38903 is a Primary ID in SwissProt Database

S00055403 is a Primary ID in SGD Database.

2. A database cross-reference found in a source record. The retrieved cross-references are stored as a combination of

a) Source database ("source_db").

b) an identifier for a record in that database ("record_id").

c) the record that the cross-reference was found in. The field "parent_id" contains an integer. This integer refers to the row in this

table (see "id") that contains the identifier of the record from which this cross reference was retrieved.

d) cross references are retrieved from some field in a record. The name of this field is recorded in the "field" column or if there

are no field names then the column number is recorded. For example : Col1 or Col4.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 252 of 421 18/04/2005

Example entries

+----+-----------+-----------+-----------+------+-------+----+

| 1 | SP | P38903 | 0 | 0 | ID | 0 |

| 2 | GB | U06630 | 1 | 0 | DR | 1 |

| 3 | GB | AAB38372 | 1 | 2 | DR | 2 |

| 4 | GB | S79635 | 1 | 0 | DR | 1 |

| 5 | GB | AAB35312 | 1 | 4 | DR | 2 |

| 6 | GB | X87331 | 1 | 0 | DR | 1 |

| 7 | GB | CAA60763 | 1 | 6 | DR | 2 |

| 8 | GB | Z74922 | 1 | 0 | DR | 1 |

| 9 | GB | CAA99203 | 1 | 8 | DR | 2 |

| 10 | InterPro | IPR001757 | 1 | 0 | DR | 3 |

| 11 | Pfam | PF00689 | 1 | 0 | DR | 3 |

| 12 | SGD | S0005540 | 1 | 0 | DR | 0 |

| 13 | SP | P47096 | 0 | 0 | ID | 0 |

| 14 | GB | Z49525 | 13 | 0 | DR | 1 |

| 15 | GB | CAA89550 | 13 | 14 | DR | 2 |

| 16 | GB | X87297 | 13 | 0 | DR | 1 |

| 17 | IPR | IPR007113 | 13 | 0 | DR | 0 |

| 18 | SGD | S00055403 | 0 | 0 | Col4 | 0 |

| 19 | GB | 1420113 | 18 | 0 | GI | 0 |

| 20 | GermOnline| 143602 | 18 | 0 | Col1 | 0 |

| 21 | DIP | 4191 | 18 | 0 | Col1 | 4 |

| 22 | GB | 6324588 | 18 | 0 | GI | 0 |

| 23 | GB | AX596518 | 18 | 0 | Col1 | 1 |

| 24 | CandidaDB | CA1247 | 18 | 0 | Col1 | 0 |

| 25 | GB | CAA99203 | 18 | 0 | Col1 | 2 |

| 26 | InterPro | IPR002554 | 18 | 0 | Col1 | 0 |

| 27 | GB | NP_014656 | 18 | 0 | Col1 | 2 |

| 28 | SP | P38903 | 18 | 0 | Col1 | 0 |

| 29 | PIR | S54620 | 18 | 0 | Col1 | 0 |

| 30 | GB | S79635 | 18 | 0 | Col1 | 1 |

| 31 | GB | U06630 | 18 | 0 | Col1 | 1 |

| 32 | GB | X87331 | 18 | 0 | Col1 | 1 |

| 33 | MIPS | YOR014W | 18 | 0 | Col1 | 0 |

| 34 | GB | Z74922 | 18 | 0 | Col1 | 1 |

+----+-----------+-----------+-----------+------+-------+----+

The "link", "field" and "cv" fields in the dbxref table help specify where the information is coming from. These are described below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 253 of 421 18/04/2005

link

The link field was created to support some databases (for example Swiss-Prot) which may store more than one cross-reference in one

field. For example a swiss-prot record for a protein may contain a "DR" field that lists two EMBL identifiers. ie.:

DR EMBL; U06630; AAB38372.1; -.

The first identifier is a cross-reference for a nucleotide record in EMBL that encodes a protein (second identifier).

The link field was created to capture this relationship between the two cross-references. The integer in the link field points to a row in

the dbxref table (see "id") and indicates that the current cross-reference is linked (or comes after) some other cross reference in the

same source record and field.

The exact meaning of the database cross-reference can be discerned from the "cv" (Controled Vocabulary) column of this table. In

this example, one cross-reference would be labeled as "nucleotide" and one would be labeled as "protein".

2: points to record "id=2" meaning that protein sequence identifier "AAB38372" comes after the identifier “U06630” in the DR

field.

0: if no relationship exists or if the dbxref is the first one in a list

field

This describes the field in the source record where the database cross-reference was found. For example, “DR” is a field name in

Swiss-Prot records indicating a Database Reference. Alternatively, a column number might be listed here is the source file was a tab-

delimited text file.

cv

cv is a controlled vocabulary term that is used to describe the type of record that the database cross-reference is pointing to. This

controlled vocabulary is simple at the moment and may be expanded in future. Briefly,

1 indicates a DNA sequence record

2 indicates a protein sequence record

3 RNA

100 Swiss-Prot record

101 Trembl record

110 Swiss-Prot secondary accession

111 Trembl secondary accession

seqhound@blueprint.org Version 3.3

The SeqHound Manual 254 of 421 18/04/2005

0 means not defined.

source code location: /slri/seqhound/dbxref

input Files: See summary table.

parsers: See summary table.

config file dependencies:

dbxref.ini

command line parameters: See summary table.

example use: See summary table.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 255 of 421 18/04/2005

associated scripts: (See associated scripts under GOA Module.)

The following shell scripts execute the parsers.

dbxrefgoa_cron_monthly.sh: Script to create all tables and sub-directories when running dbxref-goa module for the first time.

dbxrefgoa_updatecron.pl: Program that automates deployment and runs parsers in a cluster. (see dbxref.ini documentation for details)

error and run-time logs:

dbxrefgoa_errors.log: log that tracks run time error by the parsers.

dbxrefgoa_updatecron.log: log that indicate what parser was run, completed or failed . A copy of this log will be sent by email to the

SeqHound administrator.

dbxref_parser.log: log that summarizes number of record updates by each dbxref parser.

Error messages are sent by email when dbxrefgoa_updatecron.pl runs.

troubleshooting:

Check the email sent by dbxrefgoa_updatecron.pl to find out if any parsers had problems. For more details consult the log files.

additional info:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 256 of 421 18/04/2005

How to update the DBXref and GO Annotation modules using a cluster.

Last updated April 15th, 2005.

This section is maintained by Zhe Wang and Renan Cavero.

Note for those installing there own local instance of SeqHound. The data sets generated by this process are made available on the

SeqHound ftp site at ftp://ftp.blueprint.org/pub/SeqHound/Data/ . These instructions are supplied as a description of our internal

process. The scripts described are provided as part of the SeqHound code release package for those who may wish to use them as a

guide for setting up their own internal cluster build. The framework presented here might also be useful to others for setting up cluster

jobs.

Dbxrefgoa_update.pl will generate scripts for distribution and execution of processes on a cluster. To accomplish this it uses clusterit

tools. More information about clusterit can be found at: http://www.garbled.net/clusterit.html.

The file dbxref.ini has all the configurations to build and update the DBXRef/GOA modules on a computer cluster. This configuration

file is read and executed by the script dbxrefgoa_update.pl.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 257 of 421 18/04/2005

Understanding the dbxref.ini file

Section [DBXREFDATA] describes the database to which the updated or new DBXRef/GOA data will be written. Variable

“database” should be an existing database. The text in italics should be modified. The user must have write privilege to the

database.

[DBXREFDATA]

# this should point to the server which has dbxref and goa modules

host = staging_box

port = 33306

user = user

password = passwd

database = dbxrefgoa

table = dbxref

tablegigo = goa_gigo

Section [SEQHOUND] describes the SeqHound database from which information will be read in order to update the DBXRef/GOA

modules specified in section [DBXREFDATA]. The user must have read privilege to the database.

[SEQHOUND]

# this should point to the server which has the most up-to-date tables of seqhound

hostshound = production_box

portshound = 3306

usersshound = user

passwordshound = passwd

databaseshound = seqhound

working_dir = /home/user123/dbxrefgoa/

data_dir = /scratch/dbxrefgoa/download

myemail = you@you.org

The variable “working_dir” specifies the directory in which to run script dbxref_updatecron.pl.

Directory /home/ is mapped to all cluster nodes.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 258 of 421 18/04/2005

Variable “data_dir” specifies where the input files are saved.

Directory /scratch/ is a local directory on each node.

Variable “myemail” should be the e-mail address of the SeqHound administrator who will be notified of the result of the

DBXRef/GOA update.

Section [LOG_FILES] has the name for three logs files:

[LOG_FILES]

results_log = dbxrefgoa_results.log

errors_log = dbxrefgoa_errors.log

update_log = dbxrefgoa_updatecron.log

Section [DBXREF_FILES] and [GENE_ASSOCIATION_FILES] specify the data files to be processed.

Each of sections [ORGANISM_DBXREF] and [ORGANISM_GOA] specifies the categories of data to be processed.

[ORGANISM_DBXREF]

ORGANISMS = “GENE;SP;TR”

The value of the variable “ORGANISMS” indicates that there are three groups of data that need to be processed. This particular order

of GENE, SP and TR is important for internal purposes. By looking at the value of the variable “ORGANISMS”, script

dbxrefgoa_update.pl goes to the proper section in this configuration file, dbxref.ini, to find information such as where to download the

data file and what commands to be executed on the data file.

For example, when dbxrefgoa_update.pl reads “GENE”, it looks for section [GENE] in the dbxref.ini file. Section [GENE] is shown

below. “GENE_URL” specifies the FTP path of the data file, “GENE_CMD” specifies the command to be run, and

“GENE_CMD2RUN” specifies the command to be run on the cluster nodes.

[GENE]

GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene"

GENE_CMD2RUN=perl dbxref_gene_cluster.pl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 259 of 421 18/04/2005

GENE_CMD="./deploy.sh GENE gene2accession"

GENE_CMD="./generate_dbxref_run.pl XREF_GENE > runXREF_GENE.sh"

GENE_CMD="./runXREF_GENE.sh"

GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz"

GENE_OMIM_CMD2RUN=perl dbxref_gene_extra.pl -d OMIM

GENE_CMD="./deploy.sh GENE mim2gene"

GENE_CMD="./generate_dbxref_run.pl XREF_GENE_OMIM > runXREF_GENE_OMIM.sh"

GENE_CMD="./runXREF_GENE_OMIM.sh"

GENE_CMD="./clean.sh GENE"

GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2unigene"

GENE_UNIGENE_CMD2RUN=perl dbxref_gene_extra.pl -d UNIGENE

GENE_CMD="./deploy.sh GENE gene2unigene"

GENE_CMD="./generate_dbxref_run.pl XREF_GENE_UNIGENE > runXREF_GENE_UNIGENE.sh"

GENE_CMD="./runXREF_GENE_UNIGENE.sh"

GENE_CMD="./clean.sh GENE"

Script dbxrefgoa_update.pl first creates a sub-directory called “GENE/wget/” in the directory specified by the variable “data_dir”.

The newly created directory would be /scratch/dbxrefgoa/download/GENEl/wget/. Script dbxrefgoa_update.pl then reads all

GENE_URLs sequentially and downloads the data files if they been updated. These files are placed into the directory

/scratch/dbxrefgoa/download/GENE/wget/, and then subsequently copied one level up to the directory

/scratch/dbxrefgoa/download/GENE/. If no up-to-date data files are downloaded, script dbxrefgoa_update.pl finishes with this section

and moves to the next according to the value of “ORGANISMS” in section [ORGANISM_DBXREF]. Once all the data files are

downloaded, the script dbxrefgoa_update.pl reads the values of “GENE_CMD” and executes one command at a time.

The first part of section “[GENE]” is used to explain how this works.

[GENE]

#GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz"

GENE_CMD2RUN=perl dbxref_gene_cluster.pl

GENE_CMD="./deploy.sh GENE gene2accession"

GENE_CMD="./generate_dbxref_run.pl XREF_GENE > runXREF_GENE.sh"

GENE_CMD="./runXREF_GENE.sh"

seqhound@blueprint.org Version 3.3

The SeqHound Manual 260 of 421 18/04/2005

The first command to be executed is "./deploy.sh GENE gene2accession ". Script deploy.sh makes directory

/scratch/dbxrefgoa/download/GENE/ on each of the cluster nodes and copy the data file gene2accession into the newly created

directory on each node.

The second command to be executed is "./generate_dbxref_run.pl XREF_GENE > runXREF_GENE.sh". Script

generate_dbxref_run.pl takes in one parameter, XREF_GENE, which refers to the full path of the data file specified in variable

“data_dir” plus “XREF_GENE” in section [DBXREF_FILE] of the configuration file dbxref.ini. The data file has been copied to each

cluster node by executing command "./deploy.sh GENE gene2accession". Each of the cluster nodes will process a part of the data file.

The data file will be evenly divided into multiple segments each of which is processed by one node. In this way, the jobs on the nodes

can finish in approximately the same time.

The cluster nodes need to know where in the file to start and to end. Script generate_dbxref_run.pl reads the data file and

calculates the start and end points of the file for each cluster node to process. The script also constructs the command line for each

cluster node. The output of script generate_dbxref_run.pl is written to the file runXREF_GENE.sh.

An example of such a shell script is:

#! /usr/bin/bash

rsh an090 "cd /home/user123/dbxrefgoa; perl dbxref_gene_cluster.pl -c T -o 0 -n

269259 -f an090.dat >> /home/user123/dbxrefgoa/xref_parser.log 2>&1 &"

rsh an091 "cd /home/user123/dbxrefgoa; perl dbxref_gene_cluster.pl -c T -o 269259 -n 269259 -f an091.dat >>

/home/user123/dbxrefgoa/xref_parser.log 2>&1 &"

rsh an092 "cd /home/user123/dbxrefgoa; perl dbxref_gene_cluster.pl -c T -o 538518 -n 269259 -f an092.dat >>

/home/user123/dbxrefgoa/xref_parser.log 2>&1 &"

# Num of records: 807777

./wait.pl an090.dat an091.dat an092.dat

rm an090.dat an091.dat an092.dat

Where “perl dbxref_gene_cluster.pl” is the value of variable “GENE_CMD2RUN” in section [GENE]. In this example, the data file

has 807,777 records that are evenly divided into three segments each of which is processed on one cluster node.

Script dbxref_ll.pl takes four parameters:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 261 of 421 18/04/2005

–c indicates whether this is a run on a cluster node (-c F would make the script process the entire data file regardless of the values of –

o and –n),

–o indicates the starting point or offset in the file for this node,

–n indicates the number of records to be processed,

–f specifies the name of the flag file.

Since the next data file cannot be processed until the previous one is finished, it is necessary to know whether the process on every

cluster node is completed. On each node, a flag file named node.dat (e.g. an091.dat) is generated when the process on that particular

node is finished. The next data file won’t be processed until script wait.pl finds file node.dat on all of the nodes. After this condition is

met, wait.pl removes all flag files.

The last line in section [GENE] is:

GENE_CMD="./cleana.sh"

This command deletes all data files downloaded for this section from the cluster nodes in order to reclaim disk space.

All of the other sections in the dbxref.ini file have the same format as just described for [GENE]. These additional sections are:

[GENE], [SP], [TR], [FB]. [WB], [MGI], [SGD], [TIGR_ATH], [DDB], [RGD], [ZFIN], [XREFGOA], [GENEDB_SPOMBE],

[TAIR], [TIGR_CMR], [UNIGENE], [VIDA], [GOA_CGEN], [GOA_DDB], [GOA_FB], [GOA_GENEDB_GMORSITANS],

[GOA_GENEDB_LMAJOR], [GOA_GENEDB_PFALCIPARUM], [GOA_GENEDB_SPOMBE], [GOA_GENEDB_TBRUCEI],

[GOA_GRAMENE_ORYZA], [GOA_MGI], [GOA_GOA_PDB], [GOA_RGD], [GOA_SGD], [GOA_TAIR], [GOA_TIGR_ATH1],

[GOA_TIGR_CMR], [GOA_TIGR_TBRUCEI_CHR2], [GOA_TIGR_GENE_INDEX], [GOA_VIDA], [GOA_WB],

[GOA_GOA_UNIPROT], [GOA_ZFIN] and [GO_GENE].

seqhound@blueprint.org Version 3.3

The SeqHound Manual 262 of 421 18/04/2005

Table summarizing input files, parsers and command line parameters for dbxref module.

Input file parser 1command line parameters

ftp://expasy.org/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz dbxref_sptr_cluster.pl -d XREF_SP

ftp://expasy.org/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz dbxref_sptr_cluster.pl -d XREF_TR

http://flybase.bio.indiana.edu/allied-data/extdb/external-databases.txt dbxref_fb_cluster.pl none

ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep.table dbxref_wb_cluster.pl none

ftp://ftp.informatics.jax.org/pub/reports/MRK_Sequence.rpt dbxref_mgi_cluster.pl none

ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt dbxref_mgi_cluster.pl none

ftp://genome-

ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/dbxref.tab

dbxref_sgd_cluster.pl none

ftp://ftp.afcs.org/pub/mpdata/afcsflat.txt dbxref_AFCS_cluster.pl none

ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/DATA_RELEASE_SUPPLEMENT/r

elease_5.genbank_accessions.txt.gz

dbxref_tigr_ath_cluster.pl

-d XREF_TIGR_ATH

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.tigr_ath dbxref_tigr_ath_cluster.pl -d XREF_TIGR_ATH_GP

ftp://ftp.blueprint.org/pub/SeqHound/Private/DDB/dictybaseid_gb_accession.t

xt.gz

dbxref_ddb_cluster.pl none

ftp://rgd.mcw.edu/pub/data_release/genbank_to_gene_ids.txt dbxref_rgd_cluster.pl none

http://zfin.org/data_transfer/Downloads/genbank.txt dbxref_zfin_cluster.pl none

http://zfin.org/data_transfer/Downloads/refseq.txt dbxref_zfin_cluster.pl none

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.zfin dbxref_zfin_cluster.pl none

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/human.xrefs.gz dbxref_goa_xrefs_cluster.pl -d

XREF_XREFGOA_HUMAN

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/mouse.xrefs.gz dbxref_goa_xrefs_cluster.pl –d

seqhound@blueprint.org Version 3.3

The SeqHound Manual 263 of 421 18/04/2005

Input file parser 1command line parameters

XREF_XREFGOA_MOUSE

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/rat.xrefs.gz dbxref_goa_xrefs_cluster.pl -d XREF_XREFGOA_RAT

ftp://ftp.sanger.ac.uk/pub/yeast/pombe/Mappings/gp2swiss.txt dbxref_DBs_SPTR_cluster.pl -d spombe

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.tair dbxref_DBs_SPTR_cluster.pl -d tair

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.tigr_cmr dbxref_DBs_SPTR_cluster.pl -d tigr_cmr

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.unigene dbxref_DBs_SPTR_cluster.pl -d unigene

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.vida dbxref_DBs_SPTR_cluster.pl -d vida

ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accessions.gz dbxref_gene_cluster.pl

ftp://ftp.ncbi.nih.gov/gene/DATA/mim2gene.gz dbxref_gene_extra.pl -d OMIM

ftp://ftp.ncbi.nih.gov/gene/DATA/gene2unigene.gz dbxref_gene_extra.pl -d UNIGENE

1 command line parameters:

All DBXref parsers can run in a cluster environment setting the –c argument.

-d Organism Database File Abbreviation: Can be found under dbxref.ini Partition [ORGANISM_DBXREF] Ex.: SPTR; FB.

-c Cluster Option: T: True (run in a cluster environment); F: False: (run stand alone).

The following options are optional. Needs to be set-up only if running in a Cluster environment.

-o Offset: The offset in a flat file where a cluster node will start parsing.

-n Number of lines: The number of line-records to process by a cluster node after the Offset is reached.

-f Flag File: For synchronization purposes. A file name that will be generated when a cluster node finishes parsing telling the Cluster’s

head node that the parsing is finished. (The Cluster Head node will track completion of all nodes before it continues with the next

process in queue).

seqhound@blueprint.org Version 3.3

The SeqHound Manual 264 of 421 18/04/2005

seqhound@blueprint.org Version 3.3

The SeqHound Manual 265 of 421 18/04/2005

dbxref table

Last updated April 12, 2005

Database: SeqHound

Table: dbxref

Module: dbxref

Definition: The dbxref table is the core of the dbxref module.

It represents cross references between 3rd party databases and

GeneBank. This table is created by parsing 3rd party flat files and

creating records in "dbxref" for each record parsed. dbxref is a self-

referencing table. See “Explanation of data table structure” above.

Source org: multiple - see summary table

multiple - see summary table

Source file:

multiple - see summary table

FTP script:

multiple - see summary table

Parser:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 266 of 421 18/04/2005

dbxref table

Field Type Null Default Column_Definition Example Source1API

id int(11)

N

o Auto

increment Identifier for this entry. 2 All SQL insert statements will auto

increment this field.

N

A

source_id int(11)

N

o

0

Foreign key pointing to an identifier for a database, in table

dbxrefsourcedb. For example,

1 = Swiss-Prot,

2 = GenBank…etc.

2 See dbxrefsourcedb:source_id.

N

A

record_id char(30)

N

o

Record identifier in database mentioned in previous column.

The format of this identifier will be that of the source

database.

AAD12597

N

A

parent_id int(11)

N

o

0

Link pointing to "id" (first column in this table) to the

record that is the source id of the Cross-Reference. An

entry of “0” indicates that this is a source record from which

database cross-references are parsed.

1 See dbxref:id

N

A

link int(11) Yes

N

ULL

The link field was created to support some databases (for

example Swiss-Prot) which may store more than one cross-

reference in one field. See the section above “Explanation

of data table structure”.

0 SQL insert statements for proteins that

have dbXref to the nucleotide will

have the "link" field pointing to the

"id" record of the nucleotide.

If this relationship cannot be

established the "link" value will be set

to 0.

N

A

field char(20) Yes

N

ULL

Describes the field in the source record where the database

cross-reference was found. For example, “DR” is a field

name for Database Reference found in Swiss-Prot records.

Col6

N

A

cv int(11) Yes

N

ULL

Controled vocabulary. This will be used to describe the

type of record that the cross-reference is pointing to. See

the section above “Explanation of data table structure”.

2

N

A

lastupdate timestamp

N

o

N

ULL When was this entry last updated 2005-04-01

17:25:44

N

A

1. Multiple source files are parsed by multiple parsers. For details, see summary table.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 267 of 421 18/04/2005

dbxref indices

Keyname Type Cardinality Field

id_idx INDEX 16849002 id

source_id_idx PRIMARY 18 source_id

dbxref_id _idx PRIMARY 16849002 record_id

parent_id_idx PRIMARY 16849002 parent_id

link_idx INDEX 16849002 link

field_idx INDEX field

cv_idx INDEX 18 cv

lastupdate_idx INDEX 51843 lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 268 of 421 18/04/2005

dbxrefsourcedb table

Last updated April 12, 2005

Database: seqhound

Table: dbxrefsourcedb

Module: dbxref

Definition: This table assigns an internal identifier to all data sources where cross-

references are found. These identifiers are used in the source_id

column of the dbxref table.

Source org: multiple - see summary table

multiple - see summary table

Source file:

multiple - see summary table

FTP script:

multiple - see summary table

Parser:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 269 of 421 18/04/2005

dbxrefsourcedb table1

Field Type Null Default Column_Definition Example Source1API

source_id int(11)

N

o

0

Database ID, primary key in table dbxrefsourcedb.

For example,

1 = Swiss-Prot

1

N

A

source_db char(50)

N

o

Abbreviated name of database.

Abbreviations for database names are the same as

those used by the Gene Ontology group see:

"GO.xrf_abbs"

f

tp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs

This value is mandatory.

SP

N

A

isprimary_db tinyint(4)

N

o

0

Are cross-references retrieved from this database?

1 = YES

0 = NO

1

N

A

lastupdate timestamp

N

o

2005-04-01 17:25:44

N

A

1. The contents of this table are hand-edited as part of the dbxrefgoa.sql file. The entire contents of this table as of April 5, 2005 are

listed below.

dbxrefsourcedb indices

Keyname Type Cardinality Field

PRIMARY PRIMARY 33 source_id

source_db_idx INDEX 33 source_db

lastupdate_idx INDEX lastupdate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 270 of 421 18/04/2005

Contents of dbxrefsourcedb table

(last updated March 22, 2005).

source

_

id source_db isprimar

y_

db

1 SP 1

2 GB 0

3 PFAM 0

4 INTERPRO 0

5 MGI 1

6 SGD 1

7 SMART 0

8 ZFIN 1

9 FB 1

10 CG 0

11 TR 1

12 SPTR 1

13 WB 1

14 LL 0

15 CGEN 0

16 TIGR_ATH 1

17 REFSEQ 0

18 GENEDB_SPOMBE 1

19 DDB 1

GR 1

21 TAIR

22 TIGR_CMR 1

23 UNIGENE 1

24 VIDA 1

20

1

seqhound@blueprint.org Version 3.3

The SeqHound Manual 271 of 421 18/04/2005

25 RGD 1

26 IPI 1

27 ENSEMBL 1

28 AFCS 1

29 HUGO 0

30 OMIM 0

31 PIR 0

32 GENE 1

seqhound@blueprint.org Version 3.3

The SeqHound Manual 272 of 421 18/04/2005

RPS-BLAST domains (rpsdb) module

domname parser

Note: Not available at this time, as it has not been ported to an ODBC backend. Please go to our ftp site at

ftp://ftp.blueprint.org/pub/SeqHound/RPS/

to download precomputed domname tables.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 273 of 421 18/04/2005

Rpsdb parser

Last updated: August 4, 2004

Note: The rpsdb table is precalculated on a cluster and the resulting table is distributed in MySQL format on our ftp

site. Therefore, this section is provided for informational purposes only, or for those who would like to build rpsdb

tables from there own sequence/domain data; it is not necessary if one wishes simply to include the rpsdb module into

their own seqhound instance, in which case they should simply download the precomputed tables.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 274 of 421 18/04/2005

domname table

Last updated: August 4, 2004

Database: SeqHound

Table: domname

Definition: Domain information parsed from CDD database.

Parser: The table is populated by the program “DomNameToDB” whose

source file is seqhoun/rps/domname.c

The program traverses a directory that contains all the *.acd files

(obtained from NCBI ftp site

f

tp://ftp.ncbi.nih.gov/pub/mmdb/cdd/acd.tar.gz) which contains the

CDD ASN.1 records

Comments: Similarly to rpsdb, this table has two API’s. The CDD ASN.1

definition seems to be under constant flux and with every new

release there are new and revised fields. This in turn requires that we

modify the parser functions and update API functions accordingly.

Codebase (for historical purposes)

Column_name Indexed NULL Column_Definition Data_type Size

ACCESSION Yes

N

o String 15 CDD id

N

AME Yes

N

o String 25 Short domain label

PDB-ID Yes Yes String 10 ID of PDB structure that is

used as 3D representative

of the domain

ASN1

N

o Binary

object

The entire CDD ASN.1

record from the source file

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

accession varchar(15)

N

o CDD id

name varchar(25)

N

o Short domain label

seqhound@blueprint.org Version 3.3

The SeqHound Manual 275 of 421 18/04/2005

p

dbi

d

varchar(12)

N

o

ID of PDB structure that is used

as 3D representative of the

domain

asn1 mediumblob

N

o The entire CDD ASN.1 record

from the source file

MySQL Indexes

Keyname Type Field

idom_rowid INDEX rowid

idom_acc INDEX accession

idom_name INDEX name

idom_pdbid INDEX

p

dbid

***accession***

Description: This is the domain’s Conserved Domain Database (CDD) unique

identifier assigned by the CDD group in NCBI. It can also be

SMART, Pfam, LOAD or CDD identifier.

Null

CddIdPtr,

DomNameToDB

Default Value:

Source (ASN.1) :

Parser :

Function: FillDomNameNode

More Info: In early versions of CDD the accession was the Pfam or SMART

identifier. Therefore, in some of the comments in the code this field

is referred to as the Pfam or SMART identifier. Currently, most of

the domains have a unique CDD id but some may not.

More info: The API functions return a list. The reason is that a domain label

such as SH3 may have a SMART and Pfam entry each stored as a

separate CDD entry. Both ID’s will be returned.

API: SHoundGetDomainIdFromLabel(CharPtr label);

seqhound@blueprint.org Version 3.3

The SeqHound Manual 276 of 421 18/04/2005

***name***

Description: Domain’s short label.

Default value: Null

source: This information is obtained from *.csq file that contains a FASTA

description of the domain. The label is parsed from the definition

line.

string GetDomainLabelFromDomainId(string s); CharPtr LIBCALL

SHoundGetDomainLabelFromDomainId(CharPtr accession)

Get3DStructureFromDomainId

Function: FillDomNameNode

More info:

API:

***pdbid***

Description:

Default value:

A 3D structure representative of the domain.

null

source: pCdd->master3d, part of Cdd record

FillDomNameNode

Function:

More info:

API: SHoundGetDomain3DStructure

***asn1***

Description: This is the entire NCBI CDD ASN.1 structure.

Default value: Null

source: Cdd

More info:

Important Note: The CDD structure contains place holders for describing the domains parent, sibling and child

domains. These are domains that are structurally or otherwise related to the domain on the sequence level. The program

DomNameToDB collects this information. In the latest release of CDD that was parsed these fields were left empty.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 277 of 421 18/04/2005

NCBI plans to (or may have) include this information in future releases. This is important information pertaining to the

domains that may be included in future versions of the DomName table. Currently, the table has not been expanded

with additional fields to hold this information.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 278 of 421 18/04/2005

rpsdb table

Last updated: October 07, 2004

Database: SeqHound

Table: rpsdb

Definition: Domain annotation of proteins derived from RPS-BLAST and

Conserved Domains Database (CDD).

Comments: The rpsdb is precalculated on a cluster and distributed in MySQL

format on our ftp site.

This table contains two local API versions (C and C++) defined in

rpsdbapi.hpp and rpsdbapi.c. For the most part the same information

can be accessed by both API albeit in different forms. However,

there may be cases where the two APIs are not identical.

The default e-value cutoff for data in rpsdb is 1

Note: The rpsdb table is precalculated on a cluster and the resulting table is distributed in MySQL format on our ftp

site. Therefore, this section is provided for informational purposes only, or for those who would like to build rpsdb

tables from there own sequence/domain data; it is not necessary if one wishes simply to include the rpsdb module into

their own seqhound instance, in which case they should simply download the precomputed tables.

Codebase (for historical purposes)

Column_name Indexed NULL Data_type Size Column_Definition

GI Yes

N

o Integer 10 Sequence identifier

CDDID Yes

N

o Integer 10 CDD ID (domain ID from

CDD)

DOMID Yes

N

o String 12 Domain ID from primary

database (Pfam, SMART,

COG, KOG or cd)

FROM

N

o

N

o Integer 6 First a.a. position aligned

to this domain

seqhound@blueprint.org Version 3.3

The SeqHound Manual 279 of 421 18/04/2005

ALIGN_LEN Yes

N

o Integer 6 Length of alignment b/w

p

rotein and domain

SCORE

N

o

N

o Integer 10 RPS-BLAST score

EVALUE

N

o

N

o Double 15,8 Base 10 log of RPS-

BLAST E- value

BITSCORE

N

o

N

o Double 15,8 RPS-BLAST bit score

MISSING_N

N

o

N

o Integer 6 The length of N-terminus

residues on the domain

that were not aligned

MISSING_C

N

o

N

o Integer 6 The length of C-terminus

residues on the domain

that were not aligned

N

UMDOM Yes

N

o Integer 4

N

umber of total domains

mapped to this protein

MySQL

Field Type Null Default Column_Definition

rowid int(11)

N

o Auto incremented id

gi int(11)

N

o 0 Sequence identifier

cddid int(11)

N

o 0 CDD ID (domain ID from CDD)

domid char(12)

N

o

Domain ID from primary

database (Pfam, SMART, COG,

KOG or cd)

rfrom int(11)

N

o 0 First a.a. position aligned to this

domain

align_len int(11)

N

o 0 Length of alignment b/w protein

and domain

score int(11) 0 RPS-BLAST score

evalue decimal(15,8) Yes

N

ULL Base 10 log of RPS-BLAST E-

value

N

o

seqhound@blueprint.org Version 3.3

The SeqHound Manual 280 of 421 18/04/2005

bitscore decimal(15,8) Yes

N

ULL RPS-BLAST bit score

missing_n int(11)

N

o 0

The length of N-terminus residues

on the domain that were not

aligned

missing_c int(11)

N

o 0

The length of C-terminus residues

on the domain that were not

aligned

numdom int(11)

N

o 0

N

umber of total domains mapped

to this protein

MySQL Indexes

Keyname Type Field

irps_rowid INDEX rowid

irps_gi INDEX gi

irps_cddid INDEX cddid

irps_domid INDEX domid

irps_len INDEX align_len

irps_numdom INDEX numdom

irps_gi_e INDEX gi

evalue

Source db: SeqHound redundant table

Source program:

Parser :

RPS-BLAST results

rpsdb.h/c in seqhound/rps

***gi***

description: Primary sequence identifier assigned at NCBI

Default value : 0

seqhound@blueprint.org Version 3.3

The SeqHound Manual 281 of 421 18/04/2005

source: GI’s are collected from Seqhound redund table. Proteins that were

not annotated in this table were the hypothetical ORF from RefSeq

(XP_xxxxxx) and SWISS-PROT proteins. These proteins are not

present in SeqHound (no Bioseq) although their GI's are in redund

table. In each redundant group the first GI was used for computing

RPS-BLAST (ordinal 1). These proteins are considered to be the

best representative of the redundant group. However, some

redundant groups may not have the first GI in SeqHound, in those

cases the program collects ordinal 2 or higher. The redundant list is

collected from redund table using “redundlist” program

(seqhound/rps).

API: SHoundGetGisByDomainIdAndEvalue

SHoundGetGisByDomainId

SHoundGetGisByNumberOfDomains

More info:

Comment: There are two functions in the C version that use Codebase relational

query. They should not be used and are there for experimental

purposes only.

seqhound/rps/rpsdb_README.txt

***cddid***

description: Conserved Domain Database unique identifier.

Default Value: none

Source (ASN.1) : CddIdPtr, can also be collected from CddHitPtr cdd_id field.

Parser : rpsdb

Function: RPSDBSHoundRedund2ResultsCallback

More info: seqhound/rps/rpsdb_README.txt

***domid***

Description: Domain identifier from the primary database of origin. These

correspond to either Pfam or SMART string identifiers.

Default Value: Null

seqhound@blueprint.org Version 3.3

The SeqHound Manual 282 of 421 18/04/2005

source: Definition field in CddHitPtr structure.

Function: RPSDBSHoundRedund2ResultsCallback

***rfrom***

Description: The index position of the first amino acid in the protein that is

aligned with this domain.

Default value: 0

source:

Function:

Start field in CddHitPtr structure

RPSDBSHoundRedund2ResultsCallback

***align_len***

Description: The length of sequence alignment between the protein and the

domain.

Default value : 0

source:

Function:

stop- start field values in CddHitPtr.

RPSDBSHoundRedund2ResultsCallback

***score***

Description : RPS-BLAST score parameter

Default value: 0

source: score field in CddHitPtr

Function: RPSDBSHoundRedund2ResultsCallback

***evalue***

Description : Base 10 log of RPS-BLAST evalue score

0

Default value:

Source: evalue field in CddHitPtr

Function:

RPSDBSHoundRedund2ResultsCallback

***bitscore***

seqhound@blueprint.org Version 3.3

The SeqHound Manual 283 of 421 18/04/2005

Description: RPS-BLAST bit score

0

Default:

source:

Default :

source: bitscore field in CddHitPtr

Function:

RPSDBSHoundRedund2ResultsCallback

***missing_n***

Description: The number of residues on the domain’s N-terminus that were not

aligned with the protein. The missing length on the N-terminus of

the domain.

0

source: It is collected from DenseDegPtr which is part of SeqAlign

structure. This structure is filled up by the RPS-BLAST engine. It

contains the collection of aligned segments between the domain and

the protein.

Function: RPSDBSHoundRedund2ResultsCallback

***missing_c***

Description: The number of missing residues not aligned in the domain’s C-

terminus.

0

Same as MISSING_N

Function: RPSDBSHoundRedund2ResultsCallback

***numdom***

Description: The total number of domains aligned with the protein.

Default: 0

source: The number of entries for the query GI in the rpsdb..

Function: SLRICddCountSeqAligns in rpsdb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 284 of 421 18/04/2005

API: The above fields are all accessed through a set of calls that retrieve

the domain annotation based on different requirements.

SHoundGetDomainsFromGi

SHoundGetDomainsFromGiWithEvalue

SHoundGetDomainsFromGiListWithEvalue

seqhound@blueprint.org Version 3.3

The SeqHound Manual 285 of 421 18/04/2005

Molecular Interaction (MI) module

MI-BIND parser

Last updated October 1, 2004

purpose:

The Molecular Interaction module is meant to consolidate the interaction data, and associated annotation, from disparate source

interaction databases(e.g. BIND, IntAct, MINT, etc). The source data is parsed out of their own unique formats and placed into the

MI module tables' data model. This data model has been designed to provide maximum flexibility in terms of the complexity of

queries that can be made to it. Additionally, the module adds value to the data by cross-referencing distinct records, regardless of their

source databases, to provide information on molecular object redundancy and interaction similarity. The MI module's set of parsers

takes records from source interaction modules, parse out their data, and insert that data into the MI tables. Currently, there exists only

one such parser, for parsing BIND XML records; parsers for other interaction databases will be developed in the future.

Logic:

MI-BIND parser:

This parses the BIND XML records, available on the BIND ftp site. It uses the SAX XML parsing API, in order to achieve maximum

parsing speed. In order to foster code reusability, the parser is broken down into three components: The first component parses out

the BIND XML records interaction data and places it into a general MI data structure. This data structure is then passed to another

component, which is responsible for placing the contents into the MI tables in the database. Another component is then called to do

the cross-referencing and redundancy/similarity analysis on the data just processed, and places that additional annotation into the

database. This break-down is convenient, because it means that developers of MI parsers for other source interaction databases will

only need to replace the first component, the one which parses the data out of the source record, which they can then pass to the

already written data feed and cross-referencing components. The parser was written in the Java programming language to take

advantage of Java's mature XML processing capabilities, and because a great deal of java source code was available from the open

source BIND project for processing BIND XML records.

input files:

The MI-BIND parser processes BIND records in their native XML format. These are available from the BIND ftp site, in a number of

different partitions, each of which can be used interchangeable. In order to import all of BIND into the MI module, one would

download the “Divisions” partition from

ftp://ftp.blueprint.org/pub/BIND/data/divisions/xml/*.xml

seqhound@blueprint.org Version 3.3

The SeqHound Manual 286 of 421 18/04/2005

and then process this list of input files using the MI-BIND parser.

tables altered:

MI_complex2ints

MI_complex2subunits

MI_complexes

MI_dbases

MI_exp_methods

MI_ints

MI_mol_type

MI_obj_dbases

MI_obj_labels

MI_objects

MI_record_types

MI_refs

MI_refs_db

MI_source

source code location:

The source is contained in the following seqhound java packages:

org.bluprint.seqhound.parsers.mi

org.bluprint.seqhound.parsers.mi.bind

config file dependencies:

MI.properties file

This file must be present in the same directory as where the MI-BIND parser is being invoked, and must contain the entries:

dbDriverName=myDatabaseDriversName

dbUserName=myDatabaseUserName

dbPassWord=myDatabasePassword

dbURL=myDatabaseConnectionURL

These contain the settings, which the parser will use to connect to the database management system, which manages the MI data

tables. An example entry follows:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 287 of 421 18/04/2005

dbDriverName=com.mysql.jdbc.Driver

dbUserName=johnsmith

dbPassWord=johmsmithspassword

dbURL=jdbc:mysql://dbhostname:dbportnumber/MIdbname

seqhound@blueprint.org Version 3.3

The SeqHound Manual 288 of 421 18/04/2005

library dependencies:

Along with the standard java runtime environment, the MI-BIND parser requires one third party java library; this is the open source

xerces2 XML parser implementation, available for free from the apache foundation(see http://xml.apache.org/xerces2-j/download.cgi

for free download instructions).

The xerces jar files xercesImpl.jar and xmlParserAPIs.jar must be in classpath when executing the parser.

The MI-BIND parser has been tested only with xerces2 java version 2.6.2, but may work with other versions.

command line parameters:

The MI-BIND parser takes one command line argument: a file containing a list of filenames of files for the parser to process.

example use:

java org.blueprint.seqhound.parsers.mi.bind.bmdParse BINDXMLFileList.txt

Where BINDXMLFileList.txt is a text file containing a newline delimited list of the names of files containing the BIND XML

records to be processed; these BIND XML files must be in the same directory. Note that the compiled bind module parser files, as

well as the xerces jars, must be in classpath in order for the above command to function.

associated scripts:

none

error and run-time logs:

runtime and error information is printed to standard output and standard error, respectively. This will likely be changed in a future

release, to print to a standard log file.

troubleshooting:

additional info:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 289 of 421 18/04/2005

MI_source table

Last updated October 4, 2004

Database:

MI_source

MI

Table:

Module: Molecular Interaction(MI)

Definition: Contains information about the source of an interaction record

MySQL

Field Type Null Default Column Definition

uid Int(11)

N

o Internal MI identifier for record

intcompid Int(11)

N

o interaction, complex or pathway

this record refers to

0

Internal MI identifier of the

db Smallint(6) Yes

N

ULL

Internal MI identifier of the

source db which this record

comes room, corresponds to db

column of MI_dbases

acc Varchar(10) Yes

N

ull Source db accession of this

record, if it exists

id Int(11) Yes

N

ull Source db ID of this record, if it

exists

type Smallint(6)

N

ull

Type of this record, either

interaction(1), complex(2) or

p

athway(3)

descr Text Yes

N

ull Text description of this record

from the source db

data_blob Longblob Yes Binary version of original record

from source db, if it exists

data_clob Longtext Yes

N

ull Text version of original record

N

o

N

ull

seqhound@blueprint.org Version 3.3

The SeqHound Manual 290 of 421 18/04/2005

from source db, if it exists

MySQL Indexes

Keyname Field Type

Primary Primary uid

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

***uid***

function:

description:

example:

default value

ASN.1 structure:

parser:

API:

more info:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 291 of 421 18/04/2005

MI_ints table

Last updated October 4, 2004

Database: MI

Table:

Module:

MI_ints

MI

Definition: Contains information about individual interactions

MySQL

Field Type Null Default Column Definition

intid Int(11)

N

o Internal MI id of this interaction

objAid Int(11)

N

o 0 Internal MI id of the first object

in this interaction

objBid Int(11)

N

o 0 Internal MI id of the second

object in this interaction

rig Int(11)

N

o 0

Internal MI id of the redundant

interaction group to which this

interaction belongs

MySQL Indexes

Keyname Type Field

Primary Primary intid

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 292 of 421 18/04/2005

MI_objects table

Last updated October 4, 2004

Database: MI

Table: MI_objects

Module: MI

Definition: Contains information about the individual objects involved in

interactions, complexes and pathways

MySQL

Field Type Null Default Column Definition

objid Int(11)

N

o Internal MI identifier for the

object which this row describes

type Smallint(6)

N

o 0

Internal MI identifier for the

molecule type of this object;

corresponds to column type of

MI_mol_type

db Smallint(6)

N

o 0

Internal MI identifier for the

source db of this object;

corresponds to column db of

MI_obj_dbases

id Int(11) Yes

N

ull Source db id of this object

tax Int(11) Yes

N

ull

N

CBI taxonomy id of this object,

if applicable

acc Varchar(20) Yes

N

ull Source db accession of this object

rog Int(11)

N

o 0

Internal MI id of the redundant

object group to which this object

belongs

MySQL Indexes

Keyname Type Field

Primary Primary objid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 293 of 421 18/04/2005

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 294 of 421 18/04/2005

MI_obj_dbases table

Last updated October 4, 2004

Database: MI

Table: MI_obj_dbases

Module: MI

Definition: Table of objects source databases

MySQL

Field Type Null Default Column Definition

db Smallint(6)

N

o Internal MI identifier for the

source db

db_name Varchar(30)

N

o The name of the object source db

MySQL Indexes

Keyname Type Field

Primary Primary db

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 295 of 421 18/04/2005

MI_mol_types table

Last updated October 4, 2004

Database: MI

Table: MI_mol_types

Module: MI

Definition: Molecular types of objects

MySQL

Field Type Null Default Column Definition

type Smallint(6)

N

o Internal MI identifier for

molecules of this type

type_name Varchar(15)

N

o

N

atural language name for this

molecule type

MySQL Indexes

Keyname Type Field

Primary Primary type

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 296 of 421 18/04/2005

MI_dbases table

Last updated October 4, 2004

Database: MI

Table: MI_dbases

Module: MI

Definition: Table of source databases for interaction records

MySQL

Field Type Null Default Column Definition

db Smallint(6)

N

o Internal MI identifier for this

source interaction database

db_name Varchar(30)

N

o

N

atural language name for this

source interaction database.

MySQL Indexes

Keyname Type Field

Primary Primary db

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 297 of 421 18/04/2005

MI_record_types table

Last updated October 4, 2004

Database: MI

Table: MI_record_types

Module: MI

Definition: Table of types for interaction records

MySQL

Field Type Null Default Column Definition

type Smallint(6)

N

o Internal MI identifier for this

record type

type_name Varchar(20)

N

o

N

atural language name for this

record type(by default, either

interaction, complex or pathway).

MySQL Indexes

Keyname Type Field

Primary Primary type

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 298 of 421 18/04/2005

MI_complexes table

Last updated October 4, 2004

Database: MI

Table: MI_complexes

Module: MI

Definition: Table of complex and pathway records

MySQL

Field Type Null Default Column Definition

compid Int(11) Internal MI identifier for this

complex record

numsubunits Int(11)

N

o 0

N

umber of subunits in this

complex

N

o

MySQL Indexes

Keyname Type Field

Primary Primary compid

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 299 of 421 18/04/2005

MI_complex2ints table

Last updated October 4, 2004

Database: MI

Table: MI_complex2ints

Module: MI

Definition: Mapping of complexes to their component interactions

MySQL

Field Type Null Default Column Definition

compid Int(11)

N

o 0Internal MI identifier for this

complex

intid Int(11)

N

o 0

Internal MI identifier for an

interaction which is a component

of this complex

MySQL Indexes

Keyname Type Field

Observation:

Source org: Blueprint

Source file:

FTP script:

BIND division files

N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 300 of 421 18/04/2005

MI_complex2subunits table

Last updated October 4, 2004

Database:

Table:

MI

MI_complex2subunits

MI

Module:

Definition: Mapping of complexes to subunits

MySQL

Field Type Null Default Column Definition

compid Int(11)

N

o 0Internal MI identifier for this

complex

objid Int(11)

N

o 0

Internal MI identifier for the

subunit object which belongs to

this complex

MySQL Indexes

Keyname Type Field

Observation:

Source org:

Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 301 of 421 18/04/2005

MI_complex2subunits table

Last updated October 4, 2004

Database: MI

Table: MI_complex2subunits

Module: MI

Definition: Mapping of complexes to subunits

MySQL

Field Type Null Default Column Definition

compid Int(11)

N

o 0 Internal MI identifier for this

complex

objid Int(11)

N

o 0

Internal MI identifier for the

subunit object which belongs to

this complex

MySQL Indexes

Keyname Type Field

Observation:

Source org: Blueprint

Source file: BIND division files

N/A

FTP script:

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 302 of 421 18/04/2005

MI_refs table

Last updated October 6, 2004

Database: MI

Table: MI_refs

Module: MI

Definition: Mapping of MI records to literature references which support it

MySQL

Field Type Null Default Column_Definition

uid Int(11)

N

o 0

Internal MI identifier for the

interaction record which this

reference is a part of

db Smallint(6)

N

o 0

Internal MI identifier for the

reference database which this

reference is from (eg pubmed or

medline)

acc Varchar(15) Yes

N

ull Accession of this reference from

source reference database

id Int(11) Yes

N

ull ID of this reference from the

source reference database

method Smallint(6)

N

o 0

Internal MI identifier for the

experimental method used in the

referenced experiment

MySQL Indexes

Keyname Type Field

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

seqhound@blueprint.org Version 3.3

The SeqHound Manual 303 of 421 18/04/2005

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 304 of 421 18/04/2005

MI_refs_db table

Last updated October 6, 2004

Database: MI

Table: MI_refs_db

Module: MI

Definition: Mapping of internal MI reference database Ids to source reference

databases

MySQL

Field Type Null Default Column_Definition

db Smallint(6)

N

o Internal MI identifier for this

reference database

db_name Varchar(15) Yes

N

ull The natural language name of the

reference source database

MySQL Indexes

Keyname Type Field

Primary Primary db

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script:

Parser:

N/A

MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 305 of 421 18/04/2005

MI_exp_methods table

Last updated October 6, 2004

Database: MI

Table: MI_exp_methods

Module: MI

Definition: Mapping of internal experimental method identifiers to their

descriptions

MySQL

Field Type Null Default Column_Definition

method Smallint(6)

N

o Internal MI identifier for this

experimental method

method_descr Varchar(40) Yes

N

ull

N

atural language na

m

e of the

experimental method

MySQL Indexes

Keyname Type Field

Primary Primary method

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 306 of 421 18/04/2005

MI_obj_labels table

Last updated October 6, 2004

Database: MI

Table: MI_obj_labels

Module: MI

Definition: Mapping of molecular objects to their free form labels

MySQL

Field Type Null Default Column_Definition

uid Int(11)

N

o 0

Internal MI identifier for the

interaction/complex/pathway

record which assigns this label to

this object

objid Int(11)

N

o 0 Internal MI identifier for the

molecular object being labeled

label Text Yes

N

ull

Free form label given to this

molecular object by this

interaction/complex/pathway

record

MySQL Indexes

Keyname Type Field

Observation:

Source org: Blueprint

Source file: BIND division files

FTP script: N/A

Parser: MI-BIND

seqhound@blueprint.org Version 3.3

The SeqHound Manual 307 of 421 18/04/2005

Text mining module

Overview:

The SeqHound text mining module helps researchers locate mentions and co-mentions of biologically related entities in the scientific

literature.

At the time of writing this is limited to finding protein mentions in PubMed abstracts. The module however has been designed to be

extensible to small molecules, complexes and even biological concepts in both abstract and full-text articles. Each of the steps in the

process may be scored by multiple methods that can be developed internally or by external developers.

mother parser

Note that a number of the tables in the text module are created at the same time as the creation of the other core tables (see core.sql)

and some of these tables are populated by the mother parser (see table descriptions below). Mother, is used to retrieve protein names

and synonyms from the RefSeq database, so these tables are created at the same time as other core module tables. The mother parser

is described under the core module section above. Parsers specific to the text module are described below.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 308 of 421 18/04/2005

text searcher parser

last updated February 25th, 2005

purpose:

The text update parser and related programs are used to collect bionames, search against a literature database, and then investigate co-

mentions of bioentities in the literatures. The co-mentions of bioentities are scored using pattern recognition and statistical machine

learning methods to look for potential biophysical interactions between bioentities. The text mining module tables are updated daily.

logic:

Most of the update logic is implemented in the Text.pm Perl module and other scoring programs. The text update parser calls the

functions in Text.pm and other Perl modules. The text mining module depends on the CORE module to generate and update names of

proteins and on the MyMED in house literature database. These resources are updated daily. The steps taken to update the text-mining

module are as follows:

Step 1: Collect bioentity names from the lexicon

Step 2: Formulate searches; collect search results and scores

Step 3: Collect co-occurrences of names and scores

Step 4: Summarize evidence for each pair of bioentities and scores

module: text

input files:

Latest Medline release: ftp://ftp.ncbi.nih.gov/nlmdata/.medlease/*.xml.gz

Latest PubMed Central release: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/

Latest Biomed Central release: ftp://ftp.biomedcentral.com/

Latest NRC LitMiner SVM release: http://ii200.iit.nrc.ca/~martinj/

British National Corpus: ftp://ftp.itri.bton.ac.uk/bnc/

seqhound@blueprint.org Version 3.3

The SeqHound Manual 309 of 421 18/04/2005

Moby project English word list: http://www.dcs.shef.ac.uk/research/ilash/Moby/

PubMed help stop word list: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#Stopwords

Smart English stop word list: ftp://ftp.cs.cornell.edu/pub/smart/english.stop

tables altered:

text_bncorpus, text_db, text_doc, text_docscore, text_doctax, text_englishdict, text_evidence, text_evidencescore, text_method,

text_namepair, text_namepairresult, text_organism, text_pattern, text_point, text_pointscore, text_result, text_resultscore, text_rng,

text_rngscore, text_search, text_searchscore, text_stopword

source code location:

slri/seqhound/text/text_update.pl

config file dependencies:

slri/seqhound/text/text.ini

command line parameters:

Typing “./text_update.pl -“ or “perl text_update.pl -” at the command line while in the directory where

text_update.pl resides will return a list of command line parameters and default settings.

text_update.pl arguments:

-r redo all name searches against new insertions from MyMED using time stamp [T/F] optional

-t taxnomy ids for organism to be searched, seperated by comma. If specified in command line, only provided

organism(s) will be searched. Otherwise, a default list of taxids stored in text.ini file will be searched.

Optional

-u update target table only, default is all. Optional

seqhound@blueprint.org Version 3.3

The SeqHound Manual 310 of 421 18/04/2005

example use:

For example:

>./text_update.pl -r F –t 4932,10090,9606 -u text_searchscore

associated scripts:

slri/seqhound/text/text.sql Data Definition file for text mining module tables.

slri/seqhound/text/text_dump.sh Mysql dump script for portable tables.

slri/seqhound/text/text_create.sh Script used to create Text database, the starting point.

slri/seqhound/text/text_regex.pl Script used to score evidence using regex patterns.

slri/seqhound/text/Text.pm contains most functions that will be using for update.

slri/seqhound/text/Pattern.pm Perl module used to represent regular expression objects.

slri/seqhound/text/Tee.pm Perl module used to branch the output to different outputs.

slri/seqhound/text/text_updatecron.sh Text module daily update cron script.

slri/seqhound/text/myeutils.pl Entrez eutil script used for comparing results between MyMED and entrez searches.

error and run-time logs:

Errors and runtime logs will be directed to file specified in text.ini , current default log file name is text_update.log

Daily update log will also be send to email account specified in text.ini file

troubleshooting:

Check the email message sent by text_update.pl to see if there is any error during update. Consult update log file to look for detail

problem.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 311 of 421 18/04/2005

additional info:

All text mining module tables are in small cases, and are prefixed with "text_". Tables with auto incremented rowid have a primary

key "id" as default. When referencing primary key in other table, the field name in the referencing table will be the referenced table

name plus id.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 312 of 421 18/04/2005

yeastnameparser.pl parser

Last updated September 27, 2004

purpose:

The yeastnameparser extracts names from the SGD file SGD_features.tab. Only names that belong to yeast records that are already in

RefSeq are added, as determined using the DBXRef module. This means that this parser MUST be run after the DBXRef parsers.

It is not necessary to run this parser, if yeast names are not desired.

logic:

The yeastnameparser reads through each record in SGD_features.tab. It searches for a refseq cross reference for each record. If one is

found, then the parser gets the relevant bioentityid and checks all names for that bioentityid.

If the name does not already exist in the database, then it is added and the db field is set to “sgd” and the access field is set to the

SGDID. The action field is set to 1, ADD and the current data is written to actiondate.

Existing names in the database are also compared to the file. Names which are no longer present in the yeast file are marked as deleted

(action =2) and the current data is written to actiondate. If the name already exists in refseq, then a SECONDREFS record is filled out

for the yeast record.

module: names

input files:

SGD_features.tab from

ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/

tables altered:

bioentity, bioname, secondrefs

source code location:

slri/seqhound/names/yeastnameparser.pl

config file dependencies:

The relevant configuration files are:

.intrezrc and .odbc.ini should be set up as described above for seqhound. The values will be read by shconfig.pm, which should be

located in the same directory as yeastnameparser.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 313 of 421 18/04/2005

command line parameters:

None.

example use:

perl yeastnameparser.pl

associated scripts:

The program yeastnamecron.pl can be used for both the initial read of SGD_features.tab and updates of the file. yeastnamecron.pl

checks whether SGD_features.tab needs updating, downloads it and calls yeastnameparser.pl. SGD_features.tab is updated weekly by

SGD.

error and run-time logs:

yeastnameparser writes errors to a file called yeastname.log. Updates are written to a file called yeastupdate.log as a tab delimited file

where the fields are: name, sgdid, bionameid and field. Additions are written to a file called yeastadd.log as a tab delimited file where

the fields are:name, sgdid bioentityid and field.

troubleshooting:

additional info:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 314 of 421 18/04/2005

text_bioentity table

Last updated Febuary 15, 2004

Database: SeqHound

Table: text_bioentity

Module: text (note mother parser is part of the core module)

Definition: A bioentity refers to any biological object with names that may be used in the literature to refer to these objects.

Currently, all bioentities are proteins from RefSeq. This table tells us which database contains the primary record

for this bioentity, the accession and the field which refers to this bioentity. At the moment, all bioentities are

obtained from RefSeq. RefSeq was chosen because it represents a high quality database of non-redundant records.

The intention of this part of the core is to collect a non-redundant list of biological objects and the names that are

used in written language to refer to them by.

Source file: *.bna.gz from ftp://ftp.ncbi.nih.gov/refseq/release/complete

FTP script: asnftp.pl

Parser: mother

seqhound@blueprint.org Version 3.3

The SeqHound Manual 315 of 421 18/04/2005

text_bioentity table

Field Type Null Default Column_Definition Example Source API

id

int(11) no

Auto incremented id. A

unique identifier for this

bioentity.

1 mother parser

autoincrements this

field

SHoundBioentityIdFromGi

SHoundBioentityIdFromAcc

SHoundBioentityIdListFromBionameAndTaxId

bioentitytypeid int(11)

N

o

0

Bioentity type (for

example, molecule type).

Id from the bioentitytype

table.

1 mother parser (see

GetType()) parsing

and see bioseq->mol

SHoundBioentityFromBioentityId

db varchar(15)

N

o

The primary database

where this bioentity was

found.

ref mother parser inserts

“ref”

SHoundBioentityFromBioentityId

access varchar(20)

N

o

Accession in the primary

database. Any

alphanumeric identifier

used by the primary

database.

N

P_858066 mother parser (see

FillBionameDB())

and see bioseq--

>seqid

SHoundBioentityFromBioentityId

identifier int(11) Yes

N

umeric identifier for this

bioentity in the primary

database; for example an

N

CBI Gene Info identifier

(GI). This field is not

required.

31982991 mother parser (see

GetGI) see bioseq--

>seqid (choice 12)).

SHoundBioentityFromBioentityId

fieldtypeid int(11)

N

o

A number that represents

the field from which this

bioentity was derived.

This is the id from the

fieldtype table. For

example, 1 indicates the

ASN.1 path

“seqentry/seqset/bioseq”

1 mother parser (see

ASN.1 path)

SHoundBioentityFromBioentityId

seqhound@blueprint.org Version 3.3

The SeqHound Manual 316 of 421 18/04/2005

text_bioentity indices

Keyname Type Field

PRIMARY PRIMARY access

ibioe_id INDEX id

ibioe_type INDEX bioentitytypeid

ibioe_identifier INDEX identifier

seqhound@blueprint.org Version 3.3

The SeqHound Manual 317 of 421 18/04/2005

text_bioname table

Last updated February 15, 2005

Database: SeqHound

Table:

Module:

text_bioname

text (note that the mother parser is part of the core module)

Definition: This table holds names for bioentities. A Bioname is some “name” applied to some bioentity.

Source org: NCBI

Source file: *.bna.gz from ftp://ftp.ncbi.nih.gov/ncbi/refseq/release/complete

FTP script: asnftp.pl

Parser: mother

seqhound@blueprint.org Version 3.3

The SeqHound Manual 318 of 421 18/04/2005

text_bioname table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o A unique identifier for this

bioentity-name pair.

1 mother parser

auto increments

this column

SHoundBionameListFromBioentityId

timestamp datetime Yes

N

ULL mother parser

yeastnameparse

r

bioentityid int(11)

N

o

0The id of the bioentity to which

this name refers.

1 mother parser

yeastnameparse

r

SHoundBionameListFromBioentityId

name text

N

o

mother parser

The name.

2-

isopropylmal

ate synthase

yeastnameparse

r

SHoundBionameListFromBioentityId

nametypeid int(11)

N

o

0

Type of name. For example,

protein name (1) or gene name

(2). This is the id from the

nametype table.

1 mother parser

yeastnameparse

r

SHoundBionameListFromBioentityId

db varchar(15)

N

o

The database in which this name

was found.

ref mother parser

yeastnameparse

r

SHoundBionameListFromBioentityId

access varchar(20)

N

o The accession of the record in

which this name was found.

NP_047187 mother parser

yeastnameparse

r

SHoundBionameListFromBioentityId

identifier int(11) Yes

N

ULL

The identifier of the record in

which this name was found; for

example a Gene Info identifier

(GI).

10954458 mother parser

yeastnameparse

r

SHoundBionameListFromBioentityId

fieldtypeid int(11)

N

o

0

The field of the record in which

this name was found. This is the

id from the fieldtype table; for

example, 5 indicates the

ASN.1 path

“seqentry/seqset/bioseq/descr-

title/”

5 mother parser

yeastnameparse

r

SHoundBionameListFromBioentityId

seqhound@blueprint.org Version 3.3

The SeqHound Manual 319 of 421 18/04/2005

official

int(11) Yes 0 Is this an official name?

1=Yes, 2=N0

1 yeastnameparse

r

SHoundBionameListFromBioentityId

deprecated int(11) Yes 0 Has this name been deprecated?

N

ot used at present.

0 SHoundBionameListFromBioentityId

datedeprecated datetime Yes

000000000

00000

The date the name was

deprecated.

00000000000

000

mother parser

yeastna

m

eparser

SHoundBionameListFromBioentityId

ruleid int(11) Yes

N

ULL

What rule was used to construct

this name? This is the id from the

rules table. For example, 1

indicates that A gene name is

being used to refer to a protein.

1 mother parser SHoundBionameListFromBioentityId

yeastnameparser

action

char(1) Yes A

What was the last action taken on

this record. A=ADD,

D=DELETE.

mother parser

yeastnameparse

r

actiondate datetime Yes

000000000

00000 Date of the last action. 2004-08-20

01:30:08

mother parser

yeastnameparser

seqhound@blueprint.org Version 3.3

The SeqHound Manual 320 of 421 18/04/2005

text_bioname indices

Keyname Type Field

ibioname_id INDEX id

ibioname_identifier INDEX identifier

ibioname_access INDEX access

ibioname_bioentityid INDEX bioentityid

ibioname_nametypeid INDEX nametypeid

ibioname_official INDEX official

ibioname_deprecated INDEX deprecated

ibioname_ruleid INDEX ruleid

ibioname_action INDEX action

ibioname_actiondate INDEX actiondate

seqhound@blueprint.org Version 3.3

The SeqHound Manual 321 of 421 18/04/2005

text_secondrefs table

Last updated February 15, 2005

Database: SeqHound

Table: text_secondrefs

Module: core

Definition: Additional references for sources of bionames.

Source org: SGD

Source file: SGD_features.tab from ftp://genome-

ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/

FTP script: yeastnamecron.pl

Parser: yeastnameparser.pl

seqhound@blueprint.org Version 3.3

The SeqHound Manual 322 of 421 18/04/2005

text_secondrefs table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o

id of the second reference. Mysql

auto-increment column.

1 yeastnameparser

will Auto-

increment this

column

timestamp datetime Yes

N

ULL yeastnameparse

r

bionameid int(11)

N

o

0id of the name that this reference

refers to.

1 yeastnameparser

retrieves this from

text_bioname table

db varchar(15)

N

o

Database in which this reference is

found.

sgd yeastnameparser

(see

AddSeconfRef())

access varchar(20) Yes

N

ULL Accession of the record that refers

to the name.

S000033 yeastnameparser

(see

AddSeconfRef())

fieldtypeid int(11)

N

o

0Identifies the field in the record

where this name was found.

101 yeastnameparser

(see

AddSeconfRef())

seqhound@blueprint.org Version 3.3

The SeqHound Manual 323 of 421 18/04/2005

text_secondrefs indices

Keyname Type Field

isecondrefs_id INDEX id

isecondrefs_bionameid INDEX bionameid

isecondrefs_dbsearch INDEX access, db

isecondrefs_field INDEX field

seqhound@blueprint.org Version 3.3

The SeqHound Manual 324 of 421 18/04/2005

text_bioentitytype table

Last updated February 15, 2005

Database: SeqHound

Table: text_bioentitytype

Module: text (note that this table is currently defined when the core module is

created)

Definition: Look up table that stores bioentity types.

Source org: Blueprint

Source file: core.sql

FTP script: NA

Parser: NA

text_bioentitytype table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Identifier for a bioentitytype.

Mysql auto-increment column.

1 core.sql

type varchar(80)

N

o The molecule type. protein core.sql

text_bioentitytype indices

NA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 325 of 421 18/04/2005

text_fieldtype table

Last updated February 15, 2005

Database: SeqHound

Table: text_fieldtype

Module: text (note that this table is defined when the core module is created)

Definition: Look up table that stores the field that contains the name.

Source org: Blueprint

Source file: This is a look up table that is filled by core.sql

FTP script: NA

Parser: NA

text_fieldtype table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Identifier for a fieldtype. 1 core.sql

pathtofield varchar(80)

N

o

The field that contains the name.

For an ASN.1 record, this is the

ASN.1 "path" to the field that

contains the name. For a flat file,

this is of the form "Column

#:Column Name".

seqentry/seq

set/bioseq/s

eq-

annot/seqfea

t-gene/syn

core.sql

text_fieldtype indices

NA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 326 of 421 18/04/2005

text_nametype table

Last updated February 15, 2005

Database:

Table:

SeqHound

text_nametype

Module: text (note that this table is defined when the core module is created)

Definition:

Look up table that stores name types.

Source org: Blueprint

Source file: This is a look up table that is filled by core.sql

FTP script: NA

Parser: NA

text_nametype table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o

Identifier for a nametype.

Mysql auto-increment

column.

1 core.sql

type varchar(880)

N

o The type of name.

p

rotein core.sql

text_nametype indices

NA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 327 of 421 18/04/2005

text_rules table

Last updated September 30, 2004

Database: SeqHound

Table: text_rules

Module: text (note that this table is defined when the core module is created)

Definition: Look up table that stores rules for generating names.

Source org: Blueprint

Source file: This is a look up table that is filled by core.sql

FTP script: NA

Parser: NA

text_rules table

Field Type Example Null Default Column_Definition Source API

id int(11) Identifier for a rule. Mysql auto-

increment column.

core.sql

type varchar(80) The rule. Use gene name

for protein.

core.sql

N

o 1

N

o

text_rules indices

NA

seqhound@blueprint.org Version 3.3

The SeqHound Manual 328 of 421 18/04/2005

text_db table

Last updated February 15, 2005

Database: seqhound

Table: text_db

Module: text

Definition: Lookup table that lists biomedical literature databases used by the text

module.

Source file:

Source org: Blueprint

FTP script: NA

Parser: NA

text_db table

Field Type Null Default Column Definition Example Source API

dbid int(11)

N

o Auto incremented id 1 text.sql

name varchar(80)

N

o name of database PubMed text.sql

text_db indices

Keyname Type Field

PRIMARY PRIMARY dbid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 329 of 421 18/04/2005

text_doc table

Last updated February 16, 2005

Database: seqhound

Table: text_doc

Module: text

Definition: This table lists the accession ids from each literature database and

assigns them an internal document id.

Source file:

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/

Source org: NLM MEDLINE and other biomedical literature databases

FTP script: slri/medline/updates/updatecron.sh

slri/medline/pubmedcentral/pmc_updatecron.sh

Parser: text_update.pl

ftp://ftp.nlm.nih.gov/nlmdata/.medlease/medline*.xml.gz

seqhound@blueprint.org Version 3.3

The SeqHound Manual 330 of 421 18/04/2005

text_doc table

Field Type Null Default Column Definition Example Source API

docid int(11)

N

o auto incremented identifier 1077882

dbid int(11)

N

o 0 literature database id 1

accession int(11)

N

o

0

accession in literature

database such as a PubMed

identifier (PMID)

1077882 Medline Xpath:

“/MedlineCitati

on/PMID”

Or

PMC Xpath:

/art/ui[@type="

p

mid"]'

status char(10) Yes

N

ULL insert, delete or update

N

ULL

text_doc indices

Keyname Type Field

id PRIMARY id

dbid INDEX dbid

accession INDEX accession

seqhound@blueprint.org Version 3.3

The SeqHound Manual 331 of 421 18/04/2005

text_docscore table

Last updated September 28, 2004

Database: seqhound

Table:

text_docscore

Module: text

Definition: This table lists scores for documents. Multiple scores (from different

scoring methods) may be listed for each document.

Source file:

Source org: NRC

FTP script: wget –P ~/nrc -m http://ii200.iit.nrc.ca/~martinj/

slri/text/nrc.sh

Program:

LitMiner :http://textomy.iit.nrc.ca/cgi-

bin/bindpresent.cgi?qry=10747882

Notes: The current method for scoring a document is an SVM (support vector

machine) classifier implemented by National Research Council's Joel

Martin and Berry deBruijn. Detailed information can be found in the

prebind publication:

http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pu

bmedid=12689350

seqhound@blueprint.org Version 3.3

The SeqHound Manual 332 of 421 18/04/2005

text_docscore table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o foreign key from doc table. See

docid

25

methodid int(11)

N

o

scoring method identifier. See the

text_method table. For example, 1

indicates an SVM trained to

recognize papers describing

interaction data.

1

score double

N

o

0 score -0.852876574 http://ii200.iit.nr

c.ca/~martinj/

text_docscore indices

Keyname Type Field

PRIMARY PRIMARY (docid, methodid, score)

methodid INDEX methodid

score INDEX score

seqhound@blueprint.org Version 3.3

The SeqHound Manual 333 of 421 18/04/2005

text_evidence table

Last updated February 16, 2004

Database: seqhound

Table: text_evidence

Module: text

Definition: This table assigns a unique identifier to each co-occurrence of two

bionames in the same document. A bioname points to a specific name

bioentity pair in the bioname table (this pair is repeated here). This is to

be distinguished from a pair of names that co-occur in a document; this

pair of names is identified by a namepairid and does not make any

reference to a specific pair of bioentities. This evidence is in support of

some point and is based on some name pair.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 334 of 421 18/04/2005

text_evidence table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o A unique identifier for this

p

iece of evidence.

1 Auto

incremented

identifier

docid int(11)

N

o

0the document identifier

where this evidence occurs

325387

int(11)

N

o

0search result identifier for

bioname A

2200

bioentityida int(11)

N

o 0 identifier for bioentity A 107564

nama char(80)

N

o name referring to A LEU1

resultidB int(11)

N

o

0search result identifier for

bioname B

9513

bioentityidb int(11)

N

o 0 identifier for bioentity B 109911

namb char(80)

N

o name referring to B HIS4

p

ointi

d

int(11) Yes

N

ULL

POtential INTeraction

between two bioentites that

this evidence supports

2

namepairid Yesint(11)

N

ULL

corresponding namepairid

on which this evidence is

based

10634

state smallint(6) Yes book keeping 0

resultidA

N

ULL

seqhound@blueprint.org Version 3.3

The SeqHound Manual 335 of 421 18/04/2005

text_evidence indices

Keyname Type Field

PRIMARY PRIMARY evidenceid

docid INDEX docid

resultida INDEX resultidA

resultidb INDEX resultidB

seqhound@blueprint.org Version 3.3

The SeqHound Manual 336 of 421 18/04/2005

text_evidencescore table

Last updated February 16, 2005

Database: seqhound

Table: text_evidencescore

Module: text

Definition: This table assigns a score to a piece of evidence. See evidence table.

Multiples cores may be assigned to one evidence if different methods

were used. Currently, this table stores potential protein-protein

interaction scores for each co-occurrence of bionames. The current

scoring method uses a set of manually collected regular expression

patterns to identity interactions. Detailed information can be found at:

ftp://ftp.blueprint.org/pub/BIND/PreBIND/README

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 337 of 421 18/04/2005

text_evidencescore table

Field Type Null Default Column_Definition Example Source API

evidenceid int(11)

N

o 0 see text_evidence table 8

methodid int(11)

N

o 0

scoring method identifier

(see text_method table); for

example, 2 indicates a

method that uses regular

expressions to detect

bioentities that are

p

hysically interacting

2

score double

N

o 0.63 0 the score

text_evidencescore indices

Keyname Type Field

PRIMARY PRIMARY (evidenceid, methodid, score)

methodid INDEX methodid

score INDEX score

seqhound@blueprint.org Version 3.3

The SeqHound Manual 338 of 421 18/04/2005

text_method table

Last updated March 15, 2005

Database: seqhound

Table: text_method

Module: text

Definition: This table assigns a unique identifier to each of the multiple scoring

schemes used in the SeqHound text mining module.

Source: This file is hand-edited.

Source Org.

Blueprint

seqhound@blueprint.org Version 3.3

The SeqHound Manual 339 of 421 18/04/2005

text_method table

Field Type Default Source Null Column_Definition Example API

id int(11)

N

o 0 Auto incremented identifier 1

type varchar(30)

N

o

This field describes the type of

score that is generated by the

method. It will be one of

searchscore, docscore, resultscore,

evidencescore or pointscore.

searchscore

hypothid

int(11) Yes 0

corresponding hypothesis

identifier. See the text_hypoth

table.

23

hypoth text Yes

This document

describes

biophysical

interaction data for

some molecule(s).

method text Yes

N

ULL

a more detailed description of the

method with pointers to more

details about the method and its

performance.

A support vector

machine was

trained to recognize

abstracts containing

biophysical

interaction data.

See PubMed

Identifier 12689350

for more details.

positive text Yes

N

ULL

the value (range) of scores

corresponding to the hypothesis

being found TRUE

>0

negative text Yes

N

ULL

the value (range) of scores

corresponding to the hypothesis

being found FALSE

<0

N

ULL a hypothesis that this method

attempts to support or refute.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 340 of 421 18/04/2005

undecided text Yes

N

ULL

the value (range) of scores

corresponding to the hypothesis

being found UNDECIDED

0

implemented enumerated Yes

N

ULL has this method been implemented

N

O

assume text Yes

N

ULL

the score value that can be

assumed if the method has been

implemented but no score is found

>0

script text Yes

N

ULL the script or program that

implements the method

text_search.pl

text_method indices

Keyname Type Field

methodid PRIMARY methodid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 341 of 421 18/04/2005

text_point table

Last updated February 16, 2005

seqhound

Database:

Table: text_point

Module: text

Definition: A POINT represents two bioentities for which some POtential

INTeraction may occur in the literature. Each POINT may be supported

by multiple pieces of evidence in the evidence table.

Program:

text_point table

Field Type Column_Definition Example API Null Default Source

id int(11)

N

o auto incremented identifier 1

bioentityidA int(11)

N

o 0 bioentity identifier 110070

bioentityidB int(11)

N

o 0 bioentity identifier 1268438

small int Yes

N

ULL book keeping 0state

text_point indices

Keyname Type Field

tid PRIMARY id

bioentityida INDEX bioentityida

bioentityidb INDEX bioentityidb

seqhound@blueprint.org Version 3.3

The SeqHound Manual 342 of 421 18/04/2005

text_pointscore table

Last updated February 16, 2005

Database: seqhound

Table: text_pointscore

Module: text

Definition: This table lists scores for potential interactions. These scores may be

viewed as a summary of the scores for all the pieces of evidence that

support this POINT. A POINT may have multiple scores that are

generated by multiple methods.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 343 of 421 18/04/2005

text_pointscore table

Field Type Null Default Column_Definition Example Source API

p

ointi

d

int(11)

N

o e

POINT (see text_point table)

0a unique identifier for som 11

methodid

an identifier for the method

used to generate this score

for the POINT (see

text_method table).

3

score double

N

o 0 the score 0.63

int(11)

N

o 0

text_pointscore indices

Ty Field

PRIMARY PRIMARY

Keyname

(pointid, methodid, score)

methodid

pe

INDEX methodid

score score INDEX

p

ointi

d

INDEX

p

ointi

d

seqhound@blueprint.org Version 3.3

The SeqHound Manual 344 of 421 18/04/2005

text_result table

Last updated February 16, 2004

Database: seqhound

Table: text_result

Module: text

Definition: This table stores all the search results (document ids) for performed

searches (see text_search table). Position default is 0, which means that

the bioname appears somewhere in the document without specifying

exactly where or how many times.

Program:

text_result table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o auto incremented identifier 1

searchid int(11)

N

o

0identifies the search that

generated this result

1

docid int(11)

N

o

0the document identifier

where the name was found

7874750

p

ositioni

d

int(11)

N

o

0

the postion id in the

document where the name

appears (0 indicates no

specified position)

0

state smallint(6) Yes

N

ULL book keeping 0

text_result indices

Keyname Type Field

tid PRIMARY id

seqhound@blueprint.org Version 3.3

The SeqHound Manual 345 of 421 18/04/2005

searchid INDEX searchid

docid INDEX docid

p

ositione

d

INDEX

p

ositioni

d

seqhound@blueprint.org Version 3.3

The SeqHound Manual 346 of 421 18/04/2005

text_resultscore table

Last updated February 16, 2005

Database: seqhound

Table: text_resultscore

Module: text

Definition: This table holds scores for search results; i.e., is the searched-for

bioentity really mentioned in the document. This table might be used to

store disambiguation scores for search results. Because a bioname can

refer to many different bioentities, an algorithm (some method) )may be

used to determine which bioentity a name occurrence refers to. This

table might also hold the results of methods that disambiguate

bioentities that have English words as names.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 347 of 421 18/04/2005

text_resultscore table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o 0 Auto incremented identifier 1

methodid int(11)

N

o

0

identifier for the method

used to score this result (see

text_method table)

3

score double

N

o 0 the score -1

text_resultscore indices

Keyname Type Field

PRIMARY PRIMARY (resultid, methodid, score)

resultid INDEX resultid

methodid INDEX methodid

score INDEX score

seqhound@blueprint.org Version 3.3

The SeqHound Manual 348 of 421 18/04/2005

text_search table

Last updated February 16, 2004

Database: seqhound

Table: text_search

Module: text

Definition: This search table is generated using the text_bioname table. A search is

minimally composed of a name that is used to look for some bioentity

(listed in this table) using some method. Currently all bioname items

with nametypeid=2 and a taxid that can be found in the taxgi table

and are inserted into this search table for MyMedline searching.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 349 of 421 18/04/2005

text_search table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o auto incremented identifier 1

bioentityid int(11)

N

o 0 bioentity being searched for 106945

bionameid int(11)

N

o 0

b

ionameid used to search for

bioentity

414737

name char(80)

N

o name used to find mention

of bioentity

AI1

taxid int(11)

N

o

0taxonomy identifier for

bioentity (if applicable)

4932

rngid int(11) Yes

N

ULL redundant name group

identifier

70513

methodid int(11)

N

o

method used to search for

the bioentity mention using

name

1

searched datetime

N

o

0000-00-

00

00:00:00

date and time that search

was last performed

2004-12-01 10:12:22

results int(11) Yes

N

ULL number of search results

returned

17

text_search indices

Keyname Type Field

PRIMARY PRIMARY searchid

bioentityid INDEX bioentityid

bionameid INDEX bionameid

taxid INDEX taxid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 350 of 421 18/04/2005

searched INDEX searched

seqhound@blueprint.org Version 3.3

The SeqHound Manual 351 of 421 18/04/2005

text_searchscore table

Last updated February 16, 2004

Database: seqhound

Table: text_searchscore

Module: text

Definition: This table holds some score that may be used to determine if a given

search strategy WILL BE informative if performed or its results are

likely TO BE informative if examined. This score might also be used to

determine whether a search strategy is to be performed at all or if some

search strategy is best left until more informative searches have been

informed. For example, if the name to be used is an English word, the

search may be scored so as to skip this search. Multiple methods (and

their scores) may be applied to a single search.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 352 of 421 18/04/2005

text_searchscore table

Field Type Null Default Column_Definition Example Source API

searchid int(11)

N

o

0identifies a search strategy (see

text_search table)

1

methodid int(11)

N

o 0 identifies the method used to score

the search strategy

score double

N

o 0 the score

text_searchscore indices

Keyname Type Field

PRIMARY PRIMARY (searchid, methodid, score)

searchid resultid INDEX

methodid INDEX methodid

score INDEX score

seqhound@blueprint.org Version 3.3

The SeqHound Manual 353 of 421 18/04/2005

text_rng table

Last updated February 16, 2004

Database: seqhound

Table: text_rng (redundant name group)

Module: text

Definition: This table groups together bionames that have equivalent names (i.e.

homonyms). This table facilitates searching by reducing the eliminating

redundant searches for the same string in the document collection. It is

thus an intermediate table in the process of creating the text_result table.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 354 of 421 18/04/2005

text_rng table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o

auto incremented identifier. This

is a unique identifier for the

redundant name group identifier.

8

name char(80)

N

o the homonym represented by this

group

SRB8

searched datetime

N

o 0000-00-00

00:00:00

when this name was last used to

search

2004-12-01 12:02:23

results Yes

N

ULL the number of documents returned

by this search

0

status char(20) Yes

N

ULL searched book keeping

text_rng indices

Keyname Type Field

Id PRIMARY id

N

ame Unique

N

ame

Searched INDEX Searched

Results Results INDEX

Status status INDEX

seqhound@blueprint.org Version 3.3

The SeqHound Manual 355 of 421 18/04/2005

text_rngresult table

Last updated February 16, 2004

seqhound

Database:

Table: text_rngresult

Module: text

Definition: This table is an intermediate step in one method for searching for

mentions of protein names in text. The table stores search results for

redundant name groups. These results are combined with the

text_doctax table to generate the final text_results table.

Program:

Note: This is an intermediate table and is not distributed as part of SeqHound.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 356 of 421 18/04/2005

text_rngresults table

Field Type Default Example Null Column_Definition Source API

id int(11)

N

o Auto incremented identifier. 1

rngid int(11)

N

o 2

0a redundant name group identifier.

See the text_rng table.

docid int(11) Yes

N

ULL a document identifier where this

name appears

45549

pmid int(11) Yes

N

ULL the corresponding PubMed

identifier where this name appears

45549

state int(11)

N

o 0 book keeping 0

text_rngresults indices

Keyname Type Field

Id PRIMARY Id

Rngid INDEX Rngid

Docid INDEX Docid

Pmid INDEX Pmid

State INDEX State

seqhound@blueprint.org Version 3.3

The SeqHound Manual 357 of 421 18/04/2005

text_doctax table

Last updated February 16, 2004

Database: seqhound

Table: text_doctax

Module: text

Definition: This table keeps a list of the organisms (by taxon identifiers) that are

described in a document.

Program:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 358 of 421 18/04/2005

text_doctax table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Auto incremented identifier. A

unique identifier.

1

docid int(11)

N

o

0document identifier (see text_doc

table).

taxid int(11)

N

o

0

organism described in this

document (listed by NCBI

taxonomy database identifier)

text_doctax indices

Keyname Type Field

Id PRIMARY Id

Docid INDEX Docid

Taxid Index taxid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 359 of 421 18/04/2005

text_organism table

Last updated February 16, 2004

Database: seqhound

Table: text_organism

Module: text

Definition: This table keeps a list of the MESH terms that are used to identify the

presence of an organism in a PubMed abstract.

Program:

Source Org: Blueprint

Source file:

Note:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 360 of 421 18/04/2005

text_organism table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o auto incremented identifier 1

taxid int(11)

N

o 0

N

CBI taxonomy identifie

r

4932

mesh timestamp

N

o mesh term found in abstracts

that describe this organism

Saccharomyces cerevisiae

searched timestamp Yes

CURRENT_TIM

ESTAMP

time that this search was last

completed

2005-02-15 15:58:38

results int(11) Yes

N

ULL number of documents

returned

N

ULL

bioentities int(11) Yes

N

ULL number of bioentities for this

organism

N

ULL

bionames int(11) Yes

N

ULL number of names for this

organism

N

ULL

maxbionameid int(11) Yes

N

ULL maximum bioname identifier

N

ULL

lastupdate timestamp Yes

0000-00-00

00:00:00

time that all all updates were

last completed

0000-00-00 00:00:00

text_organism indices

Keyname Type Field

Id PRIMARY Id

seqhound@blueprint.org Version 3.3

The SeqHound Manual 361 of 421 18/04/2005

text_englishdict table

Last updated February 16, 2004

Database: seqhound

Table: text_englishdict

Module: text

Definition: This table holds an English Dictionary from Moby project

http://www.dcs.shef.ac.uk/research/ilash/Moby/

Program:

Organization: Oxford University

File:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 362 of 421 18/04/2005

text_englishdict table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o

Auto incremented identifier.

Aunique identifier for this word-

pos combination.

1

word char(16) Yes

N

ULL the word the

pos char(10) Yes

N

ULL part of speech Det

freq int(11) Yes

N

ULL frequency in the bnc per million

words

61847

count int(11) Yes

N

ULL how many bioname identifiers

does this word refer to

1

source char(10) Yes

N

ULL

pubmed = this word is in th

pubmed stopword list

stop = present in other stopword

lists

other possibilities

pubmed

text_englishdict indices

Keyname Type Field

Id PRIMARY Id

Word Index Word

Pos Index Pos

Freq Index Freq

Count Index count

seqhound@blueprint.org Version 3.3

The SeqHound Manual 363 of 421 18/04/2005

text_bncorpus table

Last updated February 16, 2004

Database: seqhound

Table: text_bncorpus

Module: text

Definition: This table holds the British National Corpus.

Parser:

Organization: Oxford University (http://www.natcorp.ox.ac.uk/)

File: ftp://ftp.itri.bton.ac.uk/bnc/

Note: this file is not distributed with SeqHound

seqhound@blueprint.org Version 3.3

The SeqHound Manual 364 of 421 18/04/2005

text_bncorpus table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o auto incremented identifier. A

unique identifier for this word.

1

word char(160)

N

o the word the

freq int(11)

N

o 0 frequency in the corpus 6187267

pos char(16)

N

o part of speech at0

files int(11)

N

o

0number of files (out of 4120) that

this word appears in

4120

text_bncorpus indices

Keyname Type Field

Id Primary Id

Word Index Word

Freq Index Freq

seqhound@blueprint.org Version 3.3

The SeqHound Manual 365 of 421 18/04/2005

text_pattern table

Last updated February 16, 2004

Database: seqhound

Table: text_pattern

Module: text

Definition: This table holds a list of regular expressions used to detect mentions of

biophysical interactions between two given names that appear in a

sentence.

Program:

Source Org: Blueprint

Source File:

Note:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 366 of 421 18/04/2005

text_pattern table

Field Type Null Default Column_Definition Example Source API

id varchar(8)

N

o a unique identifier for this regular

expression.

9920

parentid varchar(8) Yes

N

ULL

the parent identifier if this

expression is derived from

another.

99

class int(11)

N

o

0

does this expression identify an

interaction (1) or the absence of an

interaction (-1).

1

score double Yes

N

ULL score for this regex 0.85

regex text Yes

N

ULL the regular expression A(\S*\s+){0,4}\S*B(\S*\s+){0,4}\S

*heterodimer

text_pattern indices

Keyname Type Field

Id Primary Id

Parentid Index Parentid

seqhound@blueprint.org Version 3.3

The SeqHound Manual 367 of 421 18/04/2005

text_stopword table

Last updated February 16, 2004

Database: seqhound

Table: text_stopword

Module: text

Definition: This table contains a complete list of stopwords.

1.http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#Sto

pwords

2.ftp://ftp.cs.cornell.edu/pub/smart/english.stop

3. Manually collected from bioname table

Program:

Source Org: Various (see definition)

Source file:

Note:

seqhound@blueprint.org Version 3.3

The SeqHound Manual 368 of 421 18/04/2005

text_stopword table

Field Type Null Default Column_Definition Example Source API

id int(11)

N

o Auto incremented identifier 1

word char(16)

N

o the stop word a

source char(16) Yes Source of this stop word PubMed help

text_ indices

Keyname Type Field

Id Primary Id

Word Index word

seqhound@blueprint.org Version 3.3

The SeqHound Manual 369 of 421 18/04/2005

6. Developing for SeqHound.

Open source development.

SeqHound code is developed on a cvs tree internal to the Samuel Lunenfeld Research

Institute by the members of the Blueprint Initiative Development team. The most stable

current release is available at

ftp://ftp.blueprint.org/pub/SeqHound/

External developers are encouraged to discuss major additions or modifications to the

system with the Project Manager at seqhound@blueprint.org.

Minor additions or corrections may also submitted to seqhound@blueprint.org.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 370 of 421 18/04/2005

Code organization.

Note: The following section is under revision and will be updated shortly.

This document summarizes the contents in the directories under slri/seqhound and

slri/nblast as of June of 2003. The contents of directories are first stated and then the

purpose of the contents is given.

Under the current directory hierarchy, the seqhound directory contains the following

directories (/) and files (*):

asn/

bioperl/

build/

cgi/

config/

db2/

domains/

examples/

genomes/

go/

html/

include/

include_cxx

java/

lib/

locuslink/

parsers/

perl/

rps/

scripts/

shreadme/

shreadme_cxx/

src/

src_cxx/

taxon/

tindex/

updates/

yeast/

seqhound .mk*

seqhound_cb.mk*

seqhound_db2.mk*

seqhound_odbc.mk*

seqhound_rem.mk*

shreadme*

shreadme_cxx*

seqhound@blueprint.org Version 3.3

The SeqHound Manual 371 of 421 18/04/2005

asn/ : contains the asn specifications (*.asn) for various objects used in

seqhound, auxiliary files used by datatool & asntool (*.def), and the

scripts used by asntool and datatool to autogenerate the objects in unix

(*.sh) and windows (*.bat) platforms.

bioperl/: contains the files used in the SeqHound bioperl module

build/: contains a directory structure used to store the executables. As

executables get compiled, they will be moved into their relevant

directories inside the build directory. Executables using Codebase will

be moved to cb/, db2 executables are moved to db2/, and odbc

executables to odbc/. The appropriate directories will get created as the

need arises.

cgi/: contains the source code for the cgi and web services.

config/: the configuration files used in seqhound remote & local applications.

db2/: scripts to create the tables in db2 and redund for db2.

domains/: source code for the domain module.

examples/: source code showing how to use some of the code in seqhound, ex the

asn structures and functions, the C++ remote library.

genomes/: source code for the complete genome module.

go/: source code for the gene ontology module.

html/: documentation for SeqHound API, and some scripts for converting the

documentation to html

include/: the *.h files for SeqHound

include_cxx/: the include files for the C++ remote library

java/: source code for the SeqHound Java remote library

lib/: libraries for SeqHound are copied here once compiled.

locuslink/: source code for the locuslink module.

parsers/: source code for various parsers (mother, cbmmdb, cddb, redund,

mmdbloc)

perl/: the SeqHound perl module (deprecated in favor of bioperl)

rps/: source code for the rps module (domname, rpsdb)

scripts/: various scripts to retrieve flatfiles from NCBI, and to build SeqHound

src/: source code for SeqHound (includes db layers for GO, HIST,

LOCUSLINK, NBR, rpsdb, taxdb& core modules), the C remote API

src_cxx/: source code for the remote C++ library.

taxon/: source code for the taxon module parser.

tindex/: source code for the text indexer.

updates/: source code for the daily updates & histparser.

yeast/: source code for importing yeast GO into Seqhound (never completed).

seqhound@blueprint.org Version 3.3

The SeqHound Manual 372 of 421 18/04/2005

*.mk: various auxiliary files used by SeqHound makefiles.

shreadme*: readme files for C & C++ SeqHound

Under the current directory hierarchy, the nblast directory contains the following

directories(/) and files(*):

asn/

db/

docs/

lib/

msvc/

scripts/

src/

nblast.mk*

nblastflags.mk*

asn/: contains the ASN specifications for nblast.

db/: source code for nblast Codebase layer.

docs/: instructions for compiling nblast.

lib/: source code for the ASN nblast object.

msvc/: files for Microsoft Visual C++ project.

scripts/

:

various scripts for retrieving NCBI Blast flatfiles & setting up NBlast.

src/: core source code for nblast.

*.mk: auxiliary makefile for nblast applications.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 373 of 421 18/04/2005

Adding/Modifying a remote API function to SeqHound.

Note: This section is included for historical purposes and is being

rewritten for a future release (5.0)

Overall steps:

1. open a new Bugzilla report

2. create the database search functionality

3. create the local API call

4. create the CGI call

5. add the remote calls (C/C++/Java/Bioperl/Perl)

6. test the new functionality in each of the layers

7. write/update documentation

8. inform technical writer

9. inform tester

10. update seqrem and local library in production server

11. In some rare cases, modifications may have to be made to the underlying data table

structure, then SeqHound must be rebuilt in test and production environments before

code is checked into cvs.

12. check in the new source code

13. update API website

14. close Bugzilla report

This document only goes over steps 2-5 in detail.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 374 of 421 18/04/2005

Overall architecture of the SeqHound system.

The diagram shows how the various code layers in SeqHound.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 375 of 421 18/04/2005

2. Create the Database Search Functionality

If adding a new data file, you will need to write the database search functions, one for

each data file. Database Search functions are found in ‘slri/seqhound/src/*_cb.c’

files.

‘*’ is the name of the module that the searched table belongs to like ‘GO’ or ‘intrez’.

‘cb’ refers to the database engine that the function talks to like CodeBase (cb) or

ODBC compliant databases (odbc).

Functions that search data tables are typically called SearchXXX. New data files

require you to define the ASN structure, typically each field in the data file

corresponds to an ASN field.

Most of the existing databases already have search functions so you should not have

to write these yourself. If they don’t exist, then you should consult the instructions

under Creating a new SeqHound module first. The details below are here to help you

understand the database layer and to use Search functions in your API function.

Boolean LIBCALL SearchACCDB (StAccdbPtr PNTR ppac)

The search functions should find ALL instances of a key in the database and return

them through the ASN pointer. Each ASN structure is a linked list, so multiple

records can be retrieved. Pseudo code for the typical search function:

Boolean SearchXXX(ASNLinkList Pointer)

{

foreach record that matches key in Pointer

create a new ASNLinkList Node;

fill node with record fields;

join node to Pointer;

end foreach

return TRUE if key found

else return FALSE;

}

As an example, the function

Boolean LIBCALL SearchACCDB (StAccdbPtr PNTR ppac)

finds records in the accdb table for an ODBC compliant database engine. This

function is located in slri/seqhound/src/intrez_odbc.c. The corresponding function

for a CodeBase database engine has the same name but is found in

slri/seqhound/src/intrez_cb.c. The ASN structure that StAccdbPtr points to is defined

in slri/seqhound/asn/slristruc.asn (search for StAccdb) and the corresponding C

structure is defined in slri/seqhound/include/objslristruc.h. You need to know this

when you use a database search function in your local API function.

3. Create The Local API Function

You will need to change one of three files in order to add new functionality to

SeqHound.

a) go_query.c : to add functionality to gene ontology module API functions

b) ll_query.c : to add to locuslink module API functions

c) intrez.c : to add new functionality to the other data tables.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 376 of 421 18/04/2005

Declare function prototype in slri/seqhound/include/seqhound.h. Try to group the

functions logically in the .c and .h files.

The API functions follow a general logic. You should try to stick to that logic.

Typical example of an API function:

Int4 LIBCALL SHoundFindAcc(CharPtr pcAcc)

{

Int4 gi = 0;

StAccdbPtr pac = NULL, pachead = NULL;

Int2 res = 0;

if ((pcAcc == NULL) || (strcmp(pcAcc, "n/a") == 0))

{

ErrPostEx(SEV_ERROR,0,0,"SHoundFindAcc: No accession.");

Input integrity

check

Each data table has specific asn structure.

return 0;

}

pac = StAccdbNew();

pachead = pac;

pac->access = StrSave(pcAcc);

res = SearchACCDB(&pac);

if (res == FALSE)

{

ErrPostEx(SEV_ERROR,0,0,"SHoundFindAcc: Search failed.");

StAccdbFree(pachead);

Failed search.

Free structure

and return

Successful search - extract what

you need and return it.

Set the key in the asn structure

database layer to search for the key.

The asn structure will hold all the

records that match the key.

return 0;

}else if(res == TRUE){

gi = pac->gi;

pac = pac->next;

}

StAccdbFree(pachead);

return gi;

}

4. Step 4. Create the CGI Function

Definitions:

CGI (Common Gateway Interface):

a program that allows 2 computers connected to the internet to communicate with

each other.

HTTP (Hypertext Transfer Protocol):

a way of passing messages between the 2 computers. Is used for web services, eg

CGI, servlets … Format of HTTP:

http://server_name/path/to/cgi/cgi_name?key1=value1&key2=value2&….

Query string:

The portion of the HTTP call that is used to pass parameters to the CGI. Format of

query strings: ?param1=value1&param2=value2….

SeqHound’s CGI

is slri/seqhound/cgi/seqrem.c.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 377 of 421 18/04/2005

Remote users

can call seqrem on our servers using HTTP calls. Format of a HTTP call:

http://seqhound.blueprint.org/cgi-

bin/seqrem?fnct=SeqHoundFindAcc&acc=AA73235

2 parts to edit in seqrem.c

1. if statement in function: ProcessFnctRequest(CharPtr pfnct);

2. Add a corresponding CGI function call that calls the local API

There is currently an if-else statement in the

ProcessFnctRequest function. It needs to be modified to check if

the first field of the query string contains your new function name. If so, it

will call a CGI function that calls your local API function.

SLRI_ERR ProcessFnctRequest(CharPtr pfnct)

{

if(SeqHoundInit() == SLRI_FAIL) {

MemFree(pfnct);

return 1;

} else if (strcmp(pfnct, "SeqHoundFindAcc") == 0) {

if(SeqHoundFindAcc() == SLRI_FAIL) {

MemFree(pfnct);

Calling internal CGI function

… see below

return 1;

}

} ...

} else if(strcmp(pfnc, “YourNewFunction” ..

if(SeqHoundYourNewFunction() ...

}

Next you need to define a function with the prototype

SLRI_ERR SeqHoundYourNewFunction(void)

This function will extract the remaining keys in the query string portion of the HTTP

call, and call the local API function you defined in step 3, then output the HTTP

response to be sent back to the user.

http://seqhound.blueprint.org/cgi-bin/seqrem?fnct=SeqHoundFindAcc&acc=AA73235

SLRI_ERR SeqHoundFindAcc(void)

{

Int4 gi = 0;

Int4 IndexArgs = -1;

printf("Content-type: text/html\r\n\r\n");

if ((IndexArgs = WWWFindName(ginfo,"acc")) >= 0)

{

pcThis = WWWGetValueByIndex(ginfo,IndexArgs);

pacc = StringSave(pcThis);

}

if ((pacc == NULL) || (strlen(pacc) == 0))

Output the HTTP

headers

Extract the remaining

fields from the CGI

ti

that calls Internal CGI function

the local API

Your new function gets

called here

{

ErrPostEx(SEV_ERROR,0,0, " Failed to get parameter value.");

fprintf(stdout, "SEQHOUND_ERROR Failed to get parameters.");

return SLRI_FAIL;

seqhound@blueprint.org Version 3.3

}

gi = SHoundFindAcc(pacc); Calling the local API

fti

The SeqHound Manual 378 of 421 18/04/2005

fprintf(stdout, "SEQHOUND_OK\n");

Send output

back to remote

users using

HTTP

fprintf(stdout, "%ld\n", (long) gi);

MemFree(pacc);

return SLRI_SUCCESS;

}

seqhound@blueprint.org Version 3.3

The SeqHound Manual 379 of 421 18/04/2005

5. Step 5. Create the Remote Calls

We have programs available in most of the widely used languages that allow SeqHound

users to write programs in their favorite language, accessing SeqHound data without

having to understand how everything works. Our remote programs in effect constructs

the HTTP calls (described above), sends the HTTP calls to the server and then parses the

server’s return value and sends this back to the user’s program.

Our remote interfaces are:

1. slri/seqhound/src/seqhoundapi.c

2. slri/seqhound/src_cxx/Seqhound.cpp

3. slri/seqhound/java/SeqHound.java

4. slri/seqhound/perl/SeqHound.pm

5. slri/seqhound/bioperl/SeqHound.pm

Int4 LIBCALL SHoundFindAcc(CharPtr pcAcc)

{

Char fpath[PATH_MAX];

Int4 gi = 0;

if(pcAcc == NULL)

Making the HTTP

call.

Contains the server

name

,

p

ath to CGI

,

{

ErrPostEx(SEV_ERROR,0,0, "Invalid parameter.");

return 0;

}

if(SHoundWWWGetfile(slri_sername, fpath) == 0)

sprintf(fpath,"%s?fnct=SeqHoundFindAcc&acc=%s", slri_cgipath, pcAcc);

ErrPostEx(SEV_INFO,0,0, "SeqHoundFindAcc request: %s.\n", fpath);

{

ErrPostEx(SEV_ERROR,0,0, "SHoundWWWGetfile failed.");

Send the HTTP

call

return 0;

}

gi = ReplyBSGetInteger();

if (gi == 0)

{

ErrPostEx(SEV_INFO,0,0, "SeqHoundFindAcc returned zero.");

Get the return

value.

A family of

ReplyBSGetXX

X exists for

return 0;

}

return gi;

}

Similar logic is used in the remote C++, Java, and Perl libraries.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 380 of 421 18/04/2005

Adding a new module to SeqHound

Note: This section is included for historical purposes and is being

rewritten for a future release (5.0)

This section describes how to go about adding a new module to SeqHound from the

developer’s point of view. It is basically a description of the different files that have to

be written for a new module and where they should go in the code tree. So this section

may be used as a guide for looking at existing modules to find out what files are expected

and where they are. However, historically, several programmers have added modules to

SeqHound and have used different code organization schemes; therefore, historical

modules may be organized differently. Going forward (as of November 2003) all

modules will have the components that are organized in the way described below.

1. Start a new module project plan

Creating a new module begins with starting a new project plan (see Format for project

reports). This should contain a clearly stated “Need” and “Objective” as well as a

“General approach” to developing the module; this may include background

information on the data resource being incorporated into SeqHound. The “Detailed

planning” section will contain a section on “Data table design” and “Code

Organization” that will be described further. See the DBXREF (Database cross-

reference) module Project Plan as an example

(./slri/seqhound_priv/dbxref/dbxref_module_desc.txt).

The components of a new module are summarized in the diagram below. The rest of

this section describes the creation of each of these components and their organization.

File names in bold italics indicate files that may be looked at as examples.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 381 of 421 18/04/2005

Database layer

2. Design the data table structure

The module’s project description should contain a description of the final table

design. See the example in the DBXREF (Database cross-reference) module Project

Plan as an example.

3. Write the script file that creates the data table.

You must modify the existing script file that creates the SeqHound data tables (this

step is only for ODBC SeqHound). This file is called seqhound.sql and is located in

./slri/seqhound/sql/seqhound.sql

Add the line

DROP TABLE SEQHOUND.DBXREF_tablename;

near the top of the file. Every time this script is run, it will destroy any pre-existing

data tables that belong to the module.

Add the lines that describe the table(s) that are a part of the module so the script

creates the new tables belonging to the module. See the example file “seqhound.sql”.

Note that CodeBase tables to not require a script to be made; these tables are created

by the function InitCodeBase in the <module_name>_cb.c file.

4. Write ASN.1 structure(s) that corresponds to data table descriptions.

The file ./slri/seqhound/asn/slristruc.asn must be modified to contain a description

of tables in the module. See the example structure StDbXref in the slristruc.asn file.

The shell script “./slri/seqhound/asn/makeasn” calls asntool which auto generates

slristruc.h, objsslristruc.h and objslristruc.c files. These files contain functions that

allow one to allocate and free memory for a structure corresponding to a database

record. See the makeasn file.

5. Step 5: Design new functions for the Database code layer

The DB layer contains C functions that create and retrieve records from tables in the

module. In the case of the DBXREF module, the tables are populated by a parser

written in PERL so there are no functions listed that write to the tables. The

requirement for a certain Perl modules is however noted in the project plan; these

modules allow Perl to converse with a database.

The DB layer will have three files that were auto generated from the preceding step.

a) ./slri/seqhound/include/slristruc.h

b) ./slri/seqhound/include/objsslristruc.h

c) ./slri/seqhound/src/objsslristruc.c

These files will be placed in this location by the makeasn script.

In addition, DB layer will require that two new files be made.

a) ./slri/seqhound/include/<module_name>_odbc.h

This will contain function prototypes and comments for functions in

seqhound@blueprint.org Version 3.3

The SeqHound Manual 382 of 421 18/04/2005

b) ./slri/seqhound/src/<module_name>_odbc.c

This will contain at least one functions called “Search<module_name> that

retrieves records from the database. Other functions that read or write to the database

may be included in this file.

The example function “SearchDBXREF” (see dbxref_odbc.h and dbxref_odbc.c)

takes as input a pointer to a structure called StDbXref. This structure has the same

fields as the DBXREF table. This structure is described in ASN.1 and functions to

allocate and free memory for this structure are auto generated using ASNtool. Any

row in a DBXREF table that contains field values matching anyone of the field values

passed to “SearchDBXREF” will be returned in a linked list of Valnodes that

contain pointers to StDBXref structures.

6. Make changes to existing code to accommodate changes to the database code layer.

You must modify the existing function called InitCodeBase so that it handles

tables that are new to the module. This function is found in two locations:

a) ./slri/seqhound/src/intrez_cb.c is the file that supports a CodeBase database

backend to SeqHound. InitCodeBase() needs to be changed to open the

code base data files that belong to the new module. See the example intrez_cb.c

file.

b) ./slri/seqhound/src/intrez_odbc.c is the file that supports a supports an ODBC

database backend to SeqHound. InitCodeBase() need to contain the

necessary code to establish a connection to the Database Server. See the example

intrez_odbc.c file.

The function InitCodeBase() under intrez_odbc.c has a function call:

GetAppParam(intrez", "datab", "db2alias", NULL,

(Char*)dsn, sizeof(char) * 10) <= 0 )

that retrieves database connection information from the

./slri/seqhound/config/.intrezrc configuration file.

Parser layer

7. Design the parser layer.

Parsers are generally written in C or Perl. A separate script is written to download

some file from an external ftp site. The parser takes this file as input and uses it to

populate a set of data tables belonging to the module.

Pseudocode for parsers should be documented in the project plan.

Parser layer code is located in

./slri/seqhound/<module_name>/[parser_name]

for example

./slri/seqhound/dbxref/dbxref_parser_sp.pl

Finally, by project end, all parsers must be documented according to the examples

given in the SeqHound manual. See the example parser description for “mother”.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 383 of 421 18/04/2005

Local API layer (Query layer)

8. Design the Local API layer

Design the local API local query layer. This layer will consist of three files

a) ./slri/seqhound/src /[Module name]_query.c

This file contains all of the API functions that query the module’s tables as well

as auxiliary functions (if any). Note that this naming convention is not followed

by API calls that belong to the core module of SeqHound; code for these local

API calls that query core module tables is in ./slri/seqhound/src/intrez.c.

b) ./slri/seqhound/include/[Module name]_query.h

This file contains function prototypes and comments for auxiliary functions (if

any) that may be used by the module’s parsers, API functions or other

applications specific to the module.

c) ./slri/seqhound/include/seqhound.h

This is where all publicly available API functions are defined. Note that the local

and the remote API’s use the same header file. ALL API functions for ALL

modules are defined in this header. This file already exists and must be simply

modified.

Examples are:

./slri/seqhound/src /dbxref_query.c

./slri/seqhound/include/dbxref_query.h

./slri/seqhound/include/seqhound.h

An example of an API function is a function(s) to retrieve data base cross-references

given a source record (SHoundDBXREFGetDBXrefListBySourceRecord).

See the example in the dbxref_query.c file. This is a local API call. Notice the

naming convention: ‘Shound’ followed by the module name ‘GODB’ followed by the

actual API function name.

In the example function note the line

Note that this function calls a database layer function called

if(!SHoundModule("godb"))

This checks the SeqHound configuration file to make certain that the build of

SeqHound actually includes this module. All API function calls must have an

analogous check.

SearchDBXREF.

CGI layer

9. Design the CGI layer

The SeqHound cgi layer that supports the remote API is contained in only one file

./slri/seqhound/cgi/seqrem.c

There is no header file for seqrem.c since all functions are defined before “Main()”

and Main only calls functions in this file.

seqhound@blueprint.org Version 3.3

The SeqHound Manual 384 of 421 18/04/2005

This file must be edited to include cgi support for the remote API calls for the new

module. See the examples in the seqrem.c file. See the example function

SeqHoundDBXREFGetDBXrefListBySourceRecord.

10. Design the remote API layer

For C:

SEQH_SRC_ODBC

Note the naming convention: SeqHound followed by the module name followed by

the same function name used by the local API function.

Remote API layer

The remote API for SeqHound supports 4 languages. So there are four files that must

be modified to include new API functions for the new module.

./slri/seqhound/src/seqhoundapi.c

For C++: ./slri/seqhound/src_cxx/SeqHound.c

For Java: ./slri/seqhound/java/SeqHound.java

For Perl: ./slri/seqhound/perl/SeqHound.pm

The function names will be exactly the same as those listed in the local API layer (for

example; SHoundDBXREFGetDBXrefListBySourceRecord).

Examples for the C remote API are given in the seqhoundapi.c file.

11. Modify the seqhound config file.

New modules require that another entry be made in the .intrezrc file. This setting

allows the SeqHound administrator to indicate whether any given SeqHound module

has been built. The function SHoundModule() will look at these config file

settings every time a local API call is made to determine if the module is present.

Modify the ./slri/seqhound/config/intrezrc according to the example. Analogous

additions must be made to the ./slri/seqhound/config/intrez.ini file for Windows

platforms.

Modify the ShoundModule() function in ./slri/seqhound/src/intrez_cfg.c file to

support the new module. Follow the example in this file.

12. Modify/create make files to support the new module.

Modify the following files.

./slri/seqhound/seqhound_odbc.mk

SEQH_OBJ_OODBC

./slri/seqhound/seqhound.mk

SEQH_SRC_COM

SEQH_OBJ_COM

SEQH_ODBC_COM

seqhound@blueprint.org Version 3.3

The SeqHound Manual 385 of 421 18/04/2005

./slri/seqhound/seqhound_rem.mk

./slri/seqhound/seqhoundrem.mk??

./slri/seqhound/src/make.shoundlocllib

./slri/seqhound/src/make.shoundremlib

Create the following make files for any parsers written in C.

./slri/seqhound/<module_name>/make.[parser_name]

see the example in

./slri/seqhound/locuslink/make.llparser

13. Design regression tests

Regression tests are based on the CuTest C Unit Testing Framework.

(see http://cutest.sourceforge.net/).

One file must be modified to support tests for the new module. Follow the examples

in:

./slri/seqhound/test/regresion/main.c

And a new file must be created that contains the actual test functions. See the

examples in the test driver for the database layer for the DBXREF module:

./slri/seqhound/test/regresion/dbxref_odbc_driver.c

Examples of some test cases Function calls.

void testDBXREF_GetObjectIDbyAcc(CuTest *tc){...}

Another driver contains test functions for the

./slri/seqhound/test/regresion/dbxref_query_driver.c

Finally, the following file must be modified to accommodate the new module test

drivers. Follow the examples in:

./slri/seqhound/test/regresion/make.test_driver

14. Design test cases

These test cases refer to tests of the local and remote API functions relevant to the

module.

15. Code test and debug

16. Finish documentation

17. Compile SeqHound code in test environment

18. Build SeqHound db in test environment

19. Check in code

20. Pass the project on to delivery team (Test Dev/Systems Dev/Software Training)

21. Build module in production (SEQHOUND ADMIN)

22. Update data tables (SEQHOUND ADMIN)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 386 of 421 18/04/2005

23. Update seqrem (SEQHOUND ADMIN)

24. Update docs (SEQHOUND ADMIN)

25. Update website (SEQHOUND ADMIN)

seqhound@blueprint.org Version 3.3

The SeqHound Manual 387 of 421 18/04/2005

7. Appendices

Example GenBank record in ASN.1 format

Example SwissProt record in ASN.1 format

Example EMBL record in ASN.1 format

Example PDB record in ASN.1 format

Example Biostruc in ASN.1 format

GO background material

seqhound@blueprint.org Version 3.3

The SeqHound Manual 388 of 421 18/04/2005

Example GenBank record

Seq-entry ::= set {

title "Vairimorpha necatrix largest subunit of RNA polymerase II

source {

db {

std {

name {

country "UK" ,

month 4 ,

day 4 } ,

class nuc-prot ,

descr {

(RPB1)

gene, complete cds." ,

org {

taxname "Vairimorpha necatrix" ,

{

db "taxon" ,

tag

id 6039 } } , <==========================TAXGI/taxid

orgname {

name

binomial {

genus "Vairimorpha" ,

species "necatrix" } ,

lineage "Eukaryota; Fungi; Microsporidia; Burenellidae;

Vairimorpha" ,

gcode 1 ,

mgcode 1 ,

div "INV" } } } ,

create-date

std {

year 1998 ,

month 12 ,

day 10 } ,

pub {

sub {

authors {

names

{

name

name {

last "Hirt" ,

first "R" ,

initials "R.P." } } ,

{

name

last "Healy" ,

first "B" ,

initials "B." } } } ,

affil

std {

affil "The Natural History Museum" ,

div "Zoology" ,

city "London" ,

street "Cromwell Road" ,

postal-code "SW7 5BD" } } ,

medium email ,

date

std {

year 1998 ,

day 16 } } } } ,

update-date

std {

year 1999 ,

month 2 ,

pub {

article {

seqhound@blueprint.org Version 3.3

The SeqHound Manual 389 of 421 18/04/2005

title {

name "Microsporidia are related to Fungi: evidence from the

largest subunit of RNA polymerase II and other proteins." } ,

authors {

names

std {

{

name

name {

last "Hirt" ,

initials "R.P." } } ,

{

name

name {

last "Doolittle" ,

name

medline 99110933 } } ,

last "Logsdon" ,

initials "J.M." ,

suffix "Jr." } } ,

{

name

name {

last "Healy" ,

initials "B." } } ,

{

name

name {

last "Dorey" ,

initials "M.W." } } ,

{

name

name {

initials "W.F." } } ,

{

name {

last "Embley" ,

initials "T.M." } } } ,

affil

str "Department of Zoology, The Natural History Museum,

London

SW7 5BD, United Kingdom." } ,

from

journal {

title {

iso-jta "Proc. Natl. Acad. Sci. U.S.A." ,

ml-jta "Proc Natl Acad Sci U S A" ,

issn "0027-8424" ,

name "Proceedings of the National Academy of Sciences

of the

United States of America." } ,

imp {

date

std {

year 1999 ,

month 1 ,

day 19 } ,

volume "96" ,

issue "2" ,

pages "580-585" ,

language "eng" } } ,

ids {

pubmed 9892676 ,

muid 99110933 ,

pmid 9892676 } } } ,

seq-set {

<==================================beginning of bioseq

<===========================look down for end of bioseq(see ASNDB/asn1)

seq {

id {

genbank{<===============================ACCDB/DB

name "AF060234" ,<================== ======ACCDB/name

seqhound@blueprint.org Version 3.3

The SeqHound Manual 390 of 421 18/04/2005

accession "AF060234" ,<=====================ACCDB/accession

version 1 } ,<============================ACCDB/version

gi 4001823 } , <==========================Many tables/GI

descr {

molinfo {

biomol genomic } } ,

inst {

repr raw ,

mol dna ,

length 5019 ,

seq-data

ncbi2na

'171B6C0C434EFC0FBDC301F7E3FFFF3F396FF3FD7FFFF355C03BF8E03EC

10020DDB4350F6BF3FDD48203080352EDB5030D4578053A00EAF57037AEBF0C8FC00EA4

71202A7

FFCEF4D7B8008C3FF7E57AD3FE944F83C1C250EF53BEBC4E18034200DFA0EEFEFFCFBDC

87C03CB

1000DF0808FE0FFB3A0CFDC21C02EFE60AE03EA803ABF129EE80C2497B0C00082838BF8

C9FF380

A208202E3AC0B0FF83A2022F4C33DE0823EC0E08E7BBFFABFE3482FC70978BA7CFF87BD

FF2F5D7

5F7B0A5CB3ED3A0A0EF889208E3F874C0FA48CCB009C31F37C203383C82A917AD3B2C2A

3CE043C

750FD33E510E3E10E130BA5051291F4802EA657F0024FDE761E029020A98B5AA03F0EA0

02AC8FD

2E52B7B0F1978D60CFDEC820B68BD7D223E608D447F5480F3075FF0CF8597C701FB740E

A5038B1

56A983CEC3080E1A1020C87C0FD0CAA233C2F2082BCECB622113908688ECBFCFC32105F

7F1100E

24E3A74FFB58B4E829007FC8703FB7EEDD95CC19A3FE1AA180E0DF44E5902F104B0A683

C820F3B

F2C2C042CF0B510DC3014B0EA8CB1084BF049DC21FFC4F188F4FFF8DB22010E43CF33DE

C0CF0C3

CE2FC78F6D80F0F3874E3874FEB0C3F131E0837D033CE00CF03F549CFDF35C203EE87AC

040DF0B

C4FF140C43F30E80370E244182087E803B608F7C6D30F203A6037C2E83CF842027B6AD0

4429AFC

3D130F90387FAD7848B87EFDFE38E748038E0DFCFD9C430197FDCFA0FAE190C92102010

E2D0B10

20B32012702042C0E034CB4091000301C808F14ACED0E22227F80B42C0F13FE0C050884

FDE99B4

9F4037F8BFFB0C138A18EB5E928D00A74FCC0CFD10B449FBF2B4100EC828008357F6BFC

3CC8DFC

513FCB0268F3DEA037228FEC8032F3B00A0F1D7820F7DFD39CEA2B220ADC3E33E73C012

4804A33

3D020872C0243A08E71ED1DF8C8DEC228923A3F3F3433831A208EBFE3947FFC80E40038

74E386E

9C702386EDFF003F93F2C84EFC48FE0FD90C000803BC48D03F103CF871E3B03F1200FF3

38E0FE2

EBC0E00EC000CE000E0CF9CB53B0FD40830C0FC98FCC0FE3EC802884FD56CFC3FC8C7F0

00FC320

0FC50F083F3C3E033C300CC3F3630228FF0383303CD3C827CC3E8FF0080F03F00FF00D0

C3F7503

80EB291FE9A7437B2A81494540E1FC0C7F53F24A2CD240CF18EA2C508F0080F303BE600

33C21D7

ECE08FCFC02353F0C01FC80E9C200D40B83C83FD233002DDEE0FCB80FCF3857B8C8211F

7300823

023FEF4A0CFF8FD5E3813F23FCB00E5C0FCD308F003E120F03EB0B003F80F200CFB002F

C4E297F

5C33FD4CC3C8D238032D20FC30C20C23B3F4DFE0C30EC833CC3F10CC00CCF03F023CEAF

30C0342

02FFCC2E08C08408E2EB3F12123A2FECC22033DDD3503B20A13FAC1F70E3F0380CB20BF

CA0C827

9C888437E0E0F87F2F321A03A0BCED0D329134971E9E3B8E10E0A8C5F12B304213A2E0C

0BA8FEA

E47019934FE0807BA33FC721927F2C920033B0402BCC488330E3A845C957FAA12832805

E7FE8EC

2C0F230273147F70148330F3820BE3194FCC44B52ED2005E2CC24BA01E895A7CFCB20A0

3633994

0042FCCF75862561F32D4189504CCB5462541332D4109507C495842541F32D4189585C7

4504D94

DB32D610B58BC4B5042562F12D4109589CCB5042D4DF3254109537C4B504D16DF12D413

seqhound@blueprint.org Version 3.3

The SeqHound Manual 391 of 421 18/04/2005

45B7C4B

504256DF32D41095B7CCB504256DF32D41B18C38600801C3280600A024300CF0CB0000F

C00FFFF

FFEC03FFCCD035CCF8DCCBFC33C030304'H } ,

annot {

{

data

ftable {

{

data

gene {

locus "RPB1" } ,

location

int {

from 100 ,

to 4917 ,

id

gi 4001823 } } } } } } ,

<==============================the first bioseq ends here

see ASNDB/asn1

<==============================next bioseq begins here

seq {

id {

genbank {

accession "AAD12604" ,

NGPNEYP

version 1 } ,

gi 4001824 } ,

descr {

molinfo {

biomol peptide ,

tech concept-trans-a } ,

title "largest subunit of RNA polymerase II; RPO21 [Vairimorpha

necatrix]" } ,

inst {

repr raw ,

mol aa ,

length 1605 ,

topology not-set ,

seq-data

ncbieaa

"MFDEIVTKRISSIQFGLFSPEEIRKSSVVQIIHPETMENGFPKSGGLIDLKMGTTERAF

LCSSCEKDNFSCPGHFGHIELTKPMFHVGYMTKIKKILECVCFYCSRLKISTKNLKKDLNFVWNISKTKSV

CEGEIGE

NGFTGCGNKQPVIKKEGMSLIAFMKGEEESDGKVILNGERVHNILKKIVNEDAVFLGFDQKFTKPEWLILT

VLLVPPP

SVRPSIVMEGMLRAEDDLTHKLADIVKANTYLKKYELEGAPGHVVRDYEQLLQFHIATMIDNDISGQPQAL

QKSGRPL

KSISARLKGKEGRVRGNLMGKRVDFSARSVITPDPNISVEEVGVPSEIAKIHTFPEIITPFNIDRLTKLVS

GANYVIRNDGQRIDLNFNRGDIKLEEGYVVERHMQDGDVVLFNRQPSLHKMSMMAHFVRVMEGKTFRLNLS

CVSPYNA

DFDGDEMNLHMPQSYNSKAELEELCLVSKQVLSPQSNKPVMGIVQDSLTALRLFTLRDSFFDRRETMQLLY

SVNINNY

EFTDSSKLIMTHDDSFGNNLHTEESSNIMKILNFPAISYPKKLWTGKQILSYILPNTIYNGKSNEHNEEDL

ENVEDSY

VIIRNGEILSGIIDKKAVGSTQGGLIHIIANDFGPDRVTCFFDDAQKMMNLYFATINAFSIGIGDAIADKE

TMSQVQR

SIETAKEQVNEIIVKAQKNKLERLPGMSMRESFESQVNYILNKPRDISGASASKSLSFCNNMRTMVLAGSK

GSFINIS

QVTACLGQQNVEGKRIPFGFNYRSLPHFSKADYSGKSRGFVENSYVKGITPEEFFFHAMGGREGLIDIAIK

TAETGYI

QRRLVKAMEDATVTLDRSVRGADGFIYQYEYGEDGFDATFLEMQKMTHDDVATKDDVSFKNLHLVDMFTDL

NFAIKKE

NVTDQIYKLLTTDVNLQKILYDEFEWLNENVKKYEKMNIASPCNFQRIINLAIYKFDCRKGDISPYLILDT

LKNLIEN

LPIKNLLIEILIKYNLSIKRILNEYKLSLEAYNWILKEIKFKILKSIISPNEMVGTLAAQSVGEPATQMTL

NTFHLAG

VSANITMGVPRLKEIINVAKNIKTPCMKIYLKDPFNKTLEMAKKIQSELEFSDIKSLCEFSEIYYDPVIED

TSIKEDK

DFVQEYFDFPDEHLDFSKMPKFIIRLKIDRIKLVSKNLKLENIVKSLHEAFPNIFHIIRSDENSQNLIIRI

RCISSLN

seqhound@blueprint.org Version 3.3

The SeqHound Manual 392 of 421 18/04/2005

NNVEYYNLQYKNILNLKIMGYNKIKKVFISEDKDKDEWYLQTDGVCIREIFSHPNVEGHLVTSNDLNEIVE

VLGIEAA

RETILNELTLVIDGNGSYVNHRHISLLADVMTMKGYLTGITRHGVNKVGFGCTKRASFEETVDILLDAALV

AEKYVTK

GYTENIMMGHLAPLGTGIGNLLLDVSKLDKAIPLSKPEYNYEEVDTPFIHSPVSENLSISSGNWSPAYLVE

GNRYAPK

TSLYSPTSPTYSPTSPTYSPTSPTYSPTSPTYSPTSPTYSPTSPTYSPTSPSYSPTSPSYSPTSPSYSPTS

"RPO21" } } ,

data

PSYSPTS

PSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTYDNDEKKTNRKRKGKQ" } ,

annot {

{

data

ftable {

{

data

prot {

name {

"largest subunit of RNA polymerase II" ,

location

int {

from 0 ,

to 1604 ,

strand plus ,

id

gi 4001824 } } } } } } } ,

annot {

{

ftable {

{

data

cdregion {

frame one ,

code {

id 1 } } ,

product

whole

gi 4001824 ,

location

int {

from 100 ,

to 4917 ,

strand plus ,

id

gi 4001823 } } } } } }

seqhound@blueprint.org Version 3.3

The SeqHound Manual 393 of 421 18/04/2005

Example SwissProt record

Seq-entry ::= seq {

id {

swissprot {

POLYMERASE II FOR THE MRNA PRECURSOR, AND POLYMERASE III FOR 5S AND

FAMILY." ,

gi 886082 ,

str "L0001744" } ,

name "RPB1_YEAST" ,<============================ACCDB/name

accession "P04050" } ,<=========================ACCDB/accession

gi 2507347 } ,

descr { <======ACCDB/title

title "DNA-DIRECTED RNA POLYMERASE II LARGEST SUBUNIT (B220)." ,

comment "----------------------------------------------------------

-------

--~This SWISS-PROT entry is copyright. It is produced through

a~collaboration

between the Swiss Institute of Bioinformatics and~the EMBL outstation

- the

European Bioinformatics Institute.~The original entry is available

from

http://www.expasy.ch/sprot~and

http://www.ebi.ac.uk/sprot~-------------------------------------------

-------

----------------" ,

comment "[FUNCTION] DNA-DEPENDENT RNA POLYMERASE CATALYZES THE

TRANSCRIPTION OF DNA INTO RNA USING THE FOUR RIBONUCLEOSIDE

TRIPHOSPHATES AS

SUBSTRATES." ,

comment "[CATALYTIC ACTIVITY] N NUCLEOSIDE TRIPHOSPHATE = N

PYROPHOSPHATE

+ RNA(N)." ,

comment "[SUBUNIT] RNA POLYMERASE II CONSISTS OF 12 DIFFERENT

SUBUNITS.

THIS SUBUNIT IS THE LARGEST COMPONENT OF RNA POLYMERASE II." ,

comment "[SUBCELLULAR LOCATION] NUCLEAR." ,

comment "[PTM] THE TANDEM 7 RESIDUES REPEATS CAN BE HIGHLY

PHOSPHORYLATED.

THE PHOSPHORYLATION ACTIVATES POL2." ,

comment "[MISCELLANEOUS] THREE DISTINCT ZINC-CONTAINING RNA

POLYMERASES

ARE FOUND IN EUKARYOTIC NUCLEI: POLYMERASE I FOR THE RIBOSOMAL RNA

PRECURSOR,

TRNA

GENES." ,

comment "[SIMILARITY] BELONGS TO THE RNA POLYMERASE BETA' CHAIN

sp {

class standard ,

extra-acc {

"Q12364" ,

"Q92315" } ,

seqref {

gi 4397 ,

gi 4398 ,

gi 1419218 ,

gi 1419221 ,

gi 1431216 ,

gi 1431217 ,

gi 886080 ,

gi 2144431 } ,

dbref {

{

db "SGD" ,

tag

{

db "PFAM" ,

tag

str "PF00623" } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 394 of 421 18/04/2005

{

db "PROSITE" ,

tag

str "PS00115" } } ,

keywords {

"Transferase" ,

month 11 ,

org {

db "taxon" ,

orgname {

gcode 1 ,

molinfo {

pub {

"DNA-directed RNA polymerase" ,

"Transcription" ,

"Zinc" ,

"Repeat" ,

"DNA-binding" ,

"Nuclear protein" ,

"Phosphorylation" ,

"Zinc-finger" } ,

created

std {

year 1986 ,

month 11 ,

day 1 } ,

sequpd

std {

year 1997 ,

month 11 ,

day 1 } ,

annotupd

std {

year 1999 ,

month 7 ,

day 15 } } ,

create-date

std {

year 1986 ,

day 1 } ,

update-date

std {

year 1999 ,

month 7 ,

day 15 } ,

source {

taxname "Saccharomyces cerevisiae" ,

common "baker's yeast" ,

db {

{

tag

id 4932 } } ,

name

binomial {

genus "Saccharomyces" ,

species "cerevisiae" } ,

lineage "Eukaryota; Fungi; Ascomycota; Saccharomycotina;

Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces"

,

mgcode 3 ,

div "PLN" } } } ,

biomol peptide ,

completeness complete } ,

pub {

gen {

serial-number 1 } ,

muid 85282617 ,

article {

title {

name "Extensive homology among the largest subunits of

eukaryotic

and prokaryotic RNA polymerases." } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 395 of 421 18/04/2005

authors {

names

std {

{

name

name {

last "Allison" ,

initials "L.A." } } ,

name

name {

issn "0092-8674" ,

language "eng" } } ,

muid 97127826 ,

{

last "Moyle" ,

initials "M." } } ,

{

name

name {

last "Shales" ,

initials "M." } } ,

{

name

name {

last "Ingles" ,

initials "C.J." } } } } ,

from

journal {

title {

iso-jta "Cell" ,

ml-jta "Cell" ,

name "Cell." } ,

imp {

date

std {

year 1985 ,

month 9 } ,

volume "42" ,

issue "2" ,

pages "599-610" ,

ids {

pubmed 3896517 ,

medline 85282617 } } ,

pmid 3896517 } ,

comment "SEQUENCE FROM N.A.~STRAIN=A364A" } ,

pub {

gen {

serial-number 2 } ,

article {

title {

name "Analysis of a 26,756 bp segment from the left arm of

yeast

chromosome IV." } ,

authors {

names

std {

name

name {

last "Wolfl" ,

initials "S." } } ,

{

name

name {

last "Hanemann" ,

initials "V." } } ,

{

name

name {

last "Saluz" ,

initials "H.P." } } } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 396 of 421 18/04/2005

affil

str "Hans-Knoll-Institut fur Naturstoff-Forschung,

Department of

Cell and Molecular Biology, Jena, Germany." } ,

from

journal {

title {

iso-jta "Yeast" ,

ml-jta "Yeast" ,

issn "0749-503X" ,

name "Yeast (Chichester, England)" } ,

imp {

date

std {

year 1996 ,

month 12 } ,

volume "12" ,

issue "15" ,

pages "1549-1554" ,

language "eng" } } ,

ids {

pubmed 8972577 ,

medline 97127826 } } ,

pmid 8972577 } ,

comment "SEQUENCE FROM N.A.~STRAIN=S288C / FY1679" } ,

pub {

gen {

serial-number 3 } ,

muid 95377607 ,

article {

authors {

Urbana

6180, USA." } ,

journal {

date

title {

name "The gene encoding the biotin-apoprotein ligase of

Saccharomyces cerevisiae." } ,

names

std {

{

name

name {

last "Cronan" ,

initials "J.E." ,

suffix "Jr." } } ,

{

name

name {

last "Wallace" ,

initials "J.C." } } } ,

affil

str "Department of Microbiology, University of Illinois,

from

title {

iso-jta "FEMS Microbiol. Lett." ,

ml-jta "FEMS Microbiol Lett" ,

issn "0378-1097" ,

name "FEMS microbiology letters." } ,

imp {

std {

year 1995 ,

month 8 ,

day 1 } ,

volume "130" ,

issue "2-3" ,

pages "221-229" ,

language "eng" } } ,

ids {

pubmed 7649444 ,

medline 95377607 } } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 397 of 421 18/04/2005

pmid 7649444 } ,

comment "SEQUENCE OF 1669-1733 FROM N.A.~STRAIN=S288C" } } ,

inst {

repr raw ,

mol aa ,

length 1733 ,

seq-data

ncbieaa

"MVGQQYSSAPLRTVKEVQFGLFSPEEVRAISVAKIRFPETMDETQTRAKIGGLNDPRLGSIDR

NLKCQTCQEGMNECPGHFGHIDLAKPVFHVGFIAKIKKVCECVCMHCGKLLLDEHNELMRQALAIKDSKKR

FAAIWTL

SIAKTLT

ETIAEAK

ASEIYYD

ids {

int {

CKTKMVCETDVPSEDDPTQLVSRGGCGNTQPTIRKDGLKLVGSWKKDRATGDADEPELRVLSTEEILNIFK

HISVKDF

TSLGFNEVFSRPEWMILTCLPVPPPPVRPSISFNESQRGEDDLTFKLADILKANISLETLEHNGAPHHAIE

EAESLLQ

FHVATYMDNDIAGQPQALQKSGRPVKSIRARLKGKEGRIRGNLMGKRVDFSARTVISGDPNLELDQVGVPK

YPEVVTPYNIDRLTQLVRNGPNEHPGAKYVIRDSGDRIDLRYSKRAGDIQLQYGWKVERHIMDNDPVLFNR

QPSLHKM

SMMAHRVKVIPYSTFRLNLSVTSPYNADFDGDEMNLHVPQSEETRAELSQLCAVPLQIVSPQSNKPCMGIV

QDTLCGI

RKLTLRDTFIELDQVLNMLYWVPDWDGVIPTPAIIKPKPLWSGKQILSVAIPNGIHLQRFDEGTTLLSPKD

NGMLIID

GQIIFGVVEKKTVGSSNGGLIHVVTREKGPQVCAKLFGNIQKVVNFWLLHNGFSTGIGDTIADGPTMREIT

KKVLDVTKEAQANLLTAKHGMTLRESFEDNVVRFLNEARDKAGRLAEVNLKDLNNVKQMVMAGSKGSFINI

AQMSACV

GQQSVEGKRIAFGFVDRTLPHFSKDDYSPESKGFVENSYLRGLTPQEFFFHAMGGREGLIDTAVKTAETGY

IQRRLVK

ALEDIMVHYDNTTRNSLGNVIQFIYGEDGMDAAHIEKQSLDTIGGSDAAFEKRYRVDLLNTDHTLDPSLLE

SGSEILG

DLKLQVLLDEEYKQLVKDRKFLREVFVDGEANWPLPVNIRRIIQNAQQTFHIDHTKPSDLTIKDIVLGVKD

LQENLLV

LRGKNEIIQNAQRDAVTLFCCLLRSRLATRRVLQEYRLTKQAFDWVLSNIEAQFLRSVVHPGEMVGVLAAQ

SIGEPAT

QMTLNTFHFAGVASKKVTSGVPRLKEILNVAKNMKTPSLTVYLEPGHAADQEQAKLIRSAIEHTTLKSVTI

PDPRSTVIPEDEEIIQLHFSLLDEEAEQSFDQQSPWLLRLELDRAAMNDKDLTMGQVGERIKQTFKNDLFV

IWSEDND

EKLIIRCRVVRPKSLDAETEAEEDHMLKKIENTMLENITLRGVENIERVVMMKYDRKVPSPTGEYVKEPEW

VLETDGV

NLSEVMTVPGIDPTRIYTNSFIDIMEVLGIEAGRAALYKEVYNVIASDGSYVNYRHMALLVDVMTTQGGLT

SVTRHGF

NRSNTGALMRCSFEETVEILFEAGASAELDDCRGVSENVILGQMAPIGTGAFDVMIDEESLVKYMPEQKIT

EIEDGQD

GGVTPYSNESGLVNADLDVKDELMFSPLVDSGSNDAMAGGFTAYGGADYGEATSPFGAYGEAPTSPGFGVS

SPGFSPT

SPTYSPTSPAYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTS

PSYSPTS

PSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPAYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSP

NYSPTSP

SYSPTSPGYSPGSPAYSPKQDEQKHNENENSR" ,

hist {

replaces {

date

std {

year 1997 ,

month 10 ,

day 9 } ,

gi 133330 } } } } ,

annot {

{

data

ftable {

{

data

region "Zinc finger region" ,

comment "C2H2-TYPE (POTENTIAL)." ,

location

from 66 ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 398 of 421 18/04/2005

to 82 ,

id

gi 2507347 } ,

exp-ev not-experimental } ,

{

data

region "Domain" ,

gi 2507347 } ,

data

{

comment "CARBOXYL-TERMINAL 7-RESIDUE REPEATS." ,

location

int {

from 1543 ,

to 1718 ,

id

gi 2507347 } ,

exp-ev experimental } ,

{

data

region "Variant" ,

comment "MISSING (IN STRAIN A364A)." ,

location

int {

from 1652 ,

to 1658 ,

id

exp-ev experimental } ,

{

data

region "Conflict" ,

comment "A -> V (IN REF. 1)." ,

location

pnt {

point 1513 ,

id

gi 2507347 } ,

exp-ev experimental } ,

{

region "Conflict" ,

comment "G -> A (IN REF. 1)." ,

location

pnt {

point 1523 ,

id

gi 2507347 } ,

exp-ev experimental } ,

data

region "Conflict" ,

comment "T -> M (IN REF. 1)." ,

location

pnt {

point 1600 ,

id

gi 2507347 } ,

exp-ev experimental } ,

{

data

gene {

locus "RPB1" ,

syn {

"RPO21" ,

"RPB220" ,

"SUA8" ,

"YDL140C" ,

"D2150" } } ,

location

int {

from 0 ,

to 1732 ,

id

gi 2507347 } } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 399 of 421 18/04/2005

{

data

prot {

name { <=============ACCDB/title

"DNA-DIRECTED RNA POLYMERASE II LARGEST SUBUNIT" } ,

ec {

"2.7.7.6" } } ,

location

int {

from 0 ,

to 1732 ,

id

gi 2507347 } } } } } }

seqhound@blueprint.org Version 3.3

The SeqHound Manual 400 of 421 18/04/2005

Example EMBL record

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

pub {

gen {

cit "Unpublished" ,

authors {

names

std {

{

name

name {

last "Drebot" ,

initials "M.A." } } ,

{

name

name {

last "Jansma" ,

initials "D." } } ,

{

name

name {

last "Himmelfarb" ,

initials "H.J." } } ,

{

name

name {

last "Friesen" ,

initials "J.D." } } } } ,

title "Suppressors of yeast RNA polymerase II mutations

belong to a

family of gene products that interact with a protein kinase" } } } ,

pub {

sub {

authors {

names

std {

{

name

name {

last "Jansma" ,

initials "D." } } } ,

affil

str "David Jansma, Genetics, Hospital for Sick Children,

555

University, Avenue, Toronto, Ontario, M5G 1X8, Canada" } ,

medium other ,

date

std {

year 1992 ,

month 7 ,

day 28 } } } } ,

create-date

std {

year 1992 ,

month 12 ,

day 11 } ,

update-date

std {

year 1993 ,

month 3 ,

day 12 } ,

source {

org {

taxname "Saccharomyces cerevisiae" ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 401 of 421 18/04/2005

common "baker's yeast" ,

db {

{

db "taxon" ,

tag

id 4932 } } ,

orgname {

name

binomial {

genus "Saccharomyces" ,

species "cerevisiae" } ,

mod {

{

subtype isolate ,

subname "Mating type a" } } ,

lineage "Eukaryota; Fungi; Ascomycota; Saccharomycotina;

Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces"

,

gcode 1 ,

mgcode 3 ,

div "PLN" } } ,

subtype {

{

subtype chromosome ,

name "7" } } } } ,

seq-set {

seq {

id {

embl {

name "SCSPM2" ,<==============================ACCDB/name

accession "Z14128" ,<========================ACCDB/accession

version 1 } ,<==================================ACCDB/version

gi 287914 } ,

descr {

title "S.cerevisiae spm2+ gene for spm2+ protein" ,

embl {

div fun ,

creation-date

std {

year 1992 ,

month 12 ,

day 11 } ,

update-date

std {

year 1993 ,

month 3 ,

day 12 } ,

keywords {

"CDC68/SPT16 protein" ,

"spm2+ gene" ,

"spm2+ protein" } ,

xref {

{

dbname

name "SGD" ,

id {

str "L0001891" ,

str "SIP2" } } ,

{

dbname

code swissprot ,

id {

str "P34164" ,

str "SIP2_YEAST" } } } } ,

molinfo {

biomol genomic } } ,

inst {

repr raw ,

mol dna ,

length 2032 ,

seq-data

ncbi2na

seqhound@blueprint.org Version 3.3

The SeqHound Manual 402 of 421 18/04/2005

'00F965EC6C171E0103E3BFD06B54022FC17F9A55FFCDD14FFD770607B4E

0107C933066C7C4C8C8F0323053933470724DB4E0C91BF973183E69693ADFF33E757E85

8B023F1

7E6FF68DB4037FCA80A303114423D7A80E00B3780C3EE7FE82AC78892E0BF73AB1C610B

4D49500

024041000BB6E45CD38B86F2200152439D0A798164203A39EE24202F1E0789F0303BD23

D50866A

925F42A2A373C50000361BCF9E6E38A18951CE502737BCE8079EE847877AB7DB71C98E3

822A233

CF9521848950908E52D78E38D9D294F7D952228A841240C6E50209F6A2B50B80D089D3C

3AF56E8

8C2BA490AEBD00BF1B84A74F4503A28038DAFE3178F78430E9D3F4EE09C29E7D4A444C8

F48FCCB

A30E27C8BCB8FD7951241E348EA01F6D0F1320B4A4148000541C180230AD428248F5398

5D51498

D8D743E77903EB0A35218FFAE1A3310AFD38237B45485D6FA0CC41213D596EFC7856DED

3A0631C

F317E8DB4902431E31B4E9E114550F175D2FA20EF37E0C0C739464214BD062041D299DE

54F5814

EEB9F051FAF1CB24F0910C47FBB25D4DBD8CC0480CED1523DFCC654C8B5DB0D6A3D71FF

C0A6CFB

333FFA49CE911C3F73F50FCBB0922332093304C38FC70EA74B1DD0E7CFBCD1014631C31

5A0B66E

37FFFFF6803883651FF706020CF80DF0CFAE208A0C02C9C222D1A8700AC00BD273D41FC

75FDE33

E45FDB81E9F6ECFC2E0CE'H ,

hist {

replaces {

date

std {

year 1993 ,

month 6 ,

day 11 } ,

ids {

gi 4525 } } } } ,

annot {

{

data

ftable {

{

data

pub {

muid 92017853 ,

article {

title {

name "CDC68, a yeast gene that affects

regulation of

cell proliferation and transcription, encodes a protein with a highly

acidic

carboxyl terminus." } ,

authors {

names

std {

{

name

name {

last "Rowley" ,

initials "A." } } ,

{

name

name {

last "Singer" ,

initials "R.A." } } ,

{

name

name {

last "Johnston" ,

initials "G.C." } } } ,

affil

str "Department of Microbiology, Dalhousie

University, Halifax, Nova Scotia, Canada." } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 403 of 421 18/04/2005

from

journal {

title {

iso-jta "Mol. Cell. Biol." ,

ml-jta "Mol Cell Biol" ,

issn "0270-7306" ,

name "Molecular and cellular biology." }

,

imp {

date

std {

year 1991 ,

month 11 } ,

volume "11" ,

issue "11" ,

pages "5718-5726" ,

language "eng" } } ,

ids {

pubmed 1833637 ,

medline 92017853 } } ,

pmid 1833637 } } ,

location

int {

from 1282 ,

to 2031 ,

id

gi 287914 } } ,

{

data

gene {

locus "spm2+" } ,

location

int {

from 398 ,

to 1645 ,

id

gi 287914 } } } } } } ,

seq {

id {

embl {

accession "CAA78503" ,

version 1 } ,

gi 287915 } ,

descr {

title "spm2+ [Saccharomyces cerevisiae]" ,

molinfo {

biomol peptide } } ,

inst {

repr raw ,

mol aa ,

length 415 ,

seq-data

ncbieaa

"MGTTTSHPAQKKQTTKKCRAPIMSDVREKPSNAQGCEPQEMDAVSKKVTELSLNKCSDS

QDAGQPSREGSITKKKSTLLLRDEDEPTMPKLSVMETAVDTDSGSSSTSDDEEGDIIAQTTEPKQDASPDD

DRSGHSS

PREEGQQQIRAKEASGGPSEIKSSLMVPVEIRWQQGGSKVYVTGSFTKWRKMIGLIPDSDNNGSFHVKLRL

LPGTHRF

RFIVDNELRVSDFLPTATDQMGNFVNYIEVRQPEKNPTNEKIRSKEADSMRPPTSDRSSIALQIGKDPDDF

GDGYTRF

HEDLSPRPPLEYTTDIPAVFTDPSVMERYYYTLDRQQSNTDTSWLTPPQLPPQLENVILNKYYATQDQFNE

NNSGALP

IPNHVVLNHLVTSSIKHNTLCVASIVRYKQKYVTQILYTPIESS" ,

hist {

replaces {

date

std {

year 1993 ,

month 6 ,

day 11 } ,

ids {

gi 4526 } } } } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 404 of 421 18/04/2005

annot {

{

data

ftable {

{

data

prot {

name {

"spm2+" } ,

activity {

"Wild-type version of SPM2,a dominant extragenic

suppressor of some temperature-sensitve mutations in RPO21 and PRP4."

} } ,

location

whole

gi 287915 } } } } } ,

seq {

id {

embl {

accession "CAA78504" ,

version 1 } ,

gi 4388554 } ,

descr {

title "CDC68 /SPT16 [Saccharomyces cerevisiae]" ,

annot {

prot {

data

molinfo {

biomol peptide ,

completeness partial } } ,

inst {

repr raw ,

mol aa ,

length 1 ,

seq-data

ncbieaa "M" } ,

{

data

ftable {

{

data

name {

"CDC68 /SPT16" } } ,

partial TRUE ,

location

whole

gi 4388554 } } } } } } ,

annot {

{

data

ftable {

{

cdregion {

frame one ,

code {

id 1 } } ,

product

whole

gi 287915 ,

location

int {

from 398 ,

to 1645 ,

id

gi 287914 } ,

dbxref {

{

db "SWISS-PROT" ,

tag

str "P34164" } } } ,

{

seqhound@blueprint.org Version 3.3

The SeqHound Manual 405 of 421 18/04/2005

cdregion {

frame one ,

code {

id 1 } } ,

partial TRUE ,

product

whole

gi 4388554 ,

location

int {

from 2029 ,

to 2031 ,

id

gi 287914 ,

fuzz-to

lim gt } ,

cit

pub {

muid 92017853 } } } } } }

seqhound@blueprint.org Version 3.3

The SeqHound Manual 406 of 421 18/04/2005

Example PDB record

Seq-entry ::= set {

class pdb-entry ,

descr {

pdb {

deposition

std {

year 1992 ,

month 4 ,

day 3 } ,

class "Isomerase(Intramolecular Oxidoreductse)" ,

compound {

"D-Xylose Isomerase (E.C.5.3.1.5) Mutant With Glu 186 Replaced

By Gln

(E186Q) Complex With Xylose And Mn" } ,

source {

"(Actinoplanes missouriensis) E186Q Mutant Gene Expressed In

(Escherichia coli)" } ,

exp-method "X-Ray Diffraction" } ,

comment "Revision History:~JUL 15 93 Initial Entry" ,

pub {

sub {

authors {

names

std {

{

name

name {

last "Janin" ,

full "J.Janin" ,

initials "J." } } } } ,

date

std {

year 1992 ,

month 4 ,

day 3 } } } } ,

pub {

muid 92304915 ,

article {

title {

name "Protein engineering of xylose (glucose) isomerase

from

title {

Actinoplanes missouriensis. 1. Crystallography and site-directed

mutagenesis

of metal binding sites." } ,

authors {

names

str {

"J.Jenkins" ,

"J.Janin" ,

"F.Rey" ,

"M.Chiadmi" ,

"H.van Tilbeurgh" ,

"I.Lasters" ,

"M.De Maeyer" ,

"D.Van Belle" ,

"S.J.Wodak" ,

"M.Lauwereys" ,

"P.Stanssens" ,

"N.T.Mrabet" ,

"J.Snauwaert" ,

"G.Matthyssens" ,

"A.-M.Lambeir" } } ,

from

journal {

seqhound@blueprint.org Version 3.3

The SeqHound Manual 407 of 421 18/04/2005

iso-jta "Biochemistry" ,

ml-jta "Biochemistry" ,

issn "0006-2960" ,

name "Biochemistry." } ,

year 1992 ,

issue "24" ,

pubmed 1610791 ,

article {

"N.T.Mrabet" ,

"H.van Tilbeurgh" ,

year 1992 ,

pubmed 1610792 ,

imp {

date

std {

month 6 ,

day 23 } ,

volume "31" ,

pages "5449-5458" ,

language "eng" } } ,

ids {

medline 92304915 } } ,

pmid 1610791 } } ,

pub {

muid 92304916 ,

title {

name "Protein engineering of xylose (glucose) isomerase

from

Actinoplanes missouriensis. 2. Site-directed mutagenesis of the xylose

binding site." } ,

authors {

names

str {

"A.-M.Lambeir" ,

"M.Lauwereys" ,

"P.Stanssens" ,

"J.Snauwaert" ,

"G.Matthyssens" ,

"I.Lasters" ,

"M.De Maeyer" ,

"S.J.Wodak" ,

"J.Jenkins" ,

"M.Chiadmi" ,

"J.Janin" } } ,

from

journal {

title {

iso-jta "Biochemistry" ,

ml-jta "Biochemistry" ,

issn "0006-2960" ,

name "Biochemistry." } ,

imp {

date

std {

month 6 ,

day 23 } ,

volume "31" ,

issue "24" ,

pages "5459-5466" ,

language "eng" } } ,

ids {

medline 92304916 } } ,

pmid 1610792 } } ,

pub {

muid 92304917 ,

article {

title {

name "Protein engineering of xylose (glucose) isomerase

from

Actinoplanes missouriensis. 3. Changing metal specificity and the pH

seqhound@blueprint.org Version 3.3

The SeqHound Manual 408 of 421 18/04/2005

profile

by site-directed mutagenesis." } ,

authors {

names

std {

{

name

name {

last "van Tilbeurgh" ,

initials "H." } } ,

{

name

name {

last "Jenkins" ,

initials "J." } } ,

{

name

name {

last "Chiadmi" ,

initials "M." } } ,

{

name

name {

last "Janin" ,

initials "J." } } ,

{

name

name {

last "Wodak" ,

initials "S.J." } } ,

{

name

name {

last "Mrabet" ,

initials "N.T." } } ,

{

name

name {

last "Lambeir" ,

initials "A.M." } } } ,

affil

str "Plant Genetic Systems N.V., Gent, Belgium." } ,

from

name "Biochemistry." } ,

journal {

title {

iso-jta "Biochemistry" ,

ml-jta "Biochemistry" ,

issn "0006-2960" ,

imp {

date

std {

year 1992 ,

month 6 ,

day 23 } ,

volume "31" ,

issue "24" ,

pages "5467-5471" ,

language "eng" } } ,

ids {

pubmed 1610793 ,

medline 92304917 } } ,

pmid 1610793 } } ,

pub {

muid 92172844 ,

article {

title {

name "Arginine residues as stabilizing elements in

proteins." } ,

authors {

names

seqhound@blueprint.org Version 3.3

The SeqHound Manual 409 of 421 18/04/2005

str {

"N.T.Mrabet" ,

"A.Van Den Broek" ,

"I.Van Den Brande" ,

"P.Stanssens" ,

"Y.Laroche" ,

"A.-M.Lambeir" ,

"G.Matthijssens" ,

"J.Jenkins" ,

"M.Chiadmi" ,

"H.van Tilbeurgh" ,

"F.Rey" ,

"J.Janin" ,

"W.J.Quax" ,

"I.Lasters" ,

"M.De Maeyer" ,

"S.J.Wodak" } } ,

from

journal {

title {

iso-jta "Biochemistry" ,

ml-jta "Biochemistry" ,

issn "0006-2960" ,

name "Biochemistry." } ,

imp {

date

std {

year 1992 ,

month 3 ,

day 3 } ,

volume "31" ,

issue "8" ,

pages "2239-2253" ,

language "eng" } } ,

ids {

pubmed 1540579 ,

medline 92172844 } } ,

pmid 1540579 } } ,

pub {

muid 89184498 ,

article {

title {

name "Structural analysis of the 2.8 A model of Xylose

isomerase

from Actinoplanes missouriensis." } ,

authors {

names

std {

{

name

name {

last "Rey" ,

initials "F." } } ,

{

name

name {

last "Jenkins" ,

initials "J." } } ,

{

name

name {

last "Janin" ,

initials "J." } } ,

{

name

name {

last "Lasters" ,

initials "I." } } ,

{

name

name {

seqhound@blueprint.org Version 3.3

The SeqHound Manual 410 of 421 18/04/2005

last "Alard" ,

initials "P." } } ,

{

name

name {

last "Claessens" ,

initials "M." } } ,

{

name

name {

last "Matthyssens" ,

initials "G." } } ,

{

name

name {

last "Wodak" ,

initials "S." } } } ,

affil

str "Laboratoire de Biologie Physicochimique, Universite

Paris

Sud, Orsay, France." } ,

from

journal {

title {

iso-jta "Proteins" ,

ml-jta "Proteins" ,

issn "0887-3585" ,

name "Proteins." } ,

imp {

date

std {

year 1988 } ,

volume "4" ,

issue "3" ,

pages "165-172" ,

language "eng" } } ,

ids {

pubmed 3237716 ,

medline 89184498 } } ,

pmid 3237716 } } } ,

seq-set {

seq {

id {

pdb {

mol "9XIM" ,<===================================ACCDB/name

chain 65 ,<====================================ACCDB/chain

rel

std {

year 1992 , ,<===========================ACCDB/release

month 4 ,

day 3 } } ,

gi 443580 } ,

descr {

// record truncated

seqhound@blueprint.org Version 3.3

The SeqHound Manual 411 of 421 18/04/2005

Example Biostruc

An example of the ASN biostruc. Some data has been removed for

the sake of brevity.

Biostruc ::= {

id {

mmdb-id 2 } ,

descr {

name "101D" ,

pdb-comment "remark 3: Refinement." ,

pdb-comment "remark 3: Program Nuclsq" ,

pdb-comment "remark 3: Authors Westhof,Dumas,Moras" ,

pdb-comment "remark 3: R Value 0.163" ,

pdb-comment "remark 3: Free R Value 0.252" ,

pdb-comment "remark 3: Number Of Reflections 2430" ,

pdb-comment "remark 3: Resolution Range 8.0 - 2.25 Angstroms" ,

pdb-comment "remark 3: Data Cutoff 2.0 Sigma(F)" ,

pdb-comment "remark 3: Number Of Protein Atoms

0" ,

pdb-comment "remark 3: Number Of Nucleic Acid Atoms

488" ,

pdb-comment "remark 3: Number Of Solvent Atoms

33" ,

pdb-comment "remark 3: Rms Deviations From Ideal Values (The Values Of" ,

pdb-comment "remark 3: Sigma, In Parentheses, Are The Input Estimated" ,

pdb-comment "remark 3: Standard Deviations That Determine The Relative" ,

pdb-comment "remark 3: Weights Of The Corresponding Restraints)" ,

pdb-comment "remark 3: Distance Restraints (Angstroms)" ,

pdb-comment "remark 3: Sugar-Base Bond Distance

0.024(0.030)" ,

pdb-comment "remark 3: Sugar-Base Bond Angle Distance

0.040(0.040)" ,

pdb-comment "remark 3: Phosphate Bond Distance

0.026(0.040)" ,

pdb-comment "remark 3: Phosphate Bond Angle Distance, H-Bond

0.057(0.050)" ,

pdb-comment "remark 3: Plane Restraint (Angstroms)

0.014(0.020)" ,

pdb-comment "remark 3: Chiral-Center Restraint (Angstroms3)

0.161(0.150)" ,

pdb-comment "remark 3: Non-Bonded Contact Restraints (Angstroms)" ,

pdb-comment "remark 3: Single Torsion Contact

0.093(0.100)" ,

pdb-comment "remark 3: Multiple Torsion Contact

0.097(0.100)" ,

pdb-comment "remark 3: Isotropic Thermal Factor Restraints (Angstroms2)" ,

pdb-comment "remark 3: Sugar-Base Bond

4.282(6.000)" ,

pdb-comment "remark 3: Sugar-Base Angle

4.990(6.000)" ,

pdb-comment "remark 3: Phosphate Bond

5.693(6.000)" ,

pdb-comment "remark 3: Phosphate Bond Angle, H-Bond

5.227(6.000)" ,

pdb-comment "remark 105: Nomenclature For Atoms Of The DeoxyriboseRIBOSE

pdb-comment "remark 106: The Hydrogen Bonds Between Base Pairs In This

pdb-comment "remark 101: Residue +c A 9 Has Br Bonded To C5." ,

pdb-comment "remark 101: Residue +c B 21 Has Br Bonded To C5." ,

pdb-comment "remark 105: The Protein Data Bank Has Adopted The Saccharide

Chemists" ,

MOIETY" ,

pdb-comment "remark 105: Rather Than That Of The Nucleoside Chemists. The

Ring" ,

pdb-comment "remark 105: Oxygen Atom Is Labelled O4 Instead Of O1." ,

Entry Follow" ,

pdb-comment "remark 106: The Conventional Watson-Crick Hydrogen Bonding

Pattern." ,

pdb-comment "remark 106: They Have Not Been Presented On Conect Records In

This" ,

pdb-comment "remark 106: Entry." ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 412 of 421 18/04/2005

history {

data-source {

name-of-database "Protein Data Bank" ,

version-of-database

release-date

std {

year 1995 ,

month 2 ,

day 28 } ,

database-entry-id

other-database {

db "PDB" ,

tag

str "101D" } ,

database-entry-date

std {

year 1994 ,

month 12 ,

day 14 } } } ,

attribution

sub {

authors {

names

std {

{

name

name {

last "Goodsell" ,

full "D.S.Goodsell" ,

initials "D.S." } } ,

{

name

name {

last "Kopka" ,

full "M.L.Kopka" ,

initials "M.L." } } ,

{

name

name {

last "Dickerson" ,

full "R.E.Dickerson" ,

initials "R.E." } } } } ,

imp {

date

std {

year 1994 ,

month 12 ,

day 14 } } } ,

attribution

gen {

cit "To Be Published" ,

authors {

names

std {

{

name

name {

last "Goodsell" ,

full "D.S.Goodsell" ,

initials "D.S." } } ,

{

name

name {

last "Kopka" ,

full "M.L.Kopka" ,

initials "M.L." } } ,

{

name

name {

last "Dickerson" ,

full "R.E.Dickerson" ,

initials "R.E." } } } } ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 413 of 421 18/04/2005

title "Refinement Of Netropsin Bound To Dna: Bias And Feedback In

Electron Density Map Interpretation" } ,

attribution

equiv {

muid 85264810 ,

article {

title {

name "Binding of an antitumor drug to DNA, Netropsin and

C-G-C-G-A-A-T-T-BrC-G-C-G." } ,

authors {

names

std {

{

name

name {

last "Kopka" ,

initials "M.L." } } ,

{

name

name {

last "Yoon" ,

initials "C." } } ,

{

name

name {

last "Goodsell" ,

initials "D." } } ,

{

name

name {

last "Pjura" ,

initials "P." } } ,

{

name

name {

last "Dickerson" ,

initials "R.E." } } } } ,

from

journal {

title {

iso-jta "J. Mol. Biol." ,

ml-jta "J Mol Biol" ,

issn "0022-2836" ,

jta "J6V" } ,

imp {

date

std {

year 1985 ,

month 6 ,

day 25 } ,

volume "183" ,

issue "4" ,

pages "553-563" } } } } } ,

chemical-graph {

descr {

name "Dna (5'-D(CpGpCpGpApApTpTp(Br)cpGpCpG)-3') Complexed With

Netropsin, Re-Refinement" ,

pdb-class "Deoxyribonucleic Acid" ,

pdb-source "Synthetic" ,

assembly-type other } ,

molecule-graphs {

{

id 1 ,

descr {

name "A" ,

pdb-comment "SEQRES" ,

molecule-type dna ,

organism {

org {

taxname "synthetic construct" ,

db {

{

seqhound@blueprint.org Version 3.3

The SeqHound Manual 414 of 421 18/04/2005

db "taxon" ,

tag

id 32630 } } ,

orgname {

name

partial {

{

fixed-level other ,

level "species" ,

name "synthetic construct" } } ,

lineage "artificial sequence" ,

gcode 11 ,

div "SYN" } } } } ,

seq-id

gi 996094 ,

residue-sequence {

{

id 1 ,

name " 1 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 66 } } ,

{

id 5 ,

name " 5 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 61 } } ,

{

id 8 ,

name " 8 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 70 } } ,

{

id 9 ,

name " 9 " ,

residue-graph

local 1 } ,

{

id 10 ,

name " 10 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 67 } } ,

{

id 11 ,

name " 11 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

seqhound@blueprint.org Version 3.3

The SeqHound Manual 415 of 421 18/04/2005

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 64 } } ,

{

id 12 ,

name " 12 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 68 } } } ,

inter-residue-bonds {

{

atom-id-1 {

molecule-id 1 ,

residue-id 11 ,

atom-id 9 } ,

atom-id-2 {

molecule-id 1 ,

residue-id 12 ,

atom-id 1 } } } } ,

{

id 2 ,

descr {

name "B" ,

pdb-comment "SEQRES" ,

molecule-type dna ,

organism {

org {

taxname "synthetic construct" ,

db {

{

db "taxon" ,

tag

id 32630 } } ,

orgname {

name

partial {

{

fixed-level other ,

level "species" ,

name "synthetic construct" } } ,

lineage "artificial sequence" ,

gcode 11 ,

div "SYN" } } } } ,

seq-id

gi 996095 ,

residue-sequence {

{

id 9 ,

name " 21 " ,

residue-graph

local 1 } ,

{

id 10 ,

name " 22 " ,

residue-graph

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

residue-graph-id 67 } } ,

{

id 12 ,

name " 24 " ,

residue-graph

seqhound@blueprint.org Version 3.3

The SeqHound Manual 416 of 421 18/04/2005

standard {

biostruc-residue-graph-set-id

other-database {

db "Standard residue dictionary" ,

tag

id 1 } ,

inter-molecule-bonds {

residue-graph-id 68 } } } ,

inter-residue-bonds {

{

atom-id-1 {

molecule-id 2 ,

residue-id 1 ,

atom-id 6 } ,

atom-id-2 {

molecule-id 2 ,

residue-id 2 ,

atom-id 1 } } ,

{

atom-id-1 {

molecule-id 2 ,

residue-id 11 ,

atom-id 9 } ,

atom-id-2 {

molecule-id 2 ,

residue-id 12 ,

atom-id 1 } } } } ,

{

id 3 ,

descr {

name "1" ,

molecule-type other-nonpolymer } ,

residue-sequence {

{

id 1 ,

name " 9 " ,

residue-graph

local 2 } } } ,

{

id 4 ,

descr {

name "2" ,

molecule-type other-nonpolymer } ,

residue-sequence {

{

id 1 ,

name " 21 " ,

residue-graph

local 2 } } } ,

{

id 38 ,

descr {

name "3" ,

molecule-type solvent } ,

residue-sequence {

{

id 1 ,

name " 58 " ,

residue-graph

local 5 } } } ,

{

id 39 ,

descr {

name "3" ,

molecule-type solvent } ,

residue-sequence {

{

id 1 ,

name " 59 " ,

residue-graph

local 5 } } } } ,

{

seqhound@blueprint.org Version 3.3

The SeqHound Manual 417 of 421 18/04/2005

atom-id-1 {

molecule-id 1 ,

residue-id 9 ,

atom-id 18 } ,

atom-id-2 {

molecule-id 3 ,

residue-id 1 ,

atom-id 1 } } ,

{

atom-id-1 {

molecule-id 2 ,

residue-id 9 ,

atom-id 18 } ,

atom-id-2 {

molecule-id 4 ,

residue-id 1 ,

atom-id 1 } } } ,

residue-graphs {

{

id 1 ,

descr {

name " +C DNA" ,

pdb-comment "" } ,

residue-type deoxyribonucleotide ,

iupac-code {

"N" } ,

atoms {

{

id 1 ,

name " P " ,

iupac-code {

" P " } ,

element p } ,

{

id 18 ,

name " C5 " ,

iupac-code {

" C5 " } ,

element c } ,

{

id 19 ,

name " C6 " ,

iupac-code {

" C6 " } ,

element c } } ,

bonds {

{

atom-id-1 16 ,

atom-id-2 17 ,

bond-order unknown } ,

{

atom-id-1 16 ,

atom-id-2 18 ,

bond-order unknown } } } ,

{

id 2 ,

descr {

name " BR" ,

pdb-comment "Bromine" } ,

residue-type other ,

iupac-code {

"X" } ,

atoms {

{

id 1 ,

name "BR " ,

iupac-code {

"BR " } ,

element br } } ,

bonds {

} } ,

{

seqhound@blueprint.org Version 3.3

The SeqHound Manual 418 of 421 18/04/2005

id 3 ,

descr {

name " NT" ,

pdb-comment "Netropsin" } ,

residue-type other ,

iupac-code {

"X" } ,

atoms {

{

id 1 ,

name " C1 " ,

iupac-code {

" C1 " } ,

element c } ,

{

id 5 ,

name " C2 " ,

iupac-code {

" C2 " } ,

element c } ,

{

id 31 ,

name " N10" ,

iupac-code {

" N10" } ,

element n } } ,

bonds {

{

atom-id-1 1 ,

atom-id-2 2 ,

bond-order unknown } ,

{

atom-id-1 1 ,

atom-id-2 3 ,

bond-order unknown } ,

{

atom-id-1 12 ,

atom-id-2 14 ,

bond-order unknown } ,

{

atom-id-1 29 ,

atom-id-2 31 ,

bond-order unknown } } } ,

{

id 4 ,

descr {

name "MO3" ,

pdb-comment "Magnesium Ion, 3 Waters Coordinated" } ,

residue-type other ,

iupac-code {

"X" } ,

atoms {

{

id 1 ,

name "MG " ,

iupac-code {

"MG " } ,

element mg } ,

{

id 2 ,

name " O1 " ,

iupac-code {

" O1 " } ,

element o } ,

{

id 3 ,

name " O2 " ,

iupac-code {

" O2 " } ,

element o } ,

{

id 4 ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 419 of 421 18/04/2005

name " O3 " ,

iupac-code {

" O3 " } ,

element o } } ,

bonds {

{

atom-id-1 1 ,

atom-id-2 2 ,

bond-order unknown } ,

{

atom-id-1 1 ,

atom-id-2 3 ,

bond-order unknown } ,

{

atom-id-1 1 ,

atom-id-2 4 ,

bond-order unknown } } } ,

{

id 5 ,

descr {

name "HOH" ,

pdb-comment "" } ,

residue-type other ,

iupac-code {

"X" } ,

atoms {

{

id 1 ,

name " O " ,

iupac-code {

" O " } ,

element o } } ,

bonds {

} } } } ,

model {

{

id 3 ,

type pdb-model ,

descr {

name "Model 1 from PDB entry 101D" ,

pdb-reso "Resolution: 2.25" ,

pdb-method "X-Ray Diffraction" ,

pdb-comment "FEB 27 95 Initial Entry" } ,

model-space {

coordinate-units angstroms ,

thermal-factor-units b } ,

model-coordinates {

{

id 1 ,

coordinates

literal

atomic {

number-of-points 556 ,

atoms {

number-of-ptrs 556 ,

molecule-ids {

1 ,

2 ,

3 ,

4 ,

5 ,

6 ,

seqhound@blueprint.org Version 3.3

The SeqHound Manual 420 of 421 18/04/2005

6 ,

7 ,

8 ,

9 ,

10 ,

11 ,

12 ,

13 ,

14 ,

15 ,

16 ,

17 ,

18 ,

19 ,

20 ,

12 ,

1 ,

1 } } ,

sites {

scale-factor 1000 ,

x {

18598 ,

19853 ,

20375 ,

16812 } ,

y {

34469 ,

34632 ,

33233 ,

30694 } ,

z {

24672 ,

22605 ,

8518 ,

-2033 ,

26343 } } ,

temperature-factors

isotropic {

scale-factor 1000 ,

b {

23239 ,

24930 } } } } } } } }

seqhound@blueprint.org Version 3.3

The SeqHound Manual 421 of 421 18/04/2005

seqhound@blueprint.org Version 3.3

GO background material

Please see: http://www.geneontology.org/doc/GO.doc.html}

The_SeqHound_Admin_Manual The Seq Hound Admin Manual

Navigation menu

Versions of this User Manual:

Views

Navigation