The_SeqHound_Admin_Manual The Seq Hound Admin Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 421

Download
Open PDF In Browser	View PDF

The SeqHound Manual
Part II: Sections 4-7
For Administrators and Developers
Release 3.3
(April 20th, 2005)

Authors
Ian Donaldson, Katerina Michalickova, Hao Lieu, Renan Cavero, Michel Dumontier,
Doron Betel, Ruth Isserlin, Marc Dumontier, Michael Matan, Rong Yao, Zhe Wang,
Victor Gu, Elizabeth Burgess, Kai Zheng, Rachel Farrall
Edited by
Rachel Farrall and Ian Donaldson

The SeqHound Manual

2 of 421

18/04/2005

Table of Contents
About this manual............................................................................................................ 7
Conventions ..................................................................................................................... 8
How to contact us. ........................................................................................................... 8
Who is SeqHound?........................................................................................................... 9
4. Setting up SeqHound locally. ....................................................................................... 10
4.1 Overview.................................................................................................................. 10
4.2 SeqHound system requirements............................................................................... 11
OS and hardware architecture .................................................................................... 11
Memory (RAM) ......................................................................................................... 11
Hard Disk ................................................................................................................... 12
Source code and executables .................................................................................. 12
Database.................................................................................................................. 12
Other Software ........................................................................................................... 12
Compiling SeqHound Code yourself. ........................................................................ 13
ODBC compliant database engines............................................................................ 13
Library dependencies ................................................................................................. 13
4.3 Obtaining precompiled SeqHound executables....................................................... 14
4.3.1 Obtaining SeqHound Source Code...................................................................... 16
4.4 Compiling SeqHound executables on Solaris.......................................................... 18
4.5 Building the SeqHound system on Solaris............................................................... 26
Catch up on SeqHound daily updates ........................................................................ 45
Setting up daily sequence updates.............................................................................. 47
Setting up SeqHound servers. Overview................................................................... 53
Trouble-shooting notes............................................................................................... 57
Error logs ................................................................................................................ 57
Recompiling SeqHound .......................................................................................... 57
Restarting the Apache server .................................................................................. 57
Other useful links.................................................................................................... 58
Parser schedule........................................................................................................ 58
MySQL errors ......................................................................................................... 58
5. Description of the SeqHound parsers and data tables by module................................. 59
What are modules? ........................................................................................................ 59
How to use this section. ................................................................................................. 59
Parser descriptions........................................................................................................ 59
Table descriptions.......................................................................................................... 60
An overview of the SeqHound data table structure ....................................................... 63
Parsers and resource files needed to build and update modules of SeqHound. ........... 64
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

3 of 421

18/04/2005

core module ................................................................................................................ 66
mother parser .......................................................................................................... 66
update parser ........................................................................................................... 71
postcomgen parser .................................................................................................. 72
asndb table .............................................................................................................. 75
parti table ................................................................................................................ 78
nucprot table............................................................................................................ 80
accdb table .............................................................................................................. 82
histdb table .............................................................................................................. 88
pubseq table ............................................................................................................ 91
taxgi table................................................................................................................ 94
sengi table ............................................................................................................... 97
sendb table .............................................................................................................. 99
chrom table............................................................................................................ 101
gichromid table ..................................................................................................... 105
contigchromid table .............................................................................................. 107
gichromosome table .............................................................................................. 109
contigchromosome table ....................................................................................... 111
Redundant protein sequences (redundb) module ..................................................... 113
redund parser......................................................................................................... 113
redund table........................................................................................................... 115
Complete genomes tracking (gendb) module........................................................... 119
Taxonomy hierarchy (taxdb) module....................................................................... 120
importtaxdb parser ................................................................................................ 120
taxdb table............................................................................................................. 122
gcodedb table ........................................................................................................ 127
divdb table............................................................................................................. 132
del table................................................................................................................. 135
merge table............................................................................................................ 137
Structural databases (strucdb) module ..................................................................... 139
cbmmdb parser...................................................................................................... 139
vastblst parser........................................................................................................ 144
pdbrep parser......................................................................................................... 146
mmdb table............................................................................................................ 148
mmgi table ............................................................................................................ 154
domdb table........................................................................................................... 156
Protein sequence neighbours (neighdb) module ...................................................... 162
Installing nblast:.................................................................................................... 162
Configuration of nblast environment:................................................................... 163
Running NBLAST ................................................................................................ 164
NBLAST Update Procedure ................................................................................. 166
nbraccess program* .............................................................................................. 168
BLASTDB table................................................................................................... 169
NBLASTDB table................................................................................................. 172
Locus link functional annotations (lldb) module ..................................................... 177
llparser................................................................................................................... 177

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

4 of 421

18/04/2005

addgoid parser....................................................................................................... 179
ll_omim table ........................................................................................................ 181
ll_go table.............................................................................................................. 183
ll_llink table .......................................................................................................... 186
ll_cdd table............................................................................................................ 188
GENE module .......................................................................................................... 191
parse_gene_files.pl parser..................................................................................... 191
gene_dbxref table.................................................................................................. 193
gene_genomicgi table ........................................................................................... 195
gene_history table ................................................................................................. 198
gene_info table...................................................................................................... 201
gene_object table .................................................................................................. 204
gene_productgi table............................................................................................. 206
gene_pubmed table ............................................................................................... 208
gene_synonyms table ............................................................................................ 210
Gene Ontology hierarchy (godb) module................................................................. 212
goparser................................................................................................................. 212
go_parent table...................................................................................................... 214
go_name table ....................................................................................................... 216
go_reference table................................................................................................. 219
go_synonym table ................................................................................................. 221
Gene Ontology Association (GOA) module ............................................................ 223
Table summarizing input files, parsers and command line parameters for GOA
module................................................................................................................... 225
Gene Ontology Module Diagram.......................................................................... 228
goa_seq_dbxref table ............................................................................................ 230
goa_association table ............................................................................................ 234
goa_reference table ............................................................................................... 237
goa_with table....................................................................................................... 239
goa_xdb table ........................................................................................................ 242
goa_gigo table....................................................................................................... 245
dbxref module .......................................................................................................... 248
Who Cross-references who? ................................................................................. 249
Explanation of the data table structure: ................................................................ 249
How to update the DBXref and GO Annotation modules using a cluster. .............. 256
Understanding the dbxref.ini file ............................................................................. 257
Table summarizing input files, parsers and command line parameters for dbxref
module................................................................................................................... 262
dbxref table ........................................................................................................... 265
dbxrefsourcedb table............................................................................................. 268
Contents of dbxrefsourcedb table ......................................................................... 270
RPS-BLAST domains (rpsdb) module..................................................................... 272
domname parser .................................................................................................... 272
Rpsdb parser.......................................................................................................... 273
domname table ...................................................................................................... 274
rpsdb table............................................................................................................. 278

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

5 of 421

18/04/2005

Molecular Interaction (MI) module.......................................................................... 285
MI-BIND parser.................................................................................................... 285
MI_source table .................................................................................................... 289
MI_ints table ......................................................................................................... 291
MI_objects table.................................................................................................... 292
MI_obj_dbases table ............................................................................................. 294
MI_mol_types table .............................................................................................. 295
MI_dbases table .................................................................................................... 296
MI_record_types table .......................................................................................... 297
MI_complexes table.............................................................................................. 298
MI_complex2ints table ......................................................................................... 299
MI_complex2subunits table.................................................................................. 300
MI_complex2subunits table.................................................................................. 301
MI_refs table......................................................................................................... 302
MI_refs_db table................................................................................................... 304
MI_exp_methods table.......................................................................................... 305
MI_obj_labels table .............................................................................................. 306
Text mining module ................................................................................................. 307
mother parser ........................................................................................................ 307
text searcher parser ............................................................................................... 308
yeastnameparser.pl parser ..................................................................................... 312
text_bioentity table................................................................................................ 314
text_bioname table ................................................................................................ 317
text_secondrefs table............................................................................................. 321
text_bioentitytype table......................................................................................... 324
text_fieldtype table................................................................................................ 325
text_nametype table .............................................................................................. 326
text_rules table ...................................................................................................... 327
text_db table.......................................................................................................... 328
text_doc table ........................................................................................................ 329
text_docscore table................................................................................................ 331
text_evidencescore table ....................................................................................... 336
text_method table.................................................................................................. 338
text_point table...................................................................................................... 341
text_pointscore table ............................................................................................. 342
text_result table..................................................................................................... 344
text_resultscore table ............................................................................................ 346
text_search table.................................................................................................... 348
text_searchscore table ........................................................................................... 351
text_rng table ........................................................................................................ 353
text_rngresult table................................................................................................ 355
text_doctax table ................................................................................................... 357
text_organism table............................................................................................... 359
text_englishdict table ............................................................................................ 361
text_bncorpus table ............................................................................................... 363
text_pattern table................................................................................................... 365

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

6 of 421

18/04/2005

text_stopword table............................................................................................... 367
6. Developing for SeqHound. ......................................................................................... 369
Open source development............................................................................................ 369
Code organization. ...................................................................................................... 370
Adding/Modifying a remote API function to SeqHound.............................................. 373
Overall architecture of the SeqHound system.......................................................... 374
Adding a new module to SeqHound............................................................................. 380
Database layer .......................................................................................................... 381
Parser layer............................................................................................................... 382
Local API layer (Query layer).................................................................................. 383
CGI layer .................................................................................................................. 383
Remote API layer ..................................................................................................... 384
7. Appendices.................................................................................................................. 387
Example GenBank record ........................................................................................ 388
Example SwissProt record ....................................................................................... 393
Example EMBL record ............................................................................................ 400
Example PDB record................................................................................................ 406
Example Biostruc ..................................................................................................... 411
GO background material .......................................................................................... 421
* not available at time of writing

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

7 of 421

18/04/2005

About this manual.
This manual contains everything that has been documented about SeqHound. It is
distributed in two Parts (Part I: For Users and Part II: For Administrators and
Developers).
If you can’t find the answer here then please contact us. This manual was written and
reviewed by the persons listed under “Who is SeqHound”. Any errors should be reported
to seqhound@blueprint.org.
You can find out more about the general architecture of SeqHound by reading the
SeqHound paper that is freely available from BioMed Central. This paper is included in
the supplementary material distributed with this manual. See:
Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R, Hogue CW.
SeqHound: biological sequence and structure database as a platform for bioinformatics
research. BMC Bioinformatics. 2002 Oct 25;3(1):32.
PMID: 12401134
The SeqHound Manual (Part I: Sections 1-3) For Users.
Section1 and Section 2 is a one page description that tells you what to read first to get
started depending on what kind of user you are.
Section 3 is of interest to programmers who want to use the remote API to access
information in the SeqHound database maintained by the Blueprint Initiative.
The SeqHound Manual (Part II: Sections 4-7) For Administrators and Developers
Section 4 is of interest to programmers and system administrators who want to set up
SeqHound themselves so they can use the local API.
Section 5 is an in-depth description of everything that’s in the SeqHound database and
how it gets there (table by table). This section will be of interest to all users.
Section 6 describes how programmers can add to SeqHound. This section also describes
our internal development process at Blueprint.
Section 7 includes Appendices of background and reference material.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

8 of 421

18/04/2005

Conventions
The following section describes the conventions used in this manual.
Italic
is used for filenames, file extensions, URLs, and email addresses.
Constant Width
is used for code examples, function names and system output.
Constant Bold
is used in examples for user input.
Constant Italic
is used in examples to show variables for which a context-specific substitution should be
made.

How to contact us.
General enquiries or comments can be posted to the SeqHound usergroup mailing list
seqhound.usergroup@blueprint.org. You may also subscribe to this list to receive
regular updates about SeqHound developments by going to
http://lists.blueprint.org/mailman/listinfo/seqhound.usergroup .
Private enquiries, bug reports from external users, questions about SeqHound or errors
found in this manual may be sent to seqhound@blueprint.org.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

9 of 421

18/04/2005

Who is SeqHound?
Chronologically ordered according to when the person first started work on SeqHound.
Chris Hogue
Katerina Michalickova
Gary Bader
Ian Donaldson
Ruth Isserlin
Michel Dumontier
Hao Lieu
Marc Dumontier
Doron Betel
Renan Cavero
Ivy Lu
Rong Yao
Volodya Grytsan
Zhe Wang
Victor Gu
Rachel Farrall
Michael Matan
Elizabeth Burgess
Kai Zheng

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

10 of 421

18/04/2005

4. Setting up SeqHound locally.
4.1 Overview.
This section describes how one can set up the SeqHound system on your own hardware
using freely available SeqHound executables. These executables will allow you to build
and update the SeqHound database as well as run a web-interface and a remote API
server.
Section 4.2 should be reviewed first for system requirements before attempting to install
the SeqHound system.
Section 4.3 tells you how to download executables from the SeqHound ftp site for your
platform and operating system. SeqHound code may also be downloaded from this site.
Section 4.4 describes how SeqHound code may be compiled on your own hardware using
the freely available code available on the SeqHound ftp site. This step is only required if
SeqHound executables are not available for your platform or if you want to make use of
the local programming API. If you obtain SeqHound executables from the ftp site and
want to build your local SeqHound database, you still need to go through Steps 8, 9, 10,
11 and 13 in this section which describe how to install the MySQL server and ODBC
driver.
Section 4.5 contains detailed instructions for using the executables to build the SeqHound
data tables and for setting up the SeqHound web-interface and remote API server.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

11 of 421

18/04/2005

4.2 SeqHound system requirements.
Before attempting to set up SeqHound yourself, you should review the system
requirements listed below. The SeqHound system is able to run on a number of operating
systems (we recommend and can best support a UNIX operating system like Sun Solaris
or Red Hat Linux). Setting up SeqHound will require approximately 700 GB of disk
space (see below).
Questions about system requirements, compilation, setup and maintenance can be
addressed to seqhound@blueprint.org. We will do our best to address all inquiries but
resources may not allow us to solve all problems arising on all possible set ups.
OS and hardware architecture
SeqHound code is compiled on the following platforms based on release version code.
Blueprint production SeqHound is compiled and run on Sun-Fire-880 - Sun Solaris
(version 9). We have also compiled and tested SeqHound on the Fedora Core 2.0 and
the MacOS X operating systems.
Release versions of SeqHound executables are available for.
x86 architecture
Sun-Fire-880
PowerPC architecture

(Fedora Core 2.0)
Sun Solaris (version 9)
MacOS X

We have also successfully built executables on the following platforms.
x86 architecture
FreeBSD
x86 architecture
QNX
x86 architecture
Windows NT
PowerPC architecture
PPC Linux
SGI
Irix 6
Alpha architecture
Compaq Alpha OS
HPPA 2.0 architecture
HPUX 11.0
HPPA 1.1 architecture
PA-RISC Linux
Memory (RAM)
We recommend a minimum of 1 GB of RAM to run the SeqHound executables.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

12 of 421

18/04/2005

Hard Disk
Source code and executables
Component
SeqHound Source and compiled
NCBI Toolkit
NCBI C++ Toolkit
bzip2 Library
slri lib
slri lib_cxx
Source code and executables (total)
Database

Image Size
220.0 MB
560.0 MB
12GB
4.5 MB
7.3MB
9.4 MB
13GB approx

Component
data tables
data tables backup

Image Size
300 GB
300 GB
700 GB*
Database (total)
*700GB includes 300 GB for a single copy of the SeqHound data tables. The SeqHound
system includes a second copy of the data tables used for back up and updating. We
suggest a minimum of 700 GB for SeqHound installation. This allows for yearly growth
of the data tables as well as for a RAID5 disk configuration.
We are using the MySQL database storage engine InnoDB, which provides transaction
support and automatic recovery in the event of database server outage. There is no need
to keep a separate instance of the database when the InnoDB storage engine is used. To
prevent deadlock during data insertion and update, you should not run SeqHound parsers
in parallel against the InnoDB database server. As a result, it takes up to three extra days
for the initial build of SeqHound database using the InnoDB storage engine. If you wish
to use the MyISM storage engine, you can run parallel parsers to speed up the initial
build of SeqHound. However, you will need to keep a separate database instance for
database update and backup as the storage engine MyISM does not support transaction
and automatic recovery.
Other Software
Apache
Webserver(version 1.3)
Apache Jakarta Tocat
JSP/Servlet Container
(version 4.1)
Perl (version 5.8.3)

seqhound@blueprint.org

See http://www.apache.org/ for software installation for you
platform.
See http://jakarta.apache.org/tomcat/ for software installation for
you platform.
See http://www.cpan.org/ for installation for your platform.
Requiredmodules include Net/FTP.pm, sun4-solaris-64/DBI.pm

Version 3.3

The SeqHound Manual

13 of 421

18/04/2005

Compiling SeqHound Code yourself.
It is not necessary to compile SeqHound executables yourself; the system may be set up
using the executables provided on the ftp site for selected Operating Systems. However,
if you wish to make use of the local API then you must compile SeqHound yourself.

ODBC compliant database engines
Blueprint uses the ODBC compliant MySQL database engine. We are using version
4.1.10 in production; this version supports nested SQL queries and internationalization.
We have not tested SeqHound on other ODBC compliant RDBMS such as Oracle, DB2
and PostgreSQL.

Library dependencies
Library
Source
NCBI Toolkit
from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/
NCBI C++ Toolkit (optional*)
from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/
bzip2 Library
from http://sourceforge.net/projects/slritools/
slri lib
from http://sourceforge.net/projects/slritools/
slri lib_cxx (optional*)
from http://sourceforge.net/projects/slritools/
* This library is only required if you plan to use the SeqHound remote C++ API.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

14 of 421

18/04/2005

4.3 Obtaining precompiled SeqHound executables.
It is not necessary to compile SeqHound executables yourself; the system may be set up
using the precompiled executables provided on the ftp site for selected Operating
Systems. If you choose to compile the executables yourself, skip to step 4.3.1.
You will require about 220 MB of disk space to store the SeqHound compiled
executables. These instructions assume you are logged in as user “seqhound” on a UNIX
system running the bash shell and you have perl installed on your system.
1. Decide the location to install the SeqHound binary executables. For example, if you
want to install in the directory /home/seqhound/execs, do the following:
mkdir execs
cd execs
2. Download the SeqHound installation utility script installseqhound.pl from the FTP
site: ftp.blueprint.org
ftp ftp.blueprint.org
When prompted for a name enter
anonymous
When prompted for a password type your email address:
myemail@home.com
cd pub/SeqHound/script
get installseqhound.pl
Close the ftp session by typing:
bye
3. Run the perl script to download and install SeqHound executables. The perl script
will download SeqHound binary executables based on the specified platform (linux or
solaris), unpack the tar ball, modify the configurations files .odbc.ini and .intrezrc (for
ODBC database access) and deploy the configuration files. It requires two
commandline arguments: platform (linux or solaris) and installation path (e.g.
/home/seqhound/execs). Enter the path to the ODBC driver (e.g.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

15 of 421

18/04/2005

/usr/lib/libmyodbc3.so, please refer to step 10 in section 4.4 for ODBC driver path),
database server name, port number, user id, password and database instance name
when prompted by the perl script.
./installseqhound.pl [linux OR solaris] [/home/seqhound/execs]

Upon successful execution of the perl script, you should see the following directories
in the directory execs:
build
config
example
include
lib
sql
test
updates
www
The configuration file .odbc.ini can be found in the home
directory (e.g. /home/seqhound).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

16 of 421

18/04/2005

4.3.1 Obtaining SeqHound Source Code.
Follow the instructions below to download SeqHound source code . If you downloaded
and unpacked the executables, you can skip section 4.3.1 and 4.4 and continue with
section 4.5.
1. In your home directory, make a new directory where you will store the new
SeqHound code.
mkdir compile
Move into this directory and set an environment variable called COMPILE to point to
this directory.
cd compile
export COMPILE=`pwd`
(where (`) is a single back-quote)
2. Download the perl utility seqhoundsrcdownload from the SeqHound ftp site

Note: We no longer support SeqHound download from the Sourceforge
FTP site. Please download SeqHound from
ftp://ftp.blueprint.org/pub/SeqHound/

From the compile directory, type:
ftp ftp.blueprint.org
When prompted for a name enter
anonymous
When prompted for a password type your email address:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

17 of 421

18/04/2005

myemail@home.com
cd pub/SeqHound/script
get seqhoundsrcdownload.pl
Close the ftp session by typing:
bye
3. Download SeqHound source code by running the perl script seqhoundsrcdownload.pl.
The script will download the source code tar file and unpack the tar file into two
directories slri and bzip2. You will also see a release note file Release_notes_x.x.txt
in the same directory compile.
./seqhoundsrcdownload.pl
4. Set the SLRI environment variable
Move to the slri directory and set the environment variable “SLRI” to point to this
directory.
cd $COMPILE/slri
export SLRI=`pwd`

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

18 of 421

18/04/2005

4.4 Compiling SeqHound executables on Solaris
These instructions describe how to compile SeqHound running on the Solaris platform.
They may be used as a guide for compiling SeqHound code on other platforms.
Instructions are similar for Linux and differences are noted.
Using these instructions
These instructions assume that:
You have downloaded the SeqHound code from the ftp server and you have set
environment variables called COMPILE and SLRI. See section 4.3.1
You are using the bash shell.
Note: On Linux platforms, to compile SeqHound libs with ODBC support you also need
unixODBC-devel package which contains the sql.h + other libs/headers required to
compile SeqHound libs with ODBC support. This is not needed to run SeqHound, just to
compile it.
These instructions were tested on a Sun-Fire-880 architecture running a Sun Solaris OS
(version 9). The system information for the test-box (results of a “uname –a” call)
were:
SunOS machine_name 5.9 Generic_117171-15 sun4u sparc
SUNW,Sun-Fire-880
1.

Download the NCBI toolkit
SeqHound is dependent on code in the NCBI toolkit
Move to the compile directory and ftp to the NCBI ftp site:
cd $COMPILE
ftp ftp.ncbi.nlm.nih.gov
When prompted for a name enter anonymous
When prompted for a password type myemail@home.com
cd toolbox/CURRENT
Make a note of the FAQ.html and the readme.htm files.
Change your transfer type to binary and get the zipped directory called ncbi.tar.gz
bin
get ncbi.tar.gz
Close the ftp session by typing:
bye

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

19 of 421

18/04/2005

Uncompress the toolkit.
gunzip ncbi.tar.gz
tar xvf ncbi.tar
2. Edit the platform make file.
Go to the platform directory and locate the file with a “.mk” extension that applies to
your platform. For 64-bit Solaris system the file is “solaris64.ncbi.mk” and in Linux
the file is linux-x86.ncbi.mk.
cd $COMPILE/ncbi
cd platform
In Linux linux-x86.ncbi.mk replace the line /home/coremake/ncbi with
${NCBI}
Use the following line (a Perl command) to replace the string in the Solaris file
/netopt/ncbi_tools/ncbi64/ncbi with the string ${NCBI}
in the solaris64.ncbi.mk file:
perl -p -i.bak -e 's|/netopt/ncbi_tools/
ncbi64/ncbi|\${NCBI}|g' solaris64.ncbi.mk
so for instance, the line
NCBI_INCDIR = /netopt/ncbi_tools/ncbi64/ncbi/include
Will become:
NCBI_INCDIR = ${NCBI}/include
You could also edit this file in hand using a text editor if you don’t have Perl
installed.
Copy the file up one level to the ncbi directory and rename it “ncbi.mk”
cp solaris64.ncbi.mk ../ncbi.mk
3. Set environment variables in preparation for the toolkit build.
Move back to the ncbi directory and set the environment variable NCBI to point to
that directory
cd $COMPILE/ncbi
export NCBI=`pwd`
check this by typing
echo $NCBI
the value shown will replace ${NCBI} in the “solaris64.ncbi.mk” file that you
modified in the above step when the make file is run.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

20 of 421

18/04/2005

Note: The make file in the NCBI toolkit will use the C compiler from Sun
instead of the compiler gcc. We do not recommend using gcc as it
generates seqhound parsers that lead to segmentation fault at run time.
Finally, paths to the compiler and the archive executable ar should be added to your
PATH variable:
export
PATH=/usr/local/bin:/opt/SUNWspro/prod/bin:/usr/ccs/bin:$
PATH
You can check all of your environment variables by typing
set | sort
At this point, the relevant environment variables should be something like this:
COMPILE=/export/home/your_user_name/compile
NCBI=/export/home/your_user_name/compile/ncbi
OSTYPE=solaris2.9
PATH=/opt/SUNWspro/prod/bin:/usr/local/bin:/usr/ccs/bin:/
usr/bin:/usr/ucb:/etc:.
If you want, you can read the readme file in the make directory.
cd make
more readme.unx

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

21 of 421

18/04/2005

Note: For the Solaris UNIX OS only, the SeqHound API functions
SHoundGetGenBankff and SHoundGetGenBankffList breaks
due to a bug in the NCBI library file ncbistr.c (in directory ncbi/corelib
and ncbi/build). To fix the problem, replace all the code inside the
function Nlm_TrimSpacesAroundString() in the file ncbistr.c
with the following text
char *ptr, *dst, *revPtr;
int spaceCounter = 0;
ptr = dst = revPtr = str;
if ( !str || str[0] == '\0' )
return str;
while ( *revPtr != '\0' )
if ( *revPtr++ <= ' ' )
spaceCounter++;
if ( (revPtr - str) <= spaceCounter )
{
*str = '\0';
return str;
}
while ( revPtr > str && *revPtr <= ' ' )
revPtr--;
while ( ptr < revPtr && *ptr <= ' ' ) ptr++;
while ( ptr <= revPtr ) *dst++ = *ptr++;
*dst = '\0';
return str;

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

22 of 421

18/04/2005

4. Build the NCBI toolkit
Move back up to the compile directory and run the make command.
cd $COMPILE
./ncbi/make/makedis.csh |& tee out.makedis.txt
Note: to build Solaris 64 bit binaries add the following to the command
line:
SOLARIS_MODE=64 ./ncbi/make/makedis.csh
This runs a c-shell script to make the toolkit and tees the output to the screen and a
log file “out.makedis.txt”. It is safe to ignore the multiple error messages that you
may see.
At the end of a successful build you will see
*********************************************************
*The new binaries are located in ./ncbi/build/ directory*
*********************************************************
The ncbi.tar file can be removed from the “compile” directory after the successful build
process has been completed.
5. Make the bzip2 library
The bzip2 code was downloaded as part of the seqhound code in step 4.3.1 above.
Move to the bzip2 directory and run the make file.
cd $COMPILE/bzip2
make –f make.bzlib
6. Set the BZDIR environment variable.
cd $COMPILE/bzip2
export BZDIR=`pwd`
7. In your home directory, add the following environment parameters to the appropriate
configuration file such as .bashrc or .bash_profile. Text in italics should be changed
to the correct path on your machine that points to directory having DBI.pm:
export NCBI=$COMPILE/ncbi
export BZDIR=$COMPILE/bzip2
export SLRI=$COMPILE/slri
export VIBLIBS="-L/usr/X11R6/lib -lXm -lXpm -lXmu -lXp lXt -X11 -lXext"
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

23 of 421

18/04/2005

export
PERL5LIB=/usr/local/lib/perl5/site_perl/5.8.3/sun4solaris-64
8. Install MySQL server and create database “seqhound”.
SeqHound is built and tested in MySQL version 4.1.10. You can download MySQL
from http://dev.mysql.com/downloads/mysql/4.1.html and follow the manual at
http://dev.mysql.com/doc/mysql/en/index.html to install MySQL on your server. The
data directory where the MySQL server points to should have 700 GB for a full
SeqHound database. After MySQL is installed, you need to log into MySQL and
create database “seqhound”:
create database seqhound;
Note that ";" must be used at the end of all MySQL statements.
9. Install ODBC driver:
Note that for Linux platforms, the unixODBC package needs to be
installed prior to the ODBC driver otherwise the following error will
occur:
error: Failed dependencies:
libodbcinst.so.1 is needed by MyODBC-3.51.09-1

a) Go to web site: http://dev.mysql.com/doc/connector/odbc/en/faq_2.html
b) Find and download RPM distribution of ODBC driver MyODBC-3.51.071.i586.rpm.
c) As user "root", install the driver.
For first time installation
rpm -ivh MyODBC-3.51.01.i386-1.rpm
For upgrade
rpm -Uvh MyODBC-3.51.01.i386-1.rpm
d) The library file libmyodbc3. will be installed in directory /usr/lib or
/usr/local/lib.
10. Set up the configuration file for ODBC driver.
Create a configuration file called .odbc.ini in your home directory with the following
content:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

24 of 421

18/04/2005

Edit the file called .intrezrc in directory slri/seqhound/config/.
header must not be used for other sections
[mysqlsh]
Description = MySQL ODBC 3.51 Driver DSN
Trace
= Off
TraceFile
= stderr
your library path
Driver
= /usr/lib/libmyodbc3.so
DSN
= mysqlsh
same as the header name
SERVER
= my_server
PORT
= my_port
USER
= my_id
PASSWORD
= my_pwd
DATABASE
= seqhound
database name
Text in italics should be changed. Text /usr in the value of variable Driver
should be changed to the path where unixodbc resides. Text my_server should be
changed to the IP address or the server name of the MySQL server. Text my_port
should be changed to port number of the MySQL instance. Text my_id and my_pwd
should be replaced by your user id and password to the MySQL database.
Note that the values for the headers such as DSN, USER, PASSWORD and
DATABASE must be less than 9 characters.
11. Set up ODBC related variables:
export ODBC=path_to_unixodbc
Where path_to_unixodbc should be replaced by the path of the UnixODBC
driver on your machine.
In your home directory, add parameter “LD_LIBRARY_PATH” to the appropriate
configuration file such as .bashrc or .bash_profile:
export LD_LIBRARY_PATH =
/usr/local/unixodbc/lib:/usr/local/unixodbc/odbc/lib:/usr
/local/mysql/lib/mysql:/usr/local/mysql/lib/mysql/lib
The value of variable “LD_LIBRARY_PATH” should have all the paths that have the
library files libodbc*, libmyodbc*, and libmysqlclient*
12. Build the SeqHound executables
Move to the compile directory and list all the files in the directory:
cd $COMPILE
ls
You should see:
> ls
bzip2
ncbi
slri
out.makedis.txt

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

25 of 421

18/04/2005

Before proceeding you should check your environment variables
set | sort
to ensure that correct paths have been specified for each of the following variables:
NCBI
SLRI
ODBC
BZDIR
Compile the SLRI libraries using the following commands:
cd $SLRI/lib
make -f make.slrilib
make -f make.slrilib odbc
The above commands will build the SLRI libraries needed by SeqHound.
The make files which you are about to invoke call on these variables therefore the
paths must be correct. Move to the make directory for SeqHound and run the makeall
script. The script requires two command line arguments. The first parameter indicates
what database backend is to be used for the build (currently the only valid target is
odbc). The second parameter indicates what SeqHound programs are to be made (a
choice of all, cgi, domains, examples, genomes, go,
locuslink, parsers,scripts, taxon, updates). The output of the
build script will be captured in the text file out.makeseqhound.txt.
cd $SLRI/seqhound
./makeallsh odbc all 2>&1 | tee out.makeseqhound.txt
It is safe to ignore the multiple warning messages that you may see.
After this has finished running, move to the directory slri/seqhound/build/odbc/
where you will find the executables for SeqHound.
cd build/odbc
ls -1
You will see
>ls –1
addgoid
cbmmdb
chrom
clustmask
clustmasklist
comgen
fastadom
gen2fasta
gen2struc
goparser
goquery

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

26 of 421

18/04/2005

histparser
importtaxdb
isshoundon
llgoa
llparser
llquery
mother
pdbrep
precompute
redund
seqrem
sh_nbhrs
shunittest_odbc_local
shunittest_odbc_rem
shtest
update
vastblst
wwwseekgi
13. Set up the SQL files that create tables.
cd $SLRI/seqhound/sql
In each of files core.sql, redund.sql, ll.sql, taxdb.sql, gendb.sql,
strucdb.sql, cddb.sql, godb.sql, rps.sql, nbr.sql, there is a line close to
the beginning of each file:
#use testsql;
This line should be changed to
use seqhound;

4.5 Building the SeqHound system on Solaris
Using these instructions
These instructions show how the SeqHound executables may be used to build the
SeqHound system under a Solaris 8 OS. These instructions may also be used as a guide
for setting up SeqHound under other operating systems. These instructions assume that:
• You have downloaded the latest release version of the SeqHound code (see step
4.3.3)
•

You have successfully installed MySQL

•

You have successfully compiled the SeqHound code yourself (section 4.4)
OR
you have downloaded the SeqHound executables for your platform and operating
system (section 4.3.4).

•

You have set environment variables called COMPILE and SLRI (see steps 4.3.1 and
4.3.6).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

27 of 421

18/04/2005

•

You have a default install of an Apache server running. See http://www.apache.org/
for freely available software and instructions for your platform.

•

You have installed Perl. See http://www.cpan.org/ for freely available software and
installation instructions.

•

You have at least 300 MB space available in a directory where you can check out
code and compile it.

•

You have at least 600 GB available for the SeqHound executables and data tables.
See section 4.2.

These instructions were tested on a Sun Ultra machine running the Sun-Solaris 8 OS. The
system information for the test-box (results of a “uname –a” call) were:
SunOS machine_name 5.8 Generic_108528-01 sun4u sparc
SUNW,Ultra-4
These instructions assume that you are using the c shell. Syntax may differ for some
commands in other shells.
Note: These instructions begin with ‘step 14’.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

28 of 421

18/04/2005

14. Prepare to build the SeqHound database.
Create a new directory where you will set up SeqHound.
mkdir seqhound
Set the environment variable SEQH to point to this directory.
cd seqhound
setenv SEQH `pwd`
Move to this directory and create new directories
cd seqhound
mkdir 1.core.files
mkdir 2.redund.files
mkdir 3.taxdb.files
mkdir 4.godb.files
mkdir 5.lldb.files
mkdir 6.comgenome.files
mkdir 7.mmdb.files
mkdir 8.hist.files
mkdir 9.neighbours.files
mkdir 10.rpsdb.files
mkdir precompute
The numbered directories will hold parsers and files required for the build of the
SeqHound data tables. Directory “precompute” will hold the precomputed data of the
database.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

29 of 421

18/04/2005

Move to each of the numbered directories and copy all of the scripts and executables
required for the build.
cd $SEQH/1.core.files
cp $SLRI/seqhound/sql/core.sql .
cp $SLRI/seqhound/scripts/asnftp.pl .
cp $SLRI/seqhound/scripts/seqhound_build.sh .
cp $SLRI/seqhound/build/odbc/mother .
cp $SLRI/seqhound/build/odbc/update .
cp $SLRI/seqhound/config/.intrezrc .
cd
cp
cp
cp

$SEQH/2.redund.files
$SLRI/seqhound/sql/redund.sql .
$SLRI/seqhound/scripts/nrftp.pl .
$SLRI/seqhound/build/odbc/redund .

cd
cp
cp
cp

$SEQH/3.taxdb.files
$SLRI/seqhound/sql/taxdb.sql .
$SLRI/seqhound/scripts/taxftp.pl .
$SLRI/seqhound/build/odbc/importtaxdb .

cd
cp
cp
cp

$SEQH/4.godb.files
$SLRI/seqhound/sql/godb.sql .
$SLRI/seqhound/scripts/goftp.pl .
$SLRI/seqhound/build/odbc/goparser .

cd
cp
cp
cp
cp

$SEQH/5.lldb.files
$SLRI/seqhound/sql/ll.sql .
$SLRI/seqhound/scripts/llftp.pl .
$SLRI/seqhound/build/odbc/llparser .
$SLRI/seqhound/build/odbc/addgoid .

cd
cp
cp
cp
cp
cp
cp
cp
cp

$SEQH/6.comgenomes.files
$SLRI/seqhound/sql/gendb.sql .
$SLRI/seqhound/scripts/genftp.pl .
$SLRI/seqhound/scripts/humoasn.pl .
$SLRI/seqhound/scripts/humouse_build.sh .
$SLRI/seqhound/scripts/comgencron_odbc.pl .
$SLRI/seqhound/scripts/shconfig.pm .
$SLRI/seqhound/genomes/gen_cxx .
$SLRI/seqhound/genomes/pregen.pl .

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

30 of 421

cp
cp
cp
cp
cp

$SLRI/seqhound/genomes/gen.pl .
$SLRI/seqhound/genomes/ncbi.bacteria.pl .
$SLRI/seqhound/build/odbc/chrom .
$SLRI/seqhound/build/odbc/comgen .
$SLRI/seqhound/build/odbc/mother .

cd
cp
cp
cp
cp
cp

$SEQH/7.mmdb.files
$SLRI/seqhound/sql/strucdb.sql .
$SLRI/seqhound/scripts/mmdbftp.pl .
$SLRI/seqhound/config/.mmdbrc .
$SLRI/seqhound/config/.ncbirc .
$SLRI/seqhound/build/odbc/cbmmdb .

18/04/2005

cd $SEQH/8.hist.files
cp $SLRI/seqhound/build/odbc/histparser .
Open the .intrezrc file with a text editor like pico and edit.
cd $SEQH/1.core.files
pico .intrezrc
An example .intrezrc file follows. Lines preceded by a semi-colon are comments that
explain what the settings are used for and their possible values.
Text in italics must be changed for the .intrezrc file to function correctly with
your SeqHound set-up. Variables username, password, dsn, database in
section [datab] should have the same values as USER, PASSWORD, DSN and
DATABASE respectively in the .odbc.ini file you set up in Step 10 in section 4.4. For
variable path and indexfile in section [precompute], replace the text in
italics with the absolute path of directory “precompute” you just created.
Warning: This file may have wrapped lines. Take care when editing this
file that you do not break any of the lines (i.e. introduce any unwanted
carriage returns).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

31 of 421

18/04/2005

-------------------------------example .intrezrc begins-------------------------------[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[config]
;the executable the cgi runs off of.
CGI=wwwseekgi
[precompute]
;precomputed taxonomy queries
MaxQueries = 100
MaxQueryTime = 10
QueryCount = 50
path = /seqhound/precompute/
indexfile = /seqhound/precompute/index
[sections]
;indicated what modules are available in SeqHound
;1 for available, 0 for not available
;gene ontology hierarchy
godb = 1
;locus link functional annotations
lldb = 1
;taxonomy hierarchy
taxdb = 1
;protein sequence neighbours
neigdb = 1
;structural databases
strucdb = 1
;complete genomes tracking
gendb = 1
;redundant protein sequences
redundb = 1
;open reading frame database
;currently not exported to outside users of SeqHound
cddb = 0
;RPS-BLAST domains
rpsdb = 1
;DBXref Database Cross_Reference
dbxref = 0
[crons]
;customizable variables in cron jobs
;NOTE: all paths must end in '/'
pathupdates=./
pathinputfiles=./
pathinputfilescomgen=./
mail=user\@host.org
defaultrelease=141
pathflags=./
-------------------------------example .intrezrc ends----------------------------------

This file should be copied to other directories used during the build process:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

32 of 421

18/04/2005

cp .intrezrc $SEQH/2.redund.files/.
cp .intrezrc $SEQH/3.taxdb.files/.
cp .intrezrc $SEQH/4.godb.files/.
cp .intrezrc $SEQH/5.lldb.files/.
cp .intrezrc $SEQH/6.comgenome.files/.
cp .intrezrc $SEQH/7.mmdb.files/.
cp .intrezrc $SEQH/8.hist.files/.
cp .intrezrc $SEQH/9.neighbours.files/.
cp .intrezrc $SEQH/10.rpsdb.files/.
15. Build the core module of SeqHound.
Building the core module (basically all of the sequence data tables) is not optional.
The rest of the modules are optional if there is a need to spare resources or
administrative efforts but the corresponding API functionality will not be present.
cd $SEQH/1.core.files
Create the core tables in the database
Make sure file core.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < core.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates core tables accdb, asndb, nucprot, parti, pubseq, sendb, sengi, taxgi,
bioentity, bioname, secondrefs, bioentitytype, nametype, rules, fieldtype and histdb.
If you are building a full-instance of the SeqHound database then run the asnftp.pl
script while in the build directory:
./asnftp.pl
Note that any command in these instructions can be run as a ‘nohup’ to
prevent the process from ending if your connection to the machine should
be lost. For example:
nohup ./asnftp.pl &

If you only want to build a small test version of the database then manually download
a single file. For example:
ftp ftp.ncbi.nih.gov
When prompted for a name enter anonymous
When prompted for a password type myemail@home.com
cd refseq/cumulative
bin
get rscu.bna.Z (do not uncompress this file)
bye

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

33 of 421

18/04/2005

The asnftp.pl script downloads all of the GenBank sequence records (in binary ASN.1
format) required to make an initial build of the SeqHound core module. This script
will take approximately 24 hours to run and will consume 14 GB of disk space.
Note that all scripts are described in detail in section 5.
Two other files are generated by this script:
asn.list is a list of the sequence files that the script intends to download.
asnftp.log is where the script logs error messages during execution time.
If you open another session with the machine where you are building SeqHound, you
can check how far along asnftp.pl is by comparing the number of lines in the asn.list
file
grep “.aso.gz” asn.list | wc –l
to the number of lines in the build directory (number of files actually downloaded so
far)
ls *.aso.gz | wc -l
Once asnftp has finished, these two numbers should be the same.
Run the seqhound build script. Before running this script, make certain that the
.intrezrc file, in the same directory, and .odbc.ini, in your home directory, have
correct configuration values. (see steps 10 in section 4.4 and step 14 in the current
section). This parser MUST be given a single parameter that represents the release
version of GenBank. You can find the release number in the file:
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Release.
./seqhound_build.sh 141
seqhound_build.sh executes the mother parser over all source files and populates
tables accdb, asndb, nucprot, parti, pubseq, sendb, sengi, taxgi, bioentity, bioname,
secondrefs, bioentitytype, nametype, rules and fieldtype. This will take about 75
hours. Table histdb is still empty at this stage. It is populated in Step 25.
Parser mother creates a log file for every *.aso file that it parses. These log files are
located in a subdirectory called “logs” and are named “rsnc0506run” where
“rsnc0506” is the name of the file that was being processed.
While seqhound_build.sh is running, you can move on to steps 16-18.
Once seqhound_build.sh has finished you can test that all of the files were properly
processed by showing that the results of
cd logs
grep “Done” | wc –l
is the same as
ls *run | wc –l
is the same as
cd ..
ls *aso.gz | wc -l

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

34 of 421

18/04/2005

The seqhound_build.sh script unzips .aso.gz files before feeding them as input to the
mother program. seqhound_build.sh then rezips the file after mother is done with it.
If for some reason, the build should crash part way through, you have to
a) recreate core tables using core.sql (see above) and
b) search for any unzipped (*.aso files) in the build directory and rezip them
c) restart seqhound_build.sh.
Once the seqhound_build.sh script has finished, you should move all of the *.aso.gz
files into a directory where they will be out of the way:
mkdir asofiles
mv *.aso.gz asofiles/.
16. Build the redundb module.
cd $SEQH/2.redund.files
Create table redund in the database.
Make sure file redund.sql has the line use seqhound close to the beginning of the
file.
mysql –u my_id –p –P my_port –h my_server < redund.sql
Where my_id”, “my_port” and “my_server” should be replaced by your userid
for the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates table redund in the database.
Run the nrftp.pl script to download the FASTA nr database of proteins
(ftp://ftp.ncbi.nlm.nih.gov/blast/db).
./nrftp.pl
nrftp.pl generates a log file “nrftp.log” that informs you what happened. If everything
went ok, the last two lines should read:
Getting nr.gz
closing connection
A new file should appear in the build directory called “nr.Z”. You will have to
unpack this file by typing:
gunzip nr.gz
Run the redund parser to make the redund table of identical protein sequences.
Before running this script, make certain that the .intrezrc file in the same directory
and .odbc.ini in your home directory have correct configuration values (see step 10 in
section 4.4 and step 14 in the current section).
./redund -i nr -n F
redund generates the log file “redundlog”. If everything went ok, the only line in this
file should be:
NOTE: [000.000] {redund.c, line 259} Done.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

35 of 421

18/04/2005

And about 3 millions records will be inserted into table redund.
17. Build the taxdb module
Create tables of the taxdb module in the database.
cd $SEQH/3.taxdb.files
Make sure file taxdb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < taxdb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates tables taxdb, gcodedb, divdb, del, merge in the database.
Run the taxftp.pl script to download taxonomy info from the NCBI
(ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz).
taxftp.pl
taxftp.pl generates a log file taxftp.log that informs you what happened. If everything
went ok, the last two lines should read:
Getting taxdump.tar.gz
closing connection
A new file should appear in the build directory called taxdump.tar.gz. You will have
to unpack this file by typing:
gzip –d taxdump.tar.gz
tar -xvf taxdump.tar
There will be seven new files:
delnodes.dmp
division.dmp
gc.prt
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
Run the importtaxdb parser to make the taxonomy data tables. Taxdump must be in
the same directory as this parser.
./importtaxdb
importtaxdb has no command line parameters. importtaxdb generates the log file
importtaxdb_log.txt. If everything went ok, the output of this file should be
something like:
Program start at Thu Sep 4 13:47:51 2003
Number of Tax ID records parsed: 191647
Number of Tax ID Name records parsed: 246263
Number of Division records parsed: 11
Number of Genetic Code records parsed: 18

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

36 of 421

18/04/2005

Number of Deleted Node records parsed: 25475
Number of Merged Node records parsed: 4607
Program end at Thu Aug 12 13:49:43 2004

And records will be inserted into tables taxdb, gcodedb, divdb, del and merge.
18. Build the GODB module
Create tables of the godb module in the database.
cd $SEQH/4.godb.files
Make sure file godb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < godb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates tables go_parent, go_name, go_reference, go_synonym in the database.
Run the goftp.pl script to download the gene ontology files
(ftp://ftp.geneontology.org/pub/go/gene-associations and
ftp://ftp.geneontology.org/pub/go/ontology).
goftp.pl
There is a log file for this script called goftp.log that indicates that it got all of these
files. Three new files should appear in the build directory called
component.ontology
function.ontology
process.ontology
Two other files also appear called
gene_association.Compugen.GenBank.gz
gene_association.Compugen.UnitProt.gz
but these are used as input files by addgoid in the next step.
Run the goparser to make the hierarchical gene ontology data tables. The three input
files must be in the same directory as this parser.
./goparser
goparser has no command line parameters. goparser generates the log file
goparserlog. If everything went ok, the output of this file should have only one
NOTE line:
NOTE: [000.000] {goparser.c, line 101} Main: Done!
And records will be inserted into tables go_parent, go_name, go_reference,
go_synonym.
19. Build the LLDB module
Create tables of the locus link module in the database.
cd $SEQH/5.lldb.files
Make sure file ll.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < ll.sql
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

37 of 421

18/04/2005

Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates tables ll_omim, ll_go, ll_llink, ll_cdd in the database.
Run the llftp.pl script to download the locus link template file (LL_tmpl) which is the
source for function annotation tables
(ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz).
llftp.pl
This script generates the llftp.log file. If everything executes correctly, the last two
lines of the file should read:
Getting LL_tmpl.gz
closing connection
And a new file should appear in the build directory called LL_tmpl.gz which you will
have to unpack using the commands
gzip –d LL_tmpl.gz
Run the llparser to create the set of functional annotation data tables. The input file
must be in the same directory as this parser.
./llparser
llparser has no command line parameters. llparser generates the log file
“llparserlog”. At the time of writing, the output of this file will have thousands of
lines like:
NOTE: [000.000] {ll_cb.c, line 654} LL_AppendRecord: No
NP id. Record skipped.
(these lines are expected since many LocusLink records are not linked to specific
sequence records)
followed by the last line of the file:
NOTE: [000.000] {llparser.c, line 90} Main: Done!
Records will be inserted into tables ll_omim, ll_go, ll_llink and ll_cdd. Run the
addgoid parser to populate the go annotation table. This parser uses input files that
were downloaded in the previous step 13. Copy those files to this directory:
cp ../4.godb.files/gene_association.Compugen.GenBank.gz
./
cp ../4.godb.files/gene_association.Compugen.UniProt.gz
./
The files need to be unpacked.
gunzip gene_association.Compugen.GenBank.gz
gunzip gene_association.Compugen.UnitProt.gz
The input files must be in the same directory as addgoid
./addgoid –i gene_association.Compugen.GenBank
after this parser has finished, use it to parse the other input file
./addgoid –i gene_association.Compugen.UniProt
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

38 of 421

18/04/2005

At the time of writing, this second input file is not parsed since cross references
between Swissprot and GenBank ids are not available. This is being corrected by the
dbxref module project.
addgoid MUST BE EXECUTED AFTER ALL CORE TABLES AND
LLDB TABLES HAVE BEEN BUILT; the llparser makes the ll_go table
into which the addgoid script writes. This program is dependent on tables
asndb, parti, accdb and nucprot..
addgoid generates the log file addgoidlog. The output of this file will look like:
=========[ Sep 5, 2003 10:28
ERROR: [000.000] {addgoid.c,
ERROR: [000.000] {addgoid.c,
ERROR: [000.000] {addgoid.c,
ERROR: [000.000] {addgoid.c,

AM ]========================
line 235} No GI from 100K_RAT.
line 235} No GI from 100K_RAT.
line 235} No GI from 100K_RAT.
line 235} No GI from 100K_RAT.

This is normal. These errors are caused by the inability to find GI’s for names of
proteins/loci that are annotated in the GO input file. This problem is being addressed
by the dbxref module.dir
This program writes to the existing ll_go table that was generated by llparser.
20. Build the GENDB module
Change directories to the Complete Genomes directory (comgenomes).
cd $SEQH/6.comgenomes.files
Create tables of the GENDB module in the database.
Make sure file gendb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < gendb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates table chrom in the database.
Building the GENDB module involves several steps. To simplify the process, a perl
script, comgencron_odbc.pl groups together all of the necessary scripts or binaries for
each individual step. These scripts and binaries must be present in this directory.
They are:
comgencron_odbc.pl
shconfig.pm
gen_cxx
pregen.pl
gen.pl
ncbi.bacteria.pl
genftp.pl
humoasn.pl
chrom

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

39 of 421

18/04/2005

iterateparti
humouse_build.sh
mother
comgen
Before building the GENDB module, the [crons] section in configuration file
.intrezrc should be set up properly. It should look like the following. Text in
italics must be changed. Variable mail should have the e-mail address where
you want the message to be sent to. Variable defaultrelease should have the
release number of the GenBank files you use to build the core tables of SeqHound
database (see Step 15):
[crons]
;customizable variables in cron jobs
;NOTE: all paths must end in '/'
pathupdates=./
pathinputfiles=./
pathinputfilescomgen=./genfiles/
mail=your_email_addr
defaultrelease=141
pathflags=./flag/

Make a subdirectory flag where the flag file comgen_complete.flg will be saved.
mkdir flag
Run the script to build the GENDB module:
./comgencron_odbc.pl
comgencron_odbc.pl generates flat file genff, log files bacteria.log, chromlog,
comgenlog, gen.log, iteratapartilog, a subdirectory genfiles and a lot of logs file with
postfix run which will be moved to a subdirectory logs. It also downloads many .asn
files which will be moved to subdirectory genfiles. During the process, temporary file
comff and directory asn are created. They are deleted before the end of the build
process. If the build process fails in the middle, they should be removed along with
file genff manually.
There are several lines printed on the screen during the build like:
mail = your_email_addr
pathupdates = ./
pathinputfilescomgen = ./genfiles/
defaultrelease = 141
pathflags = ./flag/
No source or subsource Plasmpdium falciparum NC_03043.
Update 1 chromosome type by hand.
It is OK to see above line.
An e-mail will be sent to the address you provide to inform if the process succeeds or
fails. If everything went ok, you will see the last line in file comgenlog as:
NOTE: [000.000] {comgen.c, line 504} Main: Done.
The last line in file iteratepartilog as:
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

40 of 421

18/04/2005

NOTE: [000.000] {iterateparti.c, line 170} Done.
The last line in file chromlog as:
NOTE: [000.000] {chrom.c, line 173} Done.
The last two lines in file bacteria.log as:
deleteing asn
See bacteria.results for changes to ./genff
The last two lines in file gen.log as:
Removing asn
Deleting comff
The following is a detailed explanation of the script comgencron_odbc.pl. You may skip
it.
21. Generate flat file genff.
genff is a tab-delimited text file where each line in this file represents one "DNA unit"
(chromosome, plasmid, extrachromosomal element etc.) belonging to a complete
genome.
Column
1
2
3
4
5

Description
Taxonomy identifier for the genome
Unique integer identifier for a given chromosome
Type of molecule (1 or chromosome, 8 for plasmid, …)
FTP file name for the genome without the .asn
extension)
Full name of the organism

Here is an example of several rows from genff:
305
258594
781
782
90370
90370
90370
209261

286
287
288
289
290
291
292
293

8
1
1
1
1
8
8
1

NC_003296
NC_005296
NC_003103
NC_000963
NC_003198
NC_003384
NC_003385
NC_004631

Ralstonia solanacearum plasmid pGMI1000MP
Rhodopseudomonas palustris CGA009 chromosome
Rickettsia conorii chromosome
Rickettsia prowazekii chromosome
Salmonella typhi chromosome
Salmonella typhi plasmid pHCM1
Salmonella typhi plasmid pHCM2
Salmonella typhi Ty2 chromosome

The genff flat file is generated in two steps.
a) gen.pl which will CREATE genff using the eukaryotic complete genomes.
b) ncbi.bacteria.pl which will UPDATE genff with bacteria complete genomes.
* both gen.pl and ncbi.bacteria.pl are dependent on pregen.pl so this must be in the
same directory as gen.pl and ncbi.bacteria.pl when you run it.
gen.pl will backup the current (if it exists) genff as genff.backup and then create a new
genff file. gen.pl will download asn files from NCBI’s ftp site and then extract the
relevant fields (as described above) and store them as records in genff.
The data of bacteria complete genome is written to genff by running ncbi.bacteria.pl.
This perl utility will compare the data in genff to the contents of the
/genomes/bacteria directory in NCBI’s ftp site and then automatically update genff.
ncbi.bacteria.pl will save the names of the bacteria that have been newly added to
genff in a separate file called bacteria.results. You can use this file to quickly verify
the results.
A sample output of bacteria.results.pl
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

41 of 421

18/04/2005

***********PERFECT MATCH***********
Aeropyrum pernix
***********SEMI MATCHED NCBI BACTERIA*************
NCBI BACTERIA
CHROMFF
---------------------------------------Buchnera aphidicola
Buchnera sp
Buchnera aphidicola Sg
Buchnera sp
***********UNMATCHED NCBI BACTERIA*************
Agrobacterium tumefaciens C58 Cereon
Agrobacterium tumefaciens C58 UWash

Perfectly matched bacteria are already present in genff. Semi matched bacteria means
that there is an organism that is closely related to a new organism. For the above
example, Buchnera aphidocola Sg and Buchnera aphidocola were newly released and
closely related to the Buchnera sp. The newly released data will have been added to
genff. Unmatched bacteria are completely new organism and will be added to genff.
Both gen.pl and ncbi.bacteria.pl will create an intermediate file called comff, and a
temporary directory asn. These are temporary and are critical to the functionality of
the perl scripts. Both gen.pl and ncbi.bacteria.pl will delete comff and asn after
execution.
While running gen.pl and ncbi.bacteria.pl you may see the following on the screen.
No source or subsource Plasmodium falciparum NC_03043.
Update 1 chromosome type by hand.
It means that for the specified organism, the asn file is missing the chromosome type.
In such a scenario, the chromosome type will default to 1 (chromosome
Once you have generated file genff, you will likely need to run it again periodically,
in case some of the data in genff has changed, for example if an organism taxid
changes, in which case it is crucial to rerun gen.pl.
Script genftp.pl downloads complete genome files from
ftp://ftp.ncbi.nih.gov/genomes/*.
A script called humoasn.pl must be in the same directory as genftp.pl since genftp.pl
calls the script.
humoasn.pl is a misnomer because the script actually processes files for
human, mouse AND rat genomes.
Each of these genomes has two files called rna.asn and protein.asn (these files are
called the same thing regardless of the organism that they refer to: the only way you
can tell which organism the file refers to is by looking at the directory name that it
came from or by looking at the contents. genftp.pl renames rna.asn and protein.asn
files to more specific names so they can be processed with the humoasn.pl script.
rna.asn and protein.asn files mostly contain XM’s and XP’s sequences: see for
example genomes/H_sapiens/protein. The sequences in these files are “loose”
bioseqs that have to be “stitched” together into bioseq sets by humoasn.pl. This
allows these sequences to be processed by the mother parser in the next step.
Many new *.asn files will appear in the comgenomes directory after this is run. There
is no log file for this script.
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

42 of 421

18/04/2005

a) Populate table chrom
Binary chrom is used to populate table chrom from the list of complete genomes
found in genff. Chrom generates the log file “chromlog”. This log will look
something like:
============[ Sep 5, 2003
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
…
NOTE: [000.000] {chrom.c,

2:30 PM ]====================
line 130} Assigned TaxId 56636.
line 137} Assigned Kloodge 1.
line 144} Assigned Chromfl 1.
line 149} Assigned Access NC_000854
line 152} Assigned Name Aeropyrum pernix.
line 167} Done.

b) Delete all records from division gbchm from the tables of the core module.
This step is carried out for data integrity purpose. All the records that are inserted
into the core module tables are labeled as from division gbchm. Before they are
inserted, it needs to ensure no such record exists in the database. This is
accomplished using binary iterateparti. iterateparti takes the division name as one
parameter and deletes all GI’s that are part of that division from all of the tables in
the core module.
c) Set kloodge to 0 in table taxgi
This step is also carried out for data integrity purpose. The field “kloodge” in
table taxgi for all records should be set to 0 before they are updated in a later step
by binary comgen.
d) Move all Apis mellifera related files to a subdirectory.
The chromosome, rna and protein files of Apis mellifera are not processed at the
time of writing. They are moved to a subdirectory.
e) Add records to the core module tables.
Since the human, mouse and rat sequences from this source (the “Complete
Genomes” directory) are not a part of the GenBank release, the records are added
to the core module tables by script humouse_build.sh. 141
This script feeds all chromosome, rna and protein files downloaded by genftp.pl to
the mother parser. The mother parser makes a new division called “gbchm”
(GenBank Chromosome Human and Mouse) and touches all core module tables.
Log files will be created by mother for every chromosome file processed (called
*run).
f) Update field kloodge in table taxgi and field name in table accdb
Parser comgen is used to label sequences as belonging to a complete genome.
This program uses the files downloaded by genftp.pl and marks the complete
genomes in table taxgi. This program also adds loci names into table accdb (if
they are not present). comgen is dependent on the chrom table and writes to
accdb and taxgi. The comgen program has to be executed after all databases are
built.
Comgen writes to the log file comgenlog in the same directory where it is run.
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

43 of 421

18/04/2005

22. Build the Strucdb module
Change to the mmdbdata directory.
cd $SEQH/7.mmdb.files
Create tables of the Strucdb module in the database.
Make sure file strucdb. sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < gendb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server repectively. You will be prompted to enter your password.
This creates tables mmdb, mmgi and domdb in the database.
Make certain that the configuration files have been properly set up. These include:
.mmdbrc, .ncbirc and .intrezrc.
In file .mmdbrc, variable “Gunzip” should have a value which is the path of
gunzip on the machine (change text in italics). File .mmdbrc looks like:
[MMDB]
;Database and Index required when local MMDB database is used
Database = ./
Index = mmdb.idx
Gunzip = /bin/gunzip
; [VAST]
;Database required for local VAST fetches.
; Database = .

In file .ncbirc, variable DATA should have a value which is the path of directory
ncbi/data on your machine. File .ncbirc looks like (change text in italics):
[NCBI]
ROOT=/
DATA=/my_home/compile/ncbi/data/

Copy file bstdt.val from the ncbi/data directory:
cp ~/compile/ncbi/data/bstdt.val ./
Run the mmdbftp.pl script to download the mmdb (Molecular Model Database)
ASN.1 files from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata. This will take
approximately 10 hours..
./mmdbftp.pl
This script writes to the mmdb.log file and records the files downloaded.
Approximately 20000 *.val.gz files will appear in the mmdbdata directory after
running this. Look at the first line in the mmdb.idx index file and this states the
number of files that should have been downloaded.
Run the cbmmdb parser to make the MMDB and MMGI datafiles. Use:
./cbmmdb –n F -m F
This program takes about 12 hours to run and writes errors to the cbmmdblog file.
After a typical run this file will contain:
============[ Nov 3, 2003 1:21 AM ]======================
ERROR: [004.001] {cbmmdb.c, line 125} Error opening MMDB id 22339

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

44 of 421

18/04/2005

WARNING: [011.001] {cbmmdb.c, line 240} Total elapsed time: 41857 seconds
NOTE: [000.000] {cbmmdb.c, line 245} Main: Done!

And records are inserted into tables mmdb and mmgi.
Run the vastblst parser to make the DOMDB datafile.
./vastblst –n F
This program writes errors to the vastblstlog file. After a typical run this file will
contain no messages and records are inserted into table domdb
In addition, vastblst makes a FASTA datafile of domains called mmdbdom.fas in the
directory where it is run.
Get the most recent nrpdb.* file from the NCBI ftp site in hand
(ftp://ftp.ncbi.nih.gov/mmdb/nrtable/nrpdb
Run the pdbrep parser to label representatives of nr chain sets in the domdb datatable.
This parser writes to the domdb table. Use:
uncompress nrpdb*.Z
pdbrep –i nrpdb.*
Where nrpdb.* is the name of the input file set. pdbrep will write errors to the
pdbreplog file in the same directory where it is run.
23. Build the Neighdb module
The sequence neighbours tables can be downloaded from
ftp://ftp.blueprint.org/pub/SeqHound/NBLAST/ as MySql database table files, as well
as mysqldump output, which should be adaptable to most SQL database systems. See
the readme on the ftp site for information on these files. To incorporate the mysql
database table files into your instance of seqhound, simply copy the files extracted
from the nblastdb and blastdb archives, downloaded from the ftp site, into your
seqhound database directory in your mysql instance. To incorporate the mysql dumps
of these tables into your seqhound instance, you need only pipe the contents of the
dump(which are SQL statements) to your database server. In the case of mysql,
simply execute:
gunzip -c
gunzip -c

seqhound.blastdb.SQLdump.YYYYMMDD.gz | mysql seqhound
seqhound.nblastdb.SQLdump.YYYYMMDD.gz | mysql seqhound

Be sure to fill in any required mysql options, such as username, hostname and
port number.
24. Build the Rpsdb and Domname modules
The pre-computed rps-blast table and the domname table can be downloaded from
ftp://ftp.blueprint.org/pub/SeqHound/RPS/ as MySQL database table files, as well as
mysqldump output, which should be adaptable to most SQL database systems. To
incorporate the mysql database table files into your instance of seqhound, simply
copy the files extracted from the rpsdb and domname archive, downloaded from the
ftp site, into your seqhound database directory in your mysql instance. To
incorporate the mysql dumps of these tables into your seqhound instance, you need
only pipe the contents of the dump(which are SQL statements) to your database
server. In the case of mysql, simply execute:
gunzip -c
gunzip -c

seqhound.rpsdb.SQLdump.YYYYMMDD.gz | mysql seqhound
seqhound.domname.SQLdump.YYYYMMDD.gz | mysql seqhound

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

45 of 421

18/04/2005

Be sure to fill in any required mysql options, such as username, hostname and port
number.
25. Build the histdb table.
cd $SEQH/8.hist.files
./histparser –n F
This parser populates table histdb. An entry will be generated for each of the
sequences that have valid accessions in table accdb that indicates that the sequence
was added on this day (when you ran histparser). This parser writes to the
histparserlog. This parser requires the accdb table and will take about 15 hours to
run.
26. You are done with the initial build of SeqHound.
If you did not build any of the optional modules, you will have to
remember this when setting up the .intrezrc configuration file for any
SeqHound application.
Set module values to zero if you did not build them. See the following section of the
.intrezrc configuration file.
example:
[sections]
;indicate what modules are available in SeqHound
;1 for available, 0 for not available
;gene ontology hierarchy (did you run goparser?)
godb = 1
;locus link functional annotations (did you run llparser and addgoid?)
lldb = 1
;taxonomy hierarchy (did you run importtaxdb?)
taxdb = 1
;protein sequence neighbours (did you download neighbours tables?)
neigdb = 1
;structural databases (did you run cbmmdb, vastblst and pdbrep?)
strucdb = 1
;complete genomes tracking (did you run chrom and comgen?)
gendb = 1
;redundant protein sequences (did you run redund?)
redundb = 1
;open reading frame database (currently not exported at all)
cddb = 0
;RPS-BLAST tables (did you download RPS-BLAST tables?)
rpsdb = 1

Catch up on SeqHound daily updates
27. Download all daily update files for genbank
Warning: There might have been a new GenBank release while you were
building SeqHound, in this case you cannot get updates from
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc/ any more. You have to rebuild
SeqHound with a fresh GenBank release. You should check the file
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Release to make certain that it
contains the same release number that was present when you started step
15.
cd $SEQH/
mkdir seqsync
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

46 of 421

18/04/2005

cd seqsync
ftp ftp.ncbi.nih.gov
When prompted for a name enter anonymous
When prompted for a password type myemail@home.com cd ncbi-asn1
cd daily-nc
bin
prompt
mget nc*.aso.gz
bye
Do not download the con_nc*.aso.gz files from this directory. SeqHound does not
use them.
28. Download all daily update files for refseq.
From ftp://ftp.ncbi.nih.gov/refseq/daily/ download all files past the date stamp on
gbrscu.aso.gz. gbrscu.aso.gz is the latest cumulative RefSeq division which was
downloaded by asnftp.pl and is located (in this example) in seqhound/build/asofiles.
cd $SEQH/seqsync
ftp ftp.ncbi.nih.gov
enter anonymous and your email address when prompted
cd refseq
cd daily
bin
get rsnc.****.2003.bna.Z
(where **** are files with timestamps greater than gbrscu.aso.gz)
bye
You must uncompress all of these files and rezip them so they can be processed by
the mother parser.
compress –d *.Z
gzip *.bna
29. Run update and mother on all downloaded files (excluding today's one; crons will do
it in the evening).
You can use the scripts all_update.sh and all_update_rs.sh. You will also need
mother, update and a properly configured .intrezrc file in the same directory as all of
the daily update files.
cd $SEQH/seqsync
cp $COMPILE/slri/seqhound/scripts/all_update.sh .
cp $COMPILE/slri/seqhound/scripts/all_update_rs.sh .
cp $SEQH/1.core.files/.intrezrc .
cp $SEQH/1.core.files/mother .
cp $SEQH/1.core.files/update .
Run all_update.sh first
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

47 of 421

18/04/2005

./all_update.sh 141
where 141 is the release number.
Run all_update_rs.sh second.
./all_update_rs.sh 141
These scripts will run update and mother executables (consecutively) on all
downloaded files present in the current directory.
All daily updates in SeqHound are stored in one division called gbupd regardless
how long SeqHound runs without a core rebuild.
mother will make a log file called “*run” for every file that it processes
update will make two log files called “*gis” and “*log” for every file that it processes
You can check that the two parsers have completed successfully. Each of the
following queries should return the same number (the number of starting input files):
ls *aso.gz | wc –l
ls *gis | wc –l
ls nc*log | wc –l
ls nc*run | wc –l
grep Done nc*run |wc -l
Setting up daily sequence updates
30. Make a new directory from where you will run daily sequence updates.
Populate this with the necessary scripts and programs.
cd $SEQH
mkdir updates
cd updates
cp $SLRI/seqhound/scripts/*cron_odbc.pl
cp $SLRI/seqhound/scripts/shconfig.pm .
cp $SLRI/seqhound/build/odbc/redund .
cp $SLRI/seqhound/build/odbc/mother .
cp $SLRI/seqhound/build/odbc/update .
cp $SLRI/seqhound/build/odbc/precompute
cp $SLRI/seqhound/build/odbc/isshoundon
cp $SLRI/seqhound/build/odbc/importaxdb
cp $SLRI/seqhound/build/odbc/goparser .
cp $SLRI/seqhound/build/odbc/llparser .
cp $SLRI/seqhound/build/odbc/addgoid .
cp $SLRI/seqhound/build/odbc/comgen .

seqhound@blueprint.org

.
.
.

Version 3.3

The SeqHound Manual

48 of 421

18/04/2005

cp $SLRI/seqhound/build/odbc/chrom .
cp $SLRI/seqhound/scripts/genftp.pl .
cp $SLRI/seqhound/scripts/humoasn.pl .
cp $SLRI/seqhound/scripts/humouse_build.sh .
cp $SLRI/seqhound/genomes/gen_cxx .
cp $SLRI/seqhound/genomes/pregen.pl .
cp $SLRI/seqhound/genomes/gen.pl .
cp $SLRI/seqhound/genomes/ncbi.bacteria.pl .
mkdir logs
mkdir asofiles
mkdir inputfiles
mkdir genfiles
mkdir flags
31. Copy the .intrezrc config file to the updates directory and edit it.
cd $SEQH/updates
cp $SLRI/seqhound/config/.intrezrc .
cp $SEQH/1.core.files/.intrezrc .
Text in italics must be changed. in [crons] section, variable pathupdates
points to the path where the update jobs will be set up; variable pathinputfiles
points to the path that saves the input files (other than *.aso.gz and *.bna.gz files from
the core module and *.asn files from the gendb module); variable
pathinputfilescomgen points to the path that saves the input files *.asn for the
gendb module; variable mail indicates your e-mail address; variable
defaultrelease is the GenBank release you build SeqHound database with;
variable pathflags points to the path that save the flag files generated by each
updating job.
[crons]
;customizable variables in cron jobs
;NOTE: all paths must end in '/'
pathupdates=./
pathinputfiles=./inputfiles/
pathinputfilescomgen=./genfiles/
mail=my_email
defaultrelease=141
pathflags=./flags/

The cron daemon may consider your home directory to be the “current directory”.
For this reason, the .intrezrc file should be copied to your home directory too.
cd $SEQH/updates
cp .intrezrc ~/.
32. Set up the dupdcron_odbc.pl cron job.
dupdcron_odbc.pl (daily update cron) is a PERL script that retrieves the latest
GenBank and RefSeq update files from the NCBI ftp site and then passes them to

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

49 of 421

18/04/2005

“update” and “mother” where they are used to update the SeqHound data tables.
Specifically, it
a) downloads update files with today's date (from ftp://ftp.ncbi.nih.gov/ncbiasn1/daily-nc/ nc*.aso.gz and ftp://ftp.ncbi.nih.gov/refseq/daily/ rsnc*.bna.Z
b) runs update
(update -i nc*.aso.gz)
and then
c) runs mother
(mother -i nc*.aso.gz -r version# -n F -m F -u T).
You need to know this because if you miss a few updates before setting up
the cron job (and after completing the seqsync steps above) you have to
run update and mother in hand using the above commands.
All scripts (like dupdcron_odbc.pl) report success or failure via email. The mailto
address is set in the shconfig.pm script which you have just edited.
dupdcron_odbc.pl is the first cron job that has to be set up. Make a new text file
called list_crontab where you will list the cron jobs.
cd $SEQH/updates
pico list_crontabs
This file should have the single line
30 22 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./dupdcron_odbc.pl

where libpath should be replaced by the correct path you set up in Step 11 for
environment variable LD_LIBRARY_PATH. You can find it out by:
echo $LD_LIBRARY_PATH
This line specifies the time to run a job on a recurring basis. It consists of 6 fields
separated by spaces. The fields and allowable values are of the form:
minute
(0-59) in this case 30
hour
(0-23) in this case 22
day of the month (1-31) in this case *
month
(1-12) in this case *
day-of-week
(0-6 where 0 is Sunday) in this case *
command to run
The above line indicates that dupdcron_odbc.pl is to be run at 10:30 PM every day of
the month, every month, regardless of the day of the week. The * character is a wildcard. The actual command consists of changing to the directory where
dupdcron_odbc.pl exists (this path will have to be modified depending on your set
up)
cd /seqhound/update;
and then executing the perl script
./dupdcron_odbc.pl
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

50 of 421

18/04/2005

After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
If for some reason, you want to deactivate the cron job, type:
crontab –r list_crontabs
To find out what cron jobs you have activated, type
crontab -l
For more information on setting up cron jobs on UNIX type:
man crontab
33. Set up redundcron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 23 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./redundcron_odbc.pl

See Step 32 for the explanation of libpath.
After adding the above line, edit it to match your setup and close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically does three things:
a) checks if file “nr” is updated on the ftp site ftp://ftp.ncbi.nlm.nih.gov/blast/db. If it
is, retrieves it
b) drops table redund from the database and recreate it.
c) rebuilds table redund using the downloaded nr file and the redund parser.
34. Run precompute for the first time.
First set up the configuration file
cd $SEQH/updates
pico .intrezrc
Edit the section under [precompute] to make it look like:
[precompute]
;precomputed taxonomy queries
MaxQueries = 0
MaxQueryTime = 10
QueryCount = 0
#path to precomputed searches has to have "/" at the end !!
path = /seqhound/precompute/
indexfile = /seqhound/precompute/index

Make sure the value of path is the absolute path of directory precompute you
make in Step 14 and the value of indexfile is the value of path plus index.
Variable path is the directory that holds results of the precompute executable.
indexfile is a path to the index that will be created by precompute.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

51 of 421

18/04/2005

Finally, run the precompute executable:
cd $SEQH/updates
./precompute –a redo
Where –a redo specifies that the program is being run for the first time.
This program basically precomputes the number of proteins and nucleic acids (and
their GI values) for each taxon in the taxgi table. The results of this query are stored
and indexed in text files (in the directory specified by path) if this query takes
longer than x seconds (where x is defined by MaxQueryTime in the above .intrezrc
file). These text files are used by SeqHound API calls such as
SHoundProteinsFromTaxIDIII(taxid)
35. Set up precomcron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 1 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./precomcron_odbc.pl

See Step 32 for the explanation of libpath.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically runs the command
precompute -a update
and updates the precomputed search results.
36. Set up isshoundoncron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 7 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./isshoundoncron_odbc.pl

See Step 32 for the explanation of “libpath”.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically does two things:
a) runs the executable called isshoundon. This program makes a single call to
the local SeqHound API to ensure that it is working.
b) moves all log, run and gis log files into a directory called logs
37. Set up llcron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

52 of 421

18/04/2005

Add the following line:
30 21 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./llcron_odbc.pl

See Step 32 for the explanation of libpath.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically repeats the actions listed in step 14 above and re-creates the
locus link tables in SeqHound. This includes:
a) getting the latest LL_tmpl.gz file from the NCBI ftp site.
b) removing the locus link tables from SeqHound
c) running llparser
d) getting 2 GO annotation files from GO ftp site
e) running the addgoid parser on these two files
38. Set up comgencron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 21 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./comgencron_odbc.pl

See Step 32 for the explanation of libpath.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically repeats the actions listed in step 15 above and re-creates the
chrom table in SeqHound and updates the complete genome information in the core
tables. This includes:
a) generating a list of “DNA units” that belongs to a complete genome,
b) downloading complete genome files from NCBI ftp site,
c) rebuilding table chrom
d) removing all records in the core tables that belongs to division “gbchm”,
e) running script humous_build.sh to insert records into core tables
f) resetting the kloodge field in table taxgi for all records to 0
g) updating kloodge by running parser comgen

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

53 of 421

18/04/2005

Setting up SeqHound servers. Overview
39. Setting up SeqHound servers. Overview.
There are two web server applications that make up the SeqHound system:
a) wwwseekgi produces html pages for the SeqHound web interface and
b) seqrem processes requests to the SeqHound remote API.
Step 40 shows you how to find the two directories where you will set up these two
applications (assuming that you are using a default installation of Apache). The two
directories are called:
cgi-bin
htdocs
Step 40 may be skipped if you already know or have already been told where these two
directories are.
Steps 41 - describe the files that must be placed into these two sub-directories in order to
start the wwwseekgi and seqrem servers.
40. Examining the httpd.conf file for Apache.
These instructions assume that you already have an Apache server running. In order
to proceed further you must locate the directory where executables will be run from
(called “cgi-bin” in a default set-up of Apache) and a directory that contains html
documents (called “htdocs” in a default set-up of Apache). You can find (and reset)
the location of these two directories in an Apache configuration file called
“httpd.conf”. In a default set-up of Apache, the httpd.conf file can be accessed by
changing to the directory:
cd /etc/apache
and then opening the httpd.conf file found in this directory using a text editor such as
pico:
pico httpd.conf
To find the cgi-bin directory location, look for the line beginning with
“ScriptAlias”. In the default set-up, this line looks like this:
ScriptAlias /cgi-bin/ “/var/apache/cgi-bin/”
In this example, the path to the cgi-bin directory is /var/apache/cgi-bin/.
Write this path down, whatever it is.
To find the htdocs directory, look for the line beginning with “DocumentRoot”. In
the default set-up, this line looks like this:`
DocumentRoot “/var/apache/htdocs/”
In this example, the path to the cgi-bin directory is /var/apache/htdocs/.
Write this path down, whatever it is.
Also make a note of the line beginning with “User” and “Group” (who has
ownership to the server). In a default Apache set-up, these lines are likely
User nobody
Group nobody

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

54 of 421

18/04/2005

Make a note of this, whatever it is.
Exit from the httpd.conf file and save your changes. If you made changes to the file,
you must restart the Apache server using the command:
/usr/apache/bin/apachectl restart
See the Trouble Shooting section at the end for more information on this.
In the steps below you will set up the SeqHound server by adding to these two
directories
Contents of the cgi-bin and htdocs directories
directory

contents

cgi-bin

the SeqHound wwwseekgi and seqrem server applications will placed
here

all of the static html pages used by the SeqHound interface will be placed
here
41. Set up the cgi-bin directory.
htdocs

Move to the cgi-bin directory you found in the step above. For the default set-up:
cd /var/apache/cgi-bin/
make a new subdirectory here called SeqHound
mkdir seqhound
cd seqhound
copy the SeqHound server applications here:
cp $COMPILE/slri/seqhound/build/odbc/seqrem .
cp $COMPILE/slri/seqhound/build/odbc/wwwseekgi .
also copy the following files to this directory:
cp $COMPILE/slri/seqhound/html/seekhead.txt .
cp $COMPILE/slri/seqhound/html/seektail.txt .
cp $COMPILE/slri/seqhound/html/seekhead.txt pics/.
cp $COMPILE/slri/seqhound/config/.intrezrc .
cp $COMPILE/slri/seqhound/config/.ncbirc .
42. Edit the .ncbirc configuration file.
Open the file with a text editor such as pico.
The setting for Data should contain a path to the ncbi/data directory. This directory
was downloaded as part of the ncbi toolkit in step 2.
--------------------example .ncbirc file begins----------------------[NCBI]
Data=/home/ncbi/data
--------------------example .ncbirc file ends-------------------------

43. Edit the .intrezrc configuration file.
Refer to step 14 in the current section for setting up of the .intrezrc file. The settings
for username, password, dsn and database in section [datab] should be
valid for the SeqHound database you have just built, and the setting for path and
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

55 of 421

18/04/2005

indexfile in section [precompute] should point to the valid path as in step 34
in the current section. Set up the index.html file for the web interface.
Move to the htdocs directory for your web-server. In the default case:
cd /var/apache/htdocs/
Make a SeqHound directory here:
mkdir seqhound
cd seqhound
Copy the index.html page to this directory:
cp $COMPILE/slri/seqhound/html/index.html .
Open the file in a text editor like pico and edit it so that its action points to the
wwwseekgi server.
pico index.html
then edit the line

where "/cgi-bin/seqhound/wwwseekgi" should specify the path to the
wwwseekgi executable.
44. Set up ODBC configuration file .odbc.ini:.
Move to the home directory of the owner of the binary seqrem in directory
/var/apache/cgi-bin/. Text in italics should be changed (see Step 10):
cd /homedir
Set up file .odbc.ini as the following (text in italics should be changed):
[mysqlsh]
Description
Trace
TraceFile
Driver
DSN
SERVER
PORT
USER
PASSWORD
DATABASE

=
=
=
=
=
=
=
=
=
=

MySQL ODBC 3.51 Driver DSN
On
stderr
/software/64/unixodbc/odbc/lib/libmyodbc3.so
mysqlsh
my_server
my_port
user_id
my_pwd
seqhound

45. Set permissions on the cgi-bin directory.
Move to the cgi-bin directory.
cd /var/apache/cgi-bin/
Change the user and group ownership to nobody (or whatever the values of “User”
and “Group” were set to in step 40).
chown –R nobody:nobody seqhound
46. Set permissions on the htdocs directory.
Move to the htdocs directory.
cd /var/apache/htdocs/

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

56 of 421

18/04/2005

Change the user and group ownership to nobody (or whatever the values of “User”
and “Group” were set to in step 40).
chown –R nobody:nobody seqhound
47. Test the SeqHound web interface.
Open an internet browser and, in this example, go to the url
http://yourmachinename/cgi-bin/seqhound/
You should see the front page of the SeqHound wwwseekgi interface.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

57 of 421

18/04/2005

Trouble-shooting notes
Error logs
Error logs for each of the SeqHound parsers are described in the steps above where the
parser is used to initially build a given SeqHound module.
Error logs for the SeqHound wwwseekgi server software is located in the same directory
as the executable; see wwwseekgilog.
Error logs for the seqrem server are located in the same directory as the executable; see
seqremlogs.
Error logs for the Apache server software are located (in a default set-up) in
/var/apache/logs.
Recompiling SeqHound
If you make changes to and recompile SeqHound, you should first do a clean of the
existing object files and executables. If you are still in a super-user shell, you may wish
to exit this shell and return to the shell where you had set all of your environment
variables. These variables are required by the clean and make scripts.
COMPILE
compile directory
SLRI
slri directory
NCBI

ncbi directory

gcc
PATH
/usr/local/bin:/usr/ccs/bin:${PATH}
EXTRAOPT
–D_FILE_OFFSET_BITS=64
ODBC
(path to unixodbc)
LD_LIBRARY_PATH (path to mysql and odbc libraries)
The first three variables refer to directories that are created during the above instructions.
To clean, run any make file with the ‘clean’ target. For example, to clean the cgi
executables, type:
cd $COMPILE/slri/seqhound/cgi
make –f make.seqrem clean
make –f make.wwwseekgi clean
Restarting the Apache server
If changes are made to the httpd.conf file, the Apache server must be restarted for the
changes to take effect. Use the apachectl script to do this
For a default install of Apache this script is in the directory
cd /usr/apache/bin
and to start the server, type
./apachectl restart
or to get a list of apachectl script commands type
./apachectl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

58 of 421

18/04/2005

Other useful links
SLRI on Sourceforge
http://sourceforge.net/projects/slritools/
NCBI Info Engineering branch:
http://www.ncbi.nlm.nih.gov/IEB/
Concurrent Versioning System (CVS)
http://www.cvshome.org/
MySQL
http://www.mysql.com/
Parser schedule
Parsers are run on a periodic basis as “cron” jobs on Unix platforms and as “Schedules
tasks” on Windows platforms.
The cron job schedule is set up in the file “seqhound/update/list_crontabs” on UNIX
platforms
The setup for the current production version of SeqHound described in these instructions
is shown below:
0 19 * * * /arena/seqhound/update/llcron_odbc.pl
0 21 * * * /arena/seqhound/update/redundcron_odbc.pl
30 22 * * * /arena/seqhound/update/dupdcron_odbc.pl
30 24 * * * /arena/seqhound/update/precomcron_odbc.pl
0 7 * * * /arena/seqhound/update/isshoundoncron_odbc.pl

MySQL errors
ERROR 1153 at line X: Got a packet bigger than
'max_allowed_packet'
When MySQL receives a packet bigger than max_allowed_packet bytes, it issues a
Packet too large error and closes the connection. For example, when a single SQL
statement from a mysqldump being imported exceeds the value for
"max_allowed_packet" configured on the MySQL server. Increasing this value to
64MB from the default 16MB should resolve the error. This value may be changed in the
global config or setting this on the running server via:
set global max_allowed_packet=67108864;
Please see http://dev.mysql.com/doc/mysql/en/Packet_too_large.html for more
information

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

59 of 421

18/04/2005

5. Description of the SeqHound parsers and data tables by module
What are modules?
The SeqHound system is divided into one required “core” module and several optionally configured modules. Modules are groups of
tables and API calls that are filled using a common data resource, for example the 3D structures. The purpose of this division is to give
the user an option to control hardware resources and complexity of system administration when parts of the SeqHound system are not
required. The list of SeqHound modules and their data resources is contained in the table below.
After a system build, the module information is recorded in the configuration file which is then utilized by the API to determine if
certain operations can be achieved with the current setup. The configuration file is called .intrezrc (Unix platforms) The relevant
section of this file is under the heading “[sections]”. Consult Section 4 under “Building the SeqHound system” for more information
on specifying the available modules in this file.

How to use this section.
This section describes the SeqHound system in detail module by module.
Parser and data tables associated with a given module are described under the section for that module.
A brief description of all of the parsers can be found below and in the Table 2 of the SeqHound paper. The table is repeated below
and will be updated here.
Note: It is assumed that you have read the material in section 3 and section 4 before delving into this material. These
two sections include everything you need to know to start using the SeqHound remote API or to install you own local
version of SeqHound. This section is intended for users who may want more details about how SeqHound is
constructed and exactly what it is doing behind the scenes. This section is also meant as background material for
developers who want to further develop the SeqHound system

Parser descriptions
Parser descriptions contain the following headings and information:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

purpose:
logic:
module:
input files:
tables altered:
source code
location:
config file
dependencies:
command line
parameters:
example use:
associated scripts:
error and run-time
logs:
troubleshooting:
additional info:

60 of 421

18/04/2005

a brief description of what the parser is for
more details on the parser – see update parser for example
what module the parser belongs to
input files required by the parser (also available in Table 2)
tables in SeqHound that are modified or created by the parser (also
available in Table 2)
location of the parser source code in the slri development tree
what configuration file parameters must be set for the parser to work
used by the parser
of the parser from the command line
that are used to run the parser
where they are located and what is in them
problems that may occur with this parser
where to find it

Table descriptions
Each table that is relevant to a module is described under that module.
Here is an example with comments in brackets. A data table description consists of the following sections and content:
Database:
Table:
Definition:

that the table belongs to (almost always SeqHound)
name of the table like “accdb”
a brief description of the table's purpose (for example, “This table

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

61 of 421

18/04/2005

correlates gi’s to accession identifiers).
notes about the table
Observation:
where does this table’s information come from
Source db:
the source file (used by the parser to fill this table) location is listed here
Source file:
the name of parser if a single parser is responsible for filling this entire
Parser:
table.
This is followed by a summary of the table’s definition (for example):
Field
Type
Null Default
Column_Definition
rowid
int(11)
No
Auto number row identifier
gi
int(11)
No
0
GenInfo Identifier
asn1
longblob
Yes NULL
BioSeqs: Sequences
and indexes
Keyname
Type
Field
PRIMARY
PRIMARY
gi
iasndb_rowid
INDEX
rowid
iasndb_gi
INDEX
gi
(each field is then described...only one field description is shown for this example)
***gi***
description:
example:
default value:
ASN.1 structure:

definition of the column (for example, GenInfo sequence record identifier”)
of a column entry (for example, “1232452”)
if the value has a default value, it is listed here
If the value in this column is derived from an NCBI data structure, this gives
you a quick idea of where to locate it--you can find more info by searching
for this data structure at:
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SB/hbr.html.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

source:

parser:
function:
API:
more info:

62 of 421

18/04/2005

for example, “Bioseq->Seq-id (choice 12)
Alternatively, the column may store a binary object that is an NCBI or SLRI
ASN.1 object. Descriptions of SLRI ASN.1 objects may be found at
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/slritools/slri/seqhound/asn.
If the value in this column is derived from a text file, this describes how to
find the information present in this column in the source file (for example in
the fourth column)
this is the parser(s) that retrieves the value from some other db (for example,
mother)
this is the parser function that retrieves the value
if a SHound API function retrieves this value from this table it is listed here
(for example, ShoundFindAcc)
(other notes)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

63 of 421

18/04/2005

An overview of the SeqHound data table structure
An overview of the SeqHound data table structure is available as a separate document in pdf format. See
http://www.blueprint.org/seqhound/api_help/docs/SeqHound_Schema_Prod.pdf.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

64 of 421

18/04/2005

Parsers and resource files needed to build and update modules of SeqHound.
This table will be updated shortly
Input File

Resource

ASN.1
sequences

ftp://ftp.ncbi.nih.gov/ncbi-asn1/*.aso
ftp://ftp.ncbi.nih.gov/refseq/cumulative mother
/*.bna

ASN.1
sequences

ftp://ftp.ncbi.nih.gov/ncbi-asn1/dailync/*.aso

Parser

update

ftp://ftp.ncbi.nih.gov/refseq/daily/*bna
FASTA nr
redund
ftp://ftp.ncbi.nih.gov/blast/db/nr
database
List of complete http://cvs.sourceforge.net/cgigenomes (flat bin/viewcvs.cgi/slritools/slri/seqhound chrom
file)
/genomes/chromff
ASN.1 for
complete
ftp://ftp.ncbi.nih.gov/genomes/*/*.asn comgen
genomes
Taxonomy
ftp://ftp.ncbi.nih.gov/pub/taxonomy/tax
importtaxdb
release (flat file) dump.tar
ASN.1 MMDB ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata
cbmmdb
release
/*.val
MMDB
mmdb table
vastblst
(database table)

seqhound@blueprint.org

Tables
Module
Modified
asndb, parti,
nucprot, accdb,
core
pubseq, taxgi,
sendb, sengi
asndb, parti,
nucprot, accdb,
core
pubseq, taxgi,
sendb, sengi
redund

redundb

chrom

gendb

taxgi, accdb

gendb

TAX, GCODE,
taxdb
DIV, del, merge
mmdb, mmgi

strucdb

domdb

strucdb

Version 3.3

The SeqHound Manual

3-D chain
BLAST sets
(flat file)
FASTA nr
database

65 of 421

18/04/2005

ftp://ftp.ncbi.nih.gov/mmdb/nrtable/nr
pdbrep
pdb.*
ftp://ftp.ncbi.nih.gov/blast/db/nr

nblast

nrB table available nrB and nrN tables
BLAST ASN.1 available at
nbraccess
results
ftp://ftp.blueprint.org/pub/SeqHound/
NBLAST/
LL_tmpl (flat ftp://ftp.ncbi.nih.gov/refseq/LocusLink/
llparser
file)
LL_tmpl
gene_associaton
.com
addgoid
pugen.GenBank http://www.geneontology.org
/Swissprot (flat
files)
function.ontolog
y
process.ontolog
goparser
http://www.geneontology.org
y
component.onto
logy (flat files)
CDD database ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/ domname
ftp://ftp.ncbi.nih.gov/blast/db/nr;
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
FASTA nr
DOMNAME and RPSDB tables
database and
rpsdb
available at
CDD database
ftp://ftp.blueprint.org/pub/SeqHound/
RPS/

seqhound@blueprint.org

domdb

strucdb

nrB

neigdb

nrN

neigdb

ll_omim, ll_go,
lldb
ll_llink, ll_cdd
ll_go

lldb

go_parent,
go_name,
go_reference,
go_synonym

godb

domname

rpsdb

Version 3.3

The SeqHound Manual

66 of 421

18/04/2005

core module
Last updated April 11, 2005. This section is maintained bt Elizabeth Burgess.
mother parser
Last updated April 11, 2005
purpose:
The mother parser is the first parser that is used to initially build the SeqHound “core” set of data tables. The input files consist of the
latest release of GenBank and RefSeq in binary ASN.1 format available on the NCBI ftp site. The resulting SeqHound data tables
hold DNA, RNA and protein sequence record information.
As of release 4.0, the mother parser is also used with the GenBank and RefSeq daily updates to update SeqHound so that that
sequence information is synchronized with that of NCBI. Previously, updates were handled by a separate parser called “update”.
logic:
The mother parser is run on a daily basis in update mode (-u T) to update with two input files. One input file is the daily update from
GenBank and the other is the daily update from the RefSeq database.
For each bioseq in the input file, mother retrieves the GI and accession number.
Mother then looks for this pair of GI and accession identifiers in the SeqHound accdb table.
If neither the accession (nor the GI) is found, then the record in the daily update represents a GI that is to be ADDED to SeqHound.
If the accession in the update file is found in SeqHound and is associated with the same GI (as listed in the update file) then the record
in the daily update represents a GI that has been CHANGED. This means that the sequence record was resubmitted to GenBank with
the same GI; the sequence remains the same but the associated annotation has changed.
If the accession in the update file is found in SeqHound but the GI associated with this accession differs between the update file and
SeqHound, then the GI that is newly associated with the accession represents a change in the sequence. The accession and updated GI

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

67 of 421

18/04/2005

pair point to a sequence record that will be ADDED to SeqHound. The accession and old GI (currently in SeqHound) point to a
record in SeqHound that will be KILLED (deleted).
Mother then records in the SeqHound history table whether the GI and accession in the update file represent an ADDED or
CHANGED record. The previous gi associated with the accession is also recorded, if there is one. Mother also records in the history
table those sequence records currently in SeqHound that will be KILLED (see above).
This process is completed for every bioseq in the update file.
A list of ADDED, CHANGED and KILLED GI’s is written to the file rsncmmddgis where rsncmmdd refers to the update file that was
being processed.
Mother then deletes from the following tables: parti, accdb, sendb,sengi, taxgi, pubseq, nucprot asndb. If the record is a complete
genome record, then the gi will be deleted from gichromid, contigchromid, gichromosome and contigchromosome. The strategy for
complete genome deletions is as follows:
Complete genome information in NCBI may be stored in several different places in the record.
1. Records from organisms other than those for certain higher eukaryotes ( e.g. human, mouse, rat, chicken and bee). These
records contain the flag NCBI_GENOMES and list the RNA and protein gis in the annotation. The RNA and protein gis will
be written to gichromid. The gi of the contig that contains the annotation is also written to this record if it is known.
2. Records from human, mouse, rat etc. The top level contig contains the NCBI_GENOMES flag, but no annotation. Instead, it
lists the gis for contigs that make up this chromosome. The annotation of these contig records contains the gis and proteins that
belong to this chromosome. The lower level contig gis are written to contigchromid so that the protein and RNA gis can be
parsed later from the contig bioseqs by the postcomgen parser. The gi of the top level contig is also written to this record.
3. Some records exist that only contain a chromosome number in the description of the bioseq. These records can be a contig
record or a record for an individual protein or RNA. These records are first written to gichromosome or contigchromosome and
later moved to gichromid and contigchromid by postcomgen.
For updating, we check to see if the accession for that gi is in the chrom table. The only records that are written to this table are the top
level gis that contain the NCBI_GENOMES flag. If the gi belongs to a top level contig, then all gis in contigchromid that belong to
that contig are retrieved and all gis in gichromid that belong to each low level contig gi are deleted. The low level contig gis are then
deleted from contigchromid.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

68 of 421

18/04/2005

If the record is present in contigchromid, then all gis in gichromid from that contig are deleted before the record is deleted from
contigchromid.
If the gi is present in only gichromid and contains a contig gi, then the appropriate record in contigchromid is marked as changed.
That way, after mother finishes, postcomgen will only process those contigs that have changed.
Mother writes errors and messages to a log file called rsncmmddrun where rsncmmdd refers to the update file that was being
processed.
module: core
input files:
latest GenBank release (ftp://ftp.ncbi.nih.gov/ncbi-asn1/*.aso)
latest RefSeq release (ftp://ftp.ncbi.nih.gov/refseq/cumulative/*.bna)
daily GenBank release (ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc/*.aso)
daily RefSeq release (ftp://ftp.ncbi.nih.gov/refseq/daily/*.bna)
tables altered:
asndb, parti, nucprot, accdb, pubseq, taxgi, sendb, sengi, bioentity, bioname, chrom, gichromid, contigchromid, gichromosome,
contigchromosome
source code location:
slri/seqhound/parsers/mother.c
config file dependencies:
The relevant configuration file is:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

69 of 421

18/04/2005

slri/seqhound/config/.intrezrc
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,
password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE
respectively in your .odbc.ini (see section 4).
command line parameters:
Typing “./mother –“ at the command line while in the directory where mother resides will return a list of command line
parameters and default settings.
Note that -n and -m are listed for historical Codebase purposes. These values are always F for ODBC.
For example:
> ./mother –
mother
arguments:
-i Filename for asn.1 input [File In]
-r Release [String]
-n Initialize the ASNDB database file [T/F] Optional
default = F
-m Initialize the remaing database files [T/F] Optional
default = F
-u Is this an update [T/F] Optional
default = F
-c Is this a file for human/mouse complete genome [T/F] Optional
default = F
-t Read input file in text mode [T/F] Optional
default = F

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

70 of 421

18/04/2005

example use:
mother –i nc1227.aso -r 135 -n F -m F
mother –I nc1227.aso –r 135

Note that mother is normally run under the control of a script (see below).
associated scripts:
The initial build of SeqHound is executed using the script called “seqhound_build.sh”
See “slri/seqhound/scripts/seqhound_build.sh”
This script cycles through all the GenBank and RefSeq release files downloaded by asn_ftp.pl unzips them, processes them with
mother and then zips them up again.
The script must be run in the directory containing the GenBank and RefSeq release files and a copy of the mother parser The script
takes one argument (a release number).
./seqhound_build.sh 135
mother is also called by the scripts that generate the daily updates (see “associated scripts” under update parser).
error and run-time logs:
mother writes to a log file called “rsncmmddrun” where rsncmmdd refers to the GenBank release or update file that was being
processed. When run in update mode, it also writes the gis to a file called “rsncmmddgis”. The log files are created in the directory
logs.
troubleshooting:
additional info:
See readme files that accompany the input files on the GenBank ftp site.
See data table descriptions for each of the tables that are listed under “tables altered”.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

71 of 421

18/04/2005

update parser
Last updated April 11, 2005
note:
As of Release 4.0, update functionality has been moved to the mother parser and associated scripts. See above.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

72 of 421

18/04/2005

postcomgen parser
Last updated April 11, 2005
purpose:
The postcomgen parser is used in conjunction with the mother parser. It is used to update the taxgi table with complete genome
information using the chrom, contigchromid, gichromid, gichromosome and contigchromosome tables. It is run after the initial build
and after any updates.
logic:
Complete genome information in NCBI may be stored in several different places in different records.
1.Records from organisms other than those for certain higher eukaryotes ( e.g. human, mouse, rat, chicken and bee) contain the flag
NCBI_GENOMES and list the RNA and protein gis in the annotation.
2.Records from certain higher eukaryotes (human, mouse, rat etc) are different. The top level contig contains the NCBI_GENOMES
flag, but no annotation. Instead, it lists gis for the contigs that make up this chromosome. The annotation of these records contains the
gis and proteins that belong to this chromosome.
3. Some records exist that only contain a chromosome number in the description of the bioseq. These records can be contig records or
records for an individual protein or RNA.
At the time that mother is run, gis in the annotation may not yet have been entered into accdb and taxgi. For this reason, mother writes
these gis out to the table gichromid. When mother encounters a record that contains only contig gis, these gis are written out to
contigchromid. The records that contain only a chromosome number are written to gichromosome or contigchromosome.
After mother is finished, postcomgen is run to process the data in these tables and write the kloodge to the taxgi table for each gi:
1. Postcomgen first writes every gi in gichromosome to gichromid with the appropriate chromid. After this is done, all records are
deleted from gichromosome.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

73 of 421

18/04/2005

2. Next, all contigchromosome gis are written to contigchromid with the appropriate chromid and all records are deleted from
contigchromosome.
3. For each gi in contigchromid, the bioseq is obtained from asndb and any protein or RNA gis in the annotation are written to
gichromid. In addition to the chromid, the contiggi is stored so that the appropriate contigchromid record can be reread on update if
necessary.
4. Each record in contigchromid is marked as read, so that upon update only records that have been modified need be read.
5. Finally, taxgi is updated for each gi in gichromid with the appropriate kloodge.
module: core
input files:
There are no input files. The parser uses the tables gichromid, contigchromid, gichromosome and contigchromosome, which are filled
by mother, to update the taxgi table.
tables altered:
taxgi, gichromid, contigchromid, gichromosome, contigchromosome
source code location:
slri/seqhound/genomes/postcomgen.c
config file dependencies:
The relevant configuration file is:
slri/seqhound/config/.intrezrc
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

74 of 421

18/04/2005

password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username, password,
dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE respectively in your
.odbc.ini (see 4).
command line parameters:
None.
example use:
postcomgen
associated scripts:
error and run-time logs:
Postcomgen writes to a log file called “postcomgen.log”.
troubleshooting:
additional info:
See data table descriptions for each of the tables that are listed under “tables altered”.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

75 of 421

18/04/2005

asndb table
Last updated April 11, 2005
SeqHound
Database:
core
Module:
asndb
Table:
Central table that stores sequence records in binary ASN.1 format and indexed by gi.
Definition:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

76 of 421

18/04/2005

asndb table
Field

Type

Null Default

rowid

int(11)

NULL

timestamp

Yes

CURRENT_TIMES
timestamp
TAMP

int(11)

asn1

longblob

Yes NULL

Column_Definition
Mysql autoincrement
column

GenInfo primary
sequence record
identifier

varchar(25)

API

2005-03-23 12:32:16
4001923

FillASNDB calls GetGi.
See Bioseq->seqid->gi

PRI

release

varchar(10)

type

varchar(15)

The type of molecule protein
that this record refers to.

SHoundGetBioseq
SHoundSequenceLength

GetDivisionFromGBBlock.
See
Bioseq->descr->source
->org->orgname->div

The name of the release 135
given as a command
line input to the mother
parser

seqhound@blueprint.org

Source

bioseq in binary ASN.1 An example of a bioseq ToBioseqSEQENTRY
is included in the
See Bioseq->seqid->gi
Appendix (Example of
GenBank record).
NCBI division

division

Example

FillASNDB calls GetType. SHoundMoleculeType
See Bioseq-> mol

Version 3.3

The SeqHound Manual

77 of 421

18/04/2005

asndb indices
Keyname
PRIMARY
iasndb_rowid
iasndb_ts
iasndb_gi

Type
PRIMARY
INDEX
INDEX
INDEX

seqhound@blueprint.org

Field
gi
rowid
ts
gi

Version 3.3

The SeqHound Manual

78 of 421

18/04/2005

parti table
Last updated April 11, 2005
Note: This table is documented for historical purposes and may be deprecated in a future release. PARTI was required
by codebase as an index to the ASNDB table.
Database:
Table:
Module:
Definition:

seqhound
parti
core
This table maps the gi to the NCBI division.

parti table
Field

Type

Null

rowid

int(11)

timestamp

Yes

CURRENT_TI MySQL timestamp
MESTAMP column

int(11)

division

char(15)

Yes

seqhound@blueprint.org

Default

NULL

Column_Definition
MySQL autoincrement
column.

Example

Source

2005-03-23 12:32:16

GenInfo Identifier

21676275

FillASNDB calls
AppendRecordPARTI. See
Bioseq->Seqid->gi (choice 12)

NCBI Division

PRI

GetDivisionFromGBBlock. See
Bioseq->descr->source
->org->orgname->div

Version 3.3

API

The SeqHound Manual

parti indices
Keyname
PRIMARY
iparti_rowid
iparti_ts
iparti_gi
iparti_div

Type
PRIMARY
INDEX
INDEX
INDEX
INDEX

seqhound@blueprint.org

79 of 421

18/04/2005

Field
gi
rowid
ts
gi
division

Version 3.3

The SeqHound Manual

80 of 421

18/04/2005

nucprot table
Last updated April 11, 2005
seqhound
Database:
nucprot
Table:
core
Module:
This table maps the gi of a protein to the gi of its encoding DNA.
Definition:
A DNA may map to more than one protein. A protein can only map to one DNA. Not every DNA may
Observation:
map to a protein, eg synthetic DNA fragments typically do not have a protein. In a bioseqset structure of
type nucprot, the first bioseq record is a nucleic acid, subsequent bioseq records correspond to proteins.
In such a case, the entries added to the nucprot database would be:
record 1) gi_1 gi_2
record 2) gi_1 gi_3
record 3) gi_1 gi_4
Each record would be a mapping of the nucleic acid (gi_1) to each protein (gi_2 ... gi_N)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

81 of 421

18/04/2005

nucprot table
Field

Type

Null Default

rowid

int(11)

gin

int(11)

timestamp

Yes

CURRENT_
2005-03-23 12:32:16
TIMESTAM MySQL timestamp column
P

gia

int(11)

nucprot indices
Keyname
PRIMARY
inucprot_rowid
inucprot_ts
inuc_gin
inuc_gia

Type
PRIMARY
INDEX
INDEX
INDEX
INDEX

seqhound@blueprint.org

Column_Definition
MySQL autoincrement
column.
DNA GenInfo Identifier

Protein GenInfo Identifier

Example

Source

API

27464927

FillNUCPROT. See
Bioseq->Seqid->gi (choice 12)

SHoundDNAFromProtein

FillNUCPROT. See
Bioseq->Seqid->gi (choice 12)

SHoundProteinFromDNA

27464928

Field
gia
rowid
ts
gin
gia

Version 3.3

The SeqHound Manual

82 of 421

18/04/2005

accdb table
Last updated April 11, 2005
seqhound
Database:
accdb
Table:
This table maps gi's to accession numbers and other identifiers (a combination of a database name and
Definition:
identifier) found in GenBank sequence records.
Note that gi is not unique in ACCDB. Also note that a gi does not always have an accession associated
Observation
with it. There are 2 kinds of records in ACCDB. In one case the gi is associated with an accession and in
the other case, the gi is associated with a database name and some identifier from that database (name
column). Any given gi may have both record types and may be associated with more than one name.
Note that internal NCBI databases may be stored in ACCDB, for example NCBI_GENOMES. In this
case the record will not have an accession, but it will have a name.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

83 of 421

18/04/2005

accdb table
Field

Type

Null Default

rowid

int(10)

timestamp

Yes

CURRENT_TI
MySQL timestamp column
MESTAMP

int(11)

name

varchar(15)

varchar(30)

namelow varchar(30)

Column_Definition
MySQL autoincrement
column.

GenInfo Identifier
The source database of the
sequence record.
See note 1 below.

Example

Source

API

2313082

FillACCDB calls GetGi. See
Bioseq->seqid->gi

SHoundGiFromGBAcc
(ShoundFindAcc)

pdb

FillACCDB. See
Bioseq->Seq-Id->dbtag->db

SHoundDbNameAndIdListFrom
GBAcc

1AAP

See note 2 below.

SHoundDbNameAndIdListFrom
GBAcc
(ShoundFindName)
(ShoundGetNameByGi)

2005-03-23 12:32:16

This is an accession number
from a foreign database.

Yes

Same as name but in lower 1aap
case. Present for historical
reasons and may probably be
removed in the future.

NULL

See note 2 below.

n/a
access

chain

varchar(20)

Genbank Accession. The
primary sequence record
identifier.

Yes

This describes which one of A
(possibly) many chains in a
structure a sequence refers to.
This only refers to PDB
sequences.

seqhound@blueprint.org

NULL

WriteACCDB. See Bioseq- SHoundGBAccFromGi
>Seq-id->Textseq-id(ShoundAccFromGi)
>accession. This field is n/a
for sequences from PDB and
other databases. This field is
only relevant when name field
is n/a.
FillACCDB. See
SHoundGiFromPDBchain
Bioseq->Seq-id->PDB-seq-id>chain. See note 5 below.

Version 3.3

The SeqHound Manual

release

version

title

varchar(20)

int(11)

text

Yes

84 of 421

NULL

release date of the record

18/04/2005

Sep 14, 1990

See note 3 below.
SHoundSeqIdFromGi

See note 4 below.

Chain A, Protease
Inhibitor Domain Of
Alzheimer's Amyloid
Beta-Protein Precursor
(APPI)

FillACCDB calls
CreateDefline.

version of the record

a brief description of the
sequence

Notes
1.
A complete list of db names that may appear in this column is listed under API supplementary material at
http://www.blueprint.org/seqhound/apisupplement.html. Some common db name abbreviations are listed below.
gb means GenBank
embl means EMBL
pir means Protein Information Resource (PIR)
sp means Swiss-Prot
pbs means other database
ref means RefSeq
dbj means DNA Database of Japan (DDBJ)
prf means PRF
pdb means Protein Data Bank (PDB)
tpe means third party annotation from embl

seqhound@blueprint.org

Version 3.3

This value cannot be directly
retrieved from the table by the
API but a Seq-id can be retrieved
using ShoundSeqIdFromGi
given a gi.

The SeqHound Manual

85 of 421

18/04/2005

tpg means third party annotation from genbank
tpd means third party annotation from ddbj
other means some other database
2.
WriteAccdb.
For all db’s except PDB and ‘other’:
Bioseq->Seq-id->Textseq-id->name
For PDB sequences:
Bioseq->Seq-id->PDB->seq-id->PDB-mol-id (4 characters)
For other databases:
Bioseq->Seq-id->Dbtag->Object-id->string or integer
3.
FillACCDB, given a Bioseq pointer, retrieves a Seq-id pointer. If the Seq-id is of the type, pdb (protein data bank), then a PDB-seqid ptr is passed to WriteACCDB which retrieves the value of rel. If the Seq-id is of the type, genbank, embl, pir, swissprot, other,
ddbj, prf, tpg, tpd or tpe then a Textseq-id pointer is passed to WriteACCDB which retrieves the value of release. If the Seq-id is of
the type, giimport then a Giimport-id ptr is passed to WriteACCDB which retrieves the value of release. In all cases, the formatting
of the date is attempted with NCBI's function DatePrint (like this 07-OCT-2004). If the Formatting cannot be done, then the string
is just copied.
Bioseq->Seq-id->Textseq-id->release (an INTEGER) for all databases besides PDB, Giimport and ‘other’
Bioseq->Seq-id->PDB->seq-id->rel (a string or an NCBI data type) for PDB sequences
Bioseq->Seq-id->Giimport-id->release (a string) for Giimport
n/a for all ‘other’ databases.
4.
FillACCDB, given a Bioseq pointer, retrieves a Seq-id pointer. If the Seq-id is of the type, genbank, embl, pir, swissprot, other,
ddbj, prf, tpg, tpd or tpe then a Textseq-id pointer is passed to WriteACCDB which retrieves the value of version. For all other
databases, this value is n/a. See

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

86 of 421

18/04/2005

5.
Sequences that are part of structures will have a PDB identifier. Since there may be more than one chain (sequence) in a structure, the
PDB identifier must be accompanied by a chain identifier to uniquely identifiy a chain within the structure. A PDB identifier
supplemented with a chain identifier will be associated with a single sequence record (GI). For example, see PDB id “9XIM” and
chain A; this corresponds to GI sequence 443580. See example ASN.1 record for 9XIM_A in the Appendix.
Chain in the ASN.1 annotation is of type character, not of type string so it appears in its decimal form. So 65 is A and so on.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

accdb indices
Keyname
PRIMARY

Type
PRIMARY

iaccdb_rowid
iaccdb_ts
iaccdb_gi
iaccdb_db
iaccdb_name
iaccdb_namelow
iaccdb_acc

INDEX
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX

seqhound@blueprint.org

87 of 421

18/04/2005

Field
gi
db
access
rowid
ts
gi
db
name
namelow
access

Version 3.3

The SeqHound Manual

88 of 421

18/04/2005

histdb table
Last updated April 11, 2005
seqhound
Database:
histdb
Table:
hist
Module:
A history table of gis. Includes the date and any action taken on them.
Definition:
As of release 4.0, histdb is no longer populated by the histparser and update. Instead, mother populates
Observation:
this table, both during the initial build and the update. During the initial build, gi-accession number pairs,
the version number, filename and date are stored in histdb with an action ACTION_ADDED.
During updates, mother extracts the gi-accession number pairs from ACCDB. If neither the gi nor the
accession are found, then this is a new record (ACTION_ADDED). If the accession is found in
SeqHound but is associated with a different gi, then the record represents a changed record
(ACTION_CHANGED), and if the gi-accession pair is present in SeqHound, then the record represents a
deleted record (ACTION_KILLED). The gi, accession, action along with the version, filename and date
(current date update executed) are logged in histdb. Gi's with ACTION_ADDED will get added to accdb,
nucprot, taxgi, pubseq, sengi, sendb, parti by mother. Gi's with ACTION_CHANGED will be deleted and
the new updated information will be added to the core databases. For gi's with ACTION_KILLED, the
old record information in the core databases is now obsolete and will be deleted from the core databases
and the new information will be added to the databases . For any gi that is killed, there will be another gi
(with the same accession) that is added.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

89 of 421

18/04/2005

histdb table
Field
rowid

Type
int(11)

Null Default
No

timestamp

Yes

Column_Definition
MySQL autoincrement column.

CURRENT_
TIMESTAM MySQL timestamp column
P

Example

Source

2005-03-23 12:32:16

21676275

GetGi is called by ToBioseqUp in the
initial build and by ToBioseqUp in
updates. See:
Bioseq-->Seqid->id (choice 12)

int(11)

GenInfo Identifier

oldgi

int(11)

Yes

NULL

The gi previously associated with this 10954454
accession. Not used during the initial
build.

ToBioseqUp retrieves the gi from ACCDB
using GetACCDBRecord.

The genbank accession from ACCDB NC_001911
for the gi. If the accession is ‘n/a’,
then the ‘name’ is written here.
See details in accdb table description.

ToBioseqUp retrieves the accession or
name from ACCDB using
GetACCDBRecord.

access

char(20)

ToBioseq and ToBioseqUp call
GetSeqIdInfoFromBioseq. See Bioseq>Seq-id->Textseq-id->version.

version int(11)

Yes

NULL

Version of the record. See details in
accdb table description.

date

0000-00-00

current date when histdb is updated in 20040128
the form (YYYYMMDD)

Mother uses the current date.

the action taken on the record:
ACTION_ADDED,
ACTION_KILLED,
ACTION_CHANGED

ToBioseq and ToBioseqUp calls
LogHistory.

action

date

int(11)

filename Varchar(80) No

seqhound@blueprint.org

ACTION_CHANGED

The name of the file that contains this rsnc0311.bna
gi for debugging purposes.

Version 3.3

Main passes this to ToBioseq and
ToBioseqUp.

API

The SeqHound Manual

90 of 421

18/04/2005

histdb indices
Keyname
ihistdb_rowid
ihistdb_ts

Type
INDEX
INDEX

Field
rowid
ts

ihistdb_gi

INDEX

ihistdb_date

INDEX

date

ihistdb_action

INDEX

action

ihistdb_acc
ihistdb_filename

INDEX
INDEX

access
filename

ihistdb_oldgi

INDEX

oldgi

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

91 of 421

18/04/2005

pubseq table
Last updated April 11, 2005
seqhound
Database:
pubseq
Table:
This database maps a geninfo identifier to the medline/pubmed article(s) found in the sequence record
Definition:
indicating who first published the sequence.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

92 of 421

18/04/2005

pubseq table
Field
rowid

Type
int(11)

Null Default
No

timestamp

Yes

CURRENT_
TIMESTAM MySQL timestamp column
P

int(11)

muid

pmid

int(11)

seqhound@blueprint.org

Column_Definition
MySQL autoincrement column.

Example

Source

API

2005-03-23 12:32:16
ts
25992759

ToBioseq calls GetGi. SHoundGiFromReferenceID
See Bioseq-->Seqid SHoundGiFromReferenceList
(choice 12)

99091706

FillPUBSEQ calls
GetMuid. See Seqentry->Bioseq>Seqdescr->Pubdesc>Pub-equiv->Pub>muid (choice 4)

9873079

FillPUBSEQ calls
GetMuid. See Seqentry->Bioseq>Seqdescr->Pubdesc>Pub-equiv->Pub>pmid (choice 13)

GenInfo Identifier

Medline identifier found in this
record (points to same article as
pmid (if listed)).

Pubmed identifier found in this
record (points to same article as
muid (if listed))

Version 3.3

SHoundGetReferenceIDFromGi
SHoundGetReferenceIDFromGiList
SHoundMuidFrom3D
SHoundMuidFrom3Dlist
(ShoundMuidFromGi)
(ShoundMuidFromGiList)

The SeqHound Manual

93 of 421

18/04/2005

pubseq indices
Keyname
PRIMARY

Type
PRIMARY

Field
gi
pmid

ipubseq_rowid

INDEX

rowid

ipubseq_ts
ipubseq_gi
ipubseq_muid
ipubseq_pmid

INDEX
INDEX
INDEX
INDEX

ts
gi
muid
pmid

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

94 of 421

18/04/2005

taxgi table
Last updated April 11, 2005
seqhound
Database:
taxgi
Table:
core
Module:
This table associates a gi with a taxon identifier.
Definition:
Gis are stored in taxgi even if they are not associated with any taxonomy id. This occurs, for example, for
Observation.:
some patent gis.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

95 of 421

18/04/2005

taxgi table
Field

Type

Null Default

rowid

int(11)

timestamp

Yes

CURRENT_T
MySQL timestamp column
IMESTAMP

int(11)

taxid

int(11)

Yes

NULL

kloodge int(11)

type

varchar(50) Yes

Column_Definition
MySQL autoincrement
colulmn.

NULL

Example

Source

API

2005-03-23
16:50:42

GenInfo Identifier

42565179

WriteTAXGI calls GetGi.
See Bioseq-->Seq-id (choice 12)

SHoundDNAFromTaxID()
SHoundDNAFromTaxIDList()
SHoundProteinsFromTaxID()
SHoundProteinsFromTaxIDList()
SHoundProteinsFromChromosome()
SHoundProteinsFromChromosomeList()
SHoundDNAFromChromosome()
SHoundDNAFromChromosomeList()

Taxonomy Identifier from
NCBI’s taxonomy database.

3702

See note 1 below.

SHoundTaxIDFromGi()
SHoundTaxIDFromGiList()

Blueprint chromosome id from 392
the chrom table.
protein
Type of macromolecule
DNA, RNA, protein, NA,
other, "not specified".

See note 2 below.
WriteTaxgi calls GetType. See
Bioseq->mol (see note 3 below).

Notes:
1.
FillTAXGI calls GetTaxId. See Bioseq-->Seq-desc->BioSource->Org-ref->Dbtag.db == "taxon"
Bioseq-->Seq-desc->BioSource->Org-ref->Dbtag.tag
2.
Mother: ToBioseq calls SearchCHROMByChromNum.
Postcomgen: GetAllRecordsFromGICHROMID and UpdateTaxgiWithKloodge

seqhound@blueprint.org

Version 3.3

SHoundDNAFromChromosome()
SHoundDNAFromChromosomeList()
SHoundProteinsFromChromosome()
ShoundProteinsFromChromosomeList()

The SeqHound Manual

96 of 421

18/04/2005

3.
DNA = 1
RNA = 2
Protein = 3
NA = 4
Other = 255

taxgi indices
Keyname
PRIMARY

Type
PRIMARY

Field
gi

itaxgi_rowid
itaxgi_ts

INDEX
INDEX

rowid
timestamp

itaxgi_gi

INDEX

itaxgi_taxid

INDEX

taxid

itaxgi_kl

INDEX

kloodge

itaxgi_type

INDEX

type

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

97 of 421

18/04/2005

sengi table
Last updated April 11, 2005
seqhound
Database:
sengi
Table:
core
Module:
Seqentry to 'gi' conversion.
Definition:
Seqhound keeps track of which Bioseqs come from which Bioseq-sets using the sengi table. In the sengi
Observation:
table, each Bioseq is represented by a GI (a unique identifier for the sequence record) and each Bioseq-set
is represented by the first GI in the set of sequences, the seid.
Sengi also stores the division for historical purposes.
sengi table
Field
rowid

Type
int(10)

timestamp Yes

CURRENT_TI
MySQL timestamp column
MESTAMP

int(11)

GenInfo Identifier

seid

int(11)

Yes

NULL

division char(15)

Null Default
No

seqhound@blueprint.org

Column_Definition
Example
MySQL autoincrement column

Source

2005-03-23 12:32:16
2313082

ToBioseqSEQENTRY calls GetGi. See Seq-entry->Bioseqset->Bioseq-->Seqid (choice 12)

GenInfo Identifier: the first
one in the bioseq-set. seid
means seqentry id.

231308

ToBioseq calls GetGi for the first Bioseq in a SeqEntry See
Seq-entry->Bioseq-set->Bioseq-->Seqid (choice 12).

Genbank division.

PRI

ToBioseqSEQENTRY calls GetDivisionFromPARTI.

Version 3.3

API

The SeqHound Manual

98 of 421

18/04/2005

sengi indices
Keyname
PRIMARY
isengi_rowid

Type
PRIMARY
INDEX

Field
gi
rowid

isengi_ts

INDEX

isengi_gi

INDEX

isengi_seid

INDEX

seid

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

99 of 421

18/04/2005

sendb table
Last updated April 11, 2005
seqhound
Database:
sendb
Table:
core
Module:
SeqEntry Data Base
Definition:
Sequences are distributed by GenBank in packages. Each package is referred to as a seq-entry.
Observation:
A seq-entry may contain either a single sequence record (called a Bioseq) or a set of sequence records
(called a Bioseq-set). If a Seq-entry contains a Bioseq-set then that Bioseq-set contains a list of Seq-entry
packages (yes, this data structure is recursive). Each of these Seq-entry packages contains a single
sequence record (a Bioseq). There is annotation that is associated with single sequence records (Bioseqs).
An example of annotation is a list of authors who are responsible for submitting a sequence record. There
is also annotation associated with sets of sequence records (Bioseq-sets). This type of annotation applies
to all of the sequence records that are in the set. For example a set of authors may be responsible for all
of the sequence records in the set.
Seqhound stores Bioseqs and their associated GenInfo id's in one central table (called asndb). Since each
of these Bioseqs may have come from a Bioseq-set, Seqhound needs a way to store the Bioseq-set
associated annotation (that applies to each of the Bioseqs in the set). To accomplish this, Seqhound takes
the Bioseq-set, removes the Bioseqs that it contains, and stores the remainder of the Bioseq-set
(annotation) in the sendb table. Seqhound keeps track of which Bioseqs come from which Bioseq-sets
using the sengi table. In the sengi table, each Bioseq is represented by a GI (a unique identifier for the
sequence record) and each Bioseq-set is represented by the first GI in the set of sequences (the seid).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

100 of 421

18/04/2005

sendb table
Field
rowid

Type
int(10)

Null Default
No

timestamp

Yes

seid

asn1

int(11)

Column_Definition
Auto number row identifier

CURRENT_TI
MySQL timestamp column
MESTAMP

mediumblob No

Example

API

2005-03-23 12:32:16
2313082

ToBioseq calls GetGi for the
first Bioseq in a SeqEntry.
Main calls
AppendRecordSENDB. See
Seq-entry->Bioseq-set>Seq-entry->Bioseq->Seqid
(choice 12)

binary data

ToBioseqSeqEntry removes These functions return the Seqentry
the bioseqs. Main calls
or BioseqSet stored in this field
AppendRecordSENDB.
after first replacing all of the
Bioseqs that belong in it.

Gene Info identifier for the first
bioseq in the SeqEntry.

ASN1 Binary SeqEntry. Note
that the bioseqs themselves have
been removed to save space.

Source

SHoundGetBioseqSet()
SHoundGetSeqEntry()

sendb indices
Keyname
PRIMARY

Type
PRIMARY

Field
seid

isendb_rowid

INDEX

rowid

isendb_ts

INDEX

isendb_seid

INDEX

seid

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

101 of 421

18/04/2005

chrom table
Database:
Table:
Module:
Definition:
Observation:

seqhound
chrom
core
This table maps taxonomy ids with their complete genome information (chromid, chromflag, accession, and name)
This table keeps track of all completely sequenced chromosomes of a particular organism (taxid). We also store the NCBI
chromid (chromid), the type of sequence (chromflag), the name (name) and accession number (access).
Every chromid will have a corresponding kloodge identifier (this is Blueprint’s internal version of the NCBI chromid. In
addition, every taxon will have one kloodge identifier that has no corresponding chromid. This kloodge id is assigned to any
proteins or RNAs that are have no chromid; i.e. they have not been assigned to a chromosome yet. The kloodge is the value of
an autoincrement column and is written to taxgi by postcomgen.
We also store the chromosome number (chromnum) as text. This is used so that we can assign records which have no complete
genome information other than taxid and chromosome number to a chromosome identifier.
For example, let’s say that chromosome 4 (circular plasmid, named chr_4, access XIV) of species “species X” (taxid 123) is
completely sequenced and that NCBI has assigned this chromosome a chromid of 191. Let us say our internal kloodge id for
this same chromosome is 1254. Then a record is added to the chrom table with fields:
kloodge: 1254
taxid: 123
chromid: 191
chromflag: plasmid
accession: XIV
name: species X chr_4
chromnum: chr 4
See more example entries in the notes section below.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

102 of 421

18/04/2005

chrom table
Field

Type

Column_Definition

Blueprint chromosome 568
id (used internally).
Mysql Autoincrement
column.

kloodge

int(11)

timestamp

CURRENT_T
IMESTAMP

taxid

int(11)

chromid

int(11)

chromflag int(11)

Example1

Null Default

access

varchar(20) No

name

text

seqhound@blueprint.org

API

2005-03-23
16:17:11
Taxonomy identifier

9606

ToBioseq calls GetTaxid See
SHoundAllGenomes
Bioseq-->Seq-desc->BioSource->Orgref->Dbtag->tag

NCBI chromsome
identifier (see note 2
below)

ToBioseq calls
GetChromIdFromBioseq. See Bioseq>SeqId->Dbtag->tag

the type of chromosome 1
(chromosome, plasmid,
mitochrondria etc)
defined as byte macros
in intrez.h.
(see note 3 below)

ToBioseq calls GetGenomeInfo, See
Bioseq->descr->source->subtype>subtype

Accession for the
sequence (see note 2
below).

NC_000001

NULL

GetGenomeInfo calls
Misc_GetAccession. See Bioseq->Seqid->Textseq-id->accession

The scientific name of
the organism with the
chromosome name.

Homo sapiens
Chromosome 1

GetGenomeFromBiosource constructs
this from various places in the
Biosource. See Bioseq->descr->source

chromnum varchar(10) Yes

Source

NULL

Chromosome number as 1
text.

GetGenomeFromBiosource constructs
this from various places in the
Biosource subtype. See Bioseq->descr>source->subtype

Version 3.3

SHoundChromosomeFromGenome

The SeqHound Manual

103 of 421

18/04/2005

Notes.
1.
Example entries from the chrom table.
kloodge

taxid

chromid

chromflag

access

name

chromnum

'2005-03-23 12:33:24'

13723

'NC_001911'

'Buchnera aphidicola Plasmid pLeu-Dn'

'pLeu-Dn'

'2005-03-23 12:33:24'

17119

'NC_004843'

'Buchnera aphidicola Plasmid pBPS1'

'pBPS1'

567

'2005-03-23 16:06:41'

9606

-9606

'Homo sapiens Chromosome Un'

'Un'

568

'2005-03-23 16:17:11'

9606

'NC_000001'

'Homo sapiens Chromosome 1'

'1'

569

'2005-03-23 16:17:12'

9606

'NC_000002'

'Homo sapiens Chromosome 2'

'2'

570

'2005-03-23 16:17:12'

9606

'NC_000003'

'Homo sapiens Chromosome 3'

'3'

592

'2005-03-23 16:35:05'

9606

2000021

'AC_000021'

'Homo sapiens Mitochondria'

4289

'2005-04-05 09:29:55'

287944

18301

'NC_006916'

'Harpiosquilla harpax Mitochondria'

4290

'2005-04-05 09:29:55'

302098

18316

'NC_006931'

'Eubalaena japonica Mitochondria'

Note that the third entry down represents an entry corresponding to an “unknown” chromosome. Proteins or RNAs that have not been
mapped to a chromosome will be assigned this kloodge identifier (567) in the accdb and table.
2.
Kloodge identifiers representing unknown chromosmomes will not have a corresponding chromid from NCBI. The negative value of
the taxid is used instead. See third entry in example table above. These entries will also have no accession associated with them.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

104 of 421

18/04/2005

notes continued…
3.
CHROM_PHAGE (phage sequence)
CHROM_NR
CHROM_ECE
CHROM_PLMD (plasmid sequence)
CHROM_CHLO (chloroplast sequence)
CHROM_MITO (mitochrondrial sequence)
CHROM_CHROM (chromosome sequence)
CHROM_ALL

chrom indices
Keyname
PRIMARY
ichrom_chromid

Type
PRIMARY
INDEX

Field
kloodge
chromid

ichrom_taxid

INDEX

taxid , chromnum

ichrom_kl

INDEX

kloodge

ichrom_acc
INDEX
Ichrom_chromnum INDEX

seqhound@blueprint.org

access
chromnum

Version 3.3

The SeqHound Manual

105 of 421

18/04/2005

gichromid table
Last updated April 11, 2005
SeqHound
Database:
gichromid
Table:
core
Module:
Stores RNA and protein gis that are part of a complete genome. Used to
Definition:
update taxgi with the kloodge.
This table is used internally to store RNA and Protein gis parsed from
Observation:
contigs that are part of a complete genome. After a complete build is
done, the gis and chromids from this table are used by postcomgen to
update taxgi with the kloodge.
gichromid table
Field

Type

Null Default

Column_Definition

int(11)

Primary key. Mysql autoincrement column.

timestamp Yes

CURRENT_TIMESTAMP

MySQL timestamp
column

int(11)

Gene Info identifier

chromid int(11)

NCBI chromosome id

NULL

Gi of the contig that
contains this gi.

contiggi int(11)

Yes

Example

Source

2005-03-23 12:32:16
231308

ToBioseq calls
FillGiChromid. See Bioseq->Seqid (choice 12)

3144

NCBI chromid

231308

ToBioseq calls
GetChromidFromBioseq.
See Bioseq->Seqid(choice
11)->tag->id

gichromid indices
Keyname

Type

seqhound@blueprint.org

Field

Version 3.3

API

The SeqHound Manual

106 of 421

18/04/2005

id
gi

PRIMARY
INDEX

gi
gi

igits

INDEX

igichromid
igicontiggi

INDEX
INDEX

chromid
contiggi

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

107 of 421

18/04/2005

contigchromid table
Last updated April 11, 2005
SeqHound
Database:
contigchromid
Table:
core
Module:
Stores contig gis that are part of a complete genome. Used to update
Definition:
taxgi with the kloodge.
This table is used internally to store lists of GI’s that make up a contig
Observation:
sequence for a complete genome. After a complete build is done, the gis
and chromids from these contigs are used by postcomgen to update
taxgi with the kloodge.
contigchromid table
Field

Type

Null Default

Column_Definition

int(11)

Mysql auto-increment column.

timestamp

Yes

CURRENT_TI
MySQL timestamp column
MESTAMP

contiggi int(11)

Gene Info identifier for this sequence record 50754263
that is part of a contig.

ToBioseq calls GetGi. See Bioseq->Seqid (choice 12)

Yes

NULL

Genbank gi of the contig that contains
multiple sequences (i.e. multiple
contiggi’s).

51039201

topgi

ToBioseq calls GetGi. See Bioseq->Seqid (choice 12)

461
chromid int(11)

NCBI chromosome id.

ToBioseq calls
GetChromidFromBioseq. See Bioseq>Seqid(choice 11)->tag->id

changed Int(11)

Yes

NULL

Indicates whether the record needs to be
1 = process
processed by postcomgen. Used by mother 2 = do not process
in update mode.

int(11)

Example
2005-03-23 12:32:16

contigchromid indices

seqhound@blueprint.org

Source

Version 3.3

ToBioseq calls FillContigChromid. See
Bioseq->Seqid(choice 11)->tag->id

API

The SeqHound Manual

108 of 421

18/04/2005

Keyname
contiggi
icontigts

Type
PRIMARY
INDEX

Field
id
ts

contigi

UNIQUE

contiggi

iccontigi
ictopgi

INDEX
INDEX

contiggi
topgi

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

109 of 421

18/04/2005

gichromosome table
Last updated April 11, 2005
SeqHound
Database:
gichromosome
Table:
core
Module:
Stores RNA and protein gis that have a chromosome name, but no other
Definition:
complete genome info.
This table is used internally to store RNA and protein gis that have a
Observation:
chromosome name, but no other complete genome info. After a
complete build is done, postcomgen will write these gis to gichromid
with the appropriate chromid.

gichromosome table
Field

Type

Null Default

Column_Definition Example

int(11)

Primary key Mysql
auto-increment
column.

timestamp

Yes

CURRENT_TIM MySQL timestamp 2005-03-23 12:32:16
ESTAMP
column

int(11)

chromnum

varchar(10)

seqhound@blueprint.org

GenInfo Identifier
Chromosome
number

Source

231308

ToBioseq calls GetGi. See Bioseq-->Seqid
(choice 12)

ToBioseq calls GetGenomeInfo. See Bioseq>descr->source->subtype->name

Version 3.3

API

The SeqHound Manual

110 of 421

18/04/2005

gichromosome indices
Keyname
Type
id
PRIMARY
igichrom_ts
INDEX

Field
id
ts

UNIQUE

igichrom_gi
igichrom_num

INDEX
INDEX

gi
chromnum

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

111 of 421

18/04/2005

contigchromosome table
Last updated April 11, 2005
SeqHound
Database:
gichromosome
Table:
core
Module:
Stores contig gis that have a chromosome name, but no other complete
Definition:
genome info.
This table is used internally to store contig gis that have a chromosome
Observation:
name, but no other complete genome info. After a complete build is
done, postcomgen will writethese gis to contigchromid with the
appropriate chromid.
contigchromosome table
Field

Type

Null

int(11)

timestamp

Yes

contiggi

int(11)

GenInfo gi

chromnum

Varchar(10)

Chromosome number

seqhound@blueprint.org

Default

Column_Definition

Example

Source

Primary key Mysql autoincrement column.
CURREN
2005-03-23 12:32:16
T_TIMES MySQL timestamp column
TAMP
231308

ToBioseq calls GetGi. See Bioseq-->Seqid
(choice 12)

ToBioseq calls GetGenomeInfo. See Bioseq->Seqid (choice 12)

Version 3.3

API

The SeqHound Manual

112 of 421

18/04/2005

contigchromosome indices
Keyname
Type
id
PRIMARY

Field
id

igichrom_ts

INDEX

contiggi

UNIQUE

contiggi

icontigchrom_gi
icontigchrom_num

INDEX
INDEX

contiggi
chromnum

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

113 of 421

18/04/2005

Redundant protein sequences (redundb) module
Last updated: August 4, 2004
redund parser
purpose:
The redund parser builds the redundb module, which consists of the redund table. The input file consists of information pertaining to
redundant GIs, accession numbers and the sequence information which the GIs refer to. The resulting data table contain information
on the GIs, redundant GI groups and the ranking of each GI within their redundant group.
module:
redundb
input files:
nr.gz from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
table created:
redund
source code location:
slri/seqhound/parsers/redund.c
config file dependencies:
slri/seqhound/config/.intrezrc (UNIX platform)
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

114 of 421

18/04/2005

;this should be set to 1 to allow usuage of the redundb
;redundant protein sequences
redundb = 1

Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username,
password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE
respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections] redundb should be 1.
commandline parameters
Typing ./redund at the command line will return a list of command line parameters and default settings. For example:
>./redund
redund
arguments:
-i Input non-redundant database fasta file [File In]
-n Initialize database file [T/F] Optional
default = F
example use:
./redund -i nr -n T
associated scripts:
nrftp.pl: retrieves the relevant data file from NCBI's ftp site
error & run-time logs:
redund writes to a log file called redundlog
additional information:
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
See data table descriptions for each of the tables that are listed under “tables altered”

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

115 of 421

18/04/2005

redund table
Last updated: August 4, 2004
seqhound
Database:
redund
Table:
redundb
Module:
Partitions GIs into groups based on sequence redundancy, and ranks
Definition:
each GI within their group.
MySQL
Field
Type
Null Default
Column_Definition
rowid
int(11)
No
Auto number row identifier
gi
int(11)
No
0
GenInfo Identifier
rordinal
int(11)
No
0
Position in redundant group
rgroup
int(11)
Yes NULL
ID of redundant group(not static)
MySQL Indexes
Keyname
Type
Field
PRIMARY
PRIMARY
gi
iredund_rowid
INDEX
rowid
iredund_gi
INDEX
gi
iredund_ordinal
INDEX
rordinal
iredund_rgroup
INDEX
rgroup

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

116 of 421

18/04/2005

Source org:
Source file:
parser:

redund is a database table that places GIs with equivalent sequences
into redundant groups. The same sequence may be worked on by
independent researchers. Each discovery is submitted and assigned
a different gi. When the discoveries concern a common sequence,
the GIs point to sequence records that describe redundant sequences,
i.e. a single sequence has multiple GIs. These GIs then belong to a
common group (rgroup). In such cases, it is common for one
particular GI to be used more than the others because it is better
annotated. The ordinal (rordinal) is a ranking of the GIs, 1 for the
best annotated GI, 2 for the second, 3 for the third etc. This ordering
is determined by the ordering of the sequence headers in the nr
source file. These 3 pieces of information form the redund database
table.
NCBI
nr.gz from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
redund

***gi***
description:
example:
default value:
source:
parser:
function:

GenInfo Identifier for a sequence record.
7228451
0
parsed from nr (line 1 of record) as token(|) delimited fields
redund
LabelGI, AssignRedund (redund.c)

Observation:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

API:

***rordinal***
description:
example:
default value:
source:
parser:
function:
API:
***rgroup***
description:
example:
default value:
source:

parser:

117 of 421

18/04/2005

SHoundRedundantGroup (returns all GIs in the same group as
the input GI)
SHoundRedundantGroupFromID (returns GIs from the group
with input group ID)
SHoundFirstOfRedundantGroupFromID (GI of rank 1,
given a group)
SHoundIsNRFirst (true/false, is this GI is ranked 1 in it's
group)

rank position in redundant set
1
n/a
based on the position of the gi within the record, the first gi has
rordinal 1, the 2nd has ordinal 2
redund
LabelGI, AssignRedund (redund.c)
SHoundIsNRFirst (tells if a gi has rordinal = 1)

redundant group the gi is a member of
1
n/a
rgroup IDs are assigned to GIs as the input file is parsed. Each GI
encountered is assigned the rgroup ID of those GIs already
encountered which have an identical sequence. If no such GIs have
been encountered, the GI is assigned a new rgroup ID one number
higher than the largest current rgroup ID in the database. The first
GI encountered is given an rgroup ID of 1.
redund

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

function:
API:

118 of 421

18/04/2005

LabelGI, AssignRedund (redund.c)
SHoundRedundantGroupIDFromGI[List] (returns the ID of
the rgroup that the input GI is a part of)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

119 of 421

18/04/2005

Complete genomes tracking (gendb) module
As of release 4.0 of SeqHound, the complete genomes module has been moved into the core module. The initial build and updates to
these tables are handled by the mother and postcomgen parsers. See the description of the core module above for more details.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

120 of 421

18/04/2005

Taxonomy hierarchy (taxdb) module
Last updated August 18, 2004
importtaxdb parser
purpose:
The importtaxdb parser builds the taxonomy module, which consist of the taxdb, gcodedb, divdb, del and merge table. The input file
consists of taxonomic information such as the taxonomy nodes, names, division, genetic codes, deleted nodes and merged nodes.
Table taxdb holds taxonomy ids and a binary file associated with each id. Table gcodedb holds genetic code ids and a binary file
associated with each id. Table divdb holds division ids and a binary file associated with each id. Table del holds the taxonomy id of
the deleted nodes. Table merge holds the taxonomy ids of nodes which has been merged and which is result of merging.
module:
taxdb
input files:
taxdump.tar.gz from ftp://ftp.ncbi.nih.gov/pub/taxonomy/
tables created:
taxdb
gcodedb
divdb
del
merge
source code location:
slri/seqhound/taxon/importtaxdb.c
config file dependencies:
slri/seqhound/config/.intrezrc (UNIX platform)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

121 of 421

18/04/2005

The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;taxonomy hierarchy
taxdb = 1

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

122 of 421

18/04/2005

taxdb table
Database:
Table:
Module:
Definition:
MySQL
Field
rowid
tid
asn
MySQL Indexes
Keyname
PRIMARY
itax_rowid
itax_tid
Observation:

seqhound
taxdb
taxdb
Maps a taxid to information about the taxid. The information about the
taxid is stored as an ASN blob.
Type
int(11)
int(11)
mediumblob
Type
PRIMARY
INDEX
INDEX

Null
No
No
Yes

Default
0
NULL

Column_Definition
Auto number row identifier
Taxonomy identifier
Information about tid
Field
tid
rowid
tid

The entire taxonomy can be viewed as a hierarchical taxonomy tree.
The tree expresses the relationships between the nodes within the tree.
At the root of the tree is a generic node that provides no information.
Descendents of the root node include superkingdoms such as Archea,
Eubacteria, Eukaryota, Viroids, Viruses. etc. Each of these
superkingdoms have corresponding taxonomy children. Also included
in the taxonomy tree are artificial sequences and unclassified
taxonomies such as the prions, unidentified agents, etc. taxdb maps the
taxonomy identifiers to the relevant information describing the
taxonomies. Organizing the taxonomy as a tree allows users to request

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

Source org:
Source files:
parser:

***tid***
description:
example:
default value:
source:
parser:
function:
API:

123 of 421

18/04/2005

complete lineages, ancestors, children and permits the user to browse
the tree exploiting its tree structure.
taxdb contains information from both nodes.dmp and names.dmp
NCBI
nodes.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
names.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
importtaxdb

the unique taxonomy identifier.
9606 (Human)
n/a
nodes.dmp: column 1
importtaxdb
Parser_TaxDBNodeRecord
SHoundGetTaxChildNodes
SHoundGetTaxChildNodesList
SHoundGetAllTaxProgeny
SHoundGetTaxParent
ShoundGetAllTaxAncestors
Note that although relationships (parent, child) are not
stored in the database, it is a part of the asn field, and as
such it makes it possible to retrieve parent, children,
ancestors and lineages (see asn blob).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***asn***
description:

124 of 421

18/04/2005

ASN blob containing information about the taxonomy.
Fields in the ASN blob include:
1) parent taxonomy of node
2) rank of node
3) embl code for node
4) division of node (see divdb)
5) inherited div flag (1 if this node inherits its division from its parent)
6) genetic code (see gcodedb)
7) inherited genetic code (1 if this nodes inherits its genetic code from
parent)
8) mitochondrial genetic code
9) inherited mit. genetic code (1 if nodes inherits mitochondrial gencode
from parent)
10) genbank hidden flag (1 if name is suppressed in GenBank entry
lineage)
11) subtree root flag (1 if this subtree has no sequence data yet)
12) comments
13) name of taxid (from names.dmp)
Possible rankings of node (there may be more):
superkingdom, kingdom
genus, subgenus
subspecies, species
subfamily, family, superfamily
phylum, subphylum,
subtribe, tribe
varietas

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

125 of 421

18/04/2005

infraorder, order, suborder
infraclass, class, subclass
no rank
example:

A textual representation of the blob for taxid 9606:
SLRI-taxon ::= {
taxId 9606 ,
parent-taxId 9605 ,
children-taxId {
63221 } ,
names {
{
name "Homo sapiens" ,
name-class scientific-name } ,
{
name "human" ,
name-class other ,
other-class "genbank common name" } ,
{
name "man" ,
name-class common-name } } ,
rank {
rank species ,
premod none ,
postmod none } ,
embl-code "HS" ,
division 5 ,
inherited-div TRUE ,
gencode 1 ,
inherited-gencode TRUE ,
mito-gencode 2 ,
inherited-mito-gencode TRUE ,
genbank-hidden FALSE ,
hidden-subtree-root FALSE }

default value:
source:
parser:
function:
API:

n/a
nodes.dmp: column 2 - 13 and names.dmp: column 2, 3
importtaxdb
Parser_TaxDBNodeRecord
The ASN object is not directly accessible through the API. Instead,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

126 of 421

18/04/2005

fields in the ASN object may be retrieved and be accessible from the
API. eg.
SHoundGetTaxNameFromTaxIDByClass
SHoundGetTaxNameFromTaxID
SHoundGetTaxLineageFromTaxID
You can also directly retrieve the ASN SLRITaxon by:
#include
AsnIoPtr aip = NULL;
SLRITaxonPtr ptax = NULL;
SHoundInit(FALSE, "name");
ptax = DB_GetTaxRec(9606);
aip = AsnIoNex(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);
SLRITaxonAsnWrite(ptax, aip, NULL);
AsnIoClose(aip);

This will produce the text in the example above.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

127 of 421

18/04/2005

gcodedb table
database:
table:
Module:
definition:
MySQL
Field
rowid
gcid
asn
MySQL Indexes
Keyname
PRIMARY
igcode_rowid
igcode_gcid
Observation:

seqhound
gcodedb
taxdb
Maps a genetic code identifier to the corresponding ASN taxonomy
genecode record.
Type
int(11)
int(11)
mediumblob
Type
PRIMARY
INDEX
INDEX

Null
No
No
Yes

Default
0
NULL

Column_Definition
Auto number row identifier
genetic code
ASN taxon-gencode record
Field
gcid
rowid
gcid

gcodedb is part of a group of databases (taxdb, gcodedb, divdb) that can
be used to form a taxonomy hierarchy tree. In the case of gcode, the
records map a genetic code identifier to an ASN object that holds
information about that genetic code. taxdb (see taxdb later) will specify
the type of genetic material each taxonomy uses, eg if the taxonomy has
a plasmid as a genetic code, mitochrondria DNA, or standard
chromosomal DNA etc. gcodedb houses information concerning each of
the genetic code. Relevant information include the translation table (the
mapping of nucleic acid codons to amino acids), & the start codon.
Further information at: http://www.ncbi.nlm.nih.gov/Taxonomy

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

Source org:
Source file:
Parser:

***gcid***
description:

128 of 421

18/04/2005

NCBI
gencode.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
importtaxdb

The genetic code identifier. The identifiers may not be consecutive
because historically codes may have merged or new ones were created.
The code and their corresponding names:
0 Unspecified
1 Standard
2 Vertebrate Mitochondrial
3 Yeast Mitochondrial
4 Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate
Mitochondrial; Mycoplasma; Spiroplasma
5 Invertebrate Mitochondrial
6 Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear
9 Echinoderm Mitochondrial; Flatworm Mitochondrial
10 Euplotid Nuclear
11 Bacterial and Plant Plastid
12 Alternative Yeast Nuclear
13 Ascidian Mitochondrial
14 Alternative Flatworm Mitochondrial
15 Blepharisma Macronuclear
16 Chlorophycean Mitochondrial
21 Trematode Mitochondrial
22 Scenedesmus obliquus mitochondrial

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

example:
default value:
source:
parser:
function:
API:
***asn***
description:

129 of 421

18/04/2005

23 Thraustochytrium mitochondrial code
Because many of the taxonomy groups are bacterial, 11 is a common
genetic code.
n/a
gencode.dmp: column 1
importtaxdb
Parse_TaxDBGenCodeRecord
n/a

An ASN blob for the genetic code information corresponding to the
genetic code identifiers. The individual fields in the blob are:
1) an optional abbreviation
2) name of genetic code (see gcid for names)
3) translation table
4) start codon
The translation table maps the nucleic acid codon to amino acids:
In the source file, the map is represented as:
FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

There are 64 characters in this map. Each position in the map
corresponds to a codon (see below) and the letter at that position
corresponds to the resulting amino acid that will be translated from that
codon. The same can map be better expressed as a direct mapping from
codon to amino acid where
1) first 3 letters represent a nucleic acid sequence (codon),
2) an amino acid letter identifier
3) an amino acid abbreviation

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

130 of 421

18/04/2005

TTT
TTC
TTA
TTG

F
F
L
L

Phe
Phe
Leu
Leu

TCT
TCC
TCA
TCG

S
S
S
S

Ser
Ser
Ser
Ser

TAT
TAC
TAA
TAG

Y
Y
*
*

Tyr
Tyr
Ter
Ter

TGT
TGC
TGA
TGG

C
C
*
W

Cys
Cys
Ter
Trp

CTT
CTC
CTA
CTG

L
L
L
L

Leu
Leu
Leu
Leu

CCT
CCC
CCA
CCG

P
P
P
P

Pro
Pro
Pro
Pro

CAT
CAC
CAA
CAG

H
H
Q
Q

His
His
Gln
Gln

CGT
CGC
CGA
CGG

R
R
R
R

Arg
Arg
Arg
Arg

ATT
ATC
ATA
ATG

I
I
I
M

Ile
Ile
Ile
Met

ACT
ACC
ACA
ACG

T
T
T
T

Thr
Thr
Thr
Thr

AAT
AAC
AAA
AAG

N
N
K
K

Asn
Asn
Lys
Lys

AGT
AGC
AGA
AGG

S
S
R
R

Ser
Ser
Arg
Arg

GTT
GTC
GTA
GTG

V
V
V
V

Val
Val
Val
Val

GCT
GCC
GCA
GCG

A
A
A
A

Ala
Ala
Ala
Ala

GAT
GAC
GAA
GAG

D
D
E
E

Asp
Asp
Glu
Glu

GGT
GGC
GGA
GGG

G
G
G
G

Gly
Gly
Gly
Gly

The start codon is the particular codon that signals the start of protein
translation for that particular genetic molecular (i.e. a standard
chromosome starts translation at a different codon than a protozoan
mitochrondrial DNA).
The start codon is shown in the same manner as the translation table:
----------------------------------MM----------------------------

example:

This indicates that M (methionine) is the start codon.
Further information can be found at:
htpp://www.ncbi.nlm.nih.gov/Taxonomy in Genetic codes link.
A textually representation of a blob for genetic code 11:
SLRI-taxon-gencode ::= {
gencode-id 11 ,
name "Bacterial and Plant Plastid" ,
trans-table
"FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGG
G " ,
start-codons "---M---------------M------------MMMM--------------

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

131 of 421

18/04/2005

-M------------ " }

default value:
source:
parser
function:
API:

n/a
gencode.dmp: column 2, 3, 4, 5 are stored in the database as an ASN
SLRITaxonGencode structure.
importtaxdb
Parse_TaxDBGenCodeRecord
not directly accessible from the API, but you can retrieve the
SLRITaxonGencode object using:
#include

AsnIoPtr aip = NULL;
SLRITaxGencodePtr ptax = NULL;
SHoundInit(FALSE, "name");
ptax = DB_GetTaxGencodeRec(11);
aip = AsnIoNex(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);
SLRITaxGencodeAsnWrite(ptax, aip, NULL);
AsnIoClose(aip);

This will produce the text in the example above.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

132 of 421

18/04/2005

divdb table
Database:
Table:
Module:
Definition:
MySQL
Field
rowid
did
asn
MySQL Indexes
Keyname
PRIMARY
idiv_rowid
idiv_did

Observation:

Source org:
Source file:
Parser:

seqhound
divdb
taxdb
Maps taxonomy division ID to an ASN SLRI-taxon-div object
Type
int(11)
int(11)
mediumblob
Type
PRIMARY
INDEX
INDEX

Null
No
No
Yes

Default
0
NULL

Column_Definition
Auto number row identifier
Taxonomy Division ID
ASN taxonomy division record
Field
did
rowid
did

divdb is part of a group of databases (taxdb, gcodedb, divdb) that can be
used to form a taxonomy hierarchy tree. In the case of divdb, the
records map a division identifier to an ASN object that holds
information about that division. This information include a 3-character
GenBank division code, a division name and an optional comment.
There are 10 divisions (see below).
NCBI
division.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
importtaxdb

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***did***
description:
example:
source:
parser:
function:
API:
***asn***
description:

133 of 421

18/04/2005

The division identifier (0-11, see asn for the names of division).
0
division.dmp: column 1
importtaxdb
Parse_TaxDBDivRecord
n/a

The ASN blob for the taxonomy division. The individual fields in the
ASN blob are:
1) a 3 character division code (see below)
2) a division name (see below)
3) an optional comment
Possible division codes and corresponding division names are:
BCT Bacteria
INV Invertebrates
MAM Mammals
PHG Phages
PLN Plants
PRI Primates
ROD Rodents
SYN Synthetic
UNA Unassigned

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

example:

134 of 421

18/04/2005

VRL Viruses
VRT Vertebrates
the entire field is one ASN blob. A possible blob could be represented
textually as:
SLRI-taxon-div ::= {
div-id 1 ,
div-code "INV" ,
div-name "Invertebrates" }

source:
parser:
function:
API:

Column 2, 3, and 4 of division.dmp are stored as one ASN blob
(SLRITaxonDiv) created in-house.
importtaxdb
Parse_TaxDBDivRecord (importtaxdb)
there is no API function that allows you to retrieve the entire ASN blob.
The information is used to construct information about a taxid, (see
taxdb). You can retrieve a TaxonDivRec using:
#include
AsnIoPtr aip = NULL;
SLRITaxonDivPtr ptax = NULL;
SHoundInit(FALSE, "name");
ptax = DB_GetTaxDivRec(1);
aip = AsnIoNex(ASNIO_TEXT_OUT, stdout, NULL, NULL, NULL);
SLRITaxonDivAsnWrite(ptax, aip, NULL);
AsnIoClose(aip);

This will produce the text in the example above.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

135 of 421

18/04/2005

del table
Database:
Table:
Module:
Definition:
MySQL
Field
rowid
tid
MySQL Indexes
Keyname
PRIMARY
idel_rowid
idel_tid

seqhound
del
taxdb
Contains all the deleted taxonomy identifiers.
Type
int(11)
int(11)

Null
No
No

Default
0

Type
PRIMARY
INDEX
INDEX

Column_Definition
Auto number row identifier
Taxonomy ID
Field
tid
rowid
tid

Observation:
Source org:
Source files:
parser:

NCBI
delnodes.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
importtaxdb

***tid***
description:
example:

The taxonomy identifier
15

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

default value:
source:
parser:
function:
API:

136 of 421

18/04/2005

n/a
column 1 of delnodes.dmp
importtaxdb
main of importtaxdb
SHoundIsTaxDeleted (true if taxid parameter is deleted)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

137 of 421

18/04/2005

merge table
Database:
Table:
Module:
Definition:

MySQL
Field
rowid
otid
ntid
MySQL Indexes
Keyname
PRIMARY
imerge_rowid
imerge_ntid
imerge_otid
Observation:
Source org:
Source files:
parser:

seqhound
merge
taxdb
Contains all tax nodes that have merged. Maps an old taxid before
merging with its new taxid after the merge. A taxonomy may be
deleted because initially it was considered species X, but when its DNA
is analyzed, it is deemed to be insignificantly different from species Y,
in which case one of the species gets merged.

Type
int(11)
int(11)
int(11)
Type
PRIMARY
INDEX
INDEX
INDEX

Null
No
No
No

Default
0
0

Column_Definition
Auto number row identifier
old taxonomy id
new taxonomy id
Field
otid
rowid
ntid
otid

NCBI
merged.dmp from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
importtaxdb

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***otid***
description:
example:
default value:
source:
parser:
function:
API:

***ntid***
description:
example:
default value:
source:
parser:
function:
API:

138 of 421

18/04/2005

the old taxid
12
n/a
merged.dmp: column 1
importtaxdb
main of importtaxdb
SHoundIsTaxMerged (takes an old taxid and checks if it has been
merged into a new taxid)

the new taxid
74109
n/a
merged.dmp: column 2
importtaxdb
main of importtaxdb
SHoundIsTaxMerged (takes an old taxid and checks if it's been
merged into a new taxid)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

139 of 421

18/04/2005

Structural databases (strucdb) module
Last updated October 6, 2004
cbmmdb parser
purpose:
The cbmmdb parser builds the strucdb module databases. The input files to cbmmdb contain data for experimentally determined
macromolecular 3D structures. The data files are in ASN.1 format available from the NCBI ftp site.
Module:
Build the Strucdb module
Change to the mmdbdata directory.
cd $SEQH/7.mmdb.files
Create tables of the Strucdb module in the database.
Make sure file strucdb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < strucdb.sql
Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP
address or the server name of the database server repectively. You will be prompted to enter your password.
This creates tables mmdb, mmgi and domdb in the database.
strucdb
input files:
*.val.gz (ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata)
tables altered:
mmdb, mmgi
source code location:
slri/seqhound/parsers/cbmmdb.c
config file & other dependencies:
The relevant configuration files are:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

140 of 421

18/04/2005

slri/seqhound/config/.intrezrc (UNIX platform)
slri/seqhound/config/.mmdbrc (UNIX platform
ncbi/config/.ncbirc (UNIX platform)
ncbi/data/bstdt.val (optional)
* The requirement that bstdt.val be in the same directory as the parser is not strictly enforced. As long as .ncbirc (see below) is
properly configured, bstdt.val will be located. However, if bstdt.val is not in the same directory as parser, a warning will be issued in
cbmmdb's log file. Despite the warning, the parser should work properly, granted the configurations files are all set properly.
The .mmdbrc file should have at least 1 section [MMDB] with 3 fields:
1) Database: the path where the data source files (*.val.gz) are located
2) Index: the index file for the data source files (mmdb.idx: this is also downloaded by mmdbftp.pl)
3) Gunzip: path to gunzip

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

141 of 421

18/04/2005

sample .mmdbrc file:
[MMDB]
;Where the data source files are located
Database = ./
;Index of all data source files
Index = mmdb.idx
;Path to gunzip
Gunzip = /bin/gunzip
The .ncbirc file should have 1 section [NCBI] with 1 field:
1) Data: the path to bstdt.val
sample .ncbirc file:
In file .ncbirc, variable DATA should have a value which is the path of directory ncbi/data on your machine. File .ncbirc looks like
(change text in italics):
[NCBI]
ROOT=/
DATA=/my_home/compile/ncbi/data/
[NCBI]
;where bstdt.val is located
Data=/ncbi/data
The .intrezrc file should have the pathmm set to where you want the databases to be stored.
sample .intrezrc file:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;structural databases
strucdb = 1

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

142 of 421

18/04/2005

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

143 of 421

18/04/2005

command line parameters:
Typing "./cbmmdb" at the command line will return a list of parameters
and default settings. For example:
> ./cbmmdb
cbmmdb
-n
-m

arguments:
Initialize Database File for Biostrucs [T/F]
Initialize Database File for MMDB Id and GI pairs [T/F]

example use:
./cbmmdb -i T -m T
associated scripts:
mmdbftp.pl retrieves all the data files for input to cbmmdb.
error and run-time logs:
cbmmdb writes to cbmmdblog
additional info:
See additional readme files on the NCBI ftp site (ftp://ftp.ncbi.nih.gov/mmdb)
See data table descriptions for each of the tables that are listed under “tables altered”

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

144 of 421

18/04/2005

vastblst parser
purpose:
The vastblst parser constructs the 3D domain information that is extracted from the MMDB 3D structures. vastblst builds the domdb
table to store this information.
module:
strucdb
input files:
mmdb table
tables altered:
domdb
source code location:
slri/seqhound/domains/vastblst.c
config file dependencies:
slri/seqhound/config/.intrezrc (UNIX platform)
The relevant section in the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;structural databases
strucdb = 1

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

145 of 421

18/04/2005

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

146 of 421

18/04/2005

pdbrep parser
purpose:
The pdbrep parser fills in the 'rep' field in the domdb table. 'rep' is the best representative domain for that blast set. The input files
consist of non-redundant PDB chain set, BLAST rankings for the set and the representative for that chain set. pdbrep stores the
representative into the domdb table.
module:
strucdb
input files:
nrpdb.* from ftp://ftp.ncbi.nih.gov/mmdb/nrtable
tables altered:
domdb
source code location:
slri/seqhound/domains/pdbrep.c
config file dependencies:
The relevant configuration file is:
slri/seqhound/config/.intrezrc (UNIX platform)
The relevant sections are:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;structural databases

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

147 of 421

18/04/2005

strucdb = 1

arguments:
Input nrpdb table [File In]

example use:
./pdbrep -i nrpdb.010400
associated scripts:
n/a
error & run-time log:
pdbrep writes to pdbreplog
additional information:
README's at ftp://ftp.ncbi.nih.gov/mmdb/nrtable
see data table descriptions for each of the tables that are listed under "tables altered"

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

148 of 421

18/04/2005

mmdb table
Database:
Table:
Module:
Definition:

seqhound
mmdb
strucdb
Is a database of experimentally determined macromolecular 3D
structures (Molecular Modeling DB).

MySQL
Field
rowid

Type
int(11)

Null
No

Default

mmdbid

int(11)

asn1
pdbid
bwhat
models
molecules
size
bzsize
MySQL Indexes
Keyname
PRIMARY
immdb_rowid
immdb_mmdbid
immdb_pdbid

mediumblob
varchar(20)
int(11)
int(11)
int(11)
int(11)
int(11)

No
No
Yes
Yes
Yes
Yes
Yes

NULL
NULL
NULL
NULL
NULL

Type
PRIMARY
INDEX
INDEX
INDEX

seqhound@blueprint.org

Column_Definition
Auto number row identifier
Molecular modeling database
identifier
Biostruc blob
PDB id
types of molecules in the biostruc
number of models in record
number of molecules in record
size of uncompressed biostruc
size of compressed biostruc
Field
mmdbid
rowid
mmdbid
pdbid

Version 3.3

The SeqHound Manual

Observation:

Source org:
Source file:
FTP script:
parser:

149 of 421

18/04/2005

Most 3D structures are obtained through X-ray crystallography and
NMR-spectroscopy. MMDB is a subset of Protein Data Bank (PDB),
excluding theoretical models. 3D structures can provide information
pertaining to the biological function, mechanism of function,
evolutionary history and relationships of a biomolecule.
NCBI
*.val.gz from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata
slri/seqhound/scripts/mmdbftp.pl
cbmmdb

***mmdbid***
description:

example:
default value:
ASN.1 struct:
parser:
function:
API:
***asn1***
description:

seqhound@blueprint.org

a unique, stable numerical identifier. MMDB ID are assigned when a new
structure enters MMDB. If an entry is deleted from PDB, then its MMDB entry
is obsolete. Obsolete structures get archived.
10
n/a
Biostruc->Biostruc-id (choice 1)
cbmmdb
Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)
SHound3DExists

the Biostruct blob (see ASN structure below). Contains information about the
3D structure of a molecule including information about the researchers,

Version 3.3

The SeqHound Manual

example:
default value:
source:
parser:
function:
API:

**pdbid
***
description:
example:
default:
ASN.1 struct:
parser:
function:
API:
***bwhat
***
description:

seqhound@blueprint.org

150 of 421

18/04/2005

molecule bond information, etc.
see below
n/a
*.val.gz
cbmmdb
MMDBBiostrucGet
SHoundGet3DfromPdbId
SHoundGetPDB3D
SHoundGet3D

a 4 character identifier from the Protein Data Bank.
100D
n/a
Biostruc->Biostruc-id(choice 2)
cbmmdb
Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)
no functions return the pdbid from mmdb, it is used as a key to retrieve a 3D
structure, see asn1 API.

an integer that represents the type of molecule.
The possible flags are (defined in ncbi/biostruc/mmdbapi1.h) and their
corresponding byte values are:
1) AM_ION 0x80

Version 3.3

The SeqHound Manual

example:
default:
ASN.1 stuct:
parser:
function:
API:
***models***
description:
example:
default:
source:
parser:
function:
API:

seqhound@blueprint.org

151 of 421

18/04/2005

2) AM_RNA 0x40
3) AM_WAT 0x20
4) AM_SOL 0x10
5) AM_HET 0x08
6) AM_DNA 0x04
7) AM_PROT 0x0
8) AM_POLY 0x01
9) AM_UNK 0x00
AM_ION
AM_UNK
Biostruc->Biostruc-graph->Molecule-graph->Bio-mol-descr->molecule-type
cbmmdb
Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)
SHound3DbWhat

an enumeration of the different models for this structure
3
n/a
this is enumerated by NCBI code, increment for EACH Biostruc->Biostrucmodel in the ASN structure
cbmmdb
Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)
n/a, no public interface provided to this field

Version 3.3

The SeqHound Manual

***molecules***
description:
example:
default:
source:
parser:
function:
API:
***size***
description:
example:
default:
source:

152 of 421

18/04/2005

an enumeration of the number of molecules in the structure
3
n/a
this is enumerated by NCBI code, increment for EACH Biostruc->Biostrucgraph->Molecule-graph in the ASN structure
cbmmdb
Biostruc2Modelstruc (ncbi/biostruc/mmdbapi1.c)
n/a, no public interface provided to this field

parser:
function:
API:

the size of the uncompressed biostruc
7691
n/a
the size of the uncompressed biostruc is determined just before it is compressed
and appended
cbmmdb
AssignASNMemBZMemo (intrez_cb.c)
n/a, no public interface provided to this field

***bzsize***
description:
example:
default:
source:

the size of the compressed biostruc
31
n/a
the compressed size is determined after it is compressed

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

parser:
function:
API:

seqhound@blueprint.org

153 of 421

18/04/2005

cbmmdb
AssignASNMemBZMemo (intrez_cb.c)
n/a, no public interface provided to this field

Version 3.3

The SeqHound Manual

154 of 421

18/04/2005

mmgi table
seqhound
mmgi
strucdb
A mapping of mmdbid to gi.

Database:
Table:
Module:
Definition:
MySQL
Field
rowid
mmdbid
gi
MySQL Indexes
Keyname

Type

PRIMARY

immgi_rowid
immgi_mmdbid
immgi_gi

INDEX
INDEX
INDEX

Observation:
Source org:
Source file:
FTP script:
parser:

Type
int(11)
int(11)
int(11)

Null
No
No
No

Default
0
0

Column_Definition
Auto number row identifier
mmdb identifier
GenInfo Identifier
Field
mmdbid
gi
rowid
mmdbid
gi

NCBI
*.val.gz from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata
slri/seqhound/scripts/mmdbftp.pl
cbmmdb (see mmdb for initialization info)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***mmdbid***
description:
example:
ASN.1 struct:
parser:
function:
API:

***gi***
description:
example:
ASN.1 struct:
parser:
function:
API:

155 of 421

18/04/2005

3D molecular model unique identifier
1
Biostruc->Biostruc-id (choice 1)
cbmmdb
Main
SHound3DExists
SHound3DFromGi
SHound3DFromGiList

GenInfo Identifier
1420979
Biostruc->Biostruc-id
cbmmdb
Main
SHoundGiFrom3D
SHoundGiFrom3DList

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

156 of 421

18/04/2005

domdb table
seqhound
domdb
strucdb
Stores the information of structural domains.

Database:
Table:
Module:
Definition:
MySQL
Field
rowid
mmdbid
asn1
pdbid
pdbchain
gi
domno

Type
int(11)
int(11)
mediumblob
varchar(20)
varchar(10)
int(11)
int(11)

Null
No
No
No
No
Yes
No
Yes

NULL
0
NULL

domall

int(11)

domid
rep
MySQL Indexes
Keyname
PRIMARY
idomdb_rowid
idomdb_mmdbid
idomdb_pdbid
idomdb_gi

int(11)
int(11)

No
Yes

0
NULL

Type
PRIMARY
INDEX
INDEX
INDEX
INDEX

seqhound@blueprint.org

Default
0

Column_Definition
Auto number row identifier
mmdb identifier
asn blob
pdb identifier
pdb chain
geninfo identifier
domain number
number of domains in the whole
structure
vast domain id
representative of blast set
Field
domid
rowid
mmdbid
pdbid
gi

Version 3.3

The SeqHound Manual

idomdb_domall
idomdb_domid

157 of 421

INDEX
INDEX

18/04/2005

domall
domid

Source org:
Source file:
Parser:
Note:

VAST (vector alignment search tool) is an NCBI-developed algorithm
used to identify similar 3-D structures that may not possess sequence
similarities. Structural information is taken from the mmdb database.
3D structures are converted into vectors (which corresponding to
structurally significant regions of the protein). Adjacent vectors are
grouped into domains. Domain information is used to determine distant
homologs and provide structural information. The domains of proteins
can be compared with other protein domains. Those with similar
domain overlaps are likely to be distant homologs. The domdb database
stores the domain information.
NCBI
MMDB database & nrpdb.* from ftp://ftp.ncbi.nih.gov/mmdb/nrtable
vastblst & pdbrep
vastblst parser uses SHoundGet3D to retrieve the MMDB information..

***mmdbid***
description:
example:
default value:
source:
parser:
function:

the 3D structure id from which the domain was computed.
3446
n/a
retrieved from mmdbid by calling SHoundAllMMDBID.
vastblst
Main calls SHoundAllMMDBID and MakeAModelstruc

Observation:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

API:

***asn1***
description:
example:

default value:
source:
parser:
function:
API:

158 of 421

18/04/2005

not used by the API, there are other tables with more complete
mmdbid's.

the asn structure of the domain. The structure gives the gi of the domain
and the starting and ending point of the domain.
a text output of the SLRIValNode:
SLRIValNode ::= {
domain {
gi 443581 ,
from 2 ,
to 393 } }
n/a
extracted from the biostruc reconstructed from what is stored in mmdb.
vastblst
WriteFASTAByDomain calls MakeDomain
SHoundGetDomain to actually print out the ASN structure, after
calling SHoundGetDomain, the return value (a SLRIValNodePtr
object) must be opened and streamed into an asn IO stream. eg.
{ // start of program
SLRIValNodePtr pdom = NULL;
AsnIoPtr aip = NULL;
pdom = SHoundGetDomain(443581,
aip = AsnIoNew(ASNIO_TEXT_OUT,
SLRIValNodeAsnWrite(pdom, aip,
AsnIoClose(aip);
} // end of program, text output

0);
stdout, NULL, NULL, NULL);
NULL);
as in example

SHoundGetFastaDomain (to retrieve the text, follow code above,
replace SLRIValNode with SLRIFasta

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***pdbid***
description:
example:
default value:
ASN.1 struct:
parser:
function:
API:
***gi***
description:
example:
default value:
source:
parser:
function:
API:

***domno***
description:

159 of 421

18/04/2005

the PDB id of the 3D structure from which the domain was computed
9XIM
n/a
n/a
vastblst
WriteFASTAByDomain
SHoundGetDomain

the gi of the chain from which the domain was computed
443581
n/a
extracted from the biostruc object reconstructed from data in mmdb.
Biostruc->Biostruc-id (choice 1)
vastblst
WriteFASTAByDomain
SHoundGiFromPDBchain
This field can also be parsed from the SLRIValNode and
SLRIFasta ASN structures (above).

a protein chain may have several domains, each domain is enumerated,
eg if a protein X has 10 domains, the first domain has domno 0, the

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

example:
default value:
source:
parser:
function:
API:

***domall***
description:
example:
default value:
source:
parser:
function:
API:
***domid***
description:
example:
default value:
source:

160 of 421

18/04/2005

second domain has domno 1, ...etc
0
n/a
this is enumerated by NCBI code
vastblast
WriteFASTAByDomain
n/a, by itself, the domno provides little information, so it is not
provided in the API

The total number of domains in the 3D structure. In the above example
protein X, domall is 10
1
n/a
this is enumerated by NCBI code
vastblast
WriteFASTAByDomain
n/a

the id for the domain. domid is computed by hashing various id's,
including (but not limited to) mmdbid, domno, etc.
34460200
n/a
hashed in WriteFASTAByDomain

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

parser:
function:
API:
***rep***
description:

example:
default:
source:
parser:
function:
API:

161 of 421

18/04/2005

vastblast
WriteFASTAByDomain
n/a

domains get grouped based on properties. The domains are then
compared to the other domains in their group and then the best over
domain representative of that group is chosen. This field reflects that
representative.
1
0
nrpdb.* from ftp://ftp.ncbi.nih.gov/mmdb/nrtable columns 6, 9 & A
pdbrep
Main
n/a

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

162 of 421

18/04/2005

Protein sequence neighbours (neighdb) module
Last updated: August 18, 2004
Note: The neighbours tables are precalculated on a cluster and the resulting tables are distributed in MySQL format on
our ftp site. Therefore, this section is provided for informational purposes only, or for those who would like to build
neighbours tables from there own sequence data; it is not necessary if one wishes simply to include the neighbours
module into their own seqhound instance, in which case they should simply download the precomputed tables.
Note: the nblast development tree is in the slri directory on the same level as SeqHound (i.e.; slri/nblast).
This documentation includes a description of the nblast program.
The NBLAST paper is freely available from BioMed Central and is provided in the supplementary documents directory distributed
with this manual. See:
Dumontier M, Hogue CW. NBLAST: a cluster variant of BLAST for NxN comparisons.
BMC Bioinformatics. 2002 May 8;3(1):13. Epub 2002 May 08.
PMID: 12019022
Purpose: This section describes how to build and use nblast, a program for generating a database of sequence neighbours.
Installing nblast:
There are two ways to install nblast:
* Download nblast binaries for your platform from the sourceforge website:
http://sourceforge.net/project/showfiles.php?group_id=17918
or
* Compile nblast on your system: In order to compile nblast, two pieces of third party software are prerequisite: NCBI's C toolkit
and Sequiter software's CodeBase database library. Nblast currently only supports CodeBase as its database backend, but it may
be ported to other databases in the future. If your system meets these requirements, download the platform independent source
code for nblast from the link above and proceed.
The NCBI C toolkit can be downloaded from NCBI's website at
ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz
CodeBase is a commercial database software library available from Sequiter software.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

163 of 421

18/04/2005

A description of how to compile and set up the nblast software follows. If you installed the binaries, or already have compiled the
software, skip to the Configuration section. The installation procedure for CodeBase and the NCBI C toolkit are not detailed here, but
are assumed to have been successful.
Once you have unpacked the source code tree, change directories into the topmost nblast directory and then exercise one of the
following sets of instructions, depending on whether you are using a Unix or Windows system.
Unix:
• Modify the nblast.mk file for different source and library paths.
•

Modify the make.nblast file or project settings to create an executable that incorporates NBLAST (-D NBLAST_API), logging
(-D NBLAST_LOG), and/or MoBiDiCK(-D MOBIDICK_API) (The MoBiDick library is currently not publicly available).

•

From the src directory type 'make -f make.nblast', if the nblast binary is built then build nblastcleanup with 'make -f
make.nblastcleanup' The binaries will be placed in the build/ subdirectory of the source tree.

Windows:
• Open the nblast.dsw workspace in MSVC.
Build NBLAST and NBLASTCLEANUP
The two nblast executables, nblast and nblastcleanup, should now be present in the slri/nblast/build/ directory. Set your PATH
environment variable appropriately so that you can execute them. Before you can use nblast, you'll have to do some more
configuration, described in the next section.
Configuration of nblast environment:
At least one configuration file is required to use nblast and it's associated applications, more if you wish to have some of the relevant
files outside of the current directory. At the very least, you must have an nblast configuration file, called .nblastrc on Unix, and
nblast.ini on Windows. This file must be present either in your current directory or your home directory[may not apply on windows]
and has the following formats:
.nblastrc on Unix:
; NBLAST configuration file
[NBLAST]
writepath = /home/nblast/build/
nblast.ini on Windows:
[NBLAST]
writepath = g:\code\slri\nblast\build

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

164 of 421

18/04/2005

where the directory path named after “writepath = “ specifies where the files generated by nblast will be written to. A trailing
slash(unix) or backslash(windows) is necessary for the directory path to work.
If you want to place the scoring matrix (used by the blast algorithm) and/or the formatted fasta database in any directory but the
current working one, you must also have an ncbi configuration file:
.ncbirc on Unix:
[NCBI]
DATA = g:\code\ncbi\data
[BLAST]
BLASTDB = g:\blastdb\
ncbi.ini on Windows:
[NCBI]
DATA = /code/ncbi/data
[BLAST]
BLASTDB = /blastdb/
Where the path following “DATA =” specifies the path to find the scoring matrix file, and the path following “BLASTDB” specifies
the path to find the formatted fasta database.
The BLAST algorithm uses the BLOSUM62 scoring matrix. The file containing this matrix is required for nblast to function. The file
is named BLOSUM62, and comes with the NCBI C toolkit. It should also come with the binary distribution of nblast.
Once the configuration files have been properly set, you are ready to run nblast, detailed in the next section.
Running NBLAST
* Format the fasta database using formatdb: Before Nblast can process the protein sequences, they must be properly formatted.
This is done using the formatdb program, available from NCBI, which takes a fasta formatted database of protein sequences
and processes them into a form which can be BLASTed. Once you have downloaded formatdb, format your fasta database
using the following command:
formatdb -oT -i
Where is the file name of your fasta formatted protein sequence database. Generally this is nr, the non-redundant
protein sequence database, which can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.tar.gz file (at the time of this
writing, that file is ~500 MB compressed). will be used throughout the rest of this document to mean the filename of
your original fasta formatted sequence database.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

165 of 421

18/04/2005

Use NBLAST to build the skeleton database: Before nblast can compare sequences, it needs to build a skeleton of the nblastdb
table with information on how the sequences will be ordered and indexed. This is done by running nblast with the -N1
command line argument as follows:

nblast -i -d -N1
* Use Nblast to do the blast comparisons. Before the neighbour tables can be built, nblast must do the pairwise blast
comparisons. In addition to the nblast specific command line options mentioned here, you may also apply options related to
NCBI's blastall to augment the BLAST results. For small databases, it is feasible to do this step on a single computer. Single
computer execution:
nblast -i -d -e -N2
Where is the maximum evalue allowed for BLAST comparisons between sequences. BLAST comparisons
which result in evalues higher than this are not saved.
Parallel Computer Execution: Since these comparisons are computationally intensive for large databases (like nr), nblast is
capable of splitting up the task across multiple compute nodes. This is done by distributing the N.* files generated by
the last step, and the .p* files generated by formatdb, to the compute nodes, and then running nblast on each compute
node with node specific options:
nblast -i -d -e -N2 -C -D
Where is the total number of compute nodes in your parallel computing system, and is the
index (ranging from 1 to ) of the current compute node with respect to the set of nodes which the nblast
task is divided across. has the same meaning as in single computer usage.
Both single and parallel computer execution of the N2 mode of nblast results in creation of files entitled
B.* (for single computers, defaults to 1). These contain the set of blast results which that
particular compute node generated.
* Build the nblastdb table of sequence neighbours: Now the blastdb table of pairwise blast results can be used to build/fill the
nblastdb table of neighbouring sequences. The pairwise BLAST results generated during the last step are processed and
consolidated to generate the records for the neighbouring sequences table, nblastdb. Single Computer Execution:
nblast -i -d -e -N4
Parallel Computer Execution: You must first collect the database files generated on the compute nodes during the last step onto
the head node before running this step. Then run the following on the head node:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

166 of 421

18/04/2005

nblast -i -d -e -N4 -C -D
* Cleanup the database and generate number of neighbours fields: The neighbours table, nblastdb, is nearly complete. All that
needs to be done is to fill in the #Neighbours fields of the nblastdb table. This is done using the program nblastcleanup with
the following arguments:
nblastcleanup -bT -pT -aT -qT -d -n
Where is the number of blastdb's to check/build from, generally equal to the number of compute nodes you
used (1 in the single computer case).
Your Neighbours database should now be completely built.
NBLAST Update Procedure
Nblast was designed to process NCBI's non-redundant sequence database, named nr. This database contains a non-redundant list of
protein sequences and their associated GenInfo identifier numbers (GIs). The NCBI's nr database is updated on a regular basis, with
some GI's being removed, and others being added. It is desirable to keep the NeighDB module's nblastdb and blastdb tables updated
with the nr database, without having to recompute all the neighbours. This is the purpose of nblast's update procedure, which removes
GI's which have been “killed” from any entries in the nblast tables which contain them, and BLASTs and inserts “new” GI's where
they are appropriate. This process involves multiple steps, which are outlined and explained here.
* Format new version of the fasta database using formatdb: Use formatdb to format your new version of the fasta database in the
same way you did before. Note that your new fasta database should have the same name as the one you initially used to build the
nblast tables.
*

formatdb -oT -i
Run Nblast in Update mode N5: This step compares the GIs in the out of date nblastdb with those in the updated FASTA
database. It removes entries of the killed Gis from the neighbour lists of all gi's, deletes records which only exist because of the
killed GI, and inserts empty records for the newly added GIs in the nblastdb. It also creates an update file called
NBLAST_UPDATE.val which lists which GI's have been killed and which are to be added, for the next stages to use. On the head
compute node, execute:
nblast -i -N5
Run Nblast in Update mode N6: This step performs the pairwise BLAST comparisons between the newly added sequences and all
the old sequences that haven't been killed, as well as comparing the new sequences with each other. Single Computer Execution:
nblast -i -e -N6

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

167 of 421

18/04/2005

Parallel Computer Execution:
nblast -i -e -N6 -C -D
Where is the maximum allowed evalue for comparison results, as described earlier.
Rebuild the Neighbouring Sequences Table: Now rebuild the nblast database using the same command you used to build it
previously. Single Computer Execution:

nblast -i -d -e -N4
Parallel Computer Execution: First collect all the B.* files from the compute nodes onto the head node.
Then run the following command:
nblast -i -d -e -N4 -C -D
* Cleanup: The cleanup procedure is done in the same manner as for the initial build of the neighbour tables. Single Computer
Execution:
nblastcleanup -bT -pT -aT -qT -d -n
Where is the number of blastdb's to check/build from, generally equal to the number of compute nodes you
used (1 in the single computer case).
Your Neighbours database should now be properly updated.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

168 of 421

18/04/2005

nbraccess program*
Note: Not available at time of release.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

169 of 421

18/04/2005

BLASTDB table
Last updated: August 4, 2004
SeqHound
Database:
(Sequence) Neighbours
Module:
NBLAST
Source:
Derived from NxN comparison of sequences from the NR database
Source File:
from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
Parser/Application: NBLAST http://www.sourceforge.net/projects/slritools (source &
binaries)
BMC Bioinformatics http://www.biomedcentral.com/1471Published
2105/3/13/
Documentation:
Pairwise sequence alignments from NBLAST computation
Definition:
Note: The neighbours tables are precalculated on a cluster and the resulting tables are distributed in MySQL format on
our ftp site. Therefore, this section is provided for informational purposes only, or for those who would like to build
neighbours tables from there own sequence data; it is not necessary if one wishes simply to include the neighbours
module into their own seqhound instance, in which case they should simply download the precomputed tables.
Codebase
Column_Name
UID

Index
P

NULL
N

EVAL
ASNSA

N
N

MySQL
Field
rowid
uid

Type
int(11)
bigint(20)

seqhound@blueprint.org

Data_Type
INTEGER/Hash
Key
FLOAT
MEMO

Null
Yes
Yes

Default
NULL
NULL

Size
14

Column_Definition
Perfect Hash for 2
Ordinal Numbers
E-Value 12.7 Alignment Evalue
ASN.1 Modified SeqAlign for
pairwise sequence
alignment

Column_Definition
Auto incremented id
Perfect Hash for 2 Ordinal
Numbers

Version 3.3

The SeqHound Manual

eval
asnsa

decimal(12,7) Yes
mediumblob Yes

MySQL Indexes:
Keyname
iblastdb_rowid
iblastdb_uid

Type
INDEX
INDEX

*** uid ***
Description:
example:
default value:
source:
functions:

*** eval ***
Description:
example:
default value:
function:

170 of 421

NULL
NULL

18/04/2005

12.7 Alignment Evalue
Modified SeqAlign for pairwise
sequence alignment
Field
rowid
uid

64 bit Perfect Hash Key generated from 2 32 bit Ordinal Numbers in
nblastdb
none
NBLAST - a cluster computer extension of BLAST to compute NxN
sequence comparisons
ShoundGetBlastResult: Given two GIs, checks if alignment
exists, returns NBlast-result-set else NULL
SHoundGetBlastSeqAlign : Given two GIs, calls
SHoundGetBlastResult and formats the NBlast-result-set to a
SeqAlign

The BLAST e-value reported for this alignment
0.0001
none
Used by NBLAST to create the NBLASTDB table using a userspecified minimum evalue

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***asnsa***
Description:

171 of 421

18/04/2005

NBlast-Result-Set (ASN.1 definition is described in
slri/nblast/asn/NBlastasn.asn)
- Stores the useful parts of a seqalign from the BLAST computation
including the alignment, bitscore, evalue

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

172 of 421

18/04/2005

NBLASTDB table
Last updated: August 4, 2004
SeqHound
Database:
(Sequence) Neighbours
Module:
NBLAST
Source:
Derived from NxN comparison of the NR database from
Source File:
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
Parser/Application: NBLAST http://www.sourceforge.net/projects/slritools (source &
binaries)
BMC Bioinformatics http://www.biomedcentral.com/1471Published
2105/3/13/
Documentation:
Sequence neighbour lists from NBLAST computation
Definition:
Codebase
Column_Name
Index NULL Data_Type
Size
Column_Definition
ORD
P
N
INTEGER
10
The ordinal number of
Ordinal
this entry (autoincrement)
Number
GI
I
N
INTEGER
10
The GenInfo Identifier
GenInfo Id
for the sequence for
which there are
neighbours in the list
NUM
I
N
INTEGER# 10
Number of neighbours
in ASN.1 structure
Neighbours
ASNNBR
N
N
MEMO
ASN.1
ASN.1 structure
containing the list of GIs
whose sequences
neighbour that of this
entries GI, and the
corresponding evalues
of the alignments.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

173 of 421

MySQL
Field
rowid
ord

Type
int(11)
int(11)

Null
No
No

int(11)

num

int(11)

asnnbr

mediumblob

MySQL Indexes:
Keyname
inblastdb_rowid
inblastdb_ord
inblastdb_gi
inblastdb_num

Type
INDEX
INDEX
INDEX
INDEX

***ord***
Description:

Example:
Default Value:
Function:

Default

18/04/2005

Column_Definition
Auto incremented id
The ordinal number of this entry
The GenInfo Identifier for the
sequence for which there are
neighbours in the list
Number of neighbours in ASN.1
structure
ASN.1 structure containing the
list of GIs whose sequences
neighbour that of this entries GI,
and the corresponding evalues of
the alignments.
Field
rowid
ord
gi
num

The ordinal value of the entry in the database (starts with 1). The
ordinal values from two GIs in a pairwise alignment are perfectly
and reversibly hashed into/from the BLASTDB UID value. This
approach saves on space required to store two 32 bit integers.
1
NA

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***gi***
Description:
Example:
Default Value:

174 of 421

18/04/2005

The GenInfo Identifier of the query sequence
2495000
NA

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

API Function:

175 of 421

18/04/2005

SHoundNeighboursFromGiEx: Given GI, an evalue cutoff and
a limit of 100 returned results, returns FLinkSet containing list of
neighbour GIs and their alignment e-value
Other functions that use SeqHound Functionality (Redundant
Groups & Taxonomy Protein Lists)
SHoundNeighboursFromGi: Calls
SHoundNeighboursFromGiEx, and if the GI is not found,
searches through the list of redundant GIs for the respective
sequence to find an equivalent GI for which there is neighbour
information, returning the ShoundNeighboursFromGiEx results for
that GI.
SHoundNeighboursFromGiList: Calls
SHoundNeighboursFromGi with each GI in a valnode list
SHoundNeighboursFromTaxID: Calls
ShoundNeighboursFromGiList for each GI in a given
taxonomy (SHoundProteinsFromOrganism)
SHoundNeighboursOfNeighbours: Fetches the neighbours of
supplied GI and each of their neighbours (limit 100)
SHoundNeighboursOfNeighboursList: Fetches the
neighbours of supplied GI list and each of their neighbours
SHoundGiAndNumNeighboursList: returns a list of all the
GIs with more than 0 neighbours
SHoundNumNeighboursInDB: Fetches the number of
neighbours in the nblastdb

***num***

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

Description:
Example:
Default Value:
Function:
***asnnbr***
Description:

Functions:

176 of 421

18/04/2005

Quick lookup value to find out how many neighbours are in the
ASN.1 structure. Can also be used to sort the GIs with most/least
neighbours
range of integer values
0

NBlast-GiAndEval-set (ASN.1 definition is described in
slri/nblast/asn/NBlastasn.asn)
stores the GI and it's list of sequence neighbours (as GI, evalue
pairs)
see above for GI

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

177 of 421

18/04/2005

Locus link functional annotations (lldb) module
llparser
Last updated: August 5, 2004
purpose:
The llparser parses the LocusLink data files to create and update a set of tables that
correlate curated sequence data and descriptive information about genetic loci. It retrieves
information on official nomenclature, aliases, sequence accessions, phenotypes, EC
numbers, MIM numbers, UniGene clusters, homology and map locations.
module: lldb
input files:
LL_tmpl
From ftp://ftp.ncbi.nih.gov/refseq/LocusLink/
tables altered:
ll_cdd, ll_go, ll_llink, ll_omim
source code location:
slri/seqhound/locuslink/llparser.c
config file dependencies:
The relevant configuration file is:
slri/seqhound/parsers/.intrezrc (for Unix platforms) or
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;locus link functional annotations
lldb = 1

Version 3.3

The SeqHound Manual

178 of 421

18/04/2005

associated scripts:
see “slri/seqhound/scripts/llftp.pl”
This script retrieves the input file or this parser.
error and run-time logs:
llparser writes to a log file called “llparserlog” where it writes Time Stamp, Error #,
goparser.c line # and cause of the problem.
e.g.
=================[ May 21, 2003 3:49 PM ]==================
ERROR: [000.000] {llparser.c, line 81} Main: Cannot find LL_tmpl file.

troubleshooting:
additional info:
The LocusLink web page is at NCBI http://www.ncbi.nlm.nih.gov/LocusLink/
The LocusLink README file describes the input file for this parser
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/
See data table descriptions for each of the tables in the lldb module.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

179 of 421

18/04/2005

addgoid parser
Last updated: August 5, 2004
purpose:
Correlates sequence record gi's with GO annotation identifiers.
The Gene Ontology flat file parser adds information to the supplements ll_go table. This
information correlates sequence records GI’s with GO annotation identifiers. This
information is retrieved from the gene_association.compugen.Genbank and
gene_association.compugen.Swissprot files from www.geneontology.org.
Other GO annotation data is available at this site but it is not currently incorporated into
SeqHound. This is actively being worked on at the time of writing.
Note that the input files used here do not contain PubMed Identifiers.
This parser is dependent on the following tables, asndb, parti, accdb and nucprot. This
parser is also dependent on SeqHound API.
For this reason, the mother parser must be run before using the addgoid parser.
module: lldb
input files:
gene_ association.compugen.Genbank
gene_association.compugen.Swissprot
from ftp://ftp.geneontology.org/pub/go/gene-associations/
tables altered:
ll_go
source code location:
slri/seqhound/locuslink/addgoid.c
config file dependencies:
The relevant configuration file is:
slri/seqhound/parsers/.intrezrc (for Unix platforms) or
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;locus link functional annotations
lldb = 1

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

180 of 421

18/04/2005

Text in italics must be changed for the .intrezrc file to function correctly with your
SeqHound set-up. Variables username, password, dsn, database in section
[datab] should have the same values as USER, PASSWORD, DSN and DATABASE
respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections]
lldb should be 1.
command line parameters:
Typing “./addgoid –“ at the command line while in the directory where addgoid
resides will return a list of command line parameters and default settings. For example:
> ./addgoid –
pdbrep arguments:
-i Input file [File In]
associated scripts:
see /slri/seqhound/goftp.pl
The script retrieves the files used as input by the addgoid parser. This script also
retrieves the three input files required by the goparser (see godb module).
error and run-time logs:
addgoid parser writes to a log file called addgoidlog where it writes Time Stamp, Error #,
goparser.c line # and cause of the problem. For example
==================[ May 21, 2003 4:14 PM ]==================
NOTE: CoreLib [002.003] {ncbifile.c, line 624} FileOpen("gene_association.compugen.Swissprot","r")
failed

troubleshooting:
additional info:
The NCBI LocusLink web page is at
http://www.ncbi.nlm.nih.gov/LocusLink/.
The Gene Ontology Consortium documentation at is at :http://www.geneontology.org/..
See data table descriptions for each of the tables that are listed under the lldb module.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

181 of 421

18/04/2005

ll_omim table
Last updated: August 5, 2004
seqhound
Database
ll_omim
Table
lldb
Module
OMIM Online Mendelian Inheritance in Man.
Definition
Online Mendelian Inheritance in Man (OMIMTM) is a continuously
updated catalog of human genes and genetic disorders. OMIM focuses
primarily on inherited, or heritable, genetic diseases. It is also
considered to be a phenotypic companion to the human genome project.
OMIM is based upon the text Mendelian Inheritance in Man, authored
and edited by Dr. Victor A. McKusick and a team of science writers and
editors at Johns Hopkins University and elsewhere.
MySQL
Field
Type
Null Default
Column_Definition
rowid
int(11)
No
Auto incremented id
ll_id
int(11)
No
0
Locus Link Identifier
omim_id
int(11)
No
0
OMIM Identifier.
MySQL Indexes
Keyname
Type
Field
ll_id
PRIMARY
PRIMARY
omim_id
illomim_rowid
INDEX
rowid
illomim_llid
INDEX
ll_id
illomim_omimid
INDEX
omim_id
Observation:
Organization:
Source db:

*** ll_id ***
description:
example:
default value:
source:

parser:

NCBI
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

Locus link record identifier.
1
0
First field in each locus link record. Eg:
>>1
is the record with locus link id 1.
llparser.c --> ll_parser.c

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
function:
API:
*** omim_id ***
description:
example:
default value:
source:
parser:
function:
API:

182 of 421

18/04/2005

LL_ParseFile() --> LL_LineParser
db_layer: LL_Append2OMIM_DB()
SHoundLLIDFromOMIM()

OMIM Online Mendelian
138670
n/a
After tag 'OMIM: '
llparser.c --> ll_parser.c
LL_LineParser() --> LL_ParseOMIM()
db_layer: LL_AppendRecord() --> LL_Append2OMIM_DB()
SHoundOMIMFromLLID()
SHoundOMIMFromGi()
SHoundOMIMFromGiList()

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

183 of 421

18/04/2005

ll_go table
Last updated: August 5, 2004
seqhound
Database:
ll_go
Table:
lldb
Module:
Associates a locus link identifier with the gene ontology (GO)
Definition:
annotation.
MySQL
Field
Type
Null Default
Column_Definition
int(11)
No
Auto incremented id
rowid
int(11)
No
0
Locus Link Identifier
ll_id
int(11)
Yes NULL
Gene Ontology ID.
go_id
int(11)
Yes NULL
Pub Med ID.
pmid
varchar(50)
Yes NULL
Evidence Code.
evidence
MySQL Indexes
Keyname
Type
Field
illgo_rowid
INDEX
rowid
illgo_llid
INDEX
ll_id
illgo_goid
INDEX
go_id
illgo_pmid
INDEX
pmid
Observation.:
Organization:
Source file:
Parser:

NCBI
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz
ll_parser.c

*** ll_id ***
description:
example:
default value:
source:
parser:
function:
API:

Locus link identifier
2
n/a
see above
ll_parser.c -> ll_parser.c
LL_LineParser() -> LL_ParseNPUnit
SHoundLLIDFromGOIDAndECode ()

***go_id***
description:

Gene Ontology Identifier.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
example:
default value:
source:

parser:
function:
API:

***pmid***
description:
example:
default value:
source:

184 of 421

18/04/2005

5717
n/a
after tag GO, 4th element
e.g.:
GO: cellular
component|extracellular|IDA|GO:0005574|GOA|na
ll_parser.c or ll_parser.c
LL_LineParser() -> LL_ParseGOUnit
SHoundGOIDFromLLID()
SHoundGOIDFromGi()
SHoundGOIDFromRedundantGi
SHoundGOIDFromGiList
SHoundGOIDFromRedundantGiList

PubMed Identification.
3458201
n/a
after tag GO, 6th element
e.g.:
GO: cellular component|extracellular|IDA|GO:0005574|GOA|345201

parser:
function:
API:
***evidence***
description:

example:

default value:
source:

parser:

llparser.c -> ll_parser.c
LL_LineParser() -> LL_ParseGOUnit
SHoundGOPMIDFromGiAndGOID()

Every GO annotation must indicate the type of evidence that supports
it; these evidence codes correspond to broad categories of experimental
or other support. http://www.geneontology.org/GO.evidence.html
IC: Inferred by Curator.
ND: No biological Data available.
TAS: Traceable Author Statement.
IEA: Inferred from Electronic Annotation.
n/a
After a 'GO:' line, it's the 3rd. field element
eg:
GO: cellular
component|extracellular|IDA|GO:0005574|GOA|34520
1
llparser.c -> ll_parser.c

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
function:
API:

185 of 421

18/04/2005

LL_LineParser()-> LL_ParseGOUnit()
SHoundGOECodeFromGiAndGOID()

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

186 of 421

18/04/2005

ll_llink table
Last updated: August 5, 2004
seqhound
Database:
ll_llink
Table:
lldb
Module:
Associates a locus link id with a sequence identifier
Definition:
MySQL
Field
Type
Null Default
Column_Definition
rowid
int(11)
No
Auto incremented id
ll_id
int(11)
No
0
Locus Link Identifier.
gi
int(11)
No
0
Gene Info Identifier
map
text
Yes NULL
Gene location in chromosome.
MySQL Indexes
Keyname
Type
Field
PRIMARY
PRIMARY
ll_id
gi
illll_rowid
INDEX
rowid
illll_llid
illll_gi
Observation.:

Organization:
Source file:

INDEX
INDEX

ll_id
gi

Maps a locus link id with a gi and its location in the chromosome. Not
all locus link ids will have a gi. A gi may be specified as an NP gi or a
XP gi ( experimentally determined ). If an NP gi is available, it will be
used. If an NP gi is not available, then the XP gi will be used. If neither
the NP nor XP gi is available, then no gi will be used.
NCBI
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

*** ll_id ***
description:
example:
default value:
source:

parser:
function:

seqhound@blueprint.org

Locus Link Identifier.
2
n/a
start of record
eg.
>>2
llparser.c --> ll_parser.c
LL_AppendLLID_DB()
LL_ParseFile() -> LL_LineParser()
Version 3.3

The SeqHound Manual
API:

***gi***
description:
example:
default value:
source:

parser:
function:
API:

***map***
description:
example:
default value:
source:
parser:
function:
API:

seqhound@blueprint.org

187 of 421

18/04/2005

SHoundLLIDFromGi()
SHoundLLIDFromGiList()

Gene Info Identifier
21071030
n/a
in the NP tag, 2nd element
eg
NP: NP_570602|21071030
llparser.c --> ll_parser.c
LL_ParseFile() --> LL_LineParser() -->
LL_Append2LLID_DB()
SHoundGiFromLLID()
SHoundGiFromLLIDList()

Location of the gene in the Chromosome
19q13.4
n/a
in the MAP tag: 1st element
MAP: 19q13.4|RefSeq|C|
llparser.c --> ll_parser.c
LL_ParseFile() --> LL_LineParser()
LL_Append2LLID_DB()
SHoundLocusFromGi()

Version 3.3

The SeqHound Manual

188 of 421

18/04/2005

ll_cdd table
Last updated: August 5, 2004
seqhound
Database:
ll_cdd
Table:
lldb
Module:
Conserved Domain Database.
Definition:
Maps a locus link id to a Conserved Domain ID.
Proteins often contain several modules or domains, each with a distinct
evolutionary origin and function. The CDD database may be used to
identify the conserved domains present in a protein sequence.
Conserved Domains can be summarized with multiple local sequence
alignments. Computational biologists have compiled collections of
such alignments representing conserved domains, and LocusLink
imports them from three major sources:
* Smart, the Simple Modular Architecture Research Tool
* Pfam (UK), Pfam-A seed alignments
* COG (Clusters of Orthologous Groups)
Smart and Pfam are public domain databases, which are offered in
combination with HMM-based search engines and alignment
visualization services.
COG is an NCBI-curated protein classification resource. Sequence
alignments corresponding to COGs are created automatically from
constituent sequences and have not been validated manually for import
in CDD.
The default e-value cutoff of for data in this table is 0.01
MySQL
Field
rowid
ll_id
cdd_id
evalue
MySQL Indexes
Keyname
illcdd_rowid

Type
Null
int(11)
No
int(11)
No
varchar(50)
Yes
decimal(20,10)Yes

Default
0
NULL
NULL

Column_Definition
Auto incremented id
Locus Link Identifier
CDD Source Database.
Descriptive match.

Type
INDEX

Field
rowid

illcdd_llid

INDEX

ll_id

illcdd_cddid

INDEX

cdd_id

Observation.:

Domains can be thought of as distinct functional and/or structural units
of a protein. These two classifications coincide rather often, as a matter
of fact, and what is found as an independently folding unit of a

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

Organization:
Source db:

*** ll_id ***
description:
example:
default value:
source:

parser:
function:
API:

***cdd_id***
description:
example:
default value:
source:

parser:
function:
API:

***evalue***
description:

189 of 421

18/04/2005

polypeptide chain also carries specific function. Domains are often
identified as recurring (sequence or structure) units, which may exist in
various contexts. e.g.: of the CDD Record.
CDD: pfam00207: Alpha-2-macroglobulin
family|5952|2318|na|8.970050e+02
CDDID: pfam00207
Evalue: 8.970050e+02
NCBI
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz

Locus link identifier
2
n/a
start of record
eg.
>>2
llparser.c --> ll_parser.c
LL_AppendLLID_DB()
LL_ParseFile() -> LL_LineParser()
SHoundLLIDFromCDDID()
SHoundLLIDFromGi()
SHoundLLIDFromGiList()

Conserved Domain Database Identifier
pfam00207
n/a
after CDD tag: 1st element
CDD: pfam00207: Alpha-2-macroglobulin
family|5952|2318|na|8.970050e+02
llparser.c --> ll_parser.c
LL_ParseNPUnit()
LL_ParseFile() --> LL_LineParser()
SHoundCDDIDFromGi()
SHoundCDDIDFromGiList()
SHoundCDDIDFromLLID()

Describes match between CDD and the sequence

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

190 of 421

18/04/2005

example:
default value:
source:

8.970050e+02
0
after CDD tag: last element
CDD: pfam00207: Alpha-2-macroglobulin
family|5952|2318|na|8.970050e+02

parser:
function:

llparser.c --> ll_parser.c
LL_ParseNPUnit()
LL_ParseFile() --> LL_LineParser()
SHoundCDDScoreFromGi()

API:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

191 of 421

18/04/2005

GENE module
parse_gene_files.pl parser
Last updated September 27, 2004
purpose:
parse_gene_files.pl parses 4 files from the NCBI Gene database and populates gene_*
tables in SeqHound.
logic:
parse_gene_files.pl first drops the gene_* tables and then recreates them. It then reads the
files generated by gene_cron.pl to populate the new tables. In the future, the tables will
not be dropped upon update.
module: gene
input files:
ftp://ftp.ncbi.nih.gov/gene/DATA
gene2refseqUniq
gene_infoUniq
gene_historyUniq
gene2pubmedUniq
tables altered: gene_dbxref, gene_genomicgi, gene_history, gene_info, gene_productgi,
gene_pubmed, gene_object, gene_synonyms.
source code location:
/seqhound/gene/parse_gene_files.pl
/seqhound/scripts/gene_cron.pl
config file dependencies:
The relevant configuration file is:
slri/seqhound/config/.intrezrc (for Unix platforms)
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

192 of 421

18/04/2005

command line parameters:
example use:
perl parse_gene_files.pl

associated scripts:
Four files are downloaded from ftp://ftp.ncbi.nih.gov/gene/DATA by gene_cron.pl:
gene2refseq.gz
gene_info.gz
gene_history.gz
gene2pubmed.gz.
gene_cron.pl unzips the files and makes sure that each file only contains unique records
generating the following files:
gene2refseqUniq
gene_infoUniq
gene_historyUniq
gene2pubmedUniq
error and run-time logs:
parse_gene_files.log
troubleshooting:
additional info:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genehelp.html
Note that the NCBI gene database is experimental. This means that the
input files and the SeqHound tables may change.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

193 of 421

18/04/2005

gene_dbxref table
Last updated September 28, 2004
SeqHound
Database:
gene_dbxref
Table:
GENE
Module:
The table holds cross references to other databases for GeneIds and their
Definition:
associated gis.
MySQL
Field
id

Type
Int(11)

Null
No

geneinfoid

Int(11)

dbname

Varchar(255) No

dbaccession
Varchar(30)
MySQL Indexes
Keyname
Type
igenedbxrefs_id
INDEX
igenedbxrefs_infoid
INDEX
igenedbxrefs_db_name INDEX
igenedbxrefs_db_access INDEX
Observation:
Source org:
Source file:
FTP script:
Parser:

***id***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

Default

Column_Definition
Identifier for a cross reference.
Id of the gene_info record that
this cross reference refers to.
Database name
Accession.
Field
id
geneinfoid
dbname
dbaccess

NCBI
gene_info.gz
gene_cron.pl
parse_gene_files.pl

Identifier for the cross reference. Mysql auto-increment table.
1

parse_gene_files.pl
not available yet

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***geneinfoid***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***dbname***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***dbaccess***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

194 of 421

18/04/2005

Identifier for the geneinfo record where this cross reference was found.
53577

parse_gene_files.pl
not available yet

Name of the database that contains the second reference.
SGD

parse_gene_files.pl
not available yet

Accession for the record that contains the name.
S0000572

parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

195 of 421

18/04/2005

gene_genomicgi table
Last updated September 29, 2004
SeqHound
Database:
gene_genomicgi
Table:
GENE
Module:
Table that contains the gis of genomic DNAs that contain genes. The
Definition:
start, stop location and orientation on the genomic DNA are also
included.
MySQL
Field

Type

Null

int(11)

geneobjectid

int(11)

start

int(11)

Yes

The start location on the genomic
DNA.

end

int(11)

Yes

The start location on the genomic
DNA.

orientation

char(1)

Yes

The orientation on the genomic
DNA.

MySQL Indexes
Keyname
igenomic_id
igenomic_objectid
igenomic_gi

Observation:
Source org:
Source file:
FTP script:
Parser:

Type

Default

Column_Definition
Identifier of a product gi. Mysql
auto-increment column.
Identifier of the geneobject that
this genomic gi refers to.
The genomic gi.

Field
id
geneobjectid
gi

NCBI
gene_info.gz
gene_cron.pl
parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***id***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***geneobjectid***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***gi***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***start***
description:
example:
default value:
ASN.1 struct:

seqhound@blueprint.org

196 of 421

18/04/2005

Identifier of a genomic gi.
1

parse_gene_files.pl
not available yet

The gene object that this gi refers to.

parse_gene_files.pl
not available yet

The genomic gi.
10954454

parse_gene_files.pl
not available yet

The start position of the gene on the genomic DNA.
348

Version 3.3

The SeqHound Manual
source:
parser:
function:
API:
***end***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***orientation***
description:

example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

seqhound@blueprint.org

197 of 421

18/04/2005

parse_gene_files.pl
not available yet

The end position of the gene on the genomic DNA.
1190

parse_gene_files.pl
not available yet

The orientation of the gene on the genomic DNA.
- meaning "minus" strand or
+ meaning "plus" strand
-

parse_gene_files.pl
not available yet

Version 3.3

The SeqHound Manual

198 of 421

18/04/2005

gene_history table
Last updated September 29,2004
SeqHound
Database:
gene_history
Table:
GENE
Module:
This table has information about geneids that are no longer current.
Definition:
MySQL
Field

Type

Null

int(11)

taxid

int(11)

Yes

currentgeneid

int(11)

Yes

oldgeneid

int(11)

Yes

oldsymbol

varchar(20)

MySQL Indexes
Keyname
igenehistory_id
current_gene_id
discont_gene_id
discont_symbol

Type
Index
Index
Index
Index

Observation:
Source org:
Source file:
FTP script:
Parser:

Yes

Default

Column_Definition
Identifier for a history record.
Mysql auto-increment column.
NCBI taxonomy identifier.
The current NCBI Gene Id for
this gene.
The discontinued Gene Id.
The symbol assigned to the
discontinued Gene Id, if the
discontinued record was not
replaced with another.
Field
id
currentgeneid
oldgeneid
oldsymbol

Note that all fields except for id are optional.
NCBI
gene_history.gz
gene_cron.pl
parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***id***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***taxid***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***currentgeneid**
*
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***oldgeneid***
description:
example:
default value:
ASN.1 struct:

199 of 421

18/04/2005

Identifier for the history record. Mysql auto-increment column.
212815

parse_gene_files.pl
not available yet
The NCBI taxonomy identifier for this gene..
10116

parse_gene_files.pl
not available yet

The current NCBI gene id.
29666

parse_gene_files.pl
not available yet

The discontinued NCBI gene id.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
source:
parser:
function:
API:
***oldsymbol***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

200 of 421

18/04/2005

parse_gene_files.pl
not available yet

The symbol associated with the discontinued gene id, if no new record
replaced the discontinued record.
hlyB

parse_gene_files.pl
not available yet

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

201 of 421

18/04/2005

gene_info table
Last updated September 29, 2004
SeqHound
Database:
gene_info
Table:
GENE
Module:
Information about a gene.
Definition:
MySQL
Field
id

Type
int(11)

Null
No

geneobjectid

int(11)

symbol

varchar(255)

Yes

The default symbol for this gene.

locustag

varchar(255)

Yes

The LocusTag for this gene.

chromosome

varchar(32)

Yes

The chromosome where this gene
is found.

maplocation

varchar(255)

Yes

description

mediumtext

Yes

MySQL Indexes
Keyname
igeneinfo_id
object_id
symbol
locus_tag
chr

Type
Index
Index
Index
Index
Index

Observation:
Source org:
Source file:
FTP script:
Parser:

Default

Column_Definition
Identifier for this record.
geneobjectid association with this
information.

The map location of this gene on
the chromosome.
A description of this gene.

Field
id
geneobjectid
symbol
locustag
chromosome

NCBI
gene_info.gz
gene_cron.pl
parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***id***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***geneobjectid**
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***symbol***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***locustag***
description:
example:
default value:
ASN.1 struct:

202 of 421

18/04/2005

Identifier for this record.
74058

parse_gene_files.pl

The geneobjectid for the gene that this information refers to.
73870

parse_gene_files.pl

The symbol for this gene.
1A981
gene->locus (NCBI says this is where the data comes from).
parse_gene_files.pl

The locus tag for this gene.
1A981
gene->locus-tag (NCBI says this is where the data comes from).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
source:
parser:
function:
API:
***chromosome***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***maplocation***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***description***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

203 of 421

18/04/2005

parse_gene_files.pl

The chromosome where this gene is found.
I

parse_gene_files.pl

The map location of this gene on the chromosome.
I;-19.13 cM (interpolated genetic position)

parse_gene_files.pl

A description of this gene.
putative protein (69.3 kD) (1A981)

parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

204 of 421

18/04/2005

gene_object table
Last updated September 29,2004
SeqHound
Database:
gene_object
Table:
GENE
Module:
This table contains information on the status of gene records.
Definition:
MySQL
Field
id
geneid

Type
int(11)
int(11)

status

varchar(64))

MySQL Indexes
Keyname
igeneobject_id
igeneobject_geneid
igeneobject_status

Type
Index
Index
Index

Observation:
Source org:
Source file:
FTP script:
Parser:

Null
No
No
Yes

Default

Column_Definition
Identifier for this gene.
NCBI's Gene Id.
Status of the RefSeq for this gene.
May be provisional, INFERRED,
MODEL, PREDICTED,
Reviewed or VALIDATED

Field
id
geneid
status

NCBI
gene2refseq.gz
gene_cron.pl
parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***id***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***geneid**
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***status***
description:

example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

205 of 421

18/04/2005

Identifier for this gene.
8

parse_gene_files.pl
not available yet

NCBI's Gene Id.
1246510

parse_gene_files.pl
not available yet

The status of the RefSeq record that contains this gene.
NULL
INFERRED
MODEL
PREDICTED
Provisional
Reviewed
VALIDATED
Provisional

parse_gene_files.pl
not available yet

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

206 of 421

18/04/2005

gene_productgi table
Last updated September 29, 2004
SeqHound
Database:
gene_productgi
Table:
GENE
Module:
Stores the gis of the RNA and Proteins produced by a gene.
Definition:
MySQL
Field

Type

Null

int(11)

geneobjectid

int(11)

gi
MySQL Indexes
Keyname
iproduct_id
iproduct_objected
iproduct_gi

int(11)

Type
Index
Index
Index

Observation:
Source org:
Source file:
FTP script:
Parser:

***id***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

Default

Column_Definition
Internal identifier for the product
gi. Mysql auto-increment column.
The geneobjectid for the gene that
this gi belongs to.
The gi.
Field
id
geneobjectid
gi

NCBI
gene2refseq.gz
gene_cron.pl
parse_gene_files.pl

Internal identifier for this gi. Mysql auto-increment column.
9

parse_gene_files.pl
not available yet

***geneobjectid***

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***gi***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

207 of 421

18/04/2005

NCBI's identifier for the gene that this gi belongs to.
10

parse_gene_files.pl
not available yet

The gi. May be a protein or an RNA gi.
32455275

parse_gene_files.pl
not available yet

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

208 of 421

18/04/2005

gene_pubmed table
Last updated September 29, 2004
SeqHound
Database:
gene_pubmed
Table:
GENE
Module:
This table stores publications that refer to genes.
Definition:
MySQL
Field

Type

Null

int(11)

geneobjectid

int(11)

Yes

pmid
MySQL Indexes
Keyname
ipubmed_id
ipubmed_geneid
ipmid

int(11)

Yes

Observation:
Source org:
Source file:
FTP script:
Parser:

***id***
description:
example:
default value:
ASN.1 struct:
source:
source:
parser:
function:
API:

Type
Index
Index
Index

Default

Column_Definition
Internal identifier for this
reference.
The geneobjectid that this
reference is about.
The pmid of the reference.
Field
id
geneobjectid
pmid

NCBI
gene2pubmed.gz
gene_cron.pl
parse_gene_files.pl

Internal identifier of this reference or of this gene objectid-pmid pair
112394

parse_gene_files.pl
not available yet

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
***geneobjectid***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***pmid***
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

209 of 421

18/04/2005

The geneobjectid of the gene.
173392

parse_gene_files.pl

The pmid of the reference.
8889548

parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

210 of 421

18/04/2005

gene_synonyms table
Last updated September 28, 2004
SeqHound
Database:
gene_synonyms table
Table:
GENE
Module:
Stores synonyms for genes.
Definition:
MySQL
Field

Type

Null

int(11)

geneinfoid

int(11)

Yes

Column_Definition
Identifier of this synonym, Mysql
auto-increment column
The geneinfoid for this synonym.

synonym
MySQL Indexes
Keyname
igenesyn_id
igeneinfoid
isynonym

text

Yes

The synonym.

Observation:
Source org:
Source file:
FTP script:
Parser:

***id*
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

Type
Index
Index
Index

Default

Field
id
geneinfoid
synonym

NCBI
gene_info.gz
gene_cron.pl
parse_gene_files.pl

Internal identifier for this synonym geneinfoid pair. Basically a rowid.
Mysql auto-increment column.
11179

parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
***geneinfoid
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:
***synonym
description:
example:
default value:
ASN.1 struct:
source:
parser:
function:
API:

211 of 421

18/04/2005

The geneinfoid for the record where this synonym was found.
27686

parse_gene_files.pl

The synonym.
T1J24.19

parse_gene_files.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

212 of 421

18/04/2005

Gene Ontology hierarchy (godb) module
goparser
Last updated: August 5, 2004
purpose:
Go data files contain information from the Gene Ontology Consortium that describe
controlled vocabularies for the description of the molecular function, biological process
and cellular component of gene products.
The GO parser is responsible for generating the hierarchical gene ontology datafiles:
go_name, go_parent, go_synonym and go_reference. This is accomplished by parsing
the flat files component.ontology, function.ontology and process.ontology.
module: godb
input files:
component.ontology
function.ontology
process.ontology.
From ftp://ftp.geneontology.org/pub/go/ontology/
tables altered:
go_name, go_parent, GO_REF, GO_SYN
source code location:
slri/seqhound/go/goparser.c
config file dependencies:
The relevant configuration file is:
slri/seqhound/parsers/.intrezrc (for Unix platforms) or
The relevant section of the configuration file is:
[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[sections]
;gene ontology hierarchy
godb = 1

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

213 of 421

18/04/2005

Text in italics must be changed for the .intrezrc file to function correctly with your
SeqHound set-up. Variables username, password, dsn, database in section
[datab] should have the same values as USER, PASSWORD, DSN and DATABASE
respectively in your .odbc.ini (see Step 10 in section 4.4.) In section [sections]
godb should be 1.
command line parameters:
This parser does not have any command line parameters. Typing “./goparser” will
process the three Gene Ontology files.
These files (component.ontology, function.ontology and process.ontology) must be
present in the same directory as the compiled goparser executable
associated scripts:
see “/slri/seqhound/goftp.pl”
This script retrieves the three input files required by the goparser. The script also
retrieves the files used as input by the addgoid parser (see lldb module).
error and run-time logs:
goparser writes to a log file called “goparserlog” where it writes Time Stamp, Error #,
goparser.c line # and cause of the problem.
e.g.
================[ May 21, 2003 2:36 PM====================
ERROR: [000.000] {goparser.c, line 74} Main: Cannot find function.ontology file
in current directory.

troubleshooting:
additional info:
The Gene Ontology Consortium documentation is located at
http://www.geneontology.org/
See data table descriptions for each of the tables that are listed under SeqHound's Data
Dictionary.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

214 of 421

go_parent table
Last updated: August 5, 2004
Seqhound
Database:
go_parent
Table:
Godb
Module:
Definition:
MySQL
Field
Type
Null
rowid
int(11)
No
go_id
int(11)
No
parent_goid
int(11)
No
MySQL Indexes
Keyname
Type

Default
0
0

18/04/2005

Column_Definition
Auto number row identifier
GeneOntology ID
Parent Go Id..

PRIMARY

igoparent_rowid

INDEX

Field
go_id
parent_goid
rowid

igoparent_goid

INDEX

go_id

igoparent_pid

INDEX

parent_goid

Observation.:

Organization:
Source db:

***go_id***
description:
example:

Function, process and component are represented as directed acyclic
graphs (DAGs) or networks. The difference between a DAG and a
hierarchy is that in the latter each child can only have one parent; a
DAG allows a child to have more than one parent. A child term may be
an "instance" of its parent term(isa relationship) or a component of its
parent term (part-of relationship). A child term may have more than one
parent term and may have a different class of relationship with its
different parents.
Gene Ontology (ftp://ftp.geneontology.org/pub/go/ontology)
function.ontology (Molecular function)
process.ontology (Biological process)
component.ontology (Cellular component)

Gene Ontology identifier Child.
0016172
%antifreeze activity ; GO:0016172 {Parent}
%ice nucleation inhibitor activity ; GO:0016173

default value:
source:

{Child}

n/a
related to the indentation in the file.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
parser:
function:

215 of 421

18/04/2005

API:

goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Name()->
GO_AppendParent()
GO_GetParentOf()
GO_GetChildrenOf()
GO_GetAllChildren()
GO_GetAllAncestors()
SHoundGODBGetChildrenOf()

***parent_goid***
description:
example:

Gene Ontology identifier Parent.
0016173

default value:
source:
parser:
function:

API:

%antifreeze activity ; GO:0016172
{Parent}
%ice nucleation inhibitor activity ; GO:0016173

{Child}

n/a
Related to indentation in the file
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Parent()
GO_GetParentOf()
GO_GetChildrenOf()
GO_GetAllChildren()
GO_GetAllAncestors()
SHoundGODBGetParentOf()

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

216 of 421

go_name table
Last updated: August 5, 2004
seqhound
Database:
go_name
Table:
godb
Module:
Definition:
MySQL
Field
Type
rowid
int(11)
go_id
int(11)
go_name
text

Null
No
No

Default
0

18/04/2005

Column_Definition
Auto incremented id
GeneOntology ID.
Go_id Identifier name.

go_db

int(11)

GO Database name.

go_level
MySQL Indexes
Keyname

int(11)

Hierarchy level of the GO ID.

PRIMARY

igoname_rowid
igoname_goid

INDEX
INDEX

Observation.:

Organization:
Source db:

***go_id***
description:
example:
default value:
source:
parser:
function:

Type

Field
go_id
go_name
go_db
go_level
rowid
go_id

This table contains a list of 'go_id' and it's molecular function name as
well the database where the go_id was parsed from and the level in the
Gene Ontology hierarchy.
GeneOntology (ftp://ftp.geneontology.org/pub/go/ontology)
function.ontology (Molecular function)
process.ontology (Biological process)
component.ontology (Cellular component)

Gene Ontology identifier.
15643
n/a
See example Go File. The GO id follows 'GO:' tag.
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Name()

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

API:

***go_name***
description:
example:
default value:
source:
parser:
function:

API:
***go_db***
description:
example:

217 of 421

18/04/2005

GO_GetRecordByID()
SHoundGOIDFromRedundantGi
SHoundGOIDFromRedundantGiList
SHoundGOIDFromGi
SHoundGOIDFromGiList

Gene Ontology identifier name.
anti-toxin;
n/a
go_name follows, '%' and proceeds ';GO:'.
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Name()
GO_GetNameByID()
SHoundGODBGetNameByID

API:

Gene Ontology Database file. This is an integer 1, 2 or 3
1
where:
1 GO_MOLFUNCTION
2 GO_BIOPROCESS
3 GO_CELLCOMPONEN
n/a
NA
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Name()
GO_GetClassification()
SHoundGODBGetClassification()

***go_level***
description:

Hierarchy level of the GO ID.

default value:
source:
parser:
function:

This is not used because the same GO ID can appear at
different levels making the process to determine its level
irrelevant. This was intended to be used to indicate the
distance of a given GO node from the root node.
seqhound@blueprint.org

Version 3.3

The SeqHound Manual
example:
default value:
source:

218 of 421

18/04/2005

3
n/a
Related to the indentation of
'%{function name}'

parser:
function:

API:

goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Name()
GO_GetHierarchyLevel()
SHoundGODBGetHierarchyLevel() this function is deprecated.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

219 of 421

go_reference table
Last updated: August 5, 2004
seqhound
Database:
go_reference
Table:
godb
Module:
Definition:
MySQL
Field
Type
Null
rowid
int(11)
No
go_id
int(11)
No

Default
0

18/04/2005

Column_Definition
Auto incremented id
GeneOntology ID

go_ref

text

go_ref_db
MySQL Indexes
Keyname

varchar(50)

Yes

PRIMARY

igoref_rowid

INDEX

rowid

igoref_go_ref_db

INDEX

go_ref_db

Observation.:

Organization:
Source db:

***go_id***
description:
example:
default value:
source:
parser:
function:

Type

GeneOntology Reference.
NULL

Reference Data Base Name
Field
go_id
go_ref

This table stores the 'go_id' with its database cross-reference. The
cross-reference can be an external database identifier that points to
something that is equivalent to a given GO term.
GeneOntology (ftp://ftp.geneontology.org/pub/go/ontology)
function.ontology (Molecular function)
process.ontology (Biological process)
component.ontology (Cellular component)

Gene Ontology identifier.
45174
n/a
The go_id comes after '%{function name} ; GO:xxxxxx ; '
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()>Go_Append_Reference()
GO_GetRecordByReference()

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
API:
***go_ref***
description:
example:
default value:
source:
parser:
function:

API:
***go_ref_db***
description:
example:

default value:
source:
parser:
function:

API:

220 of 421

18/04/2005

SHoundGODBGetRecordByReference()

Gene Ontology Synonym Description.
ISBN: 0198506732
n/a
go_ref comes after "GO:xxxxxx ; ISBN: 0198506732 ;"
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()>GO_Append_Reference()
GO_GetRecordByReference()
NA

Gene Ontology Reference Database.
EC
ISBN
TC
n/a
After '%{function_name} ; GO:xxxxxxx ; {go_ref_db}
After '%{component_name} ; GO:xxxxxxx ; {go_ref_db}
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()>GO_Append_Reference()
GO_GetRecordByReference()
NA

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

221 of 421

go_synonym table
Last updated: August 5, 2004
seqhound
Database:
go_synonym
Table:
godb
Module:
Definition:
MySQL
Field
Type
Null
rowid
int(11)
No
go_id
int(11)
No
go_syn
MySQL Indexes
Keyname

text

PRIMARY KEY

INDEX

Column_Definition
Auto incremented id
GeneOntology ID
GeneOntology Synonym.

igosynonym_rowid INDEX

Field
go_id
go_syn(100)
go_id

igosynonym_go_id INDEX

go_syn

Observation.:
Organization:
Source db:

Example of source
file:

***go_id***
description:
example:
default value:
source:
parser:
function:

API:

Type

Default

18/04/2005

GeneOntology (ftp://ftp.geneontology.org/pub/go/ontology)
function.ontology (Molecular function)
process.ontology (Biological process)
component.ontology (Cellular component)
%glycine binding activity ; GO:0016594 ; synonym:Gly binding ;
synonym:aminoacetic acid binding ; synonym:aminoethanoic acid
binding

Gene Ontology identifier.
45174
n/a
The go_id follows '%{function name}; GO:'
goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->Go_Append_Synonym()
n/a

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
***go_syn***
description:
example:
default value:
source:

222 of 421

18/04/2005

Gene Ontology Synonym Description. This can be a 'synonym' of a
'process' or 'component' name.
'Gly binding'
n/a
After '%{function_name} ; GO:xxxxxxx ; synonym:{synonym name}'
After '%{component_name} ; GO:xxxxxxx ; synonym:{synonym name}'
After '%{process_name} ; GO:xxxxxxx ; synonym:{synonym name}'
glutathione dehydrogenase (ascorbate); glutamic acid binding
%glycine binding activity ; GO:0016594 ;
synonym:Gly binding ;
synonym:aminoacetic acid binding ;
synonym:aminoethanoic acid binding

parser:
function:

API:

goparser.c -> go_parser.c
GODB_ParseFile()->GO_LineParser()>GO_AppendRecord()
db_layer: GO_AppendRecord()->GO_Append_Synonym()
n/a

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

223 of 421

18/04/2005

Gene Ontology Association (GOA) module
Last updated April 5, 2005
This section is maintained by Renan Cavero.
purpose:
The GOA Module provides Gene Ontology (GO) information for all possible organisms that have GO terms. GO terms are controlled
vocabulary for molecular function, biological process and cellular component of gene products. GO Annotation assignments are
derived from ftp://ftp.geneontology.org/pub/go/gene-associations/ and from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz.
The GO terms are linked to identifiers provided by the curating database (e.g. FlyBase) AND to the NCBI Gene Info identifier (GI)
using the SeqHound DBXref Module.
The GOA Module contains
• GO term assignments to genes or proteins
•

Literature references like PubMed IDs

•

Evidence codes between the gene product and the GO term

•

Object types that get annotated such as gene, transcript, protein or protein structure

•

Gene symbols or other associated text

•

Taxonomic identifiers

•

Date on which the annotation was made.

module:
GOA

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

224 of 421

18/04/2005

input files:
All files found under gene ontology FTP site
ftp://ftp.geneontology.org/pub/go/gene-associations/
NCBI EntrezGene
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
tables altered:
goa_association
goa_gigo
goa_reference
goa_seq_dbxref
goa_with
goa_xdb

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

225 of 421

18/04/2005

Table summarizing input files, parsers and command line parameters for GOA module.
Input file

parser

All files from: ftp://ftp.geneontology.org/pub/go/gene-associations/

goa_parser_cluster.pl
See note below.

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz

go_geneparser_cluster.pl
See note below.

Tables: goa_gigo, dbxref, goa_seq_dbxref, goa_association, accdb,
redund, taxgi.

Seqhound_gigo_cluster.pl
See note below.

Tables: goa_gigo, goa_seq_dbxref, goa_association, accdb, redund, taxgi.

Seqhound_gigoPDB_cluster.pl
See note below.

Tables: goa_gigo, dbxref, goa_seq_dbxref, goa_association.

Seqhound_gigoCGEN_cluster.pl
See note below.

command line
parameters
-d {Org.abbr.} (*)
–c {Cluster: T F}
-o {Offset: n}
-n {Num. Lines: n}
-f {Flag File}
-d GO_GENE
(**)
-c {Cluster: T F}
-o {Offset: n}
-n {Num. Lines: n}
-f {Flag File}
-c {Cluster: T F}
-o {Offset: n}
-n {Num. Lines: n}
-f {Flag File}
-c
-o
-n
-f
-c
-o
-n
-f

{Cluster: T F}
{Offset: n}
{Num. Lines: n}
{Flag File}
{Cluster: T F}
{Offset: n}
{Num. Lines: n}
{Flag File}

Notes:
(*) Parser Command Line Parameters. (See dbxref.ini for details.)
goa_parser_cluster.pl: Parse Gene Ontology Association files and populate GOA module tables.
-d Organism Database File Abbreviation: Can be found under dbxref.ini Partition [GENE_ASSOCIATON_FILE] Ex.: GOA_DDB,
GOA_CGEN.
-c Cluster Option: T: True (run in a cluster environment); F: False: (run stand alone).
The following options are optional. Needs to be set-up only if running in a Cluster environment.
-o Offset: The offset in a flat file where a cluster node will start parsing.
-n Number of lines: The number of line-records to process by a cluster node after the Offset is reached.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

226 of 421

18/04/2005

-f Flag File: For synchronization purpose. A file name that will be generated by a cluster node when finished parsing, telling the
Cluster’s head node that the parsing was completed. (Cluster Head node will track completion of all nodes before continuing with the
next process in the queue.
(**)
goa_gene_parser: Same as above but this parser will only parse gene2go file.
seqhound_gigo_cluster.pl: Populates goa_gigo with GI’s (GenBank assessions) extracted from dbxref table and by looking-up Xref
in goa_seq_dbxref will get GO terms from goa_association. When looking up GI’s in accdb, redundancy and taxonomies are
considered.
seqhound_gigoPDB_cluster.pl: Looks up PDB records and chains from goa_seq_dbxref and by looking-up GIs in accdb populates
gi-go pairs in goa_gigo. When looking up GI’s in accdb, redundancy and taxonomies are considered.
seqhound_gigoCGEN_cluster.pl: Same as above but for GIs from Compugen records.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

227 of 421

18/04/2005

source code location:
The parsers for this module have been updated to work in a cluster configuration. Source code is unavailable at the time of this
manual release but will be released with the next code release.
/slri/seqhound/dbxrefgoa
config file dependencies:
The relevant configuration file is:
dbxref.ini (for Unix platforms)
see dbxref.ini documentation.
command line parameters:
See summary table above.
associated scripts:
The following shell scripts execute the parsers:
dbxrefgoa_cron.sh: Cron job that runs dbxrefgoa_updatecron.pl
dbxrefgoa_updatecron.pl: Program that automates deployment and runs parsers in a cluster. (see dbxref.ini documentation for details)
auxiliary scripts:
In a cluster environment the following scripts will help dbxrefgoa_updatecron.pl in the deployment and execution:
parsers.deploy.sh: deploy files to be parsed to the cluster nodes.
generate_goa_run.pl: Generate an instruction script that distributes processes in the cluster nodes.
run{organism database_file_abbreviation}: instruction script generated by generate_goa_run.pl. Example.: runGOA_MGI.sh
clean.sh: clean (remove) files previously deployed by script deploy.sh.
wait.pl: makes the cluster’s head node wait until all the nodes finish processing the parser that was deployed. When wait.pl receives a
signal from all the nodes, it will continue with the next parser.
See dbxref.ini for more details.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

228 of 421

18/04/2005

error and run-time logs:
dbxrefgoa_errors.log: log that tracks run time error by the parsers.
dbxrefgoa_updatecron.log: log that indicate what parser was run, completed or failed . A copy of this log will be send by email to the
SeqHound administrator.
goa_parser.log: log that summarize number of records updates by each goa parser.
troubleshooting:
dbxrefgoa_updatecron.pl will execute instructions from the configuration file dbxref.ini. Any error generated by this process or by the
parsers will be recorded under dbxrefgoa_error.log. The final results will be summarized in dbxrefgoa_updatecron.log and
goa_parser.log. By following the error messages, the SeqHound administrator will be able to adjust configuration parameters under
dbxref.ini and run it again.
additional info:
The Gene Ontology Consortium documentation is located at
http://www.geneontology.org/GO.contents.doc.html
ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec
Gene Ontology Module Diagram
The tables also appears in http://www.blueprint.org/seqhound/api_help/docs/SeqHound_Schema_Prod.pdf

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

229 of 421

goa_seq_dbxref
PK

FK3
I1

goa_xdb_id
xref_key (2)
taxid_1 (13a)
taxid_2 (13b)
type_id (12)
symbol (3)
full_name (10)
synonym (11)
timestamp

18/04/2005

goa_association
PK,FK4
FK8

is_not (4)
go_id (5)
goa_seq_dbxref_id
assigned_by (15)
code (7)
date (14)
timestamp

FK4
FK5

goa_association_id
goa_xdb_id (8)
xref_key (8a, 8b, 8x)
key_type
goa_seq_dbxref_id
timestamp

FK6

go_id

I2
I1

go_name
go_db
go_level

Relational Data Base representation of Gene
Ontology Association module in SeqHound.
(Last updated March 23, 2005)

goa_with
PK

go_name

Numbers beside column names refer to columns in
the Gene Ontology flat file representation
(see http://www.geneontology.org/
GO.annotation.shtml#file and table below).

Gene Ontology Flat File
goa_reference

goa_xdb
PK

id
abbreviation (1)
name
object
example
generic_url
url_syntax
url_example
timestamp

seqhound@blueprint.org

FK3

goa_association_id
goa_xdb_id
xref_key (6a, 6b, 6x)
timestamp
goa_gigo

gi
go
code
xdbid
xref_key

DB (1)
DB_Object_ID (2)
DB_Object_Symbol (3)
NOT (4)
GOid (5)
DB_Reference (6)
Evidence (7)
With (8)
Aspect (9)
DB_Object_Name (10)
DB_Object_Synonym (11)
DB_Object_Type (12)
taxon (13)
Date (14)
Assigned_by (15)

Version 3.3

The SeqHound Manual

230 of 421

18/04/2005

goa_seq_dbxref table
Last updated April 12, 2005
Seqhound
Database:
goa_seq_dbxref
Table:
GOA
Module:
The main table of the GOA Module. This table contains entries for a
Definition:
database record ID (e.g.: Fly Base ID FBgn0013277) and the
information associated with the record that can be found in
gene_association files from Gene Ontology. The record is the base to
obtain other information like GO terms, reference and other annotation.
Gene Ontology ftp://ftp.geneontology.org/pub/go/gene-associations/
Source org:
See module description above.
Source file:
See module description above.
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

231 of 421

18/04/2005

goa_seq_dbxref table
Field

Type

Null

int(10)

goa_xdb_id

int(10)

xref_key

varchar(30)

taxid_1

int(11)

taxid_2

type_id

int(11)

varchar(20)

seqhound@blueprint.org

Yes

Default
Auto incremented
integer.

Column_Definition
Internal primary record
identifier.

Example
2099131

Source
Assigned by source
document parser.

API
NA

Database identifier from 5
table goa_xdb. For
example, an xdb_id of 4
indicates the Candida
Genome Database.

See note below.

Record ID (in the above- PrID1098818
mentioned database)
which points to the object
being annotated.

See note below.

Taxonomic identifier of
the species encoding the
gene product.

See note below.

Taxonomic identifier of 0
the species where the gene
product acts (in the
manner described by the
GO annotation). For
example, if the protein is
from a virus, the host
organism might be listed
here.
Indicates the type of
protein
object being annotated
(gene, transcript, protein
etc.)

See note below.

11676

Version 3.3

The SeqHound Manual

232 of 421

18/04/2005

A unique and valid
symbol to which the
xref_key is matched. It
can be an ORF name,
gene product symbol or
any other identifier.

GI424263

See note below.

cell surface glycoprotein See note below.
gp138

symbol

varchar(30)

full_name

varchar(255)

Yes

NULL

Name of gene or gene
product.

synonym

varchar(50)

Yes

NULL

Gene symbol or other text.fusA

lastupdate

timestamp

CURRENT_TIMES When was this entry last
TAMP
updated.
2005-04-01 17:25:44

See note below.

Note: columns in this table may be mapped to the GO Annotation flat file format described at
http://www.geneontology.org/GO.annotation.shtml#file. Also see the Figure entitled “GOA relationships” below.

seqhound@blueprint.org

Version 3.3

NA
NA

The SeqHound Manual

233 of 421

goa_seq_dbxref indices
Keyname
Type
Cardinality
id_idx
INDEX
2138717
goa_xdb_id_idx PRIMARY 91
xref_key_idx
PRIMARY VC0002
taxid_1_idx
INDEX
686
symbol_idx
INDEX
VC0002
synonym_idx INDEX
mioC
lastupdate_idx

INDEX

seqhound@blueprint.org

18/04/2005

Field
id
goa_xdb_id
xref_key
taxid_1
symbol
synonym

2005-04-01 17:25:44 lastupdate

Version 3.3

The SeqHound Manual

234 of 421

18/04/2005

goa_association table
Last updated April 12, 2005
Seqhound
Database:
goa_association
Table:
GOA
Module:
This table contains GO term information associated with the Record ID
Definition:
(goa_seq_xref.xref_key).
See module description above.
Source file:
See module description above.
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

235 of 421

18/04/2005

goa_association table
Field
id

Type
int(10)

Null Default
Column_Definition
Example
No Auto
Internal unique identifier for this record. 33
increment
integer.

Source
API
This identifier is incremented NA
by the source file parser.

is_not

char(1)

Yes

NULL

Flags that modify the interpretation of an F
annotation. A GO ID with a Not in this
field means that a particular gene product
is NOT associated with a particular GO
term.
For more information please read "Using
the Qualifier column" from
http://www.geneontology.org/GO.annota
tion.html
“T” indicates that “NOT was found in
this column.
“F” indicates that “NOT” was not found.

See note below.

go_id

int(10)

Gene Ontology identifier for the term
attributed to the object described by
goa_seq_xref.xref_key.

3677

See note below.

goa_seq_dbxref_id

int(10)

Foreign key pointing to
goa_seq_dbxref.id

See note below.

assigned_by

int(10)

database that made the annotation.

117

See note below.

code

char(4)

Evidence Code. One of IMP, IGI, IPI, IEA
ISS, IDA, IEP, IEA, TAS, NAS, ND, IC.

See note below.

date

char(8)

Yes

NULL

Date on which the annotation was made. 20040107

See note below.

lastupdate

timestamp

Yes

NULL

Date when this record was last modified.

See note below.

2005-04-01 17:25:44
Note: columns in this table may be mapped to the GO Annotation flat file format described at
http://www.geneontology.org/GO.annotation.shtml#file. See the Figure entitled “GOA relationships” below.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

236 of 421

goa_association indices
Keyname
Type
Cardinality
PRIMARY
PRIMARY 8720064

18/04/2005

Field

id
go_id
go_id
UNIQUE 8720064
goa_seq_dbxref_id
code
goa_id_idx
INDEX
21268
go_id
goa_seq_dbxref_id_idx INDEX
8720064
goa_seq_dbxref_id
assigned_by_idx
INDEX
19
assigned_by
code_idx
INDEX
code
lastupadate_idx
INDEX
2005-04-01 17:25:44 lastupdate

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

237 of 421

18/04/2005

goa_reference table
Last updated April 12, 2005
Seqhound
Database:
goa_reference
Table:
GOA
Module:
This table contains the reference identifier for the GO annotation.
Definition:
Reference identifiers are unique identifiers appropriate to a database
authority for the attribution of the go_id to the Record ID. They may be
a literature reference or a database record.
See module description above.
Source file:
See module description above.
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

238 of 421

18/04/2005

goa_reference table
Field
id

Type
int(10)

Null Default Column_Definition
No Auto
increment A unique identifier for this record.
integer.

Example
1

Source
API
This value is autoincremented NA
by the source file parser.

goa_association_id

int(10)

Foreign key pointing to
goa_association.id.

530508

See note below.

goa_xdb_id

int(10)

Database identifier that made the
reference. Foreign key pointing to
goa_xdb.id..

See note below.

xref_key

varchar(20)

See note below.

Reference Identifier. For example, if MGI:1354194
the reference is a published paper
that has a PubMed ID, the PubMed
ID number will be in this field

timestamp

Yes

NULL

When was this entry last updated?

goa_reference indices
Keyname
Type Cardinality
Field
PRIMARY
PRIMARY 8424423
id
goa_association_id_idx INDEX
8424423
goa_association_id
goa_xdb_id_idx
INDEX
18
goa_xdb_id
xref_key_idx
INDEX
2864
xref_key
lastupdate_idx

INDEX

seqhound@blueprint.org

17122

lastupdate

Version 3.3

The SeqHound Manual

239 of 421

18/04/2005

goa_with table
Last updated April 12, 2005
Seqhound
Database:
goa_with
Table:
GOA
Module:
This table is used to hold additional identifiers for annotations using
Definition:
certain evidence codes. For more information please see "With (or)
From" section at http://www.geneontology.org/GO.annotation.html
See module description above.
Source files:
See module description above.
Parsers:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

240 of 421

18/04/2005

goa_with table
Field
id

Type
int(10)

Null
No

Default
Column_Definition
Example
Auto
A unique identifier for this record. 1
increment
integer.

Source
API
This column is autoincremented NA
by the source file parser.

goa_association_id

int(10)

Foreign key pointing to
goa_association.id

See note below.

goa_xdb_id

int(10)

Database Identifier that made the 37
'with' annotation.

See note below.

xref_key

varchar(20) No

Reference identifier.

IPR001601

See note below.

key_type

int(10)

Type of the symbol annotated:
Type of the symbol annotated:
1=gene_symbol,
2=allele_symbol,
3=gene_id,
4=sequence_id,
5=go_id.

See note below.

goa_seq_dbxref_id

int(10)

Foreign key pointing to
0
goa_seq_dbxref.id for quick lookup from a ‘with’ annotation back
to the record id.

See note below.
This field is not currently
implemented.

lastupdate

timestamp No

NULL

When was this entry last updated? 2005-04-01 17:25:44

530509

Note: columns in this table may be mapped to the GO Annotation flat file format described at
http://www.geneontology.org/GO.annotation.shtml#file. See the Figure entitled “GOA relationships” below.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

241 of 421

18/04/2005

goa_with indices
Keyname
Type Cardinality
Field
PRIMARY
PRIMARY 1234449
id
goa_association_id_idx INDEX
1234449
goa_association_id
goa_xdb_id_idx
INDEX
17
goa_xdb_id
xref_key_idx
INDEX
xref_key
key_type_idx
INDEX
key_type
goa_seq_dbxref_id_idx INDEX
goa_seq_dbxref_id
lastupdate_idx
INDEX
lastupdate

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

242 of 421

18/04/2005

goa_xdb table
Last updated April 12, 2005
Seqhound
Database:
goa_xdb
Table:
GOA
Module:
The table goa_xdb contains metadata about the organizations which
Definition:
contribute to the GO. There is a one to one relationship between
abbreviations and URLs where data can be retrieved. A single URL
which can be queried using database ids is referred to as a datasource.
Each organization may have multiple datasources. Each abbreviation
identifies one section of the file which provides the abbreviation and
full name of that data source, the object type which is retrieved, an
example database id, the generic url which identifies that data source
uniquely and globally and the syntax of an actual query request with
parameters filled in.
For more information please read:
ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec
See module description above.
Source files:
See module description above.
Parsers:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

243 of 421

18/04/2005

goa_xdb table
Field
id

Type
int(11)

Null
No

abbreviation

varchar(50)

name

varchar(255)

Yes

object

varchar(255)

example

Default
Column_Definition
Example
Auto
A unique identifier for this record. 76
increment Note that this identifier is unstable
integer.
and may change from one release
to another.

Source
API
This column is NA
autoincremented
by the source file
parser.

Database abbreviation.

SWISS-PROT

See note below. NA

NULL

Database name or description.

Swiss-Prot protein database.

See note below. NA

Yes

NULL

type of identifier returned from the
data source:
Accession number
Locus identifier
call number
Gene symbol etc.

varchar(50)

Yes

NULL

An example database identifier.

generic_url

varchar(255)

Yes

NULL

The root or representative URL for http://ca.expasy.org/spro See note below. NA
this data source.
t/

url_syntax

varchar(255)

Yes

NULL

A string to which one can append http://www.expasy.ch/cgi- See note below. NA
a database ID and get a valid URL bin/sprot-search-ac?
query for the object referenced by
that id. There is no wild card to
represent the database ID, it is
simply appended to the end of the
string.

url_example

varchar(255)

Yes

NULL

An example of what the
url_syntax will look like

See note below. NA

Swiss-Prot:P45867

See note below. NA

http://www.expasy.ch/cgi- See note below. NA
bin/sprot-searchac?P45867

timestamp

seqhound@blueprint.org

NULL

Version 3.3

The SeqHound Manual

244 of 421

18/04/2005

goa_xdb indices
Keyname
Type Cardinality
Field
PRIMARY
PRIMARY 117
id
abbreviation_idx INDEX
abbreviation
name_idx
INDEX
name

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

245 of 421

18/04/2005

goa_gigo table
Last updated April 5, 2005
Seqhound
Database:
goa_gigo
Table:
GOA
Module:
Pre-computed list of gi-go pairs.
Definition:
Since SeqHound database is NCBI GI Centric and GOA is organism
database ID centric; the process to get a GO term given a GI can be
complicated and time consuming, for this reason a pair list of gi-go
were pre-computed and stored in goa_gigo table. The SeqHound core
database (mainly accdb and redund tables) is required to run these
parsers.
See module description above.
Source files:
See module description above.
Parsers:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

246 of 421

18/04/2005

goa_gigo table
Field
gi

Type
INTEGER

Null
No

Default Column_Definition
Example
0
Gene Info Identifier for an NCBI sequence 6552303
record

Source
API
The record indicated by the last NA
two columns of this table are
converted to a matching Gene
Info Identifier using the
DBXref module.

INTEGER

6281

From table
goa_association:gi_id

code

varchar(4)

Yes

NULL Evidence Code. One of IMP, IGI, IPI, ISS, TAS
IDA, IEP, IEA, TAS, NAS, ND, IC.
For more information please read
http://www.geneontology.org/GO.evidenc
e.html

From table
goa_association:code

xdb_id

INTEGER

Yes

NULL Identifier for database. See also xref_key 103
below. From table goa_xdb.

From table goa_xdb.id

xref_key

varchar(30)

Yes

NULL Identifier in database (see previous
column) pointing to object that was
originally annotated.

P38398

From table
goa_seq_dbxref:xref_key.

lastupdate

timestamp

NULL When was this entry last updated?

2005-04-01 17:25:44

seqhound@blueprint.org

Gene Ontology Identifier.

Version 3.3

The SeqHound Manual

goa_gigo indices
Keyname
Type
PRIMARY

247 of 421

Cardinality

PRIMARY 7936320

gi_idx
INDEX
go_idx
INDEX
code_idx
INDEX
xdb_id_idx
INDEX
xref_key_idx INDEX
lastupdate_idx INDEX

seqhound@blueprint.org

3968160
7254
22
7936320
22

18/04/2005

Field
gi
go
code
gi
go
code
xdb_id
xref_key
lastupdate

Version 3.3

The SeqHound Manual

248 of 421

18/04/2005

dbxref module
Last updated April 12, 2005
purpose:
The purpose of the SeqHound dbxref module is to have a centralized data source (cross references) where related information can be
found from a given ID. By using the dbxref module, it is possible to find one to "n" relationships between IDs from 3rd party
databases for DNA, Protein Sequences, Domains and Interactions, GenBank Accession Numbers, Swiss-Prot, LocusLink, SGD,
MGD, ZFIN, FB, PFAM, SMART, etc. See the “Explanation of the data table structure” below.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

249 of 421

18/04/2005

Who Cross-references who?
The following table indicates what database records we collect as cross-references from primary database records in the SeqHound
DBXref module.
This table was last updated April 15th and may change on a regular basis.
DB Cross references to:
Primary DBs
GENE
SPTR
ENSEMBL
AFCS
TIGR_ATH
MGI
RGD
FB
WB
SGD
DDB
ZFIN
GRAMINE
GENEDB_SPOMBE
TIGR_ATH
TIGR_CMR
UNIGENE
VIDA

GB SPTR

CG TAIR

UNIGENE

IPI

ENSEMBL

OMIM

X
X
X
X
X
X
X
X
X
X
X
X
X

X
X

X
X
X

X
X

X
X
X

X
X

X
X
X
X

Explanation of the data table structure:
The dbxref table is the core of the dbxref module. This explanation refers to that table.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

250 of 421

18/04/2005

Dbxref represents cross-references between 3rd party databases and GeneBank. This table is created by parsing 3rd party flat files and
creating records in "dbxref" for each record parsed. dbxref is a self-referencing table.
The value in the "record_id" field may represent one of two things:
1. A source record:
If the value of "parent_id" in the row is zero (0) it is referred to as an object id. Its "record_id" is the identifier/primary id for a
record in the database ("source_db") from which cross-references have been retrieved.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

251 of 421

18/04/2005

For example, from the dbxref table content example listed below.
P38903
S00055403

is a Primary ID in SwissProt Database
is a Primary ID in SGD Database.

2. A database cross-reference found in a source record. The retrieved cross-references are stored as a combination of
a) Source database ("source_db").
b) an identifier for a record in that database ("record_id").
c) the record that the cross-reference was found in. The field "parent_id" contains an integer. This integer refers to the row in this
table (see "id") that contains the identifier of the record from which this cross reference was retrieved.
d) cross references are retrieved from some field in a record. The name of this field is recorded in the "field" column or if there
are no field names then the column number is recorded. For example : Col1 or Col4.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

252 of 421

18/04/2005

Example entries
| id | source_db | record_id | parent_id | link | field | cv |
+----+-----------+-----------+-----------+------+-------+----+
| 1 | SP
| P38903
|
0 |
0 | ID
| 0 |
| 2 | GB
| U06630
|
1 |
0 | DR
| 1 |
| 3 | GB
| AAB38372 |
1 |
2 | DR
| 2 |
| 4 | GB
| S79635
|
1 |
0 | DR
| 1 |
| 5 | GB
| AAB35312 |
1 |
4 | DR
| 2 |
| 6 | GB
| X87331
|
1 |
0 | DR
| 1 |
| 7 | GB
| CAA60763 |
1 |
6 | DR
| 2 |
| 8 | GB
| Z74922
|
1 |
0 | DR
| 1 |
| 9 | GB
| CAA99203 |
1 |
8 | DR
| 2 |
| 10 | InterPro | IPR001757 |
1 |
0 | DR
| 3 |
| 11 | Pfam
| PF00689
|
1 |
0 | DR
| 3 |
| 12 | SGD
| S0005540 |
1 |
0 | DR
| 0 |
| 13 | SP
| P47096
|
0 |
0 | ID
| 0 |
| 14 | GB
| Z49525
|
13 |
0 | DR
| 1 |
| 15 | GB
| CAA89550 |
13 |
14 | DR
| 2 |
| 16 | GB
| X87297
|
13 |
0 | DR
| 1 |
| 17 | IPR
| IPR007113 |
13 |
0 | DR
| 0 |
| 18 | SGD
| S00055403 |
0 |
0 | Col4 | 0 |
| 19 | GB
| 1420113
|
18 |
0 | GI
| 0 |
| 20 | GermOnline| 143602
|
18 |
0 | Col1 | 0 |
| 21 | DIP
| 4191
|
18 |
0 | Col1 | 4 |
| 22 | GB
| 6324588
|
18 |
0 | GI
| 0 |
| 23 | GB
| AX596518 |
18 |
0 | Col1 | 1 |
| 24 | CandidaDB | CA1247
|
18 |
0 | Col1 | 0 |
| 25 | GB
| CAA99203 |
18 |
0 | Col1 | 2 |
| 26 | InterPro | IPR002554 |
18 |
0 | Col1 | 0 |
| 27 | GB
| NP_014656 |
18 |
0 | Col1 | 2 |
| 28 | SP
| P38903
|
18 |
0 | Col1 | 0 |
| 29 | PIR
| S54620
|
18 |
0 | Col1 | 0 |
| 30 | GB
| S79635
|
18 |
0 | Col1 | 1 |
| 31 | GB
| U06630
|
18 |
0 | Col1 | 1 |
| 32 | GB
| X87331
|
18 |
0 | Col1 | 1 |
| 33 | MIPS
| YOR014W
|
18 |
0 | Col1 | 0 |
| 34 | GB
| Z74922
|
18 |
0 | Col1 | 1 |
+----+-----------+-----------+-----------+------+-------+----+

The "link", "field" and "cv" fields in the dbxref table help specify where the information is coming from. These are described below.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

253 of 421

18/04/2005

link
The link field was created to support some databases (for example Swiss-Prot) which may store more than one cross-reference in one
field. For example a swiss-prot record for a protein may contain a "DR" field that lists two EMBL identifiers. ie.:
DR

EMBL; U06630; AAB38372.1; -.

The first identifier is a cross-reference for a nucleotide record in EMBL that encodes a protein (second identifier).
The link field was created to capture this relationship between the two cross-references. The integer in the link field points to a row in
the dbxref table (see "id") and indicates that the current cross-reference is linked (or comes after) some other cross reference in the
same source record and field.
The exact meaning of the database cross-reference can be discerned from the "cv" (Controled Vocabulary) column of this table. In
this example, one cross-reference would be labeled as "nucleotide" and one would be labeled as "protein".
2: points to record "id=2" meaning that protein sequence identifier "AAB38372" comes after the identifier “U06630” in the DR
field.
0: if no relationship exists or if the dbxref is the first one in a list
field
This describes the field in the source record where the database cross-reference was found. For example, “DR” is a field name in
Swiss-Prot records indicating a Database Reference. Alternatively, a column number might be listed here is the source file was a tabdelimited text file.
cv
cv is a controlled vocabulary term that is used to describe the type of record that the database cross-reference is pointing to. This
controlled vocabulary is simple at the moment and may be expanded in future. Briefly,
1 indicates a DNA sequence record
2 indicates a protein sequence record
3 RNA
100 Swiss-Prot record
101 Trembl record
110 Swiss-Prot secondary accession
111 Trembl secondary accession

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

254 of 421

18/04/2005

0 means not defined.
source code location: /slri/seqhound/dbxref
input Files: See summary table.
parsers: See summary table.

config file dependencies:
dbxref.ini
command line parameters: See summary table.
example use: See summary table.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

255 of 421

18/04/2005

associated scripts: (See associated scripts under GOA Module.)
The following shell scripts execute the parsers.
dbxrefgoa_cron_monthly.sh: Script to create all tables and sub-directories when running dbxref-goa module for the first time.
dbxrefgoa_updatecron.pl: Program that automates deployment and runs parsers in a cluster. (see dbxref.ini documentation for details)
error and run-time logs:
dbxrefgoa_errors.log: log that tracks run time error by the parsers.
dbxrefgoa_updatecron.log: log that indicate what parser was run, completed or failed . A copy of this log will be sent by email to the
SeqHound administrator.
dbxref_parser.log: log that summarizes number of record updates by each dbxref parser.
Error messages are sent by email when dbxrefgoa_updatecron.pl runs.
troubleshooting:
Check the email sent by dbxrefgoa_updatecron.pl to find out if any parsers had problems. For more details consult the log files.
additional info:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

256 of 421

18/04/2005

How to update the DBXref and GO Annotation modules using a cluster.
Last updated April 15th, 2005.
This section is maintained by Zhe Wang and Renan Cavero.
Note for those installing there own local instance of SeqHound. The data sets generated by this process are made available on the
SeqHound ftp site at ftp://ftp.blueprint.org/pub/SeqHound/Data/ . These instructions are supplied as a description of our internal
process. The scripts described are provided as part of the SeqHound code release package for those who may wish to use them as a
guide for setting up their own internal cluster build. The framework presented here might also be useful to others for setting up cluster
jobs.
Dbxrefgoa_update.pl will generate scripts for distribution and execution of processes on a cluster. To accomplish this it uses clusterit
tools. More information about clusterit can be found at: http://www.garbled.net/clusterit.html.
The file dbxref.ini has all the configurations to build and update the DBXRef/GOA modules on a computer cluster. This configuration
file is read and executed by the script dbxrefgoa_update.pl.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

257 of 421

18/04/2005

Understanding the dbxref.ini file
Section [DBXREFDATA] describes the database to which the updated or new DBXRef/GOA data will be written. Variable
“database” should be an existing database. The text in italics should be modified. The user must have write privilege to the
database.
[DBXREFDATA]
# this should point to the server which has dbxref and goa modules
host
= staging_box
port
= 33306
user
= user
password
= passwd
database
= dbxrefgoa
table
= dbxref
tablegigo
= goa_gigo

Section [SEQHOUND] describes the SeqHound database from which information will be read in order to update the DBXRef/GOA
modules specified in section [DBXREFDATA]. The user must have read privilege to the database.
[SEQHOUND]
# this should point to the server which has the most up-to-date tables of seqhound
hostshound
= production_box
portshound
= 3306
usersshound
= user
passwordshound = passwd
databaseshound = seqhound
working_dir
= /home/user123/dbxrefgoa/
data_dir
= /scratch/dbxrefgoa/download
myemail
= you@you.org

The variable “working_dir” specifies the directory in which to run script dbxref_updatecron.pl.
Directory /home/ is mapped to all cluster nodes.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

258 of 421

18/04/2005

Variable “data_dir” specifies where the input files are saved.
Directory /scratch/ is a local directory on each node.
Variable “myemail” should be the e-mail address of the SeqHound administrator who will be notified of the result of the
DBXRef/GOA update.
Section [LOG_FILES] has the name for three logs files:
[LOG_FILES]
results_log
errors_log
update_log

= dbxrefgoa_results.log
= dbxrefgoa_errors.log
= dbxrefgoa_updatecron.log

Section [DBXREF_FILES] and [GENE_ASSOCIATION_FILES] specify the data files to be processed.
Each of sections [ORGANISM_DBXREF] and [ORGANISM_GOA] specifies the categories of data to be processed.
[ORGANISM_DBXREF]
ORGANISMS = “GENE;SP;TR”

The value of the variable “ORGANISMS” indicates that there are three groups of data that need to be processed. This particular order
of GENE, SP and TR is important for internal purposes. By looking at the value of the variable “ORGANISMS”, script
dbxrefgoa_update.pl goes to the proper section in this configuration file, dbxref.ini, to find information such as where to download the
data file and what commands to be executed on the data file.
For example, when dbxrefgoa_update.pl reads “GENE”, it looks for section [GENE] in the dbxref.ini file. Section [GENE] is shown
below. “GENE_URL” specifies the FTP path of the data file, “GENE_CMD” specifies the command to be run, and
“GENE_CMD2RUN” specifies the command to be run on the cluster nodes.
[GENE]
GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene"
GENE_CMD2RUN=perl dbxref_gene_cluster.pl

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

259 of 421

18/04/2005

GENE_CMD="./deploy.sh GENE gene2accession"
GENE_CMD="./generate_dbxref_run.pl XREF_GENE > runXREF_GENE.sh"
GENE_CMD="./runXREF_GENE.sh"
GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz"
GENE_OMIM_CMD2RUN=perl dbxref_gene_extra.pl -d OMIM
GENE_CMD="./deploy.sh GENE mim2gene"
GENE_CMD="./generate_dbxref_run.pl XREF_GENE_OMIM > runXREF_GENE_OMIM.sh"
GENE_CMD="./runXREF_GENE_OMIM.sh"
GENE_CMD="./clean.sh GENE"
GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2unigene"
GENE_UNIGENE_CMD2RUN=perl dbxref_gene_extra.pl -d UNIGENE
GENE_CMD="./deploy.sh GENE gene2unigene"
GENE_CMD="./generate_dbxref_run.pl XREF_GENE_UNIGENE > runXREF_GENE_UNIGENE.sh"
GENE_CMD="./runXREF_GENE_UNIGENE.sh"
GENE_CMD="./clean.sh GENE"

Script dbxrefgoa_update.pl first creates a sub-directory called “GENE/wget/” in the directory specified by the variable “data_dir”.
The newly created directory would be /scratch/dbxrefgoa/download/GENEl/wget/. Script dbxrefgoa_update.pl then reads all
GENE_URLs sequentially and downloads the data files if they been updated. These files are placed into the directory
/scratch/dbxrefgoa/download/GENE/wget/, and then subsequently copied one level up to the directory
/scratch/dbxrefgoa/download/GENE/. If no up-to-date data files are downloaded, script dbxrefgoa_update.pl finishes with this section
and moves to the next according to the value of “ORGANISMS” in section [ORGANISM_DBXREF]. Once all the data files are
downloaded, the script dbxrefgoa_update.pl reads the values of “GENE_CMD” and executes one command at a time.
The first part of section “[GENE]” is used to explain how this works.
[GENE]
#GENE_URL="ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz"
GENE_CMD2RUN=perl dbxref_gene_cluster.pl
GENE_CMD="./deploy.sh GENE gene2accession"
GENE_CMD="./generate_dbxref_run.pl XREF_GENE > runXREF_GENE.sh"
GENE_CMD="./runXREF_GENE.sh"

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

260 of 421

18/04/2005

The first command to be executed is "./deploy.sh GENE gene2accession ". Script deploy.sh makes directory
/scratch/dbxrefgoa/download/GENE/ on each of the cluster nodes and copy the data file gene2accession into the newly created
directory on each node.
The second command to be executed is "./generate_dbxref_run.pl XREF_GENE > runXREF_GENE.sh". Script
generate_dbxref_run.pl takes in one parameter, XREF_GENE, which refers to the full path of the data file specified in variable
“data_dir” plus “XREF_GENE” in section [DBXREF_FILE] of the configuration file dbxref.ini. The data file has been copied to each
cluster node by executing command "./deploy.sh GENE gene2accession". Each of the cluster nodes will process a part of the data file.
The data file will be evenly divided into multiple segments each of which is processed by one node. In this way, the jobs on the nodes
can finish in approximately the same time.
The cluster nodes need to know where in the file to start and to end. Script generate_dbxref_run.pl reads the data file and
calculates the start and end points of the file for each cluster node to process. The script also constructs the command line for each
cluster node. The output of script generate_dbxref_run.pl is written to the file runXREF_GENE.sh.
An example of such a shell script is:
#! /usr/bin/bash
rsh an090 "cd /home/user123/dbxrefgoa; perl dbxref_gene_cluster.pl -c T -o 0 -n
269259 -f an090.dat >> /home/user123/dbxrefgoa/xref_parser.log 2>&1 &"
rsh an091 "cd /home/user123/dbxrefgoa; perl dbxref_gene_cluster.pl -c T -o 269259 -n 269259 -f an091.dat >>
/home/user123/dbxrefgoa/xref_parser.log 2>&1 &"
rsh an092 "cd /home/user123/dbxrefgoa; perl dbxref_gene_cluster.pl -c T -o 538518 -n 269259 -f an092.dat >>
/home/user123/dbxrefgoa/xref_parser.log 2>&1 &"
# Num of records: 807777
./wait.pl an090.dat an091.dat an092.dat
rm an090.dat an091.dat an092.dat

Where “perl dbxref_gene_cluster.pl” is the value of variable “GENE_CMD2RUN” in section [GENE]. In this example, the data file
has 807,777 records that are evenly divided into three segments each of which is processed on one cluster node.
Script dbxref_ll.pl takes four parameters:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

261 of 421

18/04/2005

–c indicates whether this is a run on a cluster node (-c F would make the script process the entire data file regardless of the values of –
o and –n),
–o indicates the starting point or offset in the file for this node,
–n indicates the number of records to be processed,
–f specifies the name of the flag file.
Since the next data file cannot be processed until the previous one is finished, it is necessary to know whether the process on every
cluster node is completed. On each node, a flag file named node.dat (e.g. an091.dat) is generated when the process on that particular
node is finished. The next data file won’t be processed until script wait.pl finds file node.dat on all of the nodes. After this condition is
met, wait.pl removes all flag files.
The last line in section [GENE] is:
GENE_CMD="./cleana.sh"

This command deletes all data files downloaded for this section from the cluster nodes in order to reclaim disk space.
All of the other sections in the dbxref.ini file have the same format as just described for [GENE]. These additional sections are:
[GENE], [SP], [TR], [FB]. [WB], [MGI], [SGD], [TIGR_ATH], [DDB], [RGD], [ZFIN], [XREFGOA], [GENEDB_SPOMBE],
[TAIR], [TIGR_CMR], [UNIGENE], [VIDA], [GOA_CGEN], [GOA_DDB], [GOA_FB], [GOA_GENEDB_GMORSITANS],
[GOA_GENEDB_LMAJOR], [GOA_GENEDB_PFALCIPARUM], [GOA_GENEDB_SPOMBE], [GOA_GENEDB_TBRUCEI],
[GOA_GRAMENE_ORYZA], [GOA_MGI], [GOA_GOA_PDB], [GOA_RGD], [GOA_SGD], [GOA_TAIR], [GOA_TIGR_ATH1],
[GOA_TIGR_CMR], [GOA_TIGR_TBRUCEI_CHR2], [GOA_TIGR_GENE_INDEX], [GOA_VIDA], [GOA_WB],
[GOA_GOA_UNIPROT], [GOA_ZFIN] and [GO_GENE].

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

262 of 421

18/04/2005

Table summarizing input files, parsers and command line parameters for dbxref module.
Input file
ftp://expasy.org/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz

parser
dbxref_sptr_cluster.pl

ftp://expasy.org/databases/uniprot/knowledgebase/uniprot_trembl.dat.gz

dbxref_sptr_cluster.pl

-d XREF_TR

http://flybase.bio.indiana.edu/allied-data/extdb/external-databases.txt

dbxref_fb_cluster.pl

none

ftp://ftp.sanger.ac.uk/pub/databases/wormpep/wormpep.table

dbxref_wb_cluster.pl

none

ftp://ftp.informatics.jax.org/pub/reports/MRK_Sequence.rpt

dbxref_mgi_cluster.pl

none

ftp://ftp.informatics.jax.org/pub/reports/MRK_SwissProt_TrEMBL.rpt

dbxref_mgi_cluster.pl

none

ftp://genomeftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/dbxref.tab

dbxref_sgd_cluster.pl

none

ftp://ftp.afcs.org/pub/mpdata/afcsflat.txt

dbxref_AFCS_cluster.pl

none

ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/DATA_RELEASE_SUPPLEMENT/r
elease_5.genbank_accessions.txt.gz

dbxref_tigr_ath_cluster.pl

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.tigr_ath

dbxref_tigr_ath_cluster.pl

-d XREF_TIGR_ATH_GP

ftp://ftp.blueprint.org/pub/SeqHound/Private/DDB/dictybaseid_gb_accession.t
xt.gz

dbxref_ddb_cluster.pl

none

ftp://rgd.mcw.edu/pub/data_release/genbank_to_gene_ids.txt

dbxref_rgd_cluster.pl

none

http://zfin.org/data_transfer/Downloads/genbank.txt

dbxref_zfin_cluster.pl

none

http://zfin.org/data_transfer/Downloads/refseq.txt

dbxref_zfin_cluster.pl

none

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.zfin

dbxref_zfin_cluster.pl

none

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/human.xrefs.gz

dbxref_goa_xrefs_cluster.pl

-d
XREF_XREFGOA_HUMAN

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/mouse.xrefs.gz

dbxref_goa_xrefs_cluster.pl

–d

seqhound@blueprint.org

command line parameters
-d XREF_SP

-d XREF_TIGR_ATH

Version 3.3

The SeqHound Manual

263 of 421

Input file

18/04/2005

parser

command line parameters

XREF_XREFGOA_MOUSE

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/rat.xrefs.gz

dbxref_goa_xrefs_cluster.pl

-d XREF_XREFGOA_RAT

ftp://ftp.sanger.ac.uk/pub/yeast/pombe/Mappings/gp2swiss.txt

dbxref_DBs_SPTR_cluster.pl

-d spombe

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.tair

dbxref_DBs_SPTR_cluster.pl

-d tair

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.tigr_cmr

dbxref_DBs_SPTR_cluster.pl

-d tigr_cmr

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.unigene

dbxref_DBs_SPTR_cluster.pl

-d unigene

ftp://ftp.geneontology.org/pub/go/gp2protein/gp2protein.vida

dbxref_DBs_SPTR_cluster.pl

-d vida

ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accessions.gz

dbxref_gene_cluster.pl

ftp://ftp.ncbi.nih.gov/gene/DATA/mim2gene.gz

dbxref_gene_extra.pl

-d OMIM

ftp://ftp.ncbi.nih.gov/gene/DATA/gene2unigene.gz

dbxref_gene_extra.pl

-d UNIGENE

command line parameters:
All DBXref parsers can run in a cluster environment setting the –c argument.
-d Organism Database File Abbreviation: Can be found under dbxref.ini Partition [ORGANISM_DBXREF] Ex.: SPTR; FB.
-c Cluster Option: T: True (run in a cluster environment); F: False: (run stand alone).
The following options are optional. Needs to be set-up only if running in a Cluster environment.
-o Offset: The offset in a flat file where a cluster node will start parsing.
-n Number of lines: The number of line-records to process by a cluster node after the Offset is reached.
-f Flag File: For synchronization purposes. A file name that will be generated when a cluster node finishes parsing telling the Cluster’s
head node that the parsing is finished. (The Cluster Head node will track completion of all nodes before it continues with the next
process in queue).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

seqhound@blueprint.org

264 of 421

18/04/2005

Version 3.3

The SeqHound Manual

265 of 421

18/04/2005

dbxref table
Last updated April 12, 2005
SeqHound
Database:
dbxref
Table:
dbxref
Module:
The dbxref table is the core of the dbxref module.
Definition:
It represents cross references between 3rd party databases and
GeneBank. This table is created by parsing 3rd party flat files and
creating records in "dbxref" for each record parsed. dbxref is a selfreferencing table. See “Explanation of data table structure” above.
multiple - see summary table
Source org:
multiple - see summary table
Source file:
multiple - see summary table
FTP script:
multiple - see summary table
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

266 of 421

18/04/2005

dbxref table
Field

Type

int(11)

Null Default Column_Definition
Auto
No
Identifier for this entry.
increment

Foreign key pointing to an identifier for a database, in table 2
dbxrefsourcedb. For example,
1 = Swiss-Prot,
2 = GenBank…etc.

source_id

int(11)

record_id

char(30)

Record identifier in database mentioned in previous column. AAD12597
The format of this identifier will be that of the source
database.

Link pointing to "id" (first column in this table) to the
1
record that is the source id of the Cross-Reference. An
entry of “0” indicates that this is a source record from which
database cross-references are parsed.

parent_id

int(11)

Example
2

Source1
All SQL insert statements will auto
increment this field.

API
NA

See dbxrefsourcedb:source_id.

See dbxref:id

SQL insert statements for proteins that NA
have dbXref to the nucleotide will
have the "link" field pointing to the
"id" record of the nucleotide.
If this relationship cannot be
established the "link" value will be set
to 0.

link

int(11)

Yes

NULL

The link field was created to support some databases (for
example Swiss-Prot) which may store more than one crossreference in one field. See the section above “Explanation
of data table structure”.

char(20)

Yes

NULL

Describes the field in the source record where the database Col6
cross-reference was found. For example, “DR” is a field
name for Database Reference found in Swiss-Prot records.

field

int(11)

Yes

NULL

Controled vocabulary. This will be used to describe the
type of record that the cross-reference is pointing to. See
the section above “Explanation of data table structure”.

NULL

When was this entry last updated

2005-04-01
17:25:44

lastupdate timestamp

1. Multiple source files are parsed by multiple parsers. For details, see summary table.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

267 of 421

18/04/2005

dbxref indices
Keyname
Type Cardinality Field
id_idx
INDEX
16849002 id
source_id_idx PRIMARY 18
source_id
dbxref_id _idx PRIMARY 16849002 record_id
parent_id_idx PRIMARY 16849002 parent_id
link_idx
INDEX
16849002 link
field_idx
INDEX
field
cv_idx
INDEX
18
cv
lastupdate_idx INDEX
51843
lastupdate

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

268 of 421

18/04/2005

dbxrefsourcedb table
Last updated April 12, 2005
seqhound
Database:
dbxrefsourcedb
Table:
dbxref
Module:
This table assigns an internal identifier to all data sources where crossDefinition:
references are found. These identifiers are used in the source_id
column of the dbxref table.
multiple - see summary table
Source org:
multiple - see summary table
Source file:
multiple - see summary table
FTP script:
multiple - see summary table
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

269 of 421

18/04/2005

dbxrefsourcedb table1
Field

Type

Null Default

source_id

int(11)

Column_Definition
Database ID, primary key in table dbxrefsourcedb.
For example,
1 = Swiss-Prot

Example
1

Source1

API
NA

Abbreviated name of database.
SP
Abbreviations for database names are the same as
those used by the Gene Ontology group see:
"GO.xrf_abbs"
ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs
This value is mandatory.

isprimary_db tinyint(4)

Are cross-references retrieved from this database?
1 = YES
0 = NO

lastupdate

source_db

char(50)

timestamp

2005-04-01 17:25:44
1. The contents of this table are hand-edited as part of the dbxrefgoa.sql file. The entire contents of this table as of April 5, 2005 are
listed below.
dbxrefsourcedb indices
Keyname
Type Cardinality Field
PRIMARY PRIMARY 33
source_id
source_db_idx INDEX
33
source_db
lastupdate_idx INDEX
lastupdate

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

270 of 421

18/04/2005

Contents of dbxrefsourcedb table
(last updated March 22, 2005).
source_id

source_db

isprimary_db

1 SP

2 GB

3 PFAM

4 INTERPRO

5 MGI

6 SGD

7 SMART

8 ZFIN

9 FB

10 CG

11 TR

12 SPTR

13 WB

14 LL

15 CGEN

16 TIGR_ATH

17 REFSEQ

18 GENEDB_SPOMBE

19 DDB

20 GR

21 TAIR

22 TIGR_CMR

23 UNIGENE

24 VIDA

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

271 of 421

25 RGD

26 IPI

27 ENSEMBL

28 AFCS

29 HUGO

30 OMIM

31 PIR

32 GENE

seqhound@blueprint.org

18/04/2005

Version 3.3

The SeqHound Manual

272 of 421

18/04/2005

RPS-BLAST domains (rpsdb) module
domname parser
Note: Not available at this time, as it has not been ported to an ODBC backend. Please go to our ftp site at
ftp://ftp.blueprint.org/pub/SeqHound/RPS/
to download precomputed domname tables.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

273 of 421

18/04/2005

Rpsdb parser
Last updated: August 4, 2004
Note: The rpsdb table is precalculated on a cluster and the resulting table is distributed in MySQL format on our ftp
site. Therefore, this section is provided for informational purposes only, or for those who would like to build rpsdb
tables from there own sequence/domain data; it is not necessary if one wishes simply to include the rpsdb module into
their own seqhound instance, in which case they should simply download the precomputed tables.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

274 of 421

18/04/2005

domname table
Last updated: August 4, 2004
SeqHound
Database:
domname
Table:
Domain information parsed from CDD database.
Definition:
The table is populated by the program “DomNameToDB” whose
Parser:
source file is seqhoun/rps/domname.c
The program traverses a directory that contains all the *.acd files
(obtained from NCBI ftp site
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/acd.tar.gz) which contains the
CDD ASN.1 records
Similarly to rpsdb, this table has two API’s. The CDD ASN.1
Comments:
definition seems to be under constant flux and with every new
release there are new and revised fields. This in turn requires that we
modify the parser functions and update API functions accordingly.
Codebase (for historical purposes)
Column_name Indexed NULL Data_type Size
Column_Definition
ACCESSION Yes
No
String
15
CDD id
NAME
Yes
No
String
25
Short domain label
PDB-ID
Yes
Yes
String
10
ID of PDB structure that is
used as 3D representative
of the domain
ASN1
No
Binary
The entire CDD ASN.1
object
record from the source file
MySQL
Field
Type
Null Default
Column_Definition
rowid
int(11)
No
Auto incremented id
accession
varchar(15)
No
CDD id
name
varchar(25)
No
Short domain label

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

275 of 421

pdbid

varchar(12)

asn1

mediumblob

MySQL Indexes
Keyname
idom_rowid
idom_acc
idom_name
idom_pdbid

Type
INDEX
INDEX
INDEX
INDEX

***accession***
Description:
Default Value:
Source (ASN.1) :
Parser :
Function:
More Info:

More info:
API:

18/04/2005

ID of PDB structure that is used
as 3D representative of the
domain
The entire CDD ASN.1 record
from the source file
Field
rowid
accession
name
pdbid

This is the domain’s Conserved Domain Database (CDD) unique
identifier assigned by the CDD group in NCBI. It can also be
SMART, Pfam, LOAD or CDD identifier.
Null
CddIdPtr,
DomNameToDB
FillDomNameNode
In early versions of CDD the accession was the Pfam or SMART
identifier. Therefore, in some of the comments in the code this field
is referred to as the Pfam or SMART identifier. Currently, most of
the domains have a unique CDD id but some may not.
The API functions return a list. The reason is that a domain label
such as SH3 may have a SMART and Pfam entry each stored as a
separate CDD entry. Both ID’s will be returned.
SHoundGetDomainIdFromLabel(CharPtr label);

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

***name***
Description:
Default value:
source:
Function:
More info:
API:

***pdbid***
Description:
Default value:
source:
Function:
More info:
API:
***asn1***
Description:
Default value:
source:
More info:

276 of 421

18/04/2005

Domain’s short label.
Null
This information is obtained from *.csq file that contains a FASTA
description of the domain. The label is parsed from the definition
line.
FillDomNameNode
string GetDomainLabelFromDomainId(string s); CharPtr LIBCALL
SHoundGetDomainLabelFromDomainId(CharPtr accession)

A 3D structure representative of the domain.
null
pCdd->master3d, part of Cdd record
FillDomNameNode
Get3DStructureFromDomainId
SHoundGetDomain3DStructure

This is the entire NCBI CDD ASN.1 structure.
Null
Cdd

Important Note: The CDD structure contains place holders for describing the domains parent, sibling and child
domains. These are domains that are structurally or otherwise related to the domain on the sequence level. The program
DomNameToDB collects this information. In the latest release of CDD that was parsed these fields were left empty.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

277 of 421

18/04/2005

NCBI plans to (or may have) include this information in future releases. This is important information pertaining to the
domains that may be included in future versions of the DomName table. Currently, the table has not been expanded
with additional fields to hold this information.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

278 of 421

18/04/2005

rpsdb table
Last updated: October 07, 2004
SeqHound
Database:
rpsdb
Table:
Domain annotation of proteins derived from RPS-BLAST and
Definition:
Conserved Domains Database (CDD).
The rpsdb is precalculated on a cluster and distributed in MySQL
Comments:
format on our ftp site.
This table contains two local API versions (C and C++) defined in
rpsdbapi.hpp and rpsdbapi.c. For the most part the same information
can be accessed by both API albeit in different forms. However,
there may be cases where the two APIs are not identical.
The default e-value cutoff for data in rpsdb is 1
Note: The rpsdb table is precalculated on a cluster and the resulting table is distributed in MySQL format on our ftp
site. Therefore, this section is provided for informational purposes only, or for those who would like to build rpsdb
tables from there own sequence/domain data; it is not necessary if one wishes simply to include the rpsdb module into
their own seqhound instance, in which case they should simply download the precomputed tables.
Codebase (for historical purposes)
Column_name Indexed
GI
Yes
CDDID
Yes

NULL
No
No

Data_type
Integer
Integer

Size
10
10

DOMID

Yes

String

FROM

Integer

seqhound@blueprint.org

Column_Definition
Sequence identifier
CDD ID (domain ID from
CDD)
Domain ID from primary
database (Pfam, SMART,
COG, KOG or cd)
First a.a. position aligned
to this domain

Version 3.3

The SeqHound Manual

279 of 421

18/04/2005

ALIGN_LEN

Yes

Integer

SCORE
EVALUE

No
No

Integer
Double

10
15,8

BITSCORE
MISSING_N

No
No

Double
Integer

15,8
6

MISSING_C

Integer

NUMDOM

Yes

Integer

MySQL
Field
rowid
gi
cddid

Type
int(11)
int(11)
int(11)

Null
No
No
No

domid

char(12)

rfrom

int(11)

align_len

int(11)

score

int(11)

evalue

decimal(15,8) Yes

seqhound@blueprint.org

Default
0
0

NULL

Length of alignment b/w
protein and domain
RPS-BLAST score
Base 10 log of RPSBLAST E- value
RPS-BLAST bit score
The length of N-terminus
residues on the domain
that were not aligned
The length of C-terminus
residues on the domain
that were not aligned
Number of total domains
mapped to this protein

Column_Definition
Auto incremented id
Sequence identifier
CDD ID (domain ID from CDD)
Domain ID from primary
database (Pfam, SMART, COG,
KOG or cd)
First a.a. position aligned to this
domain
Length of alignment b/w protein
and domain
RPS-BLAST score
Base 10 log of RPS-BLAST Evalue

Version 3.3

The SeqHound Manual

280 of 421

18/04/2005

bitscore

decimal(15,8) Yes

NULL

missing_n

int(11)

missing_c

int(11)

numdom

int(11)

MySQL Indexes
Keyname
irps_rowid
irps_gi
irps_cddid
irps_domid
irps_len
irps_numdom

Type
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX

irps_gi_e

INDEX

Source db:
Source program:
Parser :

SeqHound redundant table
RPS-BLAST results
rpsdb.h/c in seqhound/rps

***gi***
description:
Default value :

Primary sequence identifier assigned at NCBI
0

seqhound@blueprint.org

RPS-BLAST bit score
The length of N-terminus residues
on the domain that were not
aligned
The length of C-terminus residues
on the domain that were not
aligned
Number of total domains mapped
to this protein
Field
rowid
gi
cddid
domid
align_len
numdom
gi
evalue

Version 3.3

The SeqHound Manual

281 of 421

18/04/2005

More info:

GI’s are collected from Seqhound redund table. Proteins that were
not annotated in this table were the hypothetical ORF from RefSeq
(XP_xxxxxx) and SWISS-PROT proteins. These proteins are not
present in SeqHound (no Bioseq) although their GI's are in redund
table. In each redundant group the first GI was used for computing
RPS-BLAST (ordinal 1). These proteins are considered to be the
best representative of the redundant group. However, some
redundant groups may not have the first GI in SeqHound, in those
cases the program collects ordinal 2 or higher. The redundant list is
collected from redund table using “redundlist” program
(seqhound/rps).
SHoundGetGisByDomainIdAndEvalue
SHoundGetGisByDomainId
SHoundGetGisByNumberOfDomains
There are two functions in the C version that use Codebase relational
query. They should not be used and are there for experimental
purposes only.
seqhound/rps/rpsdb_README.txt

***cddid***
description:
Default Value:
Source (ASN.1) :
Parser :
Function:
More info:

Conserved Domain Database unique identifier.
none
CddIdPtr, can also be collected from CddHitPtr cdd_id field.
rpsdb
RPSDBSHoundRedund2ResultsCallback
seqhound/rps/rpsdb_README.txt

source:

API:
Comment:

***domid***
Description:
Default Value:

Domain identifier from the primary database of origin. These
correspond to either Pfam or SMART string identifiers.
Null

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

source:
Function:
***rfrom***
Description:
Default value:
source:
Function:
***align_len***
Description:

282 of 421

18/04/2005

Definition field in CddHitPtr structure.
RPSDBSHoundRedund2ResultsCallback

The index position of the first amino acid in the protein that is
aligned with this domain.
0
Start field in CddHitPtr structure
RPSDBSHoundRedund2ResultsCallback

Default value :
source:
Function:

The length of sequence alignment between the protein and the
domain.
0
stop- start field values in CddHitPtr.
RPSDBSHoundRedund2ResultsCallback

***score***
Description :
Default value:
source:
Function:

RPS-BLAST score parameter
0
score field in CddHitPtr
RPSDBSHoundRedund2ResultsCallback

***evalue***
Description :
Default value:
Source:
Function:

Base 10 log of RPS-BLAST evalue score
0
evalue field in CddHitPtr
RPSDBSHoundRedund2ResultsCallback

***bitscore***

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

Description:
Default :
source:
Function:
***missing_n***
Description:
Default:
source:

Function:
***missing_c***
Description:

283 of 421

18/04/2005

RPS-BLAST bit score
0
bitscore field in CddHitPtr
RPSDBSHoundRedund2ResultsCallback

The number of residues on the domain’s N-terminus that were not
aligned with the protein. The missing length on the N-terminus of
the domain.
0
It is collected from DenseDegPtr which is part of SeqAlign
structure. This structure is filled up by the RPS-BLAST engine. It
contains the collection of aligned segments between the domain and
the protein.
RPSDBSHoundRedund2ResultsCallback

Default:
source:
Function:

The number of missing residues not aligned in the domain’s Cterminus.
0
Same as MISSING_N
RPSDBSHoundRedund2ResultsCallback

***numdom***
Description:
Default:
source:
Function:

The total number of domains aligned with the protein.
0
The number of entries for the query GI in the rpsdb..
SLRICddCountSeqAligns in rpsdb

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

API:

284 of 421

18/04/2005

The above fields are all accessed through a set of calls that retrieve
the domain annotation based on different requirements.
SHoundGetDomainsFromGi
SHoundGetDomainsFromGiWithEvalue
SHoundGetDomainsFromGiListWithEvalue

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

285 of 421

18/04/2005

Molecular Interaction (MI) module
MI-BIND parser
Last updated October 1, 2004
purpose:
The Molecular Interaction module is meant to consolidate the interaction data, and associated annotation, from disparate source
interaction databases(e.g. BIND, IntAct, MINT, etc). The source data is parsed out of their own unique formats and placed into the
MI module tables' data model. This data model has been designed to provide maximum flexibility in terms of the complexity of
queries that can be made to it. Additionally, the module adds value to the data by cross-referencing distinct records, regardless of their
source databases, to provide information on molecular object redundancy and interaction similarity. The MI module's set of parsers
takes records from source interaction modules, parse out their data, and insert that data into the MI tables. Currently, there exists only
one such parser, for parsing BIND XML records; parsers for other interaction databases will be developed in the future.
Logic:
MI-BIND parser:
This parses the BIND XML records, available on the BIND ftp site. It uses the SAX XML parsing API, in order to achieve maximum
parsing speed. In order to foster code reusability, the parser is broken down into three components: The first component parses out
the BIND XML records interaction data and places it into a general MI data structure. This data structure is then passed to another
component, which is responsible for placing the contents into the MI tables in the database. Another component is then called to do
the cross-referencing and redundancy/similarity analysis on the data just processed, and places that additional annotation into the
database. This break-down is convenient, because it means that developers of MI parsers for other source interaction databases will
only need to replace the first component, the one which parses the data out of the source record, which they can then pass to the
already written data feed and cross-referencing components. The parser was written in the Java programming language to take
advantage of Java's mature XML processing capabilities, and because a great deal of java source code was available from the open
source BIND project for processing BIND XML records.
input files:
The MI-BIND parser processes BIND records in their native XML format. These are available from the BIND ftp site, in a number of
different partitions, each of which can be used interchangeable. In order to import all of BIND into the MI module, one would
download the “Divisions” partition from
ftp://ftp.blueprint.org/pub/BIND/data/divisions/xml/*.xml

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

286 of 421

18/04/2005

and then process this list of input files using the MI-BIND parser.
tables altered:
MI_complex2ints
MI_complex2subunits
MI_complexes
MI_dbases
MI_exp_methods
MI_ints
MI_mol_type
MI_obj_dbases
MI_obj_labels
MI_objects
MI_record_types
MI_refs
MI_refs_db
MI_source
source code location:
The source is contained in the following seqhound java packages:
org.bluprint.seqhound.parsers.mi
org.bluprint.seqhound.parsers.mi.bind
config file dependencies:
MI.properties file
This file must be present in the same directory as where the MI-BIND parser is being invoked, and must contain the entries:
dbDriverName=myDatabaseDriversName
dbUserName=myDatabaseUserName
dbPassWord=myDatabasePassword
dbURL=myDatabaseConnectionURL
These contain the settings, which the parser will use to connect to the database management system, which manages the MI data
tables. An example entry follows:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

287 of 421

18/04/2005

dbDriverName=com.mysql.jdbc.Driver
dbUserName=johnsmith
dbPassWord=johmsmithspassword
dbURL=jdbc:mysql://dbhostname:dbportnumber/MIdbname

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

288 of 421

18/04/2005

library dependencies:
Along with the standard java runtime environment, the MI-BIND parser requires one third party java library; this is the open source
xerces2 XML parser implementation, available for free from the apache foundation(see http://xml.apache.org/xerces2-j/download.cgi
for free download instructions).
The xerces jar files xercesImpl.jar and xmlParserAPIs.jar must be in classpath when executing the parser.
The MI-BIND parser has been tested only with xerces2 java version 2.6.2, but may work with other versions.
command line parameters:
The MI-BIND parser takes one command line argument: a file containing a list of filenames of files for the parser to process.
example use:
java org.blueprint.seqhound.parsers.mi.bind.bmdParse BINDXMLFileList.txt
Where BINDXMLFileList.txt is a text file containing a newline delimited list of the names of files containing the BIND XML
records to be processed; these BIND XML files must be in the same directory. Note that the compiled bind module parser files, as
well as the xerces jars, must be in classpath in order for the above command to function.
associated scripts:
none
error and run-time logs:
runtime and error information is printed to standard output and standard error, respectively. This will likely be changed in a future
release, to print to a standard log file.
troubleshooting:
additional info:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

289 of 421

18/04/2005

MI_source table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_source
Molecular Interaction(MI)
Contains information about the source of an interaction record

Field
uid

Type
Int(11)

Null
No

Default

intcompid

Int(11)

Smallint(6)

Yes

NULL

acc

Varchar(10)

Yes

Null

Int(11)

Yes

Null

type

Smallint(6)

Null

descr

Text

Yes

Null

data_blob

Longblob

Yes

Null

data_clob

Longtext

Yes

Null

MySQL

seqhound@blueprint.org

Version 3.3

Column Definition
Internal MI identifier for record
Internal MI identifier of the
interaction, complex or pathway
this record refers to
Internal MI identifier of the
source db which this record
comes room, corresponds to db
column of MI_dbases
Source db accession of this
record, if it exists
Source db ID of this record, if it
exists
Type of this record, either
interaction(1), complex(2) or
pathway(3)
Text description of this record
from the source db
Binary version of original record
from source db, if it exists
Text version of original record

The SeqHound Manual

290 of 421

18/04/2005

from source db, if it exists
MySQL Indexes
Keyname
Primary

Type
Primary

Field
uid

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

***uid***
description:
example:
default value
ASN.1 structure:
parser:
function:
API:
more info:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

291 of 421

18/04/2005

MI_ints table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_ints
MI
Contains information about individual interactions

Field
intid

Type
Int(11)

Null
No

Default

objAid

Int(11)

objBid

Int(11)

rig

Int(11)

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Column Definition
Internal MI id of this interaction
Internal MI id of the first object
in this interaction
Internal MI id of the second
object in this interaction
Internal MI id of the redundant
interaction group to which this
interaction belongs

MySQL Indexes

seqhound@blueprint.org

Field
intid

Version 3.3

The SeqHound Manual

292 of 421

18/04/2005

MI_objects table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_objects
MI
Contains information about the individual objects involved in
interactions, complexes and pathways

Field

Type

Null

objid

Int(11)

type

Smallint(6)

Int(11)

Yes

Null

tax

Int(11)

Yes

Null

acc

Varchar(20)

Yes

Null

rog

Int(11)

Keyname
Primary

Type
Primary

MySQL
Default

Column Definition
Internal MI identifier for the
object which this row describes
Internal MI identifier for the
molecule type of this object;
corresponds to column type of
MI_mol_type
Internal MI identifier for the
source db of this object;
corresponds to column db of
MI_obj_dbases
Source db id of this object
NCBI taxonomy id of this object,
if applicable
Source db accession of this object
Internal MI id of the redundant
object group to which this object
belongs

MySQL Indexes

seqhound@blueprint.org

Field
objid

Version 3.3

The SeqHound Manual

293 of 421

Observation:
Source org:
Source file:
FTP script:
Parser:

seqhound@blueprint.org

18/04/2005

Blueprint
BIND division files
N/A
MI-BIND

Version 3.3

The SeqHound Manual

294 of 421

18/04/2005

MI_obj_dbases table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_obj_dbases
MI
Table of objects source databases

Field

Type

Null

Smallint(6)

db_name

Varchar(30)

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for the
source db
The name of the object source db

MySQL Indexes

seqhound@blueprint.org

Field
db

Version 3.3

The SeqHound Manual

295 of 421

18/04/2005

MI_mol_types table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_mol_types
MI
Molecular types of objects

Field

Type

Null

type

Smallint(6)

type_name

Varchar(15)

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for
molecules of this type
Natural language name for this
molecule type

MySQL Indexes

seqhound@blueprint.org

Field
type

Version 3.3

The SeqHound Manual

296 of 421

18/04/2005

MI_dbases table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_dbases
MI
Table of source databases for interaction records

Field

Type

Null

Smallint(6)

db_name

Varchar(30)

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for this
source interaction database
Natural language name for this
source interaction database.

MySQL Indexes

seqhound@blueprint.org

Field
db

Version 3.3

The SeqHound Manual

297 of 421

18/04/2005

MI_record_types table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_record_types
MI
Table of types for interaction records

Field

Type

Null

type

Smallint(6)

type_name

Varchar(20)

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for this
record type
Natural language name for this
record type(by default, either
interaction, complex or pathway).

MySQL Indexes

seqhound@blueprint.org

Field
type

Version 3.3

The SeqHound Manual

298 of 421

18/04/2005

MI_complexes table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_complexes
MI
Table of complex and pathway records

Field

Type

Null

compid

Int(11)

numsubunits

Int(11)

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for this
complex record
Number of subunits in this
0
complex

MySQL Indexes

seqhound@blueprint.org

Field
compid

Version 3.3

The SeqHound Manual

299 of 421

18/04/2005

MI_complex2ints table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_complex2ints
MI
Mapping of complexes to their component interactions

Field

Type

Null

compid

Int(11)

intid

Int(11)

Keyname

Type

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for this
0
complex
Internal MI identifier for an
0interaction which is a component
of this complex

MySQL Indexes

seqhound@blueprint.org

Field

Version 3.3

The SeqHound Manual

300 of 421

18/04/2005

MI_complex2subunits table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_complex2subunits
MI
Mapping of complexes to subunits

Field

Type

Null

compid

Int(11)

objid

Int(11)

Keyname

Type

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Column Definition
Internal MI identifier for this
0
complex
Internal MI identifier for the
0subunit object which belongs to
this complex

MySQL Indexes

seqhound@blueprint.org

Field

Version 3.3

The SeqHound Manual

301 of 421

18/04/2005

MI_complex2subunits table
Last updated October 4, 2004
Database:
Table:
Module:
Definition:

MI
MI_complex2subunits
MI
Mapping of complexes to subunits

Field

Type

Null

Default

compid

Int(11)

objid

Int(11)

Keyname

Type

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Column Definition
Internal MI identifier for this
complex
Internal MI identifier for the
subunit object which belongs to
this complex

MySQL Indexes

seqhound@blueprint.org

Field

Version 3.3

The SeqHound Manual

302 of 421

18/04/2005

MI_refs table
Last updated October 6, 2004
Database:
Table:
Module:
Definition:

MI
MI_refs
MI
Mapping of MI records to literature references which support it

Field

Type

Null

Default

uid

Int(11)

Smallint(6)

acc

Varchar(15)

Yes

Null

Int(11)

Yes

Null

method

Smallint(6)

Keyname

Type

Observation:
Source org:
Source file:
FTP script:

Blueprint
BIND division files
N/A

MySQL
Column_Definition
Internal MI identifier for the
interaction record which this
reference is a part of
Internal MI identifier for the
reference database which this
reference is from (eg pubmed or
medline)
Accession of this reference from
source reference database
ID of this reference from the
source reference database
Internal MI identifier for the
experimental method used in the
referenced experiment

MySQL Indexes

seqhound@blueprint.org

Field

Version 3.3

The SeqHound Manual

303 of 421

Parser:

seqhound@blueprint.org

18/04/2005

MI-BIND

Version 3.3

The SeqHound Manual

304 of 421

18/04/2005

MI_refs_db table
Last updated October 6, 2004
Database:
Table:
Module:
Definition:

MI
MI_refs_db
MI
Mapping of internal MI reference database Ids to source reference
databases

Field

Type

Null

Smallint(6)

db_name

Varchar(15)

Yes

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Null

Column_Definition
Internal MI identifier for this
reference database
The natural language name of the
reference source database

MySQL Indexes

seqhound@blueprint.org

Field
db

Version 3.3

The SeqHound Manual

305 of 421

18/04/2005

MI_exp_methods table
Last updated October 6, 2004
Database:
Table:
Module:
Definition:

MI
MI_exp_methods
MI
Mapping of internal experimental method identifiers to their
descriptions

Field

Type

Null

method

Smallint(6)

method_descr

Varchar(40)

Yes

Keyname
Primary

Type
Primary

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Default

Null

Column_Definition
Internal MI identifier for this
experimental method
Natural language name of the
experimental method

MySQL Indexes

seqhound@blueprint.org

Field
method

Version 3.3

The SeqHound Manual

306 of 421

18/04/2005

MI_obj_labels table
Last updated October 6, 2004
Database:
Table:
Module:
Definition:

MI
MI_obj_labels
MI
Mapping of molecular objects to their free form labels

Field

Type

Null

Default

uid

Int(11)

objid

Int(11)

label

Text

Yes

Null

Keyname

Type

Observation:
Source org:
Source file:
FTP script:
Parser:

Blueprint
BIND division files
N/A
MI-BIND

MySQL
Column_Definition
Internal MI identifier for the
interaction/complex/pathway
record which assigns this label to
this object
Internal MI identifier for the
molecular object being labeled
Free form label given to this
molecular object by this
interaction/complex/pathway
record

MySQL Indexes

seqhound@blueprint.org

Field

Version 3.3

The SeqHound Manual

307 of 421

18/04/2005

Text mining module
Overview:
The SeqHound text mining module helps researchers locate mentions and co-mentions of biologically related entities in the scientific
literature.
At the time of writing this is limited to finding protein mentions in PubMed abstracts. The module however has been designed to be
extensible to small molecules, complexes and even biological concepts in both abstract and full-text articles. Each of the steps in the
process may be scored by multiple methods that can be developed internally or by external developers.

mother parser
Note that a number of the tables in the text module are created at the same time as the creation of the other core tables (see core.sql)
and some of these tables are populated by the mother parser (see table descriptions below). Mother, is used to retrieve protein names
and synonyms from the RefSeq database, so these tables are created at the same time as other core module tables. The mother parser
is described under the core module section above. Parsers specific to the text module are described below.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

308 of 421

18/04/2005

text searcher parser
last updated February 25th, 2005
purpose:
The text update parser and related programs are used to collect bionames, search against a literature database, and then investigate comentions of bioentities in the literatures. The co-mentions of bioentities are scored using pattern recognition and statistical machine
learning methods to look for potential biophysical interactions between bioentities. The text mining module tables are updated daily.
logic:
Most of the update logic is implemented in the Text.pm Perl module and other scoring programs. The text update parser calls the
functions in Text.pm and other Perl modules. The text mining module depends on the CORE module to generate and update names of
proteins and on the MyMED in house literature database. These resources are updated daily. The steps taken to update the text-mining
module are as follows:
Step 1: Collect bioentity names from the lexicon
Step 2: Formulate searches; collect search results and scores
Step 3: Collect co-occurrences of names and scores
Step 4: Summarize evidence for each pair of bioentities and scores
module: text
input files:
Latest Medline release: ftp://ftp.ncbi.nih.gov/nlmdata/.medlease/*.xml.gz
Latest PubMed Central release: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/
Latest Biomed Central release: ftp://ftp.biomedcentral.com/
Latest NRC LitMiner SVM release: http://ii200.iit.nrc.ca/~martinj/
British National Corpus: ftp://ftp.itri.bton.ac.uk/bnc/

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

309 of 421

18/04/2005

Moby project English word list: http://www.dcs.shef.ac.uk/research/ilash/Moby/
PubMed help stop word list: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#Stopwords
Smart English stop word list: ftp://ftp.cs.cornell.edu/pub/smart/english.stop
tables altered:
text_bncorpus, text_db, text_doc, text_docscore, text_doctax, text_englishdict, text_evidence, text_evidencescore, text_method,
text_namepair, text_namepairresult, text_organism, text_pattern, text_point, text_pointscore, text_result, text_resultscore, text_rng,
text_rngscore, text_search, text_searchscore, text_stopword
source code location:
slri/seqhound/text/text_update.pl
config file dependencies:
slri/seqhound/text/text.ini
command line parameters:
Typing “./text_update.pl -“ or “perl text_update.pl -” at the command line while in the directory where
text_update.pl resides will return a list of command line parameters and default settings.
text_update.pl
-r

arguments:

redo all name searches against new insertions from MyMED using time stamp [T/F] optional

-t taxnomy ids for organism to be searched, seperated by comma. If specified in command line, only provided
organism(s) will be searched. Otherwise, a default list of taxids stored in text.ini file will be searched.
Optional
-u update target table only, default is all. Optional

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

310 of 421

18/04/2005

example use:
For example:
>./text_update.pl -r F –t 4932,10090,9606 -u text_searchscore
associated scripts:
slri/seqhound/text/text.sql Data Definition file for text mining module tables.
slri/seqhound/text/text_dump.sh Mysql dump script for portable tables.
slri/seqhound/text/text_create.sh Script used to create Text database, the starting point.
slri/seqhound/text/text_regex.pl Script used to score evidence using regex patterns.
slri/seqhound/text/Text.pm contains most functions that will be using for update.
slri/seqhound/text/Pattern.pm Perl module used to represent regular expression objects.
slri/seqhound/text/Tee.pm Perl module used to branch the output to different outputs.
slri/seqhound/text/text_updatecron.sh Text module daily update cron script.

slri/seqhound/text/myeutils.pl Entrez eutil script used for comparing results between MyMED and entrez searches.
error and run-time logs:
Errors and runtime logs will be directed to file specified in text.ini , current default log file name is text_update.log
Daily update log will also be send to email account specified in text.ini file
troubleshooting:
Check the email message sent by text_update.pl to see if there is any error during update. Consult update log file to look for detail
problem.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

311 of 421

18/04/2005

additional info:
All text mining module tables are in small cases, and are prefixed with "text_". Tables with auto incremented rowid have a primary
key "id" as default. When referencing primary key in other table, the field name in the referencing table will be the referenced table
name plus id.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

312 of 421

18/04/2005

yeastnameparser.pl parser
Last updated September 27, 2004
purpose:
The yeastnameparser extracts names from the SGD file SGD_features.tab. Only names that belong to yeast records that are already in
RefSeq are added, as determined using the DBXRef module. This means that this parser MUST be run after the DBXRef parsers.
It is not necessary to run this parser, if yeast names are not desired.
logic:
The yeastnameparser reads through each record in SGD_features.tab. It searches for a refseq cross reference for each record. If one is
found, then the parser gets the relevant bioentityid and checks all names for that bioentityid.
If the name does not already exist in the database, then it is added and the db field is set to “sgd” and the access field is set to the
SGDID. The action field is set to 1, ADD and the current data is written to actiondate.
Existing names in the database are also compared to the file. Names which are no longer present in the yeast file are marked as deleted
(action =2) and the current data is written to actiondate. If the name already exists in refseq, then a SECONDREFS record is filled out
for the yeast record.
module: names
input files:
SGD_features.tab from
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/
tables altered:
bioentity, bioname, secondrefs
source code location:
slri/seqhound/names/yeastnameparser.pl
config file dependencies:
The relevant configuration files are:
.intrezrc and .odbc.ini should be set up as described above for seqhound. The values will be read by shconfig.pm, which should be
located in the same directory as yeastnameparser.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

313 of 421

18/04/2005

command line parameters:
None.
example use:
perl yeastnameparser.pl
associated scripts:
The program yeastnamecron.pl can be used for both the initial read of SGD_features.tab and updates of the file. yeastnamecron.pl
checks whether SGD_features.tab needs updating, downloads it and calls yeastnameparser.pl. SGD_features.tab is updated weekly by
SGD.
error and run-time logs:
yeastnameparser writes errors to a file called yeastname.log. Updates are written to a file called yeastupdate.log as a tab delimited file
where the fields are: name, sgdid, bionameid and field. Additions are written to a file called yeastadd.log as a tab delimited file where
the fields are:name, sgdid bioentityid and field.
troubleshooting:
additional info:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

314 of 421

18/04/2005

text_bioentity table
Last updated Febuary 15, 2004
SeqHound
Database:
text_bioentity
Table:
text (note mother parser is part of the core module)
Module:
A bioentity refers to any biological object with names that may be used in the literature to refer to these objects.
Definition:
Currently, all bioentities are proteins from RefSeq. This table tells us which database contains the primary record
for this bioentity, the accession and the field which refers to this bioentity. At the moment, all bioentities are
obtained from RefSeq. RefSeq was chosen because it represents a high quality database of non-redundant records.
The intention of this part of the core is to collect a non-redundant list of biological objects and the names that are
used in written language to refer to them by.
*.bna.gz from ftp://ftp.ncbi.nih.gov/refseq/release/complete
Source file:
asnftp.pl
FTP script:
mother
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

315 of 421

18/04/2005

text_bioentity table
Field

Type

Null

Default

int(11)

Auto incremented id. A
unique identifier for this
bioentity.

Bioentity type (for
1
example, molecule type).
Id from the bioentitytype
table.

mother parser (see
SHoundBioentityFromBioentityId
GetType()) parsing
and see bioseq->mol
mother parser inserts SHoundBioentityFromBioentityId
“ref”

Example
1

Source
mother parser
autoincrements this
field

API
SHoundBioentityIdFromGi
SHoundBioentityIdFromAcc
SHoundBioentityIdListFromBionameAndTaxId

bioentitytypeid

int(11)

varchar(15)

The primary database
ref
where this bioentity was
found.

Accession in the primary NP_858066 mother parser (see
database. Any
FillBionameDB())
and see bioseq-alphanumeric identifier
>seqid
used by the primary
database.

Yes

Numeric identifier for this 31982991
bioentity in the primary
database; for example an
NCBI Gene Info identifier
(GI). This field is not
required.

mother parser (see
SHoundBioentityFromBioentityId
GetGI) see bioseq->seqid (choice 12)).

mother parser (see
ASN.1 path)

A number that represents 1
the field from which this
bioentity was derived.
This is the id from the
fieldtype table. For
example, 1 indicates the
ASN.1 path
“seqentry/seqset/bioseq”

access

identifier

fieldtypeid

varchar(20)

int(11)

seqhound@blueprint.org

Column_Definition

Version 3.3

SHoundBioentityFromBioentityId

The SeqHound Manual

text_bioentity indices
Keyname
Type
PRIMARY
PRIMARY
ibioe_id
INDEX
ibioe_type
INDEX
ibioe_identifier
INDEX

seqhound@blueprint.org

316 of 421

18/04/2005

Field
access
id
bioentitytypeid
identifier

Version 3.3

The SeqHound Manual

317 of 421

18/04/2005

text_bioname table
Last updated February 15, 2005
SeqHound
Database:
text_bioname
Table:
text (note that the mother parser is part of the core module)
Module:
This table holds names for bioentities. A Bioname is some “name” applied to some bioentity.
Definition:
NCBI
Source org:
*.bna.gz from ftp://ftp.ncbi.nih.gov/ncbi/refseq/release/complete
Source file:
asnftp.pl
FTP script:
mother
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

318 of 421

18/04/2005

text_bioname table
Field

Type

Null

Default

Column_Definition

int(11)

timestamp

datetime

Yes

NULL

bioentityid

int(11)

name

text

The name.
Type of name. For example,
protein name (1) or gene name
(2). This is the id from the
nametype table.

A unique identifier for this
bioentity-name pair.

Source
API
mother parser
SHoundBionameListFromBioentityId
auto increments
this column
mother parser
yeastnameparser

The id of the bioentity to which
this name refers.

SHoundBionameListFromBioentityId
mother parser
yeastnameparser

2mother parser
SHoundBionameListFromBioentityId
isopropylmal yeastnameparser
ate synthase
1

mother parser
SHoundBionameListFromBioentityId
yeastnameparser

nametypeid

int(11)

varchar(15)

The database in which this name ref
was found.

access

varchar(20)

The accession of the record in
which this name was found.

SHoundBionameListFromBioentityId
NP_047187 mother parser
yeastnameparser
10954458

Yes

NULL

The identifier of the record in
which this name was found; for
example a Gene Info identifier
(GI).

The field of the record in which 5
this name was found. This is the
id from the fieldtype table; for
example, 5 indicates the
ASN.1 path
“seqentry/seqset/bioseq/descrtitle/”

identifier

fieldtypeid

int(11)

seqhound@blueprint.org

Example
1

Version 3.3

SHoundBionameListFromBioentityId
mother parser
yeastnameparser

mother parser
SHoundBionameListFromBioentityId
yeastnameparser

The SeqHound Manual

319 of 421

18/04/2005

official

int(11)

Yes

Is this an official name?
1=Yes, 2=N0

yeastnameparser SHoundBionameListFromBioentityId

deprecated

int(11)

Yes

Has this name been deprecated? 0
Not used at present.

SHoundBionameListFromBioentityId

datedeprecated

datetime

Yes

000000000 The date the name was
00000
deprecated.

00000000000 mother parser
SHoundBionameListFromBioentityId
000
yeastnameparser

ruleid

int(11)

Yes

NULL

What rule was used to construct 1
this name? This is the id from the
rules table. For example, 1
indicates that A gene name is
being used to refer to a protein.

action

char(1)

Yes

What was the last action taken on
this record. A=ADD,
D=DELETE.

actiondate

datetime

Yes

000000000
Date of the last action.
00000

seqhound@blueprint.org

mother parser
SHoundBionameListFromBioentityId
yeastnameparser

mother parser
yeastnameparser
2004-08-20 mother parser
01:30:08
yeastnameparser

Version 3.3

The SeqHound Manual

text_bioname indices
Keyname
ibioname_id
ibioname_identifier
ibioname_access
ibioname_bioentityid
ibioname_nametypeid
ibioname_official
ibioname_deprecated
ibioname_ruleid
ibioname_action
ibioname_actiondate

seqhound@blueprint.org

320 of 421

Type
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX
INDEX

18/04/2005

Field
id
identifier
access
bioentityid
nametypeid
official
deprecated
ruleid
action
actiondate

Version 3.3

The SeqHound Manual

321 of 421

18/04/2005

text_secondrefs table
Last updated February 15, 2005
SeqHound
Database:
text_secondrefs
Table:
core
Module:
Additional references for sources of bionames.
Definition:
SGD
Source org:
SGD_features.tab from ftp://genomeSource file:
ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/
yeastnamecron.pl
FTP script:
yeastnameparser.pl
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

322 of 421

18/04/2005

text_secondrefs table
Field

Type

Null

int(11)

timestamp

datetime

Yes

NULL

bionameid

int(11)

varchar(15)

access

varchar(20)

Yes

NULL

Accession of the record that refers
to the name.

fieldtypeid

int(11)

Identifies the field in the record
where this name was found.

seqhound@blueprint.org

Default

Column_Definition

Example
1

id of the second reference. Mysql
auto-increment column.

Source
yeastnameparser
will Autoincrement this
column
yeastnameparser

id of the name that this reference
refers to.
Database in which this reference is
found.

yeastnameparser
retrieves this from
text_bioname table

sgd

yeastnameparser
(see
AddSeconfRef())

S000033

yeastnameparser
(see
AddSeconfRef())

101

yeastnameparser
(see
AddSeconfRef())

Version 3.3

API

The SeqHound Manual

text_secondrefs indices
Keyname
isecondrefs_id
isecondrefs_bionameid
isecondrefs_dbsearch
isecondrefs_field

323 of 421

Type
INDEX
INDEX
INDEX
INDEX

seqhound@blueprint.org

18/04/2005

Field
id
bionameid
access, db
field

Version 3.3

The SeqHound Manual

324 of 421

18/04/2005

text_bioentitytype table
Last updated February 15, 2005
SeqHound
Database:
text_bioentitytype
Table:
text (note that this table is currently defined when the core module is
Module:
created)
Look up table that stores bioentity types.
Definition:
Blueprint
Source org:
core.sql
Source file:
NA
FTP script:
NA
Parser:
text_bioentitytype table
Field

Type

Null

int(11)

type

varchar(80)

Default

Column_Definition
Identifier for a bioentitytype.
Mysql auto-increment column.

Example
1

Source
core.sql

The molecule type.

protein

core.sql

text_bioentitytype indices
NA

seqhound@blueprint.org

Version 3.3

API

The SeqHound Manual

325 of 421

18/04/2005

text_fieldtype table
Last updated February 15, 2005
SeqHound
Database:
text_fieldtype
Table:
text (note that this table is defined when the core module is created)
Module:
Look up table that stores the field that contains the name.
Definition:
Blueprint
Source org:
This is a look up table that is filled by core.sql
Source file:
NA
FTP script:
NA
Parser:
text_fieldtype table
Field
id

pathtofield

Type
int(11)

varchar(80)

Null
No

Default

Column_Definition
Identifier for a fieldtype.

Example
1

The field that contains the name.
For an ASN.1 record, this is the
ASN.1 "path" to the field that
contains the name. For a flat file,
this is of the form "Column
#:Column Name".

seqentry/seq core.sql
set/bioseq/s
eqannot/seqfea
t-gene/syn

text_fieldtype indices
NA

seqhound@blueprint.org

Version 3.3

Source
core.sql

API

The SeqHound Manual

326 of 421

18/04/2005

text_nametype table
Last updated February 15, 2005
SeqHound
Database:
text_nametype
Table:
text (note that this table is defined when the core module is created)
Module:
Look up table that stores name types.
Definition:
Blueprint
Source org:
This is a look up table that is filled by core.sql
Source file:
NA
FTP script:
NA
Parser:
text_nametype table
Field
Type
id
type

Null Default Column_Definition
Identifier for a nametype.
Mysql auto-increment
int(11)
No
column.
varchar(880) No
The type of name.

Example
1

Source
core.sql

protein

core.sql

text_nametype indices
NA

seqhound@blueprint.org

Version 3.3

API

The SeqHound Manual

327 of 421

18/04/2005

text_rules table
Last updated September 30, 2004
SeqHound
Database:
text_rules
Table:
text (note that this table is defined when the core module is created)
Module:
Look up table that stores rules for generating names.
Definition:
Blueprint
Source org:
This is a look up table that is filled by core.sql
Source file:
NA
FTP script:
NA
Parser:
text_rules table
Field

Type

Null

int(11)

type

varchar(80)

Default

Column_Definition
Identifier for a rule. Mysql autoincrement column.
The rule.

Example
1

Source
core.sql

Use gene name
for protein.

core.sql

text_rules indices
NA

seqhound@blueprint.org

Version 3.3

API

The SeqHound Manual

328 of 421

18/04/2005

text_db table
Last updated February 15, 2005
seqhound
Database:
text_db
Table:
text
Module:
Lookup table that lists biomedical literature databases used by the text
Definition:
module.
Source file:
Blueprint
Source org:
NA
FTP script:
NA
Parser:
text_db table
Field
dbid

Type
int(11)

Null
No

name

varchar(80)

text_db indices
Keyname
PRIMARY

Type
PRIMARY

seqhound@blueprint.org

Default

Column Definition
Auto incremented id

Example
1

Source
text.sql

name of database

PubMed

text.sql

Field
dbid

Version 3.3

API

The SeqHound Manual

329 of 421

18/04/2005

text_doc table
Last updated February 16, 2005
seqhound
Database:
text_doc
Table:
text
Module:
This table lists the accession ids from each literature database and
Definition:
assigns them an internal document id.
ftp://ftp.nlm.nih.gov/nlmdata/.medlease/medline*.xml.gz
Source file:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/
NLM MEDLINE and other biomedical literature databases
Source org:
slri/medline/updates/updatecron.sh
FTP script:
slri/medline/pubmedcentral/pmc_updatecron.sh
text_update.pl
Parser:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

text_doc table
Field
docid
dbid

Type
int(11)
int(11)

330 of 421

18/04/2005

Null Default Column Definition
Example Source
No
auto incremented identifier 1077882
No 0
literature database id
1
1077882

accession

int(11)

accession in literature
database such as a PubMed
identifier (PMID)

status

char(10)

Yes

NULL

insert, delete or update

text_doc indices
Keyname
id
dbid
accession

Type
PRIMARY
INDEX
INDEX

seqhound@blueprint.org

Medline Xpath:
“/MedlineCitati
on/PMID”
Or
PMC Xpath:
/art/ui[@type="
pmid"]'

NULL

Field
id
dbid
accession

Version 3.3

API

The SeqHound Manual

331 of 421

18/04/2005

text_docscore table
Last updated September 28, 2004
seqhound
Database:
text_docscore
Table:
text
Module:
This table lists scores for documents. Multiple scores (from different
Definition:
scoring methods) may be listed for each document.
Source file:
NRC
Source org:
wget –P ~/nrc -m http://ii200.iit.nrc.ca/~martinj/
FTP script:
slri/text/nrc.sh
LitMiner :http://textomy.iit.nrc.ca/cgiProgram:
bin/bindpresent.cgi?qry=10747882
The current method for scoring a document is an SVM (support vector
Notes:
machine) classifier implemented by National Research Council's Joel
Martin and Berry deBruijn. Detailed information can be found in the
prebind publication:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pu
bmedid=12689350

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

332 of 421

18/04/2005

text_docscore table
Field

Type

Null

int(11)

methodid

int(11)

score

double

text_docscore indices
Keyname
Type
PRIMARY
PRIMARY
methodid
INDEX
score
INDEX

seqhound@blueprint.org

Default

Column_Definition
foreign key from doc table. See
docid

Example
25

Source

scoring method identifier. See the 1
text_method table. For example, 1
indicates an SVM trained to
recognize papers describing
interaction data.
0

score

-0.852876574

Field
(docid, methodid, score)
methodid
score

Version 3.3

http://ii200.iit.nr
c.ca/~martinj/

API

The SeqHound Manual

333 of 421

18/04/2005

text_evidence table
Last updated February 16, 2004
seqhound
Database:
text_evidence
Table:
text
Module:
This table assigns a unique identifier to each co-occurrence of two
Definition:
bionames in the same document. A bioname points to a specific name
bioentity pair in the bioname table (this pair is repeated here). This is to
be distinguished from a pair of names that co-occur in a document; this
pair of names is identified by a namepairid and does not make any
reference to a specific pair of bioentities. This evidence is in support of
some point and is based on some name pair.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

334 of 421

18/04/2005

text_evidence table
Field
Type

Null Default Column_Definition

int(11)

docid

int(11)

the document identifier
325387
where this evidence occurs

resultidA

int(11)

search result identifier for
bioname A

2200

bioentityida

int(11)

identifier for bioentity A

107564

nama

char(80)

name referring to A

LEU1

resultidB

int(11)

search result identifier for
bioname B

9513

bioentityidb

int(11)

identifier for bioentity B

109911

namb

char(80)

name referring to B

HIS4

pointid

int(11)

Yes

NULL

POtential INTeraction
2
between two bioentites that
this evidence supports

int(11)

Yes

NULL

corresponding namepairid
on which this evidence is
based

10634

namepairid
state

smallint(6) Yes

NULL

book keeping

seqhound@blueprint.org

A unique identifier for this
piece of evidence.

Example
1

Version 3.3

Source
Auto
incremented
identifier

API

The SeqHound Manual

text_evidence indices
Keyname
Type
PRIMARY
PRIMARY
docid
INDEX
resultida
INDEX
resultidb
INDEX

seqhound@blueprint.org

335 of 421

18/04/2005

Field
evidenceid
docid
resultidA
resultidB

Version 3.3

The SeqHound Manual

336 of 421

18/04/2005

text_evidencescore table
Last updated February 16, 2005
seqhound
Database:
text_evidencescore
Table:
text
Module:
This table assigns a score to a piece of evidence. See evidence table.
Definition:
Multiples cores may be assigned to one evidence if different methods
were used. Currently, this table stores potential protein-protein
interaction scores for each co-occurrence of bionames. The current
scoring method uses a set of manually collected regular expression
patterns to identity interactions. Detailed information can be found at:
ftp://ftp.blueprint.org/pub/BIND/PreBIND/README
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

text_evidencescore table
Field
Type
evidenceid
int(11)

methodid

int(11)

score

double

337 of 421

Null Default Column_Definition
Example
No 0
see text_evidence table
8
scoring method identifier
2
(see text_method table); for
example, 2 indicates a
method that uses regular
No 0
expressions to detect
bioentities that are
physically interacting
No

text_evidencescore indices
Keyname
Type
PRIMARY
PRIMARY
methodid
INDEX
score
INDEX

seqhound@blueprint.org

18/04/2005

the score

0.63

Field
(evidenceid, methodid, score)
methodid
score

Version 3.3

Source

API

The SeqHound Manual

338 of 421

18/04/2005

text_method table
Last updated March 15, 2005
seqhound
Database:
text_method
Table:
text
Module:
This table assigns a unique identifier to each of the multiple scoring
Definition:
schemes used in the SeqHound text mining module.
This file is hand-edited.
Source:
Blueprint
Source Org.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

339 of 421

18/04/2005

text_method table
Field
id

Type
int(11)

Null
No

type

varchar(30)

hypothid

int(11)

Yes

hypoth

text

Yes

Default
0

Column_Definition
Auto incremented identifier

Example
1

This field describes the type of
searchscore
score that is generated by the
method. It will be one of
searchscore, docscore, resultscore,
evidencescore or pointscore.
0

corresponding hypothesis
identifier. See the text_hypoth
table.

NULL

a hypothesis that this method
attempts to support or refute.

This document
describes
biophysical
interaction data for
some molecule(s).

method

text

Yes

NULL

A support vector
machine was
trained to recognize
a more detailed description of the
abstracts containing
method with pointers to more
biophysical
details about the method and its
interaction data.
performance.
See PubMed
Identifier 12689350
for more details.

text

Yes

NULL

the value (range) of scores
corresponding to the hypothesis
being found TRUE

positive

text

Yes

NULL

the value (range) of scores
corresponding to the hypothesis
being found FALSE

negative

seqhound@blueprint.org

Version 3.3

Source

API

The SeqHound Manual

340 of 421

18/04/2005

undecided

text

Yes

NULL

the value (range) of scores
corresponding to the hypothesis
being found UNDECIDED

implemented

enumerated

Yes

NULL

has this method been implemented NO

assume

text

Yes

NULL

script

text

Yes

NULL

text_method indices
Keyname
Type
methodid
PRIMARY

seqhound@blueprint.org

the score value that can be
>0
assumed if the method has been
implemented but no score is found
the script or program that
implements the method

text_search.pl

Field
methodid

Version 3.3

The SeqHound Manual

341 of 421

18/04/2005

text_point table
Last updated February 16, 2005
seqhound
Database:
text_point
Table:
text
Module:
A POINT represents two bioentities for which some POtential
Definition:
INTeraction may occur in the literature. Each POINT may be supported
by multiple pieces of evidence in the evidence table.
Program:
text_point table
Field
Type
id
int(11)

Null Default Column_Definition
Example
No
auto incremented identifier 1

bioentityidA

int(11)

bioentity identifier

110070

bioentityidB

int(11)

bioentity identifier

1268438

state

small int

Yes

NULL

book keeping

text_point indices
Keyname
tid
bioentityida
bioentityidb

Type
PRIMARY
INDEX
INDEX

seqhound@blueprint.org

Field
id
bioentityida
bioentityidb

Version 3.3

Source

API

The SeqHound Manual

342 of 421

18/04/2005

text_pointscore table
Last updated February 16, 2005
seqhound
Database:
text_pointscore
Table:
text
Module:
This table lists scores for potential interactions. These scores may be
Definition:
viewed as a summary of the scores for all the pieces of evidence that
support this POINT. A POINT may have multiple scores that are
generated by multiple methods.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

text_pointscore table
Field
Type
pointid

int(11)

methodid

int(11)

score

double

343 of 421

Null Default Column_Definition
Example
a unique identifier for some 11
No 0
POINT (see text_point table)
an identifier for the method 3
used to generate this score
No 0
for the POINT (see
text_method table).
No

text_pointscore indices
Keyname
Type
PRIMARY
PRIMARY
methodid
INDEX
score
INDEX
pointid
INDEX

seqhound@blueprint.org

18/04/2005

the score

0.63

Field
(pointid, methodid, score)
methodid
score
pointid

Version 3.3

Source

API

The SeqHound Manual

344 of 421

18/04/2005

text_result table
Last updated February 16, 2004
seqhound
Database:
text_result
Table:
text
Module:
This table stores all the search results (document ids) for performed
Definition:
searches (see text_search table). Position default is 0, which means that
the bioname appears somewhere in the document without specifying
exactly where or how many times.
Program:
text_result table
Field
Type
id
int(11)

Null Default Column_Definition
Example
No
auto incremented identifier 1

searchid

int(11)

identifies the search that
generated this result

docid

int(11)

the document identifier
7874750
where the name was found
0

the postion id in the
document where the name
appears (0 indicates no
specified position)

NULL

book keeping

positionid

int(11)

state

smallint(6) Yes

text_result indices
Keyname
Type
tid
PRIMARY

seqhound@blueprint.org

Field
id

Version 3.3

Source

API

The SeqHound Manual

searchid
docid
positioned

INDEX
INDEX
INDEX

seqhound@blueprint.org

345 of 421

18/04/2005

searchid
docid
positionid

Version 3.3

The SeqHound Manual

346 of 421

18/04/2005

text_resultscore table
Last updated February 16, 2005
seqhound
Database:
text_resultscore
Table:
text
Module:
This table holds scores for search results; i.e., is the searched-for
Definition:
bioentity really mentioned in the document. This table might be used to
store disambiguation scores for search results. Because a bioname can
refer to many different bioentities, an algorithm (some method) )may be
used to determine which bioentity a name occurrence refers to. This
table might also hold the results of methods that disambiguate
bioentities that have English words as names.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

347 of 421

18/04/2005

text_resultscore table
Field
Type
id
int(11)

Null Default Column_Definition
Example
No 0
Auto incremented identifier 1

methodid

int(11)

identifier for the method
3
used to score this result (see
text_method table)

score

double

the score

text_resultscore indices
Keyname
Type
PRIMARY
PRIMARY
resultid
INDEX
methodid
INDEX
score
INDEX

seqhound@blueprint.org

-1

Field
(resultid, methodid, score)
resultid
methodid
score

Version 3.3

Source

API

The SeqHound Manual

348 of 421

18/04/2005

text_search table
Last updated February 16, 2004
seqhound
Database:
text_search
Table:
text
Module:
This search table is generated using the text_bioname table. A search is
Definition:
minimally composed of a name that is used to look for some bioentity
(listed in this table) using some method. Currently all bioname items
with nametypeid=2 and a taxid that can be found in the taxgi table
and are inserted into this search table for MyMedline searching.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

349 of 421

18/04/2005

text_search table
Field
Type
id
int(11)
bioentityid
int(11)

Null Default Column_Definition
Example
No
auto incremented identifier 1
No 0
bioentity being searched for 106945

bionameid

int(11)

name

char(80)

taxid

int(11)

rngid

int(11)

Yes

bionameid used to search for 414737
bioentity
name used to find mention
of bioentity

AI1

taxonomy identifier for
bioentity (if applicable)

4932

NULL

redundant name group
identifier

70513

method used to search for 1
the bioentity mention using
name

methodid

int(11)

searched

datetime

0000-00date and time that search
00
was last performed
00:00:00

results

int(11)

Yes

NULL

text_search indices
Keyname
PRIMARY
bioentityid
bionameid
taxid

Type
PRIMARY
INDEX
INDEX
INDEX

seqhound@blueprint.org

number of search results
returned

2004-12-01 10:12:22

Field
searchid
bioentityid
bionameid
taxid

Version 3.3

Source

API

The SeqHound Manual

searched

INDEX

seqhound@blueprint.org

350 of 421

18/04/2005

searched

Version 3.3

The SeqHound Manual

351 of 421

18/04/2005

text_searchscore table
Last updated February 16, 2004
seqhound
Database:
text_searchscore
Table:
text
Module:
This table holds some score that may be used to determine if a given
Definition:
search strategy WILL BE informative if performed or its results are
likely TO BE informative if examined. This score might also be used to
determine whether a search strategy is to be performed at all or if some
search strategy is best left until more informative searches have been
informed. For example, if the name to be used is an English word, the
search may be scored so as to skip this search. Multiple methods (and
their scores) may be applied to a single search.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

352 of 421

18/04/2005

text_searchscore table
Field

Type

Null

Default

searchid

int(11)

methodid

int(11)

identifies the method used to score
the search strategy

score

double

the score

text_searchscore indices
Keyname
Type
PRIMARY
PRIMARY
searchid
INDEX
methodid
INDEX
score
INDEX

seqhound@blueprint.org

Column_Definition
identifies a search strategy (see
text_search table)

Example
1

Field
(searchid, methodid, score)
resultid
methodid
score

Version 3.3

Source

API

The SeqHound Manual

353 of 421

18/04/2005

text_rng table
Last updated February 16, 2004
seqhound
Database:
text_rng (redundant name group)
Table:
text
Module:
This table groups together bionames that have equivalent names (i.e.
Definition:
homonyms). This table facilitates searching by reducing the eliminating
redundant searches for the same string in the document collection. It is
thus an intermediate table in the process of creating the text_result table.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

354 of 421

18/04/2005

text_rng table
Field

Type

Null

int(11)

name

char(80)

searched

datetime

0000-00-00 when this name was last used to
00:00:00 search

Yes

NULL

the number of documents returned 0
by this search

Yes

NULL

book keeping

results
status

text_rng indices
Keyname
Id
Name
Searched
Results
Status

char(20)

Type
PRIMARY
Unique
INDEX
INDEX
INDEX

seqhound@blueprint.org

Default

Column_Definition
Example
auto incremented identifier. This 8
is a unique identifier for the
redundant name group identifier.
the homonym represented by this SRB8
group
2004-12-01 12:02:23

searched

Field
id
Name
Searched
Results
status

Version 3.3

Source

API

The SeqHound Manual

355 of 421

18/04/2005

text_rngresult table
Last updated February 16, 2004
seqhound
Database:
text_rngresult
Table:
text
Module:
This table is an intermediate step in one method for searching for
Definition:
mentions of protein names in text. The table stores search results for
redundant name groups. These results are combined with the
text_doctax table to generate the final text_results table.
Program:
This is an intermediate table and is not distributed as part of SeqHound.
Note:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

356 of 421

18/04/2005

text_rngresults table
Field
id

Type
int(11)

Null
No

Default

rngid

int(11)

a redundant name group identifier. 2
See the text_rng table.

docid

int(11)

Yes

NULL

a document identifier where this
name appears

pmid

int(11)

Yes

NULL

the corresponding PubMed
45549
identifier where this name appears

state

int(11)

book keeping

text_rngresults indices
Keyname
Type
Id
PRIMARY
Rngid
INDEX
Docid
INDEX
Pmid
INDEX
State
INDEX

seqhound@blueprint.org

Column_Definition
Auto incremented identifier.

Example
1

45549

Field
Id
Rngid
Docid
Pmid
State

Version 3.3

Source

API

The SeqHound Manual

357 of 421

18/04/2005

text_doctax table
Last updated February 16, 2004
seqhound
Database:
text_doctax
Table:
text
Module:
This table keeps a list of the organisms (by taxon identifiers) that are
Definition:
described in a document.
Program:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

358 of 421

18/04/2005

text_doctax table
Field

Type

Null

int(11)

docid

int(11)

document identifier (see text_doc
table).

taxid

int(11)

organism described in this
document (listed by NCBI
taxonomy database identifier)

text_doctax indices
Keyname
Id
Docid
Taxid

Type
PRIMARY
INDEX
Index

seqhound@blueprint.org

Default

Column_Definition
Auto incremented identifier. A
unique identifier.

Example
1

Field
Id
Docid
taxid

Version 3.3

Source

API

The SeqHound Manual

359 of 421

18/04/2005

text_organism table
Last updated February 16, 2004
seqhound
Database:
text_organism
Table:
text
Module:
This table keeps a list of the MESH terms that are used to identify the
Definition:
presence of an organism in a PubMed abstract.
Program:
Blueprint
Source Org:
Source file:
Note:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

360 of 421

18/04/2005

text_organism table
Field
id

Type
int(11)

Null
No

Default

Column_Definition
Example
auto incremented identifier 1

taxid

int(11)

NCBI taxonomy identifier

mesh

timestamp

searched

timestamp

Yes

CURRENT_TIM time that this search was last 2005-02-15 15:58:38
ESTAMP
completed

results

int(11)

Yes

NULL

number of documents
returned

bioentities

int(11)

Yes

NULL

number of bioentities for this NULL
organism

bionames

int(11)

Yes

NULL

number of names for this
organism

maxbionameid

int(11)

Yes

NULL

maximum bioname identifierNULL

lastupdate

timestamp

Yes

0000-00-00
00:00:00

time that all all updates were 0000-00-00 00:00:00
last completed

text_organism indices
Keyname
Type
Id
PRIMARY

seqhound@blueprint.org

4932

mesh term found in abstracts Saccharomyces cerevisiae
that describe this organism

NULL

Field
Id

Version 3.3

Source

API

The SeqHound Manual

361 of 421

18/04/2005

text_englishdict table
Last updated February 16, 2004
seqhound
Database:
text_englishdict
Table:
text
Module:
This table holds an English Dictionary from Moby project
Definition:
http://www.dcs.shef.ac.uk/research/ilash/Moby/
Program:
Oxford University
Organization:
File:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

362 of 421

18/04/2005

text_englishdict table
Field

Type

Null

Default

Column_Definition
Example
Auto incremented identifier.
1
Aunique identifier for this wordpos combination.

int(11)

word

char(16)

Yes

NULL

the word

the

pos

char(10)

Yes

NULL

part of speech

Det

freq

int(11)

Yes

NULL

frequency in the bnc per million
words

61847

count

int(11)

Yes

NULL

how many bioname identifiers
does this word refer to

1
pubmed

NULL

pubmed = this word is in th
pubmed stopword list
stop = present in other stopword
lists
other possibilities

source

char(10)

Yes

text_englishdict indices
Keyname
Type
Id
PRIMARY
Word
Index
Pos
Index
Freq
Index
Count
Index

seqhound@blueprint.org

Field
Id
Word
Pos
Freq
count

Version 3.3

Source

API

The SeqHound Manual

363 of 421

18/04/2005

text_bncorpus table
Last updated February 16, 2004
seqhound
Database:
text_bncorpus
Table:
text
Module:
This table holds the British National Corpus.
Definition:
Parser:
Oxford University (http://www.natcorp.ox.ac.uk/)
Organization:
ftp://ftp.itri.bton.ac.uk/bnc/
File:
this file is not distributed with SeqHound
Note:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

364 of 421

18/04/2005

text_bncorpus table
Field

Type

Null

int(11)

word

char(160)

freq

int(11)

pos

char(16)

files

int(11)

text_bncorpus indices
Keyname
Type
Id
Primary
Word
Index
Freq
Index

seqhound@blueprint.org

Default

Column_Definition
auto incremented identifier. A
unique identifier for this word.

Example
1

the word

the

frequency in the corpus

6187267

part of speech

at0

number of files (out of 4120) that 4120
this word appears in

Field
Id
Word
Freq

Version 3.3

Source

API

The SeqHound Manual

365 of 421

18/04/2005

text_pattern table
Last updated February 16, 2004
seqhound
Database:
text_pattern
Table:
text
Module:
This table holds a list of regular expressions used to detect mentions of
Definition:
biophysical interactions between two given names that appear in a
sentence.
Program:
Blueprint
Source Org:
Source File:
Note:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

366 of 421

18/04/2005

text_pattern table
Field

Type

Null

varchar(8)

parentid

varchar(8)

Yes

NULL

the parent identifier if this
expression is derived from
another.

class

int(11)

does this expression identify an 1
interaction (1) or the absence of an
interaction (-1).

score

double

Yes

NULL

score for this regex

0.85

regex

text

Yes

NULL

the regular expression

A(\S*\s+){0,4}\S*B(\S*\s+){0,4}\S
*heterodimer

text_pattern indices
Keyname
Type
Id
Primary
Parentid
Index

seqhound@blueprint.org

Default

Column_Definition
Example
a unique identifier for this regular 9920
expression.
99

Field
Id
Parentid

Version 3.3

Source

API

The SeqHound Manual

367 of 421

18/04/2005

text_stopword table
Last updated February 16, 2004
seqhound
Database:
text_stopword
Table:
text
Module:
This table contains a complete list of stopwords.
Definition:
1.http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#Sto
pwords
2.ftp://ftp.cs.cornell.edu/pub/smart/english.stop
3. Manually collected from bioname table
Program:
Various (see definition)
Source Org:
Source file:
Note:

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

368 of 421

18/04/2005

text_stopword table
Field
id

Type
int(11)

Null
No

word

char(16)

source

char(16)

text_ indices
Keyname
Id
Word

Type
Primary
Index

seqhound@blueprint.org

Default

Column_Definition
Auto incremented identifier

Example
1

the stop word

Yes

Source of this stop word

PubMed help

Field
Id
word

Version 3.3

Source

API

The SeqHound Manual

369 of 421

18/04/2005

6. Developing for SeqHound.
Open source development.
SeqHound code is developed on a cvs tree internal to the Samuel Lunenfeld Research
Institute by the members of the Blueprint Initiative Development team. The most stable
current release is available at
ftp://ftp.blueprint.org/pub/SeqHound/
External developers are encouraged to discuss major additions or modifications to the
system with the Project Manager at seqhound@blueprint.org.
Minor additions or corrections may also submitted to seqhound@blueprint.org.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

370 of 421

18/04/2005

Code organization.
Note: The following section is under revision and will be updated shortly.
This document summarizes the contents in the directories under slri/seqhound and
slri/nblast as of June of 2003. The contents of directories are first stated and then the
purpose of the contents is given.
Under the current directory hierarchy, the seqhound directory contains the following
directories (/) and files (*):
asn/

bioperl/
build/
cgi/
config/
db2/
domains/
examples/
genomes/

go/
html/
include/
include_cxx
java/
lib/
locuslink/
parsers/

perl/
rps/
scripts/
shreadme/

shreadme_cxx/
src/
src_cxx/
taxon/
tindex/
updates/
yeast/
seqhound .mk*
seqhound_cb.mk*
seqhound_db2.mk*
seqhound_odbc.mk*
seqhound_rem.mk*
shreadme*
shreadme_cxx*
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

asn/ :

bioperl/:
build/:

cgi/:
config/:
db2/:
domains/:
examples/:
genomes/:
go/:
html/:
include/:
include_cxx/:
java/:
lib/:
locuslink/:
parsers/:
perl/:
rps/:
scripts/:
src/:
src_cxx/:
taxon/:
tindex/:
updates/:
yeast/:

371 of 421

18/04/2005

contains the asn specifications (*.asn) for various objects used in
seqhound, auxiliary files used by datatool & asntool (*.def), and the
scripts used by asntool and datatool to autogenerate the objects in unix
(*.sh) and windows (*.bat) platforms.
contains the files used in the SeqHound bioperl module
contains a directory structure used to store the executables. As
executables get compiled, they will be moved into their relevant
directories inside the build directory. Executables using Codebase will
be moved to cb/, db2 executables are moved to db2/, and odbc
executables to odbc/. The appropriate directories will get created as the
need arises.
contains the source code for the cgi and web services.
the configuration files used in seqhound remote & local applications.
scripts to create the tables in db2 and redund for db2.
source code for the domain module.
source code showing how to use some of the code in seqhound, ex the
asn structures and functions, the C++ remote library.
source code for the complete genome module.
source code for the gene ontology module.
documentation for SeqHound API, and some scripts for converting the
documentation to html
the *.h files for SeqHound
the include files for the C++ remote library
source code for the SeqHound Java remote library
libraries for SeqHound are copied here once compiled.
source code for the locuslink module.
source code for various parsers (mother, cbmmdb, cddb, redund,
mmdbloc)
the SeqHound perl module (deprecated in favor of bioperl)
source code for the rps module (domname, rpsdb)
various scripts to retrieve flatfiles from NCBI, and to build SeqHound
source code for SeqHound (includes db layers for GO, HIST,
LOCUSLINK, NBR, rpsdb, taxdb& core modules), the C remote API
source code for the remote C++ library.
source code for the taxon module parser.
source code for the text indexer.
source code for the daily updates & histparser.
source code for importing yeast GO into Seqhound (never completed).

seqhound@blueprint.org

Version 3.3

The SeqHound Manual
*.mk:
shreadme*:

372 of 421

18/04/2005

various auxiliary files used by SeqHound makefiles.
readme files for C & C++ SeqHound

Under the current directory hierarchy, the nblast directory contains the following
directories(/) and files(*):
asn/
db/
docs/
lib/
msvc/
scripts/
src/
nblast.mk*
nblastflags.mk*
asn/:
db/:
docs/:
lib/:
msvc/:
scripts/
:
src/:
*.mk:

contains the ASN specifications for nblast.
source code for nblast Codebase layer.
instructions for compiling nblast.
source code for the ASN nblast object.
files for Microsoft Visual C++ project.
various scripts for retrieving NCBI Blast flatfiles & setting up NBlast.
core source code for nblast.
auxiliary makefile for nblast applications.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

373 of 421

18/04/2005

Adding/Modifying a remote API function to SeqHound.
Note: This section is included for historical purposes and is being
rewritten for a future release (5.0)
Overall steps:
1. open a new Bugzilla report
2. create the database search functionality
3. create the local API call
4. create the CGI call
5. add the remote calls (C/C++/Java/Bioperl/Perl)
6. test the new functionality in each of the layers
7. write/update documentation
8. inform technical writer
9. inform tester
10. update seqrem and local library in production server
11. In some rare cases, modifications may have to be made to the underlying data table
structure, then SeqHound must be rebuilt in test and production environments before
code is checked into cvs.
12. check in the new source code
13. update API website
14. close Bugzilla report
This document only goes over steps 2-5 in detail.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

374 of 421

18/04/2005

Overall architecture of the SeqHound system.
The diagram shows how the various code layers in SeqHound.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

375 of 421

18/04/2005

2. Create the Database Search Functionality
If adding a new data file, you will need to write the database search functions, one for
each data file. Database Search functions are found in ‘slri/seqhound/src/*_cb.c’
files.
‘*’ is the name of the module that the searched table belongs to like ‘GO’ or ‘intrez’.
‘cb’ refers to the database engine that the function talks to like CodeBase (cb) or
ODBC compliant databases (odbc).
Functions that search data tables are typically called SearchXXX. New data files
require you to define the ASN structure, typically each field in the data file
corresponds to an ASN field.
Most of the existing databases already have search functions so you should not have
to write these yourself. If they don’t exist, then you should consult the instructions
under Creating a new SeqHound module first. The details below are here to help you
understand the database layer and to use Search functions in your API function.
Boolean LIBCALL SearchACCDB (StAccdbPtr PNTR ppac)

The search functions should find ALL instances of a key in the database and return
them through the ASN pointer. Each ASN structure is a linked list, so multiple
records can be retrieved. Pseudo code for the typical search function:
Boolean SearchXXX(ASNLinkList Pointer)
{
foreach record that matches key in Pointer
create a new ASNLinkList Node;
fill node with record fields;
join node to Pointer;
end foreach
return TRUE if key found
else return FALSE;
}

As an example, the function
Boolean LIBCALL SearchACCDB (StAccdbPtr PNTR ppac)

finds records in the accdb table for an ODBC compliant database engine. This
function is located in slri/seqhound/src/intrez_odbc.c. The corresponding function
for a CodeBase database engine has the same name but is found in
slri/seqhound/src/intrez_cb.c. The ASN structure that StAccdbPtr points to is defined
in slri/seqhound/asn/slristruc.asn (search for StAccdb) and the corresponding C
structure is defined in slri/seqhound/include/objslristruc.h. You need to know this
when you use a database search function in your local API function.
3. Create The Local API Function
You will need to change one of three files in order to add new functionality to
SeqHound.
a) go_query.c : to add functionality to gene ontology module API functions
b) ll_query.c : to add to locuslink module API functions
c) intrez.c : to add new functionality to the other data tables.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

376 of 421

18/04/2005

Declare function prototype in slri/seqhound/include/seqhound.h. Try to group the
functions logically in the .c and .h files.
The API functions follow a general logic. You should try to stick to that logic.
Typical example of an API function:
Int4 LIBCALL SHoundFindAcc(CharPtr pcAcc)
{
Int4 gi = 0;
StAccdbPtr pac = NULL, pachead = NULL;
Int2 res = 0;
if ((pcAcc == NULL) || (strcmp(pcAcc, "n/a") == 0))
{
ErrPostEx(SEV_ERROR,0,0,"SHoundFindAcc: No accession.");
return 0;
}
pac = StAccdbNew();
pachead = pac;
pac->access = StrSave(pcAcc);
res = SearchACCDB(&pac);

Input integrity
check

Each data table has specific asn structure.
Set the key in the asn structure
database layer to search for the key.
The asn structure will hold all the
records that match the key.

if (res == FALSE)
Failed search.
{
Free structure
ErrPostEx(SEV_ERROR,0,0,"SHoundFindAcc: Search failed.");
and return
StAccdbFree(pachead);
return 0;
}else if(res == TRUE){
gi = pac->gi;
Successful search - extract what
pac = pac->next;
}
you need and return it.
StAccdbFree(pachead);
return gi;
}

4. Step 4. Create the CGI Function
Definitions:
CGI (Common Gateway Interface):
a program that allows 2 computers connected to the internet to communicate with
each other.
HTTP (Hypertext Transfer Protocol):
a way of passing messages between the 2 computers. Is used for web services, eg
CGI, servlets … Format of HTTP:
http://server_name/path/to/cgi/cgi_name?key1=value1&key2=value2&….
Query string:
The portion of the HTTP call that is used to pass parameters to the CGI. Format of
query strings: ?param1=value1¶m2=value2….
SeqHound’s CGI
is slri/seqhound/cgi/seqrem.c.
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

377 of 421

18/04/2005

Remote users
can call seqrem on our servers using HTTP calls. Format of a HTTP call:
http://seqhound.blueprint.org/cgibin/seqrem?fnct=SeqHoundFindAcc&acc=AA73235
2 parts to edit in seqrem.c
1. if statement in function: ProcessFnctRequest(CharPtr pfnct);
2. Add a corresponding CGI function call that calls the local API
There is currently an if-else statement in the
ProcessFnctRequest function. It needs to be modified to check if
the first field of the query string contains your new function name. If so, it
will call a CGI function that calls your local API function.
SLRI_ERR ProcessFnctRequest(CharPtr pfnct)
{
if(SeqHoundInit() == SLRI_FAIL) {
MemFree(pfnct);
return 1;
} else if (strcmp(pfnct, "SeqHoundFindAcc") == 0) {
if(SeqHoundFindAcc() == SLRI_FAIL) {
MemFree(pfnct);
return 1;
Calling internal
}
… see below
} ...

CGI function

Your new function gets
called here

} else if(strcmp(pfnc, “YourNewFunction” ..
if(SeqHoundYourNewFunction() ...
}
}

Next you need to define a function with the prototype
SLRI_ERR SeqHoundYourNewFunction(void)

This function will extract the remaining keys in the query string portion of the HTTP
call, and call the local API function you defined in step 3, then output the HTTP
response to be sent back to the user.
http://seqhound.blueprint.org/cgi-bin/seqrem?fnct=SeqHoundFindAcc&acc=AA73235

Internal CGI function that calls
the local API
SLRI_ERR SeqHoundFindAcc(void)
{
Int4 gi = 0;
Int4 IndexArgs = -1;

Output the HTTP
printf("Content-type: text/html\r\n\r\n");
if ((IndexArgs = WWWFindName(ginfo,"acc")) >= 0)
headers
{
pcThis = WWWGetValueByIndex(ginfo,IndexArgs);
Extract the remaining
pacc = StringSave(pcThis);
}
fields from the CGI
if ((pacc == NULL) || (strlen(pacc) == 0))
ti
{
ErrPostEx(SEV_ERROR,0,0, " Failed to get parameter value.");
fprintf(stdout, "SEQHOUND_ERROR Failed to get parameters.");
return SLRI_FAIL;
}
gi = SHoundFindAcc(pacc);
Calling the local API
f
seqhound@blueprint.org

ti
Version 3.3

The SeqHound Manual

378 of 421

fprintf(stdout, "SEQHOUND_OK\n");
fprintf(stdout, "%ld\n", (long) gi);
MemFree(pacc);
return SLRI_SUCCESS;
}

seqhound@blueprint.org

18/04/2005

Send output
back to remote
users using
HTTP

Version 3.3

The SeqHound Manual

379 of 421

18/04/2005

5. Step 5. Create the Remote Calls
We have programs available in most of the widely used languages that allow SeqHound
users to write programs in their favorite language, accessing SeqHound data without
having to understand how everything works. Our remote programs in effect constructs
the HTTP calls (described above), sends the HTTP calls to the server and then parses the
server’s return value and sends this back to the user’s program.
Our remote interfaces are:
1. slri/seqhound/src/seqhoundapi.c
2. slri/seqhound/src_cxx/Seqhound.cpp
3. slri/seqhound/java/SeqHound.java
4. slri/seqhound/perl/SeqHound.pm
5. slri/seqhound/bioperl/SeqHound.pm
Int4 LIBCALL SHoundFindAcc(CharPtr pcAcc)
{
Char fpath[PATH_MAX];
Int4 gi = 0;
if(pcAcc == NULL)
{
ErrPostEx(SEV_ERROR,0,0, "Invalid parameter.");
return 0;
}

Making the HTTP
call.
Contains the server
name, path to CGI,

sprintf(fpath,"%s?fnct=SeqHoundFindAcc&acc=%s", slri_cgipath, pcAcc);
ErrPostEx(SEV_INFO,0,0, "SeqHoundFindAcc request: %s.\n", fpath);
if(SHoundWWWGetfile(slri_sername, fpath) == 0)
{
ErrPostEx(SEV_ERROR,0,0, "SHoundWWWGetfile failed.");
return 0;
}
gi = ReplyBSGetInteger();
if (gi == 0)
{
ErrPostEx(SEV_INFO,0,0, "SeqHoundFindAcc returned zero.");
return 0;
}
return gi;

Send the HTTP
call
Get the return
value.
A family of
ReplyBSGetXX
X exists for

}

Similar logic is used in the remote C++, Java, and Perl libraries.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

380 of 421

18/04/2005

Adding a new module to SeqHound
Note: This section is included for historical purposes and is being
rewritten for a future release (5.0)
This section describes how to go about adding a new module to SeqHound from the
developer’s point of view. It is basically a description of the different files that have to
be written for a new module and where they should go in the code tree. So this section
may be used as a guide for looking at existing modules to find out what files are expected
and where they are. However, historically, several programmers have added modules to
SeqHound and have used different code organization schemes; therefore, historical
modules may be organized differently. Going forward (as of November 2003) all
modules will have the components that are organized in the way described below.
1. Start a new module project plan
Creating a new module begins with starting a new project plan (see Format for project
reports). This should contain a clearly stated “Need” and “Objective” as well as a
“General approach” to developing the module; this may include background
information on the data resource being incorporated into SeqHound. The “Detailed
planning” section will contain a section on “Data table design” and “Code
Organization” that will be described further. See the DBXREF (Database crossreference) module Project Plan as an example
(./slri/seqhound_priv/dbxref/dbxref_module_desc.txt).
The components of a new module are summarized in the diagram below. The rest of
this section describes the creation of each of these components and their organization.
File names in bold italics indicate files that may be looked at as examples.

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

381 of 421

18/04/2005

Database layer
2. Design the data table structure
The module’s project description should contain a description of the final table
design. See the example in the DBXREF (Database cross-reference) module Project
Plan as an example.
3. Write the script file that creates the data table.
You must modify the existing script file that creates the SeqHound data tables (this
step is only for ODBC SeqHound). This file is called seqhound.sql and is located in
./slri/seqhound/sql/seqhound.sql
Add the line
DROP TABLE SEQHOUND.DBXREF_tablename;

near the top of the file. Every time this script is run, it will destroy any pre-existing
data tables that belong to the module.
Add the lines that describe the table(s) that are a part of the module so the script
creates the new tables belonging to the module. See the example file “seqhound.sql”.
Note that CodeBase tables to not require a script to be made; these tables are created
by the function InitCodeBase in the _cb.c file.
4. Write ASN.1 structure(s) that corresponds to data table descriptions.
The file ./slri/seqhound/asn/slristruc.asn must be modified to contain a description
of tables in the module. See the example structure StDbXref in the slristruc.asn file.
The shell script “./slri/seqhound/asn/makeasn” calls asntool which auto generates
slristruc.h, objsslristruc.h and objslristruc.c files. These files contain functions that
allow one to allocate and free memory for a structure corresponding to a database
record. See the makeasn file.
5. Step 5: Design new functions for the Database code layer
The DB layer contains C functions that create and retrieve records from tables in the
module. In the case of the DBXREF module, the tables are populated by a parser
written in PERL so there are no functions listed that write to the tables. The
requirement for a certain Perl modules is however noted in the project plan; these
modules allow Perl to converse with a database.
The DB layer will have three files that were auto generated from the preceding step.
a) ./slri/seqhound/include/slristruc.h
b) ./slri/seqhound/include/objsslristruc.h
c) ./slri/seqhound/src/objsslristruc.c
These files will be placed in this location by the makeasn script.
In addition, DB layer will require that two new files be made.
a) ./slri/seqhound/include/_odbc.h
This will contain function prototypes and comments for functions in
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

382 of 421

18/04/2005

b) ./slri/seqhound/src/_odbc.c
This will contain at least one functions called “Search that
retrieves records from the database. Other functions that read or write to the database
may be included in this file.
The example function “SearchDBXREF” (see dbxref_odbc.h and dbxref_odbc.c)
takes as input a pointer to a structure called StDbXref. This structure has the same
fields as the DBXREF table. This structure is described in ASN.1 and functions to
allocate and free memory for this structure are auto generated using ASNtool. Any
row in a DBXREF table that contains field values matching anyone of the field values
passed to “SearchDBXREF” will be returned in a linked list of Valnodes that
contain pointers to StDBXref structures.
6. Make changes to existing code to accommodate changes to the database code layer.
You must modify the existing function called InitCodeBase so that it handles
tables that are new to the module. This function is found in two locations:
a) ./slri/seqhound/src/intrez_cb.c is the file that supports a CodeBase database
backend to SeqHound. InitCodeBase() needs to be changed to open the
code base data files that belong to the new module. See the example intrez_cb.c
file.
b) ./slri/seqhound/src/intrez_odbc.c is the file that supports a supports an ODBC
database backend to SeqHound. InitCodeBase() need to contain the
necessary code to establish a connection to the Database Server. See the example
intrez_odbc.c file.
The function InitCodeBase() under intrez_odbc.c has a function call:
GetAppParam(intrez", "datab", "db2alias", NULL,
(Char*)dsn, sizeof(char) * 10) <= 0 )
that retrieves database connection information from the
./slri/seqhound/config/.intrezrc configuration file.
Parser layer
7. Design the parser layer.
Parsers are generally written in C or Perl. A separate script is written to download
some file from an external ftp site. The parser takes this file as input and uses it to
populate a set of data tables belonging to the module.
Pseudocode for parsers should be documented in the project plan.
Parser layer code is located in
./slri/seqhound//[parser_name]
for example
./slri/seqhound/dbxref/dbxref_parser_sp.pl
Finally, by project end, all parsers must be documented according to the examples
given in the SeqHound manual. See the example parser description for “mother”.
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

383 of 421

18/04/2005

Local API layer (Query layer)
8. Design the Local API layer
Design the local API local query layer. This layer will consist of three files
a) ./slri/seqhound/src /[Module name]_query.c
This file contains all of the API functions that query the module’s tables as well
as auxiliary functions (if any). Note that this naming convention is not followed
by API calls that belong to the core module of SeqHound; code for these local
API calls that query core module tables is in ./slri/seqhound/src/intrez.c.
b) ./slri/seqhound/include/[Module name]_query.h
This file contains function prototypes and comments for auxiliary functions (if
any) that may be used by the module’s parsers, API functions or other
applications specific to the module.
c) ./slri/seqhound/include/seqhound.h
This is where all publicly available API functions are defined. Note that the local
and the remote API’s use the same header file. ALL API functions for ALL
modules are defined in this header. This file already exists and must be simply
modified.
Examples are:
./slri/seqhound/src /dbxref_query.c
./slri/seqhound/include/dbxref_query.h
./slri/seqhound/include/seqhound.h
An example of an API function is a function(s) to retrieve data base cross-references
given a source record (SHoundDBXREFGetDBXrefListBySourceRecord).
See the example in the dbxref_query.c file. This is a local API call. Notice the
naming convention: ‘Shound’ followed by the module name ‘GODB’ followed by the
actual API function name.
In the example function note the line
if(!SHoundModule("godb"))
This checks the SeqHound configuration file to make certain that the build of
SeqHound actually includes this module. All API function calls must have an
analogous check.
Note that this function calls a database layer function called
SearchDBXREF.
CGI layer
9. Design the CGI layer
The SeqHound cgi layer that supports the remote API is contained in only one file
./slri/seqhound/cgi/seqrem.c
There is no header file for seqrem.c since all functions are defined before “Main()”
and Main only calls functions in this file.
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

384 of 421

18/04/2005

This file must be edited to include cgi support for the remote API calls for the new
module. See the examples in the seqrem.c file. See the example function
SeqHoundDBXREFGetDBXrefListBySourceRecord.
Note the naming convention: SeqHound followed by the module name followed by
the same function name used by the local API function.
Remote API layer
10. Design the remote API layer
The remote API for SeqHound supports 4 languages. So there are four files that must
be modified to include new API functions for the new module.
For C:

./slri/seqhound/src/seqhoundapi.c

For C++:

./slri/seqhound/src_cxx/SeqHound.c

For Java:

./slri/seqhound/java/SeqHound.java

For Perl:

./slri/seqhound/perl/SeqHound.pm

The function names will be exactly the same as those listed in the local API layer (for
example; SHoundDBXREFGetDBXrefListBySourceRecord).
Examples for the C remote API are given in the seqhoundapi.c file.
11. Modify the seqhound config file.
New modules require that another entry be made in the .intrezrc file. This setting
allows the SeqHound administrator to indicate whether any given SeqHound module
has been built. The function SHoundModule() will look at these config file
settings every time a local API call is made to determine if the module is present.
Modify the ./slri/seqhound/config/intrezrc according to the example. Analogous
additions must be made to the ./slri/seqhound/config/intrez.ini file for Windows
platforms.
Modify the ShoundModule() function in ./slri/seqhound/src/intrez_cfg.c file to
support the new module. Follow the example in this file.
12. Modify/create make files to support the new module.
Modify the following files.
./slri/seqhound/seqhound_odbc.mk
SEQH_SRC_ODBC
SEQH_OBJ_OODBC
./slri/seqhound/seqhound.mk
SEQH_SRC_COM
SEQH_OBJ_COM
SEQH_ODBC_COM
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

385 of 421

18/04/2005

./slri/seqhound/seqhound_rem.mk
./slri/seqhound/seqhoundrem.mk??
./slri/seqhound/src/make.shoundlocllib
./slri/seqhound/src/make.shoundremlib
Create the following make files for any parsers written in C.
./slri/seqhound//make.[parser_name]
see the example in
./slri/seqhound/locuslink/make.llparser
13. Design regression tests
Regression tests are based on the CuTest C Unit Testing Framework.
(see http://cutest.sourceforge.net/).
One file must be modified to support tests for the new module. Follow the examples
in:
./slri/seqhound/test/regresion/main.c
And a new file must be created that contains the actual test functions. See the
examples in the test driver for the database layer for the DBXREF module:
./slri/seqhound/test/regresion/dbxref_odbc_driver.c
Examples of some test cases Function calls.
void testDBXREF_GetObjectIDbyAcc(CuTest *tc){...}
Another driver contains test functions for the
./slri/seqhound/test/regresion/dbxref_query_driver.c
Finally, the following file must be modified to accommodate the new module test
drivers. Follow the examples in:
./slri/seqhound/test/regresion/make.test_driver
14. Design test cases
These test cases refer to tests of the local and remote API functions relevant to the
module.
15. Code test and debug
16. Finish documentation
17. Compile SeqHound code in test environment
18. Build SeqHound db in test environment
19. Check in code
20. Pass the project on to delivery team (Test Dev/Systems Dev/Software Training)
21. Build module in production (SEQHOUND ADMIN)
22. Update data tables (SEQHOUND ADMIN)
seqhound@blueprint.org

Version 3.3

The SeqHound Manual

386 of 421

18/04/2005

23. Update seqrem (SEQHOUND ADMIN)
24. Update docs (SEQHOUND ADMIN)
25. Update website (SEQHOUND ADMIN)

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

387 of 421

18/04/2005

7. Appendices
Example GenBank record in ASN.1 format
Example SwissProt record in ASN.1 format
Example EMBL record in ASN.1 format
Example PDB record in ASN.1 format
Example Biostruc in ASN.1 format
GO background material

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

388 of 421

18/04/2005

Example GenBank record
Seq-entry ::= set {
class nuc-prot ,
descr {
title "Vairimorpha necatrix largest subunit of RNA polymerase II
(RPB1)
gene, complete cds." ,
source {
org {
taxname "Vairimorpha necatrix" ,
db {
{
db "taxon" ,
tag
id 6039 } } , <==========================TAXGI/taxid
orgname {
name
binomial {
genus "Vairimorpha" ,
species "necatrix" } ,
lineage "Eukaryota; Fungi; Microsporidia; Burenellidae;
Vairimorpha" ,
gcode 1 ,
mgcode 1 ,
div "INV" } } } ,
create-date
std {
year 1998 ,
month 12 ,
day 10 } ,
pub {
pub {
sub {
authors {
names
std {
{
name
name {
last "Hirt" ,
first "R" ,
initials "R.P." } } ,
{
name
name {
last "Healy" ,
first "B" ,
initials "B." } } } ,
affil
std {
affil "The Natural History Museum" ,
div "Zoology" ,
city "London" ,
country "UK" ,
street "Cromwell Road" ,
postal-code "SW7 5BD" } } ,
medium email ,
date
std {
year 1998 ,
month 4 ,
day 16 } } } } ,
update-date
std {
year 1999 ,
month 2 ,
day 4 } ,
pub {
pub {
article {

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

389 of 421

18/04/2005

title {
name "Microsporidia are related to Fungi: evidence from the
largest subunit of RNA polymerase II and other proteins." } ,
authors {
names
std {
{
name
name {
last "Hirt" ,
initials "R.P." } } ,
{
name
name {
last "Logsdon" ,
initials "J.M." ,
suffix "Jr." } } ,
{
name
name {
last "Healy" ,
initials "B." } } ,
{
name
name {
last "Dorey" ,
initials "M.W." } } ,
{
name
name {
last "Doolittle" ,
initials "W.F." } } ,
{
name
name {
last "Embley" ,
initials "T.M." } } } ,
affil
str "Department of Zoology, The Natural History Museum,
London
SW7 5BD, United Kingdom." } ,
from
journal {
title {
iso-jta "Proc. Natl. Acad. Sci. U.S.A." ,
ml-jta "Proc Natl Acad Sci U S A" ,
issn "0027-8424" ,
name "Proceedings of the National Academy of Sciences
of the
United States of America." } ,
imp {
date
std {
year 1999 ,
month 1 ,
day 19 } ,
volume "96" ,
issue "2" ,
pages "580-585" ,
language "eng" } } ,
ids {
pubmed 9892676 ,
medline 99110933 } } ,
muid 99110933 ,
pmid 9892676 } } } ,
seq-set {
<==================================beginning of bioseq
<===========================look down for end of bioseq(see ASNDB/asn1)
seq {
id {
genbank{<===============================ACCDB/DB
name "AF060234" ,<================== ======ACCDB/name

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

390 of 421

18/04/2005

accession "AF060234" ,<=====================ACCDB/accession
version 1 } ,<============================ACCDB/version
gi 4001823 } , <==========================Many tables/GI
descr {
molinfo {
biomol genomic } } ,
inst {
repr raw ,
mol dna ,
length 5019 ,
seq-data
ncbi2na
'171B6C0C434EFC0FBDC301F7E3FFFF3F396FF3FD7FFFF355C03BF8E03EC
10020DDB4350F6BF3FDD48203080352EDB5030D4578053A00EAF57037AEBF0C8FC00EA4
71202A7
FFCEF4D7B8008C3FF7E57AD3FE944F83C1C250EF53BEBC4E18034200DFA0EEFEFFCFBDC
87C03CB
1000DF0808FE0FFB3A0CFDC21C02EFE60AE03EA803ABF129EE80C2497B0C00082838BF8
C9FF380
A208202E3AC0B0FF83A2022F4C33DE0823EC0E08E7BBFFABFE3482FC70978BA7CFF87BD
FF2F5D7
5F7B0A5CB3ED3A0A0EF889208E3F874C0FA48CCB009C31F37C203383C82A917AD3B2C2A
3CE043C
750FD33E510E3E10E130BA5051291F4802EA657F0024FDE761E029020A98B5AA03F0EA0
02AC8FD
2E52B7B0F1978D60CFDEC820B68BD7D223E608D447F5480F3075FF0CF8597C701FB740E
A5038B1
56A983CEC3080E1A1020C87C0FD0CAA233C2F2082BCECB622113908688ECBFCFC32105F
7F1100E
24E3A74FFB58B4E829007FC8703FB7EEDD95CC19A3FE1AA180E0DF44E5902F104B0A683
C820F3B
F2C2C042CF0B510DC3014B0EA8CB1084BF049DC21FFC4F188F4FFF8DB22010E43CF33DE
C0CF0C3
CE2FC78F6D80F0F3874E3874FEB0C3F131E0837D033CE00CF03F549CFDF35C203EE87AC
040DF0B
C4FF140C43F30E80370E244182087E803B608F7C6D30F203A6037C2E83CF842027B6AD0
4429AFC
3D130F90387FAD7848B87EFDFE38E748038E0DFCFD9C430197FDCFA0FAE190C92102010
E2D0B10
20B32012702042C0E034CB4091000301C808F14ACED0E22227F80B42C0F13FE0C050884
FDE99B4
9F4037F8BFFB0C138A18EB5E928D00A74FCC0CFD10B449FBF2B4100EC828008357F6BFC
3CC8DFC
513FCB0268F3DEA037228FEC8032F3B00A0F1D7820F7DFD39CEA2B220ADC3E33E73C012
4804A33
3D020872C0243A08E71ED1DF8C8DEC228923A3F3F3433831A208EBFE3947FFC80E40038
74E386E
9C702386EDFF003F93F2C84EFC48FE0FD90C000803BC48D03F103CF871E3B03F1200FF3
38E0FE2
EBC0E00EC000CE000E0CF9CB53B0FD40830C0FC98FCC0FE3EC802884FD56CFC3FC8C7F0
00FC320
0FC50F083F3C3E033C300CC3F3630228FF0383303CD3C827CC3E8FF0080F03F00FF00D0
C3F7503
80EB291FE9A7437B2A81494540E1FC0C7F53F24A2CD240CF18EA2C508F0080F303BE600
33C21D7
ECE08FCFC02353F0C01FC80E9C200D40B83C83FD233002DDEE0FCB80FCF3857B8C8211F
7300823
023FEF4A0CFF8FD5E3813F23FCB00E5C0FCD308F003E120F03EB0B003F80F200CFB002F
C4E297F
5C33FD4CC3C8D238032D20FC30C20C23B3F4DFE0C30EC833CC3F10CC00CCF03F023CEAF
30C0342
02FFCC2E08C08408E2EB3F12123A2FECC22033DDD3503B20A13FAC1F70E3F0380CB20BF
CA0C827
9C888437E0E0F87F2F321A03A0BCED0D329134971E9E3B8E10E0A8C5F12B304213A2E0C
0BA8FEA
E47019934FE0807BA33FC721927F2C920033B0402BCC488330E3A845C957FAA12832805
E7FE8EC
2C0F230273147F70148330F3820BE3194FCC44B52ED2005E2CC24BA01E895A7CFCB20A0
3633994
0042FCCF75862561F32D4189504CCB5462541332D4109507C495842541F32D4189585C7
4504D94
DB32D610B58BC4B5042562F12D4109589CCB5042D4DF3254109537C4B504D16DF12D413

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

391 of 421

18/04/2005

45B7C4B
504256DF32D41095B7CCB504256DF32D41B18C38600801C3280600A024300CF0CB0000F
C00FFFF
FFEC03FFCCD035CCF8DCCBFC33C030304'H } ,
annot {
{
data
ftable {
{
data
gene {
locus "RPB1" } ,
location
int {
from 100 ,
to 4917 ,
id
gi 4001823 } } } } } } ,

<==============================the first bioseq ends here
see ASNDB/asn1
<==============================next bioseq begins here
seq {
id {
genbank {
accession "AAD12604" ,
version 1 } ,
gi 4001824 } ,
descr {
molinfo {
biomol peptide ,
tech concept-trans-a } ,
title "largest subunit of RNA polymerase II; RPO21 [Vairimorpha
necatrix]" } ,
inst {
repr raw ,
mol aa ,
length 1605 ,
topology not-set ,
seq-data
ncbieaa
"MFDEIVTKRISSIQFGLFSPEEIRKSSVVQIIHPETMENGFPKSGGLIDLKMGTTERAF
LCSSCEKDNFSCPGHFGHIELTKPMFHVGYMTKIKKILECVCFYCSRLKISTKNLKKDLNFVWNISKTKSV
CEGEIGE
NGFTGCGNKQPVIKKEGMSLIAFMKGEEESDGKVILNGERVHNILKKIVNEDAVFLGFDQKFTKPEWLILT
VLLVPPP
SVRPSIVMEGMLRAEDDLTHKLADIVKANTYLKKYELEGAPGHVVRDYEQLLQFHIATMIDNDISGQPQAL
QKSGRPL
KSISARLKGKEGRVRGNLMGKRVDFSARSVITPDPNISVEEVGVPSEIAKIHTFPEIITPFNIDRLTKLVS
NGPNEYP
GANYVIRNDGQRIDLNFNRGDIKLEEGYVVERHMQDGDVVLFNRQPSLHKMSMMAHFVRVMEGKTFRLNLS
CVSPYNA
DFDGDEMNLHMPQSYNSKAELEELCLVSKQVLSPQSNKPVMGIVQDSLTALRLFTLRDSFFDRRETMQLLY
SVNINNY
EFTDSSKLIMTHDDSFGNNLHTEESSNIMKILNFPAISYPKKLWTGKQILSYILPNTIYNGKSNEHNEEDL
ENVEDSY
VIIRNGEILSGIIDKKAVGSTQGGLIHIIANDFGPDRVTCFFDDAQKMMNLYFATINAFSIGIGDAIADKE
TMSQVQR
SIETAKEQVNEIIVKAQKNKLERLPGMSMRESFESQVNYILNKPRDISGASASKSLSFCNNMRTMVLAGSK
GSFINIS
QVTACLGQQNVEGKRIPFGFNYRSLPHFSKADYSGKSRGFVENSYVKGITPEEFFFHAMGGREGLIDIAIK
TAETGYI
QRRLVKAMEDATVTLDRSVRGADGFIYQYEYGEDGFDATFLEMQKMTHDDVATKDDVSFKNLHLVDMFTDL
NFAIKKE
NVTDQIYKLLTTDVNLQKILYDEFEWLNENVKKYEKMNIASPCNFQRIINLAIYKFDCRKGDISPYLILDT
LKNLIEN
LPIKNLLIEILIKYNLSIKRILNEYKLSLEAYNWILKEIKFKILKSIISPNEMVGTLAAQSVGEPATQMTL
NTFHLAG
VSANITMGVPRLKEIINVAKNIKTPCMKIYLKDPFNKTLEMAKKIQSELEFSDIKSLCEFSEIYYDPVIED
TSIKEDK
DFVQEYFDFPDEHLDFSKMPKFIIRLKIDRIKLVSKNLKLENIVKSLHEAFPNIFHIIRSDENSQNLIIRI
RCISSLN

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

392 of 421

18/04/2005

NNVEYYNLQYKNILNLKIMGYNKIKKVFISEDKDKDEWYLQTDGVCIREIFSHPNVEGHLVTSNDLNEIVE
VLGIEAA
RETILNELTLVIDGNGSYVNHRHISLLADVMTMKGYLTGITRHGVNKVGFGCTKRASFEETVDILLDAALV
AEKYVTK
GYTENIMMGHLAPLGTGIGNLLLDVSKLDKAIPLSKPEYNYEEVDTPFIHSPVSENLSISSGNWSPAYLVE
GNRYAPK
TSLYSPTSPTYSPTSPTYSPTSPTYSPTSPTYSPTSPTYSPTSPTYSPTSPSYSPTSPSYSPTSPSYSPTS
PSYSPTS
PSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTYDNDEKKTNRKRKGKQ" } ,
annot {
{
data
ftable {
{
data
prot {
name {
"largest subunit of RNA polymerase II" ,
"RPO21" } } ,
location
int {
from 0 ,
to 1604 ,
strand plus ,
id
gi 4001824 } } } } } } } ,
annot {
{
data
ftable {
{
data
cdregion {
frame one ,
code {
id 1 } } ,
product
whole
gi 4001824 ,
location
int {
from 100 ,
to 4917 ,
strand plus ,
id
gi 4001823 } } } } } }

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

393 of 421

18/04/2005

Example SwissProt record
Seq-entry ::= seq {
id {
swissprot {
name "RPB1_YEAST" ,<============================ACCDB/name
accession "P04050" } ,<=========================ACCDB/accession
gi 2507347 } ,
descr {
<======ACCDB/title
title "DNA-DIRECTED RNA POLYMERASE II LARGEST SUBUNIT (B220)." ,
comment "-----------------------------------------------------------------~This SWISS-PROT entry is copyright. It is produced through
a~collaboration
between the Swiss Institute of Bioinformatics and~the EMBL outstation
- the
European Bioinformatics Institute.~The original entry is available
from
http://www.expasy.ch/sprot~and
http://www.ebi.ac.uk/sprot~----------------------------------------------------------------" ,
comment "[FUNCTION] DNA-DEPENDENT RNA POLYMERASE CATALYZES THE
TRANSCRIPTION OF DNA INTO RNA USING THE FOUR RIBONUCLEOSIDE
TRIPHOSPHATES AS
SUBSTRATES." ,
comment "[CATALYTIC ACTIVITY] N NUCLEOSIDE TRIPHOSPHATE = N
PYROPHOSPHATE
+ RNA(N)." ,
comment "[SUBUNIT] RNA POLYMERASE II CONSISTS OF 12 DIFFERENT
SUBUNITS.
THIS SUBUNIT IS THE LARGEST COMPONENT OF RNA POLYMERASE II." ,
comment "[SUBCELLULAR LOCATION] NUCLEAR." ,
comment "[PTM] THE TANDEM 7 RESIDUES REPEATS CAN BE HIGHLY
PHOSPHORYLATED.
THE PHOSPHORYLATION ACTIVATES POL2." ,
comment "[MISCELLANEOUS] THREE DISTINCT ZINC-CONTAINING RNA
POLYMERASES
ARE FOUND IN EUKARYOTIC NUCLEI: POLYMERASE I FOR THE RIBOSOMAL RNA
PRECURSOR,
POLYMERASE II FOR THE MRNA PRECURSOR, AND POLYMERASE III FOR 5S AND
TRNA
GENES." ,
comment "[SIMILARITY] BELONGS TO THE RNA POLYMERASE BETA' CHAIN
FAMILY." ,
sp {
class standard ,
extra-acc {
"Q12364" ,
"Q92315" } ,
seqref {
gi 4397 ,
gi 4398 ,
gi 1419218 ,
gi 1419221 ,
gi 1431216 ,
gi 1431217 ,
gi 886080 ,
gi 886082 ,
gi 2144431 } ,
dbref {
{
db "SGD" ,
tag
str "L0001744" } ,
{
db "PFAM" ,
tag
str "PF00623" } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

394 of 421

18/04/2005

{
db "PROSITE" ,
tag
str "PS00115" } } ,
keywords {
"Transferase" ,
"DNA-directed RNA polymerase" ,
"Transcription" ,
"Zinc" ,
"Repeat" ,
"DNA-binding" ,
"Nuclear protein" ,
"Phosphorylation" ,
"Zinc-finger" } ,
created
std {
year 1986 ,
month 11 ,
day 1 } ,
sequpd
std {
year 1997 ,
month 11 ,
day 1 } ,
annotupd
std {
year 1999 ,
month 7 ,
day 15 } } ,
create-date
std {
year 1986 ,
month 11 ,
day 1 } ,
update-date
std {
year 1999 ,
month 7 ,
day 15 } ,
source {
org {
taxname "Saccharomyces cerevisiae" ,
common "baker's yeast" ,
db {
{
db "taxon" ,
tag
id 4932 } } ,
orgname {
name
binomial {
genus "Saccharomyces" ,
species "cerevisiae" } ,
lineage "Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces"
,
gcode 1 ,
mgcode 3 ,
div "PLN" } } } ,
molinfo {
biomol peptide ,
completeness complete } ,
pub {
pub {
gen {
serial-number 1 } ,
muid 85282617 ,
article {
title {
name "Extensive homology among the largest subunits of
eukaryotic
and prokaryotic RNA polymerases." } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

395 of 421

18/04/2005

authors {
names
std {
{
name
name {
last "Allison" ,
initials "L.A." } } ,
{
name
name {
last "Moyle" ,
initials "M." } } ,
{
name
name {
last "Shales" ,
initials "M." } } ,
{
name
name {
last "Ingles" ,
initials "C.J." } } } } ,
from
journal {
title {
iso-jta "Cell" ,
ml-jta "Cell" ,
issn "0092-8674" ,
name "Cell." } ,
imp {
date
std {
year 1985 ,
month 9 } ,
volume "42" ,
issue "2" ,
pages "599-610" ,
language "eng" } } ,
ids {
pubmed 3896517 ,
medline 85282617 } } ,
pmid 3896517 } ,
comment "SEQUENCE FROM N.A.~STRAIN=A364A" } ,
pub {
pub {
gen {
serial-number 2 } ,
muid 97127826 ,
article {
title {
name "Analysis of a 26,756 bp segment from the left arm of
yeast
chromosome IV." } ,
authors {
names
std {
{
name
name {
last "Wolfl" ,
initials "S." } } ,
{
name
name {
last "Hanemann" ,
initials "V." } } ,
{
name
name {
last "Saluz" ,
initials "H.P." } } } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

396 of 421

18/04/2005

affil
str "Hans-Knoll-Institut fur Naturstoff-Forschung,
Department of
Cell and Molecular Biology, Jena, Germany." } ,
from
journal {
title {
iso-jta "Yeast" ,
ml-jta "Yeast" ,
issn "0749-503X" ,
name "Yeast (Chichester, England)" } ,
imp {
date
std {
year 1996 ,
month 12 } ,
volume "12" ,
issue "15" ,
pages "1549-1554" ,
language "eng" } } ,
ids {
pubmed 8972577 ,
medline 97127826 } } ,
pmid 8972577 } ,
comment "SEQUENCE FROM N.A.~STRAIN=S288C / FY1679" } ,
pub {
pub {
gen {
serial-number 3 } ,
muid 95377607 ,
article {
title {
name "The gene encoding the biotin-apoprotein ligase of
Saccharomyces cerevisiae." } ,
authors {
names
std {
{
name
name {
last "Cronan" ,
initials "J.E." ,
suffix "Jr." } } ,
{
name
name {
last "Wallace" ,
initials "J.C." } } } ,
affil
str "Department of Microbiology, University of Illinois,
Urbana
6180, USA." } ,
from
journal {
title {
iso-jta "FEMS Microbiol. Lett." ,
ml-jta "FEMS Microbiol Lett" ,
issn "0378-1097" ,
name "FEMS microbiology letters." } ,
imp {
date
std {
year 1995 ,
month 8 ,
day 1 } ,
volume "130" ,
issue "2-3" ,
pages "221-229" ,
language "eng" } } ,
ids {
pubmed 7649444 ,
medline 95377607 } } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

397 of 421

18/04/2005

pmid 7649444 } ,
comment "SEQUENCE OF 1669-1733 FROM N.A.~STRAIN=S288C" } } ,
inst {
repr raw ,
mol aa ,
length 1733 ,
seq-data
ncbieaa
"MVGQQYSSAPLRTVKEVQFGLFSPEEVRAISVAKIRFPETMDETQTRAKIGGLNDPRLGSIDR
NLKCQTCQEGMNECPGHFGHIDLAKPVFHVGFIAKIKKVCECVCMHCGKLLLDEHNELMRQALAIKDSKKR
FAAIWTL
CKTKMVCETDVPSEDDPTQLVSRGGCGNTQPTIRKDGLKLVGSWKKDRATGDADEPELRVLSTEEILNIFK
HISVKDF
TSLGFNEVFSRPEWMILTCLPVPPPPVRPSISFNESQRGEDDLTFKLADILKANISLETLEHNGAPHHAIE
EAESLLQ
FHVATYMDNDIAGQPQALQKSGRPVKSIRARLKGKEGRIRGNLMGKRVDFSARTVISGDPNLELDQVGVPK
SIAKTLT
YPEVVTPYNIDRLTQLVRNGPNEHPGAKYVIRDSGDRIDLRYSKRAGDIQLQYGWKVERHIMDNDPVLFNR
QPSLHKM
SMMAHRVKVIPYSTFRLNLSVTSPYNADFDGDEMNLHVPQSEETRAELSQLCAVPLQIVSPQSNKPCMGIV
QDTLCGI
RKLTLRDTFIELDQVLNMLYWVPDWDGVIPTPAIIKPKPLWSGKQILSVAIPNGIHLQRFDEGTTLLSPKD
NGMLIID
GQIIFGVVEKKTVGSSNGGLIHVVTREKGPQVCAKLFGNIQKVVNFWLLHNGFSTGIGDTIADGPTMREIT
ETIAEAK
KKVLDVTKEAQANLLTAKHGMTLRESFEDNVVRFLNEARDKAGRLAEVNLKDLNNVKQMVMAGSKGSFINI
AQMSACV
GQQSVEGKRIAFGFVDRTLPHFSKDDYSPESKGFVENSYLRGLTPQEFFFHAMGGREGLIDTAVKTAETGY
IQRRLVK
ALEDIMVHYDNTTRNSLGNVIQFIYGEDGMDAAHIEKQSLDTIGGSDAAFEKRYRVDLLNTDHTLDPSLLE
SGSEILG
DLKLQVLLDEEYKQLVKDRKFLREVFVDGEANWPLPVNIRRIIQNAQQTFHIDHTKPSDLTIKDIVLGVKD
LQENLLV
LRGKNEIIQNAQRDAVTLFCCLLRSRLATRRVLQEYRLTKQAFDWVLSNIEAQFLRSVVHPGEMVGVLAAQ
SIGEPAT
QMTLNTFHFAGVASKKVTSGVPRLKEILNVAKNMKTPSLTVYLEPGHAADQEQAKLIRSAIEHTTLKSVTI
ASEIYYD
PDPRSTVIPEDEEIIQLHFSLLDEEAEQSFDQQSPWLLRLELDRAAMNDKDLTMGQVGERIKQTFKNDLFV
IWSEDND
EKLIIRCRVVRPKSLDAETEAEEDHMLKKIENTMLENITLRGVENIERVVMMKYDRKVPSPTGEYVKEPEW
VLETDGV
NLSEVMTVPGIDPTRIYTNSFIDIMEVLGIEAGRAALYKEVYNVIASDGSYVNYRHMALLVDVMTTQGGLT
SVTRHGF
NRSNTGALMRCSFEETVEILFEAGASAELDDCRGVSENVILGQMAPIGTGAFDVMIDEESLVKYMPEQKIT
EIEDGQD
GGVTPYSNESGLVNADLDVKDELMFSPLVDSGSNDAMAGGFTAYGGADYGEATSPFGAYGEAPTSPGFGVS
SPGFSPT
SPTYSPTSPAYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTS
PSYSPTS
PSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPAYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSP
NYSPTSP
SYSPTSPGYSPGSPAYSPKQDEQKHNENENSR" ,
hist {
replaces {
date
std {
year 1997 ,
month 10 ,
day 9 } ,
ids {
gi 133330 } } } } ,
annot {
{
data
ftable {
{
data
region "Zinc finger region" ,
comment "C2H2-TYPE (POTENTIAL)." ,
location
int {
from 66 ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

398 of 421

18/04/2005

to 82 ,
id
gi 2507347 } ,
exp-ev not-experimental } ,
{
data
region "Domain" ,
comment "CARBOXYL-TERMINAL 7-RESIDUE REPEATS." ,
location
int {
from 1543 ,
to 1718 ,
id
gi 2507347 } ,
exp-ev experimental } ,
{
data
region "Variant" ,
comment "MISSING (IN STRAIN A364A)." ,
location
int {
from 1652 ,
to 1658 ,
id
gi 2507347 } ,
exp-ev experimental } ,
{
data
region "Conflict" ,
comment "A -> V (IN REF. 1)." ,
location
pnt {
point 1513 ,
id
gi 2507347 } ,
exp-ev experimental } ,
{
data
region "Conflict" ,
comment "G -> A (IN REF. 1)." ,
location
pnt {
point 1523 ,
id
gi 2507347 } ,
exp-ev experimental } ,
{
data
region "Conflict" ,
comment "T -> M (IN REF. 1)." ,
location
pnt {
point 1600 ,
id
gi 2507347 } ,
exp-ev experimental } ,
{
data
gene {
locus "RPB1" ,
syn {
"RPO21" ,
"RPB220" ,
"SUA8" ,
"YDL140C" ,
"D2150" } } ,
location
int {
from 0 ,
to 1732 ,
id
gi 2507347 } } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

399 of 421

18/04/2005

{
data
prot {
name {
<=============ACCDB/title
"DNA-DIRECTED RNA POLYMERASE II LARGEST SUBUNIT" } ,
ec {
"2.7.7.6" } } ,
location
int {
from 0 ,
to 1732 ,
id
gi 2507347 } } } } } }

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

400 of 421

18/04/2005

Example EMBL record
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
pub {
pub {
gen {
cit "Unpublished" ,
authors {
names
std {
{
name
name {
last "Drebot" ,
initials "M.A." } } ,
{
name
name {
last "Jansma" ,
initials "D." } } ,
{
name
name {
last "Himmelfarb" ,
initials "H.J." } } ,
{
name
name {
last "Friesen" ,
initials "J.D." } } } } ,
title "Suppressors of yeast RNA polymerase II mutations
belong to a
family of gene products that interact with a protein kinase" } } } ,
pub {
pub {
sub {
authors {
names
std {
{
name
name {
last "Jansma" ,
initials "D." } } } ,
affil
str "David Jansma, Genetics, Hospital for Sick Children,
555
University, Avenue, Toronto, Ontario, M5G 1X8, Canada" } ,
medium other ,
date
std {
year 1992 ,
month 7 ,
day 28 } } } } ,
create-date
std {
year 1992 ,
month 12 ,
day 11 } ,
update-date
std {
year 1993 ,
month 3 ,
day 12 } ,
source {
org {
taxname "Saccharomyces cerevisiae" ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

401 of 421

18/04/2005

common "baker's yeast" ,
db {
{
db "taxon" ,
tag
id 4932 } } ,
orgname {
name
binomial {
genus "Saccharomyces" ,
species "cerevisiae" } ,
mod {
{
subtype isolate ,
subname "Mating type a" } } ,
lineage "Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces"
,
gcode 1 ,
mgcode 3 ,
div "PLN" } } ,
subtype {
{
subtype chromosome ,
name "7" } } } } ,
seq-set {
seq {
id {
embl {
name "SCSPM2" ,<==============================ACCDB/name
accession "Z14128" ,<========================ACCDB/accession
version 1 } ,<==================================ACCDB/version
gi 287914 } ,
descr {
title "S.cerevisiae spm2+ gene for spm2+ protein" ,
embl {
div fun ,
creation-date
std {
year 1992 ,
month 12 ,
day 11 } ,
update-date
std {
year 1993 ,
month 3 ,
day 12 } ,
keywords {
"CDC68/SPT16 protein" ,
"spm2+ gene" ,
"spm2+ protein" } ,
xref {
{
dbname
name "SGD" ,
id {
str "L0001891" ,
str "SIP2" } } ,
{
dbname
code swissprot ,
id {
str "P34164" ,
str "SIP2_YEAST" } } } } ,
molinfo {
biomol genomic } } ,
inst {
repr raw ,
mol dna ,
length 2032 ,
seq-data
ncbi2na

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

402 of 421

18/04/2005

'00F965EC6C171E0103E3BFD06B54022FC17F9A55FFCDD14FFD770607B4E
0107C933066C7C4C8C8F0323053933470724DB4E0C91BF973183E69693ADFF33E757E85
8B023F1
7E6FF68DB4037FCA80A303114423D7A80E00B3780C3EE7FE82AC78892E0BF73AB1C610B
4D49500
024041000BB6E45CD38B86F2200152439D0A798164203A39EE24202F1E0789F0303BD23
D50866A
925F42A2A373C50000361BCF9E6E38A18951CE502737BCE8079EE847877AB7DB71C98E3
822A233
CF9521848950908E52D78E38D9D294F7D952228A841240C6E50209F6A2B50B80D089D3C
3AF56E8
8C2BA490AEBD00BF1B84A74F4503A28038DAFE3178F78430E9D3F4EE09C29E7D4A444C8
F48FCCB
A30E27C8BCB8FD7951241E348EA01F6D0F1320B4A4148000541C180230AD428248F5398
5D51498
D8D743E77903EB0A35218FFAE1A3310AFD38237B45485D6FA0CC41213D596EFC7856DED
3A0631C
F317E8DB4902431E31B4E9E114550F175D2FA20EF37E0C0C739464214BD062041D299DE
54F5814
EEB9F051FAF1CB24F0910C47FBB25D4DBD8CC0480CED1523DFCC654C8B5DB0D6A3D71FF
C0A6CFB
333FFA49CE911C3F73F50FCBB0922332093304C38FC70EA74B1DD0E7CFBCD1014631C31
5A0B66E
37FFFFF6803883651FF706020CF80DF0CFAE208A0C02C9C222D1A8700AC00BD273D41FC
75FDE33
E45FDB81E9F6ECFC2E0CE'H ,
hist {
replaces {
date
std {
year 1993 ,
month 6 ,
day 11 } ,
ids {
gi 4525 } } } } ,
annot {
{
data
ftable {
{
data
pub {
pub {
muid 92017853 ,
article {
title {
name "CDC68, a yeast gene that affects
regulation of
cell proliferation and transcription, encodes a protein with a highly
acidic
carboxyl terminus." } ,
authors {
names
std {
{
name
name {
last "Rowley" ,
initials "A." } } ,
{
name
name {
last "Singer" ,
initials "R.A." } } ,
{
name
name {
last "Johnston" ,
initials "G.C." } } } ,
affil
str "Department of Microbiology, Dalhousie
University, Halifax, Nova Scotia, Canada." } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

403 of 421

18/04/2005

from
journal {
title {
iso-jta "Mol. Cell. Biol." ,
ml-jta "Mol Cell Biol" ,
issn "0270-7306" ,
name "Molecular and cellular biology." }
,
imp {
date
std {
year 1991 ,
month 11 } ,
volume "11" ,
issue "11" ,
pages "5718-5726" ,
language "eng" } } ,
ids {
pubmed 1833637 ,
medline 92017853 } } ,
pmid 1833637 } } ,
location
int {
from 1282 ,
to 2031 ,
id
gi 287914 } } ,
{
data
gene {
locus "spm2+" } ,
location
int {
from 398 ,
to 1645 ,
id
gi 287914 } } } } } } ,
seq {
id {
embl {
accession "CAA78503" ,
version 1 } ,
gi 287915 } ,
descr {
title "spm2+ [Saccharomyces cerevisiae]" ,
molinfo {
biomol peptide } } ,
inst {
repr raw ,
mol aa ,
length 415 ,
seq-data
ncbieaa
"MGTTTSHPAQKKQTTKKCRAPIMSDVREKPSNAQGCEPQEMDAVSKKVTELSLNKCSDS
QDAGQPSREGSITKKKSTLLLRDEDEPTMPKLSVMETAVDTDSGSSSTSDDEEGDIIAQTTEPKQDASPDD
DRSGHSS
PREEGQQQIRAKEASGGPSEIKSSLMVPVEIRWQQGGSKVYVTGSFTKWRKMIGLIPDSDNNGSFHVKLRL
LPGTHRF
RFIVDNELRVSDFLPTATDQMGNFVNYIEVRQPEKNPTNEKIRSKEADSMRPPTSDRSSIALQIGKDPDDF
GDGYTRF
HEDLSPRPPLEYTTDIPAVFTDPSVMERYYYTLDRQQSNTDTSWLTPPQLPPQLENVILNKYYATQDQFNE
NNSGALP
IPNHVVLNHLVTSSIKHNTLCVASIVRYKQKYVTQILYTPIESS" ,
hist {
replaces {
date
std {
year 1993 ,
month 6 ,
day 11 } ,
ids {
gi 4526 } } } } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

404 of 421

18/04/2005

annot {
{
data
ftable {
{
data
prot {
name {
"spm2+" } ,
activity {
"Wild-type version of SPM2,a dominant extragenic
suppressor of some temperature-sensitve mutations in RPO21 and PRP4."
} } ,
location
whole
gi 287915 } } } } } ,
seq {
id {
embl {
accession "CAA78504" ,
version 1 } ,
gi 4388554 } ,
descr {
title "CDC68 /SPT16 [Saccharomyces cerevisiae]" ,
molinfo {
biomol peptide ,
completeness partial } } ,
inst {
repr raw ,
mol aa ,
length 1 ,
seq-data
ncbieaa "M" } ,
annot {
{
data
ftable {
{
data
prot {
name {
"CDC68 /SPT16" } } ,
partial TRUE ,
location
whole
gi 4388554 } } } } } } ,
annot {
{
data
ftable {
{
data
cdregion {
frame one ,
code {
id 1 } } ,
product
whole
gi 287915 ,
location
int {
from 398 ,
to 1645 ,
id
gi 287914 } ,
dbxref {
{
db "SWISS-PROT" ,
tag
str "P34164" } } } ,
{
data

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

405 of 421

18/04/2005

cdregion {
frame one ,
code {
id 1 } } ,
partial TRUE ,
product
whole
gi 4388554 ,
location
int {
from 2029 ,
to 2031 ,
id
gi 287914 ,
fuzz-to
lim gt } ,
cit
pub {
muid 92017853 } } } } } }

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

406 of 421

18/04/2005

Example PDB record
Seq-entry ::= set {
class pdb-entry ,
descr {
pdb {
deposition
std {
year 1992 ,
month 4 ,
day 3 } ,
class "Isomerase(Intramolecular Oxidoreductse)" ,
compound {
"D-Xylose Isomerase (E.C.5.3.1.5) Mutant With Glu 186 Replaced
By Gln
(E186Q) Complex With Xylose And Mn" } ,
source {
"(Actinoplanes missouriensis) E186Q Mutant Gene Expressed In
(Escherichia coli)" } ,
exp-method "X-Ray Diffraction" } ,
comment "Revision History:~JUL 15 93 Initial Entry" ,
pub {
pub {
sub {
authors {
names
std {
{
name
name {
last "Janin" ,
full "J.Janin" ,
initials "J." } } } } ,
date
std {
year 1992 ,
month 4 ,
day 3 } } } } ,
pub {
pub {
muid 92304915 ,
article {
title {
name "Protein engineering of xylose (glucose) isomerase
from
Actinoplanes missouriensis. 1. Crystallography and site-directed
mutagenesis
of metal binding sites." } ,
authors {
names
str {
"J.Jenkins" ,
"J.Janin" ,
"F.Rey" ,
"M.Chiadmi" ,
"H.van Tilbeurgh" ,
"I.Lasters" ,
"M.De Maeyer" ,
"D.Van Belle" ,
"S.J.Wodak" ,
"M.Lauwereys" ,
"P.Stanssens" ,
"N.T.Mrabet" ,
"J.Snauwaert" ,
"G.Matthyssens" ,
"A.-M.Lambeir" } } ,
from
journal {
title {

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

407 of 421

18/04/2005

iso-jta "Biochemistry" ,
ml-jta "Biochemistry" ,
issn "0006-2960" ,
name "Biochemistry." } ,
imp {
date
std {
year 1992 ,
month 6 ,
day 23 } ,
volume "31" ,
issue "24" ,
pages "5449-5458" ,
language "eng" } } ,
ids {
pubmed 1610791 ,
medline 92304915 } } ,
pmid 1610791 } } ,
pub {
pub {
muid 92304916 ,
article {
title {
name "Protein engineering of xylose (glucose) isomerase
from
Actinoplanes missouriensis. 2. Site-directed mutagenesis of the xylose
binding site." } ,
authors {
names
str {
"A.-M.Lambeir" ,
"M.Lauwereys" ,
"P.Stanssens" ,
"N.T.Mrabet" ,
"J.Snauwaert" ,
"H.van Tilbeurgh" ,
"G.Matthyssens" ,
"I.Lasters" ,
"M.De Maeyer" ,
"S.J.Wodak" ,
"J.Jenkins" ,
"M.Chiadmi" ,
"J.Janin" } } ,
from
journal {
title {
iso-jta "Biochemistry" ,
ml-jta "Biochemistry" ,
issn "0006-2960" ,
name "Biochemistry." } ,
imp {
date
std {
year 1992 ,
month 6 ,
day 23 } ,
volume "31" ,
issue "24" ,
pages "5459-5466" ,
language "eng" } } ,
ids {
pubmed 1610792 ,
medline 92304916 } } ,
pmid 1610792 } } ,
pub {
pub {
muid 92304917 ,
article {
title {
name "Protein engineering of xylose (glucose) isomerase
from
Actinoplanes missouriensis. 3. Changing metal specificity and the pH

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

408 of 421

18/04/2005

profile
by site-directed mutagenesis." } ,
authors {
names
std {
{
name
name {
last "van Tilbeurgh" ,
initials "H." } } ,
{
name
name {
last "Jenkins" ,
initials "J." } } ,
{
name
name {
last "Chiadmi" ,
initials "M." } } ,
{
name
name {
last "Janin" ,
initials "J." } } ,
{
name
name {
last "Wodak" ,
initials "S.J." } } ,
{
name
name {
last "Mrabet" ,
initials "N.T." } } ,
{
name
name {
last "Lambeir" ,
initials "A.M." } } } ,
affil
str "Plant Genetic Systems N.V., Gent, Belgium." } ,
from
journal {
title {
iso-jta "Biochemistry" ,
ml-jta "Biochemistry" ,
issn "0006-2960" ,
name "Biochemistry." } ,
imp {
date
std {
year 1992 ,
month 6 ,
day 23 } ,
volume "31" ,
issue "24" ,
pages "5467-5471" ,
language "eng" } } ,
ids {
pubmed 1610793 ,
medline 92304917 } } ,
pmid 1610793 } } ,
pub {
pub {
muid 92172844 ,
article {
title {
name "Arginine residues as stabilizing elements in
proteins." } ,
authors {
names

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

409 of 421

18/04/2005

str {
"N.T.Mrabet" ,
"A.Van Den Broek" ,
"I.Van Den Brande" ,
"P.Stanssens" ,
"Y.Laroche" ,
"A.-M.Lambeir" ,
"G.Matthijssens" ,
"J.Jenkins" ,
"M.Chiadmi" ,
"H.van Tilbeurgh" ,
"F.Rey" ,
"J.Janin" ,
"W.J.Quax" ,
"I.Lasters" ,
"M.De Maeyer" ,
"S.J.Wodak" } } ,
from
journal {
title {
iso-jta "Biochemistry" ,
ml-jta "Biochemistry" ,
issn "0006-2960" ,
name "Biochemistry." } ,
imp {
date
std {
year 1992 ,
month 3 ,
day 3 } ,
volume "31" ,
issue "8" ,
pages "2239-2253" ,
language "eng" } } ,
ids {
pubmed 1540579 ,
medline 92172844 } } ,
pmid 1540579 } } ,
pub {
pub {
muid 89184498 ,
article {
title {
name "Structural analysis of the 2.8 A model of Xylose
isomerase
from Actinoplanes missouriensis." } ,
authors {
names
std {
{
name
name {
last "Rey" ,
initials "F." } } ,
{
name
name {
last "Jenkins" ,
initials "J." } } ,
{
name
name {
last "Janin" ,
initials "J." } } ,
{
name
name {
last "Lasters" ,
initials "I." } } ,
{
name
name {

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

410 of 421

18/04/2005

last "Alard" ,
initials "P." } } ,
{
name
name {
last "Claessens" ,
initials "M." } } ,
{
name
name {
last "Matthyssens" ,
initials "G." } } ,
{
name
name {
last "Wodak" ,
initials "S." } } } ,
affil
str "Laboratoire de Biologie Physicochimique, Universite
Paris
Sud, Orsay, France." } ,
from
journal {
title {
iso-jta "Proteins" ,
ml-jta "Proteins" ,
issn "0887-3585" ,
name "Proteins." } ,
imp {
date
std {
year 1988 } ,
volume "4" ,
issue "3" ,
pages "165-172" ,
language "eng" } } ,
ids {
pubmed 3237716 ,
medline 89184498 } } ,
pmid 3237716 } } } ,
seq-set {
seq {
id {
pdb {
mol "9XIM" ,<===================================ACCDB/name
chain 65 ,<====================================ACCDB/chain
rel
std {
year 1992 , ,<===========================ACCDB/release
month 4 ,
day 3 } } ,
gi 443580 } ,
descr {
// record truncated

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

411 of 421

18/04/2005

Example Biostruc
An example of the ASN biostruc. Some data has been removed for
the sake of brevity.
Biostruc ::= {
id {
mmdb-id 2 }
descr {
name "101D"
pdb-comment
pdb-comment
pdb-comment
pdb-comment
pdb-comment
pdb-comment
pdb-comment
pdb-comment
pdb-comment

,
,
"remark
"remark
"remark
"remark
"remark
"remark
"remark
"remark
"remark

3:
3:
3:
3:
3:
3:
3:
3:
3:
0"
3:

Refinement." ,
Program
Nuclsq" ,
Authors
Westhof,Dumas,Moras" ,
R Value
0.163" ,
Free R Value
0.252" ,
Number Of Reflections
2430" ,
Resolution Range
8.0 - 2.25 Angstroms" ,
Data Cutoff
2.0
Sigma(F)" ,
Number Of Protein Atoms
,
Number Of Nucleic Acid Atoms

pdb-comment "remark
488" ,
pdb-comment "remark
3: Number Of Solvent Atoms
33" ,
pdb-comment "remark
3: Rms Deviations From Ideal Values (The Values Of" ,
pdb-comment "remark
3: Sigma, In Parentheses, Are The Input Estimated" ,
pdb-comment "remark
3: Standard Deviations That Determine The Relative" ,
pdb-comment "remark
3: Weights Of The Corresponding Restraints)" ,
pdb-comment "remark
3: Distance Restraints (Angstroms)" ,
pdb-comment "remark
3: Sugar-Base Bond Distance
0.024(0.030)" ,
pdb-comment "remark
3: Sugar-Base Bond Angle Distance
0.040(0.040)" ,
pdb-comment "remark
3: Phosphate Bond Distance
0.026(0.040)" ,
pdb-comment "remark
3: Phosphate Bond Angle Distance, H-Bond
0.057(0.050)" ,
pdb-comment "remark
3: Plane Restraint (Angstroms)
0.014(0.020)" ,
pdb-comment "remark
3: Chiral-Center Restraint (Angstroms3)
0.161(0.150)" ,
pdb-comment "remark
3: Non-Bonded Contact Restraints (Angstroms)" ,
pdb-comment "remark
3: Single Torsion Contact
0.093(0.100)" ,
pdb-comment "remark
3: Multiple Torsion Contact
0.097(0.100)" ,
pdb-comment "remark
3: Isotropic Thermal Factor Restraints (Angstroms2)" ,
pdb-comment "remark
3: Sugar-Base Bond
4.282(6.000)" ,
pdb-comment "remark
3: Sugar-Base Angle
4.990(6.000)" ,
pdb-comment "remark
3: Phosphate Bond
5.693(6.000)" ,
pdb-comment "remark
3: Phosphate Bond Angle, H-Bond
5.227(6.000)" ,
pdb-comment "remark 101: Residue +c A
9 Has Br
Bonded To C5." ,
pdb-comment "remark 101: Residue +c B 21 Has Br
Bonded To C5." ,
pdb-comment "remark 105: The Protein Data Bank Has Adopted The Saccharide
Chemists" ,
pdb-comment "remark 105: Nomenclature For Atoms Of The DeoxyriboseRIBOSE
MOIETY" ,
pdb-comment "remark 105: Rather Than That Of The Nucleoside Chemists. The
Ring" ,
pdb-comment "remark 105: Oxygen Atom Is Labelled O4 Instead Of O1." ,
pdb-comment "remark 106: The Hydrogen Bonds Between Base Pairs In This
Entry Follow" ,
pdb-comment "remark 106: The Conventional Watson-Crick Hydrogen Bonding
Pattern." ,
pdb-comment "remark 106: They Have Not Been Presented On Conect Records In
This" ,
pdb-comment "remark 106: Entry." ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

412 of 421

18/04/2005

history {
data-source {
name-of-database "Protein Data Bank" ,
version-of-database
release-date
std {
year 1995 ,
month 2 ,
day 28 } ,
database-entry-id
other-database {
db "PDB" ,
tag
str "101D" } ,
database-entry-date
std {
year 1994 ,
month 12 ,
day 14 } } } ,
attribution
sub {
authors {
names
std {
{
name
name {
last "Goodsell" ,
full "D.S.Goodsell" ,
initials "D.S." } } ,
{
name
name {
last "Kopka" ,
full "M.L.Kopka" ,
initials "M.L." } } ,
{
name
name {
last "Dickerson" ,
full "R.E.Dickerson" ,
initials "R.E." } } } } ,
imp {
date
std {
year 1994 ,
month 12 ,
day 14 } } } ,
attribution
gen {
cit "To Be Published" ,
authors {
names
std {
{
name
name {
last "Goodsell" ,
full "D.S.Goodsell" ,
initials "D.S." } } ,
{
name
name {
last "Kopka" ,
full "M.L.Kopka" ,
initials "M.L." } } ,
{
name
name {
last "Dickerson" ,
full "R.E.Dickerson" ,
initials "R.E." } } } } ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

413 of 421

18/04/2005

title "Refinement Of Netropsin Bound To Dna: Bias And Feedback In
Electron Density Map Interpretation" } ,
attribution
equiv {
muid 85264810 ,
article {
title {
name "Binding of an antitumor drug to DNA, Netropsin and
C-G-C-G-A-A-T-T-BrC-G-C-G." } ,
authors {
names
std {
{
name
name {
last "Kopka" ,
initials "M.L." } } ,
{
name
name {
last "Yoon" ,
initials "C." } } ,
{
name
name {
last "Goodsell" ,
initials "D." } } ,
{
name
name {
last "Pjura" ,
initials "P." } } ,
{
name
name {
last "Dickerson" ,
initials "R.E." } } } } ,
from
journal {
title {
iso-jta "J. Mol. Biol." ,
ml-jta "J Mol Biol" ,
issn "0022-2836" ,
jta "J6V" } ,
imp {
date
std {
year 1985 ,
month 6 ,
day 25 } ,
volume "183" ,
issue "4" ,
pages "553-563" } } } } } ,
chemical-graph {
descr {
name "Dna (5'-D(CpGpCpGpApApTpTp(Br)cpGpCpG)-3') Complexed With
Netropsin, Re-Refinement" ,
pdb-class "Deoxyribonucleic Acid" ,
pdb-source "Synthetic" ,
assembly-type other } ,
molecule-graphs {
{
id 1 ,
descr {
name "A" ,
pdb-comment "SEQRES" ,
molecule-type dna ,
organism {
org {
taxname "synthetic construct" ,
db {
{

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

414 of 421

db "taxon" ,
tag
id 32630 } } ,
orgname {
name
partial {
{
fixed-level other ,
level "species" ,
name "synthetic construct" } }
lineage "artificial sequence" ,
gcode 11 ,
div "SYN" } } } } ,
seq-id
gi 996094 ,
residue-sequence {
{
id 1 ,
name "
1 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary"
tag
id 1 } ,
residue-graph-id 66 } } ,
{
id 5 ,
name "
5 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary"
tag
id 1 } ,
residue-graph-id 61 } } ,
{
id 8 ,
name "
8 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary"
tag
id 1 } ,
residue-graph-id 70 } } ,
{
id 9 ,
name "
9 " ,
residue-graph
local 1 } ,
{
id 10 ,
name " 10 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary"
tag
id 1 } ,
residue-graph-id 67 } } ,
{
id 11 ,
name " 11 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {

seqhound@blueprint.org

18/04/2005

Version 3.3

The SeqHound Manual

415 of 421

18/04/2005

db "Standard residue dictionary" ,
tag
id 1 } ,
residue-graph-id 64 } } ,
{
id 12 ,
name " 12 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary" ,
tag
id 1 } ,
residue-graph-id 68 } } } ,
inter-residue-bonds {
{
atom-id-1 {
molecule-id 1 ,
residue-id 11 ,
atom-id 9 } ,
atom-id-2 {
molecule-id 1 ,
residue-id 12 ,
atom-id 1 } } } } ,
{
id 2 ,
descr {
name "B" ,
pdb-comment "SEQRES" ,
molecule-type dna ,
organism {
org {
taxname "synthetic construct" ,
db {
{
db "taxon" ,
tag
id 32630 } } ,
orgname {
name
partial {
{
fixed-level other ,
level "species" ,
name "synthetic construct" } } ,
lineage "artificial sequence" ,
gcode 11 ,
div "SYN" } } } } ,
seq-id
gi 996095 ,
residue-sequence {
{
id 9 ,
name " 21 " ,
residue-graph
local 1 } ,
{
id 10 ,
name " 22 " ,
residue-graph
standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary" ,
tag
id 1 } ,
residue-graph-id 67 } } ,
{
id 12 ,
name " 24 " ,
residue-graph

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

416 of 421

18/04/2005

standard {
biostruc-residue-graph-set-id
other-database {
db "Standard residue dictionary" ,
tag
id 1 } ,
residue-graph-id 68 } } } ,
inter-residue-bonds {
{
atom-id-1 {
molecule-id 2 ,
residue-id 1 ,
atom-id 6 } ,
atom-id-2 {
molecule-id 2 ,
residue-id 2 ,
atom-id 1 } } ,
{
atom-id-1 {
molecule-id 2 ,
residue-id 11 ,
atom-id 9 } ,
atom-id-2 {
molecule-id 2 ,
residue-id 12 ,
atom-id 1 } } } } ,
{
id 3 ,
descr {
name "1" ,
molecule-type other-nonpolymer } ,
residue-sequence {
{
id 1 ,
name "
9 " ,
residue-graph
local 2 } } } ,
{
id 4 ,
descr {
name "2" ,
molecule-type other-nonpolymer } ,
residue-sequence {
{
id 1 ,
name " 21 " ,
residue-graph
local 2 } } } ,
{
id 38 ,
descr {
name "3" ,
molecule-type solvent } ,
residue-sequence {
{
id 1 ,
name " 58 " ,
residue-graph
local 5 } } } ,
{
id 39 ,
descr {
name "3" ,
molecule-type solvent } ,
residue-sequence {
{
id 1 ,
name " 59 " ,
residue-graph
local 5 } } } } ,
inter-molecule-bonds {
{

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

417 of 421

18/04/2005

atom-id-1 {
molecule-id 1 ,
residue-id 9 ,
atom-id 18 } ,
atom-id-2 {
molecule-id 3 ,
residue-id 1 ,
atom-id 1 } } ,
{
atom-id-1 {
molecule-id 2 ,
residue-id 9 ,
atom-id 18 } ,
atom-id-2 {
molecule-id 4 ,
residue-id 1 ,
atom-id 1 } } } ,
residue-graphs {
{
id 1 ,
descr {
name " +C DNA" ,
pdb-comment "" } ,
residue-type deoxyribonucleotide ,
iupac-code {
"N" } ,
atoms {
{
id 1 ,
name " P " ,
iupac-code {
" P " } ,
element p } ,
{
id 18 ,
name " C5 " ,
iupac-code {
" C5 " } ,
element c } ,
{
id 19 ,
name " C6 " ,
iupac-code {
" C6 " } ,
element c } } ,
bonds {
{
atom-id-1 16 ,
atom-id-2 17 ,
bond-order unknown } ,
{
atom-id-1 16 ,
atom-id-2 18 ,
bond-order unknown } } } ,
{
id 2 ,
descr {
name " BR" ,
pdb-comment "Bromine" } ,
residue-type other ,
iupac-code {
"X" } ,
atoms {
{
id 1 ,
name "BR " ,
iupac-code {
"BR " } ,
element br } } ,
bonds {
} } ,
{

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

418 of 421

18/04/2005

id 3 ,
descr {
name " NT" ,
pdb-comment "Netropsin" } ,
residue-type other ,
iupac-code {
"X" } ,
atoms {
{
id 1 ,
name " C1 " ,
iupac-code {
" C1 " } ,
element c } ,
{
id 5 ,
name " C2 " ,
iupac-code {
" C2 " } ,
element c } ,
{
id 31 ,
name " N10" ,
iupac-code {
" N10" } ,
element n } } ,
bonds {
{
atom-id-1 1 ,
atom-id-2 2 ,
bond-order unknown } ,
{
atom-id-1 1 ,
atom-id-2 3 ,
bond-order unknown } ,
{
atom-id-1 12 ,
atom-id-2 14 ,
bond-order unknown } ,
{
atom-id-1 29 ,
atom-id-2 31 ,
bond-order unknown } } } ,
{
id 4 ,
descr {
name "MO3" ,
pdb-comment "Magnesium Ion, 3 Waters Coordinated" } ,
residue-type other ,
iupac-code {
"X" } ,
atoms {
{
id 1 ,
name "MG " ,
iupac-code {
"MG " } ,
element mg } ,
{
id 2 ,
name " O1 " ,
iupac-code {
" O1 " } ,
element o } ,
{
id 3 ,
name " O2 " ,
iupac-code {
" O2 " } ,
element o } ,
{
id 4 ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

419 of 421

18/04/2005

name " O3 " ,
iupac-code {
" O3 " } ,
element o } } ,
bonds {
{
atom-id-1 1 ,
atom-id-2 2 ,
bond-order unknown } ,
{
atom-id-1 1 ,
atom-id-2 3 ,
bond-order unknown } ,
{
atom-id-1 1 ,
atom-id-2 4 ,
bond-order unknown } } } ,
{
id 5 ,
descr {
name "HOH" ,
pdb-comment "" } ,
residue-type other ,
iupac-code {
"X" } ,
atoms {
{
id 1 ,
name " O " ,
iupac-code {
" O " } ,
element o } } ,
bonds {
} } } } ,
model {
{
id 3 ,
type pdb-model ,
descr {
name "Model 1 from PDB entry 101D" ,
pdb-reso "Resolution: 2.25" ,
pdb-method "X-Ray Diffraction" ,
pdb-comment "FEB 27 95 Initial Entry" } ,
model-space {
coordinate-units angstroms ,
thermal-factor-units b } ,
model-coordinates {
{
id 1 ,
coordinates
literal
atomic {
number-of-points 556 ,
atoms {
number-of-ptrs 556 ,
molecule-ids {
1 ,
1 ,
1 ,
1 ,
2 ,
2 ,
2 ,
3 ,
4 ,
5 ,
5 ,
5 ,
5 ,
6 ,
6 ,
6 ,

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

420 of 421

18/04/2005

6 ,
7 ,
8 ,
9 ,
10 ,
11 ,
12 ,
13 ,
14 ,
15 ,
16 ,
17 ,
18 ,
19 ,
20 ,
12 ,
12 ,
12 ,
12 ,
12 ,
1 ,
1 ,
1 } } ,
sites {
scale-factor 1000 ,
x {
18598 ,
19853 ,
20375 ,
16812 } ,
y {
34469 ,
34632 ,
33233 ,
30694 } ,
z {
24672 ,
22605 ,
8518 ,
-2033 ,
26343 } } ,
temperature-factors
isotropic {
scale-factor 1000 ,
b {
23239 ,
24930 } } } } } } } }

seqhound@blueprint.org

Version 3.3

The SeqHound Manual

421 of 421

18/04/2005

GO background material
Please see: http://www.geneontology.org/doc/GO.doc.html}

seqhound@blueprint.org

Version 3.3

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : Yes
Create Date                     : 2005:04:18 16:53:04-04:00
Modify Date                     : 2005:04:18 17:51:54-03:00
Page Count                      : 421
Creation Date                   : 2005:04:18 20:53:04Z
Mod Date                        : 2005:04:18 20:53:04Z
Producer                        : Acrobat Distiller 5.0.5 (Windows)
Author                          : idonalds
Metadata Date                   : 2005:04:18 20:53:04Z
Creator                         : idonalds
Title                           : Microsoft Word - The_SeqHound_Admin_Manual.doc
Page Mode                       : UseOutlines
Tagged PDF                      : Yes

EXIF Metadata provided by EXIF.tools

The_SeqHound_Admin_Manual The Seq Hound Admin Manual

Navigation menu

Versions of this User Manual:

Views

Navigation