The_SeqHound_Admin_Manual The Seq Hound Admin Manual
User Manual:
Open the PDF directly: View PDF
.
Page Count: 421
| Download | |
| Open PDF In Browser | View PDF |
The SeqHound Manual
Part II: Sections 4-7
For Administrators and Developers
Release 3.3
(April 20th, 2005)
Authors
Ian Donaldson, Katerina Michalickova, Hao Lieu, Renan Cavero, Michel Dumontier,
Doron Betel, Ruth Isserlin, Marc Dumontier, Michael Matan, Rong Yao, Zhe Wang,
Victor Gu, Elizabeth Burgess, Kai Zheng, Rachel Farrall
Edited by
Rachel Farrall and Ian Donaldson
© 2005 Mount Sinai Hospital
The SeqHound Manual
2 of 421
18/04/2005
Table of Contents
About this manual............................................................................................................ 7
Conventions ..................................................................................................................... 8
How to contact us. ........................................................................................................... 8
Who is SeqHound?........................................................................................................... 9
4. Setting up SeqHound locally. ....................................................................................... 10
4.1 Overview.................................................................................................................. 10
4.2 SeqHound system requirements............................................................................... 11
OS and hardware architecture .................................................................................... 11
Memory (RAM) ......................................................................................................... 11
Hard Disk ................................................................................................................... 12
Source code and executables .................................................................................. 12
Database.................................................................................................................. 12
Other Software ........................................................................................................... 12
Compiling SeqHound Code yourself. ........................................................................ 13
ODBC compliant database engines............................................................................ 13
Library dependencies ................................................................................................. 13
4.3 Obtaining precompiled SeqHound executables....................................................... 14
4.3.1 Obtaining SeqHound Source Code...................................................................... 16
4.4 Compiling SeqHound executables on Solaris.......................................................... 18
4.5 Building the SeqHound system on Solaris............................................................... 26
Catch up on SeqHound daily updates ........................................................................ 45
Setting up daily sequence updates.............................................................................. 47
Setting up SeqHound servers. Overview................................................................... 53
Trouble-shooting notes............................................................................................... 57
Error logs ................................................................................................................ 57
Recompiling SeqHound .......................................................................................... 57
Restarting the Apache server .................................................................................. 57
Other useful links.................................................................................................... 58
Parser schedule........................................................................................................ 58
MySQL errors ......................................................................................................... 58
5. Description of the SeqHound parsers and data tables by module................................. 59
What are modules? ........................................................................................................ 59
How to use this section. ................................................................................................. 59
Parser descriptions........................................................................................................ 59
Table descriptions.......................................................................................................... 60
An overview of the SeqHound data table structure ....................................................... 63
Parsers and resource files needed to build and update modules of SeqHound. ........... 64
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
3 of 421
18/04/2005
core module ................................................................................................................ 66
mother parser .......................................................................................................... 66
update parser ........................................................................................................... 71
postcomgen parser .................................................................................................. 72
asndb table .............................................................................................................. 75
parti table ................................................................................................................ 78
nucprot table............................................................................................................ 80
accdb table .............................................................................................................. 82
histdb table .............................................................................................................. 88
pubseq table ............................................................................................................ 91
taxgi table................................................................................................................ 94
sengi table ............................................................................................................... 97
sendb table .............................................................................................................. 99
chrom table............................................................................................................ 101
gichromid table ..................................................................................................... 105
contigchromid table .............................................................................................. 107
gichromosome table .............................................................................................. 109
contigchromosome table ....................................................................................... 111
Redundant protein sequences (redundb) module ..................................................... 113
redund parser......................................................................................................... 113
redund table........................................................................................................... 115
Complete genomes tracking (gendb) module........................................................... 119
Taxonomy hierarchy (taxdb) module....................................................................... 120
importtaxdb parser ................................................................................................ 120
taxdb table............................................................................................................. 122
gcodedb table ........................................................................................................ 127
divdb table............................................................................................................. 132
del table................................................................................................................. 135
merge table............................................................................................................ 137
Structural databases (strucdb) module ..................................................................... 139
cbmmdb parser...................................................................................................... 139
vastblst parser........................................................................................................ 144
pdbrep parser......................................................................................................... 146
mmdb table............................................................................................................ 148
mmgi table ............................................................................................................ 154
domdb table........................................................................................................... 156
Protein sequence neighbours (neighdb) module ...................................................... 162
Installing nblast:.................................................................................................... 162
Configuration of nblast environment:................................................................... 163
Running NBLAST ................................................................................................ 164
NBLAST Update Procedure ................................................................................. 166
nbraccess program* .............................................................................................. 168
BLASTDB table................................................................................................... 169
NBLASTDB table................................................................................................. 172
Locus link functional annotations (lldb) module ..................................................... 177
llparser................................................................................................................... 177
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
4 of 421
18/04/2005
addgoid parser....................................................................................................... 179
ll_omim table ........................................................................................................ 181
ll_go table.............................................................................................................. 183
ll_llink table .......................................................................................................... 186
ll_cdd table............................................................................................................ 188
GENE module .......................................................................................................... 191
parse_gene_files.pl parser..................................................................................... 191
gene_dbxref table.................................................................................................. 193
gene_genomicgi table ........................................................................................... 195
gene_history table ................................................................................................. 198
gene_info table...................................................................................................... 201
gene_object table .................................................................................................. 204
gene_productgi table............................................................................................. 206
gene_pubmed table ............................................................................................... 208
gene_synonyms table ............................................................................................ 210
Gene Ontology hierarchy (godb) module................................................................. 212
goparser................................................................................................................. 212
go_parent table...................................................................................................... 214
go_name table ....................................................................................................... 216
go_reference table................................................................................................. 219
go_synonym table ................................................................................................. 221
Gene Ontology Association (GOA) module ............................................................ 223
Table summarizing input files, parsers and command line parameters for GOA
module................................................................................................................... 225
Gene Ontology Module Diagram.......................................................................... 228
goa_seq_dbxref table ............................................................................................ 230
goa_association table ............................................................................................ 234
goa_reference table ............................................................................................... 237
goa_with table....................................................................................................... 239
goa_xdb table ........................................................................................................ 242
goa_gigo table....................................................................................................... 245
dbxref module .......................................................................................................... 248
Who Cross-references who? ................................................................................. 249
Explanation of the data table structure: ................................................................ 249
How to update the DBXref and GO Annotation modules using a cluster. .............. 256
Understanding the dbxref.ini file ............................................................................. 257
Table summarizing input files, parsers and command line parameters for dbxref
module................................................................................................................... 262
dbxref table ........................................................................................................... 265
dbxrefsourcedb table............................................................................................. 268
Contents of dbxrefsourcedb table ......................................................................... 270
RPS-BLAST domains (rpsdb) module..................................................................... 272
domname parser .................................................................................................... 272
Rpsdb parser.......................................................................................................... 273
domname table ...................................................................................................... 274
rpsdb table............................................................................................................. 278
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
5 of 421
18/04/2005
Molecular Interaction (MI) module.......................................................................... 285
MI-BIND parser.................................................................................................... 285
MI_source table .................................................................................................... 289
MI_ints table ......................................................................................................... 291
MI_objects table.................................................................................................... 292
MI_obj_dbases table ............................................................................................. 294
MI_mol_types table .............................................................................................. 295
MI_dbases table .................................................................................................... 296
MI_record_types table .......................................................................................... 297
MI_complexes table.............................................................................................. 298
MI_complex2ints table ......................................................................................... 299
MI_complex2subunits table.................................................................................. 300
MI_complex2subunits table.................................................................................. 301
MI_refs table......................................................................................................... 302
MI_refs_db table................................................................................................... 304
MI_exp_methods table.......................................................................................... 305
MI_obj_labels table .............................................................................................. 306
Text mining module ................................................................................................. 307
mother parser ........................................................................................................ 307
text searcher parser ............................................................................................... 308
yeastnameparser.pl parser ..................................................................................... 312
text_bioentity table................................................................................................ 314
text_bioname table ................................................................................................ 317
text_secondrefs table............................................................................................. 321
text_bioentitytype table......................................................................................... 324
text_fieldtype table................................................................................................ 325
text_nametype table .............................................................................................. 326
text_rules table ...................................................................................................... 327
text_db table.......................................................................................................... 328
text_doc table ........................................................................................................ 329
text_docscore table................................................................................................ 331
text_evidencescore table ....................................................................................... 336
text_method table.................................................................................................. 338
text_point table...................................................................................................... 341
text_pointscore table ............................................................................................. 342
text_result table..................................................................................................... 344
text_resultscore table ............................................................................................ 346
text_search table.................................................................................................... 348
text_searchscore table ........................................................................................... 351
text_rng table ........................................................................................................ 353
text_rngresult table................................................................................................ 355
text_doctax table ................................................................................................... 357
text_organism table............................................................................................... 359
text_englishdict table ............................................................................................ 361
text_bncorpus table ............................................................................................... 363
text_pattern table................................................................................................... 365
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
6 of 421
18/04/2005
text_stopword table............................................................................................... 367
6. Developing for SeqHound. ......................................................................................... 369
Open source development............................................................................................ 369
Code organization. ...................................................................................................... 370
Adding/Modifying a remote API function to SeqHound.............................................. 373
Overall architecture of the SeqHound system.......................................................... 374
Adding a new module to SeqHound............................................................................. 380
Database layer .......................................................................................................... 381
Parser layer............................................................................................................... 382
Local API layer (Query layer).................................................................................. 383
CGI layer .................................................................................................................. 383
Remote API layer ..................................................................................................... 384
7. Appendices.................................................................................................................. 387
Example GenBank record ........................................................................................ 388
Example SwissProt record ....................................................................................... 393
Example EMBL record ............................................................................................ 400
Example PDB record................................................................................................ 406
Example Biostruc ..................................................................................................... 411
GO background material .......................................................................................... 421
* not available at time of writing
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
7 of 421
18/04/2005
About this manual.
This manual contains everything that has been documented about SeqHound. It is
distributed in two Parts (Part I: For Users and Part II: For Administrators and
Developers).
If you can’t find the answer here then please contact us. This manual was written and
reviewed by the persons listed under “Who is SeqHound”. Any errors should be reported
to seqhound@blueprint.org.
You can find out more about the general architecture of SeqHound by reading the
SeqHound paper that is freely available from BioMed Central. This paper is included in
the supplementary material distributed with this manual. See:
Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R, Hogue CW.
SeqHound: biological sequence and structure database as a platform for bioinformatics
research. BMC Bioinformatics. 2002 Oct 25;3(1):32.
PMID: 12401134
The SeqHound Manual (Part I: Sections 1-3) For Users.
Section1 and Section 2 is a one page description that tells you what to read first to get
started depending on what kind of user you are.
Section 3 is of interest to programmers who want to use the remote API to access
information in the SeqHound database maintained by the Blueprint Initiative.
The SeqHound Manual (Part II: Sections 4-7) For Administrators and Developers
Section 4 is of interest to programmers and system administrators who want to set up
SeqHound themselves so they can use the local API.
Section 5 is an in-depth description of everything that’s in the SeqHound database and
how it gets there (table by table). This section will be of interest to all users.
Section 6 describes how programmers can add to SeqHound. This section also describes
our internal development process at Blueprint.
Section 7 includes Appendices of background and reference material.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
8 of 421
18/04/2005
Conventions
The following section describes the conventions used in this manual.
Italic
is used for filenames, file extensions, URLs, and email addresses.
Constant Width
is used for code examples, function names and system output.
Constant Bold
is used in examples for user input.
Constant Italic
is used in examples to show variables for which a context-specific substitution should be
made.
How to contact us.
General enquiries or comments can be posted to the SeqHound usergroup mailing list
seqhound.usergroup@blueprint.org. You may also subscribe to this list to receive
regular updates about SeqHound developments by going to
http://lists.blueprint.org/mailman/listinfo/seqhound.usergroup .
Private enquiries, bug reports from external users, questions about SeqHound or errors
found in this manual may be sent to seqhound@blueprint.org.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
9 of 421
18/04/2005
Who is SeqHound?
Chronologically ordered according to when the person first started work on SeqHound.
Chris Hogue
Katerina Michalickova
Gary Bader
Ian Donaldson
Ruth Isserlin
Michel Dumontier
Hao Lieu
Marc Dumontier
Doron Betel
Renan Cavero
Ivy Lu
Rong Yao
Volodya Grytsan
Zhe Wang
Victor Gu
Rachel Farrall
Michael Matan
Elizabeth Burgess
Kai Zheng
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
10 of 421
18/04/2005
4. Setting up SeqHound locally.
4.1 Overview.
This section describes how one can set up the SeqHound system on your own hardware
using freely available SeqHound executables. These executables will allow you to build
and update the SeqHound database as well as run a web-interface and a remote API
server.
Section 4.2 should be reviewed first for system requirements before attempting to install
the SeqHound system.
Section 4.3 tells you how to download executables from the SeqHound ftp site for your
platform and operating system. SeqHound code may also be downloaded from this site.
Section 4.4 describes how SeqHound code may be compiled on your own hardware using
the freely available code available on the SeqHound ftp site. This step is only required if
SeqHound executables are not available for your platform or if you want to make use of
the local programming API. If you obtain SeqHound executables from the ftp site and
want to build your local SeqHound database, you still need to go through Steps 8, 9, 10,
11 and 13 in this section which describe how to install the MySQL server and ODBC
driver.
Section 4.5 contains detailed instructions for using the executables to build the SeqHound
data tables and for setting up the SeqHound web-interface and remote API server.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
11 of 421
18/04/2005
4.2 SeqHound system requirements.
Before attempting to set up SeqHound yourself, you should review the system
requirements listed below. The SeqHound system is able to run on a number of operating
systems (we recommend and can best support a UNIX operating system like Sun Solaris
or Red Hat Linux). Setting up SeqHound will require approximately 700 GB of disk
space (see below).
Questions about system requirements, compilation, setup and maintenance can be
addressed to seqhound@blueprint.org. We will do our best to address all inquiries but
resources may not allow us to solve all problems arising on all possible set ups.
OS and hardware architecture
SeqHound code is compiled on the following platforms based on release version code.
Blueprint production SeqHound is compiled and run on Sun-Fire-880 - Sun Solaris
(version 9). We have also compiled and tested SeqHound on the Fedora Core 2.0 and
the MacOS X operating systems.
Release versions of SeqHound executables are available for.
x86 architecture
Sun-Fire-880
PowerPC architecture
(Fedora Core 2.0)
Sun Solaris (version 9)
MacOS X
We have also successfully built executables on the following platforms.
x86 architecture
FreeBSD
x86 architecture
QNX
x86 architecture
Windows NT
PowerPC architecture
PPC Linux
SGI
Irix 6
Alpha architecture
Compaq Alpha OS
HPPA 2.0 architecture
HPUX 11.0
HPPA 1.1 architecture
PA-RISC Linux
Memory (RAM)
We recommend a minimum of 1 GB of RAM to run the SeqHound executables.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
12 of 421
18/04/2005
Hard Disk
Source code and executables
Component
SeqHound Source and compiled
NCBI Toolkit
NCBI C++ Toolkit
bzip2 Library
slri lib
slri lib_cxx
Source code and executables (total)
Database
Image Size
220.0 MB
560.0 MB
12GB
4.5 MB
7.3MB
9.4 MB
13GB approx
Component
data tables
data tables backup
Image Size
300 GB
300 GB
700 GB*
Database (total)
*700GB includes 300 GB for a single copy of the SeqHound data tables. The SeqHound
system includes a second copy of the data tables used for back up and updating. We
suggest a minimum of 700 GB for SeqHound installation. This allows for yearly growth
of the data tables as well as for a RAID5 disk configuration.
We are using the MySQL database storage engine InnoDB, which provides transaction
support and automatic recovery in the event of database server outage. There is no need
to keep a separate instance of the database when the InnoDB storage engine is used. To
prevent deadlock during data insertion and update, you should not run SeqHound parsers
in parallel against the InnoDB database server. As a result, it takes up to three extra days
for the initial build of SeqHound database using the InnoDB storage engine. If you wish
to use the MyISM storage engine, you can run parallel parsers to speed up the initial
build of SeqHound. However, you will need to keep a separate database instance for
database update and backup as the storage engine MyISM does not support transaction
and automatic recovery.
Other Software
Apache
Webserver(version 1.3)
Apache Jakarta Tocat
JSP/Servlet Container
(version 4.1)
Perl (version 5.8.3)
seqhound@blueprint.org
See http://www.apache.org/ for software installation for you
platform.
See http://jakarta.apache.org/tomcat/ for software installation for
you platform.
See http://www.cpan.org/ for installation for your platform.
Requiredmodules include Net/FTP.pm, sun4-solaris-64/DBI.pm
Version 3.3
The SeqHound Manual
13 of 421
18/04/2005
Compiling SeqHound Code yourself.
It is not necessary to compile SeqHound executables yourself; the system may be set up
using the executables provided on the ftp site for selected Operating Systems. However,
if you wish to make use of the local API then you must compile SeqHound yourself.
ODBC compliant database engines
Blueprint uses the ODBC compliant MySQL database engine. We are using version
4.1.10 in production; this version supports nested SQL queries and internationalization.
We have not tested SeqHound on other ODBC compliant RDBMS such as Oracle, DB2
and PostgreSQL.
Library dependencies
Library
Source
NCBI Toolkit
from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/
NCBI C++ Toolkit (optional*)
from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/
bzip2 Library
from http://sourceforge.net/projects/slritools/
slri lib
from http://sourceforge.net/projects/slritools/
slri lib_cxx (optional*)
from http://sourceforge.net/projects/slritools/
* This library is only required if you plan to use the SeqHound remote C++ API.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
14 of 421
18/04/2005
4.3 Obtaining precompiled SeqHound executables.
It is not necessary to compile SeqHound executables yourself; the system may be set up
using the precompiled executables provided on the ftp site for selected Operating
Systems. If you choose to compile the executables yourself, skip to step 4.3.1.
You will require about 220 MB of disk space to store the SeqHound compiled
executables. These instructions assume you are logged in as user “seqhound” on a UNIX
system running the bash shell and you have perl installed on your system.
1. Decide the location to install the SeqHound binary executables. For example, if you
want to install in the directory /home/seqhound/execs, do the following:
mkdir execs
cd execs
2. Download the SeqHound installation utility script installseqhound.pl from the FTP
site: ftp.blueprint.org
ftp ftp.blueprint.org
When prompted for a name enter
anonymous
When prompted for a password type your email address:
myemail@home.com
cd pub/SeqHound/script
get installseqhound.pl
Close the ftp session by typing:
bye
3. Run the perl script to download and install SeqHound executables. The perl script
will download SeqHound binary executables based on the specified platform (linux or
solaris), unpack the tar ball, modify the configurations files .odbc.ini and .intrezrc (for
ODBC database access) and deploy the configuration files. It requires two
commandline arguments: platform (linux or solaris) and installation path (e.g.
/home/seqhound/execs). Enter the path to the ODBC driver (e.g.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
15 of 421
18/04/2005
/usr/lib/libmyodbc3.so, please refer to step 10 in section 4.4 for ODBC driver path),
database server name, port number, user id, password and database instance name
when prompted by the perl script.
./installseqhound.pl [linux OR solaris] [/home/seqhound/execs]
Upon successful execution of the perl script, you should see the following directories
in the directory execs:
build
config
example
include
lib
sql
test
updates
www
The configuration file .odbc.ini can be found in the home
directory (e.g. /home/seqhound).
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
16 of 421
18/04/2005
4.3.1 Obtaining SeqHound Source Code.
Follow the instructions below to download SeqHound source code . If you downloaded
and unpacked the executables, you can skip section 4.3.1 and 4.4 and continue with
section 4.5.
1. In your home directory, make a new directory where you will store the new
SeqHound code.
mkdir compile
Move into this directory and set an environment variable called COMPILE to point to
this directory.
cd compile
export COMPILE=`pwd`
(where (`) is a single back-quote)
2. Download the perl utility seqhoundsrcdownload from the SeqHound ftp site
Note: We no longer support SeqHound download from the Sourceforge
FTP site. Please download SeqHound from
ftp://ftp.blueprint.org/pub/SeqHound/
From the compile directory, type:
ftp ftp.blueprint.org
When prompted for a name enter
anonymous
When prompted for a password type your email address:
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
17 of 421
18/04/2005
myemail@home.com
cd pub/SeqHound/script
get seqhoundsrcdownload.pl
Close the ftp session by typing:
bye
3. Download SeqHound source code by running the perl script seqhoundsrcdownload.pl.
The script will download the source code tar file and unpack the tar file into two
directories slri and bzip2. You will also see a release note file Release_notes_x.x.txt
in the same directory compile.
./seqhoundsrcdownload.pl
4. Set the SLRI environment variable
Move to the slri directory and set the environment variable “SLRI” to point to this
directory.
cd $COMPILE/slri
export SLRI=`pwd`
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
18 of 421
18/04/2005
4.4 Compiling SeqHound executables on Solaris
These instructions describe how to compile SeqHound running on the Solaris platform.
They may be used as a guide for compiling SeqHound code on other platforms.
Instructions are similar for Linux and differences are noted.
Using these instructions
These instructions assume that:
You have downloaded the SeqHound code from the ftp server and you have set
environment variables called COMPILE and SLRI. See section 4.3.1
You are using the bash shell.
Note: On Linux platforms, to compile SeqHound libs with ODBC support you also need
unixODBC-devel package which contains the sql.h + other libs/headers required to
compile SeqHound libs with ODBC support. This is not needed to run SeqHound, just to
compile it.
These instructions were tested on a Sun-Fire-880 architecture running a Sun Solaris OS
(version 9). The system information for the test-box (results of a “uname –a” call)
were:
SunOS machine_name 5.9 Generic_117171-15 sun4u sparc
SUNW,Sun-Fire-880
1.
Download the NCBI toolkit
SeqHound is dependent on code in the NCBI toolkit
Move to the compile directory and ftp to the NCBI ftp site:
cd $COMPILE
ftp ftp.ncbi.nlm.nih.gov
When prompted for a name enter anonymous
When prompted for a password type myemail@home.com
cd toolbox/CURRENT
Make a note of the FAQ.html and the readme.htm files.
Change your transfer type to binary and get the zipped directory called ncbi.tar.gz
bin
get ncbi.tar.gz
Close the ftp session by typing:
bye
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
19 of 421
18/04/2005
Uncompress the toolkit.
gunzip ncbi.tar.gz
tar xvf ncbi.tar
2. Edit the platform make file.
Go to the platform directory and locate the file with a “.mk” extension that applies to
your platform. For 64-bit Solaris system the file is “solaris64.ncbi.mk” and in Linux
the file is linux-x86.ncbi.mk.
cd $COMPILE/ncbi
cd platform
In Linux linux-x86.ncbi.mk replace the line /home/coremake/ncbi with
${NCBI}
Use the following line (a Perl command) to replace the string in the Solaris file
/netopt/ncbi_tools/ncbi64/ncbi with the string ${NCBI}
in the solaris64.ncbi.mk file:
perl -p -i.bak -e 's|/netopt/ncbi_tools/
ncbi64/ncbi|\${NCBI}|g' solaris64.ncbi.mk
so for instance, the line
NCBI_INCDIR = /netopt/ncbi_tools/ncbi64/ncbi/include
Will become:
NCBI_INCDIR = ${NCBI}/include
You could also edit this file in hand using a text editor if you don’t have Perl
installed.
Copy the file up one level to the ncbi directory and rename it “ncbi.mk”
cp solaris64.ncbi.mk ../ncbi.mk
3. Set environment variables in preparation for the toolkit build.
Move back to the ncbi directory and set the environment variable NCBI to point to
that directory
cd $COMPILE/ncbi
export NCBI=`pwd`
check this by typing
echo $NCBI
the value shown will replace ${NCBI} in the “solaris64.ncbi.mk” file that you
modified in the above step when the make file is run.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
20 of 421
18/04/2005
Note: The make file in the NCBI toolkit will use the C compiler from Sun
instead of the compiler gcc. We do not recommend using gcc as it
generates seqhound parsers that lead to segmentation fault at run time.
Finally, paths to the compiler and the archive executable ar should be added to your
PATH variable:
export
PATH=/usr/local/bin:/opt/SUNWspro/prod/bin:/usr/ccs/bin:$
PATH
You can check all of your environment variables by typing
set | sort
At this point, the relevant environment variables should be something like this:
COMPILE=/export/home/your_user_name/compile
NCBI=/export/home/your_user_name/compile/ncbi
OSTYPE=solaris2.9
PATH=/opt/SUNWspro/prod/bin:/usr/local/bin:/usr/ccs/bin:/
usr/bin:/usr/ucb:/etc:.
If you want, you can read the readme file in the make directory.
cd make
more readme.unx
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
21 of 421
18/04/2005
Note: For the Solaris UNIX OS only, the SeqHound API functions
SHoundGetGenBankff and SHoundGetGenBankffList breaks
due to a bug in the NCBI library file ncbistr.c (in directory ncbi/corelib
and ncbi/build). To fix the problem, replace all the code inside the
function Nlm_TrimSpacesAroundString() in the file ncbistr.c
with the following text
char *ptr, *dst, *revPtr;
int spaceCounter = 0;
ptr = dst = revPtr = str;
if ( !str || str[0] == '\0' )
return str;
while ( *revPtr != '\0' )
if ( *revPtr++ <= ' ' )
spaceCounter++;
if ( (revPtr - str) <= spaceCounter )
{
*str = '\0';
return str;
}
while ( revPtr > str && *revPtr <= ' ' )
revPtr--;
while ( ptr < revPtr && *ptr <= ' ' ) ptr++;
while ( ptr <= revPtr ) *dst++ = *ptr++;
*dst = '\0';
return str;
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
22 of 421
18/04/2005
4. Build the NCBI toolkit
Move back up to the compile directory and run the make command.
cd $COMPILE
./ncbi/make/makedis.csh |& tee out.makedis.txt
Note: to build Solaris 64 bit binaries add the following to the command
line:
SOLARIS_MODE=64 ./ncbi/make/makedis.csh
This runs a c-shell script to make the toolkit and tees the output to the screen and a
log file “out.makedis.txt”. It is safe to ignore the multiple error messages that you
may see.
At the end of a successful build you will see
*********************************************************
*The new binaries are located in ./ncbi/build/ directory*
*********************************************************
The ncbi.tar file can be removed from the “compile” directory after the successful build
process has been completed.
5. Make the bzip2 library
The bzip2 code was downloaded as part of the seqhound code in step 4.3.1 above.
Move to the bzip2 directory and run the make file.
cd $COMPILE/bzip2
make –f make.bzlib
6. Set the BZDIR environment variable.
cd $COMPILE/bzip2
export BZDIR=`pwd`
7. In your home directory, add the following environment parameters to the appropriate
configuration file such as .bashrc or .bash_profile. Text in italics should be changed
to the correct path on your machine that points to directory having DBI.pm:
export NCBI=$COMPILE/ncbi
export BZDIR=$COMPILE/bzip2
export SLRI=$COMPILE/slri
export VIBLIBS="-L/usr/X11R6/lib -lXm -lXpm -lXmu -lXp lXt -X11 -lXext"
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
23 of 421
18/04/2005
export
PERL5LIB=/usr/local/lib/perl5/site_perl/5.8.3/sun4solaris-64
8. Install MySQL server and create database “seqhound”.
SeqHound is built and tested in MySQL version 4.1.10. You can download MySQL
from http://dev.mysql.com/downloads/mysql/4.1.html and follow the manual at
http://dev.mysql.com/doc/mysql/en/index.html to install MySQL on your server. The
data directory where the MySQL server points to should have 700 GB for a full
SeqHound database. After MySQL is installed, you need to log into MySQL and
create database “seqhound”:
create database seqhound;
Note that ";" must be used at the end of all MySQL statements.
9. Install ODBC driver:
Note that for Linux platforms, the unixODBC package needs to be
installed prior to the ODBC driver otherwise the following error will
occur:
error: Failed dependencies:
libodbcinst.so.1 is needed by MyODBC-3.51.09-1
a) Go to web site: http://dev.mysql.com/doc/connector/odbc/en/faq_2.html
b) Find and download RPM distribution of ODBC driver MyODBC-3.51.071.i586.rpm.
c) As user "root", install the driver.
For first time installation
rpm -ivh MyODBC-3.51.01.i386-1.rpm
For upgrade
rpm -Uvh MyODBC-3.51.01.i386-1.rpm
d) The library file libmyodbc3. will be installed in directory /usr/lib or
/usr/local/lib.
10. Set up the configuration file for ODBC driver.
Create a configuration file called .odbc.ini in your home directory with the following
content:
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
24 of 421
18/04/2005
Edit the file called .intrezrc in directory slri/seqhound/config/.
header must not be used for other sections
[mysqlsh]
Description = MySQL ODBC 3.51 Driver DSN
Trace
= Off
TraceFile
= stderr
your library path
Driver
= /usr/lib/libmyodbc3.so
DSN
= mysqlsh
same as the header name
SERVER
= my_server
PORT
= my_port
USER
= my_id
PASSWORD
= my_pwd
DATABASE
= seqhound
database name
Text in italics should be changed. Text /usr in the value of variable Driver
should be changed to the path where unixodbc resides. Text my_server should be
changed to the IP address or the server name of the MySQL server. Text my_port
should be changed to port number of the MySQL instance. Text my_id and my_pwd
should be replaced by your user id and password to the MySQL database.
Note that the values for the headers such as DSN, USER, PASSWORD and
DATABASE must be less than 9 characters.
11. Set up ODBC related variables:
export ODBC=path_to_unixodbc
Where path_to_unixodbc should be replaced by the path of the UnixODBC
driver on your machine.
In your home directory, add parameter “LD_LIBRARY_PATH” to the appropriate
configuration file such as .bashrc or .bash_profile:
export LD_LIBRARY_PATH =
/usr/local/unixodbc/lib:/usr/local/unixodbc/odbc/lib:/usr
/local/mysql/lib/mysql:/usr/local/mysql/lib/mysql/lib
The value of variable “LD_LIBRARY_PATH” should have all the paths that have the
library files libodbc*, libmyodbc*, and libmysqlclient*
12. Build the SeqHound executables
Move to the compile directory and list all the files in the directory:
cd $COMPILE
ls
You should see:
> ls
bzip2
ncbi
slri
out.makedis.txt
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
25 of 421
18/04/2005
Before proceeding you should check your environment variables
set | sort
to ensure that correct paths have been specified for each of the following variables:
NCBI
SLRI
ODBC
BZDIR
Compile the SLRI libraries using the following commands:
cd $SLRI/lib
make -f make.slrilib
make -f make.slrilib odbc
The above commands will build the SLRI libraries needed by SeqHound.
The make files which you are about to invoke call on these variables therefore the
paths must be correct. Move to the make directory for SeqHound and run the makeall
script. The script requires two command line arguments. The first parameter indicates
what database backend is to be used for the build (currently the only valid target is
odbc). The second parameter indicates what SeqHound programs are to be made (a
choice of all, cgi, domains, examples, genomes, go,
locuslink, parsers,scripts, taxon, updates). The output of the
build script will be captured in the text file out.makeseqhound.txt.
cd $SLRI/seqhound
./makeallsh odbc all 2>&1 | tee out.makeseqhound.txt
It is safe to ignore the multiple warning messages that you may see.
After this has finished running, move to the directory slri/seqhound/build/odbc/
where you will find the executables for SeqHound.
cd build/odbc
ls -1
You will see
>ls –1
addgoid
cbmmdb
chrom
clustmask
clustmasklist
comgen
fastadom
gen2fasta
gen2struc
goparser
goquery
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
26 of 421
18/04/2005
histparser
importtaxdb
isshoundon
llgoa
llparser
llquery
mother
pdbrep
precompute
redund
seqrem
sh_nbhrs
shunittest_odbc_local
shunittest_odbc_rem
shtest
update
vastblst
wwwseekgi
13. Set up the SQL files that create tables.
cd $SLRI/seqhound/sql
In each of files core.sql, redund.sql, ll.sql, taxdb.sql, gendb.sql,
strucdb.sql, cddb.sql, godb.sql, rps.sql, nbr.sql, there is a line close to
the beginning of each file:
#use testsql;
This line should be changed to
use seqhound;
4.5 Building the SeqHound system on Solaris
Using these instructions
These instructions show how the SeqHound executables may be used to build the
SeqHound system under a Solaris 8 OS. These instructions may also be used as a guide
for setting up SeqHound under other operating systems. These instructions assume that:
• You have downloaded the latest release version of the SeqHound code (see step
4.3.3)
•
You have successfully installed MySQL
•
You have successfully compiled the SeqHound code yourself (section 4.4)
OR
you have downloaded the SeqHound executables for your platform and operating
system (section 4.3.4).
•
You have set environment variables called COMPILE and SLRI (see steps 4.3.1 and
4.3.6).
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
27 of 421
18/04/2005
•
You have a default install of an Apache server running. See http://www.apache.org/
for freely available software and instructions for your platform.
•
You have installed Perl. See http://www.cpan.org/ for freely available software and
installation instructions.
•
You have at least 300 MB space available in a directory where you can check out
code and compile it.
•
You have at least 600 GB available for the SeqHound executables and data tables.
See section 4.2.
These instructions were tested on a Sun Ultra machine running the Sun-Solaris 8 OS. The
system information for the test-box (results of a “uname –a” call) were:
SunOS machine_name 5.8 Generic_108528-01 sun4u sparc
SUNW,Ultra-4
These instructions assume that you are using the c shell. Syntax may differ for some
commands in other shells.
Note: These instructions begin with ‘step 14’.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
28 of 421
18/04/2005
14. Prepare to build the SeqHound database.
Create a new directory where you will set up SeqHound.
mkdir seqhound
Set the environment variable SEQH to point to this directory.
cd seqhound
setenv SEQH `pwd`
Move to this directory and create new directories
cd seqhound
mkdir 1.core.files
mkdir 2.redund.files
mkdir 3.taxdb.files
mkdir 4.godb.files
mkdir 5.lldb.files
mkdir 6.comgenome.files
mkdir 7.mmdb.files
mkdir 8.hist.files
mkdir 9.neighbours.files
mkdir 10.rpsdb.files
mkdir precompute
The numbered directories will hold parsers and files required for the build of the
SeqHound data tables. Directory “precompute” will hold the precomputed data of the
database.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
29 of 421
18/04/2005
Move to each of the numbered directories and copy all of the scripts and executables
required for the build.
cd $SEQH/1.core.files
cp $SLRI/seqhound/sql/core.sql .
cp $SLRI/seqhound/scripts/asnftp.pl .
cp $SLRI/seqhound/scripts/seqhound_build.sh .
cp $SLRI/seqhound/build/odbc/mother .
cp $SLRI/seqhound/build/odbc/update .
cp $SLRI/seqhound/config/.intrezrc .
cd
cp
cp
cp
$SEQH/2.redund.files
$SLRI/seqhound/sql/redund.sql .
$SLRI/seqhound/scripts/nrftp.pl .
$SLRI/seqhound/build/odbc/redund .
cd
cp
cp
cp
$SEQH/3.taxdb.files
$SLRI/seqhound/sql/taxdb.sql .
$SLRI/seqhound/scripts/taxftp.pl .
$SLRI/seqhound/build/odbc/importtaxdb .
cd
cp
cp
cp
$SEQH/4.godb.files
$SLRI/seqhound/sql/godb.sql .
$SLRI/seqhound/scripts/goftp.pl .
$SLRI/seqhound/build/odbc/goparser .
cd
cp
cp
cp
cp
$SEQH/5.lldb.files
$SLRI/seqhound/sql/ll.sql .
$SLRI/seqhound/scripts/llftp.pl .
$SLRI/seqhound/build/odbc/llparser .
$SLRI/seqhound/build/odbc/addgoid .
cd
cp
cp
cp
cp
cp
cp
cp
cp
$SEQH/6.comgenomes.files
$SLRI/seqhound/sql/gendb.sql .
$SLRI/seqhound/scripts/genftp.pl .
$SLRI/seqhound/scripts/humoasn.pl .
$SLRI/seqhound/scripts/humouse_build.sh .
$SLRI/seqhound/scripts/comgencron_odbc.pl .
$SLRI/seqhound/scripts/shconfig.pm .
$SLRI/seqhound/genomes/gen_cxx .
$SLRI/seqhound/genomes/pregen.pl .
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
30 of 421
cp
cp
cp
cp
cp
$SLRI/seqhound/genomes/gen.pl .
$SLRI/seqhound/genomes/ncbi.bacteria.pl .
$SLRI/seqhound/build/odbc/chrom .
$SLRI/seqhound/build/odbc/comgen .
$SLRI/seqhound/build/odbc/mother .
cd
cp
cp
cp
cp
cp
$SEQH/7.mmdb.files
$SLRI/seqhound/sql/strucdb.sql .
$SLRI/seqhound/scripts/mmdbftp.pl .
$SLRI/seqhound/config/.mmdbrc .
$SLRI/seqhound/config/.ncbirc .
$SLRI/seqhound/build/odbc/cbmmdb .
18/04/2005
cd $SEQH/8.hist.files
cp $SLRI/seqhound/build/odbc/histparser .
Open the .intrezrc file with a text editor like pico and edit.
cd $SEQH/1.core.files
pico .intrezrc
An example .intrezrc file follows. Lines preceded by a semi-colon are comments that
explain what the settings are used for and their possible values.
Text in italics must be changed for the .intrezrc file to function correctly with
your SeqHound set-up. Variables username, password, dsn, database in
section [datab] should have the same values as USER, PASSWORD, DSN and
DATABASE respectively in the .odbc.ini file you set up in Step 10 in section 4.4. For
variable path and indexfile in section [precompute], replace the text in
italics with the absolute path of directory “precompute” you just created.
Warning: This file may have wrapped lines. Take care when editing this
file that you do not break any of the lines (i.e. introduce any unwanted
carriage returns).
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
31 of 421
18/04/2005
-------------------------------example .intrezrc begins-------------------------------[datab]
;seqhound database that you are connecting
username=your_user_name
password=your_pass_word
dsn=dsn_in_.odbc.ini_file
database=seqhound
local=
[config]
;the executable the cgi runs off of.
CGI=wwwseekgi
[precompute]
;precomputed taxonomy queries
MaxQueries = 100
MaxQueryTime = 10
QueryCount = 50
path = /seqhound/precompute/
indexfile = /seqhound/precompute/index
[sections]
;indicated what modules are available in SeqHound
;1 for available, 0 for not available
;gene ontology hierarchy
godb = 1
;locus link functional annotations
lldb = 1
;taxonomy hierarchy
taxdb = 1
;protein sequence neighbours
neigdb = 1
;structural databases
strucdb = 1
;complete genomes tracking
gendb = 1
;redundant protein sequences
redundb = 1
;open reading frame database
;currently not exported to outside users of SeqHound
cddb = 0
;RPS-BLAST domains
rpsdb = 1
;DBXref Database Cross_Reference
dbxref = 0
[crons]
;customizable variables in cron jobs
;NOTE: all paths must end in '/'
pathupdates=./
pathinputfiles=./
pathinputfilescomgen=./
mail=user\@host.org
defaultrelease=141
pathflags=./
-------------------------------example .intrezrc ends----------------------------------
This file should be copied to other directories used during the build process:
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
32 of 421
18/04/2005
cp .intrezrc $SEQH/2.redund.files/.
cp .intrezrc $SEQH/3.taxdb.files/.
cp .intrezrc $SEQH/4.godb.files/.
cp .intrezrc $SEQH/5.lldb.files/.
cp .intrezrc $SEQH/6.comgenome.files/.
cp .intrezrc $SEQH/7.mmdb.files/.
cp .intrezrc $SEQH/8.hist.files/.
cp .intrezrc $SEQH/9.neighbours.files/.
cp .intrezrc $SEQH/10.rpsdb.files/.
15. Build the core module of SeqHound.
Building the core module (basically all of the sequence data tables) is not optional.
The rest of the modules are optional if there is a need to spare resources or
administrative efforts but the corresponding API functionality will not be present.
cd $SEQH/1.core.files
Create the core tables in the database
Make sure file core.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < core.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates core tables accdb, asndb, nucprot, parti, pubseq, sendb, sengi, taxgi,
bioentity, bioname, secondrefs, bioentitytype, nametype, rules, fieldtype and histdb.
If you are building a full-instance of the SeqHound database then run the asnftp.pl
script while in the build directory:
./asnftp.pl
Note that any command in these instructions can be run as a ‘nohup’ to
prevent the process from ending if your connection to the machine should
be lost. For example:
nohup ./asnftp.pl &
If you only want to build a small test version of the database then manually download
a single file. For example:
ftp ftp.ncbi.nih.gov
When prompted for a name enter anonymous
When prompted for a password type myemail@home.com
cd refseq/cumulative
bin
get rscu.bna.Z (do not uncompress this file)
bye
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
33 of 421
18/04/2005
The asnftp.pl script downloads all of the GenBank sequence records (in binary ASN.1
format) required to make an initial build of the SeqHound core module. This script
will take approximately 24 hours to run and will consume 14 GB of disk space.
Note that all scripts are described in detail in section 5.
Two other files are generated by this script:
asn.list is a list of the sequence files that the script intends to download.
asnftp.log is where the script logs error messages during execution time.
If you open another session with the machine where you are building SeqHound, you
can check how far along asnftp.pl is by comparing the number of lines in the asn.list
file
grep “.aso.gz” asn.list | wc –l
to the number of lines in the build directory (number of files actually downloaded so
far)
ls *.aso.gz | wc -l
Once asnftp has finished, these two numbers should be the same.
Run the seqhound build script. Before running this script, make certain that the
.intrezrc file, in the same directory, and .odbc.ini, in your home directory, have
correct configuration values. (see steps 10 in section 4.4 and step 14 in the current
section). This parser MUST be given a single parameter that represents the release
version of GenBank. You can find the release number in the file:
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Release.
./seqhound_build.sh 141
seqhound_build.sh executes the mother parser over all source files and populates
tables accdb, asndb, nucprot, parti, pubseq, sendb, sengi, taxgi, bioentity, bioname,
secondrefs, bioentitytype, nametype, rules and fieldtype. This will take about 75
hours. Table histdb is still empty at this stage. It is populated in Step 25.
Parser mother creates a log file for every *.aso file that it parses. These log files are
located in a subdirectory called “logs” and are named “rsnc0506run” where
“rsnc0506” is the name of the file that was being processed.
While seqhound_build.sh is running, you can move on to steps 16-18.
Once seqhound_build.sh has finished you can test that all of the files were properly
processed by showing that the results of
cd logs
grep “Done” | wc –l
is the same as
ls *run | wc –l
is the same as
cd ..
ls *aso.gz | wc -l
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
34 of 421
18/04/2005
The seqhound_build.sh script unzips .aso.gz files before feeding them as input to the
mother program. seqhound_build.sh then rezips the file after mother is done with it.
If for some reason, the build should crash part way through, you have to
a) recreate core tables using core.sql (see above) and
b) search for any unzipped (*.aso files) in the build directory and rezip them
c) restart seqhound_build.sh.
Once the seqhound_build.sh script has finished, you should move all of the *.aso.gz
files into a directory where they will be out of the way:
mkdir asofiles
mv *.aso.gz asofiles/.
16. Build the redundb module.
cd $SEQH/2.redund.files
Create table redund in the database.
Make sure file redund.sql has the line use seqhound close to the beginning of the
file.
mysql –u my_id –p –P my_port –h my_server < redund.sql
Where my_id”, “my_port” and “my_server” should be replaced by your userid
for the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates table redund in the database.
Run the nrftp.pl script to download the FASTA nr database of proteins
(ftp://ftp.ncbi.nlm.nih.gov/blast/db).
./nrftp.pl
nrftp.pl generates a log file “nrftp.log” that informs you what happened. If everything
went ok, the last two lines should read:
Getting nr.gz
closing connection
A new file should appear in the build directory called “nr.Z”. You will have to
unpack this file by typing:
gunzip nr.gz
Run the redund parser to make the redund table of identical protein sequences.
Before running this script, make certain that the .intrezrc file in the same directory
and .odbc.ini in your home directory have correct configuration values (see step 10 in
section 4.4 and step 14 in the current section).
./redund -i nr -n F
redund generates the log file “redundlog”. If everything went ok, the only line in this
file should be:
NOTE: [000.000] {redund.c, line 259} Done.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
35 of 421
18/04/2005
And about 3 millions records will be inserted into table redund.
17. Build the taxdb module
Create tables of the taxdb module in the database.
cd $SEQH/3.taxdb.files
Make sure file taxdb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < taxdb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates tables taxdb, gcodedb, divdb, del, merge in the database.
Run the taxftp.pl script to download taxonomy info from the NCBI
(ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz).
taxftp.pl
taxftp.pl generates a log file taxftp.log that informs you what happened. If everything
went ok, the last two lines should read:
Getting taxdump.tar.gz
closing connection
A new file should appear in the build directory called taxdump.tar.gz. You will have
to unpack this file by typing:
gzip –d taxdump.tar.gz
tar -xvf taxdump.tar
There will be seven new files:
delnodes.dmp
division.dmp
gc.prt
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
Run the importtaxdb parser to make the taxonomy data tables. Taxdump must be in
the same directory as this parser.
./importtaxdb
importtaxdb has no command line parameters. importtaxdb generates the log file
importtaxdb_log.txt. If everything went ok, the output of this file should be
something like:
Program start at Thu Sep 4 13:47:51 2003
Number of Tax ID records parsed: 191647
Number of Tax ID Name records parsed: 246263
Number of Division records parsed: 11
Number of Genetic Code records parsed: 18
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
36 of 421
18/04/2005
Number of Deleted Node records parsed: 25475
Number of Merged Node records parsed: 4607
Program end at Thu Aug 12 13:49:43 2004
And records will be inserted into tables taxdb, gcodedb, divdb, del and merge.
18. Build the GODB module
Create tables of the godb module in the database.
cd $SEQH/4.godb.files
Make sure file godb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < godb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates tables go_parent, go_name, go_reference, go_synonym in the database.
Run the goftp.pl script to download the gene ontology files
(ftp://ftp.geneontology.org/pub/go/gene-associations and
ftp://ftp.geneontology.org/pub/go/ontology).
goftp.pl
There is a log file for this script called goftp.log that indicates that it got all of these
files. Three new files should appear in the build directory called
component.ontology
function.ontology
process.ontology
Two other files also appear called
gene_association.Compugen.GenBank.gz
gene_association.Compugen.UnitProt.gz
but these are used as input files by addgoid in the next step.
Run the goparser to make the hierarchical gene ontology data tables. The three input
files must be in the same directory as this parser.
./goparser
goparser has no command line parameters. goparser generates the log file
goparserlog. If everything went ok, the output of this file should have only one
NOTE line:
NOTE: [000.000] {goparser.c, line 101} Main: Done!
And records will be inserted into tables go_parent, go_name, go_reference,
go_synonym.
19. Build the LLDB module
Create tables of the locus link module in the database.
cd $SEQH/5.lldb.files
Make sure file ll.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < ll.sql
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
37 of 421
18/04/2005
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates tables ll_omim, ll_go, ll_llink, ll_cdd in the database.
Run the llftp.pl script to download the locus link template file (LL_tmpl) which is the
source for function annotation tables
(ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz).
llftp.pl
This script generates the llftp.log file. If everything executes correctly, the last two
lines of the file should read:
Getting LL_tmpl.gz
closing connection
And a new file should appear in the build directory called LL_tmpl.gz which you will
have to unpack using the commands
gzip –d LL_tmpl.gz
Run the llparser to create the set of functional annotation data tables. The input file
must be in the same directory as this parser.
./llparser
llparser has no command line parameters. llparser generates the log file
“llparserlog”. At the time of writing, the output of this file will have thousands of
lines like:
NOTE: [000.000] {ll_cb.c, line 654} LL_AppendRecord: No
NP id. Record skipped.
(these lines are expected since many LocusLink records are not linked to specific
sequence records)
followed by the last line of the file:
NOTE: [000.000] {llparser.c, line 90} Main: Done!
Records will be inserted into tables ll_omim, ll_go, ll_llink and ll_cdd. Run the
addgoid parser to populate the go annotation table. This parser uses input files that
were downloaded in the previous step 13. Copy those files to this directory:
cp ../4.godb.files/gene_association.Compugen.GenBank.gz
./
cp ../4.godb.files/gene_association.Compugen.UniProt.gz
./
The files need to be unpacked.
gunzip gene_association.Compugen.GenBank.gz
gunzip gene_association.Compugen.UnitProt.gz
The input files must be in the same directory as addgoid
./addgoid –i gene_association.Compugen.GenBank
after this parser has finished, use it to parse the other input file
./addgoid –i gene_association.Compugen.UniProt
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
38 of 421
18/04/2005
At the time of writing, this second input file is not parsed since cross references
between Swissprot and GenBank ids are not available. This is being corrected by the
dbxref module project.
addgoid MUST BE EXECUTED AFTER ALL CORE TABLES AND
LLDB TABLES HAVE BEEN BUILT; the llparser makes the ll_go table
into which the addgoid script writes. This program is dependent on tables
asndb, parti, accdb and nucprot..
addgoid generates the log file addgoidlog. The output of this file will look like:
=========[ Sep 5, 2003 10:28
ERROR: [000.000] {addgoid.c,
ERROR: [000.000] {addgoid.c,
ERROR: [000.000] {addgoid.c,
ERROR: [000.000] {addgoid.c,
AM ]========================
line 235} No GI from 100K_RAT.
line 235} No GI from 100K_RAT.
line 235} No GI from 100K_RAT.
line 235} No GI from 100K_RAT.
This is normal. These errors are caused by the inability to find GI’s for names of
proteins/loci that are annotated in the GO input file. This problem is being addressed
by the dbxref module.dir
This program writes to the existing ll_go table that was generated by llparser.
20. Build the GENDB module
Change directories to the Complete Genomes directory (comgenomes).
cd $SEQH/6.comgenomes.files
Create tables of the GENDB module in the database.
Make sure file gendb.sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < gendb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server respectively. You will be prompted to enter your password.
This creates table chrom in the database.
Building the GENDB module involves several steps. To simplify the process, a perl
script, comgencron_odbc.pl groups together all of the necessary scripts or binaries for
each individual step. These scripts and binaries must be present in this directory.
They are:
comgencron_odbc.pl
shconfig.pm
gen_cxx
pregen.pl
gen.pl
ncbi.bacteria.pl
genftp.pl
humoasn.pl
chrom
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
39 of 421
18/04/2005
iterateparti
humouse_build.sh
mother
comgen
Before building the GENDB module, the [crons] section in configuration file
.intrezrc should be set up properly. It should look like the following. Text in
italics must be changed. Variable mail should have the e-mail address where
you want the message to be sent to. Variable defaultrelease should have the
release number of the GenBank files you use to build the core tables of SeqHound
database (see Step 15):
[crons]
;customizable variables in cron jobs
;NOTE: all paths must end in '/'
pathupdates=./
pathinputfiles=./
pathinputfilescomgen=./genfiles/
mail=your_email_addr
defaultrelease=141
pathflags=./flag/
Make a subdirectory flag where the flag file comgen_complete.flg will be saved.
mkdir flag
Run the script to build the GENDB module:
./comgencron_odbc.pl
comgencron_odbc.pl generates flat file genff, log files bacteria.log, chromlog,
comgenlog, gen.log, iteratapartilog, a subdirectory genfiles and a lot of logs file with
postfix run which will be moved to a subdirectory logs. It also downloads many .asn
files which will be moved to subdirectory genfiles. During the process, temporary file
comff and directory asn are created. They are deleted before the end of the build
process. If the build process fails in the middle, they should be removed along with
file genff manually.
There are several lines printed on the screen during the build like:
mail = your_email_addr
pathupdates = ./
pathinputfilescomgen = ./genfiles/
defaultrelease = 141
pathflags = ./flag/
No source or subsource Plasmpdium falciparum NC_03043.
Update 1 chromosome type by hand.
It is OK to see above line.
An e-mail will be sent to the address you provide to inform if the process succeeds or
fails. If everything went ok, you will see the last line in file comgenlog as:
NOTE: [000.000] {comgen.c, line 504} Main: Done.
The last line in file iteratepartilog as:
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
40 of 421
18/04/2005
NOTE: [000.000] {iterateparti.c, line 170} Done.
The last line in file chromlog as:
NOTE: [000.000] {chrom.c, line 173} Done.
The last two lines in file bacteria.log as:
deleteing asn
See bacteria.results for changes to ./genff
The last two lines in file gen.log as:
Removing asn
Deleting comff
The following is a detailed explanation of the script comgencron_odbc.pl. You may skip
it.
21. Generate flat file genff.
genff is a tab-delimited text file where each line in this file represents one "DNA unit"
(chromosome, plasmid, extrachromosomal element etc.) belonging to a complete
genome.
Column
1
2
3
4
5
Description
Taxonomy identifier for the genome
Unique integer identifier for a given chromosome
Type of molecule (1 or chromosome, 8 for plasmid, …)
FTP file name for the genome without the .asn
extension)
Full name of the organism
Here is an example of several rows from genff:
305
258594
781
782
90370
90370
90370
209261
286
287
288
289
290
291
292
293
8
1
1
1
1
8
8
1
NC_003296
NC_005296
NC_003103
NC_000963
NC_003198
NC_003384
NC_003385
NC_004631
Ralstonia solanacearum plasmid pGMI1000MP
Rhodopseudomonas palustris CGA009 chromosome
Rickettsia conorii chromosome
Rickettsia prowazekii chromosome
Salmonella typhi chromosome
Salmonella typhi plasmid pHCM1
Salmonella typhi plasmid pHCM2
Salmonella typhi Ty2 chromosome
The genff flat file is generated in two steps.
a) gen.pl which will CREATE genff using the eukaryotic complete genomes.
b) ncbi.bacteria.pl which will UPDATE genff with bacteria complete genomes.
* both gen.pl and ncbi.bacteria.pl are dependent on pregen.pl so this must be in the
same directory as gen.pl and ncbi.bacteria.pl when you run it.
gen.pl will backup the current (if it exists) genff as genff.backup and then create a new
genff file. gen.pl will download asn files from NCBI’s ftp site and then extract the
relevant fields (as described above) and store them as records in genff.
The data of bacteria complete genome is written to genff by running ncbi.bacteria.pl.
This perl utility will compare the data in genff to the contents of the
/genomes/bacteria directory in NCBI’s ftp site and then automatically update genff.
ncbi.bacteria.pl will save the names of the bacteria that have been newly added to
genff in a separate file called bacteria.results. You can use this file to quickly verify
the results.
A sample output of bacteria.results.pl
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
41 of 421
18/04/2005
***********PERFECT MATCH***********
Aeropyrum pernix
***********SEMI MATCHED NCBI BACTERIA*************
NCBI BACTERIA
CHROMFF
---------------------------------------Buchnera aphidicola
Buchnera sp
Buchnera aphidicola Sg
Buchnera sp
***********UNMATCHED NCBI BACTERIA*************
Agrobacterium tumefaciens C58 Cereon
Agrobacterium tumefaciens C58 UWash
Perfectly matched bacteria are already present in genff. Semi matched bacteria means
that there is an organism that is closely related to a new organism. For the above
example, Buchnera aphidocola Sg and Buchnera aphidocola were newly released and
closely related to the Buchnera sp. The newly released data will have been added to
genff. Unmatched bacteria are completely new organism and will be added to genff.
Both gen.pl and ncbi.bacteria.pl will create an intermediate file called comff, and a
temporary directory asn. These are temporary and are critical to the functionality of
the perl scripts. Both gen.pl and ncbi.bacteria.pl will delete comff and asn after
execution.
While running gen.pl and ncbi.bacteria.pl you may see the following on the screen.
No source or subsource Plasmodium falciparum NC_03043.
Update 1 chromosome type by hand.
It means that for the specified organism, the asn file is missing the chromosome type.
In such a scenario, the chromosome type will default to 1 (chromosome
Once you have generated file genff, you will likely need to run it again periodically,
in case some of the data in genff has changed, for example if an organism taxid
changes, in which case it is crucial to rerun gen.pl.
Script genftp.pl downloads complete genome files from
ftp://ftp.ncbi.nih.gov/genomes/*.
A script called humoasn.pl must be in the same directory as genftp.pl since genftp.pl
calls the script.
humoasn.pl is a misnomer because the script actually processes files for
human, mouse AND rat genomes.
Each of these genomes has two files called rna.asn and protein.asn (these files are
called the same thing regardless of the organism that they refer to: the only way you
can tell which organism the file refers to is by looking at the directory name that it
came from or by looking at the contents. genftp.pl renames rna.asn and protein.asn
files to more specific names so they can be processed with the humoasn.pl script.
rna.asn and protein.asn files mostly contain XM’s and XP’s sequences: see for
example genomes/H_sapiens/protein. The sequences in these files are “loose”
bioseqs that have to be “stitched” together into bioseq sets by humoasn.pl. This
allows these sequences to be processed by the mother parser in the next step.
Many new *.asn files will appear in the comgenomes directory after this is run. There
is no log file for this script.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
42 of 421
18/04/2005
a) Populate table chrom
Binary chrom is used to populate table chrom from the list of complete genomes
found in genff. Chrom generates the log file “chromlog”. This log will look
something like:
============[ Sep 5, 2003
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
NOTE: [000.000] {chrom.c,
…
NOTE: [000.000] {chrom.c,
2:30 PM ]====================
line 130} Assigned TaxId 56636.
line 137} Assigned Kloodge 1.
line 144} Assigned Chromfl 1.
line 149} Assigned Access NC_000854
line 152} Assigned Name Aeropyrum pernix.
line 167} Done.
b) Delete all records from division gbchm from the tables of the core module.
This step is carried out for data integrity purpose. All the records that are inserted
into the core module tables are labeled as from division gbchm. Before they are
inserted, it needs to ensure no such record exists in the database. This is
accomplished using binary iterateparti. iterateparti takes the division name as one
parameter and deletes all GI’s that are part of that division from all of the tables in
the core module.
c) Set kloodge to 0 in table taxgi
This step is also carried out for data integrity purpose. The field “kloodge” in
table taxgi for all records should be set to 0 before they are updated in a later step
by binary comgen.
d) Move all Apis mellifera related files to a subdirectory.
The chromosome, rna and protein files of Apis mellifera are not processed at the
time of writing. They are moved to a subdirectory.
e) Add records to the core module tables.
Since the human, mouse and rat sequences from this source (the “Complete
Genomes” directory) are not a part of the GenBank release, the records are added
to the core module tables by script humouse_build.sh. 141
This script feeds all chromosome, rna and protein files downloaded by genftp.pl to
the mother parser. The mother parser makes a new division called “gbchm”
(GenBank Chromosome Human and Mouse) and touches all core module tables.
Log files will be created by mother for every chromosome file processed (called
*run).
f) Update field kloodge in table taxgi and field name in table accdb
Parser comgen is used to label sequences as belonging to a complete genome.
This program uses the files downloaded by genftp.pl and marks the complete
genomes in table taxgi. This program also adds loci names into table accdb (if
they are not present). comgen is dependent on the chrom table and writes to
accdb and taxgi. The comgen program has to be executed after all databases are
built.
Comgen writes to the log file comgenlog in the same directory where it is run.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
43 of 421
18/04/2005
22. Build the Strucdb module
Change to the mmdbdata directory.
cd $SEQH/7.mmdb.files
Create tables of the Strucdb module in the database.
Make sure file strucdb. sql has line use seqhound close to the beginning of the file.
mysql –u my_id –p –P my_port –h my_server < gendb.sql
Where my_id, my_port and my_server should be replaced by your userid for
the database, the port of the database and the IP address or the server name of the
database server repectively. You will be prompted to enter your password.
This creates tables mmdb, mmgi and domdb in the database.
Make certain that the configuration files have been properly set up. These include:
.mmdbrc, .ncbirc and .intrezrc.
In file .mmdbrc, variable “Gunzip” should have a value which is the path of
gunzip on the machine (change text in italics). File .mmdbrc looks like:
[MMDB]
;Database and Index required when local MMDB database is used
Database = ./
Index = mmdb.idx
Gunzip = /bin/gunzip
; [VAST]
;Database required for local VAST fetches.
; Database = .
In file .ncbirc, variable DATA should have a value which is the path of directory
ncbi/data on your machine. File .ncbirc looks like (change text in italics):
[NCBI]
ROOT=/
DATA=/my_home/compile/ncbi/data/
Copy file bstdt.val from the ncbi/data directory:
cp ~/compile/ncbi/data/bstdt.val ./
Run the mmdbftp.pl script to download the mmdb (Molecular Model Database)
ASN.1 files from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata. This will take
approximately 10 hours..
./mmdbftp.pl
This script writes to the mmdb.log file and records the files downloaded.
Approximately 20000 *.val.gz files will appear in the mmdbdata directory after
running this. Look at the first line in the mmdb.idx index file and this states the
number of files that should have been downloaded.
Run the cbmmdb parser to make the MMDB and MMGI datafiles. Use:
./cbmmdb –n F -m F
This program takes about 12 hours to run and writes errors to the cbmmdblog file.
After a typical run this file will contain:
============[ Nov 3, 2003 1:21 AM ]======================
ERROR: [004.001] {cbmmdb.c, line 125} Error opening MMDB id 22339
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
44 of 421
18/04/2005
WARNING: [011.001] {cbmmdb.c, line 240} Total elapsed time: 41857 seconds
NOTE: [000.000] {cbmmdb.c, line 245} Main: Done!
And records are inserted into tables mmdb and mmgi.
Run the vastblst parser to make the DOMDB datafile.
./vastblst –n F
This program writes errors to the vastblstlog file. After a typical run this file will
contain no messages and records are inserted into table domdb
In addition, vastblst makes a FASTA datafile of domains called mmdbdom.fas in the
directory where it is run.
Get the most recent nrpdb.* file from the NCBI ftp site in hand
(ftp://ftp.ncbi.nih.gov/mmdb/nrtable/nrpdb
Run the pdbrep parser to label representatives of nr chain sets in the domdb datatable.
This parser writes to the domdb table. Use:
uncompress nrpdb*.Z
pdbrep –i nrpdb.*
Where nrpdb.* is the name of the input file set. pdbrep will write errors to the
pdbreplog file in the same directory where it is run.
23. Build the Neighdb module
The sequence neighbours tables can be downloaded from
ftp://ftp.blueprint.org/pub/SeqHound/NBLAST/ as MySql database table files, as well
as mysqldump output, which should be adaptable to most SQL database systems. See
the readme on the ftp site for information on these files. To incorporate the mysql
database table files into your instance of seqhound, simply copy the files extracted
from the nblastdb and blastdb archives, downloaded from the ftp site, into your
seqhound database directory in your mysql instance. To incorporate the mysql dumps
of these tables into your seqhound instance, you need only pipe the contents of the
dump(which are SQL statements) to your database server. In the case of mysql,
simply execute:
gunzip -c
gunzip -c
seqhound.blastdb.SQLdump.YYYYMMDD.gz | mysql seqhound
seqhound.nblastdb.SQLdump.YYYYMMDD.gz | mysql seqhound
Be sure to fill in any required mysql options, such as username, hostname and
port number.
24. Build the Rpsdb and Domname modules
The pre-computed rps-blast table and the domname table can be downloaded from
ftp://ftp.blueprint.org/pub/SeqHound/RPS/ as MySQL database table files, as well as
mysqldump output, which should be adaptable to most SQL database systems. To
incorporate the mysql database table files into your instance of seqhound, simply
copy the files extracted from the rpsdb and domname archive, downloaded from the
ftp site, into your seqhound database directory in your mysql instance. To
incorporate the mysql dumps of these tables into your seqhound instance, you need
only pipe the contents of the dump(which are SQL statements) to your database
server. In the case of mysql, simply execute:
gunzip -c
gunzip -c
seqhound.rpsdb.SQLdump.YYYYMMDD.gz | mysql seqhound
seqhound.domname.SQLdump.YYYYMMDD.gz | mysql seqhound
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
45 of 421
18/04/2005
Be sure to fill in any required mysql options, such as username, hostname and port
number.
25. Build the histdb table.
cd $SEQH/8.hist.files
./histparser –n F
This parser populates table histdb. An entry will be generated for each of the
sequences that have valid accessions in table accdb that indicates that the sequence
was added on this day (when you ran histparser). This parser writes to the
histparserlog. This parser requires the accdb table and will take about 15 hours to
run.
26. You are done with the initial build of SeqHound.
If you did not build any of the optional modules, you will have to
remember this when setting up the .intrezrc configuration file for any
SeqHound application.
Set module values to zero if you did not build them. See the following section of the
.intrezrc configuration file.
example:
[sections]
;indicate what modules are available in SeqHound
;1 for available, 0 for not available
;gene ontology hierarchy (did you run goparser?)
godb = 1
;locus link functional annotations (did you run llparser and addgoid?)
lldb = 1
;taxonomy hierarchy (did you run importtaxdb?)
taxdb = 1
;protein sequence neighbours (did you download neighbours tables?)
neigdb = 1
;structural databases (did you run cbmmdb, vastblst and pdbrep?)
strucdb = 1
;complete genomes tracking (did you run chrom and comgen?)
gendb = 1
;redundant protein sequences (did you run redund?)
redundb = 1
;open reading frame database (currently not exported at all)
cddb = 0
;RPS-BLAST tables (did you download RPS-BLAST tables?)
rpsdb = 1
Catch up on SeqHound daily updates
27. Download all daily update files for genbank
Warning: There might have been a new GenBank release while you were
building SeqHound, in this case you cannot get updates from
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc/ any more. You have to rebuild
SeqHound with a fresh GenBank release. You should check the file
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Release to make certain that it
contains the same release number that was present when you started step
15.
cd $SEQH/
mkdir seqsync
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
46 of 421
18/04/2005
cd seqsync
ftp ftp.ncbi.nih.gov
When prompted for a name enter anonymous
When prompted for a password type myemail@home.com cd ncbi-asn1
cd daily-nc
bin
prompt
mget nc*.aso.gz
bye
Do not download the con_nc*.aso.gz files from this directory. SeqHound does not
use them.
28. Download all daily update files for refseq.
From ftp://ftp.ncbi.nih.gov/refseq/daily/ download all files past the date stamp on
gbrscu.aso.gz. gbrscu.aso.gz is the latest cumulative RefSeq division which was
downloaded by asnftp.pl and is located (in this example) in seqhound/build/asofiles.
cd $SEQH/seqsync
ftp ftp.ncbi.nih.gov
enter anonymous and your email address when prompted
cd refseq
cd daily
bin
get rsnc.****.2003.bna.Z
(where **** are files with timestamps greater than gbrscu.aso.gz)
bye
You must uncompress all of these files and rezip them so they can be processed by
the mother parser.
compress –d *.Z
gzip *.bna
29. Run update and mother on all downloaded files (excluding today's one; crons will do
it in the evening).
You can use the scripts all_update.sh and all_update_rs.sh. You will also need
mother, update and a properly configured .intrezrc file in the same directory as all of
the daily update files.
cd $SEQH/seqsync
cp $COMPILE/slri/seqhound/scripts/all_update.sh .
cp $COMPILE/slri/seqhound/scripts/all_update_rs.sh .
cp $SEQH/1.core.files/.intrezrc .
cp $SEQH/1.core.files/mother .
cp $SEQH/1.core.files/update .
Run all_update.sh first
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
47 of 421
18/04/2005
./all_update.sh 141
where 141 is the release number.
Run all_update_rs.sh second.
./all_update_rs.sh 141
These scripts will run update and mother executables (consecutively) on all
downloaded files present in the current directory.
All daily updates in SeqHound are stored in one division called gbupd regardless
how long SeqHound runs without a core rebuild.
mother will make a log file called “*run” for every file that it processes
update will make two log files called “*gis” and “*log” for every file that it processes
You can check that the two parsers have completed successfully. Each of the
following queries should return the same number (the number of starting input files):
ls *aso.gz | wc –l
ls *gis | wc –l
ls nc*log | wc –l
ls nc*run | wc –l
grep Done nc*run |wc -l
Setting up daily sequence updates
30. Make a new directory from where you will run daily sequence updates.
Populate this with the necessary scripts and programs.
cd $SEQH
mkdir updates
cd updates
cp $SLRI/seqhound/scripts/*cron_odbc.pl
cp $SLRI/seqhound/scripts/shconfig.pm .
cp $SLRI/seqhound/build/odbc/redund .
cp $SLRI/seqhound/build/odbc/mother .
cp $SLRI/seqhound/build/odbc/update .
cp $SLRI/seqhound/build/odbc/precompute
cp $SLRI/seqhound/build/odbc/isshoundon
cp $SLRI/seqhound/build/odbc/importaxdb
cp $SLRI/seqhound/build/odbc/goparser .
cp $SLRI/seqhound/build/odbc/llparser .
cp $SLRI/seqhound/build/odbc/addgoid .
cp $SLRI/seqhound/build/odbc/comgen .
seqhound@blueprint.org
.
.
.
.
Version 3.3
The SeqHound Manual
48 of 421
18/04/2005
cp $SLRI/seqhound/build/odbc/chrom .
cp $SLRI/seqhound/scripts/genftp.pl .
cp $SLRI/seqhound/scripts/humoasn.pl .
cp $SLRI/seqhound/scripts/humouse_build.sh .
cp $SLRI/seqhound/genomes/gen_cxx .
cp $SLRI/seqhound/genomes/pregen.pl .
cp $SLRI/seqhound/genomes/gen.pl .
cp $SLRI/seqhound/genomes/ncbi.bacteria.pl .
mkdir logs
mkdir asofiles
mkdir inputfiles
mkdir genfiles
mkdir flags
31. Copy the .intrezrc config file to the updates directory and edit it.
cd $SEQH/updates
cp $SLRI/seqhound/config/.intrezrc .
cp $SEQH/1.core.files/.intrezrc .
Text in italics must be changed. in [crons] section, variable pathupdates
points to the path where the update jobs will be set up; variable pathinputfiles
points to the path that saves the input files (other than *.aso.gz and *.bna.gz files from
the core module and *.asn files from the gendb module); variable
pathinputfilescomgen points to the path that saves the input files *.asn for the
gendb module; variable mail indicates your e-mail address; variable
defaultrelease is the GenBank release you build SeqHound database with;
variable pathflags points to the path that save the flag files generated by each
updating job.
[crons]
;customizable variables in cron jobs
;NOTE: all paths must end in '/'
pathupdates=./
pathinputfiles=./inputfiles/
pathinputfilescomgen=./genfiles/
mail=my_email
defaultrelease=141
pathflags=./flags/
The cron daemon may consider your home directory to be the “current directory”.
For this reason, the .intrezrc file should be copied to your home directory too.
cd $SEQH/updates
cp .intrezrc ~/.
32. Set up the dupdcron_odbc.pl cron job.
dupdcron_odbc.pl (daily update cron) is a PERL script that retrieves the latest
GenBank and RefSeq update files from the NCBI ftp site and then passes them to
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
49 of 421
18/04/2005
“update” and “mother” where they are used to update the SeqHound data tables.
Specifically, it
a) downloads update files with today's date (from ftp://ftp.ncbi.nih.gov/ncbiasn1/daily-nc/ nc*.aso.gz and ftp://ftp.ncbi.nih.gov/refseq/daily/ rsnc*.bna.Z
b) runs update
(update -i nc*.aso.gz)
and then
c) runs mother
(mother -i nc*.aso.gz -r version# -n F -m F -u T).
You need to know this because if you miss a few updates before setting up
the cron job (and after completing the seqsync steps above) you have to
run update and mother in hand using the above commands.
All scripts (like dupdcron_odbc.pl) report success or failure via email. The mailto
address is set in the shconfig.pm script which you have just edited.
dupdcron_odbc.pl is the first cron job that has to be set up. Make a new text file
called list_crontab where you will list the cron jobs.
cd $SEQH/updates
pico list_crontabs
This file should have the single line
30 22 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./dupdcron_odbc.pl
where libpath should be replaced by the correct path you set up in Step 11 for
environment variable LD_LIBRARY_PATH. You can find it out by:
echo $LD_LIBRARY_PATH
This line specifies the time to run a job on a recurring basis. It consists of 6 fields
separated by spaces. The fields and allowable values are of the form:
minute
(0-59) in this case 30
hour
(0-23) in this case 22
day of the month (1-31) in this case *
month
(1-12) in this case *
day-of-week
(0-6 where 0 is Sunday) in this case *
command to run
The above line indicates that dupdcron_odbc.pl is to be run at 10:30 PM every day of
the month, every month, regardless of the day of the week. The * character is a wildcard. The actual command consists of changing to the directory where
dupdcron_odbc.pl exists (this path will have to be modified depending on your set
up)
cd /seqhound/update;
and then executing the perl script
./dupdcron_odbc.pl
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
50 of 421
18/04/2005
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
If for some reason, you want to deactivate the cron job, type:
crontab –r list_crontabs
To find out what cron jobs you have activated, type
crontab -l
For more information on setting up cron jobs on UNIX type:
man crontab
33. Set up redundcron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 23 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./redundcron_odbc.pl
See Step 32 for the explanation of libpath.
After adding the above line, edit it to match your setup and close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically does three things:
a) checks if file “nr” is updated on the ftp site ftp://ftp.ncbi.nlm.nih.gov/blast/db. If it
is, retrieves it
b) drops table redund from the database and recreate it.
c) rebuilds table redund using the downloaded nr file and the redund parser.
34. Run precompute for the first time.
First set up the configuration file
cd $SEQH/updates
pico .intrezrc
Edit the section under [precompute] to make it look like:
[precompute]
;precomputed taxonomy queries
MaxQueries = 0
MaxQueryTime = 10
QueryCount = 0
#path to precomputed searches has to have "/" at the end !!
path = /seqhound/precompute/
indexfile = /seqhound/precompute/index
Make sure the value of path is the absolute path of directory precompute you
make in Step 14 and the value of indexfile is the value of path plus index.
Variable path is the directory that holds results of the precompute executable.
indexfile is a path to the index that will be created by precompute.
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
51 of 421
18/04/2005
Finally, run the precompute executable:
cd $SEQH/updates
./precompute –a redo
Where –a redo specifies that the program is being run for the first time.
This program basically precomputes the number of proteins and nucleic acids (and
their GI values) for each taxon in the taxgi table. The results of this query are stored
and indexed in text files (in the directory specified by path) if this query takes
longer than x seconds (where x is defined by MaxQueryTime in the above .intrezrc
file). These text files are used by SeqHound API calls such as
SHoundProteinsFromTaxIDIII(taxid)
35. Set up precomcron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 1 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./precomcron_odbc.pl
See Step 32 for the explanation of libpath.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically runs the command
precompute -a update
and updates the precomputed search results.
36. Set up isshoundoncron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 7 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./isshoundoncron_odbc.pl
See Step 32 for the explanation of “libpath”.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically does two things:
a) runs the executable called isshoundon. This program makes a single call to
the local SeqHound API to ensure that it is working.
b) moves all log, run and gis log files into a directory called logs
37. Set up llcron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
52 of 421
18/04/2005
Add the following line:
30 21 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./llcron_odbc.pl
See Step 32 for the explanation of libpath.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically repeats the actions listed in step 14 above and re-creates the
locus link tables in SeqHound. This includes:
a) getting the latest LL_tmpl.gz file from the NCBI ftp site.
b) removing the locus link tables from SeqHound
c) running llparser
d) getting 2 GO annotation files from GO ftp site
e) running the addgoid parser on these two files
38. Set up comgencron_odbc.pl to run daily.
cd $SEQH/updates
pico list_crontabs
Add the following line:
30 21 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./comgencron_odbc.pl
See Step 32 for the explanation of libpath.
After adding the above line and editing it to match your setup, close the file.
To activate this crontab file, type
crontab list_crontabs
This script basically repeats the actions listed in step 15 above and re-creates the
chrom table in SeqHound and updates the complete genome information in the core
tables. This includes:
a) generating a list of “DNA units” that belongs to a complete genome,
b) downloading complete genome files from NCBI ftp site,
c) rebuilding table chrom
d) removing all records in the core tables that belongs to division “gbchm”,
e) running script humous_build.sh to insert records into core tables
f) resetting the kloodge field in table taxgi for all records to 0
g) updating kloodge by running parser comgen
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
53 of 421
18/04/2005
Setting up SeqHound servers. Overview
39. Setting up SeqHound servers. Overview.
There are two web server applications that make up the SeqHound system:
a) wwwseekgi produces html pages for the SeqHound web interface and
b) seqrem processes requests to the SeqHound remote API.
Step 40 shows you how to find the two directories where you will set up these two
applications (assuming that you are using a default installation of Apache). The two
directories are called:
cgi-bin
htdocs
Step 40 may be skipped if you already know or have already been told where these two
directories are.
Steps 41 - describe the files that must be placed into these two sub-directories in order to
start the wwwseekgi and seqrem servers.
40. Examining the httpd.conf file for Apache.
These instructions assume that you already have an Apache server running. In order
to proceed further you must locate the directory where executables will be run from
(called “cgi-bin” in a default set-up of Apache) and a directory that contains html
documents (called “htdocs” in a default set-up of Apache). You can find (and reset)
the location of these two directories in an Apache configuration file called
“httpd.conf”. In a default set-up of Apache, the httpd.conf file can be accessed by
changing to the directory:
cd /etc/apache
and then opening the httpd.conf file found in this directory using a text editor such as
pico:
pico httpd.conf
To find the cgi-bin directory location, look for the line beginning with
“ScriptAlias”. In the default set-up, this line looks like this:
ScriptAlias /cgi-bin/ “/var/apache/cgi-bin/”
In this example, the path to the cgi-bin directory is /var/apache/cgi-bin/.
Write this path down, whatever it is.
To find the htdocs directory, look for the line beginning with “DocumentRoot”. In
the default set-up, this line looks like this:`
DocumentRoot “/var/apache/htdocs/”
In this example, the path to the cgi-bin directory is /var/apache/htdocs/.
Write this path down, whatever it is.
Also make a note of the line beginning with “User” and “Group” (who has
ownership to the server). In a default Apache set-up, these lines are likely
User nobody
Group nobody
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
54 of 421
18/04/2005
Make a note of this, whatever it is.
Exit from the httpd.conf file and save your changes. If you made changes to the file,
you must restart the Apache server using the command:
/usr/apache/bin/apachectl restart
See the Trouble Shooting section at the end for more information on this.
In the steps below you will set up the SeqHound server by adding to these two
directories
Contents of the cgi-bin and htdocs directories
directory
contents
cgi-bin
the SeqHound wwwseekgi and seqrem server applications will placed
here
all of the static html pages used by the SeqHound interface will be placed
here
41. Set up the cgi-bin directory.
htdocs
Move to the cgi-bin directory you found in the step above. For the default set-up:
cd /var/apache/cgi-bin/
make a new subdirectory here called SeqHound
mkdir seqhound
cd seqhound
copy the SeqHound server applications here:
cp $COMPILE/slri/seqhound/build/odbc/seqrem .
cp $COMPILE/slri/seqhound/build/odbc/wwwseekgi .
also copy the following files to this directory:
cp $COMPILE/slri/seqhound/html/seekhead.txt .
cp $COMPILE/slri/seqhound/html/seektail.txt .
cp $COMPILE/slri/seqhound/html/seekhead.txt pics/.
cp $COMPILE/slri/seqhound/config/.intrezrc .
cp $COMPILE/slri/seqhound/config/.ncbirc .
42. Edit the .ncbirc configuration file.
Open the file with a text editor such as pico.
The setting for Data should contain a path to the ncbi/data directory. This directory
was downloaded as part of the ncbi toolkit in step 2.
--------------------example .ncbirc file begins----------------------[NCBI]
Data=/home/ncbi/data
--------------------example .ncbirc file ends-------------------------
43. Edit the .intrezrc configuration file.
Refer to step 14 in the current section for setting up of the .intrezrc file. The settings
for username, password, dsn and database in section [datab] should be
valid for the SeqHound database you have just built, and the setting for path and
seqhound@blueprint.org
Version 3.3
The SeqHound Manual
55 of 421
18/04/2005
indexfile in section [precompute] should point to the valid path as in step 34
in the current section. Set up the index.html file for the web interface.
Move to the htdocs directory for your web-server. In the default case:
cd /var/apache/htdocs/
Make a SeqHound directory here:
mkdir seqhound
cd seqhound
Copy the index.html page to this directory:
cp $COMPILE/slri/seqhound/html/index.html .
Open the file in a text editor like pico and edit it so that its action points to the
wwwseekgi server.
pico index.html
then edit the line