The_SeqHound_Admin_Manual The Seq Hound Admin Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 421
Download | |
Open PDF In Browser | View PDF |
The SeqHound Manual Part II: Sections 4-7 For Administrators and Developers Release 3.3 (April 20th, 2005) Authors Ian Donaldson, Katerina Michalickova, Hao Lieu, Renan Cavero, Michel Dumontier, Doron Betel, Ruth Isserlin, Marc Dumontier, Michael Matan, Rong Yao, Zhe Wang, Victor Gu, Elizabeth Burgess, Kai Zheng, Rachel Farrall Edited by Rachel Farrall and Ian Donaldson © 2005 Mount Sinai Hospital The SeqHound Manual 2 of 421 18/04/2005 Table of Contents About this manual............................................................................................................ 7 Conventions ..................................................................................................................... 8 How to contact us. ........................................................................................................... 8 Who is SeqHound?........................................................................................................... 9 4. Setting up SeqHound locally. ....................................................................................... 10 4.1 Overview.................................................................................................................. 10 4.2 SeqHound system requirements............................................................................... 11 OS and hardware architecture .................................................................................... 11 Memory (RAM) ......................................................................................................... 11 Hard Disk ................................................................................................................... 12 Source code and executables .................................................................................. 12 Database.................................................................................................................. 12 Other Software ........................................................................................................... 12 Compiling SeqHound Code yourself. ........................................................................ 13 ODBC compliant database engines............................................................................ 13 Library dependencies ................................................................................................. 13 4.3 Obtaining precompiled SeqHound executables....................................................... 14 4.3.1 Obtaining SeqHound Source Code...................................................................... 16 4.4 Compiling SeqHound executables on Solaris.......................................................... 18 4.5 Building the SeqHound system on Solaris............................................................... 26 Catch up on SeqHound daily updates ........................................................................ 45 Setting up daily sequence updates.............................................................................. 47 Setting up SeqHound servers. Overview................................................................... 53 Trouble-shooting notes............................................................................................... 57 Error logs ................................................................................................................ 57 Recompiling SeqHound .......................................................................................... 57 Restarting the Apache server .................................................................................. 57 Other useful links.................................................................................................... 58 Parser schedule........................................................................................................ 58 MySQL errors ......................................................................................................... 58 5. Description of the SeqHound parsers and data tables by module................................. 59 What are modules? ........................................................................................................ 59 How to use this section. ................................................................................................. 59 Parser descriptions........................................................................................................ 59 Table descriptions.......................................................................................................... 60 An overview of the SeqHound data table structure ....................................................... 63 Parsers and resource files needed to build and update modules of SeqHound. ........... 64 seqhound@blueprint.org Version 3.3 The SeqHound Manual 3 of 421 18/04/2005 core module ................................................................................................................ 66 mother parser .......................................................................................................... 66 update parser ........................................................................................................... 71 postcomgen parser .................................................................................................. 72 asndb table .............................................................................................................. 75 parti table ................................................................................................................ 78 nucprot table............................................................................................................ 80 accdb table .............................................................................................................. 82 histdb table .............................................................................................................. 88 pubseq table ............................................................................................................ 91 taxgi table................................................................................................................ 94 sengi table ............................................................................................................... 97 sendb table .............................................................................................................. 99 chrom table............................................................................................................ 101 gichromid table ..................................................................................................... 105 contigchromid table .............................................................................................. 107 gichromosome table .............................................................................................. 109 contigchromosome table ....................................................................................... 111 Redundant protein sequences (redundb) module ..................................................... 113 redund parser......................................................................................................... 113 redund table........................................................................................................... 115 Complete genomes tracking (gendb) module........................................................... 119 Taxonomy hierarchy (taxdb) module....................................................................... 120 importtaxdb parser ................................................................................................ 120 taxdb table............................................................................................................. 122 gcodedb table ........................................................................................................ 127 divdb table............................................................................................................. 132 del table................................................................................................................. 135 merge table............................................................................................................ 137 Structural databases (strucdb) module ..................................................................... 139 cbmmdb parser...................................................................................................... 139 vastblst parser........................................................................................................ 144 pdbrep parser......................................................................................................... 146 mmdb table............................................................................................................ 148 mmgi table ............................................................................................................ 154 domdb table........................................................................................................... 156 Protein sequence neighbours (neighdb) module ...................................................... 162 Installing nblast:.................................................................................................... 162 Configuration of nblast environment:................................................................... 163 Running NBLAST ................................................................................................ 164 NBLAST Update Procedure ................................................................................. 166 nbraccess program* .............................................................................................. 168 BLASTDB table................................................................................................... 169 NBLASTDB table................................................................................................. 172 Locus link functional annotations (lldb) module ..................................................... 177 llparser................................................................................................................... 177 seqhound@blueprint.org Version 3.3 The SeqHound Manual 4 of 421 18/04/2005 addgoid parser....................................................................................................... 179 ll_omim table ........................................................................................................ 181 ll_go table.............................................................................................................. 183 ll_llink table .......................................................................................................... 186 ll_cdd table............................................................................................................ 188 GENE module .......................................................................................................... 191 parse_gene_files.pl parser..................................................................................... 191 gene_dbxref table.................................................................................................. 193 gene_genomicgi table ........................................................................................... 195 gene_history table ................................................................................................. 198 gene_info table...................................................................................................... 201 gene_object table .................................................................................................. 204 gene_productgi table............................................................................................. 206 gene_pubmed table ............................................................................................... 208 gene_synonyms table ............................................................................................ 210 Gene Ontology hierarchy (godb) module................................................................. 212 goparser................................................................................................................. 212 go_parent table...................................................................................................... 214 go_name table ....................................................................................................... 216 go_reference table................................................................................................. 219 go_synonym table ................................................................................................. 221 Gene Ontology Association (GOA) module ............................................................ 223 Table summarizing input files, parsers and command line parameters for GOA module................................................................................................................... 225 Gene Ontology Module Diagram.......................................................................... 228 goa_seq_dbxref table ............................................................................................ 230 goa_association table ............................................................................................ 234 goa_reference table ............................................................................................... 237 goa_with table....................................................................................................... 239 goa_xdb table ........................................................................................................ 242 goa_gigo table....................................................................................................... 245 dbxref module .......................................................................................................... 248 Who Cross-references who? ................................................................................. 249 Explanation of the data table structure: ................................................................ 249 How to update the DBXref and GO Annotation modules using a cluster. .............. 256 Understanding the dbxref.ini file ............................................................................. 257 Table summarizing input files, parsers and command line parameters for dbxref module................................................................................................................... 262 dbxref table ........................................................................................................... 265 dbxrefsourcedb table............................................................................................. 268 Contents of dbxrefsourcedb table ......................................................................... 270 RPS-BLAST domains (rpsdb) module..................................................................... 272 domname parser .................................................................................................... 272 Rpsdb parser.......................................................................................................... 273 domname table ...................................................................................................... 274 rpsdb table............................................................................................................. 278 seqhound@blueprint.org Version 3.3 The SeqHound Manual 5 of 421 18/04/2005 Molecular Interaction (MI) module.......................................................................... 285 MI-BIND parser.................................................................................................... 285 MI_source table .................................................................................................... 289 MI_ints table ......................................................................................................... 291 MI_objects table.................................................................................................... 292 MI_obj_dbases table ............................................................................................. 294 MI_mol_types table .............................................................................................. 295 MI_dbases table .................................................................................................... 296 MI_record_types table .......................................................................................... 297 MI_complexes table.............................................................................................. 298 MI_complex2ints table ......................................................................................... 299 MI_complex2subunits table.................................................................................. 300 MI_complex2subunits table.................................................................................. 301 MI_refs table......................................................................................................... 302 MI_refs_db table................................................................................................... 304 MI_exp_methods table.......................................................................................... 305 MI_obj_labels table .............................................................................................. 306 Text mining module ................................................................................................. 307 mother parser ........................................................................................................ 307 text searcher parser ............................................................................................... 308 yeastnameparser.pl parser ..................................................................................... 312 text_bioentity table................................................................................................ 314 text_bioname table ................................................................................................ 317 text_secondrefs table............................................................................................. 321 text_bioentitytype table......................................................................................... 324 text_fieldtype table................................................................................................ 325 text_nametype table .............................................................................................. 326 text_rules table ...................................................................................................... 327 text_db table.......................................................................................................... 328 text_doc table ........................................................................................................ 329 text_docscore table................................................................................................ 331 text_evidencescore table ....................................................................................... 336 text_method table.................................................................................................. 338 text_point table...................................................................................................... 341 text_pointscore table ............................................................................................. 342 text_result table..................................................................................................... 344 text_resultscore table ............................................................................................ 346 text_search table.................................................................................................... 348 text_searchscore table ........................................................................................... 351 text_rng table ........................................................................................................ 353 text_rngresult table................................................................................................ 355 text_doctax table ................................................................................................... 357 text_organism table............................................................................................... 359 text_englishdict table ............................................................................................ 361 text_bncorpus table ............................................................................................... 363 text_pattern table................................................................................................... 365 seqhound@blueprint.org Version 3.3 The SeqHound Manual 6 of 421 18/04/2005 text_stopword table............................................................................................... 367 6. Developing for SeqHound. ......................................................................................... 369 Open source development............................................................................................ 369 Code organization. ...................................................................................................... 370 Adding/Modifying a remote API function to SeqHound.............................................. 373 Overall architecture of the SeqHound system.......................................................... 374 Adding a new module to SeqHound............................................................................. 380 Database layer .......................................................................................................... 381 Parser layer............................................................................................................... 382 Local API layer (Query layer).................................................................................. 383 CGI layer .................................................................................................................. 383 Remote API layer ..................................................................................................... 384 7. Appendices.................................................................................................................. 387 Example GenBank record ........................................................................................ 388 Example SwissProt record ....................................................................................... 393 Example EMBL record ............................................................................................ 400 Example PDB record................................................................................................ 406 Example Biostruc ..................................................................................................... 411 GO background material .......................................................................................... 421 * not available at time of writing seqhound@blueprint.org Version 3.3 The SeqHound Manual 7 of 421 18/04/2005 About this manual. This manual contains everything that has been documented about SeqHound. It is distributed in two Parts (Part I: For Users and Part II: For Administrators and Developers). If you can’t find the answer here then please contact us. This manual was written and reviewed by the persons listed under “Who is SeqHound”. Any errors should be reported to seqhound@blueprint.org. You can find out more about the general architecture of SeqHound by reading the SeqHound paper that is freely available from BioMed Central. This paper is included in the supplementary material distributed with this manual. See: Michalickova K, Bader GD, Dumontier M, Lieu H, Betel D, Isserlin R, Hogue CW. SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics. 2002 Oct 25;3(1):32. PMID: 12401134 The SeqHound Manual (Part I: Sections 1-3) For Users. Section1 and Section 2 is a one page description that tells you what to read first to get started depending on what kind of user you are. Section 3 is of interest to programmers who want to use the remote API to access information in the SeqHound database maintained by the Blueprint Initiative. The SeqHound Manual (Part II: Sections 4-7) For Administrators and Developers Section 4 is of interest to programmers and system administrators who want to set up SeqHound themselves so they can use the local API. Section 5 is an in-depth description of everything that’s in the SeqHound database and how it gets there (table by table). This section will be of interest to all users. Section 6 describes how programmers can add to SeqHound. This section also describes our internal development process at Blueprint. Section 7 includes Appendices of background and reference material. seqhound@blueprint.org Version 3.3 The SeqHound Manual 8 of 421 18/04/2005 Conventions The following section describes the conventions used in this manual. Italic is used for filenames, file extensions, URLs, and email addresses. Constant Width is used for code examples, function names and system output. Constant Bold is used in examples for user input. Constant Italic is used in examples to show variables for which a context-specific substitution should be made. How to contact us. General enquiries or comments can be posted to the SeqHound usergroup mailing list seqhound.usergroup@blueprint.org. You may also subscribe to this list to receive regular updates about SeqHound developments by going to http://lists.blueprint.org/mailman/listinfo/seqhound.usergroup . Private enquiries, bug reports from external users, questions about SeqHound or errors found in this manual may be sent to seqhound@blueprint.org. seqhound@blueprint.org Version 3.3 The SeqHound Manual 9 of 421 18/04/2005 Who is SeqHound? Chronologically ordered according to when the person first started work on SeqHound. Chris Hogue Katerina Michalickova Gary Bader Ian Donaldson Ruth Isserlin Michel Dumontier Hao Lieu Marc Dumontier Doron Betel Renan Cavero Ivy Lu Rong Yao Volodya Grytsan Zhe Wang Victor Gu Rachel Farrall Michael Matan Elizabeth Burgess Kai Zheng seqhound@blueprint.org Version 3.3 The SeqHound Manual 10 of 421 18/04/2005 4. Setting up SeqHound locally. 4.1 Overview. This section describes how one can set up the SeqHound system on your own hardware using freely available SeqHound executables. These executables will allow you to build and update the SeqHound database as well as run a web-interface and a remote API server. Section 4.2 should be reviewed first for system requirements before attempting to install the SeqHound system. Section 4.3 tells you how to download executables from the SeqHound ftp site for your platform and operating system. SeqHound code may also be downloaded from this site. Section 4.4 describes how SeqHound code may be compiled on your own hardware using the freely available code available on the SeqHound ftp site. This step is only required if SeqHound executables are not available for your platform or if you want to make use of the local programming API. If you obtain SeqHound executables from the ftp site and want to build your local SeqHound database, you still need to go through Steps 8, 9, 10, 11 and 13 in this section which describe how to install the MySQL server and ODBC driver. Section 4.5 contains detailed instructions for using the executables to build the SeqHound data tables and for setting up the SeqHound web-interface and remote API server. seqhound@blueprint.org Version 3.3 The SeqHound Manual 11 of 421 18/04/2005 4.2 SeqHound system requirements. Before attempting to set up SeqHound yourself, you should review the system requirements listed below. The SeqHound system is able to run on a number of operating systems (we recommend and can best support a UNIX operating system like Sun Solaris or Red Hat Linux). Setting up SeqHound will require approximately 700 GB of disk space (see below). Questions about system requirements, compilation, setup and maintenance can be addressed to seqhound@blueprint.org. We will do our best to address all inquiries but resources may not allow us to solve all problems arising on all possible set ups. OS and hardware architecture SeqHound code is compiled on the following platforms based on release version code. Blueprint production SeqHound is compiled and run on Sun-Fire-880 - Sun Solaris (version 9). We have also compiled and tested SeqHound on the Fedora Core 2.0 and the MacOS X operating systems. Release versions of SeqHound executables are available for. x86 architecture Sun-Fire-880 PowerPC architecture (Fedora Core 2.0) Sun Solaris (version 9) MacOS X We have also successfully built executables on the following platforms. x86 architecture FreeBSD x86 architecture QNX x86 architecture Windows NT PowerPC architecture PPC Linux SGI Irix 6 Alpha architecture Compaq Alpha OS HPPA 2.0 architecture HPUX 11.0 HPPA 1.1 architecture PA-RISC Linux Memory (RAM) We recommend a minimum of 1 GB of RAM to run the SeqHound executables. seqhound@blueprint.org Version 3.3 The SeqHound Manual 12 of 421 18/04/2005 Hard Disk Source code and executables Component SeqHound Source and compiled NCBI Toolkit NCBI C++ Toolkit bzip2 Library slri lib slri lib_cxx Source code and executables (total) Database Image Size 220.0 MB 560.0 MB 12GB 4.5 MB 7.3MB 9.4 MB 13GB approx Component data tables data tables backup Image Size 300 GB 300 GB 700 GB* Database (total) *700GB includes 300 GB for a single copy of the SeqHound data tables. The SeqHound system includes a second copy of the data tables used for back up and updating. We suggest a minimum of 700 GB for SeqHound installation. This allows for yearly growth of the data tables as well as for a RAID5 disk configuration. We are using the MySQL database storage engine InnoDB, which provides transaction support and automatic recovery in the event of database server outage. There is no need to keep a separate instance of the database when the InnoDB storage engine is used. To prevent deadlock during data insertion and update, you should not run SeqHound parsers in parallel against the InnoDB database server. As a result, it takes up to three extra days for the initial build of SeqHound database using the InnoDB storage engine. If you wish to use the MyISM storage engine, you can run parallel parsers to speed up the initial build of SeqHound. However, you will need to keep a separate database instance for database update and backup as the storage engine MyISM does not support transaction and automatic recovery. Other Software Apache Webserver(version 1.3) Apache Jakarta Tocat JSP/Servlet Container (version 4.1) Perl (version 5.8.3) seqhound@blueprint.org See http://www.apache.org/ for software installation for you platform. See http://jakarta.apache.org/tomcat/ for software installation for you platform. See http://www.cpan.org/ for installation for your platform. Requiredmodules include Net/FTP.pm, sun4-solaris-64/DBI.pm Version 3.3 The SeqHound Manual 13 of 421 18/04/2005 Compiling SeqHound Code yourself. It is not necessary to compile SeqHound executables yourself; the system may be set up using the executables provided on the ftp site for selected Operating Systems. However, if you wish to make use of the local API then you must compile SeqHound yourself. ODBC compliant database engines Blueprint uses the ODBC compliant MySQL database engine. We are using version 4.1.10 in production; this version supports nested SQL queries and internationalization. We have not tested SeqHound on other ODBC compliant RDBMS such as Oracle, DB2 and PostgreSQL. Library dependencies Library Source NCBI Toolkit from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ NCBI C++ Toolkit (optional*) from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ bzip2 Library from http://sourceforge.net/projects/slritools/ slri lib from http://sourceforge.net/projects/slritools/ slri lib_cxx (optional*) from http://sourceforge.net/projects/slritools/ * This library is only required if you plan to use the SeqHound remote C++ API. seqhound@blueprint.org Version 3.3 The SeqHound Manual 14 of 421 18/04/2005 4.3 Obtaining precompiled SeqHound executables. It is not necessary to compile SeqHound executables yourself; the system may be set up using the precompiled executables provided on the ftp site for selected Operating Systems. If you choose to compile the executables yourself, skip to step 4.3.1. You will require about 220 MB of disk space to store the SeqHound compiled executables. These instructions assume you are logged in as user “seqhound” on a UNIX system running the bash shell and you have perl installed on your system. 1. Decide the location to install the SeqHound binary executables. For example, if you want to install in the directory /home/seqhound/execs, do the following: mkdir execs cd execs 2. Download the SeqHound installation utility script installseqhound.pl from the FTP site: ftp.blueprint.org ftp ftp.blueprint.org When prompted for a name enter anonymous When prompted for a password type your email address: myemail@home.com cd pub/SeqHound/script get installseqhound.pl Close the ftp session by typing: bye 3. Run the perl script to download and install SeqHound executables. The perl script will download SeqHound binary executables based on the specified platform (linux or solaris), unpack the tar ball, modify the configurations files .odbc.ini and .intrezrc (for ODBC database access) and deploy the configuration files. It requires two commandline arguments: platform (linux or solaris) and installation path (e.g. /home/seqhound/execs). Enter the path to the ODBC driver (e.g. seqhound@blueprint.org Version 3.3 The SeqHound Manual 15 of 421 18/04/2005 /usr/lib/libmyodbc3.so, please refer to step 10 in section 4.4 for ODBC driver path), database server name, port number, user id, password and database instance name when prompted by the perl script. ./installseqhound.pl [linux OR solaris] [/home/seqhound/execs] Upon successful execution of the perl script, you should see the following directories in the directory execs: build config example include lib sql test updates www The configuration file .odbc.ini can be found in the home directory (e.g. /home/seqhound). seqhound@blueprint.org Version 3.3 The SeqHound Manual 16 of 421 18/04/2005 4.3.1 Obtaining SeqHound Source Code. Follow the instructions below to download SeqHound source code . If you downloaded and unpacked the executables, you can skip section 4.3.1 and 4.4 and continue with section 4.5. 1. In your home directory, make a new directory where you will store the new SeqHound code. mkdir compile Move into this directory and set an environment variable called COMPILE to point to this directory. cd compile export COMPILE=`pwd` (where (`) is a single back-quote) 2. Download the perl utility seqhoundsrcdownload from the SeqHound ftp site Note: We no longer support SeqHound download from the Sourceforge FTP site. Please download SeqHound from ftp://ftp.blueprint.org/pub/SeqHound/ From the compile directory, type: ftp ftp.blueprint.org When prompted for a name enter anonymous When prompted for a password type your email address: seqhound@blueprint.org Version 3.3 The SeqHound Manual 17 of 421 18/04/2005 myemail@home.com cd pub/SeqHound/script get seqhoundsrcdownload.pl Close the ftp session by typing: bye 3. Download SeqHound source code by running the perl script seqhoundsrcdownload.pl. The script will download the source code tar file and unpack the tar file into two directories slri and bzip2. You will also see a release note file Release_notes_x.x.txt in the same directory compile. ./seqhoundsrcdownload.pl 4. Set the SLRI environment variable Move to the slri directory and set the environment variable “SLRI” to point to this directory. cd $COMPILE/slri export SLRI=`pwd` seqhound@blueprint.org Version 3.3 The SeqHound Manual 18 of 421 18/04/2005 4.4 Compiling SeqHound executables on Solaris These instructions describe how to compile SeqHound running on the Solaris platform. They may be used as a guide for compiling SeqHound code on other platforms. Instructions are similar for Linux and differences are noted. Using these instructions These instructions assume that: You have downloaded the SeqHound code from the ftp server and you have set environment variables called COMPILE and SLRI. See section 4.3.1 You are using the bash shell. Note: On Linux platforms, to compile SeqHound libs with ODBC support you also need unixODBC-devel package which contains the sql.h + other libs/headers required to compile SeqHound libs with ODBC support. This is not needed to run SeqHound, just to compile it. These instructions were tested on a Sun-Fire-880 architecture running a Sun Solaris OS (version 9). The system information for the test-box (results of a “uname –a” call) were: SunOS machine_name 5.9 Generic_117171-15 sun4u sparc SUNW,Sun-Fire-880 1. Download the NCBI toolkit SeqHound is dependent on code in the NCBI toolkit Move to the compile directory and ftp to the NCBI ftp site: cd $COMPILE ftp ftp.ncbi.nlm.nih.gov When prompted for a name enter anonymous When prompted for a password type myemail@home.com cd toolbox/CURRENT Make a note of the FAQ.html and the readme.htm files. Change your transfer type to binary and get the zipped directory called ncbi.tar.gz bin get ncbi.tar.gz Close the ftp session by typing: bye seqhound@blueprint.org Version 3.3 The SeqHound Manual 19 of 421 18/04/2005 Uncompress the toolkit. gunzip ncbi.tar.gz tar xvf ncbi.tar 2. Edit the platform make file. Go to the platform directory and locate the file with a “.mk” extension that applies to your platform. For 64-bit Solaris system the file is “solaris64.ncbi.mk” and in Linux the file is linux-x86.ncbi.mk. cd $COMPILE/ncbi cd platform In Linux linux-x86.ncbi.mk replace the line /home/coremake/ncbi with ${NCBI} Use the following line (a Perl command) to replace the string in the Solaris file /netopt/ncbi_tools/ncbi64/ncbi with the string ${NCBI} in the solaris64.ncbi.mk file: perl -p -i.bak -e 's|/netopt/ncbi_tools/ ncbi64/ncbi|\${NCBI}|g' solaris64.ncbi.mk so for instance, the line NCBI_INCDIR = /netopt/ncbi_tools/ncbi64/ncbi/include Will become: NCBI_INCDIR = ${NCBI}/include You could also edit this file in hand using a text editor if you don’t have Perl installed. Copy the file up one level to the ncbi directory and rename it “ncbi.mk” cp solaris64.ncbi.mk ../ncbi.mk 3. Set environment variables in preparation for the toolkit build. Move back to the ncbi directory and set the environment variable NCBI to point to that directory cd $COMPILE/ncbi export NCBI=`pwd` check this by typing echo $NCBI the value shown will replace ${NCBI} in the “solaris64.ncbi.mk” file that you modified in the above step when the make file is run. seqhound@blueprint.org Version 3.3 The SeqHound Manual 20 of 421 18/04/2005 Note: The make file in the NCBI toolkit will use the C compiler from Sun instead of the compiler gcc. We do not recommend using gcc as it generates seqhound parsers that lead to segmentation fault at run time. Finally, paths to the compiler and the archive executable ar should be added to your PATH variable: export PATH=/usr/local/bin:/opt/SUNWspro/prod/bin:/usr/ccs/bin:$ PATH You can check all of your environment variables by typing set | sort At this point, the relevant environment variables should be something like this: COMPILE=/export/home/your_user_name/compile NCBI=/export/home/your_user_name/compile/ncbi OSTYPE=solaris2.9 PATH=/opt/SUNWspro/prod/bin:/usr/local/bin:/usr/ccs/bin:/ usr/bin:/usr/ucb:/etc:. If you want, you can read the readme file in the make directory. cd make more readme.unx seqhound@blueprint.org Version 3.3 The SeqHound Manual 21 of 421 18/04/2005 Note: For the Solaris UNIX OS only, the SeqHound API functions SHoundGetGenBankff and SHoundGetGenBankffList breaks due to a bug in the NCBI library file ncbistr.c (in directory ncbi/corelib and ncbi/build). To fix the problem, replace all the code inside the function Nlm_TrimSpacesAroundString() in the file ncbistr.c with the following text char *ptr, *dst, *revPtr; int spaceCounter = 0; ptr = dst = revPtr = str; if ( !str || str[0] == '\0' ) return str; while ( *revPtr != '\0' ) if ( *revPtr++ <= ' ' ) spaceCounter++; if ( (revPtr - str) <= spaceCounter ) { *str = '\0'; return str; } while ( revPtr > str && *revPtr <= ' ' ) revPtr--; while ( ptr < revPtr && *ptr <= ' ' ) ptr++; while ( ptr <= revPtr ) *dst++ = *ptr++; *dst = '\0'; return str; seqhound@blueprint.org Version 3.3 The SeqHound Manual 22 of 421 18/04/2005 4. Build the NCBI toolkit Move back up to the compile directory and run the make command. cd $COMPILE ./ncbi/make/makedis.csh |& tee out.makedis.txt Note: to build Solaris 64 bit binaries add the following to the command line: SOLARIS_MODE=64 ./ncbi/make/makedis.csh This runs a c-shell script to make the toolkit and tees the output to the screen and a log file “out.makedis.txt”. It is safe to ignore the multiple error messages that you may see. At the end of a successful build you will see ********************************************************* *The new binaries are located in ./ncbi/build/ directory* ********************************************************* The ncbi.tar file can be removed from the “compile” directory after the successful build process has been completed. 5. Make the bzip2 library The bzip2 code was downloaded as part of the seqhound code in step 4.3.1 above. Move to the bzip2 directory and run the make file. cd $COMPILE/bzip2 make –f make.bzlib 6. Set the BZDIR environment variable. cd $COMPILE/bzip2 export BZDIR=`pwd` 7. In your home directory, add the following environment parameters to the appropriate configuration file such as .bashrc or .bash_profile. Text in italics should be changed to the correct path on your machine that points to directory having DBI.pm: export NCBI=$COMPILE/ncbi export BZDIR=$COMPILE/bzip2 export SLRI=$COMPILE/slri export VIBLIBS="-L/usr/X11R6/lib -lXm -lXpm -lXmu -lXp lXt -X11 -lXext" seqhound@blueprint.org Version 3.3 The SeqHound Manual 23 of 421 18/04/2005 export PERL5LIB=/usr/local/lib/perl5/site_perl/5.8.3/sun4solaris-64 8. Install MySQL server and create database “seqhound”. SeqHound is built and tested in MySQL version 4.1.10. You can download MySQL from http://dev.mysql.com/downloads/mysql/4.1.html and follow the manual at http://dev.mysql.com/doc/mysql/en/index.html to install MySQL on your server. The data directory where the MySQL server points to should have 700 GB for a full SeqHound database. After MySQL is installed, you need to log into MySQL and create database “seqhound”: create database seqhound; Note that ";" must be used at the end of all MySQL statements. 9. Install ODBC driver: Note that for Linux platforms, the unixODBC package needs to be installed prior to the ODBC driver otherwise the following error will occur: error: Failed dependencies: libodbcinst.so.1 is needed by MyODBC-3.51.09-1 a) Go to web site: http://dev.mysql.com/doc/connector/odbc/en/faq_2.html b) Find and download RPM distribution of ODBC driver MyODBC-3.51.071.i586.rpm. c) As user "root", install the driver. For first time installation rpm -ivh MyODBC-3.51.01.i386-1.rpm For upgrade rpm -Uvh MyODBC-3.51.01.i386-1.rpm d) The library file libmyodbc3. will be installed in directory /usr/lib or /usr/local/lib. 10. Set up the configuration file for ODBC driver. Create a configuration file called .odbc.ini in your home directory with the following content: seqhound@blueprint.org Version 3.3 The SeqHound Manual 24 of 421 18/04/2005 Edit the file called .intrezrc in directory slri/seqhound/config/. header must not be used for other sections [mysqlsh] Description = MySQL ODBC 3.51 Driver DSN Trace = Off TraceFile = stderr your library path Driver = /usr/lib/libmyodbc3.so DSN = mysqlsh same as the header name SERVER = my_server PORT = my_port USER = my_id PASSWORD = my_pwd DATABASE = seqhound database name Text in italics should be changed. Text /usr in the value of variable Driver should be changed to the path where unixodbc resides. Text my_server should be changed to the IP address or the server name of the MySQL server. Text my_port should be changed to port number of the MySQL instance. Text my_id and my_pwd should be replaced by your user id and password to the MySQL database. Note that the values for the headers such as DSN, USER, PASSWORD and DATABASE must be less than 9 characters. 11. Set up ODBC related variables: export ODBC=path_to_unixodbc Where path_to_unixodbc should be replaced by the path of the UnixODBC driver on your machine. In your home directory, add parameter “LD_LIBRARY_PATH” to the appropriate configuration file such as .bashrc or .bash_profile: export LD_LIBRARY_PATH = /usr/local/unixodbc/lib:/usr/local/unixodbc/odbc/lib:/usr /local/mysql/lib/mysql:/usr/local/mysql/lib/mysql/lib The value of variable “LD_LIBRARY_PATH” should have all the paths that have the library files libodbc*, libmyodbc*, and libmysqlclient* 12. Build the SeqHound executables Move to the compile directory and list all the files in the directory: cd $COMPILE ls You should see: > ls bzip2 ncbi slri out.makedis.txt seqhound@blueprint.org Version 3.3 The SeqHound Manual 25 of 421 18/04/2005 Before proceeding you should check your environment variables set | sort to ensure that correct paths have been specified for each of the following variables: NCBI SLRI ODBC BZDIR Compile the SLRI libraries using the following commands: cd $SLRI/lib make -f make.slrilib make -f make.slrilib odbc The above commands will build the SLRI libraries needed by SeqHound. The make files which you are about to invoke call on these variables therefore the paths must be correct. Move to the make directory for SeqHound and run the makeall script. The script requires two command line arguments. The first parameter indicates what database backend is to be used for the build (currently the only valid target is odbc). The second parameter indicates what SeqHound programs are to be made (a choice of all, cgi, domains, examples, genomes, go, locuslink, parsers,scripts, taxon, updates). The output of the build script will be captured in the text file out.makeseqhound.txt. cd $SLRI/seqhound ./makeallsh odbc all 2>&1 | tee out.makeseqhound.txt It is safe to ignore the multiple warning messages that you may see. After this has finished running, move to the directory slri/seqhound/build/odbc/ where you will find the executables for SeqHound. cd build/odbc ls -1 You will see >ls –1 addgoid cbmmdb chrom clustmask clustmasklist comgen fastadom gen2fasta gen2struc goparser goquery seqhound@blueprint.org Version 3.3 The SeqHound Manual 26 of 421 18/04/2005 histparser importtaxdb isshoundon llgoa llparser llquery mother pdbrep precompute redund seqrem sh_nbhrs shunittest_odbc_local shunittest_odbc_rem shtest update vastblst wwwseekgi 13. Set up the SQL files that create tables. cd $SLRI/seqhound/sql In each of files core.sql, redund.sql, ll.sql, taxdb.sql, gendb.sql, strucdb.sql, cddb.sql, godb.sql, rps.sql, nbr.sql, there is a line close to the beginning of each file: #use testsql; This line should be changed to use seqhound; 4.5 Building the SeqHound system on Solaris Using these instructions These instructions show how the SeqHound executables may be used to build the SeqHound system under a Solaris 8 OS. These instructions may also be used as a guide for setting up SeqHound under other operating systems. These instructions assume that: • You have downloaded the latest release version of the SeqHound code (see step 4.3.3) • You have successfully installed MySQL • You have successfully compiled the SeqHound code yourself (section 4.4) OR you have downloaded the SeqHound executables for your platform and operating system (section 4.3.4). • You have set environment variables called COMPILE and SLRI (see steps 4.3.1 and 4.3.6). seqhound@blueprint.org Version 3.3 The SeqHound Manual 27 of 421 18/04/2005 • You have a default install of an Apache server running. See http://www.apache.org/ for freely available software and instructions for your platform. • You have installed Perl. See http://www.cpan.org/ for freely available software and installation instructions. • You have at least 300 MB space available in a directory where you can check out code and compile it. • You have at least 600 GB available for the SeqHound executables and data tables. See section 4.2. These instructions were tested on a Sun Ultra machine running the Sun-Solaris 8 OS. The system information for the test-box (results of a “uname –a” call) were: SunOS machine_name 5.8 Generic_108528-01 sun4u sparc SUNW,Ultra-4 These instructions assume that you are using the c shell. Syntax may differ for some commands in other shells. Note: These instructions begin with ‘step 14’. seqhound@blueprint.org Version 3.3 The SeqHound Manual 28 of 421 18/04/2005 14. Prepare to build the SeqHound database. Create a new directory where you will set up SeqHound. mkdir seqhound Set the environment variable SEQH to point to this directory. cd seqhound setenv SEQH `pwd` Move to this directory and create new directories cd seqhound mkdir 1.core.files mkdir 2.redund.files mkdir 3.taxdb.files mkdir 4.godb.files mkdir 5.lldb.files mkdir 6.comgenome.files mkdir 7.mmdb.files mkdir 8.hist.files mkdir 9.neighbours.files mkdir 10.rpsdb.files mkdir precompute The numbered directories will hold parsers and files required for the build of the SeqHound data tables. Directory “precompute” will hold the precomputed data of the database. seqhound@blueprint.org Version 3.3 The SeqHound Manual 29 of 421 18/04/2005 Move to each of the numbered directories and copy all of the scripts and executables required for the build. cd $SEQH/1.core.files cp $SLRI/seqhound/sql/core.sql . cp $SLRI/seqhound/scripts/asnftp.pl . cp $SLRI/seqhound/scripts/seqhound_build.sh . cp $SLRI/seqhound/build/odbc/mother . cp $SLRI/seqhound/build/odbc/update . cp $SLRI/seqhound/config/.intrezrc . cd cp cp cp $SEQH/2.redund.files $SLRI/seqhound/sql/redund.sql . $SLRI/seqhound/scripts/nrftp.pl . $SLRI/seqhound/build/odbc/redund . cd cp cp cp $SEQH/3.taxdb.files $SLRI/seqhound/sql/taxdb.sql . $SLRI/seqhound/scripts/taxftp.pl . $SLRI/seqhound/build/odbc/importtaxdb . cd cp cp cp $SEQH/4.godb.files $SLRI/seqhound/sql/godb.sql . $SLRI/seqhound/scripts/goftp.pl . $SLRI/seqhound/build/odbc/goparser . cd cp cp cp cp $SEQH/5.lldb.files $SLRI/seqhound/sql/ll.sql . $SLRI/seqhound/scripts/llftp.pl . $SLRI/seqhound/build/odbc/llparser . $SLRI/seqhound/build/odbc/addgoid . cd cp cp cp cp cp cp cp cp $SEQH/6.comgenomes.files $SLRI/seqhound/sql/gendb.sql . $SLRI/seqhound/scripts/genftp.pl . $SLRI/seqhound/scripts/humoasn.pl . $SLRI/seqhound/scripts/humouse_build.sh . $SLRI/seqhound/scripts/comgencron_odbc.pl . $SLRI/seqhound/scripts/shconfig.pm . $SLRI/seqhound/genomes/gen_cxx . $SLRI/seqhound/genomes/pregen.pl . seqhound@blueprint.org Version 3.3 The SeqHound Manual 30 of 421 cp cp cp cp cp $SLRI/seqhound/genomes/gen.pl . $SLRI/seqhound/genomes/ncbi.bacteria.pl . $SLRI/seqhound/build/odbc/chrom . $SLRI/seqhound/build/odbc/comgen . $SLRI/seqhound/build/odbc/mother . cd cp cp cp cp cp $SEQH/7.mmdb.files $SLRI/seqhound/sql/strucdb.sql . $SLRI/seqhound/scripts/mmdbftp.pl . $SLRI/seqhound/config/.mmdbrc . $SLRI/seqhound/config/.ncbirc . $SLRI/seqhound/build/odbc/cbmmdb . 18/04/2005 cd $SEQH/8.hist.files cp $SLRI/seqhound/build/odbc/histparser . Open the .intrezrc file with a text editor like pico and edit. cd $SEQH/1.core.files pico .intrezrc An example .intrezrc file follows. Lines preceded by a semi-colon are comments that explain what the settings are used for and their possible values. Text in italics must be changed for the .intrezrc file to function correctly with your SeqHound set-up. Variables username, password, dsn, database in section [datab] should have the same values as USER, PASSWORD, DSN and DATABASE respectively in the .odbc.ini file you set up in Step 10 in section 4.4. For variable path and indexfile in section [precompute], replace the text in italics with the absolute path of directory “precompute” you just created. Warning: This file may have wrapped lines. Take care when editing this file that you do not break any of the lines (i.e. introduce any unwanted carriage returns). seqhound@blueprint.org Version 3.3 The SeqHound Manual 31 of 421 18/04/2005 -------------------------------example .intrezrc begins-------------------------------[datab] ;seqhound database that you are connecting username=your_user_name password=your_pass_word dsn=dsn_in_.odbc.ini_file database=seqhound local= [config] ;the executable the cgi runs off of. CGI=wwwseekgi [precompute] ;precomputed taxonomy queries MaxQueries = 100 MaxQueryTime = 10 QueryCount = 50 path = /seqhound/precompute/ indexfile = /seqhound/precompute/index [sections] ;indicated what modules are available in SeqHound ;1 for available, 0 for not available ;gene ontology hierarchy godb = 1 ;locus link functional annotations lldb = 1 ;taxonomy hierarchy taxdb = 1 ;protein sequence neighbours neigdb = 1 ;structural databases strucdb = 1 ;complete genomes tracking gendb = 1 ;redundant protein sequences redundb = 1 ;open reading frame database ;currently not exported to outside users of SeqHound cddb = 0 ;RPS-BLAST domains rpsdb = 1 ;DBXref Database Cross_Reference dbxref = 0 [crons] ;customizable variables in cron jobs ;NOTE: all paths must end in '/' pathupdates=./ pathinputfiles=./ pathinputfilescomgen=./ mail=user\@host.org defaultrelease=141 pathflags=./ -------------------------------example .intrezrc ends---------------------------------- This file should be copied to other directories used during the build process: seqhound@blueprint.org Version 3.3 The SeqHound Manual 32 of 421 18/04/2005 cp .intrezrc $SEQH/2.redund.files/. cp .intrezrc $SEQH/3.taxdb.files/. cp .intrezrc $SEQH/4.godb.files/. cp .intrezrc $SEQH/5.lldb.files/. cp .intrezrc $SEQH/6.comgenome.files/. cp .intrezrc $SEQH/7.mmdb.files/. cp .intrezrc $SEQH/8.hist.files/. cp .intrezrc $SEQH/9.neighbours.files/. cp .intrezrc $SEQH/10.rpsdb.files/. 15. Build the core module of SeqHound. Building the core module (basically all of the sequence data tables) is not optional. The rest of the modules are optional if there is a need to spare resources or administrative efforts but the corresponding API functionality will not be present. cd $SEQH/1.core.files Create the core tables in the database Make sure file core.sql has line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < core.sql Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server respectively. You will be prompted to enter your password. This creates core tables accdb, asndb, nucprot, parti, pubseq, sendb, sengi, taxgi, bioentity, bioname, secondrefs, bioentitytype, nametype, rules, fieldtype and histdb. If you are building a full-instance of the SeqHound database then run the asnftp.pl script while in the build directory: ./asnftp.pl Note that any command in these instructions can be run as a ‘nohup’ to prevent the process from ending if your connection to the machine should be lost. For example: nohup ./asnftp.pl & If you only want to build a small test version of the database then manually download a single file. For example: ftp ftp.ncbi.nih.gov When prompted for a name enter anonymous When prompted for a password type myemail@home.com cd refseq/cumulative bin get rscu.bna.Z (do not uncompress this file) bye seqhound@blueprint.org Version 3.3 The SeqHound Manual 33 of 421 18/04/2005 The asnftp.pl script downloads all of the GenBank sequence records (in binary ASN.1 format) required to make an initial build of the SeqHound core module. This script will take approximately 24 hours to run and will consume 14 GB of disk space. Note that all scripts are described in detail in section 5. Two other files are generated by this script: asn.list is a list of the sequence files that the script intends to download. asnftp.log is where the script logs error messages during execution time. If you open another session with the machine where you are building SeqHound, you can check how far along asnftp.pl is by comparing the number of lines in the asn.list file grep “.aso.gz” asn.list | wc –l to the number of lines in the build directory (number of files actually downloaded so far) ls *.aso.gz | wc -l Once asnftp has finished, these two numbers should be the same. Run the seqhound build script. Before running this script, make certain that the .intrezrc file, in the same directory, and .odbc.ini, in your home directory, have correct configuration values. (see steps 10 in section 4.4 and step 14 in the current section). This parser MUST be given a single parameter that represents the release version of GenBank. You can find the release number in the file: ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Release. ./seqhound_build.sh 141 seqhound_build.sh executes the mother parser over all source files and populates tables accdb, asndb, nucprot, parti, pubseq, sendb, sengi, taxgi, bioentity, bioname, secondrefs, bioentitytype, nametype, rules and fieldtype. This will take about 75 hours. Table histdb is still empty at this stage. It is populated in Step 25. Parser mother creates a log file for every *.aso file that it parses. These log files are located in a subdirectory called “logs” and are named “rsnc0506run” where “rsnc0506” is the name of the file that was being processed. While seqhound_build.sh is running, you can move on to steps 16-18. Once seqhound_build.sh has finished you can test that all of the files were properly processed by showing that the results of cd logs grep “Done” | wc –l is the same as ls *run | wc –l is the same as cd .. ls *aso.gz | wc -l seqhound@blueprint.org Version 3.3 The SeqHound Manual 34 of 421 18/04/2005 The seqhound_build.sh script unzips .aso.gz files before feeding them as input to the mother program. seqhound_build.sh then rezips the file after mother is done with it. If for some reason, the build should crash part way through, you have to a) recreate core tables using core.sql (see above) and b) search for any unzipped (*.aso files) in the build directory and rezip them c) restart seqhound_build.sh. Once the seqhound_build.sh script has finished, you should move all of the *.aso.gz files into a directory where they will be out of the way: mkdir asofiles mv *.aso.gz asofiles/. 16. Build the redundb module. cd $SEQH/2.redund.files Create table redund in the database. Make sure file redund.sql has the line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < redund.sql Where my_id”, “my_port” and “my_server” should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server respectively. You will be prompted to enter your password. This creates table redund in the database. Run the nrftp.pl script to download the FASTA nr database of proteins (ftp://ftp.ncbi.nlm.nih.gov/blast/db). ./nrftp.pl nrftp.pl generates a log file “nrftp.log” that informs you what happened. If everything went ok, the last two lines should read: Getting nr.gz closing connection A new file should appear in the build directory called “nr.Z”. You will have to unpack this file by typing: gunzip nr.gz Run the redund parser to make the redund table of identical protein sequences. Before running this script, make certain that the .intrezrc file in the same directory and .odbc.ini in your home directory have correct configuration values (see step 10 in section 4.4 and step 14 in the current section). ./redund -i nr -n F redund generates the log file “redundlog”. If everything went ok, the only line in this file should be: NOTE: [000.000] {redund.c, line 259} Done. seqhound@blueprint.org Version 3.3 The SeqHound Manual 35 of 421 18/04/2005 And about 3 millions records will be inserted into table redund. 17. Build the taxdb module Create tables of the taxdb module in the database. cd $SEQH/3.taxdb.files Make sure file taxdb.sql has line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < taxdb.sql Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server respectively. You will be prompted to enter your password. This creates tables taxdb, gcodedb, divdb, del, merge in the database. Run the taxftp.pl script to download taxonomy info from the NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz). taxftp.pl taxftp.pl generates a log file taxftp.log that informs you what happened. If everything went ok, the last two lines should read: Getting taxdump.tar.gz closing connection A new file should appear in the build directory called taxdump.tar.gz. You will have to unpack this file by typing: gzip –d taxdump.tar.gz tar -xvf taxdump.tar There will be seven new files: delnodes.dmp division.dmp gc.prt gencode.dmp merged.dmp names.dmp nodes.dmp Run the importtaxdb parser to make the taxonomy data tables. Taxdump must be in the same directory as this parser. ./importtaxdb importtaxdb has no command line parameters. importtaxdb generates the log file importtaxdb_log.txt. If everything went ok, the output of this file should be something like: Program start at Thu Sep 4 13:47:51 2003 Number of Tax ID records parsed: 191647 Number of Tax ID Name records parsed: 246263 Number of Division records parsed: 11 Number of Genetic Code records parsed: 18 seqhound@blueprint.org Version 3.3 The SeqHound Manual 36 of 421 18/04/2005 Number of Deleted Node records parsed: 25475 Number of Merged Node records parsed: 4607 Program end at Thu Aug 12 13:49:43 2004 And records will be inserted into tables taxdb, gcodedb, divdb, del and merge. 18. Build the GODB module Create tables of the godb module in the database. cd $SEQH/4.godb.files Make sure file godb.sql has line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < godb.sql Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server respectively. You will be prompted to enter your password. This creates tables go_parent, go_name, go_reference, go_synonym in the database. Run the goftp.pl script to download the gene ontology files (ftp://ftp.geneontology.org/pub/go/gene-associations and ftp://ftp.geneontology.org/pub/go/ontology). goftp.pl There is a log file for this script called goftp.log that indicates that it got all of these files. Three new files should appear in the build directory called component.ontology function.ontology process.ontology Two other files also appear called gene_association.Compugen.GenBank.gz gene_association.Compugen.UnitProt.gz but these are used as input files by addgoid in the next step. Run the goparser to make the hierarchical gene ontology data tables. The three input files must be in the same directory as this parser. ./goparser goparser has no command line parameters. goparser generates the log file goparserlog. If everything went ok, the output of this file should have only one NOTE line: NOTE: [000.000] {goparser.c, line 101} Main: Done! And records will be inserted into tables go_parent, go_name, go_reference, go_synonym. 19. Build the LLDB module Create tables of the locus link module in the database. cd $SEQH/5.lldb.files Make sure file ll.sql has line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < ll.sql seqhound@blueprint.org Version 3.3 The SeqHound Manual 37 of 421 18/04/2005 Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server respectively. You will be prompted to enter your password. This creates tables ll_omim, ll_go, ll_llink, ll_cdd in the database. Run the llftp.pl script to download the locus link template file (LL_tmpl) which is the source for function annotation tables (ftp://ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz). llftp.pl This script generates the llftp.log file. If everything executes correctly, the last two lines of the file should read: Getting LL_tmpl.gz closing connection And a new file should appear in the build directory called LL_tmpl.gz which you will have to unpack using the commands gzip –d LL_tmpl.gz Run the llparser to create the set of functional annotation data tables. The input file must be in the same directory as this parser. ./llparser llparser has no command line parameters. llparser generates the log file “llparserlog”. At the time of writing, the output of this file will have thousands of lines like: NOTE: [000.000] {ll_cb.c, line 654} LL_AppendRecord: No NP id. Record skipped. (these lines are expected since many LocusLink records are not linked to specific sequence records) followed by the last line of the file: NOTE: [000.000] {llparser.c, line 90} Main: Done! Records will be inserted into tables ll_omim, ll_go, ll_llink and ll_cdd. Run the addgoid parser to populate the go annotation table. This parser uses input files that were downloaded in the previous step 13. Copy those files to this directory: cp ../4.godb.files/gene_association.Compugen.GenBank.gz ./ cp ../4.godb.files/gene_association.Compugen.UniProt.gz ./ The files need to be unpacked. gunzip gene_association.Compugen.GenBank.gz gunzip gene_association.Compugen.UnitProt.gz The input files must be in the same directory as addgoid ./addgoid –i gene_association.Compugen.GenBank after this parser has finished, use it to parse the other input file ./addgoid –i gene_association.Compugen.UniProt seqhound@blueprint.org Version 3.3 The SeqHound Manual 38 of 421 18/04/2005 At the time of writing, this second input file is not parsed since cross references between Swissprot and GenBank ids are not available. This is being corrected by the dbxref module project. addgoid MUST BE EXECUTED AFTER ALL CORE TABLES AND LLDB TABLES HAVE BEEN BUILT; the llparser makes the ll_go table into which the addgoid script writes. This program is dependent on tables asndb, parti, accdb and nucprot.. addgoid generates the log file addgoidlog. The output of this file will look like: =========[ Sep 5, 2003 10:28 ERROR: [000.000] {addgoid.c, ERROR: [000.000] {addgoid.c, ERROR: [000.000] {addgoid.c, ERROR: [000.000] {addgoid.c, AM ]======================== line 235} No GI from 100K_RAT. line 235} No GI from 100K_RAT. line 235} No GI from 100K_RAT. line 235} No GI from 100K_RAT. This is normal. These errors are caused by the inability to find GI’s for names of proteins/loci that are annotated in the GO input file. This problem is being addressed by the dbxref module.dir This program writes to the existing ll_go table that was generated by llparser. 20. Build the GENDB module Change directories to the Complete Genomes directory (comgenomes). cd $SEQH/6.comgenomes.files Create tables of the GENDB module in the database. Make sure file gendb.sql has line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < gendb.sql Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server respectively. You will be prompted to enter your password. This creates table chrom in the database. Building the GENDB module involves several steps. To simplify the process, a perl script, comgencron_odbc.pl groups together all of the necessary scripts or binaries for each individual step. These scripts and binaries must be present in this directory. They are: comgencron_odbc.pl shconfig.pm gen_cxx pregen.pl gen.pl ncbi.bacteria.pl genftp.pl humoasn.pl chrom seqhound@blueprint.org Version 3.3 The SeqHound Manual 39 of 421 18/04/2005 iterateparti humouse_build.sh mother comgen Before building the GENDB module, the [crons] section in configuration file .intrezrc should be set up properly. It should look like the following. Text in italics must be changed. Variable mail should have the e-mail address where you want the message to be sent to. Variable defaultrelease should have the release number of the GenBank files you use to build the core tables of SeqHound database (see Step 15): [crons] ;customizable variables in cron jobs ;NOTE: all paths must end in '/' pathupdates=./ pathinputfiles=./ pathinputfilescomgen=./genfiles/ mail=your_email_addr defaultrelease=141 pathflags=./flag/ Make a subdirectory flag where the flag file comgen_complete.flg will be saved. mkdir flag Run the script to build the GENDB module: ./comgencron_odbc.pl comgencron_odbc.pl generates flat file genff, log files bacteria.log, chromlog, comgenlog, gen.log, iteratapartilog, a subdirectory genfiles and a lot of logs file with postfix run which will be moved to a subdirectory logs. It also downloads many .asn files which will be moved to subdirectory genfiles. During the process, temporary file comff and directory asn are created. They are deleted before the end of the build process. If the build process fails in the middle, they should be removed along with file genff manually. There are several lines printed on the screen during the build like: mail = your_email_addr pathupdates = ./ pathinputfilescomgen = ./genfiles/ defaultrelease = 141 pathflags = ./flag/ No source or subsource Plasmpdium falciparum NC_03043. Update 1 chromosome type by hand. It is OK to see above line. An e-mail will be sent to the address you provide to inform if the process succeeds or fails. If everything went ok, you will see the last line in file comgenlog as: NOTE: [000.000] {comgen.c, line 504} Main: Done. The last line in file iteratepartilog as: seqhound@blueprint.org Version 3.3 The SeqHound Manual 40 of 421 18/04/2005 NOTE: [000.000] {iterateparti.c, line 170} Done. The last line in file chromlog as: NOTE: [000.000] {chrom.c, line 173} Done. The last two lines in file bacteria.log as: deleteing asn See bacteria.results for changes to ./genff The last two lines in file gen.log as: Removing asn Deleting comff The following is a detailed explanation of the script comgencron_odbc.pl. You may skip it. 21. Generate flat file genff. genff is a tab-delimited text file where each line in this file represents one "DNA unit" (chromosome, plasmid, extrachromosomal element etc.) belonging to a complete genome. Column 1 2 3 4 5 Description Taxonomy identifier for the genome Unique integer identifier for a given chromosome Type of molecule (1 or chromosome, 8 for plasmid, …) FTP file name for the genome without the .asn extension) Full name of the organism Here is an example of several rows from genff: 305 258594 781 782 90370 90370 90370 209261 286 287 288 289 290 291 292 293 8 1 1 1 1 8 8 1 NC_003296 NC_005296 NC_003103 NC_000963 NC_003198 NC_003384 NC_003385 NC_004631 Ralstonia solanacearum plasmid pGMI1000MP Rhodopseudomonas palustris CGA009 chromosome Rickettsia conorii chromosome Rickettsia prowazekii chromosome Salmonella typhi chromosome Salmonella typhi plasmid pHCM1 Salmonella typhi plasmid pHCM2 Salmonella typhi Ty2 chromosome The genff flat file is generated in two steps. a) gen.pl which will CREATE genff using the eukaryotic complete genomes. b) ncbi.bacteria.pl which will UPDATE genff with bacteria complete genomes. * both gen.pl and ncbi.bacteria.pl are dependent on pregen.pl so this must be in the same directory as gen.pl and ncbi.bacteria.pl when you run it. gen.pl will backup the current (if it exists) genff as genff.backup and then create a new genff file. gen.pl will download asn files from NCBI’s ftp site and then extract the relevant fields (as described above) and store them as records in genff. The data of bacteria complete genome is written to genff by running ncbi.bacteria.pl. This perl utility will compare the data in genff to the contents of the /genomes/bacteria directory in NCBI’s ftp site and then automatically update genff. ncbi.bacteria.pl will save the names of the bacteria that have been newly added to genff in a separate file called bacteria.results. You can use this file to quickly verify the results. A sample output of bacteria.results.pl seqhound@blueprint.org Version 3.3 The SeqHound Manual 41 of 421 18/04/2005 ***********PERFECT MATCH*********** Aeropyrum pernix ***********SEMI MATCHED NCBI BACTERIA************* NCBI BACTERIA CHROMFF ---------------------------------------Buchnera aphidicola Buchnera sp Buchnera aphidicola Sg Buchnera sp ***********UNMATCHED NCBI BACTERIA************* Agrobacterium tumefaciens C58 Cereon Agrobacterium tumefaciens C58 UWash Perfectly matched bacteria are already present in genff. Semi matched bacteria means that there is an organism that is closely related to a new organism. For the above example, Buchnera aphidocola Sg and Buchnera aphidocola were newly released and closely related to the Buchnera sp. The newly released data will have been added to genff. Unmatched bacteria are completely new organism and will be added to genff. Both gen.pl and ncbi.bacteria.pl will create an intermediate file called comff, and a temporary directory asn. These are temporary and are critical to the functionality of the perl scripts. Both gen.pl and ncbi.bacteria.pl will delete comff and asn after execution. While running gen.pl and ncbi.bacteria.pl you may see the following on the screen. No source or subsource Plasmodium falciparum NC_03043. Update 1 chromosome type by hand. It means that for the specified organism, the asn file is missing the chromosome type. In such a scenario, the chromosome type will default to 1 (chromosome Once you have generated file genff, you will likely need to run it again periodically, in case some of the data in genff has changed, for example if an organism taxid changes, in which case it is crucial to rerun gen.pl. Script genftp.pl downloads complete genome files from ftp://ftp.ncbi.nih.gov/genomes/*. A script called humoasn.pl must be in the same directory as genftp.pl since genftp.pl calls the script. humoasn.pl is a misnomer because the script actually processes files for human, mouse AND rat genomes. Each of these genomes has two files called rna.asn and protein.asn (these files are called the same thing regardless of the organism that they refer to: the only way you can tell which organism the file refers to is by looking at the directory name that it came from or by looking at the contents. genftp.pl renames rna.asn and protein.asn files to more specific names so they can be processed with the humoasn.pl script. rna.asn and protein.asn files mostly contain XM’s and XP’s sequences: see for example genomes/H_sapiens/protein. The sequences in these files are “loose” bioseqs that have to be “stitched” together into bioseq sets by humoasn.pl. This allows these sequences to be processed by the mother parser in the next step. Many new *.asn files will appear in the comgenomes directory after this is run. There is no log file for this script. seqhound@blueprint.org Version 3.3 The SeqHound Manual 42 of 421 18/04/2005 a) Populate table chrom Binary chrom is used to populate table chrom from the list of complete genomes found in genff. Chrom generates the log file “chromlog”. This log will look something like: ============[ Sep 5, 2003 NOTE: [000.000] {chrom.c, NOTE: [000.000] {chrom.c, NOTE: [000.000] {chrom.c, NOTE: [000.000] {chrom.c, NOTE: [000.000] {chrom.c, … NOTE: [000.000] {chrom.c, 2:30 PM ]==================== line 130} Assigned TaxId 56636. line 137} Assigned Kloodge 1. line 144} Assigned Chromfl 1. line 149} Assigned Access NC_000854 line 152} Assigned Name Aeropyrum pernix. line 167} Done. b) Delete all records from division gbchm from the tables of the core module. This step is carried out for data integrity purpose. All the records that are inserted into the core module tables are labeled as from division gbchm. Before they are inserted, it needs to ensure no such record exists in the database. This is accomplished using binary iterateparti. iterateparti takes the division name as one parameter and deletes all GI’s that are part of that division from all of the tables in the core module. c) Set kloodge to 0 in table taxgi This step is also carried out for data integrity purpose. The field “kloodge” in table taxgi for all records should be set to 0 before they are updated in a later step by binary comgen. d) Move all Apis mellifera related files to a subdirectory. The chromosome, rna and protein files of Apis mellifera are not processed at the time of writing. They are moved to a subdirectory. e) Add records to the core module tables. Since the human, mouse and rat sequences from this source (the “Complete Genomes” directory) are not a part of the GenBank release, the records are added to the core module tables by script humouse_build.sh. 141 This script feeds all chromosome, rna and protein files downloaded by genftp.pl to the mother parser. The mother parser makes a new division called “gbchm” (GenBank Chromosome Human and Mouse) and touches all core module tables. Log files will be created by mother for every chromosome file processed (called *run). f) Update field kloodge in table taxgi and field name in table accdb Parser comgen is used to label sequences as belonging to a complete genome. This program uses the files downloaded by genftp.pl and marks the complete genomes in table taxgi. This program also adds loci names into table accdb (if they are not present). comgen is dependent on the chrom table and writes to accdb and taxgi. The comgen program has to be executed after all databases are built. Comgen writes to the log file comgenlog in the same directory where it is run. seqhound@blueprint.org Version 3.3 The SeqHound Manual 43 of 421 18/04/2005 22. Build the Strucdb module Change to the mmdbdata directory. cd $SEQH/7.mmdb.files Create tables of the Strucdb module in the database. Make sure file strucdb. sql has line use seqhound close to the beginning of the file. mysql –u my_id –p –P my_port –h my_server < gendb.sql Where my_id, my_port and my_server should be replaced by your userid for the database, the port of the database and the IP address or the server name of the database server repectively. You will be prompted to enter your password. This creates tables mmdb, mmgi and domdb in the database. Make certain that the configuration files have been properly set up. These include: .mmdbrc, .ncbirc and .intrezrc. In file .mmdbrc, variable “Gunzip” should have a value which is the path of gunzip on the machine (change text in italics). File .mmdbrc looks like: [MMDB] ;Database and Index required when local MMDB database is used Database = ./ Index = mmdb.idx Gunzip = /bin/gunzip ; [VAST] ;Database required for local VAST fetches. ; Database = . In file .ncbirc, variable DATA should have a value which is the path of directory ncbi/data on your machine. File .ncbirc looks like (change text in italics): [NCBI] ROOT=/ DATA=/my_home/compile/ncbi/data/ Copy file bstdt.val from the ncbi/data directory: cp ~/compile/ncbi/data/bstdt.val ./ Run the mmdbftp.pl script to download the mmdb (Molecular Model Database) ASN.1 files from ftp://ftp.ncbi.nih.gov/mmdb/mmdbdata. This will take approximately 10 hours.. ./mmdbftp.pl This script writes to the mmdb.log file and records the files downloaded. Approximately 20000 *.val.gz files will appear in the mmdbdata directory after running this. Look at the first line in the mmdb.idx index file and this states the number of files that should have been downloaded. Run the cbmmdb parser to make the MMDB and MMGI datafiles. Use: ./cbmmdb –n F -m F This program takes about 12 hours to run and writes errors to the cbmmdblog file. After a typical run this file will contain: ============[ Nov 3, 2003 1:21 AM ]====================== ERROR: [004.001] {cbmmdb.c, line 125} Error opening MMDB id 22339 seqhound@blueprint.org Version 3.3 The SeqHound Manual 44 of 421 18/04/2005 WARNING: [011.001] {cbmmdb.c, line 240} Total elapsed time: 41857 seconds NOTE: [000.000] {cbmmdb.c, line 245} Main: Done! And records are inserted into tables mmdb and mmgi. Run the vastblst parser to make the DOMDB datafile. ./vastblst –n F This program writes errors to the vastblstlog file. After a typical run this file will contain no messages and records are inserted into table domdb In addition, vastblst makes a FASTA datafile of domains called mmdbdom.fas in the directory where it is run. Get the most recent nrpdb.* file from the NCBI ftp site in hand (ftp://ftp.ncbi.nih.gov/mmdb/nrtable/nrpdb Run the pdbrep parser to label representatives of nr chain sets in the domdb datatable. This parser writes to the domdb table. Use: uncompress nrpdb*.Z pdbrep –i nrpdb.* Where nrpdb.* is the name of the input file set. pdbrep will write errors to the pdbreplog file in the same directory where it is run. 23. Build the Neighdb module The sequence neighbours tables can be downloaded from ftp://ftp.blueprint.org/pub/SeqHound/NBLAST/ as MySql database table files, as well as mysqldump output, which should be adaptable to most SQL database systems. See the readme on the ftp site for information on these files. To incorporate the mysql database table files into your instance of seqhound, simply copy the files extracted from the nblastdb and blastdb archives, downloaded from the ftp site, into your seqhound database directory in your mysql instance. To incorporate the mysql dumps of these tables into your seqhound instance, you need only pipe the contents of the dump(which are SQL statements) to your database server. In the case of mysql, simply execute: gunzip -c gunzip -c seqhound.blastdb.SQLdump.YYYYMMDD.gz | mysql seqhound seqhound.nblastdb.SQLdump.YYYYMMDD.gz | mysql seqhound Be sure to fill in any required mysql options, such as username, hostname and port number. 24. Build the Rpsdb and Domname modules The pre-computed rps-blast table and the domname table can be downloaded from ftp://ftp.blueprint.org/pub/SeqHound/RPS/ as MySQL database table files, as well as mysqldump output, which should be adaptable to most SQL database systems. To incorporate the mysql database table files into your instance of seqhound, simply copy the files extracted from the rpsdb and domname archive, downloaded from the ftp site, into your seqhound database directory in your mysql instance. To incorporate the mysql dumps of these tables into your seqhound instance, you need only pipe the contents of the dump(which are SQL statements) to your database server. In the case of mysql, simply execute: gunzip -c gunzip -c seqhound.rpsdb.SQLdump.YYYYMMDD.gz | mysql seqhound seqhound.domname.SQLdump.YYYYMMDD.gz | mysql seqhound seqhound@blueprint.org Version 3.3 The SeqHound Manual 45 of 421 18/04/2005 Be sure to fill in any required mysql options, such as username, hostname and port number. 25. Build the histdb table. cd $SEQH/8.hist.files ./histparser –n F This parser populates table histdb. An entry will be generated for each of the sequences that have valid accessions in table accdb that indicates that the sequence was added on this day (when you ran histparser). This parser writes to the histparserlog. This parser requires the accdb table and will take about 15 hours to run. 26. You are done with the initial build of SeqHound. If you did not build any of the optional modules, you will have to remember this when setting up the .intrezrc configuration file for any SeqHound application. Set module values to zero if you did not build them. See the following section of the .intrezrc configuration file. example: [sections] ;indicate what modules are available in SeqHound ;1 for available, 0 for not available ;gene ontology hierarchy (did you run goparser?) godb = 1 ;locus link functional annotations (did you run llparser and addgoid?) lldb = 1 ;taxonomy hierarchy (did you run importtaxdb?) taxdb = 1 ;protein sequence neighbours (did you download neighbours tables?) neigdb = 1 ;structural databases (did you run cbmmdb, vastblst and pdbrep?) strucdb = 1 ;complete genomes tracking (did you run chrom and comgen?) gendb = 1 ;redundant protein sequences (did you run redund?) redundb = 1 ;open reading frame database (currently not exported at all) cddb = 0 ;RPS-BLAST tables (did you download RPS-BLAST tables?) rpsdb = 1 Catch up on SeqHound daily updates 27. Download all daily update files for genbank Warning: There might have been a new GenBank release while you were building SeqHound, in this case you cannot get updates from ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc/ any more. You have to rebuild SeqHound with a fresh GenBank release. You should check the file ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Release to make certain that it contains the same release number that was present when you started step 15. cd $SEQH/ mkdir seqsync seqhound@blueprint.org Version 3.3 The SeqHound Manual 46 of 421 18/04/2005 cd seqsync ftp ftp.ncbi.nih.gov When prompted for a name enter anonymous When prompted for a password type myemail@home.com cd ncbi-asn1 cd daily-nc bin prompt mget nc*.aso.gz bye Do not download the con_nc*.aso.gz files from this directory. SeqHound does not use them. 28. Download all daily update files for refseq. From ftp://ftp.ncbi.nih.gov/refseq/daily/ download all files past the date stamp on gbrscu.aso.gz. gbrscu.aso.gz is the latest cumulative RefSeq division which was downloaded by asnftp.pl and is located (in this example) in seqhound/build/asofiles. cd $SEQH/seqsync ftp ftp.ncbi.nih.gov enter anonymous and your email address when prompted cd refseq cd daily bin get rsnc.****.2003.bna.Z (where **** are files with timestamps greater than gbrscu.aso.gz) bye You must uncompress all of these files and rezip them so they can be processed by the mother parser. compress –d *.Z gzip *.bna 29. Run update and mother on all downloaded files (excluding today's one; crons will do it in the evening). You can use the scripts all_update.sh and all_update_rs.sh. You will also need mother, update and a properly configured .intrezrc file in the same directory as all of the daily update files. cd $SEQH/seqsync cp $COMPILE/slri/seqhound/scripts/all_update.sh . cp $COMPILE/slri/seqhound/scripts/all_update_rs.sh . cp $SEQH/1.core.files/.intrezrc . cp $SEQH/1.core.files/mother . cp $SEQH/1.core.files/update . Run all_update.sh first seqhound@blueprint.org Version 3.3 The SeqHound Manual 47 of 421 18/04/2005 ./all_update.sh 141 where 141 is the release number. Run all_update_rs.sh second. ./all_update_rs.sh 141 These scripts will run update and mother executables (consecutively) on all downloaded files present in the current directory. All daily updates in SeqHound are stored in one division called gbupd regardless how long SeqHound runs without a core rebuild. mother will make a log file called “*run” for every file that it processes update will make two log files called “*gis” and “*log” for every file that it processes You can check that the two parsers have completed successfully. Each of the following queries should return the same number (the number of starting input files): ls *aso.gz | wc –l ls *gis | wc –l ls nc*log | wc –l ls nc*run | wc –l grep Done nc*run |wc -l Setting up daily sequence updates 30. Make a new directory from where you will run daily sequence updates. Populate this with the necessary scripts and programs. cd $SEQH mkdir updates cd updates cp $SLRI/seqhound/scripts/*cron_odbc.pl cp $SLRI/seqhound/scripts/shconfig.pm . cp $SLRI/seqhound/build/odbc/redund . cp $SLRI/seqhound/build/odbc/mother . cp $SLRI/seqhound/build/odbc/update . cp $SLRI/seqhound/build/odbc/precompute cp $SLRI/seqhound/build/odbc/isshoundon cp $SLRI/seqhound/build/odbc/importaxdb cp $SLRI/seqhound/build/odbc/goparser . cp $SLRI/seqhound/build/odbc/llparser . cp $SLRI/seqhound/build/odbc/addgoid . cp $SLRI/seqhound/build/odbc/comgen . seqhound@blueprint.org . . . . Version 3.3 The SeqHound Manual 48 of 421 18/04/2005 cp $SLRI/seqhound/build/odbc/chrom . cp $SLRI/seqhound/scripts/genftp.pl . cp $SLRI/seqhound/scripts/humoasn.pl . cp $SLRI/seqhound/scripts/humouse_build.sh . cp $SLRI/seqhound/genomes/gen_cxx . cp $SLRI/seqhound/genomes/pregen.pl . cp $SLRI/seqhound/genomes/gen.pl . cp $SLRI/seqhound/genomes/ncbi.bacteria.pl . mkdir logs mkdir asofiles mkdir inputfiles mkdir genfiles mkdir flags 31. Copy the .intrezrc config file to the updates directory and edit it. cd $SEQH/updates cp $SLRI/seqhound/config/.intrezrc . cp $SEQH/1.core.files/.intrezrc . Text in italics must be changed. in [crons] section, variable pathupdates points to the path where the update jobs will be set up; variable pathinputfiles points to the path that saves the input files (other than *.aso.gz and *.bna.gz files from the core module and *.asn files from the gendb module); variable pathinputfilescomgen points to the path that saves the input files *.asn for the gendb module; variable mail indicates your e-mail address; variable defaultrelease is the GenBank release you build SeqHound database with; variable pathflags points to the path that save the flag files generated by each updating job. [crons] ;customizable variables in cron jobs ;NOTE: all paths must end in '/' pathupdates=./ pathinputfiles=./inputfiles/ pathinputfilescomgen=./genfiles/ mail=my_email defaultrelease=141 pathflags=./flags/ The cron daemon may consider your home directory to be the “current directory”. For this reason, the .intrezrc file should be copied to your home directory too. cd $SEQH/updates cp .intrezrc ~/. 32. Set up the dupdcron_odbc.pl cron job. dupdcron_odbc.pl (daily update cron) is a PERL script that retrieves the latest GenBank and RefSeq update files from the NCBI ftp site and then passes them to seqhound@blueprint.org Version 3.3 The SeqHound Manual 49 of 421 18/04/2005 “update” and “mother” where they are used to update the SeqHound data tables. Specifically, it a) downloads update files with today's date (from ftp://ftp.ncbi.nih.gov/ncbiasn1/daily-nc/ nc*.aso.gz and ftp://ftp.ncbi.nih.gov/refseq/daily/ rsnc*.bna.Z b) runs update (update -i nc*.aso.gz) and then c) runs mother (mother -i nc*.aso.gz -r version# -n F -m F -u T). You need to know this because if you miss a few updates before setting up the cron job (and after completing the seqsync steps above) you have to run update and mother in hand using the above commands. All scripts (like dupdcron_odbc.pl) report success or failure via email. The mailto address is set in the shconfig.pm script which you have just edited. dupdcron_odbc.pl is the first cron job that has to be set up. Make a new text file called list_crontab where you will list the cron jobs. cd $SEQH/updates pico list_crontabs This file should have the single line 30 22 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./dupdcron_odbc.pl where libpath should be replaced by the correct path you set up in Step 11 for environment variable LD_LIBRARY_PATH. You can find it out by: echo $LD_LIBRARY_PATH This line specifies the time to run a job on a recurring basis. It consists of 6 fields separated by spaces. The fields and allowable values are of the form: minute (0-59) in this case 30 hour (0-23) in this case 22 day of the month (1-31) in this case * month (1-12) in this case * day-of-week (0-6 where 0 is Sunday) in this case * command to run The above line indicates that dupdcron_odbc.pl is to be run at 10:30 PM every day of the month, every month, regardless of the day of the week. The * character is a wildcard. The actual command consists of changing to the directory where dupdcron_odbc.pl exists (this path will have to be modified depending on your set up) cd /seqhound/update; and then executing the perl script ./dupdcron_odbc.pl seqhound@blueprint.org Version 3.3 The SeqHound Manual 50 of 421 18/04/2005 After adding the above line and editing it to match your setup, close the file. To activate this crontab file, type crontab list_crontabs If for some reason, you want to deactivate the cron job, type: crontab –r list_crontabs To find out what cron jobs you have activated, type crontab -l For more information on setting up cron jobs on UNIX type: man crontab 33. Set up redundcron_odbc.pl to run daily. cd $SEQH/updates pico list_crontabs Add the following line: 30 23 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./redundcron_odbc.pl See Step 32 for the explanation of libpath. After adding the above line, edit it to match your setup and close the file. To activate this crontab file, type crontab list_crontabs This script basically does three things: a) checks if file “nr” is updated on the ftp site ftp://ftp.ncbi.nlm.nih.gov/blast/db. If it is, retrieves it b) drops table redund from the database and recreate it. c) rebuilds table redund using the downloaded nr file and the redund parser. 34. Run precompute for the first time. First set up the configuration file cd $SEQH/updates pico .intrezrc Edit the section under [precompute] to make it look like: [precompute] ;precomputed taxonomy queries MaxQueries = 0 MaxQueryTime = 10 QueryCount = 0 #path to precomputed searches has to have "/" at the end !! path = /seqhound/precompute/ indexfile = /seqhound/precompute/index Make sure the value of path is the absolute path of directory precompute you make in Step 14 and the value of indexfile is the value of path plus index. Variable path is the directory that holds results of the precompute executable. indexfile is a path to the index that will be created by precompute. seqhound@blueprint.org Version 3.3 The SeqHound Manual 51 of 421 18/04/2005 Finally, run the precompute executable: cd $SEQH/updates ./precompute –a redo Where –a redo specifies that the program is being run for the first time. This program basically precomputes the number of proteins and nucleic acids (and their GI values) for each taxon in the taxgi table. The results of this query are stored and indexed in text files (in the directory specified by path) if this query takes longer than x seconds (where x is defined by MaxQueryTime in the above .intrezrc file). These text files are used by SeqHound API calls such as SHoundProteinsFromTaxIDIII(taxid) 35. Set up precomcron_odbc.pl to run daily. cd $SEQH/updates pico list_crontabs Add the following line: 30 1 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./precomcron_odbc.pl See Step 32 for the explanation of libpath. After adding the above line and editing it to match your setup, close the file. To activate this crontab file, type crontab list_crontabs This script basically runs the command precompute -a update and updates the precomputed search results. 36. Set up isshoundoncron_odbc.pl to run daily. cd $SEQH/updates pico list_crontabs Add the following line: 30 7 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./isshoundoncron_odbc.pl See Step 32 for the explanation of “libpath”. After adding the above line and editing it to match your setup, close the file. To activate this crontab file, type crontab list_crontabs This script basically does two things: a) runs the executable called isshoundon. This program makes a single call to the local SeqHound API to ensure that it is working. b) moves all log, run and gis log files into a directory called logs 37. Set up llcron_odbc.pl to run daily. cd $SEQH/updates pico list_crontabs seqhound@blueprint.org Version 3.3 The SeqHound Manual 52 of 421 18/04/2005 Add the following line: 30 21 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./llcron_odbc.pl See Step 32 for the explanation of libpath. After adding the above line and editing it to match your setup, close the file. To activate this crontab file, type crontab list_crontabs This script basically repeats the actions listed in step 14 above and re-creates the locus link tables in SeqHound. This includes: a) getting the latest LL_tmpl.gz file from the NCBI ftp site. b) removing the locus link tables from SeqHound c) running llparser d) getting 2 GO annotation files from GO ftp site e) running the addgoid parser on these two files 38. Set up comgencron_odbc.pl to run daily. cd $SEQH/updates pico list_crontabs Add the following line: 30 21 * * * cd /seqhound/update; LD_LIBRARY_PATH=libpath ./comgencron_odbc.pl See Step 32 for the explanation of libpath. After adding the above line and editing it to match your setup, close the file. To activate this crontab file, type crontab list_crontabs This script basically repeats the actions listed in step 15 above and re-creates the chrom table in SeqHound and updates the complete genome information in the core tables. This includes: a) generating a list of “DNA units” that belongs to a complete genome, b) downloading complete genome files from NCBI ftp site, c) rebuilding table chrom d) removing all records in the core tables that belongs to division “gbchm”, e) running script humous_build.sh to insert records into core tables f) resetting the kloodge field in table taxgi for all records to 0 g) updating kloodge by running parser comgen seqhound@blueprint.org Version 3.3 The SeqHound Manual 53 of 421 18/04/2005 Setting up SeqHound servers. Overview 39. Setting up SeqHound servers. Overview. There are two web server applications that make up the SeqHound system: a) wwwseekgi produces html pages for the SeqHound web interface and b) seqrem processes requests to the SeqHound remote API. Step 40 shows you how to find the two directories where you will set up these two applications (assuming that you are using a default installation of Apache). The two directories are called: cgi-bin htdocs Step 40 may be skipped if you already know or have already been told where these two directories are. Steps 41 - describe the files that must be placed into these two sub-directories in order to start the wwwseekgi and seqrem servers. 40. Examining the httpd.conf file for Apache. These instructions assume that you already have an Apache server running. In order to proceed further you must locate the directory where executables will be run from (called “cgi-bin” in a default set-up of Apache) and a directory that contains html documents (called “htdocs” in a default set-up of Apache). You can find (and reset) the location of these two directories in an Apache configuration file called “httpd.conf”. In a default set-up of Apache, the httpd.conf file can be accessed by changing to the directory: cd /etc/apache and then opening the httpd.conf file found in this directory using a text editor such as pico: pico httpd.conf To find the cgi-bin directory location, look for the line beginning with “ScriptAlias”. In the default set-up, this line looks like this: ScriptAlias /cgi-bin/ “/var/apache/cgi-bin/” In this example, the path to the cgi-bin directory is /var/apache/cgi-bin/. Write this path down, whatever it is. To find the htdocs directory, look for the line beginning with “DocumentRoot”. In the default set-up, this line looks like this:` DocumentRoot “/var/apache/htdocs/” In this example, the path to the cgi-bin directory is /var/apache/htdocs/. Write this path down, whatever it is. Also make a note of the line beginning with “User” and “Group” (who has ownership to the server). In a default Apache set-up, these lines are likely User nobody Group nobody seqhound@blueprint.org Version 3.3 The SeqHound Manual 54 of 421 18/04/2005 Make a note of this, whatever it is. Exit from the httpd.conf file and save your changes. If you made changes to the file, you must restart the Apache server using the command: /usr/apache/bin/apachectl restart See the Trouble Shooting section at the end for more information on this. In the steps below you will set up the SeqHound server by adding to these two directories Contents of the cgi-bin and htdocs directories directory contents cgi-bin the SeqHound wwwseekgi and seqrem server applications will placed here all of the static html pages used by the SeqHound interface will be placed here 41. Set up the cgi-bin directory. htdocs Move to the cgi-bin directory you found in the step above. For the default set-up: cd /var/apache/cgi-bin/ make a new subdirectory here called SeqHound mkdir seqhound cd seqhound copy the SeqHound server applications here: cp $COMPILE/slri/seqhound/build/odbc/seqrem . cp $COMPILE/slri/seqhound/build/odbc/wwwseekgi . also copy the following files to this directory: cp $COMPILE/slri/seqhound/html/seekhead.txt . cp $COMPILE/slri/seqhound/html/seektail.txt . cp $COMPILE/slri/seqhound/html/seekhead.txt pics/. cp $COMPILE/slri/seqhound/config/.intrezrc . cp $COMPILE/slri/seqhound/config/.ncbirc . 42. Edit the .ncbirc configuration file. Open the file with a text editor such as pico. The setting for Data should contain a path to the ncbi/data directory. This directory was downloaded as part of the ncbi toolkit in step 2. --------------------example .ncbirc file begins----------------------[NCBI] Data=/home/ncbi/data --------------------example .ncbirc file ends------------------------- 43. Edit the .intrezrc configuration file. Refer to step 14 in the current section for setting up of the .intrezrc file. The settings for username, password, dsn and database in section [datab] should be valid for the SeqHound database you have just built, and the setting for path and seqhound@blueprint.org Version 3.3 The SeqHound Manual 55 of 421 18/04/2005 indexfile in section [precompute] should point to the valid path as in step 34 in the current section. Set up the index.html file for the web interface. Move to the htdocs directory for your web-server. In the default case: cd /var/apache/htdocs/ Make a SeqHound directory here: mkdir seqhound cd seqhound Copy the index.html page to this directory: cp $COMPILE/slri/seqhound/html/index.html . Open the file in a text editor like pico and edit it so that its action points to the wwwseekgi server. pico index.html then edit the line