BLAST® Command Line Applications User Manual Blast

User Manual:

Open the PDF directly: View PDF .
Page Count: 58

Contributors
Introduction
User Manual
Cookbook
Appendices

BLAST® Command Line

Applications User Manual

Last Updated: 2016 Nov 4

National Center for Biotechnology Information (US)

Bethesda (MD)

National Center for Biotechnology Information (US), Bethesda (MD)

BLAST is a Registered Trademark of the National Library of Medicine

NLM Citation: BLAST® Command Line Applications User Manual [Internet]. Bethesda (MD): National

Center for Biotechnology Information (US); 2008-.

is manual documents the BLAST (Basic Local Alignment Search

Tool) command line applications developed at the National Center

for Biotechnology Information (NCBI).

iii

Table of Contents

Contributors ......................................................................................................................................................... 1

Introduction.......................................................................................................................................................... 3

Installation.............................................................................................................................................................. 4

Dependencies........................................................................................................................................................ 5

Quick start ............................................................................................................................................................. 5

User Manual ........................................................................................................................................................ 7

Functionality offered by BLAST+ applications ............................................................................................... 7

BLAST+ features.................................................................................................................................................... 7

Configuring BLAST............................................................................................................................................... 12

Input formats to BLAST........................................................................................................................................ 14

Cookbook.............................................................................................................................................................. 17

Query a BLAST database with a GI, but exclude that GI from the results .............................................. 17

Create a masked BLAST database ................................................................................................................... 17

Search with database masking enabled......................................................................................................... 22

Display BLAST search results with custom output format ............................................................................ 23

Use blastdb_aliastool to manage the BLAST databases.............................................................................. 25

Reformat BLAST reports with blast_formatter ................................................................................................ 26

Extracting data from BLAST databases with blastdbcmd............................................................................ 27

Use Windowmasker to filter the query sequence(s) in a BLAST search................................................... 29

Building a BLAST database with local sequences......................................................................................... 30

Limiting a Search with a List of Identifiers ...................................................................................................... 31

Multiple databases vs. spaces in filenames and paths ................................................................................ 32

Specifying a sequence as the multiple sequence alignment master in psiblast...................................... 32

Ignoring the consensus sequence in the multiple sequence alignment in psiblast................................. 33

Performing a DELTA-BLAST search................................................................................................................... 33

Appendices ........................................................................................................................................................... 35

Conversion from C toolkit applications ........................................................................................................... 35

iv BLAST® Command Line Applications User Manual

Exit codes............................................................................................................................................................... 36

Options for the command-line applications................................................................................................... 36

BLASTN reward/penalty values ....................................................................................................................... 50

BLAST Substitution Matrices............................................................................................................................... 51

References.............................................................................................................................................................. 52

Contents v

vi BLAST® Command Line Applications User Manual

Contributors

Christiam Camacho: camacho@ncbi.nlm.nih.gov

omas Madden: madden@ncbi.nlm.nih.gov

Tao Tao: tao@ncbi.nlm.nih.gov

Richa Agarwala: richa@ncbi.nlm.nih.gov

Aleksandr Morgulis: morgulis@ncbi.nlm.nih.gov

2BLAST® Command Line Applications User Manual

Introduction

Created: June 23, 2008; Updated: May 26, 2016.

Sequence similarity searching is one of the more important bioinformatics activities and

oen provides the rst evidence for the function of a newly sequenced gene or piece of

sequence. Basic Local Alignment Search Tool (BLAST) is probably the most popular

similarity search tool. e National Center for Biotechnology Information (NCBI) rst

introduced BLAST in 1989. e NCBI has continued to maintain and update BLAST

since the rst version. In 2009, the NCBI introduced a new version of the stand-alone

BLAST applications (BLAST+). e BLAST+ applications have a number of

improvements that allow faster searches as well as more exibility in output formats and

in the search input. ese improvements include: splitting of longer queries so as to

reduce the memory usage and to take advantage of modern CPU architectures; use of a

database index to dramatically speed up the search; the ability to save a “search strategy”

that can be used later to start a new search; and greater exibility in the formatting of

tabular results.

e functionality of the BLAST+ applications is organized by search type. As an example,

there is a “blastp” application that compares proteins queries to protein databases. e

“blastx” application translates a nucleotide query in six frames and searches it against a

protein database. is organization is dierent from that of the applications rst released

in 1997 (e.g., blastall) that supported all types of searches with one application, but it

resembles that of the NCBI BLAST web site. An advantage of this design is that each

application has only the options relevant to the searches it performs. Additionally, each

application can compare a query to a set of FASTA sequences in a le, bypassing the need

to create a BLAST databases for small and infrequently searched sets. Finally, a “remote”

option permits each application to send o a search to the NCBI servers.

is manual has several sections. It provides brief installation instructions, a QuickStart, a

section describing BLAST+ features in more depth, a “Cook Book” section on how to

perform a number of tasks, as well as three appendices. e rst appendix discusses tools

to help with the transition from the older applications (e.g., blastall) to the BLAST+

applications. e second appendix documents exit codes from the BLAST+ applications.

e third appendix is a table of BLAST options, the type of input required, and the default

values for each application. e fourth appendix lists the scoring parameters that the

blastn application supports.

An introduction to BLAST is outside the scope of this manual, more information on this

subject can be found on http://blast.ncbi.nlm.nih.gov/Blast.cgi?

CMD=Web&PAGE_TYPE=BlastDocs.

Please feel free to contact us with any questions, feedback, or bug reports at blast-

help@ncbi.nlm.nih.gov.

Installation

Installation instructions are available for Windows and LINUX/UNIX. is section

provides instructions for a few cases not covered by those entries.

e BLAST+ applications are distributed both as an executable and as source code. For

the executable formats we provide installers as well as tarballs; the source code is only

provided as a tarball. ese are freely available at p://p.ncbi.nlm.nih.gov/blast/

executables/blast+/. Please be sure to use the most recent available version; this will be

indicated in the le name (for instance, in the sections below, version 2.2.18 is listed, but

this should be replaced accordingly).

MacOSX

For users without administrator privileges: follow the procedure described in http://

www.ncbi.nlm.nih.gov/books/NBK52640/

For users with administrator privileges and machines MacOSX version 10.5 or higher:

Download the ncbi-blast-2.2.18+.dmg installer and double click on it. Double click the

newly mounted ncbi-blast-2.2.18+ volume, double click on ncbi-blast-2.2.18+.pkg and

follow the instructions in the installer. By default the BLAST+ applications are installed

in /usr/local/ncbi/blast, overwriting its previous contents (an uninstaller is provided and it

is recommended when upgrading a BLAST+ installation).

RedHat Linux

Download the appropriate *.rpm le for your platform and either install or upgrade the

ncbi-blast+ package as appropriate using the commands:

Install:

rpm -ivh ncbi-blast-2.2.18-1.x86_64.rpm

Upgrade:

rpm -Uvh ncbi-blast-2.2.18-1.x86_64.rpm

Note: one must have root privileges to run these commands. If you do not have root

privileges, please use the procedure described in http://www.ncbi.nlm.nih.gov/books/

NBK52640/

Source tarball

Use this approach to build the BLAST+ applications yourself. Download the tarball,

expand it, change directories to the newly created directory, and type the following

commands:

cd c++

./configure

cd ReleaseMT/build

make all_r

4BLAST® Command Line Applications User Manual

e compiled executables will be found in c++/ReleaseMT/bin. Please note that this

sequence of commands will build the applications with optimizations, with support for

multi-threading and it may require the installation of dependencies (see following

section). If a dierent conguration is desired, please use the congure.orig script located

in the same directory.

Building sources in Windows

Extract the appropriate tarball (e.g.: ncbi-blast-VERSION+-x64-win64-tar.gz) and open

the appropriate MSVC solution or project le (e.g.: c++\compilers\msvc1000_prj\static

\build\ncbi_cpp.sln), build the -CONFIGURE- project, click on “Reload” when prompted

by the development environment, and then build the -BUILD-ALL- project. e compiled

executables will be found in the directory corresponding to the build conguration

selected (e.g.: c++\compilers\msvc1000_prj\static\bin\debugdll).

Note regarding building the source code

e BLAST source tarballs are a subset of the NCBI C++ toolkit. Information on using

and compiling the NCBI C++ toolkit is available at http://www.ncbi.nlm.nih.gov/

toolkit/doc/book. Please send questions about compiling the NCBI C++ toolkit to

toolbox@ncbi.nlm.nih.gov

Dependencies

Starting with BLAST+ 2.5.0, the usage of the –remote option requires the GNUTLS

development libraries (http://www.gnutls.org/) to make a secure connection to NCBI. e

pre-compiled Linux and MacOS binaries link these statically and the Windows binaries

include the required DLLs.

If you are compiling the sources, please be sure your system has the appropriate

dependencies installed before building the BLAST+ applications.

Quick start

A BLAST search against a database requires at least a –query and –db option. e

command:

blastn –db nt –query nt.fsa –out results.out

will run a search of nt.fsa (a nucleotide sequence in FASTA format) against the nt

database, printing results to the le results.out. If “-out results.out” had been le o, the

results would have been printed to stdout (i.e., the screen). e blastn application searches

a nucleotide query against a nucleotide database.

To send the search to our servers and databases, add the –remote option:

blastn –db nt –query nt.fsa –out results.out -remote

Introduction 5

See more about this option in the section below, BLAST+ remote service.

e BLAST+ applications print documentation when invoked with the –h or –help

option. e –h option provides abbreviated help, and the –help ag provides more

extensive documentation. For example, use –help to get a list of output options for the –

outfmt option.

Create a custom database from a multi-FASTA le of sequences with this minimal

command:

makeblastdb –in mydb.fsa –dbtype nucl –parse_seqids

See the section below, Building a BLAST database with local sequences, for more details.

e BLAST databases are required to run BLAST locally and to support automatic

resolution of sequence identiers. Documentation about these identiers can be found at

http://www.ncbi.nlm.nih.gov/toolkit/doc/book/ch_demo/#ch_demo.T5. e databases

may be retrieved automatically with the update_blastdb.pl PERL script, which is included

as part of this distribution. is script will download multiple tar les for each BLAST

database volume if necessary, without having to designate each volume. For example:

./update_blastdb.pl htgs

will download all the relevant HTGs tar les (htgs.00.tar.gz, …, htgs.N.tar.gz)

e script can also compare your local copy of the database tar le(s) and only download

tar les if the date stamp has changed reecting a newer version of the database. is will

allow the script run on a schedule and only download tar les when needed.

Documentation for the update_blastdb.pl script can be obtained by running the script

without any arguments (perl is required).

RPS-BLAST ready databases are available at p://p.ncbi.nih.gov/pub/mmdb/cdd/

e BLAST taxonomy database is required in order to print the scientic name, common

name, blast name, or super kingdom as part of the BLAST report or in a report with

blastdbcmd. e BLAST database contains only the taxid (an integer) for each entry, and

the taxonomy database allow BLAST to retrieve the scientic name etc. from a taxid. e

BLAST taxonomy database consists of a pair of les (taxdb.bti and taxdb.btd) that are

available as a compressed archive from the NCBI BLAST FTP site (p://

p.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz). e update_blastdb.pl script can be used to

download and update this archive; it is recommended that the uncompressed contents of

the archive be installed in the same directory where the BLAST databases reside.

Assuming proper le permissions and that the BLASTDB environment variable contains

the path to the installation directory of the BLAST databases, the following commands

accomplish that:

# Download the taxdb archive

perl update_blastdb.pl taxdb

# Install it in the BLASTDB directory

gunzip -cd taxdb.tar.gz | (cd $BLASTDB; tar xvf - )

6BLAST® Command Line Applications User Manual

User Manual

Created: June 23, 2008; Updated: November 4, 2016.

Functionality offered by BLAST+ applications

e functionality oered by the BLAST+ applications has been organized by program

type, as to more closely resemble Web BLAST.

As an example, to run a search of a nucleotide query (translated “on the y” by BLAST)

against a protein database one would use the blastx application. e blastx application will

also work in “Blast2Sequences” mode (i.e.: accept FASTA sequences instead of a BLAST

database as targets) and can also send BLAST searches over the network to the public

NCBI server if desired.

e BLAST+ package oers three categories of applications: 1.) search tools, 2.) BLAST

database tools, and 3.) sequence ltering tools. e blastn, blastp, blastx, tblastx, tblastn,

psiblast, rpsblast, and rpstblastn are considered search applications, as they execute a

BLAST search, whereas makeblastdb, blastdb_aliastool, makeproledb, and blastdbcmd

are considered BLAST database applications, as they either create or examine BLAST

databases.

ere is also a new set of sequence ltering applications described in the section Sequence

ltering applications and an application to build database indices that greatly speed up

megablast in some cases (see section titled Megablast indexed searches).

BLAST+ features

Tasks

e blastn and blastp applications have a –task option. is option sets the parameters

(e.g., word-size or gap values) to typical values for a specic type of search. For example,

the “megablast” task is optimized for intraspecies comparison as it uses a large word-size,

whereas “blastn” is better suited for interspecies comparisons with a shorter word-size.

ese tasks resemble the “Program Selection” section of the BLAST web pages and do not

preclude the user from setting other options to override those specied by the task. See

Appendix "Options for the command-line application" for documentation on parameter

values for dierent tasks. e following tasks are currently available:

Program Task Name Description

blastp blastp Traditional BLASTP to compare a protein query to a protein database

blastp-short BLASTP optimized for queries shorter than 30 residues

blastn blastn Traditional BLASTN requiring an exact match of 11

blastn-short BLASTN program optimized for sequences shorter than 50 bases

Table continues on next page...

Table continued from previous page.

megablast Traditional megablast used to nd very similar (e.g., intraspecies or closely

related species) sequences

dc-megablast Discontiguous megablast used to nd more distant (e.g., interspecies) sequences

Megablast indexed searches

Indexing provides an alternative way to search for initial matches in nucleotide-nucleotide

searches (blastn and megablast) by pre-indexing the N-mer locations in a special data

structure, called a database index.

Using an index can improve search times signicantly under certain conditions. It is most

benecial when the queries are much shorter than the database and works best for queries

under 1 Mbases long. e advantage comes from the fact that the whole database does not

have to be scanned during the search.

Indices can capture masking information, thereby enabling search against databases

masked for repeats, low complexity, etc.

ere are, however, limitations to using indexed search in blast:

• Index les are about four times larger than the blast databases. If an index does not

t into computer operating memory, then the advantage of using it is eliminated.

• Word size must be set to 16 or more in order to use an indexed search.

• Discontiguous search is not supported.

Reference: Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schäer AA.

Database Indexing for Production MegaBLAST Searches. Bioinformatics 2008, 24(16):

1757-64. PMID:18567917

BLAST search strategies

BLAST search strategies are les that encode the inputs necessary to perform a BLAST

search. e purpose of these les is to be able to seamlessly reproduce a BLAST search in

various environments (Web BLAST, command line applications, etc).

Exporting search strategies on the Web BLAST

Click on "download" next to the RID/saved strategy in the "Recent Results" or "Saved

Strategies" tabs.

Exporting search strategies with BLAST+ applications

Add the -export_search_strategy along with a le name to the command line options.

Importing search strategies on Web BLAST

Go to the "Saved Strategies" tab, click on "Browse" to select your search strategy le, then

click on "View" to load it into the submission page.

8BLAST® Command Line Applications User Manual

Importing search strategies with BLAST+ applications

Add the -import_search_strategy along with a le name containing the search strategy

le. Note that if provided, the –query, -db, -use_index, and –index_name command line

options will override the specications of the search strategy le provided (no other

command line options will override the contents of the search strategy le).

Negative GI lists

Search applications support negative GI lists. is feature provides a means to exclude GIs

from a BLAST database search. e expect values in the BLAST results are based upon the

sequences actually searched and not on the underlying database. For an example, see the

cookbook.

Masking in BLAST databases

It is now possible to create BLAST databases that contain ltered sequences (also known

as masking information or masks). is ltering information can be used for so or hard

masking of the subject sequences. For instructions on creating masked BLAST databases,

please see the cookbook.

Custom output formats for BLAST searches

e BLAST+ search command line applications support custom output formats for the

tabular and comma-separated value output formats. For more details see “outfmt” in

Appendix “Options for the command-line application” as well as the cookbook.

Custom output formats to extract BLAST database data

blastdbcmd supports custom output formats to extract data from BLAST databases via the

-outfmt command line option. For more details see the blastdbcmd options in Appendix

“Options for the command-line application” as well as the cookbook.

Improved software installation packages

e BLAST+ applications are available via Windows and MacOSX installers as well as

RPMs (source and binary) and unix tarballs. For more details about these, refer to the

installation section.

Sequence filtering applications

e BLAST+ applications include a new set of sequence ltering applications, namely

segmasker, dustmasker, and windowmasker. Segmasker is an application that identies

and masks low complexity regions of protein sequences. e dustmasker application

provides a similar functionality for nucleotide sequences. Windowmasker uses a genome

to identify sequences represented too oen to be of interest to most users. See p://

p.ncbi.nlm.nih.gov/pub/agarwala/dustmasker/README.dustmasker and p://

User Manual 9

p.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/README.windowmasker for more

information.

Best-Hits filtering algorithm

e Best-Hit ltering algorithm is designed for use in applications that are searching for

only the best matches for each query region reporting matches. Its -best_hit_overhang

parameter, H, controls when an HSP is considered short enough to be ltered due to

presence of another HSP. For each HSP A that is ltered, there exists another HSP B such

that the query region of HSP A extends each end of the query region of HSP B by at most

H times the length of the query region for B.

Additional requirements that must also be met in order to lter A on account of B are:

i. evalue(A) >= evalue(B)

ii. score(A)/length(A) < (1.0 – score_edge) * score(B)/length(B)

We consider 0.1 to 0.25 to be an acceptable range for the -best_hit_overhang parameter

and 0.05 to 0.25 to be an acceptable range for the -best_hit_score_edge parameter.

Increasing the value of the overhang parameter eliminates a higher number of matches,

but increases the running time; increasing the score_edge parameter removes smaller

number of hits.

Automatic resolution of sequence identifiers

e BLAST+ search applications support automatic resolution of query and subject

sequence identiers specied as GIs or accessions (see the cookbook section for an

example). is feature enables the user to specify one or more sequence identiers (GIs

and/or accessions, one per line) in a le as the input to the -query and -subject command

line options.

Upon encountering this type of input, by default the BLAST+ search applications will try

to resolve these sequence identiers in locally available BLAST databases rst, then in the

BLAST databases at NCBI, and nally in Genbank (the latter two data sources require a

properly congured internet connection). ese data sources can be congured via the

DATA_LOADERS conguration option and the BLAST databases to search can be

congured via the BLASTDB_PROT_DATA_LOADER and

BLASTDB_NUCL_DATA_LOADER conguration options (see the section on

Conguring BLAST).

BLAST-WindowMasker integration in BLAST+ search applications

e BLAST+ search applications support integration with the windowmasker les via the

-window_masker_taxid and the WINDOW_MASKER_PATH conguration parameter

(see Conguring BLAST) or via the -window_masker_db command line option.

10 BLAST® Command Line Applications User Manual

In the rst case, the WINDOW_MASKER_PATH conguration parameter should refer to

a directory which contains subdirectories named aer NCBI taxonomy IDs (e.g.: 9606 for

human, 10090 for mouse), where the windowmasker unit counts data les should be

placed with the following naming convention: wmasker.obinary (for les generated with

the obinary format) and/or wmasker.oascii (for les generated with the oascii format). For

an example on how to create these les, please see the Cookbook. Once these

windowmasker les and the conguration le are in place, this feature can be invoked by

providing the taxonomy ID to the -window_masker_taxid command line option.

Alternatively, this feature can also be invoked by providing the path to the windowmasker

unit counts data le via the -window_masker_db.

Please see the Cookbook for a usage example of this feature.

DELTA-BLAST: A tool for sensitive protein sequence search

DELTA-BLAST uses RPS-BLAST to search for conserved domains matching to a query,

constructs a PSSM from the sequences associated with the matching domains, and

searches a sequence database. Its sensitivity is comparable to PSI-BLAST and does not

require several iterations of searches against a large sequence database. See the cookbook

for more information.

Concatenation of queries

BLAST works more eciently if it scans the database once for multiple queries. is

feature is known as concatenation. It speeds up MegaBLAST searches the most as they

spend little time on tasks that consume CPU and most of the time streaming through the

database. BLASTN and discontiguous MegaBLAST searches also run faster with

concatenation, though the eect is less pronounced. BLAST+ applies concatenation on all

types of searches (e.g., also BLASTP, etc.), and it can be very benecial if the input is a

large number of queries in FASTA format. BLAST+ concatenates queries by grouping

them together until a specic number of letters (or “chunk size”) is reached.

Unfortunately, a constant chunk size for each database scan causes certain problems. For

some searches the chunk size is too large, too many letters are searched at once, and the

process consumes too much memory. Tests have shown that the number of successful

ungapped extensions performed in the preliminary stage is a good predictor of overall

memory use during a search. e BLASTN application (starting with the 2.2.28 release)

takes advantage of this insight to provide an “adaptive chunk size”. e application starts

with a low initial chunk size of 10,000 bases and records how many successful ungapped

extensions were performed during search. It adjusts the chunk size on the next database

scan with a target of performing two million extensions during the search.

Query concatenation also means that BLAST will produce no output until the rst set of

concatenated queries have been processed. Some users nd this disconcerting, but it is not

a problem.

User Manual 11

BLAST+ remote service

e BLAST+ applications can also send a search to the servers at the NCBI. In this case,

the BLAST+ application is acting as a client and there is no need to install a database or

provide more than minimal computing power. e BLAST+ remote service uses the same

servers used by the NCBI BLAST website. e BLAST server can return a Request ID

(RID) as part of the results, and that RID can be used to reformat the results with the

blast_formatter or on the NCBI website. In general, the servers keep the results for an RID

for 36 hours. e BLAST+ applications will use the remote service if the –remote ag is

added to the command line. e BLAST+ remote service uses a shared resource (the

computers at the NCBI), so only one BLAST+ application should run remote searches at a

time. An example in the cookbook section demonstrates a remote search.

Configuring BLAST

e BLAST+ search applications can be congured by means of a conguration le or

environment variables.

Configuring BLAST via configuration file

is can be accomplished with a conguration le named .ncbirc (on Unix-like platforms)

or ncbi.ini (on Windows). is is a plain text le that contains sections and key-value

pairs to specify conguration parameters. Lines starting with a semi-colon are considered

comments. e application will search for the le in the following order and locations:

1. Current working directory (*)

2. User's HOME directory (*)

3. Directory specied by the NCBI environment variable

4. e standard system directory (“/etc” on Unix-like systems, and given by the

environment variable SYSTEMROOT on Windows)

(*) Unless the NCBI_DONT_USE_LOCAL_CONFIG environment variable is dened.

e search for this le will stop at the rst location where it is found and the

congurations settings from that le will be applied. If the conguration le is not found

or if the NCBI_DONT_USE_NCBIRC environment variable is dened, the default values

will apply. e following are the possible conguration parameters that impact the BLAST

+ applications:

Conguration Parameter Species Default value

BLASTDB Path to BLAST databases. Current working

directory

e following is an example with comments describing the available parameters for

conguration:

; Start the section for BLAST configuration

[BLAST]

; Specifies the path where BLAST databases are installed

BLASTDB=/home/guest/blast/db

; Specifies the data sources to use for automatic resolution

; for sequence identifiers

DATA_LOADERS=blastdb

; Specifies the BLAST database to use resolve protein sequences

BLASTDB_PROT_DATA_LOADER=custom_protein_database

; Specifies the BLAST database to use resolve protein sequences

BLASTDB_NUCL_DATA_LOADER=/home/some_user/my_nucleotide_db

; Windowmasker settings

[WINDOW_MASKER]

WINDOW_MASKER_PATH=/home/guest/blast/db/windowmasker

; end of file

Configuring BLAST via environment variables

Please note that the environment variables take precedence over any settings from the

NCBI conguration le.

Environment Variable Species

NCBI Path to NCBI conguration le.

NCBI_DONT_USE_NCBIRC If dened, no NCBI conguration le will be used.

Table continues on next page...

User Manual 13

Table continued from previous page.

NCBI_DONT_USE_LOCAL_CONFIG If dened, no NCBI conguration le on the local directory or

the user’s HOME directory will be used

BLASTDB Path to BLAST databases.

BLASTMAT Path to scoring matrix les.

BATCH_SIZE See “Controlling concatenation of queries” and “Memory usage”

sections below.

NCBI_CONFIG__BLAST__X Assuming X is any of the conguration parameters from the

previous section, it serves the same purpose.

Controlling concatenation of queries

As described above, BLAST+ works more eciently if it scans the database once for

multiple queries. is feature is knows as concatenation. Unfortunately, for some searches

the concatenation values are not optimal, too many queries are searched at once, and the

process can consume too much memory. For applications besides BLASTN (which uses

an adaptive approach), it is possible to control these values by setting the BATCH_SIZE

environment variable. Setting the value too low will degrade performance dramatically, so

this environment variable should be used with caution.

Memory usage

e BLAST search programs can exhaust all memory on a machine if the input is too

large or if there are too many hits to the BLAST database. If this is the case, please see

your operating system documentation to limit the memory used by a program (e.g.:

ulimit on Unix-like platforms). Setting the BATCH_SIZE environment variable as

described above may help.

Input formats to BLAST

Multiple sequence alignment

e -in_msa psiblast option provides a way to jump start psiblast from a master-slave

multiple sequence alignment computed outside psiblast. e multiple sequence alignment

must contain the query sequence as one of its sequences, but it need not be the rst

sequence. e multiple sequence alignment must be specied in a format that is derived

from Clustal, but without some headers and trailers (see example below).

e rules are also described by the following words. Suppose the multiple sequence

alignment has N sequences. It may be presented in one or more blocks, where each block

presents a range of columns from the multiple sequence alignment. E.g., the rst block

might have columns 1-60, the second block might have columns 61-95, the third block

might have columns 96-128. Each block should have N rows, one row per sequence. e

sequences should be in the same order in every block. Blocks are separated by one or

more black lines. Within a block there are no blank lines, and each line consists of one

14 BLAST® Command Line Applications User Manual

sequence identier followed by some whitespace followed by characters (and gaps) for

that sequence in the multiple sequence alignment. In each column, all letters must be in

upper case, or all letters must be in lower case.

# Example multiple sequence alignment file

align1

------

26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgkla

F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsakl

YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvkniln

YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaake

FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrc

COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrr

644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqt

YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspe

eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw----------------------------------

T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw----------------------------------

YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspe

KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgae

F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircne

Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlket

26SPS9_Hs lryagrqtealkcvaqasknrsladfekaltdy---------------------------

F57B9_Ce alkyngsdldamkaiaaaaqkrslkdfqvafgsf--------------------------

YDL097c_Sc akytketyqsrgidamkavaeaynnrslldfntalkqy----------------------

YMJ5_Ce ivayqkspriiairsmadafrkrslkdfvkalaeh-------------------------

FUS6_ARATH asglahlelkkyklaarkfldvnpelgnsyneviapqdiatygglcalasfdrselkqkv

COS41.8_Ci kfleaaqrynelsyksaiheteqtkalekalncailapagqqrsrmlatlfkdercqllp

644879 qailtklkcaaglaelaarkykqaakclllasfdhcdfpellspsnvaiygglcalatfd

YPR108w_Sc llslisttaalqsissltislyasdyasyfpyllety-----------------------

eif-3p110_Hs ------------------------------------------------------------

T23D8.4_Ce ------------------------------------------------------------

YD95_Sp vlavlpqnesmssleacinslylcdysgffrtladve-----------------------

KIAA0107_Hs ilevlhslpavrqylfslyecrysvffqslavv---------------------------

F49C12.8_Hs vqeqltggglngtlipvreylesyydchydrffiqlaale--------------------

Int-6_Mm idnnsvssplqslqqrtwlihwslfvffnhpkgrdniidlflyqpqylnaiqtmcphilr

26SPS9_Hs ------------------------------------------------------------

F57B9_Ce ------------------------------------------------------------

YDL097c_Sc ------------------------------------------------------------

YMJ5_Ce ------------------------------------------------------------

FUS6_ARATH idninfrnflelvpdvrelindfyssryascleylasl----------------------

COS41.8_Ci sfgilekmfldriiksdemeefar------------------------------------

644879 rqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem--------------

YPR108w_Sc ------------------------------------------------------------

eif-3p110_Hs ------------------------------------------------------------

T23D8.4_Ce ------------------------------------------------------------

YD95_Sp ------------------------------------------------------------

KIAA0107_Hs ------------------------------------------------------------

F49C12.8_Hs ------------------------------------------------------------

Int-6_Mm ylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkklrec

User Manual 15

26SPS9_Hs RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMI

F57B9_Ce PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMI

YDL097c_Sc EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMI

YMJ5_Ce KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMI

FUS6_ARATH KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALI

COS41.8_Ci QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMI

644879 KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLI

YPR108w_Sc ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFI

eif-3p110_Hs DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMI

T23D8.4_Ce NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMI

YD95_Sp VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFI

KIAA0107_Hs EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFI

F49C12.8_Hs SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLI

Int-6_Mm ESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLI

26SPS9_Hs LDKKFHGILDQGEGVLIIFDEPP

F57B9_Ce LDQKLSGSLDQGEGMLIVFEIAV

YDL097c_Sc LDKIFYGVLDQGNGWLYVYETPN

YMJ5_Ce LDKKLMGSIDQHGDTVVVYPKAD

FUS6_ARATH TDNQIQARIDSHNKILYARHADQ

COS41.8_Ci CESRMKGHIDQIDGIVFFERRET

644879 LEGLISARVDSHSKILYARDVDQ

YPR108w_Sc PNKQLNCVIDRVNGIVETNRPDN

eif-3p110_Hs INEELMASLDQPTQTVVMHRTEP

T23D8.4_Ce IQEELSATLDEPTDCLIMHRVEP

YD95_Sp PDNKLNCVIDRVNGVVFTNRPDE

KIAA0107_Hs AAGRLHCKIDKVNEIVETNRPDS

F49C12.8_Hs ATGQLQCRIDAVNGVIEVNHRDS

Int-6_Mm RNARLDAKIDSKLGHVVMGNNAV

16 BLAST® Command Line Applications User Manual

Cookbook

Created: June 23, 2008; Updated: May 26, 2016.

Query a BLAST database with a GI, but exclude that GI from

the results

Extract a GI from the ecoli database:

$ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \

tee exclude_me

1786181

Run the restricted database search, which shows there are no self-hits:

$ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0 \

-query exclude_me | grep `cat exclude_me`

Query= gi|1786181|gb|AE000111.1|AE000111

Create a masked BLAST database

Creating a masked BLAST database is a two step process:

a. Generate the masking data using a sequence ltering utility like windowmasker or

dustmasker

b. Generate the actual BLAST database using makeblastdb

For both steps, the input le can be a text le containing sequences in FASTA format, or

an existing BLAST database created using makeblastdb. We will provide examples for

both scenarios.

Collect mask information files

For nucleotide sequence data in FASTA les or BLAST database format, we can generate

the mask information les using windowmasker or dustmasker. Windowmasker masks

the over-represented sequence data and it can also mask the low complexity sequence data

using the built-in dust algorithm (through the -dust option). To mask low-complexity

sequences only, we will need to use dustmasker.

For protein sequence data in FASTA les or BLAST database format, we need to use

segmasker to generate the mask information le.

e following examples assume that BLAST databases, listed in “Obtaining sample data

for this cookbook entry”, are available in the current working directory. Note that you

should use the sequence id parsing consistently. In all our examples, we enable this

function by including the “-parse_seqids” in the command line arguments.

Create masking information using dustmasker

We can generate the masking information with dustmasker using a single command line:

$ dustmasker -in hs_chr -infmt blastdb -parse_seqids \

-outfmt maskinfo_asn1_bin -out hs_chr_dust.asnb

Here we specify the input is a BLAST database named hs_chr (-in hs_chr -infmt blastdb),

enable the sequence id parsing (-parse_seqids), request the mask data in binary asn.1

format (-outfmt maskinfo_asn1_bin), and name the output le as hs_chr_dust.asnb (-out

hs_chr_dust.asnb).

If the input format is the original FASTA le, hs_chr.fa, we need to change input to -in

and -infmt options as follows:

$ dustmasker -in hs_chr.fa -infmt fasta -parse_seqids \

-outfmt maskinfo_asn1_bin -out hs_chr_dust.asnb

Create masking information using windowmasker

To generate the masking information using windowmasker from the BLAST database

hs_chr, we rst need to generate a counts le:

$ windowmasker -in hs_chr -infmt blastdb -mk_counts \

-parse_seqids -out hs_chr_mask.counts –sformat obinary

Here we specify the input BLAST database (-in hs_chr -infmt blastdb), request it to

generate the counts (-mk_counts) with sequence id parsing (-parse_seqids), and save the

output to a le named hs_chr_mask.counts (-out hs_chr_mask.counts).

To use the FASTA le hs_chr.fa to generate the counts, we need to change the input le

name and format:

$ windowmasker -in hs_chr.fa -infmt fasta -mk_counts \

-parse_seqids -out hs_chr_mask.counts –sformat obinary

With the counts le we can then proceed to create the le containing the masking

information as follows:

$ windowmasker -in hs_chr -infmt blastdb -ustat hs_chr_mask.counts \

-outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asnb

Here we need to use the same input (-in hs_chr -infmt blastdb) and the output of step 1 (-

ustat hs_chr_mask.counts). We set the mask le format to binary asn.1 (-outfmt

maskinfo_asn1_bin), enable the sequence ids parsing (-parse_seqids), and save the

masking data to hs_chr_mask.asnb (-out hs_chr_mask.asnb).

To use the FASTA le hs_chr.fa, we change the input le name and le type:

$ windowmasker -in hs_chr.fa -infmt fasta -ustat hs_chr.counts \

-outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asnb

Create masking information using segmasker

We can generate the masking information with segmasker using a single command line:

18 BLAST® Command Line Applications User Manual

$ segmasker -in refseq_protein -infmt blastdb -parse_seqids \

-outfmt maskinfo_asn1_bin -out refseq_seg.asnb

Here we specify the refseq_protein BLAST database (-in refseq_protein -infmt blastdb),

enable sequence ids parsing (-parse_seqids), request the mask data in binary asn.1 format

(-outfmt maskinfo_asn1_bin), and name the out le as refseq_seg.asnb (-out

refseq_seg.asnb).

If the input format is the FASTA le, we need to change the command line to specify the

input format:

$ segmasker -in refseq_protein.fa -infmt fasta -parse_seqids \

-outfmt maskinfo_asn1_bin -out refseq_seg.asnb

Extract masking information from FASTA sequences with lowercase masking

We can also extract the masking information from a FASTA sequence le with lowercase

masking (generated by various means) using convert2blastmask utility. An example

command line follows:

$ convert2blastmask -in hs_chr.mfa -parse_seqids -masking_algorithm repeat \

-masking_options "repeatmasker, default" -outfmt maskinfo_asn1_bin \

-out hs_chr_mfa.asnb

Here the input is hs_chr.mfa (-in hs_chr.mfa), enable parsing of sequence ids, specify the

masking algorithm name (-masking_algorithm repeat) and its parameter (-

masking_options “repeatmasker, default”), and ask for asn.1 output (-outfmt

maskinfo_asn1_bin) to be saved in specied le (-out hs_chr_mfa.asnb).

Create BLAST database with the masking information

Using the masking information data les generated in the previous 4 steps, we can create

BLAST database with masking information incorporated.

Note: we should use “-parse_seqids” in a consistent manner – either use it in both steps or

not use it at all.

Create BLAST database with masking information using an existing BLAST

database or FASTA sequence file as input

For example, we can use the following command line to apply the masking information,

created above, to the existing BLAST database generated in Obtaining sample data for this

cookbook entry:

$ makeblastdb -in hs_chr –input_type blastdb -dbtype nucl -parse_seqids \

-mask_data hs_chr_mask.asnb -out hs_chr -title \

"Human Chromosome, Ref B37.1"

Here, we use the existing BLAST database as input le (-in hs_chr), specify its type (-

dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data (-

Cookbook 19

mask_data hs_chr_mask.asnb), and name the output database with the same base name (-

out hs_chr) overwriting the existing one.

To use the original FASTA sequence le (hs_chr.fa) as the input, we need to use “-in

hs_chr.fa” to instruct makeblastdb to use that FASTA le instead.

We can check the “re-created” database to nd out if the masking information was added

properly, using blastdbcmd with the following command line:

$ blastdbcmd -db hs_chr -info

is command prints out a summary of the target database:

Database: human chromosomes, Ref B37.1

24 sequences; 3,095,677,412 total bases

Date: Aug 13, 2009 3:02 PM Longest sequence: 249,250,621 bases

Available filtering algorithms applied to database sequences:

Algorithm ID Algorithm name Algorithm options

30 windowmasker

Volumes:

/export/home/tao/blast_test/hs_chr

Extra lines under the “Available ltering algorithms …” describe the masking algorithms

available. e “Algorithm ID” eld, 30 in our case, is what we need to use if we want to

invoke database so masking during an actual search through the “-db_so_mask”

parameter.

We can apply additional masking data to an existing BLAST database with one type of

masking information already added. For example, we can apply the dust masking

generated above to the database generated earlier by using this command line:

$ makeblastdb -in hs_chr –input_type blastdb -dbtype nucl -parse_seqids \

-mask_data hs_chr_dust.asnb -out hs_chr -title "Human Chromosome, Ref B37.1"

Here, we use the existing database as input le (-in hs_chr), specify its input and molecule

type (-input_type blastdb -dbtype nucl), enable parsing of sequence ids (-parse_seqids),

provide the dust masking data (-mask_data hs_chr_dust.asnb), naming the database with

the same based name (-out hs_chr) overwriting the existing one.

Checking the “re-generated” database with blastdbcmd:

$ blastdbcmd -db hs_chr -info

we can see that both sets of masking information are available:

Database: Human Chromosome, Ref B37.1

24 sequences; 3,095,677,412 total bases

Date: Aug 25, 2009 4:43 PM Longest sequence: 249,250,621 bases

20 BLAST® Command Line Applications User Manual

Available filtering algorithms applied to database sequences:

Algorithm ID Algorithm name Algorithm options

11 dust window=64; level=20; linker=1

30 windowmasker

Volumes:

/net/gizmo4/export/home/tao/blast_test/hs_chr

A more straightforward approach to apply multiple sets of masking information in a

single makeblastdb run by providing multiple set of masking data les in a comma

delimited list:

$ makeblastdb -in hs_chr –input_type blastdb -dbtype nucl -parse_seqids \

-mask_data hs_chr_dust.asnb, hs_chr_mask.asnb -out hs_chr

Create a protein BLAST database with masking information

We can use the masking data le generated in “Create masking information using

segmasker” to create a protein BLAST database:

$ makeblastdb -in refseq_protein –input_type blastdb -dbtype prot -parse_seqids \

-mask_data refseq_seg.asnb -out refseq_protein -title \

"RefSeq Protein Database"

Using blastdbcmd, we can check the database thus generated:

$ blastdbcmd -db refseq_protein -info

is produces the following summary, which includes the masking information:

Database: RefSeq Protein Database

7,044,477 sequences; 2,469,203,411 total residues

Date: Sep 1, 2009 10:50 AM Longest sequence: 36,805 residues

Available filtering algorithms applied to database sequences:

Algorithm ID Algorithm name Algorithm options

21 seg window=12; locut=2.2; hicut=2.5

Volumes:

/export/home/tao/blast_test/refseq_protein2.00

/export/home/tao/blast_test/refseq_protein2.01

/export/home/tao/blast_test/refseq_protein2.02

Create a nucleotide BLAST database using the masking information extracted

from lower case masked FASTA file

We use the following command line:

$ makeblastdb -in hs_chr.mfa -dbtype nucl -parse_seqids \

-mask_data hs_chr_mfa.asnb -out hs_chr_mfa -title "Human chromosomes (mfa)"

Cookbook 21

Here we use the lowercase masked FASTA sequence le as input (-in hs_chr.mfa), its le

type (-input_type fasta), specify the database as nucleotide (-dbtype nucl), enable parsing

of sequence ids (-parse_seqids), provide the masking data (-mask_data hs_chr_mfa.asnb),

and name the resulting database as hs_chr_mfa (-out hs_chr_mfa).

Checking the database thus generated using blastdbcmd, we have:

Database: Human chromosomes (mfa)

24 sequences; 3,095,677,412 total bases

Date: Aug 26, 2009 11:41 AM Longest sequence: 249,250,621 bases

Available filtering algorithms applied to database sequences:

Algorithm ID Algorithm name Algorithm options

40 repeat repeatmasker lowercase

Volumes:

/export/home/tao/hs_chr_mfa

e algorithm name and algorithm options are the values we provided in “Extract

masking information from FASTA sequences with lowercase masking”.

Obtaining Sample data for this cookbook entry

For input nucleotide sequences, we use the BLAST database generated from a FASTA

input le hs_chr.fa, containing complete human chromosomes from BUILD38, generated

by inating and combining the hs_ref_*.fa.gz les located at:

ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/

We use this command line to create the BLAST database from the input nucleotide

sequences:

$ makeblastdb -in hs_chr.fa -dbtype nucl -parse_seqids \

-out hs_chr -title "Human chromosomes, Ref B38"

For input nucleotide sequences with lowercase masking, we use the FASTA le

hs_chr.mfa, containing the complete human chromosomes from BUILD37.1, generated

by inating and combining the hs_ref_*.mfa.gz les located in the same p directory.

For input protein sequences, we use the preformatted refseq_protein database from the

NCBI blast/db/ p directory:

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.00.tar.gz

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.01.tar.gz

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.02.tar.gz

Search with database masking enabled

Database masking has two modes. e rst is known as "so-masking", and BLAST uses

the database mask only during the (initial) word-nding phase of BLAST. e second is

22 BLAST® Command Line Applications User Manual

known as "hard-masking", and BLAST uses the database mask during all phases of the

search. Here, we look at both types of masking.

To enable database masking during a BLAST search, we use the –info parameter of

blastdbcmd to discover the masking Algorithm ID. For the database generated in the

previous cookbook entry, we can use the following command line to activate the

windowmasker so masking:

$ blastn -query HTT_gene -task megablast -db hs_chr -db_soft_mask 30 \

-outfmt 7 -out HTT_megablast_softmask.out -num_threads 4

Here, we search a nucleotide query, HTT_gene* (-query HTT_gene), with the megablast

algorithm (-task megablast) against the database hs_chr (-db hs_chr). We use so

masking (-db_so_mask 30), set the result format to tabular output (-outfmt 7), and save

the result to a le named HTT_megablast_somask.tab (-out

HTT_megablast_somask.tab). We also activated the multi-threaded feature of blastn to

speed up the search by using 4 CPUs$ (-num_threads 4).

For the database generated in the previous cookbook entry, we can use the following

command line to activate the windowmasker hard masking:

$ blastn -query HTT_gene -task megablast -db hs_chr -db_hard_mask 30 \

-outfmt 7 -out HTT_megablast_hardmask.out -num_threads 4

e options are similar to the ones for so masking, except that we use –db_hard_mask

rather than –db_so_mask. Additionally, we changed the name of the output le.

Hard masking is much more aggressive than so masking. In interspersed or simple

repeats, so masking normally provides the best results. Hard masking may be warranted

to remove vector or other contamination from the BLAST results.

*is is a genomic fragment containing the HTT gene from human, including 5 kb up-

and down-stream of the transcribed region. It is represented by NG_009378.

$ e number to use under in your run will depend on the number of CPUs your system

has.

In a test run under a 64-bits Linux machine, the search with so masking took about 1.5

seconds real time, and the search with hard masking took about 2.5 seconds real time. e

search without database masking took about 31 minutes.

Display BLAST search results with custom output format

e –outfmt option permits formatting arbitrary elds from the BLAST tabular format.

Use the –help option on the command-line application (e.g., blastn) to see the supported

elds. e max_target_seqs option should be used with any tabular output to control the

number of matches reported.

Cookbook 23

Example of custom output format

e following example shows how to display the results of a BLAST search using a custom

output format. e tabular output format with comments is used, but only the query

accession, subject accession, evalue, query start, query stop, subject start, and subject stop

are requested. For brevity, only the rst 10 lines of output are shown:

$ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalue

qstart qend sstart send"

# BLASTN 2.2.18+

# Query: gi|1786181|gb|AE000111.1|AE000111

# Database: ecoli

# Fields: query acc., subject acc., evalue, q. start, q. end, s.

start, s. end

# 85 hits found

AE000111 AE000111 0.0 1 10596 1 10596

AE000111 AE000174 8e-30 5565 5671 6928 6821

AE000111 AE000394 1e-27 5587 5671 135 219

AE000111 AE000425 6e-26 5587 5671 8552 8468

AE000111 AE000171 3e-24 5587 5671 2214 2130

Trace-back operations (BTOP)

e “Blast trace-back operations” (BTOP) string describes the alignment produced by

BLAST. is string is similar to the CIGAR string produced in SAM format, but there are

important dierences. BTOP is a more exible format that lists not only the aligned

region but also matches and mismatches. BTOP operations consist of 1.) a number with a

count of matching letters, 2.) two letters showing a mismatch (e.g., “AG” means A was

replaced by G), or 3.) a dash (“-“) and a letter showing a gap. e box below shows a

blastn run rst with BTOP output and then the same run with the BLAST report showing

the alignments.

$ blastn -query test_q.fa -subject test_s.fa -dust no -outfmt "6

qseqid sseqid btop" -parse_deflines

query1 q_multi 7AG39

query1 q_multi 7A-39

query1 q_multi 6-G-A41

$ blastn -query test_q.fa -subject test_s.fa -dust no -parse_deflines

BLASTN 2.2.24+

Query= query1

Length=47

Subject=

Length=142

Score = 82.4 bits (44), Expect = 9e-22

Identities = 46/47 (97%), Gaps = 0/47 (0%)

Strand=Plus/Plus

24 BLAST® Command Line Applications User Manual

Query 1 ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47

||||||| |||||||||||||||||||||||||||||||||||||||

Sbjct 47 ACGTCCGGGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 93

Score = 80.5 bits (43), Expect = 3e-21

Identities = 46/47 (97%), Gaps = 1/47 (2%)

Strand=Plus/Plus

Query 1 ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47

||||||| |||||||||||||||||||||||||||||||||||||||

Sbjct 1 ACGTCCG-GACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 46

Score = 78.7 bits (42), Expect = 1e-20

Identities = 47/49 (95%), Gaps = 2/49 (4%)

Strand=Plus/Plus

Query 1 ACGTCC--GAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47

|||||| |||||||||||||||||||||||||||||||||||||||||

Sbjct 94 ACGTCCGAGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 142

Use blastdb_aliastool to manage the BLAST databases

Oen we need to search multiple databases together or wish to search a specic subset of

sequences within an existing database. At the BLAST search level, we can provide multiple

database names to the “-db” parameter, or to provide a GI le specifying the desired

subset to the “-gilist” parameter. However for these types of searches, a more convenient

way to conduct them is by creating virtual BLAST databases for these. Note: When

combining BLAST databases, all the databases must be of the same molecule type. e

following examples assume that the two databases as well as the GI le are in the current

working directory.

Aggregate existing BLAST databases

To combine the two nematode nucleotide databases, named “nematode_mrna” and

“nematode_genomic", we use the following command line:

$ blastdb_aliastool -dblist "nematode_mrna nematode_genomic" -dbtype nucl \

-out nematode_all -title "Nematode RefSeq mRNA + Genomic"

Create a subset of a BLAST database

e nematode_mrna database contains RefSeq mRNAs for several species of round

worms. e best subset is from C. elegance. In most cases, we want to search this subset

instead of the complete collection. Since the database entries are from NCBI nucleotide

databases and the database is formatted with ”-parse_seqids”, we can use the “-gilist

c_elegance_mrna.gi” parameter/value pair to limit the search to the subset of interest,

alternatively, we can create a subset of the nematode_mrna database as follows:

Cookbook 25

$ blastdb_aliastool -db nematode_mrna -gilist c_elegance_mrna.gi -dbtype \

nucl -out c_elegance_mrna -title "C. elegans refseq mRNA entries"

Note: one can also specify multiple databases using the -db parameter of blastdb_aliastool.

Reformat BLAST reports with blast_formatter

It may be helpful to view the same BLAST results in dierent formats. A user may rst

parse the tabular format looking for matches meeting a certain criteria, then go back and

examine the relevant alignments in the full BLAST report. He may also rst look at pair-

wise alignments, then decide to use a query-anchored view. Viewing a BLAST report in

dierent formats has been possible on the NCBI BLAST web site since 2000, but has not

been possible with stand-alone BLAST runs. e blast_formatter allows this, if the

original search produced blast archive format using the –outfmt 11 switch. e query

sequence, the BLAST options, the masking information, the name of the database, and the

alignment are written out as ASN.1 (a structured format similar to XML). e –

max_target_seqs option should be used to control the number of matches recorded in the

alignment. e blast_formatter reads this information and formats a report. e BLAST

database used for the original search must be available, or the sequences need to be

fetched from the NCBI, assuming the database contains sequences in the public dataset.

e box below illustrates the procedure. A blastn run rst produces the BLAST archive

format, and the blast_fomatter then reads the le and produces tabular output.

Blast_formatter will format stand-alone searches performed with an earlier version of a

database if both the search and formatting databases are prepared so that fetching by

sequence ID is possible. To enable fetching by sequence ID use the –parse_seqids ag

when running makeblastdb, or (if available) download preformatted BLAST databases

from p://p.ncbi.nlm.nih.gov/blast/db/ using update_blastdb.pl (provided as part of the

BLAST+ package). Currently the blast archive format and blast_formatter do not work

with database free searches (i.e., -subject rather than –db was used for the original

search).

$ echo 1786181 | blastn -db ecoli -outfmt 11 -out out.1786181.asn

$ blast_formatter -archive out.1786181.asn -outfmt "7 qacc sacc evalue

qstart qend sstart send"

# BLASTN 2.2.24+

# Query: gi|1786181|gb|AE000111.1|AE000111 Escherichia coli K-12 MG1655

section 1 of 400

# Database: ecoli

# Fields: query acc., subject acc., evalue, q. start, q. end,

s. start, s. end

# 85 hits found

AE000111 AE000111 0.0 1 10596 1 10596

AE000111 AE000174 8e-30 5565 5671 6928 6821

AE000111 AE000394 1e-27 5587 5671 135 219

AE000111 AE000425 6e-26 5587 5671 8552 8468

AE000111 AE000171 3e-24 5587 5671 2214 2130

AE000111 AE000171 1e-23 5587 5670 10559 10642

26 BLAST® Command Line Applications User Manual

AE000111 AE000376 1e-22 5587 5675 129 42

AE000111 AE000268 1e-22 5587 5671 6174 6090

AE000111 AE000112 1e-22 10539 10596 1 58

AE000111 AE000447 5e-22 5587 5670 681 598

AE000111 AE000344 6e-21 5587 5671 4112 4196

AE000111 AE000490 2e-20 5584 5671 4921 4835

AE000111 AE000280 2e-20 5587 5670 12930 12847

Extracting data from BLAST databases with blastdbcmd

Extract lowercase masked FASTA from a BLAST database with masking

information

If a BLAST database contains masking information, this can be extracted using the

blastdbcmd options –db_mask and –mask_sequence as follows:

$ blastdbcmd -info -db mask-data-db

Database: Mask data test

10 sequences; 12,609 total residues

Date: Feb 17, 2009 5:10 PM Longest sequence: 1,694 residues

Available filtering algorithms applied to database sequences:

Algorithm ID Algorithm name Algorithm options

20 seg default options used

40 repeat -species Desmodus_rotundus

Volumes:

mask-data-db

$ blastdbcmd -db mask-data-db -mask_sequence_with 20 -entry 71022837

>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 [Ustilago maydis 521]

MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPPPSTHG

GPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavQklslANEaawvsIGsaaetm

EdydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdTWGSMGHCYLMMDDLQRAYTAYQQ

ALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQQNKFPASLECFRYILDNPPR

PLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAGFNNQERAIQFLTKSLESDPNDAQSW

YLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDALDAYSRAIRLNPYISEVWFDLGSLYEACNN

QISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVPQDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLV

GPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGTDDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPG

PPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPPHVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAG

PGGPPPPLDHYGRPMGGPMSEREREMEWEREREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVY

GRDEGRDYYNNSHPGSGPGGPRGGYERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFR

GPPPGLGRPTPDWERTRAGEYGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSP

WSAKGGVAAPGKNGKASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLA

ARAPSSRMVDEDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPED

SYKRAKSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD

GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIHEDPID

APAKRADEDGAK

Cookbook 27

Extract all human sequences from the nr database

Although one cannot select GIs by taxonomy from a database, a combination of unix

command line tools will accomplish this:

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \

awk ' { if ($2 == 9606) { print $1 } } ' | \

blastdbcmd -db nr -entry_batch - -out human_sequences.txt

e rst blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID),

the awk command selects from the output of that command those sequences which have a

taxonomy ID of 9606 (human) and prints its GIs, and nally the second blastdbcmd

invocation uses those GIs to print the sequence data for the human sequences in the nr

database.

Custom data extraction and formatting from a BLAST database

e following examples show how to extract selected information from a BLAST database

and how to format it:

Extract the accession, sequence length,

and masked locations for GI 71022837:

$ blastdbcmd -entry 71022837 -db Test/mask-data-db -outfmt "%a %l %m"

XP_761648.1 1292 119-139;140-144;147-152;154-160;161-216;

Extract different sequence ranges from the BLAST databases

e command below will extract two dierent sequences: bases 40-80 in human

chromosome Y (GI 13626247) with the masked regions in lowercase characters (notice

argument 30, the masking algorithm ID which is available in this BLAST database) and

bases 1-10 in the minus strand of human chromosome 20 (GI 14772189).

$ printf "%s %s %s %s\n%s %s %s\n" 13626247 40-80 plus 30 14772189 1-10

minus \

| blastdbcmd -db GPIPE/9606/current/all_contig -entry_batch -

>gi|13626247|ref|NT_025975.2|:40-80 Homo sapiens chromosome Y genomic

contig, GRCh37.p10 Primary Assembly

tgcattccattctattctcttctACTGCATACAatttcact

>gi|14772189|ref|NT_025215.4|:c10-1 Homo sapiens chromosome 20 genomic

contig, GRCh37.p10 Primary Assembly

GCTCTAGATC

Display the locations where BLAST will search for BLAST databases

is is accomplished by using the -show_blastdb_search_path option in blastdbcmd:

$ blastdbcmd -show_blastdb_search_path

:/net/nabl000/vol/blast/db/blast1:/net/nabl000/vol/blast/db/blast2:

28 BLAST® Command Line Applications User Manual

Display the available BLAST databases at a given directory

is is accomplished by using the -list option in blastdbcmd:

$ blastdbcmd -list repeat -recursive

repeat/repeat_3055 Nucleotide

repeat/repeat_31032 Nucleotide

repeat/repeat_35128 Nucleotide

repeat/repeat_3702 Nucleotide

repeat/repeat_40674 Nucleotide

repeat/repeat_4530 Nucleotide

repeat/repeat_4751 Nucleotide

repeat/repeat_6238 Nucleotide

repeat/repeat_6239 Nucleotide

repeat/repeat_7165 Nucleotide

repeat/repeat_7227 Nucleotide

repeat/repeat_7719 Nucleotide

repeat/repeat_7955 Nucleotide

repeat/repeat_9606 Nucleotide

repeat/repeat_9989 Nucleotide

e rst column of the default output is the le name of the BLAST database (usually

provided as the –db argument to other BLAST+ applications), the second column

represents the molecule type of the BLAST database. is output is congurable via the

list_outfmt command line option.

Use Windowmasker to filter the query sequence(s) in a BLAST

e blastn executable can lter a query sequence using the windowmasker data les. is

option can be used to mask interspersed repeats that may lead to spurious matches. e

windowmasker data les should be created as discussed in step 1 of “Create masking

information using windowmasker” or downloaded from the NCBI FTP site. Follow the

instructions in Conguring BLAST to make sure BLAST will be able to nd the

windowmasker les in the examples below.

1. Run BLAST search using Windowmasker for sequence filtering based

upon taxid

(9606 is the taxid for human).

$ blastn -query input -db database -window_masker_taxid 9606 -out

results.txt

2. Run BLAST search using Windowmasker for sequence filtering based

upon the windowmasker file name.

$ blastn –query input –db database –window_masker_db 9606/

wmasker.obinary

Cookbook 29

Building a BLAST database with local sequences

e makeblastdb application produces BLAST databases from FASTA les. In the simplest

case the FASTA denition lines are not parsed by makeblastdb and may be completely

unstructured. e text in the denition line will be stored in the BLAST database and

displayed in the BLAST report, but it will not be possible to fetch individual sequences

using blastdbcmd or to limit the search with the –seqidlist option. Use the –parse_seqids

ag when invoking makeblastdb to enable retrieval of sequences based upon sequence

identiers. In this case, each sequence must have a unique identier, and that identier

must have a specic format. e identier should begin right aer the “>” sign on the

denition line, contain no spaces, and follow the formats described in http://

www.ncbi.nlm.nih.gov/toolkit/doc/book/ch_demo/#ch_demo.T5 User supplied

sequences should make use of the local or general identiers described in the above table.

A FASTA le with general IDs would look like:

$ cat mydb.fsa

>gnl|MYDB|1 this is sequence 1

GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG

TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA

GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT

GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG

GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT

TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG

CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT

TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT

CACAGACAGCCTGAGACAGTTCTTACGGAAACACCCCAGGACACAATTGAATTAAACAGATTGAATTTAGAATCTTCCAA

>gnl|MYDB|2 this is sequence 2