Batch BLAST Help
Run Program
Contents
Guidelines for Use
Input format for sequences
Submission Form
Results
Overview
Batch BLAST is a service provided by SWBIC to the NMSU research community.
It runs the Basic Local Alignment Search Tool (BLAST) for each sequence
in a set of input (i.e., query) sequences. The sequences are input as a group
in FASTA format. The user may select which type of BLAST to run (blastp,
blastn, or blastx), which sequence database to search, and other BLAST options.
The results of these BLASTs may be sorted by the queries that resulted in
the most significant matches to database sequences. The BLAST runs are performed
in parallel on SWBIC's 32-node Beowulf cluster
(darwin), which is capable
of completing thousands of BLAST runs each hour. Because of the possibile
time required, however, results are returned in an e-mail that provides the
URL to the main page of your BLAST results. These results provide more than
the standard BLAST report, and include a graphical alignment of High Scoring
Segment Pairs (HSPs), more complete descriptions of subject sequences, and
links to the Genbank entries of the matched subject sequences.
Batch BLAST uses the latest version of BLAST from NCBI, as well as sequence
databases from NCBI that are updated weekly. The
NCBI BLAST
Help page contains information about the various BLAST options, although
it also refers specifically to the NCBI BLAST interface.
Guidelines for Use
Batch BLAST runs are queued on a first-come-first-served basis,
and are assigned a lower priority in the queue than other SWBIC
bioinformatics services.
The time it will take to return results will depend on how busy
darwin is, where your job is in the queue, and how large your job is.
We ask that you submit large jobs in off-peak hours, if possible.
What constitutes a large job depends on several factors: the number and length
of sequences submitted, the type of BLAST being run, and the size of the
database searched. blastx requires approximately six times longer to run
than blastp, and three times longer than blastn. The largest databases are
the non-redundant and EST databases. Please keep these considerations in
mind when you submit a job.
The result web pages will be retained on darwin for at least two weeks. If
you need to keep them online for a longer time, contact
John Spalding.
Input format for sequences
You may paste one or more sequences in the text box,
or click the UPLOAD button to specify a file of sequences on your computer.
All sequences must be in FASTA-format, i.e.,
each sequence begins with the ">" character
in the first position, followed by descriptive text (the "definition
line"). One or more lines containing the sequence then follow. These
lines may be of varying length and should contain only sequence characters
that are valid to BLAST.
The input may consist of multiple sequences, one following the other.
The following example of two sequences is valid:
>seq001 this is the description
MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF
TAAMYQLGGQ VLDLNPSVTQ VGRGEPIQDT ARVLDRYIDI
LAVRTFKQTDLQTFADHAKM
>seq002 this is the description
MRVFLAICLSLTVALAAETGKYTPFQYNRVYSTVSPFVYKPGRYVADPGR
GFYTGSGTAGGPGGAYVGTKEDLSKYLGDAYKGSSIVPLPVVKPTIPVPV
APEATTT
IMPORTANT: The first field of each defline must be unique because Batch BLAST
identifies query sequences by this field. The first field consists of the
characters between the leading '>' and the first blank. The above example
follows this rule. The next example does not and will cause Batch BLAST to fail:
>experiment 1
MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF
>experiment 2
MRVFLAICLSLTVALAAETGKYTPFQYNRVYSTVSPFVYKPGRYVADPGR
Submission Form
There are four types of inputs to Batch BLAST:
-
Query sequences: these must be in FASTA format. Each sequence contains
a definition line (starting with a ">" followed by a description of the
sequence) and then one or more lines of sequence data. This set of sequences
may be pasted into the text window or uploaded from a file on your computer.
-
BLAST options:
-
BLAST program: the choices are blastn (nucleotide query and database),
blastp (protein query and database), and blastx (nucleotide query translated
into six reading frames and searched against a protein database).
-
Database to search: the available databases are offered in this list.
For more information on these databases see
SWBIC sequence databases.
-
Max E-value: this is a cut-off value; HSPs with an E-value greater
than this will not be included in the BLAST output.
-
Max number of matches: this is a cut-off value; the number of matched
subject sequences will not exceed this number.
-
Protein substitution matrix: applicable for blastp and blastx searches.
See
NCBI
Help for more information.
-
Filter query: uncheck this box if you do not want to apply a
low-complexity filter
(NCBI
Help).
-
Costs to open and extend a gap: these are set to defaults, but you
may specify your own values. Be aware, however, that BLAST does not accept
any combination of gap costs, and may stop if it detects an invalid pair
of values
(NCBI
Help).
-
Advanced options: this text box allows the user to enter other BLAST
options not offered in the Batch BLAST form. Options must be provided by
in the Unix "-flag value" format required as input to the "blastall" program
(NCBI
Help). Do not enter options already being defined in the input form.
-
Sort options: This determines the order of query sequences in the
main result page. They may be sorted
-
by the queries that resulted in the smallest E-value matches to subject
sequences,
-
by the queries that produced the most BLAST matches of subject sequences
-
or not sorted (i.e., in the order of query sequences input)
-
E-mail address: This is a required field.
Results
Batch BLAST results are organized in a series of web pages stored on darwin.
When all of the analyses are completed, the user is sent an e-mail with the
URL of the main page of results. These results are organized in a table,
in which each row contains a summary of and links to the results for each
query sequence. Each row contains the following information: the number of
subject sequences matched, a link to the query sequence, the query description
and link to the detailed results, the description of the top-scoring subject,
and the scores (bits and E-value) of this subject.
The query sequences may be resorted by the methods described above. A resort
requires little computer time, so results are returned quickly to your browser.
Clicking on a query description opens a page of complete results for that
query in a second window. At the top is a graphical display showing the alignment
of all HSPs against the query, with reverse complement HSPs (blastn and blastx)
shown as hollow bars. Statistics for each HSP are shown to the right in the
following order: E-value, bits score, percent identity, percent positives,
percent gaps, and length of the subject sequence. When you hold the mouse
pointer over an HSP bar, a short description of that subject sequence is
shown. Click on an HSP bar to jump to that sequence in summary table below.
Below the HSP graphical alignments is a table of data for each subject sequence,
sorted by the score of the top-scoring HSP (the same order as in the BLAST
report). The description is divided into two parts: the Genbank identifier
(which is also a link to the Genbank entry) and the description of the subject
sequence. The scores (E-value and bits) for the top-scoring HSP are also
shown. The sequence description is also a link to the alignment (HSP) results
for that subject in the BLAST report.
The last section of the results is the BLAST report itself. You may jump
to the alignment (HSP) results for a particular subject sequence by clicking
on its description in the table above.