Batch BLAST Help

Run Program

Contents

Guidelines for Use
Input format for sequences
Submission Form
Results

Overview

Batch BLAST is a service provided by SWBIC to the NMSU research community. It runs the Basic Local Alignment Search Tool (BLAST) for each sequence in a set of input (i.e., query) sequences. The sequences are input as a group in FASTA format. The user may select which type of BLAST to run (blastp, blastn, or blastx), which sequence database to search, and other BLAST options. The results of these BLASTs may be sorted by the queries that resulted in the most significant matches to database sequences. The BLAST runs are performed in parallel on SWBIC’s 32-node Beowulf cluster (darwin), which is capable of completing thousands of BLAST runs each hour. Because of the possibile time required, however, results are returned in an e-mail that provides the URL to the main page of your BLAST results. These results provide more than the standard BLAST report, and include a graphical alignment of High Scoring Segment Pairs (HSPs), more complete descriptions of subject sequences, and links to the Genbank entries of the matched subject sequences.

Batch BLAST uses the latest version of BLAST from NCBI, as well as sequence databases from NCBI that are updated weekly. The NCBI BLAST Help page contains information about the various BLAST options, although it also refers specifically to the NCBI BLAST interface.

Guidelines for Use

Batch BLAST runs are queued on a first-come-first-served basis, and are assigned a lower priority in the queue than other SWBIC bioinformatics services. The time it will take to return results will depend on how busy darwin is, where your job is in the queue, and how large your job is. We ask that you submit large jobs in off-peak hours, if possible. What constitutes a large job depends on several factors: the number and length of sequences submitted, the type of BLAST being run, and the size of the database searched. blastx requires approximately six times longer to run than blastp, and three times longer than blastn. The largest databases are the non-redundant and EST databases. Please keep these considerations in mind when you submit a job.

The result web pages will be retained on darwin for at least two weeks. If you need to keep them online for a longer time, contact John Spalding.

Input format for sequences

You may paste one or more sequences in the text box, or click the UPLOAD button to specify a file of sequences on your computer. All sequences must be in FASTA-format, i.e., each sequence begins with the “>” character in the first position, followed by descriptive text (the “definition line”). One or more lines containing the sequence then follow. These lines may be of varying length and should contain only sequence characters that are valid to BLAST. The input may consist of multiple sequences, one following the other. The following example of two sequences is valid:

>seq001 this is the description MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF TAAMYQLGGQ VLDLNPSVTQ VGRGEPIQDT ARVLDRYIDI LAVRTFKQTDLQTFADHAKM >seq002 this is the description MRVFLAICLSLTVALAAETGKYTPFQYNRVYSTVSPFVYKPGRYVADPGR GFYTGSGTAGGPGGAYVGTKEDLSKYLGDAYKGSSIVPLPVVKPTIPVPV APEATTT 

IMPORTANT: The first field of each defline must be unique because Batch BLAST identifies query sequences by this field. The first field consists of the characters between the leading ‘>’ and the first blank. The above example follows this rule. The next example does not and will cause Batch BLAST to fail:

>experiment 1 MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF >experiment 2 MRVFLAICLSLTVALAAETGKYTPFQYNRVYSTVSPFVYKPGRYVADPGR 

Submission Form

There are four types of inputs to Batch BLAST:

  • Query sequences: these must be in FASTA format. Each sequence contains a definition line (starting with a “>” followed by a description of the sequence) and then one or more lines of sequence data. This set of sequences may be pasted into the text window or uploaded from a file on your computer.
  • BLAST options:
    • BLAST program: the choices are blastn (nucleotide query and database), blastp (protein query and database), and blastx (nucleotide query translated into six reading frames and searched against a protein database).
    • Database to search: the available databases are offered in this list. For more information on these databases see SWBIC sequence databases.
    • Max E-value: this is a cut-off value; HSPs with an E-value greater than this will not be included in the BLAST output.
    • Max number of matches: this is a cut-off value; the number of matched subject sequences will not exceed this number.
    • Protein substitution matrix: applicable for blastp and blastx searches. See NCBI Help for more information.
    • Filter query: uncheck this box if you do not want to apply a low-complexity filter (NCBI Help).
    • Costs to open and extend a gap: these are set to defaults, but you may specify your own values. Be aware, however, that BLAST does not accept any combination of gap costs, and may stop if it detects an invalid pair of values (NCBI Help).
    • Advanced options: this text box allows the user to enter other BLAST options not offered in the Batch BLAST form. Options must be provided by in the Unix “-flag value” format required as input to the “blastall” program (NCBI Help). Do not enter options already being defined in the input form.
  • Sort options: This determines the order of query sequences in the main result page. They may be sorted
    • by the queries that resulted in the smallest E-value matches to subject sequences,
    • by the queries that produced the most BLAST matches of subject sequences
    • or not sorted (i.e., in the order of query sequences input)
  • E-mail address: This is a required field.

Results

Batch BLAST results are organized in a series of web pages stored on darwin. When all of the analyses are completed, the user is sent an e-mail with the URL of the main page of results. These results are organized in a table, in which each row contains a summary of and links to the results for each query sequence. Each row contains the following information: the number of subject sequences matched, a link to the query sequence, the query description and link to the detailed results, the description of the top-scoring subject, and the scores (bits and E-value) of this subject.

The query sequences may be resorted by the methods described above. A resort requires little computer time, so results are returned quickly to your browser.

Clicking on a query description opens a page of complete results for that query in a second window. At the top is a graphical display showing the alignment of all HSPs against the query, with reverse complement HSPs (blastn and blastx) shown as hollow bars. Statistics for each HSP are shown to the right in the following order: E-value, bits score, percent identity, percent positives, percent gaps, and length of the subject sequence. When you hold the mouse pointer over an HSP bar, a short description of that subject sequence is shown. Click on an HSP bar to jump to that sequence in summary table below.

Below the HSP graphical alignments is a table of data for each subject sequence, sorted by the score of the top-scoring HSP (the same order as in the BLAST report). The description is divided into two parts: the Genbank identifier (which is also a link to the Genbank entry) and the description of the subject sequence. The scores (E-value and bits) for the top-scoring HSP are also shown. The sequence description is also a link to the alignment (HSP) results for that subject in the BLAST report.

The last section of the results is the BLAST report itself. You may jump to the alignment (HSP) results for a particular subject sequence by clicking on its description in the table above.