BLAST Filter Help
Run Program
Contents
Input format for sequences
Step 1: BLAST Analysis Options
Step 2: Rule Analysis Results
Step 2: Rule Analysis (Filter) Options
Overview
BLAST Filter is a Bioinformatics tool that builds a set of related sequences
from a single query sequence. The full sequences are returned, which could
then be used as input for a multiple alignment, hidden Markov models, or
other analyses that require a set of homologous sequences. It works in two
steps:
-
Step 1: BLAST Analysis. Your query sequence (protein or DNA) is searched
by BLAST (blastp or blastn) against a sequence database.
-
Step 2: Rule Analysis. The results of the BLAST search are displayed
graphically by showing the alignments of all HSPs against the query. A set
of rules is also shown; initially, all subject sequences from the BLAST search
pass all these rules. You may then change the values of one or more of the
rules and re-apply the rules to remove or "filter" sequences from
the full set returned by BLAST. Sequences failing one or more rules are shown
in different colors.
You may re-apply changed rules as often as you like, and even return to the
Step 1 page to change BLAST parameters. When you are finished, a FASTA-format
file of the full sequences matched by BLAST and passing all rules can be
viewed and saved. The color-coding of the graphical alignment display
makes it easy to see which sequences and HSPs (BLAST local alignments) are
failing the current set of rules.
Limits:
-
Number of BLAST matches returned: 1000 subject sequences
-
Sequence length: 25,000 letters (for any single sequence)
Compute time considerations: Step 1 (BLAST analysis) can be
time-consuming because, in addition to the BLAST search, CLUSTALW is run
on all query-subject pairs to compute global pairwise statistics. Therefore,
a long sequence (> 500 residues) with many BLAST matches (> 500) can
take several minutes or more to complete. An intermediate page between Steps
1 and 2 will display run time information and automatically go to Step 2
when the Step 1 has completed.
Also, the Maximum similarity (%) rule involves
all-against-all pairwise alignments and can be time-consuming; in
general, you should leave this rule at it's default value (100%) until
you have refined other rules.
Page loading time: Internet Explorer is much more efficient than Netscape
Navigator (version 4.7) in displaying the Step2: Rule Analysis page,
which includes the graphical alignment of HSPs. If you are using Netscape,
and find this a problem, you may wish to switch to Internet Explorer for
this application.
Input format for sequences
You may paste a sequence in the text box, or click the UPLOAD button to specify
a sequence file on your computer. You may use raw (just lines of sequence)
or FASTA-format, i.e., the sequence begins with the ">" character
in the first position, followed by descriptive text (the "definition
line"). One or more lines containing the sequence then follow. These
lines may be of varying length but must contain only letters, blanks, and
tabs. The following example is valid:
>seq123 this is description
MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF
TAAMYQLGGQ VLDLNPSVTQ VGRGEPIQDT ARVLDRYIDI
LAVRTFKQTDLQTFADHAKM
Step 1: BLAST Analysis Options
In this step you supply a query sequence, which is searched against a database
using BLAST. After that, the BLAST results are analyzed and all query-subject,
global pairwise alignments are computed by CLUSTALW because some of the rules
use those data in addition to BLAST data. If any errors occur in this step,
this page will be re-displayed along with error messages.
For help on BLAST inputs, see the
NCBI
BLAST Help page. Note that the E-value (or "Expect") threshold is not
an option. It is fixed at 10 in this step, but can be changed in a rule in
Step 2.
Select a program
There are only two choices: blastp (protein query, protein database) and
blastn (DNA query, DNA database).
Select a database
Choose one from the list of available databases; their type (protein or DNA)
is specified in the list. You must select a database that matches the type
of the query sequence. For more information on available databases see
SWBIC sequence databases.
Paste or upload sequence
Either paste a sequence in the correct format in the
text box, or specify a file on your computer containing the single sequence.
Filter, matrix, and gap options
These options control whether a low-complexity filter is used, which protein
substitution scoring matrix is used, and the gap open and extend penalties.
A value of zero for a gap cost means that the default BLAST gap cost will
be used.
Advanced options
You may enter one or more BLAST advanced options in this text field, for
example "-W 5 -X 10". Each option begins with a hyphen, then a letter (this
indicates which option), a blank, and the option argument. Multiple options
may be entered; two are shown in this example. If there are any conflicts
with options fixed by BLAST Filter, you will receive an error report explaining
which options cannot be specified. For a description of these options, see
the
NCBI BLAST Advanced Options help.
Submit and reset buttons
Click on "Submit Query" to begin the BLAST analysis step of BLAST
Filter. The "Reset" button clears the sequence box and resets all
fields their default values.
Step 2: Rule Analysis Results
You may bookmark this page and return within 2 days for further analyses.
Result options
The five buttons at the top of the page give you these options:
-
Return to step 1 (BLAST): Returns to the Step 1 page, to allow you
to change BLAST input parameters.
-
Redisplay...: Switches back and forth between showing all sequences,
or only passed sequences. Use it to "declutter" the graphical alignment and
show only passed sequences.
-
View one-line descriptions: Shows the score lines for each subject
sequence matched by BLAST.
-
View output sequence set: Shows the output sequence set (i.e., the
sequences that passed all rules)
-
View complete BLAST report: Shows the text BLAST report.
The last two "View" buttons each display a page of text.
To save the BLAST report or output sequence set,
view the page and use your browser's "save as" function to download as a file.
HSP graphical alignment
The next section of the report is a color-coded, graphical alignment of subject
sequence HSPs against the query. The first bar is the query sequence, labeled
by residue positions. Below are the HSPs for each sequence matched by BLAST,
in the order returned by BLAST (ranked by score). At the left is the sequence
number. For each sequence there are one or more bars showing the HSP(s) for
that subject sequence aligned with the query. The numbers at each end of
an HSP bar are the residue positions of the subject sequence at the start
and end of the HSP. At the far right are sequence and HSP statistics. For each
HSP is shown its E-value, score (bits), and percent identity. For the first HSP
of each sequence the global percent identity (to the query sequence) and sequence
length are also given.
Pausing the
mouse cursor over an HSP displays the first part of the definition line of
that sequence. Clicking on the HSP jumps to that sequence in the "Score lines
for subject sequences" page.
Each sequence has a background color and a colored bar (the HSP). There may
be more than one HSP per sequence. The legend at the top of this section
shows the color schemes. The bar (HSP) color scheme is:
-
black bar: HSP passed all HSP statistic rules
-
red bar: HSP failed one or more HSP statistic rules
-
hollow bar (black or red): the HSP is on the complementary DNA strand (blastn
DNA-DNA comparisons only)
The background color behind the bar indicates whether the sequence passed
all the rules, or has been filtered out by failing one or more rules. The
background color scheme is:
-
green background: sequence passed all rules
-
pink background: sequence is filtered out (failed one or more of the
subject/query comparison rules)
-
orange background: sequence is filtered out because it was redundant (see
Maximum similarity rule)
-
purple background: sequence is filtered out because it exceeded the maximum
number (see Maximum number of sequences rule)
Score lines for subject sequences
These results show the subject sequence definition
lines ranked by score (similar to this section in a standard BLAST report).
At the left of each sequence is its rank number. At the right are the E-value
and score (bits) of the highest scoring HSP for that subject sequence, followed
by query/subject global pairwise alignment statistics (percent identity, positives
and gaps), and ending with the length of the sequence. To view the complete GenBank
entry for a sequence, click on the accession number.
If you have selected the button "Redisplay only passed sequences"
only the score lines of passed (non-filtered) sequences are shown. This is
consistent with the graphical display of HSPs, where only sequences passing
all rules are shown.
Step 2: Rule Analysis (Filter) Options
The last section of the results page shows the results of the BLAST analysis.
First is shown the length of the query sequence and the number of subject
sequences matched by the BLAST search. After changing rule values click on
"Apply change rules" to redisplay this page. The number of sequences
failing each rule are shown at the right of each rule.
The next section of the page displays the rules that can be used to filter
(i.e., remove) sequences matched by BLAST. The following is shown for each
rule: name, range of valid values, current value, and number of subject sequences
failing that rule. The first time this page is displayed the rule values
are such that all sequences pass all rules. The rules are grouped as follows,
with the allowable value range for each rule is shown in brackets ("[]")
following each rule name.
Rules for length of subject sequence as a proportion of query sequence length
Subject sequences that are too short or long (with respect to the query length)
may be filtered out with these rules. For example, a "Minimum sequence length"
of 0.5 means that sequences shorter than half the query length will fail.
-
Minimum sequence length [0.0-1.0]: filters subject sequences that are less
than a proportion of the query sequence length
-
Maximum sequence length [1.0-999]: filters subject sequences that are greater
than a proportion of the query sequence length
Rules for global pairwise alignment of the subject with the query
These rules filter sequences based on CLUSTALW global pairwise alignments
with the query.
-
Minimum identity (%) [0-100]: filters subject sequences with less than the
specified percent identity (residue matches) with the query
-
Minimum positives (%) [0-100]: filters subject sequences with less than the
specified percent positives (residue matches plus conservative replacements)
with the query
-
Maximum gaps (%) [0-100]: filters subject sequences with more than the specified
percent gaps with the query
Rules based on HSP statistics determined by BLAST for each subject sequence
These rules filter sequences based on BLAST HSP (local alignment) statistics.
In a BLAST report, each matched subject sequence may have one or more HSPs.
This type of rule works as follows: if at least one HSP for a subject sequence
passes a rule, the sequence passes that rule; i.e., for a subject sequence
to fail based on this type of rule, all HSPs for that sequence must fail
the rule.
-
Minimum Expectation value [0-10]: fails HSPs with less than the specified
E-value. The value may be entered in normal notation (e.g., 1 or .001) or
scientific notation (e.g., 1e-10 or 2.5e-5, where "e" means "10 raised to
the power of".
-
Maximum Expectation value [0-10]: fails HSPs with more than the specified
E-value (see above for how values may be entered)
-
Minimum score (bits) [0-9999]: fails HSPs with less than the specified score
(bits)
-
Maximum score (bits) [0-9999]: fails HSPs with more than the specified score
(bits)
-
Minimum overlap to query (%) [0-100]: fails HSPs whose length is less than
a percent of the query length
-
Minimum identity (%) [0-100]: fails HSPs with less than the specified percent
identity (residue matches) with the query
-
Minimum positives (%) [0-100]: fails HSPs with less than the specified percent
positives (residue matches plus conservative replacements) with the query
-
Maximum gaps (%) [0-100]: fails HSPs with more than the specified percent
gaps with the query
The following rules are applied to the sequences that pass the above
rules
The set of subject sequences that pass all of the above rules are then tested
against two rules designed to limit the redundancy and size of the final
sequence set.
-
Maximum similarity (%) [0-100]:
All-against-all pairwise alignments (CLUSTALW
fast pairwise alignment method) are made for all members of the filtered
sequence set. For each sequence pair, the sequence with the lower BLAST score
is removed if the CLUSTALW percent similarity (similar to, but not exactly
the same as the percent identity) for the sequence pair is greater than the
specified value. This results in a non-redundant set of related sequences,
i.e., no two sequences share more than a certain percent similarity.
-
Maximum number of sequences [1-9999]: After redundant sequences are removed
by the previous rule, the final sequence set is truncated to this number
by removing the lowest scoring sequences.