BLAST Filter Help
BLAST Filter is a Bioinformatics tool that builds a set of related sequences from a single query sequence. The full sequences are returned, which could then be used as input for a multiple alignment, hidden Markov models, or other analyses that require a set of homologous sequences. It works in two steps:
- Step 1: BLAST Analysis. Your query sequence (protein or DNA) is searched by BLAST (blastp or blastn) against a sequence database.
- Step 2: Rule Analysis. The results of the BLAST search are displayed graphically by showing the alignments of all HSPs against the query. A set of rules is also shown; initially, all subject sequences from the BLAST search pass all these rules. You may then change the values of one or more of the rules and re-apply the rules to remove or “filter” sequences from the full set returned by BLAST. Sequences failing one or more rules are shown in different colors.
You may re-apply changed rules as often as you like, and even return to the Step 1 page to change BLAST parameters. When you are finished, a FASTA-format file of the full sequences matched by BLAST and passing all rules can be viewed and saved. The color-coding of the graphical alignment display makes it easy to see which sequences and HSPs (BLAST local alignments) are failing the current set of rules.
- Number of BLAST matches returned: 1000 subject sequences
- Sequence length: 25,000 letters (for any single sequence)
Compute time considerations: Step 1 (BLAST analysis) can be time-consuming because, in addition to the BLAST search, CLUSTALW is run on all query-subject pairs to compute global pairwise statistics. Therefore, a long sequence (> 500 residues) with many BLAST matches (> 500) can take several minutes or more to complete. An intermediate page between Steps 1 and 2 will display run time information and automatically go to Step 2 when the Step 1 has completed. Also, the Maximum similarity (%) rule involves all-against-all pairwise alignments and can be time-consuming; in general, you should leave this rule at it’s default value (100%) until you have refined other rules.
Page loading time: Internet Explorer is much more efficient than Netscape Navigator (version 4.7) in displaying the Step2: Rule Analysis page, which includes the graphical alignment of HSPs. If you are using Netscape, and find this a problem, you may wish to switch to Internet Explorer for this application.
You may paste a sequence in the text box, or click the UPLOAD button to specify a sequence file on your computer. You may use raw (just lines of sequence) or FASTA-format, i.e., the sequence begins with the “>” character in the first position, followed by descriptive text (the “definition line”). One or more lines containing the sequence then follow. These lines may be of varying length but must contain only letters, blanks, and tabs. The following example is valid:
>seq123 this is description MGIKALAGRDLLAIADLTIEEMKSLLQLAADLKSGVLKPHCRKILGLLFYKASTRTRVSF TAAMYQLGGQ VLDLNPSVTQ VGRGEPIQDT ARVLDRYIDI LAVRTFKQTDLQTFADHAKM
In this step you supply a query sequence, which is searched against a database using BLAST. After that, the BLAST results are analyzed and all query-subject, global pairwise alignments are computed by CLUSTALW because some of the rules use those data in addition to BLAST data. If any errors occur in this step, this page will be re-displayed along with error messages.
For help on BLAST inputs, see the NCBI BLAST Help page. Note that the E-value (or “Expect”) threshold is not an option. It is fixed at 10 in this step, but can be changed in a rule in Step 2.
Select a program
There are only two choices: blastp (protein query, protein database) and blastn (DNA query, DNA database).
Select a database
Choose one from the list of available databases; their type (protein or DNA) is specified in the list. You must select a database that matches the type of the query sequence. For more information on available databases see SWBIC sequence databases.
Paste or upload sequence
Either paste a sequence in the correct format in the text box, or specify a file on your computer containing the single sequence.
Filter, matrix, and gap options
These options control whether a low-complexity filter is used, which protein substitution scoring matrix is used, and the gap open and extend penalties. A value of zero for a gap cost means that the default BLAST gap cost will be used.
You may enter one or more BLAST advanced options in this text field, for example “-W 5 -X 10”. Each option begins with a hyphen, then a letter (this indicates which option), a blank, and the option argument. Multiple options may be entered; two are shown in this example. If there are any conflicts with options fixed by BLAST Filter, you will receive an error report explaining which options cannot be specified. For a description of these options, see the NCBI BLAST Advanced Options help.
Submit and reset buttons
Click on “Submit Query” to begin the BLAST analysis step of BLAST Filter. The “Reset” button clears the sequence box and resets all fields their default values.
You may bookmark this page and return within 2 days for further analyses.
The five buttons at the top of the page give you these options:
- Return to step 1 (BLAST): Returns to the Step 1 page, to allow you to change BLAST input parameters.
- Redisplay…: Switches back and forth between showing all sequences, or only passed sequences. Use it to “declutter” the graphical alignment and show only passed sequences.
- View one-line descriptions: Shows the score lines for each subject sequence matched by BLAST.
- View output sequence set: Shows the output sequence set (i.e., the sequences that passed all rules)
- View complete BLAST report: Shows the text BLAST report.
The last two “View” buttons each display a page of text. To save the BLAST report or output sequence set, view the page and use your browser’s “save as” function to download as a file.
HSP graphical alignment
The next section of the report is a color-coded, graphical alignment of subject sequence HSPs against the query. The first bar is the query sequence, labeled by residue positions. Below are the HSPs for each sequence matched by BLAST, in the order returned by BLAST (ranked by score). At the left is the sequence number. For each sequence there are one or more bars showing the HSP(s) for that subject sequence aligned with the query. The numbers at each end of an HSP bar are the residue positions of the subject sequence at the start and end of the HSP. At the far right are sequence and HSP statistics. For each HSP is shown its E-value, score (bits), and percent identity. For the first HSP of each sequence the global percent identity (to the query sequence) and sequence length are also given.
Pausing the mouse cursor over an HSP displays the first part of the definition line of that sequence. Clicking on the HSP jumps to that sequence in the “Score lines for subject sequences” page.
Each sequence has a background color and a colored bar (the HSP). There may be more than one HSP per sequence. The legend at the top of this section shows the color schemes. The bar (HSP) color scheme is:
- black bar: HSP passed all HSP statistic rules
- red bar: HSP failed one or more HSP statistic rules
- hollow bar (black or red): the HSP is on the complementary DNA strand (blastn DNA-DNA comparisons only)
The background color behind the bar indicates whether the sequence passed all the rules, or has been filtered out by failing one or more rules. The background color scheme is:
- green background: sequence passed all rules
- pink background: sequence is filtered out (failed one or more of the subject/query comparison rules)
- orange background: sequence is filtered out because it was redundant (see Maximum similarity rule)
- purple background: sequence is filtered out because it exceeded the maximum number (see Maximum number of sequences rule)
Score lines for subject sequences
These results show the subject sequence definition lines ranked by score (similar to this section in a standard BLAST report). At the left of each sequence is its rank number. At the right are the E-value and score (bits) of the highest scoring HSP for that subject sequence, followed by query/subject global pairwise alignment statistics (percent identity, positives and gaps), and ending with the length of the sequence. To view the complete GenBank entry for a sequence, click on the accession number.
If you have selected the button “Redisplay only passed sequences” only the score lines of passed (non-filtered) sequences are shown. This is consistent with the graphical display of HSPs, where only sequences passing all rules are shown.
The last section of the results page shows the results of the BLAST analysis. First is shown the length of the query sequence and the number of subject sequences matched by the BLAST search. After changing rule values click on “Apply change rules” to redisplay this page. The number of sequences failing each rule are shown at the right of each rule.
The next section of the page displays the rules that can be used to filter (i.e., remove) sequences matched by BLAST. The following is shown for each rule: name, range of valid values, current value, and number of subject sequences failing that rule. The first time this page is displayed the rule values are such that all sequences pass all rules. The rules are grouped as follows, with the allowable value range for each rule is shown in brackets (“”) following each rule name.
Rules for length of subject sequence as a proportion of query sequence length
Subject sequences that are too short or long (with respect to the query length) may be filtered out with these rules. For example, a “Minimum sequence length” of 0.5 means that sequences shorter than half the query length will fail.
- Minimum sequence length [0.0-1.0]: filters subject sequences that are less than a proportion of the query sequence length
- Maximum sequence length [1.0-999]: filters subject sequences that are greater than a proportion of the query sequence length
Rules for global pairwise alignment of the subject with the query
These rules filter sequences based on CLUSTALW global pairwise alignments with the query.
- Minimum identity (%) [0-100]: filters subject sequences with less than the specified percent identity (residue matches) with the query
- Minimum positives (%) [0-100]: filters subject sequences with less than the specified percent positives (residue matches plus conservative replacements) with the query
- Maximum gaps (%) [0-100]: filters subject sequences with more than the specified percent gaps with the query
Rules based on HSP statistics determined by BLAST for each subject sequence
These rules filter sequences based on BLAST HSP (local alignment) statistics. In a BLAST report, each matched subject sequence may have one or more HSPs. This type of rule works as follows: if at least one HSP for a subject sequence passes a rule, the sequence passes that rule; i.e., for a subject sequence to fail based on this type of rule, all HSPs for that sequence must fail the rule.
- Minimum Expectation value [0-10]: fails HSPs with less than the specified E-value. The value may be entered in normal notation (e.g., 1 or .001) or scientific notation (e.g., 1e-10 or 2.5e-5, where “e” means “10 raised to the power of”.
- Maximum Expectation value [0-10]: fails HSPs with more than the specified E-value (see above for how values may be entered)
- Minimum score (bits) [0-9999]: fails HSPs with less than the specified score (bits)
- Maximum score (bits) [0-9999]: fails HSPs with more than the specified score (bits)
- Minimum overlap to query (%) [0-100]: fails HSPs whose length is less than a percent of the query length
- Minimum identity (%) [0-100]: fails HSPs with less than the specified percent identity (residue matches) with the query
- Minimum positives (%) [0-100]: fails HSPs with less than the specified percent positives (residue matches plus conservative replacements) with the query
- Maximum gaps (%) [0-100]: fails HSPs with more than the specified percent gaps with the query
The following rules are applied to the sequences that pass the above rules
The set of subject sequences that pass all of the above rules are then tested against two rules designed to limit the redundancy and size of the final sequence set.
- Maximum similarity (%) [0-100]: All-against-all pairwise alignments (CLUSTALW fast pairwise alignment method) are made for all members of the filtered sequence set. For each sequence pair, the sequence with the lower BLAST score is removed if the CLUSTALW percent similarity (similar to, but not exactly the same as the percent identity) for the sequence pair is greater than the specified value. This results in a non-redundant set of related sequences, i.e., no two sequences share more than a certain percent similarity.
- Maximum number of sequences [1-9999]: After redundant sequences are removed by the previous rule, the final sequence set is truncated to this number by removing the lowest scoring sequences.