Module 4: Similarity Searches

 

Similarity Search Methods

Distance methods and scoring matrices module 4 contents Exercise 4 back to the index of modules


This discussion will concentrate on three of the most commonly employed computational tools for scanning protein and DNA databases for similarity to a test or query sequence. The first two, FASTA and BLAST are heuristic algorithms which use certain assumptions and approximations. Both programs, although using different approaches, first identify very short exact matches between the query sequence and the database sequence(s). Next, the best short hits from the first step are extended to look for longer stretches of similarity. Finally, the best hits are optimized with some form of dynamic programming.

On the other hand, the Smith-Waterman approach is a wholly dynamic programming tool which effectively makes all possible pairwise comparisons to all of the database sequences. Hence, it is a much more sensitive technique as compared to FASTA and BLAST, but it is much more computationally expensive and slower than either one of them. In general, the sensitivity and selectivity of FASTA and BLAST searches are comparable to those of Smith-Waterman searches due to the incorporation of certain statistical parameters into the former two programs.

FASTA

  • It is based on a heuristic approach with restrictions of word size and window size.
  • As in dot plots, FASTA compares two sequences at a time. Instead of comparing one nucleotide at a time, it compares a word of 5-6 nucleotides (or 2 amino acids at a time). The parameter in FASTA for word size is named "ktup". Most FASTA programs use a default ktup of 6 nucleotides or 2 amino acids. If the sensitivity of the search is to be increased, then the ktup window size should be decreased (to 3 or 4 nucleotides or 1 amino acid). This will slow down the search process. The program matches all identical words from the two sequences and creates diagonals by joining adjacent matches which are non-overlapping (Figure A).
  • Next, the highest scoring regions (in similarity) are re-scored using a substitution matrix such as the PAM250 and the best of these scores is called "init1" (Figure B).
  • The high scoring diagonals are joined together allowing for gaps between them. The diagonal with the best score gets the value of "initn" (Figure C).
  • In the final step, an optimal local alignment is performed between the query sequence and a limited number of database "hit" sequences with high initn values. For this alignment, the Smith-Waterman dynamic programming approach is employed. This optimal alignment score is labeled "opt" (Figure D).
  • What to look for in FASTA outputs?
    • FASTA calculates an expectation of significance value [E() value]. If this value is <0.02, the similarity measure between the query sequence and a sequence(s) in a database is statistically significant.
    • The z-score is derived from the opt score after correcting for differences in sequence length.

FASTA format for DNA and protein sequence data is compatible with virtually all molecular biology programs. A sequence in this format has a single-line description (header) followed by lines of sequence data. The header line is distinguished from the data lines by a greater-than (">") symbol as its first character. An example of sequence in FASTA format is as follows:

>gi|129369|sp|P04637|P53_HUMAN CELLULAR TUMOR ANTIGEN P53 (PHOSPHOPROTEIN P53) MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA PPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKT CPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRN TFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGR DRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALEL KDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

REFERENCES

1. W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444- 2448,

2. W. R. Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63- 98).

BLAST (Basic Local Alignment Search Tool)

  • Heuristic approach developed at the NCBI (BLAST) for local alignments that can detect relationships among sequences which share only isolated regions of similarity.
  • Two step process - word search followed by the maximal segment pair alignment method (a simplification of the Smith-Waterman algorithm)
  • Fast but less sensitive
  • Much more effective for protein than DNA sequences
  • BLAST2 can incorporate gaps
  • BLAST outputs - what to look for? how to read e values?

The sequences in these databases are compared against one another using the BLAST algorithm for finding ungapped local alignments. This will retrieve most biologically significant similarities, but will miss a few and will include some chance similarities. Nucleotide neighbors are most successfully used to build contigs rather than discern biological function.

REFERENCES

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J Mol Biol, 215, 403-10.

Smith-Waterman search (ref. "pairwise sequence alignment" by R. Giegerich & D. Wheeler - Virtual School of Natural Sciences [VSNS] web site)

  • Dynamic programming approach
  • Slow, but more sensitive
  • Best pairwise alignment with user-defined parameters
  • Queries by local alignments
  • Allows gaps in the alignment
  • blitz from embl & bioccelerator at the weizmann institute of science


Distance methods and scoring matrices module 4 contents Exercise 4 back to the index of modules

| Return to SWBIC home |

The Southwest Biotechnology and Informatics Center WWW server is located at "http://www.swbic.org/".
Please send comments and suggestions to: [email protected]
SWBIC 2001