Module 4: Similarity Searches



Back to Module 4 module 4 contents Dot Plots back to the index of modules

Although modules 2 and 3 focused on DNA sequences only, protein sequences will also be considered in this module and the next one on alignments (Module 5). In most cases, the protein sequence corresponding to a DNA (gene) sequence is a more suitable and better yard stick in similarity searches and alignment methods since it is the functional unit of genetic information in any organism.

In searching for sequences which are similar to a test sequence, one has to be aware of certain terms (both biological and computational) that are used indiscriminately. The following terms and phrases are worth being defined here prior to a discussion of methodologies of similarity searching and sequence alignments.
  • Similarity Vs. Homology: When two sequences (nucleic acid or protein) are similar over a short stretch or in a global way, it does not necessarily mean that they will share common functionalities. It means that the two sequences share identity or similarity in nucleotides or amino acids locally or globally. i.e., sequence data does not necessarily correlate directly to functional information. On the other hand, when two sequences are homologous then they share common functional identities somewhere along their sequences. Homology is to do with a common evolutionary relationship between the sequences (genes or proteins). A very high level of similarity between two or more sequences is a strong indicator of homology between those sequences. In general, a 25% identity over a stretch of 100 amino acids relates to a common evolutionary mechanism for two proteins.
  • Sensitivity and Selectivity: Sensitivity in a similarity search deals with the ability to detect more distantly related sequences. The more sensitive a search is the more likely that it will drag in "false positive" matches. Protein similarity searches are generally more sensitive than DNA sequence searches for the following reasons. (1) Since there are only 4 letters in the DNA alphabet (i.e., the 4 nucleotides, A, G, T and C) compared to the 20 amino acid letters of the protein alphabet, it is more probable to get chance matches in DNA-DNA comparisons than in protein-protein comparisons. (2) Unlike DNA letters, protein letters (amino acids) are flexible in their interchangeability at any given position along the protein. Hence, in pairwise comparisons, a pair of DNA bases is generally scored either as a match or a mismatch whereas two amino acids can share varying degrees of similarity based on their physical and chemical properties, degeneracy of codon usage and natural mutation (exchangeable) rates. (3) Protein databases are much smaller than DNA databases resulting in protein searches with lower numbers of false positives compared to DNA searches. Selectivity is achieved by focusing on a narrow set of data points. For example, a researcher already aware of the function of a query protein may want to selectively search for proteins with similar function/sequence. In general, the lower the selectivity of a search the larger the number of "false negatives" the search will bring in (ie., the search detects sequences that are not genuinely similar to the query sequence - "Close but no cigar!"). Hence selectivity can be thought of as the ability "to avoid false negatives". Later in this module and the next (Module 5) we will learn how one can adjust various parameters in the different search tools to choose between a search that is sensitive or one that is selective. It is important to note that this choice is at the discretion of the scientist based on what he/she needs for any given search.
  • Global Vs. Local: Global similarities consider the entire length of the sequences being compared and a quantitative "similarity score" is assigned. Initial attempts at creating algorithms for similarity searches focused mainly on global similarities (Ref: Needelman & Wunch, 1970). Global algorithms are not usually sensitive for highly diverged sequences with some localized similarities within them. Sequence similarities can be better analyzed with local similarity algorithms which, in general, assign a total similarity score for two sequences based on a summation of local similarity scores. Local similarities are detected by comparison of similar "words" between the two sequences. A word could be a certain number of nucleotides (may be 5 or 6) or amino acids (1 or 2). We will look at the three most widely employed algorithms (FASTA, BLAST, Smith-Waterman)
  • Heuristic approach: A heuristic approach to problem-solving pertains to the process of knowing by trying rather than by following some pre-established formula. ie., it can be considered as a "trial and error" learning method. It can also be a way of learning by experience or the "rule-of-thumb" approach (eg., as with human chess players). Computational molecular biologists have devised algorithms for solving similarity search and alignment problems with a heuristic approach based on successive approximations. Such heuristic algorithms are much faster than algorithms based on dynamic programming.
  • Dynamic programming: This is a mathematically complex computational technique which structures a large search space (such as nucleic acid or protein sequence comparisons) into a succession of stages such that
    • the initial stage contains trivial solutions to sub-problems (partial solutions),
    • each partial solution in a later stage can be calculated by recurring on only a fixed number of partial solutions in an earlier stage,
    • the final stage contains the overall solution.
    Dynamic programming is then employed to arrive at an optimal alignment for the two sequences. A good reference for a discussion of this topic is at the VSNS (Virtual School of Natural Sciences) Biocomputing Hypertext Coursebook, Chapter 1 by Robert Giegerich and David Wheeler.

Back to Module 4 module 4 contents Dot Plots back to the index of modules

| Return to SWBIC home |

The Southwest Biotechnology and Informatics Center WWW server is located at "".
Please send comments and suggestions to: [email protected]
SWBIC 2001