Module 4: Similarity Searches
|Although modules 2 and 3 focused on DNA sequences only, protein sequences
will also be considered in this module and the next one on alignments (Module 5). In
most cases, the protein sequence corresponding to a DNA (gene) sequence is a more suitable and better yard stick in similarity
searches and alignment methods since it is the functional unit of genetic information in any organism.
In searching for sequences which are similar to a test sequence, one has to be aware of certain terms (both biological and
computational) that are used indiscriminately. The following terms and phrases are worth being defined here prior to a
discussion of methodologies of similarity searching and sequence alignments.
- Similarity Vs. Homology: When two sequences (nucleic acid or protein) are similar over a short stretch or in a
global way, it does not necessarily mean that they will share common functionalities. It means that the two sequences share
identity or similarity in nucleotides or amino acids locally or globally. i.e., sequence data does not
necessarily correlate directly to functional information. On the other hand, when two sequences are homologous then
they share common functional identities somewhere along their sequences. Homology is to do with a common evolutionary
relationship between the sequences (genes or proteins). A very high level of similarity between two or more sequences is a
strong indicator of homology between those sequences. In general, a 25% identity over a stretch of 100 amino acids relates
to a common evolutionary mechanism for two proteins.
- Sensitivity and Selectivity: Sensitivity in a similarity search deals with the ability to detect more distantly
related sequences. The more sensitive a search is the more likely that it will drag in "false positive" matches. Protein
similarity searches are generally more sensitive than DNA sequence searches for the following reasons. (1) Since there are
only 4 letters in the DNA alphabet (i.e., the 4 nucleotides, A, G, T and C) compared to the 20 amino acid letters of
the protein alphabet, it is more probable to get chance matches in DNA-DNA comparisons than in protein-protein comparisons.
(2) Unlike DNA letters, protein letters (amino acids) are flexible in their interchangeability at any given position along
the protein. Hence, in pairwise comparisons, a pair of DNA bases is generally scored either as a match or a mismatch
whereas two amino acids can share varying degrees of similarity based on their physical and chemical properties, degeneracy
of codon usage and natural mutation (exchangeable) rates. (3) Protein databases are much smaller than DNA databases
resulting in protein searches with lower numbers of false positives compared to DNA searches. Selectivity is achieved by
focusing on a narrow set of data points. For example, a researcher already aware of the function of a query protein may
want to selectively search for proteins with similar function/sequence. In general, the lower the selectivity of a search
the larger the number of "false negatives" the search will bring in (ie., the search detects sequences that are not
genuinely similar to the query sequence - "Close but no cigar!"). Hence selectivity can be thought of as the ability "to
avoid false negatives". Later in this module and the next (Module 5) we will learn
how one can adjust various parameters in the different search tools to choose between a search that is sensitive or one
that is selective. It is important to note that this choice is at the discretion of the scientist based on what he/she
needs for any given search.
- Global Vs. Local: Global similarities consider the entire length of the sequences being compared and a
quantitative "similarity score" is assigned. Initial attempts at creating algorithms for similarity searches focused mainly
on global similarities (Ref: Needelman & Wunch, 1970). Global algorithms are not usually sensitive for highly
diverged sequences with some localized similarities within them. Sequence similarities can be better analyzed with local
similarity algorithms which, in general, assign a total similarity score for two sequences based on a summation of local
similarity scores. Local similarities are detected by comparison of similar "words" between the two sequences. A word could
be a certain number of nucleotides (may be 5 or 6) or amino acids (1 or 2). We will look at the three most widely employed
algorithms (FASTA, BLAST,
- Heuristic approach: A heuristic approach to problem-solving pertains to the process of knowing by trying rather
than by following some pre-established formula. ie., it can be considered as a "trial and error" learning method. It
can also be a way of learning by experience or the "rule-of-thumb" approach (eg., as with human chess players).
Computational molecular biologists have devised algorithms for solving similarity search and alignment problems with a
heuristic approach based on successive approximations. Such heuristic algorithms are much faster than algorithms based on
- Dynamic programming: This is a mathematically complex computational technique which structures a large search
space (such as nucleic acid or protein sequence comparisons) into a succession of stages such that
Dynamic programming is then employed to arrive at an optimal alignment for the two sequences. A good reference for a
discussion of this topic is at the VSNS (Virtual School of Natural Sciences)
Biocomputing Hypertext Coursebook, Chapter 1 by
Robert Giegerich and David Wheeler.
- the initial stage contains trivial solutions to sub-problems (partial solutions),
- each partial solution in a later stage can be calculated by recurring on only a fixed number of partial solutions in
an earlier stage,
- the final stage contains the overall solution.