Module 4: Similarity Searches
Distance Methods and Scoring Matrices
|Algorithms involved in similarity searches and sequence alignments employ two different methods of computing (quantitative)
similarity for a pair of sequences. They are the distance measure and similarity measure (usually in the form of a matrix). In
general, for any given pair of sequences, these two factors are inversely proportional to each other. ie, larger the
distance between two sequences, the smaller the similarity and vice versa.
- Distance between two sequences is computed as the sum of differences between the two sequences.
- In the case of nucleic acid sequences, differences due to insertions & deletions are generally given a larger
distance score than substitutions. Replacements of one nucleotide by any of the other three nucleotides can all be
considered as equal substitutions or different weights can be assigned based on the naturally occurring frequencies of
transitions (purines to purines or pyrimidines to pyrimidines) and transversions (purines to pyrimidines or vice
versa). In general, transitions are much more common than transversions (due to the physical constraints of base
- With protein sequences, distance methods are not efficient for sequence comparisons since amino acid replacements can
be based on a variety of parameters such as the following: (1) Some amino acid substitutions do not change protein
structure and function and hence are considered highly interchangeable (naturally mutable). (2) Degeneracy of the
genetic code has to be taken into account when amino acid frequencies are considered for probability calculations of
substitutions between two protein sequences.
- As a practicality, frequency tables (called Percent Accepted Mutations or PAM matrix) of amino acid mutation rates
between sets of related amino acids in protein families, have been developed.
- PAM (Percent Accepted Mutations) similarity matrix
- PAM matrices pertaining to protein sequences are constructed using amino acid similarities in evolutionarily related
sequences. They show probability scores of replacement of amino acids by each other based on natural mutation rates in
related protein families. Hence, these matrices are sometimes called "substitution matrices".
- A score above zero assigned to two amino acids indicates that these two replace each other more often than
expected by chance alone. ie., they are functionally exchangeable.
- A negative score (below zero) indicates that the two amino acids are rarely interchangeable. eg., a basic
amino acid for an acidic one or one with an aromatic side chain for one with aliphatic side chain.
- The number 250 in PAM250 corresponds to an average of 250 amino acid replacements per 100 residues from a data set of
71 aligned sequences [Ref: Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in
proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National
Biomedical Research Foundation, Silver Spring, MD.]. The higher the matrix number, the farther the evolutionary
distance between the compared sequences.
Figure: PAM250 Matrix
- PAM45 matrix is the most commonly used default matrix for comparing nucleotide sequences (a score of +5 for matches
& -4 for mismatches; assumes equal rates of transition & transversion)
- BLOSUM (Blocks Substitution Matrix) matrix
- These are substitution matrices derived from the observed frequencies of amino acid replacements in highly conserved
regions of ungapped local alignments. The data for the substitution scores in these matrices come from about 2000
blocks of aligned sequence segments characterizing more than 500 groups of related proteins [Ref: Henikoff, S.,
and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89: 10915-10919]
- The BLAST server from NCBI and the search servers
from EBI use different versions of the BLOSUM matrix for protein similarity searches and alignments.
- It is better to attempt protein sequence comparisons with several different substitution matrices. Comparison of the
results should give the best possible matches.