Module 4: Similarity Searches


Distance Methods and Scoring Matrices

Dot Plots module 4 contents Similarity Search Methods back to the index of modules

Algorithms involved in similarity searches and sequence alignments employ two different methods of computing (quantitative) similarity for a pair of sequences. They are the distance measure and similarity measure (usually in the form of a matrix). In general, for any given pair of sequences, these two factors are inversely proportional to each other. ie, larger the distance between two sequences, the smaller the similarity and vice versa.
  • Distance between two sequences is computed as the sum of differences between the two sequences.
    • In the case of nucleic acid sequences, differences due to insertions & deletions are generally given a larger distance score than substitutions. Replacements of one nucleotide by any of the other three nucleotides can all be considered as equal substitutions or different weights can be assigned based on the naturally occurring frequencies of transitions (purines to purines or pyrimidines to pyrimidines) and transversions (purines to pyrimidines or vice versa). In general, transitions are much more common than transversions (due to the physical constraints of base pairing)
    • With protein sequences, distance methods are not efficient for sequence comparisons since amino acid replacements can be based on a variety of parameters such as the following: (1) Some amino acid substitutions do not change protein structure and function and hence are considered highly interchangeable (naturally mutable). (2) Degeneracy of the genetic code has to be taken into account when amino acid frequencies are considered for probability calculations of substitutions between two protein sequences.
    • As a practicality, frequency tables (called Percent Accepted Mutations or PAM matrix) of amino acid mutation rates between sets of related amino acids in protein families, have been developed.
  • PAM (Percent Accepted Mutations) similarity matrix
    • PAM matrices pertaining to protein sequences are constructed using amino acid similarities in evolutionarily related sequences. They show probability scores of replacement of amino acids by each other based on natural mutation rates in related protein families. Hence, these matrices are sometimes called "substitution matrices".
      • A score above zero assigned to two amino acids indicates that these two replace each other more often than expected by chance alone. ie., they are functionally exchangeable.
      • A negative score (below zero) indicates that the two amino acids are rarely interchangeable. eg., a basic amino acid for an acidic one or one with an aromatic side chain for one with aliphatic side chain.
    • The number 250 in PAM250 corresponds to an average of 250 amino acid replacements per 100 residues from a data set of 71 aligned sequences [Ref: Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.]. The higher the matrix number, the farther the evolutionary distance between the compared sequences.

      Figure: PAM250 Matrix

    • PAM45 matrix is the most commonly used default matrix for comparing nucleotide sequences (a score of +5 for matches & -4 for mismatches; assumes equal rates of transition & transversion)

  • BLOSUM (Blocks Substitution Matrix) matrix
    • These are substitution matrices derived from the observed frequencies of amino acid replacements in highly conserved regions of ungapped local alignments. The data for the substitution scores in these matrices come from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins [Ref: Henikoff, S., and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89: 10915-10919]
    • The BLAST server from NCBI and the search servers from EBI use different versions of the BLOSUM matrix for protein similarity searches and alignments.

  • It is better to attempt protein sequence comparisons with several different substitution matrices. Comparison of the results should give the best possible matches.

Dot Plots module 4 contents Similarity Search Methods back to the index of modules

| Return to SWBIC home |

The Southwest Biotechnology and Informatics Center WWW server is located at "".
Please send comments and suggestions to: [email protected]
SWBIC 2001