Module 4: Similarity Searches


Exercise 4

Similarity Search Methods module 4 contents back to the index of modules

We will perform similarity searches using BLAST, FASTA as hands-on tools. We will also briefly discuss Smith-Waterman algorithm as a potential similarity search tool.

FASTA Protein Similarity Search

  1. Link to the Fasta3 search engine at the European Bioinformatics Institute. You may check out the links to the "Help" and "Tool" screens.
  2. Your e-mail address and search title for the sequence are optional entries. Enter "Murine IL-7 Receptor" as your search title. Choose "interactive" as the option for results although you can get them by e-mail.
  3. Change the scoring matrix to Blosum62 since this matrix has been shown to detect most protein similarities when the query sequence is long. (The murine IL-7 receptor is 459 amino acids long.)
  4. In order to limit the number of hits (similar sequences) in this search, change the number of scores to 30 and the alignments to 10. You may get a histogram of the results by changing the "HIST" drop-down menu to "yes". Leave the other parameters unchanged with the default values. We will search the default database, "swall", which is the Swiss-Prot non-redundant database combined with Trembl and TremblNew (Trembl = Translated EMBL and TremblNew = New sequences in Trembl).
  5. Copy and paste the murine IL-7 receptor (IL-7R) sequence from this text file. The input sequence can be in any format.
  6. Click the "Run Fasta3" button for the search results.
  7. You may view the same results as a graphical output by clicking on the "VisualFasta" button from the "Results of Search" screen.
  8. Interpretation of the results:
    • In general, one selects sequence similarities with E() value < 0.02 as statistically significant matches.
    • As expected, notice that the murine IL-7R sequences in the database (Accession numbers Q9R0C1, P16872) show the best similarities to the query sequence (with the highest opt and z-scores). Only two other sequences corresponding to the human IL-7R gene (Acc. #'s P16871, Q9UPC1) show fairly high opt and z-scores. Even the human protein isoforms (Acc. #'s P16871-02, P16871_01) with some identical residues to the query sequence have lower opt and z scores.

BLAST Protein Similarity Search

  1. Connect to the BLAST site at NCBI. Click the link to the "Standard protein-protein BLAST [blastp]" page. Familiarize yourself with the various features of the site.
  2. Choose "nr" for the non-redundant database to search.
  3. As with the FASTA exercise above, copy and paste the murine IL-7 receptor (IL-7R) sequence from this text file into the large data entry field. This sequence is already in the FASTA format.
  4. Limit the number of hits to a manageable size by changing the Expect value to 1 from the default value of 10. You may also restrict the number of hits returned by decreasing the number of Descriptions and Alignments returned.
  5. Use the default Blosum62 scoring matrix as selected at the bottom of the page.
  6. Click on the "Search" button to perform the similarity search immediately. On the Blast CGI screen that shows up next, view the results by pressing the "Format results" button. You may also check for any conserved domains between your sequence and the database sequences. You may wish to get the BLAST results by e-mail by providing your e-mail address on the BLAST search screen.

Comparison of Fasta and BLAST Search Results

  1. Check for database sequences that have been pulled out as common hits by the two search algorithms. How do these sequences common to both searches show up in the graphical alignment figures from BLAST and FASTA?
  2. How do the E values for common sequences compare between the two programs?
Smith-Waterman Algorithm

You may connect to the Bioccelerator site at EMBL to use the Smith-Waterman algorithm for sequence similarity searches.
  1. This search tool is a rigorously mathematical, dynamic programming algorithm that uses iterative calculation of similarity in matrix cells (pairwise comparisons between the query and database sequences).
  2. Very computationally intensive and may take longer times for similarity searches.
  3. Gives significantly better results than FASTA and BLAST (higher scores for "true" homologues).

Similarity Search Methods module 4 contents back to the index of modules

| Return to SWBIC home |

The Southwest Biotechnology and Informatics Center WWW server is located at "".
Please send comments and suggestions to: [email protected]
SWBIC 2001