PCAPSS is the acronymn for Protein Classification through the Assessment of Predicted Secondary Structure. It is a method for identifying remote protein homologs in the Protein Data Bank (PDB) that share a similar secondary structure (SS) to the query protein. PCAPSS was designed to be used with orphan or hypothetical proteins, i.e., those for which amino acid sequence searches fail to find similar sequences with functional annotation. Its purpose is to generate hypotheses for structure and function for such hypothetical or putative protein sequences based on fold recognition. The input is a single query protein sequence. Due to the time required for an analysis (depending on the query, 1/2 hour or more) results are stored here and you are e-mailed the URL of the results. You may then save the result pages you wish on your local computer.
A manuscript on the methods and validation study results is under preparation and will be made available here soon. In general, the approach is to characterize (in the form of a hidden Markov model, or HMM) the predicted SS of the query protein based on a training set of proteins closely related to the query. A database of experimentally-determined SSs is then searched to identify proteins of similar SS. This database is derived from the atomic coordinate data files of the PDB. A flowchart of the processing steps in PCAPSS is shown in Figure 1. The specific methods involved in each of these steps are detailed in the following sections.
HMM training set construction
These methods are performed in steps 2-3 (Figure 1). For each protein selected, a group of homologous proteins is created by first using the “blastp” sequence similarity search program (Altschul et al., 1997) with the National Center for Biotechnology Information (NCBI) “nr” (non-redundant) protein database. Up to 500 matches with Expectation values <= 0.1 are allowed.
This set of related proteins is further reduced to become the training set for the construction of an HMM (step 2). Based on results from preliminary analyses, we refined a set of 10 rules to be applied to the subject proteins matched by the BLAST search (Figure 1). The first eight rules compare each subject protein to the query based on its whole sequence length, global percent identity, and High Scoring Segment Pair (HSP) statistics extracted from the BLAST report. Rules 1-8 were designed to retain sequences of similar length to the query with relatively long, higher quality HSPs. Rule 3 (minimum global percent identity with query) is applied to the result of a ClustalW (Thomson et al., 1994) pairwise alignment of the query and each subject. Rule 9 is applied to sequences passing rules 1-8, and reduces the redundancy within the set such that no two sequences have more than a 95% ClustalW similarity. This was to avoid over-weighting the HMM with many highly similar sequences. If there are more than 50 proteins in this non-redundant set, it is reduced by removing the subject sequences with the lowest HSP Expectation values (rule 10). This number was determined in preliminary analyses as being sufficient for building an HMM and reducing computation time. These rules are described more fully in the Rule Analysis Options Help for the BLAST Filter bioinformatics tool, which uses the same software as PCAPSS for identifying and reducing a set of related sequences based on rules.
NOTE: The rules described above and in Figure 1 have been loosened, which results in larger training sets. Recent results indicate that these new rules actually improve the accuracy of PCAPSS. More information on these rules and changes in validation results will be posted here soon.
The HMM training set is formed (step 3) by predicting the SS sequences for each of the proteins in the reduced set of related proteins. In preliminary studies we used the following programs: PREDATOR (Frishman and Argos; 1996, 1997), DSC (Discrimination of protein Secondary structure Class; King and Sternberg, 1996), and PSIPRED Versions 1.2 and 2.01 (Jones, 1999). Although results were similar among the methods, the current implementation of PCAPSS uses the latest version (2.3) of PSIPRED because of its claimed accuracy (>76%) and its very high processing speed relative to the other prediction programs.
HMM building and database searching
These methods are performed in steps 4-7 (Figure 1). HMMs are constructed with HMMER Release 2.1.1 (Eddy, 1998). Each HMM is built (HMMER program “hmmbuild”) from a ClustalW multiple alignment of the training set of predicted SSs formed in steps 1-3; amino acid specific options are turned off by specifying the identity protein weight matrix, no residue-specific options, and with the option for hydrophilic gaps off.
HMMER was developed to produce protein and nucleic acid profile HMMs. The code was modified by us to use the CEH (coil, extended, alpha helix) SS alphabet output by SS prediction programs. Several input parameters were modified to reflect characteristics of SS sequences, including supplying custom priors (based on the observed frequencies of SS states in the PDB) for the null model, and Dirichlet match and insert emission probabilities (also based on observed SS frequencies in the PDB). The “hmmcalibrate” HMMER program is used to improve the accuracy of statistical significance estimates. HMMs are built with the default configuration; this means that, when a sequence database is searched with the HMM, the best non-overlapping partial alignments of each subject sequence to the complete model are identified. Other HMM configurations are under study.
The HMMs trained from predicted SS sequences are used to search an updated database of protein SS sequences (step 6) derived from the database of DSSP structures (Dictionary of Protein Secondary Structure; Kabsch and Sander, 1983) maintained by the PDB. The 8-state assignments of DSSP are reduced to the 3 states (coil, sheet, helix, or CEH) predicted by PSIPRED using “Method B” of Cuff and Barton (1999), which generally improved prediction accuracy compared to “Method A.” “Method B” 8- to-3 state reduction is: E stays as E, H as H, and other states to coil including EE and HHHH. This database was further refined to remove all sequences > 95% coil or shorter than 50 residues to reduce search time.
The search of the PDB database (step 7) is performed by the HMMER program “hmmsearch.” After preliminary studies, it was determined that normalizing HMMER scores (bits) by the length of the matched protein chain improved PCAPSS prediction accuracy. Finally, annotation from the PDB is incorporated in the results to facilitate evaluation of the top scoring chains for each PCAPSS analysis.
These automated methods were run on a set of 50 validation protein chains with identities or structural homologs in the PDB; PCAPSS correctly identified a functionally similar chain in the PDB for 49 of these. The global pairwise percent identity between queries and their “target” (same function) sequences in the PDB ranged from 100% (3 cases) to 11%. The single case of the 11% identity was the single failure. 30 successful cases had low percent identies (17-40%) with the protein chains they matched. Training set size ranged from the maximum of 50 to 8. The single failure case had a training set of 8 sequences, but another case with this training set size succeeded. Normalized scores ranged from 0.18 to 1.05; the single failure case also had the lowest normalized score. Three successes had normalized scores between 0.19 and 0.31, and the remainder were greater than 0.42.
From this and other analyses, we concluded that two statistics were most correlated with success: training set size and normalized score. Because of the former, a PCAPSS analysis is run only for cases that yield training sets of 10 sequences or more. Normalized scores are difficult to interpret, but our guidelines are to consider scores > 0.3 as indicating a strong SS similarity between the query and subject and scores < 0.1 as weak or non-existent. Scores > 0.6 indicate a very high SS similarity. The HMMER scores (bits) and E-values are also provided in the results (see Results section).
There are only two inputs to PCAPSS: the query protein sequence and your e-mail address. The sequence may be in FASTA format (i.e., with an initial definition line starting with the “>” character, followed by one or more sequence lines) or raw format (sequence lines only). The e-mail address is required because an analysis may take 1/2 hour or more.
The results are stored in HTML pages on our bioinformatics computer (darwin) and the URL of the main results page is e-mailed to you as soon as the analysis is finished. To view the results, simply use that URL; if you wish to save any of the result pages, use your browser’s “save page” function to store them on your local computer for later viewing. Result pages will be retained for one week.
If the query sequence did not result in a training set of at least 10 sequences, the analysis is stopped, and the e-mail message will indicate this. Other errors, such as an empty query protein sequence, will also be indicated in the e-mailed results.
Associated output files
At the top of the PCAPSS results page are links to view the query sequence and results produced as part of a PCAPSS analysis. These result links are:
- View BLAST search of the non-redundant protein database: This is the BLAST output from searching the query sequence against the NCBI non-redundant protein database. The subject sequences matched in this search form the base from which the training was produced (see Methods).
- View FASTA search of the PDB protein database: This is the FASTA output from searching the query sequence against the protein sequences in the PDB. You may use it to see which PDB sequences are similar in amino acid sequence to the query.
- View set of related protein sequences (Step 2): This is the group of related proteins used to form the SS training set for the HMM. The generation of this group is based on the rule set used in Step 2.
- View HMM training set (predicted SS of related protein sequences) (Step 3): This is the group of predicted SS sequences of related proteins that will form the HMM training set. See Step 3.
- View Secondary Structure Multiple Alignment of HMM training set (Step 4): This is the multiple alignment output file from ClustalW for the group of predicted SS sequences. Strictly speaking, this alignment is the training set from which the HMM is built. See Step 4.
PCAPSS predictions are presented in a table in which the top-scoring PDB chains are presented ranked by normalized score. The top 50 scoring chains with positive bits scores are shown. The columns contain the following information for each matched PDB chain:
- Norm’d score rank: The rank of the chain based on normalized score.
- Bits score rank: The rank of the chain if it were sorted by raw score (bits). This is also the same order if it were sorted by E-value.
- PDB ID :Chain: The 4-character PDB protein identifier followed by the chain identifier. This is also a link to the entry for that protein in the PDB.
- Seq. Length: The length of the PDB chain.
- Bits Score: The score in bits produced by the HMMER program “hmmsearch” for the match of the HMM to that chain.
- E-value: The Expectation value produced by the HMMER program “hmmsearch” for the match of the HMM to that chain. The interpretation of the E-value is the same as in other programs, i.e., the number of sequences with that E-value or lower expected to be matched by chance.
- Norm’d Score: The PCAPSS normalized score. This is the HMMER score (bits) divided by the length of the matched PDB chain. Our guidelines are to consider scores > 0.3 as indicating a strong SS similarity between the query and subject, and scores < 0.1 as weak or non-existent. Scores > 0.6 indicate a very high SS similarity. See the Validation section for more discussion of normalized scores.
- Description: The annotation for this protein as maintained by the PDB in the “TITLE” field. If it is blank, this is because the compound description database file obtained from the PDB does not contain this protein. To view its description, click on the PDB ID link (column 3).
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc. Acids Res. 25, 3389-3402.
Cuff, J.A., and Barton, G.J. 1999. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins Struct. Funct. Genet. 34, 508-519.
Eddy, S.R. 1998. The HMMER 2.1.1 User’s Guide. Unpublished. Available from http://hmmer.wustl.edu/.
Frishman, D., and Argos, P. 1996. Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng. 9, 133-142.
Frishman, D., and Argos, P. 1997. Seventy-five percent accuracy in protein secondary structure prediction. Proteins Struct. Funct. Genet. 27, 329-335.
Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195-202.
Kabsch, W. and Sander, C. 1983 Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637.
King, R.D. and Sternberg, M.J.E. 1996. Identification and application of the concepts important for accurate and reliable protein secondary structure. Protein Sci., 5, 2298-2310.
Thompson, J.D., HIggins, D.G., and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
The authors of PCAPSS may be contacted if you have further questions.
- Dr. Peter J. Lammers, Dept. of Chemistry and Biochemistry, New Mexico State University
- Dr. John B. Spalding, Molecular Biology Program, New Mexico State University (use this contact for reporting errors or technical implementation questions)
Both are members of the SWBIC Leadership Team.