Validation study and score interpretation
PCAPSS is the acronymn for Protein Classification through the
Assessment of Predicted Secondary Structure.
It is a method for identifying remote protein homologs in the Protein Data
Bank (PDB) that share a similar secondary structure (SS) to the query protein.
PCAPSS was designed to be used with orphan or hypothetical proteins, i.e.,
those for which amino acid sequence searches fail to find similar sequences
with functional annotation. Its purpose is to generate hypotheses for structure
and function for such hypothetical or putative protein sequences based on
fold recognition. The input is a single query protein sequence. Due to the
time required for an analysis (depending on the query, 1/2 hour or more) results
are stored here and you are e-mailed the URL of the results. You may then
save the result pages you wish on your local computer.
A manuscript on the methods and validation study results is under preparation
and will be made available here soon.
In general, the approach is to characterize
(in the form of a hidden Markov model, or HMM) the predicted SS of the query
protein based on a training set of proteins closely related to the query.
A database of experimentally-determined SSs is then searched to identify
proteins of similar SS. This database is derived from the atomic coordinate
data files of the PDB. A flowchart of the processing steps in PCAPSS is shown
in Figure 1. The specific methods involved in each of these steps are detailed
in the following sections.
HMM training set construction
These methods are performed in steps 2-3 (Figure 1). For each protein selected,
a group of homologous proteins is created by first using the "blastp" sequence
similarity search program (Altschul et al., 1997) with the National Center
for Biotechnology Information (NCBI) "nr" (non-redundant) protein database.
Up to 500 matches with Expectation values <= 0.1 are allowed.
This set of related proteins is further reduced to become the training set
for the construction of an HMM (step 2). Based on results from preliminary
analyses, we refined a set of 10 rules to be applied to the subject proteins
matched by the BLAST search (Figure 1). The first eight rules compare each
subject protein to the query based on its whole sequence length, global percent
identity, and High Scoring Segment Pair (HSP) statistics extracted from the
BLAST report. Rules 1-8 were designed to retain sequences of similar length
to the query with relatively long, higher quality HSPs. Rule 3 (minimum global
percent identity with query) is applied to the result of a ClustalW (Thomson
et al., 1994) pairwise alignment of the query and each subject. Rule 9 is
applied to sequences passing rules 1-8, and reduces the redundancy within
the set such that no two sequences have more than a 95% ClustalW similarity.
This was to avoid over-weighting the HMM with many highly similar sequences.
If there are more than 50 proteins in this non-redundant set, it is reduced
by removing the subject sequences with the lowest HSP Expectation values
(rule 10). This number was determined in preliminary analyses as being sufficient
for building an HMM and reducing computation time. These rules are described
more fully in the
Analysis Options Help for the
bioinformatics tool, which uses the same software as PCAPSS for identifying
and reducing a set of related sequences based on rules.
NOTE: The rules described above and in Figure 1 have been loosened, which
results in larger training sets. Recent results indicate that these new rules
actually improve the accuracy of PCAPSS. More information on these rules and
changes in validation results will be posted here soon.
The HMM training set is formed (step 3) by predicting the SS sequences for
each of the proteins in the reduced set of related proteins. In preliminary
studies we used the following programs: PREDATOR (Frishman and Argos; 1996,
1997), DSC (Discrimination of protein Secondary structure Class; King and
Sternberg, 1996), and PSIPRED Versions 1.2 and 2.01 (Jones, 1999). Although
results were similar among the methods, the current implementation of PCAPSS
uses the latest version (2.3) of PSIPRED because of its claimed accuracy (>76%)
and its very high
processing speed relative to the other prediction programs.
HMM building and database searching
These methods are performed in steps 4-7 (Figure 1). HMMs are constructed
with HMMER Release 2.1.1 (Eddy, 1998). Each HMM is built (HMMER program
"hmmbuild") from a ClustalW multiple alignment of the training set of predicted
SSs formed in steps 1-3; amino acid specific options are turned off by specifying
the identity protein weight matrix, no residue-specific options, and with
the option for hydrophilic gaps off.
HMMER was developed to produce protein and nucleic acid profile HMMs. The
code was modified by us to use the CEH (coil, extended, alpha helix) SS alphabet
output by SS prediction programs. Several input parameters were modified
to reflect characteristics of SS sequences, including supplying custom priors
(based on the observed frequencies of SS states in the PDB) for the null
model, and Dirichlet match and insert emission probabilities (also based
on observed SS frequencies in the PDB). The "hmmcalibrate" HMMER program
is used to improve the accuracy of statistical significance estimates. HMMs
are built with the default configuration; this means that, when a sequence
database is searched with the HMM, the best non-overlapping partial alignments
of each subject sequence to the complete model are identified. Other HMM
configurations are under study.
The HMMs trained from predicted SS sequences are used to search an updated
database of protein SS sequences (step 6) derived from the database of DSSP
structures (Dictionary of Protein Secondary Structure; Kabsch and Sander,
1983) maintained by the PDB. The 8-state assignments of DSSP are reduced
to the 3 states (coil, sheet, helix, or CEH) predicted by PSIPRED using "Method
B" of Cuff and Barton (1999), which generally improved prediction accuracy
compared to "Method A." "Method B" 8- to-3 state reduction is: E stays as
E, H as H, and other states to coil including EE and HHHH. This database
was further refined to remove all sequences > 95% coil or shorter than
50 residues to reduce search time.
The search of the PDB database (step 7) is performed by the HMMER program
"hmmsearch." After preliminary studies, it was determined that normalizing
HMMER scores (bits) by the length of the matched protein chain improved PCAPSS
prediction accuracy. Finally, annotation from the PDB is incorporated in
the results to facilitate evaluation of the top scoring chains for each PCAPSS
Validation study and score interpretation
These automated methods were run on a set of 50 validation protein chains
with identities or structural homologs in the PDB; PCAPSS correctly identified
a functionally similar chain in the PDB for 49 of these. The global pairwise
percent identity between queries and their "target" (same function) sequences
in the PDB ranged from 100% (3 cases) to 11%. The single case of the 11%
identity was the single failure. 30 successful cases had low percent identies
(17-40%) with the protein chains they matched. Training set size ranged from
the maximum of 50 to 8. The single failure case had a training set of 8
sequences, but another case with this training set size succeeded. Normalized
scores ranged from 0.18 to 1.05; the single failure case also had the lowest
normalized score. Three successes had normalized scores between 0.19 and
0.31, and the remainder were greater than 0.42.
From this and other analyses, we concluded that two statistics were most
correlated with success: training set size and normalized score. Because
of the former, a PCAPSS analysis is run only for cases that yield training
sets of 10 sequences or more. Normalized scores are difficult to interpret,
but our guidelines are to consider scores > 0.3 as indicating a strong
SS similarity between the query and subject and scores < 0.1 as weak or
non-existent. Scores > 0.6 indicate a very high SS similarity. The HMMER
scores (bits) and E-values are also provided in the results (see Results
There are only two inputs to PCAPSS: the query protein sequence and your
e-mail address. The sequence may be in FASTA format (i.e., with an initial
definition line starting with the ">" character, followed by one or more
sequence lines) or raw format (sequence lines only). The e-mail address is
required because an analysis may take 1/2 hour or more.
The results are stored in HTML pages on our bioinformatics computer
(darwin) and the URL of
the main results page is e-mailed to you as soon as the analysis is finished.
To view the results, simply use that URL; if you wish to save any of the
result pages, use your browser's "save page" function to store them on your
local computer for later viewing. Result pages will be retained for one week.
If the query sequence did not result in a training set of at least 10 sequences,
the analysis is stopped, and the e-mail message will indicate this. Other
errors, such as an empty query protein sequence, will also be indicated in
the e-mailed results.
Associated output files
At the top of the PCAPSS results page are links to view the query sequence
and results produced as part of a PCAPSS analysis. These result links are:
View BLAST search of the non-redundant protein database:
This is the BLAST output from searching the query sequence against the NCBI
non-redundant protein database. The subject sequences matched in this
search form the base
from which the training was produced (see Methods).
View FASTA search of the PDB protein database: This is the
FASTA output from searching the query sequence against the protein
sequences in the PDB. You may use it to see which PDB sequences are
similar in amino acid sequence to the query.
View set of related protein sequences (Step 2): This is the
group of related proteins used to form the SS
training set for the HMM. The generation of this group is based on the
rule set used in Step 2.
View HMM training set (predicted SS of related
protein sequences) (Step 3): This is the
group of predicted SS sequences of related proteins
that will form the HMM training set. See Step 3.
View Secondary Structure Multiple Alignment of HMM training set
(Step 4): This is the multiple alignment output file from ClustalW
for the group of predicted SS sequences. Strictly
speaking, this alignment is the training set from which the HMM
is built. See Step 4.
PCAPSS predictions are presented in a table in which the top-scoring PDB
chains are presented ranked by normalized score. The top 50 scoring chains
with positive bits scores are shown. The columns contain the following
information for each matched PDB chain:
Norm'd score rank: The rank of the chain based on normalized score.
Bits score rank: The rank of the chain if it were sorted by raw score
(bits). This is also the same order if it were sorted by E-value.
PDB ID :Chain: The 4-character PDB protein identifier followed by
the chain identifier. This is also a link to the entry for that protein in
Seq. Length: The length of the PDB chain.
Bits Score: The score in bits produced by the HMMER program "hmmsearch"
for the match of the HMM to that chain.
E-value: The Expectation value produced by the HMMER program "hmmsearch"
for the match of the HMM to that chain. The interpretation of the E-value
is the same as in other programs, i.e., the number of sequences with that
E-value or lower expected to be matched by chance.
Norm'd Score: The PCAPSS normalized score. This is the HMMER score
(bits) divided by the length of the matched PDB chain. Our guidelines are
to consider scores > 0.3 as indicating a strong SS similarity between
the query and subject, and scores < 0.1 as weak or non-existent. Scores
> 0.6 indicate a very high SS similarity. See the
Validation section for more discussion of normalized
Description: The annotation for this protein as maintained by the
PDB in the "TITLE" field. If it is blank, this is because the compound
description database file obtained from the PDB does not contain this protein.
To view its description, click on the PDB ID link (column 3).
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z.,
Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nuc. Acids Res. 25, 3389-3402.
Cuff, J.A., and Barton, G.J. 1999. Evaluation and improvement of multiple
sequence methods for protein secondary structure prediction. Proteins Struct.
Funct. Genet. 34, 508-519.
Eddy, S.R. 1998. The HMMER 2.1.1 User's Guide. Unpublished. Available from
Frishman, D., and Argos, P. 1996. Incorporation of non-local interactions
in protein secondary structure prediction from the amino acid sequence. Protein
Eng. 9, 133-142.
Frishman, D., and Argos, P. 1997. Seventy-five percent accuracy in protein
secondary structure prediction. Proteins Struct. Funct. Genet. 27, 329-335.
Jones, D.T. 1999. Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292, 195-202.
Kabsch, W. and Sander, C. 1983 Dictionary of protein secondary structure:
Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers
King, R.D. and Sternberg, M.J.E. 1996. Identification and application of
the concepts important for accurate and reliable protein secondary structure.
Protein Sci., 5, 2298-2310.
Thompson, J.D., HIggins, D.G., and Gibson, T.J. 1994. CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice. Nucleic
Acids Res. 22:4673-4680.
The authors of PCAPSS may be contacted if you have further questions.
Dr. Peter J. Lammers, Dept. of Chemistry
and Biochemistry, New Mexico State University
Dr. John B. Spalding, Molecular
Biology Program, New Mexico State University (use this contact for
reporting errors or technical implementation questions)
Both are members of the SWBIC