Bioinformatics: From Genome Sequences to Protein Structures

Structure prediction

Page 5 of 5

e. Fold recognition

The aim of fold recognition is to find which of the currently known folds (as represented by the protein structures in the PDB) most closely matches (ie is most compatible with) the query sequence.

There are currently over 670 unique folds in the PDB. Some estimates put the number of protein folds found in Nature at around 1,000. Some folds occur more often than others, such as the 10 superfolds shown below.

1. globin
( 1hlm )
2. updown
( 2hmz )
3. trefoil
( 1i1b )
4. Tim barrel
( 7tim )
       
5. OB roll
( 1mjc )
6. doubly wound
( 3eca )
7. immuno-
globulin
( 3hhr )
8. UB roll
( 1sha )
       
   
  9. jelly roll
( 2stv )
10. plaitfold
( 1aps )
 
       
The 10 superfolds
[taken from Salem et al. (1999), J. Mol. Biol., 287, 969-981].

Current methods of fold recognition are not particularly reliable. They are most likely to fail if the query sequence corresponds to a multi-domain protein, or if its fold is not one of those in the PDB.

There are a number of protein fold recognition servers on the Web:-

3D-PSSM prediction

All these methods can take a very long time to run. So, here is a set of results prepared earlier. The results were generated by submitting the sequence to the Phyre Web server.

Simply scroll down the upper box to look at the results. The lower box contains the probe sequence, a secondary structure prediction and its assigned reliability. Briefly compare the prediction with the results from the Jpred consensus secondary structure prediction.

In the table in the upper frame the significance of each hit is given by confidence scores:-

The top hit in the table, is, happily, the true structure (1sro) whose fold is an OB-fold - a type of mainly-beta barrel. The next 7 hits are also OB-folds, which is encouraging, but remember that these results have been obtained from a database containing the structure we're looking for, so are likely to be biased towards giving a correct answer. Such an apparently high level of success is not always obtained when the sequence is genuinely of a protein whose structure is unknown.

CASP2 results

At the time our sequence was being used as Target 4 for the Casp2 fold recognition experiment, the 1sro structure had, of course, not been published. Of the 708 fold predictions made for this sequence the following PDB codes were most frequently given as being compatible with the target sequence:-

PDB
code
Protein Fold Number of
times
matched
1csp Major cold shock protein Beta-barrel 33
1mjc Major cold shock protein Beta-barrel 27
1bov(A) Verotoxin-1 Beta-barrel 7
1lyl(A)
(dom 1)
Lysyl-tRNA synthetase Beta-barrel 7

Which of these are also in the Phyre hits?

Let's see how these folds compare with the correct answer (ie the structure of 1sro). We will use the SAS database to align and superpose the above proteins so that we can view them in 3D using RasMol.

The steps are as follows:-

  1. Go to the SAS database.

  2. Paste the mystery sequence, given again below, into the appropriate box.

    AEIEVGRVYTGKVTRIVDFGAFVAIGGGKEGLVHISQIADKRVEKVTDYL
    QMGQEVPVKVLEVDRQGRIRLSIKEATEQSQPAA

  3. Change the Search: option from the default of "All protein sequences in PDB" to "Specified subset" by clicking the appropriate button. Here we merely want our database to consist of the sequences in the table above.

    So, type the following series of PDB codes and chain identifiers into the lower of the two boxes. (In fact, you can just cut-and-paste from here!):-

  4. Then press the green Search button to generate the alignment of our sequences.

    The sequence identities in the SAS alignment are given at the bottom of the lower frame. As you might expect the mystery sequence is 100% identical to 1sro. Its identities with the other sequences are all in the 20-35% Twilight Zone. (The one apparent exception of 1bov(A), which has a sequence identity of 43.5%, is not significant as it only applies over a short segment of 23 residues).

  5. Alter the "Annotate by" option in the top frame to secondary structure and press Apply.

    In the new version of the alignment the blue residues represent residues in beta strands, while the red ones represent residues in helices.

    Comparing all the sequences suggests that the patterns of secondary structure regions match quite well.

    Compare the structures in SwisPDBviewer.

Conclusion

From what we have seen so far, both the secondary structure prediction methods and the fold recognition methods performed impressively well on our mystery sequence. The secondary structure predictions were largely correct (in fact, in one part even seemed to be more correct than the actual structure itself!), while the folds most often recognised by the CASP2 predictions also seemed spot on.

However, don't be lulled into a false sense of confidence in these methods. This was a special case which was relatively easy to predict. Have a look at the picture of the 10 superfolds given at the top of this page. You'll find 1mjc is one of them! In other words, our mystery sequence adopts one of the superfolds. The structure is a small one and the database contains many examples of this fold (although none are above the Twilight Zone in terms of sequence similarity to our mystery sequence). So it should be a relatively easy sequence to predict.

In all, CASP2 had 15 targets for the fold recognition exercise, of which our mystery sequence was one. These were classified as:-

where the classification from easy to medium to hard was based on the nature and extent of the solved structure's similarity to existing structures in the PDB. The impossible structures were those with no recognizable similarity to anything in the PDB .

Our mystery sequence was one of the two easy targets. And these were indeed fairly easy; a number of methods succeeded in identifying the folds correctly. The prediction success declined with increasing difficulty of the targets, and the impossible targets were not always recognised as being impossible.


This material is prepared with the support of the project ESF pro V� II na UK, Reg. num.: CZ.02.2.69/0.0/0.0/18_056/0013322.