Structure analysis (quality, fold, function)

Practical by Marian Novotny, 2005

The aim of this practical is to do the analysis of a protein structure, in particular a structure validation, a fold comparison and a functional annotation.
This practical should introduce a number of useful tools and handy services that can help to get the most of the macromolecular structural data. Some of the services you might have seen before (especially if you attend the Molecular Bioinformatics X3 course or alternatively the Molecular Biology X3), but this time they will come in different context and a repetition is invaluable.

The practical is web-based, all the services we will use are accessible via your preferred web browser. We will also use Deep View Swiss-PdbViewer or Pymol to visualize structures. Please make sure you can find this program in your computer. In case you can not, please, contact one of the lab assistants. Some of the results will be delivered to you by email - check if you can reach your email box.

A part of the practical are also questions. Please, take notes and answers to questions to your favorite text editor so you can check them later with a help of one of the lab assistants.

Have fun and good luck!


We will work with protein X (PrX) from Creatura mystiska and try to find out something about it. This protein does not exist in real life. It was only designed for this practical and its three-dimensional(3D) structure was modeled in our laboratory. (Just because otherwise the practical would be too trivial. You know already a lot about databases and various services and are able to do some detective work). However, the protein was derived from existing sequences, so it is not completely absurd and it is possible to make qualified guesses about its function and biological importance.
There are many proteins around with a known 3D structure, but with a very few data apart the structural data. Therefore it is essential to learn to acquire as much information as possible from the data we already have.

Let us start with the PrX protein's structure, or rather just with the PrX protein's structure image.


Now, just by looking at the picture, try to answer following two questions.

Q. 1. How would you qualify the protein based on the secondary structure content? All-alpha? All-beta? or mixed alpha-beta?

Q. 2. Can you guess how many domains does the structure contain? Are they most probably identical or different?


This practical is called Structure analysis, so we will be concerned and work mostly with the structure of the so far mysterious PrX. You can find the structure coordinates (ie the pdb file) for the PrX here. Please, download the structure coordinates. You will need them throughout the practical.

Before we start doing anything with the structure, we should do some structure validation to get an idea of the quality of our structure. There are many ways to do that, you might compare bond lengths, bond angles and various other statistics with the database of these statistics for already known protein structures to see if your protein of interest looks any similar to these structures.
We will do just a very basic validation, we will look at the Ramachandran plot of the PrX. The Ramachandran plot is a very good indicator of a structure-determination quality, since a final protein structure is not optimized to get a better Ramachandran plot.
Open your structure with Swiss-Pdb Viewer, select all residues and get the Ramachandran plot (Ctrl+R). You can figure an identity of a residue by pointing the mouse arrow on it.

Q. 3. What does Ramachandran plot show? What are Psi and Phi?

Q. 4. How many residues are outside three major allowed regions? What do the squares in the plot mean?


Protein structures have usually a vast majority of the residues within the major allowed area like in the high-resolution structure of calmodulin (1exr), while poor structures have many residues outside these areas (see Figure below).




Q. 5. Judging from Ramachandran plot of our mysterious structure, does PrX look like a protein? Is it worth to further analyse it?


Regardless of your answer to the previous question, we will continue with this structure.
What shall we do if we want to find out something about an unknown protein? Well, as we have done before we can try to find something similar in the databases. Previously we were using the sequence of the protein. Now that we have the structure of the protein, we can search the structural databases.
We will try to find structures that have similar fold. If we find one, we can infer some properties of PrX from the similarity with a known (and hopefully annotated) structure.

There are many servers that aim to do a fold-comparison. The fold-comparison is a structure alignment made between your structure and all the structures in the database a particular server is using. Different servers use different methods and different databases (or rather different subsets of the PDB) and consequently give different results. None of the servers is known to always give a better result than the others, therefore it is generally a good idea to always try more than just one server. A few years ago we did an evaluation of fold comparison servers, and if you are interested in the result, consider looking at this page.

Here, we will look at two of these servers, SSM and DALI.
Let us begin with DALI that was one of the best servers in our survey. You will do a search using the DaliLite v.3. Upload the coordinates for PrX, fill in your e-mail adress and click 'Submit'. DALI sends results by email and we have have to wait for them about 10 minutes (if you have to wait more than 20 minutes something is wrong).

Meanwhile, you can familiarize yourself with the DALI server...
Servers and databases are only useful when they are maintained and frequently updated. A user of any service should always check if the service is still maintained or was already sidetracked. Look at the DALI page and answer the following questions:

Q. 6. When was the DALI webpage last updated?

Q. 7. When was the DALI Lite database last updated? What database is used by DALI?

Q. 8. Do you think DALI database is updated often enough?


We will also submit our structure to the SSM server. Before we do that, look on the webpage of the server and answer questions:

Q. 9. When was the SSM webpage last updated? Who is the primary developer of the server?

Q. 10. Which paper would give you more information about the SSM server?


Now, click on the 'Start SSM' button, change 'Source' of the Query from 'PDB entry' to 'Coordinate file' and upload the PrX structure. In the 'Target' window you have a lot of options, but to stay on the safe side we can keep the default setting (All PDB archive - biggest available database). We will also keep default settings everywhere else and run the job by clicking 'Submit your query'. Nevertheless, try to think what effects can have, for example, setting of 'lowest acceptable match' to 70%.
A list of hits should arrive almost instantly. Look at the Titles in the list.

SSM Results

Q. 11. How many different proteins are among the hits?

Q. 12. What score is decisive for ordering the hits? What does it mean?
Find the explanation for each column in the list.


Take a closer look at one of the hits, let's say 1PO6. The row for each hit gives many statistics to describe how similar is your query to the hit. You can also see any of the hits superimposed on the top of your query protein and loads of links to other services and databases are also provided.

Q. 13. What is the sequence identity between 1PO6 and the query?

Q. 14. What is the SSE identity between 1PO6 and the query?


Now, is the time to compare the results from both servers. If you still have not received an email from the DALI server you can have a look at a locally deposited copy of the results. The DALI server reduces a number of hits by pruning the initial database. It uses just structures that have sequence identity to each other smaller than 25%.

Q. 15. How many hits with the DALI server? More or less than with SSM?


Rather surprising answer, isn't it? One would expect less hits if the database is smaller. Apparently, the size of the database is not the only factor influencing the results. There have to be other things to consider.

Q. 16. How could you explain the different number of hits between the SSM and the DALI? What other factors could have played role?


Spend some time with the DALI list of hits. It is similar to the SSM list. It is showing some sort of a statistical significance score, RMSD, a number of aligned residues, sequence identity and a brief annotation of the hit. Note that the sequence identity is often very low, well bellow the twilight zone. Let us pay a special attention to to the column 'Description'. Do the hits help to guess the function of our protein, at lest at a very general level?

Q. 17. What is the most probable function of our mysterious protein?


Let us explore the function of the protein a bit more in the last section of this practical. It seems that our protein is the most similar to heterogeneous nuclear ribonucleoprotein A1 (or helix-destabilizing protein or single-strand binding protein or hnRNP core protein A1 - maybe there are still more names) from Homo sapiens. That is a very long name, but doesn't say very much. Maybe you are an expert in cell biology and you already know all, but for the rest of us it will take some effort to find out more.

We can start at the PDBsum page of one of the heterogeneous nuclear ribonucleoprotein A1 structures, namely 1po6. The PDBsum is a database of known proteins and nucleic acids where we can easily find a basic information about the particular structure, but more importantly about the protein itself. There are also many links to other structural and sequence databases.

First of all, being incredulous scientists, we will do a quick validation of the heterogeneous nuclear ribonucleoprotein A1 structure (1po6). Find a link to EDS on the PDBsum page for the entry 1po6. EDS stands for the (Uppsala) Electron Density Server. The EDS works only for structures determined by X-ray crystallography. The main objective of the server is to show how well the structure built by the authors corresponds the data that they collected. One would expect a very good match, meaning that the structure shall be built according to experimental data. However this is not always the case. A protein can, for example, contain a mobile loop that has a few different conformations, but the structure deposited in the PDB is static and can contain just one of of these conformations. In such case, the EDS server will show a lower correlation between the structure and the experimental data (electron density) and person interested in the protein will know that she/he should not trust this particular part of structure to the last decimal place.

Click on "Real-space R-value" on the EDS page for the PDB entry 1po6. You will get to a plot that shows a (dis)agreement between the model built by crystallographer and the electron density for each amino acid. The higher is the bar the bigger is the disagreement. The EDS also enables to visualize the agreement graphically.
Click on the bar for Arg 97, a viewer should pop up and centre on Arg 97. Where does the amino acid lie (in the core?, on the surface?)? How well does it fit?
EDS has retired and some of the data might not be available again - if so, look in Pymol on 1po6 and identify where Arg97 lies - is it in the core? is it on the surface?

Q. 18. Try to speculate -was Arg 97 built into the structure in a wrong way or are there objective reasons that make building of this residue difficult?


Please, go to the Interpro database and browse by structure for the entry 1po6. Interpro can answer what is shared in all the hits in the DALI output. So, look at a domain composition of this protein (possibly look also at the other hits).

Q. 19. Which domain does occur in hnRNP core protein A1?


We already know what is the most similar known protein with the known structure to our protein of interest and we also domain composition that protein. To get still more specific information we will move to the curated, annotated Uniprot database to see what people before us found out about
hnRNP core protein A1.

Q. 20. What is the function of hnRNP core protein A1 according to the Uniprot?


And finally, let's explore the interaction between the protein and nucleic acid. This way we can better understand how is ssRNA or ssDNA recognized by proteins. Download the PDB file for the 1po6 entry (there are many ways to do that, by now you should at least one of them). Open the structure in the Swiss-PDB Viewer and make sure that Control panel is on. Select all the nucleotides. Go to 'Select' menu and choose 'Neighbors of selected aa' (never mind you selected nucleotides) and pick 'Select groups that are within' from the menu. The default value is 3.5 A, because 3.5 A is a maximum length of a hydrogen bond. Nucleic acid - protein interactions are usually mediated by hydrogen bonds, so 3.5 A suits us. OK.

Q. 21. How many residues are within hydrogen bonding distance? What amino acid types (hydrofobic, positively charged..) mediate the interaction?


Visualize just DNA and the amino acids indicated in the interaction. Add also hydrogen bonds (Tools - Compute H-bonds).

Q. 22. Is it side chain or main chain that is responsible for the interaction between the amino acids and RNA?

Q. 23. Which part of the nucleotide does mostly take part in the interaction?

Q. 24. Could there be also other types of interactions taking place?


Now is the time to compare nucleic binding properties of our protein with hnRNP core protein A1. We will align the sequences of both proteins and check if the residues in hnRNP core protein A1 responsible for binding ssDNA of are conserved also in our protein. If you find most of the residues conserved, then there is high probability that even our protein will bind nucleic acid and possibly also have similar function.
At last, here comes the sequence of our protein.


>Protein "X"
KRPDQLGKLFIGNLSFQTSDESVRQHFEQWGEITDSIVMKDKNTGRSRGYGFVSYAPVED
VTAIMNARLHLLDGNVIEKKRKVSVEDNQRPVKKLFIRGIKESTTEEDLKEYFSE
YGDIELLEIVTDHASGKTRGFGFVTFDDKDTVMKLVINRYHIVNGHQCEARLALSRQEMA
SAS
 


You know how to get a sequence of hnRNP core protein A1 and you can align them at any pairwise sequence alignment server, e.g here.

Q. 25. Would you bet that our protein X could bind nucleic acid (please, motivate your answer)?


Congratulation, you have survived. In this exercise, you have seen and tried to use a few services that can provide hints about function and evolution of a protein while working with protein 3D structures.


Comments, questions forward to Marian Novotny.

Last updated 11th of April 2018.