Structure Prediction Protein Design Biclustering + cMonkey Network Inference Publications          Links + Collaborations People                   Opportunities             Software + Code Teaching            


WCG Status Update

WCG Post

Human Proteome Folding Newsletter - Mid March Update
[Mar 22, 2005 5:56:30 AM]

Human Proteome Folding - World Community Grid
Newsletter: Mid March Update
Richard Bonneau, Seattle, March 2004

Well, we've crossed an important milestone on the World Community Grid: 5 million results returned. Everyone pat yourselves on the back because that means we've computed structure predictions for over 30,000 proteins and protein domains. To put that in perspective, the human genome has on the order of 30,000 protein coding genes. This means that we've evaluated over 840000000000000* different protein conformations. How big is that number … well, if each of those protein conformations was the size of a hamster then the whole calculation would be the size of Jupiter (if the hamsters are puffed up) or Saturn (if you get their fur wet). Compared to numbers from statistical mechanics or astronomy that's a small number, but given the speed of your average cluster node that is a big number. My facilities manager might be a bit upset about the electric bill if we tried to do this in the building.

* Disclaimer: I reserve the right to be off by several orders of magnitude at all points in this newsletter. Also the hamster analogy is not related to the HMMSTER structure prediction server. For a cute picture of a hamster eating a protein see:

What does this mean from the perspective of useable results and science?:

This means we're roughly a third of the way through the calculation that we'll do on the grid. The process consists of three phases: 1) PICK THE GENES FROM THE SETA OF ALL SEQUENCED GENES TO FOLD ON THE GRID. Deciding what proteins to put on the grid and the preprocessing that is required to make the work units (Rosetta input data) 2) WHAT THE CLIENT IS DOING ON YOUR COMPUTER. The folding or protein structure prediction that the Rosetta-client performs on the grid and 3) GETTING THE RESULTS INTO THE HANDS OF BIOLOGISTS AND BIOMEDICAL RESEARCHERS. The post processing step that is required to make sense of results.

Step one is nearly complete, and the work units are mostly waiting for the World community Grid to slurp them up. A lot of work went into deciding what proteins to place on the grid. First of all, we look to see if there are simpler ways to predict the protein structure (known as comparative modeling and fold recognition). If we can find a match to a known protein fold then we can model the structure by mapping the protein in question onto the structure of a close match (when I say close match I mean sequence-sequence). This is much more efficient. Also proteins must be smaller than 150 amino acids long to fit be Rosetta-able. Therefore, we processed a lot more than 100,000 proteins to come up with 100,000 foldable (Rosettable) domains. All in all we've processed nearly all protein sequences in publicly available databases. This task was carried out with using a program called ginzu (Malmstroem, Kim, Chivian, Baker) at the University of Washington in collaboration with Lars Malmstroem and David Baker. We'll continue putting the finishing touches on this process in the weeks to come, but the bulk of this task is finished.

Step two is one third done. That means we have huge numbers of protein conformations on disk at the ISB that were predicted on the grid. Here are some pictures of some of the structures generated on the grid. Until we perform the post processing needed to distill function from these structures we won't have much to say about any single protein.

Step three is just beginning. We'll say more about progress on this final front and what exactly we mean by prost processing in later newsletters, and for now just say that this part of the overall procedure is just beginning.

For the reader interested in the workings of rosetta we offer this installment of "Rosetta Courner". Each newsletter we'll cover another part of the Rosetta simulation. This part gets a tiny bit hairy. For more details see:

How Does Rosetta Work Part 1: Fragments.

The first thing to understand if we want to talk about what Rosetta is doing is how Rosetta builds proteins. The problem of finding the correct structure in the astronomically large space of all possible structures requires that we have a strategy to efficiently create and judge protein structures if we are to successfully predict protein topologies. This amounts to twisting the bond angles along the protein chain to get good global conformations (create favorable contacts between non-local parts of the chain). The way we do this in Rosetta is to precompute (step 1 above) libraries of local conformations or fragments of peptide chain structure for each protein (fig 2). The problem of building protein structures is then reduced to assembling these structure fragments. That is what your client is showing. We start with a random chain configuration and then start substituting in random fragments until we start to make good contacts and eventually good structures. When we can't make the structure any better we've converged and we start the process over again from a different random number seed. Each client will try to make anywhere from 50 to 500 structures this way, each structure made of pieces of other proteins (local fragments). This strategy has two main advantages: 1) all local parts are "protein-like" and 2) each fragment substitution is an efficient simultaneous sampling of several bond angles (we change 18 degrees of freedom in one move in an intelligent way).

Step 1: Find Rosettable domains. All this happens before we get to to grid. Blue programs are sequence based, orange are structure based. [Bill this figure can go if it is too much]

Step 1 - part two: Pick fragments of local structure: This figure (made by Kim Simons right around the time of Rosetta's birth) shows the pieces of local structure that Rosetta will put together. We're just showing you a few fragments. We use 75 fragments (deep in this picture) for every possible 3 and 9 residue window (across in this figure)… So there are a lot of ways to combine that many fragments at that many positions and that is why we need the grid.