Structure Prediction Protein Design Biclustering + cMonkey Network Inference Publications          Links + Collaborations People                   Opportunities             Software + Code Teaching            

 

WCG Status Update

WCG Post

HPF2 Update
Hello crunchers,

It seems that we are progressing towards our goal very rapidly thanks to your compute power. We want to keep you posted on what's happening on our end as well as what you're currently computing.

First, we have been hard at work evaluating the computed structures and we are about 98% done with the HPF1 batches (so its taken 6 months just to post-process the results from the grid, another reminder of how vast the grid is compared to an average in house compute cluster). To remind everyone what we do when we receive the raw structures back from your machines, we first gather all of the structures for a particular protein (from the many different grid agents each started from a different random protein conformation) and "cluster" them. This essentially means we group them together by how similar they are. There may be clusters with a lot of structures that are very similar and others that only have a few structures. We generally consider the clusters with a lot of structures pretty confident. This is because all of the structures started from random positions and they all folded up to something similar, it is likely that is the correct structure. We then pick a representative from each cluster and compare it to all known structures to see if we get a match. We did this for about 120,000 proteins so far with only a few thousand more to go.

Once we have the match to a known structure we can then look to see what the known structure does (specifically its function). For example does it bind DNA or does it break apart peptides or does it manufacture membrane (the list goes on and on and on). This is what we have been working on quite a bit in the last few months. So what we do first is find all proteins that we know their functions and their structures. (This doesn't include the proteins we folded because we're trying to find out their function and structure). We then figure out how often a specific structure has a specific function. Maybe proteins with structure A bind to DNA 75% of the time and break apart peptides 25% of the time. And maybe proteins with structure B manufacture membrane 95% of the time. Then if we're interested in a particular protein and it has a match to structure B we can be pretty confident it manufactures membrane.

Now since Im writing this on Halloween and we're all in the mood for something scary, here's an equation:



This is one of the equations we use to compute our confidence of a protein have a particular function. If you think its scary, its not as scary as trying to explain it which Im too afraid of doing right now. :D

So far we have gotten pretty good results using our method but there is still some work to be done to fine-tune the parameters and get the best possible results.

Now to move onto the status of HPF2 and what is currently being crunched on your desktops, laptops or your company compute clusters without them knowing. We have told you that we're just about done with malaria and when we looked at some of the results we noticed some very interesting things about the proteins. Not all of them folded as we expected. After scratching our heads for a while we started to notice that malaria proteins have a unique feature about them that most other organisms don't have, specifically a high percentage of disorder. So what is disorder? Disorder is a region of the protein that doesn't fold into a stable structure. Most proteins have a little disorder and in a lot of cases the disorder actually helps the proteins do what they're supposed to do. But with malaria proteins it appears that disordered regions are very very prominent. We saw this and expected it, so now we're working on alternate ways of chopping these disordered regions out and sending the cleaned up proteins back to the grid. This means that if we want the best chance of getting the correct prediction for many of the malaria proteins we'll have to send a few versions of it out to the grid. It is hypothesized that the disorder is there to trick our immune system that its not really a malaria protein and therefore it escapes our bodies defenses. These disordered regions are also highly variable (they mutate rapidly) so once our immune system figures out what's going on, the disorder changes and tricks us again. We've decided to look into these disordered regions more closely and try to remove them or identify/classify them before we send them out to the grid again. So you may be crunching the same proteins you crunched before but that was only because malaria is a really complicated creature and it tricked us the first time around. Hopefully we won't be fooled again.

Another set of proteins we are going to be crunching soon is a close relative to the malaria we crunched before. We're excited about this one because it was just recently sequenced and hopefully some of our results will help discover a vaccine. The official name is plasmodium vivax (the other one we work on is plasmodium falciparum) and its not as deadly as the other but still causes plenty of people to get sick. This work is a collaboration with a top researcher in the field Jane Carlton. Another hope for us is that it will have some of same proteins as p.falciparum but without all the hairy disorder regions. The idea is that we sneak in the backdoor on these proteins, any way to beat this bug.

The last dataset that is of great interest and you will be crunching soon is the Global Oceanic Sequencing dataset or GOS. The story behind this dataset begins with a man and his boat. Well really it's a team of researchers and one of the researcher's yacht. (We don't all have yachts by the way ) Anyway, they took this boat out into the sea, stuck a bucket a foot below the surface and started sequencing what was in the bucket and did it across a few thousand miles of ocean. What they found was an absolutely incredible diversity of organisms. They almost doubled the number of known protein sequences and discovered over 1700 new protein families. Very exciting stuff. Besides just the general fundamental questions that can be answered with knowing the diversity of organisms, the organisms may have some pretty awesome applications. They hope to find new antibiotics, new industrial enzymes, new organisms that bind toxic metals and who knows what else. We're going to fold some of the new protein families to begin a somewhat daunting task of annotating this massive dataset and clue researchers into where to look for interesting things. Hopefully you're on board with us. (I couldn't help putting a cheesy rally call )

Again thanks for crunching and we'll keep you updated.



Survey sites of GOS expedition
http://biology.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pbio.0050077