Structure Prediction Protein Design Biclustering + cMonkey Network Inference Publications          Links + Collaborations People                   Opportunities             Software + Code Teaching            


World Community Grid Status Update

World Community Grid Post - HPF2 Update, November 2011

Greetings to everyone,

It's been a stretch since the last update, but excitingly (!), we've been quite busy wrapping up ongoing projects with publications, and also getting our teeth into new projects and data. So, without further ado, I'd like to first mention our accepted and pending publications, and then go over the new data we're crunching and where it is leading us.

The lab has been very excited to recently have two gargantuan efforts come to fruition with the acceptance of one paper and the completion and submission of a second. The first, Kevin Drew (et al.)'s, is an enormous work covering nearly everything we do in terms of protein structure and function prediction, and was made conceivable in the first place and achievable in the second by support of World Community Grid computing cycles.

The paper will be available in the journal Genome Research this month (November 2011). The abstract is as follows, and the lab spent extra to ensure an open license so that the paper could be viewed in full - take a look!

The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition and grid-computing enabled de novo structure prediction. We predict protein domain boundaries and 3D structures for protein domains from 94 genomes (including Human, Arabidopsis, Rice, Mouse, Fly, Yeast, E. coli and Worm). De novo structure predictions were distributed on a grid of over 1.5 million CPUs worldwide (World Community Grid). We generate significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions.

The paper can be viewed here:

Also, take a quick look at this seminal image from the paper - predicting domain boundaries, and using the grid to do de Novo structure prediction for unknown domains:


The second piece of good news is that another paper involving protein structure folding has recently been submitted for publication.

Melissa Pentony et al. have presented work considering sites of positive selection (areas of faster-than-average evolution) in the proteomes of five major plant species in order to study plant protein evolution, and have extended this analysis in a novel way by mapping sites of positive selection in proteins onto 3D predicted protein structures. This is exciting as, seen in the image below, it allows scientists to visualize where sites of increased evolution occur structurally on a protein.

protein on DNA

[Image: a DNA-binding protein interacting with DNA, with positively selected residues of protein highlighted by blue spheres. Notice, then, that the parts of the protein interacting with DNA are under selected evolution!]

This work is currently being revised, and will be available for preview shortly - Another example of the grid producing data (predicted protein structure!) that can be used in diverse biological studies to extend analyses and relate biological phenomena to the fundamental molecular machines of the human body (proteins!).

Now on to what's been grinding on your CPUs...

Code Experiment Project/Organism Description Status
oa-ok 1169 Microbiome Novel Gastro-Intestinal proteins from the Human Microbiome Project Finished
ol-op 1170 P. Yoelii Plasmodium Yoelii Yoelii, a mode rodent malaria Finished
oq-ow 1146-1161 Haloferax, Haloarcula Two Archaea, part of the third domain of life Temp. Suspended
ox-qc 1171 Mouse New proteome data for Mouse Running!
qd+ 1171 Mouse New proteome data for Mouse Waiting..
ql+ 1176 Human New proteome data for Human Waiting..

Processing for the Human Microbiome Project (described in the last update) was finished with batch 'ok', and from there we moved on to Plasmodium Yoelii Yoelii, which made up batches 'ol' through 'op'.

I mentioned the bacteria Plasmodium Yoelii Yoelii in the previous status update and very briefly in my last forum post. Pyy is a rodent malaria used a model organism for studying malaria in general, and specifically human malaria (the concept of using very similar model organisms is common in the field, and is extremely helpful for increasing data set size and inferring properties of an organism from known properties in a model). For this reason, having accurate structural knowledge of Pyy is important for the malaria research community.

Knowing this, we looked up our collaborator Jane Carlton, recently moved to the NYU Department of Biology, and asked for the most up-to-date data. We were pointed to a resource called PlasmoDB (, and from the data we found there put together five batches of novel protein domains to be sent for de Novo structure prediction.

After malaria...

After malaria, while we updated our post-processing analyses to make better use of grid results, we moved on to Archaea, which make up the third domain of life (the other two being bacteria and eukaryotes). Archaea are incredibly interesting and important organisms - they're now getting a lot of press due to their role in the function of the human colonic system, and interestingly, some species are known to thrive in incredibly harsh environments, such as salt lakes and hot springs.

For more information on Archaea, check this Berkeley resource or, of course, wikipedia - Archaea. The archaea Haloferax and Haloarcula comprise batches oq through ow.

Pausing the Archaeas

At the moment, we have a large list of archaea to analyze, but have switched priorities due to some extremely exciting new ideas regarding protein function prediction based on machine learning techniques (which sounds AI-cool, but is more statistics-cool) which we have developed in house, and on revised proteome data for Mouse and Human.

We have decided to re-run this new mouse and human data through our domain prediction pipeline and send results to the grid in order to get the best possible protein structure data. With improvements and updates to our pre- and post-processing methods and increased sampling on the grid (we're now folding 100,000 structures per domain, up from 30,000!), we will be able to approach the problem of protein structure prediction in a novel and potentially game-changing way with the best data available.

In terms of work batches, ox through ql (we skipped the letter p in batch naming) are made up of Mouse protein data, with ox through ql running on the grid now. After ql, new Human data will take over.

The first culmination of this mouse and human redo, along with our new protein function prediction ideas, will be our presence at a nation-wide protein structure/function jamboree hosted by the University of California, San Diego in early December, where we will present the work of the grid and its incorporation into our new methods to hopefully astounding effect!

Cross your fingers for us…