Structure Prediction Protein Design Biclustering + cMonkey Network Inference Publications          Links + Collaborations People                   Opportunities             Software + Code Teaching            

 

World Community Grid Status Update

World Community Grid Post - HPF2 Update, Fall/Winter 2012

Holiday greetings to all you crunchers (and other interested parties!) out there - here's an update on what we're working on, to keep everyone motivated through the long winter ahead (or, for those of you in the southern hemisphere… the hot summer?)

As mentioned in the last status update, we have been working feverishly on improving, testing, and generally developing a method to predict protein and gene function using custom machine learning techniques. As we mentioned last time, machine learning is a concept that has been around for quite some time, but has in the past few years become the hot topic in computer science and math, and generally any field that strives to predict features and outcomes in some system (biology and finance come to mind…).

To avoid digressing and poorly covering what some truly brilliant others have already done, I encourage you to check out Andrew Ng's Machine Learning course (available free from Stanford Uni. online) if interested - even watching the seven-minute 'Welcome' and 'What is Machine Learning?' lectures will give you a great idea of what this topic is about and why it's so valuable. They're available here: https://www.coursera.org/course/ml (just click 'Preview').


A new paper!

The exciting news for the Bonneau Lab on the machine learning project front is that we have submitted our first paper, entitled 'Parametric Bayesian Priors and Better Choice of Negative Examples Improve Protein Function Prediction,' to the journal Bioinformatics. The paper combines an important step for the math & computer science of machine learning (improving the predictive powers of a learning technique by fancy math stuff) with an important step for predictive biology: making better protein function predictions. And predicting protein function in the best and most-accurate way possible is what the Bonneau lab is all about.

Hopefully, the paper will be out in Bioinformatics shortly - we'll be sure to pass it along!


How the Grid fits into our Machine Learning goals

As we aim to publish the first paper of this project, as described above, it is with one eye firmly focused on how we plan to use this updated and robust method to go further - and that is where the World Community Grid comes in. Having developed this improved method for protein function prediction, we are now striving to incorporate the massive amount of predicted protein structure data produced by the Grid and you, its users, in a way that no one has ever done before.

In the past, our lab has shown that knowing about the structure of a protein significantly contributes to understanding its function. At present, in our preliminary tests with the machine learning technique just developed, we have found that protein structure is one of the data types that most strongly contributes to correctly predicting protein functions, offering predictive insights not found in other data types. We now find ourselves in a great spot - we have a new method that is performing extremely well in the field of protein structure prediction; and, we are about to incorporate a vast amount of high-quality, high-coverage structure data that will serve to significantly better our predictions!

As we complete the publishing process for the first paper, and continue the background and processing work required to process Grid data, we will also be preparing for a second publication describing our incorporation of World Community Grid data, and how it boosts protein function prediction to new heights.

Currently on the Grid

On to a quick update regarding what's on your machines: currently, we are just about finished with the new, high-sampling/high-resolution mouse data submitted a few months ago (very important as a model system for the human proteome!). As mouse finishes, we will move on to a new set of human data with a slightly lower sampling rate/resolution. Here's the chart:


Code    Experiment Project/Organism Description Sampling rate Status
ox-qi 1171 Mouse New proteome data for Mouse 100,000 Complete
qj,qk 1171 Mouse New proteome data for Mouse 100,000 Running
ql-qu 1176 Human New proteome data for Human 50,000 Waiting
ra+ 1183 Microbiome Gastro-Intestinal proteins, Human Microbiome 50,000 Waiting


"New" data sets

Two quick things to describe here. In regards to "new" data sets, it is important to realize that the set of proteins representing an organism (a proteome) can change over time - new proteins are discovered, existing proteins are found to be simple variants of each other… in short, many changes and updates can occur. It can therefore be important to obtain the latest data for an organism of interest, and in the case of significant updates, re-run analyses on the data.

It is also significant that as time goes by, external resources such as the Protein Data Bank (PDB, a data set of all experimentally known and verified protein structures) are updated and added to. It can then be valuable to re-run certain data sets in order to match them to the latest data from external resources. In our case, this often leads to proteins matching to newly experimentally discovered protein structures, giving us highly accurate structural information at the cost of much less computational power.

This is the case with our "new" mouse and human data sets: new information is available, and because of updates to external resources and our own methods, it is highly valuable to re-run these organisms.


Sampling rate - Speed vs. Coverage

Secondly, the issue of sampling rate is very important to our structure prediction process. When asked to predict the structure of a protein segment, the Rosetta software that is running on your machines produces many potential predicted structures, each with a different score and profile relating how well it may match the given protein segment. This number is called the sampling rate. Abstractly, increasing the sampling rate of Rosetta increases the "resolution," or potential scope and accuracy of the protein structure prediction. However, it also increases computation time significantly.

Currently, we're looking for the sweet spot between sampling rate and speed. For the new mouse proteins (codes 'ox' through 'qk'), we chose a sampling rate of 100,000 predicted structures per protein segment. For human, we are lowering this rate to 50,000 predicted structures per segment, as we believe that with 50,000 predicted structures we keep a good resolution of structure prediction but increase the speed of prediction significantly.


Future priorities

Notably, we have also just increased the duration of the Human Protein Folding project, phase 2 - we're in it for another 100 batches (100,000 predicted structures)! This is exciting for us, as it gives us some decisions to make. Where will we go from here, what will we fold? After the new human data is finished on the Grid, we will have an excellent, high-quality dataset of predicted structures for our two most important organisms. We plan to turn our focus back to Microbiome proteins (hugely important for human health, as discussed in previous updates), and are also considering a renewed focus on important crop plant proteomes, such as that of rice and wheat. Best of all, we're not exactly sure what to prioritize, which means some investigation into what will be most important and meaningful in the future.

That said, there is certainly no shortage of interesting proteins to fold - so cheers to much more in 2013, and thanks to you all.