Recent and Pending Publications
CELL: A Validated Regulatory Network for Th17 Cell Specification
Maria Ciofani, Aviv Madar, Carolina Galan, MacLean Sellars, Kieran Mace, Florencia Pauli, Ashish Agarwal, Wendy Huang, Christopher N. Parkurst, Michael Muratet, Kim M. Newberry, Sarah Meadows, Alex Greenfield, Yi Yang, Preti Jain, Francis K. Kirigin, Carmen Birchmeier, Erwin F. Wagner, Kenneth M. Murphy, Richard M. Myers, Richard Bonneau, Dan R. Littman. A Validated Regulatory Network for Th17 Cell Specification. Cell - 25 September 2012.
Th17 cells have critical roles in mucosal defense and are major contributors to inflammatory disease. Their differentiation requires the nuclear hormone receptor RORγt working with multiple other essential transcription factors (TFs). We have used an iterative systems approach, combining genome-wide TF occupancy, expression profiling of TF mutants, and expression time series to delineate the Th17 global transcriptional regulatory network. We find that cooperatively bound BATF and IRF4 contribute to initial chromatin accessibility and, with STAT3, initiate a transcriptional program that is then globally tuned by the lineage-specifying TF RORγt, which plays a focal deterministic role at key loci. Integration of multiple data sets allowed inference of an accurate predictive model that we computationally and experimentally validated, identifying multiple new Th17 regulators, including Fosl2, a key determinant of cellular plasticity. This interconnected network can be used to investigate new therapeutic approaches to manipulate Th17 functions in the setting of inflammatory disease.
Organic Letters: N-Naphthyl peptoid foldamers exhibiting atropisomerism
Bishwajit Paul, Glenn L. Butterfoss, Mikki G. Boswell, Mia L. Huang, Richard Bonneau, Christian Wolf, and Kent Kirshenbaum. N-Naphthyl Peptoid Foldamers Exhibiting Atropisomerism. Organic Letters 2012 14 (3), 926-929.
We introduce peptoid oligomers incorporating N-(1)-naphthyl glycine monomers. Axial chirality was established due to restricted rotation about the C-N(aryl) bond. Atropisomerism of both linear and cyclic peptoids was investigated by computational analysis, dynamic HPLC, and X-ray crystallographic studies.
GBE: The Plant Proteome Folding Project: Structure and positive selection in plant protein families
M. M. Pentony, P. Winters, D. Penfold-Brown, K. Drew, A. Narechania, R. DeSalle, R. Bonneau, and M. D. Purugganan. The Plant Proteome Folding Project: Structure and Positive Selection in Plant Protein Families. Genome Biol Evol (2012) Vol. 4 360-371 first published online February 16, 2012 doi:10.1093/gbe/evs015
Despite its importance, relatively little is known about the relationship between the structure, function, and evolution of proteins, particularly in land plant species. We have developed a database with predicted protein domains for five plant proteomes (http://pfp.bio.nyu.edu) and used both protein structural fold recognition and de novo Rosetta-based protein structure prediction to predict protein structure for Arabidopsis and rice proteins. Based on sequence similarity, we have identified ∼15,000 orthologous/paralogous protein family clusters among these species and used codon-based models to predict positive selection in protein evolution within 175 of these sequence clusters. Our results show that codons that display positive selection appear to be less frequent in helical and strand regions and are overrepresented in amino acid residues that are associated with a change in protein secondary structure. Like in other organisms, disordered protein regions also appear to have more selected sites. Structural information provides new functional insights into specific plant proteins and allows us to map positively selected amino acid sites onto protein structures and view these sites in a structural and functional context.
The proteome folding project: proteome-scale prediction of structure and function
Kevin Drew, Patrick Winters, Glenn L. Butterfoss, Viktors Berstis, Keith Uplinger, Jonathan Armstrong, Michael Riffle, Erik Schweighofer, Bill Bovermann, David R. Goodlett, Trisha N. Davis, Dennis Shasha, Lars Malmström and Richard Bonneau. The Proteome Folding Project: Proteome-scale prediction of structure and function. Genome Res. 2011. 21: 1981-1994.
The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition and grid-computing enabled de novo structure prediction. We predict protein domain boundaries and 3D structures for protein domains from 94 genomes (including Human, Arabidopsis, Rice, Mouse, Fly, Yeast, E. coli and Worm). De novo structure predictions were distributed on a grid of over 1.5 million CPUs worldwide (World Community Grid). We generate significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions.
Structure based prediction of temperature sensitive mutations
Christopher S. Poultney, David Gresham, Nathan J. Brandt, Glenn L. Butterfoss, Michelle R. Gutwein, Kevin Drew, Kristin C. Gunsalus, Dennis E. Shasha, Richard Bonneau.
Invited submission to PLoS one special collection: RosettaCon 2010: Worked examples of macromolecular modeling and design.
Oligo(N-alkoxy glycines): Trans substantiating peptoid conformation
Peter Jordan, Glenn L. Butterfoss, P. Douglas Renfrew, Rich Bonneau, Kent Kirshenbaum.
Submitted to Biopolymers.
The Rosetta Developers Meeting, 2010 special collection: state of the art macromolecular modeling meets reproducible publishing
Charlie EM Strauss, P. Douglas Renfrew, Richard Bonneau.
Invited introduction to the RosettaCon 2010 PLoS one special collection.
Paul B, Butterfoss GL, Boswell MG, Renfrew PD, Yeung FG, Shah NH, Wolf C, Bonneau R, Kirshenbaum K. Peptoid atropisomers. J Am Chem Soc. 2011 Jul 20;133(28):10910-9. Epub 2011 Jun 22.
We report the isolation of N-aryl peptoid oligomers that adopt chiral folds, despite the absence of chiral centers. Peptoid monomers incorporating ortho-substituted N-aryl side chains are identified that exhibit axial chirality. We observe significant energy barriers to rotation about the stereogenic carbon-nitrogen bond, allowing chromatographic purification of stable atropisomeric forms. We study the atropisomerism of N-aryl peptoid oligomers by computational modeling, NMR, X-ray crystallography, dynamic HPLC, and circular dichroism. The results demonstrate a new approach to promote the conformational ordering of this important class of foldamer compounds.
ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules
Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R, Kaufman K, Renfrew PD, Smith CA, Sheffler W, Davis IW, Cooper S, Treuille A, Mandell DJ, Richter F, Ban YE, Fleishman SJ, Corn JE, Kim DE, Lyskov S, Berrondo M, Mentzer S, Popović Z, Havranek JJ, Karanicolas J, Das R, Meiler J, Kortemme T, Gray JJ, Kuhlman B, Baker D, Bradley P. Methods Enzymol. 2011;487:545-74.
We have recently completed a full re-architecturing of the ROSETTA molecular modeling program, generalizing and expanding its existing functionality. The new architecture enables the rapid prototyping of novel protocols by providing easy-to-use interfaces to powerful tools for molecular modeling. The source code of this rearchitecturing has been released as ROSETTA3 and is freely available for academic use. At the time of its release, it contained 470,000 lines of code. Counting currently unpublished protocols at the time of this writing, the source includes 1,285,000 lines. Its rapid growth is a testament to its ease of use. This chapter describes the requirements for our new architecture, justifies the design decisions, sketches out central classes, and highlights a few of the common tasks that the new software can perform.
Multi-species integrative biclustering
Waltman P, Kacmarczyk T, Bate AR, Kearns DB, Reiss DJ, Eichenberger P, Bonneau R. Genome Biol. 2010;11(9):R96. Epub 2010, Sep 29. PubMed PMID: 20920250
We describe an algorithm, multi-species cMonkey, for the simultaneous biclustering of heterogeneous multiple-species data collections and apply the algorithm to a group of bacteria containing Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes. The algorithm reveals evolutionary insights into the surprisingly high degree of conservation of regulatory modules across these three species and allows data and insights from well-studied organisms to complement the analysis of related but less well studied organisms.
DREAM4: Combining Genetic and Dynamic Information to Identify Biological Networks and Dynamical Models
Greenfield A, Madar, A, Bonneau R. DREAM4 top performers special collection. PLoS ONE 2010, 5(10): e13397. doi:10.1371/journal.pone.0013397
Background: Current technologies have lead to the availability of multiple genomic data types in sufficient quantity and quality to serve as a basis for automatic global network inference. Accordingly, there are currently a large variety of network inference methods that learn regulatory networks to varying degrees of detail. These methods have different strengths and weaknesses and thus can be complementary. However, combining different methods in a mutually reinforcing manner remains a challenge.
Methodology: We investigate how three scalable methods can be combined into a useful network inference pipeline. The first is a novel t-test–based method that relies on a comprehensive steady-state knock-out dataset to rank regulatory interactions. The remaining two are previously published mutual information and ordinary differential equation based methods (tlCLR and Inferelator 1.0, respectively) that use both time-series and steady-state data to rank regulatory interactions; the latter has the added advantage of also inferring dynamic models of gene regulation which can be used to predict the system's response to new perturbations.
Conclusion/Significance: Our t-test based method proved powerful at ranking regulatory interactions, tying for first out of methods in the DREAM4 100-gene in-silico network inference challenge. We demonstrate complementarity between this method and the two methods that take advantage of time-series data by combining the three into a pipeline whose ability to rank regulatory interactions is markedly improved compared to either method alone. Moreover, the pipeline is able to accurately predict the response of the system to new conditions (in this case new double knock-out genetic perturbations). Our evaluation of the performance of multiple methods for network inference suggests avenues for future methods development and provides simple considerations for genomic experimental design. Our code is publicly available at http://err.bio.nyu.edu/inferelator/.
Innate immune detection of the type III secretion apparatus through the NLRC4 inflammasome
Miao EA, Mao DP, Yudkovsky N, Bonneau R, Lorang CG, Warren SE, Leaf IA, Aderem A. Proc Natl Acad Sci U S A. 2010 Feb 16;107(7):3076-80. Epub 2010 Feb 1. PubMed PMID: 20133635.
The mammalian innate immune system uses Toll-like receptors (TLRs) and Nod-LRRs (NLRs) to detect microbial components during infection. Often these molecules work in concert; for example, the TLRs can stimulate the production of the proforms of the cytokines IL-1beta and IL-18, whereas certain NLRs trigger their subsequent proteolytic processing via caspase 1. Gram-negative bacteria use type III secretion systems (T3SS) to deliver virulence factors to the cytosol of host cells, where they modulate cell physiology to favor the pathogen. We show here that NLRC4/Ipaf detects the basal body rod component of the T3SS apparatus (rod protein) from S. typhimurium (PrgJ), Burkholderia pseudomallei (BsaK), Escherichia coli (EprJ and EscI), Shigella flexneri (MxiI), and Pseudomonas aeruginosa (PscI). These rod proteins share a sequence motif that is essential for detection by NLRC4; a similar motif is found in flagellin that is also detected by NLRC4. S. typhimurium has two T3SS: Salmonella pathogenicity island-1 (SPI1), which encodes the rod protein PrgJ, and SPI2, which encodes the rod protein SsaI. Although PrgJ is detected by NLRC4, SsaI is not, and this evasion is required for virulence in mice. The detection of a conserved component of the T3SS apparatus enables innate immune responses to virulent bacteria through a single pathway, a strategy that is divergent from that used by plants in which multiple NB-LRR proteins are used to detect T3SS effectors or their effects on cells. Furthermore, the specific detection of the virulence machinery permits the discrimination between pathogenic and nonpathogenic bacteria.
DREAM3: Network Inference Using Dynamic Context Likelihood of Relatedness and the Inferelator
Madar A, Greenfield A, Vanden-Eijnden E, Bonneau R. PLoS ONE, 2010, 5(3): e9803. doi:10.1371/journal.pone.0009803
Background: Many current works aiming to learn regulatory networks from systems biology data must balance model complexity with respect to data availability and quality. Methods that learn regulatory associations based on unit-less metrics, such as Mutual Information, are attractive in that they scale well and reduce the number of free parameters (model complexity) per interaction to a minimum. In contrast, methods for learning regulatory networks based on explicit dynamical models are more complex and scale less gracefully, but are attractive as they may allow direct prediction of transcriptional dynamics and resolve the directionality of many regulatory interactions.
Methodology: We aim to investigate whether scalable information based methods (like the Context Likelihood of Relatedness method) and more explicit dynamical models (like Inferelator 1.0) prove synergistic when combined. We test a pipeline where a novel modification of the Context Likelihood of Relatedness (mixed-CLR, modified to use time series data) is first used to define likely regulatory interactions and then Inferelator 1.0 is used for final model selection and to build an explicit dynamical model.
Conclusions/Significance: Our method ranked 2nd out of 22 in the DREAM3 100-gene in silico networks challenge. Mixed-CLR and Inferelator 1.0 are complementary, demonstrating a large performance gain relative to any single tested method, with precision being especially high at low recall values. Partitioning the provided data set into four groups (knock-down, knock-out, time-series, and combined) revealed that using comprehensive knock-out data alone provides optimal performance. Inferelator 1.0 proved particularly powerful at resolving the directionality of regulatory interactions, i.e. “who regulates who” (approximately of identified true positives were correctly resolved). Performance drops for high in-degree genes, i.e. as the number of regulators per target gene increases, but not with out-degree, i.e. performance is not affected by the presence of regulatory hubs.
The Inferelator 2.0: A scalable framework for reconstruction of dynamic regulatory network models
Madar A, Greenfield A, Ostrer H, Vanden-Eijnden E, Bonneau R. Conf Proc IEEE Eng Med Biol Soc. 2009;1:5448-51. PubMed PMID: 19964678.
Current methods for reconstructing biological networks often learn either the topology of large networks or the kinetic parameters of smaller networks with a well-characterized topology. We have recently described a network reconstruction algorithm, the Inferelator 1.0, that given a set of genome-wide measurements as input, simultaneously learns both topology and kinetic-parameters. Specifically, it learns a system of ordinary differential equations (ODEs) that describe the rate of change in transcription of each gene or gene-cluster, as a function of environmental and transcription factors. In order to scale to large networks, in Inferelator 1.0 we have approximated the system of ODEs to be uncoupled, and have solved each ODE using a one-step finite difference approximation. Naturally, these approximations become crude as the simulated time-interval increases. Here we present, implement, and test a new Markov-Chain-Monte-Carlo (MCMC) dynamical modeling method, Inferelator 2.0, that works in tandem with Inferelator 1.0 and is designed to relax these approximations. We show results for the prokaryote Halobacterium that demonstrate a marked improvement in our predictive performance in modeling the regulatory dynamics of the system over longer time-scales.
A preliminary survey of the peptoid folding landscape
Butterfoss GL, Renfrew PD, Kuhlman B, Kirshenbaum K, Bonneau R. J Am Chem Soc. 2009 Nov 25;131(46):16798-807. PubMed PMID: 19919145.
We present an analysis of the conformational preferences of N-substituted glycine peptoid oligomers. We survey the backbone conformations observed in experimentally determined peptoid structures and provide a comparison with high-level quantum mechanics calculations of short peptoid oligomers. The dominant sources of structural variation derive from: side-chain dependent cis/trans isomerization of backbone amide bonds, side chain stereochemistry, and flexibility in the psi dihedral angle. We find good agreement between the clustering of experimentally determined peptoid torsion angles and local torsional minima predicted by theory for a disarcosine model. The calculations describe a well-defined conformational map featuring distinct energy minima. The general features of the peptoid backbone conformational landscape are consistent across a range of N-alkyl glycine side chains. Alteration of side chain types, however, creates subtle but potentially significant variations in local folding propensities. We identify a limited number of low energy local conformations, which may be preferentially favored by incorporation of particular monomer units. Greater variation in backbone dihedral angles are accessible in peptoids featuring trans amide bond geometries. These results confirm that computational approaches can play a valuable role in guiding the design of complex peptoid architectures and may lead to strategies for introducing constraints that select among a limited number of low energy local conformations.
The coat morphogenetic protein SpoVID is necessary for spore encasement in Bacillus subtilis
Wang KH, Isidro AL, Domingues L, Eskandarian HA, McKenney PT, Drew K,Grabowski P, Chua MH, Barry SN, Guan M, Bonneau R, Henriques AO, Eichenberger P. Mol Microbiol. 2009 Nov;74(3):634-49. Epub 2009 Sep 22. PubMed PMID: 19775244; PubMed Central PMCID: PMC2806667.
Endospores formed by Bacillus subtilis are encased in a tough protein shell known as the coat, which consists of at least 70 different proteins. We investigated the process of spore coat morphogenesis using a library of 40 coat proteins fused to green fluorescent protein and demonstrate that two successive steps can be distinguished in coat assembly. The first step, initial localization of proteins to the spore surface, is dependent on the coat morphogenetic proteins SpoIVA and SpoVM. The second step, spore encasement, requires a third protein, SpoVID. We show that in spoVID mutant cells, most coat proteins assembled into a cap at one side of the developing spore but failed to migrate around and encase it. We also found that SpoIVA directly interacts with SpoVID. A domain analysis revealed that the N-terminus of SpoVID is required for encasement and is a structural homologue of a virion protein, whereas the C-terminus is necessary for the interaction with SpoIVA. Thus, SpoVM, SpoIVA and SpoVID are recruited to the spore surface in a concerted manner and form a tripartite machine that drives coat formation and spore encasement.
BioNetBuilder2.0: bringing systems biology to chicken and other model organisms
Konieczka JH, Drew K, Pine A, Belasco K, Davey S, Yatskievych TA, Bonneau R, Antin PB. BMC Genomics. 2009 Jul 14;10 Suppl 2:S6. PubMed PMID: 19607657; PubMed Central PMCID: PMC2709267.
BACKGROUND: Systems Biology research tools, such as Cytoscape, have greatly extended the reach of genomic research. By providing platforms to integrate data with molecular interaction networks, researchers can more rapidly begin interpretation of large data sets collected for a system of interest. BioNetBuilder is an open-source client-server Cytoscape plugin that automatically integrates molecular interactions from all major public interaction databases and serves them directly to the user's Cytoscape environment. Until recently however, chicken and other eukaryotic model systems had little interaction data available.
RESULTS: Version 2.0 of BioNetBuilder includes a redesigned synonyms resolution engine that enables transfer and integration of interactions across species; this engine translates between alternate gene names as well as between orthologs in multiple species. Additionally, BioNetBuilder is now implemented to be part of the Gaggle, thereby allowing seamless communication of interaction data to any software implementing the widely used Gaggle software. Using BioNetBuilder, we constructed a chicken interactome possessing 72,000 interactions among 8,140 genes directly in the Cytoscape environment. In this paper, we present a tutorial on how to do so and analysis of a specific use case.
CONCLUSION: BioNetBuilder 2.0 provides numerous user-friendly systems biology tools that were otherwise inaccessible to researchers in chicken genomics, as well as other model systems. We provide a detailed tutorial spanning all required steps in the analysis. BioNetBuilder 2.0, the tools for maintaining its data bases, standard operating procedures for creating local copies of its back-end data bases, as well as all of the Gaggle and Cytoscape codes required, are open-source and freely available at http://err.bio.nyu.edu/cytoscape/bionetbuilder/.
Diurnally entrained anticipatory behavior in archaea
Whitehead K, Pan M, Masumura K, Bonneau R, Baliga NS. PLoS One. 2009;4(5):e5485. Epub 2009 May 8. PubMed PMID: 19424498; PubMed Central PMCID: PMC2675056.
By sensing changes in one or few environmental factors biological systems can anticipate future changes in multiple factors over a wide range of time scales (daily to seasonal). This anticipatory behavior is important to the fitness of diverse species, and in context of the diurnal cycle it is overall typical of eukaryotes and some photoautotrophic bacteria but is yet to be observed in archaea. Here, we report the first observation of light-dark (LD)-entrained diurnal oscillatory transcription in up to 12% of all genes of a halophilic archaeon Halobacterium salinarum NRC-1. Significantly, the diurnally entrained transcription was observed under constant darkness after removal of the LD stimulus (free-running rhythms). The memory of diurnal entrainment was also associated with the synchronization of oxic and anoxic physiologies to the LD cycle. Our results suggest that under nutrient limited conditions halophilic archaea take advantage of the causal influence of sunlight (via temperature) on O(2) diffusivity in a closed hypersaline environment to streamline their physiology and operate oxically during nighttime and anoxically during daytime.
Oligo(N-aryl glycines): a new twist on structured peptoids
Shah NH, Butterfoss GL, Nguyen K, Yoo B, Bonneau R, Rabenstein DL, Kirshenbaum K. J Am Chem Soc. 2008 Dec 10;130(49):16622-32. PubMed PMID: 19049458.
We explore strategies to enhance conformational ordering of N-substituted glycine peptoid oligomers. Peptoids bearing bulky N-alkyl side chains have previously been studied as important examples of biomimetic "foldamer" compounds, as they exhibit a capacity to populate helical structures featuring repeating cis-amide bonds. Substantial cis/trans amide bond isomerization, however, gives rise to conformational heterogeneity. Here, we report the use of N-aryl side chains as a tool to enforce the presence of trans-amide bonds, thereby engendering structural stability. Aniline derivatives and bromoacetic acid are used in the facile solid-phase synthesis of a diverse family of sequence-specific N-aryl glycine oligomers. Quantum mechanics calculations yield a detailed energy profile of the folding landscape and substantiate the hypothesis that the presence of anilide groups establishes a strong energetic preference for trans-amide bonds. X-ray crystallographic analysis and solution NMR studies verify this preference. Molecular modeling indicates that the linear oligomers can adopt helical structures resembling a polyproline type II helix. High resolution structures of macrocyclic oligomers incorporating both N-alkyl and N-aryl glycine units confirm the ability to direct the presence of trans-amide bonds specifically at N-aryl positions. These results are an important step in developing strategies for the rational de novo design of new structural motifs in biomimetic oligopeptoid systems.
Modeling gene regulation and spatial organization of sequence based motifs
Jochen Supper, Claas aufm Kampe, Dierk Wanke, Kenneth W. Berendzen, Klaus Harter, Richard Bonneau, and Andreas Zell. 8th IEEE international conference on BioInformatics and BioEngineering (BIBE 2008).
Reconstructing and modeling regulatory networks is an active area of research in bioinformatics and systems biology. Hence, various computational methods have been published, often successfully modeling one aspect of regulatory control. Gene regulation, however, is a process tha t depends on many different components such as transcription factors (TFs), cis-regulatory motifs and their temporal and spatial coordination. Accordingly, a promising new direction for computational analysis is the incorporation of multiple data types to discover, for instance, cluster membership, the spatial organization of cis-regulatory motifs and TFs tha t bind to these motifs. Here, we present such a data-driven framework, comprising four stages, to infer gene regulatory networks (GRNs) by modeling: 1. motif presence in the promoter, 2. spatial motif arrangement in co-regulated genes, 3. TFs tha t bind the respective motifs, and 4. dynamic properties of the GRN. A novel method is presented in stage 2, where we optimize for the spatial motif properties: orientation, occurrence of multiple motifs, relative distance between two motifs and distance to the Transcription St a r t Site (TSS). To find optimal distance based properties in efficient time we describe a dynamic programming approach. To combine multiple motif properties tha t are shared by genes with similar expression profiles a Hill-climber is employed. Subsequently, in stage 3 and 4, we infer GRNs by assigning TFs to the derived motifs and model time-dependent regulatory relationships between them with the Inferelator approach. None of the stages require the user to manually adjust any parameter, and thus derived properties can be analyzed without the bias introduced by parametrization. We applied this approach to S. cerevisiae data and obtained insight into individual and general properties of the spatial assembly of regulatory elements and inferred the corresponding GRN.
A protein domain-based interactome network for C. elegans early embryogenesis
Mike Boxem, Zoltan Maliga, Niels J. Klitgord, Na Li, Irma Lemmens, Miyeko Mana, Lorenzo De Lichtervelde, Joram Mul, Diederik van de Peut, Maxime Devos, Nicolas Si-monis, Anne-Lore Schlaitz, Murat Cokol, Muhammed A. Yildirim, Tong Hao, Changyu Fan, Chenwei Lin, Mike Tipsword, Kevin Drew, Matilde Galli, Kahn Rhrissorrakrai, David Drech-sel, David E. Hill, Richard Bonneau, Kristin C. Gunsalus, Frederick P. Roth, Fabio Piano, Jan Tavernier, Sander van den Heuvel, Anthony A. Hyman, Marc Vidal. Cell, 2008 134(3) pp. 534 - 545.
Many protein-protein interactions are mediated through independently folding modular domains. Proteome-wide efforts to model protein-protein interaction or “interactome” networks have largely ignored this modular organization of proteins. We developed an experimental strategy to efficiently identify interaction domains and generated a domain-based interactome network for proteins involved in C. elegans early embryonic cell divisions. Minimal interacting regions were identified for over 200 proteins, providing important information on their domain organization. Furthermore, our approach increased the sensitivity of the two-hybrid system, resulting in a more complete interactome network. This interactome modeling strategy revealed new insights into C. elegans centrosome function and is applicable to other biological processes in this and other organisms.
A predictive model for transcriptional control of physiology in a free living cell
Bonneau, R*, Facciotti, MT, Reiss, DJ, Madar A, et al., Baliga, NS*. (2007) Cell. Dec 131:1354-1365
The environment significantly influences the dynamic expression and assembly of all components encoded in the genome of an organism into functional biological networks. We have constructed a model for this process in Halobacterium salinarum NRC-1 through the data-driven discovery of regulatory and functional interrelationships among ∼80% of its genes and key abiotic factors in its hypersaline environment. Using relative changes in 72 transcription factors and 9 environmental factors (EFs) this model accurately predicts dynamic transcriptional responses of all these genes in 147 newly collected experiments representing completely novel genetic backgrounds and environments—suggesting a remarkable degree of network completeness. Using this model we have constructed and tested hypotheses critical to this organism's interaction with its changing hypersaline environment. This study supports the claim that the high degree of connectivity within biological and EF networks will enable the construction of similar models for any organism from relatively modest numbers of experiments.
BioNetBuilder, an automatic network interface
Iliana Avila-Campillo*, Kevin Drew*, John Lin, David J. Reiss, Richard Bonneau. (2007) Bioinformatics. Feb 1;23(3):392-3.
BioNetBuilder is an open-source client-server Cytoscape plugin that offers a user-friendly interface to create biological networks integrated from several databases. Users can create networks for ∼1500 organisms, including common model organisms and human. Currently supported databases include: DIP, BIND, Prolinks, KEGG, HPRD, The BioGrid and GO, among others. The BioNetBuilder plugin client is available as a Java Webstart, providing a platform-independent network interface to these public databases.
A conserved surface on Toll-like receptor 5 recognizes bacterial flagellin
Andersen-Nissen E, Smith KD, Bonneau R, Strong RK, Aderem A. (2007) J Exp Med. Feb 19;204(2):393-403.
The molecular basis for Toll-like receptor (TLR) recognition of microbial ligands is unknown. We demonstrate that mouse and human TLR5 discriminate between different flagellins, and we use this difference to map the flagellin recognition site on TLR5 to 228 amino acids of the extracellular domain. Through molecular modeling of the TLR5 ectodomain, we identify two conserved surface-exposed regions. Mutagenesis studies demonstrate that naturally occurring amino acid variation in TLR5 residue 268 is responsible for human and mouse discrimination between flagellin molecules. Mutations within one conserved surface identify residues D295 and D367 as important for flagellin recognition. These studies localize flagellin recognition to a conserved surface on the modeled TLR5 structure, providing detailed analysis of the interaction of a TLR with its ligand. These findings suggest that ligand binding at the beta sheets results in TLR activation and provide a new framework for understanding TLR-agonist interactions.
Genome-wide superfamily assignments for Saccharomyces cerevisiae protein domains through integration of de novo structure prediction with the gene ontology
Malmström L, Riffle M., Strauss CEM, Chivian, D, Davis TN., Bonneau R.3 Baker D.. PLoS Biol. (2007) Apr;5(4):e76.
Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown. Yeast proteins were parsed into 14,934 domains, and those lacking sequence similarity to proteins of known structure were folded using the Rosetta de novo structure prediction method on the World Community Grid. This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach. We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them. We have also assigned structural annotations to 7,094 predicted domains based on fold recognition and homology modeling methods. The domain predictions and structural information are available in an online database at http://rd.plos.org/10.1371_journal.pbio.0050076_01
General transcription factor specified global gene regulation in archaea
Facciotti MT, Reiss DJ, Pan M, Kaur A, Vuthoori M, Bonneau R, Shannon P, Srivastava A, Donohoe SM, Hood LE, Baliga NS. Proc Natl Acad Sci U S A. (2007) Mar 13;104(11):4630-5.
Cells responding to dramatic environmental changes or undergoing a developmental switch typically change the expression of numerous genes. In bacteria, sigma factors regulate much of this process, whereas in eukaryotes, four RNA polymerases and a multiplicity of generalized transcription factors (GTFs) are required. Here, by using a systems approach, we provide experimental evidence (including protein-coimmunoprecipitation, ChIP-Chip, GTF perturbation and knockout, and measurement of transcriptional changes in these genetically perturbed strains) for how archaea likely accomplish similar large-scale transcriptional segregation and modulation of physiological functions. We are able to associate GTFs to nearly half of all putative promoters and show evidence for at least 7 of the possible 42 functional GTF pairs. This report represents a significant contribution toward closing the gap in our understanding of gene regulation by GTFs for all three domains of life and provides an example for how to use various experimental techniques to rapidly learn significant portions of a global gene regulatory network of organisms for which little has been previously known.
Somatodendritic microRNAs identified by laser capture and multiplex RT-PCR
Kye MJ, Liu T, Levy SF, Xu NL, Groves BB, Bonneau R, Lao K, Kosik KS. RNA. (2007) Aug;13(8):1224-34.
The catalog of RNAs present in dendrites represents the complete repertoire of local translation that contributes to synaptic plasticity. Most views hold that a pool of dendritic mRNAs is selectively transported to a dendritic destination. This view requires that some mRNAs in the dendrite are locally enriched relative to the cell body; however, quantitative comparisons that would support this assumption do not currently exist. These issues related to somatodendritic distribution of mRNAs also apply to the microRNAs, approximately 21 nucleotide noncoding transcripts that bind to target mRNAs and either inhibit their translation or destabilize them. We combined laser capture with multiplex real-time RT (reverse transcription) PCR to quantify microRNAs in the neuritic and somatic compartments separately. The samples were standardized by RT-PCR measurements of a set of mRNAs, including known dendritic mRNAs, in these two compartments. Most neuronal miRNAs were detected in dendrites. With a few notable exceptions, most miRNAs were distributed through the somatodendritic compartment across a nearly constant gradient. Thus for lower-abundance miRNAs, the total neuronal concentration of the miRNA can remain readily detectable in the cell body but vanish from the dendrite. A very small number of miRNAs deviate from the distribution gradient across the miRNA population as relatively enriched or depleted in the dendrite.
The Inferelator: a procedure for learning parsimonious regulatory networks from systems-biology data-sets de novo
Bonneau R, Reiss DJ, Shannon P, Hood L, Baliga NS, Thorsson V (2006). Genome Biol. 7(5):R36.
We present a method (the Inferelator) for deriving genome-wide transcriptional regulatory interactions, and apply the method to predict a large portion of the regulatory network of the archaeon Halobacterium NRC-1. The Inferelator uses regression and variable selection to identify transcriptional influences on genes based on the integration of genome annotation and expression data. The learned network successfully predicted Halobacterium's global expression under novel perturbations with predictive power similar to that seen over training data. Several specific regulatory predictions were experimentally tested and verified.
Integrated biclustering of heterogeneous genome-wide datasets
David J Reiss, Nitin S Baliga, Bonneau R. (2006) BMC Bioinformatics. 7(1):280.
Background: The learning of global genetic regulatory networks from expression data is a severely under-constrained problem that is aided by reducing the dimensionality of the search space by means of clustering genes into putatively co-regulated groups, as opposed to those that are simply co-expressed. Be cause genes may be co-regulated only across a subset of all observed experimental conditions, biclustering (clustering of genes and conditions) is more appropriate than standard clustering. Co-regulated genes are also often functionally (physically, spatially, genetically, and/or evolutionarily) associated, and such a priori known or pre-computed associations can provide support for appropriately grouping genes. One important association is the presence of one or more common cis-regulatory motifs. In organisms where these motifs are not known, their de novo detection, integrated into the clustering algorithm, can help to guide the process towards more biologically parsimonious solutions.
Results: We have developed an algorithm, cMonkey, that detects putative co-regulated gene groupings by integrating the biclustering of gene expression data and various functional associations with the de novo detection of sequence motifs.
Conclusion: We have applied this procedure to the archaeon Halobacterium NRC-1, as part of our efforts to decipher its regulatory network. In addition, we used cMonkey on public data for three organisms in the other two domains of life: Helicobacter pylori, Saccharomyces cerevisiae, and Escherichia coli. The biclusters detected by cMonkey both recapitulated known biology and enabled novel predictions (some for Halobacterium were subsequently confirmed in the laboratory). For example, it identified the bacteriorhodopsin regulon, assigned additional genes to this regulon with apparently unrelated function, and detected its known promoter motif. We have performed a thorough comparison of cMonkey results against other clustering methods, and find that cMonkey biclusters are more parsimonious with all available evidence for co-regulation.
The Gaggle: A system for integrating bioinformatics and computational biology software and data sources
Shannon P, Reiss DJ, Bonneau R, Baliga NS. (2006) BMC Bioinformatics. 7:176.
Background: Systems biologists work with many kinds of data, from many different sources, using a variety of software tools. Each of these tools typically excels at one type of analysis, such as of microarrays, of metabolic networks and of predicted protein structure. A crucial challenge is to combine the capabilities of these (and other forthcoming) data resources and tools to create a data exploration and analysis environment that does justice to the variety and complexity of systems biology data sets. A solution to this problem should recognize that data types, formats and software in this high throughput age of biology are constantly changing.
Results: In this paper we describe the Gaggle -a simple, open-source Java software environment that helps to solve the problem of software and database integration. Guided by the classic software engineering strategy of separation of concerns and a policy of semantic flexibility, it integrates existing popular programs and web resources into a user-friendly, easily-extended environment. We demonstrate that four simple data types (names, matrices, networks, and associative arrays) are sufficient to bring together diverse databases and software. We highlight some capabilities of the Gaggle with an exploration of Helicobacter pylori pathogenesis genes, in which we identify a putative ricin-like protein -a discovery made possible by simultaneous data exploration using a wide range of publicly available data and a variety of popular bioinformatics software tools.
Conclusion: We have integrated diverse databases (for example, KEGG, BioCyc, String) and software (Cytoscape, DataMatrixViewer, R statistical environment, and TIGR Microarray Expression Viewer). Through this loose coupling of diverse software and databases the Gaggle enables simultaneous exploration of experimental data (mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations (operon, chromosomal proximity, phylogenetic pattern), metabolic pathways (KEGG) and Pubmed abstracts (STRING web resource), creating an exploratory environment useful to 'web browser and spreadsheet biologists', to statistically savvy computational biologists, and those in between. The Gaggle uses Java RMI and Java Web Start technologies and can be found at http://gaggle.systemsbiology.net webcite.
Quantitative proteomic analysis of the budding yeast cell cycle using acid-cleavable isotopecoded affinity tag reagents
Flory MR, Lee H, Bonneau R, Mallick P, Serikawa K, Goodlett D, Morris D, Aebersold R. (2006) Proteomics Dec;6(23):6146-57.
Quantitative profiling of proteins, the direct effectors of nearly all biological functions, will undoubtedly complement technologies for the measurement of mRNA. Systematic proteomic measurement of the cell cycle is now possible by using stable isotopic labeling with isotope-coded affinity tag reagents and software tools for high-throughput analysis of LC-MS/MS data. We provide here the first such study achieving quantitative, global proteomic measurement of a time-course gene expression experiment in a model eukaryote, the budding yeast Saccharomyces cerevisiae, during the cell cycle. We sampled 48% of all predicted ORFs, and provide the data, including identifications, quantitations, and statistical measures of certainty, to the community in a sortable matrix. We do not detect significant concordance in the dynamics of the system over the time-course tested between our proteomic measurements and microarray measures collected from similarly treated yeast cultures. Our proteomic dataset therefore provides a necessary and complementary measure of eukaryotic gene expression, establishes a rich database for the functional analysis of S. cerevisiae proteins, and will enable further development of technologies for global proteomic analysis of higher eukaryotes.
UniPep, a database for human N-linked glycosites: A resource for biomarker discovery
Zhang, H., Loriaux, P., Eng, J., Bonneau, R., Smith, R & Aebersold, R. Genome Biology, (2006), 7:114
There has been considerable recent interest in proteomic analyses of plasma for the purpose of discovering biomarkers. Profiling N-linked glycopeptides is a particularly promising method because the population of N-linked glycosites represents the proteomes of plasma, the cell surface, and secreted proteins at very low redundancy and provides a compelling link between the tissue and plasma proteomes. Here, we describe UniPep http://www.unipep.org webcite - a database of human N-linked glycosites - as a resource for biomarker discovery.
Comprehensive de novo structure prediction in a systems-biology context for the archaea Halobacterium sp. NRC-1
Bonneau R, Baliga NS, Deutsch EW, Shannon P, Hood L. (2004) Genome Biology. 5(8):R52-68
Background: Large fractions of all fully sequenced genomes code for proteins of unknown function. Annotating these proteins of unknown function remains a critical bottleneck for systems biology and is crucial to understanding the biological relevance of genome-wide changes in mRNA and protein expression, protein-protein and protein-DNA interactions. The work reported here demonstrates that de novo structure prediction is now a viable option for providing general function information for many proteins of unknown function.
Results: We have used Rosetta de novo structure prediction to predict three-dimensional structures for 1,185 proteins and protein domains (<150 residues in length) found in Halobacterium NRC-1, a widely studied halophilic archaeon. Predicted structures were searched against the Protein Data Bank to identify fold similarities and extrapolate putative functions. They were analyzed in the context of a predicted association network composed of several sources of functional associations such as: predicted protein interactions, predicted operons, phylogenetic profile similarity and domain fusion. To illustrate this approach, we highlight three cases where our combined procedure has provided novel insights into our understanding of chemotaxis, possible prophage remnants in Halobacterium NRC-1 and archaeal transcriptional regulators.
Conclusions: Simultaneous analysis of the association network, coordinated mRNA level changes in microarray experiments and genome-wide structure prediction has allowed us to glean significant biological insights into the roles of several Halobacterium NRC-1 proteins of previously unknown function, and significantly reduce the number of proteins encoded in the genome of this haloarchaeon for which no annotation is available.
Systems level insights into the stress response to UV radiation in the halophilic archaeon Halobacterium NRC-1
Baliga NS, Bjork SJ, Bonneau R, Pan M, Iloanusi C, Kottemann MC, Hood L, DiRuggiero J. (2004) Genome Res. 14(6):1025-35. Epub 2004 May 12.
We report a remarkably high UV-radiation resistance in the extremely halophilic archaeon Halobacterium NRC-1 withstanding up to 110 J/m2 with no loss of viability. Gene knockout analysis in two putative photolyase-like genes (phr1 and phr2) implicated only phr2 in photoreactivation. The UV-response was further characterized by analyzing simultaneously, along with gene function and protein interactions inferred through comparative genomics approaches, mRNA changes for all 2400 genes during light and dark repair. In addition to photoreactivation, three other putative repair mechanisms were identified including d(CTAG) methylation-directed mismatch repair, four oxidative damage repair enzymes, and two proteases for eliminating damaged proteins. Moreover, a UV-induced down-regulation of many important metabolic functions was observed during light repair and seems to be a phenomenon shared by all three domains of life. The systems analysis has facilitated the assignment of putative functions to 26 of 33 key proteins in the UV response through sequence-based methods and/or similarities of their predicted three-dimensional structures to known structures in the PDB. Finally, the systems analysis has raised, through the integration of experimentally determined and computationally inferred data, many experimentally testable hypotheses that describe the metabolic and regulatory networks of Halobacterium NRC-1.
Genome sequence of Haloarcula marismortui–a halophilic archaeon from the Dead Sea
Baliga NS., Bonneau R., Facciotti M, Pan M, Deutsch E, Glusman G, Shannon P, Chiu Y, Weng RS, Kan JR, Hung P, Date S, Marcotte E, Hood L, Ng V. (2004) Genome Res. 5(8):R52
We report the complete sequence of the 4,274,642-bp genome of Haloarcula marismortui, a halophilic archaeal isolate from the Dead Sea. The genome is organized into nine circular replicons of varying G+C compositions ranging from 54% to 62%. Comparison of the genome architectures of Halobacterium sp. NRC-1 and H. marismortui suggests a common ancestor for the two organisms and a genome of significantly reduced size in the former. Both of these halophilic archaea use the same strategy of high surface negative charge of folded proteins as means to circumvent the salting-out phenomenon in a hypersaline cytoplasm. A multitiered annotation approach, including primary sequence similarities, protein family signatures, structure prediction, and a protein function association network, has assigned putative functions for at least 58% of the 4242 predicted proteins, a far larger number than is usually achieved in most newly sequenced microorganisms. Among these assigned functions were genes encoding six opsins, 19 MCP and/or HAMP domain signal transducers, and an unusually large number of environmental response regulators—nearly five times as many as those encoded in Halobacterium sp. NRC-1—suggesting H. marismortui is significantly more physiologically capable of exploiting diverse environments. In comparing the physiologies of the two halophilic archaea, in addition to the expected extensive similarity, we discovered several differences in their metabolic strategies and physiological responses such as distinct pathways for arginine breakdown in each halophile. Finally, as expected from the larger genome, H. marismortui encodes many more functions and seems to have fewer nutritional requirements for survival than does Halobacterium sp. NRC-1.
Automated prediction of CASP-5 structures using the Robetta server
Chivian D, Kim DE, Malmstrom L, Bradley P, Robertson T, Murphy P, Strauss CE, Bonneau R, Rohl CA, Baker D. (2003) Proteins. 53 Suppl 6:524-33.
Robetta is a fully automated protein structure prediction server that uses the Rosetta fragment-insertion method. It combines template-based and de novo structure prediction methods in an attempt to produce high quality models that cover every residue of a submitted sequence. The first step in the procedure is the automatic detection of the locations of domains and selection of the appropriate modeling protocol for each domain. For domains matched to a homolog with an experimentally characterized structure by PSI-BLAST or Pcons2, Robetta uses a new alignment method, called K*Sync, to align the query sequence onto the parent structure. It then models the variable regions by allowing them to explore conformational space with fragments in fashion similar to the de novo protocol, but in the context of the template. When no structural homolog is available, domains are modeled with the Rosetta de novo protocol, which allows the full length of the domain to explore conformational space via fragment-insertion, producing a large decoy ensemble from which the final models are selected. The Robetta server produced quite reasonable predictions for targets in the recent CASP-5 and CAFASP-3 experiments, some of which were at the level of the best human predictions.
An improved protein decoy set for testing energy functions for protein structure prediction
Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D. (2003) Proteins. 53(1):76-87.
We have improved the original Rosetta centroid/backbone decoy set by increasing the number of proteins and frequency of near native models and by building on sidechains and minimizing clashes. The new set consists of 1,400 model structures for 78 different and diverse protein targets and provides a challenging set for the testing and evaluation of scoring functions. We evaluated the extent to which a variety of all-atom energy functions could identify the native and close-to-native structures in the new decoy sets. Of various implicit solvent models, we found that a solvent-accessible surface area-based solvation provided the best enrichment and discrimination of close-to-native decoys. The combination of this solvation treatment with Lennard Jones terms and the original Rosetta energy provided better enrichment and discrimination than any of the individual terms. The results also highlight the differences in accuracy of NMR and X-ray crystal structures: a large energy gap was observed between native and non-native conformations for X-ray structures but not for NMR structures.
De Novo Prediction of Three Dimensional Structures for Major Protein Families
Bonneau, R., Dylan Chivian, Charlie EM Strauss, Carol Rohl, David Baker. (2002) JMB, 322(1):65-78.
We use the Rosetta de novo structure prediction method to produce three-dimensional structure models for all Pfam-A sequence families with average length under 150 residues and no link to any protein of known structure. To estimate the reliability of the predictions, the method was calibrated on 131 proteins of known structure. For approximately 60% of the proteins one of the top five models was correctly predicted for 50 or more residues, and for approximately 35%, the correct SCOP superfamily was identified in a structure-based search of the Protein Data Bank using one of the models. This performance is consistent with results from the fourth critical assessment of structure prediction (CASP4). Correct and incorrect predictions could be partially distinguished using a confidence function based on a combination of simulation convergence, protein length and the similarity of a given structure prediction to known protein structures. While the limited accuracy and reliability of the method precludes definitive conclusions, the Pfam models provide the only tertiary structure information available for the 12% of publicly available sequences represented by these large protein families.
Contact Order and ab initio Structure Prediction
Bonneau, R., Ingo Ruczinski, Jerry Tsai, David Baker. (2002) Protein Science, 11(8):1937-44.
Although much of the motivation for experimental studies of protein folding is to obtain insights for improving protein structure prediction, there has been relatively little connection between experimental protein folding studies and computational structural prediction work in recent years. In the present study, we show that the relationship between protein folding rates and the contact order (CO) of the native structure has implications for ab initio protein structure prediction. Rosetta ab initio folding simulations produce a dearth of high CO structures and an excess of low CO structures, as expected if the computer simulations mimic to some extent the actual folding process. Consistent with this, the majority of failures in ab initio prediction in the CASP4 (critical assessment of structure prediction) experiment involved high CO structures likely to fold much more slowly than the lower CO structures for which reasonable predictions were made. This bias against high CO structures can be partially alleviated by performing large numbers of additional simulations, selecting out the higher CO structures, and eliminating the very low CO structures; this leads to a modest improvement in prediction quality. More significant improvements in predictions for proteins with complex topologies may be possible following significant increases in high–performance computing power, which will be required for thoroughly sampling high CO conformations (high CO proteins can take six orders of magnitude longer to fold than low CO proteins). Importantly for such a strategy, simulations performed for high CO structures converge much less strongly than those for low CO structures, and hence, lack of simulation convergence can indicate the need for improved sampling of high CO conformations. The parallels between Rosetta simulations and folding in vivo may extend to misfolding: The very low CO structures that accumulate in Rosetta simulations consist primarily of local up–down β–sheets that may resemble precursors to amyloid formation.
Rosetta in CASP4: Progress in ab initio protein structure prediction
Bonneau, R., Jerry Tsai, Ingo Ruczinski, Dylan Chivian, Carol Rohl, Charlie EM Strauss, David Baker. (2001) Proteins. 45(S5)119-126.
Rosetta ab initio protein structure predictions in CASP4 were considerably more consistent and more accurate than previous ab initio structure predictions. Large segments were correctly predicted (>50 residues superimposed within an RMSD of 6.5 Å) for 16 of the 21 domains under 300 residues for which models were submitted. Models with the global fold largely correct were produced for several targets with new folds, and for several difficult fold recognition targets, the Rosetta models were more accurate than those produced with traditional fold recognition models. These promising results suggest that Rosetta may soon be able to contribute to the interpretation of genome sequence information
Distributions of beta sheets in proteins with application to structure prediction
Ruczinski, I., Kooperberg, C., Bonneau, R., Baker, D. (2001) Proteins 48(1), 85-97.
We recently developed the Rosetta algorithm for ab initio protein structure prediction, which generates protein structures from fragment libraries using simulated annealing. The scoring function in this algorithm favors the assembly of strands into sheets. However, it does not discriminate between different sheet motifs. After generating many structures using Rosetta, we found that the folding algorithm predominantly generates very local structures. We surveyed the distribution of β-sheet motifs with two edge strands (open sheets) in a large set of non-homologous proteins. We investigated how much of that distribution can be accounted for by rules previously published in the literature, and developed a filter and a scoring method that enables us to improve protein structure prediction for β-sheet proteins.
Functional inferences from blind ab initio protein structure predictions
Bonneau, R., Jerry Tsai, Ingo Ruczinski, David Baker. (2001) J. Struct. Biol. 134(2-3),186-90.
Ab initio protein structure prediction methods have improved dramatically in the past several years. Because these methods require only the sequence of the protein of interest, they are potentially applicable to the open reading frames in the many organisms whose sequences have been and will be determined. Ab initio methods cannot currently produce models of high enough resolution for use in rational drug design, but there is an exciting potential for using the methods for functional annotation of protein sequences on a genomic scale. Here we illustrate how functional insights can be obtained from low-resolution predicted structures using examples from blind ab initio structure predictions from the third and fourth critical assessment of structure prediction (CASP3, CASP4) experiments.
Improving the performance of Rosetta using multiple sequence alignment information and global measures of hydrophobic core formation
Bonneau, R., Strauss, C. & Baker, D. (2001) Proteins 43(1), 1-11.
This study explores the use of multiple sequence alignment (MSA) information and global measures of hydrophobic core formation for improving the Rosetta ab initio protein structure prediction method. The most effective use of the MSA information is achieved by carrying out independent folding simulations for a subset of the homologous sequences in the MSA and then identifying the free energy minima common to all folded sequences via simultaneous clustering of the independent folding runs. Global measures of hydrophobic core formation, using ellipsoidal rather than spherical representations of the hydrophobic core, are found to be useful in removing non-native conformations before cluster analysis. Through this combination of MSA information and global measures of protein core formation, we significantly increase the performance of Rosetta on a challenging test set.
Ab initio protein structure prediction of CASP III targets using ROSETTA
Simons, K. T., Bonneau, R., Ruczinski, I. & Baker, D. (1999) Proteins 37(S3), 171-176.
To generate structures consistent with both the local and nonlocal interactions responsible for protein stability, 3 and 9 residue fragments of known structures with local sequences similar to the target sequence were assembled into complete tertiary structures using a Monte Carlo simulated annealing procedure (Simons et al., J Mol Biol 1997;268:209–225). The scoring function used in the simulated annealing procedure consists of sequence-dependent terms representing hydrophobic burial and specific pair interactions such as electrostatics and disulfide bonding and sequence-independent terms representing hard sphere packing, α-helix and β-strand packing, and the collection of β-strands in β-sheets (Simons et al., Proteins 1999;34:82–95). For each of 21 small, ab initio targets, 1,200 final structures were constructed, each the result of 100,000 attempted fragment substitutions. The five structures submitted for the CASP III experiment were chosen from the approximately 25 structures with the lowest scores in the broadest minima (assessed through the number of structural neighbors; Shortle et al., Proc Natl Acad Sci USA 1998;95:1158–1162). The results were encouraging: highlights of the predictions include a 99-residue segment for MarA with an rmsd of 6.4 Å to the native structure, a 95-residue (full length) prediction for the EH2 domain of EPS15 with an rmsd of 6.0 Å, a 75-residue segment of DNAB helicase with an rmsd of 4.7 Å, and a 67-residue segment of ribosomal protein L30 with an rmsd of 3.8 Å. These results suggest that ab initio methods may soon become useful for low-resolution structure prediction for proteins that lack a close homologue of known structure.
Book Chapters, Reviews, and Perspectives
De novo protein structure prediction: methods and application
Kevin Drew, Dylan Chivian and Richard Bonneau. (2007) Structural Bioinformatics 2nd Edition. Wiley-Liss (Book Chapter).
Learning biological networks: from modules to dynamics
Bonneau, Richard. Nature Chemical Biology 4, 2008, 658 - 664
Prokaryotic Systems Biology
Waltman P, Kacmarczyk T, Bonneau R. In Plant Systems Biology, Annual Plant Reviews. 1st edition. Edited by: Coruzzi G, Gutierrez RA. Ames, Iowa: Blackwell Publishers; 2009:67-136
Learning global models of transcriptional regulatory networks from data
Madar A, Bonneau R. Methods Mol Biol. 2009;541:181. Review. PubMed PMID:19381524.
De novo structure prediction: methods and applications
Kevin Drew, Dylan Chivian, Richard Bonneau. Structural Bioinformatics, Second Edition. (2009), Edited by: Jenny Gu and Philip E. Bourne, p. 755- 781, John Wiley and Sons. ISBN-10: 0470181052
Dissecting the Quorum-Sensing Receptor LuxN
Richard Bonneau. (2008) Cell, 134(3):390-391, ISSN 0092-8674
Systems approaches applied to the study of Saccharomyces cerevisiae and Halobacterium
Weston AD, Baliga NS, Bonneau R, Hood L. Cold Spring Harb Symp Quant Biol. 2003;68:345-57. PubMed PMID: 15338636.
Ab initio methods
Chivian D, Robertson T, Bonneau R, Baker D. (2003) Methods Biochem Anal. 44:547-57.
Ab Initio Protein Structure Prediction: Progress and Prospects
Bonneau, R & Baker, D. (2001). Annu. Rev. Biophys. Biomol. Struct. 30, 173-89. Review.