Integrating Heterogenous Data-types to Learn Condition Dependent Modules
Biological systems are modular. Clustering and biclustering pervade systems biology analysis and are performed for a very wide variety of reasons (where "biclustering" is condition- or cell-state-specific clustering). The main practical reasons are that biological systems are inherently modular and that grouping genes into modules dramatically reduces the effective complexity of any given dataset. If clustering or biclustering is done incorrectly, however, very little of the downstream analysis is likely to be correct. The problem is complicated by the fact that many genes are active in only subsets of cell states and environmental conditions, the noise in available data, and the complexity of the underlying regulatory system. Although taking advantage of modularity is key to success in learning biological networks from data, it is still a tough problem and should continue to be an active area of research.
A natural first step in the analysis of functional genomics data is the learning of co-regulated clusters. Early methods for clustering genes assumed that genes cluster across all observed cell states (or genetic backgrounds) and that genes participate in only one cluster. Newer methods allow for genes to participate in multiple clusters and for those clusters to be condition-specific. Identifying which conditions a bicluster is relevant over (in addition to the gene membership of a bicluster) is especially important in cases where genes are not expressed over significant numbers of conditions and in cases where clusters are split into multiple clusters by additional regulatory factors only active under subsets of conditions.
Our approach to Integrative Biclustering: Methods for learning co-regulated conditions can also take advantage of the fact that many co-regulated groups are also co-functional and often share detectable binding sites for transcription factors/regulators. For example, genes whose products form a protein complex are likely to be co-regulated. These associations can be either derived experimentally or computationally and it is common practice to use one or more of these associations as a post-facto measure of the biological quality of a gene cluster. cMonkey (our program), groups genes and conditions into biclusters on the basis of:
- Coherence in expression data across subsets of experimental conditions
- Co-occurrence of putative cis-acting regulatory motifs in the regulatory regions of bicluster members
- The presence of highly connected sub-graphs in metabolic, signaling, protein-protein, and comparative genomics networks.
cMonkey identifies relevant conditions in which the genes within a given bicluster are expected to be co-regulated (importantly, in later stages of analysis we use only these conditions to learn TFs and EFs that influence each bicluster). The methods separates the calculation of the score components associated with each datatype into individual calculations but still effectively sample biclusters that optimally satisfy multiple model components (each representing a separate data-type). The method was designed as a preprocessing step for network inference and performed well in comparison to all other methods tested when the trade-off between sensitivity, specificity, and coverage (fraction of conditions and genes included in one or more biclusters) were considered, particularly in context of the other bulk characteristics (cluster size, residual, etc.).