Clustering expression data
To facilitate direct comparison of clustering methods, we created the same number of clusters by each method. First, we performed complete linkage hierarchical clustering on the data using the Cluster software (1) and separated the tree into separate clusters on the 6th level of the tree (where root is defined as level 1). This created 32 clusters. We then performed K-means clustering with k=32 and Self Organizing Maps clustering with an 8 by 4 map on the same dataset, using the program XCluster. The resulting 32-cluster outputs were converted to adjacency matrices (see Conversions between pairs and gene groupings below) and then served as input to the MAGIC software.
MAGIC Bayesian Network: Experimental methods for detection of interactions
The general naming and description of experimental methods follow GRID (http://biodata.mshri.on.ca/grid/servlet/HelpHtmlPages?pageID=3). A brief explanation for each experimental method is provided below:
1. Affinity Precipitation = both coIP and non-Ig precipitation
2. Affinity Chromatography = same as affinity precipitation, but without beads
3. Biochemical Assay = direct in-vitro binding assay
4. Dosage Lethality = strain has a mutation in gene A, and increased (high copy lethality ) level of gene B affects viability of the strain
5. Purified complex = protein complex purified from in vivo
6. Reconstructed complex = proteins purified separately, and form a complex in vitro (presumably with some test of function of the formed complex)
7. Synthetic lethality = strain has a mutation in gene A and is viable, but is inviable when gene B is also mutated
8. Synthetic rescue = gene A suppresses mutation in gene B
9. Two hybrid = standard two hybrid system
10. TF binding sites are experimentally identified transcription factor binding sites (from SCPD).
The methods are combined using the probabilities in the conditional probability tables. These probabilities were formally assessed from molecular biology experts (see Methods). For example, conditional probability table for the “Affinity precipitation” node in the network is shown below. It indicates that a pair of yeast proteins that have a physical association will have a positive affinity precipitation result 75% of the time and a negative result in the remaining 25%. On the other hand, two proteins that do not physically interact in vivo may have a positive affinity precipitation result in 5% of the experiments, and a negative one in 95%.
Conversions between pairs and gene groupings (“clusters”)
Gene grouping (“clusters”) ® adjacency matrix
Any gene grouping can be converted to an adjacency matrix and then served as input to the MAGIC software. The conversion essentially involves considering any two genes that appear in the same cluster to be a “pair”, and thus have a non-zero score in the corresponding cell of the adjacency matrix.
For example, if we consider K-means clustering output, where gene A, gene B, and gene C are in the same cluster c but gene D is in a different cluster, the input matrix would contain s>0 in cells (A, B), (A, C), (B, C) and their complements ((B, A) etc.), but 0’s in cells (A,D), (B,D), (C,D). In this case, s can be defined as the average Pearson correlation of two genes in cluster c to the centroid of cluster c:
Non-binary scores for each method can then be binned into high, medium, and low belief groups. The exact binning process is determined for each individual scoring method.
Adjacency matrix ® gene grouping
To construct groupings of genes based on pair-wise output of MAGIC, we define “clusters” around each gene i, or each row of the adjacency matrix (i = 1…total number of genes). For example, in the adjacency matrix Am for output of method m, the cluster around gene i includes any gene j for which Am(i,j) > 0 (or Am(i,j) > cutoff). This avoids the issue of defining fully connected clusters for all gene-gene pairings, which is an NP complete problem equivalent to the clique finding problem (for a graph defined by the adjacency matrix Am). While this method is simple, it directly addresses the issue of gene function prediction by creating gene groupings around each gene with unknown biological process. In the future, we plan to investigate more complex methodologies for this conversion, such as clustering the rows of gene-gene adjacency matrices.
It is important to note that MAGIC’s output is not a strict subset of the microarray-based pairs. MAGIC forms functional relationship predictions based on all of its heterogeneous inputs, including pairs of genes that may not be coexpressed in the microarray data set but that have non-expression evidence of involvement in the same biological process. For example, VMA8 and VMA4 encode two subunits of the vacuolar ATPase V1, an ATP-dependent proton pump that acidifies intracellular vacuolar compartments. The genes share an experimentally identified transcription factor binding site and have affinity precipitation evidence for interaction. They are not identified as coexpressed by any of the clustering methods in this study because VMA8 and VMA4 have different patterns of expression in the heat shock and nitrogen depletion experiments in the stress response data set (3). However, MAGIC identifies VMA8 and VMA4 as functionally related based on non-expression evidence.
In general, genes that are involved in a common biological process may be coexpressed only under certain conditions or may not be coregulated at the transcription level. MAGIC creates accurate groupings of functionally related genes by incorporating diverse experimental data and thus captures various types of functional relationships beyond coexpression.
Figure W1. Comparison of methods at optimal (highest) proportion of TP pairs (only predictions of over 100 TP pairs were considered). MAGIC increases the proportion of TP pairs as compared to the optimized clustering methods (non-optimized outputs of clustering methods are used for MAGIC’s input).
Table W1. Example clusters (from MAGIC) of genes induced or repressed in environmental stress response. Genes are annotated by their GO Terms. MAGIC identifies clusters of genes in general categories described by Gasch et al (3). (carbohydrate metabolism, signal transduction, protein folding and degradation, etc), but it groups genes into more biologically specific groupings (groupings organized around specific biological processes).
WEB SUPPLEMENT REFERENCES
1. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proceedings of the National Academy of Sciences of the United States of America 95, 14863-8.
2. Breitkreutz, B. J., Stark, C. & Tyers, M. (2002) Genome Biol 3.
3. Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000) Mol. Biol. Cell. 11, 4241-4257.
4. Ozcan, S. & Johnston, M. (1999) Microbiol Mol Biol Rev 63, 554-69.