Supplemental Information on Data Analysis
The initial set of tumors was analyzed on 22K arrays. Subsequently, larger 42K arrays replaced the 22K arrays and additional tumors were analyzed on these larger arrays. Dendrograms were obtained when both sets were analyzed separately (Supplemental Figure 1). The 22K group had 26 tumors, while the 42K group contained 20 tumors. In both array types, the tumors segregated into discrete groups according to pathologic diagnosis for the synovial sarcomas and the GI stromal tumors. The remaining tumors did not cluster clearly according to pathologic diagnosis, partly due to the low number of cases available for analysis.
Combining 22 and 42K Arrays
The initial gene selection procedure with the combined set of 22K and 42K arrays yielded 7425 well-measured genes that were present on both types of arrays, representing 20% of the maximum number of genes available for analysis. Selection for signal/background ratio and manually flagged spots had removed 17% of the genes, selection for 80% good data for each gene removed a further 32%, and selection for a fluorescence ratio of at least 3-fold greater than the geometric mean ratio for the specimens examined in at least 2 arrays removed another 31%. When results from 22 and 42K arrays were combined a new dendrogram was derived (Supplemental Figure 2). With more tumors available for analysis, additional discrete groups of tumors were noted. For example, the two schwannomas, although run on different array types, formed a tight group distinct from the remaining specimens. In addition, a group of three leiomyosarcomas (including STT516, which was run on both 22K and 42K arrays) now formed a tight cluster. All synovial sarcoma and GIST samples continued to cluster in distinct groups. However, an apparent 22K versus 42K array bias was observed that contributed to the cluster pattern. For example, in the synovial sarcoma cluster, the five specimens that were run only on 42K arrays clustered on a branch distinct from the other specimens. While four of the five tumors that had been analyzed on both arrays clustered pair-wise together one did not. GIST-STT094-A (22K array) seemed more similar to another GIST (STT219, also run on a 22K array), than to GIST-STT094-B (42K array). Finally, the correlation was quite low in three pairs, with only leiomyosarcoma STT516 showing a high degree of correlation between 22K and 42K arrays. The mean correlation coefficient, obtained with centered data, was 0.61 for the 5 pairs.
Singular Value Decomposition (SVD)
In an attempt to identify and correct the 22K versus 42K array bias, we performed SVD1. This analysis identified a number of eigengenes and corresponding eigenarrays in the dataset (Supplemental Figure 3). Several of the most significant eigengenes correlated with specific tumor groups such as the synovial sarcomas, gastrointestinal stromal tumors, and the subset of leiomyosarcomas expressing a cluster of muscle markers, including calponin. A single eigengene correlated almost perfectly with the tumors based on whether a 22K or 42K array was used for analysis (Supplemental Figure 4). Panel (a) shows the clustergram of all selected genes used for this report, with the arrays in the order obtained in the final dataset. Panel (b) describes the level of expression of this eigengene in each of the 46 arrays, with a near complete correlation between the expression level and the type of array used, showing a positive value found in almost all 42K arrays and a negative value in almost all 22K arrays. A different representation of these data is shown in panel (c), where the arrays have been put in the order dictated by the value for this eigengene. This shows that the vast majority of 42K arrays have a value above zero for this eigengene. Finally, panel (d) shows the contribution of each of the genes to the eigenarray that represents the 22K versus 42K array bias. Only those genes whose value is zero in this analysis are not affected; thus, it appears that the vast majority of genes are influenced by this array bias.
The eigengene and eigenarray correlating with the slide bias were subtracted from the data set, and this adjusted data set was then reselected using the same criteria that generated the first data set. Because the corrected expression of those genes heavily influenced by array bias in some cases no longer varied >3 fold from the geometric mean ratio in at least two experiments, this reselection step led to the removal of 1905 genes. Thus, the initial set of 7425 genes was reduced to 5520. It should be noted however that almost all genes received some contribution to their expression levels from array bias. Subtraction of the array-type bias thus not only removed a specific set of genes but also improved the biological significance of the expression levels determined for all genes.
The adjusted data set was reclustered to yield the final tumor dendrogram (Supplemental Figure 5a) and clustergram (Figure 2). Several observations can be made. First, all five tumors that had been analyzed on both array types now were located on shared terminal branches. Second, the correlation between the pair members had improved from 0.61 before SVD to 0.73 after SVD. It should be noted that the data used for this comparison was centered, which emphasizes differences rather than similarities in gene expression. Third, a much less conspicuous clustering based on array type was noted in that the synovial sarcomas ran on 42K arrays no longer were located on a branch separate from the others. Finally by removing the array bias, the subset of calponin-expressing leiomyosarcomas that grouped tightly together had increased from 4 to 6 specimens.
After singular value decomposition and subtraction of the slide bias eigengene the major gene clusters appear more condensed and readily interpretable than seen on the uncorrected clustergram (Supplemental Figure 5b).
Comparison of classification of genes by hierarchichal clustering, SVD and SAM
We used three complementary methods for the analysis of the data: hierarchical clustering, SVD and SAM. Clustering and SVD gave similar classifications of the tumor samples. Clustering, SVD and SAM gave similar classifications of the genes (where the supervised SAM analysis made use of the sample classification in generating gene classifications). For comparison of the classification of genes by clustering, SVD and SAM, we combined (see Web Table 4 and Supplemental Fig. 6) the GIST gene cluster (Fig. 3c) with the SVD scores (Web Table 2b) and SAM score (Web Table 3) for the genes in this cluster.
Clustering places the gene kit almost in the center of this tight cluster of 125 genes with a correlation coefficient of 0.75. SVD ranks kit as the 41th gene, based on the high negative projection of its expression pattern onto the direction defined by eigengene B, the eigengene that distinguishes between GIST and SynSarc samples. SVD also gives kit a high anticorrelation value of 0.59 with eigengene B. Together with the high value of anticorrelation with eigengene A, the eigengene that distinguishes GISTs and SynSarcs from the rest of the tumor samples, of 0.65, kit has about 0.9 (i.e., Sqrt[0.59^2+0.65^2]) of its expression in the "GIST subspace" that is defined by these two eigengenes.
Note that, out of the 125 genes in the GIST cluster, 64 genes (or 51%) overlapped with the list of 225 genes that combined the top 125 genes ranked by SVD for highest negative projection onto eigengene B and the top 125 genes ranked by SVD for high anticorrelation with eigengene B. Also, 85 genes (or 68%) overlapped with the top 125 genes ranked by the SAM score.
We have also performed another type of data analysis (ANOVA) to remove the artifact induced by the use of 2 types of gene arrays. This analysis showed highly similar results to that obtained through SVD.
Identification of misplaced genes on arrays
During the analysis of this project a limited number of misplaced genes were identified. To date, only 35 genes were found in the pre-SVD dataset (a total of 7,425 genes), 27 of which remained after SVD correction (a total of 5,520 genes). These genes did not influence the data analysis and are not subject of discussion in this report. These genes theoretically could have contributed to the bias introduced by the use of 22K and 42K arrays. The misplaced genes are noted in Web Table 5 and will be updated if additional errors are identified. The misplaced genes have been removed from the Gene Explorer dataset on this website.
Supplemental Information References
- Alter, O. et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97, 10101-10106 (2000).
| Materials and Methods
| Figures and Tables
| Supplemental data