This page describes the methodology for the yeast-worm protein comparison used in Chervitz et al., (1998). Science 282:2022-2028. This documentation is divided into the following sections:
Reciprocal WU-BLAST (version 2.0) comparisons of the predicted proteins of yeast and worm
Each predicted yeast protein was searched against the entire set of predicted worm proteins and vice versa. This was done by using each translated yeast ORF as a query sequence against the translated worm ORF dataset. The reciprocal "worm-vs-yeast" BLASTP dataset was created by using each translated worm ORF as a query sequence against the translated yeast ORF dataset. The data sets were: S. cerevisiae: 6,217 ORF peptide sequences from SGD, October 28, 1998; C. elegans: 19,099 ORF peptide sequences from Sanger Centre, October 16, 1998. These datasets are available via ftp at ftp://genome-ftp.stanford.edu/yeast/data_download/sequence_similarity/yeast_worm_datasets/.
The WU-BLASTP program was used (2.0a19MP-WashU: Altschul et al., 1990; Gish and States, 1993) with the BLOSUM62 scoring matrix, xnu and seg filters and gapping on, and other parameters at default values. Note that filtering and gapping can affect the results of the BLAST.
Using the WU-BLASTP results from the reciprocal searches, the protein sequences were combined into groups. Each member of a group has a BLAST similarity with at least one other member of the group. The BLAST p-value was used to limit the members of a particular group. Groups with the maximum p-value of 10-10, 10-20, 10-50, and 10-100 were constructed either with or without a 80% alignment constraint, which required that at least 80% of both the query and subject sequences were aligned for an HSP (high scoring segment pair) to be counted. Note that the p-value for a particular HSP is sensitive to the size and composition of both the query sequence and the dataset used. Thus, the p-value for a given yeast-against-worm sequence comparison is likely to be different from the reciprocal worm-against-yeast p-value for the same sequences.
CLUSTALW (version 1.74) analysis of predicted protein groups
In the cases where worm/yeast proteins could be combined into groups, all members were clustered and displayed in rooted and unrooted similarity trees. Multiple sequence alignments were generated using ClustalW (version 1.74, Thompson, Higgins, Gibson) using the default BLOSUM substitution matrices. All trees were generated by the programs Drawgram (rooted trees) and Drawtree (unrooted trees) included in the Phylip Package (version 3.5c, J Felsenstein).
Determination of unshared groups of predicted proteins
Intra-species BLASTs were also done to build clusters of sequence families not shared by worm and yeast. A WU-BLASTP dataset was created that consisted of all yeast protein sequences that did not contain an HSP with a p-value of 10-10 or smaller when compared against the worm dataset. Each yeast sequence in this dataset was compared using WU-BLASTP to all other sequences in this dataset; the same analysis was separately done with ungrouped worm sequences. Lists were then compiled for all hits at three different p-values (10-20, 10-50, 10-100) with or without the > 80% aligned constraint for the query sequence, as above. Each sequence group with 3 or more sequences was aligned using ClustalW, and trees were created using the Phylip package as described above.
Comparing worm and yeast protein domains
To find worm proteins that are associated with functions in multicellularity, we defined a set of 122 eukaryotic protein domains involved in regulation of gene expression and signal transduction (Bork et al., 1997). The domains were in large part from the SMART database (Schultz et al, 1998), though several domains were added. The number of these domains and the domain architectures within the respective proteins were determined in both the yeast and worm datasets.
To determine the number of each domain in the worm and yeast data sets, representative sequences of each domain were compared to the nonredundant protein database (NCBI) using the PSI-BLAST program (Altschul et al., 1997) to retreive position-dependent weight matrices (profiles). The number of search iterations and the cutoff for inclusion of sequences in the profile were adjusted individually for each domain. The profiles were then compared separately to the yeast and worm databases. Generally, the random expectation value of 0.01 was used as the criterion for domain identification, but the search results were additionally scrutinized for the conservation of patterns typical of the respective domain, to ensure the elimination of any false positives. The profiles can be obtained by ftp at ftp://ncbi.nlm.nih.gov/pub/koonin/WORM_YEAST/
Note that this study, as with any scientific method, has some caveats. In this section, we describe some of these that you should keep in mind.
Similarity groups
False negatives: This study aims at identifying orthologs. For any given sequence similarity group, there are likely to be additional, non-orthologous sequences with weaker but significant similarity. Thus, some similarity groups could be missing related sequences that were not detected as a result of the limitations of the WU-BLAST algorithm (described above) and the selection criteria used to create the groups.
False positives: some similarity groups could contain
completely unrelated sequences, primarily because of the way
similarity groups were constructed as illustrated below:
Seq A: ---XXXXXXXX-----OOOOOOOOO------------------Sequence A and sequence C will go into the ABC group even though they are unrelated to each other. This "chaining" effect is a result of the single-linkage clustering method used in the present study. This is much more of a problem for groups constructed without the >80% alignment constraint but will also be a factor for all of the groups reported at this site. View the multiple sequence alignment of the cluster to verify the regions of all aligned sequences.
Seq B: ----------------OOOOOOOOO------ZZZZZZZ-----
Seq C: -------------------------------ZZZZZZZ-----
Anomalous groups
Cytochrome P450: by this analysis, the cytochrome P450 group is worm only. See, for example, the p100a80 group. The BLAST results indicate that the yeast hits to these worm sequences are just inside the 1-10 cutoff (between 1-05 and 1-09). Also, the yeast proteins (Erg5 and Erg11) that are orthologous to the worm cytochrome P450s are not very similar to each other so they do not even form a yeast-only similarity group by our criteria. See BLAST results for details. [Erratum: There is actually one shared group (P<=1e-10) for cytochrome P450, but of the 32 sequences in this group, only three are yeast.]
Guanylate Cyclase: this group also appears in the worm only dataset. There are several yeast proteins at 1-09 but they all appear to be kinases, not cyclases. In this case, the 1-10 cutoff proved useful in screening out non-orthologous proteins. See the similarity tree and the table containing links to the BLAST reports for details.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J.
(1990). Basic local alignment search tool. Journal of Molecular
Biology 215:403-10.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. Bork, P., Schultz, J., and Ponting, C.P. (1997). Cytoplasmic signalling domains: the next generation. Trends Biochem. Sci. 22:296-298.
Felsenstein, J. (1996). Inferring phylogenies from protein sequences by
parsimony, distance, and likelihood methods. Methods Enzymol. 266:418-427.
Gish, W. and States, D.J. (1993). Identification of protein coding
regions by database similarity search. Nature Genetics 3:266-272.
Schultz, J., Milpetz, F., Bork, P., and Ponting, C.P. (1998). SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. U.S.A. 95:5857-5864.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994).CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res.
22:4673-4680.
|
|
Send a Message to the SGD Curators ![]() |