http://genome-www.stanford.edu/yeast-worm/

SGD Worm-Yeast Protein Comparison:
General Description and Methods


Worm-Yeast Home | Shared & Unshared Similarity Group Summaries | Download Datasets
Extended Table 2 | Worm-Yeast BLAST Summaries | Description of Methods | Sanger Centre | SGD Home

This page describes the methodology for the yeast-worm protein comparison used in Chervitz et al., (1998).  Science 282:2022-2028. This documentation is divided into the following sections:

Reciprocal WU-BLAST (version 2.0) comparisons of the predicted proteins of yeast and worm

Each predicted yeast protein was searched against the entire set of predicted worm proteins and vice versa. This was done by using each translated yeast ORF as a query sequence against the translated worm ORF dataset. The reciprocal "worm-vs-yeast" BLASTP dataset was created by using each translated worm ORF as a query sequence against the translated yeast ORF dataset. The data sets were: S. cerevisiae: 6,217 ORF peptide sequences from SGD, October 28, 1998; C. elegans: 19,099 ORF peptide sequences from Sanger Centre, October 16, 1998. These datasets are available via ftp at ftp://genome-ftp.stanford.edu/yeast/data_download/sequence_similarity/yeast_worm_datasets/.

The WU-BLASTP program was used (2.0a19MP-WashU: Altschul et al., 1990; Gish and States, 1993) with the BLOSUM62 scoring matrix, xnu and seg filters and gapping on, and other parameters at default values. Note that filtering and gapping can affect the results of the BLAST.

Grouping predicted proteins

Using the WU-BLASTP results from the reciprocal searches, the protein sequences were combined into groups. Each member of a group has a BLAST similarity with at least one other member of the group. The BLAST p-value was used to limit the members of a particular group. Groups with the maximum p-value of 10-10, 10-20, 10-50, and 10-100 were constructed either with or without a 80% alignment constraint, which required that at least 80% of both the query and subject sequences were aligned for an HSP (high scoring segment pair) to be counted. Note that the p-value for a particular HSP is sensitive to the size and composition of both the query sequence and the dataset used. Thus, the p-value for a given yeast-against-worm sequence comparison is likely to be different from the reciprocal worm-against-yeast p-value for the same sequences.

CLUSTALW (version 1.74) analysis of predicted protein groups

In the cases where worm/yeast proteins could be combined into groups, all members were clustered and displayed in rooted and unrooted similarity trees. Multiple sequence alignments were generated using ClustalW (version 1.74, Thompson, Higgins, Gibson) using the default BLOSUM substitution matrices. All trees were generated by the programs Drawgram (rooted trees) and Drawtree (unrooted trees) included in the Phylip Package (version 3.5c, J Felsenstein).

Determination of unshared groups of predicted proteins

Intra-species BLASTs were also done to build clusters of sequence families not shared by worm and yeast. A WU-BLASTP dataset was created that consisted of all yeast protein sequences that did not contain an HSP with a p-value of 10-10 or smaller when compared against the worm dataset. Each yeast sequence in this dataset was compared using WU-BLASTP to all other sequences in this dataset; the same analysis was separately done with ungrouped worm sequences. Lists were then compiled for all hits at three different p-values (10-20, 10-50, 10-100) with or without the > 80% aligned constraint for the query sequence, as above. Each sequence group with 3 or more sequences was aligned using ClustalW, and trees were created using the Phylip package as described above.

Comparing worm and yeast protein domains

To find worm proteins that are associated with functions in multicellularity, we defined a set of 122 eukaryotic protein domains involved in regulation of gene expression and signal transduction (Bork et al., 1997). The domains were in large part from the SMART database (Schultz et al, 1998), though several domains were added. The number of these domains and the domain architectures within the respective proteins were determined in both the yeast and worm datasets.

To determine the number of each domain in the worm and yeast data sets, representative sequences of each domain were compared to the nonredundant protein database (NCBI) using the PSI-BLAST program (Altschul et al., 1997) to retreive position-dependent weight matrices (profiles). The number of search iterations and the cutoff for inclusion of sequences in the profile were adjusted individually for each domain. The profiles were then compared separately to the yeast and worm databases. Generally, the random expectation value of 0.01 was used as the criterion for domain identification, but the search results were additionally scrutinized for the conservation of patterns typical of the respective domain, to ensure the elimination of any false positives. The profiles can be obtained by ftp at ftp://ncbi.nlm.nih.gov/pub/koonin/WORM_YEAST/

Important points to consider

Note that this study, as with any scientific method, has some caveats. In this section, we describe some of these that you should keep in mind.

References:

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. Journal of Molecular Biology 215:403-10.

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

Bork, P., Schultz, J., and Ponting, C.P. (1997). Cytoplasmic signalling domains: the next generation. Trends Biochem. Sci. 22:296-298.

Felsenstein, J. (1996). Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266:418-427.

Gish, W. and States, D.J. (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3:266-272.

Schultz, J., Milpetz, F., Bork, P., and Ponting, C.P. (1998). SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. U.S.A. 95:5857-5864.

Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994).CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.


Last Modified: 1999-02-21 SAC Send a Message to the SGD Curators