Our global analysis revealed two subsets of mRNAs that may have a unique mode of regulation. The first subset includes the 53 genes with most of their mRNA molecules associated with a single ribosome (peak in the monosome, listed in supplemental Table 1) and the second includes the 31 genes which the majority of the mRNAs are not associated with ribosomes (low occupancy, listed in supplemental Table 2). We attempted to identify a common sequence motif that may regulate their unique behavior. For this, we utilized two motif search algorithms: BioProspector which uses a Gibbs sampling strategy to identify binding sites (1) and MEME, which utilizes expectation maximization strategy (2). We used sequences 140 nts upstream or downstream to either the subset of 53 genes with high peak in the monosome or to the 31 genes (excluding redundant sequences) with low occupancy. We chose 140 nts as this is the average length of the 5'UTR in yeast, but similar results were obtained with sequences 500 nts long.
We cannot make any definitive conclusion about sequence motifs in the 5' or 3' regions of these genes from the following reasons: 1) All motifs show high degree of degeneracy, and in cases where there are conserved sequences these are stretches of A or T. Similar stretches appear also in our control dataset (3 sets of randomly selected UTR sequences of 50 yeast genes). 2) In most cases less than half of the genes in each group contain the motifs. 3) Only one motif (at the 5'-UTR of the genes that peak in the monosome) appear to have p value lower than e-02. This motif is highly degenerate and resembles other motifs with higher p value; its significance is not clear.
We are therefore currently investigating other analysis tools to try to identify regulatory elements, tools that are more directed to identify RNA motifs (BioProspector and MEME were designed to identify DNA motifs) and that will take into account other factors such as RNA structures.
1. Liu X, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001;:127-38.
2. Timothy L. Bailey and Charles Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994
Results of motifs search by BioProspector
Motifs at the 5'-UTR (140 nts) of genes that peak in the monosome (total of 53 genes)
Motif # of genes with this
motif P value* MMMASACMAMWM 27 4*E-03 MMMASAMMAMWH 26 1.1*E-02 RAMAAAYAAMAC 17 1.6*E-02
*The P -value represents
the probability to find this motif in a random set of
sequences.
Motifs at the 3'-UTR (140 nts) of genes that peak in the monosome (total of 53 genes)
Motif # of genes with this
motif P value DHYCARCAAYMR 23 1*E000 WTGRWWCMWMVW 24 1*E000 CMAKWTCRKSCA 22 1*E000
Motifs at the 5'-UTR (140 nts) of genes with low occupancy (total of 31 genes)
Motif # of genes with this
motif P value WYMAAYWAGCCT 11 2.8*E-01 WYMMAMWMGCMW 11 3.5*E-01 SWCAMAHWMGCC 7 4.9*E-01
Motifs at the 3'-UTR (140 nts) of genes with low occupancy (total of 31 genes)
Motif # of genes with this
motif P value CKMCDSYYKRTY 16 2*E-02 RYTKKTCMASCT 16 2*E-02 WRRCKCCRSTYS 18 6*E-02
Motifs at the 5'-UTR (140 nts) of 3 sets of control genes (total of 50 genes in each set)
Motif # of genes with this
motif P value SKMWATYRVTRG 26 4*e-05 HCWAMYRWYAST 21 4*e-04 TTTTYTTTTTYT 26 2*e-03 WTYRATGAKRTK 20 8*e-04 TYTTWTTTYTTT 27 1*e-03 TTTTYTTTTTTT 24 2*e-03 MWYWAYTAWWHK 26 1.5*e-05 WWHRYTAWYAWH 39 2*e-04 CCTTYRYYRAYR 19 3*e-04
Motifs at the 3'-UTR (140 nts) of 3 sets of control genes (total of 50 genes in each set)
Motif # of genes with this
motif P value SKMWATYRVTRG 26 4*e-05 HCWAMYRWYAST 21 4*e-04 TTTTYTTTTTYT 26 2*e-03 WTYRATGAKRTK 20 8*e-04 TYTTWTTTYTTT 27 1*e-03 TTTTYTTTTTTT 24 2*e-03 MWYWAYTAWWHK 26 1.5*e-05 WWHRYTAWYAWH 39 2*e-04 CCTTYRYYRAYR 19 3*e-04
Redundancy
code R A/G Y C/T M A/C K G/T W A/T S C/G B C/G/T D A/G/T H A/C/T V A/C/G
Results of motifs search by MEME
Motifs at the 5'-UTR (140 nts) of genes that peak in the monosome (total of 53 genes)
Motifs are presented in multilevel consensus, showing the most conserved letter(s) at each motif position.
Motif Length # of genes with this
motif E value* G T A T T T T T T T C T C T C T T T C T T C C C A A C T G T C A C T G T G 21 15 2.8*e-001 C A G G A A G C C A A A G A G T G C A G C A A A T A T C C C A C T G A C T T C G A T C A A G C C A T G C G T G G C C G T T 28 5 2.2*e+001 C C C G C G G G T C T 8 4 5.5*e+002
*The E
value is the Expected number of alignments with the given information
content in a set of random sequences of the same
size.
Motifs at the 3'-UTR (140 nts) of genes that peak in the monosome (total of 53 genes)
Motif Length # of genes with this
motif E value G G G C T G G C A G T T C A G G G G G A A T A C A C C A T C T A 19 7 1.3*e+001 T G G T T C T C C T T T A G 11 8 2.2*e+001 C A C A G T T C G T C A A T C C C G C C T C T A A T T G G C T A A A C T G A T G G 21 5 2.2*e+002
Motifs at the 5'-UTR (140 nts) of genes with low occupancy (total of 31 genes)
Motif Length # of genes with this
motif E value T T T T T T C T T T T C C C G A 11 19 2.2*e000 T A G A T A A G A G T A A A A G A C A G A A A A G A G A T A A A G A A C G G G T T G A A T A C A G T G T T T A T T A A T A 41 6 1.6*e+002 T A T C C A A A T C G C T C T G C T A C A C T T T C T C C C T C A A T A G A A C C G C C T A C G G A A A A G A C G G G C A A T A T A C C A T C T C A T C T T G T T C T 41 4 2.3*e+002
Motifs at the 3'-UTR (140 nts) of genes with low occupancy (total of 31 genes)
Motif Length # of genes with this
motif P value C T T T C T T T T T C T A G C 11 13 7.3*e-001 C A C C T T T G A A T G C C C A T G A C C G A T G G C C G G G 18 4 4.6*e+001 G A A A A A A A A A A A A T C T T T T A C 14 14 1.2*e+002
Motifs at the 5'-UTR (140 nts) of 3 sets of control genes (total of 50 genes in each set)
Motif Length # of genes with this
motif E value T T T T T T T C T T C C C G C A T G 11 22 4.8*e000 G C G G T C T C C A G G C A G T A G G A T G G G T T 15 4 2.9*e+001 A G A A T T G G C T G T C C A T G 11 8 8.8*e+001
Motifs at the 3'-UTR (140
nts) of 3 sets of control genes (total of 50 genes in each set)
Motif Length # of genes with this
motif E value G C G C A A A G C T T A A A A A G G G A A A T A G A A A T C A A G T A G C A C C A G G A T C A G A A G T G C C A G A G T C A A G C A G C G C T A A G C A A T A G C G A A T G G C T C T T T T C T G T T G T C T 50 6 1.7*e+002 A A C G T C C C G A G A A T G A A G A G A T A T T A T A A A A C A A A G A T A T T G C A G A T C C A C G A C G A G A G T T C A C C G A C A T T G T G T C C T C C C T C A G A A T G A G A A T T A G T T T C T G G G T G C G 50 5 4.7*e+001 G G G C C C G C C G G 10 2 3.1*e+002