Article
Reference
Codon Usage in the terminal region of E. Coli genes
STEINBERGER, Cynthia
Abstract
A comparison of codon usage in the region close to the termination codon in E. coli genes with the average E. coli codon usage shows that those codons which differ from termination codons by one base change, called pretermination codons, appear more frequently at the end of the gene. The higher frequency of pretermination codons in this region might be due to single base mutations of previously existing multiple termination codons. In addition, a comparison is made of termination codon usage, tandem termination frequency, and termination context in E. coli, H. sapiens and bacteriophage T4.
STEINBERGER, Cynthia. Codon Usage in the terminal region of E. Coli genes. Life Science Advances Molecular Genetics , 1988, vol. 7, p. 141-145
Available at:
http://archive-ouverte.unige.ch/unige:127566
Disclaimer: layout of this document may differ from the published version.
1 / 1
.,
Mol. Gen. (Life Sci. Adv.) 1988, 7: 141 -145
Codon Usage in the terminal region of E Colt genes
C. Alff-Steinberger
Department oi Molecular Biology, l)niversity of Geneva, 30 quai Erncst-Anscrmct, 1211 Gcncvc 4, Switzerland.
ABSTRACT
A comparison of codon usage in the region close to the termination codon in ~. ~ · genes
with the average ~. £Ql.i codon usage shows that those codons which differ from termination codons by one base change, called pretermination codons, appear more frequently at the end of the gene.
The higher frequency of pretermination codons in this region might be due to single base
mutations of previously existing muitiple termination codons. In addition, a comparison is made of termination codon usage, tandem termination frequency, and termination context in
t.
£Ql..i,H·
sapiens and bac~eriophage T4.INTBODUCTION
The appearance of multiple termination codons in prokaryotes has been observed and commented upon for over a decade (Watson, 1976). It is not clear to what extent the tandem stops observed are still functional, or whether they are evolutionary remainders from a period when translation termination was less efficient than at present. In this study, the codons usage of 201 Escherichia £Ql.i genes was studied, and in particular, the usage of codons in the region close to the termination codon was compared to the average usage. It is found that in a small region 5' to the
termination codon, there is a significant increase in the appearance of pretermination codons (by which term are denoted the 18 codons which differ from the termination codons by only one base). If use of multiple termination codons were a common feature of the ancestor of ~- coli, but no longer required by the translation system of the modern prokaryote, single base substitution mutations of some of the redundant termination codons would lead to an enrichment of
pretermination codons at the end of the gene.
This seems to be a likely explanation of the effect observed in the present study. on the 3' side of the termination codon, the triplet observed is also frequently a pretermination codon, or a termination codon, or another triplet beginning with U, confirming the importance of context on translation termination efficiency {Fluck et al, 1977, Kohli and Grosjean, 1981).
MATERIALS AND METHODS Sequence Selection
The sequence data were obtained from the Nucleotide Sequence Data Library, Release 7, of the European Molecular Biology Laboratory (EMBL), Heidelberg, West Germany. Complete~
.!22.li
coding regions for known genes were selected. In the cases where more than one sequence was given for the same gene, one of the sequences was arbitrarily chosen to be included in the tabulation. Genes were rejected if there were gaps in the sequence, if the gene product was not identified, if the gene was plasmid related, if the initiation or termination codon was missing, if the length of the coding region was not a multiple of three, or if there were inconsistencies in the data library entry. In this way, 201 complete genes were selected.Analysis
The selection, tabulation, and
statistical analyses of the data were done using the CDC cyber computer of the cantonal Hospital in Geneva, using programs written by the author. The EMBL Data Library usually contains, for DNA entries, the sequence of the non-coding strand, which is homologous to the mRNA transcribed from the coding strand. The codon frequencies shown in Tables 1-4 are those of the mRNA.
142
RESULTS ANO DISCUSSION
This study is concerned with codon µsage i:n the terminal region of the gene. For this purpose, the codons are numbered in the 5' direction, starting with "O" f or the termination codon, "l" for the next codon upstream, etc. The codon frequencies for the 201 &:.·
£2.li
genes are given in Table 1, for the entire gene. In Table 2, the codon frequencies are given for 3 codons adjacent to the termination codon in the 5' direction, which are calied codons (1-3), and in Table 3, fo! the 10 codons adjacent to thetermination codon in the 5' direction, which are called codons (1-10). In Table 4 the distribution of the triplet following the termination codon on the 3' d irection is given. Of our original sample of 201 genes, only 185 sequences extended beyond the termination codon. Tab es 1-3 therefore contain entries from 201 genes, and Table 4 contains entries from 1.85 genes. The 18 pretermination codons are marked with an *
Table 1: Codon distribution for 201 complete ~· coli genes
uuu
1039 PHEucu
842 SERuuc
1458 PHEucc
803 SER UUA* 549 LEU UCA* 331 SER UUG* 660 LEU UCG* 474 SERcuu
536 LEUccu
381 PRO CUC 598 LEU CCC 201 PRO CUA 146 LEU CCA 489 PRO CUG 4136 LEU CCG 1777 PRO AUU 1635 ILEU ACU 776 THR AUC 2273 ILEU ACC 1695 THR AUA 162 ILEU ACA 318 THR AUG 1854 MET ACG 687 THR GUU 1690 VAL GCU 1493 ALA GUC 861 VAL GCC 1520 ALA GUA 957 VAL GCA 1463 ALA GUG 1643 VAL GCG 2302 ALA Total number of codons = 69234Table 2: Codon distrubution for codons (1-3) on the 51 side of the termination codon for 201
g.
coliuuu
8 PHEucu
8 SERuuc
10 PHEucc
8 SER UUA* 6 LEU UCA* 3 SER UUG*l3 LEU UCG* 2 SERcuu
5 LEUccu
2 PROcue
2 LEU CCC 2 PRO CUA 0 LEU CCA 2 PRO CUG 28 LEU CCG 5 PRO AUU 7 ILEU ACU 6 THR AUC 11 ILEU ACC 6 THR AUA 2 ILEU ACA 1 THR AUG 9 MET ACG 4 THR GUU 18 VAL GCU 21 ALA GUC 8 VAL GCU 8 ALA GUA 5 VAL GCA 9 ALA GUG 10 VAL GCG 20 ALAC. Alff-Steinberger The remaining codons, which are neither termination nor pretermina~ion codons, are called here non-pretermination codons.
In Tables 5, 6 , and 7, the numbers of preterrnination and non-pre~ermination codons in the terminal and other regions are
compared. It is se~n that for the codons closest to the termination codon, in positions (1-3], 40 % are pretermination codons, while in positions (4-10) and in positions (11 up to but not including
initiation], this fraction is 25%. To see if . this difference is statistically significant,
a chi-square test of independence is used. In Table 5, the numbers of pretermination and non-preterrnination codons fo und close to the termination codon (positions (1-3)) are compared to those found upstream nearby
(positions (4-10]) . The chi-square of 41.6 found for this table indicates that the probability that the row and column variables are independent is very small ,
i·
g., thehigher frequency of pretermination codons in
UAU* 912 TYR l.JGU* 279 CYS UAC*l056 TYR UGC* 365 CYS UAA 153 TERM UGA 38 TERM UAG 10 TERM UGG* 692 TRP CAU 609 HIS CGU 2045 ARG CAC 816 HIS CGC 1451 ARG CAA* 820 GLUN CGA* 148 ARG CAG*2210 GLUN CGG 164 ARG AAU 866 ASPN AGU 376 SER AAC 1907 ASPN AGC 998 SER AAA*2804 LYS AGA* 102 ARG AAG* 870 LYS AGG 55 ARG GAU 2073 ASP GGU 2340 GLY GAC 1751 ASP GGC 2151 GLY GAA*3243 GLU GGA* 333 GLY GAG*l352 GLU GGG 496 · GLY
c::renes.
UAU* 5 TYR UGU* 2 CYS UAC* 8 TYR UGC* 2 CYS UAA 0 TERM UGA 0 TERM UAG 0 TERM UGG*ll TRP CAU 9 HIS CGU 14 ARG CAC 1 HIS CGC 12 ARG
° CAA*ll GLUN CAG*28 GLUN CGG 3 ARG CGA*
"
ARGAAU 9 ASPN AGU 5 ·SER AAC 11 ASPN AGC g SER AAA*52 LYS AGA* ) ARG AAG*30 LYS AGG 1 ARG GAU 11 ASP GGU 15 GLY GAC 13 ASP GG~ 12 GLY GAA*37 GLU GGA* 4 GLY GAG*l7 GLU GGG 14 GLY
:·
Codon usag~ in the terminal region of E coli genes 1-0 Tabl e 3: Codon distribution for
codons (l-10) on the 5' side o f t he t e rmination codon for 201 g. coli genes.
uuu
30 PHEucu
19 SER UAU* 20 TYR UGU* 5 CYSuuc
43 PHEucc
18 SER UAC* 25 TYR UGC* 11 CYS UUA* 15 LEU UCA* 12 SER UAA 0 TERM UGA 0 TERM UUG* 27 LEU UCG* 18 SER UAG 0 TERM UGG* 25 TRPcuu
21 LEUccu
13 PRO CAU 27 HIS CGU 46 ARG CUC 14 LEU CCC 4 PRO CAC 7 HIS CGC 44 ARG CUA 4 LEU CCA 13 PRO CAA* 25 GLUN CGA* 14 ARG CUG 106 LEU CCG 34 PRO CAG* 62 GLUN CGG 13 ARG AUU 44 ILEU ACU 14 THR AAU 26 ASPN AGU 16 SER AUC 52 ILEU ACC 37 THR AAC 50 ASPN AGC 21 SER AUA 4 ILEU AC/I. 7 THR AAA*l26 LYS AGA* 10 ARG AUG 44 MET ACG 20 THR AAG* 50 LYS AGG 4 ARG GUU 60 VAL GCU 68 ALA GAU 49 ASP GGU 55 GLY GUC 37 VAL GCC 35 ALA GAC 39 ASP GGC 43 GLY GUA 25 VAL GCA 63 ALA GAA*lOO GLU GGA* 11 GLY GUG 50 VAL GCG 63 ALA GAG* 42 GLU GGG 30 GLYTable 4: Codon distribution for the triplet adjacent to the termination
codon on the 3' side for 185 ~- coli genes
uuu
11 PHEucu
5 SERuuc
7 PHEucc
3 SER UUA*.5 LEU UCA* 4 SER . UUG* 3 LEU UCG* 8 SERcuu
- 0 LEUccu
1 PRO CUC 0 LEU CCC 3 PRO CUA 0 LEU CCA 2 PRO CUG 2 LEU CCG 3 PRO AUU 1 ILEU ACU l THR AUC 0 ILEU ACC 4 THR AUA 5 ILEU ACA 4 THR AUG 3 MET ACG l THR GUU 2 VAL GCU 0 ALA GUC 1 VAL GCC 5 ALA GUA 3 VAL GCA 3 ALA GUG 0 VAL GCG 2 ALApositi ons (1-3), compared with positions [4 - 10), is statistica lly s i g nificant. In Tabl e 6, the number s of pretermina t i on and non- prete rminati on codons found close t o the termination codon, in positions [ l-3], a re compa red wit h those found upstream, positi ons (4 up to but not i ncludi ng i n i t i ati on] . The ch i-square o f 71.2 f ound f or this t a b le ind.i c ates that the hig h er frequency o f pretermination codons in position (1-3], compared to position& upstream, is
statistically significant . In Table 7, the numbers of pretermination and non-
pretermination codons in positions [4-10], which are toward the end of the gene, are compared to those found in the region upstream, positions [11 up to but not
including initiation]. The chi-square of 0.26 found for this table indicates that frequency of pretermination codons found in positions [4-10] is not significantly different from that found in the upstream region; The choice of position 3 for the cut-off point for the chi-square calculation was made by examining
UAU* 4 TYR UGU* 2 CYS UAC* 1 TYR UGC* 5 CYS UAA 17 TERM UGA 7 TERM UAG 6 TERM UGG* 3 TRP CAU 2 HIS CGU 2 ARG CAC 2 HIS CGC 4 ARG CAA* 1 GLUN CGA* 1 ARG CAG* l GLUN CGG l ARG AAU 2 ASPN AGU 1 SER AAC 2 ASPN AGC 0 SER AAA* 3 LYS AGA* 2 ARG AAG* 4 LYS AGG 0 ARG GAU l ASP GGU 3 GLY GAC 0 ASP GGC 1 GLY GAA* 1 GLU GGA* 7 GLY GAG* 3 GLU GGG 4 GLY
t he frequency of pretermination codons in p o sitions [1-10], which are, respectively, 76 , 86, 78, 46, 55, 56, 57, 38, 67, 39, out o f a total of 201 codons at each position, of course. Thus, it is seen that in a smal l region o f length three codons, upstream from the termi nation codon, there is a significant increas e in the frequency of pretermination codons.
It can be seen from Table 4 that the frequencies of the triplets adjacent to the termination codon on the J' side are not random: 30 of the 185 genes have a double stop c odon ; 58 of the r emaining 155 t r i p lets are preterminat i on codons , wh ich i s
sign ifica ntl y l arger t ha n would be e xpected if the trip let freque ncies f o llowed the genome a v erage , a s in Tabl e 1; 9 1 of the 18 5 t riplets beg i n with
u,
i n contrast t o a ve rag e codon usag e, where U is t he least frequent ba se . in the firs t posit ion . All th ree stop codons tend to be followed by U. Of the 185 genes in Table 4, 143 terminate in UAA, of which 69 have U on the 3' side, 34 terminate/Y
144
Tables 5, 6, and 7 : Comparison of the numbers of pretermination and non-pretermination codons in the terminal and other regions.
Table 5
Pretermination codons
Positions (1-3]
240 Non-pretermination 363 codons
Positions (4-10]
358 1049 Chi-square=41.6, probability <.0001
Table 6
*
Positions (1-3]
Pretermination 240 codons
Non-pretermination 363 codons
Positions (4 up to but not including initiation}
16960 51269 Chi-sqUare=71.2, probability <.0001
in UGA, of which 18 have U on the 3' side, and 8 terminate in UAG, of which 4 have U on the J' side.
A possible explanation for the increased pretermination codon frequencies in the terminal region is that they are remnants of a past usage of multiple termination codons.
If the use of multiple termination codons were common in the past, perhaps due to a less efficient translation termination system, and if an increase in the efficiency of translation termination has relaxed the requirement for multiple terminations, then it might be expected that single base
mutations occurring in the group of multiple terminations would lead to an increase in the frequency of pretermination codons on both sides of the termination codon, as is
observed in the current sample. The frequent appearance of U immediately after the
termination codon, as observed previously by Kohli and Grosjean, 1981, suggests that it may have some function in translation termination.
The terminal regions of sm~ll samples of HQID.Q sapiens and bact~riophage '1'4 genes have been investigated. Significant differences are seen between H· sapiens and ,t. coli. The li· sapiens sample used was that whose codon usage was tabulated by Alff-Steinberger, 1987. The relative frequency of the three stop codons is different in the two species.
In the 35 gene fi, sapiens sample, ·the
frequencies of UAA, UAG , and UGA were 14 , 10, and 11. 33 of these sequences extended beyond the termination codon. Of these 33, only 6 have U on the 3' side of the stop codon, and
Table 7
Pretermination codons
Positions (4-10]
358 Non-pretermination 1-049 codons
C. Alff-Steinberger
Positions [ll: up to but not including initiation]
16602 50220 Chi-square=0.26, probability <0.61
there is only 1 double termination. Thus, the context of
H·
sapiens termination issignificantly different from that of ~. £.Qli on the 3' side. The number of
H·
sapiens codons in this sample is not large enough to permit a significant measurement ofpretermination codon frequency in the terminal region.
Sequences of 37 bacteriophage T4 genes were provided by R. Epstein and collaborators from EMBL and other sources. Analysis of the terminal region of these genes shows that T4 is in some aspects similar to its host ~.
coli. The frequencies of the three stop codons UAA, UAG, and UGA were JO, 2, and 5, which is consistent with 153, 10, and 38 from
~. coli, in Table 1. 36 of the 37 T4
sequences extended beyond the stop codon. Of these 36, 18 have U on the 3' side of the termination codon, of which 6 are double terminations. The double termination
frequency is consistent with that observed in
~. coli. The T4 genes have more UA content than those of ,t. ~. so that an increase in U or A in any given position is to be
expected. There are 18 U's'and 10 A's on the 3' side of the T4 stop codons. Clearly better statistics are needed before one can conclude that the increase in U is significant. As in the case of the fi. sapiens sample, the number of T4 codons is not large enough to permit a significant observation of pretermination codon frequency in the terminal region.
Comparing Tables 1, 2, and J, and looking at each amino acid separately, it is seen that the codon usage for most amino acids in the terminal region is consistent
Codon usag~ in the terminal rep;ion of E coli J!Cnes
with the usage i~ the earlier part of the gene, the exceptions being histidine, leucine, arginine, lysine and alanine. For histidine, the ~odon CAU is much preferred in the terminal regioo. For leucine and
arginine, some of the rarer codons appear more frequently. For lysine, AAG usage is slightly increased in the extreme terminal region. For alanine, the relative usage of its 4 codons is different, with GCU rather than GCG being most frequent. The GC content of the terminal region does not differ ' from the average GC content of the gene, unlike the region following the initiation codon, where modified codon usage results in lower GC content (Rodier et al, 1982).
ACKNOWLEDGMENTS
I am grateful to L. Caro, R. Epstein and H.M. Krisch for discussions of this work. I thank the Computer Department of the Cantonal Hospital of Geneva f o r o ffering computer facilities and for user-friendly expertise, and H. Rieben for advice in using the Cyber.
This work was supported in part by grant No 3.516.- 0.83 to L. Caro fr om the Swiss National Science Foundation.
REFERENCES
Alff-Steinberger, C. 1987. ;i. Theor. Biol., l_H:89-95·
Fluck, M.M., Salser, W.
- 1977. Mol. Gen. Genet., Kohli, J. & Grosjean , Genet., 182:J0-4J9 .
& Epstein , R-
151~1.37-1.49.
H. 1981. !iQ.,l. Gen.
Rodier, F., Gabarro-Arpa , J., Ehrlich R.
Reiss,
c.,
1982. Nucleic Acids Res., 10:391- 401.145
&
Watson, J.D. 1976. Molecular Biology of the Gene, published by W.A. Benjamin, Menlo Park, California.