Molecular Biology and Evolution 18:1611-1630 (2001)
© 2001 Society for Molecular Biology and Evolution
Review Article |
The Evolution of Controlled Multitasked Gene Networks: The Role of Introns and Other Noncoding RNAs in the Development of Complex Organisms
Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia;
Department of Mechanical Systems Engineering, Kanazawa University, Kanazawa, Ishikawa, Japan
| Abstract |
|---|
|
|
|---|
Eukaryotic phenotypic diversity arises from multitasking of a core proteome of limited size. Multitasking is routine in computers, as well as in other sophisticated information systems, and requires multiple inputs and outputs to control and integrate network activity. Higher eukaryotes have a mosaic gene structure with a dual output, mRNA (protein-coding) sequences and introns, which are released from the pre-mRNA by posttranscriptional processing. Introns have been enormously successful as a class of sequences and comprise up to 95% of the primary transcripts of protein-coding genes in mammals. In addition, many other transcripts (perhaps more than half) do not encode proteins at all, but appear both to be developmentally regulated and to have genetic function. We suggest that these RNAs (eRNAs) have evolved to function as endogenous network control molecules which enable direct gene-gene communication and multitasking of eukaryotic genomes. Analysis of a range of complex genetic phenomena in which RNA is involved or implicated, including co-suppression, transgene silencing, RNA interference, imprinting, methylation, and transvection, suggests that a higher-order regulatory system based on RNA signals operates in the higher eukaryotes and involves chromatin remodeling as well as other RNA-DNA, RNA-RNA, and RNA-protein interactions. The evolution of densely connected gene networks would be expected to result in a relatively stable core proteome due to the multiple reuse of components, implying that cellular differentiation and phenotypic variation in the higher eukaryotes results primarily from variation in the control architecture. Thus, network integration and multitasking using trans-acting RNA molecules produced in parallel with protein-coding sequences may underpin both the evolution of developmentally sophisticated multicellular organisms and the rapid expansion of phenotypic complexity into uncontested environments such as those initiated in the Cambrian radiation and those seen after major extinction events.
| Introduction |
|---|
|
|
|---|
Our understanding of the relationship between genetic information and biological function is rooted in the one geneone protein hypothesis and in classical studies of the lac operon and the "genetic code," i.e., the triplet code specifying amino acids in protein-coding sequences. The concept of DNA as a relatively stable, heritable source of template information for proteins, transduced through a temporary and discrete RNA readout, has become an article of faith and implicitly, but very powerfully, influenced our ideas on the structure of genetic systems. Accordingly, cells and organisms are thought of as being built from a myriad of structural and catalytic proteins whose expression is generally controlled by other regulatory proteins which bind to DNA. This is a biochemical rather than an informatic perspective, which, apart from local analysis of promoter function, gives little thought to the problem of how complex programs of gene activity in the higher organisms might be integrated and regulated in four dimensions.
Genome sequencing projects have shown that the core proteome sizes of Caenorhabditis elegans and Drosophila melanogaster are similar and that each is only about twice the size of yeast and some bacteria, despite these animals' every appearance of possessing more than twice the complexity of micro-organisms (Chervitz et al. 1998
; Rubin et al. 2000
), leading to the conclusion that "the evolution of additional complex attributes is essentially an organizational one; a matter of novel interactions that derive from the temporal and spatial segregation of fairly similar components" (Rubin et al. 2000
). This conclusion is reinforced by the finding that the human genome has only about 30,000 protein-coding genes (Roest Crollius et al. 2000
; International Human Genome Sequencing Consortium 2001
; Venter et al. 2001
), 99% of which are shared in common with the mouse (J. C. Venter, personal communication). The increased complexity of the higher eukaryotes is related, at least in part, to the production of different protein isoforms from the same gene by alternative splicing (Croft et al. 2000
). However, the other striking feature of the evolution of these organisms, largely ignored to date, is the huge increase in the amount of complex non-protein-coding RNAs, which can represent up to 97%98% of all transcriptional output from the genome. That is, the vast majority of the expressed information in the higher eukaryotes is in RNA, not protein-coding sequences. Moreover, less than 1% of the sequence differences between individual humans occurs in protein-coding sequences (Venter et al. 2001
), which suggests that the majority of phenotypic variation between individuals (and species) results from differences in the control architecture, not the proteins themselves. This is in contrast to bacteria, wherein phenotypic variation is primarily achieved by varying the proteomedifferent strains of Escherichia coli have been found to differ by over 20% in their gene complement (Hayashi et al. 2001
).
The view that phenotypic variation in complex organisms results from the differential use of a set of core components is becoming common (Gerhart and Kirschner 1997; Duboule and Wilkins 1998
) and includes such concepts as "synexpression groups" (Niehrs and Pollet 1999
), "syntagms" of interacting genes (Huang 1998
) and gene cassettes (Jan and Jan 1993
), the reuse of modules in signaling pathways (Pawson 1995
; T. Hunter 2000
), and enhanced rates of evolution by varying connections between modular network components (Hartwell et al. 1999
; Holland 1999
). These concepts have been drawn primarily from electrical circuit design and have focused principally on the modules rather than on the interconnecting control architecture of the system.
Particular network models, which range in size from single regulated circuits (Mestl, Plahte, and Omholt 1995
; Almeida, Fernandes de Lima, and Infantosi 1998
; Mendoza and Alvarez-Buylla 1998
; Yuh, Bolouri, and Davidson 1998
) to complete genomes (Thieffry et al. 1998
), have demonstrated that feedback-subnetworks can exhibit computational behaviors including "learned behavior" (Bhalla and Iyengar 1999
), that switching networks and transcriptional control networks can exhibit dynamical stability (Wolf and Eeckman 1998
; Smolen, Baxter, and Byrne 2000
), and that feedback circuits can implement oscillators governing cell cycles and circadian clocks (Dano, Sorensen, and Hynne 1999
; Haase and Reed 1999
; Shearman et al. 2000
). Stochastic noise and time delays allowing feedback, molecular memory, and oscillations can be incorporated into such circuit models (Smolen, Baxter, and Byrne 1999
), generating probabilistic phenotypic variation (McAdams and Arkin 1997
) and amplification of signals (Hasty et al. 2000
). Some of these models have been verified by synthesizing circuits in cells to feature bistability, oscillations, and stochastic destruction of temporal correlations (Becskei and Serrano 2000
; Elowitz and Leibler 2000
; Gardner, Cantor, and Collins 2000
).
However, such models are unsuited to the analysis of global cellular connectivity and dynamics, as they cannot be scaled up to large network sizes, since linear increases in the number of interconnected circuit nodes requires quadratic increases in the number of interconnecting molecules. This leads to an explosive increase in model size which severely constrains numerical simulations using current computing technologies (see, e.g., Weng, Bhalla, and Iyengar 1999
). A number of alternate approaches have sought to avoid this size explosion by treating subnetworks as active integrated logic components which are interconnected into larger networks (McAdams and Shapiro 1995
), or by exploiting hierarchically organized control systems to significantly decrease analytical complexity (van der Gugten and Westerhoff 1997
).
We suggest that biology has solved this problem differently. Here we examine first whether the types of network control architecture which are used to integrate and multitask computers (and which implicitly feature in other complex information processing systems) might also be employed by molecular biological networks to generate phenotypic complexity and variability. Second, we examine the proposition and collate the evidence that introns and other nonprotein-coding RNAs may have evolved to function as network control molecules in the higher organisms, freeing such organisms from the constraints of a simple single-output protein-based genetic operating system.
| Multitasking by Programmed Network Control |
|---|
|
|
|---|
Multitasking is employed in every computer in which control codes (program instructions) of n bits set the central processing circuit to process one of 2n different operations. Sequences of control codes (a program) can be internally stored in memory, creating a self-contained programmed response networka computeras originally defined by von Neumann in 1945 (von Neumann 1982
Existing genetic circuit models, although sophisticated, ignore endogenous controlled multitasking and consider each molecular subnetwork (involving a few genes, for instance) to be sparsely interconnected and either off or on to express only one dynamical output (see, e.g., McAdams and Shapiro 1995
; Bhalla and Iyengar 1999
; Weng, Bhalla, and Iyengar 1999
). Such models require more complex genetic programs to be built from many subnetworks encoded by exponentially large numbers of genes, a severe constraint. In contrast, multitasking via n controls (single molecules suffice) can, in theory, achieve exponential (2n) multitasking of subnetwork dynamical outputs and allow a wide range of programmed responses to be obtained from limited numbers of subnetworks (and genetic coding information). The imbalance between the exponential benefit of controlled multitasking and the small linear cost of control molecules makes it likely that evolution will have explored this option. Indeed, this may be the only feasible way to lift the constraints on the complexity and sophistication of genetic programming.
The relevant output dynamics of complex systems can only be found by a comprehensive search of input parameter space, as nonlinear interactions within the network can have unexpected and emergent properties. During evolution, genetic networks must perform a similar search of possible subnetwork dynamics, which can also be greatly accelerated when multitasking is employed. It is far easier to modify and expand the numbers of small control sequences than to duplicate and mutate entire subnetworks of genes, Additionally, simply turning off controls may reset the program, perhaps important in reproduction and survival. Most importantly, a control architecture makes it possible to coordinate activity across interacting sets of genes, while variation of this architecture can generate a large spectrum of different protein expression profiles.
However, multitasking controls are only useful to the extent that they convey information about the dynamical state of the network and its surrounding environment. To do this, nodes within the network must not only receive multiple inputs, but also generate multiple outputs (endogenous controls). In cells, molecular switches which act as input controls to relay metabolic, physiological, and environmental information by modifying protein structure and protein-protein and protein-nucleic acid binding affinities have been known for many years. However, endogenous controls need to be correlated with the internal cellular state, the central component of which is gene expression status. Importantly, in a fully integrated network, endogenously sourced controls are likely to be more numerous than externally sourced controls, just as computers must internally regulate millions of internal subnetwork controls to communicate with a few peripherals in the environment.
Ideally then, in order for a molecular genetic network to be capable of complex programming and multitasking, each of the gene subnetworks within a cell must produce numerous control molecules in parallel with their primary gene products, which dynamically communicate with other subnetworks (via transcriptional, splicing, and translational controls, among others). Such a system would be expected to display an exponential increase in its ability to manage and integrate larger genetic data sets and in its functionality and phenotypic range. In addition, because modulation of system dynamics can be readily achieved by mutation of control molecules, such a system should be able to explore new expression space at fast evolutionary rates over short evolutionary timescales.
A controlled multitasked molecular network is schematically shown in figure 1 in contrast to an uncontrolled regulated network. This network architecture can be equally applied to computer networks, neural networks, and cellular networks.
|
| The Evolution of Controlled Multitasked Gene Networks |
|---|
|
|
|---|
The nodes of a controlled multitasked network must be capable of generating and integrating multiple inputs and outputs. Such networks are generally stable and scale-free, with some nodes having high connectivity and others having low connectivity, similar to most communication and social networks, including the Internet (Albert, Jeong, and Barabasi 2000
In cells, genetic information is transduced into RNAs and proteins, the latter of which are considered to be the major functional outputs of the genome and to comprise the structural, metabolic, and regulatory systems by which cells and organisms function. Theoretically, it is possible for proteins to provide multiple input controls, and combinatorial regulation does occur in the case of, e.g., transcription factors, but for each genetic node to be multiply connected, a multiplex output is also required from each node, at least on average. At present, however, there is no evidence that proteins are used to provide an output connection function (i.e., in parallel with a primary gene product), and no output (networking) molecules acting as controls influencing the activity of other genes (or RNAs or proteins) have been identified, although intronic RNA could fulfill this function.
Prokaryote genomes consist almost entirely of protein-coding sequences that are separated by short intergenic regions containing promoters and transcription termination signals, and are flanked by 5' and 3' untranslated signals that are involved in translational control, mRNA localization, and mRNA stabilization. Prokaryotic genes are frequently arranged in operons allowing cotranscription of genes with related functions, such as the lac operon, although rarely if ever are broader regulatory (output control) proteins expressed from the same node (operon). Most regulatory proteins are expressed from separate nodes. For the lac operon, input control comes from the lac repressor (polling cellular lactose status) and the CAP protein (polling cellular cAMP/energy status), both of which are expressed separately (Reznikoff 1992
). Transcription of the lac operon (and most operons) is therefore blindno secondary communication signals are coexpressed and other cellular nodes remain unaware of the event, except indirectly through delayed feedback loops which relay metabolic state information. The number of regulatory proteins in bacteria is a relatively low proportion of the total, and the system appears to function as a set of sparsely connected local area networks, with each regulator contacting a limited number of nodes in the genome, and with controls usually composed of metabolic or environmental chemicals that intersect with these regulators.
Prokaryotes have limited genome sizes (upper limit
10 Mb) and low phenotypic complexity, suggesting that advanced integrated control technologies are not widely employed in these organisms. The absence of a prokaryotic multiplex control system also implies that a system built primarily on proteins has inherent limitations. It is not as if prokaryotes have had insufficient time to evolve such a systemthey have had four billion years and countless generations in which to explore all possible protein and phenotypic space, aided by lateral transfer to spread innovation. However, while multiplex input at complex promoters is possible (see below), a multiplex output (synchronous control signals based on proteins) is far more difficult. Prokaryotic gene transcripts are not processed to produce subspecies, and the only parallel outputs that are possible are separate proteins translated from polycistronic mRNAs. To average just one (additional) protein output per node requires doubling genome size, and the multiplex output necessary for true dynamical systems integration requires huge increases in both genome size and energy cost to the cell, making such integration unmanageable and ultimately impossible by this means. The lack of a sophisticated systems control technology in prokaryotes may be the primary reason why genomic and developmental complexity has not arisen in these lineages. This also reciprocally suggests that this constraint had to be solved before more complex organisms could evolve and that the network control mechanisms operating in the higher eukaryotes may not be principally protein-based.
The complexity and phenotypic versatility of the higher eukaryotes is thought to result primarily from a larger set of proteins and combinatorial (input) control of gene expression by such proteins. This includes multiple "transcription factors" and intersecting signal transduction pathways influencing gene expression, along with alternative splicing producing different proteins from the same gene (Lopez 1998
; Croft et al. 2000
; Smith and Valcarcel 2000
), generating subtly or substantially different functions in different tissues. While gene number is higher in complex eukaryotes, and alternative splicing greatly increases protein isoform numbers, combinatorial control of gene expression only allows multiplex input control, and alternative splicing mainly provides flexibility in endpoint specialization. Neither of these systems allows multiplex output of control molecules at the point of gene expression, a principal requirement for a multitasked network.
One possible population of cellular molecules with the attributes required to act as controls in genetic multitasking are functional introns and other noncoding RNAs. These have previously been suggested to potentiate a parallel processing system with vastly expanded regulatory options, leading to more complex genetic data sets, programs, and phenotypes which was perhaps critical to the evolution of multicellular organisms (Mattick 1994
). These RNAs were initially christened iRNA (intronic/informational RNA) (Mattick 1994
), but because of the ambiguity in that term (mRNA is also informational) and potential confusion with the recently discovered phenomenon termed RNAi (RNA interference), we have chosen to denote non-protein-coding RNAs which are involved in network integration and control as eRNA ("efference" RNA).
| A Role for Introns and Other Noncoding RNAs in Dynamical Gene-Gene Communication, Genetic Multitasking, and Systems Integration |
|---|
|
|
|---|
Potential cellular control molecules enabling multitasking and system integration must be capable of specifically targeted interactions with other molecules, must be plentiful (as limited numbers impair connectivity and adaptation in real and evolutionary time), and must carry information about the dynamical state of cellular gene expression. These goals are most simply achieved by spatially and temporally synchronizing control molecule production with gene expression. Most protein-coding genes of higher eukaryotes are mosaics containing one or more intervening sequences (introns) of generally high sequence complexity, which are spliced out during pre-mRNA processing to generate a nuclear population of intronic RNA with concentration profiles linked to that of the exons, which are reassembled during this process to form mRNA, and subsequently translated into protein. The numbers of protein-coding genes do not increase exponentially in complex organisms and hence cannot provide large-scale cellular connectivity (which does increase exponentially). The genomes of higher organisms are nevertheless much larger than those of single-celled organisms, with the vast majority of this size increase (after accounting for variable amounts of repetitive DNA) occurring within intron sequences and other non-protein-coding RNAs. Introns therefore fulfill the essential conditions for system connectivity and multitasking(1) multiple output in parallel with gene expression; (2) large numbers, especially if, as is likely (see below), they are further processed to smaller molecules after excision from the primary transcript; and (3) the potential for specifically targeted interactions as a function of their sequence complexity. Sequences of just 2030 nt should generally have sufficient specificity for homology-dependent or structure-specific interactions. Introns are therefore excellent candidates for, and perhaps the only source of, possible control molecules for multitasking eukaryotic molecular networks, which relieve the problems associated with protein-based systems, as genetic output can be multiplexed and target specificity can be efficiently encoded, assuming a receptive infrastructure.
Before considering the evidence that introns might fulfill such a function, it is necessary to address some preconceptions. The widely held idea that introns are nonfunctional is an assumption which dates back to the initial discovery of these sequencesa great surprise at the time (Williamson 1977
)which were interpreted in the light of the prevailing dogma that all cellular functions were directed by proteins and that genes were simply repositories of protein-coding sequences, which in turn was based on bacterial molecular genetics, in the absence of any understanding of the evolutionary history and origin of nuclear introns. There is no evidence to support the assumption that nuclear introns generally (as a class of sequences in the higher organisms) are nonfunctional, although the issue is confused by the fact that most introns are less conserved in sequence than accompanying protein-coding exons and that some or many will not have evolved function, as each intron will be evolving largely independently (see below).
| Introns Populated the Eukaryotic Lineage Late in Evolution |
|---|
|
|
|---|
It is now clear that modern nuclear introns are not ancient remnants of the prebiotic assembly of genes, but the evolutionary descendants of self-catalytic group II introns, which have similar splicing mechanisms (Lambowitz and Belfort 1993
The evolution of the nucleus and the separation of transcription and translation in the eukaryotes provided the opportunity for these introns to invade protein-coding genes, as long as their removal by self-splicing was efficient enough not to interfere with mRNA and protein production. The subsequent evolution of the spliceosome (involving the devolution of internal cis-acting catalytic RNAs into trans-acting spliceosomal RNAs and recruitment of accessory proteins) (Lambowitz and Belfort 1993
; Mattick 1994
; Newman 1994
; Stoltzfus 1999
; Yean et al. 2000
) made intron processing easier, which reduced the negative selection against introns and allowed them more latitude. It also relaxed their internal sequence requirements, leaving them free to evolve and to explore new evolutionary space, based on RNA molecules produced in parallel with protein-coding sequences (Mattick 1994
). This would have been accelerated by the co-evolution of receptor systems for these molecules, involving RNA-protein, RNA-RNA, and RNA-DNA/chromatin interactions, in the same way as other complex systems such as the ribosome and the spliceosome have evolved (Stoltzfus 1999
). It does not follow that all introns in a given lineage will have evolved function (see below), but, rather, there will have been increasing opportunity to do so. This also applies to other types of insertion elements (International Human Genome Sequencing Consortium 2001
). Any useful functions that may have been acquired would have provided a positive selection pressure, which is the basis of Darwinian evolution. The general hypothesis that intron-derived RNAs may have evolved trans-acting functions is therefore eminently feasible and should be entertained.
| Intron Density Correlates with Developmental Complexity |
|---|
|
|
|---|
Intron size and sequence complexity correlates well with developmental complexity, and introns comprise the majority of pre-mRNA sequences in the higher organisms. In developmentally simple eukaryotes like Schizosaccharomyces pombe, Aspergillus, and Dictyostelium, introns compose only 10%20% of the primary transcript and are generally small, with an average length of less than 100 bases and a density of about one to three introns per kilobase of protein-coding sequence. These data are consistent with hybridization kinetic analyses of the relative sequence complexity of "heterogeneous nuclear RNA" (hnRNA) versus mRNA in lower eukaryotes (Davidson 1976). In the higher plants, there are two to four introns per gene of an average length of about 250 bases, comprising about 50% of the primary transcript. In animals, the average intron size increases to about 500 bases in Drosophila and C. elegans and to about 3,400 bases in humans (six to seven introns per gene, average over 95% of the primary transcript) (Palmer and Logsdon 1991
Organisms with streamlined genomes provide a good test of the stringency of intron expansion. The pufferfish Fugu rubripides has, for unknown reasons, almost no repetitive (presumably superfluous) sequences in its genome: three quarters of pufferfish introns are very small, whereas the remainder are much larger and still account for the majority of total unique sequence (Brenner et al. 1993
; Elgar 1996
). A similar skewed distribution is observed in the compact genome of Arabidopsis thaliana (Carels and Bernardi 2000
). (A comprehensive analysis of eukaryotic intron size can be found at http://isis.bit.uq.edu.au; Croft et al. 2000
). Most of the small introns are probably vestigial, whereas, in these and probably in most organisms, larger introns with high sequence complexity may be considered to indicate functionality. This is the case in at least one instance (Cecconi et al. 1996
). Interestingly, the complex alga Volvox carteri appears to possess large introns (Fabry et al. 1993
). Since the order Volvocales contains a number of closely related members ranging from unicellular (Chlamydomonas) through a series of colonial forms to fully differentiated forms, this may represent a useful test case for the appearance of larger introns through an evolutionary developmental series.
| Introns Have the Signatures of Information |
|---|
|
|
|---|
Introns (and other nonprotein-coding RNAs; see below) of higher organisms exhibit all the signatures of information. They generally have high sequence complexity (Tautz, Trick, and Dover 1986
Nonetheless, some introns are highly conserved over substantial evolutionary distances (Garbe and Pardue 1986
; Rieger and Franke 1988
; Tournier-Lasserve et al. 1989
; Lloyd and Gunning 1993
; Starke and Gogarten 1993
; Koop and Hood 1994
; Bagavathi and Malathi 1996
; John, Smith, and Kaiser 1996
; Rosby, Alestrom, and Berg 1997
; Kazmierczak et al. 1998
; Aruscavage and Bass 2000
; Sun et al. 2000
; Yatsuki et al. 2000
), often in large blocks (Jareborg, Birney, and Durbin 1999
), indicating that they are under functional constraint. While such conservation might, in some cases, be ascribed to the presence of important cis-acting elements such as transcription enhancers, this cannot account for the extensive homology between, for example, the 94 kb of introns in the mouse and human T-cell receptor genes showing a high level of nucleotide sequence conservation (over 70%) similar to that of the accompanying exons (Koop and Hood 1994
). Intron sequences can also evolve faster than silent positions in accompanying exons (Kloek et al. 1996
) (sites that are presumably relatively neutral), indicating positive selection, further evidence of intron functionality. Moreover, if introns are acting as networking controls, the important issue is not the conservation of the sequence per se (i.e., to produce functional domains in the protein sense), but the conservation of interactions.
| Noncoding RNAs Comprise the Majority of Genomic Output |
|---|
|
|
|---|
Many (if not most; see below) transcripts from the genomes of higher organisms do not encode proteins at all (Eddy 1999
There is a general point to be made here. Gene regulation often involves "enhancers" located either downstream of the transcription start site (in introns) or in the upstream promoter region spanning many kilobases of DNA, as well as more distant regions sometimes referred to as "locus control regions." In some, and perhaps many, cases, these intergenic regions are themselves transcribed (into noncoding RNAs), suggesting that their effects might be related to trans-acting, not cis-acting sequences, which can confound interpretation of mutational analysis of "promoter regions." Such transcripts have been discovered by careful analysis of transcriptional activity around a locus of interest, such as ß-globin (Ashe et al. 1997
), but this has not often been done.
Also, as noted by Eddy (1999)
, most systematic genomic screens are biased against discovering noncoding RNAs. PolyA+RNA preparations used in cDNA library construction are depleted of noncoding RNAs, and bioinformatic searches are limited by a lack of knowledge about the signatures and variety of these molecules, although comparative genomics to identify regions of sequence homology outside of protein-coding regions may provide clues. Many such homology regions are evident from comparison of the human and mouse genomes (V. R. Bonazzi, personal communication), and many noncoding regions in C. elegans encode sequences predicted to form thermodynamically stable complex secondary structures (F. Clark, personal communication). Genetic screens are probably also compromised by the likelihood that noncoding RNAs are less likely to be badly affected by point mutations. In Drosophila, most known mutants in "regulatory" regions that have strong phenotypic signatures are either large insertions or deletions. Furthermore, while there are very few known cases of point mutations in introns (or promoter regions) giving observable phenotypes in mammals, there is an unexpectedly high frequency of insertional mutants which give observable phenotypes in transgenic mice, most of which occur in introns or other noncoding regions (Meisler 1992
). These observations not only strengthen the case that introns may have functions, but also suggest that these functions may only be readily revealed via extensive sequence disruption or deletion. This may also explain some of the unexpected results of gene knockouts in transgenic mice and confound interpretation of such experiments, which have not traditionally been designed to take account of introns and other non-protein-coding RNAs produced from the locus under study.
Additional evidence for large numbers of noncoding RNA transcripts in animal nuclei comes from earlier studies (preceding the discovery of introns) on the sequence complexity of heterogeneous nuclear RNA (hnRNA) (Davidson 1976), from which it was speculated that this RNA may represent regulatory transcripts (Britten and Davidson 1969
; Davidson, Klein, and Britten 1977
). Hybridization renaturation kinetics shows that hnRNA complexity in echinoderms is approximately 1030 times that of mRNA (Davidson 1976), whereas we now know that protein-coding primary transcripts in vertebrates are about 520 times as complex as the resulting mRNAs (Deutsch and Long 1999
). While these comparisons are crude, they suggest that a significant proportion of nuclear transcripts, perhaps more than half, do not contain protein-coding sequences. The nucleus of the higher organisms appears to be a very complex ball of RNA-DNA-protein interactions. On reflection, it may not be surprising that if an RNA communication network based on introns expressed in parallel with protein-coding sequences has evolved, a higher-order control network involving eRNA alone may also have evolved. In addition, even though a substantial proportion of the human genome is composed of repeated elements, many of these are transcribed, and it is well within the bounds of possibility that they have also evolved to form part of the regulatory architecture (International Human Genome Sequencing Consortium 2001
).
| Examples of Gene Regulation and Communication by Introns and Noncoding RNAs |
|---|
|
|
|---|
Clear-cut instances of RNA-mediated gene regulation are beginning to appear. The activities of the heterochronic genes lin-14 and lin-41, which regulate developmental timing in C. elegans, are controlled by lin-4 and let-7 gene products encoding small RNAs that are antisense to repeated elements in the 3' untranslated region of target mRNAs and which appear to inhibit translation by RNA-RNA interactions (Lee, Feinbaum, and Ambros 1993
It has also been discovered that most small nucleolar RNAs (a group of more than 100 stable RNA molecules concentrated in the nucleolus) derive from processed introns of other genes, which encode various ribosomal proteins (e.g., L1, L5, L7, L13, S1, S3, S7, S8, S13, and others), ribosome-associated proteins (e.g., eIF-4A), nucleolar proteins (e.g., nucleolin, laminin, and fibrillarin), the heat shock protein hsc70, and the cell-cycle regulated protein RCC1, among others (Prislei et al. 1993
; Sollner-Webb 1993
; Bachellerie et al. 1995
; Maxwell and Fournier 1995
; Nicoloso et al. 1996
; Rebane et al. 1998
; Filipowicz et al. 1999
; Filipowicz 2000
). These provide both clear examples of dual gene outputs and potential instances of coordinate regulation (efference control) involving intronic sequences, in this case of ribosomal biogenesis and cell growth (Pelczar and Filipowicz 1998
; Smith and Steitz 1998
; Tanaka et al. 2000
). More tellingly, some genes have so evolved that their protein-coding capacity no longer exists, and their primary product is intron-derived small nucleolar RNAs (Tycowski, Shu, and Steitz 1996
; Bortolin and Kiss 1998
; Pelczar and Filipowicz 1998
; Smith and Steitz 1998
; Tanaka et al. 2000
), leading to the statement that "genes generating functionally important RNAs exclusively from their intron regions are probably more frequent than has been anticipated" (Bortolin and Kiss 1998
).
These nucleolar RNAs are processed from introns by specific mechanisms involving endonucleolytic cleavage by double-stranded RNase IIIrelated enzymes (Caffarelli et al. 1997
; Chanfreau et al. 1998
; Qu et al. 1999
) (also implicated in RNAi, transgene silencing, and methylation [Mette et al. 2000]; see below), exonucleolytic trimming (Cecconi, Mariottini, and Amaldi 1995
; Kiss and Filipowicz 1995
; Mitchell et al. 1997
; Allmang et al. 1999a, 1999b
; van Hoof and Parker 1999
; van Hoof, Lennertz, and Parker 2000
), and possibly even adjacent RNA sequences that have self-cleaving activity (Prislei et al. 1995
). This processing occurs in large RNA processing complexes called exosomes, which are also involved in processing rRNA and small nuclear RNAs, contain at least 10 3'5' exonucleases, helicases, and RNA-binding proteins, and are found in both the nucleus and the cytoplasm (Mitchell et al. 1997
; Allmang et al. 1999a, 1999b
; van Hoof and Parker 1999
; Mitchell and Tollervey 2000
).
| Intron Processing, Stability, Decay, and Memory |
|---|
|
|
|---|
Intronic RNAs are more stable than is generally thought. The widespread view that excised introns are simply discarded and degraded derives from the unjustified a priori assumption that introns are nonfunctional. For example, it has been stated that "the half-life of excised introns is of the order of a few seconds" (Sharp et al. 1987
After splicing, introns (initially in lariat form) are debranched (Ruskin and Green 1985
), a process that is itself subject to regulation (Ruskin and Green 1985
; Qian et al. 1992
), but subsequent events are unknown. We suggest that it is likely that excised introns are processed by specific pathways similar to those used to produce small nucleolar RNAs and which generate multiple smaller species which can function independently as trans-acting signals in the network (Mattick 1994
), affecting the metabolism of other RNAs and the modulation of chromatin structure, among other things (see below). The intronic origins of small nucleolar RNAs became known only because of their relative stability and abundance, and they may be just one tip of a large iceberg of a much more complex milieu (tens of thousands) of other intron-derived and other non-protein-coding RNAs, which may be more transient and in much lower individual abundance and which have not yet been detected except by their genetic signatures, as in the case of lin-4 and let-7.
There are other documented examples of small trans-acting functional RNAs processed from longer transcripts (Sit, Vaewhongs, and Lommel 1998
; Cavaille et al. 2000
). There are also large numbers of ribonucleases and other RNA-related proteins in plants and animals (see below), most of whose functions and substrates are not well defined. Such processing may also involve other splicing pathways (Santoro et al. 1994
; Kreivi and Lamond 1996
) and guide RNAs, possibly derived from introns or other nonprotein-coding RNAs. These have been described as "riboregulators" (in relation to antisense RNAs) (Delihas 1995
) and the "ribotype" (in relation to alternatively spliced mRNAs) (Herbert and Rich 1999a
) and may be considered part of the "soft wiring" of the cell (Mattick 1994
; Herbert and Rich 1999b
).
The decay characteristics of eRNAs are likely to be important to their function. Bot
