Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (100)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mattick, J. S.
Right arrow Articles by Gagen, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mattick, J. S.
Right arrow Articles by Gagen, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Molecular Biology and Evolution 18:1611-1630 (2001)
© 2001 Society for Molecular Biology and Evolution


Review Article

The Evolution of Controlled Multitasked Gene Networks: The Role of Introns and Other Noncoding RNAs in the Development of Complex Organisms

John S. Mattick and Michael J. Gagen

Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia;
Department of Mechanical Systems Engineering, Kanazawa University, Kanazawa, Ishikawa, Japan


    Abstract
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Eukaryotic phenotypic diversity arises from multitasking of a core proteome of limited size. Multitasking is routine in computers, as well as in other sophisticated information systems, and requires multiple inputs and outputs to control and integrate network activity. Higher eukaryotes have a mosaic gene structure with a dual output, mRNA (protein-coding) sequences and introns, which are released from the pre-mRNA by posttranscriptional processing. Introns have been enormously successful as a class of sequences and comprise up to 95% of the primary transcripts of protein-coding genes in mammals. In addition, many other transcripts (perhaps more than half) do not encode proteins at all, but appear both to be developmentally regulated and to have genetic function. We suggest that these RNAs (eRNAs) have evolved to function as endogenous network control molecules which enable direct gene-gene communication and multitasking of eukaryotic genomes. Analysis of a range of complex genetic phenomena in which RNA is involved or implicated, including co-suppression, transgene silencing, RNA interference, imprinting, methylation, and transvection, suggests that a higher-order regulatory system based on RNA signals operates in the higher eukaryotes and involves chromatin remodeling as well as other RNA-DNA, RNA-RNA, and RNA-protein interactions. The evolution of densely connected gene networks would be expected to result in a relatively stable core proteome due to the multiple reuse of components, implying that cellular differentiation and phenotypic variation in the higher eukaryotes results primarily from variation in the control architecture. Thus, network integration and multitasking using trans-acting RNA molecules produced in parallel with protein-coding sequences may underpin both the evolution of developmentally sophisticated multicellular organisms and the rapid expansion of phenotypic complexity into uncontested environments such as those initiated in the Cambrian radiation and those seen after major extinction events.


    Introduction
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Our understanding of the relationship between genetic information and biological function is rooted in the one gene–one protein hypothesis and in classical studies of the lac operon and the "genetic code," i.e., the triplet code specifying amino acids in protein-coding sequences. The concept of DNA as a relatively stable, heritable source of template information for proteins, transduced through a temporary and discrete RNA readout, has become an article of faith and implicitly, but very powerfully, influenced our ideas on the structure of genetic systems. Accordingly, cells and organisms are thought of as being built from a myriad of structural and catalytic proteins whose expression is generally controlled by other regulatory proteins which bind to DNA. This is a biochemical rather than an informatic perspective, which, apart from local analysis of promoter function, gives little thought to the problem of how complex programs of gene activity in the higher organisms might be integrated and regulated in four dimensions.

Genome sequencing projects have shown that the core proteome sizes of Caenorhabditis elegans and Drosophila melanogaster are similar and that each is only about twice the size of yeast and some bacteria, despite these animals' every appearance of possessing more than twice the complexity of micro-organisms (Chervitz et al. 1998Citation ; Rubin et al. 2000Citation ), leading to the conclusion that "the evolution of additional complex attributes is essentially an organizational one; a matter of novel interactions that derive from the temporal and spatial segregation of fairly similar components" (Rubin et al. 2000Citation ). This conclusion is reinforced by the finding that the human genome has only about 30,000 protein-coding genes (Roest Crollius et al. 2000Citation ; International Human Genome Sequencing Consortium 2001Citation ; Venter et al. 2001Citation ), 99% of which are shared in common with the mouse (J. C. Venter, personal communication). The increased complexity of the higher eukaryotes is related, at least in part, to the production of different protein isoforms from the same gene by alternative splicing (Croft et al. 2000Citation ). However, the other striking feature of the evolution of these organisms, largely ignored to date, is the huge increase in the amount of complex non-protein-coding RNAs, which can represent up to 97%–98% of all transcriptional output from the genome. That is, the vast majority of the expressed information in the higher eukaryotes is in RNA, not protein-coding sequences. Moreover, less than 1% of the sequence differences between individual humans occurs in protein-coding sequences (Venter et al. 2001Citation ), which suggests that the majority of phenotypic variation between individuals (and species) results from differences in the control architecture, not the proteins themselves. This is in contrast to bacteria, wherein phenotypic variation is primarily achieved by varying the proteome—different strains of Escherichia coli have been found to differ by over 20% in their gene complement (Hayashi et al. 2001Citation ).

The view that phenotypic variation in complex organisms results from the differential use of a set of core components is becoming common (Gerhart and Kirschner 1997; Duboule and Wilkins 1998Citation ) and includes such concepts as "synexpression groups" (Niehrs and Pollet 1999Citation ), "syntagms" of interacting genes (Huang 1998Citation ) and gene cassettes (Jan and Jan 1993Citation ), the reuse of modules in signaling pathways (Pawson 1995Citation ; T. Hunter 2000Citation ), and enhanced rates of evolution by varying connections between modular network components (Hartwell et al. 1999Citation ; Holland 1999Citation ). These concepts have been drawn primarily from electrical circuit design and have focused principally on the modules rather than on the interconnecting control architecture of the system.

Particular network models, which range in size from single regulated circuits (Mestl, Plahte, and Omholt 1995Citation ; Almeida, Fernandes de Lima, and Infantosi 1998Citation ; Mendoza and Alvarez-Buylla 1998Citation ; Yuh, Bolouri, and Davidson 1998Citation ) to complete genomes (Thieffry et al. 1998Citation ), have demonstrated that feedback-subnetworks can exhibit computational behaviors including "learned behavior" (Bhalla and Iyengar 1999Citation ), that switching networks and transcriptional control networks can exhibit dynamical stability (Wolf and Eeckman 1998Citation ; Smolen, Baxter, and Byrne 2000Citation ), and that feedback circuits can implement oscillators governing cell cycles and circadian clocks (Dano, Sorensen, and Hynne 1999Citation ; Haase and Reed 1999Citation ; Shearman et al. 2000Citation ). Stochastic noise and time delays allowing feedback, molecular memory, and oscillations can be incorporated into such circuit models (Smolen, Baxter, and Byrne 1999Citation ), generating probabilistic phenotypic variation (McAdams and Arkin 1997Citation ) and amplification of signals (Hasty et al. 2000Citation ). Some of these models have been verified by synthesizing circuits in cells to feature bistability, oscillations, and stochastic destruction of temporal correlations (Becskei and Serrano 2000Citation ; Elowitz and Leibler 2000Citation ; Gardner, Cantor, and Collins 2000Citation ).

However, such models are unsuited to the analysis of global cellular connectivity and dynamics, as they cannot be scaled up to large network sizes, since linear increases in the number of interconnected circuit nodes requires quadratic increases in the number of interconnecting molecules. This leads to an explosive increase in model size which severely constrains numerical simulations using current computing technologies (see, e.g., Weng, Bhalla, and Iyengar 1999Citation ). A number of alternate approaches have sought to avoid this size explosion by treating subnetworks as active integrated logic components which are interconnected into larger networks (McAdams and Shapiro 1995Citation ), or by exploiting hierarchically organized control systems to significantly decrease analytical complexity (van der Gugten and Westerhoff 1997Citation ).

We suggest that biology has solved this problem differently. Here we examine first whether the types of network control architecture which are used to integrate and multitask computers (and which implicitly feature in other complex information processing systems) might also be employed by molecular biological networks to generate phenotypic complexity and variability. Second, we examine the proposition and collate the evidence that introns and other nonprotein-coding RNAs may have evolved to function as network control molecules in the higher organisms, freeing such organisms from the constraints of a simple single-output protein-based genetic operating system.


    Multitasking by Programmed Network Control
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Multitasking is employed in every computer in which control codes (program instructions) of n bits set the central processing circuit to process one of 2n different operations. Sequences of control codes (a program) can be internally stored in memory, creating a self-contained programmed response network—a computer—as originally defined by von Neumann in 1945 (von Neumann 1982Citation ). Prior to the arrival of the von Neumann computing architecture, a computer could only be reprogrammed by laborious rewiring of the central processing unit, while subsequent reprogramming simply required loading new control codes into memory. In all computing networks, processing requires not only stored program instructions, but also communication between nodes to synchronize and integrate network activity. In theory, gene networks could exploit similar technology using internal controls to multitask components and subnetworks to generate a wide range of programmed responses, such as in differentiation and development.

Existing genetic circuit models, although sophisticated, ignore endogenous controlled multitasking and consider each molecular subnetwork (involving a few genes, for instance) to be sparsely interconnected and either off or on to express only one dynamical output (see, e.g., McAdams and Shapiro 1995Citation ; Bhalla and Iyengar 1999Citation ; Weng, Bhalla, and Iyengar 1999Citation ). Such models require more complex genetic programs to be built from many subnetworks encoded by exponentially large numbers of genes, a severe constraint. In contrast, multitasking via n controls (single molecules suffice) can, in theory, achieve exponential (2n) multitasking of subnetwork dynamical outputs and allow a wide range of programmed responses to be obtained from limited numbers of subnetworks (and genetic coding information). The imbalance between the exponential benefit of controlled multitasking and the small linear cost of control molecules makes it likely that evolution will have explored this option. Indeed, this may be the only feasible way to lift the constraints on the complexity and sophistication of genetic programming.

The relevant output dynamics of complex systems can only be found by a comprehensive search of input parameter space, as nonlinear interactions within the network can have unexpected and emergent properties. During evolution, genetic networks must perform a similar search of possible subnetwork dynamics, which can also be greatly accelerated when multitasking is employed. It is far easier to modify and expand the numbers of small control sequences than to duplicate and mutate entire subnetworks of genes, Additionally, simply turning off controls may reset the program, perhaps important in reproduction and survival. Most importantly, a control architecture makes it possible to coordinate activity across interacting sets of genes, while variation of this architecture can generate a large spectrum of different protein expression profiles.

However, multitasking controls are only useful to the extent that they convey information about the dynamical state of the network and its surrounding environment. To do this, nodes within the network must not only receive multiple inputs, but also generate multiple outputs (endogenous controls). In cells, molecular switches which act as input controls to relay metabolic, physiological, and environmental information by modifying protein structure and protein-protein and protein-nucleic acid binding affinities have been known for many years. However, endogenous controls need to be correlated with the internal cellular state, the central component of which is gene expression status. Importantly, in a fully integrated network, endogenously sourced controls are likely to be more numerous than externally sourced controls, just as computers must internally regulate millions of internal subnetwork controls to communicate with a few peripherals in the environment.

Ideally then, in order for a molecular genetic network to be capable of complex programming and multitasking, each of the gene subnetworks within a cell must produce numerous control molecules in parallel with their primary gene products, which dynamically communicate with other subnetworks (via transcriptional, splicing, and translational controls, among others). Such a system would be expected to display an exponential increase in its ability to manage and integrate larger genetic data sets and in its functionality and phenotypic range. In addition, because modulation of system dynamics can be readily achieved by mutation of control molecules, such a system should be able to explore new expression space at fast evolutionary rates over short evolutionary timescales.

A controlled multitasked molecular network is schematically shown in figure 1 in contrast to an uncontrolled regulated network. This network architecture can be equally applied to computer networks, neural networks, and cellular networks.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1.—Schematic representation of subnetworks of an uncontrolled regulated network and a controlled multitasked network. a, An uncontrolled subnetwork wherein nodes take limited numbers of regulatory inputs rk and generate limited numbers of protein outputs gk. Here, g1 regulates n2 while being subject to feedback interactions from g2 (dotted line). b, The same subnetwork with each node expressing a multiplex output of protein product gk and many control molecules ck, each capable of targeted interactions to multitask the subnetwork. A sample of possible interactions (shown as dot-dash lines) includes control c1 determining the alternative splicing of the node n3 output giving g3 or g'3, the latter of which regulates node n2 when expressed, while nodes n1 and n3 each feedback controls onto the other. It is evident that controls increase interconnectivity, which increases network dynamical output complexity

 

    The Evolution of Controlled Multitasked Gene Networks
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
The nodes of a controlled multitasked network must be capable of generating and integrating multiple inputs and outputs. Such networks are generally stable and scale-free, with some nodes having high connectivity and others having low connectivity, similar to most communication and social networks, including the Internet (Albert, Jeong, and Barabasi 2000Citation ). Multiply connected networks are widely employed in other complex information processing systems, including neurobiology, where secondary networking signals, termed "efference" signals, underlie sensory awareness and motor coordination (Bridgeman 1995Citation ; Andersen et al. 1997Citation ). The concept of multiple inputs and outputs is also a well-established feature of neural networks in cognition, language, and memory (Plunkett et al. 1997Citation ; Elman 1998Citation ). These networks involve densely connected webs of processing units that propagate and transform complex patterns of activity and are capable of self-organization. They operate by a form of parallel distributed processing, whereby information is distributed across the system such that patterns of activation across sets of "hidden units" (i.e., controls), which define the state of the network, then determine the pattern of activation across output nodes (McClelland and Rumelhart 1985Citation ; Rumelhart and McClelland 1986; McClelland and Plaut 1993Citation ; Plunkett et al. 1997Citation ; Elman 1998Citation ).

In cells, genetic information is transduced into RNAs and proteins, the latter of which are considered to be the major functional outputs of the genome and to comprise the structural, metabolic, and regulatory systems by which cells and organisms function. Theoretically, it is possible for proteins to provide multiple input controls, and combinatorial regulation does occur in the case of, e.g., transcription factors, but for each genetic node to be multiply connected, a multiplex output is also required from each node, at least on average. At present, however, there is no evidence that proteins are used to provide an output connection function (i.e., in parallel with a primary gene product), and no output (networking) molecules acting as controls influencing the activity of other genes (or RNAs or proteins) have been identified, although intronic RNA could fulfill this function.

Prokaryote genomes consist almost entirely of protein-coding sequences that are separated by short intergenic regions containing promoters and transcription termination signals, and are flanked by 5' and 3' untranslated signals that are involved in translational control, mRNA localization, and mRNA stabilization. Prokaryotic genes are frequently arranged in operons allowing cotranscription of genes with related functions, such as the lac operon, although rarely if ever are broader regulatory (output control) proteins expressed from the same node (operon). Most regulatory proteins are expressed from separate nodes. For the lac operon, input control comes from the lac repressor (polling cellular lactose status) and the CAP protein (polling cellular cAMP/energy status), both of which are expressed separately (Reznikoff 1992Citation ). Transcription of the lac operon (and most operons) is therefore blind—no secondary communication signals are coexpressed and other cellular nodes remain unaware of the event, except indirectly through delayed feedback loops which relay metabolic state information. The number of regulatory proteins in bacteria is a relatively low proportion of the total, and the system appears to function as a set of sparsely connected local area networks, with each regulator contacting a limited number of nodes in the genome, and with controls usually composed of metabolic or environmental chemicals that intersect with these regulators.

Prokaryotes have limited genome sizes (upper limit ~10 Mb) and low phenotypic complexity, suggesting that advanced integrated control technologies are not widely employed in these organisms. The absence of a prokaryotic multiplex control system also implies that a system built primarily on proteins has inherent limitations. It is not as if prokaryotes have had insufficient time to evolve such a system—they have had four billion years and countless generations in which to explore all possible protein and phenotypic space, aided by lateral transfer to spread innovation. However, while multiplex input at complex promoters is possible (see below), a multiplex output (synchronous control signals based on proteins) is far more difficult. Prokaryotic gene transcripts are not processed to produce subspecies, and the only parallel outputs that are possible are separate proteins translated from polycistronic mRNAs. To average just one (additional) protein output per node requires doubling genome size, and the multiplex output necessary for true dynamical systems integration requires huge increases in both genome size and energy cost to the cell, making such integration unmanageable and ultimately impossible by this means. The lack of a sophisticated systems control technology in prokaryotes may be the primary reason why genomic and developmental complexity has not arisen in these lineages. This also reciprocally suggests that this constraint had to be solved before more complex organisms could evolve and that the network control mechanisms operating in the higher eukaryotes may not be principally protein-based.

The complexity and phenotypic versatility of the higher eukaryotes is thought to result primarily from a larger set of proteins and combinatorial (input) control of gene expression by such proteins. This includes multiple "transcription factors" and intersecting signal transduction pathways influencing gene expression, along with alternative splicing producing different proteins from the same gene (Lopez 1998Citation ; Croft et al. 2000Citation ; Smith and Valcarcel 2000Citation ), generating subtly or substantially different functions in different tissues. While gene number is higher in complex eukaryotes, and alternative splicing greatly increases protein isoform numbers, combinatorial control of gene expression only allows multiplex input control, and alternative splicing mainly provides flexibility in endpoint specialization. Neither of these systems allows multiplex output of control molecules at the point of gene expression, a principal requirement for a multitasked network.

One possible population of cellular molecules with the attributes required to act as controls in genetic multitasking are functional introns and other noncoding RNAs. These have previously been suggested to potentiate a parallel processing system with vastly expanded regulatory options, leading to more complex genetic data sets, programs, and phenotypes which was perhaps critical to the evolution of multicellular organisms (Mattick 1994Citation ). These RNAs were initially christened iRNA (intronic/informational RNA) (Mattick 1994Citation ), but because of the ambiguity in that term (mRNA is also informational) and potential confusion with the recently discovered phenomenon termed RNAi (RNA interference), we have chosen to denote non-protein-coding RNAs which are involved in network integration and control as eRNA ("efference" RNA).


    A Role for Introns and Other Noncoding RNAs in Dynamical Gene-Gene Communication, Genetic Multitasking, and Systems Integration
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Potential cellular control molecules enabling multitasking and system integration must be capable of specifically targeted interactions with other molecules, must be plentiful (as limited numbers impair connectivity and adaptation in real and evolutionary time), and must carry information about the dynamical state of cellular gene expression. These goals are most simply achieved by spatially and temporally synchronizing control molecule production with gene expression. Most protein-coding genes of higher eukaryotes are mosaics containing one or more intervening sequences (introns) of generally high sequence complexity, which are spliced out during pre-mRNA processing to generate a nuclear population of intronic RNA with concentration profiles linked to that of the exons, which are reassembled during this process to form mRNA, and subsequently translated into protein. The numbers of protein-coding genes do not increase exponentially in complex organisms and hence cannot provide large-scale cellular connectivity (which does increase exponentially). The genomes of higher organisms are nevertheless much larger than those of single-celled organisms, with the vast majority of this size increase (after accounting for variable amounts of repetitive DNA) occurring within intron sequences and other non-protein-coding RNAs. Introns therefore fulfill the essential conditions for system connectivity and multitasking—(1) multiple output in parallel with gene expression; (2) large numbers, especially if, as is likely (see below), they are further processed to smaller molecules after excision from the primary transcript; and (3) the potential for specifically targeted interactions as a function of their sequence complexity. Sequences of just 20–30 nt should generally have sufficient specificity for homology-dependent or structure-specific interactions. Introns are therefore excellent candidates for, and perhaps the only source of, possible control molecules for multitasking eukaryotic molecular networks, which relieve the problems associated with protein-based systems, as genetic output can be multiplexed and target specificity can be efficiently encoded, assuming a receptive infrastructure.

Before considering the evidence that introns might fulfill such a function, it is necessary to address some preconceptions. The widely held idea that introns are nonfunctional is an assumption which dates back to the initial discovery of these sequences—a great surprise at the time (Williamson 1977Citation )—which were interpreted in the light of the prevailing dogma that all cellular functions were directed by proteins and that genes were simply repositories of protein-coding sequences, which in turn was based on bacterial molecular genetics, in the absence of any understanding of the evolutionary history and origin of nuclear introns. There is no evidence to support the assumption that nuclear introns generally (as a class of sequences in the higher organisms) are nonfunctional, although the issue is confused by the fact that most introns are less conserved in sequence than accompanying protein-coding exons and that some or many will not have evolved function, as each intron will be evolving largely independently (see below).


    Introns Populated the Eukaryotic Lineage Late in Evolution
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
It is now clear that modern nuclear introns are not ancient remnants of the prebiotic assembly of genes, but the evolutionary descendants of self-catalytic group II introns, which have similar splicing mechanisms (Lambowitz and Belfort 1993Citation ; Eickbush 2000Citation ). These elements appear to have penetrated the eukaryotic lineage late in evolution (Cavalier-Smith 1991Citation ; Palmer and Logsdon 1991Citation ; Mattick 1994Citation ; Stoltzfus et al. 1994Citation ; Cho and Doolittle 1997Citation ; Logsdon 1998Citation ; Wolf et al. 2000Citation ) and to have expanded initially by retrotransposition (Cousineau et al. 2000Citation ; Eickbush 2000Citation ) and later (after their sequence constraints were reduced by the evolution of the spliceosome) by other mutational, recombinational, and insertional processes (Tarrio, Rodriguez-Trelles, and Ayala 1998Citation ). Self-catalytic group II introns do occur in bacteria, usually in tRNA genes (Ferat and Michel 1993Citation ; Martinez-Abarca and Toro 2000Citation ), and the likely reason that introns are generally absent from prokaryotic protein-coding sequences is the intimate coupling of transcription and translation in these cells, which does not allow time for intron excision (Mattick 1994Citation ).

The evolution of the nucleus and the separation of transcription and translation in the eukaryotes provided the opportunity for these introns to invade protein-coding genes, as long as their removal by self-splicing was efficient enough not to interfere with mRNA and protein production. The subsequent evolution of the spliceosome (involving the devolution of internal cis-acting catalytic RNAs into trans-acting spliceosomal RNAs and recruitment of accessory proteins) (Lambowitz and Belfort 1993Citation ; Mattick 1994Citation ; Newman 1994Citation ; Stoltzfus 1999Citation ; Yean et al. 2000Citation ) made intron processing easier, which reduced the negative selection against introns and allowed them more latitude. It also relaxed their internal sequence requirements, leaving them free to evolve and to explore new evolutionary space, based on RNA molecules produced in parallel with protein-coding sequences (Mattick 1994Citation ). This would have been accelerated by the co-evolution of receptor systems for these molecules, involving RNA-protein, RNA-RNA, and RNA-DNA/chromatin interactions, in the same way as other complex systems such as the ribosome and the spliceosome have evolved (Stoltzfus 1999Citation ). It does not follow that all introns in a given lineage will have evolved function (see below), but, rather, there will have been increasing opportunity to do so. This also applies to other types of insertion elements (International Human Genome Sequencing Consortium 2001Citation ). Any useful functions that may have been acquired would have provided a positive selection pressure, which is the basis of Darwinian evolution. The general hypothesis that intron-derived RNAs may have evolved trans-acting functions is therefore eminently feasible and should be entertained.


    Intron Density Correlates with Developmental Complexity
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Intron size and sequence complexity correlates well with developmental complexity, and introns comprise the majority of pre-mRNA sequences in the higher organisms. In developmentally simple eukaryotes like Schizosaccharomyces pombe, Aspergillus, and Dictyostelium, introns compose only 10%–20% of the primary transcript and are generally small, with an average length of less than 100 bases and a density of about one to three introns per kilobase of protein-coding sequence. These data are consistent with hybridization kinetic analyses of the relative sequence complexity of "heterogeneous nuclear RNA" (hnRNA) versus mRNA in lower eukaryotes (Davidson 1976). In the higher plants, there are two to four introns per gene of an average length of about 250 bases, comprising about 50% of the primary transcript. In animals, the average intron size increases to about 500 bases in Drosophila and C. elegans and to about 3,400 bases in humans (six to seven introns per gene, average over 95% of the primary transcript) (Palmer and Logsdon 1991Citation ; Deutsch and Long 1999Citation ; International Human Genome Sequencing Consortium 2001Citation ; Venter et al. 2001Citation ).

Organisms with streamlined genomes provide a good test of the stringency of intron expansion. The pufferfish Fugu rubripides has, for unknown reasons, almost no repetitive (presumably superfluous) sequences in its genome: three quarters of pufferfish introns are very small, whereas the remainder are much larger and still account for the majority of total unique sequence (Brenner et al. 1993Citation ; Elgar 1996Citation ). A similar skewed distribution is observed in the compact genome of Arabidopsis thaliana (Carels and Bernardi 2000Citation ). (A comprehensive analysis of eukaryotic intron size can be found at http://isis.bit.uq.edu.au; Croft et al. 2000Citation ). Most of the small introns are probably vestigial, whereas, in these and probably in most organisms, larger introns with high sequence complexity may be considered to indicate functionality. This is the case in at least one instance (Cecconi et al. 1996Citation ). Interestingly, the complex alga Volvox carteri appears to possess large introns (Fabry et al. 1993Citation ). Since the order Volvocales contains a number of closely related members ranging from unicellular (Chlamydomonas) through a series of colonial forms to fully differentiated forms, this may represent a useful test case for the appearance of larger introns through an evolutionary developmental series.


    Introns Have the Signatures of Information
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Introns (and other nonprotein-coding RNAs; see below) of higher organisms exhibit all the signatures of information. They generally have high sequence complexity (Tautz, Trick, and Dover 1986Citation ), although one must distinguish between introns that may have evolved function and those that have not (which will be more degenerate) and take account of the differing proportions of functional and nonfunctional introns in lineages of different developmental complexity. While introns generally show less conservation than adjacent protein-coding sequences, which are subject to strong constraints, so also do adjacent promoters and 5' and 3' untranslated regions of mRNA, all of which are known to be important in gene regulation. The plasticity and more rapid evolution of these regulatory sequences does not mean they are nonfunctional, and we suggest the same holds in general for introns.

Nonetheless, some introns are highly conserved over substantial evolutionary distances (Garbe and Pardue 1986Citation ; Rieger and Franke 1988Citation ; Tournier-Lasserve et al. 1989Citation ; Lloyd and Gunning 1993Citation ; Starke and Gogarten 1993Citation ; Koop and Hood 1994Citation ; Bagavathi and Malathi 1996Citation ; John, Smith, and Kaiser 1996Citation ; Rosby, Alestrom, and Berg 1997Citation ; Kazmierczak et al. 1998Citation ; Aruscavage and Bass 2000Citation ; Sun et al. 2000Citation ; Yatsuki et al. 2000Citation ), often in large blocks (Jareborg, Birney, and Durbin 1999Citation ), indicating that they are under functional constraint. While such conservation might, in some cases, be ascribed to the presence of important cis-acting elements such as transcription enhancers, this cannot account for the extensive homology between, for example, the 94 kb of introns in the mouse and human T-cell receptor genes showing a high level of nucleotide sequence conservation (over 70%) similar to that of the accompanying exons (Koop and Hood 1994Citation ). Intron sequences can also evolve faster than silent positions in accompanying exons (Kloek et al. 1996Citation ) (sites that are presumably relatively neutral), indicating positive selection, further evidence of intron functionality. Moreover, if introns are acting as networking controls, the important issue is not the conservation of the sequence per se (i.e., to produce functional domains in the protein sense), but the conservation of interactions.


    Noncoding RNAs Comprise the Majority of Genomic Output
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Many (if not most; see below) transcripts from the genomes of higher organisms do not encode proteins at all (Eddy 1999Citation ; Erdmann et al. 1999Citation ). Where they have been examined, these nonprotein-coding transcripts are conserved and clearly functional. Well-documented examples include XIST (involved in female X-chromosome inactivation) (Brockdorff 1998Citation ; Lee, Davidow, and Warshawsky 1999Citation ; Hong, Ontiveros, and Strauss 2000Citation ) and H19 (mutants of which promote tumor development) (Wrana 1994Citation ; Hurst and Smith 1999Citation ), both of which are imprinted and differentially spliced without encoding any protein (Hurst and Smith 1999Citation ; Hong, Ontiveros, and Strauss 2000Citation ; F. Clark, personal communication). Others include roX1 and roX2 RNAs involved in dosage response (male X-chromosome activation) in Drosophila, heat shock response RNA in Drosophila, oxidative stress response RNAs in mammals, His-1 RNA involved in viral response/carcinogenesis in humans and mice, SCA8 RNA involved in spinocerebellar ataxia type 8 which is antisense to an actin-binding protein, and ENOD40 RNA in legumes and other plants (Eddy 1999Citation ; Erdmann et al. 1999Citation ; Nemes, Benzow, and Koob 2000Citation ). The 200-kb bithorax-abdominalA/B locus of Drosophila produces seven major transcripts (there may be minor ones as well), only three of which encode proteins, but all of which have phenotypic signatures and are developmentally regulated (Akam et al. 1985Citation ; Hogness et al. 1985Citation ; Lipshitz, Peattie, and Hogness 1987Citation ; Sanchez-Herrero and Akam 1989Citation ). These are not isolated examples. Many loci, including imprinted loci, express noncoding antisense and intergenic transcripts, some of which are alternatively spliced and developmentally regulated (Ashe et al. 1997Citation ; Lipman 1997Citation ; Potter and Branford 1998Citation ; Lee, Davidow, and Warshawsky 1999Citation ; Filipowicz 2000Citation ; Hastings et al. 2000Citation ; Nemes, Benzow, and Koob 2000Citation ), in addition to being stably detectable in the nucleus (Ashe et al. 1997Citation ).

There is a general point to be made here. Gene regulation often involves "enhancers" located either downstream of the transcription start site (in introns) or in the upstream promoter region spanning many kilobases of DNA, as well as more distant regions sometimes referred to as "locus control regions." In some, and perhaps many, cases, these intergenic regions are themselves transcribed (into noncoding RNAs), suggesting that their effects might be related to trans-acting, not cis-acting sequences, which can confound interpretation of mutational analysis of "promoter regions." Such transcripts have been discovered by careful analysis of transcriptional activity around a locus of interest, such as ß-globin (Ashe et al. 1997Citation ), but this has not often been done.

Also, as noted by Eddy (1999)Citation , most systematic genomic screens are biased against discovering noncoding RNAs. PolyA+RNA preparations used in cDNA library construction are depleted of noncoding RNAs, and bioinformatic searches are limited by a lack of knowledge about the signatures and variety of these molecules, although comparative genomics to identify regions of sequence homology outside of protein-coding regions may provide clues. Many such homology regions are evident from comparison of the human and mouse genomes (V. R. Bonazzi, personal communication), and many noncoding regions in C. elegans encode sequences predicted to form thermodynamically stable complex secondary structures (F. Clark, personal communication). Genetic screens are probably also compromised by the likelihood that noncoding RNAs are less likely to be badly affected by point mutations. In Drosophila, most known mutants in "regulatory" regions that have strong phenotypic signatures are either large insertions or deletions. Furthermore, while there are very few known cases of point mutations in introns (or promoter regions) giving observable phenotypes in mammals, there is an unexpectedly high frequency of insertional mutants which give observable phenotypes in transgenic mice, most of which occur in introns or other noncoding regions (Meisler 1992Citation ). These observations not only strengthen the case that introns may have functions, but also suggest that these functions may only be readily revealed via extensive sequence disruption or deletion. This may also explain some of the unexpected results of gene knockouts in transgenic mice and confound interpretation of such experiments, which have not traditionally been designed to take account of introns and other non-protein-coding RNAs produced from the locus under study.

Additional evidence for large numbers of noncoding RNA transcripts in animal nuclei comes from earlier studies (preceding the discovery of introns) on the sequence complexity of heterogeneous nuclear RNA (hnRNA) (Davidson 1976), from which it was speculated that this RNA may represent regulatory transcripts (Britten and Davidson 1969Citation ; Davidson, Klein, and Britten 1977Citation ). Hybridization renaturation kinetics shows that hnRNA complexity in echinoderms is approximately 10–30 times that of mRNA (Davidson 1976), whereas we now know that protein-coding primary transcripts in vertebrates are about 5–20 times as complex as the resulting mRNAs (Deutsch and Long 1999Citation ). While these comparisons are crude, they suggest that a significant proportion of nuclear transcripts, perhaps more than half, do not contain protein-coding sequences. The nucleus of the higher organisms appears to be a very complex ball of RNA-DNA-protein interactions. On reflection, it may not be surprising that if an RNA communication network based on introns expressed in parallel with protein-coding sequences has evolved, a higher-order control network involving eRNA alone may also have evolved. In addition, even though a substantial proportion of the human genome is composed of repeated elements, many of these are transcribed, and it is well within the bounds of possibility that they have also evolved to form part of the regulatory architecture (International Human Genome Sequencing Consortium 2001Citation ).


    Examples of Gene Regulation and Communication by Introns and Noncoding RNAs
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Clear-cut instances of RNA-mediated gene regulation are beginning to appear. The activities of the heterochronic genes lin-14 and lin-41, which regulate developmental timing in C. elegans, are controlled by lin-4 and let-7 gene products encoding small RNAs that are antisense to repeated elements in the 3' untranslated region of target mRNAs and which appear to inhibit translation by RNA-RNA interactions (Lee, Feinbaum, and Ambros 1993Citation ; Wightman, Ha, and Ruvkun 1993Citation ; Feinbaum and Ambros 1999Citation ; Reinhart et al. 2000Citation ), possibly by targeting the mRNA for endoribonuclease attack (Nashimoto 2000Citation ). Lin-4 and let-7 do not contain obvious protein-coding sequences, and the surrounding genomic sequences suggest that both are derived from functional introns surrounded by vestigial exons (Lee, Feinbaum, and Ambros 1993Citation ; Reinhart et al. 2000Citation ; L. Croft, personal communication). Moreover, let-7 is functionally conserved in other bilaterian animals, from mollusks to mammals (Pasquinelli et al. 2000Citation ). Interestingly, the size of these RNAs (21–22 nt) is similar to that produced by the RNA interference (RNAi) pathway (Bass 2000Citation ; Parrish et al. 2000Citation ; Yang, Lu, and Erickson 2000Citation ; Zamore et al. 2000Citation ; Sharp 2001Citation ) (see below).

It has also been discovered that most small nucleolar RNAs (a group of more than 100 stable RNA molecules concentrated in the nucleolus) derive from processed introns of other genes, which encode various ribosomal proteins (e.g., L1, L5, L7, L13, S1, S3, S7, S8, S13, and others), ribosome-associated proteins (e.g., eIF-4A), nucleolar proteins (e.g., nucleolin, laminin, and fibrillarin), the heat shock protein hsc70, and the cell-cycle regulated protein RCC1, among others (Prislei et al. 1993Citation ; Sollner-Webb 1993Citation ; Bachellerie et al. 1995Citation ; Maxwell and Fournier 1995Citation ; Nicoloso et al. 1996Citation ; Rebane et al. 1998Citation ; Filipowicz et al. 1999Citation ; Filipowicz 2000Citation ). These provide both clear examples of dual gene outputs and potential instances of coordinate regulation (efference control) involving intronic sequences, in this case of ribosomal biogenesis and cell growth (Pelczar and Filipowicz 1998Citation ; Smith and Steitz 1998Citation ; Tanaka et al. 2000Citation ). More tellingly, some genes have so evolved that their protein-coding capacity no longer exists, and their primary product is intron-derived small nucleolar RNAs (Tycowski, Shu, and Steitz 1996Citation ; Bortolin and Kiss 1998Citation ; Pelczar and Filipowicz 1998Citation ; Smith and Steitz 1998Citation ; Tanaka et al. 2000Citation ), leading to the statement that "genes generating functionally important RNAs exclusively from their intron regions are probably more frequent than has been anticipated" (Bortolin and Kiss 1998Citation ).

These nucleolar RNAs are processed from introns by specific mechanisms involving endonucleolytic cleavage by double-stranded RNase III–related enzymes (Caffarelli et al. 1997Citation ; Chanfreau et al. 1998Citation ; Qu et al. 1999Citation ) (also implicated in RNAi, transgene silencing, and methylation [Mette et al. 2000]; see below), exonucleolytic trimming (Cecconi, Mariottini, and Amaldi 1995Citation ; Kiss and Filipowicz 1995Citation ; Mitchell et al. 1997Citation ; Allmang et al. 1999a, 1999bCitation ; van Hoof and Parker 1999Citation ; van Hoof, Lennertz, and Parker 2000Citation ), and possibly even adjacent RNA sequences that have self-cleaving activity (Prislei et al. 1995Citation ). This processing occurs in large RNA processing complexes called exosomes, which are also involved in processing rRNA and small nuclear RNAs, contain at least 10 3'–5' exonucleases, helicases, and RNA-binding proteins, and are found in both the nucleus and the cytoplasm (Mitchell et al. 1997Citation ; Allmang et al. 1999a, 1999bCitation ; van Hoof and Parker 1999Citation ; Mitchell and Tollervey 2000Citation ).


    Intron Processing, Stability, Decay, and Memory
 TOP
 Abstract
 Introduction
 Multitasking by Programmed...
 The Evolution of Controlled...
 A Role for Introns...
 Introns Populated the Eukaryotic...
 Intron Density Correlates with...
 Introns Have the Signatures...
 Noncoding RNAs Comprise the...
 Examples of Gene Regulation...
 Intron Processing, Stability,...
 Unexplained Genetic Phenomena...
 Transvection and Chromatin...
 Genetic Programming and the...
 Acknowledgements
 References
 
Intronic RNAs are more stable than is generally thought. The widespread view that excised introns are simply discarded and degraded derives from the unjustified a priori assumption that introns are nonfunctional. For example, it has been stated that "the half-life of excised introns is of the order of a few seconds" (Sharp et al. 1987Citation ), but closer examination of the primary literature indicates that this estimate is the time taken to splice introns from primary transcripts (Padgett et al. 1986Citation ), not the half-life of the introns themselves. Free introns are rarely observed in Northern blots, as these are mostly performed with polyA+RNA preparations and/or cDNA probes and with different questions in mind. However, when examined, free introns in both lariat and linear form have been found to be present in "abundance" (Zeitlin and Efstratiadis 1984Citation ), and some are relatively stable (Qian et al. 1992Citation ). In situ hybridization studies suggest that while excised intronic sequences diffuse away from the spliceosome, they remain detectable (by this relatively insensitive technique) in the nucleus, exhibiting a broad signal with a "punctate" (spotted) pattern (Xing et al. 1993Citation ), consistent with the possibility of a life for intron-derived RNAs within the nuclear domain, and perhaps beyond.

After splicing, introns (initially in lariat form) are debranched (Ruskin and Green 1985Citation ), a process that is itself subject to regulation (Ruskin and Green 1985Citation ; Qian et al. 1992Citation ), but subsequent events are unknown. We suggest that it is likely that excised introns are processed by specific pathways similar to those used to produce small nucleolar RNAs and which generate multiple smaller species which can function independently as trans-acting signals in the network (Mattick 1994Citation ), affecting the metabolism of other RNAs and the modulation of chromatin structure, among other things (see below). The intronic origins of small nucleolar RNAs became known only because of their relative stability and abundance, and they may be just one tip of a large iceberg of a much more complex milieu (tens of thousands) of other intron-derived and other non-protein-coding RNAs, which may be more transient and in much lower individual abundance and which have not yet been detected except by their genetic signatures, as in the case of lin-4 and let-7.

There are other documented examples of small trans-acting functional RNAs processed from longer transcripts (Sit, Vaewhongs, and Lommel 1998Citation ; Cavaille et al. 2000Citation ). There are also large numbers of ribonucleases and other RNA-related proteins in plants and animals (see below), most of whose functions and substrates are not well defined. Such processing may also involve other splicing pathways (Santoro et al. 1994Citation ; Kreivi and Lamond 1996Citation ) and guide RNAs, possibly derived from introns or other nonprotein-coding RNAs. These have been described as "riboregulators" (in relation to antisense RNAs) (Delihas 1995Citation ) and the "ribotype" (in relation to alternatively spliced mRNAs) (Herbert and Rich 1999aCitation ) and may be considered part of the "soft wiring" of the cell (Mattick 1994Citation ; Herbert and Rich 1999bCitation ).

The decay characteristics of eRNAs are likely to be important to their function. Bot